Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve file URI ergonomics in urllib.request #125866

Open
barneygale opened this issue Oct 23, 2024 · 9 comments
Open

Improve file URI ergonomics in urllib.request #125866

barneygale opened this issue Oct 23, 2024 · 9 comments
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@barneygale
Copy link
Contributor

barneygale commented Oct 23, 2024

Feature or enhancement

I request that we make pathname2url and url2pathname easier to use:

  • pathname2url() is made to accept an optional include_scheme argument that sticks file: on the front when true
  • url2pathname() is made to strip any file: prefix from its argument.

I think this would go a long way towards making these functions usable, and allow us to remove the scary "This does not accept/produce a complete URL" warnings from the docs.

Linked PRs

@barneygale barneygale added the type-feature A feature request or enhancement label Oct 23, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Oct 25, 2024
Merge `URL2PathNameTests` `PathName2URLTests` and test cases (which test
only the Windows-specific implementations from `nturl2path`) into the main
`Pathname_Tests` test case for these functions.

Copy/port some test cases for `pathlib.Path.as_uri()` and `from_uri()`.
barneygale added a commit to barneygale/cpython that referenced this issue Oct 25, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Oct 29, 2024
barneygale added a commit that referenced this issue Oct 29, 2024
…125993)

Merge `URL2PathNameTests` and `PathName2URLTests` test cases (which test
only the Windows-specific implementations from `nturl2path`) into the main
`Pathname_Tests` test case for these functions.

Copy/port some test cases for `pathlib.Path.as_uri()` and `from_uri()`.
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Oct 29, 2024
…()` (pythonGH-125993)

Merge `URL2PathNameTests` and `PathName2URLTests` test cases (which test
only the Windows-specific implementations from `nturl2path`) into the main
`Pathname_Tests` test case for these functions.

Copy/port some test cases for `pathlib.Path.as_uri()` and `from_uri()`.
(cherry picked from commit 6742f14)

Co-authored-by: Barney Gale <barney.gale@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Oct 29, 2024
…()` (pythonGH-125993)

Merge `URL2PathNameTests` and `PathName2URLTests` test cases (which test
only the Windows-specific implementations from `nturl2path`) into the main
`Pathname_Tests` test case for these functions.

Copy/port some test cases for `pathlib.Path.as_uri()` and `from_uri()`.
(cherry picked from commit 6742f14)

Co-authored-by: Barney Gale <barney.gale@gmail.com>
barneygale added a commit that referenced this issue Oct 29, 2024
…e()` (GH-125993) (#126145)

GH-125866: Improve tests for `pathname2url()` and `url2pathname()` (GH-125993)

Merge `URL2PathNameTests` and `PathName2URLTests` test cases (which test
only the Windows-specific implementations from `nturl2path`) into the main
`Pathname_Tests` test case for these functions.

Copy/port some test cases for `pathlib.Path.as_uri()` and `from_uri()`.
(cherry picked from commit 6742f14)

Co-authored-by: Barney Gale <barney.gale@gmail.com>
barneygale added a commit that referenced this issue Oct 29, 2024
…e()` (GH-125993) (#126144)

GH-125866: Improve tests for `pathname2url()` and `url2pathname()` (GH-125993)

Merge `URL2PathNameTests` and `PathName2URLTests` test cases (which test
only the Windows-specific implementations from `nturl2path`) into the main
`Pathname_Tests` test case for these functions.

Copy/port some test cases for `pathlib.Path.as_uri()` and `from_uri()`.
(cherry picked from commit 6742f14)

Co-authored-by: Barney Gale <barney.gale@gmail.com>
barneygale added a commit to barneygale/cpython that referenced this issue Oct 29, 2024
Adjust `urllib.request.pathname2url` and `url2pathname()` to generate and
accept file URIs as described in RFC8089.

`pathname2url()` gains a new *include_scheme* argument, which defaults to
false. When set to true, the returned URL includes a `file:` prefix.

`url2pathname()` now automatically removes a `file:` prefix if present.

On Windows, `pathname2url()` now generates URIs that begin with two slashes
rather than four when given a UNC path.

On other platforms, `pathname2url()` now generates URIs that begin with
three slashes rather than one when given an absolute path. `url2pathname()`
now performs the opposite transformation, so `file:///etc/hosts` becomes
`/etc/hosts`. Furthermore, `url2pathname()` now ignores local hosts (like
"localhost" or any alias) and raises `URLError` for non-local hosts.
@picnixz picnixz added the stdlib Python modules in the Lib dir label Oct 30, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Oct 30, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Nov 19, 2024
barneygale added a commit to barneygale/cpython that referenced this issue Nov 22, 2024
Stop converting Windows drive letters to uppercase in
`urllib.request.pathname2url()` and `url2pathname()`. This behaviour is
unnecessary and inconsistent with pathlib's file URI implementation.
barneygale added a commit that referenced this issue Nov 23, 2024
Stop converting Windows drive letters to uppercase in
`urllib.request.pathname2url()` and `url2pathname()`. This behaviour is
unnecessary and inconsistent with pathlib's file URI implementation.
barneygale added a commit to barneygale/cpython that referenced this issue Nov 23, 2024
@serhiy-storchaka
Copy link
Member

file:/etc/hosts, file:///etc/hosts and file://localhost/etc/hosts are equivalent URIs. I do not see anything in RFC 8089 that suggests that file:///etc/hosts is more preferable.

Actually, file:/etc/hosts is a new form. It was not allowed in RFC 1738, and only allowed in RFC 8089. So you can say that it is more modern representation.

@barneygale
Copy link
Contributor Author

I do not see anything in RFC 8089 that suggests that file:///etc/hosts is more preferable.

RFC 8089 is mainly non-normative, but it describes file:/// as "traditional" and file:/ as "minimal"

The freedesktop spec notes:

Some current apps generate URIs of the form "file:/". These
are not correct according to RFC1738, so they should not be
generated. However for backwards compatibility, it is recommended that
such URIs are interpreted as file URIs with an empty hostname.

The wikipedia article lists the file:/// form first, and notes KDE as an exception.

@barneygale
Copy link
Contributor Author

barneygale commented Nov 24, 2024

Also, using file:/// is:

  1. More consistent with the Windows version of pathname2url(), which adds an empty authority for DOS drive paths - e.g. C:\foo is converted to ///C:/foo, not /C:/foo
  2. Consistent with pathlib.Path.as_uri()

@serhiy-storchaka
Copy link
Member

RFC 8089 updates RFC 1738. The file:/// form is RFC 1738 compliant. The file:/ form is not RFC 1738 compliant, but RFC 8089 compliant. So the change does not match the title of this issue. This is why it raises questions.

It may be a justified change, but it needs different justification than RFC 8089 compliance.

@barneygale
Copy link
Contributor Author

Good point. I'll revise the issue - it's already evolved a bit since I logged it.

@barneygale barneygale changed the title RFC8089-compliant file URI support in urllib.request Improve file URI ergonomics in urllib.request Nov 24, 2024
@barneygale
Copy link
Contributor Author

barneygale commented Nov 25, 2024

@serhiy-storchaka I've revised the description as follows:

I request that we make pathname2url and url2pathname easier to use:

  • pathname2url() is made to accept an optional include_scheme argument that sticks file: on the front when true
  • url2pathname() is made to strip any file: prefix from its argument.

Does that seem like a good idea to you?

If we go for these changes, and the user provides a URL with a non-file: scheme to url2pathname(), what should happen in your opinion? (I'm erring towards raising URLError.)

Relatedly, what do you think should happen if the user provides a query or fragment in the URL? Currently it's returned as part of the file path. (This seems under-specified to me, but I'm leaning towards raising URLError here too.)

Thank you

@serhiy-storchaka
Copy link
Member

I wonder -- would not be better to add such functions in os.path? The implementation is clearly platform depending, and if we add support for platform with different path format (like old macpath), it will need a completely different implementation. In custom build for exotic platform it is easier to provide a separate os.path implementation than patch urllib.request.

@barneygale
Copy link
Contributor Author

Good question! It's something that's come up before:

The current Windows + POSIX implementations don't have much in common, but I think that will change when I address this issue and #123599. To solve them, I'm planning to use urlsplit() to parse the URL, replacing parts of the janky implementation in url2pathname(). Here's a rough draft of where I want to end up:

def url2pathname(url):
    scheme, authority, path, query, fragment = urlsplit(url, scheme='file')
    if scheme != 'file':
        raise URLError(f'URL uses non-file scheme: {url!r}')
    if query:
        raise URLError(f'file URL has query: {url!r}')
    if fragment:
        raise URLError(f'file URL has fragment: {url!r}')

    if os.name == 'nt':
        # Windows-specific file URI quirks.
        if authority and authority != 'localhost':
            # e.g. file://server/share/file.txt
            path = '//' + authority + path
        elif path[:3] == '///':
            # e.g. file://///server/share/file.txt
            path = path[1:]
        else:
            if path[:1] == '/' and path[2:3] in ':|':
                # Skip past extra slash before DOS drive in URL path.
                path = path[1:]
            if path[1:2] == '|':
                # Older URLs use a pipe after a drive letter
                path = path.replace('|', ':', 1)
        path = path.replace('/', '\\')
    elif not _is_local_authority(authority):
        # POSIX only: reject URL if authority doesn't resolve to localhost.
        raise URLError(f'file URL has non-local authority: {url!r}')

    encoding = sys.getfilesystemencoding()
    errors = sys.getfilesystemencodeerrors()
    return unquote(path, encoding=encoding, errors=errors)


def pathname2url(pathname, include_scheme=False):
    if os.name == 'nt':
        pathname = pathname.replace('\\', '/')
    encoding = sys.getfilesystemencoding()
    errors = sys.getfilesystemencodeerrors()
    prefix = 'file:' if include_scheme else ''
    drive, root, tail = os.path.splitroot(pathname)

    if drive:
        if drive[:4] == '//?/':
            drive = drive[4:]
            if drive[:4].upper() == 'UNC/':
                drive = '//' + drive[4:]
        if drive[1:] == ':':
            prefix += '///'
        drive = quote(drive, encoding=encoding, errors=errors, safe='/:')
    elif root:
        prefix += '//'

    tail = quote(tail, encoding=encoding, errors=errors)
    return prefix + drive + root + tail

There's clearly some OS-specific code in there, but there's also a fair amount of shared and URL-specific code which might feel out-of-place in os.path.

Besides, if we added functions in os.path we'd need to either live with near-duplicate functionality in urllib, or deprecate the urllib functions (which isn't much fun for their users). I'm not ruling that out, but it seems unnecessarily if there's a reasonable path towards making pathname2url() and url2pathname() a bit more usable.

@barneygale
Copy link
Contributor Author

barneygale commented Dec 3, 2024

FWIW, if we go with the above merger, then we can deprecate the oddball nturl2path module.

picnixz pushed a commit to picnixz/cpython that referenced this issue Dec 8, 2024
…()` (python#125993)

Merge `URL2PathNameTests` and `PathName2URLTests` test cases (which test
only the Windows-specific implementations from `nturl2path`) into the main
`Pathname_Tests` test case for these functions.

Copy/port some test cases for `pathlib.Path.as_uri()` and `from_uri()`.
ebonnal pushed a commit to ebonnal/cpython that referenced this issue Jan 12, 2025
…()` (python#125993)

Merge `URL2PathNameTests` and `PathName2URLTests` test cases (which test
only the Windows-specific implementations from `nturl2path`) into the main
`Pathname_Tests` test case for these functions.

Copy/port some test cases for `pathlib.Path.as_uri()` and `from_uri()`.
ebonnal pushed a commit to ebonnal/cpython that referenced this issue Jan 12, 2025
…hon#127138)

Stop converting Windows drive letters to uppercase in
`urllib.request.pathname2url()` and `url2pathname()`. This behaviour is
unnecessary and inconsistent with pathlib's file URI implementation.
barneygale added a commit to barneygale/cpython that referenced this issue Mar 18, 2025
Add `tests.test_nturl2path` to exercise `nturl2path`.

`nturl2path` is currently used by `urllib` and tested by `test_urllib`, but
that will change when we deprecate `nturl2path` and add support for Windows
paths/URLs in `urllib` itself.
barneygale added a commit to barneygale/cpython that referenced this issue Mar 19, 2025
barneygale added a commit that referenced this issue Mar 19, 2025
Deprecate the `nturl2path` module. Its functionality is merged into
`urllib.request`.

Add `tests.test_nturl2path` to exercise `nturl2path`, as it's no longer
covered by `test_urllib`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants