Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-130942: Fix path seperator matched in character ranges for glob.translate #130989

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

dmitya26
Copy link

@dmitya26 dmitya26 commented Mar 8, 2025

Copy link

cpython-cla-bot bot commented Mar 8, 2025

All commit authors signed the Contributor License Agreement.
CLA signed

@bedevere-app bedevere-app bot added the tests Tests in the Lib/test dir label Mar 8, 2025
@bedevere-app
Copy link

bedevere-app bot commented Mar 8, 2025

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@bedevere-app
Copy link

bedevere-app bot commented Mar 8, 2025

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@dmitya26 dmitya26 marked this pull request as draft March 8, 2025 23:09
@dmitya26 dmitya26 marked this pull request as ready for review March 10, 2025 19:16
@dmitya26
Copy link
Author

@barneygale @picnixz
PR is ready for review! :)

@dmitya26
Copy link
Author

+type-bug -tests

@barneygale
Copy link
Contributor

barneygale commented Mar 10, 2025

Thanks v much for taking a look!

Range expressions like [%-0] are still valid, so we should evaluate them as wildcards rather than matching literally IMO. Basically we just need to apply an additional restriction: don't match a separator. We could do that with a lookahead (untested):

diff --git a/Lib/fnmatch.py b/Lib/fnmatch.py
index 865baea2346..ee35dd4d24c 100644
--- a/Lib/fnmatch.py
+++ b/Lib/fnmatch.py
@@ -145,8 +145,10 @@ def _translate(pat, star, question_mark):
                     add('(?!)')
                 elif stuff == '!':
                     # Negated empty range: match any character.
-                    add('.')
+                    add(question_mark)
                 else:
+                    if question_mark != '.':
+                        add(f'(?={question_mark})')
                     # Escape set operations (&&, ~~ and ||).
                     stuff = _re_setops_sub(r'\\\1', stuff)
                     if stuff[0] == '!':

@barneygale barneygale added type-bug An unexpected behavior, bug, or error and removed tests Tests in the Lib/test dir labels Mar 10, 2025
@@ -514,6 +514,9 @@ def fn(pat):
self.assertEqual(fn('foo/bar\\baz'), r'(?s:foo[/\\]bar[/\\]baz)\Z')
self.assertEqual(fn('**/*'), r'(?s:(?:.+[/\\])?[^/\\]+)\Z')

self.assertEqual(fn('foo[%-0]bar'), r'(?s:foo\[%-0\]bar)\Z')
Copy link

@dkaszews dkaszews Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is correct. From my understanding of manpages quoted in the issue, a class should be escaped only if it contains a literal path separator, not a range encompassing it. In latter case, we need to just exclude the separator.

[%-0] => (?!/)[%-0]
[ab/] => \[ab/\]

Edge case to be tested in bash and glob.glob: is a range beginning with separator ([%-/] or [/-0]) the first case or the second one? What about corner case of single element range [/-/]? I would say that all three should be escaped since they "contain an explicit / character".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, does

A range containing an
explicit '/' character is syntactically incorrect. (POSIX requires that
syntactically incorrect patterns are left unchanged.)

mean that entire glob should be escaped, or just the part with the separator? I.e, does [ab][0/][xy] map to [ab]\[0/\][xy] or \[ab\]\[0/\]\[xy\]?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Author

@dmitya26 dmitya26 Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Thus, "[]-]" matches just the two characters ']' and
'-', and "[--0]" matches the three characters '-', '.', and '0',
since '/' cannot be matched.)

This would indicate that a range which includes a '/' character as a non-literal would match that range but exclude the '/' character, at least with my interpretation.

I got that from the glob manpage.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2.13.3.1 looks to back my interpretation:

  1. If path separator appears between brackets, be it a single character or next to a hyphen, escape entire bracket expression. For '\\' in seps case, be careful to check it is not an escape but actual '\\\\'.
  2. Else, if any hyphened range spans a separator, add a negative lookahead. For simplicity, it can also be added for any bracket expression with a hyphen, or any bracket at all - result is the same, just simplifies regex in most cases.
  3. All bracket expressions are analyzed separately, so path separator in one does not invalidate and escape all others.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmitya26 I don't have Python on hand, can you just quickly run glob.glob('a[/-b]c') on a following tree:

|-- abc
`-- a[
    `-- -b]c

If it returns [abc], then you are correct, if it returns the file in subdir then my interpretation seems to match existing implementation.

Copy link
Author

@dmitya26 dmitya26 Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns '[]'.

edit: Oh wait I think I might've misread how the directories need to be structured.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

├── a[
│   └── -b]c
└── abc

and

glob.glob('a[/-b]c')

would return

['a[/-b]c']

for me.

Copy link
Author

@dmitya26 dmitya26 Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait so regarding the spec, do you think we should be disallowing only '/' characters, the system's path separator (os.path.sep), or all path separators mentioned like the ones in glob.translate?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation already extends the spec to all given separators, e.g. glob.translate('abc?', seps=['/', '\\']) maps to '(?s:abc[^/\\\\])\\Z'.

@dmitya26
Copy link
Author

@barneygale Alright . I just pushed the implementation Dkaszews proposed earlier as that seems to be the most compliant with the spec you mentioned earlier on. I can also get you the initial implementation you showed where it uses a lookahead to exclude path separators from the range though if you feel that would be better. Feel free to take a look! :)

@dkaszews
Copy link

@dmitya26 Looks good, could you please also add test cases for [abc/], [%-/], [/-0] and [/-/] to show that they are all escaped?

@picnixz picnixz removed the type-bug An unexpected behavior, bug, or error label Mar 12, 2025
@picnixz
Copy link
Member

picnixz commented Mar 12, 2025

(type-bug is reserved for the issues generally)

Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert all new lines changes that are un-necessary. New lines are added to separate logical sections of a function (the stdlib is quite compactly written).

In addition, please add more tests, some tests with multiple ranges, [%-0][1-9] for instance, some with incomplete ranges, some with side-by-side ranges, some with collapsing ranges. I may think of more once the implementation is stable.

@@ -263,7 +263,6 @@ def escape(pathname):
_dir_open_flags = os.O_RDONLY | getattr(os, 'O_DIRECTORY', 0)
_no_recurse_symlinks = object()


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert.

@bedevere-app
Copy link

bedevere-app bot commented Mar 12, 2025

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

@dkaszews
Copy link

dkaszews commented Mar 12, 2025

Just though of something, doesn't [!a] currently translate trivially to [^a]? Because that also needs a negative lookahead, otherwise a[!b]c will also falsely match a/c.

Edit: Instead of negative lookahead, a more compact solution would be to replace [!...] with [^/...].

@dmitya26
Copy link
Author

I have made the requested changes; please review again.

@bedevere-app
Copy link

bedevere-app bot commented Mar 17, 2025

Thanks for making the requested changes!

@picnixz: please review the changes made to this pull request.

@bedevere-app bedevere-app bot requested a review from picnixz March 17, 2025 07:07
@dmitya26
Copy link
Author

dmitya26 commented Mar 17, 2025

Just though of something, doesn't [!a] currently translate trivially to [^a]? Because that also needs a negative lookahead, otherwise a[!b]c will also falsely match a/c.

Edit: Instead of negative lookahead, a more compact solution would be to replace [!...] with [^/...].

I definitely did implement this at some point, and it definitely is way easier than what's on my fork right now, I'm just not entirely confident it's spec compliant.

@dkaszews
Copy link

Wouldn't this not be spec compliant though?

What spec? The only spec concerns behavior of glob.glob, which says no class can match path separator. So (?!/)[^...] and [^/...] are the same, because they match the exact same set of files.

@dmitya26
Copy link
Author

Oh, my mistake. I can have the changes out to you later today! :)

Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked exactly at the implementation again because I want to be sure we're on the same page, especially concerning empty ranges.

@@ -263,7 +263,6 @@ def escape(pathname):
_dir_open_flags = os.O_RDONLY | getattr(os, 'O_DIRECTORY', 0)
_no_recurse_symlinks = object()


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert.

else:
negative_lookahead=''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
negative_lookahead=''
negative_lookahead = ''

Lib/fnmatch.py Outdated
@@ -135,6 +138,9 @@ def _translate(pat, star, question_mark):
if chunks[k-1][-1] > chunks[k][0]:
chunks[k-1] = chunks[k-1][:-1] + chunks[k][1:]
del chunks[k]
if len(chunks)>1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if len(chunks)>1:
if len(chunks) > 1:

@@ -513,7 +513,14 @@ def fn(pat):
return glob.translate(pat, recursive=True, include_hidden=True, seps=['/', '\\'])
self.assertEqual(fn('foo/bar\\baz'), r'(?s:foo[/\\]bar[/\\]baz)\Z')
self.assertEqual(fn('**/*'), r'(?s:(?:.+[/\\])?[^/\\]+)\Z')

self.assertEqual(fn('foo[!a]bar'), r'(?s:foo(?![/\\])[^a]bar)\Z')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need new tests for fnmatch.translate.

self.assertEqual(fn('foo[%-0]bar'), r'(?s:foo(?![/\\])[%-0]bar)\Z')
self.assertEqual(fn('foo[%-0][1-9]bar'), r'(?s:foo(?![/\\])[%-0][1-9]bar)\Z')
self.assertEqual(fn('foo[0-%]bar'), r'(?s:foo(?!)bar)\Z')
self.assertEqual(fn('foo[^-'), r'(?s:foo\[\^\-)\Z')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need also a test case with multiple ranges and incomplete ones, e.g., [0-%][0-%[0-%]. And possibly with an additional tail after the last range.

@@ -513,7 +513,14 @@ def fn(pat):
return glob.translate(pat, recursive=True, include_hidden=True, seps=['/', '\\'])
self.assertEqual(fn('foo/bar\\baz'), r'(?s:foo[/\\]bar[/\\]baz)\Z')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More generally, can you upodate test_translate_matching and include the examples of https://man7.org/linux/man-pages/man7/glob.7.html so that we have a compliant implementation?

@@ -0,0 +1 @@
Glob.translate negative-lookaheads path separators regex ranges that ecompass path seperator. For ranges which include path separator literals, the range is escaped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires a better indication. In addition, a versionchanged:: next should be added for both glob.translate() and fnmatch.translate(). Note that the meaning of / in fnmatch.translate() is different from glob.translate() because / is not special at all.

Suggested change
Glob.translate negative-lookaheads path separators regex ranges that ecompass path seperator. For ranges which include path separator literals, the range is escaped.
:func:`glob.translate` now correctly handles ranges implicitly containing path
separators (for instance, ``[0-%]`` contains ``/``). In addition, ranges including
path separator literals are now correctly escaped, as specified by POSIX specifications.

This suggestion is not perfect so we will likely come back later. However for the translate() functions need to be updated.

@dmitya26
Copy link
Author

The empty ranges were replaced with a negative lookahead before I even opened the PR. I think we should leave it as is and remove the test case. The reason I wrote that test case was to insure that I wasn't altering its behavior by accident when we were discussing how to handle invalid ranges all the way back in the beginning of the issue thread.

@dkaszews
Copy link

To clarify, because "empty ranges" can be a bit ambiguous:

  • Immediately closed class [] - ] as first character gets implicitly escaped, may become part of bigger class such as [][] is actually [\]\[], i.e. either literal [ or ]. Since glob spec matches Python regex, no special handling needed.
  • Classes that are not empty, but nevertheless cannot match anything, usually due to a backwards range such as [z-a]. Again, could be left alone, but current implementation simplifies them to empty negative lookahead (?!) which has the same semantic of never matching anything.

@dmitya26
Copy link
Author

Yea, I think it's best to leave it as is. I never intended on changing it and I don't think it is impacting the current issue at all.

@picnixz
Copy link
Member

picnixz commented Mar 17, 2025

but current implementation simplifies them to empty negative lookahead (?!) which has the same semantic of never matching anything.

Ups, I think I only remembered the part were we remove empty ranges, but then make them match nothing. False alarm, my bad!

@dmitya26
Copy link
Author

dmitya26 commented Mar 19, 2025

Alright.

I changed the negative lookahead for '!' matching. I also added some more tests which account for rules mentioned in the manpage as you suggested. I am seeing now that I was a bit lacking on the test_translate_matching testcases, so I'll get to adding more of those, but if you see anything more that I haven't noticed yet lmk.

edit: In my next commit I'm also going to remove the newline in the documentation file that's currently failing the CI.

@dmitya26
Copy link
Author

@picnixz I've made the requested changes! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants