-
-
Notifications
You must be signed in to change notification settings - Fork 31.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-130942: Fix path seperator matched in character ranges for glob.translate #130989
base: main
Are you sure you want to change the base?
Conversation
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool. If this change has little impact on Python users, wait for a maintainer to apply the |
… into glob_translations
… into glob_translations
@barneygale @picnixz |
+type-bug -tests |
Thanks v much for taking a look! Range expressions like diff --git a/Lib/fnmatch.py b/Lib/fnmatch.py
index 865baea2346..ee35dd4d24c 100644
--- a/Lib/fnmatch.py
+++ b/Lib/fnmatch.py
@@ -145,8 +145,10 @@ def _translate(pat, star, question_mark):
add('(?!)')
elif stuff == '!':
# Negated empty range: match any character.
- add('.')
+ add(question_mark)
else:
+ if question_mark != '.':
+ add(f'(?={question_mark})')
# Escape set operations (&&, ~~ and ||).
stuff = _re_setops_sub(r'\\\1', stuff)
if stuff[0] == '!': |
Lib/test/test_glob.py
Outdated
@@ -514,6 +514,9 @@ def fn(pat): | |||
self.assertEqual(fn('foo/bar\\baz'), r'(?s:foo[/\\]bar[/\\]baz)\Z') | |||
self.assertEqual(fn('**/*'), r'(?s:(?:.+[/\\])?[^/\\]+)\Z') | |||
|
|||
self.assertEqual(fn('foo[%-0]bar'), r'(?s:foo\[%-0\]bar)\Z') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is correct. From my understanding of manpages quoted in the issue, a class should be escaped only if it contains a literal path separator, not a range encompassing it. In latter case, we need to just exclude the separator.
[%-0] => (?!/)[%-0]
[ab/] => \[ab/\]
Edge case to be tested in bash and glob.glob
: is a range beginning with separator ([%-/]
or [/-0]
) the first case or the second one? What about corner case of single element range [/-/]
? I would say that all three should be escaped since they "contain an explicit /
character".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, does
A range containing an
explicit '/' character is syntactically incorrect. (POSIX requires that
syntactically incorrect patterns are left unchanged.)
mean that entire glob should be escaped, or just the part with the separator? I.e, does [ab][0/][xy]
map to [ab]\[0/\][xy]
or \[ab\]\[0/\]\[xy\]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Relevant standard seems to be here: https://pubs.opengroup.org/onlinepubs/9699919799.2008edition/utilities/V3_chap02.html#tag_18_13_01
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Thus, "[]-]" matches just the two characters ']' and
'-', and "[--0]" matches the three characters '-', '.', and '0',
since '/' cannot be matched.)
This would indicate that a range which includes a '/' character as a non-literal would match that range but exclude the '/' character, at least with my interpretation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2.13.3.1 looks to back my interpretation:
- If path separator appears between brackets, be it a single character or next to a hyphen, escape entire bracket expression. For
'\\' in seps
case, be careful to check it is not an escape but actual'\\\\'
. - Else, if any hyphened range spans a separator, add a negative lookahead. For simplicity, it can also be added for any bracket expression with a hyphen, or any bracket at all - result is the same, just simplifies regex in most cases.
- All bracket expressions are analyzed separately, so path separator in one does not invalidate and escape all others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmitya26 I don't have Python on hand, can you just quickly run glob.glob('a[/-b]c')
on a following tree:
|-- abc
`-- a[
`-- -b]c
If it returns [abc]
, then you are correct, if it returns the file in subdir then my interpretation seems to match existing implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It returns '[]'.
edit: Oh wait I think I might've misread how the directories need to be structured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
├── a[
│ └── -b]c
└── abc
and
glob.glob('a[/-b]c')
would return
['a[/-b]c']
for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait so regarding the spec, do you think we should be disallowing only '/' characters, the system's path separator (os.path.sep), or all path separators mentioned like the ones in glob.translate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current implementation already extends the spec to all given separators, e.g. glob.translate('abc?', seps=['/', '\\'])
maps to '(?s:abc[^/\\\\])\\Z'
.
…ors in fnmatch._translate
…h separator in fnmatch._translate().
@barneygale Alright . I just pushed the implementation Dkaszews proposed earlier as that seems to be the most compliant with the spec you mentioned earlier on. I can also get you the initial implementation you showed where it uses a lookahead to exclude path separators from the range though if you feel that would be better. Feel free to take a look! :) |
@dmitya26 Looks good, could you please also add test cases for |
(type-bug is reserved for the issues generally) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert all new lines changes that are un-necessary. New lines are added to separate logical sections of a function (the stdlib is quite compactly written).
In addition, please add more tests, some tests with multiple ranges, [%-0][1-9]
for instance, some with incomplete ranges, some with side-by-side ranges, some with collapsing ranges. I may think of more once the implementation is stable.
@@ -263,7 +263,6 @@ def escape(pathname): | |||
_dir_open_flags = os.O_RDONLY | getattr(os, 'O_DIRECTORY', 0) | |||
_no_recurse_symlinks = object() | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert.
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
Just though of something, doesn't Edit: Instead of negative lookahead, a more compact solution would be to replace |
…pass path separator in fnmatch._translate()."
I have made the requested changes; please review again. |
Thanks for making the requested changes! @picnixz: please review the changes made to this pull request. |
I definitely did implement this at some point, and it definitely is way easier than what's on my fork right now, I'm just not entirely confident it's spec compliant. |
What spec? The only spec concerns behavior of |
Oh, my mistake. I can have the changes out to you later today! :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked exactly at the implementation again because I want to be sure we're on the same page, especially concerning empty ranges.
@@ -263,7 +263,6 @@ def escape(pathname): | |||
_dir_open_flags = os.O_RDONLY | getattr(os, 'O_DIRECTORY', 0) | |||
_no_recurse_symlinks = object() | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert.
else: | ||
negative_lookahead='' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
negative_lookahead='' | |
negative_lookahead = '' |
Lib/fnmatch.py
Outdated
@@ -135,6 +138,9 @@ def _translate(pat, star, question_mark): | |||
if chunks[k-1][-1] > chunks[k][0]: | |||
chunks[k-1] = chunks[k-1][:-1] + chunks[k][1:] | |||
del chunks[k] | |||
if len(chunks)>1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if len(chunks)>1: | |
if len(chunks) > 1: |
Lib/test/test_glob.py
Outdated
@@ -513,7 +513,14 @@ def fn(pat): | |||
return glob.translate(pat, recursive=True, include_hidden=True, seps=['/', '\\']) | |||
self.assertEqual(fn('foo/bar\\baz'), r'(?s:foo[/\\]bar[/\\]baz)\Z') | |||
self.assertEqual(fn('**/*'), r'(?s:(?:.+[/\\])?[^/\\]+)\Z') | |||
|
|||
self.assertEqual(fn('foo[!a]bar'), r'(?s:foo(?![/\\])[^a]bar)\Z') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need new tests for fnmatch.translate
.
self.assertEqual(fn('foo[%-0]bar'), r'(?s:foo(?![/\\])[%-0]bar)\Z') | ||
self.assertEqual(fn('foo[%-0][1-9]bar'), r'(?s:foo(?![/\\])[%-0][1-9]bar)\Z') | ||
self.assertEqual(fn('foo[0-%]bar'), r'(?s:foo(?!)bar)\Z') | ||
self.assertEqual(fn('foo[^-'), r'(?s:foo\[\^\-)\Z') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need also a test case with multiple ranges and incomplete ones, e.g., [0-%][0-%[0-%]
. And possibly with an additional tail after the last range.
@@ -513,7 +513,14 @@ def fn(pat): | |||
return glob.translate(pat, recursive=True, include_hidden=True, seps=['/', '\\']) | |||
self.assertEqual(fn('foo/bar\\baz'), r'(?s:foo[/\\]bar[/\\]baz)\Z') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More generally, can you upodate test_translate_matching
and include the examples of https://man7.org/linux/man-pages/man7/glob.7.html so that we have a compliant implementation?
@@ -0,0 +1 @@ | |||
Glob.translate negative-lookaheads path separators regex ranges that ecompass path seperator. For ranges which include path separator literals, the range is escaped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires a better indication. In addition, a versionchanged:: next
should be added for both glob.translate()
and fnmatch.translate()
. Note that the meaning of /
in fnmatch.translate()
is different from glob.translate()
because /
is not special at all.
Glob.translate negative-lookaheads path separators regex ranges that ecompass path seperator. For ranges which include path separator literals, the range is escaped. | |
:func:`glob.translate` now correctly handles ranges implicitly containing path | |
separators (for instance, ``[0-%]`` contains ``/``). In addition, ranges including | |
path separator literals are now correctly escaped, as specified by POSIX specifications. |
This suggestion is not perfect so we will likely come back later. However for the translate() functions need to be updated.
The empty ranges were replaced with a negative lookahead before I even opened the PR. I think we should leave it as is and remove the test case. The reason I wrote that test case was to insure that I wasn't altering its behavior by accident when we were discussing how to handle invalid ranges all the way back in the beginning of the issue thread. |
To clarify, because "empty ranges" can be a bit ambiguous:
|
Yea, I think it's best to leave it as is. I never intended on changing it and I don't think it is impacting the current issue at all. |
Ups, I think I only remembered the part were we remove empty ranges, but then make them match nothing. False alarm, my bad! |
Alright. I changed the negative lookahead for '!' matching. I also added some more tests which account for rules mentioned in the manpage as you suggested. I am seeing now that I was a bit lacking on the test_translate_matching testcases, so I'll get to adding more of those, but if you see anything more that I haven't noticed yet lmk. edit: In my next commit I'm also going to remove the newline in the documentation file that's currently failing the CI. |
@picnixz I've made the requested changes! :) |
Issue: #130942