gh-130942: Fix path seperator matched in character ranges for glob.translate #130989

dmitya26 · 2025-03-08T22:55:57Z

Issue: #130942

Issue: glob.translate incorrectly matches path separator in character ranges #130942

cpython-cla-bot · 2025-03-08T22:55:59Z

All commit authors signed the Contributor License Agreement.

bedevere-app · 2025-03-08T22:56:01Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

bedevere-app · 2025-03-08T23:07:24Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

… into glob_translations

dmitya26 · 2025-03-10T19:17:27Z

@barneygale @picnixz
PR is ready for review! :)

dmitya26 · 2025-03-10T19:28:48Z

+type-bug -tests

barneygale · 2025-03-10T19:28:54Z

Thanks v much for taking a look!

Range expressions like [%-0] are still valid, so we should evaluate them as wildcards rather than matching literally IMO. Basically we just need to apply an additional restriction: don't match a separator. We could do that with a lookahead (untested):

diff --git a/Lib/fnmatch.py b/Lib/fnmatch.py
index 865baea2346..ee35dd4d24c 100644
--- a/Lib/fnmatch.py
+++ b/Lib/fnmatch.py
@@ -145,8 +145,10 @@ def _translate(pat, star, question_mark):
                     add('(?!)')
                 elif stuff == '!':
                     # Negated empty range: match any character.
-                    add('.')
+                    add(question_mark)
                 else:
+                    if question_mark != '.':
+                        add(f'(?={question_mark})')
                     # Escape set operations (&&, ~~ and ||).
                     stuff = _re_setops_sub(r'\\\1', stuff)
                     if stuff[0] == '!':

dkaszews · 2025-03-11T18:12:48Z

Lib/test/test_glob.py

@@ -514,6 +514,9 @@ def fn(pat):
        self.assertEqual(fn('foo/bar\\baz'), r'(?s:foo[/\\]bar[/\\]baz)\Z')
        self.assertEqual(fn('**/*'), r'(?s:(?:.+[/\\])?[^/\\]+)\Z')

+        self.assertEqual(fn('foo[%-0]bar'), r'(?s:foo\[%-0\]bar)\Z')


I'm not sure this is correct. From my understanding of manpages quoted in the issue, a class should be escaped only if it contains a literal path separator, not a range encompassing it. In latter case, we need to just exclude the separator.

[%-0] => (?!/)[%-0] [ab/] => \[ab/\]

Edge case to be tested in bash and glob.glob: is a range beginning with separator ([%-/] or [/-0]) the first case or the second one? What about corner case of single element range [/-/]? I would say that all three should be escaped since they "contain an explicit / character".

Also, does

A range containing an
explicit '/' character is syntactically incorrect. (POSIX requires that
syntactically incorrect patterns are left unchanged.)

mean that entire glob should be escaped, or just the part with the separator? I.e, does [ab][0/][xy] map to [ab]\[0/\][xy] or \[ab\]\[0/\]\[xy\]?

Relevant standard seems to be here: https://pubs.opengroup.org/onlinepubs/9699919799.2008edition/utilities/V3_chap02.html#tag_18_13_01

(Thus, "[]-]" matches just the two characters ']' and
'-', and "[--0]" matches the three characters '-', '.', and '0',
since '/' cannot be matched.)

This would indicate that a range which includes a '/' character as a non-literal would match that range but exclude the '/' character, at least with my interpretation.

I got that from the glob manpage.

2.13.3.1 looks to back my interpretation:

If path separator appears between brackets, be it a single character or next to a hyphen, escape entire bracket expression. For '\\' in seps case, be careful to check it is not an escape but actual '\\\\'.

Else, if any hyphened range spans a separator, add a negative lookahead. For simplicity, it can also be added for any bracket expression with a hyphen, or any bracket at all - result is the same, just simplifies regex in most cases.

All bracket expressions are analyzed separately, so path separator in one does not invalidate and escape all others.

@dmitya26 I don't have Python on hand, can you just quickly run glob.glob('a[/-b]c') on a following tree:

|-- abc `-- a[ `-- -b]c

If it returns [abc], then you are correct, if it returns the file in subdir then my interpretation seems to match existing implementation.

It returns '[]'.

edit: Oh wait I think I might've misread how the directories need to be structured.

├── a[
│ └── -b]c
└── abc

and

glob.glob('a[/-b]c')

would return

['a[/-b]c']

for me.

Wait so regarding the spec, do you think we should be disallowing only '/' characters, the system's path separator (os.path.sep), or all path separators mentioned like the ones in glob.translate?

Current implementation already extends the spec to all given separators, e.g. glob.translate('abc?', seps=['/', '\\']) maps to '(?s:abc[^/\\\\])\\Z'.

…ors in fnmatch._translate

…h separator in fnmatch._translate().

dmitya26 · 2025-03-12T07:31:15Z

@barneygale Alright . I just pushed the implementation Dkaszews proposed earlier as that seems to be the most compliant with the spec you mentioned earlier on. I can also get you the initial implementation you showed where it uses a lookahead to exclude path separators from the range though if you feel that would be better. Feel free to take a look! :)

dkaszews · 2025-03-12T08:39:04Z

@dmitya26 Looks good, could you please also add test cases for [abc/], [%-/], [/-0] and [/-/] to show that they are all escaped?

picnixz · 2025-03-12T14:28:04Z

(type-bug is reserved for the issues generally)

picnixz

Please revert all new lines changes that are un-necessary. New lines are added to separate logical sections of a function (the stdlib is quite compactly written).

In addition, please add more tests, some tests with multiple ranges, [%-0][1-9] for instance, some with incomplete ranges, some with side-by-side ranges, some with collapsing ranges. I may think of more once the implementation is stable.

Lib/fnmatch.py

picnixz · 2025-03-12T14:30:40Z

Lib/glob.py

@@ -263,7 +263,6 @@ def escape(pathname):
 _dir_open_flags = os.O_RDONLY | getattr(os, 'O_DIRECTORY', 0)
 _no_recurse_symlinks = object()

-


Please revert

Please revert.

bedevere-app · 2025-03-12T14:41:35Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

dkaszews · 2025-03-12T22:00:06Z

Just though of something, doesn't [!a] currently translate trivially to [^a]? Because that also needs a negative lookahead, otherwise a[!b]c will also falsely match a/c.

Edit: Instead of negative lookahead, a more compact solution would be to replace [!...] with [^/...].

…pass path separator in fnmatch._translate()."

…als.

dmitya26 · 2025-03-17T07:07:26Z

I have made the requested changes; please review again.

bedevere-app · 2025-03-17T07:07:31Z

Thanks for making the requested changes!

@picnixz: please review the changes made to this pull request.

dmitya26 · 2025-03-17T15:20:29Z

Just though of something, doesn't [!a] currently translate trivially to [^a]? Because that also needs a negative lookahead, otherwise a[!b]c will also falsely match a/c.

Edit: Instead of negative lookahead, a more compact solution would be to replace [!...] with [^/...].

I definitely did implement this at some point, and it definitely is way easier than what's on my fork right now, I'm just not entirely confident it's spec compliant.

dkaszews · 2025-03-17T15:23:20Z

Wouldn't this not be spec compliant though?

What spec? The only spec concerns behavior of glob.glob, which says no class can match path separator. So (?!/)[^...] and [^/...] are the same, because they match the exact same set of files.

dmitya26 · 2025-03-17T15:25:53Z

Oh, my mistake. I can have the changes out to you later today! :)

picnixz

I haven't looked exactly at the implementation again because I want to be sure we're on the same page, especially concerning empty ranges.

picnixz · 2025-03-17T15:35:41Z

Lib/glob.py

@@ -263,7 +263,6 @@ def escape(pathname):
 _dir_open_flags = os.O_RDONLY | getattr(os, 'O_DIRECTORY', 0)
 _no_recurse_symlinks = object()

-


Please revert.

picnixz · 2025-03-17T15:37:23Z

Lib/fnmatch.py

                else:
+                    negative_lookahead=''


Suggested change

negative_lookahead=''

negative_lookahead = ''

picnixz · 2025-03-17T16:17:41Z

Lib/fnmatch.py

@@ -135,6 +138,9 @@ def _translate(pat, star, question_mark):
                        if chunks[k-1][-1] > chunks[k][0]:
                            chunks[k-1] = chunks[k-1][:-1] + chunks[k][1:]
                            del chunks[k]
+                    if len(chunks)>1:


Suggested change

if len(chunks)>1:

if len(chunks) > 1:

picnixz · 2025-03-17T16:19:06Z

Lib/test/test_glob.py

@@ -513,7 +513,14 @@ def fn(pat):
            return glob.translate(pat, recursive=True, include_hidden=True, seps=['/', '\\'])
        self.assertEqual(fn('foo/bar\\baz'), r'(?s:foo[/\\]bar[/\\]baz)\Z')
        self.assertEqual(fn('**/*'), r'(?s:(?:.+[/\\])?[^/\\]+)\Z')
-
+        self.assertEqual(fn('foo[!a]bar'), r'(?s:foo(?![/\\])[^a]bar)\Z')


We also need new tests for fnmatch.translate.

picnixz · 2025-03-17T16:25:53Z

Lib/test/test_glob.py

+        self.assertEqual(fn('foo[%-0]bar'), r'(?s:foo(?![/\\])[%-0]bar)\Z')
+        self.assertEqual(fn('foo[%-0][1-9]bar'), r'(?s:foo(?![/\\])[%-0][1-9]bar)\Z')
+        self.assertEqual(fn('foo[0-%]bar'), r'(?s:foo(?!)bar)\Z')
+        self.assertEqual(fn('foo[^-'), r'(?s:foo\[\^\-)\Z')


We need also a test case with multiple ranges and incomplete ones, e.g., [0-%][0-%[0-%]. And possibly with an additional tail after the last range.

Lib/test/test_glob.py

picnixz · 2025-03-17T16:35:11Z

Lib/test/test_glob.py

@@ -513,7 +513,14 @@ def fn(pat):
            return glob.translate(pat, recursive=True, include_hidden=True, seps=['/', '\\'])
        self.assertEqual(fn('foo/bar\\baz'), r'(?s:foo[/\\]bar[/\\]baz)\Z')


More generally, can you upodate test_translate_matching and include the examples of https://man7.org/linux/man-pages/man7/glob.7.html so that we have a compliant implementation?

picnixz · 2025-03-17T16:39:16Z

Misc/NEWS.d/next/Library/2025-03-08-23-26-50.gh-issue-130942.jxRMK_.rst

@@ -0,0 +1 @@
+Glob.translate negative-lookaheads path separators regex ranges that ecompass path seperator. For ranges which include path separator literals, the range is escaped.


This requires a better indication. In addition, a versionchanged:: next should be added for both glob.translate() and fnmatch.translate(). Note that the meaning of / in fnmatch.translate() is different from glob.translate() because / is not special at all.

Suggested change

Glob.translate negative-lookaheads path separators regex ranges that ecompass path seperator. For ranges which include path separator literals, the range is escaped.

:func:`glob.translate` now correctly handles ranges implicitly containing path

separators (for instance, ``[0-%]`` contains ``/``). In addition, ranges including

path separator literals are now correctly escaped, as specified by POSIX specifications.

This suggestion is not perfect so we will likely come back later. However for the translate() functions need to be updated.

dmitya26 · 2025-03-17T20:15:09Z

The empty ranges were replaced with a negative lookahead before I even opened the PR. I think we should leave it as is and remove the test case. The reason I wrote that test case was to insure that I wasn't altering its behavior by accident when we were discussing how to handle invalid ranges all the way back in the beginning of the issue thread.

dkaszews · 2025-03-17T20:33:03Z

To clarify, because "empty ranges" can be a bit ambiguous:

Immediately closed class [] - ] as first character gets implicitly escaped, may become part of bigger class such as [][] is actually [\]\[], i.e. either literal [ or ]. Since glob spec matches Python regex, no special handling needed.
Classes that are not empty, but nevertheless cannot match anything, usually due to a backwards range such as [z-a]. Again, could be left alone, but current implementation simplifies them to empty negative lookahead (?!) which has the same semantic of never matching anything.

dmitya26 · 2025-03-17T20:35:59Z

Yea, I think it's best to leave it as is. I never intended on changing it and I don't think it is impacting the current issue at all.

picnixz · 2025-03-17T20:41:36Z

but current implementation simplifies them to empty negative lookahead (?!) which has the same semantic of never matching anything.

Ups, I think I only remembered the part were we remove empty ranges, but then make them match nothing. False alarm, my bad!

dmitya26 · 2025-03-19T05:41:23Z

Alright.

I changed the negative lookahead for '!' matching. I also added some more tests which account for rules mentioned in the manpage as you suggested. I am seeing now that I was a bit lacking on the test_translate_matching testcases, so I'll get to adding more of those, but if you see anything more that I haven't noticed yet lmk.

edit: In my next commit I'm also going to remove the newline in the documentation file that's currently failing the CI.

dmitya26 · 2025-03-20T17:12:52Z

@picnixz I've made the requested changes! :)

added testcase for globbing with a ranged seperator

be37b54

bedevere-app bot added the tests Tests in the Lib/test dir label Mar 8, 2025

bedevere-app bot added the awaiting review label Mar 8, 2025

bedevere-app bot mentioned this pull request Mar 8, 2025

glob.translate incorrectly matches path separator in character ranges #130942

Open

Merge branch 'main' into glob_translations

b874745

dmitya26 marked this pull request as draft March 8, 2025 23:09

bedevere-app bot removed the awaiting review label Mar 8, 2025

blurb-it bot and others added 8 commits March 8, 2025 23:26

📜🤖 Added by blurb_it.

6990566

Merge branch 'main' into glob_translations

cc03a6d

WIP - need to refine glob testcases.

cea1f5e

Merge branch 'glob_translations' of https://github.com/dmitya26/cpython…

5251d75

… into glob_translations

Escape regex ranges including seperators in glob.translate.

dd1b155

Merge branch 'main' into glob_translations

e8b3559

Typo function name in glob.py

9f461a5

Merge branch 'glob_translations' of https://github.com/dmitya26/cpython…

4820018

… into glob_translations

dmitya26 marked this pull request as ready for review March 10, 2025 19:16

bedevere-app bot added the awaiting review label Mar 10, 2025

barneygale added type-bug An unexpected behavior, bug, or error and removed tests Tests in the Lib/test dir labels Mar 10, 2025

dkaszews reviewed Mar 11, 2025

View reviewed changes

dmitya26 added 2 commits March 12, 2025 00:05

Lookahead to ignore path separators in ranges which span path separat…

c7f6d87

…ors in fnmatch._translate

Added empty negative lookahead in front of ranges which encompass pat…

d5748b8

…h separator in fnmatch._translate().

picnixz removed the type-bug An unexpected behavior, bug, or error label Mar 12, 2025

picnixz requested changes Mar 12, 2025

View reviewed changes

bedevere-app bot added awaiting changes and removed awaiting review labels Mar 12, 2025

dmitya26 added 2 commits March 13, 2025 11:34

Revert "Added empty negative lookahead in front of ranges which encom…

95b4ccf

…pass path separator in fnmatch._translate()."

Refine testcases and and escape ranges including path separator liter…

cdfcf47

…als.

bedevere-app bot added awaiting change review and removed awaiting changes labels Mar 17, 2025

bedevere-app bot requested a review from picnixz March 17, 2025 07:07

fix blurb.

3929b06

picnixz reviewed Mar 17, 2025

View reviewed changes

Refine fnmatch translate and glob translate testcases.

e5abc80

Add some more matching tests for glob tests.

93c3092

		@@ -263,7 +263,6 @@ def escape(pathname):
		_dir_open_flags = os.O_RDONLY \| getattr(os, 'O_DIRECTORY', 0)
		_no_recurse_symlinks = object()

		@@ -513,7 +513,14 @@ def fn(pat):
		return glob.translate(pat, recursive=True, include_hidden=True, seps=['/', '\\'])
		self.assertEqual(fn('foo/bar\\baz'), r'(?s:foo[/\\]bar[/\\]baz)\Z')

		@@ -0,0 +1 @@
		Glob.translate negative-lookaheads path separators regex ranges that ecompass path seperator. For ranges which include path separator literals, the range is escaped.

-Glob.translate negative-lookaheads path separators regex ranges that ecompass path seperator. For ranges which include path separator literals, the range is escaped.
+:func:`glob.translate` now correctly handles ranges implicitly containing path
+separators (for instance, ``[0-%]`` contains ``/``). In addition, ranges including
+path separator literals are now correctly escaped, as specified by POSIX specifications.

gh-130942: Fix path seperator matched in character ranges for glob.translate #130989

Are you sure you want to change the base?

gh-130942: Fix path seperator matched in character ranges for glob.translate #130989

Conversation

dmitya26 commented Mar 8, 2025 • edited by bedevere-app bot Loading

cpython-cla-bot bot commented Mar 8, 2025 • edited Loading

bedevere-app bot commented Mar 8, 2025

bedevere-app bot commented Mar 8, 2025

dmitya26 commented Mar 10, 2025

dmitya26 commented Mar 10, 2025

barneygale commented Mar 10, 2025 • edited Loading

dkaszews Mar 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitya26 Mar 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitya26 Mar 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitya26 Mar 14, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitya26 commented Mar 12, 2025

dkaszews commented Mar 12, 2025

picnixz commented Mar 12, 2025

picnixz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bedevere-app bot commented Mar 12, 2025

dkaszews commented Mar 12, 2025 • edited Loading

dmitya26 commented Mar 17, 2025

bedevere-app bot commented Mar 17, 2025

dmitya26 commented Mar 17, 2025 • edited Loading

dkaszews commented Mar 17, 2025

dmitya26 commented Mar 17, 2025

picnixz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitya26 commented Mar 17, 2025

dkaszews commented Mar 17, 2025

dmitya26 commented Mar 17, 2025

picnixz commented Mar 17, 2025

dmitya26 commented Mar 19, 2025 • edited Loading

dmitya26 commented Mar 20, 2025

dmitya26 commented Mar 8, 2025 •

edited by bedevere-app bot

Loading

cpython-cla-bot bot commented Mar 8, 2025 •

edited

Loading

barneygale commented Mar 10, 2025 •

edited

Loading

dkaszews Mar 11, 2025 •

edited

Loading

dmitya26 Mar 11, 2025 •

edited

Loading

dmitya26 Mar 11, 2025 •

edited

Loading

dmitya26 Mar 14, 2025 •

edited

Loading

dkaszews commented Mar 12, 2025 •

edited

Loading

dmitya26 commented Mar 17, 2025 •

edited

Loading

dmitya26 commented Mar 19, 2025 •

edited

Loading