-
-
Notifications
You must be signed in to change notification settings - Fork 32k
mishandling of c-strings in parser #96670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There was a discussion about the larger issue of code being ”misleadingly different from what it looks like", and the general consensus was that this should be solved in editors, linters and code review tools, rather than in Python. IMO it was this thread but might need more digging. |
that was the rationale for making this public, but it's different than those as it's a mishandling by the parser rather than a quirk of how unicode displays |
Ugh, this is freakin amazing.
And run it:
|
FWIW I'd be in favor of that behavior change as a Feature. I wouldn't backport it as a bug though. |
there's kinda 3 things I think:
does it make sense to pursue these three things and separately? |
Agreed, it's changing behavior, so can't backport. I like SyntaxError for this situation. The root exception isn't in ast.parse(), it's in compile(), probably even deeper.
That's the biggest hole, should probably do this first.
Depends on where that fix is (can you tell I haven't looked at the source code yet? :-). If any of the affected functions are public it's going to be more difficult.
Likely. |
Be careful about this one. Python pre-pended to raw binary data for the executed Python to locate within the file and use (embedded zip or other data) is a common idiom that must continue to work. |
Are you sure that works? Unlike Unix shells, Python parses the entire source file before executing any code. How would you get Python to ignore a blob of arbitrary binary data embedded in the source code, even if But if I had to do something like that I'd probably just embed the Python code in a bash script as a "here" document and end the bash script with |
Oh you're right, I guess what I've seen do that is a bash+python+data hybrid monster. |
Okay then we should be safe banning |
I suspect one could probably hack something together like this unfortunately # coding: latin1
with open(__file__, 'rb') as f:
contents = f.read().split(b'### BINARY\n')[1]
GARB = '''\
### BINARY
(actual binary here)
### BINARY
''' |
You'd still have to arrange for the actual binary not to contain the sequence |
latin1 is to prevent a decoding error while parsing the source -- I linked an in-the-wild example of such a file in the original post |
Okay, that's impressive. It looks like the blob is lightly encoded -- The proposed change in 3.12 will break them, but they have version checks and they can just cope with it, I don't think we need to preserve this machine-dependent quirk forever. The code looks like it had to deal with various other versioning issues already (e.g. it tries to handle Python 2 and 3!). |
The root issue of this bug is that the Python parser is implemented in C which treats the NUL byte/character as the string terminator? So it's more a limitation of the current CPython implementation, but Python might support NUL bytes/characters later? Or do you think that it's always a bad practice to have NUL bytes/characters in a source file? IMO it's always misleading to have NUL in a source file and it should be banned. I don't see any legit use case for that. By the way, in Python, it's trivial there are many ways to create a byte, character, or string containing NUL:
|
I agree we should always ban NUL in source files, like we already do when parsing from a string. |
Oh right, I didn't notice that!
So yeah, I agree to always ban null characters/bytes in files as well. |
This is actually a real (albeit very odd) use case for me! It can be used to demonstrate hash collision attacks. There are tools which will generate two blobs of binary data with the same MD5 hash, which can be used to create two Python files with the same hash, but different behaviors. This is a fairly common lab for intro security classes (example, other example). This is just another reason to disable this functionality IMO, but there could be other unexpected reasons why somebody may need to include arbitrary binary in a source file. |
It seems like the issue #97556 is a duplicate of this issue. |
They're very similar, although I don't believe they're exact duplicates- #97556 deals with a particular parsing error in Python 3.10, which fatally fails to parse a string literal containing a NULL character. This issue deals with the larger problem of all source code being ignored after a null character: it applies even outside of string literals, although doing so is a syntax error, and in pre-Python 3.10 versions as well. |
Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
Closing this. Starting with Python 3.12, NUL bytes won't be allowed in source code read from files. We're not backporting this since this could be considered a feature by some. |
Just for reference: The problem is already mentioned in https://peps.python.org/pep-0672/#control-characters And Pylint checks for null characters already. |
…honGH-97594). (cherry picked from commit aab01e3) Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>
Bug report
the parser mishandles lines containing null bytes when parsing source -- this allows the code to be misleadingly different from what it looks like.
I've been told by security@ that it is ok to post this publicly.
in the below example,
<NUL>
is an actual null byte:and the execution and appearance in the terminal:
it appears that after splitting the source into lines, the individual lines are treated as c strings and so the null terminator is misinterpreted, jamming the string contents together and it executes similar to this:
note that if you want to write out a file like this here's a simple bit of code you can paste into an interactive prompt:
here is perhaps a shorter example:
I originally found this due to a bug report where the
ast
parser rejects code containing null bytes:ideally I would want the interpreter to reject files containing null bytes as a
SyntaxError
(and update theast.parse
error to aSyntaxError
as well) -- though it appears there are some of these files in the wild -- such as https://github.com/univention/univention-corporate-server/blob/5.0-2/services/univention-ldb-modules/buildtools/bin/waf-svnYour environment
Linked PRs
The text was updated successfully, but these errors were encountered: