mishandling of c-strings in parser #96670

asottile · 2022-09-08T01:06:35Z

Bug report

the parser mishandles lines containing null bytes when parsing source -- this allows the code to be misleadingly different from what it looks like.

I've been told by security@ that it is ok to post this publicly.

in the below example, <NUL> is an actual null byte:

x = '<NUL>' nothing to see here
';import os;os.system('echo pwnd')

and the execution and appearance in the terminal:

$ cat t.py
x = '' nothing to see here
';import os;os.system('echo pwnd')
$ python3 t.py
pwnd

it appears that after splitting the source into lines, the individual lines are treated as c strings and so the null terminator is misinterpreted, jamming the string contents together and it executes similar to this:

x = '';import os;os.system('echo pwnd')

note that if you want to write out a file like this here's a simple bit of code you can paste into an interactive prompt:

open('t.py', 'w').write("x = '\0' nothing to see here\n';import os;os.system('echo pwnd')\n")

here is perhaps a shorter example:

open('t.py', 'w').write("x = 1\0 + 1\n+2\nprint(x)\n")

I originally found this due to a bug report where the ast parser rejects code containing null bytes:

>>> import ast
>>> ast.parse("x = '\0'")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/ast.py", line 47, in parse
    return compile(source, filename, mode, flags,
ValueError: source code string cannot contain null bytes
>>> ast.parse(b"x = '\0'")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.8/ast.py", line 47, in parse
    return compile(source, filename, mode, flags,
ValueError: source code string cannot contain null bytes

ideally I would want the interpreter to reject files containing null bytes as a SyntaxError (and update the ast.parse error to a SyntaxError as well) -- though it appears there are some of these files in the wild -- such as https://github.com/univention/univention-corporate-server/blob/5.0-2/services/univention-ldb-modules/buildtools/bin/waf-svn

Your environment

CPython versions tested on: 3.7 ... 3.11rc1 (though pretty sure this reproduces on all versions)
Operating system and architecture: ubuntu 22.04, linux, x86_64

Linked PRs

[3.11] gh-96670: Raise SyntaxError when parsing NULL bytes (GH-97594) #104195

The text was updated successfully, but these errors were encountered:

encukou · 2022-09-08T08:07:32Z

There was a discussion about the larger issue of code being ”misleadingly different from what it looks like", and the general consensus was that this should be solved in editors, linters and code review tools, rather than in Python. IMO it was this thread but might need more digging.
That doesn't mean handling of \0 can't be changed. Or of control characters in general: see the informational PEP 672 for some more details.
Perhaps all unusual control characters should be banned? Even in strings? With a warning for 2 releases, according to the backwards compatibility policy – waf would need to switch to e.g. base64.

asottile · 2022-09-08T12:02:46Z

that was the rationale for making this public, but it's different than those as it's a mishandling by the parser rather than a quirk of how unicode displays

ypankovych · 2022-09-09T23:11:17Z

Ugh, this is freakin amazing.

C:\Users\user>more t.py

"""
How do i print 'Hello World' in Python?

Here is how:
print('Hello World')
"""

And run it:

C:\Users\user>py t.py
Hello World

open('t.py', 'w').write("\0'''\0How do i print \'Hello World\' in Python?\n\0Here is how:\nprint(\'Hello World\')\0'''")

gpshead · 2022-09-13T21:40:26Z

ideally I would want the interpreter to reject files containing null bytes as a SyntaxError (and update the ast.parse error to a SyntaxError as well)

FWIW I'd be in favor of that behavior change as a Feature. I wouldn't backport it as a bug though.

asottile · 2022-09-13T22:06:49Z

there's kinda 3 things I think:

convert ValueError => SyntaxError in ast.parse (probably 3.12+)
treat source files containing null bytes as SyntaxError (probably also 3.12+)
fix the thing considering it as a c string by using (char*, size_t) or whatever function (probably backportable?)

does it make sense to pursue these three things and separately?

gvanrossum · 2022-09-13T22:14:53Z

there's kinda 3 things I think:

1. convert `ValueError` => `SyntaxError` in `ast.parse` (probably 3.12+)

Agreed, it's changing behavior, so can't backport. I like SyntaxError for this situation. The root exception isn't in ast.parse(), it's in compile(), probably even deeper.

2. treat source files containing null bytes as `SyntaxError` (probably also 3.12+)

That's the biggest hole, should probably do this first.

3. fix the thing considering it as a c string by using `(char*, size_t)` or whatever function (probably backportable?)

Depends on where that fix is (can you tell I haven't looked at the source code yet? :-). If any of the affected functions are public it's going to be more difficult.

does it make sense to pursue these three things and separately?

Likely.

gpshead · 2022-09-14T00:56:36Z

2. treat source files containing null bytes as `SyntaxError` (probably also 3.12+)
That's the biggest hole, should probably do this first.

Be careful about this one. Python pre-pended to raw binary data for the executed Python to locate within the file and use (embedded zip or other data) is a common idiom that must continue to work.

gvanrossum · 2022-09-14T02:43:00Z

Be careful about this one. Python pre-pended to raw binary data for the executed Python to locate within the file and use (embedded zip or other data) is a common idiom that must continue to work.

Are you sure that works? Unlike Unix shells, Python parses the entire source file before executing any code. How would you get Python to ignore a blob of arbitrary binary data embedded in the source code, even if \0 is accepted? If the blob contains \n characters you can't hide it behind a # comment. I suppose you could prefix it with a """ quote, if you can arrange for the file to also end in """, and you're lucky that the blob doesn't contain embedded """ sequences.

But if I had to do something like that I'd probably just embed the Python code in a bash script as a "here" document and end the bash script with exit.

gpshead · 2022-09-14T02:58:34Z

Oh you're right, I guess what I've seen do that is a bash+python+data hybrid monster.

gvanrossum · 2022-09-14T03:00:13Z

Okay then we should be safe banning \0 in files starting with 3.12.

asottile · 2022-09-14T03:02:01Z

I suspect one could probably hack something together like this unfortunately

# coding: latin1
with open(__file__, 'rb') as f:
    contents = f.read().split(b'### BINARY\n')[1]
GARB = '''\
### BINARY
(actual binary here)
### BINARY
'''

gvanrossum · 2022-09-14T03:07:48Z

You'd still have to arrange for the actual binary not to contain the sequence ''' -- because that would end the string started at GARB =. I don't see how setting the coding to Latin-1 changes matters. (Honestly it seems you're making the same mistake as Greg.)

asottile · 2022-09-14T03:20:54Z

latin1 is to prevent a decoding error while parsing the source -- I linked an in-the-wild example of such a file in the original post

gvanrossum · 2022-09-14T04:00:26Z

Okay, that's impressive. It looks like the blob is lightly encoded -- \n and \r are encoded using #. and #&. (I'm curious what they'd do if the blob contains one of those sequences, there doesn't seem to be a way to quote them.)

The proposed change in 3.12 will break them, but they have version checks and they can just cope with it, I don't think we need to preserve this machine-dependent quirk forever. The code looks like it had to deal with various other versioning issues already (e.g. it tries to handle Python 2 and 3!).

vstinner · 2022-09-15T07:10:41Z

fix the thing considering it as a c string by using (char*, size_t) or whatever function (probably backportable?)

The root issue of this bug is that the Python parser is implemented in C which treats the NUL byte/character as the string terminator? So it's more a limitation of the current CPython implementation, but Python might support NUL bytes/characters later? Or do you think that it's always a bad practice to have NUL bytes/characters in a source file?

IMO it's always misleading to have NUL in a source file and it should be banned. I don't see any legit use case for that.

By the way, in Python, it's trivial there are many ways to create a byte, character, or string containing NUL:

bytes((0,)): byte
chr(0): character
b'[\x00]': bytes string
'[\x00]': Unicode string

gvanrossum · 2022-09-15T16:01:44Z

I agree we should always ban NUL in source files, like we already do when parsing from a string.

vstinner · 2022-09-16T08:09:18Z

Oh right, I didn't notice that!

>>> compile("x=1\x00", "string", "exec")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: source code string cannot contain null bytes

So yeah, I agree to always ban null characters/bytes in files as well.

apccurtiss · 2022-09-25T01:13:42Z

I suspect one could probably hack something together like this unfortunately

# coding: latin1
with open(__file__, 'rb') as f:
    contents = f.read().split(b'### BINARY\n')[1]
GARB = '''\
### BINARY
(actual binary here)
### BINARY
'''

This is actually a real (albeit very odd) use case for me! It can be used to demonstrate hash collision attacks. There are tools which will generate two blobs of binary data with the same MD5 hash, which can be used to create two Python files with the same hash, but different behaviors. This is a fairly common lab for intro security classes (example, other example). This is just another reason to disable this functionality IMO, but there could be other unexpected reasons why somebody may need to include arbitrary binary in a source file.

vstinner · 2022-09-26T07:57:52Z

It seems like the issue #97556 is a duplicate of this issue.

apccurtiss · 2022-09-26T18:40:59Z

It seems like the issue #97556 is a duplicate of this issue.

They're very similar, although I don't believe they're exact duplicates- #97556 deals with a particular parsing error in Python 3.10, which fatally fails to parse a string literal containing a NULL character. This issue deals with the larger problem of all source code being ignored after a null character: it applies even outside of string literals, although doing so is a syntax error, and in pre-Python 3.10 versions as well.

Signed-off-by: Pablo Galindo <pablogsal@gmail.com>

gvanrossum · 2022-09-27T23:04:34Z

Closing this. Starting with Python 3.12, NUL bytes won't be allowed in source code read from files. We're not backporting this since this could be considered a feature by some.

vstinner · 2022-09-28T15:22:26Z

Starting with Python 3.12, NUL bytes won't be allowed in source code read from files

It was changed by the commit aab01e3: PR #97594.

CarliJoy · 2022-10-17T20:22:49Z

Just for reference: The problem is already mentioned in https://peps.python.org/pep-0672/#control-characters

And Pylint checks for null characters already.

…honGH-97594). (cherry picked from commit aab01e3) Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>

…#104195)

asottile added the type-bug An unexpected behavior, bug, or error label Sep 8, 2022

vstinner mentioned this issue Sep 26, 2022

Null characters in strings cause a C SystemError #97556

Closed

apccurtiss added a commit to apccurtiss/cpython that referenced this issue Sep 26, 2022

Removed test, which is blocked by issue python#96670

fd4161d

bedevere-bot mentioned this issue Sep 26, 2022

gh-97556: Fix crash on null characters in string literal parser #97577

Open

bedevere-bot mentioned this issue Sep 27, 2022

gh-96670: Raise SyntaxError when parsing NULL bytes #97594

Merged

pablogsal added a commit to pablogsal/cpython that referenced this issue Sep 27, 2022

pythongh-96670: Raise SyntaxError when parsing NULL bytes

fc28337

pablogsal added a commit to pablogsal/cpython that referenced this issue Sep 27, 2022

pythongh-96670: Raise SyntaxError when parsing NULL bytes

ff3931b

Signed-off-by: Pablo Galindo <pablogsal@gmail.com>

pablogsal added a commit to pablogsal/cpython that referenced this issue Sep 27, 2022

pythongh-96670: Raise SyntaxError when parsing NULL bytes

7cebcc7

Signed-off-by: Pablo Galindo <pablogsal@gmail.com>

pablogsal added a commit to pablogsal/cpython that referenced this issue Sep 27, 2022

fixup! pythongh-96670: Raise SyntaxError when parsing NULL bytes

7b23309

pablogsal added a commit to pablogsal/cpython that referenced this issue Sep 27, 2022

fixup! fixup! pythongh-96670: Raise SyntaxError when parsing NULL bytes

09f4d22

pablogsal added a commit to pablogsal/cpython that referenced this issue Sep 27, 2022

pythongh-96670: Raise SyntaxError when parsing NULL bytes

750c691

Signed-off-by: Pablo Galindo <pablogsal@gmail.com>

pablogsal added a commit to pablogsal/cpython that referenced this issue Sep 27, 2022

pythongh-96670: Raise SyntaxError when parsing NULL bytes

97a8f83

Signed-off-by: Pablo Galindo <pablogsal@gmail.com>

pablogsal added a commit to pablogsal/cpython that referenced this issue Sep 27, 2022

pythongh-96670: Raise SyntaxError when parsing NULL bytes

cb89392

Signed-off-by: Pablo Galindo <pablogsal@gmail.com>

pablogsal added a commit to pablogsal/cpython that referenced this issue Sep 27, 2022

pythongh-96670: Raise SyntaxError when parsing NULL bytes

490f5bd

Signed-off-by: Pablo Galindo <pablogsal@gmail.com>

pablogsal added a commit to pablogsal/cpython that referenced this issue Sep 27, 2022

pythongh-96670: Raise SyntaxError when parsing NULL bytes

4b1105c

Signed-off-by: Pablo Galindo <pablogsal@gmail.com>

pablogsal added a commit to pablogsal/cpython that referenced this issue Sep 27, 2022

pythongh-96670: Raise SyntaxError when parsing NULL bytes

e10909d

Signed-off-by: Pablo Galindo <pablogsal@gmail.com>

pablogsal added a commit that referenced this issue Sep 27, 2022

gh-96670: Raise SyntaxError when parsing NULL bytes (#97594)

aab01e3

gvanrossum closed this as completed Sep 27, 2022

asottile mentioned this issue Oct 27, 2022

ValueError: source code string cannot contain null bytes PyCQA/flake8#1682

Closed

bskinn mentioned this issue May 4, 2023

gh-97556: Raise null bytes syntax error upon null in multiline string #104136

Merged

lysnikolaou added a commit to lysnikolaou/cpython that referenced this issue May 5, 2023

[3.11] pythongh-96670: Raise SyntaxError when parsing NULL bytes (pyt…

e6d28ba

…honGH-97594). (cherry picked from commit aab01e3) Co-authored-by: Pablo Galindo Salgado <Pablogsal@gmail.com>

bedevere-bot mentioned this issue May 5, 2023

[3.11] gh-96670: Raise SyntaxError when parsing NULL bytes (GH-97594) #104195

Merged

pablogsal pushed a commit that referenced this issue May 7, 2023

[3.11] gh-96670: Raise SyntaxError when parsing NULL bytes (GH-97594) (…

a09d390

…#104195)

JelleZijlstra mentioned this issue May 15, 2023

Add more error handling carljm/compfinder#2

Merged

hrnciar mentioned this issue May 16, 2023

Changes in Python 3.12 break tests PythonCharmers/python-future#618

Closed

vEpiphyte mentioned this issue Jun 7, 2023

Handle SyntaxError which can be raised by cpython in 3.11.4+ (SYN-5563) vertexproject/synapse#3172

Merged

asottile mentioned this issue Jul 31, 2023

NUL bytes in commented lines #64314

Closed

akshayka mentioned this issue Dec 31, 2023

improvement: use virtual file for image marimo-team/marimo#489

Merged

This was referenced Jun 21, 2024

Error when compiling with Python3.12 (WAF files contain null bytes) mdaus/nitro#614

Open

coda-oss waf build fails with Python 3.12 (Ubuntu 24.04) mdaus/coda-oss#773

Open

dscorbett mentioned this issue Feb 21, 2025

PLE2514 fix should be marked unsafe and can modify octal escape sequences astral-sh/ruff#16309

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mishandling of c-strings in parser #96670

mishandling of c-strings in parser #96670

asottile commented Sep 8, 2022 •

edited by bedevere-bot

Loading

encukou commented Sep 8, 2022

asottile commented Sep 8, 2022

ypankovych commented Sep 9, 2022 •

edited

Loading

gpshead commented Sep 13, 2022

asottile commented Sep 13, 2022

gvanrossum commented Sep 13, 2022

gpshead commented Sep 14, 2022

gvanrossum commented Sep 14, 2022

gpshead commented Sep 14, 2022

gvanrossum commented Sep 14, 2022

asottile commented Sep 14, 2022 •

edited

Loading

gvanrossum commented Sep 14, 2022

asottile commented Sep 14, 2022

gvanrossum commented Sep 14, 2022

vstinner commented Sep 15, 2022

gvanrossum commented Sep 15, 2022

vstinner commented Sep 16, 2022

apccurtiss commented Sep 25, 2022

vstinner commented Sep 26, 2022

apccurtiss commented Sep 26, 2022

gvanrossum commented Sep 27, 2022

vstinner commented Sep 28, 2022

CarliJoy commented Oct 17, 2022

mishandling of c-strings in parser #96670

mishandling of c-strings in parser #96670

Comments

asottile commented Sep 8, 2022 • edited by bedevere-bot Loading

Bug report

Your environment

Linked PRs

encukou commented Sep 8, 2022

asottile commented Sep 8, 2022

ypankovych commented Sep 9, 2022 • edited Loading

gpshead commented Sep 13, 2022

asottile commented Sep 13, 2022

gvanrossum commented Sep 13, 2022

gpshead commented Sep 14, 2022

gvanrossum commented Sep 14, 2022

gpshead commented Sep 14, 2022

gvanrossum commented Sep 14, 2022

asottile commented Sep 14, 2022 • edited Loading

gvanrossum commented Sep 14, 2022

asottile commented Sep 14, 2022

gvanrossum commented Sep 14, 2022

vstinner commented Sep 15, 2022

gvanrossum commented Sep 15, 2022

vstinner commented Sep 16, 2022

apccurtiss commented Sep 25, 2022

vstinner commented Sep 26, 2022

apccurtiss commented Sep 26, 2022

gvanrossum commented Sep 27, 2022

vstinner commented Sep 28, 2022

CarliJoy commented Oct 17, 2022

asottile commented Sep 8, 2022 •

edited by bedevere-bot

Loading

ypankovych commented Sep 9, 2022 •

edited

Loading

asottile commented Sep 14, 2022 •

edited

Loading