Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignment fault in PyMember_SetOne under freethreading on aarch64 #129675

Open
hawkinsp opened this issue Feb 5, 2025 · 10 comments
Open

Alignment fault in PyMember_SetOne under freethreading on aarch64 #129675

hawkinsp opened this issue Feb 5, 2025 · 10 comments
Labels
docs Documentation in the Doc dir topic-C-API topic-free-threading type-bug An unexpected behavior, bug, or error

Comments

@hawkinsp
Copy link
Contributor

hawkinsp commented Feb 5, 2025

Bug report

Bug description:

I saw a SIGBUS crash with the following backtrace immediately on startup when running a free-threaded build of https://github.com/jax-ml/jax on Python 3.13t:

#0  PyMember_SetOne (addr=0x20006fa48bc "", addr@entry=0x20006fa4890 "`\367N~", <incomplete sequence \346>, l=<optimized out>,
    v=v@entry='Context manager for `jax2tf_associative_scan_reductions` config option.\n\nJAX has two separate lowering rules for the cumulative reduction primitives (cumsum, cumprod, cummax, cummin). On CPUs and GPUs it uses a lax.associative_scan, while for TPUs it uses the HLO ReduceWindow. The latter has a slow implementation on CPUs and GPUs. By default, jax2tf uses the TPU lowering. Set this flag to True to use the associative scan lowering usage, and only if it makes a difference for your application. See the jax2tf README.md for more details.') at Python/structmember.c:308
#1  0x0000b1fea968e908 in member_set (self=<member_descriptor at remote 0x20006384210>, obj=<State at remote 0x20006fa4890>,
    value='Context manager for `jax2tf_associative_scan_reductions` config option.\n\nJAX has two separate lowering rules for the cumulative reduction primitives (cumsum, cumprod, cummax, cummin). On CPUs and GPUs it uses a lax.associative_scan, while for TPUs it uses the HLO ReduceWindow. The latter has a slow implementation on CPUs and GPUs. By default, jax2tf uses the TPU lowering. Set this flag to True to use the associative scan lowering usage, and only if it makes a difference for your application. See the jax2tf README.md for more details.') at Objects/descrobject.c:238
#2  0x0000b1fea96e9928 in _PyObject_GenericSetAttrWithDict (obj=<State at remote 0x20006fa4890>, name='__doc__',
    value='Context manager for `jax2tf_associative_scan_reductions` config option.\n\nJAX has two separate lowering rules for the cumulative reduction primitives (cumsum, cumprod, cummax, cummin). On CPUs and GPUs it uses a lax.associative_scan, while for TPUs it uses the HLO ReduceWindow. The latter has a slow implementation on CPUs and GPUs. By default, jax2tf uses the TPU lowering. Set this flag to True to use the associative scan lowering usage, and only if it makes a difference for your application. See the jax2tf README.md for more details.', dict=dict@entry=0x0) at Objects/object.c:1778

What we're doing here is running this code:

obj.__doc__ = "..."

and crashing.

We're at this line of code:

FT_ATOMIC_STORE_PTR_RELEASE(*(PyObject **)addr, Py_XNewRef(v));

Digging a little deeper, the relevant disassembly is:

(gdb) display /10i $pc-20
10: x/10i $pc-20
   0xb1fea9860094 <PyMember_SetOne+1596>:       bl      0xb1fea985f690 <_PyCriticalSection_BeginMutex>
   0xb1fea9860098 <PyMember_SetOne+1600>:       ldr     x19, [x19, x21]
   0xb1fea986009c <PyMember_SetOne+1604>:       cbz     x20, 0xb1fea98600a8 <PyMember_SetOne+1616>
   0xb1fea98600a0 <PyMember_SetOne+1608>:       mov     x0, x20
   0xb1fea98600a4 <PyMember_SetOne+1612>:       bl      0xb1fea985f42c <Py_INCREF>
=> 0xb1fea98600a8 <PyMember_SetOne+1616>:       stlr    x20, [x24]
   0xb1fea98600ac <PyMember_SetOne+1620>:       add     x0, sp, #0x8
   0xb1fea98600b0 <PyMember_SetOne+1624>:       bl      0xb1fea985f774 <_PyCriticalSection_End>
   0xb1fea98600b4 <PyMember_SetOne+1628>:       mov     x0, x19
   0xb1fea98600b8 <PyMember_SetOne+1632>:       bl      0xb1fea985f664 <Py_XDECREF>

(gdb) info registers
...
x20            0x200064f9610       2199129134608
x21            0x2c                44
x22            0x10                16
x23            0x0                 0
x24            0x20006fa48bc       2199140321468
x25            0x0                 0
...

What's going on here is that stlr's target address must be 8-byte aligned on aarch64: https://developer.arm.com/documentation/102336/0100/Load-Acquire-and-Store-Release-instructions

but the __doc__ field of this object is only 4-byte aligned, with offset = 44.

(gdb) up
(gdb) print *descr->d_member
$38 = {name = 0xb1fea9bdc950 <_PyRuntime+63952> "__doc__", type = 16, offset = 44, flags = 0, doc = 0x0}

We've placed an object field at an unaligned address, and used it in an atomic access that requires alignment, which is an error.

How should we fix this?

This seems like it's a CPython bug: CPython shouldn't choose underaligned slot offsets for object fields.

However, in this particular case it comes from a Python subclass of a C extension base class that has tp_basicsize=44; I'm also not aware of any rule that says tp_basicsize has to be a multiple of the word size. Perhaps CPython should either enforce that or round up the size of base classes to ensure alignment.

Or we can argue that CPython shouldn't be using an aligned atomic in this case.

What do you think?

CPython versions tested on:

3.13

Operating systems tested on:

Linux

Linked PRs

@hawkinsp hawkinsp added the type-bug An unexpected behavior, bug, or error label Feb 5, 2025
@hawkinsp
Copy link
Contributor Author

hawkinsp commented Feb 5, 2025

It occurs to me that this problem may also be present but in a sense worse on x86. I believe on x86, the "atomic" access is silently not atomic if it's unaligned. Perhaps there's code to check this somewhere in the CPython implementation, though, otherwise this means ostensibly atomic accesses are actually racy.

@colesbury
Copy link
Contributor

Can you provide some more information about how you end up with an offset and tp_basicsize of 44?

@hawkinsp
Copy link
Contributor Author

hawkinsp commented Feb 5, 2025

Here's a trivial repro, using nanobind:

repro.cc:

#include <Python.h>

#include "nanobind/nanobind.h"

namespace nb = nanobind;

struct AClass {
    int x;
};

NB_MODULE(repro, m) {
    nb::class_<AClass> c(m, "AClass");
    c.def(nb::init<>());
}

r.py

import repro

class Child(repro.AClass):
  __slots__ = ("x",)


c = Child()
c.x = "hello"

CMakeLists.txt:

cmake_minimum_required(VERSION 3.16)

project(repro_project)

# https://nanobind.readthedocs.io/en/latest/building.html


if (CMAKE_VERSION VERSION_LESS 3.18)
  set(DEV_MODULE Development)
else()
  set(DEV_MODULE Development.Module)
endif()

find_package(Python 3.8 COMPONENTS Interpreter ${DEV_MODULE} REQUIRED)

if (NOT CMAKE_BUILD_TYPE AND NOT CMAKE_CONFIGURATION_TYPES)
  set(CMAKE_BUILD_TYPE Release CACHE STRING "Choose the type of build." FORCE)
  set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS "Debug" "Release" "MinSizeRel" "RelWithDebInfo")
endif()

function(append value)
  foreach(variable ${ARGN})
    set(${variable} "${${variable}} ${value}" PARENT_SCOPE)
  endforeach(variable)
endfunction()

# Detect the installed nanobind package and import it into CMake
execute_process(
  COMMAND "${Python_EXECUTABLE}" -m nanobind --cmake_dir
  OUTPUT_STRIP_TRAILING_WHITESPACE OUTPUT_VARIABLE nanobind_ROOT)

find_package(nanobind CONFIG REQUIRED)

# Abseil requires C++14
set(CMAKE_CXX_STANDARD 14)

nanobind_add_module(
    repro
    FREE_THREADED
    repro.cc
)

set_target_properties(repro PROPERTIES POSITION_INDEPENDENT_CODE ON)
(sbenv) phawkins@phawkins-c4a:~/repro$ cmake -S. -Bbuild
-- The C compiler identification is GNU 13.3.0
-- The CXX compiler identification is GNU 13.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Python: /home/phawkins/sbenv/bin/python3 (found suitable version "3.13.2", minimum required is "3.8") found components: Interpreter Development.Module
-- Configuring done (1.0s)
-- Generating done (0.0s)
-- Build files have been written to: /home/phawkins/repro/build
(sbenv) phawkins@phawkins-c4a:~/repro$ cmake --build build
[  7%] Building CXX object CMakeFiles/nanobind-static-ft.dir/home/phawkins/sbenv/lib/python3.13t/site-packages/nanobind/src/nb_internals.cpp.o
[ 14%] Building CXX object CMakeFiles/nanobind-static-ft.dir/home/phawkins/sbenv/lib/python3.13t/site-packages/nanobind/src/nb_func.cpp.o
[ 21%] Building CXX object CMakeFiles/nanobind-static-ft.dir/home/phawkins/sbenv/lib/python3.13t/site-packages/nanobind/src/nb_type.cpp.o
[ 28%] Building CXX object CMakeFiles/nanobind-static-ft.dir/home/phawkins/sbenv/lib/python3.13t/site-packages/nanobind/src/nb_enum.cpp.o
[ 35%] Building CXX object CMakeFiles/nanobind-static-ft.dir/home/phawkins/sbenv/lib/python3.13t/site-packages/nanobind/src/nb_ndarray.cpp.o
[ 42%] Building CXX object CMakeFiles/nanobind-static-ft.dir/home/phawkins/sbenv/lib/python3.13t/site-packages/nanobind/src/nb_static_property.cpp.o
[ 50%] Building CXX object CMakeFiles/nanobind-static-ft.dir/home/phawkins/sbenv/lib/python3.13t/site-packages/nanobind/src/nb_ft.cpp.o
[ 57%] Building CXX object CMakeFiles/nanobind-static-ft.dir/home/phawkins/sbenv/lib/python3.13t/site-packages/nanobind/src/common.cpp.o
[ 64%] Building CXX object CMakeFiles/nanobind-static-ft.dir/home/phawkins/sbenv/lib/python3.13t/site-packages/nanobind/src/error.cpp.o
[ 71%] Building CXX object CMakeFiles/nanobind-static-ft.dir/home/phawkins/sbenv/lib/python3.13t/site-packages/nanobind/src/trampoline.cpp.o
[ 78%] Building CXX object CMakeFiles/nanobind-static-ft.dir/home/phawkins/sbenv/lib/python3.13t/site-packages/nanobind/src/implicit.cpp.o
[ 85%] Linking CXX static library libnanobind-static-ft.a
[ 85%] Built target nanobind-static-ft
[ 92%] Building CXX object CMakeFiles/repro.dir/repro.cc.o
[100%] Linking CXX shared module repro.cpython-313td-aarch64-linux-gnu.so
[100%] Built target repro
(sbenv) phawkins@phawkins-c4a:~/repro$ ls build
CMakeCache.txt  CMakeFiles  Makefile  cmake_install.cmake  libnanobind-static-ft.a  repro.cpython-313td-aarch64-linux-gnu.so
(sbenv) phawkins@phawkins-c4a:~/repro$ PYTHONPATH=build python r.py
Bus error (core dumped)

@colesbury
Copy link
Contributor

Thanks for the repro.

I think this is a bug in nanobind, although the Python documentation and validation could be improved.

  • Unaligned accesses are undefined behavior in C and C++ even though many architectures support them for plain loads and stores. It's not just limited to atomics.
  • tp_basicsize needs to meet PyObject size and alignment. In C and C++, struct size is a multiple of alignment, which effectively means that it must be at least pointer-aligned. The docs say "the only correct way to get an initializer for the tp_basicsize is to use the sizeof operator". The sizeof() operator will always get you the right thing. We can probably make the alignment requirements more clear in the docs.
  • For PyType_Spec.basicsize, there are the same requirements when using a positive size. For negative sizes, CPython will adjust it to ensure the alignment necessary.

We can also add some additional validation of tp_basicsize in PyType_Ready, but I don't think we should adjust positive values of tp_basicsize.

@colesbury
Copy link
Contributor

cc @encukou since this is related to PEP 697

hawkinsp added a commit to hawkinsp/nanobind that referenced this issue Feb 6, 2025
See python/cpython#129675 (comment)
tp_basicsize needs to be a multiple of the alignment of PyObject, which
is in effect pointer aligned.

This fixes a SIGBUS alignment crash found when testing free-threaded
nanobind-using code on aarch64.
wjakob pushed a commit to wjakob/nanobind that referenced this issue Feb 6, 2025
See python/cpython#129675 (comment)
tp_basicsize needs to be a multiple of the alignment of PyObject, which
is in effect pointer aligned.

This fixes a SIGBUS alignment crash found when testing free-threaded
nanobind-using code on aarch64.
@hawkinsp
Copy link
Contributor Author

hawkinsp commented Feb 6, 2025

This is now fixed in nanobind, but I think there's still a couple of action items here for CPython to document and enforce this.

@encukou encukou added docs Documentation in the Doc dir topic-C-API labels Feb 6, 2025
@markshannon
Copy link
Member

I think might be a bug in CPython.
Take this struct:

struct object44 {
    PyObject header;
    int array[7];
};

and the corresponding PyTypeObject:

    .tp_basicsize = sizeof(object44),

For all major compilers, sizeof(struct object44) == 48 is true, but I don't think the C spec requires it.
I think a size of 44 is legal, but I might be wrong.

@colesbury
Copy link
Contributor

Not if pointers have 8 byte alignment (and if PyObject contains pointers). You need trailing padding so that the size of object44 is a multiple of its alignment. Otherwise C arrays would be broken. For example, if you declare a fixed size array in C:

struct object44 arr[N];

Or allocate it using malloc:

struct object44 *arr = malloc(sizeof(struct object44) * N);

Every element must be satisfy the alignment of struct object44.

struct object44 *ptr = &arr[1]; // this would be wrong! if sizeof(struct object44) == 44

Some compilers support __attribute__((packed)) or similar as an extension, but that's outside of the standard and doesn't work with arrays. (Some things are just silently broken with __attribute__((packed))!)

@encukou
Copy link
Member

encukou commented Feb 8, 2025

See #129850 for the docs part. (If we add a warning, we shouldn't backport it, so it should be a separate PR.)

I don't think the C spec requires [rounding struct size up to its alignment].

Huh, right, that's curious. I can't find it spelled out explicitly!
But, the spec does give sizeof array / sizeof array[0] as an example for getting an array size, and more than just Python would break if malloc(sizeof(t) * n) didn't give you a t[n] array.

@encukou
Copy link
Member

encukou commented Feb 24, 2025

I'd appreciate a review on the docs PR, especially from someone who knows* what tp_basicsize/tp_itemsize should mean -- everyone has a slightly different idea :)

I'll merge it next week if there are no objections.

encukou added a commit that referenced this issue Mar 11, 2025
* Update documentation for tp_basicsize & tp_itemsize

- Add alignment requirement
- Mention that ob_size is unreliable if you don't control it
- Add some links for context
- basicsize should include the base type in generaly not just PyObject

This adds a “by-the-way” link to `PyObject_New`, which shouldn't be
used for GC types. In order to be comfortable linking to it, I also
add a link to `PyObject_GC_New` from its docs. And the same for
`*Var` variants, while I'm here.

* Strongly suggest Py_SIZE & Py_SET_SIZE
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Mar 11, 2025
…ythonGH-129850)

* Update documentation for tp_basicsize & tp_itemsize

- Add alignment requirement
- Mention that ob_size is unreliable if you don't control it
- Add some links for context
- basicsize should include the base type in generaly not just PyObject

This adds a “by-the-way” link to `PyObject_New`, which shouldn't be
used for GC types. In order to be comfortable linking to it, I also
add a link to `PyObject_GC_New` from its docs. And the same for
`*Var` variants, while I'm here.

* Strongly suggest Py_SIZE & Py_SET_SIZE
(cherry picked from commit ad0f618)

Co-authored-by: Petr Viktorin <encukou@gmail.com>
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Mar 11, 2025
…ythonGH-129850)

* Update documentation for tp_basicsize & tp_itemsize

- Add alignment requirement
- Mention that ob_size is unreliable if you don't control it
- Add some links for context
- basicsize should include the base type in generaly not just PyObject

This adds a “by-the-way” link to `PyObject_New`, which shouldn't be
used for GC types. In order to be comfortable linking to it, I also
add a link to `PyObject_GC_New` from its docs. And the same for
`*Var` variants, while I'm here.

* Strongly suggest Py_SIZE & Py_SET_SIZE
(cherry picked from commit ad0f618)

Co-authored-by: Petr Viktorin <encukou@gmail.com>
encukou added a commit that referenced this issue Mar 17, 2025
…GH-129850) (GH-131078)

- Add alignment requirement
- Mention that ob_size is unreliable if you don't control it
- Add some links for context
- basicsize should include the base type in generaly not just PyObject
- suggest Py_SIZE & Py_SET_SIZE

This adds a “by-the-way” link to `PyObject_New`, which shouldn't be
used for GC types. In order to be comfortable linking to it, I also
add a link to `PyObject_GC_New` from its docs. And the same for
`*Var` variants, while I'm here.

(cherry picked from commit ad0f618)

Co-authored-by: Petr Viktorin <encukou@gmail.com>
encukou added a commit that referenced this issue Mar 17, 2025
…GH-129850) (GH-131079)

- Add alignment requirement
- Mention that ob_size is unreliable if you don't control it
- Add some links for context
- basicsize should include the base type in generaly not just PyObject
- suggest Py_SIZE & Py_SET_SIZE

This adds a “by-the-way” link to `PyObject_New`, which shouldn't be
used for GC types. In order to be comfortable linking to it, I also
add a link to `PyObject_GC_New` from its docs. And the same for
`*Var` variants, while I'm here.

(cherry picked from commit ad0f618)

Co-authored-by: Petr Viktorin <encukou@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation in the Doc dir topic-C-API topic-free-threading type-bug An unexpected behavior, bug, or error
Projects
Status: Todo
Development

No branches or pull requests

5 participants