GH-113464: Generate a more efficient JIT #118512

brandtbucher · 2024-05-02T15:51:04Z

This breaks up the JIT into smaller functions, reduces a lot of branching in hot inner loops, and generally makes the C code cleaner (and probably faster).

Currently, we generate declarative structures at build time that we then loop over in order to emit the desired machine code at runtime. For example, the _STORE_FAST stencil looks like this:

static const unsigned char _STORE_FAST_code_body[61] = {
    0x50, 0x48, 0x8b, 0x45, 0xf8, 0x48, 0x83, 0xc5,
    0xf8, 0x0f, 0xb7, 0x0d, 0x00, 0x00, 0x00, 0x00,
    0x49, 0x8b, 0x7c, 0xcd, 0x48, 0x49, 0x89, 0x44,
    0xcd, 0x48, 0x48, 0x85, 0xff, 0x74, 0x0f, 0x48,
    0x8b, 0x07, 0x85, 0xc0, 0x78, 0x08, 0x48, 0xff,
    0xc8, 0x48, 0x89, 0x07, 0x74, 0x07, 0x58, 0xff,
    0x25, 0x00, 0x00, 0x00, 0x00, 0xff, 0x15, 0x00,
    0x00, 0x00, 0x00, 0x58,
};
static const Hole _STORE_FAST_code_holes[4] = {
    {0xc, HoleKind_R_X86_64_GOTPCREL, HoleValue_DATA, NULL, -0x4},
    {0x31, HoleKind_R_X86_64_GOTPCRELX, HoleValue_DATA, NULL, 0x4},
    {0x37, HoleKind_R_X86_64_GOTPCRELX, HoleValue_DATA, NULL, 0xc},
};
static const unsigned char _STORE_FAST_data_body[25] = {
    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
};
static const Hole _STORE_FAST_data_holes[4] = {
    {0x0, HoleKind_R_X86_64_64, HoleValue_OPARG, NULL, 0x0},
    {0x8, HoleKind_R_X86_64_64, HoleValue_CONTINUE, NULL, 0x0},
    {0x10, HoleKind_R_X86_64_64, HoleValue_ZERO, &_Py_Dealloc, 0x0},
};

This very general approach means that we have a lot of complex logic in our hot inner loop to decode instructions and set up values for patching that may not even be needed. It's also very branchy, since we're essentially "interpreting" the array of holes for each instruction.

With this PR, jit_stencils.h instead contains the following function:

void
emit__STORE_FAST(
    unsigned char *code, unsigned char *data, _PyExecutorObject *executor,
    const _PyUOpInstruction *instruction, uintptr_t instruction_starts[])
{
    const unsigned char code_body[60] = {
        0x50, 0x48, 0x8b, 0x45, 0xf8, 0x48, 0x83, 0xc5,
        0xf8, 0x0f, 0xb7, 0x0d, 0x00, 0x00, 0x00, 0x00,
        0x49, 0x8b, 0x7c, 0xcd, 0x48, 0x49, 0x89, 0x44,
        0xcd, 0x48, 0x48, 0x85, 0xff, 0x74, 0x0f, 0x48,
        0x8b, 0x07, 0x85, 0xc0, 0x78, 0x08, 0x48, 0xff,
        0xc8, 0x48, 0x89, 0x07, 0x74, 0x07, 0x58, 0xff,
        0x25, 0x00, 0x00, 0x00, 0x00, 0xff, 0x15, 0x00,
        0x00, 0x00, 0x00, 0x58,
    };
    const unsigned char data_body[24] = {
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
    };
    memcpy(data, data_body, sizeof(data_body));
    patch_64(data + 0x0, instruction->oparg);
    patch_64(data + 0x8, (uintptr_t)code + sizeof(code_body));
    patch_64(data + 0x10, (uintptr_t)&_Py_Dealloc);
    memcpy(code, code_body, sizeof(code_body));
    patch_32r(code + 0xc, (uintptr_t)data + -0x4);
    patch_x86_64_32rx(code + 0x31, (uintptr_t)data + 0x4);
    patch_x86_64_32rx(code + 0x37, (uintptr_t)data + 0xc);
}

This function is called directly to emit the machine code for every _STORE_FAST instruction, and hardcodes the logic for all of the necessary copies and patches. The result is one indirect call, no unnecessary branching, and (in my opinion) cleaner code, since a lot of the tricky logic is now hidden away in generated files.

I know this is right before feature freeze, but I'd really like to get this in 3.13 since it will make backporting any fixes much easier. It doesn't change the actual jitted code in any way.

Note to reviewers: the diff is a bit messy, so it may make more sense to compare the before-vs-after files side-by-side instead.

Issue: JIT Compilation #113464

brandtbucher · 2024-05-02T15:52:11Z

@savannahostrowski, I'd love to get your review of this if you have a few cycles.

savannahostrowski

A couple of comments and questions but after sitting and reading through this code a bunch over the last week or two, I'm excited about how much more readable this will get with this change! 💆‍♀️

Python/jit.c

Tools/jit/_stencils.py

savannahostrowski · 2024-05-02T23:26:09Z

Tools/jit/_writer.py

    yield ""


 def dump(groups: dict[str, _stencils.StencilGroup]) -> typing.Iterator[str]:
    """Yield a JIT compiler line-by-line as a C header file."""
-    yield from _dump_header()
-    for opname, group in groups.items():
+    for opname, group in sorted(groups.items()):


Is there a reason that this needs to be sorted?

Nope, I just like it that way (if you couldn't tell by now). ;)

Tools/jit/_stencils.py

savannahostrowski

Thanks for adding in the comment about the naming conventions - I think that helps! Otherwise, this looks pretty solid to me (barring some Windows CI failures). Lots of moving things into function but it's a whole lot more readable! 🎉

brandtbucher · 2024-05-03T23:40:53Z

Windows JIT CI fixed in GH-118564.

brandtbucher and others added 16 commits May 1, 2024 15:15

Replace stencils with dedicated writer functions

2901caf

Generate patching logic

f30fa64

Cleanup

23e211c

uint64_t -> uintptr_t

431fbed

uintptr_t -> uint64_t

3e6b25c

Linting

82030c8

Restore AArch64 pair folding

236af82

Cleanup

3b7e693

Add missing relocations

fbb97fc

Fix AArch64 folds

c40bb34

Dedent

bd570b5

Use a single array of structs

b2fd9d2

Move C initializer formation to StencilGroup

7aa12a2

Add comment on why data is first

9ec64ac

Silence warnings

eb0826f

Add missing space

a04d7f8

brandtbucher added performance Performance or resource usage interpreter-core (Objects, Python, Grammar, and Parser dirs) build The build process and cross-build labels May 2, 2024

brandtbucher self-assigned this May 2, 2024

bedevere-app bot added the awaiting core review label May 2, 2024

bedevere-app bot mentioned this pull request May 2, 2024

JIT Compilation #113464

Closed

brandtbucher requested a review from markshannon May 2, 2024 15:51

brandtbucher added the skip news label May 2, 2024

savannahostrowski reviewed May 2, 2024

View reviewed changes

brandtbucher added 2 commits May 3, 2024 14:30

Clarify which patch functions are relaxing (and what that means)

b919fcc

Exaplain why GOT is commented out

46adf09

savannahostrowski approved these changes May 3, 2024

View reviewed changes

brandtbucher merged commit 1b7e5e6 into python:main May 3, 2024
57 of 59 checks passed

bedevere-app bot removed the awaiting core review label May 3, 2024

SonicField pushed a commit to SonicField/cpython that referenced this pull request May 8, 2024

pythonGH-113464: Generate a more efficient JIT (pythonGH-118512)

1493daa

brandtbucher added the topic-JIT label May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-113464: Generate a more efficient JIT #118512

GH-113464: Generate a more efficient JIT #118512

brandtbucher commented May 2, 2024 •

edited

Loading

brandtbucher commented May 2, 2024

savannahostrowski left a comment •

edited

Loading

savannahostrowski May 2, 2024

brandtbucher May 3, 2024

savannahostrowski left a comment •

edited

Loading

brandtbucher commented May 3, 2024

GH-113464: Generate a more efficient JIT #118512

GH-113464: Generate a more efficient JIT #118512

Conversation

brandtbucher commented May 2, 2024 • edited Loading

brandtbucher commented May 2, 2024

savannahostrowski left a comment • edited Loading

Choose a reason for hiding this comment

savannahostrowski May 2, 2024

Choose a reason for hiding this comment

brandtbucher May 3, 2024

Choose a reason for hiding this comment

savannahostrowski left a comment • edited Loading

Choose a reason for hiding this comment

brandtbucher commented May 3, 2024

brandtbucher commented May 2, 2024 •

edited

Loading

savannahostrowski left a comment •

edited

Loading

savannahostrowski left a comment •

edited

Loading