Skip to content

GH-113464: Generate a more efficient JIT #118512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
May 3, 2024

Conversation

brandtbucher
Copy link
Member

@brandtbucher brandtbucher commented May 2, 2024

This breaks up the JIT into smaller functions, reduces a lot of branching in hot inner loops, and generally makes the C code cleaner (and probably faster).

Currently, we generate declarative structures at build time that we then loop over in order to emit the desired machine code at runtime. For example, the _STORE_FAST stencil looks like this:

static const unsigned char _STORE_FAST_code_body[61] = {
    0x50, 0x48, 0x8b, 0x45, 0xf8, 0x48, 0x83, 0xc5,
    0xf8, 0x0f, 0xb7, 0x0d, 0x00, 0x00, 0x00, 0x00,
    0x49, 0x8b, 0x7c, 0xcd, 0x48, 0x49, 0x89, 0x44,
    0xcd, 0x48, 0x48, 0x85, 0xff, 0x74, 0x0f, 0x48,
    0x8b, 0x07, 0x85, 0xc0, 0x78, 0x08, 0x48, 0xff,
    0xc8, 0x48, 0x89, 0x07, 0x74, 0x07, 0x58, 0xff,
    0x25, 0x00, 0x00, 0x00, 0x00, 0xff, 0x15, 0x00,
    0x00, 0x00, 0x00, 0x58,
};
static const Hole _STORE_FAST_code_holes[4] = {
    {0xc, HoleKind_R_X86_64_GOTPCREL, HoleValue_DATA, NULL, -0x4},
    {0x31, HoleKind_R_X86_64_GOTPCRELX, HoleValue_DATA, NULL, 0x4},
    {0x37, HoleKind_R_X86_64_GOTPCRELX, HoleValue_DATA, NULL, 0xc},
};
static const unsigned char _STORE_FAST_data_body[25] = {
    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
};
static const Hole _STORE_FAST_data_holes[4] = {
    {0x0, HoleKind_R_X86_64_64, HoleValue_OPARG, NULL, 0x0},
    {0x8, HoleKind_R_X86_64_64, HoleValue_CONTINUE, NULL, 0x0},
    {0x10, HoleKind_R_X86_64_64, HoleValue_ZERO, &_Py_Dealloc, 0x0},
};

This very general approach means that we have a lot of complex logic in our hot inner loop to decode instructions and set up values for patching that may not even be needed. It's also very branchy, since we're essentially "interpreting" the array of holes for each instruction.

With this PR, jit_stencils.h instead contains the following function:

void
emit__STORE_FAST(
    unsigned char *code, unsigned char *data, _PyExecutorObject *executor,
    const _PyUOpInstruction *instruction, uintptr_t instruction_starts[])
{
    const unsigned char code_body[60] = {
        0x50, 0x48, 0x8b, 0x45, 0xf8, 0x48, 0x83, 0xc5,
        0xf8, 0x0f, 0xb7, 0x0d, 0x00, 0x00, 0x00, 0x00,
        0x49, 0x8b, 0x7c, 0xcd, 0x48, 0x49, 0x89, 0x44,
        0xcd, 0x48, 0x48, 0x85, 0xff, 0x74, 0x0f, 0x48,
        0x8b, 0x07, 0x85, 0xc0, 0x78, 0x08, 0x48, 0xff,
        0xc8, 0x48, 0x89, 0x07, 0x74, 0x07, 0x58, 0xff,
        0x25, 0x00, 0x00, 0x00, 0x00, 0xff, 0x15, 0x00,
        0x00, 0x00, 0x00, 0x58,
    };
    const unsigned char data_body[24] = {
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
        0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
    };
    memcpy(data, data_body, sizeof(data_body));
    patch_64(data + 0x0, instruction->oparg);
    patch_64(data + 0x8, (uintptr_t)code + sizeof(code_body));
    patch_64(data + 0x10, (uintptr_t)&_Py_Dealloc);
    memcpy(code, code_body, sizeof(code_body));
    patch_32r(code + 0xc, (uintptr_t)data + -0x4);
    patch_x86_64_32rx(code + 0x31, (uintptr_t)data + 0x4);
    patch_x86_64_32rx(code + 0x37, (uintptr_t)data + 0xc);
}

This function is called directly to emit the machine code for every _STORE_FAST instruction, and hardcodes the logic for all of the necessary copies and patches. The result is one indirect call, no unnecessary branching, and (in my opinion) cleaner code, since a lot of the tricky logic is now hidden away in generated files.


I know this is right before feature freeze, but I'd really like to get this in 3.13 since it will make backporting any fixes much easier. It doesn't change the actual jitted code in any way.

Note to reviewers: the diff is a bit messy, so it may make more sense to compare the before-vs-after files side-by-side instead.

@brandtbucher brandtbucher added performance Performance or resource usage interpreter-core (Objects, Python, Grammar, and Parser dirs) build The build process and cross-build labels May 2, 2024
@brandtbucher brandtbucher self-assigned this May 2, 2024
@bedevere-app bedevere-app bot mentioned this pull request May 2, 2024
@brandtbucher brandtbucher requested a review from markshannon May 2, 2024 15:51
@brandtbucher
Copy link
Member Author

@savannahostrowski, I'd love to get your review of this if you have a few cycles.

Copy link
Member

@savannahostrowski savannahostrowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments and questions but after sitting and reading through this code a bunch over the last week or two, I'm excited about how much more readable this will get with this change! 💆‍♀️

yield ""


def dump(groups: dict[str, _stencils.StencilGroup]) -> typing.Iterator[str]:
"""Yield a JIT compiler line-by-line as a C header file."""
yield from _dump_header()
for opname, group in groups.items():
for opname, group in sorted(groups.items()):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that this needs to be sorted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, I just like it that way (if you couldn't tell by now). ;)

Copy link
Member

@savannahostrowski savannahostrowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding in the comment about the naming conventions - I think that helps! Otherwise, this looks pretty solid to me (barring some Windows CI failures). Lots of moving things into function but it's a whole lot more readable! 🎉

@brandtbucher
Copy link
Member Author

Windows JIT CI fixed in GH-118564.

@brandtbucher brandtbucher merged commit 1b7e5e6 into python:main May 3, 2024
57 of 59 checks passed
SonicField pushed a commit to SonicField/cpython that referenced this pull request May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build The build process and cross-build interpreter-core (Objects, Python, Grammar, and Parser dirs) performance Performance or resource usage skip news topic-JIT
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants