-
Notifications
You must be signed in to change notification settings - Fork 7.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip] Pass opline as argument to opcode handlers in CALL VM #17952
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really interesting. This will work with MSVC as well, right?
On bench.php callgrind shows significant improvement without JIT, and slight improvement with tracing JIT.
As I understood, the main improvement comes from elimination of EX(opline)
loading in each instruction handler (elimination of USE_OPLINE
) and improvement of prologues and epilogues of short handlers. Right? May be I missed something?
I'm really surprised in the effect. I understand the improvement coming from caching EX(opline)
in a preserved CPU register, but in my opinion, passing and returning it across all handlers should smooth the effect from load elimination. It seems I was wrong.
I think, this should be finalized, carefully reviewed and merged.
Zend/zend_vm_execute.h
Outdated
# define ZEND_OPCODE_HANDLER_ARGS zend_execute_data *execute_data | ||
# define ZEND_OPCODE_HANDLER_ARGS_PASSTHRU execute_data | ||
# define ZEND_OPCODE_HANDLER_ARGS_DC , ZEND_OPCODE_HANDLER_ARGS | ||
# define ZEND_OPCODE_HANDLER_ARGS_PASSTHRU_CC , ZEND_OPCODE_HANDLER_ARGS_PASSTHRU | ||
# define ZEND_OPCODE_HANDLER_ARGS zend_execute_data *execute_data, const zend_op *opline | ||
# define ZEND_OPCODE_HANDLER_ARGS_PASSTHRU execute_data, opline | ||
# define ZEND_OPCODE_HANDLER_ARGS_DC ZEND_OPCODE_HANDLER_ARGS, | ||
# define ZEND_OPCODE_HANDLER_ARGS_PASSTHRU_CC ZEND_OPCODE_HANDLER_ARGS_PASSTHRU, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally PHP used _D
, _C
, _DC
and _CC
suffixes for macros that implements implementation defined arguments. I can't remember what does they mean. Probably C
- comma and D
- declaration.
@derickr do you remeber?
Probably, we shouldn't use the same convention for something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I'm not sure this change (commit 7d7be6d) makes a difference anymore, so I may revert it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, call, comma and declaration as far as I remember. CC = call comma; DC = declaration comma.
So this still fits.
Zend/zend_vm_execute.h
Outdated
# define ZEND_VM_ENTER_BIT (1ULL<<(UINTPTR_WIDTH-1)) | ||
# define ZEND_VM_ENTER_EX() return (zend_op*)((uintptr_t)opline | ZEND_VM_ENTER_BIT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assumption that the high bit of address must be zero may be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's right at least on x86_64, but I will check other platforms. I hope I can keep this scheme as it allows this check to be a single instruction:
Line 2150 in 3ce2089
out($f, $m[1]."if (UNEXPECTED((intptr_t)OPLINE <= 0))".$m[3]."\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added support for systems where the high bit may be set in user space addresses: 41335c4
On these systems I assume that alignment is > 1 (we make this assumption in a few places already), and I define ZEND_VM_ENTER_BIT
to 1
.
ZEND_VM_RETURN()
becomes return ZEND_VM_ENTER_BIT
instead of return 0
, and ZEND_VM_LEAVE()
/ ZEND_VM_ENTER()
remain return opline | ZEND_VM_ENTER_BIT
.
The check
if (UNEXPECTED((intptr_t)opline <= 0)) {
becomes
if (UNEXPECTED(((uintptr_t)opline & ZEND_VM_ENTER_BIT))) {
which compiles very similarly:
test $0x1,%al
je 0xa5f740
Instead of
test %rax,%rax
jg 0xa5f6f0
Both have the same encoded size.
The alignment variant is very slightly faster according to both valgrind and wall time. I think that's because the constant 1ULL<<63
results in slightly bigger code in other places.
So I may keep only the alignment variant, as it's more portable. WDYT?
#ifdef ZEND_VM_FP_GLOBAL_REG | ||
execute_data = vm_stack_data.orig_execute_data; | ||
# ifdef ZEND_VM_IP_GLOBAL_REG | ||
opline = vm_stack_data.orig_opline; | ||
# endif | ||
return; | ||
#else | ||
if (EXPECTED(ret > 0)) { | ||
if (EXPECTED(opline != NULL && (uintptr_t)opline != ZEND_VM_ENTER_BIT)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check must be more expensive...
The second part looks incorrect. I suppose it should check only a single bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have opline == ZEND_VM_ENTER_BIT
because the JIT will return opline | ZEND_VM_ENTER_BIT
unconditionally to force the VM to reload execute_data
, even when opline
might be 0
.
I've updated this to
opline = (const zend_op*)((uintptr_t)opline & ~ZEND_VM_ENTER_BIT);
if (EXPECTED(opline != NULL)) {
Which is similar, but clearer. This adds just one instruction compared to the base branch.
static int ZEND_FASTCALL zend_runtime_jit(void) | ||
static ZEND_OPCODE_HANDLER_RET ZEND_FASTCALL zend_runtime_jit(ZEND_OPCODE_HANDLER_ARGS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ops. It looks like the previous prototype was wrong.
ext/opcache/jit/zend_jit.c
Outdated
#if GCC_GLOBAL_REGS | ||
return; // ZEND_VM_CONTINUE | ||
#else | ||
return op_array->opcodes; // ZEND_VM_CONTINUE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may skip leading RECV instructions. Actually, you should return the opline
passed as argument or EX(opline)
.
#ifndef HAVE_GCC_GLOBAL_REGS | ||
opline = EX(opline); | ||
#endif | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this work with other VM types (GOTO and SWITCH)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will check!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JIT doesn't support the GOTO and SWITCH VMs currently:
php-src/ext/opcache/jit/zend_jit.c
Line 3677 in f75dd82
zend_error(E_WARNING, "JIT is compatible only with CALL and HYBRID VM. JIT disabled."); |
I'm assuming it's because zend_get_opcode_handler_func()
can not be implemented, as there are no opcode handler functions:
php-src/Zend/zend_vm_execute.skl
Line 126 in f75dd82
ZEND_API const void* ZEND_FASTCALL zend_get_opcode_handler_func(const zend_op *op) |
We could support these VM kinds in the future by generating handler funcs in addition to the big SWITCH/GOTO. If these functions use the same conventions as the CALL VM, then the changes in this PR will work with these VM kinds. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't mind supporting JIT for SWITCH/GOTO. Just SWITCH/GOTO VM interpreter.
I assume so, but I still need to test on Windows/MSVC, x86, aarch64.
Yes, it is my intuition that eliminating the load/stores of A comparison of the annotated code of Base branch:
Current branch:
It's possibly slightly slower than using global fixed registers, but in comparison to using EX(opline), it's almost equivalent. In the fast path, opline will just be held in either %rsi or %rax and will not be spilled. It will need to be moved back and forth between the two registers when returning from and calling op handlers, but this is less expensive than loading/storing In the case of |
Right. I missed store. And this should make the biggest impact to real-life performance.
These specialized instructions are rare on real-life loads. Your tests showed relatively small improvement on Symfony. Note that finalizing may decrease this even more. On the other hand some related optimizations may be discovered.
I see, and this makes sense. |
@@ -1934,6 +1934,7 @@ function run_test(string $php, $file, array $env): string | |||
$diff_filename = $temp_dir . DIRECTORY_SEPARATOR . $main_file_name . 'diff'; | |||
$log_filename = $temp_dir . DIRECTORY_SEPARATOR . $main_file_name . 'log'; | |||
$exp_filename = $temp_dir . DIRECTORY_SEPARATOR . $main_file_name . 'exp'; | |||
$stdin_filename = $temp_dir . DIRECTORY_SEPARATOR . $main_file_name . 'stdin'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes in this file will be go to a separate PR
So that execute_ex/opline are always passed via (the same) registers
Instead of returning -opline. This simplifies JIT as we can simply return -1 for VM_ENTER/VM_LEAVE. However, this implies that EX(opline) must be in sync.
This reverts commit 9aaceaf6a2480836ae59d9657d37b10bbe04268e.
This simplies JIT compared to returning -opline, and doesn't require EX(opline) to be in sync.
Saving them in execute_ex is not safe when not assigning them to global regs, as the compiler allocate them.
…U_CC As they do not follow the _DC / _CC convention anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not able to make a deep review now.
The patch is big, but I think it may make sense.
It shows improvement without JIT.
With JIT I see degradation (because of useless moving argument to %r15
).
It may be not easy to eliminate this.
I don't like ZEND_VM_ENTER_BIT
magic. If reloading opline form EX(opline) on ENTER/LEAVE won't make a big difference, I would prefer to do this.
Please, see the questions.
#ifdef ZEND_VM_IP_GLOBAL_REG | ||
USE_OPLINE | ||
|
||
SAVE_OPLINE(); | ||
#endif | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This removes EX(opline) = opline
when ZEND_VM_IP_GLOBAL_REG
is defined.
This may cause output of incorrect line number in error message.
May be I miss something?
out($f,"# define ZEND_VM_ENTER_BIT 1ULL\n"); | ||
out($f,"# endif\n"); | ||
out($f,"# define ZEND_VM_ENTER_EX() return (zend_op*)((uintptr_t)opline | ZEND_VM_ENTER_BIT)\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can't guaranty that C compiler always align labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we return just some constant (e.g. NULL
) and then reload opline
from EX(opline)
?
Or this is exactly that you are trying to optimize?
# ifdef ZEND_HIGH_HALF_KERNEL | ||
# define ZEND_VM_ENTER_BIT (1ULL<<(UINTPTR_WIDTH-1)) | ||
# define ZEND_VM_RETURN_VAL 0 | ||
# else | ||
# define ZEND_VM_ENTER_BIT 1ULL | ||
# define ZEND_VM_RETURN_VAL ZEND_VM_ENTER_BIT | ||
# endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we redifine ZEND_VM_ENTER_BIT
here?
#ifndef HAVE_GCC_GLOBAL_REGS | ||
opline = EX(opline); | ||
#endif | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didn't mind supporting JIT for SWITCH/GOTO. Just SWITCH/GOTO VM interpreter.
JIT leave also became more expansive
|
BTW: I told about degradation in term of callgrind instructions, that is not the same as real-time. |
Related:
This extracts the part of #17849 that passes
opline
as opcode handler argument in the Call VM. This should reduce the size of #17849, and also unify the Hybrid and Call VMs slightly.Currently we have two VMs:
execute_data
andopline
are global register variablesexecute_data
is passed as opcode handler arg, andopline
is loaded/stored from/toexecute_data->opline
.The Call VM looks like this:
Opcode handlers return a positive value to signal that the loop must load a new
execute_data
fromEG(current_execute_data)
, typically when entering or leaving a function.Changes in this PR
opline
as opcode handler argumentopline
from opcode handlersopline | (1<<63)
to signal thatexecute_data
must be reloaded fromEG(current_execute_data)
This gives us:
In addition to making the changes of #17849 smaller and to unifying VMs slightly, this improves performances of the Call VM:
bench.php
is 23% faster:Symfony Demo
is2.8%
faster:JIT
When using the Hybrid VM, JIT stores execute_data/opline in two fixed callee-saved registers and rarely touches
EX(opline)
, just like the VM.Since the registers are callee-saved, the JIT'ed code doesn't have to save them before calling other functions, and can assume they always contain execute_data/opline. The code also avoids saving/restoring them in prologue/epilogue, as
execute_ex
takes care of that (JIT'ed code is called exclusively from there).When using the Call VM, we can do that, too, except that we can't rely on
execute_ex
to save the registers for us, as it may use these registers it self. So we have to save/restore the two registers in JIT'ed code prologue/epilogue.TODO