Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory corruption on 8.4 #17974

Open
danog opened this issue Mar 5, 2025 · 14 comments
Open

Memory corruption on 8.4 #17974

danog opened this issue Mar 5, 2025 · 14 comments

Comments

@danog
Copy link
Contributor

danog commented Mar 5, 2025

Description

After switching to PHP 8.4, we're seeing memory corruption segfaults occurring during zend_deactivate (no JIT, just opcache).

Not sure whether it's viable to run with ASAN on prod to further debug the issue...

Ref phpredis/phpredis#2630

While redis is in the stack trace, obviously the corruption couuld have happened sooner.

#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x0000778718bc227e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x0000778718ba58ff in __GI_abort () at ./stdlib/abort.c:79
#5  0x000059634ccbbb76 in zend_mm_panic (message=0x59634d0042b7 "zend_mm_heap corrupted") at /usr/src/php8.4-8.4.4-1+ubuntu24.04.1+deb.sury.org+1/Zend/zend_alloc.c:398
#6  0x000059634ccbbe70 in zend_mm_get_next_free_slot (slot=<optimized out>, bin_num=<optimized out>, heap=<optimized out>)
    at /usr/src/php8.4-8.4.4-1+ubuntu24.04.1+deb.sury.org+1/Zend/zend_alloc.c:1326
#7  zend_mm_alloc_small (bin_num=<optimized out>, heap=<optimized out>) at /usr/src/php8.4-8.4.4-1+ubuntu24.04.1+deb.sury.org+1/Zend/zend_alloc.c:1410
#8  zend_mm_alloc_heap (size=<optimized out>, heap=<optimized out>) at /usr/src/php8.4-8.4.4-1+ubuntu24.04.1+deb.sury.org+1/Zend/zend_alloc.c:1488
#9  _emalloc (size=<optimized out>) at /usr/src/php8.4-8.4.4-1+ubuntu24.04.1+deb.sury.org+1/Zend/zend_alloc.c:2740
#10 0x000059634cfc5cac in zend_string_alloc (persistent=false, len=<optimized out>) at /usr/src/php8.4-8.4.4-1+ubuntu24.04.1+deb.sury.org+1/Zend/zend_string.h:176
#11 smart_str_erealloc (len=128, str=0x7ffcddb1bf20) at /usr/src/php8.4-8.4.4-1+ubuntu24.04.1+deb.sury.org+1/Zend/zend_smart_str.c:36
#12 smart_str_erealloc (str=str@entry=0x7ffcddb1bf20, len=len@entry=128) at /usr/src/php8.4-8.4.4-1+ubuntu24.04.1+deb.sury.org+1/Zend/zend_smart_str.c:30
#13 0x0000778711c52f73 in smart_str_alloc (persistent=false, len=128, str=0x7ffcddb1bf20) at /usr/include/php/20240924/Zend/zend_smart_str.h:50
#14 redis_pool_spprintf (redis_sock=redis_sock@entry=0x77870816d380, fmt=0x42259268 "") at /usr/src/php-redis-6.1.0-2+ubuntu24.04.1+deb.sury.org+1/build-8.4/library.c:876
#15 0x0000778711c53420 in redis_sock_get_connection_pool (redis_sock=0x77870816d380) at /usr/src/php-redis-6.1.0-2+ubuntu24.04.1+deb.sury.org+1/build-8.4/library.c:114
#16 0x0000778711c5e628 in redis_sock_disconnect (redis_sock=0x77870816d380, force=0, is_reset_mode=1)
    at /usr/src/php-redis-6.1.0-2+ubuntu24.04.1+deb.sury.org+1/build-8.4/library.c:3233
#17 0x0000778711c18cda in free_redis_object (object=0x778708094cc8) at /usr/src/php-redis-6.1.0-2+ubuntu24.04.1+deb.sury.org+1/build-8.4/redis.c:201
#18 0x000059634cfb7af3 in zend_objects_store_free_object_storage (objects=objects@entry=0x59634d180fd8 <executor_globals+856>, fast_shutdown=fast_shutdown@entry=true)
    at /usr/src/php8.4-8.4.4-1+ubuntu24.04.1+deb.sury.org+1/Zend/zend_objects_API.c:105
#19 0x000059634cf17349 in zend_shutdown_executor_values (fast_shutdown=fast_shutdown@entry=true) at /usr/src/php8.4-8.4.4-1+ubuntu24.04.1+deb.sury.org+1/Zend/zend_execute_API.c:425
#20 0x000059634cf179e2 in shutdown_executor () at /usr/src/php8.4-8.4.4-1+ubuntu24.04.1+deb.sury.org+1/Zend/zend_execute_API.c:442
#21 0x000059634cfd5f59 in zend_deactivate () at /usr/src/php8.4-8.4.4-1+ubuntu24.04.1+deb.sury.org+1/Zend/zend.c:1347
#22 0x000059634ce6704b in php_request_shutdown (dummy=dummy@entry=0x0) at /usr/src/php8.4-8.4.4-1+ubuntu24.04.1+deb.sury.org+1/main/main.c:1950
#23 0x000059634cce0c7e in main (argc=<optimized out>, argv=<optimized out>) at /usr/src/php8.4-8.4.4-1+ubuntu24.04.1+deb.sury.org+1/sapi/fpm/fpm/fpm_main.c:1966

PHP Version

PHP 8.4.4

Operating System

Ubuntu 24.04.1

@danog
Copy link
Contributor Author

danog commented Mar 5, 2025

Curiously, manually extracting the shadow, shadow key and next ptr for slot 8 (should be the correct slot number for the allocation size 231) from the coredump and manually recomputing the next ptr from the shadow+shadowkey returns a correct value...

Might have chosen the wrong slot tho

@danog
Copy link
Contributor Author

danog commented Mar 5, 2025

It would be super useful to have tooling in .gdbinit to dump the stack & automatically validate the shadow values for all slots, and print out the contents of invalid slots (to see specifically what data caused the corruption)

@iluuu1994
Copy link
Member

Well, can you reproduce this reliably? Would it possible to disable Redis? Ofc, this could hide the issue even if it isn't caused by Redis. In any case, I really don't know how we can help without a reproducer...

@danog
Copy link
Contributor Author

danog commented Mar 7, 2025

I can't reproduce this reliably unless I were to try and run ASAN in production, which isn't something I can realistically do (this is a very common condition actually).

I did, however, come up with a small gdb script to analyze the zend heap, to pinpoint the exact change causing the assertion: it's a single flipped bit (which makes me think of cosmic rays, if it didn't happen regularly and only if persistent redis connections are enabled).

define dh
	set $mm_heap = alloc_globals.mm_heap
	set $i = 0
	while $i < 30
		set $free_slot = $mm_heap->free_slot[$i]
		if $free_slot != 0
			set $next = $free_slot->next_free_slot
			if $next != 0
				set $sz = bin_data_size[$i] - sizeof(zend_mm_free_slot*)

				p "Here"

				set $shadow = *((zend_mm_free_slot**)((char*)($free_slot) + $sz))
				printf "Init shadow: %p\n", $shadow
				set $shadow = (uintptr_t)$shadow ^ $mm_heap->shadow_key
				printf "After XOR  : %p\n", $shadow
				set $shadow = (($shadow & 0xFFu) << 56) | (($shadow & 0xFF00u) << 40) | (($shadow & 0xFF0000u) << 24) | (($shadow & 0xFF000000u) << 8) | (($shadow & 0xFF00000000u) >> 8) | (($shadow & 0xFF0000000000u) >> 24) | (($shadow & 0xFF000000000000u) >> 40) | (($shadow & 0xFF00000000000000u) >> 56)
				printf "After bswap: %p\n", $shadow
				set $shadow = (zend_mm_free_slot*) $shadow
				printf "Expecting  : %p\n", $next
				if $shadow != $next
					eval "set $fname = \"small_slot_%d.dump\"", $i
					
					set $contents = (char*) $free_slot
					set $end_contents = $contents + $sz

					eval "dump binary memory %s $contents $end_contents", $fname

					printf "Slot %d at address %p is corrupted, expected guard value %p, got %p, saved %d bytes to %s\n", $i, $free_slot, $next, $shadow, $sz, $fname
				end
			end
		end
		set $i=$i+1
	end
end

Resulting in (some manual padding added for readability):

Init shadow: 0x7bd131090d0a6d10
After XOR  : 0x002fad9d85780000
After bswap: 0x000078859dad2f00
Expecting  : 0x000078059dad2f00

Slot 15 at address 0x78859daf8b00 is corrupted, expected guard value 0x78059dad2f00, got 0x78859dad2f00, saved 248 bytes to small_slot_15.dump

The difference between 0x85 and 0x05 is a single bit...

The rest of the page doesn't contain anything I can make sense of either:

root@dev(50cb372aaa1e):~ # xxd small_slot_15.dump
00000000: 002f ad9d 0578 0000 4611 0000 0000 0000  ./...x..F.......
00000010: e018 1d9f 8578 0000 2079 90e5 bc5c 0000  .....x.. y...\..
00000020: 0000 0000 0000 0000 48b7 be9d 8578 0000  ........H....x..
00000030: 0803 0000 0000 0000 48b8 be9d 8578 0000  ........H....x..
00000040: 0803 0000 0000 0000 60a1 5762 fe7f 0000  ........`.Wb....
00000050: 0000 0000 0100 0000 0000 0000 0000 0000  ................
00000060: 0400 0000 0000 0000 0200 0000 0000 0000  ................
00000070: 0200 0000 0000 0000 0200 0000 0000 0000  ................
00000080: 0100 0000 0000 0000 0200 0000 0000 0000  ................
00000090: 0200 0000 0000 0000 406d b09e 8578 0000  ........@m...x..
000000a0: 0600 0000 0000 0000 8050 aba1 8578 0000  .........P...x..
000000b0: 0601 0000 0000 0000 48e3 bc9d 8578 0000  ........H....x..
000000c0: 0703 0000 0000 0000 0200 0000 0000 0000  ................
000000d0: 0100 0000 0000 0000 00a0 40a1 8578 0000  ..........@..x..
000000e0: 0803 0000 0000 0000 0200 0000 0000 0000  ................
000000f0: 0200 0000 0000 0000                      ........

@smelchior
Copy link

We are currently also seeing this issue with 8.3.19 with a similar GDB trace with the error happening in ./nptl/pthread_kill.c The issue did not occur with 8.3.17

@iluuu1994
Copy link
Member

@smelchior It's more relevant what comes before zend_mm_panic(). Can you share your full trace?

@smelchior
Copy link

oh yes sorry, in this case i am getting

(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x00007ffff74a9f4f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007ffff745afb2 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007ffff7445472 in __GI_abort () at ./stdlib/abort.c:79
#4  0x00007ffff749e42f in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff75b8459 "%s\n") at ../sysdeps/posix/libc_fatal.c:156
#5  0x00007ffff74b386a in malloc_printerr (str=str@entry=0x7ffff75baf60 "malloc_consolidate(): unaligned fastbin chunk detected") at ./malloc/malloc.c:5660
#6  0x00007ffff74b42cc in malloc_consolidate (av=av@entry=0x7ffff75f1c60 <main_arena>) at ./malloc/malloc.c:4746
#7  0x00007ffff74b68d8 in _int_malloc (av=av@entry=0x7ffff75f1c60 <main_arena>, bytes=bytes@entry=4096) at ./malloc/malloc.c:3961
#8  0x00007ffff74b7aba in __GI___libc_malloc (bytes=bytes@entry=4096) at ./malloc/malloc.c:3315
#9  0x000055555582d799 in __zend_malloc (len=4096) at ./Zend/zend_alloc.c:3128
#10 0x00007ffff7c38983 in ?? () from /lib/x86_64-linux-gnu/libpcre2-8.so.0
#11 0x00007ffff7c3bccd in pcre2_jit_compile_8 () from /lib/x86_64-linux-gnu/libpcre2-8.so.0
#12 0x00005555556d62ad in pcre_get_compiled_regex_cache_ex (regex=0x7fffee9081e0, locale_aware=locale_aware@entry=1) at ./ext/pcre/php_pcre.c:810
#13 0x00005555556d658a in pcre_get_compiled_regex_cache (regex=<optimized out>) at ./ext/pcre/php_pcre.c:896
#14 0x00005555556d8ea3 in php_pcre_replace (regex=<optimized out>, subject_str=0x7fffee719528, subject=0x7fffee719540 "development.*", subject_len=13, replace_str=0x7fffee8817e0, limit=18446744073709551615, replace_count=0x7fffffffb0c8) at ./ext/pcre/php_pcre.c:1572
#15 0x00005555556d9391 in php_replace_in_subject (replace_count=0x7fffffffb0c8, limit=<optimized out>, subject=<optimized out>, replace_ht=0x0, replace_str=<optimized out>, regex_ht=0x0, regex_str=<optimized out>) at ./ext/pcre/php_pcre.c:2148
#16 preg_replace_common (execute_data=<optimized out>, return_value=0x7ffff4e16900, is_filter=false) at ./ext/pcre/php_pcre.c:2286
#17 0x00005555558cd4ad in ZEND_DO_ICALL_SPEC_RETVAL_USED_HANDLER () at ./Zend/zend_vm_execute.h:1337
#18 execute_ex (ex=0x1828a) at ./Zend/zend_vm_execute.h:57246
#19 0x00005555558d2525 in zend_execute (op_array=0x7ffff4e89000, return_value=0x0) at ./Zend/zend_vm_execute.h:61634
#20 0x000055555585e078 in zend_execute_scripts (type=type@entry=8, retval=retval@entry=0x0, file_count=file_count@entry=3) at ./Zend/zend.c:1895
#21 0x00005555557f229e in php_execute_script (primary_file=primary_file@entry=0x7fffffffd700) at ./main/main.c:2529
#22 0x000055555594a62d in do_cli (argc=4, argv=0x555555b033d0) at ./sapi/cli/php_cli.c:966
#23 0x000055555567fb97 in main (argc=4, argv=0x555555b033d0) at ./sapi/cli/php_cli.c:1341

which indeed looks different ;) I might therefore be mistaken.

@iluuu1994
Copy link
Member

@smelchior Likely unrelated, but who knows. Do you have a reliable way to reproduce this? If you can identify which endpoint caused the crash, you may try to run it with Valgrind (which will be very slow, but might give some insight).

@smelchior
Copy link

@iluuu1994

I do, this happens on a CLI call. From my strace i can see that it dies when parsing a specific XML config file.
I pulled the valgrind traces with:

valgrind --tool=memcheck --num-callers=30 --log-file=php.log php8.3 ./bin/doctrine migrations:migrate --no-interaction

For the valgrind-error-1.log with this set as suggested in https://bugs.php.net/bugs-getting-valgrind-log.php:

export USE_ZEND_ALLOC=0
export ZEND_DONT_UNLOAD_MODULES=1

for the valgrind-error-2.log without it, this one also shows the errors in the xml lib, so might help a bit.

I am really sorry if this does not quite match the initial issue and i am happy to open another one if that makes more sense.

valgrind-error-1.log
valgrind-error-2.log

@iluuu1994
Copy link
Member

@smelchior This does look useful, thank you! /cc @nielsdos

Relevant excerpt

==680065== Invalid read of size 4
==680065==    at 0x8B758F3: dom_objects_free_storage (php_dom.c:997)
==680065==    by 0x4A9592: zend_objects_store_del (zend_objects_API.c:200)
==680065==    by 0x47D21D: i_zval_ptr_dtor (zend_variables.h:44)
==680065==    by 0x47D21D: i_free_compiled_variables (zend_execute.c:3883)
==680065==    by 0x47D21D: execute_ex (zend_vm_execute.h:57105)
==680065==    by 0x486524: zend_execute (zend_vm_execute.h:61634)
==680065==    by 0x412077: zend_execute_scripts (zend.c:1895)
==680065==    by 0x3A629D: php_execute_script (main.c:2529)
==680065==    by 0x4FE62C: do_cli (php_cli.c:966)
==680065==    by 0x233B96: main (php_cli.c:1341)
==680065==  Address 0x10392648 is 8 bytes inside a block of size 96 free'd
==680065==    at 0x484417B: free (vg_replace_malloc.c:872)
==680065==    by 0x499EBCB: xmlFreeNode (in /usr/lib/x86_64-linux-gnu/libxml2.so.2.9.14)
==680065==    by 0x49F9DBB: ??? (in /usr/lib/x86_64-linux-gnu/libxml2.so.2.9.14)
==680065==    by 0x49FBAE4: xmlXIncludeProcessFlags (in /usr/lib/x86_64-linux-gnu/libxml2.so.2.9.14)
==680065==    by 0x8B80FAB: zim_DOMDocument_xinclude (document.c:1668)
==680065==    by 0x484ABE: ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER (zend_vm_execute.h:1976)
==680065==    by 0x484ABE: execute_ex (zend_vm_execute.h:57282)
==680065==    by 0x486524: zend_execute (zend_vm_execute.h:61634)
==680065==    by 0x412077: zend_execute_scripts (zend.c:1895)
==680065==    by 0x3A629D: php_execute_script (main.c:2529)
==680065==    by 0x4FE62C: do_cli (php_cli.c:966)
==680065==    by 0x233B96: main (php_cli.c:1341)
==680065==  Block was alloc'd at
==680065==    at 0x48417B4: malloc (vg_replace_malloc.c:381)
==680065==    by 0x499D343: xmlNewNsPropEatName (in /usr/lib/x86_64-linux-gnu/libxml2.so.2.9.14)
==680065==    by 0x4A6C5CF: ??? (in /usr/lib/x86_64-linux-gnu/libxml2.so.2.9.14)
==680065==    by 0x4A6F575: xmlSAX2StartElementNs (in /usr/lib/x86_64-linux-gnu/libxml2.so.2.9.14)
==680065==    by 0x4989048: ??? (in /usr/lib/x86_64-linux-gnu/libxml2.so.2.9.14)
==680065==    by 0x498A5A0: ??? (in /usr/lib/x86_64-linux-gnu/libxml2.so.2.9.14)
==680065==    by 0x49925CF: ??? (in /usr/lib/x86_64-linux-gnu/libxml2.so.2.9.14)
==680065==    by 0x49935A7: xmlParseElement (in /usr/lib/x86_64-linux-gnu/libxml2.so.2.9.14)
==680065==    by 0x49939E9: xmlParseDocument (in /usr/lib/x86_64-linux-gnu/libxml2.so.2.9.14)
==680065==    by 0x8B8091F: dom_document_parser.isra.0 (document.c:1351)
==680065==    by 0x8B80A18: dom_parse_document (document.c:1436)
==680065==    by 0x484ABE: ZEND_DO_FCALL_SPEC_RETVAL_USED_HANDLER (zend_vm_execute.h:1976)
==680065==    by 0x484ABE: execute_ex (zend_vm_execute.h:57282)
==680065==    by 0x486524: zend_execute (zend_vm_execute.h:61634)
==680065==    by 0x412077: zend_execute_scripts (zend.c:1895)
==680065==    by 0x3A629D: php_execute_script (main.c:2529)
==680065==    by 0x4FE62C: do_cli (php_cli.c:966)
==680065==    by 0x233B96: main (php_cli.c:1341)

@nielsdos
Copy link
Member

@iluuu1994 @smelchior The bug you're seeing is definitely different from OP's bug because libxml always uses system malloc, so it can't corrupt ZendMM.
It seems like fallout from fixing #17847 but it's not immediately clear to me why, I'll try to spend some time this evening.

@nielsdos
Copy link
Member

@iluuu1994 @smelchior Based on the stacktrace I was able to make a guess and a reproducer, and the fix is here (hopefully): #18100
Note that this issue reproduces all the way back to PHP 5.x, so this is nothing new really, it surprises me that this only surfaces now...

@smelchior
Copy link

Thank you so much @nielsdos for tackling this so quickly!

@nielsdos
Copy link
Member

The DOM issue is resolved, but this issue needs more information in order to be able to do something. Either a reproducer or an ASAN trace would help. We're not even sure the problem is inside PHP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants