Skip to content

hash: Add SHA-NI implementation of SHA-256 #15152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions NEWS
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@ PHP NEWS
- Hash:
. Deprecated passing incorrect data types for options to ext/hash functions.
(nielsdos)
. Added SSE2 and SHA-NI implementation of SHA-256. (timwolla, Colin Percival,
Graham Percival)

- PHPDBG:
. array out of bounds, stack overflow handled for segfault handler on windows.
Expand Down
2 changes: 1 addition & 1 deletion README.REDIST.BINS
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
18. avifinfo (ext/standard/libavifinfo) see ext/standard/libavifinfo/LICENSE
19. xxHash (ext/hash/xxhash)
20. Lexbor (ext/dom/lexbor/lexbor) see ext/dom/lexbor/LICENSE

21. Portions of libcperciva (ext/hash/hash_sha_{ni,sse2}.c) see the header in the source file

3. pcre2lib (ext/pcre)

Expand Down
4 changes: 4 additions & 0 deletions UPGRADING
Original file line number Diff line number Diff line change
Expand Up @@ -955,6 +955,10 @@ PHP 8.4 UPGRADE NOTES
. Improved the performance of FTP uploads up to a factor of 10x for large
uploads.

- Hash:
. Added SSE2 and SHA-NI implementations of SHA-256. This improves the performance
on supported CPUs by ~1.3x (SSE2) and 3x - 5x (SHA-NI).

- MBString:
. The performance of strspn() and strcspn() is greatly improved.
They now run in linear time instead of being bounded by quadratic time.
Expand Down
1 change: 1 addition & 0 deletions Zend/zend_cpuinfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ typedef enum _zend_cpu_feature {
ZEND_CPU_FEATURE_AVX512F = (1<<16 | ZEND_CPU_EBX_MASK),
ZEND_CPU_FEATURE_AVX512DQ = (1<<17 | ZEND_CPU_EBX_MASK),
ZEND_CPU_FEATURE_AVX512CD = (1<<28 | ZEND_CPU_EBX_MASK),
ZEND_CPU_FEATURE_SHA = (1<<29 | ZEND_CPU_EBX_MASK),
/* intentionally don't support = (1<<30 | ZEND_CPU_EBX_MASK) */
/* intentionally don't support = (1<<31 | ZEND_CPU_EBX_MASK) */

Expand Down
2 changes: 1 addition & 1 deletion ext/hash/config.m4
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ else
PHP_HASH_CFLAGS="$PHP_HASH_CFLAGS -I@ext_srcdir@/$SHA3_DIR -DKeccakP200_excluded -DKeccakP400_excluded -DKeccakP800_excluded -DZEND_ENABLE_STATIC_TSRMLS_CACHE=1"
fi

EXT_HASH_SOURCES="hash.c hash_md.c hash_sha.c hash_ripemd.c hash_haval.c \
EXT_HASH_SOURCES="hash.c hash_md.c hash_sha.c hash_sha_sse2.c hash_sha_ni.c hash_ripemd.c hash_haval.c \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have support for this on Windows, too; maybe just try to add these files to ext/hash/config.w32 and see what CI makes of it. At least SSE2 shouldn't be an issue on Windows, since the build system defaults to this for many years (CI even requires AVX2 instructions).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made the suggested change and pushed. I don't use Windows myself, though. Please check yourself and push any necessary fixes into my branch (pushing is enabled for maintainers):

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem are the static restrict qualifiers when some array arguments are declared. Removing these makes the build pass. I've pushed a quick fix, but according to https://stackoverflow.com/questions/53863084/what-optimization-benefit-does-pointerrestrict-static-1-bring-when-declare-a this construct is supported by C, but not by C++, so the #ifdef would need to be fixed. I'm not sure about that, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you been able to verify if the execution on Windows has become faster / uses the SSE / SHA-NI implementation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A default build runs the SSE2 code branch. An AVX2 build still does for me (not sure if I have support for Intel SHA instructions). I'll check that, but first a couple of Windows build issues need to be resolved (e.g. HAVE_IMMINTRISIC_H is never defined, although that header file is apparently available (Windows SDK 10.0.20348.0).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, that whole AVX detection is apparently pretty much messed up in php-src. Prior to this PR, HAVE_IMMINTRISIC_H is only used once in zend_portability.h, to conditionally define PHP_HAVE_AVX2 (but not for MSVC), which is then used to conditionally define ZEND_INTRIN_AVX2_RESOLVER (which is always done on Windows except for ARM64).

If immintrin.h is included, it is guarded by __SSE2__, __SSE3__, __AVX__, __AVX2__, ZEND_INTRIN_AVX2_NATIVE, or ZEND_INTRIN_AVX2_RESOLVER. Not sure how to generally solve this, but maybe this PR should not use HAVE_IMMINTRIN_H header at all. although it would be possible to define that symbol on Windows (probably unconditionally).

And there is another issue, namely in hash_sha_ni.c and php_hash_sha.h we have #if (defined(__i386__) || defined(__x86_64__)) && defined(HAVE_IMMINTRIN_H). but these are never defined by MSVC. Not sure if we have some portable symbols for that; otherwise __WIN32 and __WIN64 could be used (although the latter is also defined for ARM64 which is probably an issue).

Anyway, I've just skipped these checks for a test build, verified that SHA256_Transform_shani() is actually called, and ext/hash/tests/sha256.phpt passed. \o/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, that whole AVX detection is apparently pretty much messed up in php-src.

Yeah, that's what I meant by the open task item in the original issue: I have no idea what I was doing here and the existing source files where the intrinsics are used didn't help either, because they are preprocessor hell.

So I just did what looked reasonable and worked for me, hoping for someone knowledgeable to advice.

hash_tiger.c hash_gost.c hash_snefru.c hash_whirlpool.c hash_adler32.c \
hash_crc32.c hash_fnv.c hash_joaat.c $EXT_HASH_SHA3_SOURCES
murmur/PMurHash.c murmur/PMurHash128.c hash_murmur.c hash_xxhash.c"
Expand Down
2 changes: 1 addition & 1 deletion ext/hash/config.w32
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ if (PHP_MHASH != 'no') {

PHP_HASH = 'yes';

EXTENSION('hash', 'hash.c hash_md.c hash_sha.c hash_ripemd.c hash_haval.c ' +
EXTENSION('hash', 'hash.c hash_md.c hash_sha.c hash_sha_sse2.c hash_sha_ni.c hash_ripemd.c hash_haval.c ' +
'hash_tiger.c hash_gost.c hash_snefru.c hash_whirlpool.c ' +
'hash_adler32.c hash_crc32.c hash_joaat.c hash_fnv.c ' +
'hash_sha3.c hash_murmur.c hash_xxhash.c', false);
Expand Down
19 changes: 19 additions & 0 deletions ext/hash/hash_sha.c
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

#include "php_hash.h"
#include "php_hash_sha.h"
#include "Zend/zend_cpuinfo.h"

static const unsigned char PADDING[128] =
{
Expand Down Expand Up @@ -160,6 +161,24 @@ PHP_HASH_API void PHP_SHA256InitArgs(PHP_SHA256_CTX * context, ZEND_ATTRIBUTE_UN
*/
static void SHA256Transform(uint32_t state[8], const unsigned char block[64])
{
#if defined(PHP_HASH_INTRIN_SHA_NATIVE)
SHA256_Transform_shani(state, block);
return;
#elif defined(PHP_HASH_INTRIN_SHA_RESOLVER)
if (zend_cpu_supports(ZEND_CPU_FEATURE_SSSE3) && zend_cpu_supports(ZEND_CPU_FEATURE_SHA)) {
SHA256_Transform_shani(state, block);
return;
}
#endif

#if defined(__SSE2__)
uint32_t tmp32[72];

SHA256_Transform_sse2(state, block, &tmp32[0], &tmp32[64]);
ZEND_SECURE_ZERO((unsigned char*) tmp32, sizeof(tmp32));
return;
#endif

uint32_t a = state[0], b = state[1], c = state[2], d = state[3];
uint32_t e = state[4], f = state[5], g = state[6], h = state[7];
uint32_t x[16], T1, T2, W[64];
Expand Down
176 changes: 176 additions & 0 deletions ext/hash/hash_sha_ni.c
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
/*-
* Copyright 2018 Tarsnap Backup Inc.
* All rights reserved.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
*
* THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*/

#include "php_hash.h"
#include "php_hash_sha.h"

#if (defined(__i386__) || defined(__x86_64__)) && defined(HAVE_IMMINTRIN_H)

# include <immintrin.h>

# if PHP_HASH_INTRIN_SHA_RESOLVER
static __m128i be32dec_128(const uint8_t * src) __attribute__((target("ssse3")));
void SHA256_Transform_shani(uint32_t state[PHP_STATIC_RESTRICT 8], const uint8_t block[PHP_STATIC_RESTRICT 64]) __attribute__((target("ssse3,sha")));
# endif

/* Original implementation from libcperciva follows.
*
* Modified to use `PHP_STATIC_RESTRICT` for MSVC compatibility.
*/

/**
* This code uses intrinsics from the following feature sets:
* SHANI: _mm_sha256msg1_epu32, _mm_sha256msg2_epu32, _mm_sha256rnds2_epu32
* SSSE3: _mm_shuffle_epi8, _mm_alignr_epi8
* SSE2: Everything else
*
* The SSSE3 intrinsics could be avoided at a slight cost by using a few SSE2
* instructions in their place; we have not done this since to our knowledge
* there are presently no CPUs which support the SHANI instruction set but do
* not support SSSE3.
*/

/* Load 32-bit big-endian words. */
static __m128i
be32dec_128(const uint8_t * src)
{
const __m128i SHUF = _mm_set_epi8(12, 13, 14, 15, 8, 9, 10, 11,
4, 5, 6, 7, 0, 1, 2, 3);
__m128i x;

/* Load four 32-bit words. */
x = _mm_loadu_si128((const __m128i *)src);

/* Reverse the order of the bytes in each word. */
return (_mm_shuffle_epi8(x, SHUF));
}

/* Convert an unsigned 32-bit immediate into a signed value. */
#define I32(a) ((UINT32_C(a) >= UINT32_C(0x80000000)) ? \
-(int32_t)(UINT32_C(0xffffffff) - UINT32_C(a)) - 1 : (int32_t)INT32_C(a))

/* Load four unsigned 32-bit immediates into a vector register. */
#define IMM4(a, b, c, d) _mm_set_epi32(I32(a), I32(b), I32(c), I32(d))

/* Run four rounds of SHA256. */
#define RND4(S, W, K0, K1, K2, K3) do { \
__m128i M; \
\
/* Add the next four words of message schedule and round constants. */ \
M = _mm_add_epi32(W, IMM4(K3, K2, K1, K0)); \
\
/* Perform two rounds of SHA256, using the low two words in M. */ \
S[1] = _mm_sha256rnds2_epu32(S[1], S[0], M); \
\
/* Shift the two words of M down and perform the next two rounds. */ \
M = _mm_srli_si128(M, 8); \
S[0] = _mm_sha256rnds2_epu32(S[0], S[1], M); \
} while (0)

/* Compute the ith set of four words of message schedule. */
#define MSG4(W, i) do { \
W[(i + 0) % 4] = _mm_sha256msg1_epu32(W[(i + 0) % 4], W[(i + 1) % 4]); \
W[(i + 0) % 4] = _mm_add_epi32(W[(i + 0) % 4], \
_mm_alignr_epi8(W[(i + 3) % 4], W[(i + 2) % 4], 4)); \
W[(i + 0) % 4] = _mm_sha256msg2_epu32(W[(i + 0) % 4], W[(i + 3) % 4]); \
} while (0)

/* Perform 4 rounds of SHA256 and generate more message schedule if needed. */
#define RNDMSG(S, W, i, K0, K1, K2, K3) do { \
RND4(S, W[i % 4], K0, K1, K2, K3); \
if (i < 12) \
MSG4(W, i + 4); \
} while (0)

/**
* SHA256_Transform_shani(state, block):
* Compute the SHA256 block compression function, transforming ${state} using
* the data in ${block}. This implementation uses x86 SHANI and SSSE3
* instructions, and should only be used if CPUSUPPORT_X86_SHANI and _SSSE3
* are defined and cpusupport_x86_shani() and _ssse3() return nonzero.
*/
void
SHA256_Transform_shani(uint32_t state[PHP_STATIC_RESTRICT 8],
const uint8_t block[PHP_STATIC_RESTRICT 64])
{
__m128i S3210, S7654;
__m128i S0123, S4567;
__m128i S0145, S2367;
__m128i W[4];
__m128i S[2];

/* Load state. */
S3210 = _mm_loadu_si128((const __m128i *)&state[0]);
S7654 = _mm_loadu_si128((const __m128i *)&state[4]);

/* Shuffle the 8 32-bit values into the order we need them. */
S0123 = _mm_shuffle_epi32(S3210, 0x1B);
S4567 = _mm_shuffle_epi32(S7654, 0x1B);
S0145 = _mm_unpackhi_epi64(S4567, S0123);
S2367 = _mm_unpacklo_epi64(S4567, S0123);

/* Load input block; this is the start of the message schedule. */
W[0] = be32dec_128(&block[0]);
W[1] = be32dec_128(&block[16]);
W[2] = be32dec_128(&block[32]);
W[3] = be32dec_128(&block[48]);

/* Initialize working variables. */
S[0] = S0145;
S[1] = S2367;

/* Perform 64 rounds, 4 at a time. */
RNDMSG(S, W, 0, 0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5);
RNDMSG(S, W, 1, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5);
RNDMSG(S, W, 2, 0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3);
RNDMSG(S, W, 3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174);
RNDMSG(S, W, 4, 0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc);
RNDMSG(S, W, 5, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da);
RNDMSG(S, W, 6, 0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7);
RNDMSG(S, W, 7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967);
RNDMSG(S, W, 8, 0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13);
RNDMSG(S, W, 9, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85);
RNDMSG(S, W, 10, 0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3);
RNDMSG(S, W, 11, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070);
RNDMSG(S, W, 12, 0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5);
RNDMSG(S, W, 13, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3);
RNDMSG(S, W, 14, 0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208);
RNDMSG(S, W, 15, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2);

/* Mix local working variables into global state. */
S0145 = _mm_add_epi32(S0145, S[0]);
S2367 = _mm_add_epi32(S2367, S[1]);

/* Shuffle state back to the original word order and store. */
S0123 = _mm_unpackhi_epi64(S2367, S0145);
S4567 = _mm_unpacklo_epi64(S2367, S0145);
S3210 = _mm_shuffle_epi32(S0123, 0x1B);
S7654 = _mm_shuffle_epi32(S4567, 0x1B);
_mm_storeu_si128((__m128i *)&state[0], S3210);
_mm_storeu_si128((__m128i *)&state[4], S7654);
}

#endif
Loading
Loading