-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[X86][AVX] Prefer per-element vector shifts for known splats #39424
Comments
Likely steps:
|
Hi! This issue may be a good introductory issue for people new to working on LLVM. If you would like to work on this issue, your first steps are:
If you have any further questions about this issue, don't hesitate to ask via a comment in the thread below. |
@llvm/issue-subscribers-good-first-issue Author: Simon Pilgrim (RKSimon)
| | |
| --- | --- |
| Bugzilla Link | [40077](https://llvm.org/bz40077) |
| Version | trunk |
| OS | Windows NT |
| CC | @adibiagio,@topperc,@RKSimon,@rotateright |
Extended DescriptionAs detailed on https://reviews.llvm.org/rL340813, many recent machines have better throughput for the 'per-element' variable vector shifts than the old style 'scalar-count-in-xmm' variable shifts if we know that the shift amount is already splatted: Probably the wrong place to report this, but I looked at some other sequences:
For Skylake, variable-shifts (vpsraVd) are single uop, but count-in-xmm shifts are 2 uops. Probably they're implemented internally as broadcast to feed the SIMD variable-shift hardware. The above is 3 uops / 3c latency on SKL. So for AVX2 Skylake (but not Broadwell or earlier) we want this 2 uop / 2c latency implementation:
Same for SKX AVX512 with vpsravw and so on. There are some test cases where we use the same shift-count register multiple times, and it would be significantly better to broadcast it and use variable-shifts instead of count-from-the-low-element shifts. But on Ryzen, and Broadwell and earlier, variable-shifts cost more. (Interestingly, on Ryzen they run on a different execution port from normal count-in-xmm shifts; still a single uop (per lane) but 3c latency and not fully pipelined. Ryzen has shift-in-xmm shifts as efficient as immediate shifts, unlike Intel where shift-in-xmm is always 2 uops (port5 + shift port). KNL is horrible for pslld xmm,xmm (13c throughput/latency), but it has the same throughput as immediate for variable shifts like VPSRLVD z,z,z. I don't totally trust Agner's numbers for x,x shifts; maybe he only used the non-VEX encoding? Anyway, for AVX512 we should prefer broadcast + variable-shift instead of pmovzxb/wq / regular shift, because it's better on SKX and at least as good on KNL. This includes 16-bit elements for AVX512BW, unlike AVX2. (With AVX1, we don't have variable shifts so the earlier implementation with vpsrad is our best option.) |
@RKSimon Is this task still available? I'd be happy to work on it if it is. |
@SahilPatidar You're welcome to work on anything you'd like! I assigned this issue to you, but you shouldn't feel pressure to fix it. |
…shift amount Noticed while trying to compare splat vs per-element shift perf stats for #39424 Confirmed with uops.info
…ift amount Noticed while trying to compare splat vs per-element shift perf stats for #39424 Confirmed with uops.info
Another example: define <8 x i32> @f(<8 x i32> noundef %x, i32 noundef %s) {
%vecinit = insertelement <8 x i32> poison, i32 %s, i64 0
%vecinit7 = shufflevector <8 x i32> %vecinit, <8 x i32> poison, <8 x i32> zeroinitializer
%shl = shl <8 x i32> %x, %vecinit7
ret <8 x i32> %shl
} vmovd xmm1, edi
vpslld ymm0, ymm0, xmm1
ret On AVX512 targets (which can broadcast from scalar reg) we'd be better off with: vpbroadcastd ymm1, edi
vpsllvd ymm0, ymm0, ymm1
ret |
Where should the TuningPreferPerEltVectorShift flag be added? |
X86.td |
Extended Description
As detailed on https://reviews.llvm.org/rL340813, many recent machines have better throughput for the 'per-element' variable vector shifts than the old style 'scalar-count-in-xmm' variable shifts if we know that the shift amount is already splatted:
Probably the wrong place to report this, but I looked at some other sequences:
For Skylake, variable-shifts (vpsraVd) are single uop, but count-in-xmm shifts are 2 uops. Probably they're implemented internally as broadcast to feed the SIMD variable-shift hardware.
The above is 3 uops / 3c latency on SKL.
So for AVX2 Skylake (but not Broadwell or earlier) we want this 2 uop / 2c latency implementation:
Same for SKX AVX512 with vpsravw and so on. There are some test cases where we use the same shift-count register multiple times, and it would be significantly better to broadcast it and use variable-shifts instead of count-from-the-low-element shifts.
But on Ryzen, and Broadwell and earlier, variable-shifts cost more. (Interestingly, on Ryzen they run on a different execution port from normal count-in-xmm shifts; still a single uop (per lane) but 3c latency and not fully pipelined. Ryzen has shift-in-xmm shifts as efficient as immediate shifts, unlike Intel where shift-in-xmm is always 2 uops (port5 + shift port).
KNL is horrible for pslld xmm,xmm (13c throughput/latency), but it has the same throughput as immediate for variable shifts like VPSRLVD z,z,z. I don't totally trust Agner's numbers for x,x shifts; maybe he only used the non-VEX encoding?
Anyway, for AVX512 we should prefer broadcast + variable-shift instead of pmovzxb/wq / regular shift, because it's better on SKX and at least as good on KNL. This includes 16-bit elements for AVX512BW, unlike AVX2.
(With AVX1, we don't have variable shifts so the earlier implementation with vpsrad is our best option.)
The text was updated successfully, but these errors were encountered: