[X86][AVX] Prefer per-element vector shifts for known splats #39424

RKSimon · 2018-12-18T13:13:43Z


Bugzilla Link	40077
Version	trunk
OS	Windows NT
CC	@adibiagio,@topperc,@RKSimon,@rotateright

Extended Description

As detailed on https://reviews.llvm.org/rL340813, many recent machines have better throughput for the 'per-element' variable vector shifts than the old style 'scalar-count-in-xmm' variable shifts if we know that the shift amount is already splatted:

Probably the wrong place to report this, but I looked at some other sequences:

; AVX-LABEL: splatvar_shift_v4i32:
; AVX:       # %bb.0:
; AVX-NEXT:    vpmovzxdq {{.*#+}} xmm1 = xmm1[0],zero,xmm1[1],zero   # 1 uop / 1c latency
; AVX-NEXT:    vpsrad %xmm1, %xmm0, %xmm0                # 2 uops / 2c latency on Intel since Haswell at least
; AVX-NEXT:    retq

For Skylake, variable-shifts (vpsraVd) are single uop, but count-in-xmm shifts are 2 uops. Probably they're implemented internally as broadcast to feed the SIMD variable-shift hardware.

The above is 3 uops / 3c latency on SKL.

So for AVX2 Skylake (but not Broadwell or earlier) we want this 2 uop / 2c latency implementation:

vpbroadcastd %xmm1, %xmm1         = xmm1[0],xmm1[1],xmm1[2],xmm1[3]   # 1 uop / 1c latency
vpsravd      %xmm1, %xmm0, %xmm0                          # 1 uop / 1c latency on SKL.   3 / 3 on BDW and earlier.

Same for SKX AVX512 with vpsravw and so on. There are some test cases where we use the same shift-count register multiple times, and it would be significantly better to broadcast it and use variable-shifts instead of count-from-the-low-element shifts.

But on Ryzen, and Broadwell and earlier, variable-shifts cost more. (Interestingly, on Ryzen they run on a different execution port from normal count-in-xmm shifts; still a single uop (per lane) but 3c latency and not fully pipelined. Ryzen has shift-in-xmm shifts as efficient as immediate shifts, unlike Intel where shift-in-xmm is always 2 uops (port5 + shift port).

KNL is horrible for pslld xmm,xmm (13c throughput/latency), but it has the same throughput as immediate for variable shifts like VPSRLVD z,z,z. I don't totally trust Agner's numbers for x,x shifts; maybe he only used the non-VEX encoding?

Anyway, for AVX512 we should prefer broadcast + variable-shift instead of pmovzxb/wq / regular shift, because it's better on SKX and at least as good on KNL. This includes 16-bit elements for AVX512BW, unlike AVX2.

(With AVX1, we don't have variable shifts so the earlier implementation with vpsrad is our best option.)

The text was updated successfully, but these errors were encountered:

RKSimon · 2024-03-04T16:32:05Z

Likely steps:

Add TuningPreferPerEltVectorShift flag - and add it to for Skylake (and later) tuning
Update vector shift lowering to not use LowerShiftByScalarVariable if the tuning flag is set and the per-element shift is supported (check supportedVectorVarShift)

llvmbot · 2024-03-04T16:32:30Z

Hi!

This issue may be a good introductory issue for people new to working on LLVM. If you would like to work on this issue, your first steps are:

In the comments of the issue, request for it to be assigned to you.
Fix the issue locally.
Run the test suite locally. Remember that the subdirectories under test/ create fine-grained testing targets, so you can e.g. use make check-clang-ast to only run Clang's AST tests.
Create a Git commit.
Run git clang-format HEAD~1 to format your changes.
Open a pull request to the upstream repository on GitHub. Detailed instructions can be found in GitHub's documentation.

If you have any further questions about this issue, don't hesitate to ask via a comment in the thread below.

llvmbot · 2024-03-04T16:32:30Z

@llvm/issue-subscribers-good-first-issue

Author: Simon Pilgrim (RKSimon)

| | | | --- | --- | | Bugzilla Link | [40077](https://llvm.org/bz40077) | | Version | trunk | | OS | Windows NT | | CC | @adibiagio,@topperc,@RKSimon,@rotateright |

Extended Description

As detailed on https://reviews.llvm.org/rL340813, many recent machines have better throughput for the 'per-element' variable vector shifts than the old style 'scalar-count-in-xmm' variable shifts if we know that the shift amount is already splatted:

Probably the wrong place to report this, but I looked at some other sequences:

; AVX-LABEL: splatvar_shift_v4i32:
; AVX:       # %bb.0:
; AVX-NEXT:    vpmovzxdq {{.*#+}} xmm1 = xmm1[0],zero,xmm1[1],zero   # 1 uop / 1c latency
; AVX-NEXT:    vpsrad %xmm1, %xmm0, %xmm0                # 2 uops / 2c latency on Intel since Haswell at least
; AVX-NEXT:    retq

For Skylake, variable-shifts (vpsraVd) are single uop, but count-in-xmm shifts are 2 uops. Probably they're implemented internally as broadcast to feed the SIMD variable-shift hardware.

The above is 3 uops / 3c latency on SKL.

So for AVX2 Skylake (but not Broadwell or earlier) we want this 2 uop / 2c latency implementation:

vpbroadcastd %xmm1, %xmm1         = xmm1[0],xmm1[1],xmm1[2],xmm1[3]   # 1 uop / 1c latency
vpsravd      %xmm1, %xmm0, %xmm0                          # 1 uop / 1c latency on SKL.   3 / 3 on BDW and earlier.

Same for SKX AVX512 with vpsravw and so on. There are some test cases where we use the same shift-count register multiple times, and it would be significantly better to broadcast it and use variable-shifts instead of count-from-the-low-element shifts.

But on Ryzen, and Broadwell and earlier, variable-shifts cost more. (Interestingly, on Ryzen they run on a different execution port from normal count-in-xmm shifts; still a single uop (per lane) but 3c latency and not fully pipelined. Ryzen has shift-in-xmm shifts as efficient as immediate shifts, unlike Intel where shift-in-xmm is always 2 uops (port5 + shift port).

KNL is horrible for pslld xmm,xmm (13c throughput/latency), but it has the same throughput as immediate for variable shifts like VPSRLVD z,z,z. I don't totally trust Agner's numbers for x,x shifts; maybe he only used the non-VEX encoding?

Anyway, for AVX512 we should prefer broadcast + variable-shift instead of pmovzxb/wq / regular shift, because it's better on SKX and at least as good on KNL. This includes 16-bit elements for AVX512BW, unlike AVX2.

(With AVX1, we don't have variable shifts so the earlier implementation with vpsrad is our best option.)

SahilPatidar · 2024-03-13T11:16:19Z

@RKSimon Is this task still available? I'd be happy to work on it if it is.

Endilll · 2024-03-25T01:22:45Z

@SahilPatidar You're welcome to work on anything you'd like! I assigned this issue to you, but you shouldn't feel pressure to fix it.

…shift amount Noticed while trying to compare splat vs per-element shift perf stats for #39424 Confirmed with uops.info

…ift amount Noticed while trying to compare splat vs per-element shift perf stats for #39424 Confirmed with uops.info

RKSimon · 2024-04-05T09:29:36Z

Another example:

define <8 x i32> @f(<8 x i32> noundef %x, i32 noundef %s) {
  %vecinit = insertelement <8 x i32> poison, i32 %s, i64 0
  %vecinit7 = shufflevector <8 x i32> %vecinit, <8 x i32> poison, <8 x i32> zeroinitializer
  %shl = shl <8 x i32> %x, %vecinit7
  ret <8 x i32> %shl
}

        vmovd   xmm1, edi
        vpslld  ymm0, ymm0, xmm1
        ret

On AVX512 targets (which can broadcast from scalar reg) we'd be better off with:

        vpbroadcastd    ymm1, edi
        vpsllvd ymm0, ymm0, ymm1
        ret

SahilPatidar · 2024-04-05T10:13:03Z

Likely steps:

Add TuningPreferPerEltVectorShift flag - and add it to for Skylake (and later) tuning

Update vector shift lowering to not use LowerShiftByScalarVariable if the tuning flag is set and the per-element shift is supported (check supportedVectorVarShift)

Where should the TuningPreferPerEltVectorShift flag be added?

RKSimon · 2024-04-05T10:30:01Z

X86.td

…39424

llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021

RKSimon added the good first issue https://github.com/llvm/llvm-project/contribute label Mar 4, 2024

Endilll assigned SahilPatidar Mar 25, 2024

RKSimon added a commit that referenced this issue Mar 25, 2024

[X86] HSW/BDW - vector splat shifts don't use Port5 when loading the …

3dcf62b

…shift amount Noticed while trying to compare splat vs per-element shift perf stats for #39424 Confirmed with uops.info

RKSimon added a commit that referenced this issue Mar 26, 2024

[X86] ICX - vector XMM splat use Port 1 or 5 when boradcasting the sh…

5d7e7ab

…ift amount Noticed while trying to compare splat vs per-element shift perf stats for #39424 Confirmed with uops.info

SahilPatidar added a commit to SahilPatidar/llvm-project that referenced this issue Apr 7, 2024

[X86][AVX] Prefer VPSRAV to VPSRA style shifts for known splats llvm#…

52e9ab8

…39424

SahilPatidar linked a pull request Apr 7, 2024 that will close this issue

[X86][AVX] Prefer per-element vector shifts for known splats #39424 #87913

Draft

RKSimon changed the title ~~[X86][AVX] Prefer VPSRAV to VPSRA style shifts for known splats~~ [X86][AVX] Prefer per-element vector shifts for known splats Apr 8, 2024

SahilPatidar added a commit to SahilPatidar/llvm-project that referenced this issue Apr 9, 2024

[X86][AVX] Prefer VPSRAV to VPSRA style shifts for known splats llvm#…

03ad9b3

…39424

SahilPatidar added a commit to SahilPatidar/llvm-project that referenced this issue Apr 22, 2024

[X86][AVX] Prefer VPSRAV to VPSRA style shifts for known splats llvm#…

e519e1f

…39424

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[X86][AVX] Prefer per-element vector shifts for known splats #39424

[X86][AVX] Prefer per-element vector shifts for known splats #39424

RKSimon commented Dec 18, 2018 •

edited

Loading

RKSimon commented Mar 4, 2024

llvmbot commented Mar 4, 2024

llvmbot commented Mar 4, 2024

Extended Description

SahilPatidar commented Mar 13, 2024

Endilll commented Mar 25, 2024

RKSimon commented Apr 5, 2024

SahilPatidar commented Apr 5, 2024

RKSimon commented Apr 5, 2024

[X86][AVX] Prefer per-element vector shifts for known splats #39424

[X86][AVX] Prefer per-element vector shifts for known splats #39424

Comments

RKSimon commented Dec 18, 2018 • edited Loading

Extended Description

RKSimon commented Mar 4, 2024

llvmbot commented Mar 4, 2024

llvmbot commented Mar 4, 2024

Extended Description

SahilPatidar commented Mar 13, 2024

Endilll commented Mar 25, 2024

RKSimon commented Apr 5, 2024

SahilPatidar commented Apr 5, 2024

RKSimon commented Apr 5, 2024

RKSimon commented Dec 18, 2018 •

edited

Loading