Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[X86][AVX] Prefer per-element vector shifts for known splats #39424

Open
RKSimon opened this issue Dec 18, 2018 · 8 comments · May be fixed by #87913
Open

[X86][AVX] Prefer per-element vector shifts for known splats #39424

RKSimon opened this issue Dec 18, 2018 · 8 comments · May be fixed by #87913
Assignees
Labels
backend:X86 bugzilla Issues migrated from bugzilla good first issue https://github.com/llvm/llvm-project/contribute

Comments

@RKSimon
Copy link
Collaborator

RKSimon commented Dec 18, 2018

Bugzilla Link 40077
Version trunk
OS Windows NT
CC @adibiagio,@topperc,@RKSimon,@rotateright

Extended Description

As detailed on https://reviews.llvm.org/rL340813, many recent machines have better throughput for the 'per-element' variable vector shifts than the old style 'scalar-count-in-xmm' variable shifts if we know that the shift amount is already splatted:

Probably the wrong place to report this, but I looked at some other sequences:

; AVX-LABEL: splatvar_shift_v4i32:
; AVX:       # %bb.0:
; AVX-NEXT:    vpmovzxdq {{.*#+}} xmm1 = xmm1[0],zero,xmm1[1],zero   # 1 uop / 1c latency
; AVX-NEXT:    vpsrad %xmm1, %xmm0, %xmm0                # 2 uops / 2c latency on Intel since Haswell at least
; AVX-NEXT:    retq

For Skylake, variable-shifts (vpsraVd) are single uop, but count-in-xmm shifts are 2 uops. Probably they're implemented internally as broadcast to feed the SIMD variable-shift hardware.

The above is 3 uops / 3c latency on SKL.

So for AVX2 Skylake (but not Broadwell or earlier) we want this 2 uop / 2c latency implementation:

vpbroadcastd %xmm1, %xmm1         = xmm1[0],xmm1[1],xmm1[2],xmm1[3]   # 1 uop / 1c latency
vpsravd      %xmm1, %xmm0, %xmm0                          # 1 uop / 1c latency on SKL.   3 / 3 on BDW and earlier.

Same for SKX AVX512 with vpsravw and so on. There are some test cases where we use the same shift-count register multiple times, and it would be significantly better to broadcast it and use variable-shifts instead of count-from-the-low-element shifts.

But on Ryzen, and Broadwell and earlier, variable-shifts cost more. (Interestingly, on Ryzen they run on a different execution port from normal count-in-xmm shifts; still a single uop (per lane) but 3c latency and not fully pipelined. Ryzen has shift-in-xmm shifts as efficient as immediate shifts, unlike Intel where shift-in-xmm is always 2 uops (port5 + shift port).

KNL is horrible for pslld xmm,xmm (13c throughput/latency), but it has the same throughput as immediate for variable shifts like VPSRLVD z,z,z. I don't totally trust Agner's numbers for x,x shifts; maybe he only used the non-VEX encoding?

Anyway, for AVX512 we should prefer broadcast + variable-shift instead of pmovzxb/wq / regular shift, because it's better on SKX and at least as good on KNL. This includes 16-bit elements for AVX512BW, unlike AVX2.

(With AVX1, we don't have variable shifts so the earlier implementation with vpsrad is our best option.)

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
@RKSimon
Copy link
Collaborator Author

RKSimon commented Mar 4, 2024

Likely steps:

  • Add TuningPreferPerEltVectorShift flag - and add it to for Skylake (and later) tuning
  • Update vector shift lowering to not use LowerShiftByScalarVariable if the tuning flag is set and the per-element shift is supported (check supportedVectorVarShift)

@RKSimon RKSimon added the good first issue https://github.com/llvm/llvm-project/contribute label Mar 4, 2024
@llvmbot
Copy link
Member

llvmbot commented Mar 4, 2024

Hi!

This issue may be a good introductory issue for people new to working on LLVM. If you would like to work on this issue, your first steps are:

  1. In the comments of the issue, request for it to be assigned to you.
  2. Fix the issue locally.
  3. Run the test suite locally. Remember that the subdirectories under test/ create fine-grained testing targets, so you can e.g. use make check-clang-ast to only run Clang's AST tests.
  4. Create a Git commit.
  5. Run git clang-format HEAD~1 to format your changes.
  6. Open a pull request to the upstream repository on GitHub. Detailed instructions can be found in GitHub's documentation.

If you have any further questions about this issue, don't hesitate to ask via a comment in the thread below.

@llvmbot
Copy link
Member

llvmbot commented Mar 4, 2024

@llvm/issue-subscribers-good-first-issue

Author: Simon Pilgrim (RKSimon)

| | | | --- | --- | | Bugzilla Link | [40077](https://llvm.org/bz40077) | | Version | trunk | | OS | Windows NT | | CC | @adibiagio,@topperc,@RKSimon,@rotateright |

Extended Description

As detailed on https://reviews.llvm.org/rL340813, many recent machines have better throughput for the 'per-element' variable vector shifts than the old style 'scalar-count-in-xmm' variable shifts if we know that the shift amount is already splatted:

Probably the wrong place to report this, but I looked at some other sequences:

; AVX-LABEL: splatvar_shift_v4i32:
; AVX:       # %bb.0:
; AVX-NEXT:    vpmovzxdq {{.*#+}} xmm1 = xmm1[0],zero,xmm1[1],zero   # 1 uop / 1c latency
; AVX-NEXT:    vpsrad %xmm1, %xmm0, %xmm0                # 2 uops / 2c latency on Intel since Haswell at least
; AVX-NEXT:    retq

For Skylake, variable-shifts (vpsraVd) are single uop, but count-in-xmm shifts are 2 uops. Probably they're implemented internally as broadcast to feed the SIMD variable-shift hardware.

The above is 3 uops / 3c latency on SKL.

So for AVX2 Skylake (but not Broadwell or earlier) we want this 2 uop / 2c latency implementation:

vpbroadcastd %xmm1, %xmm1         = xmm1[0],xmm1[1],xmm1[2],xmm1[3]   # 1 uop / 1c latency
vpsravd      %xmm1, %xmm0, %xmm0                          # 1 uop / 1c latency on SKL.   3 / 3 on BDW and earlier.

Same for SKX AVX512 with vpsravw and so on. There are some test cases where we use the same shift-count register multiple times, and it would be significantly better to broadcast it and use variable-shifts instead of count-from-the-low-element shifts.

But on Ryzen, and Broadwell and earlier, variable-shifts cost more. (Interestingly, on Ryzen they run on a different execution port from normal count-in-xmm shifts; still a single uop (per lane) but 3c latency and not fully pipelined. Ryzen has shift-in-xmm shifts as efficient as immediate shifts, unlike Intel where shift-in-xmm is always 2 uops (port5 + shift port).

KNL is horrible for pslld xmm,xmm (13c throughput/latency), but it has the same throughput as immediate for variable shifts like VPSRLVD z,z,z. I don't totally trust Agner's numbers for x,x shifts; maybe he only used the non-VEX encoding?

Anyway, for AVX512 we should prefer broadcast + variable-shift instead of pmovzxb/wq / regular shift, because it's better on SKX and at least as good on KNL. This includes 16-bit elements for AVX512BW, unlike AVX2.

(With AVX1, we don't have variable shifts so the earlier implementation with vpsrad is our best option.)

@SahilPatidar
Copy link
Contributor

@RKSimon Is this task still available? I'd be happy to work on it if it is.

@Endilll
Copy link
Contributor

Endilll commented Mar 25, 2024

@SahilPatidar You're welcome to work on anything you'd like! I assigned this issue to you, but you shouldn't feel pressure to fix it.

RKSimon added a commit that referenced this issue Mar 25, 2024
…shift amount

Noticed while trying to compare splat vs per-element shift perf stats for #39424

Confirmed with uops.info
RKSimon added a commit that referenced this issue Mar 26, 2024
…ift amount

Noticed while trying to compare splat vs per-element shift perf stats for #39424

Confirmed with uops.info
@RKSimon
Copy link
Collaborator Author

RKSimon commented Apr 5, 2024

Another example:

define <8 x i32> @f(<8 x i32> noundef %x, i32 noundef %s) {
  %vecinit = insertelement <8 x i32> poison, i32 %s, i64 0
  %vecinit7 = shufflevector <8 x i32> %vecinit, <8 x i32> poison, <8 x i32> zeroinitializer
  %shl = shl <8 x i32> %x, %vecinit7
  ret <8 x i32> %shl
}
        vmovd   xmm1, edi
        vpslld  ymm0, ymm0, xmm1
        ret

On AVX512 targets (which can broadcast from scalar reg) we'd be better off with:

        vpbroadcastd    ymm1, edi
        vpsllvd ymm0, ymm0, ymm1
        ret

@SahilPatidar
Copy link
Contributor

Likely steps:

  • Add TuningPreferPerEltVectorShift flag - and add it to for Skylake (and later) tuning
  • Update vector shift lowering to not use LowerShiftByScalarVariable if the tuning flag is set and the per-element shift is supported (check supportedVectorVarShift)

Where should the TuningPreferPerEltVectorShift flag be added?

@RKSimon
Copy link
Collaborator Author

RKSimon commented Apr 5, 2024

X86.td

SahilPatidar added a commit to SahilPatidar/llvm-project that referenced this issue Apr 7, 2024
@RKSimon RKSimon changed the title [X86][AVX] Prefer VPSRAV to VPSRA style shifts for known splats [X86][AVX] Prefer per-element vector shifts for known splats Apr 8, 2024
SahilPatidar added a commit to SahilPatidar/llvm-project that referenced this issue Apr 9, 2024
SahilPatidar added a commit to SahilPatidar/llvm-project that referenced this issue Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:X86 bugzilla Issues migrated from bugzilla good first issue https://github.com/llvm/llvm-project/contribute
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants