-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MIR passes do not take into account if an operation is convergent #137086
Comments
@rust-lang/wg-mir-opt and @rust-lang/opsem might be interested as well |
Seems like that's a pretty cursed intrinsic, and should not be exposed as a regular Rust intrinsic. This is another case of hardware vendors thinking they can just add something to C or Rust and say it works "like this hardware instruction"... but that's not how this goes, hardware vendors need to work with language designers to find a way to add those semantics into the language. |
Can this be worked around by modeling the intrinsic as a call to an arbitrary external function? In other words: is the “convergent” concept in LLVM only needed to preserve more optimizations around uses of these operations, or is it even more special (language-spec-breaking) than that? |
GPU vendors did not add this to C initially. Instead when Nvidia introduced programmable shader units, they and Microsoft created a C like language (in I think 2001) which they called Cg (C for graphics) and HLSL (high level shading language) respectively (according to https://gamedev.net/forums/topic/281349-hlsl-vs-cg-any-significant-difference/2759856/ these two languages are nearly identical) It has only been much later that real C and C++ got supported on GPU's (CUDA released in 2007, OpenCL in 2009 which used C++ and C respectively as basis for their programming language). |
Using an extern function call would allow the call to be duplicated into two branches. The convergence requirement means that every single thread within a thread group has to reach the exact same instruction. Some GPU's implement threads within a thread group by basically having a single execution stream and using a separate vector lane for each "thread" and just masking/unmasking lanes depending on which branch of an if those threads took. Instructions that require convergence don't respect this masking. And for other GPU's I guess it would cause issues like deadlocks and other misbehavior. |
I think as far as Rust is concerned, it should be good enough to have a target flag that says "this target may have convergent operations" and disable certain MIR optimizations like JumpThreading entirely in that case. The necessary complexity to correctly perform control flow optimizations in the presence of convergent operations is not worth the bother at Rust's level. |
So a convergent call that's present transitively anywhere inside a function I call means that call has to be treated specially? How can LLVM even handle this soundly, e.g. when function pointers are involved? |
Clang marks all functions and calls as Rust should do that as well. |
In the short term, we should dull the pain by adding a hacky check in the What has me really confused is that we're only hearing about this problem now. JumpThreading was enabled 6 stable releases ago. Have GPU users been blanket disabling all MIR opts since before 1.78 or something? |
In my research the bug manifested only in a very specific case. It really was a coincidence I encountered it. @kjetilkjeka as one of the more frequent users did you encounter something like this in the past? |
Not really. Our GPU code that uses thread syncing also uses other features that are not straightforward in Rust and is thus written in C. I'm sure if things like shared memory was available in Rust we would have a lot of cases where this broke for us. Our Rust GPU code almost always read an input buffer, perform some calculation, write to a seperate output buffer. |
I am currently working on a fix which does what @nikic suggested:
This should initially reduce the pain. However, a few things are still unclear to me:
It may be an option to use convergence control tokens as soon as these are stable in LLVM (to my knowledge at the moment clang only emits these for HLSL/SPIR-V. There is active development ongoing to respect convergence control tokens in LLVM's passes and targets.) |
Is there a good reason to leave any enabled? |
Issue
I encountered an issue when using the
core::arch::nvptx::_syncthreads()
while the MIR pass JumpThreading is enabled. The issue can be reproduced with a simple kernel when executed with the following parameters:block_dim = BlockDim{x : 512, y : 1, z : 1};
grid_dim = GridDim{x : 2, y : 1, z : 1};
n = 1000;
compute-sanitizer --tool synccheck
complains about barrier errors. The resulting.ptx
shows the reason for that. The code is transformed to the following:I could track down this transformation to the MIR pass JumpThreading. However,
_syncthreads()
is a convergent operation. This property must be considered when doing code transformations (see this LLVM issue for reference).Therefore turning off the MIR pass JumpThreading completely prevents this transformation from happening and the resulting code is correct (
compute-sanitizer
does also not complain any longer).PTX with JumpThreading
PTX without JumpThreading
In the above example
_syncthreads()
does nothing useful and can be ommited. However I encountered this issue in a more complex stencil kernel where these transformations lead to side effects and race conditions.Compiler arguments
With JumpThreading:
Without JumpThreading:
Background
Targets like nvptx64-nvidia-cuda, amdgpu and probably also spir-v (rust-gpu) make use of so called convergent operations (like
_syncthreads()
. LLVM provides a detailed explanation for this type of operations. Special care must be taken when code that involves convergent operations is transformed.To my knowledge rustc does not know if an operation is convergent so passes do not handle these operations correctly.
Zulip-Stream
The text was updated successfully, but these errors were encountered: