Language Extensions#
SCALE has various opt-in language extensions that aim to improve the experience of writing GPU code. More language extensions are in development.
Unless otherwise noted, SCALE accepts these language extensions in both 
clang and nvcc modes. Note that there are dialect differences
between the two modes.
Since NVIDIA's compiler does not support SCALE language extensions, if you want to retain the ability to compile for NVIDIA GPUs you must do one of two things:
- Guard use of language extensions behind the __REDSCALE__macro, hiding it from NVIDIA'snvcc.
- Use SCALE's clangcompiler to compile for both NVIDIA and AMD targets. This will require changes to your build system.
[[clang::loop_unroll]]#
GPU code frequently contains loops that need to be partially unrolled, and which have the property that the degree of unrolling is a tradeoff between ILP and register usage.
Finding the optimal amount to unroll is not usually possible in the compiler because the number of threads to be used is a runtime value. Programmers therefore usually want to set unroll depth by hand.
The existing #pragma unroll N allows this to be set at the preprocessor level.
The new [[clang::loop_unroll N]] allows doing this in a template-dependent
way:
template<int UnrollAmount>
__device__ void example(int n) {
    [[clang::loop_unroll UnrollAmount]]
    for (int i = 0; i < n; i++) {
        // ...
    }
}
__builtin_provable(bool X)#
__builtin_provable(X) accepts a boolean, X, and:
- If the compiler is able to prove, during optimisation, that Xis a compile-time constant true, the entire expression evaluates to a compile-time constant true.
- Otherwise (if Xis unknown, or provably false), the entire expression evaluates to a compile-time constant false.
This allows you to write code that opportunistically optimises for a special case, without the risk of runtime branching overhead or the inconvenience of propagating this information through your entire program using templates. For example:
__device__ int myCleverFunction(int input) {
    if (__builtin_provable(input % 2 == 0)) {
        // Special fast code for the case where `input` is divisible
        // by 2 goes here.
    } else {
        // Slow, general case goes here.
    }
}
During optimisation, as calls to myCleverFunction get inlined, the 
compiler may be able to prove that input % 2 == 0 for specific calls to 
this function. Those cases will be compiled with the "fast path", while all 
others will be compiled to the "slow path". The if statement will never 
compile to an actual conditional.
Since there are no guarantees that the optimiser is able to prove the condition, the program must produce identical outputs from either path, or the behaviour is undefined.
This feature differs from the standard c++17 if constexpr in that it is 
not required that the input boolean be constexpr. __builtin_provable() 
communicates with the optimiser, not the template system. Consequently:
- You don't need to use templates to propagate "optimisation knowledge" throughout the program.
- Compilation may be faster, as a result of not having to template everything.
- Some cases may be missed where optimisation fails. Such cases are probably independently worth investigating (Why did optimisation fail? That's a source of additional slowness).
Improved support for non-32 warpSize#
Not all AMD GPUs have a warp size of 32. To mitigate this, we offer a variety of compiler and API features:
- cudaLaneMask_t: A type that is an integer with the number of bits as a CUDA warp. This should be used when using functions such as- __ballot()to avoid discarding half the bits.
- Use of cudaLaneMask_tin appropriate places in the CUDA APIs (such as the return value of__ballot())
- Diagnostics to catch implicit casts from cudaLaneMask_tto narrower types.
In practice, this means the compiler detects the majority of cases where code is written in a way that will break on a device with a warp size of 64.
Programmers should modify their CUDA code to be agnostic to warp size. NVIDIA's documentation recommends this practice, but a lot of real-world CUDA code does it incorrectly because no current NVIDIA hardware has a warp size other than 32.
Since NVIDIA's nvcc does not have cudaLaneMask_t, programmers should use
auto to declare the return types of functions such as __ballot() that return
it. This will compile correctly on all platforms.