Compiler Diagnostics Reference#
This page documents the meaning of various compiler diagnostics provided by SCALE which are not found in other compilers. Diagnostics not listed here are provided by clang, and may be found in the Clang Compiler Diagnostics Reference
As with all diagnostics, these can be enabled or disabled both globally and in a particular region of code. See diagnostic flags reference
Since many of them represent undefined behaviour even on NVIDIA platforms, fixing the underlying problem is recommended.
-Wptx-binding-as-address#
Any pointer argument to a PTX asm() statement is passed as a generic address.
It is therefore invlaid to directly use an asm() input as an address operand
to any PTX instruction that doesn't use generic addressing. Such code will
work correctly any time the generic address space has identical layout to the
target address space of the instruction (as is relatively often the case for
global, for example), but will fail randomly on some targets.
To achieve correct behaviour across all GPUs, use the cvta
PTX instruction to convert the incoming pointer to the desired address space
before passing it to the PTX memory instruction. In cases where this conversion
is a no-op, the optimiser will remove this extra step (including with NVIDIA's compiler!).
When the conversion is not a no-op, both SCALE and NVIDIA nvcc have compiler
optimisations that attempt to deduce the address space of the pointer and
rewrite it into the target address space for its entire lifetime.
-Wptx-unused-local-variable#
Identifies unused local variables (.reg declarations) in PTXasm.
-Wptx-local-variable-leak#
Identifies PTX variable declarations that may lead to ptxas failures when
compiling for NVIDIA.
When a device function contains a PTX variable declaration, repeated inlining of calls to it may lead to duplicate variable declarations in the generated PTX.
__device__ void suffering() {
asm(".reg .b32 pain;");
}
__global__ void explode() {
// The function body will inline twice, causing this kernel's final PTX to
// contain two declarations of the same PTX variable. This produces a
// confusing ptxas error.
suffering();
suffering();
}
To resolve this, all device functions that make PTX .reg declarations should
enclose them in PTX {}. This limits the scope of the inlined variable
declarartion to the inlined functoin body, allowing multiple copies to
coexist.
This issue can never cause a problem when building for AMD targets.
-Wptx-wave64#
Detects hardcoded lanemask operands that have all zeros in the top 32 bits when
compiling for native wave64 mode. Such code is likely a mistake, such as
hardcoding 0xFFFFFFFF for a ballot's mask argument instead of the more
portable -1. On a wave64 target, 0xFFFFFFFF is really
0x00000000FFFFFFFF, turning off half the warp, when the intent was likely
to turn every thread on.
Note that SCALE's default compilation mode is to emulate a warp size of 32 on all targets, so you can usually ignore this class of problems initially. Most programs don't suffer a measurable performance degredation from this emulation process, but certain patterns (such as sending alternating warps down different control flow paths) would be pathological. It is desirable to migrate your code to be truly warp-size portable.
Errors relating to the PTX carry bit#
PTX offers extended-precision integr math instructions, with implicit carry-out and carry-in. However, the PTX manual notes:
The condition code register is not preserved across calls and is mainly intended for use in straight-line code sequences for computing extended-precision integer addition, subtraction, and multiplication.
It is therefore undefined behaviour to end one device function with a PTX
asm() statement that writes the carry-bit, and then try to read that stored
carry-bit using an asm() statement at the start of another device function. If
you need to write that kind of code, you can either:
- Use compiler intrinsics to access add-with-carry operations more directly.
- Use int128 types (where possible) to avoid having to have this kind of asm entirely.
- Use macros instead of functions for affected regions of code.
- Refactor so both the reader and writer of the carry bit are in the same device function.
When compiling for NVIDIA, such code will usually work if the functions inline and the asm blocks end up adjacent (so there is no actual function call to discard the carry bit). This is optimiser-dependent behaviour, and will fail if the compiler decides to reorder code or not perform the inlining.
When compiling for AMD with SCALE, we cannot create that behaviour, so it is simply an error to attempt to return the carry-bit.