Inline PTX support#
SCALE accepts inline PTX asm
blocks in CUDA programs and will attempt to
compile it for AMD along with the rest of your program.
Wave64 considerations#
A small number of PTX instructions depend on the warp size of the GPU being used. Since all NVIDIA GPUs and many AMD ones have a warp size of 32, much code implicitly relies on this. As a result, issues can appear when targeting wave64 devices.
SCALE provides several tools and compiler warnings to help you write portable PTX code. In most cases only small tweaks are required to get things working. Since so little PTX actually depends on the warp size, most projects are unaffected by the issues documented in this section. Nevertheless, it is useful to adjust your code to be warp-size-agnostic, since doing so can be done with no downsides.
Querying warp size#
PTX defines the WARP_SIZE
global constant which can be used to access the
warp size directly. It's a compile-time constant in nvidia's implementation
as well as in SCALE, so there is no cost to using this and doing arithmetic
with it (like with warpSize
in CUDA code).
Lanemask inputs#
The length of lanemask operands on instructions will always have a number of
bits equal to the warp size on the target GPU. For
example, when compiling for a wave64 GPU, the lanemask argument to shfl.sync
is a b64
, not b32
.
The following rules are applied to help detect problems with such operands:
- If a non-constant lanemask operand is used, and its bit-length is <= the warp size, an error is raised.
- If a constant lanemask operand is used with no 1-bits in the high 32 bits,
while compiling for a wave64 architecture, a warning is raised (which can
be disabled). This catches the common case of hardcoded lanemasks like
0xFFFFFFFF
which will typecheck asb64
, but will probably not do what you want.
In the common case where you want an all-ones lanemask, the most convenient
thing to do is write -1
instead of 0xFFFFFFFF
: this will give you the
correct number of 1-bits in all cases, including on nvidia platforms.
The c
argument to shfl
instructions#
The shfl
PTX instruction has a funky operand, c
, used for clamping etc.
See the documentation.
The c
operand is really two operands packed together: cval
in
bits 0-4, and segmask
in bits 8-12. For wave64, an extra bit is needed. Since
there is space for an extra bit in each of these values, we simply add it in
the obvious place.
A portable way of reasoning about this is to assume that cval
is in bits 0-7
and segmask
in bits 8-15.
Here's a concrete example of a reverse cumsum that works on either warp size:
__global__ void shuffleRevCumsumKernel(float *dst)
{
float out;
const int C = warpSize - 1;
asm(
".reg .f32 Rx;"
".reg .f32 Ry;"
".reg .pred p;"
"mov.b32 Rx, %1;"
"shfl.sync.down.b32 Ry|p, Rx, 0x1, %2, -1;"
"@p add.f32 Rx, Ry, Rx;"
"shfl.sync.down.b32 Ry|p, Rx, 0x2, %2, -1;"
"@p add.f32 Rx, Ry, Rx;"
"shfl.sync.down.b32 Ry|p, Rx, 0x4, %2, -1;"
"@p add.f32 Rx, Ry, Rx;"
"shfl.sync.down.b32 Ry|p, Rx, 0x8, %2, -1;"
"@p add.f32 Rx, Ry, Rx;"
"shfl.sync.down.b32 Ry|p, Rx, 0x10, %2, -1;"
"@p add.f32 Rx, Ry, Rx;"
// One extra shuffle is needed for the larger warp size.
#if __SCALE_WARP_SIZE__ > 32
"shfl.sync.down.b32 Ry|p, Rx, 0x20, %2, -1;"
"@p add.f32 Rx, Ry, Rx;"
#endif // __SCALE_WARP_SIZE__
"mov.b32 %0, Rx;"
: "=f"(out) : "f"(1.0f), "n"(C)
);
dst[threadIdx.x] = out;
}
And here's how to do a portable butterfly shuffle reduction:
__global__ void shuffleBflyKernel(float *dst)
{
const int C = warpSize - 1;
float out;
asm(
".reg .f32 Rx;"
".reg .f32 Ry;"
".reg .pred p;"
"mov.b32 Rx, %1;"
#if __SCALE_WARP_SIZE__ > 32
"shfl.sync.bfly.b32 Ry, Rx, 0x20, %2, -1;"
"add.f32 Rx, Ry, Rx;"
#endif // __SCALE_WARP_SIZE__
"shfl.sync.bfly.b32 Ry, Rx, 0x10, %2, -1;"
"add.f32 Rx, Ry, Rx;"
"shfl.sync.bfly.b32 Ry, Rx, 0x8, %2, -1;"
"add.f32 Rx, Ry, Rx;"
"shfl.sync.bfly.b32 Ry, Rx, 0x4, %2, -1;"
"add.f32 Rx, Ry, Rx;"
"shfl.sync.bfly.b32 Ry, Rx, 0x2, %2, -1;"
"add.f32 Rx, Ry, Rx;"
"shfl.sync.bfly.b32 Ry, Rx, 0x1, %2, -1;"
"add.f32 Rx, Ry, Rx;"
"mov.b32 %0, Rx;"
: "=f"(out) : "f"((float) threadIdx.x), "n"(C)
);
dst[threadIdx.x] = out;
}
Dialect differences#
The SCALE compiler accepts a more permissive dialect of PTX than NVIDIA's implementation does.
Integer lengths#
Most PTX instructions are defined to work only for a specific, arbitrary set of integer types. We didn't bother to implement such restrictions except in cases where they are needed for correctness, so many PTX instructions accept a wider selection of types than nvcc accepts.
One amusing consequence of this is that most of the simple instructions work
for any bit-length: add.s17
is allowed (but will of course lead to
extra sext/trunc instructions, so isn't necessarily a good idea).
Divergent exit
#
AMD hardware does not seem to have a mechanism by which individual threads
can terminate early (only entire warps). As a result, the exit
instruction may be used only in converged contexts. We transform it into
approximately:
if (__activemask() == -1) {
exit_entire_warp();
} else {
// This situation is unrepresentable
__trap();
}
Code that uses exit
as a performance optimisation for nvidia hardware may
benefit from being adjusted for AMD.
Empty asm volatile
blocks#
To cater to "interesting" user code, the SCALE compiler will not touch
asm volatile
blocks containing no instructions. We've seen
real-world CUDA code that uses these as a kind of ad-hoc optimisation
barrier to prevent the compiler breaking programs that contain undefined
behaviour. This pragmatic choice should reduce how often such broken programs
fail to function, but such code is broken by definition.
Note that the volatile
on non-empty volatile asm
blocks has no effect on the
behaviour of the SCALE compiler. volatile
asm is a conservative feature that
allows the compiler to model "unknowable" implicit dependencies of the actions
takenby the inline asm. Since we're compiling the asm to IR, the actual
dependencies and properties of everything it does are known and modelled. This
can improve optimisation, but may break programs that have undefined behaviour
that was being hidden by the optimisation barrier effect of the volatile asm
block.
asm
input/output types#
nvcc
doesn't appear to consistently follow its own tying rules for PTX asm
inputs/outputs. It allows the following invalid things to occur in some cases
(and real programs depend on this):
- Writes to read-only asm bindings are permitted (such as writing to an "r") constraint. The result of the write is not visible to the caller: it's effectively a temporary inside the scope of the asm block.
=r
(write-only) constraints can be used in a read-write fashion (as if they were+r
).- Values passed to the asm block are sometimes, but not always, type checked, implicitly widened, or implicitly truncated.
To avoid having to characterise and define the perimeter of this buggy behaviour, SCALE's implementation defines the following consistent rules which are intended to maximise compatibility (and minimise "weirdness"):
- All read-only inputs may be written to. The results of these writes are visible only within the scope of the asm block (as if they were local variables being passed by value into a function).
- All write-only outputs are implicitly read-write. ie.: there is no
difference between
+r
and=r
. - The type of an input or output binding is governed by the type of the expression, not the constraint letter. Once "in PTX", the usual PTX rules about implicit truncation/widening/etc. apply. This nuance won't change the behaviour of programs unless they rely on using a "too short" PTX constraint type to truncate a value, and then implicitly widen it within PTX (hence zeroing out some of the bits). Since such truncations are inconsistently applied even with nvidia nvcc mode, they are probably best achieved with an explicit cast.
Performance considerations#
In most cases, there isn't a performance penalty from using PTX asm in CUDA code: it will usually convert to the same IR as the C++ you could have written instead, and may actually be faster due to not needing to be as conservative about optimisation compared to the usual rules of asm blocks.
Since the compiler effectively converts it to the CUDA code you could have
written to achieve the same effect without the use of the PTX asm, it
doesn't come with the optimisation-hindering downsides asm blocks
normally imply. The compiler will respect the ordering/synchronisation/etc.
requirements of each operation individually, rather than having to regard an
entire asm volatile
block as an opaque, immutable unit.
Programs that have already added support for HIP might have multiple codepaths: one for CUDA that uses inline PTX, and one for AMD which doesn't. In such cases, it is worth experimentally determining which is actually superior: the result can vary on a case-by-case basis.
Supported constraints#
The following PTX constraint letters are supported. See above commentary on nuances regarding how they are interpreted.
h
: u16
r
: u32
l
: u64
f
: f32
d
: f64
n
: constants
The "C"
constraint type is in development, and seems likely to prove
useful for authors wishing to generalise their PTX code for wave sizes other
than 32.
Supported instructions#
The following instructions are currently supported.
Caveat: since the bf16
, fp8
and tf32
floating point formats are not
currently supported in SCALE, they are also not supported here.
Instruction | Notes |
---|---|
abs | |
activemask | |
add | |
addc | |
and | |
atom | |
bfe | |
bfi | |
bfind | |
bfind.shiftamt | |
bmsk | |
bra | |
brev | |
brkpt | Currently a no-op |
clz | |
cnot | |
copysign | |
cvt | |
cvt.pack | |
discard | Currently a no-op |
div | |
dp2a | |
dp4a | |
elect | |
exit | Only from convergent code |
fma | |
fns | |
griddepcontrol.launch_dependents | Currently a no-op |
griddepcontrol.wait | Currently a no-op |
isspacep | |
ld | |
ld.nc | |
ldu | |
lop3 | |
mad | |
mad24 | |
madc | |
match.all | |
match.any | |
max | |
max.xorsign.abs | |
min | |
min.xorsign.abs | |
mov | |
mul | |
mul24 | |
nanosleep | |
neg | |
not | |
or | |
pmevent | Currently a no-op |
popc | |
prefetch | |
prefetchu | |
prmt | |
prmt.b4e | |
prmt.ecl | |
prmt.ecr | |
prmt.f4e | |
prmt.rc16 | |
prmt.rc8 | |
rcp | |
red | |
redux | |
rem | |
sad | |
selp | |
set | |
setp | |
shf.l | |
shfl.bfly | |
shfl.down | |
shfl.idx | |
shfl.up | |
shf.r | |
shl | |
shr | |
slct | |
st | |
sub | |
subc | |
szext | |
testp.finite | |
testp.infinite | |
testp.normal | |
testp.notanumber | |
testp.number | |
testp.subnormal | |
trap | |
vabsdiff | |
vadd | |
vmax | |
vmin | |
vote.all | |
vote.any | |
vote.ballot | |
vote.uni | |
vshl | |
vshr | |
vsub | |
xor |