Inline PTX support#
SCALE accepts inline PTX asm
blocks in CUDA programs and will attempt to
compile it for AMD along with the rest of your program.
Dialect differences#
The SCALE compiler accepts a more permissive dialect of PTX than NVIDIA's implementation does.
Integer lengths#
Most PTX instructions are defined to work only for a specific, arbitrary set of integer types. We didn't bother to implement such restrictions except in cases where they are needed for correctness, so many PTX instructions accept a wider selection of types than nvcc accepts.
One amusing consequence of this is that most of the simple instructions work
for any bit-length: add.s17
is allowed (but will of course lead to
extra sext/trunc instructions, so isn't necessarily a good idea).
Divergent exit
#
AMD hardware does not seem to have a mechanism by which individual threads
can terminate early (only entire warps). As a result, the exit
instruction may be used only in converged contexts.
Empty asm volatile
blocks#
To cater to "interesting" user code, the SCALE compiler will not touch
asm volatile
blocks containing no instructions. We've seen
real-world CUDA code that uses these as a kind of ad-hoc optimisation
barrier to prevent the compiler breaking programs that contain undefined
behaviour. This pragmatic choice should reduce how often such broken programs
fail to function, but such code is broken by definition.
asm
input/output types#
nvcc
doesn't appear to consistently follow its own tying rules for PTX asm
inputs/outputs. It the following invalid things to occur in some cases (and real
programs depend on this):
- Writes to read-only asm bindings are permitted (such as writing to an "r") constraint. The result of the write is not visible to the caller: it's effectively a temporary inside the scope of the asm block.
=r
(write-only) constraints can be used in a read-write fashion (as if they were+r
).- Values passed to the asm block are sometimes, but not always, type checked, implicitly widened, or implicitly truncated.
To avoid having to characterise and define the perimeter of this buggy behaviour, SCALE's implementation defines the following consistent rules which are intended to maximise compatibility (and minimise "weirdness"):
- All read-only inputs may be written to. The results of these writes are visible only within the scope of the asm block (as if they were local variables being passed by value into a function).
- All write-only outputs are implicitly read-write. ie.: there is no
difference between
+r
and=r
. - The type of an input or output binding is governed by the type of the expression, not the constraint letter. Once "in PTX", the usual PTX rules about implicit truncation/widening/etc. apply. This nuance won't change the behaviour of programs unless they rely on using a "too short" PTX constraint type to truncate a value, and then implicitly widen it within PTX (hence zeroing out some of the bits). Since such truncations are inconsistently applied even in nvidia mode, they are probably best achieved with an explicit cast. In any case, we have thus far never seen such code in the wild!
Performance considerations#
In most cases, there isn't a performance penalty from using PTX asm in CUDA code: it will usually convert to the same IR as the C++ you could have written instead, and may actually be faster due to not needing to be as conservative about optimisation compared to the usual rules of asm blocks.
Since the compiler effectively converts it to the CUDA code you could have
written to achieve the same effect without the use of the PTX asm, it
doesn't come with the optimisation-hindering downsides asm blocks
normally imply. The compiler will respect the ordering/synchronisation/etc.
requirements of each operation individually, rather than having to regard an
entire asm volatile
block as an opaque, immutable unit.
Programs that have already added support for HIP might have multiple codepaths: one for CUDA that uses inline PTX, and one for AMD which doesn't. In such cases, it is worth experimentally determining which is actually superior: the result can vary on a case-by-case basis.
Supported constraints#
The following PTX constraint letters are supported. See above commentary on nuances regarding how they are interpreted.
h
: u16
r
: u32
l
: u64
f
: f32
d
: f64
n
: constants
The "C"
constraint type is in development, and seems likely to prove
useful for authors wishing to generalise their PTX code for wave sizes other
than 32.
Supported instructions#
The following instructions are currently supported.
Caveat: since the bf16
, fp8
and tf32
floating point formats are not
currently supported in SCALE, they are also not supported here.
Instruction | Notes |
---|---|
abs | |
activemask | |
add | |
addc | |
and | |
atom | |
bfe | |
bfi | |
bfind | |
bfind.shiftamt | |
bmsk | |
bra | |
brev | |
brkpt | Currently a no-op |
clz | |
cnot | |
copysign | |
cvt | |
cvt.pack | |
discard | Currently a no-op |
div | |
dp2a | |
dp4a | |
elect | |
exit | Only from convergent code |
fma | |
griddepcontrol.launch_dependents | Currently a no-op |
griddepcontrol.wait | Currently a no-op |
isspacep | |
ld | |
ld.nc | |
ldu | |
lop3 | |
mad | |
mad24 | |
madc | |
match.all | |
match.any | |
max | |
max.xorsign.abs | |
min | |
min.xorsign.abs | |
mov | |
mul | |
mul24 | |
nanosleep | |
neg | |
not | |
or | |
pmevent | Currently a no-op |
popc | |
prefetch | |
prefetchu | |
prmt | |
prmt.b4e | |
prmt.ecl | |
prmt.ecr | |
prmt.f4e | |
prmt.rc16 | |
prmt.rc8 | |
rcp | |
red | |
redux | |
rem | |
sad | |
selp | |
set | |
setp | |
shf.l | |
shfl.bfly | |
shfl.down | |
shfl.idx | |
shfl.up | |
shf.r | |
shl | |
shr | |
slct | |
st | |
sub | |
subc | |
szext | |
testp.finite | |
testp.infinite | |
testp.normal | |
testp.notanumber | |
testp.number | |
testp.subnormal | |
trap | |
vabsdiff | |
vadd | |
vmax | |
vmin | |
vote.all | |
vote.any | |
vote.ballot | |
vote.uni | |
vshl | |
vshr | |
vsub | |
xor |