Release 1.5.1 (2026-02-03)#
Compiler: NVCC emulation#
-Wregisterno longer defaults to an error in nvcc dialect mode.-Xfatbinand-Xptxasare now parsed to extract meaningful flags from them.nvcc -v's path outputs now reflect the actual CUDA install search process.- Fixed an edgecase of the
target_overloadattribute that led to incompatibility with some versions of the standard library. - Don't crash when a variable is illegally declared
__host__ __device__. nv_weakis now an alias ofweak.- Added support for
-ptxflag. - Fixed some edgecases of argument translation for list-form arguments.
- Dependency information is no longer duplicated for CUDA mode (for
-Mand friends). - Even more missing-typename relaxations to support the nvcc dialect.
- Unqualified lookups of dependent members in base classes is now alloewd in nvcc mode.
- Fix some builtin macro pedantry so programs that check with
#if <foo> == 1work, as well as simply#ifdef <foo>, work. - Implement some simple IR-level optimisations for
__byte_perm(). --extended-lambdaand--relaxed-constexprnow drive the same macro as they do in NVIDIA's implementation, even though they still do not change the behaviour of the semantic analyzer.
Compiler: Inline PTX support#
- Add minifloat types (fp8/6/4) to the PTX type system, pending implementation of the corresponding tensor core ops.
- Fix subtle miscompiles surrounding edgecases for
.reluPTX operations. Seems the semantics of the.reluoperator differ between different opcodes... - Fix miscompile when both
.satfand.reluare provided on the same instruction. - Implement
__nv_is_extended_device_lambda_with_preserved_return_type. Although it is largely meaningless given our proper lambda suppport, user code relies on it. - Treat
.mmioas an alias of.volatile, at least until we support some devices with actual IO devices stuck to the GPU. - Fix a compiler crash when passing a pointer to a struct directly into a PTX asm block.
- Fix a compiler crash when compiling certain SIMD PTX instructions, such as
fma.sat.f32x2. - Fix a type inference failure for instructions include vector expressions with constants in them, such as
asm("mov.b32 %0, {0, %1};" : "=f"(f) : "h"(h));. - Fix a compiler crash when a mandatory round-mode operand was omitted from a float-to-int
cvtinstruction. - Fix failure to compile certain cases of
cvt, such asf64<->f16. - Fix failure to compile vector-splat cases of
cvtfor small float types.
Compiler: Other#
- Fix a miscompilation in which rounding-mode state would sometimes be leaked, leaving the rounding mode register set to a non-default value after executing certain combinations special-rounding-mode operations.
clangdnow crashes less often.
Library#
- Newly-added CUDA APIs:
cudaMalloc3DcuCtxSynchronize_v2cuCtxGetDevice_v2cuCtxWaitEventcuFuncGetNamecuFuncGetParamInfocuStreamWriteValue32
- Add the NVML APIs for query process information.
- Small improvements to exception handling performance.
- Fix an implicit synchronisation bug for memsets. Memsets targeting pinned host memory are supposed to implicitly synchronise the affected stream.
- Fix a crash when using the
extraargument of various launch APIs to pass a user-provided kernel argument buffer containing objects with "diamond of doom" inheritance. half2charnow properly converts to signed char, not unsigned char.- Fixed handling of
-0andNaNin various Math API routines. - Fixed a few cases where denorms were flushed on the "wrong side" of the operation. NVIDIA don't document which operations flush denorms before doig the operation (eg. min()), and which ones flush afterwards.
- Fixed an endianness issue affecting bf16x2 assembly APIs such as
__low2bfloat16. - Entirely disabled SDMA for gfx9xx devices except gfx90a, due to apparent hardware bugs. This should resolve issues where cudaMemcpys would mysteriously not synchronise with the host even when asked to. Seems SDMA barriers sometimes just don't work...?
- Fixed various header defects (missing
typedef enum, implicit STL includes, etc.)
Other#
- scaleenv no longer leaks PS1 to the env.
- The scripts in scale-validation are now much simpler, hopefully making it more obvious that they are just copies of the upstream build scripts!