Release 1.4.2 (2025-09-19)#
Platform/Packaging#
- Simplified packaging/installation: removed dependency on rocm package repos. Removal of more rocm components from the package is in development.
- Using rocm 6.4.1 versions of rocBLAS etc.
Compiler Diagnostics#
- Warn about PTX instruction flags that don't actually do anything.
- Warn about PTX variable declarations leaking into the enclosing function, since this may cause ptxas failures when building for NVIDIA.
- Warn about passing a generic address space C++ pointer into an asm operand for a non-generic PTX memory instruction if the corresponding addrspacecast is not a no-op.
- Detect when the user seems to have gotten
cvta.toandcvtamixed up. - Cleanly reject
.paramaddress space in PTX: using that in inline asm is undefined behaviour. - Diagnose accidental address operands (eg. use of
[]forcvtais a common screwup). - Proper errors for integer width mismatch in PTX address space conversions. Implicitly truncating pointers is bad.
- Disallow overloads that differ only in return type and sideness in clang-dialect mode.
- Implement a fancier-than-nvcc edition of
-Wdefault-stream-launch, warning about implicit use of the legacy default stream via any launch mechanism, with opt-in support for warning about any use of the default stream (ptds or not). - PTX diagnositcs now correctly point at the offending operand, where applicable.
- Correctly report source locations for diagnostics in PTX blocks that use
Cconstraints.
Compiler Optimisation#
- Use DPP to optimise applicable
__shfl*/__reduce*calls, and loops that perform common reduction/cumsum idioms. - Improved instruction selection for shared-memory reads of sub-word types.
- Vectorisation support for
__shfl*(eg. turns multiple i8 shuffles into an i32 or i64 shuffle). Particularly useful for exploiting architectures that have support for i64 shuffles. - Ability to move extending conversions across shuffles, to avoid shuffling useless bits.
- Don't generate redundant warp reductions prior to uniform atomic ops.
- Constant propagation for
__shfl*. - Improved program startup time in the presence of CUDA translation units with no device code.
- Improvements to
i1vectorisation. - Don't discard
__builtin_provabletoo early when doing LTO or building bitcode libraries.
NVCC Compatibility#
- Support for
__shared__and__constant__keywords an anonymous unions/structs. - Allow
__device__keyword to redundantly accompany__shared__or__constant__. - Remove spurious warning about passing C++ pointers to memory ops.
- Commas in asm statement argument lists are, it turns out, optional.
- Support
forward-unknown-opts. - Correctly handle missing
templatekeywords in more dependent name lookup scenarios. - Tolerate wrong-sided access during constant evaluation.
- Cleaner diagnostics when handling redundant commas in the presence of multiple templates closing at once.
- Allow out-of-line redeclaration of namespaced template functions with conflicting signatures (this becomes an overload, when it should be a compile error).
Inline PTX Support#
- Support for undocumented syntax quirks, like spaces in the middle of directives.
- Support for a wider variety of constant expressions.
- Fix miscompilation of
min.xorsign.abs. - Fix miscompilation of
testp.finitein-ffast-mathmode. - Correct behaviour of
dp4a/dp2ain the presence of overflow. - Correctly parse identifiers including
%on asm blocks with no inputs/outputs. - Fix miscompile of shifts/bmsk with extremely large shift amounts.
- Implement insane implicit asm-input conversion rules.
- Fix miscompiles in video instructions using min/max as the secondary op, since real behaviour turned out to differ from the manual.
- Avoid compiler crash when trying to constant-evaluate a PTX
ninput that has a type error. tf32support, via upcasting tofp32.- Fix some corner cases of mixing vector splats with
_operands. - Fix crash parsing video byte selectors.
- Correctly handle implicit vector-of-vector types, because it turns out those are at least semantically a thing.
- Newly-supported instructions:
cp.async.*min/max, 3-input edition.add/sub/mul/fma/cvtwith non-default rounding modes.mma/wmmafor sub-byte integer types.mma/wmmafor all remaining shapes, in all datatypes except fp8/6/4.red/atomfor vector types.cvt.pack, 4-input edition.nanosleep
- Newly-supported special registers:
%globaltimer%globaltimer_hi%globaltimer_lo%warpid%nwarpid%smid%nsmid
Compiler Misc.#
- Fixed compiler crash when doing
decltype(lambda)in certain conditions. __syncthreads_and()and friends no longer use any shared memory.- Improvements to compile speed, espeically in the presence of inline PTX. Extremely so in the presence of
mma/wmma.
Library: New APIs#
See the API diffs for precise information.
- Added most commonly-used NVML APIs.
- More
cuSOLVER/cuSPARSE. - Non-default-rounding-mode APIs (conversions and arithmetic).
- Added 1D CUarray copy APIs.
- Added device-side versions of device/context property query APIs (
cudaDeviceGetAttributeetc). bmma_sync.atomicAddforfloat2andfloat4.- Added API for programmatically controlling SCALE's exception behaviour.
Library: Fixes#
- Allocator improvements prevent premature OOM due to address space fragmentation in long-running applications with a high level of memory churn.
- Fix const-correctness in coop-groups.
- Removed some macro leaks.
__sad()now behaves correctly in the presence of integer overload.- Fix bugs in FFT plan creation.
- Fix some edgecases relating to delting the active CUcontext before popping it.
- Fixed use of
cuCtxSetCurrentwhen stack is empty - Don't crash when unloading a module at the same time as launching one of its kernels.
scaleenvnow works with zsh, and does not pollute the shell environment.- Endless tiny fixes, random macros, header compatibility tweaks, etc.
- Device PCI IDs now match the format of NVIDIA CUDA exactly.
- Fix some edgecases where denorms weren't being flushed, but should be.
- Stream creation is faster.
- Don't crash when the printf buffer is larger than 4GB.
- Fix rare hang when using the IPC APIs.