Release 1.3.0 (2025-04-23)#

Platform#

Upgraded from llvm17 to llvm19.1.7.
Support for rocm 6.3.1.
Support for gfx902 architecture.
Enterprise edition: Support for new architectures:
- gfx908
- gfx90a
- gfx942

Added scaleenv, a new and much easier way to use SCALE.
Support for simulating a warp size of 32 even on wave64 platforms, fixing many projects on such platforms.
Support for bfloat16.
Compatibility improvements with non-cmake buildsystems.
Added SCALE_CUDA_VERSION environment variable to tell SCALE to impersonate a specific version of CUDA.
SCALE_EXCEPTIONS now supports a non-fatal mode.

Lazy-initialisation of primary contexts now works properly, fixing some subtle lifecycle issues.
Added some missing undocumented headers like texture_types.h.
Added the IPC memory/event APIs
Added many multi-GPU APIs
Added cuMemcpyPeer/cuMemcpyPeerAsync.
Rewritten device allocator to work around HSA bugs and performance issues.
Fix a crash when initialising SCALE with many GPUs with huge amounts of memory.
Added CUDA IPC APIs. Among other things, this enables CUDA-MPI applications to work, including AMGX's distributed mode.
Fixed lots of multi-GPU brokenness.
Implemented the CU_CTX_SYNC_MEMOPS context flag.
Fixed accuracy issues in some of the CUDA Math APIs.
fp16 headers no longer produce warnings for projects that include them without -isystem.
Improved performance and correctness of cudaMemcpy/memset.
Fix subtle issues with pointer attribute APIs.
Improvements to C89 compatibility of headers.
Added more Cooperative Groups APIs.
Support for grid_sync().
Fix some wave64 issues with cooperative_groups.h.

__launch_bounds__ now works correctly, significantly improving performance.
Device atomics are now much more efficient.
Denorm-flushing optimisations are no longer skipped when they aren't supposed to be.
Ability to use DPP to optimise warp shuffles in some cases. Currently, this only works if the individual shfl is provably equivalent to a DPP op, not when loop analysis would be required. __shfl_xor is your friend.

Corrected the behaviour of the nvcc -odir flag in during dependency generation.
Added the nvcc -prec-sqrt and -prec-div flags.
use_fast_math now matches nivida's behaviour, instead of mapping to clang's -ffast-math, which does too much.
--device-c no longer inappropriately triggers the linker.
Newly-supported nvcc flags:
- -arch=native
- -jump-table-density (ignored)
- -compress-mode (ignored)
- -split-compile-extended (ignored)

Support broken template code in more situations in nvcc mode.
Allow invalid const-correctness in unexpanded template code in nvcc mode.
Allow trailing commas in template argument lists in nvcc mode.
Fix a parser crash when explicitly calling operator<<<int>() in CUDA mode.
Fix a crash when using --compiler-options to pass huge numbers of options through to -Wl.

Warning for unused PTX variables
Error for attempts to return the carry bit (undefined behaviour on NVIDIA).
Compiler diagnostic to catch some undefined behaviour patterns with CUDA atomics.

New instructions supported
- sm_100 variants of redux.
- Mixed-precision add/sub/fma FP instructions.
- membar
- bar.warp.sync
- fence (partial)
- mma (software emulated)
- wmma (software emulated)
Fixed parsing of hex-float constants.
Support for PTX C constraints (dynamic asm strings).
f16/bf16 PTX instructions no longer depend on the corresponding C++ header.
asm blocks can now refer to variables declared in other asm blocks, including absolutely cursed patterns.
Fixed an issue where template-dependent asm strings were mishandled.
Fixed various parsing issues (undocumented syntax quirks etc.).
Fixed a crash when trying to XOR floating point numbers together.

Things that now appear to work include: