Release 1.3.0 (2025-04-23)#
Platform#
- Upgraded from llvm17 to llvm19.1.7.
- Support for rocm 6.3.1.
- Support for
gfx902architecture. - Enterprise edition: Support for new architectures:
gfx908gfx90agfx942
Packaging#
- Packages for Rocky9 are now available.
- Package repos for Ubuntu and Rocky9 to simplify installation/upgrades.
New Features#
- Added
scaleenv, a new and much easier way to use SCALE. - Support for simulating a warp size of 32 even on wave64 platforms, fixing many projects on such platforms.
- Support for
bfloat16. - Compatibility improvements with non-cmake buildsystems.
- Added
SCALE_CUDA_VERSIONenvironment variable to tell SCALE to impersonate a specific version of CUDA. SCALE_EXCEPTIONSnow supports a non-fatal mode.
Library wrappers#
- Added most of cuFFT.
- Added lots more cuSolver and cuSPARSE.
- Filled in some missing NVTX APIs.
- Added homeopathic quantities of nvtx3.
Library#
- Lazy-initialisation of primary contexts now works properly, fixing some subtle lifecycle issues.
- Added some missing undocumented headers like
texture_types.h. - Added the IPC memory/event APIs
- Added many multi-GPU APIs
- Added
cuMemcpyPeer/cuMemcpyPeerAsync. - Rewritten device allocator to work around HSA bugs and performance issues.
- Fix a crash when initialising SCALE with many GPUs with huge amounts of memory.
- Added CUDA IPC APIs. Among other things, this enables CUDA-MPI applications to work, including AMGX's distributed mode.
- Fixed lots of multi-GPU brokenness.
- Implemented the
CU_CTX_SYNC_MEMOPScontext flag. - Fixed accuracy issues in some of the CUDA Math APIs.
- fp16 headers no longer produce warnings for projects that include them without
-isystem. - Improved performance and correctness of cudaMemcpy/memset.
- Fix subtle issues with pointer attribute APIs.
- Improvements to C89 compatibility of headers.
- Added more Cooperative Groups APIs.
- Support for
grid_sync(). - Fix some wave64 issues with
cooperative_groups.h.
Compiler#
__launch_bounds__now works correctly, significantly improving performance.- Device atomics are now much more efficient.
- Denorm-flushing optimisations are no longer skipped when they aren't supposed to be.
- Ability to use DPP to optimise warp shuffles in some cases. Currently,
this only works if the individual shfl is provably equivalent to a DPP op,
not when loop analysis would be required.
__shfl_xoris your friend.
NVCC Interface#
- Corrected the behaviour of the nvcc
-odirflag in during dependency generation. - Added the nvcc
-prec-sqrtand-prec-divflags. use_fast_mathnow matches nivida's behaviour, instead of mapping to clang's-ffast-math, which does too much.--device-cno longer inappropriately triggers the linker.- Newly-supported
nvccflags:-arch=native-jump-table-density(ignored)-compress-mode(ignored)-split-compile-extended(ignored)
NVCC Semantics#
- Support broken template code in more situations in nvcc mode.
- Allow invalid const-correctness in unexpanded template code in nvcc mode.
- Allow trailing commas in template argument lists in nvcc mode.
- Fix a parser crash when explicitly calling
operator<<<int>()in CUDA mode. - Fix a crash when using
--compiler-optionsto pass huge numbers of options through to-Wl.
Diagnostics#
- Warning for unused PTX variables
- Error for attempts to return the carry bit (undefined behaviour on NVIDIA).
- Compiler diagnostic to catch some undefined behaviour patterns with CUDA atomics.
PTX#
- New instructions supported
sm_100variants ofredux.- Mixed-precision
add/sub/fmaFP instructions. membarbar.warp.syncfence(partial)mma(software emulated)wmma(software emulated)
- Fixed parsing of hex-float constants.
- Support for PTX
Cconstraints (dynamic asm strings). - f16/bf16 PTX instructions no longer depend on the corresponding C++ header.
- asm blocks can now refer to variables declared in other asm blocks, including absolutely cursed patterns.
- Fixed an issue where template-dependent asm strings were mishandled.
- Fixed various parsing issues (undocumented syntax quirks etc.).
- Fixed a crash when trying to XOR floating point numbers together.
Thirdparty Project demos#
Things that now appear to work include:
- CUDA-aware MPI
- MAGMA
- whisper.cpp
- TCLB