Skip to content

Release 1.3.0 (2025-04-23)#

Platform#

  • Upgraded from llvm17 to llvm19.1.7.
  • Support for rocm 6.3.1.
  • Support for gfx902 architecture.
  • Enterprise edition: Support for new architectures:
    • gfx908
    • gfx90a
    • gfx942

Packaging#

  • Packages for Rocky9 are now available.
  • Package repos for Ubuntu and Rocky9 to simplify installation/upgrades.

New Features#

  • Added scaleenv, a new and much easier way to use SCALE.
  • Support for simulating a warp size of 32 even on wave64 platforms, fixing many projects on such platforms.
  • Support for bfloat16.
  • Compatibility improvements with non-cmake buildsystems.
  • Added SCALE_CUDA_VERSION environment variable to tell SCALE to impersonate a specific version of CUDA.
  • SCALE_EXCEPTIONS now supports a non-fatal mode.

Library wrappers#

  • Added most of cuFFT.
  • Added lots more cuSolver and cuSPARSE.
  • Filled in some missing NVTX APIs.
  • Added homeopathic quantities of nvtx3.

Library#

  • Lazy-initialisation of primary contexts now works properly, fixing some subtle lifecycle issues.
  • Added some missing undocumented headers like texture_types.h.
  • Added the IPC memory/event APIs
  • Added many multi-GPU APIs
  • Added cuMemcpyPeer/cuMemcpyPeerAsync.
  • Rewritten device allocator to work around HSA bugs and performance issues.
  • Fix a crash when initialising SCALE with many GPUs with huge amounts of memory.
  • Added CUDA IPC APIs. Among other things, this enables CUDA-MPI applications to work, including AMGX's distributed mode.
  • Fixed lots of multi-GPU brokenness.
  • Implemented the CU_CTX_SYNC_MEMOPS context flag.
  • Fixed accuracy issues in some of the CUDA Math APIs.
  • fp16 headers no longer produce warnings for projects that include them without -isystem.
  • Improved performance and correctness of cudaMemcpy/memset.
  • Fix subtle issues with pointer attribute APIs.
  • Improvements to C89 compatibility of headers.
  • Added more Cooperative Groups APIs.
  • Support for grid_sync().
  • Fix some wave64 issues with cooperative_groups.h.

Compiler#

  • __launch_bounds__ now works correctly, significantly improving performance.
  • Device atomics are now much more efficient.
  • Denorm-flushing optimisations are no longer skipped when they aren't supposed to be.
  • Ability to use DPP to optimise warp shuffles in some cases. Currently, this only works if the individual shfl is provably equivalent to a DPP op, not when loop analysis would be required. __shfl_xor is your friend.

NVCC Interface#

  • Corrected the behaviour of the nvcc -odir flag in during dependency generation.
  • Added the nvcc -prec-sqrt and -prec-div flags.
  • use_fast_math now matches nivida's behaviour, instead of mapping to clang's -ffast-math, which does too much.
  • --device-c no longer inappropriately triggers the linker.
  • Newly-supported nvcc flags:
    • -arch=native
    • -jump-table-density (ignored)
    • -compress-mode (ignored)
    • -split-compile-extended (ignored)

NVCC Semantics#

  • Support broken template code in more situations in nvcc mode.
  • Allow invalid const-correctness in unexpanded template code in nvcc mode.
  • Allow trailing commas in template argument lists in nvcc mode.
  • Fix a parser crash when explicitly calling operator<<<int>() in CUDA mode.
  • Fix a crash when using --compiler-options to pass huge numbers of options through to -Wl.

Diagnostics#

  • Warning for unused PTX variables
  • Error for attempts to return the carry bit (undefined behaviour on NVIDIA).
  • Compiler diagnostic to catch some undefined behaviour patterns with CUDA atomics.

PTX#

  • New instructions supported
    • sm_100 variants of redux.
    • Mixed-precision add/sub/fma FP instructions.
    • membar
    • bar.warp.sync
    • fence (partial)
    • mma (software emulated)
    • wmma (software emulated)
  • Fixed parsing of hex-float constants.
  • Support for PTX C constraints (dynamic asm strings).
  • f16/bf16 PTX instructions no longer depend on the corresponding C++ header.
  • asm blocks can now refer to variables declared in other asm blocks, including absolutely cursed patterns.
  • Fixed an issue where template-dependent asm strings were mishandled.
  • Fixed various parsing issues (undocumented syntax quirks etc.).
  • Fixed a crash when trying to XOR floating point numbers together.

Thirdparty Project demos#

Things that now appear to work include:

  • CUDA-aware MPI
  • MAGMA
  • whisper.cpp
  • TCLB