Skip to content

Release 1.5.0 (2025-11-15)#

Platform#

  • Compiler is now based on llvm 20.1.8
  • Using rocm 7.1 versios of rocBLAS etc.

Supported architectures#

  • All architectures are now enabled in the free version of SCALE. See new EULA for details.
  • Newly-supported AMD GPU architectures:
    • gfx950 (MI350x/MI355x)
    • gfx1151 (Strix-halo etc.)
    • gfx1200 (RX 9060 XT etc.)
    • gfx1201 (RX 9070 XT etc.)
    • gfx908 (MI100)
    • gfx906 (MI50/MI60)

Compiler: Inline PTX support#

  • Improve PTX diagnostics for unused instruction flags
  • Support for q constraints (128-bit asm inputs)
  • Diagnostics for implicit truncation via asm constraints.
  • New PTX instructions:
    • movmatrix
    • bar/barrier: only cases that can be represented as __synthreads_*.

Compiler: NVCC emulation#

  • New NVCC flags:
    • -keep
    • -keep-dir
    • -link
    • -preprocess (alias of -E)
    • -libdevice-directory/-ldir: Meaningless
    • -target-directory/target-dir: Meaningless
    • -cudadevrt: no-op because our devrt implementation uses no smem, so no point.
    • -opt-info: alias of -Rpass, so it can do more than just -inline
  • Fix resolution of relative paths for nvcc's option-files flag.
  • Fix compiler crash when fence.cluster appears in inline PTX in dead code.
  • Accept even more cases of invalid identifiers in unused code in nvcc mode.
  • Improvements and fixes to the __shfl* optimiser. More DPP, less UB.
  • Add a compiler warning to complain about mixing __constant__ and constexpr.

Compiler (misc.)#

  • Introduce builtins for conversion between arbitrary types (fp8 here we come)
  • Further improvements to deferred diagnostics, especialy surrounding typo'd identifiers
  • Add [[clang::getter]]
  • Fix kernel argument buffer alignment sometimes being wrong
  • Microsoft extensions are no longer enabled in CUDA mode.
  • Arithmetic involving threadIdx and friends now compiles.
  • Various small optimiser enhancements.

Runtime#

  • Fix a rare race condition resolving cudaEvents
  • Slightly improve performance of every API by optimising error handling routines.
  • Introduce nvtx3 support as a header-only library, like it should be.
  • New API: cudaEventRecordWithFlags
  • Respect CUDA_GRAPHS_USE_NODE_PRIORITY environment variable.
  • Implement UUID-query APIs, and make them consistent between the driver API and nvml
  • Support the new __nv_* atomic functions
  • Support (and ignore) the ancient CU_CTX_MAP_HOST/cudaDeviceMapHost flags. This feature is always enabled.
  • Startup-time improvements.
  • Fix crash when empty CUDA graphs are executed.
  • Fix occasionally picking the wrong cache coherence domain for SDMA engines and breaking everything.
  • fp16x2/bf16x2 are now trivially-copyable in C++11+: a very tiny extension.
  • Fix intermittent crash in cuMemUnmap
  • Add the new CUDA13 aligned vector types
  • Workaround SDMA bug that gave incorrect results for cuMemsetD16 on some devices.
  • Many random header compatibility/typedef fixes.
  • Accuracy improvements to tanh() and sin().
  • Fix crash when millions of cudaStreams are destroyed all at once.
  • Crash correctly when the GPU executes a trap instruction.
  • Fix the GPU trap handler on GFX94x devices.
  • Fix a few C89-compatibility issues in less-commonly-used headers.
  • Make headers warning-free on GCC, since not everyone uses -isystem properly.
  • Slightly improve performance of nvRTC.
  • Raise an error if the user attempts to execute PTX as AMDGPU machine code, instead of actually trying it.
  • Fix a few cases where the runtime library and RTC compiler would disagree about what architecture to build for.