Skip to content

What's new?#

Release 1.5.0 (2025-11-15)#

Platform#

  • Compiler is now based on llvm 20.1.8
  • Using rocm 7.1 versios of rocBLAS etc.

Supported architectures#

  • All architectures are now enabled in the free version of SCALE. See new EULA for details.
  • Newly-supported AMD GPU architectures:
    • gfx950 (MI350x/MI355x)
    • gfx1151 (Strix-halo etc.)
    • gfx1200 (RX 9060 XT etc.)
    • gfx1201 (RX 9070 XT etc.)
    • gfx908 (MI100)
    • gfx906 (MI50/MI60)

Compiler: Inline PTX support#

  • Improve PTX diagnostics for unused instruction flags
  • Support for q constraints (128-bit asm inputs)
  • Diagnostics for implicit truncation via asm constraints.
  • New PTX instructions:
    • movmatrix
    • bar/barrier: only cases that can be represented as __synthreads_*.

Compiler: NVCC emulation#

  • New NVCC flags:
    • -keep
    • -keep-dir
    • -link
    • -preprocess (alias of -E)
    • -libdevice-directory/-ldir: Meaningless
    • -target-directory/target-dir: Meaningless
    • -cudadevrt: no-op because our devrt implementation uses no smem, so no point.
    • -opt-info: alias of -Rpass, so it can do more than just -inline
  • Fix resolution of relative paths for nvcc's option-files flag.
  • Fix compiler crash when fence.cluster appears in inline PTX in dead code.
  • Accept even more cases of invalid identifiers in unused code in nvcc mode.
  • Improvements and fixes to the __shfl* optimiser. More DPP, less UB.
  • Add a compiler warning to complain about mixing __constant__ and constexpr.

Compiler (misc.)#

  • Introduce builtins for conversion between arbitrary types (fp8 here we come)
  • Further improvements to deferred diagnostics, especialy surrounding typo'd identifiers
  • Add [[clang::getter]]
  • Fix kernel argument buffer alignment sometimes being wrong
  • Microsoft extensions are no longer enabled in CUDA mode.
  • Arithmetic involving threadIdx and friends now compiles.
  • Various small optimiser enhancements.

Runtime#

  • Fix a rare race condition resolving cudaEvents
  • Slightly improve performance of every API by optimising error handling routines.
  • Introduce nvtx3 support as a header-only library, like it should be.
  • New API: cudaEventRecordWithFlags
  • Respect CUDA_GRAPHS_USE_NODE_PRIORITY environment variable.
  • Implement UUID-query APIs, and make them consistent between the driver API and nvml
  • Support the new __nv_* atomic functions
  • Support (and ignore) the ancient CU_CTX_MAP_HOST/cudaDeviceMapHost flags. This feature is always enabled.
  • Startup-time improvements.
  • Fix crash when empty CUDA graphs are executed.
  • Fix occasionally picking the wrong cache coherence domain for SDMA engines and breaking everything.
  • fp16x2/bf16x2 are now trivially-copyable in C++11+: a very tiny extension.
  • Fix intermittent crash in cuMemUnmap
  • Add the new CUDA13 aligned vector types
  • Workaround SDMA bug that gave incorrect results for cuMemsetD16 on some devices.
  • Many random header compatibility/typedef fixes.
  • Accuracy improvements to tanh() and sin().
  • Fix crash when millions of cudaStreams are destroyed all at once.
  • Crash correctly when the GPU executes a trap instruction.
  • Fix the GPU trap handler on GFX94x devices.
  • Fix a few C89-compatibility issues in less-commonly-used headers.
  • Make headers warning-free on GCC, since not everyone uses -isystem properly.
  • Slightly improve performance of nvRTC.
  • Raise an error if the user attempts to execute PTX as AMDGPU machine code, instead of actually trying it.
  • Fix a few cases where the runtime library and RTC compiler would disagree about what architecture to build for.

Release 1.4.2 (2025-09-19)#

Platform/Packaging#

  • Simplified packaging/installation: removed dependency on rocm package repos. Removal of more rocm components from the package is in development.
  • Using rocm 6.4.1 versions of rocBLAS etc.

Compiler Diagnostics#

  • Warn about PTX instruction flags that don't actually do anything.
  • Warn about PTX variable declarations leaking into the enclosing function, since this may cause ptxas failures when building for NVIDIA.
  • Warn about passing a generic address space C++ pointer into an asm operand for a non-generic PTX memory instruction if the corresponding addrspacecast is not a no-op.
  • Detect when the user seems to have gotten cvta.to and cvta mixed up.
  • Cleanly reject .param address space in PTX: using that in inline asm is undefined behaviour.
  • Diagnose accidental address operands (eg. use of [] for cvta is a common screwup).
  • Proper errors for integer width mismatch in PTX address space conversions. Implicitly truncating pointers is bad.
  • Disallow overloads that differ only in return type and sideness in clang-dialect mode.
  • Implement a fancier-than-nvcc edition of -Wdefault-stream-launch, warning about implicit use of the legacy default stream via any launch mechanism, with opt-in support for warning about any use of the default stream (ptds or not).
  • PTX diagnositcs now correctly point at the offending operand, where applicable.
  • Correctly report source locations for diagnostics in PTX blocks that use C constraints.

Compiler Optimisation#

  • Use DPP to optimise applicable __shfl*/__reduce* calls, and loops that perform common reduction/cumsum idioms.
  • Improved instruction selection for shared-memory reads of sub-word types.
  • Vectorisation support for __shfl* (eg. turns multiple i8 shuffles into an i32 or i64 shuffle). Particularly useful for exploiting architectures that have support for i64 shuffles.
  • Ability to move extending conversions across shuffles, to avoid shuffling useless bits.
  • Don't generate redundant warp reductions prior to uniform atomic ops.
  • Constant propagation for __shfl*.
  • Improved program startup time in the presence of CUDA translation units with no device code.
  • Improvements to i1 vectorisation.
  • Don't discard __builtin_provable too early when doing LTO or building bitcode libraries.

NVCC Compatibility#

  • Support for __shared__ and __constant__ keywords an anonymous unions/structs.
  • Allow __device__ keyword to redundantly accompany __shared__ or __constant__.
  • Remove spurious warning about passing C++ pointers to memory ops.
  • Commas in asm statement argument lists are, it turns out, optional.
  • Support forward-unknown-opts.
  • Correctly handle missing template keywords in more dependent name lookup scenarios.
  • Tolerate wrong-sided access during constant evaluation.
  • Cleaner diagnostics when handling redundant commas in the presence of multiple templates closing at once.
  • Allow out-of-line redeclaration of namespaced template functions with conflicting signatures (this becomes an overload, when it should be a compile error).

Inline PTX Support#

  • Support for undocumented syntax quirks, like spaces in the middle of directives.
  • Support for a wider variety of constant expressions.
  • Fix miscompilation of min.xorsign.abs.
  • Fix miscompilation of testp.finite in -ffast-math mode.
  • Correct behaviour of dp4a/dp2a in the presence of overflow.
  • Correctly parse identifiers including % on asm blocks with no inputs/outputs.
  • Fix miscompile of shifts/bmsk with extremely large shift amounts.
  • Implement insane implicit asm-input conversion rules.
  • Fix miscompiles in video instructions using min/max as the secondary op, since real behaviour turned out to differ from the manual.
  • Avoid compiler crash when trying to constant-evaluate a PTX n input that has a type error.
  • tf32 support, via upcasting to fp32.
  • Fix some corner cases of mixing vector splats with _ operands.
  • Fix crash parsing video byte selectors.
  • Correctly handle implicit vector-of-vector types, because it turns out those are at least semantically a thing.
  • Newly-supported instructions:
    • cp.async.*
    • min/max, 3-input edition.
    • add/sub/mul/fma/cvt with non-default rounding modes.
    • mma/wmma for sub-byte integer types.
    • mma/wmma for all remaining shapes, in all datatypes except fp8/6/4.
    • red/atom for vector types.
    • cvt.pack, 4-input edition.
    • nanosleep
  • Newly-supported special registers:
    • %globaltimer
    • %globaltimer_hi
    • %globaltimer_lo
    • %warpid
    • %nwarpid
    • %smid
    • %nsmid

Compiler Misc.#

  • Fixed compiler crash when doing decltype(lambda) in certain conditions.
  • __syncthreads_and() and friends no longer use any shared memory.
  • Improvements to compile speed, espeically in the presence of inline PTX. Extremely so in the presence of mma/wmma.

Library: New APIs#

See the API diffs for precise information.

  • Added most commonly-used NVML APIs.
  • More cuSOLVER/cuSPARSE.
  • Non-default-rounding-mode APIs (conversions and arithmetic).
  • Added 1D CUarray copy APIs.
  • Added device-side versions of device/context property query APIs (cudaDeviceGetAttribute etc).
  • bmma_sync.
  • atomicAdd for float2 and float4.
  • Added API for programmatically controlling SCALE's exception behaviour.

Library: Fixes#

  • Allocator improvements prevent premature OOM due to address space fragmentation in long-running applications with a high level of memory churn.
  • Fix const-correctness in coop-groups.
  • Removed some macro leaks.
  • __sad() now behaves correctly in the presence of integer overload.
  • Fix bugs in FFT plan creation.
  • Fix some edgecases relating to delting the active CUcontext before popping it.
  • Fixed use of cuCtxSetCurrent when stack is empty
  • Don't crash when unloading a module at the same time as launching one of its kernels.
  • scaleenv now works with zsh, and does not pollute the shell environment.
  • Endless tiny fixes, random macros, header compatibility tweaks, etc.
  • Device PCI IDs now match the format of NVIDIA CUDA exactly.
  • Fix some edgecases where denorms weren't being flushed, but should be.
  • Stream creation is faster.
  • Don't crash when the printf buffer is larger than 4GB.
  • Fix rare hang when using the IPC APIs.

Release 1.3.1 (2025-05-12)#

Compiler#

  • Fixed a bug in the handling of weak device-side symbols which broke the device binaries for certain projects.
  • Fixed various PTX miscompilations.
  • Added support for approximate-math PTX instructions (lg2.approx and friends).

Library#

  • Fixed many small bugs in the device-side APIs.
  • Per-thread-default-stream actually works now, rather than silently using the legacy stream.
  • Fixed a race condition in the fft library.

Thirdparty Project demos#

  • GROMACS now works. SCALE appears to support a wider selection of AMD architectures than the HIP port, and seems to perform somewhat better (on MI210, at least!).

Release 1.3.0 (2025-04-23)#

Platform#

  • Upgraded from llvm17 to llvm19.1.7.
  • Support for rocm 6.3.1.
  • Support for gfx902 architecture.
  • Enterprise edition: Support for new architectures:
    • gfx908
    • gfx90a
    • gfx942

Packaging#

  • Packages for Rocky9 are now available.
  • Package repos for Ubuntu and Rocky9 to simplify installation/upgrades.

New Features#

  • Added scaleenv, a new and much easier way to use SCALE.
  • Support for simulating a warp size of 32 even on wave64 platforms, fixing many projects on such platforms.
  • Support for bfloat16.
  • Compatibility improvements with non-cmake buildsystems.
  • Added SCALE_CUDA_VERSION environment variable to tell SCALE to impersonate a specific version of CUDA.
  • SCALE_EXCEPTIONS now supports a non-fatal mode.

Library wrappers#

  • Added most of cuFFT.
  • Added lots more cuSolver and cuSPARSE.
  • Filled in some missing NVTX APIs.
  • Added homeopathic quantities of nvtx3.

Library#

  • Lazy-initialisation of primary contexts now works properly, fixing some subtle lifecycle issues.
  • Added some missing undocumented headers like texture_types.h.
  • Added the IPC memory/event APIs
  • Added many multi-GPU APIs
  • Added cuMemcpyPeer/cuMemcpyPeerAsync.
  • Rewritten device allocator to work around HSA bugs and performance issues.
  • Fix a crash when initialising SCALE with many GPUs with huge amounts of memory.
  • Added CUDA IPC APIs. Among other things, this enables CUDA-MPI applications to work, including AMGX's distributed mode.
  • Fixed lots of multi-GPU brokenness.
  • Implemented the CU_CTX_SYNC_MEMOPS context flag.
  • Fixed accuracy issues in some of the CUDA Math APIs.
  • fp16 headers no longer produce warnings for projects that include them without -isystem.
  • Improved performance and correctness of cudaMemcpy/memset.
  • Fix subtle issues with pointer attribute APIs.
  • Improvements to C89 compatibility of headers.
  • Added more Cooperative Groups APIs.
  • Support for grid_sync().
  • Fix some wave64 issues with cooperative_groups.h.

Compiler#

  • __launch_bounds__ now works correctly, significantly improving performance.
  • Device atomics are now much more efficient.
  • Denorm-flushing optimisations are no longer skipped when they aren't supposed to be.
  • Ability to use DPP to optimise warp shuffles in some cases. Currently, this only works if the individual shfl is provably equivalent to a DPP op, not when loop analysis would be required. __shfl_xor is your friend.

NVCC Interface#

  • Corrected the behaviour of the nvcc -odir flag in during dependency generation.
  • Added the nvcc -prec-sqrt and -prec-div flags.
  • use_fast_math now matches nivida's behaviour, instead of mapping to clang's -ffast-math, which does too much.
  • --device-c no longer inappropriately triggers the linker.
  • Newly-supported nvcc flags:
    • -arch=native
    • -jump-table-density (ignored)
    • -compress-mode (ignored)
    • -split-compile-extended (ignored)

NVCC Semantics#

  • Support broken template code in more situations in nvcc mode.
  • Allow invalid const-correctness in unexpanded template code in nvcc mode.
  • Allow trailing commas in template argument lists in nvcc mode.
  • Fix a parser crash when explicitly calling operator<<<int>() in CUDA mode.
  • Fix a crash when using --compiler-options to pass huge numbers of options through to -Wl.

Diagnostics#

  • Warning for unused PTX variables
  • Error for attempts to return the carry bit (undefined behaviour on NVIDIA).
  • Compiler diagnostic to catch some undefined behaviour patterns with CUDA atomics.

PTX#

  • New instructions supported
    • sm_100 variants of redux.
    • Mixed-precision add/sub/fma FP instructions.
    • membar
    • bar.warp.sync
    • fence (partial)
    • mma (software emulated)
    • wmma (software emulated)
  • Fixed parsing of hex-float constants.
  • Support for PTX C constraints (dynamic asm strings).
  • f16/bf16 PTX instructions no longer depend on the corresponding C++ header.
  • asm blocks can now refer to variables declared in other asm blocks, including absolutely cursed patterns.
  • Fixed an issue where template-dependent asm strings were mishandled.
  • Fixed various parsing issues (undocumented syntax quirks etc.).
  • Fixed a crash when trying to XOR floating point numbers together.

Thirdparty Project demos#

Things that now appear to work include:

  • CUDA-aware MPI
  • MAGMA
  • whisper.cpp
  • TCLB

Release 1.2.0 (2024-11-27)#

Library Enhancements#

  • Support for gfx900 architecture.
  • Support for gfx1102 architecture.

PTX#

Other#

  • Support for Ubuntu 24.04.
  • Upgraded to ROCm 6.2.2.

Release 1.1.0 (2024-10-31)#

Library Enhancements#

  • Added much of the CUDA graph API.
  • Improvements to multi-GPU handling.
  • Fixed rare shutdown-time segfaults.
  • Added many random API functions. As usual, see The diff.

PTX#

  • f16x2, u16x2 and s16x2 types.
  • fns instruction
  • Fixed miscompile of sad instruction.

Thirdparty Project demos#

The scale-validation repo now has working demos for the following additional projects:

  • FLAMEGPU2
  • GPUJPEG
  • gpu_jpeg2k

Release 1.0.2.0 (2024-09-05)#

Documented a record of the CUDA APIs already available in SCALE, and those still to come: Implemented APIs.

Library Enhancements#

  • Kernel arguments larger than 4kb no longer crash the library.
  • Programs that ignore CUDA error codes can no longer get stuck in a state where the library always returns the error code you ignored.
  • Fixed synchronisation bugs when using synchronous cuMemset* APIs.
  • Fixed implicit synchronisation behaviour of cuMemcpy2D/cuMemcpy2DAsync().
  • Fixed precision issues in fp16 exp2(), rsqrt(), and h2log().
  • cudaEventRecord for an empty event no longer returns a time in the past.
  • Fixed occupancy API behaviour in edgecases that are not multiples of warp size.
  • Fixed rare crashes during static de-initialisation when library wrappers were in use.
  • All flags supported by SCALE's nvcc are now also accepted by our nvrtc implementation.
  • Various small header fixes.

Compiler Enhancements#

  • decltype() now works correctly for __host__ __device__ functions.
  • -Winvalid-constexpr no longer defaults to -Werror, for consistency with nvcc.
  • PTX variable names including % are no longer rejected.
  • Support for nvcc's nonstandard permissiveness surrounding missing typename keywords in dependent types.
  • Support for nvcc's wacky "split declaration" syntax for __host__ __device functions (with a warning):
    int foo();
    __device__ int foo();
    __host__ int foo() {
        return 5;
    }
    // foo() is a __host__ __device__ function. :D
    
  • Newly-supported compiler flags (all of which are aliases for standard flags, or combinations thereof):
    • -device-c
    • -device-w
    • -pre-include
    • -library
    • -output-file
    • -define-macro
    • -undefine-macro

New CUDA APIs#

Math APIs#

  • exp10(__half)
  • exp2(__half)
  • rcp(__half)
  • rint(__half)
  • h2exp10(__half2)
  • h2exp2(__half2)
  • h2rcp(__half2)
  • h2rint(__half2)

Release 1.0.1.0 (2024-07-24)#

This release primarily fixes issues that prevent people from successfully compiling their projects with SCALE. Many thanks to those users who submitted bug reports.

CUDA APIs#

  • The extra argument to cuLaunchKernel is now supported.
  • Added support for some more undocumented NVIDIA headers.
  • Fix various overload resolution issues with atomic APIs.
  • Fix overload resolution issues with min/max.
  • Added various undocumented macros to support projects that are explicitly checking cuda include guard macros.
  • lrint() and llrint() no longer crash the compiler. :D
  • Newly supported CUDA APIs:
    • nvrtcGetNumSupportedArchs
    • nvrtcGetSupportedArchs
    • cudaLaunchKernelEx, cuLaunchKernelEx, cudaLaunchKernelExC: some of the performance-hint launch options are no-ops.
    • __vavgs2, __vavgs4
    • All the atomic*_block() and atomic*_system() variants.

Compiler#

  • Improved parsing of nvcc arguments:
    • Allow undocumented option variants (-foo bar, --foo bar, --foo=bar, and -foo=bar are always allowed, it seems).
    • Implement "interesting" quoting/escaping rules in nvcc arguments, such as embedded quotes and \,. We now correctly handle cursed arguments like: '-Xcompiler=-Wl\,-O1' '-Xcompiler=-Wl\,-rpath\,/usr/lib,-Wl\,-rpath-link\,/usr/lib'
  • Support for more nvcc arguments:

    • NVCC-style diagnostic flags: -Werror, -disable-warnings, etc.
    • --run, --run-args
    • -Xlinker, -linker-options
    • -no-exceptions, -noeh
    • -minimal: no-op. Exact semantics are undocumented, and build times are reasonably fast anyway.
    • -gen-opt-lto, -dlink-time-opt, -dlto. No-ops: device LTO not yet supported.
    • -t, --threads, -split-compile: No-ops: they're flags for making compilation faster and are specific to how nvcc is implemented.
    • -device-int128: no-op: we always enable int128.
    • -extra-device-vectorization: no-op: vectorisation optimisations are controlled by the usual -O* flags.
    • -entries, -source-in-ptx, -src-in-ptx: no-ops: there is no PTX.
    • -use-local-env, -idp, -ddp, -dp, etc.: ignored since they are meaningless except on Windows.
  • Allow variadic device functions in non-evaluated functions.

  • Don't warn about implicit conversion from cudaLaneMask_t to bool.
  • __builtin_provable no longer causes compiler crashes in -O0/-O1 builds.
  • Fixed a bug causing PTX asm blocks inside non-template, non-dependent member functions of template classes to sometimes not be compiled, causing PTX to end up in the AMD binary unmodified.
  • CUDA launch tokens with spaces (ie.: myKernel<< <1, 1>> >()) are now supported.
  • Building non-cuda C translation units with SCALE-nvcc now works.

Other#

  • The meson build system no longer regards SCALE-nvcc as a "broken" compiler.
  • hsakmtsysinfo no longer explodes if it doesn't like your GPU.
  • New documentation pages.
  • Published more details about thirdparty testing, including the build scripts.

Release 1.0.0.0 (2024-07-15)#

Initial release