What's new?#
Release 1.5.0 (2025-11-15)#
Platform#
- Compiler is now based on llvm 20.1.8
- Using rocm 7.1 versios of rocBLAS etc.
Supported architectures#
- All architectures are now enabled in the free version of SCALE. See new EULA for details.
- Newly-supported AMD GPU architectures:
gfx950(MI350x/MI355x)gfx1151(Strix-halo etc.)gfx1200(RX 9060 XT etc.)gfx1201(RX 9070 XT etc.)gfx908(MI100)gfx906(MI50/MI60)
Compiler: Inline PTX support#
- Improve PTX diagnostics for unused instruction flags
- Support for
qconstraints (128-bit asm inputs) - Diagnostics for implicit truncation via asm constraints.
- New PTX instructions:
movmatrixbar/barrier: only cases that can be represented as__synthreads_*.
Compiler: NVCC emulation#
- New NVCC flags:
-keep-keep-dir-link-preprocess(alias of-E)-libdevice-directory/-ldir: Meaningless-target-directory/target-dir: Meaningless-cudadevrt: no-op because our devrt implementation uses no smem, so no point.-opt-info: alias of-Rpass, so it can do more than just-inline
- Fix resolution of relative paths for nvcc's option-files flag.
- Fix compiler crash when
fence.clusterappears in inline PTX in dead code. - Accept even more cases of invalid identifiers in unused code in nvcc mode.
- Improvements and fixes to the
__shfl*optimiser. More DPP, less UB. - Add a compiler warning to complain about mixing
__constant__andconstexpr.
Compiler (misc.)#
- Introduce builtins for conversion between arbitrary types (fp8 here we come)
- Further improvements to deferred diagnostics, especialy surrounding typo'd identifiers
- Add
[[clang::getter]] - Fix kernel argument buffer alignment sometimes being wrong
- Microsoft extensions are no longer enabled in CUDA mode.
- Arithmetic involving
threadIdxand friends now compiles. - Various small optimiser enhancements.
Runtime#
- Fix a rare race condition resolving cudaEvents
- Slightly improve performance of every API by optimising error handling routines.
- Introduce nvtx3 support as a header-only library, like it should be.
- New API:
cudaEventRecordWithFlags - Respect
CUDA_GRAPHS_USE_NODE_PRIORITYenvironment variable. - Implement UUID-query APIs, and make them consistent between the driver API and nvml
- Support the new
__nv_*atomic functions - Support (and ignore) the ancient
CU_CTX_MAP_HOST/cudaDeviceMapHostflags. This feature is always enabled. - Startup-time improvements.
- Fix crash when empty CUDA graphs are executed.
- Fix occasionally picking the wrong cache coherence domain for SDMA engines and breaking everything.
- fp16x2/bf16x2 are now trivially-copyable in C++11+: a very tiny extension.
- Fix intermittent crash in cuMemUnmap
- Add the new CUDA13 aligned vector types
- Workaround SDMA bug that gave incorrect results for cuMemsetD16 on some devices.
- Many random header compatibility/typedef fixes.
- Accuracy improvements to
tanh()andsin(). - Fix crash when millions of cudaStreams are destroyed all at once.
- Crash correctly when the GPU executes a trap instruction.
- Fix the GPU trap handler on GFX94x devices.
- Fix a few C89-compatibility issues in less-commonly-used headers.
- Make headers warning-free on GCC, since not everyone uses
-isystemproperly. - Slightly improve performance of nvRTC.
- Raise an error if the user attempts to execute PTX as AMDGPU machine code, instead of actually trying it.
- Fix a few cases where the runtime library and RTC compiler would disagree about what architecture to build for.
Release 1.4.2 (2025-09-19)#
Platform/Packaging#
- Simplified packaging/installation: removed dependency on rocm package repos. Removal of more rocm components from the package is in development.
- Using rocm 6.4.1 versions of rocBLAS etc.
Compiler Diagnostics#
- Warn about PTX instruction flags that don't actually do anything.
- Warn about PTX variable declarations leaking into the enclosing function, since this may cause ptxas failures when building for NVIDIA.
- Warn about passing a generic address space C++ pointer into an asm operand for a non-generic PTX memory instruction if the corresponding addrspacecast is not a no-op.
- Detect when the user seems to have gotten
cvta.toandcvtamixed up. - Cleanly reject
.paramaddress space in PTX: using that in inline asm is undefined behaviour. - Diagnose accidental address operands (eg. use of
[]forcvtais a common screwup). - Proper errors for integer width mismatch in PTX address space conversions. Implicitly truncating pointers is bad.
- Disallow overloads that differ only in return type and sideness in clang-dialect mode.
- Implement a fancier-than-nvcc edition of
-Wdefault-stream-launch, warning about implicit use of the legacy default stream via any launch mechanism, with opt-in support for warning about any use of the default stream (ptds or not). - PTX diagnositcs now correctly point at the offending operand, where applicable.
- Correctly report source locations for diagnostics in PTX blocks that use
Cconstraints.
Compiler Optimisation#
- Use DPP to optimise applicable
__shfl*/__reduce*calls, and loops that perform common reduction/cumsum idioms. - Improved instruction selection for shared-memory reads of sub-word types.
- Vectorisation support for
__shfl*(eg. turns multiple i8 shuffles into an i32 or i64 shuffle). Particularly useful for exploiting architectures that have support for i64 shuffles. - Ability to move extending conversions across shuffles, to avoid shuffling useless bits.
- Don't generate redundant warp reductions prior to uniform atomic ops.
- Constant propagation for
__shfl*. - Improved program startup time in the presence of CUDA translation units with no device code.
- Improvements to
i1vectorisation. - Don't discard
__builtin_provabletoo early when doing LTO or building bitcode libraries.
NVCC Compatibility#
- Support for
__shared__and__constant__keywords an anonymous unions/structs. - Allow
__device__keyword to redundantly accompany__shared__or__constant__. - Remove spurious warning about passing C++ pointers to memory ops.
- Commas in asm statement argument lists are, it turns out, optional.
- Support
forward-unknown-opts. - Correctly handle missing
templatekeywords in more dependent name lookup scenarios. - Tolerate wrong-sided access during constant evaluation.
- Cleaner diagnostics when handling redundant commas in the presence of multiple templates closing at once.
- Allow out-of-line redeclaration of namespaced template functions with conflicting signatures (this becomes an overload, when it should be a compile error).
Inline PTX Support#
- Support for undocumented syntax quirks, like spaces in the middle of directives.
- Support for a wider variety of constant expressions.
- Fix miscompilation of
min.xorsign.abs. - Fix miscompilation of
testp.finitein-ffast-mathmode. - Correct behaviour of
dp4a/dp2ain the presence of overflow. - Correctly parse identifiers including
%on asm blocks with no inputs/outputs. - Fix miscompile of shifts/bmsk with extremely large shift amounts.
- Implement insane implicit asm-input conversion rules.
- Fix miscompiles in video instructions using min/max as the secondary op, since real behaviour turned out to differ from the manual.
- Avoid compiler crash when trying to constant-evaluate a PTX
ninput that has a type error. tf32support, via upcasting tofp32.- Fix some corner cases of mixing vector splats with
_operands. - Fix crash parsing video byte selectors.
- Correctly handle implicit vector-of-vector types, because it turns out those are at least semantically a thing.
- Newly-supported instructions:
cp.async.*min/max, 3-input edition.add/sub/mul/fma/cvtwith non-default rounding modes.mma/wmmafor sub-byte integer types.mma/wmmafor all remaining shapes, in all datatypes except fp8/6/4.red/atomfor vector types.cvt.pack, 4-input edition.nanosleep
- Newly-supported special registers:
%globaltimer%globaltimer_hi%globaltimer_lo%warpid%nwarpid%smid%nsmid
Compiler Misc.#
- Fixed compiler crash when doing
decltype(lambda)in certain conditions. __syncthreads_and()and friends no longer use any shared memory.- Improvements to compile speed, espeically in the presence of inline PTX. Extremely so in the presence of
mma/wmma.
Library: New APIs#
See the API diffs for precise information.
- Added most commonly-used NVML APIs.
- More
cuSOLVER/cuSPARSE. - Non-default-rounding-mode APIs (conversions and arithmetic).
- Added 1D CUarray copy APIs.
- Added device-side versions of device/context property query APIs (
cudaDeviceGetAttributeetc). bmma_sync.atomicAddforfloat2andfloat4.- Added API for programmatically controlling SCALE's exception behaviour.
Library: Fixes#
- Allocator improvements prevent premature OOM due to address space fragmentation in long-running applications with a high level of memory churn.
- Fix const-correctness in coop-groups.
- Removed some macro leaks.
__sad()now behaves correctly in the presence of integer overload.- Fix bugs in FFT plan creation.
- Fix some edgecases relating to delting the active CUcontext before popping it.
- Fixed use of
cuCtxSetCurrentwhen stack is empty - Don't crash when unloading a module at the same time as launching one of its kernels.
scaleenvnow works with zsh, and does not pollute the shell environment.- Endless tiny fixes, random macros, header compatibility tweaks, etc.
- Device PCI IDs now match the format of NVIDIA CUDA exactly.
- Fix some edgecases where denorms weren't being flushed, but should be.
- Stream creation is faster.
- Don't crash when the printf buffer is larger than 4GB.
- Fix rare hang when using the IPC APIs.
Release 1.3.1 (2025-05-12)#
Compiler#
- Fixed a bug in the handling of weak device-side symbols which broke the device binaries for certain projects.
- Fixed various PTX miscompilations.
- Added support for approximate-math PTX instructions (
lg2.approxand friends).
Library#
- Fixed many small bugs in the device-side APIs.
- Per-thread-default-stream actually works now, rather than silently using the legacy stream.
- Fixed a race condition in the fft library.
Thirdparty Project demos#
- GROMACS now works. SCALE appears to support a wider selection of AMD architectures than the HIP port, and seems to perform somewhat better (on MI210, at least!).
Release 1.3.0 (2025-04-23)#
Platform#
- Upgraded from llvm17 to llvm19.1.7.
- Support for rocm 6.3.1.
- Support for
gfx902architecture. - Enterprise edition: Support for new architectures:
gfx908gfx90agfx942
Packaging#
- Packages for Rocky9 are now available.
- Package repos for Ubuntu and Rocky9 to simplify installation/upgrades.
New Features#
- Added
scaleenv, a new and much easier way to use SCALE. - Support for simulating a warp size of 32 even on wave64 platforms, fixing many projects on such platforms.
- Support for
bfloat16. - Compatibility improvements with non-cmake buildsystems.
- Added
SCALE_CUDA_VERSIONenvironment variable to tell SCALE to impersonate a specific version of CUDA. SCALE_EXCEPTIONSnow supports a non-fatal mode.
Library wrappers#
- Added most of cuFFT.
- Added lots more cuSolver and cuSPARSE.
- Filled in some missing NVTX APIs.
- Added homeopathic quantities of nvtx3.
Library#
- Lazy-initialisation of primary contexts now works properly, fixing some subtle lifecycle issues.
- Added some missing undocumented headers like
texture_types.h. - Added the IPC memory/event APIs
- Added many multi-GPU APIs
- Added
cuMemcpyPeer/cuMemcpyPeerAsync. - Rewritten device allocator to work around HSA bugs and performance issues.
- Fix a crash when initialising SCALE with many GPUs with huge amounts of memory.
- Added CUDA IPC APIs. Among other things, this enables CUDA-MPI applications to work, including AMGX's distributed mode.
- Fixed lots of multi-GPU brokenness.
- Implemented the
CU_CTX_SYNC_MEMOPScontext flag. - Fixed accuracy issues in some of the CUDA Math APIs.
- fp16 headers no longer produce warnings for projects that include them without
-isystem. - Improved performance and correctness of cudaMemcpy/memset.
- Fix subtle issues with pointer attribute APIs.
- Improvements to C89 compatibility of headers.
- Added more Cooperative Groups APIs.
- Support for
grid_sync(). - Fix some wave64 issues with
cooperative_groups.h.
Compiler#
__launch_bounds__now works correctly, significantly improving performance.- Device atomics are now much more efficient.
- Denorm-flushing optimisations are no longer skipped when they aren't supposed to be.
- Ability to use DPP to optimise warp shuffles in some cases. Currently,
this only works if the individual shfl is provably equivalent to a DPP op,
not when loop analysis would be required.
__shfl_xoris your friend.
NVCC Interface#
- Corrected the behaviour of the nvcc
-odirflag in during dependency generation. - Added the nvcc
-prec-sqrtand-prec-divflags. use_fast_mathnow matches nivida's behaviour, instead of mapping to clang's-ffast-math, which does too much.--device-cno longer inappropriately triggers the linker.- Newly-supported
nvccflags:-arch=native-jump-table-density(ignored)-compress-mode(ignored)-split-compile-extended(ignored)
NVCC Semantics#
- Support broken template code in more situations in nvcc mode.
- Allow invalid const-correctness in unexpanded template code in nvcc mode.
- Allow trailing commas in template argument lists in nvcc mode.
- Fix a parser crash when explicitly calling
operator<<<int>()in CUDA mode. - Fix a crash when using
--compiler-optionsto pass huge numbers of options through to-Wl.
Diagnostics#
- Warning for unused PTX variables
- Error for attempts to return the carry bit (undefined behaviour on NVIDIA).
- Compiler diagnostic to catch some undefined behaviour patterns with CUDA atomics.
PTX#
- New instructions supported
sm_100variants ofredux.- Mixed-precision
add/sub/fmaFP instructions. membarbar.warp.syncfence(partial)mma(software emulated)wmma(software emulated)
- Fixed parsing of hex-float constants.
- Support for PTX
Cconstraints (dynamic asm strings). - f16/bf16 PTX instructions no longer depend on the corresponding C++ header.
- asm blocks can now refer to variables declared in other asm blocks, including absolutely cursed patterns.
- Fixed an issue where template-dependent asm strings were mishandled.
- Fixed various parsing issues (undocumented syntax quirks etc.).
- Fixed a crash when trying to XOR floating point numbers together.
Thirdparty Project demos#
Things that now appear to work include:
- CUDA-aware MPI
- MAGMA
- whisper.cpp
- TCLB
Release 1.2.0 (2024-11-27)#
Library Enhancements#
- Support for
gfx900architecture. - Support for
gfx1102architecture.
PTX#
- Improved handling of wave64 in inline PTX.
- Various inline PTX compilation fixes.
Other#
- Support for Ubuntu 24.04.
- Upgraded to ROCm 6.2.2.
Release 1.1.0 (2024-10-31)#
Library Enhancements#
- Added much of the CUDA graph API.
- Improvements to multi-GPU handling.
- Fixed rare shutdown-time segfaults.
- Added many random API functions. As usual, see The diff.
PTX#
f16x2,u16x2ands16x2types.fnsinstruction- Fixed miscompile of
sadinstruction.
Thirdparty Project demos#
The scale-validation repo now has working demos for the following
additional projects:
- FLAMEGPU2
- GPUJPEG
- gpu_jpeg2k
Release 1.0.2.0 (2024-09-05)#
Documented a record of the CUDA APIs already available in SCALE, and those still to come: Implemented APIs.
Library Enhancements#
- Kernel arguments larger than 4kb no longer crash the library.
- Programs that ignore CUDA error codes can no longer get stuck in a state where the library always returns the error code you ignored.
- Fixed synchronisation bugs when using synchronous
cuMemset*APIs. - Fixed implicit synchronisation behaviour of
cuMemcpy2D/cuMemcpy2DAsync(). - Fixed precision issues in fp16
exp2(),rsqrt(), andh2log(). cudaEventRecordfor an empty event no longer returns a time in the past.- Fixed occupancy API behaviour in edgecases that are not multiples of warp size.
- Fixed rare crashes during static de-initialisation when library wrappers were in use.
- All flags supported by SCALE's nvcc are now also accepted by our nvrtc implementation.
- Various small header fixes.
Compiler Enhancements#
decltype()now works correctly for__host__ __device__functions.-Winvalid-constexprno longer defaults to-Werror, for consistency with nvcc.- PTX variable names including
%are no longer rejected. - Support for nvcc's nonstandard permissiveness surrounding missing
typenamekeywords in dependent types. - Support for nvcc's wacky "split declaration" syntax for
__host__ __devicefunctions (with a warning):int foo(); __device__ int foo(); __host__ int foo() { return 5; } // foo() is a __host__ __device__ function. :D - Newly-supported compiler flags (all of which are aliases for
standard flags, or combinations thereof):
-device-c-device-w-pre-include-library-output-file-define-macro-undefine-macro
New CUDA APIs#
Math APIs#
exp10(__half)exp2(__half)rcp(__half)rint(__half)h2exp10(__half2)h2exp2(__half2)h2rcp(__half2)h2rint(__half2)
Release 1.0.1.0 (2024-07-24)#
This release primarily fixes issues that prevent people from successfully compiling their projects with SCALE. Many thanks to those users who submitted bug reports.
CUDA APIs#
- The
extraargument tocuLaunchKernelis now supported. - Added support for some more undocumented NVIDIA headers.
- Fix various overload resolution issues with atomic APIs.
- Fix overload resolution issues with min/max.
- Added various undocumented macros to support projects that are explicitly checking cuda include guard macros.
lrint()andllrint()no longer crash the compiler. :D- Newly supported CUDA APIs:
nvrtcGetNumSupportedArchsnvrtcGetSupportedArchscudaLaunchKernelEx,cuLaunchKernelEx,cudaLaunchKernelExC: some of the performance-hint launch options are no-ops.__vavgs2,__vavgs4- All the
atomic*_block()andatomic*_system()variants.
Compiler#
- Improved parsing of nvcc arguments:
- Allow undocumented option variants (
-foo bar,--foo bar,--foo=bar, and-foo=barare always allowed, it seems). - Implement "interesting" quoting/escaping rules in nvcc arguments, such as
embedded quotes and
\,. We now correctly handle cursed arguments like:'-Xcompiler=-Wl\,-O1' '-Xcompiler=-Wl\,-rpath\,/usr/lib,-Wl\,-rpath-link\,/usr/lib'
- Allow undocumented option variants (
-
Support for more nvcc arguments:
- NVCC-style diagnostic flags:
-Werror,-disable-warnings, etc. --run,--run-args-Xlinker,-linker-options-no-exceptions,-noeh-minimal: no-op. Exact semantics are undocumented, and build times are reasonably fast anyway.-gen-opt-lto,-dlink-time-opt,-dlto. No-ops: device LTO not yet supported.-t,--threads,-split-compile: No-ops: they're flags for making compilation faster and are specific to how nvcc is implemented.-device-int128: no-op: we always enable int128.-extra-device-vectorization: no-op: vectorisation optimisations are controlled by the usual-O*flags.-entries,-source-in-ptx,-src-in-ptx: no-ops: there is no PTX.-use-local-env,-idp,-ddp,-dp, etc.: ignored since they are meaningless except on Windows.
- NVCC-style diagnostic flags:
-
Allow variadic device functions in non-evaluated functions.
- Don't warn about implicit conversion from
cudaLaneMask_ttobool. __builtin_provableno longer causes compiler crashes in-O0/-O1builds.- Fixed a bug causing PTX
asmblocks inside non-template, non-dependent member functions of template classes to sometimes not be compiled, causing PTX to end up in the AMD binary unmodified. - CUDA launch tokens with spaces (ie.:
myKernel<< <1, 1>> >()) are now supported. - Building non-cuda C translation units with SCALE-nvcc now works.
Other#
- The
mesonbuild system no longer regards SCALE-nvcc as a "broken" compiler. hsakmtsysinfono longer explodes if it doesn't like your GPU.- New documentation pages.
- Published more details about thirdparty testing, including the build scripts.
Release 1.0.0.0 (2024-07-15)#
Initial release