What's new?#

Release 1.2.0 (2024-11-27)#

Library Enhancements#

Support for gfx900 architecture.
Support for gfx1102 architecture.

PTX#

Improved handling of wave64 in inline PTX.
Various inline PTX compilation fixes.

Other#

Support for Ubuntu 24.04.
Upgraded to ROCm 6.2.2.

Release 1.1.0 (2024-10-31)#

Library Enhancements#

Added much of the CUDA graph API.
Improvements to multi-GPU handling.
Fixed rare shutdown-time segfaults.
Added many random API functions. As usual, see The diff.

PTX#

f16x2, u16x2 and s16x2 types.
fns instruction
Fixed miscompile of sad instruction.

Thirdparty Project demos#

The scale-validation repo now has working demos for the following additional projects:

FLAMEGPU2
GPUJPEG
gpu_jpeg2k

Release 1.0.2.0 (2024-09-05)#

Documented a record of the CUDA APIs already available in SCALE, and those still to come: Implemented APIs.

Library Enhancements#

Kernel arguments larger than 4kb no longer crash the library.
Programs that ignore CUDA error codes can no longer get stuck in a state where the library always returns the error code you ignored.
Fixed synchronisation bugs when using synchronous cuMemset* APIs.
Fixed implicit synchronisation behaviour of cuMemcpy2D/cuMemcpy2DAsync().
Fixed precision issues in fp16 exp2(), rsqrt(), and h2log().
cudaEventRecord for an empty event no longer returns a time in the past.
Fixed occupancy API behaviour in edgecases that are not multiples of warp size.
Fixed rare crashes during static de-initialisation when library wrappers were in use.
All flags supported by SCALE's nvcc are now also accepted by our nvrtc implementation.
Various small header fixes.

Compiler Enhancements#

decltype() now works correctly for __host__ __device__ functions.
-Winvalid-constexpr no longer defaults to -Werror, for consistency with nvcc.
PTX variable names including % are no longer rejected.
Support for nvcc's nonstandard permissiveness surrounding missing typename keywords in dependent types.

Support for nvcc's wacky "split declaration" syntax for __host__ __device functions (with a warning):

int foo();
__device__ int foo();
__host__ int foo() {
    return 5;
}
// foo() is a __host__ __device__ function. :D

Newly-supported compiler flags (all of which are aliases for standard flags, or combinations thereof):
- -device-c
- -device-w
- -pre-include
- -library
- -output-file
- -define-macro
- -undefine-macro

New CUDA APIs#

Math APIs#

exp10(__half)
exp2(__half)
rcp(__half)
rint(__half)
h2exp10(__half2)
h2exp2(__half2)
h2rcp(__half2)
h2rint(__half2)

Release 1.0.1.0 (2024-07-24)#

This release primarily fixes issues that prevent people from successfully compiling their projects with SCALE. Many thanks to those users who submitted bug reports.

CUDA APIs#

The extra argument to cuLaunchKernel is now supported.
Added support for some more undocumented NVIDIA headers.
Fix various overload resolution issues with atomic APIs.
Fix overload resolution issues with min/max.
Added various undocumented macros to support projects that are explicitly checking cuda include guard macros.
lrint() and llrint() no longer crash the compiler. :D
Newly supported CUDA APIs:
- nvrtcGetNumSupportedArchs
- nvrtcGetSupportedArchs
- cudaLaunchKernelEx, cuLaunchKernelEx, cudaLaunchKernelExC: some of the performance-hint launch options are no-ops.
- __vavgs2, __vavgs4
- All the atomic*_block() and atomic*_system() variants.

Compiler#

Improved parsing of nvcc arguments:
- Allow undocumented option variants (-foo bar, --foo bar, --foo=bar, and -foo=bar are always allowed, it seems).
- Implement "interesting" quoting/escaping rules in nvcc arguments, such as embedded quotes and \,. We now correctly handle cursed arguments like: '-Xcompiler=-Wl\,-O1' '-Xcompiler=-Wl\,-rpath\,/usr/lib,-Wl\,-rpath-link\,/usr/lib'
Support for more nvcc arguments:
- NVCC-style diagnostic flags: -Werror, -disable-warnings, etc.
- --run, --run-args
- -Xlinker, -linker-options
- -no-exceptions, -noeh
- -minimal: no-op. Exact semantics are undocumented, and build times are reasonably fast anyway.
- -gen-opt-lto, -dlink-time-opt, -dlto. No-ops: device LTO not yet supported.
- -t, --threads, -split-compile: No-ops: they're flags for making compilation faster and are specific to how nvcc is implemented.
- -device-int128: no-op: we always enable int128.
- -extra-device-vectorization: no-op: vectorisation optimisations are controlled by the usual -O* flags.
- -entries, -source-in-ptx, -src-in-ptx: no-ops: there is no PTX.
- -use-local-env, -idp, -ddp, -dp, etc.: ignored since they are meaningless except on Windows.
Allow variadic device functions in non-evaluated functions.
Don't warn about implicit conversion from cudaLaneMask_t to bool.
__builtin_provable no longer causes compiler crashes in -O0/-O1 builds.
Fixed a bug causing PTX asm blocks inside non-template, non-dependent member functions of template classes to sometimes not be compiled, causing PTX to end up in the AMD binary unmodified.
CUDA launch tokens with spaces (ie.: myKernel<< <1, 1>> >()) are now supported.
Building non-cuda C translation units with SCALE-nvcc now works.

Other#

The meson build system no longer regards SCALE-nvcc as a "broken" compiler.
hsakmtsysinfo no longer explodes if it doesn't like your GPU.
New documentation pages.
Published more details about thirdparty testing, including the build scripts.

Release 1.0.0.0 (2024-07-15)#

Initial release