What's new?#
Release 1.2.0 (2024-11-27)#
Library Enhancements#
- Support for
gfx900architecture. - Support for
gfx1102architecture.
PTX#
- Improved handling of wave64 in inline PTX.
- Various inline PTX compilation fixes.
Other#
- Support for Ubuntu 24.04.
- Upgraded to ROCm 6.2.2.
Release 1.1.0 (2024-10-31)#
Library Enhancements#
- Added much of the CUDA graph API.
- Improvements to multi-GPU handling.
- Fixed rare shutdown-time segfaults.
- Added many random API functions. As usual, see The diff.
PTX#
f16x2,u16x2ands16x2types.fnsinstruction- Fixed miscompile of
sadinstruction.
Thirdparty Project demos#
The scale-validation repo now has working demos for the following
additional projects:
- FLAMEGPU2
- GPUJPEG
- gpu_jpeg2k
Release 1.0.2.0 (2024-09-05)#
Documented a record of the CUDA APIs already available in SCALE, and those still to come: Implemented APIs.
Library Enhancements#
- Kernel arguments larger than 4kb no longer crash the library.
- Programs that ignore CUDA error codes can no longer get stuck in a state where the library always returns the error code you ignored.
- Fixed synchronisation bugs when using synchronous
cuMemset*APIs. - Fixed implicit synchronisation behaviour of
cuMemcpy2D/cuMemcpy2DAsync(). - Fixed precision issues in fp16
exp2(),rsqrt(), andh2log(). cudaEventRecordfor an empty event no longer returns a time in the past.- Fixed occupancy API behaviour in edgecases that are not multiples of warp size.
- Fixed rare crashes during static de-initialisation when library wrappers were in use.
- All flags supported by SCALE's nvcc are now also accepted by our nvrtc implementation.
- Various small header fixes.
Compiler Enhancements#
decltype()now works correctly for__host__ __device__functions.-Winvalid-constexprno longer defaults to-Werror, for consistency with nvcc.- PTX variable names including
%are no longer rejected. - Support for nvcc's nonstandard permissiveness surrounding missing
typenamekeywords in dependent types. - Support for nvcc's wacky "split declaration" syntax for
__host__ __devicefunctions (with a warning):int foo(); __device__ int foo(); __host__ int foo() { return 5; } // foo() is a __host__ __device__ function. :D - Newly-supported compiler flags (all of which are aliases for
standard flags, or combinations thereof):
-device-c-device-w-pre-include-library-output-file-define-macro-undefine-macro
New CUDA APIs#
Math APIs#
exp10(__half)exp2(__half)rcp(__half)rint(__half)h2exp10(__half2)h2exp2(__half2)h2rcp(__half2)h2rint(__half2)
Release 1.0.1.0 (2024-07-24)#
This release primarily fixes issues that prevent people from successfully compiling their projects with SCALE. Many thanks to those users who submitted bug reports.
CUDA APIs#
- The
extraargument tocuLaunchKernelis now supported. - Added support for some more undocumented NVIDIA headers.
- Fix various overload resolution issues with atomic APIs.
- Fix overload resolution issues with min/max.
- Added various undocumented macros to support projects that are explicitly checking cuda include guard macros.
lrint()andllrint()no longer crash the compiler. :D- Newly supported CUDA APIs:
nvrtcGetNumSupportedArchsnvrtcGetSupportedArchscudaLaunchKernelEx,cuLaunchKernelEx,cudaLaunchKernelExC: some of the performance-hint launch options are no-ops.__vavgs2,__vavgs4- All the
atomic*_block()andatomic*_system()variants.
Compiler#
- Improved parsing of nvcc arguments:
- Allow undocumented option variants (
-foo bar,--foo bar,--foo=bar, and-foo=barare always allowed, it seems). - Implement "interesting" quoting/escaping rules in nvcc arguments, such as
embedded quotes and
\,. We now correctly handle cursed arguments like:'-Xcompiler=-Wl\,-O1' '-Xcompiler=-Wl\,-rpath\,/usr/lib,-Wl\,-rpath-link\,/usr/lib'
- Allow undocumented option variants (
-
Support for more nvcc arguments:
- NVCC-style diagnostic flags:
-Werror,-disable-warnings, etc. --run,--run-args-Xlinker,-linker-options-no-exceptions,-noeh-minimal: no-op. Exact semantics are undocumented, and build times are reasonably fast anyway.-gen-opt-lto,-dlink-time-opt,-dlto. No-ops: device LTO not yet supported.-t,--threads,-split-compile: No-ops: they're flags for making compilation faster and are specific to how nvcc is implemented.-device-int128: no-op: we always enable int128.-extra-device-vectorization: no-op: vectorisation optimisations are controlled by the usual-O*flags.-entries,-source-in-ptx,-src-in-ptx: no-ops: there is no PTX.-use-local-env,-idp,-ddp,-dp, etc.: ignored since they are meaningless except on Windows.
- NVCC-style diagnostic flags:
-
Allow variadic device functions in non-evaluated functions.
- Don't warn about implicit conversion from
cudaLaneMask_ttobool. __builtin_provableno longer causes compiler crashes in-O0/-O1builds.- Fixed a bug causing PTX
asmblocks inside non-template, non-dependent member functions of template classes to sometimes not be compiled, causing PTX to end up in the AMD binary unmodified. - CUDA launch tokens with spaces (ie.:
myKernel<< <1, 1>> >()) are now supported. - Building non-cuda C translation units with SCALE-nvcc now works.
Other#
- The
mesonbuild system no longer regards SCALE-nvcc as a "broken" compiler. hsakmtsysinfono longer explodes if it doesn't like your GPU.- New documentation pages.
- Published more details about thirdparty testing, including the build scripts.
Release 1.0.0.0 (2024-07-15)#
Initial release