What's new?#
Release 1.0.2.0 (2024-09-05)#
Documented a record of the CUDA APIs already available in SCALE, and those still to come: Implemented APIs.
Library Enhancements#
- Kernel arguments larger than 4kb no longer crash the library.
- Programs that ignore CUDA error codes can no longer get stuck in a state where the library always returns the error code you ignored.
- Fixed synchronisation bugs when using synchronous
cuMemset*
APIs. - Fixed implicit synchronisation behaviour of
cuMemcpy2D/cuMemcpy2DAsync()
. - Fixed precision issues in fp16
exp2()
,rsqrt()
, andh2log()
. cudaEventRecord
for an empty event no longer returns a time in the past.- Fixed occupancy API behaviour in edgecases that are not multiples of warp size.
- Fixed rare crashes during static de-initialisation when library wrappers were in use.
- All flags supported by SCALE's nvcc are now also accepted by our nvrtc implementation.
- Various small header fixes.
Compiler Enhancements#
decltype()
now works correctly for__host__ __device__
functions.-Winvalid-constexpr
no longer defaults to-Werror
, for consistency with nvcc.- PTX variable names including
%
are no longer rejected. - Support for nvcc's nonstandard permissiveness surrounding missing
typename
keywords in dependent types. - Support for nvcc's wacky "split declaration" syntax for
__host__ __device
functions (with a warning):int foo(); __device__ int foo(); __host__ int foo() { return 5; } // foo() is a __host__ __device__ function. :D
- Newly-supported compiler flags (all of which are aliases for
standard flags, or combinations thereof):
-device-c
-device-w
-pre-include
-library
-output-file
-define-macro
-undefine-macro
New CUDA APIs#
Math APIs#
exp10(__half)
exp2(__half)
rcp(__half)
rint(__half)
h2exp10(__half2)
h2exp2(__half2)
h2rcp(__half2)
h2rint(__half2)
Release 1.0.1.0 (2024-07-24)#
This release primarily fixes issues that prevent people from successfully compiling their projects with SCALE. Many thanks to those users who submitted bug reports.
CUDA APIs#
- The
extra
argument tocuLaunchKernel
is now supported. - Added support for some more undocumented NVIDIA headers.
- Fix various overload resolution issues with atomic APIs.
- Fix overload resolution issues with min/max.
- Added various undocumented macros to support projects that are explicitly checking cuda include guard macros.
lrint()
andllrint()
no longer crash the compiler. :D- Newly supported CUDA APIs:
nvrtcGetNumSupportedArchs
nvrtcGetSupportedArchs
cudaLaunchKernelEx
,cuLaunchKernelEx
,cudaLaunchKernelExC
: some of the performance-hint launch options are no-ops.__vavgs2
,__vavgs4
- All the
atomic*_block()
andatomic*_system()
variants.
Compiler#
- Improved parsing of nvcc arguments:
- Allow undocumented option variants (
-foo bar
,--foo bar
,--foo=bar
, and-foo=bar
are always allowed, it seems). - Implement "interesting" quoting/escaping rules in nvcc arguments, such as
embedded quotes and
\,
. We now correctly handle cursed arguments like:'-Xcompiler=-Wl\,-O1' '-Xcompiler=-Wl\,-rpath\,/usr/lib,-Wl\,-rpath-link\,/usr/lib'
- Allow undocumented option variants (
-
Support for more nvcc arguments:
- NVCC-style diagnostic flags:
-Werror
,-disable-warnings
, etc. --run
,--run-args
-Xlinker
,-linker-options
-no-exceptions
,-noeh
-minimal
: no-op. Exact semantics are undocumented, and build times are reasonably fast anyway.-gen-opt-lto
,-dlink-time-opt
,-dlto
. No-ops: device LTO not yet supported.-t
,--threads
,-split-compile
: No-ops: they're flags for making compilation faster and are specific to how nvcc is implemented.-device-int128
: no-op: we always enable int128.-extra-device-vectorization
: no-op: vectorisation optimisations are controlled by the usual-O*
flags.-entries
,-source-in-ptx
,-src-in-ptx
: no-ops: there is no PTX.-use-local-env
,-idp
,-ddp
,-dp
, etc.: ignored since they are meaningless except on Windows.
- NVCC-style diagnostic flags:
-
Allow variadic device functions in non-evaluated functions.
- Don't warn about implicit conversion from
cudaLaneMask_t
tobool
. __builtin_provable
no longer causes compiler crashes in-O0
/-O1
builds.- Fixed a bug causing PTX
asm
blocks inside non-template, non-dependent member functions of template classes to sometimes not be compiled, causing PTX to end up in the AMD binary unmodified. - CUDA launch tokens with spaces (ie.:
myKernel<< <1, 1>> >()
) are now supported. - Building non-cuda C translation units with SCALE-nvcc now works.
Other#
- The
meson
build system no longer regards SCALE-nvcc as a "broken" compiler. hsakmtsysinfo
no longer explodes if it doesn't like your GPU.- New documentation pages.
- Published more details about thirdparty testing, including the build scripts.
Release 1.0.0.0 (2024-07-15)#
Initial release