Release 1.0.2 (2024-09-05)#
Documented a record of the CUDA APIs already available in SCALE, and those still to come: Implemented APIs.
Library Enhancements#
- Kernel arguments larger than 4kb no longer crash the library.
- Programs that ignore CUDA error codes can no longer get stuck in a state where the library always returns the error code you ignored.
- Fixed synchronisation bugs when using synchronous
cuMemset*APIs. - Fixed implicit synchronisation behaviour of
cuMemcpy2D/cuMemcpy2DAsync(). - Fixed precision issues in fp16
exp2(),rsqrt(), andh2log(). cudaEventRecordfor an empty event no longer returns a time in the past.- Fixed occupancy API behaviour in edgecases that are not multiples of warp size.
- Fixed rare crashes during static de-initialisation when library wrappers were in use.
- All flags supported by SCALE's nvcc are now also accepted by our nvrtc implementation.
- Various small header fixes.
Compiler Enhancements#
decltype()now works correctly for__host__ __device__functions.-Winvalid-constexprno longer defaults to-Werror, for consistency with nvcc.- PTX variable names including
%are no longer rejected. - Support for nvcc's nonstandard permissiveness surrounding missing
typenamekeywords in dependent types. - Support for nvcc's wacky "split declaration" syntax for
__host__ __devicefunctions (with a warning):int foo(); __device__ int foo(); __host__ int foo() { return 5; } // foo() is a __host__ __device__ function. :D - Newly-supported compiler flags (all of which are aliases for
standard flags, or combinations thereof):
-device-c-device-w-pre-include-library-output-file-define-macro-undefine-macro
New CUDA APIs#
Math APIs#
exp10(__half)exp2(__half)rcp(__half)rint(__half)h2exp10(__half2)h2exp2(__half2)h2rcp(__half2)h2rint(__half2)