Skip to content

Differences from NVIDIA CUDA#

There are some areas where SCALE's implementation of a certain feature also found in NVIDIA CUDA has different behaviour. This document does not enumerate missing CUDA APIs/features.

Defects#

NVRTC differences#

SCALE's current implementation of the nvrtc API works by calling the compiler as a subprocess instead of a library. This differs from how NVIDIA's implementation works, and means that the library must be able to locate the compiler to invoke it.

If your program uses the rtc APIs and fails with errors that relate to being unable to locate the compiler, ensure that SCALE's nvcc is first in PATH.

Stream synchronization#

SCALE does not yet support per-thread default stream behaviour.

Instead, the default stream is used in place of the per-thread default stream. This will not break programs, but is likely to reduce performance.

A workaround which will also slightly improve the performance of your program when run on NVIDIA GPUs is to use nonblocking CUDA streams explicitly, rather than relying on the implicit CUDA stream.

Host-side __half support#

The CUDA API allows many __half math functions to be used on both host and device.

When compiling non-CUDA translation units, you can include <cuda_fp16.h> and use the __half math APIs in host code. When you do this, NVIDIA's CUDA implementation converts the __half to 32-bit float, does the calculation, and converts back.

SCALE only allows these functions to be used on the host when the host compiler supports compiling fp16 code directly (via the _Float16 type). Current versions of gcc and clang both support this.

This difference only applies to non-CUDA translation units using compilers at least 2 years old.

This means:

  • All __half APIs work in both host and device code in .cu files.
  • __half APIs that perform floating point math will not compile in host code in non-CUDA translation units if an old host compiler is used.
  • The outcome of __half calculations on host/device will always be the same.
  • APIs for using __half as a storage type are always supported.

SCALE bundles a modern host compiler at <SCALE_DIR>/targets/gfxXXX/bin/clang++ you can use as a workaround if this edgecase becomes a problem.

Enhancements#

Contexts where CUDA APIs are forbidden#

NVIDIA's implementation forbids CUDA APIs in various contexts, such as from host-side functions enqueued onto streams.

This implementation allows CUDA API calls in such cases.

Static initialization and deinitialization#

This implementation permits the use of CUDA API functions during global static initialization and thread_local static initialization.

It is not permitted to use CUDA API functions during static deinitialization.

This is more permissive than what is allowed by NVIDIA's implementation.

Device printf#

SCALE's device printf accepts an unlimited number of arguments if you compile with at least C++11.

If you target an older version of C++ then it is limited to 32, like NVIDIA's implementation.

Contexts#

If cuCtxDestroy() is used to destroy the context that is current to a different CPU thread, and that CPU thread then issues an API call that depends on the context without first setting a different context to be current, the behaviour is undefined.

In NVIDIA's implementation, this condition returns CUDA_ERROR_CONTEXT_IS_DESTROYED.

Matching NVIDIA's behaviour would have incurred a small performance penalty on many operations to handle an edgecase that is not permitted.

Kernel argument size#

SCALE accepts kernel arguments up to 2GB, whereas NVIDIA CUDA allows only 32kb (and 4kb before version 12.1).

This is more an implementation quirk than a feature, since huge kernel arguments are unlikely to perform well compared to achieving the same effect with async copies, memory mapping, etc.