There are a few things to note up-front:
- Refer to the half-precision intrinsics.
- Note that most or all of these intrinsics are only supported in device code. (However, @njuffa has created a set of host-usable conversion functions here)
- Note that devices of compute capability 5.2 and below do not natively support half-precision arithmetic. This means that any arithmetic operations to be performed must be done on some supported type, such as
float
. Devices of compute capability 5.3 (Tegra TX1, currently) and presumably future devices, will support "native" half-precision arithmetic operations, but these are currently exposed through such intrinsics as __hmul
. An intrinsic like __hmul
will be undefined in devices that do not support native operations.
- You should include
cuda_fp16.h
in any file where you intend to make use of these types and intrinsics in device code.
With the above points in mind, here is a simple code that takes a set of float
quantities, converts them to half
quantities, and scales them by a scale factor:
$ cat t924.cu
#include <stdio.h>
#include <cuda_fp16.h>
#define DSIZE 4
#define SCF 0.5f
#define nTPB 256
__global__ void half_scale_kernel(float *din, float *dout, int dsize){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < dsize){
half scf = __float2half(SCF);
half kin = __float2half(din[idx]);
half kout;
#if __CUDA_ARCH__ >= 530
kout = __hmul(kin, scf);
#else
kout = __float2half(__half2float(kin)*__half2float(scf));
#endif
dout[idx] = __half2float(kout);
}
}
int main(){
float *hin, *hout, *din, *dout;
hin = (float *)malloc(DSIZE*sizeof(float));
hout = (float *)malloc(DSIZE*sizeof(float));
for (int i = 0; i < DSIZE; i++) hin[i] = i;
cudaMalloc(&din, DSIZE*sizeof(float));
cudaMalloc(&dout, DSIZE*sizeof(float));
cudaMemcpy(din, hin, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
half_scale_kernel<<<(DSIZE+nTPB-1)/nTPB,nTPB>>>(din, dout, DSIZE);
cudaMemcpy(hout, dout, DSIZE*sizeof(float), cudaMemcpyDeviceToHost);
for (int i = 0; i < DSIZE; i++) printf("%f
", hout[i]);
return 0;
}
$ nvcc -o t924 t924.cu
$ cuda-memcheck ./t924
========= CUDA-MEMCHECK
0.000000
0.500000
1.000000
1.500000
========= ERROR SUMMARY: 0 errors
$
If you study the above code, you'll note that, except in the case of cc5.3 and higher devices, the arithmetic is being done as a regular float
operation. This is consistent with the note 3 above.
The takeaways are as follows:
- On devices of cc5.2 and below, the
half
datatype may still be useful, but principally as a storage optimization (and, relatedly, perhaps a memory bandwidth optimization, since e.g. a given 128-bit vector load could load 8 half
quantities at once). For example, if you have a large neural network, and you've determined that the weights can tolerate being stored as half-precision quantities (thereby doubling the storage density, or approximately doubling the size of the neural network that can be represented in the storage space of a GPU), then you could store the neural network weights as half-precision. Then, when you need to perform a forward pass (inference) or a backward pass (training) you could load the weights in from memory, convert them on-the-fly (using the intrinsics) to float
quantities, perform the necessary operation (perhaps including adjusting the weight due to training), then (if necessary) store the weight again as a half
quantity.
- For cc5.3 and future devices, if the algorithm will tolerate it, it may be possible to perform a similar operation as above, but without conversion to
float
(and perhaps back to half
), but rather leaving all data in half
representation, and doing the necessary arithmetic directly (using e.g. __hmul
or __hadd
intrinsics).
Although I haven't demonstrated it here, the half
datatype is "usable" in host code. By that, I mean you can allocate storage for items of that type, and perform e.g. cudaMemcpy
operations on it. But the host code doesn't know anything about half
data type (e.g. how to do arithmetic on it, or print it out, or do type conversions) and the intrinsics are not usable in host code. Therefore, you could certainly allocate storage for a large array of half
data type if you wanted to (perhaps to store a set of neural network weights), but you could only directly manipulate that data with any ease from device code, not host code.
A few more comments:
The CUBLAS library implements a matrix-matrix multiply designed to work directly on half
data. The description above should give some insight as to what is likely going on "under the hood" for different device types (i.e. compute capabilities).
A related question about use of half
in thrust is here.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…