It's not possible to stop a running kernel in CUDA without:
- assistance from the kernel code itself
(or)
- corrupting the CUDA context, making any subsequent CUDA operations fail
Item 2 is not satisfactory, therefore to "asynchronously" stop a running kernel, will require the kernel code (all threads) to "poll" a location that gives an indication to stop.
A typical way to have a memory location to do this would be to use pinned/zero-copy techniques in CUDA. In numba, this type of memory is allocated using mapped memory. Such memory is accessible from both host and device at the same time. An additional wrinkle is that we require the device code to not cache any copies of the memory locations used for communication. The only method I found in numba to accomplish this is to use atomics.
Here is a worked example combining these ideas:
$ cat t51.py
import numpy as np
import numba as nb
from numba import cuda
@cuda.jit
def test(arr):
while nb.cuda.atomic.max(arr, 0, 0) < 1: #poll for signal to stop
nb.cuda.atomic.add(arr, 1, 1) #do "other work"
arr[2] = 1 #acknowledge stop signal
if __name__ == '__main__':
arr = nb.cuda.mapped_array(3, dtype=np.int32)
arr[0] = 0 # stop signal goes here
arr[1] = 1 # monitoring "other work"
arr[2] = 0 # acknowledgment of stop signal
my_str = nb.cuda.stream()
griddim = (1,1)
blockdim = (1,1,1)
test[griddim, blockdim, my_str](arr) # launch work to be done
for i in range(1000): # for demo, give kernel time to start
if arr[1] < 2:
print(arr[1])
print(arr[0])
while arr[2] != 1: # send stop signal, and wait for acknowledgment
arr[0] = 1
print(arr[0]) # for demo
nb.cuda.synchronize() # if stop is working correctly code will not hang here
print(arr[0]) # for demo
print(arr[1])
$ python t51.py
0
1
1
1600
$
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…