Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
719 views
in Technique[技术] by (71.8m points)

multiprocessing - How to stop/cancel a cuda kernel launched by Numba before it ends?

I have a simulation written with Python/Numba that uses several cuda GPUs. Each one is launched from a different process using a separate cuda context. This simulation runs a very long loop, and at the end reports the result to the parent process which stores the best result so far, and the process keeps going.

When a GPU / process finishes its kernel and reports a new best result, I like to kill the kernel executions on the other processes / GPUs so they can pick up this new best result and iterate over it, instead of waiting for them to finish. Each execution can take 30 mins, so if I can kill one that just started and go again with better data, that saves me a lot of time.

I can't seem to find a way to stop a launched cuda kernel.

Can this be done?

I'm using Numba 0.51.

question from:https://stackoverflow.com/questions/65929101/how-to-transfer-a-single-float32-number-from-cpu-to-gpu-and-back-fast-in-python

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

It's not possible to stop a running kernel in CUDA without:

  1. assistance from the kernel code itself (or)
  2. corrupting the CUDA context, making any subsequent CUDA operations fail

Item 2 is not satisfactory, therefore to "asynchronously" stop a running kernel, will require the kernel code (all threads) to "poll" a location that gives an indication to stop.

A typical way to have a memory location to do this would be to use pinned/zero-copy techniques in CUDA. In numba, this type of memory is allocated using mapped memory. Such memory is accessible from both host and device at the same time. An additional wrinkle is that we require the device code to not cache any copies of the memory locations used for communication. The only method I found in numba to accomplish this is to use atomics.

Here is a worked example combining these ideas:

$ cat t51.py
import numpy as np
import numba as nb

from numba import cuda

@cuda.jit
def test(arr):
    while nb.cuda.atomic.max(arr, 0, 0) < 1: #poll for signal to stop
        nb.cuda.atomic.add(arr, 1, 1)        #do "other work"
    arr[2] = 1                               #acknowledge stop signal

if __name__ == '__main__':

    arr = nb.cuda.mapped_array(3, dtype=np.int32)
    arr[0] = 0   # stop signal goes here
    arr[1] = 1   # monitoring "other work"
    arr[2] = 0   # acknowledgment of stop signal
    my_str = nb.cuda.stream()
    griddim = (1,1)
    blockdim = (1,1,1)
    test[griddim, blockdim, my_str](arr)   # launch work to be done
    for i in range(1000):  # for demo, give kernel time to start
        if arr[1] < 2:
            print(arr[1])
    print(arr[0])
    while arr[2] != 1:     # send stop signal, and wait for acknowledgment
        arr[0] = 1
    print(arr[0])          # for demo
    nb.cuda.synchronize()  # if stop is working correctly code will not hang here
    print(arr[0])          # for demo
    print(arr[1])
$ python t51.py
0
1
1
1600
$

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...