Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
277 views
in Technique[技术] by (71.8m points)

parallel processing - Fastest way to find the differences between two images with different offsets

I am coding an algorithm in CUDA, the most time consuming part of which is calculating the differences between a target image to a list of comparison images, each comparison image compared a number of times, each time with a different offset/shift. The target and comparison images are the same width and height. The offsets calculate in my case are sequential and go from 0 to x, e.g. the comparison image is shifted by 0 pixels, then I calculate diffs with target image, then comparison image is shifted by 1 to the right and compared to target image, etc until x shifts. I am interested in saving the differences for each x,y coordinate, for each offset (e.g. results size = whoffsets). This part of the algorithm takes the majority of the time and any speed improvements I can make will have a big impact on the overall execution speed. My question is, are there any clear ineffiencies in my code / is this the correct way to implement such an algorithm?

I am not necessarily looking if this algorithm is 100% correct, etc, I am mostly looking for ways to improve the speed at which this code runs to perform these comparisons.

My current algorithm works like this:

 - I have a kernel<<<dim3((w*h/1024+1), numberOfOffsets/numberOfOffsetsPerThread+1), 1024>>>:
      aka a kernel running in an x,y grid where:
          - x = the number of blocks required to process the whole image
          - y = the number of blocks required to process all of the offsets being calculated 

I have my arrays in host device like this:

    leftImage = new int[w*h]; // leftImage = target image
    rightImages = new int*[numberOfFrames]; // righImages = comparison images
    // numberOfFrames = number of comparison images
    results = new int[w*h*numberOfOffsets];
    // numberOfOffsets = the number of offsets each comparison image is shifted

Where rightImages is initialised with rightImages[i] = new int[w*h+offsetCount];.

I am then running my CUDA kernel in a for loop like this:

int blockdim = 1024; // number of xy coordinates each block of threads can work on
// offsetsPerThread = Number of offsets calculated per thread
// offsetGroups = Number of blocks required to calculate comparisons for all offsets
// imageGroups = Number of blocks required to calculate comparisons for all pixels in an image
dim3 griddim(imageGroups, offsetGroups);

for(int f=0; f<frames; f++){
    // groupSLen = 
    costMatrix<<<griddim, blockdim>>>(devResults, devLeftImage, devRightImages[f], 
        groupSLen, w, h, offsetCount);
}
if (cudaDeviceSynchronize() != cudaSuccess)
    printf("%s
", gpuErrorMessage());

My kernel then looks like this:

__global__ void costMatrix(int* res, int* leftImage, int* rightImage, int offsetsPerThread, int w, int h, int offsetCount){
    int offsetGroupIndex = blockIdx.y; // used to find out which offsets this thread will calculate
    int xy = blockIdx.x * blockDim.x + threadIdx.x; //  x,y coordinate on image

    int offsetStart = offsetGroupIndex * offsetsPerThread + 1; // first offset to calculate
    int offsetLimit = offsetStart + offsetsPerThread; // last offset to calculate

    int resultStart = xy*offsetCount + (offsetStart-1); // the starting index in the results array

    for(int offset=offsetStart; offset<offsetLimit; offset++){
        if((offset < 0 && xy < offset) || ((offset > 0) && (xy >= w*h)) || (offset > offsetCount))
            continue;

        int lArgb = leftImage[xy]; // argb values
        int rArgb = rightImage[xy+offset];
        
        int rd = abs((0xFF & (lArgb >> 16)) - (0xFF & (rArgb >> 16)));
        int gd = abs((0xFF & (lArgb >> 8)) - (0xFF & (rArgb >> 8)));
        int bd = abs((0xFF & lArgb) - (0xFF & rArgb));
        atomicAdd(&res[resultStart++], rd + gd + bd);
    }
}

Is there a faster way to perform this calculation? I noticed atomicAdd(&res[resultStart++], rd + gd + bd); and even res[resultStart++] += rd + gd + bd; is very slow in comparison to res[resultStart++] = rd + gd + bd;, but the memory requirements to do that are very high.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
等待大神答复

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...