I am coding an algorithm in CUDA, the most time consuming part of which is calculating the differences between a target image to a list of comparison images, each comparison image compared a number of times, each time with a different offset/shift. The target and comparison images are the same width and height. The offsets calculate in my case are sequential and go from 0 to x, e.g. the comparison image is shifted by 0 pixels, then I calculate diffs with target image, then comparison image is shifted by 1 to the right and compared to target image, etc until x shifts. I am interested in saving the differences for each x,y coordinate, for each offset (e.g. results size = whoffsets). This part of the algorithm takes the majority of the time and any speed improvements I can make will have a big impact on the overall execution speed. My question is, are there any clear ineffiencies in my code / is this the correct way to implement such an algorithm?
I am not necessarily looking if this algorithm is 100% correct, etc, I am mostly looking for ways to improve the speed at which this code runs to perform these comparisons.
My current algorithm works like this:
- I have a kernel<<<dim3((w*h/1024+1), numberOfOffsets/numberOfOffsetsPerThread+1), 1024>>>:
aka a kernel running in an x,y grid where:
- x = the number of blocks required to process the whole image
- y = the number of blocks required to process all of the offsets being calculated
I have my arrays in host device like this:
leftImage = new int[w*h]; // leftImage = target image
rightImages = new int*[numberOfFrames]; // righImages = comparison images
// numberOfFrames = number of comparison images
results = new int[w*h*numberOfOffsets];
// numberOfOffsets = the number of offsets each comparison image is shifted
Where rightImages
is initialised with rightImages[i] = new int[w*h+offsetCount];
.
I am then running my CUDA kernel in a for loop like this:
int blockdim = 1024; // number of xy coordinates each block of threads can work on
// offsetsPerThread = Number of offsets calculated per thread
// offsetGroups = Number of blocks required to calculate comparisons for all offsets
// imageGroups = Number of blocks required to calculate comparisons for all pixels in an image
dim3 griddim(imageGroups, offsetGroups);
for(int f=0; f<frames; f++){
// groupSLen =
costMatrix<<<griddim, blockdim>>>(devResults, devLeftImage, devRightImages[f],
groupSLen, w, h, offsetCount);
}
if (cudaDeviceSynchronize() != cudaSuccess)
printf("%s
", gpuErrorMessage());
My kernel then looks like this:
__global__ void costMatrix(int* res, int* leftImage, int* rightImage, int offsetsPerThread, int w, int h, int offsetCount){
int offsetGroupIndex = blockIdx.y; // used to find out which offsets this thread will calculate
int xy = blockIdx.x * blockDim.x + threadIdx.x; // x,y coordinate on image
int offsetStart = offsetGroupIndex * offsetsPerThread + 1; // first offset to calculate
int offsetLimit = offsetStart + offsetsPerThread; // last offset to calculate
int resultStart = xy*offsetCount + (offsetStart-1); // the starting index in the results array
for(int offset=offsetStart; offset<offsetLimit; offset++){
if((offset < 0 && xy < offset) || ((offset > 0) && (xy >= w*h)) || (offset > offsetCount))
continue;
int lArgb = leftImage[xy]; // argb values
int rArgb = rightImage[xy+offset];
int rd = abs((0xFF & (lArgb >> 16)) - (0xFF & (rArgb >> 16)));
int gd = abs((0xFF & (lArgb >> 8)) - (0xFF & (rArgb >> 8)));
int bd = abs((0xFF & lArgb) - (0xFF & rArgb));
atomicAdd(&res[resultStart++], rd + gd + bd);
}
}
Is there a faster way to perform this calculation? I noticed atomicAdd(&res[resultStart++], rd + gd + bd);
and even res[resultStart++] += rd + gd + bd;
is very slow in comparison to res[resultStart++] = rd + gd + bd;
, but the memory requirements to do that are very high.