Windows 7, NVidia GeForce 425M.
I wrote a simple CUDA code which calculates the row sums of a matrix.
The matrix has uni-dimensional representation (pointer to a float).
The serial version of code is below (it has 2
loops, as expected):
void serial_rowSum (float* m, float* output, int nrow, int ncol) {
float sum;
for (int i = 0 ; i < nrow ; i++) {
sum = 0;
for (int j = 0 ; j < ncol ; j++)
sum += m[i*ncol+j];
output[i] = sum;
}
}
Inside the CUDA code, I call the kernel function sweeping the matrix by rows. Below, the kernel call snippet:
dim3 threadsPerBlock((unsigned int) nThreadsPerBlock); // has to be multiple of 32
dim3 blocksPerGrid((unsigned int) ceil(nrow/(float) nThreadsPerBlock));
kernel_rowSum<<<blocksPerGrid, threadsPerBlock>>>(d_m, d_output, nrow, ncol);
and the kernel function which performs the parallel sum of the rows (still has 1
loop):
__global__ void kernel_rowSum(float *m, float *s, int nrow, int ncol) {
int rowIdx = threadIdx.x + blockIdx.x * blockDim.x;
if (rowIdx < nrow) {
float sum=0;
for (int k = 0 ; k < ncol ; k++)
sum+=m[rowIdx*ncol+k];
s[rowIdx] = sum;
}
}
So far so good. The serial and parallel (CUDA) results are equal.
The whole point is that the CUDA version takes almost twice the time of the serial one to compute, even if I change the nThreadsPerBlock
parameter: I tested nThreadsPerBlock
from 32
to 1024
(maximum number of threads per block allowed for my card).
IMO, the matrix dimension is big enough to justify parallelization: 90,000 x 1,000
.
Below, I report the time elapsed for the serial and parallel versions using different nThreadsPerBlock
. Time reported in msec
over an average of 100
samples:
Matrix: nrow = 90000 x ncol = 1000
Serial: Average Time Elapsed per Sample in msec (100
samples): 289.18
.
CUDA (32
ThreadsPerBlock): Average Time Elapsed per Sample in msec (100
samples): 497.11
.
CUDA (1024
ThreadsPerBlock): Average Time Elapsed per Sample in msec (100
samples): 699.66
.
Just in case, the version with 32
/1024
nThreadsPerBlock
is the fastest/slowest one.
I understand that there is a kind of overhead when copying from Host to Device and the other way around, but maybe the slowness is because I am not implementing the fastest code.
Since I am far from being a CUDA expert:
Am I coding the fastest version for this task? How could I improve my code?
Can I get rid of the loop in the kernel function?
Any thoughts appreciated.
EDIT 1
Although I describe a standard rowSum
, I am interested in the AND
/OR
operation of rows which have (0;1}
values, like rowAND
/rowOR
. That said, it doesn't allow me to exploit the cuBLAS
multiply by 1
's COL
column vector trick, as suggested by some commentators.
EDIT 2
As suggest by users other users and here endorsed:
FORGET ABOUT TRYING TO WRITE YOUR OWN FUNCTIONS, use Thrust library instead and the magic comes.
Question&Answers:
os