Recently, I have installed OpenBLAS using the Windows Subsystem for Linux on Windows 10 so that I could run optimised matrix calculations in C++, however I don't think the library is making full use of the hardware I am running it on.
For example, if I run a simple dgemm
call to multiply two 10,000x10,000 matrices, it takes roughly 10-11 seconds to run, while numpy
on exactly the same size of matrix, using the same datatype (double
/float64
), takes only 4-5 seconds. Looking in task manager, it appears that numpy
is able to use roughly 16 of my 32 threads, while OpenBLAS only uses 4 (this was confirmed when I ran openblas_get_num_threads()
)
Even after explicitly telling OpenBLAS to use more, I still get 4 threads being used, as shown in the code below:
openblas_set_num_threads(8); // This should set the number of OpenBLAS threads to 8
goto_set_num_threads(8); // This should also set the number of OpenBLAS threads to 8
std::cout << "OpenBLAS number of threads: " << openblas_get_num_threads() << "
"; // Always gives 4
std::cout << "Number of cores: " << openblas_get_num_procs() << "
"; // 32 (correct)
std::cout << "Parallel type: " << openblas_get_parallel() << "
"; // 1 -- Default parallel type -- i.e. no OpenMP
My questions is, is there a hard-coded limit of 4 threads set in the libopenblas.lib
file or elsewhere, or is there something I can do to make the dgemm
call run on more threads and boost performance, ideally reaching or exceeding numpy
's time?
Thanks in advance
=========== EDIT ===========
I have played around with this some more, and found that there is, in fact, a limit of 4 threads being set, however I can't find a way to change this. I tried setting it in the make
configuration like this:
make MAX_THREADS=32 ......
but this hasn't changed anything. Is there some way of fixing this?
Here is how I found that there is a set limit of 4:
std::cout << "Config type: " << openblas_get_config() << "
"; // ... MAX_THREADS=4
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…