Sharing an implementation of SGEMM using GPU and OpenCL

Hi,

I just created a project with the XOP and source files for a simple (and un-optimized) version of SGEMM (Single-precision General Matrix Muliplication).
http://www.igorexchange.com/project/OpenCLSGEMM

I am new to C programming and only spent a few days assembling the codes from tutorials around the web (I tried to reference to each of them within the source code).
At the moment I only got a factor of 3 improvement (for large input matrices) using a AMD HD 7950 over Intel Core i7-4790. While I expect using a newer and faster GPU will help quite a lot, the implementation can definitely be improved by a fair amount.
It is done by using this kernel: (with work per thread set to 4).
https://cnugteren.github.io/tutorial/pages/page5.html

I will try to improve the performance over time, but it would be great if someone who is experience in using OpenCL and GPU can give me some hints.
If you are also interested in trying it out, I will glad to know if it works out for you, and provide help where possible.
With this script:
function test(runs)
    variable runs
   
    variable m
    variable dim = 8192
   
    make/o/n=(runs) CPUTakes, GPUTakes
    make/o/n=(dim,dim) AA, BB, CC, DD
   
    variable timer, GPUTook, CPUTook
   
    for (m=0;m<runs;m=m+1)
       
        timer=startMSTimer
        GPU_SGEMM(AA,BB,CC)
        GPUTook=StopMSTimer(timer)/1e6
       
        GPUTakes[m]=GPUTook
       
        MultiThreadingControl setMode=8
        timer=startMSTimer
        MatrixOp/O DD = AA x BB
        CPUTook=StopMSTimer(timer)/1e6 
        MultiThreadingControl setMode=0
       
        CPUTakes[m]=CPUTook
        endfor
       
    variable/g CPUTake=mean(CPUTakes)
    variable/g GPUTake=mean(GPUTakes)
   
    variable/g CPUGFLOPS = 2*dim*dim*dim/CPUTake*1e-9
    variable/g GPUGFLOPS = 2*dim*dim*dim/GPUTake*1e-9
   
    print "CPU SGEMM GFLOPS = "+num2str(CPUGFLOPS)
    print "GPU SGEMM GFLOPS = "+num2str(GPUGFLOPS)
   
    killwaves/Z CPUTakes, GPUTakes, AA, BB, CC, DD
   
end


Just some performance metrics:
Intel Core i7-6700k@4.4GHz SGEMM GFLOPS = 239.81
Intel Core i7-4790 SGEMM GFLOPS = 196.84
AMD HD 7950 SGEMM GFLOPS = 585.71
AMD RX 480 SGEMM GFLOPS = 908.5

Also, with Python and Numpy, Threadripper 1950x SGEMM FLOPS = 238.38 (AMD has to optimize this)


Nice project!

Have you seen http://www.igorexchange.com/project/IgorCL/https://github.com/pdedecker… which allows to load openCL code from within Igor? In additionw with this XOP you can also execute the CL code on the CPU.

Here I get:

•test(1)
Result is equal 1
CPU SGEMM GFLOPS = 216.82
GPU SGEMM GFLOPS = 472.03

with a Radeon (TM) Pro WX 5100 Graphics and a i7-3930k.

When doing these tests it usually a good idea to compare the results with e.g. printf "Result is equal %d\r", EqualWaves(DD, CC, 1, 1e-6).

I've tried fiddling with the TS/WPT parameters in the opencl file but if I do that I always get wrong results.

[Edited]
thomas_braun wrote:
Nice project!

Have you seen http://www.igorexchange.com/project/IgorCL/https://github.com/pdedecker… which allows to load openCL code from within Igor? In additionw with this XOP you can also execute the CL code on the CPU.

Here I get:

•test(1)
Result is equal 1
CPU SGEMM GFLOPS = 216.82
GPU SGEMM GFLOPS = 472.03

with a Radeon (TM) Pro WX 5100 Graphics and a i7-3930k.

When doing these tests it usually a good idea to compare the results with e.g. printf "Result is equal %d\r", EqualWaves(DD, CC, 1, 1e-6).

I've tried fiddling with the TS/WPT parameters in the opencl file but if I do that I always get wrong results.

[Edited]


For further performance tuning, you may want to refer to this page:
https://cnugteren.github.io/tutorial/pages/page1.html

A large portion of my implementation was based on it.
His kernel 6 and 7 are another factor of 2-3 faster than what is being used in my source now, but I couldn't get correct results with them, I am finding out why. He also has got this: https://github.com/CNugteren/CLTune
Also, it might be good to see if the clBlas kernel can be used, which didn't look too straight-forward at the moment: https://github.com/clMathLibraries/clBLAS
If I could push the GFLOPS of RX 480 to close to 3000 I would be grateful, and probably then I can convince my boss to get me a Vega :D.

I found a bug in the memory check (I forgot to divide sizeof(float) by 8 to get the unit of bytes), I will update it soon, but you may just change it in the source.
*Forgot to add: the implementation I used only works for matrix with dimensions divisible by 16 (or 4, need to check), otherwise the boundary results will be invalid.
This kind of boundary issue also needs to be handled more gracefully.