
CLOO: Getting started, questions about performance (or lack thereof).
Posted Wednesday, 17 February, 2010 - 20:05 by Jeff inI'm running the "AddVector" example and the performance seems pretty poor when compared to a simple single-threaded c# implementation.
GPU: Quadro FX 570, here are some of the details:
MaxClockFrequency: 920
MaxComputeUnits: 2
MaxConstantArguments: 9
MaxConstantBufferSize: 65536
MaxMemoryAllocationSize: 134217728
MaxParameterSize: 4352
MaxReadImageArguments: 128
MaxSamplers: 16
MaxWorkGroupSize: 512
CPU: 2.33 Ghz Xeon (2 cpus, 4 cores each, but that shouldn't matter since my comparison c# code is single-threaded).
.NET 2.0, CLOO 0.6.1
Across a range of vector sizes, the GPU seems to fairly consistently take 10 times as long as the CPU. That's after moving the "new ComputeContextPropertyList", "new ComputeContext", "new ComputeProgram", "program.Build", and "program.CreateKernel" outside the loop (since presumably it is okay to reuse the kernel over and over).
Here are the times for all the CLOO calls (for 2400 runs, vector sizes from 2 to 16Mb):
Total: 00:00:00.0312500, Count: 1, Average: 00:00:00.0312500. Key: OpenCL new ComputeContextPropertyList Total: 00:00:00.0468750, Count: 1, Average: 00:00:00.0468750. Key: OpenCL new ComputeContext Total: 00:00:00 , Count: 1, Average: 00:00:00 . Key: OpenCL new ComputeProgram Total: 00:00:00.2968750, Count: 1, Average: 00:00:00.2968750. Key: OpenCL program.Build Total: 00:00:00 , Count: 1, Average: 00:00:00 . Key: OpenCL program.CreateKernel Total: 00:00:33.4531250, Count: 4800, Average: 00:00:00.0069694. Key: OpenCL new ComputeBuffer (input) Total: 00:00:00.0156250, Count: 2400, Average: 00:00:00.0000065. Key: OpenCL new ComputeBuffer (output) Total: 00:00:00.0156250, Count: 7200, Average: 00:00:00.0000021. Key: OpenCL kernel.SetMemoryArgument Total: 00:00:00.0156250, Count: 2400, Average: 00:00:00.0000065. Key: OpenCL new ComputeCommandQueue Total: 00:01:11.1250000, Count: 2400, Average: 00:00:00.0296354. Key: OpenCL queue.Execute Total: 00:04:17.1875000, Count: 2400, Average: 00:00:00.1071614. Key: OpenCL queue.Read
It seems odd to me that the longest time is spend during the "read" call, since I'm inputting twice as much data as I'm reading.
Which call is the one that actually passes the data up to the GPU's memory? Execute? Or constructing the input ComputeBuffer?
Should I be specifying WorkGroupSize (or any other optimization parameters)? Or is that handled automatically by CLOO (or OpenCL itself) based on the data being worked on?


Comments
Re: CLOO: Getting started, questions about performance (or ...
The "AddVector" example is more a test if it works, it's not uncommon that it runs slower than an CPU version. One of the main reasons is that "AddVector" is not really using what the GPU is good at, doing calculations, and most of the time reading and writing from/to memory. Which is, depending on the memory type used, quite slow and as there are probably not enough other things to keep it busy in the main time.
The GPU is very different from the CPU. The CPU gets the data and processes it, and to speed up the data fetching it has a cache. If the GPU needs data which it can't get it quick, then it switches to another thread.
Edit: look at this youtube video and you'll see the difference in a more appropriate example
Re: CLOO: Getting started, questions about performance (or ...
Hello, and welcome!
First off, I have run the VectorAdd sample before back when Cloo was at version 0.4 and I can confirm that using Cloo/OpenCL DOES result in dramatic speed up of the calculations for me when I compared a single threaded version to the Cloo version on my dual core CPU (no GPU for me). However, the VectorAdd demo does not have the timing triggers at the right places and the sample lacks the queue.Finish(); command that should be called after every queue.Execute(....). The VectorAdd demo is meant to be a simple example and is not optimized. If you post the code for you VectorAdd sample along with the single threaded version that you are using to benchmark, I can take a look at it.
Re: CLOO: Getting started, questions about performance (or ...
"VectorAdd" comes from the OpenCL ice ages, a time when, given the state of the drivers and whatnots, I had to spend a good part of the afternoon (and sometimes night) just to make a silly "Execute(...)" working. That's why the kernel is very simple and doesn't really take advantage of parallel processing.
I haven't had the time to add a more complex example which makes use of the OpenCL processing power. We've been busy hunting down bugs and generally improving Cloo internals for a couple of weeks now.
Anyway, I have this hypothesis about the "Read" anomaly (or the last command in general). If you're specifying a blocking read behavior then you're possibly measuring not only the "Read" command itself but also the execution of the previous command(s) which might have piled up in the queue. However, I've never tested this hypothesis and I can't say more without a look into the code.
the sample lacks the queue.Finish(); command that should be called after every queue.Execute(....).
Not necessarily. If you take advantage of the event lists and your last command is blocking (Read, if you tell it to) then you needn't specify a finish command. In this case Read will wait for the specified events AND itself to finish.
Edit:
Which call is the one that actually passes the data up to the GPU's memory? Execute? Or constructing the input ComputeBuffer?
Should I be specifying WorkGroupSize (or any other optimization parameters)? Or is that handled automatically by CLOO (or OpenCL itself) based on the data being worked on?
1) Either the ComputeBuffer constructor or the Write commands.
2) Cloo automates only primitive work. Although definitely possible, high level optimizations are left to the user because in Cloo they might have a negative impact on the range of functionality. One of the design choices was to follow the OpenCL specs as close as possible.
Re: CLOO: Getting started, questions about performance (or ...
My C# code for testing is about what you'd expect:
Here's my code that uses CLOO:
Init and cleanup of reused objects:
Add vector implementation:
And the code that executes it looks basically like this:
I removed the timing checks around SetMemoryArgument because it seems to take effectively no time.
And for the exceptionally nit-picky, yes I realize I am not timing the various calls to VariousClooObjects.Dispose();
Re: CLOO: Getting started, questions about performance (or ...
OK so responding to the comments earlier, I tried to make the kernel more complicated to see if I can see some benefit of GPU over CPU. However I'm now getting an "InvalidKernelNameComputeException". Here's my new kernel:
Any idea what I'm doing wrong?
Re: CLOO: Getting started, questions about performance (or ...
lengthis an OpenCL function/reserved word. I guess that's what's causing the issue.http://www.khronos.org/opencl/sdk/1.0/docs/man/xhtml/length.html
Edit: Oh, and remove the ("global" or any other) memory location decoration from ints. You are not allowed to specify such thing on primitive types.
Re: CLOO: Getting started, questions about performance (or ...
There were several problems. As nythrix mentioned, I had some syntax errors in the kernel code. I think that exception however was because I declared the kernel with a name of "VectorSmooth" and then called CreateKernel with "SmoothVector" (oops).