I'm running the "AddVector" example and the performance seems pretty poor when compared to a simple single-threaded c# implementation.
GPU: Quadro FX 570, here are some of the details:
CPU: 2.33 Ghz Xeon (2 cpus, 4 cores each, but that shouldn't matter since my comparison c# code is single-threaded).
.NET 2.0, CLOO 0.6.1
Across a range of vector sizes, the GPU seems to fairly consistently take 10 times as long as the CPU. That's after moving the "new ComputeContextPropertyList", "new ComputeContext", "new ComputeProgram", "program.Build", and "program.CreateKernel" outside the loop (since presumably it is okay to reuse the kernel over and over).
Here are the times for all the CLOO calls (for 2400 runs, vector sizes from 2 to 16Mb):
Total: 00:00:00.0312500, Count: 1, Average: 00:00:00.0312500. Key: OpenCL new ComputeContextPropertyList Total: 00:00:00.0468750, Count: 1, Average: 00:00:00.0468750. Key: OpenCL new ComputeContext Total: 00:00:00 , Count: 1, Average: 00:00:00 . Key: OpenCL new ComputeProgram Total: 00:00:00.2968750, Count: 1, Average: 00:00:00.2968750. Key: OpenCL program.Build Total: 00:00:00 , Count: 1, Average: 00:00:00 . Key: OpenCL program.CreateKernel Total: 00:00:33.4531250, Count: 4800, Average: 00:00:00.0069694. Key: OpenCL new ComputeBuffer (input) Total: 00:00:00.0156250, Count: 2400, Average: 00:00:00.0000065. Key: OpenCL new ComputeBuffer (output) Total: 00:00:00.0156250, Count: 7200, Average: 00:00:00.0000021. Key: OpenCL kernel.SetMemoryArgument Total: 00:00:00.0156250, Count: 2400, Average: 00:00:00.0000065. Key: OpenCL new ComputeCommandQueue Total: 00:01:11.1250000, Count: 2400, Average: 00:00:00.0296354. Key: OpenCL queue.Execute Total: 00:04:17.1875000, Count: 2400, Average: 00:00:00.1071614. Key: OpenCL queue.Read
It seems odd to me that the longest time is spend during the "read" call, since I'm inputting twice as much data as I'm reading.
Which call is the one that actually passes the data up to the GPU's memory? Execute? Or constructing the input ComputeBuffer?
Should I be specifying WorkGroupSize (or any other optimization parameters)? Or is that handled automatically by CLOO (or OpenCL itself) based on the data being worked on?