carga's picture

Cloo performance? [ OpenCL CPU vs Pure .NET ]


I succeeded to run VectorAdd sample from Cloo project. In my particular environment there is no GPGPU available for OpenCL, so it uses CPU only.

I was interested to compare Cloo performance with what .NET provides out of the box. Here is my result for vector with 10,000,000 elements:
------------------| Start VectorAdd |------------------
Dim(a)=10000000 GPU Time: 290 msec
Dim(a)=10000000 .NET Time: 87 msec
-------------------| End VectorAdd |-------------------

Pure .NET is 3 times faster.

Please, provide here result of this test executed in environment with GPGPU available?

I would like to see at least 10 times OpenCL speed up, otherwise it's just a waste of time to use such complicated technology.

Best regards,

PS I had observed similar situation when using Mono-to-SSE bindings: if SSE is available -- we have 2 times speed up. If not -- then 2 times slow down.


Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
viewon01's picture

Someone at AMD has answer to the problem :

Take a look at the solution at :


viewon01's picture


I have do some test and in my "Test" application I receive the following message "Compilation failed".

But when I put my "CL" code in your application I receive a complete error description by calling :

Here is the code I use :
ComputeContext context = new ComputeContext(DeviceTypeFlags.DeviceTypeDefault, null, null);
ComputeProgram program = new ComputeProgram(context, kernelSource);

program.Build(context.Devices, null, null, IntPtr.Zero);
catch (ComputeException e)
LogManager.Trace(LogSeverity.CriticalError, LogType.Rendering, program.GetBuildLog(context.Devices[0]));

Why I receive a "short" description of the error and not the complete one like in your application ?


viewon01's picture

I have find the problem, but don' t know how to solve it.

When I run the application as a console application all is fine... but here it is a WPF application that launch several threads.
Does the OpenCL call must be done from the main thread ? or something related ?


nythrix's picture

Since this topic has grown a bit off-topic, please direct further posts to the bug report I've created. Thanks.

PS.: I will probably not fix this issue before weekend. I'm out of time for the next two days, guys.

carga's picture

I could not run speed tests on my nVidia GPU since Sony is too slow updating drivers for their VAIO notebooks, so TriangleIntersection example is computed on general CPU. Here is my result:

------------------| Start OpenCL platform info |------------------
For test only: Expires on Sun Feb 28 00:00:00 2010
name: ATI Stream
version: OpenCL 1.0 ATI-Stream-v2.0-beta4
vendor: Advanced Micro Devices, Inc.

name: Intel(R) Core(TM)2 Duo CPU T7700 @ 2.40GHz
driver: 1.0
vendor: GenuineIntel
+ cl_khr_global_int32_base_atomics
+ cl_khr_global_int32_extended_atomics
+ cl_khr_local_int32_base_atomics
+ cl_khr_local_int32_extended_atomics
+ cl_khr_byte_addressable_store
-------------------| End OpenCL platform info |-------------------

------------------| Start Triangle intersection |------------------
Cloo ticks: 17484, milliseconds: 1
.NET ticks: 666924, milliseconds: 46
-------------------| End Triangle intersection |-------------------

I am very happy to see 50 times acceleration! Basically it means that OpenCL executes parallel kernels on general CPU in much, much more effective way then .NET does for it's sequential code.

Could anybody rewrite .NET intersect(...) method using Parallel LINQ? Will it give comparable speed-up for .NET performance?

Best regards,

PS Compilation error on my system was due to the 'dir' declaration in kernel. The correct method signature looks like

kernel void intersect(
    float4 dir,
    global float4 * pointA,
    global float4 * pointB,
    global float4 * pointC,
    global float *    hits )

I suggest that any non-array argument that will be passed through kernel.SetValueArg(...) should not have global or local modifier.

viewon01's picture


1 - I think that there is an error in the way we compute the "Performance". When using OpenCL we must :
a) take account of the "Data transfer" between .NET memory and the CPU or GPU memory (Theses are 2 kinds of different transfers.
b) we must take into account the fact that the code is compiled... in some case we have to create a new context, a new compilation and a new data transfer
c) we need to free some OpenCL memory, in some cases too

2 - Using the .NET Parallel LINQ will not speed up the processing up to 50 times... except if you have at least 50 core on your CPU....

So, it is very difficult to compare both !

What I suggest is that we should put the 'CL data creation + CL data transfer' in the speed test.


carga's picture
viewon01 wrote:

What I suggest is that we should put the 'CL data creation + CL data transfer' in the speed test.

It is very common to have all the data already prepared before starting any hard cpu-intensive processing, so it is not required to measure 'preparation' step. Otherwise we will measure the speed of our Random generator together with the speed of the algo itself. It is also wise to create context and to initialize kernels at early stages of the program, so this overhead should not be measured as well.

But I absolutely agree that timer must be started just before first kernel.Set*Arg() and it must be stopped just after last kernel.Read*(). I measure performance of VectorAdd exactly in this way. And TriangleIntercept also does it in this way, doesn't it?

Enjoy IT!

viewon01's picture

Hi Anto,

I understand your point of view and agree...

Except that in some cases it is not possible to directly convert .NET data to OpenCL data, in fact, we can do some "basic performance tests" but in the reality there are "some cases" where we must take "more" parameters into account to compute the "real" performance gain ;-)

Even, in some case we can send all the "triangles" information at the startup of the application and doing all the "intersections tests" without doing the "transfer" (Set*Arg) at each call...

When playing with this kind of technology we must "adapt" the "algorithms" for massive parallelism and take account of all the constraints :-)

nythrix's picture

Could anybody rewrite .NET intersect(...) method using Parallel LINQ? Will it give comparable speed-up for .NET performance?
I have no experience with LINQ whatsoever, I'm sorry. All of my projects up till now were based on .NET 2.0 for compatibility reasons. I think same applies to OpenTK and by extent to Cloo. I will take any advise on this.

[intersection kernel]
That has been fixed and will be released along with Cloo-0.3.1 in a couple of hours.
Because of certain limitations it wasn't possible to pass "__local" arguments to kernels. That's been fixed as well.

[speed comparisons]
Generally speaking, OpenCL vs. .NET is hard to compare. You have to take into account whether the result finishes onscreen (no readback penalty) or the data travel back and forth through different memory stacks. It is therefore obvious that the determining factor for speedups is the nature of the algorithm. Other players, like HW, drivers or OpenTK+Cloo, are far less important (asymptotically speaking).

Edit: read-back operation is not included in the timer because raytracing results usually end up on screen and not on the CPU. Oh, and one more thing. Most of the OpenCL commands can run asynchronously. So, direct comparison with real world apps is compromised once again :)

viewon01's picture


Are you trying to do a "raytracer" based on OpenCL ?