carga's picture

Cloo performance? [ OpenCL CPU vs Pure .NET ]

Hello!

I succeeded to run VectorAdd sample from Cloo project. In my particular environment there is no GPGPU available for OpenCL, so it uses CPU only.

I was interested to compare Cloo performance with what .NET provides out of the box. Here is my result for vector with 10,000,000 elements:
------------------| Start VectorAdd |------------------
Dim(a)=10000000 GPU Time: 290 msec
Dim(a)=10000000 .NET Time: 87 msec
-------------------| End VectorAdd |-------------------

Pure .NET is 3 times faster.

Please, provide here result of this test executed in environment with GPGPU available?

I would like to see at least 10 times OpenCL speed up, otherwise it's just a waste of time to use such complicated technology.

Best regards,
Anton.
http://kyta.spb.ru

PS I had observed similar situation when using Mono-to-SSE bindings: if SSE is available -- we have 2 times speed up. If not -- then 2 times slow down.


Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
nythrix's picture

Yes. Never been a fan of rasterized graphics. My hobby engine uses a SW raytracer and a couple of GL rasterizers but I'm not happy with any of them. So I want to try out an OpenCL raytracer. If that works out, my next move is pushing the whole engine on top of OpenCL. A lot of work there.

But first, I have to polish Cloo. I'm not going anywhere without a decent base.

viewon01's picture

Great...

It is what I'm doing too...

Doing intersection per primitive is slow...so currently I'm working on implementing the acceleration structure with OpenCL... in order to handle millions of polygons ;-)

Take a look at :
- http://www.cs.unc.edu/~lauterb/GPUBVH/paper.pdf
- http://www.tml.tkk.fi/~timo/publications/aila2009hpg_paper.pdf

carga's picture

I just realized that this huge advantage of the TriangleIntersect sample is just the difference between implementations of scalar and cross multiplications in .NET (OpenTK Vector4 multiplications) and in OpenCL.

OpenCL kernel and .NET Intersect() code look very similar, but there are a lot of math under the cover. Difference in execution time is the difference between OCL's float uw = dot( EdgeAB, w ); and OTK's float uw = Vector4.Dot( EdgeAB, w );, between implementations of vector substraction and normalisation, between OCL's float4 Normal = normalize( cross( EdgeAB, EdgeAC ) ); and OTK's Vector4 Normal = Vector4.Normalize( new Vector4( Vector3.Cross( EdgeAB.Xyz, EdgeAC.Xyz ), 0 ) );.

In my environment the sample is executed on CPU in both cases, so the results should be more or less equal. The 50 times acceleration is (mostly) the result of different overheads and inefficiencies in OpenTK implementation of all these complex vector operators.

I also expect that on GPUs with hardware support of vector operators this kernel will be executed much faster. But nythrix and I we both have similar results... Strange...

Mr nythrix, could you please execute TriangleIntersector sample in two different ways:
1. Forcing OpenCL to use CPU as computing device (here we compare OpenCL CPU vs .NET CPU);
2. Forcing OpenCL to use GPU as computing device (here we compare OpenCL GPU vs .NET CPU);

having these 2 results we will be able to compare OpenCL CPU vs OpenCL GPU. THIS challenge is even more interesting!!!

I even suggest you to include one more sample in next release: all the kernels we saw before are executed via OpenCL one-by-one on all devices available in system sequentially in a loop. My environment is very limited, but it is very interesting to see the results of guys with several devices available.

Best regards,
Anton.

the Fiddler's picture

I haven't seen the example code yet, but this is very slow code:

Vector4 Normal = Vector4.Normalize( new Vector4( Vector3.Cross( EdgeAB.Xyz, EdgeAC.Xyz ), 0 ) );.

It constructs and copies 5 temporary vectors (the Xyz properties, the Cross and Normalize methods and the Vector4 constructor).

Rearranging the code to reduce the number of temporaries will have a positive impact to performance. Using OpenTK methods that take ref overloads will also help a lot:

float uw = Vector4.Dot(EdgeAB, w); // slow
 
float uw;
Vector4.Dot(ref EdgeAB, ref w, out uw); // fast

Unfortunately, it is very hard to write performant math code in .Net (this is a common complaint against XNA too). The lack of const references means you have to choose between multiple temporary objects and ugly syntax. (Even worse, sometimes it's impossible to avoid tempoararies.) Then, there's the lack of generic operators and other similar annoyances that make math code very tedious to implement.

I believe that even a CPU-powered OpenCL implementation will be faster for non-trivial applications.

viewon01's picture

Also,

Take a look at this sample : http://kioku.sys-k.net/archives/code/ , it is an OpenCL port of the AO Bench ;-)

Quite simple, but can be usefull to do some test.

Regards

nythrix's picture

I would say that CL vector operations are better optimized even on a CPU. There's no match for them in the managed world unless you get down to assembler/SSE. Still, the gap is so big that this is not the only difference. Depending on the implementation of OpenCL, multithreading capabilities of CPUs are exploited darn well, I'd say.

Right now, the only way of trying out the CPU vs GPU test is this:
1) First run the test on a CL capable nVidia card.
2) Uninstall nVidia drivers and run the test with ATI Stream on the CPU.

As of now, clGetPlatformIDs will not list both of these platforms. The vendors haven't reached an agreement in interoperability, yet. So we're stuck in one platform for now.

But you could still use multiple devices in one platform, right? Well, not so fast:
1) nVidia needs to cooperate with CPU vendors to release CPU capable OpenCL. [Edit: never mind the sentence here. I'm still asleep]
2) ATI releases GPU capable drivers. I love their HW since it's usually ahead of the competition. But the drivers have been a pain in the arse since I was a kid.
3) Get a MAC. It looks like you can run kernels on multiple devices with their integrated drivers. Not having one, I cannot confirm this.

nythrix's picture
the Fiddler wrote:

...Then, there's the lack of generic operators and other similar annoyances that make math code very tedious to implement...

Apparently one simple thing that could've helped here is patented.

the Fiddler's picture

2) AMD Stream 2.0 beta 4 is both CPU- and GPU-capable, so that's a viable route.

If you have a Nvidia GPU, there might be a way to hack a CPU/GPU testing environment together: first install AMD's Stream SDK (CPU only). Afterwards, install Nvidia's OpenCL-capable drivers (GPU only). If you run OpenCL now, you should get Nvidia's GPU implementation. If you wish to test with AMD's CPU implementation, simply copy the relevant dlls and exes from %Program Files%\Amd\Stream 2.0 to the test program's folder. To go back to Nvidia, simply rename AMD's OpenCL.dll to something else.

Edit: about that patent, I wouldn't worry about it too much. There's a *lot* of prior art (the oldest I could find is from 2004: http://www.codeproject.com/KB/cs/genericnumerics.aspx, but it's likely you can find even older implementations) and it probably won't hold up in court.

nythrix's picture

2) AMD Stream 2.0 beta 4 is both CPU- and GPU-capable, so that's a viable route.
Thanks for the pointer, I'll update the docs.

Copying dll's here and there is a faster route, yes. But it remains an "external" procedure. I was playing with the idea of querying the CL platforms using home-baked methods. I'll reconsider it, if nVidia doesn't come up with CPU support till January.

carga's picture
viewon01 wrote:

Take a look at this sample : http://kioku.sys-k.net/archives/code/ , it is an OpenCL port of the AO Bench ;-)

Quite simple, but can be usefull to do some test.

Extremely interesting performance comparison sample!

The AOBench kernel for OpenCL has the only output parameter of type global uint * result.

1. What is the size of this array for default picture size 256x256? 65536 I guess, isn't it?

2. How should I interpret this uint[] output to reproduce the picture? I am able to get it using AOBench for C#, and I wish to reproduce it for OpenCL implementation as well (just to be sure that both results are exactly identical).

After this step I will be ready to contribute AOBenchTest for Cloo (if anybody interested =) ).

Best regards,
Anton.

PS What is about Cloo 0.3.x announced a few days ago?