carga's picture

Cloo performance? [ OpenCL CPU vs Pure .NET ]

Hello!

I succeeded to run VectorAdd sample from Cloo project. In my particular environment there is no GPGPU available for OpenCL, so it uses CPU only.

I was interested to compare Cloo performance with what .NET provides out of the box. Here is my result for vector with 10,000,000 elements:
------------------| Start VectorAdd |------------------
Dim(a)=10000000 GPU Time: 290 msec
Dim(a)=10000000 .NET Time: 87 msec
-------------------| End VectorAdd |-------------------

Pure .NET is 3 times faster.

Please, provide here result of this test executed in environment with GPGPU available?

I would like to see at least 10 times OpenCL speed up, otherwise it's just a waste of time to use such complicated technology.

Best regards,
Anton.
http://kyta.spb.ru

PS I had observed similar situation when using Mono-to-SSE bindings: if SSE is available -- we have 2 times speed up. If not -- then 2 times slow down.


Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
nythrix's picture

First of all thanks for testing it out.
I've been running similar tests and I can confirm this behaviour for the VectorAdd example. With a GeForce 9600GT it was running 8-9x slower than a simple for(;;) through an array of 50 000 items.

You see, Cloo and the underlying OpenCL carry a little overhead. With such a small kernel (just a simple c[i] = a[i] + b[i]) there is no performance gain. Actually, there is a huge performance drop.

VectorAdd is a dummy test that is supposed to tell you things work. I'll add a more complex kernel to the examples so that we have a better match against the real world.

nythrix's picture

I'm posting some numbers that might make you happy. This test intersects a ray with a number of triangles and it was run in both OpenCL (on GF 9600 GT) and .NET (release config).

10 triangles:

Cloo ticks: 952072,             milliseconds: 0
.NET ticks: 5891232,            milliseconds: 2

100 triangles

Cloo ticks: 950576,             milliseconds: 0
.NET ticks: 6017176,            milliseconds: 2

1 000 triangles

Cloo ticks: 1767872,            milliseconds: 0
.NET ticks: 7309984,            milliseconds: 2

10 000 triangles

Cloo ticks: 2695160,            milliseconds: 1
.NET ticks: 19414624,           milliseconds: 7

65 535 triangles

Cloo ticks: 9008704,            milliseconds: 3
.NET ticks: 107045968,          milliseconds: 40

The test will be included in the next release of Cloo.

carga's picture

Yes, this is much, much better! Thank you for the good news!

I would like to test more "general-purpose" task such as standard NBody problem, so I will post my results and src asap.

Best regards,
Anton.

viewon01's picture

Hi nythrix,

Can you provide the code you use for your performance test ? I would like to do some tests too .

Thx

nythrix's picture

Yes. Cloo 0.2.1 is out.

viewon01's picture

It sounds great...

I have run it and I got the following error :

"InvalidKernelNameComputeException"

Here is the stack trace :

Cloo.dll!Cloo.ComputeException.ThrowIfError(OpenTK.Compute.CL10.ErrorCode errorCode = InvalidKernelName) Line 150 C#
Cloo.dll!Cloo.ComputeKernel.ComputeKernel(string functionName = "intersect", Cloo.ComputeProgram program = {ComputeProgram(92622517)}) Line 87 + 0x8 bytes C#
Cloo.dll!Cloo.ComputeProgram.CreateKernel(string functionName = "intersect") Line 230 + 0x1b bytes C#
ClooTester.exe!ClooTester.TriangleIntersector.Run() Line 97 + 0x10 bytes C#
ClooTester.exe!ClooTester.Program.Main() Line 20 + 0x1d bytes C#

I run the ATI SDK... with :
- Intel(R) Core(TM)2 Duo CPU E7400 @ 2.80GHz
- Graphic card : Intel Eaglelake

viewon01's picture

Here is the ouput I got :

------------------| Start OpenCL platform info |------------------
For test only: Expires on Sun Feb 28 00:00:00 2010
name: ATI Stream
version: OpenCL 1.0 ATI-Stream-v2.0-beta4
profile: FULL_PROFILE
vendor: Advanced Micro Devices, Inc.
extensions:
+

devices:
name: Intel(R) Core(TM)2 Duo CPU E7400 @ 2.80GHz
driver: 1.0
vendor: GenuineIntel
extensions:
+ cl_khr_global_int32_base_atomics
+ cl_khr_global_int32_extended_atomics
+ cl_khr_local_int32_base_atomics
+ cl_khr_local_int32_extended_atomics
+ cl_khr_byte_addressable_store
-------------------| End OpenCL platform info |-------------------

------------------| Start Program binary |------------------
file://C:\DOCUME~1\guest\LOCALS~1\Temp\OCL1ACC.tmp.dll
-------------------| End Program binary |-------------------

------------------| Start Memory Object mapping |------------------
Original content:
6754592535810294784
5609699999583784960
7831835295529076736
8435040577537347584
4115809016406119424
5747797109613563904
Mapped content:
6754592535810294784
5609699999583784960
7831835295529076736
8435040577537347584
4115809016406119424
5747797109613563904
-------------------| End Memory Object mapping |-------------------

------------------| Start VectorAdd |------------------
33,86629 + 96,28593 = 130,1522
59,61269 + 27,52616 = 87,13885
31,76719 + 9,762908 = 41,53009
92,57583 + 97,41137 = 189,9872
19,51515 + 12,30295 = 31,8181
21,562 + 42,41216 = 63,97417
64,71769 + 8,006876 = 72,72456
54,36967 + 78,10004 = 132,4697
28,51088 + 84,06319 = 112,5741
25,838 + 5,153872 = 30,99187
-------------------| End VectorAdd |-------------------

------------------| Start Triangle intersection |------------------
C:\DOCUME~1\guest\LOCALS~1\Temp\OCL1ACE.tmp.cl(3): error: a parameter cannot
be allocated in a named address space
global float4 dir,
^

1 error detected in the compilation of "C:\DOCUME~1\guest\LOCALS~1\Temp\OCL1ACE.tmp.cl".

nythrix's picture

Strange, this doesn't occur with nVidia drivers. I have to find out whether this is caused by the kernel or by the ATI drivers. Anyway, I can't tell for sure until I get home tonight.

viewon01's picture

Thanks,

I have do a post on the AMD forums : http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=123037...

I'll try to do my best to help you... if you don't find before ;-)

pceric's picture

I get the same error on nVidia drivers:

------------------| Start OpenCL platform info |------------------
name: NVIDIA CUDA
version: OpenCL 1.0 CUDA 3.0.1
profile: FULL_PROFILE
vendor: NVIDIA Corporation
extensions:
+ cl_khr_byte_addressable_store
+ cl_khr_gl_sharing
+ cl_nv_compiler_options
+ cl_nv_device_attribute_query

devices:
name: GeForce 9600 GT
driver: 195.62
vendor: NVIDIA Corporation
extensions:
+ cl_khr_byte_addressable_store
+ cl_khr_gl_sharing
+ cl_nv_compiler_options
+ cl_nv_device_attribute_query
+
+ cl_khr_global_int32_base_atomics
+ cl_khr_global_int32_extended_atomics
-------------------| End OpenCL platform info |-------------------

------------------| Start Program binary |------------------
//
// Generated by NVIDIA NVPTX Backend for LLVM
//

.version 1.5
.target sm_11, texmode_independent, map_f64_to_f32

// Global Launch Offsets
.const[0] .s32 %_global_block_offset[3];
.const[0] .s32 %_global_launch_offset[3];
.const[0] .s32 %_global_num_groups[3];
.const[0] .s32 %_global_size[3];
.const[0] .u32 %_work_dim;

.const .align 8 .b8 def___internal_i2opi_d[144] = { 0x08, 0x5D, 0x8D, 0x1F, 0xB1, 0x5F, 0xFB, 0x6B, 0xEA, 0x92, 0x52, 0x8A, 0xF7, 0x39, 0x07, 0x3D, 0x7B, 0xF1, 0xE5, 0xEB, 0xC7, 0xBA, 0x27, 0x75, 0x2D, 0xEA, 0x5F, 0x9E, 0x66, 0x3F, 0x46, 0x4F, 0xB7, 0x09, 0xCB, 0x27, 0xCF, 0x7E, 0x36, 0x6D, 0x1F, 0x6D, 0x0A, 0x5A, 0x8B, 0x11, 0x2F, 0xEF, 0x0F, 0x98, 0x05, 0xDE, 0xFF, 0x97, 0xF8, 0x1F, 0x3B, 0x28, 0xF9, 0xBD, 0x8B, 0x5F, 0x84, 0x9C, 0xF4, 0x39, 0x53, 0x83, 0x39, 0xD6, 0x91, 0x39, 0x41, 0x7E, 0x5F, 0xB4, 0x26, 0x70, 0x9C, 0xE9, 0x84, 0x44, 0xBB, 0x2E, 0xF5, 0x35, 0x82, 0xE8, 0x3E, 0xA7, 0x29, 0xB1, 0x1C, 0xEB, 0x1D, 0xFE, 0x1C, 0x92, 0xD1, 0x09, 0xEA, 0x2E, 0x49, 0x06, 0xE0, 0xD2, 0x4D, 0x42, 0x3A, 0x6E, 0x24, 0xB7, 0x61, 0xC5, 0xBB, 0xDE, 0xAB, 0x63, 0x51, 0xFE, 0x41, 0x90, 0x43, 0x3C, 0x99, 0x95, 0x62, 0xDB, 0xC0, 0xDD, 0x34, 0xF5, 0xD1, 0x57, 0x27, 0xFC, 0x29, 0x15, 0x44, 0x4E, 0x6E, 0x83, 0xF9, 0xA2 };

.entry vectorAdd
(
.param .b32 vectorAdd_param_0,
.param .b32 vectorAdd_param_1,
.param .b32 vectorAdd_param_2
)
{
.reg .f32 %f<4>;
.reg .s32 %r<9>;

_vectorAdd:
{
// get_global_id(0)
.reg .u32 %vntidx;
.reg .u32 %vctaidx;
.reg .u32 %vtidx;
cvt.u32.u16 %vntidx, %ntid.x;
cvt.u32.u16 %vctaidx, %ctaid.x;
cvt.u32.u16 %vtidx, %tid.x;
mad.lo.s32 %r1, %vntidx, %vctaidx, %vtidx;
.reg .u32 %temp;
ld.const.u32 %temp, [%_global_launch_offset+0];
add.u32 %r1, %r1, %temp;
}

shl.b32 %r2, %r1, 2;
ld.param.u32 %r3, [vectorAdd_param_1];
ld.param.u32 %r4, [vectorAdd_param_0];
add.s32 %r5, %r4, %r2;
add.s32 %r6, %r3, %r2;
ld.param.u32 %r7, [vectorAdd_param_2];
ld.global.f32 %f1, [%r5];
ld.global.f32 %f2, [%r6];
add.rn.ftz.f32 %f3, %f1, %f2;
add.s32 %r8, %r7, %r2;
st.global.f32 [%r8], %f3;
ret;
}

-------------------| End Program binary |-------------------

------------------| Start Memory Object mapping |------------------
Original content:
2934802780230322176
8527569823997102080
204372612966262688
7046751857986121728
3974656866072502784
1309497746332879616
Mapped content:
2934802780230322176
8527569823997102080
204372612966262688
7046751857986121728
3974656866072502784
1309497746332879616
-------------------| End Memory Object mapping |-------------------

------------------| Start VectorAdd |------------------
56.57155 + 58.45175 = 115.0233
58.53479 + 70.96561 = 129.5004
7.859627 + 63.40628 = 71.26591
6.03061 + 52.21336 = 58.24397
49.79776 + 95.02283 = 144.8206
46.15215 + 2.900972 = 49.05312
57.18979 + 83.5145 = 140.7043
72.97466 + 0.7222652 = 73.69693
8.511994 + 2.004705 = 10.5167
75.73506 + 20.94896 = 96.68402
-------------------| End VectorAdd |-------------------

------------------| Start Triangle intersection |------------------
:3: error: invalid address space for argument to __kernel function
global float4 dir,