
Cloo performance? [ OpenCL CPU vs Pure .NET ]
Posted Friday, 27 November, 2009 - 11:17 by carga inHello!
I succeeded to run VectorAdd sample from Cloo project. In my particular environment there is no GPGPU available for OpenCL, so it uses CPU only.
I was interested to compare Cloo performance with what .NET provides out of the box. Here is my result for vector with 10,000,000 elements:
------------------| Start VectorAdd |------------------
Dim(a)=10000000 GPU Time: 290 msec
Dim(a)=10000000 .NET Time: 87 msec
-------------------| End VectorAdd |-------------------
Pure .NET is 3 times faster.
Please, provide here result of this test executed in environment with GPGPU available?
I would like to see at least 10 times OpenCL speed up, otherwise it's just a waste of time to use such complicated technology.
Best regards,
Anton.
http://kyta.spb.ru
PS I had observed similar situation when using Mono-to-SSE bindings: if SSE is available -- we have 2 times speed up. If not -- then 2 times slow down.


Comments
Re: Cloo performance? [ OpenCL CPU vs Pure .NET ]
First of all thanks for testing it out.
I've been running similar tests and I can confirm this behaviour for the VectorAdd example. With a GeForce 9600GT it was running 8-9x slower than a simple for(;;) through an array of 50 000 items.
You see, Cloo and the underlying OpenCL carry a little overhead. With such a small kernel (just a simple c[i] = a[i] + b[i]) there is no performance gain. Actually, there is a huge performance drop.
VectorAdd is a dummy test that is supposed to tell you things work. I'll add a more complex kernel to the examples so that we have a better match against the real world.
Re: Cloo performance? [ OpenCL CPU vs Pure .NET ]
I'm posting some numbers that might make you happy. This test intersects a ray with a number of triangles and it was run in both OpenCL (on GF 9600 GT) and .NET (release config).
10 triangles:
100 triangles
1 000 triangles
10 000 triangles
65 535 triangles
The test will be included in the next release of Cloo.
Re: Cloo performance? [ OpenCL CPU vs Pure .NET ]
Yes, this is much, much better! Thank you for the good news!
I would like to test more "general-purpose" task such as standard NBody problem, so I will post my results and src asap.
Best regards,
Anton.
Re: Cloo performance? [ OpenCL CPU vs Pure .NET ]
Hi nythrix,
Can you provide the code you use for your performance test ? I would like to do some tests too .
Thx
Re: Cloo performance? [ OpenCL CPU vs Pure .NET ]
Yes. Cloo 0.2.1 is out.
Re: Cloo performance? [ OpenCL CPU vs Pure .NET ]
It sounds great...
I have run it and I got the following error :
"InvalidKernelNameComputeException"
Here is the stack trace :
Cloo.dll!Cloo.ComputeException.ThrowIfError(OpenTK.Compute.CL10.ErrorCode errorCode = InvalidKernelName) Line 150 C#
Cloo.dll!Cloo.ComputeKernel.ComputeKernel(string functionName = "intersect", Cloo.ComputeProgram program = {ComputeProgram(92622517)}) Line 87 + 0x8 bytes C#
Cloo.dll!Cloo.ComputeProgram.CreateKernel(string functionName = "intersect") Line 230 + 0x1b bytes C#
ClooTester.exe!ClooTester.TriangleIntersector.Run() Line 97 + 0x10 bytes C#
ClooTester.exe!ClooTester.Program.Main() Line 20 + 0x1d bytes C#
I run the ATI SDK... with :
- Intel(R) Core(TM)2 Duo CPU E7400 @ 2.80GHz
- Graphic card : Intel Eaglelake
Re: Cloo performance? [ OpenCL CPU vs Pure .NET ]
Here is the ouput I got :
------------------| Start OpenCL platform info |------------------
For test only: Expires on Sun Feb 28 00:00:00 2010
name: ATI Stream
version: OpenCL 1.0 ATI-Stream-v2.0-beta4
profile: FULL_PROFILE
vendor: Advanced Micro Devices, Inc.
extensions:
+
devices:
name: Intel(R) Core(TM)2 Duo CPU E7400 @ 2.80GHz
driver: 1.0
vendor: GenuineIntel
extensions:
+ cl_khr_global_int32_base_atomics
+ cl_khr_global_int32_extended_atomics
+ cl_khr_local_int32_base_atomics
+ cl_khr_local_int32_extended_atomics
+ cl_khr_byte_addressable_store
-------------------| End OpenCL platform info |-------------------
------------------| Start Program binary |------------------
file://C:\DOCUME~1\guest\LOCALS~1\Temp\OCL1ACC.tmp.dll
-------------------| End Program binary |-------------------
------------------| Start Memory Object mapping |------------------
Original content:
6754592535810294784
5609699999583784960
7831835295529076736
8435040577537347584
4115809016406119424
5747797109613563904
Mapped content:
6754592535810294784
5609699999583784960
7831835295529076736
8435040577537347584
4115809016406119424
5747797109613563904
-------------------| End Memory Object mapping |-------------------
------------------| Start VectorAdd |------------------
33,86629 + 96,28593 = 130,1522
59,61269 + 27,52616 = 87,13885
31,76719 + 9,762908 = 41,53009
92,57583 + 97,41137 = 189,9872
19,51515 + 12,30295 = 31,8181
21,562 + 42,41216 = 63,97417
64,71769 + 8,006876 = 72,72456
54,36967 + 78,10004 = 132,4697
28,51088 + 84,06319 = 112,5741
25,838 + 5,153872 = 30,99187
-------------------| End VectorAdd |-------------------
------------------| Start Triangle intersection |------------------
C:\DOCUME~1\guest\LOCALS~1\Temp\OCL1ACE.tmp.cl(3): error: a parameter cannot
be allocated in a named address space
global float4 dir,
^
1 error detected in the compilation of "C:\DOCUME~1\guest\LOCALS~1\Temp\OCL1ACE.tmp.cl".
Re: Cloo performance? [ OpenCL CPU vs Pure .NET ]
Strange, this doesn't occur with nVidia drivers. I have to find out whether this is caused by the kernel or by the ATI drivers. Anyway, I can't tell for sure until I get home tonight.
Re: Cloo performance? [ OpenCL CPU vs Pure .NET ]
Thanks,
I have do a post on the AMD forums : http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=123037...
I'll try to do my best to help you... if you don't find before ;-)
Re: Cloo performance? [ OpenCL CPU vs Pure .NET ]
I get the same error on nVidia drivers:
------------------| Start OpenCL platform info |------------------
name: NVIDIA CUDA
version: OpenCL 1.0 CUDA 3.0.1
profile: FULL_PROFILE
vendor: NVIDIA Corporation
extensions:
+ cl_khr_byte_addressable_store
+ cl_khr_gl_sharing
+ cl_nv_compiler_options
+ cl_nv_device_attribute_query
devices:
name: GeForce 9600 GT
driver: 195.62
vendor: NVIDIA Corporation
extensions:
+ cl_khr_byte_addressable_store
+ cl_khr_gl_sharing
+ cl_nv_compiler_options
+ cl_nv_device_attribute_query
+
+ cl_khr_global_int32_base_atomics
+ cl_khr_global_int32_extended_atomics
-------------------| End OpenCL platform info |-------------------
------------------| Start Program binary |------------------
//
// Generated by NVIDIA NVPTX Backend for LLVM
//
.version 1.5
.target sm_11, texmode_independent, map_f64_to_f32
// Global Launch Offsets
.const[0] .s32 %_global_block_offset[3];
.const[0] .s32 %_global_launch_offset[3];
.const[0] .s32 %_global_num_groups[3];
.const[0] .s32 %_global_size[3];
.const[0] .u32 %_work_dim;
.const .align 8 .b8 def___internal_i2opi_d[144] = { 0x08, 0x5D, 0x8D, 0x1F, 0xB1, 0x5F, 0xFB, 0x6B, 0xEA, 0x92, 0x52, 0x8A, 0xF7, 0x39, 0x07, 0x3D, 0x7B, 0xF1, 0xE5, 0xEB, 0xC7, 0xBA, 0x27, 0x75, 0x2D, 0xEA, 0x5F, 0x9E, 0x66, 0x3F, 0x46, 0x4F, 0xB7, 0x09, 0xCB, 0x27, 0xCF, 0x7E, 0x36, 0x6D, 0x1F, 0x6D, 0x0A, 0x5A, 0x8B, 0x11, 0x2F, 0xEF, 0x0F, 0x98, 0x05, 0xDE, 0xFF, 0x97, 0xF8, 0x1F, 0x3B, 0x28, 0xF9, 0xBD, 0x8B, 0x5F, 0x84, 0x9C, 0xF4, 0x39, 0x53, 0x83, 0x39, 0xD6, 0x91, 0x39, 0x41, 0x7E, 0x5F, 0xB4, 0x26, 0x70, 0x9C, 0xE9, 0x84, 0x44, 0xBB, 0x2E, 0xF5, 0x35, 0x82, 0xE8, 0x3E, 0xA7, 0x29, 0xB1, 0x1C, 0xEB, 0x1D, 0xFE, 0x1C, 0x92, 0xD1, 0x09, 0xEA, 0x2E, 0x49, 0x06, 0xE0, 0xD2, 0x4D, 0x42, 0x3A, 0x6E, 0x24, 0xB7, 0x61, 0xC5, 0xBB, 0xDE, 0xAB, 0x63, 0x51, 0xFE, 0x41, 0x90, 0x43, 0x3C, 0x99, 0x95, 0x62, 0xDB, 0xC0, 0xDD, 0x34, 0xF5, 0xD1, 0x57, 0x27, 0xFC, 0x29, 0x15, 0x44, 0x4E, 0x6E, 0x83, 0xF9, 0xA2 };
.entry vectorAdd
(
.param .b32 vectorAdd_param_0,
.param .b32 vectorAdd_param_1,
.param .b32 vectorAdd_param_2
)
{
.reg .f32 %f<4>;
.reg .s32 %r<9>;
_vectorAdd:
{
// get_global_id(0)
.reg .u32 %vntidx;
.reg .u32 %vctaidx;
.reg .u32 %vtidx;
cvt.u32.u16 %vntidx, %ntid.x;
cvt.u32.u16 %vctaidx, %ctaid.x;
cvt.u32.u16 %vtidx, %tid.x;
mad.lo.s32 %r1, %vntidx, %vctaidx, %vtidx;
.reg .u32 %temp;
ld.const.u32 %temp, [%_global_launch_offset+0];
add.u32 %r1, %r1, %temp;
}
shl.b32 %r2, %r1, 2;
ld.param.u32 %r3, [vectorAdd_param_1];
ld.param.u32 %r4, [vectorAdd_param_0];
add.s32 %r5, %r4, %r2;
add.s32 %r6, %r3, %r2;
ld.param.u32 %r7, [vectorAdd_param_2];
ld.global.f32 %f1, [%r5];
ld.global.f32 %f2, [%r6];
add.rn.ftz.f32 %f3, %f1, %f2;
add.s32 %r8, %r7, %r2;
st.global.f32 [%r8], %f3;
ret;
}
-------------------| End Program binary |-------------------
------------------| Start Memory Object mapping |------------------
Original content:
2934802780230322176
8527569823997102080
204372612966262688
7046751857986121728
3974656866072502784
1309497746332879616
Mapped content:
2934802780230322176
8527569823997102080
204372612966262688
7046751857986121728
3974656866072502784
1309497746332879616
-------------------| End Memory Object mapping |-------------------
------------------| Start VectorAdd |------------------
56.57155 + 58.45175 = 115.0233
58.53479 + 70.96561 = 129.5004
7.859627 + 63.40628 = 71.26591
6.03061 + 52.21336 = 58.24397
49.79776 + 95.02283 = 144.8206
46.15215 + 2.900972 = 49.05312
57.18979 + 83.5145 = 140.7043
72.97466 + 0.7222652 = 73.69693
8.511994 + 2.004705 = 10.5167
75.73506 + 20.94896 = 96.68402
-------------------| End VectorAdd |-------------------
------------------| Start Triangle intersection |------------------
:3: error: invalid address space for argument to __kernel function
global float4 dir,