nythrix's picture

Cloo - Compute Language, Object Oriented

The first testing release is out! Grab a copy and test your OpenCL installation.

Please report any findings!

P.S: The support for images is a work in progress so any related API method will punch you with a NotImplementedException. You don't have to report those.


Comments

the Fiddler's picture

I hope to have the image-related wrappers fixed by tomorrow.

nythrix's picture

Thanks. No rush, though. I can get busy elsewhere in the code. As for you, I guess the priority ATM is OpenTK 1.0.

carga's picture

Hello,

I would like to use kernel from NBody demo with signature:

kernel void nbody_sim(
    global float4* pos,
    global float4* vel,
 
    int numBodies,
    float deltaTime,
    float epsSqr,
 
    local float4* localPos)

What C# type should be mapped to float4*? Is it possible to use float[4, SIZE]? What type should I provide to ComputeBuffer?

Is there C# struct in Cloo, that is designed to be mapped to vector types?

Best regards,
Anton.

nythrix's picture

You can use any struct that has exactly 4 float fields:

struct Vector4f
{
    float x;
    float y;
    float z;
    float w;
 
    // methods here ....
}

If you don't have such a structure in your project you can use OpenTK.Vector4 instead.

carga's picture

Thank you for OpenTK.Vector4 idea!

Currently I do the following:

Vector4[] pos = new Vector4[count];
Vector4[] vel = new Vector4[count];
Vector4[] buf = new Vector4[count];
 
ComputeBuffer<Vector4> a = new ComputeBuffer<Vector4>(context, MemFlags.MemReadWrite | MemFlags.MemCopyHostPtr, pos);
ComputeBuffer<Vector4> b = new ComputeBuffer<Vector4>(context, MemFlags.MemReadWrite | MemFlags.MemCopyHostPtr, vel); 
ComputeBuffer<Vector4> c = new ComputeBuffer<Vector4>(context, MemFlags.MemReadWrite | MemFlags.MemCopyHostPtr, buf);
 
kernel.SetMemoryArg(0, a);
kernel.SetMemoryArg(1, b);
kernel.SetValueArg(2, count);
kernel.SetValueArg(3, 0.1f);
kernel.SetValueArg(4, 1e-6f);
 
   kernel.SetMemoryArg(5, c);

When executing last method (argument with index 5), I receive ComputeException with ErrorCode.InvalidArgValue.

If I do not initialise that parameter, I receive ComputeException with ErrorCode.InvalidKernelArgs.

How should I initialize kernel argument marked as local float4* localPos?

Thank you in advance,
Anton.

nythrix's picture

I've never tried setting a local argument. Chapter 3.3 of the OpenCL specs:

Local Memory: A memory region local to a work-group. This memory region can be
used to allocate variables that are shared by all work-items in that work-group. It may be
implemented as dedicated regions of memory on the OpenCL device. Alternatively, the
local memory region may be mapped onto sections of the global memory.

Table 3.1 states that you cannot access (read or write) such arguments. You can only allocate them. Try removing the MemFlags.MemReadWrite flag when you create c and see what happens.

Edit: You can also create buffers without specifying an array:
ComputeBuffer<float> c = new ComputeBuffer<float>( context, flags, count );

carga's picture

I've tried
ComputeBuffer<Vector4> c = new ComputeBuffer<Vector4>(context, MemFlags.MemUseHostPtr, buf);
but ErrorCode.InvalidArgValue

then I replaced "local" with "global" and it works now.

Thank you very much!
Anton.

viewon01's picture

No news about the "local" problem ?

Thx

nythrix's picture

This is what I've found in the OpenCL specs:

If the argument is declared with the __local qualifier, the entry arg_value must be null

However, the current implementation will probably crash if you try to kernel.Set*Arg( index, null );. I'm working on a fix. I will also post a howto on setting kernel arguments. It's a dark area where I get lost too.

carga's picture

Thank you very much for new release (0.3.1) and for a new test (KernelArgsTester).

In my environment (Intel CPU) compilation of the kernel fails with following error messages:

------------------| Start Kernel args test |------------------
C:\Users\xxx\AppData\Local\Temp\OCL9414.tmp.cl(4): error: kernel pointer
          arguments must point to addrSpace global, local, or constant
      kernel void k03(          image3d_t img ) {}
                  ^
 
C:\Users\xxx\AppData\Local\Temp\OCL9414.tmp.cl(14): error: a parameter
          cannot be allocated in a named address space
      kernel void k11( global   image3d_t img ) {}
                       ^
 
C:\Users\xxx\AppData\Local\Temp\OCL9414.tmp.cl(14): error: kernel pointer
          arguments must point to addrSpace global, local, or constant
      kernel void k11( global   image3d_t img ) {}
                  ^
 
3 errors detected in the compilation of "C:\Users\xxx\AppData\Local\Temp\OCL94
14.tmp.cl".
    image3d_t im
-------------------| End Kernel args test |-------------------

That's ok, but just for your info...

Are you going to implement some kind of automatic .NET to OpenCL kernel translation? There is a Brahma project http://brahma.ananthonline.net/ with some steps toward similar goal, but the project is completely stalled now. =((( It was an attempt to translate general computational LINQ expression to its parallel equivalent and to execute it on GPU using some deprecated DirectX GP GPU set of libraries.

I do not feel myself comfortable enough with writing parallel expressions on LINQ, but it is better to LINQ then to OpenCL. =) I mean it is not my first dream to study new deeply graphic oriented C-like dialect.

Have a fast code!
Anton.
http://kyta.spb.ru

nythrix's picture

As I pointed out before, defining/seting kernel args is a bit obscure. It would have been great if Khronos included a summary table with clSetKernelArgs.

With KernelArgsTester I set out to try every possible combination of global/constatnt/local/none with simple type/image/sampler/buffer. Then I commented out the ones that don't compile. I'll recheck this example when I get home.

The LINQ to OpenCL conversion is quite an interesting idea. It is definitely worth considering. However it requires three things:
1) Me learning enough LINQ to tell whether this is possible at all. Probably yes, but you didn't hear me promise anything.
2) Cloo (and possibly the whole Xloo/OpenTK 2.0) will have to target C# 3.0. Which hasn't been discussed yet.
3) Enough time for me to actually implement the thing. Given my ongoing exams season, that's not to happen until February. Or even spring.

carga's picture
nythrix wrote:

With KernelArgsTester I set out to try every possible combination of global/constatnt/local/none with simple type/image/sampler/buffer. Then I commented out the ones that don't compile. I'll recheck this example when I get home.

No problem. There are chances that this is problems on my side: you provide kernels for nVidia implementation and I try to compile it using ATI's 2.0-beta4 driver.

nythrix wrote:

The LINQ to OpenCL conversion is quite an interesting idea. It is definitely worth considering. However it requires three things:
1) Me learning enough LINQ to tell whether this is possible at all. Probably yes, but you didn't hear me promise anything.

I do not advertise LINQ (just mentioned Brahma project as reference): a) MS announced PLinq already; b) it is hard to write general computations in this syntax. We all like conditions and loops and all other procedural benefits C-like language gifts us. =)

Hmm!.. I wonder to start with System.Expression-to-Kernel conversion. System.Expression trees are very general way to represent general program tree with all its conditional branching and loops. LINQ is nomore then just a short way to write some complicated Expression tree...

But at the end of the game I dream to have some stand-alone .NET class written completely in C# [probably] without any external dependencies, [probably] completely covered with usual unit-tests. This class performs just one CPU intensive task and it _IS_ able to do the job. But it is too slow. Then I dream this class to be able to automatically analyze its own IL and to emit corresponding OpenCL kernel. After that (thanks to Cloo) it is just a few seconds to get 10-100 times speedup with OpenCL-on-CPU or even 100-1000 times speedup with OpenCL-on-GPU! Does anybody have robust IL-to-OpenCL translator? =DDD

nythrix wrote:

2) Cloo (and possibly the whole Xloo/OpenTK 2.0) will have to target C# 3.0. Which hasn't been discussed yet.

It's a serious point. =|

nythrix wrote:

3) Enough time for me to actually implement the thing. Given my ongoing exams season, that's not to happen until February. Or even spring.

Even a more serious point. But you are ready to show them the excellence, aren't you? ;-)

Have a fast code!
Anton.
http://kyta.spb.ru

bungee's picture

Hi,

i have some samples working and the performance is excellent. It could be even better if I find a way to send my polygons to the OpenCL program using cloo and c#. I have no idea how to do that.
Second Question is how to create and pass an array of Float2 from c#

Cheers

Michael

nythrix's picture
Quote:

It could be even better if I find a way to send my polygons to the OpenCL program using cloo and c#

How do you store your polygons in main memory?

Quote:

Second Question is how to create and pass an array of Float2 from c#

Any way you like. If you declare a float2* on the kernel you can as well pass a float[]. The kernel will then read the items in pairs. Just don't forget that float2 is half the size (in items) so watch your for loops :)

Edit: Or you can use any struct that has exactly two floats:

struct MyFloat2
{
   float f1, f2;
}

You can use this struct in both ComputeBuffer or inside your OpenCL code:

struct MyFloat2
{
   float f1, f2;
}
 
kernel void K( global MyFloat2* floatArray )
{
...
}
bungee's picture

Thank you.

Currently I get them as WKB http://dev.mysql.com/doc/refman/5.0/en/gis-wkb-format.html from the database

Is there a good way to pass strings from c# for example to encrypt/decrypt?

nythrix's picture

I don't have much experience with SQL and I've never seen WKB before. I don't even know what you're trying to achieve (rendering?) therefore I'm not sure what to suggest.
Instead I'm posting the signature of a kernel that does some sort of hybrid raytracing. You probably don't need all this so take it as purely inspirational code:

kernel void render( 
           read_only  int     width,  // frame width
           read_only  int     height, // frame height
    global write_only float4* renderbuffer,  // the rendered image
    global            float*  depthbuffer,   // depth buffer information
    global read_only  float*  caminfo,  // camera info: location, view direction, up vector, fov, clipping distances
    global read_only  float*  inversematrix,  //  the inverse of the ransformation matrix for this mesh. it is applied to the ray before ray-mesh intersection test occurs
           read_only  float4  boundvol, // the bounding sphere around this mesh
    global read_only  float*  vertices, // vertex positions (v1.x, v1.y, v1.z, v2.x, v2.y, v2.z, etc. )
    global read_only  uint*   indices,   // indices that point to the vertices array. specifies how to create triangles (v3, v6, v2)
           read_only  uint    prim_count ) // number of triangles in indices

As you see you can pass your polys any way you like. Except for some limits that OpenCL specifies that is.
Passing a string would look like this.
OpenCL code:

kernel void Test( global char* str )
{...}

Cloo code:

ComputeBuffer<char> str = new ... ;
kernel.SetMemoryArgument( 0, str );

Last but not least: watch out for char encoding differences between .NET and OpenCL C.

bungee's picture

Great. The first thing is now clear. But I still have problems with strings

What i want is to pass an array of strings and AesEncrypt them. Since string is not working i thougt iI coud do my own 'NameField' which contains a char[30]

the definition

    public  struct NameField
        {
           private char[] arr1;
 
            public NameField(char[]charVal)
            {
                this.arr1 = new char[30];
                for (int i=0;i<30;i++)
                {
                    if (i < charVal.Length)
                       this.arr1[i] = charVal[i];
                    else
                       this.arr1[i] = ' ';
                }
            }
        }
 

creating and filling the array is working

 NameField[] source = new NameField[5000];
 
 for (int i = 0; i < 5000; i++)
 {
                for (int ii = 0; ii < 30; ii++)
                {
                   source [i]=new NameField( "AbCdEfGhIjKlM nOpQrStUvWxYz".ToCharArray());
                }
 
 }
 
  ComputeBuffer<NameField> a = new ComputeBuffer<NameField>(_AesEncryptionKernel.Context,
                                                              ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer, source);

But transfering it to OpenCl causes an error in ComputeBuffer.cs in line 128 (GCHandle dataPtr = GCHandle.Alloc( data, GCHandleType.Pinned ))
Object does not contain simple Data
it's possibly because an char[] is not simple

OpenCL Code

typedef struct {
char arr1[30];
} NameField;
__kernel void AesEncrypt(global read_only NameField* arrIn,global write_only NameField* arrOut)
{
...
}
 
nythrix's picture

This is not possible. OpenCL forbids pointer to pointer kernel arguments. Therefore you cannot pass array of structs that contain an inner array. As a result Cloo issues a warning (or crashes) before you attempt such a thing.
If your strings are always char[30] you can pack them all inside a ComputeBuffer of length 30*5000 (or whatever). You can then read/write different pieces of the buffer through ComputeCommandQueue.Read/Write methods.

carga's picture

Very unexpected behavior: when calling the same kernel with different parameters on GPU, it hangs after 50 calls (in average). Also I have noticed, that GPU call slows down from iteration to iteration.

Basically I do the following:
1. Prepare context for GPU platform (static member)
2. Compile program (static member)
3. Create kernel from program (static member)

4. Then I loop:
4.1. Prepare input data (create ComputeBuffer)
4.2. Set kernel arguments
4.3. Create command queue
4.4. Execute
4.5. Read result

I do not experience any problem in this scenario when executing on CPU. Also CPU version works much faster.

Is there any problem with this scenario?

Should I explicitly free/dispose compute buffers after the result is read from ComputeQueue?

Have a fast code!
Anton.
http://kyta.spb.ru

PS OS Ubuntu 10.04 64bit, ATI Stream SDK 2.1, Mono 2.6.4, Cloo 0.7.2, ATI Radeon HD 5750.

carga's picture
carga wrote:

Very unexpected behavior: when calling the same kernel with different parameters on GPU, it hangs after 50 calls (in average). Also I have noticed, that GPU call slows down from iteration to iteration <...>

Since last message I changed the code: now the job is done in just one kernel call. But now there another issue: my computer hangs during computations (on GPU). The only way to "awake it" is to press reset button. I would like to pay special attention: on CPU exactly the same kernel works fine.

Is there any work around for my case? I try to execute kernel for work item with dimensions 256x256x256. Such dimension is supported by GPU according to platform info. Also it has 9 computing units (CPU has only 2 computing units).

After the kernel is started, my video freezes: no mouse movements, no cursor blinks. For shorter tasks it awakes after a while, but THIS long-running kernel completely kills the PC. Does anybody else experience similar problem? What's a solution?

Thank you in advance,
Anton.
http://kyta.spb.ru

PS OS Ubuntu 10.04 64bit, ATI Stream SDK 2.1, Mono 2.6.4, Cloo 0.7.2, ATI Radeon HD 5750.

nythrix's picture
carga wrote:

Very unexpected behavior: when calling the same kernel with different parameters on GPU, it hangs after 50 calls (in average). Also I have noticed, that GPU call slows down from iteration to iteration.

Basically I do the following:
1. Prepare context for GPU platform (static member)
2. Compile program (static member)
3. Create kernel from program (static member)

4. Then I loop:
4.1. Prepare input data (create ComputeBuffer)
4.2. Set kernel arguments
4.3. Create command queue
4.4. Execute
4.5. Read result

I do not experience any problem in this scenario when executing on CPU. Also CPU version works much faster.

Is there any problem with this scenario?

Should I explicitly free/dispose compute buffers after the result is read from ComputeQueue?

When creating a large number of Cloo/OpenCL objects with data in GPU memory, it may be necessary to manually dispose them because the GC cannot know the GPU memory consumption rate. This rate is usually much higher than the consumption rate of the available RAM (which holds only pointers to the native OpenCL objects). Therefore, the created objects may not be subjected to garbage collection until it's too late.

carga wrote:

Since last message I changed the code: now the job is done in just one kernel call. But now there another issue: my computer hangs during computations (on GPU). The only way to "awake it" is to press reset button. I would like to pay special attention: on CPU exactly the same kernel works fine.

Is there any work around for my case? I try to execute kernel for work item with dimensions 256x256x256. Such dimension is supported by GPU according to platform info. Also it has 9 computing units (CPU has only 2 computing units).

After the kernel is started, my video freezes: no mouse movements, no cursor blinks. For shorter tasks it awakes after a while, but THIS long-running kernel completely kills the PC. Does anybody else experience similar problem? What's a solution?

Max global work dims apply only to very small or empty kernels (that's marketing for you). Complex ones take up a lot of space which affects the available memory for memory buffers, images and the maximum number of running threads.

Try decreasing the dimension sizes.

These and other problems don't usually occur when running kernels on the CPU because it has much larger memory at its disposal.