CodyIrons's picture

Converting the N-Body C tutorial

Hi Guys,

I got some free time and decided to grab all the latest openCL / Cloo stuff. I want to convert the 'C' tutorial linked to by ATI's Stream sdk to C#. But i'm having a bit of a conundrum. In the tutorial found here: they make use of the type cl_float4 which maps nicely into their kernel when they want to use

__kernel void nbody_kern(
float dt1, float eps,
__global float4* pos_old,
__global float4* pos_new,
__global float4* vel,
__local float4* pblock

do Cloo or OpenTK provide these types and i'm just not seeing them? I was trying the following to get the values over there:

float[] arrPos1= new float[nparticle * 4];
ComputeBuffer pos1 = new ComputeBuffer(context, ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer, arrPos1);

but i just don't have faith that the float4* in the kernel is going to behave properly on my float* ComputeBuffer.

Then a bonus question:
The last argument in their kernel is a local float4* pblock which they set with, clarg_set_local(krn,5,nthread*sizeof(cl_float4));
I'm trying to mentally map that into something like this with cloo, kernel.SetArgument(5, IntPtr.Zero, IntPtr.Zero);
But i just can't seem to figure out what i should be setting argument 5 to.

Has anyone already tried converting this tutorial over to Cloo? I'm not worried about displaying it just yet i just wanted to make it so i could calculate the numbers.



*I have to admit i'm a bit rusty with the Cloo bindings as i haven't made free time for messing with it since version 0.4.1 i think.
code i hacked together for good measure. Fails at runtime at SetArgument 5

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Runtime.InteropServices;
using Cloo;
namespace NBodyOpenCL
    class Program
        private static string kernelSource = @"
__kernel void nbody_kern(
    float dt1, float eps,
    __global float4* pos_old,
    __global float4* pos_new,
    __global float4* vel,
    __local float4* pblock
) {
    const float4 dt = (float)(dt1,dt1,dt1,0.0f);
    int gti = get_global_id(0);
    int ti = get_local_id(0);
    int n = get_global_size(0);
    int nt = get_local_size(0);
    int nb = n/nt;
    float4 p = pos_old[gti];
    float4 v = vel[gti];
    float4 a = (float4)(0.0f,0.0f,0.0f,0.0f);
    for(int jb=0; jb < nb; jb++) {
        pblock[ti] = pos_old[jb*nt+ti];
        for(int j=0; j<nt; j++){
            float4 p2 = pblock[j];
            float4 d = p2 - p;
            float invr = rsqrt(d.x*d.x + d.y*d.y + d.z*d.z + eps);
            float f = p2.w*invr*invr*invr;
            a += f*d;
    p += dt*v + 0.5f*dt*dt*a;
    v += dt*a;
    pos_new[gti] = p;
    vel[gti] = v;
        static void Main(string[] args)
            ComputePlatform platform = ComputePlatform.Platforms[0];
            ComputeContextPropertyList properties = new ComputeContextPropertyList(platform);
            ComputeContext context = new ComputeContext(platform.Devices, properties, null, IntPtr.Zero);
            int step, burst;
            int nparticle = 8192;
            int nstep = 100;
            int nburst = 20;
            int nthread = 64;
            Random rand = new Random();
            float dt = 0.0001f;
            float eps = 0.0001f;
            float[] arrPos1= new float[nparticle * 4];
            for (int i = 0; i < arrPos1.Length; i++) { arrPos1[i] = (float)rand.NextDouble(); }
            float[] arrPos2 = new float[nparticle * 4];
            float[] arrVel = new float[nparticle * 4];
            for (int i = 0; i < arrVel.Length; i++) { arrVel[i] = 0.0f; }
            ComputeBuffer<float> pos1 = new ComputeBuffer<float>(context, ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer, arrPos1);
            ComputeBuffer<float> pos2 = new ComputeBuffer<float>(context, ComputeMemoryFlags.WriteOnly, arrPos2.Length);
            ComputeBuffer<float> vel = new ComputeBuffer<float>(context, ComputeMemoryFlags.WriteOnly, arrVel.Length);
            ComputeProgram program = new ComputeProgram(context, new string[] { kernelSource });
            program.Build(null, null, null, IntPtr.Zero);
            ComputeKernel kernel = program.CreateKernel("nbody_kern");
            kernel.SetValueArgument<float>(0, dt);
            kernel.SetValueArgument<float>(1, eps);
            kernel.SetMemoryArgument(2, pos1);
            kernel.SetMemoryArgument(3, pos2);
            kernel.SetMemoryArgument(4, vel);
            kernel.SetArgument(5, IntPtr.Zero, IntPtr.Zero);
            ComputeCommandQueue commands = new ComputeCommandQueue(context, context.Devices[0], ComputeCommandQueueFlags.None);
            ComputeEventList events = new ComputeEventList();
            commands.Execute(kernel, null, new long[] { nparticle * 4 }, null, events);
            arrPos2 = new float[nparticle * 4];
            arrVel = new float[nparticle * 4];
            GCHandle arrPos2Handle = GCHandle.Alloc(arrPos2, GCHandleType.Pinned);
            GCHandle arrVelHandle = GCHandle.Alloc(arrVel, GCHandleType.Pinned);
            commands.Read(pos2, false, 0, nparticle * 4, arrPos2Handle.AddrOfPinnedObject(), events);
            commands.Read(vel, false, 0, nparticle * 4, arrVelHandle.AddrOfPinnedObject(), events);


Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
nythrix's picture

1) That definitely works. OpenCL operates on a lower level where something like this is just treated as a memory area filled with floats. The following holds true for any primitive type T: T16 = T8[2] = T4[4] = T2[8] = T[16]. That said, you can use OpenTK vectors and matrices on the host. Or even make up your own (how about float64x64 :). If you watch out for alignment quirks you can also mix together elaborated structs and arrays of them.
2) Try: kernel.SetArgument(5, new IntPtr(nthread * 4 * sizeof(float)), IntPtr.Zero);
You can also use nthread * Marshal.SizeOf(Vector4) where Vector4 is a struct of 4 floats such as in OpenTK.
Edit: Code line.

CodyIrons's picture

Ah, that is very good to know, hopefully i'll get to play with this when i get home from work today. I keep coming across the "OpenCL Galaxies Demo" on youtube and i think it would be pretty cool to have a working version utilizing Cloo and C#.

LikeKT's picture

Hi, I was looking into the OpenCL SDK example from Nvidia.

I have the following questions regarding creating compute buffers for input arguments:

Question 1:
if the input is
int MeshHeight = 8*64;

is it necessary to define

int[] MeshHeight = new int[1];
MeshHeight[0] = 8*64;
 ComputeBuffer<int> mMeshHeight = new ComputeBuffer<int>(context, ComputeMemoryFlags.WriteOnly, MeshHeight);

Question 2:

Looking at the code below => createVBO(),
2(a)what is the right way to create the ComputeBuffer for vbo (which is cl_mem)?
2(b)what is the equivalent of cl_mem in

//How to convert this c++ statement into Cloo.NET?
vbo_cl = clCreateFromGLBuffer(cxGPUContext, CL_MEM_WRITE_ONLY, *vbo, NULL);
//One possibility I can think of, however, what it the right <T> for the statemetn below? (int[] vbo = new vbo[1])
ComputeBuffer<T>.CreateFromGLBuffer<T>(cxGPUContext, ComputeMemoryFlags.WriteOnly, vbo[0]);

Question 3:

After defining the right ComputeBuffer for vbo (assuming mVBO)

how to define the argument input for the kernel?

//Given (unsigned int VboSize = mesh_width * mesh_height * 4 * sizeof(float);)
IntPtr mVBOSize = new IntPtr(VboSize);
                //Which of the below statemetns is correct or there are interchangable?
                 kernel.SetArgument(0,mVBOSize , mVBO);

I hope you can see that I am at the evaluation phase to see if I have understood what statements in Cloo.NET that are equivalent to SDK OpenCL. Have a good weekend, if anyone of you has some time to answer partially , I appreciate.

cl_mem vbo_cl;
 // Create VBO
void createVBO(GLuint* vbo)
    // create VBO
    unsigned int size = mesh_width * mesh_height * 4 * sizeof(float);
        // create buffer object
        glGenBuffers(1, vbo);
        glBindBuffer(GL_ARRAY_BUFFER, *vbo);
        // initialize buffer object
        glBufferData(GL_ARRAY_BUFFER, size, 0, GL_DYNAMIC_DRAW);
        #ifdef GL_INTEROP
            // create OpenCL buffer from GL VBO
            vbo_cl = clCreateFromGLBuffer(cxGPUContext, CL_MEM_WRITE_ONLY, *vbo, NULL);
            // create standard OpenCL mem buffer
            vbo_cl = clCreateBuffer(cxGPUContext, CL_MEM_WRITE_ONLY, size, NULL, &ciErrNum);
        // create standard OpenCL mem buffer
        vbo_cl = clCreateBuffer(cxGPUContext, CL_MEM_WRITE_ONLY, size, NULL, &ciErrNum);
LikeKT's picture

is there an equivalent of this frunction from shrUtils.h (reference OpenCL Nvidia SDK)

shrCheckErrorEX(ciErrNum, CL_SUCCESS, pCleanup);


How to debug OpenCL codes when using cloo.NET?

Thanks in advance.

LikeKT's picture

After looking at
OpenCLTemplate 1.10, OpenCL 1.1, I think the answers to some of my questions can be found there. A great source to learn OpenCL in .NET

nythrix's picture

1) No, you should pass 8*64 as the last argument to the constructor:
ComputeBuffer<int> mMeshHeight = new ComputeBuffer<int>(context, ComputeMemoryFlags.WriteOnly, 8*64);
If you pass an array the buffer will be the size of the array i.e. one int wide! This is documented in Cloo (check Intellisense).

2) cl_mem is the ComputeBuffer class. The line you posted is the right way to do it. T should be the type of the elements of the VBO. So for int[] vbo you should specify ComputeBuffer<int>. Oh, and be sure to create a shared OpenCL/OpenGL context before any attempt to data interop between the two:

3) They are interchangeable since all the higher level methods translate to the basic SetArgument(). However, in order to avoid possible errors when setting kernel args (it's one of the most error prone parts of any program), you should use the specific versions where available (cl_mem, cl_image, cl_sampler and simple values are all covered). They're type safe and thoroughly tested.

error codes) All internal calls to OpenCL in Cloo have their error code automatically checked. If an error occurs, the appropriate exception is thrown. If you want to check for a certain error code yourself you can use static ComputeException.ThrowOnError(...)

shrCheckErrorEx(...) is not a standard OpenCL function and as such it is not covered by Cloo.

Hope this helps.

LikeKT's picture

Hi nythrix, yes, I did setup the OpenCLOpenGL interop Context

                //Create OpenCLOpenGL Interop Context
                IntPtr curDC = wglGetCurrentDC();
                OpenTK.Graphics.IGraphicsContextInternal ctx = (OpenTK.Graphics.IGraphicsContextInternal)OpenTK.Graphics.GraphicsContext.CurrentContext;
                IntPtr raw_context_handle = ctx.Context.Handle;
                ComputeContextProperty p1 = new ComputeContextProperty(ComputeContextPropertyName.CL_GL_CONTEXT_KHR, raw_context_handle);
                ComputeContextProperty p2 = new ComputeContextProperty(ComputeContextPropertyName.CL_WGL_HDC_KHR, curDC);
                ComputeContextProperty p3 = new ComputeContextProperty(ComputeContextPropertyName.Platform, ComputePlatform.Platforms[0].Handle);
                List<ComputeContextProperty> props = new List<ComputeContextProperty>() { p1, p2, p3 };
                ComputeContextPropertyList Properties = new ComputeContextPropertyList(props);
cxGPUContext = new ComputeContext(ComputeDeviceTypes.Gpu, Properties, null, IntPtr.Zero);
           //continue from previous upper codes
             kernel = program.CreateKernel("sine_wave");
            mWidth = new ComputeBuffer<int>(cxGPUContext, ComputeMemoryFlags.WriteOnly, meshWidth);  //meshWidth = 256
            mHeight = new ComputeBuffer<int>(cxGPUContext, ComputeMemoryFlags.WriteOnly, meshHeight);//meshWidth = 256
            kernel.SetMemoryArgument(0,mVbo);   //created in CreateVBO();
            kernel.SetMemoryArgument(1, mWidth);
            kernel.SetMemoryArgument(2, mHeight);
            cqCommandQueue = new ComputeCommandQueue(cxGPUContext, cxGPUContext.Devices[0], ComputeCommandQueueFlags.None);

This is the code for creating the VBO according to your recommendation.

private void createVBO()
            if (vbo[0] != 0)
                GL.DeleteBuffers(1, ref vbo[0]);
                vbo[0] = 0;
            GL.GenBuffers(1, out vbo[0]);
            GL.BindBuffer(BufferTarget.ArrayBuffer, vbo[0]);
            vboSize = meshWidth* meshHeight * 4 * sizeof(float);   //int vobSize
            vboMemSize = Marshal.AllocHGlobal((int)vboSize);       //intPtr vboMemSize
            GL.BufferData(BufferTarget.ArrayBuffer, vboMemSize, IntPtr.Zero, BufferUsageHint.DynamicDraw);
                if (GL_Interop)
                     //ComputeBuffer<int> mVbo
                    mVbo= ComputeBuffer<int>.CreateFromGLBuffer<int>(cxGPUContext, ComputeMemoryFlags.WriteOnly, vbo[0]);
                    mVbo= new ComputeBuffer<int>(cxGPUContext, ComputeMemoryFlags.WriteOnly, vbo[0]); 
        protected override void OnRenderFrame(FrameEventArgs e)
                      // run OpenCL kernel to generate vertex positions
                      //rest of the OpenTK codes  for rendering the sineWave using vertex positions generated by OpenCL kernel

=> This is the section I always end up stuck, getting value from the CommandQueue
With GL_Interop,
I can not do cqCommandQueue.Finish(); //ERROR ==> ComputeErrorCode = outofResource
without GL_Interop
cqCommandQueue.Read(mVbo, true, 0, vboSize, pointer, null);//ERROR ==> ComputeErrorCode = InvalidValue

My graphics card is Nvidia 8 series with the latest Cuda 3.0 driver.
I read from previous discussion that it is not possible to do ansynchronous blocking in Nvidia card through Cloo, has the issue being resolved? ref ""

Do I need to insert additional codes to debug or check what is wrong here?

I consider myself at the early state of learning openCL.
I could get the program work without openCL, so the OpenGL part if fine.
Thanks for your help.

private void runKernel()
            if (GL_Interop)
                List<ComputeMemory> c = new List<ComputeMemory>() { mVbo };     //ComputeBuffer<int> to vbo[0]
                cqCommandQueue.AcquireGLObjects(c, null);
                fixed (float* anim = &animationState[0])   //float animationState = 0.0f;
                    IntPtr animPointer = new IntPtr((void*) anim);
                    kernel.SetArgument(3, new IntPtr(sizeof(float)), animPointer);
            globalWorkSize = new long[2];
            globalWorkSize[0] = meshWidth;
            globalWorkSize[1] = meshHeigh;
            cqCommandQueue.Execute(kernel, null, globalWorkSize, null, null);
            if (GL_Interop)
                    cqCommandQueue.ReleaseGLObjects(c, null);
                    cqCommandQueue.Finish(); //ERROR ==> ComputeErrorCode = outofResource
                GL.BindBuffer(BufferTarget.ArrayBuffer, vbo[0]);
                IntPtr pointer = GL.MapBuffer(BufferTarget.ArrayBuffer, BufferAccess.WriteOnly);
                long vboSize = meshWidth*meshHeigh*4*sizeof(float);
                cqCommandQueue.Read<int>(mVbo, true, 0, vboSize, pointer, null);//ERROR ==> ComputeErrorCode = InvalidValue
LikeKT's picture

"However, in order to avoid possible errors when setting kernel args (it's one of the most error prone parts of any program), you should use the specific versions where available (cl_mem, cl_image, cl_sampler and simple values are all covered). They're type safe and thoroughly tested."

How do I port this to
ciErrNum  = clSetKernelArg(ckKernel, 0, sizeof(cl_mem), (void *) &vbo_cl);

Is this correct?

kernel.SetMemoryArgument(0,mVbo);   //created in CreateVBO();

How do I do kernel.SetArgument(0, sizeOf(cl_mem), pointerTo the vbo[0]) ???????????

thanks again, appreciate that you are sharing your how-how.

nythrix's picture

[Warning] When debugging OpenCL keep in mind this: More often than not, the problem lies anywhere between the point an exception occurs and the start of the code. So, if you come across an error that doesn't make sense, be sure to check some of the previous commands as well. [End of Warning]

vboMemSize seems to be allocated and not used or freed. Leaking memory!!

Your VBO contains floats so you should use ComputeBuffer<float>.

               if (GL_Interop)
                     //ComputeBuffer<float> mVbo
                    mVbo= ComputeBuffer<float>.CreateFromGLBuffer<float>(cxGPUContext, ComputeMemoryFlags.WriteOnly, vbo[0]);
//change last arg:
                    mVbo= new ComputeBuffer<float>(cxGPUContext, ComputeMemoryFlags.WriteOnly, /*not vbo[0] but*/ numberOfElementsInVBO); 

Do not cq.ReleaseGLObjects(...) before cq.Finish() because OpenCL might not be done with them yet.

If you're passing a simple float (or any other value type) into the kernel (and not reading it back) use kernel.SetValueArgument(...). It's much simpler.

Instead of:
cqCommandQueue.Read<int>(mVbo, true, 0, vboSize, pointer, null)
cqCommandQueue.Read<float>(mVbo, true, 0, meshWidth*meshHeigh*4 /*not sizeof(float), this is the count of elements not size in bytes. Or stay safe and use mVbo.Count.*/, pointer, null)

The issue with async calls in Cloo has been solved.

Try the above and let me know.

Edit: Yes, kernel.SetMemoryArgument(...) is used in that case.

LikeKT's picture

Hi nythrix ,

Thanks for taking your time to help. YES YES YES, the OpenGL part work!!!

However, the Non OpenCL/OpenGL interop does not work

cqCommandQueue.Read<float>(mVbo, true, 0,meshWidth * meshHeight * 4, pointer, null); //again invalidValue

so I make the following change . However, the original c++ codes do not need such modification, I wonder why?

float[] VerticeArray = new float[meshWidth*meshHeight*4];
ComputeBuffer<float> mVertices = new ComputeBuffer<float>(cxGPUContext, ComputeMemoryFlags.WriteOnly, VerticeArray);   
cqCommandQueue.Read<float>(mVertices , true, 0,meshWidth * meshHeight * 4, pointer, null); //again invalidValue