XtremeLeo's picture

CLOO Kernel Help

Hi,

I am new to CLOO, and generally to OpenCL - OpenTK, programming. I was developing a simple program which would compute PI in both GPU and CPU. I used as a base the CLOO VectorAddTest and modified it until the following:

 using System;
using Cloo;
 
namespace PI
{
    class Test: AbstractTest 
    {
     private string kernelSource = @"
kernel void GaussLegendre(
    global float* BufferA,
    global float* BufferB,
    global float* BufferT,
    global float* BufferP,
    global float* BufferAP,
    global float* BufferBP,
    global float* BufferTP,
    global float* BufferPP,
    global float* BufferResult )
{
    BufferResult[1] = ((BufferA[1] + BufferB[1])*(BufferA[1] + BufferB[1]))/(4 * BufferT[1]);
 
    BufferAP[1] = BufferA[1];
    BufferBP[1] = BufferB[1];
    BufferTP[1] = BufferT[1];
    BufferPP[1] = BufferP[1];
 
    BufferA[1] = (BufferAP[1] + BufferBP[1]) / 2;
    BufferB[1] = sqrt(BufferAP[1]*BufferBP[1]);
    BufferT[1] = BufferTP[1] - (BufferPP * (pow((BufferAP[1]-BufferA[1]), 2)));
    BufferP[1] = BufferPP[1] * 2;
}
";
     private int device;
        public Test(int ComputeDevice)
            : base( "Gauss-Legendre - PI" )
        {
            device = ComputeDevice;
        }
 
        protected override void RunInternal()
        {
            ComputeContextPropertyList cpl = new ComputeContextPropertyList( ComputePlatform.Platforms[ 0 ] );
            ComputeContext context;
 
            if(device == 0)
                context = new ComputeContext( ComputeDeviceTypes.Cpu, cpl, null, IntPtr.Zero );
            else
                context = new ComputeContext( ComputeDeviceTypes.Gpu, cpl, null, IntPtr.Zero );
 
            ComputeProgram program = new ComputeProgram( context, new string[]{ kernelSource } );
            program.Build( null, null, null, IntPtr.Zero );
            ComputeKernel kernel = program.CreateKernel( "GaussLegendre" );
 
 
            float[] a, b, t, p, ap, bp, tp, pp, result;
            a = new float[1];
            b = new float[1];
            t = new float[1];
            p = new float[1];
            ap = new float[1];
            bp = new float[1];
            tp = new float[1];
            pp = new float[1];
            result = new float[1];
            a[1] = 1.0f;
            b[1] = 1 / (float)Math.Sqrt(2);
            t[1] = 0.25f;
            p[1] = 1;
 
            ComputeBuffer<float> BufferA = new ComputeBuffer<float>(context, ComputeMemoryFlags.AllocateHostPointer, a);
            ComputeBuffer<float> BufferB = new ComputeBuffer<float>(context, ComputeMemoryFlags.AllocateHostPointer, b);
            ComputeBuffer<float> BufferT = new ComputeBuffer<float>(context, ComputeMemoryFlags.AllocateHostPointer, t);
            ComputeBuffer<float> BufferP = new ComputeBuffer<float>(context, ComputeMemoryFlags.AllocateHostPointer, p);
            ComputeBuffer<float> BufferAP = new ComputeBuffer<float>(context, ComputeMemoryFlags.AllocateHostPointer, ap);
            ComputeBuffer<float> BufferBP = new ComputeBuffer<float>(context, ComputeMemoryFlags.AllocateHostPointer, bp);
            ComputeBuffer<float> BufferTP = new ComputeBuffer<float>(context, ComputeMemoryFlags.AllocateHostPointer, tp);
            ComputeBuffer<float> BufferPP = new ComputeBuffer<float>(context, ComputeMemoryFlags.AllocateHostPointer, pp);
            ComputeBuffer<float> BufferResult = new ComputeBuffer<float>(context, ComputeMemoryFlags.AllocateHostPointer, result);
 
            kernel.SetMemoryArgument(0, BufferA);
            kernel.SetMemoryArgument(1, BufferB);
            kernel.SetMemoryArgument(2, BufferT);
            kernel.SetMemoryArgument(3, BufferP);
            kernel.SetMemoryArgument(4, BufferAP);
            kernel.SetMemoryArgument(5, BufferBP);
            kernel.SetMemoryArgument(6, BufferTP);
            kernel.SetMemoryArgument(7, BufferPP);
            kernel.SetMemoryArgument(8, BufferResult);
 
 
            ComputeCommandQueue queue = new ComputeCommandQueue( context, context.Devices[ 0 ], ComputeCommandQueueFlags.None );            
            queue.Execute( kernel, null, new long[] { 8 }, null, null );
            Console.WriteLine(queue.Device.Name.ToString());
            float[] PI = new float[1];
            PI = queue.Read(BufferResult, false, 0, 1, null);
            Console.WriteLine("PI = " + PI[1].ToString());
        }
    }
}

I get an error in the kernel, and using the editor, it says that in line 1 the expression should have an arithmetic type. Could you please help me in solving this error? (And if you find more and correct them, I would be thank you a lot ;))


Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
ctk's picture

Looking at your kernel, you won't get any parallel processing from OpenCL since this is just one serialized calculation. I'm thinking that this is just a test program to get familiar with OpenCL? I fixed up your code so that it should probably run (didn't test it on my system). Your problem was that you were trying to write to the same location with multiple threads in the kernel when, what I'm assuming, is that you only want to do it once. Also, indexes in OpenCL and C# start at zero, not one. You also needed to enable read/write for the output buffer. If you are new to OpenCL, I highly recommend that you read the manuals that come with either the ATI Stream SDK or the Nvidia SDK before trying anything complex and try modifying the ClooUtils program in Cloo 0.6.1.

 
using System;
using Cloo;
 
namespace PI
{
    class Test: AbstractTest 
    {
     private string kernelSource = @"
kernel void GaussLegendre(
    global float* BufferA,
    global float* BufferB,
    global float* BufferT,
    global float* BufferP,
    global float* BufferAP,
    global float* BufferBP,
    global float* BufferTP,
    global float* BufferPP,
    global float* BufferResult )
{
 if (get_global_id(0) == 0)
 {
    BufferResult[0] = ((BufferA[0] + BufferB[0])*(BufferA[0] + BufferB[0]))/(4 * BufferT[0]);
 
    BufferAP[0] = BufferA[0];
    BufferBP[0] = BufferB[0];
    BufferTP[0] = BufferT[0];
    BufferPP[0] = BufferP[0];
 
    BufferA[0] = (BufferAP[0] + BufferBP[0]) / 2;
    BufferB[0] = sqrt(BufferAP[0]*BufferBP[0]);
    BufferT[0] = BufferTP[0] - (BufferPP * (pow((BufferAP[0]-BufferA[0]), 2)));
    BufferP[0] = BufferPP[0] * 2;
 }
}
";
     private int device;
        public Test(int ComputeDevice)
            : base( "Gauss-Legendre - PI" )
        {
            device = ComputeDevice;
        }
 
        protected override void RunInternal()
        {
            ComputeContextPropertyList cpl = new ComputeContextPropertyList( ComputePlatform.Platforms[ 0 ] );
            ComputeContext context;
 
            if(device == 0)
                context = new ComputeContext( ComputeDeviceTypes.Cpu, cpl, null, IntPtr.Zero );
            else
                context = new ComputeContext( ComputeDeviceTypes.Gpu, cpl, null, IntPtr.Zero );
 
            ComputeProgram program = new ComputeProgram( context, new string[]{ kernelSource } );
            program.Build( null, null, null, IntPtr.Zero );
            ComputeKernel kernel = program.CreateKernel( "GaussLegendre" );
 
 
            float[] a, b, t, p, ap, bp, tp, pp, result;
            a = new float[1];
            b = new float[1];
            t = new float[1];
            p = new float[1];
            ap = new float[1];
            bp = new float[1];
            tp = new float[1];
            pp = new float[1];
            result = new float[1];
            a[0] = 1.0f;
            b[0] = 1 / (float)Math.Sqrt(2);
            t[0] = 0.25f;
            p[0] = 1;
 
            ComputeBuffer<float> BufferA = new ComputeBuffer<float>(context, ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.CopyHostPointer, a);
            ComputeBuffer<float> BufferB = new ComputeBuffer<float>(context, ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.CopyHostPointer, b);
            ComputeBuffer<float> BufferT = new ComputeBuffer<float>(context, ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.CopyHostPointer, t);
            ComputeBuffer<float> BufferP = new ComputeBuffer<float>(context, ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.CopyHostPointer, p);
            ComputeBuffer<float> BufferAP = new ComputeBuffer<float>(context, ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.CopyHostPointer, ap);
            ComputeBuffer<float> BufferBP = new ComputeBuffer<float>(context, ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.CopyHostPointer, bp);
            ComputeBuffer<float> BufferTP = new ComputeBuffer<float>(context, ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.CopyHostPointer, tp);
            ComputeBuffer<float> BufferPP = new ComputeBuffer<float>(context, ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.CopyHostPointer, pp);
            ComputeBuffer<float> BufferResult = new ComputeBuffer<float>(context, ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.CopyHostPointer, result);
 
            kernel.SetMemoryArgument(0, BufferA);
            kernel.SetMemoryArgument(1, BufferB);
            kernel.SetMemoryArgument(2, BufferT);
            kernel.SetMemoryArgument(3, BufferP);
            kernel.SetMemoryArgument(4, BufferAP);
            kernel.SetMemoryArgument(5, BufferBP);
            kernel.SetMemoryArgument(6, BufferTP);
            kernel.SetMemoryArgument(7, BufferPP);
            kernel.SetMemoryArgument(8, BufferResult);
 
 
            ComputeCommandQueue queue = new ComputeCommandQueue( context, context.Devices[ 0 ], ComputeCommandQueueFlags.None );            
            queue.Execute( kernel, null, new long[] { 1 }, null, null );
            Console.WriteLine(queue.Device.Name.ToString());
            float[] PI = new float[1];
            PI = queue.Read(BufferResult, true, 0, 1, null);
            Console.WriteLine("PI = " + PI[0].ToString());
        }
    }
}
nythrix's picture

As ctk said you should have a look into various documents. My all time favourite is the OpenCL specification. Also you have to keep an eye open for the quirks of the managed world. With those in mind AllocateHostPointer is considered an advanced technique.
Now that I think of it, you can't use this flag with the current constructors. Should fix that.

Tip of the day: If you want to launch a single kernel you can use: commandQueue.Execute( kernel, events );

XtremeLeo's picture

Thanks for your very fast response. I wanted to copy the whole array in the kernel, but yes, I missed changing the index in the var init section of the code and I can see it'll be easier to only copy one element. As you said, I'm just getting familiar with OpenCL, despite I would want this kernel run parallely on my GPU, how could I achieve that? Finally the modifications ctk gladly made didn't work, I'm getting the same error.

XtremeLeo's picture

I was thinking about the kernel and I consider that as this alogrithm requieres a cyclic approach, I don't see real benefits of using OpenCL as it is not possible to effectively serialize the kernel. Do you think it may be serialized?

nythrix's picture

Converging algorithms are not fit for OpenCL because they can't be parallelized. Instead, you could use it in case there's a loop processing an array of data in your algorithm.

Hiran47's picture

error is in this line (in ctks' code)...have a closer look bro....
BufferT[0] = BufferTP[0] - (BufferPP * (pow((BufferAP[0]-BufferA[0]), 2)));

you should put BufferPP[0] instead..... :)
good luck!!!

-------------------------------------------------------------------------------------------------------------------------------------------------------------------
~ when the going gets tough - The tough get going ~