Jeff's picture

CLOO: Getting started, questions about performance (or lack thereof).

I'm running the "AddVector" example and the performance seems pretty poor when compared to a simple single-threaded c# implementation.

GPU: Quadro FX 570, here are some of the details:
MaxClockFrequency: 920
MaxComputeUnits: 2
MaxConstantArguments: 9
MaxConstantBufferSize: 65536
MaxMemoryAllocationSize: 134217728
MaxParameterSize: 4352
MaxReadImageArguments: 128
MaxSamplers: 16
MaxWorkGroupSize: 512

CPU: 2.33 Ghz Xeon (2 cpus, 4 cores each, but that shouldn't matter since my comparison c# code is single-threaded).

.NET 2.0, CLOO 0.6.1

Across a range of vector sizes, the GPU seems to fairly consistently take 10 times as long as the CPU. That's after moving the "new ComputeContextPropertyList", "new ComputeContext", "new ComputeProgram", "program.Build", and "program.CreateKernel" outside the loop (since presumably it is okay to reuse the kernel over and over).

Here are the times for all the CLOO calls (for 2400 runs, vector sizes from 2 to 16Mb):

Total: 00:00:00.0312500, Count:       1, Average: 00:00:00.0312500.  Key: OpenCL new ComputeContextPropertyList
Total: 00:00:00.0468750, Count:       1, Average: 00:00:00.0468750.  Key: OpenCL new ComputeContext
Total: 00:00:00        , Count:       1, Average: 00:00:00        .  Key: OpenCL new ComputeProgram
Total: 00:00:00.2968750, Count:       1, Average: 00:00:00.2968750.  Key: OpenCL program.Build
Total: 00:00:00        , Count:       1, Average: 00:00:00        .  Key: OpenCL program.CreateKernel
Total: 00:00:33.4531250, Count:    4800, Average: 00:00:00.0069694.  Key: OpenCL new ComputeBuffer (input)
Total: 00:00:00.0156250, Count:    2400, Average: 00:00:00.0000065.  Key: OpenCL new ComputeBuffer (output)
Total: 00:00:00.0156250, Count:    7200, Average: 00:00:00.0000021.  Key: OpenCL kernel.SetMemoryArgument
Total: 00:00:00.0156250, Count:    2400, Average: 00:00:00.0000065.  Key: OpenCL new ComputeCommandQueue
Total: 00:01:11.1250000, Count:    2400, Average: 00:00:00.0296354.  Key: OpenCL queue.Execute
Total: 00:04:17.1875000, Count:    2400, Average: 00:00:00.1071614.  Key: OpenCL queue.Read

It seems odd to me that the longest time is spend during the "read" call, since I'm inputting twice as much data as I'm reading.

Which call is the one that actually passes the data up to the GPU's memory? Execute? Or constructing the input ComputeBuffer?

Should I be specifying WorkGroupSize (or any other optimization parameters)? Or is that handled automatically by CLOO (or OpenCL itself) based on the data being worked on?


Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
Laar's picture

The "AddVector" example is more a test if it works, it's not uncommon that it runs slower than an CPU version. One of the main reasons is that "AddVector" is not really using what the GPU is good at, doing calculations, and most of the time reading and writing from/to memory. Which is, depending on the memory type used, quite slow and as there are probably not enough other things to keep it busy in the main time.

The GPU is very different from the CPU. The CPU gets the data and processes it, and to speed up the data fetching it has a cache. If the GPU needs data which it can't get it quick, then it switches to another thread.

Edit: look at this youtube video and you'll see the difference in a more appropriate example

ctk's picture

Hello, and welcome!

First off, I have run the VectorAdd sample before back when Cloo was at version 0.4 and I can confirm that using Cloo/OpenCL DOES result in dramatic speed up of the calculations for me when I compared a single threaded version to the Cloo version on my dual core CPU (no GPU for me). However, the VectorAdd demo does not have the timing triggers at the right places and the sample lacks the queue.Finish(); command that should be called after every queue.Execute(....). The VectorAdd demo is meant to be a simple example and is not optimized. If you post the code for you VectorAdd sample along with the single threaded version that you are using to benchmark, I can take a look at it.

nythrix's picture

"VectorAdd" comes from the OpenCL ice ages, a time when, given the state of the drivers and whatnots, I had to spend a good part of the afternoon (and sometimes night) just to make a silly "Execute(...)" working. That's why the kernel is very simple and doesn't really take advantage of parallel processing.
I haven't had the time to add a more complex example which makes use of the OpenCL processing power. We've been busy hunting down bugs and generally improving Cloo internals for a couple of weeks now.
Anyway, I have this hypothesis about the "Read" anomaly (or the last command in general). If you're specifying a blocking read behavior then you're possibly measuring not only the "Read" command itself but also the execution of the previous command(s) which might have piled up in the queue. However, I've never tested this hypothesis and I can't say more without a look into the code.

Quote:

the sample lacks the queue.Finish(); command that should be called after every queue.Execute(....).

Not necessarily. If you take advantage of the event lists and your last command is blocking (Read, if you tell it to) then you needn't specify a finish command. In this case Read will wait for the specified events AND itself to finish.

Edit:

Quote:

Which call is the one that actually passes the data up to the GPU's memory? Execute? Or constructing the input ComputeBuffer?

Should I be specifying WorkGroupSize (or any other optimization parameters)? Or is that handled automatically by CLOO (or OpenCL itself) based on the data being worked on?

1) Either the ComputeBuffer constructor or the Write commands.
2) Cloo automates only primitive work. Although definitely possible, high level optimizations are left to the user because in Cloo they might have a negative impact on the range of functionality. One of the design choices was to follow the OpenCL specs as close as possible.

Jeff's picture

My C# code for testing is about what you'd expect:

        public static float[] AddVectorsCSharp(float[] vec1, float[] vec2)
        {
            int i;
            float[] resultVec = new float[vec1.Length];
            for (i = 0; i < resultVec.Length; i++)
            {
                resultVec[i] = vec1[i] + vec2[i];
            }
            return resultVec;
        }

Here's my code that uses CLOO:

Init and cleanup of reused objects:

        private static ComputeContext _addVectorContext;
        private static ComputeProgram _addVectorProgram;
        private static ComputeKernel _addVectorKernel;
 
        public static void InitPrograms()
        {
            string addVectorSource = @"
__kernel void VectorAdd(
    global read_only float* a,
    global read_only float* b,
    global write_only float* c )
{
    int index = get_global_id(0);
    c[index] = a[index] + b[index];
}
";
            Chronometer.BeginTiming("OpenCL new ComputeContextPropertyList");
            ComputeContextPropertyList cpl = new ComputeContextPropertyList(ComputePlatform.Platforms[0]);
            Chronometer.EndTiming("OpenCL new ComputeContextPropertyList");
 
            Chronometer.BeginTiming("OpenCL new ComputeContext");
            // This was ComputeDeviceTypes.Default, I tried Gpu instead with no noticeable difference.
            _addVectorContext = new ComputeContext(ComputeDeviceTypes.Gpu, cpl, null, IntPtr.Zero);
            Chronometer.EndTiming("OpenCL new ComputeContext");
 
            Chronometer.BeginTiming("OpenCL new ComputeProgram");
            _addVectorProgram = new ComputeProgram(_addVectorContext, new[] { addVectorSource });
            Chronometer.EndTiming("OpenCL new ComputeProgram");
 
            Chronometer.BeginTiming("OpenCL program.Build");
            // I am specifying the first device, the original example did not, but it does not make a difference in performance.
            _addVectorProgram.Build(new[] { _addVectorContext.Devices[0] }, null, null, IntPtr.Zero);
            Chronometer.EndTiming("OpenCL program.Build");
 
            Chronometer.BeginTiming("OpenCL program.CreateKernel");
            _addVectorKernel = _addVectorProgram.CreateKernel("VectorAdd");
            Chronometer.EndTiming("OpenCL program.CreateKernel");
        }
 
        public static void CleanupPrograms()
        {
            // Empty catches are just for testing purposes.
            try { _addVectorKernel.Dispose(); } catch (Exception e) { }
            try { _addVectorProgram.Dispose(); } catch (Exception e) { }
            try { _addVectorContext.Dispose(); } catch (Exception e) { }
        }

Add vector implementation:

        public static float[] AddVectorsOpenCL(float[] vec1, float[] vec2)
        {
            if (_addVectorKernel == null)
            {
                throw new Exception("Call InitPrograms first!");
            }
 
            Chronometer.BeginTiming("OpenCL new ComputeBuffer (input)");
 
            using (ComputeBuffer<float> a = new ComputeBuffer<float>(_addVectorKernel.Context,
                                                              ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer, vec1))
            {
                Chronometer.EndTiming("OpenCL new ComputeBuffer (input)");
 
                Chronometer.BeginTiming("OpenCL new ComputeBuffer (input)");
                using (ComputeBuffer<float> b = new ComputeBuffer<float>(_addVectorKernel.Context,
                                                                  ComputeMemoryFlags.ReadOnly |
                                                                  ComputeMemoryFlags.CopyHostPointer, vec2))
                {
                    Chronometer.EndTiming("OpenCL new ComputeBuffer (input)");
 
                    Chronometer.BeginTiming("OpenCL new ComputeBuffer (output)");
                    using (ComputeBuffer<float> c = new ComputeBuffer<float>(_addVectorKernel.Context,
                                                                      ComputeMemoryFlags.WriteOnly, vec1.Length))
                    {
                        Chronometer.EndTiming("OpenCL new ComputeBuffer (output)");
 
                        _addVectorKernel.SetMemoryArgument(0, a);
                        _addVectorKernel.SetMemoryArgument(1, b);
                        _addVectorKernel.SetMemoryArgument(2, c);
 
                        Chronometer.BeginTiming("OpenCL new ComputeCommandQueue");
                        using (ComputeCommandQueue queue = new ComputeCommandQueue(_addVectorKernel.Context,
                                                                            _addVectorKernel.Context.Devices[0],
                                                                            ComputeCommandQueueFlags.None))
                        {
                            Chronometer.EndTiming("OpenCL new ComputeCommandQueue");
 
                            Chronometer.BeginTiming("OpenCL queue.Execute");
                            queue.Execute(_addVectorKernel, null, new long[] {vec1.Length}, null, null);
                            Chronometer.EndTiming("OpenCL queue.Execute");
 
                            Chronometer.BeginTiming("OpenCL queue.Read");
                            float[] retVal = queue.Read(c, true, 0, vec1.Length, null);
                            Chronometer.EndTiming("OpenCL queue.Read");
                            return retVal;
                        }
                    }
                }
            }
        }

And the code that executes it looks basically like this:

    InitPrograms();
    for (x = 0; x <=24; x++) {
        for (y = 0; y < 100; y++) {
            AddVectorsOpenCL(/*two vectors of random floats, of size 2^x*/);
        }
    }
    CleanupPrograms();
    Chronometer.ReportTimes();

I removed the timing checks around SetMemoryArgument because it seems to take effectively no time.
And for the exceptionally nit-picky, yes I realize I am not timing the various calls to VariousClooObjects.Dispose();

Jeff's picture

OK so responding to the comments earlier, I tried to make the kernel more complicated to see if I can see some benefit of GPU over CPU. However I'm now getting an "InvalidKernelNameComputeException". Here's my new kernel:

kernel void VectorSmooth(
    global read_only float* a,
    global read_only int length,
    global read_only int distance,
    global write_only float* b )
{
    int index = get_global_id(0);
    int minIndex = index - distance;
    if (minIndex < 0) {
        minIndex = 0;
    }
    int maxIndex = index + distance;
    if (maxIndex >= length) {
        maxIndex = length - 1;
    }
    int numCells = maxIndex - minIndex + 1;
    float total = 0;
    for (int x = minIndex; x < maxIndex; x++) {
        total += a[x];
    }
    b[index] = total / numCells;
}

Any idea what I'm doing wrong?

nythrix's picture

length is an OpenCL function/reserved word. I guess that's what's causing the issue.

http://www.khronos.org/opencl/sdk/1.0/docs/man/xhtml/length.html

Edit: Oh, and remove the ("global" or any other) memory location decoration from ints. You are not allowed to specify such thing on primitive types.

Jeff's picture

There were several problems. As nythrix mentioned, I had some syntax errors in the kernel code. I think that exception however was because I declared the kernel with a name of "VectorSmooth" and then called CreateKernel with "SmoothVector" (oops).