Jeff's picture

CLOO (or probably OpenCL): Anyone used non-blocking ComputeCommandQueue.Read?

I'm trying to do non-blocking ComputeCommandQueue.Reads and I can't seem to make it not block. Passing "true" or "false" for the blocking parameter doesn't seem to have any effect.

float[] retVal = queue.Read(c, true, 0, vec1.Length, events);

or

float[] retVal = queue.Read(c, false, 0, vec1.Length, events);

In both cases retVal contains the answer the instant control returns from the Read call, even for operations that take more than 3 seconds (read just blocks for the 3 seconds until it is complete, regardless of the value passed for "blocking").

Is there something else I'm supposed to do to enable non-blocking operations other than just passing "false" to ComputeCommandQueue.Read?


Comments

nythrix's picture

Up till now asynchronous calls are not an option if you're running nVidia. This is a known "feature".
I don't know about ATI.

ctk's picture

Nythrix, maybe you can shed some light on this, but I don't see how the current queue.Read(....) command in Cloo can be asynchronous in it's current form even if the underlying graphics card and driver supports it. For example, say we wanted async reads like:

float[] resultsA = queue.Read(a, false, 0, resultsLength, events);
float[] resultsB = queue.Read(b, false, 0, resultsLength, events);
float[] resultsC = queue.Read(c, false, 0, resultsLength, events);
queue.Finish();

Each line in the code would have to be blocking for the array variables to store valid data since they have to wait for queue.Read(....) to read and generate the results into .Net form. And I don't believe you can have 3 separate return calls running at the same time in .Net like this without a different syntax.

nythrix's picture

Yes, I'm aware of this problem. I'm keeping a comment/warning about possible misbehavior of asynchronous calls right on top of ComputeCommandQueue.cs so that it doesn't get lost in the woods. I will redesign this part although without a working asynchronous mechanism it's a bit of a blind shot.

Getting our hands dirty, this relates to the old keep-alive problem we were facing in the case of kernel arguments. Unfortunately, getting rid of it completely is not easy. For example, I've just discovered that keeping alive kernel arguments randomly blows up disposing of resources (but only some of them).
You'll see another two or three versions of Cloo come out, I guess :)

Jeff's picture

I think I don't understand the problem... It looks to me like the implementation of read right now creates the return array and returns it empty, having passed a pointer to the memory to the underlying call to populate at leisure.

Is the problem that the "gcHandle.Free" really shouldn't be called until the async call is finished, but it instead is called right away (thereby allowing the GC to move or blow away your memory before the underlying call gets around to writing to it)?

Well, if I can't make async calls, do you have any other ideas how I might achieve what I want? I have a large array (millions) of data I need to load, process, and return, but I was hoping to queue it up as a series of smaller chunks (thousands) to try to get the GPU working on the first chunk while the second one was being transferred.

Right now (if I understand everything correctly) the GPU sits idle until the entire array is loaded into its memory, then it processes the whole thing, then it sits idle again while the results are sent back.

The other thread you linked to implies that they couldn't get a data transfer and processing to happen at the same time no matter what they tried... which would be unfortunate.

ctk's picture

Nythrix, perhaps the simplest solution to all these problems would simply to require the user to manually call a clean up function at the end of their code instead of trying to anticipate when to dispose with totally managed code. For example, in my simulation, I currently use the following code to read and write single structs into the kernel (not arrays, just single structs, I have a lot of single variable data required for my simulation and structs make them easier to manage):

using Cloo;
using System;
using OpenTK;
using OpenTK.Compute.CL10;
using System.IO;
using System.Runtime.InteropServices;
using System.Diagnostics;
using System.Collections.Generic;
 
namespace OpenCLTest
{
    /// <summary>
    /// Enables the reading and writing of structs in a kernel.
    /// </summary>
    public static class ClooHelpers
    {
        /// <summary>
        /// Sets a kernel argument of arbitrary type.
        /// </summary>
        /// <param name="data">The data to place in the kernel argument.</param>
        /// <param name="context">A valid context we are currently using.</param>
        /// <param name="hMemList">A list holding pointers to the buffers. Used to properly dispose the buffer (User does it manually in this case).</param>        
        /// <param name="kernel">The current kernel we are putting data into.</param>
        /// <param name="memFlags">Memory flags.</param>
        /// <param name="kernelArgCount">The index of the argument in the kernel.</param>
        /// <returns></returns>
        public static void SetKernelArg<T>(ref T data, ref ComputeContext context, ref List<IntPtr> hMemList,
            ref ComputeKernel kernel, MemFlags memFlags, int kernelArgCount)
        {
            unsafe
            {
                GCHandle dataPtr = GCHandle.Alloc(data, GCHandleType.Pinned);
                IntPtr hDeviceMem;
                ErrorCode error;
 
                try
                {
                    hDeviceMem = CL.CreateBuffer(context.Handle,
                       memFlags,
                       new IntPtr(Marshal.SizeOf(data)),
                       dataPtr.AddrOfPinnedObject(),
                       out error);
 
                    if (error != ErrorCode.Success)
                        throw new Exception(error.ToString());
                }
                finally
                {
                    dataPtr.Free();
                }
 
                hMemList.Add(hDeviceMem);
                CL.SetKernelArg(kernel.Handle, kernelArgCount, new IntPtr(Marshal.SizeOf(typeof(IntPtr))), new IntPtr(&hDeviceMem));
            };
        }
 
        /// <summary>
        /// Enqueues a read for a struct object in the kernel.
        /// </summary>
        /// <param name="data">Where we will read our data to</param>
        /// <param name="queue">The queue used to read the data from.</param>
        /// <param name="buffer">A valid buffer reference the struct object in the kernel.</param>        
        /// <returns></returns>
        public static int ReadKernelArg<T>(ref T data, ref ComputeCommandQueue queue, IntPtr buffer) where T : struct
        {
            int result;
            unsafe
            {
                GCHandle dataPtr = GCHandle.Alloc(data, GCHandleType.Pinned);                
 
                try
                {
                    // block_reading must be set to true or the data is not read (bug?)
                    result = CL.EnqueueReadBuffer(queue.Handle, buffer, true, new IntPtr(0),
                        new IntPtr(Marshal.SizeOf(data)), ref data, 0, (IntPtr*)null, (IntPtr*)null);
                }
                finally
                {
                    dataPtr.Free();
                }
            }
            return result;
        }
    }
}

I keep a list of the IntPtr's to the buffers with List hMemList passed into SetKernelArg, and I manually

            for (int i = 0; i < hMemList.Count; i++)
            {
                CL.ReleaseMemObject(hMemList[i]);
            }

at the end of my program.

In my ReadKernelArg implementation, you could theoretically do an async read by changing the true to a false in EnqueueReadBuffer, although the pinned memory would be a problem to manage. However, on my system (duo core CPU, ATI Stream SDK), I've found that async reads did not work and data could not be read out. That's why it's set to block above.

BTW, nythrix, any chance of enabling the read and write of SINGLE structs into the kernel for the next release of CLOO? I know that you can read and write arrays of structs right now with CLOO but I haven't found an easy way to just read and write a single struct. Am I missing some obvious command here?

nythrix's picture

Ok. One bit at a time.
1. Read/Write methods
You are right, they don't work properly in case the implementation supports asynchronous calls. Letting the user manage the resources is a simple solution but I'm not a fan of it because it is quite error prone. Besides I believe there's this better approach I've been playing with for some time now. The main idea is to bind the GCHandles to the ComputeEvents that the R/W methods generate thus delaying the call to GCHandle.Free() until the corresponding command is complete. This is fully transparent to the user and doesn't require any changes to the API.
2.

Quote:

// block_reading must be set to true or the data is not read (bug?)

Have you hit this issue? I haven't been able to enforce this exact bug, even though I was sure it would happen. I think this proves that async calls are working in your environment. I would very much appreciate some info on how to reproduce this!
3. Kernel args
There is
ComputeKernel.SetValueArgument<T>( int index, T data ) for sending in any blittable structure, or
ComputeKernel.SetArgument( int index, IntPtr dataSize, IntPtr dataAddr ) which, apart from the error check, just calls the underlying clSetKernelArg so you can use it to send in virtually anything.
You cannot read back an argument unless you wrap it up in a ComputeBuffer. The kernel treats pointers as buffers anyway so the hack you're using is pretty much the only way.

ctk's picture
nythrix wrote:

2.

Quote:

// block_reading must be set to true or the data is not read (bug?)

Have you hit this issue? I haven't been able to enforce this exact bug, even though I was sure it would happen. I think this proves that async calls are working in your environment. I would very much appreciate some info on how to reproduce this!

Yep, several times I tried doing an async read by setting the block to false and my program would run and finish but the expected data would not be read out. So it does look like async reads work on my computer. Maybe I'm not getting the data out correctly since I'm pinning the struct in .Net and the read is asynchronous. To me, it looks like the struct would be unpinned before the async read would be able to write data to it.

I'll pull the code out and make a test case to try and reproduce this. I'll post it up later.

nythrix wrote:

3. Kernel args
There is
ComputeKernel.SetValueArgument<T>( int index, T data ) for sending in any blittable structure, or
ComputeKernel.SetArgument( int index, IntPtr dataSize, IntPtr dataAddr ) which, apart from the error check, just calls the underlying clSetKernelArg so you can use it to send in virtually anything.
You cannot read back an argument unless you wrap it up in a ComputeBuffer. The kernel treats pointers as buffers anyway so the hack you're using is pretty much the only way.

Ok, I will stick with my hack since I need both read and write for single structs and it looks like the path of least resistance.

ctk's picture
Jeff wrote:

I think I don't understand the problem... It looks to me like the implementation of read right now creates the return array and returns it empty, having passed a pointer to the memory to the underlying call to populate at leisure.

Is the problem that the "gcHandle.Free" really shouldn't be called until the async call is finished, but it instead is called right away (thereby allowing the GC to move or blow away your memory before the underlying call gets around to writing to it)?

Well, if I can't make async calls, do you have any other ideas how I might achieve what I want? I have a large array (millions) of data I need to load, process, and return, but I was hoping to queue it up as a series of smaller chunks (thousands) to try to get the GPU working on the first chunk while the second one was being transferred.

Right now (if I understand everything correctly) the GPU sits idle until the entire array is loaded into its memory, then it processes the whole thing, then it sits idle again while the results are sent back.

The other thread you linked to implies that they couldn't get a data transfer and processing to happen at the same time no matter what they tried... which would be unfortunate.

Does the second chunk of your data depend on the first chunk of data to return a final result or are they totally independent? If they are totally independent, you could try creating host threads in .Net with each one having a command queue as suggested in the link or, you could split up your data into separate files and run multiple copies of your program with each pointed to a separate file (I can confirm this second method works). If your data chunks depend on one another to return a final result, I think you should wait and see if I have any success getting the async read to work on my computer.

nythrix's picture
Quote:

I'll pull the code out and make a test case to try and reproduce this. I'll post it up later.

Thanks.

Quote:

If they are totally independent, you could try creating host threads in .Net with each one having a command queue as suggested in the link

It doesn't really matter how many host threads you have, IMHO. Unless you really need several host threads a single host and separate command queues for reading/writing/executing will get you quite far without much hassle.

None of this will help if the backend is broken, though.

ctk's picture

Attached is a test case that shows that async reads execute on my computer but it does not properly read the data from the kernel when using my ClooHelper hack for structs. Example output:

---Async Read Demo---
Executing.....
Blocking Output:
testInt: 12345
testFloat: 1.2345

Async Output:
testInt: -1
testFloat: -1

The correct result is from the blocking output. I ruled out that GCAlloc is causing problems by manually allocating the pointer to the struct myself and freeing it manually at the end of the program.

AttachmentSize
AsyncReadTest.7z10.26 KB
nythrix's picture

Good news is ATI Stream definitely supports async calls.
As for the bad news there was something fishy in that code because I couldn't reproduce the issue using only Cloo functionality. Indeed, the part that's causing the problem is one of the OpenTK calls:

CL.EnqueueReadBuffer( 
    queue.Handle, buffer, block_read, new IntPtr(0), new IntPtr(Marshal.SizeOf(data)), 
    ref data,  // <- Don't do this! Use new IntPtr(&data) instead!
    0, (IntPtr*)null, (IntPtr*)null );

Cloo.Bindings are very strict and don't include overloads like this because I believe that bypassing the marshaler altogether for APIs such as OpenCL is the best thing to do.

However, this doesn't mean that the hi-level R/W operations in Cloo are solid. It is just a different problem.

the Fiddler's picture
Quote:

Cloo.Bindings are very strict and don't include overloads like this because I believe that bypassing the marshaler altogether for APIs such as OpenCL is the best thing to do.

Ref parameters in OpenTK bindings do bypass the marshaler (otherwise half of these overloads wouldn't even work):

public static unsafe int EnqueueReadBuffer<T5>(IntPtr command_queue, IntPtr buffer, bool blocking_read, IntPtr offset, IntPtr cb, [InAttribute, OutAttribute] ref T5 ptr, Int32 num_events_in_wait_list, IntPtr* event_wait_list, IntPtr* @event)
            where T5 : struct
        {
            GCHandle ptr_ptr = GCHandle.Alloc(ptr, GCHandleType.Pinned);
            try
            {
                int retval = Delegates.clEnqueueReadBuffer((IntPtr)command_queue, (IntPtr)buffer, (bool)blocking_read, (IntPtr)offset, (IntPtr)cb, (IntPtr)ptr_ptr.AddrOfPinnedObject(), (uint)num_events_in_wait_list, (IntPtr*)event_wait_list, (IntPtr*)@event);
                ptr = (T5)ptr_ptr.Target;
                return retval;
            }
            finally
            {
                ptr_ptr.Free();
            }
        }

That said, this doesn't really help here as the parameter is pinned only for the duration of the call. Then again, pinning the structure for the duration of the program should rule that issue out.

Some piece of the puzzle is missing here, I'll check again tomorrow with a clear brain.

nythrix's picture

Oh, now I see. I thought this was caused by the marshaler. Unfortunately, I tend to miss the details late in the night.

I believe there's no way this mechanism would work for async calls. Not at this level anyway. Wouldn't it be better if we ruled out these overloads?

ctk's picture

Does anyone actually have asynchronous reads working in a C/C++ program with OpenCL?

the Fiddler's picture

That's one possibility. Another might be to use the .Net async pattern, which works like this:

IAsyncResult result = BeginEnqueueReadBuffer(...);
// do other stuff
EndEnqueueReadBuffer(result);

The idea is that the begin method will pin the necessary resources and invoke the async call and the End method will block until the results are ready and clear the pins. The larger the distance between Begin and End, the lower the chance End will block.

Personally, I really dislike this pattern for all the reasons outlined here. The problem is I can't think of any good way to solve this without some form of explicit pin (via an explicit GCHandle or via a Begin* method).

What would help is if OpenCL had an event to signal when the read is complete. If that was available (is it?), the async Read method could pin the buffer internally and unpin it when the 'ready' signal comes (possibly firing a 'ReadFinished' event or something similar). This is basically a less ugly implementation of the above async pattern, but it could only work if such a signal existed.

nythrix's picture

This works:

string source = @"
kernel void Copy(
    global read_only float* bufferIn,
    global write_only float* bufferOut )
{
    int index = get_global_id(0);
    bufferOut[index] = bufferIn[index];
}
";
ComputeContextPropertyList cpl = new ComputeContextPropertyList( platform );
ComputeContext context = new ComputeContext( ComputeDeviceTypes.Default, cpl, null, IntPtr.Zero );
 
int size = 10;
float[] arrayIn = new float[ size ];
for( int i = 0; i < size; i++ ) arrayIn[ i ] = i;
 
ComputeBuffer<float> bufferIn = new ComputeBuffer<float>( context, ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer, arrayIn );
ComputeBuffer<float> bufferOut = new ComputeBuffer<float>( context, ComputeMemoryFlags.WriteOnly, size );
 
ComputeProgram program = new ComputeProgram( context, source );
program.Build( null, null, null, IntPtr.Zero );
ComputeKernel kernel = program.CreateKernel( "Copy" );
kernel.SetMemoryArgument( 0, bufferIn );
kernel.SetMemoryArgument( 1, bufferOut );
 
ComputeCommandQueue commands = new ComputeCommandQueue( context, context.Devices[ 0 ], ComputeCommandQueueFlags.None );
commands.Execute( kernel, null, new long[]{ size }, null, null );
 
float[] arrayOut = new float[ size ];
GCHandle arrayOutHandle = GCHandle.Alloc( arrayOut, GCHandleType.Pinned );
unsafe
{
    CL.EnqueueReadBuffer(
        commands.Handle, bufferOut.Handle, false, IntPtr.Zero, new IntPtr( bufferOut.Size ),
        arrayOutHandle.AddrOfPinnedObject(), 0, ( IntPtr* )null, ( IntPtr* )null );
}
commands.Finish();
arrayOutHandle.Free();
nythrix's picture
Quote:

What would help is if OpenCL had an event to signal when the read is complete. If that was available (is it?), the async Read method could pin the buffer internally and unpin it when the 'ready' signal comes (possibly firing a 'ReadFinished' event or something similar). This is basically a less ugly implementation of the above async pattern, but it could only work if such a signal existed.

There is no such event. However, as I pointed out before, this problem is deterministically solvable using Cloo. Basically it will work like this:
1) The host enqueues an async command.
2) Cloo pins the needed structures and binds the GCHandles with the event that the async command generates.
3) The host carries on execution as normal.
4) Cloo intercepts one of the synchronizing commands (Finish, Wait, etc.) which the host is required to issue any time after an async call, and releases the GCHandles from the events that carry them.

It'd be lovely to have an option of specifying an event callback that would signal its completion. Unfortunately, as of OpenCL 1.0 no such mechanism exists.

ctk's picture

@Nythrix, yes that asynchronous read code you post works on my system, too! However, I can't seem to make it work for asynchronous read of single structs. Could you take a look at the code I posted before and see if you could make that work? I tried following your code but I still don't get any data read out for a single struct in non-blocking mode.

nythrix's picture

This method in ClooHelpers should do it:

public static int ReadKernelArgAsync<T>( ref ComputeCommandQueue queue, IntPtr buffer, GCHandle dataPtr )
    where T : struct
{
    unsafe
    {
        result = CL.EnqueueReadBuffer( queue.Handle, buffer, false, new IntPtr( 0 ),
            new IntPtr( Marshal.SizeOf( typeof(T) ) ), dataPtr.AddrOfPinnedObject(), 0, ( IntPtr* )null, ( IntPtr* )null );
    }
    return result;
}
ctk's picture

Nope, that ReadKernelArgSync did not work. In fact, using dataPtr.AddrOfPinnedObject() will cause no data to be read for a single struct no matter if it is blocking or not. You must use ref data in order for a single struct to be read successfully into .Net. I've tried out other variants of EnqueueReadBuffer, such reading a Vector4 array and a TestStruct array asynchronously and those work. I don't know exactly why this is so, but maybe I should just covert all the single structs in my .Net code to struct arrays and be done with the async read issue. I don't think I would have to change anything on the kernel side since the pointer notation would still work for structs array of length 1.

the Fiddler's picture

Is the pack size for the struct correct? Try specifying 1 instead of 8 and see if it works then (or does OpenCL expect alignment on 8byte boundaries?)

ctk's picture
the Fiddler wrote:

Is the pack size for the struct correct? Try specifying 1 instead of 8 and see if it works then (or does OpenCL expect alignment on 8byte boundaries?)

Just tried a pack size of 1, 4, and 8, but it didn't change anything. OpenCL expects a pack size that aligns vectors (8 for Vector2, 16 for Vector4, ie: 4 bytes per each float in a Vector). However, since I'm using a simple TestStruct with just a plain int and a float, I don't expect any alignment issues here. It's quite puzzling why the pinned memory doesn't work at all.

nythrix's picture

Technically speaking there is no such thing as a single struct going in and out of the kernel. In fact you are already using an array containing one TestStruct to achieve the same effect. This means you have to carefully track whether you're working with the array or with the struct itself. I believe that this inconsistency is the cause of the problems in your case.
A new release of Cloo is due out tomorrow. It comes with new internals which should take care of the asynchronous calls of OpenCL completely transparently.
All that said you might as well use a ComputeBuffer instead of a bare handle for your hack.

In any case, I've attached one of your previous files which shows the async call working.

AttachmentSize
AsyncReadTest.cs4.5 KB
ctk's picture

@Nythrix, that fixed it, thanks. The problem seems to be that "ref data" was not working in the async read command and instead required "new IntPtr(&data)". Looks like this wraps it up.

nythrix's picture
ctk wrote:

The problem seems to be that "ref data" was not working in the async read command and instead required "new IntPtr(&data)".

And there was light... :D

nythrix wrote:
CL.EnqueueReadBuffer( 
    queue.Handle, buffer, block_read, new IntPtr(0), new IntPtr(Marshal.SizeOf(data)), 
    ref data,  // <- Don't do this! Use new IntPtr(&data) instead!
    0, (IntPtr*)null, (IntPtr*)null );
ctk's picture
nythrix wrote:
ctk wrote:

The problem seems to be that "ref data" was not working in the async read command and instead required "new IntPtr(&data)".

And there was light... :D

nythrix wrote:
CL.EnqueueReadBuffer( 
    queue.Handle, buffer, block_read, new IntPtr(0), new IntPtr(Marshal.SizeOf(data)), 
    ref data,  // <- Don't do this! Use new IntPtr(&data) instead!
    0, (IntPtr*)null, (IntPtr*)null );

Yeah, I didn't get why you didn't like "ref data" before, but now I see. With "new IntPtr(&data)", there is now a minor problem where the computer says that it can't take the address of "data" if it is a generic type T since it's managed, but that's a minor problem I can work around by just overloading the ClooHelpers. But, I think you will run into this say compiler error when you modify public T[] Read( ......) in Cloo since you will have to take out the pinned memory address with a variant of "new IntPtr(&data)".

nythrix's picture

Mmm, no. Read/Write commands in Cloo behave differently. They pin the array (T[]) while you're pinning the struct itself (T). That's why you're having problems IMHO.

Some good news in the end: The public part of the Cloo API will probably not change in order to accommodate GCHandles tracking. It's fitting in without a fuss so far.

Jeff's picture

Nythrix, just to clarify, your improvements will fix asynchronous operations in concept, I.E. against any drivers that support it, but not for NVidia because their driver still does asynchronous operations synchronously.

So the only way (maybe, haven't tried it yet) to do asynchronous operations against NVidia is still multiple command queues.

nythrix's picture

Yes I can only improve the frontend (i.e. Cloo) because there's nothing I can do about the drivers.

Interesting thing is these async calls work perfectly in MacOS, or so I've heard. So let's see what they come up with for the rest of us.
I even read some hypothesis that nVidia is crippling OpenCL because of CUDA but this isn't very plausible. nVidia knows that people could switch HW and not the technology if things don't improve. And nVidia can't afford such risks with a standard that's rapidly gaining popularity.

ctk's picture

Speaking of which, when is Intel going to do anything with OpenCL? They are really late to the game with their own drivers. Seems like they don't want to be a part of it.