Jeff's picture

CLOO (or probably OpenCL): Anyone used non-blocking ComputeCommandQueue.Read?

I'm trying to do non-blocking ComputeCommandQueue.Reads and I can't seem to make it not block. Passing "true" or "false" for the blocking parameter doesn't seem to have any effect.

float[] retVal = queue.Read(c, true, 0, vec1.Length, events);

or

float[] retVal = queue.Read(c, false, 0, vec1.Length, events);

In both cases retVal contains the answer the instant control returns from the Read call, even for operations that take more than 3 seconds (read just blocks for the 3 seconds until it is complete, regardless of the value passed for "blocking").

Is there something else I'm supposed to do to enable non-blocking operations other than just passing "false" to ComputeCommandQueue.Read?


Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
nythrix's picture

Up till now asynchronous calls are not an option if you're running nVidia. This is a known "feature".
I don't know about ATI.

ctk's picture

Nythrix, maybe you can shed some light on this, but I don't see how the current queue.Read(....) command in Cloo can be asynchronous in it's current form even if the underlying graphics card and driver supports it. For example, say we wanted async reads like:

float[] resultsA = queue.Read(a, false, 0, resultsLength, events);
float[] resultsB = queue.Read(b, false, 0, resultsLength, events);
float[] resultsC = queue.Read(c, false, 0, resultsLength, events);
queue.Finish();

Each line in the code would have to be blocking for the array variables to store valid data since they have to wait for queue.Read(....) to read and generate the results into .Net form. And I don't believe you can have 3 separate return calls running at the same time in .Net like this without a different syntax.

nythrix's picture

Yes, I'm aware of this problem. I'm keeping a comment/warning about possible misbehavior of asynchronous calls right on top of ComputeCommandQueue.cs so that it doesn't get lost in the woods. I will redesign this part although without a working asynchronous mechanism it's a bit of a blind shot.

Getting our hands dirty, this relates to the old keep-alive problem we were facing in the case of kernel arguments. Unfortunately, getting rid of it completely is not easy. For example, I've just discovered that keeping alive kernel arguments randomly blows up disposing of resources (but only some of them).
You'll see another two or three versions of Cloo come out, I guess :)

Jeff's picture

I think I don't understand the problem... It looks to me like the implementation of read right now creates the return array and returns it empty, having passed a pointer to the memory to the underlying call to populate at leisure.

Is the problem that the "gcHandle.Free" really shouldn't be called until the async call is finished, but it instead is called right away (thereby allowing the GC to move or blow away your memory before the underlying call gets around to writing to it)?

Well, if I can't make async calls, do you have any other ideas how I might achieve what I want? I have a large array (millions) of data I need to load, process, and return, but I was hoping to queue it up as a series of smaller chunks (thousands) to try to get the GPU working on the first chunk while the second one was being transferred.

Right now (if I understand everything correctly) the GPU sits idle until the entire array is loaded into its memory, then it processes the whole thing, then it sits idle again while the results are sent back.

The other thread you linked to implies that they couldn't get a data transfer and processing to happen at the same time no matter what they tried... which would be unfortunate.

ctk's picture

Nythrix, perhaps the simplest solution to all these problems would simply to require the user to manually call a clean up function at the end of their code instead of trying to anticipate when to dispose with totally managed code. For example, in my simulation, I currently use the following code to read and write single structs into the kernel (not arrays, just single structs, I have a lot of single variable data required for my simulation and structs make them easier to manage):

using Cloo;
using System;
using OpenTK;
using OpenTK.Compute.CL10;
using System.IO;
using System.Runtime.InteropServices;
using System.Diagnostics;
using System.Collections.Generic;
 
namespace OpenCLTest
{
    /// <summary>
    /// Enables the reading and writing of structs in a kernel.
    /// </summary>
    public static class ClooHelpers
    {
        /// <summary>
        /// Sets a kernel argument of arbitrary type.
        /// </summary>
        /// <param name="data">The data to place in the kernel argument.</param>
        /// <param name="context">A valid context we are currently using.</param>
        /// <param name="hMemList">A list holding pointers to the buffers. Used to properly dispose the buffer (User does it manually in this case).</param>        
        /// <param name="kernel">The current kernel we are putting data into.</param>
        /// <param name="memFlags">Memory flags.</param>
        /// <param name="kernelArgCount">The index of the argument in the kernel.</param>
        /// <returns></returns>
        public static void SetKernelArg<T>(ref T data, ref ComputeContext context, ref List<IntPtr> hMemList,
            ref ComputeKernel kernel, MemFlags memFlags, int kernelArgCount)
        {
            unsafe
            {
                GCHandle dataPtr = GCHandle.Alloc(data, GCHandleType.Pinned);
                IntPtr hDeviceMem;
                ErrorCode error;
 
                try
                {
                    hDeviceMem = CL.CreateBuffer(context.Handle,
                       memFlags,
                       new IntPtr(Marshal.SizeOf(data)),
                       dataPtr.AddrOfPinnedObject(),
                       out error);
 
                    if (error != ErrorCode.Success)
                        throw new Exception(error.ToString());
                }
                finally
                {
                    dataPtr.Free();
                }
 
                hMemList.Add(hDeviceMem);
                CL.SetKernelArg(kernel.Handle, kernelArgCount, new IntPtr(Marshal.SizeOf(typeof(IntPtr))), new IntPtr(&hDeviceMem));
            };
        }
 
        /// <summary>
        /// Enqueues a read for a struct object in the kernel.
        /// </summary>
        /// <param name="data">Where we will read our data to</param>
        /// <param name="queue">The queue used to read the data from.</param>
        /// <param name="buffer">A valid buffer reference the struct object in the kernel.</param>        
        /// <returns></returns>
        public static int ReadKernelArg<T>(ref T data, ref ComputeCommandQueue queue, IntPtr buffer) where T : struct
        {
            int result;
            unsafe
            {
                GCHandle dataPtr = GCHandle.Alloc(data, GCHandleType.Pinned);                
 
                try
                {
                    // block_reading must be set to true or the data is not read (bug?)
                    result = CL.EnqueueReadBuffer(queue.Handle, buffer, true, new IntPtr(0),
                        new IntPtr(Marshal.SizeOf(data)), ref data, 0, (IntPtr*)null, (IntPtr*)null);
                }
                finally
                {
                    dataPtr.Free();
                }
            }
            return result;
        }
    }
}

I keep a list of the IntPtr's to the buffers with List hMemList passed into SetKernelArg, and I manually

            for (int i = 0; i < hMemList.Count; i++)
            {
                CL.ReleaseMemObject(hMemList[i]);
            }

at the end of my program.

In my ReadKernelArg implementation, you could theoretically do an async read by changing the true to a false in EnqueueReadBuffer, although the pinned memory would be a problem to manage. However, on my system (duo core CPU, ATI Stream SDK), I've found that async reads did not work and data could not be read out. That's why it's set to block above.

BTW, nythrix, any chance of enabling the read and write of SINGLE structs into the kernel for the next release of CLOO? I know that you can read and write arrays of structs right now with CLOO but I haven't found an easy way to just read and write a single struct. Am I missing some obvious command here?

nythrix's picture

Ok. One bit at a time.
1. Read/Write methods
You are right, they don't work properly in case the implementation supports asynchronous calls. Letting the user manage the resources is a simple solution but I'm not a fan of it because it is quite error prone. Besides I believe there's this better approach I've been playing with for some time now. The main idea is to bind the GCHandles to the ComputeEvents that the R/W methods generate thus delaying the call to GCHandle.Free() until the corresponding command is complete. This is fully transparent to the user and doesn't require any changes to the API.
2.

Quote:

// block_reading must be set to true or the data is not read (bug?)

Have you hit this issue? I haven't been able to enforce this exact bug, even though I was sure it would happen. I think this proves that async calls are working in your environment. I would very much appreciate some info on how to reproduce this!
3. Kernel args
There is
ComputeKernel.SetValueArgument<T>( int index, T data ) for sending in any blittable structure, or
ComputeKernel.SetArgument( int index, IntPtr dataSize, IntPtr dataAddr ) which, apart from the error check, just calls the underlying clSetKernelArg so you can use it to send in virtually anything.
You cannot read back an argument unless you wrap it up in a ComputeBuffer. The kernel treats pointers as buffers anyway so the hack you're using is pretty much the only way.

ctk's picture
nythrix wrote:

2.

Quote:

// block_reading must be set to true or the data is not read (bug?)

Have you hit this issue? I haven't been able to enforce this exact bug, even though I was sure it would happen. I think this proves that async calls are working in your environment. I would very much appreciate some info on how to reproduce this!

Yep, several times I tried doing an async read by setting the block to false and my program would run and finish but the expected data would not be read out. So it does look like async reads work on my computer. Maybe I'm not getting the data out correctly since I'm pinning the struct in .Net and the read is asynchronous. To me, it looks like the struct would be unpinned before the async read would be able to write data to it.

I'll pull the code out and make a test case to try and reproduce this. I'll post it up later.

nythrix wrote:

3. Kernel args
There is
ComputeKernel.SetValueArgument<T>( int index, T data ) for sending in any blittable structure, or
ComputeKernel.SetArgument( int index, IntPtr dataSize, IntPtr dataAddr ) which, apart from the error check, just calls the underlying clSetKernelArg so you can use it to send in virtually anything.
You cannot read back an argument unless you wrap it up in a ComputeBuffer. The kernel treats pointers as buffers anyway so the hack you're using is pretty much the only way.

Ok, I will stick with my hack since I need both read and write for single structs and it looks like the path of least resistance.

ctk's picture
Jeff wrote:

I think I don't understand the problem... It looks to me like the implementation of read right now creates the return array and returns it empty, having passed a pointer to the memory to the underlying call to populate at leisure.

Is the problem that the "gcHandle.Free" really shouldn't be called until the async call is finished, but it instead is called right away (thereby allowing the GC to move or blow away your memory before the underlying call gets around to writing to it)?

Well, if I can't make async calls, do you have any other ideas how I might achieve what I want? I have a large array (millions) of data I need to load, process, and return, but I was hoping to queue it up as a series of smaller chunks (thousands) to try to get the GPU working on the first chunk while the second one was being transferred.

Right now (if I understand everything correctly) the GPU sits idle until the entire array is loaded into its memory, then it processes the whole thing, then it sits idle again while the results are sent back.

The other thread you linked to implies that they couldn't get a data transfer and processing to happen at the same time no matter what they tried... which would be unfortunate.

Does the second chunk of your data depend on the first chunk of data to return a final result or are they totally independent? If they are totally independent, you could try creating host threads in .Net with each one having a command queue as suggested in the link or, you could split up your data into separate files and run multiple copies of your program with each pointed to a separate file (I can confirm this second method works). If your data chunks depend on one another to return a final result, I think you should wait and see if I have any success getting the async read to work on my computer.

nythrix's picture
Quote:

I'll pull the code out and make a test case to try and reproduce this. I'll post it up later.

Thanks.

Quote:

If they are totally independent, you could try creating host threads in .Net with each one having a command queue as suggested in the link

It doesn't really matter how many host threads you have, IMHO. Unless you really need several host threads a single host and separate command queues for reading/writing/executing will get you quite far without much hassle.

None of this will help if the backend is broken, though.

ctk's picture

Attached is a test case that shows that async reads execute on my computer but it does not properly read the data from the kernel when using my ClooHelper hack for structs. Example output:

---Async Read Demo---
Executing.....
Blocking Output:
testInt: 12345
testFloat: 1.2345

Async Output:
testInt: -1
testFloat: -1

The correct result is from the blocking output. I ruled out that GCAlloc is causing problems by manually allocating the pointer to the struct myself and freeing it manually at the end of the program.

AttachmentSize
AsyncReadTest.7z10.26 KB