
CLOO (or probably OpenCL): Anyone used non-blocking ComputeCommandQueue.Read?
Posted Thursday, 4 March, 2010 - 21:14 by Jeff inI'm trying to do non-blocking ComputeCommandQueue.Reads and I can't seem to make it not block. Passing "true" or "false" for the blocking parameter doesn't seem to have any effect.
float[] retVal = queue.Read(c, true, 0, vec1.Length, events);
or
float[] retVal = queue.Read(c, false, 0, vec1.Length, events);
In both cases retVal contains the answer the instant control returns from the Read call, even for operations that take more than 3 seconds (read just blocks for the 3 seconds until it is complete, regardless of the value passed for "blocking").
Is there something else I'm supposed to do to enable non-blocking operations other than just passing "false" to ComputeCommandQueue.Read?


Comments
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
Up till now asynchronous calls are not an option if you're running nVidia. This is a known "feature".
I don't know about ATI.
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
Nythrix, maybe you can shed some light on this, but I don't see how the current queue.Read(....) command in Cloo can be asynchronous in it's current form even if the underlying graphics card and driver supports it. For example, say we wanted async reads like:
Each line in the code would have to be blocking for the array variables to store valid data since they have to wait for queue.Read(....) to read and generate the results into .Net form. And I don't believe you can have 3 separate return calls running at the same time in .Net like this without a different syntax.
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
Yes, I'm aware of this problem. I'm keeping a comment/warning about possible misbehavior of asynchronous calls right on top of ComputeCommandQueue.cs so that it doesn't get lost in the woods. I will redesign this part although without a working asynchronous mechanism it's a bit of a blind shot.
Getting our hands dirty, this relates to the old keep-alive problem we were facing in the case of kernel arguments. Unfortunately, getting rid of it completely is not easy. For example, I've just discovered that keeping alive kernel arguments randomly blows up disposing of resources (but only some of them).
You'll see another two or three versions of Cloo come out, I guess :)
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
I think I don't understand the problem... It looks to me like the implementation of read right now creates the return array and returns it empty, having passed a pointer to the memory to the underlying call to populate at leisure.
Is the problem that the "gcHandle.Free" really shouldn't be called until the async call is finished, but it instead is called right away (thereby allowing the GC to move or blow away your memory before the underlying call gets around to writing to it)?
Well, if I can't make async calls, do you have any other ideas how I might achieve what I want? I have a large array (millions) of data I need to load, process, and return, but I was hoping to queue it up as a series of smaller chunks (thousands) to try to get the GPU working on the first chunk while the second one was being transferred.
Right now (if I understand everything correctly) the GPU sits idle until the entire array is loaded into its memory, then it processes the whole thing, then it sits idle again while the results are sent back.
The other thread you linked to implies that they couldn't get a data transfer and processing to happen at the same time no matter what they tried... which would be unfortunate.
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
Nythrix, perhaps the simplest solution to all these problems would simply to require the user to manually call a clean up function at the end of their code instead of trying to anticipate when to dispose with totally managed code. For example, in my simulation, I currently use the following code to read and write single structs into the kernel (not arrays, just single structs, I have a lot of single variable data required for my simulation and structs make them easier to manage):
I keep a list of the IntPtr's to the buffers with List hMemList passed into SetKernelArg, and I manually
at the end of my program.
In my ReadKernelArg implementation, you could theoretically do an async read by changing the true to a false in EnqueueReadBuffer, although the pinned memory would be a problem to manage. However, on my system (duo core CPU, ATI Stream SDK), I've found that async reads did not work and data could not be read out. That's why it's set to block above.
BTW, nythrix, any chance of enabling the read and write of SINGLE structs into the kernel for the next release of CLOO? I know that you can read and write arrays of structs right now with CLOO but I haven't found an easy way to just read and write a single struct. Am I missing some obvious command here?
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
Ok. One bit at a time.
1. Read/Write methods
You are right, they don't work properly in case the implementation supports asynchronous calls. Letting the user manage the resources is a simple solution but I'm not a fan of it because it is quite error prone. Besides I believe there's this better approach I've been playing with for some time now. The main idea is to bind the GCHandles to the ComputeEvents that the R/W methods generate thus delaying the call to GCHandle.Free() until the corresponding command is complete. This is fully transparent to the user and doesn't require any changes to the API.
2.
// block_reading must be set to true or the data is not read (bug?)
Have you hit this issue? I haven't been able to enforce this exact bug, even though I was sure it would happen. I think this proves that async calls are working in your environment. I would very much appreciate some info on how to reproduce this!
3. Kernel args
There is
ComputeKernel.SetValueArgument<T>( int index, T data )for sending in any blittable structure, orComputeKernel.SetArgument( int index, IntPtr dataSize, IntPtr dataAddr )which, apart from the error check, just calls the underlyingclSetKernelArgso you can use it to send in virtually anything.You cannot read back an argument unless you wrap it up in a ComputeBuffer. The kernel treats pointers as buffers anyway so the hack you're using is pretty much the only way.
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
2.
// block_reading must be set to true or the data is not read (bug?)
Have you hit this issue? I haven't been able to enforce this exact bug, even though I was sure it would happen. I think this proves that async calls are working in your environment. I would very much appreciate some info on how to reproduce this!
Yep, several times I tried doing an async read by setting the block to false and my program would run and finish but the expected data would not be read out. So it does look like async reads work on my computer. Maybe I'm not getting the data out correctly since I'm pinning the struct in .Net and the read is asynchronous. To me, it looks like the struct would be unpinned before the async read would be able to write data to it.
I'll pull the code out and make a test case to try and reproduce this. I'll post it up later.
3. Kernel args
There is
ComputeKernel.SetValueArgument<T>( int index, T data )for sending in any blittable structure, orComputeKernel.SetArgument( int index, IntPtr dataSize, IntPtr dataAddr )which, apart from the error check, just calls the underlyingclSetKernelArgso you can use it to send in virtually anything.You cannot read back an argument unless you wrap it up in a ComputeBuffer. The kernel treats pointers as buffers anyway so the hack you're using is pretty much the only way.
Ok, I will stick with my hack since I need both read and write for single structs and it looks like the path of least resistance.
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
I think I don't understand the problem... It looks to me like the implementation of read right now creates the return array and returns it empty, having passed a pointer to the memory to the underlying call to populate at leisure.
Is the problem that the "gcHandle.Free" really shouldn't be called until the async call is finished, but it instead is called right away (thereby allowing the GC to move or blow away your memory before the underlying call gets around to writing to it)?
Well, if I can't make async calls, do you have any other ideas how I might achieve what I want? I have a large array (millions) of data I need to load, process, and return, but I was hoping to queue it up as a series of smaller chunks (thousands) to try to get the GPU working on the first chunk while the second one was being transferred.
Right now (if I understand everything correctly) the GPU sits idle until the entire array is loaded into its memory, then it processes the whole thing, then it sits idle again while the results are sent back.
The other thread you linked to implies that they couldn't get a data transfer and processing to happen at the same time no matter what they tried... which would be unfortunate.
Does the second chunk of your data depend on the first chunk of data to return a final result or are they totally independent? If they are totally independent, you could try creating host threads in .Net with each one having a command queue as suggested in the link or, you could split up your data into separate files and run multiple copies of your program with each pointed to a separate file (I can confirm this second method works). If your data chunks depend on one another to return a final result, I think you should wait and see if I have any success getting the async read to work on my computer.
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
I'll pull the code out and make a test case to try and reproduce this. I'll post it up later.
Thanks.
If they are totally independent, you could try creating host threads in .Net with each one having a command queue as suggested in the link
It doesn't really matter how many host threads you have, IMHO. Unless you really need several host threads a single host and separate command queues for reading/writing/executing will get you quite far without much hassle.
None of this will help if the backend is broken, though.
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
Attached is a test case that shows that async reads execute on my computer but it does not properly read the data from the kernel when using my ClooHelper hack for structs. Example output:
---Async Read Demo---
Executing.....
Blocking Output:
testInt: 12345
testFloat: 1.2345
Async Output:
testInt: -1
testFloat: -1
The correct result is from the blocking output. I ruled out that GCAlloc is causing problems by manually allocating the pointer to the struct myself and freeing it manually at the end of the program.
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
Good news is ATI Stream definitely supports async calls.
As for the bad news there was something fishy in that code because I couldn't reproduce the issue using only Cloo functionality. Indeed, the part that's causing the problem is one of the OpenTK calls:
Cloo.Bindings are very strict and don't include overloads like this because I believe that bypassing the marshaler altogether for APIs such as OpenCL is the best thing to do.
However, this doesn't mean that the hi-level R/W operations in Cloo are solid. It is just a different problem.
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
Cloo.Bindings are very strict and don't include overloads like this because I believe that bypassing the marshaler altogether for APIs such as OpenCL is the best thing to do.
Ref parameters in OpenTK bindings do bypass the marshaler (otherwise half of these overloads wouldn't even work):
That said, this doesn't really help here as the parameter is pinned only for the duration of the call. Then again, pinning the structure for the duration of the program should rule that issue out.
Some piece of the puzzle is missing here, I'll check again tomorrow with a clear brain.
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
Oh, now I see. I thought this was caused by the marshaler. Unfortunately, I tend to miss the details late in the night.
I believe there's no way this mechanism would work for async calls. Not at this level anyway. Wouldn't it be better if we ruled out these overloads?
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
Does anyone actually have asynchronous reads working in a C/C++ program with OpenCL?
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
That's one possibility. Another might be to use the .Net async pattern, which works like this:
The idea is that the begin method will pin the necessary resources and invoke the async call and the End method will block until the results are ready and clear the pins. The larger the distance between Begin and End, the lower the chance End will block.
Personally, I really dislike this pattern for all the reasons outlined here. The problem is I can't think of any good way to solve this without some form of explicit pin (via an explicit GCHandle or via a Begin* method).
What would help is if OpenCL had an event to signal when the read is complete. If that was available (is it?), the async Read method could pin the buffer internally and unpin it when the 'ready' signal comes (possibly firing a 'ReadFinished' event or something similar). This is basically a less ugly implementation of the above async pattern, but it could only work if such a signal existed.
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
This works:
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
What would help is if OpenCL had an event to signal when the read is complete. If that was available (is it?), the async Read method could pin the buffer internally and unpin it when the 'ready' signal comes (possibly firing a 'ReadFinished' event or something similar). This is basically a less ugly implementation of the above async pattern, but it could only work if such a signal existed.
There is no such event. However, as I pointed out before, this problem is deterministically solvable using Cloo. Basically it will work like this:
1) The host enqueues an async command.
2) Cloo pins the needed structures and binds the GCHandles with the event that the async command generates.
3) The host carries on execution as normal.
4) Cloo intercepts one of the synchronizing commands (Finish, Wait, etc.) which the host is required to issue any time after an async call, and releases the GCHandles from the events that carry them.
It'd be lovely to have an option of specifying an event callback that would signal its completion. Unfortunately, as of OpenCL 1.0 no such mechanism exists.
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
@Nythrix, yes that asynchronous read code you post works on my system, too! However, I can't seem to make it work for asynchronous read of single structs. Could you take a look at the code I posted before and see if you could make that work? I tried following your code but I still don't get any data read out for a single struct in non-blocking mode.
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
This method in ClooHelpers should do it:
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
Nope, that ReadKernelArgSync did not work. In fact, using dataPtr.AddrOfPinnedObject() will cause no data to be read for a single struct no matter if it is blocking or not. You must use ref data in order for a single struct to be read successfully into .Net. I've tried out other variants of EnqueueReadBuffer, such reading a Vector4 array and a TestStruct array asynchronously and those work. I don't know exactly why this is so, but maybe I should just covert all the single structs in my .Net code to struct arrays and be done with the async read issue. I don't think I would have to change anything on the kernel side since the pointer notation would still work for structs array of length 1.
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
Is the pack size for the struct correct? Try specifying 1 instead of 8 and see if it works then (or does OpenCL expect alignment on 8byte boundaries?)
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
Is the pack size for the struct correct? Try specifying 1 instead of 8 and see if it works then (or does OpenCL expect alignment on 8byte boundaries?)
Just tried a pack size of 1, 4, and 8, but it didn't change anything. OpenCL expects a pack size that aligns vectors (8 for Vector2, 16 for Vector4, ie: 4 bytes per each float in a Vector). However, since I'm using a simple TestStruct with just a plain int and a float, I don't expect any alignment issues here. It's quite puzzling why the pinned memory doesn't work at all.
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
Technically speaking there is no such thing as a single struct going in and out of the kernel. In fact you are already using an array containing one TestStruct to achieve the same effect. This means you have to carefully track whether you're working with the array or with the struct itself. I believe that this inconsistency is the cause of the problems in your case.
A new release of Cloo is due out tomorrow. It comes with new internals which should take care of the asynchronous calls of OpenCL completely transparently.
All that said you might as well use a ComputeBuffer instead of a bare handle for your hack.
In any case, I've attached one of your previous files which shows the async call working.
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
@Nythrix, that fixed it, thanks. The problem seems to be that "ref data" was not working in the async read command and instead required "new IntPtr(&data)". Looks like this wraps it up.
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
The problem seems to be that "ref data" was not working in the async read command and instead required "new IntPtr(&data)".
And there was light... :D
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
The problem seems to be that "ref data" was not working in the async read command and instead required "new IntPtr(&data)".
And there was light... :D
Yeah, I didn't get why you didn't like "ref data" before, but now I see. With "new IntPtr(&data)", there is now a minor problem where the computer says that it can't take the address of "data" if it is a generic type T since it's managed, but that's a minor problem I can work around by just overloading the ClooHelpers. But, I think you will run into this say compiler error when you modify public T[] Read( ......) in Cloo since you will have to take out the pinned memory address with a variant of "new IntPtr(&data)".
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
Mmm, no. Read/Write commands in Cloo behave differently. They pin the array (T[]) while you're pinning the struct itself (T). That's why you're having problems IMHO.
Some good news in the end: The public part of the Cloo API will probably not change in order to accommodate GCHandles tracking. It's fitting in without a fuss so far.
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
Nythrix, just to clarify, your improvements will fix asynchronous operations in concept, I.E. against any drivers that support it, but not for NVidia because their driver still does asynchronous operations synchronously.
So the only way (maybe, haven't tried it yet) to do asynchronous operations against NVidia is still multiple command queues.
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
Yes I can only improve the frontend (i.e. Cloo) because there's nothing I can do about the drivers.
Interesting thing is these async calls work perfectly in MacOS, or so I've heard. So let's see what they come up with for the rest of us.
I even read some hypothesis that nVidia is crippling OpenCL because of CUDA but this isn't very plausible. nVidia knows that people could switch HW and not the technology if things don't improve. And nVidia can't afford such risks with a standard that's rapidly gaining popularity.
Re: CLOO (or probably OpenCL): Anyone used non-blocking ...
Speaking of which, when is Intel going to do anything with OpenCL? They are really late to the game with their own drivers. Seems like they don't want to be a part of it.