
CLOO: Memory Leaking Troubles
Posted Tuesday, 2 February, 2010 - 13:01 by CodyIrons inHi guys hope all is going well,
I'm running into a bit of an issue when using CLoo (o.5.1 and 0.6.0) and attempting to have a program call a kernel an obscene amount of times over the coarse of a benchmark. I'm just using the vectorAdd kernel and basically what i'm doing is generating a set of random inputs each iteration then calling the kernel to compute. I've tried to separate as much as possible the setup and execution of the kernel as i would like this to become fairly modular in any future projects I decide to do. But I just can't seem to pinpoint what could be causing this leak.
The leak seems to be revolving around the creating and disposing of the ComputeBuffers (but if it's some of my code please feel free to say so.) The project is only two classes and i tried to document them as much as possible.
Class one is Program.cs
using System; using System.Collections.Generic; using System.Linq; using System.Text; using Cloo; namespace OpenCLTesting { class Program { static void Main(string[] args) { ///first lets collect all the compute devices on all detected platforms ///this may be a little overkill but i was experimenting with ideas at the time Dictionary<ComputeContext, List<ComputeDevice>> thesystem = new Dictionary<ComputeContext, List<ComputeDevice>>(); List<ComputeDevice> allDevices; ComputeContext context; foreach(ComputePlatform cp in ComputePlatform.Platforms) { ComputeContextPropertyList propertyList = new ComputeContextPropertyList(cp); context = new ComputeContext(ComputeDeviceTypes.All, propertyList, null, IntPtr.Zero); allDevices = new List<ComputeDevice>(); foreach(ComputeDevice cd in context.Devices) { allDevices.Add(cd); } thesystem.Add(context, allDevices); } string vectorAddKernel = @"kernel void vectorAdd(global read_only float * a, global read_only float * b, global write_only float * c) { // Vector element index int nIndex = get_global_id(0); c[nIndex] = a[nIndex] + b[nIndex]; }"; int testSize = 1024; int testLengthInSeconds = 60; ///we will be tested each device in each platform for performance ///we will print off the data so it can be collected and observed foreach (KeyValuePair<ComputeContext, List<ComputeDevice>> platform in thesystem) { ComputeContext localContext = platform.Key; List<ComputeDevice> devices = platform.Value; foreach (ComputeDevice device in devices) { //Testing each individual device Console.WriteLine("DevicePlatform = {0}",device.Platform.Name); Console.WriteLine("DeviceName = {0}",device.Name); Console.WriteLine("DeviceCUnits = {0}", device.MaxComputeUnits); Console.WriteLine("DeviceSpeed = {0}", device.MaxClockFrequency); float result = testDevice(localContext, device, testLengthInSeconds, testSize, vectorAddKernel, "vectorAdd"); Console.WriteLine("DevicePerformance = {0}",result); //float result1 = testYetAgain(localContext, device, testLengthInSeconds, testSize, vectorAddKernel, "vectorAdd"); //Console.WriteLine("DevicePerformance1 = {0}", result1 ); } } Console.ReadLine(); } /// <summary> /// here we attempt to separate the oonstruction of the values needed /// by the kernel during execution time. This is being done in an /// attempt to make execution of the kernel slightly more abstracted /// </summary> /// <param name="localContext">the compute context</param> /// <param name="device">the device we are running on</param> /// <param name="lengthOfTestSeconds">how long in seconds we would like the test to run</param> /// <param name="testSize">the number of objects the kernel will be performing in one batch 'these are just floats for now'</param> /// <param name="kernelSource">the kernels source</param> /// <param name="kernelName">the name of the kernel in the source</param> /// <returns>a float value representing how well this device performed</returns> private static float testDevice( ComputeContext localContext, ComputeDevice device, int lengthOfTestSeconds, int testSize, string kernelSource, string kernelName) { float result = 0.0f; //create our test kernel instance Test2Kernel t2k = new Test2Kernel(localContext, device, kernelSource, kernelName); //the number of inputs to the kernel int inputCount = 2; List<float[]> inputs = new List<float[]>(); for (int i = 0; i < inputCount; i++) { float[] arrI = new float[testSize]; inputs.Add(arrI); } //an array for our outputs int outputCount = 1; List<float[]> outputs = new List<float[]>(); for (int i = 0; i < outputCount; i++) { float[] arrC = new float[testSize]; outputs.Add(arrC); } //just setting up random number gen some timing stuff and the //arrays list that will store all of our communication with the kernel Random rand = new Random(); DateTime start = DateTime.Now; TimeSpan testLength = new TimeSpan(0, 0, lengthOfTestSeconds); List<float[]> arrays = new List<float[]>(); //just looping through till time is up while ((DateTime.Now - start) < testLength) { //clear our array arrays.Clear(); //for the size of the test, (currently set to 1024) for (int i = 0; i < testSize; i++) { //for each input buffer we will genereate a random double for (int l_inputs = 0; l_inputs < inputCount; l_inputs++) { inputs[l_inputs][i] = (float)(rand.NextDouble() * 100); } //for each output buffer we will initialize to zero for (int l_outputs = 0; l_outputs < outputCount; l_outputs++) { outputs[l_outputs][i] = 0.0f; } } //now loop through our inputs and add them to arrays for (int i = 0; i < inputCount; i++) { arrays.Add(inputs[i]); } //do the same with outputs for (int i = 0; i < outputCount; i++) { arrays.Add(outputs[i]); } //now perform our calculation by lettting the kernel know how long //the test is, where the arrays are, the number of inputs and the //number of outputs t2k.performCalculation(testSize, ref arrays, inputCount, outputCount); //just some debuging to let us know it is running //Console.WriteLine("{0}){1} + {2} = {3}", result, arrays[0][0], arrays[1][0], arrays[2][0]); result++; //desparate attempt to get the memory leak to go away GC.Collect(); GC.WaitForPendingFinalizers(); } //we are calcualting performance as the size of the test (so 1024) //multiplied by the number of times it got through the loop //devided by the length of the test //this should possibly give us something like calculations per second return (float)testSize * result / lengthOfTestSeconds; } /// <summary> /// flattened version of test2kernel to rule out anything weird happening /// in our test2kernel instance based test. /// </summary> /// <param name="localContext"></param> /// <param name="device"></param> /// <param name="lengthOfTestSeconds"></param> /// <param name="testSize"></param> /// <param name="kernelSource"></param> /// <param name="kernelName"></param> /// <returns></returns> public static float testYetAgain(ComputeContext localContext, ComputeDevice device, int lengthOfTestSeconds, int testSize, string kernelSource, string kernelName) { float result = 0.0f; ComputeProgram m_computeProgram; ComputeKernel m_computeKernel; ComputeBuffer<float> tempBuffer; ComputeCommandQueue m_queue; ComputeBuffer<float>[] m_buffers; m_computeProgram = new ComputeProgram(localContext, new string[] { kernelSource }); m_computeProgram.Build(null, null, null, IntPtr.Zero); m_computeKernel = m_computeProgram.CreateKernel(kernelName); m_queue = new ComputeCommandQueue(localContext, device, ComputeCommandQueueFlags.None); m_buffers = new ComputeBuffer<float>[3]; // the number of values we want to run through the kernel each pass //int count = 10; //the number of inputs to the kernel int inputCount = 2; List<float[]> inputs = new List<float[]>(); for (int i = 0; i < inputCount; i++) { float[] arrI = new float[testSize]; inputs.Add(arrI); } //an array for our outputs int outputCount = 1; List<float[]> outputs = new List<float[]>(); for (int i = 0; i < outputCount; i++) { float[] arrC = new float[testSize]; outputs.Add(arrC); } Random rand = new Random(); DateTime start = DateTime.Now; TimeSpan testLength = new TimeSpan(0, 0, lengthOfTestSeconds); List<float[]> arrays = new List<float[]>(); while ((DateTime.Now - start) < testLength) { arrays.Clear(); for (int i = 0; i < testSize; i++) { for (int l_inputs = 0; l_inputs < inputCount; l_inputs++) { inputs[l_inputs][i] = (float)(rand.NextDouble() * 100); } for (int l_outputs = 0; l_outputs < outputCount; l_outputs++) { outputs[l_outputs][i] = 0.0f; } } for (int i = 0; i < inputCount; i++) { arrays.Add(inputs[i]); } for (int i = 0; i < outputCount; i++) { arrays.Add(outputs[i]); } //t2k.performCalculation(testSize, ref arrays, inputCount, outputCount); for (int i = 0; i < inputCount; i++) { m_buffers[i] = new ComputeBuffer<float>(localContext, ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer, arrays[i]); } for (int i = inputCount; i < inputCount + outputCount; i++) { m_buffers[i] = new ComputeBuffer<float>(localContext, ComputeMemoryFlags.WriteOnly, arrays[i].Length); } for (int i = 0; i < inputCount + outputCount; i++) { m_computeKernel.SetMemoryArgument(i, m_buffers[i]); } m_queue.Execute(m_computeKernel, null, new long[] { testSize }, null, null); for (int i = inputCount; i < inputCount + outputCount; i++) { arrays[i] = m_queue.Read(m_buffers[i], true, 0, testSize, null); } for (int i = 0; i < inputCount + outputCount; i++) { m_buffers[i].Dispose(); } //Console.WriteLine("{0})", result); result++; //GC.Collect(); //GC.WaitForPendingFinalizers(); } return (float)testSize * result / lengthOfTestSeconds; } } }
class 2 is called Test2Kernel.cs
using System; using System.Collections.Generic; using System.Linq; using System.Text; using Cloo; namespace OpenCLTesting { class Test2Kernel { ComputeContext m_context; ComputeDevice m_device; ComputeProgram m_computeProgram; ComputeKernel m_computeKernel; ComputeBuffer<float> tempBuffer; ComputeCommandQueue m_queue; //ComputeBuffer<float>[] m_buffers; List<ComputeBuffer<float>> m_buffers; string m_kernelSource; string m_kernelName; /// <summary> /// Constructor prepares a device given the the kernel information /// </summary> /// <param name="context"></param> /// <param name="device"></param> /// <param name="kernelSource"></param> /// <param name="kernelName"></param> public Test2Kernel(ComputeContext context, ComputeDevice device, String kernelSource, String kernelName) { m_context = context; m_device = device; m_kernelSource = kernelSource; m_kernelName = kernelName; initialize(); } /// <summary> /// pull out the instantiation of everything /// originally thought to 're-initialize' everything if certain conditions /// arize during execution /// </summary> private void initialize() { m_computeProgram = new ComputeProgram(m_context, new string[]{m_kernelSource}); m_computeProgram.Build(null, null, null, IntPtr.Zero); m_computeKernel = m_computeProgram.CreateKernel(m_kernelName); m_queue = new ComputeCommandQueue(m_context, m_device, ComputeCommandQueueFlags.None); //m_buffers = new ComputeBuffer<float>[3]; m_buffers = new List<ComputeBuffer<float>>(); } /// <summary> /// this perofrms the actual construciton of our compute buffers /// and then performs the calculation. The only Cloo items that /// are being reset each time are the ComputeBuffers. /// /// Have tried several ways of storing the buffers 'as array' /// 'as a list' but each way we still end up with a rather nasty /// memory leak /// </summary> /// <param name="count"></param> /// <param name="arrays"></param> /// <param name="inputCount"></param> /// <param name="outputCount"></param> public void performCalculation( int count, ref List<float[]> arrays, int inputCount, int outputCount) { //add our 'input' compute buffers to the compute buffer list for (int i = 0; i < inputCount; i++) { tempBuffer = new ComputeBuffer<float>(m_context, ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer, arrays[i]); m_buffers.Add(tempBuffer); //m_buffers[i] = new ComputeBuffer<float>(m_context, ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer, arrays[i]); } //add our 'output' compute bufferes to the main compute buffer list for (int i = inputCount; i < inputCount + outputCount; i++) { tempBuffer = new ComputeBuffer<float>(m_context, ComputeMemoryFlags.WriteOnly, arrays[i].Length); m_buffers.Add(tempBuffer); //m_buffers[i] = new ComputeBuffer<float>(m_context, ComputeMemoryFlags.WriteOnly, arrays[i].Length); } //map our compute buffers to the kernel for (int i = 0; i < inputCount + outputCount; i++) { m_computeKernel.SetMemoryArgument(i, m_buffers[i]); } //execute our kernel m_queue.Execute(m_computeKernel, null, new long[] { count }, null, null); //read from each output buffer and save it into our array reference for (int i = inputCount; i < inputCount + outputCount; i++) { arrays[i] = m_queue.Read(m_buffers[i], true, 0, count, null); } //dispose of each buffer for (int i = 0; i < inputCount+outputCount; i++) { m_buffers[i].Dispose(); } m_buffers.Clear(); //tempBuffer.Dispose(); } } }
But basically when you run this it will iterate through your platforms (which is always just one as far as i know) collecting contexts and devices for that context. It will then perform the benchmark on each device. The benchmark is set to take 60 seconds and if you have your taskmanager up you can see all your memory slowly creep away. I've also left in some commented code so you can see different things i've tried.
I just can't seem to find what could be leaking or a way to make it stop.
I guess a pertinent question would be, is there a way to clear the memory used by a device that i have not noticed yet?
Thanks for any help.
-Cody


Comments
Re: CLOO: Memory Leaking Troubles
This signature contains only value types, so it won't leak by itself. However, the surrounding code may generate garbage in some cases: for instance,
GCHandle.Allocmay cause boxing when used on value types.Any such leaks will be collected by the GC, however, and will not cause memory usage to grow indefinitely. This can only happen if a leak happens in native code and/or in calls to
Marshal.AllocHGlobaland similar.Edit: attached is a quick test that checks whether this signature leaks memory. Ran this on Mono 2.4.3 and no leak was detected:
On Linux, the test can be compiled and ran with the following commnads:
On Windows/MSVC, you'd need to add __declspec(dllexport) before the function definition in native.c, and compile native.c with cl and test.cs with csc (sorry, don't remember the command to create a dll off-hand).
Re: CLOO: Memory Leaking Troubles
I've been able to trace the cause of the problem back to the flat CL bindings. And it's not only Cloo's bindings, OpenTK's suffer from the same problem. My guts tell me we're somehow messing with marshaling and that is the root of all evil.
This is the C definition of a typical leaking function:
which translates to:
Is anyone able to see through this?
I've tried different signatures, especially replacing pointers with arrays, to no effect.
Just a minor observation, but should the Int32 not be UInt32 instead?
Re: CLOO: Memory Leaking Troubles
Problem is,
uintis not CLS-compliant and thus cannot be used in the public API. Or, a CLS-compiant overload should be provided along with the non-CLS-compliant one (which is what OpenTK does).That said, in most cases it doesn't really matter. If a
uintoverload isn't provided, at worst you'll have to use a cast:(int)0x80000000. Unless you are working with bit-patterns (GL.LineStipple?), the need for such casts doesn't arise very often (how often would you need more than 2^31 - 1 work units?)Re: CLOO: Memory Leaking Troubles
Okay, I've figured out what is causing the memory leak problem we are seeing here. The problem is that the bindings are using "out IntPtr newEvent" instead of "IntPtr* newEvent", which creates memory leaks in Cloo since OpenCl is the one responsible for freeing this pointer instead of the Garbage Collector. This is a problem with many of the interop bindings in Cloo/OpenTK (not just the ones we've used in this thread).
For example, in Cloo 0.6.0 under CL10.cs:
We can verify that it should be "IntPtr* newEvent" from the OpenCL man pages: http://www.khronos.org/opencl/sdk/1.0/docs/man/xhtml/index.html. For the above binding, the relevant page is http://www.khronos.org/opencl/sdk/1.0/docs/man/xhtml/clEnqueueNDRangeKer...
Attached is my the copy of Cloo 0.6.0 on my computer with the VectorAdd demo I've posted in the previous page of this thread. The previous VectorAdd had a memory leak whereas I've manually edited the bindings for EnqueueNDRangeKernel, EnqueueTask, and EnqueueRead to demonstrate the fix for the memory leak. The relevant Cloo functions in ComputeCommandQueue.cs were also changed. This version of VectorAdd is stable and will hover around 20 megs of memory use for a 64-bit system.
For the relevant functions in ComputeCommandQueue.cs, we will have to additionally modify:
Re: CLOO: Memory Leaking Troubles
You're my hero mate!
Actually, it's not the "out" parameter itself. Instead it's the goddamn events that are leaking! I retrieve them and stick them in the list. But if no event list is provided they just keep piling up because I never dispose of them!!!
I shall update the library immediately.
What a stupid bug...
Re: CLOO: Memory Leaking Troubles
A new version is online. This memory problem is hopefully gone. The Clootils + ATI Stream combo problem should be fixed as well.
Looking forward to your feedback.
Re: CLOO: Memory Leaking Troubles
Yes! Cloo 0.6.1 has fixed both the memory leak and the ATI Stream output in WinForms. Great job, I will be porting my AI code to the new version.
I think it would be a good idea to update the gallery page for Cloo to reflect the latest version: http://opentk.com/project/cloo or have more links to http://sourceforge.net/projects/cloo/ so that it is easier to reach.
Re: CLOO: Memory Leaking Troubles
I can confirm on my work computer that my tests do not leak with sizes 10, 1000, or 100000. Most excellent!!!
And good work guys.
Re: CLOO: Memory Leaking Troubles
Good to hear Cody. And just in case you didn't know this trick:
I use this in my AI code to compute things when I have more items to compute than the global work item size limits. It may help you to improve CPU utilization.
Re: CLOO: Memory Leaking Troubles
I think it would be a good idea to update the gallery page for Cloo to reflect the latest version: http://opentk.com/project/cloo or have more links to http://sourceforge.net/projects/cloo/ so that it is easier to reach.
I stopped updating the gallery page to prevent people from extensively using the opentk forums for cloo related stuff. Despite the projects being related and Fiddler being fine with everything, I was afraid the forums here would end up being a mess.
I know, I haven't bothered with letting people know Cloo exists. Perhaps, I should go out and spread the word. Thing is, advertisment is way low on my TODO list :)
Thank you all for your most valuable help!
Re: CLOO: Memory Leaking Troubles
I stopped updating the gallery page to prevent people from extensively using the opentk forums for cloo related stuff.
In that case, it would probably be a good idea to add http://sourceforge.net/projects/cloo/ as the homepage in http://opentk.com/project/cloo, plus a note at the top for using the sourceforge forums and/or bug tracker.
Regarding the download, you can create a 'dummy' release (called e.g. 1.0-dev) and use an http redirect to sourceforge as the download file. This way, users will always get the latest version by clicking on the download link, without you having to do anything.
The attached file should do the trick (edit as desired).
Re: CLOO: Memory Leaking Troubles
Done. Thanks for the advice.