CodyIrons's picture

CLOO: Memory Leaking Troubles

Hi guys hope all is going well,

I'm running into a bit of an issue when using CLoo (o.5.1 and 0.6.0) and attempting to have a program call a kernel an obscene amount of times over the coarse of a benchmark. I'm just using the vectorAdd kernel and basically what i'm doing is generating a set of random inputs each iteration then calling the kernel to compute. I've tried to separate as much as possible the setup and execution of the kernel as i would like this to become fairly modular in any future projects I decide to do. But I just can't seem to pinpoint what could be causing this leak.

The leak seems to be revolving around the creating and disposing of the ComputeBuffers (but if it's some of my code please feel free to say so.) The project is only two classes and i tried to document them as much as possible.

Class one is Program.cs

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Cloo;
 
namespace OpenCLTesting
{
    class Program
    {
        static void Main(string[] args)
        {
            ///first lets collect all the compute devices on all detected platforms
            ///this may be a little overkill but i was experimenting with ideas at the time
            Dictionary<ComputeContext, List<ComputeDevice>> thesystem
                = new Dictionary<ComputeContext, List<ComputeDevice>>();
            List<ComputeDevice> allDevices;
            ComputeContext context;
            foreach(ComputePlatform cp in ComputePlatform.Platforms)
            {
 
                ComputeContextPropertyList propertyList = new ComputeContextPropertyList(cp);
                context = new ComputeContext(ComputeDeviceTypes.All, propertyList, null, IntPtr.Zero);
                allDevices = new List<ComputeDevice>();
                foreach(ComputeDevice cd in context.Devices)
                {
                    allDevices.Add(cd);
                }
                thesystem.Add(context, allDevices);
            }
 
 
            string vectorAddKernel = @"kernel void
            vectorAdd(global read_only float * a,
                      global read_only float * b,
                      global write_only float * c)
            {
                // Vector element index
                int nIndex = get_global_id(0);
                c[nIndex] = a[nIndex] + b[nIndex];
            }";
 
            int testSize = 1024;
            int testLengthInSeconds = 60;
 
            ///we will be tested each device in each platform for performance
            ///we will print off the data so it can be collected and observed
            foreach (KeyValuePair<ComputeContext, List<ComputeDevice>> platform in thesystem)
            {
                ComputeContext localContext = platform.Key;
                List<ComputeDevice> devices = platform.Value;
 
                foreach (ComputeDevice device in devices)
                {
                    //Testing each individual device
                    Console.WriteLine("DevicePlatform = {0}",device.Platform.Name);
                    Console.WriteLine("DeviceName = {0}",device.Name);
                    Console.WriteLine("DeviceCUnits = {0}", device.MaxComputeUnits);
                    Console.WriteLine("DeviceSpeed = {0}", device.MaxClockFrequency);
 
                    float result = testDevice(localContext, device, testLengthInSeconds, testSize, vectorAddKernel, "vectorAdd");
                    Console.WriteLine("DevicePerformance = {0}",result);
 
                    //float result1 = testYetAgain(localContext, device, testLengthInSeconds, testSize, vectorAddKernel, "vectorAdd");
                    //Console.WriteLine("DevicePerformance1 = {0}", result1 );
                }
            }
            Console.ReadLine();
        }
 
        /// <summary>
        /// here we attempt to separate the oonstruction of the values needed 
        /// by the kernel during execution time.  This is being done in an
        /// attempt to make execution of the kernel slightly more abstracted
        /// </summary>
        /// <param name="localContext">the compute context</param>
        /// <param name="device">the device we are running on</param>
        /// <param name="lengthOfTestSeconds">how long in seconds we would like the test to run</param>
        /// <param name="testSize">the number of objects the kernel will be performing in one batch 'these are just floats for now'</param>
        /// <param name="kernelSource">the kernels source</param>
        /// <param name="kernelName">the name of the kernel in the source</param>
        /// <returns>a float value representing how well this device performed</returns>
        private static float testDevice(
            ComputeContext localContext, 
            ComputeDevice device, 
            int lengthOfTestSeconds, 
            int testSize,
            string kernelSource, 
            string kernelName)
        {
            float result = 0.0f;
 
            //create our test kernel instance
            Test2Kernel t2k = new Test2Kernel(localContext, device, kernelSource, kernelName);
 
            //the number of inputs to the kernel
            int inputCount = 2;
            List<float[]> inputs = new List<float[]>();
            for (int i = 0; i < inputCount; i++)
            {
                float[] arrI = new float[testSize];
                inputs.Add(arrI);
            }
 
            //an array for our outputs
            int outputCount = 1;
            List<float[]> outputs = new List<float[]>();
            for (int i = 0; i < outputCount; i++)
            {
                float[] arrC = new float[testSize];
                outputs.Add(arrC);
            }
 
            //just setting up random number gen some timing stuff and the
            //arrays list that will store all of our communication with the kernel
            Random rand = new Random();
            DateTime start = DateTime.Now;
            TimeSpan testLength = new TimeSpan(0, 0, lengthOfTestSeconds);
            List<float[]> arrays = new List<float[]>();
 
            //just looping through till time is up
            while ((DateTime.Now - start) < testLength)
            {
                //clear our array
                arrays.Clear();
 
                //for the size of the test, (currently set to 1024)
                for (int i = 0; i < testSize; i++)
                {
                    //for each input buffer we will genereate a random double
                    for (int l_inputs = 0; l_inputs < inputCount; l_inputs++)
                    {
                        inputs[l_inputs][i] = (float)(rand.NextDouble() * 100);
                    }
                    //for each output buffer we will initialize to zero
                    for (int l_outputs = 0; l_outputs < outputCount; l_outputs++)
                    {
                        outputs[l_outputs][i] = 0.0f;
                    }
                }
 
                //now loop through our inputs and add them to arrays
                for (int i = 0; i < inputCount; i++)
                {
                    arrays.Add(inputs[i]);
                }
                //do the same with outputs
                for (int i = 0; i < outputCount; i++)
                {
                    arrays.Add(outputs[i]);
                }
 
                //now perform our calculation by lettting the kernel know how long
                //the test is, where the arrays are, the number of inputs and the
                //number of outputs
                t2k.performCalculation(testSize, ref arrays, inputCount, outputCount);
 
                //just some debuging to let us know it is running
                //Console.WriteLine("{0}){1} + {2} = {3}", result, arrays[0][0], arrays[1][0], arrays[2][0]);
                result++;
 
                //desparate attempt to get the memory leak to go away
                GC.Collect();
                GC.WaitForPendingFinalizers();
            }
 
            //we are calcualting performance as the size of the test (so 1024)
            //multiplied by the number of times it got through the loop
            //devided by the length of the test
            //this should possibly give us something like calculations per second
            return (float)testSize * result / lengthOfTestSeconds;
        }
 
        /// <summary>
        /// flattened version of test2kernel to rule out anything weird happening
        /// in our test2kernel instance based test.
        /// </summary>
        /// <param name="localContext"></param>
        /// <param name="device"></param>
        /// <param name="lengthOfTestSeconds"></param>
        /// <param name="testSize"></param>
        /// <param name="kernelSource"></param>
        /// <param name="kernelName"></param>
        /// <returns></returns>
        public static float testYetAgain(ComputeContext localContext,
            ComputeDevice device,
            int lengthOfTestSeconds,
            int testSize,
            string kernelSource,
            string kernelName)
        {
            float result = 0.0f;
 
            ComputeProgram m_computeProgram;
            ComputeKernel m_computeKernel;
            ComputeBuffer<float> tempBuffer;
            ComputeCommandQueue m_queue;
            ComputeBuffer<float>[] m_buffers;
 
            m_computeProgram = new ComputeProgram(localContext, new string[] { kernelSource });
            m_computeProgram.Build(null, null, null, IntPtr.Zero);
            m_computeKernel = m_computeProgram.CreateKernel(kernelName);
            m_queue = new ComputeCommandQueue(localContext, device, ComputeCommandQueueFlags.None);
            m_buffers = new ComputeBuffer<float>[3];
 
            // the number of values we want to run through the kernel each pass
            //int count = 10;
            //the number of inputs to the kernel
            int inputCount = 2;
            List<float[]> inputs = new List<float[]>();
            for (int i = 0; i < inputCount; i++)
            {
                float[] arrI = new float[testSize];
                inputs.Add(arrI);
            }
 
            //an array for our outputs
            int outputCount = 1;
            List<float[]> outputs = new List<float[]>();
            for (int i = 0; i < outputCount; i++)
            {
                float[] arrC = new float[testSize];
                outputs.Add(arrC);
            }
 
 
            Random rand = new Random();
            DateTime start = DateTime.Now;
            TimeSpan testLength = new TimeSpan(0, 0, lengthOfTestSeconds);
            List<float[]> arrays = new List<float[]>();
 
            while ((DateTime.Now - start) < testLength)
            {
                arrays.Clear();
 
                for (int i = 0; i < testSize; i++)
                {
 
                    for (int l_inputs = 0; l_inputs < inputCount; l_inputs++)
                    {
                        inputs[l_inputs][i] = (float)(rand.NextDouble() * 100);
                    }
                    for (int l_outputs = 0; l_outputs < outputCount; l_outputs++)
                    {
                        outputs[l_outputs][i] = 0.0f;
                    }
                }
 
                for (int i = 0; i < inputCount; i++)
                {
                    arrays.Add(inputs[i]);
                }
                for (int i = 0; i < outputCount; i++)
                {
                    arrays.Add(outputs[i]);
                }
 
                //t2k.performCalculation(testSize, ref arrays, inputCount, outputCount);
                for (int i = 0; i < inputCount; i++)
                {
                    m_buffers[i] = new ComputeBuffer<float>(localContext, ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer, arrays[i]);
                }
 
                for (int i = inputCount; i < inputCount + outputCount; i++)
                {
                    m_buffers[i] = new ComputeBuffer<float>(localContext, ComputeMemoryFlags.WriteOnly, arrays[i].Length);
                }
 
                for (int i = 0; i < inputCount + outputCount; i++)
                {
                    m_computeKernel.SetMemoryArgument(i, m_buffers[i]);
                }
 
                m_queue.Execute(m_computeKernel, null, new long[] { testSize }, null, null);
 
                for (int i = inputCount; i < inputCount + outputCount; i++)
                {
                    arrays[i] = m_queue.Read(m_buffers[i], true, 0, testSize, null);
                }
 
                for (int i = 0; i < inputCount + outputCount; i++)
                {
                    m_buffers[i].Dispose();
                }
 
 
                //Console.WriteLine("{0})", result);
                result++;
                //GC.Collect();
                //GC.WaitForPendingFinalizers();
            }
 
            return (float)testSize * result / lengthOfTestSeconds;
        }
    }
}

class 2 is called Test2Kernel.cs

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Cloo;
 
namespace OpenCLTesting
{
    class Test2Kernel
    {
        ComputeContext m_context;
        ComputeDevice m_device;
        ComputeProgram m_computeProgram;
        ComputeKernel m_computeKernel;
        ComputeBuffer<float> tempBuffer;
        ComputeCommandQueue m_queue;
        //ComputeBuffer<float>[] m_buffers;
        List<ComputeBuffer<float>> m_buffers;
        string m_kernelSource;
        string m_kernelName;
 
        /// <summary>
        /// Constructor prepares a device given the the kernel information
        /// </summary>
        /// <param name="context"></param>
        /// <param name="device"></param>
        /// <param name="kernelSource"></param>
        /// <param name="kernelName"></param>
        public Test2Kernel(ComputeContext context, ComputeDevice device, String kernelSource, String kernelName)
        {
            m_context = context;
            m_device = device;
            m_kernelSource = kernelSource;
            m_kernelName = kernelName;
            initialize();
 
        }
 
 
        /// <summary>
        /// pull out the instantiation of everything
        /// originally thought to 're-initialize' everything if certain conditions
        /// arize during execution 
        /// </summary>
        private void initialize()
        {
            m_computeProgram = new ComputeProgram(m_context, new string[]{m_kernelSource});
            m_computeProgram.Build(null, null, null, IntPtr.Zero);
            m_computeKernel = m_computeProgram.CreateKernel(m_kernelName);
            m_queue = new ComputeCommandQueue(m_context, m_device, ComputeCommandQueueFlags.None);
            //m_buffers = new ComputeBuffer<float>[3];
            m_buffers = new List<ComputeBuffer<float>>();
        }
 
        /// <summary>
        /// this perofrms the actual construciton of our compute buffers
        /// and then performs the calculation.  The only Cloo items that
        /// are being reset each time are the ComputeBuffers.
        /// 
        /// Have tried several ways of storing the buffers 'as array'
        /// 'as a list' but each way we still end up with a rather nasty
        /// memory leak
        /// </summary>
        /// <param name="count"></param>
        /// <param name="arrays"></param>
        /// <param name="inputCount"></param>
        /// <param name="outputCount"></param>
        public void performCalculation( 
            int count,
            ref List<float[]> arrays,
            int inputCount,
            int outputCount)
        {
            //add our 'input' compute buffers to the compute buffer list
            for (int i = 0; i < inputCount; i++)
            {
                tempBuffer = new ComputeBuffer<float>(m_context, ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer, arrays[i]);
                m_buffers.Add(tempBuffer);
                //m_buffers[i] = new ComputeBuffer<float>(m_context, ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer, arrays[i]);
            }
 
            //add our 'output' compute bufferes to the main compute buffer list
            for (int i = inputCount; i < inputCount + outputCount; i++)
            {
                tempBuffer = new ComputeBuffer<float>(m_context, ComputeMemoryFlags.WriteOnly, arrays[i].Length);
                m_buffers.Add(tempBuffer);
                //m_buffers[i] = new ComputeBuffer<float>(m_context, ComputeMemoryFlags.WriteOnly, arrays[i].Length);
            }
 
            //map our compute buffers to the kernel
            for (int i = 0; i < inputCount + outputCount; i++)
            {
                m_computeKernel.SetMemoryArgument(i, m_buffers[i]);
            }
 
            //execute our kernel
            m_queue.Execute(m_computeKernel, null, new long[] { count }, null, null);
 
            //read from each output buffer and save it into our array reference
            for (int i = inputCount; i < inputCount + outputCount; i++)
            {
                arrays[i] = m_queue.Read(m_buffers[i], true, 0, count, null);
            }
 
            //dispose of each buffer
            for (int i = 0; i < inputCount+outputCount; i++)
            {
                m_buffers[i].Dispose();
            }
 
            m_buffers.Clear();
            //tempBuffer.Dispose();
        }
    }
}

But basically when you run this it will iterate through your platforms (which is always just one as far as i know) collecting contexts and devices for that context. It will then perform the benchmark on each device. The benchmark is set to take 60 seconds and if you have your taskmanager up you can see all your memory slowly creep away. I've also left in some commented code so you can see different things i've tried.

I just can't seem to find what could be leaking or a way to make it stop.

I guess a pertinent question would be, is there a way to clear the memory used by a device that i have not noticed yet?

Thanks for any help.

-Cody


Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
the Fiddler's picture

This signature contains only value types, so it won't leak by itself. However, the surrounding code may generate garbage in some cases: for instance, GCHandle.Alloc may cause boxing when used on value types.

Any such leaks will be collected by the GC, however, and will not cause memory usage to grow indefinitely. This can only happen if a leak happens in native code and/or in calls to Marshal.AllocHGlobal and similar.

Edit: attached is a quick test that checks whether this signature leaks memory. Ran this on Mono 2.4.3 and no leak was detected:

102400 # First call to GetTotalMemory()
241664 # Second call, presumably this is due to JIT ovehead
241664 # After first round of p/invokes
241664 # After second  round of p/invokes
241664 # After third round of p/invokes

On Linux, the test can be compiled and ran with the following commnads:

gmcs test.cs /unsafe
gcc native.c -fPIC -shared -o libnative.so
mono test.exe

On Windows/MSVC, you'd need to add __declspec(dllexport) before the function definition in native.c, and compile native.c with cl and test.cs with csc (sorry, don't remember the command to create a dll off-hand).

AttachmentSize
test.cs1.32 KB
native.c477 bytes
ctk's picture
nythrix wrote:

I've been able to trace the cause of the problem back to the flat CL bindings. And it's not only Cloo's bindings, OpenTK's suffer from the same problem. My guts tell me we're somehow messing with marshaling and that is the root of all evil.
This is the C definition of a typical leaking function:

 cl_int   clEnqueueNDRangeKernel  (  	cl_command_queue command_queue,
  	cl_kernel kernel,
  	cl_uint work_dim,
  	const size_t *global_work_offset,
  	const size_t *global_work_size,
  	const size_t *local_work_size,
  	cl_uint num_events_in_wait_list,
  	const cl_event *event_wait_list,
  	cl_event *event)

which translates to:

public extern static unsafe ComputeErrorCode
        EnqueueNDRangeKernel(
            IntPtr command_queue,
            IntPtr kernel,
            Int32 work_dim,
            /* const */ IntPtr* global_work_offset,
            /* const */ IntPtr* global_work_size,
            /* const */ IntPtr* local_work_size,
            Int32 num_events_in_wait_list,
            /* const */ IntPtr* event_wait_list,
            out IntPtr newEvent );

Is anyone able to see through this?
I've tried different signatures, especially replacing pointers with arrays, to no effect.

Just a minor observation, but should the Int32 not be UInt32 instead?

public extern static unsafe ComputeErrorCode
        EnqueueNDRangeKernel(
            IntPtr command_queue,
            IntPtr kernel,
            UInt32 work_dim,
            /* const */ IntPtr* global_work_offset,
            /* const */ IntPtr* global_work_size,
            /* const */ IntPtr* local_work_size,
            UInt32 num_events_in_wait_list,
            /* const */ IntPtr* event_wait_list,
            out IntPtr newEvent );
the Fiddler's picture

Problem is, uint is not CLS-compliant and thus cannot be used in the public API. Or, a CLS-compiant overload should be provided along with the non-CLS-compliant one (which is what OpenTK does).

That said, in most cases it doesn't really matter. If a uint overload isn't provided, at worst you'll have to use a cast: (int)0x80000000. Unless you are working with bit-patterns (GL.LineStipple?), the need for such casts doesn't arise very often (how often would you need more than 2^31 - 1 work units?)

ctk's picture

Okay, I've figured out what is causing the memory leak problem we are seeing here. The problem is that the bindings are using "out IntPtr newEvent" instead of "IntPtr* newEvent", which creates memory leaks in Cloo since OpenCl is the one responsible for freeing this pointer instead of the Garbage Collector. This is a problem with many of the interop bindings in Cloo/OpenTK (not just the ones we've used in this thread).

For example, in Cloo 0.6.0 under CL10.cs:

        // // "out IntPtr newEvent" causes a memory leak, should instead be "IntPtr* newEvent"
        //[DllImport( dll, EntryPoint = "clEnqueueNDRangeKernel" )]
        //internal extern static unsafe ComputeErrorCode
        //EnqueueNDRangeKernel(
        //    IntPtr command_queue,
        //    IntPtr kernel,
        //    Int32 work_dim,
        //    /* const */ IntPtr* global_work_offset,
        //    /* const */ IntPtr* global_work_size,
        //    /* const */ IntPtr* local_work_size,
        //    Int32 num_events_in_wait_list,
        //    /* const */ IntPtr* event_wait_list,
        //    out IntPtr newEvent );
 
        [DllImport(dll, EntryPoint = "clEnqueueNDRangeKernel")]
        internal extern static unsafe ComputeErrorCode
        EnqueueNDRangeKernel(
            IntPtr command_queue,
            IntPtr kernel,
            Int32 work_dim,
            /* const */ IntPtr* global_work_offset,
            /* const */ IntPtr* global_work_size,
            /* const */ IntPtr* local_work_size,
            Int32 num_events_in_wait_list,
            /* const */ IntPtr* event_wait_list,
            IntPtr* newEvent);

We can verify that it should be "IntPtr* newEvent" from the OpenCL man pages: http://www.khronos.org/opencl/sdk/1.0/docs/man/xhtml/index.html. For the above binding, the relevant page is http://www.khronos.org/opencl/sdk/1.0/docs/man/xhtml/clEnqueueNDRangeKer...

Attached is my the copy of Cloo 0.6.0 on my computer with the VectorAdd demo I've posted in the previous page of this thread. The previous VectorAdd had a memory leak whereas I've manually edited the bindings for EnqueueNDRangeKernel, EnqueueTask, and EnqueueRead to demonstrate the fix for the memory leak. The relevant Cloo functions in ComputeCommandQueue.cs were also changed. This version of VectorAdd is stable and will hover around 20 megs of memory use for a 64-bit system.

For the relevant functions in ComputeCommandQueue.cs, we will have to additionally modify:

            // NEED TO MODIFY FOLLOWING TO HANDLE INTPTR* INSTEAD OF INTPTR FOR newEventHandle
            if( events != null )
                events.Add( new ComputeEvent( newEventHandle, this ) );
AttachmentSize
Cloo-0.6.0.7z45.07 KB
nythrix's picture

You're my hero mate!
Actually, it's not the "out" parameter itself. Instead it's the goddamn events that are leaking! I retrieve them and stick them in the list. But if no event list is provided they just keep piling up because I never dispose of them!!!

I shall update the library immediately.

What a stupid bug...

nythrix's picture

A new version is online. This memory problem is hopefully gone. The Clootils + ATI Stream combo problem should be fixed as well.
Looking forward to your feedback.

ctk's picture

Yes! Cloo 0.6.1 has fixed both the memory leak and the ATI Stream output in WinForms. Great job, I will be porting my AI code to the new version.

I think it would be a good idea to update the gallery page for Cloo to reflect the latest version: http://opentk.com/project/cloo or have more links to http://sourceforge.net/projects/cloo/ so that it is easier to reach.

CodyIrons's picture

I can confirm on my work computer that my tests do not leak with sizes 10, 1000, or 100000. Most excellent!!!
And good work guys.

ctk's picture

Good to hear Cody. And just in case you didn't know this trick:

// In the OpenCL kernel, do:
for (size_t nIndex = get_global_id(0); nIndex < numItemsToCompute; nIndex += get_global_size(0)) 
{
        // Compute items here
	result[nIndex] = a[nIndex ] + b[nIndex];
}

I use this in my AI code to compute things when I have more items to compute than the global work item size limits. It may help you to improve CPU utilization.

nythrix's picture
Quote:

I think it would be a good idea to update the gallery page for Cloo to reflect the latest version: http://opentk.com/project/cloo or have more links to http://sourceforge.net/projects/cloo/ so that it is easier to reach.

I stopped updating the gallery page to prevent people from extensively using the opentk forums for cloo related stuff. Despite the projects being related and Fiddler being fine with everything, I was afraid the forums here would end up being a mess.
I know, I haven't bothered with letting people know Cloo exists. Perhaps, I should go out and spread the word. Thing is, advertisment is way low on my TODO list :)

Thank you all for your most valuable help!