CodyIrons's picture

CLOO: Memory Leaking Troubles

Hi guys hope all is going well,

I'm running into a bit of an issue when using CLoo (o.5.1 and 0.6.0) and attempting to have a program call a kernel an obscene amount of times over the coarse of a benchmark. I'm just using the vectorAdd kernel and basically what i'm doing is generating a set of random inputs each iteration then calling the kernel to compute. I've tried to separate as much as possible the setup and execution of the kernel as i would like this to become fairly modular in any future projects I decide to do. But I just can't seem to pinpoint what could be causing this leak.

The leak seems to be revolving around the creating and disposing of the ComputeBuffers (but if it's some of my code please feel free to say so.) The project is only two classes and i tried to document them as much as possible.

Class one is Program.cs

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Cloo;
 
namespace OpenCLTesting
{
    class Program
    {
        static void Main(string[] args)
        {
            ///first lets collect all the compute devices on all detected platforms
            ///this may be a little overkill but i was experimenting with ideas at the time
            Dictionary<ComputeContext, List<ComputeDevice>> thesystem
                = new Dictionary<ComputeContext, List<ComputeDevice>>();
            List<ComputeDevice> allDevices;
            ComputeContext context;
            foreach(ComputePlatform cp in ComputePlatform.Platforms)
            {
 
                ComputeContextPropertyList propertyList = new ComputeContextPropertyList(cp);
                context = new ComputeContext(ComputeDeviceTypes.All, propertyList, null, IntPtr.Zero);
                allDevices = new List<ComputeDevice>();
                foreach(ComputeDevice cd in context.Devices)
                {
                    allDevices.Add(cd);
                }
                thesystem.Add(context, allDevices);
            }
 
 
            string vectorAddKernel = @"kernel void
            vectorAdd(global read_only float * a,
                      global read_only float * b,
                      global write_only float * c)
            {
                // Vector element index
                int nIndex = get_global_id(0);
                c[nIndex] = a[nIndex] + b[nIndex];
            }";
 
            int testSize = 1024;
            int testLengthInSeconds = 60;
 
            ///we will be tested each device in each platform for performance
            ///we will print off the data so it can be collected and observed
            foreach (KeyValuePair<ComputeContext, List<ComputeDevice>> platform in thesystem)
            {
                ComputeContext localContext = platform.Key;
                List<ComputeDevice> devices = platform.Value;
 
                foreach (ComputeDevice device in devices)
                {
                    //Testing each individual device
                    Console.WriteLine("DevicePlatform = {0}",device.Platform.Name);
                    Console.WriteLine("DeviceName = {0}",device.Name);
                    Console.WriteLine("DeviceCUnits = {0}", device.MaxComputeUnits);
                    Console.WriteLine("DeviceSpeed = {0}", device.MaxClockFrequency);
 
                    float result = testDevice(localContext, device, testLengthInSeconds, testSize, vectorAddKernel, "vectorAdd");
                    Console.WriteLine("DevicePerformance = {0}",result);
 
                    //float result1 = testYetAgain(localContext, device, testLengthInSeconds, testSize, vectorAddKernel, "vectorAdd");
                    //Console.WriteLine("DevicePerformance1 = {0}", result1 );
                }
            }
            Console.ReadLine();
        }
 
        /// <summary>
        /// here we attempt to separate the oonstruction of the values needed 
        /// by the kernel during execution time.  This is being done in an
        /// attempt to make execution of the kernel slightly more abstracted
        /// </summary>
        /// <param name="localContext">the compute context</param>
        /// <param name="device">the device we are running on</param>
        /// <param name="lengthOfTestSeconds">how long in seconds we would like the test to run</param>
        /// <param name="testSize">the number of objects the kernel will be performing in one batch 'these are just floats for now'</param>
        /// <param name="kernelSource">the kernels source</param>
        /// <param name="kernelName">the name of the kernel in the source</param>
        /// <returns>a float value representing how well this device performed</returns>
        private static float testDevice(
            ComputeContext localContext, 
            ComputeDevice device, 
            int lengthOfTestSeconds, 
            int testSize,
            string kernelSource, 
            string kernelName)
        {
            float result = 0.0f;
 
            //create our test kernel instance
            Test2Kernel t2k = new Test2Kernel(localContext, device, kernelSource, kernelName);
 
            //the number of inputs to the kernel
            int inputCount = 2;
            List<float[]> inputs = new List<float[]>();
            for (int i = 0; i < inputCount; i++)
            {
                float[] arrI = new float[testSize];
                inputs.Add(arrI);
            }
 
            //an array for our outputs
            int outputCount = 1;
            List<float[]> outputs = new List<float[]>();
            for (int i = 0; i < outputCount; i++)
            {
                float[] arrC = new float[testSize];
                outputs.Add(arrC);
            }
 
            //just setting up random number gen some timing stuff and the
            //arrays list that will store all of our communication with the kernel
            Random rand = new Random();
            DateTime start = DateTime.Now;
            TimeSpan testLength = new TimeSpan(0, 0, lengthOfTestSeconds);
            List<float[]> arrays = new List<float[]>();
 
            //just looping through till time is up
            while ((DateTime.Now - start) < testLength)
            {
                //clear our array
                arrays.Clear();
 
                //for the size of the test, (currently set to 1024)
                for (int i = 0; i < testSize; i++)
                {
                    //for each input buffer we will genereate a random double
                    for (int l_inputs = 0; l_inputs < inputCount; l_inputs++)
                    {
                        inputs[l_inputs][i] = (float)(rand.NextDouble() * 100);
                    }
                    //for each output buffer we will initialize to zero
                    for (int l_outputs = 0; l_outputs < outputCount; l_outputs++)
                    {
                        outputs[l_outputs][i] = 0.0f;
                    }
                }
 
                //now loop through our inputs and add them to arrays
                for (int i = 0; i < inputCount; i++)
                {
                    arrays.Add(inputs[i]);
                }
                //do the same with outputs
                for (int i = 0; i < outputCount; i++)
                {
                    arrays.Add(outputs[i]);
                }
 
                //now perform our calculation by lettting the kernel know how long
                //the test is, where the arrays are, the number of inputs and the
                //number of outputs
                t2k.performCalculation(testSize, ref arrays, inputCount, outputCount);
 
                //just some debuging to let us know it is running
                //Console.WriteLine("{0}){1} + {2} = {3}", result, arrays[0][0], arrays[1][0], arrays[2][0]);
                result++;
 
                //desparate attempt to get the memory leak to go away
                GC.Collect();
                GC.WaitForPendingFinalizers();
            }
 
            //we are calcualting performance as the size of the test (so 1024)
            //multiplied by the number of times it got through the loop
            //devided by the length of the test
            //this should possibly give us something like calculations per second
            return (float)testSize * result / lengthOfTestSeconds;
        }
 
        /// <summary>
        /// flattened version of test2kernel to rule out anything weird happening
        /// in our test2kernel instance based test.
        /// </summary>
        /// <param name="localContext"></param>
        /// <param name="device"></param>
        /// <param name="lengthOfTestSeconds"></param>
        /// <param name="testSize"></param>
        /// <param name="kernelSource"></param>
        /// <param name="kernelName"></param>
        /// <returns></returns>
        public static float testYetAgain(ComputeContext localContext,
            ComputeDevice device,
            int lengthOfTestSeconds,
            int testSize,
            string kernelSource,
            string kernelName)
        {
            float result = 0.0f;
 
            ComputeProgram m_computeProgram;
            ComputeKernel m_computeKernel;
            ComputeBuffer<float> tempBuffer;
            ComputeCommandQueue m_queue;
            ComputeBuffer<float>[] m_buffers;
 
            m_computeProgram = new ComputeProgram(localContext, new string[] { kernelSource });
            m_computeProgram.Build(null, null, null, IntPtr.Zero);
            m_computeKernel = m_computeProgram.CreateKernel(kernelName);
            m_queue = new ComputeCommandQueue(localContext, device, ComputeCommandQueueFlags.None);
            m_buffers = new ComputeBuffer<float>[3];
 
            // the number of values we want to run through the kernel each pass
            //int count = 10;
            //the number of inputs to the kernel
            int inputCount = 2;
            List<float[]> inputs = new List<float[]>();
            for (int i = 0; i < inputCount; i++)
            {
                float[] arrI = new float[testSize];
                inputs.Add(arrI);
            }
 
            //an array for our outputs
            int outputCount = 1;
            List<float[]> outputs = new List<float[]>();
            for (int i = 0; i < outputCount; i++)
            {
                float[] arrC = new float[testSize];
                outputs.Add(arrC);
            }
 
 
            Random rand = new Random();
            DateTime start = DateTime.Now;
            TimeSpan testLength = new TimeSpan(0, 0, lengthOfTestSeconds);
            List<float[]> arrays = new List<float[]>();
 
            while ((DateTime.Now - start) < testLength)
            {
                arrays.Clear();
 
                for (int i = 0; i < testSize; i++)
                {
 
                    for (int l_inputs = 0; l_inputs < inputCount; l_inputs++)
                    {
                        inputs[l_inputs][i] = (float)(rand.NextDouble() * 100);
                    }
                    for (int l_outputs = 0; l_outputs < outputCount; l_outputs++)
                    {
                        outputs[l_outputs][i] = 0.0f;
                    }
                }
 
                for (int i = 0; i < inputCount; i++)
                {
                    arrays.Add(inputs[i]);
                }
                for (int i = 0; i < outputCount; i++)
                {
                    arrays.Add(outputs[i]);
                }
 
                //t2k.performCalculation(testSize, ref arrays, inputCount, outputCount);
                for (int i = 0; i < inputCount; i++)
                {
                    m_buffers[i] = new ComputeBuffer<float>(localContext, ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer, arrays[i]);
                }
 
                for (int i = inputCount; i < inputCount + outputCount; i++)
                {
                    m_buffers[i] = new ComputeBuffer<float>(localContext, ComputeMemoryFlags.WriteOnly, arrays[i].Length);
                }
 
                for (int i = 0; i < inputCount + outputCount; i++)
                {
                    m_computeKernel.SetMemoryArgument(i, m_buffers[i]);
                }
 
                m_queue.Execute(m_computeKernel, null, new long[] { testSize }, null, null);
 
                for (int i = inputCount; i < inputCount + outputCount; i++)
                {
                    arrays[i] = m_queue.Read(m_buffers[i], true, 0, testSize, null);
                }
 
                for (int i = 0; i < inputCount + outputCount; i++)
                {
                    m_buffers[i].Dispose();
                }
 
 
                //Console.WriteLine("{0})", result);
                result++;
                //GC.Collect();
                //GC.WaitForPendingFinalizers();
            }
 
            return (float)testSize * result / lengthOfTestSeconds;
        }
    }
}

class 2 is called Test2Kernel.cs

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Cloo;
 
namespace OpenCLTesting
{
    class Test2Kernel
    {
        ComputeContext m_context;
        ComputeDevice m_device;
        ComputeProgram m_computeProgram;
        ComputeKernel m_computeKernel;
        ComputeBuffer<float> tempBuffer;
        ComputeCommandQueue m_queue;
        //ComputeBuffer<float>[] m_buffers;
        List<ComputeBuffer<float>> m_buffers;
        string m_kernelSource;
        string m_kernelName;
 
        /// <summary>
        /// Constructor prepares a device given the the kernel information
        /// </summary>
        /// <param name="context"></param>
        /// <param name="device"></param>
        /// <param name="kernelSource"></param>
        /// <param name="kernelName"></param>
        public Test2Kernel(ComputeContext context, ComputeDevice device, String kernelSource, String kernelName)
        {
            m_context = context;
            m_device = device;
            m_kernelSource = kernelSource;
            m_kernelName = kernelName;
            initialize();
 
        }
 
 
        /// <summary>
        /// pull out the instantiation of everything
        /// originally thought to 're-initialize' everything if certain conditions
        /// arize during execution 
        /// </summary>
        private void initialize()
        {
            m_computeProgram = new ComputeProgram(m_context, new string[]{m_kernelSource});
            m_computeProgram.Build(null, null, null, IntPtr.Zero);
            m_computeKernel = m_computeProgram.CreateKernel(m_kernelName);
            m_queue = new ComputeCommandQueue(m_context, m_device, ComputeCommandQueueFlags.None);
            //m_buffers = new ComputeBuffer<float>[3];
            m_buffers = new List<ComputeBuffer<float>>();
        }
 
        /// <summary>
        /// this perofrms the actual construciton of our compute buffers
        /// and then performs the calculation.  The only Cloo items that
        /// are being reset each time are the ComputeBuffers.
        /// 
        /// Have tried several ways of storing the buffers 'as array'
        /// 'as a list' but each way we still end up with a rather nasty
        /// memory leak
        /// </summary>
        /// <param name="count"></param>
        /// <param name="arrays"></param>
        /// <param name="inputCount"></param>
        /// <param name="outputCount"></param>
        public void performCalculation( 
            int count,
            ref List<float[]> arrays,
            int inputCount,
            int outputCount)
        {
            //add our 'input' compute buffers to the compute buffer list
            for (int i = 0; i < inputCount; i++)
            {
                tempBuffer = new ComputeBuffer<float>(m_context, ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer, arrays[i]);
                m_buffers.Add(tempBuffer);
                //m_buffers[i] = new ComputeBuffer<float>(m_context, ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer, arrays[i]);
            }
 
            //add our 'output' compute bufferes to the main compute buffer list
            for (int i = inputCount; i < inputCount + outputCount; i++)
            {
                tempBuffer = new ComputeBuffer<float>(m_context, ComputeMemoryFlags.WriteOnly, arrays[i].Length);
                m_buffers.Add(tempBuffer);
                //m_buffers[i] = new ComputeBuffer<float>(m_context, ComputeMemoryFlags.WriteOnly, arrays[i].Length);
            }
 
            //map our compute buffers to the kernel
            for (int i = 0; i < inputCount + outputCount; i++)
            {
                m_computeKernel.SetMemoryArgument(i, m_buffers[i]);
            }
 
            //execute our kernel
            m_queue.Execute(m_computeKernel, null, new long[] { count }, null, null);
 
            //read from each output buffer and save it into our array reference
            for (int i = inputCount; i < inputCount + outputCount; i++)
            {
                arrays[i] = m_queue.Read(m_buffers[i], true, 0, count, null);
            }
 
            //dispose of each buffer
            for (int i = 0; i < inputCount+outputCount; i++)
            {
                m_buffers[i].Dispose();
            }
 
            m_buffers.Clear();
            //tempBuffer.Dispose();
        }
    }
}

But basically when you run this it will iterate through your platforms (which is always just one as far as i know) collecting contexts and devices for that context. It will then perform the benchmark on each device. The benchmark is set to take 60 seconds and if you have your taskmanager up you can see all your memory slowly creep away. I've also left in some commented code so you can see different things i've tried.

I just can't seem to find what could be leaking or a way to make it stop.

I guess a pertinent question would be, is there a way to clear the memory used by a device that i have not noticed yet?

Thanks for any help.

-Cody


Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
CodyIrons's picture

So i've been messing with this tonight after work and i went ahead and refined it a lot and cleaned up some of the unimportant code. I was able to move the buffer creation into the init function and also the setMemArg into init. I also had to change my ComputeMemoryFlags.CopyHostPointer to ComputeMemoryFlags.UseHostPointer to get this thing working how i think it should, but in debugging i seem to have stumbled into something weird where i have no idea what is happening.

I currently have the program setup to run for 1 second and i have the buffer size set to 10. After each kernel call i print off index 0 for the inputs and outputs. The first few lines that are print to the console appear to be correct but after a 10 - 15 prints it starts to repeat the result.

I've attached the project this time instead of pasting the code as that may be more useful. But any ideas on what i'm doing incorrectly are appreciated.

AttachmentSize
OpenCLTesting.zip171.26 KB
CodyIrons's picture

So it seems that in the latest project i submitted if you set the "testSize" to something really large (100000) and the time to 60 seconds it performs exactly as desired with no memory leak.

Actually did a little more digging just now and any value below 6755 will return bad result eventually (starts returning zeros if left to run for a long time.) Any value 6756 and over returns proper results.

CodyIrons's picture

Other observations (sorry for using this as a blog of sorts):
I switched from my Intel processor based netbook (just an atom processor) to my desktop (AMD quad core with 3 graphics adapters) and the only device that performed the benchmark properly was the cpu. Each of the video cards always returns the same value for every result, and there do not seem to be any exceptions being thrown (i'm not even attempting to catch any, has CLoo abstracted away the ability to check error codes after kernel compilation?).

Another interesting result though is that no more than one core of the cpu is ever utilized. I double checked this on my work pc as well (dual quad core xeon) and the kernel only ever attempts to execute on one core. For some reason i recall another test of mine working across all cores.

Is anyone able to utilize CLoo to completely saturate a device? I would be interested in testing more complex kernels across my diverse set of platforms. I would love to use CLoo to do some AI/Swarm based learning but if i can't get the basic tests to completely utilize a device I really am apprehensive about putting the time into converting my PSO/pathfinding algorithms to openCL using CLoo.

nythrix's picture

Hi,
sorry for being a bit unresponsive. My real life is a bit challenging at the moment so I'm not able to prompt reply all the time.
Some observations:

1) I did some testing using your original code and I'm pretty sure there are no leaks in the sense of opencl objects. Everything that's created is disposed of properly, either manually or automatically. I run the program for more than 30 minutes with hundred thousands of objects getting created and destroyed and the counts always matched. Having that out of the way I can now focus on other things like GC handles and the like. No tests have been run on your other code yet. Hopefully, I'll be back with the results in a couple of hours.

2) Cloo introduces no restrictions over raw OpenCL. It is its main design goal so if you're having problems then you've possibly hit a bug or a missing bit. In that case I'll do my best to fix it.

ctk's picture

I'm also using Cloo to do AI and I have no problems utilizing both of my cores in my dual core AMD CPU to 95-100%. And this is for a very complex kernel with lots of function calls that can take 20+ minutes. The memory usage is stable since I only allocate memory once and it's a console program and exits at the end of the run.

I ran the code from the OP and I do appear to get an ever increasing usage of memory and there isn't anything wrong with the code that I can see. More debugging will be needed.......

nythrix's picture
Quote:

I also had to change my ComputeMemoryFlags.CopyHostPointer to ComputeMemoryFlags.UseHostPointer to get this thing working how i think it should, but in debugging i seem to have stumbled into something weird where i have no idea what is happening.

UseHostPointer means OpenCL will operate in place (i.e.: using the array you've provided). That said, the array should be pinned otherwise you may experience unexpected behaviour and random access violations!

CodyIrons's picture
nythrix wrote:
Quote:

I also had to change my ComputeMemoryFlags.CopyHostPointer to ComputeMemoryFlags.UseHostPointer to get this thing working how i think it should, but in debugging i seem to have stumbled into something weird where i have no idea what is happening.

UseHostPointer means OpenCL will operate in place (i.e.: using the array you've provided). That said, the array should be pinned otherwise you may experience unexpected behaviour and random access violations!

Interesting, i had not heard of such a thing in C# before (i had to google it to see what that was.) But that would explain why it would seem that randomly into the program the values would all default to a bad value. So using something like GCHandle pinnedRawData = GCHandle.Alloc(foo, GCHandleType.Pinned); could then be passed to the compute buffers, or is it just at some point after setting up my List <float []> foo; that i need to pin it so that the GC doesn't move it around? I'll experiment in a little bit with this, still at work.

nythrix's picture

You need to pin the arrays i.e.: all the float[]s because those are used as buffer's content, not the list. Basically you need to prevent the GC move the data while OpenCL accesses them.
It's actually what I'm fighting with most of the time while putting together Cloo :)

ctk's picture

Okay, I've determined the cause of the bug using the code in the original post of this thread. It's a weird bug, that's for sure. In the original code, where the main while loop is:

while ((DateTime.Now - start) < testLength)
{
       ..........
}

If, instead you change it to a for loop, like:

for (int j = 0; j < 100; j++)
{
        .........
}

Then you will get no more memory leaks. The task manager shows that the memory use is now absolutely stable! Why this is so, I have absolutely no idea! In my OpenCL code, I've always used the for loop method instead of the while loop method so I haven't encountered this bug before.

It would certainly be interesting if someone could figure out the reason that a while loop causes a memory leak and a for loop doesn't.

ctk's picture

After a little more tinkering, I've now determined that it isn't the while loop itself that is the cause of the bug, but rather the (DateTime.Now - start) part that is causing the memory leak. The exact reason for this is still unknown.

The (DateTime.Now - start) part will cause a memory leak if used either in

while ((DateTime.Now - start) < testLength)
{
       ..........
}

or

for(int j = -60000; j < (DateTime.Now - start).Seconds; j++)
{
       ..........
}

I think the lesson here is to avoid using DateTime and TimeSpan in the comparators. However, the following works fine:

while (start.Second + 60 < DateTime.Now.Second)
{
     ......
}
int j = 0;
while (j < 100)
{
     j++;
     .......
}