klaus's picture

Performance Problems with Cloo

Hi,

there was a similar thread somewhere here but it is a few years old.

So here's my problem: I've implemented an object segmentation algorithm for videos. It uses graph cutting to separate foreground and background. The main performance draw is the calculation the weights of the edges of the graph.

I have written the everything in C#. Now, I wanted to have the weights calculated on my graphics card with OpenCL and used Cloo to do so. Unfortunately it is very slow.

My pure c# implementation needs 30ms. My OpenCL implementation takes 20ms on CPU but 200ms on GPU. Now, is this a normal overhead or am I doing something wrong. I think it has to do with how I copy the video for the kernel. I hope you have some suggestions.

thx!


Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
nythrix's picture

Indeed this might be a "not the right tool" situation. Unsurprisingly OpenCL+GPU works best if your data already resides in GPU RAM or transits from/to OpenGL. Otherwise you need a fair amount of "complexity" (non-trivial kernel processing large amount of data) to compensate for the overhead. For example, you can add two large vectors in GPU but you need tenths or hundreds of millions of numbers to see it run faster than a simple C# "for" loop. I think the OpenCL+CPU time measurement is telling you exactly that. Your "complexity" isn't enough to make up for the data transfer overhead.
I might be able to give you more hints should you wish to disclose more info or code details.

klaus's picture

Hi,

my Kernel looks like this:

float l2norm(uchar pixel1_b, uchar pixel1_g, uchar pixel1_r,
    uchar pixel2_b, uchar pixel2_g, uchar pixel2_r)
{
    return sqrt((pown((float)pixel1_b - (float)pixel2_b, 2)
                + pown((float)pixel1_g - (float)pixel2_g, 2)
                + pown((float)pixel1_r - (float)pixel2_r, 2)));
}
 
kernel void E2(__constant read_only uchar* bgr,
    __constant read_only float* beta,
    __constant read_only int* video_size,
    global write_only int* e2_out)
{
    int frame = get_global_id(0);
    int row = get_global_id(1);
    int col = get_global_id(2);
 
	int video_rows = video_size[0];
	int video_cols = video_size[1];
 
    int index = (frame * video_rows * video_cols + row * video_cols + col) * 3;
	int offset = get_global_size(0) * video_rows * video_cols;
 
	uchar b1 = bgr[index];
	uchar g1 = bgr[index+1];
	uchar r1 = bgr[index+2];
 
    float beta_ = beta[0];
 
	//bottom neighbour
	if(row < video_rows - 1)
	{
		int neighbour_index = (frame * video_rows * video_cols + (row+1) * video_cols + col) * 3;
		uchar b2 = bgr[neighbour_index];
		uchar g2 = bgr[neighbour_index + 1];
		uchar r2 = bgr[neighbour_index + 2];
 
		float l2_norm = l2norm(b1, g1, r1, b2, g2, r2);
		e2_out[index] = (int)exp(-1 * beta_ * pown(l2_norm, 2));
	}
 
	//right neighbour
	if(col < video_cols -1)
	{
		int neighbour_index = (frame * video_rows * video_cols + row * video_cols + col + 1) * 3;
		uchar b2 = bgr[neighbour_index];
		uchar g2 = bgr[neighbour_index + 1];
		uchar r2 = bgr[neighbour_index + 2];
 
		float l2_norm = l2norm(b1, g1, r1, b2, g2, r2);
		e2_out[offset + index] = (int)exp(-1 * beta_ * pown(l2_norm, 2));
	}
}

And I set the Buffers and call the Kernel like this:

//Video in array
            Byte[] bgr = new Byte[(video.nFrames - 2) * video.Rows * video.Cols * 2 * 3];
 
            //copy video into array
            for (int fi = 1; fi < (video.nFrames - 1); fi++)
            {
                Image<Bgr, Byte> frame = video.getFrame(fi);
                for (int x = 0; x < frame.Width; x++)
                {
                    for (int y = 0; y < frame.Height; y++)
                    {
                        int index = ((fi - 1) * video.Rows * video.Cols + y * video.Rows + x) * 3;
                        bgr[index] = frame.Data[y, x, 0];
                        bgr[index+1] = frame.Data[y, x, 1];
                        bgr[index+2] = frame.Data[y, x, 2];
                    }
                }
            }
 
 
            ComputeBuffer<Byte> in_bgr = new ComputeBuffer<Byte>(context, ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer, bgr);
 
            float[] fBeta = new float[1];
            fBeta[0] = (float)beta;
            ComputeBuffer<float> in_beta = new ComputeBuffer<float>(context, ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer, fBeta);
 
            int[] video_size = new int[2] { video.Rows, video.Cols };
            ComputeBuffer<int> in_video_size = new ComputeBuffer<int>(context, ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.UseHostPointer, video_size);
 
            ComputeBuffer<int> out_e2 = new ComputeBuffer<int>(context, ComputeMemoryFlags.WriteOnly, count);
 
 
            ComputeKernel kernel = program.CreateKernel("E2");
 
            kernel.SetMemoryArgument(0, in_bgr);
 
            kernel.SetMemoryArgument(1, in_video_size);
            kernel.SetMemoryArgument(2, in_beta);
            kernel.SetMemoryArgument(3, out_e2);
 
            //Measure time
            Stopwatch stopWatch = Stopwatch.StartNew();
 
            ComputeEventList eventList = new ComputeEventList();
            ComputeCommandQueue commands = new ComputeCommandQueue(context, context.Devices[0], ComputeCommandQueueFlags.None);
            commands.Execute(kernel, null, new long[] { video.nFrames - 2, video.Rows, video.Cols }, new long[] {1, 1, 1}, eventList);
 
            //Read Buffer
            commands.ReadFromBuffer(out_e2, ref weights_cl, false, eventList);
            commands.Finish();
 
            stopWatch.Stop();
            double time_elapsed = (stopWatch.ElapsedTicks * 1000.0) / Stopwatch.Frequency;
            Console.Write("E2 in CL: " + time_elapsed + " ms\n");