there was a similar thread somewhere here but it is a few years old.
So here's my problem: I've implemented an object segmentation algorithm for videos. It uses graph cutting to separate foreground and background. The main performance draw is the calculation the weights of the edges of the graph.
I have written the everything in C#. Now, I wanted to have the weights calculated on my graphics card with OpenCL and used Cloo to do so. Unfortunately it is very slow.
My pure c# implementation needs 30ms. My OpenCL implementation takes 20ms on CPU but 200ms on GPU. Now, is this a normal overhead or am I doing something wrong. I think it has to do with how I copy the video for the kernel. I hope you have some suggestions.