tgz's picture

Fast, but not that fast... any cloos?

Hi

I'm working on a multiagent simulation system, and I'm trying to harness the my GPUs computing power. I programmed CUDA before and but now I decided to switch to OpenGL. The system is written in C#, and I use Cloo to interact with the GPU. As a test I created a simple Game of Life program, and I was hoping i could get a ten times or so increase in speed. Well, I didnt. There was a slight increase, like 0.9 secs instead of the 1.3 secs, but thats not exactly what i was looking for. I can't believe that's it, so I'm guessing I'm doing something really wrong. So I'm looking for someone, to tell me "man, you got it all wrong, take a look at these and those samples, and how can you use uchar * anyway".

I'm using two ComputeBuffers, one containing the current state while one acts as the target buffer - they switch roles every turn.
oclKernel stores the kernel that was compiled elsewhere in the code. map[0] and map[1] are two byte arrays, storing the cell values. The maps are wrapped toroidically (or something like that) meaning if you wander off the map on one side, you come back on the other side.

                oclBuffer = new ComputeBuffer<byte>[2];
 
                oclBuffer[0] = new ComputeBuffer<byte>(oclEnv.OclContext, ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.CopyHostPointer, map[0]);
                oclBuffer[1] = new ComputeBuffer<byte>(oclEnv.OclContext, ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.CopyHostPointer, map[1]);
 
                oclKernel.SetValueArgument(2, width);
                oclKernel.SetValueArgument(3, height);

Here's the kernel i wrote:

// Advance the state of a Game Of Life map 1 step:
// 1. Check how many neighbours of the cell are alive (live cells have a value of 1, dead cells have a value of zero)
// 2. a) If the cell is alive, two or three live neighbours will keep it alive, less or more kill it
// 2. b) if the cell is not not alive, exactly three live neighbours will make it live
__kernel void Advance(
	__global uchar *map,	// starting array
	__global uchar *dst,		// destination array
	int width,					// map width
	int height)					// map height
{
	int p0 = get_global_id(0);	// which cell?
	int size = width * height;		// map size
	if (p0 > size)					// are we on the map at all?
		return;
 
	int x = p0 % width;			// x coordinate
 
	int p1, p2;
	int litNeighbours = 0;		// the number of neighbours that are lit / alive
 
	//
	// Find the top and bottom neigbours
	//
	p1 = p0 - width;			// p1 go north
	if (p1 < 0)					// if its off the map, come back on the other side
		p1 += size;
	p2 = p0 + width;			// p2 go south...
	if (p2 >= size)
		p2 -= size;
 
	litNeighbours += map[p1];
	litNeighbours += map[p2];
 
	//
	// neighbours on the left side
	//
	if (x > 0)					
	{
		p0--;							
		p1--;							
		p2--;							
		litNeighbours += map[p0];
		litNeighbours += map[p1];
		litNeighbours += map[p2];
		p0++;
		p1++;
		p2++;
	}
	else
	{
		p0 += width - 1;
		p1 += width - 1;
		p2 += width - 1;
		litNeighbours += map[p0];
		litNeighbours += map[p1];
		litNeighbours += map[p2];
		p0 = p0 - width + 1;
		p1 = p1 - width + 1;
		p2 = p2 - width + 1;
	}
 
	//
	// neighbours on the right side
	//
	if (x < width - 1)					
	{
		p0++;							
		p1++;							
		p2++;							
		litNeighbours += map[p0];
		litNeighbours += map[p1];
		litNeighbours += map[p2];
	}
	else
	{
		p0 = p0 - width + 1;
		p1 = p1 - width + 1;
		p2 = p2 - width + 1;
		litNeighbours += map[p0];
		litNeighbours += map[p1];
		litNeighbours += map[p2];
	}
 
	// As all cells contained either zero or 1, litNeighbours will now hold the number of live neighbouring cells.
 
	p0 = get_global_id(0);
	if (map[p0])
	{
		// if the cell is live, less then two live neighbours will kill it (lonelyness)
		// more than 4 live neighbours will kill it as well (overpopulation)
		dst[p0] = (litNeighbours == 2 || litNeighbours == 3) ? 1 : 0;
	}
	else
	{
		dst[p0] = litNeighbours == 3 ? 1 : 0;
	}
 
}

Finally here's the code that schedules the kernel for execution

                // selector alternates between values 0 and 1
                oclKernel.SetMemoryArgument(0, oclBuffer[selector]);
                oclKernel.SetMemoryArgument(1, oclBuffer[1 - selector]);
 
                oclEnv.OclCommands.Execute(oclKernel, null, new long[] { size }, 
                    new long[] { oclEnv.OclDevice.MaxWorkGroupSize }, oclEnv.OclEvents);
 
                oclEnv.OclCommands.Finish();
                map[1 - selector] = oclEnv.OclCommands.Read(oclBuffer[1 - selector], oclEnv.OclEvents);
                oclEnv.OclCommands.Finish();

Any hints would be most appreciated.

TGZ :)


Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
Hortus Longus's picture

What do you think about simple programming?
Perhaps instead:

if (x > 0)					
	{
		p0--;							
		p1--;							
		p2--;							
		litNeighbours += map[p0];
		litNeighbours += map[p1];
		litNeighbours += map[p2];
		p0++;
		p1++;
		p2++;
	}

try

if (x > 0)					
	{
		litNeighbours += map[p0-1];
		litNeighbours += map[p1-1];
		litNeighbours += map[p2-1];
	}
nythrix's picture

I couldn't find anything algorithmically wrong in your code. However, there's one thing that usually hurts perfomance in GPUs: code branching. In order to process a lot of threads the GPU "ties" them together and processes them the same regardless of such jumps. That means ALL of the code is touched by the GPU.
Example: Let's have an if else statement and two threads. Suppose one of the threads logically follows the if and the other the else part of the statement. Then the GPU will still process both if and else in both threads but only modify the data in the valid branch for each thread. I've seen this technique explained somewhere in the nVidia's docs that come with the CUDA platform.
The CPUs are usually better at handling branching. I've had kernels running faster (!) on a 2core CPU than on a 64core GPU because of this branching bottleneck.

Try two things: reduce the number of ifs or try running the kernel on a CPU. Does it make any difference?

tgz's picture

Haha, yeah, I really should get some sleep, I guess. :)

tgz's picture
nythrix wrote:

I couldn't find anything algorithmically wrong in your code. However, there's one thing that usually hurts perfomance in GPUs: code branching. In order to process a lot of threads the GPU "ties" them together and processes them the same regardless of such jumps. That means ALL of the code is touched by the GPU.
Example: Let's have an if else statement and two threads. Suppose one of the threads logically follows the if and the other the else part of the statement. Then the GPU will still process both if and else in both threads but only modify the data in the valid branch for each thread. I've seen this technique explained somewhere in the nVidia's docs that come with the CUDA platform.
The CPUs are usually better at handling branching. I've had kernels running faster (!) on a 2core CPU than on a 64core GPU because of this branching bottleneck.

Try two things: reduce the number of ifs or try running the kernel on a CPU. Does it make any difference?

Yes, I read those documents too. Unfortunately, i can think of no way of reducing the number of ifs, unless i don't process the cells on the edges of the map. However, that is necessary.
I was thinking about building a "border" around the actual map, copying the cells from the other side, and only processing the inside elements. I might just give it a try, but I'm afraid of the cost of such preprocessing.

carga's picture

Hello, tgz.

I have no ready solution, just a few ideas.

1. Make sure you do not include kernel compile time to the total time.

2. You do not need both streams to be ReadWrite: probably, it's enough to have first series as ReadOnly and second one as WriteOnly. I think this MAY save some msecs.

3. AFAIK you should explicitly enable byte-addressing mode in your kernel (I have to do it with ATI SDK v2.1). Something like "#pragma OPENCL EXTENSION cl_khd_byte_addressing : enable" (please, google, because I don't remember exact syntax now)

4. On my CPU max work item dimension is 1024, I have 3 dimensions available, so I can address only 1024 work items in linear mode (and YOU use linear mode also) and about 10^9 work items in 3D. I expect you will not get any measurable benefit for 1024 work items. Please, try 1 million or 1 billion.

For GPU situation is even worse: my ATI Radeon 5750 has 3 dimensions by 256 work items in each. It is just about 16 millions simultanious work items. Not very much THREADS! =)

5. The greatest warning (and question for nythrix): I experience trouble working with byte array arguments. =((( My code is working perfect when it is executed on CPU: it returns its result through byte array in the same way like your "__global uchar *dst". As I said, it works perfectly on CPU. But when I execute it on GPU, the code works 10 times SLOWER then on cpu and it does not return expected result. =(

I am going to prepare the shortest possible example and share it with community to debug this issue, but currently I guess that there is some internal issue for byte (8-bit unsigned) arrays either in Cloo or in ATI Stream implementation.

Have a fast code!
Anton.
http://kyta.spb.ru

PS I use Cloo v0.7.2 + ATI Stream SDK v2.1 + Ubuntu v10.04 64bit + mono v2.6.4.0 built from release tarball. GPU -- AMD Radeon 5750

PPS Exact instruction to enable byte addressing:
#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable

ctk's picture
tgz wrote:

// selector alternates between values 0 and 1
oclKernel.SetMemoryArgument(0, oclBuffer[selector]);
oclKernel.SetMemoryArgument(1, oclBuffer[1 - selector]);

oclEnv.OclCommands.Execute(oclKernel, null, new long[] { size },
new long[] { oclEnv.OclDevice.MaxWorkGroupSize }, oclEnv.OclEvents);

oclEnv.OclCommands.Finish();
map[1 - selector] = oclEnv.OclCommands.Read(oclBuffer[1 - selector], oclEnv.OclEvents);
oclEnv.OclCommands.Finish();

Looks to me like your problem is that you are reading and writing a large chunk of memory to the GPU at every iteration of the code above, which is a no-no because you make yourself bandwidth bound from the constant transferring.

tgz's picture
ctk wrote:

Looks to me like your problem is that you are reading and writing a large chunk of memory to the GPU at every iteration of the code above, which is a no-no because you make yourself bandwidth bound from the constant transferring.

Hmm, I might be wrong, but I don't (or at least i don't intend to) write the memory every iteration. I just set the memory arguments, but those are existing, preallocated buffers. Unfortunately I do have to read the results back every iteration, since I have to display them in a WinForms PictureBox.

tgz's picture
carga wrote:

Hello, tgz.

I have no ready solution, just a few ideas.

1. Make sure you do not include kernel compile time to the total time.

2. You do not need both streams to be ReadWrite: probably, it's enough to have first series as ReadOnly and second one as WriteOnly. I think this MAY save some msecs.

3. AFAIK you should explicitly enable byte-addressing mode in your kernel (I have to do it with ATI SDK v2.1). Something like "#pragma OPENCL EXTENSION cl_khd_byte_addressing : enable" (please, google, because I don't remember exact syntax now)

4. On my CPU max work item dimension is 1024, I have 3 dimensions available, so I can address only 1024 work items in linear mode (and YOU use linear mode also) and about 10^9 work items in 3D. I expect you will not get any measurable benefit for 1024 work items. Please, try 1 million or 1 billion.

For GPU situation is even worse: my ATI Radeon 5750 has 3 dimensions by 256 work items in each. It is just about 16 millions simultanious work items. Not very much THREADS! =)

5. The greatest warning (and question for nythrix): I experience trouble working with byte array arguments. =((( My code is working perfect when it is executed on CPU: it returns its result through byte array in the same way like your "__global uchar *dst". As I said, it works perfectly on CPU. But when I execute it on GPU, the code works 10 times SLOWER then on cpu and it does not return expected result. =(

I am going to prepare the shortest possible example and share it with community to debug this issue, but currently I guess that there is some internal issue for byte (8-bit unsigned) arrays either in Cloo or in ATI Stream implementation.

Have a fast code!
Anton.
http://kyta.spb.ru

PS I use Cloo v0.7.2 + ATI Stream SDK v2.1 + Ubuntu v10.04 64bit + mono v2.6.4.0 built from release tarball. GPU -- AMD Radeon 5750

PPS Exact instruction to enable byte addressing:
#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable

Hello!

Thanks for your advices.
1. unfortunately the kernel compiling is not in the total time.

2. those buffers switch roles every turn.One turn the first is the input, the other is the output buffer, the next turn the other becomes the input buffer. Its done this way so I don't have to copy all the time. I'd think that would do more harm than setting the modes to ReadWrite

3. thanks, I'll try that one.

4. ok, I might not get how this works. I have a GeForce 8600GT video card and a max workgroup size of 512. I don't think using two dimensions would double that number, and I'm absolutely sure the number of transistors in my GPU is constant, so even if I have sixty billion work items, my gpu will only work on 512 at a time.

5, uchar seems to be working fine here. Actually, I tried using int and float as well, but no improvements on speed, so I decided to stick to uchar, which takes up less space.

douglas125's picture

All right, my 2 cents here. You switched to OpenCL and not OpenGL right?
Now seriously,

1. First of all, I think you may want to consider using work_dimension = 2, ie, launch the kernel with Global_Work_Size = int[2] {Width, Height}.

This would allow you to avoid this line:

int x = p0 % width;			// x coordinate

Modulus is very slow. If you insist on this, try

int temp = p0/width;
int x = p0-temp;

You don't need to know the remainder.

2. I don't get this:

	if (p0 > size)					// are we on the map at all?
		return;

Don't you already know that it will be on the map? Make sure to use the correct Global_Work_Size and remove this line.

3. Copy the neighbors of the pixel you are analyzing to local memory before anything. You are reading data from global memory in a pattern that GPUs don't like. GPUs prefetch memory when you read globals (it's faster to use coalesced access). You can simply name them
a11, a12, a13, a21, a22, a23, a31, a32, a33 (a22 is the pixel you are analyzing)
and assign values just as the kernel starts.

4. I don't get this too:

	// As all cells contained either zero or 1, litNeighbours will now hold the number of live neighbouring cells.
 
	p0 = get_global_id(0);

Don't you already know p0?

5. Oh! You are using maximum work size?? EDIT: Sorry I saw the local work size only. You should always be on the map. Also, you don't need to Finish the queue if you enqueue a Blocking_Read..

6. There's a size = width * height in your kernel. Why not int size = get_global_size(0)?

You've got some work to do there. Let me know if you don't understand something I wrote.