ctk's picture

ATI Stream SDK 2.01 Solved a Mysterious Crash Issue for Me

Just in case there is another person out there with mysterious crashes in Cloo/OpenCL using the ATI Stream SDK 2.0, I highly recommend an upgrade to the ATI Stream SDK 2.01 that's just released. It solved a mysterious intermittent crash issue for me when running a highly complex kernel. It could save you a lot of time debugging your Cloo/OpenCL code when the bug may just be the driver itself. My crashes were unexplained and occurred about 20% of the time with the other 80% running fine even with the exact same code and data.


Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
nythrix's picture

That's great news!
I also do experience mysterious random crashes here and there. Both ATI and nVidia. I guess, you always have to wait for new technologies fit in.

Edit: When you say mysterious, can you be a bit more verbose? Any observed patterns would help me.
The OpenCL specs are not fully clear at places (explicit statements about blocking/non-blocking behaviour missing) and the resulting freedom in various implementations has exposed some logically weaker points inside Cloo. I haven't been able to force a crash where I expect it to happen, though.

ctk's picture

In all my runs, I'm using an AMD dual core CPU. For my mysterious crashes, I would be using the exact same code each time and every few runs or so, about 5 to 30 seconds into a simulation, I would get the unexplained crash and I'm forced to exit. For my simulation, I'm repeatedly calling the OpenCL kernel to perform calculations and I only allocate memory and write values to the memory at the beginning of the simulation.

There are no errors or warnings from the OpenCL compiler either. In my kernel code, I'm not even using any local/shared memory, only global right now. I also do not have any barriers inside the kernel code. I'm also pretty sure that I'm not accidentally trying to access memory out of bounds because when my simulations run to completion, they always give the same results.

I'm also not pushing the memory limits of my system (yet) so I have plenty of room in that regard. My simulation takes up about 50 megs for testing. In my kernel code, I'm do calculations on several float and uint arrays and am using the built in native math functions where possible for speed.

When the program crashes, most of the time, I get a Windows 7 crash box saying that the program crashed and needs to exit or if I want to debug it or look online for help. But sometimes when it crashes, I will get a mysterious error message in the console from the OpenCL driver itself referencing a line of code in the driver with a message that says "Should not reach here!".

In my Cloo code, I made sure to reference the Cloo memory objects at the end of the code to rule out the Garbage Collector premature collection issure.

ctk's picture

After further code development, I think I know what was causing my mysterious crashes and it was my fault! I realized that I was updating a variable in my kernel while it was possibly being read. A classic threading bug. Moving it to a serialized portion of my kernel fixed that problem. It appears that version 2.01 of the ATI Stream SDK handles the execution/memory access slightly differently compared to 2.0, which resulted in the crashes going away. However, I noticed that the calculations my kernel was performing was not always giving consistent results, which drove me to isolate the portion of the code that was causing the problem. The lesson here is to code carefully and do a code review to make sure you don't have read/write conflicts. That or have better debugging tools that can detect read/write conflicts.

nythrix's picture

That's not great news about the those changes in ATI Stream. Silently skipping bugs isn't exactly what you'd call "healthy".