XZibit's picture

Make native Calls visible or use them in GL at least?

Cheers,

i have tried something out arround the TK and got some questions about the implementation of the GL class.
I got the current version and there were all gl-calls done by delegates but in my profiling test, the invisible native calls
gone much faster than the delegatet calls (eg a glLoadIdentity took 6 up to 10 ns delegated and 2 ns extern) on .NET 2.0
2,3ghz dual core system with an ati Radeon HD outside of debug mode.

So why dont make the native calls visible for use or at least use them instead of the delegated one?


Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
the Fiddler's picture

Delegates are necessary for OpenGL extensions that must be loaded at runtime through Marshal.GetDelegateForFunctionPointer(). This includes everything after OpenGL 1.1 on windows (and different subsets on other operating systems).

It is not possible to avoid these function pointers as they are a fundamental part of OpenGL design. However, with some ingenuity it is possible to avoid the delegate themselves. The answer lies in the 'calli' (call indirect) IL instruction, which can be used to dereference a function pointer (which may point into native code!) Caveats:

  1. Calli is rather uncommon, so its performance characteristics may vary. Some people have suggested it performs better than DllImports; others have reported it performs worse.
  2. Calli is generally not available on mobile devices. That's not a problem, since (a) OpenGL ES lies in separate namespaces and (b) most mobile devices should be able to use DllImports directly (no delegates)
  3. Calli is unverifiable, which means it must be surrounded in an 'unsafe' block (not a problem, can be handled inside OpenTK with the current design).
  4. There is no way to issue this instruction through C#.

The last point is the significant one. So how could one use calli in the GL class? In one of two ways:

  • (Easier) Disassemble OpenTK.dll, insert the calli instructions and re-assemble it. We didn't have cross-platform tools for that back in 2006, but now this is possible through Cecil or IKVM.Reflection.Emit. This could be done as a custom post-build step.
  • (Harder) Modify the binding generator to generate IL, rather than C#, assemble and merge the resulting module into OpenTK.dll (or leave it as a separate dll).

This is actually one feature that I'd really love to see in OpenTK at some point. Creating the delegates at startup is very costly, both in memory and CPU time. Storing the function pointers as plain address is quite a bit better, plus there's a good chance this would be faster.

Edit: actually, I've been giving this some thought and it might be possible to handle this by hiding the calli instruction behind a regular function, say call(IntPtr address). We could implement 'call' in pure IL and invoke it through regular C#. A trivial change to the generator and we could eliminate both the DllImports and delegates from the GL class (reducing dll size by ~1MB and improving startup time tremendously!) The 'call' function itself would be inlined by the compiler, so no overhead on that part.

Has anyone attempted this before? This could be very interesting.

XZibit's picture

As i toled before, we tested it in several ways on profiling tests during our application. Espacily we got about 1000 frames more than
using the delegated calls itselfe during a gui and mixed ortho and projection. we tested it very deep to get best performance to our application. I dont know about other users platform configuration and code style!! but on the systems we tested it, including quad core with nearly 60.000 fps on a .NET 4.0 Platform, we got the same values in generel. At least we settled the "core" itself public and used this and the GL class mixed.

In an other way you may test it yourself by taking a direct call to an c# method and a delegated one. The delegated one is often over 4 times slower than the direct one (nearly 1.2 ns direct and 5 to 6ns delegated). It isnt much that way but doing a hole of thousends of calls per second there is a difference. Especialy creating complex games like som sort of Skyrim or Farcry would profit from it.

2. To OpenGL ES, i expect there were different native calls, to which functions your "provider" would build if there werent some.

3. I think users that work with OpenGL in C# should have a little knowledge to unsafe code when they use dllimports.

Even if users would use Extension-calls like shaders or so on, they use the call returned by getprocadress or its brothers, the rest of delegated calls are the one bound by dllimport. Allthou it would speed up the creation process of a valid context when only extensions have to be bound during creation time.

BTW
I read that it wouldnt be possible to create complex games with OpenTK but in our Gamedevelopment Curse, we have a different view to that and tried it out with some small tests. C++ may be faster a few times but C# is as fast as C++ in maner.

the Fiddler's picture
XZibit wrote:

As i toled before, we tested it in several ways on profiling tests during our application. Espacily we got about 1000 frames more than
using the delegated calls itselfe during a gui and mixed ortho and projection. we tested it very deep to get best performance to our application. I dont know about other users platform configuration and code style!! but on the systems we tested it, including quad core with nearly 60.000 fps on a .NET 4.0 Platform, we got the same values in generel. At least we settled the "core" itself public and used this and the GL class mixed.

60000 fps vs 61000 fps is a difference of 2.73us, i.e. roughly equivalent to 60 fps vs 60.001 fps. More performance is good, but this difference is unlikely to make or break an application in itself, considering other common bottlenecks (e.g. a gen-0 GC cycle is roughly three orders of magnitude slower).

Quote:

In an other way you may test it yourself by taking a direct call to an c# method and a delegated one. The delegated one is often over 4 times slower than the direct one (nearly 1.2 ns direct and 5 to 6ns delegated). It isnt much that way but doing a hole of thousends of calls per second there is a difference. Especialy creating complex games like som sort of Skyrim or Farcry would profit from it.

This is indeed a real concern. Modern game developers porting console games to the PC regularly face problems with draw call overhead. I recall an interview regarding a game that could issue up to 100000 calls/sec on a console but hit a brick wall at ~30000 calls/sec on the PC.

Consider what this means for OpenTK. 30000 calls/sec at 4.5ns (additional) overhead per call totals 135us per second or equivalent to 60 fps vs 60.49 fps, a statistically significant but not world-ending difference. Truth is, at 30000 calls/sec the primary problem is kernel switch overhead, rather than call overhead.

What I am getting at is that the overhead of delegates is measurable but not a show-stopper. I've been using OpenTK for half a decade now and I've never seen call overhead cause performance problems by itself in modern OpenGL. (The only time it becomes significant is when using GL.Begin-GL.End, but that's slow even in pure C so it doesn't really matter).

There are still good reasons why avoiding delegates at the lowest level is a good idea, and it's something I wish to test in the future. However, do note that your approach (making the DllImports public) is only applicable to GL 1.1 functions - and, frankly, this is the wrong optimization to perform.

Quote:

BTW
I read that it wouldnt be possible to create complex games with OpenTK but in our Gamedevelopment Curse, we have a different view to that and tried it out with some small tests. C++ may be faster a few times but C# is as fast as C++ in maner.

Indeed. The main disadvantage of C# vs C++ is mathematical operations - graphics (OpenGL) will be within a few percent of C++ when used correctly.

XZibit's picture
the Fiddler wrote:

60000 fps vs 61000 fps is a difference of 2.73us, i.e. roughly equivalent to 60 fps vs 60.001 fps. More performance is good, but this difference is unlikely to make or break an application in itself, considering other common bottlenecks (e.g. a gen-0 GC cycle is roughly three orders of magnitude slower).

I dont know what the Environment is or why it comes to such an result, i even know that it is a .NET 4 x64 platform so i dont care about it. We'll take the current frequency anyway but its nice to see who many frames we could max get. Althou it creates space for some opperations like GUI where an Example i saw here had a difference of 1000 frames during show the gui and do something else, so performance may make applications more efficient.

I attached a little Program that analyzes which GL functions are included in the current libryr and which have to be bound by getprocadress. You may think it useful.

http://www.file-upload.net/download-4038567/Analyzer.zip.html

EDIT

Just an screenshot from a performancetest calling 1000 times glLoadIdentity() on Win7 .NET4 x64 QuadCore System. Values are in nano seconds.

c2woody's picture

Posting "Application is 1000 frames frames more" is like telling "My new car goes 1000 kilometers more" which really is completely useless information (does the car die after 1000 kilometers? did you just drive 1000 kilometers more than last time?)