This chapter discusses advanced topics on the interaction of .Net/Mono, OpenGL and OpenAL. It builds on the previous two chapters and a good grasp of C#, OpenGL and OpenAL is assumed.
The .Net Framework features an aggressive, generational and compacting Garbage Collector (GC): aggressive because it knows the location and reachability of every managed object, generational because it distinguishes long-lived objects objects from temporary ones, and compacting because it moves data in memory to avoid leaving holes behind. The GC is a great tool in the .Net arsenal, not only because it increases productivity but also because it provides extremely fast memory allocations (compared to standard C/C++ malloc/new).
[Describe the unmanaged resource pool, pinning and performance considerations]
As discussed in the previous chapter, GC finalization occurs on the finalizer thread. This poses some problems on OpenGL resource deallocation, since the context used to create the resources is not available in the finalizer thread!
Since OpenGL functions cannot be called in finalizers, a different methodology must be followed. By implementing the disposable pattern, we can use the Dispose() method to deterministaclly destroy OpenGL resources in the main thread. By modifying the finalizer logic we can provide a way to flag resources as 'dead', and destroy them from the main thread. Last, by extending the concept of the OpenGL context, we can be notified of context destruction, to release all related resources.
The following code describes the implementation of the "OpenGL disposable pattern" in OpenTK, but it is easy to adapt this code to any managed OpenGL project:
In OpenTK, each GLContext class maintains a queue of OpenGL resources that need to be destroyed. Resources are added to this queue through the RegisterForDisposal() call, and they are destroyed through the DisposeResources() method. The whole process is deterministic: it is your responsibility to call DisposeResources at appropriate time intervals (or setup up a timer event to do this for you).
Resource creation takes a small performance hit due to the call to GLContext.CurrentContext, while garbage collect-able OpenGL resources consume slightly more memory (due to the reference to the GLContext). Prefer calling the Dispose() method to destroy resources instead of relying on the GC, as finalizable resources are only collected on a generation 1 or 2 GC sweep.
The current implementation in OpenTK does not take shared contexts into account - this will be taken care of in the near future.
Introduction
A widely available Texture Compression comes from S3, mostly due to Microsoft licensing it and including it into DirectX. It was added into the file format DDS (DirectDraw Surface), which is basically a copy of the Texture in Video Memory. Every graphics accelerator compatible with DirectX 7 or higher supports this Texture Compression.
The DXT Formats
What the S3 Texture Compression (abbreviation: S3TC. The Formats are named DXTn, where ( 1 <= n <= 5 ) ) does is encode the whole Image into Blocks of 4x4 Texel, instead of storing every single Texel of the Image. Thus the ideal compressed Texture dimension is a multiple of 4, like 640x480 or a power of 2, which can be nicely fit into these Blocks. This is the ideal and not a restriction, the specification allows any non-power-of-two dimension, but will internally use a 4x4 Block for a Texture with the size of 2x1 (the other Texels in the Block are undefined).
This results into an 1:6 compression for DXT1 and 1:4 compression for DXT3/5, which translates into smaller disk size, load times and also render times.
DXT1, 8 Bytes per Block, Accuracy: R5G6B5 or R5G5B5A1
DXT3, 16 Bytes per Block, Accuracy: R5G6B5A8
DXT5, 16 Bytes per Block, Accuracy: R5G6B5A8
The formats DXT2 and DXT4 do exist, but they include pre-multiplied Alpha which is problematic when blending with images with explicit Alpha (RGBA, DXT3/5, etc). That's why those formats have barely been used, and are partially not supported by hardware and export/import tools. Avoiding DXT2/4 is strongly recommended, as they offer no beneficial functionality over established formats.
Compressed vs. Uncompressed
You probably guessed it already, there is a catch involved when reliably shrinking an image to 25% of it's uncompressed size: A lossy compression technique. This quality loss involved, which can be altered by tweaking the Filter options when compressing the image, is different to the one used in JPG compression. Although both formats - .dds and .jpg - are designed to compress an Image, the S3TC format was developed with graphics hardware in mind.
A bilinear Texture lookup usually reads 2x2 Texels from the Texture and interpolates those 4 Texels to get the final Color. Since a Block consists of 4x4 Texels, there is a good chance that all 4 Texels - which must be examined for the bilinear lookup - are in the same Block. This means that the worst case scenario involves reading 4 Blocks, but usually only 1-2 Blocks are used to achieve the bilinear lookup. When using uncompressed Textures, every bilinear lookup requires reading 4 Texels.
If you do the maths now you will notice that the compressed image actually needs 16 Bytes for 1 Block of RGBA Color, but the uncompressed 4 Texels of RGBA need 16 Bytes too. And yes, if you would only draw a single Pixel on the screen all this would not bring any noticable performance gains, actually it would be slower if multiple Blocks must be read to do the lookup.
However in OpenGL you typically draw more than a single Pixel, at least a Triangle. When the Triangle is rasterized, alot of Pixels will be very close to each other, which means their 2x2 lookup is very likely in the same 4x4 Block used by the last lookup, or a close neighbour. Graphic cards usually support this locality by using a small amount of memory in the chip for a dedicated Texture Cache. If a Cache hit is made, the cost for reading the Texels is very low, compared to reading from Video Memory.
That's why S3TC does decrease render times: the earlier mentioned 16 Bytes of a DXTn Block contain 16 Texels (1 Byte per Texel), while 16 Bytes of uncompressed Texture only contain 4 Texels (4 Bytes per Texel). Alot more data is stored in the 16 Bytes of DXTn, and alot of lookups will be able to use the fast Texture Cache. The game Quake 3 Arena's Framerate increases by ~20% when using compressed Textures, compared to using uncompressed Textures.
Restrictions
Although you might be convinced now that Texture Compression is something worth looking into, do handle it with care. After all, it's a lossy compression Technique which introduces compression Artifacts into the Texture. For Textures that are close to the Viewer this will be noticed, that's why 2D Elements which are drawn very close to the near Plane - like the Mouse Cursor, Fonts or User Interface Elements like the Health display - are usually done with uncompressed Textures, which do not suffer from Artifacts.
As a rule of thumb, do not use Texture Compression where 1 Texel in the Texture will map to 1 Pixel on the Screen.
Using OpenTK.Utilities .dds loader
At the time of writing, the .dds loader included with OpenTK can handle compressed 2D Textures and compressed Cube Maps. Keep in mind that the loader expects a valid OpenGL Context to be present. It will only read the file from disk and upload all MipMap levels to OpenGL. It will NOT set minification/magnification filter or wrapping mode, because it cannot guess how you intent to use it.
void LoadFromDisk( string filename, bool flip, out int texturehandle, out TextureTarget dimension)
Input Parameter: filename
A string used to locate the DDS file on the harddisk, note that escape-sequences like "\n" are NOT stripped from the string.
Input Parameter: flip
The DDS format is designed to be used with DirectX, and that defines GL.TexCoord2(0.0, 0.0) at top-left, while OpenGL uses bottom-left. If you wish to use the default OpenGL Texture Matrix, the Image must be flipped before loading it as Texture into OpenGL.
Output Parameter: texturehandle
If there occured any error while loading, the loader will return "0" in this parameter. If >0 it's a valid Texture that can be used with GL.BindTexture.
Output Parameter: dimension
This parameter is used to identify what was loaded, currently it can return "Invalid", "Texture2D" or "TextureCube".
Example Usage
Remember that you must first GL.Enable the states Texture2D or TextureCube, before using the Texture in drawing.
Useful links:
ATi Compressonator:
http://ati.amd.com/developer/compressonator.html
nVidia's Photoshop Plugin:
http://developer.nvidia.com/object/photoshop_dds_plugins.html
nVidia's GPU-accelerated Texture Tools:
http://developer.nvidia.com/object/texture_tools.html
Detailed comparison of uncompressed vs. compressed Images:
http://www.digit-life.com/articles/reviews3tcfxt1/
OpenGL Extension Specification:
http://www.opengl.org/registry/specs/EXT/texture_compression_s3tc.txt
Microsoft's .dds file format specification (was used to build the OpenTK .dds loader)
http://msdn.microsoft.com/archive/default.asp?url=/archive/en-us/dx81_c/...
DXT Compression using CUDA
http://developer.download.nvidia.com/compute/cuda/sdk/website/projects/d...
Real-Time YCoCg-DXT Compression
http://news.developer.nvidia.com/2007/10/real-time-ycocg.html
Last Update of the Links: January 2008
Graphic cards usually have 2 Caches designed to help processing Vertices, one of their favorite tasks.
Pre T&L Cache
This Cache merely stores the untransformed Vertex read from a VBO. Optimizations regarding this part of the Cache are simply sorting your Vertices in order of appearance, so the IBO issues Triangles in this order (0,1,2,0,2,3) rather then (999,17,2044,999,2044,2). This Cache is typically extremely large, being able to hold ~64k Vertices on a Geforce 3 and up.
Post T&L Cache
The more valuable Cache is the one storing the transformed results from the Vertex Shader, this Cache is typically very small (8 is minimum, 12-24 common) holding only very few Entries. It will only work with indexed primitives passed to GL.DrawElements, because GL.DrawArrays cannot make any assumptions which Vertices are actually identical.
While Pre-T&L Cache optimization only operates on the Vertices, Post T&L optimization will only operate on Indices (Primitives). Typically the Post T&L is calculated first, and the Pre T&L sorting step is performed on the optimized Indices Array.
Links for further reading
http://ati.amd.com/developer/i3d2006/I3D2006-Sander-TOO.pdf
http://www.cs.princeton.edu/gfx/pubs/Sander_2007_%3ETR/index.php
http://www.cs.umd.edu/Honors/reports/Vertex_Reordering_for_Cache_Coheren...
http://home.comcast.net/~tom_forsyth/papers/fast_vert_cache_opt.html
http://ati.amd.com/developer/tootle.html
http://developer.nvidia.com/object/vertex_cache_opt.html (ancient)
http://developer.nvidia.com/object/nvtristrip_library.html
http://www.clootie.ru/delphi/dxtools.html (DirectX based detector)
Useful quotes:
truncated quote from: http://developer.nvidia.com/object/devnews005.html
"When rendering using the hardware transform-and-lighting (TnL) pipeline or vertex-shaders, the GPU intermittently caches transformed and lit vertices. Storing these post-transform and lighting (post-TnL) vertices avoids recomputing the same values whenever a vertex is shared between multiple triangles and thus saves time. The post-TnL cache increases rendering performance by up to 2x. ...
...The post-TnL cache is a strict First-In-First-Out buffer, and varies in size from effectively 10 (actual 16) vertices on GeForce 256, GeForce 2, and GeForce 4 MX chipsets to effectively 18 (actual 24) on GeForce 3 and GeForce 4 Ti chipsets. Non-indexed draw-calls cannot take advantage of the cache, as it is then impossible for the GPU to know which vertices are shared. ...
...The mesh needs to be submitted in a single draw-call to optimize batch-size. The draw-call must be with an indexed primitive-type (see above), either strips or lists -- the performance difference between strips and lists is negligible when taking advantage of the post-TnL cache."
Last Update of the Links: January 2008