Chapter 8: Advanced Topics

This chapter discusses advanced topics on the interaction of .Net/Mono, OpenGL and OpenAL. It builds on the previous two chapters and a good grasp of C#, OpenGL and OpenAL is assumed.

Vertex Cache Optimizations

Graphic cards usually have 2 Caches designed to help processing Vertices, one of their favorite tasks.

Pre T&L Cache
This Cache merely stores the untransformed Vertex read from a VBO. Optimizations regarding this part of the Cache are simply sorting your Vertices in order of appearance, so the IBO issues Triangles in this order (0,1,2,0,2,3) rather then (999,17,2044,999,2044,2). This Cache is typically extremely large, being able to hold ~64k Vertices on a Geforce 3 and up.

Post T&L Cache
The more valuable Cache is the one storing the transformed results from the Vertex Shader, this Cache is typically very small (8 is minimum, 12-24 common) holding only very few Entries. It will only work with indexed primitives passed to GL.DrawElements, because GL.DrawArrays cannot make any assumptions which Vertices are actually identical.

While Pre-T&L Cache optimization only operates on the Vertices, Post T&L optimization will only operate on Indices (Primitives). Typically the Post T&L is calculated first, and the Pre T&L sorting step is performed on the optimized Indices Array.

Links for further reading

http://ati.amd.com/developer/i3d2006/I3D2006-Sander-TOO.pdf
http://www.cs.princeton.edu/gfx/pubs/Sander_2007_%3ETR/index.php
http://www.cs.umd.edu/Honors/reports/Vertex_Reordering_for_Cache_Coheren...
http://home.comcast.net/~tom_forsyth/papers/fast_vert_cache_opt.html
http://ati.amd.com/developer/tootle.html
http://developer.nvidia.com/object/vertex_cache_opt.html (ancient)
http://developer.nvidia.com/object/nvtristrip_library.html
http://www.clootie.ru/delphi/dxtools.html (DirectX based detector)

Useful quotes:
truncated quote from: http://developer.nvidia.com/object/devnews005.html

"When rendering using the hardware transform-and-lighting (TnL) pipeline or vertex-shaders, the GPU intermittently caches transformed and lit vertices. Storing these post-transform and lighting (post-TnL) vertices avoids recomputing the same values whenever a vertex is shared between multiple triangles and thus saves time. The post-TnL cache increases rendering performance by up to 2x. ...

...The post-TnL cache is a strict First-In-First-Out buffer, and varies in size from effectively 10 (actual 16) vertices on GeForce 256, GeForce 2, and GeForce 4 MX chipsets to effectively 18 (actual 24) on GeForce 3 and GeForce 4 Ti chipsets. Non-indexed draw-calls cannot take advantage of the cache, as it is then impossible for the GPU to know which vertices are shared. ...

...The mesh needs to be submitted in a single draw-call to optimize batch-size. The draw-call must be with an indexed primitive-type (see above), either strips or lists -- the performance difference between strips and lists is negligible when taking advantage of the post-TnL cache."

Last Update of the Links: January 2008

Garbage Collection Performance

The .Net Framework features a precise, generational and compacting Garbage Collector (GC): precise because it specifically traces only managed object pointers, generational because it distinguishes long-lived objects objects from temporary ones, and compacting because it moves data in memory to avoid leaving holes behind. The GC is a great tool in the .Net arsenal, not only because it increases productivity but also because it provides extremely fast memory allocations (compared to standard C/C++ malloc/new).

One of the challenges of working with garbage collection is garbage collection pauses. If these last longer than 5-7ms, it can impact interactive updates. While collector pauses are highly dependent on the the implementation, typically they are managed-pointer-count proportional pauses which occur because of the tracing or marking of the tenured heap. Note that even in the CLR 4.0 "concurrent" collector, there is still a pause for tenured-generation mark.

There are two main strategies for dealing with garbage collection pauses. First, minimize the duration of the pause, second, minimize the frequency of the pause.

To minimize the duration of the pause, minimize the total number of long-lived reachable managed pointers. It's important to note this is not the same as keeping the heap small. Large value-type arrays are not scanned by the GC if they contain no pointers. Likewise, raw byte[] data buffers are also not scanned by the GC. Placing large data-chunks, such as vertex-buffers, index-buffers, textures, and other raw-data into value-type arrays which don't contain pointers can substantially minimize the number of GC traced pointers.

To minimize the frequency of the pause, minimize tenured churn. Tenured churn is when objects are frequently allocated, survive long enough to make it into the tenured generation, and then are released. Reclaiming the space from those objects requires the GC to trace the entire tenured generation, causing the pause. This can be avoided by avoiding heap allocated objects which live too long and then die, and in general by avoiding allocation when possible. Ideally all objects either die very fast, or live a very long time. Aside from minimizing allocation through stack-allocated types, churn can be minimized by using reusable object pools for medium-lifetime objects.

other memory related topics

The managed heap is not the only memory usage in the process. Buffers handed to OpenGL may be copied into the unmanaged resource pool. This creates memory usage outside of the managed heap.

Another important memory consideration is the OpenGL robustness configuration. Historically, OpenGL preserves all state and buffers in system-ram, even when those assets are sent to the video card, so it can restore those assets in the event that the application loses and regains control of the 3d hardware graphics context. For some applications, this can mean assets like textures and vertex buffers are stored three times, once in the application ram, once in OpenGL ram, and once in video-card ram.

As of OpenGL 3.2, the GL_ARB_robustness extension can be used to control OpenGL robustness, allowing OpenGL to forgo storing system-ram copies of resources. If resources are lost, the application will need to restore them. In OpenGL ES, only non-robust operation is allowed, mirroring the behavior of Direct3d.

[Describe the unmanaged resource pool, pinning and performance considerations]

GC & OpenGL (work in progress)

As discussed in the previous chapter, GC finalization occurs on the finalizer thread. This poses some problems on OpenGL resource deallocation, since the context used to create the resources is not available in the finalizer thread!

Since OpenGL functions cannot be called in finalizers, a different methodology must be followed. By implementing the disposable pattern, we can use the Dispose() method to deterministaclly destroy OpenGL resources in the main thread. By modifying the finalizer logic we can provide a way to flag resources as 'dead', and destroy them from the main thread. Last, by extending the concept of the OpenGL context, we can be notified of context destruction, to release all related resources.

The following code describes the implementation of the "OpenGL disposable pattern" in OpenTK, but it is easy to adapt this code to any managed OpenGL project:

// This code is out-of-date. Please do not use it!
 
// The OpenGL disposable pattern
class GraphicsResource: IDisposable
{
    int resource_handle;    // The OpenGL handle to the resource
    GraphicsContext context;      // The context which owns this resource
 
    public GraphicsResource()
    {
        // Obtain the current OpenGL context, and allocate the resource
        context = GraphicsContext.CurrentContext;
        if (context == null)
            throw new InvalidOperationException(String.Format(
                "No OpenGL context available in thread {0}.",
                System.Threading.Thread.CurrentThread.ManagedThreadId));
 
        resource_handle = [...];
 
        context.Destroy += ContextDisposed;
    }
 
    #region --- Disposable Pattern ---
 
    private void ContextDisposed(IGraphicsContext sender, EventArgs e)
    {
        context.Destroy -= ContextDisposed;
        // TODO: Shared resources shouldn't be destroyed here.
        Dispose();
    }
 
    public void Dispose()
    {
        Dispose(true);
        GC.SuppressFinalize(this);
    }
 
    // If the owning context is current then destroy the resource,
    // otherwise flag it (so it will be destroyed from the correct thread)..
    // TODO: Is the "manual" flag necessary? Simply checking for the
    // owning context should be enough.
    private void Dispose(bool manual)
    {
        if (!disposed)
        {
            if (!context.IsCurrent || !manual)
            {
                GC.KeepAlive(this);
                context.RegisterForDisposal(this);
            }
            else
            {
                // Destroy resource_handle through OpenGL
                disposed = true;
            }
        }
    }
 
    ~GraphicsResource()
    {
        Dispose(false);
    }
 
    #endregion
}

In OpenTK, each GraphicsContext class maintains a queue of OpenGL resources that need to be destroyed. Resources are added to this queue through the RegisterForDisposal() call, and they are destroyed through the DisposeResources() method. The whole process is deterministic: it is your responsibility to call DisposeResources at appropriate time intervals (or setup up a timer event to do this for you).

Resource creation takes a small performance hit due to the call to GraphicsContext.CurrentContext, while garbage collect-able OpenGL resources consume slightly more memory (due to the reference to the GraphicsContext). Prefer calling the Dispose() method to destroy resources instead of relying on the GC, as finalizable resources are only collected on a generation 1 or 2 GC sweep.

The current implementation in OpenTK does not take shared contexts into account - this will be taken care of in the near future.

Uniform Buffer Objects (UBO) using the std140 layout specification

If we have information we need to set for multiple programs, we can either set the uniform each time we use a new program :

// Global Variables
int programID;
int uniformLocation;
 
// Done after successful program linking
uniformLocation = GL.GetUniformLocation( programID, uniformName ); // Gets the uniform variable Location
 
// Done at render stage (GLControl Paint Event / GameWindow OnRenderFrame Event / Wherever your rendering is done using FBO, etc.)
GL.UseProgram( programID ); // Sets the current shader program
GL.Uniform4( uniformLocation, ref uniformVariable ); // Sets the uniform value for the programs use, in this case a Vector4

or we could set the information into a UBO and direct the shader programs to where it is, and use that.

The advantage here is if we have a lot of information (eg. List of lights/Materials, etc.) the amount of calls needed to set this on a per program level can become enormous, and generate a heavy amount of undesired overhead. One solution is to use Uniform Buffer Objects, which are set one per frame or once per load depending on the use.

Here I will only cover the layout std140 specification defined in the OpenGL 3.3 Specification (Section 2.11.4, Pg71).

std140 specifies a layout which is implementation independent, the other layouts are implementation dependant and requires gathering information and formatting your buffers accordingly, however to get started std140 will do fine (Note: std140 defines a specific way to layout the buffer, it may not necessarily be the best or most optimized way to use the buffer)

Discussion of the std140 layout:

According to the specification (Linked Above) the Block Alignment is set at 4N, where N = Basic Data Type. Basic Data Types all fit into a single DWORD, and according to the specification they are bool, float, int, uint. so in essence it will align to 4(bool|float|int|uint).

The shader variable alignment is as follows: (I'll only cover the basic floats here, but the principle applies all round)
vec4 - 4N
vec3 - 4N
vec2 - 2N
float - N

Best way to explain this is with a picture :)

If we have a Data Block of 8N

NNNNNNNN

The layout states, everything will work with the alignment of 4N, so we get this

NNNN
NNNN

Simply put, we have chunks of 4N to work with, it is wise to fit our data into those chunks, anything that goes over a chunk boundry, will be placed into a new chunk.

eg.
If we have floats values for N we can have:

1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f

which will break into chunks like this

1.0f, 2.0f, 3.0f, 4.0f, 
5.0f, 6.0f, 7.0f, 8.0f

Now to see how this fits into variable, I can have a float array in C#

float[] UBOData = { 1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f };

and have a uniform block in my shader defined as

layout(std140) uniform UBOData {
	float value1;
	float value2;
	float value3;
	float value4;
	float value5;
	float value6;
	float value7;
	float value8;
};

This will place the array of data into the relevent slot sequentially as expected, however this is most certainly not very useful, so we use some variable proper, like so:

layout(std140) uniform UBOData {
	vec4 firstHalfValues;
	vec4 secondHalfValues;
};

More useful yes, but how do the values look in the vectors?

firstHalfValues looks like this (1.0f, 2.0f, 3.0f, 4.0f)
and
secondHalfValues looks like this (5.0f, 6.0f, 7.0f, 8.0f)

So where does the alignment and boundries come into play, well if we change the uniform block to this:

layout(std140) uniform UBOData {
	vec3 firstValue;
	vec4 secondValue;
	float thirdValue;
};

Now looking at this, we are still defining 8 floats here (vec3 = 3, vec4 = 4, float = 1), One might expect the result to be this:

  • firstValue = (1.0f, 2.0f, 3.0f)
  • secondValue = (4.0f, 5.0f, 6.0f, 7.0f)
  • thirdValue = 8.0f

But it is not so, the actual values end up as such:

  • firstValue = (1.0f, 2.0f, 3.0f)
  • secondValue = (5.0f, 6.0f, 7.0f, 8.0f)
  • thirdValue = 0.0f

The first variable is correct as expected, however, the second and third are not, the reson for this is the alignment of 4N as in the spec, if the next defined variable in a block cannot fit within the size of the remainder of the chunk then the values are aligned with the next chunk.

To show the calculation it goes something like this:

Start of block
firstValue has a size of 3 floats, chunk has 4 floats available, so there is 1 remainder in the chunk
secondValue has a size of 4 floats, chunk has 1 float available, so skip the remainder and start at the next chunk
thirdValue has a size of 1 float, chunk has 0 float available, so move to the begining of the next chunk
End of Block

As can be seem here, the total chunks used are 3, 1 for each variable, looking here we can correct the input array by padding it where it is expected to skip, like so:

float[] UBOData = { 1.0f, 2.0f, 3.0f, 0.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f };

this will give the values as we expect them like this:

  • firstValue = (1.0f, 2.0f, 3.0f)
  • secondValue = (4.0f, 5.0f, 6.0f, 7.0f)
  • thirdValue = 8.0f

Alternately we could change the array as follows

float[] UBOData = { 1.0f, 2.0f, 3.0f, 8.0f, 4.0f, 5.0f, 6.0f, 7.0f };

and change the shaders Block definition to this:

layout(std140) uniform UBOData {
	vec3 firstValue;
	float thirdValue;
	vec4 secondValue;
};

Here there is no padding done but changing the order of the Data and the Block variables, we still get the desired result, Data Calculation for this is as follows:

Start of block
firstValue has a size of 3 floats, chunk has 4 floats available, so there is 1 remainder in the chunk
thirdValue has a size of 1 float, chunk has 1 float available, so fill the variable
secondValue has a size of 4 floats, chunk has 0 float available, so move to the begining the next chunk
End of Block

As can be seen, this is now only using 2 chunks according to the rules.

A more structured approach

When filling the uniform blocks, it is a lot more useful to use a approach which does not include float arrays as input data, so we can use a struct in C# to define our data in a friendlier manner, and we can then match that struch in the shader.

For example our shader defines:

layout(std140) uniform UBOData {
	vec3 firstValue;
	float thirdValue;
	vec4 secondValue;
};

and our C# struct will look like this:

[Serializable]
[StructLayout(LayoutKind.Sequential)]
struct UBOData {
    public Vector3 firstValue;
    public float thirdValue;
    public Vector4 secondValue;
};

This will allow us to load the UBO with the struct, and we know it will match correctly, now we can also change the information in the C# struct and the uniform to mismatch, but still work together.
For example, if we have a Light Structure, where the uniform will be expected to have the light position/direction in the first 3 positions of a vec4, the forth position is 0 for a directional light or 1 for a point light, with a second vec4 as the intensity setting.
This definition in the shader will look like this:

layout(std140) uniform Light {
	vec4 dirPosType;
	vec4 intensity;
};

and the matching C# struct would look like this:

[Serializable]
[StructLayout(LayoutKind.Sequential)]
struct Light {
    public Vector4 dirPosType;
    public Vector4 intensity;
};

This is all good and well as the structure matches correctly, however in the dev environment, you will need to recall whats what in the dirPosType variable in C#, we could change the structure to look like this, and still keep in line with what the shader expects:

[Serializable]
[StructLayout(LayoutKind.Sequential)]
struct Light {
    public Vector3 dirPos;
    public float type;
    public Vector4 intensity;
};

This makes it a bit more readable within the c# code.

Now off to some code :)

To get a uniform block going a few steps need to be completed to do it, namely:

  • Create and setup a Uniform Buffer Object
  • Bind the Uniform Buffer Object to a Buffer Index
  • Bind the Program Uniform Block to the Buffer Index

There are 3 locations which are used here

  1. UBO Location given by OpenGL when u generate one
  2. Uniform Block Location given by OpenGL when Queried from a succesfully Link Shader program
  3. A USER supplied Binding Buffer Index

Now the Buffer Index if a number you pick [0, maxUniformIndex), the maxUniformIndex can be retrieved from OpenGL when a valid context exists with the following command:

int maxUniformIndex;
GL.GetInteger(GetPName.MaxUniformBufferBindings, out maxUniformIndex);

The maximum is very implementation dependant, between my machines I have values of 24, 36 and 72.

Setting up a Buffer is done like this in the initialization of your code after the context exists:

// Global Variables
int BufferUBO; // Location for the UBO given by OpenGL
int BufferIndex = 0; // Index to use for the buffer binding (All good things start at 0 )
int UniformBlockLocation; // Uniform Block Location in the program given by OpenGL
 
Light UBOData;
 
void InitializeUniformBuffer() {
	GL.GenBuffers(1, out BufferUBO); // Generate the buffer
	GL.BindBuffer(BufferTarget.UniformBuffer, BufferUBO); // Bind the buffer for writing
	GL.BufferData(BufferTarget.UniformBuffer, (IntPtr)(sizeof(float) * 8), (IntPtr)(null), BufferUsageHint.DynamicDraw); // Request the memory to be allocated
 
	GL.BindBufferRange(BufferTarget.UniformBuffer, BufferIndex, BufferUBO, (IntPtr)0, (IntPtr)(sizeof(float) * 8)); // Bind the created Uniform Buffer to the Buffer Index
}

Note: In the above code teh BufferUsageHint is DynamicDraw, which means we are planning to update the Data occasionally, if you plan to update the data every frame I would suggest changing the Hint to StreamDraw
Next, we link the Buffer Index to the Uniform Block of the shader program, this is done only once for each program, usually after creation:

UniformBlockLocation = GL.GetUniformBlockIndex(programID, "Light");
GL.UniformBlockBinding(programID, UniformBlockLocation, BufferIndex);

And then whenever we want to load the uniform blocks data we can fill it by calling a function like this:

void FillUniformBuffer() {
	GL.BindBuffer(BufferTarget.UniformBuffer, BufferUBO);
	GL.BufferSubData(BufferTarget.UniformBuffer, (IntPtr)0, (IntPtr)(sizeof(float) * 8), ref UBOData);
	GL.BindBuffer(BufferTarget.UniformBuffer, 0);
}

Admittidly this is not a particularly useful example, however a more useful implementation would be for a array of lights, where we would have a list of lights, which updating a list of 3 or 4 lights to 20 programs would be time consuming if not using a UBO.

to create a array of lights, the shader changes slightly to the following:

const int Light_Count = 4;
 
struct LightInformation {
	vec4 dirPosType;
	vec4 intensity;
};
 
layout(std140) uniform Light {
	LightInformation Lights[Light_Count];
};

As you can see, a struct definition in the shader is simular to our C#.

Our C# Struct remains the same, but the variable changes to this:

Light[] UBOData = new Light[4];

And all the buffer Creation and and filling Sizes change to this:

sizeof(float) * 8 * UBOData.Length

Which is the Size of the Data, multiplied by the number of elements. So our code will change to this:

// Global Variables
int BufferUBO; // Location for the UBO given by OpenGL
int BufferIndex = 0; // Index to use for the buffer binding (All good things start at 0 )
int UniformBlockLocation; // Uniform Block Location in the program given by OpenGL
 
Light[] UBOData = new Light[4];
 
void InitializeUniformBuffer() {
	GL.GenBuffers(1, out BufferUBO); // Generate the buffer
	GL.BindBuffer(BufferTarget.UniformBuffer, BufferUBO); // Bind the buffer for writing
	GL.BufferData(BufferTarget.UniformBuffer, (IntPtr)(sizeof(float) * 8 * UBOData.Length), (IntPtr)(null), BufferUsageHint.DynamicDraw); // Request the memory to be allocated
 
	GL.BindBufferRange(BufferTarget.UniformBuffer, BufferIndex, BufferUBO, (IntPtr)0, (IntPtr)(sizeof(float) * 8 * UBOData.Length)); // Bind the created Uniform Buffer to the Buffer Index
}
 
void FillUniformBuffer() {
	GL.BindBuffer(BufferTarget.UniformBuffer, BufferUBO);
	GL.BufferSubData(BufferTarget.UniformBuffer, (IntPtr)0, (IntPtr)(sizeof(float) * 8 * UBOData.Length), UBOData);
	GL.BindBuffer(BufferTarget.UniformBuffer, 0);
}

A Point to note will be the passing of our UBOData variable, in the first example it was passed as a ref, when it is a array, it can no longer be passed as a ref.

In the shader to Access a particular Lights information would be done like this now(To get intensity of the second light):

Lights[1].intensity; // As always a Zero based Index, 1 is the Second Light in the array

If any new information is added to the struct, when it is an array, please bear in mind the Alignment of 4N, as the entire array can become useless if this rule is not obeyed.

For Example adding a light range in the shader:

struct LightInformation {
	vec4 dirPosType;
	vec4 intensity;
	float maxRange;
};

And in C#:

struct Light {
    public Vector3 dirPos;
    public float type;
    public Vector4 intensity;
    public float maxRange;
};

This is going to throw all of the arrays values out of sync, at index 0 the information will be correct, but the rest are doomed, this is due to the rules being applied to the shaders struct, and not to the C# struct.
To correct this we apply the rules, so to recap the shader will align the next element in the array to the base alignment hence

  • dirPosType is 4 floats of the first chunk
  • intensity is 4 floats of the second chunk
  • maxRange is one float of the third chunk leaving 3 remainder which OpenGL will skip and leave unused

For a total of 3 chunks of 4 floats ( total used space of 12 floats).

The C# struct has this

  • dirPos is 3 floats of the first chunk
  • type is 1 float for the remainder of the first chunk
  • intensity is 4 floats of the second chunk
  • maxRange is 1 float of the third chunk

With the serialization of the UBOData variable, this means that the next array element dirPos, will be filled into the last 3 floats of the third chunk. And this is not according to the rules, to correct this, we need to add the appropriate padding, like so:

struct Light {
    public Vector3 dirPos;
    public float type;
    public Vector4 intensity;
    public float maxRange;
    public float padTheSecondFloatOfTheThirdChunk;
    public float padTheThirdFloatOfTheThirdChunk;
    public float padTheFourthFloatOfTheThirdChunk;
};

Of couse the sizeof statements in the code will need to be updated for a float size of 12 and not 8 as it originally was.

Doing the struct like this will bring it back into alignment and everything will work correctly.

As can be seen above, padding brings us back into alignment, however we are wasting 3 floats of space for each element in the array.

And thats it for the Uniform Buffer Object and the std140 Specification.

Happy Coding.