Vertex Cache Optimizations

Graphic cards usually have 2 Caches designed to help processing Vertices, one of their favorite tasks.

Pre T&L Cache
This Cache merely stores the untransformed Vertex read from a VBO. Optimizations regarding this part of the Cache are simply sorting your Vertices in order of appearance, so the IBO issues Triangles in this order (0,1,2,0,2,3) rather then (999,17,2044,999,2044,2). This Cache is typically extremely large, being able to hold ~64k Vertices on a Geforce 3 and up.

Post T&L Cache
The more valuable Cache is the one storing the transformed results from the Vertex Shader, this Cache is typically very small (8 is minimum, 12-24 common) holding only very few Entries. It will only work with indexed primitives passed to GL.DrawElements, because GL.DrawArrays cannot make any assumptions which Vertices are actually identical.

While Pre-T&L Cache optimization only operates on the Vertices, Post T&L optimization will only operate on Indices (Primitives). Typically the Post T&L is calculated first, and the Pre T&L sorting step is performed on the optimized Indices Array.

Links for further reading

http://ati.amd.com/developer/i3d2006/I3D2006-Sander-TOO.pdf
http://www.cs.princeton.edu/gfx/pubs/Sander_2007_%3ETR/index.php
http://www.cs.umd.edu/Honors/reports/Vertex_Reordering_for_Cache_Coheren...
http://home.comcast.net/~tom_forsyth/papers/fast_vert_cache_opt.html
http://ati.amd.com/developer/tootle.html
http://developer.nvidia.com/object/vertex_cache_opt.html (ancient)
http://developer.nvidia.com/object/nvtristrip_library.html
http://www.clootie.ru/delphi/dxtools.html (DirectX based detector)

Useful quotes:
truncated quote from: http://developer.nvidia.com/object/devnews005.html

"When rendering using the hardware transform-and-lighting (TnL) pipeline or vertex-shaders, the GPU intermittently caches transformed and lit vertices. Storing these post-transform and lighting (post-TnL) vertices avoids recomputing the same values whenever a vertex is shared between multiple triangles and thus saves time. The post-TnL cache increases rendering performance by up to 2x. ...

...The post-TnL cache is a strict First-In-First-Out buffer, and varies in size from effectively 10 (actual 16) vertices on GeForce 256, GeForce 2, and GeForce 4 MX chipsets to effectively 18 (actual 24) on GeForce 3 and GeForce 4 Ti chipsets. Non-indexed draw-calls cannot take advantage of the cache, as it is then impossible for the GPU to know which vertices are shared. ...

...The mesh needs to be submitted in a single draw-call to optimize batch-size. The draw-call must be with an indexed primitive-type (see above), either strips or lists -- the performance difference between strips and lists is negligible when taking advantage of the post-TnL cache."

Last Update of the Links: January 2008