...so how about crunching the framebuffer that should be sampled to a compressed texture format like BC1 and rather sample from that instead?
OpenGL 4.2 adds ARB_copy_image which allows to do texture compression in realtime completely on the GPU (PDF here). In short, the idea is to render into a uint16 framebuffer attachment (RGBA16UI, 8 Bytes is the size of a BC1 block): encode the uncompressed source framebuffer to compressed 4x4 texel BC1 blocks in the fragment shader, then use the command GL.CopyImageSubData() to copy to a blank BC1 format texture and bind that as sampler for future rendering, i.e. it will be treated just as if you loaded the compressed BC1 texture from a .dds file.
The advantage of using compressed texture formats like BC1 is caches. Graphics cards will store the compressed BC1 blocks in L2 Cache and only unpack texels in L1 Cache (L1 size can be expected to be 4-8kb (ryg))
In the worst case scenario you get compression ratio of 2:1, but that already is twice as many texels in the same amount of memory. Compressed the memory cost is 0.5 or 1 Byte/Texel.
However the compression is lossy and there is the cost of compressing the framebuffer (0.57ms for a 1024² texture in the paper) and this only make sense for operations which will sample the texture alot, my question now is:
Has this been tried on a per-frame basis already for expensive sample operations like SSAO? Or do people commonly encode impostors like this? My google-fu failed me on this, there has been some research into texture compression 5 years ago but it seems it never has really been adopted? Has anyone heard of this before or thinks this makes sense or can see the flaw why it would fail?
P.S. The BC1 color compression from the beginning can also be used on the G-Buffer with deferred shading. Run geometry pass on uncompressed attachments as usual, then compress to DXT/RGTC/WhateverIsAvailable and then do lighting from compressed samplers. Simple deferred shading setup, all FBO attachments 32 Bit, 4 Byte/Texel
geometry pass, formats can be written to:
- depth: Depth24Stencil8
- color 0: RGBA8 - Material Albedo RGB, Material specular
- color 1: RG16f - Normal
convert to compressed formats for read-only:
- depth: to half-precision float, 2 Byte/Texel (dropped stencil, but need accurate depth for position)
- color 0: to BC3 or BC7, 1 Byte/Texel
- color 1: to BC5, 1 Byte/Texel
uncompressed: 4+4+4 = 12 Byte/Texel
compressed: 2+1+1 = 4 Byte/Texel
->OpenCL/compute shaders/Cuda no advantages?
->How about rendering to RGB9e5 with this hack? (nice Vector3 format in 4 Bytes)
->DXT5 possible with uint32? (Yes, RGBA32UI)
->The whole technique adds just a simple optional step to the pipeline, i.e. can always sample from uncompressed source if compression benchmarks worse on the current platform or is not supported at all.
->Only the few encoder shaders have to be written, hardware decodes the formats on it's own:
Input sampler: Color
1 Channel, R = BC4 (0.5 Byte/Texel)
2 Channels, RG = BC5 (1 Byte/Texel)
3 Channels, RGB = BC1 (0.5 Byte/Texel)
4 Channels, RGBA = BC3 (1 Byte/Texel)
Input sampler: Depth
1 Byte/Texel: BC5 depth=red*256+green; [0..2^16-1], but needs decoder logic
2 Byte/Texel: Half-precision float
My train of thought is to compress the depth buffer into one of the compressed texture format, then sample from there instead of full precision depth buffer. Something like casting to half-precision float? BC4, BC6-7 or some hack involving BC5's depth=red*256+green). Or would this be worse than the hardware's internal depth optimizations?
->It could make sense to split the FBO attachments up, like 4 attachments with 2-3 elements could compress better than 2 attachments with 4 elements.
->Other common depth optimizations beside Ati HiZ?
->Common deferred shading optimizations here, does not list the above method.
->Likely best results if AMD/Intel/NV pick up this idea and design texture compression formats around this idea.