Bloom with Hardware Mipmap Generation
This post covers the bloom system I built for a custom DirectX 12 deferred rendering engine. The engine runs an Iris-compatible shader pipeline, and the bloom is part of the EnigmaDefault ShaderBundle targeting a Complementary Reimagined visual style.
The system generates a 7-level tile atlas by sampling hardware mipmaps of the HDR scene texture with a 7x7 Gaussian kernel, then composites the result back onto the scene before tonemapping. The mipmap chain is produced by a GPU compute shader that dispatches once per mip level, using RWTexture2D UAV access to avoid SRV/UAV resource state conflicts. The engine manages per-mip UAV descriptors as persistent members on D12Texture, eliminating per-frame allocation overhead entirely.
Rendering Pipeline Overview
The bloom system spans two composite passes and a compute mipmap generation step that runs between them. The mipmap generation is triggered automatically after each composite sub-pass for any render target that has colortexNMipmapEnabled = true declared in the ShaderBundle.
flowchart TD
subgraph Deferred["Deferred Pass"]
DL["Deferred Lighting, colortex0 = HDR Scene"]
end
subgraph Composite["Composite Passes"]
C1["Composite 1, Volumetric Light + Underwater"]
MIP["Compute: Mipmap Generation, 10 Dispatch per mip chain"]
C4["Composite 4, Bloom Tile Atlas Generation, 7x7 Gaussian x 7 LODs"]
C5["Composite 5, Bloom Application + Tonemapping"]
end
DL --> C1
C1 -->|"colortex0 written"| MIP
MIP -->|"colortex0 mip 0-10 ready"| C4
C4 -->|"colortex3 = tile atlas"| C5
C5 -->|"colortex0 = LDR output"| Final["Final Pass"]
Mipmap Generation System
The bloom algorithm requires downsampled versions of the HDR scene texture (colortex0) at multiple resolutions. Rather than manually downsampling in the bloom shader, the engine generates a full mipmap chain via a GPU compute shader. This provides pre-filtered data at each resolution level, improving both quality and cache coherence.
ShaderBundle Configuration
Mipmap generation is opt-in per render target. The ShaderBundle declares which textures need mipmaps through a directive in the shader source, following the Iris const bool convention:
// rt_formats.hlsl
const bool colortex0MipmapEnabled = true;
The engine’s PackRenderTargetDirectives parser scans all shader files during bundle loading and applies the setting to the RenderTargetConfig. When enabled, the render target is created with a full mip chain (MipLevels = floor(log2(max(width, height))) + 1). For a 1920x1080 texture, this produces 11 mip levels.
Compute Shader
Each mip level is generated by a single compute dispatch using an 8x8 thread group. The shader reads from the source mip via RWTexture2D load and writes to the destination mip via RWTexture2D store, implementing a manual 2x2 box filter:
[numthreads(8, 8, 1)]
void main(uint3 DTid : SV_DispatchThreadID)
{
if (DTid.x >= g_dstWidth || DTid.y >= g_dstHeight)
return;
RWTexture2D<float4> srcMip = ResourceDescriptorHeap[g_srcTextureIndex];
RWTexture2D<float4> dstMip = ResourceDescriptorHeap[g_dstMipUavIndex];
uint2 srcCoord = DTid.xy * 2;
uint srcWidth = g_dstWidth * 2;
uint srcHeight = g_dstHeight * 2;
float4 s00 = srcMip[min(srcCoord + uint2(0, 0), uint2(srcWidth - 1, srcHeight - 1))];
float4 s10 = srcMip[min(srcCoord + uint2(1, 0), uint2(srcWidth - 1, srcHeight - 1))];
float4 s01 = srcMip[min(srcCoord + uint2(0, 1), uint2(srcWidth - 1, srcHeight - 1))];
float4 s11 = srcMip[min(srcCoord + uint2(1, 1), uint2(srcWidth - 1, srcHeight - 1))];
dstMip[DTid.xy] = (s00 + s10 + s01 + s11) * 0.25;
}
Both source and destination are accessed through UAV (RWTexture2D) rather than mixing SRV reads with UAV writes. This is a deliberate design choice: in DX12, transitioning the entire resource to UNORDERED_ACCESS state means SRV reads (Texture2D.SampleLevel) would fail because the resource is not in PIXEL_SHADER_RESOURCE state. Per-subresource barriers were attempted but did not resolve the conflict, since the SRV descriptor covers all mip levels and any single mip in UAV state invalidates the entire SRV read. The RWTexture2D approach sidesteps this entirely by keeping everything in UAV state.
D3D12RenderSystem::CreateUAV
The engine provides a generic static factory for UAV creation that encapsulates bindless index allocation and descriptor heap writes in a single call:
static uint32_t CreateUAV(
ID3D12Resource* resource,
const D3D12_UNORDERED_ACCESS_VIEW_DESC& desc
);
The caller builds the D3D12_UNORDERED_ACCESS_VIEW_DESC (format, dimension, mip slice) and receives a bindless index back. This keeps D3D12RenderSystem generic while letting each resource type (texture, buffer) construct its own descriptor. The same pattern applies to the existing TransitionResource and UAVBarrier helpers.
D12Texture Persistent UAV Indices
A key architectural decision was making per-mip UAV descriptors persistent members of D12Texture rather than allocating them per-frame in the MipmapGenerator.
The initial implementation allocated temporary UAV indices for each dispatch, freed them after the loop, and relied on the GPU reading them before they were overwritten. This failed because CreateUnorderedAccessView writes directly to the GPU-visible descriptor heap (CPU-side, immediate), but the GPU reads the descriptors later during command list execution. Freeing and reallocating an index within the same frame caused the descriptor to be overwritten before the GPU could read it.
The solution: D12Texture lazily creates per-mip UAV descriptors on the first GenerateMips() call and stores them permanently. The MipmapGenerator simply reads texture->GetMipUavIndex(mip) with zero allocation overhead. Cleanup happens in the destructor.
flowchart LR
subgraph D12Texture["D12Texture"]
MUI["m_mipUavIndices: [0] [1] [2] ... [10]"]
GM["GenerateMips()"]
CMUD["createMipUavDescriptors(), Lazy init, called once"]
FREE["~D12Texture(), freeMipUavDescriptors()"]
end
subgraph MipmapGenerator["MipmapGenerator (stateless)"]
LOOP["Per-mip dispatch loop"]
GET["texture->GetMipUavIndex(mip)"]
end
GM -->|"first call"| CMUD
CMUD -->|"D3D12RenderSystem::CreateUAV()"| MUI
GM --> LOOP
LOOP --> GET
GET --> MUI
Bloom Algorithm
The bloom effect uses a 7-level tile atlas approach ported from Complementary Reimagined. Instead of the traditional ping-pong downsampling/upsampling chain, all 7 LOD levels are generated in a single pass and packed into a tile atlas stored in colortex3 (RGBA8).
Tile Atlas Layout
Each LOD level occupies a non-overlapping region of the atlas in normalized UV space. The layout is designed for 1920x1080 and automatically scales for lower resolutions via GetBloomRescale():
+---------------------------------------+
| LOD 2 (1/4 res) |
| offset: (0.0, 0.0) |
+-------------------+-------------------+
| LOD 3 (1/8) | LOD 4 (1/16) |
| (0.0, 0.26) | (0.135, 0.26) |
+----------+--------+--------+----------+
| LOD 5 | LOD 6 | LOD 7 | LOD 8 |
| (1/32) | (1/64) | (1/128)| (1/256) |
+----------+--------+--------+----------+
Gaussian Blur with Hardware Mipmaps
The BloomTile() function samples colortex0 at the matching hardware mip level using a 7x7 Gaussian kernel (Pascal’s triangle row 6, weights: 1, 6, 15, 20, 15, 6, 1). The hardware mipmap provides pre-filtered downsampling, and the Gaussian kernel adds additional spatial blur on top:
float3 BloomTile(float lod, float2 offset, float2 scaledCoord)
{
float scale = exp2(lod);
// ... tile boundary check ...
for (int i = -3; i <= 3; i++)
for (int j = -3; j <= 3; j++)
{
float wg = bloomWeight[i + 3] * bloomWeight[j + 3];
float2 bloomCoord = (scaledCoord - offset + float2(i,j) * pixelSize) * scale;
bloom += colortex0.SampleLevel(sampler0, bloomCoord, lod).rgb * wg;
}
bloom /= 4096.0; // 64 * 64
return pow(max(bloom / 128.0, 0.0), 0.25); // Gamma encode
}
At mip level lod, each texel covers exp2(lod) original pixels. The kernel offsets of 1 screen pixel multiplied by scale equal exactly 1 mip texel, giving a proper 7-texel Gaussian blur at each resolution. This combination produces wider effective blur than either technique alone.
Gamma Encoding for HDR in RGBA8
The bloom atlas (colortex3) uses RGBA8 format to save bandwidth, but the scene data is HDR. A gamma encoding scheme preserves the dynamic range:
| Stage | Formula | Purpose |
|---|---|---|
| Encode (composite4) | pow(x / 128.0, 0.25) | Compress HDR to [0,1] for RGBA8 |
| Decode (composite5) | x^4 * 128.0 | Restore original HDR range |
The 4th-root encoding allocates more precision to darker values where the eye is most sensitive, while the /128 normalization handles typical HDR scene brightness.
Bloom Pass Pipeline
Composite 4: Bloom Generation
This pass reads the HDR scene from colortex0 (with hardware mipmaps) and writes the 7-LOD tile atlas to colortex3. Higher LOD levels receive reduced weights to prevent over-blur at extreme downsampling:
blur += BloomTile(2.0, float2(0.0, 0.0), scaledCoord); // 1.0x
blur += BloomTile(3.0, float2(0.0, 0.26), scaledCoord); // 1.0x
blur += BloomTile(4.0, float2(0.135, 0.26), scaledCoord); // 1.0x
blur += BloomTile(5.0, float2(0.2075, 0.26), scaledCoord) * 0.8; // reduced
blur += BloomTile(6.0, float2(0.135, 0.3325), scaledCoord) * 0.8;
blur += BloomTile(7.0, float2(0.160625, 0.3325), scaledCoord) * 0.6;
blur += BloomTile(8.0, float2(0.1784375,0.3325), scaledCoord) * 0.4;
Composite 5: Bloom Application + Tonemapping
This pass reads both the HDR scene (colortex0) and the bloom atlas (colortex3), applies bloom before tonemapping, then outputs the final LDR result:
- Read HDR scene color
DoBloom()averages all 7 LOD tiles (* 0.14) and blends withlerp(color, blur, bloomStrength)- Lottes 2016 tonemapping (HDR to LDR)
- Color saturation and vibrance adjustment
- Output to
colortex0(LDR)
The bloom strength includes a darkness boost (+ 0.2 * darknessFactor) that intensifies the glow in dark scenes, matching how the human eye perceives bloom more strongly in low-light conditions.
Bloom Configuration
All bloom parameters are exposed in settings.hlsl as compile-time defines with slider ranges, following the Iris/OptiFine shader options convention:
| Parameter | Default | Range | Purpose |
|---|---|---|---|
BLOOM_ENABLED | 1 | -1, 1 | Enable or disable bloom entirely |
BLOOM_STRENGTH | 0.52 | 0.027 to 10.00 | Bloom intensity (lerp factor) |
The colortex0MipmapEnabled directive in rt_formats.hlsl controls whether the engine generates hardware mipmaps for the scene texture. Disabling it falls back to mip-0-only sampling, which still produces bloom but with lower quality downsampling.
Final Results
Design Philosophy
Across the bloom and mipmap generation systems, several principles guided the architecture:
Texture Owns Its Resources — Per-mip UAV descriptors live on D12Texture as persistent members, created lazily on first use and freed in the destructor. The MipmapGenerator remains a stateless utility with zero allocation overhead per frame. This follows the same ownership pattern as the existing SRV bindless index on D12Resource.
Generic Engine API, Specific Usage — D3D12RenderSystem::CreateUAV accepts a raw D3D12_UNORDERED_ACCESS_VIEW_DESC without assuming texture dimensions or mip slices. D3D12RenderSystem::UAVBarrier wraps the synchronization primitive. Each resource type builds its own descriptor and calls the generic API. No texture-specific knowledge leaks into the render system layer.
ShaderBundle-Driven Configuration — Mipmap generation is opt-in per render target through a const bool colortexNMipmapEnabled directive in the shader source. The engine parses this during bundle loading and configures the render target accordingly. Artists and shader developers control the feature without touching C++ code.
Compute Over Copy — Mipmap generation uses a dedicated compute shader rather than GenerateMips or CPU-side downsampling. The SM6.6 bindless RWTexture2D approach avoids the SRV/UAV state conflict that plagues mixed-access patterns in DX12, and the box filter produces identical results to hardware bilinear sampling for power-of-two downsampling.
Looking For More?
Check out some of our other blogs if you haven't already!