Bloom with Hardware Mipmap Generation

This post covers the bloom system I built for a custom DirectX 12 deferred rendering engine. The engine runs an Iris-compatible shader pipeline, and the bloom is part of the EnigmaDefault ShaderBundle targeting a Complementary Reimagined visual style.

The system generates a 7-level tile atlas by sampling hardware mipmaps of the HDR scene texture with a 7x7 Gaussian kernel, then composites the result back onto the scene before tonemapping. The mipmap chain is produced by a GPU compute shader that dispatches once per mip level, using RWTexture2D UAV access to avoid SRV/UAV resource state conflicts. The engine manages per-mip UAV descriptors as persistent members on D12Texture, eliminating per-frame allocation overhead entirely.

Rendering Pipeline Overview

The bloom system spans two composite passes and a compute mipmap generation step that runs between them. The mipmap generation is triggered automatically after each composite sub-pass for any render target that has colortexNMipmapEnabled = true declared in the ShaderBundle.

flowchart TD
    subgraph Deferred["Deferred Pass"]
        DL["Deferred Lighting, colortex0 = HDR Scene"]
    end

    subgraph Composite["Composite Passes"]
        C1["Composite 1, Volumetric Light + Underwater"]
        MIP["Compute: Mipmap Generation, 10 Dispatch per mip chain"]
        C4["Composite 4, Bloom Tile Atlas Generation, 7x7 Gaussian x 7 LODs"]
        C5["Composite 5, Bloom Application + Tonemapping"]
    end

    DL --> C1
    C1 -->|"colortex0 written"| MIP
    MIP -->|"colortex0 mip 0-10 ready"| C4
    C4 -->|"colortex3 = tile atlas"| C5
    C5 -->|"colortex0 = LDR output"| Final["Final Pass"]

Mipmap Generation System

The bloom algorithm requires downsampled versions of the HDR scene texture (colortex0) at multiple resolutions. Rather than manually downsampling in the bloom shader, the engine generates a full mipmap chain via a GPU compute shader. This provides pre-filtered data at each resolution level, improving both quality and cache coherence.

ShaderBundle Configuration

Mipmap generation is opt-in per render target. The ShaderBundle declares which textures need mipmaps through a directive in the shader source, following the Iris const bool convention:

// rt_formats.hlsl
const bool colortex0MipmapEnabled = true;

The engine’s PackRenderTargetDirectives parser scans all shader files during bundle loading and applies the setting to the RenderTargetConfig. When enabled, the render target is created with a full mip chain (MipLevels = floor(log2(max(width, height))) + 1). For a 1920x1080 texture, this produces 11 mip levels.

Compute Shader

Each mip level is generated by a single compute dispatch using an 8x8 thread group. The shader reads from the source mip via RWTexture2D load and writes to the destination mip via RWTexture2D store, implementing a manual 2x2 box filter:

[numthreads(8, 8, 1)]
void main(uint3 DTid : SV_DispatchThreadID)
{
    if (DTid.x >= g_dstWidth || DTid.y >= g_dstHeight)
        return;

    RWTexture2D<float4> srcMip = ResourceDescriptorHeap[g_srcTextureIndex];
    RWTexture2D<float4> dstMip = ResourceDescriptorHeap[g_dstMipUavIndex];

    uint2 srcCoord = DTid.xy * 2;
    uint srcWidth  = g_dstWidth * 2;
    uint srcHeight = g_dstHeight * 2;

    float4 s00 = srcMip[min(srcCoord + uint2(0, 0), uint2(srcWidth - 1, srcHeight - 1))];
    float4 s10 = srcMip[min(srcCoord + uint2(1, 0), uint2(srcWidth - 1, srcHeight - 1))];
    float4 s01 = srcMip[min(srcCoord + uint2(0, 1), uint2(srcWidth - 1, srcHeight - 1))];
    float4 s11 = srcMip[min(srcCoord + uint2(1, 1), uint2(srcWidth - 1, srcHeight - 1))];

    dstMip[DTid.xy] = (s00 + s10 + s01 + s11) * 0.25;
}

Both source and destination are accessed through UAV (RWTexture2D) rather than mixing SRV reads with UAV writes. This is a deliberate design choice: in DX12, transitioning the entire resource to UNORDERED_ACCESS state means SRV reads (Texture2D.SampleLevel) would fail because the resource is not in PIXEL_SHADER_RESOURCE state. Per-subresource barriers were attempted but did not resolve the conflict, since the SRV descriptor covers all mip levels and any single mip in UAV state invalidates the entire SRV read. The RWTexture2D approach sidesteps this entirely by keeping everything in UAV state.

RenderDoc Event Browser showing 4 compute passes with 10 dispatches each for mipmap generation between composite sub-passes — RenderDoc Event Browser showing 4 compute passes (one after each composite sub-pass), each with 10 dispatches generating mip levels 1 through 10

RenderDoc stepping through the 10 compute dispatches of a single mipmap generation pass, showing progressive downsampling from 960x540 (mip 1) down to 1x1 (mip 10)

D3D12RenderSystem::CreateUAV

The engine provides a generic static factory for UAV creation that encapsulates bindless index allocation and descriptor heap writes in a single call:

static uint32_t CreateUAV(
    ID3D12Resource* resource,
    const D3D12_UNORDERED_ACCESS_VIEW_DESC& desc
);

The caller builds the D3D12_UNORDERED_ACCESS_VIEW_DESC (format, dimension, mip slice) and receives a bindless index back. This keeps D3D12RenderSystem generic while letting each resource type (texture, buffer) construct its own descriptor. The same pattern applies to the existing TransitionResource and UAVBarrier helpers.

D12Texture Persistent UAV Indices

A key architectural decision was making per-mip UAV descriptors persistent members of D12Texture rather than allocating them per-frame in the MipmapGenerator.

The initial implementation allocated temporary UAV indices for each dispatch, freed them after the loop, and relied on the GPU reading them before they were overwritten. This failed because CreateUnorderedAccessView writes directly to the GPU-visible descriptor heap (CPU-side, immediate), but the GPU reads the descriptors later during command list execution. Freeing and reallocating an index within the same frame caused the descriptor to be overwritten before the GPU could read it.

The solution: D12Texture lazily creates per-mip UAV descriptors on the first GenerateMips() call and stores them permanently. The MipmapGenerator simply reads texture->GetMipUavIndex(mip) with zero allocation overhead. Cleanup happens in the destructor.

flowchart LR
    subgraph D12Texture["D12Texture"]
        MUI["m_mipUavIndices: [0] [1] [2] ... [10]"]
        GM["GenerateMips()"]
        CMUD["createMipUavDescriptors(), Lazy init, called once"]
        FREE["~D12Texture(), freeMipUavDescriptors()"]
    end

    subgraph MipmapGenerator["MipmapGenerator (stateless)"]
        LOOP["Per-mip dispatch loop"]
        GET["texture->GetMipUavIndex(mip)"]
    end

    GM -->|"first call"| CMUD
    CMUD -->|"D3D12RenderSystem::CreateUAV()"| MUI
    GM --> LOOP
    LOOP --> GET
    GET --> MUI

Bloom Algorithm

The bloom effect uses a 7-level tile atlas approach ported from Complementary Reimagined. Instead of the traditional ping-pong downsampling/upsampling chain, all 7 LOD levels are generated in a single pass and packed into a tile atlas stored in colortex3 (RGBA8).

Tile Atlas Layout

Each LOD level occupies a non-overlapping region of the atlas in normalized UV space. The layout is designed for 1920x1080 and automatically scales for lower resolutions via GetBloomRescale():

+---------------------------------------+
| LOD 2 (1/4 res)                       |
| offset: (0.0, 0.0)                    |
+-------------------+-------------------+
| LOD 3 (1/8)       | LOD 4 (1/16)      |
| (0.0, 0.26)       | (0.135, 0.26)     |
+----------+--------+--------+----------+
| LOD 5    | LOD 6  | LOD 7  | LOD 8    |
| (1/32)   | (1/64) | (1/128)| (1/256)  |
+----------+--------+--------+----------+

Gaussian Blur with Hardware Mipmaps

The BloomTile() function samples colortex0 at the matching hardware mip level using a 7x7 Gaussian kernel (Pascal’s triangle row 6, weights: 1, 6, 15, 20, 15, 6, 1). The hardware mipmap provides pre-filtered downsampling, and the Gaussian kernel adds additional spatial blur on top:

float3 BloomTile(float lod, float2 offset, float2 scaledCoord)
{
    float scale = exp2(lod);
    // ... tile boundary check ...

    for (int i = -3; i <= 3; i++)
        for (int j = -3; j <= 3; j++)
        {
            float  wg         = bloomWeight[i + 3] * bloomWeight[j + 3];
            float2 bloomCoord = (scaledCoord - offset + float2(i,j) * pixelSize) * scale;
            bloom += colortex0.SampleLevel(sampler0, bloomCoord, lod).rgb * wg;
        }
    bloom /= 4096.0; // 64 * 64

    return pow(max(bloom / 128.0, 0.0), 0.25); // Gamma encode
}

At mip level lod, each texel covers exp2(lod) original pixels. The kernel offsets of 1 screen pixel multiplied by scale equal exactly 1 mip texel, giving a proper 7-texel Gaussian blur at each resolution. This combination produces wider effective blur than either technique alone.

Gamma Encoding for HDR in RGBA8

The bloom atlas (colortex3) uses RGBA8 format to save bandwidth, but the scene data is HDR. A gamma encoding scheme preserves the dynamic range:

Stage	Formula	Purpose
Encode (composite4)	`pow(x / 128.0, 0.25)`	Compress HDR to [0,1] for RGBA8
Decode (composite5)	`x^4 * 128.0`	Restore original HDR range

The 4th-root encoding allocates more precision to darker values where the eye is most sensitive, while the /128 normalization handles typical HDR scene brightness.

Bloom Pass Pipeline

Composite 4: Bloom Generation

This pass reads the HDR scene from colortex0 (with hardware mipmaps) and writes the 7-LOD tile atlas to colortex3. Higher LOD levels receive reduced weights to prevent over-blur at extreme downsampling:

blur += BloomTile(2.0, float2(0.0, 0.0),       scaledCoord);        // 1.0x
blur += BloomTile(3.0, float2(0.0, 0.26),       scaledCoord);        // 1.0x
blur += BloomTile(4.0, float2(0.135, 0.26),     scaledCoord);        // 1.0x
blur += BloomTile(5.0, float2(0.2075, 0.26),    scaledCoord) * 0.8;  // reduced
blur += BloomTile(6.0, float2(0.135, 0.3325),   scaledCoord) * 0.8;
blur += BloomTile(7.0, float2(0.160625, 0.3325), scaledCoord) * 0.6;
blur += BloomTile(8.0, float2(0.1784375,0.3325), scaledCoord) * 0.4;

Composite 5: Bloom Application + Tonemapping

This pass reads both the HDR scene (colortex0) and the bloom atlas (colortex3), applies bloom before tonemapping, then outputs the final LDR result:

Read HDR scene color
DoBloom() averages all 7 LOD tiles (* 0.14) and blends with lerp(color, blur, bloomStrength)
Lottes 2016 tonemapping (HDR to LDR)
Color saturation and vibrance adjustment
Output to colortex0 (LDR)

The bloom strength includes a darkness boost (+ 0.2 * darknessFactor) that intensifies the glow in dark scenes, matching how the human eye perceives bloom more strongly in low-light conditions.

Bloom Configuration

All bloom parameters are exposed in settings.hlsl as compile-time defines with slider ranges, following the Iris/OptiFine shader options convention:

Parameter	Default	Range	Purpose
`BLOOM_ENABLED`	1	-1, 1	Enable or disable bloom entirely
`BLOOM_STRENGTH`	0.52	0.027 to 10.00	Bloom intensity (lerp factor)

The colortex0MipmapEnabled directive in rt_formats.hlsl controls whether the engine generates hardware mipmaps for the scene texture. Disabling it falls back to mip-0-only sampling, which still produces bloom but with lower quality downsampling.

Final Results

Scene with bloom disabled, showing the raw HDR tonemapped output without any glow

Bloom at 32% strength, adding subtle glow to bright surfaces like water reflections and sky

Bloom at 52% strength (default), producing a more pronounced glow that enhances the atmospheric feel

Design Philosophy

Across the bloom and mipmap generation systems, several principles guided the architecture:

Texture Owns Its Resources — Per-mip UAV descriptors live on D12Texture as persistent members, created lazily on first use and freed in the destructor. The MipmapGenerator remains a stateless utility with zero allocation overhead per frame. This follows the same ownership pattern as the existing SRV bindless index on D12Resource.

Generic Engine API, Specific Usage — D3D12RenderSystem::CreateUAV accepts a raw D3D12_UNORDERED_ACCESS_VIEW_DESC without assuming texture dimensions or mip slices. D3D12RenderSystem::UAVBarrier wraps the synchronization primitive. Each resource type builds its own descriptor and calls the generic API. No texture-specific knowledge leaks into the render system layer.

ShaderBundle-Driven Configuration — Mipmap generation is opt-in per render target through a const bool colortexNMipmapEnabled directive in the shader source. The engine parses this during bundle loading and configures the render target accordingly. Artists and shader developers control the feature without touching C++ code.

Compute Over Copy — Mipmap generation uses a dedicated compute shader rather than GenerateMips or CPU-side downsampling. The SM6.6 bindless RWTexture2D approach avoids the SRV/UAV state conflict that plagues mixed-access patterns in DX12, and the box filter produces identical results to hardware bilinear sampling for power-of-two downsampling.