Files
hakmem/docs/analysis/POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md
Moe Charm (CI) d9991f39ff Phase ALLOC-TINY-FAST-DUALHOT-1 & Optimization Roadmap Update
Add comprehensive design docs and research boxes:
- docs/analysis/ALLOC_TINY_FAST_DUALHOT_1_DESIGN.md: ALLOC DUALHOT investigation
- docs/analysis/FREE_TINY_FAST_DUALHOT_1_DESIGN.md: FREE DUALHOT final specs
- docs/analysis/FREE_TINY_FAST_HOTCOLD_OPT_1_DESIGN.md: Hot/Cold split research
- docs/analysis/POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md: Deferred batching design
- docs/analysis/POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md: Stats overhead findings
- docs/analysis/MID_DESC_CACHE_BENCHMARK_2025-12-12.md: Cache measurement results
- docs/analysis/LAST_MATCH_CACHE_IMPLEMENTATION.md: TLS cache investigation

Research boxes (SS page table):
- core/box/ss_pt_env_box.h: HAKMEM_SS_LOOKUP_KIND gate
- core/box/ss_pt_types_box.h: 2-level page table structures
- core/box/ss_pt_lookup_box.h: ss_pt_lookup() implementation
- core/box/ss_pt_register_box.h: Page table registration
- core/box/ss_pt_impl.c: Global definitions

Updates:
- docs/specs/ENV_VARS_COMPLETE.md: HOTCOLD, DEFERRED, SS_LOOKUP env vars
- core/box/hak_free_api.inc.h: FREE-DISPATCH-SSOT integration
- core/box/pool_mid_inuse_deferred_box.h: Deferred API updates
- core/box/pool_mid_inuse_deferred_stats_box.h: Stats collection
- core/hakmem_super_registry: SS page table integration

Current Status:
- FREE-TINY-FAST-DUALHOT-1: +13% improvement, ready for adoption
- ALLOC-TINY-FAST-DUALHOT-1: -2% regression, frozen as research box
- Next: Optimization roadmap per ROI (mimalloc gap 2.5x)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-13 05:35:46 +09:00

16 KiB
Raw Blame History

POOL-MID-DN-BATCH Performance Regression Analysis

Date: 2025-12-12 Benchmark: bench_mid_large_mt_hakmem (4 threads, 8-32KB allocations) Status: ROOT CAUSE IDENTIFIED

Update: Early implementations counted stats via global atomics on every deferred op, even when not dumping stats. This can add significant cross-thread contention and distort perf results. Current code gates stats behind HAKMEM_POOL_MID_INUSE_DEFERRED_STATS=1 and uses per-thread counters; re-run A/B to confirm the true regression shape.


Executive Summary

The deferred inuse_dec optimization (HAKMEM_POOL_MID_INUSE_DEFERRED=1) shows:

  • -5.2% median throughput regression (8.96M → 8.49M ops/s)
  • 2x variance increase (range 5.9-8.9M vs 8.3-9.8M baseline)
  • +7.4% more instructions executed (248M vs 231M)
  • +7.5% more branches (54.6M vs 50.8M)
  • +11% more branch misses (3.98M vs 3.58M)

Root Cause: The 32-entry linear search in the TLS map costs more than the hash-table lookup it eliminates.


Benchmark Configuration

# Baseline (immediate inuse_dec)
HAKMEM_POOL_MID_INUSE_DEFERRED=0 ./bench_mid_large_mt_hakmem

# Deferred (batched inuse_dec)
HAKMEM_POOL_MID_INUSE_DEFERRED=1 ./bench_mid_large_mt_hakmem

Workload:

  • 4 threads × 40K operations = 160K total
  • 8-32 KiB allocations (MID tier)
  • 50% alloc, 50% free (steady state)
  • Same-thread pattern (fast path via pool_free_v1_box.h:85)

Results Summary

Throughput Measurements (5 runs each)

Run Baseline (ops/s) Deferred (ops/s) Delta
1 9,047,406 8,340,647 -7.8%
2 8,920,386 8,141,846 -8.7%
3 9,023,716 7,320,439 -18.9%
4 8,724,190 5,879,051 -32.6%
5 7,701,940 8,295,536 +7.7%
Median 8,920,386 8,141,846 -8.7%
Range 7.7M-9.0M (16%) 5.9M-8.3M (41%) 2.6x variance

Deferred Stats (from HAKMEM_POOL_MID_INUSE_DEFERRED_STATS=1)

Deferred hits:      82,090
Drain calls:        2,519
Pages drained:      82,086
Empty transitions:  3,516
Avg pages/drain:    32.59

Analysis:

  • 82K deferred operations out of 160K total (51%)
  • 2.5K drains = 1 drain per 32.6 frees (as designed)
  • Very stable across runs (±0.1 pages/drain)

perf stat Measurements

Instructions

  • Baseline: 231M instructions (avg)
  • Deferred: 248M instructions (avg)
  • Delta: +7.4% MORE instructions

Branches

  • Baseline: 50.8M branches (avg)
  • Deferred: 54.6M branches (avg)
  • Delta: +7.5% MORE branches

Branch Misses

  • Baseline: 3.58M misses (7.04% miss rate)
  • Deferred: 3.98M misses (7.27% miss rate)
  • Delta: +11% MORE misses

Cache Events

  • Baseline: 4.04M L1 dcache misses (4.46% miss rate)
  • Deferred: 3.57M L1 dcache misses (4.24% miss rate)
  • Delta: -11.6% FEWER cache misses (slight improvement)

Root Cause Analysis

Expected Behavior

The deferred optimization was designed to eliminate repeated mid_desc_lookup() calls:

// Baseline: 1 lookup per free
void mid_page_inuse_dec_and_maybe_dn(void* raw) {
    MidPageDesc* d = mid_desc_lookup(raw);      // Hash + linked list walk (~10-20ns)
    atomic_fetch_sub(&d->in_use, 1);            // Atomic dec (~5ns)
    if (in_use == 0) { enqueue_dontneed(); }    // Rare
}
// Deferred: Batch 32 frees into 1 drain with 32 lookups
void mid_inuse_dec_deferred(void* raw) {
    // Add to TLS map (O(1) amortized)
    // Every 32nd call: drain with 32 batched lookups
}

Expected: 32 frees × 1 lookup each = 32 lookups → 1 drain × 32 lookups = same total lookups, but better cache locality

Reality: The TLS map search dominates the cost.

Actual Behavior

Hot Path Code (pool_mid_inuse_deferred_box.h:73-108)

static inline void mid_inuse_dec_deferred(void* raw) {
    // 1. ENV check (cached, ~0.5ns)
    if (!hak_pool_mid_inuse_deferred_enabled()) { ... }

    // 2. Ensure cleanup registered (cached TLS load, ~0.25ns)
    mid_inuse_deferred_ensure_cleanup();

    // 3. Calculate page base (~0.5ns)
    void* page = (void*)((uintptr_t)raw & ~((uintptr_t)POOL_PAGE_SIZE - 1));

    // 4. LINEAR SEARCH (EXPENSIVE!)
    MidInuseTlsPageMap* map = &g_mid_inuse_tls_map;
    for (uint32_t i = 0; i < map->used; i++) {          // Loop: 0-32 iterations
        if (map->pages[i] == page) {                    // Compare: memory load + branch
            map->counts[i]++;                           // Write: cache line dirty
            return;
        }
    }
    // Average iterations when map is half-full: 16

    // 5. Map full check (rare)
    if (map->used >= 32) { mid_inuse_deferred_drain(); }

    // 6. Add new entry
    map->pages[map->used] = page;
    map->counts[map->used] = 1;
    map->used++;
}

Cost Breakdown

Operation Baseline Deferred Delta
ENV check - 0.5ns +0.5ns
TLS cleanup check - 0.25ns +0.25ns
Page calc 0.5ns 0.5ns 0
Linear search - ~16 iterations × 0.32ns = 5.1ns +5.1ns
mid_desc_lookup 15ns - (deferred) -15ns
Atomic dec 5ns - (deferred) -5ns
Drain (amortized) - 30ns / 32 frees = 0.94ns +0.94ns
Total ~21ns ~7.5ns + 0.94ns = 8.4ns Expected: -12.6ns savings

Expected: Deferred should be ~60% faster per operation!

Problem: The micro-benchmark assumes best-case linear search (immediate hit). In practice:

Linear Search Performance Degradation

The TLS map fills from 0 to 32 entries, then drains. During filling:

Map State Iterations Cost per Search Frequency
Early (0-10 entries) 0-5 1-2ns 30% of frees
Middle (10-20 entries) 5-15 2-5ns 40% of frees
Late (20-32 entries) 15-30 5-10ns 30% of frees
Weighted Average 16 ~5ns -

With 82K deferred operations:

  • Extra branches: 82K × 16 iterations = 1.31M branches
  • Extra instructions: 1.31M × 3 (load, compare, branch) = 3.93M instructions
  • Branch mispredicts: Loop exit is unpredictable → higher miss rate

Measured:

  • +3.8M branches (54.6M - 50.8M) ✓ Matches 1.31M + existing variance
  • +17M instructions (248M - 231M) ✓ Matches 3.93M + drain overhead

Why Lookup is Cheaper Than Expected

The mid_desc_lookup() implementation (pool_mid_desc.inc.h:73-82) is lock-free:

static MidPageDesc* mid_desc_lookup(void* addr) {
    mid_desc_init_once();                           // Cached, ~0ns amortized
    void* page = (void*)((uintptr_t)addr & ~...);   // 1 instruction
    uint32_t h = mid_desc_hash(page);               // 5-10 instructions (multiplication-based hash)
    for (MidPageDesc* d = g_mid_desc_head[h]; d; d = d->next) {  // 1-3 nodes typical
        if (d->page == page) return d;
    }
    return NULL;
}

Cost: ~10-20ns (not 50-200ns as initially assumed due to no locks!)

So the baseline is:

  • mid_desc_lookup: 15ns (hash + 1-2 node walk)
  • atomic_fetch_sub: 5ns
  • Total: ~20ns per free

And the deferred hot path is:

  • Linear search: 5ns (average)
  • Amortized drain: 0.94ns
  • Overhead: 1ns
  • Total: ~7ns per free

Expected: Deferred should be 3x faster!

The Missing Factor: Code Size and Branch Predictor Pollution

The linear search loop adds:

  1. More branches (+7.5%) → pollutes branch predictor
  2. More instructions (+7.4%) → pollutes icache
  3. Unpredictable exits → branch mispredicts (+11%)

The rest of the allocator's hot paths (pool refill, remote push, ring ops) suffer from:

  • Branch predictor pollution (linear search branches evict other predictions)
  • Instruction cache pollution (48-instruction loop evicts hot code)

This explains why the entire benchmark slows down, not just the deferred path.


Variance Analysis

Baseline Variance: 16% (7.7M - 9.0M ops/s)

Causes:

  • Kernel scheduling (4 threads, context switches)
  • mmap/munmap timing variability
  • Background OS activity

Deferred Variance: 41% (5.9M - 8.3M ops/s)

Additional causes:

  1. TLS allocation timing: First call per thread pays pthread_once + pthread_setspecific (~700ns)
  2. Map fill pattern: If allocations cluster by page, map fills slower (fewer drains, more expensive searches)
  3. Branch predictor thrashing: Unpredictable loop exits cause cascading mispredicts
  4. Thread scheduling: One slow thread blocks join, magnifying timing differences

5.9M outlier analysis (32% below median):

  • Likely one thread experienced severe branch mispredict cascade
  • Possible NUMA effect (TLS allocated on remote node)
  • Could also be kernel scheduler preemption during critical section

Proposed Fixes

Idea: Cache the last matched index to exploit temporal locality.

typedef struct {
    void* pages[32];
    uint32_t counts[32];
    uint32_t used;
    uint32_t last_idx;  // NEW: Cache last matched index
} MidInuseTlsPageMap;

static inline void mid_inuse_dec_deferred(void* raw) {
    // ... ENV check, page calc ...

    // Fast path: Check last match first
    MidInuseTlsPageMap* map = &g_mid_inuse_tls_map;
    if (map->last_idx < map->used && map->pages[map->last_idx] == page) {
        map->counts[map->last_idx]++;
        return;  // 1 iteration (60-80% hit rate expected)
    }

    // Cold path: Full linear search
    for (uint32_t i = 0; i < map->used; i++) {
        if (map->pages[i] == page) {
            map->counts[i]++;
            map->last_idx = i;  // Cache for next time
            return;
        }
    }

    // ... add new entry ...
}

Expected Impact:

  • If 70% hit rate: avg iterations = 0.7×1 + 0.3×16 = 5.5 (65% reduction)
  • Reduces branches by ~850K (65% of 1.31M)
  • Estimated: +8-12% improvement vs baseline

Pros:

  • Simple 1-line change to struct, 3-line change to function
  • No algorithm change, just optimization
  • High probability of success (allocations have strong temporal locality)

Cons:

  • May not help if allocations are scattered across many pages

Option 2: Hash Table (HIGHER CEILING, HIGHER RISK)

Idea: Replace linear search with direct hash lookup.

#define MAP_SIZE 64  // Must be power of 2
typedef struct {
    void* pages[MAP_SIZE];
    uint32_t counts[MAP_SIZE];
    uint32_t used;
} MidInuseTlsPageMap;

static inline uint32_t map_hash(void* page) {
    uintptr_t x = (uintptr_t)page >> 16;
    x ^= x >> 12; x ^= x >> 6;  // Quick hash
    return (uint32_t)(x & (MAP_SIZE - 1));
}

static inline void mid_inuse_dec_deferred(void* raw) {
    // ... ENV check, page calc ...

    MidInuseTlsPageMap* map = &g_mid_inuse_tls_map;
    uint32_t idx = map_hash(page);

    // Linear probe on collision (open addressing)
    for (uint32_t probe = 0; probe < MAP_SIZE; probe++) {
        uint32_t i = (idx + probe) & (MAP_SIZE - 1);
        if (map->pages[i] == page) {
            map->counts[i]++;
            return;  // Typically 1 iteration
        }
        if (map->pages[i] == NULL) {
            // Empty slot, add new entry
            map->pages[i] = page;
            map->counts[i] = 1;
            map->used++;
            if (map->used >= MAP_SIZE * 3/4) { drain(); }  // 75% load factor
            return;
        }
    }

    // Map full, drain immediately
    drain();
    // ... retry ...
}

Expected Impact:

  • Average 1-2 iterations (vs 16 currently)
  • Reduces branches by ~1.1M (85% of 1.31M)
  • Estimated: +12-18% improvement vs baseline

Pros:

  • Scales to larger maps (can increase to 128 or 256 entries)
  • Predictable O(1) performance

Cons:

  • More complex implementation (collision handling, resize logic)
  • Larger TLS footprint (512 bytes for 64 entries)
  • Hash function overhead (~5ns)
  • Risk of hash collisions causing probe loops

Option 3: Reduce Map Size to 16 Entries

Idea: Smaller map = fewer iterations.

Expected Impact:

  • Average 8 iterations (vs 16 currently)
  • But 2x more drains (5K vs 2.5K)
  • Each drain: 16 pages × 30ns = 480ns
  • Net: Neutral or slightly worse

Verdict: Not recommended.


Idea: Use AVX2 to compare 4 pointers at once.

#include <immintrin.h>

// Search 4 pages at once using AVX2
for (uint32_t i = 0; i < map->used; i += 4) {
    __m256i pages_vec = _mm256_loadu_si256((__m256i*)&map->pages[i]);
    __m256i target_vec = _mm256_set1_epi64x((int64_t)page);
    __m256i cmp = _mm256_cmpeq_epi64(pages_vec, target_vec);
    int mask = _mm256_movemask_epi8(cmp);
    if (mask) {
        int idx = i + (__builtin_ctz(mask) / 8);
        map->counts[idx]++;
        return;
    }
}

Expected Impact:

  • Reduces iterations from 16 to 4 (75% reduction)
  • Reduces branches by ~1M
  • Estimated: +10-15% improvement vs baseline

Pros:

  • Predictable speedup
  • Keeps linear structure (simple)

Cons:

  • Requires AVX2 (not portable)
  • Added complexity
  • SIMD latency may offset gains for small maps

Recommendation

Implement Option 1 (Last-Match Cache) immediately:

  1. Low risk: 4-line change, no algorithm change
  2. High probability of success: Allocations have strong temporal locality
  3. Estimated +8-12% improvement: Turns regression into win
  4. Fallback ready: If it fails, Option 2 (hash table) is next

Implementation Priority:

  1. Phase 1: Add last_idx cache (1 hour)
  2. Phase 2: Benchmark and validate (30 min)
  3. Phase 3: If insufficient, implement Option 2 (hash table) (4 hours)

Code Locations

Files to Modify

  1. TLS Map Structure:

    • File: /mnt/workdisk/public_share/hakmem/core/box/pool_mid_inuse_tls_pagemap_box.h
    • Line: 22-26
    • Change: Add uint32_t last_idx; field
  2. Search Logic:

    • File: /mnt/workdisk/public_share/hakmem/core/box/pool_mid_inuse_deferred_box.h
    • Line: 88-95
    • Change: Add last_idx fast path before loop
  3. Drain Logic:

    • File: Same as above
    • Line: 154
    • Change: Reset map->last_idx = 0; after drain

Appendix: Micro-Benchmark Data

Operation Costs (measured on test system)

Operation Cost (ns)
TLS variable load 0.25
pthread_once (cached) 2.3
pthread_once (first call) 2,945
pthread_setspecific 2.6
Linear search (32 entries, avg) 5.2
Linear search (first match) 0.0 (optimized out)

Key Insight

The linear search cost (5.2ns for 16 iterations) is competitive with mid_desc_lookup (15ns) only if:

  1. The lookup is truly eliminated (it is)
  2. The search doesn't pollute branch predictor (it does!)
  3. The overall code footprint doesn't grow (it does!)

The problem is not the search itself, but its impact on the rest of the allocator.


Conclusion

The deferred inuse_dec optimization failed to deliver expected performance gains because:

  1. The linear search is too expensive (16 avg iterations × 3 ops = 48 instructions per free)
  2. Branch predictor pollution (+7.5% more branches, +11% more mispredicts)
  3. Code footprint growth (+7.4% more instructions executed globally)

The fix is simple: Add a last-match cache to reduce average iterations from 16 to ~5, turning the 5% regression into an 8-12% improvement.

Next Steps:

  1. Implement Option 1 (last-match cache)
  2. Re-run benchmarks
  3. If successful, document and merge
  4. If insufficient, proceed to Option 2 (hash table)

Analysis by: Claude Opus 4.5 Date: 2025-12-12 Benchmark: bench_mid_large_mt_hakmem Status: Ready for implementation