Files
hakmem/docs/analysis/POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md
Moe Charm (CI) d9991f39ff Phase ALLOC-TINY-FAST-DUALHOT-1 & Optimization Roadmap Update
Add comprehensive design docs and research boxes:
- docs/analysis/ALLOC_TINY_FAST_DUALHOT_1_DESIGN.md: ALLOC DUALHOT investigation
- docs/analysis/FREE_TINY_FAST_DUALHOT_1_DESIGN.md: FREE DUALHOT final specs
- docs/analysis/FREE_TINY_FAST_HOTCOLD_OPT_1_DESIGN.md: Hot/Cold split research
- docs/analysis/POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md: Deferred batching design
- docs/analysis/POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md: Stats overhead findings
- docs/analysis/MID_DESC_CACHE_BENCHMARK_2025-12-12.md: Cache measurement results
- docs/analysis/LAST_MATCH_CACHE_IMPLEMENTATION.md: TLS cache investigation

Research boxes (SS page table):
- core/box/ss_pt_env_box.h: HAKMEM_SS_LOOKUP_KIND gate
- core/box/ss_pt_types_box.h: 2-level page table structures
- core/box/ss_pt_lookup_box.h: ss_pt_lookup() implementation
- core/box/ss_pt_register_box.h: Page table registration
- core/box/ss_pt_impl.c: Global definitions

Updates:
- docs/specs/ENV_VARS_COMPLETE.md: HOTCOLD, DEFERRED, SS_LOOKUP env vars
- core/box/hak_free_api.inc.h: FREE-DISPATCH-SSOT integration
- core/box/pool_mid_inuse_deferred_box.h: Deferred API updates
- core/box/pool_mid_inuse_deferred_stats_box.h: Stats collection
- core/hakmem_super_registry: SS page table integration

Current Status:
- FREE-TINY-FAST-DUALHOT-1: +13% improvement, ready for adoption
- ALLOC-TINY-FAST-DUALHOT-1: -2% regression, frozen as research box
- Next: Optimization roadmap per ROI (mimalloc gap 2.5x)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-13 05:35:46 +09:00

6.0 KiB

Phase POOL-MID-DN-BATCH: Deferred inuse_dec Design

Goal

  • Eliminate mid_desc_lookup* from hak_pool_free_v1_fast_impl hot path completely
  • Target: Mixed median +2-4%, tail/variance reduction (as seen in cache A/B)

Background

A/B Benchmark Results (2025-12-12)

Metric Baseline Cache ON Improvement
Median throughput 8.72M ops/s 8.93M ops/s +2.3%
Worst-case 6.46M ops/s 7.52M ops/s +16.5%
CV (variance) 13.3% 7.2% -46%

Insight: Cache improves stability more than raw speed. Deferred will be even better because it completely eliminates lookup from hot path.

Box Theory Design

L0: MidInuseDeferredBox

// Hot API (lookup/atomic/lock PROHIBITED)
static inline void mid_inuse_dec_deferred(void* raw);

// Cold API (ONLY lookup boundary)
static inline void mid_inuse_deferred_drain(void);

L1: MidInuseTlsPageMapBox

// TLS fixed-size map (32 or 64 entries)
// Single responsibility: "bundle page→dec_count"
typedef struct {
    void* pages[MID_INUSE_TLS_MAP_SIZE];
    uint32_t counts[MID_INUSE_TLS_MAP_SIZE];
    uint32_t used;
} MidInuseTlsPageMap;

static __thread MidInuseTlsPageMap g_mid_inuse_tls_map;

Algorithm

mid_inuse_dec_deferred(raw) - HOT

static inline void mid_inuse_dec_deferred(void* raw) {
    if (!hak_pool_mid_inuse_deferred_enabled()) {
        mid_page_inuse_dec_and_maybe_dn(raw);  // Fallback
        return;
    }

    void* page = (void*)((uintptr_t)raw & ~(POOL_PAGE_SIZE - 1));

    // Find or insert in TLS map
    for (int i = 0; i < g_mid_inuse_tls_map.used; i++) {
        if (g_mid_inuse_tls_map.pages[i] == page) {
            g_mid_inuse_tls_map.counts[i]++;
            STAT_INC(mid_inuse_deferred_hit);
            return;
        }
    }

    // New page entry
    if (g_mid_inuse_tls_map.used >= MID_INUSE_TLS_MAP_SIZE) {
        mid_inuse_deferred_drain();  // Flush when full
    }

    int idx = g_mid_inuse_tls_map.used++;
    g_mid_inuse_tls_map.pages[idx] = page;
    g_mid_inuse_tls_map.counts[idx] = 1;
    STAT_INC(mid_inuse_deferred_hit);
}

mid_inuse_deferred_drain() - COLD (only lookup boundary)

static inline void mid_inuse_deferred_drain(void) {
    STAT_INC(mid_inuse_deferred_drain_calls);

    for (int i = 0; i < g_mid_inuse_tls_map.used; i++) {
        void* page = g_mid_inuse_tls_map.pages[i];
        uint32_t n = g_mid_inuse_tls_map.counts[i];

        // ONLY lookup happens here (batched)
        MidPageDesc* d = mid_desc_lookup(page);
        if (d) {
            uint64_t old = atomic_fetch_sub(&d->in_use, n);
            STAT_ADD(mid_inuse_deferred_pages_drained, n);

            // Check for empty transition (existing logic)
            if (old >= n && old - n == 0) {
                STAT_INC(mid_inuse_deferred_empty_transitions);
                // pending_dn logic (existing)
                if (d->pending_dn == 0) {
                    d->pending_dn = 1;
                    hak_batch_add_page(page);
                }
            }
        }
    }

    g_mid_inuse_tls_map.used = 0;  // Clear map
}

Drain Boundaries (Critical)

DO NOT drain in hot path. Drain only at these cold/rare points:

  1. TLS map full - Inside mid_inuse_dec_deferred() (once per overflow)
  2. Refill/slow boundary - Add 1 call in pool alloc refill or slow free tail
  3. Thread exit - If thread cleanup exists (optional)

ENV Gate

// HAKMEM_POOL_MID_INUSE_DEFERRED=1 (default 0)
static inline int hak_pool_mid_inuse_deferred_enabled(void) {
    static int g = -1;
    if (__builtin_expect(g == -1, 0)) {
        const char* e = getenv("HAKMEM_POOL_MID_INUSE_DEFERRED");
        g = (e && *e == '1') ? 1 : 0;
    }
    return g;
}

Related knobs:

  • HAKMEM_POOL_MID_INUSE_MAP_KIND=linear|hash (default linear)
    • TLS page-map implementation used by the hot path.
  • HAKMEM_POOL_MID_INUSE_DEFERRED_STATS=0/1 (default 0)
    • Enables debug counters + exit dump. Keep OFF for perf runs.

Implementation Patches (Order)

Step File Description
1 pool_mid_inuse_deferred_env_box.h ENV gate
2 pool_mid_inuse_tls_pagemap_box.h TLS map box
3 pool_mid_inuse_deferred_box.h deferred API (dec + drain)
4 pool_free_v1_box.h Replace tail with deferred (ENV ON only)
5 pool_mid_inuse_deferred_stats_box.h Counters
6 A/B benchmark Validate

Stats Counters

typedef struct {
    _Atomic uint64_t mid_inuse_deferred_hit;  // deferred dec calls (hot)
    _Atomic uint64_t drain_calls;             // drain invocations (cold)
    _Atomic uint64_t pages_drained;           // unique pages processed
    _Atomic uint64_t decs_drained;            // total decrements applied
    _Atomic uint64_t empty_transitions;       // pages that hit <=0
} MidInuseDeferredStats;

Goal: With fastsplit ON + deferred ON:

  • fast path lookup = 0
  • drain calls = rare (low frequency)

Safety Analysis

Concern Analysis
Race condition dec delayed → in_use appears larger → DONTNEED delayed (safe direction)
Double free No change (header check still in place)
Early release Impossible (dec is delayed, not advanced)
Memory pressure Slightly delayed DONTNEED, acceptable

Acceptance Gates

Workload Metric Criteria
Mixed (MIXED_TINYV3_C7_SAFE) Median No regression
Mixed CV Clear reduction (matches cache trend)
C6-heavy (C6_HEAVY_LEGACY_POOLV1) Throughput <2% regression, ideally +2%
pending_dn Timing Delayed OK, earlier NG

Expected Result

After this phase, pool free hot path becomes:

header check → TLS push → deferred bookkeeping (O(1), no lookup)

This is very close to mimalloc's O(1) fast free design.

Files to Modify

  • core/box/pool_mid_inuse_deferred_env_box.h (NEW)
  • core/box/pool_mid_inuse_tls_pagemap_box.h (NEW)
  • core/box/pool_mid_inuse_deferred_box.h (NEW)
  • core/box/pool_free_v1_box.h (MODIFY - add deferred call)
  • core/box/pool_mid_inuse_deferred_stats_box.h (NEW)