# Phase POOL-MID-DN-BATCH: Deferred inuse_dec Design ## Goal - Eliminate `mid_desc_lookup*` from `hak_pool_free_v1_fast_impl` hot path completely - Target: Mixed median +2-4%, tail/variance reduction (as seen in cache A/B) ## Background ### A/B Benchmark Results (2025-12-12) | Metric | Baseline | Cache ON | Improvement | |--------|----------|----------|-------------| | Median throughput | 8.72M ops/s | 8.93M ops/s | +2.3% | | Worst-case | 6.46M ops/s | 7.52M ops/s | **+16.5%** | | CV (variance) | 13.3% | 7.2% | **-46%** | **Insight**: Cache improves stability more than raw speed. Deferred will be even better because it completely eliminates lookup from hot path. ## Box Theory Design ### L0: MidInuseDeferredBox ```c // Hot API (lookup/atomic/lock PROHIBITED) static inline void mid_inuse_dec_deferred(void* raw); // Cold API (ONLY lookup boundary) static inline void mid_inuse_deferred_drain(void); ``` ### L1: MidInuseTlsPageMapBox ```c // TLS fixed-size map (32 or 64 entries) // Single responsibility: "bundle page→dec_count" typedef struct { void* pages[MID_INUSE_TLS_MAP_SIZE]; uint32_t counts[MID_INUSE_TLS_MAP_SIZE]; uint32_t used; } MidInuseTlsPageMap; static __thread MidInuseTlsPageMap g_mid_inuse_tls_map; ``` ## Algorithm ### mid_inuse_dec_deferred(raw) - HOT ```c static inline void mid_inuse_dec_deferred(void* raw) { if (!hak_pool_mid_inuse_deferred_enabled()) { mid_page_inuse_dec_and_maybe_dn(raw); // Fallback return; } void* page = (void*)((uintptr_t)raw & ~(POOL_PAGE_SIZE - 1)); // Find or insert in TLS map for (int i = 0; i < g_mid_inuse_tls_map.used; i++) { if (g_mid_inuse_tls_map.pages[i] == page) { g_mid_inuse_tls_map.counts[i]++; STAT_INC(mid_inuse_deferred_hit); return; } } // New page entry if (g_mid_inuse_tls_map.used >= MID_INUSE_TLS_MAP_SIZE) { mid_inuse_deferred_drain(); // Flush when full } int idx = g_mid_inuse_tls_map.used++; g_mid_inuse_tls_map.pages[idx] = page; g_mid_inuse_tls_map.counts[idx] = 1; STAT_INC(mid_inuse_deferred_hit); } ``` ### mid_inuse_deferred_drain() - COLD (only lookup boundary) ```c static inline void mid_inuse_deferred_drain(void) { STAT_INC(mid_inuse_deferred_drain_calls); for (int i = 0; i < g_mid_inuse_tls_map.used; i++) { void* page = g_mid_inuse_tls_map.pages[i]; uint32_t n = g_mid_inuse_tls_map.counts[i]; // ONLY lookup happens here (batched) MidPageDesc* d = mid_desc_lookup(page); if (d) { uint64_t old = atomic_fetch_sub(&d->in_use, n); STAT_ADD(mid_inuse_deferred_pages_drained, n); // Check for empty transition (existing logic) if (old >= n && old - n == 0) { STAT_INC(mid_inuse_deferred_empty_transitions); // pending_dn logic (existing) if (d->pending_dn == 0) { d->pending_dn = 1; hak_batch_add_page(page); } } } } g_mid_inuse_tls_map.used = 0; // Clear map } ``` ## Drain Boundaries (Critical) **DO NOT drain in hot path.** Drain only at these cold/rare points: 1. **TLS map full** - Inside `mid_inuse_dec_deferred()` (once per overflow) 2. **Refill/slow boundary** - Add 1 call in pool alloc refill or slow free tail 3. **Thread exit** - If thread cleanup exists (optional) ## ENV Gate ```c // HAKMEM_POOL_MID_INUSE_DEFERRED=1 (default 0) static inline int hak_pool_mid_inuse_deferred_enabled(void) { static int g = -1; if (__builtin_expect(g == -1, 0)) { const char* e = getenv("HAKMEM_POOL_MID_INUSE_DEFERRED"); g = (e && *e == '1') ? 1 : 0; } return g; } ``` Related knobs: - `HAKMEM_POOL_MID_INUSE_MAP_KIND=linear|hash` (default `linear`) - TLS page-map implementation used by the hot path. - `HAKMEM_POOL_MID_INUSE_DEFERRED_STATS=0/1` (default `0`) - Enables debug counters + exit dump. Keep OFF for perf runs. ## Implementation Patches (Order) | Step | File | Description | |------|------|-------------| | 1 | `pool_mid_inuse_deferred_env_box.h` | ENV gate | | 2 | `pool_mid_inuse_tls_pagemap_box.h` | TLS map box | | 3 | `pool_mid_inuse_deferred_box.h` | deferred API (dec + drain) | | 4 | `pool_free_v1_box.h` | Replace tail with deferred (ENV ON only) | | 5 | `pool_mid_inuse_deferred_stats_box.h` | Counters | | 6 | A/B benchmark | Validate | ## Stats Counters ```c typedef struct { _Atomic uint64_t mid_inuse_deferred_hit; // deferred dec calls (hot) _Atomic uint64_t drain_calls; // drain invocations (cold) _Atomic uint64_t pages_drained; // unique pages processed _Atomic uint64_t decs_drained; // total decrements applied _Atomic uint64_t empty_transitions; // pages that hit <=0 } MidInuseDeferredStats; ``` **Goal**: With fastsplit ON + deferred ON: - fast path lookup = 0 - drain calls = rare (low frequency) ## Safety Analysis | Concern | Analysis | |---------|----------| | Race condition | dec delayed → in_use appears larger → DONTNEED delayed (safe direction) | | Double free | No change (header check still in place) | | Early release | Impossible (dec is delayed, not advanced) | | Memory pressure | Slightly delayed DONTNEED, acceptable | ## Acceptance Gates | Workload | Metric | Criteria | |----------|--------|----------| | Mixed (MIXED_TINYV3_C7_SAFE) | Median | No regression | | Mixed | CV | Clear reduction (matches cache trend) | | C6-heavy (C6_HEAVY_LEGACY_POOLV1) | Throughput | <2% regression, ideally +2% | | pending_dn | Timing | Delayed OK, earlier NG | ## Expected Result After this phase, pool free hot path becomes: ``` header check → TLS push → deferred bookkeeping (O(1), no lookup) ``` This is very close to mimalloc's O(1) fast free design. ## Files to Modify - `core/box/pool_mid_inuse_deferred_env_box.h` (NEW) - `core/box/pool_mid_inuse_tls_pagemap_box.h` (NEW) - `core/box/pool_mid_inuse_deferred_box.h` (NEW) - `core/box/pool_free_v1_box.h` (MODIFY - add deferred call) - `core/box/pool_mid_inuse_deferred_stats_box.h` (NEW)