Add comprehensive design docs and research boxes: - docs/analysis/ALLOC_TINY_FAST_DUALHOT_1_DESIGN.md: ALLOC DUALHOT investigation - docs/analysis/FREE_TINY_FAST_DUALHOT_1_DESIGN.md: FREE DUALHOT final specs - docs/analysis/FREE_TINY_FAST_HOTCOLD_OPT_1_DESIGN.md: Hot/Cold split research - docs/analysis/POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md: Deferred batching design - docs/analysis/POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md: Stats overhead findings - docs/analysis/MID_DESC_CACHE_BENCHMARK_2025-12-12.md: Cache measurement results - docs/analysis/LAST_MATCH_CACHE_IMPLEMENTATION.md: TLS cache investigation Research boxes (SS page table): - core/box/ss_pt_env_box.h: HAKMEM_SS_LOOKUP_KIND gate - core/box/ss_pt_types_box.h: 2-level page table structures - core/box/ss_pt_lookup_box.h: ss_pt_lookup() implementation - core/box/ss_pt_register_box.h: Page table registration - core/box/ss_pt_impl.c: Global definitions Updates: - docs/specs/ENV_VARS_COMPLETE.md: HOTCOLD, DEFERRED, SS_LOOKUP env vars - core/box/hak_free_api.inc.h: FREE-DISPATCH-SSOT integration - core/box/pool_mid_inuse_deferred_box.h: Deferred API updates - core/box/pool_mid_inuse_deferred_stats_box.h: Stats collection - core/hakmem_super_registry: SS page table integration Current Status: - FREE-TINY-FAST-DUALHOT-1: +13% improvement, ready for adoption - ALLOC-TINY-FAST-DUALHOT-1: -2% regression, frozen as research box - Next: Optimization roadmap per ROI (mimalloc gap 2.5x) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
6.0 KiB
6.0 KiB
Phase POOL-MID-DN-BATCH: Deferred inuse_dec Design
Goal
- Eliminate
mid_desc_lookup*fromhak_pool_free_v1_fast_implhot path completely - Target: Mixed median +2-4%, tail/variance reduction (as seen in cache A/B)
Background
A/B Benchmark Results (2025-12-12)
| Metric | Baseline | Cache ON | Improvement |
|---|---|---|---|
| Median throughput | 8.72M ops/s | 8.93M ops/s | +2.3% |
| Worst-case | 6.46M ops/s | 7.52M ops/s | +16.5% |
| CV (variance) | 13.3% | 7.2% | -46% |
Insight: Cache improves stability more than raw speed. Deferred will be even better because it completely eliminates lookup from hot path.
Box Theory Design
L0: MidInuseDeferredBox
// Hot API (lookup/atomic/lock PROHIBITED)
static inline void mid_inuse_dec_deferred(void* raw);
// Cold API (ONLY lookup boundary)
static inline void mid_inuse_deferred_drain(void);
L1: MidInuseTlsPageMapBox
// TLS fixed-size map (32 or 64 entries)
// Single responsibility: "bundle page→dec_count"
typedef struct {
void* pages[MID_INUSE_TLS_MAP_SIZE];
uint32_t counts[MID_INUSE_TLS_MAP_SIZE];
uint32_t used;
} MidInuseTlsPageMap;
static __thread MidInuseTlsPageMap g_mid_inuse_tls_map;
Algorithm
mid_inuse_dec_deferred(raw) - HOT
static inline void mid_inuse_dec_deferred(void* raw) {
if (!hak_pool_mid_inuse_deferred_enabled()) {
mid_page_inuse_dec_and_maybe_dn(raw); // Fallback
return;
}
void* page = (void*)((uintptr_t)raw & ~(POOL_PAGE_SIZE - 1));
// Find or insert in TLS map
for (int i = 0; i < g_mid_inuse_tls_map.used; i++) {
if (g_mid_inuse_tls_map.pages[i] == page) {
g_mid_inuse_tls_map.counts[i]++;
STAT_INC(mid_inuse_deferred_hit);
return;
}
}
// New page entry
if (g_mid_inuse_tls_map.used >= MID_INUSE_TLS_MAP_SIZE) {
mid_inuse_deferred_drain(); // Flush when full
}
int idx = g_mid_inuse_tls_map.used++;
g_mid_inuse_tls_map.pages[idx] = page;
g_mid_inuse_tls_map.counts[idx] = 1;
STAT_INC(mid_inuse_deferred_hit);
}
mid_inuse_deferred_drain() - COLD (only lookup boundary)
static inline void mid_inuse_deferred_drain(void) {
STAT_INC(mid_inuse_deferred_drain_calls);
for (int i = 0; i < g_mid_inuse_tls_map.used; i++) {
void* page = g_mid_inuse_tls_map.pages[i];
uint32_t n = g_mid_inuse_tls_map.counts[i];
// ONLY lookup happens here (batched)
MidPageDesc* d = mid_desc_lookup(page);
if (d) {
uint64_t old = atomic_fetch_sub(&d->in_use, n);
STAT_ADD(mid_inuse_deferred_pages_drained, n);
// Check for empty transition (existing logic)
if (old >= n && old - n == 0) {
STAT_INC(mid_inuse_deferred_empty_transitions);
// pending_dn logic (existing)
if (d->pending_dn == 0) {
d->pending_dn = 1;
hak_batch_add_page(page);
}
}
}
}
g_mid_inuse_tls_map.used = 0; // Clear map
}
Drain Boundaries (Critical)
DO NOT drain in hot path. Drain only at these cold/rare points:
- TLS map full - Inside
mid_inuse_dec_deferred()(once per overflow) - Refill/slow boundary - Add 1 call in pool alloc refill or slow free tail
- Thread exit - If thread cleanup exists (optional)
ENV Gate
// HAKMEM_POOL_MID_INUSE_DEFERRED=1 (default 0)
static inline int hak_pool_mid_inuse_deferred_enabled(void) {
static int g = -1;
if (__builtin_expect(g == -1, 0)) {
const char* e = getenv("HAKMEM_POOL_MID_INUSE_DEFERRED");
g = (e && *e == '1') ? 1 : 0;
}
return g;
}
Related knobs:
HAKMEM_POOL_MID_INUSE_MAP_KIND=linear|hash(defaultlinear)- TLS page-map implementation used by the hot path.
HAKMEM_POOL_MID_INUSE_DEFERRED_STATS=0/1(default0)- Enables debug counters + exit dump. Keep OFF for perf runs.
Implementation Patches (Order)
| Step | File | Description |
|---|---|---|
| 1 | pool_mid_inuse_deferred_env_box.h |
ENV gate |
| 2 | pool_mid_inuse_tls_pagemap_box.h |
TLS map box |
| 3 | pool_mid_inuse_deferred_box.h |
deferred API (dec + drain) |
| 4 | pool_free_v1_box.h |
Replace tail with deferred (ENV ON only) |
| 5 | pool_mid_inuse_deferred_stats_box.h |
Counters |
| 6 | A/B benchmark | Validate |
Stats Counters
typedef struct {
_Atomic uint64_t mid_inuse_deferred_hit; // deferred dec calls (hot)
_Atomic uint64_t drain_calls; // drain invocations (cold)
_Atomic uint64_t pages_drained; // unique pages processed
_Atomic uint64_t decs_drained; // total decrements applied
_Atomic uint64_t empty_transitions; // pages that hit <=0
} MidInuseDeferredStats;
Goal: With fastsplit ON + deferred ON:
- fast path lookup = 0
- drain calls = rare (low frequency)
Safety Analysis
| Concern | Analysis |
|---|---|
| Race condition | dec delayed → in_use appears larger → DONTNEED delayed (safe direction) |
| Double free | No change (header check still in place) |
| Early release | Impossible (dec is delayed, not advanced) |
| Memory pressure | Slightly delayed DONTNEED, acceptable |
Acceptance Gates
| Workload | Metric | Criteria |
|---|---|---|
| Mixed (MIXED_TINYV3_C7_SAFE) | Median | No regression |
| Mixed | CV | Clear reduction (matches cache trend) |
| C6-heavy (C6_HEAVY_LEGACY_POOLV1) | Throughput | <2% regression, ideally +2% |
| pending_dn | Timing | Delayed OK, earlier NG |
Expected Result
After this phase, pool free hot path becomes:
header check → TLS push → deferred bookkeeping (O(1), no lookup)
This is very close to mimalloc's O(1) fast free design.
Files to Modify
core/box/pool_mid_inuse_deferred_env_box.h(NEW)core/box/pool_mid_inuse_tls_pagemap_box.h(NEW)core/box/pool_mid_inuse_deferred_box.h(NEW)core/box/pool_free_v1_box.h(MODIFY - add deferred call)core/box/pool_mid_inuse_deferred_stats_box.h(NEW)