Add comprehensive design docs and research boxes: - docs/analysis/ALLOC_TINY_FAST_DUALHOT_1_DESIGN.md: ALLOC DUALHOT investigation - docs/analysis/FREE_TINY_FAST_DUALHOT_1_DESIGN.md: FREE DUALHOT final specs - docs/analysis/FREE_TINY_FAST_HOTCOLD_OPT_1_DESIGN.md: Hot/Cold split research - docs/analysis/POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md: Deferred batching design - docs/analysis/POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md: Stats overhead findings - docs/analysis/MID_DESC_CACHE_BENCHMARK_2025-12-12.md: Cache measurement results - docs/analysis/LAST_MATCH_CACHE_IMPLEMENTATION.md: TLS cache investigation Research boxes (SS page table): - core/box/ss_pt_env_box.h: HAKMEM_SS_LOOKUP_KIND gate - core/box/ss_pt_types_box.h: 2-level page table structures - core/box/ss_pt_lookup_box.h: ss_pt_lookup() implementation - core/box/ss_pt_register_box.h: Page table registration - core/box/ss_pt_impl.c: Global definitions Updates: - docs/specs/ENV_VARS_COMPLETE.md: HOTCOLD, DEFERRED, SS_LOOKUP env vars - core/box/hak_free_api.inc.h: FREE-DISPATCH-SSOT integration - core/box/pool_mid_inuse_deferred_box.h: Deferred API updates - core/box/pool_mid_inuse_deferred_stats_box.h: Stats collection - core/hakmem_super_registry: SS page table integration Current Status: - FREE-TINY-FAST-DUALHOT-1: +13% improvement, ready for adoption - ALLOC-TINY-FAST-DUALHOT-1: -2% regression, frozen as research box - Next: Optimization roadmap per ROI (mimalloc gap 2.5x) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
6.1 KiB
POOL-MID-DN-BATCH: Last-Match Cache Implementation
Date: 2025-12-13 Phase: POOL-MID-DN-BATCH optimization Status: Implemented but insufficient for full regression fix
Problem Statement
The POOL-MID-DN-BATCH deferred inuse_dec implementation showed a -5% performance regression instead of the expected +2-4% improvement. Root cause analysis revealed:
- Linear search overhead: Average 16 iterations in 32-entry TLS map
- Instruction count: +7.4% increase on hot path
- Hot path cost: Linear search exceeded the savings from eliminating mid_desc_lookup
Solution: Last-Match Cache
Added a last_idx field to exploit temporal locality - the assumption that consecutive frees often target the same page.
Implementation
1. Structure Change (pool_mid_inuse_tls_pagemap_box.h)
typedef struct {
void* pages[MID_INUSE_TLS_MAP_SIZE]; // Page base addresses
uint32_t counts[MID_INUSE_TLS_MAP_SIZE]; // Pending dec count per page
uint32_t used; // Number of active entries
uint32_t last_idx; // NEW: Cache last hit index
} MidInuseTlsPageMap;
2. Lookup Logic (pool_mid_inuse_deferred_box.h)
Before:
// Linear search only
for (uint32_t i = 0; i < map->used; i++) {
if (map->pages[i] == page) {
map->counts[i]++;
return;
}
}
After:
// Check last match first (O(1) fast path)
if (map->last_idx < map->used && map->pages[map->last_idx] == page) {
map->counts[map->last_idx]++;
return; // Early exit on cache hit
}
// Fallback to linear search
for (uint32_t i = 0; i < map->used; i++) {
if (map->pages[i] == page) {
map->counts[i]++;
map->last_idx = i; // Update cache
return;
}
}
3. Cache Maintenance
- On new entry:
map->last_idx = idx;(new page likely to be reused) - On drain:
map->last_idx = 0;(reset for next batch)
Benchmark Results
Test Configuration
- Benchmark:
bench_mid_large_mt_hakmem - Threads: 4
- Cycles: 40,000 per thread
- Working set: 2048 slots
- Size range: 8-32 KiB
- Access pattern: Random
Performance Data
| Metric | Baseline (DEFERRED=0) | Deferred w/ Cache (DEFERRED=1) | Change |
|---|---|---|---|
| Median throughput | 9.08M ops/s | 8.38M ops/s | -7.6% |
| Mean throughput | 9.04M ops/s | 8.25M ops/s | -8.7% |
| Min throughput | 7.81M ops/s | 7.34M ops/s | -6.0% |
| Max throughput | 9.71M ops/s | 8.77M ops/s | -9.7% |
| Variance | 300B | 207B | -31% (improvement) |
| Std Dev | 548K | 455K | -17% |
Raw Results
Baseline (10 runs):
8,720,875 9,147,207 9,709,755 8,708,904 9,541,168
9,322,187 9,005,728 8,994,402 7,808,414 9,459,910
Deferred with Last-Match Cache (20 runs):
8,323,016 7,963,325 8,578,296 8,313,354 8,314,545
7,445,113 7,518,391 8,610,739 8,770,947 7,338,433
8,668,194 7,797,795 7,882,001 8,442,375 8,564,862
7,950,541 8,552,224 8,548,635 8,636,063 8,742,399
Analysis
What Worked
- Variance reduction: -31% improvement in variance confirms that the deferred approach provides more stable performance
- Cache mechanism: The last_idx optimization is correctly implemented and should help in workloads with better temporal locality
Why Regression Persists
Access Pattern Mismatch:
- Expected: 60-80% cache hit rate (consecutive frees from same page)
- Reality: bench_mid_large_mt uses random access across 2048 slots
- Result: Poor temporal locality → low cache hit rate → linear search dominates
Cost Breakdown:
Original (no deferred):
mid_desc_lookup: ~10 cycles
atomic operations: ~5 cycles
Total per free: ~15 cycles
Deferred (with last-match cache):
last_idx check: ~2 cycles (on miss)
linear search: ~32 cycles (avg 16 iterations × 2 ops)
Total per free: ~34 cycles (2.3× slower)
Expected with 70% hit rate:
70% hits: ~2 cycles
30% searches: ~10 cycles
Total per free: ~4.4 cycles (2.9× faster)
The cache hit rate for this benchmark is likely <30%, making it slower than the baseline.
Conclusion
Success Criteria (Original)
- [✗] No regression: median deferred >= median baseline (Failed: -7.6%)
- [✓] Stability: deferred variance <= baseline variance (Success: -31%)
- [✗] No outliers: all runs within 20% of median (Failed: still has variance)
Deliverables
- [✓] last_idx field added to MidInuseTlsPageMap
- [✓] Fast-path check before linear search
- [✓] Cache update on hits and new entries
- [✓] Cache reset on drain
- [✓] Build succeeds
- [✓] Committed to git (commit
6c849fd02)
Next Steps
The last-match cache is necessary but insufficient. Additional optimizations needed:
Option A: Hash-Based Lookup
Replace linear search with simple hash:
#define MAP_HASH(page) (((uintptr_t)(page) >> 16) & (MAP_SIZE - 1))
- Pro: O(1) expected lookup
- Con: Requires handling collisions
Option B: Reduce Map Size
Use 8 or 16 entries instead of 32:
- Pro: Fewer iterations on search
- Con: More frequent drains (overhead moves to drain)
Option C: Better Drain Boundaries
Drain more frequently at natural boundaries:
- After N allocations (not just on map full)
- At refill/slow path transitions
- Pro: Keeps map small, searches fast
- Con: More drain calls (must benchmark)
Option D: MRU (Most Recently Used) Ordering
Keep recently used entries at front of array:
- Pro: Common pages found faster
- Con: Array reordering overhead
Recommendation
Try Option A (hash-based) first as it has the best theoretical performance and aligns with the "O(1) like mimalloc" design goal.
Related Documents
- POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md - Original design
- POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md - Root cause analysis
Commit
commit 6c849fd02
Author: ...
Date: 2025-12-13
POOL-MID-DN-BATCH: Add last-match cache to reduce linear search overhead