Phase ALLOC-TINY-FAST-DUALHOT-1 & Optimization Roadmap Update

Add comprehensive design docs and research boxes: - docs/analysis/ALLOC_TINY_FAST_DUALHOT_1_DESIGN.md: ALLOC DUALHOT investigation - docs/analysis/FREE_TINY_FAST_DUALHOT_1_DESIGN.md: FREE DUALHOT final specs - docs/analysis/FREE_TINY_FAST_HOTCOLD_OPT_1_DESIGN.md: Hot/Cold split research - docs/analysis/POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md: Deferred batching design - docs/analysis/POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md: Stats overhead findings - docs/analysis/MID_DESC_CACHE_BENCHMARK_2025-12-12.md: Cache measurement results - docs/analysis/LAST_MATCH_CACHE_IMPLEMENTATION.md: TLS cache investigation Research boxes (SS page table): - core/box/ss_pt_env_box.h: HAKMEM_SS_LOOKUP_KIND gate - core/box/ss_pt_types_box.h: 2-level page table structures - core/box/ss_pt_lookup_box.h: ss_pt_lookup() implementation - core/box/ss_pt_register_box.h: Page table registration - core/box/ss_pt_impl.c: Global definitions Updates: - docs/specs/ENV_VARS_COMPLETE.md: HOTCOLD, DEFERRED, SS_LOOKUP env vars - core/box/hak_free_api.inc.h: FREE-DISPATCH-SSOT integration - core/box/pool_mid_inuse_deferred_box.h: Deferred API updates - core/box/pool_mid_inuse_deferred_stats_box.h: Stats collection - core/hakmem_super_registry: SS page table integration Current Status: - FREE-TINY-FAST-DUALHOT-1: +13% improvement, ready for adoption - ALLOC-TINY-FAST-DUALHOT-1: -2% regression, frozen as research box - Next: Optimization roadmap per ROI (mimalloc gap 2.5x) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-13 05:35:46 +09:00
parent b917357034
commit d9991f39ff
18 changed files with 1721 additions and 25 deletions
--- a/docs/analysis/LAST_MATCH_CACHE_IMPLEMENTATION.md
+++ b/docs/analysis/LAST_MATCH_CACHE_IMPLEMENTATION.md
@ -0,0 +1,196 @@
+# POOL-MID-DN-BATCH: Last-Match Cache Implementation
+
+**Date**: 2025-12-13
+**Phase**: POOL-MID-DN-BATCH optimization
+**Status**: Implemented but insufficient for full regression fix
+
+## Problem Statement
+
+The POOL-MID-DN-BATCH deferred inuse_dec implementation showed a -5% performance regression instead of the expected +2-4% improvement. Root cause analysis revealed:
+
+- **Linear search overhead**: Average 16 iterations in 32-entry TLS map
+- **Instruction count**: +7.4% increase on hot path
+- **Hot path cost**: Linear search exceeded the savings from eliminating mid_desc_lookup
+
+## Solution: Last-Match Cache
+
+Added a `last_idx` field to exploit temporal locality - the assumption that consecutive frees often target the same page.
+
+### Implementation
+
+#### 1. Structure Change (`pool_mid_inuse_tls_pagemap_box.h`)
+
+```c
+typedef struct {
+    void* pages[MID_INUSE_TLS_MAP_SIZE];      // Page base addresses
+    uint32_t counts[MID_INUSE_TLS_MAP_SIZE];  // Pending dec count per page
+    uint32_t used;                             // Number of active entries
+    uint32_t last_idx;                         // NEW: Cache last hit index
+} MidInuseTlsPageMap;
+```
+
+#### 2. Lookup Logic (`pool_mid_inuse_deferred_box.h`)
+
+**Before**:
+```c
+// Linear search only
+for (uint32_t i = 0; i < map->used; i++) {
+    if (map->pages[i] == page) {
+        map->counts[i]++;
+        return;
+    }
+}
+```
+
+**After**:
+```c
+// Check last match first (O(1) fast path)
+if (map->last_idx < map->used && map->pages[map->last_idx] == page) {
+    map->counts[map->last_idx]++;
+    return;  // Early exit on cache hit
+}
+
+// Fallback to linear search
+for (uint32_t i = 0; i < map->used; i++) {
+    if (map->pages[i] == page) {
+        map->counts[i]++;
+        map->last_idx = i;  // Update cache
+        return;
+    }
+}
+```
+
+#### 3. Cache Maintenance
+
+- **On new entry**: `map->last_idx = idx;` (new page likely to be reused)
+- **On drain**: `map->last_idx = 0;` (reset for next batch)
+
+## Benchmark Results
+
+### Test Configuration
+- Benchmark: `bench_mid_large_mt_hakmem`
+- Threads: 4
+- Cycles: 40,000 per thread
+- Working set: 2048 slots
+- Size range: 8-32 KiB
+- Access pattern: Random
+
+### Performance Data
+
+| Metric | Baseline (DEFERRED=0) | Deferred w/ Cache (DEFERRED=1) | Change |
+|--------|----------------------|-------------------------------|--------|
+| **Median throughput** | 9.08M ops/s | 8.38M ops/s | **-7.6%** |
+| **Mean throughput** | 9.04M ops/s | 8.25M ops/s | -8.7% |
+| **Min throughput** | 7.81M ops/s | 7.34M ops/s | -6.0% |
+| **Max throughput** | 9.71M ops/s | 8.77M ops/s | -9.7% |
+| **Variance** | 300B | 207B | **-31%** (improvement) |
+| **Std Dev** | 548K | 455K | -17% |
+
+### Raw Results
+
+**Baseline (10 runs)**:
+```
+8,720,875  9,147,207  9,709,755  8,708,904  9,541,168
+9,322,187  9,005,728  8,994,402  7,808,414  9,459,910
+```
+
+**Deferred with Last-Match Cache (20 runs)**:
+```
+8,323,016  7,963,325  8,578,296  8,313,354  8,314,545
+7,445,113  7,518,391  8,610,739  8,770,947  7,338,433
+8,668,194  7,797,795  7,882,001  8,442,375  8,564,862
+7,950,541  8,552,224  8,548,635  8,636,063  8,742,399
+```
+
+## Analysis
+
+### What Worked
+- **Variance reduction**: -31% improvement in variance confirms that the deferred approach provides more stable performance
+- **Cache mechanism**: The last_idx optimization is correctly implemented and should help in workloads with better temporal locality
+
+### Why Regression Persists
+
+**Access Pattern Mismatch**:
+- Expected: 60-80% cache hit rate (consecutive frees from same page)
+- Reality: bench_mid_large_mt uses random access across 2048 slots
+- Result: Poor temporal locality → low cache hit rate → linear search dominates
+
+**Cost Breakdown**:
+```
+Original (no deferred):
+  mid_desc_lookup:    ~10 cycles
+  atomic operations:   ~5 cycles
+  Total per free:     ~15 cycles
+
+Deferred (with last-match cache):
+  last_idx check:      ~2 cycles (on miss)
+  linear search:      ~32 cycles (avg 16 iterations × 2 ops)
+  Total per free:     ~34 cycles (2.3× slower)
+
+Expected with 70% hit rate:
+  70% hits:            ~2 cycles
+  30% searches:       ~10 cycles
+  Total per free:      ~4.4 cycles (2.9× faster)
+```
+
+The cache hit rate for this benchmark is likely <30%, making it slower than the baseline.
+
+## Conclusion
+
+### Success Criteria (Original)
+- [✗] No regression: median deferred >= median baseline (**Failed**: -7.6%)
+- [✓] Stability: deferred variance <= baseline variance (**Success**: -31%)
+- [✗] No outliers: all runs within 20% of median (**Failed**: still has variance)
+
+### Deliverables
+- [✓] last_idx field added to MidInuseTlsPageMap
+- [✓] Fast-path check before linear search
+- [✓] Cache update on hits and new entries
+- [✓] Cache reset on drain
+- [✓] Build succeeds
+- [✓] Committed to git (commit 6c849fd02)
+
+## Next Steps
+
+The last-match cache is necessary but insufficient. Additional optimizations needed:
+
+### Option A: Hash-Based Lookup
+Replace linear search with simple hash:
+```c
+#define MAP_HASH(page) (((uintptr_t)(page) >> 16) & (MAP_SIZE - 1))
+```
+- Pro: O(1) expected lookup
+- Con: Requires handling collisions
+
+### Option B: Reduce Map Size
+Use 8 or 16 entries instead of 32:
+- Pro: Fewer iterations on search
+- Con: More frequent drains (overhead moves to drain)
+
+### Option C: Better Drain Boundaries
+Drain more frequently at natural boundaries:
+- After N allocations (not just on map full)
+- At refill/slow path transitions
+- Pro: Keeps map small, searches fast
+- Con: More drain calls (must benchmark)
+
+### Option D: MRU (Most Recently Used) Ordering
+Keep recently used entries at front of array:
+- Pro: Common pages found faster
+- Con: Array reordering overhead
+
+### Recommendation
+Try **Option A (hash-based)** first as it has the best theoretical performance and aligns with the "O(1) like mimalloc" design goal.
+
+## Related Documents
+- [POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md](./POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md) - Original design
+- [POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md](./POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md) - Root cause analysis
+
+## Commit
+```
+commit 6c849fd02
+Author: ...
+Date:   2025-12-13
+
+    POOL-MID-DN-BATCH: Add last-match cache to reduce linear search overhead
+```