# POOL-MID-DN-BATCH: Last-Match Cache Implementation **Date**: 2025-12-13 **Phase**: POOL-MID-DN-BATCH optimization **Status**: Implemented but insufficient for full regression fix ## Problem Statement The POOL-MID-DN-BATCH deferred inuse_dec implementation showed a -5% performance regression instead of the expected +2-4% improvement. Root cause analysis revealed: - **Linear search overhead**: Average 16 iterations in 32-entry TLS map - **Instruction count**: +7.4% increase on hot path - **Hot path cost**: Linear search exceeded the savings from eliminating mid_desc_lookup ## Solution: Last-Match Cache Added a `last_idx` field to exploit temporal locality - the assumption that consecutive frees often target the same page. ### Implementation #### 1. Structure Change (`pool_mid_inuse_tls_pagemap_box.h`) ```c typedef struct { void* pages[MID_INUSE_TLS_MAP_SIZE]; // Page base addresses uint32_t counts[MID_INUSE_TLS_MAP_SIZE]; // Pending dec count per page uint32_t used; // Number of active entries uint32_t last_idx; // NEW: Cache last hit index } MidInuseTlsPageMap; ``` #### 2. Lookup Logic (`pool_mid_inuse_deferred_box.h`) **Before**: ```c // Linear search only for (uint32_t i = 0; i < map->used; i++) { if (map->pages[i] == page) { map->counts[i]++; return; } } ``` **After**: ```c // Check last match first (O(1) fast path) if (map->last_idx < map->used && map->pages[map->last_idx] == page) { map->counts[map->last_idx]++; return; // Early exit on cache hit } // Fallback to linear search for (uint32_t i = 0; i < map->used; i++) { if (map->pages[i] == page) { map->counts[i]++; map->last_idx = i; // Update cache return; } } ``` #### 3. Cache Maintenance - **On new entry**: `map->last_idx = idx;` (new page likely to be reused) - **On drain**: `map->last_idx = 0;` (reset for next batch) ## Benchmark Results ### Test Configuration - Benchmark: `bench_mid_large_mt_hakmem` - Threads: 4 - Cycles: 40,000 per thread - Working set: 2048 slots - Size range: 8-32 KiB - Access pattern: Random ### Performance Data | Metric | Baseline (DEFERRED=0) | Deferred w/ Cache (DEFERRED=1) | Change | |--------|----------------------|-------------------------------|--------| | **Median throughput** | 9.08M ops/s | 8.38M ops/s | **-7.6%** | | **Mean throughput** | 9.04M ops/s | 8.25M ops/s | -8.7% | | **Min throughput** | 7.81M ops/s | 7.34M ops/s | -6.0% | | **Max throughput** | 9.71M ops/s | 8.77M ops/s | -9.7% | | **Variance** | 300B | 207B | **-31%** (improvement) | | **Std Dev** | 548K | 455K | -17% | ### Raw Results **Baseline (10 runs)**: ``` 8,720,875 9,147,207 9,709,755 8,708,904 9,541,168 9,322,187 9,005,728 8,994,402 7,808,414 9,459,910 ``` **Deferred with Last-Match Cache (20 runs)**: ``` 8,323,016 7,963,325 8,578,296 8,313,354 8,314,545 7,445,113 7,518,391 8,610,739 8,770,947 7,338,433 8,668,194 7,797,795 7,882,001 8,442,375 8,564,862 7,950,541 8,552,224 8,548,635 8,636,063 8,742,399 ``` ## Analysis ### What Worked - **Variance reduction**: -31% improvement in variance confirms that the deferred approach provides more stable performance - **Cache mechanism**: The last_idx optimization is correctly implemented and should help in workloads with better temporal locality ### Why Regression Persists **Access Pattern Mismatch**: - Expected: 60-80% cache hit rate (consecutive frees from same page) - Reality: bench_mid_large_mt uses random access across 2048 slots - Result: Poor temporal locality → low cache hit rate → linear search dominates **Cost Breakdown**: ``` Original (no deferred): mid_desc_lookup: ~10 cycles atomic operations: ~5 cycles Total per free: ~15 cycles Deferred (with last-match cache): last_idx check: ~2 cycles (on miss) linear search: ~32 cycles (avg 16 iterations × 2 ops) Total per free: ~34 cycles (2.3× slower) Expected with 70% hit rate: 70% hits: ~2 cycles 30% searches: ~10 cycles Total per free: ~4.4 cycles (2.9× faster) ``` The cache hit rate for this benchmark is likely <30%, making it slower than the baseline. ## Conclusion ### Success Criteria (Original) - [✗] No regression: median deferred >= median baseline (**Failed**: -7.6%) - [✓] Stability: deferred variance <= baseline variance (**Success**: -31%) - [✗] No outliers: all runs within 20% of median (**Failed**: still has variance) ### Deliverables - [✓] last_idx field added to MidInuseTlsPageMap - [✓] Fast-path check before linear search - [✓] Cache update on hits and new entries - [✓] Cache reset on drain - [✓] Build succeeds - [✓] Committed to git (commit 6c849fd02) ## Next Steps The last-match cache is necessary but insufficient. Additional optimizations needed: ### Option A: Hash-Based Lookup Replace linear search with simple hash: ```c #define MAP_HASH(page) (((uintptr_t)(page) >> 16) & (MAP_SIZE - 1)) ``` - Pro: O(1) expected lookup - Con: Requires handling collisions ### Option B: Reduce Map Size Use 8 or 16 entries instead of 32: - Pro: Fewer iterations on search - Con: More frequent drains (overhead moves to drain) ### Option C: Better Drain Boundaries Drain more frequently at natural boundaries: - After N allocations (not just on map full) - At refill/slow path transitions - Pro: Keeps map small, searches fast - Con: More drain calls (must benchmark) ### Option D: MRU (Most Recently Used) Ordering Keep recently used entries at front of array: - Pro: Common pages found faster - Con: Array reordering overhead ### Recommendation Try **Option A (hash-based)** first as it has the best theoretical performance and aligns with the "O(1) like mimalloc" design goal. ## Related Documents - [POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md](./POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md) - Original design - [POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md](./POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md) - Root cause analysis ## Commit ``` commit 6c849fd02 Author: ... Date: 2025-12-13 POOL-MID-DN-BATCH: Add last-match cache to reduce linear search overhead ```