Files
hakmem/docs/analysis/LAST_MATCH_CACHE_IMPLEMENTATION.md
Moe Charm (CI) d9991f39ff Phase ALLOC-TINY-FAST-DUALHOT-1 & Optimization Roadmap Update
Add comprehensive design docs and research boxes:
- docs/analysis/ALLOC_TINY_FAST_DUALHOT_1_DESIGN.md: ALLOC DUALHOT investigation
- docs/analysis/FREE_TINY_FAST_DUALHOT_1_DESIGN.md: FREE DUALHOT final specs
- docs/analysis/FREE_TINY_FAST_HOTCOLD_OPT_1_DESIGN.md: Hot/Cold split research
- docs/analysis/POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md: Deferred batching design
- docs/analysis/POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md: Stats overhead findings
- docs/analysis/MID_DESC_CACHE_BENCHMARK_2025-12-12.md: Cache measurement results
- docs/analysis/LAST_MATCH_CACHE_IMPLEMENTATION.md: TLS cache investigation

Research boxes (SS page table):
- core/box/ss_pt_env_box.h: HAKMEM_SS_LOOKUP_KIND gate
- core/box/ss_pt_types_box.h: 2-level page table structures
- core/box/ss_pt_lookup_box.h: ss_pt_lookup() implementation
- core/box/ss_pt_register_box.h: Page table registration
- core/box/ss_pt_impl.c: Global definitions

Updates:
- docs/specs/ENV_VARS_COMPLETE.md: HOTCOLD, DEFERRED, SS_LOOKUP env vars
- core/box/hak_free_api.inc.h: FREE-DISPATCH-SSOT integration
- core/box/pool_mid_inuse_deferred_box.h: Deferred API updates
- core/box/pool_mid_inuse_deferred_stats_box.h: Stats collection
- core/hakmem_super_registry: SS page table integration

Current Status:
- FREE-TINY-FAST-DUALHOT-1: +13% improvement, ready for adoption
- ALLOC-TINY-FAST-DUALHOT-1: -2% regression, frozen as research box
- Next: Optimization roadmap per ROI (mimalloc gap 2.5x)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-13 05:35:46 +09:00

197 lines
6.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# POOL-MID-DN-BATCH: Last-Match Cache Implementation
**Date**: 2025-12-13
**Phase**: POOL-MID-DN-BATCH optimization
**Status**: Implemented but insufficient for full regression fix
## Problem Statement
The POOL-MID-DN-BATCH deferred inuse_dec implementation showed a -5% performance regression instead of the expected +2-4% improvement. Root cause analysis revealed:
- **Linear search overhead**: Average 16 iterations in 32-entry TLS map
- **Instruction count**: +7.4% increase on hot path
- **Hot path cost**: Linear search exceeded the savings from eliminating mid_desc_lookup
## Solution: Last-Match Cache
Added a `last_idx` field to exploit temporal locality - the assumption that consecutive frees often target the same page.
### Implementation
#### 1. Structure Change (`pool_mid_inuse_tls_pagemap_box.h`)
```c
typedef struct {
void* pages[MID_INUSE_TLS_MAP_SIZE]; // Page base addresses
uint32_t counts[MID_INUSE_TLS_MAP_SIZE]; // Pending dec count per page
uint32_t used; // Number of active entries
uint32_t last_idx; // NEW: Cache last hit index
} MidInuseTlsPageMap;
```
#### 2. Lookup Logic (`pool_mid_inuse_deferred_box.h`)
**Before**:
```c
// Linear search only
for (uint32_t i = 0; i < map->used; i++) {
if (map->pages[i] == page) {
map->counts[i]++;
return;
}
}
```
**After**:
```c
// Check last match first (O(1) fast path)
if (map->last_idx < map->used && map->pages[map->last_idx] == page) {
map->counts[map->last_idx]++;
return; // Early exit on cache hit
}
// Fallback to linear search
for (uint32_t i = 0; i < map->used; i++) {
if (map->pages[i] == page) {
map->counts[i]++;
map->last_idx = i; // Update cache
return;
}
}
```
#### 3. Cache Maintenance
- **On new entry**: `map->last_idx = idx;` (new page likely to be reused)
- **On drain**: `map->last_idx = 0;` (reset for next batch)
## Benchmark Results
### Test Configuration
- Benchmark: `bench_mid_large_mt_hakmem`
- Threads: 4
- Cycles: 40,000 per thread
- Working set: 2048 slots
- Size range: 8-32 KiB
- Access pattern: Random
### Performance Data
| Metric | Baseline (DEFERRED=0) | Deferred w/ Cache (DEFERRED=1) | Change |
|--------|----------------------|-------------------------------|--------|
| **Median throughput** | 9.08M ops/s | 8.38M ops/s | **-7.6%** |
| **Mean throughput** | 9.04M ops/s | 8.25M ops/s | -8.7% |
| **Min throughput** | 7.81M ops/s | 7.34M ops/s | -6.0% |
| **Max throughput** | 9.71M ops/s | 8.77M ops/s | -9.7% |
| **Variance** | 300B | 207B | **-31%** (improvement) |
| **Std Dev** | 548K | 455K | -17% |
### Raw Results
**Baseline (10 runs)**:
```
8,720,875 9,147,207 9,709,755 8,708,904 9,541,168
9,322,187 9,005,728 8,994,402 7,808,414 9,459,910
```
**Deferred with Last-Match Cache (20 runs)**:
```
8,323,016 7,963,325 8,578,296 8,313,354 8,314,545
7,445,113 7,518,391 8,610,739 8,770,947 7,338,433
8,668,194 7,797,795 7,882,001 8,442,375 8,564,862
7,950,541 8,552,224 8,548,635 8,636,063 8,742,399
```
## Analysis
### What Worked
- **Variance reduction**: -31% improvement in variance confirms that the deferred approach provides more stable performance
- **Cache mechanism**: The last_idx optimization is correctly implemented and should help in workloads with better temporal locality
### Why Regression Persists
**Access Pattern Mismatch**:
- Expected: 60-80% cache hit rate (consecutive frees from same page)
- Reality: bench_mid_large_mt uses random access across 2048 slots
- Result: Poor temporal locality → low cache hit rate → linear search dominates
**Cost Breakdown**:
```
Original (no deferred):
mid_desc_lookup: ~10 cycles
atomic operations: ~5 cycles
Total per free: ~15 cycles
Deferred (with last-match cache):
last_idx check: ~2 cycles (on miss)
linear search: ~32 cycles (avg 16 iterations × 2 ops)
Total per free: ~34 cycles (2.3× slower)
Expected with 70% hit rate:
70% hits: ~2 cycles
30% searches: ~10 cycles
Total per free: ~4.4 cycles (2.9× faster)
```
The cache hit rate for this benchmark is likely <30%, making it slower than the baseline.
## Conclusion
### Success Criteria (Original)
- [✗] No regression: median deferred >= median baseline (**Failed**: -7.6%)
- [✓] Stability: deferred variance <= baseline variance (**Success**: -31%)
- [✗] No outliers: all runs within 20% of median (**Failed**: still has variance)
### Deliverables
- [✓] last_idx field added to MidInuseTlsPageMap
- [✓] Fast-path check before linear search
- [✓] Cache update on hits and new entries
- [✓] Cache reset on drain
- [✓] Build succeeds
- [✓] Committed to git (commit 6c849fd02)
## Next Steps
The last-match cache is necessary but insufficient. Additional optimizations needed:
### Option A: Hash-Based Lookup
Replace linear search with simple hash:
```c
#define MAP_HASH(page) (((uintptr_t)(page) >> 16) & (MAP_SIZE - 1))
```
- Pro: O(1) expected lookup
- Con: Requires handling collisions
### Option B: Reduce Map Size
Use 8 or 16 entries instead of 32:
- Pro: Fewer iterations on search
- Con: More frequent drains (overhead moves to drain)
### Option C: Better Drain Boundaries
Drain more frequently at natural boundaries:
- After N allocations (not just on map full)
- At refill/slow path transitions
- Pro: Keeps map small, searches fast
- Con: More drain calls (must benchmark)
### Option D: MRU (Most Recently Used) Ordering
Keep recently used entries at front of array:
- Pro: Common pages found faster
- Con: Array reordering overhead
### Recommendation
Try **Option A (hash-based)** first as it has the best theoretical performance and aligns with the "O(1) like mimalloc" design goal.
## Related Documents
- [POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md](./POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md) - Original design
- [POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md](./POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md) - Root cause analysis
## Commit
```
commit 6c849fd02
Author: ...
Date: 2025-12-13
POOL-MID-DN-BATCH: Add last-match cache to reduce linear search overhead
```