hakmem/docs/analysis/LAST_MATCH_CACHE_IMPLEMENTATION.md

# POOL-MID-DN-BATCH: Last-Match Cache Implementation

**Date**: 2025-12-13
**Phase**: POOL-MID-DN-BATCH optimization
**Status**: Implemented but insufficient for full regression fix

## Problem Statement

The POOL-MID-DN-BATCH deferred inuse_dec implementation showed a -5% performance regression instead of the expected +2-4% improvement. Root cause analysis revealed:

- **Linear search overhead**: Average 16 iterations in 32-entry TLS map
- **Instruction count**: +7.4% increase on hot path
- **Hot path cost**: Linear search exceeded the savings from eliminating mid_desc_lookup

## Solution: Last-Match Cache

Added a `last_idx` field to exploit temporal locality - the assumption that consecutive frees often target the same page.

### Implementation

#### 1. Structure Change (`pool_mid_inuse_tls_pagemap_box.h`)

```c
typedef struct {
    void* pages[MID_INUSE_TLS_MAP_SIZE];      // Page base addresses
    uint32_t counts[MID_INUSE_TLS_MAP_SIZE];  // Pending dec count per page
    uint32_t used;                             // Number of active entries
    uint32_t last_idx;                         // NEW: Cache last hit index
} MidInuseTlsPageMap;
```

#### 2. Lookup Logic (`pool_mid_inuse_deferred_box.h`)

**Before**:
```c
// Linear search only
for (uint32_t i = 0; i < map->used; i++) {
    if (map->pages[i] == page) {
        map->counts[i]++;
        return;
    }
}
```

**After**:
```c
// Check last match first (O(1) fast path)
if (map->last_idx < map->used && map->pages[map->last_idx] == page) {
    map->counts[map->last_idx]++;
    return;  // Early exit on cache hit
}

// Fallback to linear search
for (uint32_t i = 0; i < map->used; i++) {
    if (map->pages[i] == page) {
        map->counts[i]++;
        map->last_idx = i;  // Update cache
        return;
    }
}
```

#### 3. Cache Maintenance

- **On new entry**: `map->last_idx = idx;` (new page likely to be reused)
- **On drain**: `map->last_idx = 0;` (reset for next batch)

## Benchmark Results

### Test Configuration
- Benchmark: `bench_mid_large_mt_hakmem`
- Threads: 4
- Cycles: 40,000 per thread
- Working set: 2048 slots
- Size range: 8-32 KiB
- Access pattern: Random

### Performance Data

| Metric | Baseline (DEFERRED=0) | Deferred w/ Cache (DEFERRED=1) | Change |
|--------|----------------------|-------------------------------|--------|
| **Median throughput** | 9.08M ops/s | 8.38M ops/s | **-7.6%** |
| **Mean throughput** | 9.04M ops/s | 8.25M ops/s | -8.7% |
| **Min throughput** | 7.81M ops/s | 7.34M ops/s | -6.0% |
| **Max throughput** | 9.71M ops/s | 8.77M ops/s | -9.7% |
| **Variance** | 300B | 207B | **-31%** (improvement) |
| **Std Dev** | 548K | 455K | -17% |

### Raw Results

**Baseline (10 runs)**:
```
8,720,875  9,147,207  9,709,755  8,708,904  9,541,168
9,322,187  9,005,728  8,994,402  7,808,414  9,459,910
```

**Deferred with Last-Match Cache (20 runs)**:
```
8,323,016  7,963,325  8,578,296  8,313,354  8,314,545
7,445,113  7,518,391  8,610,739  8,770,947  7,338,433
8,668,194  7,797,795  7,882,001  8,442,375  8,564,862
7,950,541  8,552,224  8,548,635  8,636,063  8,742,399
```

## Analysis

### What Worked
- **Variance reduction**: -31% improvement in variance confirms that the deferred approach provides more stable performance
- **Cache mechanism**: The last_idx optimization is correctly implemented and should help in workloads with better temporal locality

### Why Regression Persists

**Access Pattern Mismatch**:
- Expected: 60-80% cache hit rate (consecutive frees from same page)
- Reality: bench_mid_large_mt uses random access across 2048 slots
- Result: Poor temporal locality → low cache hit rate → linear search dominates

**Cost Breakdown**:
```
Original (no deferred):
  mid_desc_lookup:    ~10 cycles
  atomic operations:   ~5 cycles
  Total per free:     ~15 cycles

Deferred (with last-match cache):
  last_idx check:      ~2 cycles (on miss)
  linear search:      ~32 cycles (avg 16 iterations × 2 ops)
  Total per free:     ~34 cycles (2.3× slower)

Expected with 70% hit rate:
  70% hits:            ~2 cycles
  30% searches:       ~10 cycles
  Total per free:      ~4.4 cycles (2.9× faster)
```

The cache hit rate for this benchmark is likely <30%, making it slower than the baseline.

## Conclusion

### Success Criteria (Original)
- [✗] No regression: median deferred >= median baseline (**Failed**: -7.6%)
- [✓] Stability: deferred variance <= baseline variance (**Success**: -31%)
- [✗] No outliers: all runs within 20% of median (**Failed**: still has variance)

### Deliverables
- [✓] last_idx field added to MidInuseTlsPageMap
- [✓] Fast-path check before linear search
- [✓] Cache update on hits and new entries
- [✓] Cache reset on drain
- [✓] Build succeeds
- [✓] Committed to git (commit 6c849fd02)

## Next Steps

The last-match cache is necessary but insufficient. Additional optimizations needed:

### Option A: Hash-Based Lookup
Replace linear search with simple hash:
```c
#define MAP_HASH(page) (((uintptr_t)(page) >> 16) & (MAP_SIZE - 1))
```
- Pro: O(1) expected lookup
- Con: Requires handling collisions

### Option B: Reduce Map Size
Use 8 or 16 entries instead of 32:
- Pro: Fewer iterations on search
- Con: More frequent drains (overhead moves to drain)

### Option C: Better Drain Boundaries
Drain more frequently at natural boundaries:
- After N allocations (not just on map full)
- At refill/slow path transitions
- Pro: Keeps map small, searches fast
- Con: More drain calls (must benchmark)

### Option D: MRU (Most Recently Used) Ordering
Keep recently used entries at front of array:
- Pro: Common pages found faster
- Con: Array reordering overhead

### Recommendation
Try **Option A (hash-based)** first as it has the best theoretical performance and aligns with the "O(1) like mimalloc" design goal.

## Related Documents
- [POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md](./POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md) - Original design
- [POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md](./POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md) - Root cause analysis

## Commit
```
commit 6c849fd02
Author: ...
Date:   2025-12-13

    POOL-MID-DN-BATCH: Add last-match cache to reduce linear search overhead
```
-												Phase ALLOC-TINY-FAST-DUALHOT-1 & Optimization Roadmap Update

Add comprehensive design docs and research boxes:
- docs/analysis/ALLOC_TINY_FAST_DUALHOT_1_DESIGN.md: ALLOC DUALHOT investigation
- docs/analysis/FREE_TINY_FAST_DUALHOT_1_DESIGN.md: FREE DUALHOT final specs
- docs/analysis/FREE_TINY_FAST_HOTCOLD_OPT_1_DESIGN.md: Hot/Cold split research
- docs/analysis/POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md: Deferred batching design
- docs/analysis/POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md: Stats overhead findings
- docs/analysis/MID_DESC_CACHE_BENCHMARK_2025-12-12.md: Cache measurement results
- docs/analysis/LAST_MATCH_CACHE_IMPLEMENTATION.md: TLS cache investigation

Research boxes (SS page table):
- core/box/ss_pt_env_box.h: HAKMEM_SS_LOOKUP_KIND gate
- core/box/ss_pt_types_box.h: 2-level page table structures
- core/box/ss_pt_lookup_box.h: ss_pt_lookup() implementation
- core/box/ss_pt_register_box.h: Page table registration
- core/box/ss_pt_impl.c: Global definitions

Updates:
- docs/specs/ENV_VARS_COMPLETE.md: HOTCOLD, DEFERRED, SS_LOOKUP env vars
- core/box/hak_free_api.inc.h: FREE-DISPATCH-SSOT integration
- core/box/pool_mid_inuse_deferred_box.h: Deferred API updates
- core/box/pool_mid_inuse_deferred_stats_box.h: Stats collection
- core/hakmem_super_registry: SS page table integration

Current Status:
- FREE-TINY-FAST-DUALHOT-1: +13% improvement, ready for adoption
- ALLOC-TINY-FAST-DUALHOT-1: -2% regression, frozen as research box
- Next: Optimization roadmap per ROI (mimalloc gap 2.5x)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-12-13 05:35:46 +09:00
+								# POOL-MID-DN-BATCH: Last-Match Cache Implementation
 								**Date**: 2025-12-13
 								**Phase**: POOL-MID-DN-BATCH optimization
 								**Status**: Implemented but insufficient for full regression fix
 								## Problem Statement
 								The POOL-MID-DN-BATCH deferred inuse_dec implementation showed a -5% performance regression instead of the expected +2-4% improvement. Root cause analysis revealed:
 								- **Linear search overhead**: Average 16 iterations in 32-entry TLS map
 								- **Instruction count**: +7.4% increase on hot path
 								- **Hot path cost**: Linear search exceeded the savings from eliminating mid_desc_lookup
 								## Solution: Last-Match Cache
 								Added a `last_idx` field to exploit temporal locality - the assumption that consecutive frees often target the same page.
 								### Implementation
 								#### 1. Structure Change (`pool_mid_inuse_tls_pagemap_box.h`)
 								```c
 								typedef struct {
 								    void* pages[MID_INUSE_TLS_MAP_SIZE];      // Page base addresses
 								    uint32_t counts[MID_INUSE_TLS_MAP_SIZE];  // Pending dec count per page
 								    uint32_t used;                             // Number of active entries
 								    uint32_t last_idx;                         // NEW: Cache last hit index
 								} MidInuseTlsPageMap;
 								```
 								#### 2. Lookup Logic (`pool_mid_inuse_deferred_box.h`)
 								**Before**:
 								```c
 								// Linear search only
 								for (uint32_t i = 0; i < map->used; i++) {
 								    if (map->pages[i] == page) {
 								        map->counts[i]++;
 								        return;
 								    }
 								}
 								```
 								**After**:
 								```c
 								// Check last match first (O(1) fast path)
 								if (map->last_idx < map->used && map->pages[map->last_idx] == page) {
 								    map->counts[map->last_idx]++;
 								    return;  // Early exit on cache hit
 								}
 								// Fallback to linear search
 								for (uint32_t i = 0; i < map->used; i++) {
 								    if (map->pages[i] == page) {
 								        map->counts[i]++;
 								        map->last_idx = i;  // Update cache
 								        return;
 								    }
 								}
 								```
 								#### 3. Cache Maintenance
 								- **On new entry**: `map->last_idx = idx;` (new page likely to be reused)
 								- **On drain**: `map->last_idx = 0;` (reset for next batch)
 								## Benchmark Results
 								### Test Configuration
 								- Benchmark: `bench_mid_large_mt_hakmem`
 								- Threads: 4
 								- Cycles: 40,000 per thread
 								- Working set: 2048 slots
 								- Size range: 8-32 KiB
 								- Access pattern: Random
 								### Performance Data
 								| Metric | Baseline (DEFERRED=0) | Deferred w/ Cache (DEFERRED=1) | Change |
 								|--------|----------------------|-------------------------------|--------|
 								| **Median throughput** | 9.08M ops/s | 8.38M ops/s | **-7.6%** |
 								| **Mean throughput** | 9.04M ops/s | 8.25M ops/s | -8.7% |
 								| **Min throughput** | 7.81M ops/s | 7.34M ops/s | -6.0% |
 								| **Max throughput** | 9.71M ops/s | 8.77M ops/s | -9.7% |
 								| **Variance** | 300B | 207B | **-31%** (improvement) |
 								| **Std Dev** | 548K | 455K | -17% |
 								### Raw Results
 								**Baseline (10 runs)**:
 								```
 ,720,875  9,147,207  9,709,755  8,708,904  9,541,168
 ,322,187  9,005,728  8,994,402  7,808,414  9,459,910
 								```
 								**Deferred with Last-Match Cache (20 runs)**:
 								```
 ,323,016  7,963,325  8,578,296  8,313,354  8,314,545
 ,445,113  7,518,391  8,610,739  8,770,947  7,338,433
 ,668,194  7,797,795  7,882,001  8,442,375  8,564,862
 ,950,541  8,552,224  8,548,635  8,636,063  8,742,399
 								```
 								## Analysis
 								### What Worked
 								- **Variance reduction**: -31% improvement in variance confirms that the deferred approach provides more stable performance
 								- **Cache mechanism**: The last_idx optimization is correctly implemented and should help in workloads with better temporal locality
 								### Why Regression Persists
 								**Access Pattern Mismatch**:
 								- Expected: 60-80% cache hit rate (consecutive frees from same page)
 								- Reality: bench_mid_large_mt uses random access across 2048 slots
 								- Result: Poor temporal locality → low cache hit rate → linear search dominates
 								**Cost Breakdown**:
 								```
 								Original (no deferred):
 								  mid_desc_lookup:    ~10 cycles
 								  atomic operations:   ~5 cycles
 								  Total per free:     ~15 cycles
 								Deferred (with last-match cache):
 								  last_idx check:      ~2 cycles (on miss)
 								  linear search:      ~32 cycles (avg 16 iterations × 2 ops)
 								  Total per free:     ~34 cycles (2.3× slower)
 								Expected with 70% hit rate:
 % hits:            ~2 cycles
 % searches:       ~10 cycles
 								  Total per free:      ~4.4 cycles (2.9× faster)
 								```
 								The cache hit rate for this benchmark is likely <30%, making it slower than the baseline.
 								## Conclusion
 								### Success Criteria (Original)
 								- [✗] No regression: median deferred >= median baseline (**Failed**: -7.6%)
 								- [✓] Stability: deferred variance <= baseline variance (**Success**: -31%)
 								- [✗] No outliers: all runs within 20% of median (**Failed**: still has variance)
 								### Deliverables
 								- [✓] last_idx field added to MidInuseTlsPageMap
 								- [✓] Fast-path check before linear search
 								- [✓] Cache update on hits and new entries
 								- [✓] Cache reset on drain
 								- [✓] Build succeeds
 								- [✓] Committed to git (commit 6c849fd02)
 								## Next Steps
 								The last-match cache is necessary but insufficient. Additional optimizations needed:
 								### Option A: Hash-Based Lookup
 								Replace linear search with simple hash:
 								```c
 								#define MAP_HASH(page) (((uintptr_t)(page) >> 16) & (MAP_SIZE - 1))
 								```
 								- Pro: O(1) expected lookup
 								- Con: Requires handling collisions
 								### Option B: Reduce Map Size
 								Use 8 or 16 entries instead of 32:
 								- Pro: Fewer iterations on search
 								- Con: More frequent drains (overhead moves to drain)
 								### Option C: Better Drain Boundaries
 								Drain more frequently at natural boundaries:
 								- After N allocations (not just on map full)
 								- At refill/slow path transitions
 								- Pro: Keeps map small, searches fast
 								- Con: More drain calls (must benchmark)
 								### Option D: MRU (Most Recently Used) Ordering
 								Keep recently used entries at front of array:
 								- Pro: Common pages found faster
 								- Con: Array reordering overhead
 								### Recommendation
 								Try **Option A (hash-based)** first as it has the best theoretical performance and aligns with the "O(1) like mimalloc" design goal.
 								## Related Documents
 								- [POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md](./POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md) - Original design
 								- [POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md](./POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md) - Root cause analysis
 								## Commit
 								```
 								commit 6c849fd02
 								Author: ...
 								Date:   2025-12-13
 								    POOL-MID-DN-BATCH: Add last-match cache to reduce linear search overhead
 								```