# POOL-MID-DN-BATCH: Last-Match Cache Implementation

**Date**: 2025-12-13
**Phase**: POOL-MID-DN-BATCH optimization
**Status**: Implemented but insufficient for full regression fix

## Problem Statement

The POOL-MID-DN-BATCH deferred inuse_dec implementation showed a -5% performance regression instead of the expected +2-4% improvement. Root cause analysis revealed:

- **Linear search overhead**: Average 16 iterations in 32-entry TLS map
- **Instruction count**: +7.4% increase on hot path
- **Hot path cost**: Linear search exceeded the savings from eliminating mid_desc_lookup

## Solution: Last-Match Cache

Added a `last_idx` field to exploit temporal locality - the assumption that consecutive frees often target the same page.

### Implementation

#### 1. Structure Change (`pool_mid_inuse_tls_pagemap_box.h`)

```c
typedef struct {
    void* pages[MID_INUSE_TLS_MAP_SIZE];      // Page base addresses
    uint32_t counts[MID_INUSE_TLS_MAP_SIZE];  // Pending dec count per page
    uint32_t used;                             // Number of active entries
    uint32_t last_idx;                         // NEW: Cache last hit index
} MidInuseTlsPageMap;
```

#### 2. Lookup Logic (`pool_mid_inuse_deferred_box.h`)

**Before**:
```c
// Linear search only
for (uint32_t i = 0; i < map->used; i++) {
    if (map->pages[i] == page) {
        map->counts[i]++;
        return;
    }
}
```

**After**:
```c
// Check last match first (O(1) fast path)
if (map->last_idx < map->used && map->pages[map->last_idx] == page) {
    map->counts[map->last_idx]++;
    return;  // Early exit on cache hit
}

// Fallback to linear search
for (uint32_t i = 0; i < map->used; i++) {
    if (map->pages[i] == page) {
        map->counts[i]++;
        map->last_idx = i;  // Update cache
        return;
    }
}
```

#### 3. Cache Maintenance

- **On new entry**: `map->last_idx = idx;` (new page likely to be reused)
- **On drain**: `map->last_idx = 0;` (reset for next batch)

## Benchmark Results

### Test Configuration
- Benchmark: `bench_mid_large_mt_hakmem`
- Threads: 4
- Cycles: 40,000 per thread
- Working set: 2048 slots
- Size range: 8-32 KiB
- Access pattern: Random

### Performance Data

| Metric | Baseline (DEFERRED=0) | Deferred w/ Cache (DEFERRED=1) | Change |
|--------|----------------------|-------------------------------|--------|
| **Median throughput** | 9.08M ops/s | 8.38M ops/s | **-7.6%** |
| **Mean throughput** | 9.04M ops/s | 8.25M ops/s | -8.7% |
| **Min throughput** | 7.81M ops/s | 7.34M ops/s | -6.0% |
| **Max throughput** | 9.71M ops/s | 8.77M ops/s | -9.7% |
| **Variance** | 300B | 207B | **-31%** (improvement) |
| **Std Dev** | 548K | 455K | -17% |

### Raw Results

**Baseline (10 runs)**:
```
8,720,875  9,147,207  9,709,755  8,708,904  9,541,168
9,322,187  9,005,728  8,994,402  7,808,414  9,459,910
```

**Deferred with Last-Match Cache (20 runs)**:
```
8,323,016  7,963,325  8,578,296  8,313,354  8,314,545
7,445,113  7,518,391  8,610,739  8,770,947  7,338,433
8,668,194  7,797,795  7,882,001  8,442,375  8,564,862
7,950,541  8,552,224  8,548,635  8,636,063  8,742,399
```

## Analysis

### What Worked
- **Variance reduction**: -31% improvement in variance confirms that the deferred approach provides more stable performance
- **Cache mechanism**: The last_idx optimization is correctly implemented and should help in workloads with better temporal locality

### Why Regression Persists

**Access Pattern Mismatch**:
- Expected: 60-80% cache hit rate (consecutive frees from same page)
- Reality: bench_mid_large_mt uses random access across 2048 slots
- Result: Poor temporal locality → low cache hit rate → linear search dominates

**Cost Breakdown**:
```
Original (no deferred):
  mid_desc_lookup:    ~10 cycles
  atomic operations:   ~5 cycles
  Total per free:     ~15 cycles

Deferred (with last-match cache):
  last_idx check:      ~2 cycles (on miss)
  linear search:      ~32 cycles (avg 16 iterations × 2 ops)
  Total per free:     ~34 cycles (2.3× slower)

Expected with 70% hit rate:
  70% hits:            ~2 cycles
  30% searches:       ~10 cycles
  Total per free:      ~4.4 cycles (2.9× faster)
```

The cache hit rate for this benchmark is likely <30%, making it slower than the baseline.

## Conclusion

### Success Criteria (Original)
- [✗] No regression: median deferred >= median baseline (**Failed**: -7.6%)
- [✓] Stability: deferred variance <= baseline variance (**Success**: -31%)
- [✗] No outliers: all runs within 20% of median (**Failed**: still has variance)

### Deliverables
- [✓] last_idx field added to MidInuseTlsPageMap
- [✓] Fast-path check before linear search
- [✓] Cache update on hits and new entries
- [✓] Cache reset on drain
- [✓] Build succeeds
- [✓] Committed to git (commit 6c849fd02)

## Next Steps

The last-match cache is necessary but insufficient. Additional optimizations needed:

### Option A: Hash-Based Lookup
Replace linear search with simple hash:
```c
#define MAP_HASH(page) (((uintptr_t)(page) >> 16) & (MAP_SIZE - 1))
```
- Pro: O(1) expected lookup
- Con: Requires handling collisions

### Option B: Reduce Map Size
Use 8 or 16 entries instead of 32:
- Pro: Fewer iterations on search
- Con: More frequent drains (overhead moves to drain)

### Option C: Better Drain Boundaries
Drain more frequently at natural boundaries:
- After N allocations (not just on map full)
- At refill/slow path transitions
- Pro: Keeps map small, searches fast
- Con: More drain calls (must benchmark)

### Option D: MRU (Most Recently Used) Ordering
Keep recently used entries at front of array:
- Pro: Common pages found faster
- Con: Array reordering overhead

### Recommendation
Try **Option A (hash-based)** first as it has the best theoretical performance and aligns with the "O(1) like mimalloc" design goal.

## Related Documents
- [POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md](./POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md) - Original design
- [POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md](./POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md) - Root cause analysis

## Commit
```
commit 6c849fd02
Author: ...
Date:   2025-12-13

    POOL-MID-DN-BATCH: Add last-match cache to reduce linear search overhead
```