197 lines
6.1 KiB
Markdown
197 lines
6.1 KiB
Markdown
|
|
# POOL-MID-DN-BATCH: Last-Match Cache Implementation
|
|||
|
|
|
|||
|
|
**Date**: 2025-12-13
|
|||
|
|
**Phase**: POOL-MID-DN-BATCH optimization
|
|||
|
|
**Status**: Implemented but insufficient for full regression fix
|
|||
|
|
|
|||
|
|
## Problem Statement
|
|||
|
|
|
|||
|
|
The POOL-MID-DN-BATCH deferred inuse_dec implementation showed a -5% performance regression instead of the expected +2-4% improvement. Root cause analysis revealed:
|
|||
|
|
|
|||
|
|
- **Linear search overhead**: Average 16 iterations in 32-entry TLS map
|
|||
|
|
- **Instruction count**: +7.4% increase on hot path
|
|||
|
|
- **Hot path cost**: Linear search exceeded the savings from eliminating mid_desc_lookup
|
|||
|
|
|
|||
|
|
## Solution: Last-Match Cache
|
|||
|
|
|
|||
|
|
Added a `last_idx` field to exploit temporal locality - the assumption that consecutive frees often target the same page.
|
|||
|
|
|
|||
|
|
### Implementation
|
|||
|
|
|
|||
|
|
#### 1. Structure Change (`pool_mid_inuse_tls_pagemap_box.h`)
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
typedef struct {
|
|||
|
|
void* pages[MID_INUSE_TLS_MAP_SIZE]; // Page base addresses
|
|||
|
|
uint32_t counts[MID_INUSE_TLS_MAP_SIZE]; // Pending dec count per page
|
|||
|
|
uint32_t used; // Number of active entries
|
|||
|
|
uint32_t last_idx; // NEW: Cache last hit index
|
|||
|
|
} MidInuseTlsPageMap;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 2. Lookup Logic (`pool_mid_inuse_deferred_box.h`)
|
|||
|
|
|
|||
|
|
**Before**:
|
|||
|
|
```c
|
|||
|
|
// Linear search only
|
|||
|
|
for (uint32_t i = 0; i < map->used; i++) {
|
|||
|
|
if (map->pages[i] == page) {
|
|||
|
|
map->counts[i]++;
|
|||
|
|
return;
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**After**:
|
|||
|
|
```c
|
|||
|
|
// Check last match first (O(1) fast path)
|
|||
|
|
if (map->last_idx < map->used && map->pages[map->last_idx] == page) {
|
|||
|
|
map->counts[map->last_idx]++;
|
|||
|
|
return; // Early exit on cache hit
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Fallback to linear search
|
|||
|
|
for (uint32_t i = 0; i < map->used; i++) {
|
|||
|
|
if (map->pages[i] == page) {
|
|||
|
|
map->counts[i]++;
|
|||
|
|
map->last_idx = i; // Update cache
|
|||
|
|
return;
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 3. Cache Maintenance
|
|||
|
|
|
|||
|
|
- **On new entry**: `map->last_idx = idx;` (new page likely to be reused)
|
|||
|
|
- **On drain**: `map->last_idx = 0;` (reset for next batch)
|
|||
|
|
|
|||
|
|
## Benchmark Results
|
|||
|
|
|
|||
|
|
### Test Configuration
|
|||
|
|
- Benchmark: `bench_mid_large_mt_hakmem`
|
|||
|
|
- Threads: 4
|
|||
|
|
- Cycles: 40,000 per thread
|
|||
|
|
- Working set: 2048 slots
|
|||
|
|
- Size range: 8-32 KiB
|
|||
|
|
- Access pattern: Random
|
|||
|
|
|
|||
|
|
### Performance Data
|
|||
|
|
|
|||
|
|
| Metric | Baseline (DEFERRED=0) | Deferred w/ Cache (DEFERRED=1) | Change |
|
|||
|
|
|--------|----------------------|-------------------------------|--------|
|
|||
|
|
| **Median throughput** | 9.08M ops/s | 8.38M ops/s | **-7.6%** |
|
|||
|
|
| **Mean throughput** | 9.04M ops/s | 8.25M ops/s | -8.7% |
|
|||
|
|
| **Min throughput** | 7.81M ops/s | 7.34M ops/s | -6.0% |
|
|||
|
|
| **Max throughput** | 9.71M ops/s | 8.77M ops/s | -9.7% |
|
|||
|
|
| **Variance** | 300B | 207B | **-31%** (improvement) |
|
|||
|
|
| **Std Dev** | 548K | 455K | -17% |
|
|||
|
|
|
|||
|
|
### Raw Results
|
|||
|
|
|
|||
|
|
**Baseline (10 runs)**:
|
|||
|
|
```
|
|||
|
|
8,720,875 9,147,207 9,709,755 8,708,904 9,541,168
|
|||
|
|
9,322,187 9,005,728 8,994,402 7,808,414 9,459,910
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Deferred with Last-Match Cache (20 runs)**:
|
|||
|
|
```
|
|||
|
|
8,323,016 7,963,325 8,578,296 8,313,354 8,314,545
|
|||
|
|
7,445,113 7,518,391 8,610,739 8,770,947 7,338,433
|
|||
|
|
8,668,194 7,797,795 7,882,001 8,442,375 8,564,862
|
|||
|
|
7,950,541 8,552,224 8,548,635 8,636,063 8,742,399
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Analysis
|
|||
|
|
|
|||
|
|
### What Worked
|
|||
|
|
- **Variance reduction**: -31% improvement in variance confirms that the deferred approach provides more stable performance
|
|||
|
|
- **Cache mechanism**: The last_idx optimization is correctly implemented and should help in workloads with better temporal locality
|
|||
|
|
|
|||
|
|
### Why Regression Persists
|
|||
|
|
|
|||
|
|
**Access Pattern Mismatch**:
|
|||
|
|
- Expected: 60-80% cache hit rate (consecutive frees from same page)
|
|||
|
|
- Reality: bench_mid_large_mt uses random access across 2048 slots
|
|||
|
|
- Result: Poor temporal locality → low cache hit rate → linear search dominates
|
|||
|
|
|
|||
|
|
**Cost Breakdown**:
|
|||
|
|
```
|
|||
|
|
Original (no deferred):
|
|||
|
|
mid_desc_lookup: ~10 cycles
|
|||
|
|
atomic operations: ~5 cycles
|
|||
|
|
Total per free: ~15 cycles
|
|||
|
|
|
|||
|
|
Deferred (with last-match cache):
|
|||
|
|
last_idx check: ~2 cycles (on miss)
|
|||
|
|
linear search: ~32 cycles (avg 16 iterations × 2 ops)
|
|||
|
|
Total per free: ~34 cycles (2.3× slower)
|
|||
|
|
|
|||
|
|
Expected with 70% hit rate:
|
|||
|
|
70% hits: ~2 cycles
|
|||
|
|
30% searches: ~10 cycles
|
|||
|
|
Total per free: ~4.4 cycles (2.9× faster)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The cache hit rate for this benchmark is likely <30%, making it slower than the baseline.
|
|||
|
|
|
|||
|
|
## Conclusion
|
|||
|
|
|
|||
|
|
### Success Criteria (Original)
|
|||
|
|
- [✗] No regression: median deferred >= median baseline (**Failed**: -7.6%)
|
|||
|
|
- [✓] Stability: deferred variance <= baseline variance (**Success**: -31%)
|
|||
|
|
- [✗] No outliers: all runs within 20% of median (**Failed**: still has variance)
|
|||
|
|
|
|||
|
|
### Deliverables
|
|||
|
|
- [✓] last_idx field added to MidInuseTlsPageMap
|
|||
|
|
- [✓] Fast-path check before linear search
|
|||
|
|
- [✓] Cache update on hits and new entries
|
|||
|
|
- [✓] Cache reset on drain
|
|||
|
|
- [✓] Build succeeds
|
|||
|
|
- [✓] Committed to git (commit 6c849fd02)
|
|||
|
|
|
|||
|
|
## Next Steps
|
|||
|
|
|
|||
|
|
The last-match cache is necessary but insufficient. Additional optimizations needed:
|
|||
|
|
|
|||
|
|
### Option A: Hash-Based Lookup
|
|||
|
|
Replace linear search with simple hash:
|
|||
|
|
```c
|
|||
|
|
#define MAP_HASH(page) (((uintptr_t)(page) >> 16) & (MAP_SIZE - 1))
|
|||
|
|
```
|
|||
|
|
- Pro: O(1) expected lookup
|
|||
|
|
- Con: Requires handling collisions
|
|||
|
|
|
|||
|
|
### Option B: Reduce Map Size
|
|||
|
|
Use 8 or 16 entries instead of 32:
|
|||
|
|
- Pro: Fewer iterations on search
|
|||
|
|
- Con: More frequent drains (overhead moves to drain)
|
|||
|
|
|
|||
|
|
### Option C: Better Drain Boundaries
|
|||
|
|
Drain more frequently at natural boundaries:
|
|||
|
|
- After N allocations (not just on map full)
|
|||
|
|
- At refill/slow path transitions
|
|||
|
|
- Pro: Keeps map small, searches fast
|
|||
|
|
- Con: More drain calls (must benchmark)
|
|||
|
|
|
|||
|
|
### Option D: MRU (Most Recently Used) Ordering
|
|||
|
|
Keep recently used entries at front of array:
|
|||
|
|
- Pro: Common pages found faster
|
|||
|
|
- Con: Array reordering overhead
|
|||
|
|
|
|||
|
|
### Recommendation
|
|||
|
|
Try **Option A (hash-based)** first as it has the best theoretical performance and aligns with the "O(1) like mimalloc" design goal.
|
|||
|
|
|
|||
|
|
## Related Documents
|
|||
|
|
- [POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md](./POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md) - Original design
|
|||
|
|
- [POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md](./POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md) - Root cause analysis
|
|||
|
|
|
|||
|
|
## Commit
|
|||
|
|
```
|
|||
|
|
commit 6c849fd02
|
|||
|
|
Author: ...
|
|||
|
|
Date: 2025-12-13
|
|||
|
|
|
|||
|
|
POOL-MID-DN-BATCH: Add last-match cache to reduce linear search overhead
|
|||
|
|
```
|