Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
335 lines
8.4 KiB
Markdown
335 lines
8.4 KiB
Markdown
# Tiny Pool Optimization Roadmap
|
|
|
|
Quick reference for implementing mimalloc-style optimizations in hakmem's Tiny Pool.
|
|
|
|
**Current Performance**: 83 ns/op for 8-64B allocations
|
|
**Target Performance**: 30-50 ns/op (realistic with optimizations)
|
|
**Gap to mimalloc**: Still 2-3.5x slower (fundamental architecture difference)
|
|
|
|
---
|
|
|
|
## Quick Wins (10-20 ns improvement)
|
|
|
|
### 1. Lookup Table Size Classification
|
|
**Effort**: 30 minutes | **Gain**: 3-5 ns
|
|
|
|
Replace if-chain with table lookup:
|
|
|
|
```c
|
|
// Before
|
|
static inline int hak_tiny_size_to_class(size_t size) {
|
|
if (size <= 8) return 0;
|
|
if (size <= 16) return 1;
|
|
if (size <= 32) return 2;
|
|
// ... sequential if-chain
|
|
}
|
|
|
|
// After
|
|
static const uint8_t g_size_to_class[65] = {
|
|
0,0,0,0,0,0,0,0, // 0-7
|
|
1,1,1,1,1,1,1,1, // 8-15
|
|
2,2,2,2,2,2,2,2, // 16-23
|
|
2,2,2,2,2,2,2,2, // 24-31
|
|
3,3,3,3,3,3,3,3, // 32-39
|
|
// ... continue to 64
|
|
};
|
|
|
|
static inline int hak_tiny_size_to_class_fast(size_t size) {
|
|
return likely(size <= 64) ? g_size_to_class[size] : -1;
|
|
}
|
|
```
|
|
|
|
**Implementation**:
|
|
1. File: `hakmem_tiny.h`
|
|
2. Add static const array after line 36 (after `g_tiny_blocks_per_slab`)
|
|
3. Update `hak_tiny_size_to_class()` to use table
|
|
4. Add `__builtin_expect()` for fast path
|
|
|
|
---
|
|
|
|
### 2. Remove Statistics from Critical Path
|
|
**Effort**: 1 hour | **Gain**: 10-15 ns
|
|
|
|
Move sampled counter updates to separate tracking:
|
|
|
|
```c
|
|
// Before (hot path)
|
|
void* p = mag->items[--mag->top].ptr;
|
|
t_tiny_rng ^= t_tiny_rng << 13; // 3 ns overhead
|
|
t_tiny_rng ^= t_tiny_rng >> 17;
|
|
t_tiny_rng ^= t_tiny_rng << 5;
|
|
if ((t_tiny_rng & ((1u<<g_tiny_count_sample_exp)-1u)) == 0u)
|
|
g_tiny_pool.alloc_count[class_idx]++;
|
|
return p;
|
|
|
|
// After (hot path)
|
|
void* p = mag->items[--mag->top].ptr;
|
|
// Stats update deferred (see lazy_counter_update below)
|
|
return p;
|
|
|
|
// New: Lazy counter accumulation (cold path)
|
|
static void hak_tiny_lazy_counter_update(int class_idx) {
|
|
if (++g_tls_alloc_counter[class_idx] >= 100) {
|
|
g_tiny_pool.alloc_count[class_idx] += g_tls_alloc_counter[class_idx];
|
|
g_tls_alloc_counter[class_idx] = 0;
|
|
}
|
|
}
|
|
```
|
|
|
|
**Implementation**:
|
|
1. File: `hakmem_tiny.c`
|
|
2. Remove sampled XOR code from `hak_tiny_alloc()` lines 656-659
|
|
3. Replace with simple per-thread counter
|
|
4. Call `hak_tiny_lazy_counter_update()` in slow path only
|
|
5. Update `hak_tiny_get_stats()` to account for lazy counters
|
|
|
|
---
|
|
|
|
### 3. Inline Fast Path
|
|
**Effort**: 1 hour | **Gain**: 5-10 ns
|
|
|
|
Create separate inlined fast-path function:
|
|
|
|
```c
|
|
// New file: hakmem_tiny_alloc_fast.h
|
|
static inline void* hak_tiny_alloc_hot(size_t size) {
|
|
// Magazine fast path only (no TLS active slab, no locks)
|
|
if (size > TINY_MAX_SIZE) return NULL;
|
|
|
|
int class_idx = g_size_to_class[size];
|
|
TinyTLSMag* mag = &g_tls_mags[class_idx];
|
|
|
|
if (likely(mag->top > 0)) {
|
|
return mag->items[--mag->top].ptr;
|
|
}
|
|
|
|
// Fall through to slow path
|
|
extern void* hak_tiny_alloc_slow(size_t size);
|
|
return hak_tiny_alloc_slow(size);
|
|
}
|
|
```
|
|
|
|
**Implementation**:
|
|
1. Create: `hakmem_tiny_alloc_fast.h`
|
|
2. Move pure magazine fast path here
|
|
3. Declare as `__attribute__((always_inline))`
|
|
4. Include from `hakmem.h` for public API
|
|
5. Keep `hakmem_tiny.c` for slow path
|
|
|
|
---
|
|
|
|
## Medium Effort (2-5 ns improvement each)
|
|
|
|
### 4. Combine TLS Reads into Single Structure
|
|
**Effort**: 2 hours | **Gain**: 2-3 ns
|
|
|
|
```c
|
|
// Before
|
|
TinyTLSMag* mag = &g_tls_mags[class_idx];
|
|
TinySlab* slab_a = &g_tls_active_slab_a[class_idx];
|
|
TinySlab* slab_b = &g_tls_active_slab_b[class_idx];
|
|
// 3 separate TLS reads + prefetch misses
|
|
|
|
// After
|
|
typedef struct {
|
|
TinyTLSMag mag; // All magazine data
|
|
TinySlab* slab_a; // Pointers
|
|
TinySlab* slab_b;
|
|
} TinyTLSCache;
|
|
static __thread TinyTLSCache g_tls_cache[TINY_NUM_CLASSES];
|
|
|
|
TinyTLSCache* cache = &g_tls_cache[class_idx]; // 1 TLS read
|
|
// All 3 data structures prefetched together
|
|
```
|
|
|
|
**Benefits**:
|
|
- Single TLS read instead of 3
|
|
- All related data on same cache line
|
|
- Better prefetcher behavior
|
|
|
|
---
|
|
|
|
### 5. Hardware Prefetching Hints
|
|
**Effort**: 30 minutes | **Gain**: 1-2 ns (cumulative)
|
|
|
|
```c
|
|
// In hot path loop (e.g., bench_tiny_mt.c)
|
|
void* p = mag->items[--mag->top].ptr;
|
|
if (mag->top > 0) {
|
|
__builtin_prefetch(mag->items[mag->top].ptr, 0, 3);
|
|
// 0 = read, 3 = bring to L1
|
|
}
|
|
use_allocation(p);
|
|
```
|
|
|
|
---
|
|
|
|
### 6. Branchless Fallback Logic
|
|
**Effort**: 1.5 hours | **Gain**: 10-15 ns
|
|
|
|
Use conditional moves instead of branches:
|
|
|
|
```c
|
|
// Before (2+ branches, high misprediction rate)
|
|
if (mag->top > 0) {
|
|
return mag->items[--mag->top].ptr;
|
|
}
|
|
if (slab_a && slab_a->free_count > 0) {
|
|
// ... allocation from slab
|
|
}
|
|
// Fall through to slow path
|
|
|
|
// After (0 mispredictions with cmov)
|
|
void* p = NULL;
|
|
if (mag->top > 0) {
|
|
p = mag->items[--mag->top].ptr;
|
|
}
|
|
// If still NULL, slab_a handler gets chance
|
|
if (!p && slab_a && slab_a->free_count > 0) {
|
|
// ... allocation
|
|
p = result;
|
|
}
|
|
return p != NULL ? p : hak_tiny_alloc_slow(size);
|
|
```
|
|
|
|
---
|
|
|
|
## Advanced Optimizations (2-5 ns improvement)
|
|
|
|
### 7. Code Layout (Hot/Cold Separation)
|
|
**Effort**: 2 hours | **Gain**: 2-5 ns
|
|
|
|
Use compiler pragmas to place fast path in hot section:
|
|
|
|
```c
|
|
// In hakmem_tiny_alloc_fast.h
|
|
__attribute__((section(".text.hot")))
|
|
__attribute__((aligned(64)))
|
|
static inline void* hak_tiny_alloc_hot(size_t size) {
|
|
// Fast path only
|
|
}
|
|
|
|
// In hakmem_tiny.c
|
|
__attribute__((section(".text.cold")))
|
|
static void* hak_tiny_alloc_slow(size_t size) {
|
|
// Slow path: locks, scanning, etc.
|
|
}
|
|
```
|
|
|
|
**Benefits**:
|
|
- Hot path packed in contiguous instruction cache
|
|
- Fewer I-cache misses
|
|
- Better CPU prefetching
|
|
|
|
---
|
|
|
|
## Implementation Priority (Time Investment vs Gain)
|
|
|
|
| Priority | Optimization | Effort | Gain | Ratio |
|
|
|----------|--------------|--------|------|-------|
|
|
| **P0** | Lookup table classification | 30min | 3-5ns | 10x ROI |
|
|
| **P1** | Remove stats overhead | 1hr | 10-15ns | 15x ROI |
|
|
| **P2** | Inline fast path | 1hr | 5-10ns | 7x ROI |
|
|
| P3 | Branch elimination | 1.5hr | 10-15ns | 7x ROI |
|
|
| P4 | Combined TLS reads | 2hr | 2-3ns | 1.5x ROI |
|
|
| P5 | Code layout | 2hr | 2-5ns | 2x ROI |
|
|
| P6 | Prefetching hints | 30min | 1-2ns | 3x ROI |
|
|
|
|
---
|
|
|
|
## Testing Strategy
|
|
|
|
After each optimization:
|
|
|
|
```bash
|
|
# Rebuild
|
|
make clean && make -j4
|
|
|
|
# Single-threaded benchmark
|
|
./bench_tiny --iterations=1000000 --threads=1
|
|
# Should show improvement in latency_ns
|
|
|
|
# Multi-threaded verification
|
|
./bench_tiny --iterations=100000 --threads=4
|
|
# Should maintain thread-safety and hit rate
|
|
|
|
# Compare against baseline
|
|
./docs/bench_compare.sh baseline optimized
|
|
```
|
|
|
|
---
|
|
|
|
## Expected Results Timeline
|
|
|
|
**Phase 1** (P0+P1, ~2 hours):
|
|
- Current: 83 ns/op
|
|
- Expected: 65-70 ns/op
|
|
- Gain: ~18% improvement
|
|
|
|
**Phase 2** (P0+P1+P2, ~3 hours):
|
|
- Expected: 55-65 ns/op
|
|
- Gain: ~25-30% improvement
|
|
|
|
**Phase 3** (P0+P1+P2+P3, ~4.5 hours):
|
|
- Expected: 45-55 ns/op
|
|
- Gain: ~40-45% improvement
|
|
- Still 3-4x slower than mimalloc (fundamental difference)
|
|
|
|
**Why still slower than mimalloc (14 ns)?**
|
|
|
|
Irreducible architectural gaps:
|
|
1. Bitmap lookup [5 ns] vs free list head [1 ns]
|
|
2. Magazine validation [3-5 ns] vs implicit ownership [0 ns]
|
|
3. Thread ownership tracking [2-3 ns] vs per-thread pages [0 ns]
|
|
|
|
**Total irreducible gap: 10-13 ns**
|
|
|
|
---
|
|
|
|
## Code Files to Modify
|
|
|
|
| File | Change | Priority |
|
|
|------|--------|----------|
|
|
| `hakmem_tiny.h` | Add size_to_class LUT | P0 |
|
|
| `hakmem_tiny.c` | Remove stats from hot path | P1 |
|
|
| `hakmem_tiny_alloc_fast.h` | NEW - Inlined fast path | P2 |
|
|
| `hakmem_tiny.c` | Branchless fallback | P3 |
|
|
| `hakmem_tiny.h` | Combine TLS structure | P4 |
|
|
| `bench_tiny.c` | Add prefetch hints | P6 |
|
|
| `Makefile` | Hot/cold sections (-ffunction-sections) | P5 |
|
|
|
|
---
|
|
|
|
## Rollback Plan
|
|
|
|
Each optimization is independent and can be reverted:
|
|
|
|
```bash
|
|
# If optimization causes regression:
|
|
git diff <file> # See changes
|
|
git checkout <file> # Revert single file
|
|
./bench_tiny # Re-verify
|
|
```
|
|
|
|
All optimizations preserve semantics - no behavior changes.
|
|
|
|
---
|
|
|
|
## Success Criteria
|
|
|
|
- [ ] P0 Implementation: 80-82 ns/op (no regression)
|
|
- [ ] P1 Implementation: 68-72 ns/op (+15% from current)
|
|
- [ ] P2 Implementation: 60-65 ns/op (+22% from current)
|
|
- [ ] P3 Implementation: 50-55 ns/op (+35% from current)
|
|
- [ ] All changes compile with `-O3 -march=native`
|
|
- [ ] All benchmarks pass thread-safety verification
|
|
- [ ] No regressions on medium/large allocations (L2 pool)
|
|
|
|
---
|
|
|
|
**Last Updated**: 2025-10-26
|
|
**Status**: Ready for implementation
|
|
**Owner**: [Team]
|
|
**Target**: Implement P0-P2 in next sprint
|
|
|