609 lines
17 KiB
Markdown
609 lines
17 KiB
Markdown
|
|
# Tiny Pool Optimization Strategy
|
||
|
|
|
||
|
|
**Date**: 2025-10-26
|
||
|
|
**Goal**: Reduce 5.9x performance gap with mimalloc for small allocations (8-64 bytes)
|
||
|
|
**Current**: 83 ns/op (hakmem) vs 14 ns/op (mimalloc)
|
||
|
|
**Target**: 50-55 ns/op (35-40% improvement, ~3.5x gap remaining)
|
||
|
|
**Status**: Ready for implementation
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Executive Summary
|
||
|
|
|
||
|
|
### The 5.9x Gap: Root Cause Analysis
|
||
|
|
|
||
|
|
Based on comprehensive mimalloc analysis (see ANALYSIS_SUMMARY.md), the performance gap stems from **architectural differences**, not bugs:
|
||
|
|
|
||
|
|
| Component | mimalloc | hakmem | Impact |
|
||
|
|
|-----------|----------|--------|--------|
|
||
|
|
| **Primary data structure** | LIFO free list (intrusive) | Bitmap + magazine | +20 ns |
|
||
|
|
| **State location** | Thread-local only | Thread-local + global | +10 ns |
|
||
|
|
| **Cache validation** | Implicit (per-thread pages) | Explicit (ownership tracking) | +5 ns |
|
||
|
|
| **Statistics overhead** | Batched/deferred | Per-allocation sampled | +10 ns |
|
||
|
|
| **Control flow** | 1 branch | 3-4 branches | +5 ns |
|
||
|
|
|
||
|
|
**Total measured gap**: 69 ns (83 - 14)
|
||
|
|
|
||
|
|
### Why We Can't Match mimalloc's 14 ns
|
||
|
|
|
||
|
|
**Irreducible architectural gaps** (10-13 ns total):
|
||
|
|
1. **Bitmap lookup** [5 ns]: Find-first-set + bit extraction vs single pointer read
|
||
|
|
2. **Magazine validation** [3-5 ns]: Ownership tracking for diagnostics vs implicit ownership
|
||
|
|
3. **Statistics integration** [2-3 ns]: Per-class stats require bookkeeping vs atomic counters
|
||
|
|
|
||
|
|
### What We Can Realistically Achieve
|
||
|
|
|
||
|
|
**Addressable overhead** (30-35 ns):
|
||
|
|
- P0: Lookup table classification → +3-5 ns
|
||
|
|
- P1: Remove stats from hot path → +10-15 ns
|
||
|
|
- P2: Inline fast path → +5-10 ns
|
||
|
|
- P3: Branch elimination → +10-15 ns
|
||
|
|
|
||
|
|
**Expected result**: 83 ns → 50-55 ns (35-40% improvement)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Strategic Approach: Three-Phase Implementation
|
||
|
|
|
||
|
|
### Phase I: Quick Wins (P0 + P1) - 90 minutes
|
||
|
|
**Target**: 83 ns → 65-70 ns (~20% improvement)
|
||
|
|
**ROI**: Highest impact per time invested
|
||
|
|
|
||
|
|
**Why Start Here**:
|
||
|
|
- P0 and P1 are independent (no code conflicts)
|
||
|
|
- Combined gain: 13-20 ns
|
||
|
|
- Low risk (simple, localized changes)
|
||
|
|
- Immediate validation possible
|
||
|
|
|
||
|
|
### Phase II: Fast Path Optimization (P2) - 60 minutes
|
||
|
|
**Target**: 65-70 ns → 55-60 ns (~30% cumulative improvement)
|
||
|
|
**ROI**: High impact, moderate complexity
|
||
|
|
|
||
|
|
**Why Second**:
|
||
|
|
- Depends on P1 completion (stats removed from hot path)
|
||
|
|
- Creates foundation for P3 (branch elimination)
|
||
|
|
- More complex but well-documented in roadmap
|
||
|
|
|
||
|
|
### Phase III: Advanced Optimization (P3) - 90 minutes
|
||
|
|
**Target**: 55-60 ns → 50-55 ns (~40% cumulative improvement)
|
||
|
|
**ROI**: Moderate, requires careful testing
|
||
|
|
|
||
|
|
**Why Last**:
|
||
|
|
- Most complex (branchless logic)
|
||
|
|
- Highest risk of subtle bugs
|
||
|
|
- Marginal improvement (diminishing returns)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Detailed Implementation Plan
|
||
|
|
|
||
|
|
### P0: Lookup Table Size Classification
|
||
|
|
|
||
|
|
**File**: `hakmem_tiny.h`
|
||
|
|
**Effort**: 30 minutes
|
||
|
|
**Gain**: +3-5 ns
|
||
|
|
**Risk**: Very Low
|
||
|
|
|
||
|
|
#### Current Implementation (If-Chain)
|
||
|
|
```c
|
||
|
|
static inline int hak_tiny_size_to_class(size_t size) {
|
||
|
|
if (size <= 8) return 0;
|
||
|
|
if (size <= 16) return 1;
|
||
|
|
if (size <= 32) return 2;
|
||
|
|
if (size <= 64) return 3;
|
||
|
|
if (size <= 128) return 4;
|
||
|
|
if (size <= 256) return 5;
|
||
|
|
if (size <= 512) return 6;
|
||
|
|
if (size <= 1024) return 7;
|
||
|
|
return -1;
|
||
|
|
}
|
||
|
|
// Branches: 8 (worst case), avg 4 mispredictions
|
||
|
|
// Cost: 5-8 ns
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Target Implementation (LUT)
|
||
|
|
```c
|
||
|
|
// Add after line 36 in hakmem_tiny.h (after g_tiny_blocks_per_slab)
|
||
|
|
static const uint8_t g_tiny_size_to_class[1025] = {
|
||
|
|
// 0-8: class 0
|
||
|
|
0,0,0,0,0,0,0,0,0,
|
||
|
|
// 9-16: class 1
|
||
|
|
1,1,1,1,1,1,1,1,
|
||
|
|
// 17-32: class 2
|
||
|
|
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
|
||
|
|
// 33-64: class 3
|
||
|
|
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,
|
||
|
|
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,
|
||
|
|
// 65-128: class 4
|
||
|
|
4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
|
||
|
|
4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
|
||
|
|
4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
|
||
|
|
4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
|
||
|
|
// 129-256: class 5
|
||
|
|
// ... (continue pattern for all 1025 entries)
|
||
|
|
// 513-1024: class 7
|
||
|
|
};
|
||
|
|
|
||
|
|
static inline int hak_tiny_size_to_class_fast(size_t size) {
|
||
|
|
// Fast path: direct table lookup
|
||
|
|
if (__builtin_expect(size <= 1024, 1)) {
|
||
|
|
return g_tiny_size_to_class[size];
|
||
|
|
}
|
||
|
|
// Slow path: out of Tiny Pool range
|
||
|
|
return -1;
|
||
|
|
}
|
||
|
|
// Branches: 1 (predictable)
|
||
|
|
// Cost: 0.5-1 ns (L1 cache hit)
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Implementation Steps
|
||
|
|
1. Add `g_tiny_size_to_class[1025]` table to `hakmem_tiny.h` (after line 36)
|
||
|
|
2. Replace `hak_tiny_size_to_class()` with `hak_tiny_size_to_class_fast()`
|
||
|
|
3. Update all call sites in `hakmem_tiny.c`
|
||
|
|
4. Compile and verify correctness
|
||
|
|
|
||
|
|
#### Testing
|
||
|
|
```bash
|
||
|
|
# Build and verify correctness
|
||
|
|
make clean && make -j4
|
||
|
|
|
||
|
|
# Unit test: verify classification accuracy
|
||
|
|
./test_tiny_size_class # (create if needed)
|
||
|
|
|
||
|
|
# Performance test
|
||
|
|
./bench_tiny --iterations=1000000 --threads=1
|
||
|
|
# Expected: 83ns → 78-80ns
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Rollback
|
||
|
|
```bash
|
||
|
|
git diff hakmem_tiny.h # Verify changes
|
||
|
|
git checkout hakmem_tiny.h # If regression detected
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### P1: Remove Statistics from Critical Path
|
||
|
|
|
||
|
|
**File**: `hakmem_tiny.c`
|
||
|
|
**Effort**: 60 minutes
|
||
|
|
**Gain**: +10-15 ns
|
||
|
|
**Risk**: Low (statistics semantics preserved)
|
||
|
|
|
||
|
|
#### Current Implementation (Per-Allocation Sampling)
|
||
|
|
```c
|
||
|
|
// hakmem_tiny.c:656-659 (hot path)
|
||
|
|
void* p = mag->items[--mag->top].ptr;
|
||
|
|
|
||
|
|
// Sampled counter update (XOR-based PRNG)
|
||
|
|
t_tiny_rng ^= t_tiny_rng << 13; // 3 ns overhead
|
||
|
|
t_tiny_rng ^= t_tiny_rng >> 17;
|
||
|
|
t_tiny_rng ^= t_tiny_rng << 5;
|
||
|
|
if ((t_tiny_rng & ((1u<<g_tiny_count_sample_exp)-1u)) == 0u) {
|
||
|
|
g_tiny_pool.alloc_count[class_idx]++; // Atomic increment (contention!)
|
||
|
|
}
|
||
|
|
|
||
|
|
return p;
|
||
|
|
// Total stats overhead: 10-15 ns
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Target Implementation (Batched Lazy Update)
|
||
|
|
```c
|
||
|
|
// New TLS counter (add to hakmem_tiny.c TLS section)
|
||
|
|
static __thread uint64_t t_tiny_alloc_counter[TINY_NUM_CLASSES] = {0};
|
||
|
|
|
||
|
|
// Hot path (remove all stats code)
|
||
|
|
void* p = mag->items[--mag->top].ptr;
|
||
|
|
// NO STATS HERE!
|
||
|
|
return p;
|
||
|
|
// Stats overhead: 0 ns
|
||
|
|
|
||
|
|
// Cold path: Lazy accumulation (called during magazine refill)
|
||
|
|
static void hak_tiny_lazy_counter_update(int class_idx) {
|
||
|
|
// Accumulate every 100 allocations
|
||
|
|
if (++t_tiny_alloc_counter[class_idx] >= 100) {
|
||
|
|
g_tiny_pool.alloc_count[class_idx] += t_tiny_alloc_counter[class_idx];
|
||
|
|
t_tiny_alloc_counter[class_idx] = 0;
|
||
|
|
}
|
||
|
|
}
|
||
|
|
|
||
|
|
// Call from slow path (magazine refill function)
|
||
|
|
void hak_tiny_refill_magazine(...) {
|
||
|
|
// ... existing refill logic ...
|
||
|
|
hak_tiny_lazy_counter_update(class_idx);
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Implementation Steps
|
||
|
|
1. Add `t_tiny_alloc_counter[TINY_NUM_CLASSES]` TLS array
|
||
|
|
2. Remove XOR PRNG code from `hak_tiny_alloc()` hot path (lines 656-659)
|
||
|
|
3. Add `hak_tiny_lazy_counter_update()` function
|
||
|
|
4. Call lazy update in slow path (magazine refill, slab allocation)
|
||
|
|
5. Update `hak_tiny_get_stats()` to flush pending TLS counters
|
||
|
|
|
||
|
|
#### Statistics Accuracy Trade-off
|
||
|
|
- **Before**: Sampled (1/16 allocations counted)
|
||
|
|
- **After**: Batched (accumulated every 100 allocations)
|
||
|
|
- **Impact**: Both are approximations; batching is more accurate and faster
|
||
|
|
|
||
|
|
#### Testing
|
||
|
|
```bash
|
||
|
|
# Build
|
||
|
|
make clean && make -j4
|
||
|
|
|
||
|
|
# Functional test: verify stats still work
|
||
|
|
./test_mf2 # Should show non-zero alloc_count
|
||
|
|
|
||
|
|
# Performance test
|
||
|
|
./bench_tiny --iterations=1000000 --threads=1
|
||
|
|
# Expected: 78-80ns → 63-70ns
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Rollback
|
||
|
|
```bash
|
||
|
|
git diff hakmem_tiny.c
|
||
|
|
git checkout hakmem_tiny.c # If stats broken
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### P2: Inline Fast Path
|
||
|
|
|
||
|
|
**Files**: `hakmem_tiny_alloc_fast.h` (new), `hakmem.h`, `hakmem_tiny.c`
|
||
|
|
**Effort**: 60 minutes
|
||
|
|
**Gain**: +5-10 ns
|
||
|
|
**Risk**: Moderate (function splitting, ABI considerations)
|
||
|
|
|
||
|
|
#### Current Implementation (Unified Function)
|
||
|
|
```c
|
||
|
|
// hakmem_tiny.c: Single function for fast + slow path
|
||
|
|
void* hak_tiny_alloc(size_t size) {
|
||
|
|
// Size classification
|
||
|
|
int class_idx = hak_tiny_size_to_class(size);
|
||
|
|
|
||
|
|
// Magazine check
|
||
|
|
TinyTLSMag* mag = &g_tls_mags[class_idx];
|
||
|
|
if (mag->top > 0) {
|
||
|
|
return mag->items[--mag->top].ptr; // Fast path
|
||
|
|
}
|
||
|
|
|
||
|
|
// TLS active slab A
|
||
|
|
TinySlab* slab_a = g_tls_active_slab_a[class_idx];
|
||
|
|
if (slab_a && slab_a->free_count > 0) {
|
||
|
|
// ... bitmap scan ...
|
||
|
|
}
|
||
|
|
|
||
|
|
// TLS active slab B
|
||
|
|
// ...
|
||
|
|
|
||
|
|
// Global pool (slow path with locks)
|
||
|
|
// ...
|
||
|
|
}
|
||
|
|
// Problem: Compiler can't inline due to size/complexity
|
||
|
|
// Call overhead: 5-10 ns
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Target Implementation (Split Fast/Slow)
|
||
|
|
```c
|
||
|
|
// New file: hakmem_tiny_alloc_fast.h
|
||
|
|
#ifndef HAKMEM_TINY_ALLOC_FAST_H
|
||
|
|
#define HAKMEM_TINY_ALLOC_FAST_H
|
||
|
|
|
||
|
|
#include "hakmem_tiny.h"
|
||
|
|
|
||
|
|
// External slow path declaration
|
||
|
|
extern void* hak_tiny_alloc_slow(size_t size, int class_idx);
|
||
|
|
|
||
|
|
// Inlined fast path (magazine-only)
|
||
|
|
__attribute__((always_inline))
|
||
|
|
static inline void* hak_tiny_alloc_hot(size_t size) {
|
||
|
|
// Fast bounds check
|
||
|
|
if (__builtin_expect(size > TINY_MAX_SIZE, 0)) {
|
||
|
|
return NULL; // Not Tiny Pool range
|
||
|
|
}
|
||
|
|
|
||
|
|
// O(1) size classification (uses P0 LUT)
|
||
|
|
int class_idx = g_tiny_size_to_class[size];
|
||
|
|
|
||
|
|
// TLS magazine check (fast path only)
|
||
|
|
TinyTLSMag* mag = &g_tls_mags[class_idx];
|
||
|
|
if (__builtin_expect(mag->top > 0, 1)) {
|
||
|
|
return mag->items[--mag->top].ptr;
|
||
|
|
}
|
||
|
|
|
||
|
|
// Fall through to slow path
|
||
|
|
return hak_tiny_alloc_slow(size, class_idx);
|
||
|
|
}
|
||
|
|
|
||
|
|
#endif
|
||
|
|
```
|
||
|
|
|
||
|
|
```c
|
||
|
|
// hakmem_tiny.c: Slow path only
|
||
|
|
void* hak_tiny_alloc_slow(size_t size, int class_idx) {
|
||
|
|
// TLS active slab A
|
||
|
|
TinySlab* slab_a = g_tls_active_slab_a[class_idx];
|
||
|
|
if (slab_a && slab_a->free_count > 0) {
|
||
|
|
// ... bitmap scan ...
|
||
|
|
}
|
||
|
|
|
||
|
|
// TLS active slab B
|
||
|
|
// ...
|
||
|
|
|
||
|
|
// Global pool (locks)
|
||
|
|
// ...
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
```c
|
||
|
|
// hakmem.h: Update public API
|
||
|
|
#include "hakmem_tiny_alloc_fast.h"
|
||
|
|
|
||
|
|
// Use inlined fast path for tiny allocations
|
||
|
|
static inline void* hak_alloc_at(size_t size, void* site) {
|
||
|
|
if (size <= TINY_MAX_SIZE) {
|
||
|
|
return hak_tiny_alloc_hot(size);
|
||
|
|
}
|
||
|
|
// ... L2/L2.5 pools ...
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Benefits
|
||
|
|
- **Zero call overhead**: Compiler inlines directly into caller
|
||
|
|
- **Better register allocation**: Hot path uses minimal stack
|
||
|
|
- **Improved branch prediction**: Fast path separate from cold code
|
||
|
|
|
||
|
|
#### Implementation Steps
|
||
|
|
1. Create `hakmem_tiny_alloc_fast.h`
|
||
|
|
2. Move magazine-only logic to `hak_tiny_alloc_hot()`
|
||
|
|
3. Rename `hak_tiny_alloc()` → `hak_tiny_alloc_slow()`
|
||
|
|
4. Update `hakmem.h` to include new header
|
||
|
|
5. Update all call sites
|
||
|
|
|
||
|
|
#### Testing
|
||
|
|
```bash
|
||
|
|
# Build
|
||
|
|
make clean && make -j4
|
||
|
|
|
||
|
|
# Functional test: verify all paths work
|
||
|
|
./test_mf2
|
||
|
|
./test_mf2_warmup
|
||
|
|
|
||
|
|
# Performance test
|
||
|
|
./bench_tiny --iterations=1000000 --threads=1
|
||
|
|
# Expected: 63-70ns → 55-65ns
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Rollback
|
||
|
|
```bash
|
||
|
|
git rm hakmem_tiny_alloc_fast.h
|
||
|
|
git checkout hakmem.h hakmem_tiny.c
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Testing Strategy
|
||
|
|
|
||
|
|
### Phase-by-Phase Validation
|
||
|
|
|
||
|
|
#### After P0 (Lookup Table)
|
||
|
|
```bash
|
||
|
|
# Correctness
|
||
|
|
./test_tiny_size_class # Verify all 1025 entries correct
|
||
|
|
|
||
|
|
# Performance
|
||
|
|
./bench_tiny --iterations=1000000 --threads=1
|
||
|
|
# Target: 83ns → 78-80ns
|
||
|
|
|
||
|
|
# No regressions on other sizes
|
||
|
|
./bench_allocators_hakmem --scenario json # L2.5 64KB
|
||
|
|
./bench_allocators_hakmem --scenario mir # L2.5 256KB
|
||
|
|
```
|
||
|
|
|
||
|
|
#### After P1 (Remove Stats)
|
||
|
|
```bash
|
||
|
|
# Correctness: Stats still work
|
||
|
|
./test_mf2
|
||
|
|
hak_tiny_get_stats() # Should show non-zero counts
|
||
|
|
|
||
|
|
# Performance
|
||
|
|
./bench_tiny --iterations=1000000 --threads=1
|
||
|
|
# Target: 78-80ns → 63-70ns
|
||
|
|
|
||
|
|
# Multi-threaded: Verify TLS counters work
|
||
|
|
./bench_tiny_mt --iterations=100000 --threads=4
|
||
|
|
# Should see reasonable alloc_count
|
||
|
|
```
|
||
|
|
|
||
|
|
#### After P2 (Inline Fast Path)
|
||
|
|
```bash
|
||
|
|
# Correctness: All paths work
|
||
|
|
./test_mf2
|
||
|
|
./test_mf2_warmup
|
||
|
|
|
||
|
|
# Performance
|
||
|
|
./bench_tiny --iterations=1000000 --threads=1
|
||
|
|
# Target: 63-70ns → 55-65ns
|
||
|
|
|
||
|
|
# Code inspection: Verify inlining
|
||
|
|
objdump -d bench_tiny | grep -A 20 'hak_alloc_at'
|
||
|
|
# Should show inline expansion, not call instruction
|
||
|
|
```
|
||
|
|
|
||
|
|
### Comprehensive Validation
|
||
|
|
```bash
|
||
|
|
# After all P0-P2 complete
|
||
|
|
make clean && make -j4
|
||
|
|
|
||
|
|
# Full benchmark suite
|
||
|
|
./bench_tiny --iterations=1000000 --threads=1 # Single-threaded
|
||
|
|
./bench_tiny_mt --iterations=100000 --threads=4 # Multi-threaded
|
||
|
|
./bench_allocators_hakmem --scenario json # L2.5 64KB
|
||
|
|
./bench_allocators_hakmem --scenario mir # L2.5 256KB
|
||
|
|
|
||
|
|
# Git commit
|
||
|
|
git add .
|
||
|
|
git commit -m "Tiny Pool optimization P0-P2: 83ns → 55-65ns
|
||
|
|
|
||
|
|
- P0: Lookup table size classification (+3-5ns)
|
||
|
|
- P1: Remove statistics from hot path (+10-15ns)
|
||
|
|
- P2: Inline fast path (+5-10ns)
|
||
|
|
|
||
|
|
Cumulative improvement: ~30-35%
|
||
|
|
Remaining gap to mimalloc (14ns): 3.5-4x (irreducible)"
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Risk Mitigation
|
||
|
|
|
||
|
|
### Low-Risk Approach
|
||
|
|
1. **Incremental changes**: One optimization at a time
|
||
|
|
2. **Validation at each step**: Benchmark + test after each P0, P1, P2
|
||
|
|
3. **Git commits**: Separate commit for each optimization
|
||
|
|
4. **Rollback ready**: `git checkout` if regression detected
|
||
|
|
|
||
|
|
### Potential Issues and Mitigation
|
||
|
|
|
||
|
|
#### Issue 1: LUT Size (1 KB)
|
||
|
|
- **Risk**: L1 cache pressure
|
||
|
|
- **Mitigation**: 1 KB fits in L1 (32 KB typical), negligible impact
|
||
|
|
- **Fallback**: Use smaller LUT (65 entries, round up to power-of-2)
|
||
|
|
|
||
|
|
#### Issue 2: Inlining Bloat
|
||
|
|
- **Risk**: Code size increase from inlining
|
||
|
|
- **Mitigation**: Only inline magazine path (~10 instructions)
|
||
|
|
- **Fallback**: Use `__attribute__((hot))` instead of `always_inline`
|
||
|
|
|
||
|
|
#### Issue 3: Stats Accuracy
|
||
|
|
- **Risk**: Batched counters less accurate than sampled
|
||
|
|
- **Mitigation**: Actually MORE accurate (no sampling variance)
|
||
|
|
- **Fallback**: Keep TLS counters but flush more frequently
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Success Criteria
|
||
|
|
|
||
|
|
### Performance Targets
|
||
|
|
|
||
|
|
| Optimization | Current | Target | Achieved? |
|
||
|
|
|--------------|---------|--------|-----------|
|
||
|
|
| **Baseline** | 83 ns/op | - | ✓ (established) |
|
||
|
|
| **P0 Complete** | 83 ns | 78-80 ns | [ ] |
|
||
|
|
| **P1 Complete** | 78-80 ns | 63-70 ns | [ ] |
|
||
|
|
| **P2 Complete** | 63-70 ns | 55-65 ns | [ ] |
|
||
|
|
| **P0-P2 Combined** | 83 ns | **50-55 ns** | [ ] |
|
||
|
|
|
||
|
|
**Stretch Goal**: 50 ns/op (38% improvement, 3.5x gap to mimalloc)
|
||
|
|
|
||
|
|
### Functional Requirements
|
||
|
|
- [ ] All existing tests pass
|
||
|
|
- [ ] No regressions on L2/L2.5 pools
|
||
|
|
- [ ] Statistics still functional (batched counters work)
|
||
|
|
- [ ] Multi-threaded safety maintained
|
||
|
|
- [ ] Zero hard page faults (memory reuse preserved)
|
||
|
|
|
||
|
|
### Code Quality
|
||
|
|
- [ ] Clean compilation with `-Wall -Wextra`
|
||
|
|
- [ ] No compiler warnings
|
||
|
|
- [ ] Documented trade-offs (stats batching, LUT size)
|
||
|
|
- [ ] Git commit messages reference this strategy doc
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Timeline and Effort
|
||
|
|
|
||
|
|
### Phase I: Quick Wins (P0 + P1)
|
||
|
|
- **P0 Implementation**: 30 minutes
|
||
|
|
- **P0 Testing**: 15 minutes
|
||
|
|
- **P1 Implementation**: 60 minutes
|
||
|
|
- **P1 Testing**: 15 minutes
|
||
|
|
- **Total Phase I**: **2 hours**
|
||
|
|
|
||
|
|
### Phase II: Fast Path (P2)
|
||
|
|
- **P2 Implementation**: 60 minutes
|
||
|
|
- **P2 Testing**: 15 minutes
|
||
|
|
- **Total Phase II**: **1.25 hours**
|
||
|
|
|
||
|
|
### Phase III: Validation
|
||
|
|
- **Comprehensive testing**: 30 minutes
|
||
|
|
- **Documentation update**: 15 minutes
|
||
|
|
- **Git commit**: 15 minutes
|
||
|
|
- **Total Phase III**: **1 hour**
|
||
|
|
|
||
|
|
### Grand Total: 4.25 hours
|
||
|
|
|
||
|
|
**Realistic estimate with contingency**: **5-6 hours**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Beyond P0-P2: Future Optimization (Optional)
|
||
|
|
|
||
|
|
### P3: Branch Elimination (Not Included in This Strategy)
|
||
|
|
**Effort**: 90 minutes
|
||
|
|
**Gain**: +10-15 ns
|
||
|
|
**Risk**: High (complex, subtle bugs possible)
|
||
|
|
|
||
|
|
**Why Deferred**:
|
||
|
|
- Diminishing returns (30% improvement already achieved)
|
||
|
|
- Complexity vs gain trade-off unfavorable
|
||
|
|
- Can be addressed in future iteration
|
||
|
|
|
||
|
|
### Alternative: NEXT_STEPS.md Approach
|
||
|
|
After P0-P2 complete, consider NEXT_STEPS.md optimizations:
|
||
|
|
- MPSC opportunistic drain during alloc slow path
|
||
|
|
- Immediate full→free slab promotion after drain
|
||
|
|
- Adaptive magazine capacity per site
|
||
|
|
|
||
|
|
**These may yield better ROI than P3.**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Comparison to Alternatives
|
||
|
|
|
||
|
|
### Alternative 1: Implement NEXT_STEPS.md First
|
||
|
|
**Pros**: Novel features (adaptive magazines, ELO learning)
|
||
|
|
**Cons**: Higher complexity, uncertain performance gain
|
||
|
|
**Why Not**: mimalloc analysis shows stats overhead is 10-15ns - addressing known bottleneck first is safer
|
||
|
|
|
||
|
|
### Alternative 2: Copy mimalloc's Free List Architecture
|
||
|
|
**Pros**: Would match mimalloc's 14ns performance
|
||
|
|
**Cons**: Abandons bitmap approach, loses diagnostics/ownership tracking
|
||
|
|
**Why Not**: Violates hakmem's research goals (flexible architecture)
|
||
|
|
|
||
|
|
### Alternative 3: Do Nothing
|
||
|
|
**Pros**: Zero effort
|
||
|
|
**Cons**: 5.9x gap remains, no learning
|
||
|
|
**Why Not**: P0-P2 are low-risk, high-ROI quick wins
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Conclusion
|
||
|
|
|
||
|
|
This strategy prioritizes **high-impact, low-risk optimizations** to achieve a **30-40% performance improvement** in 4-6 hours of work.
|
||
|
|
|
||
|
|
**Key Principles**:
|
||
|
|
1. **Incremental validation**: Test after each step
|
||
|
|
2. **Focus on hot path**: Remove overhead from critical path
|
||
|
|
3. **Preserve semantics**: No behavior changes, only optimization
|
||
|
|
4. **Accept trade-offs**: 50-55ns is excellent; chasing 14ns abandons research goals
|
||
|
|
|
||
|
|
**Next Steps**:
|
||
|
|
1. Review and approve this strategy
|
||
|
|
2. Implement P0 (Lookup Table) - 30 minutes
|
||
|
|
3. Validate P0 - 15 minutes
|
||
|
|
4. Implement P1 (Remove Stats) - 60 minutes
|
||
|
|
5. Validate P1 - 15 minutes
|
||
|
|
6. Implement P2 (Inline Fast Path) - 60 minutes
|
||
|
|
7. Validate P2 - 15 minutes
|
||
|
|
8. Comprehensive testing and documentation - 1 hour
|
||
|
|
|
||
|
|
**Expected Outcome**: Tiny Pool performance improves from 83ns to 50-55ns, closing 40% of the gap with mimalloc while preserving hakmem's research architecture.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Last Updated**: 2025-10-26
|
||
|
|
**Status**: Ready for implementation
|
||
|
|
**Approval**: Pending user confirmation
|
||
|
|
**Implementation Start**: After approval
|