# HAKMEM Tiny Allocator Refactoring - Executive Summary ## Problem Statement **Current Performance**: 23.6M ops/s (Random Mixed 256B benchmark) **System malloc**: 92.6M ops/s (baseline) **Performance gap**: **3.9x slower** **Root Cause**: `tiny_alloc_fast()` generates **2624 lines of assembly** (should be ~20-50 lines), causing: - **11.6x more L1 cache misses** than System malloc (1.98 miss/op vs 0.17) - **Instruction cache thrashing** from 11 overlapping frontend layers - **Branch prediction failures** from 26 conditional compilation paths + 38 runtime checks ## Architecture Analysis ### Current Bloat Inventory **Frontend Layers in `tiny_alloc_fast()`** (11 total): 1. FastCache (C0-C3 array stack) 2. SFC (Super Front Cache, all classes) 3. Front C23 (Ultra-simple C2/C3) 4. Unified Cache (tcache-style, all classes) 5. Ring Cache (C2/C3/C5 array cache) 6. UltraHot (C2-C5 magazine) 7. HeapV2 (C0-C3 magazine) 8. Class5 Hotpath (256B dedicated path) 9. TLS SLL (generic freelist) 10. Front-Direct (experimental bypass) 11. Legacy refill path **Problem**: Massive redundancy - Ring Cache, Front C23, and UltraHot all target C2/C3! ### File Size Issues - `hakmem_tiny.c`: **2228 lines** (should be ~300-500) - `tiny_alloc_fast.inc.h`: **885 lines** (should be ~50-100) - `core/front/` directory: **2127 lines** total (11 experimental layers) ## Solution: 3-Phase Refactoring ### Phase 1: Remove Dead Features (1 day, ZERO risk) **Target**: 4 features proven harmful or redundant | Feature | Lines | Status | Evidence | |---------|-------|--------|----------| | UltraHot | ~150 | Disabled by default | A/B test: +12.9% when OFF | | HeapV2 | ~120 | Disabled by default | Redundant with Ring Cache | | Front C23 | ~80 | Opt-in only | Redundant with Ring Cache | | Class5 Hotpath | ~150 | Disabled by default | Special case, unnecessary | **Expected Results**: - Assembly: 2624 → 1000-1200 lines (-60%) - Performance: 23.6M → 40-50M ops/s (+70-110%) - Time: 1 day - Risk: **ZERO** (all disabled & proven harmful) ### Phase 2: Simplify to 2-Layer Architecture (2-3 days) **Current**: 11 layers (chaotic) **Target**: 2 layers (clean) ``` Layer 0: Unified Cache (tcache-style, all classes C0-C7) ↓ miss Layer 1: TLS SLL (unlimited overflow) ↓ miss Layer 2: SuperSlab backend (refill source) ``` **Tasks**: 1. A/B test: Ring Cache vs Unified Cache → pick winner 2. A/B test: FastCache vs SFC → consolidate into winner 3. A/B test: Front-Direct vs Legacy → pick one refill path 4. Extract ultra-fast path to `tiny_alloc_ultra.inc.h` (50 lines) **Expected Results**: - Assembly: 1000-1200 → 150-200 lines (-90% from baseline) - Performance: 40-50M → 70-90M ops/s (+200-280% from baseline) - Time: 2-3 days - Risk: LOW (A/B tests ensure no regression) ### Phase 3: Split Monolithic Files (2-3 days) **Current**: `hakmem_tiny.c` (2228 lines, unmaintainable) **Target**: 7 focused modules (~300-500 lines each) ``` hakmem_tiny.c (300-400 lines) - Public API tiny_state.c (200-300 lines) - Global state tiny_tls.c (300-400 lines) - TLS operations tiny_superslab.c (400-500 lines) - SuperSlab backend tiny_registry.c (200-300 lines) - Slab registry tiny_lifecycle.c (200-300 lines) - Init/shutdown tiny_stats.c (200-300 lines) - Statistics tiny_alloc_ultra.inc.h (50-100 lines) - FAST PATH (inline) ``` **Expected Results**: - Maintainability: Much improved (clear dependencies) - Performance: No change (structural refactor only) - Time: 2-3 days - Risk: MEDIUM (need careful dependency management) ## Performance Projections ### Baseline (Current) - **Performance**: 23.6M ops/s - **Assembly**: 2624 lines - **L1 misses**: 1.98 miss/op - **Gap to System malloc**: 3.9x slower ### After Phase 1 (Quick Win) - **Performance**: 40-50M ops/s (+70-110%) - **Assembly**: 1000-1200 lines (-60%) - **L1 misses**: 0.8-1.2 miss/op (-40%) - **Gap to System malloc**: 1.9-2.3x slower ### After Phase 2 (Architecture Fix) - **Performance**: 70-90M ops/s (+200-280%) - **Assembly**: 150-200 lines (-92%) - **L1 misses**: 0.3-0.5 miss/op (-75%) - **Gap to System malloc**: 1.0-1.3x slower ### Target (System malloc parity) - **Performance**: 92.6M ops/s (System malloc baseline) - **Assembly**: 50-100 lines (tcache equivalent) - **L1 misses**: 0.17 miss/op (System malloc level) - **Gap**: **CLOSED** ## Implementation Timeline ### Week 1: Phase 1 (Quick Win) - **Day 1**: Remove UltraHot, HeapV2, Front C23, Class5 Hotpath - **Day 2**: Test, benchmark, verify (+40-50M ops/s expected) ### Week 2: Phase 2 (Architecture) - **Day 3**: A/B test Ring vs Unified vs SFC (pick winner) - **Day 4**: A/B test Front-Direct vs Legacy (pick winner) - **Day 5**: Extract `tiny_alloc_ultra.inc.h` (ultra-fast path) ### Week 3: Phase 3 (Code Health) - **Day 6-7**: Split `hakmem_tiny.c` into 7 modules - **Day 8**: Test, document, finalize **Total**: 8 days (2 weeks) ## Risk Assessment ### Phase 1 (Zero Risk) - ✅ All 4 features disabled by default - ✅ UltraHot proven harmful (+12.9% when OFF) - ✅ HeapV2/Front C23 redundant (Ring Cache is better) - ✅ Class5 Hotpath unnecessary (Ring Cache handles C5) **Worst case**: Performance stays same (very unlikely) **Expected case**: +70-110% improvement **Best case**: +150-200% improvement ### Phase 2 (Low Risk) - ⚠️ A/B tests required before removing features - ⚠️ Keep losers as fallback during transition - ✅ Toggle via ENV flags (easy rollback) **Worst case**: A/B test shows no winner → keep both temporarily **Expected case**: +200-280% improvement **Best case**: +300-350% improvement ### Phase 3 (Medium Risk) - ⚠️ Circular dependencies in current code - ⚠️ Need careful extraction to avoid breakage - ✅ Incremental approach (extract one module at a time) **Worst case**: Build breaks → incremental rollback **Expected case**: No performance change (structural only) **Best case**: Easier maintenance → faster future iterations ## Recommended Action ### Immediate (Week 1) **Execute Phase 1 immediately** - Highest ROI, lowest risk - Remove 4 dead/harmful features - Expected: +40-50M ops/s (+70-110%) - Time: 1 day - Risk: ZERO ### Short-term (Week 2) **Execute Phase 2** - Core architecture fix - A/B test competing features, keep winners - Extract ultra-fast path - Expected: +70-90M ops/s (+200-280%) - Time: 3 days - Risk: LOW (A/B tests mitigate risk) ### Medium-term (Week 3) **Execute Phase 3** - Code health & maintainability - Split monolithic files - Document architecture - Expected: No performance change, much easier maintenance - Time: 2-3 days - Risk: MEDIUM (careful execution required) ## Key Insights ### Why Current Architecture Fails **Root Cause**: **Feature Accumulation Disease** - 26 phases of development, each adding a new layer - No removal of failed experiments (UltraHot, HeapV2, Front C23) - Overlapping responsibilities (Ring, Front C23, UltraHot all target C2/C3) - **Result**: 11 layers competing → branch explosion → I-cache thrashing ### Why System Malloc is Faster **System malloc (glibc tcache)**: - 1 layer (tcache) - 3-4 instructions fast path - ~10-15 bytes assembly - Fits entirely in L1 instruction cache **HAKMEM current**: - 11 layers (chaotic) - 2624 instructions fast path - ~10KB assembly - Thrashes L1 instruction cache (32KB = ~10K instructions) **Solution**: Simplify to 2 layers (Unified Cache + TLS SLL), achieving tcache-equivalent simplicity. ## Success Metrics ### Primary Metric: Performance - **Phase 1 target**: 40-50M ops/s (+70-110%) - **Phase 2 target**: 70-90M ops/s (+200-280%) - **Final target**: 92.6M ops/s (System malloc parity) ### Secondary Metrics - **Assembly size**: 2624 → 150-200 lines (-92%) - **L1 cache misses**: 1.98 → 0.2-0.4 miss/op (-80%) - **Code maintainability**: 2228-line monolith → 7 focused modules ### Validation - Benchmark: `bench_random_mixed_hakmem` (Random Mixed 256B) - Acceptance: Must match or exceed System malloc (92.6M ops/s) ## Conclusion The HAKMEM Tiny allocator suffers from **architectural bloat** (11 frontend layers) causing 3.9x performance gap vs System malloc. The solution is aggressive simplification: 1. **Remove 4 dead features** (1 day, +70-110%) 2. **Simplify to 2 layers** (3 days, +200-280%) 3. **Split monolithic files** (3 days, maintainability) **Total time**: 2 weeks **Expected outcome**: 23.6M → 70-90M ops/s, approaching System malloc parity (92.6M ops/s) **Risk**: LOW (Phase 1 is ZERO risk, Phase 2 uses A/B tests) **Recommendation**: Start Phase 1 immediately (highest ROI, lowest risk, 1 day).