# HAKMEM Tiny Allocator Refactoring Plan ## Executive Summary **Problem**: `tiny_alloc_fast()` generates 2624 lines of assembly (should be ~20-50 lines for a fast path), causing 11.6x more L1 cache misses than System malloc (1.98 miss/op vs 0.17). Performance: 23.6M ops/s vs System's 92.6M ops/s (3.9x slower). **Root Cause**: Architectural bloat from accumulation of experimental features: - 26 conditional compilation branches in `tiny_alloc_fast.inc.h` - 38 runtime conditional checks in allocation path - 11 overlapping frontend layers (Ring Cache, Unified Cache, HeapV2, UltraHot, FastCache, SFC, etc.) - 2228-line monolithic `hakmem_tiny.c` - 885-line `tiny_alloc_fast.inc.h` with excessive inlining **Impact**: The "smart features" designed to improve performance are creating instruction cache thrashing, destroying the fast path. --- ## Analysis: Current Architecture Problems ### Problem 1: Too Many Frontend Layers (Bloat Disease) **Current layers in `tiny_alloc_fast()`** (lines 562-812): ```c static inline void* tiny_alloc_fast(size_t size) { // Layer 0: FastCache (C0-C3 only) - lines 232-244 if (g_fastcache_enable && class_idx <= 3) { ... } // Layer 1: SFC (Super Front Cache) - lines 255-274 if (sfc_is_enabled) { ... } // Layer 2: Front C23 (Ultra-simple C2/C3) - lines 610-617 if (tiny_front_c23_enabled() && (class_idx == 2 || class_idx == 3)) { ... } // Layer 3: Unified Cache (tcache-style) - lines 623-635 if (unified_cache_enabled()) { ... } // Layer 4: Ring Cache (C2/C3/C5 only) - lines 641-659 if (class_idx == 2 || class_idx == 3) { ... } // Layer 5: UltraHot (C2-C5) - lines 669-686 if (ultra_hot_enabled() && front_prune_ultrahot_enabled()) { ... } // Layer 6: HeapV2 (C0-C3) - lines 693-701 if (tiny_heap_v2_enabled() && front_prune_heapv2_enabled() && class_idx <= 3) { ... } // Layer 7: Class5 hotpath (256B dedicated) - lines 710-732 if (hot_c5) { ... } // Layer 8: TLS SLL (generic) - lines 736-752 if (g_tls_sll_enable && !s_front_direct_alloc) { ... } // Layer 9: Front-Direct refill - lines 759-775 if (s_front_direct_alloc) { ... } // Layer 10: Legacy refill - lines 769-775 else { ... } // Layer 11: Slow path - lines 806-809 ptr = hak_tiny_alloc_slow(size, class_idx); } ``` **Problem**: 11 layers with overlapping responsibilities! - **Redundancy**: Ring Cache (C2/C3), Front C23 (C2/C3), and UltraHot (C2-C5) all target the same classes - **Branch explosion**: Each layer adds 2-5 conditional branches - **I-cache thrashing**: 2624 assembly lines cannot fit in L1 instruction cache (32KB = ~10K instructions) ### Problem 2: Assembly Bloat Analysis **Expected fast path** (System malloc tcache): ```asm ; 3-4 instructions, ~10-15 bytes mov rax, QWORD PTR [tls_cache + class*8] ; Load head test rax, rax ; Check NULL je .miss ; Branch on empty mov rdx, QWORD PTR [rax] ; Load next mov QWORD PTR [tls_cache + class*8], rdx ; Update head ret ; Return ptr .miss: call tcache_refill ; Refill (cold path) ``` **Actual HAKMEM fast path**: 2624 lines of assembly! **Why?** 1. **Inlining explosion**: Every `__attribute__((always_inline))` layer inlines ALL branches 2. **ENV checks**: Multiple `getenv()` calls inlined (even with TLS caching) 3. **Debug code**: Not gated properly with `#if !HAKMEM_BUILD_RELEASE` 4. **Metrics**: Frontend metrics tracking (`front_metrics_*`) adds 50-100 instructions ### Problem 3: File Organization Chaos **`hakmem_tiny.c`** (2228 lines): - Lines 1-500: Global state, TLS variables, initialization - Lines 500-1000: TLS operations (refill, spill, bind) - Lines 1000-1500: SuperSlab management - Lines 1500-2000: Registry operations, slab management - Lines 2000-2228: Statistics, lifecycle, API wrappers **Problems**: - No clear separation of concerns - Mix of hot path (refill) and cold path (init, stats) - Circular dependencies between files via `#include` --- ## Refactoring Plan: 3-Phase Approach ### Phase 1: Identify and Remove Dead Features (Priority 1, Quick Win) **Goal**: Remove experimental features that are disabled or have negative performance impact. **Actions**: 1. **Audit ENV flags** (1 hour): ```bash grep -r "getenv.*HAKMEM_TINY" core/ | cut -d: -f2 | sort -u > env_flags.txt # Identify which are: # - Always disabled (default=0, never used) # - Negative performance (A/B test showed regression) # - Redundant (overlapping with better features) ``` 2. **Remove confirmed-dead features** (2 hours): - **UltraHot** (Phase 19-4): ENV default OFF, adds 11.7% overhead → DELETE - **HeapV2** (Phase 13-A): ENV gated, overlaps with Ring Cache → DELETE - **Front C23**: Redundant with Ring Cache → DELETE - **FastCache**: Overlaps with SFC → CONSOLIDATE into SFC 3. **Simplify to 3-layer hierarchy** (result): ``` Layer 0: Unified Cache (tcache-style, all classes C0-C7) Layer 1: TLS SLL (unlimited overflow) Layer 2: SuperSlab backend (refill source) ``` **Expected impact**: -30-40% assembly size, +10-15% performance --- ### Phase 2: Extract Hot Path to Separate File (Priority 1, Critical) **Goal**: Create ultra-simple fast path with zero cold code. **File split**: ``` core/tiny_alloc_fast.inc.h (885 lines) ↓ core/tiny_alloc_ultra.inc.h (50-100 lines, HOT PATH ONLY) core/tiny_alloc_refill.inc.h (200-300 lines, refill logic) core/tiny_alloc_frontend.inc.h (300-400 lines, frontend layers) core/tiny_alloc_metrics.inc.h (100-150 lines, debug/stats) ``` **`tiny_alloc_ultra.inc.h`** (NEW, ultra-simple): ```c // Ultra-fast path: 10-20 instructions, no branches except miss static inline void* tiny_alloc_ultra(int class_idx) { // Layer 0: Unified Cache (single TLS array) void* ptr = g_unified_cache[class_idx].pop(); if (__builtin_expect(ptr != NULL, 1)) { // Fast hit: 3-4 instructions HAK_RET_ALLOC(class_idx, ptr); } // Layer 1: TLS SLL (overflow) ptr = tls_sll_pop(class_idx); if (ptr) { HAK_RET_ALLOC(class_idx, ptr); } // Miss: delegate to refill (cold path, out-of-line) return tiny_alloc_refill_slow(class_idx); } ``` **Expected assembly**: ```asm tiny_alloc_ultra: ; ~15-20 instructions total mov rax, [g_unified_cache + class*8] ; Load cache head test rax, rax ; Check NULL je .try_sll ; Branch on miss mov rdx, [rax] ; Load next mov [g_unified_cache + class*8], rdx ; Update head mov byte [rax], HEADER_MAGIC | class ; Write header lea rax, [rax + 1] ; USER = BASE + 1 ret ; Return .try_sll: call tls_sll_pop ; Try TLS SLL test rax, rax jne .sll_hit call tiny_alloc_refill_slow ; Cold path (out-of-line) ret .sll_hit: mov byte [rax], HEADER_MAGIC | class lea rax, [rax + 1] ret ``` **Expected impact**: ~20-30 instructions (from 2624), +200-300% performance --- ### Phase 3: Refactor `hakmem_tiny.c` into Modules (Priority 2, Maintainability) **Goal**: Split 2228-line monolith into focused, testable modules. **File structure** (new): ``` core/ ├── hakmem_tiny.c (300-400 lines, main API only) ├── tiny_state.c (200-300 lines, global state) ├── tiny_tls.c (300-400 lines, TLS operations) ├── tiny_superslab.c (400-500 lines, SuperSlab backend) ├── tiny_registry.c (200-300 lines, slab registry) ├── tiny_lifecycle.c (200-300 lines, init/shutdown) ├── tiny_stats.c (200-300 lines, statistics) └── tiny_alloc_ultra.inc.h (50-100 lines, FAST PATH) ``` **Module responsibilities**: 1. **`hakmem_tiny.c`** (300-400 lines): - Public API: `hak_tiny_alloc()`, `hak_tiny_free()` - Wrapper functions only - Include order: `tiny_alloc_ultra.inc.h` → fast path inline 2. **`tiny_state.c`** (200-300 lines): - Global variables: `g_tiny_pool`, `g_tls_sll_head[]`, etc. - ENV flag parsing (init-time only) - Configuration structures 3. **`tiny_tls.c`** (300-400 lines): - TLS operations: `tls_refill()`, `tls_spill()`, `tls_bind()` - TLS cache management - Adaptive sizing logic 4. **`tiny_superslab.c`** (400-500 lines): - SuperSlab allocation: `superslab_refill()`, `superslab_alloc()` - Slab metadata management - Active block tracking 5. **`tiny_registry.c`** (200-300 lines): - Slab registry: `registry_lookup()`, `registry_register()` - Hash table operations - Owner slab lookup 6. **`tiny_lifecycle.c`** (200-300 lines): - Initialization: `hak_tiny_init()` - Shutdown: `hak_tiny_shutdown()` - Prewarm: `hak_tiny_prewarm_tls_cache()` 7. **`tiny_stats.c`** (200-300 lines): - Statistics collection - Debug counters - Metrics printing **Benefits**: - Each file < 500 lines (maintainable) - Clear dependencies (no circular includes) - Testable in isolation - Parallel compilation --- ## Priority Order & Estimated Impact ### Priority 1: Quick Wins (1-2 days) **Task 1.1**: Remove dead features (2 hours) - Delete UltraHot, HeapV2, Front C23 - Remove ENV checks for disabled features - **Impact**: -30% assembly, +10% performance **Task 1.2**: Extract ultra-fast path (4 hours) - Create `tiny_alloc_ultra.inc.h` (50 lines) - Move refill logic to separate file - **Impact**: -90% assembly (2624 → 200 lines), +150-200% performance **Task 1.3**: Remove debug code from release builds (2 hours) - Gate all `fprintf()` with `#if !HAKMEM_BUILD_RELEASE` - Remove profiling counters in release - **Impact**: -10% assembly, +5-10% performance **Expected total (Priority 1)**: 23.6M → 60-80M ops/s (+150-240%) --- ### Priority 2: Code Health (2-3 days) **Task 2.1**: Split `hakmem_tiny.c` (1 day) - Extract modules as described above - Fix include dependencies - **Impact**: Maintainability only (no performance change) **Task 2.2**: Simplify frontend to 2 layers (1 day) - Unified Cache (Layer 0) + TLS SLL (Layer 1) - Remove redundant Ring/SFC/FastCache - **Impact**: -5-10% assembly, +5-10% performance **Task 2.3**: Documentation (0.5 day) - Document new architecture in `ARCHITECTURE.md` - Add performance benchmarks - **Impact**: Team velocity +20% --- ### Priority 3: Advanced Optimization (3-5 days, optional) **Task 3.1**: Profile-guided optimization - Collect PGO data from benchmarks - Recompile with `-fprofile-use` - **Impact**: +10-20% performance **Task 3.2**: Assembly-level tuning - Hand-optimize critical sections - Align hot paths to cache lines - **Impact**: +5-10% performance --- ## Recommended Implementation Order **Week 1** (Priority 1 - Quick Wins): 1. **Day 1**: Remove dead features + create `tiny_alloc_ultra.inc.h` 2. **Day 2**: Test + benchmark + iterate **Week 2** (Priority 2 - Code Health): 3. **Day 3-4**: Split `hakmem_tiny.c` into modules 4. **Day 5**: Simplify frontend layers **Week 3** (Priority 3 - Optional): 5. **Day 6-7**: PGO + assembly tuning --- ## Expected Performance Results ### Current (baseline): - Performance: 23.6M ops/s - Assembly: 2624 lines - L1 misses: 1.98 miss/op ### After Priority 1 (Quick Wins): - Performance: 60-80M ops/s (+150-240%) - Assembly: 150-200 lines (-92%) - L1 misses: 0.4-0.6 miss/op (-70%) ### After Priority 2 (Code Health): - Performance: 70-90M ops/s (+200-280%) - Assembly: 100-150 lines (-94%) - L1 misses: 0.2-0.4 miss/op (-80%) - Maintainability: Much improved ### Target (System malloc parity): - Performance: 92.6M ops/s (System malloc baseline) - Assembly: 50-100 lines (tcache equivalent) - L1 misses: 0.17 miss/op (System malloc level) --- ## Risk Assessment ### Low Risk: - Removing disabled features (UltraHot, HeapV2, Front C23) - Extracting fast path to separate file - Gating debug code with `#if !HAKMEM_BUILD_RELEASE` ### Medium Risk: - Simplifying frontend from 11 layers → 2 layers - **Mitigation**: Keep Ring Cache as fallback during transition - **A/B test**: Toggle via `HAKMEM_TINY_UNIFIED_ONLY=1` ### High Risk: - Splitting `hakmem_tiny.c` (circular dependencies) - **Mitigation**: Incremental extraction, one module at a time - **Test**: Ensure all benchmarks pass after each extraction --- ## Conclusion The current architecture suffers from **feature accumulation disease**: 11 experimental frontend layers competing in the same allocation path, creating massive instruction bloat (2624 lines of assembly). The solution is aggressive simplification: 1. **Remove dead/redundant features** (11 layers → 2 layers) 2. **Extract ultra-fast path** (2624 asm lines → 100-150 lines) 3. **Split monolithic file** (2228 lines → 7 focused modules) **Expected outcome**: 3-4x performance improvement (23.6M → 70-90M ops/s), approaching System malloc parity (92.6M ops/s). **Recommended action**: Start with Priority 1 tasks (1-2 days), which deliver 80% of the benefit with minimal risk.