# HAKMEM Tiny Allocator Refactoring Plan

## Executive Summary

**Problem**: `tiny_alloc_fast()` generates 2624 lines of assembly (should be ~20-50 lines for a fast path), causing 11.6x more L1 cache misses than System malloc (1.98 miss/op vs 0.17). Performance: 23.6M ops/s vs System's 92.6M ops/s (3.9x slower).

**Root Cause**: Architectural bloat from accumulation of experimental features:
- 26 conditional compilation branches in `tiny_alloc_fast.inc.h`
- 38 runtime conditional checks in allocation path
- 11 overlapping frontend layers (Ring Cache, Unified Cache, HeapV2, UltraHot, FastCache, SFC, etc.)
- 2228-line monolithic `hakmem_tiny.c`
- 885-line `tiny_alloc_fast.inc.h` with excessive inlining

**Impact**: The "smart features" designed to improve performance are creating instruction cache thrashing, destroying the fast path.

---

## Analysis: Current Architecture Problems

### Problem 1: Too Many Frontend Layers (Bloat Disease)

**Current layers in `tiny_alloc_fast()`** (lines 562-812):

```c
static inline void* tiny_alloc_fast(size_t size) {
    // Layer 0: FastCache (C0-C3 only) - lines 232-244
    if (g_fastcache_enable && class_idx <= 3) { ... }

    // Layer 1: SFC (Super Front Cache) - lines 255-274
    if (sfc_is_enabled) { ... }

    // Layer 2: Front C23 (Ultra-simple C2/C3) - lines 610-617
    if (tiny_front_c23_enabled() && (class_idx == 2 || class_idx == 3)) { ... }

    // Layer 3: Unified Cache (tcache-style) - lines 623-635
    if (unified_cache_enabled()) { ... }

    // Layer 4: Ring Cache (C2/C3/C5 only) - lines 641-659
    if (class_idx == 2 || class_idx == 3) { ... }

    // Layer 5: UltraHot (C2-C5) - lines 669-686
    if (ultra_hot_enabled() && front_prune_ultrahot_enabled()) { ... }

    // Layer 6: HeapV2 (C0-C3) - lines 693-701
    if (tiny_heap_v2_enabled() && front_prune_heapv2_enabled() && class_idx <= 3) { ... }

    // Layer 7: Class5 hotpath (256B dedicated) - lines 710-732
    if (hot_c5) { ... }

    // Layer 8: TLS SLL (generic) - lines 736-752
    if (g_tls_sll_enable && !s_front_direct_alloc) { ... }

    // Layer 9: Front-Direct refill - lines 759-775
    if (s_front_direct_alloc) { ... }

    // Layer 10: Legacy refill - lines 769-775
    else { ... }

    // Layer 11: Slow path - lines 806-809
    ptr = hak_tiny_alloc_slow(size, class_idx);
}
```

**Problem**: 11 layers with overlapping responsibilities!
- **Redundancy**: Ring Cache (C2/C3), Front C23 (C2/C3), and UltraHot (C2-C5) all target the same classes
- **Branch explosion**: Each layer adds 2-5 conditional branches
- **I-cache thrashing**: 2624 assembly lines cannot fit in L1 instruction cache (32KB = ~10K instructions)

### Problem 2: Assembly Bloat Analysis

**Expected fast path** (System malloc tcache):
```asm
; 3-4 instructions, ~10-15 bytes
mov    rax, QWORD PTR [tls_cache + class*8]   ; Load head
test   rax, rax                                 ; Check NULL
je     .miss                                    ; Branch on empty
mov    rdx, QWORD PTR [rax]                    ; Load next
mov    QWORD PTR [tls_cache + class*8], rdx   ; Update head
ret                                             ; Return ptr
.miss:
  call   tcache_refill                          ; Refill (cold path)
```

**Actual HAKMEM fast path**: 2624 lines of assembly!

**Why?**
1. **Inlining explosion**: Every `__attribute__((always_inline))` layer inlines ALL branches
2. **ENV checks**: Multiple `getenv()` calls inlined (even with TLS caching)
3. **Debug code**: Not gated properly with `#if !HAKMEM_BUILD_RELEASE`
4. **Metrics**: Frontend metrics tracking (`front_metrics_*`) adds 50-100 instructions

### Problem 3: File Organization Chaos

**`hakmem_tiny.c`** (2228 lines):
- Lines 1-500: Global state, TLS variables, initialization
- Lines 500-1000: TLS operations (refill, spill, bind)
- Lines 1000-1500: SuperSlab management
- Lines 1500-2000: Registry operations, slab management
- Lines 2000-2228: Statistics, lifecycle, API wrappers

**Problems**:
- No clear separation of concerns
- Mix of hot path (refill) and cold path (init, stats)
- Circular dependencies between files via `#include`

---

## Refactoring Plan: 3-Phase Approach

### Phase 1: Identify and Remove Dead Features (Priority 1, Quick Win)

**Goal**: Remove experimental features that are disabled or have negative performance impact.

**Actions**:

1. **Audit ENV flags** (1 hour):
   ```bash
   grep -r "getenv.*HAKMEM_TINY" core/ | cut -d: -f2 | sort -u > env_flags.txt
   # Identify which are:
   # - Always disabled (default=0, never used)
   # - Negative performance (A/B test showed regression)
   # - Redundant (overlapping with better features)
   ```

2. **Remove confirmed-dead features** (2 hours):
   - **UltraHot** (Phase 19-4): ENV default OFF, adds 11.7% overhead → DELETE
   - **HeapV2** (Phase 13-A): ENV gated, overlaps with Ring Cache → DELETE
   - **Front C23**: Redundant with Ring Cache → DELETE
   - **FastCache**: Overlaps with SFC → CONSOLIDATE into SFC

3. **Simplify to 3-layer hierarchy** (result):
   ```
   Layer 0: Unified Cache (tcache-style, all classes C0-C7)
   Layer 1: TLS SLL (unlimited overflow)
   Layer 2: SuperSlab backend (refill source)
   ```

**Expected impact**: -30-40% assembly size, +10-15% performance

---

### Phase 2: Extract Hot Path to Separate File (Priority 1, Critical)

**Goal**: Create ultra-simple fast path with zero cold code.

**File split**:

```
core/tiny_alloc_fast.inc.h  (885 lines)
  ↓
core/tiny_alloc_ultra.inc.h      (50-100 lines, HOT PATH ONLY)
core/tiny_alloc_refill.inc.h     (200-300 lines, refill logic)
core/tiny_alloc_frontend.inc.h   (300-400 lines, frontend layers)
core/tiny_alloc_metrics.inc.h    (100-150 lines, debug/stats)
```

**`tiny_alloc_ultra.inc.h`** (NEW, ultra-simple):
```c
// Ultra-fast path: 10-20 instructions, no branches except miss
static inline void* tiny_alloc_ultra(int class_idx) {
    // Layer 0: Unified Cache (single TLS array)
    void* ptr = g_unified_cache[class_idx].pop();
    if (__builtin_expect(ptr != NULL, 1)) {
        // Fast hit: 3-4 instructions
        HAK_RET_ALLOC(class_idx, ptr);
    }

    // Layer 1: TLS SLL (overflow)
    ptr = tls_sll_pop(class_idx);
    if (ptr) {
        HAK_RET_ALLOC(class_idx, ptr);
    }

    // Miss: delegate to refill (cold path, out-of-line)
    return tiny_alloc_refill_slow(class_idx);
}
```

**Expected assembly**:
```asm
tiny_alloc_ultra:
  ; ~15-20 instructions total
  mov    rax, [g_unified_cache + class*8]       ; Load cache head
  test   rax, rax                                 ; Check NULL
  je     .try_sll                                 ; Branch on miss
  mov    rdx, [rax]                              ; Load next
  mov    [g_unified_cache + class*8], rdx       ; Update head
  mov    byte [rax], HEADER_MAGIC | class        ; Write header
  lea    rax, [rax + 1]                          ; USER = BASE + 1
  ret                                             ; Return

.try_sll:
  call   tls_sll_pop                             ; Try TLS SLL
  test   rax, rax
  jne    .sll_hit
  call   tiny_alloc_refill_slow                  ; Cold path (out-of-line)
  ret

.sll_hit:
  mov    byte [rax], HEADER_MAGIC | class
  lea    rax, [rax + 1]
  ret
```

**Expected impact**: ~20-30 instructions (from 2624), +200-300% performance

---

### Phase 3: Refactor `hakmem_tiny.c` into Modules (Priority 2, Maintainability)

**Goal**: Split 2228-line monolith into focused, testable modules.

**File structure** (new):

```
core/
├── hakmem_tiny.c               (300-400 lines, main API only)
├── tiny_state.c                (200-300 lines, global state)
├── tiny_tls.c                  (300-400 lines, TLS operations)
├── tiny_superslab.c            (400-500 lines, SuperSlab backend)
├── tiny_registry.c             (200-300 lines, slab registry)
├── tiny_lifecycle.c            (200-300 lines, init/shutdown)
├── tiny_stats.c                (200-300 lines, statistics)
└── tiny_alloc_ultra.inc.h      (50-100 lines, FAST PATH)
```

**Module responsibilities**:

1. **`hakmem_tiny.c`** (300-400 lines):
   - Public API: `hak_tiny_alloc()`, `hak_tiny_free()`
   - Wrapper functions only
   - Include order: `tiny_alloc_ultra.inc.h` → fast path inline

2. **`tiny_state.c`** (200-300 lines):
   - Global variables: `g_tiny_pool`, `g_tls_sll_head[]`, etc.
   - ENV flag parsing (init-time only)
   - Configuration structures

3. **`tiny_tls.c`** (300-400 lines):
   - TLS operations: `tls_refill()`, `tls_spill()`, `tls_bind()`
   - TLS cache management
   - Adaptive sizing logic

4. **`tiny_superslab.c`** (400-500 lines):
   - SuperSlab allocation: `superslab_refill()`, `superslab_alloc()`
   - Slab metadata management
   - Active block tracking

5. **`tiny_registry.c`** (200-300 lines):
   - Slab registry: `registry_lookup()`, `registry_register()`
   - Hash table operations
   - Owner slab lookup

6. **`tiny_lifecycle.c`** (200-300 lines):
   - Initialization: `hak_tiny_init()`
   - Shutdown: `hak_tiny_shutdown()`
   - Prewarm: `hak_tiny_prewarm_tls_cache()`

7. **`tiny_stats.c`** (200-300 lines):
   - Statistics collection
   - Debug counters
   - Metrics printing

**Benefits**:
- Each file < 500 lines (maintainable)
- Clear dependencies (no circular includes)
- Testable in isolation
- Parallel compilation

---

## Priority Order & Estimated Impact

### Priority 1: Quick Wins (1-2 days)

**Task 1.1**: Remove dead features (2 hours)
- Delete UltraHot, HeapV2, Front C23
- Remove ENV checks for disabled features
- **Impact**: -30% assembly, +10% performance

**Task 1.2**: Extract ultra-fast path (4 hours)
- Create `tiny_alloc_ultra.inc.h` (50 lines)
- Move refill logic to separate file
- **Impact**: -90% assembly (2624 → 200 lines), +150-200% performance

**Task 1.3**: Remove debug code from release builds (2 hours)
- Gate all `fprintf()` with `#if !HAKMEM_BUILD_RELEASE`
- Remove profiling counters in release
- **Impact**: -10% assembly, +5-10% performance

**Expected total (Priority 1)**: 23.6M → 60-80M ops/s (+150-240%)

---

### Priority 2: Code Health (2-3 days)

**Task 2.1**: Split `hakmem_tiny.c` (1 day)
- Extract modules as described above
- Fix include dependencies
- **Impact**: Maintainability only (no performance change)

**Task 2.2**: Simplify frontend to 2 layers (1 day)
- Unified Cache (Layer 0) + TLS SLL (Layer 1)
- Remove redundant Ring/SFC/FastCache
- **Impact**: -5-10% assembly, +5-10% performance

**Task 2.3**: Documentation (0.5 day)
- Document new architecture in `ARCHITECTURE.md`
- Add performance benchmarks
- **Impact**: Team velocity +20%

---

### Priority 3: Advanced Optimization (3-5 days, optional)

**Task 3.1**: Profile-guided optimization
- Collect PGO data from benchmarks
- Recompile with `-fprofile-use`
- **Impact**: +10-20% performance

**Task 3.2**: Assembly-level tuning
- Hand-optimize critical sections
- Align hot paths to cache lines
- **Impact**: +5-10% performance

---

## Recommended Implementation Order

**Week 1** (Priority 1 - Quick Wins):
1. **Day 1**: Remove dead features + create `tiny_alloc_ultra.inc.h`
2. **Day 2**: Test + benchmark + iterate

**Week 2** (Priority 2 - Code Health):
3. **Day 3-4**: Split `hakmem_tiny.c` into modules
4. **Day 5**: Simplify frontend layers

**Week 3** (Priority 3 - Optional):
5. **Day 6-7**: PGO + assembly tuning

---

## Expected Performance Results

### Current (baseline):
- Performance: 23.6M ops/s
- Assembly: 2624 lines
- L1 misses: 1.98 miss/op

### After Priority 1 (Quick Wins):
- Performance: 60-80M ops/s (+150-240%)
- Assembly: 150-200 lines (-92%)
- L1 misses: 0.4-0.6 miss/op (-70%)

### After Priority 2 (Code Health):
- Performance: 70-90M ops/s (+200-280%)
- Assembly: 100-150 lines (-94%)
- L1 misses: 0.2-0.4 miss/op (-80%)
- Maintainability: Much improved

### Target (System malloc parity):
- Performance: 92.6M ops/s (System malloc baseline)
- Assembly: 50-100 lines (tcache equivalent)
- L1 misses: 0.17 miss/op (System malloc level)

---

## Risk Assessment

### Low Risk:
- Removing disabled features (UltraHot, HeapV2, Front C23)
- Extracting fast path to separate file
- Gating debug code with `#if !HAKMEM_BUILD_RELEASE`

### Medium Risk:
- Simplifying frontend from 11 layers → 2 layers
  - **Mitigation**: Keep Ring Cache as fallback during transition
  - **A/B test**: Toggle via `HAKMEM_TINY_UNIFIED_ONLY=1`

### High Risk:
- Splitting `hakmem_tiny.c` (circular dependencies)
  - **Mitigation**: Incremental extraction, one module at a time
  - **Test**: Ensure all benchmarks pass after each extraction

---

## Conclusion

The current architecture suffers from **feature accumulation disease**: 11 experimental frontend layers competing in the same allocation path, creating massive instruction bloat (2624 lines of assembly). The solution is aggressive simplification:

1. **Remove dead/redundant features** (11 layers → 2 layers)
2. **Extract ultra-fast path** (2624 asm lines → 100-150 lines)
3. **Split monolithic file** (2228 lines → 7 focused modules)

**Expected outcome**: 3-4x performance improvement (23.6M → 70-90M ops/s), approaching System malloc parity (92.6M ops/s).

**Recommended action**: Start with Priority 1 tasks (1-2 days), which deliver 80% of the benefit with minimal risk.