hakmem/docs/design/REFACTORING_PLAN_TINY_ALLOC.md

# HAKMEM Tiny Allocator Refactoring Plan

## Executive Summary

**Problem**: `tiny_alloc_fast()` generates 2624 lines of assembly (should be ~20-50 lines for a fast path), causing 11.6x more L1 cache misses than System malloc (1.98 miss/op vs 0.17). Performance: 23.6M ops/s vs System's 92.6M ops/s (3.9x slower).

**Root Cause**: Architectural bloat from accumulation of experimental features:
- 26 conditional compilation branches in `tiny_alloc_fast.inc.h`
- 38 runtime conditional checks in allocation path
- 11 overlapping frontend layers (Ring Cache, Unified Cache, HeapV2, UltraHot, FastCache, SFC, etc.)
- 2228-line monolithic `hakmem_tiny.c`
- 885-line `tiny_alloc_fast.inc.h` with excessive inlining

**Impact**: The "smart features" designed to improve performance are creating instruction cache thrashing, destroying the fast path.

---

## Analysis: Current Architecture Problems

### Problem 1: Too Many Frontend Layers (Bloat Disease)

**Current layers in `tiny_alloc_fast()`** (lines 562-812):

```c
static inline void* tiny_alloc_fast(size_t size) {
    // Layer 0: FastCache (C0-C3 only) - lines 232-244
    if (g_fastcache_enable && class_idx <= 3) { ... }

    // Layer 1: SFC (Super Front Cache) - lines 255-274
    if (sfc_is_enabled) { ... }

    // Layer 2: Front C23 (Ultra-simple C2/C3) - lines 610-617
    if (tiny_front_c23_enabled() && (class_idx == 2 || class_idx == 3)) { ... }

    // Layer 3: Unified Cache (tcache-style) - lines 623-635
    if (unified_cache_enabled()) { ... }

    // Layer 4: Ring Cache (C2/C3/C5 only) - lines 641-659
    if (class_idx == 2 || class_idx == 3) { ... }

    // Layer 5: UltraHot (C2-C5) - lines 669-686
    if (ultra_hot_enabled() && front_prune_ultrahot_enabled()) { ... }

    // Layer 6: HeapV2 (C0-C3) - lines 693-701
    if (tiny_heap_v2_enabled() && front_prune_heapv2_enabled() && class_idx <= 3) { ... }

    // Layer 7: Class5 hotpath (256B dedicated) - lines 710-732
    if (hot_c5) { ... }

    // Layer 8: TLS SLL (generic) - lines 736-752
    if (g_tls_sll_enable && !s_front_direct_alloc) { ... }

    // Layer 9: Front-Direct refill - lines 759-775
    if (s_front_direct_alloc) { ... }

    // Layer 10: Legacy refill - lines 769-775
    else { ... }

    // Layer 11: Slow path - lines 806-809
    ptr = hak_tiny_alloc_slow(size, class_idx);
}
```

**Problem**: 11 layers with overlapping responsibilities!
- **Redundancy**: Ring Cache (C2/C3), Front C23 (C2/C3), and UltraHot (C2-C5) all target the same classes
- **Branch explosion**: Each layer adds 2-5 conditional branches
- **I-cache thrashing**: 2624 assembly lines cannot fit in L1 instruction cache (32KB = ~10K instructions)

### Problem 2: Assembly Bloat Analysis

**Expected fast path** (System malloc tcache):
```asm
; 3-4 instructions, ~10-15 bytes
mov    rax, QWORD PTR [tls_cache + class*8]   ; Load head
test   rax, rax                                 ; Check NULL
je     .miss                                    ; Branch on empty
mov    rdx, QWORD PTR [rax]                    ; Load next
mov    QWORD PTR [tls_cache + class*8], rdx   ; Update head
ret                                             ; Return ptr
.miss:
  call   tcache_refill                          ; Refill (cold path)
```

**Actual HAKMEM fast path**: 2624 lines of assembly!

**Why?**
1. **Inlining explosion**: Every `__attribute__((always_inline))` layer inlines ALL branches
2. **ENV checks**: Multiple `getenv()` calls inlined (even with TLS caching)
3. **Debug code**: Not gated properly with `#if !HAKMEM_BUILD_RELEASE`
4. **Metrics**: Frontend metrics tracking (`front_metrics_*`) adds 50-100 instructions

### Problem 3: File Organization Chaos

**`hakmem_tiny.c`** (2228 lines):
- Lines 1-500: Global state, TLS variables, initialization
- Lines 500-1000: TLS operations (refill, spill, bind)
- Lines 1000-1500: SuperSlab management
- Lines 1500-2000: Registry operations, slab management
- Lines 2000-2228: Statistics, lifecycle, API wrappers

**Problems**:
- No clear separation of concerns
- Mix of hot path (refill) and cold path (init, stats)
- Circular dependencies between files via `#include`

---

## Refactoring Plan: 3-Phase Approach

### Phase 1: Identify and Remove Dead Features (Priority 1, Quick Win)

**Goal**: Remove experimental features that are disabled or have negative performance impact.

**Actions**:

1. **Audit ENV flags** (1 hour):
   ```bash
   grep -r "getenv.*HAKMEM_TINY" core/ | cut -d: -f2 | sort -u > env_flags.txt
   # Identify which are:
   # - Always disabled (default=0, never used)
   # - Negative performance (A/B test showed regression)
   # - Redundant (overlapping with better features)
   ```

2. **Remove confirmed-dead features** (2 hours):
   - **UltraHot** (Phase 19-4): ENV default OFF, adds 11.7% overhead → DELETE
   - **HeapV2** (Phase 13-A): ENV gated, overlaps with Ring Cache → DELETE
   - **Front C23**: Redundant with Ring Cache → DELETE
   - **FastCache**: Overlaps with SFC → CONSOLIDATE into SFC

3. **Simplify to 3-layer hierarchy** (result):
   ```
   Layer 0: Unified Cache (tcache-style, all classes C0-C7)
   Layer 1: TLS SLL (unlimited overflow)
   Layer 2: SuperSlab backend (refill source)
   ```

**Expected impact**: -30-40% assembly size, +10-15% performance

---

### Phase 2: Extract Hot Path to Separate File (Priority 1, Critical)

**Goal**: Create ultra-simple fast path with zero cold code.

**File split**:

```
core/tiny_alloc_fast.inc.h  (885 lines)
  ↓
core/tiny_alloc_ultra.inc.h      (50-100 lines, HOT PATH ONLY)
core/tiny_alloc_refill.inc.h     (200-300 lines, refill logic)
core/tiny_alloc_frontend.inc.h   (300-400 lines, frontend layers)
core/tiny_alloc_metrics.inc.h    (100-150 lines, debug/stats)
```

**`tiny_alloc_ultra.inc.h`** (NEW, ultra-simple):
```c
// Ultra-fast path: 10-20 instructions, no branches except miss
static inline void* tiny_alloc_ultra(int class_idx) {
    // Layer 0: Unified Cache (single TLS array)
    void* ptr = g_unified_cache[class_idx].pop();
    if (__builtin_expect(ptr != NULL, 1)) {
        // Fast hit: 3-4 instructions
        HAK_RET_ALLOC(class_idx, ptr);
    }

    // Layer 1: TLS SLL (overflow)
    ptr = tls_sll_pop(class_idx);
    if (ptr) {
        HAK_RET_ALLOC(class_idx, ptr);
    }

    // Miss: delegate to refill (cold path, out-of-line)
    return tiny_alloc_refill_slow(class_idx);
}
```

**Expected assembly**:
```asm
tiny_alloc_ultra:
  ; ~15-20 instructions total
  mov    rax, [g_unified_cache + class*8]       ; Load cache head
  test   rax, rax                                 ; Check NULL
  je     .try_sll                                 ; Branch on miss
  mov    rdx, [rax]                              ; Load next
  mov    [g_unified_cache + class*8], rdx       ; Update head
  mov    byte [rax], HEADER_MAGIC | class        ; Write header
  lea    rax, [rax + 1]                          ; USER = BASE + 1
  ret                                             ; Return

.try_sll:
  call   tls_sll_pop                             ; Try TLS SLL
  test   rax, rax
  jne    .sll_hit
  call   tiny_alloc_refill_slow                  ; Cold path (out-of-line)
  ret

.sll_hit:
  mov    byte [rax], HEADER_MAGIC | class
  lea    rax, [rax + 1]
  ret
```

**Expected impact**: ~20-30 instructions (from 2624), +200-300% performance

---

### Phase 3: Refactor `hakmem_tiny.c` into Modules (Priority 2, Maintainability)

**Goal**: Split 2228-line monolith into focused, testable modules.

**File structure** (new):

```
core/
├── hakmem_tiny.c               (300-400 lines, main API only)
├── tiny_state.c                (200-300 lines, global state)
├── tiny_tls.c                  (300-400 lines, TLS operations)
├── tiny_superslab.c            (400-500 lines, SuperSlab backend)
├── tiny_registry.c             (200-300 lines, slab registry)
├── tiny_lifecycle.c            (200-300 lines, init/shutdown)
├── tiny_stats.c                (200-300 lines, statistics)
└── tiny_alloc_ultra.inc.h      (50-100 lines, FAST PATH)
```

**Module responsibilities**:

1. **`hakmem_tiny.c`** (300-400 lines):
   - Public API: `hak_tiny_alloc()`, `hak_tiny_free()`
   - Wrapper functions only
   - Include order: `tiny_alloc_ultra.inc.h` → fast path inline

2. **`tiny_state.c`** (200-300 lines):
   - Global variables: `g_tiny_pool`, `g_tls_sll_head[]`, etc.
   - ENV flag parsing (init-time only)
   - Configuration structures

3. **`tiny_tls.c`** (300-400 lines):
   - TLS operations: `tls_refill()`, `tls_spill()`, `tls_bind()`
   - TLS cache management
   - Adaptive sizing logic

4. **`tiny_superslab.c`** (400-500 lines):
   - SuperSlab allocation: `superslab_refill()`, `superslab_alloc()`
   - Slab metadata management
   - Active block tracking

5. **`tiny_registry.c`** (200-300 lines):
   - Slab registry: `registry_lookup()`, `registry_register()`
   - Hash table operations
   - Owner slab lookup

6. **`tiny_lifecycle.c`** (200-300 lines):
   - Initialization: `hak_tiny_init()`
   - Shutdown: `hak_tiny_shutdown()`
   - Prewarm: `hak_tiny_prewarm_tls_cache()`

7. **`tiny_stats.c`** (200-300 lines):
   - Statistics collection
   - Debug counters
   - Metrics printing

**Benefits**:
- Each file < 500 lines (maintainable)
- Clear dependencies (no circular includes)
- Testable in isolation
- Parallel compilation

---

## Priority Order & Estimated Impact

### Priority 1: Quick Wins (1-2 days)

**Task 1.1**: Remove dead features (2 hours)
- Delete UltraHot, HeapV2, Front C23
- Remove ENV checks for disabled features
- **Impact**: -30% assembly, +10% performance

**Task 1.2**: Extract ultra-fast path (4 hours)
- Create `tiny_alloc_ultra.inc.h` (50 lines)
- Move refill logic to separate file
- **Impact**: -90% assembly (2624 → 200 lines), +150-200% performance

**Task 1.3**: Remove debug code from release builds (2 hours)
- Gate all `fprintf()` with `#if !HAKMEM_BUILD_RELEASE`
- Remove profiling counters in release
- **Impact**: -10% assembly, +5-10% performance

**Expected total (Priority 1)**: 23.6M → 60-80M ops/s (+150-240%)

---

### Priority 2: Code Health (2-3 days)

**Task 2.1**: Split `hakmem_tiny.c` (1 day)
- Extract modules as described above
- Fix include dependencies
- **Impact**: Maintainability only (no performance change)

**Task 2.2**: Simplify frontend to 2 layers (1 day)
- Unified Cache (Layer 0) + TLS SLL (Layer 1)
- Remove redundant Ring/SFC/FastCache
- **Impact**: -5-10% assembly, +5-10% performance

**Task 2.3**: Documentation (0.5 day)
- Document new architecture in `ARCHITECTURE.md`
- Add performance benchmarks
- **Impact**: Team velocity +20%

---

### Priority 3: Advanced Optimization (3-5 days, optional)

**Task 3.1**: Profile-guided optimization
- Collect PGO data from benchmarks
- Recompile with `-fprofile-use`
- **Impact**: +10-20% performance

**Task 3.2**: Assembly-level tuning
- Hand-optimize critical sections
- Align hot paths to cache lines
- **Impact**: +5-10% performance

---

## Recommended Implementation Order

**Week 1** (Priority 1 - Quick Wins):
1. **Day 1**: Remove dead features + create `tiny_alloc_ultra.inc.h`
2. **Day 2**: Test + benchmark + iterate

**Week 2** (Priority 2 - Code Health):
3. **Day 3-4**: Split `hakmem_tiny.c` into modules
4. **Day 5**: Simplify frontend layers

**Week 3** (Priority 3 - Optional):
5. **Day 6-7**: PGO + assembly tuning

---

## Expected Performance Results

### Current (baseline):
- Performance: 23.6M ops/s
- Assembly: 2624 lines
- L1 misses: 1.98 miss/op

### After Priority 1 (Quick Wins):
- Performance: 60-80M ops/s (+150-240%)
- Assembly: 150-200 lines (-92%)
- L1 misses: 0.4-0.6 miss/op (-70%)

### After Priority 2 (Code Health):
- Performance: 70-90M ops/s (+200-280%)
- Assembly: 100-150 lines (-94%)
- L1 misses: 0.2-0.4 miss/op (-80%)
- Maintainability: Much improved

### Target (System malloc parity):
- Performance: 92.6M ops/s (System malloc baseline)
- Assembly: 50-100 lines (tcache equivalent)
- L1 misses: 0.17 miss/op (System malloc level)

---

## Risk Assessment

### Low Risk:
- Removing disabled features (UltraHot, HeapV2, Front C23)
- Extracting fast path to separate file
- Gating debug code with `#if !HAKMEM_BUILD_RELEASE`

### Medium Risk:
- Simplifying frontend from 11 layers → 2 layers
  - **Mitigation**: Keep Ring Cache as fallback during transition
  - **A/B test**: Toggle via `HAKMEM_TINY_UNIFIED_ONLY=1`

### High Risk:
- Splitting `hakmem_tiny.c` (circular dependencies)
  - **Mitigation**: Incremental extraction, one module at a time
  - **Test**: Ensure all benchmarks pass after each extraction

---

## Conclusion

The current architecture suffers from **feature accumulation disease**: 11 experimental frontend layers competing in the same allocation path, creating massive instruction bloat (2624 lines of assembly). The solution is aggressive simplification:

1. **Remove dead/redundant features** (11 layers → 2 layers)
2. **Extract ultra-fast path** (2624 asm lines → 100-150 lines)
3. **Split monolithic file** (2228 lines → 7 focused modules)

**Expected outcome**: 3-4x performance improvement (23.6M → 70-90M ops/s), approaching System malloc parity (92.6M ops/s).

**Recommended action**: Start with Priority 1 tasks (1-2 days), which deliver 80% of the benefit with minimal risk.
Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization) ## Changes ### 1. core/page_arena.c - Removed init failure message (lines 25-27) - error is handled by returning early - All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks ### 2. core/hakmem.c - Wrapped SIGSEGV handler init message (line 72) - CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs ### 3. core/hakmem_shared_pool.c - Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE: - Node pool exhaustion warning (line 252) - SP_META_CAPACITY_ERROR warning (line 421) - SP_FIX_GEOMETRY debug logging (line 745) - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865) - SP_ACQUIRE_STAGE0_L0 debug logging (line 803) - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922) - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996) - SP_ACQUIRE_STAGE3 debug logging (line 1116) - SP_SLOT_RELEASE debug logging (line 1245) - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305) - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316) - Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized ## Performance Validation Before: 51M ops/s (with debug fprintf overhead) After: 49.1M ops/s (consistent performance, fprintf removed from hot paths) ## Build & Test ```bash ./build.sh larson_hakmem ./out/release/larson_hakmem 1 5 1 1000 100 10000 42 # Result: 49.1M ops/s ``` Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-11-26 13:14:18 +09:00			`# HAKMEM Tiny Allocator Refactoring Plan`

			`## Executive Summary`

			Problem: `tiny_alloc_fast()` generates 2624 lines of assembly (should be ~20-50 lines for a fast path), causing 11.6x more L1 cache misses than System malloc (1.98 miss/op vs 0.17). Performance: 23.6M ops/s vs System's 92.6M ops/s (3.9x slower).

			`Root Cause: Architectural bloat from accumulation of experimental features:`
			- 26 conditional compilation branches in `tiny_alloc_fast.inc.h`
			`- 38 runtime conditional checks in allocation path`
			`- 11 overlapping frontend layers (Ring Cache, Unified Cache, HeapV2, UltraHot, FastCache, SFC, etc.)`
			- 2228-line monolithic `hakmem_tiny.c`
			- 885-line `tiny_alloc_fast.inc.h` with excessive inlining

			`Impact: The "smart features" designed to improve performance are creating instruction cache thrashing, destroying the fast path.`

			`---`

			`## Analysis: Current Architecture Problems`

			`### Problem 1: Too Many Frontend Layers (Bloat Disease)`

			Current layers in `tiny_alloc_fast()` (lines 562-812):

			```c
			`static inline void* tiny_alloc_fast(size_t size) {`
			`// Layer 0: FastCache (C0-C3 only) - lines 232-244`
			`if (g_fastcache_enable && class_idx <= 3) { ... }`

			`// Layer 1: SFC (Super Front Cache) - lines 255-274`
			`if (sfc_is_enabled) { ... }`

			`// Layer 2: Front C23 (Ultra-simple C2/C3) - lines 610-617`
			`if (tiny_front_c23_enabled() && (class_idx == 2 \|\| class_idx == 3)) { ... }`

			`// Layer 3: Unified Cache (tcache-style) - lines 623-635`
			`if (unified_cache_enabled()) { ... }`

			`// Layer 4: Ring Cache (C2/C3/C5 only) - lines 641-659`
			`if (class_idx == 2 \|\| class_idx == 3) { ... }`

			`// Layer 5: UltraHot (C2-C5) - lines 669-686`
			`if (ultra_hot_enabled() && front_prune_ultrahot_enabled()) { ... }`

			`// Layer 6: HeapV2 (C0-C3) - lines 693-701`
			`if (tiny_heap_v2_enabled() && front_prune_heapv2_enabled() && class_idx <= 3) { ... }`

			`// Layer 7: Class5 hotpath (256B dedicated) - lines 710-732`
			`if (hot_c5) { ... }`

			`// Layer 8: TLS SLL (generic) - lines 736-752`
			`if (g_tls_sll_enable && !s_front_direct_alloc) { ... }`

			`// Layer 9: Front-Direct refill - lines 759-775`
			`if (s_front_direct_alloc) { ... }`

			`// Layer 10: Legacy refill - lines 769-775`
			`else { ... }`

			`// Layer 11: Slow path - lines 806-809`
			`ptr = hak_tiny_alloc_slow(size, class_idx);`
			`}`
			```

			`Problem: 11 layers with overlapping responsibilities!`
			`- Redundancy: Ring Cache (C2/C3), Front C23 (C2/C3), and UltraHot (C2-C5) all target the same classes`
			`- Branch explosion: Each layer adds 2-5 conditional branches`
			`- I-cache thrashing: 2624 assembly lines cannot fit in L1 instruction cache (32KB = ~10K instructions)`

			`### Problem 2: Assembly Bloat Analysis`

			`Expected fast path (System malloc tcache):`
			```asm
			`; 3-4 instructions, ~10-15 bytes`
			`mov rax, QWORD PTR [tls_cache + class*8] ; Load head`
			`test rax, rax ; Check NULL`
			`je .miss ; Branch on empty`
			`mov rdx, QWORD PTR [rax] ; Load next`
			`mov QWORD PTR [tls_cache + class*8], rdx ; Update head`
			`ret ; Return ptr`
			`.miss:`
			`call tcache_refill ; Refill (cold path)`
			```

			`Actual HAKMEM fast path: 2624 lines of assembly!`

			`Why?`
			1. Inlining explosion: Every `__attribute__((always_inline))` layer inlines ALL branches
			2. ENV checks: Multiple `getenv()` calls inlined (even with TLS caching)
			3. Debug code: Not gated properly with `#if !HAKMEM_BUILD_RELEASE`
			4. Metrics: Frontend metrics tracking (`front_metrics_*`) adds 50-100 instructions

			`### Problem 3: File Organization Chaos`

			`hakmem_tiny.c` (2228 lines):
			`- Lines 1-500: Global state, TLS variables, initialization`
			`- Lines 500-1000: TLS operations (refill, spill, bind)`
			`- Lines 1000-1500: SuperSlab management`
			`- Lines 1500-2000: Registry operations, slab management`
			`- Lines 2000-2228: Statistics, lifecycle, API wrappers`

			`Problems:`
			`- No clear separation of concerns`
			`- Mix of hot path (refill) and cold path (init, stats)`
			- Circular dependencies between files via `#include`

			`---`

			`## Refactoring Plan: 3-Phase Approach`

			`### Phase 1: Identify and Remove Dead Features (Priority 1, Quick Win)`

			`Goal: Remove experimental features that are disabled or have negative performance impact.`

			`Actions:`

			`1. Audit ENV flags (1 hour):`
			```bash
			`grep -r "getenv.*HAKMEM_TINY" core/ \| cut -d: -f2 \| sort -u > env_flags.txt`
			`# Identify which are:`
			`# - Always disabled (default=0, never used)`
			`# - Negative performance (A/B test showed regression)`
			`# - Redundant (overlapping with better features)`
			```

			`2. Remove confirmed-dead features (2 hours):`
			`- UltraHot (Phase 19-4): ENV default OFF, adds 11.7% overhead → DELETE`
			`- HeapV2 (Phase 13-A): ENV gated, overlaps with Ring Cache → DELETE`
			`- Front C23: Redundant with Ring Cache → DELETE`
			`- FastCache: Overlaps with SFC → CONSOLIDATE into SFC`

			`3. Simplify to 3-layer hierarchy (result):`
			```
			`Layer 0: Unified Cache (tcache-style, all classes C0-C7)`
			`Layer 1: TLS SLL (unlimited overflow)`
			`Layer 2: SuperSlab backend (refill source)`
			```

			`Expected impact: -30-40% assembly size, +10-15% performance`

			`---`

			`### Phase 2: Extract Hot Path to Separate File (Priority 1, Critical)`

			`Goal: Create ultra-simple fast path with zero cold code.`

			`File split:`

			```
			`core/tiny_alloc_fast.inc.h (885 lines)`
			`↓`
			`core/tiny_alloc_ultra.inc.h (50-100 lines, HOT PATH ONLY)`
			`core/tiny_alloc_refill.inc.h (200-300 lines, refill logic)`
			`core/tiny_alloc_frontend.inc.h (300-400 lines, frontend layers)`
			`core/tiny_alloc_metrics.inc.h (100-150 lines, debug/stats)`
			```

			`tiny_alloc_ultra.inc.h` (NEW, ultra-simple):
			```c
			`// Ultra-fast path: 10-20 instructions, no branches except miss`
			`static inline void* tiny_alloc_ultra(int class_idx) {`
			`// Layer 0: Unified Cache (single TLS array)`
			`void* ptr = g_unified_cache[class_idx].pop();`
			`if (__builtin_expect(ptr != NULL, 1)) {`
			`// Fast hit: 3-4 instructions`
			`HAK_RET_ALLOC(class_idx, ptr);`
			`}`

			`// Layer 1: TLS SLL (overflow)`
			`ptr = tls_sll_pop(class_idx);`
			`if (ptr) {`
			`HAK_RET_ALLOC(class_idx, ptr);`
			`}`

			`// Miss: delegate to refill (cold path, out-of-line)`
			`return tiny_alloc_refill_slow(class_idx);`
			`}`
			```

			`Expected assembly:`
			```asm
			`tiny_alloc_ultra:`
			`; ~15-20 instructions total`
			`mov rax, [g_unified_cache + class*8] ; Load cache head`
			`test rax, rax ; Check NULL`
			`je .try_sll ; Branch on miss`
			`mov rdx, [rax] ; Load next`
			`mov [g_unified_cache + class*8], rdx ; Update head`
			`mov byte [rax], HEADER_MAGIC \| class ; Write header`
			`lea rax, [rax + 1] ; USER = BASE + 1`
			`ret ; Return`

			`.try_sll:`
			`call tls_sll_pop ; Try TLS SLL`
			`test rax, rax`
			`jne .sll_hit`
			`call tiny_alloc_refill_slow ; Cold path (out-of-line)`
			`ret`

			`.sll_hit:`
			`mov byte [rax], HEADER_MAGIC \| class`
			`lea rax, [rax + 1]`
			`ret`
			```

			`Expected impact: ~20-30 instructions (from 2624), +200-300% performance`

			`---`

			### Phase 3: Refactor `hakmem_tiny.c` into Modules (Priority 2, Maintainability)

			`Goal: Split 2228-line monolith into focused, testable modules.`

			`File structure (new):`

			```
			`core/`
			`├── hakmem_tiny.c (300-400 lines, main API only)`
			`├── tiny_state.c (200-300 lines, global state)`
			`├── tiny_tls.c (300-400 lines, TLS operations)`
			`├── tiny_superslab.c (400-500 lines, SuperSlab backend)`
			`├── tiny_registry.c (200-300 lines, slab registry)`
			`├── tiny_lifecycle.c (200-300 lines, init/shutdown)`
			`├── tiny_stats.c (200-300 lines, statistics)`
			`└── tiny_alloc_ultra.inc.h (50-100 lines, FAST PATH)`
			```

			`Module responsibilities:`

			1. `hakmem_tiny.c` (300-400 lines):
			- Public API: `hak_tiny_alloc()`, `hak_tiny_free()`
			`- Wrapper functions only`
			- Include order: `tiny_alloc_ultra.inc.h` → fast path inline

			2. `tiny_state.c` (200-300 lines):
			- Global variables: `g_tiny_pool`, `g_tls_sll_head[]`, etc.
			`- ENV flag parsing (init-time only)`
			`- Configuration structures`

			3. `tiny_tls.c` (300-400 lines):
			- TLS operations: `tls_refill()`, `tls_spill()`, `tls_bind()`
			`- TLS cache management`
			`- Adaptive sizing logic`

			4. `tiny_superslab.c` (400-500 lines):
			- SuperSlab allocation: `superslab_refill()`, `superslab_alloc()`
			`- Slab metadata management`
			`- Active block tracking`

			5. `tiny_registry.c` (200-300 lines):
			- Slab registry: `registry_lookup()`, `registry_register()`
			`- Hash table operations`
			`- Owner slab lookup`

			6. `tiny_lifecycle.c` (200-300 lines):
			- Initialization: `hak_tiny_init()`
			- Shutdown: `hak_tiny_shutdown()`
			- Prewarm: `hak_tiny_prewarm_tls_cache()`

			7. `tiny_stats.c` (200-300 lines):
			`- Statistics collection`
			`- Debug counters`
			`- Metrics printing`

			`Benefits:`
			`- Each file < 500 lines (maintainable)`
			`- Clear dependencies (no circular includes)`
			`- Testable in isolation`
			`- Parallel compilation`

			`---`

			`## Priority Order & Estimated Impact`

			`### Priority 1: Quick Wins (1-2 days)`

			`Task 1.1: Remove dead features (2 hours)`
			`- Delete UltraHot, HeapV2, Front C23`
			`- Remove ENV checks for disabled features`
			`- Impact: -30% assembly, +10% performance`

			`Task 1.2: Extract ultra-fast path (4 hours)`
			- Create `tiny_alloc_ultra.inc.h` (50 lines)
			`- Move refill logic to separate file`
			`- Impact: -90% assembly (2624 → 200 lines), +150-200% performance`

			`Task 1.3: Remove debug code from release builds (2 hours)`
			- Gate all `fprintf()` with `#if !HAKMEM_BUILD_RELEASE`
			`- Remove profiling counters in release`
			`- Impact: -10% assembly, +5-10% performance`

			`Expected total (Priority 1): 23.6M → 60-80M ops/s (+150-240%)`

			`---`

			`### Priority 2: Code Health (2-3 days)`

			Task 2.1: Split `hakmem_tiny.c` (1 day)
			`- Extract modules as described above`
			`- Fix include dependencies`
			`- Impact: Maintainability only (no performance change)`

			`Task 2.2: Simplify frontend to 2 layers (1 day)`
			`- Unified Cache (Layer 0) + TLS SLL (Layer 1)`
			`- Remove redundant Ring/SFC/FastCache`
			`- Impact: -5-10% assembly, +5-10% performance`

			`Task 2.3: Documentation (0.5 day)`
			- Document new architecture in `ARCHITECTURE.md`
			`- Add performance benchmarks`
			`- Impact: Team velocity +20%`

			`---`

			`### Priority 3: Advanced Optimization (3-5 days, optional)`

			`Task 3.1: Profile-guided optimization`
			`- Collect PGO data from benchmarks`
			- Recompile with `-fprofile-use`
			`- Impact: +10-20% performance`

			`Task 3.2: Assembly-level tuning`
			`- Hand-optimize critical sections`
			`- Align hot paths to cache lines`
			`- Impact: +5-10% performance`

			`---`

			`## Recommended Implementation Order`

			`Week 1 (Priority 1 - Quick Wins):`
			1. Day 1: Remove dead features + create `tiny_alloc_ultra.inc.h`
			`2. Day 2: Test + benchmark + iterate`

			`Week 2 (Priority 2 - Code Health):`
			3. Day 3-4: Split `hakmem_tiny.c` into modules
			`4. Day 5: Simplify frontend layers`

			`Week 3 (Priority 3 - Optional):`
			`5. Day 6-7: PGO + assembly tuning`

			`---`

			`## Expected Performance Results`

			`### Current (baseline):`
			`- Performance: 23.6M ops/s`
			`- Assembly: 2624 lines`
			`- L1 misses: 1.98 miss/op`

			`### After Priority 1 (Quick Wins):`
			`- Performance: 60-80M ops/s (+150-240%)`
			`- Assembly: 150-200 lines (-92%)`
			`- L1 misses: 0.4-0.6 miss/op (-70%)`

			`### After Priority 2 (Code Health):`
			`- Performance: 70-90M ops/s (+200-280%)`
			`- Assembly: 100-150 lines (-94%)`
			`- L1 misses: 0.2-0.4 miss/op (-80%)`
			`- Maintainability: Much improved`

			`### Target (System malloc parity):`
			`- Performance: 92.6M ops/s (System malloc baseline)`
			`- Assembly: 50-100 lines (tcache equivalent)`
			`- L1 misses: 0.17 miss/op (System malloc level)`

			`---`

			`## Risk Assessment`

			`### Low Risk:`
			`- Removing disabled features (UltraHot, HeapV2, Front C23)`
			`- Extracting fast path to separate file`
			- Gating debug code with `#if !HAKMEM_BUILD_RELEASE`

			`### Medium Risk:`
			`- Simplifying frontend from 11 layers → 2 layers`
			`- Mitigation: Keep Ring Cache as fallback during transition`
			- A/B test: Toggle via `HAKMEM_TINY_UNIFIED_ONLY=1`

			`### High Risk:`
			- Splitting `hakmem_tiny.c` (circular dependencies)
			`- Mitigation: Incremental extraction, one module at a time`
			`- Test: Ensure all benchmarks pass after each extraction`

			`---`

			`## Conclusion`

			`The current architecture suffers from feature accumulation disease: 11 experimental frontend layers competing in the same allocation path, creating massive instruction bloat (2624 lines of assembly). The solution is aggressive simplification:`

			`1. Remove dead/redundant features (11 layers → 2 layers)`
			`2. Extract ultra-fast path (2624 asm lines → 100-150 lines)`
			`3. Split monolithic file (2228 lines → 7 focused modules)`

			`Expected outcome: 3-4x performance improvement (23.6M → 70-90M ops/s), approaching System malloc parity (92.6M ops/s).`

			`Recommended action: Start with Priority 1 tasks (1-2 days), which deliver 80% of the benefit with minimal risk.`