# HAKMEM Tiny Allocator Refactoring - Executive Summary

## Problem Statement

**Current Performance**: 23.6M ops/s (Random Mixed 256B benchmark)
**System malloc**: 92.6M ops/s (baseline)
**Performance gap**: **3.9x slower**

**Root Cause**: `tiny_alloc_fast()` generates **2624 lines of assembly** (should be ~20-50 lines), causing:
- **11.6x more L1 cache misses** than System malloc (1.98 miss/op vs 0.17)
- **Instruction cache thrashing** from 11 overlapping frontend layers
- **Branch prediction failures** from 26 conditional compilation paths + 38 runtime checks

## Architecture Analysis

### Current Bloat Inventory

**Frontend Layers in `tiny_alloc_fast()`** (11 total):
1. FastCache (C0-C3 array stack)
2. SFC (Super Front Cache, all classes)
3. Front C23 (Ultra-simple C2/C3)
4. Unified Cache (tcache-style, all classes)
5. Ring Cache (C2/C3/C5 array cache)
6. UltraHot (C2-C5 magazine)
7. HeapV2 (C0-C3 magazine)
8. Class5 Hotpath (256B dedicated path)
9. TLS SLL (generic freelist)
10. Front-Direct (experimental bypass)
11. Legacy refill path

**Problem**: Massive redundancy - Ring Cache, Front C23, and UltraHot all target C2/C3!

### File Size Issues

- `hakmem_tiny.c`: **2228 lines** (should be ~300-500)
- `tiny_alloc_fast.inc.h`: **885 lines** (should be ~50-100)
- `core/front/` directory: **2127 lines** total (11 experimental layers)

## Solution: 3-Phase Refactoring

### Phase 1: Remove Dead Features (1 day, ZERO risk)

**Target**: 4 features proven harmful or redundant

| Feature | Lines | Status | Evidence |
|---------|-------|--------|----------|
| UltraHot | ~150 | Disabled by default | A/B test: +12.9% when OFF |
| HeapV2 | ~120 | Disabled by default | Redundant with Ring Cache |
| Front C23 | ~80 | Opt-in only | Redundant with Ring Cache |
| Class5 Hotpath | ~150 | Disabled by default | Special case, unnecessary |

**Expected Results**:
- Assembly: 2624 → 1000-1200 lines (-60%)
- Performance: 23.6M → 40-50M ops/s (+70-110%)
- Time: 1 day
- Risk: **ZERO** (all disabled & proven harmful)

### Phase 2: Simplify to 2-Layer Architecture (2-3 days)

**Current**: 11 layers (chaotic)
**Target**: 2 layers (clean)

```
Layer 0: Unified Cache (tcache-style, all classes C0-C7)
         ↓ miss
Layer 1: TLS SLL (unlimited overflow)
         ↓ miss
Layer 2: SuperSlab backend (refill source)
```

**Tasks**:
1. A/B test: Ring Cache vs Unified Cache → pick winner
2. A/B test: FastCache vs SFC → consolidate into winner
3. A/B test: Front-Direct vs Legacy → pick one refill path
4. Extract ultra-fast path to `tiny_alloc_ultra.inc.h` (50 lines)

**Expected Results**:
- Assembly: 1000-1200 → 150-200 lines (-90% from baseline)
- Performance: 40-50M → 70-90M ops/s (+200-280% from baseline)
- Time: 2-3 days
- Risk: LOW (A/B tests ensure no regression)

### Phase 3: Split Monolithic Files (2-3 days)

**Current**: `hakmem_tiny.c` (2228 lines, unmaintainable)

**Target**: 7 focused modules (~300-500 lines each)

```
hakmem_tiny.c          (300-400 lines) - Public API
tiny_state.c           (200-300 lines) - Global state
tiny_tls.c             (300-400 lines) - TLS operations
tiny_superslab.c       (400-500 lines) - SuperSlab backend
tiny_registry.c        (200-300 lines) - Slab registry
tiny_lifecycle.c       (200-300 lines) - Init/shutdown
tiny_stats.c           (200-300 lines) - Statistics
tiny_alloc_ultra.inc.h (50-100 lines)  - FAST PATH (inline)
```

**Expected Results**:
- Maintainability: Much improved (clear dependencies)
- Performance: No change (structural refactor only)
- Time: 2-3 days
- Risk: MEDIUM (need careful dependency management)

## Performance Projections

### Baseline (Current)
- **Performance**: 23.6M ops/s
- **Assembly**: 2624 lines
- **L1 misses**: 1.98 miss/op
- **Gap to System malloc**: 3.9x slower

### After Phase 1 (Quick Win)
- **Performance**: 40-50M ops/s (+70-110%)
- **Assembly**: 1000-1200 lines (-60%)
- **L1 misses**: 0.8-1.2 miss/op (-40%)
- **Gap to System malloc**: 1.9-2.3x slower

### After Phase 2 (Architecture Fix)
- **Performance**: 70-90M ops/s (+200-280%)
- **Assembly**: 150-200 lines (-92%)
- **L1 misses**: 0.3-0.5 miss/op (-75%)
- **Gap to System malloc**: 1.0-1.3x slower

### Target (System malloc parity)
- **Performance**: 92.6M ops/s (System malloc baseline)
- **Assembly**: 50-100 lines (tcache equivalent)
- **L1 misses**: 0.17 miss/op (System malloc level)
- **Gap**: **CLOSED**

## Implementation Timeline

### Week 1: Phase 1 (Quick Win)
- **Day 1**: Remove UltraHot, HeapV2, Front C23, Class5 Hotpath
- **Day 2**: Test, benchmark, verify (+40-50M ops/s expected)

### Week 2: Phase 2 (Architecture)
- **Day 3**: A/B test Ring vs Unified vs SFC (pick winner)
- **Day 4**: A/B test Front-Direct vs Legacy (pick winner)
- **Day 5**: Extract `tiny_alloc_ultra.inc.h` (ultra-fast path)

### Week 3: Phase 3 (Code Health)
- **Day 6-7**: Split `hakmem_tiny.c` into 7 modules
- **Day 8**: Test, document, finalize

**Total**: 8 days (2 weeks)

## Risk Assessment

### Phase 1 (Zero Risk)
- ✅ All 4 features disabled by default
- ✅ UltraHot proven harmful (+12.9% when OFF)
- ✅ HeapV2/Front C23 redundant (Ring Cache is better)
- ✅ Class5 Hotpath unnecessary (Ring Cache handles C5)

**Worst case**: Performance stays same (very unlikely)
**Expected case**: +70-110% improvement
**Best case**: +150-200% improvement

### Phase 2 (Low Risk)
- ⚠️ A/B tests required before removing features
- ⚠️ Keep losers as fallback during transition
- ✅ Toggle via ENV flags (easy rollback)

**Worst case**: A/B test shows no winner → keep both temporarily
**Expected case**: +200-280% improvement
**Best case**: +300-350% improvement

### Phase 3 (Medium Risk)
- ⚠️ Circular dependencies in current code
- ⚠️ Need careful extraction to avoid breakage
- ✅ Incremental approach (extract one module at a time)

**Worst case**: Build breaks → incremental rollback
**Expected case**: No performance change (structural only)
**Best case**: Easier maintenance → faster future iterations

## Recommended Action

### Immediate (Week 1)
**Execute Phase 1 immediately** - Highest ROI, lowest risk
- Remove 4 dead/harmful features
- Expected: +40-50M ops/s (+70-110%)
- Time: 1 day
- Risk: ZERO

### Short-term (Week 2)
**Execute Phase 2** - Core architecture fix
- A/B test competing features, keep winners
- Extract ultra-fast path
- Expected: +70-90M ops/s (+200-280%)
- Time: 3 days
- Risk: LOW (A/B tests mitigate risk)

### Medium-term (Week 3)
**Execute Phase 3** - Code health & maintainability
- Split monolithic files
- Document architecture
- Expected: No performance change, much easier maintenance
- Time: 2-3 days
- Risk: MEDIUM (careful execution required)

## Key Insights

### Why Current Architecture Fails

**Root Cause**: **Feature Accumulation Disease**
- 26 phases of development, each adding a new layer
- No removal of failed experiments (UltraHot, HeapV2, Front C23)
- Overlapping responsibilities (Ring, Front C23, UltraHot all target C2/C3)
- **Result**: 11 layers competing → branch explosion → I-cache thrashing

### Why System Malloc is Faster

**System malloc (glibc tcache)**:
- 1 layer (tcache)
- 3-4 instructions fast path
- ~10-15 bytes assembly
- Fits entirely in L1 instruction cache

**HAKMEM current**:
- 11 layers (chaotic)
- 2624 instructions fast path
- ~10KB assembly
- Thrashes L1 instruction cache (32KB = ~10K instructions)

**Solution**: Simplify to 2 layers (Unified Cache + TLS SLL), achieving tcache-equivalent simplicity.

## Success Metrics

### Primary Metric: Performance
- **Phase 1 target**: 40-50M ops/s (+70-110%)
- **Phase 2 target**: 70-90M ops/s (+200-280%)
- **Final target**: 92.6M ops/s (System malloc parity)

### Secondary Metrics
- **Assembly size**: 2624 → 150-200 lines (-92%)
- **L1 cache misses**: 1.98 → 0.2-0.4 miss/op (-80%)
- **Code maintainability**: 2228-line monolith → 7 focused modules

### Validation
- Benchmark: `bench_random_mixed_hakmem` (Random Mixed 256B)
- Acceptance: Must match or exceed System malloc (92.6M ops/s)

## Conclusion

The HAKMEM Tiny allocator suffers from **architectural bloat** (11 frontend layers) causing 3.9x performance gap vs System malloc. The solution is aggressive simplification:

1. **Remove 4 dead features** (1 day, +70-110%)
2. **Simplify to 2 layers** (3 days, +200-280%)
3. **Split monolithic files** (3 days, maintainability)

**Total time**: 2 weeks
**Expected outcome**: 23.6M → 70-90M ops/s, approaching System malloc parity (92.6M ops/s)
**Risk**: LOW (Phase 1 is ZERO risk, Phase 2 uses A/B tests)

**Recommendation**: Start Phase 1 immediately (highest ROI, lowest risk, 1 day).