hakmem/docs/analysis/REFACTOR_EXECUTIVE_SUMMARY.md

# HAKMEM Tiny Allocator Refactoring - Executive Summary

## Problem Statement

**Current Performance**: 23.6M ops/s (Random Mixed 256B benchmark)
**System malloc**: 92.6M ops/s (baseline)
**Performance gap**: **3.9x slower**

**Root Cause**: `tiny_alloc_fast()` generates **2624 lines of assembly** (should be ~20-50 lines), causing:
- **11.6x more L1 cache misses** than System malloc (1.98 miss/op vs 0.17)
- **Instruction cache thrashing** from 11 overlapping frontend layers
- **Branch prediction failures** from 26 conditional compilation paths + 38 runtime checks

## Architecture Analysis

### Current Bloat Inventory

**Frontend Layers in `tiny_alloc_fast()`** (11 total):
1. FastCache (C0-C3 array stack)
2. SFC (Super Front Cache, all classes)
3. Front C23 (Ultra-simple C2/C3)
4. Unified Cache (tcache-style, all classes)
5. Ring Cache (C2/C3/C5 array cache)
6. UltraHot (C2-C5 magazine)
7. HeapV2 (C0-C3 magazine)
8. Class5 Hotpath (256B dedicated path)
9. TLS SLL (generic freelist)
10. Front-Direct (experimental bypass)
11. Legacy refill path

**Problem**: Massive redundancy - Ring Cache, Front C23, and UltraHot all target C2/C3!

### File Size Issues

- `hakmem_tiny.c`: **2228 lines** (should be ~300-500)
- `tiny_alloc_fast.inc.h`: **885 lines** (should be ~50-100)
- `core/front/` directory: **2127 lines** total (11 experimental layers)

## Solution: 3-Phase Refactoring

### Phase 1: Remove Dead Features (1 day, ZERO risk)

**Target**: 4 features proven harmful or redundant

| Feature | Lines | Status | Evidence |
|---------|-------|--------|----------|
| UltraHot | ~150 | Disabled by default | A/B test: +12.9% when OFF |
| HeapV2 | ~120 | Disabled by default | Redundant with Ring Cache |
| Front C23 | ~80 | Opt-in only | Redundant with Ring Cache |
| Class5 Hotpath | ~150 | Disabled by default | Special case, unnecessary |

**Expected Results**:
- Assembly: 2624 → 1000-1200 lines (-60%)
- Performance: 23.6M → 40-50M ops/s (+70-110%)
- Time: 1 day
- Risk: **ZERO** (all disabled & proven harmful)

### Phase 2: Simplify to 2-Layer Architecture (2-3 days)

**Current**: 11 layers (chaotic)
**Target**: 2 layers (clean)

```
Layer 0: Unified Cache (tcache-style, all classes C0-C7)
         ↓ miss
Layer 1: TLS SLL (unlimited overflow)
         ↓ miss
Layer 2: SuperSlab backend (refill source)
```

**Tasks**:
1. A/B test: Ring Cache vs Unified Cache → pick winner
2. A/B test: FastCache vs SFC → consolidate into winner
3. A/B test: Front-Direct vs Legacy → pick one refill path
4. Extract ultra-fast path to `tiny_alloc_ultra.inc.h` (50 lines)

**Expected Results**:
- Assembly: 1000-1200 → 150-200 lines (-90% from baseline)
- Performance: 40-50M → 70-90M ops/s (+200-280% from baseline)
- Time: 2-3 days
- Risk: LOW (A/B tests ensure no regression)

### Phase 3: Split Monolithic Files (2-3 days)

**Current**: `hakmem_tiny.c` (2228 lines, unmaintainable)

**Target**: 7 focused modules (~300-500 lines each)

```
hakmem_tiny.c          (300-400 lines) - Public API
tiny_state.c           (200-300 lines) - Global state
tiny_tls.c             (300-400 lines) - TLS operations
tiny_superslab.c       (400-500 lines) - SuperSlab backend
tiny_registry.c        (200-300 lines) - Slab registry
tiny_lifecycle.c       (200-300 lines) - Init/shutdown
tiny_stats.c           (200-300 lines) - Statistics
tiny_alloc_ultra.inc.h (50-100 lines)  - FAST PATH (inline)
```

**Expected Results**:
- Maintainability: Much improved (clear dependencies)
- Performance: No change (structural refactor only)
- Time: 2-3 days
- Risk: MEDIUM (need careful dependency management)

## Performance Projections

### Baseline (Current)
- **Performance**: 23.6M ops/s
- **Assembly**: 2624 lines
- **L1 misses**: 1.98 miss/op
- **Gap to System malloc**: 3.9x slower

### After Phase 1 (Quick Win)
- **Performance**: 40-50M ops/s (+70-110%)
- **Assembly**: 1000-1200 lines (-60%)
- **L1 misses**: 0.8-1.2 miss/op (-40%)
- **Gap to System malloc**: 1.9-2.3x slower

### After Phase 2 (Architecture Fix)
- **Performance**: 70-90M ops/s (+200-280%)
- **Assembly**: 150-200 lines (-92%)
- **L1 misses**: 0.3-0.5 miss/op (-75%)
- **Gap to System malloc**: 1.0-1.3x slower

### Target (System malloc parity)
- **Performance**: 92.6M ops/s (System malloc baseline)
- **Assembly**: 50-100 lines (tcache equivalent)
- **L1 misses**: 0.17 miss/op (System malloc level)
- **Gap**: **CLOSED**

## Implementation Timeline

### Week 1: Phase 1 (Quick Win)
- **Day 1**: Remove UltraHot, HeapV2, Front C23, Class5 Hotpath
- **Day 2**: Test, benchmark, verify (+40-50M ops/s expected)

### Week 2: Phase 2 (Architecture)
- **Day 3**: A/B test Ring vs Unified vs SFC (pick winner)
- **Day 4**: A/B test Front-Direct vs Legacy (pick winner)
- **Day 5**: Extract `tiny_alloc_ultra.inc.h` (ultra-fast path)

### Week 3: Phase 3 (Code Health)
- **Day 6-7**: Split `hakmem_tiny.c` into 7 modules
- **Day 8**: Test, document, finalize

**Total**: 8 days (2 weeks)

## Risk Assessment

### Phase 1 (Zero Risk)
- ✅ All 4 features disabled by default
- ✅ UltraHot proven harmful (+12.9% when OFF)
- ✅ HeapV2/Front C23 redundant (Ring Cache is better)
- ✅ Class5 Hotpath unnecessary (Ring Cache handles C5)

**Worst case**: Performance stays same (very unlikely)
**Expected case**: +70-110% improvement
**Best case**: +150-200% improvement

### Phase 2 (Low Risk)
- ⚠️ A/B tests required before removing features
- ⚠️ Keep losers as fallback during transition
- ✅ Toggle via ENV flags (easy rollback)

**Worst case**: A/B test shows no winner → keep both temporarily
**Expected case**: +200-280% improvement
**Best case**: +300-350% improvement

### Phase 3 (Medium Risk)
- ⚠️ Circular dependencies in current code
- ⚠️ Need careful extraction to avoid breakage
- ✅ Incremental approach (extract one module at a time)

**Worst case**: Build breaks → incremental rollback
**Expected case**: No performance change (structural only)
**Best case**: Easier maintenance → faster future iterations

## Recommended Action

### Immediate (Week 1)
**Execute Phase 1 immediately** - Highest ROI, lowest risk
- Remove 4 dead/harmful features
- Expected: +40-50M ops/s (+70-110%)
- Time: 1 day
- Risk: ZERO

### Short-term (Week 2)
**Execute Phase 2** - Core architecture fix
- A/B test competing features, keep winners
- Extract ultra-fast path
- Expected: +70-90M ops/s (+200-280%)
- Time: 3 days
- Risk: LOW (A/B tests mitigate risk)

### Medium-term (Week 3)
**Execute Phase 3** - Code health & maintainability
- Split monolithic files
- Document architecture
- Expected: No performance change, much easier maintenance
- Time: 2-3 days
- Risk: MEDIUM (careful execution required)

## Key Insights

### Why Current Architecture Fails

**Root Cause**: **Feature Accumulation Disease**
- 26 phases of development, each adding a new layer
- No removal of failed experiments (UltraHot, HeapV2, Front C23)
- Overlapping responsibilities (Ring, Front C23, UltraHot all target C2/C3)
- **Result**: 11 layers competing → branch explosion → I-cache thrashing

### Why System Malloc is Faster

**System malloc (glibc tcache)**:
- 1 layer (tcache)
- 3-4 instructions fast path
- ~10-15 bytes assembly
- Fits entirely in L1 instruction cache

**HAKMEM current**:
- 11 layers (chaotic)
- 2624 instructions fast path
- ~10KB assembly
- Thrashes L1 instruction cache (32KB = ~10K instructions)

**Solution**: Simplify to 2 layers (Unified Cache + TLS SLL), achieving tcache-equivalent simplicity.

## Success Metrics

### Primary Metric: Performance
- **Phase 1 target**: 40-50M ops/s (+70-110%)
- **Phase 2 target**: 70-90M ops/s (+200-280%)
- **Final target**: 92.6M ops/s (System malloc parity)

### Secondary Metrics
- **Assembly size**: 2624 → 150-200 lines (-92%)
- **L1 cache misses**: 1.98 → 0.2-0.4 miss/op (-80%)
- **Code maintainability**: 2228-line monolith → 7 focused modules

### Validation
- Benchmark: `bench_random_mixed_hakmem` (Random Mixed 256B)
- Acceptance: Must match or exceed System malloc (92.6M ops/s)

## Conclusion

The HAKMEM Tiny allocator suffers from **architectural bloat** (11 frontend layers) causing 3.9x performance gap vs System malloc. The solution is aggressive simplification:

1. **Remove 4 dead features** (1 day, +70-110%)
2. **Simplify to 2 layers** (3 days, +200-280%)
3. **Split monolithic files** (3 days, maintainability)

**Total time**: 2 weeks
**Expected outcome**: 23.6M → 70-90M ops/s, approaching System malloc parity (92.6M ops/s)
**Risk**: LOW (Phase 1 is ZERO risk, Phase 2 uses A/B tests)

**Recommendation**: Start Phase 1 immediately (highest ROI, lowest risk, 1 day).
Phase 3d-B: TLS Cache Merge - Unified g_tls_sll[] structure (+12-18% expected) Merge separate g_tls_sll_head[] and g_tls_sll_count[] arrays into unified TinyTLSSLL struct to improve L1D cache locality. Expected performance gain: +12-18% from reducing cache line splits (2 loads → 1 load per operation). Changes: - core/hakmem_tiny.h: Add TinyTLSSLL type (16B aligned, head+count+pad) - core/hakmem_tiny.c: Replace separate arrays with g_tls_sll[8] - core/box/tls_sll_box.h: Update Box API (13 sites) for unified access - Updated 32+ files: All g_tls_sll_head[i] → g_tls_sll[i].head - Updated 32+ files: All g_tls_sll_count[i] → g_tls_sll[i].count - core/hakmem_tiny_integrity.h: Unified canary guards - core/box/integrity_box.c: Simplified canary validation - Makefile: Added core/box/tiny_sizeclass_hist_box.o to link Build: ✅ PASS (10K ops sanity test) Warnings: Only pre-existing LTO type mismatches (unrelated) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-11-20 07:32:30 +09:00			`# HAKMEM Tiny Allocator Refactoring - Executive Summary`

			`## Problem Statement`

			`Current Performance: 23.6M ops/s (Random Mixed 256B benchmark)`
			`System malloc: 92.6M ops/s (baseline)`
			`Performance gap: 3.9x slower`

			Root Cause: `tiny_alloc_fast()` generates 2624 lines of assembly (should be ~20-50 lines), causing:
			`- 11.6x more L1 cache misses than System malloc (1.98 miss/op vs 0.17)`
			`- Instruction cache thrashing from 11 overlapping frontend layers`
			`- Branch prediction failures from 26 conditional compilation paths + 38 runtime checks`

			`## Architecture Analysis`

			`### Current Bloat Inventory`

			Frontend Layers in `tiny_alloc_fast()` (11 total):
			`1. FastCache (C0-C3 array stack)`
			`2. SFC (Super Front Cache, all classes)`
			`3. Front C23 (Ultra-simple C2/C3)`
			`4. Unified Cache (tcache-style, all classes)`
			`5. Ring Cache (C2/C3/C5 array cache)`
			`6. UltraHot (C2-C5 magazine)`
			`7. HeapV2 (C0-C3 magazine)`
			`8. Class5 Hotpath (256B dedicated path)`
			`9. TLS SLL (generic freelist)`
			`10. Front-Direct (experimental bypass)`
			`11. Legacy refill path`

			`Problem: Massive redundancy - Ring Cache, Front C23, and UltraHot all target C2/C3!`

			`### File Size Issues`

			- `hakmem_tiny.c`: 2228 lines (should be ~300-500)
			- `tiny_alloc_fast.inc.h`: 885 lines (should be ~50-100)
			- `core/front/` directory: 2127 lines total (11 experimental layers)

			`## Solution: 3-Phase Refactoring`

			`### Phase 1: Remove Dead Features (1 day, ZERO risk)`

			`Target: 4 features proven harmful or redundant`

			`\| Feature \| Lines \| Status \| Evidence \|`
			`\|---------\|-------\|--------\|----------\|`
			`\| UltraHot \| ~150 \| Disabled by default \| A/B test: +12.9% when OFF \|`
			`\| HeapV2 \| ~120 \| Disabled by default \| Redundant with Ring Cache \|`
			`\| Front C23 \| ~80 \| Opt-in only \| Redundant with Ring Cache \|`
			`\| Class5 Hotpath \| ~150 \| Disabled by default \| Special case, unnecessary \|`

			`Expected Results:`
			`- Assembly: 2624 → 1000-1200 lines (-60%)`
			`- Performance: 23.6M → 40-50M ops/s (+70-110%)`
			`- Time: 1 day`
			`- Risk: ZERO (all disabled & proven harmful)`

			`### Phase 2: Simplify to 2-Layer Architecture (2-3 days)`

			`Current: 11 layers (chaotic)`
			`Target: 2 layers (clean)`

			```
			`Layer 0: Unified Cache (tcache-style, all classes C0-C7)`
			`↓ miss`
			`Layer 1: TLS SLL (unlimited overflow)`
			`↓ miss`
			`Layer 2: SuperSlab backend (refill source)`
			```

			`Tasks:`
			`1. A/B test: Ring Cache vs Unified Cache → pick winner`
			`2. A/B test: FastCache vs SFC → consolidate into winner`
			`3. A/B test: Front-Direct vs Legacy → pick one refill path`
			4. Extract ultra-fast path to `tiny_alloc_ultra.inc.h` (50 lines)

			`Expected Results:`
			`- Assembly: 1000-1200 → 150-200 lines (-90% from baseline)`
			`- Performance: 40-50M → 70-90M ops/s (+200-280% from baseline)`
			`- Time: 2-3 days`
			`- Risk: LOW (A/B tests ensure no regression)`

			`### Phase 3: Split Monolithic Files (2-3 days)`

			Current: `hakmem_tiny.c` (2228 lines, unmaintainable)

			`Target: 7 focused modules (~300-500 lines each)`

			```
			`hakmem_tiny.c (300-400 lines) - Public API`
			`tiny_state.c (200-300 lines) - Global state`
			`tiny_tls.c (300-400 lines) - TLS operations`
			`tiny_superslab.c (400-500 lines) - SuperSlab backend`
			`tiny_registry.c (200-300 lines) - Slab registry`
			`tiny_lifecycle.c (200-300 lines) - Init/shutdown`
			`tiny_stats.c (200-300 lines) - Statistics`
			`tiny_alloc_ultra.inc.h (50-100 lines) - FAST PATH (inline)`
			```

			`Expected Results:`
			`- Maintainability: Much improved (clear dependencies)`
			`- Performance: No change (structural refactor only)`
			`- Time: 2-3 days`
			`- Risk: MEDIUM (need careful dependency management)`

			`## Performance Projections`

			`### Baseline (Current)`
			`- Performance: 23.6M ops/s`
			`- Assembly: 2624 lines`
			`- L1 misses: 1.98 miss/op`
			`- Gap to System malloc: 3.9x slower`

			`### After Phase 1 (Quick Win)`
			`- Performance: 40-50M ops/s (+70-110%)`
			`- Assembly: 1000-1200 lines (-60%)`
			`- L1 misses: 0.8-1.2 miss/op (-40%)`
			`- Gap to System malloc: 1.9-2.3x slower`

			`### After Phase 2 (Architecture Fix)`
			`- Performance: 70-90M ops/s (+200-280%)`
			`- Assembly: 150-200 lines (-92%)`
			`- L1 misses: 0.3-0.5 miss/op (-75%)`
			`- Gap to System malloc: 1.0-1.3x slower`

			`### Target (System malloc parity)`
			`- Performance: 92.6M ops/s (System malloc baseline)`
			`- Assembly: 50-100 lines (tcache equivalent)`
			`- L1 misses: 0.17 miss/op (System malloc level)`
			`- Gap: CLOSED`

			`## Implementation Timeline`

			`### Week 1: Phase 1 (Quick Win)`
			`- Day 1: Remove UltraHot, HeapV2, Front C23, Class5 Hotpath`
			`- Day 2: Test, benchmark, verify (+40-50M ops/s expected)`

			`### Week 2: Phase 2 (Architecture)`
			`- Day 3: A/B test Ring vs Unified vs SFC (pick winner)`
			`- Day 4: A/B test Front-Direct vs Legacy (pick winner)`
			- Day 5: Extract `tiny_alloc_ultra.inc.h` (ultra-fast path)

			`### Week 3: Phase 3 (Code Health)`
			- Day 6-7: Split `hakmem_tiny.c` into 7 modules
			`- Day 8: Test, document, finalize`

			`Total: 8 days (2 weeks)`

			`## Risk Assessment`

			`### Phase 1 (Zero Risk)`
			`- ✅ All 4 features disabled by default`
			`- ✅ UltraHot proven harmful (+12.9% when OFF)`
			`- ✅ HeapV2/Front C23 redundant (Ring Cache is better)`
			`- ✅ Class5 Hotpath unnecessary (Ring Cache handles C5)`

			`Worst case: Performance stays same (very unlikely)`
			`Expected case: +70-110% improvement`
			`Best case: +150-200% improvement`

			`### Phase 2 (Low Risk)`
			`- ⚠️ A/B tests required before removing features`
			`- ⚠️ Keep losers as fallback during transition`
			`- ✅ Toggle via ENV flags (easy rollback)`

			`Worst case: A/B test shows no winner → keep both temporarily`
			`Expected case: +200-280% improvement`
			`Best case: +300-350% improvement`

			`### Phase 3 (Medium Risk)`
			`- ⚠️ Circular dependencies in current code`
			`- ⚠️ Need careful extraction to avoid breakage`
			`- ✅ Incremental approach (extract one module at a time)`

			`Worst case: Build breaks → incremental rollback`
			`Expected case: No performance change (structural only)`
			`Best case: Easier maintenance → faster future iterations`

			`## Recommended Action`

			`### Immediate (Week 1)`
			`Execute Phase 1 immediately - Highest ROI, lowest risk`
			`- Remove 4 dead/harmful features`
			`- Expected: +40-50M ops/s (+70-110%)`
			`- Time: 1 day`
			`- Risk: ZERO`

			`### Short-term (Week 2)`
			`Execute Phase 2 - Core architecture fix`
			`- A/B test competing features, keep winners`
			`- Extract ultra-fast path`
			`- Expected: +70-90M ops/s (+200-280%)`
			`- Time: 3 days`
			`- Risk: LOW (A/B tests mitigate risk)`

			`### Medium-term (Week 3)`
			`Execute Phase 3 - Code health & maintainability`
			`- Split monolithic files`
			`- Document architecture`
			`- Expected: No performance change, much easier maintenance`
			`- Time: 2-3 days`
			`- Risk: MEDIUM (careful execution required)`

			`## Key Insights`

			`### Why Current Architecture Fails`

			`Root Cause: Feature Accumulation Disease`
			`- 26 phases of development, each adding a new layer`
			`- No removal of failed experiments (UltraHot, HeapV2, Front C23)`
			`- Overlapping responsibilities (Ring, Front C23, UltraHot all target C2/C3)`
			`- Result: 11 layers competing → branch explosion → I-cache thrashing`

			`### Why System Malloc is Faster`

			`System malloc (glibc tcache):`
			`- 1 layer (tcache)`
			`- 3-4 instructions fast path`
			`- ~10-15 bytes assembly`
			`- Fits entirely in L1 instruction cache`

			`HAKMEM current:`
			`- 11 layers (chaotic)`
			`- 2624 instructions fast path`
			`- ~10KB assembly`
			`- Thrashes L1 instruction cache (32KB = ~10K instructions)`

			`Solution: Simplify to 2 layers (Unified Cache + TLS SLL), achieving tcache-equivalent simplicity.`

			`## Success Metrics`

			`### Primary Metric: Performance`
			`- Phase 1 target: 40-50M ops/s (+70-110%)`
			`- Phase 2 target: 70-90M ops/s (+200-280%)`
			`- Final target: 92.6M ops/s (System malloc parity)`

			`### Secondary Metrics`
			`- Assembly size: 2624 → 150-200 lines (-92%)`
			`- L1 cache misses: 1.98 → 0.2-0.4 miss/op (-80%)`
			`- Code maintainability: 2228-line monolith → 7 focused modules`

			`### Validation`
			- Benchmark: `bench_random_mixed_hakmem` (Random Mixed 256B)
			`- Acceptance: Must match or exceed System malloc (92.6M ops/s)`

			`## Conclusion`

			`The HAKMEM Tiny allocator suffers from architectural bloat (11 frontend layers) causing 3.9x performance gap vs System malloc. The solution is aggressive simplification:`

			`1. Remove 4 dead features (1 day, +70-110%)`
			`2. Simplify to 2 layers (3 days, +200-280%)`
			`3. Split monolithic files (3 days, maintainability)`

			`Total time: 2 weeks`
			`Expected outcome: 23.6M → 70-90M ops/s, approaching System malloc parity (92.6M ops/s)`
			`Risk: LOW (Phase 1 is ZERO risk, Phase 2 uses A/B tests)`

			`Recommendation: Start Phase 1 immediately (highest ROI, lowest risk, 1 day).`