259 lines
8.4 KiB
Markdown
259 lines
8.4 KiB
Markdown
|
|
# HAKMEM Tiny Allocator Refactoring - Executive Summary
|
||
|
|
|
||
|
|
## Problem Statement
|
||
|
|
|
||
|
|
**Current Performance**: 23.6M ops/s (Random Mixed 256B benchmark)
|
||
|
|
**System malloc**: 92.6M ops/s (baseline)
|
||
|
|
**Performance gap**: **3.9x slower**
|
||
|
|
|
||
|
|
**Root Cause**: `tiny_alloc_fast()` generates **2624 lines of assembly** (should be ~20-50 lines), causing:
|
||
|
|
- **11.6x more L1 cache misses** than System malloc (1.98 miss/op vs 0.17)
|
||
|
|
- **Instruction cache thrashing** from 11 overlapping frontend layers
|
||
|
|
- **Branch prediction failures** from 26 conditional compilation paths + 38 runtime checks
|
||
|
|
|
||
|
|
## Architecture Analysis
|
||
|
|
|
||
|
|
### Current Bloat Inventory
|
||
|
|
|
||
|
|
**Frontend Layers in `tiny_alloc_fast()`** (11 total):
|
||
|
|
1. FastCache (C0-C3 array stack)
|
||
|
|
2. SFC (Super Front Cache, all classes)
|
||
|
|
3. Front C23 (Ultra-simple C2/C3)
|
||
|
|
4. Unified Cache (tcache-style, all classes)
|
||
|
|
5. Ring Cache (C2/C3/C5 array cache)
|
||
|
|
6. UltraHot (C2-C5 magazine)
|
||
|
|
7. HeapV2 (C0-C3 magazine)
|
||
|
|
8. Class5 Hotpath (256B dedicated path)
|
||
|
|
9. TLS SLL (generic freelist)
|
||
|
|
10. Front-Direct (experimental bypass)
|
||
|
|
11. Legacy refill path
|
||
|
|
|
||
|
|
**Problem**: Massive redundancy - Ring Cache, Front C23, and UltraHot all target C2/C3!
|
||
|
|
|
||
|
|
### File Size Issues
|
||
|
|
|
||
|
|
- `hakmem_tiny.c`: **2228 lines** (should be ~300-500)
|
||
|
|
- `tiny_alloc_fast.inc.h`: **885 lines** (should be ~50-100)
|
||
|
|
- `core/front/` directory: **2127 lines** total (11 experimental layers)
|
||
|
|
|
||
|
|
## Solution: 3-Phase Refactoring
|
||
|
|
|
||
|
|
### Phase 1: Remove Dead Features (1 day, ZERO risk)
|
||
|
|
|
||
|
|
**Target**: 4 features proven harmful or redundant
|
||
|
|
|
||
|
|
| Feature | Lines | Status | Evidence |
|
||
|
|
|---------|-------|--------|----------|
|
||
|
|
| UltraHot | ~150 | Disabled by default | A/B test: +12.9% when OFF |
|
||
|
|
| HeapV2 | ~120 | Disabled by default | Redundant with Ring Cache |
|
||
|
|
| Front C23 | ~80 | Opt-in only | Redundant with Ring Cache |
|
||
|
|
| Class5 Hotpath | ~150 | Disabled by default | Special case, unnecessary |
|
||
|
|
|
||
|
|
**Expected Results**:
|
||
|
|
- Assembly: 2624 → 1000-1200 lines (-60%)
|
||
|
|
- Performance: 23.6M → 40-50M ops/s (+70-110%)
|
||
|
|
- Time: 1 day
|
||
|
|
- Risk: **ZERO** (all disabled & proven harmful)
|
||
|
|
|
||
|
|
### Phase 2: Simplify to 2-Layer Architecture (2-3 days)
|
||
|
|
|
||
|
|
**Current**: 11 layers (chaotic)
|
||
|
|
**Target**: 2 layers (clean)
|
||
|
|
|
||
|
|
```
|
||
|
|
Layer 0: Unified Cache (tcache-style, all classes C0-C7)
|
||
|
|
↓ miss
|
||
|
|
Layer 1: TLS SLL (unlimited overflow)
|
||
|
|
↓ miss
|
||
|
|
Layer 2: SuperSlab backend (refill source)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Tasks**:
|
||
|
|
1. A/B test: Ring Cache vs Unified Cache → pick winner
|
||
|
|
2. A/B test: FastCache vs SFC → consolidate into winner
|
||
|
|
3. A/B test: Front-Direct vs Legacy → pick one refill path
|
||
|
|
4. Extract ultra-fast path to `tiny_alloc_ultra.inc.h` (50 lines)
|
||
|
|
|
||
|
|
**Expected Results**:
|
||
|
|
- Assembly: 1000-1200 → 150-200 lines (-90% from baseline)
|
||
|
|
- Performance: 40-50M → 70-90M ops/s (+200-280% from baseline)
|
||
|
|
- Time: 2-3 days
|
||
|
|
- Risk: LOW (A/B tests ensure no regression)
|
||
|
|
|
||
|
|
### Phase 3: Split Monolithic Files (2-3 days)
|
||
|
|
|
||
|
|
**Current**: `hakmem_tiny.c` (2228 lines, unmaintainable)
|
||
|
|
|
||
|
|
**Target**: 7 focused modules (~300-500 lines each)
|
||
|
|
|
||
|
|
```
|
||
|
|
hakmem_tiny.c (300-400 lines) - Public API
|
||
|
|
tiny_state.c (200-300 lines) - Global state
|
||
|
|
tiny_tls.c (300-400 lines) - TLS operations
|
||
|
|
tiny_superslab.c (400-500 lines) - SuperSlab backend
|
||
|
|
tiny_registry.c (200-300 lines) - Slab registry
|
||
|
|
tiny_lifecycle.c (200-300 lines) - Init/shutdown
|
||
|
|
tiny_stats.c (200-300 lines) - Statistics
|
||
|
|
tiny_alloc_ultra.inc.h (50-100 lines) - FAST PATH (inline)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected Results**:
|
||
|
|
- Maintainability: Much improved (clear dependencies)
|
||
|
|
- Performance: No change (structural refactor only)
|
||
|
|
- Time: 2-3 days
|
||
|
|
- Risk: MEDIUM (need careful dependency management)
|
||
|
|
|
||
|
|
## Performance Projections
|
||
|
|
|
||
|
|
### Baseline (Current)
|
||
|
|
- **Performance**: 23.6M ops/s
|
||
|
|
- **Assembly**: 2624 lines
|
||
|
|
- **L1 misses**: 1.98 miss/op
|
||
|
|
- **Gap to System malloc**: 3.9x slower
|
||
|
|
|
||
|
|
### After Phase 1 (Quick Win)
|
||
|
|
- **Performance**: 40-50M ops/s (+70-110%)
|
||
|
|
- **Assembly**: 1000-1200 lines (-60%)
|
||
|
|
- **L1 misses**: 0.8-1.2 miss/op (-40%)
|
||
|
|
- **Gap to System malloc**: 1.9-2.3x slower
|
||
|
|
|
||
|
|
### After Phase 2 (Architecture Fix)
|
||
|
|
- **Performance**: 70-90M ops/s (+200-280%)
|
||
|
|
- **Assembly**: 150-200 lines (-92%)
|
||
|
|
- **L1 misses**: 0.3-0.5 miss/op (-75%)
|
||
|
|
- **Gap to System malloc**: 1.0-1.3x slower
|
||
|
|
|
||
|
|
### Target (System malloc parity)
|
||
|
|
- **Performance**: 92.6M ops/s (System malloc baseline)
|
||
|
|
- **Assembly**: 50-100 lines (tcache equivalent)
|
||
|
|
- **L1 misses**: 0.17 miss/op (System malloc level)
|
||
|
|
- **Gap**: **CLOSED**
|
||
|
|
|
||
|
|
## Implementation Timeline
|
||
|
|
|
||
|
|
### Week 1: Phase 1 (Quick Win)
|
||
|
|
- **Day 1**: Remove UltraHot, HeapV2, Front C23, Class5 Hotpath
|
||
|
|
- **Day 2**: Test, benchmark, verify (+40-50M ops/s expected)
|
||
|
|
|
||
|
|
### Week 2: Phase 2 (Architecture)
|
||
|
|
- **Day 3**: A/B test Ring vs Unified vs SFC (pick winner)
|
||
|
|
- **Day 4**: A/B test Front-Direct vs Legacy (pick winner)
|
||
|
|
- **Day 5**: Extract `tiny_alloc_ultra.inc.h` (ultra-fast path)
|
||
|
|
|
||
|
|
### Week 3: Phase 3 (Code Health)
|
||
|
|
- **Day 6-7**: Split `hakmem_tiny.c` into 7 modules
|
||
|
|
- **Day 8**: Test, document, finalize
|
||
|
|
|
||
|
|
**Total**: 8 days (2 weeks)
|
||
|
|
|
||
|
|
## Risk Assessment
|
||
|
|
|
||
|
|
### Phase 1 (Zero Risk)
|
||
|
|
- ✅ All 4 features disabled by default
|
||
|
|
- ✅ UltraHot proven harmful (+12.9% when OFF)
|
||
|
|
- ✅ HeapV2/Front C23 redundant (Ring Cache is better)
|
||
|
|
- ✅ Class5 Hotpath unnecessary (Ring Cache handles C5)
|
||
|
|
|
||
|
|
**Worst case**: Performance stays same (very unlikely)
|
||
|
|
**Expected case**: +70-110% improvement
|
||
|
|
**Best case**: +150-200% improvement
|
||
|
|
|
||
|
|
### Phase 2 (Low Risk)
|
||
|
|
- ⚠️ A/B tests required before removing features
|
||
|
|
- ⚠️ Keep losers as fallback during transition
|
||
|
|
- ✅ Toggle via ENV flags (easy rollback)
|
||
|
|
|
||
|
|
**Worst case**: A/B test shows no winner → keep both temporarily
|
||
|
|
**Expected case**: +200-280% improvement
|
||
|
|
**Best case**: +300-350% improvement
|
||
|
|
|
||
|
|
### Phase 3 (Medium Risk)
|
||
|
|
- ⚠️ Circular dependencies in current code
|
||
|
|
- ⚠️ Need careful extraction to avoid breakage
|
||
|
|
- ✅ Incremental approach (extract one module at a time)
|
||
|
|
|
||
|
|
**Worst case**: Build breaks → incremental rollback
|
||
|
|
**Expected case**: No performance change (structural only)
|
||
|
|
**Best case**: Easier maintenance → faster future iterations
|
||
|
|
|
||
|
|
## Recommended Action
|
||
|
|
|
||
|
|
### Immediate (Week 1)
|
||
|
|
**Execute Phase 1 immediately** - Highest ROI, lowest risk
|
||
|
|
- Remove 4 dead/harmful features
|
||
|
|
- Expected: +40-50M ops/s (+70-110%)
|
||
|
|
- Time: 1 day
|
||
|
|
- Risk: ZERO
|
||
|
|
|
||
|
|
### Short-term (Week 2)
|
||
|
|
**Execute Phase 2** - Core architecture fix
|
||
|
|
- A/B test competing features, keep winners
|
||
|
|
- Extract ultra-fast path
|
||
|
|
- Expected: +70-90M ops/s (+200-280%)
|
||
|
|
- Time: 3 days
|
||
|
|
- Risk: LOW (A/B tests mitigate risk)
|
||
|
|
|
||
|
|
### Medium-term (Week 3)
|
||
|
|
**Execute Phase 3** - Code health & maintainability
|
||
|
|
- Split monolithic files
|
||
|
|
- Document architecture
|
||
|
|
- Expected: No performance change, much easier maintenance
|
||
|
|
- Time: 2-3 days
|
||
|
|
- Risk: MEDIUM (careful execution required)
|
||
|
|
|
||
|
|
## Key Insights
|
||
|
|
|
||
|
|
### Why Current Architecture Fails
|
||
|
|
|
||
|
|
**Root Cause**: **Feature Accumulation Disease**
|
||
|
|
- 26 phases of development, each adding a new layer
|
||
|
|
- No removal of failed experiments (UltraHot, HeapV2, Front C23)
|
||
|
|
- Overlapping responsibilities (Ring, Front C23, UltraHot all target C2/C3)
|
||
|
|
- **Result**: 11 layers competing → branch explosion → I-cache thrashing
|
||
|
|
|
||
|
|
### Why System Malloc is Faster
|
||
|
|
|
||
|
|
**System malloc (glibc tcache)**:
|
||
|
|
- 1 layer (tcache)
|
||
|
|
- 3-4 instructions fast path
|
||
|
|
- ~10-15 bytes assembly
|
||
|
|
- Fits entirely in L1 instruction cache
|
||
|
|
|
||
|
|
**HAKMEM current**:
|
||
|
|
- 11 layers (chaotic)
|
||
|
|
- 2624 instructions fast path
|
||
|
|
- ~10KB assembly
|
||
|
|
- Thrashes L1 instruction cache (32KB = ~10K instructions)
|
||
|
|
|
||
|
|
**Solution**: Simplify to 2 layers (Unified Cache + TLS SLL), achieving tcache-equivalent simplicity.
|
||
|
|
|
||
|
|
## Success Metrics
|
||
|
|
|
||
|
|
### Primary Metric: Performance
|
||
|
|
- **Phase 1 target**: 40-50M ops/s (+70-110%)
|
||
|
|
- **Phase 2 target**: 70-90M ops/s (+200-280%)
|
||
|
|
- **Final target**: 92.6M ops/s (System malloc parity)
|
||
|
|
|
||
|
|
### Secondary Metrics
|
||
|
|
- **Assembly size**: 2624 → 150-200 lines (-92%)
|
||
|
|
- **L1 cache misses**: 1.98 → 0.2-0.4 miss/op (-80%)
|
||
|
|
- **Code maintainability**: 2228-line monolith → 7 focused modules
|
||
|
|
|
||
|
|
### Validation
|
||
|
|
- Benchmark: `bench_random_mixed_hakmem` (Random Mixed 256B)
|
||
|
|
- Acceptance: Must match or exceed System malloc (92.6M ops/s)
|
||
|
|
|
||
|
|
## Conclusion
|
||
|
|
|
||
|
|
The HAKMEM Tiny allocator suffers from **architectural bloat** (11 frontend layers) causing 3.9x performance gap vs System malloc. The solution is aggressive simplification:
|
||
|
|
|
||
|
|
1. **Remove 4 dead features** (1 day, +70-110%)
|
||
|
|
2. **Simplify to 2 layers** (3 days, +200-280%)
|
||
|
|
3. **Split monolithic files** (3 days, maintainability)
|
||
|
|
|
||
|
|
**Total time**: 2 weeks
|
||
|
|
**Expected outcome**: 23.6M → 70-90M ops/s, approaching System malloc parity (92.6M ops/s)
|
||
|
|
**Risk**: LOW (Phase 1 is ZERO risk, Phase 2 uses A/B tests)
|
||
|
|
|
||
|
|
**Recommendation**: Start Phase 1 immediately (highest ROI, lowest risk, 1 day).
|