Phase 1 完了:環境変数整理 + fprintf デバッグガード ENV変数削除(BG/HotMag系): - core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines) - core/hakmem_tiny_bg_spill.c: BG spill ENV 削除 - core/tiny_refill.h: BG remote 固定値化 - core/hakmem_tiny_slow.inc: BG refs 削除 fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE): - core/hakmem_shared_pool.c: Lock stats (~18 fprintf) - core/page_arena.c: Init/Shutdown/Stats (~27 fprintf) - core/hakmem.c: SIGSEGV init message ドキュメント整理: - 328 markdown files 削除(旧レポート・重複docs) 性能確認: - Larson: 52.35M ops/s (前回52.8M、安定動作✅) - ENV整理による機能影響なし - Debug出力は一部残存(次phase で対応) 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
8.4 KiB
HAKMEM Tiny Allocator Refactoring - Executive Summary
Problem Statement
Current Performance: 23.6M ops/s (Random Mixed 256B benchmark) System malloc: 92.6M ops/s (baseline) Performance gap: 3.9x slower
Root Cause: tiny_alloc_fast() generates 2624 lines of assembly (should be ~20-50 lines), causing:
- 11.6x more L1 cache misses than System malloc (1.98 miss/op vs 0.17)
- Instruction cache thrashing from 11 overlapping frontend layers
- Branch prediction failures from 26 conditional compilation paths + 38 runtime checks
Architecture Analysis
Current Bloat Inventory
Frontend Layers in tiny_alloc_fast() (11 total):
- FastCache (C0-C3 array stack)
- SFC (Super Front Cache, all classes)
- Front C23 (Ultra-simple C2/C3)
- Unified Cache (tcache-style, all classes)
- Ring Cache (C2/C3/C5 array cache)
- UltraHot (C2-C5 magazine)
- HeapV2 (C0-C3 magazine)
- Class5 Hotpath (256B dedicated path)
- TLS SLL (generic freelist)
- Front-Direct (experimental bypass)
- Legacy refill path
Problem: Massive redundancy - Ring Cache, Front C23, and UltraHot all target C2/C3!
File Size Issues
hakmem_tiny.c: 2228 lines (should be ~300-500)tiny_alloc_fast.inc.h: 885 lines (should be ~50-100)core/front/directory: 2127 lines total (11 experimental layers)
Solution: 3-Phase Refactoring
Phase 1: Remove Dead Features (1 day, ZERO risk)
Target: 4 features proven harmful or redundant
| Feature | Lines | Status | Evidence |
|---|---|---|---|
| UltraHot | ~150 | Disabled by default | A/B test: +12.9% when OFF |
| HeapV2 | ~120 | Disabled by default | Redundant with Ring Cache |
| Front C23 | ~80 | Opt-in only | Redundant with Ring Cache |
| Class5 Hotpath | ~150 | Disabled by default | Special case, unnecessary |
Expected Results:
- Assembly: 2624 → 1000-1200 lines (-60%)
- Performance: 23.6M → 40-50M ops/s (+70-110%)
- Time: 1 day
- Risk: ZERO (all disabled & proven harmful)
Phase 2: Simplify to 2-Layer Architecture (2-3 days)
Current: 11 layers (chaotic) Target: 2 layers (clean)
Layer 0: Unified Cache (tcache-style, all classes C0-C7)
↓ miss
Layer 1: TLS SLL (unlimited overflow)
↓ miss
Layer 2: SuperSlab backend (refill source)
Tasks:
- A/B test: Ring Cache vs Unified Cache → pick winner
- A/B test: FastCache vs SFC → consolidate into winner
- A/B test: Front-Direct vs Legacy → pick one refill path
- Extract ultra-fast path to
tiny_alloc_ultra.inc.h(50 lines)
Expected Results:
- Assembly: 1000-1200 → 150-200 lines (-90% from baseline)
- Performance: 40-50M → 70-90M ops/s (+200-280% from baseline)
- Time: 2-3 days
- Risk: LOW (A/B tests ensure no regression)
Phase 3: Split Monolithic Files (2-3 days)
Current: hakmem_tiny.c (2228 lines, unmaintainable)
Target: 7 focused modules (~300-500 lines each)
hakmem_tiny.c (300-400 lines) - Public API
tiny_state.c (200-300 lines) - Global state
tiny_tls.c (300-400 lines) - TLS operations
tiny_superslab.c (400-500 lines) - SuperSlab backend
tiny_registry.c (200-300 lines) - Slab registry
tiny_lifecycle.c (200-300 lines) - Init/shutdown
tiny_stats.c (200-300 lines) - Statistics
tiny_alloc_ultra.inc.h (50-100 lines) - FAST PATH (inline)
Expected Results:
- Maintainability: Much improved (clear dependencies)
- Performance: No change (structural refactor only)
- Time: 2-3 days
- Risk: MEDIUM (need careful dependency management)
Performance Projections
Baseline (Current)
- Performance: 23.6M ops/s
- Assembly: 2624 lines
- L1 misses: 1.98 miss/op
- Gap to System malloc: 3.9x slower
After Phase 1 (Quick Win)
- Performance: 40-50M ops/s (+70-110%)
- Assembly: 1000-1200 lines (-60%)
- L1 misses: 0.8-1.2 miss/op (-40%)
- Gap to System malloc: 1.9-2.3x slower
After Phase 2 (Architecture Fix)
- Performance: 70-90M ops/s (+200-280%)
- Assembly: 150-200 lines (-92%)
- L1 misses: 0.3-0.5 miss/op (-75%)
- Gap to System malloc: 1.0-1.3x slower
Target (System malloc parity)
- Performance: 92.6M ops/s (System malloc baseline)
- Assembly: 50-100 lines (tcache equivalent)
- L1 misses: 0.17 miss/op (System malloc level)
- Gap: CLOSED
Implementation Timeline
Week 1: Phase 1 (Quick Win)
- Day 1: Remove UltraHot, HeapV2, Front C23, Class5 Hotpath
- Day 2: Test, benchmark, verify (+40-50M ops/s expected)
Week 2: Phase 2 (Architecture)
- Day 3: A/B test Ring vs Unified vs SFC (pick winner)
- Day 4: A/B test Front-Direct vs Legacy (pick winner)
- Day 5: Extract
tiny_alloc_ultra.inc.h(ultra-fast path)
Week 3: Phase 3 (Code Health)
- Day 6-7: Split
hakmem_tiny.cinto 7 modules - Day 8: Test, document, finalize
Total: 8 days (2 weeks)
Risk Assessment
Phase 1 (Zero Risk)
- ✅ All 4 features disabled by default
- ✅ UltraHot proven harmful (+12.9% when OFF)
- ✅ HeapV2/Front C23 redundant (Ring Cache is better)
- ✅ Class5 Hotpath unnecessary (Ring Cache handles C5)
Worst case: Performance stays same (very unlikely) Expected case: +70-110% improvement Best case: +150-200% improvement
Phase 2 (Low Risk)
- ⚠️ A/B tests required before removing features
- ⚠️ Keep losers as fallback during transition
- ✅ Toggle via ENV flags (easy rollback)
Worst case: A/B test shows no winner → keep both temporarily Expected case: +200-280% improvement Best case: +300-350% improvement
Phase 3 (Medium Risk)
- ⚠️ Circular dependencies in current code
- ⚠️ Need careful extraction to avoid breakage
- ✅ Incremental approach (extract one module at a time)
Worst case: Build breaks → incremental rollback Expected case: No performance change (structural only) Best case: Easier maintenance → faster future iterations
Recommended Action
Immediate (Week 1)
Execute Phase 1 immediately - Highest ROI, lowest risk
- Remove 4 dead/harmful features
- Expected: +40-50M ops/s (+70-110%)
- Time: 1 day
- Risk: ZERO
Short-term (Week 2)
Execute Phase 2 - Core architecture fix
- A/B test competing features, keep winners
- Extract ultra-fast path
- Expected: +70-90M ops/s (+200-280%)
- Time: 3 days
- Risk: LOW (A/B tests mitigate risk)
Medium-term (Week 3)
Execute Phase 3 - Code health & maintainability
- Split monolithic files
- Document architecture
- Expected: No performance change, much easier maintenance
- Time: 2-3 days
- Risk: MEDIUM (careful execution required)
Key Insights
Why Current Architecture Fails
Root Cause: Feature Accumulation Disease
- 26 phases of development, each adding a new layer
- No removal of failed experiments (UltraHot, HeapV2, Front C23)
- Overlapping responsibilities (Ring, Front C23, UltraHot all target C2/C3)
- Result: 11 layers competing → branch explosion → I-cache thrashing
Why System Malloc is Faster
System malloc (glibc tcache):
- 1 layer (tcache)
- 3-4 instructions fast path
- ~10-15 bytes assembly
- Fits entirely in L1 instruction cache
HAKMEM current:
- 11 layers (chaotic)
- 2624 instructions fast path
- ~10KB assembly
- Thrashes L1 instruction cache (32KB = ~10K instructions)
Solution: Simplify to 2 layers (Unified Cache + TLS SLL), achieving tcache-equivalent simplicity.
Success Metrics
Primary Metric: Performance
- Phase 1 target: 40-50M ops/s (+70-110%)
- Phase 2 target: 70-90M ops/s (+200-280%)
- Final target: 92.6M ops/s (System malloc parity)
Secondary Metrics
- Assembly size: 2624 → 150-200 lines (-92%)
- L1 cache misses: 1.98 → 0.2-0.4 miss/op (-80%)
- Code maintainability: 2228-line monolith → 7 focused modules
Validation
- Benchmark:
bench_random_mixed_hakmem(Random Mixed 256B) - Acceptance: Must match or exceed System malloc (92.6M ops/s)
Conclusion
The HAKMEM Tiny allocator suffers from architectural bloat (11 frontend layers) causing 3.9x performance gap vs System malloc. The solution is aggressive simplification:
- Remove 4 dead features (1 day, +70-110%)
- Simplify to 2 layers (3 days, +200-280%)
- Split monolithic files (3 days, maintainability)
Total time: 2 weeks Expected outcome: 23.6M → 70-90M ops/s, approaching System malloc parity (92.6M ops/s) Risk: LOW (Phase 1 is ZERO risk, Phase 2 uses A/B tests)
Recommendation: Start Phase 1 immediately (highest ROI, lowest risk, 1 day).