Files
hakmem/docs/analysis/REFACTOR_EXECUTIVE_SUMMARY.md
Moe Charm (CI) a9ddb52ad4 ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)
Phase 1 完了:環境変数整理 + fprintf デバッグガード

ENV変数削除(BG/HotMag系):
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除

fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message

ドキュメント整理:
- 328 markdown files 削除(旧レポート・重複docs)

性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作)
- ENV整理による機能影響なし
- Debug出力は一部残存(次phase で対応)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 14:45:26 +09:00

8.4 KiB

HAKMEM Tiny Allocator Refactoring - Executive Summary

Problem Statement

Current Performance: 23.6M ops/s (Random Mixed 256B benchmark) System malloc: 92.6M ops/s (baseline) Performance gap: 3.9x slower

Root Cause: tiny_alloc_fast() generates 2624 lines of assembly (should be ~20-50 lines), causing:

  • 11.6x more L1 cache misses than System malloc (1.98 miss/op vs 0.17)
  • Instruction cache thrashing from 11 overlapping frontend layers
  • Branch prediction failures from 26 conditional compilation paths + 38 runtime checks

Architecture Analysis

Current Bloat Inventory

Frontend Layers in tiny_alloc_fast() (11 total):

  1. FastCache (C0-C3 array stack)
  2. SFC (Super Front Cache, all classes)
  3. Front C23 (Ultra-simple C2/C3)
  4. Unified Cache (tcache-style, all classes)
  5. Ring Cache (C2/C3/C5 array cache)
  6. UltraHot (C2-C5 magazine)
  7. HeapV2 (C0-C3 magazine)
  8. Class5 Hotpath (256B dedicated path)
  9. TLS SLL (generic freelist)
  10. Front-Direct (experimental bypass)
  11. Legacy refill path

Problem: Massive redundancy - Ring Cache, Front C23, and UltraHot all target C2/C3!

File Size Issues

  • hakmem_tiny.c: 2228 lines (should be ~300-500)
  • tiny_alloc_fast.inc.h: 885 lines (should be ~50-100)
  • core/front/ directory: 2127 lines total (11 experimental layers)

Solution: 3-Phase Refactoring

Phase 1: Remove Dead Features (1 day, ZERO risk)

Target: 4 features proven harmful or redundant

Feature Lines Status Evidence
UltraHot ~150 Disabled by default A/B test: +12.9% when OFF
HeapV2 ~120 Disabled by default Redundant with Ring Cache
Front C23 ~80 Opt-in only Redundant with Ring Cache
Class5 Hotpath ~150 Disabled by default Special case, unnecessary

Expected Results:

  • Assembly: 2624 → 1000-1200 lines (-60%)
  • Performance: 23.6M → 40-50M ops/s (+70-110%)
  • Time: 1 day
  • Risk: ZERO (all disabled & proven harmful)

Phase 2: Simplify to 2-Layer Architecture (2-3 days)

Current: 11 layers (chaotic) Target: 2 layers (clean)

Layer 0: Unified Cache (tcache-style, all classes C0-C7)
         ↓ miss
Layer 1: TLS SLL (unlimited overflow)
         ↓ miss
Layer 2: SuperSlab backend (refill source)

Tasks:

  1. A/B test: Ring Cache vs Unified Cache → pick winner
  2. A/B test: FastCache vs SFC → consolidate into winner
  3. A/B test: Front-Direct vs Legacy → pick one refill path
  4. Extract ultra-fast path to tiny_alloc_ultra.inc.h (50 lines)

Expected Results:

  • Assembly: 1000-1200 → 150-200 lines (-90% from baseline)
  • Performance: 40-50M → 70-90M ops/s (+200-280% from baseline)
  • Time: 2-3 days
  • Risk: LOW (A/B tests ensure no regression)

Phase 3: Split Monolithic Files (2-3 days)

Current: hakmem_tiny.c (2228 lines, unmaintainable)

Target: 7 focused modules (~300-500 lines each)

hakmem_tiny.c          (300-400 lines) - Public API
tiny_state.c           (200-300 lines) - Global state
tiny_tls.c             (300-400 lines) - TLS operations
tiny_superslab.c       (400-500 lines) - SuperSlab backend
tiny_registry.c        (200-300 lines) - Slab registry
tiny_lifecycle.c       (200-300 lines) - Init/shutdown
tiny_stats.c           (200-300 lines) - Statistics
tiny_alloc_ultra.inc.h (50-100 lines)  - FAST PATH (inline)

Expected Results:

  • Maintainability: Much improved (clear dependencies)
  • Performance: No change (structural refactor only)
  • Time: 2-3 days
  • Risk: MEDIUM (need careful dependency management)

Performance Projections

Baseline (Current)

  • Performance: 23.6M ops/s
  • Assembly: 2624 lines
  • L1 misses: 1.98 miss/op
  • Gap to System malloc: 3.9x slower

After Phase 1 (Quick Win)

  • Performance: 40-50M ops/s (+70-110%)
  • Assembly: 1000-1200 lines (-60%)
  • L1 misses: 0.8-1.2 miss/op (-40%)
  • Gap to System malloc: 1.9-2.3x slower

After Phase 2 (Architecture Fix)

  • Performance: 70-90M ops/s (+200-280%)
  • Assembly: 150-200 lines (-92%)
  • L1 misses: 0.3-0.5 miss/op (-75%)
  • Gap to System malloc: 1.0-1.3x slower

Target (System malloc parity)

  • Performance: 92.6M ops/s (System malloc baseline)
  • Assembly: 50-100 lines (tcache equivalent)
  • L1 misses: 0.17 miss/op (System malloc level)
  • Gap: CLOSED

Implementation Timeline

Week 1: Phase 1 (Quick Win)

  • Day 1: Remove UltraHot, HeapV2, Front C23, Class5 Hotpath
  • Day 2: Test, benchmark, verify (+40-50M ops/s expected)

Week 2: Phase 2 (Architecture)

  • Day 3: A/B test Ring vs Unified vs SFC (pick winner)
  • Day 4: A/B test Front-Direct vs Legacy (pick winner)
  • Day 5: Extract tiny_alloc_ultra.inc.h (ultra-fast path)

Week 3: Phase 3 (Code Health)

  • Day 6-7: Split hakmem_tiny.c into 7 modules
  • Day 8: Test, document, finalize

Total: 8 days (2 weeks)

Risk Assessment

Phase 1 (Zero Risk)

  • All 4 features disabled by default
  • UltraHot proven harmful (+12.9% when OFF)
  • HeapV2/Front C23 redundant (Ring Cache is better)
  • Class5 Hotpath unnecessary (Ring Cache handles C5)

Worst case: Performance stays same (very unlikely) Expected case: +70-110% improvement Best case: +150-200% improvement

Phase 2 (Low Risk)

  • ⚠️ A/B tests required before removing features
  • ⚠️ Keep losers as fallback during transition
  • Toggle via ENV flags (easy rollback)

Worst case: A/B test shows no winner → keep both temporarily Expected case: +200-280% improvement Best case: +300-350% improvement

Phase 3 (Medium Risk)

  • ⚠️ Circular dependencies in current code
  • ⚠️ Need careful extraction to avoid breakage
  • Incremental approach (extract one module at a time)

Worst case: Build breaks → incremental rollback Expected case: No performance change (structural only) Best case: Easier maintenance → faster future iterations

Immediate (Week 1)

Execute Phase 1 immediately - Highest ROI, lowest risk

  • Remove 4 dead/harmful features
  • Expected: +40-50M ops/s (+70-110%)
  • Time: 1 day
  • Risk: ZERO

Short-term (Week 2)

Execute Phase 2 - Core architecture fix

  • A/B test competing features, keep winners
  • Extract ultra-fast path
  • Expected: +70-90M ops/s (+200-280%)
  • Time: 3 days
  • Risk: LOW (A/B tests mitigate risk)

Medium-term (Week 3)

Execute Phase 3 - Code health & maintainability

  • Split monolithic files
  • Document architecture
  • Expected: No performance change, much easier maintenance
  • Time: 2-3 days
  • Risk: MEDIUM (careful execution required)

Key Insights

Why Current Architecture Fails

Root Cause: Feature Accumulation Disease

  • 26 phases of development, each adding a new layer
  • No removal of failed experiments (UltraHot, HeapV2, Front C23)
  • Overlapping responsibilities (Ring, Front C23, UltraHot all target C2/C3)
  • Result: 11 layers competing → branch explosion → I-cache thrashing

Why System Malloc is Faster

System malloc (glibc tcache):

  • 1 layer (tcache)
  • 3-4 instructions fast path
  • ~10-15 bytes assembly
  • Fits entirely in L1 instruction cache

HAKMEM current:

  • 11 layers (chaotic)
  • 2624 instructions fast path
  • ~10KB assembly
  • Thrashes L1 instruction cache (32KB = ~10K instructions)

Solution: Simplify to 2 layers (Unified Cache + TLS SLL), achieving tcache-equivalent simplicity.

Success Metrics

Primary Metric: Performance

  • Phase 1 target: 40-50M ops/s (+70-110%)
  • Phase 2 target: 70-90M ops/s (+200-280%)
  • Final target: 92.6M ops/s (System malloc parity)

Secondary Metrics

  • Assembly size: 2624 → 150-200 lines (-92%)
  • L1 cache misses: 1.98 → 0.2-0.4 miss/op (-80%)
  • Code maintainability: 2228-line monolith → 7 focused modules

Validation

  • Benchmark: bench_random_mixed_hakmem (Random Mixed 256B)
  • Acceptance: Must match or exceed System malloc (92.6M ops/s)

Conclusion

The HAKMEM Tiny allocator suffers from architectural bloat (11 frontend layers) causing 3.9x performance gap vs System malloc. The solution is aggressive simplification:

  1. Remove 4 dead features (1 day, +70-110%)
  2. Simplify to 2 layers (3 days, +200-280%)
  3. Split monolithic files (3 days, maintainability)

Total time: 2 weeks Expected outcome: 23.6M → 70-90M ops/s, approaching System malloc parity (92.6M ops/s) Risk: LOW (Phase 1 is ZERO risk, Phase 2 uses A/B tests)

Recommendation: Start Phase 1 immediately (highest ROI, lowest risk, 1 day).