Files

Moe Charm (CI) a9ddb52ad4 ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)

Phase 1 完了：環境変数整理 + fprintf デバッグガード

ENV変数削除（BG/HotMag系）:
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除

fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message

ドキュメント整理:
- 328 markdown files 削除（旧レポート・重複docs）

性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作✅)
- ENV整理による機能影響なし
- Debug出力は一部残存（次phase で対応）

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-26 14:45:26 +09:00

8.4 KiB

Raw Blame History

HAKMEM Tiny Allocator Refactoring - Executive Summary

Problem Statement

Current Performance: 23.6M ops/s (Random Mixed 256B benchmark) System malloc: 92.6M ops/s (baseline) Performance gap: 3.9x slower

Root Cause: tiny_alloc_fast() generates 2624 lines of assembly (should be ~20-50 lines), causing:

11.6x more L1 cache misses than System malloc (1.98 miss/op vs 0.17)
Instruction cache thrashing from 11 overlapping frontend layers
Branch prediction failures from 26 conditional compilation paths + 38 runtime checks

Architecture Analysis

Current Bloat Inventory

Frontend Layers in tiny_alloc_fast() (11 total):

FastCache (C0-C3 array stack)
SFC (Super Front Cache, all classes)
Front C23 (Ultra-simple C2/C3)
Unified Cache (tcache-style, all classes)
Ring Cache (C2/C3/C5 array cache)
UltraHot (C2-C5 magazine)
HeapV2 (C0-C3 magazine)
Class5 Hotpath (256B dedicated path)
TLS SLL (generic freelist)
Front-Direct (experimental bypass)
Legacy refill path

Problem: Massive redundancy - Ring Cache, Front C23, and UltraHot all target C2/C3!

File Size Issues

hakmem_tiny.c: 2228 lines (should be ~300-500)
tiny_alloc_fast.inc.h: 885 lines (should be ~50-100)
core/front/ directory: 2127 lines total (11 experimental layers)

Solution: 3-Phase Refactoring

Phase 1: Remove Dead Features (1 day, ZERO risk)

Target: 4 features proven harmful or redundant

Feature	Lines	Status	Evidence
UltraHot	~150	Disabled by default	A/B test: +12.9% when OFF
HeapV2	~120	Disabled by default	Redundant with Ring Cache
Front C23	~80	Opt-in only	Redundant with Ring Cache
Class5 Hotpath	~150	Disabled by default	Special case, unnecessary

Expected Results:

Assembly: 2624 → 1000-1200 lines (-60%)
Performance: 23.6M → 40-50M ops/s (+70-110%)
Time: 1 day
Risk: ZERO (all disabled & proven harmful)

Phase 2: Simplify to 2-Layer Architecture (2-3 days)

Current: 11 layers (chaotic) Target: 2 layers (clean)

Layer 0: Unified Cache (tcache-style, all classes C0-C7)
         ↓ miss
Layer 1: TLS SLL (unlimited overflow)
         ↓ miss
Layer 2: SuperSlab backend (refill source)

Tasks:

A/B test: Ring Cache vs Unified Cache → pick winner
A/B test: FastCache vs SFC → consolidate into winner
A/B test: Front-Direct vs Legacy → pick one refill path
Extract ultra-fast path to tiny_alloc_ultra.inc.h (50 lines)

Expected Results:

Assembly: 1000-1200 → 150-200 lines (-90% from baseline)
Performance: 40-50M → 70-90M ops/s (+200-280% from baseline)
Time: 2-3 days
Risk: LOW (A/B tests ensure no regression)

Phase 3: Split Monolithic Files (2-3 days)

Current: hakmem_tiny.c (2228 lines, unmaintainable)

Target: 7 focused modules (~300-500 lines each)

hakmem_tiny.c          (300-400 lines) - Public API
tiny_state.c           (200-300 lines) - Global state
tiny_tls.c             (300-400 lines) - TLS operations
tiny_superslab.c       (400-500 lines) - SuperSlab backend
tiny_registry.c        (200-300 lines) - Slab registry
tiny_lifecycle.c       (200-300 lines) - Init/shutdown
tiny_stats.c           (200-300 lines) - Statistics
tiny_alloc_ultra.inc.h (50-100 lines)  - FAST PATH (inline)

Expected Results:

Maintainability: Much improved (clear dependencies)
Performance: No change (structural refactor only)
Time: 2-3 days
Risk: MEDIUM (need careful dependency management)

Performance Projections

Baseline (Current)

Performance: 23.6M ops/s
Assembly: 2624 lines
L1 misses: 1.98 miss/op
Gap to System malloc: 3.9x slower

After Phase 1 (Quick Win)

Performance: 40-50M ops/s (+70-110%)
Assembly: 1000-1200 lines (-60%)
L1 misses: 0.8-1.2 miss/op (-40%)
Gap to System malloc: 1.9-2.3x slower

After Phase 2 (Architecture Fix)

Performance: 70-90M ops/s (+200-280%)
Assembly: 150-200 lines (-92%)
L1 misses: 0.3-0.5 miss/op (-75%)
Gap to System malloc: 1.0-1.3x slower

Target (System malloc parity)

Performance: 92.6M ops/s (System malloc baseline)
Assembly: 50-100 lines (tcache equivalent)
L1 misses: 0.17 miss/op (System malloc level)
Gap: CLOSED

Implementation Timeline

Week 1: Phase 1 (Quick Win)

Day 1: Remove UltraHot, HeapV2, Front C23, Class5 Hotpath
Day 2: Test, benchmark, verify (+40-50M ops/s expected)

Week 2: Phase 2 (Architecture)

Day 3: A/B test Ring vs Unified vs SFC (pick winner)
Day 4: A/B test Front-Direct vs Legacy (pick winner)
Day 5: Extract tiny_alloc_ultra.inc.h (ultra-fast path)

Week 3: Phase 3 (Code Health)

Day 6-7: Split hakmem_tiny.c into 7 modules
Day 8: Test, document, finalize

Total: 8 days (2 weeks)

Risk Assessment

Phase 1 (Zero Risk)

✅ All 4 features disabled by default
✅ UltraHot proven harmful (+12.9% when OFF)
✅ HeapV2/Front C23 redundant (Ring Cache is better)
✅ Class5 Hotpath unnecessary (Ring Cache handles C5)

Worst case: Performance stays same (very unlikely) Expected case: +70-110% improvement Best case: +150-200% improvement

Phase 2 (Low Risk)

⚠️ A/B tests required before removing features
⚠️ Keep losers as fallback during transition
✅ Toggle via ENV flags (easy rollback)

Worst case: A/B test shows no winner → keep both temporarily Expected case: +200-280% improvement Best case: +300-350% improvement

Phase 3 (Medium Risk)

⚠️ Circular dependencies in current code
⚠️ Need careful extraction to avoid breakage
✅ Incremental approach (extract one module at a time)

Worst case: Build breaks → incremental rollback Expected case: No performance change (structural only) Best case: Easier maintenance → faster future iterations

Recommended Action

Immediate (Week 1)

Execute Phase 1 immediately - Highest ROI, lowest risk

Remove 4 dead/harmful features
Expected: +40-50M ops/s (+70-110%)
Time: 1 day
Risk: ZERO

Short-term (Week 2)

Execute Phase 2 - Core architecture fix

A/B test competing features, keep winners
Extract ultra-fast path
Expected: +70-90M ops/s (+200-280%)
Time: 3 days
Risk: LOW (A/B tests mitigate risk)

Medium-term (Week 3)

Execute Phase 3 - Code health & maintainability

Split monolithic files
Document architecture
Expected: No performance change, much easier maintenance
Time: 2-3 days
Risk: MEDIUM (careful execution required)

Key Insights

Why Current Architecture Fails

Root Cause: Feature Accumulation Disease

26 phases of development, each adding a new layer
No removal of failed experiments (UltraHot, HeapV2, Front C23)
Overlapping responsibilities (Ring, Front C23, UltraHot all target C2/C3)
Result: 11 layers competing → branch explosion → I-cache thrashing

Why System Malloc is Faster

System malloc (glibc tcache):

1 layer (tcache)
3-4 instructions fast path
~10-15 bytes assembly
Fits entirely in L1 instruction cache

HAKMEM current:

11 layers (chaotic)
2624 instructions fast path
~10KB assembly
Thrashes L1 instruction cache (32KB = ~10K instructions)

Solution: Simplify to 2 layers (Unified Cache + TLS SLL), achieving tcache-equivalent simplicity.

Success Metrics

Primary Metric: Performance

Phase 1 target: 40-50M ops/s (+70-110%)
Phase 2 target: 70-90M ops/s (+200-280%)
Final target: 92.6M ops/s (System malloc parity)

Secondary Metrics

Assembly size: 2624 → 150-200 lines (-92%)
L1 cache misses: 1.98 → 0.2-0.4 miss/op (-80%)
Code maintainability: 2228-line monolith → 7 focused modules

Validation

Benchmark: bench_random_mixed_hakmem (Random Mixed 256B)
Acceptance: Must match or exceed System malloc (92.6M ops/s)

Conclusion

The HAKMEM Tiny allocator suffers from architectural bloat (11 frontend layers) causing 3.9x performance gap vs System malloc. The solution is aggressive simplification:

Remove 4 dead features (1 day, +70-110%)
Simplify to 2 layers (3 days, +200-280%)
Split monolithic files (3 days, maintainability)

Total time: 2 weeks Expected outcome: 23.6M → 70-90M ops/s, approaching System malloc parity (92.6M ops/s) Risk: LOW (Phase 1 is ZERO risk, Phase 2 uses A/B tests)

Recommendation: Start Phase 1 immediately (highest ROI, lowest risk, 1 day).

8.4 KiB Raw Blame History

HAKMEM Tiny Allocator Refactoring - Executive Summary

Problem Statement

Architecture Analysis

Current Bloat Inventory

File Size Issues

Solution: 3-Phase Refactoring

Phase 1: Remove Dead Features (1 day, ZERO risk)

Phase 2: Simplify to 2-Layer Architecture (2-3 days)

Phase 3: Split Monolithic Files (2-3 days)

Performance Projections

Baseline (Current)

After Phase 1 (Quick Win)

After Phase 2 (Architecture Fix)

Target (System malloc parity)

Implementation Timeline

Week 1: Phase 1 (Quick Win)

Week 2: Phase 2 (Architecture)

Week 3: Phase 3 (Code Health)

Risk Assessment

Phase 1 (Zero Risk)

Phase 2 (Low Risk)

Phase 3 (Medium Risk)

Recommended Action

Immediate (Week 1)

Short-term (Week 2)

Medium-term (Week 3)

Key Insights

Why Current Architecture Fails

Why System Malloc is Faster

Success Metrics

Primary Metric: Performance

Secondary Metrics

Validation

Conclusion

8.4 KiB

Raw Blame History