Update CURRENT_TASK.md with Phase 5 roadmap: - Goal: +10-26% improvement (57.2M → 63-72M ops/s) - Strategy: Fix allocation gap + Config Box + Mid MT optimization - Duration: 12 days / 2 weeks Phase 5 Steps: 1. Mid MT Verification (2 days) 2. Allocation Gap Elimination (3 days) - Priority 1 3. Mid/Large Config Box (3 days) 4. Mid Registry Pre-allocation (2 days) 5. Documentation & Benchmark (2 days) Critical Issue Found: - 1KB-8KB allocations fall through to mmap() when ACE disabled - Impact: 1000-5000x slower than O(1) allocation - Fix: Route through existing Mid MT allocator Phase 4 Complete: - Result: 53.3M → 57.2M ops/s (+7.3%) - PGO deferred to final optimization phase 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
8.3 KiB
Current Task: Phase 5 - Mid/Large Allocation Optimization
Date: 2025-11-29 Goal: Mid/Large allocation gap elimination + Config Box application Strategy: Fix allocation gap (1KB-8KB) + Compile-time config + Mid MT optimization Expected Gain: +10-26% (57.2M → 63-72M ops/s)
Phase 5 Overview: 5-Step Approach
Step 1: Mid MT Verification (Pending)
- Duration: 2 days
- Risk: Low
- Goal: Verify Mid MT allocator handles 1KB-8KB range efficiently
Deliverables:
- Benchmark Mid MT performance for 1KB-8KB sizes
- Identify any gaps or inefficiencies
- Document current Mid MT behavior
Step 2: Allocation Gap Elimination (Pending)
- Duration: 3 days
- Risk: Medium
- Target: +5-15% improvement
- Goal: Route 1KB-8KB allocations through Mid MT instead of mmap fallback
Critical Issue:
- File:
core/box/hak_alloc_api.inc.h:171-216 - Problem: When ACE disabled, 1KB-8KB falls through to mmap()
- Impact: 1000-5000x slower than O(1) allocation
Deliverables:
- Fix routing logic in
hak_alloc_api.inc.h - Route all >1KB allocations through Mid MT
- Benchmark improvement
- Completion report
Step 3: Mid/Large Config Box (Pending)
- Duration: 3 days
- Risk: Low
- Target: +2-4% improvement
- Goal: Apply Phase 4 Config Box pattern to Mid/Large feature gates
Runtime ENV Checks to Eliminate:
HAKMEM_SMALLMID_ENABLE(SmallMid allocator gate)HAKMEM_POOL_TLS(Pool allocator gate)HAKMEM_BIGCACHE(BigCache gate)HAKMEM_ACE(ACE allocator gate)- 4+ other feature checks in hot path
Deliverables:
core/box/mid_large_config_box.h- Reuse Phase 4 pattern- Replace 5-8 runtime checks with compile-time macros
- Build flag:
HAKMEM_MID_LARGE_PGO=1 - Benchmark improvement
- Completion report
Step 4: Mid Registry Pre-allocation (Pending)
- Duration: 2 days
- Risk: Low
- Target: Eliminate lock contention in MT workloads
- Goal: Pre-allocate Mid MT registry at init instead of lazy allocation
Deliverables:
- Modify
hakmem_mid_mt.cinit to pre-allocate registry - Remove registry lock from hot path
- Benchmark MT workload improvement
- Completion report
Step 5: Documentation & Final Benchmark (Pending)
- Duration: 2 days
- Risk: Low
- Goal: Document Phase 5 results, prepare for Phase 6
Deliverables:
- Phase 5 completion report
- Full benchmark suite comparison
- Update CURRENT_TASK.md for Phase 6
- Git commit & documentation
Phase 5 Success Criteria
bench_random_mixed (ws=256):
- Phase 4 result: 57.2M ops/s (Hot/Cold Box, no PGO)
- Phase 5.1 (Gap fix): 60-65M ops/s (+5-15%)
- Phase 5.2 (Config Box): 62-68M ops/s (+2-4% cumulative)
- Phase 5.3 (Registry): 63-70M ops/s (MT improvement)
- Phase 5 target: 63-72M ops/s ✓ (+10-26% cumulative)
Allocation Gap Impact:
- 1KB-8KB allocations: mmap() → Mid MT (1000-5000x faster)
Current Status: Phase 5 Ready to Start
Phase 4 Complete ✅:
- Step 1: PGO Workflow Box (+6.25%)
- Step 2: Hot/Cold Path Box (+7.3%)
- Step 3: Front Config Box (+2.7-4.9%)
- Result: 53.3M → 57.2M ops/s (+7.3%, without PGO)
Phase 5 Next Actions:
- Step 1: Verify Mid MT for 1KB range (2 days)
- Step 2: Eliminate allocation gap (3 days)
- Step 3: Apply Config Box pattern (3 days)
- Step 4: Pre-allocate Mid registry (2 days)
- Step 5: Documentation & benchmarks (2 days)
Total Duration: 12 days / 2 weeks
Previous: Phase 4 - Tiny Front Optimization ✅ COMPLETE
Date: 2025-11-29 Goal: Tiny allocation throughput 2x improvement (56.8M → 110M+ ops/s) Strategy: Box化 + PGO + Hot/Cold separation Result: 53.3M → 57.2M ops/s (+7.3%, without PGO)
Phase 4 Overview: 3-Step Approach
Step 1: PGO Workflow Box ✅ COMPLETE (+6.25%)
- Duration:
1-2 daysCompleted: 2025-11-29 - Risk: Low
- Target: 56.8M → 60-62M ops/s
- Actual: 57.0M → 60.6M ops/s (+6.25%) ✓
Deliverables:
- ✅
scripts/box/pgo_tiny_profile_box.sh- Profile collection automation - ✅
scripts/box/pgo_tiny_profile_config.sh- Workload configuration - ✅ Makefile targets:
pgo-tiny-profile,pgo-tiny-collect,pgo-tiny-build,pgo-tiny-full - ✅ Makefile help target updated with PGO instructions
- ✅ Benchmark comparison (before/after PGO)
- ✅ Completion report:
PHASE4_STEP1_COMPLETE.md
Step 2: Hot/Cold Path Box ✅ COMPLETE (+7.3%)
- Duration:
3-5 daysCompleted: 2025-11-29 - Risk: Medium
- Target: 60-62M → 68-75M ops/s (cumulative +15-25%)
- Actual: 53.3M → 57.2M ops/s (+7.3%, without PGO) ✓
Deliverables:
- ✅
core/box/tiny_front_hot_box.h- Ultra-fast path (1 branch, range check removed) - ✅
core/box/tiny_front_cold_box.h- Slow path (noinline, cold) - ✅ Refactored
malloc_tiny_fast()to use Hot/Cold boxes - ⏸️ PGO re-optimization (temporarily disabled due to build issues)
- ✅ Completion report:
PHASE4_STEP2_COMPLETE.md
Note: PGO temporarily disabled (build issues). Expected +13-15% with PGO re-enabled.
Step 3: Front Config Box ✅ COMPLETE (+2.7-4.9%)
- Duration:
2-3 daysCompleted: 2025-11-29 - Risk: Low
- Target: 68-75M → 73-83M ops/s (cumulative +20-33%)
- Actual: 50.3M → 52.8M ops/s (+2.7-4.9%, limited scope) ✓
Deliverables:
- ✅
core/box/tiny_front_config_box.h- Compile-time config management - ✅ Replace runtime checks with
TINY_FRONT_*_ENABLEDmacros (2 call sites) - ✅ Build flag:
HAKMEM_TINY_FRONT_PGO=1 - ⏸️ Final PGO optimization (PGO still disabled due to build issues)
- ✅ Completion report:
PHASE4_STEP3_COMPLETE.md
Note: Achieved +2.7-4.9% (below +5-8% target) due to limited scope (1 function, 2 call sites). Full target achievable by expanding to all config functions (6+ remaining).
Success Criteria
bench_random_mixed (ws=256):
- Phase 3 baseline: 56.8M ops/s
- Phase 4.1 (PGO): 60-62M ops/s
- Phase 4.2 (Hot/Cold): 68-75M ops/s
- Phase 4.3 (Config): 73-83M ops/s ✓ (vs mimalloc 107M = 68-77%)
bench_tiny_hot (64B):
- Phase 3 baseline: 81.0M ops/s
- Phase 4.3 target: 100-115M ops/s ✓ (vs system 156M = 64-74%)
Current Status: All 3 Steps Complete ✅ → Next: PGO Fix or Expand Config Box
Completed (Step 1):
- ✅ PGO Profile Collection Box implemented (+6.25% improvement with PGO)
- ✅ Makefile workflow automation (
make pgo-tiny-full) - ✅ Help target updated for discoverability
- ✅ Completion report:
PHASE4_STEP1_COMPLETE.md
Completed (Step 2):
- ✅ Tiny Front Hot Path Box (1 branch, range check removed)
- ✅ Tiny Front Cold Path Box (noinline, cold attributes)
- ✅ Refactored
malloc_tiny_fast()with Hot/Cold separation - ✅ Benchmark: +7.3% improvement (53.3 → 57.2 M ops/s, without PGO)
- ✅ Completion report:
PHASE4_STEP2_COMPLETE.md
Completed (Step 3):
- ✅ Front Config Box (compile-time config, dead code elimination)
- ✅ Build flag:
HAKMEM_TINY_FRONT_PGO=1 - ✅ Config macros:
TINY_FRONT_*_ENABLED(2 call sites updated) - ✅ Benchmark: +2.7-4.9% improvement (50.3 → 52.8 M ops/s)
- ✅ Completion report:
PHASE4_STEP3_COMPLETE.md
Next Actions (Choose One):
- Option A: Expand Config Box - Replace 6+ remaining config functions (+2-3% more expected)
- Option B: Fix PGO - Resolve build issues, re-enable PGO workflow (+6% expected from Step 1)
- Option C: Mark Phase 4 Complete - Move to next phase or final optimization
Design Reference: docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md (already complete)
Notes from ChatGPT Analysis
Real bottleneck:
- NOT front_gate_v2 alone
- BUT
tiny_alloc_fast()overall complexity (15-20 branches)
Branch explosion sources:
- ultra_slim_mode_enabled() gate
- hak_tiny_size_to_class range check
- tiny_sizeclass_hist_hit (profile)
- HeapV2 enabled/disabled
- FastCache enabled/disabled
- SFC enabled/disabled + hit/miss
- TLS SLL enabled/disabled + per-class branches
- Multiple env gates in refill path
Pool/Tiny boundary: Negligible overhead (0.1-0.2% in bench)
memset/page fault: Already optimized (TRUST_MMAP_ZERO=1)
Updated: 2025-11-29 Phase: 4 (Tiny Front Optimization) Previous: Phase 3 (mincore removal, +10.7%)