Files
hakmem/CURRENT_TASK.md
Moe Charm (CI) 3cc7b675df docs: Start Phase 5 - Mid/Large Allocation Optimization
Update CURRENT_TASK.md with Phase 5 roadmap:
- Goal: +10-26% improvement (57.2M → 63-72M ops/s)
- Strategy: Fix allocation gap + Config Box + Mid MT optimization
- Duration: 12 days / 2 weeks

Phase 5 Steps:
1. Mid MT Verification (2 days)
2. Allocation Gap Elimination (3 days) - Priority 1
3. Mid/Large Config Box (3 days)
4. Mid Registry Pre-allocation (2 days)
5. Documentation & Benchmark (2 days)

Critical Issue Found:
- 1KB-8KB allocations fall through to mmap() when ACE disabled
- Impact: 1000-5000x slower than O(1) allocation
- Fix: Route through existing Mid MT allocator

Phase 4 Complete:
- Result: 53.3M → 57.2M ops/s (+7.3%)
- PGO deferred to final optimization phase

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 12:30:29 +09:00

8.3 KiB

Current Task: Phase 5 - Mid/Large Allocation Optimization

Date: 2025-11-29 Goal: Mid/Large allocation gap elimination + Config Box application Strategy: Fix allocation gap (1KB-8KB) + Compile-time config + Mid MT optimization Expected Gain: +10-26% (57.2M → 63-72M ops/s)


Phase 5 Overview: 5-Step Approach

Step 1: Mid MT Verification (Pending)

  • Duration: 2 days
  • Risk: Low
  • Goal: Verify Mid MT allocator handles 1KB-8KB range efficiently

Deliverables:

  1. Benchmark Mid MT performance for 1KB-8KB sizes
  2. Identify any gaps or inefficiencies
  3. Document current Mid MT behavior

Step 2: Allocation Gap Elimination (Pending)

  • Duration: 3 days
  • Risk: Medium
  • Target: +5-15% improvement
  • Goal: Route 1KB-8KB allocations through Mid MT instead of mmap fallback

Critical Issue:

  • File: core/box/hak_alloc_api.inc.h:171-216
  • Problem: When ACE disabled, 1KB-8KB falls through to mmap()
  • Impact: 1000-5000x slower than O(1) allocation

Deliverables:

  1. Fix routing logic in hak_alloc_api.inc.h
  2. Route all >1KB allocations through Mid MT
  3. Benchmark improvement
  4. Completion report

Step 3: Mid/Large Config Box (Pending)

  • Duration: 3 days
  • Risk: Low
  • Target: +2-4% improvement
  • Goal: Apply Phase 4 Config Box pattern to Mid/Large feature gates

Runtime ENV Checks to Eliminate:

  • HAKMEM_SMALLMID_ENABLE (SmallMid allocator gate)
  • HAKMEM_POOL_TLS (Pool allocator gate)
  • HAKMEM_BIGCACHE (BigCache gate)
  • HAKMEM_ACE (ACE allocator gate)
  • 4+ other feature checks in hot path

Deliverables:

  1. core/box/mid_large_config_box.h - Reuse Phase 4 pattern
  2. Replace 5-8 runtime checks with compile-time macros
  3. Build flag: HAKMEM_MID_LARGE_PGO=1
  4. Benchmark improvement
  5. Completion report

Step 4: Mid Registry Pre-allocation (Pending)

  • Duration: 2 days
  • Risk: Low
  • Target: Eliminate lock contention in MT workloads
  • Goal: Pre-allocate Mid MT registry at init instead of lazy allocation

Deliverables:

  1. Modify hakmem_mid_mt.c init to pre-allocate registry
  2. Remove registry lock from hot path
  3. Benchmark MT workload improvement
  4. Completion report

Step 5: Documentation & Final Benchmark (Pending)

  • Duration: 2 days
  • Risk: Low
  • Goal: Document Phase 5 results, prepare for Phase 6

Deliverables:

  1. Phase 5 completion report
  2. Full benchmark suite comparison
  3. Update CURRENT_TASK.md for Phase 6
  4. Git commit & documentation

Phase 5 Success Criteria

bench_random_mixed (ws=256):

  • Phase 4 result: 57.2M ops/s (Hot/Cold Box, no PGO)
  • Phase 5.1 (Gap fix): 60-65M ops/s (+5-15%)
  • Phase 5.2 (Config Box): 62-68M ops/s (+2-4% cumulative)
  • Phase 5.3 (Registry): 63-70M ops/s (MT improvement)
  • Phase 5 target: 63-72M ops/s ✓ (+10-26% cumulative)

Allocation Gap Impact:

  • 1KB-8KB allocations: mmap() → Mid MT (1000-5000x faster)

Current Status: Phase 5 Ready to Start

Phase 4 Complete :

  • Step 1: PGO Workflow Box (+6.25%)
  • Step 2: Hot/Cold Path Box (+7.3%)
  • Step 3: Front Config Box (+2.7-4.9%)
  • Result: 53.3M → 57.2M ops/s (+7.3%, without PGO)

Phase 5 Next Actions:

  1. Step 1: Verify Mid MT for 1KB range (2 days)
  2. Step 2: Eliminate allocation gap (3 days)
  3. Step 3: Apply Config Box pattern (3 days)
  4. Step 4: Pre-allocate Mid registry (2 days)
  5. Step 5: Documentation & benchmarks (2 days)

Total Duration: 12 days / 2 weeks



Previous: Phase 4 - Tiny Front Optimization COMPLETE

Date: 2025-11-29 Goal: Tiny allocation throughput 2x improvement (56.8M → 110M+ ops/s) Strategy: Box化 + PGO + Hot/Cold separation Result: 53.3M → 57.2M ops/s (+7.3%, without PGO)


Phase 4 Overview: 3-Step Approach

Step 1: PGO Workflow Box COMPLETE (+6.25%)

  • Duration: 1-2 days Completed: 2025-11-29
  • Risk: Low
  • Target: 56.8M → 60-62M ops/s
  • Actual: 57.0M → 60.6M ops/s (+6.25%)

Deliverables:

  1. scripts/box/pgo_tiny_profile_box.sh - Profile collection automation
  2. scripts/box/pgo_tiny_profile_config.sh - Workload configuration
  3. Makefile targets: pgo-tiny-profile, pgo-tiny-collect, pgo-tiny-build, pgo-tiny-full
  4. Makefile help target updated with PGO instructions
  5. Benchmark comparison (before/after PGO)
  6. Completion report: PHASE4_STEP1_COMPLETE.md

Step 2: Hot/Cold Path Box COMPLETE (+7.3%)

  • Duration: 3-5 days Completed: 2025-11-29
  • Risk: Medium
  • Target: 60-62M → 68-75M ops/s (cumulative +15-25%)
  • Actual: 53.3M → 57.2M ops/s (+7.3%, without PGO)

Deliverables:

  1. core/box/tiny_front_hot_box.h - Ultra-fast path (1 branch, range check removed)
  2. core/box/tiny_front_cold_box.h - Slow path (noinline, cold)
  3. Refactored malloc_tiny_fast() to use Hot/Cold boxes
  4. ⏸️ PGO re-optimization (temporarily disabled due to build issues)
  5. Completion report: PHASE4_STEP2_COMPLETE.md

Note: PGO temporarily disabled (build issues). Expected +13-15% with PGO re-enabled.


Step 3: Front Config Box COMPLETE (+2.7-4.9%)

  • Duration: 2-3 days Completed: 2025-11-29
  • Risk: Low
  • Target: 68-75M → 73-83M ops/s (cumulative +20-33%)
  • Actual: 50.3M → 52.8M ops/s (+2.7-4.9%, limited scope)

Deliverables:

  1. core/box/tiny_front_config_box.h - Compile-time config management
  2. Replace runtime checks with TINY_FRONT_*_ENABLED macros (2 call sites)
  3. Build flag: HAKMEM_TINY_FRONT_PGO=1
  4. ⏸️ Final PGO optimization (PGO still disabled due to build issues)
  5. Completion report: PHASE4_STEP3_COMPLETE.md

Note: Achieved +2.7-4.9% (below +5-8% target) due to limited scope (1 function, 2 call sites). Full target achievable by expanding to all config functions (6+ remaining).


Success Criteria

bench_random_mixed (ws=256):

  • Phase 3 baseline: 56.8M ops/s
  • Phase 4.1 (PGO): 60-62M ops/s
  • Phase 4.2 (Hot/Cold): 68-75M ops/s
  • Phase 4.3 (Config): 73-83M ops/s ✓ (vs mimalloc 107M = 68-77%)

bench_tiny_hot (64B):

  • Phase 3 baseline: 81.0M ops/s
  • Phase 4.3 target: 100-115M ops/s ✓ (vs system 156M = 64-74%)

Current Status: All 3 Steps Complete → Next: PGO Fix or Expand Config Box

Completed (Step 1):

  1. PGO Profile Collection Box implemented (+6.25% improvement with PGO)
  2. Makefile workflow automation (make pgo-tiny-full)
  3. Help target updated for discoverability
  4. Completion report: PHASE4_STEP1_COMPLETE.md

Completed (Step 2):

  1. Tiny Front Hot Path Box (1 branch, range check removed)
  2. Tiny Front Cold Path Box (noinline, cold attributes)
  3. Refactored malloc_tiny_fast() with Hot/Cold separation
  4. Benchmark: +7.3% improvement (53.3 → 57.2 M ops/s, without PGO)
  5. Completion report: PHASE4_STEP2_COMPLETE.md

Completed (Step 3):

  1. Front Config Box (compile-time config, dead code elimination)
  2. Build flag: HAKMEM_TINY_FRONT_PGO=1
  3. Config macros: TINY_FRONT_*_ENABLED (2 call sites updated)
  4. Benchmark: +2.7-4.9% improvement (50.3 → 52.8 M ops/s)
  5. Completion report: PHASE4_STEP3_COMPLETE.md

Next Actions (Choose One):

  • Option A: Expand Config Box - Replace 6+ remaining config functions (+2-3% more expected)
  • Option B: Fix PGO - Resolve build issues, re-enable PGO workflow (+6% expected from Step 1)
  • Option C: Mark Phase 4 Complete - Move to next phase or final optimization

Design Reference: docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md (already complete)


Notes from ChatGPT Analysis

Real bottleneck:

  • NOT front_gate_v2 alone
  • BUT tiny_alloc_fast() overall complexity (15-20 branches)

Branch explosion sources:

  1. ultra_slim_mode_enabled() gate
  2. hak_tiny_size_to_class range check
  3. tiny_sizeclass_hist_hit (profile)
  4. HeapV2 enabled/disabled
  5. FastCache enabled/disabled
  6. SFC enabled/disabled + hit/miss
  7. TLS SLL enabled/disabled + per-class branches
  8. Multiple env gates in refill path

Pool/Tiny boundary: Negligible overhead (0.1-0.2% in bench)

memset/page fault: Already optimized (TRUST_MMAP_ZERO=1)


Updated: 2025-11-29 Phase: 4 (Tiny Front Optimization) Previous: Phase 3 (mincore removal, +10.7%)