# Current Task: Phase 5 - Mid/Large Allocation Optimization **Date**: 2025-11-29 **Goal**: Mid/Large allocation gap elimination + Config Box application **Strategy**: Fix allocation gap (1KB-8KB) + Compile-time config + Mid MT optimization **Expected Gain**: +10-26% (57.2M → 63-72M ops/s) --- ## Phase 5 Overview: 5-Step Approach ### Step 1: Mid MT Verification (Pending) - **Duration**: 2 days - **Risk**: Low - **Goal**: Verify Mid MT allocator handles 1KB-8KB range efficiently **Deliverables**: 1. Benchmark Mid MT performance for 1KB-8KB sizes 2. Identify any gaps or inefficiencies 3. Document current Mid MT behavior --- ### Step 2: Allocation Gap Elimination (Pending) - **Duration**: 3 days - **Risk**: Medium - **Target**: +5-15% improvement - **Goal**: Route 1KB-8KB allocations through Mid MT instead of mmap fallback **Critical Issue**: - **File**: `core/box/hak_alloc_api.inc.h:171-216` - **Problem**: When ACE disabled, 1KB-8KB falls through to mmap() - **Impact**: 1000-5000x slower than O(1) allocation **Deliverables**: 1. Fix routing logic in `hak_alloc_api.inc.h` 2. Route all >1KB allocations through Mid MT 3. Benchmark improvement 4. Completion report --- ### Step 3: Mid/Large Config Box (Pending) - **Duration**: 3 days - **Risk**: Low - **Target**: +2-4% improvement - **Goal**: Apply Phase 4 Config Box pattern to Mid/Large feature gates **Runtime ENV Checks to Eliminate**: - `HAKMEM_SMALLMID_ENABLE` (SmallMid allocator gate) - `HAKMEM_POOL_TLS` (Pool allocator gate) - `HAKMEM_BIGCACHE` (BigCache gate) - `HAKMEM_ACE` (ACE allocator gate) - 4+ other feature checks in hot path **Deliverables**: 1. `core/box/mid_large_config_box.h` - Reuse Phase 4 pattern 2. Replace 5-8 runtime checks with compile-time macros 3. Build flag: `HAKMEM_MID_LARGE_PGO=1` 4. Benchmark improvement 5. Completion report --- ### Step 4: Mid Registry Pre-allocation (Pending) - **Duration**: 2 days - **Risk**: Low - **Target**: Eliminate lock contention in MT workloads - **Goal**: Pre-allocate Mid MT registry at init instead of lazy allocation **Deliverables**: 1. Modify `hakmem_mid_mt.c` init to pre-allocate registry 2. Remove registry lock from hot path 3. Benchmark MT workload improvement 4. Completion report --- ### Step 5: Documentation & Final Benchmark (Pending) - **Duration**: 2 days - **Risk**: Low - **Goal**: Document Phase 5 results, prepare for Phase 6 **Deliverables**: 1. Phase 5 completion report 2. Full benchmark suite comparison 3. Update CURRENT_TASK.md for Phase 6 4. Git commit & documentation --- ## Phase 5 Success Criteria **bench_random_mixed (ws=256)**: - Phase 4 result: 57.2M ops/s (Hot/Cold Box, no PGO) - Phase 5.1 (Gap fix): 60-65M ops/s (+5-15%) - Phase 5.2 (Config Box): 62-68M ops/s (+2-4% cumulative) - Phase 5.3 (Registry): 63-70M ops/s (MT improvement) - **Phase 5 target**: **63-72M ops/s** ✓ (+10-26% cumulative) **Allocation Gap Impact**: - 1KB-8KB allocations: mmap() → Mid MT (1000-5000x faster) --- ## Current Status: Phase 5 Ready to Start **Phase 4 Complete** ✅: - Step 1: PGO Workflow Box (+6.25%) - Step 2: Hot/Cold Path Box (+7.3%) - Step 3: Front Config Box (+2.7-4.9%) - **Result**: 53.3M → 57.2M ops/s (+7.3%, without PGO) **Phase 5 Next Actions**: 1. **Step 1**: Verify Mid MT for 1KB range (2 days) 2. **Step 2**: Eliminate allocation gap (3 days) 3. **Step 3**: Apply Config Box pattern (3 days) 4. **Step 4**: Pre-allocate Mid registry (2 days) 5. **Step 5**: Documentation & benchmarks (2 days) **Total Duration**: 12 days / 2 weeks --- --- # Previous: Phase 4 - Tiny Front Optimization ✅ COMPLETE **Date**: 2025-11-29 **Goal**: Tiny allocation throughput 2x improvement (56.8M → 110M+ ops/s) **Strategy**: Box化 + PGO + Hot/Cold separation **Result**: 53.3M → 57.2M ops/s (+7.3%, without PGO) --- ## Phase 4 Overview: 3-Step Approach ### Step 1: PGO Workflow Box ✅ COMPLETE (+6.25%) - **Duration**: ~~1-2 days~~ **Completed: 2025-11-29** - **Risk**: Low - **Target**: 56.8M → 60-62M ops/s - **Actual**: **57.0M → 60.6M ops/s (+6.25%)** ✓ **Deliverables**: 1. ✅ `scripts/box/pgo_tiny_profile_box.sh` - Profile collection automation 2. ✅ `scripts/box/pgo_tiny_profile_config.sh` - Workload configuration 3. ✅ Makefile targets: `pgo-tiny-profile`, `pgo-tiny-collect`, `pgo-tiny-build`, `pgo-tiny-full` 4. ✅ Makefile help target updated with PGO instructions 5. ✅ Benchmark comparison (before/after PGO) 6. ✅ Completion report: `PHASE4_STEP1_COMPLETE.md` --- ### Step 2: Hot/Cold Path Box ✅ COMPLETE (+7.3%) - **Duration**: ~~3-5 days~~ **Completed: 2025-11-29** - **Risk**: Medium - **Target**: 60-62M → 68-75M ops/s (cumulative +15-25%) - **Actual**: **53.3M → 57.2M ops/s (+7.3%, without PGO)** ✓ **Deliverables**: 1. ✅ `core/box/tiny_front_hot_box.h` - Ultra-fast path (1 branch, range check removed) 2. ✅ `core/box/tiny_front_cold_box.h` - Slow path (noinline, cold) 3. ✅ Refactored `malloc_tiny_fast()` to use Hot/Cold boxes 4. ⏸️ PGO re-optimization (temporarily disabled due to build issues) 5. ✅ Completion report: `PHASE4_STEP2_COMPLETE.md` **Note**: PGO temporarily disabled (build issues). Expected +13-15% with PGO re-enabled. --- ### Step 3: Front Config Box ✅ COMPLETE (+2.7-4.9%) - **Duration**: ~~2-3 days~~ **Completed: 2025-11-29** - **Risk**: Low - **Target**: 68-75M → 73-83M ops/s (cumulative +20-33%) - **Actual**: **50.3M → 52.8M ops/s (+2.7-4.9%, limited scope)** ✓ **Deliverables**: 1. ✅ `core/box/tiny_front_config_box.h` - Compile-time config management 2. ✅ Replace runtime checks with `TINY_FRONT_*_ENABLED` macros (2 call sites) 3. ✅ Build flag: `HAKMEM_TINY_FRONT_PGO=1` 4. ⏸️ Final PGO optimization (PGO still disabled due to build issues) 5. ✅ Completion report: `PHASE4_STEP3_COMPLETE.md` **Note**: Achieved +2.7-4.9% (below +5-8% target) due to limited scope (1 function, 2 call sites). Full target achievable by expanding to all config functions (6+ remaining). --- ## Success Criteria **bench_random_mixed (ws=256)**: - Phase 3 baseline: 56.8M ops/s - Phase 4.1 (PGO): 60-62M ops/s - Phase 4.2 (Hot/Cold): 68-75M ops/s - Phase 4.3 (Config): **73-83M ops/s** ✓ (vs mimalloc 107M = 68-77%) **bench_tiny_hot (64B)**: - Phase 3 baseline: 81.0M ops/s - Phase 4.3 target: **100-115M ops/s** ✓ (vs system 156M = 64-74%) --- ## Current Status: All 3 Steps Complete ✅ → Next: PGO Fix or Expand Config Box **Completed (Step 1)**: 1. ✅ PGO Profile Collection Box implemented (+6.25% improvement with PGO) 2. ✅ Makefile workflow automation (`make pgo-tiny-full`) 3. ✅ Help target updated for discoverability 4. ✅ Completion report: `PHASE4_STEP1_COMPLETE.md` **Completed (Step 2)**: 1. ✅ Tiny Front Hot Path Box (1 branch, range check removed) 2. ✅ Tiny Front Cold Path Box (noinline, cold attributes) 3. ✅ Refactored `malloc_tiny_fast()` with Hot/Cold separation 4. ✅ Benchmark: **+7.3% improvement** (53.3 → 57.2 M ops/s, without PGO) 5. ✅ Completion report: `PHASE4_STEP2_COMPLETE.md` **Completed (Step 3)**: 1. ✅ Front Config Box (compile-time config, dead code elimination) 2. ✅ Build flag: `HAKMEM_TINY_FRONT_PGO=1` 3. ✅ Config macros: `TINY_FRONT_*_ENABLED` (2 call sites updated) 4. ✅ Benchmark: **+2.7-4.9% improvement** (50.3 → 52.8 M ops/s) 5. ✅ Completion report: `PHASE4_STEP3_COMPLETE.md` **Next Actions (Choose One)**: - **Option A: Expand Config Box** - Replace 6+ remaining config functions (+2-3% more expected) - **Option B: Fix PGO** - Resolve build issues, re-enable PGO workflow (+6% expected from Step 1) - **Option C: Mark Phase 4 Complete** - Move to next phase or final optimization **Design Reference**: `docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md` (already complete) --- ## Notes from ChatGPT Analysis **Real bottleneck**: - NOT front_gate_v2 alone - BUT `tiny_alloc_fast()` overall complexity (15-20 branches) **Branch explosion sources**: 1. ultra_slim_mode_enabled() gate 2. hak_tiny_size_to_class range check 3. tiny_sizeclass_hist_hit (profile) 4. HeapV2 enabled/disabled 5. FastCache enabled/disabled 6. SFC enabled/disabled + hit/miss 7. TLS SLL enabled/disabled + per-class branches 8. Multiple env gates in refill path **Pool/Tiny boundary**: Negligible overhead (0.1-0.2% in bench) **memset/page fault**: Already optimized (TRUST_MMAP_ZERO=1) --- Updated: 2025-11-29 Phase: 4 (Tiny Front Optimization) Previous: Phase 3 (mincore removal, +10.7%)