# Current Task: Phase 4 - Tiny Front Optimization **Date**: 2025-11-29 **Goal**: Tiny allocation throughput 2x improvement (56.8M → 110M+ ops/s) **Strategy**: Box化 + PGO + Hot/Cold separation --- ## Phase 4 Overview: 3-Step Approach ### Step 1: PGO Workflow Box ✅ COMPLETE (+6.25%) - **Duration**: ~~1-2 days~~ **Completed: 2025-11-29** - **Risk**: Low - **Target**: 56.8M → 60-62M ops/s - **Actual**: **57.0M → 60.6M ops/s (+6.25%)** ✓ **Deliverables**: 1. ✅ `scripts/box/pgo_tiny_profile_box.sh` - Profile collection automation 2. ✅ `scripts/box/pgo_tiny_profile_config.sh` - Workload configuration 3. ✅ Makefile targets: `pgo-tiny-profile`, `pgo-tiny-collect`, `pgo-tiny-build`, `pgo-tiny-full` 4. ✅ Makefile help target updated with PGO instructions 5. ✅ Benchmark comparison (before/after PGO) 6. ✅ Completion report: `PHASE4_STEP1_COMPLETE.md` --- ### Step 2: Hot/Cold Path Box (Expected: +10-15%) - **Duration**: 3-5 days - **Risk**: Medium - **Target**: 60-62M → 68-75M ops/s (cumulative +15-25%) **Deliverables**: 1. `core/box/tiny_front_hot_box.h` - Ultra-fast path (5-7 branches max) 2. `core/box/tiny_front_cold_box.h` - Slow path (noinline, cold) 3. Refactor `tiny_alloc_fast()` to use Hot/Cold boxes 4. PGO re-optimization with new structure --- ### Step 3: Front Config Box (Expected: +5-8%) - **Duration**: 2-3 days - **Risk**: Low - **Target**: 68-75M → 73-83M ops/s (cumulative +20-33%) **Deliverables**: 1. `core/box/tiny_front_config_box.h` - Compile-time config management 2. Replace runtime checks with `TINY_FRONT_*_ENABLED` macros 3. Build flag: `HAKMEM_TINY_FRONT_PGO=1` 4. Final PGO optimization + full benchmark suite --- ## Success Criteria **bench_random_mixed (ws=256)**: - Phase 3 baseline: 56.8M ops/s - Phase 4.1 (PGO): 60-62M ops/s - Phase 4.2 (Hot/Cold): 68-75M ops/s - Phase 4.3 (Config): **73-83M ops/s** ✓ (vs mimalloc 107M = 68-77%) **bench_tiny_hot (64B)**: - Phase 3 baseline: 81.0M ops/s - Phase 4.3 target: **100-115M ops/s** ✓ (vs system 156M = 64-74%) --- ## Current Status: Step 1 Complete ✅ → Ready for Step 2 **Completed**: 1. ✅ PGO Profile Collection Box implemented (+6.25% improvement) 2. ✅ Makefile workflow automation (`make pgo-tiny-full`) 3. ✅ Help target updated for discoverability 4. ✅ Completion report written **Next Actions (Step 2)**: 1. Implement Tiny Front Hot Path Box (5-7 branches max) 2. Implement Tiny Front Cold Path Box (noinline, cold) 3. Refactor `tiny_alloc_fast()` to use Hot/Cold separation 4. Re-run PGO optimization with new structure 5. Benchmark: Target 68-75M ops/s (+10-15% over Step 1) **Design Reference**: `docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md` (already complete) --- ## Notes from ChatGPT Analysis **Real bottleneck**: - NOT front_gate_v2 alone - BUT `tiny_alloc_fast()` overall complexity (15-20 branches) **Branch explosion sources**: 1. ultra_slim_mode_enabled() gate 2. hak_tiny_size_to_class range check 3. tiny_sizeclass_hist_hit (profile) 4. HeapV2 enabled/disabled 5. FastCache enabled/disabled 6. SFC enabled/disabled + hit/miss 7. TLS SLL enabled/disabled + per-class branches 8. Multiple env gates in refill path **Pool/Tiny boundary**: Negligible overhead (0.1-0.2% in bench) **memset/page fault**: Already optimized (TRUST_MMAP_ZERO=1) --- Updated: 2025-11-29 Phase: 4 (Tiny Front Optimization) Previous: Phase 3 (mincore removal, +10.7%)