Implemented automated Profile-Guided Optimization workflow using Box pattern: Performance Improvement: - Baseline: 57.0 M ops/s - PGO-optimized: 60.6 M ops/s - Gain: +6.25% (within expected +5-10% range) Implementation: 1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads 2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection 3. Makefile PGO targets: - pgo-tiny-profile: Build instrumented binaries - pgo-tiny-collect: Collect .gcda profile data - pgo-tiny-build: Build optimized binaries - pgo-tiny-full: Complete workflow (profile → collect → build → test) 4. Makefile help target: Added PGO instructions for discoverability Design: - Box化: Single responsibility, clear contracts - Deterministic: Fixed seeds (42) for reproducibility - Safe: Validation, error detection, timeout protection (30s/workload) - Observable: Progress reporting, .gcda verification (33 files generated) Workload Coverage: - Random mixed: 3 working set sizes (128/256/512 slots) - Tiny hot: 2 size classes (16B/64B) - Total: 5 workloads covering hot/cold paths Documentation: - PHASE4_STEP1_COMPLETE.md - Completion report - CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓) - docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
3.3 KiB
3.3 KiB
Current Task: Phase 4 - Tiny Front Optimization
Date: 2025-11-29 Goal: Tiny allocation throughput 2x improvement (56.8M → 110M+ ops/s) Strategy: Box化 + PGO + Hot/Cold separation
Phase 4 Overview: 3-Step Approach
Step 1: PGO Workflow Box ✅ COMPLETE (+6.25%)
- Duration:
1-2 daysCompleted: 2025-11-29 - Risk: Low
- Target: 56.8M → 60-62M ops/s
- Actual: 57.0M → 60.6M ops/s (+6.25%) ✓
Deliverables:
- ✅
scripts/box/pgo_tiny_profile_box.sh- Profile collection automation - ✅
scripts/box/pgo_tiny_profile_config.sh- Workload configuration - ✅ Makefile targets:
pgo-tiny-profile,pgo-tiny-collect,pgo-tiny-build,pgo-tiny-full - ✅ Makefile help target updated with PGO instructions
- ✅ Benchmark comparison (before/after PGO)
- ✅ Completion report:
PHASE4_STEP1_COMPLETE.md
Step 2: Hot/Cold Path Box (Expected: +10-15%)
- Duration: 3-5 days
- Risk: Medium
- Target: 60-62M → 68-75M ops/s (cumulative +15-25%)
Deliverables:
core/box/tiny_front_hot_box.h- Ultra-fast path (5-7 branches max)core/box/tiny_front_cold_box.h- Slow path (noinline, cold)- Refactor
tiny_alloc_fast()to use Hot/Cold boxes - PGO re-optimization with new structure
Step 3: Front Config Box (Expected: +5-8%)
- Duration: 2-3 days
- Risk: Low
- Target: 68-75M → 73-83M ops/s (cumulative +20-33%)
Deliverables:
core/box/tiny_front_config_box.h- Compile-time config management- Replace runtime checks with
TINY_FRONT_*_ENABLEDmacros - Build flag:
HAKMEM_TINY_FRONT_PGO=1 - Final PGO optimization + full benchmark suite
Success Criteria
bench_random_mixed (ws=256):
- Phase 3 baseline: 56.8M ops/s
- Phase 4.1 (PGO): 60-62M ops/s
- Phase 4.2 (Hot/Cold): 68-75M ops/s
- Phase 4.3 (Config): 73-83M ops/s ✓ (vs mimalloc 107M = 68-77%)
bench_tiny_hot (64B):
- Phase 3 baseline: 81.0M ops/s
- Phase 4.3 target: 100-115M ops/s ✓ (vs system 156M = 64-74%)
Current Status: Step 1 Complete ✅ → Ready for Step 2
Completed:
- ✅ PGO Profile Collection Box implemented (+6.25% improvement)
- ✅ Makefile workflow automation (
make pgo-tiny-full) - ✅ Help target updated for discoverability
- ✅ Completion report written
Next Actions (Step 2):
- Implement Tiny Front Hot Path Box (5-7 branches max)
- Implement Tiny Front Cold Path Box (noinline, cold)
- Refactor
tiny_alloc_fast()to use Hot/Cold separation - Re-run PGO optimization with new structure
- Benchmark: Target 68-75M ops/s (+10-15% over Step 1)
Design Reference: docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md (already complete)
Notes from ChatGPT Analysis
Real bottleneck:
- NOT front_gate_v2 alone
- BUT
tiny_alloc_fast()overall complexity (15-20 branches)
Branch explosion sources:
- ultra_slim_mode_enabled() gate
- hak_tiny_size_to_class range check
- tiny_sizeclass_hist_hit (profile)
- HeapV2 enabled/disabled
- FastCache enabled/disabled
- SFC enabled/disabled + hit/miss
- TLS SLL enabled/disabled + per-class branches
- Multiple env gates in refill path
Pool/Tiny boundary: Negligible overhead (0.1-0.2% in bench)
memset/page fault: Already optimized (TRUST_MMAP_ZERO=1)
Updated: 2025-11-29 Phase: 4 (Tiny Front Optimization) Previous: Phase 3 (mincore removal, +10.7%)