Files

Moe Charm (CI) b51b600e8d Phase 4-Step1: Add PGO workflow automation (+6.25% performance)

Implemented automated Profile-Guided Optimization workflow using Box pattern:

Performance Improvement:
- Baseline:      57.0 M ops/s
- PGO-optimized: 60.6 M ops/s
- Gain: +6.25% (within expected +5-10% range)

Implementation:
1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads
2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection
3. Makefile PGO targets:
   - pgo-tiny-profile: Build instrumented binaries
   - pgo-tiny-collect: Collect .gcda profile data
   - pgo-tiny-build:   Build optimized binaries
   - pgo-tiny-full:    Complete workflow (profile → collect → build → test)
4. Makefile help target: Added PGO instructions for discoverability

Design:
- Box化: Single responsibility, clear contracts
- Deterministic: Fixed seeds (42) for reproducibility
- Safe: Validation, error detection, timeout protection (30s/workload)
- Observable: Progress reporting, .gcda verification (33 files generated)

Workload Coverage:
- Random mixed: 3 working set sizes (128/256/512 slots)
- Tiny hot: 2 size classes (16B/64B)
- Total: 5 workloads covering hot/cold paths

Documentation:
- PHASE4_STEP1_COMPLETE.md - Completion report
- CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓)
- docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design

Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-29 11:28:38 +09:00

3.3 KiB

Raw Blame History

Current Task: Phase 4 - Tiny Front Optimization

Date: 2025-11-29 Goal: Tiny allocation throughput 2x improvement (56.8M → 110M+ ops/s) Strategy: Box化 + PGO + Hot/Cold separation

Phase 4 Overview: 3-Step Approach

Step 1: PGO Workflow Box ✅ COMPLETE (+6.25%)

Duration: ~~1-2 days~~ Completed: 2025-11-29
Risk: Low
Target: 56.8M → 60-62M ops/s
Actual: 57.0M → 60.6M ops/s (+6.25%) ✓

Deliverables:

✅ scripts/box/pgo_tiny_profile_box.sh - Profile collection automation
✅ scripts/box/pgo_tiny_profile_config.sh - Workload configuration
✅ Makefile targets: pgo-tiny-profile, pgo-tiny-collect, pgo-tiny-build, pgo-tiny-full
✅ Makefile help target updated with PGO instructions
✅ Benchmark comparison (before/after PGO)
✅ Completion report: PHASE4_STEP1_COMPLETE.md

Step 2: Hot/Cold Path Box (Expected: +10-15%)

Duration: 3-5 days
Risk: Medium
Target: 60-62M → 68-75M ops/s (cumulative +15-25%)

Deliverables:

core/box/tiny_front_hot_box.h - Ultra-fast path (5-7 branches max)
core/box/tiny_front_cold_box.h - Slow path (noinline, cold)
Refactor tiny_alloc_fast() to use Hot/Cold boxes
PGO re-optimization with new structure

Step 3: Front Config Box (Expected: +5-8%)

Duration: 2-3 days
Risk: Low
Target: 68-75M → 73-83M ops/s (cumulative +20-33%)

Deliverables:

core/box/tiny_front_config_box.h - Compile-time config management
Replace runtime checks with TINY_FRONT_*_ENABLED macros
Build flag: HAKMEM_TINY_FRONT_PGO=1
Final PGO optimization + full benchmark suite

Success Criteria

bench_random_mixed (ws=256):

Phase 3 baseline: 56.8M ops/s
Phase 4.1 (PGO): 60-62M ops/s
Phase 4.2 (Hot/Cold): 68-75M ops/s
Phase 4.3 (Config): 73-83M ops/s ✓ (vs mimalloc 107M = 68-77%)

bench_tiny_hot (64B):

Phase 3 baseline: 81.0M ops/s
Phase 4.3 target: 100-115M ops/s ✓ (vs system 156M = 64-74%)

Current Status: Step 1 Complete ✅ → Ready for Step 2

Completed:

✅ PGO Profile Collection Box implemented (+6.25% improvement)
✅ Makefile workflow automation (make pgo-tiny-full)
✅ Help target updated for discoverability
✅ Completion report written

Next Actions (Step 2):

Implement Tiny Front Hot Path Box (5-7 branches max)
Implement Tiny Front Cold Path Box (noinline, cold)
Refactor tiny_alloc_fast() to use Hot/Cold separation
Re-run PGO optimization with new structure
Benchmark: Target 68-75M ops/s (+10-15% over Step 1)

Design Reference: docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md (already complete)

Notes from ChatGPT Analysis

Real bottleneck:

NOT front_gate_v2 alone
BUT tiny_alloc_fast() overall complexity (15-20 branches)

Branch explosion sources:

ultra_slim_mode_enabled() gate
hak_tiny_size_to_class range check
tiny_sizeclass_hist_hit (profile)
HeapV2 enabled/disabled
FastCache enabled/disabled
SFC enabled/disabled + hit/miss
TLS SLL enabled/disabled + per-class branches
Multiple env gates in refill path

Pool/Tiny boundary: Negligible overhead (0.1-0.2% in bench)

memset/page fault: Already optimized (TRUST_MMAP_ZERO=1)

Updated: 2025-11-29 Phase: 4 (Tiny Front Optimization) Previous: Phase 3 (mincore removal, +10.7%)

3.3 KiB Raw Blame History

Current Task: Phase 4 - Tiny Front Optimization

Phase 4 Overview: 3-Step Approach

Step 1: PGO Workflow Box ✅ COMPLETE (+6.25%)

Step 2: Hot/Cold Path Box (Expected: +10-15%)

Step 3: Front Config Box (Expected: +5-8%)

Success Criteria

Current Status: Step 1 Complete ✅ → Ready for Step 2

Notes from ChatGPT Analysis

3.3 KiB

Raw Blame History