Implemented automated Profile-Guided Optimization workflow using Box pattern: Performance Improvement: - Baseline: 57.0 M ops/s - PGO-optimized: 60.6 M ops/s - Gain: +6.25% (within expected +5-10% range) Implementation: 1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads 2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection 3. Makefile PGO targets: - pgo-tiny-profile: Build instrumented binaries - pgo-tiny-collect: Collect .gcda profile data - pgo-tiny-build: Build optimized binaries - pgo-tiny-full: Complete workflow (profile → collect → build → test) 4. Makefile help target: Added PGO instructions for discoverability Design: - Box化: Single responsibility, clear contracts - Deterministic: Fixed seeds (42) for reproducibility - Safe: Validation, error detection, timeout protection (30s/workload) - Observable: Progress reporting, .gcda verification (33 files generated) Workload Coverage: - Random mixed: 3 working set sizes (128/256/512 slots) - Tiny hot: 2 size classes (16B/64B) - Total: 5 workloads covering hot/cold paths Documentation: - PHASE4_STEP1_COMPLETE.md - Completion report - CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓) - docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
7.5 KiB
Phase 4-Step1: PGO Workflow - COMPLETE ✓
Date: 2025-11-29 Status: ✅ Complete Performance Gain: +6.25% (57.0 → 60.6 M ops/s)
Summary
Phase 4-Step1 implemented a fully automated PGO (Profile-Guided Optimization) workflow for the HAKMEM Tiny Front using the Box pattern. The implementation achieved a +6.25% performance improvement (within the expected +5-10% range) with zero code changes - pure compiler optimization.
Implementation
Box 1: PGO Profile Collection Box
Purpose: Automated, reproducible profile data collection Contract: Execute representative workloads → Generate .gcda files
Components:
scripts/box/pgo_tiny_profile_config.sh- Workload configurationscripts/box/pgo_tiny_profile_box.sh- Profile collection automation- Makefile PGO targets - Workflow orchestration
Design Principles:
- ✅ Deterministic: Fixed seeds (42) for reproducibility
- ✅ Representative: 5 workloads covering diverse allocation patterns
- ✅ Automated: Single command (
make pgo-tiny-full) for complete workflow - ✅ Safe: Validation checks, error detection, timeout protection
- ✅ Observable: Clear progress reporting, .gcda file verification
Workload Design
The PGO profile collection uses 5 representative workloads to capture diverse allocation patterns:
| Workload | Purpose | Key Characteristics |
|---|---|---|
bench_random_mixed 5M 256 42 |
Common case | Medium working set, balanced cache pressure |
bench_random_mixed 5M 128 42 |
Hot path bias | Smaller working set, higher TLS cache hit rate |
bench_random_mixed 5M 512 42 |
Cold path bias | Larger working set, more SuperSlab allocations |
bench_tiny_hot 16 100 60000 |
Class 0 intensive | Smallest size class (16B) |
bench_tiny_hot 64 100 60000 |
Class 3 intensive | Common small object size (64B) |
Coverage: The workloads exercise:
- Hot TLS SLL pop path (high-frequency allocations)
- Cold refill path (SuperSlab allocations)
- Multiple size classes (0, 3, and mixed)
- Varied cache pressure scenarios
Makefile Targets
# Step 1: Build instrumented binaries (-fprofile-generate)
make pgo-tiny-profile
# Step 2: Collect profile data (run workloads → .gcda files)
make pgo-tiny-collect
# Step 3: Build optimized binaries (-fprofile-use)
make pgo-tiny-build
# Full workflow: profile → collect → build → test
make pgo-tiny-full
Default Goal: The Makefile help target now includes PGO instructions (lines 18-23)
Performance Results
Baseline (No PGO)
Run 1: 57.04 M ops/s
Run 2: 57.14 M ops/s
Run 3: 56.95 M ops/s
Average: 57.04 M ops/s
PGO-Optimized
Run 1: 60.49 M ops/s
Run 2: 60.68 M ops/s
Run 3: 60.66 M ops/s
Average: 60.61 M ops/s
Improvement
Absolute: +3.57 M ops/s
Relative: +6.25%
Expected: +5-10% ✓
Verification: Latest test (after Makefile fix) confirmed 60.75 M ops/s - consistent with expected performance.
Technical Details
Profile Data Collection
The pgo_tiny_profile_box.sh script implements a robust collection workflow:
-
Binary Validation
- Checks binaries exist and are executable
- Auto-fixes permissions if needed
-
Profile Cleanup
- Removes old .gcda files to prevent stale data
- Reports cleanup statistics
-
Workload Execution
- Runs each workload with 30s timeout
- Detects timeouts and failures
- Fails fast on errors
-
Profile Verification
- Confirms .gcda files were generated
- Reports profile file count and locations
- Detects missing -fprofile-generate flag
Output: 33 .gcda files (confirmed in latest run)
Compiler Flags
# Profile Generation (Step 1)
PROFILE_GEN_FLAGS = -fprofile-generate -flto
# Profile Use (Step 3)
PROFILE_USE_FLAGS = -fprofile-use -flto
LTO: Link-Time Optimization is enabled for both phases to maximize PGO effectiveness.
Workflow Fix (2025-11-29)
Issue: Initial implementation had pgo-tiny-build calling the profile collection script, causing:
- Duplicate script execution
- Unclear separation of concerns
- Skipped
pgo-tiny-collectin dependency chain
Fix: Cleaned up the workflow:
# Before (broken):
pgo-tiny-full: pgo-tiny-profile pgo-tiny-build # Missing collect!
# After (correct):
pgo-tiny-full: pgo-tiny-profile pgo-tiny-collect pgo-tiny-build
Result: Each target now has a single responsibility:
pgo-tiny-profile: Build onlypgo-tiny-collect: Collect onlypgo-tiny-build: Build onlypgo-tiny-full: Orchestrate all steps
Help Target Update
The Makefile help target (lines 8-37) now includes:
Benchmarking (PGO-optimized, +6% faster):
make pgo-tiny-full - Full PGO workflow (~5-10 min)
= Profile + Optimize + Test
make pgo-tiny-profile - Step 1: Build profile binaries
make pgo-tiny-collect - Step 2: Collect profile data
make pgo-tiny-build - Step 3: Build optimized
Phase 4 Performance:
Baseline: 57.0 M ops/s
PGO-optimized: 60.6 M ops/s (+6.25%)
TIP: For best performance, use 'make pgo-tiny-full'
This ensures developers won't forget how to use PGO builds.
Artifacts
New Files
scripts/box/pgo_tiny_profile_config.sh- Workload definitionsscripts/box/pgo_tiny_profile_box.sh- Collection automationPHASE4_STEP1_COMPLETE.md- This completion report
Modified Files
Makefile(lines 8-37) - Help target with PGO instructionsMakefile(lines 1305-1356) - PGO workflow targets
Documentation
CURRENT_TASK.md- Phase 4 roadmapdocs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md- Complete Box design
Box Pattern Compliance
✅ Single Responsibility: Profile collection is a separate Box ✅ Clear Contract: Workloads → .gcda files → Optimized binaries ✅ Observable: Progress reporting, error detection, summary statistics ✅ Safe: Validation, timeouts, fail-fast on errors ✅ Testable: Deterministic seeds for reproducibility
Next Steps
Phase 4-Step2: Hot/Cold Path Box
- Target: +10-15% improvement (60.6 → 70.0 M ops/s)
- Approach: Separate hot (inline, likely) and cold (noinline, unlikely) paths
- Design: Already specified in
PHASE4_TINY_FRONT_BOX_DESIGN.md
Phase 4-Step3: Front Config Box
- Target: +5-8% improvement (70.0 → 76.0 M ops/s)
- Approach: Compile-time config optimization
- Design: Already specified in design doc
Overall Phase 4 Target: 73-83 M ops/s (vs current 60.6 M ops/s)
Lessons Learned
- PGO is high ROI: +6.25% with zero code changes, ~30 minutes of work
- Representative workloads matter: 5 diverse workloads > 1 simple workload
- Automation is critical: Manual PGO workflows are error-prone
- Box pattern scales: Profile collection fits the Box pattern naturally
- Help targets prevent forgetting: Make workflows discoverable
Conclusion
Phase 4-Step1 successfully implemented PGO optimization using the Box pattern, achieving +6.25% performance improvement (57.0 → 60.6 M ops/s) with:
- ✅ Fully automated workflow (
make pgo-tiny-full) - ✅ Reproducible results (deterministic seeds)
- ✅ Clear documentation (help target, design doc)
- ✅ Robust error handling (validation, timeouts)
- ✅ Within expected range (+5-10%)
Status: Ready to proceed to Phase 4-Step2 (Hot/Cold Path Box)
Signed: Claude (2025-11-29) Commit: TBD (pending git commit)