Files
hakmem/PHASE4_STEP1_COMPLETE.md
Moe Charm (CI) b51b600e8d Phase 4-Step1: Add PGO workflow automation (+6.25% performance)
Implemented automated Profile-Guided Optimization workflow using Box pattern:

Performance Improvement:
- Baseline:      57.0 M ops/s
- PGO-optimized: 60.6 M ops/s
- Gain: +6.25% (within expected +5-10% range)

Implementation:
1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads
2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection
3. Makefile PGO targets:
   - pgo-tiny-profile: Build instrumented binaries
   - pgo-tiny-collect: Collect .gcda profile data
   - pgo-tiny-build:   Build optimized binaries
   - pgo-tiny-full:    Complete workflow (profile → collect → build → test)
4. Makefile help target: Added PGO instructions for discoverability

Design:
- Box化: Single responsibility, clear contracts
- Deterministic: Fixed seeds (42) for reproducibility
- Safe: Validation, error detection, timeout protection (30s/workload)
- Observable: Progress reporting, .gcda verification (33 files generated)

Workload Coverage:
- Random mixed: 3 working set sizes (128/256/512 slots)
- Tiny hot: 2 size classes (16B/64B)
- Total: 5 workloads covering hot/cold paths

Documentation:
- PHASE4_STEP1_COMPLETE.md - Completion report
- CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓)
- docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design

Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 11:28:38 +09:00

7.5 KiB

Phase 4-Step1: PGO Workflow - COMPLETE ✓

Date: 2025-11-29 Status: Complete Performance Gain: +6.25% (57.0 → 60.6 M ops/s)


Summary

Phase 4-Step1 implemented a fully automated PGO (Profile-Guided Optimization) workflow for the HAKMEM Tiny Front using the Box pattern. The implementation achieved a +6.25% performance improvement (within the expected +5-10% range) with zero code changes - pure compiler optimization.


Implementation

Box 1: PGO Profile Collection Box

Purpose: Automated, reproducible profile data collection Contract: Execute representative workloads → Generate .gcda files

Components:

  1. scripts/box/pgo_tiny_profile_config.sh - Workload configuration
  2. scripts/box/pgo_tiny_profile_box.sh - Profile collection automation
  3. Makefile PGO targets - Workflow orchestration

Design Principles:

  • Deterministic: Fixed seeds (42) for reproducibility
  • Representative: 5 workloads covering diverse allocation patterns
  • Automated: Single command (make pgo-tiny-full) for complete workflow
  • Safe: Validation checks, error detection, timeout protection
  • Observable: Clear progress reporting, .gcda file verification

Workload Design

The PGO profile collection uses 5 representative workloads to capture diverse allocation patterns:

Workload Purpose Key Characteristics
bench_random_mixed 5M 256 42 Common case Medium working set, balanced cache pressure
bench_random_mixed 5M 128 42 Hot path bias Smaller working set, higher TLS cache hit rate
bench_random_mixed 5M 512 42 Cold path bias Larger working set, more SuperSlab allocations
bench_tiny_hot 16 100 60000 Class 0 intensive Smallest size class (16B)
bench_tiny_hot 64 100 60000 Class 3 intensive Common small object size (64B)

Coverage: The workloads exercise:

  • Hot TLS SLL pop path (high-frequency allocations)
  • Cold refill path (SuperSlab allocations)
  • Multiple size classes (0, 3, and mixed)
  • Varied cache pressure scenarios

Makefile Targets

# Step 1: Build instrumented binaries (-fprofile-generate)
make pgo-tiny-profile

# Step 2: Collect profile data (run workloads → .gcda files)
make pgo-tiny-collect

# Step 3: Build optimized binaries (-fprofile-use)
make pgo-tiny-build

# Full workflow: profile → collect → build → test
make pgo-tiny-full

Default Goal: The Makefile help target now includes PGO instructions (lines 18-23)


Performance Results

Baseline (No PGO)

Run 1: 57.04 M ops/s
Run 2: 57.14 M ops/s
Run 3: 56.95 M ops/s
Average: 57.04 M ops/s

PGO-Optimized

Run 1: 60.49 M ops/s
Run 2: 60.68 M ops/s
Run 3: 60.66 M ops/s
Average: 60.61 M ops/s

Improvement

Absolute: +3.57 M ops/s
Relative: +6.25%
Expected: +5-10% ✓

Verification: Latest test (after Makefile fix) confirmed 60.75 M ops/s - consistent with expected performance.


Technical Details

Profile Data Collection

The pgo_tiny_profile_box.sh script implements a robust collection workflow:

  1. Binary Validation

    • Checks binaries exist and are executable
    • Auto-fixes permissions if needed
  2. Profile Cleanup

    • Removes old .gcda files to prevent stale data
    • Reports cleanup statistics
  3. Workload Execution

    • Runs each workload with 30s timeout
    • Detects timeouts and failures
    • Fails fast on errors
  4. Profile Verification

    • Confirms .gcda files were generated
    • Reports profile file count and locations
    • Detects missing -fprofile-generate flag

Output: 33 .gcda files (confirmed in latest run)

Compiler Flags

# Profile Generation (Step 1)
PROFILE_GEN_FLAGS = -fprofile-generate -flto

# Profile Use (Step 3)
PROFILE_USE_FLAGS = -fprofile-use -flto

LTO: Link-Time Optimization is enabled for both phases to maximize PGO effectiveness.


Workflow Fix (2025-11-29)

Issue: Initial implementation had pgo-tiny-build calling the profile collection script, causing:

  • Duplicate script execution
  • Unclear separation of concerns
  • Skipped pgo-tiny-collect in dependency chain

Fix: Cleaned up the workflow:

# Before (broken):
pgo-tiny-full: pgo-tiny-profile pgo-tiny-build  # Missing collect!

# After (correct):
pgo-tiny-full: pgo-tiny-profile pgo-tiny-collect pgo-tiny-build

Result: Each target now has a single responsibility:

  • pgo-tiny-profile: Build only
  • pgo-tiny-collect: Collect only
  • pgo-tiny-build: Build only
  • pgo-tiny-full: Orchestrate all steps

Help Target Update

The Makefile help target (lines 8-37) now includes:

Benchmarking (PGO-optimized, +6% faster):
  make pgo-tiny-full                - Full PGO workflow (~5-10 min)
                                      = Profile + Optimize + Test
  make pgo-tiny-profile             - Step 1: Build profile binaries
  make pgo-tiny-collect             - Step 2: Collect profile data
  make pgo-tiny-build               - Step 3: Build optimized

Phase 4 Performance:
  Baseline:      57.0 M ops/s
  PGO-optimized: 60.6 M ops/s (+6.25%)

TIP: For best performance, use 'make pgo-tiny-full'

This ensures developers won't forget how to use PGO builds.


Artifacts

New Files

  • scripts/box/pgo_tiny_profile_config.sh - Workload definitions
  • scripts/box/pgo_tiny_profile_box.sh - Collection automation
  • PHASE4_STEP1_COMPLETE.md - This completion report

Modified Files

  • Makefile (lines 8-37) - Help target with PGO instructions
  • Makefile (lines 1305-1356) - PGO workflow targets

Documentation

  • CURRENT_TASK.md - Phase 4 roadmap
  • docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Box design

Box Pattern Compliance

Single Responsibility: Profile collection is a separate Box Clear Contract: Workloads → .gcda files → Optimized binaries Observable: Progress reporting, error detection, summary statistics Safe: Validation, timeouts, fail-fast on errors Testable: Deterministic seeds for reproducibility


Next Steps

Phase 4-Step2: Hot/Cold Path Box

  • Target: +10-15% improvement (60.6 → 70.0 M ops/s)
  • Approach: Separate hot (inline, likely) and cold (noinline, unlikely) paths
  • Design: Already specified in PHASE4_TINY_FRONT_BOX_DESIGN.md

Phase 4-Step3: Front Config Box

  • Target: +5-8% improvement (70.0 → 76.0 M ops/s)
  • Approach: Compile-time config optimization
  • Design: Already specified in design doc

Overall Phase 4 Target: 73-83 M ops/s (vs current 60.6 M ops/s)


Lessons Learned

  1. PGO is high ROI: +6.25% with zero code changes, ~30 minutes of work
  2. Representative workloads matter: 5 diverse workloads > 1 simple workload
  3. Automation is critical: Manual PGO workflows are error-prone
  4. Box pattern scales: Profile collection fits the Box pattern naturally
  5. Help targets prevent forgetting: Make workflows discoverable

Conclusion

Phase 4-Step1 successfully implemented PGO optimization using the Box pattern, achieving +6.25% performance improvement (57.0 → 60.6 M ops/s) with:

  • Fully automated workflow (make pgo-tiny-full)
  • Reproducible results (deterministic seeds)
  • Clear documentation (help target, design doc)
  • Robust error handling (validation, timeouts)
  • Within expected range (+5-10%)

Status: Ready to proceed to Phase 4-Step2 (Hot/Cold Path Box)


Signed: Claude (2025-11-29) Commit: TBD (pending git commit)