Implemented automated Profile-Guided Optimization workflow using Box pattern: Performance Improvement: - Baseline: 57.0 M ops/s - PGO-optimized: 60.6 M ops/s - Gain: +6.25% (within expected +5-10% range) Implementation: 1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads 2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection 3. Makefile PGO targets: - pgo-tiny-profile: Build instrumented binaries - pgo-tiny-collect: Collect .gcda profile data - pgo-tiny-build: Build optimized binaries - pgo-tiny-full: Complete workflow (profile → collect → build → test) 4. Makefile help target: Added PGO instructions for discoverability Design: - Box化: Single responsibility, clear contracts - Deterministic: Fixed seeds (42) for reproducibility - Safe: Validation, error detection, timeout protection (30s/workload) - Observable: Progress reporting, .gcda verification (33 files generated) Workload Coverage: - Random mixed: 3 working set sizes (128/256/512 slots) - Tiny hot: 2 size classes (16B/64B) - Total: 5 workloads covering hot/cold paths Documentation: - PHASE4_STEP1_COMPLETE.md - Completion report - CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓) - docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
260 lines
7.5 KiB
Markdown
260 lines
7.5 KiB
Markdown
# Phase 4-Step1: PGO Workflow - COMPLETE ✓
|
|
|
|
**Date**: 2025-11-29
|
|
**Status**: ✅ Complete
|
|
**Performance Gain**: +6.25% (57.0 → 60.6 M ops/s)
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
Phase 4-Step1 implemented a fully automated PGO (Profile-Guided Optimization) workflow for the HAKMEM Tiny Front using the Box pattern. The implementation achieved a **+6.25% performance improvement** (within the expected +5-10% range) with zero code changes - pure compiler optimization.
|
|
|
|
---
|
|
|
|
## Implementation
|
|
|
|
### Box 1: PGO Profile Collection Box
|
|
|
|
**Purpose**: Automated, reproducible profile data collection
|
|
**Contract**: Execute representative workloads → Generate .gcda files
|
|
|
|
**Components**:
|
|
1. `scripts/box/pgo_tiny_profile_config.sh` - Workload configuration
|
|
2. `scripts/box/pgo_tiny_profile_box.sh` - Profile collection automation
|
|
3. Makefile PGO targets - Workflow orchestration
|
|
|
|
**Design Principles**:
|
|
- ✅ **Deterministic**: Fixed seeds (42) for reproducibility
|
|
- ✅ **Representative**: 5 workloads covering diverse allocation patterns
|
|
- ✅ **Automated**: Single command (`make pgo-tiny-full`) for complete workflow
|
|
- ✅ **Safe**: Validation checks, error detection, timeout protection
|
|
- ✅ **Observable**: Clear progress reporting, .gcda file verification
|
|
|
|
---
|
|
|
|
## Workload Design
|
|
|
|
The PGO profile collection uses **5 representative workloads** to capture diverse allocation patterns:
|
|
|
|
| Workload | Purpose | Key Characteristics |
|
|
|----------|---------|---------------------|
|
|
| `bench_random_mixed 5M 256 42` | Common case | Medium working set, balanced cache pressure |
|
|
| `bench_random_mixed 5M 128 42` | Hot path bias | Smaller working set, higher TLS cache hit rate |
|
|
| `bench_random_mixed 5M 512 42` | Cold path bias | Larger working set, more SuperSlab allocations |
|
|
| `bench_tiny_hot 16 100 60000` | Class 0 intensive | Smallest size class (16B) |
|
|
| `bench_tiny_hot 64 100 60000` | Class 3 intensive | Common small object size (64B) |
|
|
|
|
**Coverage**: The workloads exercise:
|
|
- Hot TLS SLL pop path (high-frequency allocations)
|
|
- Cold refill path (SuperSlab allocations)
|
|
- Multiple size classes (0, 3, and mixed)
|
|
- Varied cache pressure scenarios
|
|
|
|
---
|
|
|
|
## Makefile Targets
|
|
|
|
```makefile
|
|
# Step 1: Build instrumented binaries (-fprofile-generate)
|
|
make pgo-tiny-profile
|
|
|
|
# Step 2: Collect profile data (run workloads → .gcda files)
|
|
make pgo-tiny-collect
|
|
|
|
# Step 3: Build optimized binaries (-fprofile-use)
|
|
make pgo-tiny-build
|
|
|
|
# Full workflow: profile → collect → build → test
|
|
make pgo-tiny-full
|
|
```
|
|
|
|
**Default Goal**: The Makefile help target now includes PGO instructions (lines 18-23)
|
|
|
|
---
|
|
|
|
## Performance Results
|
|
|
|
### Baseline (No PGO)
|
|
```
|
|
Run 1: 57.04 M ops/s
|
|
Run 2: 57.14 M ops/s
|
|
Run 3: 56.95 M ops/s
|
|
Average: 57.04 M ops/s
|
|
```
|
|
|
|
### PGO-Optimized
|
|
```
|
|
Run 1: 60.49 M ops/s
|
|
Run 2: 60.68 M ops/s
|
|
Run 3: 60.66 M ops/s
|
|
Average: 60.61 M ops/s
|
|
```
|
|
|
|
### Improvement
|
|
```
|
|
Absolute: +3.57 M ops/s
|
|
Relative: +6.25%
|
|
Expected: +5-10% ✓
|
|
```
|
|
|
|
**Verification**: Latest test (after Makefile fix) confirmed **60.75 M ops/s** - consistent with expected performance.
|
|
|
|
---
|
|
|
|
## Technical Details
|
|
|
|
### Profile Data Collection
|
|
|
|
The `pgo_tiny_profile_box.sh` script implements a robust collection workflow:
|
|
|
|
1. **Binary Validation**
|
|
- Checks binaries exist and are executable
|
|
- Auto-fixes permissions if needed
|
|
|
|
2. **Profile Cleanup**
|
|
- Removes old .gcda files to prevent stale data
|
|
- Reports cleanup statistics
|
|
|
|
3. **Workload Execution**
|
|
- Runs each workload with 30s timeout
|
|
- Detects timeouts and failures
|
|
- Fails fast on errors
|
|
|
|
4. **Profile Verification**
|
|
- Confirms .gcda files were generated
|
|
- Reports profile file count and locations
|
|
- Detects missing -fprofile-generate flag
|
|
|
|
**Output**: 33 .gcda files (confirmed in latest run)
|
|
|
|
### Compiler Flags
|
|
|
|
```makefile
|
|
# Profile Generation (Step 1)
|
|
PROFILE_GEN_FLAGS = -fprofile-generate -flto
|
|
|
|
# Profile Use (Step 3)
|
|
PROFILE_USE_FLAGS = -fprofile-use -flto
|
|
```
|
|
|
|
**LTO**: Link-Time Optimization is enabled for both phases to maximize PGO effectiveness.
|
|
|
|
---
|
|
|
|
## Workflow Fix (2025-11-29)
|
|
|
|
**Issue**: Initial implementation had `pgo-tiny-build` calling the profile collection script, causing:
|
|
- Duplicate script execution
|
|
- Unclear separation of concerns
|
|
- Skipped `pgo-tiny-collect` in dependency chain
|
|
|
|
**Fix**: Cleaned up the workflow:
|
|
```makefile
|
|
# Before (broken):
|
|
pgo-tiny-full: pgo-tiny-profile pgo-tiny-build # Missing collect!
|
|
|
|
# After (correct):
|
|
pgo-tiny-full: pgo-tiny-profile pgo-tiny-collect pgo-tiny-build
|
|
```
|
|
|
|
**Result**: Each target now has a single responsibility:
|
|
- `pgo-tiny-profile`: Build only
|
|
- `pgo-tiny-collect`: Collect only
|
|
- `pgo-tiny-build`: Build only
|
|
- `pgo-tiny-full`: Orchestrate all steps
|
|
|
|
---
|
|
|
|
## Help Target Update
|
|
|
|
The Makefile `help` target (lines 8-37) now includes:
|
|
|
|
```
|
|
Benchmarking (PGO-optimized, +6% faster):
|
|
make pgo-tiny-full - Full PGO workflow (~5-10 min)
|
|
= Profile + Optimize + Test
|
|
make pgo-tiny-profile - Step 1: Build profile binaries
|
|
make pgo-tiny-collect - Step 2: Collect profile data
|
|
make pgo-tiny-build - Step 3: Build optimized
|
|
|
|
Phase 4 Performance:
|
|
Baseline: 57.0 M ops/s
|
|
PGO-optimized: 60.6 M ops/s (+6.25%)
|
|
|
|
TIP: For best performance, use 'make pgo-tiny-full'
|
|
```
|
|
|
|
This ensures developers won't forget how to use PGO builds.
|
|
|
|
---
|
|
|
|
## Artifacts
|
|
|
|
### New Files
|
|
- `scripts/box/pgo_tiny_profile_config.sh` - Workload definitions
|
|
- `scripts/box/pgo_tiny_profile_box.sh` - Collection automation
|
|
- `PHASE4_STEP1_COMPLETE.md` - This completion report
|
|
|
|
### Modified Files
|
|
- `Makefile` (lines 8-37) - Help target with PGO instructions
|
|
- `Makefile` (lines 1305-1356) - PGO workflow targets
|
|
|
|
### Documentation
|
|
- `CURRENT_TASK.md` - Phase 4 roadmap
|
|
- `docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md` - Complete Box design
|
|
|
|
---
|
|
|
|
## Box Pattern Compliance
|
|
|
|
✅ **Single Responsibility**: Profile collection is a separate Box
|
|
✅ **Clear Contract**: Workloads → .gcda files → Optimized binaries
|
|
✅ **Observable**: Progress reporting, error detection, summary statistics
|
|
✅ **Safe**: Validation, timeouts, fail-fast on errors
|
|
✅ **Testable**: Deterministic seeds for reproducibility
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
### Phase 4-Step2: Hot/Cold Path Box
|
|
- **Target**: +10-15% improvement (60.6 → 70.0 M ops/s)
|
|
- **Approach**: Separate hot (inline, likely) and cold (noinline, unlikely) paths
|
|
- **Design**: Already specified in `PHASE4_TINY_FRONT_BOX_DESIGN.md`
|
|
|
|
### Phase 4-Step3: Front Config Box
|
|
- **Target**: +5-8% improvement (70.0 → 76.0 M ops/s)
|
|
- **Approach**: Compile-time config optimization
|
|
- **Design**: Already specified in design doc
|
|
|
|
**Overall Phase 4 Target**: 73-83 M ops/s (vs current 60.6 M ops/s)
|
|
|
|
---
|
|
|
|
## Lessons Learned
|
|
|
|
1. **PGO is high ROI**: +6.25% with zero code changes, ~30 minutes of work
|
|
2. **Representative workloads matter**: 5 diverse workloads > 1 simple workload
|
|
3. **Automation is critical**: Manual PGO workflows are error-prone
|
|
4. **Box pattern scales**: Profile collection fits the Box pattern naturally
|
|
5. **Help targets prevent forgetting**: Make workflows discoverable
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
Phase 4-Step1 successfully implemented PGO optimization using the Box pattern, achieving **+6.25% performance improvement** (57.0 → 60.6 M ops/s) with:
|
|
- ✅ Fully automated workflow (`make pgo-tiny-full`)
|
|
- ✅ Reproducible results (deterministic seeds)
|
|
- ✅ Clear documentation (help target, design doc)
|
|
- ✅ Robust error handling (validation, timeouts)
|
|
- ✅ Within expected range (+5-10%)
|
|
|
|
**Status**: Ready to proceed to Phase 4-Step2 (Hot/Cold Path Box)
|
|
|
|
---
|
|
|
|
**Signed**: Claude (2025-11-29)
|
|
**Commit**: TBD (pending git commit)
|