Phase 4-Step1: Add PGO workflow automation (+6.25% performance)
Implemented automated Profile-Guided Optimization workflow using Box pattern: Performance Improvement: - Baseline: 57.0 M ops/s - PGO-optimized: 60.6 M ops/s - Gain: +6.25% (within expected +5-10% range) Implementation: 1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads 2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection 3. Makefile PGO targets: - pgo-tiny-profile: Build instrumented binaries - pgo-tiny-collect: Collect .gcda profile data - pgo-tiny-build: Build optimized binaries - pgo-tiny-full: Complete workflow (profile → collect → build → test) 4. Makefile help target: Added PGO instructions for discoverability Design: - Box化: Single responsibility, clear contracts - Deterministic: Fixed seeds (42) for reproducibility - Safe: Validation, error detection, timeout protection (30s/workload) - Observable: Progress reporting, .gcda verification (33 files generated) Workload Coverage: - Random mixed: 3 working set sizes (128/256/512 slots) - Tiny hot: 2 size classes (16B/64B) - Total: 5 workloads covering hot/cold paths Documentation: - PHASE4_STEP1_COMPLETE.md - Completion report - CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓) - docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
110
CURRENT_TASK.md
Normal file
110
CURRENT_TASK.md
Normal file
@ -0,0 +1,110 @@
|
||||
# Current Task: Phase 4 - Tiny Front Optimization
|
||||
|
||||
**Date**: 2025-11-29
|
||||
**Goal**: Tiny allocation throughput 2x improvement (56.8M → 110M+ ops/s)
|
||||
**Strategy**: Box化 + PGO + Hot/Cold separation
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 Overview: 3-Step Approach
|
||||
|
||||
### Step 1: PGO Workflow Box ✅ COMPLETE (+6.25%)
|
||||
- **Duration**: ~~1-2 days~~ **Completed: 2025-11-29**
|
||||
- **Risk**: Low
|
||||
- **Target**: 56.8M → 60-62M ops/s
|
||||
- **Actual**: **57.0M → 60.6M ops/s (+6.25%)** ✓
|
||||
|
||||
**Deliverables**:
|
||||
1. ✅ `scripts/box/pgo_tiny_profile_box.sh` - Profile collection automation
|
||||
2. ✅ `scripts/box/pgo_tiny_profile_config.sh` - Workload configuration
|
||||
3. ✅ Makefile targets: `pgo-tiny-profile`, `pgo-tiny-collect`, `pgo-tiny-build`, `pgo-tiny-full`
|
||||
4. ✅ Makefile help target updated with PGO instructions
|
||||
5. ✅ Benchmark comparison (before/after PGO)
|
||||
6. ✅ Completion report: `PHASE4_STEP1_COMPLETE.md`
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Hot/Cold Path Box (Expected: +10-15%)
|
||||
- **Duration**: 3-5 days
|
||||
- **Risk**: Medium
|
||||
- **Target**: 60-62M → 68-75M ops/s (cumulative +15-25%)
|
||||
|
||||
**Deliverables**:
|
||||
1. `core/box/tiny_front_hot_box.h` - Ultra-fast path (5-7 branches max)
|
||||
2. `core/box/tiny_front_cold_box.h` - Slow path (noinline, cold)
|
||||
3. Refactor `tiny_alloc_fast()` to use Hot/Cold boxes
|
||||
4. PGO re-optimization with new structure
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Front Config Box (Expected: +5-8%)
|
||||
- **Duration**: 2-3 days
|
||||
- **Risk**: Low
|
||||
- **Target**: 68-75M → 73-83M ops/s (cumulative +20-33%)
|
||||
|
||||
**Deliverables**:
|
||||
1. `core/box/tiny_front_config_box.h` - Compile-time config management
|
||||
2. Replace runtime checks with `TINY_FRONT_*_ENABLED` macros
|
||||
3. Build flag: `HAKMEM_TINY_FRONT_PGO=1`
|
||||
4. Final PGO optimization + full benchmark suite
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
**bench_random_mixed (ws=256)**:
|
||||
- Phase 3 baseline: 56.8M ops/s
|
||||
- Phase 4.1 (PGO): 60-62M ops/s
|
||||
- Phase 4.2 (Hot/Cold): 68-75M ops/s
|
||||
- Phase 4.3 (Config): **73-83M ops/s** ✓ (vs mimalloc 107M = 68-77%)
|
||||
|
||||
**bench_tiny_hot (64B)**:
|
||||
- Phase 3 baseline: 81.0M ops/s
|
||||
- Phase 4.3 target: **100-115M ops/s** ✓ (vs system 156M = 64-74%)
|
||||
|
||||
---
|
||||
|
||||
## Current Status: Step 1 Complete ✅ → Ready for Step 2
|
||||
|
||||
**Completed**:
|
||||
1. ✅ PGO Profile Collection Box implemented (+6.25% improvement)
|
||||
2. ✅ Makefile workflow automation (`make pgo-tiny-full`)
|
||||
3. ✅ Help target updated for discoverability
|
||||
4. ✅ Completion report written
|
||||
|
||||
**Next Actions (Step 2)**:
|
||||
1. Implement Tiny Front Hot Path Box (5-7 branches max)
|
||||
2. Implement Tiny Front Cold Path Box (noinline, cold)
|
||||
3. Refactor `tiny_alloc_fast()` to use Hot/Cold separation
|
||||
4. Re-run PGO optimization with new structure
|
||||
5. Benchmark: Target 68-75M ops/s (+10-15% over Step 1)
|
||||
|
||||
**Design Reference**: `docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md` (already complete)
|
||||
|
||||
---
|
||||
|
||||
## Notes from ChatGPT Analysis
|
||||
|
||||
**Real bottleneck**:
|
||||
- NOT front_gate_v2 alone
|
||||
- BUT `tiny_alloc_fast()` overall complexity (15-20 branches)
|
||||
|
||||
**Branch explosion sources**:
|
||||
1. ultra_slim_mode_enabled() gate
|
||||
2. hak_tiny_size_to_class range check
|
||||
3. tiny_sizeclass_hist_hit (profile)
|
||||
4. HeapV2 enabled/disabled
|
||||
5. FastCache enabled/disabled
|
||||
6. SFC enabled/disabled + hit/miss
|
||||
7. TLS SLL enabled/disabled + per-class branches
|
||||
8. Multiple env gates in refill path
|
||||
|
||||
**Pool/Tiny boundary**: Negligible overhead (0.1-0.2% in bench)
|
||||
|
||||
**memset/page fault**: Already optimized (TRUST_MMAP_ZERO=1)
|
||||
|
||||
---
|
||||
|
||||
Updated: 2025-11-29
|
||||
Phase: 4 (Tiny Front Optimization)
|
||||
Previous: Phase 3 (mincore removal, +10.7%)
|
||||
89
Makefile
89
Makefile
@ -1,6 +1,40 @@
|
||||
# Makefile for hakmem PoC
|
||||
|
||||
CC = gcc
|
||||
# Default target: Show help
|
||||
.DEFAULT_GOAL := help
|
||||
|
||||
.PHONY: help
|
||||
help:
|
||||
@echo "========================================="
|
||||
@echo "HAKMEM Build Targets"
|
||||
@echo "========================================="
|
||||
@echo ""
|
||||
@echo "Development (Fast builds):"
|
||||
@echo " make bench_random_mixed_hakmem - Quick build (~1-2 min)"
|
||||
@echo " make bench_tiny_hot_hakmem - Quick build"
|
||||
@echo " make test_hakmem - Quick test build"
|
||||
@echo ""
|
||||
@echo "Benchmarking (PGO-optimized, +6% faster):"
|
||||
@echo " make pgo-tiny-full - Full PGO workflow (~5-10 min)"
|
||||
@echo " = Profile + Optimize + Test"
|
||||
@echo " make pgo-tiny-profile - Step 1: Build profile binaries"
|
||||
@echo " make pgo-tiny-collect - Step 2: Collect profile data"
|
||||
@echo " make pgo-tiny-build - Step 3: Build optimized"
|
||||
@echo ""
|
||||
@echo "Comparison:"
|
||||
@echo " make bench-comparison - Compare hakmem vs system vs mimalloc"
|
||||
@echo " make bench-pool-tls - Pool TLS benchmark"
|
||||
@echo ""
|
||||
@echo "Cleanup:"
|
||||
@echo " make clean - Clean build artifacts"
|
||||
@echo ""
|
||||
@echo "Phase 4 Performance:"
|
||||
@echo " Baseline: 57.0 M ops/s"
|
||||
@echo " PGO-optimized: 60.6 M ops/s (+6.25%)"
|
||||
@echo ""
|
||||
@echo "TIP: For best performance, use 'make pgo-tiny-full'"
|
||||
@echo "========================================="
|
||||
CXX = g++
|
||||
|
||||
# Directory structure (2025-11-01 reorganization)
|
||||
@ -1262,3 +1296,58 @@ test_simple_e1: test_simple_e1.o $(HAKMEM_OBJS)
|
||||
|
||||
test_simple_e1.o: test_simple_e1.c
|
||||
$(CC) $(CFLAGS) -c -o $@ $<
|
||||
|
||||
# ========================================
|
||||
# Phase 4: PGO (Profile-Guided Optimization) Targets
|
||||
# ========================================
|
||||
# Phase 4-Step1: PGO Profile Build
|
||||
# Builds binaries with -fprofile-generate for profiling
|
||||
.PHONY: pgo-tiny-profile
|
||||
pgo-tiny-profile:
|
||||
@echo "========================================="
|
||||
@echo "Phase 4: Building PGO Profile Binaries"
|
||||
@echo "========================================="
|
||||
$(MAKE) clean
|
||||
$(MAKE) PROFILE_GEN=1 bench_random_mixed_hakmem bench_tiny_hot_hakmem
|
||||
@echo ""
|
||||
@echo "✓ PGO profile binaries built"
|
||||
@echo "Next: Run 'make pgo-tiny-collect' to collect profile data"
|
||||
@echo ""
|
||||
|
||||
# Phase 4-Step1: PGO Profile Collection
|
||||
# Executes representative workloads to generate .gcda files
|
||||
.PHONY: pgo-tiny-collect
|
||||
pgo-tiny-collect:
|
||||
@echo "========================================="
|
||||
@echo "Phase 4: Collecting PGO Profile Data"
|
||||
@echo "========================================="
|
||||
./scripts/box/pgo_tiny_profile_box.sh
|
||||
|
||||
# Phase 4-Step1: PGO Optimized Build
|
||||
# Builds binaries with -fprofile-use for optimization
|
||||
.PHONY: pgo-tiny-build
|
||||
pgo-tiny-build:
|
||||
@echo "========================================="
|
||||
@echo "Phase 4: Building PGO-Optimized Binaries"
|
||||
@echo "========================================="
|
||||
@echo "Building optimized binaries..."
|
||||
$(MAKE) clean
|
||||
$(MAKE) PROFILE_USE=1 bench_random_mixed_hakmem bench_tiny_hot_hakmem
|
||||
@echo ""
|
||||
@echo "✓ PGO-optimized binaries built"
|
||||
@echo "Next: Run './bench_random_mixed_hakmem 1000000 256 42' to test"
|
||||
@echo ""
|
||||
|
||||
# Phase 4-Step1: Full PGO Workflow
|
||||
# Complete workflow: profile → collect → build → test
|
||||
.PHONY: pgo-tiny-full
|
||||
pgo-tiny-full: pgo-tiny-profile pgo-tiny-collect pgo-tiny-build
|
||||
@echo "========================================="
|
||||
@echo "Phase 4: PGO Full Workflow Complete"
|
||||
@echo "========================================="
|
||||
@echo "Testing PGO-optimized binary..."
|
||||
@echo ""
|
||||
./bench_random_mixed_hakmem 1000000 256 42
|
||||
@echo ""
|
||||
@echo "✓ PGO optimization complete!"
|
||||
@echo ""
|
||||
|
||||
259
PHASE4_STEP1_COMPLETE.md
Normal file
259
PHASE4_STEP1_COMPLETE.md
Normal file
@ -0,0 +1,259 @@
|
||||
# Phase 4-Step1: PGO Workflow - COMPLETE ✓
|
||||
|
||||
**Date**: 2025-11-29
|
||||
**Status**: ✅ Complete
|
||||
**Performance Gain**: +6.25% (57.0 → 60.6 M ops/s)
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
Phase 4-Step1 implemented a fully automated PGO (Profile-Guided Optimization) workflow for the HAKMEM Tiny Front using the Box pattern. The implementation achieved a **+6.25% performance improvement** (within the expected +5-10% range) with zero code changes - pure compiler optimization.
|
||||
|
||||
---
|
||||
|
||||
## Implementation
|
||||
|
||||
### Box 1: PGO Profile Collection Box
|
||||
|
||||
**Purpose**: Automated, reproducible profile data collection
|
||||
**Contract**: Execute representative workloads → Generate .gcda files
|
||||
|
||||
**Components**:
|
||||
1. `scripts/box/pgo_tiny_profile_config.sh` - Workload configuration
|
||||
2. `scripts/box/pgo_tiny_profile_box.sh` - Profile collection automation
|
||||
3. Makefile PGO targets - Workflow orchestration
|
||||
|
||||
**Design Principles**:
|
||||
- ✅ **Deterministic**: Fixed seeds (42) for reproducibility
|
||||
- ✅ **Representative**: 5 workloads covering diverse allocation patterns
|
||||
- ✅ **Automated**: Single command (`make pgo-tiny-full`) for complete workflow
|
||||
- ✅ **Safe**: Validation checks, error detection, timeout protection
|
||||
- ✅ **Observable**: Clear progress reporting, .gcda file verification
|
||||
|
||||
---
|
||||
|
||||
## Workload Design
|
||||
|
||||
The PGO profile collection uses **5 representative workloads** to capture diverse allocation patterns:
|
||||
|
||||
| Workload | Purpose | Key Characteristics |
|
||||
|----------|---------|---------------------|
|
||||
| `bench_random_mixed 5M 256 42` | Common case | Medium working set, balanced cache pressure |
|
||||
| `bench_random_mixed 5M 128 42` | Hot path bias | Smaller working set, higher TLS cache hit rate |
|
||||
| `bench_random_mixed 5M 512 42` | Cold path bias | Larger working set, more SuperSlab allocations |
|
||||
| `bench_tiny_hot 16 100 60000` | Class 0 intensive | Smallest size class (16B) |
|
||||
| `bench_tiny_hot 64 100 60000` | Class 3 intensive | Common small object size (64B) |
|
||||
|
||||
**Coverage**: The workloads exercise:
|
||||
- Hot TLS SLL pop path (high-frequency allocations)
|
||||
- Cold refill path (SuperSlab allocations)
|
||||
- Multiple size classes (0, 3, and mixed)
|
||||
- Varied cache pressure scenarios
|
||||
|
||||
---
|
||||
|
||||
## Makefile Targets
|
||||
|
||||
```makefile
|
||||
# Step 1: Build instrumented binaries (-fprofile-generate)
|
||||
make pgo-tiny-profile
|
||||
|
||||
# Step 2: Collect profile data (run workloads → .gcda files)
|
||||
make pgo-tiny-collect
|
||||
|
||||
# Step 3: Build optimized binaries (-fprofile-use)
|
||||
make pgo-tiny-build
|
||||
|
||||
# Full workflow: profile → collect → build → test
|
||||
make pgo-tiny-full
|
||||
```
|
||||
|
||||
**Default Goal**: The Makefile help target now includes PGO instructions (lines 18-23)
|
||||
|
||||
---
|
||||
|
||||
## Performance Results
|
||||
|
||||
### Baseline (No PGO)
|
||||
```
|
||||
Run 1: 57.04 M ops/s
|
||||
Run 2: 57.14 M ops/s
|
||||
Run 3: 56.95 M ops/s
|
||||
Average: 57.04 M ops/s
|
||||
```
|
||||
|
||||
### PGO-Optimized
|
||||
```
|
||||
Run 1: 60.49 M ops/s
|
||||
Run 2: 60.68 M ops/s
|
||||
Run 3: 60.66 M ops/s
|
||||
Average: 60.61 M ops/s
|
||||
```
|
||||
|
||||
### Improvement
|
||||
```
|
||||
Absolute: +3.57 M ops/s
|
||||
Relative: +6.25%
|
||||
Expected: +5-10% ✓
|
||||
```
|
||||
|
||||
**Verification**: Latest test (after Makefile fix) confirmed **60.75 M ops/s** - consistent with expected performance.
|
||||
|
||||
---
|
||||
|
||||
## Technical Details
|
||||
|
||||
### Profile Data Collection
|
||||
|
||||
The `pgo_tiny_profile_box.sh` script implements a robust collection workflow:
|
||||
|
||||
1. **Binary Validation**
|
||||
- Checks binaries exist and are executable
|
||||
- Auto-fixes permissions if needed
|
||||
|
||||
2. **Profile Cleanup**
|
||||
- Removes old .gcda files to prevent stale data
|
||||
- Reports cleanup statistics
|
||||
|
||||
3. **Workload Execution**
|
||||
- Runs each workload with 30s timeout
|
||||
- Detects timeouts and failures
|
||||
- Fails fast on errors
|
||||
|
||||
4. **Profile Verification**
|
||||
- Confirms .gcda files were generated
|
||||
- Reports profile file count and locations
|
||||
- Detects missing -fprofile-generate flag
|
||||
|
||||
**Output**: 33 .gcda files (confirmed in latest run)
|
||||
|
||||
### Compiler Flags
|
||||
|
||||
```makefile
|
||||
# Profile Generation (Step 1)
|
||||
PROFILE_GEN_FLAGS = -fprofile-generate -flto
|
||||
|
||||
# Profile Use (Step 3)
|
||||
PROFILE_USE_FLAGS = -fprofile-use -flto
|
||||
```
|
||||
|
||||
**LTO**: Link-Time Optimization is enabled for both phases to maximize PGO effectiveness.
|
||||
|
||||
---
|
||||
|
||||
## Workflow Fix (2025-11-29)
|
||||
|
||||
**Issue**: Initial implementation had `pgo-tiny-build` calling the profile collection script, causing:
|
||||
- Duplicate script execution
|
||||
- Unclear separation of concerns
|
||||
- Skipped `pgo-tiny-collect` in dependency chain
|
||||
|
||||
**Fix**: Cleaned up the workflow:
|
||||
```makefile
|
||||
# Before (broken):
|
||||
pgo-tiny-full: pgo-tiny-profile pgo-tiny-build # Missing collect!
|
||||
|
||||
# After (correct):
|
||||
pgo-tiny-full: pgo-tiny-profile pgo-tiny-collect pgo-tiny-build
|
||||
```
|
||||
|
||||
**Result**: Each target now has a single responsibility:
|
||||
- `pgo-tiny-profile`: Build only
|
||||
- `pgo-tiny-collect`: Collect only
|
||||
- `pgo-tiny-build`: Build only
|
||||
- `pgo-tiny-full`: Orchestrate all steps
|
||||
|
||||
---
|
||||
|
||||
## Help Target Update
|
||||
|
||||
The Makefile `help` target (lines 8-37) now includes:
|
||||
|
||||
```
|
||||
Benchmarking (PGO-optimized, +6% faster):
|
||||
make pgo-tiny-full - Full PGO workflow (~5-10 min)
|
||||
= Profile + Optimize + Test
|
||||
make pgo-tiny-profile - Step 1: Build profile binaries
|
||||
make pgo-tiny-collect - Step 2: Collect profile data
|
||||
make pgo-tiny-build - Step 3: Build optimized
|
||||
|
||||
Phase 4 Performance:
|
||||
Baseline: 57.0 M ops/s
|
||||
PGO-optimized: 60.6 M ops/s (+6.25%)
|
||||
|
||||
TIP: For best performance, use 'make pgo-tiny-full'
|
||||
```
|
||||
|
||||
This ensures developers won't forget how to use PGO builds.
|
||||
|
||||
---
|
||||
|
||||
## Artifacts
|
||||
|
||||
### New Files
|
||||
- `scripts/box/pgo_tiny_profile_config.sh` - Workload definitions
|
||||
- `scripts/box/pgo_tiny_profile_box.sh` - Collection automation
|
||||
- `PHASE4_STEP1_COMPLETE.md` - This completion report
|
||||
|
||||
### Modified Files
|
||||
- `Makefile` (lines 8-37) - Help target with PGO instructions
|
||||
- `Makefile` (lines 1305-1356) - PGO workflow targets
|
||||
|
||||
### Documentation
|
||||
- `CURRENT_TASK.md` - Phase 4 roadmap
|
||||
- `docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md` - Complete Box design
|
||||
|
||||
---
|
||||
|
||||
## Box Pattern Compliance
|
||||
|
||||
✅ **Single Responsibility**: Profile collection is a separate Box
|
||||
✅ **Clear Contract**: Workloads → .gcda files → Optimized binaries
|
||||
✅ **Observable**: Progress reporting, error detection, summary statistics
|
||||
✅ **Safe**: Validation, timeouts, fail-fast on errors
|
||||
✅ **Testable**: Deterministic seeds for reproducibility
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
### Phase 4-Step2: Hot/Cold Path Box
|
||||
- **Target**: +10-15% improvement (60.6 → 70.0 M ops/s)
|
||||
- **Approach**: Separate hot (inline, likely) and cold (noinline, unlikely) paths
|
||||
- **Design**: Already specified in `PHASE4_TINY_FRONT_BOX_DESIGN.md`
|
||||
|
||||
### Phase 4-Step3: Front Config Box
|
||||
- **Target**: +5-8% improvement (70.0 → 76.0 M ops/s)
|
||||
- **Approach**: Compile-time config optimization
|
||||
- **Design**: Already specified in design doc
|
||||
|
||||
**Overall Phase 4 Target**: 73-83 M ops/s (vs current 60.6 M ops/s)
|
||||
|
||||
---
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
1. **PGO is high ROI**: +6.25% with zero code changes, ~30 minutes of work
|
||||
2. **Representative workloads matter**: 5 diverse workloads > 1 simple workload
|
||||
3. **Automation is critical**: Manual PGO workflows are error-prone
|
||||
4. **Box pattern scales**: Profile collection fits the Box pattern naturally
|
||||
5. **Help targets prevent forgetting**: Make workflows discoverable
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
Phase 4-Step1 successfully implemented PGO optimization using the Box pattern, achieving **+6.25% performance improvement** (57.0 → 60.6 M ops/s) with:
|
||||
- ✅ Fully automated workflow (`make pgo-tiny-full`)
|
||||
- ✅ Reproducible results (deterministic seeds)
|
||||
- ✅ Clear documentation (help target, design doc)
|
||||
- ✅ Robust error handling (validation, timeouts)
|
||||
- ✅ Within expected range (+5-10%)
|
||||
|
||||
**Status**: Ready to proceed to Phase 4-Step2 (Hot/Cold Path Box)
|
||||
|
||||
---
|
||||
|
||||
**Signed**: Claude (2025-11-29)
|
||||
**Commit**: TBD (pending git commit)
|
||||
724
docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md
Normal file
724
docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md
Normal file
@ -0,0 +1,724 @@
|
||||
# Phase 4: Tiny Front Optimization - Box Design Document
|
||||
|
||||
**Date**: 2025-11-29
|
||||
**Author**: Claude Code
|
||||
**Goal**: 2x throughput improvement via Box化 + PGO + Hot/Cold separation
|
||||
|
||||
---
|
||||
|
||||
## Design Philosophy
|
||||
|
||||
### Box化原則
|
||||
1. **Single Responsibility**: 1 Box = 1明確な責務
|
||||
2. **Clear Contracts**: 入力/出力/保証を明示
|
||||
3. **Macro-Based Pointers**: 型安全、null check、統一API
|
||||
4. **Testability**: 各Boxが独立してテスト可能
|
||||
5. **Incremental**: 段階的実装・検証
|
||||
|
||||
### Pointer Safety Strategy
|
||||
**全てのポインター操作をマクロで抽象化**:
|
||||
- Null check統一
|
||||
- 型キャスト安全性
|
||||
- デバッグビルドでアサーション
|
||||
- リリースビルドで最適化
|
||||
|
||||
---
|
||||
|
||||
## Box 1: PGO Profile Collection Box
|
||||
|
||||
### 責務
|
||||
Tiny Front用PGOプロファイル収集を標準化・自動化
|
||||
|
||||
### ファイル構成
|
||||
```
|
||||
scripts/box/pgo_tiny_profile_box.sh - メインスクリプト
|
||||
scripts/box/pgo_tiny_profile_config.sh - 設定(ワークロード定義)
|
||||
```
|
||||
|
||||
### Contract
|
||||
|
||||
**Input**:
|
||||
- Built binaries with `-fprofile-generate -flto`
|
||||
- `bench_random_mixed_hakmem`
|
||||
- `bench_tiny_hot_hakmem`
|
||||
|
||||
**Output**:
|
||||
- `.gcda` profile data files
|
||||
- Profile summary report
|
||||
|
||||
**Guarantees**:
|
||||
- Deterministic execution (固定seed)
|
||||
- Representative workload coverage
|
||||
- Error detection & reporting
|
||||
|
||||
### Implementation
|
||||
|
||||
#### scripts/box/pgo_tiny_profile_box.sh
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Box: PGO Profile Collection
|
||||
# Contract: Execute representative Tiny workloads for PGO
|
||||
# Usage: ./scripts/box/pgo_tiny_profile_box.sh
|
||||
|
||||
set -e
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
|
||||
source "${SCRIPT_DIR}/pgo_tiny_profile_config.sh"
|
||||
|
||||
echo "========================================="
|
||||
echo "Box: PGO Profile Collection (Tiny Front)"
|
||||
echo "========================================="
|
||||
|
||||
# Validate binaries exist
|
||||
for bin in "${PGO_BINARIES[@]}"; do
|
||||
if [[ ! -x "$bin" ]]; then
|
||||
echo "ERROR: Binary not found or not executable: $bin"
|
||||
exit 1
|
||||
fi
|
||||
done
|
||||
|
||||
# Clean old profile data
|
||||
echo "[PGO_BOX] Cleaning old .gcda files..."
|
||||
find . -name "*.gcda" -delete
|
||||
|
||||
# Execute workloads
|
||||
echo "[PGO_BOX] Executing representative workloads..."
|
||||
|
||||
for workload in "${PGO_WORKLOADS[@]}"; do
|
||||
echo "[PGO_BOX] Running: $workload"
|
||||
eval "$workload"
|
||||
done
|
||||
|
||||
# Verify profile data generated
|
||||
GCDA_COUNT=$(find . -name "*.gcda" | wc -l)
|
||||
if [[ $GCDA_COUNT -eq 0 ]]; then
|
||||
echo "ERROR: No .gcda files generated!"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "[PGO_BOX] Profile collection complete"
|
||||
echo "[PGO_BOX] Generated $GCDA_COUNT .gcda files"
|
||||
echo "========================================="
|
||||
```
|
||||
|
||||
#### scripts/box/pgo_tiny_profile_config.sh
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Box: PGO Profile Configuration
|
||||
# Purpose: Define representative workloads for Tiny Front
|
||||
|
||||
# Binaries to profile
|
||||
PGO_BINARIES=(
|
||||
"./bench_random_mixed_hakmem"
|
||||
"./bench_tiny_hot_hakmem"
|
||||
)
|
||||
|
||||
# Representative workloads (deterministic seeds)
|
||||
PGO_WORKLOADS=(
|
||||
# Random mixed: Common case (medium working set)
|
||||
"./bench_random_mixed_hakmem 5000000 256 42"
|
||||
|
||||
# Random mixed: Smaller working set (higher cache hit)
|
||||
"./bench_random_mixed_hakmem 5000000 128 42"
|
||||
|
||||
# Random mixed: Larger working set (more diverse)
|
||||
"./bench_random_mixed_hakmem 5000000 512 42"
|
||||
|
||||
# Tiny hot path: 16B allocations
|
||||
"./bench_tiny_hot_hakmem 16 100 60000"
|
||||
|
||||
# Tiny hot path: 64B allocations
|
||||
"./bench_tiny_hot_hakmem 64 100 60000"
|
||||
)
|
||||
```
|
||||
|
||||
### Makefile Integration
|
||||
```makefile
|
||||
# PGO Tiny Profile Build
|
||||
pgo-tiny-profile:
|
||||
@echo "Building PGO profile binaries..."
|
||||
$(MAKE) clean
|
||||
$(MAKE) CFLAGS+="-fprofile-generate -flto" \
|
||||
LDFLAGS+="-fprofile-generate -flto" \
|
||||
HAKMEM_BUILD_RELEASE=1 \
|
||||
HAKMEM_TINY_FRONT_PGO=1 \
|
||||
bench_random_mixed_hakmem bench_tiny_hot_hakmem
|
||||
|
||||
# PGO Tiny Optimized Build
|
||||
pgo-tiny-build:
|
||||
@echo "Collecting PGO profile data..."
|
||||
./scripts/box/pgo_tiny_profile_box.sh
|
||||
@echo "Building PGO-optimized binaries..."
|
||||
$(MAKE) clean
|
||||
$(MAKE) CFLAGS+="-fprofile-use -flto" \
|
||||
LDFLAGS+="-fprofile-use -flto" \
|
||||
HAKMEM_BUILD_RELEASE=1 \
|
||||
HAKMEM_TINY_FRONT_PGO=1 \
|
||||
bench_random_mixed_hakmem
|
||||
|
||||
# PGO Full Workflow
|
||||
pgo-tiny-full: pgo-tiny-profile pgo-tiny-build
|
||||
@echo "PGO optimization complete!"
|
||||
@echo "Testing optimized binary..."
|
||||
./bench_random_mixed_hakmem 1000000 256 42
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Box 2: Tiny Front Hot Path Box
|
||||
|
||||
### 責務
|
||||
Ultra-fast allocation path(分岐数最小化、always_inline)
|
||||
|
||||
### ファイル構成
|
||||
```
|
||||
core/box/tiny_front_hot_box.h - Hot path implementation
|
||||
core/box/tiny_front_hot_box_macros.h - Pointer safety macros
|
||||
```
|
||||
|
||||
### Contract
|
||||
|
||||
**Preconditions**:
|
||||
- `class_idx` validated (0-7)
|
||||
- TLS initialized
|
||||
- Not in slow path mode
|
||||
|
||||
**Guarantees**:
|
||||
- Maximum 5-7 branches
|
||||
- Always inlined
|
||||
- Null-safe pointer operations
|
||||
- PGO-optimized
|
||||
|
||||
**Performance**:
|
||||
- Hit case: ~20-30 cycles
|
||||
- Miss case: → Cold Box (~100-200 cycles)
|
||||
|
||||
### Pointer Safety Macros
|
||||
|
||||
#### core/box/tiny_front_hot_box_macros.h
|
||||
```c
|
||||
#ifndef TINY_FRONT_HOT_BOX_MACROS_H
|
||||
#define TINY_FRONT_HOT_BOX_MACROS_H
|
||||
|
||||
#include <stdint.h>
|
||||
#include <stddef.h>
|
||||
|
||||
// ========== Pointer Type Definitions ==========
|
||||
|
||||
// Opaque pointer types for type safety
|
||||
typedef void* TinyHotPtr; // User-facing allocation pointer
|
||||
typedef void* TinySLLNode; // SLL node pointer
|
||||
typedef void* TinySlabBase; // Slab base pointer
|
||||
|
||||
// ========== Pointer Safety Macros ==========
|
||||
|
||||
#if HAKMEM_BUILD_RELEASE
|
||||
// Release: No overhead
|
||||
#define TINY_HOT_PTR_CHECK(ptr) (ptr)
|
||||
#define TINY_HOT_PTR_CAST(type, ptr) ((type)(ptr))
|
||||
#define TINY_HOT_PTR_NULL NULL
|
||||
#define TINY_HOT_PTR_IS_NULL(ptr) ((ptr) == NULL)
|
||||
#define TINY_HOT_PTR_IS_VALID(ptr) ((ptr) != NULL)
|
||||
#else
|
||||
// Debug: Assertions enabled
|
||||
#include <assert.h>
|
||||
#define TINY_HOT_PTR_CHECK(ptr) \
|
||||
({ void* _p = (ptr); \
|
||||
assert(((uintptr_t)_p & 0x7) == 0 && "Pointer not 8-byte aligned"); \
|
||||
_p; })
|
||||
#define TINY_HOT_PTR_CAST(type, ptr) \
|
||||
((type)TINY_HOT_PTR_CHECK(ptr))
|
||||
#define TINY_HOT_PTR_NULL NULL
|
||||
#define TINY_HOT_PTR_IS_NULL(ptr) ((ptr) == NULL)
|
||||
#define TINY_HOT_PTR_IS_VALID(ptr) \
|
||||
({ void* _p = (ptr); \
|
||||
_p != NULL && ((uintptr_t)_p & 0x7) == 0; })
|
||||
#endif
|
||||
|
||||
// ========== SLL Operations Macros ==========
|
||||
|
||||
// Read next pointer from SLL node
|
||||
#define TINY_HOT_SLL_NEXT(node) \
|
||||
TINY_HOT_PTR_CAST(TinySLLNode, tiny_nextptr_get(node))
|
||||
|
||||
// Write next pointer to SLL node
|
||||
#define TINY_HOT_SLL_SET_NEXT(node, next) \
|
||||
tiny_nextptr_set((node), TINY_HOT_PTR_CHECK(next))
|
||||
|
||||
// Pop from TLS SLL (class-specific)
|
||||
#define TINY_HOT_SLL_POP(class_idx) \
|
||||
TINY_HOT_PTR_CAST(TinyHotPtr, tls_sll_pop_inline(class_idx))
|
||||
|
||||
// Push to TLS SLL (class-specific)
|
||||
#define TINY_HOT_SLL_PUSH(class_idx, ptr) \
|
||||
tls_sll_push_inline((class_idx), TINY_HOT_PTR_CHECK(ptr))
|
||||
|
||||
// ========== Likely/Unlikely Hints ==========
|
||||
|
||||
#define TINY_HOT_LIKELY(x) __builtin_expect(!!(x), 1)
|
||||
#define TINY_HOT_UNLIKELY(x) __builtin_expect(!!(x), 0)
|
||||
|
||||
// ========== Branch Prediction Hints ==========
|
||||
|
||||
// Expected: SLL hit (80-90% of allocations)
|
||||
#define TINY_HOT_EXPECT_HIT(ptr) \
|
||||
TINY_HOT_LIKELY(TINY_HOT_PTR_IS_VALID(ptr))
|
||||
|
||||
// Expected: SLL miss (10-20% of allocations)
|
||||
#define TINY_HOT_EXPECT_MISS(ptr) \
|
||||
TINY_HOT_UNLIKELY(TINY_HOT_PTR_IS_NULL(ptr))
|
||||
|
||||
#endif // TINY_FRONT_HOT_BOX_MACROS_H
|
||||
```
|
||||
|
||||
### Implementation
|
||||
|
||||
#### core/box/tiny_front_hot_box.h
|
||||
```c
|
||||
#ifndef TINY_FRONT_HOT_BOX_H
|
||||
#define TINY_FRONT_HOT_BOX_H
|
||||
|
||||
#include "tiny_front_hot_box_macros.h"
|
||||
#include "../tiny_next_ptr_box.h" // tiny_nextptr_get/set
|
||||
#include "../tls_sll_box.h" // TLS SLL operations
|
||||
|
||||
// Forward declaration for cold path
|
||||
void* tiny_front_cold_refill(int class_idx)
|
||||
__attribute__((noinline, cold));
|
||||
|
||||
// ========== Box: Tiny Front Hot Path ==========
|
||||
// Contract: Ultra-fast allocation with 5-7 branches max
|
||||
// Precondition: class_idx validated (0-7), TLS initialized
|
||||
// Performance: ~20-30 cycles (hit), ~100-200 cycles (miss → cold)
|
||||
// Optimization: always_inline + PGO + branch hints
|
||||
|
||||
__attribute__((always_inline))
|
||||
static inline TinyHotPtr tiny_front_hot_alloc(int class_idx)
|
||||
{
|
||||
// Branch 1: TLS SLL pop (expected: 80-90% hit)
|
||||
TinyHotPtr ptr = TINY_HOT_SLL_POP(class_idx);
|
||||
|
||||
// Branch 2: Check if hit (optimized by PGO)
|
||||
if (TINY_HOT_EXPECT_HIT(ptr)) {
|
||||
// Fast path exit: ~20-30 cycles total
|
||||
return ptr;
|
||||
}
|
||||
|
||||
// Branch 3: Miss → Cold path refill (10-20% of allocations)
|
||||
return TINY_HOT_PTR_CAST(TinyHotPtr, tiny_front_cold_refill(class_idx));
|
||||
}
|
||||
|
||||
// ========== Box: Tiny Front Hot Free ==========
|
||||
// Contract: Ultra-fast free with 3-5 branches max
|
||||
// Precondition: ptr is valid Tiny allocation
|
||||
// Performance: ~15-25 cycles
|
||||
|
||||
__attribute__((always_inline))
|
||||
static inline void tiny_front_hot_free(TinyHotPtr ptr, int class_idx)
|
||||
{
|
||||
// Branch 1: Null check (expected: rare)
|
||||
if (TINY_HOT_UNLIKELY(TINY_HOT_PTR_IS_NULL(ptr))) {
|
||||
return;
|
||||
}
|
||||
|
||||
// Branch 2: Push to TLS SLL (expected: always succeeds)
|
||||
TINY_HOT_SLL_PUSH(class_idx, ptr);
|
||||
|
||||
// Fast path exit: ~15-25 cycles total
|
||||
}
|
||||
|
||||
#endif // TINY_FRONT_HOT_BOX_H
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Box 3: Tiny Front Cold Path Box
|
||||
|
||||
### 責務
|
||||
低頻度allocation/free slow path(noinline, cold属性)
|
||||
|
||||
### ファイル構成
|
||||
```
|
||||
core/box/tiny_front_cold_box.h - Cold path implementation
|
||||
```
|
||||
|
||||
### Contract
|
||||
|
||||
**Called When**:
|
||||
- TLS SLL miss (refill needed)
|
||||
- Slow allocation path (debug, large size, etc.)
|
||||
|
||||
**Guarantees**:
|
||||
- I-cache separated from hot path
|
||||
- Heavy operations allowed
|
||||
- Can call into ACE, learning, diagnostics
|
||||
|
||||
**Optimization**:
|
||||
- `noinline` → Not inlined into hot path
|
||||
- `cold` → Compiler puts in cold section
|
||||
|
||||
### Implementation
|
||||
|
||||
#### core/box/tiny_front_cold_box.h
|
||||
```c
|
||||
#ifndef TINY_FRONT_COLD_BOX_H
|
||||
#define TINY_FRONT_COLD_BOX_H
|
||||
|
||||
#include <stddef.h>
|
||||
#include "tiny_front_hot_box_macros.h"
|
||||
|
||||
// ========== Box: Tiny Front Cold Refill ==========
|
||||
// Contract: Refill TLS SLL when empty
|
||||
// Called: 10-20% of allocations (SLL miss)
|
||||
// Performance: ~100-200 cycles (acceptable for miss case)
|
||||
// Optimization: noinline, cold → separated from hot path
|
||||
|
||||
__attribute__((noinline, cold))
|
||||
void* tiny_front_cold_refill(int class_idx)
|
||||
{
|
||||
// Heavy refill logic
|
||||
// - May allocate new SuperSlab
|
||||
// - May trigger ACE learning
|
||||
// - May call into diagnostics
|
||||
|
||||
// Call existing refill logic
|
||||
tiny_fast_refill_and_take(class_idx);
|
||||
|
||||
// After refill, try pop again
|
||||
TinyHotPtr ptr = TINY_HOT_SLL_POP(class_idx);
|
||||
|
||||
if (TINY_HOT_PTR_IS_VALID(ptr)) {
|
||||
return ptr;
|
||||
}
|
||||
|
||||
// Refill failed → slow allocation
|
||||
return tiny_front_cold_slow_alloc(0, class_idx); // size=0 (unknown)
|
||||
}
|
||||
|
||||
// ========== Box: Tiny Front Cold Slow Alloc ==========
|
||||
// Contract: Slowest allocation path (debug, diagnostics, ACE)
|
||||
// Called: Rare (refill failure, special modes)
|
||||
// Performance: ~500-1000+ cycles (acceptable for rare case)
|
||||
|
||||
__attribute__((noinline, cold))
|
||||
void* tiny_front_cold_slow_alloc(size_t size, int class_idx)
|
||||
{
|
||||
// Debug/diagnostic/ACE learning hooks
|
||||
// - Allocation site tracking
|
||||
// - Size class profiling
|
||||
// - Memory pressure monitoring
|
||||
|
||||
// Call legacy slow path
|
||||
return hak_tiny_alloc_slow(size, class_idx);
|
||||
}
|
||||
|
||||
// ========== Box: Tiny Front Cold Drain ==========
|
||||
// Contract: Drain remote frees (batched, low frequency)
|
||||
// Called: Background or on threshold
|
||||
// Optimization: noinline, cold
|
||||
|
||||
__attribute__((noinline, cold))
|
||||
void tiny_front_cold_drain_remote(int class_idx)
|
||||
{
|
||||
// Drain remote free lists into TLS SLL
|
||||
// - Batch processing for efficiency
|
||||
// - May trigger ACE rebalancing
|
||||
|
||||
tiny_remote_drain_to_sll(class_idx);
|
||||
}
|
||||
|
||||
#endif // TINY_FRONT_COLD_BOX_H
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Box 4: Tiny Front Config Box
|
||||
|
||||
### 責務
|
||||
Tiny Front設定の一元管理(コンパイル時/実行時切り替え)
|
||||
|
||||
### ファイル構成
|
||||
```
|
||||
core/box/tiny_front_config_box.h - Configuration management
|
||||
core/hakmem_build_flags.h - Build flag definitions (existing)
|
||||
```
|
||||
|
||||
### Contract
|
||||
|
||||
**Compile-time Mode (PGO builds)**:
|
||||
- `HAKMEM_TINY_FRONT_PGO=1`
|
||||
- All runtime checks → compile-time constants
|
||||
- Unused branches eliminated by compiler
|
||||
|
||||
**Runtime Mode (normal builds)**:
|
||||
- Backward compatible
|
||||
- ENV variable checks as before
|
||||
- Full feature set available
|
||||
|
||||
### Implementation
|
||||
|
||||
#### core/box/tiny_front_config_box.h
|
||||
```c
|
||||
#ifndef TINY_FRONT_CONFIG_BOX_H
|
||||
#define TINY_FRONT_CONFIG_BOX_H
|
||||
|
||||
// ========== Build Flag Definitions ==========
|
||||
// Location: core/hakmem_build_flags.h
|
||||
|
||||
#ifndef HAKMEM_TINY_FRONT_PGO
|
||||
# define HAKMEM_TINY_FRONT_PGO 0
|
||||
#endif
|
||||
|
||||
// ========== PGO Mode: Fixed Configuration ==========
|
||||
|
||||
#if HAKMEM_TINY_FRONT_PGO
|
||||
// PGO build: Fix configuration for profiling/optimization
|
||||
// All runtime checks become compile-time constants
|
||||
|
||||
#define TINY_FRONT_ULTRA_SLIM_ENABLED 0
|
||||
#define TINY_FRONT_HEAP_V2_ENABLED 0
|
||||
#define TINY_FRONT_SFC_ENABLED 1
|
||||
#define TINY_FRONT_FASTCACHE_ENABLED 0
|
||||
#define TINY_FRONT_UNIFIED_GATE_ENABLED 1
|
||||
#define TINY_FRONT_METRICS_ENABLED 0
|
||||
#define TINY_FRONT_DIAG_ENABLED 0
|
||||
|
||||
// Optimization: Constant folding eliminates dead branches
|
||||
// Example:
|
||||
// if (TINY_FRONT_HEAP_V2_ENABLED) { ... }
|
||||
// → Compiler eliminates entire block (0 is constant false)
|
||||
|
||||
#else
|
||||
// Normal build: Runtime configuration (backward compatible)
|
||||
// Checks ENV variables or config state
|
||||
|
||||
#define TINY_FRONT_ULTRA_SLIM_ENABLED ultra_slim_mode_enabled()
|
||||
#define TINY_FRONT_HEAP_V2_ENABLED tiny_heap_v2_enabled()
|
||||
#define TINY_FRONT_SFC_ENABLED sfc_cascade_enabled()
|
||||
#define TINY_FRONT_FASTCACHE_ENABLED tiny_fastcache_enabled()
|
||||
#define TINY_FRONT_UNIFIED_GATE_ENABLED front_gate_unified_enabled()
|
||||
#define TINY_FRONT_METRICS_ENABLED tiny_metrics_enabled()
|
||||
#define TINY_FRONT_DIAG_ENABLED tiny_diag_enabled()
|
||||
|
||||
#endif // HAKMEM_TINY_FRONT_PGO
|
||||
|
||||
// ========== Configuration Helpers ==========
|
||||
|
||||
// Check if running in PGO-optimized build
|
||||
static inline int tiny_front_is_pgo_build(void)
|
||||
{
|
||||
return HAKMEM_TINY_FRONT_PGO;
|
||||
}
|
||||
|
||||
// Get effective configuration (for diagnostics)
|
||||
static inline void tiny_front_config_report(void)
|
||||
{
|
||||
#if !HAKMEM_BUILD_RELEASE
|
||||
fprintf(stderr, "[TINY_FRONT_CONFIG]\n");
|
||||
fprintf(stderr, " PGO Build: %d\n", HAKMEM_TINY_FRONT_PGO);
|
||||
fprintf(stderr, " Ultra SLIM: %d\n", TINY_FRONT_ULTRA_SLIM_ENABLED);
|
||||
fprintf(stderr, " Heap V2: %d\n", TINY_FRONT_HEAP_V2_ENABLED);
|
||||
fprintf(stderr, " SFC: %d\n", TINY_FRONT_SFC_ENABLED);
|
||||
fprintf(stderr, " FastCache: %d\n", TINY_FRONT_FASTCACHE_ENABLED);
|
||||
fprintf(stderr, " Unified Gate: %d\n", TINY_FRONT_UNIFIED_GATE_ENABLED);
|
||||
#endif
|
||||
}
|
||||
|
||||
#endif // TINY_FRONT_CONFIG_BOX_H
|
||||
```
|
||||
|
||||
#### Update to core/hakmem_build_flags.h
|
||||
```c
|
||||
// Add around line 190:
|
||||
|
||||
// HAKMEM_TINY_FRONT_PGO:
|
||||
// 0 = Normal build with runtime configuration (default)
|
||||
// 1 = PGO-optimized build with compile-time configuration
|
||||
// Eliminates runtime branches for maximum performance.
|
||||
// Use with: make HAKMEM_TINY_FRONT_PGO=1 pgo-tiny-build
|
||||
#ifndef HAKMEM_TINY_FRONT_PGO
|
||||
# define HAKMEM_TINY_FRONT_PGO 0
|
||||
#endif
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Integration: Refactor tiny_alloc_fast()
|
||||
|
||||
### Before (複雑な1関数、15-20分岐)
|
||||
```c
|
||||
void* tiny_alloc_fast(size_t size) {
|
||||
// Ultra SLIM check
|
||||
if (ultra_slim_mode_enabled()) { ... }
|
||||
|
||||
// Size to class
|
||||
int class_idx = hak_tiny_size_to_class(size);
|
||||
|
||||
// Metrics
|
||||
if (tiny_metrics_enabled()) { tiny_sizeclass_hist_hit(class_idx); }
|
||||
|
||||
// Heap V2 check
|
||||
if (tiny_heap_v2_enabled()) { ... }
|
||||
|
||||
// FastCache check
|
||||
if (tiny_fastcache_enabled()) { ... }
|
||||
|
||||
// SFC cascade
|
||||
if (sfc_cascade_enabled()) { ... }
|
||||
|
||||
// TLS SLL pop
|
||||
void* ptr = tls_sll_pop(class_idx);
|
||||
if (ptr) return ptr;
|
||||
|
||||
// Refill logic (複雑)
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
### After (Box化、3-5分岐のみ)
|
||||
```c
|
||||
// Include new boxes
|
||||
#include "core/box/tiny_front_config_box.h"
|
||||
#include "core/box/tiny_front_hot_box.h"
|
||||
#include "core/box/tiny_front_cold_box.h"
|
||||
|
||||
void* tiny_alloc_fast(size_t size) {
|
||||
// Branch 1: Ultra SLIM mode check (compile-time constant in PGO)
|
||||
if (TINY_FRONT_ULTRA_SLIM_ENABLED) {
|
||||
return tiny_ultra_slim_alloc(size); // Separate path
|
||||
}
|
||||
|
||||
// Branch 2: Size to class (always needed)
|
||||
int class_idx = hak_tiny_size_to_class(size);
|
||||
|
||||
// Branch 3: Hot path (inlined, 2-3 branches inside)
|
||||
return tiny_front_hot_alloc(class_idx);
|
||||
|
||||
// Total branches in PGO build: 2-3
|
||||
// (Ultra SLIM = 0 → eliminated, hot_alloc inlined)
|
||||
}
|
||||
```
|
||||
|
||||
**PGO最適化後の実効分岐数**: **2-3分岐のみ**!
|
||||
|
||||
---
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Step 1: PGO Workflow Test
|
||||
```bash
|
||||
# Build profile version
|
||||
make pgo-tiny-profile
|
||||
|
||||
# Collect profiles (automated)
|
||||
./scripts/box/pgo_tiny_profile_box.sh
|
||||
|
||||
# Build optimized version
|
||||
make pgo-tiny-build
|
||||
|
||||
# Benchmark
|
||||
./bench_random_mixed_hakmem 1000000 256 42
|
||||
./bench_tiny_hot_hakmem
|
||||
|
||||
# Expected: +5-10% improvement
|
||||
```
|
||||
|
||||
### Step 2: Hot/Cold Separation Test
|
||||
```bash
|
||||
# Build with hot/cold boxes
|
||||
make clean
|
||||
make HAKMEM_TINY_FRONT_PGO=1 bench_random_mixed_hakmem
|
||||
|
||||
# Benchmark
|
||||
./bench_random_mixed_hakmem 1000000 256 42
|
||||
|
||||
# Expected: +10-15% improvement (cumulative +15-25%)
|
||||
```
|
||||
|
||||
### Step 3: Config Box Test
|
||||
```bash
|
||||
# PGO build (compile-time config)
|
||||
make clean
|
||||
make HAKMEM_TINY_FRONT_PGO=1 pgo-tiny-full
|
||||
|
||||
# Normal build (runtime config)
|
||||
make clean
|
||||
make bench_random_mixed_hakmem
|
||||
|
||||
# Both should work, PGO should be faster
|
||||
```
|
||||
|
||||
### Regression Testing
|
||||
```bash
|
||||
# Ensure backward compatibility
|
||||
HAKMEM_TINY_ULTRA_SLIM=1 ./bench_random_mixed_hakmem 100000 256
|
||||
HAKMEM_TINY_HEAP_V2=1 ./bench_random_mixed_hakmem 100000 256
|
||||
|
||||
# All existing ENV vars should work in normal builds
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Expectations
|
||||
|
||||
### Branch Reduction
|
||||
- **Before**: 15-20 branches in `tiny_alloc_fast()`
|
||||
- **After (PGO)**: 2-3 branches (most eliminated by compiler)
|
||||
- **Gain**: ~40-60% reduction in branch misses
|
||||
|
||||
### Instruction Count
|
||||
- **Before**: ~167M instructions (1M ops)
|
||||
- **After**: ~120-140M instructions
|
||||
- **Gain**: ~16-28% reduction
|
||||
|
||||
### Throughput
|
||||
- **Phase 3**: 56.8M ops/s
|
||||
- **Phase 4.1 (PGO)**: 60-62M ops/s (+5-10%)
|
||||
- **Phase 4.2 (Hot/Cold)**: 68-75M ops/s (+10-15%)
|
||||
- **Phase 4.3 (Config)**: 73-83M ops/s (+5-8%)
|
||||
|
||||
**Total Improvement**: +28-46% → **2倍に迫る**
|
||||
|
||||
---
|
||||
|
||||
## Implementation Schedule
|
||||
|
||||
### Week 1: PGO Workflow
|
||||
- Day 1-2: PGO scripts + Makefile
|
||||
- Day 3: Profile collection + benchmarking
|
||||
- Day 4: Documentation + review
|
||||
|
||||
### Week 2: Hot/Cold Separation
|
||||
- Day 1-2: Hot Box + macros
|
||||
- Day 3-4: Cold Box + refactor
|
||||
- Day 5: Testing + PGO re-optimization
|
||||
|
||||
### Week 3: Config Box + Polish
|
||||
- Day 1-2: Config Box implementation
|
||||
- Day 3: Integration testing
|
||||
- Day 4-5: Final benchmarks + documentation
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
✅ **Code Quality**:
|
||||
- All pointer operations use macros
|
||||
- Clear contracts in each Box
|
||||
- Zero regression in existing features
|
||||
|
||||
✅ **Performance**:
|
||||
- bench_random_mixed: 73-83M ops/s (vs 56.8M baseline)
|
||||
- bench_tiny_hot: 100-115M ops/s (vs 81M baseline)
|
||||
- No regression in other benchmarks
|
||||
|
||||
✅ **Maintainability**:
|
||||
- Hot/Cold separation clear
|
||||
- PGO workflow documented
|
||||
- Backward compatible
|
||||
|
||||
---
|
||||
|
||||
Generated: 2025-11-29
|
||||
Phase: 4 Design
|
||||
Next: Implementation (Week 1 start)
|
||||
101
scripts/box/pgo_tiny_profile_box.sh
Executable file
101
scripts/box/pgo_tiny_profile_box.sh
Executable file
@ -0,0 +1,101 @@
|
||||
#!/bin/bash
|
||||
# Box: PGO Profile Collection (Tiny Front)
|
||||
# Contract: Execute representative Tiny workloads for PGO
|
||||
# Usage: ./scripts/box/pgo_tiny_profile_box.sh
|
||||
#
|
||||
# Input: Built binaries with -fprofile-generate -flto
|
||||
# Output: .gcda profile data files
|
||||
# Guarantees: Deterministic execution, error detection, summary report
|
||||
|
||||
set -e # Fail fast on errors
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
|
||||
source "${SCRIPT_DIR}/pgo_tiny_profile_config.sh"
|
||||
|
||||
echo "========================================="
|
||||
echo "Box: PGO Profile Collection (Tiny Front)"
|
||||
echo "========================================="
|
||||
echo "Date: $(date)"
|
||||
echo "Workloads: $PGO_WORKLOAD_COUNT"
|
||||
echo "Binaries: $PGO_BINARY_COUNT"
|
||||
echo ""
|
||||
|
||||
# Validate binaries exist and are executable
|
||||
echo "[PGO_BOX] Validating binaries..."
|
||||
for bin in "${PGO_BINARIES[@]}"; do
|
||||
if [[ ! -f "$bin" ]]; then
|
||||
echo "ERROR: Binary not found: $bin"
|
||||
exit 1
|
||||
fi
|
||||
if [[ ! -x "$bin" ]]; then
|
||||
echo "ERROR: Binary not executable: $bin"
|
||||
chmod +x "$bin" || exit 1
|
||||
echo " Fixed: Made $bin executable"
|
||||
fi
|
||||
echo " ✓ $bin"
|
||||
done
|
||||
echo ""
|
||||
|
||||
# Clean old profile data
|
||||
echo "[PGO_BOX] Cleaning old .gcda files..."
|
||||
GCDA_OLD_COUNT=$(find . -name "*.gcda" 2>/dev/null | wc -l)
|
||||
if [[ $GCDA_OLD_COUNT -gt 0 ]]; then
|
||||
find . -name "*.gcda" -delete
|
||||
echo " Removed $GCDA_OLD_COUNT old .gcda files"
|
||||
else
|
||||
echo " No old .gcda files found"
|
||||
fi
|
||||
echo ""
|
||||
|
||||
# Execute workloads
|
||||
echo "[PGO_BOX] Executing representative workloads..."
|
||||
echo "==========================================="
|
||||
WORKLOAD_NUM=0
|
||||
for workload in "${PGO_WORKLOADS[@]}"; do
|
||||
WORKLOAD_NUM=$((WORKLOAD_NUM + 1))
|
||||
echo ""
|
||||
echo "[$WORKLOAD_NUM/$PGO_WORKLOAD_COUNT] Running: $workload"
|
||||
echo "-------------------------------------------"
|
||||
|
||||
# Execute with timeout (30s per workload)
|
||||
if timeout 30 $workload; then
|
||||
echo " ✓ Success"
|
||||
else
|
||||
EXIT_CODE=$?
|
||||
if [[ $EXIT_CODE -eq 124 ]]; then
|
||||
echo " ✗ TIMEOUT (30s exceeded)"
|
||||
else
|
||||
echo " ✗ FAILED (exit code: $EXIT_CODE)"
|
||||
fi
|
||||
echo "ERROR: Workload failed: $workload"
|
||||
exit 1
|
||||
fi
|
||||
done
|
||||
echo ""
|
||||
echo "==========================================="
|
||||
|
||||
# Verify profile data generated
|
||||
echo "[PGO_BOX] Verifying profile data..."
|
||||
GCDA_COUNT=$(find . -name "*.gcda" 2>/dev/null | wc -l)
|
||||
if [[ $GCDA_COUNT -eq 0 ]]; then
|
||||
echo "ERROR: No .gcda files generated!"
|
||||
echo " This usually means binaries were not built with -fprofile-generate"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo " ✓ Generated $GCDA_COUNT .gcda files"
|
||||
echo ""
|
||||
|
||||
# Summary
|
||||
echo "========================================="
|
||||
echo "PGO Profile Collection: SUCCESS"
|
||||
echo "========================================="
|
||||
echo "Profile files: $GCDA_COUNT .gcda files"
|
||||
echo "Next step: make pgo-tiny-build"
|
||||
echo ""
|
||||
echo "Profile locations:"
|
||||
find . -name "*.gcda" | head -5
|
||||
if [[ $GCDA_COUNT -gt 5 ]]; then
|
||||
echo " ... and $((GCDA_COUNT - 5)) more"
|
||||
fi
|
||||
echo "========================================="
|
||||
43
scripts/box/pgo_tiny_profile_config.sh
Executable file
43
scripts/box/pgo_tiny_profile_config.sh
Executable file
@ -0,0 +1,43 @@
|
||||
#!/bin/bash
|
||||
# Box: PGO Profile Configuration
|
||||
# Purpose: Define representative workloads for Tiny Front
|
||||
# Contract: Provides workload definitions for PGO profile collection
|
||||
|
||||
# Binaries to profile
|
||||
PGO_BINARIES=(
|
||||
"./bench_random_mixed_hakmem"
|
||||
"./bench_tiny_hot_hakmem"
|
||||
)
|
||||
|
||||
# Representative workloads (deterministic seeds for reproducibility)
|
||||
# Design: Cover diverse allocation patterns for optimal PGO data
|
||||
PGO_WORKLOADS=(
|
||||
# Random mixed: Common case (medium working set)
|
||||
# - Most representative of general allocation patterns
|
||||
# - 256 slots = moderate cache pressure
|
||||
"./bench_random_mixed_hakmem 5000000 256 42"
|
||||
|
||||
# Random mixed: Smaller working set (higher cache hit)
|
||||
# - Exercises hot TLS SLL path heavily
|
||||
# - 128 slots = higher hit rate
|
||||
"./bench_random_mixed_hakmem 5000000 128 42"
|
||||
|
||||
# Random mixed: Larger working set (more diverse)
|
||||
# - Exercises refill and cold paths more
|
||||
# - 512 slots = more SuperSlab allocations
|
||||
"./bench_random_mixed_hakmem 5000000 512 42"
|
||||
|
||||
# Tiny hot path: 16B allocations
|
||||
# - Class 0 (smallest) intensive
|
||||
# - High allocation frequency
|
||||
"./bench_tiny_hot_hakmem 16 100 60000"
|
||||
|
||||
# Tiny hot path: 64B allocations
|
||||
# - Class 3 (common size) intensive
|
||||
# - Typical small object pattern
|
||||
"./bench_tiny_hot_hakmem 64 100 60000"
|
||||
)
|
||||
|
||||
# Configuration summary
|
||||
PGO_WORKLOAD_COUNT=${#PGO_WORKLOADS[@]}
|
||||
PGO_BINARY_COUNT=${#PGO_BINARIES[@]}
|
||||
Reference in New Issue
Block a user