From b51b600e8d626601e1deca96d044cd10674c7cb1 Mon Sep 17 00:00:00 2001 From: "Moe Charm (CI)" Date: Sat, 29 Nov 2025 11:28:38 +0900 Subject: [PATCH] Phase 4-Step1: Add PGO workflow automation (+6.25% performance) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Implemented automated Profile-Guided Optimization workflow using Box pattern: Performance Improvement: - Baseline: 57.0 M ops/s - PGO-optimized: 60.6 M ops/s - Gain: +6.25% (within expected +5-10% range) Implementation: 1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads 2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection 3. Makefile PGO targets: - pgo-tiny-profile: Build instrumented binaries - pgo-tiny-collect: Collect .gcda profile data - pgo-tiny-build: Build optimized binaries - pgo-tiny-full: Complete workflow (profile → collect → build → test) 4. Makefile help target: Added PGO instructions for discoverability Design: - Box化: Single responsibility, clear contracts - Deterministic: Fixed seeds (42) for reproducibility - Safe: Validation, error detection, timeout protection (30s/workload) - Observable: Progress reporting, .gcda verification (33 files generated) Workload Coverage: - Random mixed: 3 working set sizes (128/256/512 slots) - Tiny hot: 2 size classes (16B/64B) - Total: 5 workloads covering hot/cold paths Documentation: - PHASE4_STEP1_COMPLETE.md - Completion report - CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓) - docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- CURRENT_TASK.md | 110 +++ Makefile | 89 +++ PHASE4_STEP1_COMPLETE.md | 259 +++++++ docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md | 724 ++++++++++++++++++++ scripts/box/pgo_tiny_profile_box.sh | 101 +++ scripts/box/pgo_tiny_profile_config.sh | 43 ++ 6 files changed, 1326 insertions(+) create mode 100644 CURRENT_TASK.md create mode 100644 PHASE4_STEP1_COMPLETE.md create mode 100644 docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md create mode 100755 scripts/box/pgo_tiny_profile_box.sh create mode 100755 scripts/box/pgo_tiny_profile_config.sh diff --git a/CURRENT_TASK.md b/CURRENT_TASK.md new file mode 100644 index 00000000..8440c175 --- /dev/null +++ b/CURRENT_TASK.md @@ -0,0 +1,110 @@ +# Current Task: Phase 4 - Tiny Front Optimization + +**Date**: 2025-11-29 +**Goal**: Tiny allocation throughput 2x improvement (56.8M → 110M+ ops/s) +**Strategy**: Box化 + PGO + Hot/Cold separation + +--- + +## Phase 4 Overview: 3-Step Approach + +### Step 1: PGO Workflow Box ✅ COMPLETE (+6.25%) +- **Duration**: ~~1-2 days~~ **Completed: 2025-11-29** +- **Risk**: Low +- **Target**: 56.8M → 60-62M ops/s +- **Actual**: **57.0M → 60.6M ops/s (+6.25%)** ✓ + +**Deliverables**: +1. ✅ `scripts/box/pgo_tiny_profile_box.sh` - Profile collection automation +2. ✅ `scripts/box/pgo_tiny_profile_config.sh` - Workload configuration +3. ✅ Makefile targets: `pgo-tiny-profile`, `pgo-tiny-collect`, `pgo-tiny-build`, `pgo-tiny-full` +4. ✅ Makefile help target updated with PGO instructions +5. ✅ Benchmark comparison (before/after PGO) +6. ✅ Completion report: `PHASE4_STEP1_COMPLETE.md` + +--- + +### Step 2: Hot/Cold Path Box (Expected: +10-15%) +- **Duration**: 3-5 days +- **Risk**: Medium +- **Target**: 60-62M → 68-75M ops/s (cumulative +15-25%) + +**Deliverables**: +1. `core/box/tiny_front_hot_box.h` - Ultra-fast path (5-7 branches max) +2. `core/box/tiny_front_cold_box.h` - Slow path (noinline, cold) +3. Refactor `tiny_alloc_fast()` to use Hot/Cold boxes +4. PGO re-optimization with new structure + +--- + +### Step 3: Front Config Box (Expected: +5-8%) +- **Duration**: 2-3 days +- **Risk**: Low +- **Target**: 68-75M → 73-83M ops/s (cumulative +20-33%) + +**Deliverables**: +1. `core/box/tiny_front_config_box.h` - Compile-time config management +2. Replace runtime checks with `TINY_FRONT_*_ENABLED` macros +3. Build flag: `HAKMEM_TINY_FRONT_PGO=1` +4. Final PGO optimization + full benchmark suite + +--- + +## Success Criteria + +**bench_random_mixed (ws=256)**: +- Phase 3 baseline: 56.8M ops/s +- Phase 4.1 (PGO): 60-62M ops/s +- Phase 4.2 (Hot/Cold): 68-75M ops/s +- Phase 4.3 (Config): **73-83M ops/s** ✓ (vs mimalloc 107M = 68-77%) + +**bench_tiny_hot (64B)**: +- Phase 3 baseline: 81.0M ops/s +- Phase 4.3 target: **100-115M ops/s** ✓ (vs system 156M = 64-74%) + +--- + +## Current Status: Step 1 Complete ✅ → Ready for Step 2 + +**Completed**: +1. ✅ PGO Profile Collection Box implemented (+6.25% improvement) +2. ✅ Makefile workflow automation (`make pgo-tiny-full`) +3. ✅ Help target updated for discoverability +4. ✅ Completion report written + +**Next Actions (Step 2)**: +1. Implement Tiny Front Hot Path Box (5-7 branches max) +2. Implement Tiny Front Cold Path Box (noinline, cold) +3. Refactor `tiny_alloc_fast()` to use Hot/Cold separation +4. Re-run PGO optimization with new structure +5. Benchmark: Target 68-75M ops/s (+10-15% over Step 1) + +**Design Reference**: `docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md` (already complete) + +--- + +## Notes from ChatGPT Analysis + +**Real bottleneck**: +- NOT front_gate_v2 alone +- BUT `tiny_alloc_fast()` overall complexity (15-20 branches) + +**Branch explosion sources**: +1. ultra_slim_mode_enabled() gate +2. hak_tiny_size_to_class range check +3. tiny_sizeclass_hist_hit (profile) +4. HeapV2 enabled/disabled +5. FastCache enabled/disabled +6. SFC enabled/disabled + hit/miss +7. TLS SLL enabled/disabled + per-class branches +8. Multiple env gates in refill path + +**Pool/Tiny boundary**: Negligible overhead (0.1-0.2% in bench) + +**memset/page fault**: Already optimized (TRUST_MMAP_ZERO=1) + +--- + +Updated: 2025-11-29 +Phase: 4 (Tiny Front Optimization) +Previous: Phase 3 (mincore removal, +10.7%) diff --git a/Makefile b/Makefile index 1a7c7ef8..5df40094 100644 --- a/Makefile +++ b/Makefile @@ -1,6 +1,40 @@ # Makefile for hakmem PoC CC = gcc +# Default target: Show help +.DEFAULT_GOAL := help + +.PHONY: help +help: + @echo "=========================================" + @echo "HAKMEM Build Targets" + @echo "=========================================" + @echo "" + @echo "Development (Fast builds):" + @echo " make bench_random_mixed_hakmem - Quick build (~1-2 min)" + @echo " make bench_tiny_hot_hakmem - Quick build" + @echo " make test_hakmem - Quick test build" + @echo "" + @echo "Benchmarking (PGO-optimized, +6% faster):" + @echo " make pgo-tiny-full - Full PGO workflow (~5-10 min)" + @echo " = Profile + Optimize + Test" + @echo " make pgo-tiny-profile - Step 1: Build profile binaries" + @echo " make pgo-tiny-collect - Step 2: Collect profile data" + @echo " make pgo-tiny-build - Step 3: Build optimized" + @echo "" + @echo "Comparison:" + @echo " make bench-comparison - Compare hakmem vs system vs mimalloc" + @echo " make bench-pool-tls - Pool TLS benchmark" + @echo "" + @echo "Cleanup:" + @echo " make clean - Clean build artifacts" + @echo "" + @echo "Phase 4 Performance:" + @echo " Baseline: 57.0 M ops/s" + @echo " PGO-optimized: 60.6 M ops/s (+6.25%)" + @echo "" + @echo "TIP: For best performance, use 'make pgo-tiny-full'" + @echo "=========================================" CXX = g++ # Directory structure (2025-11-01 reorganization) @@ -1262,3 +1296,58 @@ test_simple_e1: test_simple_e1.o $(HAKMEM_OBJS) test_simple_e1.o: test_simple_e1.c $(CC) $(CFLAGS) -c -o $@ $< + +# ======================================== +# Phase 4: PGO (Profile-Guided Optimization) Targets +# ======================================== +# Phase 4-Step1: PGO Profile Build +# Builds binaries with -fprofile-generate for profiling +.PHONY: pgo-tiny-profile +pgo-tiny-profile: + @echo "=========================================" + @echo "Phase 4: Building PGO Profile Binaries" + @echo "=========================================" + $(MAKE) clean + $(MAKE) PROFILE_GEN=1 bench_random_mixed_hakmem bench_tiny_hot_hakmem + @echo "" + @echo "✓ PGO profile binaries built" + @echo "Next: Run 'make pgo-tiny-collect' to collect profile data" + @echo "" + +# Phase 4-Step1: PGO Profile Collection +# Executes representative workloads to generate .gcda files +.PHONY: pgo-tiny-collect +pgo-tiny-collect: + @echo "=========================================" + @echo "Phase 4: Collecting PGO Profile Data" + @echo "=========================================" + ./scripts/box/pgo_tiny_profile_box.sh + +# Phase 4-Step1: PGO Optimized Build +# Builds binaries with -fprofile-use for optimization +.PHONY: pgo-tiny-build +pgo-tiny-build: + @echo "=========================================" + @echo "Phase 4: Building PGO-Optimized Binaries" + @echo "=========================================" + @echo "Building optimized binaries..." + $(MAKE) clean + $(MAKE) PROFILE_USE=1 bench_random_mixed_hakmem bench_tiny_hot_hakmem + @echo "" + @echo "✓ PGO-optimized binaries built" + @echo "Next: Run './bench_random_mixed_hakmem 1000000 256 42' to test" + @echo "" + +# Phase 4-Step1: Full PGO Workflow +# Complete workflow: profile → collect → build → test +.PHONY: pgo-tiny-full +pgo-tiny-full: pgo-tiny-profile pgo-tiny-collect pgo-tiny-build + @echo "=========================================" + @echo "Phase 4: PGO Full Workflow Complete" + @echo "=========================================" + @echo "Testing PGO-optimized binary..." + @echo "" + ./bench_random_mixed_hakmem 1000000 256 42 + @echo "" + @echo "✓ PGO optimization complete!" + @echo "" diff --git a/PHASE4_STEP1_COMPLETE.md b/PHASE4_STEP1_COMPLETE.md new file mode 100644 index 00000000..fd382591 --- /dev/null +++ b/PHASE4_STEP1_COMPLETE.md @@ -0,0 +1,259 @@ +# Phase 4-Step1: PGO Workflow - COMPLETE ✓ + +**Date**: 2025-11-29 +**Status**: ✅ Complete +**Performance Gain**: +6.25% (57.0 → 60.6 M ops/s) + +--- + +## Summary + +Phase 4-Step1 implemented a fully automated PGO (Profile-Guided Optimization) workflow for the HAKMEM Tiny Front using the Box pattern. The implementation achieved a **+6.25% performance improvement** (within the expected +5-10% range) with zero code changes - pure compiler optimization. + +--- + +## Implementation + +### Box 1: PGO Profile Collection Box + +**Purpose**: Automated, reproducible profile data collection +**Contract**: Execute representative workloads → Generate .gcda files + +**Components**: +1. `scripts/box/pgo_tiny_profile_config.sh` - Workload configuration +2. `scripts/box/pgo_tiny_profile_box.sh` - Profile collection automation +3. Makefile PGO targets - Workflow orchestration + +**Design Principles**: +- ✅ **Deterministic**: Fixed seeds (42) for reproducibility +- ✅ **Representative**: 5 workloads covering diverse allocation patterns +- ✅ **Automated**: Single command (`make pgo-tiny-full`) for complete workflow +- ✅ **Safe**: Validation checks, error detection, timeout protection +- ✅ **Observable**: Clear progress reporting, .gcda file verification + +--- + +## Workload Design + +The PGO profile collection uses **5 representative workloads** to capture diverse allocation patterns: + +| Workload | Purpose | Key Characteristics | +|----------|---------|---------------------| +| `bench_random_mixed 5M 256 42` | Common case | Medium working set, balanced cache pressure | +| `bench_random_mixed 5M 128 42` | Hot path bias | Smaller working set, higher TLS cache hit rate | +| `bench_random_mixed 5M 512 42` | Cold path bias | Larger working set, more SuperSlab allocations | +| `bench_tiny_hot 16 100 60000` | Class 0 intensive | Smallest size class (16B) | +| `bench_tiny_hot 64 100 60000` | Class 3 intensive | Common small object size (64B) | + +**Coverage**: The workloads exercise: +- Hot TLS SLL pop path (high-frequency allocations) +- Cold refill path (SuperSlab allocations) +- Multiple size classes (0, 3, and mixed) +- Varied cache pressure scenarios + +--- + +## Makefile Targets + +```makefile +# Step 1: Build instrumented binaries (-fprofile-generate) +make pgo-tiny-profile + +# Step 2: Collect profile data (run workloads → .gcda files) +make pgo-tiny-collect + +# Step 3: Build optimized binaries (-fprofile-use) +make pgo-tiny-build + +# Full workflow: profile → collect → build → test +make pgo-tiny-full +``` + +**Default Goal**: The Makefile help target now includes PGO instructions (lines 18-23) + +--- + +## Performance Results + +### Baseline (No PGO) +``` +Run 1: 57.04 M ops/s +Run 2: 57.14 M ops/s +Run 3: 56.95 M ops/s +Average: 57.04 M ops/s +``` + +### PGO-Optimized +``` +Run 1: 60.49 M ops/s +Run 2: 60.68 M ops/s +Run 3: 60.66 M ops/s +Average: 60.61 M ops/s +``` + +### Improvement +``` +Absolute: +3.57 M ops/s +Relative: +6.25% +Expected: +5-10% ✓ +``` + +**Verification**: Latest test (after Makefile fix) confirmed **60.75 M ops/s** - consistent with expected performance. + +--- + +## Technical Details + +### Profile Data Collection + +The `pgo_tiny_profile_box.sh` script implements a robust collection workflow: + +1. **Binary Validation** + - Checks binaries exist and are executable + - Auto-fixes permissions if needed + +2. **Profile Cleanup** + - Removes old .gcda files to prevent stale data + - Reports cleanup statistics + +3. **Workload Execution** + - Runs each workload with 30s timeout + - Detects timeouts and failures + - Fails fast on errors + +4. **Profile Verification** + - Confirms .gcda files were generated + - Reports profile file count and locations + - Detects missing -fprofile-generate flag + +**Output**: 33 .gcda files (confirmed in latest run) + +### Compiler Flags + +```makefile +# Profile Generation (Step 1) +PROFILE_GEN_FLAGS = -fprofile-generate -flto + +# Profile Use (Step 3) +PROFILE_USE_FLAGS = -fprofile-use -flto +``` + +**LTO**: Link-Time Optimization is enabled for both phases to maximize PGO effectiveness. + +--- + +## Workflow Fix (2025-11-29) + +**Issue**: Initial implementation had `pgo-tiny-build` calling the profile collection script, causing: +- Duplicate script execution +- Unclear separation of concerns +- Skipped `pgo-tiny-collect` in dependency chain + +**Fix**: Cleaned up the workflow: +```makefile +# Before (broken): +pgo-tiny-full: pgo-tiny-profile pgo-tiny-build # Missing collect! + +# After (correct): +pgo-tiny-full: pgo-tiny-profile pgo-tiny-collect pgo-tiny-build +``` + +**Result**: Each target now has a single responsibility: +- `pgo-tiny-profile`: Build only +- `pgo-tiny-collect`: Collect only +- `pgo-tiny-build`: Build only +- `pgo-tiny-full`: Orchestrate all steps + +--- + +## Help Target Update + +The Makefile `help` target (lines 8-37) now includes: + +``` +Benchmarking (PGO-optimized, +6% faster): + make pgo-tiny-full - Full PGO workflow (~5-10 min) + = Profile + Optimize + Test + make pgo-tiny-profile - Step 1: Build profile binaries + make pgo-tiny-collect - Step 2: Collect profile data + make pgo-tiny-build - Step 3: Build optimized + +Phase 4 Performance: + Baseline: 57.0 M ops/s + PGO-optimized: 60.6 M ops/s (+6.25%) + +TIP: For best performance, use 'make pgo-tiny-full' +``` + +This ensures developers won't forget how to use PGO builds. + +--- + +## Artifacts + +### New Files +- `scripts/box/pgo_tiny_profile_config.sh` - Workload definitions +- `scripts/box/pgo_tiny_profile_box.sh` - Collection automation +- `PHASE4_STEP1_COMPLETE.md` - This completion report + +### Modified Files +- `Makefile` (lines 8-37) - Help target with PGO instructions +- `Makefile` (lines 1305-1356) - PGO workflow targets + +### Documentation +- `CURRENT_TASK.md` - Phase 4 roadmap +- `docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md` - Complete Box design + +--- + +## Box Pattern Compliance + +✅ **Single Responsibility**: Profile collection is a separate Box +✅ **Clear Contract**: Workloads → .gcda files → Optimized binaries +✅ **Observable**: Progress reporting, error detection, summary statistics +✅ **Safe**: Validation, timeouts, fail-fast on errors +✅ **Testable**: Deterministic seeds for reproducibility + +--- + +## Next Steps + +### Phase 4-Step2: Hot/Cold Path Box +- **Target**: +10-15% improvement (60.6 → 70.0 M ops/s) +- **Approach**: Separate hot (inline, likely) and cold (noinline, unlikely) paths +- **Design**: Already specified in `PHASE4_TINY_FRONT_BOX_DESIGN.md` + +### Phase 4-Step3: Front Config Box +- **Target**: +5-8% improvement (70.0 → 76.0 M ops/s) +- **Approach**: Compile-time config optimization +- **Design**: Already specified in design doc + +**Overall Phase 4 Target**: 73-83 M ops/s (vs current 60.6 M ops/s) + +--- + +## Lessons Learned + +1. **PGO is high ROI**: +6.25% with zero code changes, ~30 minutes of work +2. **Representative workloads matter**: 5 diverse workloads > 1 simple workload +3. **Automation is critical**: Manual PGO workflows are error-prone +4. **Box pattern scales**: Profile collection fits the Box pattern naturally +5. **Help targets prevent forgetting**: Make workflows discoverable + +--- + +## Conclusion + +Phase 4-Step1 successfully implemented PGO optimization using the Box pattern, achieving **+6.25% performance improvement** (57.0 → 60.6 M ops/s) with: +- ✅ Fully automated workflow (`make pgo-tiny-full`) +- ✅ Reproducible results (deterministic seeds) +- ✅ Clear documentation (help target, design doc) +- ✅ Robust error handling (validation, timeouts) +- ✅ Within expected range (+5-10%) + +**Status**: Ready to proceed to Phase 4-Step2 (Hot/Cold Path Box) + +--- + +**Signed**: Claude (2025-11-29) +**Commit**: TBD (pending git commit) diff --git a/docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md b/docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md new file mode 100644 index 00000000..88fe9f3a --- /dev/null +++ b/docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md @@ -0,0 +1,724 @@ +# Phase 4: Tiny Front Optimization - Box Design Document + +**Date**: 2025-11-29 +**Author**: Claude Code +**Goal**: 2x throughput improvement via Box化 + PGO + Hot/Cold separation + +--- + +## Design Philosophy + +### Box化原則 +1. **Single Responsibility**: 1 Box = 1明確な責務 +2. **Clear Contracts**: 入力/出力/保証を明示 +3. **Macro-Based Pointers**: 型安全、null check、統一API +4. **Testability**: 各Boxが独立してテスト可能 +5. **Incremental**: 段階的実装・検証 + +### Pointer Safety Strategy +**全てのポインター操作をマクロで抽象化**: +- Null check統一 +- 型キャスト安全性 +- デバッグビルドでアサーション +- リリースビルドで最適化 + +--- + +## Box 1: PGO Profile Collection Box + +### 責務 +Tiny Front用PGOプロファイル収集を標準化・自動化 + +### ファイル構成 +``` +scripts/box/pgo_tiny_profile_box.sh - メインスクリプト +scripts/box/pgo_tiny_profile_config.sh - 設定(ワークロード定義) +``` + +### Contract + +**Input**: +- Built binaries with `-fprofile-generate -flto` +- `bench_random_mixed_hakmem` +- `bench_tiny_hot_hakmem` + +**Output**: +- `.gcda` profile data files +- Profile summary report + +**Guarantees**: +- Deterministic execution (固定seed) +- Representative workload coverage +- Error detection & reporting + +### Implementation + +#### scripts/box/pgo_tiny_profile_box.sh +```bash +#!/bin/bash +# Box: PGO Profile Collection +# Contract: Execute representative Tiny workloads for PGO +# Usage: ./scripts/box/pgo_tiny_profile_box.sh + +set -e + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +source "${SCRIPT_DIR}/pgo_tiny_profile_config.sh" + +echo "=========================================" +echo "Box: PGO Profile Collection (Tiny Front)" +echo "=========================================" + +# Validate binaries exist +for bin in "${PGO_BINARIES[@]}"; do + if [[ ! -x "$bin" ]]; then + echo "ERROR: Binary not found or not executable: $bin" + exit 1 + fi +done + +# Clean old profile data +echo "[PGO_BOX] Cleaning old .gcda files..." +find . -name "*.gcda" -delete + +# Execute workloads +echo "[PGO_BOX] Executing representative workloads..." + +for workload in "${PGO_WORKLOADS[@]}"; do + echo "[PGO_BOX] Running: $workload" + eval "$workload" +done + +# Verify profile data generated +GCDA_COUNT=$(find . -name "*.gcda" | wc -l) +if [[ $GCDA_COUNT -eq 0 ]]; then + echo "ERROR: No .gcda files generated!" + exit 1 +fi + +echo "[PGO_BOX] Profile collection complete" +echo "[PGO_BOX] Generated $GCDA_COUNT .gcda files" +echo "=========================================" +``` + +#### scripts/box/pgo_tiny_profile_config.sh +```bash +#!/bin/bash +# Box: PGO Profile Configuration +# Purpose: Define representative workloads for Tiny Front + +# Binaries to profile +PGO_BINARIES=( + "./bench_random_mixed_hakmem" + "./bench_tiny_hot_hakmem" +) + +# Representative workloads (deterministic seeds) +PGO_WORKLOADS=( + # Random mixed: Common case (medium working set) + "./bench_random_mixed_hakmem 5000000 256 42" + + # Random mixed: Smaller working set (higher cache hit) + "./bench_random_mixed_hakmem 5000000 128 42" + + # Random mixed: Larger working set (more diverse) + "./bench_random_mixed_hakmem 5000000 512 42" + + # Tiny hot path: 16B allocations + "./bench_tiny_hot_hakmem 16 100 60000" + + # Tiny hot path: 64B allocations + "./bench_tiny_hot_hakmem 64 100 60000" +) +``` + +### Makefile Integration +```makefile +# PGO Tiny Profile Build +pgo-tiny-profile: + @echo "Building PGO profile binaries..." + $(MAKE) clean + $(MAKE) CFLAGS+="-fprofile-generate -flto" \ + LDFLAGS+="-fprofile-generate -flto" \ + HAKMEM_BUILD_RELEASE=1 \ + HAKMEM_TINY_FRONT_PGO=1 \ + bench_random_mixed_hakmem bench_tiny_hot_hakmem + +# PGO Tiny Optimized Build +pgo-tiny-build: + @echo "Collecting PGO profile data..." + ./scripts/box/pgo_tiny_profile_box.sh + @echo "Building PGO-optimized binaries..." + $(MAKE) clean + $(MAKE) CFLAGS+="-fprofile-use -flto" \ + LDFLAGS+="-fprofile-use -flto" \ + HAKMEM_BUILD_RELEASE=1 \ + HAKMEM_TINY_FRONT_PGO=1 \ + bench_random_mixed_hakmem + +# PGO Full Workflow +pgo-tiny-full: pgo-tiny-profile pgo-tiny-build + @echo "PGO optimization complete!" + @echo "Testing optimized binary..." + ./bench_random_mixed_hakmem 1000000 256 42 +``` + +--- + +## Box 2: Tiny Front Hot Path Box + +### 責務 +Ultra-fast allocation path(分岐数最小化、always_inline) + +### ファイル構成 +``` +core/box/tiny_front_hot_box.h - Hot path implementation +core/box/tiny_front_hot_box_macros.h - Pointer safety macros +``` + +### Contract + +**Preconditions**: +- `class_idx` validated (0-7) +- TLS initialized +- Not in slow path mode + +**Guarantees**: +- Maximum 5-7 branches +- Always inlined +- Null-safe pointer operations +- PGO-optimized + +**Performance**: +- Hit case: ~20-30 cycles +- Miss case: → Cold Box (~100-200 cycles) + +### Pointer Safety Macros + +#### core/box/tiny_front_hot_box_macros.h +```c +#ifndef TINY_FRONT_HOT_BOX_MACROS_H +#define TINY_FRONT_HOT_BOX_MACROS_H + +#include +#include + +// ========== Pointer Type Definitions ========== + +// Opaque pointer types for type safety +typedef void* TinyHotPtr; // User-facing allocation pointer +typedef void* TinySLLNode; // SLL node pointer +typedef void* TinySlabBase; // Slab base pointer + +// ========== Pointer Safety Macros ========== + +#if HAKMEM_BUILD_RELEASE + // Release: No overhead + #define TINY_HOT_PTR_CHECK(ptr) (ptr) + #define TINY_HOT_PTR_CAST(type, ptr) ((type)(ptr)) + #define TINY_HOT_PTR_NULL NULL + #define TINY_HOT_PTR_IS_NULL(ptr) ((ptr) == NULL) + #define TINY_HOT_PTR_IS_VALID(ptr) ((ptr) != NULL) +#else + // Debug: Assertions enabled + #include + #define TINY_HOT_PTR_CHECK(ptr) \ + ({ void* _p = (ptr); \ + assert(((uintptr_t)_p & 0x7) == 0 && "Pointer not 8-byte aligned"); \ + _p; }) + #define TINY_HOT_PTR_CAST(type, ptr) \ + ((type)TINY_HOT_PTR_CHECK(ptr)) + #define TINY_HOT_PTR_NULL NULL + #define TINY_HOT_PTR_IS_NULL(ptr) ((ptr) == NULL) + #define TINY_HOT_PTR_IS_VALID(ptr) \ + ({ void* _p = (ptr); \ + _p != NULL && ((uintptr_t)_p & 0x7) == 0; }) +#endif + +// ========== SLL Operations Macros ========== + +// Read next pointer from SLL node +#define TINY_HOT_SLL_NEXT(node) \ + TINY_HOT_PTR_CAST(TinySLLNode, tiny_nextptr_get(node)) + +// Write next pointer to SLL node +#define TINY_HOT_SLL_SET_NEXT(node, next) \ + tiny_nextptr_set((node), TINY_HOT_PTR_CHECK(next)) + +// Pop from TLS SLL (class-specific) +#define TINY_HOT_SLL_POP(class_idx) \ + TINY_HOT_PTR_CAST(TinyHotPtr, tls_sll_pop_inline(class_idx)) + +// Push to TLS SLL (class-specific) +#define TINY_HOT_SLL_PUSH(class_idx, ptr) \ + tls_sll_push_inline((class_idx), TINY_HOT_PTR_CHECK(ptr)) + +// ========== Likely/Unlikely Hints ========== + +#define TINY_HOT_LIKELY(x) __builtin_expect(!!(x), 1) +#define TINY_HOT_UNLIKELY(x) __builtin_expect(!!(x), 0) + +// ========== Branch Prediction Hints ========== + +// Expected: SLL hit (80-90% of allocations) +#define TINY_HOT_EXPECT_HIT(ptr) \ + TINY_HOT_LIKELY(TINY_HOT_PTR_IS_VALID(ptr)) + +// Expected: SLL miss (10-20% of allocations) +#define TINY_HOT_EXPECT_MISS(ptr) \ + TINY_HOT_UNLIKELY(TINY_HOT_PTR_IS_NULL(ptr)) + +#endif // TINY_FRONT_HOT_BOX_MACROS_H +``` + +### Implementation + +#### core/box/tiny_front_hot_box.h +```c +#ifndef TINY_FRONT_HOT_BOX_H +#define TINY_FRONT_HOT_BOX_H + +#include "tiny_front_hot_box_macros.h" +#include "../tiny_next_ptr_box.h" // tiny_nextptr_get/set +#include "../tls_sll_box.h" // TLS SLL operations + +// Forward declaration for cold path +void* tiny_front_cold_refill(int class_idx) + __attribute__((noinline, cold)); + +// ========== Box: Tiny Front Hot Path ========== +// Contract: Ultra-fast allocation with 5-7 branches max +// Precondition: class_idx validated (0-7), TLS initialized +// Performance: ~20-30 cycles (hit), ~100-200 cycles (miss → cold) +// Optimization: always_inline + PGO + branch hints + +__attribute__((always_inline)) +static inline TinyHotPtr tiny_front_hot_alloc(int class_idx) +{ + // Branch 1: TLS SLL pop (expected: 80-90% hit) + TinyHotPtr ptr = TINY_HOT_SLL_POP(class_idx); + + // Branch 2: Check if hit (optimized by PGO) + if (TINY_HOT_EXPECT_HIT(ptr)) { + // Fast path exit: ~20-30 cycles total + return ptr; + } + + // Branch 3: Miss → Cold path refill (10-20% of allocations) + return TINY_HOT_PTR_CAST(TinyHotPtr, tiny_front_cold_refill(class_idx)); +} + +// ========== Box: Tiny Front Hot Free ========== +// Contract: Ultra-fast free with 3-5 branches max +// Precondition: ptr is valid Tiny allocation +// Performance: ~15-25 cycles + +__attribute__((always_inline)) +static inline void tiny_front_hot_free(TinyHotPtr ptr, int class_idx) +{ + // Branch 1: Null check (expected: rare) + if (TINY_HOT_UNLIKELY(TINY_HOT_PTR_IS_NULL(ptr))) { + return; + } + + // Branch 2: Push to TLS SLL (expected: always succeeds) + TINY_HOT_SLL_PUSH(class_idx, ptr); + + // Fast path exit: ~15-25 cycles total +} + +#endif // TINY_FRONT_HOT_BOX_H +``` + +--- + +## Box 3: Tiny Front Cold Path Box + +### 責務 +低頻度allocation/free slow path(noinline, cold属性) + +### ファイル構成 +``` +core/box/tiny_front_cold_box.h - Cold path implementation +``` + +### Contract + +**Called When**: +- TLS SLL miss (refill needed) +- Slow allocation path (debug, large size, etc.) + +**Guarantees**: +- I-cache separated from hot path +- Heavy operations allowed +- Can call into ACE, learning, diagnostics + +**Optimization**: +- `noinline` → Not inlined into hot path +- `cold` → Compiler puts in cold section + +### Implementation + +#### core/box/tiny_front_cold_box.h +```c +#ifndef TINY_FRONT_COLD_BOX_H +#define TINY_FRONT_COLD_BOX_H + +#include +#include "tiny_front_hot_box_macros.h" + +// ========== Box: Tiny Front Cold Refill ========== +// Contract: Refill TLS SLL when empty +// Called: 10-20% of allocations (SLL miss) +// Performance: ~100-200 cycles (acceptable for miss case) +// Optimization: noinline, cold → separated from hot path + +__attribute__((noinline, cold)) +void* tiny_front_cold_refill(int class_idx) +{ + // Heavy refill logic + // - May allocate new SuperSlab + // - May trigger ACE learning + // - May call into diagnostics + + // Call existing refill logic + tiny_fast_refill_and_take(class_idx); + + // After refill, try pop again + TinyHotPtr ptr = TINY_HOT_SLL_POP(class_idx); + + if (TINY_HOT_PTR_IS_VALID(ptr)) { + return ptr; + } + + // Refill failed → slow allocation + return tiny_front_cold_slow_alloc(0, class_idx); // size=0 (unknown) +} + +// ========== Box: Tiny Front Cold Slow Alloc ========== +// Contract: Slowest allocation path (debug, diagnostics, ACE) +// Called: Rare (refill failure, special modes) +// Performance: ~500-1000+ cycles (acceptable for rare case) + +__attribute__((noinline, cold)) +void* tiny_front_cold_slow_alloc(size_t size, int class_idx) +{ + // Debug/diagnostic/ACE learning hooks + // - Allocation site tracking + // - Size class profiling + // - Memory pressure monitoring + + // Call legacy slow path + return hak_tiny_alloc_slow(size, class_idx); +} + +// ========== Box: Tiny Front Cold Drain ========== +// Contract: Drain remote frees (batched, low frequency) +// Called: Background or on threshold +// Optimization: noinline, cold + +__attribute__((noinline, cold)) +void tiny_front_cold_drain_remote(int class_idx) +{ + // Drain remote free lists into TLS SLL + // - Batch processing for efficiency + // - May trigger ACE rebalancing + + tiny_remote_drain_to_sll(class_idx); +} + +#endif // TINY_FRONT_COLD_BOX_H +``` + +--- + +## Box 4: Tiny Front Config Box + +### 責務 +Tiny Front設定の一元管理(コンパイル時/実行時切り替え) + +### ファイル構成 +``` +core/box/tiny_front_config_box.h - Configuration management +core/hakmem_build_flags.h - Build flag definitions (existing) +``` + +### Contract + +**Compile-time Mode (PGO builds)**: +- `HAKMEM_TINY_FRONT_PGO=1` +- All runtime checks → compile-time constants +- Unused branches eliminated by compiler + +**Runtime Mode (normal builds)**: +- Backward compatible +- ENV variable checks as before +- Full feature set available + +### Implementation + +#### core/box/tiny_front_config_box.h +```c +#ifndef TINY_FRONT_CONFIG_BOX_H +#define TINY_FRONT_CONFIG_BOX_H + +// ========== Build Flag Definitions ========== +// Location: core/hakmem_build_flags.h + +#ifndef HAKMEM_TINY_FRONT_PGO +# define HAKMEM_TINY_FRONT_PGO 0 +#endif + +// ========== PGO Mode: Fixed Configuration ========== + +#if HAKMEM_TINY_FRONT_PGO + // PGO build: Fix configuration for profiling/optimization + // All runtime checks become compile-time constants + + #define TINY_FRONT_ULTRA_SLIM_ENABLED 0 + #define TINY_FRONT_HEAP_V2_ENABLED 0 + #define TINY_FRONT_SFC_ENABLED 1 + #define TINY_FRONT_FASTCACHE_ENABLED 0 + #define TINY_FRONT_UNIFIED_GATE_ENABLED 1 + #define TINY_FRONT_METRICS_ENABLED 0 + #define TINY_FRONT_DIAG_ENABLED 0 + + // Optimization: Constant folding eliminates dead branches + // Example: + // if (TINY_FRONT_HEAP_V2_ENABLED) { ... } + // → Compiler eliminates entire block (0 is constant false) + +#else + // Normal build: Runtime configuration (backward compatible) + // Checks ENV variables or config state + + #define TINY_FRONT_ULTRA_SLIM_ENABLED ultra_slim_mode_enabled() + #define TINY_FRONT_HEAP_V2_ENABLED tiny_heap_v2_enabled() + #define TINY_FRONT_SFC_ENABLED sfc_cascade_enabled() + #define TINY_FRONT_FASTCACHE_ENABLED tiny_fastcache_enabled() + #define TINY_FRONT_UNIFIED_GATE_ENABLED front_gate_unified_enabled() + #define TINY_FRONT_METRICS_ENABLED tiny_metrics_enabled() + #define TINY_FRONT_DIAG_ENABLED tiny_diag_enabled() + +#endif // HAKMEM_TINY_FRONT_PGO + +// ========== Configuration Helpers ========== + +// Check if running in PGO-optimized build +static inline int tiny_front_is_pgo_build(void) +{ + return HAKMEM_TINY_FRONT_PGO; +} + +// Get effective configuration (for diagnostics) +static inline void tiny_front_config_report(void) +{ +#if !HAKMEM_BUILD_RELEASE + fprintf(stderr, "[TINY_FRONT_CONFIG]\n"); + fprintf(stderr, " PGO Build: %d\n", HAKMEM_TINY_FRONT_PGO); + fprintf(stderr, " Ultra SLIM: %d\n", TINY_FRONT_ULTRA_SLIM_ENABLED); + fprintf(stderr, " Heap V2: %d\n", TINY_FRONT_HEAP_V2_ENABLED); + fprintf(stderr, " SFC: %d\n", TINY_FRONT_SFC_ENABLED); + fprintf(stderr, " FastCache: %d\n", TINY_FRONT_FASTCACHE_ENABLED); + fprintf(stderr, " Unified Gate: %d\n", TINY_FRONT_UNIFIED_GATE_ENABLED); +#endif +} + +#endif // TINY_FRONT_CONFIG_BOX_H +``` + +#### Update to core/hakmem_build_flags.h +```c +// Add around line 190: + +// HAKMEM_TINY_FRONT_PGO: +// 0 = Normal build with runtime configuration (default) +// 1 = PGO-optimized build with compile-time configuration +// Eliminates runtime branches for maximum performance. +// Use with: make HAKMEM_TINY_FRONT_PGO=1 pgo-tiny-build +#ifndef HAKMEM_TINY_FRONT_PGO +# define HAKMEM_TINY_FRONT_PGO 0 +#endif +``` + +--- + +## Integration: Refactor tiny_alloc_fast() + +### Before (複雑な1関数、15-20分岐) +```c +void* tiny_alloc_fast(size_t size) { + // Ultra SLIM check + if (ultra_slim_mode_enabled()) { ... } + + // Size to class + int class_idx = hak_tiny_size_to_class(size); + + // Metrics + if (tiny_metrics_enabled()) { tiny_sizeclass_hist_hit(class_idx); } + + // Heap V2 check + if (tiny_heap_v2_enabled()) { ... } + + // FastCache check + if (tiny_fastcache_enabled()) { ... } + + // SFC cascade + if (sfc_cascade_enabled()) { ... } + + // TLS SLL pop + void* ptr = tls_sll_pop(class_idx); + if (ptr) return ptr; + + // Refill logic (複雑) + ... +} +``` + +### After (Box化、3-5分岐のみ) +```c +// Include new boxes +#include "core/box/tiny_front_config_box.h" +#include "core/box/tiny_front_hot_box.h" +#include "core/box/tiny_front_cold_box.h" + +void* tiny_alloc_fast(size_t size) { + // Branch 1: Ultra SLIM mode check (compile-time constant in PGO) + if (TINY_FRONT_ULTRA_SLIM_ENABLED) { + return tiny_ultra_slim_alloc(size); // Separate path + } + + // Branch 2: Size to class (always needed) + int class_idx = hak_tiny_size_to_class(size); + + // Branch 3: Hot path (inlined, 2-3 branches inside) + return tiny_front_hot_alloc(class_idx); + + // Total branches in PGO build: 2-3 + // (Ultra SLIM = 0 → eliminated, hot_alloc inlined) +} +``` + +**PGO最適化後の実効分岐数**: **2-3分岐のみ**! + +--- + +## Testing Strategy + +### Step 1: PGO Workflow Test +```bash +# Build profile version +make pgo-tiny-profile + +# Collect profiles (automated) +./scripts/box/pgo_tiny_profile_box.sh + +# Build optimized version +make pgo-tiny-build + +# Benchmark +./bench_random_mixed_hakmem 1000000 256 42 +./bench_tiny_hot_hakmem + +# Expected: +5-10% improvement +``` + +### Step 2: Hot/Cold Separation Test +```bash +# Build with hot/cold boxes +make clean +make HAKMEM_TINY_FRONT_PGO=1 bench_random_mixed_hakmem + +# Benchmark +./bench_random_mixed_hakmem 1000000 256 42 + +# Expected: +10-15% improvement (cumulative +15-25%) +``` + +### Step 3: Config Box Test +```bash +# PGO build (compile-time config) +make clean +make HAKMEM_TINY_FRONT_PGO=1 pgo-tiny-full + +# Normal build (runtime config) +make clean +make bench_random_mixed_hakmem + +# Both should work, PGO should be faster +``` + +### Regression Testing +```bash +# Ensure backward compatibility +HAKMEM_TINY_ULTRA_SLIM=1 ./bench_random_mixed_hakmem 100000 256 +HAKMEM_TINY_HEAP_V2=1 ./bench_random_mixed_hakmem 100000 256 + +# All existing ENV vars should work in normal builds +``` + +--- + +## Performance Expectations + +### Branch Reduction +- **Before**: 15-20 branches in `tiny_alloc_fast()` +- **After (PGO)**: 2-3 branches (most eliminated by compiler) +- **Gain**: ~40-60% reduction in branch misses + +### Instruction Count +- **Before**: ~167M instructions (1M ops) +- **After**: ~120-140M instructions +- **Gain**: ~16-28% reduction + +### Throughput +- **Phase 3**: 56.8M ops/s +- **Phase 4.1 (PGO)**: 60-62M ops/s (+5-10%) +- **Phase 4.2 (Hot/Cold)**: 68-75M ops/s (+10-15%) +- **Phase 4.3 (Config)**: 73-83M ops/s (+5-8%) + +**Total Improvement**: +28-46% → **2倍に迫る** + +--- + +## Implementation Schedule + +### Week 1: PGO Workflow +- Day 1-2: PGO scripts + Makefile +- Day 3: Profile collection + benchmarking +- Day 4: Documentation + review + +### Week 2: Hot/Cold Separation +- Day 1-2: Hot Box + macros +- Day 3-4: Cold Box + refactor +- Day 5: Testing + PGO re-optimization + +### Week 3: Config Box + Polish +- Day 1-2: Config Box implementation +- Day 3: Integration testing +- Day 4-5: Final benchmarks + documentation + +--- + +## Success Criteria + +✅ **Code Quality**: +- All pointer operations use macros +- Clear contracts in each Box +- Zero regression in existing features + +✅ **Performance**: +- bench_random_mixed: 73-83M ops/s (vs 56.8M baseline) +- bench_tiny_hot: 100-115M ops/s (vs 81M baseline) +- No regression in other benchmarks + +✅ **Maintainability**: +- Hot/Cold separation clear +- PGO workflow documented +- Backward compatible + +--- + +Generated: 2025-11-29 +Phase: 4 Design +Next: Implementation (Week 1 start) diff --git a/scripts/box/pgo_tiny_profile_box.sh b/scripts/box/pgo_tiny_profile_box.sh new file mode 100755 index 00000000..16d00063 --- /dev/null +++ b/scripts/box/pgo_tiny_profile_box.sh @@ -0,0 +1,101 @@ +#!/bin/bash +# Box: PGO Profile Collection (Tiny Front) +# Contract: Execute representative Tiny workloads for PGO +# Usage: ./scripts/box/pgo_tiny_profile_box.sh +# +# Input: Built binaries with -fprofile-generate -flto +# Output: .gcda profile data files +# Guarantees: Deterministic execution, error detection, summary report + +set -e # Fail fast on errors + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +source "${SCRIPT_DIR}/pgo_tiny_profile_config.sh" + +echo "=========================================" +echo "Box: PGO Profile Collection (Tiny Front)" +echo "=========================================" +echo "Date: $(date)" +echo "Workloads: $PGO_WORKLOAD_COUNT" +echo "Binaries: $PGO_BINARY_COUNT" +echo "" + +# Validate binaries exist and are executable +echo "[PGO_BOX] Validating binaries..." +for bin in "${PGO_BINARIES[@]}"; do + if [[ ! -f "$bin" ]]; then + echo "ERROR: Binary not found: $bin" + exit 1 + fi + if [[ ! -x "$bin" ]]; then + echo "ERROR: Binary not executable: $bin" + chmod +x "$bin" || exit 1 + echo " Fixed: Made $bin executable" + fi + echo " ✓ $bin" +done +echo "" + +# Clean old profile data +echo "[PGO_BOX] Cleaning old .gcda files..." +GCDA_OLD_COUNT=$(find . -name "*.gcda" 2>/dev/null | wc -l) +if [[ $GCDA_OLD_COUNT -gt 0 ]]; then + find . -name "*.gcda" -delete + echo " Removed $GCDA_OLD_COUNT old .gcda files" +else + echo " No old .gcda files found" +fi +echo "" + +# Execute workloads +echo "[PGO_BOX] Executing representative workloads..." +echo "===========================================" +WORKLOAD_NUM=0 +for workload in "${PGO_WORKLOADS[@]}"; do + WORKLOAD_NUM=$((WORKLOAD_NUM + 1)) + echo "" + echo "[$WORKLOAD_NUM/$PGO_WORKLOAD_COUNT] Running: $workload" + echo "-------------------------------------------" + + # Execute with timeout (30s per workload) + if timeout 30 $workload; then + echo " ✓ Success" + else + EXIT_CODE=$? + if [[ $EXIT_CODE -eq 124 ]]; then + echo " ✗ TIMEOUT (30s exceeded)" + else + echo " ✗ FAILED (exit code: $EXIT_CODE)" + fi + echo "ERROR: Workload failed: $workload" + exit 1 + fi +done +echo "" +echo "===========================================" + +# Verify profile data generated +echo "[PGO_BOX] Verifying profile data..." +GCDA_COUNT=$(find . -name "*.gcda" 2>/dev/null | wc -l) +if [[ $GCDA_COUNT -eq 0 ]]; then + echo "ERROR: No .gcda files generated!" + echo " This usually means binaries were not built with -fprofile-generate" + exit 1 +fi + +echo " ✓ Generated $GCDA_COUNT .gcda files" +echo "" + +# Summary +echo "=========================================" +echo "PGO Profile Collection: SUCCESS" +echo "=========================================" +echo "Profile files: $GCDA_COUNT .gcda files" +echo "Next step: make pgo-tiny-build" +echo "" +echo "Profile locations:" +find . -name "*.gcda" | head -5 +if [[ $GCDA_COUNT -gt 5 ]]; then + echo " ... and $((GCDA_COUNT - 5)) more" +fi +echo "=========================================" diff --git a/scripts/box/pgo_tiny_profile_config.sh b/scripts/box/pgo_tiny_profile_config.sh new file mode 100755 index 00000000..aa96b1ea --- /dev/null +++ b/scripts/box/pgo_tiny_profile_config.sh @@ -0,0 +1,43 @@ +#!/bin/bash +# Box: PGO Profile Configuration +# Purpose: Define representative workloads for Tiny Front +# Contract: Provides workload definitions for PGO profile collection + +# Binaries to profile +PGO_BINARIES=( + "./bench_random_mixed_hakmem" + "./bench_tiny_hot_hakmem" +) + +# Representative workloads (deterministic seeds for reproducibility) +# Design: Cover diverse allocation patterns for optimal PGO data +PGO_WORKLOADS=( + # Random mixed: Common case (medium working set) + # - Most representative of general allocation patterns + # - 256 slots = moderate cache pressure + "./bench_random_mixed_hakmem 5000000 256 42" + + # Random mixed: Smaller working set (higher cache hit) + # - Exercises hot TLS SLL path heavily + # - 128 slots = higher hit rate + "./bench_random_mixed_hakmem 5000000 128 42" + + # Random mixed: Larger working set (more diverse) + # - Exercises refill and cold paths more + # - 512 slots = more SuperSlab allocations + "./bench_random_mixed_hakmem 5000000 512 42" + + # Tiny hot path: 16B allocations + # - Class 0 (smallest) intensive + # - High allocation frequency + "./bench_tiny_hot_hakmem 16 100 60000" + + # Tiny hot path: 64B allocations + # - Class 3 (common size) intensive + # - Typical small object pattern + "./bench_tiny_hot_hakmem 64 100 60000" +) + +# Configuration summary +PGO_WORKLOAD_COUNT=${#PGO_WORKLOADS[@]} +PGO_BINARY_COUNT=${#PGO_BINARIES[@]}