# Phase 4: Tiny Front Optimization - Box Design Document **Date**: 2025-11-29 **Author**: Claude Code **Goal**: 2x throughput improvement via Box化 + PGO + Hot/Cold separation --- ## Design Philosophy ### Box化原則 1. **Single Responsibility**: 1 Box = 1明確な責務 2. **Clear Contracts**: 入力/出力/保証を明示 3. **Macro-Based Pointers**: 型安全、null check、統一API 4. **Testability**: 各Boxが独立してテスト可能 5. **Incremental**: 段階的実装・検証 ### Pointer Safety Strategy **全てのポインター操作をマクロで抽象化**: - Null check統一 - 型キャスト安全性 - デバッグビルドでアサーション - リリースビルドで最適化 --- ## Box 1: PGO Profile Collection Box ### 責務 Tiny Front用PGOプロファイル収集を標準化・自動化 ### ファイル構成 ``` scripts/box/pgo_tiny_profile_box.sh - メインスクリプト scripts/box/pgo_tiny_profile_config.sh - 設定(ワークロード定義) ``` ### Contract **Input**: - Built binaries with `-fprofile-generate -flto` - `bench_random_mixed_hakmem` - `bench_tiny_hot_hakmem` **Output**: - `.gcda` profile data files - Profile summary report **Guarantees**: - Deterministic execution (固定seed) - Representative workload coverage - Error detection & reporting ### Implementation #### scripts/box/pgo_tiny_profile_box.sh ```bash #!/bin/bash # Box: PGO Profile Collection # Contract: Execute representative Tiny workloads for PGO # Usage: ./scripts/box/pgo_tiny_profile_box.sh set -e SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" source "${SCRIPT_DIR}/pgo_tiny_profile_config.sh" echo "=========================================" echo "Box: PGO Profile Collection (Tiny Front)" echo "=========================================" # Validate binaries exist for bin in "${PGO_BINARIES[@]}"; do if [[ ! -x "$bin" ]]; then echo "ERROR: Binary not found or not executable: $bin" exit 1 fi done # Clean old profile data echo "[PGO_BOX] Cleaning old .gcda files..." find . -name "*.gcda" -delete # Execute workloads echo "[PGO_BOX] Executing representative workloads..." for workload in "${PGO_WORKLOADS[@]}"; do echo "[PGO_BOX] Running: $workload" eval "$workload" done # Verify profile data generated GCDA_COUNT=$(find . -name "*.gcda" | wc -l) if [[ $GCDA_COUNT -eq 0 ]]; then echo "ERROR: No .gcda files generated!" exit 1 fi echo "[PGO_BOX] Profile collection complete" echo "[PGO_BOX] Generated $GCDA_COUNT .gcda files" echo "=========================================" ``` #### scripts/box/pgo_tiny_profile_config.sh ```bash #!/bin/bash # Box: PGO Profile Configuration # Purpose: Define representative workloads for Tiny Front # Binaries to profile PGO_BINARIES=( "./bench_random_mixed_hakmem" "./bench_tiny_hot_hakmem" ) # Representative workloads (deterministic seeds) PGO_WORKLOADS=( # Random mixed: Common case (medium working set) "./bench_random_mixed_hakmem 5000000 256 42" # Random mixed: Smaller working set (higher cache hit) "./bench_random_mixed_hakmem 5000000 128 42" # Random mixed: Larger working set (more diverse) "./bench_random_mixed_hakmem 5000000 512 42" # Tiny hot path: 16B allocations "./bench_tiny_hot_hakmem 16 100 60000" # Tiny hot path: 64B allocations "./bench_tiny_hot_hakmem 64 100 60000" ) ``` ### Makefile Integration ```makefile # PGO Tiny Profile Build pgo-tiny-profile: @echo "Building PGO profile binaries..." $(MAKE) clean $(MAKE) CFLAGS+="-fprofile-generate -flto" \ LDFLAGS+="-fprofile-generate -flto" \ HAKMEM_BUILD_RELEASE=1 \ HAKMEM_TINY_FRONT_PGO=1 \ bench_random_mixed_hakmem bench_tiny_hot_hakmem # PGO Tiny Optimized Build pgo-tiny-build: @echo "Collecting PGO profile data..." ./scripts/box/pgo_tiny_profile_box.sh @echo "Building PGO-optimized binaries..." $(MAKE) clean $(MAKE) CFLAGS+="-fprofile-use -flto" \ LDFLAGS+="-fprofile-use -flto" \ HAKMEM_BUILD_RELEASE=1 \ HAKMEM_TINY_FRONT_PGO=1 \ bench_random_mixed_hakmem # PGO Full Workflow pgo-tiny-full: pgo-tiny-profile pgo-tiny-build @echo "PGO optimization complete!" @echo "Testing optimized binary..." ./bench_random_mixed_hakmem 1000000 256 42 ``` --- ## Box 2: Tiny Front Hot Path Box ### 責務 Ultra-fast allocation path(分岐数最小化、always_inline) ### ファイル構成 ``` core/box/tiny_front_hot_box.h - Hot path implementation core/box/tiny_front_hot_box_macros.h - Pointer safety macros ``` ### Contract **Preconditions**: - `class_idx` validated (0-7) - TLS initialized - Not in slow path mode **Guarantees**: - Maximum 5-7 branches - Always inlined - Null-safe pointer operations - PGO-optimized **Performance**: - Hit case: ~20-30 cycles - Miss case: → Cold Box (~100-200 cycles) ### Pointer Safety Macros #### core/box/tiny_front_hot_box_macros.h ```c #ifndef TINY_FRONT_HOT_BOX_MACROS_H #define TINY_FRONT_HOT_BOX_MACROS_H #include #include // ========== Pointer Type Definitions ========== // Opaque pointer types for type safety typedef void* TinyHotPtr; // User-facing allocation pointer typedef void* TinySLLNode; // SLL node pointer typedef void* TinySlabBase; // Slab base pointer // ========== Pointer Safety Macros ========== #if HAKMEM_BUILD_RELEASE // Release: No overhead #define TINY_HOT_PTR_CHECK(ptr) (ptr) #define TINY_HOT_PTR_CAST(type, ptr) ((type)(ptr)) #define TINY_HOT_PTR_NULL NULL #define TINY_HOT_PTR_IS_NULL(ptr) ((ptr) == NULL) #define TINY_HOT_PTR_IS_VALID(ptr) ((ptr) != NULL) #else // Debug: Assertions enabled #include #define TINY_HOT_PTR_CHECK(ptr) \ ({ void* _p = (ptr); \ assert(((uintptr_t)_p & 0x7) == 0 && "Pointer not 8-byte aligned"); \ _p; }) #define TINY_HOT_PTR_CAST(type, ptr) \ ((type)TINY_HOT_PTR_CHECK(ptr)) #define TINY_HOT_PTR_NULL NULL #define TINY_HOT_PTR_IS_NULL(ptr) ((ptr) == NULL) #define TINY_HOT_PTR_IS_VALID(ptr) \ ({ void* _p = (ptr); \ _p != NULL && ((uintptr_t)_p & 0x7) == 0; }) #endif // ========== SLL Operations Macros ========== // Read next pointer from SLL node #define TINY_HOT_SLL_NEXT(node) \ TINY_HOT_PTR_CAST(TinySLLNode, tiny_nextptr_get(node)) // Write next pointer to SLL node #define TINY_HOT_SLL_SET_NEXT(node, next) \ tiny_nextptr_set((node), TINY_HOT_PTR_CHECK(next)) // Pop from TLS SLL (class-specific) #define TINY_HOT_SLL_POP(class_idx) \ TINY_HOT_PTR_CAST(TinyHotPtr, tls_sll_pop_inline(class_idx)) // Push to TLS SLL (class-specific) #define TINY_HOT_SLL_PUSH(class_idx, ptr) \ tls_sll_push_inline((class_idx), TINY_HOT_PTR_CHECK(ptr)) // ========== Likely/Unlikely Hints ========== #define TINY_HOT_LIKELY(x) __builtin_expect(!!(x), 1) #define TINY_HOT_UNLIKELY(x) __builtin_expect(!!(x), 0) // ========== Branch Prediction Hints ========== // Expected: SLL hit (80-90% of allocations) #define TINY_HOT_EXPECT_HIT(ptr) \ TINY_HOT_LIKELY(TINY_HOT_PTR_IS_VALID(ptr)) // Expected: SLL miss (10-20% of allocations) #define TINY_HOT_EXPECT_MISS(ptr) \ TINY_HOT_UNLIKELY(TINY_HOT_PTR_IS_NULL(ptr)) #endif // TINY_FRONT_HOT_BOX_MACROS_H ``` ### Implementation #### core/box/tiny_front_hot_box.h ```c #ifndef TINY_FRONT_HOT_BOX_H #define TINY_FRONT_HOT_BOX_H #include "tiny_front_hot_box_macros.h" #include "../tiny_next_ptr_box.h" // tiny_nextptr_get/set #include "../tls_sll_box.h" // TLS SLL operations // Forward declaration for cold path void* tiny_front_cold_refill(int class_idx) __attribute__((noinline, cold)); // ========== Box: Tiny Front Hot Path ========== // Contract: Ultra-fast allocation with 5-7 branches max // Precondition: class_idx validated (0-7), TLS initialized // Performance: ~20-30 cycles (hit), ~100-200 cycles (miss → cold) // Optimization: always_inline + PGO + branch hints __attribute__((always_inline)) static inline TinyHotPtr tiny_front_hot_alloc(int class_idx) { // Branch 1: TLS SLL pop (expected: 80-90% hit) TinyHotPtr ptr = TINY_HOT_SLL_POP(class_idx); // Branch 2: Check if hit (optimized by PGO) if (TINY_HOT_EXPECT_HIT(ptr)) { // Fast path exit: ~20-30 cycles total return ptr; } // Branch 3: Miss → Cold path refill (10-20% of allocations) return TINY_HOT_PTR_CAST(TinyHotPtr, tiny_front_cold_refill(class_idx)); } // ========== Box: Tiny Front Hot Free ========== // Contract: Ultra-fast free with 3-5 branches max // Precondition: ptr is valid Tiny allocation // Performance: ~15-25 cycles __attribute__((always_inline)) static inline void tiny_front_hot_free(TinyHotPtr ptr, int class_idx) { // Branch 1: Null check (expected: rare) if (TINY_HOT_UNLIKELY(TINY_HOT_PTR_IS_NULL(ptr))) { return; } // Branch 2: Push to TLS SLL (expected: always succeeds) TINY_HOT_SLL_PUSH(class_idx, ptr); // Fast path exit: ~15-25 cycles total } #endif // TINY_FRONT_HOT_BOX_H ``` --- ## Box 3: Tiny Front Cold Path Box ### 責務 低頻度allocation/free slow path(noinline, cold属性) ### ファイル構成 ``` core/box/tiny_front_cold_box.h - Cold path implementation ``` ### Contract **Called When**: - TLS SLL miss (refill needed) - Slow allocation path (debug, large size, etc.) **Guarantees**: - I-cache separated from hot path - Heavy operations allowed - Can call into ACE, learning, diagnostics **Optimization**: - `noinline` → Not inlined into hot path - `cold` → Compiler puts in cold section ### Implementation #### core/box/tiny_front_cold_box.h ```c #ifndef TINY_FRONT_COLD_BOX_H #define TINY_FRONT_COLD_BOX_H #include #include "tiny_front_hot_box_macros.h" // ========== Box: Tiny Front Cold Refill ========== // Contract: Refill TLS SLL when empty // Called: 10-20% of allocations (SLL miss) // Performance: ~100-200 cycles (acceptable for miss case) // Optimization: noinline, cold → separated from hot path __attribute__((noinline, cold)) void* tiny_front_cold_refill(int class_idx) { // Heavy refill logic // - May allocate new SuperSlab // - May trigger ACE learning // - May call into diagnostics // Call existing refill logic tiny_fast_refill_and_take(class_idx); // After refill, try pop again TinyHotPtr ptr = TINY_HOT_SLL_POP(class_idx); if (TINY_HOT_PTR_IS_VALID(ptr)) { return ptr; } // Refill failed → slow allocation return tiny_front_cold_slow_alloc(0, class_idx); // size=0 (unknown) } // ========== Box: Tiny Front Cold Slow Alloc ========== // Contract: Slowest allocation path (debug, diagnostics, ACE) // Called: Rare (refill failure, special modes) // Performance: ~500-1000+ cycles (acceptable for rare case) __attribute__((noinline, cold)) void* tiny_front_cold_slow_alloc(size_t size, int class_idx) { // Debug/diagnostic/ACE learning hooks // - Allocation site tracking // - Size class profiling // - Memory pressure monitoring // Call legacy slow path return hak_tiny_alloc_slow(size, class_idx); } // ========== Box: Tiny Front Cold Drain ========== // Contract: Drain remote frees (batched, low frequency) // Called: Background or on threshold // Optimization: noinline, cold __attribute__((noinline, cold)) void tiny_front_cold_drain_remote(int class_idx) { // Drain remote free lists into TLS SLL // - Batch processing for efficiency // - May trigger ACE rebalancing tiny_remote_drain_to_sll(class_idx); } #endif // TINY_FRONT_COLD_BOX_H ``` --- ## Box 4: Tiny Front Config Box ### 責務 Tiny Front設定の一元管理(コンパイル時/実行時切り替え) ### ファイル構成 ``` core/box/tiny_front_config_box.h - Configuration management core/hakmem_build_flags.h - Build flag definitions (existing) ``` ### Contract **Compile-time Mode (PGO builds)**: - `HAKMEM_TINY_FRONT_PGO=1` - All runtime checks → compile-time constants - Unused branches eliminated by compiler **Runtime Mode (normal builds)**: - Backward compatible - ENV variable checks as before - Full feature set available ### Implementation #### core/box/tiny_front_config_box.h ```c #ifndef TINY_FRONT_CONFIG_BOX_H #define TINY_FRONT_CONFIG_BOX_H // ========== Build Flag Definitions ========== // Location: core/hakmem_build_flags.h #ifndef HAKMEM_TINY_FRONT_PGO # define HAKMEM_TINY_FRONT_PGO 0 #endif // ========== PGO Mode: Fixed Configuration ========== #if HAKMEM_TINY_FRONT_PGO // PGO build: Fix configuration for profiling/optimization // All runtime checks become compile-time constants #define TINY_FRONT_ULTRA_SLIM_ENABLED 0 #define TINY_FRONT_HEAP_V2_ENABLED 0 #define TINY_FRONT_SFC_ENABLED 1 #define TINY_FRONT_FASTCACHE_ENABLED 0 #define TINY_FRONT_UNIFIED_GATE_ENABLED 1 #define TINY_FRONT_METRICS_ENABLED 0 #define TINY_FRONT_DIAG_ENABLED 0 // Optimization: Constant folding eliminates dead branches // Example: // if (TINY_FRONT_HEAP_V2_ENABLED) { ... } // → Compiler eliminates entire block (0 is constant false) #else // Normal build: Runtime configuration (backward compatible) // Checks ENV variables or config state #define TINY_FRONT_ULTRA_SLIM_ENABLED ultra_slim_mode_enabled() #define TINY_FRONT_HEAP_V2_ENABLED tiny_heap_v2_enabled() #define TINY_FRONT_SFC_ENABLED sfc_cascade_enabled() #define TINY_FRONT_FASTCACHE_ENABLED tiny_fastcache_enabled() #define TINY_FRONT_UNIFIED_GATE_ENABLED front_gate_unified_enabled() #define TINY_FRONT_METRICS_ENABLED tiny_metrics_enabled() #define TINY_FRONT_DIAG_ENABLED tiny_diag_enabled() #endif // HAKMEM_TINY_FRONT_PGO // ========== Configuration Helpers ========== // Check if running in PGO-optimized build static inline int tiny_front_is_pgo_build(void) { return HAKMEM_TINY_FRONT_PGO; } // Get effective configuration (for diagnostics) static inline void tiny_front_config_report(void) { #if !HAKMEM_BUILD_RELEASE fprintf(stderr, "[TINY_FRONT_CONFIG]\n"); fprintf(stderr, " PGO Build: %d\n", HAKMEM_TINY_FRONT_PGO); fprintf(stderr, " Ultra SLIM: %d\n", TINY_FRONT_ULTRA_SLIM_ENABLED); fprintf(stderr, " Heap V2: %d\n", TINY_FRONT_HEAP_V2_ENABLED); fprintf(stderr, " SFC: %d\n", TINY_FRONT_SFC_ENABLED); fprintf(stderr, " FastCache: %d\n", TINY_FRONT_FASTCACHE_ENABLED); fprintf(stderr, " Unified Gate: %d\n", TINY_FRONT_UNIFIED_GATE_ENABLED); #endif } #endif // TINY_FRONT_CONFIG_BOX_H ``` #### Update to core/hakmem_build_flags.h ```c // Add around line 190: // HAKMEM_TINY_FRONT_PGO: // 0 = Normal build with runtime configuration (default) // 1 = PGO-optimized build with compile-time configuration // Eliminates runtime branches for maximum performance. // Use with: make HAKMEM_TINY_FRONT_PGO=1 pgo-tiny-build #ifndef HAKMEM_TINY_FRONT_PGO # define HAKMEM_TINY_FRONT_PGO 0 #endif ``` --- ## Integration: Refactor tiny_alloc_fast() ### Before (複雑な1関数、15-20分岐) ```c void* tiny_alloc_fast(size_t size) { // Ultra SLIM check if (ultra_slim_mode_enabled()) { ... } // Size to class int class_idx = hak_tiny_size_to_class(size); // Metrics if (tiny_metrics_enabled()) { tiny_sizeclass_hist_hit(class_idx); } // Heap V2 check if (tiny_heap_v2_enabled()) { ... } // FastCache check if (tiny_fastcache_enabled()) { ... } // SFC cascade if (sfc_cascade_enabled()) { ... } // TLS SLL pop void* ptr = tls_sll_pop(class_idx); if (ptr) return ptr; // Refill logic (複雑) ... } ``` ### After (Box化、3-5分岐のみ) ```c // Include new boxes #include "core/box/tiny_front_config_box.h" #include "core/box/tiny_front_hot_box.h" #include "core/box/tiny_front_cold_box.h" void* tiny_alloc_fast(size_t size) { // Branch 1: Ultra SLIM mode check (compile-time constant in PGO) if (TINY_FRONT_ULTRA_SLIM_ENABLED) { return tiny_ultra_slim_alloc(size); // Separate path } // Branch 2: Size to class (always needed) int class_idx = hak_tiny_size_to_class(size); // Branch 3: Hot path (inlined, 2-3 branches inside) return tiny_front_hot_alloc(class_idx); // Total branches in PGO build: 2-3 // (Ultra SLIM = 0 → eliminated, hot_alloc inlined) } ``` **PGO最適化後の実効分岐数**: **2-3分岐のみ**! --- ## Testing Strategy ### Step 1: PGO Workflow Test ```bash # Build profile version make pgo-tiny-profile # Collect profiles (automated) ./scripts/box/pgo_tiny_profile_box.sh # Build optimized version make pgo-tiny-build # Benchmark ./bench_random_mixed_hakmem 1000000 256 42 ./bench_tiny_hot_hakmem # Expected: +5-10% improvement ``` ### Step 2: Hot/Cold Separation Test ```bash # Build with hot/cold boxes make clean make HAKMEM_TINY_FRONT_PGO=1 bench_random_mixed_hakmem # Benchmark ./bench_random_mixed_hakmem 1000000 256 42 # Expected: +10-15% improvement (cumulative +15-25%) ``` ### Step 3: Config Box Test ```bash # PGO build (compile-time config) make clean make HAKMEM_TINY_FRONT_PGO=1 pgo-tiny-full # Normal build (runtime config) make clean make bench_random_mixed_hakmem # Both should work, PGO should be faster ``` ### Regression Testing ```bash # Ensure backward compatibility HAKMEM_TINY_ULTRA_SLIM=1 ./bench_random_mixed_hakmem 100000 256 HAKMEM_TINY_HEAP_V2=1 ./bench_random_mixed_hakmem 100000 256 # All existing ENV vars should work in normal builds ``` --- ## Performance Expectations ### Branch Reduction - **Before**: 15-20 branches in `tiny_alloc_fast()` - **After (PGO)**: 2-3 branches (most eliminated by compiler) - **Gain**: ~40-60% reduction in branch misses ### Instruction Count - **Before**: ~167M instructions (1M ops) - **After**: ~120-140M instructions - **Gain**: ~16-28% reduction ### Throughput - **Phase 3**: 56.8M ops/s - **Phase 4.1 (PGO)**: 60-62M ops/s (+5-10%) - **Phase 4.2 (Hot/Cold)**: 68-75M ops/s (+10-15%) - **Phase 4.3 (Config)**: 73-83M ops/s (+5-8%) **Total Improvement**: +28-46% → **2倍に迫る** --- ## Implementation Schedule ### Week 1: PGO Workflow - Day 1-2: PGO scripts + Makefile - Day 3: Profile collection + benchmarking - Day 4: Documentation + review ### Week 2: Hot/Cold Separation - Day 1-2: Hot Box + macros - Day 3-4: Cold Box + refactor - Day 5: Testing + PGO re-optimization ### Week 3: Config Box + Polish - Day 1-2: Config Box implementation - Day 3: Integration testing - Day 4-5: Final benchmarks + documentation --- ## Success Criteria ✅ **Code Quality**: - All pointer operations use macros - Clear contracts in each Box - Zero regression in existing features ✅ **Performance**: - bench_random_mixed: 73-83M ops/s (vs 56.8M baseline) - bench_tiny_hot: 100-115M ops/s (vs 81M baseline) - No regression in other benchmarks ✅ **Maintainability**: - Hot/Cold separation clear - PGO workflow documented - Backward compatible --- Generated: 2025-11-29 Phase: 4 Design Next: Implementation (Week 1 start)