Files
hakmem/docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md
Moe Charm (CI) b51b600e8d Phase 4-Step1: Add PGO workflow automation (+6.25% performance)
Implemented automated Profile-Guided Optimization workflow using Box pattern:

Performance Improvement:
- Baseline:      57.0 M ops/s
- PGO-optimized: 60.6 M ops/s
- Gain: +6.25% (within expected +5-10% range)

Implementation:
1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads
2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection
3. Makefile PGO targets:
   - pgo-tiny-profile: Build instrumented binaries
   - pgo-tiny-collect: Collect .gcda profile data
   - pgo-tiny-build:   Build optimized binaries
   - pgo-tiny-full:    Complete workflow (profile → collect → build → test)
4. Makefile help target: Added PGO instructions for discoverability

Design:
- Box化: Single responsibility, clear contracts
- Deterministic: Fixed seeds (42) for reproducibility
- Safe: Validation, error detection, timeout protection (30s/workload)
- Observable: Progress reporting, .gcda verification (33 files generated)

Workload Coverage:
- Random mixed: 3 working set sizes (128/256/512 slots)
- Tiny hot: 2 size classes (16B/64B)
- Total: 5 workloads covering hot/cold paths

Documentation:
- PHASE4_STEP1_COMPLETE.md - Completion report
- CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓)
- docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design

Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 11:28:38 +09:00

19 KiB
Raw Blame History

Phase 4: Tiny Front Optimization - Box Design Document

Date: 2025-11-29 Author: Claude Code Goal: 2x throughput improvement via Box化 + PGO + Hot/Cold separation


Design Philosophy

Box化原則

  1. Single Responsibility: 1 Box = 1明確な責務
  2. Clear Contracts: 入力/出力/保証を明示
  3. Macro-Based Pointers: 型安全、null check、統一API
  4. Testability: 各Boxが独立してテスト可能
  5. Incremental: 段階的実装・検証

Pointer Safety Strategy

全てのポインター操作をマクロで抽象化:

  • Null check統一
  • 型キャスト安全性
  • デバッグビルドでアサーション
  • リリースビルドで最適化

Box 1: PGO Profile Collection Box

責務

Tiny Front用PGOプロファイル収集を標準化・自動化

ファイル構成

scripts/box/pgo_tiny_profile_box.sh     - メインスクリプト
scripts/box/pgo_tiny_profile_config.sh  - 設定(ワークロード定義)

Contract

Input:

  • Built binaries with -fprofile-generate -flto
  • bench_random_mixed_hakmem
  • bench_tiny_hot_hakmem

Output:

  • .gcda profile data files
  • Profile summary report

Guarantees:

  • Deterministic execution (固定seed)
  • Representative workload coverage
  • Error detection & reporting

Implementation

scripts/box/pgo_tiny_profile_box.sh

#!/bin/bash
# Box: PGO Profile Collection
# Contract: Execute representative Tiny workloads for PGO
# Usage: ./scripts/box/pgo_tiny_profile_box.sh

set -e

SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
source "${SCRIPT_DIR}/pgo_tiny_profile_config.sh"

echo "========================================="
echo "Box: PGO Profile Collection (Tiny Front)"
echo "========================================="

# Validate binaries exist
for bin in "${PGO_BINARIES[@]}"; do
    if [[ ! -x "$bin" ]]; then
        echo "ERROR: Binary not found or not executable: $bin"
        exit 1
    fi
done

# Clean old profile data
echo "[PGO_BOX] Cleaning old .gcda files..."
find . -name "*.gcda" -delete

# Execute workloads
echo "[PGO_BOX] Executing representative workloads..."

for workload in "${PGO_WORKLOADS[@]}"; do
    echo "[PGO_BOX] Running: $workload"
    eval "$workload"
done

# Verify profile data generated
GCDA_COUNT=$(find . -name "*.gcda" | wc -l)
if [[ $GCDA_COUNT -eq 0 ]]; then
    echo "ERROR: No .gcda files generated!"
    exit 1
fi

echo "[PGO_BOX] Profile collection complete"
echo "[PGO_BOX] Generated $GCDA_COUNT .gcda files"
echo "========================================="

scripts/box/pgo_tiny_profile_config.sh

#!/bin/bash
# Box: PGO Profile Configuration
# Purpose: Define representative workloads for Tiny Front

# Binaries to profile
PGO_BINARIES=(
    "./bench_random_mixed_hakmem"
    "./bench_tiny_hot_hakmem"
)

# Representative workloads (deterministic seeds)
PGO_WORKLOADS=(
    # Random mixed: Common case (medium working set)
    "./bench_random_mixed_hakmem 5000000 256 42"

    # Random mixed: Smaller working set (higher cache hit)
    "./bench_random_mixed_hakmem 5000000 128 42"

    # Random mixed: Larger working set (more diverse)
    "./bench_random_mixed_hakmem 5000000 512 42"

    # Tiny hot path: 16B allocations
    "./bench_tiny_hot_hakmem 16 100 60000"

    # Tiny hot path: 64B allocations
    "./bench_tiny_hot_hakmem 64 100 60000"
)

Makefile Integration

# PGO Tiny Profile Build
pgo-tiny-profile:
	@echo "Building PGO profile binaries..."
	$(MAKE) clean
	$(MAKE) CFLAGS+="-fprofile-generate -flto" \
	        LDFLAGS+="-fprofile-generate -flto" \
	        HAKMEM_BUILD_RELEASE=1 \
	        HAKMEM_TINY_FRONT_PGO=1 \
	        bench_random_mixed_hakmem bench_tiny_hot_hakmem

# PGO Tiny Optimized Build
pgo-tiny-build:
	@echo "Collecting PGO profile data..."
	./scripts/box/pgo_tiny_profile_box.sh
	@echo "Building PGO-optimized binaries..."
	$(MAKE) clean
	$(MAKE) CFLAGS+="-fprofile-use -flto" \
	        LDFLAGS+="-fprofile-use -flto" \
	        HAKMEM_BUILD_RELEASE=1 \
	        HAKMEM_TINY_FRONT_PGO=1 \
	        bench_random_mixed_hakmem

# PGO Full Workflow
pgo-tiny-full: pgo-tiny-profile pgo-tiny-build
	@echo "PGO optimization complete!"
	@echo "Testing optimized binary..."
	./bench_random_mixed_hakmem 1000000 256 42

Box 2: Tiny Front Hot Path Box

責務

Ultra-fast allocation path分岐数最小化、always_inline

ファイル構成

core/box/tiny_front_hot_box.h          - Hot path implementation
core/box/tiny_front_hot_box_macros.h   - Pointer safety macros

Contract

Preconditions:

  • class_idx validated (0-7)
  • TLS initialized
  • Not in slow path mode

Guarantees:

  • Maximum 5-7 branches
  • Always inlined
  • Null-safe pointer operations
  • PGO-optimized

Performance:

  • Hit case: ~20-30 cycles
  • Miss case: → Cold Box (~100-200 cycles)

Pointer Safety Macros

core/box/tiny_front_hot_box_macros.h

#ifndef TINY_FRONT_HOT_BOX_MACROS_H
#define TINY_FRONT_HOT_BOX_MACROS_H

#include <stdint.h>
#include <stddef.h>

// ========== Pointer Type Definitions ==========

// Opaque pointer types for type safety
typedef void* TinyHotPtr;       // User-facing allocation pointer
typedef void* TinySLLNode;      // SLL node pointer
typedef void* TinySlabBase;     // Slab base pointer

// ========== Pointer Safety Macros ==========

#if HAKMEM_BUILD_RELEASE
  // Release: No overhead
  #define TINY_HOT_PTR_CHECK(ptr)           (ptr)
  #define TINY_HOT_PTR_CAST(type, ptr)      ((type)(ptr))
  #define TINY_HOT_PTR_NULL                 NULL
  #define TINY_HOT_PTR_IS_NULL(ptr)         ((ptr) == NULL)
  #define TINY_HOT_PTR_IS_VALID(ptr)        ((ptr) != NULL)
#else
  // Debug: Assertions enabled
  #include <assert.h>
  #define TINY_HOT_PTR_CHECK(ptr) \
      ({ void* _p = (ptr); \
         assert(((uintptr_t)_p & 0x7) == 0 && "Pointer not 8-byte aligned"); \
         _p; })
  #define TINY_HOT_PTR_CAST(type, ptr) \
      ((type)TINY_HOT_PTR_CHECK(ptr))
  #define TINY_HOT_PTR_NULL                 NULL
  #define TINY_HOT_PTR_IS_NULL(ptr)         ((ptr) == NULL)
  #define TINY_HOT_PTR_IS_VALID(ptr) \
      ({ void* _p = (ptr); \
         _p != NULL && ((uintptr_t)_p & 0x7) == 0; })
#endif

// ========== SLL Operations Macros ==========

// Read next pointer from SLL node
#define TINY_HOT_SLL_NEXT(node) \
    TINY_HOT_PTR_CAST(TinySLLNode, tiny_nextptr_get(node))

// Write next pointer to SLL node
#define TINY_HOT_SLL_SET_NEXT(node, next) \
    tiny_nextptr_set((node), TINY_HOT_PTR_CHECK(next))

// Pop from TLS SLL (class-specific)
#define TINY_HOT_SLL_POP(class_idx) \
    TINY_HOT_PTR_CAST(TinyHotPtr, tls_sll_pop_inline(class_idx))

// Push to TLS SLL (class-specific)
#define TINY_HOT_SLL_PUSH(class_idx, ptr) \
    tls_sll_push_inline((class_idx), TINY_HOT_PTR_CHECK(ptr))

// ========== Likely/Unlikely Hints ==========

#define TINY_HOT_LIKELY(x)       __builtin_expect(!!(x), 1)
#define TINY_HOT_UNLIKELY(x)     __builtin_expect(!!(x), 0)

// ========== Branch Prediction Hints ==========

// Expected: SLL hit (80-90% of allocations)
#define TINY_HOT_EXPECT_HIT(ptr) \
    TINY_HOT_LIKELY(TINY_HOT_PTR_IS_VALID(ptr))

// Expected: SLL miss (10-20% of allocations)
#define TINY_HOT_EXPECT_MISS(ptr) \
    TINY_HOT_UNLIKELY(TINY_HOT_PTR_IS_NULL(ptr))

#endif // TINY_FRONT_HOT_BOX_MACROS_H

Implementation

core/box/tiny_front_hot_box.h

#ifndef TINY_FRONT_HOT_BOX_H
#define TINY_FRONT_HOT_BOX_H

#include "tiny_front_hot_box_macros.h"
#include "../tiny_next_ptr_box.h"  // tiny_nextptr_get/set
#include "../tls_sll_box.h"        // TLS SLL operations

// Forward declaration for cold path
void* tiny_front_cold_refill(int class_idx)
    __attribute__((noinline, cold));

// ========== Box: Tiny Front Hot Path ==========
// Contract: Ultra-fast allocation with 5-7 branches max
// Precondition: class_idx validated (0-7), TLS initialized
// Performance: ~20-30 cycles (hit), ~100-200 cycles (miss → cold)
// Optimization: always_inline + PGO + branch hints

__attribute__((always_inline))
static inline TinyHotPtr tiny_front_hot_alloc(int class_idx)
{
    // Branch 1: TLS SLL pop (expected: 80-90% hit)
    TinyHotPtr ptr = TINY_HOT_SLL_POP(class_idx);

    // Branch 2: Check if hit (optimized by PGO)
    if (TINY_HOT_EXPECT_HIT(ptr)) {
        // Fast path exit: ~20-30 cycles total
        return ptr;
    }

    // Branch 3: Miss → Cold path refill (10-20% of allocations)
    return TINY_HOT_PTR_CAST(TinyHotPtr, tiny_front_cold_refill(class_idx));
}

// ========== Box: Tiny Front Hot Free ==========
// Contract: Ultra-fast free with 3-5 branches max
// Precondition: ptr is valid Tiny allocation
// Performance: ~15-25 cycles

__attribute__((always_inline))
static inline void tiny_front_hot_free(TinyHotPtr ptr, int class_idx)
{
    // Branch 1: Null check (expected: rare)
    if (TINY_HOT_UNLIKELY(TINY_HOT_PTR_IS_NULL(ptr))) {
        return;
    }

    // Branch 2: Push to TLS SLL (expected: always succeeds)
    TINY_HOT_SLL_PUSH(class_idx, ptr);

    // Fast path exit: ~15-25 cycles total
}

#endif // TINY_FRONT_HOT_BOX_H

Box 3: Tiny Front Cold Path Box

責務

低頻度allocation/free slow pathnoinline, cold属性

ファイル構成

core/box/tiny_front_cold_box.h         - Cold path implementation

Contract

Called When:

  • TLS SLL miss (refill needed)
  • Slow allocation path (debug, large size, etc.)

Guarantees:

  • I-cache separated from hot path
  • Heavy operations allowed
  • Can call into ACE, learning, diagnostics

Optimization:

  • noinline → Not inlined into hot path
  • cold → Compiler puts in cold section

Implementation

core/box/tiny_front_cold_box.h

#ifndef TINY_FRONT_COLD_BOX_H
#define TINY_FRONT_COLD_BOX_H

#include <stddef.h>
#include "tiny_front_hot_box_macros.h"

// ========== Box: Tiny Front Cold Refill ==========
// Contract: Refill TLS SLL when empty
// Called: 10-20% of allocations (SLL miss)
// Performance: ~100-200 cycles (acceptable for miss case)
// Optimization: noinline, cold → separated from hot path

__attribute__((noinline, cold))
void* tiny_front_cold_refill(int class_idx)
{
    // Heavy refill logic
    // - May allocate new SuperSlab
    // - May trigger ACE learning
    // - May call into diagnostics

    // Call existing refill logic
    tiny_fast_refill_and_take(class_idx);

    // After refill, try pop again
    TinyHotPtr ptr = TINY_HOT_SLL_POP(class_idx);

    if (TINY_HOT_PTR_IS_VALID(ptr)) {
        return ptr;
    }

    // Refill failed → slow allocation
    return tiny_front_cold_slow_alloc(0, class_idx);  // size=0 (unknown)
}

// ========== Box: Tiny Front Cold Slow Alloc ==========
// Contract: Slowest allocation path (debug, diagnostics, ACE)
// Called: Rare (refill failure, special modes)
// Performance: ~500-1000+ cycles (acceptable for rare case)

__attribute__((noinline, cold))
void* tiny_front_cold_slow_alloc(size_t size, int class_idx)
{
    // Debug/diagnostic/ACE learning hooks
    // - Allocation site tracking
    // - Size class profiling
    // - Memory pressure monitoring

    // Call legacy slow path
    return hak_tiny_alloc_slow(size, class_idx);
}

// ========== Box: Tiny Front Cold Drain ==========
// Contract: Drain remote frees (batched, low frequency)
// Called: Background or on threshold
// Optimization: noinline, cold

__attribute__((noinline, cold))
void tiny_front_cold_drain_remote(int class_idx)
{
    // Drain remote free lists into TLS SLL
    // - Batch processing for efficiency
    // - May trigger ACE rebalancing

    tiny_remote_drain_to_sll(class_idx);
}

#endif // TINY_FRONT_COLD_BOX_H

Box 4: Tiny Front Config Box

責務

Tiny Front設定の一元管理コンパイル時/実行時切り替え)

ファイル構成

core/box/tiny_front_config_box.h       - Configuration management
core/hakmem_build_flags.h              - Build flag definitions (existing)

Contract

Compile-time Mode (PGO builds):

  • HAKMEM_TINY_FRONT_PGO=1
  • All runtime checks → compile-time constants
  • Unused branches eliminated by compiler

Runtime Mode (normal builds):

  • Backward compatible
  • ENV variable checks as before
  • Full feature set available

Implementation

core/box/tiny_front_config_box.h

#ifndef TINY_FRONT_CONFIG_BOX_H
#define TINY_FRONT_CONFIG_BOX_H

// ========== Build Flag Definitions ==========
// Location: core/hakmem_build_flags.h

#ifndef HAKMEM_TINY_FRONT_PGO
#  define HAKMEM_TINY_FRONT_PGO 0
#endif

// ========== PGO Mode: Fixed Configuration ==========

#if HAKMEM_TINY_FRONT_PGO
  // PGO build: Fix configuration for profiling/optimization
  // All runtime checks become compile-time constants

  #define TINY_FRONT_ULTRA_SLIM_ENABLED    0
  #define TINY_FRONT_HEAP_V2_ENABLED       0
  #define TINY_FRONT_SFC_ENABLED           1
  #define TINY_FRONT_FASTCACHE_ENABLED     0
  #define TINY_FRONT_UNIFIED_GATE_ENABLED  1
  #define TINY_FRONT_METRICS_ENABLED       0
  #define TINY_FRONT_DIAG_ENABLED          0

  // Optimization: Constant folding eliminates dead branches
  // Example:
  //   if (TINY_FRONT_HEAP_V2_ENABLED) { ... }
  //   → Compiler eliminates entire block (0 is constant false)

#else
  // Normal build: Runtime configuration (backward compatible)
  // Checks ENV variables or config state

  #define TINY_FRONT_ULTRA_SLIM_ENABLED    ultra_slim_mode_enabled()
  #define TINY_FRONT_HEAP_V2_ENABLED       tiny_heap_v2_enabled()
  #define TINY_FRONT_SFC_ENABLED           sfc_cascade_enabled()
  #define TINY_FRONT_FASTCACHE_ENABLED     tiny_fastcache_enabled()
  #define TINY_FRONT_UNIFIED_GATE_ENABLED  front_gate_unified_enabled()
  #define TINY_FRONT_METRICS_ENABLED       tiny_metrics_enabled()
  #define TINY_FRONT_DIAG_ENABLED          tiny_diag_enabled()

#endif // HAKMEM_TINY_FRONT_PGO

// ========== Configuration Helpers ==========

// Check if running in PGO-optimized build
static inline int tiny_front_is_pgo_build(void)
{
    return HAKMEM_TINY_FRONT_PGO;
}

// Get effective configuration (for diagnostics)
static inline void tiny_front_config_report(void)
{
#if !HAKMEM_BUILD_RELEASE
    fprintf(stderr, "[TINY_FRONT_CONFIG]\n");
    fprintf(stderr, "  PGO Build: %d\n", HAKMEM_TINY_FRONT_PGO);
    fprintf(stderr, "  Ultra SLIM: %d\n", TINY_FRONT_ULTRA_SLIM_ENABLED);
    fprintf(stderr, "  Heap V2: %d\n", TINY_FRONT_HEAP_V2_ENABLED);
    fprintf(stderr, "  SFC: %d\n", TINY_FRONT_SFC_ENABLED);
    fprintf(stderr, "  FastCache: %d\n", TINY_FRONT_FASTCACHE_ENABLED);
    fprintf(stderr, "  Unified Gate: %d\n", TINY_FRONT_UNIFIED_GATE_ENABLED);
#endif
}

#endif // TINY_FRONT_CONFIG_BOX_H

Update to core/hakmem_build_flags.h

// Add around line 190:

// HAKMEM_TINY_FRONT_PGO:
//   0 = Normal build with runtime configuration (default)
//   1 = PGO-optimized build with compile-time configuration
//       Eliminates runtime branches for maximum performance.
//       Use with: make HAKMEM_TINY_FRONT_PGO=1 pgo-tiny-build
#ifndef HAKMEM_TINY_FRONT_PGO
#  define HAKMEM_TINY_FRONT_PGO 0
#endif

Integration: Refactor tiny_alloc_fast()

Before (複雑な1関数、15-20分岐)

void* tiny_alloc_fast(size_t size) {
    // Ultra SLIM check
    if (ultra_slim_mode_enabled()) { ... }

    // Size to class
    int class_idx = hak_tiny_size_to_class(size);

    // Metrics
    if (tiny_metrics_enabled()) { tiny_sizeclass_hist_hit(class_idx); }

    // Heap V2 check
    if (tiny_heap_v2_enabled()) { ... }

    // FastCache check
    if (tiny_fastcache_enabled()) { ... }

    // SFC cascade
    if (sfc_cascade_enabled()) { ... }

    // TLS SLL pop
    void* ptr = tls_sll_pop(class_idx);
    if (ptr) return ptr;

    // Refill logic (複雑)
    ...
}

After (Box化、3-5分岐のみ)

// Include new boxes
#include "core/box/tiny_front_config_box.h"
#include "core/box/tiny_front_hot_box.h"
#include "core/box/tiny_front_cold_box.h"

void* tiny_alloc_fast(size_t size) {
    // Branch 1: Ultra SLIM mode check (compile-time constant in PGO)
    if (TINY_FRONT_ULTRA_SLIM_ENABLED) {
        return tiny_ultra_slim_alloc(size);  // Separate path
    }

    // Branch 2: Size to class (always needed)
    int class_idx = hak_tiny_size_to_class(size);

    // Branch 3: Hot path (inlined, 2-3 branches inside)
    return tiny_front_hot_alloc(class_idx);

    // Total branches in PGO build: 2-3
    // (Ultra SLIM = 0 → eliminated, hot_alloc inlined)
}

PGO最適化後の実効分岐数: 2-3分岐のみ


Testing Strategy

Step 1: PGO Workflow Test

# Build profile version
make pgo-tiny-profile

# Collect profiles (automated)
./scripts/box/pgo_tiny_profile_box.sh

# Build optimized version
make pgo-tiny-build

# Benchmark
./bench_random_mixed_hakmem 1000000 256 42
./bench_tiny_hot_hakmem

# Expected: +5-10% improvement

Step 2: Hot/Cold Separation Test

# Build with hot/cold boxes
make clean
make HAKMEM_TINY_FRONT_PGO=1 bench_random_mixed_hakmem

# Benchmark
./bench_random_mixed_hakmem 1000000 256 42

# Expected: +10-15% improvement (cumulative +15-25%)

Step 3: Config Box Test

# PGO build (compile-time config)
make clean
make HAKMEM_TINY_FRONT_PGO=1 pgo-tiny-full

# Normal build (runtime config)
make clean
make bench_random_mixed_hakmem

# Both should work, PGO should be faster

Regression Testing

# Ensure backward compatibility
HAKMEM_TINY_ULTRA_SLIM=1 ./bench_random_mixed_hakmem 100000 256
HAKMEM_TINY_HEAP_V2=1 ./bench_random_mixed_hakmem 100000 256

# All existing ENV vars should work in normal builds

Performance Expectations

Branch Reduction

  • Before: 15-20 branches in tiny_alloc_fast()
  • After (PGO): 2-3 branches (most eliminated by compiler)
  • Gain: ~40-60% reduction in branch misses

Instruction Count

  • Before: ~167M instructions (1M ops)
  • After: ~120-140M instructions
  • Gain: ~16-28% reduction

Throughput

  • Phase 3: 56.8M ops/s
  • Phase 4.1 (PGO): 60-62M ops/s (+5-10%)
  • Phase 4.2 (Hot/Cold): 68-75M ops/s (+10-15%)
  • Phase 4.3 (Config): 73-83M ops/s (+5-8%)

Total Improvement: +28-46% → 2倍に迫る


Implementation Schedule

Week 1: PGO Workflow

  • Day 1-2: PGO scripts + Makefile
  • Day 3: Profile collection + benchmarking
  • Day 4: Documentation + review

Week 2: Hot/Cold Separation

  • Day 1-2: Hot Box + macros
  • Day 3-4: Cold Box + refactor
  • Day 5: Testing + PGO re-optimization

Week 3: Config Box + Polish

  • Day 1-2: Config Box implementation
  • Day 3: Integration testing
  • Day 4-5: Final benchmarks + documentation

Success Criteria

Code Quality:

  • All pointer operations use macros
  • Clear contracts in each Box
  • Zero regression in existing features

Performance:

  • bench_random_mixed: 73-83M ops/s (vs 56.8M baseline)
  • bench_tiny_hot: 100-115M ops/s (vs 81M baseline)
  • No regression in other benchmarks

Maintainability:

  • Hot/Cold separation clear
  • PGO workflow documented
  • Backward compatible

Generated: 2025-11-29 Phase: 4 Design Next: Implementation (Week 1 start)