Files
hakmem/docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md
Moe Charm (CI) b51b600e8d Phase 4-Step1: Add PGO workflow automation (+6.25% performance)
Implemented automated Profile-Guided Optimization workflow using Box pattern:

Performance Improvement:
- Baseline:      57.0 M ops/s
- PGO-optimized: 60.6 M ops/s
- Gain: +6.25% (within expected +5-10% range)

Implementation:
1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads
2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection
3. Makefile PGO targets:
   - pgo-tiny-profile: Build instrumented binaries
   - pgo-tiny-collect: Collect .gcda profile data
   - pgo-tiny-build:   Build optimized binaries
   - pgo-tiny-full:    Complete workflow (profile → collect → build → test)
4. Makefile help target: Added PGO instructions for discoverability

Design:
- Box化: Single responsibility, clear contracts
- Deterministic: Fixed seeds (42) for reproducibility
- Safe: Validation, error detection, timeout protection (30s/workload)
- Observable: Progress reporting, .gcda verification (33 files generated)

Workload Coverage:
- Random mixed: 3 working set sizes (128/256/512 slots)
- Tiny hot: 2 size classes (16B/64B)
- Total: 5 workloads covering hot/cold paths

Documentation:
- PHASE4_STEP1_COMPLETE.md - Completion report
- CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓)
- docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design

Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 11:28:38 +09:00

725 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 4: Tiny Front Optimization - Box Design Document
**Date**: 2025-11-29
**Author**: Claude Code
**Goal**: 2x throughput improvement via Box化 + PGO + Hot/Cold separation
---
## Design Philosophy
### Box化原則
1. **Single Responsibility**: 1 Box = 1明確な責務
2. **Clear Contracts**: 入力/出力/保証を明示
3. **Macro-Based Pointers**: 型安全、null check、統一API
4. **Testability**: 各Boxが独立してテスト可能
5. **Incremental**: 段階的実装・検証
### Pointer Safety Strategy
**全てのポインター操作をマクロで抽象化**:
- Null check統一
- 型キャスト安全性
- デバッグビルドでアサーション
- リリースビルドで最適化
---
## Box 1: PGO Profile Collection Box
### 責務
Tiny Front用PGOプロファイル収集を標準化・自動化
### ファイル構成
```
scripts/box/pgo_tiny_profile_box.sh - メインスクリプト
scripts/box/pgo_tiny_profile_config.sh - 設定(ワークロード定義)
```
### Contract
**Input**:
- Built binaries with `-fprofile-generate -flto`
- `bench_random_mixed_hakmem`
- `bench_tiny_hot_hakmem`
**Output**:
- `.gcda` profile data files
- Profile summary report
**Guarantees**:
- Deterministic execution (固定seed)
- Representative workload coverage
- Error detection & reporting
### Implementation
#### scripts/box/pgo_tiny_profile_box.sh
```bash
#!/bin/bash
# Box: PGO Profile Collection
# Contract: Execute representative Tiny workloads for PGO
# Usage: ./scripts/box/pgo_tiny_profile_box.sh
set -e
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
source "${SCRIPT_DIR}/pgo_tiny_profile_config.sh"
echo "========================================="
echo "Box: PGO Profile Collection (Tiny Front)"
echo "========================================="
# Validate binaries exist
for bin in "${PGO_BINARIES[@]}"; do
if [[ ! -x "$bin" ]]; then
echo "ERROR: Binary not found or not executable: $bin"
exit 1
fi
done
# Clean old profile data
echo "[PGO_BOX] Cleaning old .gcda files..."
find . -name "*.gcda" -delete
# Execute workloads
echo "[PGO_BOX] Executing representative workloads..."
for workload in "${PGO_WORKLOADS[@]}"; do
echo "[PGO_BOX] Running: $workload"
eval "$workload"
done
# Verify profile data generated
GCDA_COUNT=$(find . -name "*.gcda" | wc -l)
if [[ $GCDA_COUNT -eq 0 ]]; then
echo "ERROR: No .gcda files generated!"
exit 1
fi
echo "[PGO_BOX] Profile collection complete"
echo "[PGO_BOX] Generated $GCDA_COUNT .gcda files"
echo "========================================="
```
#### scripts/box/pgo_tiny_profile_config.sh
```bash
#!/bin/bash
# Box: PGO Profile Configuration
# Purpose: Define representative workloads for Tiny Front
# Binaries to profile
PGO_BINARIES=(
"./bench_random_mixed_hakmem"
"./bench_tiny_hot_hakmem"
)
# Representative workloads (deterministic seeds)
PGO_WORKLOADS=(
# Random mixed: Common case (medium working set)
"./bench_random_mixed_hakmem 5000000 256 42"
# Random mixed: Smaller working set (higher cache hit)
"./bench_random_mixed_hakmem 5000000 128 42"
# Random mixed: Larger working set (more diverse)
"./bench_random_mixed_hakmem 5000000 512 42"
# Tiny hot path: 16B allocations
"./bench_tiny_hot_hakmem 16 100 60000"
# Tiny hot path: 64B allocations
"./bench_tiny_hot_hakmem 64 100 60000"
)
```
### Makefile Integration
```makefile
# PGO Tiny Profile Build
pgo-tiny-profile:
@echo "Building PGO profile binaries..."
$(MAKE) clean
$(MAKE) CFLAGS+="-fprofile-generate -flto" \
LDFLAGS+="-fprofile-generate -flto" \
HAKMEM_BUILD_RELEASE=1 \
HAKMEM_TINY_FRONT_PGO=1 \
bench_random_mixed_hakmem bench_tiny_hot_hakmem
# PGO Tiny Optimized Build
pgo-tiny-build:
@echo "Collecting PGO profile data..."
./scripts/box/pgo_tiny_profile_box.sh
@echo "Building PGO-optimized binaries..."
$(MAKE) clean
$(MAKE) CFLAGS+="-fprofile-use -flto" \
LDFLAGS+="-fprofile-use -flto" \
HAKMEM_BUILD_RELEASE=1 \
HAKMEM_TINY_FRONT_PGO=1 \
bench_random_mixed_hakmem
# PGO Full Workflow
pgo-tiny-full: pgo-tiny-profile pgo-tiny-build
@echo "PGO optimization complete!"
@echo "Testing optimized binary..."
./bench_random_mixed_hakmem 1000000 256 42
```
---
## Box 2: Tiny Front Hot Path Box
### 責務
Ultra-fast allocation path分岐数最小化、always_inline
### ファイル構成
```
core/box/tiny_front_hot_box.h - Hot path implementation
core/box/tiny_front_hot_box_macros.h - Pointer safety macros
```
### Contract
**Preconditions**:
- `class_idx` validated (0-7)
- TLS initialized
- Not in slow path mode
**Guarantees**:
- Maximum 5-7 branches
- Always inlined
- Null-safe pointer operations
- PGO-optimized
**Performance**:
- Hit case: ~20-30 cycles
- Miss case: → Cold Box (~100-200 cycles)
### Pointer Safety Macros
#### core/box/tiny_front_hot_box_macros.h
```c
#ifndef TINY_FRONT_HOT_BOX_MACROS_H
#define TINY_FRONT_HOT_BOX_MACROS_H
#include <stdint.h>
#include <stddef.h>
// ========== Pointer Type Definitions ==========
// Opaque pointer types for type safety
typedef void* TinyHotPtr; // User-facing allocation pointer
typedef void* TinySLLNode; // SLL node pointer
typedef void* TinySlabBase; // Slab base pointer
// ========== Pointer Safety Macros ==========
#if HAKMEM_BUILD_RELEASE
// Release: No overhead
#define TINY_HOT_PTR_CHECK(ptr) (ptr)
#define TINY_HOT_PTR_CAST(type, ptr) ((type)(ptr))
#define TINY_HOT_PTR_NULL NULL
#define TINY_HOT_PTR_IS_NULL(ptr) ((ptr) == NULL)
#define TINY_HOT_PTR_IS_VALID(ptr) ((ptr) != NULL)
#else
// Debug: Assertions enabled
#include <assert.h>
#define TINY_HOT_PTR_CHECK(ptr) \
({ void* _p = (ptr); \
assert(((uintptr_t)_p & 0x7) == 0 && "Pointer not 8-byte aligned"); \
_p; })
#define TINY_HOT_PTR_CAST(type, ptr) \
((type)TINY_HOT_PTR_CHECK(ptr))
#define TINY_HOT_PTR_NULL NULL
#define TINY_HOT_PTR_IS_NULL(ptr) ((ptr) == NULL)
#define TINY_HOT_PTR_IS_VALID(ptr) \
({ void* _p = (ptr); \
_p != NULL && ((uintptr_t)_p & 0x7) == 0; })
#endif
// ========== SLL Operations Macros ==========
// Read next pointer from SLL node
#define TINY_HOT_SLL_NEXT(node) \
TINY_HOT_PTR_CAST(TinySLLNode, tiny_nextptr_get(node))
// Write next pointer to SLL node
#define TINY_HOT_SLL_SET_NEXT(node, next) \
tiny_nextptr_set((node), TINY_HOT_PTR_CHECK(next))
// Pop from TLS SLL (class-specific)
#define TINY_HOT_SLL_POP(class_idx) \
TINY_HOT_PTR_CAST(TinyHotPtr, tls_sll_pop_inline(class_idx))
// Push to TLS SLL (class-specific)
#define TINY_HOT_SLL_PUSH(class_idx, ptr) \
tls_sll_push_inline((class_idx), TINY_HOT_PTR_CHECK(ptr))
// ========== Likely/Unlikely Hints ==========
#define TINY_HOT_LIKELY(x) __builtin_expect(!!(x), 1)
#define TINY_HOT_UNLIKELY(x) __builtin_expect(!!(x), 0)
// ========== Branch Prediction Hints ==========
// Expected: SLL hit (80-90% of allocations)
#define TINY_HOT_EXPECT_HIT(ptr) \
TINY_HOT_LIKELY(TINY_HOT_PTR_IS_VALID(ptr))
// Expected: SLL miss (10-20% of allocations)
#define TINY_HOT_EXPECT_MISS(ptr) \
TINY_HOT_UNLIKELY(TINY_HOT_PTR_IS_NULL(ptr))
#endif // TINY_FRONT_HOT_BOX_MACROS_H
```
### Implementation
#### core/box/tiny_front_hot_box.h
```c
#ifndef TINY_FRONT_HOT_BOX_H
#define TINY_FRONT_HOT_BOX_H
#include "tiny_front_hot_box_macros.h"
#include "../tiny_next_ptr_box.h" // tiny_nextptr_get/set
#include "../tls_sll_box.h" // TLS SLL operations
// Forward declaration for cold path
void* tiny_front_cold_refill(int class_idx)
__attribute__((noinline, cold));
// ========== Box: Tiny Front Hot Path ==========
// Contract: Ultra-fast allocation with 5-7 branches max
// Precondition: class_idx validated (0-7), TLS initialized
// Performance: ~20-30 cycles (hit), ~100-200 cycles (miss → cold)
// Optimization: always_inline + PGO + branch hints
__attribute__((always_inline))
static inline TinyHotPtr tiny_front_hot_alloc(int class_idx)
{
// Branch 1: TLS SLL pop (expected: 80-90% hit)
TinyHotPtr ptr = TINY_HOT_SLL_POP(class_idx);
// Branch 2: Check if hit (optimized by PGO)
if (TINY_HOT_EXPECT_HIT(ptr)) {
// Fast path exit: ~20-30 cycles total
return ptr;
}
// Branch 3: Miss → Cold path refill (10-20% of allocations)
return TINY_HOT_PTR_CAST(TinyHotPtr, tiny_front_cold_refill(class_idx));
}
// ========== Box: Tiny Front Hot Free ==========
// Contract: Ultra-fast free with 3-5 branches max
// Precondition: ptr is valid Tiny allocation
// Performance: ~15-25 cycles
__attribute__((always_inline))
static inline void tiny_front_hot_free(TinyHotPtr ptr, int class_idx)
{
// Branch 1: Null check (expected: rare)
if (TINY_HOT_UNLIKELY(TINY_HOT_PTR_IS_NULL(ptr))) {
return;
}
// Branch 2: Push to TLS SLL (expected: always succeeds)
TINY_HOT_SLL_PUSH(class_idx, ptr);
// Fast path exit: ~15-25 cycles total
}
#endif // TINY_FRONT_HOT_BOX_H
```
---
## Box 3: Tiny Front Cold Path Box
### 責務
低頻度allocation/free slow pathnoinline, cold属性
### ファイル構成
```
core/box/tiny_front_cold_box.h - Cold path implementation
```
### Contract
**Called When**:
- TLS SLL miss (refill needed)
- Slow allocation path (debug, large size, etc.)
**Guarantees**:
- I-cache separated from hot path
- Heavy operations allowed
- Can call into ACE, learning, diagnostics
**Optimization**:
- `noinline` → Not inlined into hot path
- `cold` → Compiler puts in cold section
### Implementation
#### core/box/tiny_front_cold_box.h
```c
#ifndef TINY_FRONT_COLD_BOX_H
#define TINY_FRONT_COLD_BOX_H
#include <stddef.h>
#include "tiny_front_hot_box_macros.h"
// ========== Box: Tiny Front Cold Refill ==========
// Contract: Refill TLS SLL when empty
// Called: 10-20% of allocations (SLL miss)
// Performance: ~100-200 cycles (acceptable for miss case)
// Optimization: noinline, cold → separated from hot path
__attribute__((noinline, cold))
void* tiny_front_cold_refill(int class_idx)
{
// Heavy refill logic
// - May allocate new SuperSlab
// - May trigger ACE learning
// - May call into diagnostics
// Call existing refill logic
tiny_fast_refill_and_take(class_idx);
// After refill, try pop again
TinyHotPtr ptr = TINY_HOT_SLL_POP(class_idx);
if (TINY_HOT_PTR_IS_VALID(ptr)) {
return ptr;
}
// Refill failed → slow allocation
return tiny_front_cold_slow_alloc(0, class_idx); // size=0 (unknown)
}
// ========== Box: Tiny Front Cold Slow Alloc ==========
// Contract: Slowest allocation path (debug, diagnostics, ACE)
// Called: Rare (refill failure, special modes)
// Performance: ~500-1000+ cycles (acceptable for rare case)
__attribute__((noinline, cold))
void* tiny_front_cold_slow_alloc(size_t size, int class_idx)
{
// Debug/diagnostic/ACE learning hooks
// - Allocation site tracking
// - Size class profiling
// - Memory pressure monitoring
// Call legacy slow path
return hak_tiny_alloc_slow(size, class_idx);
}
// ========== Box: Tiny Front Cold Drain ==========
// Contract: Drain remote frees (batched, low frequency)
// Called: Background or on threshold
// Optimization: noinline, cold
__attribute__((noinline, cold))
void tiny_front_cold_drain_remote(int class_idx)
{
// Drain remote free lists into TLS SLL
// - Batch processing for efficiency
// - May trigger ACE rebalancing
tiny_remote_drain_to_sll(class_idx);
}
#endif // TINY_FRONT_COLD_BOX_H
```
---
## Box 4: Tiny Front Config Box
### 責務
Tiny Front設定の一元管理コンパイル時/実行時切り替え)
### ファイル構成
```
core/box/tiny_front_config_box.h - Configuration management
core/hakmem_build_flags.h - Build flag definitions (existing)
```
### Contract
**Compile-time Mode (PGO builds)**:
- `HAKMEM_TINY_FRONT_PGO=1`
- All runtime checks → compile-time constants
- Unused branches eliminated by compiler
**Runtime Mode (normal builds)**:
- Backward compatible
- ENV variable checks as before
- Full feature set available
### Implementation
#### core/box/tiny_front_config_box.h
```c
#ifndef TINY_FRONT_CONFIG_BOX_H
#define TINY_FRONT_CONFIG_BOX_H
// ========== Build Flag Definitions ==========
// Location: core/hakmem_build_flags.h
#ifndef HAKMEM_TINY_FRONT_PGO
# define HAKMEM_TINY_FRONT_PGO 0
#endif
// ========== PGO Mode: Fixed Configuration ==========
#if HAKMEM_TINY_FRONT_PGO
// PGO build: Fix configuration for profiling/optimization
// All runtime checks become compile-time constants
#define TINY_FRONT_ULTRA_SLIM_ENABLED 0
#define TINY_FRONT_HEAP_V2_ENABLED 0
#define TINY_FRONT_SFC_ENABLED 1
#define TINY_FRONT_FASTCACHE_ENABLED 0
#define TINY_FRONT_UNIFIED_GATE_ENABLED 1
#define TINY_FRONT_METRICS_ENABLED 0
#define TINY_FRONT_DIAG_ENABLED 0
// Optimization: Constant folding eliminates dead branches
// Example:
// if (TINY_FRONT_HEAP_V2_ENABLED) { ... }
// → Compiler eliminates entire block (0 is constant false)
#else
// Normal build: Runtime configuration (backward compatible)
// Checks ENV variables or config state
#define TINY_FRONT_ULTRA_SLIM_ENABLED ultra_slim_mode_enabled()
#define TINY_FRONT_HEAP_V2_ENABLED tiny_heap_v2_enabled()
#define TINY_FRONT_SFC_ENABLED sfc_cascade_enabled()
#define TINY_FRONT_FASTCACHE_ENABLED tiny_fastcache_enabled()
#define TINY_FRONT_UNIFIED_GATE_ENABLED front_gate_unified_enabled()
#define TINY_FRONT_METRICS_ENABLED tiny_metrics_enabled()
#define TINY_FRONT_DIAG_ENABLED tiny_diag_enabled()
#endif // HAKMEM_TINY_FRONT_PGO
// ========== Configuration Helpers ==========
// Check if running in PGO-optimized build
static inline int tiny_front_is_pgo_build(void)
{
return HAKMEM_TINY_FRONT_PGO;
}
// Get effective configuration (for diagnostics)
static inline void tiny_front_config_report(void)
{
#if !HAKMEM_BUILD_RELEASE
fprintf(stderr, "[TINY_FRONT_CONFIG]\n");
fprintf(stderr, " PGO Build: %d\n", HAKMEM_TINY_FRONT_PGO);
fprintf(stderr, " Ultra SLIM: %d\n", TINY_FRONT_ULTRA_SLIM_ENABLED);
fprintf(stderr, " Heap V2: %d\n", TINY_FRONT_HEAP_V2_ENABLED);
fprintf(stderr, " SFC: %d\n", TINY_FRONT_SFC_ENABLED);
fprintf(stderr, " FastCache: %d\n", TINY_FRONT_FASTCACHE_ENABLED);
fprintf(stderr, " Unified Gate: %d\n", TINY_FRONT_UNIFIED_GATE_ENABLED);
#endif
}
#endif // TINY_FRONT_CONFIG_BOX_H
```
#### Update to core/hakmem_build_flags.h
```c
// Add around line 190:
// HAKMEM_TINY_FRONT_PGO:
// 0 = Normal build with runtime configuration (default)
// 1 = PGO-optimized build with compile-time configuration
// Eliminates runtime branches for maximum performance.
// Use with: make HAKMEM_TINY_FRONT_PGO=1 pgo-tiny-build
#ifndef HAKMEM_TINY_FRONT_PGO
# define HAKMEM_TINY_FRONT_PGO 0
#endif
```
---
## Integration: Refactor tiny_alloc_fast()
### Before (複雑な1関数、15-20分岐)
```c
void* tiny_alloc_fast(size_t size) {
// Ultra SLIM check
if (ultra_slim_mode_enabled()) { ... }
// Size to class
int class_idx = hak_tiny_size_to_class(size);
// Metrics
if (tiny_metrics_enabled()) { tiny_sizeclass_hist_hit(class_idx); }
// Heap V2 check
if (tiny_heap_v2_enabled()) { ... }
// FastCache check
if (tiny_fastcache_enabled()) { ... }
// SFC cascade
if (sfc_cascade_enabled()) { ... }
// TLS SLL pop
void* ptr = tls_sll_pop(class_idx);
if (ptr) return ptr;
// Refill logic (複雑)
...
}
```
### After (Box化、3-5分岐のみ)
```c
// Include new boxes
#include "core/box/tiny_front_config_box.h"
#include "core/box/tiny_front_hot_box.h"
#include "core/box/tiny_front_cold_box.h"
void* tiny_alloc_fast(size_t size) {
// Branch 1: Ultra SLIM mode check (compile-time constant in PGO)
if (TINY_FRONT_ULTRA_SLIM_ENABLED) {
return tiny_ultra_slim_alloc(size); // Separate path
}
// Branch 2: Size to class (always needed)
int class_idx = hak_tiny_size_to_class(size);
// Branch 3: Hot path (inlined, 2-3 branches inside)
return tiny_front_hot_alloc(class_idx);
// Total branches in PGO build: 2-3
// (Ultra SLIM = 0 → eliminated, hot_alloc inlined)
}
```
**PGO最適化後の実効分岐数**: **2-3分岐のみ**
---
## Testing Strategy
### Step 1: PGO Workflow Test
```bash
# Build profile version
make pgo-tiny-profile
# Collect profiles (automated)
./scripts/box/pgo_tiny_profile_box.sh
# Build optimized version
make pgo-tiny-build
# Benchmark
./bench_random_mixed_hakmem 1000000 256 42
./bench_tiny_hot_hakmem
# Expected: +5-10% improvement
```
### Step 2: Hot/Cold Separation Test
```bash
# Build with hot/cold boxes
make clean
make HAKMEM_TINY_FRONT_PGO=1 bench_random_mixed_hakmem
# Benchmark
./bench_random_mixed_hakmem 1000000 256 42
# Expected: +10-15% improvement (cumulative +15-25%)
```
### Step 3: Config Box Test
```bash
# PGO build (compile-time config)
make clean
make HAKMEM_TINY_FRONT_PGO=1 pgo-tiny-full
# Normal build (runtime config)
make clean
make bench_random_mixed_hakmem
# Both should work, PGO should be faster
```
### Regression Testing
```bash
# Ensure backward compatibility
HAKMEM_TINY_ULTRA_SLIM=1 ./bench_random_mixed_hakmem 100000 256
HAKMEM_TINY_HEAP_V2=1 ./bench_random_mixed_hakmem 100000 256
# All existing ENV vars should work in normal builds
```
---
## Performance Expectations
### Branch Reduction
- **Before**: 15-20 branches in `tiny_alloc_fast()`
- **After (PGO)**: 2-3 branches (most eliminated by compiler)
- **Gain**: ~40-60% reduction in branch misses
### Instruction Count
- **Before**: ~167M instructions (1M ops)
- **After**: ~120-140M instructions
- **Gain**: ~16-28% reduction
### Throughput
- **Phase 3**: 56.8M ops/s
- **Phase 4.1 (PGO)**: 60-62M ops/s (+5-10%)
- **Phase 4.2 (Hot/Cold)**: 68-75M ops/s (+10-15%)
- **Phase 4.3 (Config)**: 73-83M ops/s (+5-8%)
**Total Improvement**: +28-46% → **2倍に迫る**
---
## Implementation Schedule
### Week 1: PGO Workflow
- Day 1-2: PGO scripts + Makefile
- Day 3: Profile collection + benchmarking
- Day 4: Documentation + review
### Week 2: Hot/Cold Separation
- Day 1-2: Hot Box + macros
- Day 3-4: Cold Box + refactor
- Day 5: Testing + PGO re-optimization
### Week 3: Config Box + Polish
- Day 1-2: Config Box implementation
- Day 3: Integration testing
- Day 4-5: Final benchmarks + documentation
---
## Success Criteria
**Code Quality**:
- All pointer operations use macros
- Clear contracts in each Box
- Zero regression in existing features
**Performance**:
- bench_random_mixed: 73-83M ops/s (vs 56.8M baseline)
- bench_tiny_hot: 100-115M ops/s (vs 81M baseline)
- No regression in other benchmarks
**Maintainability**:
- Hot/Cold separation clear
- PGO workflow documented
- Backward compatible
---
Generated: 2025-11-29
Phase: 4 Design
Next: Implementation (Week 1 start)