725 lines
19 KiB
Markdown
725 lines
19 KiB
Markdown
|
|
# Phase 4: Tiny Front Optimization - Box Design Document
|
|||
|
|
|
|||
|
|
**Date**: 2025-11-29
|
|||
|
|
**Author**: Claude Code
|
|||
|
|
**Goal**: 2x throughput improvement via Box化 + PGO + Hot/Cold separation
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Design Philosophy
|
|||
|
|
|
|||
|
|
### Box化原則
|
|||
|
|
1. **Single Responsibility**: 1 Box = 1明確な責務
|
|||
|
|
2. **Clear Contracts**: 入力/出力/保証を明示
|
|||
|
|
3. **Macro-Based Pointers**: 型安全、null check、統一API
|
|||
|
|
4. **Testability**: 各Boxが独立してテスト可能
|
|||
|
|
5. **Incremental**: 段階的実装・検証
|
|||
|
|
|
|||
|
|
### Pointer Safety Strategy
|
|||
|
|
**全てのポインター操作をマクロで抽象化**:
|
|||
|
|
- Null check統一
|
|||
|
|
- 型キャスト安全性
|
|||
|
|
- デバッグビルドでアサーション
|
|||
|
|
- リリースビルドで最適化
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Box 1: PGO Profile Collection Box
|
|||
|
|
|
|||
|
|
### 責務
|
|||
|
|
Tiny Front用PGOプロファイル収集を標準化・自動化
|
|||
|
|
|
|||
|
|
### ファイル構成
|
|||
|
|
```
|
|||
|
|
scripts/box/pgo_tiny_profile_box.sh - メインスクリプト
|
|||
|
|
scripts/box/pgo_tiny_profile_config.sh - 設定(ワークロード定義)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Contract
|
|||
|
|
|
|||
|
|
**Input**:
|
|||
|
|
- Built binaries with `-fprofile-generate -flto`
|
|||
|
|
- `bench_random_mixed_hakmem`
|
|||
|
|
- `bench_tiny_hot_hakmem`
|
|||
|
|
|
|||
|
|
**Output**:
|
|||
|
|
- `.gcda` profile data files
|
|||
|
|
- Profile summary report
|
|||
|
|
|
|||
|
|
**Guarantees**:
|
|||
|
|
- Deterministic execution (固定seed)
|
|||
|
|
- Representative workload coverage
|
|||
|
|
- Error detection & reporting
|
|||
|
|
|
|||
|
|
### Implementation
|
|||
|
|
|
|||
|
|
#### scripts/box/pgo_tiny_profile_box.sh
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# Box: PGO Profile Collection
|
|||
|
|
# Contract: Execute representative Tiny workloads for PGO
|
|||
|
|
# Usage: ./scripts/box/pgo_tiny_profile_box.sh
|
|||
|
|
|
|||
|
|
set -e
|
|||
|
|
|
|||
|
|
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
|
|||
|
|
source "${SCRIPT_DIR}/pgo_tiny_profile_config.sh"
|
|||
|
|
|
|||
|
|
echo "========================================="
|
|||
|
|
echo "Box: PGO Profile Collection (Tiny Front)"
|
|||
|
|
echo "========================================="
|
|||
|
|
|
|||
|
|
# Validate binaries exist
|
|||
|
|
for bin in "${PGO_BINARIES[@]}"; do
|
|||
|
|
if [[ ! -x "$bin" ]]; then
|
|||
|
|
echo "ERROR: Binary not found or not executable: $bin"
|
|||
|
|
exit 1
|
|||
|
|
fi
|
|||
|
|
done
|
|||
|
|
|
|||
|
|
# Clean old profile data
|
|||
|
|
echo "[PGO_BOX] Cleaning old .gcda files..."
|
|||
|
|
find . -name "*.gcda" -delete
|
|||
|
|
|
|||
|
|
# Execute workloads
|
|||
|
|
echo "[PGO_BOX] Executing representative workloads..."
|
|||
|
|
|
|||
|
|
for workload in "${PGO_WORKLOADS[@]}"; do
|
|||
|
|
echo "[PGO_BOX] Running: $workload"
|
|||
|
|
eval "$workload"
|
|||
|
|
done
|
|||
|
|
|
|||
|
|
# Verify profile data generated
|
|||
|
|
GCDA_COUNT=$(find . -name "*.gcda" | wc -l)
|
|||
|
|
if [[ $GCDA_COUNT -eq 0 ]]; then
|
|||
|
|
echo "ERROR: No .gcda files generated!"
|
|||
|
|
exit 1
|
|||
|
|
fi
|
|||
|
|
|
|||
|
|
echo "[PGO_BOX] Profile collection complete"
|
|||
|
|
echo "[PGO_BOX] Generated $GCDA_COUNT .gcda files"
|
|||
|
|
echo "========================================="
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### scripts/box/pgo_tiny_profile_config.sh
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# Box: PGO Profile Configuration
|
|||
|
|
# Purpose: Define representative workloads for Tiny Front
|
|||
|
|
|
|||
|
|
# Binaries to profile
|
|||
|
|
PGO_BINARIES=(
|
|||
|
|
"./bench_random_mixed_hakmem"
|
|||
|
|
"./bench_tiny_hot_hakmem"
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
# Representative workloads (deterministic seeds)
|
|||
|
|
PGO_WORKLOADS=(
|
|||
|
|
# Random mixed: Common case (medium working set)
|
|||
|
|
"./bench_random_mixed_hakmem 5000000 256 42"
|
|||
|
|
|
|||
|
|
# Random mixed: Smaller working set (higher cache hit)
|
|||
|
|
"./bench_random_mixed_hakmem 5000000 128 42"
|
|||
|
|
|
|||
|
|
# Random mixed: Larger working set (more diverse)
|
|||
|
|
"./bench_random_mixed_hakmem 5000000 512 42"
|
|||
|
|
|
|||
|
|
# Tiny hot path: 16B allocations
|
|||
|
|
"./bench_tiny_hot_hakmem 16 100 60000"
|
|||
|
|
|
|||
|
|
# Tiny hot path: 64B allocations
|
|||
|
|
"./bench_tiny_hot_hakmem 64 100 60000"
|
|||
|
|
)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Makefile Integration
|
|||
|
|
```makefile
|
|||
|
|
# PGO Tiny Profile Build
|
|||
|
|
pgo-tiny-profile:
|
|||
|
|
@echo "Building PGO profile binaries..."
|
|||
|
|
$(MAKE) clean
|
|||
|
|
$(MAKE) CFLAGS+="-fprofile-generate -flto" \
|
|||
|
|
LDFLAGS+="-fprofile-generate -flto" \
|
|||
|
|
HAKMEM_BUILD_RELEASE=1 \
|
|||
|
|
HAKMEM_TINY_FRONT_PGO=1 \
|
|||
|
|
bench_random_mixed_hakmem bench_tiny_hot_hakmem
|
|||
|
|
|
|||
|
|
# PGO Tiny Optimized Build
|
|||
|
|
pgo-tiny-build:
|
|||
|
|
@echo "Collecting PGO profile data..."
|
|||
|
|
./scripts/box/pgo_tiny_profile_box.sh
|
|||
|
|
@echo "Building PGO-optimized binaries..."
|
|||
|
|
$(MAKE) clean
|
|||
|
|
$(MAKE) CFLAGS+="-fprofile-use -flto" \
|
|||
|
|
LDFLAGS+="-fprofile-use -flto" \
|
|||
|
|
HAKMEM_BUILD_RELEASE=1 \
|
|||
|
|
HAKMEM_TINY_FRONT_PGO=1 \
|
|||
|
|
bench_random_mixed_hakmem
|
|||
|
|
|
|||
|
|
# PGO Full Workflow
|
|||
|
|
pgo-tiny-full: pgo-tiny-profile pgo-tiny-build
|
|||
|
|
@echo "PGO optimization complete!"
|
|||
|
|
@echo "Testing optimized binary..."
|
|||
|
|
./bench_random_mixed_hakmem 1000000 256 42
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Box 2: Tiny Front Hot Path Box
|
|||
|
|
|
|||
|
|
### 責務
|
|||
|
|
Ultra-fast allocation path(分岐数最小化、always_inline)
|
|||
|
|
|
|||
|
|
### ファイル構成
|
|||
|
|
```
|
|||
|
|
core/box/tiny_front_hot_box.h - Hot path implementation
|
|||
|
|
core/box/tiny_front_hot_box_macros.h - Pointer safety macros
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Contract
|
|||
|
|
|
|||
|
|
**Preconditions**:
|
|||
|
|
- `class_idx` validated (0-7)
|
|||
|
|
- TLS initialized
|
|||
|
|
- Not in slow path mode
|
|||
|
|
|
|||
|
|
**Guarantees**:
|
|||
|
|
- Maximum 5-7 branches
|
|||
|
|
- Always inlined
|
|||
|
|
- Null-safe pointer operations
|
|||
|
|
- PGO-optimized
|
|||
|
|
|
|||
|
|
**Performance**:
|
|||
|
|
- Hit case: ~20-30 cycles
|
|||
|
|
- Miss case: → Cold Box (~100-200 cycles)
|
|||
|
|
|
|||
|
|
### Pointer Safety Macros
|
|||
|
|
|
|||
|
|
#### core/box/tiny_front_hot_box_macros.h
|
|||
|
|
```c
|
|||
|
|
#ifndef TINY_FRONT_HOT_BOX_MACROS_H
|
|||
|
|
#define TINY_FRONT_HOT_BOX_MACROS_H
|
|||
|
|
|
|||
|
|
#include <stdint.h>
|
|||
|
|
#include <stddef.h>
|
|||
|
|
|
|||
|
|
// ========== Pointer Type Definitions ==========
|
|||
|
|
|
|||
|
|
// Opaque pointer types for type safety
|
|||
|
|
typedef void* TinyHotPtr; // User-facing allocation pointer
|
|||
|
|
typedef void* TinySLLNode; // SLL node pointer
|
|||
|
|
typedef void* TinySlabBase; // Slab base pointer
|
|||
|
|
|
|||
|
|
// ========== Pointer Safety Macros ==========
|
|||
|
|
|
|||
|
|
#if HAKMEM_BUILD_RELEASE
|
|||
|
|
// Release: No overhead
|
|||
|
|
#define TINY_HOT_PTR_CHECK(ptr) (ptr)
|
|||
|
|
#define TINY_HOT_PTR_CAST(type, ptr) ((type)(ptr))
|
|||
|
|
#define TINY_HOT_PTR_NULL NULL
|
|||
|
|
#define TINY_HOT_PTR_IS_NULL(ptr) ((ptr) == NULL)
|
|||
|
|
#define TINY_HOT_PTR_IS_VALID(ptr) ((ptr) != NULL)
|
|||
|
|
#else
|
|||
|
|
// Debug: Assertions enabled
|
|||
|
|
#include <assert.h>
|
|||
|
|
#define TINY_HOT_PTR_CHECK(ptr) \
|
|||
|
|
({ void* _p = (ptr); \
|
|||
|
|
assert(((uintptr_t)_p & 0x7) == 0 && "Pointer not 8-byte aligned"); \
|
|||
|
|
_p; })
|
|||
|
|
#define TINY_HOT_PTR_CAST(type, ptr) \
|
|||
|
|
((type)TINY_HOT_PTR_CHECK(ptr))
|
|||
|
|
#define TINY_HOT_PTR_NULL NULL
|
|||
|
|
#define TINY_HOT_PTR_IS_NULL(ptr) ((ptr) == NULL)
|
|||
|
|
#define TINY_HOT_PTR_IS_VALID(ptr) \
|
|||
|
|
({ void* _p = (ptr); \
|
|||
|
|
_p != NULL && ((uintptr_t)_p & 0x7) == 0; })
|
|||
|
|
#endif
|
|||
|
|
|
|||
|
|
// ========== SLL Operations Macros ==========
|
|||
|
|
|
|||
|
|
// Read next pointer from SLL node
|
|||
|
|
#define TINY_HOT_SLL_NEXT(node) \
|
|||
|
|
TINY_HOT_PTR_CAST(TinySLLNode, tiny_nextptr_get(node))
|
|||
|
|
|
|||
|
|
// Write next pointer to SLL node
|
|||
|
|
#define TINY_HOT_SLL_SET_NEXT(node, next) \
|
|||
|
|
tiny_nextptr_set((node), TINY_HOT_PTR_CHECK(next))
|
|||
|
|
|
|||
|
|
// Pop from TLS SLL (class-specific)
|
|||
|
|
#define TINY_HOT_SLL_POP(class_idx) \
|
|||
|
|
TINY_HOT_PTR_CAST(TinyHotPtr, tls_sll_pop_inline(class_idx))
|
|||
|
|
|
|||
|
|
// Push to TLS SLL (class-specific)
|
|||
|
|
#define TINY_HOT_SLL_PUSH(class_idx, ptr) \
|
|||
|
|
tls_sll_push_inline((class_idx), TINY_HOT_PTR_CHECK(ptr))
|
|||
|
|
|
|||
|
|
// ========== Likely/Unlikely Hints ==========
|
|||
|
|
|
|||
|
|
#define TINY_HOT_LIKELY(x) __builtin_expect(!!(x), 1)
|
|||
|
|
#define TINY_HOT_UNLIKELY(x) __builtin_expect(!!(x), 0)
|
|||
|
|
|
|||
|
|
// ========== Branch Prediction Hints ==========
|
|||
|
|
|
|||
|
|
// Expected: SLL hit (80-90% of allocations)
|
|||
|
|
#define TINY_HOT_EXPECT_HIT(ptr) \
|
|||
|
|
TINY_HOT_LIKELY(TINY_HOT_PTR_IS_VALID(ptr))
|
|||
|
|
|
|||
|
|
// Expected: SLL miss (10-20% of allocations)
|
|||
|
|
#define TINY_HOT_EXPECT_MISS(ptr) \
|
|||
|
|
TINY_HOT_UNLIKELY(TINY_HOT_PTR_IS_NULL(ptr))
|
|||
|
|
|
|||
|
|
#endif // TINY_FRONT_HOT_BOX_MACROS_H
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Implementation
|
|||
|
|
|
|||
|
|
#### core/box/tiny_front_hot_box.h
|
|||
|
|
```c
|
|||
|
|
#ifndef TINY_FRONT_HOT_BOX_H
|
|||
|
|
#define TINY_FRONT_HOT_BOX_H
|
|||
|
|
|
|||
|
|
#include "tiny_front_hot_box_macros.h"
|
|||
|
|
#include "../tiny_next_ptr_box.h" // tiny_nextptr_get/set
|
|||
|
|
#include "../tls_sll_box.h" // TLS SLL operations
|
|||
|
|
|
|||
|
|
// Forward declaration for cold path
|
|||
|
|
void* tiny_front_cold_refill(int class_idx)
|
|||
|
|
__attribute__((noinline, cold));
|
|||
|
|
|
|||
|
|
// ========== Box: Tiny Front Hot Path ==========
|
|||
|
|
// Contract: Ultra-fast allocation with 5-7 branches max
|
|||
|
|
// Precondition: class_idx validated (0-7), TLS initialized
|
|||
|
|
// Performance: ~20-30 cycles (hit), ~100-200 cycles (miss → cold)
|
|||
|
|
// Optimization: always_inline + PGO + branch hints
|
|||
|
|
|
|||
|
|
__attribute__((always_inline))
|
|||
|
|
static inline TinyHotPtr tiny_front_hot_alloc(int class_idx)
|
|||
|
|
{
|
|||
|
|
// Branch 1: TLS SLL pop (expected: 80-90% hit)
|
|||
|
|
TinyHotPtr ptr = TINY_HOT_SLL_POP(class_idx);
|
|||
|
|
|
|||
|
|
// Branch 2: Check if hit (optimized by PGO)
|
|||
|
|
if (TINY_HOT_EXPECT_HIT(ptr)) {
|
|||
|
|
// Fast path exit: ~20-30 cycles total
|
|||
|
|
return ptr;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Branch 3: Miss → Cold path refill (10-20% of allocations)
|
|||
|
|
return TINY_HOT_PTR_CAST(TinyHotPtr, tiny_front_cold_refill(class_idx));
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// ========== Box: Tiny Front Hot Free ==========
|
|||
|
|
// Contract: Ultra-fast free with 3-5 branches max
|
|||
|
|
// Precondition: ptr is valid Tiny allocation
|
|||
|
|
// Performance: ~15-25 cycles
|
|||
|
|
|
|||
|
|
__attribute__((always_inline))
|
|||
|
|
static inline void tiny_front_hot_free(TinyHotPtr ptr, int class_idx)
|
|||
|
|
{
|
|||
|
|
// Branch 1: Null check (expected: rare)
|
|||
|
|
if (TINY_HOT_UNLIKELY(TINY_HOT_PTR_IS_NULL(ptr))) {
|
|||
|
|
return;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Branch 2: Push to TLS SLL (expected: always succeeds)
|
|||
|
|
TINY_HOT_SLL_PUSH(class_idx, ptr);
|
|||
|
|
|
|||
|
|
// Fast path exit: ~15-25 cycles total
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
#endif // TINY_FRONT_HOT_BOX_H
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Box 3: Tiny Front Cold Path Box
|
|||
|
|
|
|||
|
|
### 責務
|
|||
|
|
低頻度allocation/free slow path(noinline, cold属性)
|
|||
|
|
|
|||
|
|
### ファイル構成
|
|||
|
|
```
|
|||
|
|
core/box/tiny_front_cold_box.h - Cold path implementation
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Contract
|
|||
|
|
|
|||
|
|
**Called When**:
|
|||
|
|
- TLS SLL miss (refill needed)
|
|||
|
|
- Slow allocation path (debug, large size, etc.)
|
|||
|
|
|
|||
|
|
**Guarantees**:
|
|||
|
|
- I-cache separated from hot path
|
|||
|
|
- Heavy operations allowed
|
|||
|
|
- Can call into ACE, learning, diagnostics
|
|||
|
|
|
|||
|
|
**Optimization**:
|
|||
|
|
- `noinline` → Not inlined into hot path
|
|||
|
|
- `cold` → Compiler puts in cold section
|
|||
|
|
|
|||
|
|
### Implementation
|
|||
|
|
|
|||
|
|
#### core/box/tiny_front_cold_box.h
|
|||
|
|
```c
|
|||
|
|
#ifndef TINY_FRONT_COLD_BOX_H
|
|||
|
|
#define TINY_FRONT_COLD_BOX_H
|
|||
|
|
|
|||
|
|
#include <stddef.h>
|
|||
|
|
#include "tiny_front_hot_box_macros.h"
|
|||
|
|
|
|||
|
|
// ========== Box: Tiny Front Cold Refill ==========
|
|||
|
|
// Contract: Refill TLS SLL when empty
|
|||
|
|
// Called: 10-20% of allocations (SLL miss)
|
|||
|
|
// Performance: ~100-200 cycles (acceptable for miss case)
|
|||
|
|
// Optimization: noinline, cold → separated from hot path
|
|||
|
|
|
|||
|
|
__attribute__((noinline, cold))
|
|||
|
|
void* tiny_front_cold_refill(int class_idx)
|
|||
|
|
{
|
|||
|
|
// Heavy refill logic
|
|||
|
|
// - May allocate new SuperSlab
|
|||
|
|
// - May trigger ACE learning
|
|||
|
|
// - May call into diagnostics
|
|||
|
|
|
|||
|
|
// Call existing refill logic
|
|||
|
|
tiny_fast_refill_and_take(class_idx);
|
|||
|
|
|
|||
|
|
// After refill, try pop again
|
|||
|
|
TinyHotPtr ptr = TINY_HOT_SLL_POP(class_idx);
|
|||
|
|
|
|||
|
|
if (TINY_HOT_PTR_IS_VALID(ptr)) {
|
|||
|
|
return ptr;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Refill failed → slow allocation
|
|||
|
|
return tiny_front_cold_slow_alloc(0, class_idx); // size=0 (unknown)
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// ========== Box: Tiny Front Cold Slow Alloc ==========
|
|||
|
|
// Contract: Slowest allocation path (debug, diagnostics, ACE)
|
|||
|
|
// Called: Rare (refill failure, special modes)
|
|||
|
|
// Performance: ~500-1000+ cycles (acceptable for rare case)
|
|||
|
|
|
|||
|
|
__attribute__((noinline, cold))
|
|||
|
|
void* tiny_front_cold_slow_alloc(size_t size, int class_idx)
|
|||
|
|
{
|
|||
|
|
// Debug/diagnostic/ACE learning hooks
|
|||
|
|
// - Allocation site tracking
|
|||
|
|
// - Size class profiling
|
|||
|
|
// - Memory pressure monitoring
|
|||
|
|
|
|||
|
|
// Call legacy slow path
|
|||
|
|
return hak_tiny_alloc_slow(size, class_idx);
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// ========== Box: Tiny Front Cold Drain ==========
|
|||
|
|
// Contract: Drain remote frees (batched, low frequency)
|
|||
|
|
// Called: Background or on threshold
|
|||
|
|
// Optimization: noinline, cold
|
|||
|
|
|
|||
|
|
__attribute__((noinline, cold))
|
|||
|
|
void tiny_front_cold_drain_remote(int class_idx)
|
|||
|
|
{
|
|||
|
|
// Drain remote free lists into TLS SLL
|
|||
|
|
// - Batch processing for efficiency
|
|||
|
|
// - May trigger ACE rebalancing
|
|||
|
|
|
|||
|
|
tiny_remote_drain_to_sll(class_idx);
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
#endif // TINY_FRONT_COLD_BOX_H
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Box 4: Tiny Front Config Box
|
|||
|
|
|
|||
|
|
### 責務
|
|||
|
|
Tiny Front設定の一元管理(コンパイル時/実行時切り替え)
|
|||
|
|
|
|||
|
|
### ファイル構成
|
|||
|
|
```
|
|||
|
|
core/box/tiny_front_config_box.h - Configuration management
|
|||
|
|
core/hakmem_build_flags.h - Build flag definitions (existing)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Contract
|
|||
|
|
|
|||
|
|
**Compile-time Mode (PGO builds)**:
|
|||
|
|
- `HAKMEM_TINY_FRONT_PGO=1`
|
|||
|
|
- All runtime checks → compile-time constants
|
|||
|
|
- Unused branches eliminated by compiler
|
|||
|
|
|
|||
|
|
**Runtime Mode (normal builds)**:
|
|||
|
|
- Backward compatible
|
|||
|
|
- ENV variable checks as before
|
|||
|
|
- Full feature set available
|
|||
|
|
|
|||
|
|
### Implementation
|
|||
|
|
|
|||
|
|
#### core/box/tiny_front_config_box.h
|
|||
|
|
```c
|
|||
|
|
#ifndef TINY_FRONT_CONFIG_BOX_H
|
|||
|
|
#define TINY_FRONT_CONFIG_BOX_H
|
|||
|
|
|
|||
|
|
// ========== Build Flag Definitions ==========
|
|||
|
|
// Location: core/hakmem_build_flags.h
|
|||
|
|
|
|||
|
|
#ifndef HAKMEM_TINY_FRONT_PGO
|
|||
|
|
# define HAKMEM_TINY_FRONT_PGO 0
|
|||
|
|
#endif
|
|||
|
|
|
|||
|
|
// ========== PGO Mode: Fixed Configuration ==========
|
|||
|
|
|
|||
|
|
#if HAKMEM_TINY_FRONT_PGO
|
|||
|
|
// PGO build: Fix configuration for profiling/optimization
|
|||
|
|
// All runtime checks become compile-time constants
|
|||
|
|
|
|||
|
|
#define TINY_FRONT_ULTRA_SLIM_ENABLED 0
|
|||
|
|
#define TINY_FRONT_HEAP_V2_ENABLED 0
|
|||
|
|
#define TINY_FRONT_SFC_ENABLED 1
|
|||
|
|
#define TINY_FRONT_FASTCACHE_ENABLED 0
|
|||
|
|
#define TINY_FRONT_UNIFIED_GATE_ENABLED 1
|
|||
|
|
#define TINY_FRONT_METRICS_ENABLED 0
|
|||
|
|
#define TINY_FRONT_DIAG_ENABLED 0
|
|||
|
|
|
|||
|
|
// Optimization: Constant folding eliminates dead branches
|
|||
|
|
// Example:
|
|||
|
|
// if (TINY_FRONT_HEAP_V2_ENABLED) { ... }
|
|||
|
|
// → Compiler eliminates entire block (0 is constant false)
|
|||
|
|
|
|||
|
|
#else
|
|||
|
|
// Normal build: Runtime configuration (backward compatible)
|
|||
|
|
// Checks ENV variables or config state
|
|||
|
|
|
|||
|
|
#define TINY_FRONT_ULTRA_SLIM_ENABLED ultra_slim_mode_enabled()
|
|||
|
|
#define TINY_FRONT_HEAP_V2_ENABLED tiny_heap_v2_enabled()
|
|||
|
|
#define TINY_FRONT_SFC_ENABLED sfc_cascade_enabled()
|
|||
|
|
#define TINY_FRONT_FASTCACHE_ENABLED tiny_fastcache_enabled()
|
|||
|
|
#define TINY_FRONT_UNIFIED_GATE_ENABLED front_gate_unified_enabled()
|
|||
|
|
#define TINY_FRONT_METRICS_ENABLED tiny_metrics_enabled()
|
|||
|
|
#define TINY_FRONT_DIAG_ENABLED tiny_diag_enabled()
|
|||
|
|
|
|||
|
|
#endif // HAKMEM_TINY_FRONT_PGO
|
|||
|
|
|
|||
|
|
// ========== Configuration Helpers ==========
|
|||
|
|
|
|||
|
|
// Check if running in PGO-optimized build
|
|||
|
|
static inline int tiny_front_is_pgo_build(void)
|
|||
|
|
{
|
|||
|
|
return HAKMEM_TINY_FRONT_PGO;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Get effective configuration (for diagnostics)
|
|||
|
|
static inline void tiny_front_config_report(void)
|
|||
|
|
{
|
|||
|
|
#if !HAKMEM_BUILD_RELEASE
|
|||
|
|
fprintf(stderr, "[TINY_FRONT_CONFIG]\n");
|
|||
|
|
fprintf(stderr, " PGO Build: %d\n", HAKMEM_TINY_FRONT_PGO);
|
|||
|
|
fprintf(stderr, " Ultra SLIM: %d\n", TINY_FRONT_ULTRA_SLIM_ENABLED);
|
|||
|
|
fprintf(stderr, " Heap V2: %d\n", TINY_FRONT_HEAP_V2_ENABLED);
|
|||
|
|
fprintf(stderr, " SFC: %d\n", TINY_FRONT_SFC_ENABLED);
|
|||
|
|
fprintf(stderr, " FastCache: %d\n", TINY_FRONT_FASTCACHE_ENABLED);
|
|||
|
|
fprintf(stderr, " Unified Gate: %d\n", TINY_FRONT_UNIFIED_GATE_ENABLED);
|
|||
|
|
#endif
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
#endif // TINY_FRONT_CONFIG_BOX_H
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Update to core/hakmem_build_flags.h
|
|||
|
|
```c
|
|||
|
|
// Add around line 190:
|
|||
|
|
|
|||
|
|
// HAKMEM_TINY_FRONT_PGO:
|
|||
|
|
// 0 = Normal build with runtime configuration (default)
|
|||
|
|
// 1 = PGO-optimized build with compile-time configuration
|
|||
|
|
// Eliminates runtime branches for maximum performance.
|
|||
|
|
// Use with: make HAKMEM_TINY_FRONT_PGO=1 pgo-tiny-build
|
|||
|
|
#ifndef HAKMEM_TINY_FRONT_PGO
|
|||
|
|
# define HAKMEM_TINY_FRONT_PGO 0
|
|||
|
|
#endif
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Integration: Refactor tiny_alloc_fast()
|
|||
|
|
|
|||
|
|
### Before (複雑な1関数、15-20分岐)
|
|||
|
|
```c
|
|||
|
|
void* tiny_alloc_fast(size_t size) {
|
|||
|
|
// Ultra SLIM check
|
|||
|
|
if (ultra_slim_mode_enabled()) { ... }
|
|||
|
|
|
|||
|
|
// Size to class
|
|||
|
|
int class_idx = hak_tiny_size_to_class(size);
|
|||
|
|
|
|||
|
|
// Metrics
|
|||
|
|
if (tiny_metrics_enabled()) { tiny_sizeclass_hist_hit(class_idx); }
|
|||
|
|
|
|||
|
|
// Heap V2 check
|
|||
|
|
if (tiny_heap_v2_enabled()) { ... }
|
|||
|
|
|
|||
|
|
// FastCache check
|
|||
|
|
if (tiny_fastcache_enabled()) { ... }
|
|||
|
|
|
|||
|
|
// SFC cascade
|
|||
|
|
if (sfc_cascade_enabled()) { ... }
|
|||
|
|
|
|||
|
|
// TLS SLL pop
|
|||
|
|
void* ptr = tls_sll_pop(class_idx);
|
|||
|
|
if (ptr) return ptr;
|
|||
|
|
|
|||
|
|
// Refill logic (複雑)
|
|||
|
|
...
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### After (Box化、3-5分岐のみ)
|
|||
|
|
```c
|
|||
|
|
// Include new boxes
|
|||
|
|
#include "core/box/tiny_front_config_box.h"
|
|||
|
|
#include "core/box/tiny_front_hot_box.h"
|
|||
|
|
#include "core/box/tiny_front_cold_box.h"
|
|||
|
|
|
|||
|
|
void* tiny_alloc_fast(size_t size) {
|
|||
|
|
// Branch 1: Ultra SLIM mode check (compile-time constant in PGO)
|
|||
|
|
if (TINY_FRONT_ULTRA_SLIM_ENABLED) {
|
|||
|
|
return tiny_ultra_slim_alloc(size); // Separate path
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Branch 2: Size to class (always needed)
|
|||
|
|
int class_idx = hak_tiny_size_to_class(size);
|
|||
|
|
|
|||
|
|
// Branch 3: Hot path (inlined, 2-3 branches inside)
|
|||
|
|
return tiny_front_hot_alloc(class_idx);
|
|||
|
|
|
|||
|
|
// Total branches in PGO build: 2-3
|
|||
|
|
// (Ultra SLIM = 0 → eliminated, hot_alloc inlined)
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**PGO最適化後の実効分岐数**: **2-3分岐のみ**!
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Testing Strategy
|
|||
|
|
|
|||
|
|
### Step 1: PGO Workflow Test
|
|||
|
|
```bash
|
|||
|
|
# Build profile version
|
|||
|
|
make pgo-tiny-profile
|
|||
|
|
|
|||
|
|
# Collect profiles (automated)
|
|||
|
|
./scripts/box/pgo_tiny_profile_box.sh
|
|||
|
|
|
|||
|
|
# Build optimized version
|
|||
|
|
make pgo-tiny-build
|
|||
|
|
|
|||
|
|
# Benchmark
|
|||
|
|
./bench_random_mixed_hakmem 1000000 256 42
|
|||
|
|
./bench_tiny_hot_hakmem
|
|||
|
|
|
|||
|
|
# Expected: +5-10% improvement
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Step 2: Hot/Cold Separation Test
|
|||
|
|
```bash
|
|||
|
|
# Build with hot/cold boxes
|
|||
|
|
make clean
|
|||
|
|
make HAKMEM_TINY_FRONT_PGO=1 bench_random_mixed_hakmem
|
|||
|
|
|
|||
|
|
# Benchmark
|
|||
|
|
./bench_random_mixed_hakmem 1000000 256 42
|
|||
|
|
|
|||
|
|
# Expected: +10-15% improvement (cumulative +15-25%)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Step 3: Config Box Test
|
|||
|
|
```bash
|
|||
|
|
# PGO build (compile-time config)
|
|||
|
|
make clean
|
|||
|
|
make HAKMEM_TINY_FRONT_PGO=1 pgo-tiny-full
|
|||
|
|
|
|||
|
|
# Normal build (runtime config)
|
|||
|
|
make clean
|
|||
|
|
make bench_random_mixed_hakmem
|
|||
|
|
|
|||
|
|
# Both should work, PGO should be faster
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Regression Testing
|
|||
|
|
```bash
|
|||
|
|
# Ensure backward compatibility
|
|||
|
|
HAKMEM_TINY_ULTRA_SLIM=1 ./bench_random_mixed_hakmem 100000 256
|
|||
|
|
HAKMEM_TINY_HEAP_V2=1 ./bench_random_mixed_hakmem 100000 256
|
|||
|
|
|
|||
|
|
# All existing ENV vars should work in normal builds
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Performance Expectations
|
|||
|
|
|
|||
|
|
### Branch Reduction
|
|||
|
|
- **Before**: 15-20 branches in `tiny_alloc_fast()`
|
|||
|
|
- **After (PGO)**: 2-3 branches (most eliminated by compiler)
|
|||
|
|
- **Gain**: ~40-60% reduction in branch misses
|
|||
|
|
|
|||
|
|
### Instruction Count
|
|||
|
|
- **Before**: ~167M instructions (1M ops)
|
|||
|
|
- **After**: ~120-140M instructions
|
|||
|
|
- **Gain**: ~16-28% reduction
|
|||
|
|
|
|||
|
|
### Throughput
|
|||
|
|
- **Phase 3**: 56.8M ops/s
|
|||
|
|
- **Phase 4.1 (PGO)**: 60-62M ops/s (+5-10%)
|
|||
|
|
- **Phase 4.2 (Hot/Cold)**: 68-75M ops/s (+10-15%)
|
|||
|
|
- **Phase 4.3 (Config)**: 73-83M ops/s (+5-8%)
|
|||
|
|
|
|||
|
|
**Total Improvement**: +28-46% → **2倍に迫る**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Implementation Schedule
|
|||
|
|
|
|||
|
|
### Week 1: PGO Workflow
|
|||
|
|
- Day 1-2: PGO scripts + Makefile
|
|||
|
|
- Day 3: Profile collection + benchmarking
|
|||
|
|
- Day 4: Documentation + review
|
|||
|
|
|
|||
|
|
### Week 2: Hot/Cold Separation
|
|||
|
|
- Day 1-2: Hot Box + macros
|
|||
|
|
- Day 3-4: Cold Box + refactor
|
|||
|
|
- Day 5: Testing + PGO re-optimization
|
|||
|
|
|
|||
|
|
### Week 3: Config Box + Polish
|
|||
|
|
- Day 1-2: Config Box implementation
|
|||
|
|
- Day 3: Integration testing
|
|||
|
|
- Day 4-5: Final benchmarks + documentation
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Success Criteria
|
|||
|
|
|
|||
|
|
✅ **Code Quality**:
|
|||
|
|
- All pointer operations use macros
|
|||
|
|
- Clear contracts in each Box
|
|||
|
|
- Zero regression in existing features
|
|||
|
|
|
|||
|
|
✅ **Performance**:
|
|||
|
|
- bench_random_mixed: 73-83M ops/s (vs 56.8M baseline)
|
|||
|
|
- bench_tiny_hot: 100-115M ops/s (vs 81M baseline)
|
|||
|
|
- No regression in other benchmarks
|
|||
|
|
|
|||
|
|
✅ **Maintainability**:
|
|||
|
|
- Hot/Cold separation clear
|
|||
|
|
- PGO workflow documented
|
|||
|
|
- Backward compatible
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
Generated: 2025-11-29
|
|||
|
|
Phase: 4 Design
|
|||
|
|
Next: Implementation (Week 1 start)
|