13 KiB
Phase 4 D3: Alloc Gate Specialization 設計メモ
目的
tiny_alloc_gate_fast() のルーティング分岐を MIXED 向けに特化(LEGACY 優先パス)
背景:
- Phase 3 完了: +8.93% cumulative gain (37.5M → 51M ops/s)
- Perf analysis:
tiny_alloc_gate_fastat 12.75% self (HIGH priority) - MIXED workload: 99% が LEGACY route(MID_V3 OFF)
- 現状: 全 route (LEGACY/ULTRA/MID/V7) をスイッチで分岐 → 予測失敗コスト高
観察
Current State (Phase 3 Baseline)
tiny_alloc_gate_fast: 12.75% self + children overhead- MIXED ワークロード特性:
- 99% が LEGACY route(
HAKMEM_MID_V3_ENABLED=0) - C0-C7 全体で uniform branching → prediction miss
- 99% が LEGACY route(
- 既存最適化(Phase 2 B3):
malloc_tiny_fast_for_class()内で LEGACY-first branching 実装済み- しかし
tiny_alloc_gate_fast()の routing policy 分岐は最適化されていない
Bottleneck Analysis
Current flow (core/box/tiny_alloc_gate_box.h:139-217):
void* tiny_alloc_gate_fast(size_t size) {
int class_idx = hak_tiny_size_to_class(size);
if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES) return NULL;
TinyRoutePolicy route = tiny_route_get(class_idx); // ← Policy lookup
// Branching on route policy (uniform dispatch, poor prediction)
if (route == ROUTE_POOL_ONLY) return NULL;
void* user_ptr = malloc_tiny_fast_for_class(size, class_idx);
if (route == ROUTE_TINY_ONLY) {
// Hot path (99% of Mixed traffic)
return user_ptr;
}
// ROUTE_TINY_FIRST: fallback allowed
return user_ptr;
}
Problem:
- Policy lookup overhead:
tiny_route_get()call every allocation - Branch on
route == ROUTE_POOL_ONLY: rare but evaluated every time - Branch on
route == ROUTE_TINY_ONLYvsROUTE_TINY_FIRST: Mixed default は TINY_FIRST(挙動差はほぼ無い) - Total cost: ~2-3 branches + 1 policy lookup per allocation
Expected savings:
- Eliminate policy lookup for known-LEGACY workloads
- Convert policy branches to LIKELY-hinted checks
- Reduce instruction count by 5-10 per allocation
実装アプローチ
Strategy: LEGACY-first with Static Route Assumption
Pattern: Similar to Phase 2 B3 (routing shape optimization)
- Reference:
core/front/malloc_tiny_fast.h:262-278 - Proven approach: LIKELY hint + cold helper for rare routes
- Expected branch prediction improvement: 75% miss rate → <5% miss rate
L0: Env(戻せる)
HAKMEM_ALLOC_GATE_SHAPE=0/1(default: 0, OFF)- Opt-in で特化パスを有効化(常時有効化は慎重)
- Rollback: ENV=0 で即座に既存経路へ復帰
L1: SpecializedGateBox(境界: 1箇所)
Optimized Gate Structure
Before (current):
void* tiny_alloc_gate_fast(size_t size) {
int class_idx = hak_tiny_size_to_class(size);
if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES) return NULL;
TinyRoutePolicy route = tiny_route_get(class_idx);
if (route == ROUTE_POOL_ONLY) return NULL;
void* user_ptr = malloc_tiny_fast_for_class(size, class_idx);
if (route == ROUTE_TINY_ONLY) return user_ptr;
return user_ptr; // ROUTE_TINY_FIRST
}
After (optimized for MIXED):
void* tiny_alloc_gate_fast(size_t size) {
int class_idx = hak_tiny_size_to_class(size);
if (__builtin_expect(class_idx < 0 || class_idx >= TINY_NUM_CLASSES, 0)) {
return NULL;
}
// Phase 4 D3: LEGACY-first gate specialization (ENV gated)
if (TINY_HOT_LIKELY(alloc_gate_shape_enabled())) {
// MIXED fast path: Avoid tiny_route_get() overhead.
// NOTE: We do NOT assume TINY_ONLY vs TINY_FIRST; both return user_ptr.
// Safety: still honor POOL_ONLY if configured via HAKMEM_TINY_PROFILE (e.g., "hot"/"off").
if (__builtin_expect(g_tiny_route[class_idx & 7] == ROUTE_POOL_ONLY, 0)) {
return NULL;
}
// Direct to malloc_tiny_fast_for_class, skip tiny_route_get()
return malloc_tiny_fast_for_class(size, class_idx);
}
// Original path (backward compatible)
TinyRoutePolicy route = tiny_route_get(class_idx);
if (__builtin_expect(route == ROUTE_POOL_ONLY, 0)) return NULL;
void* user_ptr = malloc_tiny_fast_for_class(size, class_idx);
if (TINY_HOT_LIKELY(route == ROUTE_TINY_ONLY)) return user_ptr;
return user_ptr;
}
Branch Prediction Impact
Current (uniform branching):
- Policy lookup: Always executed (1 function call overhead)
route == ROUTE_POOL_ONLY: 0% hit rate (but checked every time)route == ROUTE_TINY_ONLY: 99% hit rate (but no LIKELY hint)- Total overhead: ~10-15 cycles per allocation
Optimized (LEGACY-first):
- ENV check: 99% cached (< 1 cycle amortized)
- Direct path: Skip
tiny_route_get()(and its release logging branch) (save ~5-7 cycles) - LIKELY hint: CPU predictor trained to expect fast path
- Total savings: ~8-12 cycles per allocation
Expected gain: +1-2% on MIXED (conservative estimate based on 12.75% self%)
実装指示
File 1: core/box/tiny_alloc_gate_shape_env_box.h (新規)
Role: ENV gate for alloc gate shape optimization
API:
// ENV gate: HAKMEM_ALLOC_GATE_SHAPE=0/1 (default: 0)
static inline int alloc_gate_shape_enabled(void) {
static int g_enable = -1; // Lazy init sentinel
if (__builtin_expect(g_enable == -1, 0)) {
const char* e = getenv("HAKMEM_ALLOC_GATE_SHAPE");
g_enable = (e && *e && *e != '0') ? 1 : 0;
}
return g_enable;
}
Integration: Header-only, single-responsibility (ENV caching only)
File 2: Modify core/box/tiny_alloc_gate_box.h (既存)
Location: tiny_alloc_gate_fast() function (lines 139-217)
Changes:
-
Include new ENV box header:
#include "tiny_alloc_gate_shape_env_box.h" -
Add LEGACY-first fast path before existing route dispatch:
// Phase 4 D3: LEGACY-first gate specialization if (TINY_HOT_LIKELY(alloc_gate_shape_enabled())) { // Skip policy lookup for MIXED (ROUTE_TINY_ONLY assumption) return malloc_tiny_fast_for_class(size, class_idx); } -
Add LIKELY hints to existing branches (backward compatible path):
if (__builtin_expect(route == ROUTE_POOL_ONLY, 0)) return NULL; void* user_ptr = malloc_tiny_fast_for_class(size, class_idx); if (TINY_HOT_LIKELY(route == ROUTE_TINY_ONLY)) return user_ptr;
Safety:
- ENV gate ensures opt-in behavior
- Fallback path unchanged (existing validation/diagnostics preserved)
- No algorithmic changes (only branch shape optimization)
A/B テスト
Test Configuration
Workload: Mixed (10-run, 20M iters, ws=400)
- Baseline:
HAKMEM_ALLOC_GATE_SHAPE=0 - Optimized:
HAKMEM_ALLOC_GATE_SHAPE=1
Commands:
# Baseline
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_ALLOC_GATE_SHAPE=0 \
./bench_random_mixed_hakmem 20000000 400 1
# Optimized
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_ALLOC_GATE_SHAPE=1 \
./bench_random_mixed_hakmem 20000000 400 1
Results(2025-12-13, Release, 10-run)
- Baseline(
HAKMEM_ALLOC_GATE_SHAPE=0): Mean 47.55M ops/s, Median 48.08M - Optimized(
HAKMEM_ALLOC_GATE_SHAPE=1): Mean 47.82M ops/s, Median 47.84M - Δ(Mean): +0.56%(Median -0.5%)→ NEUTRAL
- 動作確認:
HAKMEM_ALLOC_GATE_SHAPE=1でtiny_route_get()経由の[REL_C7_ROUTE]ログが消える(bypass を確認)
Success Criteria
GO: Mean gain >= +1.0%, Median >= +0.0%
- Promote to default in
MIXED_TINYV3_C7_SAFEpreset - Add to
core/bench_profile.h:bench_setenv_default("HAKMEM_ALLOC_GATE_SHAPE", "1"); - Document in
docs/analysis/ENV_PROFILE_PRESETS.md
NEUTRAL: -1.0% < gain < +1.0%
- Freeze as research box (default OFF)
- Document findings, keep implementation for future study
NO-GO: Mean gain <= -1.0%
- Freeze and archive (default OFF, do not pursue)
- Document regression cause, learn for future optimizations
Decision(この変更): NEUTRAL(default OFF の research box として保持)
期待値
Performance Gain Estimation
Target: tiny_alloc_gate_fast at 12.75% self
Optimization:
- Eliminate policy lookup: ~5-7 cycles saved
- Add LIKELY hints: ~3-5 cycles saved (branch prediction)
- Total savings: ~8-12 cycles per allocation
Calculation:
- Baseline: ~100 cycles per allocation (estimated)
- Savings: 8-12 cycles (8-12% of allocation cost)
tiny_alloc_gate_fastcontribution: 12.75% self- Expected gain: 12.75% × (8-12%) = +1.0-1.5% (conservative)
Realistic range: +1-2% on MIXED workload
Risk Assessment
Risk Level: LOW
Why:
- Only branch shape optimization (no algorithmic change)
- ENV gate allows instant rollback
- Fallback path unchanged (safety preserved)
- Pattern proven by Phase 2 B3 (+2.89% success)
Failure modes:
- Policy lookup cost lower than expected → minimal/no gain
- ENV check overhead outweighs savings → slight regression
- Both cases: Rollback with
HAKMEM_ALLOC_GATE_SHAPE=0
非目標
NOT in scope:
- Route algorithm change (only branch shape)
- Learner integration (optional for future)
- C6-heavy workload optimization (already uses MID_V3 ON)
- Policy snapshot bypass (already done in Phase 3 C3)
Reference Patterns
Similar Optimizations
Phase 2 B3 (Routing shape optimization):
- File:
core/front/malloc_tiny_fast.h:262-278 - Pattern:
if (TINY_HOT_LIKELY(route_kind == SMALL_ROUTE_LEGACY)) - Result: +2.89% on MIXED, +9.13% on C6-heavy
- Lesson: LIKELY hints + cold helpers are highly effective
Phase 3 D1 (Free route cache):
- File:
core/box/tiny_free_route_cache_env_box.h - Pattern: ENV gate with lazy init (-1 sentinel)
- Result: +2.19% on MIXED (promoted to default)
- Lesson: ENV gates with cached values have minimal overhead
Phase 3 C3 (Static routing):
- File:
core/box/tiny_static_route_box.h - Pattern: Static route table (bypass policy snapshot)
- Result: +2.20% on MIXED
- Lesson: Eliminating dynamic lookups pays off
Code Examples
B3 Pattern (LIKELY-first branching):
if (TINY_HOT_LIKELY(env_cfg->alloc_route_shape)) {
if (TINY_HOT_LIKELY(route_kind == SMALL_ROUTE_LEGACY)) {
// Hot path: LEGACY fast (99% traffic)
void* ptr = tiny_hot_alloc_fast(class_idx);
if (TINY_HOT_LIKELY(ptr != NULL)) return ptr;
return tiny_cold_refill_and_alloc(class_idx);
}
// Rare routes: cold helper
return tiny_alloc_route_cold(route_kind, class_idx, size);
}
D1 Pattern (ENV gate with lazy init):
static inline int tiny_free_static_route_enabled(void) {
static int g_enable = -1;
if (__builtin_expect(g_enable == -1, 0)) {
const char* e = getenv("HAKMEM_FREE_STATIC_ROUTE");
g_enable = (e && *e && *e != '0') ? 1 : 0;
}
return g_enable;
}
Integration Plan
Step 1: Implementation
-
Create
core/box/tiny_alloc_gate_shape_env_box.h- Single function:
alloc_gate_shape_enabled() - Lazy init with -1 sentinel
- Return cached ENV value
- Single function:
-
Modify
core/box/tiny_alloc_gate_box.h- Add include for new ENV box
- Insert LEGACY-first fast path (ENV gated)
- Add LIKELY hints to existing branches
- Preserve all validation/diagnostic logic
Step 2: A/B Testing
-
Build with optimization:
make clean && make -j8 CFLAGS="-O3 -flto" -
Run 10-run baseline:
for i in {1..10}; do HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_ALLOC_GATE_SHAPE=0 \ ./bench_random_mixed_hakmem 20000000 400 1 done -
Run 10-run optimized:
for i in {1..10}; do HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_ALLOC_GATE_SHAPE=1 \ ./bench_random_mixed_hakmem 20000000 400 1 done -
Calculate statistics:
- Mean, Median, StdDev for both configurations
- Compare against success criteria
- Document results in
PHASE4_D3_ALLOC_GATE_AB_TEST_RESULTS.md
Step 3: Decision & Promotion
If GO:
- Update
core/bench_profile.h(MIXED_TINYV3_C7_SAFE preset) - Update
docs/analysis/ENV_PROFILE_PRESETS.md - Update
CURRENT_TASK.md(Phase 4 D3 complete) - Commit with message: "Phase 4 D3: Alloc Gate Specialization (+X.X%)"
If NEUTRAL or NO-GO:
- Document results in
PHASE4_D3_ALLOC_GATE_AB_TEST_RESULTS.md - Keep ENV gate at default OFF
- Archive as research box
- Move to next candidate optimization
Validation Checklist
Pre-implementation:
- Design document reviewed and approved
- Integration points identified (2 files)
- Reference patterns studied (B3, D1, C3)
- ENV gate strategy confirmed (opt-in, default OFF)
Implementation:
tiny_alloc_gate_shape_env_box.hcreatedtiny_alloc_gate_box.hmodified (LEGACY-first path added)- LIKELY hints added to fallback branches
- Clean compilation (no new warnings)
- Health check passes:
scripts/verify_health_profiles.sh
A/B Testing:
- Baseline 10-run completed (SHAPE=0)
- Optimized 10-run completed (SHAPE=1)
- Statistics calculated (mean, median, stddev)
- Results documented
- Success criteria evaluated (GO/NEUTRAL/NO-GO)
Promotion (if GO):
bench_profile.hupdated (default SHAPE=1)ENV_PROFILE_PRESETS.mdupdatedCURRENT_TASK.mdupdated (Phase 4 complete)- Commit created with clear message
- Cumulative gain updated (+X.X% total)
Phase 4 D3 Status: DESIGN COMPLETE
Next Step: Implementation → A/B Test → Decision
Expected Outcome: +1-2% gain (conservative), LOW risk
Rollback Plan: HAKMEM_ALLOC_GATE_SHAPE=0 (instant revert)