# Phase 4 D3: Alloc Gate Specialization 設計メモ ## 目的 `tiny_alloc_gate_fast()` のルーティング分岐を MIXED 向けに特化(LEGACY 優先パス) **背景**: - Phase 3 完了: +8.93% cumulative gain (37.5M → 51M ops/s) - Perf analysis: `tiny_alloc_gate_fast` at **12.75% self** (HIGH priority) - MIXED workload: 99% が LEGACY route(MID_V3 OFF) - 現状: 全 route (LEGACY/ULTRA/MID/V7) をスイッチで分岐 → 予測失敗コスト高 ## 観察 ### Current State (Phase 3 Baseline) - `tiny_alloc_gate_fast`: 12.75% self + children overhead - MIXED ワークロード特性: - 99% が LEGACY route(`HAKMEM_MID_V3_ENABLED=0`) - C0-C7 全体で uniform branching → prediction miss - 既存最適化(Phase 2 B3): - `malloc_tiny_fast_for_class()` 内で LEGACY-first branching 実装済み - しかし `tiny_alloc_gate_fast()` の routing policy 分岐は最適化されていない ### Bottleneck Analysis **Current flow** (`core/box/tiny_alloc_gate_box.h:139-217`): ```c void* tiny_alloc_gate_fast(size_t size) { int class_idx = hak_tiny_size_to_class(size); if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES) return NULL; TinyRoutePolicy route = tiny_route_get(class_idx); // ← Policy lookup // Branching on route policy (uniform dispatch, poor prediction) if (route == ROUTE_POOL_ONLY) return NULL; void* user_ptr = malloc_tiny_fast_for_class(size, class_idx); if (route == ROUTE_TINY_ONLY) { // Hot path (99% of Mixed traffic) return user_ptr; } // ROUTE_TINY_FIRST: fallback allowed return user_ptr; } ``` **Problem**: - Policy lookup overhead: `tiny_route_get()` call every allocation - Branch on `route == ROUTE_POOL_ONLY`: rare but evaluated every time - Branch on `route == ROUTE_TINY_ONLY` vs `ROUTE_TINY_FIRST`: Mixed default は TINY_FIRST(挙動差はほぼ無い) - Total cost: ~2-3 branches + 1 policy lookup per allocation **Expected savings**: - Eliminate policy lookup for known-LEGACY workloads - Convert policy branches to LIKELY-hinted checks - Reduce instruction count by 5-10 per allocation ## 実装アプローチ ### Strategy: LEGACY-first with Static Route Assumption **Pattern**: Similar to Phase 2 B3 (routing shape optimization) - Reference: `core/front/malloc_tiny_fast.h:262-278` - Proven approach: LIKELY hint + cold helper for rare routes - Expected branch prediction improvement: 75% miss rate → <5% miss rate ### L0: Env(戻せる) - `HAKMEM_ALLOC_GATE_SHAPE=0/1` (default: 0, OFF) - Opt-in で特化パスを有効化(常時有効化は慎重) - Rollback: ENV=0 で即座に既存経路へ復帰 ### L1: SpecializedGateBox(境界: 1箇所) #### Optimized Gate Structure **Before** (current): ```c void* tiny_alloc_gate_fast(size_t size) { int class_idx = hak_tiny_size_to_class(size); if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES) return NULL; TinyRoutePolicy route = tiny_route_get(class_idx); if (route == ROUTE_POOL_ONLY) return NULL; void* user_ptr = malloc_tiny_fast_for_class(size, class_idx); if (route == ROUTE_TINY_ONLY) return user_ptr; return user_ptr; // ROUTE_TINY_FIRST } ``` **After** (optimized for MIXED): ```c void* tiny_alloc_gate_fast(size_t size) { int class_idx = hak_tiny_size_to_class(size); if (__builtin_expect(class_idx < 0 || class_idx >= TINY_NUM_CLASSES, 0)) { return NULL; } // Phase 4 D3: LEGACY-first gate specialization (ENV gated) if (TINY_HOT_LIKELY(alloc_gate_shape_enabled())) { // MIXED fast path: Avoid tiny_route_get() overhead. // NOTE: We do NOT assume TINY_ONLY vs TINY_FIRST; both return user_ptr. // Safety: still honor POOL_ONLY if configured via HAKMEM_TINY_PROFILE (e.g., "hot"/"off"). if (__builtin_expect(g_tiny_route[class_idx & 7] == ROUTE_POOL_ONLY, 0)) { return NULL; } // Direct to malloc_tiny_fast_for_class, skip tiny_route_get() return malloc_tiny_fast_for_class(size, class_idx); } // Original path (backward compatible) TinyRoutePolicy route = tiny_route_get(class_idx); if (__builtin_expect(route == ROUTE_POOL_ONLY, 0)) return NULL; void* user_ptr = malloc_tiny_fast_for_class(size, class_idx); if (TINY_HOT_LIKELY(route == ROUTE_TINY_ONLY)) return user_ptr; return user_ptr; } ``` #### Branch Prediction Impact **Current** (uniform branching): - Policy lookup: Always executed (1 function call overhead) - `route == ROUTE_POOL_ONLY`: 0% hit rate (but checked every time) - `route == ROUTE_TINY_ONLY`: 99% hit rate (but no LIKELY hint) - Total overhead: ~10-15 cycles per allocation **Optimized** (LEGACY-first): - ENV check: 99% cached (< 1 cycle amortized) - Direct path: Skip `tiny_route_get()` (and its release logging branch) (save ~5-7 cycles) - LIKELY hint: CPU predictor trained to expect fast path - Total savings: ~8-12 cycles per allocation **Expected gain**: +1-2% on MIXED (conservative estimate based on 12.75% self%) ### 実装指示 #### File 1: `core/box/tiny_alloc_gate_shape_env_box.h` (新規) **Role**: ENV gate for alloc gate shape optimization **API**: ```c // ENV gate: HAKMEM_ALLOC_GATE_SHAPE=0/1 (default: 0) static inline int alloc_gate_shape_enabled(void) { static int g_enable = -1; // Lazy init sentinel if (__builtin_expect(g_enable == -1, 0)) { const char* e = getenv("HAKMEM_ALLOC_GATE_SHAPE"); g_enable = (e && *e && *e != '0') ? 1 : 0; } return g_enable; } ``` **Integration**: Header-only, single-responsibility (ENV caching only) #### File 2: Modify `core/box/tiny_alloc_gate_box.h` (既存) **Location**: `tiny_alloc_gate_fast()` function (lines 139-217) **Changes**: 1. Include new ENV box header: ```c #include "tiny_alloc_gate_shape_env_box.h" ``` 2. Add LEGACY-first fast path before existing route dispatch: ```c // Phase 4 D3: LEGACY-first gate specialization if (TINY_HOT_LIKELY(alloc_gate_shape_enabled())) { // Skip policy lookup for MIXED (ROUTE_TINY_ONLY assumption) return malloc_tiny_fast_for_class(size, class_idx); } ``` 3. Add LIKELY hints to existing branches (backward compatible path): ```c if (__builtin_expect(route == ROUTE_POOL_ONLY, 0)) return NULL; void* user_ptr = malloc_tiny_fast_for_class(size, class_idx); if (TINY_HOT_LIKELY(route == ROUTE_TINY_ONLY)) return user_ptr; ``` **Safety**: - ENV gate ensures opt-in behavior - Fallback path unchanged (existing validation/diagnostics preserved) - No algorithmic changes (only branch shape optimization) ## A/B テスト ### Test Configuration **Workload**: Mixed (10-run, 20M iters, ws=400) - Baseline: `HAKMEM_ALLOC_GATE_SHAPE=0` - Optimized: `HAKMEM_ALLOC_GATE_SHAPE=1` **Commands**: ```bash # Baseline HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_ALLOC_GATE_SHAPE=0 \ ./bench_random_mixed_hakmem 20000000 400 1 # Optimized HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_ALLOC_GATE_SHAPE=1 \ ./bench_random_mixed_hakmem 20000000 400 1 ``` ### Results(2025-12-13, Release, 10-run) - Baseline(`HAKMEM_ALLOC_GATE_SHAPE=0`): Mean **47.55M** ops/s, Median **48.08M** - Optimized(`HAKMEM_ALLOC_GATE_SHAPE=1`): Mean **47.82M** ops/s, Median **47.84M** - Δ(Mean): **+0.56%**(Median -0.5%)→ **NEUTRAL** - 動作確認: `HAKMEM_ALLOC_GATE_SHAPE=1` で `tiny_route_get()` 経由の `[REL_C7_ROUTE]` ログが消える(bypass を確認) ### Success Criteria **GO**: Mean gain >= +1.0%, Median >= +0.0% - **Promote to default** in `MIXED_TINYV3_C7_SAFE` preset - Add to `core/bench_profile.h`: `bench_setenv_default("HAKMEM_ALLOC_GATE_SHAPE", "1");` - Document in `docs/analysis/ENV_PROFILE_PRESETS.md` **NEUTRAL**: -1.0% < gain < +1.0% - **Freeze as research box** (default OFF) - Document findings, keep implementation for future study **NO-GO**: Mean gain <= -1.0% - **Freeze and archive** (default OFF, do not pursue) - Document regression cause, learn for future optimizations **Decision(この変更)**: **NEUTRAL**(default OFF の research box として保持) ## 期待値 ### Performance Gain Estimation **Target**: `tiny_alloc_gate_fast` at 12.75% self **Optimization**: - Eliminate policy lookup: ~5-7 cycles saved - Add LIKELY hints: ~3-5 cycles saved (branch prediction) - Total savings: ~8-12 cycles per allocation **Calculation**: - Baseline: ~100 cycles per allocation (estimated) - Savings: 8-12 cycles (8-12% of allocation cost) - `tiny_alloc_gate_fast` contribution: 12.75% self - Expected gain: 12.75% × (8-12%) = **+1.0-1.5%** (conservative) **Realistic range**: +1-2% on MIXED workload ### Risk Assessment **Risk Level**: LOW **Why**: - Only branch shape optimization (no algorithmic change) - ENV gate allows instant rollback - Fallback path unchanged (safety preserved) - Pattern proven by Phase 2 B3 (+2.89% success) **Failure modes**: - Policy lookup cost lower than expected → minimal/no gain - ENV check overhead outweighs savings → slight regression - Both cases: Rollback with `HAKMEM_ALLOC_GATE_SHAPE=0` ## 非目標 **NOT in scope**: - Route algorithm change (only branch shape) - Learner integration (optional for future) - C6-heavy workload optimization (already uses MID_V3 ON) - Policy snapshot bypass (already done in Phase 3 C3) ## Reference Patterns ### Similar Optimizations **Phase 2 B3** (Routing shape optimization): - File: `core/front/malloc_tiny_fast.h:262-278` - Pattern: `if (TINY_HOT_LIKELY(route_kind == SMALL_ROUTE_LEGACY))` - Result: **+2.89% on MIXED**, +9.13% on C6-heavy - Lesson: LIKELY hints + cold helpers are highly effective **Phase 3 D1** (Free route cache): - File: `core/box/tiny_free_route_cache_env_box.h` - Pattern: ENV gate with lazy init (-1 sentinel) - Result: **+2.19% on MIXED** (promoted to default) - Lesson: ENV gates with cached values have minimal overhead **Phase 3 C3** (Static routing): - File: `core/box/tiny_static_route_box.h` - Pattern: Static route table (bypass policy snapshot) - Result: **+2.20% on MIXED** - Lesson: Eliminating dynamic lookups pays off ### Code Examples **B3 Pattern** (LIKELY-first branching): ```c if (TINY_HOT_LIKELY(env_cfg->alloc_route_shape)) { if (TINY_HOT_LIKELY(route_kind == SMALL_ROUTE_LEGACY)) { // Hot path: LEGACY fast (99% traffic) void* ptr = tiny_hot_alloc_fast(class_idx); if (TINY_HOT_LIKELY(ptr != NULL)) return ptr; return tiny_cold_refill_and_alloc(class_idx); } // Rare routes: cold helper return tiny_alloc_route_cold(route_kind, class_idx, size); } ``` **D1 Pattern** (ENV gate with lazy init): ```c static inline int tiny_free_static_route_enabled(void) { static int g_enable = -1; if (__builtin_expect(g_enable == -1, 0)) { const char* e = getenv("HAKMEM_FREE_STATIC_ROUTE"); g_enable = (e && *e && *e != '0') ? 1 : 0; } return g_enable; } ``` ## Integration Plan ### Step 1: Implementation 1. Create `core/box/tiny_alloc_gate_shape_env_box.h` - Single function: `alloc_gate_shape_enabled()` - Lazy init with -1 sentinel - Return cached ENV value 2. Modify `core/box/tiny_alloc_gate_box.h` - Add include for new ENV box - Insert LEGACY-first fast path (ENV gated) - Add LIKELY hints to existing branches - Preserve all validation/diagnostic logic ### Step 2: A/B Testing 1. Build with optimization: ```bash make clean && make -j8 CFLAGS="-O3 -flto" ``` 2. Run 10-run baseline: ```bash for i in {1..10}; do HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_ALLOC_GATE_SHAPE=0 \ ./bench_random_mixed_hakmem 20000000 400 1 done ``` 3. Run 10-run optimized: ```bash for i in {1..10}; do HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_ALLOC_GATE_SHAPE=1 \ ./bench_random_mixed_hakmem 20000000 400 1 done ``` 4. Calculate statistics: - Mean, Median, StdDev for both configurations - Compare against success criteria - Document results in `PHASE4_D3_ALLOC_GATE_AB_TEST_RESULTS.md` ### Step 3: Decision & Promotion **If GO**: 1. Update `core/bench_profile.h` (MIXED_TINYV3_C7_SAFE preset) 2. Update `docs/analysis/ENV_PROFILE_PRESETS.md` 3. Update `CURRENT_TASK.md` (Phase 4 D3 complete) 4. Commit with message: "Phase 4 D3: Alloc Gate Specialization (+X.X%)" **If NEUTRAL or NO-GO**: 1. Document results in `PHASE4_D3_ALLOC_GATE_AB_TEST_RESULTS.md` 2. Keep ENV gate at default OFF 3. Archive as research box 4. Move to next candidate optimization ## Validation Checklist **Pre-implementation**: - [ ] Design document reviewed and approved - [ ] Integration points identified (2 files) - [ ] Reference patterns studied (B3, D1, C3) - [ ] ENV gate strategy confirmed (opt-in, default OFF) **Implementation**: - [ ] `tiny_alloc_gate_shape_env_box.h` created - [ ] `tiny_alloc_gate_box.h` modified (LEGACY-first path added) - [ ] LIKELY hints added to fallback branches - [ ] Clean compilation (no new warnings) - [ ] Health check passes: `scripts/verify_health_profiles.sh` **A/B Testing**: - [ ] Baseline 10-run completed (SHAPE=0) - [ ] Optimized 10-run completed (SHAPE=1) - [ ] Statistics calculated (mean, median, stddev) - [ ] Results documented - [ ] Success criteria evaluated (GO/NEUTRAL/NO-GO) **Promotion** (if GO): - [ ] `bench_profile.h` updated (default SHAPE=1) - [ ] `ENV_PROFILE_PRESETS.md` updated - [ ] `CURRENT_TASK.md` updated (Phase 4 complete) - [ ] Commit created with clear message - [ ] Cumulative gain updated (+X.X% total) --- **Phase 4 D3 Status**: DESIGN COMPLETE **Next Step**: Implementation → A/B Test → Decision **Expected Outcome**: +1-2% gain (conservative), LOW risk **Rollback Plan**: `HAKMEM_ALLOC_GATE_SHAPE=0` (instant revert)