Files
hakmem/docs/analysis/PHASE4_D3_ALLOC_GATE_SPECIALIZATION_1_DESIGN.md
2025-12-14 00:26:57 +09:00

13 KiB
Raw Blame History

Phase 4 D3: Alloc Gate Specialization 設計メモ

目的

tiny_alloc_gate_fast() のルーティング分岐を MIXED 向けに特化LEGACY 優先パス)

背景:

  • Phase 3 完了: +8.93% cumulative gain (37.5M → 51M ops/s)
  • Perf analysis: tiny_alloc_gate_fast at 12.75% self (HIGH priority)
  • MIXED workload: 99% が LEGACY routeMID_V3 OFF
  • 現状: 全 route (LEGACY/ULTRA/MID/V7) をスイッチで分岐 → 予測失敗コスト高

観察

Current State (Phase 3 Baseline)

  • tiny_alloc_gate_fast: 12.75% self + children overhead
  • MIXED ワークロード特性:
    • 99% が LEGACY routeHAKMEM_MID_V3_ENABLED=0
    • C0-C7 全体で uniform branching → prediction miss
  • 既存最適化Phase 2 B3:
    • malloc_tiny_fast_for_class() 内で LEGACY-first branching 実装済み
    • しかし tiny_alloc_gate_fast() の routing policy 分岐は最適化されていない

Bottleneck Analysis

Current flow (core/box/tiny_alloc_gate_box.h:139-217):

void* tiny_alloc_gate_fast(size_t size) {
    int class_idx = hak_tiny_size_to_class(size);
    if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES) return NULL;

    TinyRoutePolicy route = tiny_route_get(class_idx);  // ← Policy lookup

    // Branching on route policy (uniform dispatch, poor prediction)
    if (route == ROUTE_POOL_ONLY) return NULL;

    void* user_ptr = malloc_tiny_fast_for_class(size, class_idx);

    if (route == ROUTE_TINY_ONLY) {
        // Hot path (99% of Mixed traffic)
        return user_ptr;
    }

    // ROUTE_TINY_FIRST: fallback allowed
    return user_ptr;
}

Problem:

  • Policy lookup overhead: tiny_route_get() call every allocation
  • Branch on route == ROUTE_POOL_ONLY: rare but evaluated every time
  • Branch on route == ROUTE_TINY_ONLY vs ROUTE_TINY_FIRST: Mixed default は TINY_FIRST挙動差はほぼ無い
  • Total cost: ~2-3 branches + 1 policy lookup per allocation

Expected savings:

  • Eliminate policy lookup for known-LEGACY workloads
  • Convert policy branches to LIKELY-hinted checks
  • Reduce instruction count by 5-10 per allocation

実装アプローチ

Strategy: LEGACY-first with Static Route Assumption

Pattern: Similar to Phase 2 B3 (routing shape optimization)

  • Reference: core/front/malloc_tiny_fast.h:262-278
  • Proven approach: LIKELY hint + cold helper for rare routes
  • Expected branch prediction improvement: 75% miss rate → <5% miss rate

L0: Env戻せる

  • HAKMEM_ALLOC_GATE_SHAPE=0/1 (default: 0, OFF)
  • Opt-in で特化パスを有効化(常時有効化は慎重)
  • Rollback: ENV=0 で即座に既存経路へ復帰

L1: SpecializedGateBox境界: 1箇所

Optimized Gate Structure

Before (current):

void* tiny_alloc_gate_fast(size_t size) {
    int class_idx = hak_tiny_size_to_class(size);
    if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES) return NULL;

    TinyRoutePolicy route = tiny_route_get(class_idx);

    if (route == ROUTE_POOL_ONLY) return NULL;

    void* user_ptr = malloc_tiny_fast_for_class(size, class_idx);

    if (route == ROUTE_TINY_ONLY) return user_ptr;

    return user_ptr;  // ROUTE_TINY_FIRST
}

After (optimized for MIXED):

void* tiny_alloc_gate_fast(size_t size) {
    int class_idx = hak_tiny_size_to_class(size);
    if (__builtin_expect(class_idx < 0 || class_idx >= TINY_NUM_CLASSES, 0)) {
        return NULL;
    }

    // Phase 4 D3: LEGACY-first gate specialization (ENV gated)
    if (TINY_HOT_LIKELY(alloc_gate_shape_enabled())) {
        // MIXED fast path: Avoid tiny_route_get() overhead.
        // NOTE: We do NOT assume TINY_ONLY vs TINY_FIRST; both return user_ptr.
        // Safety: still honor POOL_ONLY if configured via HAKMEM_TINY_PROFILE (e.g., "hot"/"off").
        if (__builtin_expect(g_tiny_route[class_idx & 7] == ROUTE_POOL_ONLY, 0)) {
            return NULL;
        }
        // Direct to malloc_tiny_fast_for_class, skip tiny_route_get()
        return malloc_tiny_fast_for_class(size, class_idx);
    }

    // Original path (backward compatible)
    TinyRoutePolicy route = tiny_route_get(class_idx);

    if (__builtin_expect(route == ROUTE_POOL_ONLY, 0)) return NULL;

    void* user_ptr = malloc_tiny_fast_for_class(size, class_idx);

    if (TINY_HOT_LIKELY(route == ROUTE_TINY_ONLY)) return user_ptr;

    return user_ptr;
}

Branch Prediction Impact

Current (uniform branching):

  • Policy lookup: Always executed (1 function call overhead)
  • route == ROUTE_POOL_ONLY: 0% hit rate (but checked every time)
  • route == ROUTE_TINY_ONLY: 99% hit rate (but no LIKELY hint)
  • Total overhead: ~10-15 cycles per allocation

Optimized (LEGACY-first):

  • ENV check: 99% cached (< 1 cycle amortized)
  • Direct path: Skip tiny_route_get() (and its release logging branch) (save ~5-7 cycles)
  • LIKELY hint: CPU predictor trained to expect fast path
  • Total savings: ~8-12 cycles per allocation

Expected gain: +1-2% on MIXED (conservative estimate based on 12.75% self%)

実装指示

File 1: core/box/tiny_alloc_gate_shape_env_box.h (新規)

Role: ENV gate for alloc gate shape optimization

API:

// ENV gate: HAKMEM_ALLOC_GATE_SHAPE=0/1 (default: 0)
static inline int alloc_gate_shape_enabled(void) {
    static int g_enable = -1;  // Lazy init sentinel
    if (__builtin_expect(g_enable == -1, 0)) {
        const char* e = getenv("HAKMEM_ALLOC_GATE_SHAPE");
        g_enable = (e && *e && *e != '0') ? 1 : 0;
    }
    return g_enable;
}

Integration: Header-only, single-responsibility (ENV caching only)

File 2: Modify core/box/tiny_alloc_gate_box.h (既存)

Location: tiny_alloc_gate_fast() function (lines 139-217)

Changes:

  1. Include new ENV box header:

    #include "tiny_alloc_gate_shape_env_box.h"
    
  2. Add LEGACY-first fast path before existing route dispatch:

    // Phase 4 D3: LEGACY-first gate specialization
    if (TINY_HOT_LIKELY(alloc_gate_shape_enabled())) {
        // Skip policy lookup for MIXED (ROUTE_TINY_ONLY assumption)
        return malloc_tiny_fast_for_class(size, class_idx);
    }
    
  3. Add LIKELY hints to existing branches (backward compatible path):

    if (__builtin_expect(route == ROUTE_POOL_ONLY, 0)) return NULL;
    
    void* user_ptr = malloc_tiny_fast_for_class(size, class_idx);
    
    if (TINY_HOT_LIKELY(route == ROUTE_TINY_ONLY)) return user_ptr;
    

Safety:

  • ENV gate ensures opt-in behavior
  • Fallback path unchanged (existing validation/diagnostics preserved)
  • No algorithmic changes (only branch shape optimization)

A/B テスト

Test Configuration

Workload: Mixed (10-run, 20M iters, ws=400)

  • Baseline: HAKMEM_ALLOC_GATE_SHAPE=0
  • Optimized: HAKMEM_ALLOC_GATE_SHAPE=1

Commands:

# Baseline
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_ALLOC_GATE_SHAPE=0 \
  ./bench_random_mixed_hakmem 20000000 400 1

# Optimized
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_ALLOC_GATE_SHAPE=1 \
  ./bench_random_mixed_hakmem 20000000 400 1

Results2025-12-13, Release, 10-run

  • BaselineHAKMEM_ALLOC_GATE_SHAPE=0: Mean 47.55M ops/s, Median 48.08M
  • OptimizedHAKMEM_ALLOC_GATE_SHAPE=1: Mean 47.82M ops/s, Median 47.84M
  • ΔMean: +0.56%Median -0.5%)→ NEUTRAL
  • 動作確認: HAKMEM_ALLOC_GATE_SHAPE=1tiny_route_get() 経由の [REL_C7_ROUTE] ログが消えるbypass を確認)

Success Criteria

GO: Mean gain >= +1.0%, Median >= +0.0%

  • Promote to default in MIXED_TINYV3_C7_SAFE preset
  • Add to core/bench_profile.h: bench_setenv_default("HAKMEM_ALLOC_GATE_SHAPE", "1");
  • Document in docs/analysis/ENV_PROFILE_PRESETS.md

NEUTRAL: -1.0% < gain < +1.0%

  • Freeze as research box (default OFF)
  • Document findings, keep implementation for future study

NO-GO: Mean gain <= -1.0%

  • Freeze and archive (default OFF, do not pursue)
  • Document regression cause, learn for future optimizations

Decisionこの変更: NEUTRALdefault OFF の research box として保持)

期待値

Performance Gain Estimation

Target: tiny_alloc_gate_fast at 12.75% self

Optimization:

  • Eliminate policy lookup: ~5-7 cycles saved
  • Add LIKELY hints: ~3-5 cycles saved (branch prediction)
  • Total savings: ~8-12 cycles per allocation

Calculation:

  • Baseline: ~100 cycles per allocation (estimated)
  • Savings: 8-12 cycles (8-12% of allocation cost)
  • tiny_alloc_gate_fast contribution: 12.75% self
  • Expected gain: 12.75% × (8-12%) = +1.0-1.5% (conservative)

Realistic range: +1-2% on MIXED workload

Risk Assessment

Risk Level: LOW

Why:

  • Only branch shape optimization (no algorithmic change)
  • ENV gate allows instant rollback
  • Fallback path unchanged (safety preserved)
  • Pattern proven by Phase 2 B3 (+2.89% success)

Failure modes:

  • Policy lookup cost lower than expected → minimal/no gain
  • ENV check overhead outweighs savings → slight regression
  • Both cases: Rollback with HAKMEM_ALLOC_GATE_SHAPE=0

非目標

NOT in scope:

  • Route algorithm change (only branch shape)
  • Learner integration (optional for future)
  • C6-heavy workload optimization (already uses MID_V3 ON)
  • Policy snapshot bypass (already done in Phase 3 C3)

Reference Patterns

Similar Optimizations

Phase 2 B3 (Routing shape optimization):

  • File: core/front/malloc_tiny_fast.h:262-278
  • Pattern: if (TINY_HOT_LIKELY(route_kind == SMALL_ROUTE_LEGACY))
  • Result: +2.89% on MIXED, +9.13% on C6-heavy
  • Lesson: LIKELY hints + cold helpers are highly effective

Phase 3 D1 (Free route cache):

  • File: core/box/tiny_free_route_cache_env_box.h
  • Pattern: ENV gate with lazy init (-1 sentinel)
  • Result: +2.19% on MIXED (promoted to default)
  • Lesson: ENV gates with cached values have minimal overhead

Phase 3 C3 (Static routing):

  • File: core/box/tiny_static_route_box.h
  • Pattern: Static route table (bypass policy snapshot)
  • Result: +2.20% on MIXED
  • Lesson: Eliminating dynamic lookups pays off

Code Examples

B3 Pattern (LIKELY-first branching):

if (TINY_HOT_LIKELY(env_cfg->alloc_route_shape)) {
    if (TINY_HOT_LIKELY(route_kind == SMALL_ROUTE_LEGACY)) {
        // Hot path: LEGACY fast (99% traffic)
        void* ptr = tiny_hot_alloc_fast(class_idx);
        if (TINY_HOT_LIKELY(ptr != NULL)) return ptr;
        return tiny_cold_refill_and_alloc(class_idx);
    }
    // Rare routes: cold helper
    return tiny_alloc_route_cold(route_kind, class_idx, size);
}

D1 Pattern (ENV gate with lazy init):

static inline int tiny_free_static_route_enabled(void) {
    static int g_enable = -1;
    if (__builtin_expect(g_enable == -1, 0)) {
        const char* e = getenv("HAKMEM_FREE_STATIC_ROUTE");
        g_enable = (e && *e && *e != '0') ? 1 : 0;
    }
    return g_enable;
}

Integration Plan

Step 1: Implementation

  1. Create core/box/tiny_alloc_gate_shape_env_box.h

    • Single function: alloc_gate_shape_enabled()
    • Lazy init with -1 sentinel
    • Return cached ENV value
  2. Modify core/box/tiny_alloc_gate_box.h

    • Add include for new ENV box
    • Insert LEGACY-first fast path (ENV gated)
    • Add LIKELY hints to existing branches
    • Preserve all validation/diagnostic logic

Step 2: A/B Testing

  1. Build with optimization:

    make clean && make -j8 CFLAGS="-O3 -flto"
    
  2. Run 10-run baseline:

    for i in {1..10}; do
        HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_ALLOC_GATE_SHAPE=0 \
          ./bench_random_mixed_hakmem 20000000 400 1
    done
    
  3. Run 10-run optimized:

    for i in {1..10}; do
        HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_ALLOC_GATE_SHAPE=1 \
          ./bench_random_mixed_hakmem 20000000 400 1
    done
    
  4. Calculate statistics:

    • Mean, Median, StdDev for both configurations
    • Compare against success criteria
    • Document results in PHASE4_D3_ALLOC_GATE_AB_TEST_RESULTS.md

Step 3: Decision & Promotion

If GO:

  1. Update core/bench_profile.h (MIXED_TINYV3_C7_SAFE preset)
  2. Update docs/analysis/ENV_PROFILE_PRESETS.md
  3. Update CURRENT_TASK.md (Phase 4 D3 complete)
  4. Commit with message: "Phase 4 D3: Alloc Gate Specialization (+X.X%)"

If NEUTRAL or NO-GO:

  1. Document results in PHASE4_D3_ALLOC_GATE_AB_TEST_RESULTS.md
  2. Keep ENV gate at default OFF
  3. Archive as research box
  4. Move to next candidate optimization

Validation Checklist

Pre-implementation:

  • Design document reviewed and approved
  • Integration points identified (2 files)
  • Reference patterns studied (B3, D1, C3)
  • ENV gate strategy confirmed (opt-in, default OFF)

Implementation:

  • tiny_alloc_gate_shape_env_box.h created
  • tiny_alloc_gate_box.h modified (LEGACY-first path added)
  • LIKELY hints added to fallback branches
  • Clean compilation (no new warnings)
  • Health check passes: scripts/verify_health_profiles.sh

A/B Testing:

  • Baseline 10-run completed (SHAPE=0)
  • Optimized 10-run completed (SHAPE=1)
  • Statistics calculated (mean, median, stddev)
  • Results documented
  • Success criteria evaluated (GO/NEUTRAL/NO-GO)

Promotion (if GO):

  • bench_profile.h updated (default SHAPE=1)
  • ENV_PROFILE_PRESETS.md updated
  • CURRENT_TASK.md updated (Phase 4 complete)
  • Commit created with clear message
  • Cumulative gain updated (+X.X% total)

Phase 4 D3 Status: DESIGN COMPLETE Next Step: Implementation → A/B Test → Decision Expected Outcome: +1-2% gain (conservative), LOW risk Rollback Plan: HAKMEM_ALLOC_GATE_SHAPE=0 (instant revert)