Files
hakmem/docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_3_REVISED_INSTRUCTIONS.md
2025-12-15 12:50:16 +09:00

12 KiB
Raw Blame History

Phase 19-3: ENV Snapshot Consolidation — Revised Implementation Instructions

0. Design Corrections (from initial analysis)

Key mistakes in initial design:

  1. Wrong: Create new TLS ctx Right: Use existing HakmemEnvSnapshot* from core/box/hakmem_env_snapshot_box.h

  2. Wrong: Option A with static __thread g_init (doesn't respect hakmem_env_snapshot_refresh_from_env()) Right: Pass const HakmemEnvSnapshot* env down the call stack (refresh works automatically)

  3. Wrong: Modify core/wrapper.c Right: Modify core/box/hak_wrappers.inc.h (actual integration point)

  4. Wrong: Keep __builtin_expect(hakmem_env_snapshot_enabled(), 0) Right: Remove UNLIKELY hint (Phase 19-1 trap: snapshot is now ON by default, hint is backwards)


1. Strategy: Simplest Path (3 micro-phases)

Phase 19-3a: Remove UNLIKELY hint from ENV checks

Problem: __builtin_expect(hakmem_env_snapshot_enabled(), 0) appears in:

  • core/front/malloc_tiny_fast.h:236 (C7 ULTRA alloc)
  • core/front/malloc_tiny_fast.h:403 (Front V3 free hotcold)
  • core/front/malloc_tiny_fast.h:624 (C7 ULTRA free)
  • core/front/malloc_tiny_fast.h:830 (C7 ULTRA free larson)
  • core/front/malloc_tiny_fast.h:910 (Front V3 free larson)

Issue: Snapshot is now ON in presets → UNLIKELY hint is backwards (same trap as Phase 19-1 NO-GO)

Fix: Replace __builtin_expect(hakmem_env_snapshot_enabled(), 0) with hakmem_env_snapshot_enabled()

Expected: +0-2% from correct branch prediction

A/B Test:

# Before (with UNLIKELY hint)
scripts/run_mixed_10_cleanenv.sh

# After (without hint)
scripts/run_mixed_10_cleanenv.sh

GO Criteria: No regression (±1%)

Status: DONE (GO +4.42%)


Phase 19-3b: Pass snapshot down (hot helper pass-down)

Status: DONE (GO +2.76% mean / +2.57% median)
Results: docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_3B_ENV_SNAPSHOT_PASSDOWN_AB_TEST_RESULTS.md

Current state: Each callee calls hakmem_env_snapshot_enabled() independently

  • 5 calls × 2 TLS reads each = 10 TLS reads/op

Implementation (landed):

  • core/front/malloc_tiny_fast.h
    • Capture env once per hot call and pass it down:
      • free_tiny_fast() / free_tiny_fast_hot() capture env once
      • free_tiny_fast_cold(..., env) consumes it
      • tiny_legacy_fallback_free_base_with_env(..., env) consumes it
    • Reuse the same snapshot for alloc route selection:
      • tiny_policy_hot_get_route_with_env(class_idx, env)
  • core/box/tiny_legacy_fallback_box.h: add tiny_legacy_fallback_free_base_with_env(...)
  • core/box/tiny_metadata_cache_hot_box.h: add tiny_policy_hot_get_route_with_env(...)

Optional extension (if chasing the last alloc-side gate): Pass env down from core/box/hak_wrappers.inc.h entry. Keep the invariant: malloc miss must fall through (do not call malloc_cold() directly).

Example (optional):

// core/box/hak_wrappers.inc.h (malloc wrapper)
void* malloc(size_t size) {
    // Entry-point: read ENV snapshot once
    const HakmemEnvSnapshot* env = hakmem_env_snapshot_enabled() ? hakmem_env_snapshot() : NULL;

    if (fastlane_direct_enabled()) {
        // NOTE: Keep FastLane safety rule: do not use fast paths before init.
        if (g_initialized) {
            void* ptr = malloc_tiny_fast_with_env(size, env);  // NEW: pass env
            if (ptr != NULL) return ptr;
            // IMPORTANT: malloc miss must fall through to existing wrapper path
            // (do NOT call malloc_cold() directly; it expects lock_depth to be incremented).
        }
    }

    void* ptr = front_fastlane_try_malloc(size);
    if (__builtin_expect(ptr != NULL, 1)) return ptr;
    // Not handled → continue to existing wrapper path below (wrap_shape / lock_depth / init waits / malloc_cold(...)).
    // (Do not duplicate the full wrapper here; only the env pass-down is new.)
    /* existing wrapper path */
}

// core/box/hak_wrappers.inc.h (free wrapper)
void free(void* ptr) {
    if (__builtin_expect(!ptr, 0)) return;

    // Entry-point: read ENV snapshot once
    const HakmemEnvSnapshot* env = hakmem_env_snapshot_enabled() ? hakmem_env_snapshot() : NULL;

    if (fastlane_direct_enabled()) {
        // NOTE: Keep FastLane safety rule: do not use fast paths before init.
        if (g_initialized) {
            if (free_tiny_fast_with_env(ptr, env)) return;  // NEW: pass env
            free_cold(ptr);
            return;
        }
    }

    if (front_fastlane_try_free(ptr)) return;
    free_cold(ptr, wrapper_env_cfg_fast());
}

Propagate env down:

// core/front/malloc_tiny_fast.h (example)
static inline void* malloc_tiny_fast_with_env(size_t size, const HakmemEnvSnapshot* env) {
    // ... existing logic ...

    // OLD: if (__builtin_expect(hakmem_env_snapshot_enabled(), 0)) {
    // NEW: if (env) {
    if (env) {
        // Use snapshot from env (NO additional TLS read)
        if (tiny_c7_ultra_enabled_cached(env)) {  // NEW: use cached check
            // C7 ULTRA path
        }
    }

    // ... rest of logic ...
}

Helper functions (optional, if needed):

// core/box/hakmem_env_snapshot_box.h
static inline bool tiny_c7_ultra_enabled_cached(const HakmemEnvSnapshot* env) {
    if (!env) return 0;
    // Read from snapshot (already in cache, no TLS read)
    return env->tiny_c7_ultra_enabled;
}

Expected:

  • TLS reads: 10/op → 2/op (just wrapper entry check)
  • Instructions: -8.0 to -10.0/op
  • Throughput: +3-5%

ENV Gate (optional, if conservative rollout needed):

// core/box/env_snapshot_consolidation_env_box.h (NEW, if needed)
extern _Atomic int g_env_snapshot_consolidation_enabled;

static inline bool env_snapshot_consolidation_enabled(void) {
    return atomic_load_explicit(&g_env_snapshot_consolidation_enabled, memory_order_relaxed);
}

Alternative (no new ENV gate):

  • Just always pass env down when HAKMEM_FASTLANE_DIRECT=1
  • Rely on existing FASTLANE_DIRECT gate for rollback

Phase 19-3c: Propagate to all callees (if 19-3b is GO)

Targets:

  • tiny_legacy_fallback_box.h:28 (ENV snapshot check)
  • tiny_metadata_cache_hot_box.h:64 (metadata cache check)
  • Other helper functions that currently call hakmem_env_snapshot_enabled()

Expected cumulative: +5-8%


2. Box Theory Compliance

Boundary:

  • L0 (ENV gate): Optional HAKMEM_ENV_SNAPSHOT_CONSOLIDATION=0/1 OR reuse HAKMEM_FASTLANE_DIRECT=1
  • L1 (Hot inline): const HakmemEnvSnapshot* env parameter (NULL-safe)
  • L2 (Fallback): If env == NULL, use old path (call hakmem_env_snapshot_enabled())

Rollback:

  • Runtime: HAKMEM_ENV_SNAPSHOT_CONSOLIDATION=0 (or HAKMEM_FASTLANE_DIRECT=0)
  • Compile-time: #if guards on env parameter propagation
  • NULL-safe: if (!env) { /* old path */ }

Observability:

  • Perf stat: TLS read count reduction
  • hakmem_env_snapshot_enabled() samples should drop from 7% → <1%

Refresh compatibility:

  • Works: Wrapper reads fresh snapshot on each operation
  • bench_profile putenv() + refresh works (no static cache)
  • No version tracking needed

3. Implementation Steps

Step 1: Phase 19-3a (UNLIKELY hint removal)

Files to modify:

# Find all instances
grep -n "__builtin_expect(hakmem_env_snapshot_enabled(), 0)" core/front/malloc_tiny_fast.h

# Replace with
# hakmem_env_snapshot_enabled()

Lines to change:

  • malloc_tiny_fast.h:236
  • malloc_tiny_fast.h:403
  • malloc_tiny_fast.h:624
  • malloc_tiny_fast.h:830
  • malloc_tiny_fast.h:910

A/B Test:

# Baseline (before)
scripts/run_mixed_10_cleanenv.sh

# Optimized (after)
scripts/run_mixed_10_cleanenv.sh

GO Criteria: No regression (±1%)


Step 2: Phase 19-3b (Pass env from wrapper)

2.1 Update wrappers (core/box/hak_wrappers.inc.h)

Find malloc wrapper (around line 674):

void* malloc(size_t size) {
    // ADD: Read snapshot once at entry
    const HakmemEnvSnapshot* env = hakmem_env_snapshot_enabled() ? hakmem_env_snapshot() : NULL;

    // ... rest of existing logic ...
    // Replace malloc_tiny_fast(size) with malloc_tiny_fast_with_env(size, env)
}

Find free wrapper (around line 188):

void free(void* ptr) {
    // ADD: Read snapshot once at entry
    const HakmemEnvSnapshot* env = hakmem_env_snapshot_enabled() ? hakmem_env_snapshot() : NULL;

    // ... rest of existing logic ...
    // Replace free_tiny_fast(ptr) with free_tiny_fast_with_env(ptr, env)
}

2.2 Add _with_env variants (core/front/malloc_tiny_fast.h)

Option A: Rename existing functions (breaking change, requires full propagation) Option B: Add new _with_env variants, keep old functions as wrappers (safer)

Recommended: Option B (incremental migration)

// malloc_tiny_fast.h (add new variant)
static inline void* malloc_tiny_fast_with_env(size_t size, const HakmemEnvSnapshot* env) {
    // Replace hakmem_env_snapshot_enabled() checks with (env && ...)
    // ... existing logic ...
}

// Keep old function as wrapper (for gradual migration)
static inline void* malloc_tiny_fast(size_t size) {
    const HakmemEnvSnapshot* env = hakmem_env_snapshot_enabled() ? hakmem_env_snapshot() : NULL;
    return malloc_tiny_fast_with_env(size, env);
}

2.3 Update ENV checks inside _with_env

Replace:

if (__builtin_expect(hakmem_env_snapshot_enabled(), 0)) {
    // C7 ULTRA check
}

With:

if (env) {
    // Use env snapshot (NO TLS read)
    if (env->tiny_c7_ultra_enabled) {
        // C7 ULTRA path
    }
}

2.4 Build & Verify

make clean && make -j bench_random_mixed_hakmem
# Should compile without errors

2.5 A/B Test

# Baseline (19-3a only)
scripts/run_mixed_10_cleanenv.sh

# Optimized (19-3b with env passing)
scripts/run_mixed_10_cleanenv.sh

GO Criteria: +3% minimum


Step 3: Phase 19-3c (Propagate to helpers)

If 19-3b is GO, propagate env to:

  • tiny_legacy_fallback_box.h
  • tiny_metadata_cache_hot_box.h
  • Other helper functions

Expected cumulative: +5-8%


4. Safety Checklist

  • NULL-safe: All if (env) checks handle NULL correctly
  • Refresh works: Wrapper reads fresh snapshot each call
  • Rollback: Can disable via ENV or compile flag
  • No static cache: No version tracking needed
  • Existing code paths preserved: Old functions still work

5. Expected Performance (Cumulative)

Phase TLS reads/op Instructions/op Throughput
Baseline (19-1b) 10 169.45 52.06M ops/s
19-3a (hint fix) 10 ~169 +0-2%
19-3b (env pass) 2 ~159-161 +3-5%
19-3c (helpers) 2 ~157-159 +5-8%

Target after Phase 19-3c:

  • Throughput: 54.7-56.2M ops/s (vs 52.06M baseline)
  • Instructions/op: 157-159 (vs 169.45 baseline)
  • Gap to libc (135.92): +15-17% (vs +24.6% before 19-3)

6. Perf Validation

Before Phase 19-3:

perf stat -e cycles,instructions -- ./bench_random_mixed_hakmem 200000000 400 1
# Instructions: ~169.45/op

After Phase 19-3c:

perf stat -e cycles,instructions -- ./bench_random_mixed_hakmem 200000000 400 1
# Instructions: ~157-159/op (target: -10.0 reduction)

Perf record validation:

perf record -g -- ./bench_random_mixed_hakmem 50000000 400 1
perf report --stdio --no-children | grep hakmem_env_snapshot_enabled
# Should show <1% samples (down from 7%)

7. Risk Assessment

Phase 19-3a: LOW

  • Simple search-replace
  • No algorithmic changes
  • Worst case: ±0% (no-op)

Phase 19-3b: MEDIUM

  • API signature changes (add env parameter)
  • Mechanical changes (low semantic risk)
  • Rollback: Keep old functions as wrappers

Phase 19-3c: MEDIUM

  • Wider propagation
  • More functions to update
  • Same rollback strategy

8. Timeline

  • Phase 19-3a: 1-2 hours (search-replace + A/B)
  • Phase 19-3b: 4-6 hours (wrapper + _with_env variants + A/B)
  • Phase 19-3c: 3-4 hours (helper propagation + A/B)

Total: 8-12 hours (1-2 days part-time)


9. Next Steps After This Phase

If Phase 19-3 achieves +5-8%:

  • Current: 52.06M ops/s
  • After 19-3: ~54.7-56.2M ops/s
  • Gap to libc (79.72M): ~+42-46%

Remaining candidates (from Phase 19-1 Design):

  • Candidate C: Stats removal (+3-5%, already done in BENCH_MINIMAL)
  • Candidate D: Header inline (+2-3%)
  • Candidate E: Route fast path (+2-3%)

Or: Re-profile with perf record to find next hot path (self% ≥ 5%)