hakmem/docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_2_NEXT_INSTRUCTIONS.md

# Phase 19-2: FASTLANE_DIRECT Promotion + Rebaseline (Next Instructions)

## 0. Status (where we are)

- Phase 19-1b (FASTLANE_DIRECT) is **GO**: throughput **+5.88%** with **-15.23% instr/op** and **-19.36% branches/op**.
- Safety hardening completed:
  - `!g_initialized` → direct path is skipped (fail-fast, same rule as Front FastLane).
  - malloc miss no longer calls `malloc_cold()` directly; it falls through to the normal wrapper path (preserves `g_hakmem_lock_depth` invariants).
  - ENV cache is a single global `_Atomic` so `bench_profile` refresh affects wrappers.

## 1. Promotion policy (Box Theory)

- Keep rollback simple:
  - `HAKMEM_FASTLANE_DIRECT=0` → disable (fallback to Phase 6 FastLane wrapper path).
  - `HAKMEM_FASTLANE_DIRECT=1` → enable (direct `malloc_tiny_fast()` / `free_tiny_fast()` first).
- Promotion level:
  - **Preset promotion** (recommended): set `HAKMEM_FASTLANE_DIRECT=1` in `MIXED_TINYV3_C7_SAFE` and `C6_HEAVY_LEGACY_POOLV1` presets.
  - Keep **ENV default = 0** (opt-in) until real-world/LD_PRELOAD validation is done.

## 2. Required verification (same-binary A/B)

### 2.1 Mixed (10-run, clean env)

Baseline:
```sh
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FASTLANE_DIRECT=0 scripts/run_mixed_10_cleanenv.sh
```

Optimized:
```sh
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FASTLANE_DIRECT=1 scripts/run_mixed_10_cleanenv.sh
```

GO/NO-GO:
- GO: mean **+1.0%** or higher
- NEUTRAL: **±1.0%** → keep as preset-only (do not flip global default)
- NO-GO: **≤ -1.0%** → revert preset promotion

### 2.2 C6-heavy (5-run)

```sh
HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1 HAKMEM_FASTLANE_DIRECT=0 ./bench_mid_large_mt_hakmem 1 1000000 400 1
HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1 HAKMEM_FASTLANE_DIRECT=1 ./bench_mid_large_mt_hakmem 1 1000000 400 1
```

## 3. Perf stat capture (root-cause guardrails)

Run both A/B with:
```sh
perf stat -e cycles,instructions,branches,branch-misses,L1-icache-load-misses,iTLB-load-misses,dTLB-load-misses -- \
  ./bench_random_mixed_hakmem 200000000 400 1
```

Checklist:
- `instructions/op` and `branches/op` must improve (expected)
- iTLB/dTLB misses may worsen; accept only if throughput still improves

## 4. Next target selection (after promotion)

After Phase 19-2 is stable, re-run `perf record` on Mixed and choose the next box by **self% ≥ 5%**:
- If `unified_cache_push/pop` rises: focus on **UnifiedCache data-path** (touch fewer cache lines).
- If `tiny_header_finalize_alloc` rises: focus on **header finalize path** (but treat as high NO-GO risk; prior header work was often NEUTRAL).
- If ENV checks reappear in hot path: consider **Phase 19-3 (ENV check consolidation)**, but keep it in a separate research box.
Phase 17 v2 (FORCE_LIBC fix) + Phase 19-1b (FastLane Direct) — GO (+5.88%) ## Phase 17 v2: FORCE_LIBC Gap Validation Fix Critical bug fix: Phase 17 v1 の測定が壊れていた Problem: HAKMEM_FORCE_LIBC_ALLOC=1 が FastLane より後でしか見えず、 same-binary A/B が実質 "hakmem vs hakmem" になっていた（+0.39% 誤測定） Fix: core/box/hak_wrappers.inc.h:171 と :645 に g_force_libc_alloc==1 の early bypass を追加、__libc_malloc/__libc_free に最初に直行 Result: 正しい同一バイナリ A/B 測定 - hakmem (FORCE_LIBC=0): 48.99M ops/s - libc (FORCE_LIBC=1): 79.72M ops/s (+62.7%) - system binary: 88.06M ops/s (+10.5% vs libc) Gap 分解: - Allocator 差: +62.7% (主戦場) - Layout penalty: +10.5% (副次的) Conclusion: Case A 確定 (allocator dominant, NOT layout) Phase 17 v1 の Case B 判定は誤り。 Files: - docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md (v2) - docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_NEXT_INSTRUCTIONS.md (updated) --- ## Phase 19: FastLane Instruction Reduction Analysis Goal: libc との instruction gap (-35% instructions, -56% branches) を削減 perf stat 分析 (FORCE_LIBC=0 vs 1, 200M ops): - hakmem: 209.09 instructions/op, 52.33 branches/op - libc: 135.92 instructions/op, 22.93 branches/op - Delta: +73.17 instructions/op (+53.8%), +29.40 branches/op (+128.2%) Hot path (perf report): - front_fastlane_try_free: 23.97% cycles - malloc wrapper: 23.84% cycles - free wrapper: 6.82% cycles - Wrapper overhead: ~55% of all cycles Reduction candidates: - A: Wrapper layer 削除 (-17.5 inst/op, +10-15% 期待) - B: ENV snapshot 統合 (-10.0 inst/op, +5-8%) - C: Stats 削除 (-5.0 inst/op, +3-5%) - D: Header inline (-4.0 inst/op, +2-3%) - E: Route fast path (-3.5 inst/op, +2-3%) Files: - docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_1_DESIGN.md - docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_2_NEXT_INSTRUCTIONS.md --- ## Phase 19-1b: FastLane Direct — GO (+5.88%) Strategy: Wrapper layer を bypass し、core allocator を直接呼ぶ - free() → free_tiny_fast() (not free_tiny_fast_hot) - malloc() → malloc_tiny_fast() Phase 19-1 が NO-GO (-3.81%) だった原因: 1. __builtin_expect(fastlane_direct_enabled(), 0) が逆効果（A/B 不公平） 2. free_tiny_fast_hot() が誤選択（free_tiny_fast() が勝ち筋） Phase 19-1b の修正: 1. __builtin_expect() 削除 2. free_tiny_fast() を直接呼び出し Result (Mixed, 10-run, 20M iters, ws=400): - Baseline (FASTLANE_DIRECT=0): 49.17M ops/s - Optimized (FASTLANE_DIRECT=1): 52.06M ops/s - Delta: +5.88% (GO 基準 +5% クリア) perf stat (200M iters): - Instructions/op: 199.90 → 169.45 (-30.45, -15.23%) - Branches/op: 51.49 → 41.52 (-9.97, -19.36%) - Cycles/op: 88.88 → 84.37 (-4.51, -5.07%) - I-cache miss: 111K → 98K (-11.79%) Trade-offs (acceptable): - iTLB miss: +41.46% (front-end cost) - dTLB miss: +29.15% (backend cost) - Overall gain (+5.88%) outweighs costs Implementation: 1. ENV gate: core/box/fastlane_direct_env_box.{h,c} - HAKMEM_FASTLANE_DIRECT=0/1 (default: 0, opt-in) - Single _Atomic global (wrapper キャッシュ問題を解決) 2. Wrapper 修正: core/box/hak_wrappers.inc.h - malloc: direct call to malloc_tiny_fast() when FASTLANE_DIRECT=1 - free: direct call to free_tiny_fast() when FASTLANE_DIRECT=1 - Safety: !g_initialized では direct 使わない、fallback 維持 3. Preset 昇格: core/bench_profile.h:88 - bench_setenv_default("HAKMEM_FASTLANE_DIRECT", "1") - Comment: +5.88% proven on Mixed, 10-run 4. cleanenv 更新: scripts/run_mixed_10_cleanenv.sh:22 - HAKMEM_FASTLANE_DIRECT=${HAKMEM_FASTLANE_DIRECT:-1} - Phase 9/10 と同様に昇格 Verdict: GO — 本線採用、プリセット昇格完了 Rollback: HAKMEM_FASTLANE_DIRECT=0 で既存 FastLane path に戻る Files: - core/box/fastlane_direct_env_box.{h,c} (new) - core/box/hak_wrappers.inc.h (modified) - core/bench_profile.h (preset promotion) - scripts/run_mixed_10_cleanenv.sh (ENV default aligned) - Makefile (new obj) - docs/analysis/PHASE19_1B_FASTLANE_DIRECT_REVISED_AB_TEST_RESULTS.md --- ## Cumulative Performance - Baseline (all optimizations OFF): ~40M ops/s (estimated) - Current (Phase 19-1b): 52.06M ops/s - Cumulative gain: ~+30% from baseline Remaining gap to libc (79.72M): - Current: 52.06M ops/s - Target: 79.72M ops/s - Gap: +53.2% (was +62.7% before Phase 19-1b) Next: Phase 19-2 (ENV snapshot consolidation, +5-8% expected) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> 2025-12-15 11:28:40 +09:00			`# Phase 19-2: FASTLANE_DIRECT Promotion + Rebaseline (Next Instructions)`

			`## 0. Status (where we are)`

			`- Phase 19-1b (FASTLANE_DIRECT) is GO: throughput +5.88% with -15.23% instr/op and -19.36% branches/op.`
			`- Safety hardening completed:`
			- `!g_initialized` → direct path is skipped (fail-fast, same rule as Front FastLane).
			- malloc miss no longer calls `malloc_cold()` directly; it falls through to the normal wrapper path (preserves `g_hakmem_lock_depth` invariants).
			- ENV cache is a single global `_Atomic` so `bench_profile` refresh affects wrappers.

			`## 1. Promotion policy (Box Theory)`

			`- Keep rollback simple:`
			- `HAKMEM_FASTLANE_DIRECT=0` → disable (fallback to Phase 6 FastLane wrapper path).
			- `HAKMEM_FASTLANE_DIRECT=1` → enable (direct `malloc_tiny_fast()` / `free_tiny_fast()` first).
			`- Promotion level:`
			- Preset promotion (recommended): set `HAKMEM_FASTLANE_DIRECT=1` in `MIXED_TINYV3_C7_SAFE` and `C6_HEAVY_LEGACY_POOLV1` presets.
			`- Keep ENV default = 0 (opt-in) until real-world/LD_PRELOAD validation is done.`

			`## 2. Required verification (same-binary A/B)`

			`### 2.1 Mixed (10-run, clean env)`

			`Baseline:`
			```sh
			`HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FASTLANE_DIRECT=0 scripts/run_mixed_10_cleanenv.sh`
			```

			`Optimized:`
			```sh
			`HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FASTLANE_DIRECT=1 scripts/run_mixed_10_cleanenv.sh`
			```

			`GO/NO-GO:`
			`- GO: mean +1.0% or higher`
			`- NEUTRAL: ±1.0% → keep as preset-only (do not flip global default)`
			`- NO-GO: ≤ -1.0% → revert preset promotion`

			`### 2.2 C6-heavy (5-run)`

			```sh
			`HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1 HAKMEM_FASTLANE_DIRECT=0 ./bench_mid_large_mt_hakmem 1 1000000 400 1`
			`HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1 HAKMEM_FASTLANE_DIRECT=1 ./bench_mid_large_mt_hakmem 1 1000000 400 1`
			```

			`## 3. Perf stat capture (root-cause guardrails)`

			`Run both A/B with:`
			```sh
			`perf stat -e cycles,instructions,branches,branch-misses,L1-icache-load-misses,iTLB-load-misses,dTLB-load-misses -- \`
			`./bench_random_mixed_hakmem 200000000 400 1`
			```

			`Checklist:`
			- `instructions/op` and `branches/op` must improve (expected)
			`- iTLB/dTLB misses may worsen; accept only if throughput still improves`

			`## 4. Next target selection (after promotion)`

			After Phase 19-2 is stable, re-run `perf record` on Mixed and choose the next box by self% ≥ 5%:
			- If `unified_cache_push/pop` rises: focus on UnifiedCache data-path (touch fewer cache lines).
			- If `tiny_header_finalize_alloc` rises: focus on header finalize path (but treat as high NO-GO risk; prior header work was often NEUTRAL).
			`- If ENV checks reappear in hot path: consider Phase 19-3 (ENV check consolidation), but keep it in a separate research box.`