Phase 17 v2 (FORCE_LIBC fix) + Phase 19-1b (FastLane Direct) — GO (+5.88%)

## Phase 17 v2: FORCE_LIBC Gap Validation Fix **Critical bug fix**: Phase 17 v1 の測定が壊れていた **Problem**: HAKMEM_FORCE_LIBC_ALLOC=1 が FastLane より後でしか見えず、 same-binary A/B が実質 "hakmem vs hakmem" になっていた（+0.39% 誤測定） **Fix**: core/box/hak_wrappers.inc.h:171 と :645 に g_force_libc_alloc==1 の early bypass を追加、__libc_malloc/__libc_free に最初に直行 **Result**: 正しい同一バイナリ A/B 測定 - hakmem (FORCE_LIBC=0): 48.99M ops/s - libc (FORCE_LIBC=1): 79.72M ops/s (+62.7%) - system binary: 88.06M ops/s (+10.5% vs libc) **Gap 分解**: - Allocator 差: +62.7% (主戦場) - Layout penalty: +10.5% (副次的) **Conclusion**: Case A 確定 (allocator dominant, NOT layout) Phase 17 v1 の Case B 判定は誤り。 Files: - docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md (v2) - docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_NEXT_INSTRUCTIONS.md (updated) --- ## Phase 19: FastLane Instruction Reduction Analysis **Goal**: libc との instruction gap (-35% instructions, -56% branches) を削減 **perf stat 分析** (FORCE_LIBC=0 vs 1, 200M ops): - hakmem: 209.09 instructions/op, 52.33 branches/op - libc: 135.92 instructions/op, 22.93 branches/op - Delta: +73.17 instructions/op (+53.8%), +29.40 branches/op (+128.2%) **Hot path** (perf report): - front_fastlane_try_free: 23.97% cycles - malloc wrapper: 23.84% cycles - free wrapper: 6.82% cycles - **Wrapper overhead: ~55% of all cycles** **Reduction candidates**: - A: Wrapper layer 削除 (-17.5 inst/op, +10-15% 期待) - B: ENV snapshot 統合 (-10.0 inst/op, +5-8%) - C: Stats 削除 (-5.0 inst/op, +3-5%) - D: Header inline (-4.0 inst/op, +2-3%) - E: Route fast path (-3.5 inst/op, +2-3%) Files: - docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_1_DESIGN.md - docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_2_NEXT_INSTRUCTIONS.md --- ## Phase 19-1b: FastLane Direct — GO (+5.88%) **Strategy**: Wrapper layer を bypass し、core allocator を直接呼ぶ - free() → free_tiny_fast() (not free_tiny_fast_hot) - malloc() → malloc_tiny_fast() **Phase 19-1 が NO-GO (-3.81%) だった原因**: 1. __builtin_expect(fastlane_direct_enabled(), 0) が逆効果（A/B 不公平） 2. free_tiny_fast_hot() が誤選択（free_tiny_fast() が勝ち筋） **Phase 19-1b の修正**: 1. __builtin_expect() 削除 2. free_tiny_fast() を直接呼び出し **Result** (Mixed, 10-run, 20M iters, ws=400): - Baseline (FASTLANE_DIRECT=0): 49.17M ops/s - Optimized (FASTLANE_DIRECT=1): 52.06M ops/s - **Delta: +5.88%** (GO 基準 +5% クリア) **perf stat** (200M iters): - Instructions/op: 199.90 → 169.45 (-30.45, -15.23%) - Branches/op: 51.49 → 41.52 (-9.97, -19.36%) - Cycles/op: 88.88 → 84.37 (-4.51, -5.07%) - I-cache miss: 111K → 98K (-11.79%) **Trade-offs** (acceptable): - iTLB miss: +41.46% (front-end cost) - dTLB miss: +29.15% (backend cost) - Overall gain (+5.88%) outweighs costs **Implementation**: 1. **ENV gate**: core/box/fastlane_direct_env_box.{h,c} - HAKMEM_FASTLANE_DIRECT=0/1 (default: 0, opt-in) - Single _Atomic global (wrapper キャッシュ問題を解決) 2. **Wrapper 修正**: core/box/hak_wrappers.inc.h - malloc: direct call to malloc_tiny_fast() when FASTLANE_DIRECT=1 - free: direct call to free_tiny_fast() when FASTLANE_DIRECT=1 - Safety: !g_initialized では direct 使わない、fallback 維持 3. **Preset 昇格**: core/bench_profile.h:88 - bench_setenv_default("HAKMEM_FASTLANE_DIRECT", "1") - Comment: +5.88% proven on Mixed, 10-run 4. **cleanenv 更新**: scripts/run_mixed_10_cleanenv.sh:22 - HAKMEM_FASTLANE_DIRECT=${HAKMEM_FASTLANE_DIRECT:-1} - Phase 9/10 と同様に昇格 **Verdict**: GO — 本線採用、プリセット昇格完了 **Rollback**: HAKMEM_FASTLANE_DIRECT=0 で既存 FastLane path に戻る Files: - core/box/fastlane_direct_env_box.{h,c} (new) - core/box/hak_wrappers.inc.h (modified) - core/bench_profile.h (preset promotion) - scripts/run_mixed_10_cleanenv.sh (ENV default aligned) - Makefile (new obj) - docs/analysis/PHASE19_1B_FASTLANE_DIRECT_REVISED_AB_TEST_RESULTS.md --- ## Cumulative Performance - Baseline (all optimizations OFF): ~40M ops/s (estimated) - Current (Phase 19-1b): 52.06M ops/s - **Cumulative gain: ~+30% from baseline** Remaining gap to libc (79.72M): - Current: 52.06M ops/s - Target: 79.72M ops/s - **Gap: +53.2%** (was +62.7% before Phase 19-1b) Next: Phase 19-2 (ENV snapshot consolidation, +5-8% expected) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 11:28:40 +09:00
parent bc2c5ded76
commit ec87025da6
14 changed files with 1213 additions and 60 deletions
--- a/CURRENT_TASK.md
+++ b/CURRENT_TASK.md
@ -1,5 +1,117 @@
 # 本線タスク（現在）
 ## 更新メモ（2025-12-15 Phase 19-1b FASTLANE-DIRECT-1B）
 ### Phase 19-1b FASTLANE-DIRECT-1B: FastLane Direct (Revised) — ✅ GO (+5.88%)
 **Result**: Phase 19-1 の修正版が成功。__builtin_expect() 削除 + free_tiny_fast() 直呼び で throughput **+5.88%** 達成。
 **A/B Test Results**:
 - Baseline: 49.17M ops/s (FASTLANE_DIRECT=0)
 - Optimized: 52.06M ops/s (FASTLANE_DIRECT=1)
 - Delta: **+5.88%** (GO判定、+5%目標クリア)
 **perf stat Analysis** (200M ops):
 - Instructions: **-15.23%** (199.90 → 169.45/op, -30.45 削減)
 - Branches: **-19.36%** (51.49 → 41.52/op, -9.97 削減)
 - Cycles: **-5.07%** (88.88 → 84.37/op)
 - I-cache misses: -11.79% (Good)
 - iTLB misses: +41.46% (Bad, but overall gain wins)
 - dTLB misses: +29.15% (Bad, but overall gain wins)
 **犯人特定**:
 1. Phase 19-1 の NO-GO 原因: `__builtin_expect(fastlane_direct_enabled(), 0)` が逆効果
 2. `free_tiny_fast_hot()` より `free_tiny_fast()` が勝ち筋（unified cache の winner）
 3. 修正により wrapper overhead 削減 → instruction/branch の大幅削減
 **修正内容**:
 - File: `/mnt/workdisk/public_share/hakmem/core/box/hak_wrappers.inc.h`
 - malloc: `__builtin_expect(fastlane_direct_enabled(), 0)` → `fastlane_direct_enabled()`
 - free: `free_tiny_fast_hot()` → `free_tiny_fast()` (勝ち筋に変更)
 - Safety: `!g_initialized` では direct を使わず既存経路へフォールバック（FastLane と同じ fail-fast）
 - Safety: malloc miss は `malloc_cold()` を直呼びせず既存 wrapper 経路へ落とす（lock_depth 前提を守る）
 - ENV cache: `fastlane_direct_env_refresh_from_env()` が wrapper と同一の `_Atomic` に反映されるように単一グローバル化
 **Next**: Phase 19-1b は本線採用。ENV: `HAKMEM_FASTLANE_DIRECT=1` で運用。
 ---
 ## 前回タスク（Phase 19 FASTLANE-INSTRUCTION-REDUCTION-1）
 ### Phase 19 FASTLANE-INSTRUCTION-REDUCTION-1: FastLane Instruction Reduction v1 — 📊 ANALYSIS COMPLETE
 結果: perf stat/record 分析により、**libc との gap の本質**を特定。設計ドキュメント完成。
 - 設計: `docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_1_DESIGN.md`
 - perf データ: 保存済み（perf_stat_hakmem.txt, perf_stat_libc.txt, perf.data.phase19_hakmem）
 ### Gap Analysis（200M ops baseline）
 **Per-operation overhead** (hakmem vs libc):
 - Instructions/op: **209.09 vs 135.92** (+73.17, **+53.8%**)
 - Branches/op: **52.33 vs 22.93** (+29.40, **+128.2%**)
 - Cycles/op: **96.48 vs 54.69** (+41.79, +76.4%)
 - Throughput: **44.88M vs 77.62M ops/s** (+73.0% gap)
 **Critical finding**: hakmem は **73 extra instructions** と **29 extra branches** per-op を実行。これが throughput gap の全原因。
 ### Hot Path Breakdown（perf report）
 Top wrapper overhead (合計 ~55% of cycles):
 - `front_fastlane_try_free`: **23.97%**
 - `malloc`: **23.84%**
 - `free`: **6.82%**
 Wrapper layer が cycles の過半を消費（二重検証、ENV checks、class mask checks など）。
 ### Reduction Candidates（優先度順）
 1. **Candidate A: FastLane Wrapper Layer 削除** (highest ROI)
   - Impact: **-17.5 instructions/op, -6.0 branches/op** (+10-15% throughput)
   - Risk: **LOW**（free_tiny_fast_hot 既存）
   - 理由: 二重 header validation + ENV checks 排除
 2. **Candidate B: ENV Snapshot 統合** (high ROI)
   - Impact: **-10.0 instructions/op, -4.0 branches/op** (+5-8% throughput)
   - Risk: **MEDIUM**（ENV invalidation 対応必要）
   - 理由: 3+ 回の ENV check を 1 回に統合
 3. **Candidate C: Stats Counters 削除** (medium ROI)
   - Impact: **-5.0 instructions/op, -2.5 branches/op** (+3-5% throughput)
   - Risk: **LOW**（compile-time optional）
   - 理由: Atomic increment overhead 排除
 4. **Candidate D: Header Validation Inline** (medium ROI)
   - Impact: **-4.0 instructions/op, -1.5 branches/op** (+2-3% throughput)
   - Risk: **MEDIUM**（caller 検証前提）
   - 理由: 二重 header load 排除
 5. **Candidate E: Static Route Fast Path** (lower ROI)
   - Impact: **-3.5 instructions/op, -1.5 branches/op** (+2-3% throughput)
   - Risk: **LOW**（route table static）
   - 理由: Function call を bit test に置換
 **Combined estimate** (80% efficiency):
 - Instructions/op: 209.09 → **177.09** (gap: +53.8% → +30.3%)
 - Branches/op: 52.33 → **39.93** (gap: +128.2% → +74.1%)
 - Throughput: 44.88M → **54.3M ops/s** (+21%, **目標 +15-25% 超過達成**)
 ### Implementation Plan
 - **Phase 19-1** (P0): FastLane Wrapper 削除 (2-3h, +10-15%)
 - **Phase 19-2** (P1): ENV Snapshot 統合 (4-6h, +5-8%)
 - **Phase 19-3** (P2): Stats + Header Inline (2-3h, +3-5%)
 - **Phase 19-4** (P3): Route Fast Path (2-3h, +2-3%)
 ### 次の手順
 1. Phase 19-1 実装開始（FastLane layer 削除、直接 free_tiny_fast_hot 呼び出し）
 2. perf stat で instruction/branch reduction 検証
 3. Mixed 10-run で throughput improvement 測定
 4. Phase 19-2-4 を順次実装
 ---
 ## 更新メモ（2025-12-15 Phase 18 HOT-TEXT-ISOLATION-1）
 ### Phase 18 HOT-TEXT-ISOLATION-1: Hot Text Isolation v1 — ❌ NO-GO / FROZEN
@ -17,9 +129,9 @@
 - Hot/cold 属性が実際には適用されていない（実装の不完全性）
 重要な知見:
- Phase 17 の結論を再確認: bottleneck は **instruction count** と **memory latency**
+- Phase 17 v2（FORCE_LIBC 修正後）: same-binary A/B で **libc が +62.7%**（≒1.63×）速い → gap の主因は **allocator work**（layout alone ではない）
- Code layout 最適化では 2.30 IPC の壁を越えられない
+- ただし `bench_random_mixed_system` は `libc-in-hakmem-binary` よりさらに **+10.5%** 速い → wrapper/text 環境の penalty も残る
- 次の一手: instruction count を直接削る Phase 18 v2 (BENCH_MINIMAL) へ
+- Phase 18 v2（BENCH_MINIMAL）は「足し算の固定費」を削る方向として有効だが、-5% instructions 程度では +62% gap を埋められない
 ## 更新メモ（2025-12-14 Phase 6 FRONT-FASTLANE-1）
--- a/8
+++ b/8
@ -253,12 +253,12 @@ LDFLAGS += $(EXTRA_LDFLAGS)
 # Targets
 TARGET = test_hakmem
-OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/box/front_fastlane_alloc_legacy_direct_env_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o
+OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/box/front_fastlane_alloc_legacy_direct_env_box.o core/box/fastlane_direct_env_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o
 OBJS = $(OBJS_BASE)
 # Shared library
 SHARED_LIB = libhakmem.so
-SHARED_OBJS = hakmem_shared.o hakmem_config_shared.o hakmem_tiny_config_shared.o hakmem_ucb1_shared.o hakmem_bigcache_shared.o hakmem_pool_shared.o hakmem_l25_pool_shared.o hakmem_site_rules_shared.o hakmem_tiny_shared.o core/box/ss_allocation_box_shared.o superslab_stats_shared.o superslab_cache_shared.o superslab_ace_shared.o superslab_slab_shared.o superslab_backend_shared.o core/superslab_head_stub_shared.o hakmem_smallmid_shared.o core/box/superslab_expansion_box_shared.o core/box/integrity_box_shared.o core/box/mailbox_box_shared.o core/box/front_gate_box_shared.o core/box/front_gate_classifier_shared.o core/box/free_publish_box_shared.o core/box/capacity_box_shared.o core/box/carve_push_box_shared.o core/box/prewarm_box_shared.o core/box/ss_hot_prewarm_box_shared.o core/box/front_metrics_box_shared.o core/box/bench_fast_box_shared.o core/box/ss_addr_map_box_shared.o core/box/ss_pt_impl_shared.o core/box/slab_recycling_box_shared.o core/box/pagefault_telemetry_box_shared.o core/box/tiny_sizeclass_hist_box_shared.o core/box/tiny_env_box_shared.o core/box/tiny_route_box_shared.o core/box/free_front_v3_env_box_shared.o core/box/free_path_stats_box_shared.o core/box/free_dispatch_stats_box_shared.o core/box/alloc_gate_stats_box_shared.o core/box/tiny_page_box_shared.o core/box/tiny_class_policy_box_shared.o core/box/tiny_class_stats_box_shared.o core/box/tiny_policy_learner_box_shared.o core/box/ss_budget_box_shared.o core/box/tiny_mem_stats_box_shared.o core/box/wrapper_env_box_shared.o core/box/free_wrapper_env_snapshot_box_shared.o core/box/malloc_wrapper_env_snapshot_box_shared.o core/box/madvise_guard_box_shared.o core/box/libm_reloc_guard_box_shared.o core/box/hakmem_env_snapshot_box_shared.o core/box/tiny_c7_preserve_header_env_box_shared.o core/box/tiny_tcache_env_box_shared.o core/box/tiny_unified_lifo_env_box_shared.o core/box/front_fastlane_alloc_legacy_direct_env_box_shared.o core/page_arena_shared.o core/front/tiny_unified_cache_shared.o core/tiny_alloc_fast_push_shared.o core/tiny_c7_ultra_segment_shared.o core/tiny_c7_ultra_shared.o core/link_stubs_shared.o core/tiny_failfast_shared.o tiny_sticky_shared.o tiny_remote_shared.o tiny_publish_shared.o tiny_debug_ring_shared.o hakmem_tiny_magazine_shared.o hakmem_tiny_stats_shared.o hakmem_tiny_sfc_shared.o hakmem_tiny_query_shared.o hakmem_tiny_rss_shared.o hakmem_tiny_registry_shared.o hakmem_tiny_remote_target_shared.o hakmem_tiny_bg_spill_shared.o tiny_adaptive_sizing_shared.o hakmem_super_registry_shared.o hakmem_shared_pool_shared.o hakmem_shared_pool_acquire_shared.o hakmem_shared_pool_release_shared.o hakmem_elo_shared.o hakmem_batch_shared.o hakmem_p2_shared.o hakmem_sizeclass_dist_shared.o hakmem_evo_shared.o hakmem_debug_shared.o hakmem_sys_shared.o hakmem_whale_shared.o hakmem_policy_shared.o hakmem_ace_shared.o hakmem_ace_stats_shared.o hakmem_ace_controller_shared.o hakmem_ace_metrics_shared.o hakmem_ace_ucb1_shared.o hakmem_prof_shared.o hakmem_learner_shared.o hakmem_size_hist_shared.o hakmem_learn_log_shared.o hakmem_syscall_shared.o tiny_fastcache_shared.o core/box/super_reg_box_shared.o core/box/shared_pool_box_shared.o core/box/remote_side_box_shared.o core/tiny_destructors_shared.o
+SHARED_OBJS = hakmem_shared.o hakmem_config_shared.o hakmem_tiny_config_shared.o hakmem_ucb1_shared.o hakmem_bigcache_shared.o hakmem_pool_shared.o hakmem_l25_pool_shared.o hakmem_site_rules_shared.o hakmem_tiny_shared.o core/box/ss_allocation_box_shared.o superslab_stats_shared.o superslab_cache_shared.o superslab_ace_shared.o superslab_slab_shared.o superslab_backend_shared.o core/superslab_head_stub_shared.o hakmem_smallmid_shared.o core/box/superslab_expansion_box_shared.o core/box/integrity_box_shared.o core/box/mailbox_box_shared.o core/box/front_gate_box_shared.o core/box/front_gate_classifier_shared.o core/box/free_publish_box_shared.o core/box/capacity_box_shared.o core/box/carve_push_box_shared.o core/box/prewarm_box_shared.o core/box/ss_hot_prewarm_box_shared.o core/box/front_metrics_box_shared.o core/box/bench_fast_box_shared.o core/box/ss_addr_map_box_shared.o core/box/ss_pt_impl_shared.o core/box/slab_recycling_box_shared.o core/box/pagefault_telemetry_box_shared.o core/box/tiny_sizeclass_hist_box_shared.o core/box/tiny_env_box_shared.o core/box/tiny_route_box_shared.o core/box/free_front_v3_env_box_shared.o core/box/free_path_stats_box_shared.o core/box/free_dispatch_stats_box_shared.o core/box/alloc_gate_stats_box_shared.o core/box/tiny_page_box_shared.o core/box/tiny_class_policy_box_shared.o core/box/tiny_class_stats_box_shared.o core/box/tiny_policy_learner_box_shared.o core/box/ss_budget_box_shared.o core/box/tiny_mem_stats_box_shared.o core/box/wrapper_env_box_shared.o core/box/free_wrapper_env_snapshot_box_shared.o core/box/malloc_wrapper_env_snapshot_box_shared.o core/box/madvise_guard_box_shared.o core/box/libm_reloc_guard_box_shared.o core/box/hakmem_env_snapshot_box_shared.o core/box/tiny_c7_preserve_header_env_box_shared.o core/box/tiny_tcache_env_box_shared.o core/box/tiny_unified_lifo_env_box_shared.o core/box/front_fastlane_alloc_legacy_direct_env_box_shared.o core/box/fastlane_direct_env_box_shared.o core/page_arena_shared.o core/front/tiny_unified_cache_shared.o core/tiny_alloc_fast_push_shared.o core/tiny_c7_ultra_segment_shared.o core/tiny_c7_ultra_shared.o core/link_stubs_shared.o core/tiny_failfast_shared.o tiny_sticky_shared.o tiny_remote_shared.o tiny_publish_shared.o tiny_debug_ring_shared.o hakmem_tiny_magazine_shared.o hakmem_tiny_stats_shared.o hakmem_tiny_sfc_shared.o hakmem_tiny_query_shared.o hakmem_tiny_rss_shared.o hakmem_tiny_registry_shared.o hakmem_tiny_remote_target_shared.o hakmem_tiny_bg_spill_shared.o tiny_adaptive_sizing_shared.o hakmem_super_registry_shared.o hakmem_shared_pool_shared.o hakmem_shared_pool_acquire_shared.o hakmem_shared_pool_release_shared.o hakmem_elo_shared.o hakmem_batch_shared.o hakmem_p2_shared.o hakmem_sizeclass_dist_shared.o hakmem_evo_shared.o hakmem_debug_shared.o hakmem_sys_shared.o hakmem_whale_shared.o hakmem_policy_shared.o hakmem_ace_shared.o hakmem_ace_stats_shared.o hakmem_ace_controller_shared.o hakmem_ace_metrics_shared.o hakmem_ace_ucb1_shared.o hakmem_prof_shared.o hakmem_learner_shared.o hakmem_size_hist_shared.o hakmem_learn_log_shared.o hakmem_syscall_shared.o tiny_fastcache_shared.o core/box/super_reg_box_shared.o core/box/shared_pool_box_shared.o core/box/remote_side_box_shared.o core/tiny_destructors_shared.o
 # Pool TLS Phase 1 (enable with POOL_TLS_PHASE1=1)
 ifeq ($(POOL_TLS_PHASE1),1)
@ -285,7 +285,7 @@ endif
 # Benchmark targets
 BENCH_HAKMEM = bench_allocators_hakmem
 BENCH_SYSTEM = bench_allocators_system
-BENCH_HAKMEM_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/tiny_free_route_cache_env_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o bench_allocators_hakmem.o
+BENCH_HAKMEM_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/tiny_free_route_cache_env_box.o core/box/fastlane_direct_env_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o bench_allocators_hakmem.o
 BENCH_HAKMEM_OBJS = $(BENCH_HAKMEM_OBJS_BASE)
 ifeq ($(POOL_TLS_PHASE1),1)
 BENCH_HAKMEM_OBJS += pool_tls.o pool_refill.o pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o
@ -462,7 +462,7 @@ test-box-refactor: box-refactor
 	./larson_hakmem 10 8 128 1024 1 12345 4
 # Phase 4: Tiny Pool benchmarks (properly linked with hakmem)
-TINY_BENCH_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/free_publish_box.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/tiny_free_route_cache_env_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/box/front_fastlane_alloc_legacy_direct_env_box.o core/page_arena.o core/front/tiny_unified_cache.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o
+TINY_BENCH_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/free_publish_box.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/tiny_free_route_cache_env_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/box/front_fastlane_alloc_legacy_direct_env_box.o core/box/fastlane_direct_env_box.o core/page_arena.o core/front/tiny_unified_cache.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o
 TINY_BENCH_OBJS = $(TINY_BENCH_OBJS_BASE)
 ifeq ($(POOL_TLS_PHASE1),1)
 TINY_BENCH_OBJS += pool_tls.o pool_refill.o core/pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o
--- a/core/bench_profile.h
+++ b/core/bench_profile.h
@ -14,6 +14,7 @@
 #include "box/tiny_tcache_env_box.h"  // tiny_tcache_env_refresh_from_env (Phase 14 v1)
 #include "box/tiny_unified_lifo_env_box.h"  // tiny_unified_lifo_env_refresh_from_env (Phase 15 v1)
 #include "box/front_fastlane_alloc_legacy_direct_env_box.h"  // front_fastlane_alloc_legacy_direct_env_refresh_from_env (Phase 16 v1)
 #include "box/fastlane_direct_env_box.h"  // fastlane_direct_env_refresh_from_env (Phase 19-1)
 #endif
 // env が未設定のときだけ既定値を入れる
@ -84,6 +85,8 @@ static inline void bench_apply_profile(void) {
 	    bench_setenv_default("HAKMEM_FRONT_FASTLANE", "1");
 	    // Phase 6-2: Front FastLane Free DeDup (+5.18% proven on Mixed, 10-run)
 	    bench_setenv_default("HAKMEM_FRONT_FASTLANE_FREE_DEDUP", "1");
 	    // Phase 19-1b: FastLane Direct (wrapper layer bypass, +5.88% proven on Mixed, 10-run)
 	    bench_setenv_default("HAKMEM_FASTLANE_DIRECT", "1");
 	    // Phase 9: FREE-TINY-FAST MONO DUALHOT (+2.72% proven on Mixed, 10-run)
 	    bench_setenv_default("HAKMEM_FREE_TINY_FAST_MONO_DUALHOT", "1");
 	    // Phase 10: FREE-TINY-FAST MONO LEGACY DIRECT (+1.89% proven on Mixed, 10-run)
@ -119,6 +122,8 @@ static inline void bench_apply_profile(void) {
 	    bench_setenv_default("HAKMEM_FRONT_FASTLANE", "1");
 	    // Phase 6-2: Front FastLane Free DeDup (+5.18% proven on Mixed, 10-run)
 	    bench_setenv_default("HAKMEM_FRONT_FASTLANE_FREE_DEDUP", "1");
 	    // Phase 19-1b: FastLane Direct (wrapper layer bypass)
 	    bench_setenv_default("HAKMEM_FASTLANE_DIRECT", "1");
 	    // Phase 2 B3: Routing branch shape optimization (LIKELY on LEGACY, cold helper for rare routes)
 	    bench_setenv_default("HAKMEM_TINY_ALLOC_ROUTE_SHAPE", "1");
  } else if (strcmp(p, "C6_V7_STUB") == 0) {
@ -196,5 +201,7 @@ static inline void bench_apply_profile(void) {
 	  tiny_unified_lifo_env_refresh_from_env();
 	  // Phase 16 v1: Sync LEGACY direct ENV cache after bench_profile putenv defaults.
 	  front_fastlane_alloc_legacy_direct_env_refresh_from_env();
 	  // Phase 19-1: Sync FastLane Direct ENV cache after bench_profile putenv defaults.
 	  fastlane_direct_env_refresh_from_env();
 #endif
 	}
--- a/core/box/fastlane_direct_env_box.c
+++ b/core/box/fastlane_direct_env_box.c
@ -0,0 +1,15 @@
 // fastlane_direct_env_box.c - Phase 19-1: FastLane Direct Path ENV Control (implementation)
 #include "fastlane_direct_env_box.h"
 #include <stdlib.h>
 #include <stdatomic.h>
 _Atomic int g_fastlane_direct_enabled = -1;
 // Refresh cached ENV flag from environment variable
 // Called during benchmark ENV reloads to pick up runtime changes
 void fastlane_direct_env_refresh_from_env(void) {
    const char* e = getenv("HAKMEM_FASTLANE_DIRECT");
    int enable = (e && *e && *e != '0') ? 1 : 0;
    atomic_store_explicit(&g_fastlane_direct_enabled, enable, memory_order_relaxed);
 }
--- a/core/box/fastlane_direct_env_box.h
+++ b/core/box/fastlane_direct_env_box.h
@ -0,0 +1,46 @@
 // fastlane_direct_env_box.h - Phase 19-1: FastLane Direct Path ENV Control
 //
 // Goal: Remove wrapper layer overhead (30.79% of cycles) by calling core allocator directly
 // Strategy: Compile-time + runtime gate to bypass front_fastlane_try_*() wrapper
 //
 // Box Theory:
 //   - Boundary: HAKMEM_FASTLANE_DIRECT=0/1 (default: 0, opt-in)
 //   - Rollback: ENV=0 reverts to existing FastLane wrapper path
 //   - Observability: perf stat shows instruction/branch reduction
 //
 // Expected Performance:
 //   - Reduction: -17.5 instructions/op, -6.0 branches/op
 //   - Impact: +10-15% throughput (remove 30% wrapper overhead)
 //
 // ENV Variables:
 //   HAKMEM_FASTLANE_DIRECT=0/1  # Enable direct path (default: 0, research box)
 #pragma once
 #include <stdatomic.h>
 #include <stdlib.h>
 // ENV control: cached flag for fastlane_direct_enabled()
 // -1: uninitialized, 0: disabled, 1: enabled
 // NOTE: Must be a single global (not header-static) so bench_profile refresh can
 // update the same cache used by malloc/free wrappers.
 extern _Atomic int g_fastlane_direct_enabled;
 // Runtime check: Is FastLane Direct path enabled?
 // Returns: 1 if enabled, 0 if disabled
 // Hot path: Single atomic load (after first call)
 static inline int fastlane_direct_enabled(void) {
    int val = atomic_load_explicit(&g_fastlane_direct_enabled, memory_order_relaxed);
    if (__builtin_expect(val == -1, 0)) {
        // Cold path: Initialize from ENV
        const char* e = getenv("HAKMEM_FASTLANE_DIRECT");
        int enable = (e && *e && *e != '0') ? 1 : 0;
        atomic_store_explicit(&g_fastlane_direct_enabled, enable, memory_order_relaxed);
        return enable;
    }
    return val;
 }
 // Refresh from ENV: Called during benchmark ENV reloads
 // Allows runtime toggle without recompilation
 void fastlane_direct_env_refresh_from_env(void);
--- a/core/box/hak_wrappers.inc.h
+++ b/core/box/hak_wrappers.inc.h
@ -43,6 +43,7 @@ void* realloc(void* ptr, size_t size) {
 #include "malloc_tiny_direct_env_box.h"  // Phase 5 E5-4: Malloc Tiny direct path ENV gate
 #include "malloc_tiny_direct_stats_box.h"  // Phase 5 E5-4: Malloc Tiny direct path stats
 #include "front_fastlane_box.h"        // Phase 6: Front FastLane (Layer Collapse)
 #include "fastlane_direct_env_box.h"   // Phase 19-1: FastLane Direct Path (remove wrapper layer)
 #include "../hakmem_internal.h"        // AllocHeader helpers for diagnostics
 #include "../hakmem_super_registry.h"  // Superslab lookup for diagnostics
 #include "../superslab/superslab_inline.h"  // slab_index_for, capacity
@ -165,6 +166,14 @@ void* malloc(size_t size) {
 #endif
    // NDEBUG: malloc_count increment disabled - removes 27.55% bottleneck
    // Force libc must override FastLane/hot wrapper paths.
    // NOTE: Use the cached file-scope g_force_libc_alloc to avoid getenv recursion
    // during early startup (before lock_depth is incremented).
    if (__builtin_expect(g_force_libc_alloc == 1, 0)) {
        extern void* __libc_malloc(size_t);
        return __libc_malloc(size);
    }
    // Phase 20-2: BenchFast mode (structural ceiling measurement)
    // WARNING: Bypasses ALL safety checks - benchmark only!
    // IMPORTANT: Do NOT use BenchFast during preallocation/init to avoid recursion.
@ -176,6 +185,28 @@ void* malloc(size_t size) {
        // Fallback to normal path for large allocations
    }
    // Phase 19-1b: FastLane Direct Path (bypass wrapper layer, revised)
    // Strategy: Direct call to malloc_tiny_fast() (remove wrapper overhead; miss falls through)
    // Expected: -17.5 instructions/op, -6.0 branches/op, +10-15% throughput
    // ENV: HAKMEM_FASTLANE_DIRECT=0/1 (default: 0, opt-in)
    // Phase 19-1b changes:
    //   1. Removed __builtin_expect() from fastlane_direct_enabled() check (unfair A/B)
    //   2. No change to malloc path (malloc_tiny_fast already optimal)
    if (fastlane_direct_enabled()) {
        // Fail-fast: match Front FastLane rule (FastLane is only safe after init completes).
        if (__builtin_expect(!g_initialized, 0)) {
            // Not safe → fall through to wrapper path (handles init/LD safety).
        } else {
            // Direct path: bypass front_fastlane_try_malloc() wrapper
            void* ptr = malloc_tiny_fast(size);
            if (__builtin_expect(ptr != NULL, 1)) {
                return ptr;  // Success: handled by hot path
            }
            // Not handled → fall through to existing FastLane + wrapper path.
            // This preserves lock_depth/init/LD semantics for Mid/Large allocations.
        }
    }
    // Phase 6: Front FastLane (Layer Collapse)
    // Strategy: Collapse wrapper→gate→policy→route layers into single hot box
    // Observed: +11.13% on Mixed 10-run (Phase 6 A/B)
@ -631,6 +662,38 @@ void free(void* ptr) {
 #endif
    if (!ptr) return;
    // Force libc must override FastLane/hot wrapper paths.
    // NOTE: Use the cached file-scope g_force_libc_alloc (no getenv) to keep
    // this check safe even during early startup/recursion scenarios.
    if (__builtin_expect(g_force_libc_alloc == 1, 0)) {
        extern void __libc_free(void*);
        __libc_free(ptr);
        return;
    }
    // Phase 19-1b: FastLane Direct Path (bypass wrapper layer, revised)
    // Strategy: Direct call to free_tiny_fast() / free_cold() (remove 30% wrapper overhead)
    // Expected: -17.5 instructions/op, -6.0 branches/op, +10-15% throughput
    // ENV: HAKMEM_FASTLANE_DIRECT=0/1 (default: 0, opt-in)
    // Phase 19-1b changes:
    //   1. Removed __builtin_expect() from fastlane_direct_enabled() check (unfair A/B)
    //   2. Changed free_tiny_fast_hot() → free_tiny_fast() (use winning path directly)
    if (fastlane_direct_enabled()) {
        // Fail-fast: match Front FastLane rule (FastLane is only safe after init completes).
        if (__builtin_expect(!g_initialized, 0)) {
            // Not safe → fall through to wrapper path (handles init/LD safety).
        } else {
            // Direct path: bypass front_fastlane_try_free() wrapper
            if (free_tiny_fast(ptr)) {
                return;  // Success: handled by hot path
            }
            // Fallback: cold path handles Mid/Large/external pointers
            const wrapper_env_cfg_t* wcfg = wrapper_env_cfg_fast();
            free_cold(ptr, wcfg);
            return;
        }
    }
    // Phase 6: Front FastLane (Layer Collapse) - free path
    // Strategy: Collapse wrapper→gate→classify layers into single hot box
    // Observed: +11.13% on Mixed 10-run (Phase 6 A/B)
--- a/docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md
+++ b/docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md
@ -1,89 +1,75 @@
-# Phase 17: FORCE_LIBC Gap Validation v1 — A/B Test Results
+# Phase 17: FORCE_LIBC Gap Validation v2 — A/B Test Results
-**Date**: 2025-12-15  
+**Date**: 2025-12-16  
-**Verdict**: ✅ **Case B confirmed** — **Layout / I-cache penalty dominates**
+**Verdict**: ✅ **Case A confirmed** — allocator delta dominates (**libc is ~1.63× faster** in same-binary A/B)
 ---
 ## Executive Summary
-Phase 17 validated the “system malloc is faster than hakmem” observation while avoiding the classic layout/LTO trap by running a **same-binary A/B**:
+Phase 17 exists to avoid the classic “different binary layout/LTO” trap by running a **same-binary A/B**.
- Same binary (`bench_random_mixed_hakmem`) with `HAKMEM_FORCE_LIBC_ALLOC=0/1` shows **allocator logic delta is negligible**.
+**Important correction (v1 invalid):**
- The large performance gap appears only when comparing to the tiny `bench_random_mixed_system` binary.
+`HAKMEM_FORCE_LIBC_ALLOC=1` was previously checked only in late wrapper paths, so the malloc/free hot paths
 could return before FORCE_LIBC was observed. This made the “same-binary libc” measurement effectively still
 use hakmem for the hot path.
-Conclusion: The dominant gap is **binary text size + layout → I-cache thrash + instruction footprint**, not allocator algorithm efficiency.
+**Fix (v2):**
 Wrappers now bypass directly to `__libc_malloc/__libc_free` when cached `g_force_libc_alloc==1`, *before*
 entering FastLane/hot wrapper logic.
 Result: FORCE_LIBC now reflects real libc behavior in the same binary, and the delta is large.
 ---
 ## Measurement Setup
 Workload:
- `bench_random_mixed_*` (Mixed 16–1024B), working set `WS=400`
+- Mixed 16–1024B, `WS=400`, `ITERS=20000000`
- Clean ENV baseline via `scripts/run_mixed_10_cleanenv.sh`
+- Clean ENV via `scripts/run_mixed_10_cleanenv.sh`
-Two comparisons:
+Comparisons:
-1) **Same-binary toggle** (allocator logic delta)
+1) **Same binary**: `bench_random_mixed_hakmem` with `HAKMEM_FORCE_LIBC_ALLOC=0/1`
-2) **System binary** (layout penalty delta)
+2) **System binary**: `bench_random_mixed_system` (reference; different binary)
 ---
-## Results
+## Results (10-run)
 ### 1) Same-binary A/B (allocator delta)
-Binary: `bench_random_mixed_hakmem`  
+Binary: `bench_random_mixed_hakmem`
 Toggle: `HAKMEM_FORCE_LIBC_ALLOC=0/1`
-| Mode | Throughput (ops/s) | Delta |
+| Mode | Mean (ops/s) | Median (ops/s) | Delta |
-|------|---------------------|-------|
+|------|--------------:|---------------:|------:|
-| hakmem (`FORCE_LIBC=0`) | 48.12M | — |
+| hakmem (`FORCE_LIBC=0`) | 48.99M | 49.28M | — |
-| libc  (`FORCE_LIBC=1`) | 48.31M | **+0.39%** |
+| libc  (`FORCE_LIBC=1`) | 79.72M | 80.09M | **+62.7%** |
-Interpretation: allocator logic delta is ~noise-level in this experiment context.
+Interpretation: the allocator delta is **not** noise-level; libc is materially faster on this workload.
-### 2) System binary (layout penalty)
+### 2) System binary (layout/wrapper penalty estimate)
 Binary: `bench_random_mixed_system`
-| Mode | Throughput (ops/s) | Delta vs libc-in-hakmem-binary |
+| Mode | Mean (ops/s) | Median (ops/s) | Delta vs libc-in-hakmem-binary |
-|------|---------------------|--------------------------------|
+|------|--------------:|---------------:|--------------------------------:|
-| system malloc | 83.85M | **+73.57%** |
+| system malloc | 88.06M | 88.35M | **+10.5%** |
-Total observed gap: ~+74% class.
+Interpretation: there is still a non-trivial **“in-hakmem-binary” penalty** (~10%), likely from wrapper/bench
-
+overhead and text footprint, but it is *not* the dominant term versus hakmem’s allocator gap.
 ---
 ## Perf Stat (200M iterations) — Smoking Gun
 | Metric | hakmem binary | system binary | Delta |
 |--------|---------------|---------------|-------|
 | I-cache misses | 153K | 68K | **-55%** |
 | Cycles | 17.9B | 10.2B | **-43%** |
 | Instructions | 41.3B | 21.5B | **-48%** |
 | Binary size | 653K | 21K | **-97%** |
 Interpretation:
 - The system binary executes roughly **half the instructions**, with **far fewer I-cache misses**.
 - The 30× text footprint difference strongly correlates with the gap.
 ---
 ## Conclusion
-Phase 12’s “system malloc is 1.6× faster” observation was real, but the root cause was misattributed:
+- ✅ Same-binary `FORCE_LIBC` A/B (v2) shows the **dominant gap is allocator work**, not layout alone.
-
+- ✅ There is also a smaller (~10%) penalty attributable to the hakmem-binary wrapper/text environment.
 - ❌ Not primarily allocator algorithm differences
 - ✅ **Text/layout + I-cache locality + instruction footprint**
 This shifts the optimization frontier:
 - Stop chasing more routing/dispatch micro-opt (Phase 14–16 plateau)
 - Focus on **Hot Text Isolation / layout control**
 ---
 ## Next
-Proceed to:
+- Freeze Phase 18 v1 (`--gc-sections`) as NO-GO remains correct.
- `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md`
+- Re-evaluate Phase 18 v2 (BENCH_MINIMAL) expectations: -5% instructions is not enough to close a +62% gap.
-
+- Phase 19 should target **structural per-op work reduction** (not dispatch shape), while keeping the FastLane
  boundary and “same-binary A/B” discipline.
--- a/docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_NEXT_INSTRUCTIONS.md
+++ b/docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_NEXT_INSTRUCTIONS.md
@ -8,6 +8,13 @@
 本 Phase 17 の目的は、**同一バイナリ内**で `hakmem` vs `libc malloc` を A/B し、gap の実体（allocator差か、バイナリ差か）を SSOT 化すること。
 **重要（v1 の落とし穴）**:
 `HAKMEM_FORCE_LIBC_ALLOC=1` が malloc/free の hot path より後でしか観測されないと、FastLane/hot wrapper が先に return してしまい、
 同一バイナリ A/B が **実質 hakmem vs hakmem** になって壊れます。
 このレポジトリでは 2025-12-16 に `malloc/free` wrapper を修正し、cached `g_force_libc_alloc==1` のときは `__libc_malloc/__libc_free`
 へ **最初に** 直行するようにしました（`docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md` 参照）。
 ---
 ## 0. 目的（Deliverables）
@ -127,4 +134,3 @@ perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-m
 - A/B は **同一バイナリ**で行う（layout/LTO 差で誤判定しない）
 - 新しい最適化は必ず ENV gate（戻せる）+ 境界 1 箇所
 - 迷ったら “Fail-Fast で fallback” を優先（速度より整合性）
--- a/docs/analysis/PHASE19_1B_FASTLANE_DIRECT_REVISED_AB_TEST_RESULTS.md
+++ b/docs/analysis/PHASE19_1B_FASTLANE_DIRECT_REVISED_AB_TEST_RESULTS.md
@ -0,0 +1,307 @@
 # Phase 19-1b: FastLane Direct (Revised) A/B Test Results
 **Date**: 2025-12-15
 **Status**: ✅ **GO** (+5.88% throughput)
 **Branch**: master
 **Commit**: (pending)
 ---
 ## Executive Summary
 Phase 19-1 の修正版（19-1b）が成功。Phase 19-1 が NO-GO（-3.81%）となった原因を特定し、修正により **+5.88% throughput** を達成。
 **犯人特定**:
 1. `__builtin_expect(fastlane_direct_enabled(), 0)` が分岐予測を逆効果にしていた
 2. `free_tiny_fast_hot()` より `free_tiny_fast()` が勝ち筋（unified cache winner）
 **修正内容**:
 - `__builtin_expect()` 削除（フェアな A/B 比較）
 - `free_tiny_fast_hot()` → `free_tiny_fast()` 変更（直接勝ち筋を呼ぶ）
 ---
 ## A/B Test Results
 ### Throughput (10-run benchmark)
 **Baseline (FASTLANE_DIRECT=0)**:
 - Mean: **49.17M ops/s**
 - StdDev: 407,748 ops/s
 - CV: 0.83%
 **Optimized (FASTLANE_DIRECT=1)**:
 - Mean: **52.06M ops/s**
 - StdDev: 404,146 ops/s
 - CV: 0.78%
 **Delta**: **+5.88%** (GO判定、+5%目標クリア)
 ---
 ## perf stat Analysis (200M ops)
 ### Metrics Table
 | Metric                | Baseline        | Optimized       | Delta      | Judgment |
 |-----------------------|-----------------|-----------------|------------|----------|
 | **Throughput**        | 49.17M ops/s    | 52.06M ops/s    | **+5.88%** | **GO**   |
 | Cycles                | 17,775,213,215  | 16,873,451,633  | -5.07%     | Good     |
 | Instructions          | 39,980,185,471  | 33,889,807,627  | **-15.23%** | **Excellent** |
 | L1-icache-load-misses | 111,712         | 98,542          | -11.79%    | Good     |
 | iTLB-load-misses      | 26,039          | 36,835          | +41.46%    | Bad      |
 | dTLB-load-misses      | 59,329          | 76,626          | +29.15%    | Bad      |
 | Branches              | 10,297,849,396  | 8,304,201,436   | **-19.36%** | **Excellent** |
 | Branch-misses         | 232,502,367     | 232,239,642     | -0.11%     | Good     |
 ### Per-Operation Metrics
 | Metric       | Baseline | Optimized | Delta     |
 |--------------|----------|-----------|-----------|
 | Cycles/op    | 88.88    | 84.37     | **-4.51** |
 | Instr/op     | 199.90   | 169.45    | **-30.45** |
 | Branches/op  | 51.49    | 41.52     | **-9.97** |
 **Key Findings**:
 - **Instructions: -30.45/op** (-15.23%) → wrapper overhead 削減が効果的
 - **Branches: -9.97/op** (-19.36%) → 分岐数の大幅削減
 - **Cycles: -4.51/op** (-5.07%) → 総合的な効率改善
 **Trade-offs**:
 - iTLB/dTLB misses が悪化したが、instruction/branch 削減の効果が上回った
 - Front-end (I-cache) は改善、Backend (dTLB) は悪化
 - 総合的に throughput +5.88% で GO 判定
 ---
 ## Root Cause Analysis: Phase 19-1 が NO-GO となった理由
 ### Phase 19-1 の問題点
 **Phase 19-1 実装** (`core/box/hak_wrappers.inc.h` 旧版):
 ```c
 // malloc()
 if (__builtin_expect(fastlane_direct_enabled(), 0)) {  // ← 問題1: expect(...,0)
    void* ptr = malloc_tiny_fast(size);
    if (__builtin_expect(ptr != NULL, 1)) return ptr;
    // ...
 }
 // free()
 if (__builtin_expect(fastlane_direct_enabled(), 0)) {  // ← 問題1: expect(...,0)
    if (free_tiny_fast_hot(ptr)) return;  // ← 問題2: _hot variant
    // ...
 }
 ```
 **問題の本質**:
 1. **__builtin_expect(..., 0) が逆効果**:
   - `fastlane_direct_enabled()` は ENV 変数で制御されるため、A/B test 時に動的に切り替わる
   - `__builtin_expect(..., 0)` は「この分岐は unlikely」と CPU に指示
   - → A=0, B=1 で分岐予測が逆になり、フェアな比較にならない
   - → B 側（FASTLANE_DIRECT=1）で分岐予測ミスが増加
 2. **free_tiny_fast_hot() より free_tiny_fast() が勝ち筋**:
   - `free_tiny_fast_hot()`: hot/cold split version（Phase 7 で導入）
   - `free_tiny_fast()`: monolithic version（Phase 6 winner）
   - Phase 9/10 の A/B で `free_tiny_fast()` が勝利していた
   - → Phase 19-1 で `_hot` を選択したのは誤り
 ### Phase 19-1b の修正
 **Phase 19-1b 実装** (`core/box/hak_wrappers.inc.h` 修正後):
 ```c
 // malloc()
 if (fastlane_direct_enabled()) {  // ← 修正1: __builtin_expect 削除
    void* ptr = malloc_tiny_fast(size);
    if (__builtin_expect(ptr != NULL, 1)) return ptr;
    // ...
 }
 // free()
 if (fastlane_direct_enabled()) {  // ← 修正1: __builtin_expect 削除
    if (free_tiny_fast(ptr)) return;  // ← 修正2: free_tiny_fast() に変更
    // ...
 }
 ```
 **修正の効果**:
 1. `__builtin_expect()` 削除 → A/B がフェアな比較に
 2. `free_tiny_fast()` 直呼び → 勝ち筋を直接利用
 **結果**: -3.81% → **+5.88%** (9.69% の改善)
 ---
 ## Design Intent vs Implementation Gap
 ### Original Design (Phase 19 DESIGN.md)
 **想定**:
 - Wrapper layer 削除で -17.5 instructions/op, -6.0 branches/op
 - Target: +10-15% throughput
 **実測 (Phase 19-1b)**:
 - Instructions: **-30.45/op** (-15.23%, 想定の1.74倍)
 - Branches: **-9.97/op** (-19.36%, 想定の1.66倍)
 - Throughput: **+5.88%** (想定の半分だが、GO判定)
 **Gap 分析**:
 - Instructions/Branches の削減は想定以上
 - しかし throughput は想定の半分（+5.88% vs +10-15%）
 - 原因: iTLB/dTLB misses の悪化が throughput を抑制
 - 結論: Instruction 削減だけでは throughput は直線的に改善しない
 ---
 ## Lessons Learned
 ### 1. __builtin_expect() の落とし穴
 **問題**:
 - ENV-gated path で `__builtin_expect(..., 0)` を使うと A/B がフェアでない
 - 動的に切り替わる条件には使うべきでない
 **推奨**:
 - Compile-time constant なら OK（例: `HAKMEM_BUILD_RELEASE`）
 - Runtime ENV variables には使わない
 - A/B test 前に expect hint を削除して検証
 ### 2. Variant 選択の重要性
 **教訓**:
 - `free_tiny_fast_hot()` vs `free_tiny_fast()` の選択が throughput に影響
 - 過去の A/B 結果（Phase 9/10）を参照すべきだった
 - 新しい最適化でも「勝ち筋」を選ぶこと
 ### 3. Front-end vs Backend Trade-off
 **発見**:
 - Instructions/Branches 削減（front-end 改善）は throughput に直結しない
 - dTLB misses（backend 悪化）が throughput を抑制
 - 総合バランスが重要
 **今後の指針**:
 - perf stat で front-end/backend を個別に分析
 - Trade-off を明示的に評価
 ---
 ## Verdict: GO
 **Reasons**:
 1. **Throughput: +5.88%** (exceeds +5% target)
 2. **Instructions: -15.23%** (excellent reduction)
 3. **Branches: -19.36%** (excellent reduction)
 4. **Cycles: -5.07%** (solid improvement)
 5. **I-cache: -11.79%** (front-end improvement)
 **Trade-offs (Acceptable)**:
 - iTLB: +41.46% (front-end cost)
 - dTLB: +29.15% (backend cost)
 - → Overall gain (+5.88%) outweighs these costs
 **Decision**: Phase 19-1b を本線採用。ENV: `HAKMEM_FASTLANE_DIRECT=1` で運用。
 ---
 ## Next Steps
 ### Immediate Actions
 1. ✅ Commit Phase 19-1b changes to master
 2. ✅ Update CURRENT_TASK.md with results
 3. ✅ Archive this report to `docs/analysis/`
 ### Future Optimizations
 **Phase 19-2 候補** (dTLB miss 削減):
 - TLB prefetch hints
 - Page alignment optimization
 - Working set size reduction
 **Phase 19-3 候補** (instruction 削減):
 - ENV snapshot consolidation (Candidate B)
 - Stats counter removal (Candidate C)
 - Header validation inline (Candidate D)
 **Target**: Close remaining gap to libc (73 instructions/op → 40-50 instructions/op)
 ---
 ## Appendix: Raw Data
 ### Baseline (FASTLANE_DIRECT=0) 10-run
 ```
 Run 1: 49.70M ops/s
 Run 2: 49.10M ops/s
 Run 3: 48.83M ops/s
 Run 4: 49.24M ops/s
 Run 5: 49.29M ops/s
 Run 6: 48.54M ops/s
 Run 7: 49.77M ops/s
 Run 8: 48.52M ops/s
 Run 9: 49.32M ops/s
 Run 10: 49.37M ops/s
 Mean: 49.17M ops/s
 StdDev: 407,748 ops/s
 CV: 0.83%
 ```
 ### Optimized (FASTLANE_DIRECT=1) 10-run
 ```
 Run 1: 51.44M ops/s
 Run 2: 52.56M ops/s
 Run 3: 51.71M ops/s
 Run 4: 52.30M ops/s
 Run 5: 51.73M ops/s
 Run 6: 51.96M ops/s
 Run 7: 52.48M ops/s
 Run 8: 51.44M ops/s
 Run 9: 51.96M ops/s
 Run 10: 52.46M ops/s
 Mean: 52.06M ops/s
 StdDev: 404,146 ops/s
 CV: 0.78%
 ```
 ### perf stat Baseline (FASTLANE_DIRECT=0)
 ```
 Performance counter stats for 'env -i PATH= HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FASTLANE_DIRECT=0 ./bench_random_mixed_hakmem 200000000 400 1':
    17,775,213,215      cycles
    39,980,185,471      instructions              #    2.25  insn per cycle
           111,712      L1-icache-load-misses
            26,039      iTLB-load-misses
            59,329      dTLB-load-misses
    10,297,849,396      branches
       232,502,367      branch-misses             #    2.26% of all branches
       4.486849039 seconds time elapsed
 ```
 ### perf stat Optimized (FASTLANE_DIRECT=1)
 ```
 Performance counter stats for 'env -i PATH= HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FASTLANE_DIRECT=1 ./bench_random_mixed_hakmem 200000000 400 1':
    16,873,451,633      cycles
    33,889,807,627      instructions              #    2.01  insn per cycle
            98,542      L1-icache-load-misses
            36,835      iTLB-load-misses
            76,626      dTLB-load-misses
     8,304,201,436      branches
       232,239,642      branch-misses             #    2.80% of all branches
       4.247212223 seconds time elapsed
 ```
 ---
 **END OF REPORT**
--- a/docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_1_DESIGN.md
+++ b/docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_1_DESIGN.md
@ -0,0 +1,543 @@
 # Phase 19: FastLane Instruction Reduction - Design Document
 ## 0. Executive Summary
 **Goal**: Reduce instruction/branch count gap between hakmem and libc to close throughput gap
 **Current Gap**: hakmem 44.88M ops/s vs libc 77.62M ops/s (+73.0% advantage for libc)
 **Target**: Reduce instruction gap from +53.8% to <+25%, targeting +15-25% throughput improvement
 **Success Criteria**: Achieve 52-56M ops/s (from current 44.88M ops/s)
 ### Key Findings
 Per-operation overhead comparison (200M ops):
 | Metric | hakmem | libc | Delta | Delta % |
 |--------|--------|------|-------|---------|
 | **Instructions/op** | 209.09 | 135.92 | +73.17 | **+53.8%** |
 | **Branches/op** | 52.33 | 22.93 | +29.40 | **+128.2%** |
 | Cycles/op | 96.48 | 54.69 | +41.79 | +76.4% |
 | Branch-miss % | 2.22% | 2.87% | -0.65% | Better |
 **Critical insight**: hakmem executes **73 extra instructions** and **29 extra branches** per operation vs libc.
 This massive overhead accounts for the entire throughput gap.
 ---
 ## 1. Gap Analysis (Per-Operation Breakdown)
 ### 1.1 Instruction Gap: +73.17 instructions/op (+53.8%)
 This excess comes from multiple layers of overhead:
 - **FastLane wrapper checks**: ENV gates, class mask validation, size checks
 - **Policy snapshot overhead**: TLS reads for routing decisions (3+ reads even with ENV snapshot)
 - **Route determination**: Static route table lookup vs direct path
 - **Multiple ENV gates**: Scattered throughout hot path (DUALHOT, LEGACY_DIRECT, C7_ULTRA, etc.)
 - **Stats counters**: Atomic increments on hot path (FREE_PATH_STAT_INC, ALLOC_GATE_STAT_INC, etc.)
 - **Header validation duplication**: FastLane + free_tiny_fast both validate header
 ### 1.2 Branch Gap: +29.40 branches/op (+128.2%)
 Branching is **2.3x worse** than instruction gap:
 - **Cascading ENV checks**: Each layer adds 1-2 branches (g_initialized, class_mask, DUALHOT, C7_ULTRA, LEGACY_DIRECT)
 - **Route dispatch**: Static route check + route_kind switch
 - **Early-exit patterns**: Multiple if-checks for ULTRA/DUALHOT/LEGACY paths
 - **Stats gating**: `if (__builtin_expect(...))` patterns around counters
 ### 1.3 Why Cycles/op Gap is Smaller Than Expected
 Despite +76.4% cycle gap, the CPU is achieving 2.17 IPC (hakmem) vs 2.49 IPC (libc).
 This suggests:
 - **Good CPU pipelining**: Branch predictor is working well (2.22% miss rate)
 - **I-cache locality**: Code is reasonably compact despite extra instructions
 - **But**: We're paying for every extra branch in pipeline stalls
 ---
 ## 2. Hot Path Breakdown (perf report)
 Top 10 hot functions (% of cycles):
 | Function | % time | Category | Reduction Target? |
 |----------|--------|----------|-------------------|
 | **front_fastlane_try_free** | 23.97% | Wrapper | ✓ **YES** (remove layer) |
 | **malloc** | 23.84% | Wrapper | ✓ **YES** (remove layer) |
 | main | 22.02% | Benchmark | (baseline) |
 | **free** | 6.82% | Wrapper | ✓ **YES** (remove layer) |
 | unified_cache_push | 4.44% | Core | Optimize later |
 | tiny_header_finalize_alloc | 4.34% | Core | Optimize later |
 | tiny_c7_ultra_alloc | 3.38% | Core | Optimize later |
 | tiny_c7_ultra_free | 2.07% | Core | Optimize later |
 | hakmem_env_snapshot_enabled | 1.22% | ENV | ✓ **YES** (eliminate checks) |
 | hak_super_lookup | 0.98% | Core | Optimize later |
 **Critical observation**: The top 3 user-space functions are **all wrappers**:
 - `front_fastlane_try_free` (23.97%) + `free` (6.82%) = **30.79%** on free wrappers
 - `malloc` (23.84%) on alloc wrapper
 - Combined wrapper overhead: **~54-55%** of all cycles
 ### 2.1 front_fastlane_try_free Annotated Breakdown
 From `perf annotate`, the hot path has these expensive operations:
 **Header validation** (lines 1c786-1c791, ~3% samples):
 ```asm
 movzbl -0x1(%rbp),%ebx          # Load header byte
 mov    %ebx,%eax                # Copy to eax
 and    $0xfffffff0,%eax         # Extract magic (0xA0)
 cmp    $0xa0,%al                # Check magic
 jne    ... (fallback)           # Branch on mismatch
 ```
 **ENV snapshot checks** (lines 1c7ff-1c822, ~7% samples):
 ```asm
 cmpl   $0x1,0x628fa(%rip)       # g_hakmem_env_snapshot_ctor_mode (3.01%)
 mov    0x628ef(%rip),%r15d      # g_hakmem_env_snapshot_gate (1.36%)
 je     ...
 cmp    $0xffffffff,%r15d
 je     ... (init path)
 test   %r15d,%r15d
 jne    ... (snapshot path)
 ```
 **Class routing overhead** (lines 1c7d1-1c7fb, ~3% samples):
 ```asm
 mov    0x6299c(%rip),%r15d      # g.5.lto_priv.0 (policy gate)
 cmp    $0x1,%r15d
 jne    ... (fallback)
 movzbl 0x6298f(%rip),%eax       # g_mask.3.lto_priv.0
 cmp    $0xff,%al
 je     ... (all-classes path)
 movzbl %al,%r9d
 bt     %r13d,%r9d               # Bit test class mask
 jae    ... (fallback)
 ```
 **Total overhead**: ~15-20% of cycles in front_fastlane_try_free are spent on:
 - Header validation (already done again in free_tiny_fast)
 - ENV snapshot probing
 - Policy/route checks
 ---
 ## 3. Reduction Candidates (Prioritized by ROI)
 ### Candidate A: **Eliminate FastLane Wrapper Layer** (Highest ROI)
 **Problem**: front_fastlane_try_free + free wrappers consume 30.79% of cycles
 **Root cause**: Double header validation + ENV checks + class mask checks
 **Proposal**: Direct call to free_tiny_fast() from free() wrapper
 **Implementation**:
 ```c
 // In free() wrapper:
 void free(void* ptr) {
    if (__builtin_expect(!ptr, 0)) return;
    // Phase 19-A: Direct call (no FastLane layer)
    if (free_tiny_fast(ptr)) {
        return;  // Handled
    }
    // Fallback to cold path
    free_cold(ptr);
 }
 ```
 **Reduction estimate**:
 - **Instructions**: -15-20/op (eliminate duplicate header read, ENV checks, class mask checks)
 - **Branches**: -5-7/op (remove FastLane gate checks)
 - **Impact**: ~10-15% throughput improvement (remove 30% wrapper overhead)
 **Risk**: **LOW** (free_tiny_fast already has validation + routing logic)
 ---
 ### Candidate B: **Consolidate ENV Snapshot Checks** (High ROI)
 **Problem**: ENV snapshot is checked **3+ times per operation**:
 1. FastLane entry: `g_initialized` check
 2. Route determination: `hakmem_env_snapshot_enabled()` check
 3. Route-specific: `tiny_c7_ultra_enabled_env()` check
 4. Legacy fallback: Another ENV snapshot check
 **Proposal**: Single ENV snapshot read at entry, pass context down
 **Implementation**:
 ```c
 // Phase 19-B: ENV context struct
 typedef struct {
    bool c7_ultra_enabled;
    bool dualhot_enabled;
    bool legacy_direct_enabled;
    SmallRouteKind route_kind[8];  // Pre-computed routes
 } FastLaneCtx;
 static __thread FastLaneCtx g_fastlane_ctx = {0};
 static __thread int g_fastlane_ctx_init = 0;
 static inline const FastLaneCtx* fastlane_ctx_get(void) {
    if (__builtin_expect(g_fastlane_ctx_init == 0, 0)) {
        // One-time init per thread
        const HakmemEnvSnapshot* env = hakmem_env_snapshot();
        g_fastlane_ctx.c7_ultra_enabled = env->tiny_c7_ultra_enabled;
        // ... populate other fields
        g_fastlane_ctx_init = 1;
    }
    return &g_fastlane_ctx;
 }
 ```
 **Reduction estimate**:
 - **Instructions**: -8-12/op (eliminate redundant TLS reads)
 - **Branches**: -3-5/op (single init check instead of multiple)
 - **Impact**: ~5-8% throughput improvement
 **Risk**: **MEDIUM** (need to handle ENV changes during runtime - use invalidation hook)
 ---
 ### Candidate C: **Remove Stats Counters from Hot Path** (Medium ROI)
 **Problem**: Stats counters on hot path add atomic increments:
 - `FRONT_FASTLANE_STAT_INC(free_total)` (every op)
 - `FREE_PATH_STAT_INC(total_calls)` (every op)
 - `ALLOC_GATE_STAT_INC(total_calls)` (every alloc)
 - `tiny_front_free_stat_inc(class_idx)` (every free)
 **Proposal**: Make stats DEBUG-only or sample-based (1-in-N)
 **Implementation**:
 ```c
 // Phase 19-C: Sampling-based stats
 #if !HAKMEM_BUILD_RELEASE
    static __thread uint32_t g_stat_counter = 0;
    if (__builtin_expect((++g_stat_counter & 0xFFF) == 0, 0)) {
        // Sample 1-in-4096 operations
        FRONT_FASTLANE_STAT_INC(free_total);
    }
 #endif
 ```
 **Reduction estimate**:
 - **Instructions**: -4-6/op (remove atomic increments)
 - **Branches**: -2-3/op (remove `if (__builtin_expect(...))` checks)
 - **Impact**: ~3-5% throughput improvement
 **Risk**: **LOW** (stats already compile-time optional)
 ---
 ### Candidate D: **Inline Header Validation** (Medium ROI)
 **Problem**: Header validation happens twice:
 1. FastLane wrapper: `*((uint8_t*)ptr - 1)` (lines 179-191 in front_fastlane_box.h)
 2. free_tiny_fast: Same check (lines 598-605 in malloc_tiny_fast.h)
 **Proposal**: Trust FastLane validation, remove duplicate check
 **Implementation**:
 ```c
 // Phase 19-D: Add "trusted" variant
 static inline int free_tiny_fast_trusted(void* ptr, int class_idx, void* base) {
    // Skip header validation (caller already validated)
    // Direct to route dispatch
    ...
 }
 // In FastLane:
 uint8_t header = *((uint8_t*)ptr - 1);
 int class_idx = header & 0x0F;
 void* base = tiny_user_to_base_inline(ptr);
 return free_tiny_fast_trusted(ptr, class_idx, base);
 ```
 **Reduction estimate**:
 - **Instructions**: -3-5/op (remove duplicate header load + extract)
 - **Branches**: -1-2/op (remove duplicate magic check)
 - **Impact**: ~2-3% throughput improvement
 **Risk**: **MEDIUM** (need to ensure all callers validate header)
 ---
 ### Candidate E: **Static Route Table Optimization** (Lower ROI)
 **Problem**: Route determination uses TLS lookups + bit tests:
 ```c
 if (tiny_static_route_ready_fast()) {
    route_kind = tiny_static_route_get_kind_fast(class_idx);
 } else {
    route_kind = tiny_policy_hot_get_route(class_idx);
 }
 ```
 **Proposal**: Pre-compute common routes at init, inline direct paths
 **Implementation**:
 ```c
 // Phase 19-E: Route fast path (C0-C3 LEGACY, C7 ULTRA)
 static __thread uint8_t g_route_fastmap = 0;  // bit 0=C0...bit 7=C7, 1=LEGACY
 static inline bool is_legacy_route_fast(int class_idx) {
    return (g_route_fastmap >> class_idx) & 1;
 }
 ```
 **Reduction estimate**:
 - **Instructions**: -3-4/op (replace function call with bit test)
 - **Branches**: -1-2/op (replace nested if with single bit test)
 - **Impact**: ~2-3% throughput improvement
 **Risk**: **LOW** (route table is already static)
 ---
 ## 4. Combined Impact Estimate
 Assuming independent reductions (conservative estimate with 80% efficiency due to overlap):
 | Candidate | Instructions/op | Branches/op | Throughput |
 |-----------|-----------------|-------------|------------|
 | Baseline | 209.09 | 52.33 | 44.88M ops/s |
 | **A: Remove FastLane layer** | -17.5 | -6.0 | +12% |
 | **B: ENV snapshot consolidation** | -10.0 | -4.0 | +6% |
 | **C: Stats removal (Release)** | -5.0 | -2.5 | +4% |
 | **D: Inline header validation** | -4.0 | -1.5 | +2% |
 | **E: Static route fast path** | -3.5 | -1.5 | +2% |
 | **Combined (80% efficiency)** | **-32.0** | **-12.4** | **+21%** |
 **Projected outcome**:
 - Instructions/op: 209.09 → **177.09** (vs libc 135.92, gap reduced from +53.8% to +30.3%)
 - Branches/op: 52.33 → **39.93** (vs libc 22.93, gap reduced from +128.2% to +74.1%)
 - Throughput: 44.88M → **54.3M ops/s** (vs libc 77.62M, gap reduced from +73.0% to +43.0%)
 **Achievement vs Goal**: ✓ Exceeds target (+21% vs +15-25% goal)
 ---
 ## 5. Implementation Plan
 ### Phase 19-1: Remove FastLane Wrapper Layer (A)
 **Priority**: P0 (highest ROI)
 **Effort**: 2-3 hours
 **Risk**: Low (free_tiny_fast already complete)
 Steps:
 1. Modify `free()` wrapper to directly call `free_tiny_fast(ptr)`
 2. Modify `malloc()` wrapper to directly call `malloc_tiny_fast(size)`
 3. Measure: Expect +10-15% throughput
 4. Fallback: Keep FastLane as compile-time option
 ### Phase 19-2: ENV Snapshot Consolidation (B)
 **Priority**: P1 (high ROI, moderate risk)
 **Effort**: 4-6 hours
 **Risk**: Medium (ENV invalidation needed)
 Steps:
 1. Create `FastLaneCtx` struct with pre-computed ENV state
 2. Add TLS cache with invalidation hook
 3. Replace scattered ENV checks with single context read
 4. Measure: Expect +5-8% throughput on top of Phase 19-1
 5. Fallback: ENV-gate new path (HAKMEM_FASTLANE_ENV_CTX=1)
 ### Phase 19-3: Stats Removal (C) + Header Inline (D)
 **Priority**: P2 (medium ROI, low risk)
 **Effort**: 2-3 hours
 **Risk**: Low (already compile-time optional)
 Steps:
 1. Make stats sample-based (1-in-4096) in Release builds
 2. Add `free_tiny_fast_trusted()` variant (skip header validation)
 3. Measure: Expect +3-5% throughput on top of Phase 19-2
 4. Fallback: Compile-time flags for both features
 ### Phase 19-4: Static Route Fast Path (E)
 **Priority**: P3 (lower ROI, polish)
 **Effort**: 2-3 hours
 **Risk**: Low (route table is static)
 Steps:
 1. Add `g_route_fastmap` TLS cache
 2. Replace function calls with bit tests
 3. Measure: Expect +2-3% throughput on top of Phase 19-3
 4. Fallback: Keep existing path as fallback
 ---
 ## 6. Box Theory Compliance
 ### Boundary Preservation
 - **L0 (ENV)**: Keep existing ENV gates, add new ones for each optimization
 - **L1 (Hot inline)**: free_tiny_fast(), malloc_tiny_fast() remain unchanged
 - **L2 (Cold fallback)**: free_cold(), malloc_cold() remain unchanged
 - **L3 (Stats)**: Make optional via #if guards
 ### Reversibility
 - Each phase is ENV-gated (can revert at runtime)
 - Compile-time fallback preserved (HAKMEM_BUILD_RELEASE controls stats)
 - FastLane layer can be kept as compile-time option for A/B testing
 ### Incremental Rollout
 - Phase 19-1: Remove wrapper (default ON)
 - Phase 19-2: ENV context (default OFF, opt-in for testing)
 - Phase 19-3: Stats/header (default ON in Release, OFF in Debug)
 - Phase 19-4: Route fast path (default ON)
 ---
 ## 7. Validation Checklist
 After each phase:
 - [ ] Run perf stat (compare instructions/branches/cycles per-op)
 - [ ] Run perf record + annotate (verify hot path reduction)
 - [ ] Run benchmark suite (Mixed, C6-heavy, C7-heavy)
 - [ ] Check correctness (Larson, multithreaded, stress tests)
 - [ ] Measure RSS/memory overhead (should be unchanged)
 - [ ] A/B test (ENV toggle to verify reversibility)
 Success criteria:
 - [ ] Throughput improvement matches estimate (±20%)
 - [ ] Instruction count reduction matches estimate (±20%)
 - [ ] Branch count reduction matches estimate (±20%)
 - [ ] No correctness regressions (all tests pass)
 - [ ] No memory overhead increase (RSS unchanged)
 ---
 ## 8. Risk Assessment
 ### High-Risk Areas
 1. **ENV invalidation** (Phase 19-2): Runtime ENV changes could break cached context
   - Mitigation: Use invalidation hooks (existing hakmem_env_snapshot infrastructure)
   - Fallback: Revert to scattered ENV checks
 2. **Header validation trust** (Phase 19-3D): Skipping validation could miss corruption
   - Mitigation: Keep validation in Debug builds, extensive testing
   - Fallback: Compile-time option to keep duplicate checks
 ### Medium-Risk Areas
 1. **FastLane removal** (Phase 19-1): Could break gradual rollout (class_mask filtering)
   - Mitigation: Keep class_mask filtering in FastLane path only (direct path always falls back safely)
   - Fallback: Keep FastLane as compile-time option
 ### Low-Risk Areas
 1. **Stats removal** (Phase 19-3C): Already compile-time optional
 2. **Route fast path** (Phase 19-4): Route table is static, no runtime changes
 ---
 ## 9. Future Optimization Opportunities (Post-Phase 19)
 After Phase 19 closes the wrapper gap, next targets:
 1. **Unified Cache optimization** (4.44% cycles):
   - Reduce cache miss overhead (refill path)
   - Optimize LIFO vs ring buffer trade-off
 2. **Header finalization** (4.34% cycles):
   - Investigate always_inline for tiny_header_finalize_alloc()
   - Reduce metadata writes (defer to batch update)
 3. **C7 ULTRA optimization** (3.38% + 2.07% = 5.45% cycles):
   - Investigate TLS cache locality
   - Reduce ULTRA push/pop overhead
 4. **Super lookup optimization** (0.98% cycles):
   - Already optimized in Phase 12 (mask-based)
   - Further reduction may require architectural changes
 **Estimated ceiling**: With all optimizations, could approach ~65-70M ops/s (vs libc 77.62M)
 **Remaining gap**: Likely fundamental architectural differences (thread-local vs global allocator)
 ---
 ## 10. Appendix: Detailed perf Data
 ### 10.1 perf stat Results (200M ops)
 **hakmem (FORCE_LIBC=0)**:
 ```
 Performance counter stats for 'bench_random_mixed_hakmem ... HAKMEM_FORCE_LIBC_ALLOC=0':
    19,296,118,430  cycles
    41,817,886,925  instructions              #  2.17  insn per cycle
    10,466,190,806  branches
       232,592,257  branch-misses             #  2.22% of all branches
         1,660,073  cache-misses
           134,601  L1-icache-load-misses
       4.913685503 seconds time elapsed
 Throughput: 44.88M ops/s
 ```
 **libc (FORCE_LIBC=1)**:
 ```
 Performance counter stats for 'bench_random_mixed_hakmem ... HAKMEM_FORCE_LIBC_ALLOC=1':
    10,937,550,228  cycles
    27,183,469,339  instructions              #  2.49  insn per cycle
     4,586,617,379  branches
       131,515,905  branch-misses             #  2.87% of all branches
           767,370  cache-misses
            64,102  L1-icache-load-misses
       2.835174452 seconds time elapsed
 Throughput: 77.62M ops/s
 ```
 ### 10.2 Top 30 Hot Functions (perf report)
 ```
    23.97%  front_fastlane_try_free.lto_priv.0
    23.84%  malloc
    22.02%  main
     6.82%  free
     4.44%  unified_cache_push.lto_priv.0
     4.34%  tiny_header_finalize_alloc.lto_priv.0
     3.38%  tiny_c7_ultra_alloc.constprop.0
     2.07%  tiny_c7_ultra_free
     1.22%  hakmem_env_snapshot_enabled.lto_priv.0
     0.98%  hak_super_lookup.part.0.lto_priv.4.lto_priv.0
     0.85%  hakmem_env_snapshot.lto_priv.0
     0.82%  hak_pool_free_v1_slow_impl
     0.59%  tiny_front_v3_snapshot_get.lto_priv.0
     0.30%  __memset_avx2_unaligned_erms (libc)
     0.30%  tiny_unified_lifo_enabled.lto_priv.0
     0.28%  hak_free_at.constprop.0
     0.24%  hak_pool_try_alloc.part.0
     0.24%  malloc_cold
     0.16%  hak_pool_try_alloc_v1_impl.part.0
     0.14%  free_cold.constprop.0
     0.13%  mid_inuse_dec_deferred
     0.12%  hak_pool_mid_lookup
     0.12%  do_user_addr_fault (kernel)
     0.11%  handle_pte_fault (kernel)
     0.11%  __mod_memcg_lruvec_state (kernel)
     0.10%  do_anonymous_page (kernel)
     0.09%  classify_ptr
     0.07%  tiny_get_max_size.lto_priv.0
     0.06%  __handle_mm_fault (kernel)
     0.06%  __alloc_pages (kernel)
 ```
 ---
 ## 11. Conclusion
 Phase 19 has **clear, actionable targets** with high ROI:
 1. **Immediate action (Phase 19-1)**: Remove FastLane wrapper layer
   - Expected: +10-15% throughput
   - Risk: Low
   - Effort: 2-3 hours
 2. **Follow-up (Phase 19-2-4)**: ENV consolidation + stats + route optimization
   - Expected: +6-11% additional throughput
   - Risk: Medium (ENV invalidation)
   - Effort: 8-12 hours
 **Combined target**: +21% throughput (44.88M → 54.3M ops/s)
 **Gap closure**: Reduce instruction gap from +53.8% to +30.3% vs libc
 This positions hakmem for competitive performance while maintaining safety and Box Theory compliance.
--- a/docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_2_NEXT_INSTRUCTIONS.md
+++ b/docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_2_NEXT_INSTRUCTIONS.md
@ -0,0 +1,64 @@
 # Phase 19-2: FASTLANE_DIRECT Promotion + Rebaseline (Next Instructions)
 ## 0. Status (where we are)
 - Phase 19-1b (FASTLANE_DIRECT) is **GO**: throughput **+5.88%** with **-15.23% instr/op** and **-19.36% branches/op**.
 - Safety hardening completed:
  - `!g_initialized` → direct path is skipped (fail-fast, same rule as Front FastLane).
  - malloc miss no longer calls `malloc_cold()` directly; it falls through to the normal wrapper path (preserves `g_hakmem_lock_depth` invariants).
  - ENV cache is a single global `_Atomic` so `bench_profile` refresh affects wrappers.
 ## 1. Promotion policy (Box Theory)
 - Keep rollback simple:
  - `HAKMEM_FASTLANE_DIRECT=0` → disable (fallback to Phase 6 FastLane wrapper path).
  - `HAKMEM_FASTLANE_DIRECT=1` → enable (direct `malloc_tiny_fast()` / `free_tiny_fast()` first).
 - Promotion level:
  - **Preset promotion** (recommended): set `HAKMEM_FASTLANE_DIRECT=1` in `MIXED_TINYV3_C7_SAFE` and `C6_HEAVY_LEGACY_POOLV1` presets.
  - Keep **ENV default = 0** (opt-in) until real-world/LD_PRELOAD validation is done.
 ## 2. Required verification (same-binary A/B)
 ### 2.1 Mixed (10-run, clean env)
 Baseline:
 ```sh
 HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FASTLANE_DIRECT=0 scripts/run_mixed_10_cleanenv.sh
 ```
 Optimized:
 ```sh
 HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FASTLANE_DIRECT=1 scripts/run_mixed_10_cleanenv.sh
 ```
 GO/NO-GO:
 - GO: mean **+1.0%** or higher
 - NEUTRAL: **±1.0%** → keep as preset-only (do not flip global default)
 - NO-GO: **≤ -1.0%** → revert preset promotion
 ### 2.2 C6-heavy (5-run)
 ```sh
 HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1 HAKMEM_FASTLANE_DIRECT=0 ./bench_mid_large_mt_hakmem 1 1000000 400 1
 HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1 HAKMEM_FASTLANE_DIRECT=1 ./bench_mid_large_mt_hakmem 1 1000000 400 1
 ```
 ## 3. Perf stat capture (root-cause guardrails)
 Run both A/B with:
 ```sh
 perf stat -e cycles,instructions,branches,branch-misses,L1-icache-load-misses,iTLB-load-misses,dTLB-load-misses -- \
  ./bench_random_mixed_hakmem 200000000 400 1
 ```
 Checklist:
 - `instructions/op` and `branches/op` must improve (expected)
 - iTLB/dTLB misses may worsen; accept only if throughput still improves
 ## 4. Next target selection (after promotion)
 After Phase 19-2 is stable, re-run `perf record` on Mixed and choose the next box by **self% ≥ 5%**:
 - If `unified_cache_push/pop` rises: focus on **UnifiedCache data-path** (touch fewer cache lines).
 - If `tiny_header_finalize_alloc` rises: focus on **header finalize path** (but treat as high NO-GO risk; prior header work was often NEUTRAL).
 - If ENV checks reappear in hot path: consider **Phase 19-3 (ENV check consolidation)**, but keep it in a separate research box.
--- a/hakmem.d
+++ b/hakmem.d
@ -178,7 +178,8 @@ hakmem.o: core/hakmem.c core/hakmem.h core/hakmem_build_flags.h \
 core/box/front_fastlane_env_box.h core/box/front_fastlane_stats_box.h \
 core/box/front_fastlane_alloc_legacy_direct_env_box.h \
 core/box/tiny_front_hot_box.h core/box/tiny_front_cold_box.h \
- core/box/smallobject_policy_v7_box.h core/box/../hakmem_internal.h
+ core/box/smallobject_policy_v7_box.h core/box/fastlane_direct_env_box.h \
 core/box/../hakmem_internal.h
 core/hakmem.h:
 core/hakmem_build_flags.h:
 core/hakmem_config.h:
@ -441,4 +442,5 @@ core/box/front_fastlane_alloc_legacy_direct_env_box.h:
 core/box/tiny_front_hot_box.h:
 core/box/tiny_front_cold_box.h:
 core/box/smallobject_policy_v7_box.h:
 core/box/fastlane_direct_env_box.h:
 core/box/../hakmem_internal.h:
--- a/perf.data.phase19_hakmem
+++ b/perf.data.phase19_hakmem
--- a/scripts/run_mixed_10_cleanenv.sh
+++ b/scripts/run_mixed_10_cleanenv.sh
@ -18,6 +18,8 @@ export HAKMEM_MALLOC_TINY_DIRECT=${HAKMEM_MALLOC_TINY_DIRECT:-0}
 export HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=${HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT:-0}
 export HAKMEM_FORCE_LIBC_ALLOC=${HAKMEM_FORCE_LIBC_ALLOC:-0}
 export HAKMEM_ENV_SNAPSHOT_SHAPE=${HAKMEM_ENV_SNAPSHOT_SHAPE:-0}
 # NOTE: Phase 19-1b is promoted in presets. Keep cleanenv aligned by default.
 export HAKMEM_FASTLANE_DIRECT=${HAKMEM_FASTLANE_DIRECT:-1}
 # NOTE: Phase 9/10 are promoted (bench_profile defaults to 1). Keep cleanenv aligned by default.
 export HAKMEM_FREE_TINY_FAST_MONO_DUALHOT=${HAKMEM_FREE_TINY_FAST_MONO_DUALHOT:-1}
 export HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=${HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT:-1}