Phase 17 v2 (FORCE_LIBC fix) + Phase 19-1b (FastLane Direct) — GO (+5.88%)

## Phase 17 v2: FORCE_LIBC Gap Validation Fix

**Critical bug fix**: Phase 17 v1 の測定が壊れていた

**Problem**: HAKMEM_FORCE_LIBC_ALLOC=1 が FastLane より後でしか見えず、
same-binary A/B が実質 "hakmem vs hakmem" になっていた(+0.39% 誤測定)

**Fix**: core/box/hak_wrappers.inc.h:171 と :645 に g_force_libc_alloc==1 の
early bypass を追加、__libc_malloc/__libc_free に最初に直行

**Result**: 正しい同一バイナリ A/B 測定
- hakmem (FORCE_LIBC=0): 48.99M ops/s
- libc (FORCE_LIBC=1): 79.72M ops/s (+62.7%)
- system binary: 88.06M ops/s (+10.5% vs libc)

**Gap 分解**:
- Allocator 差: +62.7% (主戦場)
- Layout penalty: +10.5% (副次的)

**Conclusion**: Case A 確定 (allocator dominant, NOT layout)
Phase 17 v1 の Case B 判定は誤り。

Files:
- docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md (v2)
- docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_NEXT_INSTRUCTIONS.md (updated)

---

## Phase 19: FastLane Instruction Reduction Analysis

**Goal**: libc との instruction gap (-35% instructions, -56% branches) を削減

**perf stat 分析** (FORCE_LIBC=0 vs 1, 200M ops):
- hakmem: 209.09 instructions/op, 52.33 branches/op
- libc: 135.92 instructions/op, 22.93 branches/op
- Delta: +73.17 instructions/op (+53.8%), +29.40 branches/op (+128.2%)

**Hot path** (perf report):
- front_fastlane_try_free: 23.97% cycles
- malloc wrapper: 23.84% cycles
- free wrapper: 6.82% cycles
- **Wrapper overhead: ~55% of all cycles**

**Reduction candidates**:
- A: Wrapper layer 削除 (-17.5 inst/op, +10-15% 期待)
- B: ENV snapshot 統合 (-10.0 inst/op, +5-8%)
- C: Stats 削除 (-5.0 inst/op, +3-5%)
- D: Header inline (-4.0 inst/op, +2-3%)
- E: Route fast path (-3.5 inst/op, +2-3%)

Files:
- docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_1_DESIGN.md
- docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_2_NEXT_INSTRUCTIONS.md

---

## Phase 19-1b: FastLane Direct — GO (+5.88%)

**Strategy**: Wrapper layer を bypass し、core allocator を直接呼ぶ
- free() → free_tiny_fast() (not free_tiny_fast_hot)
- malloc() → malloc_tiny_fast()

**Phase 19-1 が NO-GO (-3.81%) だった原因**:
1. __builtin_expect(fastlane_direct_enabled(), 0) が逆効果(A/B 不公平)
2. free_tiny_fast_hot() が誤選択(free_tiny_fast() が勝ち筋)

**Phase 19-1b の修正**:
1. __builtin_expect() 削除
2. free_tiny_fast() を直接呼び出し

**Result** (Mixed, 10-run, 20M iters, ws=400):
- Baseline (FASTLANE_DIRECT=0): 49.17M ops/s
- Optimized (FASTLANE_DIRECT=1): 52.06M ops/s
- **Delta: +5.88%** (GO 基準 +5% クリア)

**perf stat** (200M iters):
- Instructions/op: 199.90 → 169.45 (-30.45, -15.23%)
- Branches/op: 51.49 → 41.52 (-9.97, -19.36%)
- Cycles/op: 88.88 → 84.37 (-4.51, -5.07%)
- I-cache miss: 111K → 98K (-11.79%)

**Trade-offs** (acceptable):
- iTLB miss: +41.46% (front-end cost)
- dTLB miss: +29.15% (backend cost)
- Overall gain (+5.88%) outweighs costs

**Implementation**:
1. **ENV gate**: core/box/fastlane_direct_env_box.{h,c}
   - HAKMEM_FASTLANE_DIRECT=0/1 (default: 0, opt-in)
   - Single _Atomic global (wrapper キャッシュ問題を解決)

2. **Wrapper 修正**: core/box/hak_wrappers.inc.h
   - malloc: direct call to malloc_tiny_fast() when FASTLANE_DIRECT=1
   - free: direct call to free_tiny_fast() when FASTLANE_DIRECT=1
   - Safety: !g_initialized では direct 使わない、fallback 維持

3. **Preset 昇格**: core/bench_profile.h:88
   - bench_setenv_default("HAKMEM_FASTLANE_DIRECT", "1")
   - Comment: +5.88% proven on Mixed, 10-run

4. **cleanenv 更新**: scripts/run_mixed_10_cleanenv.sh:22
   - HAKMEM_FASTLANE_DIRECT=${HAKMEM_FASTLANE_DIRECT:-1}
   - Phase 9/10 と同様に昇格

**Verdict**: GO — 本線採用、プリセット昇格完了

**Rollback**: HAKMEM_FASTLANE_DIRECT=0 で既存 FastLane path に戻る

Files:
- core/box/fastlane_direct_env_box.{h,c} (new)
- core/box/hak_wrappers.inc.h (modified)
- core/bench_profile.h (preset promotion)
- scripts/run_mixed_10_cleanenv.sh (ENV default aligned)
- Makefile (new obj)
- docs/analysis/PHASE19_1B_FASTLANE_DIRECT_REVISED_AB_TEST_RESULTS.md

---

## Cumulative Performance

- Baseline (all optimizations OFF): ~40M ops/s (estimated)
- Current (Phase 19-1b): 52.06M ops/s
- **Cumulative gain: ~+30% from baseline**

Remaining gap to libc (79.72M):
- Current: 52.06M ops/s
- Target: 79.72M ops/s
- **Gap: +53.2%** (was +62.7% before Phase 19-1b)

Next: Phase 19-2 (ENV snapshot consolidation, +5-8% expected)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Moe Charm (CI)
2025-12-15 11:28:40 +09:00
parent bc2c5ded76
commit ec87025da6
14 changed files with 1213 additions and 60 deletions

View File

@ -1,5 +1,117 @@
# 本線タスク(現在) # 本線タスク(現在)
## 更新メモ2025-12-15 Phase 19-1b FASTLANE-DIRECT-1B
### Phase 19-1b FASTLANE-DIRECT-1B: FastLane Direct (Revised) — ✅ GO (+5.88%)
**Result**: Phase 19-1 の修正版が成功。__builtin_expect() 削除 + free_tiny_fast() 直呼び で throughput **+5.88%** 達成。
**A/B Test Results**:
- Baseline: 49.17M ops/s (FASTLANE_DIRECT=0)
- Optimized: 52.06M ops/s (FASTLANE_DIRECT=1)
- Delta: **+5.88%** (GO判定、+5%目標クリア)
**perf stat Analysis** (200M ops):
- Instructions: **-15.23%** (199.90 → 169.45/op, -30.45 削減)
- Branches: **-19.36%** (51.49 → 41.52/op, -9.97 削減)
- Cycles: **-5.07%** (88.88 → 84.37/op)
- I-cache misses: -11.79% (Good)
- iTLB misses: +41.46% (Bad, but overall gain wins)
- dTLB misses: +29.15% (Bad, but overall gain wins)
**犯人特定**:
1. Phase 19-1 の NO-GO 原因: `__builtin_expect(fastlane_direct_enabled(), 0)` が逆効果
2. `free_tiny_fast_hot()` より `free_tiny_fast()` が勝ち筋unified cache の winner
3. 修正により wrapper overhead 削減 → instruction/branch の大幅削減
**修正内容**:
- File: `/mnt/workdisk/public_share/hakmem/core/box/hak_wrappers.inc.h`
- malloc: `__builtin_expect(fastlane_direct_enabled(), 0)``fastlane_direct_enabled()`
- free: `free_tiny_fast_hot()``free_tiny_fast()` (勝ち筋に変更)
- Safety: `!g_initialized` では direct を使わず既存経路へフォールバックFastLane と同じ fail-fast
- Safety: malloc miss は `malloc_cold()` を直呼びせず既存 wrapper 経路へ落とすlock_depth 前提を守る)
- ENV cache: `fastlane_direct_env_refresh_from_env()` が wrapper と同一の `_Atomic` に反映されるように単一グローバル化
**Next**: Phase 19-1b は本線採用。ENV: `HAKMEM_FASTLANE_DIRECT=1` で運用。
---
## 前回タスクPhase 19 FASTLANE-INSTRUCTION-REDUCTION-1
### Phase 19 FASTLANE-INSTRUCTION-REDUCTION-1: FastLane Instruction Reduction v1 — 📊 ANALYSIS COMPLETE
結果: perf stat/record 分析により、**libc との gap の本質**を特定。設計ドキュメント完成。
- 設計: `docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_1_DESIGN.md`
- perf データ: 保存済みperf_stat_hakmem.txt, perf_stat_libc.txt, perf.data.phase19_hakmem
### Gap Analysis200M ops baseline
**Per-operation overhead** (hakmem vs libc):
- Instructions/op: **209.09 vs 135.92** (+73.17, **+53.8%**)
- Branches/op: **52.33 vs 22.93** (+29.40, **+128.2%**)
- Cycles/op: **96.48 vs 54.69** (+41.79, +76.4%)
- Throughput: **44.88M vs 77.62M ops/s** (+73.0% gap)
**Critical finding**: hakmem は **73 extra instructions****29 extra branches** per-op を実行。これが throughput gap の全原因。
### Hot Path Breakdownperf report
Top wrapper overhead (合計 ~55% of cycles):
- `front_fastlane_try_free`: **23.97%**
- `malloc`: **23.84%**
- `free`: **6.82%**
Wrapper layer が cycles の過半を消費二重検証、ENV checks、class mask checks など)。
### Reduction Candidates優先度順
1. **Candidate A: FastLane Wrapper Layer 削除** (highest ROI)
- Impact: **-17.5 instructions/op, -6.0 branches/op** (+10-15% throughput)
- Risk: **LOW**free_tiny_fast_hot 既存)
- 理由: 二重 header validation + ENV checks 排除
2. **Candidate B: ENV Snapshot 統合** (high ROI)
- Impact: **-10.0 instructions/op, -4.0 branches/op** (+5-8% throughput)
- Risk: **MEDIUM**ENV invalidation 対応必要)
- 理由: 3+ 回の ENV check を 1 回に統合
3. **Candidate C: Stats Counters 削除** (medium ROI)
- Impact: **-5.0 instructions/op, -2.5 branches/op** (+3-5% throughput)
- Risk: **LOW**compile-time optional
- 理由: Atomic increment overhead 排除
4. **Candidate D: Header Validation Inline** (medium ROI)
- Impact: **-4.0 instructions/op, -1.5 branches/op** (+2-3% throughput)
- Risk: **MEDIUM**caller 検証前提)
- 理由: 二重 header load 排除
5. **Candidate E: Static Route Fast Path** (lower ROI)
- Impact: **-3.5 instructions/op, -1.5 branches/op** (+2-3% throughput)
- Risk: **LOW**route table static
- 理由: Function call を bit test に置換
**Combined estimate** (80% efficiency):
- Instructions/op: 209.09 → **177.09** (gap: +53.8% → +30.3%)
- Branches/op: 52.33 → **39.93** (gap: +128.2% → +74.1%)
- Throughput: 44.88M → **54.3M ops/s** (+21%, **目標 +15-25% 超過達成**)
### Implementation Plan
- **Phase 19-1** (P0): FastLane Wrapper 削除 (2-3h, +10-15%)
- **Phase 19-2** (P1): ENV Snapshot 統合 (4-6h, +5-8%)
- **Phase 19-3** (P2): Stats + Header Inline (2-3h, +3-5%)
- **Phase 19-4** (P3): Route Fast Path (2-3h, +2-3%)
### 次の手順
1. Phase 19-1 実装開始FastLane layer 削除、直接 free_tiny_fast_hot 呼び出し)
2. perf stat で instruction/branch reduction 検証
3. Mixed 10-run で throughput improvement 測定
4. Phase 19-2-4 を順次実装
---
## 更新メモ2025-12-15 Phase 18 HOT-TEXT-ISOLATION-1 ## 更新メモ2025-12-15 Phase 18 HOT-TEXT-ISOLATION-1
### Phase 18 HOT-TEXT-ISOLATION-1: Hot Text Isolation v1 — ❌ NO-GO / FROZEN ### Phase 18 HOT-TEXT-ISOLATION-1: Hot Text Isolation v1 — ❌ NO-GO / FROZEN
@ -17,9 +129,9 @@
- Hot/cold 属性が実際には適用されていない(実装の不完全性) - Hot/cold 属性が実際には適用されていない(実装の不完全性)
重要な知見: 重要な知見:
- Phase 17 の結論を再確認: bottleneck は **instruction count****memory latency** - Phase 17 v2FORCE_LIBC 修正後): same-binary A/B で **libc が +62.7%**≒1.63×)速い → gap の主因は **allocator work**layout alone ではない)
- Code layout 最適化では 2.30 IPC の壁を越えられない - ただし `bench_random_mixed_system``libc-in-hakmem-binary` よりさらに **+10.5%** 速い → wrapper/text 環境の penalty も残る
- 次の一手: instruction count を直接削る Phase 18 v2 (BENCH_MINIMAL) へ - Phase 18 v2BENCH_MINIMAL)は「足し算の固定費」を削る方向として有効だが、-5% instructions 程度では +62% gap を埋められない
## 更新メモ2025-12-14 Phase 6 FRONT-FASTLANE-1 ## 更新メモ2025-12-14 Phase 6 FRONT-FASTLANE-1

View File

@ -253,12 +253,12 @@ LDFLAGS += $(EXTRA_LDFLAGS)
# Targets # Targets
TARGET = test_hakmem TARGET = test_hakmem
OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/box/front_fastlane_alloc_legacy_direct_env_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/box/front_fastlane_alloc_legacy_direct_env_box.o core/box/fastlane_direct_env_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o
OBJS = $(OBJS_BASE) OBJS = $(OBJS_BASE)
# Shared library # Shared library
SHARED_LIB = libhakmem.so SHARED_LIB = libhakmem.so
SHARED_OBJS = hakmem_shared.o hakmem_config_shared.o hakmem_tiny_config_shared.o hakmem_ucb1_shared.o hakmem_bigcache_shared.o hakmem_pool_shared.o hakmem_l25_pool_shared.o hakmem_site_rules_shared.o hakmem_tiny_shared.o core/box/ss_allocation_box_shared.o superslab_stats_shared.o superslab_cache_shared.o superslab_ace_shared.o superslab_slab_shared.o superslab_backend_shared.o core/superslab_head_stub_shared.o hakmem_smallmid_shared.o core/box/superslab_expansion_box_shared.o core/box/integrity_box_shared.o core/box/mailbox_box_shared.o core/box/front_gate_box_shared.o core/box/front_gate_classifier_shared.o core/box/free_publish_box_shared.o core/box/capacity_box_shared.o core/box/carve_push_box_shared.o core/box/prewarm_box_shared.o core/box/ss_hot_prewarm_box_shared.o core/box/front_metrics_box_shared.o core/box/bench_fast_box_shared.o core/box/ss_addr_map_box_shared.o core/box/ss_pt_impl_shared.o core/box/slab_recycling_box_shared.o core/box/pagefault_telemetry_box_shared.o core/box/tiny_sizeclass_hist_box_shared.o core/box/tiny_env_box_shared.o core/box/tiny_route_box_shared.o core/box/free_front_v3_env_box_shared.o core/box/free_path_stats_box_shared.o core/box/free_dispatch_stats_box_shared.o core/box/alloc_gate_stats_box_shared.o core/box/tiny_page_box_shared.o core/box/tiny_class_policy_box_shared.o core/box/tiny_class_stats_box_shared.o core/box/tiny_policy_learner_box_shared.o core/box/ss_budget_box_shared.o core/box/tiny_mem_stats_box_shared.o core/box/wrapper_env_box_shared.o core/box/free_wrapper_env_snapshot_box_shared.o core/box/malloc_wrapper_env_snapshot_box_shared.o core/box/madvise_guard_box_shared.o core/box/libm_reloc_guard_box_shared.o core/box/hakmem_env_snapshot_box_shared.o core/box/tiny_c7_preserve_header_env_box_shared.o core/box/tiny_tcache_env_box_shared.o core/box/tiny_unified_lifo_env_box_shared.o core/box/front_fastlane_alloc_legacy_direct_env_box_shared.o core/page_arena_shared.o core/front/tiny_unified_cache_shared.o core/tiny_alloc_fast_push_shared.o core/tiny_c7_ultra_segment_shared.o core/tiny_c7_ultra_shared.o core/link_stubs_shared.o core/tiny_failfast_shared.o tiny_sticky_shared.o tiny_remote_shared.o tiny_publish_shared.o tiny_debug_ring_shared.o hakmem_tiny_magazine_shared.o hakmem_tiny_stats_shared.o hakmem_tiny_sfc_shared.o hakmem_tiny_query_shared.o hakmem_tiny_rss_shared.o hakmem_tiny_registry_shared.o hakmem_tiny_remote_target_shared.o hakmem_tiny_bg_spill_shared.o tiny_adaptive_sizing_shared.o hakmem_super_registry_shared.o hakmem_shared_pool_shared.o hakmem_shared_pool_acquire_shared.o hakmem_shared_pool_release_shared.o hakmem_elo_shared.o hakmem_batch_shared.o hakmem_p2_shared.o hakmem_sizeclass_dist_shared.o hakmem_evo_shared.o hakmem_debug_shared.o hakmem_sys_shared.o hakmem_whale_shared.o hakmem_policy_shared.o hakmem_ace_shared.o hakmem_ace_stats_shared.o hakmem_ace_controller_shared.o hakmem_ace_metrics_shared.o hakmem_ace_ucb1_shared.o hakmem_prof_shared.o hakmem_learner_shared.o hakmem_size_hist_shared.o hakmem_learn_log_shared.o hakmem_syscall_shared.o tiny_fastcache_shared.o core/box/super_reg_box_shared.o core/box/shared_pool_box_shared.o core/box/remote_side_box_shared.o core/tiny_destructors_shared.o SHARED_OBJS = hakmem_shared.o hakmem_config_shared.o hakmem_tiny_config_shared.o hakmem_ucb1_shared.o hakmem_bigcache_shared.o hakmem_pool_shared.o hakmem_l25_pool_shared.o hakmem_site_rules_shared.o hakmem_tiny_shared.o core/box/ss_allocation_box_shared.o superslab_stats_shared.o superslab_cache_shared.o superslab_ace_shared.o superslab_slab_shared.o superslab_backend_shared.o core/superslab_head_stub_shared.o hakmem_smallmid_shared.o core/box/superslab_expansion_box_shared.o core/box/integrity_box_shared.o core/box/mailbox_box_shared.o core/box/front_gate_box_shared.o core/box/front_gate_classifier_shared.o core/box/free_publish_box_shared.o core/box/capacity_box_shared.o core/box/carve_push_box_shared.o core/box/prewarm_box_shared.o core/box/ss_hot_prewarm_box_shared.o core/box/front_metrics_box_shared.o core/box/bench_fast_box_shared.o core/box/ss_addr_map_box_shared.o core/box/ss_pt_impl_shared.o core/box/slab_recycling_box_shared.o core/box/pagefault_telemetry_box_shared.o core/box/tiny_sizeclass_hist_box_shared.o core/box/tiny_env_box_shared.o core/box/tiny_route_box_shared.o core/box/free_front_v3_env_box_shared.o core/box/free_path_stats_box_shared.o core/box/free_dispatch_stats_box_shared.o core/box/alloc_gate_stats_box_shared.o core/box/tiny_page_box_shared.o core/box/tiny_class_policy_box_shared.o core/box/tiny_class_stats_box_shared.o core/box/tiny_policy_learner_box_shared.o core/box/ss_budget_box_shared.o core/box/tiny_mem_stats_box_shared.o core/box/wrapper_env_box_shared.o core/box/free_wrapper_env_snapshot_box_shared.o core/box/malloc_wrapper_env_snapshot_box_shared.o core/box/madvise_guard_box_shared.o core/box/libm_reloc_guard_box_shared.o core/box/hakmem_env_snapshot_box_shared.o core/box/tiny_c7_preserve_header_env_box_shared.o core/box/tiny_tcache_env_box_shared.o core/box/tiny_unified_lifo_env_box_shared.o core/box/front_fastlane_alloc_legacy_direct_env_box_shared.o core/box/fastlane_direct_env_box_shared.o core/page_arena_shared.o core/front/tiny_unified_cache_shared.o core/tiny_alloc_fast_push_shared.o core/tiny_c7_ultra_segment_shared.o core/tiny_c7_ultra_shared.o core/link_stubs_shared.o core/tiny_failfast_shared.o tiny_sticky_shared.o tiny_remote_shared.o tiny_publish_shared.o tiny_debug_ring_shared.o hakmem_tiny_magazine_shared.o hakmem_tiny_stats_shared.o hakmem_tiny_sfc_shared.o hakmem_tiny_query_shared.o hakmem_tiny_rss_shared.o hakmem_tiny_registry_shared.o hakmem_tiny_remote_target_shared.o hakmem_tiny_bg_spill_shared.o tiny_adaptive_sizing_shared.o hakmem_super_registry_shared.o hakmem_shared_pool_shared.o hakmem_shared_pool_acquire_shared.o hakmem_shared_pool_release_shared.o hakmem_elo_shared.o hakmem_batch_shared.o hakmem_p2_shared.o hakmem_sizeclass_dist_shared.o hakmem_evo_shared.o hakmem_debug_shared.o hakmem_sys_shared.o hakmem_whale_shared.o hakmem_policy_shared.o hakmem_ace_shared.o hakmem_ace_stats_shared.o hakmem_ace_controller_shared.o hakmem_ace_metrics_shared.o hakmem_ace_ucb1_shared.o hakmem_prof_shared.o hakmem_learner_shared.o hakmem_size_hist_shared.o hakmem_learn_log_shared.o hakmem_syscall_shared.o tiny_fastcache_shared.o core/box/super_reg_box_shared.o core/box/shared_pool_box_shared.o core/box/remote_side_box_shared.o core/tiny_destructors_shared.o
# Pool TLS Phase 1 (enable with POOL_TLS_PHASE1=1) # Pool TLS Phase 1 (enable with POOL_TLS_PHASE1=1)
ifeq ($(POOL_TLS_PHASE1),1) ifeq ($(POOL_TLS_PHASE1),1)
@ -285,7 +285,7 @@ endif
# Benchmark targets # Benchmark targets
BENCH_HAKMEM = bench_allocators_hakmem BENCH_HAKMEM = bench_allocators_hakmem
BENCH_SYSTEM = bench_allocators_system BENCH_SYSTEM = bench_allocators_system
BENCH_HAKMEM_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/tiny_free_route_cache_env_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o bench_allocators_hakmem.o BENCH_HAKMEM_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/tiny_free_route_cache_env_box.o core/box/fastlane_direct_env_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o bench_allocators_hakmem.o
BENCH_HAKMEM_OBJS = $(BENCH_HAKMEM_OBJS_BASE) BENCH_HAKMEM_OBJS = $(BENCH_HAKMEM_OBJS_BASE)
ifeq ($(POOL_TLS_PHASE1),1) ifeq ($(POOL_TLS_PHASE1),1)
BENCH_HAKMEM_OBJS += pool_tls.o pool_refill.o pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o BENCH_HAKMEM_OBJS += pool_tls.o pool_refill.o pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o
@ -462,7 +462,7 @@ test-box-refactor: box-refactor
./larson_hakmem 10 8 128 1024 1 12345 4 ./larson_hakmem 10 8 128 1024 1 12345 4
# Phase 4: Tiny Pool benchmarks (properly linked with hakmem) # Phase 4: Tiny Pool benchmarks (properly linked with hakmem)
TINY_BENCH_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/free_publish_box.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/tiny_free_route_cache_env_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/box/front_fastlane_alloc_legacy_direct_env_box.o core/page_arena.o core/front/tiny_unified_cache.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o TINY_BENCH_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/free_publish_box.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/tiny_free_route_cache_env_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/box/front_fastlane_alloc_legacy_direct_env_box.o core/box/fastlane_direct_env_box.o core/page_arena.o core/front/tiny_unified_cache.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o
TINY_BENCH_OBJS = $(TINY_BENCH_OBJS_BASE) TINY_BENCH_OBJS = $(TINY_BENCH_OBJS_BASE)
ifeq ($(POOL_TLS_PHASE1),1) ifeq ($(POOL_TLS_PHASE1),1)
TINY_BENCH_OBJS += pool_tls.o pool_refill.o core/pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o TINY_BENCH_OBJS += pool_tls.o pool_refill.o core/pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o

View File

@ -14,6 +14,7 @@
#include "box/tiny_tcache_env_box.h" // tiny_tcache_env_refresh_from_env (Phase 14 v1) #include "box/tiny_tcache_env_box.h" // tiny_tcache_env_refresh_from_env (Phase 14 v1)
#include "box/tiny_unified_lifo_env_box.h" // tiny_unified_lifo_env_refresh_from_env (Phase 15 v1) #include "box/tiny_unified_lifo_env_box.h" // tiny_unified_lifo_env_refresh_from_env (Phase 15 v1)
#include "box/front_fastlane_alloc_legacy_direct_env_box.h" // front_fastlane_alloc_legacy_direct_env_refresh_from_env (Phase 16 v1) #include "box/front_fastlane_alloc_legacy_direct_env_box.h" // front_fastlane_alloc_legacy_direct_env_refresh_from_env (Phase 16 v1)
#include "box/fastlane_direct_env_box.h" // fastlane_direct_env_refresh_from_env (Phase 19-1)
#endif #endif
// env が未設定のときだけ既定値を入れる // env が未設定のときだけ既定値を入れる
@ -84,6 +85,8 @@ static inline void bench_apply_profile(void) {
bench_setenv_default("HAKMEM_FRONT_FASTLANE", "1"); bench_setenv_default("HAKMEM_FRONT_FASTLANE", "1");
// Phase 6-2: Front FastLane Free DeDup (+5.18% proven on Mixed, 10-run) // Phase 6-2: Front FastLane Free DeDup (+5.18% proven on Mixed, 10-run)
bench_setenv_default("HAKMEM_FRONT_FASTLANE_FREE_DEDUP", "1"); bench_setenv_default("HAKMEM_FRONT_FASTLANE_FREE_DEDUP", "1");
// Phase 19-1b: FastLane Direct (wrapper layer bypass, +5.88% proven on Mixed, 10-run)
bench_setenv_default("HAKMEM_FASTLANE_DIRECT", "1");
// Phase 9: FREE-TINY-FAST MONO DUALHOT (+2.72% proven on Mixed, 10-run) // Phase 9: FREE-TINY-FAST MONO DUALHOT (+2.72% proven on Mixed, 10-run)
bench_setenv_default("HAKMEM_FREE_TINY_FAST_MONO_DUALHOT", "1"); bench_setenv_default("HAKMEM_FREE_TINY_FAST_MONO_DUALHOT", "1");
// Phase 10: FREE-TINY-FAST MONO LEGACY DIRECT (+1.89% proven on Mixed, 10-run) // Phase 10: FREE-TINY-FAST MONO LEGACY DIRECT (+1.89% proven on Mixed, 10-run)
@ -119,6 +122,8 @@ static inline void bench_apply_profile(void) {
bench_setenv_default("HAKMEM_FRONT_FASTLANE", "1"); bench_setenv_default("HAKMEM_FRONT_FASTLANE", "1");
// Phase 6-2: Front FastLane Free DeDup (+5.18% proven on Mixed, 10-run) // Phase 6-2: Front FastLane Free DeDup (+5.18% proven on Mixed, 10-run)
bench_setenv_default("HAKMEM_FRONT_FASTLANE_FREE_DEDUP", "1"); bench_setenv_default("HAKMEM_FRONT_FASTLANE_FREE_DEDUP", "1");
// Phase 19-1b: FastLane Direct (wrapper layer bypass)
bench_setenv_default("HAKMEM_FASTLANE_DIRECT", "1");
// Phase 2 B3: Routing branch shape optimization (LIKELY on LEGACY, cold helper for rare routes) // Phase 2 B3: Routing branch shape optimization (LIKELY on LEGACY, cold helper for rare routes)
bench_setenv_default("HAKMEM_TINY_ALLOC_ROUTE_SHAPE", "1"); bench_setenv_default("HAKMEM_TINY_ALLOC_ROUTE_SHAPE", "1");
} else if (strcmp(p, "C6_V7_STUB") == 0) { } else if (strcmp(p, "C6_V7_STUB") == 0) {
@ -196,5 +201,7 @@ static inline void bench_apply_profile(void) {
tiny_unified_lifo_env_refresh_from_env(); tiny_unified_lifo_env_refresh_from_env();
// Phase 16 v1: Sync LEGACY direct ENV cache after bench_profile putenv defaults. // Phase 16 v1: Sync LEGACY direct ENV cache after bench_profile putenv defaults.
front_fastlane_alloc_legacy_direct_env_refresh_from_env(); front_fastlane_alloc_legacy_direct_env_refresh_from_env();
// Phase 19-1: Sync FastLane Direct ENV cache after bench_profile putenv defaults.
fastlane_direct_env_refresh_from_env();
#endif #endif
} }

View File

@ -0,0 +1,15 @@
// fastlane_direct_env_box.c - Phase 19-1: FastLane Direct Path ENV Control (implementation)
#include "fastlane_direct_env_box.h"
#include <stdlib.h>
#include <stdatomic.h>
_Atomic int g_fastlane_direct_enabled = -1;
// Refresh cached ENV flag from environment variable
// Called during benchmark ENV reloads to pick up runtime changes
void fastlane_direct_env_refresh_from_env(void) {
const char* e = getenv("HAKMEM_FASTLANE_DIRECT");
int enable = (e && *e && *e != '0') ? 1 : 0;
atomic_store_explicit(&g_fastlane_direct_enabled, enable, memory_order_relaxed);
}

View File

@ -0,0 +1,46 @@
// fastlane_direct_env_box.h - Phase 19-1: FastLane Direct Path ENV Control
//
// Goal: Remove wrapper layer overhead (30.79% of cycles) by calling core allocator directly
// Strategy: Compile-time + runtime gate to bypass front_fastlane_try_*() wrapper
//
// Box Theory:
// - Boundary: HAKMEM_FASTLANE_DIRECT=0/1 (default: 0, opt-in)
// - Rollback: ENV=0 reverts to existing FastLane wrapper path
// - Observability: perf stat shows instruction/branch reduction
//
// Expected Performance:
// - Reduction: -17.5 instructions/op, -6.0 branches/op
// - Impact: +10-15% throughput (remove 30% wrapper overhead)
//
// ENV Variables:
// HAKMEM_FASTLANE_DIRECT=0/1 # Enable direct path (default: 0, research box)
#pragma once
#include <stdatomic.h>
#include <stdlib.h>
// ENV control: cached flag for fastlane_direct_enabled()
// -1: uninitialized, 0: disabled, 1: enabled
// NOTE: Must be a single global (not header-static) so bench_profile refresh can
// update the same cache used by malloc/free wrappers.
extern _Atomic int g_fastlane_direct_enabled;
// Runtime check: Is FastLane Direct path enabled?
// Returns: 1 if enabled, 0 if disabled
// Hot path: Single atomic load (after first call)
static inline int fastlane_direct_enabled(void) {
int val = atomic_load_explicit(&g_fastlane_direct_enabled, memory_order_relaxed);
if (__builtin_expect(val == -1, 0)) {
// Cold path: Initialize from ENV
const char* e = getenv("HAKMEM_FASTLANE_DIRECT");
int enable = (e && *e && *e != '0') ? 1 : 0;
atomic_store_explicit(&g_fastlane_direct_enabled, enable, memory_order_relaxed);
return enable;
}
return val;
}
// Refresh from ENV: Called during benchmark ENV reloads
// Allows runtime toggle without recompilation
void fastlane_direct_env_refresh_from_env(void);

View File

@ -43,6 +43,7 @@ void* realloc(void* ptr, size_t size) {
#include "malloc_tiny_direct_env_box.h" // Phase 5 E5-4: Malloc Tiny direct path ENV gate #include "malloc_tiny_direct_env_box.h" // Phase 5 E5-4: Malloc Tiny direct path ENV gate
#include "malloc_tiny_direct_stats_box.h" // Phase 5 E5-4: Malloc Tiny direct path stats #include "malloc_tiny_direct_stats_box.h" // Phase 5 E5-4: Malloc Tiny direct path stats
#include "front_fastlane_box.h" // Phase 6: Front FastLane (Layer Collapse) #include "front_fastlane_box.h" // Phase 6: Front FastLane (Layer Collapse)
#include "fastlane_direct_env_box.h" // Phase 19-1: FastLane Direct Path (remove wrapper layer)
#include "../hakmem_internal.h" // AllocHeader helpers for diagnostics #include "../hakmem_internal.h" // AllocHeader helpers for diagnostics
#include "../hakmem_super_registry.h" // Superslab lookup for diagnostics #include "../hakmem_super_registry.h" // Superslab lookup for diagnostics
#include "../superslab/superslab_inline.h" // slab_index_for, capacity #include "../superslab/superslab_inline.h" // slab_index_for, capacity
@ -165,6 +166,14 @@ void* malloc(size_t size) {
#endif #endif
// NDEBUG: malloc_count increment disabled - removes 27.55% bottleneck // NDEBUG: malloc_count increment disabled - removes 27.55% bottleneck
// Force libc must override FastLane/hot wrapper paths.
// NOTE: Use the cached file-scope g_force_libc_alloc to avoid getenv recursion
// during early startup (before lock_depth is incremented).
if (__builtin_expect(g_force_libc_alloc == 1, 0)) {
extern void* __libc_malloc(size_t);
return __libc_malloc(size);
}
// Phase 20-2: BenchFast mode (structural ceiling measurement) // Phase 20-2: BenchFast mode (structural ceiling measurement)
// WARNING: Bypasses ALL safety checks - benchmark only! // WARNING: Bypasses ALL safety checks - benchmark only!
// IMPORTANT: Do NOT use BenchFast during preallocation/init to avoid recursion. // IMPORTANT: Do NOT use BenchFast during preallocation/init to avoid recursion.
@ -176,6 +185,28 @@ void* malloc(size_t size) {
// Fallback to normal path for large allocations // Fallback to normal path for large allocations
} }
// Phase 19-1b: FastLane Direct Path (bypass wrapper layer, revised)
// Strategy: Direct call to malloc_tiny_fast() (remove wrapper overhead; miss falls through)
// Expected: -17.5 instructions/op, -6.0 branches/op, +10-15% throughput
// ENV: HAKMEM_FASTLANE_DIRECT=0/1 (default: 0, opt-in)
// Phase 19-1b changes:
// 1. Removed __builtin_expect() from fastlane_direct_enabled() check (unfair A/B)
// 2. No change to malloc path (malloc_tiny_fast already optimal)
if (fastlane_direct_enabled()) {
// Fail-fast: match Front FastLane rule (FastLane is only safe after init completes).
if (__builtin_expect(!g_initialized, 0)) {
// Not safe → fall through to wrapper path (handles init/LD safety).
} else {
// Direct path: bypass front_fastlane_try_malloc() wrapper
void* ptr = malloc_tiny_fast(size);
if (__builtin_expect(ptr != NULL, 1)) {
return ptr; // Success: handled by hot path
}
// Not handled → fall through to existing FastLane + wrapper path.
// This preserves lock_depth/init/LD semantics for Mid/Large allocations.
}
}
// Phase 6: Front FastLane (Layer Collapse) // Phase 6: Front FastLane (Layer Collapse)
// Strategy: Collapse wrapper→gate→policy→route layers into single hot box // Strategy: Collapse wrapper→gate→policy→route layers into single hot box
// Observed: +11.13% on Mixed 10-run (Phase 6 A/B) // Observed: +11.13% on Mixed 10-run (Phase 6 A/B)
@ -631,6 +662,38 @@ void free(void* ptr) {
#endif #endif
if (!ptr) return; if (!ptr) return;
// Force libc must override FastLane/hot wrapper paths.
// NOTE: Use the cached file-scope g_force_libc_alloc (no getenv) to keep
// this check safe even during early startup/recursion scenarios.
if (__builtin_expect(g_force_libc_alloc == 1, 0)) {
extern void __libc_free(void*);
__libc_free(ptr);
return;
}
// Phase 19-1b: FastLane Direct Path (bypass wrapper layer, revised)
// Strategy: Direct call to free_tiny_fast() / free_cold() (remove 30% wrapper overhead)
// Expected: -17.5 instructions/op, -6.0 branches/op, +10-15% throughput
// ENV: HAKMEM_FASTLANE_DIRECT=0/1 (default: 0, opt-in)
// Phase 19-1b changes:
// 1. Removed __builtin_expect() from fastlane_direct_enabled() check (unfair A/B)
// 2. Changed free_tiny_fast_hot() → free_tiny_fast() (use winning path directly)
if (fastlane_direct_enabled()) {
// Fail-fast: match Front FastLane rule (FastLane is only safe after init completes).
if (__builtin_expect(!g_initialized, 0)) {
// Not safe → fall through to wrapper path (handles init/LD safety).
} else {
// Direct path: bypass front_fastlane_try_free() wrapper
if (free_tiny_fast(ptr)) {
return; // Success: handled by hot path
}
// Fallback: cold path handles Mid/Large/external pointers
const wrapper_env_cfg_t* wcfg = wrapper_env_cfg_fast();
free_cold(ptr, wcfg);
return;
}
}
// Phase 6: Front FastLane (Layer Collapse) - free path // Phase 6: Front FastLane (Layer Collapse) - free path
// Strategy: Collapse wrapper→gate→classify layers into single hot box // Strategy: Collapse wrapper→gate→classify layers into single hot box
// Observed: +11.13% on Mixed 10-run (Phase 6 A/B) // Observed: +11.13% on Mixed 10-run (Phase 6 A/B)

View File

@ -1,89 +1,75 @@
# Phase 17: FORCE_LIBC Gap Validation v1 — A/B Test Results # Phase 17: FORCE_LIBC Gap Validation v2 — A/B Test Results
**Date**: 2025-12-15 **Date**: 2025-12-16
**Verdict**: ✅ **Case B confirmed****Layout / I-cache penalty dominates** **Verdict**: ✅ **Case A confirmed**allocator delta dominates (**libc is ~1.63× faster** in same-binary A/B)
--- ---
## Executive Summary ## Executive Summary
Phase 17 validated the “system malloc is faster than hakmem” observation while avoiding the classic layout/LTO trap by running a **same-binary A/B**: Phase 17 exists to avoid the classic “different binary layout/LTO trap by running a **same-binary A/B**.
- Same binary (`bench_random_mixed_hakmem`) with `HAKMEM_FORCE_LIBC_ALLOC=0/1` shows **allocator logic delta is negligible**. **Important correction (v1 invalid):**
- The large performance gap appears only when comparing to the tiny `bench_random_mixed_system` binary. `HAKMEM_FORCE_LIBC_ALLOC=1` was previously checked only in late wrapper paths, so the malloc/free hot paths
could return before FORCE_LIBC was observed. This made the “same-binary libc” measurement effectively still
use hakmem for the hot path.
Conclusion: The dominant gap is **binary text size + layout → I-cache thrash + instruction footprint**, not allocator algorithm efficiency. **Fix (v2):**
Wrappers now bypass directly to `__libc_malloc/__libc_free` when cached `g_force_libc_alloc==1`, *before*
entering FastLane/hot wrapper logic.
Result: FORCE_LIBC now reflects real libc behavior in the same binary, and the delta is large.
--- ---
## Measurement Setup ## Measurement Setup
Workload: Workload:
- `bench_random_mixed_*` (Mixed 161024B), working set `WS=400` - Mixed 161024B, `WS=400`, `ITERS=20000000`
- Clean ENV baseline via `scripts/run_mixed_10_cleanenv.sh` - Clean ENV via `scripts/run_mixed_10_cleanenv.sh`
Two comparisons: Comparisons:
1) **Same-binary toggle** (allocator logic delta) 1) **Same binary**: `bench_random_mixed_hakmem` with `HAKMEM_FORCE_LIBC_ALLOC=0/1`
2) **System binary** (layout penalty delta) 2) **System binary**: `bench_random_mixed_system` (reference; different binary)
--- ---
## Results ## Results (10-run)
### 1) Same-binary A/B (allocator delta) ### 1) Same-binary A/B (allocator delta)
Binary: `bench_random_mixed_hakmem` Binary: `bench_random_mixed_hakmem`
Toggle: `HAKMEM_FORCE_LIBC_ALLOC=0/1`
| Mode | Throughput (ops/s) | Delta | | Mode | Mean (ops/s) | Median (ops/s) | Delta |
|------|---------------------|-------| |------|--------------:|---------------:|------:|
| hakmem (`FORCE_LIBC=0`) | 48.12M | — | | hakmem (`FORCE_LIBC=0`) | 48.99M | 49.28M | — |
| libc (`FORCE_LIBC=1`) | 48.31M | **+0.39%** | | libc (`FORCE_LIBC=1`) | 79.72M | 80.09M | **+62.7%** |
Interpretation: allocator logic delta is ~noise-level in this experiment context. Interpretation: the allocator delta is **not** noise-level; libc is materially faster on this workload.
### 2) System binary (layout penalty) ### 2) System binary (layout/wrapper penalty estimate)
Binary: `bench_random_mixed_system` Binary: `bench_random_mixed_system`
| Mode | Throughput (ops/s) | Delta vs libc-in-hakmem-binary | | Mode | Mean (ops/s) | Median (ops/s) | Delta vs libc-in-hakmem-binary |
|------|---------------------|--------------------------------| |------|--------------:|---------------:|--------------------------------:|
| system malloc | 83.85M | **+73.57%** | | system malloc | 88.06M | 88.35M | **+10.5%** |
Total observed gap: ~+74% class. Interpretation: there is still a non-trivial **“in-hakmem-binary” penalty** (~10%), likely from wrapper/bench
overhead and text footprint, but it is *not* the dominant term versus hakmems allocator gap.
---
## Perf Stat (200M iterations) — Smoking Gun
| Metric | hakmem binary | system binary | Delta |
|--------|---------------|---------------|-------|
| I-cache misses | 153K | 68K | **-55%** |
| Cycles | 17.9B | 10.2B | **-43%** |
| Instructions | 41.3B | 21.5B | **-48%** |
| Binary size | 653K | 21K | **-97%** |
Interpretation:
- The system binary executes roughly **half the instructions**, with **far fewer I-cache misses**.
- The 30× text footprint difference strongly correlates with the gap.
--- ---
## Conclusion ## Conclusion
Phase 12s “system malloc is 1.6× faster” observation was real, but the root cause was misattributed: - ✅ Same-binary `FORCE_LIBC` A/B (v2) shows the **dominant gap is allocator work**, not layout alone.
- ✅ There is also a smaller (~10%) penalty attributable to the hakmem-binary wrapper/text environment.
- ❌ Not primarily allocator algorithm differences
-**Text/layout + I-cache locality + instruction footprint**
This shifts the optimization frontier:
- Stop chasing more routing/dispatch micro-opt (Phase 1416 plateau)
- Focus on **Hot Text Isolation / layout control**
--- ---
## Next ## Next
Proceed to: - Freeze Phase 18 v1 (`--gc-sections`) as NO-GO remains correct.
- `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md` - Re-evaluate Phase 18 v2 (BENCH_MINIMAL) expectations: -5% instructions is not enough to close a +62% gap.
- Phase 19 should target **structural per-op work reduction** (not dispatch shape), while keeping the FastLane
boundary and “same-binary A/B” discipline.

View File

@ -8,6 +8,13 @@
本 Phase 17 の目的は、**同一バイナリ内**で `hakmem` vs `libc malloc` を A/B し、gap の実体allocator差か、バイナリ差かを SSOT 化すること。 本 Phase 17 の目的は、**同一バイナリ内**で `hakmem` vs `libc malloc` を A/B し、gap の実体allocator差か、バイナリ差かを SSOT 化すること。
**重要v1 の落とし穴)**:
`HAKMEM_FORCE_LIBC_ALLOC=1` が malloc/free の hot path より後でしか観測されないと、FastLane/hot wrapper が先に return してしまい、
同一バイナリ A/B が **実質 hakmem vs hakmem** になって壊れます。
このレポジトリでは 2025-12-16 に `malloc/free` wrapper を修正し、cached `g_force_libc_alloc==1` のときは `__libc_malloc/__libc_free`
**最初に** 直行するようにしました(`docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md` 参照)。
--- ---
## 0. 目的Deliverables ## 0. 目的Deliverables
@ -127,4 +134,3 @@ perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-m
- A/B は **同一バイナリ**で行うlayout/LTO 差で誤判定しない) - A/B は **同一バイナリ**で行うlayout/LTO 差で誤判定しない)
- 新しい最適化は必ず ENV gate戻せる+ 境界 1 箇所 - 新しい最適化は必ず ENV gate戻せる+ 境界 1 箇所
- 迷ったら “Fail-Fast で fallback” を優先(速度より整合性) - 迷ったら “Fail-Fast で fallback” を優先(速度より整合性)

View File

@ -0,0 +1,307 @@
# Phase 19-1b: FastLane Direct (Revised) A/B Test Results
**Date**: 2025-12-15
**Status**: ✅ **GO** (+5.88% throughput)
**Branch**: master
**Commit**: (pending)
---
## Executive Summary
Phase 19-1 の修正版19-1bが成功。Phase 19-1 が NO-GO-3.81%)となった原因を特定し、修正により **+5.88% throughput** を達成。
**犯人特定**:
1. `__builtin_expect(fastlane_direct_enabled(), 0)` が分岐予測を逆効果にしていた
2. `free_tiny_fast_hot()` より `free_tiny_fast()` が勝ち筋unified cache winner
**修正内容**:
- `__builtin_expect()` 削除(フェアな A/B 比較)
- `free_tiny_fast_hot()``free_tiny_fast()` 変更(直接勝ち筋を呼ぶ)
---
## A/B Test Results
### Throughput (10-run benchmark)
**Baseline (FASTLANE_DIRECT=0)**:
- Mean: **49.17M ops/s**
- StdDev: 407,748 ops/s
- CV: 0.83%
**Optimized (FASTLANE_DIRECT=1)**:
- Mean: **52.06M ops/s**
- StdDev: 404,146 ops/s
- CV: 0.78%
**Delta**: **+5.88%** (GO判定、+5%目標クリア)
---
## perf stat Analysis (200M ops)
### Metrics Table
| Metric | Baseline | Optimized | Delta | Judgment |
|-----------------------|-----------------|-----------------|------------|----------|
| **Throughput** | 49.17M ops/s | 52.06M ops/s | **+5.88%** | **GO** |
| Cycles | 17,775,213,215 | 16,873,451,633 | -5.07% | Good |
| Instructions | 39,980,185,471 | 33,889,807,627 | **-15.23%** | **Excellent** |
| L1-icache-load-misses | 111,712 | 98,542 | -11.79% | Good |
| iTLB-load-misses | 26,039 | 36,835 | +41.46% | Bad |
| dTLB-load-misses | 59,329 | 76,626 | +29.15% | Bad |
| Branches | 10,297,849,396 | 8,304,201,436 | **-19.36%** | **Excellent** |
| Branch-misses | 232,502,367 | 232,239,642 | -0.11% | Good |
### Per-Operation Metrics
| Metric | Baseline | Optimized | Delta |
|--------------|----------|-----------|-----------|
| Cycles/op | 88.88 | 84.37 | **-4.51** |
| Instr/op | 199.90 | 169.45 | **-30.45** |
| Branches/op | 51.49 | 41.52 | **-9.97** |
**Key Findings**:
- **Instructions: -30.45/op** (-15.23%) → wrapper overhead 削減が効果的
- **Branches: -9.97/op** (-19.36%) → 分岐数の大幅削減
- **Cycles: -4.51/op** (-5.07%) → 総合的な効率改善
**Trade-offs**:
- iTLB/dTLB misses が悪化したが、instruction/branch 削減の効果が上回った
- Front-end (I-cache) は改善、Backend (dTLB) は悪化
- 総合的に throughput +5.88% で GO 判定
---
## Root Cause Analysis: Phase 19-1 が NO-GO となった理由
### Phase 19-1 の問題点
**Phase 19-1 実装** (`core/box/hak_wrappers.inc.h` 旧版):
```c
// malloc()
if (__builtin_expect(fastlane_direct_enabled(), 0)) { // ← 問題1: expect(...,0)
void* ptr = malloc_tiny_fast(size);
if (__builtin_expect(ptr != NULL, 1)) return ptr;
// ...
}
// free()
if (__builtin_expect(fastlane_direct_enabled(), 0)) { // ← 問題1: expect(...,0)
if (free_tiny_fast_hot(ptr)) return; // ← 問題2: _hot variant
// ...
}
```
**問題の本質**:
1. **__builtin_expect(..., 0) が逆効果**:
- `fastlane_direct_enabled()` は ENV 変数で制御されるため、A/B test 時に動的に切り替わる
- `__builtin_expect(..., 0)` は「この分岐は unlikely」と CPU に指示
- → A=0, B=1 で分岐予測が逆になり、フェアな比較にならない
- → B 側FASTLANE_DIRECT=1で分岐予測ミスが増加
2. **free_tiny_fast_hot() より free_tiny_fast() が勝ち筋**:
- `free_tiny_fast_hot()`: hot/cold split versionPhase 7 で導入)
- `free_tiny_fast()`: monolithic versionPhase 6 winner
- Phase 9/10 の A/B で `free_tiny_fast()` が勝利していた
- → Phase 19-1 で `_hot` を選択したのは誤り
### Phase 19-1b の修正
**Phase 19-1b 実装** (`core/box/hak_wrappers.inc.h` 修正後):
```c
// malloc()
if (fastlane_direct_enabled()) { // ← 修正1: __builtin_expect 削除
void* ptr = malloc_tiny_fast(size);
if (__builtin_expect(ptr != NULL, 1)) return ptr;
// ...
}
// free()
if (fastlane_direct_enabled()) { // ← 修正1: __builtin_expect 削除
if (free_tiny_fast(ptr)) return; // ← 修正2: free_tiny_fast() に変更
// ...
}
```
**修正の効果**:
1. `__builtin_expect()` 削除 → A/B がフェアな比較に
2. `free_tiny_fast()` 直呼び → 勝ち筋を直接利用
**結果**: -3.81% → **+5.88%** (9.69% の改善)
---
## Design Intent vs Implementation Gap
### Original Design (Phase 19 DESIGN.md)
**想定**:
- Wrapper layer 削除で -17.5 instructions/op, -6.0 branches/op
- Target: +10-15% throughput
**実測 (Phase 19-1b)**:
- Instructions: **-30.45/op** (-15.23%, 想定の1.74倍)
- Branches: **-9.97/op** (-19.36%, 想定の1.66倍)
- Throughput: **+5.88%** (想定の半分だが、GO判定)
**Gap 分析**:
- Instructions/Branches の削減は想定以上
- しかし throughput は想定の半分(+5.88% vs +10-15%
- 原因: iTLB/dTLB misses の悪化が throughput を抑制
- 結論: Instruction 削減だけでは throughput は直線的に改善しない
---
## Lessons Learned
### 1. __builtin_expect() の落とし穴
**問題**:
- ENV-gated path で `__builtin_expect(..., 0)` を使うと A/B がフェアでない
- 動的に切り替わる条件には使うべきでない
**推奨**:
- Compile-time constant なら OK例: `HAKMEM_BUILD_RELEASE`
- Runtime ENV variables には使わない
- A/B test 前に expect hint を削除して検証
### 2. Variant 選択の重要性
**教訓**:
- `free_tiny_fast_hot()` vs `free_tiny_fast()` の選択が throughput に影響
- 過去の A/B 結果Phase 9/10を参照すべきだった
- 新しい最適化でも「勝ち筋」を選ぶこと
### 3. Front-end vs Backend Trade-off
**発見**:
- Instructions/Branches 削減front-end 改善)は throughput に直結しない
- dTLB missesbackend 悪化)が throughput を抑制
- 総合バランスが重要
**今後の指針**:
- perf stat で front-end/backend を個別に分析
- Trade-off を明示的に評価
---
## Verdict: GO
**Reasons**:
1. **Throughput: +5.88%** (exceeds +5% target)
2. **Instructions: -15.23%** (excellent reduction)
3. **Branches: -19.36%** (excellent reduction)
4. **Cycles: -5.07%** (solid improvement)
5. **I-cache: -11.79%** (front-end improvement)
**Trade-offs (Acceptable)**:
- iTLB: +41.46% (front-end cost)
- dTLB: +29.15% (backend cost)
- → Overall gain (+5.88%) outweighs these costs
**Decision**: Phase 19-1b を本線採用。ENV: `HAKMEM_FASTLANE_DIRECT=1` で運用。
---
## Next Steps
### Immediate Actions
1. ✅ Commit Phase 19-1b changes to master
2. ✅ Update CURRENT_TASK.md with results
3. ✅ Archive this report to `docs/analysis/`
### Future Optimizations
**Phase 19-2 候補** (dTLB miss 削減):
- TLB prefetch hints
- Page alignment optimization
- Working set size reduction
**Phase 19-3 候補** (instruction 削減):
- ENV snapshot consolidation (Candidate B)
- Stats counter removal (Candidate C)
- Header validation inline (Candidate D)
**Target**: Close remaining gap to libc (73 instructions/op → 40-50 instructions/op)
---
## Appendix: Raw Data
### Baseline (FASTLANE_DIRECT=0) 10-run
```
Run 1: 49.70M ops/s
Run 2: 49.10M ops/s
Run 3: 48.83M ops/s
Run 4: 49.24M ops/s
Run 5: 49.29M ops/s
Run 6: 48.54M ops/s
Run 7: 49.77M ops/s
Run 8: 48.52M ops/s
Run 9: 49.32M ops/s
Run 10: 49.37M ops/s
Mean: 49.17M ops/s
StdDev: 407,748 ops/s
CV: 0.83%
```
### Optimized (FASTLANE_DIRECT=1) 10-run
```
Run 1: 51.44M ops/s
Run 2: 52.56M ops/s
Run 3: 51.71M ops/s
Run 4: 52.30M ops/s
Run 5: 51.73M ops/s
Run 6: 51.96M ops/s
Run 7: 52.48M ops/s
Run 8: 51.44M ops/s
Run 9: 51.96M ops/s
Run 10: 52.46M ops/s
Mean: 52.06M ops/s
StdDev: 404,146 ops/s
CV: 0.78%
```
### perf stat Baseline (FASTLANE_DIRECT=0)
```
Performance counter stats for 'env -i PATH= HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FASTLANE_DIRECT=0 ./bench_random_mixed_hakmem 200000000 400 1':
17,775,213,215 cycles
39,980,185,471 instructions # 2.25 insn per cycle
111,712 L1-icache-load-misses
26,039 iTLB-load-misses
59,329 dTLB-load-misses
10,297,849,396 branches
232,502,367 branch-misses # 2.26% of all branches
4.486849039 seconds time elapsed
```
### perf stat Optimized (FASTLANE_DIRECT=1)
```
Performance counter stats for 'env -i PATH= HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FASTLANE_DIRECT=1 ./bench_random_mixed_hakmem 200000000 400 1':
16,873,451,633 cycles
33,889,807,627 instructions # 2.01 insn per cycle
98,542 L1-icache-load-misses
36,835 iTLB-load-misses
76,626 dTLB-load-misses
8,304,201,436 branches
232,239,642 branch-misses # 2.80% of all branches
4.247212223 seconds time elapsed
```
---
**END OF REPORT**

View File

@ -0,0 +1,543 @@
# Phase 19: FastLane Instruction Reduction - Design Document
## 0. Executive Summary
**Goal**: Reduce instruction/branch count gap between hakmem and libc to close throughput gap
**Current Gap**: hakmem 44.88M ops/s vs libc 77.62M ops/s (+73.0% advantage for libc)
**Target**: Reduce instruction gap from +53.8% to <+25%, targeting +15-25% throughput improvement
**Success Criteria**: Achieve 52-56M ops/s (from current 44.88M ops/s)
### Key Findings
Per-operation overhead comparison (200M ops):
| Metric | hakmem | libc | Delta | Delta % |
|--------|--------|------|-------|---------|
| **Instructions/op** | 209.09 | 135.92 | +73.17 | **+53.8%** |
| **Branches/op** | 52.33 | 22.93 | +29.40 | **+128.2%** |
| Cycles/op | 96.48 | 54.69 | +41.79 | +76.4% |
| Branch-miss % | 2.22% | 2.87% | -0.65% | Better |
**Critical insight**: hakmem executes **73 extra instructions** and **29 extra branches** per operation vs libc.
This massive overhead accounts for the entire throughput gap.
---
## 1. Gap Analysis (Per-Operation Breakdown)
### 1.1 Instruction Gap: +73.17 instructions/op (+53.8%)
This excess comes from multiple layers of overhead:
- **FastLane wrapper checks**: ENV gates, class mask validation, size checks
- **Policy snapshot overhead**: TLS reads for routing decisions (3+ reads even with ENV snapshot)
- **Route determination**: Static route table lookup vs direct path
- **Multiple ENV gates**: Scattered throughout hot path (DUALHOT, LEGACY_DIRECT, C7_ULTRA, etc.)
- **Stats counters**: Atomic increments on hot path (FREE_PATH_STAT_INC, ALLOC_GATE_STAT_INC, etc.)
- **Header validation duplication**: FastLane + free_tiny_fast both validate header
### 1.2 Branch Gap: +29.40 branches/op (+128.2%)
Branching is **2.3x worse** than instruction gap:
- **Cascading ENV checks**: Each layer adds 1-2 branches (g_initialized, class_mask, DUALHOT, C7_ULTRA, LEGACY_DIRECT)
- **Route dispatch**: Static route check + route_kind switch
- **Early-exit patterns**: Multiple if-checks for ULTRA/DUALHOT/LEGACY paths
- **Stats gating**: `if (__builtin_expect(...))` patterns around counters
### 1.3 Why Cycles/op Gap is Smaller Than Expected
Despite +76.4% cycle gap, the CPU is achieving 2.17 IPC (hakmem) vs 2.49 IPC (libc).
This suggests:
- **Good CPU pipelining**: Branch predictor is working well (2.22% miss rate)
- **I-cache locality**: Code is reasonably compact despite extra instructions
- **But**: We're paying for every extra branch in pipeline stalls
---
## 2. Hot Path Breakdown (perf report)
Top 10 hot functions (% of cycles):
| Function | % time | Category | Reduction Target? |
|----------|--------|----------|-------------------|
| **front_fastlane_try_free** | 23.97% | Wrapper | ✓ **YES** (remove layer) |
| **malloc** | 23.84% | Wrapper | ✓ **YES** (remove layer) |
| main | 22.02% | Benchmark | (baseline) |
| **free** | 6.82% | Wrapper | ✓ **YES** (remove layer) |
| unified_cache_push | 4.44% | Core | Optimize later |
| tiny_header_finalize_alloc | 4.34% | Core | Optimize later |
| tiny_c7_ultra_alloc | 3.38% | Core | Optimize later |
| tiny_c7_ultra_free | 2.07% | Core | Optimize later |
| hakmem_env_snapshot_enabled | 1.22% | ENV | ✓ **YES** (eliminate checks) |
| hak_super_lookup | 0.98% | Core | Optimize later |
**Critical observation**: The top 3 user-space functions are **all wrappers**:
- `front_fastlane_try_free` (23.97%) + `free` (6.82%) = **30.79%** on free wrappers
- `malloc` (23.84%) on alloc wrapper
- Combined wrapper overhead: **~54-55%** of all cycles
### 2.1 front_fastlane_try_free Annotated Breakdown
From `perf annotate`, the hot path has these expensive operations:
**Header validation** (lines 1c786-1c791, ~3% samples):
```asm
movzbl -0x1(%rbp),%ebx # Load header byte
mov %ebx,%eax # Copy to eax
and $0xfffffff0,%eax # Extract magic (0xA0)
cmp $0xa0,%al # Check magic
jne ... (fallback) # Branch on mismatch
```
**ENV snapshot checks** (lines 1c7ff-1c822, ~7% samples):
```asm
cmpl $0x1,0x628fa(%rip) # g_hakmem_env_snapshot_ctor_mode (3.01%)
mov 0x628ef(%rip),%r15d # g_hakmem_env_snapshot_gate (1.36%)
je ...
cmp $0xffffffff,%r15d
je ... (init path)
test %r15d,%r15d
jne ... (snapshot path)
```
**Class routing overhead** (lines 1c7d1-1c7fb, ~3% samples):
```asm
mov 0x6299c(%rip),%r15d # g.5.lto_priv.0 (policy gate)
cmp $0x1,%r15d
jne ... (fallback)
movzbl 0x6298f(%rip),%eax # g_mask.3.lto_priv.0
cmp $0xff,%al
je ... (all-classes path)
movzbl %al,%r9d
bt %r13d,%r9d # Bit test class mask
jae ... (fallback)
```
**Total overhead**: ~15-20% of cycles in front_fastlane_try_free are spent on:
- Header validation (already done again in free_tiny_fast)
- ENV snapshot probing
- Policy/route checks
---
## 3. Reduction Candidates (Prioritized by ROI)
### Candidate A: **Eliminate FastLane Wrapper Layer** (Highest ROI)
**Problem**: front_fastlane_try_free + free wrappers consume 30.79% of cycles
**Root cause**: Double header validation + ENV checks + class mask checks
**Proposal**: Direct call to free_tiny_fast() from free() wrapper
**Implementation**:
```c
// In free() wrapper:
void free(void* ptr) {
if (__builtin_expect(!ptr, 0)) return;
// Phase 19-A: Direct call (no FastLane layer)
if (free_tiny_fast(ptr)) {
return; // Handled
}
// Fallback to cold path
free_cold(ptr);
}
```
**Reduction estimate**:
- **Instructions**: -15-20/op (eliminate duplicate header read, ENV checks, class mask checks)
- **Branches**: -5-7/op (remove FastLane gate checks)
- **Impact**: ~10-15% throughput improvement (remove 30% wrapper overhead)
**Risk**: **LOW** (free_tiny_fast already has validation + routing logic)
---
### Candidate B: **Consolidate ENV Snapshot Checks** (High ROI)
**Problem**: ENV snapshot is checked **3+ times per operation**:
1. FastLane entry: `g_initialized` check
2. Route determination: `hakmem_env_snapshot_enabled()` check
3. Route-specific: `tiny_c7_ultra_enabled_env()` check
4. Legacy fallback: Another ENV snapshot check
**Proposal**: Single ENV snapshot read at entry, pass context down
**Implementation**:
```c
// Phase 19-B: ENV context struct
typedef struct {
bool c7_ultra_enabled;
bool dualhot_enabled;
bool legacy_direct_enabled;
SmallRouteKind route_kind[8]; // Pre-computed routes
} FastLaneCtx;
static __thread FastLaneCtx g_fastlane_ctx = {0};
static __thread int g_fastlane_ctx_init = 0;
static inline const FastLaneCtx* fastlane_ctx_get(void) {
if (__builtin_expect(g_fastlane_ctx_init == 0, 0)) {
// One-time init per thread
const HakmemEnvSnapshot* env = hakmem_env_snapshot();
g_fastlane_ctx.c7_ultra_enabled = env->tiny_c7_ultra_enabled;
// ... populate other fields
g_fastlane_ctx_init = 1;
}
return &g_fastlane_ctx;
}
```
**Reduction estimate**:
- **Instructions**: -8-12/op (eliminate redundant TLS reads)
- **Branches**: -3-5/op (single init check instead of multiple)
- **Impact**: ~5-8% throughput improvement
**Risk**: **MEDIUM** (need to handle ENV changes during runtime - use invalidation hook)
---
### Candidate C: **Remove Stats Counters from Hot Path** (Medium ROI)
**Problem**: Stats counters on hot path add atomic increments:
- `FRONT_FASTLANE_STAT_INC(free_total)` (every op)
- `FREE_PATH_STAT_INC(total_calls)` (every op)
- `ALLOC_GATE_STAT_INC(total_calls)` (every alloc)
- `tiny_front_free_stat_inc(class_idx)` (every free)
**Proposal**: Make stats DEBUG-only or sample-based (1-in-N)
**Implementation**:
```c
// Phase 19-C: Sampling-based stats
#if !HAKMEM_BUILD_RELEASE
static __thread uint32_t g_stat_counter = 0;
if (__builtin_expect((++g_stat_counter & 0xFFF) == 0, 0)) {
// Sample 1-in-4096 operations
FRONT_FASTLANE_STAT_INC(free_total);
}
#endif
```
**Reduction estimate**:
- **Instructions**: -4-6/op (remove atomic increments)
- **Branches**: -2-3/op (remove `if (__builtin_expect(...))` checks)
- **Impact**: ~3-5% throughput improvement
**Risk**: **LOW** (stats already compile-time optional)
---
### Candidate D: **Inline Header Validation** (Medium ROI)
**Problem**: Header validation happens twice:
1. FastLane wrapper: `*((uint8_t*)ptr - 1)` (lines 179-191 in front_fastlane_box.h)
2. free_tiny_fast: Same check (lines 598-605 in malloc_tiny_fast.h)
**Proposal**: Trust FastLane validation, remove duplicate check
**Implementation**:
```c
// Phase 19-D: Add "trusted" variant
static inline int free_tiny_fast_trusted(void* ptr, int class_idx, void* base) {
// Skip header validation (caller already validated)
// Direct to route dispatch
...
}
// In FastLane:
uint8_t header = *((uint8_t*)ptr - 1);
int class_idx = header & 0x0F;
void* base = tiny_user_to_base_inline(ptr);
return free_tiny_fast_trusted(ptr, class_idx, base);
```
**Reduction estimate**:
- **Instructions**: -3-5/op (remove duplicate header load + extract)
- **Branches**: -1-2/op (remove duplicate magic check)
- **Impact**: ~2-3% throughput improvement
**Risk**: **MEDIUM** (need to ensure all callers validate header)
---
### Candidate E: **Static Route Table Optimization** (Lower ROI)
**Problem**: Route determination uses TLS lookups + bit tests:
```c
if (tiny_static_route_ready_fast()) {
route_kind = tiny_static_route_get_kind_fast(class_idx);
} else {
route_kind = tiny_policy_hot_get_route(class_idx);
}
```
**Proposal**: Pre-compute common routes at init, inline direct paths
**Implementation**:
```c
// Phase 19-E: Route fast path (C0-C3 LEGACY, C7 ULTRA)
static __thread uint8_t g_route_fastmap = 0; // bit 0=C0...bit 7=C7, 1=LEGACY
static inline bool is_legacy_route_fast(int class_idx) {
return (g_route_fastmap >> class_idx) & 1;
}
```
**Reduction estimate**:
- **Instructions**: -3-4/op (replace function call with bit test)
- **Branches**: -1-2/op (replace nested if with single bit test)
- **Impact**: ~2-3% throughput improvement
**Risk**: **LOW** (route table is already static)
---
## 4. Combined Impact Estimate
Assuming independent reductions (conservative estimate with 80% efficiency due to overlap):
| Candidate | Instructions/op | Branches/op | Throughput |
|-----------|-----------------|-------------|------------|
| Baseline | 209.09 | 52.33 | 44.88M ops/s |
| **A: Remove FastLane layer** | -17.5 | -6.0 | +12% |
| **B: ENV snapshot consolidation** | -10.0 | -4.0 | +6% |
| **C: Stats removal (Release)** | -5.0 | -2.5 | +4% |
| **D: Inline header validation** | -4.0 | -1.5 | +2% |
| **E: Static route fast path** | -3.5 | -1.5 | +2% |
| **Combined (80% efficiency)** | **-32.0** | **-12.4** | **+21%** |
**Projected outcome**:
- Instructions/op: 209.09 → **177.09** (vs libc 135.92, gap reduced from +53.8% to +30.3%)
- Branches/op: 52.33 → **39.93** (vs libc 22.93, gap reduced from +128.2% to +74.1%)
- Throughput: 44.88M → **54.3M ops/s** (vs libc 77.62M, gap reduced from +73.0% to +43.0%)
**Achievement vs Goal**: ✓ Exceeds target (+21% vs +15-25% goal)
---
## 5. Implementation Plan
### Phase 19-1: Remove FastLane Wrapper Layer (A)
**Priority**: P0 (highest ROI)
**Effort**: 2-3 hours
**Risk**: Low (free_tiny_fast already complete)
Steps:
1. Modify `free()` wrapper to directly call `free_tiny_fast(ptr)`
2. Modify `malloc()` wrapper to directly call `malloc_tiny_fast(size)`
3. Measure: Expect +10-15% throughput
4. Fallback: Keep FastLane as compile-time option
### Phase 19-2: ENV Snapshot Consolidation (B)
**Priority**: P1 (high ROI, moderate risk)
**Effort**: 4-6 hours
**Risk**: Medium (ENV invalidation needed)
Steps:
1. Create `FastLaneCtx` struct with pre-computed ENV state
2. Add TLS cache with invalidation hook
3. Replace scattered ENV checks with single context read
4. Measure: Expect +5-8% throughput on top of Phase 19-1
5. Fallback: ENV-gate new path (HAKMEM_FASTLANE_ENV_CTX=1)
### Phase 19-3: Stats Removal (C) + Header Inline (D)
**Priority**: P2 (medium ROI, low risk)
**Effort**: 2-3 hours
**Risk**: Low (already compile-time optional)
Steps:
1. Make stats sample-based (1-in-4096) in Release builds
2. Add `free_tiny_fast_trusted()` variant (skip header validation)
3. Measure: Expect +3-5% throughput on top of Phase 19-2
4. Fallback: Compile-time flags for both features
### Phase 19-4: Static Route Fast Path (E)
**Priority**: P3 (lower ROI, polish)
**Effort**: 2-3 hours
**Risk**: Low (route table is static)
Steps:
1. Add `g_route_fastmap` TLS cache
2. Replace function calls with bit tests
3. Measure: Expect +2-3% throughput on top of Phase 19-3
4. Fallback: Keep existing path as fallback
---
## 6. Box Theory Compliance
### Boundary Preservation
- **L0 (ENV)**: Keep existing ENV gates, add new ones for each optimization
- **L1 (Hot inline)**: free_tiny_fast(), malloc_tiny_fast() remain unchanged
- **L2 (Cold fallback)**: free_cold(), malloc_cold() remain unchanged
- **L3 (Stats)**: Make optional via #if guards
### Reversibility
- Each phase is ENV-gated (can revert at runtime)
- Compile-time fallback preserved (HAKMEM_BUILD_RELEASE controls stats)
- FastLane layer can be kept as compile-time option for A/B testing
### Incremental Rollout
- Phase 19-1: Remove wrapper (default ON)
- Phase 19-2: ENV context (default OFF, opt-in for testing)
- Phase 19-3: Stats/header (default ON in Release, OFF in Debug)
- Phase 19-4: Route fast path (default ON)
---
## 7. Validation Checklist
After each phase:
- [ ] Run perf stat (compare instructions/branches/cycles per-op)
- [ ] Run perf record + annotate (verify hot path reduction)
- [ ] Run benchmark suite (Mixed, C6-heavy, C7-heavy)
- [ ] Check correctness (Larson, multithreaded, stress tests)
- [ ] Measure RSS/memory overhead (should be unchanged)
- [ ] A/B test (ENV toggle to verify reversibility)
Success criteria:
- [ ] Throughput improvement matches estimate (±20%)
- [ ] Instruction count reduction matches estimate (±20%)
- [ ] Branch count reduction matches estimate (±20%)
- [ ] No correctness regressions (all tests pass)
- [ ] No memory overhead increase (RSS unchanged)
---
## 8. Risk Assessment
### High-Risk Areas
1. **ENV invalidation** (Phase 19-2): Runtime ENV changes could break cached context
- Mitigation: Use invalidation hooks (existing hakmem_env_snapshot infrastructure)
- Fallback: Revert to scattered ENV checks
2. **Header validation trust** (Phase 19-3D): Skipping validation could miss corruption
- Mitigation: Keep validation in Debug builds, extensive testing
- Fallback: Compile-time option to keep duplicate checks
### Medium-Risk Areas
1. **FastLane removal** (Phase 19-1): Could break gradual rollout (class_mask filtering)
- Mitigation: Keep class_mask filtering in FastLane path only (direct path always falls back safely)
- Fallback: Keep FastLane as compile-time option
### Low-Risk Areas
1. **Stats removal** (Phase 19-3C): Already compile-time optional
2. **Route fast path** (Phase 19-4): Route table is static, no runtime changes
---
## 9. Future Optimization Opportunities (Post-Phase 19)
After Phase 19 closes the wrapper gap, next targets:
1. **Unified Cache optimization** (4.44% cycles):
- Reduce cache miss overhead (refill path)
- Optimize LIFO vs ring buffer trade-off
2. **Header finalization** (4.34% cycles):
- Investigate always_inline for tiny_header_finalize_alloc()
- Reduce metadata writes (defer to batch update)
3. **C7 ULTRA optimization** (3.38% + 2.07% = 5.45% cycles):
- Investigate TLS cache locality
- Reduce ULTRA push/pop overhead
4. **Super lookup optimization** (0.98% cycles):
- Already optimized in Phase 12 (mask-based)
- Further reduction may require architectural changes
**Estimated ceiling**: With all optimizations, could approach ~65-70M ops/s (vs libc 77.62M)
**Remaining gap**: Likely fundamental architectural differences (thread-local vs global allocator)
---
## 10. Appendix: Detailed perf Data
### 10.1 perf stat Results (200M ops)
**hakmem (FORCE_LIBC=0)**:
```
Performance counter stats for 'bench_random_mixed_hakmem ... HAKMEM_FORCE_LIBC_ALLOC=0':
19,296,118,430 cycles
41,817,886,925 instructions # 2.17 insn per cycle
10,466,190,806 branches
232,592,257 branch-misses # 2.22% of all branches
1,660,073 cache-misses
134,601 L1-icache-load-misses
4.913685503 seconds time elapsed
Throughput: 44.88M ops/s
```
**libc (FORCE_LIBC=1)**:
```
Performance counter stats for 'bench_random_mixed_hakmem ... HAKMEM_FORCE_LIBC_ALLOC=1':
10,937,550,228 cycles
27,183,469,339 instructions # 2.49 insn per cycle
4,586,617,379 branches
131,515,905 branch-misses # 2.87% of all branches
767,370 cache-misses
64,102 L1-icache-load-misses
2.835174452 seconds time elapsed
Throughput: 77.62M ops/s
```
### 10.2 Top 30 Hot Functions (perf report)
```
23.97% front_fastlane_try_free.lto_priv.0
23.84% malloc
22.02% main
6.82% free
4.44% unified_cache_push.lto_priv.0
4.34% tiny_header_finalize_alloc.lto_priv.0
3.38% tiny_c7_ultra_alloc.constprop.0
2.07% tiny_c7_ultra_free
1.22% hakmem_env_snapshot_enabled.lto_priv.0
0.98% hak_super_lookup.part.0.lto_priv.4.lto_priv.0
0.85% hakmem_env_snapshot.lto_priv.0
0.82% hak_pool_free_v1_slow_impl
0.59% tiny_front_v3_snapshot_get.lto_priv.0
0.30% __memset_avx2_unaligned_erms (libc)
0.30% tiny_unified_lifo_enabled.lto_priv.0
0.28% hak_free_at.constprop.0
0.24% hak_pool_try_alloc.part.0
0.24% malloc_cold
0.16% hak_pool_try_alloc_v1_impl.part.0
0.14% free_cold.constprop.0
0.13% mid_inuse_dec_deferred
0.12% hak_pool_mid_lookup
0.12% do_user_addr_fault (kernel)
0.11% handle_pte_fault (kernel)
0.11% __mod_memcg_lruvec_state (kernel)
0.10% do_anonymous_page (kernel)
0.09% classify_ptr
0.07% tiny_get_max_size.lto_priv.0
0.06% __handle_mm_fault (kernel)
0.06% __alloc_pages (kernel)
```
---
## 11. Conclusion
Phase 19 has **clear, actionable targets** with high ROI:
1. **Immediate action (Phase 19-1)**: Remove FastLane wrapper layer
- Expected: +10-15% throughput
- Risk: Low
- Effort: 2-3 hours
2. **Follow-up (Phase 19-2-4)**: ENV consolidation + stats + route optimization
- Expected: +6-11% additional throughput
- Risk: Medium (ENV invalidation)
- Effort: 8-12 hours
**Combined target**: +21% throughput (44.88M → 54.3M ops/s)
**Gap closure**: Reduce instruction gap from +53.8% to +30.3% vs libc
This positions hakmem for competitive performance while maintaining safety and Box Theory compliance.

View File

@ -0,0 +1,64 @@
# Phase 19-2: FASTLANE_DIRECT Promotion + Rebaseline (Next Instructions)
## 0. Status (where we are)
- Phase 19-1b (FASTLANE_DIRECT) is **GO**: throughput **+5.88%** with **-15.23% instr/op** and **-19.36% branches/op**.
- Safety hardening completed:
- `!g_initialized` → direct path is skipped (fail-fast, same rule as Front FastLane).
- malloc miss no longer calls `malloc_cold()` directly; it falls through to the normal wrapper path (preserves `g_hakmem_lock_depth` invariants).
- ENV cache is a single global `_Atomic` so `bench_profile` refresh affects wrappers.
## 1. Promotion policy (Box Theory)
- Keep rollback simple:
- `HAKMEM_FASTLANE_DIRECT=0` → disable (fallback to Phase 6 FastLane wrapper path).
- `HAKMEM_FASTLANE_DIRECT=1` → enable (direct `malloc_tiny_fast()` / `free_tiny_fast()` first).
- Promotion level:
- **Preset promotion** (recommended): set `HAKMEM_FASTLANE_DIRECT=1` in `MIXED_TINYV3_C7_SAFE` and `C6_HEAVY_LEGACY_POOLV1` presets.
- Keep **ENV default = 0** (opt-in) until real-world/LD_PRELOAD validation is done.
## 2. Required verification (same-binary A/B)
### 2.1 Mixed (10-run, clean env)
Baseline:
```sh
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FASTLANE_DIRECT=0 scripts/run_mixed_10_cleanenv.sh
```
Optimized:
```sh
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FASTLANE_DIRECT=1 scripts/run_mixed_10_cleanenv.sh
```
GO/NO-GO:
- GO: mean **+1.0%** or higher
- NEUTRAL: **±1.0%** → keep as preset-only (do not flip global default)
- NO-GO: **≤ -1.0%** → revert preset promotion
### 2.2 C6-heavy (5-run)
```sh
HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1 HAKMEM_FASTLANE_DIRECT=0 ./bench_mid_large_mt_hakmem 1 1000000 400 1
HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1 HAKMEM_FASTLANE_DIRECT=1 ./bench_mid_large_mt_hakmem 1 1000000 400 1
```
## 3. Perf stat capture (root-cause guardrails)
Run both A/B with:
```sh
perf stat -e cycles,instructions,branches,branch-misses,L1-icache-load-misses,iTLB-load-misses,dTLB-load-misses -- \
./bench_random_mixed_hakmem 200000000 400 1
```
Checklist:
- `instructions/op` and `branches/op` must improve (expected)
- iTLB/dTLB misses may worsen; accept only if throughput still improves
## 4. Next target selection (after promotion)
After Phase 19-2 is stable, re-run `perf record` on Mixed and choose the next box by **self% ≥ 5%**:
- If `unified_cache_push/pop` rises: focus on **UnifiedCache data-path** (touch fewer cache lines).
- If `tiny_header_finalize_alloc` rises: focus on **header finalize path** (but treat as high NO-GO risk; prior header work was often NEUTRAL).
- If ENV checks reappear in hot path: consider **Phase 19-3 (ENV check consolidation)**, but keep it in a separate research box.

View File

@ -178,7 +178,8 @@ hakmem.o: core/hakmem.c core/hakmem.h core/hakmem_build_flags.h \
core/box/front_fastlane_env_box.h core/box/front_fastlane_stats_box.h \ core/box/front_fastlane_env_box.h core/box/front_fastlane_stats_box.h \
core/box/front_fastlane_alloc_legacy_direct_env_box.h \ core/box/front_fastlane_alloc_legacy_direct_env_box.h \
core/box/tiny_front_hot_box.h core/box/tiny_front_cold_box.h \ core/box/tiny_front_hot_box.h core/box/tiny_front_cold_box.h \
core/box/smallobject_policy_v7_box.h core/box/../hakmem_internal.h core/box/smallobject_policy_v7_box.h core/box/fastlane_direct_env_box.h \
core/box/../hakmem_internal.h
core/hakmem.h: core/hakmem.h:
core/hakmem_build_flags.h: core/hakmem_build_flags.h:
core/hakmem_config.h: core/hakmem_config.h:
@ -441,4 +442,5 @@ core/box/front_fastlane_alloc_legacy_direct_env_box.h:
core/box/tiny_front_hot_box.h: core/box/tiny_front_hot_box.h:
core/box/tiny_front_cold_box.h: core/box/tiny_front_cold_box.h:
core/box/smallobject_policy_v7_box.h: core/box/smallobject_policy_v7_box.h:
core/box/fastlane_direct_env_box.h:
core/box/../hakmem_internal.h: core/box/../hakmem_internal.h:

BIN
perf.data.phase19_hakmem Normal file

Binary file not shown.

View File

@ -18,6 +18,8 @@ export HAKMEM_MALLOC_TINY_DIRECT=${HAKMEM_MALLOC_TINY_DIRECT:-0}
export HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=${HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT:-0} export HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=${HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT:-0}
export HAKMEM_FORCE_LIBC_ALLOC=${HAKMEM_FORCE_LIBC_ALLOC:-0} export HAKMEM_FORCE_LIBC_ALLOC=${HAKMEM_FORCE_LIBC_ALLOC:-0}
export HAKMEM_ENV_SNAPSHOT_SHAPE=${HAKMEM_ENV_SNAPSHOT_SHAPE:-0} export HAKMEM_ENV_SNAPSHOT_SHAPE=${HAKMEM_ENV_SNAPSHOT_SHAPE:-0}
# NOTE: Phase 19-1b is promoted in presets. Keep cleanenv aligned by default.
export HAKMEM_FASTLANE_DIRECT=${HAKMEM_FASTLANE_DIRECT:-1}
# NOTE: Phase 9/10 are promoted (bench_profile defaults to 1). Keep cleanenv aligned by default. # NOTE: Phase 9/10 are promoted (bench_profile defaults to 1). Keep cleanenv aligned by default.
export HAKMEM_FREE_TINY_FAST_MONO_DUALHOT=${HAKMEM_FREE_TINY_FAST_MONO_DUALHOT:-1} export HAKMEM_FREE_TINY_FAST_MONO_DUALHOT=${HAKMEM_FREE_TINY_FAST_MONO_DUALHOT:-1}
export HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=${HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT:-1} export HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=${HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT:-1}