Phase 17 v2 (FORCE_LIBC fix) + Phase 19-1b (FastLane Direct) — GO (+5.88%)
## Phase 17 v2: FORCE_LIBC Gap Validation Fix
**Critical bug fix**: Phase 17 v1 の測定が壊れていた
**Problem**: HAKMEM_FORCE_LIBC_ALLOC=1 が FastLane より後でしか見えず、
same-binary A/B が実質 "hakmem vs hakmem" になっていた(+0.39% 誤測定)
**Fix**: core/box/hak_wrappers.inc.h:171 と :645 に g_force_libc_alloc==1 の
early bypass を追加、__libc_malloc/__libc_free に最初に直行
**Result**: 正しい同一バイナリ A/B 測定
- hakmem (FORCE_LIBC=0): 48.99M ops/s
- libc (FORCE_LIBC=1): 79.72M ops/s (+62.7%)
- system binary: 88.06M ops/s (+10.5% vs libc)
**Gap 分解**:
- Allocator 差: +62.7% (主戦場)
- Layout penalty: +10.5% (副次的)
**Conclusion**: Case A 確定 (allocator dominant, NOT layout)
Phase 17 v1 の Case B 判定は誤り。
Files:
- docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md (v2)
- docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_NEXT_INSTRUCTIONS.md (updated)
---
## Phase 19: FastLane Instruction Reduction Analysis
**Goal**: libc との instruction gap (-35% instructions, -56% branches) を削減
**perf stat 分析** (FORCE_LIBC=0 vs 1, 200M ops):
- hakmem: 209.09 instructions/op, 52.33 branches/op
- libc: 135.92 instructions/op, 22.93 branches/op
- Delta: +73.17 instructions/op (+53.8%), +29.40 branches/op (+128.2%)
**Hot path** (perf report):
- front_fastlane_try_free: 23.97% cycles
- malloc wrapper: 23.84% cycles
- free wrapper: 6.82% cycles
- **Wrapper overhead: ~55% of all cycles**
**Reduction candidates**:
- A: Wrapper layer 削除 (-17.5 inst/op, +10-15% 期待)
- B: ENV snapshot 統合 (-10.0 inst/op, +5-8%)
- C: Stats 削除 (-5.0 inst/op, +3-5%)
- D: Header inline (-4.0 inst/op, +2-3%)
- E: Route fast path (-3.5 inst/op, +2-3%)
Files:
- docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_1_DESIGN.md
- docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_2_NEXT_INSTRUCTIONS.md
---
## Phase 19-1b: FastLane Direct — GO (+5.88%)
**Strategy**: Wrapper layer を bypass し、core allocator を直接呼ぶ
- free() → free_tiny_fast() (not free_tiny_fast_hot)
- malloc() → malloc_tiny_fast()
**Phase 19-1 が NO-GO (-3.81%) だった原因**:
1. __builtin_expect(fastlane_direct_enabled(), 0) が逆効果(A/B 不公平)
2. free_tiny_fast_hot() が誤選択(free_tiny_fast() が勝ち筋)
**Phase 19-1b の修正**:
1. __builtin_expect() 削除
2. free_tiny_fast() を直接呼び出し
**Result** (Mixed, 10-run, 20M iters, ws=400):
- Baseline (FASTLANE_DIRECT=0): 49.17M ops/s
- Optimized (FASTLANE_DIRECT=1): 52.06M ops/s
- **Delta: +5.88%** (GO 基準 +5% クリア)
**perf stat** (200M iters):
- Instructions/op: 199.90 → 169.45 (-30.45, -15.23%)
- Branches/op: 51.49 → 41.52 (-9.97, -19.36%)
- Cycles/op: 88.88 → 84.37 (-4.51, -5.07%)
- I-cache miss: 111K → 98K (-11.79%)
**Trade-offs** (acceptable):
- iTLB miss: +41.46% (front-end cost)
- dTLB miss: +29.15% (backend cost)
- Overall gain (+5.88%) outweighs costs
**Implementation**:
1. **ENV gate**: core/box/fastlane_direct_env_box.{h,c}
- HAKMEM_FASTLANE_DIRECT=0/1 (default: 0, opt-in)
- Single _Atomic global (wrapper キャッシュ問題を解決)
2. **Wrapper 修正**: core/box/hak_wrappers.inc.h
- malloc: direct call to malloc_tiny_fast() when FASTLANE_DIRECT=1
- free: direct call to free_tiny_fast() when FASTLANE_DIRECT=1
- Safety: !g_initialized では direct 使わない、fallback 維持
3. **Preset 昇格**: core/bench_profile.h:88
- bench_setenv_default("HAKMEM_FASTLANE_DIRECT", "1")
- Comment: +5.88% proven on Mixed, 10-run
4. **cleanenv 更新**: scripts/run_mixed_10_cleanenv.sh:22
- HAKMEM_FASTLANE_DIRECT=${HAKMEM_FASTLANE_DIRECT:-1}
- Phase 9/10 と同様に昇格
**Verdict**: GO — 本線採用、プリセット昇格完了
**Rollback**: HAKMEM_FASTLANE_DIRECT=0 で既存 FastLane path に戻る
Files:
- core/box/fastlane_direct_env_box.{h,c} (new)
- core/box/hak_wrappers.inc.h (modified)
- core/bench_profile.h (preset promotion)
- scripts/run_mixed_10_cleanenv.sh (ENV default aligned)
- Makefile (new obj)
- docs/analysis/PHASE19_1B_FASTLANE_DIRECT_REVISED_AB_TEST_RESULTS.md
---
## Cumulative Performance
- Baseline (all optimizations OFF): ~40M ops/s (estimated)
- Current (Phase 19-1b): 52.06M ops/s
- **Cumulative gain: ~+30% from baseline**
Remaining gap to libc (79.72M):
- Current: 52.06M ops/s
- Target: 79.72M ops/s
- **Gap: +53.2%** (was +62.7% before Phase 19-1b)
Next: Phase 19-2 (ENV snapshot consolidation, +5-8% expected)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
118
CURRENT_TASK.md
118
CURRENT_TASK.md
@ -1,5 +1,117 @@
|
|||||||
# 本線タスク(現在)
|
# 本線タスク(現在)
|
||||||
|
|
||||||
|
## 更新メモ(2025-12-15 Phase 19-1b FASTLANE-DIRECT-1B)
|
||||||
|
|
||||||
|
### Phase 19-1b FASTLANE-DIRECT-1B: FastLane Direct (Revised) — ✅ GO (+5.88%)
|
||||||
|
|
||||||
|
**Result**: Phase 19-1 の修正版が成功。__builtin_expect() 削除 + free_tiny_fast() 直呼び で throughput **+5.88%** 達成。
|
||||||
|
|
||||||
|
**A/B Test Results**:
|
||||||
|
- Baseline: 49.17M ops/s (FASTLANE_DIRECT=0)
|
||||||
|
- Optimized: 52.06M ops/s (FASTLANE_DIRECT=1)
|
||||||
|
- Delta: **+5.88%** (GO判定、+5%目標クリア)
|
||||||
|
|
||||||
|
**perf stat Analysis** (200M ops):
|
||||||
|
- Instructions: **-15.23%** (199.90 → 169.45/op, -30.45 削減)
|
||||||
|
- Branches: **-19.36%** (51.49 → 41.52/op, -9.97 削減)
|
||||||
|
- Cycles: **-5.07%** (88.88 → 84.37/op)
|
||||||
|
- I-cache misses: -11.79% (Good)
|
||||||
|
- iTLB misses: +41.46% (Bad, but overall gain wins)
|
||||||
|
- dTLB misses: +29.15% (Bad, but overall gain wins)
|
||||||
|
|
||||||
|
**犯人特定**:
|
||||||
|
1. Phase 19-1 の NO-GO 原因: `__builtin_expect(fastlane_direct_enabled(), 0)` が逆効果
|
||||||
|
2. `free_tiny_fast_hot()` より `free_tiny_fast()` が勝ち筋(unified cache の winner)
|
||||||
|
3. 修正により wrapper overhead 削減 → instruction/branch の大幅削減
|
||||||
|
|
||||||
|
**修正内容**:
|
||||||
|
- File: `/mnt/workdisk/public_share/hakmem/core/box/hak_wrappers.inc.h`
|
||||||
|
- malloc: `__builtin_expect(fastlane_direct_enabled(), 0)` → `fastlane_direct_enabled()`
|
||||||
|
- free: `free_tiny_fast_hot()` → `free_tiny_fast()` (勝ち筋に変更)
|
||||||
|
- Safety: `!g_initialized` では direct を使わず既存経路へフォールバック(FastLane と同じ fail-fast)
|
||||||
|
- Safety: malloc miss は `malloc_cold()` を直呼びせず既存 wrapper 経路へ落とす(lock_depth 前提を守る)
|
||||||
|
- ENV cache: `fastlane_direct_env_refresh_from_env()` が wrapper と同一の `_Atomic` に反映されるように単一グローバル化
|
||||||
|
|
||||||
|
**Next**: Phase 19-1b は本線採用。ENV: `HAKMEM_FASTLANE_DIRECT=1` で運用。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 前回タスク(Phase 19 FASTLANE-INSTRUCTION-REDUCTION-1)
|
||||||
|
|
||||||
|
### Phase 19 FASTLANE-INSTRUCTION-REDUCTION-1: FastLane Instruction Reduction v1 — 📊 ANALYSIS COMPLETE
|
||||||
|
|
||||||
|
結果: perf stat/record 分析により、**libc との gap の本質**を特定。設計ドキュメント完成。
|
||||||
|
|
||||||
|
- 設計: `docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_1_DESIGN.md`
|
||||||
|
- perf データ: 保存済み(perf_stat_hakmem.txt, perf_stat_libc.txt, perf.data.phase19_hakmem)
|
||||||
|
|
||||||
|
### Gap Analysis(200M ops baseline)
|
||||||
|
|
||||||
|
**Per-operation overhead** (hakmem vs libc):
|
||||||
|
- Instructions/op: **209.09 vs 135.92** (+73.17, **+53.8%**)
|
||||||
|
- Branches/op: **52.33 vs 22.93** (+29.40, **+128.2%**)
|
||||||
|
- Cycles/op: **96.48 vs 54.69** (+41.79, +76.4%)
|
||||||
|
- Throughput: **44.88M vs 77.62M ops/s** (+73.0% gap)
|
||||||
|
|
||||||
|
**Critical finding**: hakmem は **73 extra instructions** と **29 extra branches** per-op を実行。これが throughput gap の全原因。
|
||||||
|
|
||||||
|
### Hot Path Breakdown(perf report)
|
||||||
|
|
||||||
|
Top wrapper overhead (合計 ~55% of cycles):
|
||||||
|
- `front_fastlane_try_free`: **23.97%**
|
||||||
|
- `malloc`: **23.84%**
|
||||||
|
- `free`: **6.82%**
|
||||||
|
|
||||||
|
Wrapper layer が cycles の過半を消費(二重検証、ENV checks、class mask checks など)。
|
||||||
|
|
||||||
|
### Reduction Candidates(優先度順)
|
||||||
|
|
||||||
|
1. **Candidate A: FastLane Wrapper Layer 削除** (highest ROI)
|
||||||
|
- Impact: **-17.5 instructions/op, -6.0 branches/op** (+10-15% throughput)
|
||||||
|
- Risk: **LOW**(free_tiny_fast_hot 既存)
|
||||||
|
- 理由: 二重 header validation + ENV checks 排除
|
||||||
|
|
||||||
|
2. **Candidate B: ENV Snapshot 統合** (high ROI)
|
||||||
|
- Impact: **-10.0 instructions/op, -4.0 branches/op** (+5-8% throughput)
|
||||||
|
- Risk: **MEDIUM**(ENV invalidation 対応必要)
|
||||||
|
- 理由: 3+ 回の ENV check を 1 回に統合
|
||||||
|
|
||||||
|
3. **Candidate C: Stats Counters 削除** (medium ROI)
|
||||||
|
- Impact: **-5.0 instructions/op, -2.5 branches/op** (+3-5% throughput)
|
||||||
|
- Risk: **LOW**(compile-time optional)
|
||||||
|
- 理由: Atomic increment overhead 排除
|
||||||
|
|
||||||
|
4. **Candidate D: Header Validation Inline** (medium ROI)
|
||||||
|
- Impact: **-4.0 instructions/op, -1.5 branches/op** (+2-3% throughput)
|
||||||
|
- Risk: **MEDIUM**(caller 検証前提)
|
||||||
|
- 理由: 二重 header load 排除
|
||||||
|
|
||||||
|
5. **Candidate E: Static Route Fast Path** (lower ROI)
|
||||||
|
- Impact: **-3.5 instructions/op, -1.5 branches/op** (+2-3% throughput)
|
||||||
|
- Risk: **LOW**(route table static)
|
||||||
|
- 理由: Function call を bit test に置換
|
||||||
|
|
||||||
|
**Combined estimate** (80% efficiency):
|
||||||
|
- Instructions/op: 209.09 → **177.09** (gap: +53.8% → +30.3%)
|
||||||
|
- Branches/op: 52.33 → **39.93** (gap: +128.2% → +74.1%)
|
||||||
|
- Throughput: 44.88M → **54.3M ops/s** (+21%, **目標 +15-25% 超過達成**)
|
||||||
|
|
||||||
|
### Implementation Plan
|
||||||
|
|
||||||
|
- **Phase 19-1** (P0): FastLane Wrapper 削除 (2-3h, +10-15%)
|
||||||
|
- **Phase 19-2** (P1): ENV Snapshot 統合 (4-6h, +5-8%)
|
||||||
|
- **Phase 19-3** (P2): Stats + Header Inline (2-3h, +3-5%)
|
||||||
|
- **Phase 19-4** (P3): Route Fast Path (2-3h, +2-3%)
|
||||||
|
|
||||||
|
### 次の手順
|
||||||
|
|
||||||
|
1. Phase 19-1 実装開始(FastLane layer 削除、直接 free_tiny_fast_hot 呼び出し)
|
||||||
|
2. perf stat で instruction/branch reduction 検証
|
||||||
|
3. Mixed 10-run で throughput improvement 測定
|
||||||
|
4. Phase 19-2-4 を順次実装
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## 更新メモ(2025-12-15 Phase 18 HOT-TEXT-ISOLATION-1)
|
## 更新メモ(2025-12-15 Phase 18 HOT-TEXT-ISOLATION-1)
|
||||||
|
|
||||||
### Phase 18 HOT-TEXT-ISOLATION-1: Hot Text Isolation v1 — ❌ NO-GO / FROZEN
|
### Phase 18 HOT-TEXT-ISOLATION-1: Hot Text Isolation v1 — ❌ NO-GO / FROZEN
|
||||||
@ -17,9 +129,9 @@
|
|||||||
- Hot/cold 属性が実際には適用されていない(実装の不完全性)
|
- Hot/cold 属性が実際には適用されていない(実装の不完全性)
|
||||||
|
|
||||||
重要な知見:
|
重要な知見:
|
||||||
- Phase 17 の結論を再確認: bottleneck は **instruction count** と **memory latency**
|
- Phase 17 v2(FORCE_LIBC 修正後): same-binary A/B で **libc が +62.7%**(≒1.63×)速い → gap の主因は **allocator work**(layout alone ではない)
|
||||||
- Code layout 最適化では 2.30 IPC の壁を越えられない
|
- ただし `bench_random_mixed_system` は `libc-in-hakmem-binary` よりさらに **+10.5%** 速い → wrapper/text 環境の penalty も残る
|
||||||
- 次の一手: instruction count を直接削る Phase 18 v2 (BENCH_MINIMAL) へ
|
- Phase 18 v2(BENCH_MINIMAL)は「足し算の固定費」を削る方向として有効だが、-5% instructions 程度では +62% gap を埋められない
|
||||||
|
|
||||||
## 更新メモ(2025-12-14 Phase 6 FRONT-FASTLANE-1)
|
## 更新メモ(2025-12-14 Phase 6 FRONT-FASTLANE-1)
|
||||||
|
|
||||||
|
|||||||
8
Makefile
8
Makefile
@ -253,12 +253,12 @@ LDFLAGS += $(EXTRA_LDFLAGS)
|
|||||||
|
|
||||||
# Targets
|
# Targets
|
||||||
TARGET = test_hakmem
|
TARGET = test_hakmem
|
||||||
OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/box/front_fastlane_alloc_legacy_direct_env_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o
|
OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/box/front_fastlane_alloc_legacy_direct_env_box.o core/box/fastlane_direct_env_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o
|
||||||
OBJS = $(OBJS_BASE)
|
OBJS = $(OBJS_BASE)
|
||||||
|
|
||||||
# Shared library
|
# Shared library
|
||||||
SHARED_LIB = libhakmem.so
|
SHARED_LIB = libhakmem.so
|
||||||
SHARED_OBJS = hakmem_shared.o hakmem_config_shared.o hakmem_tiny_config_shared.o hakmem_ucb1_shared.o hakmem_bigcache_shared.o hakmem_pool_shared.o hakmem_l25_pool_shared.o hakmem_site_rules_shared.o hakmem_tiny_shared.o core/box/ss_allocation_box_shared.o superslab_stats_shared.o superslab_cache_shared.o superslab_ace_shared.o superslab_slab_shared.o superslab_backend_shared.o core/superslab_head_stub_shared.o hakmem_smallmid_shared.o core/box/superslab_expansion_box_shared.o core/box/integrity_box_shared.o core/box/mailbox_box_shared.o core/box/front_gate_box_shared.o core/box/front_gate_classifier_shared.o core/box/free_publish_box_shared.o core/box/capacity_box_shared.o core/box/carve_push_box_shared.o core/box/prewarm_box_shared.o core/box/ss_hot_prewarm_box_shared.o core/box/front_metrics_box_shared.o core/box/bench_fast_box_shared.o core/box/ss_addr_map_box_shared.o core/box/ss_pt_impl_shared.o core/box/slab_recycling_box_shared.o core/box/pagefault_telemetry_box_shared.o core/box/tiny_sizeclass_hist_box_shared.o core/box/tiny_env_box_shared.o core/box/tiny_route_box_shared.o core/box/free_front_v3_env_box_shared.o core/box/free_path_stats_box_shared.o core/box/free_dispatch_stats_box_shared.o core/box/alloc_gate_stats_box_shared.o core/box/tiny_page_box_shared.o core/box/tiny_class_policy_box_shared.o core/box/tiny_class_stats_box_shared.o core/box/tiny_policy_learner_box_shared.o core/box/ss_budget_box_shared.o core/box/tiny_mem_stats_box_shared.o core/box/wrapper_env_box_shared.o core/box/free_wrapper_env_snapshot_box_shared.o core/box/malloc_wrapper_env_snapshot_box_shared.o core/box/madvise_guard_box_shared.o core/box/libm_reloc_guard_box_shared.o core/box/hakmem_env_snapshot_box_shared.o core/box/tiny_c7_preserve_header_env_box_shared.o core/box/tiny_tcache_env_box_shared.o core/box/tiny_unified_lifo_env_box_shared.o core/box/front_fastlane_alloc_legacy_direct_env_box_shared.o core/page_arena_shared.o core/front/tiny_unified_cache_shared.o core/tiny_alloc_fast_push_shared.o core/tiny_c7_ultra_segment_shared.o core/tiny_c7_ultra_shared.o core/link_stubs_shared.o core/tiny_failfast_shared.o tiny_sticky_shared.o tiny_remote_shared.o tiny_publish_shared.o tiny_debug_ring_shared.o hakmem_tiny_magazine_shared.o hakmem_tiny_stats_shared.o hakmem_tiny_sfc_shared.o hakmem_tiny_query_shared.o hakmem_tiny_rss_shared.o hakmem_tiny_registry_shared.o hakmem_tiny_remote_target_shared.o hakmem_tiny_bg_spill_shared.o tiny_adaptive_sizing_shared.o hakmem_super_registry_shared.o hakmem_shared_pool_shared.o hakmem_shared_pool_acquire_shared.o hakmem_shared_pool_release_shared.o hakmem_elo_shared.o hakmem_batch_shared.o hakmem_p2_shared.o hakmem_sizeclass_dist_shared.o hakmem_evo_shared.o hakmem_debug_shared.o hakmem_sys_shared.o hakmem_whale_shared.o hakmem_policy_shared.o hakmem_ace_shared.o hakmem_ace_stats_shared.o hakmem_ace_controller_shared.o hakmem_ace_metrics_shared.o hakmem_ace_ucb1_shared.o hakmem_prof_shared.o hakmem_learner_shared.o hakmem_size_hist_shared.o hakmem_learn_log_shared.o hakmem_syscall_shared.o tiny_fastcache_shared.o core/box/super_reg_box_shared.o core/box/shared_pool_box_shared.o core/box/remote_side_box_shared.o core/tiny_destructors_shared.o
|
SHARED_OBJS = hakmem_shared.o hakmem_config_shared.o hakmem_tiny_config_shared.o hakmem_ucb1_shared.o hakmem_bigcache_shared.o hakmem_pool_shared.o hakmem_l25_pool_shared.o hakmem_site_rules_shared.o hakmem_tiny_shared.o core/box/ss_allocation_box_shared.o superslab_stats_shared.o superslab_cache_shared.o superslab_ace_shared.o superslab_slab_shared.o superslab_backend_shared.o core/superslab_head_stub_shared.o hakmem_smallmid_shared.o core/box/superslab_expansion_box_shared.o core/box/integrity_box_shared.o core/box/mailbox_box_shared.o core/box/front_gate_box_shared.o core/box/front_gate_classifier_shared.o core/box/free_publish_box_shared.o core/box/capacity_box_shared.o core/box/carve_push_box_shared.o core/box/prewarm_box_shared.o core/box/ss_hot_prewarm_box_shared.o core/box/front_metrics_box_shared.o core/box/bench_fast_box_shared.o core/box/ss_addr_map_box_shared.o core/box/ss_pt_impl_shared.o core/box/slab_recycling_box_shared.o core/box/pagefault_telemetry_box_shared.o core/box/tiny_sizeclass_hist_box_shared.o core/box/tiny_env_box_shared.o core/box/tiny_route_box_shared.o core/box/free_front_v3_env_box_shared.o core/box/free_path_stats_box_shared.o core/box/free_dispatch_stats_box_shared.o core/box/alloc_gate_stats_box_shared.o core/box/tiny_page_box_shared.o core/box/tiny_class_policy_box_shared.o core/box/tiny_class_stats_box_shared.o core/box/tiny_policy_learner_box_shared.o core/box/ss_budget_box_shared.o core/box/tiny_mem_stats_box_shared.o core/box/wrapper_env_box_shared.o core/box/free_wrapper_env_snapshot_box_shared.o core/box/malloc_wrapper_env_snapshot_box_shared.o core/box/madvise_guard_box_shared.o core/box/libm_reloc_guard_box_shared.o core/box/hakmem_env_snapshot_box_shared.o core/box/tiny_c7_preserve_header_env_box_shared.o core/box/tiny_tcache_env_box_shared.o core/box/tiny_unified_lifo_env_box_shared.o core/box/front_fastlane_alloc_legacy_direct_env_box_shared.o core/box/fastlane_direct_env_box_shared.o core/page_arena_shared.o core/front/tiny_unified_cache_shared.o core/tiny_alloc_fast_push_shared.o core/tiny_c7_ultra_segment_shared.o core/tiny_c7_ultra_shared.o core/link_stubs_shared.o core/tiny_failfast_shared.o tiny_sticky_shared.o tiny_remote_shared.o tiny_publish_shared.o tiny_debug_ring_shared.o hakmem_tiny_magazine_shared.o hakmem_tiny_stats_shared.o hakmem_tiny_sfc_shared.o hakmem_tiny_query_shared.o hakmem_tiny_rss_shared.o hakmem_tiny_registry_shared.o hakmem_tiny_remote_target_shared.o hakmem_tiny_bg_spill_shared.o tiny_adaptive_sizing_shared.o hakmem_super_registry_shared.o hakmem_shared_pool_shared.o hakmem_shared_pool_acquire_shared.o hakmem_shared_pool_release_shared.o hakmem_elo_shared.o hakmem_batch_shared.o hakmem_p2_shared.o hakmem_sizeclass_dist_shared.o hakmem_evo_shared.o hakmem_debug_shared.o hakmem_sys_shared.o hakmem_whale_shared.o hakmem_policy_shared.o hakmem_ace_shared.o hakmem_ace_stats_shared.o hakmem_ace_controller_shared.o hakmem_ace_metrics_shared.o hakmem_ace_ucb1_shared.o hakmem_prof_shared.o hakmem_learner_shared.o hakmem_size_hist_shared.o hakmem_learn_log_shared.o hakmem_syscall_shared.o tiny_fastcache_shared.o core/box/super_reg_box_shared.o core/box/shared_pool_box_shared.o core/box/remote_side_box_shared.o core/tiny_destructors_shared.o
|
||||||
|
|
||||||
# Pool TLS Phase 1 (enable with POOL_TLS_PHASE1=1)
|
# Pool TLS Phase 1 (enable with POOL_TLS_PHASE1=1)
|
||||||
ifeq ($(POOL_TLS_PHASE1),1)
|
ifeq ($(POOL_TLS_PHASE1),1)
|
||||||
@ -285,7 +285,7 @@ endif
|
|||||||
# Benchmark targets
|
# Benchmark targets
|
||||||
BENCH_HAKMEM = bench_allocators_hakmem
|
BENCH_HAKMEM = bench_allocators_hakmem
|
||||||
BENCH_SYSTEM = bench_allocators_system
|
BENCH_SYSTEM = bench_allocators_system
|
||||||
BENCH_HAKMEM_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/tiny_free_route_cache_env_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o bench_allocators_hakmem.o
|
BENCH_HAKMEM_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/tiny_free_route_cache_env_box.o core/box/fastlane_direct_env_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o bench_allocators_hakmem.o
|
||||||
BENCH_HAKMEM_OBJS = $(BENCH_HAKMEM_OBJS_BASE)
|
BENCH_HAKMEM_OBJS = $(BENCH_HAKMEM_OBJS_BASE)
|
||||||
ifeq ($(POOL_TLS_PHASE1),1)
|
ifeq ($(POOL_TLS_PHASE1),1)
|
||||||
BENCH_HAKMEM_OBJS += pool_tls.o pool_refill.o pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o
|
BENCH_HAKMEM_OBJS += pool_tls.o pool_refill.o pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o
|
||||||
@ -462,7 +462,7 @@ test-box-refactor: box-refactor
|
|||||||
./larson_hakmem 10 8 128 1024 1 12345 4
|
./larson_hakmem 10 8 128 1024 1 12345 4
|
||||||
|
|
||||||
# Phase 4: Tiny Pool benchmarks (properly linked with hakmem)
|
# Phase 4: Tiny Pool benchmarks (properly linked with hakmem)
|
||||||
TINY_BENCH_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/free_publish_box.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/tiny_free_route_cache_env_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/box/front_fastlane_alloc_legacy_direct_env_box.o core/page_arena.o core/front/tiny_unified_cache.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o
|
TINY_BENCH_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/free_publish_box.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/tiny_free_route_cache_env_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/box/front_fastlane_alloc_legacy_direct_env_box.o core/box/fastlane_direct_env_box.o core/page_arena.o core/front/tiny_unified_cache.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o
|
||||||
TINY_BENCH_OBJS = $(TINY_BENCH_OBJS_BASE)
|
TINY_BENCH_OBJS = $(TINY_BENCH_OBJS_BASE)
|
||||||
ifeq ($(POOL_TLS_PHASE1),1)
|
ifeq ($(POOL_TLS_PHASE1),1)
|
||||||
TINY_BENCH_OBJS += pool_tls.o pool_refill.o core/pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o
|
TINY_BENCH_OBJS += pool_tls.o pool_refill.o core/pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o
|
||||||
|
|||||||
@ -14,6 +14,7 @@
|
|||||||
#include "box/tiny_tcache_env_box.h" // tiny_tcache_env_refresh_from_env (Phase 14 v1)
|
#include "box/tiny_tcache_env_box.h" // tiny_tcache_env_refresh_from_env (Phase 14 v1)
|
||||||
#include "box/tiny_unified_lifo_env_box.h" // tiny_unified_lifo_env_refresh_from_env (Phase 15 v1)
|
#include "box/tiny_unified_lifo_env_box.h" // tiny_unified_lifo_env_refresh_from_env (Phase 15 v1)
|
||||||
#include "box/front_fastlane_alloc_legacy_direct_env_box.h" // front_fastlane_alloc_legacy_direct_env_refresh_from_env (Phase 16 v1)
|
#include "box/front_fastlane_alloc_legacy_direct_env_box.h" // front_fastlane_alloc_legacy_direct_env_refresh_from_env (Phase 16 v1)
|
||||||
|
#include "box/fastlane_direct_env_box.h" // fastlane_direct_env_refresh_from_env (Phase 19-1)
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
// env が未設定のときだけ既定値を入れる
|
// env が未設定のときだけ既定値を入れる
|
||||||
@ -84,6 +85,8 @@ static inline void bench_apply_profile(void) {
|
|||||||
bench_setenv_default("HAKMEM_FRONT_FASTLANE", "1");
|
bench_setenv_default("HAKMEM_FRONT_FASTLANE", "1");
|
||||||
// Phase 6-2: Front FastLane Free DeDup (+5.18% proven on Mixed, 10-run)
|
// Phase 6-2: Front FastLane Free DeDup (+5.18% proven on Mixed, 10-run)
|
||||||
bench_setenv_default("HAKMEM_FRONT_FASTLANE_FREE_DEDUP", "1");
|
bench_setenv_default("HAKMEM_FRONT_FASTLANE_FREE_DEDUP", "1");
|
||||||
|
// Phase 19-1b: FastLane Direct (wrapper layer bypass, +5.88% proven on Mixed, 10-run)
|
||||||
|
bench_setenv_default("HAKMEM_FASTLANE_DIRECT", "1");
|
||||||
// Phase 9: FREE-TINY-FAST MONO DUALHOT (+2.72% proven on Mixed, 10-run)
|
// Phase 9: FREE-TINY-FAST MONO DUALHOT (+2.72% proven on Mixed, 10-run)
|
||||||
bench_setenv_default("HAKMEM_FREE_TINY_FAST_MONO_DUALHOT", "1");
|
bench_setenv_default("HAKMEM_FREE_TINY_FAST_MONO_DUALHOT", "1");
|
||||||
// Phase 10: FREE-TINY-FAST MONO LEGACY DIRECT (+1.89% proven on Mixed, 10-run)
|
// Phase 10: FREE-TINY-FAST MONO LEGACY DIRECT (+1.89% proven on Mixed, 10-run)
|
||||||
@ -119,6 +122,8 @@ static inline void bench_apply_profile(void) {
|
|||||||
bench_setenv_default("HAKMEM_FRONT_FASTLANE", "1");
|
bench_setenv_default("HAKMEM_FRONT_FASTLANE", "1");
|
||||||
// Phase 6-2: Front FastLane Free DeDup (+5.18% proven on Mixed, 10-run)
|
// Phase 6-2: Front FastLane Free DeDup (+5.18% proven on Mixed, 10-run)
|
||||||
bench_setenv_default("HAKMEM_FRONT_FASTLANE_FREE_DEDUP", "1");
|
bench_setenv_default("HAKMEM_FRONT_FASTLANE_FREE_DEDUP", "1");
|
||||||
|
// Phase 19-1b: FastLane Direct (wrapper layer bypass)
|
||||||
|
bench_setenv_default("HAKMEM_FASTLANE_DIRECT", "1");
|
||||||
// Phase 2 B3: Routing branch shape optimization (LIKELY on LEGACY, cold helper for rare routes)
|
// Phase 2 B3: Routing branch shape optimization (LIKELY on LEGACY, cold helper for rare routes)
|
||||||
bench_setenv_default("HAKMEM_TINY_ALLOC_ROUTE_SHAPE", "1");
|
bench_setenv_default("HAKMEM_TINY_ALLOC_ROUTE_SHAPE", "1");
|
||||||
} else if (strcmp(p, "C6_V7_STUB") == 0) {
|
} else if (strcmp(p, "C6_V7_STUB") == 0) {
|
||||||
@ -196,5 +201,7 @@ static inline void bench_apply_profile(void) {
|
|||||||
tiny_unified_lifo_env_refresh_from_env();
|
tiny_unified_lifo_env_refresh_from_env();
|
||||||
// Phase 16 v1: Sync LEGACY direct ENV cache after bench_profile putenv defaults.
|
// Phase 16 v1: Sync LEGACY direct ENV cache after bench_profile putenv defaults.
|
||||||
front_fastlane_alloc_legacy_direct_env_refresh_from_env();
|
front_fastlane_alloc_legacy_direct_env_refresh_from_env();
|
||||||
|
// Phase 19-1: Sync FastLane Direct ENV cache after bench_profile putenv defaults.
|
||||||
|
fastlane_direct_env_refresh_from_env();
|
||||||
#endif
|
#endif
|
||||||
}
|
}
|
||||||
|
|||||||
15
core/box/fastlane_direct_env_box.c
Normal file
15
core/box/fastlane_direct_env_box.c
Normal file
@ -0,0 +1,15 @@
|
|||||||
|
// fastlane_direct_env_box.c - Phase 19-1: FastLane Direct Path ENV Control (implementation)
|
||||||
|
|
||||||
|
#include "fastlane_direct_env_box.h"
|
||||||
|
#include <stdlib.h>
|
||||||
|
#include <stdatomic.h>
|
||||||
|
|
||||||
|
_Atomic int g_fastlane_direct_enabled = -1;
|
||||||
|
|
||||||
|
// Refresh cached ENV flag from environment variable
|
||||||
|
// Called during benchmark ENV reloads to pick up runtime changes
|
||||||
|
void fastlane_direct_env_refresh_from_env(void) {
|
||||||
|
const char* e = getenv("HAKMEM_FASTLANE_DIRECT");
|
||||||
|
int enable = (e && *e && *e != '0') ? 1 : 0;
|
||||||
|
atomic_store_explicit(&g_fastlane_direct_enabled, enable, memory_order_relaxed);
|
||||||
|
}
|
||||||
46
core/box/fastlane_direct_env_box.h
Normal file
46
core/box/fastlane_direct_env_box.h
Normal file
@ -0,0 +1,46 @@
|
|||||||
|
// fastlane_direct_env_box.h - Phase 19-1: FastLane Direct Path ENV Control
|
||||||
|
//
|
||||||
|
// Goal: Remove wrapper layer overhead (30.79% of cycles) by calling core allocator directly
|
||||||
|
// Strategy: Compile-time + runtime gate to bypass front_fastlane_try_*() wrapper
|
||||||
|
//
|
||||||
|
// Box Theory:
|
||||||
|
// - Boundary: HAKMEM_FASTLANE_DIRECT=0/1 (default: 0, opt-in)
|
||||||
|
// - Rollback: ENV=0 reverts to existing FastLane wrapper path
|
||||||
|
// - Observability: perf stat shows instruction/branch reduction
|
||||||
|
//
|
||||||
|
// Expected Performance:
|
||||||
|
// - Reduction: -17.5 instructions/op, -6.0 branches/op
|
||||||
|
// - Impact: +10-15% throughput (remove 30% wrapper overhead)
|
||||||
|
//
|
||||||
|
// ENV Variables:
|
||||||
|
// HAKMEM_FASTLANE_DIRECT=0/1 # Enable direct path (default: 0, research box)
|
||||||
|
|
||||||
|
#pragma once
|
||||||
|
|
||||||
|
#include <stdatomic.h>
|
||||||
|
#include <stdlib.h>
|
||||||
|
|
||||||
|
// ENV control: cached flag for fastlane_direct_enabled()
|
||||||
|
// -1: uninitialized, 0: disabled, 1: enabled
|
||||||
|
// NOTE: Must be a single global (not header-static) so bench_profile refresh can
|
||||||
|
// update the same cache used by malloc/free wrappers.
|
||||||
|
extern _Atomic int g_fastlane_direct_enabled;
|
||||||
|
|
||||||
|
// Runtime check: Is FastLane Direct path enabled?
|
||||||
|
// Returns: 1 if enabled, 0 if disabled
|
||||||
|
// Hot path: Single atomic load (after first call)
|
||||||
|
static inline int fastlane_direct_enabled(void) {
|
||||||
|
int val = atomic_load_explicit(&g_fastlane_direct_enabled, memory_order_relaxed);
|
||||||
|
if (__builtin_expect(val == -1, 0)) {
|
||||||
|
// Cold path: Initialize from ENV
|
||||||
|
const char* e = getenv("HAKMEM_FASTLANE_DIRECT");
|
||||||
|
int enable = (e && *e && *e != '0') ? 1 : 0;
|
||||||
|
atomic_store_explicit(&g_fastlane_direct_enabled, enable, memory_order_relaxed);
|
||||||
|
return enable;
|
||||||
|
}
|
||||||
|
return val;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Refresh from ENV: Called during benchmark ENV reloads
|
||||||
|
// Allows runtime toggle without recompilation
|
||||||
|
void fastlane_direct_env_refresh_from_env(void);
|
||||||
@ -43,6 +43,7 @@ void* realloc(void* ptr, size_t size) {
|
|||||||
#include "malloc_tiny_direct_env_box.h" // Phase 5 E5-4: Malloc Tiny direct path ENV gate
|
#include "malloc_tiny_direct_env_box.h" // Phase 5 E5-4: Malloc Tiny direct path ENV gate
|
||||||
#include "malloc_tiny_direct_stats_box.h" // Phase 5 E5-4: Malloc Tiny direct path stats
|
#include "malloc_tiny_direct_stats_box.h" // Phase 5 E5-4: Malloc Tiny direct path stats
|
||||||
#include "front_fastlane_box.h" // Phase 6: Front FastLane (Layer Collapse)
|
#include "front_fastlane_box.h" // Phase 6: Front FastLane (Layer Collapse)
|
||||||
|
#include "fastlane_direct_env_box.h" // Phase 19-1: FastLane Direct Path (remove wrapper layer)
|
||||||
#include "../hakmem_internal.h" // AllocHeader helpers for diagnostics
|
#include "../hakmem_internal.h" // AllocHeader helpers for diagnostics
|
||||||
#include "../hakmem_super_registry.h" // Superslab lookup for diagnostics
|
#include "../hakmem_super_registry.h" // Superslab lookup for diagnostics
|
||||||
#include "../superslab/superslab_inline.h" // slab_index_for, capacity
|
#include "../superslab/superslab_inline.h" // slab_index_for, capacity
|
||||||
@ -165,6 +166,14 @@ void* malloc(size_t size) {
|
|||||||
#endif
|
#endif
|
||||||
// NDEBUG: malloc_count increment disabled - removes 27.55% bottleneck
|
// NDEBUG: malloc_count increment disabled - removes 27.55% bottleneck
|
||||||
|
|
||||||
|
// Force libc must override FastLane/hot wrapper paths.
|
||||||
|
// NOTE: Use the cached file-scope g_force_libc_alloc to avoid getenv recursion
|
||||||
|
// during early startup (before lock_depth is incremented).
|
||||||
|
if (__builtin_expect(g_force_libc_alloc == 1, 0)) {
|
||||||
|
extern void* __libc_malloc(size_t);
|
||||||
|
return __libc_malloc(size);
|
||||||
|
}
|
||||||
|
|
||||||
// Phase 20-2: BenchFast mode (structural ceiling measurement)
|
// Phase 20-2: BenchFast mode (structural ceiling measurement)
|
||||||
// WARNING: Bypasses ALL safety checks - benchmark only!
|
// WARNING: Bypasses ALL safety checks - benchmark only!
|
||||||
// IMPORTANT: Do NOT use BenchFast during preallocation/init to avoid recursion.
|
// IMPORTANT: Do NOT use BenchFast during preallocation/init to avoid recursion.
|
||||||
@ -176,6 +185,28 @@ void* malloc(size_t size) {
|
|||||||
// Fallback to normal path for large allocations
|
// Fallback to normal path for large allocations
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Phase 19-1b: FastLane Direct Path (bypass wrapper layer, revised)
|
||||||
|
// Strategy: Direct call to malloc_tiny_fast() (remove wrapper overhead; miss falls through)
|
||||||
|
// Expected: -17.5 instructions/op, -6.0 branches/op, +10-15% throughput
|
||||||
|
// ENV: HAKMEM_FASTLANE_DIRECT=0/1 (default: 0, opt-in)
|
||||||
|
// Phase 19-1b changes:
|
||||||
|
// 1. Removed __builtin_expect() from fastlane_direct_enabled() check (unfair A/B)
|
||||||
|
// 2. No change to malloc path (malloc_tiny_fast already optimal)
|
||||||
|
if (fastlane_direct_enabled()) {
|
||||||
|
// Fail-fast: match Front FastLane rule (FastLane is only safe after init completes).
|
||||||
|
if (__builtin_expect(!g_initialized, 0)) {
|
||||||
|
// Not safe → fall through to wrapper path (handles init/LD safety).
|
||||||
|
} else {
|
||||||
|
// Direct path: bypass front_fastlane_try_malloc() wrapper
|
||||||
|
void* ptr = malloc_tiny_fast(size);
|
||||||
|
if (__builtin_expect(ptr != NULL, 1)) {
|
||||||
|
return ptr; // Success: handled by hot path
|
||||||
|
}
|
||||||
|
// Not handled → fall through to existing FastLane + wrapper path.
|
||||||
|
// This preserves lock_depth/init/LD semantics for Mid/Large allocations.
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// Phase 6: Front FastLane (Layer Collapse)
|
// Phase 6: Front FastLane (Layer Collapse)
|
||||||
// Strategy: Collapse wrapper→gate→policy→route layers into single hot box
|
// Strategy: Collapse wrapper→gate→policy→route layers into single hot box
|
||||||
// Observed: +11.13% on Mixed 10-run (Phase 6 A/B)
|
// Observed: +11.13% on Mixed 10-run (Phase 6 A/B)
|
||||||
@ -631,6 +662,38 @@ void free(void* ptr) {
|
|||||||
#endif
|
#endif
|
||||||
if (!ptr) return;
|
if (!ptr) return;
|
||||||
|
|
||||||
|
// Force libc must override FastLane/hot wrapper paths.
|
||||||
|
// NOTE: Use the cached file-scope g_force_libc_alloc (no getenv) to keep
|
||||||
|
// this check safe even during early startup/recursion scenarios.
|
||||||
|
if (__builtin_expect(g_force_libc_alloc == 1, 0)) {
|
||||||
|
extern void __libc_free(void*);
|
||||||
|
__libc_free(ptr);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Phase 19-1b: FastLane Direct Path (bypass wrapper layer, revised)
|
||||||
|
// Strategy: Direct call to free_tiny_fast() / free_cold() (remove 30% wrapper overhead)
|
||||||
|
// Expected: -17.5 instructions/op, -6.0 branches/op, +10-15% throughput
|
||||||
|
// ENV: HAKMEM_FASTLANE_DIRECT=0/1 (default: 0, opt-in)
|
||||||
|
// Phase 19-1b changes:
|
||||||
|
// 1. Removed __builtin_expect() from fastlane_direct_enabled() check (unfair A/B)
|
||||||
|
// 2. Changed free_tiny_fast_hot() → free_tiny_fast() (use winning path directly)
|
||||||
|
if (fastlane_direct_enabled()) {
|
||||||
|
// Fail-fast: match Front FastLane rule (FastLane is only safe after init completes).
|
||||||
|
if (__builtin_expect(!g_initialized, 0)) {
|
||||||
|
// Not safe → fall through to wrapper path (handles init/LD safety).
|
||||||
|
} else {
|
||||||
|
// Direct path: bypass front_fastlane_try_free() wrapper
|
||||||
|
if (free_tiny_fast(ptr)) {
|
||||||
|
return; // Success: handled by hot path
|
||||||
|
}
|
||||||
|
// Fallback: cold path handles Mid/Large/external pointers
|
||||||
|
const wrapper_env_cfg_t* wcfg = wrapper_env_cfg_fast();
|
||||||
|
free_cold(ptr, wcfg);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// Phase 6: Front FastLane (Layer Collapse) - free path
|
// Phase 6: Front FastLane (Layer Collapse) - free path
|
||||||
// Strategy: Collapse wrapper→gate→classify layers into single hot box
|
// Strategy: Collapse wrapper→gate→classify layers into single hot box
|
||||||
// Observed: +11.13% on Mixed 10-run (Phase 6 A/B)
|
// Observed: +11.13% on Mixed 10-run (Phase 6 A/B)
|
||||||
|
|||||||
@ -1,89 +1,75 @@
|
|||||||
# Phase 17: FORCE_LIBC Gap Validation v1 — A/B Test Results
|
# Phase 17: FORCE_LIBC Gap Validation v2 — A/B Test Results
|
||||||
|
|
||||||
**Date**: 2025-12-15
|
**Date**: 2025-12-16
|
||||||
**Verdict**: ✅ **Case B confirmed** — **Layout / I-cache penalty dominates**
|
**Verdict**: ✅ **Case A confirmed** — allocator delta dominates (**libc is ~1.63× faster** in same-binary A/B)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Executive Summary
|
## Executive Summary
|
||||||
|
|
||||||
Phase 17 validated the “system malloc is faster than hakmem” observation while avoiding the classic layout/LTO trap by running a **same-binary A/B**:
|
Phase 17 exists to avoid the classic “different binary layout/LTO” trap by running a **same-binary A/B**.
|
||||||
|
|
||||||
- Same binary (`bench_random_mixed_hakmem`) with `HAKMEM_FORCE_LIBC_ALLOC=0/1` shows **allocator logic delta is negligible**.
|
**Important correction (v1 invalid):**
|
||||||
- The large performance gap appears only when comparing to the tiny `bench_random_mixed_system` binary.
|
`HAKMEM_FORCE_LIBC_ALLOC=1` was previously checked only in late wrapper paths, so the malloc/free hot paths
|
||||||
|
could return before FORCE_LIBC was observed. This made the “same-binary libc” measurement effectively still
|
||||||
|
use hakmem for the hot path.
|
||||||
|
|
||||||
Conclusion: The dominant gap is **binary text size + layout → I-cache thrash + instruction footprint**, not allocator algorithm efficiency.
|
**Fix (v2):**
|
||||||
|
Wrappers now bypass directly to `__libc_malloc/__libc_free` when cached `g_force_libc_alloc==1`, *before*
|
||||||
|
entering FastLane/hot wrapper logic.
|
||||||
|
|
||||||
|
Result: FORCE_LIBC now reflects real libc behavior in the same binary, and the delta is large.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Measurement Setup
|
## Measurement Setup
|
||||||
|
|
||||||
Workload:
|
Workload:
|
||||||
- `bench_random_mixed_*` (Mixed 16–1024B), working set `WS=400`
|
- Mixed 16–1024B, `WS=400`, `ITERS=20000000`
|
||||||
- Clean ENV baseline via `scripts/run_mixed_10_cleanenv.sh`
|
- Clean ENV via `scripts/run_mixed_10_cleanenv.sh`
|
||||||
|
|
||||||
Two comparisons:
|
Comparisons:
|
||||||
1) **Same-binary toggle** (allocator logic delta)
|
1) **Same binary**: `bench_random_mixed_hakmem` with `HAKMEM_FORCE_LIBC_ALLOC=0/1`
|
||||||
2) **System binary** (layout penalty delta)
|
2) **System binary**: `bench_random_mixed_system` (reference; different binary)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Results
|
## Results (10-run)
|
||||||
|
|
||||||
### 1) Same-binary A/B (allocator delta)
|
### 1) Same-binary A/B (allocator delta)
|
||||||
|
|
||||||
Binary: `bench_random_mixed_hakmem`
|
Binary: `bench_random_mixed_hakmem`
|
||||||
Toggle: `HAKMEM_FORCE_LIBC_ALLOC=0/1`
|
|
||||||
|
|
||||||
| Mode | Throughput (ops/s) | Delta |
|
| Mode | Mean (ops/s) | Median (ops/s) | Delta |
|
||||||
|------|---------------------|-------|
|
|------|--------------:|---------------:|------:|
|
||||||
| hakmem (`FORCE_LIBC=0`) | 48.12M | — |
|
| hakmem (`FORCE_LIBC=0`) | 48.99M | 49.28M | — |
|
||||||
| libc (`FORCE_LIBC=1`) | 48.31M | **+0.39%** |
|
| libc (`FORCE_LIBC=1`) | 79.72M | 80.09M | **+62.7%** |
|
||||||
|
|
||||||
Interpretation: allocator logic delta is ~noise-level in this experiment context.
|
Interpretation: the allocator delta is **not** noise-level; libc is materially faster on this workload.
|
||||||
|
|
||||||
### 2) System binary (layout penalty)
|
### 2) System binary (layout/wrapper penalty estimate)
|
||||||
|
|
||||||
Binary: `bench_random_mixed_system`
|
Binary: `bench_random_mixed_system`
|
||||||
|
|
||||||
| Mode | Throughput (ops/s) | Delta vs libc-in-hakmem-binary |
|
| Mode | Mean (ops/s) | Median (ops/s) | Delta vs libc-in-hakmem-binary |
|
||||||
|------|---------------------|--------------------------------|
|
|------|--------------:|---------------:|--------------------------------:|
|
||||||
| system malloc | 83.85M | **+73.57%** |
|
| system malloc | 88.06M | 88.35M | **+10.5%** |
|
||||||
|
|
||||||
Total observed gap: ~+74% class.
|
Interpretation: there is still a non-trivial **“in-hakmem-binary” penalty** (~10%), likely from wrapper/bench
|
||||||
|
overhead and text footprint, but it is *not* the dominant term versus hakmem’s allocator gap.
|
||||||
---
|
|
||||||
|
|
||||||
## Perf Stat (200M iterations) — Smoking Gun
|
|
||||||
|
|
||||||
| Metric | hakmem binary | system binary | Delta |
|
|
||||||
|--------|---------------|---------------|-------|
|
|
||||||
| I-cache misses | 153K | 68K | **-55%** |
|
|
||||||
| Cycles | 17.9B | 10.2B | **-43%** |
|
|
||||||
| Instructions | 41.3B | 21.5B | **-48%** |
|
|
||||||
| Binary size | 653K | 21K | **-97%** |
|
|
||||||
|
|
||||||
Interpretation:
|
|
||||||
- The system binary executes roughly **half the instructions**, with **far fewer I-cache misses**.
|
|
||||||
- The 30× text footprint difference strongly correlates with the gap.
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Conclusion
|
## Conclusion
|
||||||
|
|
||||||
Phase 12’s “system malloc is 1.6× faster” observation was real, but the root cause was misattributed:
|
- ✅ Same-binary `FORCE_LIBC` A/B (v2) shows the **dominant gap is allocator work**, not layout alone.
|
||||||
|
- ✅ There is also a smaller (~10%) penalty attributable to the hakmem-binary wrapper/text environment.
|
||||||
- ❌ Not primarily allocator algorithm differences
|
|
||||||
- ✅ **Text/layout + I-cache locality + instruction footprint**
|
|
||||||
|
|
||||||
This shifts the optimization frontier:
|
|
||||||
- Stop chasing more routing/dispatch micro-opt (Phase 14–16 plateau)
|
|
||||||
- Focus on **Hot Text Isolation / layout control**
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Next
|
## Next
|
||||||
|
|
||||||
Proceed to:
|
- Freeze Phase 18 v1 (`--gc-sections`) as NO-GO remains correct.
|
||||||
- `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md`
|
- Re-evaluate Phase 18 v2 (BENCH_MINIMAL) expectations: -5% instructions is not enough to close a +62% gap.
|
||||||
|
- Phase 19 should target **structural per-op work reduction** (not dispatch shape), while keeping the FastLane
|
||||||
|
boundary and “same-binary A/B” discipline.
|
||||||
|
|||||||
@ -8,6 +8,13 @@
|
|||||||
|
|
||||||
本 Phase 17 の目的は、**同一バイナリ内**で `hakmem` vs `libc malloc` を A/B し、gap の実体(allocator差か、バイナリ差か)を SSOT 化すること。
|
本 Phase 17 の目的は、**同一バイナリ内**で `hakmem` vs `libc malloc` を A/B し、gap の実体(allocator差か、バイナリ差か)を SSOT 化すること。
|
||||||
|
|
||||||
|
**重要(v1 の落とし穴)**:
|
||||||
|
`HAKMEM_FORCE_LIBC_ALLOC=1` が malloc/free の hot path より後でしか観測されないと、FastLane/hot wrapper が先に return してしまい、
|
||||||
|
同一バイナリ A/B が **実質 hakmem vs hakmem** になって壊れます。
|
||||||
|
|
||||||
|
このレポジトリでは 2025-12-16 に `malloc/free` wrapper を修正し、cached `g_force_libc_alloc==1` のときは `__libc_malloc/__libc_free`
|
||||||
|
へ **最初に** 直行するようにしました(`docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md` 参照)。
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 0. 目的(Deliverables)
|
## 0. 目的(Deliverables)
|
||||||
@ -127,4 +134,3 @@ perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-m
|
|||||||
- A/B は **同一バイナリ**で行う(layout/LTO 差で誤判定しない)
|
- A/B は **同一バイナリ**で行う(layout/LTO 差で誤判定しない)
|
||||||
- 新しい最適化は必ず ENV gate(戻せる)+ 境界 1 箇所
|
- 新しい最適化は必ず ENV gate(戻せる)+ 境界 1 箇所
|
||||||
- 迷ったら “Fail-Fast で fallback” を優先(速度より整合性)
|
- 迷ったら “Fail-Fast で fallback” を優先(速度より整合性)
|
||||||
|
|
||||||
|
|||||||
@ -0,0 +1,307 @@
|
|||||||
|
# Phase 19-1b: FastLane Direct (Revised) A/B Test Results
|
||||||
|
|
||||||
|
**Date**: 2025-12-15
|
||||||
|
**Status**: ✅ **GO** (+5.88% throughput)
|
||||||
|
**Branch**: master
|
||||||
|
**Commit**: (pending)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Executive Summary
|
||||||
|
|
||||||
|
Phase 19-1 の修正版(19-1b)が成功。Phase 19-1 が NO-GO(-3.81%)となった原因を特定し、修正により **+5.88% throughput** を達成。
|
||||||
|
|
||||||
|
**犯人特定**:
|
||||||
|
1. `__builtin_expect(fastlane_direct_enabled(), 0)` が分岐予測を逆効果にしていた
|
||||||
|
2. `free_tiny_fast_hot()` より `free_tiny_fast()` が勝ち筋(unified cache winner)
|
||||||
|
|
||||||
|
**修正内容**:
|
||||||
|
- `__builtin_expect()` 削除(フェアな A/B 比較)
|
||||||
|
- `free_tiny_fast_hot()` → `free_tiny_fast()` 変更(直接勝ち筋を呼ぶ)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## A/B Test Results
|
||||||
|
|
||||||
|
### Throughput (10-run benchmark)
|
||||||
|
|
||||||
|
**Baseline (FASTLANE_DIRECT=0)**:
|
||||||
|
- Mean: **49.17M ops/s**
|
||||||
|
- StdDev: 407,748 ops/s
|
||||||
|
- CV: 0.83%
|
||||||
|
|
||||||
|
**Optimized (FASTLANE_DIRECT=1)**:
|
||||||
|
- Mean: **52.06M ops/s**
|
||||||
|
- StdDev: 404,146 ops/s
|
||||||
|
- CV: 0.78%
|
||||||
|
|
||||||
|
**Delta**: **+5.88%** (GO判定、+5%目標クリア)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## perf stat Analysis (200M ops)
|
||||||
|
|
||||||
|
### Metrics Table
|
||||||
|
|
||||||
|
| Metric | Baseline | Optimized | Delta | Judgment |
|
||||||
|
|-----------------------|-----------------|-----------------|------------|----------|
|
||||||
|
| **Throughput** | 49.17M ops/s | 52.06M ops/s | **+5.88%** | **GO** |
|
||||||
|
| Cycles | 17,775,213,215 | 16,873,451,633 | -5.07% | Good |
|
||||||
|
| Instructions | 39,980,185,471 | 33,889,807,627 | **-15.23%** | **Excellent** |
|
||||||
|
| L1-icache-load-misses | 111,712 | 98,542 | -11.79% | Good |
|
||||||
|
| iTLB-load-misses | 26,039 | 36,835 | +41.46% | Bad |
|
||||||
|
| dTLB-load-misses | 59,329 | 76,626 | +29.15% | Bad |
|
||||||
|
| Branches | 10,297,849,396 | 8,304,201,436 | **-19.36%** | **Excellent** |
|
||||||
|
| Branch-misses | 232,502,367 | 232,239,642 | -0.11% | Good |
|
||||||
|
|
||||||
|
### Per-Operation Metrics
|
||||||
|
|
||||||
|
| Metric | Baseline | Optimized | Delta |
|
||||||
|
|--------------|----------|-----------|-----------|
|
||||||
|
| Cycles/op | 88.88 | 84.37 | **-4.51** |
|
||||||
|
| Instr/op | 199.90 | 169.45 | **-30.45** |
|
||||||
|
| Branches/op | 51.49 | 41.52 | **-9.97** |
|
||||||
|
|
||||||
|
**Key Findings**:
|
||||||
|
- **Instructions: -30.45/op** (-15.23%) → wrapper overhead 削減が効果的
|
||||||
|
- **Branches: -9.97/op** (-19.36%) → 分岐数の大幅削減
|
||||||
|
- **Cycles: -4.51/op** (-5.07%) → 総合的な効率改善
|
||||||
|
|
||||||
|
**Trade-offs**:
|
||||||
|
- iTLB/dTLB misses が悪化したが、instruction/branch 削減の効果が上回った
|
||||||
|
- Front-end (I-cache) は改善、Backend (dTLB) は悪化
|
||||||
|
- 総合的に throughput +5.88% で GO 判定
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Root Cause Analysis: Phase 19-1 が NO-GO となった理由
|
||||||
|
|
||||||
|
### Phase 19-1 の問題点
|
||||||
|
|
||||||
|
**Phase 19-1 実装** (`core/box/hak_wrappers.inc.h` 旧版):
|
||||||
|
```c
|
||||||
|
// malloc()
|
||||||
|
if (__builtin_expect(fastlane_direct_enabled(), 0)) { // ← 問題1: expect(...,0)
|
||||||
|
void* ptr = malloc_tiny_fast(size);
|
||||||
|
if (__builtin_expect(ptr != NULL, 1)) return ptr;
|
||||||
|
// ...
|
||||||
|
}
|
||||||
|
|
||||||
|
// free()
|
||||||
|
if (__builtin_expect(fastlane_direct_enabled(), 0)) { // ← 問題1: expect(...,0)
|
||||||
|
if (free_tiny_fast_hot(ptr)) return; // ← 問題2: _hot variant
|
||||||
|
// ...
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**問題の本質**:
|
||||||
|
|
||||||
|
1. **__builtin_expect(..., 0) が逆効果**:
|
||||||
|
- `fastlane_direct_enabled()` は ENV 変数で制御されるため、A/B test 時に動的に切り替わる
|
||||||
|
- `__builtin_expect(..., 0)` は「この分岐は unlikely」と CPU に指示
|
||||||
|
- → A=0, B=1 で分岐予測が逆になり、フェアな比較にならない
|
||||||
|
- → B 側(FASTLANE_DIRECT=1)で分岐予測ミスが増加
|
||||||
|
|
||||||
|
2. **free_tiny_fast_hot() より free_tiny_fast() が勝ち筋**:
|
||||||
|
- `free_tiny_fast_hot()`: hot/cold split version(Phase 7 で導入)
|
||||||
|
- `free_tiny_fast()`: monolithic version(Phase 6 winner)
|
||||||
|
- Phase 9/10 の A/B で `free_tiny_fast()` が勝利していた
|
||||||
|
- → Phase 19-1 で `_hot` を選択したのは誤り
|
||||||
|
|
||||||
|
### Phase 19-1b の修正
|
||||||
|
|
||||||
|
**Phase 19-1b 実装** (`core/box/hak_wrappers.inc.h` 修正後):
|
||||||
|
```c
|
||||||
|
// malloc()
|
||||||
|
if (fastlane_direct_enabled()) { // ← 修正1: __builtin_expect 削除
|
||||||
|
void* ptr = malloc_tiny_fast(size);
|
||||||
|
if (__builtin_expect(ptr != NULL, 1)) return ptr;
|
||||||
|
// ...
|
||||||
|
}
|
||||||
|
|
||||||
|
// free()
|
||||||
|
if (fastlane_direct_enabled()) { // ← 修正1: __builtin_expect 削除
|
||||||
|
if (free_tiny_fast(ptr)) return; // ← 修正2: free_tiny_fast() に変更
|
||||||
|
// ...
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**修正の効果**:
|
||||||
|
1. `__builtin_expect()` 削除 → A/B がフェアな比較に
|
||||||
|
2. `free_tiny_fast()` 直呼び → 勝ち筋を直接利用
|
||||||
|
|
||||||
|
**結果**: -3.81% → **+5.88%** (9.69% の改善)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Design Intent vs Implementation Gap
|
||||||
|
|
||||||
|
### Original Design (Phase 19 DESIGN.md)
|
||||||
|
|
||||||
|
**想定**:
|
||||||
|
- Wrapper layer 削除で -17.5 instructions/op, -6.0 branches/op
|
||||||
|
- Target: +10-15% throughput
|
||||||
|
|
||||||
|
**実測 (Phase 19-1b)**:
|
||||||
|
- Instructions: **-30.45/op** (-15.23%, 想定の1.74倍)
|
||||||
|
- Branches: **-9.97/op** (-19.36%, 想定の1.66倍)
|
||||||
|
- Throughput: **+5.88%** (想定の半分だが、GO判定)
|
||||||
|
|
||||||
|
**Gap 分析**:
|
||||||
|
- Instructions/Branches の削減は想定以上
|
||||||
|
- しかし throughput は想定の半分(+5.88% vs +10-15%)
|
||||||
|
- 原因: iTLB/dTLB misses の悪化が throughput を抑制
|
||||||
|
- 結論: Instruction 削減だけでは throughput は直線的に改善しない
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Lessons Learned
|
||||||
|
|
||||||
|
### 1. __builtin_expect() の落とし穴
|
||||||
|
|
||||||
|
**問題**:
|
||||||
|
- ENV-gated path で `__builtin_expect(..., 0)` を使うと A/B がフェアでない
|
||||||
|
- 動的に切り替わる条件には使うべきでない
|
||||||
|
|
||||||
|
**推奨**:
|
||||||
|
- Compile-time constant なら OK(例: `HAKMEM_BUILD_RELEASE`)
|
||||||
|
- Runtime ENV variables には使わない
|
||||||
|
- A/B test 前に expect hint を削除して検証
|
||||||
|
|
||||||
|
### 2. Variant 選択の重要性
|
||||||
|
|
||||||
|
**教訓**:
|
||||||
|
- `free_tiny_fast_hot()` vs `free_tiny_fast()` の選択が throughput に影響
|
||||||
|
- 過去の A/B 結果(Phase 9/10)を参照すべきだった
|
||||||
|
- 新しい最適化でも「勝ち筋」を選ぶこと
|
||||||
|
|
||||||
|
### 3. Front-end vs Backend Trade-off
|
||||||
|
|
||||||
|
**発見**:
|
||||||
|
- Instructions/Branches 削減(front-end 改善)は throughput に直結しない
|
||||||
|
- dTLB misses(backend 悪化)が throughput を抑制
|
||||||
|
- 総合バランスが重要
|
||||||
|
|
||||||
|
**今後の指針**:
|
||||||
|
- perf stat で front-end/backend を個別に分析
|
||||||
|
- Trade-off を明示的に評価
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Verdict: GO
|
||||||
|
|
||||||
|
**Reasons**:
|
||||||
|
1. **Throughput: +5.88%** (exceeds +5% target)
|
||||||
|
2. **Instructions: -15.23%** (excellent reduction)
|
||||||
|
3. **Branches: -19.36%** (excellent reduction)
|
||||||
|
4. **Cycles: -5.07%** (solid improvement)
|
||||||
|
5. **I-cache: -11.79%** (front-end improvement)
|
||||||
|
|
||||||
|
**Trade-offs (Acceptable)**:
|
||||||
|
- iTLB: +41.46% (front-end cost)
|
||||||
|
- dTLB: +29.15% (backend cost)
|
||||||
|
- → Overall gain (+5.88%) outweighs these costs
|
||||||
|
|
||||||
|
**Decision**: Phase 19-1b を本線採用。ENV: `HAKMEM_FASTLANE_DIRECT=1` で運用。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
### Immediate Actions
|
||||||
|
|
||||||
|
1. ✅ Commit Phase 19-1b changes to master
|
||||||
|
2. ✅ Update CURRENT_TASK.md with results
|
||||||
|
3. ✅ Archive this report to `docs/analysis/`
|
||||||
|
|
||||||
|
### Future Optimizations
|
||||||
|
|
||||||
|
**Phase 19-2 候補** (dTLB miss 削減):
|
||||||
|
- TLB prefetch hints
|
||||||
|
- Page alignment optimization
|
||||||
|
- Working set size reduction
|
||||||
|
|
||||||
|
**Phase 19-3 候補** (instruction 削減):
|
||||||
|
- ENV snapshot consolidation (Candidate B)
|
||||||
|
- Stats counter removal (Candidate C)
|
||||||
|
- Header validation inline (Candidate D)
|
||||||
|
|
||||||
|
**Target**: Close remaining gap to libc (73 instructions/op → 40-50 instructions/op)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Appendix: Raw Data
|
||||||
|
|
||||||
|
### Baseline (FASTLANE_DIRECT=0) 10-run
|
||||||
|
|
||||||
|
```
|
||||||
|
Run 1: 49.70M ops/s
|
||||||
|
Run 2: 49.10M ops/s
|
||||||
|
Run 3: 48.83M ops/s
|
||||||
|
Run 4: 49.24M ops/s
|
||||||
|
Run 5: 49.29M ops/s
|
||||||
|
Run 6: 48.54M ops/s
|
||||||
|
Run 7: 49.77M ops/s
|
||||||
|
Run 8: 48.52M ops/s
|
||||||
|
Run 9: 49.32M ops/s
|
||||||
|
Run 10: 49.37M ops/s
|
||||||
|
|
||||||
|
Mean: 49.17M ops/s
|
||||||
|
StdDev: 407,748 ops/s
|
||||||
|
CV: 0.83%
|
||||||
|
```
|
||||||
|
|
||||||
|
### Optimized (FASTLANE_DIRECT=1) 10-run
|
||||||
|
|
||||||
|
```
|
||||||
|
Run 1: 51.44M ops/s
|
||||||
|
Run 2: 52.56M ops/s
|
||||||
|
Run 3: 51.71M ops/s
|
||||||
|
Run 4: 52.30M ops/s
|
||||||
|
Run 5: 51.73M ops/s
|
||||||
|
Run 6: 51.96M ops/s
|
||||||
|
Run 7: 52.48M ops/s
|
||||||
|
Run 8: 51.44M ops/s
|
||||||
|
Run 9: 51.96M ops/s
|
||||||
|
Run 10: 52.46M ops/s
|
||||||
|
|
||||||
|
Mean: 52.06M ops/s
|
||||||
|
StdDev: 404,146 ops/s
|
||||||
|
CV: 0.78%
|
||||||
|
```
|
||||||
|
|
||||||
|
### perf stat Baseline (FASTLANE_DIRECT=0)
|
||||||
|
|
||||||
|
```
|
||||||
|
Performance counter stats for 'env -i PATH= HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FASTLANE_DIRECT=0 ./bench_random_mixed_hakmem 200000000 400 1':
|
||||||
|
|
||||||
|
17,775,213,215 cycles
|
||||||
|
39,980,185,471 instructions # 2.25 insn per cycle
|
||||||
|
111,712 L1-icache-load-misses
|
||||||
|
26,039 iTLB-load-misses
|
||||||
|
59,329 dTLB-load-misses
|
||||||
|
10,297,849,396 branches
|
||||||
|
232,502,367 branch-misses # 2.26% of all branches
|
||||||
|
|
||||||
|
4.486849039 seconds time elapsed
|
||||||
|
```
|
||||||
|
|
||||||
|
### perf stat Optimized (FASTLANE_DIRECT=1)
|
||||||
|
|
||||||
|
```
|
||||||
|
Performance counter stats for 'env -i PATH= HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FASTLANE_DIRECT=1 ./bench_random_mixed_hakmem 200000000 400 1':
|
||||||
|
|
||||||
|
16,873,451,633 cycles
|
||||||
|
33,889,807,627 instructions # 2.01 insn per cycle
|
||||||
|
98,542 L1-icache-load-misses
|
||||||
|
36,835 iTLB-load-misses
|
||||||
|
76,626 dTLB-load-misses
|
||||||
|
8,304,201,436 branches
|
||||||
|
232,239,642 branch-misses # 2.80% of all branches
|
||||||
|
|
||||||
|
4.247212223 seconds time elapsed
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**END OF REPORT**
|
||||||
543
docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_1_DESIGN.md
Normal file
543
docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_1_DESIGN.md
Normal file
@ -0,0 +1,543 @@
|
|||||||
|
# Phase 19: FastLane Instruction Reduction - Design Document
|
||||||
|
|
||||||
|
## 0. Executive Summary
|
||||||
|
|
||||||
|
**Goal**: Reduce instruction/branch count gap between hakmem and libc to close throughput gap
|
||||||
|
**Current Gap**: hakmem 44.88M ops/s vs libc 77.62M ops/s (+73.0% advantage for libc)
|
||||||
|
**Target**: Reduce instruction gap from +53.8% to <+25%, targeting +15-25% throughput improvement
|
||||||
|
**Success Criteria**: Achieve 52-56M ops/s (from current 44.88M ops/s)
|
||||||
|
|
||||||
|
### Key Findings
|
||||||
|
|
||||||
|
Per-operation overhead comparison (200M ops):
|
||||||
|
|
||||||
|
| Metric | hakmem | libc | Delta | Delta % |
|
||||||
|
|--------|--------|------|-------|---------|
|
||||||
|
| **Instructions/op** | 209.09 | 135.92 | +73.17 | **+53.8%** |
|
||||||
|
| **Branches/op** | 52.33 | 22.93 | +29.40 | **+128.2%** |
|
||||||
|
| Cycles/op | 96.48 | 54.69 | +41.79 | +76.4% |
|
||||||
|
| Branch-miss % | 2.22% | 2.87% | -0.65% | Better |
|
||||||
|
|
||||||
|
**Critical insight**: hakmem executes **73 extra instructions** and **29 extra branches** per operation vs libc.
|
||||||
|
This massive overhead accounts for the entire throughput gap.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Gap Analysis (Per-Operation Breakdown)
|
||||||
|
|
||||||
|
### 1.1 Instruction Gap: +73.17 instructions/op (+53.8%)
|
||||||
|
|
||||||
|
This excess comes from multiple layers of overhead:
|
||||||
|
- **FastLane wrapper checks**: ENV gates, class mask validation, size checks
|
||||||
|
- **Policy snapshot overhead**: TLS reads for routing decisions (3+ reads even with ENV snapshot)
|
||||||
|
- **Route determination**: Static route table lookup vs direct path
|
||||||
|
- **Multiple ENV gates**: Scattered throughout hot path (DUALHOT, LEGACY_DIRECT, C7_ULTRA, etc.)
|
||||||
|
- **Stats counters**: Atomic increments on hot path (FREE_PATH_STAT_INC, ALLOC_GATE_STAT_INC, etc.)
|
||||||
|
- **Header validation duplication**: FastLane + free_tiny_fast both validate header
|
||||||
|
|
||||||
|
### 1.2 Branch Gap: +29.40 branches/op (+128.2%)
|
||||||
|
|
||||||
|
Branching is **2.3x worse** than instruction gap:
|
||||||
|
- **Cascading ENV checks**: Each layer adds 1-2 branches (g_initialized, class_mask, DUALHOT, C7_ULTRA, LEGACY_DIRECT)
|
||||||
|
- **Route dispatch**: Static route check + route_kind switch
|
||||||
|
- **Early-exit patterns**: Multiple if-checks for ULTRA/DUALHOT/LEGACY paths
|
||||||
|
- **Stats gating**: `if (__builtin_expect(...))` patterns around counters
|
||||||
|
|
||||||
|
### 1.3 Why Cycles/op Gap is Smaller Than Expected
|
||||||
|
|
||||||
|
Despite +76.4% cycle gap, the CPU is achieving 2.17 IPC (hakmem) vs 2.49 IPC (libc).
|
||||||
|
This suggests:
|
||||||
|
- **Good CPU pipelining**: Branch predictor is working well (2.22% miss rate)
|
||||||
|
- **I-cache locality**: Code is reasonably compact despite extra instructions
|
||||||
|
- **But**: We're paying for every extra branch in pipeline stalls
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Hot Path Breakdown (perf report)
|
||||||
|
|
||||||
|
Top 10 hot functions (% of cycles):
|
||||||
|
|
||||||
|
| Function | % time | Category | Reduction Target? |
|
||||||
|
|----------|--------|----------|-------------------|
|
||||||
|
| **front_fastlane_try_free** | 23.97% | Wrapper | ✓ **YES** (remove layer) |
|
||||||
|
| **malloc** | 23.84% | Wrapper | ✓ **YES** (remove layer) |
|
||||||
|
| main | 22.02% | Benchmark | (baseline) |
|
||||||
|
| **free** | 6.82% | Wrapper | ✓ **YES** (remove layer) |
|
||||||
|
| unified_cache_push | 4.44% | Core | Optimize later |
|
||||||
|
| tiny_header_finalize_alloc | 4.34% | Core | Optimize later |
|
||||||
|
| tiny_c7_ultra_alloc | 3.38% | Core | Optimize later |
|
||||||
|
| tiny_c7_ultra_free | 2.07% | Core | Optimize later |
|
||||||
|
| hakmem_env_snapshot_enabled | 1.22% | ENV | ✓ **YES** (eliminate checks) |
|
||||||
|
| hak_super_lookup | 0.98% | Core | Optimize later |
|
||||||
|
|
||||||
|
**Critical observation**: The top 3 user-space functions are **all wrappers**:
|
||||||
|
- `front_fastlane_try_free` (23.97%) + `free` (6.82%) = **30.79%** on free wrappers
|
||||||
|
- `malloc` (23.84%) on alloc wrapper
|
||||||
|
- Combined wrapper overhead: **~54-55%** of all cycles
|
||||||
|
|
||||||
|
### 2.1 front_fastlane_try_free Annotated Breakdown
|
||||||
|
|
||||||
|
From `perf annotate`, the hot path has these expensive operations:
|
||||||
|
|
||||||
|
**Header validation** (lines 1c786-1c791, ~3% samples):
|
||||||
|
```asm
|
||||||
|
movzbl -0x1(%rbp),%ebx # Load header byte
|
||||||
|
mov %ebx,%eax # Copy to eax
|
||||||
|
and $0xfffffff0,%eax # Extract magic (0xA0)
|
||||||
|
cmp $0xa0,%al # Check magic
|
||||||
|
jne ... (fallback) # Branch on mismatch
|
||||||
|
```
|
||||||
|
|
||||||
|
**ENV snapshot checks** (lines 1c7ff-1c822, ~7% samples):
|
||||||
|
```asm
|
||||||
|
cmpl $0x1,0x628fa(%rip) # g_hakmem_env_snapshot_ctor_mode (3.01%)
|
||||||
|
mov 0x628ef(%rip),%r15d # g_hakmem_env_snapshot_gate (1.36%)
|
||||||
|
je ...
|
||||||
|
cmp $0xffffffff,%r15d
|
||||||
|
je ... (init path)
|
||||||
|
test %r15d,%r15d
|
||||||
|
jne ... (snapshot path)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Class routing overhead** (lines 1c7d1-1c7fb, ~3% samples):
|
||||||
|
```asm
|
||||||
|
mov 0x6299c(%rip),%r15d # g.5.lto_priv.0 (policy gate)
|
||||||
|
cmp $0x1,%r15d
|
||||||
|
jne ... (fallback)
|
||||||
|
movzbl 0x6298f(%rip),%eax # g_mask.3.lto_priv.0
|
||||||
|
cmp $0xff,%al
|
||||||
|
je ... (all-classes path)
|
||||||
|
movzbl %al,%r9d
|
||||||
|
bt %r13d,%r9d # Bit test class mask
|
||||||
|
jae ... (fallback)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Total overhead**: ~15-20% of cycles in front_fastlane_try_free are spent on:
|
||||||
|
- Header validation (already done again in free_tiny_fast)
|
||||||
|
- ENV snapshot probing
|
||||||
|
- Policy/route checks
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Reduction Candidates (Prioritized by ROI)
|
||||||
|
|
||||||
|
### Candidate A: **Eliminate FastLane Wrapper Layer** (Highest ROI)
|
||||||
|
|
||||||
|
**Problem**: front_fastlane_try_free + free wrappers consume 30.79% of cycles
|
||||||
|
**Root cause**: Double header validation + ENV checks + class mask checks
|
||||||
|
|
||||||
|
**Proposal**: Direct call to free_tiny_fast() from free() wrapper
|
||||||
|
|
||||||
|
**Implementation**:
|
||||||
|
```c
|
||||||
|
// In free() wrapper:
|
||||||
|
void free(void* ptr) {
|
||||||
|
if (__builtin_expect(!ptr, 0)) return;
|
||||||
|
|
||||||
|
// Phase 19-A: Direct call (no FastLane layer)
|
||||||
|
if (free_tiny_fast(ptr)) {
|
||||||
|
return; // Handled
|
||||||
|
}
|
||||||
|
|
||||||
|
// Fallback to cold path
|
||||||
|
free_cold(ptr);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Reduction estimate**:
|
||||||
|
- **Instructions**: -15-20/op (eliminate duplicate header read, ENV checks, class mask checks)
|
||||||
|
- **Branches**: -5-7/op (remove FastLane gate checks)
|
||||||
|
- **Impact**: ~10-15% throughput improvement (remove 30% wrapper overhead)
|
||||||
|
|
||||||
|
**Risk**: **LOW** (free_tiny_fast already has validation + routing logic)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Candidate B: **Consolidate ENV Snapshot Checks** (High ROI)
|
||||||
|
|
||||||
|
**Problem**: ENV snapshot is checked **3+ times per operation**:
|
||||||
|
1. FastLane entry: `g_initialized` check
|
||||||
|
2. Route determination: `hakmem_env_snapshot_enabled()` check
|
||||||
|
3. Route-specific: `tiny_c7_ultra_enabled_env()` check
|
||||||
|
4. Legacy fallback: Another ENV snapshot check
|
||||||
|
|
||||||
|
**Proposal**: Single ENV snapshot read at entry, pass context down
|
||||||
|
|
||||||
|
**Implementation**:
|
||||||
|
```c
|
||||||
|
// Phase 19-B: ENV context struct
|
||||||
|
typedef struct {
|
||||||
|
bool c7_ultra_enabled;
|
||||||
|
bool dualhot_enabled;
|
||||||
|
bool legacy_direct_enabled;
|
||||||
|
SmallRouteKind route_kind[8]; // Pre-computed routes
|
||||||
|
} FastLaneCtx;
|
||||||
|
|
||||||
|
static __thread FastLaneCtx g_fastlane_ctx = {0};
|
||||||
|
static __thread int g_fastlane_ctx_init = 0;
|
||||||
|
|
||||||
|
static inline const FastLaneCtx* fastlane_ctx_get(void) {
|
||||||
|
if (__builtin_expect(g_fastlane_ctx_init == 0, 0)) {
|
||||||
|
// One-time init per thread
|
||||||
|
const HakmemEnvSnapshot* env = hakmem_env_snapshot();
|
||||||
|
g_fastlane_ctx.c7_ultra_enabled = env->tiny_c7_ultra_enabled;
|
||||||
|
// ... populate other fields
|
||||||
|
g_fastlane_ctx_init = 1;
|
||||||
|
}
|
||||||
|
return &g_fastlane_ctx;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Reduction estimate**:
|
||||||
|
- **Instructions**: -8-12/op (eliminate redundant TLS reads)
|
||||||
|
- **Branches**: -3-5/op (single init check instead of multiple)
|
||||||
|
- **Impact**: ~5-8% throughput improvement
|
||||||
|
|
||||||
|
**Risk**: **MEDIUM** (need to handle ENV changes during runtime - use invalidation hook)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Candidate C: **Remove Stats Counters from Hot Path** (Medium ROI)
|
||||||
|
|
||||||
|
**Problem**: Stats counters on hot path add atomic increments:
|
||||||
|
- `FRONT_FASTLANE_STAT_INC(free_total)` (every op)
|
||||||
|
- `FREE_PATH_STAT_INC(total_calls)` (every op)
|
||||||
|
- `ALLOC_GATE_STAT_INC(total_calls)` (every alloc)
|
||||||
|
- `tiny_front_free_stat_inc(class_idx)` (every free)
|
||||||
|
|
||||||
|
**Proposal**: Make stats DEBUG-only or sample-based (1-in-N)
|
||||||
|
|
||||||
|
**Implementation**:
|
||||||
|
```c
|
||||||
|
// Phase 19-C: Sampling-based stats
|
||||||
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
|
static __thread uint32_t g_stat_counter = 0;
|
||||||
|
if (__builtin_expect((++g_stat_counter & 0xFFF) == 0, 0)) {
|
||||||
|
// Sample 1-in-4096 operations
|
||||||
|
FRONT_FASTLANE_STAT_INC(free_total);
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
```
|
||||||
|
|
||||||
|
**Reduction estimate**:
|
||||||
|
- **Instructions**: -4-6/op (remove atomic increments)
|
||||||
|
- **Branches**: -2-3/op (remove `if (__builtin_expect(...))` checks)
|
||||||
|
- **Impact**: ~3-5% throughput improvement
|
||||||
|
|
||||||
|
**Risk**: **LOW** (stats already compile-time optional)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Candidate D: **Inline Header Validation** (Medium ROI)
|
||||||
|
|
||||||
|
**Problem**: Header validation happens twice:
|
||||||
|
1. FastLane wrapper: `*((uint8_t*)ptr - 1)` (lines 179-191 in front_fastlane_box.h)
|
||||||
|
2. free_tiny_fast: Same check (lines 598-605 in malloc_tiny_fast.h)
|
||||||
|
|
||||||
|
**Proposal**: Trust FastLane validation, remove duplicate check
|
||||||
|
|
||||||
|
**Implementation**:
|
||||||
|
```c
|
||||||
|
// Phase 19-D: Add "trusted" variant
|
||||||
|
static inline int free_tiny_fast_trusted(void* ptr, int class_idx, void* base) {
|
||||||
|
// Skip header validation (caller already validated)
|
||||||
|
// Direct to route dispatch
|
||||||
|
...
|
||||||
|
}
|
||||||
|
|
||||||
|
// In FastLane:
|
||||||
|
uint8_t header = *((uint8_t*)ptr - 1);
|
||||||
|
int class_idx = header & 0x0F;
|
||||||
|
void* base = tiny_user_to_base_inline(ptr);
|
||||||
|
return free_tiny_fast_trusted(ptr, class_idx, base);
|
||||||
|
```
|
||||||
|
|
||||||
|
**Reduction estimate**:
|
||||||
|
- **Instructions**: -3-5/op (remove duplicate header load + extract)
|
||||||
|
- **Branches**: -1-2/op (remove duplicate magic check)
|
||||||
|
- **Impact**: ~2-3% throughput improvement
|
||||||
|
|
||||||
|
**Risk**: **MEDIUM** (need to ensure all callers validate header)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Candidate E: **Static Route Table Optimization** (Lower ROI)
|
||||||
|
|
||||||
|
**Problem**: Route determination uses TLS lookups + bit tests:
|
||||||
|
```c
|
||||||
|
if (tiny_static_route_ready_fast()) {
|
||||||
|
route_kind = tiny_static_route_get_kind_fast(class_idx);
|
||||||
|
} else {
|
||||||
|
route_kind = tiny_policy_hot_get_route(class_idx);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Proposal**: Pre-compute common routes at init, inline direct paths
|
||||||
|
|
||||||
|
**Implementation**:
|
||||||
|
```c
|
||||||
|
// Phase 19-E: Route fast path (C0-C3 LEGACY, C7 ULTRA)
|
||||||
|
static __thread uint8_t g_route_fastmap = 0; // bit 0=C0...bit 7=C7, 1=LEGACY
|
||||||
|
|
||||||
|
static inline bool is_legacy_route_fast(int class_idx) {
|
||||||
|
return (g_route_fastmap >> class_idx) & 1;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Reduction estimate**:
|
||||||
|
- **Instructions**: -3-4/op (replace function call with bit test)
|
||||||
|
- **Branches**: -1-2/op (replace nested if with single bit test)
|
||||||
|
- **Impact**: ~2-3% throughput improvement
|
||||||
|
|
||||||
|
**Risk**: **LOW** (route table is already static)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Combined Impact Estimate
|
||||||
|
|
||||||
|
Assuming independent reductions (conservative estimate with 80% efficiency due to overlap):
|
||||||
|
|
||||||
|
| Candidate | Instructions/op | Branches/op | Throughput |
|
||||||
|
|-----------|-----------------|-------------|------------|
|
||||||
|
| Baseline | 209.09 | 52.33 | 44.88M ops/s |
|
||||||
|
| **A: Remove FastLane layer** | -17.5 | -6.0 | +12% |
|
||||||
|
| **B: ENV snapshot consolidation** | -10.0 | -4.0 | +6% |
|
||||||
|
| **C: Stats removal (Release)** | -5.0 | -2.5 | +4% |
|
||||||
|
| **D: Inline header validation** | -4.0 | -1.5 | +2% |
|
||||||
|
| **E: Static route fast path** | -3.5 | -1.5 | +2% |
|
||||||
|
| **Combined (80% efficiency)** | **-32.0** | **-12.4** | **+21%** |
|
||||||
|
|
||||||
|
**Projected outcome**:
|
||||||
|
- Instructions/op: 209.09 → **177.09** (vs libc 135.92, gap reduced from +53.8% to +30.3%)
|
||||||
|
- Branches/op: 52.33 → **39.93** (vs libc 22.93, gap reduced from +128.2% to +74.1%)
|
||||||
|
- Throughput: 44.88M → **54.3M ops/s** (vs libc 77.62M, gap reduced from +73.0% to +43.0%)
|
||||||
|
|
||||||
|
**Achievement vs Goal**: ✓ Exceeds target (+21% vs +15-25% goal)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Implementation Plan
|
||||||
|
|
||||||
|
### Phase 19-1: Remove FastLane Wrapper Layer (A)
|
||||||
|
**Priority**: P0 (highest ROI)
|
||||||
|
**Effort**: 2-3 hours
|
||||||
|
**Risk**: Low (free_tiny_fast already complete)
|
||||||
|
|
||||||
|
Steps:
|
||||||
|
1. Modify `free()` wrapper to directly call `free_tiny_fast(ptr)`
|
||||||
|
2. Modify `malloc()` wrapper to directly call `malloc_tiny_fast(size)`
|
||||||
|
3. Measure: Expect +10-15% throughput
|
||||||
|
4. Fallback: Keep FastLane as compile-time option
|
||||||
|
|
||||||
|
### Phase 19-2: ENV Snapshot Consolidation (B)
|
||||||
|
**Priority**: P1 (high ROI, moderate risk)
|
||||||
|
**Effort**: 4-6 hours
|
||||||
|
**Risk**: Medium (ENV invalidation needed)
|
||||||
|
|
||||||
|
Steps:
|
||||||
|
1. Create `FastLaneCtx` struct with pre-computed ENV state
|
||||||
|
2. Add TLS cache with invalidation hook
|
||||||
|
3. Replace scattered ENV checks with single context read
|
||||||
|
4. Measure: Expect +5-8% throughput on top of Phase 19-1
|
||||||
|
5. Fallback: ENV-gate new path (HAKMEM_FASTLANE_ENV_CTX=1)
|
||||||
|
|
||||||
|
### Phase 19-3: Stats Removal (C) + Header Inline (D)
|
||||||
|
**Priority**: P2 (medium ROI, low risk)
|
||||||
|
**Effort**: 2-3 hours
|
||||||
|
**Risk**: Low (already compile-time optional)
|
||||||
|
|
||||||
|
Steps:
|
||||||
|
1. Make stats sample-based (1-in-4096) in Release builds
|
||||||
|
2. Add `free_tiny_fast_trusted()` variant (skip header validation)
|
||||||
|
3. Measure: Expect +3-5% throughput on top of Phase 19-2
|
||||||
|
4. Fallback: Compile-time flags for both features
|
||||||
|
|
||||||
|
### Phase 19-4: Static Route Fast Path (E)
|
||||||
|
**Priority**: P3 (lower ROI, polish)
|
||||||
|
**Effort**: 2-3 hours
|
||||||
|
**Risk**: Low (route table is static)
|
||||||
|
|
||||||
|
Steps:
|
||||||
|
1. Add `g_route_fastmap` TLS cache
|
||||||
|
2. Replace function calls with bit tests
|
||||||
|
3. Measure: Expect +2-3% throughput on top of Phase 19-3
|
||||||
|
4. Fallback: Keep existing path as fallback
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Box Theory Compliance
|
||||||
|
|
||||||
|
### Boundary Preservation
|
||||||
|
- **L0 (ENV)**: Keep existing ENV gates, add new ones for each optimization
|
||||||
|
- **L1 (Hot inline)**: free_tiny_fast(), malloc_tiny_fast() remain unchanged
|
||||||
|
- **L2 (Cold fallback)**: free_cold(), malloc_cold() remain unchanged
|
||||||
|
- **L3 (Stats)**: Make optional via #if guards
|
||||||
|
|
||||||
|
### Reversibility
|
||||||
|
- Each phase is ENV-gated (can revert at runtime)
|
||||||
|
- Compile-time fallback preserved (HAKMEM_BUILD_RELEASE controls stats)
|
||||||
|
- FastLane layer can be kept as compile-time option for A/B testing
|
||||||
|
|
||||||
|
### Incremental Rollout
|
||||||
|
- Phase 19-1: Remove wrapper (default ON)
|
||||||
|
- Phase 19-2: ENV context (default OFF, opt-in for testing)
|
||||||
|
- Phase 19-3: Stats/header (default ON in Release, OFF in Debug)
|
||||||
|
- Phase 19-4: Route fast path (default ON)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Validation Checklist
|
||||||
|
|
||||||
|
After each phase:
|
||||||
|
- [ ] Run perf stat (compare instructions/branches/cycles per-op)
|
||||||
|
- [ ] Run perf record + annotate (verify hot path reduction)
|
||||||
|
- [ ] Run benchmark suite (Mixed, C6-heavy, C7-heavy)
|
||||||
|
- [ ] Check correctness (Larson, multithreaded, stress tests)
|
||||||
|
- [ ] Measure RSS/memory overhead (should be unchanged)
|
||||||
|
- [ ] A/B test (ENV toggle to verify reversibility)
|
||||||
|
|
||||||
|
Success criteria:
|
||||||
|
- [ ] Throughput improvement matches estimate (±20%)
|
||||||
|
- [ ] Instruction count reduction matches estimate (±20%)
|
||||||
|
- [ ] Branch count reduction matches estimate (±20%)
|
||||||
|
- [ ] No correctness regressions (all tests pass)
|
||||||
|
- [ ] No memory overhead increase (RSS unchanged)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Risk Assessment
|
||||||
|
|
||||||
|
### High-Risk Areas
|
||||||
|
1. **ENV invalidation** (Phase 19-2): Runtime ENV changes could break cached context
|
||||||
|
- Mitigation: Use invalidation hooks (existing hakmem_env_snapshot infrastructure)
|
||||||
|
- Fallback: Revert to scattered ENV checks
|
||||||
|
|
||||||
|
2. **Header validation trust** (Phase 19-3D): Skipping validation could miss corruption
|
||||||
|
- Mitigation: Keep validation in Debug builds, extensive testing
|
||||||
|
- Fallback: Compile-time option to keep duplicate checks
|
||||||
|
|
||||||
|
### Medium-Risk Areas
|
||||||
|
1. **FastLane removal** (Phase 19-1): Could break gradual rollout (class_mask filtering)
|
||||||
|
- Mitigation: Keep class_mask filtering in FastLane path only (direct path always falls back safely)
|
||||||
|
- Fallback: Keep FastLane as compile-time option
|
||||||
|
|
||||||
|
### Low-Risk Areas
|
||||||
|
1. **Stats removal** (Phase 19-3C): Already compile-time optional
|
||||||
|
2. **Route fast path** (Phase 19-4): Route table is static, no runtime changes
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Future Optimization Opportunities (Post-Phase 19)
|
||||||
|
|
||||||
|
After Phase 19 closes the wrapper gap, next targets:
|
||||||
|
|
||||||
|
1. **Unified Cache optimization** (4.44% cycles):
|
||||||
|
- Reduce cache miss overhead (refill path)
|
||||||
|
- Optimize LIFO vs ring buffer trade-off
|
||||||
|
|
||||||
|
2. **Header finalization** (4.34% cycles):
|
||||||
|
- Investigate always_inline for tiny_header_finalize_alloc()
|
||||||
|
- Reduce metadata writes (defer to batch update)
|
||||||
|
|
||||||
|
3. **C7 ULTRA optimization** (3.38% + 2.07% = 5.45% cycles):
|
||||||
|
- Investigate TLS cache locality
|
||||||
|
- Reduce ULTRA push/pop overhead
|
||||||
|
|
||||||
|
4. **Super lookup optimization** (0.98% cycles):
|
||||||
|
- Already optimized in Phase 12 (mask-based)
|
||||||
|
- Further reduction may require architectural changes
|
||||||
|
|
||||||
|
**Estimated ceiling**: With all optimizations, could approach ~65-70M ops/s (vs libc 77.62M)
|
||||||
|
**Remaining gap**: Likely fundamental architectural differences (thread-local vs global allocator)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Appendix: Detailed perf Data
|
||||||
|
|
||||||
|
### 10.1 perf stat Results (200M ops)
|
||||||
|
|
||||||
|
**hakmem (FORCE_LIBC=0)**:
|
||||||
|
```
|
||||||
|
Performance counter stats for 'bench_random_mixed_hakmem ... HAKMEM_FORCE_LIBC_ALLOC=0':
|
||||||
|
|
||||||
|
19,296,118,430 cycles
|
||||||
|
41,817,886,925 instructions # 2.17 insn per cycle
|
||||||
|
10,466,190,806 branches
|
||||||
|
232,592,257 branch-misses # 2.22% of all branches
|
||||||
|
1,660,073 cache-misses
|
||||||
|
134,601 L1-icache-load-misses
|
||||||
|
|
||||||
|
4.913685503 seconds time elapsed
|
||||||
|
Throughput: 44.88M ops/s
|
||||||
|
```
|
||||||
|
|
||||||
|
**libc (FORCE_LIBC=1)**:
|
||||||
|
```
|
||||||
|
Performance counter stats for 'bench_random_mixed_hakmem ... HAKMEM_FORCE_LIBC_ALLOC=1':
|
||||||
|
|
||||||
|
10,937,550,228 cycles
|
||||||
|
27,183,469,339 instructions # 2.49 insn per cycle
|
||||||
|
4,586,617,379 branches
|
||||||
|
131,515,905 branch-misses # 2.87% of all branches
|
||||||
|
767,370 cache-misses
|
||||||
|
64,102 L1-icache-load-misses
|
||||||
|
|
||||||
|
2.835174452 seconds time elapsed
|
||||||
|
Throughput: 77.62M ops/s
|
||||||
|
```
|
||||||
|
|
||||||
|
### 10.2 Top 30 Hot Functions (perf report)
|
||||||
|
|
||||||
|
```
|
||||||
|
23.97% front_fastlane_try_free.lto_priv.0
|
||||||
|
23.84% malloc
|
||||||
|
22.02% main
|
||||||
|
6.82% free
|
||||||
|
4.44% unified_cache_push.lto_priv.0
|
||||||
|
4.34% tiny_header_finalize_alloc.lto_priv.0
|
||||||
|
3.38% tiny_c7_ultra_alloc.constprop.0
|
||||||
|
2.07% tiny_c7_ultra_free
|
||||||
|
1.22% hakmem_env_snapshot_enabled.lto_priv.0
|
||||||
|
0.98% hak_super_lookup.part.0.lto_priv.4.lto_priv.0
|
||||||
|
0.85% hakmem_env_snapshot.lto_priv.0
|
||||||
|
0.82% hak_pool_free_v1_slow_impl
|
||||||
|
0.59% tiny_front_v3_snapshot_get.lto_priv.0
|
||||||
|
0.30% __memset_avx2_unaligned_erms (libc)
|
||||||
|
0.30% tiny_unified_lifo_enabled.lto_priv.0
|
||||||
|
0.28% hak_free_at.constprop.0
|
||||||
|
0.24% hak_pool_try_alloc.part.0
|
||||||
|
0.24% malloc_cold
|
||||||
|
0.16% hak_pool_try_alloc_v1_impl.part.0
|
||||||
|
0.14% free_cold.constprop.0
|
||||||
|
0.13% mid_inuse_dec_deferred
|
||||||
|
0.12% hak_pool_mid_lookup
|
||||||
|
0.12% do_user_addr_fault (kernel)
|
||||||
|
0.11% handle_pte_fault (kernel)
|
||||||
|
0.11% __mod_memcg_lruvec_state (kernel)
|
||||||
|
0.10% do_anonymous_page (kernel)
|
||||||
|
0.09% classify_ptr
|
||||||
|
0.07% tiny_get_max_size.lto_priv.0
|
||||||
|
0.06% __handle_mm_fault (kernel)
|
||||||
|
0.06% __alloc_pages (kernel)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 11. Conclusion
|
||||||
|
|
||||||
|
Phase 19 has **clear, actionable targets** with high ROI:
|
||||||
|
|
||||||
|
1. **Immediate action (Phase 19-1)**: Remove FastLane wrapper layer
|
||||||
|
- Expected: +10-15% throughput
|
||||||
|
- Risk: Low
|
||||||
|
- Effort: 2-3 hours
|
||||||
|
|
||||||
|
2. **Follow-up (Phase 19-2-4)**: ENV consolidation + stats + route optimization
|
||||||
|
- Expected: +6-11% additional throughput
|
||||||
|
- Risk: Medium (ENV invalidation)
|
||||||
|
- Effort: 8-12 hours
|
||||||
|
|
||||||
|
**Combined target**: +21% throughput (44.88M → 54.3M ops/s)
|
||||||
|
**Gap closure**: Reduce instruction gap from +53.8% to +30.3% vs libc
|
||||||
|
|
||||||
|
This positions hakmem for competitive performance while maintaining safety and Box Theory compliance.
|
||||||
@ -0,0 +1,64 @@
|
|||||||
|
# Phase 19-2: FASTLANE_DIRECT Promotion + Rebaseline (Next Instructions)
|
||||||
|
|
||||||
|
## 0. Status (where we are)
|
||||||
|
|
||||||
|
- Phase 19-1b (FASTLANE_DIRECT) is **GO**: throughput **+5.88%** with **-15.23% instr/op** and **-19.36% branches/op**.
|
||||||
|
- Safety hardening completed:
|
||||||
|
- `!g_initialized` → direct path is skipped (fail-fast, same rule as Front FastLane).
|
||||||
|
- malloc miss no longer calls `malloc_cold()` directly; it falls through to the normal wrapper path (preserves `g_hakmem_lock_depth` invariants).
|
||||||
|
- ENV cache is a single global `_Atomic` so `bench_profile` refresh affects wrappers.
|
||||||
|
|
||||||
|
## 1. Promotion policy (Box Theory)
|
||||||
|
|
||||||
|
- Keep rollback simple:
|
||||||
|
- `HAKMEM_FASTLANE_DIRECT=0` → disable (fallback to Phase 6 FastLane wrapper path).
|
||||||
|
- `HAKMEM_FASTLANE_DIRECT=1` → enable (direct `malloc_tiny_fast()` / `free_tiny_fast()` first).
|
||||||
|
- Promotion level:
|
||||||
|
- **Preset promotion** (recommended): set `HAKMEM_FASTLANE_DIRECT=1` in `MIXED_TINYV3_C7_SAFE` and `C6_HEAVY_LEGACY_POOLV1` presets.
|
||||||
|
- Keep **ENV default = 0** (opt-in) until real-world/LD_PRELOAD validation is done.
|
||||||
|
|
||||||
|
## 2. Required verification (same-binary A/B)
|
||||||
|
|
||||||
|
### 2.1 Mixed (10-run, clean env)
|
||||||
|
|
||||||
|
Baseline:
|
||||||
|
```sh
|
||||||
|
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FASTLANE_DIRECT=0 scripts/run_mixed_10_cleanenv.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Optimized:
|
||||||
|
```sh
|
||||||
|
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FASTLANE_DIRECT=1 scripts/run_mixed_10_cleanenv.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
GO/NO-GO:
|
||||||
|
- GO: mean **+1.0%** or higher
|
||||||
|
- NEUTRAL: **±1.0%** → keep as preset-only (do not flip global default)
|
||||||
|
- NO-GO: **≤ -1.0%** → revert preset promotion
|
||||||
|
|
||||||
|
### 2.2 C6-heavy (5-run)
|
||||||
|
|
||||||
|
```sh
|
||||||
|
HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1 HAKMEM_FASTLANE_DIRECT=0 ./bench_mid_large_mt_hakmem 1 1000000 400 1
|
||||||
|
HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1 HAKMEM_FASTLANE_DIRECT=1 ./bench_mid_large_mt_hakmem 1 1000000 400 1
|
||||||
|
```
|
||||||
|
|
||||||
|
## 3. Perf stat capture (root-cause guardrails)
|
||||||
|
|
||||||
|
Run both A/B with:
|
||||||
|
```sh
|
||||||
|
perf stat -e cycles,instructions,branches,branch-misses,L1-icache-load-misses,iTLB-load-misses,dTLB-load-misses -- \
|
||||||
|
./bench_random_mixed_hakmem 200000000 400 1
|
||||||
|
```
|
||||||
|
|
||||||
|
Checklist:
|
||||||
|
- `instructions/op` and `branches/op` must improve (expected)
|
||||||
|
- iTLB/dTLB misses may worsen; accept only if throughput still improves
|
||||||
|
|
||||||
|
## 4. Next target selection (after promotion)
|
||||||
|
|
||||||
|
After Phase 19-2 is stable, re-run `perf record` on Mixed and choose the next box by **self% ≥ 5%**:
|
||||||
|
- If `unified_cache_push/pop` rises: focus on **UnifiedCache data-path** (touch fewer cache lines).
|
||||||
|
- If `tiny_header_finalize_alloc` rises: focus on **header finalize path** (but treat as high NO-GO risk; prior header work was often NEUTRAL).
|
||||||
|
- If ENV checks reappear in hot path: consider **Phase 19-3 (ENV check consolidation)**, but keep it in a separate research box.
|
||||||
|
|
||||||
4
hakmem.d
4
hakmem.d
@ -178,7 +178,8 @@ hakmem.o: core/hakmem.c core/hakmem.h core/hakmem_build_flags.h \
|
|||||||
core/box/front_fastlane_env_box.h core/box/front_fastlane_stats_box.h \
|
core/box/front_fastlane_env_box.h core/box/front_fastlane_stats_box.h \
|
||||||
core/box/front_fastlane_alloc_legacy_direct_env_box.h \
|
core/box/front_fastlane_alloc_legacy_direct_env_box.h \
|
||||||
core/box/tiny_front_hot_box.h core/box/tiny_front_cold_box.h \
|
core/box/tiny_front_hot_box.h core/box/tiny_front_cold_box.h \
|
||||||
core/box/smallobject_policy_v7_box.h core/box/../hakmem_internal.h
|
core/box/smallobject_policy_v7_box.h core/box/fastlane_direct_env_box.h \
|
||||||
|
core/box/../hakmem_internal.h
|
||||||
core/hakmem.h:
|
core/hakmem.h:
|
||||||
core/hakmem_build_flags.h:
|
core/hakmem_build_flags.h:
|
||||||
core/hakmem_config.h:
|
core/hakmem_config.h:
|
||||||
@ -441,4 +442,5 @@ core/box/front_fastlane_alloc_legacy_direct_env_box.h:
|
|||||||
core/box/tiny_front_hot_box.h:
|
core/box/tiny_front_hot_box.h:
|
||||||
core/box/tiny_front_cold_box.h:
|
core/box/tiny_front_cold_box.h:
|
||||||
core/box/smallobject_policy_v7_box.h:
|
core/box/smallobject_policy_v7_box.h:
|
||||||
|
core/box/fastlane_direct_env_box.h:
|
||||||
core/box/../hakmem_internal.h:
|
core/box/../hakmem_internal.h:
|
||||||
|
|||||||
BIN
perf.data.phase19_hakmem
Normal file
BIN
perf.data.phase19_hakmem
Normal file
Binary file not shown.
@ -18,6 +18,8 @@ export HAKMEM_MALLOC_TINY_DIRECT=${HAKMEM_MALLOC_TINY_DIRECT:-0}
|
|||||||
export HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=${HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT:-0}
|
export HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=${HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT:-0}
|
||||||
export HAKMEM_FORCE_LIBC_ALLOC=${HAKMEM_FORCE_LIBC_ALLOC:-0}
|
export HAKMEM_FORCE_LIBC_ALLOC=${HAKMEM_FORCE_LIBC_ALLOC:-0}
|
||||||
export HAKMEM_ENV_SNAPSHOT_SHAPE=${HAKMEM_ENV_SNAPSHOT_SHAPE:-0}
|
export HAKMEM_ENV_SNAPSHOT_SHAPE=${HAKMEM_ENV_SNAPSHOT_SHAPE:-0}
|
||||||
|
# NOTE: Phase 19-1b is promoted in presets. Keep cleanenv aligned by default.
|
||||||
|
export HAKMEM_FASTLANE_DIRECT=${HAKMEM_FASTLANE_DIRECT:-1}
|
||||||
# NOTE: Phase 9/10 are promoted (bench_profile defaults to 1). Keep cleanenv aligned by default.
|
# NOTE: Phase 9/10 are promoted (bench_profile defaults to 1). Keep cleanenv aligned by default.
|
||||||
export HAKMEM_FREE_TINY_FAST_MONO_DUALHOT=${HAKMEM_FREE_TINY_FAST_MONO_DUALHOT:-1}
|
export HAKMEM_FREE_TINY_FAST_MONO_DUALHOT=${HAKMEM_FREE_TINY_FAST_MONO_DUALHOT:-1}
|
||||||
export HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=${HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT:-1}
|
export HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=${HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT:-1}
|
||||||
|
|||||||
Reference in New Issue
Block a user