## Summary Phase 18 v1 attempted layout optimization using section splitting + GC: - `-ffunction-sections -fdata-sections -Wl,--gc-sections` Result: **Catastrophic I-cache regression** - Throughput: -0.87% (48.94M → 48.52M ops/s) - I-cache misses: +91.06% (131K → 250K) - Variance: +80% (σ=0.45M → σ=0.81M) Root cause: Section-based splitting without explicit hot symbol ordering fragments code locality, destroying natural compiler/LTO layout. ## Build Knob Safety Makefile updated to separate concerns: - `HOT_TEXT_ISOLATION=1` → attributes only (safe, but no perf gain) - `HOT_TEXT_GC_SECTIONS=1` → section splitting (currently NO-GO) Both kept as research boxes (default OFF). ## Verdict Freeze Phase 18 v1: - Do NOT use section-based linking without strong ordering strategy - Keep hot/cold attributes as placeholder (currently unused) - Proceed to Phase 18 v2: BENCH_MINIMAL compile-out Expected impact v2: +10-20% via instruction count reduction - GO threshold: +5% minimum, +8% preferred - Only continue if instructions clearly drop ## Files New: - docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md Modified: - Makefile (build knob safety isolation) - CURRENT_TASK.md (Phase 18 v1 verdict) - docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md ## Lessons 1. Layout optimization is extremely fragile without ordering guarantees 2. I-cache is first-order performance factor (IPC=2.30 is memory-bound) 3. Compiler defaults may be better than manual section splitting 4. Next frontier: instruction count reduction (stats/ENV removal) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
86 KiB
本線タスク(現在)
更新メモ(2025-12-15 Phase 18 HOT-TEXT-ISOLATION-1)
Phase 18 HOT-TEXT-ISOLATION-1: Hot Text Isolation v1 — ❌ NO-GO / FROZEN
結果: Mixed 10-run mean -0.87% 回帰、I-cache misses +91.06% 劣化。-ffunction-sections -Wl,--gc-sections による細粒度セクション化が I-cache locality を破壊。hot/cold 属性は実装済みだが未適用のため、デメリットのみが発生。
- A/B 結果:
docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md - 指示書:
docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md - 設計:
docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md - 対処:
HOT_TEXT_ISOLATION=0(default) で rollback
主要原因:
- Section-based linking が自然な compiler locality を破壊
--gc-sectionsのリンク順序変更で I-cache が断片化- Hot/cold 属性が実際には適用されていない(実装の不完全性)
重要な知見:
- Phase 17 の結論を再確認: bottleneck は instruction count と memory latency
- Code layout 最適化では 2.30 IPC の壁を越えられない
- 次の一手: instruction count を直接削る Phase 18 v2 (BENCH_MINIMAL) へ
更新メモ(2025-12-14 Phase 6 FRONT-FASTLANE-1)
Phase 6 FRONT-FASTLANE-1: Front FastLane(Layer Collapse)— ✅ GO / 本線昇格
結果: Mixed 10-run で +11.13%(HAKMEM史上最大級の改善)。Fail-Fast/境界1箇所を維持したまま “入口固定費” を大幅削減。
- A/B 結果:
docs/analysis/PHASE6_FRONT_FASTLANE_1_AB_TEST_RESULTS.md - 実装レポート:
docs/analysis/PHASE6_FRONT_FASTLANE_1_IMPLEMENTATION_REPORT.md - 設計:
docs/analysis/PHASE6_FRONT_FASTLANE_1_DESIGN.md - 指示書(昇格/次):
docs/analysis/PHASE6_FRONT_FASTLANE_NEXT_INSTRUCTIONS.md - 外部回答(記録):
PHASE_ML2_CHATGPT_RESPONSE_FASTLANE.md
運用ルール:
- A/B は 同一バイナリで ENV トグル(削除/追加で別バイナリ比較にしない)
- Mixed 10-run は
scripts/run_mixed_10_cleanenv.sh基準(ENV 漏れ防止)
Phase 6-2 FRONT-FASTLANE-FREE-DEDUP: Front FastLane Free DeDup — ✅ GO / 本線昇格
結果: Mixed 10-run で +5.18%。front_fastlane_try_free() の二重ヘッダ検証を排除し、free 側の固定費をさらに削減。
- A/B 結果:
docs/analysis/PHASE6_FRONT_FASTLANE_2_FREE_DEDUP_AB_TEST_RESULTS.md - 指示書:
docs/analysis/PHASE6_FRONT_FASTLANE_2_FREE_DEDUP_NEXT_INSTRUCTIONS.md - ENV gate:
HAKMEM_FRONT_FASTLANE_FREE_DEDUP=0/1(default: 1, opt-out) - Rollback:
HAKMEM_FRONT_FASTLANE_FREE_DEDUP=0
成功要因:
- 重複検証の完全排除(
front_fastlane_try_free()→free_tiny_fast()直接呼び出し) - free パスの重要性(Mixed では free が約 50%)
- 実行安定性向上(変動係数 0.58%)
累積効果(Phase 6):
- Phase 6-1: +11.13%
- Phase 6-2: +5.18%
- 累積: ベースラインから約 +16-17% の性能向上
Phase 7 FRONT-FASTLANE-FREE-HOTCOLD-ALIGNMENT: FastLane Free Hot/Cold Alignment — ❌ NO-GO / FROZEN
結果: Mixed 10-run mean -2.16% 回帰。Hot/Cold split は wrapper 経由では有効だが、FastLane の超軽量経路では分岐/統計/TLS の固定費が勝ち、monolithic の方が速い。
- A/B 結果:
docs/analysis/PHASE7_FRONT_FASTLANE_FREE_HOTCOLD_1_AB_TEST_RESULTS.md - 指示書(記録):
docs/analysis/PHASE7_FRONT_FASTLANE_FREE_HOTCOLD_1_NEXT_INSTRUCTIONS.md - 対処: Rollback 済み(FastLane free は
free_tiny_fast()維持)
Phase 8 FREE-STATIC-ROUTE-ENV-CACHE-FIX: FREE-STATIC-ROUTE ENV Cache Hardening — ✅ GO / 本線昇格
結果: Mixed 10-run mean +2.61%、標準偏差 -61%。bench_profile の putenv() が main 前の ENV キャッシュ事故に負けて D1 が効かない問題を修正し、既存の勝ち箱(Phase 3 D1)が確実に効く状態を作った(本線品質向上)。
- 指示書(完了):
docs/analysis/PHASE8_FREE_STATIC_ROUTE_ENV_CACHE_FIX_1_NEXT_INSTRUCTIONS.md - 実装 + A/B:
docs/analysis/PHASE8_FREE_STATIC_ROUTE_ENV_CACHE_FIX_1_AB_TEST_RESULTS.md - コミット:
be723ca05
Phase 9 FREE-TINY-FAST MONO DUALHOT: monolithic free_tiny_fast() に C0–C3 direct 移植 — ✅ GO / 本線昇格
結果: Mixed 10-run mean +2.72%、標準偏差 -60.8%。Phase 7 の NO-GO(関数 split)を教訓に、monolithic 内 early-exit で “第2ホット(C0–C3)” を FastLane free にも通した。
- 指示書(完了):
docs/analysis/PHASE9_FREE_TINY_FAST_MONO_DUALHOT_1_NEXT_INSTRUCTIONS.md - 実装 + A/B:
docs/analysis/PHASE9_FREE_TINY_FAST_MONO_DUALHOT_1_AB_TEST_RESULTS.md - コミット:
871034da1 - Rollback:
export HAKMEM_FREE_TINY_FAST_MONO_DUALHOT=0
Phase 10 FREE-TINY-FAST MONO LEGACY DIRECT: monolithic free_tiny_fast() の LEGACY direct を C4–C7 へ拡張 — ✅ GO / 本線昇格
結果: Mixed 10-run mean +1.89%。nonlegacy_mask(ULTRA/MID/V7)キャッシュにより誤爆を防ぎつつ、Phase 9(C0–C3)で取り切れていない LEGACY 範囲(C4–C7)を direct でカバーした。
- 指示書(完了):
docs/analysis/PHASE10_FREE_TINY_FAST_MONO_LEGACY_DIRECT_1_NEXT_INSTRUCTIONS.md - 実装 + A/B:
docs/analysis/PHASE10_FREE_TINY_FAST_MONO_LEGACY_DIRECT_1_AB_TEST_RESULTS.md - コミット:
71b1354d3 - ENV:
HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=0/1(default ON / opt-out) - Rollback:
export HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=0
Phase 11 ENV Snapshot "maybe-fast" API — ❌ NO-GO / FROZEN(設計ミス)
結果: Mixed 10-run mean -8.35%(51.65M → 47.33M ops/s)。hakmem_env_snapshot_maybe_fast() を inline 関数内で呼ぶことによる固定費が予想外に大きく、大幅な劣化が発生。
根本原因:
maybe_fast()をtiny_legacy_fallback_free_base()(inline)内で呼んだことで、毎回の free でctor_modecheck が走る- 既存設計(関数入口で 1 回だけ
enabled()判定)と異なり、inline helper 内での API 呼び出しは固定費が累積 - コンパイラ最適化が阻害される(unconditional call vs conditional branch)
教訓: ENV gate 最適化は gate 自体を改善すべきで、call site を変更すると逆効果。
- 指示書(完了):
docs/analysis/PHASE11_ENV_SNAPSHOT_MAYBE_FAST_1_NEXT_INSTRUCTIONS.md - 実装 + A/B:
docs/analysis/PHASE11_ENV_SNAPSHOT_MAYBE_FAST_1_AB_TEST_RESULTS.md - コミット:
ad73ca554(NO-GO 記録のみ、実装は完全 rollback) - 状態: FROZEN(ENV snapshot 参照の固定費削減は別アプローチが必要)
Phase 6-10 累積成果(マイルストーン達成)
結果: Mixed 10-run +24.6%(43.04M → 53.62M ops/s)🎉
Phase 6-10 で達成した累積改善:
- Phase 6-1 (FastLane): +11.13%(hakmem 史上最大の単一改善)
- Phase 6-2 (Free DeDup): +5.18%
- Phase 8 (ENV Cache Fix): +2.61%
- Phase 9 (MONO DUALHOT): +2.72%
- Phase 10 (MONO LEGACY DIRECT): +1.89%
- Phase 7 (Hot/Cold Align): -2.16% (NO-GO)
- Phase 11 (ENV maybe-fast): -8.35% (NO-GO)
技術パターン(確立):
- ✅ Wrapper-level consolidation(層の集約)
- ✅ Deduplication(重複削減)
- ✅ Monolithic early-exit(関数 split より有効)
- ❌ Function split for lightweight paths(軽量経路では逆効果)
- ❌ Call-site API changes(inline hot path での helper 呼び出しは累積 overhead)
詳細: docs/analysis/PHASE6_10_CUMULATIVE_RESULTS.md
Phase 12: Strategic Pause — ✅ COMPLETE(衝撃的発見)
Status: 🚨 CRITICAL FINDING - System malloc が hakmem より +63.7% 速い
Pause 実施結果:
-
Baseline 確定(10-run):
- Mean: 51.76M ops/s、Median: 51.74M、Stdev: 0.53M(CV 1.03% ✅)
- 非常に安定した性能
-
Health Check: ✅ PASS(MIXED, C6-HEAVY)
-
Perf Stat:
- Throughput: 52.06M ops/s
- IPC: 2.22(良好)、Branch miss: 2.48%(良好)
- Cache/dTLB miss も少ない(locality 良好)
-
Allocator Comparison(200M iterations):
Allocator Throughput vs hakmem RSS hakmem 52.43M ops/s Baseline 33.8MB jemalloc 48.60M ops/s -7.3% 35.6MB system malloc 85.96M ops/s +63.9% 🚨 N/A
衝撃的発見: System malloc (glibc ptmalloc2) が hakmem の 1.64 倍速い
Gap 原因の仮説(優先度順):
-
Header write overhead(最優先)
- hakmem: 各 allocation で 1-byte header write(400M writes / 200M iters)
- system: user pointer = base(header write なし?)
- Expected ROI: +10-20%
-
Thread cache implementation(高 ROI)
- system: tcache(glibc 2.26+、非常に高速)
- hakmem: TinyUnifiedCache
- Expected ROI: +20-30%
-
Metadata access pattern(中 ROI)
- hakmem: SuperSlab → Slab → Metadata の間接参照
- system: chunk metadata 連続配置
- Expected ROI: +5-10%
-
Classification overhead(低 ROI)
- hakmem: LUT + routing(FastLane で既に最適化)
- Expected ROI: +5%
-
Freelist management
- hakmem: header に埋め込み
- system: chunk 内配置(user data 再利用)
- Expected ROI: +5%
詳細: docs/analysis/PHASE12_STRATEGIC_PAUSE_RESULTS.md
Phase 13: Header Write Elimination v1 — NEUTRAL (+0.78%) ⚠️ RESEARCH BOX
Date: 2025-12-14 Verdict: NEUTRAL (+0.78%) — Frozen as research box (default OFF, manual opt-in)
Target: steady-state の header write tax 削減(最優先仮説)
Strategy (v1):
- C7 freelist がヘッダを壊さない形に寄せ、E5-2(write-once)を C7 にも適用可能にする
- ENV:
HAKMEM_TINY_C7_PRESERVE_HEADER=0/1(default: 0)
Results (4-Point Matrix):
| Case | C7_PRESERVE | WRITE_ONCE | Mean (ops/s) | Delta | Verdict |
|---|---|---|---|---|---|
| A (baseline) | 0 | 0 | 51,490,500 | — | — |
| B (E5-2 only) | 0 | 1 | 52,070,600 | +1.13% | candidate |
| C (C7 preserve) | 1 | 0 | 51,355,200 | -0.26% | NEUTRAL |
| D (Phase 13 v1) | 1 | 1 | 51,891,902 | +0.78% | NEUTRAL |
Key Findings:
-
E5-2 (HAKMEM_TINY_HEADER_WRITE_ONCE=1) は “単発 +1.13%” を観測したが、20-run 再テストで NEUTRAL (+0.54%)
- 参照:
docs/analysis/PHASE5_E5_2_HEADER_WRITE_ONCE_RETEST_AB_TEST_RESULTS.md - 結論: E5-2 は research box 維持(default OFF)
- 参照:
-
C7 preserve header alone: -0.26% (slight regression)
- C7 offset=1 memcpy overhead outweighs benefits
-
Combined (Phase 13 v1): +0.78% (positive but below GO)
- C7 preserve reduces E5-2 gains
Action:
- ✅ Freeze Phase 13 v1 as research box (default OFF)
- ✅ Re-test Phase 5 E5-2 (WRITE_ONCE=1) with dedicated 20-run → NEUTRAL (+0.54%)
- 📋 Document results:
docs/analysis/PHASE13_HEADER_WRITE_ELIMINATION_1_AB_TEST_RESULTS.md
Phase 5 E5-2: Header Write-Once — 再テスト NEUTRAL (+0.54%) ⚪
Date: 2025-12-14 Verdict: ⚪ NEUTRAL (+0.54%) — Research box 維持(default OFF)
Motivation: Phase 13 の 4点マトリクスで E5-2 単体が +1.13% を記録したため、専用 20-run で昇格可否を判定。
Results (20-run):
| Case | WRITE_ONCE | Mean (ops/s) | Median (ops/s) | Delta |
|---|---|---|---|---|
| A (baseline) | 0 | 51,096,839 | 51,127,725 | — |
| B (optimized) | 1 | 51,371,358 | 51,495,811 | +0.54% |
Verdict: NEUTRAL (+0.54%) — GO 閾値 (+1.0%) 未達
考察:
- Phase 13 の +1.13% は 10-run での観測値
- 専用 20-run では +0.54%(より信頼性が高い)
- 旧 E5-2 テスト (+0.45%) と一貫性あり
Action:
- ✅ Research box 維持(default OFF、manual opt-in)
- ENV:
HAKMEM_TINY_HEADER_WRITE_ONCE=0/1(default: 0) - 📋 詳細:
docs/analysis/PHASE5_E5_2_HEADER_WRITE_ONCE_RETEST_AB_TEST_RESULTS.md
Next: Phase 12 Strategic Pause の次の gap 仮説へ進む
Phase 14 v1: Pointer Chase Reduction (tcache-style) — NEUTRAL (+0.20%) ⚠️ RESEARCH BOX
Date: 2025-12-15 Verdict: NEUTRAL (+0.20%) — Frozen as research box (default OFF, manual opt-in)
Target: Reduce pointer-chase overhead with intrusive LIFO tcache layer (inspired by glibc tcache)
Strategy (v1):
- Add intrusive LIFO tcache layer (L1) before existing array-based UnifiedCache
- TLS per-class bins (head pointer + count)
- Intrusive next pointers stored in blocks (via tiny_next_store/load SSOT)
- Cap: 64 blocks per class (default, configurable)
- ENV:
HAKMEM_TINY_TCACHE=0/1(default: 0, OFF)
Results (Mixed 10-run):
| Case | TCACHE | Mean (ops/s) | Median (ops/s) | Delta |
|---|---|---|---|---|
| A (baseline) | 0 | 51,083,379 | 50,955,866 | — |
| B (optimized) | 1 | 51,186,838 | 51,255,986 | +0.20% (mean) / +0.59% (median) |
Key Findings:
- Mean delta: +0.20% (below +1.0% GO threshold → NEUTRAL)
- Median delta: +0.59% (slightly better stability, but still NEUTRAL)
- Expected ROI (+15-25%) not achieved on Mixed workload
- ⚠️ v1 の統合点が “free 側中心” で、alloc ホットパス(
tiny_hot_alloc_fast())が tcache を消費しない- 現状:
unified_cache_push()は tcache に入るが、alloc 側は FIFO(g_unified_cache[].slots)のみ → tcache が実質 sink になりやすい - v1 の A/B は ROI を過小評価する可能性が高い(Phase 14 v2 で通電確認が必要)
- 現状:
Possible Reasons for Lower ROI:
- Workload mismatch: Mixed (16–1024B) spans C0-C7, but tcache benefits may be concentrated in hot classes (C2/C3)
- Existing cache efficiency: UnifiedCache array access may already be well-cached in L1/L2
- Cap too small: Default cap=64 may cause frequent overflow to array cache
- Intrusive next overhead: Writing/reading next pointers may offset pointer-chase reduction
Action:
- ✅ Freeze Phase 14 v1 as research box (default OFF)
- ENV:
HAKMEM_TINY_TCACHE=0/1(default: 0),HAKMEM_TINY_TCACHE_CAP=64 - 📋 Results:
docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_1_AB_TEST_RESULTS.md - 📋 Design:
docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_1_DESIGN.md - 📋 Instructions:
docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_1_NEXT_INSTRUCTIONS.md - 📋 Next (Phase 14 v2):
docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_2_NEXT_INSTRUCTIONS.md(alloc/pop 統合)
Future Work: Consider per-class cap tuning or alternative pointer-chase reduction strategies
Phase 14 v2: Pointer Chase Reduction — Hot Path Integration — NEUTRAL (+0.08%) ⚠️ RESEARCH BOX
Date: 2025-12-15
Verdict: NEUTRAL (+0.08% Mixed) / -0.39% (C7-only) — research box 維持(default OFF)
Motivation: Phase 14 v1 は “alloc 側が tcache を消費していない” 疑義があったため、tiny_front_hot_box の hot alloc/free に tcache を接続して再 A/B を実施。
Results:
| Workload | TCACHE=0 | TCACHE=1 | Delta |
|---|---|---|---|
| Mixed (16–1024B) | 51,287,515 | 51,330,213 | +0.08% |
| C7-only | 80,975,651 | 80,660,283 | -0.39% |
Conclusion:
- v2 で通電は確認したが、Mixed の “本線” 改善にはならず(GO 閾値 +1.0% 未達)
- Phase 14(tcache-style intrusive LIFO)は現状 freeze 維持が妥当
Possible root causes(次に掘るなら):
tiny_next_load/storeの fence/補助処理が TLS-only tcache には重すぎる可能性tiny_tcache_enabled/capの固定費(load/branch)が savings を相殺- Mixed では bin ごとの hit 率が薄い(workload mismatch)
Refs:
- v2 results:
docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_2_AB_TEST_RESULTS.md - v2 instructions:
docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_2_NEXT_INSTRUCTIONS.md
Phase 15 v1: UnifiedCache FIFO→LIFO (Stack) — NEUTRAL (-0.70% Mixed, +0.42% C7) ⚠️ RESEARCH BOX
Date: 2025-12-15 Verdict: NEUTRAL (-0.70% Mixed, +0.42% C7-only) — research box 維持(default OFF)
Motivation: Phase 14(tcache intrusive)が NEUTRAL だったため、intrusive を増やさず、既存 TinyUnifiedCache.slots[] を FIFO ring から LIFO stack に変更して局所性改善を狙った。
Results:
| Workload | LIFO=0 (FIFO) | LIFO=1 (LIFO) | Delta |
|---|---|---|---|
| Mixed (16–1024B) | 52,965,966 | 52,593,948 | -0.70% |
| C7-only (1025–2048B) | 78,010,783 | 78,335,509 | +0.42% |
Conclusion:
- LIFO への変更は期待した効果なし(Mixed で劣化、C7 で微改善だが両方 GO 閾値未達)
- モード判定分岐オーバーヘッド(
tiny_unified_lifo_enabled())が局所性改善を相殺 - 既存 FIFO ring 実装が既に十分最適化されている
Root causes:
- Entry-point mode check overhead (
tiny_unified_lifo_enabled()call) - Minimal LIFO vs FIFO locality delta in practice (cache warming mitigates)
- Existing FIFO ring already well-optimized
Bonus: LTO bug fix for tiny_c7_preserve_header_enabled() (Phase 13/14 latent issue)
Refs:
- A/B results:
docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_AB_TEST_RESULTS.md - Design:
docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_DESIGN.md - Instructions:
docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_NEXT_INSTRUCTIONS.md
Phase 14-15 Summary: Pointer-Chase & Cache-Shape Research ⚠️
Conclusion: 両 Phase とも NEUTRAL(研究箱として凍結)
| Phase | Approach | Mixed Delta | C7 Delta | Verdict |
|---|---|---|---|---|
| 14 v1 | tcache (free-side only) | +0.20% | N/A | NEUTRAL |
| 14 v2 | tcache (alloc+free) | +0.08% | -0.39% | NEUTRAL |
| 15 v1 | FIFO→LIFO (array cache) | -0.70% | +0.42% | NEUTRAL |
教訓:
- Pointer-chase 削減も cache 形状変更も、現状の TLS array cache に対して有意な改善を生まない
- 次の mimalloc gap(約 2.4x)を埋めるには、別次元のアプローチが必要
Phase 16 v1: Front FastLane Alloc LEGACY Direct — ⚠️ NEUTRAL (+0.62%) — research box 維持(default OFF)
Date: 2025-12-15 Verdict: NEUTRAL (+0.62% Mixed, +0.06% C6-heavy) — research box 維持(default OFF)
Motivation:
- Phase 14-15 は freeze(cache-shape/pointer-chase の ROI が薄い)
- free 側は "monolithic early-exit + dedup" が勝ち筋(Phase 9/10/6-2)
- alloc 側も同じ勝ち筋で、LEGACY ルート時の route/policy 固定費を FastLane 入口で削る
Results:
| Workload | ENV=0 (Baseline) | ENV=1 (Direct) | Delta |
|---|---|---|---|
| Mixed (16–1024B) | 47,510,791 | 47,803,890 | +0.62% |
| C6-heavy (257–768B) | 21,134,240 | 21,147,197 | +0.06% |
Critical Issue & Fix:
- Segfault discovered: Initial implementation crashed for C4-C7 during
unified_cache_refill()→tiny_next_read() - Root cause: Refill logic incompatibility for classes C4-C7
- Safety fix: Limited optimization to C0-C3 only (matching existing dualhot pattern)
- Code constraint:
if (... && (unsigned)class_idx <= 3u)added to line 96 offront_fastlane_box.h
Conclusion:
- Optimization overlaps with existing dualhot (Phase ALLOC-TINY-FAST-DUALHOT-2) for C0-C3
- Limited scope (C0-C3 only) reduces potential benefit
- Route/policy overhead already minimized by Phase 6 FastLane collapse
- Pattern continues from Phase 14-15: dispatch-layer optimizations showing NEUTRAL results
Root causes of limited benefit:
- Safety constraint: C4-C7 excluded due to refill bug
- Overlap with dualhot: C0-C3 already have direct path when dualhot enabled
- Route overhead not dominant: Phase 6 already collapsed major dispatch costs
Recommendations:
- Freeze as research box (default OFF, no preset promotion)
- Investigate C4-C7 refill issue before expanding scope
- Shift optimization focus away from dispatch layers (Phase 14/15/16 all NEUTRAL)
Refs:
- A/B results:
docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_AB_TEST_RESULTS.md - Design:
docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_DESIGN.md - Instructions:
docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_NEXT_INSTRUCTIONS.md - ENV:
HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1(default: 0, opt-in)
Phase 14-16 Summary: Post-FastLane Research Phases ⚠️
Conclusion: Phase 14-16 全て NEUTRAL(研究箱として凍結)
| Phase | Approach | Mixed Delta | Verdict |
|---|---|---|---|
| 14 v1 | tcache (free-side only) | +0.20% | NEUTRAL |
| 14 v2 | tcache (alloc+free) | +0.08% | NEUTRAL |
| 15 v1 | FIFO→LIFO (array cache) | -0.70% | NEUTRAL |
| 16 v1 | Alloc LEGACY direct | +0.62% | NEUTRAL |
教訓:
- Pointer-chase 削減、cache 形状変更、dispatch early-exit いずれも有意な改善なし
- Phase 6 FastLane collapse (入口固定費削減) 以降、dispatch/routing レイヤの最適化は ROI が薄い
- 次の mimalloc gap(約 2.4x)を埋めるには、cache miss cost / memory layout / backend allocation 等の別次元が必要
Phase 17: FORCE_LIBC Gap Validation(same-binary A/B)✅ COMPLETE (2025-12-15)
目的: 「system malloc が速い」観測の SSOT 化。同一バイナリで hakmem vs libc を A/B し、gap の本体(allocator差 / layout差)を切り分ける。
結果: Case B 確定 — Allocator差 negligible (+0.39%), Layout penalty dominant (+73.57%)
Gap Breakdown (Mixed, 20M iters, ws=400):
- hakmem (FORCE_LIBC=0): 48.12M ops/s (mean), 48.12M ops/s (median)
- libc same-binary (FORCE_LIBC=1): 48.31M ops/s (mean), 48.31M ops/s (median)
- Allocator差: +0.39% (libc slightly faster, within noise)
- system binary (21K): 83.85M ops/s (mean), 83.75M ops/s (median)
- Layout penalty: +73.57% (small binary vs large binary 653K)
- Total gap: +74.26% (hakmem → system binary)
Perf Stat Analysis (200M iters, 1-run):
- I-cache misses: 153K (hakmem) → 68K (system) = -55% (smoking gun)
- Cycles: 17.9B → 10.2B = -43%
- Instructions: 41.3B → 21.5B = -48%
Root Cause: Binary size (653K vs 21K, 30x difference) causes I-cache thrashing. Code bloat >> algorithmic efficiency.
教訓:
- Phase 12 の「system malloc 1.6x faster」観測は正しかったが、原因は allocator アルゴリズムではなく binary layout
- Same-binary A/B が必須(別バイナリ比較は layout confound で誤判定)
- I-cache efficiency が allocator-heavy workload の first-order factor
Next Direction (Case B 推奨):
- Phase 18: Hot Text Isolation / Layout Control
- Priority 1: Cold code isolation (
__attribute__((cold,noinline))+ separate TU) - Priority 2: Link-order optimization (hot functions contiguous placement)
- Priority 3: PGO (optional, profile-guided layout)
- Target: +10% throughput via I-cache optimization (48.1M → 52.9M ops/s)
- Success metric: I-cache misses -30% (153K → 107K)
- Priority 1: Cold code isolation (
Files:
- Results:
docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md - Instructions:
docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_NEXT_INSTRUCTIONS.md
Phase 18: Hot Text Isolation / Layout Control — NEXT
目的: Binary layout 最適化で I-cache 効率を改善し、system binary との gap を削減。
戦略:
-
Cold Code Isolation (優先度 1)
- Stats 収集、debug logging、error handlers を別 TU へ移動
__attribute__((cold, noinline))で明示的に cold マーク- 予想効果: I-cache misses -20%
-
Link-Order Optimization (優先度 2)
- Hot functions を連続配置(linker script or link order control)
-ffunction-sections+ custom linker script- 予想効果: I-cache misses -10%
-
Profile-Guided Optimization (優先度 3, optional)
-fprofile-generate+-fprofile-useで実測ベース配置- 予想効果: I-cache misses -10-20%
Build Gate: HOT_TEXT_ISOLATION=0/1(layout A/B 用)
Target:
- v1(TU split / attrs / optional gc-sections): +2% で GO(NEUTRAL が起きやすい想定)
- v2(BENCH_MINIMAL compile-out): +10–20% を狙う(instruction footprint を直接削る)
設計: docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md
指示書: docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md
結果(v1): docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md(❌ NO-GO / I-cache miss 悪化)
実装ゲート(戻せる):
- Makefile knob:
HOT_TEXT_ISOLATION=0/1 - Compile-time:
-DHAKMEM_HOT_TEXT_ISOLATION=0/1
更新メモ(2025-12-14 Phase 5 E5-3 Analysis - Strategic Pivot)
Phase 5 E5-3: Candidate Analysis & Strategic Recommendations ⚠️ DEFER (2025-12-14)
Decision: DEFER all E5-3 candidates (E5-3a/b/c). Pivot to E5-4 (Malloc Direct Path, E5-1 pattern replication).
Analysis:
- E5-3a (free_tiny_fast_cold 7.14%): NO-GO (cold path, low frequency despite high self%)
- E5-3b (unified_cache_push 3.39%): MAYBE (already optimized, marginal ROI ~+1.0%)
- E5-3c (hakmem_env_snapshot_enabled 2.97%): NO-GO (E3-4 precedent shows -1.44% regression)
Key Insight: Profiler self% ≠ optimization opportunity
- Self% is time-weighted (samples during execution), not frequency-weighted
- Cold paths appear hot due to expensive operations when hit, not total cost
- E5-2 lesson: 3.35% self% → +0.45% NEUTRAL (branch overhead ≈ savings)
ROI Assessment:
| Candidate | Self% | Frequency | Expected Gain | Risk | Decision |
|---|---|---|---|---|---|
| E5-3a (cold path) | 7.14% | LOW | +0.5% | HIGH | NO-GO |
| E5-3b (push) | 3.39% | HIGH | +1.0% | MEDIUM | DEFER |
| E5-3c (env snapshot) | 2.97% | HIGH | -1.0% | HIGH | NO-GO |
Strategic Pivot: Focus on E5-1 Success Pattern (wrapper-level deduplication)
- E5-1 (Free Tiny Direct): +3.35% (GO) ✅
- Next: E5-4 (Malloc Tiny Direct) - Apply E5-1 pattern to alloc side
- Expected: +2-4% (similar to E5-1, based on malloc wrapper overhead)
Cumulative Status (Phase 5):
- E4-1 (Free Wrapper Snapshot): +3.51% standalone
- E4-2 (Malloc Wrapper Snapshot): +21.83% standalone
- E4 Combined: +6.43% (from baseline with both OFF)
- E5-1 (Free Tiny Direct): +3.35% (from E4 baseline)
- E5-2 (Header Write-Once): +0.45% NEUTRAL (frozen)
- E5-3: DEFER (analysis complete, no implementation/test)
- Total Phase 5: ~+9-10% cumulative (E4+E5-1 promoted, E5-2 frozen, E5-3 deferred)
Implementation (E5-3a research box, NOT TESTED):
- Files created:
core/box/free_cold_shape_env_box.{h,c}(ENV gate, default OFF)core/box/free_cold_shape_stats_box.{h,c}(stats counters)docs/analysis/PHASE5_E5_3_ANALYSIS_AND_RECOMMENDATIONS.md(analysis)
- Files modified:
core/front/malloc_tiny_fast.h(lines 418-437, cold path shape optimization)
- Pattern: Early exit for LEGACY path (skip LARSON check when !use_tiny_heap)
- Status: FROZEN (default OFF, pre-analysis shows NO-GO, not worth A/B testing)
Key Lessons:
- Profiler self% misleads when frequency is low (cold path)
- Micro-optimizations plateau in already-optimized code (E5-2, E5-3b)
- Branch hints are profile-dependent (E3-4 failure, E5-3c risk)
- Wrapper-level deduplication wins (E4-1, E4-2, E5-1 pattern)
Next Steps:
- E5-4 Design: Malloc Tiny Direct Path (E5-1 pattern for alloc)
- Target: malloc() wrapper overhead (~12.95% self% in E4 profile)
- Method: Single size check → direct call to malloc_tiny_fast_for_class()
- Expected: +2-4% (based on E5-1 precedent +3.35%)
- Design doc:
docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_DESIGN.md - Next instructions:
docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_NEXT_INSTRUCTIONS.md
更新メモ(2025-12-14 Phase 5 E5-2 Complete - Header Write-Once)
Phase 5 E5-2: Header Write-Once Optimization ⚪ NEUTRAL (2025-12-14)
Target: tiny_region_id_write_header (3.35% self%)
- Strategy: Write headers ONCE at refill boundary, skip writes in hot allocation path
- Hypothesis: Header writes are redundant for reused blocks (C1-C6 preserve headers)
- Goal: +1-3% by eliminating redundant header writes
A/B Test Results (Mixed, 10-run, 20M iters, ws=400):
- Baseline (WRITE_ONCE=0): 44.22M ops/s (mean), 44.53M ops/s (median), σ=0.96M
- Optimized (WRITE_ONCE=1): 44.42M ops/s (mean), 44.36M ops/s (median), σ=0.48M
- Delta: +0.45% mean, -0.38% median ⚪
Decision: NEUTRAL (within ±1.0% threshold → FREEZE as research box)
- Mean +0.45% < +1.0% GO threshold
- Median -0.38% suggests no consistent benefit
- Action: Keep as research box (default OFF, do not promote to preset)
Why NEUTRAL?:
- Assumption incorrect: Headers are NOT redundant (already written correctly at freelist pop)
- Branch overhead: ENV gate + class check (~4 cycles) ≈ savings (~3-5 cycles)
- Net effect: Marginal benefit offset by branch overhead
Positive Outcome:
- Variance reduced 50%: σ dropped from 0.96M → 0.48M ops/s
- More stable performance (good for profiling/benchmarking)
Health Check: ✅ PASS
- MIXED_TINYV3_C7_SAFE: 41.9M ops/s
- C6_HEAVY_LEGACY_POOLV1: 22.6M ops/s
- All profiles passed, no regressions
Implementation (FROZEN, default OFF):
- ENV gate:
HAKMEM_TINY_HEADER_WRITE_ONCE=0/1(default: 0, research box) - Files created:
core/box/tiny_header_write_once_env_box.h(ENV gate)core/box/tiny_header_write_once_stats_box.h(Stats counters)
- Files modified:
core/box/tiny_header_box.h(addedtiny_header_finalize_alloc())core/front/tiny_unified_cache.c(addedunified_cache_prefill_headers())core/box/tiny_front_hot_box.h(usetiny_header_finalize_alloc())
- Pattern: Prefill headers at refill boundary, skip writes in hot path
Key Lessons:
- Verify assumptions: perf self% doesn't always mean redundancy
- Branch overhead matters: Even "simple" checks can cancel savings
- Variance is valuable: Stability improvement is a secondary win
Cumulative Status (Phase 5):
- E4-1 (Free Wrapper Snapshot): +3.51% standalone
- E4-2 (Malloc Wrapper Snapshot): +21.83% standalone
- E4 Combined: +6.43% (from baseline with both OFF)
- E5-1 (Free Tiny Direct): +3.35% (from E4 baseline)
- E5-2 (Header Write-Once): +0.45% NEUTRAL (frozen as research box)
- Total Phase 5: ~+9-10% cumulative (E4+E5-1 promoted, E5-2 frozen)
Next Steps:
- E5-2: FROZEN as research box (default OFF, do not pursue)
- Profile new baseline (E4-1+E4-2+E5-1 ON) to identify next target
- Design docs:
docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.mddocs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md
更新メモ(2025-12-14 Phase 5 E5-1 Complete - Free Tiny Direct Path)
Phase 5 E5-1: Free Tiny Direct Path ✅ GO (2025-12-14)
Target: Wrapper-level Tiny direct path optimization (reduce 29.56% combined free overhead)
- Strategy: Single header check in wrapper → direct call to free_tiny_fast()
- Eliminates: Redundant header validation + ENV snapshot overhead + cold path route determination
- Goal: Bypass wrapper tax for Tiny allocations (48% of frees in Mixed)
A/B Test Results (Mixed, 10-run, 20M iters, ws=400):
- Baseline (DIRECT=0): 44.38M ops/s (mean), 44.45M ops/s (median), σ=0.25M
- Optimized (DIRECT=1): 45.87M ops/s (mean), 45.95M ops/s (median), σ=0.33M
- Delta: +3.35% mean, +3.36% median ✅
Decision: GO (+3.35% >= +1.0% threshold)
- Exceeds conservative estimate (+3-5%) → Achieved +3.35%
- Action: Promote to
MIXED_TINYV3_C7_SAFEpreset (HAKMEM_FREE_TINY_DIRECT=1 default) ✅
Health Check: ✅ PASS
- MIXED_TINYV3_C7_SAFE: 41.9M ops/s
- C6_HEAVY_LEGACY_POOLV1: 21.1M ops/s
- All profiles passed, no regressions
Implementation:
- ENV gate:
HAKMEM_FREE_TINY_DIRECT=0/1(default: 0, preset(MIXED)=1) - Files created:
core/box/free_tiny_direct_env_box.h(ENV gate)core/box/free_tiny_direct_stats_box.h(Stats counters)
- Files modified:
core/box/hak_wrappers.inc.h(lines 593-625, wrapper integration)
- Pattern: Single header check (
(header & 0xF0) == 0xA0) → direct path - Safety: Page boundary guard, magic validation, class bounds check, fail-fast fallback
Why +3.35%?:
- Before (E4 baseline):
- free() wrapper: 21.67% self% (header + ENV snapshot + gate dispatch)
- free_tiny_fast_cold(): 7.89% self% (route determination + policy snapshot)
- Total: 29.56% overhead
- After (E5-1):
- free() wrapper: ~18-20% self% (single header check + direct call)
- Eliminated: ~9-10% overhead (30% reduction of 29.56%)
- Net gain: ~3.5% of total runtime (matches observed +3.35%)
Key Insight: Deduplication beats inlining. E5-1 eliminates redundant checks (header validated twice, ENV snapshot overhead), similar to E4's TLS consolidation pattern. This is the 3rd consecutive success with the "consolidation/deduplication" strategy.
Cumulative Status (Phase 5):
- E4-1 (Free Wrapper Snapshot): +3.51% standalone
- E4-2 (Malloc Wrapper Snapshot): +21.83% standalone
- E4 Combined: +6.43% (from baseline with both OFF)
- E5-1 (Free Tiny Direct): +3.35% (from E4 baseline, session variance)
- Total Phase 5: ~+9-10% cumulative (needs combined E4+E5-1 measurement)
Next Steps:
- ✅ Promote:
HAKMEM_FREE_TINY_DIRECT=1toMIXED_TINYV3_C7_SAFEpreset - ✅ E5-2: NEUTRAL → FREEZE
- ✅ E5-3: DEFER(ROI 低)
- ✅ E5-4: NEUTRAL → FREEZE
- ✅ E6: NO-GO → FREEZE
- ✅ E7: NO-GO(prune による -3%台回帰)→ 差し戻し
- Next: Phase 5 はここで一旦区切り(次は新しい “重複排除” か大きい構造変更を探索)
- Design docs:
docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_DESIGN.mddocs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_AB_TEST_RESULTS.mddocs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_NEXT_INSTRUCTIONS.mddocs/analysis/PHASE5_E5_COMPREHENSIVE_ANALYSIS.mddocs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_NEXT_INSTRUCTIONS.mddocs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_AB_TEST_RESULTS.mddocs/analysis/PHASE5_E6_ENV_SNAPSHOT_SHAPE_NEXT_INSTRUCTIONS.mddocs/analysis/PHASE5_E6_ENV_SNAPSHOT_SHAPE_AB_TEST_RESULTS.mddocs/analysis/PHASE5_E7_FROZEN_BOX_PRUNE_NEXT_INSTRUCTIONS.mddocs/analysis/PHASE5_E7_FROZEN_BOX_PRUNE_AB_TEST_RESULTS.mdPHASE_ML2_CHATGPT_QUESTIONNAIRE_FASTLANE.mdPHASE_ML2_CHATGPT_RESPONSE_FASTLANE.mddocs/analysis/PHASE6_FRONT_FASTLANE_1_DESIGN.mddocs/analysis/PHASE6_FRONT_FASTLANE_NEXT_INSTRUCTIONS.md
更新メモ(2025-12-14 Phase 5 E4 Combined Complete - E4-1 + E4-2 Interaction Analysis)
Phase 5 E4 Combined: E4-1 + E4-2 同時有効化 ✅ GO (2025-12-14)
Target: Measure combined effect of both wrapper ENV snapshots (free + malloc)
- Strategy: Enable both HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=1 and HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=1
- Goal: Verify interaction (additive / subadditive / superadditive) and establish new baseline
A/B Test Results (Mixed, 10-run, 20M iters, ws=400):
- Baseline (both OFF): 44.48M ops/s (mean), 44.39M ops/s (median), σ=0.38M
- Optimized (both ON): 47.34M ops/s (mean), 47.38M ops/s (median), σ=0.42M
- Delta: +6.43% mean, +6.74% median ✅
Individual vs Combined:
- E4-1 alone (free wrapper): +3.51%
- E4-2 alone (malloc wrapper): +21.83%
- Combined (both): +6.43%
- Interaction: 非加算(“単独” は別セッションの参考値。増分は E4 Combined A/B を正とする)
Analysis - Why Subadditive?:
- Baseline mismatch: E4-1 と E4-2 の “単独” A/B は別セッション(別バイナリ状態)で測られており、前提が一致しない
- E4-1: 45.35M → 46.94M(+3.51%)
- E4-2: 35.74M → 43.54M(+21.83%)
- 足し算期待値は作らず、同一バイナリでの E4 Combined A/B を “正” とする
- Shared Bottlenecks: Both optimizations target TLS read consolidation
- Once TLS access is optimized in one path, benefits in the other path are reduced
- Memory bandwidth / cache line effects are shared resources
- Branch Predictor Saturation: Both paths compete for branch predictor entries
- ENV snapshot checks add branches that compete for same predictor resources
- Combined overhead is non-linear
Health Check: ✅ PASS
- MIXED_TINYV3_C7_SAFE: 42.3M ops/s
- C6_HEAVY_LEGACY_POOLV1: 20.9M ops/s
- All profiles passed, no regressions
Perf Profile (New Baseline: both ON, 20M iters, 47.0M ops/s):
Top Hot Spots (self% >= 2.0%):
- free: 37.56% (wrapper + gate, still dominant)
- tiny_alloc_gate_fast: 13.73% (alloc gate, reduced from 19.50%)
- malloc: 12.95% (wrapper, reduced from 16.13%)
- main: 11.13% (benchmark driver)
- tiny_region_id_write_header: 6.97% (header write cost)
- tiny_c7_ultra_alloc: 4.56% (C7 alloc path)
- hakmem_env_snapshot_enabled: 4.29% (ENV snapshot overhead, visible)
- tiny_get_max_size: 4.24% (size limit check)
Next Phase 5 Candidates (self% >= 5%):
- free (37.56%): Still the largest hot spot, but harder to optimize further
- Already has ENV snapshot, hotcold path, static routing
- Next step: Analyze free path internals (tiny_free_fast structure)
- tiny_region_id_write_header (6.97%): Header write tax
- Phase 1 A3 showed always_inline is NO-GO (-4% on Mixed)
- Alternative: Reduce header writes (selective mode, cached writes)
Key Insight: ENV snapshot pattern は有効だが、複数パスに同時適用したときの増分は足し算にならない。評価は同一バイナリでの E4 Combined A/B(+6.43%)を正とする。
Decision: GO (+6.43% >= +1.0% threshold)
- New baseline: 47.34M ops/s (Mixed, 20M iters, ws=400)
- Both optimizations remain DEFAULT ON in MIXED_TINYV3_C7_SAFE
- Action: Shift focus to next bottleneck (free path internals or header write optimization)
Cumulative Status (Phase 5):
- E4-1 (Free Wrapper Snapshot): +3.51% standalone
- E4-2 (Malloc Wrapper Snapshot): +21.83% standalone (on top of E4-1)
- E4 Combined: +6.43% (from original baseline with both OFF)
- Total Phase 5: +6.43% (on top of Phase 4's +3.9%)
- Overall progress: 35.74M → 47.34M = +32.4% (from Phase 5 start to E4 combined)
Next Steps:
- Profile analysis: Identify E5 candidates (free path, header write, or other hot spots)
- Consider: free() fast path structure optimization (37.56% self% is large target)
- Consider: Header write reduction strategies (6.97% self%)
- Update design docs with subadditive interaction analysis
- Design doc:
docs/analysis/PHASE5_E4_COMBINED_AB_TEST_RESULTS.md
更新メモ(2025-12-14 Phase 5 E4-2 Complete - Malloc Gate Optimization)
Phase 5 E4-2: malloc Wrapper ENV Snapshot ✅ GO (2025-12-14)
Target: Consolidate TLS reads in malloc() wrapper to reduce 35.63% combined hot spot
- Strategy: Apply E4-1 success pattern (ENV snapshot consolidation) to malloc() side
- Combined target: malloc (16.13%) + tiny_alloc_gate_fast (19.50%) = 35.63% self%
- Implementation: Single TLS snapshot with packed flags (wrap_shape + front_gate + tiny_max_size_256)
- Reduce: 2+ TLS reads → 1 TLS read, eliminate tiny_get_max_size() function call
Implementation:
- ENV gate:
HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=0/1(default: 0, research box) - Files:
core/box/malloc_wrapper_env_snapshot_box.{h,c}(new ENV snapshot box) - Integration:
core/box/hak_wrappers.inc.h(lines 174-221, malloc() wrapper) - Optimization: Pre-cache
tiny_max_size() == 256to eliminate function call
A/B Test Results (Mixed, 10-run, 20M iters, ws=400):
- Baseline (SNAPSHOT=0): 35.74M ops/s (mean), 35.75M ops/s (median), σ=0.43M
- Optimized (SNAPSHOT=1): 43.54M ops/s (mean), 43.92M ops/s (median), σ=1.17M
- Delta: +21.83% mean, +22.86% median ✅
Decision: GO (+21.83% >> +1.0% threshold)
- EXCEEDED conservative estimate (+2-4%) → Achieved +21.83%
- 6.2x better than E4-1 (+3.51%) - malloc() has higher ROI than free()
- Action: Promote to default configuration (HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=1)
Health Check: ✅ PASS
- MIXED_TINYV3_C7_SAFE: 40.8M ops/s
- C6_HEAVY_LEGACY_POOLV1: 21.8M ops/s
- All profiles passed, no regressions
Why 6.2x better than E4-1?:
- Higher Call Frequency: malloc() called MORE than free() in alloc-heavy workloads
- Function Call Elimination: Pre-caching tiny_max_size()==256 removes function call overhead
- Better Branch Prediction: size <= 256 is highly predictable for tiny allocations
- Larger Target: 35.63% combined self% (malloc + tiny_alloc_gate_fast) vs free's 25.26%
Key Insight: malloc() wrapper optimization has 6.2x higher ROI than free() wrapper. ENV snapshot pattern continues to dominate, with malloc side showing exceptional gains due to function call elimination and higher call frequency.
Cumulative Status (Phase 5):
- E4-1 (Free Wrapper Snapshot): +3.51% (GO)
- E4-2 (Malloc Wrapper Snapshot): +21.83% (GO) ⭐ MAJOR WIN
- Combined estimate: ~+25-27% (to be measured with both enabled)
- Total Phase 5: +21.83% standalone (on top of Phase 4's +3.9%)
Next Steps:
- Measure combined effect (E4-1 + E4-2 both enabled)
- Profile new bottlenecks at 43.54M ops/s baseline
- Update default presets with HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=1
- Design doc:
docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_1_DESIGN.md - Results:
docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_1_AB_TEST_RESULTS.md
更新メモ(2025-12-14 Phase 5 E4-1 Complete - Free Gate Optimization)
Phase 5 E4-1: Free Wrapper ENV Snapshot ✅ GO (2025-12-14)
Target: Consolidate TLS reads in free() wrapper to reduce 25.26% self% hot spot
- Strategy: Apply E1 success pattern (ENV snapshot consolidation), NOT E3-4 failure pattern
- Implementation: Single TLS snapshot with packed flags (wrap_shape + front_gate + hotcold)
- Reduce: 2 TLS reads → 1 TLS read, 4 branches → 3 branches
Implementation:
- ENV gate:
HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=0/1(default: 0, research box) - Files:
core/box/free_wrapper_env_snapshot_box.{h,c}(new ENV snapshot box) - Integration:
core/box/hak_wrappers.inc.h(lines 552-580, free() wrapper)
A/B Test Results (Mixed, 10-run, 20M iters, ws=400):
- Baseline (SNAPSHOT=0): 45.35M ops/s (mean), 45.31M ops/s (median), σ=0.34M
- Optimized (SNAPSHOT=1): 46.94M ops/s (mean), 47.15M ops/s (median), σ=0.94M
- Delta: +3.51% mean, +4.07% median ✅
Decision: GO (+3.51% >= +1.0% threshold)
- Exceeded conservative estimate (+1.5%) → Achieved +3.51%
- Similar to E1 success (+3.92%) - ENV consolidation pattern works
- Action: Promote to
MIXED_TINYV3_C7_SAFEpreset (HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=1 default)
Health Check: ✅ PASS
- MIXED_TINYV3_C7_SAFE: 42.5M ops/s
- C6_HEAVY_LEGACY_POOLV1: 23.0M ops/s
- All profiles passed, no regressions
Perf Profile (SNAPSHOT=1, 20M iters):
- free(): 25.26% (unchanged in this sample)
- NEW hot spot: hakmem_env_snapshot_enabled: 4.67% (ENV snapshot overhead visible)
- Note: Small sample (65 samples) may not be fully representative
- Overall throughput improved +3.51% despite ENV snapshot overhead cost
Key Insight: ENV consolidation continues to yield strong returns. Free path optimization via TLS reduction proves effective, matching E1's success pattern. The visible ENV snapshot overhead (4.67%) is outweighed by overall path efficiency gains.
Cumulative Status (Phase 5):
- E4-1 (Free Wrapper Snapshot): +3.51% (GO)
- Total Phase 5: ~+3.5% (on top of Phase 4's +3.9%)
Next Steps:
- ✅ Promoted:
MIXED_TINYV3_C7_SAFEでHAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=1を default 化(opt-out 可) - ✅ Promoted:
MIXED_TINYV3_C7_SAFEでHAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=1を default 化(opt-out 可) - Next: E4-1+E4-2 の累積 A/B を 1 本だけ確認して、新 baseline で perf を取り直す
- Design doc:
docs/analysis/PHASE5_E4_FREE_GATE_OPTIMIZATION_1_DESIGN.md - 指示書:
docs/analysis/PHASE5_E4_1_FREE_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.mddocs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.mddocs/analysis/PHASE5_E4_COMBINED_AB_TEST_NEXT_INSTRUCTIONS.md
更新メモ(2025-12-14 Phase 4 E3-4 Complete - ENV Constructor Init)
Phase 4 E3-4: ENV Constructor Init ❌ NO-GO / FROZEN (2025-12-14)
Target: E1 の lazy init check(3.22% self%)を constructor init で排除
- E1 で ENV snapshot を統合したが、
hakmem_env_snapshot_enabled()の lazy check が残っていた - Strategy:
__attribute__((constructor(101)))で main() 前に gate 初期化
Implementation:
- ENV gate:
HAKMEM_ENV_SNAPSHOT_CTOR=0/1(default: 0, research box) core/box/hakmem_env_snapshot_box.c: Constructor function 追加core/box/hakmem_env_snapshot_box.h: Dual-mode enabled check (constructor vs legacy)
A/B Test Results(re-validation) (Mixed, 10-run, 20M iters, ws=400, HAKMEM_ENV_SNAPSHOT=1):
- Baseline (CTOR=0): 47.55M ops/s (mean), 47.46M ops/s (median)
- Optimized (CTOR=1): 46.86M ops/s (mean), 46.97M ops/s (median)
- Delta: -1.44% mean, -1.03% median ❌
Decision: NO-GO / FROZEN
- 初回の +4.75% は再現しない(ノイズ/環境要因の可能性が高い)
- constructor mode は “追加の分岐/ロード” になり、現状の hot path では得にならない
- Action: default OFF のまま freeze(追わない)
- Design doc:
docs/analysis/PHASE4_E3_ENV_CONSTRUCTOR_INIT_DESIGN.md
Key Insight: “constructor で初期化” 自体は安全だが、性能面では現状 NO-GO。勝ち箱は E1 に集中する。
Cumulative Status (Phase 4):
- E1 (ENV Snapshot): +3.92% (GO)
- E2 (Alloc Per-Class): -0.21% (NEUTRAL, frozen)
- E3-4 (Constructor Init): NO-GO / frozen
- Total Phase 4: ~+3.9%(E1 のみ)
Phase 4 E2: Alloc Per-Class FastPath ⚪ NEUTRAL (2025-12-14)
Target: C0-C3 dedicated fast path for alloc (bypass policy route for small sizes)
- Strategy: Skip policy snapshot + route determination for C0-C3 classes
- Reuse DUALHOT pattern from free path (which achieved +13% for C0-C3)
- Baseline: HAKMEM_ENV_SNAPSHOT=1 enabled (E1 active)
Implementation:
- ENV gate:
HAKMEM_TINY_ALLOC_DUALHOT=0/1(already exists, default: 0) - Integration:
malloc_tiny_fast_for_class()lines 247-259 - C0-C3 check: Direct to LEGACY unified cache when enabled
- Pattern: Probe window lazy init (64-call tolerance for early putenv)
A/B Test Results (Mixed, 10-run, 20M iters, HAKMEM_ENV_SNAPSHOT=1):
- Baseline (DUALHOT=0): 45.40M ops/s (mean), 45.51M ops/s (median), σ=0.38M
- Optimized (DUALHOT=1): 45.30M ops/s (mean), 45.22M ops/s (median), σ=0.49M
- Improvement: -0.21% mean, -0.62% median
Decision: NEUTRAL (-0.21% within ±1.0% noise threshold)
- Action: Keep as research box (default OFF, freeze)
- Reason: C0-C3 fast path adds branch overhead without measurable gain on Mixed
- Unlike FREE path (+13%), ALLOC path doesn't show significant route determination cost
Key Insight:
- Free path benefits from DUALHOT because it skips expensive policy snapshot + route lookup
- Alloc path already has optimized route caching (Phase 3 C3 static routing)
- C0-C3 specialization doesn't provide additional benefit over current routing
- Conclusion: Alloc route optimization has reached diminishing returns
Cumulative Status:
- Phase 4 E1: +3.92% (GO)
- Phase 4 E2: -0.21% (NEUTRAL, frozen)
- Phase 4 E3-4: NO-GO / frozen
Next: Phase 4(close & next target)
- 勝ち箱: E1 を
MIXED_TINYV3_C7_SAFEプリセットへ昇格(opt-out 可) - 研究箱: E3-4/E2 は freeze(default OFF)
- 次の芯は perf で “self% ≥ 5%” の箱から選ぶ
- 次の指示書:
docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md
Phase 4 E1: ENV Snapshot Consolidation ✅ COMPLETE (2025-12-14)
Target: Consolidate 3 ENV gate TLS reads → 1 TLS read
tiny_c7_ultra_enabled_env(): 1.28% selftiny_front_v3_enabled(): 1.01% selftiny_metadata_cache_enabled(): 0.97% self- Total ENV overhead: 3.26% self (from perf profile)
Implementation:
- Created
core/box/hakmem_env_snapshot_box.{h,c}(new ENV snapshot box) - Migrated 8 call sites across 3 hot path files to use snapshot
- ENV gate:
HAKMEM_ENV_SNAPSHOT=0/1(default: 0, research box) - Pattern: Similar to
tiny_front_v3_snapshot(proven approach)
A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (E1=0): 43.62M ops/s (avg), 43.56M ops/s (median)
- Optimized (E1=1): 45.33M ops/s (avg), 45.31M ops/s (median)
- Improvement: +3.92% avg, +4.01% median
Decision: GO (+3.92% >= +2.5% threshold)
- Exceeded conservative expectation (+1-3%) → Achieved +3.92%
- Action: Keep as research box for now (default OFF)
- Commit:
88717a873
Key Insight: Shifting from shape optimizations (plateaued) to TLS/memory overhead yields strong returns. ENV snapshot consolidation represents new optimization frontier beyond branch prediction tuning.
Phase 4 Perf Profiling Complete ✅ (2025-12-14)
Profile Analysis:
- Baseline: 46.37M ops/s (MIXED_TINYV3_C7_SAFE, 40M iterations, ws=400)
- Samples: 922 samples @ 999Hz, 3.1B cycles
- Analysis doc:
docs/analysis/PHASE4_PERF_PROFILE_ANALYSIS.md
Key Findings Leading to E1:
- ENV Gate Overhead (3.26% combined) → E1 target
- Shape Optimization Plateau (B3 +2.89%, D3 +0.56% NEUTRAL)
- tiny_alloc_gate_fast (15.37% self%) → defer to E2
Phase 4 D3: Alloc Gate Shape(HAKMEM_ALLOC_GATE_SHAPE)
- ✅ 実装完了(ENV gate + alloc gate 分岐形)
- Mixed A/B(10-run, iter=20M, ws=400): Mean +0.56%(Median -0.5%)→ NEUTRAL
- 判定: research box として freeze(default OFF、プリセット昇格しない)
- Lesson: Shape optimizations have plateaued (branch prediction saturated)
Phase 1 Quick Wins: FREE 昇格 + 観測税ゼロ化
- ✅ A1(FREE 昇格):
MIXED_TINYV3_C7_SAFEでHAKMEM_FREE_TINY_FAST_HOTCOLD=1をデフォルト化 - ✅ A2(観測税ゼロ化):
HAKMEM_DEBUG_COUNTERS=0のとき stats を compile-out(観測税ゼロ) - ❌ A3(always_inline header):
tiny_region_id_write_header()always_inline → NO-GO(指示書/結果:docs/analysis/TINY_HEADER_WRITE_ALWAYS_INLINE_A3_DESIGN.md)- A/B Result: Mixed -4.00% (I-cache pressure), C6-heavy +6.00%
- Decision: Freeze as research box (default OFF)
- Commit:
df37baa50
Phase 2: ALLOC 構造修正
- ✅ Patch 1: malloc_tiny_fast_for_class() 抽出(SSOT)
- ✅ Patch 2: tiny_alloc_gate_fast() を *_for_class 呼びに変更
- ✅ Patch 3: DUALHOT 分岐をクラス内へ移動(C0-C3 のみ)
- ✅ Patch 4: Probe window ENV gate 実装
- 結果: Mixed -0.27%(中立)、C6-heavy +1.68%(SSOT 効果)
- Commit:
d0f939c2e
Phase 2 B1 & B3: ルーティング最適化 (2025-12-13)
B1(Header tax 削減 v2): HEADER_MODE=LIGHT → ❌ NO-GO
- Mixed (10-run): 48.89M → 47.65M ops/s (-2.54%, regression)
- Decision: FREEZE (research box, ENV opt-in)
- Rationale: Conditional check overhead outweighs store savings on Mixed
B3(Routing 分岐形最適化): ALLOC_ROUTE_SHAPE=1 → ✅ ADOPT
- Mixed (10-run): 48.41M → 49.80M ops/s (+2.89%, win)
- Strategy: LIKELY on LEGACY (hot), cold helper for rare routes (V7/MID/ULTRA)
- C6-heavy (5-run): 8.97M → 9.79M ops/s (+9.13%, strong win)
- Decision: ADOPT as default in MIXED_TINYV3_C7_SAFE and C6_HEAVY_LEGACY_POOLV1
- Implementation: Already in place (lines 252-267 in malloc_tiny_fast.h), now enabled by default
- Profile updates: Added
bench_setenv_default("HAKMEM_TINY_ALLOC_ROUTE_SHAPE", "1")to both profiles
現在地: Phase 3 D1/D2 Validation Complete ✅ (2025-12-13)
Summary:
- Phase 3 D1 (Free Route Cache): ✅ ADOPT - PROMOTED TO DEFAULT
- 20-run validation: Mean +2.19%, Median +2.37% (both criteria met)
- Status: Added to MIXED_TINYV3_C7_SAFE preset (HAKMEM_FREE_STATIC_ROUTE=1)
- Phase 3 D2 (Wrapper Env Cache): ❌ NO-GO / FROZEN
- 10-run results: -1.44% regression
- Reason: TLS overhead > benefit in Mixed workload
- Status: Research box frozen (default OFF, do not pursue)
Cumulative gains: B3 +2.89%, B4 +1.47%, C3 +2.20%, D1 +2.19% (promoted) → ~7.6%
Baseline Phase 3 (10-run, 2025-12-13):
- Mean: 46.04M ops/s, Median: 46.04M ops/s, StdDev: 0.14M ops/s
Next:
- Phase 4 D3 指示書:
docs/analysis/PHASE4_ALLOC_GATE_SPECIALIZATION_NEXT_INSTRUCTIONS.md
Phase ALLOC-GATE-SSOT-1 + ALLOC-TINY-FAST-DUALHOT-2: COMPLETED
4 Patches Implemented (2025-12-13):
- ✅ Extract malloc_tiny_fast_for_class() with class_idx parameter (SSOT foundation)
- ✅ Update tiny_alloc_gate_fast() to call *_for_class (eliminate duplicate size→class)
- ✅ Reposition DUALHOT branch: only C0-C3 evaluate alloc_dualhot_enabled()
- ✅ Probe window ENV gate (64 calls) for early putenv tolerance
A/B Test Results:
- Mixed (10-run): 48.75M → 48.62M ops/s (-0.27%, neutral within variance)
- Rationale: SSOT overhead reduction offset by branch repositioning cost on aggregate
- C6-heavy (5-run): 23.24M → 23.63M ops/s (+1.68%, SSOT benefit confirmed)
- SSOT effectiveness: Eliminates duplicate hak_tiny_size_to_class() call
Decision: ADOPT SSOT (Patch 1+2) as structural improvement, DUALHOT-2 (Patch 3) as ENV-gated feature (default OFF)
Rationale:
- SSOT is foundational: Establishes single source of truth for size→class lookup
- Enables future optimization: *_for_class path can be specialized further
- No regression: Mixed neutral, C6-heavy shows SSOT benefit (+1.68%)
- DUALHOT-2 maintains clean branch structure: C4-C7 unaffected when OFF
Commit: d0f939c2e
Phase FREE-TINY-FAST-DUALHOT-1: CONFIRMED & READY FOR ADOPTION
Final A/B Verification (2025-12-13):
- Baseline (DUALHOT OFF): 42.08M ops/s (median, 10-run, Mixed)
- Optimized (DUALHOT ON): 47.81M ops/s (median, 10-run, Mixed)
- Improvement: +13.00% ✅
- Health Check: PASS (verify_health_profiles.sh)
- Safety Gate: HAKMEM_TINY_LARSON_FIX=1 disables for compatibility
Strategy: Recognize C0-C3 (48% of frees) as "second hot path"
- Skip policy snapshot + route determination for C0-C3 classes
- Direct inline to
tiny_legacy_fallback_free_base() - Implementation:
core/front/malloc_tiny_fast.hlines 461-477 - Commit:
2b567ac07+b2724e6f5
Promotion Candidate: YES - Ready for MIXED_TINYV3_C7_SAFE default profile
Phase ALLOC-TINY-FAST-DUALHOT-1: RESEARCH BOX ✅ (WIP, -2% regression)
Implementation Attempt:
- ENV gate:
HAKMEM_TINY_ALLOC_DUALHOT=0/1(default OFF) - Early-exit:
malloc_tiny_fast()lines 169-179 - A/B Result: -1.17% to -2.00% regression (10-run Mixed)
Root Cause:
- Unlike FREE path (early return saves policy snapshot), ALLOC path falls through
- Extra branch evaluation on C4-C7 (~50% of traffic) outweighs C0-C3 policy skip
- Requires structural changes (per-class fast paths) to match FREE success
Decision: Freeze as research box (default OFF, retained for future study)
Phase 2 B4: Wrapper Layer Hot/Cold Split ✅ ADOPT
設計メモ: docs/analysis/PHASE2_B4_WRAPPER_SHAPE_1_DESIGN.md
狙い: wrapper 入口の "稀なチェック"(LD mode、jemalloc、診断)を noinline,cold に押し出す
実装完了 ✅
✅ 完全実装:
- ENV gate:
HAKMEM_WRAP_SHAPE=0/1(wrapper_env_box.h/c) - malloc_cold(): noinline,cold ヘルパー実装済み(lines 93-142)
- malloc hot/cold 分割: 実装済み(lines 169-200 で ENV gate チェック)
- free_cold(): noinline,cold ヘルパー実装済み(lines 321-520)
- free hot/cold 分割: 実装済み(lines 550-574 で wrap_shape dispatch)
A/B テスト結果 ✅ GO
Mixed Benchmark (10-run):
- WRAP_SHAPE=0 (default): 34,750,578 ops/s
- WRAP_SHAPE=1 (optimized): 35,262,596 ops/s
- Average gain: +1.47% ✓ (Median: +1.39%)
- Decision: GO ✓ (exceeds +1.0% threshold)
Sanity Check 結果:
- WRAP_SHAPE=0 (default): 34,366,782 ops/s (3-run)
- WRAP_SHAPE=1 (optimized): 34,999,056 ops/s (3-run)
- Delta: +1.84% ✅(malloc + free 完全実装)
C6-heavy: Deferred(pre-existing linker issue in bench_allocators_hakmem, not B4-related)
Decision: ✅ ADOPT as default (Mixed +1.47% >= +1.0% threshold)
- ✅ Done:
MIXED_TINYV3_C7_SAFEプリセットでHAKMEM_WRAP_SHAPE=1を default 化(bench_profile)
Phase 1: Quick Wins(完了)
- ✅ A1(FREE 勝ち箱の本線昇格):
MIXED_TINYV3_C7_SAFEでHAKMEM_FREE_TINY_FAST_HOTCOLD=1を default 化(ADOPT) - ✅ A2(観測税ゼロ化):
HAKMEM_DEBUG_COUNTERS=0のとき stats を compile-out(ADOPT) - ❌ A3(always_inline header): Mixed -4% 回帰のため NO-GO → research box freeze(
docs/analysis/TINY_HEADER_WRITE_ALWAYS_INLINE_A3_DESIGN.md)
Phase 2: Structural Changes(進行中)
- ❌ B1(Header tax 削減 v2):
HAKMEM_TINY_HEADER_MODE=LIGHTは Mixed -2.54% → NO-GO / freeze(docs/analysis/PHASE2_B1_HEADER_TAX_AB_TEST_RESULTS.md) - ✅ B3(Routing 分岐形最適化):
HAKMEM_TINY_ALLOC_ROUTE_SHAPE=1は Mixed +2.89% / C6-heavy +9.13% → ADOPT(プリセット default=1) - ✅ B4(WRAPPER-SHAPE-1):
HAKMEM_WRAP_SHAPE=1は Mixed +1.47% → ADOPT(docs/analysis/PHASE2_B4_WRAPPER_SHAPE_1_DESIGN.md) - (保留)B2: C0–C3 専用 alloc fast path(入口短絡は回帰リスク高。B4 の後に判断)
Phase 3: Cache Locality - Target: +12-22% (57-68M ops/s)
指示書: docs/analysis/PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md
Phase 3 C3: Static Routing ✅ ADOPT
設計メモ: docs/analysis/PHASE3_C3_STATIC_ROUTING_1_DESIGN.md
狙い: policy_snapshot + learner evaluation をバイパスするために、初期化時に静的ルーティングテーブルを構築
実装完了 ✅:
core/box/tiny_static_route_box.h(API header + hot path functions)core/box/tiny_static_route_box.c(initialization + ENV gate + learner interlock)core/front/malloc_tiny_fast.h(lines 249-256) - 統合:tiny_static_route_ready_fast()で分岐core/bench_profile.h(line 77) - MIXED_TINYV3_C7_SAFE プリセットでHAKMEM_TINY_STATIC_ROUTE=1を default 化
A/B テスト結果 ✅ GO:
- Mixed (10-run): 38,910,792 → 39,768,006 ops/s (+2.20% average gain, median +1.98%)
- Decision: ✅ ADOPT (exceeds +1.0% GO threshold)
- Rationale: policy_snapshot is light (L1 cache resident), but atomic+branch overhead makes +2.2% realistic
- Learner Interlock: Static route auto-disables when HAKMEM_SMALL_LEARNER_V7_ENABLED=1 (safe)
Current Cumulative Gain (Phase 2-3):
- B3 (Routing shape): +2.89%
- B4 (Wrapper split): +1.47%
- C3 (Static routing): +2.20%
- Total: ~6.8% (baseline 35.2M → ~39.8M ops/s)
Phase 3 C1: TLS Cache Prefetch 🔬 NEUTRAL / FREEZE
設計メモ: docs/analysis/PHASE3_C1_TLS_PREFETCH_1_DESIGN.md
狙い: malloc ホットパス LEGACY 入口で g_unified_cache[class_idx] を L1 prefetch(数十クロック早期)
実装完了 ✅:
core/front/malloc_tiny_fast.h(lines 264-267, 331-334)- env_cfg->alloc_route_shape=1 の fast path(線264-267)
- env_cfg->alloc_route_shape=0 の fallback path(線331-334)
- ENV gate:
HAKMEM_TINY_PREFETCH=0/1(default 0)
A/B テスト結果 🔬 NEUTRAL:
- Mixed (10-run): 39,335,109 → 39,203,334 ops/s (-0.34% average, median +1.28%)
- Average gain: -0.34%(わずかな回帰、±1.0% 範囲内)
- Median gain: +1.28%(閾値超え)
- Decision: NEUTRAL (研究箱維持、デフォルト OFF)
- 理由: Average で -0.34% なので、prefetch 効果が噪音範囲
- Prefetch は "当たるかどうか" が不確定(TLS access timing dependent)
- ホットパス後(tiny_hot_alloc_fast 直前)での実行では効果限定的
技術考察:
- prefetch が効果を発揮するには、L1 miss が発生する必要がある
- TLS キャッシュは unified_cache_pop() で素早くアクセス(head/tail インデックス)
- 実際のメモリ待ちは slots[] 配列へのアクセス時(prefetch より後)
- 改善案: prefetch をもっと早期(route_kind 決定前)に移動するか、形状を変更
Phase 3 C2: Slab Metadata Cache Optimization 🔬 NEUTRAL / FREEZE
設計メモ: docs/analysis/PHASE3_C2_METADATA_CACHE_1_DESIGN.md
狙い: Free path で metadata access(policy snapshot, slab descriptor)の cache locality を改善
3 Patches 実装完了 ✅:
-
Policy Hot Cache (Patch 1):
- TinyPolicyHot struct: route_kind[8] を TLS にキャッシュ(9 bytes packed)
- policy_snapshot() 呼び出しを削減(~2 memory ops 節約)
- Safety: learner v7 active 時は自動的に disable
- Files:
core/box/tiny_metadata_cache_env_box.h,tiny_metadata_cache_hot_box.{h,c} - Integration:
core/front/malloc_tiny_fast.h(line 256) route selection
-
First Page Inline Cache (Patch 2):
- TinyFirstPageCache struct: current slab page pointer を TLS per-class にキャッシュ
- superslab metadata lookup を回避(1-2 memory ops)
- Fast-path check in
tiny_legacy_fallback_free_base() - Files:
core/front/tiny_first_page_cache.h,tiny_unified_cache.c - Integration:
core/box/tiny_legacy_fallback_box.h(lines 27-36)
-
Bounds Check Compile-out (Patch 3):
- unified_cache capacity を MACRO constant 化(2048 hardcode)
- modulo 演算を compile-time 最適化(
& MASK) - Macros:
TINY_UNIFIED_CACHE_CAPACITY_POW2=11,CAPACITY=2048,MASK=2047 - File:
core/front/tiny_unified_cache.h(lines 35-41)
A/B テスト結果 🔬 NEUTRAL:
- Mixed (10-run):
- Baseline (C2=0): 40,433,519 ops/s (avg), 40,722,094 ops/s (median)
- Optimized (C2=1): 40,252,836 ops/s (avg), 40,291,762 ops/s (median)
- Average gain: -0.45%, Median gain: -1.06%
- Decision: NEUTRAL (within ±1.0% threshold)
- Action: Keep as research box (ENV gate OFF by default)
Rationale:
- Policy hot cache: learner との interlock コストが高い(プローブ時に毎回 check)
- First page cache: 現在の free path は unified_cache push のみ(superslab lookup なし)
- 効果を発揮するには drain path への統合が必要(将来の最適化)
- Bounds check: すでにコンパイラが最適化済み(power-of-2 detection)
Current Cumulative Gain (Phase 2-3):
- B3 (Routing shape): +2.89%
- B4 (Wrapper split): +1.47%
- C3 (Static routing): +2.20%
- C2 (Metadata cache): -0.45%
- D1 (Free route cache): +2.19%(PROMOTED TO DEFAULT)
- Total: ~8.3% (Phase 2-3, C2=NEUTRAL included)
Commit: f059c0ec8
Phase 3 D1: Free Path Route Cache ✅ ADOPT - PROMOTED TO DEFAULT (+2.19%)
設計メモ: docs/analysis/PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md
狙い: Free path の tiny_route_for_class() コストを削減(4.39% self + 24.78% children)
実装完了 ✅:
core/box/tiny_free_route_cache_env_box.h(ENV gate + lazy init)core/front/malloc_tiny_fast.h(lines 373-385, 780-791) - 2箇所で route cache integrationfree_tiny_fast_cold()path: directg_tiny_route_class[]lookuplegacy_fallbackpath: directg_tiny_route_class[]lookup- Fallback safety:
g_tiny_route_snapshot_donecheck before cache use
- ENV gate:
HAKMEM_FREE_STATIC_ROUTE=0/1(default OFF;MIXED_TINYV3_C7_SAFEでは default ON)
A/B テスト結果 ✅ ADOPT:
-
Mixed (10-run, initial):
- Baseline (D1=0): 45,132,610 ops/s (avg), 45,756,040 ops/s (median)
- Optimized (D1=1): 45,610,062 ops/s (avg), 45,402,234 ops/s (median)
- Average gain: +1.06%, Median gain: -0.77%
-
Mixed (20-run, validation / iter=20M, ws=400):
- Baseline(ROUTE=0): Mean 46.30M / Median 46.30M / StdDev 0.10M
- Optimized(ROUTE=1): Mean 47.32M / Median 47.39M / StdDev 0.11M
- Gain: Mean +2.19% ✓ / Median +2.37% ✓
-
Decision: ✅ Promoted to
MIXED_TINYV3_C7_SAFEpreset default -
Rollback:
HAKMEM_FREE_STATIC_ROUTE=0
Rationale:
- Eliminates
tiny_route_for_class()call overhead in free path - Uses existing
g_tiny_route_class[]cache from Phase 3 C3 (Static Routing) - Safe fallback: checks snapshot initialization before cache use
- Minimal code footprint: 2 integration points in malloc_tiny_fast.h
Phase 3 D2: Wrapper Env Cache ❌ NO-GO (-1.44%)
設計メモ: docs/analysis/PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md
狙い: malloc/free wrapper 入口の wrapper_env_cfg() 呼び出しオーバーヘッドを削減
実装完了 ✅:
core/box/wrapper_env_cache_env_box.h(ENV gate: HAKMEM_WRAP_ENV_CACHE)core/box/wrapper_env_cache_box.h(TLS cache: wrapper_env_cfg_fast)core/box/hak_wrappers.inc.h(lines 174, 553) - malloc/free hot paths で wrapper_env_cfg_fast() 使用- Strategy: Fast pointer cache (TLS caches const wrapper_env_cfg_t*)
- ENV gate:
HAKMEM_WRAP_ENV_CACHE=0/1(default OFF)
A/B テスト結果 ❌ NO-GO:
- Mixed (10-run, 20M iters):
- Baseline (D2=0): 46,516,538 ops/s (avg), 46,467,988 ops/s (median)
- Optimized (D2=1): 45,846,933 ops/s (avg), 45,978,185 ops/s (median)
- Average gain: -1.44%, Median gain: -1.05%
- Decision: NO-GO (regression below -1.0% threshold)
- Action: FREEZE as research box (default OFF, regression confirmed)
Analysis:
- Regression cause: TLS cache adds overhead (branch + TLS access cost)
- wrapper_env_cfg() is already minimal (pointer return after simple check in g_wrapper_env.inited)
- Adding TLS caching layer makes it worse, not better
- Branch prediction penalty for wrap_env_cache_enabled() check outweighs any savings
- Lesson: Not all caching helps - simple global access can be faster than TLS cache
Current Cumulative Gain (Phase 2-3):
- B3 (Routing shape): +2.89%
- B4 (Wrapper split): +1.47%
- C3 (Static routing): +2.20%
- D1 (Free route cache): +1.06% (opt-in)
- D2 (Wrapper env cache): -1.44% (NO-GO, frozen)
- Total: ~7.2% (excluding D2, D1 is opt-in ENV)
Commit: 19056282b
Phase 3 C4: MIXED MID_V3 Routing Fix ✅ ADOPT
要点: MIXED_TINYV3_C7_SAFE では HAKMEM_MID_V3_ENABLED=1 が大きく遅くなるため、プリセットのデフォルトを OFF に変更。
変更(プリセット):
core/bench_profile.h:MIXED_TINYV3_C7_SAFEのHAKMEM_MID_V3_ENABLED=0/HAKMEM_MID_V3_CLASSES=0x0docs/analysis/ENV_PROFILE_PRESETS.md: Mixed 本線では MID v3 OFF と明記
A/B(Mixed, ws=400, 20M iters, 10-run):
- Baseline(MID_V3=1): mean ~43.33M ops/s
- Optimized(MID_V3=0): mean ~48.97M ops/s
- Delta: +13% ✅(GO)
理由(観測):
- C6 を MID_V3 にルーティングすると
tiny_alloc_route_cold()→MID 側が “第2ホット” になり、Mixed では instruction / cache コストが支配的になりやすい - Mixed 本線は “全クラス多発” なので、C6 は LEGACY(tiny unified cache) に残した方が速い
ルール:
- Mixed 本線: MID v3 OFF(デフォルト)
- C6-heavy: MID v3 ON(従来通り)
Architectural Insight (Long-term)
Reality check: hakmem 4-5 layer design (wrapper → gate → policy → route → handler) adds 50-100x instruction overhead vs mimalloc's 1-layer TLS buckets.
Maximum realistic without redesign: 65-70M ops/s (still ~1.9x gap)
Future pivot: Consider static-compiled routing + optional learner (not per-call policy)
前フェーズ: Phase POOL-MID-DN-BATCH 完了 ✅(研究箱として freeze 推奨)
Status: Phase POOL-MID-DN-BATCH 完了 ✅ (2025-12-12)
Summary:
- Goal: Eliminate
mid_desc_lookupfrom pool_free_v1 hot path by deferring inuse_dec - Performance: 当初の計測では改善が見えたが、後続解析で「stats の global atomic」が大きな外乱要因だと判明
- Stats OFF + Hash map の再計測では 概ねニュートラル(-1〜-2%程度)
- Strategy: TLS map batching (~32 pages/drain) + thread exit cleanup
- Decision: Default OFF (ENV gate) のまま freeze(opt-in 研究箱)
Key Achievements:
- Hot path: Zero lookups (O(1) TLS map update only)
- Cold path: Batched lookup + atomic subtract (32x reduction in lookup frequency)
- Thread-safe: pthread_key cleanup ensures pending ops drained on thread exit
- Stats:
HAKMEM_POOL_MID_INUSE_DEFERRED_STATS=1のときのみ有効(default OFF)
Deliverables:
core/box/pool_mid_inuse_deferred_env_box.h(ENV gate: HAKMEM_POOL_MID_INUSE_DEFERRED)core/box/pool_mid_inuse_tls_pagemap_box.h(32-entry TLS map)core/box/pool_mid_inuse_deferred_box.h(deferred API + drain logic)core/box/pool_mid_inuse_deferred_stats_box.h(counters + dump)core/box/pool_free_v1_box.h(integration: fast + slow paths)- Benchmark: +2.8% median, within target range (+2-4%)
ENV Control:
HAKMEM_POOL_MID_INUSE_DEFERRED=0 # Default (immediate dec)
HAKMEM_POOL_MID_INUSE_DEFERRED=1 # Enable deferred batching
HAKMEM_POOL_MID_INUSE_MAP_KIND=linear|hash # Default: linear
HAKMEM_POOL_MID_INUSE_DEFERRED_STATS=0/1 # Default: 0 (keep OFF for perf)
Health smoke:
- OFF/ON の最小スモークは
scripts/verify_health_profiles.shで実行
Status: Phase MID-V35-HOTPATH-OPT-1 FROZEN ✅
Summary:
- Design: Step 0-3(Geometry SSOT + Header prefill + Hot counts + C6 fastpath)
- C6-heavy (257–768B): +7.3% improvement ✅ (8.75M → 9.39M ops/s, 5-run mean)
- Mixed (16–1024B): -0.2% (誤差範囲, ±2%以内) ✓
- Decision: デフォルトOFF/FROZEN(全3ノブ)、C6-heavy推奨ON、Mixed現状維持
- Key Finding:
- Step 0: L1/L2 geometry mismatch 修正(C6 102→128 slots)
- Step 1-3: refill 境界移動 + 分岐削減 + constant 最適化で +7.3%
- Mixed では MID_V3(C6-only) 固定なため効果微小
Deliverables:
core/box/smallobject_mid_v35_geom_box.h(新規)core/box/mid_v35_hotpath_env_box.h(新規)core/smallobject_mid_v35.c(Step 1-3 統合)core/smallobject_cold_iface_mid_v3.c(Step 0 + Step 1)docs/analysis/ENV_PROFILE_PRESETS.md(更新)
Status: Phase POLICY-FAST-PATH-V2 FROZEN ✅
Summary:
- Mixed (ws=400): -1.6% regression ❌ (目標未達: 大WSで追加分岐コスト>skipメリット)
- C6-heavy (ws=200): +5.4% improvement ✅ (研究箱で有効)
- Decision: デフォルトOFF、FROZEN(C6-heavy/ws<300 研究ベンチのみ推奨)
- Learning: 大WSでは追加分岐が勝ち筋を食う(Mixed非推奨、C6-heavy専用)
Status: Phase 3-GRADUATE FROZEN ✅
TLS-UNIFY-3 Complete:
- C6 intrusive LIFO: Working (intrusive=1 with array fallback)
- Mixed regression identified: policy overhead + TLS contention
- Decision: Research box only (default OFF in mainline)
- Documentation:
docs/analysis/PHASE_3_GRADUATE_FINAL_REPORT.md✅docs/analysis/ENV_PROFILE_PRESETS.md(frozen warning added) ✅
Previous Phase TLS-UNIFY-3 Results:
- Status(Phase TLS-UNIFY-3):
- DESIGN ✅(
docs/analysis/ULTRA_C6_INTRUSIVE_FREELIST_DESIGN_V11B.md) - IMPL ✅(C6 intrusive LIFO を
TinyUltraTlsCtxに導入) - VERIFY ✅(ULTRA ルート上で intrusive 使用をカウンタで実証)
- GRADUATE-1 C6-heavy ✅
- Baseline (C6=MID v3.5): 55.3M ops/s
- ULTRA+array: 57.4M ops/s (+3.79%)
- ULTRA+intrusive: 54.5M ops/s (-1.44%, fallback=0)
- GRADUATE-1 Mixed ❌
- ULTRA+intrusive 約 -14% 回帰(Legacy fallback ≈24%)
- Root cause: 8 クラス競合による TLS キャッシュ奪い合いで ULTRA miss 増加
- DESIGN ✅(
Performance Baselines (Current HEAD - Phase 3-GRADUATE)
Test Environment:
- Date: 2025-12-12
- Build: Release (LTO enabled)
- Kernel: Linux 6.8.0-87-generic
Mixed Workload (MIXED_TINYV3_C7_SAFE):
- Throughput: 51.5M ops/s (1M iter, ws=400)
- IPC: 1.64 instructions/cycle
- L1 cache miss: 8.59% (303,027 / 3,528,555 refs)
- Branch miss: 3.70% (2,206,608 / 59,567,242 branches)
- Cycles: 151.7M, Instructions: 249.2M
Top 3 Functions (perf record, self%):
free: 29.40% (malloc wrapper + gate)main: 26.06% (benchmark driver)tiny_alloc_gate_fast: 19.11% (front gate)
C6-heavy Workload (C6_HEAVY_LEGACY_POOLV1):
- Throughput: 52.7M ops/s (1M iter, ws=200)
- IPC: 1.67 instructions/cycle
- L1 cache miss: 7.46% (257,765 / 3,455,282 refs)
- Branch miss: 3.77% (2,196,159 / 58,209,051 branches)
- Cycles: 151.1M, Instructions: 253.1M
Top 3 Functions (perf record, self%):
free: 31.44%tiny_alloc_gate_fast: 25.88%main: 18.41%
Analysis: Bottleneck Identification
Key Observations:
-
Mixed vs C6-heavy Performance Delta: Minimal (~2.3% difference)
- Mixed (51.5M ops/s) vs C6-heavy (52.7M ops/s)
- Both workloads are performing similarly, indicating hot path is well-optimized
-
Free Path Dominance:
freeaccounts for 29-31% of cycles- Suggests free path still has optimization potential
- C6-heavy shows slightly higher free% (31.44% vs 29.40%)
-
Alloc Path Efficiency:
tiny_alloc_gate_fastis 19-26% of cycles- Higher in C6-heavy (25.88%) due to MID v3/v3.5 usage
- Lower in Mixed (19.11%) suggests LEGACY path is efficient
-
Cache & Branch Efficiency: Both workloads show good metrics
- Cache miss rates: 7-9% (acceptable for mixed-size workloads)
- Branch miss rates: ~3.7% (good prediction)
- No obvious cache/branch bottleneck
-
IPC Analysis: 1.64-1.67 instructions/cycle
- Good for memory-bound allocator workloads
- Suggests memory bandwidth, not compute, is the limiter
Next Phase Decision
Recommendation: Phase POLICY-FAST-PATH-V2 (Policy Optimization)
Rationale:
-
Free path is the bottleneck (29-31% of cycles)
- Current policy snapshot mechanism may have overhead
- Multi-class routing adds branch complexity
-
MID/POOL v3 paths are efficient (only 25.88% in C6-heavy)
- MID v3/v3.5 is well-optimized after v11a-5
- Further segment/retire optimization has limited upside (~5-10% potential)
-
High-ROI target: Policy fast path specialization
- Eliminate policy snapshot in hot paths (C7 ULTRA already has this)
- Optimize class determination with specialized fast paths
- Reduce branch mispredictions in multi-class scenarios
Alternative Options (lower priority):
-
Phase MID-POOL-V3-COLD-OPTIMIZE: Cold path (segment creation, retire logic)
- Lower ROI: Cold path not showing up in top functions
- Estimated gain: 2-5%
-
Phase LEARNER-V2-TUNING: Learner threshold optimization
- Very low ROI: Learner not active in current baselines
- Estimated gain: <1%
Boundary & Rollback Plan
Phase POLICY-FAST-PATH-V2 Scope:
-
Alloc Fast Path Specialization:
- Create per-class specialized alloc gates (no policy snapshot)
- Use static routing for C0-C7 (determined at compile/init time)
- Keep policy snapshot only for dynamic routing (if enabled)
-
Free Fast Path Optimization:
- Reduce classify overhead in
free_tiny_fast() - Optimize pointer classification with LUT expansion
- Consider C6 early-exit (similar to C7 in v11b-1)
- Reduce classify overhead in
-
ENV-based Rollback:
- Add
HAKMEM_POLICY_FAST_PATH_V2=1ENV gate - Default: OFF (use existing policy snapshot mechanism)
- A/B testing: Compare v2 fast path vs current baseline
- Add
Rollback Mechanism:
- ENV gate
HAKMEM_POLICY_FAST_PATH_V2=0reverts to current behavior - No ABI changes, pure performance optimization
- Sanity benchmarks must pass before enabling by default
Success Criteria:
- Mixed workload: +5-10% improvement (target: 54-57M ops/s)
- C6-heavy workload: +3-5% improvement (target: 54-55M ops/s)
- No SEGV/assert failures
- Cache/branch metrics remain stable or improve
References
docs/analysis/PHASE_3_GRADUATE_FINAL_REPORT.md(TLS-UNIFY-3 closure)docs/analysis/ENV_PROFILE_PRESETS.md(C6 ULTRA frozen warning)docs/analysis/ULTRA_C6_INTRUSIVE_FREELIST_DESIGN_V11B.md(Phase TLS-UNIFY-3 design)
Phase TLS-UNIFY-2a: C4-C6 TLS統合 - COMPLETED ✅
変更: C4-C6 ULTRA の TLS を TinyUltraTlsCtx 1 struct に統合。配列マガジン方式維持、C7 は別箱のまま。
A/B テスト結果:
| Workload | v11b-1 (Phase 1) | TLS-UNIFY-2a | 差分 |
|---|---|---|---|
| Mixed 16-1024B | 8.0-8.8 Mop/s | 8.5-9.0 Mop/s | +0~5% |
| MID 257-768B | 8.5-9.0 Mop/s | 8.1-9.0 Mop/s | ±0% |
結果: C4-C6 ULTRA の TLS は TinyUltraTlsCtx 1箱に収束。性能同等以上、SEGV/assert なし ✅
Phase v11b-1: Free Path Optimization - COMPLETED ✅
変更: free_tiny_fast() のシリアルULTRAチェック (C7→C6→C5→C4) を単一switch構造に統合。C7 early-exit追加。
結果 (vs v11a-5):
| Workload | v11a-5 | v11b-1 | 改善 |
|---|---|---|---|
| Mixed 16-1024B | 45.4M | 50.7M | +11.7% |
| C6-heavy | 49.1M | 52.0M | +5.9% |
| C6-heavy + MID v3.5 | 53.1M | 53.6M | +0.9% |
本線プロファイル決定
| Workload | MID v3.5 | 理由 |
|---|---|---|
| Mixed 16-1024B | OFF | LEGACYが最速 (45.4M ops/s) |
| C6-heavy (257-512B) | ON (C6-only) | +8%改善 (53.1M ops/s) |
ENV設定:
MIXED_TINYV3_C7_SAFE:HAKMEM_MID_V35_ENABLED=0C6_HEAVY_LEGACY_POOLV1:HAKMEM_MID_V35_ENABLED=1 HAKMEM_MID_V35_CLASSES=0x40
Phase v11a-5: Hot Path Optimization - COMPLETED
Status: ✅ COMPLETE - 大幅な性能改善達成
変更内容
- Hot path簡素化:
malloc_tiny_fast()を単一switch構造に統合 - C7 ULTRA early-exit: Policy snapshot前にC7 ULTRAをearly-exit(最大ホットパス最適化)
- ENV checks移動: すべてのENVチェックをPolicy initに集約
結果サマリ (vs v11a-4)
| Workload | v11a-4 Baseline | v11a-5 Baseline | 改善 |
|---|---|---|---|
| Mixed 16-1024B | 38.6M | 45.4M | +17.6% |
| C6-heavy (257-512B) | 39.0M | 49.1M | +26% |
| Workload | v11a-4 MID v3.5 | v11a-5 MID v3.5 | 改善 |
|---|---|---|---|
| Mixed 16-1024B | 40.3M | 41.8M | +3.7% |
| C6-heavy (257-512B) | 40.2M | 53.1M | +32% |
v11a-5 内部比較
| Workload | Baseline | MID v3.5 ON | 差分 |
|---|---|---|---|
| Mixed 16-1024B | 45.4M | 41.8M | -8% (LEGACYが速い) |
| C6-heavy (257-512B) | 49.1M | 53.1M | +8.1% |
結論
- Hot path最適化で大幅改善: Baseline +17-26%、MID v3.5 ON +3-32%
- C7 early-exitが効果大: Policy snapshot回避で約10M ops/s向上
- MID v3.5はC6-heavyで有効: C6主体ワークロードで+8%改善
- Mixedワークロードではbaselineが最適: LEGACYパスがシンプルで速い
技術詳細
- C7 ULTRA early-exit:
tiny_c7_ultra_enabled_env()(static cached) で判定 - Policy snapshot: TLSキャッシュ + version check (version mismatch時のみ再初期化)
- Single switch: route_kind[class_idx] で分岐(ULTRA/MID_V35/V7/MID_V3/LEGACY)
Phase v11a-4: MID v3.5 Mixed本線テスト - COMPLETED
Status: ✅ COMPLETE - C6→MID v3.5 採用候補
結果サマリ
| Workload | v3.5 OFF | v3.5 ON | 改善 |
|---|---|---|---|
| C6-heavy (257-512B) | 34.0M | 35.8M | +5.1% |
| Mixed 16-1024B | 38.6M | 40.3M | +4.4% |
結論
Mixed本線で C6→MID v3.5 は採用候補。+4%の改善があり、設計の一貫性(統一セグメント管理)も得られる。
Phase v11a-3: MID v3.5 Activation - COMPLETED
Status: ✅ COMPLETE
Bug Fixes
- Policy infinite loop: CAS で global version を 1 に初期化
- Malloc recursion: segment creation で mmap 直叩きに変更
Tasks Completed (6/6)
- ✅ Add MID_V35 route kind to Policy Box
- ✅ Implement MID v3.5 HotBox alloc/free
- ✅ Wire MID v3.5 into Front Gate
- ✅ Update Makefile and build
- ✅ Run A/B benchmarks
- ✅ Update documentation
Phase v11a-2: MID v3.5 Implementation - COMPLETED
Status: COMPLETE
All 5 tasks of Phase v11a-2 have been successfully implemented.
Implementation Summary
Task 1: SegmentBox_mid_v3 (L2 Physical Layer)
File: core/smallobject_segment_mid_v3.c
Implemented:
- SmallSegment_MID_v3 structure (2MiB segment, 64KiB pages, 32 pages total)
- Per-class free page stacks (LIFO)
- Page metadata management with SmallPageMeta
- RegionIdBox integration for fast pointer classification
- Geometry: Reuses ULTRA geometry (2MiB segments, 64KiB pages)
- Class capacity mapping: C5→170 slots, C6→102 slots, C7→64 slots
Functions:
small_segment_mid_v3_create(): Allocate 2MiB via mmap, initialize metadatasmall_segment_mid_v3_destroy(): Cleanup and unregister from RegionIdBoxsmall_segment_mid_v3_take_page(): Get page from free stack (LIFO)small_segment_mid_v3_release_page(): Return page to free stack- Statistics and validation functions
Task 2: ColdIface_mid_v3 (L2→L1 Boundary)
Files:
core/box/smallobject_cold_iface_mid_v3_box.h(header)core/smallobject_cold_iface_mid_v3.c(implementation)
Implemented:
-
small_cold_mid_v3_refill_page(): Get new page for allocation- Lazy TLS segment allocation
- Free stack page retrieval
- Page metadata initialization
- Returns NULL when no pages available (for v11a-2)
-
small_cold_mid_v3_retire_page(): Return page to free pool- Calculate free hit ratio (basis points: 0-10000)
- Publish stats to StatsBox
- Reset page metadata
- Return to free stack
Task 3: StatsBox_mid_v3 (L2→L3)
File: core/smallobject_stats_mid_v3.c
Implemented:
- Stats collection and history (circular buffer, 1000 events)
small_stats_mid_v3_publish(): Record page retirement statistics- Periodic aggregation (every 100 retires by default)
- Per-class metrics tracking
- Learner notification on eval intervals
- Timestamp tracking (ns resolution)
- Free hit ratio calculation and smoothing
Task 4: Learner v2 Aggregation (L3)
File: core/smallobject_learner_v2.c
Implemented:
- Multi-class allocation tracking (C5-C7)
- Exponential moving average for retire ratios (90% history + 10% new)
small_learner_v2_record_page_stats(): Ingest stats from StatsBox- Per-class retire efficiency tracking
- C5 ratio calculation for routing decisions
- Global and per-class metrics
- Configuration: smoothing factor, evaluation interval, C5 threshold
Metrics tracked:
- Per-class allocations
- Retire count and ratios
- Free hit rate (global and per-class)
- Average page utilization
Task 5: Integration & Sanity Benchmarks
Makefile Updates:
- Added 4 new object files to OBJS_BASE and BENCH_HAKMEM_OBJS_BASE:
core/smallobject_segment_mid_v3.ocore/smallobject_cold_iface_mid_v3.ocore/smallobject_stats_mid_v3.ocore/smallobject_learner_v2.o
Build Results:
- Clean compilation with only minor warnings (unused functions)
- All object files successfully linked
- Benchmark executable built successfully
Sanity Benchmark Results:
./bench_random_mixed_hakmem 100000 400 1
Throughput = 27323121 ops/s [iter=100000 ws=400] time=0.004s
RSS: max_kb=30208
Performance: 27.3M ops/s (baseline maintained, no regression)
Architecture
Layer Structure
L3: Learner v2 (smallobject_learner_v2.c)
↑ (stats aggregation)
L2: StatsBox (smallobject_stats_mid_v3.c)
↑ (publish events)
L2: ColdIface (smallobject_cold_iface_mid_v3.c)
↑ (refill/retire)
L2: SegmentBox (smallobject_segment_mid_v3.c)
↑ (page management)
L1: [Future: Hot path integration]
Data Flow
- Page Refill: ColdIface → SegmentBox (take from free stack)
- Page Retire: ColdIface → StatsBox (publish) → Learner (aggregate)
- Decision: Learner calculates C5 ratio → routing decision (v7 vs MID_v3)
Key Design Decisions
-
No Hot Path Integration: Phase v11a-2 focuses on infrastructure only
- Existing MID v3 routing unchanged
- New code is dormant (linked but not called)
- Ready for future activation
-
ULTRA Geometry Reuse: 2MiB segments, 64KiB pages
- Proven design from C7 ULTRA
- Efficient for C5-C7 range (257-1024B)
- Good balance between fragmentation and overhead
-
Per-Class Free Stacks: Independent page pools per class
- Reduces cross-class interference
- Simplifies page accounting
- Enables per-class statistics
-
Exponential Smoothing: 90% historical + 10% new
- Stable metrics despite workload variation
- React to trends without noise
- Standard industry practice
File Summary
New Files Created (6 total)
core/smallobject_segment_mid_v3.c(280 lines)core/box/smallobject_cold_iface_mid_v3_box.h(30 lines)core/smallobject_cold_iface_mid_v3.c(115 lines)core/smallobject_stats_mid_v3.c(180 lines)core/smallobject_learner_v2.c(270 lines)
Existing Files Modified (4 total)
core/box/smallobject_segment_mid_v3_box.h(added function prototypes)core/box/smallobject_learner_v2_box.h(added stats include, function prototype)Makefile(added 4 new .o files to OBJS_BASE and TINY_BENCH_OBJS_BASE)CURRENT_TASK.md(this file)
Total Lines of Code: ~875 lines (C implementation)
Next Steps (Future Phases)
-
Phase v11a-3: Hot path integration
- Route C5/C6/C7 through MID v3.5
- TLS context caching
- Fast alloc/free implementation
-
Phase v11a-4: Route switching
- Implement C5 ratio threshold logic
- Dynamic switching between MID_v3 and v7
- A/B testing framework
-
Phase v11a-5: Performance optimization
- Inline hot functions
- Prefetching
- Cache-line optimization
Verification Checklist
- All 5 tasks completed
- Clean compilation (warnings only for unused functions)
- Successful linking
- Sanity benchmark passes (27.3M ops/s)
- No performance regression
- Code modular and well-documented
- Headers properly structured
- RegionIdBox integration works
- Stats collection functional
- Learner aggregation operational
Notes
- Not Yet Active: This code is dormant - linked but not called by hot path
- Zero Overhead: No performance impact on existing MID v3 implementation
- Ready for Integration: All infrastructure in place for future hot path activation
- Tested Build: Successfully builds and runs with existing benchmarks
Phase v11a-2 Status: ✅ COMPLETE Date: 2025-12-12 Build Status: ✅ PASSING Performance: ✅ NO REGRESSION (27.3M ops/s baseline maintained)