tomoaki/hakmem

Fork 0

Files

Moe Charm (CI) 2013514f7b Working state before pushing to cyu remote

2025-12-19 03:45:01 +09:00

30 KiB

Raw Blame History

CURRENT_TASK（Rolling, SSOT）

SSOT（今の正）

性能SSOT: scripts/run_mixed_10_cleanenv.sh（WS=400, RUNS=10, サイズ16..1040強制、*_ONLY強制OFF）
経路確認: scripts/run_mixed_observe_ssot.sh（OBSERVE専用、throughput比較には使わない）
buildモード: docs/analysis/SSOT_BUILD_MODES.md
外部比較（短時間）: docs/analysis/PHASE92_TCMALLOC_GAP_TRIAGE_SSOT.md（LD_PRELOAD同一バイナリ + hakmem_force_libc 切り分け）

Phase 87-88（終了: NO-GO）

Status: ✅ OBSERVE verified + ❌ Phase 88 NO-GO

Phase 87: Inline Slots Verification

Initial Finding (Wrong): Standard binary showed PUSH TOTAL/POP TOTAL = 0

Root Cause: ENV ドリフト（HAKMEM_BENCH_MIN_SIZE/MAX_SIZE 漏れ）
- 修正: scripts/run_mixed_10_cleanenv.sh でサイズ範囲を強制固定（MIN=16, MAX=1040）
- HAKMEM_BENCH_C5_ONLY=0, HAKMEM_BENCH_C6_ONLY=0, HAKMEM_BENCH_C7_ONLY=0 強制

Corrected Finding (OBSERVE binary) - 20M ops Mixed SSOT WS=400:

PUSH TOTAL:   C4=687,564  C5=1,373,605  C6=2,750,862  TOTAL=4,812,031 ✓
POP TOTAL:    C4=687,564  C5=1,373,605  C6=2,750,862  TOTAL=4,812,031 ✓
PUSH FULL:    0 (0.00%)
POP EMPTY:    168 (0.003%)

JUDGMENT: ✓ [C] LEGACY used AND C4/C5/C6 INLINE SLOTS ACTIVE → Ready for Phase 88/89

Phase 88: Batch Drain Optimization

Overflow Analysis:

POP EMPTY rate: 168 / 4,812,031 = 0.003% ← 極小
PUSH FULL rate: 0 / 4,812,031 = 0% ← 起きていない
Decision: バッチ化しても速さは動かない（overflow がほぼ起きていない）

Phase 88 Decision: NO-GO（凍結）

Rationale: 0.003% overflow 率では layout tax リスク > 期待値
Infrastructure: 観測用 telemetry は残す（将来の WS/容量変更時に再検証可能）

Artifacts Created:

Telemetry box: core/box/tiny_inline_slots_overflow_stats_box.h/c
Phase 87 results: docs/analysis/PHASE87_OBSERVATION_RESULTS.md
SSOT 強化: scripts/run_mixed_10_cleanenv.sh, scripts/run_mixed_observe_ssot.sh
ENV ドリフト防止ドキュメント: docs/analysis/BENCH_REPRODUCIBILITY_SSOT.md

Key Learning:

"踏んでるか確定"には OBSERVE バイナリ + total counters が必須
観測と性能測定は分離（telemetry overhead を避ける）
ENV ドリフト（MIN/MAX サイズ, CLASS_ONLY） = 経路を変える主要因 Follow-up Fix (SSOT hardening):
scripts/run_mixed_10_cleanenv.sh now forces HAKMEM_BENCH_MIN_SIZE=16 / HAKMEM_BENCH_MAX_SIZE=1040 and disables HAKMEM_BENCH_C{5,6,7}_ONLY to prevent path drift.
New pre-flight helper: scripts/run_mixed_observe_ssot.sh (Route Banner + OBSERVE, single run).
Overflow stats compile gating fixed (see above).

Phase 89（完了: Bottleneck Analysis & Optimization Roadmap）

Status: ✅ SSOT Measurement Complete + 3 Optimization Candidates Identified

4-Step SSOT Procedure Completion

Step 1: OBSERVE Binary Preflight

Binary: bench_random_mixed_hakmem_observe (with telemetry enabled)
Inline slots verification: ✓ PUSH TOTAL = 4.81M, POP EMPTY = 0.003% (confirmed active & healthy)
Throughput (with telemetry): 51.52M ops/s

Step 2: Standard 10-run Baseline

Binary: bench_random_mixed_hakmem (clean, no telemetry)
10-run SSOT results: 51.36M ops/s (CV: 0.7%, very stable)
- Range: 50.74M - 51.73M
- Decision: This is baseline for bottleneck analysis

Step 3: FAST PGO 10-run Comparison

Binary: bench_random_mixed_hakmem_minimal_pgo (PGO optimized)
10-run SSOT results: 54.16M ops/s (CV: 1.5%, acceptable)
- Range: 52.89M - 55.13M
- Performance Gap: 54.16M - 51.36M = 2.80M ops/s (+5.45%)
- This represents the optimization ceiling with current PGO profile

Step 4: Results Captured

Git SHA: e4c5f0535 (master branch)
Timestamp: 2025-12-18 23:06:01
System: AMD Ryzen 5825U, 16 cores, 6.8.0-87-generic kernel
Files: docs/analysis/PHASE89_SSOT_MEASUREMENT.md

Perf Analysis & Top Bottleneck Identification

Profile Run: 40M operations (0.78s), 833 perf samples

Top Functions by CPU Time:

free - 27.40% (hottest)
main - 26.30% (benchmark loop, not optimizable)
malloc - 20.36% (hottest)
malloc.cold - 10.65% (cold path, avoid optimizing)
free.cold - 5.59% (cold path, avoid optimizing)
tiny_region_id_write_header - 2.98% (hot, inlining candidate)

malloc + free combined = 47.76% of CPU time (already Phase 9/10/78-1/80-1 optimized)

Top 3 Optimization Candidates (Ranked by Priority)

Candidate	Priority	Recommendation	Expected Gain	Risk	Effort
tiny_region_id_write_header always_inline	HIGH	PURSUE	+1-2%	LOW	1-2h
malloc/free branch reduction	MEDIUM	DEFER	+2-3%	MEDIUM	20-40h
Cold-path optimization	LOW	AVOID	+1%	HIGH	10-20h

Candidate 1: tiny_region_id_write_header always_inline (2.98% CPU)

Current: Selective inlining from core/region_id_v6.c
Proposal: Force always_inline for hot-path call sites
Layout Impact: MINIMAL (no code bulk, maintains I-cache discipline)
Recommendation: YES - PURSUE
- Estimated timeline: Phase 90
- Implementation: 1-2 lines, add __attribute__((always_inline)) wrapper

Candidate 2: malloc/free branch reduction (47.76% CPU)

Current: Phase 9/10/78-1/80-1/83-1 already optimized
Observation: 56.4M branch-misses (branch prediction pressure)
Proposal: Pre-compute routing tables (like Phase 85 approach)
Risk: Code bloat, potential layout tax regression (Phase 85 was NO-GO)
Recommendation: DEFER
- Wait for workload characteristics that justify complexity
- Current gains saturation point reached

Phase 91（終了: NEUTRAL / 凍結）

Status: ⚪ NEUTRAL（C6 IFL: +0.38% / 10-run）→ default OFF で保持

目的: C6 inline slots の FIFO を intrusive LIFO に置換して fixed tax を削る
結果（SSOT 10-run）:
- Control（HAKMEM_TINY_C6_INLINE_SLOTS_IFL=0）mean 52.05M
- Treatment（HAKMEM_TINY_C6_INLINE_SLOTS_IFL=1）mean 52.25M
- Δ +0.38%（GO閾値 +1.0% 未達）
判定: 凍結（research box）
- 回帰は無し、ただし ROI が小さいため C5/C4 へ展開しない

Phase 92（開始予定）

Status: 🔍 次フェーズ計画中

目的: tcmalloc 性能ギャップ（hakmem: 52M vs tcmalloc: 58M, -12.8%）を短時間で原因分類

実施予定:

ケース A：小 vs 大オブジェクト分離テスト（C6-only vs C7-only）
ケース B：Inline Slots vs Unified Cache 分離テスト
ケース C：LIFO vs FIFO 比較
ケース D：Pool size sensitivity テスト

期間: 1-2h（短時間 Triage）出力: Primary bottleneck 特定 → 次の Candidate 選定

References:

Triage Plan: docs/analysis/PHASE92_TCMALLOC_GAP_TRIAGE_SSOT.md

Candidate 3: Cold-path de-duplication (16.24% CPU)

Current: malloc.cold (10.65%) + free.cold (5.59%) explicitly separated
Rationale: Separation improves hot-path I-cache utilization
Recommendation: AVOID
- Aligns with user's "layout tax 回避" principle
- Optimizing cold paths would ADD code to hot path (violates design)

Key Performance Insights

FAST PGO vs Standard (+5.45%) breakdown:

PGO branch prediction optimization: ~3%
Code layout optimization: ~2%
Inlining decisions: ~0.5%

Conclusion: Standard build limited by branch prediction pressure; further gains require architectural tradeoffs.

Inline Slots Health: Working perfectly - 0.003% overflow rate confirms no bottleneck

References & Artifacts

SSOT Measurement: docs/analysis/PHASE89_SSOT_MEASUREMENT.md
Bottleneck Analysis: docs/analysis/PHASE89_BOTTLENECK_ANALYSIS.md
Perf Stats: docs/analysis/PHASE89_PERF_STAT.txt
Scripts: scripts/run_mixed_10_cleanenv.sh, scripts/run_mixed_observe_ssot.sh

Phase 86（終了: NO-GO）

Status: ❌ NO-GO (+0.25% improvement, threshold: +1.0%)

A/B Test (10-run SSOT):

Control: 51,750,467 ops/s (CV: 2.26%)
Treatment: 51,881,055 ops/s (CV: 2.32%)
Delta: +0.25% (mean), -0.15% (median)

Summary: Free path legacy mask (mask-only) optimization for LEGACY classes.

Design: Bitset mask + direct call (avoids Phase 85's indirect call problems)
Implementation: Correct (0x7f mask computed, C0-C6 optimized)
Root cause: Competing Phase 9/10 optimizations (+1.89%) already capture most benefit
Conclusion: Free path optimization layer has reached practical ceiling

0) 今の「正」（SSOT）

現行 SSOT（Phase 89 capture / Git SHA: e4c5f0535）:
- Standard（./bench_random_mixed_hakmem）10-run mean: 51.36M ops/s（CV ~0.7%）
- FAST PGO minimal（./bench_random_mixed_hakmem_minimal_pgo）10-run mean: 54.16M ops/s（CV ~1.5% / Standard比 +5.45%）
- OBSERVE（./bench_random_mixed_hakmem_observe）: 51.52M ops/s（telemetry込み、性能比較の正ではない）
- SSOT capture: docs/analysis/PHASE89_SSOT_MEASUREMENT.md
性能最適化の判断の正: 同一バイナリ A/B（ENVトグル）＝ scripts/run_mixed_10_cleanenv.sh
mimalloc/tcmalloc 参照の正: reference（別バイナリ/LD_PRELOAD）＝ docs/analysis/ALLOCATOR_COMPARISON_SSOT.md
スコアカード（目標/現在値の正）: docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md（Phase 89 SSOT を現行 snapshot として反映済み）
- Phase 66/68/69（60M〜62M台）は historical（現 HEAD と直接比較しない。比較するなら rebase を取る）
次フェーズ（設計見直し）: docs/analysis/PHASE90_STRUCTURAL_REVIEW_AND_GAP_TRIAGE_SSOT.md
Mixed 10-run SSOT（ハーネス）: scripts/run_mixed_10_cleanenv.sh
- デフォルト BENCH_BIN=./bench_random_mixed_hakmem（Standard）
- FAST PGO は BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo を明示する
- 既定: ITERS=20000000 WS=400、HAKMEM_WARM_POOL_SIZE=16、HAKMEM_TINY_C4_INLINE_SLOTS=1、HAKMEM_TINY_C5_INLINE_SLOTS=1、HAKMEM_TINY_C6_INLINE_SLOTS=1、HAKMEM_TINY_INLINE_SLOTS_FIXED=1、HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH=1
- cleanenv で固定OFF（漏れ防止）: HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH_FIXED=0（Phase 83-1 NO-GO / research）

0a) ころころ防止（最低限の SSOT ルール）

hakmem は必ず HAKMEM_PROFILE を明示する（未指定だと route が変わり、数値が破綻しやすい）。
- 推奨: HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE（Speed-first）
比較は目的で runner を分ける:
- hakmem SSOT（最適化判断）: scripts/run_mixed_10_cleanenv.sh
- allocator reference（短時間）: scripts/run_allocator_quick_matrix.sh
- allocator reference（layout差を最小化）: scripts/run_allocator_preload_matrix.sh
再現ログを残す（数%を詰めるときの最低限）:
- scripts/bench_ssot_capture.sh
- HAKMEM_BENCH_ENV_LOG=1（CPU governor/EPP/freq を記録）
- 外部相談（貼り付けパケット）: docs/analysis/FREE_PATH_REVIEW_PACKET_CHATGPT.md（生成: scripts/make_chatgpt_pro_packet_free_path.sh）

0b) Allocator比較（reference）

allocator比較（system/jemalloc/mimalloc/tcmalloc）は reference（別バイナリ/LD_PRELOAD → layout差を含む）。
- SSOT: docs/analysis/ALLOCATOR_COMPARISON_SSOT.md
- Quick（Random Mixed 10-run）: scripts/run_allocator_quick_matrix.sh
  - 重要: hakmem は HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE を明示し、scripts/run_mixed_10_cleanenv.sh 経由で走らせる（PROFILE漏れで数値が壊れるため）。
- Same-binary（推奨, layout差を最小化）: scripts/run_allocator_preload_matrix.sh
  - bench_random_mixed_system を固定し、LD_PRELOAD で allocator を差し替える。
  - 注記: hakmem の linked benchmark（bench_random_mixed_hakmem*）とは経路が異なる（LD_PRELOAD=drop-in wrapper なので別物）。
- Scenario CSV（small-scale reference）: scripts/bench_allocators_compare.sh

1) 迷子防止（経路/観測）

“経路が踏まれていない最適化” を防ぐための最小手順。

Route Banner（経路の誤認を潰す）: HAKMEM_ROUTE_BANNER=1
- 出力: Route assignments（backend route kind）+ cache config（unified_cache_enabled / warm_pool_max_per_class）
Refill観測のSSOT: docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md
- WS=400（Mixed SSOT）では miss が極小 → unified_cache_refill() 最適化は 凍結（ROIゼロ）

2) 直近の結論（要点だけ）

Phase 69（WarmPool sweep）: HAKMEM_WARM_POOL_SIZE=16 が 強GO（+3.26%）、baseline 昇格済み。
- 設計: docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md
- 結果: docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md
Phase 70（観測SSOT）: 統計の見える化/前提ゲート確立。WS=400 SSOT では refill は冷たい。
- SSOT: docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md
Phase 71/73（WarmPool=16 の勝ち筋確定）: 勝ち筋は instruction/branch の微減（perf stat で確定）。
- 詳細: docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md
Phase 72（ENV knob ROI枯れ）: WarmPool=16 を超える ENV-only 勝ち筋なし → 構造（コード）で攻める段階。
Phase 78-1（構造）: Inline Slots enable の per-op ENV gate を固定化し、同一バイナリ A/B で GO（+2.31%）。
- 結果: docs/analysis/PHASE78_1_INLINE_SLOTS_FIXED_MODE_RESULTS.md
Phase 80-1（構造）: Inline Slots の if-chain を switch dispatch 化し、同一バイナリ A/B で GO（+1.65%）。
- 結果: docs/analysis/PHASE80_INLINE_SLOTS_SWITCH_DISPATCH_1_RESULTS.md
Phase 83-1（構造）: Switch dispatch の per-op ENV gate を固定化 (Phase 78-1 パターン適用), 同一バイナリ A/B で NO-GO（+0.32%, branch reduction negligible）。
- 結果: docs/analysis/PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md
- 原因: lazy-init pattern が既に最適化済み（per-op overhead minimal）→ fixed mode の ROI 極小

2a) 次の大方針（設計の順番、SSOT）

目的: “mimalloc/tcmalloc が強すぎる”状況でも、Box Theory（境界1箇所・戻せる・可視化最小・fail-fast）を崩さず +5–10% を狙う。

優先順（Google/TCMalloc の芯を参考にする）:

ThreadCache overflow のバッチ化（最優先）
- inline slots（C4/C5/C6）が満杯になったときの overflow を「1個ずつ」ではなく「まとめて」冷やす
- 変換点は 1 箇所（flush/drain）に固定
Central/Shared 側のバッチ push/pop（次点）
- shared/remote への統合をバッチ化して lock/atomic の回数を減らす
Memory return / footprint policy（運用軸）
- Balanced/Lean の勝ち筋（syscall/RSS drift/tail）をSSOT化しつつ、速度を落とさない範囲で攻める

重要: 現状は「設計の芯」を決める段階。実装は 計測で overflow の頻度が十分に高いことを確認してから。

2b) 次の作業（待機中）

ユーザーが別エージェント（Claude Code）に依頼した処理が完了するまで待機する。完了後に着手するチェック（最短で必要な2つ）:

inline slots overflow 率の計測（C4/C5/C6 の FULL/overflow 回数・割合）
overflow 先のコストの定量化（overflow 時に落ちる関数の perf stat / perf report）

これが揃ったら Phase 86（Overflow batch design）へ進む。

3) 運用ルール（Box Theory + layout tax 対策）

変更は必ず 箱 + 境界1箇所 + ENVで戻せる で積む（Fail-fast、最小可視化）。
A/B は 同一バイナリでENVトグルが原則（別バイナリ比較は layout が混ざる）。
SSOT運用（ころころ防止）: docs/analysis/PHASE75_6_SSOT_POLICY_FAST_PGO_VS_STANDARD.md
“削除して速い” は封印（link-out/大削除は layout tax で符号反転しやすい）→ compile-out を優先。
- 診断: scripts/box/layout_tax_forensics_box.sh / docs/analysis/PHASE67A_LAYOUT_TAX_FORENSICS_SSOT.md
研究箱の棚卸しSSOT: docs/analysis/RESEARCH_BOXES_SSOT.md
- ノブ一覧: scripts/list_hakmem_knobs.sh

5) 研究箱の扱い（freeze方針）

Phase 79-1（C2 local cache）: HAKMEM_TINY_C2_LOCAL_CACHE=0/1
- 結果: +0.57%（NO-GO, threshold +1.0% 未達）→ research box freeze
- SSOT/cleanenv では default OFF（scripts/run_mixed_10_cleanenv.sh が 0 を強制）
- 物理削除はしない（layout tax リスク回避）
- Phase 82（hardening）: hot path から C2 local cache を完全除外（環境変数を立てても alloc/free hot では踏まない）
  - 記録: docs/analysis/PHASE82_C2_LOCAL_CACHE_HOTPATH_EXCLUSION.md
Phase 85（Free path commit-once, LEGACY-only）: HAKMEM_FREE_PATH_COMMIT_ONCE=0/1
- 結果: NO-GO（-0.86%） → research box freeze（default OFF）
- 理由: Phase 10（MONO LEGACY DIRECT）と効果が被り、さらに間接呼び出し/配置の税が増えた
- 記録: docs/analysis/PHASE85_FREE_PATH_COMMIT_ONCE_RESULTS.md

4) 次の指示書（Active）

Phase 74（構造）: UnifiedCache hit-path を短くする ✅ P1 (LOCALIZE) 凍結

前提:

WS=400 SSOT では UnifiedCache miss が極小 → refill最適化は ROIゼロ。
WarmPool=16 の勝ちは instruction/branch 微減 → hit-path を短くするのが正攻法。

Phase 74-1: LOCALIZE (ENV-gated) ✅ 完了 (NEUTRAL +0.50%)

ENV: HAKMEM_TINY_UC_LOCALIZE=0/1
Runtime branch overhead で instructions/branches 増加 (+0.7%/+0.4%)
判定: NEUTRAL (+0.50%)

Phase 74-2: LOCALIZE (compile-time gate) ✅ 完了 (NEUTRAL -0.87%)

Build flag: HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1 (default 0)
Runtime branch 削除 → instructions/branches 改善 (-0.6%/-2.3%) ✓
しかし cache-misses +86% (register pressure / spill) → throughput -0.87%
切り分け成功: LOCALIZE本体は勝ち、cache-miss 増加で相殺
判定: NEUTRAL (-0.87%) → P1 (LOCALIZE) 凍結

結論:

P1 (LOCALIZE) は default OFF で凍結（dependency chain 削減の ROI 低い）
次: Phase 74-3 (P0: FASTAPI) へ進む

Phase 74-3: P0 (FASTAPI) ✅ 完了 (NEUTRAL +0.32%)

Goal: unified_cache_enabled() / lazy-init / stats 判定を hot loop の外へ追い出す

Approach:

unified_cache_push_fast() / unified_cache_pop_fast() API 追加
前提: "valid/enabled/no-stats" を caller 側で保証
Fail-fast: 想定外の状態なら slow path へ fallback（境界1箇所）
ENV gate: HAKMEM_TINY_UC_FASTAPI=0/1 (default 0, research box)

Results (10-run Mixed SSOT, WS=400):

Throughput: +0.32% (NEUTRAL, below +1.0% GO threshold)
cache-misses: -16.31% (positive signal, insufficient throughput gain)

判定: NEUTRAL (+0.32%) → P0 (FASTAPI) 凍結

参考:

設計: docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md
指示書: docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md
結果 (P1/P0): docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md

Phase 75（構造）: Hot-class Inline Slots (P2) ✅ 完了（Standard A/B）

Goal: C4-C7 の統計分析 → targeted optimization 戦略決定

前提 (Phase 74 learnings):

UnifiedCache hit-path optimization の ROI が低い ← register pressure / cache-miss effects
次の軸: per-class 特性を活用 → TLS-direct inline slots で branch elimination

Phase 75-0: Per-Class Analysis ✅ 完了

Per-class Unified-STATS (Mixed SSOT, WS=400, HAKMEM_MEASURE_UNIFIED_CACHE=1):

Class	Capacity	Occupied	Hits	Pushes	Total Ops	Hit %	% of C4-C7
C6	128	127	2,750,854	2,750,855	5,501,709	100%	57.2%
C5	128	127	1,373,604	1,373,605	2,747,209	100%	28.5%
C4	64	63	687,563	687,564	1,375,127	100%	14.3%
C7	?	?	?	?	?	?	?

Key findings:

C6 圧倒的支配: 57.2% の操作 (2.75M hits)
全クラス 100% hit rate (refill inactive in SSOT)
Cache occupancy near-capacity (98-99%)

Phase 75-1: C6-only Inline Slots ✅ 完了 (GO +2.87%)

Approach: Modular box theory design with single decision point at TLS init

Implementation (5 new boxes + test script):

ENV gate box: HAKMEM_TINY_C6_INLINE_SLOTS=0/1 (lazy-init, default OFF)
TLS extension: 128-slot ring buffer (1KB per thread, zero overhead when OFF)
Fast-path API: c6_inline_push() / c6_inline_pop() (always_inline, 1-2 cycles)
Integration: Minimal (2 boundary points: alloc/free for C6 class only)
Backward compatible: Legacy code intact, fail-fast to unified_cache

Results (10-run Mixed SSOT, WS=400):

Baseline (C6 inline OFF): 44.24 M ops/s
Treatment (C6 inline ON): 45.51 M ops/s
Delta: +1.27 M ops/s (+2.87%)

Decision: ✅ GO (exceeds +1.0% strict threshold)

Mechanism: Branch elimination on unified_cache for C6 (57.2% of C4-C7 ops)

参考:

Per-class分析: docs/analysis/PHASE75_PERCLASS_ANALYSIS_0_SSOT.md
結果: docs/analysis/PHASE75_C6_INLINE_SLOTS_1_RESULTS.md

Phase 75-2: C5 Inline Slots ✅ 完了 (GO +1.10%)

Goal: C5-only isolated measurement (28.5% of C4-C7) for individual contribution

Approach: Replicate C6 pattern with careful isolation

Add C5 ring buffer (128 slots, 1KB TLS)
ENV gate: HAKMEM_TINY_C5_INLINE_SLOTS=0/1 (default OFF)
Test strategy: C5-only (baseline C5=OFF+C6=ON, treatment C5=ON+C6=ON)
Integration: alloc/free boundary points (C5 FIRST, then C6, then unified_cache)

Results (10-run Mixed SSOT, WS=400):

Baseline (C5=OFF, C6=ON): 44.26 M ops/s (σ=0.37)
Treatment (C5=ON, C6=ON): 44.74 M ops/s (σ=0.54)
Delta: +0.49 M ops/s (+1.10%)

Decision: ✅ GO (C5 individual contribution validated)

Cumulative Performance:

Phase 75-1 (C6): +2.87%
Phase 75-2 (C5 isolated): +1.10%
Combined potential: ~+3.97% (if additive)

参考:

実装詳細: docs/analysis/PHASE75_2_C5_INLINE_SLOTS_IMPLEMENTATION.md

Phase 75-3: C5+C6 Interaction Test (4-Point Matrix A/B) ✅ 完了 (STRONG GO +5.41%)

Goal: Comprehensive interaction test + final promotion decision

Approach: 4-point matrix A/B test (single binary, ENV-only configuration)

Point A (C5=0, C6=0): Baseline
Point B (C5=1, C6=0): C5 solo
Point C (C5=0, C6=1): C6 solo
Point D (C5=1, C6=1): C5+C6 combined

Results (10-run per point, Mixed SSOT, WS=400):

Point A (baseline): 42.36 M ops/s
Point B (C5 solo): 43.54 M ops/s (+2.79% vs A)
Point C (C6 solo): 44.25 M ops/s (+4.46% vs A)
Point D (C5+C6): 44.65 M ops/s (+5.41% vs A) [MAIN TARGET]

Additivity Analysis:

Expected additive (B+C-A): 45.43 M ops/s
Actual (D): 44.65 M ops/s
Sub-additivity: 1.72% (near-perfect additivity, minimal negative interaction)

Perf Stat Validation (D vs A):

Instructions: -6.1% (function call elimination confirmed)
Branches: -6.1% (matches instruction reduction)
Cache-misses: -31.5% (improved locality, NOT +86% like Phase 74-2)
Throughput: +5.41% (net positive)

Decision: ✅ STRONG GO (+5.41%)

D vs A: +5.41% >> 3.0% threshold
Sub-additivity: 1.72% << 20% acceptable
Phase 73 thesis validated: instructions/branches DOWN, throughput UP

Promotion Completed:

core/bench_profile.h: Added C5+C6 defaults to bench_apply_mixed_tinyv3_c7_common()
scripts/run_mixed_10_cleanenv.sh: Added C5+C6 ENV defaults
C5+C6 inline slots now promoted to preset defaults for MIXED_TINYV3_C7_SAFE

Phase 75 Complete: C5+C6 inline slots (129-256B) deliver +5.41% proven gain on Standard binary（bench_random_mixed_hakmem）。

FAST PGO baseline（スコアカード）を更新する前に、BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo で 同条件の A/B（C5/C6 OFF/ON） を再計測すること。

Phase 75-4（FAST PGO rebase）✅ 完了

結果: +3.16% (GO)（4-point matrix、outlier 除外後）
詳細: docs/analysis/PHASE75_4_FAST_PGO_REBASE_RESULTS.md
重要: Phase 69 の FAST baseline (62.63M) と比較して 現行 FAST PGO baseline が大きく低い疑い（PGO profile staleness / training mismatch / build drift）

Phase 75-5（PGO 再生成）✅ 完了（NO-GO on hypothesis, code bloat root cause identified）

目的:

C5/C6 inline slots を含む現行コードに対して PGO training を再生成し、Phase 69 クラスの FAST baseline を取り戻す。

結果:

PGO profile regeneration の効果は 限定的 (+0.3% のみ)
Root cause は PGO profile mismatch ではなく code bloat (+13KB, +3.1%)
Code bloat が layout tax を引き起こし IPC collapse (-7.22%), branch-miss spike (+19.4%) → net -12% regression

Forensics findings (scripts/box/layout_tax_forensics_box.sh):

Text size: +13KB (+3.1%)
IPC: 1.80 → 1.67 (-7.22%)
Branch-misses: +19.4%
Cache-misses: +5.7%

Decision:

FAST PGO は code bloat に敏感 → Track A/B discipline 確立
Track A: Standard binary で implementation decisions (SSOT for GO/NO-GO)
Track B: FAST PGO で mimalloc ratio tracking (periodic rebase, not single-point decisions)

参考:

詳細結果: docs/analysis/PHASE75_5_PGO_REGENERATION_RESULTS.md
指示書: docs/analysis/PHASE75_5_PGO_REGENERATION_NEXT_INSTRUCTIONS.md

Phase 76（構造継続）: C4-C7 Remaining Classes ✅ Phase 76-1 完了 (GO +1.73%)

前提 (Phase 75 complete):

C5+C6 inline slots: +5.41% proven (Standard), +3.16% (FAST PGO)
Code bloat sensitivity identified → Track A/B discipline established
Remaining C4-C7 coverage: C4 (14.29%), C7 (0%)

Phase 76-0: C7 Statistics Analysis ✅ 完了 (NO-GO for C7 P2)

Approach: OBSERVE run to measure C7 allocation patterns in Mixed SSOT Results: C7 = 0% operations in Mixed SSOT workload Decision: NO-GO for C7 P2 optimization → proceed to C4

参考:

結果: docs/analysis/PHASE76_0_C7_STATISTICS_ANALYSIS.md

Phase 76-1: C4 Inline Slots ✅ 完了 (GO +1.73%)

Goal: Complete C4-C6 inline slots trilogy, targeting remaining 14.29% of C4-C7 operations

Implementation (modular box pattern):

ENV gate: HAKMEM_TINY_C4_INLINE_SLOTS=0/1 (default OFF → ON after promotion)
TLS ring: 64 slots, 512B per thread (lighter than C5/C6's 1KB)
Fast-path API: c4_inline_push() / c4_inline_pop() (always_inline)
Integration: C4 FIRST → C5 → C6 → unified_cache (alloc/free cascade)

Results (10-run Mixed SSOT, WS=400):

Baseline (C4=OFF, C5=ON, C6=ON): 52.42 M ops/s
Treatment (C4=ON, C5=ON, C6=ON): 53.33 M ops/s
Delta: +0.91 M ops/s (+1.73%)

Decision: ✅ GO (exceeds +1.0% threshold)

Promotion Completed:

core/bench_profile.h: Added C4 default to bench_apply_mixed_tinyv3_c7_common()
scripts/run_mixed_10_cleanenv.sh: Added HAKMEM_TINY_C4_INLINE_SLOTS=1 default
C4 inline slots now promoted to preset defaults alongside C5+C6

Coverage Summary (C4-C7 complete):

C6: 57.17% (Phase 75-1, +2.87%)
C5: 28.55% (Phase 75-2, +1.10%)
C4: 14.29% (Phase 76-1, +1.73%)
C7: 0.00% (Phase 76-0, NO-GO)
Combined C4-C6: 100% of C4-C7 operations

Estimated Cumulative Gain: +7-8% (C4+C5+C6 combined, assumes near-perfect additivity like Phase 75-3)

参考:

結果: docs/analysis/PHASE76_1_C4_INLINE_SLOTS_RESULTS.md
C4 box files: core/box/tiny_c4_inline_slots_*.h, core/front/tiny_c4_inline_slots.h, core/tiny_c4_inline_slots.c

Phase 76-2: C4+C5+C6 Comprehensive 4-Point Matrix ✅ 完了 (STRONG GO +7.05%, super-additive)

Goal: Validate cumulative C4+C5+C6 interaction and establish SSOT baseline for next optimization axis

Results (4-point matrix, 10-run each):

Point A (all OFF): 49.48 M ops/s (baseline)
Point B (C4 only): 49.44 M ops/s (-0.08%, context-dependent regression)
Point C (C5+C6 only): 52.27 M ops/s (+5.63% vs A)
Point D (all ON): 52.97 M ops/s (+7.05% vs A) ✅ STRONG GO

Critical Discovery:

C4 shows -0.08% regression in isolation (C5/C6 OFF)
C4 shows +1.27% gain in context (with C5+C6 ON)
Super-additivity: Actual D (+7.05%) exceeds expected additive (+5.56%)
Implication: Per-class optimizations are context-dependent, not independently additive

Sub-additivity Analysis:

Expected additive: 52.23 M ops/s (B + C - A)
Actual: 52.97 M ops/s
Gain: -1.42% (super-additive!) ✓

Decision: ✅ STRONG GO

D vs A: +7.05% >> +3.0% threshold
Super-additive behavior confirms synergistic gains
C4+C5+C6 locked to SSOT defaults

参考:

詳細結果: docs/analysis/PHASE76_2_C4C5C6_MATRIX_RESULTS.md

🟩 完了：C4-C7 Inline Slots Optimization Stack

Per-class Coverage Summary (Final):

C6 (57.17%): +2.87% (Phase 75-1)
C5 (28.55%): +1.10% (Phase 75-2)
C4 (14.29%): +1.27% in context (Phase 76-1/76-2)
C7 (0.00%): NO-GO (Phase 76-0)
Combined C4-C6: +7.05% (Phase 76-2 super-additive)

Status: ✅ C4-C7 Optimization Complete (100% coverage, SSOT locked)

🟥 次のActive（Phase 77+）

オプション:

Option A: FAST PGO Periodic Tracking (Track B discipline)

Regenerate PGO profile with C4+C5+C6=ON if code bloat accumulates
Monitor mimalloc ratio progress (secondary metric)
Not a decision point per se, but periodic maintenance

Option B: Phase 77 (Alternative Optimization Axis)

Explore beyond per-class inline slots
Candidates:
- Allocation fast-path optimization (call elimination)
- Metadata/page lookup (table optimization)
- C3/C2 class strategies
- Warm pool tuning (beyond Phase 69's WarmPool=16)

推奨: Option B へ進む（Phase 77+）

C4-C7 optimizations are exhausted and locked
Ready to explore new optimization axes
Baseline is now +7.05% stronger than Phase 75-3

参考:

C4-C7 完全分析: docs/analysis/PHASE76_2_C4C5C6_MATRIX_RESULTS.md
Phase 75-3 参考 (C5+C6): docs/analysis/PHASE75_3_C5_C6_INTERACTION_RESULTS.md

5) アーカイブ

詳細ログ: CURRENT_TASK_ARCHIVE_20251210.md
整理前スナップショット: docs/analysis/CURRENT_TASK_ARCHIVE.md

30 KiB Raw Blame History Unescape Escape

CURRENT_TASK（Rolling, SSOT）

SSOT（今の正）

Phase 87-88（終了: NO-GO）

Phase 87: Inline Slots Verification

Phase 88: Batch Drain Optimization

Phase 89（完了: Bottleneck Analysis & Optimization Roadmap）

4-Step SSOT Procedure Completion

Perf Analysis & Top Bottleneck Identification

Top 3 Optimization Candidates (Ranked by Priority)

Phase 91（終了: NEUTRAL / 凍結）

Phase 92（開始予定）

Key Performance Insights

References & Artifacts

Phase 86（終了: NO-GO）

0) 今の「正」（SSOT）

0a) ころころ防止（最低限の SSOT ルール）

0b) Allocator比較（reference）

1) 迷子防止（経路/観測）

2) 直近の結論（要点だけ）

2a) 次の大方針（設計の順番、SSOT）

2b) 次の作業（待機中）

3) 運用ルール（Box Theory + layout tax 対策）

5) 研究箱の扱い（freeze方針）

4) 次の指示書（Active）

Phase 74（構造）: UnifiedCache hit-path を短くする ✅ P1 (LOCALIZE) 凍結

Phase 75（構造）: Hot-class Inline Slots (P2) ✅ 完了（Standard A/B）

Phase 75-4（FAST PGO rebase）✅ 完了

Phase 75-5（PGO 再生成）✅ 完了（NO-GO on hypothesis, code bloat root cause identified）

Phase 76（構造継続）: C4-C7 Remaining Classes ✅ Phase 76-1 完了 (GO +1.73%)

🟩 完了：C4-C7 Inline Slots Optimization Stack

🟥 次のActive（Phase 77+）

5) アーカイブ

30 KiB

Raw Blame History