hakmem/docs/analysis/BENCH_REPRODUCIBILITY_SSOT.md

# Bench Reproducibility SSOT（ころころ防止の最低限）

目的: 「数%を詰める開発」で一番きつい **ベンチが再現しない問題**を潰す。

補助: buildの使い分けは `docs/analysis/SSOT_BUILD_MODES.md` を正とする。

## 1) まず結論（よくある原因）

同じマシンでも、以下が変わると 5–15% は普通に動く。

- **CPU power/thermal**（governor / EPP / turbo）
- **HAKMEM_PROFILE 未指定**（route が変わる）
- **ベンチのサイズレンジ漏れ**（`HAKMEM_BENCH_MIN_SIZE/MAX_SIZE` で class 分布が変わる）
- **export 漏れ**（過去の ENV が残る）
- **別バイナリ比較**（layout tax: text 配置が変わる）

## 2) SSOT（最適化判断の正）

- Runner: `scripts/run_mixed_10_cleanenv.sh`
- 必須:
  - `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE` を明示
  - `RUNS=10`（ノイズを平均化）
  - `WS=400`（SSOT）
  - サイズレンジは SSOT 側で固定（runner が強制）:
    - `HAKMEM_BENCH_MIN_SIZE=16`
    - `HAKMEM_BENCH_MAX_SIZE=1040`
- 任意（切り分け用）:
  - `HAKMEM_BENCH_ENV_LOG=1`（CPU governor/EPP/freq をログ）

## 3) reference（allocator間比較の正）

allocator比較は layout tax が混ざるため **reference**。
ただし “公平さ” を上げるなら同一バイナリで測る:

- Same-binary runner: `scripts/run_allocator_preload_matrix.sh`
  - `bench_random_mixed_system` を固定して `LD_PRELOAD` を差し替える

## 4) “ころころ”を止める運用（最低限の儀式）

1. SSOT実行は必ず cleanenv:
   - `scripts/run_mixed_10_cleanenv.sh`
   - `SSOT_MIN_SIZE/SSOT_MAX_SIZE` でレンジを明示的に上書きできる（export 漏れの影響を受けない）
2. 毎回、環境ログを残す:
   - `HAKMEM_BENCH_ENV_LOG=1`
3. 結果をファイル化（後から追える形）:
   - `scripts/bench_ssot_capture.sh` を使う（git sha / env / bench出力をまとめて保存）

## 5) 重要メモ（AMD pstate epp）

`amd-pstate-epp` 環境で
- governor=`powersave`
- energy_perf_preference=`power`
のままだと、ベンチが“遅い側”に寄ることがある。

まずは `HAKMEM_BENCH_ENV_LOG=1` の出力が **同じ**条件同士で比較すること。

## 6) 外部レビュー（貼り付けパケット）

「コードを圧縮して貼る」用途は、毎回の手作業を減らすためにパケット生成を使う:

- 生成スクリプト: `scripts/make_chatgpt_pro_packet_free_path.sh`
- 生成物（スナップショット）: `docs/analysis/FREE_PATH_REVIEW_PACKET_CHATGPT.md`
-												Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update

Key changes:
- Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible)
  Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns

- Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M):
  tcmalloc: 115.26M (92.33% of mimalloc)
  jemalloc: 97.39M (77.96% of mimalloc)
  system: 85.20M (68.24% of mimalloc)
  mimalloc: 124.82M (baseline)

- hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh
  PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements
  Result: baseline stabilized to 55.53M (44.46% of mimalloc)
  Previous unstable measurement (35.57M) was due to profile leak

- Documentation:
  * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status
  * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO)
  * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure
  * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology

- M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed

🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 18:50:00 +09:00
+								# Bench Reproducibility SSOT（ころころ防止の最低限）
 								目的: 「数%を詰める開発」で一番きつい **ベンチが再現しない問題**を潰す。
-												Working state before pushing to cyu remote

											
										
										
											2025-12-19 03:45:01 +09:00
+								補助: buildの使い分けは `docs/analysis/SSOT_BUILD_MODES.md` を正とする。
-												Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update

Key changes:
- Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible)
  Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns

- Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M):
  tcmalloc: 115.26M (92.33% of mimalloc)
  jemalloc: 97.39M (77.96% of mimalloc)
  system: 85.20M (68.24% of mimalloc)
  mimalloc: 124.82M (baseline)

- hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh
  PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements
  Result: baseline stabilized to 55.53M (44.46% of mimalloc)
  Previous unstable measurement (35.57M) was due to profile leak

- Documentation:
  * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status
  * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO)
  * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure
  * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology

- M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed

🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 18:50:00 +09:00
+								## 1) まず結論（よくある原因）
 								同じマシンでも、以下が変わると 5–15% は普通に動く。
 								- **CPU power/thermal**（governor / EPP / turbo）
 								- **HAKMEM_PROFILE 未指定**（route が変わる）
-												Working state before pushing to cyu remote

											
										
										
											2025-12-19 03:45:01 +09:00
+								- **ベンチのサイズレンジ漏れ**（`HAKMEM_BENCH_MIN_SIZE/MAX_SIZE` で class 分布が変わる）
-												Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update

Key changes:
- Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible)
  Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns

- Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M):
  tcmalloc: 115.26M (92.33% of mimalloc)
  jemalloc: 97.39M (77.96% of mimalloc)
  system: 85.20M (68.24% of mimalloc)
  mimalloc: 124.82M (baseline)

- hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh
  PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements
  Result: baseline stabilized to 55.53M (44.46% of mimalloc)
  Previous unstable measurement (35.57M) was due to profile leak

- Documentation:
  * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status
  * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO)
  * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure
  * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology

- M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed

🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 18:50:00 +09:00
+								- **export 漏れ**（過去の ENV が残る）
 								- **別バイナリ比較**（layout tax: text 配置が変わる）
 								## 2) SSOT（最適化判断の正）
 								- Runner: `scripts/run_mixed_10_cleanenv.sh`
 								- 必須:
 								  - `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE` を明示
 								  - `RUNS=10`（ノイズを平均化）
 								  - `WS=400`（SSOT）
-												Working state before pushing to cyu remote

											
										
										
											2025-12-19 03:45:01 +09:00
+								  - サイズレンジは SSOT 側で固定（runner が強制）:
 								    - `HAKMEM_BENCH_MIN_SIZE=16`
 								    - `HAKMEM_BENCH_MAX_SIZE=1040`
-												Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update

Key changes:
- Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible)
  Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns

- Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M):
  tcmalloc: 115.26M (92.33% of mimalloc)
  jemalloc: 97.39M (77.96% of mimalloc)
  system: 85.20M (68.24% of mimalloc)
  mimalloc: 124.82M (baseline)

- hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh
  PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements
  Result: baseline stabilized to 55.53M (44.46% of mimalloc)
  Previous unstable measurement (35.57M) was due to profile leak

- Documentation:
  * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status
  * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO)
  * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure
  * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology

- M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed

🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 18:50:00 +09:00
+								- 任意（切り分け用）:
 								  - `HAKMEM_BENCH_ENV_LOG=1`（CPU governor/EPP/freq をログ）
 								## 3) reference（allocator間比較の正）
 								allocator比較は layout tax が混ざるため **reference**。
 								ただし “公平さ” を上げるなら同一バイナリで測る:
 								- Same-binary runner: `scripts/run_allocator_preload_matrix.sh`
 								  - `bench_random_mixed_system` を固定して `LD_PRELOAD` を差し替える
 								## 4) “ころころ”を止める運用（最低限の儀式）
 . SSOT実行は必ず cleanenv:
 								   - `scripts/run_mixed_10_cleanenv.sh`
-												Working state before pushing to cyu remote

											
										
										
											2025-12-19 03:45:01 +09:00
+								   - `SSOT_MIN_SIZE/SSOT_MAX_SIZE` でレンジを明示的に上書きできる（export 漏れの影響を受けない）
-												Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update

Key changes:
- Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible)
  Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns

- Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M):
  tcmalloc: 115.26M (92.33% of mimalloc)
  jemalloc: 97.39M (77.96% of mimalloc)
  system: 85.20M (68.24% of mimalloc)
  mimalloc: 124.82M (baseline)

- hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh
  PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements
  Result: baseline stabilized to 55.53M (44.46% of mimalloc)
  Previous unstable measurement (35.57M) was due to profile leak

- Documentation:
  * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status
  * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO)
  * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure
  * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology

- M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed

🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 18:50:00 +09:00
+. 毎回、環境ログを残す:
 								   - `HAKMEM_BENCH_ENV_LOG=1`
 . 結果をファイル化（後から追える形）:
 								   - `scripts/bench_ssot_capture.sh` を使う（git sha / env / bench出力をまとめて保存）
 								## 5) 重要メモ（AMD pstate epp）
 								`amd-pstate-epp` 環境で
 								- governor=`powersave`
 								- energy_perf_preference=`power`
 								のままだと、ベンチが“遅い側”に寄ることがある。
 								まずは `HAKMEM_BENCH_ENV_LOG=1` の出力が **同じ**条件同士で比較すること。
-												Phase 86: Free Path Legacy Mask (NO-GO, +0.25%)

## Summary

Implemented Phase 86 "mask-only commit" optimization for free path:
- Bitset mask (0x7f for C0-C6) to identify LEGACY classes
- Direct call to tiny_legacy_fallback_free_base_with_env()
- No indirect function pointers (avoids Phase 85's -0.86% regression)
- Fail-fast on LARSON_FIX=1 (cross-thread validation incompatibility)

## Results (10-run SSOT)

**NO-GO**: +0.25% improvement (threshold: +1.0%)
- Control:    51,750,467 ops/s (CV: 2.26%)
- Treatment:  51,881,055 ops/s (CV: 2.32%)
- Delta:      +0.25% (mean), -0.15% (median)

## Root Cause

Competing optimizations plateau:
1. Phase 9/10 MONO LEGACY (+1.89%) already capture most free path benefit
2. Remaining margin insufficient to overcome:
   - Two branch checks (mask_enabled + has_class)
   - I-cache layout tax in hot path
   - Direct function call overhead

## Phase 85 vs Phase 86

| Metric | Phase 85 | Phase 86 |
|--------|----------|----------|
| Approach | Indirect calls + table | Bitset mask + direct call |
| Result | -0.86% | +0.25% |
| Verdict | NO-GO (regression) | NO-GO (insufficient) |

Phase 86 correctly avoided indirect call penalties but revealed architectural
limit: can't escape Phase 9/10 overlay without restructuring.

## Recommendation

Free path optimization layer has reached practical ceiling:
- Phase 9/10 +1.89% + Phase 6/19/FASTLANE +16-27% ≈ 18-29% total
- Further attempts on ceremony elimination face same constraints
- Recommend focus on different optimization layers (malloc, etc.)

## Files Changed

### New
- core/box/free_path_legacy_mask_box.h (API + globals)
- core/box/free_path_legacy_mask_box.c (refresh logic)

### Modified
- core/bench_profile.h (added refresh call)
- core/front/malloc_tiny_fast.h (added Phase 86 fast path check)
- Makefile (added object files)
- CURRENT_TASK.md (documented result)

All changes conditional on HAKMEM_FREE_PATH_LEGACY_MASK=1 (default OFF).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 22:05:34 +09:00
 								## 6) 外部レビュー（貼り付けパケット）
 								「コードを圧縮して貼る」用途は、毎回の手作業を減らすためにパケット生成を使う:
 								- 生成スクリプト: `scripts/make_chatgpt_pro_packet_free_path.sh`
 								- 生成物（スナップショット）: `docs/analysis/FREE_PATH_REVIEW_PACKET_CHATGPT.md`