Files
hakmem/docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md

153 lines
5.1 KiB
Markdown
Raw Normal View History

Phase 5 E4 Combined: E4-1 + E4-2 (+6.43% GO, baseline consolidated) Combined A/B Test Results (10-run Mixed): - Baseline (both OFF): 44.48M ops/s (mean), 44.39M ops/s (median) - Optimized (both ON): 47.34M ops/s (mean), 47.38M ops/s (median) - Improvement: +6.43% mean, +6.74% median Interaction Analysis: - E4-1 alone: +3.51% (measured in separate session) - E4-2 alone: +21.83% (measured in separate session) - Combined: +6.43% (measured in same binary) - Pattern: SUBADDITIVE (overlapping bottlenecks) Key Finding: Single-binary incremental gain is the accurate metric - E4-1 and E4-2 target overlapping TLS/branch resources - Individual measurements were from different baselines/sessions - Combined measurement (same binary, both flags) shows true progress Phase 5 Total Progress: - Original baseline (session start): 35.74M ops/s - Combined optimized: 47.34M ops/s - Total gain: +32.4% (cross-session, reference only) - Same-binary gain: +6.43% (E4-1+E4-2 both ON vs both OFF) New Baseline Perf Profile (47.0M ops/s): - free: 37.56% self% (still top hotspot) - tiny_alloc_gate_fast: 13.73% (reduced from 19.50%) - malloc: 12.95% (reduced from 16.13%) - tiny_region_id_write_header: 6.97% (header write tax) - hakmem_env_snapshot_enabled: 4.29% (ENV overhead visible) Health Check: PASS - MIXED_TINYV3_C7_SAFE: 42.3M ops/s - C6_HEAVY_LEGACY_POOLV1: 20.9M ops/s Phase 5 E5 Candidates (from perf profile): - E5-1: free() path internals (37.56% self%) - E5-2: Header write reduction (6.97% self%) - E5-3: ENV snapshot overhead (4.29% self%) Deliverables: - docs/analysis/PHASE5_E4_COMBINED_AB_TEST_RESULTS.md - docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md - CURRENT_TASK.md (E4 combined complete, E5 candidates) - docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md (E5 pointer) - perf.data.e4combined (perf profile data) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 05:36:57 +09:00
# Phase 5 E5: Post E4-Combined Next Instructions次の指示書
## Status2025-12-14 / E5-2 FREEZE 反映)
Phase 5 E4 Combined: E4-1 + E4-2 (+6.43% GO, baseline consolidated) Combined A/B Test Results (10-run Mixed): - Baseline (both OFF): 44.48M ops/s (mean), 44.39M ops/s (median) - Optimized (both ON): 47.34M ops/s (mean), 47.38M ops/s (median) - Improvement: +6.43% mean, +6.74% median Interaction Analysis: - E4-1 alone: +3.51% (measured in separate session) - E4-2 alone: +21.83% (measured in separate session) - Combined: +6.43% (measured in same binary) - Pattern: SUBADDITIVE (overlapping bottlenecks) Key Finding: Single-binary incremental gain is the accurate metric - E4-1 and E4-2 target overlapping TLS/branch resources - Individual measurements were from different baselines/sessions - Combined measurement (same binary, both flags) shows true progress Phase 5 Total Progress: - Original baseline (session start): 35.74M ops/s - Combined optimized: 47.34M ops/s - Total gain: +32.4% (cross-session, reference only) - Same-binary gain: +6.43% (E4-1+E4-2 both ON vs both OFF) New Baseline Perf Profile (47.0M ops/s): - free: 37.56% self% (still top hotspot) - tiny_alloc_gate_fast: 13.73% (reduced from 19.50%) - malloc: 12.95% (reduced from 16.13%) - tiny_region_id_write_header: 6.97% (header write tax) - hakmem_env_snapshot_enabled: 4.29% (ENV overhead visible) Health Check: PASS - MIXED_TINYV3_C7_SAFE: 42.3M ops/s - C6_HEAVY_LEGACY_POOLV1: 20.9M ops/s Phase 5 E5 Candidates (from perf profile): - E5-1: free() path internals (37.56% self%) - E5-2: Header write reduction (6.97% self%) - E5-3: ENV snapshot overhead (4.29% self%) Deliverables: - docs/analysis/PHASE5_E4_COMBINED_AB_TEST_RESULTS.md - docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md - CURRENT_TASK.md (E4 combined complete, E5 candidates) - docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md (E5 pointer) - perf.data.e4combined (perf profile data) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 05:36:57 +09:00
- BaselineMixed, 20M iters, ws=400: **47.34M ops/s**E4-1+E4-2 ON
- Hot spotsself%:
- `free`: **37.56%**
- `tiny_alloc_gate_fast`: **13.73%**
- `malloc`: **12.95%**
- `tiny_region_id_write_header`: **6.97%**
- `hakmem_env_snapshot_enabled`: **4.29%**
- `tiny_get_max_size`: **4.24%**
狙い: “形” 最適化は一段落。次は **free 内部****ヘッダ書き込み**、そして **ENV snapshot gate の常時コスト**を削る。
Update:
- E5-1Free Tiny Direct Path✅ GO+3.35% mean / +3.36% median
- E5-2Header write-once⚪ NEUTRAL → FREEZE
- E5-4Malloc Tiny Direct⚪ NEUTRAL → FREEZE
- E6ENV snapshot branch-shape fix❌ NO-GO → FREEZE-1.71%
- E7Frozen box prune / baseline diet❌ NO-GO-3%台)→ 差し戻し
- 次の芯: “削る/分岐形” ではなく、再び **重複排除(境界の一本化)****大きい構造変更** を探す
Phase 5 E4 Combined: E4-1 + E4-2 (+6.43% GO, baseline consolidated) Combined A/B Test Results (10-run Mixed): - Baseline (both OFF): 44.48M ops/s (mean), 44.39M ops/s (median) - Optimized (both ON): 47.34M ops/s (mean), 47.38M ops/s (median) - Improvement: +6.43% mean, +6.74% median Interaction Analysis: - E4-1 alone: +3.51% (measured in separate session) - E4-2 alone: +21.83% (measured in separate session) - Combined: +6.43% (measured in same binary) - Pattern: SUBADDITIVE (overlapping bottlenecks) Key Finding: Single-binary incremental gain is the accurate metric - E4-1 and E4-2 target overlapping TLS/branch resources - Individual measurements were from different baselines/sessions - Combined measurement (same binary, both flags) shows true progress Phase 5 Total Progress: - Original baseline (session start): 35.74M ops/s - Combined optimized: 47.34M ops/s - Total gain: +32.4% (cross-session, reference only) - Same-binary gain: +6.43% (E4-1+E4-2 both ON vs both OFF) New Baseline Perf Profile (47.0M ops/s): - free: 37.56% self% (still top hotspot) - tiny_alloc_gate_fast: 13.73% (reduced from 19.50%) - malloc: 12.95% (reduced from 16.13%) - tiny_region_id_write_header: 6.97% (header write tax) - hakmem_env_snapshot_enabled: 4.29% (ENV overhead visible) Health Check: PASS - MIXED_TINYV3_C7_SAFE: 42.3M ops/s - C6_HEAVY_LEGACY_POOLV1: 20.9M ops/s Phase 5 E5 Candidates (from perf profile): - E5-1: free() path internals (37.56% self%) - E5-2: Header write reduction (6.97% self%) - E5-3: ENV snapshot overhead (4.29% self%) Deliverables: - docs/analysis/PHASE5_E4_COMBINED_AB_TEST_RESULTS.md - docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md - CURRENT_TASK.md (E4 combined complete, E5 candidates) - docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md (E5 pointer) - perf.data.e4combined (perf profile data) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 05:36:57 +09:00
---
## Step 0: Baseline 固定Mixed
```sh
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=1 \
HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=1 \
./bench_random_mixed_hakmem 20000000 400 1
```
以後の A/B は必ず同一バイナリで:
- A: `E5_* = 0`
- B: `E5_* = 1`
---
## Step 1: perf で “free の中身” を割る(必須)
```sh
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE perf record -F 99 -- \
./bench_random_mixed_hakmem 20000000 400 1
perf report --stdio --no-children
```
次に `free` だけを掘る:
```sh
perf report --stdio --no-children --symbol free
```
目的:
- `free` の中で **真に重い行/分岐**を特定し、E5-1 の境界(箱の切り方)を決める。
---
## E5-1優先A: free() 内部の “Tiny 直通” を一本化
### 仮説
`free` は依然トップだが、wrapper での “tiny 判定→tiny free” がまだ重い(チェック/分岐/再判定が残っている)。
### 方針(箱理論)
- **L0 SplitBox**: `header_magic` / `class_idx` が valid なときだけ Tiny 直通fail-fast
- **L1 HotBox**: Tiny の same-thread TLS push だけ(副作用ゼロ)
- **L1 ColdBox**: 既存の fallbackpool/mid/large/invalid header
### 実装ルール
- 境界は 1 箇所(`free()` wrapper の先頭分岐で確定)
- `ENV gate`: `HAKMEM_FREE_TINY_DIRECT=0/1`default 0 / preset(MIXED)=1
Phase 5 E4 Combined: E4-1 + E4-2 (+6.43% GO, baseline consolidated) Combined A/B Test Results (10-run Mixed): - Baseline (both OFF): 44.48M ops/s (mean), 44.39M ops/s (median) - Optimized (both ON): 47.34M ops/s (mean), 47.38M ops/s (median) - Improvement: +6.43% mean, +6.74% median Interaction Analysis: - E4-1 alone: +3.51% (measured in separate session) - E4-2 alone: +21.83% (measured in separate session) - Combined: +6.43% (measured in same binary) - Pattern: SUBADDITIVE (overlapping bottlenecks) Key Finding: Single-binary incremental gain is the accurate metric - E4-1 and E4-2 target overlapping TLS/branch resources - Individual measurements were from different baselines/sessions - Combined measurement (same binary, both flags) shows true progress Phase 5 Total Progress: - Original baseline (session start): 35.74M ops/s - Combined optimized: 47.34M ops/s - Total gain: +32.4% (cross-session, reference only) - Same-binary gain: +6.43% (E4-1+E4-2 both ON vs both OFF) New Baseline Perf Profile (47.0M ops/s): - free: 37.56% self% (still top hotspot) - tiny_alloc_gate_fast: 13.73% (reduced from 19.50%) - malloc: 12.95% (reduced from 16.13%) - tiny_region_id_write_header: 6.97% (header write tax) - hakmem_env_snapshot_enabled: 4.29% (ENV overhead visible) Health Check: PASS - MIXED_TINYV3_C7_SAFE: 42.3M ops/s - C6_HEAVY_LEGACY_POOLV1: 20.9M ops/s Phase 5 E5 Candidates (from perf profile): - E5-1: free() path internals (37.56% self%) - E5-2: Header write reduction (6.97% self%) - E5-3: ENV snapshot overhead (4.29% self%) Deliverables: - docs/analysis/PHASE5_E4_COMBINED_AB_TEST_RESULTS.md - docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md - CURRENT_TASK.md (E4 combined complete, E5 candidates) - docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md (E5 pointer) - perf.data.e4combined (perf profile data) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 05:36:57 +09:00
- 可視化はカウンタのみ(`direct_hit`, `direct_miss`, `invalid_header`
### GO/NO-GO
- Mixed 10-run mean:
- GO: **+1.0% 以上**
- ±1.0%: NEUTRALfreeze
- -1.0% 以下: NO-GOfreeze
---
## E5-2: Header write-once⚪ NEUTRAL → FROZEN
結論:
- E5-2 は **NEUTRAL**branch overhead ≈ savingsなので **freeze**
- 以後は追わず、次は E5-4 を優先する。
参照:
- Design: `docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md`
- Results: `docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md`
Phase 5 E4 Combined: E4-1 + E4-2 (+6.43% GO, baseline consolidated) Combined A/B Test Results (10-run Mixed): - Baseline (both OFF): 44.48M ops/s (mean), 44.39M ops/s (median) - Optimized (both ON): 47.34M ops/s (mean), 47.38M ops/s (median) - Improvement: +6.43% mean, +6.74% median Interaction Analysis: - E4-1 alone: +3.51% (measured in separate session) - E4-2 alone: +21.83% (measured in separate session) - Combined: +6.43% (measured in same binary) - Pattern: SUBADDITIVE (overlapping bottlenecks) Key Finding: Single-binary incremental gain is the accurate metric - E4-1 and E4-2 target overlapping TLS/branch resources - Individual measurements were from different baselines/sessions - Combined measurement (same binary, both flags) shows true progress Phase 5 Total Progress: - Original baseline (session start): 35.74M ops/s - Combined optimized: 47.34M ops/s - Total gain: +32.4% (cross-session, reference only) - Same-binary gain: +6.43% (E4-1+E4-2 both ON vs both OFF) New Baseline Perf Profile (47.0M ops/s): - free: 37.56% self% (still top hotspot) - tiny_alloc_gate_fast: 13.73% (reduced from 19.50%) - malloc: 12.95% (reduced from 16.13%) - tiny_region_id_write_header: 6.97% (header write tax) - hakmem_env_snapshot_enabled: 4.29% (ENV overhead visible) Health Check: PASS - MIXED_TINYV3_C7_SAFE: 42.3M ops/s - C6_HEAVY_LEGACY_POOLV1: 20.9M ops/s Phase 5 E5 Candidates (from perf profile): - E5-1: free() path internals (37.56% self%) - E5-2: Header write reduction (6.97% self%) - E5-3: ENV snapshot overhead (4.29% self%) Deliverables: - docs/analysis/PHASE5_E4_COMBINED_AB_TEST_RESULTS.md - docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md - CURRENT_TASK.md (E4 combined complete, E5 candidates) - docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md (E5 pointer) - perf.data.e4combined (perf profile data) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 05:36:57 +09:00
### 仮説
`tiny_region_id_write_header` は “正しいが高頻度”。
ブロックは同一クラス内で再利用されるので、ヘッダは **初回だけ**書けば足りる。
### 方針(箱理論)
- **HeaderPrefillBox**cold/refill 境界)で “ブロック生成時” に header をセット
- alloc hot path は `base+1` 返却のみheader write をしない)
### 安全ゲート
- `ENV gate`: `HAKMEM_TINY_HEADER_PREFILL=0/1`default 0
- Fail-fast:
- “prefill された slab” だけ skip を許可
- prefill 未完のブロックは従来 `tiny_region_id_write_header()` にフォールバック
### A/B
- Mixed 10-run + health profiles
- 期待: +1〜3%(ヘッダ書き込み + 関連分岐の削減)
---
## E5-4次の芯: Malloc Tiny DirectE5-1 の alloc 側複製)
指示書:
- `docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_NEXT_INSTRUCTIONS.md`
---
## E5-3DEFER: `hakmem_env_snapshot_enabled()` の分岐形を “enabled 前提” に寄せる
Phase 5 E4 Combined: E4-1 + E4-2 (+6.43% GO, baseline consolidated) Combined A/B Test Results (10-run Mixed): - Baseline (both OFF): 44.48M ops/s (mean), 44.39M ops/s (median) - Optimized (both ON): 47.34M ops/s (mean), 47.38M ops/s (median) - Improvement: +6.43% mean, +6.74% median Interaction Analysis: - E4-1 alone: +3.51% (measured in separate session) - E4-2 alone: +21.83% (measured in separate session) - Combined: +6.43% (measured in same binary) - Pattern: SUBADDITIVE (overlapping bottlenecks) Key Finding: Single-binary incremental gain is the accurate metric - E4-1 and E4-2 target overlapping TLS/branch resources - Individual measurements were from different baselines/sessions - Combined measurement (same binary, both flags) shows true progress Phase 5 Total Progress: - Original baseline (session start): 35.74M ops/s - Combined optimized: 47.34M ops/s - Total gain: +32.4% (cross-session, reference only) - Same-binary gain: +6.43% (E4-1+E4-2 both ON vs both OFF) New Baseline Perf Profile (47.0M ops/s): - free: 37.56% self% (still top hotspot) - tiny_alloc_gate_fast: 13.73% (reduced from 19.50%) - malloc: 12.95% (reduced from 16.13%) - tiny_region_id_write_header: 6.97% (header write tax) - hakmem_env_snapshot_enabled: 4.29% (ENV overhead visible) Health Check: PASS - MIXED_TINYV3_C7_SAFE: 42.3M ops/s - C6_HEAVY_LEGACY_POOLV1: 20.9M ops/s Phase 5 E5 Candidates (from perf profile): - E5-1: free() path internals (37.56% self%) - E5-2: Header write reduction (6.97% self%) - E5-3: ENV snapshot overhead (4.29% self%) Deliverables: - docs/analysis/PHASE5_E4_COMBINED_AB_TEST_RESULTS.md - docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md - CURRENT_TASK.md (E4 combined complete, E5 candidates) - docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md (E5 pointer) - perf.data.e4combined (perf profile data) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 05:36:57 +09:00
### 背景
`MIXED_TINYV3_C7_SAFE` では `HAKMEM_ENV_SNAPSHOT=1` が常用になったため、
現状の `if (__builtin_expect(hakmem_env_snapshot_enabled(), 0))` は **hint が逆**になり得る。
### 方針
同じ意味で分岐形だけ変える(箱の外形最適化):
- `if (__builtin_expect(!hakmem_env_snapshot_enabled(), 0)) { legacy; } else { snapshot; }`
- もしくは `*_cold()` に legacy を追い出すnoinline,cold
### ENV / 戻せる
- `ENV gate`: `HAKMEM_ENV_SNAPSHOT_SHAPE=0/1`default 0
- まず `malloc_tiny_fast.h` の 5 箇所と、`tiny_legacy_fallback_box.h` / `tiny_metadata_cache_hot_box.h` を対象にする
### GO/NO-GO
- Mixed 10-run mean で **+1.0% 以上**なら採用候補
- 期待: +0.5〜2.0%mispredict 回避)
---
## Step 2: 健康診断(必須)
```sh
scripts/verify_health_profiles.sh
```
---
## Step 3: 昇格(勝ち箱のみ)
- `core/bench_profile.h``MIXED_TINYV3_C7_SAFE` に default 化opt-out 可能)
- `docs/analysis/ENV_PROFILE_PRESETS.md` に A/B と rollback を追記
- `CURRENT_TASK.md` を更新(結果と “次の芯” を 1 行で)