hakmem/docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md

# Phase 5 E5: Post E4-Combined Next Instructions（次の指示書）

## Status（2025-12-14 / E5-2 FREEZE 反映）

- Baseline（Mixed, 20M iters, ws=400）: **47.34M ops/s**（E4-1+E4-2 ON）
- Hot spots（self%）:
  - `free`: **37.56%**
  - `tiny_alloc_gate_fast`: **13.73%**
  - `malloc`: **12.95%**
  - `tiny_region_id_write_header`: **6.97%**
  - `hakmem_env_snapshot_enabled`: **4.29%**
  - `tiny_get_max_size`: **4.24%**

狙い: “形” 最適化は一段落。次は **free 内部** と **ヘッダ書き込み**、そして **ENV snapshot gate の常時コスト**を削る。

Update:
- E5-1（Free Tiny Direct Path）✅ GO（+3.35% mean / +3.36% median）
- E5-2（Header write-once）⚪ NEUTRAL → FREEZE
- E5-4（Malloc Tiny Direct）⚪ NEUTRAL → FREEZE
- E6（ENV snapshot branch-shape fix）❌ NO-GO → FREEZE（-1.71%）
- E7（Frozen box prune / baseline diet）❌ NO-GO（-3%台）→ 差し戻し
- 次の芯: “削る/分岐形” ではなく、再び **重複排除（境界の一本化）** か **大きい構造変更** を探す

---

## Step 0: Baseline 固定（Mixed）

```sh
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
  HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=1 \
  HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=1 \
  ./bench_random_mixed_hakmem 20000000 400 1
```

以後の A/B は必ず同一バイナリで:
- A: `E5_* = 0`
- B: `E5_* = 1`

---

## Step 1: perf で “free の中身” を割る（必須）

```sh
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE perf record -F 99 -- \
  ./bench_random_mixed_hakmem 20000000 400 1
perf report --stdio --no-children
```

次に `free` だけを掘る:
```sh
perf report --stdio --no-children --symbol free
```

目的:
- `free` の中で **真に重い行/分岐**を特定し、E5-1 の境界（箱の切り方）を決める。

---

## E5-1（優先A）: free() 内部の “Tiny 直通” を一本化

### 仮説
`free` は依然トップだが、wrapper での “tiny 判定→tiny free” がまだ重い（チェック/分岐/再判定が残っている）。

### 方針（箱理論）
- **L0 SplitBox**: `header_magic` / `class_idx` が valid なときだけ Tiny 直通（fail-fast）
- **L1 HotBox**: Tiny の same-thread TLS push だけ（副作用ゼロ）
- **L1 ColdBox**: 既存の fallback（pool/mid/large/invalid header）

### 実装ルール
- 境界は 1 箇所（`free()` wrapper の先頭分岐で確定）
- `ENV gate`: `HAKMEM_FREE_TINY_DIRECT=0/1`（default 0 / preset(MIXED)=1）
- 可視化はカウンタのみ（`direct_hit`, `direct_miss`, `invalid_header`）

### GO/NO-GO
- Mixed 10-run mean:
  - GO: **+1.0% 以上**
  - ±1.0%: NEUTRAL（freeze）
  - -1.0% 以下: NO-GO（freeze）

---

## E5-2: Header write-once（⚪ NEUTRAL → FROZEN）

結論:
- E5-2 は **NEUTRAL**（branch overhead ≈ savings）なので **freeze**。
- 以後は追わず、次は E5-4 を優先する。

参照:
- Design: `docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md`
- Results: `docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md`

### 仮説
`tiny_region_id_write_header` は “正しいが高頻度”。
ブロックは同一クラス内で再利用されるので、ヘッダは **初回だけ**書けば足りる。

### 方針（箱理論）
- **HeaderPrefillBox**（cold/refill 境界）で “ブロック生成時” に header をセット
- alloc hot path は `base+1` 返却のみ（header write をしない）

### 安全ゲート
- `ENV gate`: `HAKMEM_TINY_HEADER_PREFILL=0/1`（default 0）
- Fail-fast:
  - “prefill された slab” だけ skip を許可
  - prefill 未完のブロックは従来 `tiny_region_id_write_header()` にフォールバック

### A/B
- Mixed 10-run + health profiles
- 期待: +1〜3%（ヘッダ書き込み + 関連分岐の削減）

---

## E5-4（次の芯）: Malloc Tiny Direct（E5-1 の alloc 側複製）

指示書:
- `docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_NEXT_INSTRUCTIONS.md`

---

## E5-3（DEFER）: `hakmem_env_snapshot_enabled()` の分岐形を “enabled 前提” に寄せる

### 背景
`MIXED_TINYV3_C7_SAFE` では `HAKMEM_ENV_SNAPSHOT=1` が常用になったため、
現状の `if (__builtin_expect(hakmem_env_snapshot_enabled(), 0))` は **hint が逆**になり得る。

### 方針
同じ意味で分岐形だけ変える（箱の外形最適化）:
- `if (__builtin_expect(!hakmem_env_snapshot_enabled(), 0)) { legacy; } else { snapshot; }`
- もしくは `*_cold()` に legacy を追い出す（noinline,cold）

### ENV / 戻せる
- `ENV gate`: `HAKMEM_ENV_SNAPSHOT_SHAPE=0/1`（default 0）
- まず `malloc_tiny_fast.h` の 5 箇所と、`tiny_legacy_fallback_box.h` / `tiny_metadata_cache_hot_box.h` を対象にする

### GO/NO-GO
- Mixed 10-run mean で **+1.0% 以上**なら採用候補
- 期待: +0.5〜2.0%（mispredict 回避）

---

## Step 2: 健康診断（必須）

```sh
scripts/verify_health_profiles.sh
```

---

## Step 3: 昇格（勝ち箱のみ）

- `core/bench_profile.h` の `MIXED_TINYV3_C7_SAFE` に default 化（opt-out 可能）
- `docs/analysis/ENV_PROFILE_PRESETS.md` に A/B と rollback を追記
- `CURRENT_TASK.md` を更新（結果と “次の芯” を 1 行で）
-												Phase 5 E4 Combined: E4-1 + E4-2 (+6.43% GO, baseline consolidated)

Combined A/B Test Results (10-run Mixed):
- Baseline (both OFF): 44.48M ops/s (mean), 44.39M ops/s (median)
- Optimized (both ON): 47.34M ops/s (mean), 47.38M ops/s (median)
- Improvement: +6.43% mean, +6.74% median

Interaction Analysis:
- E4-1 alone: +3.51% (measured in separate session)
- E4-2 alone: +21.83% (measured in separate session)
- Combined: +6.43% (measured in same binary)
- Pattern: SUBADDITIVE (overlapping bottlenecks)

Key Finding: Single-binary incremental gain is the accurate metric
- E4-1 and E4-2 target overlapping TLS/branch resources
- Individual measurements were from different baselines/sessions
- Combined measurement (same binary, both flags) shows true progress

Phase 5 Total Progress:
- Original baseline (session start): 35.74M ops/s
- Combined optimized: 47.34M ops/s
- Total gain: +32.4% (cross-session, reference only)
- Same-binary gain: +6.43% (E4-1+E4-2 both ON vs both OFF)

New Baseline Perf Profile (47.0M ops/s):
- free: 37.56% self% (still top hotspot)
- tiny_alloc_gate_fast: 13.73% (reduced from 19.50%)
- malloc: 12.95% (reduced from 16.13%)
- tiny_region_id_write_header: 6.97% (header write tax)
- hakmem_env_snapshot_enabled: 4.29% (ENV overhead visible)

Health Check: PASS
- MIXED_TINYV3_C7_SAFE: 42.3M ops/s
- C6_HEAVY_LEGACY_POOLV1: 20.9M ops/s

Phase 5 E5 Candidates (from perf profile):
- E5-1: free() path internals (37.56% self%)
- E5-2: Header write reduction (6.97% self%)
- E5-3: ENV snapshot overhead (4.29% self%)

Deliverables:
- docs/analysis/PHASE5_E4_COMBINED_AB_TEST_RESULTS.md
- docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md
- CURRENT_TASK.md (E4 combined complete, E5 candidates)
- docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md (E5 pointer)
- perf.data.e4combined (perf profile data)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-14 05:36:57 +09:00
+								# Phase 5 E5: Post E4-Combined Next Instructions（次の指示書）
-												Phase 5 E5-3: Candidate Analysis (All DEFERRED) + E5-4 Instructions

E5-3 Analysis Results:
- free_tiny_fast_cold (7.14%): DEFER - cold path, low ROI
- unified_cache_push (3.39%): DEFER - already optimized
- hakmem_env_snapshot_enabled (2.97%): DEFER - low headroom

Key Insight: perf self% is time-weighted, not frequency-weighted.
Cold paths appear hot but have low total impact.

Next: E5-4 (Malloc Tiny Direct Path)
- Apply E5-1 winning pattern to malloc side
- Target: tiny_alloc_gate_fast() gate tax elimination
- ENV gate: HAKMEM_MALLOC_TINY_DIRECT=0/1

Files added:
- docs/analysis/PHASE5_E5_3_ANALYSIS_AND_RECOMMENDATIONS.md
- docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_NEXT_INSTRUCTIONS.md
- core/box/free_cold_shape_env_box.{h,c} (research box, not tested)
- core/box/free_cold_shape_stats_box.{h,c} (research box, not tested)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-14 06:44:04 +09:00
+								## Status（2025-12-14 / E5-2 FREEZE 反映）
-												Phase 5 E4 Combined: E4-1 + E4-2 (+6.43% GO, baseline consolidated)

Combined A/B Test Results (10-run Mixed):
- Baseline (both OFF): 44.48M ops/s (mean), 44.39M ops/s (median)
- Optimized (both ON): 47.34M ops/s (mean), 47.38M ops/s (median)
- Improvement: +6.43% mean, +6.74% median

Interaction Analysis:
- E4-1 alone: +3.51% (measured in separate session)
- E4-2 alone: +21.83% (measured in separate session)
- Combined: +6.43% (measured in same binary)
- Pattern: SUBADDITIVE (overlapping bottlenecks)

Key Finding: Single-binary incremental gain is the accurate metric
- E4-1 and E4-2 target overlapping TLS/branch resources
- Individual measurements were from different baselines/sessions
- Combined measurement (same binary, both flags) shows true progress

Phase 5 Total Progress:
- Original baseline (session start): 35.74M ops/s
- Combined optimized: 47.34M ops/s
- Total gain: +32.4% (cross-session, reference only)
- Same-binary gain: +6.43% (E4-1+E4-2 both ON vs both OFF)

New Baseline Perf Profile (47.0M ops/s):
- free: 37.56% self% (still top hotspot)
- tiny_alloc_gate_fast: 13.73% (reduced from 19.50%)
- malloc: 12.95% (reduced from 16.13%)
- tiny_region_id_write_header: 6.97% (header write tax)
- hakmem_env_snapshot_enabled: 4.29% (ENV overhead visible)

Health Check: PASS
- MIXED_TINYV3_C7_SAFE: 42.3M ops/s
- C6_HEAVY_LEGACY_POOLV1: 20.9M ops/s

Phase 5 E5 Candidates (from perf profile):
- E5-1: free() path internals (37.56% self%)
- E5-2: Header write reduction (6.97% self%)
- E5-3: ENV snapshot overhead (4.29% self%)

Deliverables:
- docs/analysis/PHASE5_E4_COMBINED_AB_TEST_RESULTS.md
- docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md
- CURRENT_TASK.md (E4 combined complete, E5 candidates)
- docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md (E5 pointer)
- perf.data.e4combined (perf profile data)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-14 05:36:57 +09:00
 								- Baseline（Mixed, 20M iters, ws=400）: **47.34M ops/s**（E4-1+E4-2 ON）
 								- Hot spots（self%）:
 								  - `free`: **37.56%**
 								  - `tiny_alloc_gate_fast`: **13.73%**
 								  - `malloc`: **12.95%**
 								  - `tiny_region_id_write_header`: **6.97%**
 								  - `hakmem_env_snapshot_enabled`: **4.29%**
 								  - `tiny_get_max_size`: **4.24%**
 								狙い: “形” 最適化は一段落。次は **free 内部** と **ヘッダ書き込み**、そして **ENV snapshot gate の常時コスト**を削る。
-												Phase 5 E5-1: Promote to preset + next target instructions

E5-1 Promotion:
- Added HAKMEM_FREE_TINY_DIRECT=1 to MIXED_TINYV3_C7_SAFE preset
- Updated ENV_PROFILE_PRESETS.md with rollback instructions
- Rollback: HAKMEM_FREE_TINY_DIRECT=0

A/B Test Clarification:
- Documented bench_setenv_default vs export ENV=0 interaction
- bench_setenv_default only sets if ENV is unset
- To force OFF in A/B: use value that differs from default

Next Target Selection (E5-2 vs E5-3):
- E5-2: Header write reduction (tiny_region_id_write_header)
- E5-3: ENV snapshot gate shape optimization
- Decision requires fresh perf profile on new baseline

Deliverables:
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_NEXT_INSTRUCTIONS.md
- docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md (updated)
- docs/analysis/ENV_PROFILE_PRESETS.md (E5-1 added)
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_AB_TEST_RESULTS.md (clarified)
- CURRENT_TASK.md (progress links)
- docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md (progress links)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-14 05:59:43 +09:00
+								Update:
-												Phase 5: freeze E6 env snapshot shape (no-go)

											
										
										
											2025-12-14 07:18:59 +09:00
+								- E5-1（Free Tiny Direct Path）✅ GO（+3.35% mean / +3.36% median）
-												Phase 5: freeze E5-4 malloc tiny direct (neutral)

											
										
										
											2025-12-14 06:59:35 +09:00
+								- E5-2（Header write-once）⚪ NEUTRAL → FREEZE
-												Phase 5: freeze E6 env snapshot shape (no-go)

											
										
										
											2025-12-14 07:18:59 +09:00
+								- E5-4（Malloc Tiny Direct）⚪ NEUTRAL → FREEZE
 								- E6（ENV snapshot branch-shape fix）❌ NO-GO → FREEZE（-1.71%）
-												Phase 5: E7 prune no-go (keep frozen boxes); add clean-env runner

											
										
										
											2025-12-14 08:11:20 +09:00
+								- E7（Frozen box prune / baseline diet）❌ NO-GO（-3%台）→ 差し戻し
 								- 次の芯: “削る/分岐形” ではなく、再び **重複排除（境界の一本化）** か **大きい構造変更** を探す
-												Phase 5 E5-1: Promote to preset + next target instructions

E5-1 Promotion:
- Added HAKMEM_FREE_TINY_DIRECT=1 to MIXED_TINYV3_C7_SAFE preset
- Updated ENV_PROFILE_PRESETS.md with rollback instructions
- Rollback: HAKMEM_FREE_TINY_DIRECT=0

A/B Test Clarification:
- Documented bench_setenv_default vs export ENV=0 interaction
- bench_setenv_default only sets if ENV is unset
- To force OFF in A/B: use value that differs from default

Next Target Selection (E5-2 vs E5-3):
- E5-2: Header write reduction (tiny_region_id_write_header)
- E5-3: ENV snapshot gate shape optimization
- Decision requires fresh perf profile on new baseline

Deliverables:
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_NEXT_INSTRUCTIONS.md
- docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md (updated)
- docs/analysis/ENV_PROFILE_PRESETS.md (E5-1 added)
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_AB_TEST_RESULTS.md (clarified)
- CURRENT_TASK.md (progress links)
- docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md (progress links)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-14 05:59:43 +09:00
-												Phase 5 E4 Combined: E4-1 + E4-2 (+6.43% GO, baseline consolidated)

Combined A/B Test Results (10-run Mixed):
- Baseline (both OFF): 44.48M ops/s (mean), 44.39M ops/s (median)
- Optimized (both ON): 47.34M ops/s (mean), 47.38M ops/s (median)
- Improvement: +6.43% mean, +6.74% median

Interaction Analysis:
- E4-1 alone: +3.51% (measured in separate session)
- E4-2 alone: +21.83% (measured in separate session)
- Combined: +6.43% (measured in same binary)
- Pattern: SUBADDITIVE (overlapping bottlenecks)

Key Finding: Single-binary incremental gain is the accurate metric
- E4-1 and E4-2 target overlapping TLS/branch resources
- Individual measurements were from different baselines/sessions
- Combined measurement (same binary, both flags) shows true progress

Phase 5 Total Progress:
- Original baseline (session start): 35.74M ops/s
- Combined optimized: 47.34M ops/s
- Total gain: +32.4% (cross-session, reference only)
- Same-binary gain: +6.43% (E4-1+E4-2 both ON vs both OFF)

New Baseline Perf Profile (47.0M ops/s):
- free: 37.56% self% (still top hotspot)
- tiny_alloc_gate_fast: 13.73% (reduced from 19.50%)
- malloc: 12.95% (reduced from 16.13%)
- tiny_region_id_write_header: 6.97% (header write tax)
- hakmem_env_snapshot_enabled: 4.29% (ENV overhead visible)

Health Check: PASS
- MIXED_TINYV3_C7_SAFE: 42.3M ops/s
- C6_HEAVY_LEGACY_POOLV1: 20.9M ops/s

Phase 5 E5 Candidates (from perf profile):
- E5-1: free() path internals (37.56% self%)
- E5-2: Header write reduction (6.97% self%)
- E5-3: ENV snapshot overhead (4.29% self%)

Deliverables:
- docs/analysis/PHASE5_E4_COMBINED_AB_TEST_RESULTS.md
- docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md
- CURRENT_TASK.md (E4 combined complete, E5 candidates)
- docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md (E5 pointer)
- perf.data.e4combined (perf profile data)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-14 05:36:57 +09:00
+								---
 								## Step 0: Baseline 固定（Mixed）
 								```sh
 								HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
 								  HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=1 \
 								  HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=1 \
 								  ./bench_random_mixed_hakmem 20000000 400 1
 								```
 								以後の A/B は必ず同一バイナリで:
 								- A: `E5_* = 0`
 								- B: `E5_* = 1`
 								---
 								## Step 1: perf で “free の中身” を割る（必須）
 								```sh
 								HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE perf record -F 99 -- \
 								  ./bench_random_mixed_hakmem 20000000 400 1
 								perf report --stdio --no-children
 								```
 								次に `free` だけを掘る:
 								```sh
 								perf report --stdio --no-children --symbol free
 								```
 								目的:
 								- `free` の中で **真に重い行/分岐**を特定し、E5-1 の境界（箱の切り方）を決める。
 								---
 								## E5-1（優先A）: free() 内部の “Tiny 直通” を一本化
 								### 仮説
 								`free` は依然トップだが、wrapper での “tiny 判定→tiny free” がまだ重い（チェック/分岐/再判定が残っている）。
 								### 方針（箱理論）
 								- **L0 SplitBox**: `header_magic` / `class_idx` が valid なときだけ Tiny 直通（fail-fast）
 								- **L1 HotBox**: Tiny の same-thread TLS push だけ（副作用ゼロ）
 								- **L1 ColdBox**: 既存の fallback（pool/mid/large/invalid header）
 								### 実装ルール
 								- 境界は 1 箇所（`free()` wrapper の先頭分岐で確定）
-												Phase 5 E5-1: Promote to preset + next target instructions

E5-1 Promotion:
- Added HAKMEM_FREE_TINY_DIRECT=1 to MIXED_TINYV3_C7_SAFE preset
- Updated ENV_PROFILE_PRESETS.md with rollback instructions
- Rollback: HAKMEM_FREE_TINY_DIRECT=0

A/B Test Clarification:
- Documented bench_setenv_default vs export ENV=0 interaction
- bench_setenv_default only sets if ENV is unset
- To force OFF in A/B: use value that differs from default

Next Target Selection (E5-2 vs E5-3):
- E5-2: Header write reduction (tiny_region_id_write_header)
- E5-3: ENV snapshot gate shape optimization
- Decision requires fresh perf profile on new baseline

Deliverables:
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_NEXT_INSTRUCTIONS.md
- docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md (updated)
- docs/analysis/ENV_PROFILE_PRESETS.md (E5-1 added)
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_AB_TEST_RESULTS.md (clarified)
- CURRENT_TASK.md (progress links)
- docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md (progress links)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-14 05:59:43 +09:00
+								- `ENV gate`: `HAKMEM_FREE_TINY_DIRECT=0/1`（default 0 / preset(MIXED)=1）
-												Phase 5 E4 Combined: E4-1 + E4-2 (+6.43% GO, baseline consolidated)

Combined A/B Test Results (10-run Mixed):
- Baseline (both OFF): 44.48M ops/s (mean), 44.39M ops/s (median)
- Optimized (both ON): 47.34M ops/s (mean), 47.38M ops/s (median)
- Improvement: +6.43% mean, +6.74% median

Interaction Analysis:
- E4-1 alone: +3.51% (measured in separate session)
- E4-2 alone: +21.83% (measured in separate session)
- Combined: +6.43% (measured in same binary)
- Pattern: SUBADDITIVE (overlapping bottlenecks)

Key Finding: Single-binary incremental gain is the accurate metric
- E4-1 and E4-2 target overlapping TLS/branch resources
- Individual measurements were from different baselines/sessions
- Combined measurement (same binary, both flags) shows true progress

Phase 5 Total Progress:
- Original baseline (session start): 35.74M ops/s
- Combined optimized: 47.34M ops/s
- Total gain: +32.4% (cross-session, reference only)
- Same-binary gain: +6.43% (E4-1+E4-2 both ON vs both OFF)

New Baseline Perf Profile (47.0M ops/s):
- free: 37.56% self% (still top hotspot)
- tiny_alloc_gate_fast: 13.73% (reduced from 19.50%)
- malloc: 12.95% (reduced from 16.13%)
- tiny_region_id_write_header: 6.97% (header write tax)
- hakmem_env_snapshot_enabled: 4.29% (ENV overhead visible)

Health Check: PASS
- MIXED_TINYV3_C7_SAFE: 42.3M ops/s
- C6_HEAVY_LEGACY_POOLV1: 20.9M ops/s

Phase 5 E5 Candidates (from perf profile):
- E5-1: free() path internals (37.56% self%)
- E5-2: Header write reduction (6.97% self%)
- E5-3: ENV snapshot overhead (4.29% self%)

Deliverables:
- docs/analysis/PHASE5_E4_COMBINED_AB_TEST_RESULTS.md
- docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md
- CURRENT_TASK.md (E4 combined complete, E5 candidates)
- docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md (E5 pointer)
- perf.data.e4combined (perf profile data)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-14 05:36:57 +09:00
+								- 可視化はカウンタのみ（`direct_hit`, `direct_miss`, `invalid_header`）
 								### GO/NO-GO
 								- Mixed 10-run mean:
 								  - GO: **+1.0% 以上**
 								  - ±1.0%: NEUTRAL（freeze）
 								  - -1.0% 以下: NO-GO（freeze）
 								---
-												Phase 5 E5-3: Candidate Analysis (All DEFERRED) + E5-4 Instructions

E5-3 Analysis Results:
- free_tiny_fast_cold (7.14%): DEFER - cold path, low ROI
- unified_cache_push (3.39%): DEFER - already optimized
- hakmem_env_snapshot_enabled (2.97%): DEFER - low headroom

Key Insight: perf self% is time-weighted, not frequency-weighted.
Cold paths appear hot but have low total impact.

Next: E5-4 (Malloc Tiny Direct Path)
- Apply E5-1 winning pattern to malloc side
- Target: tiny_alloc_gate_fast() gate tax elimination
- ENV gate: HAKMEM_MALLOC_TINY_DIRECT=0/1

Files added:
- docs/analysis/PHASE5_E5_3_ANALYSIS_AND_RECOMMENDATIONS.md
- docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_NEXT_INSTRUCTIONS.md
- core/box/free_cold_shape_env_box.{h,c} (research box, not tested)
- core/box/free_cold_shape_stats_box.{h,c} (research box, not tested)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-14 06:44:04 +09:00
+								## E5-2: Header write-once（⚪ NEUTRAL → FROZEN）
 								結論:
 								- E5-2 は **NEUTRAL**（branch overhead ≈ savings）なので **freeze**。
 								- 以後は追わず、次は E5-4 を優先する。
 								参照:
 								- Design: `docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md`
 								- Results: `docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md`
-												Phase 5 E4 Combined: E4-1 + E4-2 (+6.43% GO, baseline consolidated)

Combined A/B Test Results (10-run Mixed):
- Baseline (both OFF): 44.48M ops/s (mean), 44.39M ops/s (median)
- Optimized (both ON): 47.34M ops/s (mean), 47.38M ops/s (median)
- Improvement: +6.43% mean, +6.74% median

Interaction Analysis:
- E4-1 alone: +3.51% (measured in separate session)
- E4-2 alone: +21.83% (measured in separate session)
- Combined: +6.43% (measured in same binary)
- Pattern: SUBADDITIVE (overlapping bottlenecks)

Key Finding: Single-binary incremental gain is the accurate metric
- E4-1 and E4-2 target overlapping TLS/branch resources
- Individual measurements were from different baselines/sessions
- Combined measurement (same binary, both flags) shows true progress

Phase 5 Total Progress:
- Original baseline (session start): 35.74M ops/s
- Combined optimized: 47.34M ops/s
- Total gain: +32.4% (cross-session, reference only)
- Same-binary gain: +6.43% (E4-1+E4-2 both ON vs both OFF)

New Baseline Perf Profile (47.0M ops/s):
- free: 37.56% self% (still top hotspot)
- tiny_alloc_gate_fast: 13.73% (reduced from 19.50%)
- malloc: 12.95% (reduced from 16.13%)
- tiny_region_id_write_header: 6.97% (header write tax)
- hakmem_env_snapshot_enabled: 4.29% (ENV overhead visible)

Health Check: PASS
- MIXED_TINYV3_C7_SAFE: 42.3M ops/s
- C6_HEAVY_LEGACY_POOLV1: 20.9M ops/s

Phase 5 E5 Candidates (from perf profile):
- E5-1: free() path internals (37.56% self%)
- E5-2: Header write reduction (6.97% self%)
- E5-3: ENV snapshot overhead (4.29% self%)

Deliverables:
- docs/analysis/PHASE5_E4_COMBINED_AB_TEST_RESULTS.md
- docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md
- CURRENT_TASK.md (E4 combined complete, E5 candidates)
- docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md (E5 pointer)
- perf.data.e4combined (perf profile data)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-14 05:36:57 +09:00
 								### 仮説
 								`tiny_region_id_write_header` は “正しいが高頻度”。
 								ブロックは同一クラス内で再利用されるので、ヘッダは **初回だけ**書けば足りる。
 								### 方針（箱理論）
 								- **HeaderPrefillBox**（cold/refill 境界）で “ブロック生成時” に header をセット
 								- alloc hot path は `base+1` 返却のみ（header write をしない）
 								### 安全ゲート
 								- `ENV gate`: `HAKMEM_TINY_HEADER_PREFILL=0/1`（default 0）
 								- Fail-fast:
 								  - “prefill された slab” だけ skip を許可
 								  - prefill 未完のブロックは従来 `tiny_region_id_write_header()` にフォールバック
 								### A/B
 								- Mixed 10-run + health profiles
 								- 期待: +1〜3%（ヘッダ書き込み + 関連分岐の削減）
 								---
-												Phase 5 E5-3: Candidate Analysis (All DEFERRED) + E5-4 Instructions

E5-3 Analysis Results:
- free_tiny_fast_cold (7.14%): DEFER - cold path, low ROI
- unified_cache_push (3.39%): DEFER - already optimized
- hakmem_env_snapshot_enabled (2.97%): DEFER - low headroom

Key Insight: perf self% is time-weighted, not frequency-weighted.
Cold paths appear hot but have low total impact.

Next: E5-4 (Malloc Tiny Direct Path)
- Apply E5-1 winning pattern to malloc side
- Target: tiny_alloc_gate_fast() gate tax elimination
- ENV gate: HAKMEM_MALLOC_TINY_DIRECT=0/1

Files added:
- docs/analysis/PHASE5_E5_3_ANALYSIS_AND_RECOMMENDATIONS.md
- docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_NEXT_INSTRUCTIONS.md
- core/box/free_cold_shape_env_box.{h,c} (research box, not tested)
- core/box/free_cold_shape_stats_box.{h,c} (research box, not tested)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-14 06:44:04 +09:00
+								## E5-4（次の芯）: Malloc Tiny Direct（E5-1 の alloc 側複製）
 								指示書:
 								- `docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_NEXT_INSTRUCTIONS.md`
 								---
 								## E5-3（DEFER）: `hakmem_env_snapshot_enabled()` の分岐形を “enabled 前提” に寄せる
-												Phase 5 E4 Combined: E4-1 + E4-2 (+6.43% GO, baseline consolidated)

Combined A/B Test Results (10-run Mixed):
- Baseline (both OFF): 44.48M ops/s (mean), 44.39M ops/s (median)
- Optimized (both ON): 47.34M ops/s (mean), 47.38M ops/s (median)
- Improvement: +6.43% mean, +6.74% median

Interaction Analysis:
- E4-1 alone: +3.51% (measured in separate session)
- E4-2 alone: +21.83% (measured in separate session)
- Combined: +6.43% (measured in same binary)
- Pattern: SUBADDITIVE (overlapping bottlenecks)

Key Finding: Single-binary incremental gain is the accurate metric
- E4-1 and E4-2 target overlapping TLS/branch resources
- Individual measurements were from different baselines/sessions
- Combined measurement (same binary, both flags) shows true progress

Phase 5 Total Progress:
- Original baseline (session start): 35.74M ops/s
- Combined optimized: 47.34M ops/s
- Total gain: +32.4% (cross-session, reference only)
- Same-binary gain: +6.43% (E4-1+E4-2 both ON vs both OFF)

New Baseline Perf Profile (47.0M ops/s):
- free: 37.56% self% (still top hotspot)
- tiny_alloc_gate_fast: 13.73% (reduced from 19.50%)
- malloc: 12.95% (reduced from 16.13%)
- tiny_region_id_write_header: 6.97% (header write tax)
- hakmem_env_snapshot_enabled: 4.29% (ENV overhead visible)

Health Check: PASS
- MIXED_TINYV3_C7_SAFE: 42.3M ops/s
- C6_HEAVY_LEGACY_POOLV1: 20.9M ops/s

Phase 5 E5 Candidates (from perf profile):
- E5-1: free() path internals (37.56% self%)
- E5-2: Header write reduction (6.97% self%)
- E5-3: ENV snapshot overhead (4.29% self%)

Deliverables:
- docs/analysis/PHASE5_E4_COMBINED_AB_TEST_RESULTS.md
- docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md
- CURRENT_TASK.md (E4 combined complete, E5 candidates)
- docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md (E5 pointer)
- perf.data.e4combined (perf profile data)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-14 05:36:57 +09:00
 								### 背景
 								`MIXED_TINYV3_C7_SAFE` では `HAKMEM_ENV_SNAPSHOT=1` が常用になったため、
 								現状の `if (__builtin_expect(hakmem_env_snapshot_enabled(), 0))` は **hint が逆**になり得る。
 								### 方針
 								同じ意味で分岐形だけ変える（箱の外形最適化）:
 								- `if (__builtin_expect(!hakmem_env_snapshot_enabled(), 0)) { legacy; } else { snapshot; }`
 								- もしくは `*_cold()` に legacy を追い出す（noinline,cold）
 								### ENV / 戻せる
 								- `ENV gate`: `HAKMEM_ENV_SNAPSHOT_SHAPE=0/1`（default 0）
 								- まず `malloc_tiny_fast.h` の 5 箇所と、`tiny_legacy_fallback_box.h` / `tiny_metadata_cache_hot_box.h` を対象にする
 								### GO/NO-GO
 								- Mixed 10-run mean で **+1.0% 以上**なら採用候補
 								- 期待: +0.5〜2.0%（mispredict 回避）
 								---
 								## Step 2: 健康診断（必須）
 								```sh
 								scripts/verify_health_profiles.sh
 								```
 								---
 								## Step 3: 昇格（勝ち箱のみ）
 								- `core/bench_profile.h` の `MIXED_TINYV3_C7_SAFE` に default 化（opt-out 可能）
 								- `docs/analysis/ENV_PROFILE_PRESETS.md` に A/B と rollback を追記
 								- `CURRENT_TASK.md` を更新（結果と “次の芯” を 1 行で）