Phase 14 v1: Pointer-Chase Reduction (tcache) NEUTRAL (+0.20%)

Implementation: - Intrusive LIFO tcache layer (L1) before UnifiedCache - TLS per-class bins (head pointer + count) - Intrusive next pointers (via tiny_next_store/load SSOT) - Cap: 64 blocks per class (default) - ENV: HAKMEM_TINY_TCACHE=0/1 (default: 0, OFF) A/B Test Results (Mixed 10-run): - Baseline (TCACHE=0): 51,083,379 ops/s - Optimized (TCACHE=1): 51,186,838 ops/s - Mean delta: +0.20% (below +1.0% GO threshold) - Median delta: +0.59% Verdict: NEUTRAL - Freeze as research box (default OFF) Root Cause (v1 wiring incomplete): - Free side pushes to tcache via unified_cache_push() - Alloc hot path (tiny_hot_alloc_fast) doesn't consume tcache - tcache becomes "sink" without alloc-side pop → ROI not measurable Files: - Created: core/box/tiny_tcache_{env_box,box}.h, tiny_tcache_env_box.c - Modified: core/front/tiny_unified_cache.h (integration) - Modified: core/bench_profile.h (refresh sync) - Modified: Makefile (build integration) - Results: docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_1_AB_TEST_RESULTS.md - v2 Instructions: docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_2_NEXT_INSTRUCTIONS.md Next: Phase 14 v2 (connect tcache to tiny_front_hot_box alloc/free hot path) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 01:28:50 +09:00
parent 0b306f72f4
commit f8fb05bc13
11 changed files with 729 additions and 9 deletions
--- a/docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_1_AB_TEST_RESULTS.md
+++ b/docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_1_AB_TEST_RESULTS.md
@ -0,0 +1,184 @@
+# Phase 14 v1: Pointer-Chase Reduction (tcache-style) A/B Test Results
+
+**Date:** 2025-12-15
+**Benchmark:** Mixed (16–1024B) 10-run cleanenv
+**Target:** Reduce pointer-chase overhead with intrusive LIFO tcache layer
+**Expected ROI:** +15-25% (design estimate)
+**GO Threshold:** +1.0% mean improvement
+
+---
+
+## 1. Implementation Summary
+
+Phase 14 v1 adds an intrusive LIFO tcache layer (L1) before the existing array-based UnifiedCache, inspired by glibc tcache pattern.
+
+**Key Components:**
+- `core/box/tiny_tcache_env_box.{h,c}` - L0 ENV gate (HAKMEM_TINY_TCACHE=0/1, default 0)
+- `core/box/tiny_tcache_box.h` - L1 intrusive LIFO cache (TLS per-class bins)
+- `core/front/tiny_unified_cache.h` - Integration (try tcache first, fall through to array cache)
+- `core/bench_profile.h` - Refresh sync for bench_profile
+
+**Important Note (wiring completeness):**
+- v1 の tcache pop は `unified_cache_pop_or_refill()` 側にあるが、現行の main alloc hot path（`tiny_hot_alloc_fast()`）は `unified_cache_pop_or_refill()` を経由しない。
+- 一方で free 側は `unified_cache_push()` 経由で tcache に入るため、`HAKMEM_TINY_TCACHE=1` のとき **tcache が “sink” になり、alloc/pop 側の ROI が測れない**可能性がある。
+- 後続の修正（Phase 14 v2）で `tiny_front_hot_box` に pop/push を接続し、再 A/B を推奨する:
+  - `docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_2_NEXT_INSTRUCTIONS.md`
+
+**Design:**
+- TLS state: 8 TinyTcacheBin structs (head pointer + count per class)
+- Intrusive next pointers stored in blocks (via tiny_next_store/load SSOT)
+- Capacity: 64 blocks per class (default, configurable via HAKMEM_TINY_TCACHE_CAP)
+- LIFO order: better cache locality than FIFO array cache
+- Two-layer fallback: tcache (fast) → unified_cache (overflow/miss)
+
+**ENV Control:**
+```bash
+export HAKMEM_TINY_TCACHE=0  # Baseline (tcache disabled)
+export HAKMEM_TINY_TCACHE=1  # Optimized (tcache enabled)
+export HAKMEM_TINY_TCACHE_CAP=64  # Capacity per class (default: 64)
+```
+
+---
+
+## 2. A/B Test Results (Mixed 10-run)
+
+### Baseline (TCACHE=0):
+```
+Run 1:  51,264,525 ops/s
+Run 2:  50,950,925 ops/s
+Run 3:  51,500,295 ops/s
+Run 4:  51,698,050 ops/s
+Run 5:  50,396,686 ops/s
+Run 6:  50,960,807 ops/s
+Run 7:  50,616,179 ops/s
+Run 8:  51,817,424 ops/s
+Run 9:  50,762,958 ops/s
+Run 10: 50,865,941 ops/s
+```
+**Mean:** 51,083,379 ops/s
+**Median:** 50,955,866 ops/s
+
+### Optimized (TCACHE=1):
+```
+Run 1:  51,555,414 ops/s
+Run 2:  51,389,988 ops/s
+Run 3:  50,795,917 ops/s
+Run 4:  51,880,520 ops/s
+Run 5:  50,574,457 ops/s
+Run 6:  50,627,901 ops/s
+Run 7:  51,233,081 ops/s
+Run 8:  51,278,890 ops/s
+Run 9:  50,761,326 ops/s
+Run 10: 51,770,890 ops/s
+```
+**Mean:** 51,186,838 ops/s
+**Median:** 51,255,986 ops/s
+
+### Delta:
+- **Mean delta:** +0.20% (103,459 ops/s improvement)
+- **Median delta:** +0.59% (300,120 ops/s improvement)
+
+---
+
+## 3. Verdict: NEUTRAL
+
+**Result:** +0.20% mean improvement (below +1.0% GO threshold)
+
+**Analysis:**
+- Phase 14 v1 shows minimal performance impact on Mixed workload
+- Median delta (+0.59%) is slightly better than mean, suggesting some stability improvement
+- Both deltas are below the +1.0% GO threshold → NEUTRAL classification
+- Expected ROI (+15-25%) was not achieved
+
+**Possible Reasons for Lower-than-Expected ROI:**
+1. **Workload Mismatch:** Mixed workload (16–1024B) spans multiple classes (C0-C7), but tcache benefits may be concentrated in hot classes (C2/C3: 128B/256B). Mid/large classes (C5-C7) may not benefit as much.
+2. **Cache Locality vs Array Access:** While tcache reduces pointer-chasing, the existing UnifiedCache array access may already be well-cached in L1/L2, limiting improvement.
+3. **Cap Too Small:** Default cap=64 may be too small for high-churn workloads, causing frequent overflow to array cache.
+4. **Intrusive Next Overhead:** Writing/reading next pointers may add overhead that offsets the pointer-chase reduction.
+5. **Incomplete hot-path coverage (v1):** Free 側だけ tcache に入って alloc 側が消費しないため、hit が “見えない” 可能性がある（Phase 14 v2 で通電確認が必要）。
+
+**Comparison to Smoke Test:**
+- Smoke test (single run): +2.4% (51.88M vs 50.68M ops/s)
+- Formal 10-run: +0.20% mean, +0.59% median
+- Variance across runs suggests smoke test was an outlier
+
+---
+
+## 4. Recommendation: Freeze as Research Box
+
+**Decision:** Freeze Phase 14 v1 as research box (default OFF)
+
+**Rationale:**
+- NEUTRAL result (+0.20%) does not justify promotion to default
+- No measurable harm (close to baseline), suitable for research/experimentation
+- Future work may explore:
+  - Per-class cap tuning (hot classes get larger caps)
+  - Workload-specific profiling (C2/C3-heavy vs C5-C7-heavy)
+  - Alternative intrusive next pointer strategies
+
+**Next Steps:**
+1. Commit Phase 14 v1 implementation with NEUTRAL verdict
+2. Update CURRENT_TASK.md to freeze as research box
+3. Keep ENV gate (HAKMEM_TINY_TCACHE=0 default) for future experimentation
+4. Consider alternative approaches for pointer-chase reduction (e.g., deeper pipeline optimization, better prefetching)
+
+---
+
+## 5. Raw Data
+
+### Baseline (TCACHE=0):
+```
+51264525
+50950925
+51500295
+51698050
+50396686
+50960807
+50616179
+51817424
+50762958
+50865941
+```
+Mean: 51,083,379 ops/s
+Median: 50,955,866 ops/s
+
+### Optimized (TCACHE=1):
+```
+51555414
+51389988
+50795917
+51880520
+50574457
+50627901
+51233081
+51278890
+50761326
+51770890
+```
+Mean: 51,186,838 ops/s
+Median: 51,255,986 ops/s
+
+---
+
+## 6. Files Modified
+
+### Created:
+- `core/box/tiny_tcache_env_box.h` - L0 ENV gate
+- `core/box/tiny_tcache_env_box.c` - ENV init/refresh implementation
+- `core/box/tiny_tcache_box.h` - L1 intrusive LIFO cache
+
+### Modified:
+- `core/front/tiny_unified_cache.h` - Integration (try tcache first)
+- `core/bench_profile.h` - Refresh sync
+- `Makefile` - Build system integration
+- `scripts/run_mixed_10_cleanenv.sh` - ENV leak prevention (already updated)
+
+---
+
+## 7. Conclusion
+
+Phase 14 v1 (Pointer-Chase Reduction via tcache-style intrusive LIFO) achieved **+0.20% mean improvement** on Mixed 10-run benchmark, which is **NEUTRAL** (below +1.0% GO threshold).
+
+**Final Status:** Freeze as research box (HAKMEM_TINY_TCACHE=0 default, OFF)
+
+**Future Work:** Consider per-class cap tuning or alternative pointer-chase reduction strategies.
--- a/docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_1_DESIGN.md
+++ b/docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_1_DESIGN.md
@ -147,3 +147,12 @@ GO/NO-GO:
 - tcache hit 率が高い場合、配列アクセス・FIFO の古い再利用を回避できる
 - “system malloc が速い” の差分（tcache 的挙動）に寄せる最短の一手

+---
+
+## Update（2025-12-15）
+
+v1 の統合点（`core/front/tiny_unified_cache.h`）だけでは、現行の main alloc hot path（`tiny_hot_alloc_fast()`）が tcache を消費しないため、
+`HAKMEM_TINY_TCACHE=1` のとき tcache が “sink” になりやすい。
+
+次は hot path（`core/box/tiny_front_hot_box.h`）へ pop/push を接続して、通電した状態で再 A/B を取る（Phase 14 v2）:
+- `docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_2_NEXT_INSTRUCTIONS.md`
--- a/docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_1_NEXT_INSTRUCTIONS.md
+++ b/docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_1_NEXT_INSTRUCTIONS.md
@ -109,3 +109,12 @@ GO のとき:
 NO-GO/NEUTRAL のとき:
 - research box freeze（default OFF のまま保持）

+---
+
+## Update（2025-12-15）
+
+v1 の統合点（`core/front/tiny_unified_cache.h`）だけだと、現行の main alloc hot path（`tiny_hot_alloc_fast()`）が tcache を消費しないため、
+`HAKMEM_TINY_TCACHE=1` で “sink” になりやすい。
+
+次は hot path（`core/box/tiny_front_hot_box.h`）へ pop/push を接続し、通電した状態で再 A/B を取る（Phase 14 v2）:
+- `docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_2_NEXT_INSTRUCTIONS.md`
--- a/docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_2_NEXT_INSTRUCTIONS.md
+++ b/docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_2_NEXT_INSTRUCTIONS.md
@ -0,0 +1,131 @@
+# Phase 14 v2: Pointer-Chase Reduction — Hot Path Integration Next Instructions（Tiny tcache intrusive LIFO）
+
+## Status
+
+- Phase 14 v1（tcache L1 追加）は Mixed 10-run で **NEUTRAL**（+0.20% mean / +0.59% median）
+  - 結果: `docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_1_AB_TEST_RESULTS.md`
+  - 実装: `core/box/tiny_tcache_env_box.{h,c}` / `core/box/tiny_tcache_box.h` / `core/front/tiny_unified_cache.h`
+- ただし現状の v1 は **free 側（`unified_cache_push()`）だけ tcache に入れて、alloc 側（`tiny_hot_alloc_fast()`）が tcache を消費しない**ため、
+  - tcache が「実質 sink」になり、ROI が正しく測れない
+  - “tcache-style” の前提（push/pop の対称）が崩れている
+
+Phase 14 v2 は **tiny front の実ホットパス**に tcache を接続して、正しい A/B を取り直す。
+
+---
+
+## 0. 目的（GO 条件）
+
+Mixed 10-run（clean env）で:
+- **GO**: mean +1.0% 以上
+- **NO-GO**: mean -1.0% 以下（即 rollback / freeze）
+- **NEUTRAL**: ±1.0%（research box freeze）
+
+追加ゲート（必須）:
+- `HAKMEM_TINY_TCACHE=1` のとき **tcache pop が実際に発生**している（0 なら設計未通電）
+
+---
+
+## 1. Box 図（境界 1 箇所）
+
+```
+L0: tiny_tcache_env_box        (ENV gate / refresh / rollback)
+  ↓
+L1: tiny_tcache_box            (intrusive LIFO: push/pop, cap)
+  ↓
+L2: tiny_front_hot_box         (hot alloc/free: tcache → unified_cache(FIFO))
+  ↓
+L3: cold/refill                (unified_cache_refill → SuperSlab)
+```
+
+境界は **“tcache miss/overflow → 既存 UnifiedCache”** の 1 箇所に固定する。
+
+---
+
+## 2. 実装パッチ順（小さく積む）
+
+### Patch 1: Hot alloc に tcache pop を接続（必須）
+
+対象:
+- `core/box/tiny_front_hot_box.h`
+
+変更:
+- `tiny_hot_alloc_fast(int class_idx)` の先頭で
+  - `tiny_tcache_try_pop(class_idx)` を試す
+  - HIT なら `tiny_header_finalize_alloc(base, class_idx)` で即 return
+  - MISS なら既存の FIFO（`cache->slots[head]`）へフォールバック
+
+要件:
+- tcache OFF（default）ではホット経路が肥大しないよう最小差分にする
+- “確信がないなら fallback” を厳守（Fail-Fast）
+
+### Patch 2: Hot free に tcache push を接続（推奨）
+
+対象:
+- `core/box/tiny_front_hot_box.h`
+
+変更:
+- `tiny_hot_free_fast(int class_idx, void* base)` の先頭で
+  - `tiny_tcache_try_push(class_idx, base)` を試す
+  - SUCCESS なら `return 1`
+  - overflow / disabled のときだけ既存 FIFO へ
+
+狙い:
+- `unified_cache_push()` 経由以外の “直 push” 経路でも tcache が効く状態にする
+
+### Patch 3: 可視化（最小・TLS）
+
+対象候補:
+- `core/box/tiny_tcache_box.h`（TLS カウンタ）
+
+追加（debug / research 用）:
+- `tcache_pop_hit/miss`
+- `tcache_push_hit/overflow`
+- “ワンショット dump” を 1 回だけ（ENV opt-in）で出せるようにする
+
+禁止:
+- hot path に atomic 統計を置かない（Phase 12 / POOL-DN-BATCH の教訓）
+
+---
+
+## 3. A/B テスト（同一バイナリ）
+
+Baseline:
+```sh
+HAKMEM_TINY_TCACHE=0 scripts/run_mixed_10_cleanenv.sh
+```
+
+Optimized:
+```sh
+HAKMEM_TINY_TCACHE=1 scripts/run_mixed_10_cleanenv.sh
+```
+
+追加（効果が class 依存か確認）:
+```sh
+HAKMEM_BENCH_C7_ONLY=1 HAKMEM_TINY_TCACHE=0 scripts/run_mixed_10_cleanenv.sh
+HAKMEM_BENCH_C7_ONLY=1 HAKMEM_TINY_TCACHE=1 scripts/run_mixed_10_cleanenv.sh
+```
+
+cap 探索（research、必要なときだけ）:
+```sh
+HAKMEM_TINY_TCACHE=1 HAKMEM_TINY_TCACHE_CAP=32  scripts/run_mixed_10_cleanenv.sh
+HAKMEM_TINY_TCACHE=1 HAKMEM_TINY_TCACHE_CAP=64  scripts/run_mixed_10_cleanenv.sh
+HAKMEM_TINY_TCACHE=1 HAKMEM_TINY_TCACHE_CAP=128 scripts/run_mixed_10_cleanenv.sh
+```
+
+---
+
+## 4. 健康診断（必須）
+
+```sh
+scripts/verify_health_profiles.sh
+```
+
+---
+
+## 5. 判定と扱い
+
+- GO: `bench_profile` への昇格は **MIXED_TINYV3_C7_SAFE のみ**から開始（段階的）
+- NEUTRAL/NO-GO: Phase 14 v2 は research box として freeze（default OFF のまま）
+- Rollback:
+  - `export HAKMEM_TINY_TCACHE=0`
+