Phase 14 v1: Pointer-Chase Reduction (tcache) NEUTRAL (+0.20%)

Implementation:
- Intrusive LIFO tcache layer (L1) before UnifiedCache
- TLS per-class bins (head pointer + count)
- Intrusive next pointers (via tiny_next_store/load SSOT)
- Cap: 64 blocks per class (default)
- ENV: HAKMEM_TINY_TCACHE=0/1 (default: 0, OFF)

A/B Test Results (Mixed 10-run):
- Baseline (TCACHE=0): 51,083,379 ops/s
- Optimized (TCACHE=1): 51,186,838 ops/s
- Mean delta: +0.20% (below +1.0% GO threshold)
- Median delta: +0.59%

Verdict: NEUTRAL - Freeze as research box (default OFF)

Root Cause (v1 wiring incomplete):
- Free side pushes to tcache via unified_cache_push()
- Alloc hot path (tiny_hot_alloc_fast) doesn't consume tcache
- tcache becomes "sink" without alloc-side pop → ROI not measurable

Files:
- Created: core/box/tiny_tcache_{env_box,box}.h, tiny_tcache_env_box.c
- Modified: core/front/tiny_unified_cache.h (integration)
- Modified: core/bench_profile.h (refresh sync)
- Modified: Makefile (build integration)
- Results: docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_1_AB_TEST_RESULTS.md
- v2 Instructions: docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_2_NEXT_INSTRUCTIONS.md

Next: Phase 14 v2 (connect tcache to tiny_front_hot_box alloc/free hot path)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Moe Charm (CI)
2025-12-15 01:28:50 +09:00
parent 0b306f72f4
commit f8fb05bc13
11 changed files with 729 additions and 9 deletions

View File

@ -0,0 +1,184 @@
# Phase 14 v1: Pointer-Chase Reduction (tcache-style) A/B Test Results
**Date:** 2025-12-15
**Benchmark:** Mixed (161024B) 10-run cleanenv
**Target:** Reduce pointer-chase overhead with intrusive LIFO tcache layer
**Expected ROI:** +15-25% (design estimate)
**GO Threshold:** +1.0% mean improvement
---
## 1. Implementation Summary
Phase 14 v1 adds an intrusive LIFO tcache layer (L1) before the existing array-based UnifiedCache, inspired by glibc tcache pattern.
**Key Components:**
- `core/box/tiny_tcache_env_box.{h,c}` - L0 ENV gate (HAKMEM_TINY_TCACHE=0/1, default 0)
- `core/box/tiny_tcache_box.h` - L1 intrusive LIFO cache (TLS per-class bins)
- `core/front/tiny_unified_cache.h` - Integration (try tcache first, fall through to array cache)
- `core/bench_profile.h` - Refresh sync for bench_profile
**Important Note (wiring completeness):**
- v1 の tcache pop は `unified_cache_pop_or_refill()` 側にあるが、現行の main alloc hot path`tiny_hot_alloc_fast()`)は `unified_cache_pop_or_refill()` を経由しない。
- 一方で free 側は `unified_cache_push()` 経由で tcache に入るため、`HAKMEM_TINY_TCACHE=1` のとき **tcache が “sink” になり、alloc/pop 側の ROI が測れない**可能性がある。
- 後続の修正Phase 14 v2`tiny_front_hot_box` に pop/push を接続し、再 A/B を推奨する:
- `docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_2_NEXT_INSTRUCTIONS.md`
**Design:**
- TLS state: 8 TinyTcacheBin structs (head pointer + count per class)
- Intrusive next pointers stored in blocks (via tiny_next_store/load SSOT)
- Capacity: 64 blocks per class (default, configurable via HAKMEM_TINY_TCACHE_CAP)
- LIFO order: better cache locality than FIFO array cache
- Two-layer fallback: tcache (fast) → unified_cache (overflow/miss)
**ENV Control:**
```bash
export HAKMEM_TINY_TCACHE=0 # Baseline (tcache disabled)
export HAKMEM_TINY_TCACHE=1 # Optimized (tcache enabled)
export HAKMEM_TINY_TCACHE_CAP=64 # Capacity per class (default: 64)
```
---
## 2. A/B Test Results (Mixed 10-run)
### Baseline (TCACHE=0):
```
Run 1: 51,264,525 ops/s
Run 2: 50,950,925 ops/s
Run 3: 51,500,295 ops/s
Run 4: 51,698,050 ops/s
Run 5: 50,396,686 ops/s
Run 6: 50,960,807 ops/s
Run 7: 50,616,179 ops/s
Run 8: 51,817,424 ops/s
Run 9: 50,762,958 ops/s
Run 10: 50,865,941 ops/s
```
**Mean:** 51,083,379 ops/s
**Median:** 50,955,866 ops/s
### Optimized (TCACHE=1):
```
Run 1: 51,555,414 ops/s
Run 2: 51,389,988 ops/s
Run 3: 50,795,917 ops/s
Run 4: 51,880,520 ops/s
Run 5: 50,574,457 ops/s
Run 6: 50,627,901 ops/s
Run 7: 51,233,081 ops/s
Run 8: 51,278,890 ops/s
Run 9: 50,761,326 ops/s
Run 10: 51,770,890 ops/s
```
**Mean:** 51,186,838 ops/s
**Median:** 51,255,986 ops/s
### Delta:
- **Mean delta:** +0.20% (103,459 ops/s improvement)
- **Median delta:** +0.59% (300,120 ops/s improvement)
---
## 3. Verdict: NEUTRAL
**Result:** +0.20% mean improvement (below +1.0% GO threshold)
**Analysis:**
- Phase 14 v1 shows minimal performance impact on Mixed workload
- Median delta (+0.59%) is slightly better than mean, suggesting some stability improvement
- Both deltas are below the +1.0% GO threshold → NEUTRAL classification
- Expected ROI (+15-25%) was not achieved
**Possible Reasons for Lower-than-Expected ROI:**
1. **Workload Mismatch:** Mixed workload (161024B) spans multiple classes (C0-C7), but tcache benefits may be concentrated in hot classes (C2/C3: 128B/256B). Mid/large classes (C5-C7) may not benefit as much.
2. **Cache Locality vs Array Access:** While tcache reduces pointer-chasing, the existing UnifiedCache array access may already be well-cached in L1/L2, limiting improvement.
3. **Cap Too Small:** Default cap=64 may be too small for high-churn workloads, causing frequent overflow to array cache.
4. **Intrusive Next Overhead:** Writing/reading next pointers may add overhead that offsets the pointer-chase reduction.
5. **Incomplete hot-path coverage (v1):** Free 側だけ tcache に入って alloc 側が消費しないため、hit が “見えない” 可能性があるPhase 14 v2 で通電確認が必要)。
**Comparison to Smoke Test:**
- Smoke test (single run): +2.4% (51.88M vs 50.68M ops/s)
- Formal 10-run: +0.20% mean, +0.59% median
- Variance across runs suggests smoke test was an outlier
---
## 4. Recommendation: Freeze as Research Box
**Decision:** Freeze Phase 14 v1 as research box (default OFF)
**Rationale:**
- NEUTRAL result (+0.20%) does not justify promotion to default
- No measurable harm (close to baseline), suitable for research/experimentation
- Future work may explore:
- Per-class cap tuning (hot classes get larger caps)
- Workload-specific profiling (C2/C3-heavy vs C5-C7-heavy)
- Alternative intrusive next pointer strategies
**Next Steps:**
1. Commit Phase 14 v1 implementation with NEUTRAL verdict
2. Update CURRENT_TASK.md to freeze as research box
3. Keep ENV gate (HAKMEM_TINY_TCACHE=0 default) for future experimentation
4. Consider alternative approaches for pointer-chase reduction (e.g., deeper pipeline optimization, better prefetching)
---
## 5. Raw Data
### Baseline (TCACHE=0):
```
51264525
50950925
51500295
51698050
50396686
50960807
50616179
51817424
50762958
50865941
```
Mean: 51,083,379 ops/s
Median: 50,955,866 ops/s
### Optimized (TCACHE=1):
```
51555414
51389988
50795917
51880520
50574457
50627901
51233081
51278890
50761326
51770890
```
Mean: 51,186,838 ops/s
Median: 51,255,986 ops/s
---
## 6. Files Modified
### Created:
- `core/box/tiny_tcache_env_box.h` - L0 ENV gate
- `core/box/tiny_tcache_env_box.c` - ENV init/refresh implementation
- `core/box/tiny_tcache_box.h` - L1 intrusive LIFO cache
### Modified:
- `core/front/tiny_unified_cache.h` - Integration (try tcache first)
- `core/bench_profile.h` - Refresh sync
- `Makefile` - Build system integration
- `scripts/run_mixed_10_cleanenv.sh` - ENV leak prevention (already updated)
---
## 7. Conclusion
Phase 14 v1 (Pointer-Chase Reduction via tcache-style intrusive LIFO) achieved **+0.20% mean improvement** on Mixed 10-run benchmark, which is **NEUTRAL** (below +1.0% GO threshold).
**Final Status:** Freeze as research box (HAKMEM_TINY_TCACHE=0 default, OFF)
**Future Work:** Consider per-class cap tuning or alternative pointer-chase reduction strategies.

View File

@ -147,3 +147,12 @@ GO/NO-GO:
- tcache hit 率が高い場合配列アクセスFIFO の古い再利用を回避できる
- system malloc が速い の差分tcache 的挙動に寄せる最短の一手
---
## Update2025-12-15
v1 の統合点`core/front/tiny_unified_cache.h`だけでは現行の main alloc hot path`tiny_hot_alloc_fast()` tcache を消費しないため
`HAKMEM_TINY_TCACHE=1` のとき tcache sink になりやすい
次は hot path`core/box/tiny_front_hot_box.h` pop/push を接続して通電した状態で再 A/B を取るPhase 14 v2:
- `docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_2_NEXT_INSTRUCTIONS.md`

View File

@ -109,3 +109,12 @@ GO のとき:
NO-GO/NEUTRAL のとき:
- research box freezedefault OFF のまま保持)
---
## Update2025-12-15
v1 の統合点(`core/front/tiny_unified_cache.h`)だけだと、現行の main alloc hot path`tiny_hot_alloc_fast()`)が tcache を消費しないため、
`HAKMEM_TINY_TCACHE=1` で “sink” になりやすい。
次は hot path`core/box/tiny_front_hot_box.h`)へ pop/push を接続し、通電した状態で再 A/B を取るPhase 14 v2:
- `docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_2_NEXT_INSTRUCTIONS.md`

View File

@ -0,0 +1,131 @@
# Phase 14 v2: Pointer-Chase Reduction — Hot Path Integration Next InstructionsTiny tcache intrusive LIFO
## Status
- Phase 14 v1tcache L1 追加)は Mixed 10-run で **NEUTRAL**+0.20% mean / +0.59% median
- 結果: `docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_1_AB_TEST_RESULTS.md`
- 実装: `core/box/tiny_tcache_env_box.{h,c}` / `core/box/tiny_tcache_box.h` / `core/front/tiny_unified_cache.h`
- ただし現状の v1 は **free 側(`unified_cache_push()`)だけ tcache に入れて、alloc 側(`tiny_hot_alloc_fast()`)が tcache を消費しない**ため、
- tcache が「実質 sink」になり、ROI が正しく測れない
- “tcache-style” の前提push/pop の対称)が崩れている
Phase 14 v2 は **tiny front の実ホットパス**に tcache を接続して、正しい A/B を取り直す。
---
## 0. 目的GO 条件)
Mixed 10-runclean envで:
- **GO**: mean +1.0% 以上
- **NO-GO**: mean -1.0% 以下(即 rollback / freeze
- **NEUTRAL**: ±1.0%research box freeze
追加ゲート(必須):
- `HAKMEM_TINY_TCACHE=1` のとき **tcache pop が実際に発生**している0 なら設計未通電)
---
## 1. Box 図(境界 1 箇所)
```
L0: tiny_tcache_env_box (ENV gate / refresh / rollback)
L1: tiny_tcache_box (intrusive LIFO: push/pop, cap)
L2: tiny_front_hot_box (hot alloc/free: tcache → unified_cache(FIFO))
L3: cold/refill (unified_cache_refill → SuperSlab)
```
境界は **“tcache miss/overflow → 既存 UnifiedCache”** の 1 箇所に固定する。
---
## 2. 実装パッチ順(小さく積む)
### Patch 1: Hot alloc に tcache pop を接続(必須)
対象:
- `core/box/tiny_front_hot_box.h`
変更:
- `tiny_hot_alloc_fast(int class_idx)` の先頭で
- `tiny_tcache_try_pop(class_idx)` を試す
- HIT なら `tiny_header_finalize_alloc(base, class_idx)` で即 return
- MISS なら既存の FIFO`cache->slots[head]`)へフォールバック
要件:
- tcache OFFdefaultではホット経路が肥大しないよう最小差分にする
- “確信がないなら fallback” を厳守Fail-Fast
### Patch 2: Hot free に tcache push を接続(推奨)
対象:
- `core/box/tiny_front_hot_box.h`
変更:
- `tiny_hot_free_fast(int class_idx, void* base)` の先頭で
- `tiny_tcache_try_push(class_idx, base)` を試す
- SUCCESS なら `return 1`
- overflow / disabled のときだけ既存 FIFO へ
狙い:
- `unified_cache_push()` 経由以外の “直 push” 経路でも tcache が効く状態にする
### Patch 3: 可視化最小・TLS
対象候補:
- `core/box/tiny_tcache_box.h`TLS カウンタ)
追加debug / research 用):
- `tcache_pop_hit/miss`
- `tcache_push_hit/overflow`
- “ワンショット dump” を 1 回だけENV opt-inで出せるようにする
禁止:
- hot path に atomic 統計を置かないPhase 12 / POOL-DN-BATCH の教訓)
---
## 3. A/B テスト(同一バイナリ)
Baseline:
```sh
HAKMEM_TINY_TCACHE=0 scripts/run_mixed_10_cleanenv.sh
```
Optimized:
```sh
HAKMEM_TINY_TCACHE=1 scripts/run_mixed_10_cleanenv.sh
```
追加(効果が class 依存か確認):
```sh
HAKMEM_BENCH_C7_ONLY=1 HAKMEM_TINY_TCACHE=0 scripts/run_mixed_10_cleanenv.sh
HAKMEM_BENCH_C7_ONLY=1 HAKMEM_TINY_TCACHE=1 scripts/run_mixed_10_cleanenv.sh
```
cap 探索research、必要なときだけ:
```sh
HAKMEM_TINY_TCACHE=1 HAKMEM_TINY_TCACHE_CAP=32 scripts/run_mixed_10_cleanenv.sh
HAKMEM_TINY_TCACHE=1 HAKMEM_TINY_TCACHE_CAP=64 scripts/run_mixed_10_cleanenv.sh
HAKMEM_TINY_TCACHE=1 HAKMEM_TINY_TCACHE_CAP=128 scripts/run_mixed_10_cleanenv.sh
```
---
## 4. 健康診断(必須)
```sh
scripts/verify_health_profiles.sh
```
---
## 5. 判定と扱い
- GO: `bench_profile` への昇格は **MIXED_TINYV3_C7_SAFE のみ**から開始(段階的)
- NEUTRAL/NO-GO: Phase 14 v2 は research box として freezedefault OFF のまま)
- Rollback:
- `export HAKMEM_TINY_TCACHE=0`