Phase 14 v1: Pointer-Chase Reduction (tcache) NEUTRAL (+0.20%)
Implementation:
- Intrusive LIFO tcache layer (L1) before UnifiedCache
- TLS per-class bins (head pointer + count)
- Intrusive next pointers (via tiny_next_store/load SSOT)
- Cap: 64 blocks per class (default)
- ENV: HAKMEM_TINY_TCACHE=0/1 (default: 0, OFF)
A/B Test Results (Mixed 10-run):
- Baseline (TCACHE=0): 51,083,379 ops/s
- Optimized (TCACHE=1): 51,186,838 ops/s
- Mean delta: +0.20% (below +1.0% GO threshold)
- Median delta: +0.59%
Verdict: NEUTRAL - Freeze as research box (default OFF)
Root Cause (v1 wiring incomplete):
- Free side pushes to tcache via unified_cache_push()
- Alloc hot path (tiny_hot_alloc_fast) doesn't consume tcache
- tcache becomes "sink" without alloc-side pop → ROI not measurable
Files:
- Created: core/box/tiny_tcache_{env_box,box}.h, tiny_tcache_env_box.c
- Modified: core/front/tiny_unified_cache.h (integration)
- Modified: core/bench_profile.h (refresh sync)
- Modified: Makefile (build integration)
- Results: docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_1_AB_TEST_RESULTS.md
- v2 Instructions: docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_2_NEXT_INSTRUCTIONS.md
Next: Phase 14 v2 (connect tcache to tiny_front_hot_box alloc/free hot path)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
@ -0,0 +1,184 @@
|
||||
# Phase 14 v1: Pointer-Chase Reduction (tcache-style) A/B Test Results
|
||||
|
||||
**Date:** 2025-12-15
|
||||
**Benchmark:** Mixed (16–1024B) 10-run cleanenv
|
||||
**Target:** Reduce pointer-chase overhead with intrusive LIFO tcache layer
|
||||
**Expected ROI:** +15-25% (design estimate)
|
||||
**GO Threshold:** +1.0% mean improvement
|
||||
|
||||
---
|
||||
|
||||
## 1. Implementation Summary
|
||||
|
||||
Phase 14 v1 adds an intrusive LIFO tcache layer (L1) before the existing array-based UnifiedCache, inspired by glibc tcache pattern.
|
||||
|
||||
**Key Components:**
|
||||
- `core/box/tiny_tcache_env_box.{h,c}` - L0 ENV gate (HAKMEM_TINY_TCACHE=0/1, default 0)
|
||||
- `core/box/tiny_tcache_box.h` - L1 intrusive LIFO cache (TLS per-class bins)
|
||||
- `core/front/tiny_unified_cache.h` - Integration (try tcache first, fall through to array cache)
|
||||
- `core/bench_profile.h` - Refresh sync for bench_profile
|
||||
|
||||
**Important Note (wiring completeness):**
|
||||
- v1 の tcache pop は `unified_cache_pop_or_refill()` 側にあるが、現行の main alloc hot path(`tiny_hot_alloc_fast()`)は `unified_cache_pop_or_refill()` を経由しない。
|
||||
- 一方で free 側は `unified_cache_push()` 経由で tcache に入るため、`HAKMEM_TINY_TCACHE=1` のとき **tcache が “sink” になり、alloc/pop 側の ROI が測れない**可能性がある。
|
||||
- 後続の修正(Phase 14 v2)で `tiny_front_hot_box` に pop/push を接続し、再 A/B を推奨する:
|
||||
- `docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_2_NEXT_INSTRUCTIONS.md`
|
||||
|
||||
**Design:**
|
||||
- TLS state: 8 TinyTcacheBin structs (head pointer + count per class)
|
||||
- Intrusive next pointers stored in blocks (via tiny_next_store/load SSOT)
|
||||
- Capacity: 64 blocks per class (default, configurable via HAKMEM_TINY_TCACHE_CAP)
|
||||
- LIFO order: better cache locality than FIFO array cache
|
||||
- Two-layer fallback: tcache (fast) → unified_cache (overflow/miss)
|
||||
|
||||
**ENV Control:**
|
||||
```bash
|
||||
export HAKMEM_TINY_TCACHE=0 # Baseline (tcache disabled)
|
||||
export HAKMEM_TINY_TCACHE=1 # Optimized (tcache enabled)
|
||||
export HAKMEM_TINY_TCACHE_CAP=64 # Capacity per class (default: 64)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. A/B Test Results (Mixed 10-run)
|
||||
|
||||
### Baseline (TCACHE=0):
|
||||
```
|
||||
Run 1: 51,264,525 ops/s
|
||||
Run 2: 50,950,925 ops/s
|
||||
Run 3: 51,500,295 ops/s
|
||||
Run 4: 51,698,050 ops/s
|
||||
Run 5: 50,396,686 ops/s
|
||||
Run 6: 50,960,807 ops/s
|
||||
Run 7: 50,616,179 ops/s
|
||||
Run 8: 51,817,424 ops/s
|
||||
Run 9: 50,762,958 ops/s
|
||||
Run 10: 50,865,941 ops/s
|
||||
```
|
||||
**Mean:** 51,083,379 ops/s
|
||||
**Median:** 50,955,866 ops/s
|
||||
|
||||
### Optimized (TCACHE=1):
|
||||
```
|
||||
Run 1: 51,555,414 ops/s
|
||||
Run 2: 51,389,988 ops/s
|
||||
Run 3: 50,795,917 ops/s
|
||||
Run 4: 51,880,520 ops/s
|
||||
Run 5: 50,574,457 ops/s
|
||||
Run 6: 50,627,901 ops/s
|
||||
Run 7: 51,233,081 ops/s
|
||||
Run 8: 51,278,890 ops/s
|
||||
Run 9: 50,761,326 ops/s
|
||||
Run 10: 51,770,890 ops/s
|
||||
```
|
||||
**Mean:** 51,186,838 ops/s
|
||||
**Median:** 51,255,986 ops/s
|
||||
|
||||
### Delta:
|
||||
- **Mean delta:** +0.20% (103,459 ops/s improvement)
|
||||
- **Median delta:** +0.59% (300,120 ops/s improvement)
|
||||
|
||||
---
|
||||
|
||||
## 3. Verdict: NEUTRAL
|
||||
|
||||
**Result:** +0.20% mean improvement (below +1.0% GO threshold)
|
||||
|
||||
**Analysis:**
|
||||
- Phase 14 v1 shows minimal performance impact on Mixed workload
|
||||
- Median delta (+0.59%) is slightly better than mean, suggesting some stability improvement
|
||||
- Both deltas are below the +1.0% GO threshold → NEUTRAL classification
|
||||
- Expected ROI (+15-25%) was not achieved
|
||||
|
||||
**Possible Reasons for Lower-than-Expected ROI:**
|
||||
1. **Workload Mismatch:** Mixed workload (16–1024B) spans multiple classes (C0-C7), but tcache benefits may be concentrated in hot classes (C2/C3: 128B/256B). Mid/large classes (C5-C7) may not benefit as much.
|
||||
2. **Cache Locality vs Array Access:** While tcache reduces pointer-chasing, the existing UnifiedCache array access may already be well-cached in L1/L2, limiting improvement.
|
||||
3. **Cap Too Small:** Default cap=64 may be too small for high-churn workloads, causing frequent overflow to array cache.
|
||||
4. **Intrusive Next Overhead:** Writing/reading next pointers may add overhead that offsets the pointer-chase reduction.
|
||||
5. **Incomplete hot-path coverage (v1):** Free 側だけ tcache に入って alloc 側が消費しないため、hit が “見えない” 可能性がある(Phase 14 v2 で通電確認が必要)。
|
||||
|
||||
**Comparison to Smoke Test:**
|
||||
- Smoke test (single run): +2.4% (51.88M vs 50.68M ops/s)
|
||||
- Formal 10-run: +0.20% mean, +0.59% median
|
||||
- Variance across runs suggests smoke test was an outlier
|
||||
|
||||
---
|
||||
|
||||
## 4. Recommendation: Freeze as Research Box
|
||||
|
||||
**Decision:** Freeze Phase 14 v1 as research box (default OFF)
|
||||
|
||||
**Rationale:**
|
||||
- NEUTRAL result (+0.20%) does not justify promotion to default
|
||||
- No measurable harm (close to baseline), suitable for research/experimentation
|
||||
- Future work may explore:
|
||||
- Per-class cap tuning (hot classes get larger caps)
|
||||
- Workload-specific profiling (C2/C3-heavy vs C5-C7-heavy)
|
||||
- Alternative intrusive next pointer strategies
|
||||
|
||||
**Next Steps:**
|
||||
1. Commit Phase 14 v1 implementation with NEUTRAL verdict
|
||||
2. Update CURRENT_TASK.md to freeze as research box
|
||||
3. Keep ENV gate (HAKMEM_TINY_TCACHE=0 default) for future experimentation
|
||||
4. Consider alternative approaches for pointer-chase reduction (e.g., deeper pipeline optimization, better prefetching)
|
||||
|
||||
---
|
||||
|
||||
## 5. Raw Data
|
||||
|
||||
### Baseline (TCACHE=0):
|
||||
```
|
||||
51264525
|
||||
50950925
|
||||
51500295
|
||||
51698050
|
||||
50396686
|
||||
50960807
|
||||
50616179
|
||||
51817424
|
||||
50762958
|
||||
50865941
|
||||
```
|
||||
Mean: 51,083,379 ops/s
|
||||
Median: 50,955,866 ops/s
|
||||
|
||||
### Optimized (TCACHE=1):
|
||||
```
|
||||
51555414
|
||||
51389988
|
||||
50795917
|
||||
51880520
|
||||
50574457
|
||||
50627901
|
||||
51233081
|
||||
51278890
|
||||
50761326
|
||||
51770890
|
||||
```
|
||||
Mean: 51,186,838 ops/s
|
||||
Median: 51,255,986 ops/s
|
||||
|
||||
---
|
||||
|
||||
## 6. Files Modified
|
||||
|
||||
### Created:
|
||||
- `core/box/tiny_tcache_env_box.h` - L0 ENV gate
|
||||
- `core/box/tiny_tcache_env_box.c` - ENV init/refresh implementation
|
||||
- `core/box/tiny_tcache_box.h` - L1 intrusive LIFO cache
|
||||
|
||||
### Modified:
|
||||
- `core/front/tiny_unified_cache.h` - Integration (try tcache first)
|
||||
- `core/bench_profile.h` - Refresh sync
|
||||
- `Makefile` - Build system integration
|
||||
- `scripts/run_mixed_10_cleanenv.sh` - ENV leak prevention (already updated)
|
||||
|
||||
---
|
||||
|
||||
## 7. Conclusion
|
||||
|
||||
Phase 14 v1 (Pointer-Chase Reduction via tcache-style intrusive LIFO) achieved **+0.20% mean improvement** on Mixed 10-run benchmark, which is **NEUTRAL** (below +1.0% GO threshold).
|
||||
|
||||
**Final Status:** Freeze as research box (HAKMEM_TINY_TCACHE=0 default, OFF)
|
||||
|
||||
**Future Work:** Consider per-class cap tuning or alternative pointer-chase reduction strategies.
|
||||
@ -147,3 +147,12 @@ GO/NO-GO:
|
||||
- tcache hit 率が高い場合、配列アクセス・FIFO の古い再利用を回避できる
|
||||
- “system malloc が速い” の差分(tcache 的挙動)に寄せる最短の一手
|
||||
|
||||
---
|
||||
|
||||
## Update(2025-12-15)
|
||||
|
||||
v1 の統合点(`core/front/tiny_unified_cache.h`)だけでは、現行の main alloc hot path(`tiny_hot_alloc_fast()`)が tcache を消費しないため、
|
||||
`HAKMEM_TINY_TCACHE=1` のとき tcache が “sink” になりやすい。
|
||||
|
||||
次は hot path(`core/box/tiny_front_hot_box.h`)へ pop/push を接続して、通電した状態で再 A/B を取る(Phase 14 v2):
|
||||
- `docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_2_NEXT_INSTRUCTIONS.md`
|
||||
|
||||
@ -109,3 +109,12 @@ GO のとき:
|
||||
NO-GO/NEUTRAL のとき:
|
||||
- research box freeze(default OFF のまま保持)
|
||||
|
||||
---
|
||||
|
||||
## Update(2025-12-15)
|
||||
|
||||
v1 の統合点(`core/front/tiny_unified_cache.h`)だけだと、現行の main alloc hot path(`tiny_hot_alloc_fast()`)が tcache を消費しないため、
|
||||
`HAKMEM_TINY_TCACHE=1` で “sink” になりやすい。
|
||||
|
||||
次は hot path(`core/box/tiny_front_hot_box.h`)へ pop/push を接続し、通電した状態で再 A/B を取る(Phase 14 v2):
|
||||
- `docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_2_NEXT_INSTRUCTIONS.md`
|
||||
|
||||
@ -0,0 +1,131 @@
|
||||
# Phase 14 v2: Pointer-Chase Reduction — Hot Path Integration Next Instructions(Tiny tcache intrusive LIFO)
|
||||
|
||||
## Status
|
||||
|
||||
- Phase 14 v1(tcache L1 追加)は Mixed 10-run で **NEUTRAL**(+0.20% mean / +0.59% median)
|
||||
- 結果: `docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_1_AB_TEST_RESULTS.md`
|
||||
- 実装: `core/box/tiny_tcache_env_box.{h,c}` / `core/box/tiny_tcache_box.h` / `core/front/tiny_unified_cache.h`
|
||||
- ただし現状の v1 は **free 側(`unified_cache_push()`)だけ tcache に入れて、alloc 側(`tiny_hot_alloc_fast()`)が tcache を消費しない**ため、
|
||||
- tcache が「実質 sink」になり、ROI が正しく測れない
|
||||
- “tcache-style” の前提(push/pop の対称)が崩れている
|
||||
|
||||
Phase 14 v2 は **tiny front の実ホットパス**に tcache を接続して、正しい A/B を取り直す。
|
||||
|
||||
---
|
||||
|
||||
## 0. 目的(GO 条件)
|
||||
|
||||
Mixed 10-run(clean env)で:
|
||||
- **GO**: mean +1.0% 以上
|
||||
- **NO-GO**: mean -1.0% 以下(即 rollback / freeze)
|
||||
- **NEUTRAL**: ±1.0%(research box freeze)
|
||||
|
||||
追加ゲート(必須):
|
||||
- `HAKMEM_TINY_TCACHE=1` のとき **tcache pop が実際に発生**している(0 なら設計未通電)
|
||||
|
||||
---
|
||||
|
||||
## 1. Box 図(境界 1 箇所)
|
||||
|
||||
```
|
||||
L0: tiny_tcache_env_box (ENV gate / refresh / rollback)
|
||||
↓
|
||||
L1: tiny_tcache_box (intrusive LIFO: push/pop, cap)
|
||||
↓
|
||||
L2: tiny_front_hot_box (hot alloc/free: tcache → unified_cache(FIFO))
|
||||
↓
|
||||
L3: cold/refill (unified_cache_refill → SuperSlab)
|
||||
```
|
||||
|
||||
境界は **“tcache miss/overflow → 既存 UnifiedCache”** の 1 箇所に固定する。
|
||||
|
||||
---
|
||||
|
||||
## 2. 実装パッチ順(小さく積む)
|
||||
|
||||
### Patch 1: Hot alloc に tcache pop を接続(必須)
|
||||
|
||||
対象:
|
||||
- `core/box/tiny_front_hot_box.h`
|
||||
|
||||
変更:
|
||||
- `tiny_hot_alloc_fast(int class_idx)` の先頭で
|
||||
- `tiny_tcache_try_pop(class_idx)` を試す
|
||||
- HIT なら `tiny_header_finalize_alloc(base, class_idx)` で即 return
|
||||
- MISS なら既存の FIFO(`cache->slots[head]`)へフォールバック
|
||||
|
||||
要件:
|
||||
- tcache OFF(default)ではホット経路が肥大しないよう最小差分にする
|
||||
- “確信がないなら fallback” を厳守(Fail-Fast)
|
||||
|
||||
### Patch 2: Hot free に tcache push を接続(推奨)
|
||||
|
||||
対象:
|
||||
- `core/box/tiny_front_hot_box.h`
|
||||
|
||||
変更:
|
||||
- `tiny_hot_free_fast(int class_idx, void* base)` の先頭で
|
||||
- `tiny_tcache_try_push(class_idx, base)` を試す
|
||||
- SUCCESS なら `return 1`
|
||||
- overflow / disabled のときだけ既存 FIFO へ
|
||||
|
||||
狙い:
|
||||
- `unified_cache_push()` 経由以外の “直 push” 経路でも tcache が効く状態にする
|
||||
|
||||
### Patch 3: 可視化(最小・TLS)
|
||||
|
||||
対象候補:
|
||||
- `core/box/tiny_tcache_box.h`(TLS カウンタ)
|
||||
|
||||
追加(debug / research 用):
|
||||
- `tcache_pop_hit/miss`
|
||||
- `tcache_push_hit/overflow`
|
||||
- “ワンショット dump” を 1 回だけ(ENV opt-in)で出せるようにする
|
||||
|
||||
禁止:
|
||||
- hot path に atomic 統計を置かない(Phase 12 / POOL-DN-BATCH の教訓)
|
||||
|
||||
---
|
||||
|
||||
## 3. A/B テスト(同一バイナリ)
|
||||
|
||||
Baseline:
|
||||
```sh
|
||||
HAKMEM_TINY_TCACHE=0 scripts/run_mixed_10_cleanenv.sh
|
||||
```
|
||||
|
||||
Optimized:
|
||||
```sh
|
||||
HAKMEM_TINY_TCACHE=1 scripts/run_mixed_10_cleanenv.sh
|
||||
```
|
||||
|
||||
追加(効果が class 依存か確認):
|
||||
```sh
|
||||
HAKMEM_BENCH_C7_ONLY=1 HAKMEM_TINY_TCACHE=0 scripts/run_mixed_10_cleanenv.sh
|
||||
HAKMEM_BENCH_C7_ONLY=1 HAKMEM_TINY_TCACHE=1 scripts/run_mixed_10_cleanenv.sh
|
||||
```
|
||||
|
||||
cap 探索(research、必要なときだけ):
|
||||
```sh
|
||||
HAKMEM_TINY_TCACHE=1 HAKMEM_TINY_TCACHE_CAP=32 scripts/run_mixed_10_cleanenv.sh
|
||||
HAKMEM_TINY_TCACHE=1 HAKMEM_TINY_TCACHE_CAP=64 scripts/run_mixed_10_cleanenv.sh
|
||||
HAKMEM_TINY_TCACHE=1 HAKMEM_TINY_TCACHE_CAP=128 scripts/run_mixed_10_cleanenv.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. 健康診断(必須)
|
||||
|
||||
```sh
|
||||
scripts/verify_health_profiles.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. 判定と扱い
|
||||
|
||||
- GO: `bench_profile` への昇格は **MIXED_TINYV3_C7_SAFE のみ**から開始(段階的)
|
||||
- NEUTRAL/NO-GO: Phase 14 v2 は research box として freeze(default OFF のまま)
|
||||
- Rollback:
|
||||
- `export HAKMEM_TINY_TCACHE=0`
|
||||
|
||||
Reference in New Issue
Block a user