Files
hakmem/docs/benchmarks
Moe Charm (CI) 0546454168 WIP: Add TLS SLL validation and SuperSlab registry fallback
ChatGPT's diagnostic changes to address TLS_SLL_HDR_RESET issue.
Current status: Partial mitigation, but root cause remains.

Changes Applied:
1. SuperSlab Registry Fallback (hakmem_super_registry.h)
   - Added legacy table probe when hash map lookup misses
   - Prevents NULL returns for valid SuperSlabs during initialization
   - Status:  Works but may hide underlying registration issues

2. TLS SLL Push Validation (tls_sll_box.h)
   - Reject push if SuperSlab lookup returns NULL
   - Reject push if class_idx mismatch detected
   - Added [TLS_SLL_PUSH_NO_SS] diagnostic message
   - Status:  Prevents list corruption (defensive)

3. SuperSlab Allocation Class Fix (superslab_allocate.c)
   - Pass actual class_idx to sp_internal_allocate_superslab
   - Prevents dummy class=8 causing OOB access
   - Status:  Root cause fix for allocation path

4. Debug Output Additions
   - First 256 push/pop operations traced
   - First 4 mismatches logged with details
   - SuperSlab registration state logged
   - Status:  Diagnostic tool (not a fix)

5. TLS Hint Box Removed
   - Deleted ss_tls_hint_box.{c,h} (Phase 1 optimization)
   - Simplified to focus on stability first
   - Status:  Can be re-added after root cause fixed

Current Problem (REMAINS UNSOLVED):
- [TLS_SLL_HDR_RESET] still occurs after ~60 seconds of sh8bench
- Pointer is 16 bytes offset from expected (class 1 → class 2 boundary)
- hak_super_lookup returns NULL for that pointer
- Suggests: Use-After-Free, Double-Free, or pointer arithmetic error

Root Cause Analysis:
- Pattern: Pointer offset by +16 (one class 1 stride)
- Timing: Cumulative problem (appears after 60s, not immediately)
- Location: Header corruption detected during TLS SLL pop

Remaining Issues:
⚠️ Registry fallback is defensive (may hide registration bugs)
⚠️ Push validation prevents symptoms but not root cause
⚠️ 16-byte pointer offset source unidentified

Next Steps for Investigation:
1. Full pointer arithmetic audit (Magazine ⇔ TLS SLL paths)
2. Enhanced logging at HDR_RESET point:
   - Expected vs actual pointer value
   - Pointer provenance (where it came from)
   - Allocation trace for that block
3. Verify Headerless flag is OFF throughout build
4. Check for double-offset application in conversions

Technical Assessment:
- 60% root cause fixes (allocation class, validation)
- 40% defensive mitigation (registry fallback, push rejection)

Performance Impact:
- Registry fallback: +10-30 cycles on cold path (negligible)
- Push validation: +5-10 cycles per push (acceptable)
- Overall: < 2% performance impact estimated

Related Issues:
- Phase 1 TLS Hint Box removed temporarily
- Phase 2 Headerless blocked until stability achieved

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 20:42:28 +09:00
..

Benchmarks Docs

ここではベンチマークの実行・保存・命名規則を定義します。

保存場所・命名

  • スイープ結果: docs/benchmarks/<YYYY-MM-DD>_SWEEP_NOTES.md
  • 大きい生ログ: docs/benchmarks/<YYYY-MM-DD>/<label>_T<threads>.log

基本スイープ

# 1) Tiny/Mid/Large/Big の代表レンジを12秒でざっと
scripts/prof_sweep.sh -d 2 -t 1,4 -s 8

# 2) Mid帯に絞って詳細例: 232KB, 1s, 1T/4T
scripts/prof_sweep.sh -d 1 -t 1,4 -s 7 -m 2048 -M 32768

代表シナリオ(手動)

# 1315KB 1TDYN1 A/B
LD_PRELOAD=$(readlink -f ./libhakmem.so) HAKMEM_MID_DYN1=0     mimalloc-bench/bench/larson/larson 1 13000 15000 10000 1 12345 1
LD_PRELOAD=$(readlink -f ./libhakmem.so) HAKMEM_MID_DYN1=14336 mimalloc-bench/bench/larson/larson 1 13000 15000 10000 1 12345 1

# ラッパー内L1許可
HAKMEM_WRAP_L2=1 HAKMEM_WRAP_L25=1 ...

スクリプト(ログ保存・安全実行)

  • scripts/save_prof_sweep.sh — 日時フォルダに自動保存(外部タイムアウト付き)
  • scripts/run_bench_suite.sh — system/mimalloc/hakmem の小スイート(外部タイムアウト付き)
  • scripts/ab_sweep_mid.sh — Mid帯のA/BCAP×min_bundle×threads、外部タイムアウト付き
  • scripts/ab_fast_mid.sh — Mid fastreturn系trylock probes × ring return divのA/B短時間
  • scripts/ab_rcap_probe_drain.sh — Mid向け RING_CAP × PROBES × DRAIN_MAX × TLS_LO_MAX のA/B短時間、再ビルド含む
  • scripts/run_larson.sh — 再現性の高い larson 実行burst/loop プリセット、threads指定、ログ末尾出力
  • scripts/kill_bench.sh — 残プロセスの強制停止TERM→KILL
  • scripts/head_to_head_large.sh — Large(64KB1MB) 10s headtoheadsystem/mimalloc/hakmem。P1/P2プロファイルを一括保存
  • scripts/ab_l25_tc.sh — L2.5remote, HDR=2で RUN_FACTOR × TC_SPILL のA/B10s。ログを自動保存
  • scripts/bench_large_profiles.sh — Large 10s の代表プロファイルP1ベスト/P2+TCベストを保存

共通環境変数:

  • RUNTIME(秒): 測定時間(既定 1
  • BENCH_TIMEOUT(秒): 壁時計タイムアウト。未指定は RUNTIME+3
  • KILL_GRACE(秒): SIGTERM→SIGKILL 猶予(既定 2
  • Mid向け: HAKMEM_POOL_MIN_BUNDLE推奨4, HAKMEM_SHARD_MIX=1(シャード分散強化)

例:

BENCH_TIMEOUT=6 scripts/save_prof_sweep.sh -d 1 -t 1,4 -s 8
RUNTIME=1 THREADS=1,4 BENCH_TIMEOUT=6 scripts/run_bench_suite.sh

# Mid fast A/B10秒、1T/4T
RUNTIME=10 THREADS=1,4 PROBES=2,3 RETURNS=2,3 scripts/ab_fast_mid.sh

# Mid リング/プローブ/ドレイン/LIFO上限 A/B2秒、1T/4T
RUNTIME=2 THREADS=1,4 RCAPS=8,16 PROBES=2,3 DRAINS=32,64 LOMAX=256,512 \
  scripts/ab_rcap_probe_drain.sh

# HeadtoheadTiny/Mid, system vs mimalloc vs hakmem
export HAKMEM_HDR_LIGHT=1 HAKMEM_POOL_TLS_RING=1 HAKMEM_SHARD_MIX=1 \
       HAKMEM_TRYLOCK_PROBES=3 HAKMEM_RING_RETURN_DIV=3
OUT=docs/benchmarks/$(date +%Y%m%d_%H%M%S)_HEAD2HEAD && mkdir -p "$OUT"
scripts/run_larson.sh -d 10 -p burst -m 8 -M 64    | tee "$OUT/tiny_burst.log"
scripts/run_larson.sh -d 10 -p burst -m 2048 -M 32768 | tee "$OUT/mid_burst.log"

タイミング計測Debug Timing

計測カテゴリ別にホットスポットを可視化しますstderr出力。Debugビルド推奨。

Mid 4T, 10s:


## Large(64KB1MB) ベンチ対策10s

推奨プロファイル(現時点):
- P1ベストalloc優先
  - `HAKMEM_L25_PREF=remote HAKMEM_L25_RUN_FACTOR=4 HAKMEM_HDR_LIGHT=1 HAKMEM_SHARD_MIX=1`
  - 目安: ~102k ops/s4T, timing ON
- P2+TCベストfree優先、ヘッダレスページ記述子TC
  - `HAKMEM_L25_PREF=remote HAKMEM_L25_RUN_FACTOR=4 HAKMEM_HDR_LIGHT=2 HAKMEM_L25_TC_SPILL=16 HAKMEM_SHARD_MIX=1`
  - 目安: ~99k ops/s4T, timing ON。free負荷が高いパターンで有利

実行例headtohead保存:

./scripts/head_to_head_large.sh # docs/benchmarks/_HEAD2HEAD_LARGE に保存


パラメータA/BRUN_FACTOR × TC_SPILL:

RUNTIME=10 THREADS=4 ./scripts/ab_l25_tc.sh # docs/benchmarks/_L25_TC_AB に保存


注意:
- `LD_PRELOAD` は絶対パスを推奨(`readlink -f ./libhakmem.so`
- timing`HAKMEM_TIMING=1`)は遅くなるので、最終比較は timing OFF でも再確認してください

## トラブルシューティング(ハング/ゾンビ/暴走)

- timeout の付与(ハング防止)
  - すべての長時間ランは `timeout ${BENCH_TIMEOUT:-$((RUNTIME+3))}s` で包む
  - 本リポの `scripts/head_to_head_large.sh` / `scripts/ab_l25_tc.sh` は timeout 対応済
- ゾンビ確認/親特定/掃除
  - 確認: `ps -eo pid,ppid,stat,etime,cmd | awk '$3 ~ /Z/ {print}'`
  - 親特定: `pstree -sp <PPID>`(ない場合は `ps -p <PPID> -o pid,ppid,cmd`
  - 掃除: ゾンビは kill 不可。親プロセスを適切に終了/再起動( tmux セッション/シェル/常駐ツールなど)
  - 例: `kill -HUP <PPID>` → 効かない場合はセッションを閉じる/再接続
- 残プロセス一括停止(ベンチ)
  - larson 停止: `pkill -f 'mimalloc-bench/bench/larson/larson'`(最悪 `pkill -9 -f ...`
- 典型例(本環境)
  - `notify_wrapper.` の `<defunct>` が大量に残る事例あり。親は codex ランチャー/シェルのことが多い
  - 長時間運用後は tmux/シェルをリフレッシュしてから A/B を回すと安定
make -j4 debug
HAKMEM_TIMING=1 HAKMEM_POOL_TLS_RING=1 HAKMEM_TRYLOCK_PROBES=3 HAKMEM_TLS_LO_MAX=256 \
  LD_PRELOAD=./libhakmem.so mimalloc-bench/bench/larson/larson 10 2048 32768 10000 1 12345 4

Large 4T, 10s, L2.5:

make -j4 debug
HAKMEM_TIMING=1 HAKMEM_WRAP_L25=1 HAKMEM_POOL_TLS_RING=1 HAKMEM_TRYLOCK_PROBES=3 HAKMEM_TLS_LO_MAX=256 \
  LD_PRELOAD=./libhakmem.so mimalloc-bench/bench/larson/larson 10 65536 1048576 10000 1 12345 4

主なカテゴリ(抜粋):

  • Mid(L2): pool_lock, pool_refill, pool_tc_drain, pool_tls_ring_pop, pool_tls_lifo_pop, pool_remote_push, pool_alloc_tls_page
  • L2.5: l25_lock, l25_refill, l25_tls_ring_pop, l25_tls_lifo_pop, l25_remote_push, l25_alloc_tls_page, l25_shard_steal