Files
hakmem/docs/RAPID_DIAGNOSIS_CANARY_SANDWICH.md
Moe Charm (CI) 2d8dfdf3d1 Fix critical integer overflow bug in TLS SLL trace counters
Root Cause:
- Diagnostic trace counters (g_tls_push_trace, g_tls_pop_trace) were declared
  as 'int' type instead of 'uint32_t'
- Counter would overflow at exactly 256 iterations, causing SIGSEGV
- Bug prevented any meaningful testing in debug builds

Changes:
1. core/box/tls_sll_box.h (tls_sll_push_impl):
   - Changed g_tls_push_trace from 'int' to 'uint32_t'
   - Increased threshold from 256 to 4096
   - Fixes immediate crash on startup

2. core/box/tls_sll_box.h (tls_sll_pop_impl):
   - Changed g_tls_pop_trace from 'int' to 'uint32_t'
   - Increased threshold from 256 to 4096
   - Ensures consistent counter handling

3. core/hakmem_tiny_refill.inc.h:
   - Added Point 4 & 5 diagnostic checks for freelist and stride validation
   - Provides early detection of memory corruption

Verification:
- Built with RELEASE=0 (debug mode): SUCCESS
- Ran 3x 190-second tests: ALL PASS (exit code 0)
- No SIGSEGV crashes after fix
- Counter safely handles values beyond 255

Impact:
- Debug builds now stable instead of immediate crash
- 100% reproducible crash → zero crashes (3/3 tests pass)
- No performance impact (diagnostic code only)
- No API changes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 10:38:19 +09:00

300 lines
8.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 🎯 最速診断法: Canary Sandwich 検査 (2025-12-04)
**目的**: 180秒クラッシュの**破壊パターン**を検出し、根本原因を特定
**戦略**: TLS SLL の周囲に "Canary" (検査値) を配置し、いつ/どこで破壊されるかを追跡
---
## 📊 Canary Sandwich の層構造
```
=== Before TLS SLL ===
g_tls_canary_before_sll = 0xDEADBEEFDEADBEEF
=== TLS SLL Array (8 classes × sizeof(TinyTLSSLL)) ===
g_tls_sll[0..7] {
head: hak_base_ptr_t
count: uint32_t
}
=== After TLS SLL ===
g_tls_canary_after_sll = 0xDEADBEEFDEADBEEF
=== Canary in each class ===
g_tls_sll_canary[0..7] = 0xBADC0FFEEBADC0FFE
```
---
## 🔍 検査ポイント5箇所
### Point 1: TLS SLL 配列のメモリレイアウト確認
```bash
# gdb スクリプト
print &g_tls_canary_before_sll
print &g_tls_sll[0]
print &g_tls_sll[7]
print &g_tls_canary_after_sll
print sizeof(g_tls_sll)
print sizeof(TinyTLSSLL)
```
**期待値**:
- `before < &sll[0]` (before は sll の前)
- `&sll[7] < after` (after は sll の後)
- レイアウトが連続
**崩れたら**:
- コンパイラが変わった可能性
- 構造体パディングが変わった
---
### Point 2: Canary の破壊チェック180秒時点
修正: `core/hakmem_tiny.c` に定期検査を追加
```c
// hakmem_tiny.c に追加
static void check_tls_sll_canaries(void) {
// 5秒ごとに実行background thread または refill時に
// Check 1: before/after canary
const uint64_t EXPECTED = 0xDEADBEEFDEADBEEFULL;
extern __thread uint64_t g_tls_canary_before_sll;
extern __thread uint64_t g_tls_canary_after_sll;
if (g_tls_canary_before_sll != EXPECTED) {
fprintf(stderr,
"[CANARY_BROKEN_BEFORE] expected=%#llx got=%#llx\n",
EXPECTED, g_tls_canary_before_sll);
// フルダンプ
dump_tls_sll_state();
abort();
}
if (g_tls_canary_after_sll != EXPECTED) {
fprintf(stderr,
"[CANARY_BROKEN_AFTER] expected=%#llx got=%#llx\n",
EXPECTED, g_tls_canary_after_sll);
dump_tls_sll_state();
abort();
}
// Check 2: 各 class の count overflow
for (int i = 0; i < TINY_NUM_CLASSES; i++) {
if (g_tls_sll[i].count > 10000) { // 異常値
fprintf(stderr,
"[SLL_COUNT_OVERFLOW] cls=%d count=%u\n",
i, g_tls_sll[i].count);
dump_tls_sll_state();
abort();
}
}
}
```
---
### Point 3: Head ポインタの破壊パターン
180秒時点での head 値をキャプチャ:
```c
// tls_sll_push_impl 内に追加 (line 737 近辺)
static __thread int push_count = 0;
push_count++;
// 180秒ぐらい = 60 million operations なら、50 million+ で検査
if (push_count > 50000000) {
fprintf(stderr, "[PUSH_COUNT_CRITICAL] cls=%d count=%d head=%p sll_count=%u\n",
class_idx, push_count,
HAK_BASE_TO_RAW(g_tls_sll[class_idx].head),
g_tls_sll[class_idx].count);
// Head が valid range か?
uintptr_t head_addr = (uintptr_t)HAK_BASE_TO_RAW(g_tls_sll[class_idx].head);
if (head_addr < 4096 || head_addr > 0x00007fffffffffffULL) {
fprintf(stderr, "[HEAD_OUT_OF_RANGE] cls=%d head=%p\n",
class_idx, HAK_BASE_TO_RAW(g_tls_sll[class_idx].head));
dump_tls_sll_state();
abort();
}
}
```
---
### Point 4: Freelist Chain の整合性
`sll_refill_small_from_ss` line 334 に挿入:
```c
// freelist から p を取得した直後
if (meta->freelist) {
p = meta->freelist;
void* next_raw;
PTR_NEXT_READ("refill_check", class_idx, p, 0, next_raw);
// next が妥当なメモリアドレス か?
uintptr_t next_addr = (uintptr_t)next_raw;
// Check 4a: NULL が妥当 (freelist の終端)
if (next_raw == NULL) {
// OK
}
// Check 4b: 有効なアドレス範囲
else if (next_addr >= 4096 && next_addr <= 0x00007fffffffffffULL) {
// OK - 有効そう
}
// Check 4c: SuperSlab に属しているか確認
else {
SuperSlab* ss_check = hak_super_lookup(next_raw);
if (!ss_check || ss_check->magic != SUPERSLAB_MAGIC) {
fprintf(stderr,
"[FREELIST_NEXT_INVALID] cls=%d p=%p next=%p from_ss=%p\n",
class_idx, p, next_raw, ss_check);
dump_tls_sll_state();
abort();
}
}
meta->freelist = next_raw;
// ...
}
```
---
### Point 5: Stride 計算の確認
`sll_refill_small_from_ss` line 355 に挿入:
```c
// Carve path での addr 計算
else if (meta->carved < meta->capacity) {
uint8_t* base = tls->slab_base ? tls->slab_base :
tiny_slab_base_for_geometry(tls->ss, tls->slab_idx);
if (!base) break;
uint16_t idx = meta->carved;
if (idx >= meta->capacity) break;
const size_t stride = tiny_stride_for_class(class_idx);
// Check 5: stride が 0 でないか、overflow していないか
if (stride == 0 || stride > 100000) {
fprintf(stderr,
"[STRIDE_INVALID] cls=%d stride=%zu idx=%u cap=%u\n",
class_idx, stride, idx, meta->capacity);
dump_tls_sll_state();
abort();
}
uint8_t* addr = base + ((size_t)idx * stride);
// Check 5b: addr がベースのメモリ範囲内か
// (簡易チェック: base より後ろ、かつ reasonable offset)
uintptr_t base_addr = (uintptr_t)base;
uintptr_t addr_addr = (uintptr_t)addr;
size_t max_offset = (size_t)meta->capacity * stride;
if (addr_addr < base_addr || (addr_addr - base_addr) > max_offset) {
fprintf(stderr,
"[ADDR_OUT_OF_BOUNDS] cls=%d base=%p addr=%p offset=%zu max=%zu\n",
class_idx, base, addr, (addr_addr - base_addr), max_offset);
dump_tls_sll_state();
abort();
}
meta->carved++;
meta->used++;
// ...
p = addr;
}
```
---
## 🛠️ 実装手順
### ステップ 1: Canary 初期化hakmem_tiny.c
既にあるはずだが、確認:
```bash
grep -n "g_tls_canary_before_sll\|g_tls_canary_after_sll" /mnt/workdisk/public_share/hakmem/core/hakmem_tiny.c | head -20
```
**存在なら**: Point 4-5 の挿入に進む
**存在なし**: Canary の初期化を追加
---
### ステップ 2: Point 4-5 検査の挿入
ファイル: `core/hakmem_tiny_refill.inc.h`
- Line 334 (freelist 処理) に Point 4 挿入
- Line 355 (carve 処理) に Point 5 挿入
---
### ステップ 3: ビルド & テスト
```bash
cd /mnt/workdisk/public_share/hakmem
make clean && make RELEASE=0 # デバッグモード有効
# テスト
timeout 190 env LD_PRELOAD=./libhakmem.so \
./mimalloc-bench/out/bench/sh8bench 2>&1 | tail -50
```
**期待される出力**:
- 180秒前後で **制御されたabort()** が発生
- `[CANARY_*]`, `[FREELIST_*]`, `[ADDR_*]` のいずれかのログが表示
- そのログから破壊パターンが明確になる
---
## 📋 Canary Sandwich の利点
1. **最速診断**: クラッシュの 1 秒前に検知 → 詳細ログ
2. **破壊パターン特定**: どの構造体が壊れたか明白
3. **根本原因推定**: パターンから原因が逆算可能
4. **非侵襲的**: 既存コードを大きく変更しない
---
## 🎯 予想される結果
| 破壊パターン | 原因 | 修正 |
|-----------|------|------|
| `CANARY_BROKEN_BEFORE` | g_tls_sll の手前がバッファオーバーフロー | Stack corruption / グローバル領域破壊 |
| `CANARY_BROKEN_AFTER` | g_tls_sll の後ろが上書き | グローバル領域破壊 |
| `FREELIST_NEXT_INVALID` | freelist の next が破壊 | Double-free / heap corruption |
| `ADDR_OUT_OF_BOUNDS` | carve 計算がオーバーフロー | Integer overflow / stride 計算エラー |
| `HEAD_OUT_OF_RANGE` | TLS head が破壊 | TLS SLL push 側のバグ |
---
## ✅ 実装チェックリスト
- [ ] Point 1: メモリレイアウト確認gdb
- [ ] Point 2: Canary 破壊チェック実装
- [ ] Point 3: Head ポインタ破壊検査実装
- [ ] Point 4: Freelist chain 整合性検査実装
- [ ] Point 5: Stride 計算検査実装
- [ ] ビルド成功
- [ ] 180秒テスト実行 → ログ解析
---
**作成日**: 2025-12-04
**方法**: Canary Sandwich - 5層の防御で破壊を検出