Phase 1 完了:環境変数整理 + fprintf デバッグガード ENV変数削除(BG/HotMag系): - core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines) - core/hakmem_tiny_bg_spill.c: BG spill ENV 削除 - core/tiny_refill.h: BG remote 固定値化 - core/hakmem_tiny_slow.inc: BG refs 削除 fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE): - core/hakmem_shared_pool.c: Lock stats (~18 fprintf) - core/page_arena.c: Init/Shutdown/Stats (~27 fprintf) - core/hakmem.c: SIGSEGV init message ドキュメント整理: - 328 markdown files 削除(旧レポート・重複docs) 性能確認: - Larson: 52.35M ops/s (前回52.8M、安定動作✅) - ENV整理による機能影響なし - Debug出力は一部残存(次phase で対応) 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
184 lines
5.3 KiB
Markdown
184 lines
5.3 KiB
Markdown
# Ultra-Deep Analysis Summary: Root Cause Found
|
|
|
|
**Date**: 2025-11-04
|
|
**Status**: 🎯 **ROOT CAUSE IDENTIFIED**
|
|
|
|
---
|
|
|
|
## TL;DR
|
|
|
|
**The Bug**: Fix #1 and Fix #2 drain slabs WITHOUT checking ownership, causing concurrent modification of `meta->freelist` when multiple threads operate on the same SuperSlab.
|
|
|
|
**The Fix**: Remove Fix #1 and Fix #2, reorder sticky/hot/bench paths to claim ownership BEFORE draining.
|
|
|
|
**Confidence**: 🟢 **95%** - Explains all symptoms: crashes at 0x6261, timing-dependent failures, partial improvements from Fix #3.
|
|
|
|
---
|
|
|
|
## The Race Condition
|
|
|
|
### What Fix #1 and Fix #2 Do (WRONG)
|
|
|
|
```c
|
|
// Fix #1 (superslab_refill) and Fix #2 (hak_tiny_alloc_superslab)
|
|
for (int i = 0; i < tls_cap; i++) { // Loop through ALL slabs
|
|
if (remote_heads[i] != 0) {
|
|
ss_remote_drain_to_freelist(ss, i); // ❌ NO ownership check!
|
|
}
|
|
}
|
|
```
|
|
|
|
**Problem**: Drains ALL slabs in the SuperSlab, including slabs **owned by other threads**.
|
|
|
|
### The Race
|
|
|
|
| Thread A (owns slab 5) | Thread B (Fix #2, no ownership) |
|
|
|------------------------|----------------------------------|
|
|
| `ptr = meta->freelist` | Loops through all slabs, i=5 |
|
|
| `meta->freelist = *(void**)ptr` | Calls `ss_remote_drain_to_freelist(ss, 5)` |
|
|
| (allocating from freelist) | `node_next = meta->freelist` ← **RACE!** |
|
|
| | `meta->freelist = node` ← **Overwrites A's update!** |
|
|
|
|
**Result**: Freelist corruption, crash at `fault_addr=0x6261` (truncated pointer).
|
|
|
|
---
|
|
|
|
## Why Fix #3 is Correct
|
|
|
|
```c
|
|
// Fix #3 (Mailbox path in tiny_refill.h)
|
|
tiny_tls_bind_slab(tls, mss, midx); // Bind to TLS
|
|
ss_owner_cas(m, tiny_self_u32()); // ✅ CLAIM OWNERSHIP FIRST
|
|
|
|
// NOW safe to drain - we're the owner
|
|
if (remote_heads[midx] != 0) {
|
|
ss_remote_drain_to_freelist(mss, midx); // ✅ Safe: we own it
|
|
}
|
|
```
|
|
|
|
**Key difference**: Claims ownership (`owner_tid = self`) BEFORE draining.
|
|
|
|
---
|
|
|
|
## All Unsafe Call Sites
|
|
|
|
| Location | Fix | Risk | Solution |
|
|
|----------|-----|------|----------|
|
|
| `hakmem_tiny_free.inc:620` | **Fix #1** | 🔴 HIGH | ❌ DELETE |
|
|
| `hakmem_tiny_free.inc:756` | **Fix #2** | 🔴 HIGH | ❌ DELETE |
|
|
| `tiny_refill.h:47` | Sticky | 🟡 MEDIUM | ✅ Reorder: ownership → drain |
|
|
| `tiny_refill.h:65` | Hot slot | 🟡 MEDIUM | ✅ Reorder: ownership → drain |
|
|
| `tiny_refill.h:80` | Bench | 🟡 MEDIUM | ✅ Reorder: ownership → drain |
|
|
| `tiny_mmap_gate.h:57` | mmap_gate | 🟡 MEDIUM | ✅ Reorder: ownership → drain |
|
|
| `tiny_refill.h:105` | **Fix #3** | ✅ SAFE | ✅ Keep as-is |
|
|
|
|
---
|
|
|
|
## The Fix (3 Steps)
|
|
|
|
### Step 1: Remove Fix #1 (Priority: HIGH)
|
|
|
|
**File**: `core/hakmem_tiny_free.inc`
|
|
**Lines**: 615-621
|
|
|
|
Comment out this block:
|
|
```c
|
|
// UNSAFE: Drains all slabs without ownership check
|
|
for (int i = 0; i < tls_cap; i++) {
|
|
int has_remote = (atomic_load_explicit(&tls->ss->remote_heads[i], memory_order_acquire) != 0);
|
|
if (has_remote) {
|
|
ss_remote_drain_to_freelist(tls->ss, i); // ❌ DELETE
|
|
}
|
|
```
|
|
|
|
### Step 2: Remove Fix #2 (Priority: HIGH)
|
|
|
|
**File**: `core/hakmem_tiny_free.inc`
|
|
**Lines**: 729-767 (entire block)
|
|
|
|
Comment out the entire Fix #2 block (40 lines starting with "BUGFIX: Drain ALL slabs...").
|
|
|
|
### Step 3: Fix Refill Paths (Priority: MEDIUM)
|
|
|
|
**Files**: `core/tiny_refill.h`, `core/tiny_mmap_gate.h`
|
|
|
|
**Pattern** (apply to sticky/hot/bench/mmap_gate):
|
|
```c
|
|
// BEFORE (WRONG):
|
|
if (!m->freelist && has_remote) ss_remote_drain_to_freelist(ss, idx); // ❌ Drain first
|
|
if (m->freelist) {
|
|
tiny_tls_bind_slab(tls, ss, idx); // ← Ownership after
|
|
ss_owner_cas(m, self);
|
|
return ss;
|
|
}
|
|
|
|
// AFTER (CORRECT):
|
|
tiny_tls_bind_slab(tls, ss, idx); // ✅ Ownership first
|
|
ss_owner_cas(m, self);
|
|
if (!m->freelist && has_remote) {
|
|
ss_remote_drain_to_freelist(ss, idx); // ← Drain after
|
|
}
|
|
if (m->freelist) {
|
|
return ss;
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Test Plan
|
|
|
|
### Test 1: Remove Fix #1 and Fix #2 Only
|
|
|
|
```bash
|
|
# Apply Step 1 and Step 2 (comment out Fix #1 and Fix #2)
|
|
make clean && make -s larson_hakmem
|
|
HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh repro 30 10
|
|
```
|
|
|
|
**Expected**:
|
|
- ✅ **If crashes stop**: Fix #1/#2 were the main culprits (DONE!)
|
|
- ⚠️ **If crashes continue**: Need Step 3 (refill path fixes)
|
|
|
|
### Test 2: Apply All Fixes (Step 1-3)
|
|
|
|
```bash
|
|
# Apply all fixes
|
|
make clean && make -s larson_hakmem
|
|
HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh repro 30 20
|
|
HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh guard 30 20
|
|
```
|
|
|
|
**Expected**: NO crashes, stable for 20+ seconds.
|
|
|
|
---
|
|
|
|
## Why This Explains Everything
|
|
|
|
1. **Crashes at `fault_addr=0x6261`**: Freelist corruption from concurrent writes
|
|
2. **Timing-dependent**: Race depends on thread scheduling
|
|
3. **Improvement from 500 → 4012 events**: Fix #3 reduced races, but Fix #1/#2 still race
|
|
4. **Guard mode vs repro mode**: Different timing → different race frequency
|
|
|
|
---
|
|
|
|
## Detailed Documentation
|
|
|
|
- **Full Analysis**: `/mnt/workdisk/public_share/hakmem/ULTRATHINK_ANALYSIS.md`
|
|
- **Implementation Guide**: `/mnt/workdisk/public_share/hakmem/FIX_IMPLEMENTATION_GUIDE.md`
|
|
- **This Summary**: `/mnt/workdisk/public_share/hakmem/ULTRATHINK_SUMMARY.md`
|
|
|
|
---
|
|
|
|
## Next Action
|
|
|
|
1. Apply **Step 1 and Step 2** (remove Fix #1 and Fix #2)
|
|
2. Rebuild and test (repro mode, 30 threads, 10 seconds)
|
|
3. If crashes persist, apply **Step 3** (fix refill paths)
|
|
4. Report results
|
|
|
|
**Estimated time**: 15 minutes to apply fixes + 5 minutes testing = 20 minutes total.
|
|
|
|
---
|
|
|
|
**END OF SUMMARY**
|