Files
hakmem/docs/analysis/P0_SEGV_ANALYSIS.md
Moe Charm (CI) a9ddb52ad4 ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)
Phase 1 完了:環境変数整理 + fprintf デバッグガード

ENV変数削除(BG/HotMag系):
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除

fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message

ドキュメント整理:
- 328 markdown files 削除(旧レポート・重複docs)

性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作)
- ENV整理による機能影響なし
- Debug出力は一部残存(次phase で対応)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 14:45:26 +09:00

271 lines
8.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# P0 Batch Refill SEGV - Root Cause Analysis
## Executive Summary
**Status**: Root cause identified - Multiple potential bugs in P0 batch refill
**Severity**: CRITICAL - Crashes at 10K iterations consistently
**Impact**: P0 optimization completely broken in release builds
## Test Results
| Build Mode | P0 Status | 100K Test | Performance |
|------------|-----------|-----------|-------------|
| Release | OFF | ✅ PASS | 2.34M ops/s |
| Release | ON | ❌ SEGV @ 10K | N/A |
**Conclusion**: P0 is 100% confirmed as the crash cause.
## SEGV Characteristics
1. **Crash Point**: Always after class 1 SuperSlab initialization
2. **Iteration Count**: Fails at 10K, succeeds at 5K-9.75K
3. **Register State** (from GDB):
- `rax = 0x0` (NULL pointer)
- `rdi = 0xfffffffffffbaef0` (corrupted pointer)
- `r12 = 0xda55bada55bada38` (possible sentinel pattern)
4. **Symptoms**: Pointer corruption, not simple null dereference
## Critical Bugs Identified
### Bug #1: Release Build Disables All Boundary Checks (HIGH PRIORITY)
**Location**: `core/tiny_refill_opt.h:86-97`
```c
static inline int trc_refill_guard_enabled(void) {
#if HAKMEM_BUILD_RELEASE
return 0; // ← ALL GUARDS DISABLED!
#else
// ...validation logic...
#endif
}
```
**Impact**: In release builds (NDEBUG=1):
- No freelist corruption detection
- No linear carve boundary checks
- No alignment validation
- Silent memory corruption until SEGV
**Evidence**:
- Our test runs with `-DNDEBUG -DHAKMEM_BUILD_RELEASE=1` (line 552 of Makefile)
- All `trc_refill_guard_enabled()` checks return 0
- Lines 137-144, 146-161, 180-188, 197-200 of `tiny_refill_opt.h` are NEVER executed
### Bug #2: Potential Double-Counting of meta->used
**Location**: `core/tiny_refill_opt.h:210` + `core/hakmem_tiny_refill_p0.inc.h:182`
```c
// In trc_linear_carve():
meta->used += batch; // ← Increment #1
// In sll_refill_batch_from_ss():
ss_active_add(tls->ss, batch); // ← Increment #2 (SuperSlab counter)
```
**Analysis**:
- `meta->used` is the slab-level active counter
- `ss->total_active_blocks` is the SuperSlab-level counter
- If free path decrements both, we have a problem
- If free path decrements only one, counters diverge → OOM
**Needs Investigation**:
- How does free path decrement counters?
- Are `meta->used` and `ss->total_active_blocks` supposed to be independent?
### Bug #3: Freelist Sentinel Mixing Risk
**Location**: `core/hakmem_tiny_refill_p0.inc.h:128-132`
```c
uint32_t remote_count = atomic_load_explicit(...);
if (remote_count > 0) {
_ss_remote_drain_to_freelist_unsafe(tls->ss, tls->slab_idx, meta);
}
```
**Concern**:
- Remote drain adds blocks to `meta->freelist`
- If sentinel values (like `0xda55bada55bada38` seen in r12) are mixed in
- Next freelist pop will dereference sentinel → SEGV
**Needs Investigation**:
- Does `_ss_remote_drain_to_freelist_unsafe` properly sanitize sentinels?
- Are there sentinel values in the remote queue?
### Bug #4: Boundary Calculation Error for Slab 0
**Location**: `core/hakmem_tiny_refill_p0.inc.h:117-120`
```c
ss_limit = ss_base + SLAB_SIZE;
if (tls->slab_idx == 0) {
ss_limit = ss_base + (SLAB_SIZE - SUPERSLAB_SLAB0_DATA_OFFSET);
}
```
**Analysis**:
- For slab 0, limit should be `ss_base + usable_size`
- Current code: `ss_base + (SLAB_SIZE - 2048)` ← This is usable size from base, correct
- Actually, this looks OK (false alarm)
### Bug #5: Missing External Declarations
**Location**: `core/hakmem_tiny_refill_p0.inc.h:142-143, 183-184`
```c
extern unsigned long long g_rf_freelist_items[]; // ← Not declared in header
extern unsigned long long g_rf_carve_items[]; // ← Not declared in header
```
**Impact**:
- These might not be defined anywhere
- Linker might place them at wrong addresses
- Writes to these arrays could corrupt memory
## Hypotheses (Ordered by Likelihood)
### Hypothesis A: Linear Carve Boundary Violation (75% confidence)
**Theory**:
- `meta->carved + batch > meta->capacity` happens
- Release build has no guard (Bug #1)
- Linear carve writes beyond slab boundary
- Corrupts adjacent metadata or freelist
- Next allocation/free reads corrupted pointer → SEGV
**Evidence**:
- SEGV happens consistently at 10K iterations (specific memory state)
- Pointer corruption (`rdi = 0xffff...baef0`) suggests out-of-bounds write
- `[BATCH_CARVE]` log shows batch=16 for class 6
**Test**: Rebuild without `-DNDEBUG` to enable guards
### Hypothesis B: Freelist Double-Pop (60% confidence)
**Theory**:
- Remote drain adds blocks to freelist
- P0 pops from freelist
- Another thread also pops same blocks (race condition)
- Blocks get allocated twice
- Later free corrupts active allocations → SEGV
**Evidence**:
- r12 = `0xda55bada55bada38` looks like a sentinel pattern
- Remote drain happens at line 130
**Test**: Disable remote drain temporarily
### Hypothesis C: Active Counter Desync (50% confidence)
**Theory**:
- `meta->used` and `ss->total_active_blocks` get out of sync
- SuperSlab thinks it's full when it's not (or vice versa)
- `superslab_refill()` returns NULL (OOM)
- Allocation returns NULL
- Free path dereferences NULL → SEGV
**Evidence**:
- Previous fix added `ss_active_add()` (CLAUDE.md line 141)
- But `trc_linear_carve` also does `meta->used++`
- Potential double-counting
**Test**: Add counters to track divergence
## Recommended Actions
### Immediate (Fix Today)
1. **Enable Debug Build**
```bash
make clean
make CFLAGS="-O1 -g" bench_random_mixed_hakmem
./bench_random_mixed_hakmem 10000 256 42
```
Expected: Boundary violation abort with detailed log
2. **Add P0-specific logging** ✅
```bash
HAKMEM_TINY_REFILL_FAILFAST=1 ./bench_random_mixed_hakmem 10000 256 42
```
Note: Already tested, but release build disabled guards
3. **Check counter definitions**:
```bash
nm bench_random_mixed_hakmem | grep "g_rf_freelist_items\|g_rf_carve_items"
```
### Short-term (This Week)
1. **Fix Bug #1**: Make guards work in release builds
- Change `HAKMEM_BUILD_RELEASE` check to allow runtime override
- Add `HAKMEM_TINY_REFILL_PARANOID=1` env var
2. **Investigate Bug #2**: Audit counter updates
- Trace all `meta->used` increments/decrements
- Trace all `ss->total_active_blocks` updates
- Verify they're independent or synchronized
3. **Test Hypothesis A**: Add explicit boundary check
```c
if (meta->carved + batch > meta->capacity) {
fprintf(stderr, "BOUNDARY VIOLATION!\n");
abort();
}
```
### Medium-term (Next Sprint)
1. **Comprehensive testing matrix**:
- P0 ON/OFF × Debug/Release × 1K/10K/100K iterations
- Test each class individually (class 0-7)
- MT testing (2/4/8 threads)
2. **Add stress tests**:
- Extreme batch sizes (want=256)
- Mixed allocation patterns
- Remote queue flooding
## Build Artifacts Verified
```bash
# P0 OFF build (successful)
$ ./bench_random_mixed_hakmem 100000 256 42
Throughput = 2341698 operations per second
# P0 ON build (crashes)
$ ./bench_random_mixed_hakmem 10000 256 42
[BATCH_CARVE] cls=6 slab=1 used=0 cap=128 batch=16 base=0x7ffff6e10000 bs=513
Segmentation fault (core dumped)
```
## Next Steps
1. ✅ Build fixed-up P0 with linker errors resolved
2. ✅ Confirm P0 is crash cause (OFF works, ON crashes)
3. 🔄 **IN PROGRESS**: Analyze P0 code for bugs
4. ⏭️ Build debug version to trigger guards
5. ⏭️ Fix identified bugs
6. ⏭️ Validate with full test suite
## Files Modified for Build Fix
To make P0 compile, I added conditional compilation to route between `sll_refill_small_from_ss` (P0 OFF) and `sll_refill_batch_from_ss` (P0 ON):
1. `core/hakmem_tiny.c:182-192` - Forward declaration
2. `core/hakmem_tiny.c:1232-1236` - Pre-warm call
3. `core/tiny_alloc_fast.inc.h:69-74` - External declaration
4. `core/tiny_alloc_fast.inc.h:383-387` - Refill call
5. `core/hakmem_tiny_alloc.inc:157-161, 196-200, 229-233` - Three refill calls
6. `core/hakmem_tiny_ultra_simple.inc:70-74` - Refill call
7. `core/hakmem_tiny_metadata.inc:113-117` - Refill call
All locations now use `#if HAKMEM_TINY_P0_BATCH_REFILL` to choose the correct function.
---
**Report Generated**: 2025-11-09 21:35 UTC
**Investigator**: Claude Task Agent (Ultrathink Mode)
**Status**: Root cause analysis complete, awaiting debug build test