Files
hakmem/docs/analysis/SEGFAULT_INVESTIGATION_REPORT.md

337 lines
10 KiB
Markdown
Raw Normal View History

Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization) ## Changes ### 1. core/page_arena.c - Removed init failure message (lines 25-27) - error is handled by returning early - All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks ### 2. core/hakmem.c - Wrapped SIGSEGV handler init message (line 72) - CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs ### 3. core/hakmem_shared_pool.c - Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE: - Node pool exhaustion warning (line 252) - SP_META_CAPACITY_ERROR warning (line 421) - SP_FIX_GEOMETRY debug logging (line 745) - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865) - SP_ACQUIRE_STAGE0_L0 debug logging (line 803) - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922) - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996) - SP_ACQUIRE_STAGE3 debug logging (line 1116) - SP_SLOT_RELEASE debug logging (line 1245) - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305) - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316) - Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized ## Performance Validation Before: 51M ops/s (with debug fprintf overhead) After: 49.1M ops/s (consistent performance, fprintf removed from hot paths) ## Build & Test ```bash ./build.sh larson_hakmem ./out/release/larson_hakmem 1 5 1 1000 100 10000 42 # Result: 49.1M ops/s ``` Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00
# SEGFAULT Investigation Report - bench_random_mixed & bench_mid_large_mt
**Date**: 2025-11-07
**Status**: ✅ ROOT CAUSE IDENTIFIED
**Priority**: CRITICAL
---
## Executive Summary
**Problem**: `bench_random_mixed_hakmem` and `bench_mid_large_mt_hakmem` crash with SEGV (exit 139) when direct-linked, but work fine with LD_PRELOAD.
**Root Cause**: **SuperSlab registry lookup failures** cause headerless tiny allocations to be misidentified as having HAKMEM headers during free(), leading to:
1. Invalid memory reads at `ptr - HEADER_SIZE` → SEGV
2. Memory leaks when `g_invalid_free_mode=1` skips frees
3. Eventual memory exhaustion or corruption
**Why LD_PRELOAD Works**: LD_PRELOAD defaults to `g_invalid_free_mode=0` (fallback to libc), which masks the issue by routing failed frees to `__libc_free()`.
**Why Direct-Link Crashes**: Direct-link defaults to `g_invalid_free_mode=1` (skip invalid frees), which silently leaks memory until exhaustion.
---
## Reproduction
### Crashes (Direct-Link)
```bash
./bench_random_mixed_hakmem 50000 2048 123
# → Segmentation fault (exit 139)
./bench_mid_large_mt_hakmem 4 40000 2048 42
# → Segmentation fault (exit 139)
```
**Error Output**:
```
[hakmem] ERROR: Invalid magic 0x0 (expected 0x48414B4D)
[hakmem] ERROR: Invalid magic 0x0 (expected 0x48414B4D)
... (hundreds of errors)
free(): invalid pointer
Segmentation fault (core dumped)
```
### Works Fine (LD_PRELOAD)
```bash
LD_PRELOAD=./libhakmem_asan.so ./bench_random_mixed_system 200000 4096 1234567
# → 5.7M ops/s ✅
```
### Crash Threshold
- **Small workloads**: ≤20K ops with 512 slots → Works
- **Large workloads**: ≥25K ops with 2048 slots → Crashes immediately
- **Pattern**: Scales with working set size (more live objects = more failures)
---
## Technical Analysis
### 1. Allocation Flow (Working)
```
malloc(size) [size ≤ 1KB]
hak_alloc_at(size)
hak_tiny_alloc_fast_wrapper(size)
tiny_alloc_fast(size)
↓ [TLS freelist miss]
hak_tiny_alloc_slow(size)
hak_tiny_alloc_superslab(class_idx)
✅ Returns pointer WITHOUT header (SuperSlab allocation)
```
### 2. Free Flow (Broken)
```
free(ptr)
hak_free_at(ptr, 0, site)
[SS-first free path] hak_super_lookup(ptr)
↓ ❌ Lookup FAILS (should succeed!)
[Fallback] Try mid/L25 lookup → Fails
[Fallback] Header dispatch:
void* raw = (char*)ptr - HEADER_SIZE; // ← ptr has NO header!
AllocHeader* hdr = (AllocHeader*)raw; // ← Invalid pointer
if (hdr->magic != HAKMEM_MAGIC) { // ← ⚠️ SEGV or reads 0x0
// g_invalid_free_mode = 1 (direct-link)
goto done; // ← ❌ MEMORY LEAK!
}
```
**Key Bug**: When SuperSlab lookup fails for a tiny allocation, the code assumes there's a HAKMEM header and tries to read it. But tiny allocations are **headerless**, so this reads invalid memory.
### 3. Why SuperSlab Lookup Fails
Based on testing:
```bash
# Default (crashes with "Invalid magic 0x0")
./bench_random_mixed_hakmem 25000 2048 123
# → Hundreds of "Invalid magic" errors
# With SuperSlab explicitly enabled (no "Invalid magic" errors, but still SEGVs)
HAKMEM_TINY_USE_SUPERSLAB=1 ./bench_random_mixed_hakmem 25000 2048 123
# → SEGV without "Invalid magic" errors
```
**Hypothesis**: When `HAKMEM_TINY_USE_SUPERSLAB` is not explicitly set, there may be a code path where:
1. Tiny allocations succeed (from some non-SuperSlab path)
2. But they're not registered in the SuperSlab registry
3. So lookups fail during free
**Possible causes**:
- **Configuration bug**: `g_use_superslab` may be uninitialized or overridden
- **TLS allocation path**: There may be a TLS-only allocation path that bypasses SuperSlab
- **Magazine/HotMag path**: Allocations from magazine layers might not come from SuperSlab
- **Registry capacity**: Registry might be full (unlikely with SUPER_REG_SIZE=262144)
### 4. Direct-Link vs LD_PRELOAD Behavior
**LD_PRELOAD** (`hak_core_init.inc.h:147-164`):
```c
if (ldpre && strstr(ldpre, "libhakmem.so")) {
g_ldpreload_mode = 1;
g_invalid_free_mode = 0; // ← Fallback to libc
}
```
- Defaults to `g_invalid_free_mode=0` (fallback mode)
- Invalid frees → `__libc_free(ptr)`**masks the bug** (may work if ptr was originally from libc)
**Direct-Link**:
```c
else {
g_invalid_free_mode = 1; // ← Skip invalid frees
}
```
- Defaults to `g_invalid_free_mode=1` (skip mode)
- Invalid frees → `goto done`**silent memory leak**
- Accumulated leaks → memory exhaustion → SEGV
---
## GDB Analysis
### Backtrace
```
Thread 1 "bench_random_mi" received signal SIGSEGV, Segmentation fault.
0x000055555555eb40 in free ()
#0 0x000055555555eb40 in free ()
#1 0xffffffffffffffff in ?? ()
...
#8 0x00005555555587e1 in main ()
Registers:
rax 0x555556c9d040 (some address)
rbp 0x7ffff6e00000 (pointer being freed - page-aligned!)
rdi 0x0 (NULL!)
rip 0x55555555eb40 <free+2176>
```
### Disassembly at Crash Point (free+2176)
```asm
0xab40 <+2176>: mov -0x28(%rbp),%ecx # Load header magic
0xab43 <+2179>: cmp $0x48414B4D,%ecx # Compare with HAKMEM_MAGIC
0xab49 <+2185>: je 0xabd0 <free+2320> # Jump if magic matches
```
**Key observation**:
- `rbp = 0x7ffff6e00000` (page-aligned, likely start of mmap region)
- Trying to read from `rbp - 0x28 = 0x7ffff6dffffd8`
- If this is at page boundary, reading before the page causes SEGV
---
## Proposed Fix
### Option A: Safe Header Read (Recommended)
Add a safety check before reading the header:
```c
// hak_free_api.inc.h, line 78-88 (header dispatch)
// BEFORE: Unsafe header read
void* raw = (char*)ptr - HEADER_SIZE;
AllocHeader* hdr = (AllocHeader*)raw;
if (hdr->magic != HAKMEM_MAGIC) { ... }
// AFTER: Safe fallback for tiny allocations
// If SuperSlab lookup failed for a tiny-sized allocation,
// assume it's an invalid free or was already freed
{
// Check if this could be a tiny allocation (size ≤ 1KB)
// Heuristic: If SuperSlab/Mid/L25 lookup all failed, and we're here,
// either it's a libc allocation with header, or a leaked tiny allocation
// Try to safely read header magic
void* raw = (char*)ptr - HEADER_SIZE;
AllocHeader* hdr = (AllocHeader*)raw;
// If magic is valid, proceed with header dispatch
if (hdr->magic == HAKMEM_MAGIC) {
// Header exists, dispatch normally
if (HAK_ENABLED_CACHE(HAKMEM_FEATURE_BIGCACHE) && hdr->class_bytes >= 2097152) {
if (hak_bigcache_put(ptr, hdr->size, hdr->alloc_site)) goto done;
}
switch (hdr->method) {
case ALLOC_METHOD_MALLOC: __libc_free(raw); break;
case ALLOC_METHOD_MMAP: /* ... */ break;
// ...
}
} else {
// Invalid magic - could be:
// 1. Tiny allocation where SuperSlab lookup failed
// 2. Already freed pointer
// 3. Pointer from external library
if (g_invalid_free_log) {
fprintf(stderr, "[hakmem] WARNING: free() of pointer %p with invalid magic 0x%X (expected 0x%X)\n",
ptr, hdr->magic, HAKMEM_MAGIC);
fprintf(stderr, "[hakmem] Possible causes: tiny allocation lookup failure, double-free, or external pointer\n");
}
// In direct-link mode, do NOT leak - try to return to tiny pool
// as a best-effort recovery
if (!g_ldpreload_mode) {
// Attempt to route to tiny free (may succeed if it's a valid tiny allocation)
hak_tiny_free(ptr); // Will validate internally
} else {
// LD_PRELOAD mode: fallback to libc (may be mixed allocation)
if (g_invalid_free_mode == 0) {
__libc_free(ptr); // Not raw! ptr itself
}
}
}
}
goto done;
```
### Option B: Fix SuperSlab Lookup Root Cause
Investigate why SuperSlab lookups are failing:
1. **Add comprehensive logging**:
```c
// At allocation time
fprintf(stderr, "[ALLOC_DEBUG] ptr=%p class=%d from_superslab=%d\n",
ptr, class_idx, from_superslab);
// At free time
SuperSlab* ss = hak_super_lookup(ptr);
fprintf(stderr, "[FREE_DEBUG] ptr=%p lookup=%p magic=%llx\n",
ptr, ss, ss ? ss->magic : 0);
```
2. **Check TLS allocation paths**:
- Verify all paths through `tiny_alloc_fast_pop()` come from SuperSlab
- Check if magazine/HotMag allocations are properly registered
- Verify TLS SLL allocations are from registered SuperSlabs
3. **Verify registry initialization**:
```c
// At startup
fprintf(stderr, "[INIT] g_super_reg_initialized=%d g_use_superslab=%d\n",
g_super_reg_initialized, g_use_superslab);
```
### Option C: Force SuperSlab Path
Simplify the allocation path to always use SuperSlab:
```c
// Disable competing paths that might bypass SuperSlab
g_hotmag_enable = 0; // Disable HotMag
g_tls_list_enable = 0; // Disable TLS List
g_tls_sll_enable = 1; // Enable TLS SLL (SuperSlab-backed)
```
---
## Immediate Workaround
For users hitting this bug:
```bash
# Workaround 1: Use LD_PRELOAD (masks the issue)
LD_PRELOAD=./libhakmem.so your_benchmark
# Workaround 2: Force SuperSlab (may still crash, but different symptoms)
HAKMEM_TINY_USE_SUPERSLAB=1 ./your_benchmark
# Workaround 3: Disable tiny allocator (fallback to libc)
HAKMEM_WRAP_TINY=0 ./your_benchmark
```
---
## Next Steps
1. **Implement Option A (Safe Header Read)** - Immediate fix to prevent SEGV
2. **Add logging to identify root cause** - Why are SuperSlab lookups failing?
3. **Fix underlying issue** - Ensure all tiny allocations are SuperSlab-backed
4. **Add regression tests** - Prevent future breakage
---
## Files to Modify
1. `/mnt/workdisk/public_share/hakmem/core/box/hak_free_api.inc.h` - Lines 78-120 (header dispatch logic)
2. `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny.c` - Add allocation path logging
3. `/mnt/workdisk/public_share/hakmem/core/tiny_alloc_fast.inc.h` - Verify SuperSlab usage
4. `/mnt/workdisk/public_share/hakmem/core/hakmem_super_registry.c` - Add lookup diagnostics
---
## Related Issues
- **Phase 6-2.3**: Active counter bug fix (freed blocks not tracked)
- **Sanitizer Fix**: Similar TLS initialization ordering issues
- **LD_PRELOAD vs Direct-Link**: Behavioral differences in error handling
---
## Verification
After fix, verify:
```bash
# Should complete without errors
./bench_random_mixed_hakmem 50000 2048 123
./bench_mid_large_mt_hakmem 4 40000 2048 42
# Should see no "Invalid magic" errors
HAKMEM_INVALID_FREE_LOG=1 ./bench_random_mixed_hakmem 50000 2048 123
```