Files
hakmem/PHASE15_BUG_ROOT_CAUSE_FINAL.md
Moe Charm (CI) 880ea511c8 Phase 15: Root cause analysis - Page-aligned Tiny allocations SuperSlab registry lookup failure
ROOT CAUSE IDENTIFIED:
- Page-aligned Tiny allocations (user_ptr at 0x...000) can occur mathematically
- Box FrontGate correctly classifies as MIDCAND (can't read header at ptr-1)
- MIDCAND routing → SuperSlab registry lookup returns NULL
- ExternalGuard → __libc_free() → crash

EVIDENCE:
- Phase 14: Same test passes (5.69M ops/s)
- Phase 15: Crashes at ExternalGuard delegation
- System malloc: Never returns page-aligned pointers for 16-1040B (ptr IS hakmem)

RECOMMENDED FIX:
- Option 1: Fix SuperSlab registry lookup (primary)
- Option 3: Improve ExternalGuard safety (backup)

NEXT: Add SuperSlab registry debug logging to track allocation/registration timing
2025-11-15 23:07:56 +09:00

167 lines
5.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 15 Bug - Root Cause Analysis (FINAL)
**Date**: 2025-11-15
**Status**: ROOT CAUSE IDENTIFIED ✅
## Summary
Page-aligned Tiny allocations (`0x...000`) reach ExternalGuard → `__libc_free()` → crash.
## Evidence
### Phase 14 vs Phase 15 Behavior
| Phase | Test Result | Behavior |
|-------|-------------|----------|
| Phase 14 | ✅ PASS (5.69M ops/s) | No crash with same test |
| Phase 15 | ❌ CRASH | ExternalGuard → `__libc_free()` failure |
### Crash Pattern
```
[ExternalGuard] ptr=0x706c21a00000 offset_in_page=0x0 (page-aligned!)
[ExternalGuard] hak_super_lookup(ptr) = (nil) ← SuperSlab registry: NOT FOUND
[ExternalGuard] FrontGate classification: domain=MIDCAND
[ExternalGuard] ptr=0x706c21a00000 delegated to __libc_free
free(): invalid pointer ← CRASH
```
## Root Cause
### 1. Page-Aligned Tiny Allocations Exist
**Proof** (mathematical):
- Block stride = user_size + 1 (with 1-byte header)
- Example: 257B stride (class 5)
- Carved pointer: `base + (carved_index × 257)`
- User pointer: `carved_ptr + 1`
- For page-aligned user_ptr: `(n × 257) mod 4096 == 4095`
- Since gcd(257, 4096) = 1, **solution exists**!
**Allocation flow**:
```c
// hakmem_tiny.c:160-163
#define HAK_RET_ALLOC(cls, base_ptr) do { \
*(uint8_t*)(base_ptr) = HEADER_MAGIC | ((cls) & HEADER_CLASS_MASK); \
return (void*)((uint8_t*)(base_ptr) + 1); ← Returns user_ptr
} while(0)
```
If `base_ptr = 0x...FFF`, then `user_ptr = 0x...000` (PAGE-ALIGNED!).
### 2. Box FrontGate Classifies as MIDCAND (Correct by Design)
**front_gate_v2.h:52-59**:
```c
// CRITICAL: Same-page guard (header must be in same page as ptr)
uintptr_t offset_in_page = (uintptr_t)ptr & 0xFFF;
if (offset_in_page == 0) {
// Page-aligned pointer → no header in same page → must be MIDCAND
result.domain = FG_DOMAIN_MIDCAND;
return result;
}
```
**Reason**: Reading header at `ptr-1` would cross page boundary (unsafe).
**Result**: Page-aligned Tiny allocations → classified as MIDCAND ✅
### 3. MIDCAND Routing → SuperSlab Registry Lookup FAILS
**hak_free_api.inc.h** MIDCAND path:
1. Mid registry lookup → NULL (not Mid allocation)
2. L25 registry lookup → NULL (not L25 allocation)
3. **SuperSlab registry lookup****NULL** ❌ (BUG!)
4. ExternalGuard → `__libc_free()` → crash
**Why SuperSlab lookup fails**:
**Theory A**: Pointer is NOT from hakmem
- **REJECTED**: System malloc test shows no page-aligned pointers for 16-1040B
**Theory B**: SuperSlab is not registered
- **LIKELY**: Race condition, registry exhaustion, or allocation before registration
**Theory C**: Registry lookup bug
- **POSSIBLE**: Hash collision, probe limit, or alignment mismatch
### 4. Why Phase 14 Works but Phase 15 Doesn't
**Phase 14**: Old classification system (no Box FrontGate/ExternalGuard)
- Uses different routing logic
- May have headerless path for page-aligned pointers
- Different SuperSlab lookup implementation?
**Phase 15**: New Box architecture
- Box FrontGate → classifies page-aligned as MIDCAND
- Box routing → SuperSlab lookup
- Box ExternalGuard → delegates to `__libc_free()`**CRASH**
## Fix Options
### Option 1: Fix SuperSlab Registry Lookup ✅ **RECOMMENDED**
**Issue**: `hak_super_lookup(0x706c21a00000)` returns NULL for valid hakmem allocation.
**Root cause options**:
1. SuperSlab not registered (allocation race)
2. Registry full/hash collision
3. Lookup alignment mismatch
**Investigation needed**:
- Add debug logging to `hak_super_register()` / `hak_super_lookup()`
- Check if SuperSlab exists for this pointer
- Verify registration happens before user pointer is returned
**Fix**: Ensure all SuperSlabs are properly registered before returning user pointers.
### Option 2: Add Page-Aligned Special Path in FrontGate ❌ NOT RECOMMENDED
**Idea**: Classify page-aligned Tiny pointers as TINY instead of MIDCAND.
**Problems**:
- Can't read header at `ptr-1` (page boundary violation)
- Would need alternative classification (size class lookup?)
- Violates Box FG design (1-byte header only)
### Option 3: Fix ExternalGuard Fallback ⚠️ WORKAROUND
**Idea**: ExternalGuard should NOT delegate unknown pointers to `__libc_free()`.
**Change**:
```c
// Before (BUG):
if (!is_mapped) return 0; // Delegate to __libc_free (crashes!)
// After (FIX):
if (!is_mapped) {
// Unknown pointer - log and return success (leak vs crash tradeoff)
fprintf(stderr, "[ExternalGuard] WARNING: Unknown pointer %p (ignored)\n", ptr);
return 1; // Claim handled (prevent __libc_free crash)
}
```
**Cons**: Memory leak for genuinely external pointers.
## Next Steps
1. **Add SuperSlab Registry Debug Logging**
- Log all `hak_super_register()` calls
- Log all `hak_super_lookup()` failures
- Track when `0x706c21a00000` is allocated and registered
2. **Verify Registration Timing**
- Ensure SuperSlab is registered BEFORE user pointers are returned
- Check for race conditions in allocation path
3. **Implement Fix Option 1**
- Fix SuperSlab registry lookup
- Verify with 100K iterations test
## Conclusion
**Primary Bug**: SuperSlab registry lookup fails for page-aligned Tiny allocations.
**Secondary Bug**: ExternalGuard unconditionally delegates to `__libc_free()` (should handle unknown pointers safely).
**Recommended Fix**: Fix SuperSlab registry (Option 1) + improve ExternalGuard safety (Option 3 as backup).