Files
hakmem/docs/analysis/PHASE15_REGISTRY_LOOKUP_INVESTIGATION.md

183 lines
5.3 KiB
Markdown
Raw Normal View History

# Phase 15 Registry Lookup Investigation
**Date**: 2025-11-15
**Status**: 🔍 ROOT CAUSE IDENTIFIED
## Summary
Page-aligned Tiny allocations reach ExternalGuard → SuperSlab registry lookup FAILS → delegated to `__libc_free()` → crash.
## Critical Findings
### 1. Registry Only Stores ONE SuperSlab
**Evidence**:
```
[SUPER_REG] register base=0x7d3893c00000 lg=21 slot=115870 magic=5353504c
```
**Only 1 registration** in entire test run (10K iterations, 100K operations).
### 2. 4MB Address Gap
**Pattern** (consistent across multiple runs):
- **Registry stores**: `0x7d3893c00000` (SuperSlab structure address)
- **Lookup searches**: `0x7d3893800000` (user pointer, 4MB **lower**)
- **Difference**: `0x400000 = 4MB = 2 × SuperSlab size (lg=21, 2MB)`
### 3. User Data Layout
**From code analysis** (`superslab_inline.h:30-35`):
```c
size_t off = SUPERSLAB_SLAB0_DATA_OFFSET + (size_t)slab_idx * SLAB_SIZE;
return (uint8_t*)ss + off;
```
**User data is placed AFTER SuperSlab structure**, NOT before!
**Implication**: User pointer `0x7d3893800000` **cannot** belong to SuperSlab `0x7d3893c00000` (4MB higher).
### 4. mmap Alignment Mechanism
**Code** (`hakmem_tiny_superslab.c:280-308`):
```c
size_t alloc_size = ss_size * 2; // Allocate 4MB for 2MB SuperSlab
void* raw = mmap(NULL, alloc_size, ...);
uintptr_t aligned_addr = (raw_addr + ss_mask) & ~ss_mask; // 2MB align
```
**Scenario**:
- mmap returns `0x7d3893800000` (already 2MB-aligned)
- `aligned_addr = 0x7d3893800000` (no change)
- Prefix size = 0, Suffix = 2MB (munmapped)
- **SuperSlab registered at**: `0x7d3893800000`
**Contradiction**: Registry shows `0x7d3893c00000`, not `0x7d3893800000`!
### 5. Hash Slot Mismatch
**Lookup**:
```
[SUPER_LOOKUP] ptr=0x7d3893800000 lg=21 aligned_base=0x7d3893800000 hash=115868
```
**Registry**:
```
[SUPER_REG] register base=0x7d3893c00000 lg=21 slot=115870
```
**Hash difference**: 115868 vs 115870 (2 slots apart)
**Reason**: Linear probing found different slot due to collision.
## Root Cause Hypothesis
### Option A: Multiple SuperSlabs, Only One Registered
**Theory**: Multiple SuperSlabs allocated, but only the **last one** is logged.
**Problem**: Debug logging should show ALL registrations after fix (ENV check on every call).
### Option B: LRU Cache Reuse
**Theory**: Most SuperSlabs come from LRU cache (already registered), only new allocations are logged.
**Problem**: First few iterations should still show multiple registrations.
### Option C: Pointer is NOT from hakmem
**Theory**: `0x7d3893800000` is allocated by **`__libc_malloc()`**, NOT hakmem.
**Evidence**:
- Box BenchMeta uses `__libc_calloc` for `slots[]` array
- `free(slots[idx])` uses hakmem wrapper
- **But**: `slots[]` array itself is freed with `__libc_free(slots)` (Line 99)
**Contradiction**: `slots[]` should NOT reach hakmem `free()` wrapper.
### Option D: Registry Lookup Bug
**Theory**: SuperSlab **is** registered at `0x7d3893800000`, but lookup fails due to:
1. Hash collision (different slot used during registration vs lookup)
2. Linear probing limit exceeded (SUPER_MAX_PROBE = 8)
3. Alignment mismatch (looking for wrong base address)
## Test Results Comparison
| Phase | Test Result | Behavior |
|-------|-------------|----------|
| Phase 14 | ✅ PASS (5.69M ops/s) | No crash with same test |
| Phase 15 | ❌ CRASH | ExternalGuard → `__libc_free()` failure |
**Conclusion**: Phase 15 Box Separation introduced regression.
## Next Steps
### Investigation Needed
1. **Add more detailed logging**:
- Log ALL mmap calls with returned address
- Log prefix/suffix munmap with exact ranges
- Log final SuperSlab address vs mmap address
- Track which pointers are allocated from which SuperSlab
2. **Verify registry integrity**:
- Dump entire registry before crash
- Check for hash collisions
- Verify linear probing behavior
3. **Test with reduced SuperSlab size**:
- Try lg=20 (1MB) instead of lg=21 (2MB)
- See if 2MB gap still occurs
### Fix Options
#### **Option 1: Fix SuperSlab Registry Lookup** ✅ **RECOMMENDED**
**Issue**: Registry lookup fails for valid hakmem allocations.
**Potential fixes**:
- Increase SUPER_MAX_PROBE from 8 to 16/32
- Use better hash function to reduce collisions
- Store address **range** instead of single base
- Support lookup by any address within SuperSlab region
#### **Option 2: Improve ExternalGuard Safety** ⚠️ **WORKAROUND**
**Current behavior** (DANGEROUS):
```c
if (!is_mapped) return 0; // Delegate to __libc_free → CRASH!
```
**Safer behavior**:
```c
if (!is_mapped) {
fprintf(stderr, "[ExternalGuard] WARNING: Unknown pointer %p (ignored)\n", ptr);
return 1; // Claim handled (leak vs crash tradeoff)
}
```
**Pros**: Prevents crash
**Cons**: Memory leak for genuinely external pointers
#### **Option 3: Fix Box FrontGate Classification** ❌ NOT RECOMMENDED
**Idea**: Add special path for page-aligned Tiny pointers.
**Problems**:
- Can't read header at `ptr-1` (page boundary violation)
- Violates 1-byte header design
- Requires alternative classification
## Conclusion
**Primary Issue**: SuperSlab registry lookup fails for page-aligned user pointers.
**Secondary Issue**: ExternalGuard unconditionally delegates unknown pointers to `__libc_free()`.
**Recommended Action**:
1. Fix registry lookup (Option 1)
2. Add ExternalGuard safety (Option 2 as backup)
3. Comprehensive logging to confirm root cause