Files
hakmem/docs/status/PHASE15_REGISTRY_LOOKUP_INVESTIGATION.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

183 lines
5.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 15 Registry Lookup Investigation
**Date**: 2025-11-15
**Status**: 🔍 ROOT CAUSE IDENTIFIED
## Summary
Page-aligned Tiny allocations reach ExternalGuard → SuperSlab registry lookup FAILS → delegated to `__libc_free()` → crash.
## Critical Findings
### 1. Registry Only Stores ONE SuperSlab
**Evidence**:
```
[SUPER_REG] register base=0x7d3893c00000 lg=21 slot=115870 magic=5353504c
```
**Only 1 registration** in entire test run (10K iterations, 100K operations).
### 2. 4MB Address Gap
**Pattern** (consistent across multiple runs):
- **Registry stores**: `0x7d3893c00000` (SuperSlab structure address)
- **Lookup searches**: `0x7d3893800000` (user pointer, 4MB **lower**)
- **Difference**: `0x400000 = 4MB = 2 × SuperSlab size (lg=21, 2MB)`
### 3. User Data Layout
**From code analysis** (`superslab_inline.h:30-35`):
```c
size_t off = SUPERSLAB_SLAB0_DATA_OFFSET + (size_t)slab_idx * SLAB_SIZE;
return (uint8_t*)ss + off;
```
**User data is placed AFTER SuperSlab structure**, NOT before!
**Implication**: User pointer `0x7d3893800000` **cannot** belong to SuperSlab `0x7d3893c00000` (4MB higher).
### 4. mmap Alignment Mechanism
**Code** (`hakmem_tiny_superslab.c:280-308`):
```c
size_t alloc_size = ss_size * 2; // Allocate 4MB for 2MB SuperSlab
void* raw = mmap(NULL, alloc_size, ...);
uintptr_t aligned_addr = (raw_addr + ss_mask) & ~ss_mask; // 2MB align
```
**Scenario**:
- mmap returns `0x7d3893800000` (already 2MB-aligned)
- `aligned_addr = 0x7d3893800000` (no change)
- Prefix size = 0, Suffix = 2MB (munmapped)
- **SuperSlab registered at**: `0x7d3893800000`
**Contradiction**: Registry shows `0x7d3893c00000`, not `0x7d3893800000`!
### 5. Hash Slot Mismatch
**Lookup**:
```
[SUPER_LOOKUP] ptr=0x7d3893800000 lg=21 aligned_base=0x7d3893800000 hash=115868
```
**Registry**:
```
[SUPER_REG] register base=0x7d3893c00000 lg=21 slot=115870
```
**Hash difference**: 115868 vs 115870 (2 slots apart)
**Reason**: Linear probing found different slot due to collision.
## Root Cause Hypothesis
### Option A: Multiple SuperSlabs, Only One Registered
**Theory**: Multiple SuperSlabs allocated, but only the **last one** is logged.
**Problem**: Debug logging should show ALL registrations after fix (ENV check on every call).
### Option B: LRU Cache Reuse
**Theory**: Most SuperSlabs come from LRU cache (already registered), only new allocations are logged.
**Problem**: First few iterations should still show multiple registrations.
### Option C: Pointer is NOT from hakmem
**Theory**: `0x7d3893800000` is allocated by **`__libc_malloc()`**, NOT hakmem.
**Evidence**:
- Box BenchMeta uses `__libc_calloc` for `slots[]` array
- `free(slots[idx])` uses hakmem wrapper
- **But**: `slots[]` array itself is freed with `__libc_free(slots)` (Line 99)
**Contradiction**: `slots[]` should NOT reach hakmem `free()` wrapper.
### Option D: Registry Lookup Bug
**Theory**: SuperSlab **is** registered at `0x7d3893800000`, but lookup fails due to:
1. Hash collision (different slot used during registration vs lookup)
2. Linear probing limit exceeded (SUPER_MAX_PROBE = 8)
3. Alignment mismatch (looking for wrong base address)
## Test Results Comparison
| Phase | Test Result | Behavior |
|-------|-------------|----------|
| Phase 14 | ✅ PASS (5.69M ops/s) | No crash with same test |
| Phase 15 | ❌ CRASH | ExternalGuard → `__libc_free()` failure |
**Conclusion**: Phase 15 Box Separation introduced regression.
## Next Steps
### Investigation Needed
1. **Add more detailed logging**:
- Log ALL mmap calls with returned address
- Log prefix/suffix munmap with exact ranges
- Log final SuperSlab address vs mmap address
- Track which pointers are allocated from which SuperSlab
2. **Verify registry integrity**:
- Dump entire registry before crash
- Check for hash collisions
- Verify linear probing behavior
3. **Test with reduced SuperSlab size**:
- Try lg=20 (1MB) instead of lg=21 (2MB)
- See if 2MB gap still occurs
### Fix Options
#### **Option 1: Fix SuperSlab Registry Lookup** ✅ **RECOMMENDED**
**Issue**: Registry lookup fails for valid hakmem allocations.
**Potential fixes**:
- Increase SUPER_MAX_PROBE from 8 to 16/32
- Use better hash function to reduce collisions
- Store address **range** instead of single base
- Support lookup by any address within SuperSlab region
#### **Option 2: Improve ExternalGuard Safety** ⚠️ **WORKAROUND**
**Current behavior** (DANGEROUS):
```c
if (!is_mapped) return 0; // Delegate to __libc_free → CRASH!
```
**Safer behavior**:
```c
if (!is_mapped) {
fprintf(stderr, "[ExternalGuard] WARNING: Unknown pointer %p (ignored)\n", ptr);
return 1; // Claim handled (leak vs crash tradeoff)
}
```
**Pros**: Prevents crash
**Cons**: Memory leak for genuinely external pointers
#### **Option 3: Fix Box FrontGate Classification** ❌ NOT RECOMMENDED
**Idea**: Add special path for page-aligned Tiny pointers.
**Problems**:
- Can't read header at `ptr-1` (page boundary violation)
- Violates 1-byte header design
- Requires alternative classification
## Conclusion
**Primary Issue**: SuperSlab registry lookup fails for page-aligned user pointers.
**Secondary Issue**: ExternalGuard unconditionally delegates unknown pointers to `__libc_free()`.
**Recommended Action**:
1. Fix registry lookup (Option 1)
2. Add ExternalGuard safety (Option 2 as backup)
3. Comprehensive logging to confirm root cause