Files
hakmem/docs/status/PHASE15_REGISTRY_LOOKUP_INVESTIGATION.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

5.3 KiB
Raw Blame History

Phase 15 Registry Lookup Investigation

Date: 2025-11-15 Status: 🔍 ROOT CAUSE IDENTIFIED

Summary

Page-aligned Tiny allocations reach ExternalGuard → SuperSlab registry lookup FAILS → delegated to __libc_free() → crash.

Critical Findings

1. Registry Only Stores ONE SuperSlab

Evidence:

[SUPER_REG] register base=0x7d3893c00000 lg=21 slot=115870 magic=5353504c

Only 1 registration in entire test run (10K iterations, 100K operations).

2. 4MB Address Gap

Pattern (consistent across multiple runs):

  • Registry stores: 0x7d3893c00000 (SuperSlab structure address)
  • Lookup searches: 0x7d3893800000 (user pointer, 4MB lower)
  • Difference: 0x400000 = 4MB = 2 × SuperSlab size (lg=21, 2MB)

3. User Data Layout

From code analysis (superslab_inline.h:30-35):

size_t off = SUPERSLAB_SLAB0_DATA_OFFSET + (size_t)slab_idx * SLAB_SIZE;
return (uint8_t*)ss + off;

User data is placed AFTER SuperSlab structure, NOT before!

Implication: User pointer 0x7d3893800000 cannot belong to SuperSlab 0x7d3893c00000 (4MB higher).

4. mmap Alignment Mechanism

Code (hakmem_tiny_superslab.c:280-308):

size_t alloc_size = ss_size * 2;  // Allocate 4MB for 2MB SuperSlab
void* raw = mmap(NULL, alloc_size, ...);
uintptr_t aligned_addr = (raw_addr + ss_mask) & ~ss_mask;  // 2MB align

Scenario:

  • mmap returns 0x7d3893800000 (already 2MB-aligned)
  • aligned_addr = 0x7d3893800000 (no change)
  • Prefix size = 0, Suffix = 2MB (munmapped)
  • SuperSlab registered at: 0x7d3893800000

Contradiction: Registry shows 0x7d3893c00000, not 0x7d3893800000!

5. Hash Slot Mismatch

Lookup:

[SUPER_LOOKUP] ptr=0x7d3893800000 lg=21 aligned_base=0x7d3893800000 hash=115868

Registry:

[SUPER_REG] register base=0x7d3893c00000 lg=21 slot=115870

Hash difference: 115868 vs 115870 (2 slots apart) Reason: Linear probing found different slot due to collision.

Root Cause Hypothesis

Option A: Multiple SuperSlabs, Only One Registered

Theory: Multiple SuperSlabs allocated, but only the last one is logged.

Problem: Debug logging should show ALL registrations after fix (ENV check on every call).

Option B: LRU Cache Reuse

Theory: Most SuperSlabs come from LRU cache (already registered), only new allocations are logged.

Problem: First few iterations should still show multiple registrations.

Option C: Pointer is NOT from hakmem

Theory: 0x7d3893800000 is allocated by __libc_malloc(), NOT hakmem.

Evidence:

  • Box BenchMeta uses __libc_calloc for slots[] array
  • free(slots[idx]) uses hakmem wrapper
  • But: slots[] array itself is freed with __libc_free(slots) (Line 99)

Contradiction: slots[] should NOT reach hakmem free() wrapper.

Option D: Registry Lookup Bug

Theory: SuperSlab is registered at 0x7d3893800000, but lookup fails due to:

  1. Hash collision (different slot used during registration vs lookup)
  2. Linear probing limit exceeded (SUPER_MAX_PROBE = 8)
  3. Alignment mismatch (looking for wrong base address)

Test Results Comparison

Phase Test Result Behavior
Phase 14 PASS (5.69M ops/s) No crash with same test
Phase 15 CRASH ExternalGuard → __libc_free() failure

Conclusion: Phase 15 Box Separation introduced regression.

Next Steps

Investigation Needed

  1. Add more detailed logging:

    • Log ALL mmap calls with returned address
    • Log prefix/suffix munmap with exact ranges
    • Log final SuperSlab address vs mmap address
    • Track which pointers are allocated from which SuperSlab
  2. Verify registry integrity:

    • Dump entire registry before crash
    • Check for hash collisions
    • Verify linear probing behavior
  3. Test with reduced SuperSlab size:

    • Try lg=20 (1MB) instead of lg=21 (2MB)
    • See if 2MB gap still occurs

Fix Options

Issue: Registry lookup fails for valid hakmem allocations.

Potential fixes:

  • Increase SUPER_MAX_PROBE from 8 to 16/32
  • Use better hash function to reduce collisions
  • Store address range instead of single base
  • Support lookup by any address within SuperSlab region

Option 2: Improve ExternalGuard Safety ⚠️ WORKAROUND

Current behavior (DANGEROUS):

if (!is_mapped) return 0;  // Delegate to __libc_free → CRASH!

Safer behavior:

if (!is_mapped) {
    fprintf(stderr, "[ExternalGuard] WARNING: Unknown pointer %p (ignored)\n", ptr);
    return 1;  // Claim handled (leak vs crash tradeoff)
}

Pros: Prevents crash Cons: Memory leak for genuinely external pointers

Idea: Add special path for page-aligned Tiny pointers.

Problems:

  • Can't read header at ptr-1 (page boundary violation)
  • Violates 1-byte header design
  • Requires alternative classification

Conclusion

Primary Issue: SuperSlab registry lookup fails for page-aligned user pointers.

Secondary Issue: ExternalGuard unconditionally delegates unknown pointers to __libc_free().

Recommended Action:

  1. Fix registry lookup (Option 1)
  2. Add ExternalGuard safety (Option 2 as backup)
  3. Comprehensive logging to confirm root cause