# Phase E3-2 SEGV Root Cause Analysis **Status**: 🔴 **CRITICAL BUG IDENTIFIED** **Date**: 2025-11-12 **Affected**: Phase E3-1 + E3-2 implementation **Symptom**: SEGV at ~14K iterations on `bench_random_mixed_hakmem` with 512B working set --- ## Executive Summary **Root Cause**: Phase E3-1 removed registry lookup, which was **essential** for correctly handling **Class 7 (1KB headerless)** allocations. Without registry lookup, the header-based fast free path cannot distinguish Class 7 from other classes, leading to memory corruption and SEGV. **Severity**: **Critical** - Production blocker **Impact**: All benchmarks with mixed allocation sizes (16-1024B) crash **Fix Complexity**: **Medium** - Requires design decision on Class 7 handling --- ## Investigation Timeline ### Phase 1: Hypothesis Testing - Box TLS-SLL as Verification Layer **Hypothesis**: Box TLS-SLL acts as a verification layer, masking underlying bugs in Direct TLS push **Test**: Reverted Phase E3-2 to use Box TLS-SLL for all builds ```bash # Removed E3-2 conditional, always use Box TLS-SLL if (!tls_sll_push(class_idx, base, UINT32_MAX)) { return 0; } ``` **Result**: ❌ **DISPROVEN** - SEGV still occurs at same iteration (~14K) **Conclusion**: The bug exists independently of Box TLS-SLL vs Direct TLS push --- ### Phase 2: Understanding the Benchmark **Critical Discovery**: The "512" parameter is **working set size**, NOT allocation size! ```c // bench_random_mixed.c:58 size_t sz = 16u + (r & 0x3FFu); // 16..1040 bytes (MIXED SIZES!) ``` **Allocation Range**: 16-1024B **Class Distribution**: - Class 0 (8B) - Class 1 (16B) - Class 2 (32B) - Class 3 (64B) - Class 4 (128B) - Class 5 (256B) - Class 6 (512B) - **Class 7 (1024B)** ← HEADERLESS! **Impact**: Class 7 blocks ARE being allocated and freed, but the header-based fast free path doesn't know how to handle them! --- ### Phase 3: GDB Analysis - Crash Location **Crash Details**: ``` Thread 1 "bench_random_mi" received signal SIGSEGV, Segmentation fault. 0x000055555557367b in hak_tiny_alloc_fast_wrapper () rax 0x33333333333335c1 # User data interpreted as pointer! rbp 0x82e r12 # Crash at: 1f67b: mov (%r12),%rax # Reading next pointer from corrupted location ``` **Pattern**: `rax=0x33333333...` is user data (likely from allocation fill pattern `((unsigned char*)p)[0] = (unsigned char)r;`) **Interpretation**: A block containing user data is being treated as a TLS SLL node, and the allocator is trying to read its "next" pointer, but it's reading garbage user data instead. --- ### Phase 4: Class 7 Header Analysis **Allocation Path** (`tiny_region_id_write_header`, line 53-54): ```c if (__builtin_expect(class_idx == 7, 0)) { return base; // NO HEADER WRITTEN! Returns base directly } ``` **Free Path** (`tiny_free_fast_v2.inc.h`): ```c // Line 93: Read class_idx from header int class_idx = tiny_region_id_read_header(ptr); // Line 101-104: Check if invalid if (__builtin_expect(class_idx < 0, 0)) { return 0; // Route to slow path } // Line 129: Calculate base void* base = (char*)ptr - 1; ``` **Critical Issue**: For Class 7: 1. Allocation returns `base` (no header) 2. User receives `ptr = base` (NOT `base+1` like other classes) 3. Free receives `ptr = base` 4. Header read at `ptr-1` finds **garbage** (user data or previous allocation's data) 5. If garbage happens to match magic (0xa0-0xa7), it extracts a **wrong class_idx**! --- ## Root Cause: Missing Registry Lookup ### Phase E3-1 Removed Essential Safety Check **Removed Code** (`tiny_free_fast_v2.inc.h`, line 54-56 comment): ```c // Phase E3-1: Remove registry lookup (50-100 cycles overhead) // Reason: Phase E1 added headers to C7, making this check redundant ``` **WRONG ASSUMPTION**: The comment claims "Phase E1 added headers to C7", but this is **FALSE**! **Truth**: Phase E1 did NOT add headers to C7. Looking at `tiny_region_id_write_header`: ```c if (__builtin_expect(class_idx == 7, 0)) { return base; // Special-case class 7 (1024B blocks): return full block without header } ``` ### What Registry Lookup Did **Front Gate Classifier** (`core/box/front_gate_classifier.c`, line 198-199): ```c // Step 2: Registry lookup for Tiny (header or headerless) result = registry_lookup(ptr); ``` **Registry Lookup Logic** (line 118-154): ```c struct SuperSlab* ss = hak_super_lookup(ptr); if (!ss) return result; // Not in Tiny registry result.class_idx = ss->size_class; // Only class 7 (1KB) is headerless if (ss->size_class == 7) { result.kind = PTR_KIND_TINY_HEADERLESS; } else { result.kind = PTR_KIND_TINY_HEADER; } ``` **What It Did**: 1. Looked up pointer in SuperSlab registry (50-100 cycles) 2. Retrieved correct `class_idx` from SuperSlab metadata (NOT from header) 3. Correctly identified Class 7 as headerless 4. Routed Class 7 to slow path (which handles headerless correctly) **Evidence**: Commit `a97005f50` message: "Front Gate: registry-first classification (no ptr-1 deref); ... Verified: bench_fixed_size_hakmem 200000 1024 128 passes (Debug/Release), no SEGV." This commit shows that registry-first approach was **necessary** for 1024B (Class 7) allocations to work! --- ## Bug Scenario Walkthrough ### Scenario A: Class 7 Block Lifecycle (Current Broken Code) 1. **Allocation**: ```c // User requests 1024B → Class 7 void* base = /* carved from slab */; return base; // NO HEADER! ptr == base ``` 2. **User Writes Data**: ```c ptr[0] = 0x33; // Fill pattern ptr[1] = 0x33; // ... ``` 3. **Free Attempt**: ```c // tiny_free_fast_v2.inc.h int class_idx = tiny_region_id_read_header(ptr); // Reads ptr-1, finds 0x33 or garbage // If garbage is 0xa0-0xa7 range → false positive! // Extracts wrong class_idx (e.g., 0xa3 → class 3) // WRONG class detected! void* base = (char*)ptr - 1; // base is now WRONG! // Push to WRONG class TLS SLL tls_sll_push(WRONG_class_idx, WRONG_base, ...); ``` 4. **Later Allocation**: ```c // Allocate from WRONG class void* base = tls_sll_pop(class_3); // Gets corrupted pointer (offset by -1, wrong alignment) // Tries to read next pointer mov (%r12), %rax // r12 has corrupted address // SEGV! Reading from invalid memory ``` ### Scenario B: Class 7 with Safe Header Read (Why it doesn't always crash immediately) Most of the time, `ptr-1` for Class 7 doesn't have valid magic: ```c int class_idx = tiny_region_id_read_header(ptr); // ptr-1 has garbage (not 0xa0-0xa7) // Returns -1 if (class_idx < 0) { return 0; // Route to slow path → WORKS! } ``` **Why 128B/256B benchmarks succeed but 512B fails**: - **Smaller working sets**: Class 7 allocations are rare (only ~1% of allocations in 16-1024 range) - **Probability**: With 128/256 working set slots, fewer Class 7 blocks exist - **512 working set**: More Class 7 blocks → higher probability of false positive header match - **Crash at 14K iterations**: Eventually, a Class 7 block's ptr-1 contains garbage that matches 0xa0-0xa7 magic → corruption starts --- ## Phase E3-2 Additional Bug (Direct TLS Push) **Code** (`tiny_free_fast_v2.inc.h`, line 131-142, Phase E3-2): ```c #if HAKMEM_BUILD_RELEASE // Direct inline push (next pointer at base+1 due to header) *(void**)((uint8_t*)base + 1) = g_tls_sll_head[class_idx]; g_tls_sll_head[class_idx] = base; g_tls_sll_count[class_idx]++; #else if (!tls_sll_push(class_idx, base, UINT32_MAX)) { return 0; } #endif ``` **Bugs**: 1. **No Class 7 check**: Bypasses Box TLS-SLL's C7 rejection (line 86-88 in `tls_sll_box.h`) 2. **Wrong next pointer offset**: Uses `base+1` for all classes, but Class 7 should use `base+0` 3. **No capacity check**: Box TLS-SLL checks capacity before push; Direct push does not **Impact**: Phase E3-2 makes the problem worse, but the root cause (missing registry lookup) exists in both E3-1 and E3-2. --- ## Why Phase 7 Succeeded **Key Difference**: Phase 7 likely had registry lookup OR properly routed Class 7 to slow path **Evidence Needed**: Check Phase 7 commit history for: ```bash git log --all --oneline --grep="Phase 7\|Hybrid mincore" | head -5 # Results: # 18da2c826 Phase D: Debug-only strict header validation # 50fd70242 Phase A-C: Debug guards + Ultra-Fast Free prioritization # dde490f84 Phase 7: header-aware TLS front caches and FG gating # ... ``` Checking commit `dde490f84`: ```bash git show dde490f84:core/tiny_free_fast_v2.inc.h | grep -A 10 "registry\|class.*7" ``` **Hypothesis**: Phase 7 likely had one of: - Registry lookup before header read - Explicit Class 7 slow path routing - Front Gate Box integration (which does registry lookup) --- ## Fix Options ### Option A: Restore Registry Lookup (Conservative, Safe) **Approach**: Restore registry lookup before header read for Class 7 detection **Implementation**: ```c // tiny_free_fast_v2.inc.h static inline int hak_tiny_free_fast_v2(void* ptr) { if (!ptr) return 0; // PHASE E3-FIX: Registry lookup for Class 7 detection // Cost: 50-100 cycles (hash lookup) // Benefit: Correct handling of headerless Class 7 extern struct SuperSlab* hak_super_lookup(void* ptr); struct SuperSlab* ss = hak_super_lookup(ptr); if (ss && ss->size_class == 7) { // Class 7 (headerless) → route to slow path return 0; } // Continue with header-based fast path for C0-C6 int class_idx = tiny_region_id_read_header(ptr); if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES) { return 0; } // ... rest of fast path } ``` **Pros**: - ✅ 100% correct Class 7 handling - ✅ No assumptions about header presence - ✅ Proven to work (commit `a97005f50`) **Cons**: - ❌ 50-100 cycle overhead for ALL frees - ❌ Defeats the purpose of Phase E3-1 optimization **Performance Impact**: -10-20% (registry lookup overhead) --- ### Option B: Remove Class 7 from Fast Path (Selective Optimization) **Approach**: Accept that Class 7 cannot use fast path; optimize only C0-C6 **Implementation**: ```c static inline int hak_tiny_free_fast_v2(void* ptr) { if (!ptr) return 0; // 1. Try header read int class_idx = tiny_region_id_read_header(ptr); // 2. If header invalid → slow path if (class_idx < 0) { return 0; // Could be C7, Pool TLS, or invalid } // 3. CRITICAL: Reject Class 7 (should never have valid header) if (class_idx == 7) { // Defense in depth: C7 should never reach here // If it does, it's a bug (header written when it shouldn't be) return 0; } // 4. Bounds check if (class_idx >= TINY_NUM_CLASSES) { return 0; } // 5. Capacity check uint32_t cap = (uint32_t)TINY_TLS_MAG_CAP; if (g_tls_sll_count[class_idx] >= cap) { return 0; } // 6. Calculate base (valid for C0-C6 only) void* base = (char*)ptr - 1; // 7. Push to TLS SLL (C0-C6 only) if (!tls_sll_push(class_idx, base, UINT32_MAX)) { return 0; } return 1; } ``` **Pros**: - ✅ Fast path for C0-C6 (90-95% of allocations) - ✅ No registry lookup overhead - ✅ Explicit C7 rejection (defense in depth) **Cons**: - ⚠️ Class 7 always uses slow path (~5% of allocations) - ⚠️ Relies on header read returning -1 for C7 (probabilistic safety) **Performance**: - **Expected**: 30-50M ops/s for C0-C6 (Phase 7 target) - **Class 7**: 1-2M ops/s (slow path) - **Mixed workload**: ~28-45M ops/s (weighted average) **Risk**: If Class 7's `ptr-1` happens to contain valid magic (garbage match), corruption still occurs. Needs additional safety check. --- ### Option C: Add Headers to Class 7 (Architectural Change) **Approach**: Modify Class 7 to have 1-byte header like other classes **Implementation**: ```c // tiny_region_id_write_header static inline void* tiny_region_id_write_header(void* base, int class_idx) { if (!base) return base; // REMOVE special case for Class 7 // Write header for ALL classes (C0-C7) uint8_t* header_ptr = (uint8_t*)base; *header_ptr = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK); void* user = header_ptr + 1; return user; // Return base+1 for ALL classes } ``` **Changes Required**: 1. Allocation: Class 7 returns `base+1` (not `base`) 2. Free: Class 7 uses `ptr-1` as base (same as C0-C6) 3. TLS SLL: Class 7 can use TLS SLL (next at `base+1`) 4. Slab layout: Class 7 stride becomes 1025B (1024B user + 1B header) **Pros**: - ✅ Uniform handling for ALL classes - ✅ No special cases - ✅ Fast path works for 100% of allocations - ✅ 59-70M ops/s achievable (Phase 7 target) **Cons**: - ❌ Breaking change (ABI incompatible with existing C7 allocations) - ❌ 0.1% memory overhead for Class 7 - ❌ Stride 1025B → alignment issues (not power-of-2) - ❌ May require slab layout adjustments **Risk**: **High** - Requires extensive testing and validation --- ### Option D: Hybrid - Registry Lookup Only for Ambiguous Cases (Optimized) **Approach**: Use header first; only call registry if header might be false positive **Implementation**: ```c static inline int hak_tiny_free_fast_v2(void* ptr) { if (!ptr) return 0; // 1. Try header read int class_idx = tiny_region_id_read_header(ptr); // 2. If clearly invalid → slow path if (class_idx < 0) { return 0; } // 3. Bounds check if (class_idx >= TINY_NUM_CLASSES) { return 0; } // 4. HYBRID: For Class 7, double-check with registry // Reason: C7 should never have header, so if we see class_idx=7, // it's either a bug OR we need registry to confirm if (class_idx == 7) { // Registry lookup to confirm extern struct SuperSlab* hak_super_lookup(void* ptr); struct SuperSlab* ss = hak_super_lookup(ptr); if (!ss || ss->size_class != 7) { // False positive - not actually C7 return 0; } // Confirmed C7 → slow path (headerless) return 0; } // 5. Fast path for C0-C6 void* base = (char*)ptr - 1; if (!tls_sll_push(class_idx, base, UINT32_MAX)) { return 0; } return 1; } ``` **Pros**: - ✅ Fast path for C0-C6 (no registry lookup) - ✅ Registry lookup only for rare C7 cases (~5%) - ✅ 100% correct handling **Cons**: - ⚠️ C7 still uses slow path - ⚠️ Complex logic (two classification paths) **Performance**: - **C0-C6**: 30-50M ops/s (no overhead) - **C7**: 1-2M ops/s (registry + slow path) - **Mixed**: ~28-45M ops/s --- ## Recommendation ### SHORT TERM (Immediate Fix): **Option B + Option D Hybrid** **Rationale**: 1. Minimal code change 2. Preserves fast path for 90-95% of allocations 3. Adds defense-in-depth for Class 7 4. Low risk **Implementation Priority**: 1. Add explicit Class 7 rejection (Option B, step 3) 2. Add registry double-check for Class 7 (Option D, step 4) 3. Test thoroughly with `bench_random_mixed_hakmem` **Expected Outcome**: 28-45M ops/s on mixed workloads (vs current 8-9M with crashes) --- ### LONG TERM (Architecture): **Option C - Add Headers to Class 7** **Rationale**: 1. Eliminates all special cases 2. Achieves full Phase 7 performance (59-70M ops/s) 3. Simplifies codebase 4. Future-proof **Requirements**: 1. Design slab layout with 1025B stride 2. Update all Class 7 allocation paths 3. Extensive testing (regression suite) 4. Document breaking change **Timeline**: 1-2 weeks (design + implementation + testing) --- ## Verification Plan ### Test Matrix | Test Case | Iterations | Working Set | Expected Result | |-----------|------------|-------------|-----------------| | Fixed 128B | 200K | 128 | ✅ Pass | | Fixed 256B | 200K | 128 | ✅ Pass | | Fixed 512B | 200K | 128 | ✅ Pass | | Fixed 1024B | 200K | 128 | ✅ Pass (C7) | | **Mixed 16-1024B** | **200K** | **128** | ✅ **Pass** | | **Mixed 16-1024B** | **200K** | **512** | ✅ **Pass** | | **Mixed 16-1024B** | **200K** | **8192** | ✅ **Pass** | ### Performance Targets | Benchmark | Current (Broken) | After Fix (Option B/D) | Target (Option C) | |-----------|------------------|----------------------|-------------------| | 128B fixed | 9.52M ops/s | 30-40M ops/s | 50-70M ops/s | | 256B fixed | 8.30M ops/s | 30-40M ops/s | 50-70M ops/s | | 512B mixed | ❌ SEGV | 28-45M ops/s | 59-70M ops/s | | 1024B fixed | ❌ SEGV | 1-2M ops/s | 50-70M ops/s | --- ## References - **Commit a97005f50**: "Front Gate: registry-first classification (no ptr-1 deref); ... Verified: bench_fixed_size_hakmem 200000 1024 128 passes" - **Phase 7 Documentation**: `CLAUDE.md` lines 105-140 - **Box TLS-SLL Design**: `core/box/tls_sll_box.h` lines 84-88 (C7 rejection) - **Front Gate Classifier**: `core/box/front_gate_classifier.c` lines 148-154 (registry lookup) - **Class 7 Special Case**: `core/tiny_region_id.h` lines 49-55 (no header) --- ## Appendix: Phase E3 Goals vs Reality ### Phase E3 Goals **E3-1**: Remove registry lookup overhead (50-100 cycles) - **Assumption**: "Phase E1 added headers to C7, making registry check redundant" - **Reality**: ❌ FALSE - C7 never had headers **E3-2**: Remove Box TLS-SLL overhead (validation, double-free checks) - **Assumption**: "Header validation is sufficient, Box TLS-SLL is just extra safety" - **Reality**: ⚠️ PARTIAL - Box TLS-SLL C7 rejection was important ### Phase E3 Reality Check **Performance Gain**: +15-36% (128B: 8.25M→9.52M, 256B: 6.11M→8.30M) **Stability Loss**: ❌ CRITICAL - Crashes on mixed workloads **Verdict**: Phase E3 optimizations were based on **incorrect assumptions** about Class 7 header presence. The 15-36% gain is **not worth** the production crashes. **Action**: Revert E3-1 registry removal, keep E3-2 Direct TLS push but add C7 check. --- ## End of Report