Files
hakmem/docs/analysis/PHASE_E3_SEGV_ROOT_CAUSE_REPORT.md
Moe Charm (CI) a9ddb52ad4 ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)
Phase 1 完了:環境変数整理 + fprintf デバッグガード

ENV変数削除(BG/HotMag系):
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除

fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message

ドキュメント整理:
- 328 markdown files 削除(旧レポート・重複docs)

性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作)
- ENV整理による機能影響なし
- Debug出力は一部残存(次phase で対応)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 14:45:26 +09:00

18 KiB

Phase E3-2 SEGV Root Cause Analysis

Status: 🔴 CRITICAL BUG IDENTIFIED Date: 2025-11-12 Affected: Phase E3-1 + E3-2 implementation Symptom: SEGV at ~14K iterations on bench_random_mixed_hakmem with 512B working set


Executive Summary

Root Cause: Phase E3-1 removed registry lookup, which was essential for correctly handling Class 7 (1KB headerless) allocations. Without registry lookup, the header-based fast free path cannot distinguish Class 7 from other classes, leading to memory corruption and SEGV.

Severity: Critical - Production blocker Impact: All benchmarks with mixed allocation sizes (16-1024B) crash Fix Complexity: Medium - Requires design decision on Class 7 handling


Investigation Timeline

Phase 1: Hypothesis Testing - Box TLS-SLL as Verification Layer

Hypothesis: Box TLS-SLL acts as a verification layer, masking underlying bugs in Direct TLS push

Test: Reverted Phase E3-2 to use Box TLS-SLL for all builds

# Removed E3-2 conditional, always use Box TLS-SLL
if (!tls_sll_push(class_idx, base, UINT32_MAX)) {
    return 0;
}

Result: DISPROVEN - SEGV still occurs at same iteration (~14K) Conclusion: The bug exists independently of Box TLS-SLL vs Direct TLS push


Phase 2: Understanding the Benchmark

Critical Discovery: The "512" parameter is working set size, NOT allocation size!

// bench_random_mixed.c:58
size_t sz = 16u + (r & 0x3FFu); // 16..1040 bytes (MIXED SIZES!)

Allocation Range: 16-1024B Class Distribution:

  • Class 0 (8B)
  • Class 1 (16B)
  • Class 2 (32B)
  • Class 3 (64B)
  • Class 4 (128B)
  • Class 5 (256B)
  • Class 6 (512B)
  • Class 7 (1024B) ← HEADERLESS!

Impact: Class 7 blocks ARE being allocated and freed, but the header-based fast free path doesn't know how to handle them!


Phase 3: GDB Analysis - Crash Location

Crash Details:

Thread 1 "bench_random_mi" received signal SIGSEGV, Segmentation fault.
0x000055555557367b in hak_tiny_alloc_fast_wrapper ()

rax            0x33333333333335c1  # User data interpreted as pointer!
rbp            0x82e
r12            <corrupted pointer>

# Crash at:
1f67b:  mov    (%r12),%rax    # Reading next pointer from corrupted location

Pattern: rax=0x33333333... is user data (likely from allocation fill pattern ((unsigned char*)p)[0] = (unsigned char)r;)

Interpretation: A block containing user data is being treated as a TLS SLL node, and the allocator is trying to read its "next" pointer, but it's reading garbage user data instead.


Phase 4: Class 7 Header Analysis

Allocation Path (tiny_region_id_write_header, line 53-54):

if (__builtin_expect(class_idx == 7, 0)) {
    return base;  // NO HEADER WRITTEN! Returns base directly
}

Free Path (tiny_free_fast_v2.inc.h):

// Line 93: Read class_idx from header
int class_idx = tiny_region_id_read_header(ptr);

// Line 101-104: Check if invalid
if (__builtin_expect(class_idx < 0, 0)) {
    return 0;  // Route to slow path
}

// Line 129: Calculate base
void* base = (char*)ptr - 1;

Critical Issue: For Class 7:

  1. Allocation returns base (no header)
  2. User receives ptr = base (NOT base+1 like other classes)
  3. Free receives ptr = base
  4. Header read at ptr-1 finds garbage (user data or previous allocation's data)
  5. If garbage happens to match magic (0xa0-0xa7), it extracts a wrong class_idx!

Root Cause: Missing Registry Lookup

Phase E3-1 Removed Essential Safety Check

Removed Code (tiny_free_fast_v2.inc.h, line 54-56 comment):

// Phase E3-1: Remove registry lookup (50-100 cycles overhead)
// Reason: Phase E1 added headers to C7, making this check redundant

WRONG ASSUMPTION: The comment claims "Phase E1 added headers to C7", but this is FALSE!

Truth: Phase E1 did NOT add headers to C7. Looking at tiny_region_id_write_header:

if (__builtin_expect(class_idx == 7, 0)) {
    return base;  // Special-case class 7 (1024B blocks): return full block without header
}

What Registry Lookup Did

Front Gate Classifier (core/box/front_gate_classifier.c, line 198-199):

// Step 2: Registry lookup for Tiny (header or headerless)
result = registry_lookup(ptr);

Registry Lookup Logic (line 118-154):

struct SuperSlab* ss = hak_super_lookup(ptr);
if (!ss) return result;  // Not in Tiny registry

result.class_idx = ss->size_class;

// Only class 7 (1KB) is headerless
if (ss->size_class == 7) {
    result.kind = PTR_KIND_TINY_HEADERLESS;
} else {
    result.kind = PTR_KIND_TINY_HEADER;
}

What It Did:

  1. Looked up pointer in SuperSlab registry (50-100 cycles)
  2. Retrieved correct class_idx from SuperSlab metadata (NOT from header)
  3. Correctly identified Class 7 as headerless
  4. Routed Class 7 to slow path (which handles headerless correctly)

Evidence: Commit a97005f50 message: "Front Gate: registry-first classification (no ptr-1 deref); ... Verified: bench_fixed_size_hakmem 200000 1024 128 passes (Debug/Release), no SEGV."

This commit shows that registry-first approach was necessary for 1024B (Class 7) allocations to work!


Bug Scenario Walkthrough

Scenario A: Class 7 Block Lifecycle (Current Broken Code)

  1. Allocation:

    // User requests 1024B → Class 7
    void* base = /* carved from slab */;
    return base;  // NO HEADER! ptr == base
    
  2. User Writes Data:

    ptr[0] = 0x33;  // Fill pattern
    ptr[1] = 0x33;
    // ...
    
  3. Free Attempt:

    // tiny_free_fast_v2.inc.h
    int class_idx = tiny_region_id_read_header(ptr);
    // Reads ptr-1, finds 0x33 or garbage
    // If garbage is 0xa0-0xa7 range → false positive!
    // Extracts wrong class_idx (e.g., 0xa3 → class 3)
    
    // WRONG class detected!
    void* base = (char*)ptr - 1;  // base is now WRONG!
    
    // Push to WRONG class TLS SLL
    tls_sll_push(WRONG_class_idx, WRONG_base, ...);
    
  4. Later Allocation:

    // Allocate from WRONG class
    void* base = tls_sll_pop(class_3);
    // Gets corrupted pointer (offset by -1, wrong alignment)
    // Tries to read next pointer
    mov (%r12), %rax  // r12 has corrupted address
    // SEGV! Reading from invalid memory
    

Scenario B: Class 7 with Safe Header Read (Why it doesn't always crash immediately)

Most of the time, ptr-1 for Class 7 doesn't have valid magic:

int class_idx = tiny_region_id_read_header(ptr);
// ptr-1 has garbage (not 0xa0-0xa7)
// Returns -1

if (class_idx < 0) {
    return 0;  // Route to slow path → WORKS!
}

Why 128B/256B benchmarks succeed but 512B fails:

  • Smaller working sets: Class 7 allocations are rare (only ~1% of allocations in 16-1024 range)
  • Probability: With 128/256 working set slots, fewer Class 7 blocks exist
  • 512 working set: More Class 7 blocks → higher probability of false positive header match
  • Crash at 14K iterations: Eventually, a Class 7 block's ptr-1 contains garbage that matches 0xa0-0xa7 magic → corruption starts

Phase E3-2 Additional Bug (Direct TLS Push)

Code (tiny_free_fast_v2.inc.h, line 131-142, Phase E3-2):

#if HAKMEM_BUILD_RELEASE
    // Direct inline push (next pointer at base+1 due to header)
    *(void**)((uint8_t*)base + 1) = g_tls_sll_head[class_idx];
    g_tls_sll_head[class_idx] = base;
    g_tls_sll_count[class_idx]++;
#else
    if (!tls_sll_push(class_idx, base, UINT32_MAX)) {
        return 0;
    }
#endif

Bugs:

  1. No Class 7 check: Bypasses Box TLS-SLL's C7 rejection (line 86-88 in tls_sll_box.h)
  2. Wrong next pointer offset: Uses base+1 for all classes, but Class 7 should use base+0
  3. No capacity check: Box TLS-SLL checks capacity before push; Direct push does not

Impact: Phase E3-2 makes the problem worse, but the root cause (missing registry lookup) exists in both E3-1 and E3-2.


Why Phase 7 Succeeded

Key Difference: Phase 7 likely had registry lookup OR properly routed Class 7 to slow path

Evidence Needed: Check Phase 7 commit history for:

git log --all --oneline --grep="Phase 7\|Hybrid mincore" | head -5
# Results:
# 18da2c826 Phase D: Debug-only strict header validation
# 50fd70242 Phase A-C: Debug guards + Ultra-Fast Free prioritization
# dde490f84 Phase 7: header-aware TLS front caches and FG gating
# ...

Checking commit dde490f84:

git show dde490f84:core/tiny_free_fast_v2.inc.h | grep -A 10 "registry\|class.*7"

Hypothesis: Phase 7 likely had one of:

  • Registry lookup before header read
  • Explicit Class 7 slow path routing
  • Front Gate Box integration (which does registry lookup)

Fix Options

Option A: Restore Registry Lookup (Conservative, Safe)

Approach: Restore registry lookup before header read for Class 7 detection

Implementation:

// tiny_free_fast_v2.inc.h
static inline int hak_tiny_free_fast_v2(void* ptr) {
    if (!ptr) return 0;

    // PHASE E3-FIX: Registry lookup for Class 7 detection
    // Cost: 50-100 cycles (hash lookup)
    // Benefit: Correct handling of headerless Class 7
    extern struct SuperSlab* hak_super_lookup(void* ptr);
    struct SuperSlab* ss = hak_super_lookup(ptr);

    if (ss && ss->size_class == 7) {
        // Class 7 (headerless) → route to slow path
        return 0;
    }

    // Continue with header-based fast path for C0-C6
    int class_idx = tiny_region_id_read_header(ptr);
    if (class_idx < 0 || class_idx >= TINY_NUM_CLASSES) {
        return 0;
    }

    // ... rest of fast path
}

Pros:

  • 100% correct Class 7 handling
  • No assumptions about header presence
  • Proven to work (commit a97005f50)

Cons:

  • 50-100 cycle overhead for ALL frees
  • Defeats the purpose of Phase E3-1 optimization

Performance Impact: -10-20% (registry lookup overhead)


Option B: Remove Class 7 from Fast Path (Selective Optimization)

Approach: Accept that Class 7 cannot use fast path; optimize only C0-C6

Implementation:

static inline int hak_tiny_free_fast_v2(void* ptr) {
    if (!ptr) return 0;

    // 1. Try header read
    int class_idx = tiny_region_id_read_header(ptr);

    // 2. If header invalid → slow path
    if (class_idx < 0) {
        return 0;  // Could be C7, Pool TLS, or invalid
    }

    // 3. CRITICAL: Reject Class 7 (should never have valid header)
    if (class_idx == 7) {
        // Defense in depth: C7 should never reach here
        // If it does, it's a bug (header written when it shouldn't be)
        return 0;
    }

    // 4. Bounds check
    if (class_idx >= TINY_NUM_CLASSES) {
        return 0;
    }

    // 5. Capacity check
    uint32_t cap = (uint32_t)TINY_TLS_MAG_CAP;
    if (g_tls_sll_count[class_idx] >= cap) {
        return 0;
    }

    // 6. Calculate base (valid for C0-C6 only)
    void* base = (char*)ptr - 1;

    // 7. Push to TLS SLL (C0-C6 only)
    if (!tls_sll_push(class_idx, base, UINT32_MAX)) {
        return 0;
    }

    return 1;
}

Pros:

  • Fast path for C0-C6 (90-95% of allocations)
  • No registry lookup overhead
  • Explicit C7 rejection (defense in depth)

Cons:

  • ⚠️ Class 7 always uses slow path (~5% of allocations)
  • ⚠️ Relies on header read returning -1 for C7 (probabilistic safety)

Performance:

  • Expected: 30-50M ops/s for C0-C6 (Phase 7 target)
  • Class 7: 1-2M ops/s (slow path)
  • Mixed workload: ~28-45M ops/s (weighted average)

Risk: If Class 7's ptr-1 happens to contain valid magic (garbage match), corruption still occurs. Needs additional safety check.


Option C: Add Headers to Class 7 (Architectural Change)

Approach: Modify Class 7 to have 1-byte header like other classes

Implementation:

// tiny_region_id_write_header
static inline void* tiny_region_id_write_header(void* base, int class_idx) {
    if (!base) return base;

    // REMOVE special case for Class 7
    // Write header for ALL classes (C0-C7)
    uint8_t* header_ptr = (uint8_t*)base;
    *header_ptr = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);

    void* user = header_ptr + 1;
    return user;  // Return base+1 for ALL classes
}

Changes Required:

  1. Allocation: Class 7 returns base+1 (not base)
  2. Free: Class 7 uses ptr-1 as base (same as C0-C6)
  3. TLS SLL: Class 7 can use TLS SLL (next at base+1)
  4. Slab layout: Class 7 stride becomes 1025B (1024B user + 1B header)

Pros:

  • Uniform handling for ALL classes
  • No special cases
  • Fast path works for 100% of allocations
  • 59-70M ops/s achievable (Phase 7 target)

Cons:

  • Breaking change (ABI incompatible with existing C7 allocations)
  • 0.1% memory overhead for Class 7
  • Stride 1025B → alignment issues (not power-of-2)
  • May require slab layout adjustments

Risk: High - Requires extensive testing and validation


Option D: Hybrid - Registry Lookup Only for Ambiguous Cases (Optimized)

Approach: Use header first; only call registry if header might be false positive

Implementation:

static inline int hak_tiny_free_fast_v2(void* ptr) {
    if (!ptr) return 0;

    // 1. Try header read
    int class_idx = tiny_region_id_read_header(ptr);

    // 2. If clearly invalid → slow path
    if (class_idx < 0) {
        return 0;
    }

    // 3. Bounds check
    if (class_idx >= TINY_NUM_CLASSES) {
        return 0;
    }

    // 4. HYBRID: For Class 7, double-check with registry
    //    Reason: C7 should never have header, so if we see class_idx=7,
    //    it's either a bug OR we need registry to confirm
    if (class_idx == 7) {
        // Registry lookup to confirm
        extern struct SuperSlab* hak_super_lookup(void* ptr);
        struct SuperSlab* ss = hak_super_lookup(ptr);

        if (!ss || ss->size_class != 7) {
            // False positive - not actually C7
            return 0;
        }

        // Confirmed C7 → slow path (headerless)
        return 0;
    }

    // 5. Fast path for C0-C6
    void* base = (char*)ptr - 1;

    if (!tls_sll_push(class_idx, base, UINT32_MAX)) {
        return 0;
    }

    return 1;
}

Pros:

  • Fast path for C0-C6 (no registry lookup)
  • Registry lookup only for rare C7 cases (~5%)
  • 100% correct handling

Cons:

  • ⚠️ C7 still uses slow path
  • ⚠️ Complex logic (two classification paths)

Performance:

  • C0-C6: 30-50M ops/s (no overhead)
  • C7: 1-2M ops/s (registry + slow path)
  • Mixed: ~28-45M ops/s

Recommendation

SHORT TERM (Immediate Fix): Option B + Option D Hybrid

Rationale:

  1. Minimal code change
  2. Preserves fast path for 90-95% of allocations
  3. Adds defense-in-depth for Class 7
  4. Low risk

Implementation Priority:

  1. Add explicit Class 7 rejection (Option B, step 3)
  2. Add registry double-check for Class 7 (Option D, step 4)
  3. Test thoroughly with bench_random_mixed_hakmem

Expected Outcome: 28-45M ops/s on mixed workloads (vs current 8-9M with crashes)


LONG TERM (Architecture): Option C - Add Headers to Class 7

Rationale:

  1. Eliminates all special cases
  2. Achieves full Phase 7 performance (59-70M ops/s)
  3. Simplifies codebase
  4. Future-proof

Requirements:

  1. Design slab layout with 1025B stride
  2. Update all Class 7 allocation paths
  3. Extensive testing (regression suite)
  4. Document breaking change

Timeline: 1-2 weeks (design + implementation + testing)


Verification Plan

Test Matrix

Test Case Iterations Working Set Expected Result
Fixed 128B 200K 128 Pass
Fixed 256B 200K 128 Pass
Fixed 512B 200K 128 Pass
Fixed 1024B 200K 128 Pass (C7)
Mixed 16-1024B 200K 128 Pass
Mixed 16-1024B 200K 512 Pass
Mixed 16-1024B 200K 8192 Pass

Performance Targets

Benchmark Current (Broken) After Fix (Option B/D) Target (Option C)
128B fixed 9.52M ops/s 30-40M ops/s 50-70M ops/s
256B fixed 8.30M ops/s 30-40M ops/s 50-70M ops/s
512B mixed SEGV 28-45M ops/s 59-70M ops/s
1024B fixed SEGV 1-2M ops/s 50-70M ops/s

References

  • Commit a97005f50: "Front Gate: registry-first classification (no ptr-1 deref); ... Verified: bench_fixed_size_hakmem 200000 1024 128 passes"
  • Phase 7 Documentation: CLAUDE.md lines 105-140
  • Box TLS-SLL Design: core/box/tls_sll_box.h lines 84-88 (C7 rejection)
  • Front Gate Classifier: core/box/front_gate_classifier.c lines 148-154 (registry lookup)
  • Class 7 Special Case: core/tiny_region_id.h lines 49-55 (no header)

Appendix: Phase E3 Goals vs Reality

Phase E3 Goals

E3-1: Remove registry lookup overhead (50-100 cycles)

  • Assumption: "Phase E1 added headers to C7, making registry check redundant"
  • Reality: FALSE - C7 never had headers

E3-2: Remove Box TLS-SLL overhead (validation, double-free checks)

  • Assumption: "Header validation is sufficient, Box TLS-SLL is just extra safety"
  • Reality: ⚠️ PARTIAL - Box TLS-SLL C7 rejection was important

Phase E3 Reality Check

Performance Gain: +15-36% (128B: 8.25M→9.52M, 256B: 6.11M→8.30M) Stability Loss: CRITICAL - Crashes on mixed workloads

Verdict: Phase E3 optimizations were based on incorrect assumptions about Class 7 header presence. The 15-36% gain is not worth the production crashes.

Action: Revert E3-1 registry removal, keep E3-2 Direct TLS push but add C7 check.


End of Report