Files
hakmem/docs/analysis/P0_ROOT_CAUSE_FOUND.md
Moe Charm (CI) a9ddb52ad4 ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)
Phase 1 完了:環境変数整理 + fprintf デバッグガード

ENV変数削除(BG/HotMag系):
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除

fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message

ドキュメント整理:
- 328 markdown files 削除(旧レポート・重複docs)

性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作)
- ENV整理による機能影響なし
- Debug出力は一部残存(次phase で対応)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 14:45:26 +09:00

4.2 KiB

P0 SEGV Root Cause - CONFIRMED

Executive Summary

Status: ROOT CAUSE IDENTIFIED Bug Type: Incorrect alignment validation in splice function Severity: FALSE POSITIVE causing abort Real Issue: Guard logic error, not P0 carving logic

The Smoking Gun

[BATCH_CARVE] cls=6 slab=1 used=0 cap=128 batch=16 base=0x7efa77010000 bs=513
[TRC_GUARD] failfast=1 env=1 mode=release
[LINEAR_CARVE] base=0x7efa77010000 carved=0 batch=16 cursor=0x7efa77010000
[SPLICE_TO_SLL] cls=6 head=0x7efa77010000 tail=0x7efa77011e0f count=16
[SPLICE_CORRUPT] Chain head 0x7efa77010000 misaligned (blk=513 offset=478)!

Analysis

What Happened

  1. Class 6 allocation (512B + 1B header = 513B blocks)
  2. Slab base: 0x7efa77010000 (page-aligned, typical for mmap)
  3. Linear carve: Correctly starts at base + 0 (carved=0)
  4. Alignment check: 0x7efa77010000 % 513 = 478FALSE POSITIVE!

The Bug in the Guard

Location: core/tiny_refill_opt.h:70

// WRONG: Checks absolute address alignment
if (((uintptr_t)c->head % blk) != 0) {
    fprintf(stderr, "[SPLICE_CORRUPT] Chain head %p misaligned (blk=%zu offset=%zu)!\n",
            c->head, blk, (uintptr_t)c->head % blk);
    abort();
}

Problem:

  • Checks address % block_size
  • But slab base is page-aligned (4096), not block-size aligned (513)
  • For class 6: 0x...10000 % 513 = 478 (always!)

Why This is a False Positive

Blocks don't need absolute alignment! They only need:

  1. Correct stride spacing (513 bytes apart)
  2. Valid offset from slab base (offset % stride == 0)

Example:

  • Base: 0x...10000
  • Block 0: 0x...10000 (offset 0, valid)
  • Block 1: 0x...10201 (offset 513, valid)
  • Block 2: 0x...10402 (offset 1026, valid)

All blocks are correctly spaced by 513 bytes, even though base % 513 ≠ 0.

Why Did SEGV Happen Without Guards?

Theory: The splice function writes *(void**)c->tail = *sll_head (line 79).

If c->tail is misaligned (offset 478), writing a pointer might:

  1. Cross a cache line boundary (performance hit)
  2. Cross a page boundary (potential SEGV if next page unmapped)

Hypothesis: Later in the benchmark, when:

  • TLS SLL grows large
  • tail pointer happens to be near page boundary
  • Write crosses into unmapped page → SEGV

The Fix

// CORRECT: Check offset from slab base, not absolute address
// Note: We don't have ss_base in splice, so validate in carve instead
static inline uint32_t trc_linear_carve(...) {
    // After computing cursor:
    size_t offset = cursor - base;
    if (offset % stride != 0) {
        fprintf(stderr, "[LINEAR_CARVE] Misalignment! offset=%zu stride=%zu\n", offset, stride);
        abort();
    }
    // ... rest of function
}

Option B: Remove Alignment Check (Quick Fix)

The alignment check in splice is overly strict. Blocks are guaranteed aligned by the carve logic (line 193):

uint8_t* cursor = base + ((size_t)meta->carved * stride);  // Always aligned!

Why This Explains the Original SEGV

  1. Without guards: splice proceeds with "misaligned" pointer
  2. Most writes succeed: Memory is mapped, just not cache-aligned
  3. Rare case: tail pointer near 4096-byte page boundary
  4. Write crosses boundary: *(void**)tail = sll_head spans two pages
  5. Second page unmapped: SEGV at random iteration (10K in our case)

This is a classic Heisenbug:

  • Depends on exact memory layout
  • Only triggers when slab base address ends in specific value
  • Non-deterministic iteration count (5K-10K range)

Immediate (Today):

  1. Remove the incorrect alignment check from splice
  2. ⏭️ Test P0 again - should work now!
  3. ⏭️ Add correct validation in carve function

Future (Next Sprint):

  1. Ensure slab bases are block-size aligned at allocation time
    • This eliminates the whole issue
    • Requires changes to tiny_slab_base_for() or mmap logic

Files to Modify

  1. core/tiny_refill_opt.h:66-76 - Remove bad alignment check
  2. core/tiny_refill_opt.h:190-200 - Add correct offset check in carve

Analysis By: Claude Task Agent (Ultrathink) Date: 2025-11-09 21:40 UTC Status: Root cause confirmed, fix ready to apply