Files
hakmem/BITMAP_FIX_FAILURE_ANALYSIS.md
Moe Charm (CI) 1010a961fb Tiny: fix header/stride mismatch and harden refill paths
- Root cause: header-based class indexing (HEADER_CLASSIDX=1) wrote a 1-byte
  header during allocation, but linear carve/refill and initial slab capacity
  still used bare class block sizes. This mismatch could overrun slab usable
  space and corrupt freelists, causing reproducible SEGV at ~100k iters.

Changes
- Superslab: compute capacity with effective stride (block_size + header for
  classes 0..6; class7 remains headerless) in superslab_init_slab(). Add a
  debug-only bound check in superslab_alloc_from_slab() to fail fast if carve
  would exceed usable bytes.
- Refill (non-P0 and P0): use header-aware stride for all linear carving and
  TLS window bump operations. Ensure alignment/validation in tiny_refill_opt.h
  also uses stride, not raw class size.
- Drain: keep existing defense-in-depth for remote sentinel and sanitize nodes
  before splicing into freelist (already present).

Notes
- This unifies the memory layout across alloc/linear-carve/refill with a single
  stride definition and keeps class7 (1024B) headerless as designed.
- Debug builds add fail-fast checks; release builds remain lean.

Next
- Re-run Tiny benches (256/1024B) in debug to confirm stability, then in
  release. If any remaining crash persists, bisect with HAKMEM_TINY_P0_BATCH_REFILL=0
  to isolate P0 batch carve, and continue reducing branch-miss as planned.
2025-11-09 18:55:50 +09:00

8.4 KiB
Raw Blame History

Bitmap Fix Failure Analysis

Executive Summary

Status: REGRESSION - Bitmap fix made stability WORSE

  • Before (Task Agent's active_slabs fix): 95% (19/20)
  • After (My bitmap fix): 80% (16/20)
  • Regression: -15% (4 additional failures)

Problem Statement

User's Critical Requirement

"メモリーライブラリーなんて 5でもクラッシュおこったらつかえない"

"A memory library with even 5% crash rate is UNUSABLE"

Target: 100% stability (50+ runs with 0 failures) Current: 80% stability (UNACCEPTABLE and WORSE than before)

Error Symptoms

4T Crash Pattern

[DEBUG] superslab_refill returned NULL (OOM) detail:
  class=4
  prev_ss=0x7da378400000
  active=32
  bitmap=0xffffffff
  errno=12

free(): invalid pointer

Key Observations:

  1. Class 4 consistently fails
  2. bitmap=0xffffffff (all 32 slabs occupied)
  3. active=32 (matches bitmap)
  4. No expansion messages printed (expansion code NOT triggered!)

Code Analysis

My Bitmap Fix (tiny_superslab_alloc.inc.h:165-210)

SuperSlab* current_chunk = head->current_chunk;
if (current_chunk) {
    // Check if current chunk has available slabs
    int chunk_cap = ss_slabs_capacity(current_chunk);
    uint32_t full_bitmap = (1U << chunk_cap) - 1;  // e.g., 32 slabs → 0xFFFFFFFF

    if (current_chunk->slab_bitmap != full_bitmap) {
        // Has free slabs, update tls->ss
        if (tls->ss != current_chunk) {
            tls->ss = current_chunk;
        }
    } else {
        // Exhausted, expand!
        fprintf(stderr, "[HAKMEM] SuperSlab chunk exhausted for class %d (active=%d cap=%d bitmap=0x%08x), expanding...\n",
                class_idx, current_chunk->active_slabs, chunk_cap, current_chunk->slab_bitmap);

        if (expand_superslab_head(head) < 0) {
            fprintf(stderr, "[HAKMEM] CRITICAL: Failed to expand SuperSlabHead for class %d (system OOM)\n", class_idx);
            return NULL;
        }

        current_chunk = head->current_chunk;
        tls->ss = current_chunk;

        // Verify new chunk has free slabs
        if (!current_chunk || current_chunk->active_slabs >= ss_slabs_capacity(current_chunk)) {
            fprintf(stderr, "[HAKMEM] CRITICAL: New chunk still has no free slabs for class %d (active=%d cap=%d)\n",
                    class_idx, current_chunk ? current_chunk->active_slabs : -1,
                    current_chunk ? ss_slabs_capacity(current_chunk) : -1);
            return NULL;
        }
    }
}

Critical Issue: Expansion Message NOT Printed!

The error output shows:

  • TLS cache adaptation messages
  • OOM error from superslab_allocate()
  • NO expansion messages ("SuperSlab chunk exhausted...")

This means the expansion code (line 182-210) is NOT being executed!

Hypothesis

Why Expansion Not Triggered?

Option 1: current_chunk is NULL

  • If current_chunk is NULL, we skip the entire if block (line 166)
  • Continue to normal refill logic without expansion

Option 2: slab_bitmap != full_bitmap is TRUE (unexpected)

  • If bitmap doesn't match expected full value, we think there are free slabs
  • Don't trigger expansion
  • But later code finds no free slabs → OOM

Option 3: Execution reaches expansion but crashes before printing

  • Race condition between check and expansion
  • Another thread modifies state between line 174 and line 182

Option 4: Wrong code path entirely

  • Error comes from mid_simple_refill path (line 264)
  • Which bypasses my expansion code
  • Calls superslab_allocate() directly → OOM

Mid-Simple Refill Path (MOST LIKELY)

// Line 246-281
if (class_idx >= 4 && tiny_mid_refill_simple_enabled()) {
    if (tls->ss) {
        int tls_cap = ss_slabs_capacity(tls->ss);
        if (tls->ss->active_slabs < tls_cap) {  // ← Uses non-atomic active_slabs!
            // ... try to find free slab
        }
    }
    // Otherwise allocate a fresh SuperSlab
    SuperSlab* ssn = superslab_allocate((uint8_t)class_idx);  // ← Direct allocation!
    if (!ssn) {
        // This prints to line 269, but we see error at line 492 instead
        return NULL;
    }
}

Problem: Class 4 triggers mid_simple_refill (class_idx >= 4), which:

  1. Checks active_slabs < tls_cap (non-atomic, race condition)
  2. If exhausted, calls superslab_allocate() directly
  3. Does NOT use the dynamic expansion mechanism
  4. Returns NULL on OOM

Investigation Tasks

Task 1: Add Debug Logging

Add logging to determine execution path:

  1. Entry point logging:
fprintf(stderr, "[DEBUG] superslab_refill ENTER: class=%d current_chunk=%p tls->ss=%p\n",
        class_idx, (void*)current_chunk, (void*)tls->ss);
  1. Bitmap check logging:
fprintf(stderr, "[DEBUG] bitmap check: bitmap=0x%08x full_bitmap=0x%08x chunk_cap=%d match=%d\n",
        current_chunk->slab_bitmap, full_bitmap, chunk_cap,
        (current_chunk->slab_bitmap == full_bitmap));
  1. Mid-simple path logging:
fprintf(stderr, "[DEBUG] mid_simple_refill: class=%d enabled=%d tls->ss=%p active=%d cap=%d\n",
        class_idx, tiny_mid_refill_simple_enabled(),
        (void*)tls->ss,
        tls->ss ? tls->ss->active_slabs : -1,
        tls->ss ? ss_slabs_capacity(tls->ss) : -1);

Task 2: Fix Mid-Simple Refill Path

Two options:

Option A: Disable mid_simple_refill for testing

// Line 249: Force disable
if (0 && class_idx >= 4 && tiny_mid_refill_simple_enabled()) {

Option B: Add expansion to mid_simple_refill

// Line 262: Before allocating new SuperSlab
// Check if current tls->ss is exhausted and can be expanded
if (tls->ss && tls->ss->active_slabs >= tls_cap) {
    // Try to expand current SuperSlab instead of allocating new one
    SuperSlabHead* head = superslab_lookup_head(class_idx);
    if (head && expand_superslab_head(head) == 0) {
        tls->ss = head->current_chunk;  // Point to new chunk
        // Retry initialization with new chunk
        int free_idx = superslab_find_free_slab(tls->ss);
        if (free_idx >= 0) {
            // ... use new chunk
        }
    }
}

Task 3: Fix Bitmap Logic Inconsistency

Line 202 verification uses active_slabs (non-atomic), but I said bitmap should be used for MT-safety:

// BEFORE (inconsistent):
if (!current_chunk || current_chunk->active_slabs >= ss_slabs_capacity(current_chunk)) {

// AFTER (consistent with bitmap approach):
uint32_t new_full_bitmap = (1U << ss_slabs_capacity(current_chunk)) - 1;
if (!current_chunk || current_chunk->slab_bitmap == new_full_bitmap) {

Root Cause Hypothesis

Most Likely: Mid-simple refill path (class_idx >= 4) bypasses dynamic expansion

Evidence:

  1. Error is for class 4 (triggers mid_simple_refill)
  2. No expansion messages printed (expansion code not reached)
  3. OOM error from superslab_allocate() at line 480 (not mid_simple's line 269)
  4. Task Agent's fix worked better (95%) because it checked active_slabs earlier in the flow

Why Task Agent's fix was better:

  • Checked active_slabs < chunk_cap at line 172 (BEFORE mid_simple_refill)
  • Even though non-atomic, it caught most exhaustion cases
  • Triggered expansion before mid_simple_refill could bypass it

Why my fix is worse:

  • Uses bitmap check which might not match mid_simple's active_slabs check
  • Race condition: bitmap might show "not full" but active_slabs shows "full"
  • Mid_simple sees "full" (via active_slabs), bypasses expansion, calls allocate() → OOM

Short-term (Quick Fix):

  1. Disable mid_simple_refill for class 4-7 to force normal path
  2. Verify expansion works on normal path
  3. If successful, this proves mid_simple is the culprit

Long-term (Proper Fix):

  1. Add expansion mechanism to mid_simple_refill path
  2. Use consistent bitmap checks across all paths
  3. Remove dependency on non-atomic active_slabs for exhaustion detection

Success Criteria

  • 4T test: 50/50 runs pass (100% stability)
  • Expansion messages appear when SuperSlab exhausted
  • No "superslab_refill returned NULL (OOM)" errors
  • Performance maintained (> 900K ops/s on 4T)

Next Steps

  1. Immediate: Add debug logging to identify execution path
  2. Test: Disable mid_simple_refill and verify expansion works
  3. Fix: Add expansion to mid_simple path OR use bitmap consistently
  4. Verify: Run 50+ tests to achieve 100% stability

Generated: 2025-11-08 Investigator: Claude Code (Sonnet 4.5) Critical: User requirement is 100% stability, no tolerance for failures