Files
hakmem/docs/analysis/BITMAP_FIX_FAILURE_ANALYSIS.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

8.4 KiB
Raw Blame History

Bitmap Fix Failure Analysis

Executive Summary

Status: REGRESSION - Bitmap fix made stability WORSE

  • Before (Task Agent's active_slabs fix): 95% (19/20)
  • After (My bitmap fix): 80% (16/20)
  • Regression: -15% (4 additional failures)

Problem Statement

User's Critical Requirement

"メモリーライブラリーなんて 5でもクラッシュおこったらつかえない"

"A memory library with even 5% crash rate is UNUSABLE"

Target: 100% stability (50+ runs with 0 failures) Current: 80% stability (UNACCEPTABLE and WORSE than before)

Error Symptoms

4T Crash Pattern

[DEBUG] superslab_refill returned NULL (OOM) detail:
  class=4
  prev_ss=0x7da378400000
  active=32
  bitmap=0xffffffff
  errno=12

free(): invalid pointer

Key Observations:

  1. Class 4 consistently fails
  2. bitmap=0xffffffff (all 32 slabs occupied)
  3. active=32 (matches bitmap)
  4. No expansion messages printed (expansion code NOT triggered!)

Code Analysis

My Bitmap Fix (tiny_superslab_alloc.inc.h:165-210)

SuperSlab* current_chunk = head->current_chunk;
if (current_chunk) {
    // Check if current chunk has available slabs
    int chunk_cap = ss_slabs_capacity(current_chunk);
    uint32_t full_bitmap = (1U << chunk_cap) - 1;  // e.g., 32 slabs → 0xFFFFFFFF

    if (current_chunk->slab_bitmap != full_bitmap) {
        // Has free slabs, update tls->ss
        if (tls->ss != current_chunk) {
            tls->ss = current_chunk;
        }
    } else {
        // Exhausted, expand!
        fprintf(stderr, "[HAKMEM] SuperSlab chunk exhausted for class %d (active=%d cap=%d bitmap=0x%08x), expanding...\n",
                class_idx, current_chunk->active_slabs, chunk_cap, current_chunk->slab_bitmap);

        if (expand_superslab_head(head) < 0) {
            fprintf(stderr, "[HAKMEM] CRITICAL: Failed to expand SuperSlabHead for class %d (system OOM)\n", class_idx);
            return NULL;
        }

        current_chunk = head->current_chunk;
        tls->ss = current_chunk;

        // Verify new chunk has free slabs
        if (!current_chunk || current_chunk->active_slabs >= ss_slabs_capacity(current_chunk)) {
            fprintf(stderr, "[HAKMEM] CRITICAL: New chunk still has no free slabs for class %d (active=%d cap=%d)\n",
                    class_idx, current_chunk ? current_chunk->active_slabs : -1,
                    current_chunk ? ss_slabs_capacity(current_chunk) : -1);
            return NULL;
        }
    }
}

Critical Issue: Expansion Message NOT Printed!

The error output shows:

  • TLS cache adaptation messages
  • OOM error from superslab_allocate()
  • NO expansion messages ("SuperSlab chunk exhausted...")

This means the expansion code (line 182-210) is NOT being executed!

Hypothesis

Why Expansion Not Triggered?

Option 1: current_chunk is NULL

  • If current_chunk is NULL, we skip the entire if block (line 166)
  • Continue to normal refill logic without expansion

Option 2: slab_bitmap != full_bitmap is TRUE (unexpected)

  • If bitmap doesn't match expected full value, we think there are free slabs
  • Don't trigger expansion
  • But later code finds no free slabs → OOM

Option 3: Execution reaches expansion but crashes before printing

  • Race condition between check and expansion
  • Another thread modifies state between line 174 and line 182

Option 4: Wrong code path entirely

  • Error comes from mid_simple_refill path (line 264)
  • Which bypasses my expansion code
  • Calls superslab_allocate() directly → OOM

Mid-Simple Refill Path (MOST LIKELY)

// Line 246-281
if (class_idx >= 4 && tiny_mid_refill_simple_enabled()) {
    if (tls->ss) {
        int tls_cap = ss_slabs_capacity(tls->ss);
        if (tls->ss->active_slabs < tls_cap) {  // ← Uses non-atomic active_slabs!
            // ... try to find free slab
        }
    }
    // Otherwise allocate a fresh SuperSlab
    SuperSlab* ssn = superslab_allocate((uint8_t)class_idx);  // ← Direct allocation!
    if (!ssn) {
        // This prints to line 269, but we see error at line 492 instead
        return NULL;
    }
}

Problem: Class 4 triggers mid_simple_refill (class_idx >= 4), which:

  1. Checks active_slabs < tls_cap (non-atomic, race condition)
  2. If exhausted, calls superslab_allocate() directly
  3. Does NOT use the dynamic expansion mechanism
  4. Returns NULL on OOM

Investigation Tasks

Task 1: Add Debug Logging

Add logging to determine execution path:

  1. Entry point logging:
fprintf(stderr, "[DEBUG] superslab_refill ENTER: class=%d current_chunk=%p tls->ss=%p\n",
        class_idx, (void*)current_chunk, (void*)tls->ss);
  1. Bitmap check logging:
fprintf(stderr, "[DEBUG] bitmap check: bitmap=0x%08x full_bitmap=0x%08x chunk_cap=%d match=%d\n",
        current_chunk->slab_bitmap, full_bitmap, chunk_cap,
        (current_chunk->slab_bitmap == full_bitmap));
  1. Mid-simple path logging:
fprintf(stderr, "[DEBUG] mid_simple_refill: class=%d enabled=%d tls->ss=%p active=%d cap=%d\n",
        class_idx, tiny_mid_refill_simple_enabled(),
        (void*)tls->ss,
        tls->ss ? tls->ss->active_slabs : -1,
        tls->ss ? ss_slabs_capacity(tls->ss) : -1);

Task 2: Fix Mid-Simple Refill Path

Two options:

Option A: Disable mid_simple_refill for testing

// Line 249: Force disable
if (0 && class_idx >= 4 && tiny_mid_refill_simple_enabled()) {

Option B: Add expansion to mid_simple_refill

// Line 262: Before allocating new SuperSlab
// Check if current tls->ss is exhausted and can be expanded
if (tls->ss && tls->ss->active_slabs >= tls_cap) {
    // Try to expand current SuperSlab instead of allocating new one
    SuperSlabHead* head = superslab_lookup_head(class_idx);
    if (head && expand_superslab_head(head) == 0) {
        tls->ss = head->current_chunk;  // Point to new chunk
        // Retry initialization with new chunk
        int free_idx = superslab_find_free_slab(tls->ss);
        if (free_idx >= 0) {
            // ... use new chunk
        }
    }
}

Task 3: Fix Bitmap Logic Inconsistency

Line 202 verification uses active_slabs (non-atomic), but I said bitmap should be used for MT-safety:

// BEFORE (inconsistent):
if (!current_chunk || current_chunk->active_slabs >= ss_slabs_capacity(current_chunk)) {

// AFTER (consistent with bitmap approach):
uint32_t new_full_bitmap = (1U << ss_slabs_capacity(current_chunk)) - 1;
if (!current_chunk || current_chunk->slab_bitmap == new_full_bitmap) {

Root Cause Hypothesis

Most Likely: Mid-simple refill path (class_idx >= 4) bypasses dynamic expansion

Evidence:

  1. Error is for class 4 (triggers mid_simple_refill)
  2. No expansion messages printed (expansion code not reached)
  3. OOM error from superslab_allocate() at line 480 (not mid_simple's line 269)
  4. Task Agent's fix worked better (95%) because it checked active_slabs earlier in the flow

Why Task Agent's fix was better:

  • Checked active_slabs < chunk_cap at line 172 (BEFORE mid_simple_refill)
  • Even though non-atomic, it caught most exhaustion cases
  • Triggered expansion before mid_simple_refill could bypass it

Why my fix is worse:

  • Uses bitmap check which might not match mid_simple's active_slabs check
  • Race condition: bitmap might show "not full" but active_slabs shows "full"
  • Mid_simple sees "full" (via active_slabs), bypasses expansion, calls allocate() → OOM

Short-term (Quick Fix):

  1. Disable mid_simple_refill for class 4-7 to force normal path
  2. Verify expansion works on normal path
  3. If successful, this proves mid_simple is the culprit

Long-term (Proper Fix):

  1. Add expansion mechanism to mid_simple_refill path
  2. Use consistent bitmap checks across all paths
  3. Remove dependency on non-atomic active_slabs for exhaustion detection

Success Criteria

  • 4T test: 50/50 runs pass (100% stability)
  • Expansion messages appear when SuperSlab exhausted
  • No "superslab_refill returned NULL (OOM)" errors
  • Performance maintained (> 900K ops/s on 4T)

Next Steps

  1. Immediate: Add debug logging to identify execution path
  2. Test: Disable mid_simple_refill and verify expansion works
  3. Fix: Add expansion to mid_simple path OR use bitmap consistently
  4. Verify: Run 50+ tests to achieve 100% stability

Generated: 2025-11-08 Investigator: Claude Code (Sonnet 4.5) Critical: User requirement is 100% stability, no tolerance for failures