# Bitmap Fix Failure Analysis ## Executive Summary **Status**: ❌ REGRESSION - Bitmap fix made stability WORSE - Before (Task Agent's active_slabs fix): 95% (19/20) - After (My bitmap fix): 80% (16/20) - **Regression**: -15% (4 additional failures) ## Problem Statement ### User's Critical Requirement > "メモリーライブラリーなんて 5%でもクラッシュおこったらつかえない" > > "A memory library with even 5% crash rate is UNUSABLE" **Target**: 100% stability (50+ runs with 0 failures) **Current**: 80% stability (UNACCEPTABLE and WORSE than before) ## Error Symptoms ### 4T Crash Pattern ``` [DEBUG] superslab_refill returned NULL (OOM) detail: class=4 prev_ss=0x7da378400000 active=32 bitmap=0xffffffff errno=12 free(): invalid pointer ``` **Key Observations**: 1. Class 4 consistently fails 2. bitmap=0xffffffff (all 32 slabs occupied) 3. active=32 (matches bitmap) 4. No expansion messages printed (expansion code NOT triggered!) ## Code Analysis ### My Bitmap Fix (tiny_superslab_alloc.inc.h:165-210) ```c SuperSlab* current_chunk = head->current_chunk; if (current_chunk) { // Check if current chunk has available slabs int chunk_cap = ss_slabs_capacity(current_chunk); uint32_t full_bitmap = (1U << chunk_cap) - 1; // e.g., 32 slabs → 0xFFFFFFFF if (current_chunk->slab_bitmap != full_bitmap) { // Has free slabs, update tls->ss if (tls->ss != current_chunk) { tls->ss = current_chunk; } } else { // Exhausted, expand! fprintf(stderr, "[HAKMEM] SuperSlab chunk exhausted for class %d (active=%d cap=%d bitmap=0x%08x), expanding...\n", class_idx, current_chunk->active_slabs, chunk_cap, current_chunk->slab_bitmap); if (expand_superslab_head(head) < 0) { fprintf(stderr, "[HAKMEM] CRITICAL: Failed to expand SuperSlabHead for class %d (system OOM)\n", class_idx); return NULL; } current_chunk = head->current_chunk; tls->ss = current_chunk; // Verify new chunk has free slabs if (!current_chunk || current_chunk->active_slabs >= ss_slabs_capacity(current_chunk)) { fprintf(stderr, "[HAKMEM] CRITICAL: New chunk still has no free slabs for class %d (active=%d cap=%d)\n", class_idx, current_chunk ? current_chunk->active_slabs : -1, current_chunk ? ss_slabs_capacity(current_chunk) : -1); return NULL; } } } ``` ### Critical Issue: Expansion Message NOT Printed! The error output shows: - ✅ TLS cache adaptation messages - ✅ OOM error from superslab_allocate() - ❌ **NO expansion messages** ("SuperSlab chunk exhausted...") **This means the expansion code (line 182-210) is NOT being executed!** ## Hypothesis ### Why Expansion Not Triggered? **Option 1**: `current_chunk` is NULL - If `current_chunk` is NULL, we skip the entire if block (line 166) - Continue to normal refill logic without expansion **Option 2**: `slab_bitmap != full_bitmap` is TRUE (unexpected) - If bitmap doesn't match expected full value, we think there are free slabs - Don't trigger expansion - But later code finds no free slabs → OOM **Option 3**: Execution reaches expansion but crashes before printing - Race condition between check and expansion - Another thread modifies state between line 174 and line 182 **Option 4**: Wrong code path entirely - Error comes from mid_simple_refill path (line 264) - Which bypasses my expansion code - Calls `superslab_allocate()` directly → OOM ### Mid-Simple Refill Path (MOST LIKELY) ```c // Line 246-281 if (class_idx >= 4 && tiny_mid_refill_simple_enabled()) { if (tls->ss) { int tls_cap = ss_slabs_capacity(tls->ss); if (tls->ss->active_slabs < tls_cap) { // ← Uses non-atomic active_slabs! // ... try to find free slab } } // Otherwise allocate a fresh SuperSlab SuperSlab* ssn = superslab_allocate((uint8_t)class_idx); // ← Direct allocation! if (!ssn) { // This prints to line 269, but we see error at line 492 instead return NULL; } } ``` **Problem**: Class 4 triggers mid_simple_refill (class_idx >= 4), which: 1. Checks `active_slabs < tls_cap` (non-atomic, race condition) 2. If exhausted, calls `superslab_allocate()` directly 3. Does NOT use the dynamic expansion mechanism 4. Returns NULL on OOM ## Investigation Tasks ### Task 1: Add Debug Logging Add logging to determine execution path: 1. **Entry point logging**: ```c fprintf(stderr, "[DEBUG] superslab_refill ENTER: class=%d current_chunk=%p tls->ss=%p\n", class_idx, (void*)current_chunk, (void*)tls->ss); ``` 2. **Bitmap check logging**: ```c fprintf(stderr, "[DEBUG] bitmap check: bitmap=0x%08x full_bitmap=0x%08x chunk_cap=%d match=%d\n", current_chunk->slab_bitmap, full_bitmap, chunk_cap, (current_chunk->slab_bitmap == full_bitmap)); ``` 3. **Mid-simple path logging**: ```c fprintf(stderr, "[DEBUG] mid_simple_refill: class=%d enabled=%d tls->ss=%p active=%d cap=%d\n", class_idx, tiny_mid_refill_simple_enabled(), (void*)tls->ss, tls->ss ? tls->ss->active_slabs : -1, tls->ss ? ss_slabs_capacity(tls->ss) : -1); ``` ### Task 2: Fix Mid-Simple Refill Path Two options: **Option A: Disable mid_simple_refill for testing** ```c // Line 249: Force disable if (0 && class_idx >= 4 && tiny_mid_refill_simple_enabled()) { ``` **Option B: Add expansion to mid_simple_refill** ```c // Line 262: Before allocating new SuperSlab // Check if current tls->ss is exhausted and can be expanded if (tls->ss && tls->ss->active_slabs >= tls_cap) { // Try to expand current SuperSlab instead of allocating new one SuperSlabHead* head = superslab_lookup_head(class_idx); if (head && expand_superslab_head(head) == 0) { tls->ss = head->current_chunk; // Point to new chunk // Retry initialization with new chunk int free_idx = superslab_find_free_slab(tls->ss); if (free_idx >= 0) { // ... use new chunk } } } ``` ### Task 3: Fix Bitmap Logic Inconsistency Line 202 verification uses `active_slabs` (non-atomic), but I said bitmap should be used for MT-safety: ```c // BEFORE (inconsistent): if (!current_chunk || current_chunk->active_slabs >= ss_slabs_capacity(current_chunk)) { // AFTER (consistent with bitmap approach): uint32_t new_full_bitmap = (1U << ss_slabs_capacity(current_chunk)) - 1; if (!current_chunk || current_chunk->slab_bitmap == new_full_bitmap) { ``` ## Root Cause Hypothesis **Most Likely**: Mid-simple refill path (class_idx >= 4) bypasses dynamic expansion **Evidence**: 1. Error is for class 4 (triggers mid_simple_refill) 2. No expansion messages printed (expansion code not reached) 3. OOM error from `superslab_allocate()` at line 480 (not mid_simple's line 269) 4. Task Agent's fix worked better (95%) because it checked active_slabs earlier in the flow **Why Task Agent's fix was better**: - Checked `active_slabs < chunk_cap` at line 172 (BEFORE mid_simple_refill) - Even though non-atomic, it caught most exhaustion cases - Triggered expansion before mid_simple_refill could bypass it **Why my fix is worse**: - Uses bitmap check which might not match mid_simple's active_slabs check - Race condition: bitmap might show "not full" but active_slabs shows "full" - Mid_simple sees "full" (via active_slabs), bypasses expansion, calls allocate() → OOM ## Recommended Fix **Short-term (Quick Fix)**: 1. Disable mid_simple_refill for class 4-7 to force normal path 2. Verify expansion works on normal path 3. If successful, this proves mid_simple is the culprit **Long-term (Proper Fix)**: 1. Add expansion mechanism to mid_simple_refill path 2. Use consistent bitmap checks across all paths 3. Remove dependency on non-atomic active_slabs for exhaustion detection ## Success Criteria - 4T test: 50/50 runs pass (100% stability) - Expansion messages appear when SuperSlab exhausted - No "superslab_refill returned NULL (OOM)" errors - Performance maintained (> 900K ops/s on 4T) ## Next Steps 1. **Immediate**: Add debug logging to identify execution path 2. **Test**: Disable mid_simple_refill and verify expansion works 3. **Fix**: Add expansion to mid_simple path OR use bitmap consistently 4. **Verify**: Run 50+ tests to achieve 100% stability --- **Generated**: 2025-11-08 **Investigator**: Claude Code (Sonnet 4.5) **Critical**: User requirement is 100% stability, no tolerance for failures