Files
hakmem/docs/analysis/BITMAP_FIX_FAILURE_ANALYSIS.md

257 lines
8.4 KiB
Markdown
Raw Normal View History

Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization) ## Changes ### 1. core/page_arena.c - Removed init failure message (lines 25-27) - error is handled by returning early - All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks ### 2. core/hakmem.c - Wrapped SIGSEGV handler init message (line 72) - CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs ### 3. core/hakmem_shared_pool.c - Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE: - Node pool exhaustion warning (line 252) - SP_META_CAPACITY_ERROR warning (line 421) - SP_FIX_GEOMETRY debug logging (line 745) - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865) - SP_ACQUIRE_STAGE0_L0 debug logging (line 803) - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922) - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996) - SP_ACQUIRE_STAGE3 debug logging (line 1116) - SP_SLOT_RELEASE debug logging (line 1245) - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305) - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316) - Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized ## Performance Validation Before: 51M ops/s (with debug fprintf overhead) After: 49.1M ops/s (consistent performance, fprintf removed from hot paths) ## Build & Test ```bash ./build.sh larson_hakmem ./out/release/larson_hakmem 1 5 1 1000 100 10000 42 # Result: 49.1M ops/s ``` Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00
# Bitmap Fix Failure Analysis
## Executive Summary
**Status**: ❌ REGRESSION - Bitmap fix made stability WORSE
- Before (Task Agent's active_slabs fix): 95% (19/20)
- After (My bitmap fix): 80% (16/20)
- **Regression**: -15% (4 additional failures)
## Problem Statement
### User's Critical Requirement
> "メモリーライブラリーなんて 5でもクラッシュおこったらつかえない"
>
> "A memory library with even 5% crash rate is UNUSABLE"
**Target**: 100% stability (50+ runs with 0 failures)
**Current**: 80% stability (UNACCEPTABLE and WORSE than before)
## Error Symptoms
### 4T Crash Pattern
```
[DEBUG] superslab_refill returned NULL (OOM) detail:
class=4
prev_ss=0x7da378400000
active=32
bitmap=0xffffffff
errno=12
free(): invalid pointer
```
**Key Observations**:
1. Class 4 consistently fails
2. bitmap=0xffffffff (all 32 slabs occupied)
3. active=32 (matches bitmap)
4. No expansion messages printed (expansion code NOT triggered!)
## Code Analysis
### My Bitmap Fix (tiny_superslab_alloc.inc.h:165-210)
```c
SuperSlab* current_chunk = head->current_chunk;
if (current_chunk) {
// Check if current chunk has available slabs
int chunk_cap = ss_slabs_capacity(current_chunk);
uint32_t full_bitmap = (1U << chunk_cap) - 1; // e.g., 32 slabs 0xFFFFFFFF
if (current_chunk->slab_bitmap != full_bitmap) {
// Has free slabs, update tls->ss
if (tls->ss != current_chunk) {
tls->ss = current_chunk;
}
} else {
// Exhausted, expand!
fprintf(stderr, "[HAKMEM] SuperSlab chunk exhausted for class %d (active=%d cap=%d bitmap=0x%08x), expanding...\n",
class_idx, current_chunk->active_slabs, chunk_cap, current_chunk->slab_bitmap);
if (expand_superslab_head(head) < 0) {
fprintf(stderr, "[HAKMEM] CRITICAL: Failed to expand SuperSlabHead for class %d (system OOM)\n", class_idx);
return NULL;
}
current_chunk = head->current_chunk;
tls->ss = current_chunk;
// Verify new chunk has free slabs
if (!current_chunk || current_chunk->active_slabs >= ss_slabs_capacity(current_chunk)) {
fprintf(stderr, "[HAKMEM] CRITICAL: New chunk still has no free slabs for class %d (active=%d cap=%d)\n",
class_idx, current_chunk ? current_chunk->active_slabs : -1,
current_chunk ? ss_slabs_capacity(current_chunk) : -1);
return NULL;
}
}
}
```
### Critical Issue: Expansion Message NOT Printed!
The error output shows:
- ✅ TLS cache adaptation messages
- ✅ OOM error from superslab_allocate()
-**NO expansion messages** ("SuperSlab chunk exhausted...")
**This means the expansion code (line 182-210) is NOT being executed!**
## Hypothesis
### Why Expansion Not Triggered?
**Option 1**: `current_chunk` is NULL
- If `current_chunk` is NULL, we skip the entire if block (line 166)
- Continue to normal refill logic without expansion
**Option 2**: `slab_bitmap != full_bitmap` is TRUE (unexpected)
- If bitmap doesn't match expected full value, we think there are free slabs
- Don't trigger expansion
- But later code finds no free slabs → OOM
**Option 3**: Execution reaches expansion but crashes before printing
- Race condition between check and expansion
- Another thread modifies state between line 174 and line 182
**Option 4**: Wrong code path entirely
- Error comes from mid_simple_refill path (line 264)
- Which bypasses my expansion code
- Calls `superslab_allocate()` directly → OOM
### Mid-Simple Refill Path (MOST LIKELY)
```c
// Line 246-281
if (class_idx >= 4 && tiny_mid_refill_simple_enabled()) {
if (tls->ss) {
int tls_cap = ss_slabs_capacity(tls->ss);
if (tls->ss->active_slabs < tls_cap) { // Uses non-atomic active_slabs!
// ... try to find free slab
}
}
// Otherwise allocate a fresh SuperSlab
SuperSlab* ssn = superslab_allocate((uint8_t)class_idx); // ← Direct allocation!
if (!ssn) {
// This prints to line 269, but we see error at line 492 instead
return NULL;
}
}
```
**Problem**: Class 4 triggers mid_simple_refill (class_idx >= 4), which:
1. Checks `active_slabs < tls_cap` (non-atomic, race condition)
2. If exhausted, calls `superslab_allocate()` directly
3. Does NOT use the dynamic expansion mechanism
4. Returns NULL on OOM
## Investigation Tasks
### Task 1: Add Debug Logging
Add logging to determine execution path:
1. **Entry point logging**:
```c
fprintf(stderr, "[DEBUG] superslab_refill ENTER: class=%d current_chunk=%p tls->ss=%p\n",
class_idx, (void*)current_chunk, (void*)tls->ss);
```
2. **Bitmap check logging**:
```c
fprintf(stderr, "[DEBUG] bitmap check: bitmap=0x%08x full_bitmap=0x%08x chunk_cap=%d match=%d\n",
current_chunk->slab_bitmap, full_bitmap, chunk_cap,
(current_chunk->slab_bitmap == full_bitmap));
```
3. **Mid-simple path logging**:
```c
fprintf(stderr, "[DEBUG] mid_simple_refill: class=%d enabled=%d tls->ss=%p active=%d cap=%d\n",
class_idx, tiny_mid_refill_simple_enabled(),
(void*)tls->ss,
tls->ss ? tls->ss->active_slabs : -1,
tls->ss ? ss_slabs_capacity(tls->ss) : -1);
```
### Task 2: Fix Mid-Simple Refill Path
Two options:
**Option A: Disable mid_simple_refill for testing**
```c
// Line 249: Force disable
if (0 && class_idx >= 4 && tiny_mid_refill_simple_enabled()) {
```
**Option B: Add expansion to mid_simple_refill**
```c
// Line 262: Before allocating new SuperSlab
// Check if current tls->ss is exhausted and can be expanded
if (tls->ss && tls->ss->active_slabs >= tls_cap) {
// Try to expand current SuperSlab instead of allocating new one
SuperSlabHead* head = superslab_lookup_head(class_idx);
if (head && expand_superslab_head(head) == 0) {
tls->ss = head->current_chunk; // Point to new chunk
// Retry initialization with new chunk
int free_idx = superslab_find_free_slab(tls->ss);
if (free_idx >= 0) {
// ... use new chunk
}
}
}
```
### Task 3: Fix Bitmap Logic Inconsistency
Line 202 verification uses `active_slabs` (non-atomic), but I said bitmap should be used for MT-safety:
```c
// BEFORE (inconsistent):
if (!current_chunk || current_chunk->active_slabs >= ss_slabs_capacity(current_chunk)) {
// AFTER (consistent with bitmap approach):
uint32_t new_full_bitmap = (1U << ss_slabs_capacity(current_chunk)) - 1;
if (!current_chunk || current_chunk->slab_bitmap == new_full_bitmap) {
```
## Root Cause Hypothesis
**Most Likely**: Mid-simple refill path (class_idx >= 4) bypasses dynamic expansion
**Evidence**:
1. Error is for class 4 (triggers mid_simple_refill)
2. No expansion messages printed (expansion code not reached)
3. OOM error from `superslab_allocate()` at line 480 (not mid_simple's line 269)
4. Task Agent's fix worked better (95%) because it checked active_slabs earlier in the flow
**Why Task Agent's fix was better**:
- Checked `active_slabs < chunk_cap` at line 172 (BEFORE mid_simple_refill)
- Even though non-atomic, it caught most exhaustion cases
- Triggered expansion before mid_simple_refill could bypass it
**Why my fix is worse**:
- Uses bitmap check which might not match mid_simple's active_slabs check
- Race condition: bitmap might show "not full" but active_slabs shows "full"
- Mid_simple sees "full" (via active_slabs), bypasses expansion, calls allocate() → OOM
## Recommended Fix
**Short-term (Quick Fix)**:
1. Disable mid_simple_refill for class 4-7 to force normal path
2. Verify expansion works on normal path
3. If successful, this proves mid_simple is the culprit
**Long-term (Proper Fix)**:
1. Add expansion mechanism to mid_simple_refill path
2. Use consistent bitmap checks across all paths
3. Remove dependency on non-atomic active_slabs for exhaustion detection
## Success Criteria
- 4T test: 50/50 runs pass (100% stability)
- Expansion messages appear when SuperSlab exhausted
- No "superslab_refill returned NULL (OOM)" errors
- Performance maintained (> 900K ops/s on 4T)
## Next Steps
1. **Immediate**: Add debug logging to identify execution path
2. **Test**: Disable mid_simple_refill and verify expansion works
3. **Fix**: Add expansion to mid_simple path OR use bitmap consistently
4. **Verify**: Run 50+ tests to achieve 100% stability
---
**Generated**: 2025-11-08
**Investigator**: Claude Code (Sonnet 4.5)
**Critical**: User requirement is 100% stability, no tolerance for failures