Files
hakmem/docs/analysis/BITMAP_FIX_FAILURE_ANALYSIS.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

257 lines
8.4 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Bitmap Fix Failure Analysis
## Executive Summary
**Status**: ❌ REGRESSION - Bitmap fix made stability WORSE
- Before (Task Agent's active_slabs fix): 95% (19/20)
- After (My bitmap fix): 80% (16/20)
- **Regression**: -15% (4 additional failures)
## Problem Statement
### User's Critical Requirement
> "メモリーライブラリーなんて 5でもクラッシュおこったらつかえない"
>
> "A memory library with even 5% crash rate is UNUSABLE"
**Target**: 100% stability (50+ runs with 0 failures)
**Current**: 80% stability (UNACCEPTABLE and WORSE than before)
## Error Symptoms
### 4T Crash Pattern
```
[DEBUG] superslab_refill returned NULL (OOM) detail:
class=4
prev_ss=0x7da378400000
active=32
bitmap=0xffffffff
errno=12
free(): invalid pointer
```
**Key Observations**:
1. Class 4 consistently fails
2. bitmap=0xffffffff (all 32 slabs occupied)
3. active=32 (matches bitmap)
4. No expansion messages printed (expansion code NOT triggered!)
## Code Analysis
### My Bitmap Fix (tiny_superslab_alloc.inc.h:165-210)
```c
SuperSlab* current_chunk = head->current_chunk;
if (current_chunk) {
// Check if current chunk has available slabs
int chunk_cap = ss_slabs_capacity(current_chunk);
uint32_t full_bitmap = (1U << chunk_cap) - 1; // e.g., 32 slabs → 0xFFFFFFFF
if (current_chunk->slab_bitmap != full_bitmap) {
// Has free slabs, update tls->ss
if (tls->ss != current_chunk) {
tls->ss = current_chunk;
}
} else {
// Exhausted, expand!
fprintf(stderr, "[HAKMEM] SuperSlab chunk exhausted for class %d (active=%d cap=%d bitmap=0x%08x), expanding...\n",
class_idx, current_chunk->active_slabs, chunk_cap, current_chunk->slab_bitmap);
if (expand_superslab_head(head) < 0) {
fprintf(stderr, "[HAKMEM] CRITICAL: Failed to expand SuperSlabHead for class %d (system OOM)\n", class_idx);
return NULL;
}
current_chunk = head->current_chunk;
tls->ss = current_chunk;
// Verify new chunk has free slabs
if (!current_chunk || current_chunk->active_slabs >= ss_slabs_capacity(current_chunk)) {
fprintf(stderr, "[HAKMEM] CRITICAL: New chunk still has no free slabs for class %d (active=%d cap=%d)\n",
class_idx, current_chunk ? current_chunk->active_slabs : -1,
current_chunk ? ss_slabs_capacity(current_chunk) : -1);
return NULL;
}
}
}
```
### Critical Issue: Expansion Message NOT Printed!
The error output shows:
- ✅ TLS cache adaptation messages
- ✅ OOM error from superslab_allocate()
-**NO expansion messages** ("SuperSlab chunk exhausted...")
**This means the expansion code (line 182-210) is NOT being executed!**
## Hypothesis
### Why Expansion Not Triggered?
**Option 1**: `current_chunk` is NULL
- If `current_chunk` is NULL, we skip the entire if block (line 166)
- Continue to normal refill logic without expansion
**Option 2**: `slab_bitmap != full_bitmap` is TRUE (unexpected)
- If bitmap doesn't match expected full value, we think there are free slabs
- Don't trigger expansion
- But later code finds no free slabs → OOM
**Option 3**: Execution reaches expansion but crashes before printing
- Race condition between check and expansion
- Another thread modifies state between line 174 and line 182
**Option 4**: Wrong code path entirely
- Error comes from mid_simple_refill path (line 264)
- Which bypasses my expansion code
- Calls `superslab_allocate()` directly → OOM
### Mid-Simple Refill Path (MOST LIKELY)
```c
// Line 246-281
if (class_idx >= 4 && tiny_mid_refill_simple_enabled()) {
if (tls->ss) {
int tls_cap = ss_slabs_capacity(tls->ss);
if (tls->ss->active_slabs < tls_cap) { // ← Uses non-atomic active_slabs!
// ... try to find free slab
}
}
// Otherwise allocate a fresh SuperSlab
SuperSlab* ssn = superslab_allocate((uint8_t)class_idx); // ← Direct allocation!
if (!ssn) {
// This prints to line 269, but we see error at line 492 instead
return NULL;
}
}
```
**Problem**: Class 4 triggers mid_simple_refill (class_idx >= 4), which:
1. Checks `active_slabs < tls_cap` (non-atomic, race condition)
2. If exhausted, calls `superslab_allocate()` directly
3. Does NOT use the dynamic expansion mechanism
4. Returns NULL on OOM
## Investigation Tasks
### Task 1: Add Debug Logging
Add logging to determine execution path:
1. **Entry point logging**:
```c
fprintf(stderr, "[DEBUG] superslab_refill ENTER: class=%d current_chunk=%p tls->ss=%p\n",
class_idx, (void*)current_chunk, (void*)tls->ss);
```
2. **Bitmap check logging**:
```c
fprintf(stderr, "[DEBUG] bitmap check: bitmap=0x%08x full_bitmap=0x%08x chunk_cap=%d match=%d\n",
current_chunk->slab_bitmap, full_bitmap, chunk_cap,
(current_chunk->slab_bitmap == full_bitmap));
```
3. **Mid-simple path logging**:
```c
fprintf(stderr, "[DEBUG] mid_simple_refill: class=%d enabled=%d tls->ss=%p active=%d cap=%d\n",
class_idx, tiny_mid_refill_simple_enabled(),
(void*)tls->ss,
tls->ss ? tls->ss->active_slabs : -1,
tls->ss ? ss_slabs_capacity(tls->ss) : -1);
```
### Task 2: Fix Mid-Simple Refill Path
Two options:
**Option A: Disable mid_simple_refill for testing**
```c
// Line 249: Force disable
if (0 && class_idx >= 4 && tiny_mid_refill_simple_enabled()) {
```
**Option B: Add expansion to mid_simple_refill**
```c
// Line 262: Before allocating new SuperSlab
// Check if current tls->ss is exhausted and can be expanded
if (tls->ss && tls->ss->active_slabs >= tls_cap) {
// Try to expand current SuperSlab instead of allocating new one
SuperSlabHead* head = superslab_lookup_head(class_idx);
if (head && expand_superslab_head(head) == 0) {
tls->ss = head->current_chunk; // Point to new chunk
// Retry initialization with new chunk
int free_idx = superslab_find_free_slab(tls->ss);
if (free_idx >= 0) {
// ... use new chunk
}
}
}
```
### Task 3: Fix Bitmap Logic Inconsistency
Line 202 verification uses `active_slabs` (non-atomic), but I said bitmap should be used for MT-safety:
```c
// BEFORE (inconsistent):
if (!current_chunk || current_chunk->active_slabs >= ss_slabs_capacity(current_chunk)) {
// AFTER (consistent with bitmap approach):
uint32_t new_full_bitmap = (1U << ss_slabs_capacity(current_chunk)) - 1;
if (!current_chunk || current_chunk->slab_bitmap == new_full_bitmap) {
```
## Root Cause Hypothesis
**Most Likely**: Mid-simple refill path (class_idx >= 4) bypasses dynamic expansion
**Evidence**:
1. Error is for class 4 (triggers mid_simple_refill)
2. No expansion messages printed (expansion code not reached)
3. OOM error from `superslab_allocate()` at line 480 (not mid_simple's line 269)
4. Task Agent's fix worked better (95%) because it checked active_slabs earlier in the flow
**Why Task Agent's fix was better**:
- Checked `active_slabs < chunk_cap` at line 172 (BEFORE mid_simple_refill)
- Even though non-atomic, it caught most exhaustion cases
- Triggered expansion before mid_simple_refill could bypass it
**Why my fix is worse**:
- Uses bitmap check which might not match mid_simple's active_slabs check
- Race condition: bitmap might show "not full" but active_slabs shows "full"
- Mid_simple sees "full" (via active_slabs), bypasses expansion, calls allocate() → OOM
## Recommended Fix
**Short-term (Quick Fix)**:
1. Disable mid_simple_refill for class 4-7 to force normal path
2. Verify expansion works on normal path
3. If successful, this proves mid_simple is the culprit
**Long-term (Proper Fix)**:
1. Add expansion mechanism to mid_simple_refill path
2. Use consistent bitmap checks across all paths
3. Remove dependency on non-atomic active_slabs for exhaustion detection
## Success Criteria
- 4T test: 50/50 runs pass (100% stability)
- Expansion messages appear when SuperSlab exhausted
- No "superslab_refill returned NULL (OOM)" errors
- Performance maintained (> 900K ops/s on 4T)
## Next Steps
1. **Immediate**: Add debug logging to identify execution path
2. **Test**: Disable mid_simple_refill and verify expansion works
3. **Fix**: Add expansion to mid_simple path OR use bitmap consistently
4. **Verify**: Run 50+ tests to achieve 100% stability
---
**Generated**: 2025-11-08
**Investigator**: Claude Code (Sonnet 4.5)
**Critical**: User requirement is 100% stability, no tolerance for failures