## Changes ### 1. core/page_arena.c - Removed init failure message (lines 25-27) - error is handled by returning early - All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks ### 2. core/hakmem.c - Wrapped SIGSEGV handler init message (line 72) - CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs ### 3. core/hakmem_shared_pool.c - Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE: - Node pool exhaustion warning (line 252) - SP_META_CAPACITY_ERROR warning (line 421) - SP_FIX_GEOMETRY debug logging (line 745) - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865) - SP_ACQUIRE_STAGE0_L0 debug logging (line 803) - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922) - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996) - SP_ACQUIRE_STAGE3 debug logging (line 1116) - SP_SLOT_RELEASE debug logging (line 1245) - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305) - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316) - Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized ## Performance Validation Before: 51M ops/s (with debug fprintf overhead) After: 49.1M ops/s (consistent performance, fprintf removed from hot paths) ## Build & Test ```bash ./build.sh larson_hakmem ./out/release/larson_hakmem 1 5 1 1000 100 10000 42 # Result: 49.1M ops/s ``` Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
257 lines
8.4 KiB
Markdown
257 lines
8.4 KiB
Markdown
# Bitmap Fix Failure Analysis
|
||
|
||
## Executive Summary
|
||
|
||
**Status**: ❌ REGRESSION - Bitmap fix made stability WORSE
|
||
- Before (Task Agent's active_slabs fix): 95% (19/20)
|
||
- After (My bitmap fix): 80% (16/20)
|
||
- **Regression**: -15% (4 additional failures)
|
||
|
||
## Problem Statement
|
||
|
||
### User's Critical Requirement
|
||
> "メモリーライブラリーなんて 5%でもクラッシュおこったらつかえない"
|
||
>
|
||
> "A memory library with even 5% crash rate is UNUSABLE"
|
||
|
||
**Target**: 100% stability (50+ runs with 0 failures)
|
||
**Current**: 80% stability (UNACCEPTABLE and WORSE than before)
|
||
|
||
## Error Symptoms
|
||
|
||
### 4T Crash Pattern
|
||
```
|
||
[DEBUG] superslab_refill returned NULL (OOM) detail:
|
||
class=4
|
||
prev_ss=0x7da378400000
|
||
active=32
|
||
bitmap=0xffffffff
|
||
errno=12
|
||
|
||
free(): invalid pointer
|
||
```
|
||
|
||
**Key Observations**:
|
||
1. Class 4 consistently fails
|
||
2. bitmap=0xffffffff (all 32 slabs occupied)
|
||
3. active=32 (matches bitmap)
|
||
4. No expansion messages printed (expansion code NOT triggered!)
|
||
|
||
## Code Analysis
|
||
|
||
### My Bitmap Fix (tiny_superslab_alloc.inc.h:165-210)
|
||
|
||
```c
|
||
SuperSlab* current_chunk = head->current_chunk;
|
||
if (current_chunk) {
|
||
// Check if current chunk has available slabs
|
||
int chunk_cap = ss_slabs_capacity(current_chunk);
|
||
uint32_t full_bitmap = (1U << chunk_cap) - 1; // e.g., 32 slabs → 0xFFFFFFFF
|
||
|
||
if (current_chunk->slab_bitmap != full_bitmap) {
|
||
// Has free slabs, update tls->ss
|
||
if (tls->ss != current_chunk) {
|
||
tls->ss = current_chunk;
|
||
}
|
||
} else {
|
||
// Exhausted, expand!
|
||
fprintf(stderr, "[HAKMEM] SuperSlab chunk exhausted for class %d (active=%d cap=%d bitmap=0x%08x), expanding...\n",
|
||
class_idx, current_chunk->active_slabs, chunk_cap, current_chunk->slab_bitmap);
|
||
|
||
if (expand_superslab_head(head) < 0) {
|
||
fprintf(stderr, "[HAKMEM] CRITICAL: Failed to expand SuperSlabHead for class %d (system OOM)\n", class_idx);
|
||
return NULL;
|
||
}
|
||
|
||
current_chunk = head->current_chunk;
|
||
tls->ss = current_chunk;
|
||
|
||
// Verify new chunk has free slabs
|
||
if (!current_chunk || current_chunk->active_slabs >= ss_slabs_capacity(current_chunk)) {
|
||
fprintf(stderr, "[HAKMEM] CRITICAL: New chunk still has no free slabs for class %d (active=%d cap=%d)\n",
|
||
class_idx, current_chunk ? current_chunk->active_slabs : -1,
|
||
current_chunk ? ss_slabs_capacity(current_chunk) : -1);
|
||
return NULL;
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
### Critical Issue: Expansion Message NOT Printed!
|
||
|
||
The error output shows:
|
||
- ✅ TLS cache adaptation messages
|
||
- ✅ OOM error from superslab_allocate()
|
||
- ❌ **NO expansion messages** ("SuperSlab chunk exhausted...")
|
||
|
||
**This means the expansion code (line 182-210) is NOT being executed!**
|
||
|
||
## Hypothesis
|
||
|
||
### Why Expansion Not Triggered?
|
||
|
||
**Option 1**: `current_chunk` is NULL
|
||
- If `current_chunk` is NULL, we skip the entire if block (line 166)
|
||
- Continue to normal refill logic without expansion
|
||
|
||
**Option 2**: `slab_bitmap != full_bitmap` is TRUE (unexpected)
|
||
- If bitmap doesn't match expected full value, we think there are free slabs
|
||
- Don't trigger expansion
|
||
- But later code finds no free slabs → OOM
|
||
|
||
**Option 3**: Execution reaches expansion but crashes before printing
|
||
- Race condition between check and expansion
|
||
- Another thread modifies state between line 174 and line 182
|
||
|
||
**Option 4**: Wrong code path entirely
|
||
- Error comes from mid_simple_refill path (line 264)
|
||
- Which bypasses my expansion code
|
||
- Calls `superslab_allocate()` directly → OOM
|
||
|
||
### Mid-Simple Refill Path (MOST LIKELY)
|
||
|
||
```c
|
||
// Line 246-281
|
||
if (class_idx >= 4 && tiny_mid_refill_simple_enabled()) {
|
||
if (tls->ss) {
|
||
int tls_cap = ss_slabs_capacity(tls->ss);
|
||
if (tls->ss->active_slabs < tls_cap) { // ← Uses non-atomic active_slabs!
|
||
// ... try to find free slab
|
||
}
|
||
}
|
||
// Otherwise allocate a fresh SuperSlab
|
||
SuperSlab* ssn = superslab_allocate((uint8_t)class_idx); // ← Direct allocation!
|
||
if (!ssn) {
|
||
// This prints to line 269, but we see error at line 492 instead
|
||
return NULL;
|
||
}
|
||
}
|
||
```
|
||
|
||
**Problem**: Class 4 triggers mid_simple_refill (class_idx >= 4), which:
|
||
1. Checks `active_slabs < tls_cap` (non-atomic, race condition)
|
||
2. If exhausted, calls `superslab_allocate()` directly
|
||
3. Does NOT use the dynamic expansion mechanism
|
||
4. Returns NULL on OOM
|
||
|
||
## Investigation Tasks
|
||
|
||
### Task 1: Add Debug Logging
|
||
|
||
Add logging to determine execution path:
|
||
|
||
1. **Entry point logging**:
|
||
```c
|
||
fprintf(stderr, "[DEBUG] superslab_refill ENTER: class=%d current_chunk=%p tls->ss=%p\n",
|
||
class_idx, (void*)current_chunk, (void*)tls->ss);
|
||
```
|
||
|
||
2. **Bitmap check logging**:
|
||
```c
|
||
fprintf(stderr, "[DEBUG] bitmap check: bitmap=0x%08x full_bitmap=0x%08x chunk_cap=%d match=%d\n",
|
||
current_chunk->slab_bitmap, full_bitmap, chunk_cap,
|
||
(current_chunk->slab_bitmap == full_bitmap));
|
||
```
|
||
|
||
3. **Mid-simple path logging**:
|
||
```c
|
||
fprintf(stderr, "[DEBUG] mid_simple_refill: class=%d enabled=%d tls->ss=%p active=%d cap=%d\n",
|
||
class_idx, tiny_mid_refill_simple_enabled(),
|
||
(void*)tls->ss,
|
||
tls->ss ? tls->ss->active_slabs : -1,
|
||
tls->ss ? ss_slabs_capacity(tls->ss) : -1);
|
||
```
|
||
|
||
### Task 2: Fix Mid-Simple Refill Path
|
||
|
||
Two options:
|
||
|
||
**Option A: Disable mid_simple_refill for testing**
|
||
```c
|
||
// Line 249: Force disable
|
||
if (0 && class_idx >= 4 && tiny_mid_refill_simple_enabled()) {
|
||
```
|
||
|
||
**Option B: Add expansion to mid_simple_refill**
|
||
```c
|
||
// Line 262: Before allocating new SuperSlab
|
||
// Check if current tls->ss is exhausted and can be expanded
|
||
if (tls->ss && tls->ss->active_slabs >= tls_cap) {
|
||
// Try to expand current SuperSlab instead of allocating new one
|
||
SuperSlabHead* head = superslab_lookup_head(class_idx);
|
||
if (head && expand_superslab_head(head) == 0) {
|
||
tls->ss = head->current_chunk; // Point to new chunk
|
||
// Retry initialization with new chunk
|
||
int free_idx = superslab_find_free_slab(tls->ss);
|
||
if (free_idx >= 0) {
|
||
// ... use new chunk
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
### Task 3: Fix Bitmap Logic Inconsistency
|
||
|
||
Line 202 verification uses `active_slabs` (non-atomic), but I said bitmap should be used for MT-safety:
|
||
|
||
```c
|
||
// BEFORE (inconsistent):
|
||
if (!current_chunk || current_chunk->active_slabs >= ss_slabs_capacity(current_chunk)) {
|
||
|
||
// AFTER (consistent with bitmap approach):
|
||
uint32_t new_full_bitmap = (1U << ss_slabs_capacity(current_chunk)) - 1;
|
||
if (!current_chunk || current_chunk->slab_bitmap == new_full_bitmap) {
|
||
```
|
||
|
||
## Root Cause Hypothesis
|
||
|
||
**Most Likely**: Mid-simple refill path (class_idx >= 4) bypasses dynamic expansion
|
||
|
||
**Evidence**:
|
||
1. Error is for class 4 (triggers mid_simple_refill)
|
||
2. No expansion messages printed (expansion code not reached)
|
||
3. OOM error from `superslab_allocate()` at line 480 (not mid_simple's line 269)
|
||
4. Task Agent's fix worked better (95%) because it checked active_slabs earlier in the flow
|
||
|
||
**Why Task Agent's fix was better**:
|
||
- Checked `active_slabs < chunk_cap` at line 172 (BEFORE mid_simple_refill)
|
||
- Even though non-atomic, it caught most exhaustion cases
|
||
- Triggered expansion before mid_simple_refill could bypass it
|
||
|
||
**Why my fix is worse**:
|
||
- Uses bitmap check which might not match mid_simple's active_slabs check
|
||
- Race condition: bitmap might show "not full" but active_slabs shows "full"
|
||
- Mid_simple sees "full" (via active_slabs), bypasses expansion, calls allocate() → OOM
|
||
|
||
## Recommended Fix
|
||
|
||
**Short-term (Quick Fix)**:
|
||
1. Disable mid_simple_refill for class 4-7 to force normal path
|
||
2. Verify expansion works on normal path
|
||
3. If successful, this proves mid_simple is the culprit
|
||
|
||
**Long-term (Proper Fix)**:
|
||
1. Add expansion mechanism to mid_simple_refill path
|
||
2. Use consistent bitmap checks across all paths
|
||
3. Remove dependency on non-atomic active_slabs for exhaustion detection
|
||
|
||
## Success Criteria
|
||
|
||
- 4T test: 50/50 runs pass (100% stability)
|
||
- Expansion messages appear when SuperSlab exhausted
|
||
- No "superslab_refill returned NULL (OOM)" errors
|
||
- Performance maintained (> 900K ops/s on 4T)
|
||
|
||
## Next Steps
|
||
|
||
1. **Immediate**: Add debug logging to identify execution path
|
||
2. **Test**: Disable mid_simple_refill and verify expansion works
|
||
3. **Fix**: Add expansion to mid_simple path OR use bitmap consistently
|
||
4. **Verify**: Run 50+ tests to achieve 100% stability
|
||
|
||
---
|
||
|
||
**Generated**: 2025-11-08
|
||
**Investigator**: Claude Code (Sonnet 4.5)
|
||
**Critical**: User requirement is 100% stability, no tolerance for failures
|