257 lines
8.4 KiB
Markdown
257 lines
8.4 KiB
Markdown
|
|
# Bitmap Fix Failure Analysis
|
|||
|
|
|
|||
|
|
## Executive Summary
|
|||
|
|
|
|||
|
|
**Status**: ❌ REGRESSION - Bitmap fix made stability WORSE
|
|||
|
|
- Before (Task Agent's active_slabs fix): 95% (19/20)
|
|||
|
|
- After (My bitmap fix): 80% (16/20)
|
|||
|
|
- **Regression**: -15% (4 additional failures)
|
|||
|
|
|
|||
|
|
## Problem Statement
|
|||
|
|
|
|||
|
|
### User's Critical Requirement
|
|||
|
|
> "メモリーライブラリーなんて 5%でもクラッシュおこったらつかえない"
|
|||
|
|
>
|
|||
|
|
> "A memory library with even 5% crash rate is UNUSABLE"
|
|||
|
|
|
|||
|
|
**Target**: 100% stability (50+ runs with 0 failures)
|
|||
|
|
**Current**: 80% stability (UNACCEPTABLE and WORSE than before)
|
|||
|
|
|
|||
|
|
## Error Symptoms
|
|||
|
|
|
|||
|
|
### 4T Crash Pattern
|
|||
|
|
```
|
|||
|
|
[DEBUG] superslab_refill returned NULL (OOM) detail:
|
|||
|
|
class=4
|
|||
|
|
prev_ss=0x7da378400000
|
|||
|
|
active=32
|
|||
|
|
bitmap=0xffffffff
|
|||
|
|
errno=12
|
|||
|
|
|
|||
|
|
free(): invalid pointer
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Key Observations**:
|
|||
|
|
1. Class 4 consistently fails
|
|||
|
|
2. bitmap=0xffffffff (all 32 slabs occupied)
|
|||
|
|
3. active=32 (matches bitmap)
|
|||
|
|
4. No expansion messages printed (expansion code NOT triggered!)
|
|||
|
|
|
|||
|
|
## Code Analysis
|
|||
|
|
|
|||
|
|
### My Bitmap Fix (tiny_superslab_alloc.inc.h:165-210)
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
SuperSlab* current_chunk = head->current_chunk;
|
|||
|
|
if (current_chunk) {
|
|||
|
|
// Check if current chunk has available slabs
|
|||
|
|
int chunk_cap = ss_slabs_capacity(current_chunk);
|
|||
|
|
uint32_t full_bitmap = (1U << chunk_cap) - 1; // e.g., 32 slabs → 0xFFFFFFFF
|
|||
|
|
|
|||
|
|
if (current_chunk->slab_bitmap != full_bitmap) {
|
|||
|
|
// Has free slabs, update tls->ss
|
|||
|
|
if (tls->ss != current_chunk) {
|
|||
|
|
tls->ss = current_chunk;
|
|||
|
|
}
|
|||
|
|
} else {
|
|||
|
|
// Exhausted, expand!
|
|||
|
|
fprintf(stderr, "[HAKMEM] SuperSlab chunk exhausted for class %d (active=%d cap=%d bitmap=0x%08x), expanding...\n",
|
|||
|
|
class_idx, current_chunk->active_slabs, chunk_cap, current_chunk->slab_bitmap);
|
|||
|
|
|
|||
|
|
if (expand_superslab_head(head) < 0) {
|
|||
|
|
fprintf(stderr, "[HAKMEM] CRITICAL: Failed to expand SuperSlabHead for class %d (system OOM)\n", class_idx);
|
|||
|
|
return NULL;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
current_chunk = head->current_chunk;
|
|||
|
|
tls->ss = current_chunk;
|
|||
|
|
|
|||
|
|
// Verify new chunk has free slabs
|
|||
|
|
if (!current_chunk || current_chunk->active_slabs >= ss_slabs_capacity(current_chunk)) {
|
|||
|
|
fprintf(stderr, "[HAKMEM] CRITICAL: New chunk still has no free slabs for class %d (active=%d cap=%d)\n",
|
|||
|
|
class_idx, current_chunk ? current_chunk->active_slabs : -1,
|
|||
|
|
current_chunk ? ss_slabs_capacity(current_chunk) : -1);
|
|||
|
|
return NULL;
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Critical Issue: Expansion Message NOT Printed!
|
|||
|
|
|
|||
|
|
The error output shows:
|
|||
|
|
- ✅ TLS cache adaptation messages
|
|||
|
|
- ✅ OOM error from superslab_allocate()
|
|||
|
|
- ❌ **NO expansion messages** ("SuperSlab chunk exhausted...")
|
|||
|
|
|
|||
|
|
**This means the expansion code (line 182-210) is NOT being executed!**
|
|||
|
|
|
|||
|
|
## Hypothesis
|
|||
|
|
|
|||
|
|
### Why Expansion Not Triggered?
|
|||
|
|
|
|||
|
|
**Option 1**: `current_chunk` is NULL
|
|||
|
|
- If `current_chunk` is NULL, we skip the entire if block (line 166)
|
|||
|
|
- Continue to normal refill logic without expansion
|
|||
|
|
|
|||
|
|
**Option 2**: `slab_bitmap != full_bitmap` is TRUE (unexpected)
|
|||
|
|
- If bitmap doesn't match expected full value, we think there are free slabs
|
|||
|
|
- Don't trigger expansion
|
|||
|
|
- But later code finds no free slabs → OOM
|
|||
|
|
|
|||
|
|
**Option 3**: Execution reaches expansion but crashes before printing
|
|||
|
|
- Race condition between check and expansion
|
|||
|
|
- Another thread modifies state between line 174 and line 182
|
|||
|
|
|
|||
|
|
**Option 4**: Wrong code path entirely
|
|||
|
|
- Error comes from mid_simple_refill path (line 264)
|
|||
|
|
- Which bypasses my expansion code
|
|||
|
|
- Calls `superslab_allocate()` directly → OOM
|
|||
|
|
|
|||
|
|
### Mid-Simple Refill Path (MOST LIKELY)
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Line 246-281
|
|||
|
|
if (class_idx >= 4 && tiny_mid_refill_simple_enabled()) {
|
|||
|
|
if (tls->ss) {
|
|||
|
|
int tls_cap = ss_slabs_capacity(tls->ss);
|
|||
|
|
if (tls->ss->active_slabs < tls_cap) { // ← Uses non-atomic active_slabs!
|
|||
|
|
// ... try to find free slab
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
// Otherwise allocate a fresh SuperSlab
|
|||
|
|
SuperSlab* ssn = superslab_allocate((uint8_t)class_idx); // ← Direct allocation!
|
|||
|
|
if (!ssn) {
|
|||
|
|
// This prints to line 269, but we see error at line 492 instead
|
|||
|
|
return NULL;
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Problem**: Class 4 triggers mid_simple_refill (class_idx >= 4), which:
|
|||
|
|
1. Checks `active_slabs < tls_cap` (non-atomic, race condition)
|
|||
|
|
2. If exhausted, calls `superslab_allocate()` directly
|
|||
|
|
3. Does NOT use the dynamic expansion mechanism
|
|||
|
|
4. Returns NULL on OOM
|
|||
|
|
|
|||
|
|
## Investigation Tasks
|
|||
|
|
|
|||
|
|
### Task 1: Add Debug Logging
|
|||
|
|
|
|||
|
|
Add logging to determine execution path:
|
|||
|
|
|
|||
|
|
1. **Entry point logging**:
|
|||
|
|
```c
|
|||
|
|
fprintf(stderr, "[DEBUG] superslab_refill ENTER: class=%d current_chunk=%p tls->ss=%p\n",
|
|||
|
|
class_idx, (void*)current_chunk, (void*)tls->ss);
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
2. **Bitmap check logging**:
|
|||
|
|
```c
|
|||
|
|
fprintf(stderr, "[DEBUG] bitmap check: bitmap=0x%08x full_bitmap=0x%08x chunk_cap=%d match=%d\n",
|
|||
|
|
current_chunk->slab_bitmap, full_bitmap, chunk_cap,
|
|||
|
|
(current_chunk->slab_bitmap == full_bitmap));
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
3. **Mid-simple path logging**:
|
|||
|
|
```c
|
|||
|
|
fprintf(stderr, "[DEBUG] mid_simple_refill: class=%d enabled=%d tls->ss=%p active=%d cap=%d\n",
|
|||
|
|
class_idx, tiny_mid_refill_simple_enabled(),
|
|||
|
|
(void*)tls->ss,
|
|||
|
|
tls->ss ? tls->ss->active_slabs : -1,
|
|||
|
|
tls->ss ? ss_slabs_capacity(tls->ss) : -1);
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Task 2: Fix Mid-Simple Refill Path
|
|||
|
|
|
|||
|
|
Two options:
|
|||
|
|
|
|||
|
|
**Option A: Disable mid_simple_refill for testing**
|
|||
|
|
```c
|
|||
|
|
// Line 249: Force disable
|
|||
|
|
if (0 && class_idx >= 4 && tiny_mid_refill_simple_enabled()) {
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Option B: Add expansion to mid_simple_refill**
|
|||
|
|
```c
|
|||
|
|
// Line 262: Before allocating new SuperSlab
|
|||
|
|
// Check if current tls->ss is exhausted and can be expanded
|
|||
|
|
if (tls->ss && tls->ss->active_slabs >= tls_cap) {
|
|||
|
|
// Try to expand current SuperSlab instead of allocating new one
|
|||
|
|
SuperSlabHead* head = superslab_lookup_head(class_idx);
|
|||
|
|
if (head && expand_superslab_head(head) == 0) {
|
|||
|
|
tls->ss = head->current_chunk; // Point to new chunk
|
|||
|
|
// Retry initialization with new chunk
|
|||
|
|
int free_idx = superslab_find_free_slab(tls->ss);
|
|||
|
|
if (free_idx >= 0) {
|
|||
|
|
// ... use new chunk
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Task 3: Fix Bitmap Logic Inconsistency
|
|||
|
|
|
|||
|
|
Line 202 verification uses `active_slabs` (non-atomic), but I said bitmap should be used for MT-safety:
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// BEFORE (inconsistent):
|
|||
|
|
if (!current_chunk || current_chunk->active_slabs >= ss_slabs_capacity(current_chunk)) {
|
|||
|
|
|
|||
|
|
// AFTER (consistent with bitmap approach):
|
|||
|
|
uint32_t new_full_bitmap = (1U << ss_slabs_capacity(current_chunk)) - 1;
|
|||
|
|
if (!current_chunk || current_chunk->slab_bitmap == new_full_bitmap) {
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Root Cause Hypothesis
|
|||
|
|
|
|||
|
|
**Most Likely**: Mid-simple refill path (class_idx >= 4) bypasses dynamic expansion
|
|||
|
|
|
|||
|
|
**Evidence**:
|
|||
|
|
1. Error is for class 4 (triggers mid_simple_refill)
|
|||
|
|
2. No expansion messages printed (expansion code not reached)
|
|||
|
|
3. OOM error from `superslab_allocate()` at line 480 (not mid_simple's line 269)
|
|||
|
|
4. Task Agent's fix worked better (95%) because it checked active_slabs earlier in the flow
|
|||
|
|
|
|||
|
|
**Why Task Agent's fix was better**:
|
|||
|
|
- Checked `active_slabs < chunk_cap` at line 172 (BEFORE mid_simple_refill)
|
|||
|
|
- Even though non-atomic, it caught most exhaustion cases
|
|||
|
|
- Triggered expansion before mid_simple_refill could bypass it
|
|||
|
|
|
|||
|
|
**Why my fix is worse**:
|
|||
|
|
- Uses bitmap check which might not match mid_simple's active_slabs check
|
|||
|
|
- Race condition: bitmap might show "not full" but active_slabs shows "full"
|
|||
|
|
- Mid_simple sees "full" (via active_slabs), bypasses expansion, calls allocate() → OOM
|
|||
|
|
|
|||
|
|
## Recommended Fix
|
|||
|
|
|
|||
|
|
**Short-term (Quick Fix)**:
|
|||
|
|
1. Disable mid_simple_refill for class 4-7 to force normal path
|
|||
|
|
2. Verify expansion works on normal path
|
|||
|
|
3. If successful, this proves mid_simple is the culprit
|
|||
|
|
|
|||
|
|
**Long-term (Proper Fix)**:
|
|||
|
|
1. Add expansion mechanism to mid_simple_refill path
|
|||
|
|
2. Use consistent bitmap checks across all paths
|
|||
|
|
3. Remove dependency on non-atomic active_slabs for exhaustion detection
|
|||
|
|
|
|||
|
|
## Success Criteria
|
|||
|
|
|
|||
|
|
- 4T test: 50/50 runs pass (100% stability)
|
|||
|
|
- Expansion messages appear when SuperSlab exhausted
|
|||
|
|
- No "superslab_refill returned NULL (OOM)" errors
|
|||
|
|
- Performance maintained (> 900K ops/s on 4T)
|
|||
|
|
|
|||
|
|
## Next Steps
|
|||
|
|
|
|||
|
|
1. **Immediate**: Add debug logging to identify execution path
|
|||
|
|
2. **Test**: Disable mid_simple_refill and verify expansion works
|
|||
|
|
3. **Fix**: Add expansion to mid_simple path OR use bitmap consistently
|
|||
|
|
4. **Verify**: Run 50+ tests to achieve 100% stability
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**Generated**: 2025-11-08
|
|||
|
|
**Investigator**: Claude Code (Sonnet 4.5)
|
|||
|
|
**Critical**: User requirement is 100% stability, no tolerance for failures
|