172 lines
5.5 KiB
Markdown
172 lines
5.5 KiB
Markdown
|
|
# HAKMEM 100% Stability Investigation Report
|
|||
|
|
|
|||
|
|
## Executive Summary
|
|||
|
|
|
|||
|
|
**Status**: PARTIALLY FIXED - Single-threaded 100% stable, Multi-threaded still crashes
|
|||
|
|
**Root Cause Found**: Inverted bitmap logic in `superslab_refill()` causing false "all slabs occupied" detection
|
|||
|
|
**Primary Fix Implemented**: Corrected bitmap exhaustion check from `bitmap != 0x00000000` to `active_slabs >= capacity`
|
|||
|
|
|
|||
|
|
## Problem Statement
|
|||
|
|
|
|||
|
|
User requirement: **"メモリーライブラリーなんて5%でもクラッシュおこったらつかえない"**
|
|||
|
|
Translation: "A memory library with even 5% crash rate is UNUSABLE"
|
|||
|
|
|
|||
|
|
Initial Test Results: **19/20 success (95%)** - **UNACCEPTABLE**
|
|||
|
|
|
|||
|
|
## Investigation Timeline
|
|||
|
|
|
|||
|
|
### 1. Failure Reproduction (Run 4 of 30)
|
|||
|
|
|
|||
|
|
**Exit Code**: 134 (SIGABRT)
|
|||
|
|
|
|||
|
|
**Error Log**:
|
|||
|
|
```
|
|||
|
|
[DEBUG] superslab_refill returned NULL (OOM) detail:
|
|||
|
|
class=3
|
|||
|
|
prev_ss=0x7e21c5400000
|
|||
|
|
active=32
|
|||
|
|
bitmap=0xffffffff ← ALL BITS SET!
|
|||
|
|
errno=12
|
|||
|
|
|
|||
|
|
[HAKMEM] OOM: Unexpected allocation path for size=50, returning NULL
|
|||
|
|
free(): invalid pointer
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Key Observation**: `bitmap=0xffffffff` means all 32 slabs appear "occupied", but this shouldn't cause OOM if expansion works.
|
|||
|
|
|
|||
|
|
### 2. Root Cause Analysis
|
|||
|
|
|
|||
|
|
#### Bug #1: Inverted Bitmap Logic (CRITICAL)
|
|||
|
|
|
|||
|
|
**Location**: `/mnt/workdisk/public_share/hakmem/core/tiny_superslab_alloc.inc.h:169`
|
|||
|
|
|
|||
|
|
**Bitmap Semantics** (confirmed via `superslab_find_free_slab:788`):
|
|||
|
|
- Bit 0 = FREE slab
|
|||
|
|
- Bit 1 = OCCUPIED slab
|
|||
|
|
- `0x00000000` = All slabs FREE (0 in use)
|
|||
|
|
- `0xffffffff` = All slabs OCCUPIED (32 in use)
|
|||
|
|
|
|||
|
|
**Buggy Code**:
|
|||
|
|
```c
|
|||
|
|
// Line 169 (BEFORE FIX)
|
|||
|
|
if (current_chunk->slab_bitmap != 0x00000000) {
|
|||
|
|
// "Current chunk has free slabs" ← WRONG!!!
|
|||
|
|
// This branch executes when bitmap=0xffffffff (ALL OCCUPIED)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Problem**:
|
|||
|
|
- When all slabs occupied (`bitmap=0xffffffff`), condition is TRUE
|
|||
|
|
- Code thinks "has free slabs" and continues
|
|||
|
|
- Never reaches expansion logic
|
|||
|
|
- Returns NULL → OOM → Crash
|
|||
|
|
|
|||
|
|
**Fix Applied**:
|
|||
|
|
```c
|
|||
|
|
// Line 172 (AFTER FIX)
|
|||
|
|
if (current_chunk->active_slabs < chunk_cap) {
|
|||
|
|
// Correctly checks if ANY slabs are free
|
|||
|
|
// active_slabs=32, chunk_cap=32 → FALSE → expansion triggered!
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Verification**:
|
|||
|
|
```bash
|
|||
|
|
# Single-thread test with fix
|
|||
|
|
./larson_hakmem 1 1 128 1024 1 12345 1
|
|||
|
|
# Result: Throughput = 770,797 ops/s ✅ PASS
|
|||
|
|
|
|||
|
|
# Expansion messages observed:
|
|||
|
|
[HAKMEM] SuperSlab chunk exhausted for class 4 (active=32 cap=32 bitmap=0xffffffff), expanding...
|
|||
|
|
[HAKMEM] Expanded SuperSlabHead for class 4: 2 chunks now (bitmap=0x00000001)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Bug #2: Slab Deactivation Issue (Secondary)
|
|||
|
|
|
|||
|
|
**Initial Hypothesis**: Slabs become empty (`used=0`) but bitmap bit stays set → memory leak
|
|||
|
|
|
|||
|
|
**Investigation**: Added `superslab_deactivate_slab()` calls when `meta->used == 0`
|
|||
|
|
|
|||
|
|
**Result**: Multi-thread SEGV (even worse than original!)
|
|||
|
|
|
|||
|
|
**Root Cause of SEGV**: Double-initialization corruption
|
|||
|
|
1. Slab freed → `deactivate` → bitmap bit cleared
|
|||
|
|
2. Next alloc → `superslab_find_free_slab()` finds it
|
|||
|
|
3. Calls `init_slab()` AGAIN on already-initialized slab
|
|||
|
|
4. Metadata corruption → SEGV
|
|||
|
|
|
|||
|
|
**Correct Design**: Slabs should stay "active" once initialized until entire SuperSlab chunk is freed. The freelist mechanism handles block reuse.
|
|||
|
|
|
|||
|
|
## Final Implementation
|
|||
|
|
|
|||
|
|
### Files Modified
|
|||
|
|
|
|||
|
|
1. **`core/tiny_superslab_alloc.inc.h:168-208`**
|
|||
|
|
- Changed exhaustion check from `bitmap != 0` to `active_slabs < capacity`
|
|||
|
|
- Added diagnostic logging for expansion events
|
|||
|
|
- Improved error messages
|
|||
|
|
|
|||
|
|
2. **`core/box/free_local_box.c:100-104`**
|
|||
|
|
- Added explanatory comment: Why NOT to deactivate slabs
|
|||
|
|
|
|||
|
|
3. **`core/tiny_superslab_free.inc.h:305, 333`**
|
|||
|
|
- Added comments explaining slab lifecycle
|
|||
|
|
|
|||
|
|
### Test Results
|
|||
|
|
|
|||
|
|
| Configuration | Result | Notes |
|
|||
|
|
|---------------|--------|-------|
|
|||
|
|
| Single-thread (1T) | ✅ 100% (10/10) | 770K ops/s |
|
|||
|
|
| Multi-thread (4T) | ❌ SEGV | Crashes immediately |
|
|||
|
|
| Single-thread expansion | ✅ Works | Grows 1→2→3 chunks |
|
|||
|
|
| Multi-thread expansion | ❌ No logs | Crashes before expansion |
|
|||
|
|
|
|||
|
|
## Remaining Issues
|
|||
|
|
|
|||
|
|
### Multi-Thread SEGV
|
|||
|
|
|
|||
|
|
**Symptoms**:
|
|||
|
|
- Crashes within ~1 second
|
|||
|
|
- No expansion logging
|
|||
|
|
- Exit 139 (SIGSEGV)
|
|||
|
|
- Single-thread works perfectly
|
|||
|
|
|
|||
|
|
**Possible Causes**:
|
|||
|
|
1. **Race condition** in expansion path
|
|||
|
|
2. **Memory corruption** in multi-thread initialization
|
|||
|
|
3. **Lock-free algorithm bug** in concurrent slab access
|
|||
|
|
4. **TLS initialization issue** under high thread contention
|
|||
|
|
|
|||
|
|
**Recommended Next Steps**:
|
|||
|
|
1. Run under ThreadSanitizer: `make larson_hakmem_tsan && ./larson_hakmem_tsan 10 8 128 1024 1 12345 4`
|
|||
|
|
2. Add mutex protection around `expand_superslab_head()`
|
|||
|
|
3. Check for TOCTOU bugs in `current_chunk` access
|
|||
|
|
4. Verify atomic operations in slab acquisition
|
|||
|
|
|
|||
|
|
## Why This Achieves 100% (Single-Thread)
|
|||
|
|
|
|||
|
|
The bitmap fix ensures:
|
|||
|
|
1. **Correct exhaustion detection**: `active_slabs >= capacity` is precise
|
|||
|
|
2. **Automatic expansion**: When all slabs occupied → new chunk allocated
|
|||
|
|
3. **No false OOMs**: System only fails on true memory exhaustion
|
|||
|
|
4. **Tested extensively**: 10+ runs, stable throughput
|
|||
|
|
|
|||
|
|
**Memory behavior** (verified via logs):
|
|||
|
|
- Initial: 1 chunk per class
|
|||
|
|
- Under load: Expands to 2, 3, 4... chunks as needed
|
|||
|
|
- Each new chunk provides 32 fresh slabs
|
|||
|
|
- No premature OOM
|
|||
|
|
|
|||
|
|
## Conclusion
|
|||
|
|
|
|||
|
|
**Single-Thread**: ✅ **100% stability achieved**
|
|||
|
|
**Multi-Thread**: ❌ **Additional fix required** (race condition suspected)
|
|||
|
|
|
|||
|
|
**User's requirement**: NOT YET MET
|
|||
|
|
- Need multi-thread stability for production use
|
|||
|
|
- Recommend: Fix race condition before deployment
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**Generated**: 2025-11-08
|
|||
|
|
**Investigator**: Claude Code (Sonnet 4.5)
|
|||
|
|
**Test Environment**: 4T Larson benchmark, 10 actors, 8 ops/iter, 128B blocks, 1024 chunks
|