# HAKMEM 100% Stability Investigation Report ## Executive Summary **Status**: PARTIALLY FIXED - Single-threaded 100% stable, Multi-threaded still crashes **Root Cause Found**: Inverted bitmap logic in `superslab_refill()` causing false "all slabs occupied" detection **Primary Fix Implemented**: Corrected bitmap exhaustion check from `bitmap != 0x00000000` to `active_slabs >= capacity` ## Problem Statement User requirement: **"メモリーライブラリーなんて5%でもクラッシュおこったらつかえない"** Translation: "A memory library with even 5% crash rate is UNUSABLE" Initial Test Results: **19/20 success (95%)** - **UNACCEPTABLE** ## Investigation Timeline ### 1. Failure Reproduction (Run 4 of 30) **Exit Code**: 134 (SIGABRT) **Error Log**: ``` [DEBUG] superslab_refill returned NULL (OOM) detail: class=3 prev_ss=0x7e21c5400000 active=32 bitmap=0xffffffff ← ALL BITS SET! errno=12 [HAKMEM] OOM: Unexpected allocation path for size=50, returning NULL free(): invalid pointer ``` **Key Observation**: `bitmap=0xffffffff` means all 32 slabs appear "occupied", but this shouldn't cause OOM if expansion works. ### 2. Root Cause Analysis #### Bug #1: Inverted Bitmap Logic (CRITICAL) **Location**: `/mnt/workdisk/public_share/hakmem/core/tiny_superslab_alloc.inc.h:169` **Bitmap Semantics** (confirmed via `superslab_find_free_slab:788`): - Bit 0 = FREE slab - Bit 1 = OCCUPIED slab - `0x00000000` = All slabs FREE (0 in use) - `0xffffffff` = All slabs OCCUPIED (32 in use) **Buggy Code**: ```c // Line 169 (BEFORE FIX) if (current_chunk->slab_bitmap != 0x00000000) { // "Current chunk has free slabs" ← WRONG!!! // This branch executes when bitmap=0xffffffff (ALL OCCUPIED) ``` **Problem**: - When all slabs occupied (`bitmap=0xffffffff`), condition is TRUE - Code thinks "has free slabs" and continues - Never reaches expansion logic - Returns NULL → OOM → Crash **Fix Applied**: ```c // Line 172 (AFTER FIX) if (current_chunk->active_slabs < chunk_cap) { // Correctly checks if ANY slabs are free // active_slabs=32, chunk_cap=32 → FALSE → expansion triggered! ``` **Verification**: ```bash # Single-thread test with fix ./larson_hakmem 1 1 128 1024 1 12345 1 # Result: Throughput = 770,797 ops/s ✅ PASS # Expansion messages observed: [HAKMEM] SuperSlab chunk exhausted for class 4 (active=32 cap=32 bitmap=0xffffffff), expanding... [HAKMEM] Expanded SuperSlabHead for class 4: 2 chunks now (bitmap=0x00000001) ``` #### Bug #2: Slab Deactivation Issue (Secondary) **Initial Hypothesis**: Slabs become empty (`used=0`) but bitmap bit stays set → memory leak **Investigation**: Added `superslab_deactivate_slab()` calls when `meta->used == 0` **Result**: Multi-thread SEGV (even worse than original!) **Root Cause of SEGV**: Double-initialization corruption 1. Slab freed → `deactivate` → bitmap bit cleared 2. Next alloc → `superslab_find_free_slab()` finds it 3. Calls `init_slab()` AGAIN on already-initialized slab 4. Metadata corruption → SEGV **Correct Design**: Slabs should stay "active" once initialized until entire SuperSlab chunk is freed. The freelist mechanism handles block reuse. ## Final Implementation ### Files Modified 1. **`core/tiny_superslab_alloc.inc.h:168-208`** - Changed exhaustion check from `bitmap != 0` to `active_slabs < capacity` - Added diagnostic logging for expansion events - Improved error messages 2. **`core/box/free_local_box.c:100-104`** - Added explanatory comment: Why NOT to deactivate slabs 3. **`core/tiny_superslab_free.inc.h:305, 333`** - Added comments explaining slab lifecycle ### Test Results | Configuration | Result | Notes | |---------------|--------|-------| | Single-thread (1T) | ✅ 100% (10/10) | 770K ops/s | | Multi-thread (4T) | ❌ SEGV | Crashes immediately | | Single-thread expansion | ✅ Works | Grows 1→2→3 chunks | | Multi-thread expansion | ❌ No logs | Crashes before expansion | ## Remaining Issues ### Multi-Thread SEGV **Symptoms**: - Crashes within ~1 second - No expansion logging - Exit 139 (SIGSEGV) - Single-thread works perfectly **Possible Causes**: 1. **Race condition** in expansion path 2. **Memory corruption** in multi-thread initialization 3. **Lock-free algorithm bug** in concurrent slab access 4. **TLS initialization issue** under high thread contention **Recommended Next Steps**: 1. Run under ThreadSanitizer: `make larson_hakmem_tsan && ./larson_hakmem_tsan 10 8 128 1024 1 12345 4` 2. Add mutex protection around `expand_superslab_head()` 3. Check for TOCTOU bugs in `current_chunk` access 4. Verify atomic operations in slab acquisition ## Why This Achieves 100% (Single-Thread) The bitmap fix ensures: 1. **Correct exhaustion detection**: `active_slabs >= capacity` is precise 2. **Automatic expansion**: When all slabs occupied → new chunk allocated 3. **No false OOMs**: System only fails on true memory exhaustion 4. **Tested extensively**: 10+ runs, stable throughput **Memory behavior** (verified via logs): - Initial: 1 chunk per class - Under load: Expands to 2, 3, 4... chunks as needed - Each new chunk provides 32 fresh slabs - No premature OOM ## Conclusion **Single-Thread**: ✅ **100% stability achieved** **Multi-Thread**: ❌ **Additional fix required** (race condition suspected) **User's requirement**: NOT YET MET - Need multi-thread stability for production use - Recommend: Fix race condition before deployment --- **Generated**: 2025-11-08 **Investigator**: Claude Code (Sonnet 4.5) **Test Environment**: 4T Larson benchmark, 10 actors, 8 ops/iter, 128B blocks, 1024 chunks