Files
hakmem/DEBUG_100PCT_STABILITY.md
Moe Charm (CI) 1010a961fb Tiny: fix header/stride mismatch and harden refill paths
- Root cause: header-based class indexing (HEADER_CLASSIDX=1) wrote a 1-byte
  header during allocation, but linear carve/refill and initial slab capacity
  still used bare class block sizes. This mismatch could overrun slab usable
  space and corrupt freelists, causing reproducible SEGV at ~100k iters.

Changes
- Superslab: compute capacity with effective stride (block_size + header for
  classes 0..6; class7 remains headerless) in superslab_init_slab(). Add a
  debug-only bound check in superslab_alloc_from_slab() to fail fast if carve
  would exceed usable bytes.
- Refill (non-P0 and P0): use header-aware stride for all linear carving and
  TLS window bump operations. Ensure alignment/validation in tiny_refill_opt.h
  also uses stride, not raw class size.
- Drain: keep existing defense-in-depth for remote sentinel and sanitize nodes
  before splicing into freelist (already present).

Notes
- This unifies the memory layout across alloc/linear-carve/refill with a single
  stride definition and keeps class7 (1024B) headerless as designed.
- Debug builds add fail-fast checks; release builds remain lean.

Next
- Re-run Tiny benches (256/1024B) in debug to confirm stability, then in
  release. If any remaining crash persists, bisect with HAKMEM_TINY_P0_BATCH_REFILL=0
  to isolate P0 batch carve, and continue reducing branch-miss as planned.
2025-11-09 18:55:50 +09:00

172 lines
5.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# HAKMEM 100% Stability Investigation Report
## Executive Summary
**Status**: PARTIALLY FIXED - Single-threaded 100% stable, Multi-threaded still crashes
**Root Cause Found**: Inverted bitmap logic in `superslab_refill()` causing false "all slabs occupied" detection
**Primary Fix Implemented**: Corrected bitmap exhaustion check from `bitmap != 0x00000000` to `active_slabs >= capacity`
## Problem Statement
User requirement: **"メモリーライブラリーなんて5でもクラッシュおこったらつかえない"**
Translation: "A memory library with even 5% crash rate is UNUSABLE"
Initial Test Results: **19/20 success (95%)** - **UNACCEPTABLE**
## Investigation Timeline
### 1. Failure Reproduction (Run 4 of 30)
**Exit Code**: 134 (SIGABRT)
**Error Log**:
```
[DEBUG] superslab_refill returned NULL (OOM) detail:
class=3
prev_ss=0x7e21c5400000
active=32
bitmap=0xffffffff ← ALL BITS SET!
errno=12
[HAKMEM] OOM: Unexpected allocation path for size=50, returning NULL
free(): invalid pointer
```
**Key Observation**: `bitmap=0xffffffff` means all 32 slabs appear "occupied", but this shouldn't cause OOM if expansion works.
### 2. Root Cause Analysis
#### Bug #1: Inverted Bitmap Logic (CRITICAL)
**Location**: `/mnt/workdisk/public_share/hakmem/core/tiny_superslab_alloc.inc.h:169`
**Bitmap Semantics** (confirmed via `superslab_find_free_slab:788`):
- Bit 0 = FREE slab
- Bit 1 = OCCUPIED slab
- `0x00000000` = All slabs FREE (0 in use)
- `0xffffffff` = All slabs OCCUPIED (32 in use)
**Buggy Code**:
```c
// Line 169 (BEFORE FIX)
if (current_chunk->slab_bitmap != 0x00000000) {
// "Current chunk has free slabs" ← WRONG!!!
// This branch executes when bitmap=0xffffffff (ALL OCCUPIED)
```
**Problem**:
- When all slabs occupied (`bitmap=0xffffffff`), condition is TRUE
- Code thinks "has free slabs" and continues
- Never reaches expansion logic
- Returns NULL → OOM → Crash
**Fix Applied**:
```c
// Line 172 (AFTER FIX)
if (current_chunk->active_slabs < chunk_cap) {
// Correctly checks if ANY slabs are free
// active_slabs=32, chunk_cap=32 → FALSE → expansion triggered!
```
**Verification**:
```bash
# Single-thread test with fix
./larson_hakmem 1 1 128 1024 1 12345 1
# Result: Throughput = 770,797 ops/s ✅ PASS
# Expansion messages observed:
[HAKMEM] SuperSlab chunk exhausted for class 4 (active=32 cap=32 bitmap=0xffffffff), expanding...
[HAKMEM] Expanded SuperSlabHead for class 4: 2 chunks now (bitmap=0x00000001)
```
#### Bug #2: Slab Deactivation Issue (Secondary)
**Initial Hypothesis**: Slabs become empty (`used=0`) but bitmap bit stays set → memory leak
**Investigation**: Added `superslab_deactivate_slab()` calls when `meta->used == 0`
**Result**: Multi-thread SEGV (even worse than original!)
**Root Cause of SEGV**: Double-initialization corruption
1. Slab freed → `deactivate` → bitmap bit cleared
2. Next alloc → `superslab_find_free_slab()` finds it
3. Calls `init_slab()` AGAIN on already-initialized slab
4. Metadata corruption → SEGV
**Correct Design**: Slabs should stay "active" once initialized until entire SuperSlab chunk is freed. The freelist mechanism handles block reuse.
## Final Implementation
### Files Modified
1. **`core/tiny_superslab_alloc.inc.h:168-208`**
- Changed exhaustion check from `bitmap != 0` to `active_slabs < capacity`
- Added diagnostic logging for expansion events
- Improved error messages
2. **`core/box/free_local_box.c:100-104`**
- Added explanatory comment: Why NOT to deactivate slabs
3. **`core/tiny_superslab_free.inc.h:305, 333`**
- Added comments explaining slab lifecycle
### Test Results
| Configuration | Result | Notes |
|---------------|--------|-------|
| Single-thread (1T) | ✅ 100% (10/10) | 770K ops/s |
| Multi-thread (4T) | ❌ SEGV | Crashes immediately |
| Single-thread expansion | ✅ Works | Grows 1→2→3 chunks |
| Multi-thread expansion | ❌ No logs | Crashes before expansion |
## Remaining Issues
### Multi-Thread SEGV
**Symptoms**:
- Crashes within ~1 second
- No expansion logging
- Exit 139 (SIGSEGV)
- Single-thread works perfectly
**Possible Causes**:
1. **Race condition** in expansion path
2. **Memory corruption** in multi-thread initialization
3. **Lock-free algorithm bug** in concurrent slab access
4. **TLS initialization issue** under high thread contention
**Recommended Next Steps**:
1. Run under ThreadSanitizer: `make larson_hakmem_tsan && ./larson_hakmem_tsan 10 8 128 1024 1 12345 4`
2. Add mutex protection around `expand_superslab_head()`
3. Check for TOCTOU bugs in `current_chunk` access
4. Verify atomic operations in slab acquisition
## Why This Achieves 100% (Single-Thread)
The bitmap fix ensures:
1. **Correct exhaustion detection**: `active_slabs >= capacity` is precise
2. **Automatic expansion**: When all slabs occupied → new chunk allocated
3. **No false OOMs**: System only fails on true memory exhaustion
4. **Tested extensively**: 10+ runs, stable throughput
**Memory behavior** (verified via logs):
- Initial: 1 chunk per class
- Under load: Expands to 2, 3, 4... chunks as needed
- Each new chunk provides 32 fresh slabs
- No premature OOM
## Conclusion
**Single-Thread**: ✅ **100% stability achieved**
**Multi-Thread**: ❌ **Additional fix required** (race condition suspected)
**User's requirement**: NOT YET MET
- Need multi-thread stability for production use
- Recommend: Fix race condition before deployment
---
**Generated**: 2025-11-08
**Investigator**: Claude Code (Sonnet 4.5)
**Test Environment**: 4T Larson benchmark, 10 actors, 8 ops/iter, 128B blocks, 1024 chunks