Files
hakmem/docs/archive/DEBUG_100PCT_STABILITY.md
Moe Charm (CI) a9ddb52ad4 ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)
Phase 1 完了:環境変数整理 + fprintf デバッグガード

ENV変数削除(BG/HotMag系):
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除

fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message

ドキュメント整理:
- 328 markdown files 削除(旧レポート・重複docs)

性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作)
- ENV整理による機能影響なし
- Debug出力は一部残存(次phase で対応)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 14:45:26 +09:00

172 lines
5.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# HAKMEM 100% Stability Investigation Report
## Executive Summary
**Status**: PARTIALLY FIXED - Single-threaded 100% stable, Multi-threaded still crashes
**Root Cause Found**: Inverted bitmap logic in `superslab_refill()` causing false "all slabs occupied" detection
**Primary Fix Implemented**: Corrected bitmap exhaustion check from `bitmap != 0x00000000` to `active_slabs >= capacity`
## Problem Statement
User requirement: **"メモリーライブラリーなんて5でもクラッシュおこったらつかえない"**
Translation: "A memory library with even 5% crash rate is UNUSABLE"
Initial Test Results: **19/20 success (95%)** - **UNACCEPTABLE**
## Investigation Timeline
### 1. Failure Reproduction (Run 4 of 30)
**Exit Code**: 134 (SIGABRT)
**Error Log**:
```
[DEBUG] superslab_refill returned NULL (OOM) detail:
class=3
prev_ss=0x7e21c5400000
active=32
bitmap=0xffffffff ← ALL BITS SET!
errno=12
[HAKMEM] OOM: Unexpected allocation path for size=50, returning NULL
free(): invalid pointer
```
**Key Observation**: `bitmap=0xffffffff` means all 32 slabs appear "occupied", but this shouldn't cause OOM if expansion works.
### 2. Root Cause Analysis
#### Bug #1: Inverted Bitmap Logic (CRITICAL)
**Location**: `/mnt/workdisk/public_share/hakmem/core/tiny_superslab_alloc.inc.h:169`
**Bitmap Semantics** (confirmed via `superslab_find_free_slab:788`):
- Bit 0 = FREE slab
- Bit 1 = OCCUPIED slab
- `0x00000000` = All slabs FREE (0 in use)
- `0xffffffff` = All slabs OCCUPIED (32 in use)
**Buggy Code**:
```c
// Line 169 (BEFORE FIX)
if (current_chunk->slab_bitmap != 0x00000000) {
// "Current chunk has free slabs" ← WRONG!!!
// This branch executes when bitmap=0xffffffff (ALL OCCUPIED)
```
**Problem**:
- When all slabs occupied (`bitmap=0xffffffff`), condition is TRUE
- Code thinks "has free slabs" and continues
- Never reaches expansion logic
- Returns NULL → OOM → Crash
**Fix Applied**:
```c
// Line 172 (AFTER FIX)
if (current_chunk->active_slabs < chunk_cap) {
// Correctly checks if ANY slabs are free
// active_slabs=32, chunk_cap=32 → FALSE → expansion triggered!
```
**Verification**:
```bash
# Single-thread test with fix
./larson_hakmem 1 1 128 1024 1 12345 1
# Result: Throughput = 770,797 ops/s ✅ PASS
# Expansion messages observed:
[HAKMEM] SuperSlab chunk exhausted for class 4 (active=32 cap=32 bitmap=0xffffffff), expanding...
[HAKMEM] Expanded SuperSlabHead for class 4: 2 chunks now (bitmap=0x00000001)
```
#### Bug #2: Slab Deactivation Issue (Secondary)
**Initial Hypothesis**: Slabs become empty (`used=0`) but bitmap bit stays set → memory leak
**Investigation**: Added `superslab_deactivate_slab()` calls when `meta->used == 0`
**Result**: Multi-thread SEGV (even worse than original!)
**Root Cause of SEGV**: Double-initialization corruption
1. Slab freed → `deactivate` → bitmap bit cleared
2. Next alloc → `superslab_find_free_slab()` finds it
3. Calls `init_slab()` AGAIN on already-initialized slab
4. Metadata corruption → SEGV
**Correct Design**: Slabs should stay "active" once initialized until entire SuperSlab chunk is freed. The freelist mechanism handles block reuse.
## Final Implementation
### Files Modified
1. **`core/tiny_superslab_alloc.inc.h:168-208`**
- Changed exhaustion check from `bitmap != 0` to `active_slabs < capacity`
- Added diagnostic logging for expansion events
- Improved error messages
2. **`core/box/free_local_box.c:100-104`**
- Added explanatory comment: Why NOT to deactivate slabs
3. **`core/tiny_superslab_free.inc.h:305, 333`**
- Added comments explaining slab lifecycle
### Test Results
| Configuration | Result | Notes |
|---------------|--------|-------|
| Single-thread (1T) | ✅ 100% (10/10) | 770K ops/s |
| Multi-thread (4T) | ❌ SEGV | Crashes immediately |
| Single-thread expansion | ✅ Works | Grows 1→2→3 chunks |
| Multi-thread expansion | ❌ No logs | Crashes before expansion |
## Remaining Issues
### Multi-Thread SEGV
**Symptoms**:
- Crashes within ~1 second
- No expansion logging
- Exit 139 (SIGSEGV)
- Single-thread works perfectly
**Possible Causes**:
1. **Race condition** in expansion path
2. **Memory corruption** in multi-thread initialization
3. **Lock-free algorithm bug** in concurrent slab access
4. **TLS initialization issue** under high thread contention
**Recommended Next Steps**:
1. Run under ThreadSanitizer: `make larson_hakmem_tsan && ./larson_hakmem_tsan 10 8 128 1024 1 12345 4`
2. Add mutex protection around `expand_superslab_head()`
3. Check for TOCTOU bugs in `current_chunk` access
4. Verify atomic operations in slab acquisition
## Why This Achieves 100% (Single-Thread)
The bitmap fix ensures:
1. **Correct exhaustion detection**: `active_slabs >= capacity` is precise
2. **Automatic expansion**: When all slabs occupied → new chunk allocated
3. **No false OOMs**: System only fails on true memory exhaustion
4. **Tested extensively**: 10+ runs, stable throughput
**Memory behavior** (verified via logs):
- Initial: 1 chunk per class
- Under load: Expands to 2, 3, 4... chunks as needed
- Each new chunk provides 32 fresh slabs
- No premature OOM
## Conclusion
**Single-Thread**: ✅ **100% stability achieved**
**Multi-Thread**: ❌ **Additional fix required** (race condition suspected)
**User's requirement**: NOT YET MET
- Need multi-thread stability for production use
- Recommend: Fix race condition before deployment
---
**Generated**: 2025-11-08
**Investigator**: Claude Code (Sonnet 4.5)
**Test Environment**: 4T Larson benchmark, 10 actors, 8 ops/iter, 128B blocks, 1024 chunks