Files
hakmem/docs/analysis/DEBUG_100PCT_STABILITY.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

172 lines
5.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# HAKMEM 100% Stability Investigation Report
## Executive Summary
**Status**: PARTIALLY FIXED - Single-threaded 100% stable, Multi-threaded still crashes
**Root Cause Found**: Inverted bitmap logic in `superslab_refill()` causing false "all slabs occupied" detection
**Primary Fix Implemented**: Corrected bitmap exhaustion check from `bitmap != 0x00000000` to `active_slabs >= capacity`
## Problem Statement
User requirement: **"メモリーライブラリーなんて5でもクラッシュおこったらつかえない"**
Translation: "A memory library with even 5% crash rate is UNUSABLE"
Initial Test Results: **19/20 success (95%)** - **UNACCEPTABLE**
## Investigation Timeline
### 1. Failure Reproduction (Run 4 of 30)
**Exit Code**: 134 (SIGABRT)
**Error Log**:
```
[DEBUG] superslab_refill returned NULL (OOM) detail:
class=3
prev_ss=0x7e21c5400000
active=32
bitmap=0xffffffff ← ALL BITS SET!
errno=12
[HAKMEM] OOM: Unexpected allocation path for size=50, returning NULL
free(): invalid pointer
```
**Key Observation**: `bitmap=0xffffffff` means all 32 slabs appear "occupied", but this shouldn't cause OOM if expansion works.
### 2. Root Cause Analysis
#### Bug #1: Inverted Bitmap Logic (CRITICAL)
**Location**: `/mnt/workdisk/public_share/hakmem/core/tiny_superslab_alloc.inc.h:169`
**Bitmap Semantics** (confirmed via `superslab_find_free_slab:788`):
- Bit 0 = FREE slab
- Bit 1 = OCCUPIED slab
- `0x00000000` = All slabs FREE (0 in use)
- `0xffffffff` = All slabs OCCUPIED (32 in use)
**Buggy Code**:
```c
// Line 169 (BEFORE FIX)
if (current_chunk->slab_bitmap != 0x00000000) {
// "Current chunk has free slabs" ← WRONG!!!
// This branch executes when bitmap=0xffffffff (ALL OCCUPIED)
```
**Problem**:
- When all slabs occupied (`bitmap=0xffffffff`), condition is TRUE
- Code thinks "has free slabs" and continues
- Never reaches expansion logic
- Returns NULL → OOM → Crash
**Fix Applied**:
```c
// Line 172 (AFTER FIX)
if (current_chunk->active_slabs < chunk_cap) {
// Correctly checks if ANY slabs are free
// active_slabs=32, chunk_cap=32 → FALSE → expansion triggered!
```
**Verification**:
```bash
# Single-thread test with fix
./larson_hakmem 1 1 128 1024 1 12345 1
# Result: Throughput = 770,797 ops/s ✅ PASS
# Expansion messages observed:
[HAKMEM] SuperSlab chunk exhausted for class 4 (active=32 cap=32 bitmap=0xffffffff), expanding...
[HAKMEM] Expanded SuperSlabHead for class 4: 2 chunks now (bitmap=0x00000001)
```
#### Bug #2: Slab Deactivation Issue (Secondary)
**Initial Hypothesis**: Slabs become empty (`used=0`) but bitmap bit stays set → memory leak
**Investigation**: Added `superslab_deactivate_slab()` calls when `meta->used == 0`
**Result**: Multi-thread SEGV (even worse than original!)
**Root Cause of SEGV**: Double-initialization corruption
1. Slab freed → `deactivate` → bitmap bit cleared
2. Next alloc → `superslab_find_free_slab()` finds it
3. Calls `init_slab()` AGAIN on already-initialized slab
4. Metadata corruption → SEGV
**Correct Design**: Slabs should stay "active" once initialized until entire SuperSlab chunk is freed. The freelist mechanism handles block reuse.
## Final Implementation
### Files Modified
1. **`core/tiny_superslab_alloc.inc.h:168-208`**
- Changed exhaustion check from `bitmap != 0` to `active_slabs < capacity`
- Added diagnostic logging for expansion events
- Improved error messages
2. **`core/box/free_local_box.c:100-104`**
- Added explanatory comment: Why NOT to deactivate slabs
3. **`core/tiny_superslab_free.inc.h:305, 333`**
- Added comments explaining slab lifecycle
### Test Results
| Configuration | Result | Notes |
|---------------|--------|-------|
| Single-thread (1T) | ✅ 100% (10/10) | 770K ops/s |
| Multi-thread (4T) | ❌ SEGV | Crashes immediately |
| Single-thread expansion | ✅ Works | Grows 1→2→3 chunks |
| Multi-thread expansion | ❌ No logs | Crashes before expansion |
## Remaining Issues
### Multi-Thread SEGV
**Symptoms**:
- Crashes within ~1 second
- No expansion logging
- Exit 139 (SIGSEGV)
- Single-thread works perfectly
**Possible Causes**:
1. **Race condition** in expansion path
2. **Memory corruption** in multi-thread initialization
3. **Lock-free algorithm bug** in concurrent slab access
4. **TLS initialization issue** under high thread contention
**Recommended Next Steps**:
1. Run under ThreadSanitizer: `make larson_hakmem_tsan && ./larson_hakmem_tsan 10 8 128 1024 1 12345 4`
2. Add mutex protection around `expand_superslab_head()`
3. Check for TOCTOU bugs in `current_chunk` access
4. Verify atomic operations in slab acquisition
## Why This Achieves 100% (Single-Thread)
The bitmap fix ensures:
1. **Correct exhaustion detection**: `active_slabs >= capacity` is precise
2. **Automatic expansion**: When all slabs occupied → new chunk allocated
3. **No false OOMs**: System only fails on true memory exhaustion
4. **Tested extensively**: 10+ runs, stable throughput
**Memory behavior** (verified via logs):
- Initial: 1 chunk per class
- Under load: Expands to 2, 3, 4... chunks as needed
- Each new chunk provides 32 fresh slabs
- No premature OOM
## Conclusion
**Single-Thread**: ✅ **100% stability achieved**
**Multi-Thread**: ❌ **Additional fix required** (race condition suspected)
**User's requirement**: NOT YET MET
- Need multi-thread stability for production use
- Recommend: Fix race condition before deployment
---
**Generated**: 2025-11-08
**Investigator**: Claude Code (Sonnet 4.5)
**Test Environment**: 4T Larson benchmark, 10 actors, 8 ops/iter, 128B blocks, 1024 chunks