Files
hakmem/docs/analysis/DEBUG_100PCT_STABILITY.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

5.5 KiB
Raw Blame History

HAKMEM 100% Stability Investigation Report

Executive Summary

Status: PARTIALLY FIXED - Single-threaded 100% stable, Multi-threaded still crashes Root Cause Found: Inverted bitmap logic in superslab_refill() causing false "all slabs occupied" detection Primary Fix Implemented: Corrected bitmap exhaustion check from bitmap != 0x00000000 to active_slabs >= capacity

Problem Statement

User requirement: "メモリーライブラリーなんて5でもクラッシュおこったらつかえない" Translation: "A memory library with even 5% crash rate is UNUSABLE"

Initial Test Results: 19/20 success (95%) - UNACCEPTABLE

Investigation Timeline

1. Failure Reproduction (Run 4 of 30)

Exit Code: 134 (SIGABRT)

Error Log:

[DEBUG] superslab_refill returned NULL (OOM) detail:
  class=3
  prev_ss=0x7e21c5400000
  active=32
  bitmap=0xffffffff  ← ALL BITS SET!
  errno=12

[HAKMEM] OOM: Unexpected allocation path for size=50, returning NULL
free(): invalid pointer

Key Observation: bitmap=0xffffffff means all 32 slabs appear "occupied", but this shouldn't cause OOM if expansion works.

2. Root Cause Analysis

Bug #1: Inverted Bitmap Logic (CRITICAL)

Location: /mnt/workdisk/public_share/hakmem/core/tiny_superslab_alloc.inc.h:169

Bitmap Semantics (confirmed via superslab_find_free_slab:788):

  • Bit 0 = FREE slab
  • Bit 1 = OCCUPIED slab
  • 0x00000000 = All slabs FREE (0 in use)
  • 0xffffffff = All slabs OCCUPIED (32 in use)

Buggy Code:

// Line 169 (BEFORE FIX)
if (current_chunk->slab_bitmap != 0x00000000) {
    // "Current chunk has free slabs" ← WRONG!!!
    // This branch executes when bitmap=0xffffffff (ALL OCCUPIED)

Problem:

  • When all slabs occupied (bitmap=0xffffffff), condition is TRUE
  • Code thinks "has free slabs" and continues
  • Never reaches expansion logic
  • Returns NULL → OOM → Crash

Fix Applied:

// Line 172 (AFTER FIX)
if (current_chunk->active_slabs < chunk_cap) {
    // Correctly checks if ANY slabs are free
    // active_slabs=32, chunk_cap=32 → FALSE → expansion triggered!

Verification:

# Single-thread test with fix
./larson_hakmem 1 1 128 1024 1 12345 1
# Result: Throughput = 770,797 ops/s ✅ PASS

# Expansion messages observed:
[HAKMEM] SuperSlab chunk exhausted for class 4 (active=32 cap=32 bitmap=0xffffffff), expanding...
[HAKMEM] Expanded SuperSlabHead for class 4: 2 chunks now (bitmap=0x00000001)

Bug #2: Slab Deactivation Issue (Secondary)

Initial Hypothesis: Slabs become empty (used=0) but bitmap bit stays set → memory leak

Investigation: Added superslab_deactivate_slab() calls when meta->used == 0

Result: Multi-thread SEGV (even worse than original!)

Root Cause of SEGV: Double-initialization corruption

  1. Slab freed → deactivate → bitmap bit cleared
  2. Next alloc → superslab_find_free_slab() finds it
  3. Calls init_slab() AGAIN on already-initialized slab
  4. Metadata corruption → SEGV

Correct Design: Slabs should stay "active" once initialized until entire SuperSlab chunk is freed. The freelist mechanism handles block reuse.

Final Implementation

Files Modified

  1. core/tiny_superslab_alloc.inc.h:168-208

    • Changed exhaustion check from bitmap != 0 to active_slabs < capacity
    • Added diagnostic logging for expansion events
    • Improved error messages
  2. core/box/free_local_box.c:100-104

    • Added explanatory comment: Why NOT to deactivate slabs
  3. core/tiny_superslab_free.inc.h:305, 333

    • Added comments explaining slab lifecycle

Test Results

Configuration Result Notes
Single-thread (1T) 100% (10/10) 770K ops/s
Multi-thread (4T) SEGV Crashes immediately
Single-thread expansion Works Grows 1→2→3 chunks
Multi-thread expansion No logs Crashes before expansion

Remaining Issues

Multi-Thread SEGV

Symptoms:

  • Crashes within ~1 second
  • No expansion logging
  • Exit 139 (SIGSEGV)
  • Single-thread works perfectly

Possible Causes:

  1. Race condition in expansion path
  2. Memory corruption in multi-thread initialization
  3. Lock-free algorithm bug in concurrent slab access
  4. TLS initialization issue under high thread contention

Recommended Next Steps:

  1. Run under ThreadSanitizer: make larson_hakmem_tsan && ./larson_hakmem_tsan 10 8 128 1024 1 12345 4
  2. Add mutex protection around expand_superslab_head()
  3. Check for TOCTOU bugs in current_chunk access
  4. Verify atomic operations in slab acquisition

Why This Achieves 100% (Single-Thread)

The bitmap fix ensures:

  1. Correct exhaustion detection: active_slabs >= capacity is precise
  2. Automatic expansion: When all slabs occupied → new chunk allocated
  3. No false OOMs: System only fails on true memory exhaustion
  4. Tested extensively: 10+ runs, stable throughput

Memory behavior (verified via logs):

  • Initial: 1 chunk per class
  • Under load: Expands to 2, 3, 4... chunks as needed
  • Each new chunk provides 32 fresh slabs
  • No premature OOM

Conclusion

Single-Thread: 100% stability achieved Multi-Thread: Additional fix required (race condition suspected)

User's requirement: NOT YET MET

  • Need multi-thread stability for production use
  • Recommend: Fix race condition before deployment

Generated: 2025-11-08 Investigator: Claude Code (Sonnet 4.5) Test Environment: 4T Larson benchmark, 10 actors, 8 ops/iter, 128B blocks, 1024 chunks