Files
hakmem/DEBUG_100PCT_STABILITY.md
Moe Charm (CI) 1010a961fb Tiny: fix header/stride mismatch and harden refill paths
- Root cause: header-based class indexing (HEADER_CLASSIDX=1) wrote a 1-byte
  header during allocation, but linear carve/refill and initial slab capacity
  still used bare class block sizes. This mismatch could overrun slab usable
  space and corrupt freelists, causing reproducible SEGV at ~100k iters.

Changes
- Superslab: compute capacity with effective stride (block_size + header for
  classes 0..6; class7 remains headerless) in superslab_init_slab(). Add a
  debug-only bound check in superslab_alloc_from_slab() to fail fast if carve
  would exceed usable bytes.
- Refill (non-P0 and P0): use header-aware stride for all linear carving and
  TLS window bump operations. Ensure alignment/validation in tiny_refill_opt.h
  also uses stride, not raw class size.
- Drain: keep existing defense-in-depth for remote sentinel and sanitize nodes
  before splicing into freelist (already present).

Notes
- This unifies the memory layout across alloc/linear-carve/refill with a single
  stride definition and keeps class7 (1024B) headerless as designed.
- Debug builds add fail-fast checks; release builds remain lean.

Next
- Re-run Tiny benches (256/1024B) in debug to confirm stability, then in
  release. If any remaining crash persists, bisect with HAKMEM_TINY_P0_BATCH_REFILL=0
  to isolate P0 batch carve, and continue reducing branch-miss as planned.
2025-11-09 18:55:50 +09:00

5.5 KiB
Raw Blame History

HAKMEM 100% Stability Investigation Report

Executive Summary

Status: PARTIALLY FIXED - Single-threaded 100% stable, Multi-threaded still crashes Root Cause Found: Inverted bitmap logic in superslab_refill() causing false "all slabs occupied" detection Primary Fix Implemented: Corrected bitmap exhaustion check from bitmap != 0x00000000 to active_slabs >= capacity

Problem Statement

User requirement: "メモリーライブラリーなんて5でもクラッシュおこったらつかえない" Translation: "A memory library with even 5% crash rate is UNUSABLE"

Initial Test Results: 19/20 success (95%) - UNACCEPTABLE

Investigation Timeline

1. Failure Reproduction (Run 4 of 30)

Exit Code: 134 (SIGABRT)

Error Log:

[DEBUG] superslab_refill returned NULL (OOM) detail:
  class=3
  prev_ss=0x7e21c5400000
  active=32
  bitmap=0xffffffff  ← ALL BITS SET!
  errno=12

[HAKMEM] OOM: Unexpected allocation path for size=50, returning NULL
free(): invalid pointer

Key Observation: bitmap=0xffffffff means all 32 slabs appear "occupied", but this shouldn't cause OOM if expansion works.

2. Root Cause Analysis

Bug #1: Inverted Bitmap Logic (CRITICAL)

Location: /mnt/workdisk/public_share/hakmem/core/tiny_superslab_alloc.inc.h:169

Bitmap Semantics (confirmed via superslab_find_free_slab:788):

  • Bit 0 = FREE slab
  • Bit 1 = OCCUPIED slab
  • 0x00000000 = All slabs FREE (0 in use)
  • 0xffffffff = All slabs OCCUPIED (32 in use)

Buggy Code:

// Line 169 (BEFORE FIX)
if (current_chunk->slab_bitmap != 0x00000000) {
    // "Current chunk has free slabs" ← WRONG!!!
    // This branch executes when bitmap=0xffffffff (ALL OCCUPIED)

Problem:

  • When all slabs occupied (bitmap=0xffffffff), condition is TRUE
  • Code thinks "has free slabs" and continues
  • Never reaches expansion logic
  • Returns NULL → OOM → Crash

Fix Applied:

// Line 172 (AFTER FIX)
if (current_chunk->active_slabs < chunk_cap) {
    // Correctly checks if ANY slabs are free
    // active_slabs=32, chunk_cap=32 → FALSE → expansion triggered!

Verification:

# Single-thread test with fix
./larson_hakmem 1 1 128 1024 1 12345 1
# Result: Throughput = 770,797 ops/s ✅ PASS

# Expansion messages observed:
[HAKMEM] SuperSlab chunk exhausted for class 4 (active=32 cap=32 bitmap=0xffffffff), expanding...
[HAKMEM] Expanded SuperSlabHead for class 4: 2 chunks now (bitmap=0x00000001)

Bug #2: Slab Deactivation Issue (Secondary)

Initial Hypothesis: Slabs become empty (used=0) but bitmap bit stays set → memory leak

Investigation: Added superslab_deactivate_slab() calls when meta->used == 0

Result: Multi-thread SEGV (even worse than original!)

Root Cause of SEGV: Double-initialization corruption

  1. Slab freed → deactivate → bitmap bit cleared
  2. Next alloc → superslab_find_free_slab() finds it
  3. Calls init_slab() AGAIN on already-initialized slab
  4. Metadata corruption → SEGV

Correct Design: Slabs should stay "active" once initialized until entire SuperSlab chunk is freed. The freelist mechanism handles block reuse.

Final Implementation

Files Modified

  1. core/tiny_superslab_alloc.inc.h:168-208

    • Changed exhaustion check from bitmap != 0 to active_slabs < capacity
    • Added diagnostic logging for expansion events
    • Improved error messages
  2. core/box/free_local_box.c:100-104

    • Added explanatory comment: Why NOT to deactivate slabs
  3. core/tiny_superslab_free.inc.h:305, 333

    • Added comments explaining slab lifecycle

Test Results

Configuration Result Notes
Single-thread (1T) 100% (10/10) 770K ops/s
Multi-thread (4T) SEGV Crashes immediately
Single-thread expansion Works Grows 1→2→3 chunks
Multi-thread expansion No logs Crashes before expansion

Remaining Issues

Multi-Thread SEGV

Symptoms:

  • Crashes within ~1 second
  • No expansion logging
  • Exit 139 (SIGSEGV)
  • Single-thread works perfectly

Possible Causes:

  1. Race condition in expansion path
  2. Memory corruption in multi-thread initialization
  3. Lock-free algorithm bug in concurrent slab access
  4. TLS initialization issue under high thread contention

Recommended Next Steps:

  1. Run under ThreadSanitizer: make larson_hakmem_tsan && ./larson_hakmem_tsan 10 8 128 1024 1 12345 4
  2. Add mutex protection around expand_superslab_head()
  3. Check for TOCTOU bugs in current_chunk access
  4. Verify atomic operations in slab acquisition

Why This Achieves 100% (Single-Thread)

The bitmap fix ensures:

  1. Correct exhaustion detection: active_slabs >= capacity is precise
  2. Automatic expansion: When all slabs occupied → new chunk allocated
  3. No false OOMs: System only fails on true memory exhaustion
  4. Tested extensively: 10+ runs, stable throughput

Memory behavior (verified via logs):

  • Initial: 1 chunk per class
  • Under load: Expands to 2, 3, 4... chunks as needed
  • Each new chunk provides 32 fresh slabs
  • No premature OOM

Conclusion

Single-Thread: 100% stability achieved Multi-Thread: Additional fix required (race condition suspected)

User's requirement: NOT YET MET

  • Need multi-thread stability for production use
  • Recommend: Fix race condition before deployment

Generated: 2025-11-08 Investigator: Claude Code (Sonnet 4.5) Test Environment: 4T Larson benchmark, 10 actors, 8 ops/iter, 128B blocks, 1024 chunks