Files

Moe Charm (CI) 1010a961fb Tiny: fix header/stride mismatch and harden refill paths

- Root cause: header-based class indexing (HEADER_CLASSIDX=1) wrote a 1-byte
  header during allocation, but linear carve/refill and initial slab capacity
  still used bare class block sizes. This mismatch could overrun slab usable
  space and corrupt freelists, causing reproducible SEGV at ~100k iters.

Changes
- Superslab: compute capacity with effective stride (block_size + header for
  classes 0..6; class7 remains headerless) in superslab_init_slab(). Add a
  debug-only bound check in superslab_alloc_from_slab() to fail fast if carve
  would exceed usable bytes.
- Refill (non-P0 and P0): use header-aware stride for all linear carving and
  TLS window bump operations. Ensure alignment/validation in tiny_refill_opt.h
  also uses stride, not raw class size.
- Drain: keep existing defense-in-depth for remote sentinel and sanitize nodes
  before splicing into freelist (already present).

Notes
- This unifies the memory layout across alloc/linear-carve/refill with a single
  stride definition and keeps class7 (1024B) headerless as designed.
- Debug builds add fail-fast checks; release builds remain lean.

Next
- Re-run Tiny benches (256/1024B) in debug to confirm stability, then in
  release. If any remaining crash persists, bisect with HAKMEM_TINY_P0_BATCH_REFILL=0
  to isolate P0 batch carve, and continue reducing branch-miss as planned.

2025-11-09 18:55:50 +09:00

5.5 KiB

Raw Blame History

HAKMEM 100% Stability Investigation Report

Executive Summary

Status: PARTIALLY FIXED - Single-threaded 100% stable, Multi-threaded still crashes Root Cause Found: Inverted bitmap logic in superslab_refill() causing false "all slabs occupied" detection Primary Fix Implemented: Corrected bitmap exhaustion check from bitmap != 0x00000000 to active_slabs >= capacity

Problem Statement

User requirement: "メモリーライブラリーなんて5％でもクラッシュおこったらつかえない" Translation: "A memory library with even 5% crash rate is UNUSABLE"

Initial Test Results: 19/20 success (95%) - UNACCEPTABLE

Investigation Timeline

1. Failure Reproduction (Run 4 of 30)

Exit Code: 134 (SIGABRT)

Error Log:

[DEBUG] superslab_refill returned NULL (OOM) detail:
  class=3
  prev_ss=0x7e21c5400000
  active=32
  bitmap=0xffffffff  ← ALL BITS SET!
  errno=12

[HAKMEM] OOM: Unexpected allocation path for size=50, returning NULL
free(): invalid pointer

Key Observation: bitmap=0xffffffff means all 32 slabs appear "occupied", but this shouldn't cause OOM if expansion works.

2. Root Cause Analysis

Bug #1: Inverted Bitmap Logic (CRITICAL)

Location: /mnt/workdisk/public_share/hakmem/core/tiny_superslab_alloc.inc.h:169

Bitmap Semantics (confirmed via superslab_find_free_slab:788):

Bit 0 = FREE slab
Bit 1 = OCCUPIED slab
0x00000000 = All slabs FREE (0 in use)
0xffffffff = All slabs OCCUPIED (32 in use)

Buggy Code:

// Line 169 (BEFORE FIX)
if (current_chunk->slab_bitmap != 0x00000000) {
    // "Current chunk has free slabs" ← WRONG!!!
    // This branch executes when bitmap=0xffffffff (ALL OCCUPIED)

Problem:

When all slabs occupied (bitmap=0xffffffff), condition is TRUE
Code thinks "has free slabs" and continues
Never reaches expansion logic
Returns NULL → OOM → Crash

Fix Applied:

// Line 172 (AFTER FIX)
if (current_chunk->active_slabs < chunk_cap) {
    // Correctly checks if ANY slabs are free
    // active_slabs=32, chunk_cap=32 → FALSE → expansion triggered!

Verification:

# Single-thread test with fix
./larson_hakmem 1 1 128 1024 1 12345 1
# Result: Throughput = 770,797 ops/s ✅ PASS

# Expansion messages observed:
[HAKMEM] SuperSlab chunk exhausted for class 4 (active=32 cap=32 bitmap=0xffffffff), expanding...
[HAKMEM] Expanded SuperSlabHead for class 4: 2 chunks now (bitmap=0x00000001)

Bug #2: Slab Deactivation Issue (Secondary)

Initial Hypothesis: Slabs become empty (used=0) but bitmap bit stays set → memory leak

Investigation: Added superslab_deactivate_slab() calls when meta->used == 0

Result: Multi-thread SEGV (even worse than original!)

Root Cause of SEGV: Double-initialization corruption

Slab freed → deactivate → bitmap bit cleared
Next alloc → superslab_find_free_slab() finds it
Calls init_slab() AGAIN on already-initialized slab
Metadata corruption → SEGV

Correct Design: Slabs should stay "active" once initialized until entire SuperSlab chunk is freed. The freelist mechanism handles block reuse.

Final Implementation

Files Modified

core/tiny_superslab_alloc.inc.h:168-208
- Changed exhaustion check from bitmap != 0 to active_slabs < capacity
- Added diagnostic logging for expansion events
- Improved error messages
core/box/free_local_box.c:100-104
- Added explanatory comment: Why NOT to deactivate slabs
core/tiny_superslab_free.inc.h:305, 333
- Added comments explaining slab lifecycle

Test Results

Configuration	Result	Notes
Single-thread (1T)	✅ 100% (10/10)	770K ops/s
Multi-thread (4T)	❌ SEGV	Crashes immediately
Single-thread expansion	✅ Works	Grows 1→2→3 chunks
Multi-thread expansion	❌ No logs	Crashes before expansion

Remaining Issues

Multi-Thread SEGV

Symptoms:

Crashes within ~1 second
No expansion logging
Exit 139 (SIGSEGV)
Single-thread works perfectly

Possible Causes:

Race condition in expansion path
Memory corruption in multi-thread initialization
Lock-free algorithm bug in concurrent slab access
TLS initialization issue under high thread contention

Recommended Next Steps:

Run under ThreadSanitizer: make larson_hakmem_tsan && ./larson_hakmem_tsan 10 8 128 1024 1 12345 4
Add mutex protection around expand_superslab_head()
Check for TOCTOU bugs in current_chunk access
Verify atomic operations in slab acquisition

Why This Achieves 100% (Single-Thread)

The bitmap fix ensures:

Correct exhaustion detection: active_slabs >= capacity is precise
Automatic expansion: When all slabs occupied → new chunk allocated
No false OOMs: System only fails on true memory exhaustion
Tested extensively: 10+ runs, stable throughput

Memory behavior (verified via logs):

Initial: 1 chunk per class
Under load: Expands to 2, 3, 4... chunks as needed
Each new chunk provides 32 fresh slabs
No premature OOM

Conclusion

Single-Thread: ✅ 100% stability achieved Multi-Thread: ❌ Additional fix required (race condition suspected)

User's requirement: NOT YET MET

Need multi-thread stability for production use
Recommend: Fix race condition before deployment

Generated: 2025-11-08 Investigator: Claude Code (Sonnet 4.5) Test Environment: 4T Larson benchmark, 10 actors, 8 ops/iter, 128B blocks, 1024 chunks

5.5 KiB Raw Blame History Unescape Escape