Files

Moe Charm (CI) a9ddb52ad4 ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)

Phase 1 完了：環境変数整理 + fprintf デバッグガード

ENV変数削除（BG/HotMag系）:
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除

fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message

ドキュメント整理:
- 328 markdown files 削除（旧レポート・重複docs）

性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作✅)
- ENV整理による機能影響なし
- Debug出力は一部残存（次phase で対応）

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-26 14:45:26 +09:00

5.5 KiB

Raw Blame History

HAKMEM 100% Stability Investigation Report

Executive Summary

Status: PARTIALLY FIXED - Single-threaded 100% stable, Multi-threaded still crashes Root Cause Found: Inverted bitmap logic in superslab_refill() causing false "all slabs occupied" detection Primary Fix Implemented: Corrected bitmap exhaustion check from bitmap != 0x00000000 to active_slabs >= capacity

Problem Statement

User requirement: "メモリーライブラリーなんて5％でもクラッシュおこったらつかえない" Translation: "A memory library with even 5% crash rate is UNUSABLE"

Initial Test Results: 19/20 success (95%) - UNACCEPTABLE

Investigation Timeline

1. Failure Reproduction (Run 4 of 30)

Exit Code: 134 (SIGABRT)

Error Log:

[DEBUG] superslab_refill returned NULL (OOM) detail:
  class=3
  prev_ss=0x7e21c5400000
  active=32
  bitmap=0xffffffff  ← ALL BITS SET!
  errno=12

[HAKMEM] OOM: Unexpected allocation path for size=50, returning NULL
free(): invalid pointer

Key Observation: bitmap=0xffffffff means all 32 slabs appear "occupied", but this shouldn't cause OOM if expansion works.

2. Root Cause Analysis

Bug #1: Inverted Bitmap Logic (CRITICAL)

Location: /mnt/workdisk/public_share/hakmem/core/tiny_superslab_alloc.inc.h:169

Bitmap Semantics (confirmed via superslab_find_free_slab:788):

Bit 0 = FREE slab
Bit 1 = OCCUPIED slab
0x00000000 = All slabs FREE (0 in use)
0xffffffff = All slabs OCCUPIED (32 in use)

Buggy Code:

// Line 169 (BEFORE FIX)
if (current_chunk->slab_bitmap != 0x00000000) {
    // "Current chunk has free slabs" ← WRONG!!!
    // This branch executes when bitmap=0xffffffff (ALL OCCUPIED)

Problem:

When all slabs occupied (bitmap=0xffffffff), condition is TRUE
Code thinks "has free slabs" and continues
Never reaches expansion logic
Returns NULL → OOM → Crash

Fix Applied:

// Line 172 (AFTER FIX)
if (current_chunk->active_slabs < chunk_cap) {
    // Correctly checks if ANY slabs are free
    // active_slabs=32, chunk_cap=32 → FALSE → expansion triggered!

Verification:

# Single-thread test with fix
./larson_hakmem 1 1 128 1024 1 12345 1
# Result: Throughput = 770,797 ops/s ✅ PASS

# Expansion messages observed:
[HAKMEM] SuperSlab chunk exhausted for class 4 (active=32 cap=32 bitmap=0xffffffff), expanding...
[HAKMEM] Expanded SuperSlabHead for class 4: 2 chunks now (bitmap=0x00000001)

Bug #2: Slab Deactivation Issue (Secondary)

Initial Hypothesis: Slabs become empty (used=0) but bitmap bit stays set → memory leak

Investigation: Added superslab_deactivate_slab() calls when meta->used == 0

Result: Multi-thread SEGV (even worse than original!)

Root Cause of SEGV: Double-initialization corruption

Slab freed → deactivate → bitmap bit cleared
Next alloc → superslab_find_free_slab() finds it
Calls init_slab() AGAIN on already-initialized slab
Metadata corruption → SEGV

Correct Design: Slabs should stay "active" once initialized until entire SuperSlab chunk is freed. The freelist mechanism handles block reuse.

Final Implementation

Files Modified

core/tiny_superslab_alloc.inc.h:168-208
- Changed exhaustion check from bitmap != 0 to active_slabs < capacity
- Added diagnostic logging for expansion events
- Improved error messages
core/box/free_local_box.c:100-104
- Added explanatory comment: Why NOT to deactivate slabs
core/tiny_superslab_free.inc.h:305, 333
- Added comments explaining slab lifecycle

Test Results

Configuration	Result	Notes
Single-thread (1T)	✅ 100% (10/10)	770K ops/s
Multi-thread (4T)	❌ SEGV	Crashes immediately
Single-thread expansion	✅ Works	Grows 1→2→3 chunks
Multi-thread expansion	❌ No logs	Crashes before expansion

Remaining Issues

Multi-Thread SEGV

Symptoms:

Crashes within ~1 second
No expansion logging
Exit 139 (SIGSEGV)
Single-thread works perfectly

Possible Causes:

Race condition in expansion path
Memory corruption in multi-thread initialization
Lock-free algorithm bug in concurrent slab access
TLS initialization issue under high thread contention

Recommended Next Steps:

Run under ThreadSanitizer: make larson_hakmem_tsan && ./larson_hakmem_tsan 10 8 128 1024 1 12345 4
Add mutex protection around expand_superslab_head()
Check for TOCTOU bugs in current_chunk access
Verify atomic operations in slab acquisition

Why This Achieves 100% (Single-Thread)

The bitmap fix ensures:

Correct exhaustion detection: active_slabs >= capacity is precise
Automatic expansion: When all slabs occupied → new chunk allocated
No false OOMs: System only fails on true memory exhaustion
Tested extensively: 10+ runs, stable throughput

Memory behavior (verified via logs):

Initial: 1 chunk per class
Under load: Expands to 2, 3, 4... chunks as needed
Each new chunk provides 32 fresh slabs
No premature OOM

Conclusion

Single-Thread: ✅ 100% stability achieved Multi-Thread: ❌ Additional fix required (race condition suspected)

User's requirement: NOT YET MET

Need multi-thread stability for production use
Recommend: Fix race condition before deployment

Generated: 2025-11-08 Investigator: Claude Code (Sonnet 4.5) Test Environment: 4T Larson benchmark, 10 actors, 8 ops/iter, 128B blocks, 1024 chunks

5.5 KiB Raw Blame History Unescape Escape