Files

Moe Charm (CI) a9ddb52ad4 ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)

Phase 1 完了：環境変数整理 + fprintf デバッグガード

ENV変数削除（BG/HotMag系）:
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除

fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message

ドキュメント整理:
- 328 markdown files 削除（旧レポート・重複docs）

性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作✅)
- ENV整理による機能影響なし
- Debug出力は一部残存（次phase で対応）

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-26 14:45:26 +09:00

5.3 KiB

Raw Blame History

Ultra-Deep Analysis Summary: Root Cause Found

Date: 2025-11-04 Status: 🎯 ROOT CAUSE IDENTIFIED

TL;DR

The Bug: Fix #1 and Fix #2 drain slabs WITHOUT checking ownership, causing concurrent modification of meta->freelist when multiple threads operate on the same SuperSlab.

The Fix: Remove Fix #1 and Fix #2, reorder sticky/hot/bench paths to claim ownership BEFORE draining.

Confidence: 🟢 95% - Explains all symptoms: crashes at 0x6261, timing-dependent failures, partial improvements from Fix #3.

The Race Condition

What Fix #1 and Fix #2 Do (WRONG)

// Fix #1 (superslab_refill) and Fix #2 (hak_tiny_alloc_superslab)
for (int i = 0; i < tls_cap; i++) {  // Loop through ALL slabs
    if (remote_heads[i] != 0) {
        ss_remote_drain_to_freelist(ss, i);  // ❌ NO ownership check!
    }
}

Problem: Drains ALL slabs in the SuperSlab, including slabs owned by other threads.

The Race

Thread A (owns slab 5)	Thread B (Fix #2, no ownership)
`ptr = meta->freelist`	Loops through all slabs, i=5
`meta->freelist = (void*)ptr`	Calls `ss_remote_drain_to_freelist(ss, 5)`
(allocating from freelist)	`node_next = meta->freelist` ← RACE!
	`meta->freelist = node` ← Overwrites A's update!

Result: Freelist corruption, crash at fault_addr=0x6261 (truncated pointer).

Why Fix #3 is Correct

// Fix #3 (Mailbox path in tiny_refill.h)
tiny_tls_bind_slab(tls, mss, midx);     // Bind to TLS
ss_owner_cas(m, tiny_self_u32());       // ✅ CLAIM OWNERSHIP FIRST

// NOW safe to drain - we're the owner
if (remote_heads[midx] != 0) {
    ss_remote_drain_to_freelist(mss, midx);  // ✅ Safe: we own it
}

Key difference: Claims ownership (owner_tid = self) BEFORE draining.

All Unsafe Call Sites

Location	Fix	Risk	Solution
`hakmem_tiny_free.inc:620`	Fix #1	🔴 HIGH	❌ DELETE
`hakmem_tiny_free.inc:756`	Fix #2	🔴 HIGH	❌ DELETE
`tiny_refill.h:47`	Sticky	🟡 MEDIUM	✅ Reorder: ownership → drain
`tiny_refill.h:65`	Hot slot	🟡 MEDIUM	✅ Reorder: ownership → drain
`tiny_refill.h:80`	Bench	🟡 MEDIUM	✅ Reorder: ownership → drain
`tiny_mmap_gate.h:57`	mmap_gate	🟡 MEDIUM	✅ Reorder: ownership → drain
`tiny_refill.h:105`	Fix #3	✅ SAFE	✅ Keep as-is

The Fix (3 Steps)

Step 1: Remove Fix #1 (Priority: HIGH)

File: core/hakmem_tiny_free.inc Lines: 615-621

Comment out this block:

// UNSAFE: Drains all slabs without ownership check
for (int i = 0; i < tls_cap; i++) {
    int has_remote = (atomic_load_explicit(&tls->ss->remote_heads[i], memory_order_acquire) != 0);
    if (has_remote) {
        ss_remote_drain_to_freelist(tls->ss, i);  // ❌ DELETE
    }

Step 2: Remove Fix #2 (Priority: HIGH)

File: core/hakmem_tiny_free.inc Lines: 729-767 (entire block)

Comment out the entire Fix #2 block (40 lines starting with "BUGFIX: Drain ALL slabs...").

Step 3: Fix Refill Paths (Priority: MEDIUM)

Files: core/tiny_refill.h, core/tiny_mmap_gate.h

Pattern (apply to sticky/hot/bench/mmap_gate):

// BEFORE (WRONG):
if (!m->freelist && has_remote) ss_remote_drain_to_freelist(ss, idx);  // ❌ Drain first
if (m->freelist) {
    tiny_tls_bind_slab(tls, ss, idx);    // ← Ownership after
    ss_owner_cas(m, self);
    return ss;
}

// AFTER (CORRECT):
tiny_tls_bind_slab(tls, ss, idx);        // ✅ Ownership first
ss_owner_cas(m, self);
if (!m->freelist && has_remote) {
    ss_remote_drain_to_freelist(ss, idx);  // ← Drain after
}
if (m->freelist) {
    return ss;
}

Test Plan

Test 1: Remove Fix #1 and Fix #2 Only

# Apply Step 1 and Step 2 (comment out Fix #1 and Fix #2)
make clean && make -s larson_hakmem
HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh repro 30 10

Expected:

✅ If crashes stop: Fix #1/#2 were the main culprits (DONE!)
⚠️ If crashes continue: Need Step 3 (refill path fixes)

Test 2: Apply All Fixes (Step 1-3)

# Apply all fixes
make clean && make -s larson_hakmem
HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh repro 30 20
HAKMEM_TINY_SS_ADOPT=1 scripts/run_larson_claude.sh guard 30 20

Expected: NO crashes, stable for 20+ seconds.

Why This Explains Everything

Crashes at fault_addr=0x6261: Freelist corruption from concurrent writes
Timing-dependent: Race depends on thread scheduling
Improvement from 500 → 4012 events: Fix #3 reduced races, but Fix #1/#2 still race
Guard mode vs repro mode: Different timing → different race frequency

Detailed Documentation

Full Analysis: /mnt/workdisk/public_share/hakmem/ULTRATHINK_ANALYSIS.md
Implementation Guide: /mnt/workdisk/public_share/hakmem/FIX_IMPLEMENTATION_GUIDE.md
This Summary: /mnt/workdisk/public_share/hakmem/ULTRATHINK_SUMMARY.md

Next Action

Apply Step 1 and Step 2 (remove Fix #1 and Fix #2)
Rebuild and test (repro mode, 30 threads, 10 seconds)
If crashes persist, apply Step 3 (fix refill paths)
Report results

Estimated time: 15 minutes to apply fixes + 5 minutes testing = 20 minutes total.

END OF SUMMARY

5.3 KiB Raw Blame History