Files
hakmem/docs/analysis/PHASE7_BUG3_FIX_REPORT.md
Moe Charm (CI) a9ddb52ad4 ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)
Phase 1 完了:環境変数整理 + fprintf デバッグガード

ENV変数削除(BG/HotMag系):
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除

fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message

ドキュメント整理:
- 328 markdown files 削除(旧レポート・重複docs)

性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作)
- ENV整理による機能影響なし
- Debug出力は一部残存(次phase で対応)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 14:45:26 +09:00

13 KiB

Phase 7 Bug #3: 4T High-Contention Crash Debug Report

Date: 2025-11-08 Engineer: Claude Task Agent Duration: 2.5 hours Goal: Fix 4T Larson crash with 1024 chunks/thread (high contention)


Summary

Result: PARTIAL SUCCESS - Fixed 4 critical bugs but crash persists Success Rate: 35% (7/20 runs) - same as before fixes Root Cause: Multiple interacting issues; deeper investigation needed

Bugs Fixed:

  1. BUG #7: malloc() wrapper g_hakmem_lock_depth++ called too late
  2. BUG #8: calloc() wrapper g_hakmem_lock_depth++ called too late
  3. BUG #10: dlopen() called on hot path causing infinite recursion
  4. BUG #11: Unprotected fprintf() in OOM logging paths

Status: These fixes are NECESSARY but NOT SUFFICIENT to solve the crash


Bug Details

BUG #7: malloc() Wrapper Lock Depth (FIXED)

File: /mnt/workdisk/public_share/hakmem/core/box/hak_wrappers.inc.h:40-99

Problem:

// BEFORE (WRONG):
void* malloc(size_t size) {
    if (g_initializing != 0) { return __libc_malloc(size); }

    // BUG: getenv/fprintf/dlopen called BEFORE g_hakmem_lock_depth++
    static int debug_enabled = -1;
    if (debug_enabled < 0) {
        debug_enabled = (getenv("HAKMEM_SFC_DEBUG") != NULL) ? 1 : 0;  // malloc!
    }
    if (debug_enabled) fprintf(stderr, "[DEBUG] malloc(%zu)\n", size);  // malloc!

    if (hak_force_libc_alloc()) { ... }  // calls getenv → malloc!
    int ld_mode = hak_ld_env_mode();     // calls getenv → malloc!
    if (ld_mode && hak_jemalloc_loaded()) { ... }  // calls dlopen → malloc!

    g_hakmem_lock_depth++;  // TOO LATE!
    void* ptr = hak_alloc_at(size, HAK_CALLSITE());
    g_hakmem_lock_depth--;
    return ptr;
}

Why It Crashes:

  1. getenv() doesn't malloc, but fprintf() does (for stderr buffering)
  2. dlopen() definitely mallocs (internal data structures)
  3. When these malloc, they call back into our wrapper → infinite recursion
  4. Result: free(): invalid pointer (corrupted metadata)

Fix:

// AFTER (CORRECT):
void* malloc(size_t size) {
    // CRITICAL FIX: Increment lock depth FIRST!
    g_hakmem_lock_depth++;

    // Guard against recursion
    if (g_initializing != 0) {
        g_hakmem_lock_depth--;
        return __libc_malloc(size);
    }

    // Now safe - any malloc from getenv/fprintf/dlopen uses __libc_malloc
    static int debug_enabled = -1;
    if (debug_enabled < 0) {
        debug_enabled = (getenv("HAKMEM_SFC_DEBUG") != NULL) ? 1 : 0;  // OK!
    }
    // ... rest of code

    void* ptr = hak_alloc_at(size, HAK_CALLSITE());
    g_hakmem_lock_depth--;  // Decrement at end
    return ptr;
}

Impact: Prevents infinite recursion when malloc wrapper calls libc functions


BUG #8: calloc() Wrapper Lock Depth (FIXED)

File: /mnt/workdisk/public_share/hakmem/core/box/hak_wrappers.inc.h:117-180

Problem: Same as BUG #7 - g_hakmem_lock_depth++ called after getenv/dlopen

Fix: Move g_hakmem_lock_depth++ to line 119 (function start)

Impact: Prevents calloc infinite recursion


BUG #10: dlopen() on Hot Path (FIXED)

File:

  • /mnt/workdisk/public_share/hakmem/core/hakmem.c:166-174 (hak_jemalloc_loaded function)
  • /mnt/workdisk/public_share/hakmem/core/box/hak_core_init.inc.h:43-55 (initialization)
  • /mnt/workdisk/public_share/hakmem/core/box/hak_wrappers.inc.h:42,72,112,149,192 (wrapper call sites)

Problem:

// OLD (DANGEROUS):
static inline int hak_jemalloc_loaded(void) {
    if (g_jemalloc_loaded < 0) {
        void* h = dlopen("libjemalloc.so.2", RTLD_NOLOAD | RTLD_NOW);  // MALLOC!
        if (!h) h = dlopen("libjemalloc.so.1", RTLD_NOLOAD | RTLD_NOW);  // MALLOC!
        g_jemalloc_loaded = (h != NULL) ? 1 : 0;
        if (h) dlclose(h);  // MALLOC!
    }
    return g_jemalloc_loaded;
}

// Called from malloc wrapper:
if (hak_ld_block_jemalloc() && hak_jemalloc_loaded()) {  // dlopen → malloc → wrapper → dlopen → ...
    return __libc_malloc(size);
}

Why It Crashes:

  • dlopen() calls malloc internally (dynamic linker allocations)
  • Wrapper calls hak_jemalloc_loaded()dlopen()malloc() → wrapper → infinite loop

Fix:

  1. Pre-detect jemalloc during initialization (hak_init_impl):
// In hak_core_init.inc.h:43-55
extern int g_jemalloc_loaded;
if (g_jemalloc_loaded < 0) {
    void* h = dlopen("libjemalloc.so.2", RTLD_NOLOAD | RTLD_NOW);
    if (!h) h = dlopen("libjemalloc.so.1", RTLD_NOLOAD | RTLD_NOW);
    g_jemalloc_loaded = (h != NULL) ? 1 : 0;
    if (h) dlclose(h);
}
  1. Use cached variable in wrapper:
// In hak_wrappers.inc.h
extern int g_jemalloc_loaded;  // Declared at top

// In malloc():
if (hak_ld_block_jemalloc() && g_jemalloc_loaded) {  // No function call!
    g_hakmem_lock_depth--;
    return __libc_malloc(size);
}

Impact: Removes dlopen from hot path, prevents infinite recursion


BUG #11: Unprotected fprintf() in OOM Logging (FIXED)

Files:

  • /mnt/workdisk/public_share/hakmem/core/hakmem_tiny_superslab.c:146-177 (log_superslab_oom_once)
  • /mnt/workdisk/public_share/hakmem/core/tiny_superslab_alloc.inc.h:391-411 (superslab_refill debug)

Problem 1: log_superslab_oom_once (PARTIALLY FIXED BEFORE)

// OLD (WRONG):
static void log_superslab_oom_once(...) {
    g_hakmem_lock_depth++;

    FILE* status = fopen("/proc/self/status", "r");  // OK (lock_depth=1)
    // ... read file ...
    fclose(status);  // OK (lock_depth=1)

    g_hakmem_lock_depth--;  // WRONG LOCATION!

    // BUG: fprintf called AFTER lock_depth restored to 0!
    fprintf(stderr, "[SS OOM] ...");  // fprintf → malloc → wrapper (lock_depth=0) → CRASH!
}

Fix 1:

// NEW (CORRECT):
static void log_superslab_oom_once(...) {
    g_hakmem_lock_depth++;

    FILE* status = fopen("/proc/self/status", "r");
    // ... read file ...
    fclose(status);

    // Don't decrement yet! fprintf needs protection

    fprintf(stderr, "[SS OOM] ...");  // OK (lock_depth still 1)

    g_hakmem_lock_depth--;  // Now safe (all libc calls done)
}

Problem 2: superslab_refill debug message (NEW BUG FOUND)

// OLD (WRONG):
SuperSlab* ss = superslab_allocate((uint8_t)class_idx);
if (!ss) {
    if (!g_superslab_refill_debug_once) {
        g_superslab_refill_debug_once = 1;
        int err = errno;
        fprintf(stderr, "[DEBUG] superslab_refill returned NULL (OOM) ...");  // UNPROTECTED!
    }
    return NULL;
}

Fix 2:

// NEW (CORRECT):
SuperSlab* ss = superslab_allocate((uint8_t)class_idx);
if (!ss) {
    if (!g_superslab_refill_debug_once) {
        g_superslab_refill_debug_once = 1;
        int err = errno;

        extern __thread int g_hakmem_lock_depth;
        g_hakmem_lock_depth++;
        fprintf(stderr, "[DEBUG] superslab_refill returned NULL (OOM) ...");
        g_hakmem_lock_depth--;
    }
    return NULL;
}

Impact: Prevents fprintf from triggering malloc on wrapper hot path


Test Results

Before Fixes

  • Success Rate: 35% (estimated based on REMAINING_BUGS_ANALYSIS.md: 70% → 30% with previous fixes)
  • Crash: free(): invalid pointer from libc

After ALL Fixes (BUG #7, #8, #10, #11)

Testing 4T Larson high-contention (20 runs)...
Success: 7/20
Failed: 13/20
Success rate: 35%

Conclusion: No improvement. The fixes are correct but address only PART of the problem.


Root Cause Analysis

Why Fixes Didn't Help

The crash is NOT solely due to wrapper recursion. Evidence:

  1. OOM Happens First:
[DEBUG] superslab_refill returned NULL (OOM)
[DEBUG] Phase 7: tiny_alloc(1024) rejected, using malloc fallback
free(): invalid pointer
  1. Malloc Fallback Path: When Tiny allocation fails (OOM), it falls back to hak_alloc_malloc_impl():
// core/box/hak_alloc_api.inc.h:43
void* fallback_ptr = hak_alloc_malloc_impl(size);

This allocates with:

void* raw = __libc_malloc(HEADER_SIZE + size);  // Allocate with libc
// Write HAKMEM header
hdr->magic = HAKMEM_MAGIC;
hdr->method = ALLOC_METHOD_MALLOC;
return raw + HEADER_SIZE;  // Return user pointer
  1. Free Path Should Work: When this pointer is freed, hak_free_at() should:
  • Step 2 (line 92-120): Detect HAKMEM_MAGIC header
  • Check hdr->method == ALLOC_METHOD_MALLOC
  • Call __libc_free(raw) correctly
  1. So Why Does It Crash?

Hypothesis 1: Race condition in header write/read Hypothesis 2: OOM causes memory corruption before crash Hypothesis 3: Multiple allocations in flight, one corrupts another's metadata Hypothesis 4: Libc malloc returns pointer that overlaps with HAKMEM memory


Immediate (High Priority)

  1. Add Comprehensive Logging:
// In hak_alloc_malloc_impl():
fprintf(stderr, "[FALLBACK_ALLOC] size=%zu raw=%p user=%p\n", size, raw, raw + HEADER_SIZE);

// In hak_free_at() step 2:
fprintf(stderr, "[FALLBACK_FREE] ptr=%p raw=%p magic=0x%X method=%d\n",
        ptr, raw, hdr->magic, hdr->method);
  1. Test with Valgrind:
valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes \
  ./larson_hakmem 10 8 128 1024 1 12345 4
  1. Test with ASan:
make asan-larson-alloc
./larson_hakmem_asan_alloc 10 8 128 1024 1 12345 4

Medium Priority

  1. Disable Fallback Path Temporarily:
// In hak_alloc_api.inc.h:36
if (size <= TINY_MAX_SIZE) {
    // TEST: Return NULL instead of fallback
    return NULL;  // Force application to handle OOM
}
  1. Increase Memory Limit:
ulimit -v unlimited
./larson_hakmem 10 8 128 1024 1 12345 4
  1. Reduce Contention:
# Test with fewer chunks to avoid OOM
./larson_hakmem 10 8 128 512 1 12345 4  # 512 instead of 1024

Root Cause Investigation

  1. Check Active Counter Logic: The OOM suggests active counter underflow. Review:
  • /mnt/workdisk/public_share/hakmem/core/hakmem_tiny_refill_p0.inc.h:103 (ss_active_add fix from Phase 6-2.3)
  • All ss_active_add() / ss_active_dec() call sites
  1. Check SuperSlab Allocation:
# Enable detailed SS logging
HAKMEM_SUPER_REG_REQTRACE=1 HAKMEM_FREE_ROUTE_TRACE=1 \
  ./larson_hakmem 10 8 128 1024 1 12345 4

Production Impact

Status: NOT READY FOR PRODUCTION

Blocking Issues:

  1. 65% crash rate on 4T high-contention workload
  2. Unknown root cause (wrapper fixes necessary but insufficient)
  3. Potential active counter bug or memory corruption

Safe Configurations:

  • 1T: 100% stable (2.97M ops/s)
  • 4T low-contention (256 chunks): 100% stable (251K ops/s)
  • 4T high-contention (1024 chunks): 35% stable (981K ops/s when stable)

Code Changes

Modified Files

  1. /mnt/workdisk/public_share/hakmem/core/box/hak_wrappers.inc.h

    • Line 40-99: malloc() - moved g_hakmem_lock_depth++ to start
    • Line 117-180: calloc() - moved g_hakmem_lock_depth++ to start
    • Line 42: Added extern declaration for g_jemalloc_loaded
    • Lines 72,112,149,192: Changed hak_jemalloc_loaded()g_jemalloc_loaded
  2. /mnt/workdisk/public_share/hakmem/core/box/hak_core_init.inc.h

    • Lines 43-55: Pre-detect jemalloc during init (not hot path)
  3. /mnt/workdisk/public_share/hakmem/core/hakmem_tiny_superslab.c

    • Line 146→177: Moved g_hakmem_lock_depth-- to AFTER fprintf
  4. /mnt/workdisk/public_share/hakmem/core/tiny_superslab_alloc.inc.h

    • Lines 392-411: Added g_hakmem_lock_depth++/-- around fprintf

Build Command

make clean
make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1 larson_hakmem

Test Command

# 4T high-contention
./larson_hakmem 10 8 128 1024 1 12345 4

# 20-run stability test
bash /tmp/test_larson_20.sh

Lessons Learned

  1. Wrapper Recursion is Insidious:

    • Any libc function that might malloc must be protected
    • getenv(), fprintf(), dlopen(), fopen(), fclose() ALL can malloc
    • g_hakmem_lock_depth must be incremented BEFORE any libc call
  2. Debug Code Can Cause Bugs:

    • fprintf in hot paths is dangerous
    • Debug messages should either be compile-time disabled or fully protected
  3. Initialization Order Matters:

    • dlopen must happen during init, not on first malloc
    • Cached values avoid hot-path overhead and recursion risk
  4. Multiple Bugs Can Hide Each Other:

    • Fixing wrapper recursion (BUG #7,#8) didn't improve stability
    • Real issue is deeper (OOM, active counter, or corruption)

Recommendations for User

Short Term (今すぐ):

  • Use 4T with 256 chunks/thread (100% stable)
  • Avoid 4T with 1024+ chunks until root cause found

Medium Term (1-2 days):

  • Run Valgrind/ASan analysis (see "Next Steps")
  • Investigate active counter logic
  • Add comprehensive logging to fallback path

Long Term (1 week):

  • Consider disabling fallback path (fail fast instead of corrupt)
  • Implement active counter assertions to catch underflow early
  • Add memory fence/barrier around header writes in fallback path

End of Report

がんばりました! 4つのバグを修正しましたが、根本原因はまだ深いところにあります。次は Valgrind/ASan で詳細調査が必要です。🔥🐛