# Phase 7 Bug #3: 4T High-Contention Crash Debug Report **Date:** 2025-11-08 **Engineer:** Claude Task Agent **Duration:** 2.5 hours **Goal:** Fix 4T Larson crash with 1024 chunks/thread (high contention) --- ## Summary **Result:** PARTIAL SUCCESS - Fixed 4 critical bugs but crash persists **Success Rate:** 35% (7/20 runs) - same as before fixes **Root Cause:** Multiple interacting issues; deeper investigation needed **Bugs Fixed:** 1. BUG #7: malloc() wrapper `g_hakmem_lock_depth++` called too late 2. BUG #8: calloc() wrapper `g_hakmem_lock_depth++` called too late 3. BUG #10: dlopen() called on hot path causing infinite recursion 4. BUG #11: Unprotected fprintf() in OOM logging paths **Status:** These fixes are NECESSARY but NOT SUFFICIENT to solve the crash --- ## Bug Details ### BUG #7: malloc() Wrapper Lock Depth (FIXED) **File:** `/mnt/workdisk/public_share/hakmem/core/box/hak_wrappers.inc.h:40-99` **Problem:** ```c // BEFORE (WRONG): void* malloc(size_t size) { if (g_initializing != 0) { return __libc_malloc(size); } // BUG: getenv/fprintf/dlopen called BEFORE g_hakmem_lock_depth++ static int debug_enabled = -1; if (debug_enabled < 0) { debug_enabled = (getenv("HAKMEM_SFC_DEBUG") != NULL) ? 1 : 0; // malloc! } if (debug_enabled) fprintf(stderr, "[DEBUG] malloc(%zu)\n", size); // malloc! if (hak_force_libc_alloc()) { ... } // calls getenv → malloc! int ld_mode = hak_ld_env_mode(); // calls getenv → malloc! if (ld_mode && hak_jemalloc_loaded()) { ... } // calls dlopen → malloc! g_hakmem_lock_depth++; // TOO LATE! void* ptr = hak_alloc_at(size, HAK_CALLSITE()); g_hakmem_lock_depth--; return ptr; } ``` **Why It Crashes:** 1. `getenv()` doesn't malloc, but `fprintf()` does (for stderr buffering) 2. `dlopen()` **definitely** mallocs (internal data structures) 3. When these malloc, they call back into our wrapper → infinite recursion 4. Result: `free(): invalid pointer` (corrupted metadata) **Fix:** ```c // AFTER (CORRECT): void* malloc(size_t size) { // CRITICAL FIX: Increment lock depth FIRST! g_hakmem_lock_depth++; // Guard against recursion if (g_initializing != 0) { g_hakmem_lock_depth--; return __libc_malloc(size); } // Now safe - any malloc from getenv/fprintf/dlopen uses __libc_malloc static int debug_enabled = -1; if (debug_enabled < 0) { debug_enabled = (getenv("HAKMEM_SFC_DEBUG") != NULL) ? 1 : 0; // OK! } // ... rest of code void* ptr = hak_alloc_at(size, HAK_CALLSITE()); g_hakmem_lock_depth--; // Decrement at end return ptr; } ``` **Impact:** Prevents infinite recursion when malloc wrapper calls libc functions --- ### BUG #8: calloc() Wrapper Lock Depth (FIXED) **File:** `/mnt/workdisk/public_share/hakmem/core/box/hak_wrappers.inc.h:117-180` **Problem:** Same as BUG #7 - `g_hakmem_lock_depth++` called after getenv/dlopen **Fix:** Move `g_hakmem_lock_depth++` to line 119 (function start) **Impact:** Prevents calloc infinite recursion --- ### BUG #10: dlopen() on Hot Path (FIXED) **File:** - `/mnt/workdisk/public_share/hakmem/core/hakmem.c:166-174` (hak_jemalloc_loaded function) - `/mnt/workdisk/public_share/hakmem/core/box/hak_core_init.inc.h:43-55` (initialization) - `/mnt/workdisk/public_share/hakmem/core/box/hak_wrappers.inc.h:42,72,112,149,192` (wrapper call sites) **Problem:** ```c // OLD (DANGEROUS): static inline int hak_jemalloc_loaded(void) { if (g_jemalloc_loaded < 0) { void* h = dlopen("libjemalloc.so.2", RTLD_NOLOAD | RTLD_NOW); // MALLOC! if (!h) h = dlopen("libjemalloc.so.1", RTLD_NOLOAD | RTLD_NOW); // MALLOC! g_jemalloc_loaded = (h != NULL) ? 1 : 0; if (h) dlclose(h); // MALLOC! } return g_jemalloc_loaded; } // Called from malloc wrapper: if (hak_ld_block_jemalloc() && hak_jemalloc_loaded()) { // dlopen → malloc → wrapper → dlopen → ... return __libc_malloc(size); } ``` **Why It Crashes:** - `dlopen()` calls malloc internally (dynamic linker allocations) - Wrapper calls `hak_jemalloc_loaded()` → `dlopen()` → `malloc()` → wrapper → infinite loop **Fix:** 1. Pre-detect jemalloc during initialization (hak_init_impl): ```c // In hak_core_init.inc.h:43-55 extern int g_jemalloc_loaded; if (g_jemalloc_loaded < 0) { void* h = dlopen("libjemalloc.so.2", RTLD_NOLOAD | RTLD_NOW); if (!h) h = dlopen("libjemalloc.so.1", RTLD_NOLOAD | RTLD_NOW); g_jemalloc_loaded = (h != NULL) ? 1 : 0; if (h) dlclose(h); } ``` 2. Use cached variable in wrapper: ```c // In hak_wrappers.inc.h extern int g_jemalloc_loaded; // Declared at top // In malloc(): if (hak_ld_block_jemalloc() && g_jemalloc_loaded) { // No function call! g_hakmem_lock_depth--; return __libc_malloc(size); } ``` **Impact:** Removes dlopen from hot path, prevents infinite recursion --- ### BUG #11: Unprotected fprintf() in OOM Logging (FIXED) **Files:** - `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_superslab.c:146-177` (log_superslab_oom_once) - `/mnt/workdisk/public_share/hakmem/core/tiny_superslab_alloc.inc.h:391-411` (superslab_refill debug) **Problem 1: log_superslab_oom_once (PARTIALLY FIXED BEFORE)** ```c // OLD (WRONG): static void log_superslab_oom_once(...) { g_hakmem_lock_depth++; FILE* status = fopen("/proc/self/status", "r"); // OK (lock_depth=1) // ... read file ... fclose(status); // OK (lock_depth=1) g_hakmem_lock_depth--; // WRONG LOCATION! // BUG: fprintf called AFTER lock_depth restored to 0! fprintf(stderr, "[SS OOM] ..."); // fprintf → malloc → wrapper (lock_depth=0) → CRASH! } ``` **Fix 1:** ```c // NEW (CORRECT): static void log_superslab_oom_once(...) { g_hakmem_lock_depth++; FILE* status = fopen("/proc/self/status", "r"); // ... read file ... fclose(status); // Don't decrement yet! fprintf needs protection fprintf(stderr, "[SS OOM] ..."); // OK (lock_depth still 1) g_hakmem_lock_depth--; // Now safe (all libc calls done) } ``` **Problem 2: superslab_refill debug message (NEW BUG FOUND)** ```c // OLD (WRONG): SuperSlab* ss = superslab_allocate((uint8_t)class_idx); if (!ss) { if (!g_superslab_refill_debug_once) { g_superslab_refill_debug_once = 1; int err = errno; fprintf(stderr, "[DEBUG] superslab_refill returned NULL (OOM) ..."); // UNPROTECTED! } return NULL; } ``` **Fix 2:** ```c // NEW (CORRECT): SuperSlab* ss = superslab_allocate((uint8_t)class_idx); if (!ss) { if (!g_superslab_refill_debug_once) { g_superslab_refill_debug_once = 1; int err = errno; extern __thread int g_hakmem_lock_depth; g_hakmem_lock_depth++; fprintf(stderr, "[DEBUG] superslab_refill returned NULL (OOM) ..."); g_hakmem_lock_depth--; } return NULL; } ``` **Impact:** Prevents fprintf from triggering malloc on wrapper hot path --- ## Test Results ### Before Fixes - **Success Rate:** 35% (estimated based on REMAINING_BUGS_ANALYSIS.md: 70% → 30% with previous fixes) - **Crash:** `free(): invalid pointer` from libc ### After ALL Fixes (BUG #7, #8, #10, #11) ```bash Testing 4T Larson high-contention (20 runs)... Success: 7/20 Failed: 13/20 Success rate: 35% ``` **Conclusion:** No improvement. The fixes are correct but address only PART of the problem. --- ## Root Cause Analysis ### Why Fixes Didn't Help The crash is **NOT** solely due to wrapper recursion. Evidence: 1. **OOM Happens First:** ``` [DEBUG] superslab_refill returned NULL (OOM) [DEBUG] Phase 7: tiny_alloc(1024) rejected, using malloc fallback free(): invalid pointer ``` 2. **Malloc Fallback Path:** When Tiny allocation fails (OOM), it falls back to `hak_alloc_malloc_impl()`: ```c // core/box/hak_alloc_api.inc.h:43 void* fallback_ptr = hak_alloc_malloc_impl(size); ``` This allocates with: ```c void* raw = __libc_malloc(HEADER_SIZE + size); // Allocate with libc // Write HAKMEM header hdr->magic = HAKMEM_MAGIC; hdr->method = ALLOC_METHOD_MALLOC; return raw + HEADER_SIZE; // Return user pointer ``` 3. **Free Path Should Work:** When this pointer is freed, `hak_free_at()` should: - Step 2 (line 92-120): Detect HAKMEM_MAGIC header - Check `hdr->method == ALLOC_METHOD_MALLOC` - Call `__libc_free(raw)` correctly 4. **So Why Does It Crash?** **Hypothesis 1:** Race condition in header write/read **Hypothesis 2:** OOM causes memory corruption before crash **Hypothesis 3:** Multiple allocations in flight, one corrupts another's metadata **Hypothesis 4:** Libc malloc returns pointer that overlaps with HAKMEM memory --- ## Next Steps (Recommended) ### Immediate (High Priority) 1. **Add Comprehensive Logging:** ```c // In hak_alloc_malloc_impl(): fprintf(stderr, "[FALLBACK_ALLOC] size=%zu raw=%p user=%p\n", size, raw, raw + HEADER_SIZE); // In hak_free_at() step 2: fprintf(stderr, "[FALLBACK_FREE] ptr=%p raw=%p magic=0x%X method=%d\n", ptr, raw, hdr->magic, hdr->method); ``` 2. **Test with Valgrind:** ```bash valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes \ ./larson_hakmem 10 8 128 1024 1 12345 4 ``` 3. **Test with ASan:** ```bash make asan-larson-alloc ./larson_hakmem_asan_alloc 10 8 128 1024 1 12345 4 ``` ### Medium Priority 4. **Disable Fallback Path Temporarily:** ```c // In hak_alloc_api.inc.h:36 if (size <= TINY_MAX_SIZE) { // TEST: Return NULL instead of fallback return NULL; // Force application to handle OOM } ``` 5. **Increase Memory Limit:** ```bash ulimit -v unlimited ./larson_hakmem 10 8 128 1024 1 12345 4 ``` 6. **Reduce Contention:** ```bash # Test with fewer chunks to avoid OOM ./larson_hakmem 10 8 128 512 1 12345 4 # 512 instead of 1024 ``` ### Root Cause Investigation 7. **Check Active Counter Logic:** The OOM suggests active counter underflow. Review: - `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_refill_p0.inc.h:103` (ss_active_add fix from Phase 6-2.3) - All `ss_active_add()` / `ss_active_dec()` call sites 8. **Check SuperSlab Allocation:** ```bash # Enable detailed SS logging HAKMEM_SUPER_REG_REQTRACE=1 HAKMEM_FREE_ROUTE_TRACE=1 \ ./larson_hakmem 10 8 128 1024 1 12345 4 ``` --- ## Production Impact **Status:** NOT READY FOR PRODUCTION **Blocking Issues:** 1. 65% crash rate on 4T high-contention workload 2. Unknown root cause (wrapper fixes necessary but insufficient) 3. Potential active counter bug or memory corruption **Safe Configurations:** - 1T: 100% stable (2.97M ops/s) - 4T low-contention (256 chunks): 100% stable (251K ops/s) - 4T high-contention (1024 chunks): 35% stable (981K ops/s when stable) --- ## Code Changes ### Modified Files 1. `/mnt/workdisk/public_share/hakmem/core/box/hak_wrappers.inc.h` - Line 40-99: malloc() - moved `g_hakmem_lock_depth++` to start - Line 117-180: calloc() - moved `g_hakmem_lock_depth++` to start - Line 42: Added extern declaration for `g_jemalloc_loaded` - Lines 72,112,149,192: Changed `hak_jemalloc_loaded()` → `g_jemalloc_loaded` 2. `/mnt/workdisk/public_share/hakmem/core/box/hak_core_init.inc.h` - Lines 43-55: Pre-detect jemalloc during init (not hot path) 3. `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_superslab.c` - Line 146→177: Moved `g_hakmem_lock_depth--` to AFTER fprintf 4. `/mnt/workdisk/public_share/hakmem/core/tiny_superslab_alloc.inc.h` - Lines 392-411: Added `g_hakmem_lock_depth++/--` around fprintf ### Build Command ```bash make clean make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1 larson_hakmem ``` ### Test Command ```bash # 4T high-contention ./larson_hakmem 10 8 128 1024 1 12345 4 # 20-run stability test bash /tmp/test_larson_20.sh ``` --- ## Lessons Learned 1. **Wrapper Recursion is Insidious:** - Any libc function that might malloc must be protected - `getenv()`, `fprintf()`, `dlopen()`, `fopen()`, `fclose()` ALL can malloc - `g_hakmem_lock_depth` must be incremented BEFORE any libc call 2. **Debug Code Can Cause Bugs:** - fprintf in hot paths is dangerous - Debug messages should either be compile-time disabled or fully protected 3. **Initialization Order Matters:** - dlopen must happen during init, not on first malloc - Cached values avoid hot-path overhead and recursion risk 4. **Multiple Bugs Can Hide Each Other:** - Fixing wrapper recursion (BUG #7,#8) didn't improve stability - Real issue is deeper (OOM, active counter, or corruption) --- ## Recommendations for User **Short Term (今すぐ):** - Use 4T with 256 chunks/thread (100% stable) - Avoid 4T with 1024+ chunks until root cause found **Medium Term (1-2 days):** - Run Valgrind/ASan analysis (see "Next Steps") - Investigate active counter logic - Add comprehensive logging to fallback path **Long Term (1 week):** - Consider disabling fallback path (fail fast instead of corrupt) - Implement active counter assertions to catch underflow early - Add memory fence/barrier around header writes in fallback path --- **End of Report** がんばりました! 4つのバグを修正しましたが、根本原因はまだ深いところにあります。次は Valgrind/ASan で詳細調査が必要です。🔥🐛