Phase 1 完了:環境変数整理 + fprintf デバッグガード ENV変数削除(BG/HotMag系): - core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines) - core/hakmem_tiny_bg_spill.c: BG spill ENV 削除 - core/tiny_refill.h: BG remote 固定値化 - core/hakmem_tiny_slow.inc: BG refs 削除 fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE): - core/hakmem_shared_pool.c: Lock stats (~18 fprintf) - core/page_arena.c: Init/Shutdown/Stats (~27 fprintf) - core/hakmem.c: SIGSEGV init message ドキュメント整理: - 328 markdown files 削除(旧レポート・重複docs) 性能確認: - Larson: 52.35M ops/s (前回52.8M、安定動作✅) - ENV整理による機能影響なし - Debug出力は一部残存(次phase で対応) 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
13 KiB
Phase 7 Bug #3: 4T High-Contention Crash Debug Report
Date: 2025-11-08 Engineer: Claude Task Agent Duration: 2.5 hours Goal: Fix 4T Larson crash with 1024 chunks/thread (high contention)
Summary
Result: PARTIAL SUCCESS - Fixed 4 critical bugs but crash persists Success Rate: 35% (7/20 runs) - same as before fixes Root Cause: Multiple interacting issues; deeper investigation needed
Bugs Fixed:
- BUG #7: malloc() wrapper
g_hakmem_lock_depth++called too late - BUG #8: calloc() wrapper
g_hakmem_lock_depth++called too late - BUG #10: dlopen() called on hot path causing infinite recursion
- BUG #11: Unprotected fprintf() in OOM logging paths
Status: These fixes are NECESSARY but NOT SUFFICIENT to solve the crash
Bug Details
BUG #7: malloc() Wrapper Lock Depth (FIXED)
File: /mnt/workdisk/public_share/hakmem/core/box/hak_wrappers.inc.h:40-99
Problem:
// BEFORE (WRONG):
void* malloc(size_t size) {
if (g_initializing != 0) { return __libc_malloc(size); }
// BUG: getenv/fprintf/dlopen called BEFORE g_hakmem_lock_depth++
static int debug_enabled = -1;
if (debug_enabled < 0) {
debug_enabled = (getenv("HAKMEM_SFC_DEBUG") != NULL) ? 1 : 0; // malloc!
}
if (debug_enabled) fprintf(stderr, "[DEBUG] malloc(%zu)\n", size); // malloc!
if (hak_force_libc_alloc()) { ... } // calls getenv → malloc!
int ld_mode = hak_ld_env_mode(); // calls getenv → malloc!
if (ld_mode && hak_jemalloc_loaded()) { ... } // calls dlopen → malloc!
g_hakmem_lock_depth++; // TOO LATE!
void* ptr = hak_alloc_at(size, HAK_CALLSITE());
g_hakmem_lock_depth--;
return ptr;
}
Why It Crashes:
getenv()doesn't malloc, butfprintf()does (for stderr buffering)dlopen()definitely mallocs (internal data structures)- When these malloc, they call back into our wrapper → infinite recursion
- Result:
free(): invalid pointer(corrupted metadata)
Fix:
// AFTER (CORRECT):
void* malloc(size_t size) {
// CRITICAL FIX: Increment lock depth FIRST!
g_hakmem_lock_depth++;
// Guard against recursion
if (g_initializing != 0) {
g_hakmem_lock_depth--;
return __libc_malloc(size);
}
// Now safe - any malloc from getenv/fprintf/dlopen uses __libc_malloc
static int debug_enabled = -1;
if (debug_enabled < 0) {
debug_enabled = (getenv("HAKMEM_SFC_DEBUG") != NULL) ? 1 : 0; // OK!
}
// ... rest of code
void* ptr = hak_alloc_at(size, HAK_CALLSITE());
g_hakmem_lock_depth--; // Decrement at end
return ptr;
}
Impact: Prevents infinite recursion when malloc wrapper calls libc functions
BUG #8: calloc() Wrapper Lock Depth (FIXED)
File: /mnt/workdisk/public_share/hakmem/core/box/hak_wrappers.inc.h:117-180
Problem: Same as BUG #7 - g_hakmem_lock_depth++ called after getenv/dlopen
Fix: Move g_hakmem_lock_depth++ to line 119 (function start)
Impact: Prevents calloc infinite recursion
BUG #10: dlopen() on Hot Path (FIXED)
File:
/mnt/workdisk/public_share/hakmem/core/hakmem.c:166-174(hak_jemalloc_loaded function)/mnt/workdisk/public_share/hakmem/core/box/hak_core_init.inc.h:43-55(initialization)/mnt/workdisk/public_share/hakmem/core/box/hak_wrappers.inc.h:42,72,112,149,192(wrapper call sites)
Problem:
// OLD (DANGEROUS):
static inline int hak_jemalloc_loaded(void) {
if (g_jemalloc_loaded < 0) {
void* h = dlopen("libjemalloc.so.2", RTLD_NOLOAD | RTLD_NOW); // MALLOC!
if (!h) h = dlopen("libjemalloc.so.1", RTLD_NOLOAD | RTLD_NOW); // MALLOC!
g_jemalloc_loaded = (h != NULL) ? 1 : 0;
if (h) dlclose(h); // MALLOC!
}
return g_jemalloc_loaded;
}
// Called from malloc wrapper:
if (hak_ld_block_jemalloc() && hak_jemalloc_loaded()) { // dlopen → malloc → wrapper → dlopen → ...
return __libc_malloc(size);
}
Why It Crashes:
dlopen()calls malloc internally (dynamic linker allocations)- Wrapper calls
hak_jemalloc_loaded()→dlopen()→malloc()→ wrapper → infinite loop
Fix:
- Pre-detect jemalloc during initialization (hak_init_impl):
// In hak_core_init.inc.h:43-55
extern int g_jemalloc_loaded;
if (g_jemalloc_loaded < 0) {
void* h = dlopen("libjemalloc.so.2", RTLD_NOLOAD | RTLD_NOW);
if (!h) h = dlopen("libjemalloc.so.1", RTLD_NOLOAD | RTLD_NOW);
g_jemalloc_loaded = (h != NULL) ? 1 : 0;
if (h) dlclose(h);
}
- Use cached variable in wrapper:
// In hak_wrappers.inc.h
extern int g_jemalloc_loaded; // Declared at top
// In malloc():
if (hak_ld_block_jemalloc() && g_jemalloc_loaded) { // No function call!
g_hakmem_lock_depth--;
return __libc_malloc(size);
}
Impact: Removes dlopen from hot path, prevents infinite recursion
BUG #11: Unprotected fprintf() in OOM Logging (FIXED)
Files:
/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_superslab.c:146-177(log_superslab_oom_once)/mnt/workdisk/public_share/hakmem/core/tiny_superslab_alloc.inc.h:391-411(superslab_refill debug)
Problem 1: log_superslab_oom_once (PARTIALLY FIXED BEFORE)
// OLD (WRONG):
static void log_superslab_oom_once(...) {
g_hakmem_lock_depth++;
FILE* status = fopen("/proc/self/status", "r"); // OK (lock_depth=1)
// ... read file ...
fclose(status); // OK (lock_depth=1)
g_hakmem_lock_depth--; // WRONG LOCATION!
// BUG: fprintf called AFTER lock_depth restored to 0!
fprintf(stderr, "[SS OOM] ..."); // fprintf → malloc → wrapper (lock_depth=0) → CRASH!
}
Fix 1:
// NEW (CORRECT):
static void log_superslab_oom_once(...) {
g_hakmem_lock_depth++;
FILE* status = fopen("/proc/self/status", "r");
// ... read file ...
fclose(status);
// Don't decrement yet! fprintf needs protection
fprintf(stderr, "[SS OOM] ..."); // OK (lock_depth still 1)
g_hakmem_lock_depth--; // Now safe (all libc calls done)
}
Problem 2: superslab_refill debug message (NEW BUG FOUND)
// OLD (WRONG):
SuperSlab* ss = superslab_allocate((uint8_t)class_idx);
if (!ss) {
if (!g_superslab_refill_debug_once) {
g_superslab_refill_debug_once = 1;
int err = errno;
fprintf(stderr, "[DEBUG] superslab_refill returned NULL (OOM) ..."); // UNPROTECTED!
}
return NULL;
}
Fix 2:
// NEW (CORRECT):
SuperSlab* ss = superslab_allocate((uint8_t)class_idx);
if (!ss) {
if (!g_superslab_refill_debug_once) {
g_superslab_refill_debug_once = 1;
int err = errno;
extern __thread int g_hakmem_lock_depth;
g_hakmem_lock_depth++;
fprintf(stderr, "[DEBUG] superslab_refill returned NULL (OOM) ...");
g_hakmem_lock_depth--;
}
return NULL;
}
Impact: Prevents fprintf from triggering malloc on wrapper hot path
Test Results
Before Fixes
- Success Rate: 35% (estimated based on REMAINING_BUGS_ANALYSIS.md: 70% → 30% with previous fixes)
- Crash:
free(): invalid pointerfrom libc
After ALL Fixes (BUG #7, #8, #10, #11)
Testing 4T Larson high-contention (20 runs)...
Success: 7/20
Failed: 13/20
Success rate: 35%
Conclusion: No improvement. The fixes are correct but address only PART of the problem.
Root Cause Analysis
Why Fixes Didn't Help
The crash is NOT solely due to wrapper recursion. Evidence:
- OOM Happens First:
[DEBUG] superslab_refill returned NULL (OOM)
[DEBUG] Phase 7: tiny_alloc(1024) rejected, using malloc fallback
free(): invalid pointer
- Malloc Fallback Path:
When Tiny allocation fails (OOM), it falls back to
hak_alloc_malloc_impl():
// core/box/hak_alloc_api.inc.h:43
void* fallback_ptr = hak_alloc_malloc_impl(size);
This allocates with:
void* raw = __libc_malloc(HEADER_SIZE + size); // Allocate with libc
// Write HAKMEM header
hdr->magic = HAKMEM_MAGIC;
hdr->method = ALLOC_METHOD_MALLOC;
return raw + HEADER_SIZE; // Return user pointer
- Free Path Should Work:
When this pointer is freed,
hak_free_at()should:
- Step 2 (line 92-120): Detect HAKMEM_MAGIC header
- Check
hdr->method == ALLOC_METHOD_MALLOC - Call
__libc_free(raw)correctly
- So Why Does It Crash?
Hypothesis 1: Race condition in header write/read Hypothesis 2: OOM causes memory corruption before crash Hypothesis 3: Multiple allocations in flight, one corrupts another's metadata Hypothesis 4: Libc malloc returns pointer that overlaps with HAKMEM memory
Next Steps (Recommended)
Immediate (High Priority)
- Add Comprehensive Logging:
// In hak_alloc_malloc_impl():
fprintf(stderr, "[FALLBACK_ALLOC] size=%zu raw=%p user=%p\n", size, raw, raw + HEADER_SIZE);
// In hak_free_at() step 2:
fprintf(stderr, "[FALLBACK_FREE] ptr=%p raw=%p magic=0x%X method=%d\n",
ptr, raw, hdr->magic, hdr->method);
- Test with Valgrind:
valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes \
./larson_hakmem 10 8 128 1024 1 12345 4
- Test with ASan:
make asan-larson-alloc
./larson_hakmem_asan_alloc 10 8 128 1024 1 12345 4
Medium Priority
- Disable Fallback Path Temporarily:
// In hak_alloc_api.inc.h:36
if (size <= TINY_MAX_SIZE) {
// TEST: Return NULL instead of fallback
return NULL; // Force application to handle OOM
}
- Increase Memory Limit:
ulimit -v unlimited
./larson_hakmem 10 8 128 1024 1 12345 4
- Reduce Contention:
# Test with fewer chunks to avoid OOM
./larson_hakmem 10 8 128 512 1 12345 4 # 512 instead of 1024
Root Cause Investigation
- Check Active Counter Logic: The OOM suggests active counter underflow. Review:
/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_refill_p0.inc.h:103(ss_active_add fix from Phase 6-2.3)- All
ss_active_add()/ss_active_dec()call sites
- Check SuperSlab Allocation:
# Enable detailed SS logging
HAKMEM_SUPER_REG_REQTRACE=1 HAKMEM_FREE_ROUTE_TRACE=1 \
./larson_hakmem 10 8 128 1024 1 12345 4
Production Impact
Status: NOT READY FOR PRODUCTION
Blocking Issues:
- 65% crash rate on 4T high-contention workload
- Unknown root cause (wrapper fixes necessary but insufficient)
- Potential active counter bug or memory corruption
Safe Configurations:
- 1T: 100% stable (2.97M ops/s)
- 4T low-contention (256 chunks): 100% stable (251K ops/s)
- 4T high-contention (1024 chunks): 35% stable (981K ops/s when stable)
Code Changes
Modified Files
-
/mnt/workdisk/public_share/hakmem/core/box/hak_wrappers.inc.h- Line 40-99: malloc() - moved
g_hakmem_lock_depth++to start - Line 117-180: calloc() - moved
g_hakmem_lock_depth++to start - Line 42: Added extern declaration for
g_jemalloc_loaded - Lines 72,112,149,192: Changed
hak_jemalloc_loaded()→g_jemalloc_loaded
- Line 40-99: malloc() - moved
-
/mnt/workdisk/public_share/hakmem/core/box/hak_core_init.inc.h- Lines 43-55: Pre-detect jemalloc during init (not hot path)
-
/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_superslab.c- Line 146→177: Moved
g_hakmem_lock_depth--to AFTER fprintf
- Line 146→177: Moved
-
/mnt/workdisk/public_share/hakmem/core/tiny_superslab_alloc.inc.h- Lines 392-411: Added
g_hakmem_lock_depth++/--around fprintf
- Lines 392-411: Added
Build Command
make clean
make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1 larson_hakmem
Test Command
# 4T high-contention
./larson_hakmem 10 8 128 1024 1 12345 4
# 20-run stability test
bash /tmp/test_larson_20.sh
Lessons Learned
-
Wrapper Recursion is Insidious:
- Any libc function that might malloc must be protected
getenv(),fprintf(),dlopen(),fopen(),fclose()ALL can mallocg_hakmem_lock_depthmust be incremented BEFORE any libc call
-
Debug Code Can Cause Bugs:
- fprintf in hot paths is dangerous
- Debug messages should either be compile-time disabled or fully protected
-
Initialization Order Matters:
- dlopen must happen during init, not on first malloc
- Cached values avoid hot-path overhead and recursion risk
-
Multiple Bugs Can Hide Each Other:
- Fixing wrapper recursion (BUG #7,#8) didn't improve stability
- Real issue is deeper (OOM, active counter, or corruption)
Recommendations for User
Short Term (今すぐ):
- Use 4T with 256 chunks/thread (100% stable)
- Avoid 4T with 1024+ chunks until root cause found
Medium Term (1-2 days):
- Run Valgrind/ASan analysis (see "Next Steps")
- Investigate active counter logic
- Add comprehensive logging to fallback path
Long Term (1 week):
- Consider disabling fallback path (fail fast instead of corrupt)
- Implement active counter assertions to catch underflow early
- Add memory fence/barrier around header writes in fallback path
End of Report
がんばりました! 4つのバグを修正しましたが、根本原因はまだ深いところにあります。次は Valgrind/ASan で詳細調査が必要です。🔥🐛