Files
hakmem/docs/analysis/SANITIZER_INVESTIGATION_REPORT.md

563 lines
18 KiB
Markdown
Raw Normal View History

Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization) ## Changes ### 1. core/page_arena.c - Removed init failure message (lines 25-27) - error is handled by returning early - All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks ### 2. core/hakmem.c - Wrapped SIGSEGV handler init message (line 72) - CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs ### 3. core/hakmem_shared_pool.c - Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE: - Node pool exhaustion warning (line 252) - SP_META_CAPACITY_ERROR warning (line 421) - SP_FIX_GEOMETRY debug logging (line 745) - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865) - SP_ACQUIRE_STAGE0_L0 debug logging (line 803) - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922) - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996) - SP_ACQUIRE_STAGE3 debug logging (line 1116) - SP_SLOT_RELEASE debug logging (line 1245) - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305) - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316) - Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized ## Performance Validation Before: 51M ops/s (with debug fprintf overhead) After: 49.1M ops/s (consistent performance, fprintf removed from hot paths) ## Build & Test ```bash ./build.sh larson_hakmem ./out/release/larson_hakmem 1 5 1 1000 100 10000 42 # Result: 49.1M ops/s ``` Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00
# HAKMEM Sanitizer Investigation Report
**Date:** 2025-11-07
**Status:** Root cause identified
**Severity:** Critical (immediate SEGV on startup)
---
## Executive Summary
HAKMEM fails immediately when built with AddressSanitizer (ASan) or ThreadSanitizer (TSan) with allocator enabled (`-alloc` variants). The root cause is **ASan/TSan initialization calling `malloc()` before TLS (Thread-Local Storage) is fully initialized**, causing a SEGV when accessing `__thread` variables.
**Key Finding:** ASan's `dlsym()` call during library initialization triggers HAKMEM's `malloc()` wrapper, which attempts to access `g_hakmem_lock_depth` (TLS variable) before TLS is ready.
---
## 1. TLS Variables - Complete Inventory
### 1.1 Core TLS Variables (Recursion Guard)
**File:** `core/hakmem.c:188`
```c
__thread int g_hakmem_lock_depth = 0; // Recursion guard (NOT static!)
```
**First Access:** `core/box/hak_wrappers.inc.h:42` (in `malloc()` wrapper)
```c
void* malloc(size_t size) {
if (__builtin_expect(g_initializing != 0, 0)) { // ← Line 42
extern void* __libc_malloc(size_t);
return __libc_malloc(size);
}
// ... later: g_hakmem_lock_depth++; (line 86)
}
```
**Problem:** Line 42 checks `g_initializing` (global variable, OK), but **TLS access happens implicitly** when the function prologue sets up the stack frame for accessing TLS variables later in the function.
### 1.2 Other TLS Variables
#### Wrapper Statistics (hak_wrappers.inc.h:32-36)
```c
__thread uint64_t g_malloc_total_calls = 0;
__thread uint64_t g_malloc_tiny_size_match = 0;
__thread uint64_t g_malloc_fast_path_tried = 0;
__thread uint64_t g_malloc_fast_path_null = 0;
__thread uint64_t g_malloc_slow_path = 0;
```
#### Tiny Allocator TLS (hakmem_tiny.c)
```c
__thread int g_tls_live_ss[TINY_NUM_CLASSES] = {0}; // Line 658
__thread void* g_tls_sll_head[TINY_NUM_CLASSES] = {0}; // Line 1019
__thread uint32_t g_tls_sll_count[TINY_NUM_CLASSES] = {0}; // Line 1020
__thread uint8_t* g_tls_bcur[TINY_NUM_CLASSES] = {0}; // Line 1187
__thread uint8_t* g_tls_bend[TINY_NUM_CLASSES] = {0}; // Line 1188
```
#### Fast Cache TLS (tiny_fastcache.h:32-54, extern declarations)
```c
extern __thread void* g_tiny_fast_cache[TINY_FAST_CLASS_COUNT];
extern __thread uint32_t g_tiny_fast_count[TINY_FAST_CLASS_COUNT];
// ... 10+ more TLS variables
```
#### Other Subsystems TLS
- **SFC Cache:** `hakmem_tiny_sfc.c:18-19` (2 TLS variables)
- **Sticky Cache:** `tiny_sticky.c:6-8` (3 TLS arrays)
- **Simple Cache:** `hakmem_tiny_simple.c:23,26` (2 TLS variables)
- **Magazine:** `hakmem_tiny_magazine.c:29,37` (2 TLS variables)
- **Mid-Range MT:** `hakmem_mid_mt.c:37` (1 TLS array)
- **Pool TLS:** `core/box/pool_tls_types.inc.h:11` (1 TLS array)
**Total TLS Variables:** 50+ across the codebase
---
## 2. dlsym / syscall Initialization Flow
### 2.1 Intended Initialization Order
**File:** `core/box/hak_core_init.inc.h:29-35`
```c
static void hak_init_impl(void) {
g_initializing = 1;
// Phase 6.X P0 FIX (2025-10-24): Initialize Box 3 (Syscall Layer) FIRST!
// This MUST be called before ANY allocation (Tiny/Mid/Large/Learner)
// dlsym() initializes function pointers to real libc (bypasses LD_PRELOAD)
hkm_syscall_init(); // ← Line 35
// ...
}
```
**File:** `core/hakmem_syscall.c:41-64`
```c
void hkm_syscall_init(void) {
if (g_syscall_initialized) return; // Idempotent
// dlsym with RTLD_NEXT: Get NEXT symbol in library chain
real_malloc = dlsym(RTLD_NEXT, "malloc"); // ← Line 49
real_calloc = dlsym(RTLD_NEXT, "calloc");
real_free = dlsym(RTLD_NEXT, "free");
real_realloc = dlsym(RTLD_NEXT, "realloc");
if (!real_malloc || !real_calloc || !real_free || !real_realloc) {
fprintf(stderr, "[hakmem_syscall] FATAL: dlsym failed\n");
abort();
}
g_syscall_initialized = 1;
}
```
### 2.2 Actual Execution Order (ASan Build)
**GDB Backtrace:**
```
#0 malloc (size=69) at core/box/hak_wrappers.inc.h:40
#1 0x00007ffff7fc7cca in malloc (size=69) at ../include/rtld-malloc.h:56
#2 __GI__dl_exception_create_format (...) at ./elf/dl-exception.c:157
#3 0x00007ffff7fcf3dc in _dl_lookup_symbol_x (undef_name="__isoc99_printf", ...)
#4 0x00007ffff65759c4 in do_sym (..., name="__isoc99_printf", ...) at ./elf/dl-sym.c:146
#5 _dl_sym (handle=<optimized out>, name="__isoc99_printf", ...) at ./elf/dl-sym.c:195
#12 0x00007ffff74e3859 in __interception::GetFuncAddr (name="__isoc99_printf") at interception_linux.cpp:42
#13 __interception::InterceptFunction (name="__isoc99_printf", ...) at interception_linux.cpp:61
#14 0x00007ffff74a1deb in InitializeCommonInterceptors () at sanitizer_common_interceptors.inc:10094
#15 __asan::InitializeAsanInterceptors () at asan_interceptors.cpp:634
#16 0x00007ffff74c063b in __asan::AsanInitInternal () at asan_rtl.cpp:452
#17 0x00007ffff7fc95be in _dl_init (main_map=0x7ffff7ffe2e0, ...) at ./elf/dl-init.c:102
#18 0x00007ffff7fe32ca in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
```
**Timeline:**
1. Dynamic linker (`ld-linux.so`) initializes
2. ASan runtime initializes (`__asan::AsanInitInternal`)
3. ASan intercepts `printf` family functions
4. `dlsym("__isoc99_printf")` calls `malloc()` internally (glibc rtld-malloc.h:56)
5. HAKMEM's `malloc()` wrapper is invoked **before `hak_init()` runs**
6. **TLS access SEGV** (TLS segment not yet initialized)
### 2.3 Why `HAKMEM_FORCE_LIBC_ALLOC_BUILD` Doesn't Help
**Current Makefile (line 810-811):**
```makefile
SAN_ASAN_ALLOC_CFLAGS = -O1 -g -fno-omit-frame-pointer -fno-lto \
-fsanitize=address,undefined -fno-sanitize-recover=all -fstack-protector-strong
# NOTE: Missing -DHAKMEM_FORCE_LIBC_ALLOC_BUILD=1
```
**Expected Behavior (with flag):**
```c
#ifdef HAKMEM_FORCE_LIBC_ALLOC_BUILD
void* malloc(size_t size) {
extern void* __libc_malloc(size_t);
return __libc_malloc(size); // Bypass HAKMEM completely
}
#endif
```
**However:** Even with `HAKMEM_FORCE_LIBC_ALLOC_BUILD=1`, the symbol `malloc` would still be exported, and ASan might still interpose on it. The real fix requires:
1. Not exporting `malloc` at all when Sanitizers are active, OR
2. Using constructor priorities to guarantee TLS initialization before ASan
---
## 3. Static Constructor Execution Order
### 3.1 Current Constructors
**File:** `core/hakmem.c:66`
```c
__attribute__((constructor)) static void hakmem_ctor_install_segv(void) {
const char* dbg = getenv("HAKMEM_DEBUG_SEGV");
// ... install SIGSEGV handler
}
```
**File:** `core/tiny_debug_ring.c:204`
```c
__attribute__((constructor))
static void hak_debug_ring_ctor(void) {
// ...
}
```
**File:** `core/hakmem_tiny_stats.c:66`
```c
__attribute__((constructor))
static void hak_tiny_stats_ctor(void) {
// ...
}
```
**Problem:** No priority specified! GCC default is `65535`, which runs **after** most library constructors.
**ASan Constructor Priority:** Typically `1` or `100` (very early)
### 3.2 Constructor Priority Ranges
- **0-99:** Reserved for system libraries (libc, libstdc++, sanitizers)
- **100-999:** Early initialization (critical infrastructure)
- **1000-9999:** Normal initialization
- **65535 (default):** Late initialization
---
## 4. Sanitizer Conflict Points
### 4.1 Symbol Interposition Chain
**Without Sanitizer:**
```
Application → malloc() → HAKMEM wrapper → hak_alloc_at()
```
**With ASan (Direct Link):**
```
Application → ASan malloc() → HAKMEM malloc() → TLS access → SEGV
(during ASan init, TLS not ready!)
```
**Expected (with FORCE_LIBC):**
```
Application → ASan malloc() → __libc_malloc() ✓
```
### 4.2 LD_PRELOAD vs Direct Link
**LD_PRELOAD (libhakmem_asan.so):**
```
Application → LD_PRELOAD (HAKMEM malloc) → ASan malloc → ...
```
- Even worse: HAKMEM wrapper runs before ASan init!
**Direct Link (larson_hakmem_asan_alloc):**
```
Application → main() → ...
(ASan init via constructor) → dlsym malloc → HAKMEM malloc → SEGV
```
### 4.3 TLS Initialization Timing
**Normal Execution:**
1. ELF loader initializes TLS templates
2. `__tls_get_addr()` sets up TLS for main thread
3. Constructors run (can safely access TLS)
4. `main()` starts
**ASan Execution:**
1. ELF loader initializes TLS templates
2. ASan constructor runs **before** application constructors
3. ASan's `dlsym()` calls `malloc()`
4. **HAKMEM malloc accesses TLS → SEGV** (TLS not fully initialized!)
**Why TLS Fails:**
- ASan's early constructor (priority 1-100) runs during `_dl_init()`
- TLS segment may be allocated but **not yet associated with the current thread**
- Accessing `__thread` variable triggers `__tls_get_addr()` → NULL dereference
---
## 5. Existing Workarounds / Comments
### 5.1 Recursion Guard Design
**File:** `core/hakmem.c:175-192`
```c
// Phase 6.15 P1: Remove global lock; keep recursion guard only
// ---------------------------------------------------------------------------
// We no longer serialize all allocations with a single global mutex.
// Instead, each submodule is responsible for its own finegrained locking.
// We keep a perthread recursion guard so that internal use of malloc/free
// within the allocator routes to libc (avoids infinite recursion).
//
// Phase 6.X P0 FIX (2025-10-24): Reverted to simple g_hakmem_lock_depth check
// Box Theory - Layer 1 (API Layer):
// This guard protects against LD_PRELOAD recursion (Box 1 → Box 1)
// Box 2 (Core) → Box 3 (Syscall) uses hkm_libc_malloc() (dlsym, no guard needed!)
// NOTE: Removed 'static' to allow access from hakmem_tiny_superslab.c (fopen fix)
__thread int g_hakmem_lock_depth = 0; // 0 = outermost call
```
**Comment Analysis:**
- Designed for **runtime recursion**, not **initialization-time TLS issues**
- Assumes TLS is already available when `malloc()` is called
- `dlsym` guard mentioned, but not for initialization safety
### 5.2 Sanitizer Build Flags (Makefile)
**Line 799-801 (ASan with FORCE_LIBC):**
```makefile
SAN_ASAN_CFLAGS = -O1 -g -fno-omit-frame-pointer -fno-lto \
-fsanitize=address,undefined -fno-sanitize-recover=all -fstack-protector-strong \
-DHAKMEM_FORCE_LIBC_ALLOC_BUILD=1 # ← Bypasses HAKMEM allocator
```
**Line 810-811 (ASan with HAKMEM allocator):**
```makefile
SAN_ASAN_ALLOC_CFLAGS = -O1 -g -fno-omit-frame-pointer -fno-lto \
-fsanitize=address,undefined -fno-sanitize-recover=all -fstack-protector-strong
# NOTE: Missing -DHAKMEM_FORCE_LIBC_ALLOC_BUILD=1 ← INTENDED for testing!
```
**Design Intent:** Allow ASan to instrument HAKMEM's allocator for memory safety testing.
**Current Reality:** Broken due to TLS initialization order.
---
## 6. Recommended Fix (Priority Ordered)
### 6.1 Option A: Constructor Priority (Quick Fix) ⭐⭐⭐⭐⭐
**Difficulty:** Easy
**Risk:** Low
**Effectiveness:** High (80% confidence)
**Implementation:**
**File:** `core/hakmem.c`
```c
// PRIORITY 101: Run after ASan (priority ~100), but before default (65535)
__attribute__((constructor(101))) static void hakmem_tls_preinit(void) {
// Force TLS allocation by touching the variable
g_hakmem_lock_depth = 0;
// Optional: Pre-initialize dlsym cache
hkm_syscall_init();
}
// Keep existing constructor for SEGV handler (no priority = runs later)
__attribute__((constructor)) static void hakmem_ctor_install_segv(void) {
// ... existing code
}
```
**Rationale:**
- Ensures TLS is touched **after** ASan init but **before** any malloc calls
- Forces `__tls_get_addr()` to run in a safe context
- Minimal code change
**Verification:**
```bash
make clean
# Add constructor(101) to hakmem.c
make asan-larson-alloc
./larson_hakmem_asan_alloc 1 1 128 1024 1 12345 1
# Should run without SEGV
```
---
### 6.2 Option B: Lazy TLS Initialization (Defensive) ⭐⭐⭐⭐
**Difficulty:** Medium
**Risk:** Medium (performance impact)
**Effectiveness:** High (90% confidence)
**Implementation:**
**File:** `core/box/hak_wrappers.inc.h:40-50`
```c
void* malloc(size_t size) {
// NEW: Check if TLS is initialized using a helper
if (__builtin_expect(!hak_tls_is_ready(), 0)) {
extern void* __libc_malloc(size_t);
return __libc_malloc(size);
}
// Existing code...
if (__builtin_expect(g_initializing != 0, 0)) {
extern void* __libc_malloc(size_t);
return __libc_malloc(size);
}
// ...
}
```
**New Helper Function:**
```c
// core/hakmem.c
static __thread int g_tls_ready_flag = 0;
__attribute__((constructor(101)))
static void hak_tls_mark_ready(void) {
g_tls_ready_flag = 1;
}
int hak_tls_is_ready(void) {
// Use volatile to prevent compiler optimization
return __atomic_load_n(&g_tls_ready_flag, __ATOMIC_RELAXED);
}
```
**Pros:**
- Safe even if constructor priorities fail
- Explicit TLS readiness check
- Falls back to libc if TLS not ready
**Cons:**
- Extra branch on malloc hot path (1-2 cycles)
- Requires touching another TLS variable (`g_tls_ready_flag`)
---
### 6.3 Option C: Weak Symbol Aliasing (Advanced) ⭐⭐⭐
**Difficulty:** Hard
**Risk:** High (portability, build system complexity)
**Effectiveness:** Medium (70% confidence)
**Implementation:**
**File:** `core/box/hak_wrappers.inc.h`
```c
// Weak alias: Allow ASan to override if needed
__attribute__((weak))
void* malloc(size_t size) {
// ... HAKMEM implementation
}
// Strong symbol for internal use
void* hak_malloc_internal(size_t size) {
// ... same implementation
}
```
**Pros:**
- Allows ASan to fully control malloc symbol
- HAKMEM can still use internal allocation
**Cons:**
- Complex build interactions
- May not work with all linker configurations
- Debugging becomes harder (symbol resolution issues)
---
### 6.4 Option D: Disable Wrappers for Sanitizer Builds (Pragmatic) ⭐⭐⭐⭐⭐
**Difficulty:** Easy
**Risk:** Low
**Effectiveness:** 100% (but limited scope)
**Implementation:**
**File:** `Makefile:810-811`
```makefile
# OLD (broken):
SAN_ASAN_ALLOC_CFLAGS = -O1 -g -fno-omit-frame-pointer -fno-lto \
-fsanitize=address,undefined -fno-sanitize-recover=all -fstack-protector-strong
# NEW (fixed):
SAN_ASAN_ALLOC_CFLAGS = -O1 -g -fno-omit-frame-pointer -fno-lto \
-fsanitize=address,undefined -fno-sanitize-recover=all -fstack-protector-strong \
-DHAKMEM_FORCE_LIBC_ALLOC_BUILD=1 # ← Bypass HAKMEM allocator
```
**Rationale:**
- Sanitizer builds should focus on **application logic bugs**, not allocator bugs
- HAKMEM allocator can be tested separately without Sanitizers
- Eliminates all TLS/constructor issues
**Pros:**
- Immediate fix (1-line change)
- Zero risk
- Sanitizers work as intended
**Cons:**
- Cannot test HAKMEM allocator with Sanitizers
- Defeats purpose of `-alloc` variants
**Recommended Naming:**
```bash
# Current (misleading):
larson_hakmem_asan_alloc # Implies HAKMEM allocator is used
# Better naming:
larson_hakmem_asan_libc # Clarifies libc malloc is used
larson_hakmem_asan_nalloc # "no allocator" (HAKMEM disabled)
```
---
## 7. Recommended Action Plan
### Phase 1: Immediate Fix (1 day) ✅
1. **Add `-DHAKMEM_FORCE_LIBC_ALLOC_BUILD=1` to SAN_*_ALLOC_CFLAGS** (Makefile:810, 823)
2. Rename binaries for clarity:
- `larson_hakmem_asan_alloc``larson_hakmem_asan_libc`
- `larson_hakmem_tsan_alloc``larson_hakmem_tsan_libc`
3. Verify all Sanitizer builds work correctly
### Phase 2: Constructor Priority Fix (2-3 days)
1. Add `__attribute__((constructor(101)))` to `hakmem_tls_preinit()`
2. Test with ASan/TSan/UBSan (allocator enabled)
3. Document constructor priority ranges in `ARCHITECTURE.md`
### Phase 3: Defensive TLS Check (1 week, optional)
1. Implement `hak_tls_is_ready()` helper
2. Add early exit in `malloc()` wrapper
3. Benchmark performance impact (should be < 1%)
### Phase 4: Documentation (ongoing)
1. Update `CLAUDE.md` with Sanitizer findings
2. Add "Sanitizer Compatibility" section to README
3. Document TLS variable inventory
---
## 8. Testing Matrix
| Build Type | Allocator | Sanitizer | Expected Result | Actual Result |
|------------|-----------|-----------|-----------------|---------------|
| `asan-larson` | libc | ASan+UBSan | ✅ Pass | ✅ Pass |
| `tsan-larson` | libc | TSan | ✅ Pass | ✅ Pass |
| `asan-larson-alloc` | HAKMEM | ASan+UBSan | ✅ Pass | ❌ SEGV (TLS) |
| `tsan-larson-alloc` | HAKMEM | TSan | ✅ Pass | ❌ SEGV (TLS) |
| `asan-shared-alloc` | HAKMEM | ASan+UBSan | ✅ Pass | ❌ SEGV (TLS) |
| `tsan-shared-alloc` | HAKMEM | TSan | ✅ Pass | ❌ SEGV (TLS) |
**Target:** All ✅ after Phase 1 (libc) + Phase 2 (constructor priority)
---
## 9. References
### 9.1 Related Code Files
- `core/hakmem.c:188` - TLS recursion guard
- `core/box/hak_wrappers.inc.h:40` - malloc wrapper entry point
- `core/box/hak_core_init.inc.h:29` - Initialization flow
- `core/hakmem_syscall.c:41` - dlsym initialization
- `Makefile:799-824` - Sanitizer build flags
### 9.2 External Documentation
- [GCC Constructor/Destructor Attributes](https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-constructor-function-attribute)
- [ASan Initialization Order](https://github.com/google/sanitizers/wiki/AddressSanitizerInitializationOrderFiasco)
- [ELF TLS Specification](https://www.akkadia.org/drepper/tls.pdf)
- [glibc rtld-malloc.h](https://sourceware.org/git/?p=glibc.git;a=blob;f=include/rtld-malloc.h)
---
## 10. Conclusion
The HAKMEM Sanitizer crash is a **classic initialization order problem** exacerbated by ASan's aggressive use of `malloc()` during `dlsym()` resolution. The immediate fix is trivial (enable `HAKMEM_FORCE_LIBC_ALLOC_BUILD`), but enabling Sanitizer instrumentation of HAKMEM itself requires careful constructor priority management.
**Recommended Path:** Implement Phase 1 (immediate) + Phase 2 (robust) for full Sanitizer support with allocator instrumentation enabled.
---
**Report Author:** Claude Code (Sonnet 4.5)
**Investigation Date:** 2025-11-07
**Last Updated:** 2025-11-07