Phase 1 完了:環境変数整理 + fprintf デバッグガード ENV変数削除(BG/HotMag系): - core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines) - core/hakmem_tiny_bg_spill.c: BG spill ENV 削除 - core/tiny_refill.h: BG remote 固定値化 - core/hakmem_tiny_slow.inc: BG refs 削除 fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE): - core/hakmem_shared_pool.c: Lock stats (~18 fprintf) - core/page_arena.c: Init/Shutdown/Stats (~27 fprintf) - core/hakmem.c: SIGSEGV init message ドキュメント整理: - 328 markdown files 削除(旧レポート・重複docs) 性能確認: - Larson: 52.35M ops/s (前回52.8M、安定動作✅) - ENV整理による機能影響なし - Debug出力は一部残存(次phase で対応) 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
16 KiB
Phase 6-3 Tiny Fast Path: -20% Regression Root Cause Analysis (Ultrathink)
Status: Root cause identified Severity: Critical - Performance regression + Out-of-Memory crash Date: 2025-11-05
Executive Summary
Phase 6-3 attempted to implement a "System tcache-style" 3-4 instruction fast path for Tiny allocations (<=128B), targeting 70-80% of System malloc performance. Instead, it caused a -20% regression (4.19M → 3.35M ops/s) and crashes due to Out-of-Memory (OOM).
Root Cause: Fast Path implementation creates a double-layered allocation path with catastrophic OOM failure in superslab_refill(), causing:
- Every Fast Path attempt to fail and fallback to existing Tiny path
- Additional overhead from failed Fast Path checks (~15-20% slowdown)
- Memory leak leading to OOM crash (43,658 allocations, 0 frees, 45 GB leaked)
Impact:
- Before (Phase 6-2.2): 4.19M ops/s (Box Refactor baseline)
- After (Phase 6-3): 3.35M ops/s (-20% regression)
- OOM crash:
mmap failed: err=12 (ENOMEM) bytes=45778731008 (45 GB)
1. Root Cause Discovery
1.1 Double-Layered Allocation Path (Primary Cause)
Phase 6-3 adds Fast Path on TOP of existing Box Refactor path:
Before (Phase 6-2.2 - 4.19M ops/s):
malloc() → hkm_custom_malloc() → hak_tiny_alloc() [Box Refactor]
↓
Success (4.19M ops/s)
After (Phase 6-3 - 3.35M ops/s):
malloc() → hkm_custom_malloc() → hak_alloc_at()
↓
tiny_fast_alloc() [Fast Path]
↓
g_tiny_fast_cache[cls] == NULL (always!)
↓
tiny_fast_refill(cls)
↓
hak_tiny_alloc_slow(size, cls)
↓
hak_tiny_alloc_superslab(cls)
↓
superslab_refill() → NULL (OOM!)
↓
Fast Path returns NULL
↓
hak_tiny_alloc() [Box Refactor fallback]
↓
ALSO FAILS (OOM) → benchmark crash
Overhead introduced:
tiny_fast_alloc()initialization checktiny_fast_refill()call (complex multi-layer refill chain)superslab_refill()OOM failure- Fallback to existing Box Refactor path
- Box Refactor path ALSO fails due to same OOM
Result: ~20% overhead from failed Fast Path + eventual OOM crash
1.2 SuperSlab OOM Failure (Secondary Cause)
Fast Path refill chain triggers SuperSlab OOM:
[DEBUG] superslab_refill NULL detail: class=2 prev_ss=(nil) active=0
bitmap=0x00000000 prev_meta=(nil) used=0 cap=0 slab_idx=0
reused_freelist=0 free_idx=-2 errno=12
[SS OOM] mmap failed: err=12 ss_size=1048576 alloc_size=2097152
alloc=43658 freed=0 bytes=45778731008
RLIMIT_AS(cur=inf max=inf) VmSize=134332460 kB VmRSS=3583744 kB
Critical Evidence:
- 43,658 allocations
- 0 frees (!!)
- 45 GB allocated before crash
This is a massive memory leak - freed blocks are not being returned to SuperSlab freelist.
Connection to FAST_CAP_0 Issue:
This is the SAME bug documented in FAST_CAP_0_SEGV_ROOT_CAUSE_ANALYSIS.md:
- When TLS List mode is active (
g_tls_list_enable=1), freed blocks go to TLS List cache - These blocks NEVER get merged back into SuperSlab freelist
- Allocation path tries to allocate from freelist, which contains stale pointers
- Eventually runs out of memory (OOM)
1.3 Why Statistics Don't Appear
User reported: HAKMEM_TINY_FAST_STATS=1 shows no output.
Reasons:
-
No shutdown hook registered:
tiny_fast_print_stats()exists intiny_fastcache.c:118- But it's NEVER called (no
atexit()registration)
-
Thread-local counters lost:
g_tiny_fast_refill_countandg_tiny_fast_drain_countare__threadvariables- When threads exit, these are lost
- No aggregation or reporting mechanism
-
Early crash:
- OOM crash occurs before statistics can be printed
- Benchmark terminates abnormally
1.4 Larson Benchmark Special Handling
Larson uses custom malloc shim that bypasses one layer of Fast Path:
File: bench_larson_hakmem_shim.c
void* hkm_custom_malloc(size_t sz) {
if (s_tiny_pref && sz <= 1024) {
// Bypass wrappers: go straight to Tiny
void* ptr = hak_tiny_alloc(sz); // ← Calls Box Refactor directly
if (ptr == NULL) {
return hak_alloc_at(sz, HAK_CALLSITE()); // ← Fast Path HERE
}
return ptr;
}
return hak_alloc_at(sz, HAK_CALLSITE()); // ← Fast Path HERE too
}
Environment Variables:
HAKMEM_LARSON_TINY_ONLY=1→ callshak_tiny_alloc()directly (bypasses Fast Path inmalloc())HAKMEM_LARSON_TINY_ONLY=0→ callshak_alloc_at()(hits Fast Path)
Impact:
- Fast Path in
malloc()(lines 1294-1309) is NEVER EXECUTED by Larson - Fast Path in
hak_alloc_at()(lines 682-697) IS executed - This creates a single-layered Fast Path, but still fails due to OOM
2. Build Configuration Conflicts
2.1 Conflicting Build Flags
Makefile (lines 54-77):
# Box Refactor: ON by default (4.19M ops/s baseline)
BOX_REFACTOR_DEFAULT ?= 1
ifeq ($(BOX_REFACTOR_DEFAULT),1)
CFLAGS += -DHAKMEM_TINY_PHASE6_BOX_REFACTOR=1
endif
# Fast Path: ON by default (Phase 6-3 experiment)
TINY_FAST_PATH_DEFAULT ?= 1
ifeq ($(TINY_FAST_PATH_DEFAULT),1)
CFLAGS += -DHAKMEM_TINY_FAST_PATH=1
endif
Both flags are active simultaneously! This creates the double-layered path.
2.2 Code Path Analysis
File: core/hakmem.c:hak_alloc_at()
// Lines 682-697: Phase 6-3 Fast Path
#ifdef HAKMEM_TINY_FAST_PATH
if (size <= TINY_FAST_THRESHOLD) {
void* ptr = tiny_fast_alloc(size);
if (ptr) return ptr;
// Fall through to slow path on failure
}
#endif
// Lines 704-740: Phase 6-1.7 Box Refactor Path (existing)
if (size <= TINY_MAX_SIZE) {
#ifdef HAKMEM_TINY_PHASE6_BOX_REFACTOR
tiny_ptr = hak_tiny_alloc_fast_wrapper(size); // Box Refactor
#else
tiny_ptr = hak_tiny_alloc(size); // Standard path
#endif
if (tiny_ptr) return tiny_ptr;
}
Flow:
- Fast Path check (ALWAYS fails due to OOM)
- Box Refactor path check (also fails due to same OOM)
- Both paths try to allocate from SuperSlab
- SuperSlab is exhausted → crash
3. hak_tiny_alloc_slow() Investigation
3.1 Function Location
$ grep -r "hak_tiny_alloc_slow" core/
core/hakmem_tiny.c:197:void* __attribute__((cold, noinline)) hak_tiny_alloc_slow(...);
core/hakmem_tiny_slow.inc:7:void* __attribute__((cold, noinline)) hak_tiny_alloc_slow(...)
core/tiny_fastcache.c:25:extern void* hak_tiny_alloc_slow(size_t size, int class_idx);
Definition: core/hakmem_tiny_slow.inc (included by hakmem_tiny.c)
Export condition:
#ifdef HAKMEM_TINY_PHASE6_BOX_REFACTOR
void* __attribute__((cold, noinline)) hak_tiny_alloc_slow(size_t size, int class_idx);
#else
static void* __attribute__((cold, noinline)) hak_tiny_alloc_slow(size_t size, int class_idx);
#endif
Since HAKMEM_TINY_PHASE6_BOX_REFACTOR=1 is active, this function is exported and accessible from tiny_fastcache.c.
3.2 Implementation Analysis
File: core/hakmem_tiny_slow.inc
void* hak_tiny_alloc_slow(size_t size, int class_idx) {
// Try HotMag refill
if (g_hotmag_enable && class_idx <= 3) {
void* ptr = hotmag_pop(class_idx);
if (ptr) return ptr;
}
// Try TLS list refill
if (g_tls_list_enable) {
void* ptr = tls_list_pop(&g_tls_lists[class_idx]);
if (ptr) return ptr;
// Try refilling TLS list from slab
if (tls_refill_from_tls_slab(...) > 0) {
void* ptr = tls_list_pop(...);
if (ptr) return ptr;
}
}
// Final fallback: allocate from superslab
void* ss_ptr = hak_tiny_alloc_superslab(class_idx); // ← OOM HERE!
return ss_ptr;
}
Problem: This is a complex multi-tier refill chain:
- HotMag tier (optional)
- TLS List tier (optional)
- TLS Slab tier (optional)
- SuperSlab tier (final fallback)
When all tiers fail → returns NULL → Fast Path fails → Box Refactor also fails → OOM crash
4. Why Fast Path is Always Empty
4.1 TLS Cache Never Refills
File: core/tiny_fastcache.c:tiny_fast_refill()
void* tiny_fast_refill(int class_idx) {
int refilled = 0;
size_t size = class_sizes[class_idx];
// Batch allocation: try to get multiple blocks at once
for (int i = 0; i < TINY_FAST_REFILL_BATCH; i++) {
void* ptr = hak_tiny_alloc_slow(size, class_idx); // ← OOM!
if (!ptr) break; // Failed on FIRST iteration
// Push to fast cache (never reached)
if (g_tiny_fast_count[class_idx] < TINY_FAST_CACHE_CAP) {
*(void**)ptr = g_tiny_fast_cache[class_idx];
g_tiny_fast_cache[class_idx] = ptr;
g_tiny_fast_count[class_idx]++;
refilled++;
}
}
// Pop one for caller
void* result = g_tiny_fast_cache[class_idx]; // ← Still NULL!
return result; // Returns NULL
}
Flow:
- Tries to allocate 16 blocks via
hak_tiny_alloc_slow() - First allocation fails (OOM) → loop breaks immediately
g_tiny_fast_cache[class_idx]remains NULL- Returns NULL to caller
Result: Fast Path cache is ALWAYS empty, so EVERY allocation hits slow path.
5. Detailed Regression Mechanism
5.1 Instruction Count Comparison
Phase 6-2.2 (Box Refactor - 4.19M ops/s):
malloc() → hkm_custom_malloc()
↓ (5 instructions)
hak_tiny_alloc()
↓ (10-15 instructions, Box Refactor fast path)
Success
Phase 6-3 (Fast Path + Box Refactor - 3.35M ops/s):
malloc() → hkm_custom_malloc()
↓ (5 instructions)
hak_alloc_at()
↓ (3-4 instructions: Fast Path check)
tiny_fast_alloc()
↓ (1-2 instructions: cache check)
g_tiny_fast_cache[cls] == NULL
↓ (function call)
tiny_fast_refill()
↓ (30-40 instructions: loop + size mapping)
hak_tiny_alloc_slow()
↓ (50-100 instructions: multi-tier refill chain)
hak_tiny_alloc_superslab()
↓ (100+ instructions)
superslab_refill() → NULL (OOM)
↓ (return path)
tiny_fast_refill returns NULL
↓ (return path)
tiny_fast_alloc returns NULL
↓ (fallback to Box Refactor)
hak_tiny_alloc()
↓ (10-15 instructions)
ALSO FAILS (OOM) → crash
Added overhead:
- ~200-300 instructions per allocation (failed Fast Path attempt)
- Multiple function calls (7 levels deep)
- Branch mispredictions (Fast Path always fails)
Estimated slowdown: 15-25% from instruction overhead + branch misprediction
5.2 Why -20% Exactly?
Calculation:
Baseline (Phase 6-2.2): 4.19M ops/s = 238 ns/op
Regression (Phase 6-3): 3.35M ops/s = 298 ns/op
Added overhead: 298 - 238 = 60 ns/op
Percentage: 60 / 238 = 25.2% slowdown
Actual regression: -20%
Why not -25%?
- Some allocations still succeed before OOM crash
- Benchmark may be terminating early, inflating ops/s
- Measurement noise
6. Priority-Ranked Fix Proposals
Fix #1: Disable Fast Path (IMMEDIATE - 1 minute)
Impact: Restores 4.19M ops/s baseline Risk: None (reverts to known-good state) Effort: Trivial
Implementation:
make clean
make BOX_REFACTOR_DEFAULT=1 TINY_FAST_PATH_DEFAULT=0 larson_hakmem
./larson_hakmem 10 8 128 1024 1 12345 4
Expected result: 4.19M ops/s (baseline restored)
Fix #2: Integrate Fast Path with Box Refactor (SHORT-TERM - 2-4 hours)
Impact: Potentially achieves Fast Path goals WITHOUT regression Risk: Low (leverages existing Box Refactor infrastructure) Effort: Moderate
Approach:
-
Change
tiny_fast_refill()to callhak_tiny_alloc()instead ofhak_tiny_alloc_slow()- Leverages existing Box Refactor path (known to work at 4.19M ops/s)
- Avoids OOM issue by using proven allocation path
-
Remove Fast Path from
hak_alloc_at()- Keep Fast Path ONLY in
malloc()wrapper - Prevents double-layered path
- Keep Fast Path ONLY in
-
Simplify refill logic
void* tiny_fast_refill(int class_idx) { size_t size = class_sizes[class_idx]; // Batch allocation via Box Refactor path for (int i = 0; i < TINY_FAST_REFILL_BATCH; i++) { void* ptr = hak_tiny_alloc(size); // ← Use Box Refactor! if (!ptr) break; // Push to fast cache *(void**)ptr = g_tiny_fast_cache[class_idx]; g_tiny_fast_cache[class_idx] = ptr; g_tiny_fast_count[class_idx]++; } // Pop one for caller void* result = g_tiny_fast_cache[class_idx]; if (result) { g_tiny_fast_cache[class_idx] = *(void**)result; g_tiny_fast_count[class_idx]--; } return result; }
Expected outcome:
- Fast Path cache actually fills (using Box Refactor backend)
- Subsequent allocations hit 3-4 instruction fast path
- Target: 5.0-6.0M ops/s (20-40% improvement over baseline)
Fix #3: Fix SuperSlab OOM Root Cause (LONG-TERM - 1-2 weeks)
Impact: Eliminates OOM crashes permanently Risk: High (requires deep understanding of TLS List / SuperSlab interaction) Effort: High
Problem (from FAST_CAP_0 analysis):
- When
g_tls_list_enable=1, freed blocks go to TLS List cache - These blocks NEVER merge back into SuperSlab freelist
- Allocation path tries to allocate from freelist → stale pointers → crash
Solution:
-
Add TLS List → SuperSlab drain path
- When TLS List spills, return blocks to SuperSlab freelist
- Ensure proper synchronization (lock-free or per-class mutex)
-
Fix remote free handling
- Ensure cross-thread frees properly update
remote_heads[] - Add drain points in allocation path
- Ensure cross-thread frees properly update
-
Add memory leak detection
- Track allocated vs freed bytes per class
- Warn when imbalance exceeds threshold
Reference: FAST_CAP_0_SEGV_ROOT_CAUSE_ANALYSIS.md (lines 87-99)
7. Recommended Action Plan
Phase 1: Immediate Recovery (5 minutes)
- Disable Fast Path (Fix #1)
- Verify 4.19M ops/s baseline restored
- Confirm no OOM crashes
Phase 2: Quick Win (2-4 hours)
- Implement Fix #2 (Integrate Fast Path with Box Refactor)
- Change
tiny_fast_refill()to usehak_tiny_alloc() - Remove Fast Path from
hak_alloc_at()(keep only inmalloc()) - Run A/B test: baseline vs integrated Fast Path
- Success criteria: >4.5M ops/s (>7% improvement over baseline)
- Change
Phase 3: Root Cause Fix (1-2 weeks, OPTIONAL)
- Implement Fix #3 (Fix SuperSlab OOM)
- Only if Fix #2 still shows OOM issues
- Requires deep architectural changes
- High risk, high reward
8. Test Plan
Test 1: Baseline Recovery
make clean
make BOX_REFACTOR_DEFAULT=1 TINY_FAST_PATH_DEFAULT=0 larson_hakmem
./larson_hakmem 10 8 128 1024 1 12345 4
Expected: 4.19M ops/s, no crashes
Test 2: Integrated Fast Path
# After implementing Fix #2
make clean
make BOX_REFACTOR_DEFAULT=1 TINY_FAST_PATH_DEFAULT=1 larson_hakmem
./larson_hakmem 10 8 128 1024 1 12345 4
Expected: >4.5M ops/s, no crashes, stats show refills working
Test 3: Fast Path Statistics
HAKMEM_TINY_FAST_STATS=1 ./larson_hakmem 10 8 128 1024 1 12345 4
Expected: Stats output at end (requires adding atexit() hook)
9. Key Takeaways
- Fast Path was never active - OOM prevented cache refills
- Double-layered allocation - Fast Path + Box Refactor created overhead
- 45 GB memory leak - Freed blocks not returning to SuperSlab
- Same bug as FAST_CAP_0 - TLS List / SuperSlab disconnect
- Easy fix available - Use Box Refactor as Fast Path backend
Confidence in Fix #2: 80% (leverages proven Box Refactor infrastructure)
10. References
FAST_CAP_0_SEGV_ROOT_CAUSE_ANALYSIS.md- Same OOM root causecore/hakmem.c:682-740- Double-layered allocation pathcore/tiny_fastcache.c:41-84- Failed refill implementationbench_larson_hakmem_shim.c:8-25- Larson special handlingMakefile:54-77- Build flag conflicts
Analysis completed: 2025-11-05 Next step: Implement Fix #1 (disable Fast Path) for immediate recovery