Files
hakmem/docs/analysis/PHASE6_3_REGRESSION_ULTRATHINK.md
Moe Charm (CI) a9ddb52ad4 ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)
Phase 1 完了:環境変数整理 + fprintf デバッグガード

ENV変数削除(BG/HotMag系):
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除

fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message

ドキュメント整理:
- 328 markdown files 削除(旧レポート・重複docs)

性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作)
- ENV整理による機能影響なし
- Debug出力は一部残存(次phase で対応)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 14:45:26 +09:00

16 KiB

Phase 6-3 Tiny Fast Path: -20% Regression Root Cause Analysis (Ultrathink)

Status: Root cause identified Severity: Critical - Performance regression + Out-of-Memory crash Date: 2025-11-05


Executive Summary

Phase 6-3 attempted to implement a "System tcache-style" 3-4 instruction fast path for Tiny allocations (<=128B), targeting 70-80% of System malloc performance. Instead, it caused a -20% regression (4.19M → 3.35M ops/s) and crashes due to Out-of-Memory (OOM).

Root Cause: Fast Path implementation creates a double-layered allocation path with catastrophic OOM failure in superslab_refill(), causing:

  1. Every Fast Path attempt to fail and fallback to existing Tiny path
  2. Additional overhead from failed Fast Path checks (~15-20% slowdown)
  3. Memory leak leading to OOM crash (43,658 allocations, 0 frees, 45 GB leaked)

Impact:

  • Before (Phase 6-2.2): 4.19M ops/s (Box Refactor baseline)
  • After (Phase 6-3): 3.35M ops/s (-20% regression)
  • OOM crash: mmap failed: err=12 (ENOMEM) bytes=45778731008 (45 GB)

1. Root Cause Discovery

1.1 Double-Layered Allocation Path (Primary Cause)

Phase 6-3 adds Fast Path on TOP of existing Box Refactor path:

Before (Phase 6-2.2 - 4.19M ops/s):

malloc() → hkm_custom_malloc() → hak_tiny_alloc() [Box Refactor]
                                     ↓
                                   Success (4.19M ops/s)

After (Phase 6-3 - 3.35M ops/s):

malloc() → hkm_custom_malloc() → hak_alloc_at()
                                     ↓
                          tiny_fast_alloc() [Fast Path]
                                     ↓
                          g_tiny_fast_cache[cls] == NULL (always!)
                                     ↓
                          tiny_fast_refill(cls)
                                     ↓
                          hak_tiny_alloc_slow(size, cls)
                                     ↓
                          hak_tiny_alloc_superslab(cls)
                                     ↓
                          superslab_refill() → NULL (OOM!)
                                     ↓
                          Fast Path returns NULL
                                     ↓
                          hak_tiny_alloc() [Box Refactor fallback]
                                     ↓
                          ALSO FAILS (OOM) → benchmark crash

Overhead introduced:

  1. tiny_fast_alloc() initialization check
  2. tiny_fast_refill() call (complex multi-layer refill chain)
  3. superslab_refill() OOM failure
  4. Fallback to existing Box Refactor path
  5. Box Refactor path ALSO fails due to same OOM

Result: ~20% overhead from failed Fast Path + eventual OOM crash


1.2 SuperSlab OOM Failure (Secondary Cause)

Fast Path refill chain triggers SuperSlab OOM:

[DEBUG] superslab_refill NULL detail: class=2 prev_ss=(nil) active=0
        bitmap=0x00000000 prev_meta=(nil) used=0 cap=0 slab_idx=0
        reused_freelist=0 free_idx=-2 errno=12

[SS OOM] mmap failed: err=12 ss_size=1048576 alloc_size=2097152
         alloc=43658 freed=0 bytes=45778731008
         RLIMIT_AS(cur=inf max=inf) VmSize=134332460 kB VmRSS=3583744 kB

Critical Evidence:

  • 43,658 allocations
  • 0 frees (!!)
  • 45 GB allocated before crash

This is a massive memory leak - freed blocks are not being returned to SuperSlab freelist.

Connection to FAST_CAP_0 Issue: This is the SAME bug documented in FAST_CAP_0_SEGV_ROOT_CAUSE_ANALYSIS.md:

  • When TLS List mode is active (g_tls_list_enable=1), freed blocks go to TLS List cache
  • These blocks NEVER get merged back into SuperSlab freelist
  • Allocation path tries to allocate from freelist, which contains stale pointers
  • Eventually runs out of memory (OOM)

1.3 Why Statistics Don't Appear

User reported: HAKMEM_TINY_FAST_STATS=1 shows no output.

Reasons:

  1. No shutdown hook registered:

    • tiny_fast_print_stats() exists in tiny_fastcache.c:118
    • But it's NEVER called (no atexit() registration)
  2. Thread-local counters lost:

    • g_tiny_fast_refill_count and g_tiny_fast_drain_count are __thread variables
    • When threads exit, these are lost
    • No aggregation or reporting mechanism
  3. Early crash:

    • OOM crash occurs before statistics can be printed
    • Benchmark terminates abnormally

1.4 Larson Benchmark Special Handling

Larson uses custom malloc shim that bypasses one layer of Fast Path:

File: bench_larson_hakmem_shim.c

void* hkm_custom_malloc(size_t sz) {
    if (s_tiny_pref && sz <= 1024) {
        // Bypass wrappers: go straight to Tiny
        void* ptr = hak_tiny_alloc(sz);  // ← Calls Box Refactor directly
        if (ptr == NULL) {
            return hak_alloc_at(sz, HAK_CALLSITE());  // ← Fast Path HERE
        }
        return ptr;
    }
    return hak_alloc_at(sz, HAK_CALLSITE());  // ← Fast Path HERE too
}

Environment Variables:

  • HAKMEM_LARSON_TINY_ONLY=1 → calls hak_tiny_alloc() directly (bypasses Fast Path in malloc())
  • HAKMEM_LARSON_TINY_ONLY=0 → calls hak_alloc_at() (hits Fast Path)

Impact:

  • Fast Path in malloc() (lines 1294-1309) is NEVER EXECUTED by Larson
  • Fast Path in hak_alloc_at() (lines 682-697) IS executed
  • This creates a single-layered Fast Path, but still fails due to OOM

2. Build Configuration Conflicts

2.1 Conflicting Build Flags

Makefile (lines 54-77):

# Box Refactor: ON by default (4.19M ops/s baseline)
BOX_REFACTOR_DEFAULT ?= 1
ifeq ($(BOX_REFACTOR_DEFAULT),1)
CFLAGS += -DHAKMEM_TINY_PHASE6_BOX_REFACTOR=1
endif

# Fast Path: ON by default (Phase 6-3 experiment)
TINY_FAST_PATH_DEFAULT ?= 1
ifeq ($(TINY_FAST_PATH_DEFAULT),1)
CFLAGS += -DHAKMEM_TINY_FAST_PATH=1
endif

Both flags are active simultaneously! This creates the double-layered path.


2.2 Code Path Analysis

File: core/hakmem.c:hak_alloc_at()

// Lines 682-697: Phase 6-3 Fast Path
#ifdef HAKMEM_TINY_FAST_PATH
    if (size <= TINY_FAST_THRESHOLD) {
        void* ptr = tiny_fast_alloc(size);
        if (ptr) return ptr;
        // Fall through to slow path on failure
    }
#endif

// Lines 704-740: Phase 6-1.7 Box Refactor Path (existing)
    if (size <= TINY_MAX_SIZE) {
        #ifdef HAKMEM_TINY_PHASE6_BOX_REFACTOR
            tiny_ptr = hak_tiny_alloc_fast_wrapper(size);  // Box Refactor
        #else
            tiny_ptr = hak_tiny_alloc(size);  // Standard path
        #endif
        if (tiny_ptr) return tiny_ptr;
    }

Flow:

  1. Fast Path check (ALWAYS fails due to OOM)
  2. Box Refactor path check (also fails due to same OOM)
  3. Both paths try to allocate from SuperSlab
  4. SuperSlab is exhausted → crash

3. hak_tiny_alloc_slow() Investigation

3.1 Function Location

$ grep -r "hak_tiny_alloc_slow" core/
core/hakmem_tiny.c:197:void* __attribute__((cold, noinline)) hak_tiny_alloc_slow(...);
core/hakmem_tiny_slow.inc:7:void* __attribute__((cold, noinline)) hak_tiny_alloc_slow(...)
core/tiny_fastcache.c:25:extern void* hak_tiny_alloc_slow(size_t size, int class_idx);

Definition: core/hakmem_tiny_slow.inc (included by hakmem_tiny.c)

Export condition:

#ifdef HAKMEM_TINY_PHASE6_BOX_REFACTOR
void* __attribute__((cold, noinline)) hak_tiny_alloc_slow(size_t size, int class_idx);
#else
static void* __attribute__((cold, noinline)) hak_tiny_alloc_slow(size_t size, int class_idx);
#endif

Since HAKMEM_TINY_PHASE6_BOX_REFACTOR=1 is active, this function is exported and accessible from tiny_fastcache.c.


3.2 Implementation Analysis

File: core/hakmem_tiny_slow.inc

void* hak_tiny_alloc_slow(size_t size, int class_idx) {
    // Try HotMag refill
    if (g_hotmag_enable && class_idx <= 3) {
        void* ptr = hotmag_pop(class_idx);
        if (ptr) return ptr;
    }

    // Try TLS list refill
    if (g_tls_list_enable) {
        void* ptr = tls_list_pop(&g_tls_lists[class_idx]);
        if (ptr) return ptr;
        // Try refilling TLS list from slab
        if (tls_refill_from_tls_slab(...) > 0) {
            void* ptr = tls_list_pop(...);
            if (ptr) return ptr;
        }
    }

    // Final fallback: allocate from superslab
    void* ss_ptr = hak_tiny_alloc_superslab(class_idx);  // ← OOM HERE!
    return ss_ptr;
}

Problem: This is a complex multi-tier refill chain:

  1. HotMag tier (optional)
  2. TLS List tier (optional)
  3. TLS Slab tier (optional)
  4. SuperSlab tier (final fallback)

When all tiers fail → returns NULL → Fast Path fails → Box Refactor also fails → OOM crash


4. Why Fast Path is Always Empty

4.1 TLS Cache Never Refills

File: core/tiny_fastcache.c:tiny_fast_refill()

void* tiny_fast_refill(int class_idx) {
    int refilled = 0;
    size_t size = class_sizes[class_idx];

    // Batch allocation: try to get multiple blocks at once
    for (int i = 0; i < TINY_FAST_REFILL_BATCH; i++) {
        void* ptr = hak_tiny_alloc_slow(size, class_idx);  // ← OOM!
        if (!ptr) break;  // Failed on FIRST iteration

        // Push to fast cache (never reached)
        if (g_tiny_fast_count[class_idx] < TINY_FAST_CACHE_CAP) {
            *(void**)ptr = g_tiny_fast_cache[class_idx];
            g_tiny_fast_cache[class_idx] = ptr;
            g_tiny_fast_count[class_idx]++;
            refilled++;
        }
    }

    // Pop one for caller
    void* result = g_tiny_fast_cache[class_idx];  // ← Still NULL!
    return result;  // Returns NULL
}

Flow:

  1. Tries to allocate 16 blocks via hak_tiny_alloc_slow()
  2. First allocation fails (OOM) → loop breaks immediately
  3. g_tiny_fast_cache[class_idx] remains NULL
  4. Returns NULL to caller

Result: Fast Path cache is ALWAYS empty, so EVERY allocation hits slow path.


5. Detailed Regression Mechanism

5.1 Instruction Count Comparison

Phase 6-2.2 (Box Refactor - 4.19M ops/s):

malloc() → hkm_custom_malloc()
  ↓ (5 instructions)
hak_tiny_alloc()
  ↓ (10-15 instructions, Box Refactor fast path)
Success

Phase 6-3 (Fast Path + Box Refactor - 3.35M ops/s):

malloc() → hkm_custom_malloc()
  ↓ (5 instructions)
hak_alloc_at()
  ↓ (3-4 instructions: Fast Path check)
tiny_fast_alloc()
  ↓ (1-2 instructions: cache check)
g_tiny_fast_cache[cls] == NULL
  ↓ (function call)
tiny_fast_refill()
  ↓ (30-40 instructions: loop + size mapping)
hak_tiny_alloc_slow()
  ↓ (50-100 instructions: multi-tier refill chain)
hak_tiny_alloc_superslab()
  ↓ (100+ instructions)
superslab_refill() → NULL (OOM)
  ↓ (return path)
tiny_fast_refill returns NULL
  ↓ (return path)
tiny_fast_alloc returns NULL
  ↓ (fallback to Box Refactor)
hak_tiny_alloc()
  ↓ (10-15 instructions)
ALSO FAILS (OOM) → crash

Added overhead:

  • ~200-300 instructions per allocation (failed Fast Path attempt)
  • Multiple function calls (7 levels deep)
  • Branch mispredictions (Fast Path always fails)

Estimated slowdown: 15-25% from instruction overhead + branch misprediction


5.2 Why -20% Exactly?

Calculation:

Baseline (Phase 6-2.2): 4.19M ops/s = 238 ns/op
Regression (Phase 6-3): 3.35M ops/s = 298 ns/op

Added overhead: 298 - 238 = 60 ns/op
Percentage: 60 / 238 = 25.2% slowdown

Actual regression: -20%

Why not -25%?

  • Some allocations still succeed before OOM crash
  • Benchmark may be terminating early, inflating ops/s
  • Measurement noise

6. Priority-Ranked Fix Proposals

Fix #1: Disable Fast Path (IMMEDIATE - 1 minute)

Impact: Restores 4.19M ops/s baseline Risk: None (reverts to known-good state) Effort: Trivial

Implementation:

make clean
make BOX_REFACTOR_DEFAULT=1 TINY_FAST_PATH_DEFAULT=0 larson_hakmem
./larson_hakmem 10 8 128 1024 1 12345 4

Expected result: 4.19M ops/s (baseline restored)


Fix #2: Integrate Fast Path with Box Refactor (SHORT-TERM - 2-4 hours)

Impact: Potentially achieves Fast Path goals WITHOUT regression Risk: Low (leverages existing Box Refactor infrastructure) Effort: Moderate

Approach:

  1. Change tiny_fast_refill() to call hak_tiny_alloc() instead of hak_tiny_alloc_slow()

    • Leverages existing Box Refactor path (known to work at 4.19M ops/s)
    • Avoids OOM issue by using proven allocation path
  2. Remove Fast Path from hak_alloc_at()

    • Keep Fast Path ONLY in malloc() wrapper
    • Prevents double-layered path
  3. Simplify refill logic

    void* tiny_fast_refill(int class_idx) {
        size_t size = class_sizes[class_idx];
    
        // Batch allocation via Box Refactor path
        for (int i = 0; i < TINY_FAST_REFILL_BATCH; i++) {
            void* ptr = hak_tiny_alloc(size);  // ← Use Box Refactor!
            if (!ptr) break;
    
            // Push to fast cache
            *(void**)ptr = g_tiny_fast_cache[class_idx];
            g_tiny_fast_cache[class_idx] = ptr;
            g_tiny_fast_count[class_idx]++;
        }
    
        // Pop one for caller
        void* result = g_tiny_fast_cache[class_idx];
        if (result) {
            g_tiny_fast_cache[class_idx] = *(void**)result;
            g_tiny_fast_count[class_idx]--;
        }
        return result;
    }
    

Expected outcome:

  • Fast Path cache actually fills (using Box Refactor backend)
  • Subsequent allocations hit 3-4 instruction fast path
  • Target: 5.0-6.0M ops/s (20-40% improvement over baseline)

Fix #3: Fix SuperSlab OOM Root Cause (LONG-TERM - 1-2 weeks)

Impact: Eliminates OOM crashes permanently Risk: High (requires deep understanding of TLS List / SuperSlab interaction) Effort: High

Problem (from FAST_CAP_0 analysis):

  • When g_tls_list_enable=1, freed blocks go to TLS List cache
  • These blocks NEVER merge back into SuperSlab freelist
  • Allocation path tries to allocate from freelist → stale pointers → crash

Solution:

  1. Add TLS List → SuperSlab drain path

    • When TLS List spills, return blocks to SuperSlab freelist
    • Ensure proper synchronization (lock-free or per-class mutex)
  2. Fix remote free handling

    • Ensure cross-thread frees properly update remote_heads[]
    • Add drain points in allocation path
  3. Add memory leak detection

    • Track allocated vs freed bytes per class
    • Warn when imbalance exceeds threshold

Reference: FAST_CAP_0_SEGV_ROOT_CAUSE_ANALYSIS.md (lines 87-99)


Phase 1: Immediate Recovery (5 minutes)

  1. Disable Fast Path (Fix #1)
    • Verify 4.19M ops/s baseline restored
    • Confirm no OOM crashes

Phase 2: Quick Win (2-4 hours)

  1. Implement Fix #2 (Integrate Fast Path with Box Refactor)
    • Change tiny_fast_refill() to use hak_tiny_alloc()
    • Remove Fast Path from hak_alloc_at() (keep only in malloc())
    • Run A/B test: baseline vs integrated Fast Path
    • Success criteria: >4.5M ops/s (>7% improvement over baseline)

Phase 3: Root Cause Fix (1-2 weeks, OPTIONAL)

  1. Implement Fix #3 (Fix SuperSlab OOM)
    • Only if Fix #2 still shows OOM issues
    • Requires deep architectural changes
    • High risk, high reward

8. Test Plan

Test 1: Baseline Recovery

make clean
make BOX_REFACTOR_DEFAULT=1 TINY_FAST_PATH_DEFAULT=0 larson_hakmem
./larson_hakmem 10 8 128 1024 1 12345 4

Expected: 4.19M ops/s, no crashes

Test 2: Integrated Fast Path

# After implementing Fix #2
make clean
make BOX_REFACTOR_DEFAULT=1 TINY_FAST_PATH_DEFAULT=1 larson_hakmem
./larson_hakmem 10 8 128 1024 1 12345 4

Expected: >4.5M ops/s, no crashes, stats show refills working

Test 3: Fast Path Statistics

HAKMEM_TINY_FAST_STATS=1 ./larson_hakmem 10 8 128 1024 1 12345 4

Expected: Stats output at end (requires adding atexit() hook)


9. Key Takeaways

  1. Fast Path was never active - OOM prevented cache refills
  2. Double-layered allocation - Fast Path + Box Refactor created overhead
  3. 45 GB memory leak - Freed blocks not returning to SuperSlab
  4. Same bug as FAST_CAP_0 - TLS List / SuperSlab disconnect
  5. Easy fix available - Use Box Refactor as Fast Path backend

Confidence in Fix #2: 80% (leverages proven Box Refactor infrastructure)


10. References

  • FAST_CAP_0_SEGV_ROOT_CAUSE_ANALYSIS.md - Same OOM root cause
  • core/hakmem.c:682-740 - Double-layered allocation path
  • core/tiny_fastcache.c:41-84 - Failed refill implementation
  • bench_larson_hakmem_shim.c:8-25 - Larson special handling
  • Makefile:54-77 - Build flag conflicts

Analysis completed: 2025-11-05 Next step: Implement Fix #1 (disable Fast Path) for immediate recovery