Files
hakmem/PHASE9_LRU_ARCHITECTURE_ISSUE.md
Moe Charm (CI) f95448c767 CRITICAL DISCOVERY: Phase 9 LRU architecturally unreachable due to TLS SLL
Root Cause:
- TLS SLL fast path (95-99% of frees) does NOT decrement meta->used
- Slabs never appear empty (meta->used never reaches 0)
- superslab_free() never called
- hak_ss_lru_push() never called
- LRU cache utilization: 0% (should be >90%)

Impact:
- mmap/munmap churn: 6,455 syscalls (74.8% time)
- Performance: -94% regression (9.38M → 563K ops/s)
- Phase 9 design goal: FAILED (lazy deallocation non-functional)

Evidence:
- 200K iterations: [LRU_PUSH]=0, [LRU_POP]=877 misses
- Experimental verification with debug logs confirms theory

Solution: Option B - Periodic TLS SLL Drain
- Every 1,024 frees: drain TLS SLL → slab freelist
- Decrement meta->used properly → enable empty detection
- Expected: -96% syscalls, +1,300-1,700% throughput

Files:
- PHASE9_LRU_ARCHITECTURE_ISSUE.md: Comprehensive analysis (300+ lines)
- Includes design options A/B/C/D with tradeoff analysis

Next: Await ultrathink approval to implement Option B
2025-11-14 06:49:32 +09:00

8.3 KiB

Phase 9 LRU Architecture Issue - Root Cause Analysis

Date: 2025-11-14 Discovery: Task B-1 Investigation Impact: CRITICAL - Phase 9 Lazy Deallocation completely non-functional


Executive Summary

Phase 9 LRU cache for SuperSlab reuse is architecturally unreachable during normal operation due to TLS SLL fast path preventing meta->used == 0 condition.

Result:

  • LRU cache never populated (0% utilization)
  • SuperSlabs never reused (100% mmap/munmap churn)
  • Syscall overhead: 6,455 calls per 200K iterations (74.8% of total time)
  • Performance impact: -94% regression (9.38M → 563K ops/s)

Root Cause Chain

1. Free Path Architecture

Fast Path (95-99% of frees):

// core/tiny_free_fast_v2.inc.h
hak_tiny_free_fast_v2(ptr) {
    tls_sll_push(class_idx, base);  // ← Does NOT decrement meta->used
}

Slow Path (1-5% of frees):

// core/tiny_superslab_free.inc.h
tiny_free_local_box() {
    meta->used--;  // ← ONLY here is meta->used decremented
}

2. The Accounting Gap

Physical Reality: Blocks freed to TLS SLL (available for reuse) Slab Accounting: Blocks still counted as "used" (meta->used unchanged)

Consequence: Slabs never appear empty → SuperSlabs never freed → LRU never used

3. Empty Detection Code Path

// core/tiny_superslab_free.inc.h:211 (local free)
if (meta->used == 0) {
    shared_pool_release_slab(ss, slab_idx);  // ← NEVER REACHED
}

// core/hakmem_shared_pool.c:298
if (ss->active_slabs == 0) {
    superslab_free(ss);  // ← NEVER REACHED
}

// core/hakmem_tiny_superslab.c:1016
void superslab_free(SuperSlab* ss) {
    int lru_cached = hak_ss_lru_push(ss);  // ← NEVER CALLED
}

4. Experimental Evidence

Test: bench_random_mixed_hakmem 200000 4096 1234567

Observations:

export HAKMEM_SS_LRU_DEBUG=1
export HAKMEM_SS_FREE_DEBUG=1

# Results (200K iterations):
[LRU_POP] class=X (miss): 877 times  ← LRU lookup attempts
[LRU_PUSH]: 0 times                   ← NEVER populated
[SS_FREE]: 0 times                    ← NEVER called
[SS_EMPTY]: 0 times                   ← meta->used never reached 0

Syscall Impact:

mmap:    3,241 calls (27.4% time)
munmap:  3,214 calls (47.4% time)
Total:   6,455 syscalls (74.8% time) ← Should be ~100 with LRU working

Why This Happens

TLS SLL Design Rationale

Purpose: Ultra-fast free path (3-5 instructions) Tradeoff: No slab accounting updates

Lifecycle:

  1. Block allocated from slab: meta->used++
  2. Block freed to TLS SLL: meta->used UNCHANGED
  3. Block reallocated from TLS SLL: meta->used UNCHANGED
  4. Cycle repeats infinitely

Drain Behavior:

  • bench_random_mixed drain phase frees all blocks
  • But TLS SLL cleanup (hakmem_tiny_lifecycle.inc:162-170) drains to tls_list, NOT back to slabs
  • meta->used never decremented
  • Slabs never reported as empty

Benchmark Characteristics

bench_random_mixed.c:

  • Working set: 4,096 slots (random alloc/free)
  • Size range: 16-1040 bytes
  • Pattern: Blocks cycle through TLS SLL
  • Never reaches meta->used == 0 during main loop

Impact Analysis

Performance Regression

Metric Phase 11 (Before) Current (After SEGV Fix) Change
Throughput 9.38M ops/s 563K ops/s -94%
mmap calls ~800-900 3,241 +260-305%
munmap calls ~800-900 3,214 +257-302%
LRU hits Expected high 0 -100%

Root Causes:

  1. Primary (74.8% time): LRU not working → mmap/munmap churn
  2. Secondary (11.0% time): mincore() SEGV fix overhead

Design Validity

Phase 9 LRU Implementation: Functionally Correct

  • hak_ss_lru_push(): Works as designed
  • hak_ss_lru_pop(): Works as designed
  • Cache eviction: Works as designed

Phase 9 Architecture: Fundamentally Incompatible with TLS SLL fast path


Solution Options

Option A: Decrement meta->used in Fast Path

Approach: Modify tls_sll_push() to decrement meta->used

Problem:

  • Requires SuperSlab lookup (expensive)
  • Defeats fast path purpose (3-5 instructions → 50+ instructions)
  • Cache misses, branch mispredicts

Verdict: Not viable


Approach:

  • Drain TLS SLL back to slab freelists periodically (e.g., every 1K frees)
  • Decrement meta->used via tiny_free_local_box()
  • Allow slab empty detection

Implementation:

static __thread uint32_t g_tls_sll_drain_counter[TINY_NUM_CLASSES] = {0};

void tls_sll_push(int class_idx, void* base) {
    // Fast path: push to SLL
    // ... existing code ...

    // Periodic drain
    if (++g_tls_sll_drain_counter[class_idx] >= 1024) {
        tls_sll_drain_to_slabs(class_idx);
        g_tls_sll_drain_counter[class_idx] = 0;
    }
}

Benefits:

  • Fast path stays fast (99.9% of frees)
  • Slow path drain (0.1% of frees) updates meta->used
  • Enables slab empty detection
  • LRU cache becomes functional

Expected Impact:

  • mmap/munmap: 6,455 → ~100-200 calls (-96-97%)
  • Throughput: 563K → 8-10M ops/s (+1,300-1,700%)

Option C: Separate Accounting ⚠️

Approach: Track "logical used" (includes TLS SLL) vs "physical used"

Problem:

  • Complex, error-prone
  • Atomic operations required (slow)
  • Hard to maintain consistency

Verdict: Not recommended


Option D: Accept Current Behavior

Approach: LRU cache only for shutdown/cleanup, not runtime

Problem:

  • Defeats Phase 9 purpose (lazy deallocation)
  • Leaves 74.8% syscall overhead unfixed
  • Performance remains -94% regressed

Verdict: Not acceptable


Recommendation

Implement Option B: Periodic TLS SLL Drain

Phase 12 Design

  1. Add drain trigger in tls_sll_push()

    • Every 1,024 frees (tunable via ENV)
    • Drain TLS SLL → slab freelist
    • Decrement meta->used properly
  2. Enable slab empty detection

    • meta->used == 0 now reachable
    • shared_pool_release_slab() called
    • superslab_free()hak_ss_lru_push() called
  3. LRU cache becomes functional

    • SuperSlabs reused from cache
    • mmap/munmap reduced by 96-97%
    • Syscall overhead: 74.8% → ~5%

Expected Performance

Current:  563K ops/s (0.63% of System malloc)
After:    8-10M ops/s (9-11% of System malloc)
Gain:     +1,300-1,700%

Remaining gap to System malloc (90M ops/s):

  • Still need +800-1,000% additional optimization
  • Focus areas: Front cache hit rate, branch prediction, cache locality

Action Items

  1. [URGENT] Implement TLS SLL periodic drain (Option B)
  2. [HIGH] Add ENV tuning: HAKMEM_TLS_SLL_DRAIN_INTERVAL=1024
  3. [HIGH] Re-measure with strace -c (expect -96% mmap/munmap)
  4. [MEDIUM] Fix prewarm crash (separate investigation)
  5. [MEDIUM] Document architectural tradeoff in design docs

Lessons Learned

  1. Fast path optimizations can disable architectural features

    • TLS SLL fast path → LRU cache unreachable
    • Need periodic cleanup to restore functionality
  2. Accounting consistency is critical

    • meta->used must reflect true state
    • Buffering (TLS SLL) creates accounting gap
  3. Integration testing needed

    • Phase 9 LRU tested in isolation: Works
    • Phase 9 LRU + TLS SLL integration: Broken
    • Need end-to-end benchmarks
  4. Performance monitoring essential

    • LRU hit rate = 0% should have triggered alert
    • Syscall count regression should have been caught earlier

Files Involved

  • /mnt/workdisk/public_share/hakmem/core/tiny_free_fast_v2.inc.h - Fast path (no meta->used update)
  • /mnt/workdisk/public_share/hakmem/core/tiny_superslab_free.inc.h - Slow path (meta->used--)
  • /mnt/workdisk/public_share/hakmem/core/hakmem_shared_pool.c - Empty detection
  • /mnt/workdisk/public_share/hakmem/core/hakmem_tiny_superslab.c - superslab_free()
  • /mnt/workdisk/public_share/hakmem/core/hakmem_super_registry.c - LRU cache implementation

Conclusion

Phase 9 LRU cache is functionally correct but architecturally unreachable due to TLS SLL fast path not updating meta->used.

Fix: Implement periodic TLS SLL drain to restore slab accounting consistency and enable LRU cache utilization.

Expected Impact: +1,300-1,700% throughput improvement (563K → 8-10M ops/s)