Files

Moe Charm (CI) f95448c767 CRITICAL DISCOVERY: Phase 9 LRU architecturally unreachable due to TLS SLL

Root Cause:
- TLS SLL fast path (95-99% of frees) does NOT decrement meta->used
- Slabs never appear empty (meta->used never reaches 0)
- superslab_free() never called
- hak_ss_lru_push() never called
- LRU cache utilization: 0% (should be >90%)

Impact:
- mmap/munmap churn: 6,455 syscalls (74.8% time)
- Performance: -94% regression (9.38M → 563K ops/s)
- Phase 9 design goal: FAILED (lazy deallocation non-functional)

Evidence:
- 200K iterations: [LRU_PUSH]=0, [LRU_POP]=877 misses
- Experimental verification with debug logs confirms theory

Solution: Option B - Periodic TLS SLL Drain
- Every 1,024 frees: drain TLS SLL → slab freelist
- Decrement meta->used properly → enable empty detection
- Expected: -96% syscalls, +1,300-1,700% throughput

Files:
- PHASE9_LRU_ARCHITECTURE_ISSUE.md: Comprehensive analysis (300+ lines)
- Includes design options A/B/C/D with tradeoff analysis

Next: Await ultrathink approval to implement Option B

2025-11-14 06:49:32 +09:00

8.3 KiB

Raw Blame History

Phase 9 LRU Architecture Issue - Root Cause Analysis

Date: 2025-11-14 Discovery: Task B-1 Investigation Impact: ❌ CRITICAL - Phase 9 Lazy Deallocation completely non-functional

Executive Summary

Phase 9 LRU cache for SuperSlab reuse is architecturally unreachable during normal operation due to TLS SLL fast path preventing meta->used == 0 condition.

Result:

LRU cache never populated (0% utilization)
SuperSlabs never reused (100% mmap/munmap churn)
Syscall overhead: 6,455 calls per 200K iterations (74.8% of total time)
Performance impact: -94% regression (9.38M → 563K ops/s)

Root Cause Chain

1. Free Path Architecture

Fast Path (95-99% of frees):

// core/tiny_free_fast_v2.inc.h
hak_tiny_free_fast_v2(ptr) {
    tls_sll_push(class_idx, base);  // ← Does NOT decrement meta->used
}

Slow Path (1-5% of frees):

// core/tiny_superslab_free.inc.h
tiny_free_local_box() {
    meta->used--;  // ← ONLY here is meta->used decremented
}

2. The Accounting Gap

Physical Reality: Blocks freed to TLS SLL (available for reuse) Slab Accounting: Blocks still counted as "used" (meta->used unchanged)

Consequence: Slabs never appear empty → SuperSlabs never freed → LRU never used

3. Empty Detection Code Path

// core/tiny_superslab_free.inc.h:211 (local free)
if (meta->used == 0) {
    shared_pool_release_slab(ss, slab_idx);  // ← NEVER REACHED
}

// core/hakmem_shared_pool.c:298
if (ss->active_slabs == 0) {
    superslab_free(ss);  // ← NEVER REACHED
}

// core/hakmem_tiny_superslab.c:1016
void superslab_free(SuperSlab* ss) {
    int lru_cached = hak_ss_lru_push(ss);  // ← NEVER CALLED
}

4. Experimental Evidence

Test: bench_random_mixed_hakmem 200000 4096 1234567

Observations:

export HAKMEM_SS_LRU_DEBUG=1
export HAKMEM_SS_FREE_DEBUG=1

# Results (200K iterations):
[LRU_POP] class=X (miss): 877 times  ← LRU lookup attempts
[LRU_PUSH]: 0 times                   ← NEVER populated
[SS_FREE]: 0 times                    ← NEVER called
[SS_EMPTY]: 0 times                   ← meta->used never reached 0

Syscall Impact:

mmap:    3,241 calls (27.4% time)
munmap:  3,214 calls (47.4% time)
Total:   6,455 syscalls (74.8% time) ← Should be ~100 with LRU working

Why This Happens

TLS SLL Design Rationale

Purpose: Ultra-fast free path (3-5 instructions) Tradeoff: No slab accounting updates

Lifecycle:

Block allocated from slab: meta->used++
Block freed to TLS SLL: meta->used UNCHANGED
Block reallocated from TLS SLL: meta->used UNCHANGED
Cycle repeats infinitely

Drain Behavior:

bench_random_mixed drain phase frees all blocks
But TLS SLL cleanup (hakmem_tiny_lifecycle.inc:162-170) drains to tls_list, NOT back to slabs
meta->used never decremented
Slabs never reported as empty

Benchmark Characteristics

bench_random_mixed.c:

Working set: 4,096 slots (random alloc/free)
Size range: 16-1040 bytes
Pattern: Blocks cycle through TLS SLL
Never reaches meta->used == 0 during main loop

Impact Analysis

Performance Regression

Metric	Phase 11 (Before)	Current (After SEGV Fix)	Change
Throughput	9.38M ops/s	563K ops/s	-94%
mmap calls	~800-900	3,241	+260-305%
munmap calls	~800-900	3,214	+257-302%
LRU hits	Expected high	0	-100%

Root Causes:

Primary (74.8% time): LRU not working → mmap/munmap churn
Secondary (11.0% time): mincore() SEGV fix overhead

Design Validity

Phase 9 LRU Implementation: ✅ Functionally Correct

hak_ss_lru_push(): Works as designed
hak_ss_lru_pop(): Works as designed
Cache eviction: Works as designed

Phase 9 Architecture: ❌ Fundamentally Incompatible with TLS SLL fast path

Solution Options

Option A: Decrement `meta->used` in Fast Path ❌

Approach: Modify tls_sll_push() to decrement meta->used

Problem:

Requires SuperSlab lookup (expensive)
Defeats fast path purpose (3-5 instructions → 50+ instructions)
Cache misses, branch mispredicts

Verdict: Not viable

Option B: Periodic TLS SLL Drain to Slabs ✅ RECOMMENDED

Approach:

Drain TLS SLL back to slab freelists periodically (e.g., every 1K frees)
Decrement meta->used via tiny_free_local_box()
Allow slab empty detection

Implementation:

static __thread uint32_t g_tls_sll_drain_counter[TINY_NUM_CLASSES] = {0};

void tls_sll_push(int class_idx, void* base) {
    // Fast path: push to SLL
    // ... existing code ...

    // Periodic drain
    if (++g_tls_sll_drain_counter[class_idx] >= 1024) {
        tls_sll_drain_to_slabs(class_idx);
        g_tls_sll_drain_counter[class_idx] = 0;
    }
}

Benefits:

Fast path stays fast (99.9% of frees)
Slow path drain (0.1% of frees) updates meta->used
Enables slab empty detection
LRU cache becomes functional

Expected Impact:

mmap/munmap: 6,455 → ~100-200 calls (-96-97%)
Throughput: 563K → 8-10M ops/s (+1,300-1,700%)

Option C: Separate Accounting ⚠️

Approach: Track "logical used" (includes TLS SLL) vs "physical used"

Problem:

Complex, error-prone
Atomic operations required (slow)
Hard to maintain consistency

Verdict: Not recommended

Option D: Accept Current Behavior ❌

Approach: LRU cache only for shutdown/cleanup, not runtime

Problem:

Defeats Phase 9 purpose (lazy deallocation)
Leaves 74.8% syscall overhead unfixed
Performance remains -94% regressed

Verdict: Not acceptable

Recommendation

Implement Option B: Periodic TLS SLL Drain

Phase 12 Design

Add drain trigger in tls_sll_push()
- Every 1,024 frees (tunable via ENV)
- Drain TLS SLL → slab freelist
- Decrement meta->used properly
Enable slab empty detection
- meta->used == 0 now reachable
- shared_pool_release_slab() called
- superslab_free() → hak_ss_lru_push() called
LRU cache becomes functional
- SuperSlabs reused from cache
- mmap/munmap reduced by 96-97%
- Syscall overhead: 74.8% → ~5%

Expected Performance

Current:  563K ops/s (0.63% of System malloc)
After:    8-10M ops/s (9-11% of System malloc)
Gain:     +1,300-1,700%

Remaining gap to System malloc (90M ops/s):

Still need +800-1,000% additional optimization
Focus areas: Front cache hit rate, branch prediction, cache locality

Action Items

[URGENT] Implement TLS SLL periodic drain (Option B)
[HIGH] Add ENV tuning: HAKMEM_TLS_SLL_DRAIN_INTERVAL=1024
[HIGH] Re-measure with strace -c (expect -96% mmap/munmap)
[MEDIUM] Fix prewarm crash (separate investigation)
[MEDIUM] Document architectural tradeoff in design docs

Lessons Learned

Fast path optimizations can disable architectural features
- TLS SLL fast path → LRU cache unreachable
- Need periodic cleanup to restore functionality
Accounting consistency is critical
- meta->used must reflect true state
- Buffering (TLS SLL) creates accounting gap
Integration testing needed
- Phase 9 LRU tested in isolation: ✅ Works
- Phase 9 LRU + TLS SLL integration: ❌ Broken
- Need end-to-end benchmarks
Performance monitoring essential
- LRU hit rate = 0% should have triggered alert
- Syscall count regression should have been caught earlier

Files Involved

/mnt/workdisk/public_share/hakmem/core/tiny_free_fast_v2.inc.h - Fast path (no meta->used update)
/mnt/workdisk/public_share/hakmem/core/tiny_superslab_free.inc.h - Slow path (meta->used--)
/mnt/workdisk/public_share/hakmem/core/hakmem_shared_pool.c - Empty detection
/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_superslab.c - superslab_free()
/mnt/workdisk/public_share/hakmem/core/hakmem_super_registry.c - LRU cache implementation

Conclusion

Phase 9 LRU cache is functionally correct but architecturally unreachable due to TLS SLL fast path not updating meta->used.

Fix: Implement periodic TLS SLL drain to restore slab accounting consistency and enable LRU cache utilization.

Expected Impact: +1,300-1,700% throughput improvement (563K → 8-10M ops/s)

8.3 KiB Raw Blame History

Phase 9 LRU Architecture Issue - Root Cause Analysis

Executive Summary

Root Cause Chain

1. Free Path Architecture

2. The Accounting Gap

3. Empty Detection Code Path

4. Experimental Evidence

Why This Happens

TLS SLL Design Rationale

Benchmark Characteristics

Impact Analysis

Performance Regression

Design Validity

Solution Options

Option A: Decrement meta->used in Fast Path ❌

Option B: Periodic TLS SLL Drain to Slabs ✅ RECOMMENDED

Option C: Separate Accounting ⚠️

Option D: Accept Current Behavior ❌

Recommendation

Phase 12 Design

Expected Performance

Action Items

Lessons Learned

Files Involved

Conclusion

8.3 KiB

Raw Blame History

Option A: Decrement `meta->used` in Fast Path ❌