Phase 1 完了:環境変数整理 + fprintf デバッグガード ENV変数削除(BG/HotMag系): - core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines) - core/hakmem_tiny_bg_spill.c: BG spill ENV 削除 - core/tiny_refill.h: BG remote 固定値化 - core/hakmem_tiny_slow.inc: BG refs 削除 fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE): - core/hakmem_shared_pool.c: Lock stats (~18 fprintf) - core/page_arena.c: Init/Shutdown/Stats (~27 fprintf) - core/hakmem.c: SIGSEGV init message ドキュメント整理: - 328 markdown files 削除(旧レポート・重複docs) 性能確認: - Larson: 52.35M ops/s (前回52.8M、安定動作✅) - ENV整理による機能影響なし - Debug出力は一部残存(次phase で対応) 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
8.3 KiB
Phase 9 LRU Architecture Issue - Root Cause Analysis
Date: 2025-11-14 Discovery: Task B-1 Investigation Impact: ❌ CRITICAL - Phase 9 Lazy Deallocation completely non-functional
Executive Summary
Phase 9 LRU cache for SuperSlab reuse is architecturally unreachable during normal operation due to TLS SLL fast path preventing meta->used == 0 condition.
Result:
- LRU cache never populated (0% utilization)
- SuperSlabs never reused (100% mmap/munmap churn)
- Syscall overhead: 6,455 calls per 200K iterations (74.8% of total time)
- Performance impact: -94% regression (9.38M → 563K ops/s)
Root Cause Chain
1. Free Path Architecture
Fast Path (95-99% of frees):
// core/tiny_free_fast_v2.inc.h
hak_tiny_free_fast_v2(ptr) {
tls_sll_push(class_idx, base); // ← Does NOT decrement meta->used
}
Slow Path (1-5% of frees):
// core/tiny_superslab_free.inc.h
tiny_free_local_box() {
meta->used--; // ← ONLY here is meta->used decremented
}
2. The Accounting Gap
Physical Reality: Blocks freed to TLS SLL (available for reuse)
Slab Accounting: Blocks still counted as "used" (meta->used unchanged)
Consequence: Slabs never appear empty → SuperSlabs never freed → LRU never used
3. Empty Detection Code Path
// core/tiny_superslab_free.inc.h:211 (local free)
if (meta->used == 0) {
shared_pool_release_slab(ss, slab_idx); // ← NEVER REACHED
}
// core/hakmem_shared_pool.c:298
if (ss->active_slabs == 0) {
superslab_free(ss); // ← NEVER REACHED
}
// core/hakmem_tiny_superslab.c:1016
void superslab_free(SuperSlab* ss) {
int lru_cached = hak_ss_lru_push(ss); // ← NEVER CALLED
}
4. Experimental Evidence
Test: bench_random_mixed_hakmem 200000 4096 1234567
Observations:
export HAKMEM_SS_LRU_DEBUG=1
export HAKMEM_SS_FREE_DEBUG=1
# Results (200K iterations):
[LRU_POP] class=X (miss): 877 times ← LRU lookup attempts
[LRU_PUSH]: 0 times ← NEVER populated
[SS_FREE]: 0 times ← NEVER called
[SS_EMPTY]: 0 times ← meta->used never reached 0
Syscall Impact:
mmap: 3,241 calls (27.4% time)
munmap: 3,214 calls (47.4% time)
Total: 6,455 syscalls (74.8% time) ← Should be ~100 with LRU working
Why This Happens
TLS SLL Design Rationale
Purpose: Ultra-fast free path (3-5 instructions) Tradeoff: No slab accounting updates
Lifecycle:
- Block allocated from slab:
meta->used++ - Block freed to TLS SLL:
meta->usedUNCHANGED - Block reallocated from TLS SLL:
meta->usedUNCHANGED - Cycle repeats infinitely
Drain Behavior:
bench_random_mixeddrain phase frees all blocks- But TLS SLL cleanup (
hakmem_tiny_lifecycle.inc:162-170) drains totls_list, NOT back to slabs meta->usednever decremented- Slabs never reported as empty
Benchmark Characteristics
bench_random_mixed.c:
- Working set: 4,096 slots (random alloc/free)
- Size range: 16-1040 bytes
- Pattern: Blocks cycle through TLS SLL
- Never reaches
meta->used == 0during main loop
Impact Analysis
Performance Regression
| Metric | Phase 11 (Before) | Current (After SEGV Fix) | Change |
|---|---|---|---|
| Throughput | 9.38M ops/s | 563K ops/s | -94% |
| mmap calls | ~800-900 | 3,241 | +260-305% |
| munmap calls | ~800-900 | 3,214 | +257-302% |
| LRU hits | Expected high | 0 | -100% |
Root Causes:
- Primary (74.8% time): LRU not working → mmap/munmap churn
- Secondary (11.0% time): mincore() SEGV fix overhead
Design Validity
Phase 9 LRU Implementation: ✅ Functionally Correct
hak_ss_lru_push(): Works as designedhak_ss_lru_pop(): Works as designed- Cache eviction: Works as designed
Phase 9 Architecture: ❌ Fundamentally Incompatible with TLS SLL fast path
Solution Options
Option A: Decrement meta->used in Fast Path ❌
Approach: Modify tls_sll_push() to decrement meta->used
Problem:
- Requires SuperSlab lookup (expensive)
- Defeats fast path purpose (3-5 instructions → 50+ instructions)
- Cache misses, branch mispredicts
Verdict: Not viable
Option B: Periodic TLS SLL Drain to Slabs ✅ RECOMMENDED
Approach:
- Drain TLS SLL back to slab freelists periodically (e.g., every 1K frees)
- Decrement
meta->usedviatiny_free_local_box() - Allow slab empty detection
Implementation:
static __thread uint32_t g_tls_sll_drain_counter[TINY_NUM_CLASSES] = {0};
void tls_sll_push(int class_idx, void* base) {
// Fast path: push to SLL
// ... existing code ...
// Periodic drain
if (++g_tls_sll_drain_counter[class_idx] >= 1024) {
tls_sll_drain_to_slabs(class_idx);
g_tls_sll_drain_counter[class_idx] = 0;
}
}
Benefits:
- Fast path stays fast (99.9% of frees)
- Slow path drain (0.1% of frees) updates
meta->used - Enables slab empty detection
- LRU cache becomes functional
Expected Impact:
- mmap/munmap: 6,455 → ~100-200 calls (-96-97%)
- Throughput: 563K → 8-10M ops/s (+1,300-1,700%)
Option C: Separate Accounting ⚠️
Approach: Track "logical used" (includes TLS SLL) vs "physical used"
Problem:
- Complex, error-prone
- Atomic operations required (slow)
- Hard to maintain consistency
Verdict: Not recommended
Option D: Accept Current Behavior ❌
Approach: LRU cache only for shutdown/cleanup, not runtime
Problem:
- Defeats Phase 9 purpose (lazy deallocation)
- Leaves 74.8% syscall overhead unfixed
- Performance remains -94% regressed
Verdict: Not acceptable
Recommendation
Implement Option B: Periodic TLS SLL Drain
Phase 12 Design
-
Add drain trigger in
tls_sll_push()- Every 1,024 frees (tunable via ENV)
- Drain TLS SLL → slab freelist
- Decrement
meta->usedproperly
-
Enable slab empty detection
meta->used == 0now reachableshared_pool_release_slab()calledsuperslab_free()→hak_ss_lru_push()called
-
LRU cache becomes functional
- SuperSlabs reused from cache
- mmap/munmap reduced by 96-97%
- Syscall overhead: 74.8% → ~5%
Expected Performance
Current: 563K ops/s (0.63% of System malloc)
After: 8-10M ops/s (9-11% of System malloc)
Gain: +1,300-1,700%
Remaining gap to System malloc (90M ops/s):
- Still need +800-1,000% additional optimization
- Focus areas: Front cache hit rate, branch prediction, cache locality
Action Items
- [URGENT] Implement TLS SLL periodic drain (Option B)
- [HIGH] Add ENV tuning:
HAKMEM_TLS_SLL_DRAIN_INTERVAL=1024 - [HIGH] Re-measure with
strace -c(expect -96% mmap/munmap) - [MEDIUM] Fix prewarm crash (separate investigation)
- [MEDIUM] Document architectural tradeoff in design docs
Lessons Learned
-
Fast path optimizations can disable architectural features
- TLS SLL fast path → LRU cache unreachable
- Need periodic cleanup to restore functionality
-
Accounting consistency is critical
meta->usedmust reflect true state- Buffering (TLS SLL) creates accounting gap
-
Integration testing needed
- Phase 9 LRU tested in isolation: ✅ Works
- Phase 9 LRU + TLS SLL integration: ❌ Broken
- Need end-to-end benchmarks
-
Performance monitoring essential
- LRU hit rate = 0% should have triggered alert
- Syscall count regression should have been caught earlier
Files Involved
/mnt/workdisk/public_share/hakmem/core/tiny_free_fast_v2.inc.h- Fast path (nometa->usedupdate)/mnt/workdisk/public_share/hakmem/core/tiny_superslab_free.inc.h- Slow path (meta->used--)/mnt/workdisk/public_share/hakmem/core/hakmem_shared_pool.c- Empty detection/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_superslab.c-superslab_free()/mnt/workdisk/public_share/hakmem/core/hakmem_super_registry.c- LRU cache implementation
Conclusion
Phase 9 LRU cache is functionally correct but architecturally unreachable due to TLS SLL fast path not updating meta->used.
Fix: Implement periodic TLS SLL drain to restore slab accounting consistency and enable LRU cache utilization.
Expected Impact: +1,300-1,700% throughput improvement (563K → 8-10M ops/s)