# TLS Freelist Cache Decision Tree **Context**: Phase 6.12.1 Step 2 showed +42% regression in single-threaded workloads **Question**: Should we keep TLS for multi-threaded benefit? --- ## Decision Tree ``` START: TLS showed +42% single-threaded regression │ ├─ Question 1: Is this regression ONLY due to TLS? │ │ │ ├─ NO → Re-measure TLS in isolation (revert Slab Registry) │ │ └─ Action: Revert Step 2 (Slab Registry), keep Step 1 (SlabTag removal) │ │ └─ Re-run benchmarks → Measure true TLS overhead │ │ └─ If still +40% → Continue to Question 2 │ │ └─ If <10% → TLS is OK, problem was Slab Registry │ │ │ └─ YES (TLS overhead confirmed) → Continue to Question 2 │ ├─ Question 2: What is TLS overhead in cycles? │ │ │ ├─ Measure: 3,116 ns = ~9,000 cycles @ 3GHz │ │ │ └─ Analysis: This is TOO HIGH for just TLS cache lookup │ └─ Root cause candidates: │ ├─ Slab Registry hash overhead (likely culprit) │ ├─ TLS cache miss rate (cache too small or bad eviction) │ └─ Indirect call overhead (function pointer for free routing) │ └─ Action: Profile with perf to isolate │ ├─ Question 3: Run mimalloc-bench multi-threaded tests (Phase 6.13) │ │ │ ├─ Test: larson 4-thread, larson 16-thread │ │ │ └─ Results analysis: │ │ │ ├─ Scenario A: 4-thread benefit > 20% AND 16-thread benefit > 40% │ │ └─ Decision: ✅ KEEP TLS │ │ └─ Rationale: Multi-threaded benefit outweighs single-threaded cost │ │ └─ Next: Phase 6.14 (expand benchmarks) │ │ │ ├─ Scenario B: 4-thread benefit 10-20% OR 16-thread benefit 20-40% │ │ └─ Decision: ⚠️ MAKE CONDITIONAL │ │ └─ Implementation: Compile-time flag HAKMEM_MULTITHREAD │ │ └─ Usage: make HAKMEM_MULTITHREAD=1 (for multi-threaded apps) │ │ make HAKMEM_MULTITHREAD=0 (for single-threaded apps) │ │ └─ Next: Phase 6.14 (expand benchmarks) │ │ │ └─ Scenario C: 4-thread benefit < 10% (unlikely) │ └─ Decision: ❌ REVERT TLS │ └─ Rationale: Site Rules already reduce contention, TLS adds no value │ └─ Next: Phase 6.16 (fix Tiny Pool instead) │ └─ Question 4: hakmem-specific factors │ ├─ Site Rules already reduce lock contention by ~60% │ └─ TLS benefit is LESS for hakmem than mimalloc/jemalloc │ └─ Expected TLS benefit: 100-300 cycles (vs 245-720 for mimalloc) │ └─ Break-even analysis: ├─ TLS overhead: 20-40 cycles (best case) ├─ TLS benefit: 100-300 cycles (@ 70% cache hit rate, 4+ threads) └─ Break-even: 2-4 threads (depends on contention level) └─ Conclusion: TLS should WIN at 4+ threads, but margin is smaller ``` --- ## Quantitative Analysis ### Single-Threaded Overhead (Measured) | Metric | Value | Source | |--------|-------|--------| | Before TLS | 7,355 ns/op | Phase 6.12.1 Step 1 | | After TLS | 10,471 ns/op | Phase 6.12.1 Step 2 | | Regression | +3,116 ns/op (+42.4%) | Calculated | | Cycles (@ 3GHz) | ~9,000 cycles | Estimated | **Analysis**: 9,000 cycles is TOO HIGH for TLS cache lookup (expected: 20-40 cycles). Likely cause: **Slab Registry hash overhead** (Step 2 change). **Action**: Re-measure TLS in isolation (revert Slab Registry, keep only TLS). --- ### Multi-Threaded Benefit (Estimated) | Metric | mimalloc (baseline) | hakmem (with Site Rules) | |--------|---------------------|--------------------------| | Lock contention cost | 350-800 cycles | 140-320 cycles (-60%) | | TLS cache hit rate | 70-90% | 70-90% (same) | | Cycles saved per hit | 245-720 cycles | 100-300 cycles | | TLS overhead | 20-40 cycles | 20-40 cycles (same) | | **Net benefit** | 205-680 cycles | 60-260 cycles | **Break-even point**: - mimalloc: 2 threads (TLS overhead 20-40c < benefit 205-680c) - hakmem: 2-4 threads (TLS overhead 20-40c < benefit 60-260c) **Conclusion**: TLS is LESS valuable for hakmem, but still beneficial at 4+ threads. --- ## Recommendation Flow ### Step 1: Re-measure TLS overhead in isolation (2 hours) ```bash # Revert Slab Registry (Step 2), keep SlabTag removal (Step 1) git diff HEAD~1 hakmem_tiny.c # Review Step 2 changes git revert # Revert Step 2 only # Re-run benchmarks make bench_allocators_hakmem ./bench_allocators_hakmem # Expected result: # - If overhead drops to <10%: TLS is OK, problem was Slab Registry # - If overhead remains ~40%: TLS itself is the problem ``` **Decision**: - If overhead < 10% → Keep TLS, skip Step 2 (Slab Registry) - If overhead > 20% → Proceed to Step 2 (multi-threaded validation) --- ### Step 2: Multi-threaded validation (3-5 hours) ```bash # Phase 6.13: mimalloc-bench integration cd /tmp git clone https://github.com/daanx/mimalloc-bench.git cd mimalloc-bench ./build-all.sh # Build hakmem.so cd /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc make shared # Run multi-threaded benchmarks cd /tmp/mimalloc-bench LD_PRELOAD=/path/to/libhakmem.so ./out/bench/larson/larson 4 1000 10000 LD_PRELOAD=/path/to/libhakmem.so ./out/bench/larson/larson 16 1000 10000 ``` **Decision**: - If 4-thread benefit > 20% → ✅ KEEP TLS (Scenario A) - If 4-thread benefit 10-20% → ⚠️ CONDITIONAL TLS (Scenario B) - If 4-thread benefit < 10% → ❌ REVERT TLS (Scenario C) --- ### Step 3: Implementation choice #### ✅ Scenario A: Keep TLS (expected) ```c // No changes needed // Continue to Phase 6.14 (expand benchmarks) ``` #### ⚠️ Scenario B: Conditional TLS ```c // Add compile-time flag // hakmem.h: #ifdef HAKMEM_MULTITHREAD #define TLS_CACHE_ENABLED 1 #else #define TLS_CACHE_ENABLED 0 #endif // hakmem_tiny.c: #if TLS_CACHE_ENABLED __thread struct tls_cache_t tls_cache; #endif void* hak_tiny_alloc(size_t size) { #if TLS_CACHE_ENABLED // TLS fast path void* ptr = tls_cache_lookup(size); if (ptr) return ptr; #endif // Slow path (always available) return hak_tiny_alloc_slow(size); } ``` **Usage**: ```bash # Single-threaded builds make CFLAGS="-DHAKMEM_MULTITHREAD=0" # Multi-threaded builds make CFLAGS="-DHAKMEM_MULTITHREAD=1" ``` #### ❌ Scenario C: Revert TLS ```bash # Revert Phase 6.12.1 Step 2 completely git revert # Continue to Phase 6.16 (fix Tiny Pool via Option B) ``` --- ## Expected Outcome ### Most Likely: Scenario A (Keep TLS) **Evidence**: 1. TLS is standard practice in mimalloc/jemalloc/tcmalloc 2. Multi-threaded workloads are common (web servers, databases) 3. Single-threaded overhead is expected (20-40 cycles) 4. The 9,000-cycle regression is likely due to Slab Registry, not TLS **Next steps**: 1. Re-measure TLS in isolation (confirm <10% overhead) 2. Run mimalloc-bench (validate >20% multi-threaded benefit) 3. Keep TLS, proceed to Phase 6.14 --- ### Alternative: Scenario B (Conditional TLS) **If**: - Multi-threaded benefit is marginal (10-20% at 4 threads) - Single-threaded regression remains (even after Slab Registry revert) **Then**: - Implement compile-time flag HAKMEM_MULTITHREAD - Provide two build configurations - Document trade-offs (single vs multi-threaded) --- ## Timeline | Step | Duration | Action | |------|----------|--------| | 1 | 2 hours | Re-measure TLS in isolation | | 2 | 3-5 hours | mimalloc-bench multi-threaded validation | | 3 | 1 hour | Analyze results + make decision | | 4a | 0 hours | Keep TLS (no changes) | | 4b | 4 hours | Implement conditional TLS (if needed) | | 4c | 1 hour | Revert TLS (if needed) | **Total**: 6-13 hours (depends on outcome) --- ## Risk Mitigation ### Risk 1: Re-measurement shows TLS overhead is still high (>20%) **Mitigation**: Profile with perf to identify root cause ```bash perf record -g ./bench_allocators_hakmem perf report --stdio | grep -E 'tls|FS' ``` ### Risk 2: Multi-threaded benchmarks show no TLS benefit **Mitigation**: Check if Site Rules are too effective (already eliminated contention) ```bash # Disable Site Rules temporarily export HAKMEM_SITE_RULES=0 ./bench_allocators_hakmem # If performance drops → Site Rules are effective # If performance same → Site Rules not helping ``` ### Risk 3: Conditional TLS adds maintenance burden **Mitigation**: Use runtime detection instead (check thread count) ```c // Runtime detection (auto-enable TLS if threads > 1) static int tls_enabled = 0; void hak_init() { int num_threads = get_num_threads(); tls_enabled = (num_threads > 1); } ``` **Trade-off**: Runtime overhead (branch per allocation) vs compile-time maintenance --- **End of Decision Tree** This document provides a structured decision-making process for the TLS Freelist Cache question. The recommended approach is to re-measure TLS overhead in isolation (Step 1), then validate multi-threaded benefit with mimalloc-bench (Step 2), and make a data-driven decision (Step 3).