Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
8.9 KiB
TLS Freelist Cache Decision Tree
Context: Phase 6.12.1 Step 2 showed +42% regression in single-threaded workloads Question: Should we keep TLS for multi-threaded benefit?
Decision Tree
START: TLS showed +42% single-threaded regression
│
├─ Question 1: Is this regression ONLY due to TLS?
│ │
│ ├─ NO → Re-measure TLS in isolation (revert Slab Registry)
│ │ └─ Action: Revert Step 2 (Slab Registry), keep Step 1 (SlabTag removal)
│ │ └─ Re-run benchmarks → Measure true TLS overhead
│ │ └─ If still +40% → Continue to Question 2
│ │ └─ If <10% → TLS is OK, problem was Slab Registry
│ │
│ └─ YES (TLS overhead confirmed) → Continue to Question 2
│
├─ Question 2: What is TLS overhead in cycles?
│ │
│ ├─ Measure: 3,116 ns = ~9,000 cycles @ 3GHz
│ │
│ └─ Analysis: This is TOO HIGH for just TLS cache lookup
│ └─ Root cause candidates:
│ ├─ Slab Registry hash overhead (likely culprit)
│ ├─ TLS cache miss rate (cache too small or bad eviction)
│ └─ Indirect call overhead (function pointer for free routing)
│ └─ Action: Profile with perf to isolate
│
├─ Question 3: Run mimalloc-bench multi-threaded tests (Phase 6.13)
│ │
│ ├─ Test: larson 4-thread, larson 16-thread
│ │
│ └─ Results analysis:
│ │
│ ├─ Scenario A: 4-thread benefit > 20% AND 16-thread benefit > 40%
│ │ └─ Decision: ✅ KEEP TLS
│ │ └─ Rationale: Multi-threaded benefit outweighs single-threaded cost
│ │ └─ Next: Phase 6.14 (expand benchmarks)
│ │
│ ├─ Scenario B: 4-thread benefit 10-20% OR 16-thread benefit 20-40%
│ │ └─ Decision: ⚠️ MAKE CONDITIONAL
│ │ └─ Implementation: Compile-time flag HAKMEM_MULTITHREAD
│ │ └─ Usage: make HAKMEM_MULTITHREAD=1 (for multi-threaded apps)
│ │ make HAKMEM_MULTITHREAD=0 (for single-threaded apps)
│ │ └─ Next: Phase 6.14 (expand benchmarks)
│ │
│ └─ Scenario C: 4-thread benefit < 10% (unlikely)
│ └─ Decision: ❌ REVERT TLS
│ └─ Rationale: Site Rules already reduce contention, TLS adds no value
│ └─ Next: Phase 6.16 (fix Tiny Pool instead)
│
└─ Question 4: hakmem-specific factors
│
├─ Site Rules already reduce lock contention by ~60%
│ └─ TLS benefit is LESS for hakmem than mimalloc/jemalloc
│ └─ Expected TLS benefit: 100-300 cycles (vs 245-720 for mimalloc)
│
└─ Break-even analysis:
├─ TLS overhead: 20-40 cycles (best case)
├─ TLS benefit: 100-300 cycles (@ 70% cache hit rate, 4+ threads)
└─ Break-even: 2-4 threads (depends on contention level)
└─ Conclusion: TLS should WIN at 4+ threads, but margin is smaller
Quantitative Analysis
Single-Threaded Overhead (Measured)
| Metric | Value | Source |
|---|---|---|
| Before TLS | 7,355 ns/op | Phase 6.12.1 Step 1 |
| After TLS | 10,471 ns/op | Phase 6.12.1 Step 2 |
| Regression | +3,116 ns/op (+42.4%) | Calculated |
| Cycles (@ 3GHz) | ~9,000 cycles | Estimated |
Analysis: 9,000 cycles is TOO HIGH for TLS cache lookup (expected: 20-40 cycles). Likely cause: Slab Registry hash overhead (Step 2 change).
Action: Re-measure TLS in isolation (revert Slab Registry, keep only TLS).
Multi-Threaded Benefit (Estimated)
| Metric | mimalloc (baseline) | hakmem (with Site Rules) |
|---|---|---|
| Lock contention cost | 350-800 cycles | 140-320 cycles (-60%) |
| TLS cache hit rate | 70-90% | 70-90% (same) |
| Cycles saved per hit | 245-720 cycles | 100-300 cycles |
| TLS overhead | 20-40 cycles | 20-40 cycles (same) |
| Net benefit | 205-680 cycles | 60-260 cycles |
Break-even point:
- mimalloc: 2 threads (TLS overhead 20-40c < benefit 205-680c)
- hakmem: 2-4 threads (TLS overhead 20-40c < benefit 60-260c)
Conclusion: TLS is LESS valuable for hakmem, but still beneficial at 4+ threads.
Recommendation Flow
Step 1: Re-measure TLS overhead in isolation (2 hours)
# Revert Slab Registry (Step 2), keep SlabTag removal (Step 1)
git diff HEAD~1 hakmem_tiny.c # Review Step 2 changes
git revert <commit-hash> # Revert Step 2 only
# Re-run benchmarks
make bench_allocators_hakmem
./bench_allocators_hakmem
# Expected result:
# - If overhead drops to <10%: TLS is OK, problem was Slab Registry
# - If overhead remains ~40%: TLS itself is the problem
Decision:
- If overhead < 10% → Keep TLS, skip Step 2 (Slab Registry)
- If overhead > 20% → Proceed to Step 2 (multi-threaded validation)
Step 2: Multi-threaded validation (3-5 hours)
# Phase 6.13: mimalloc-bench integration
cd /tmp
git clone https://github.com/daanx/mimalloc-bench.git
cd mimalloc-bench
./build-all.sh
# Build hakmem.so
cd /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc
make shared
# Run multi-threaded benchmarks
cd /tmp/mimalloc-bench
LD_PRELOAD=/path/to/libhakmem.so ./out/bench/larson/larson 4 1000 10000
LD_PRELOAD=/path/to/libhakmem.so ./out/bench/larson/larson 16 1000 10000
Decision:
- If 4-thread benefit > 20% → ✅ KEEP TLS (Scenario A)
- If 4-thread benefit 10-20% → ⚠️ CONDITIONAL TLS (Scenario B)
- If 4-thread benefit < 10% → ❌ REVERT TLS (Scenario C)
Step 3: Implementation choice
✅ Scenario A: Keep TLS (expected)
// No changes needed
// Continue to Phase 6.14 (expand benchmarks)
⚠️ Scenario B: Conditional TLS
// Add compile-time flag
// hakmem.h:
#ifdef HAKMEM_MULTITHREAD
#define TLS_CACHE_ENABLED 1
#else
#define TLS_CACHE_ENABLED 0
#endif
// hakmem_tiny.c:
#if TLS_CACHE_ENABLED
__thread struct tls_cache_t tls_cache;
#endif
void* hak_tiny_alloc(size_t size) {
#if TLS_CACHE_ENABLED
// TLS fast path
void* ptr = tls_cache_lookup(size);
if (ptr) return ptr;
#endif
// Slow path (always available)
return hak_tiny_alloc_slow(size);
}
Usage:
# Single-threaded builds
make CFLAGS="-DHAKMEM_MULTITHREAD=0"
# Multi-threaded builds
make CFLAGS="-DHAKMEM_MULTITHREAD=1"
❌ Scenario C: Revert TLS
# Revert Phase 6.12.1 Step 2 completely
git revert <commit-hash>
# Continue to Phase 6.16 (fix Tiny Pool via Option B)
Expected Outcome
Most Likely: Scenario A (Keep TLS)
Evidence:
- TLS is standard practice in mimalloc/jemalloc/tcmalloc
- Multi-threaded workloads are common (web servers, databases)
- Single-threaded overhead is expected (20-40 cycles)
- The 9,000-cycle regression is likely due to Slab Registry, not TLS
Next steps:
- Re-measure TLS in isolation (confirm <10% overhead)
- Run mimalloc-bench (validate >20% multi-threaded benefit)
- Keep TLS, proceed to Phase 6.14
Alternative: Scenario B (Conditional TLS)
If:
- Multi-threaded benefit is marginal (10-20% at 4 threads)
- Single-threaded regression remains (even after Slab Registry revert)
Then:
- Implement compile-time flag HAKMEM_MULTITHREAD
- Provide two build configurations
- Document trade-offs (single vs multi-threaded)
Timeline
| Step | Duration | Action |
|---|---|---|
| 1 | 2 hours | Re-measure TLS in isolation |
| 2 | 3-5 hours | mimalloc-bench multi-threaded validation |
| 3 | 1 hour | Analyze results + make decision |
| 4a | 0 hours | Keep TLS (no changes) |
| 4b | 4 hours | Implement conditional TLS (if needed) |
| 4c | 1 hour | Revert TLS (if needed) |
Total: 6-13 hours (depends on outcome)
Risk Mitigation
Risk 1: Re-measurement shows TLS overhead is still high (>20%)
Mitigation: Profile with perf to identify root cause
perf record -g ./bench_allocators_hakmem
perf report --stdio | grep -E 'tls|FS'
Risk 2: Multi-threaded benchmarks show no TLS benefit
Mitigation: Check if Site Rules are too effective (already eliminated contention)
# Disable Site Rules temporarily
export HAKMEM_SITE_RULES=0
./bench_allocators_hakmem
# If performance drops → Site Rules are effective
# If performance same → Site Rules not helping
Risk 3: Conditional TLS adds maintenance burden
Mitigation: Use runtime detection instead (check thread count)
// Runtime detection (auto-enable TLS if threads > 1)
static int tls_enabled = 0;
void hak_init() {
int num_threads = get_num_threads();
tls_enabled = (num_threads > 1);
}
Trade-off: Runtime overhead (branch per allocation) vs compile-time maintenance
End of Decision Tree
This document provides a structured decision-making process for the TLS Freelist Cache question. The recommended approach is to re-measure TLS overhead in isolation (Step 1), then validate multi-threaded benefit with mimalloc-bench (Step 2), and make a data-driven decision (Step 3).