hakmem/docs/archive/TLS_DECISION_TREE.md

# TLS Freelist Cache Decision Tree
**Context**: Phase 6.12.1 Step 2 showed +42% regression in single-threaded workloads
**Question**: Should we keep TLS for multi-threaded benefit?

---

## Decision Tree

```
START: TLS showed +42% single-threaded regression
│
├─ Question 1: Is this regression ONLY due to TLS?
│  │
│  ├─ NO → Re-measure TLS in isolation (revert Slab Registry)
│  │      └─ Action: Revert Step 2 (Slab Registry), keep Step 1 (SlabTag removal)
│  │         └─ Re-run benchmarks → Measure true TLS overhead
│  │            └─ If still +40% → Continue to Question 2
│  │            └─ If <10% → TLS is OK, problem was Slab Registry
│  │
│  └─ YES (TLS overhead confirmed) → Continue to Question 2
│
├─ Question 2: What is TLS overhead in cycles?
│  │
│  ├─ Measure: 3,116 ns = ~9,000 cycles @ 3GHz
│  │
│  └─ Analysis: This is TOO HIGH for just TLS cache lookup
│     └─ Root cause candidates:
│        ├─ Slab Registry hash overhead (likely culprit)
│        ├─ TLS cache miss rate (cache too small or bad eviction)
│        └─ Indirect call overhead (function pointer for free routing)
│     └─ Action: Profile with perf to isolate
│
├─ Question 3: Run mimalloc-bench multi-threaded tests (Phase 6.13)
│  │
│  ├─ Test: larson 4-thread, larson 16-thread
│  │
│  └─ Results analysis:
│     │
│     ├─ Scenario A: 4-thread benefit > 20% AND 16-thread benefit > 40%
│     │  └─ Decision: ✅ KEEP TLS
│     │     └─ Rationale: Multi-threaded benefit outweighs single-threaded cost
│     │        └─ Next: Phase 6.14 (expand benchmarks)
│     │
│     ├─ Scenario B: 4-thread benefit 10-20% OR 16-thread benefit 20-40%
│     │  └─ Decision: ⚠️ MAKE CONDITIONAL
│     │     └─ Implementation: Compile-time flag HAKMEM_MULTITHREAD
│     │        └─ Usage: make HAKMEM_MULTITHREAD=1 (for multi-threaded apps)
│     │           make HAKMEM_MULTITHREAD=0 (for single-threaded apps)
│     │        └─ Next: Phase 6.14 (expand benchmarks)
│     │
│     └─ Scenario C: 4-thread benefit < 10% (unlikely)
│        └─ Decision: ❌ REVERT TLS
│           └─ Rationale: Site Rules already reduce contention, TLS adds no value
│              └─ Next: Phase 6.16 (fix Tiny Pool instead)
│
└─ Question 4: hakmem-specific factors
   │
   ├─ Site Rules already reduce lock contention by ~60%
   │  └─ TLS benefit is LESS for hakmem than mimalloc/jemalloc
   │     └─ Expected TLS benefit: 100-300 cycles (vs 245-720 for mimalloc)
   │
   └─ Break-even analysis:
      ├─ TLS overhead: 20-40 cycles (best case)
      ├─ TLS benefit: 100-300 cycles (@ 70% cache hit rate, 4+ threads)
      └─ Break-even: 2-4 threads (depends on contention level)
         └─ Conclusion: TLS should WIN at 4+ threads, but margin is smaller
```

---

## Quantitative Analysis

### Single-Threaded Overhead (Measured)

| Metric | Value | Source |
|--------|-------|--------|
| Before TLS | 7,355 ns/op | Phase 6.12.1 Step 1 |
| After TLS | 10,471 ns/op | Phase 6.12.1 Step 2 |
| Regression | +3,116 ns/op (+42.4%) | Calculated |
| Cycles (@ 3GHz) | ~9,000 cycles | Estimated |

**Analysis**: 9,000 cycles is TOO HIGH for TLS cache lookup (expected: 20-40 cycles). Likely cause: **Slab Registry hash overhead** (Step 2 change).

**Action**: Re-measure TLS in isolation (revert Slab Registry, keep only TLS).

---

### Multi-Threaded Benefit (Estimated)

| Metric | mimalloc (baseline) | hakmem (with Site Rules) |
|--------|---------------------|--------------------------|
| Lock contention cost | 350-800 cycles | 140-320 cycles (-60%) |
| TLS cache hit rate | 70-90% | 70-90% (same) |
| Cycles saved per hit | 245-720 cycles | 100-300 cycles |
| TLS overhead | 20-40 cycles | 20-40 cycles (same) |
| **Net benefit** | 205-680 cycles | 60-260 cycles |

**Break-even point**:
- mimalloc: 2 threads (TLS overhead 20-40c < benefit 205-680c)
- hakmem: 2-4 threads (TLS overhead 20-40c < benefit 60-260c)

**Conclusion**: TLS is LESS valuable for hakmem, but still beneficial at 4+ threads.

---

## Recommendation Flow

### Step 1: Re-measure TLS overhead in isolation (2 hours)
```bash
# Revert Slab Registry (Step 2), keep SlabTag removal (Step 1)
git diff HEAD~1 hakmem_tiny.c  # Review Step 2 changes
git revert <commit-hash>       # Revert Step 2 only

# Re-run benchmarks
make bench_allocators_hakmem
./bench_allocators_hakmem

# Expected result:
# - If overhead drops to <10%: TLS is OK, problem was Slab Registry
# - If overhead remains ~40%: TLS itself is the problem
```

**Decision**:
- If overhead < 10% → Keep TLS, skip Step 2 (Slab Registry)
- If overhead > 20% → Proceed to Step 2 (multi-threaded validation)

---

### Step 2: Multi-threaded validation (3-5 hours)
```bash
# Phase 6.13: mimalloc-bench integration
cd /tmp
git clone https://github.com/daanx/mimalloc-bench.git
cd mimalloc-bench
./build-all.sh

# Build hakmem.so
cd /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc
make shared

# Run multi-threaded benchmarks
cd /tmp/mimalloc-bench
LD_PRELOAD=/path/to/libhakmem.so ./out/bench/larson/larson 4 1000 10000
LD_PRELOAD=/path/to/libhakmem.so ./out/bench/larson/larson 16 1000 10000
```

**Decision**:
- If 4-thread benefit > 20% → ✅ KEEP TLS (Scenario A)
- If 4-thread benefit 10-20% → ⚠️ CONDITIONAL TLS (Scenario B)
- If 4-thread benefit < 10% → ❌ REVERT TLS (Scenario C)

---

### Step 3: Implementation choice

#### ✅ Scenario A: Keep TLS (expected)
```c
// No changes needed
// Continue to Phase 6.14 (expand benchmarks)
```

#### ⚠️ Scenario B: Conditional TLS
```c
// Add compile-time flag
// hakmem.h:
#ifdef HAKMEM_MULTITHREAD
  #define TLS_CACHE_ENABLED 1
#else
  #define TLS_CACHE_ENABLED 0
#endif

// hakmem_tiny.c:
#if TLS_CACHE_ENABLED
  __thread struct tls_cache_t tls_cache;
#endif

void* hak_tiny_alloc(size_t size) {
  #if TLS_CACHE_ENABLED
    // TLS fast path
    void* ptr = tls_cache_lookup(size);
    if (ptr) return ptr;
  #endif

  // Slow path (always available)
  return hak_tiny_alloc_slow(size);
}
```

**Usage**:
```bash
# Single-threaded builds
make CFLAGS="-DHAKMEM_MULTITHREAD=0"

# Multi-threaded builds
make CFLAGS="-DHAKMEM_MULTITHREAD=1"
```

#### ❌ Scenario C: Revert TLS
```bash
# Revert Phase 6.12.1 Step 2 completely
git revert <commit-hash>

# Continue to Phase 6.16 (fix Tiny Pool via Option B)
```

---

## Expected Outcome

### Most Likely: Scenario A (Keep TLS)

**Evidence**:
1. TLS is standard practice in mimalloc/jemalloc/tcmalloc
2. Multi-threaded workloads are common (web servers, databases)
3. Single-threaded overhead is expected (20-40 cycles)
4. The 9,000-cycle regression is likely due to Slab Registry, not TLS

**Next steps**:
1. Re-measure TLS in isolation (confirm <10% overhead)
2. Run mimalloc-bench (validate >20% multi-threaded benefit)
3. Keep TLS, proceed to Phase 6.14

---

### Alternative: Scenario B (Conditional TLS)

**If**:
- Multi-threaded benefit is marginal (10-20% at 4 threads)
- Single-threaded regression remains (even after Slab Registry revert)

**Then**:
- Implement compile-time flag HAKMEM_MULTITHREAD
- Provide two build configurations
- Document trade-offs (single vs multi-threaded)

---

## Timeline

| Step | Duration | Action |
|------|----------|--------|
| 1 | 2 hours | Re-measure TLS in isolation |
| 2 | 3-5 hours | mimalloc-bench multi-threaded validation |
| 3 | 1 hour | Analyze results + make decision |
| 4a | 0 hours | Keep TLS (no changes) |
| 4b | 4 hours | Implement conditional TLS (if needed) |
| 4c | 1 hour | Revert TLS (if needed) |

**Total**: 6-13 hours (depends on outcome)

---

## Risk Mitigation

### Risk 1: Re-measurement shows TLS overhead is still high (>20%)
**Mitigation**: Profile with perf to identify root cause
```bash
perf record -g ./bench_allocators_hakmem
perf report --stdio | grep -E 'tls|FS'
```

### Risk 2: Multi-threaded benchmarks show no TLS benefit
**Mitigation**: Check if Site Rules are too effective (already eliminated contention)
```bash
# Disable Site Rules temporarily
export HAKMEM_SITE_RULES=0
./bench_allocators_hakmem

# If performance drops → Site Rules are effective
# If performance same → Site Rules not helping
```

### Risk 3: Conditional TLS adds maintenance burden
**Mitigation**: Use runtime detection instead (check thread count)
```c
// Runtime detection (auto-enable TLS if threads > 1)
static int tls_enabled = 0;

void hak_init() {
  int num_threads = get_num_threads();
  tls_enabled = (num_threads > 1);
}
```

**Trade-off**: Runtime overhead (branch per allocation) vs compile-time maintenance

---

**End of Decision Tree**

This document provides a structured decision-making process for the TLS Freelist Cache question. The recommended approach is to re-measure TLS overhead in isolation (Step 1), then validate multi-threaded benefit with mimalloc-bench (Step 2), and make a data-driven decision (Step 3).