Files
hakmem/docs/archive/TLS_DECISION_TREE.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

292 lines
8.9 KiB
Markdown

# TLS Freelist Cache Decision Tree
**Context**: Phase 6.12.1 Step 2 showed +42% regression in single-threaded workloads
**Question**: Should we keep TLS for multi-threaded benefit?
---
## Decision Tree
```
START: TLS showed +42% single-threaded regression
├─ Question 1: Is this regression ONLY due to TLS?
│ │
│ ├─ NO → Re-measure TLS in isolation (revert Slab Registry)
│ │ └─ Action: Revert Step 2 (Slab Registry), keep Step 1 (SlabTag removal)
│ │ └─ Re-run benchmarks → Measure true TLS overhead
│ │ └─ If still +40% → Continue to Question 2
│ │ └─ If <10% → TLS is OK, problem was Slab Registry
│ │
│ └─ YES (TLS overhead confirmed) → Continue to Question 2
├─ Question 2: What is TLS overhead in cycles?
│ │
│ ├─ Measure: 3,116 ns = ~9,000 cycles @ 3GHz
│ │
│ └─ Analysis: This is TOO HIGH for just TLS cache lookup
│ └─ Root cause candidates:
│ ├─ Slab Registry hash overhead (likely culprit)
│ ├─ TLS cache miss rate (cache too small or bad eviction)
│ └─ Indirect call overhead (function pointer for free routing)
│ └─ Action: Profile with perf to isolate
├─ Question 3: Run mimalloc-bench multi-threaded tests (Phase 6.13)
│ │
│ ├─ Test: larson 4-thread, larson 16-thread
│ │
│ └─ Results analysis:
│ │
│ ├─ Scenario A: 4-thread benefit > 20% AND 16-thread benefit > 40%
│ │ └─ Decision: ✅ KEEP TLS
│ │ └─ Rationale: Multi-threaded benefit outweighs single-threaded cost
│ │ └─ Next: Phase 6.14 (expand benchmarks)
│ │
│ ├─ Scenario B: 4-thread benefit 10-20% OR 16-thread benefit 20-40%
│ │ └─ Decision: ⚠️ MAKE CONDITIONAL
│ │ └─ Implementation: Compile-time flag HAKMEM_MULTITHREAD
│ │ └─ Usage: make HAKMEM_MULTITHREAD=1 (for multi-threaded apps)
│ │ make HAKMEM_MULTITHREAD=0 (for single-threaded apps)
│ │ └─ Next: Phase 6.14 (expand benchmarks)
│ │
│ └─ Scenario C: 4-thread benefit < 10% (unlikely)
│ └─ Decision: ❌ REVERT TLS
│ └─ Rationale: Site Rules already reduce contention, TLS adds no value
│ └─ Next: Phase 6.16 (fix Tiny Pool instead)
└─ Question 4: hakmem-specific factors
├─ Site Rules already reduce lock contention by ~60%
│ └─ TLS benefit is LESS for hakmem than mimalloc/jemalloc
│ └─ Expected TLS benefit: 100-300 cycles (vs 245-720 for mimalloc)
└─ Break-even analysis:
├─ TLS overhead: 20-40 cycles (best case)
├─ TLS benefit: 100-300 cycles (@ 70% cache hit rate, 4+ threads)
└─ Break-even: 2-4 threads (depends on contention level)
└─ Conclusion: TLS should WIN at 4+ threads, but margin is smaller
```
---
## Quantitative Analysis
### Single-Threaded Overhead (Measured)
| Metric | Value | Source |
|--------|-------|--------|
| Before TLS | 7,355 ns/op | Phase 6.12.1 Step 1 |
| After TLS | 10,471 ns/op | Phase 6.12.1 Step 2 |
| Regression | +3,116 ns/op (+42.4%) | Calculated |
| Cycles (@ 3GHz) | ~9,000 cycles | Estimated |
**Analysis**: 9,000 cycles is TOO HIGH for TLS cache lookup (expected: 20-40 cycles). Likely cause: **Slab Registry hash overhead** (Step 2 change).
**Action**: Re-measure TLS in isolation (revert Slab Registry, keep only TLS).
---
### Multi-Threaded Benefit (Estimated)
| Metric | mimalloc (baseline) | hakmem (with Site Rules) |
|--------|---------------------|--------------------------|
| Lock contention cost | 350-800 cycles | 140-320 cycles (-60%) |
| TLS cache hit rate | 70-90% | 70-90% (same) |
| Cycles saved per hit | 245-720 cycles | 100-300 cycles |
| TLS overhead | 20-40 cycles | 20-40 cycles (same) |
| **Net benefit** | 205-680 cycles | 60-260 cycles |
**Break-even point**:
- mimalloc: 2 threads (TLS overhead 20-40c < benefit 205-680c)
- hakmem: 2-4 threads (TLS overhead 20-40c < benefit 60-260c)
**Conclusion**: TLS is LESS valuable for hakmem, but still beneficial at 4+ threads.
---
## Recommendation Flow
### Step 1: Re-measure TLS overhead in isolation (2 hours)
```bash
# Revert Slab Registry (Step 2), keep SlabTag removal (Step 1)
git diff HEAD~1 hakmem_tiny.c # Review Step 2 changes
git revert <commit-hash> # Revert Step 2 only
# Re-run benchmarks
make bench_allocators_hakmem
./bench_allocators_hakmem
# Expected result:
# - If overhead drops to <10%: TLS is OK, problem was Slab Registry
# - If overhead remains ~40%: TLS itself is the problem
```
**Decision**:
- If overhead < 10% Keep TLS, skip Step 2 (Slab Registry)
- If overhead > 20% → Proceed to Step 2 (multi-threaded validation)
---
### Step 2: Multi-threaded validation (3-5 hours)
```bash
# Phase 6.13: mimalloc-bench integration
cd /tmp
git clone https://github.com/daanx/mimalloc-bench.git
cd mimalloc-bench
./build-all.sh
# Build hakmem.so
cd /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc
make shared
# Run multi-threaded benchmarks
cd /tmp/mimalloc-bench
LD_PRELOAD=/path/to/libhakmem.so ./out/bench/larson/larson 4 1000 10000
LD_PRELOAD=/path/to/libhakmem.so ./out/bench/larson/larson 16 1000 10000
```
**Decision**:
- If 4-thread benefit > 20% → ✅ KEEP TLS (Scenario A)
- If 4-thread benefit 10-20% → ⚠️ CONDITIONAL TLS (Scenario B)
- If 4-thread benefit < 10% REVERT TLS (Scenario C)
---
### Step 3: Implementation choice
#### ✅ Scenario A: Keep TLS (expected)
```c
// No changes needed
// Continue to Phase 6.14 (expand benchmarks)
```
#### ⚠️ Scenario B: Conditional TLS
```c
// Add compile-time flag
// hakmem.h:
#ifdef HAKMEM_MULTITHREAD
#define TLS_CACHE_ENABLED 1
#else
#define TLS_CACHE_ENABLED 0
#endif
// hakmem_tiny.c:
#if TLS_CACHE_ENABLED
__thread struct tls_cache_t tls_cache;
#endif
void* hak_tiny_alloc(size_t size) {
#if TLS_CACHE_ENABLED
// TLS fast path
void* ptr = tls_cache_lookup(size);
if (ptr) return ptr;
#endif
// Slow path (always available)
return hak_tiny_alloc_slow(size);
}
```
**Usage**:
```bash
# Single-threaded builds
make CFLAGS="-DHAKMEM_MULTITHREAD=0"
# Multi-threaded builds
make CFLAGS="-DHAKMEM_MULTITHREAD=1"
```
#### ❌ Scenario C: Revert TLS
```bash
# Revert Phase 6.12.1 Step 2 completely
git revert <commit-hash>
# Continue to Phase 6.16 (fix Tiny Pool via Option B)
```
---
## Expected Outcome
### Most Likely: Scenario A (Keep TLS)
**Evidence**:
1. TLS is standard practice in mimalloc/jemalloc/tcmalloc
2. Multi-threaded workloads are common (web servers, databases)
3. Single-threaded overhead is expected (20-40 cycles)
4. The 9,000-cycle regression is likely due to Slab Registry, not TLS
**Next steps**:
1. Re-measure TLS in isolation (confirm <10% overhead)
2. Run mimalloc-bench (validate >20% multi-threaded benefit)
3. Keep TLS, proceed to Phase 6.14
---
### Alternative: Scenario B (Conditional TLS)
**If**:
- Multi-threaded benefit is marginal (10-20% at 4 threads)
- Single-threaded regression remains (even after Slab Registry revert)
**Then**:
- Implement compile-time flag HAKMEM_MULTITHREAD
- Provide two build configurations
- Document trade-offs (single vs multi-threaded)
---
## Timeline
| Step | Duration | Action |
|------|----------|--------|
| 1 | 2 hours | Re-measure TLS in isolation |
| 2 | 3-5 hours | mimalloc-bench multi-threaded validation |
| 3 | 1 hour | Analyze results + make decision |
| 4a | 0 hours | Keep TLS (no changes) |
| 4b | 4 hours | Implement conditional TLS (if needed) |
| 4c | 1 hour | Revert TLS (if needed) |
**Total**: 6-13 hours (depends on outcome)
---
## Risk Mitigation
### Risk 1: Re-measurement shows TLS overhead is still high (>20%)
**Mitigation**: Profile with perf to identify root cause
```bash
perf record -g ./bench_allocators_hakmem
perf report --stdio | grep -E 'tls|FS'
```
### Risk 2: Multi-threaded benchmarks show no TLS benefit
**Mitigation**: Check if Site Rules are too effective (already eliminated contention)
```bash
# Disable Site Rules temporarily
export HAKMEM_SITE_RULES=0
./bench_allocators_hakmem
# If performance drops → Site Rules are effective
# If performance same → Site Rules not helping
```
### Risk 3: Conditional TLS adds maintenance burden
**Mitigation**: Use runtime detection instead (check thread count)
```c
// Runtime detection (auto-enable TLS if threads > 1)
static int tls_enabled = 0;
void hak_init() {
int num_threads = get_num_threads();
tls_enabled = (num_threads > 1);
}
```
**Trade-off**: Runtime overhead (branch per allocation) vs compile-time maintenance
---
**End of Decision Tree**
This document provides a structured decision-making process for the TLS Freelist Cache question. The recommended approach is to re-measure TLS overhead in isolation (Step 1), then validate multi-threaded benefit with mimalloc-bench (Step 2), and make a data-driven decision (Step 3).