Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
292 lines
8.9 KiB
Markdown
292 lines
8.9 KiB
Markdown
# TLS Freelist Cache Decision Tree
|
|
**Context**: Phase 6.12.1 Step 2 showed +42% regression in single-threaded workloads
|
|
**Question**: Should we keep TLS for multi-threaded benefit?
|
|
|
|
---
|
|
|
|
## Decision Tree
|
|
|
|
```
|
|
START: TLS showed +42% single-threaded regression
|
|
│
|
|
├─ Question 1: Is this regression ONLY due to TLS?
|
|
│ │
|
|
│ ├─ NO → Re-measure TLS in isolation (revert Slab Registry)
|
|
│ │ └─ Action: Revert Step 2 (Slab Registry), keep Step 1 (SlabTag removal)
|
|
│ │ └─ Re-run benchmarks → Measure true TLS overhead
|
|
│ │ └─ If still +40% → Continue to Question 2
|
|
│ │ └─ If <10% → TLS is OK, problem was Slab Registry
|
|
│ │
|
|
│ └─ YES (TLS overhead confirmed) → Continue to Question 2
|
|
│
|
|
├─ Question 2: What is TLS overhead in cycles?
|
|
│ │
|
|
│ ├─ Measure: 3,116 ns = ~9,000 cycles @ 3GHz
|
|
│ │
|
|
│ └─ Analysis: This is TOO HIGH for just TLS cache lookup
|
|
│ └─ Root cause candidates:
|
|
│ ├─ Slab Registry hash overhead (likely culprit)
|
|
│ ├─ TLS cache miss rate (cache too small or bad eviction)
|
|
│ └─ Indirect call overhead (function pointer for free routing)
|
|
│ └─ Action: Profile with perf to isolate
|
|
│
|
|
├─ Question 3: Run mimalloc-bench multi-threaded tests (Phase 6.13)
|
|
│ │
|
|
│ ├─ Test: larson 4-thread, larson 16-thread
|
|
│ │
|
|
│ └─ Results analysis:
|
|
│ │
|
|
│ ├─ Scenario A: 4-thread benefit > 20% AND 16-thread benefit > 40%
|
|
│ │ └─ Decision: ✅ KEEP TLS
|
|
│ │ └─ Rationale: Multi-threaded benefit outweighs single-threaded cost
|
|
│ │ └─ Next: Phase 6.14 (expand benchmarks)
|
|
│ │
|
|
│ ├─ Scenario B: 4-thread benefit 10-20% OR 16-thread benefit 20-40%
|
|
│ │ └─ Decision: ⚠️ MAKE CONDITIONAL
|
|
│ │ └─ Implementation: Compile-time flag HAKMEM_MULTITHREAD
|
|
│ │ └─ Usage: make HAKMEM_MULTITHREAD=1 (for multi-threaded apps)
|
|
│ │ make HAKMEM_MULTITHREAD=0 (for single-threaded apps)
|
|
│ │ └─ Next: Phase 6.14 (expand benchmarks)
|
|
│ │
|
|
│ └─ Scenario C: 4-thread benefit < 10% (unlikely)
|
|
│ └─ Decision: ❌ REVERT TLS
|
|
│ └─ Rationale: Site Rules already reduce contention, TLS adds no value
|
|
│ └─ Next: Phase 6.16 (fix Tiny Pool instead)
|
|
│
|
|
└─ Question 4: hakmem-specific factors
|
|
│
|
|
├─ Site Rules already reduce lock contention by ~60%
|
|
│ └─ TLS benefit is LESS for hakmem than mimalloc/jemalloc
|
|
│ └─ Expected TLS benefit: 100-300 cycles (vs 245-720 for mimalloc)
|
|
│
|
|
└─ Break-even analysis:
|
|
├─ TLS overhead: 20-40 cycles (best case)
|
|
├─ TLS benefit: 100-300 cycles (@ 70% cache hit rate, 4+ threads)
|
|
└─ Break-even: 2-4 threads (depends on contention level)
|
|
└─ Conclusion: TLS should WIN at 4+ threads, but margin is smaller
|
|
```
|
|
|
|
---
|
|
|
|
## Quantitative Analysis
|
|
|
|
### Single-Threaded Overhead (Measured)
|
|
|
|
| Metric | Value | Source |
|
|
|--------|-------|--------|
|
|
| Before TLS | 7,355 ns/op | Phase 6.12.1 Step 1 |
|
|
| After TLS | 10,471 ns/op | Phase 6.12.1 Step 2 |
|
|
| Regression | +3,116 ns/op (+42.4%) | Calculated |
|
|
| Cycles (@ 3GHz) | ~9,000 cycles | Estimated |
|
|
|
|
**Analysis**: 9,000 cycles is TOO HIGH for TLS cache lookup (expected: 20-40 cycles). Likely cause: **Slab Registry hash overhead** (Step 2 change).
|
|
|
|
**Action**: Re-measure TLS in isolation (revert Slab Registry, keep only TLS).
|
|
|
|
---
|
|
|
|
### Multi-Threaded Benefit (Estimated)
|
|
|
|
| Metric | mimalloc (baseline) | hakmem (with Site Rules) |
|
|
|--------|---------------------|--------------------------|
|
|
| Lock contention cost | 350-800 cycles | 140-320 cycles (-60%) |
|
|
| TLS cache hit rate | 70-90% | 70-90% (same) |
|
|
| Cycles saved per hit | 245-720 cycles | 100-300 cycles |
|
|
| TLS overhead | 20-40 cycles | 20-40 cycles (same) |
|
|
| **Net benefit** | 205-680 cycles | 60-260 cycles |
|
|
|
|
**Break-even point**:
|
|
- mimalloc: 2 threads (TLS overhead 20-40c < benefit 205-680c)
|
|
- hakmem: 2-4 threads (TLS overhead 20-40c < benefit 60-260c)
|
|
|
|
**Conclusion**: TLS is LESS valuable for hakmem, but still beneficial at 4+ threads.
|
|
|
|
---
|
|
|
|
## Recommendation Flow
|
|
|
|
### Step 1: Re-measure TLS overhead in isolation (2 hours)
|
|
```bash
|
|
# Revert Slab Registry (Step 2), keep SlabTag removal (Step 1)
|
|
git diff HEAD~1 hakmem_tiny.c # Review Step 2 changes
|
|
git revert <commit-hash> # Revert Step 2 only
|
|
|
|
# Re-run benchmarks
|
|
make bench_allocators_hakmem
|
|
./bench_allocators_hakmem
|
|
|
|
# Expected result:
|
|
# - If overhead drops to <10%: TLS is OK, problem was Slab Registry
|
|
# - If overhead remains ~40%: TLS itself is the problem
|
|
```
|
|
|
|
**Decision**:
|
|
- If overhead < 10% → Keep TLS, skip Step 2 (Slab Registry)
|
|
- If overhead > 20% → Proceed to Step 2 (multi-threaded validation)
|
|
|
|
---
|
|
|
|
### Step 2: Multi-threaded validation (3-5 hours)
|
|
```bash
|
|
# Phase 6.13: mimalloc-bench integration
|
|
cd /tmp
|
|
git clone https://github.com/daanx/mimalloc-bench.git
|
|
cd mimalloc-bench
|
|
./build-all.sh
|
|
|
|
# Build hakmem.so
|
|
cd /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc
|
|
make shared
|
|
|
|
# Run multi-threaded benchmarks
|
|
cd /tmp/mimalloc-bench
|
|
LD_PRELOAD=/path/to/libhakmem.so ./out/bench/larson/larson 4 1000 10000
|
|
LD_PRELOAD=/path/to/libhakmem.so ./out/bench/larson/larson 16 1000 10000
|
|
```
|
|
|
|
**Decision**:
|
|
- If 4-thread benefit > 20% → ✅ KEEP TLS (Scenario A)
|
|
- If 4-thread benefit 10-20% → ⚠️ CONDITIONAL TLS (Scenario B)
|
|
- If 4-thread benefit < 10% → ❌ REVERT TLS (Scenario C)
|
|
|
|
---
|
|
|
|
### Step 3: Implementation choice
|
|
|
|
#### ✅ Scenario A: Keep TLS (expected)
|
|
```c
|
|
// No changes needed
|
|
// Continue to Phase 6.14 (expand benchmarks)
|
|
```
|
|
|
|
#### ⚠️ Scenario B: Conditional TLS
|
|
```c
|
|
// Add compile-time flag
|
|
// hakmem.h:
|
|
#ifdef HAKMEM_MULTITHREAD
|
|
#define TLS_CACHE_ENABLED 1
|
|
#else
|
|
#define TLS_CACHE_ENABLED 0
|
|
#endif
|
|
|
|
// hakmem_tiny.c:
|
|
#if TLS_CACHE_ENABLED
|
|
__thread struct tls_cache_t tls_cache;
|
|
#endif
|
|
|
|
void* hak_tiny_alloc(size_t size) {
|
|
#if TLS_CACHE_ENABLED
|
|
// TLS fast path
|
|
void* ptr = tls_cache_lookup(size);
|
|
if (ptr) return ptr;
|
|
#endif
|
|
|
|
// Slow path (always available)
|
|
return hak_tiny_alloc_slow(size);
|
|
}
|
|
```
|
|
|
|
**Usage**:
|
|
```bash
|
|
# Single-threaded builds
|
|
make CFLAGS="-DHAKMEM_MULTITHREAD=0"
|
|
|
|
# Multi-threaded builds
|
|
make CFLAGS="-DHAKMEM_MULTITHREAD=1"
|
|
```
|
|
|
|
#### ❌ Scenario C: Revert TLS
|
|
```bash
|
|
# Revert Phase 6.12.1 Step 2 completely
|
|
git revert <commit-hash>
|
|
|
|
# Continue to Phase 6.16 (fix Tiny Pool via Option B)
|
|
```
|
|
|
|
---
|
|
|
|
## Expected Outcome
|
|
|
|
### Most Likely: Scenario A (Keep TLS)
|
|
|
|
**Evidence**:
|
|
1. TLS is standard practice in mimalloc/jemalloc/tcmalloc
|
|
2. Multi-threaded workloads are common (web servers, databases)
|
|
3. Single-threaded overhead is expected (20-40 cycles)
|
|
4. The 9,000-cycle regression is likely due to Slab Registry, not TLS
|
|
|
|
**Next steps**:
|
|
1. Re-measure TLS in isolation (confirm <10% overhead)
|
|
2. Run mimalloc-bench (validate >20% multi-threaded benefit)
|
|
3. Keep TLS, proceed to Phase 6.14
|
|
|
|
---
|
|
|
|
### Alternative: Scenario B (Conditional TLS)
|
|
|
|
**If**:
|
|
- Multi-threaded benefit is marginal (10-20% at 4 threads)
|
|
- Single-threaded regression remains (even after Slab Registry revert)
|
|
|
|
**Then**:
|
|
- Implement compile-time flag HAKMEM_MULTITHREAD
|
|
- Provide two build configurations
|
|
- Document trade-offs (single vs multi-threaded)
|
|
|
|
---
|
|
|
|
## Timeline
|
|
|
|
| Step | Duration | Action |
|
|
|------|----------|--------|
|
|
| 1 | 2 hours | Re-measure TLS in isolation |
|
|
| 2 | 3-5 hours | mimalloc-bench multi-threaded validation |
|
|
| 3 | 1 hour | Analyze results + make decision |
|
|
| 4a | 0 hours | Keep TLS (no changes) |
|
|
| 4b | 4 hours | Implement conditional TLS (if needed) |
|
|
| 4c | 1 hour | Revert TLS (if needed) |
|
|
|
|
**Total**: 6-13 hours (depends on outcome)
|
|
|
|
---
|
|
|
|
## Risk Mitigation
|
|
|
|
### Risk 1: Re-measurement shows TLS overhead is still high (>20%)
|
|
**Mitigation**: Profile with perf to identify root cause
|
|
```bash
|
|
perf record -g ./bench_allocators_hakmem
|
|
perf report --stdio | grep -E 'tls|FS'
|
|
```
|
|
|
|
### Risk 2: Multi-threaded benchmarks show no TLS benefit
|
|
**Mitigation**: Check if Site Rules are too effective (already eliminated contention)
|
|
```bash
|
|
# Disable Site Rules temporarily
|
|
export HAKMEM_SITE_RULES=0
|
|
./bench_allocators_hakmem
|
|
|
|
# If performance drops → Site Rules are effective
|
|
# If performance same → Site Rules not helping
|
|
```
|
|
|
|
### Risk 3: Conditional TLS adds maintenance burden
|
|
**Mitigation**: Use runtime detection instead (check thread count)
|
|
```c
|
|
// Runtime detection (auto-enable TLS if threads > 1)
|
|
static int tls_enabled = 0;
|
|
|
|
void hak_init() {
|
|
int num_threads = get_num_threads();
|
|
tls_enabled = (num_threads > 1);
|
|
}
|
|
```
|
|
|
|
**Trade-off**: Runtime overhead (branch per allocation) vs compile-time maintenance
|
|
|
|
---
|
|
|
|
**End of Decision Tree**
|
|
|
|
This document provides a structured decision-making process for the TLS Freelist Cache question. The recommended approach is to re-measure TLS overhead in isolation (Step 1), then validate multi-threaded benefit with mimalloc-bench (Step 2), and make a data-driven decision (Step 3).
|