Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

8.9 KiB

Raw Blame History

TLS Freelist Cache Decision Tree

Context: Phase 6.12.1 Step 2 showed +42% regression in single-threaded workloads Question: Should we keep TLS for multi-threaded benefit?

Decision Tree

START: TLS showed +42% single-threaded regression
│
├─ Question 1: Is this regression ONLY due to TLS?
│  │
│  ├─ NO → Re-measure TLS in isolation (revert Slab Registry)
│  │      └─ Action: Revert Step 2 (Slab Registry), keep Step 1 (SlabTag removal)
│  │         └─ Re-run benchmarks → Measure true TLS overhead
│  │            └─ If still +40% → Continue to Question 2
│  │            └─ If <10% → TLS is OK, problem was Slab Registry
│  │
│  └─ YES (TLS overhead confirmed) → Continue to Question 2
│
├─ Question 2: What is TLS overhead in cycles?
│  │
│  ├─ Measure: 3,116 ns = ~9,000 cycles @ 3GHz
│  │
│  └─ Analysis: This is TOO HIGH for just TLS cache lookup
│     └─ Root cause candidates:
│        ├─ Slab Registry hash overhead (likely culprit)
│        ├─ TLS cache miss rate (cache too small or bad eviction)
│        └─ Indirect call overhead (function pointer for free routing)
│     └─ Action: Profile with perf to isolate
│
├─ Question 3: Run mimalloc-bench multi-threaded tests (Phase 6.13)
│  │
│  ├─ Test: larson 4-thread, larson 16-thread
│  │
│  └─ Results analysis:
│     │
│     ├─ Scenario A: 4-thread benefit > 20% AND 16-thread benefit > 40%
│     │  └─ Decision: ✅ KEEP TLS
│     │     └─ Rationale: Multi-threaded benefit outweighs single-threaded cost
│     │        └─ Next: Phase 6.14 (expand benchmarks)
│     │
│     ├─ Scenario B: 4-thread benefit 10-20% OR 16-thread benefit 20-40%
│     │  └─ Decision: ⚠️ MAKE CONDITIONAL
│     │     └─ Implementation: Compile-time flag HAKMEM_MULTITHREAD
│     │        └─ Usage: make HAKMEM_MULTITHREAD=1 (for multi-threaded apps)
│     │           make HAKMEM_MULTITHREAD=0 (for single-threaded apps)
│     │        └─ Next: Phase 6.14 (expand benchmarks)
│     │
│     └─ Scenario C: 4-thread benefit < 10% (unlikely)
│        └─ Decision: ❌ REVERT TLS
│           └─ Rationale: Site Rules already reduce contention, TLS adds no value
│              └─ Next: Phase 6.16 (fix Tiny Pool instead)
│
└─ Question 4: hakmem-specific factors
   │
   ├─ Site Rules already reduce lock contention by ~60%
   │  └─ TLS benefit is LESS for hakmem than mimalloc/jemalloc
   │     └─ Expected TLS benefit: 100-300 cycles (vs 245-720 for mimalloc)
   │
   └─ Break-even analysis:
      ├─ TLS overhead: 20-40 cycles (best case)
      ├─ TLS benefit: 100-300 cycles (@ 70% cache hit rate, 4+ threads)
      └─ Break-even: 2-4 threads (depends on contention level)
         └─ Conclusion: TLS should WIN at 4+ threads, but margin is smaller

Quantitative Analysis

Single-Threaded Overhead (Measured)

Metric	Value	Source
Before TLS	7,355 ns/op	Phase 6.12.1 Step 1
After TLS	10,471 ns/op	Phase 6.12.1 Step 2
Regression	+3,116 ns/op (+42.4%)	Calculated
Cycles (@ 3GHz)	~9,000 cycles	Estimated

Analysis: 9,000 cycles is TOO HIGH for TLS cache lookup (expected: 20-40 cycles). Likely cause: Slab Registry hash overhead (Step 2 change).

Action: Re-measure TLS in isolation (revert Slab Registry, keep only TLS).

Multi-Threaded Benefit (Estimated)

Metric	mimalloc (baseline)	hakmem (with Site Rules)
Lock contention cost	350-800 cycles	140-320 cycles (-60%)
TLS cache hit rate	70-90%	70-90% (same)
Cycles saved per hit	245-720 cycles	100-300 cycles
TLS overhead	20-40 cycles	20-40 cycles (same)
Net benefit	205-680 cycles	60-260 cycles

Break-even point:

mimalloc: 2 threads (TLS overhead 20-40c < benefit 205-680c)
hakmem: 2-4 threads (TLS overhead 20-40c < benefit 60-260c)

Conclusion: TLS is LESS valuable for hakmem, but still beneficial at 4+ threads.

Recommendation Flow

Step 1: Re-measure TLS overhead in isolation (2 hours)

# Revert Slab Registry (Step 2), keep SlabTag removal (Step 1)
git diff HEAD~1 hakmem_tiny.c  # Review Step 2 changes
git revert <commit-hash>       # Revert Step 2 only

# Re-run benchmarks
make bench_allocators_hakmem
./bench_allocators_hakmem

# Expected result:
# - If overhead drops to <10%: TLS is OK, problem was Slab Registry
# - If overhead remains ~40%: TLS itself is the problem

Decision:

If overhead < 10% → Keep TLS, skip Step 2 (Slab Registry)
If overhead > 20% → Proceed to Step 2 (multi-threaded validation)

Step 2: Multi-threaded validation (3-5 hours)

# Phase 6.13: mimalloc-bench integration
cd /tmp
git clone https://github.com/daanx/mimalloc-bench.git
cd mimalloc-bench
./build-all.sh

# Build hakmem.so
cd /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc
make shared

# Run multi-threaded benchmarks
cd /tmp/mimalloc-bench
LD_PRELOAD=/path/to/libhakmem.so ./out/bench/larson/larson 4 1000 10000
LD_PRELOAD=/path/to/libhakmem.so ./out/bench/larson/larson 16 1000 10000

Decision:

If 4-thread benefit > 20% → ✅ KEEP TLS (Scenario A)
If 4-thread benefit 10-20% → ⚠️ CONDITIONAL TLS (Scenario B)
If 4-thread benefit < 10% → ❌ REVERT TLS (Scenario C)

Step 3: Implementation choice

✅ Scenario A: Keep TLS (expected)

// No changes needed
// Continue to Phase 6.14 (expand benchmarks)

⚠️ Scenario B: Conditional TLS

// Add compile-time flag
// hakmem.h:
#ifdef HAKMEM_MULTITHREAD
  #define TLS_CACHE_ENABLED 1
#else
  #define TLS_CACHE_ENABLED 0
#endif

// hakmem_tiny.c:
#if TLS_CACHE_ENABLED
  __thread struct tls_cache_t tls_cache;
#endif

void* hak_tiny_alloc(size_t size) {
  #if TLS_CACHE_ENABLED
    // TLS fast path
    void* ptr = tls_cache_lookup(size);
    if (ptr) return ptr;
  #endif

  // Slow path (always available)
  return hak_tiny_alloc_slow(size);
}

Usage:

# Single-threaded builds
make CFLAGS="-DHAKMEM_MULTITHREAD=0"

# Multi-threaded builds
make CFLAGS="-DHAKMEM_MULTITHREAD=1"

❌ Scenario C: Revert TLS

# Revert Phase 6.12.1 Step 2 completely
git revert <commit-hash>

# Continue to Phase 6.16 (fix Tiny Pool via Option B)

Expected Outcome

Most Likely: Scenario A (Keep TLS)

Evidence:

TLS is standard practice in mimalloc/jemalloc/tcmalloc
Multi-threaded workloads are common (web servers, databases)
Single-threaded overhead is expected (20-40 cycles)
The 9,000-cycle regression is likely due to Slab Registry, not TLS

Next steps:

Re-measure TLS in isolation (confirm <10% overhead)
Run mimalloc-bench (validate >20% multi-threaded benefit)
Keep TLS, proceed to Phase 6.14

Alternative: Scenario B (Conditional TLS)

If:

Multi-threaded benefit is marginal (10-20% at 4 threads)
Single-threaded regression remains (even after Slab Registry revert)

Then:

Implement compile-time flag HAKMEM_MULTITHREAD
Provide two build configurations
Document trade-offs (single vs multi-threaded)

Timeline

Step	Duration	Action
1	2 hours	Re-measure TLS in isolation
2	3-5 hours	mimalloc-bench multi-threaded validation
3	1 hour	Analyze results + make decision
4a	0 hours	Keep TLS (no changes)
4b	4 hours	Implement conditional TLS (if needed)
4c	1 hour	Revert TLS (if needed)

Total: 6-13 hours (depends on outcome)

Risk Mitigation

Risk 1: Re-measurement shows TLS overhead is still high (>20%)

Mitigation: Profile with perf to identify root cause

perf record -g ./bench_allocators_hakmem
perf report --stdio | grep -E 'tls|FS'

Risk 2: Multi-threaded benchmarks show no TLS benefit

Mitigation: Check if Site Rules are too effective (already eliminated contention)

# Disable Site Rules temporarily
export HAKMEM_SITE_RULES=0
./bench_allocators_hakmem

# If performance drops → Site Rules are effective
# If performance same → Site Rules not helping

Risk 3: Conditional TLS adds maintenance burden

Mitigation: Use runtime detection instead (check thread count)

// Runtime detection (auto-enable TLS if threads > 1)
static int tls_enabled = 0;

void hak_init() {
  int num_threads = get_num_threads();
  tls_enabled = (num_threads > 1);
}

Trade-off: Runtime overhead (branch per allocation) vs compile-time maintenance

End of Decision Tree

This document provides a structured decision-making process for the TLS Freelist Cache question. The recommended approach is to re-measure TLS overhead in isolation (Step 1), then validate multi-threaded benefit with mimalloc-bench (Step 2), and make a data-driven decision (Step 3).

8.9 KiB Raw Blame History