Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

4.4 KiB

Raw Blame History

Phase 6.15: Multi-threaded Safety - Quick Summary

Full Plan: PHASE_6.15_PLAN.md

🎯 3-Step Approach

Step 1: Documentation Updates (1 hour)

Fix PHASE_6.14_COMPLETION_REPORT.md (67.9M measurement issue)
Update CURRENT_TASK.md (Phase 6.15 status)
Create this plan document ✅

Step 2: P0 Safety Lock (2-3 hours)

Goal: Correctness first (no performance improvement expected)

Implementation (30 min):

// Add global lock
static pthread_mutex_t g_hakmem_lock = PTHREAD_MUTEX_INITIALIZER;

// Wrap malloc/free
void* hak_alloc_at(size_t size, uintptr_t site_id) {
    pthread_mutex_lock(&g_hakmem_lock);
    void* ptr = hak_alloc_at_internal(size, site_id);
    pthread_mutex_unlock(&g_hakmem_lock);
    return ptr;
}

Testing (1.5 hours):

larson 1T/4T benchmark
Helgrind race detection (expect: 0 errors)
Stability test (10 consecutive runs)

Expected Results:

1T: 13-15M ops/sec (lock overhead 0-15%)
4T: 13-15M ops/sec (same as 1T, safe but no scalability)
Critical: Zero crashes, zero data races ✅

Step 3: TLS Performance (8-10 hours)

P1: Tiny Pool TLS (2 hours)

static __thread TinySlab* tls_tiny_cache[8];  // Per-thread cache

Expected: 4T = 12-15M ops/sec (+264-355%)

P2: L2 Pool TLS (3 hours)

static __thread L2Block* tls_l2_cache[5];

Expected: 4T = 15-18M ops/sec

P3: L2.5 Pool TLS Expansion (3 hours)

Existing: hakmem_l25_pool.c:26 already has TLS declaration Missing: Refill/eviction logic Expected: 4T = 18-22M ops/sec (+445-567%)

📊 Performance Expectations

Phase	1-thread	4-thread	vs Baseline (3.3M)	Notes
Before	15.1M	3.3M	baseline (-78%)	UNSAFE
P0 (Lock)	13-15M	13-15M	+294-355%	Safe, no scaling
P0+P1 (Tiny TLS)	13-15M	12-15M	+264-355%	95%+ TLS hit
P0+P1+P2 (L2 TLS)	13-15M	15-18M	+355-445%	90%+ TLS hit
P0+P1+P2+P3 (All)	13-15M	18-22M	+445-567%	Full TLS
Phase 6.13 Actual	17.8M	15.9M	+381% ✅	PROVEN

Validation: Phase 6.13 already proved TLS achieves 15.9M ops/sec at 4 threads ✅

🎯 Success Criteria

Minimum (P0 only)

✅ 4T ≥ 13M ops/sec (safe, from 3.3M)
✅ Zero race conditions (Helgrind)
✅ 10/10 stability runs

Target (P0+P1+P2)

✅ 4T ≥ 15M ops/sec (+355%)
✅ TLS hit rate ≥ 90%
✅ No 1T regression (≤15%)

Stretch (All Phases)

✅ 4T ≥ 18M ops/sec (+445%)
✅ 16T ≥ 11.6M ops/sec (match system)

📋 Quick Checklist

Step 1: Documentation ✅

Fix PHASE_6.14_COMPLETION_REPORT.md
Update CURRENT_TASK.md
Verify with grep commands

Step 2: P0 Safety Lock

Add pthread.h + global lock
Wrap hak_alloc_at/hak_free_at
Test: larson 1T/4T
Test: Helgrind (expect: 0 errors)
Test: 10 stability runs
Document: PHASE_6.15_P0_RESULTS.md

Step 3: TLS Performance

P1: Tiny Pool TLS (2h) → 12-15M ops/sec
P2: L2 Pool TLS (3h) → 15-18M ops/sec
P3: L2.5 Pool TLS (3h) → 18-22M ops/sec
Final: Validation + completion report

⏱️ Timeline

Day 1: Step 1 (1h) + Step 2 (2-3h) Day 2: P1 Tiny TLS (2h) Day 3-4: P2 L2 TLS (3h) Day 5: P3 L2.5 TLS (3h) Day 6: Final validation (1h)

Total: 12-13 hours over 6 days

🔗 Key References

Full Plan: PHASE_6.15_PLAN.md - Detailed implementation guide
Thread Safety Analysis: THREAD_SAFETY_SOLUTION.md - Option A/B/C comparison
TLS Validation: PHASE_6.13_INITIAL_RESULTS.md - Proof that TLS works (+147%)
Current Issue: PHASE_6.14_COMPLETION_REPORT.md - 4-thread collapse discovery

💡 Key Insights

67.9M was measurement error - Actual performance is 15.1M (1T)
4-thread collapse is NOT a feature - It's complete thread-unsafety (3.3M vs 15.1M baseline)
TLS is proven to work - Phase 6.13 achieved 15.9M at 4T (+381%)
P0 is safety net - Even if TLS fails, we have working thread-safe allocator
Gradual rollout is key - P0 → P1 → P2 → P3 (each validated independently)

Status: ✅ Ready to Execute Next Action: Start with Step 1 (Documentation, 1 hour)

4.4 KiB Raw Blame History