Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
4.4 KiB
4.4 KiB
Phase 6.15: Multi-threaded Safety - Quick Summary
Full Plan: PHASE_6.15_PLAN.md
🎯 3-Step Approach
Step 1: Documentation Updates (1 hour)
- Fix PHASE_6.14_COMPLETION_REPORT.md (67.9M measurement issue)
- Update CURRENT_TASK.md (Phase 6.15 status)
- Create this plan document ✅
Step 2: P0 Safety Lock (2-3 hours)
Goal: Correctness first (no performance improvement expected)
Implementation (30 min):
// Add global lock
static pthread_mutex_t g_hakmem_lock = PTHREAD_MUTEX_INITIALIZER;
// Wrap malloc/free
void* hak_alloc_at(size_t size, uintptr_t site_id) {
pthread_mutex_lock(&g_hakmem_lock);
void* ptr = hak_alloc_at_internal(size, site_id);
pthread_mutex_unlock(&g_hakmem_lock);
return ptr;
}
Testing (1.5 hours):
- larson 1T/4T benchmark
- Helgrind race detection (expect: 0 errors)
- Stability test (10 consecutive runs)
Expected Results:
- 1T: 13-15M ops/sec (lock overhead 0-15%)
- 4T: 13-15M ops/sec (same as 1T, safe but no scalability)
- Critical: Zero crashes, zero data races ✅
Step 3: TLS Performance (8-10 hours)
P1: Tiny Pool TLS (2 hours)
static __thread TinySlab* tls_tiny_cache[8]; // Per-thread cache
Expected: 4T = 12-15M ops/sec (+264-355%)
P2: L2 Pool TLS (3 hours)
static __thread L2Block* tls_l2_cache[5];
Expected: 4T = 15-18M ops/sec
P3: L2.5 Pool TLS Expansion (3 hours)
Existing: hakmem_l25_pool.c:26 already has TLS declaration
Missing: Refill/eviction logic
Expected: 4T = 18-22M ops/sec (+445-567%)
📊 Performance Expectations
| Phase | 1-thread | 4-thread | vs Baseline (3.3M) | Notes |
|---|---|---|---|---|
| Before | 15.1M | 3.3M | baseline (-78%) | UNSAFE |
| P0 (Lock) | 13-15M | 13-15M | +294-355% | Safe, no scaling |
| P0+P1 (Tiny TLS) | 13-15M | 12-15M | +264-355% | 95%+ TLS hit |
| P0+P1+P2 (L2 TLS) | 13-15M | 15-18M | +355-445% | 90%+ TLS hit |
| P0+P1+P2+P3 (All) | 13-15M | 18-22M | +445-567% | Full TLS |
| Phase 6.13 Actual | 17.8M | 15.9M | +381% ✅ | PROVEN |
Validation: Phase 6.13 already proved TLS achieves 15.9M ops/sec at 4 threads ✅
🎯 Success Criteria
Minimum (P0 only)
- ✅ 4T ≥ 13M ops/sec (safe, from 3.3M)
- ✅ Zero race conditions (Helgrind)
- ✅ 10/10 stability runs
Target (P0+P1+P2)
- ✅ 4T ≥ 15M ops/sec (+355%)
- ✅ TLS hit rate ≥ 90%
- ✅ No 1T regression (≤15%)
Stretch (All Phases)
- ✅ 4T ≥ 18M ops/sec (+445%)
- ✅ 16T ≥ 11.6M ops/sec (match system)
📋 Quick Checklist
Step 1: Documentation ✅
- Fix PHASE_6.14_COMPLETION_REPORT.md
- Update CURRENT_TASK.md
- Verify with grep commands
Step 2: P0 Safety Lock
- Add pthread.h + global lock
- Wrap hak_alloc_at/hak_free_at
- Test: larson 1T/4T
- Test: Helgrind (expect: 0 errors)
- Test: 10 stability runs
- Document: PHASE_6.15_P0_RESULTS.md
Step 3: TLS Performance
- P1: Tiny Pool TLS (2h) → 12-15M ops/sec
- P2: L2 Pool TLS (3h) → 15-18M ops/sec
- P3: L2.5 Pool TLS (3h) → 18-22M ops/sec
- Final: Validation + completion report
⏱️ Timeline
Day 1: Step 1 (1h) + Step 2 (2-3h) Day 2: P1 Tiny TLS (2h) Day 3-4: P2 L2 TLS (3h) Day 5: P3 L2.5 TLS (3h) Day 6: Final validation (1h)
Total: 12-13 hours over 6 days
🔗 Key References
- Full Plan: PHASE_6.15_PLAN.md - Detailed implementation guide
- Thread Safety Analysis: THREAD_SAFETY_SOLUTION.md - Option A/B/C comparison
- TLS Validation: PHASE_6.13_INITIAL_RESULTS.md - Proof that TLS works (+147%)
- Current Issue: PHASE_6.14_COMPLETION_REPORT.md - 4-thread collapse discovery
💡 Key Insights
- 67.9M was measurement error - Actual performance is 15.1M (1T)
- 4-thread collapse is NOT a feature - It's complete thread-unsafety (3.3M vs 15.1M baseline)
- TLS is proven to work - Phase 6.13 achieved 15.9M at 4T (+381%)
- P0 is safety net - Even if TLS fails, we have working thread-safe allocator
- Gradual rollout is key - P0 → P1 → P2 → P3 (each validated independently)
Status: ✅ Ready to Execute Next Action: Start with Step 1 (Documentation, 1 hour)