Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
153 lines
4.4 KiB
Markdown
153 lines
4.4 KiB
Markdown
# Phase 6.15: Multi-threaded Safety - Quick Summary
|
|
|
|
**Full Plan**: [PHASE_6.15_PLAN.md](PHASE_6.15_PLAN.md)
|
|
|
|
---
|
|
|
|
## 🎯 **3-Step Approach**
|
|
|
|
### **Step 1: Documentation Updates** (1 hour)
|
|
- Fix PHASE_6.14_COMPLETION_REPORT.md (67.9M measurement issue)
|
|
- Update CURRENT_TASK.md (Phase 6.15 status)
|
|
- Create this plan document ✅
|
|
|
|
### **Step 2: P0 Safety Lock** (2-3 hours)
|
|
**Goal**: Correctness first (no performance improvement expected)
|
|
|
|
**Implementation** (30 min):
|
|
```c
|
|
// Add global lock
|
|
static pthread_mutex_t g_hakmem_lock = PTHREAD_MUTEX_INITIALIZER;
|
|
|
|
// Wrap malloc/free
|
|
void* hak_alloc_at(size_t size, uintptr_t site_id) {
|
|
pthread_mutex_lock(&g_hakmem_lock);
|
|
void* ptr = hak_alloc_at_internal(size, site_id);
|
|
pthread_mutex_unlock(&g_hakmem_lock);
|
|
return ptr;
|
|
}
|
|
```
|
|
|
|
**Testing** (1.5 hours):
|
|
1. larson 1T/4T benchmark
|
|
2. Helgrind race detection (expect: 0 errors)
|
|
3. Stability test (10 consecutive runs)
|
|
|
|
**Expected Results**:
|
|
- 1T: 13-15M ops/sec (lock overhead 0-15%)
|
|
- 4T: 13-15M ops/sec (same as 1T, safe but no scalability)
|
|
- **Critical**: Zero crashes, zero data races ✅
|
|
|
|
---
|
|
|
|
### **Step 3: TLS Performance** (8-10 hours)
|
|
|
|
#### **P1: Tiny Pool TLS** (2 hours)
|
|
```c
|
|
static __thread TinySlab* tls_tiny_cache[8]; // Per-thread cache
|
|
```
|
|
**Expected**: 4T = 12-15M ops/sec (+264-355%)
|
|
|
|
#### **P2: L2 Pool TLS** (3 hours)
|
|
```c
|
|
static __thread L2Block* tls_l2_cache[5];
|
|
```
|
|
**Expected**: 4T = 15-18M ops/sec
|
|
|
|
#### **P3: L2.5 Pool TLS Expansion** (3 hours)
|
|
**Existing**: `hakmem_l25_pool.c:26` already has TLS declaration
|
|
**Missing**: Refill/eviction logic
|
|
**Expected**: 4T = 18-22M ops/sec (+445-567%)
|
|
|
|
---
|
|
|
|
## 📊 **Performance Expectations**
|
|
|
|
| Phase | 1-thread | 4-thread | vs Baseline (3.3M) | Notes |
|
|
|-------|----------|----------|-------------------|-------|
|
|
| **Before** | 15.1M | **3.3M** | baseline (-78%) | UNSAFE |
|
|
| **P0 (Lock)** | 13-15M | 13-15M | +294-355% | Safe, no scaling |
|
|
| **P0+P1 (Tiny TLS)** | 13-15M | 12-15M | +264-355% | 95%+ TLS hit |
|
|
| **P0+P1+P2 (L2 TLS)** | 13-15M | 15-18M | +355-445% | 90%+ TLS hit |
|
|
| **P0+P1+P2+P3 (All)** | 13-15M | 18-22M | +445-567% | Full TLS |
|
|
| **Phase 6.13 Actual** | 17.8M | **15.9M** | **+381%** ✅ | **PROVEN** |
|
|
|
|
**Validation**: Phase 6.13 already proved TLS achieves **15.9M ops/sec** at 4 threads ✅
|
|
|
|
---
|
|
|
|
## 🎯 **Success Criteria**
|
|
|
|
### **Minimum (P0 only)**
|
|
- ✅ 4T ≥ 13M ops/sec (safe, from 3.3M)
|
|
- ✅ Zero race conditions (Helgrind)
|
|
- ✅ 10/10 stability runs
|
|
|
|
### **Target (P0+P1+P2)**
|
|
- ✅ 4T ≥ 15M ops/sec (+355%)
|
|
- ✅ TLS hit rate ≥ 90%
|
|
- ✅ No 1T regression (≤15%)
|
|
|
|
### **Stretch (All Phases)**
|
|
- ✅ 4T ≥ 18M ops/sec (+445%)
|
|
- ✅ 16T ≥ 11.6M ops/sec (match system)
|
|
|
|
---
|
|
|
|
## 📋 **Quick Checklist**
|
|
|
|
**Step 1**: Documentation ✅
|
|
- [ ] Fix PHASE_6.14_COMPLETION_REPORT.md
|
|
- [ ] Update CURRENT_TASK.md
|
|
- [ ] Verify with grep commands
|
|
|
|
**Step 2**: P0 Safety Lock
|
|
- [ ] Add pthread.h + global lock
|
|
- [ ] Wrap hak_alloc_at/hak_free_at
|
|
- [ ] Test: larson 1T/4T
|
|
- [ ] Test: Helgrind (expect: 0 errors)
|
|
- [ ] Test: 10 stability runs
|
|
- [ ] Document: PHASE_6.15_P0_RESULTS.md
|
|
|
|
**Step 3**: TLS Performance
|
|
- [ ] P1: Tiny Pool TLS (2h) → 12-15M ops/sec
|
|
- [ ] P2: L2 Pool TLS (3h) → 15-18M ops/sec
|
|
- [ ] P3: L2.5 Pool TLS (3h) → 18-22M ops/sec
|
|
- [ ] Final: Validation + completion report
|
|
|
|
---
|
|
|
|
## ⏱️ **Timeline**
|
|
|
|
**Day 1**: Step 1 (1h) + Step 2 (2-3h)
|
|
**Day 2**: P1 Tiny TLS (2h)
|
|
**Day 3-4**: P2 L2 TLS (3h)
|
|
**Day 5**: P3 L2.5 TLS (3h)
|
|
**Day 6**: Final validation (1h)
|
|
|
|
**Total**: 12-13 hours over 6 days
|
|
|
|
---
|
|
|
|
## 🔗 **Key References**
|
|
|
|
- **Full Plan**: [PHASE_6.15_PLAN.md](PHASE_6.15_PLAN.md) - Detailed implementation guide
|
|
- **Thread Safety Analysis**: [THREAD_SAFETY_SOLUTION.md](THREAD_SAFETY_SOLUTION.md) - Option A/B/C comparison
|
|
- **TLS Validation**: [PHASE_6.13_INITIAL_RESULTS.md](PHASE_6.13_INITIAL_RESULTS.md) - Proof that TLS works (+147%)
|
|
- **Current Issue**: [PHASE_6.14_COMPLETION_REPORT.md](PHASE_6.14_COMPLETION_REPORT.md) - 4-thread collapse discovery
|
|
|
|
---
|
|
|
|
## 💡 **Key Insights**
|
|
|
|
1. **67.9M was measurement error** - Actual performance is 15.1M (1T)
|
|
2. **4-thread collapse is NOT a feature** - It's complete thread-unsafety (3.3M vs 15.1M baseline)
|
|
3. **TLS is proven to work** - Phase 6.13 achieved 15.9M at 4T (+381%)
|
|
4. **P0 is safety net** - Even if TLS fails, we have working thread-safe allocator
|
|
5. **Gradual rollout is key** - P0 → P1 → P2 → P3 (each validated independently)
|
|
|
|
---
|
|
|
|
**Status**: ✅ **Ready to Execute**
|
|
**Next Action**: Start with Step 1 (Documentation, 1 hour)
|