Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1009 lines
28 KiB
Markdown
1009 lines
28 KiB
Markdown
# Phase 6.15: Multi-threaded Safety + TLS Performance - Implementation Plan
|
||
|
||
**Date**: 2025-10-22
|
||
**Status**: 📋 **Planning Complete**
|
||
**Total Time**: 12-13 hours (3 weeks)
|
||
|
||
---
|
||
|
||
## 📊 **Executive Summary**
|
||
|
||
### **Current Problem**
|
||
hakmem allocator is **completely thread-unsafe** with catastrophic multi-threaded performance:
|
||
|
||
| Threads | Performance (ops/sec) | vs 1-thread |
|
||
|---------|----------------------|-------------|
|
||
| **1-thread** | 15.1M ops/sec | baseline |
|
||
| **4-thread** | 3.3M ops/sec | **-78% slower** ❌ |
|
||
|
||
**Root Cause**: Zero thread synchronization primitives in current codebase (no `pthread_mutex` anywhere)
|
||
|
||
### **Solution Strategy**
|
||
|
||
**3-Stage Gradual Implementation**:
|
||
1. **Step 1**: Document updates (1 hour) - Fix 67.9M measurement issue, create Phase 6.15 plan
|
||
2. **Step 2**: P0 Safety Lock (30 min + testing) - Ensure correctness with minimal changes
|
||
3. **Step 3**: TLS Performance (8-10 hours) - Achieve 4T = 15M ops/sec (+381% validated)
|
||
|
||
**Expected Outcome**:
|
||
- **Minimum Success** (P0): 4T = 1T performance (safe, no scalability)
|
||
- **Target Success** (P0+P1): 4T = 12-15M ops/sec (+264-355%)
|
||
- **Validated** (Phase 6.13): 4T = **15.9M ops/sec** (+381%) ✅ **ALREADY PROVEN**
|
||
|
||
---
|
||
|
||
## 🎯 **Step 1: Documentation Updates** (1 hour)
|
||
|
||
### **Task 1.1: Fix Phase 6.14 Completion Report** (15 minutes)
|
||
|
||
**File**: `apps/experiments/hakmem-poc/PHASE_6.14_COMPLETION_REPORT.md`
|
||
|
||
**Current Problem**:
|
||
- Report focuses on Registry ON/OFF toggle
|
||
- No mention of 67.9M ops/sec measurement issue
|
||
- Misleading performance claims
|
||
|
||
**Required Changes**:
|
||
|
||
1. **Add Executive Summary Section** (after line 9):
|
||
```markdown
|
||
## ⚠️ **Important Note: 67.9M Performance Measurement**
|
||
|
||
**Issue**: Earlier reports mentioned 67.9M ops/sec performance
|
||
**Status**: ❌ **NOT REPRODUCIBLE** - Likely measurement error
|
||
|
||
**Actual Achievements**:
|
||
- ✅ Registry ON/OFF toggle implementation (Pattern 2)
|
||
- ✅ O(N) proven 2.9-13.7x faster than O(1) for Small-N (8-32 slabs)
|
||
- ✅ Default: `g_use_registry = 0` (O(N) Sequential Access)
|
||
|
||
**Performance Reality**:
|
||
- 1-thread: 15.3M ops/sec (O(N), validated)
|
||
- 4-thread: **3.3M ops/sec** (THREAD-UNSAFE, requires Phase 6.15 fix)
|
||
```
|
||
|
||
2. **Update Section Title** (line 9):
|
||
```markdown
|
||
## 📊 **Executive Summary: Registry Toggle + Thread Safety Issue**
|
||
```
|
||
|
||
3. **Add Thread Safety Warning** (after line 158):
|
||
```markdown
|
||
---
|
||
|
||
## 🚨 **Critical Discovery: Thread Safety Issue**
|
||
|
||
### **Problem**
|
||
Phase 6.14 testing revealed **catastrophic multi-threaded performance collapse**:
|
||
|
||
| Threads | Performance | vs 1-thread |
|
||
|---------|-------------|-------------|
|
||
| 1-thread | 15.3M ops/sec | baseline |
|
||
| 4-thread | **3.3M ops/sec** | **-78%** ❌ |
|
||
|
||
**Root Cause**: `grep pthread_mutex *.c` → **0 results** (no locks!)
|
||
|
||
**Impact**: All global structures are race-condition prone:
|
||
- `g_tiny_pool.free_slabs[]` - Concurrent access without locks
|
||
- `g_l25_pool.freelist[]` - Multiple threads modifying same freelist
|
||
- `g_slab_registry[]` - Hash table corruption
|
||
- `g_whale_cache` - Ring buffer race conditions
|
||
|
||
### **Solution**
|
||
**Phase 6.15**: Multi-threaded Safety + TLS Performance
|
||
- **P0** (30 min): Global safety lock (correctness first)
|
||
- **P1** (2 hours): Tiny Pool TLS (95%+ lock avoidance)
|
||
- **P2** (3 hours): L2 Pool TLS (full coverage)
|
||
- **P3** (3 hours): L2.5 Pool TLS expansion
|
||
|
||
**Expected Results**:
|
||
- P0: 4T = 13-15M ops/sec (safe, no scalability)
|
||
- P0+P1: 4T = 12-15M ops/sec (+264-355%)
|
||
- **Validated**: 4T = **15.9M ops/sec** (+381% vs 3.3M baseline) ✅
|
||
|
||
**Reference**: `THREAD_SAFETY_SOLUTION.md` - Complete analysis
|
||
```
|
||
|
||
**Estimated Time**: 15 minutes
|
||
|
||
---
|
||
|
||
### **Task 1.2: Create Phase 6.15 Plan Document** (30 minutes)
|
||
|
||
**File**: `apps/experiments/hakmem-poc/PHASE_6.15_PLAN.md` (THIS FILE)
|
||
|
||
**Contents**: ✅ **Already created** (this document)
|
||
|
||
**Sections**:
|
||
1. Executive Summary
|
||
2. Step 1: Documentation Updates (detailed)
|
||
3. Step 2: P0 Safety Lock (implementation + testing)
|
||
4. Step 3: Multi-threaded Performance (P1-P3 breakdown)
|
||
5. Implementation Checklist
|
||
6. Risk Assessment
|
||
7. Success Criteria
|
||
|
||
**Estimated Time**: 30 minutes (already completed)
|
||
|
||
---
|
||
|
||
### **Task 1.3: Update CURRENT_TASK.md** (10 minutes)
|
||
|
||
**File**: `apps/experiments/hakmem-poc/CURRENT_TASK.md`
|
||
|
||
**Required Changes**:
|
||
|
||
1. **Update Current Status** (after line 30):
|
||
```markdown
|
||
## 🎯 **Current Focus: Phase 6.15 Multi-threaded Safety** (2025-10-22)
|
||
|
||
### **Immediate Priority: Thread Safety Fix** ⚠️
|
||
|
||
**Problem Discovered**: hakmem is completely thread-unsafe
|
||
- 4-thread performance: **3.3M ops/sec** (-78% vs 1-thread 15.1M)
|
||
- Root cause: Zero synchronization primitives (no `pthread_mutex`)
|
||
|
||
**Solution in Progress**: Phase 6.15 (3-stage implementation)
|
||
1. ✅ **Step 1**: Documentation updates (1 hour) ← IN PROGRESS
|
||
2. ⏳ **Step 2**: P0 Safety Lock (30 min + testing)
|
||
3. ⏳ **Step 3**: TLS Performance (P1-P3, 8-10 hours)
|
||
|
||
**Expected Outcome**: 4T = 15.9M ops/sec (validated in Phase 6.13)
|
||
|
||
**Planning Document**: [PHASE_6.15_PLAN.md](PHASE_6.15_PLAN.md)
|
||
```
|
||
|
||
2. **Move Phase 6.14 to Completed Section** (after line 296):
|
||
```markdown
|
||
## ✅ Phase 6.14 完了!(2025-10-22)
|
||
|
||
**実装完了**: Registry ON/OFF 切り替え実装 + Thread Safety Issue 発見
|
||
|
||
**✅ 実装完了内容**:
|
||
1. **Pattern 2 実装**: `HAKMEM_USE_REGISTRY` 環境変数で ON/OFF 切り替え
|
||
2. **O(N) vs O(1) 検証**: O(N) が 2.9-13.7倍速いことを実証
|
||
3. **デフォルト設定**: `g_use_registry = 0` (O(N) Sequential Access)
|
||
|
||
**🚨 Critical Discovery**: 4-thread 性能崩壊 (-78%)
|
||
- 原因: 全グローバル変数がロック無し
|
||
- 対策: Phase 6.15 で修正予定
|
||
|
||
**📊 測定結果**:
|
||
```
|
||
1-thread: 15.3M ops/sec (O(N), Registry OFF)
|
||
4-thread: 3.3M ops/sec (-78% ← THREAD-UNSAFE) ❌
|
||
```
|
||
|
||
**詳細ドキュメント**:
|
||
- [PHASE_6.14_COMPLETION_REPORT.md](PHASE_6.14_COMPLETION_REPORT.md) - Pattern 2 実装
|
||
- [THREAD_SAFETY_SOLUTION.md](THREAD_SAFETY_SOLUTION.md) - 完全分析
|
||
- [PHASE_6.15_PLAN.md](PHASE_6.15_PLAN.md) - 修正計画
|
||
|
||
**実装時間**: 34分(予定通り) ⚡
|
||
```
|
||
|
||
**Estimated Time**: 10 minutes
|
||
|
||
---
|
||
|
||
### **Task 1.4: Update README (if needed)** (5 minutes)
|
||
|
||
**File**: `apps/experiments/hakmem-poc/README.md` (if exists)
|
||
|
||
**Check if exists**:
|
||
```bash
|
||
ls -la apps/experiments/hakmem-poc/README.md
|
||
```
|
||
|
||
**If exists, add warning**:
|
||
```markdown
|
||
## ⚠️ **Current Status: Thread Safety in Development**
|
||
|
||
**Known Issue**: hakmem is currently thread-unsafe
|
||
- **Single-threaded**: 15.1M ops/sec ✅ Excellent
|
||
- **Multi-threaded**: 3.3M ops/sec (4T) ❌ Requires fix
|
||
|
||
**Fix in Progress**: Phase 6.15 Multi-threaded Safety
|
||
- Expected completion: 2025-10-24 (2-3 days)
|
||
- Target performance: 15-20M ops/sec at 4 threads
|
||
|
||
**Do NOT use in multi-threaded applications until Phase 6.15 is complete.**
|
||
```
|
||
|
||
**Estimated Time**: 5 minutes (or skip if README doesn't exist)
|
||
|
||
---
|
||
|
||
### **Task 1.5: Verification** (5 minutes)
|
||
|
||
**Checklist**:
|
||
- [ ] PHASE_6.14_COMPLETION_REPORT.md updated (67.9M issue documented)
|
||
- [ ] PHASE_6.15_PLAN.md created (this document)
|
||
- [ ] CURRENT_TASK.md updated (Phase 6.15 status)
|
||
- [ ] README.md updated (if exists)
|
||
|
||
**Verification Commands**:
|
||
```bash
|
||
cd apps/experiments/hakmem-poc
|
||
|
||
# Check files exist
|
||
ls -la PHASE_6.14_COMPLETION_REPORT.md
|
||
ls -la PHASE_6.15_PLAN.md
|
||
ls -la CURRENT_TASK.md
|
||
|
||
# Grep for keywords
|
||
grep -n "67.9M\|Thread Safety\|Phase 6.15" PHASE_6.14_COMPLETION_REPORT.md
|
||
grep -n "Phase 6.15\|Thread Safety" CURRENT_TASK.md
|
||
```
|
||
|
||
**Estimated Time**: 5 minutes
|
||
|
||
---
|
||
|
||
## ⏱️ **Step 1 Total Time: 1 hour 5 minutes**
|
||
|
||
---
|
||
|
||
## 🔐 **Step 2: P0 Safety Lock Implementation** (2-3 hours)
|
||
|
||
### **Goal**
|
||
Ensure **correctness** with minimal code changes. No performance improvement expected (4T ≈ 1T).
|
||
|
||
### **Success Criteria**
|
||
- ✅ 1-thread: 13-15M ops/sec (ロックオーバーヘッド 0-15% acceptable)
|
||
- ✅ 4-thread: 13-15M ops/sec (no scalability, but SAFE)
|
||
- ✅ Helgrind: Data race = 0 件
|
||
- ✅ Stability: 10 consecutive runs without crash
|
||
|
||
---
|
||
|
||
### **Task 2.1: Implementation** (30 minutes)
|
||
|
||
#### **File**: `apps/experiments/hakmem-poc/hakmem.c`
|
||
|
||
**Changes Required**:
|
||
|
||
1. **Add pthread.h include** (after line 22):
|
||
```c
|
||
#include <pthread.h> // Phase 6.15 P0: Thread Safety
|
||
```
|
||
|
||
2. **Add global lock** (after line 58):
|
||
```c
|
||
// ============================================================================
|
||
// Phase 6.15 P0: Thread Safety - Global Lock
|
||
// ============================================================================
|
||
|
||
// Global lock for all allocator operations
|
||
// Purpose: Ensure correctness in multi-threaded environment
|
||
// Performance: 4T ≈ 1T (no scalability, safety first)
|
||
// Will be replaced by TLS in P1-P3 (95%+ lock avoidance)
|
||
static pthread_mutex_t g_hakmem_lock = PTHREAD_MUTEX_INITIALIZER;
|
||
|
||
// Lock/unlock helpers (for debugging and future instrumentation)
|
||
#define HAKMEM_LOCK() pthread_mutex_lock(&g_hakmem_lock)
|
||
#define HAKMEM_UNLOCK() pthread_mutex_unlock(&g_hakmem_lock)
|
||
```
|
||
|
||
3. **Wrap hak_alloc_at()** (find the function, approximately line 300-400):
|
||
```c
|
||
void* hak_alloc_at(size_t size, uintptr_t site_id) {
|
||
// Phase 6.15 P0: Global lock (safety first)
|
||
HAKMEM_LOCK();
|
||
|
||
// Existing implementation
|
||
void* ptr = hak_alloc_at_internal(size, site_id);
|
||
|
||
HAKMEM_UNLOCK();
|
||
return ptr;
|
||
}
|
||
|
||
// Rename old hak_alloc_at to hak_alloc_at_internal
|
||
static void* hak_alloc_at_internal(size_t size, uintptr_t site_id) {
|
||
// ... existing code (no changes) ...
|
||
}
|
||
```
|
||
|
||
4. **Wrap hak_free_at()** (find the function):
|
||
```c
|
||
void hak_free_at(void* ptr, uintptr_t site_id) {
|
||
if (!ptr) return;
|
||
|
||
// Phase 6.15 P0: Global lock (safety first)
|
||
HAKMEM_LOCK();
|
||
|
||
// Existing implementation
|
||
hak_free_at_internal(ptr, site_id);
|
||
|
||
HAKMEM_UNLOCK();
|
||
}
|
||
|
||
// Rename old hak_free_at to hak_free_at_internal
|
||
static void hak_free_at_internal(void* ptr, uintptr_t site_id) {
|
||
// ... existing code (no changes) ...
|
||
}
|
||
```
|
||
|
||
5. **Protect hak_init()** (find initialization function):
|
||
```c
|
||
void hak_init(void) {
|
||
// Phase 6.15 P0: No lock needed (called once before any threads spawn)
|
||
// But add atomic check for safety
|
||
|
||
// ... existing init code ...
|
||
}
|
||
```
|
||
|
||
**Estimated Time**: 30 minutes
|
||
|
||
---
|
||
|
||
### **Task 2.2: Build & Smoke Test** (15 minutes)
|
||
|
||
**Commands**:
|
||
```bash
|
||
cd /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc
|
||
|
||
# Clean build
|
||
make clean
|
||
make bench_allocators
|
||
|
||
# Smoke test (single-threaded)
|
||
./bench_allocators --allocator hakmem-baseline --scenario json
|
||
|
||
# Expected: ~300-350ns (slight overhead acceptable)
|
||
```
|
||
|
||
**Success Criteria**:
|
||
- ✅ Build succeeds (no compilation errors)
|
||
- ✅ No crashes on single-threaded test
|
||
- ✅ Performance: 13-15M ops/sec (within 0-15% of Phase 6.14)
|
||
|
||
**Estimated Time**: 15 minutes
|
||
|
||
---
|
||
|
||
### **Task 2.3: Multi-threaded Validation** (1 hour)
|
||
|
||
#### **Test 1: larson Benchmark** (30 minutes)
|
||
|
||
**Setup**:
|
||
```bash
|
||
# Build shared library (if not already done)
|
||
cd /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc
|
||
make clean && make shared
|
||
|
||
# Verify library
|
||
ls -lh libhakmem.so
|
||
nm -D libhakmem.so | grep -E "malloc|free|calloc|realloc"
|
||
```
|
||
|
||
**Benchmark Execution**:
|
||
```bash
|
||
cd /tmp/mimalloc-bench/bench/larson
|
||
|
||
# 1-thread baseline
|
||
./larson 0 8 1024 10000 1 12345 1
|
||
|
||
# 1-thread with hakmem P0
|
||
LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
|
||
./larson 0 8 1024 10000 1 12345 1
|
||
|
||
# Expected: 13-15M ops/sec (lock overhead 0-15%)
|
||
```
|
||
|
||
```bash
|
||
# 4-thread baseline
|
||
./larson 0 8 1024 10000 1 12345 4
|
||
|
||
# 4-thread with hakmem P0
|
||
LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
|
||
./larson 0 8 1024 10000 1 12345 4
|
||
|
||
# Expected: 13-15M ops/sec (same as 1T, no scalability)
|
||
# Critical: NO CRASHES, NO DATA CORRUPTION
|
||
```
|
||
|
||
**Success Criteria**:
|
||
- ✅ 1T: 13-15M ops/sec (within 15% of Phase 6.14)
|
||
- ✅ 4T: 13-15M ops/sec (no scalability expected)
|
||
- ✅ 4T: NO crashes, NO segfaults
|
||
- ✅ 4T: NO data corruption (verify checksum if larson supports)
|
||
|
||
**Estimated Time**: 30 minutes
|
||
|
||
---
|
||
|
||
#### **Test 2: Helgrind Race Detection** (20 minutes)
|
||
|
||
**Purpose**: Verify all data races are eliminated
|
||
|
||
**Commands**:
|
||
```bash
|
||
cd /tmp/mimalloc-bench/bench/larson
|
||
|
||
# Install valgrind (if not installed)
|
||
sudo apt-get install -y valgrind
|
||
|
||
# Run Helgrind on 4-thread test
|
||
valgrind --tool=helgrind \
|
||
--read-var-info=yes \
|
||
LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
|
||
./larson 0 8 1024 1000 1 12345 4
|
||
# Note: Reduced iterations (1000 instead of 10000) for faster run
|
||
|
||
# Expected output:
|
||
# ERROR SUMMARY: 0 errors from 0 contexts (suppressed: X from Y)
|
||
```
|
||
|
||
**Success Criteria**:
|
||
- ✅ ERROR SUMMARY: **0 errors** (zero data races)
|
||
- ✅ No warnings about unprotected reads/writes
|
||
- ⚠️ NOTE: Helgrind may show false positives from libc. Ignore if they are NOT in hakmem code.
|
||
|
||
**Estimated Time**: 20 minutes
|
||
|
||
---
|
||
|
||
#### **Test 3: Stability Test** (10 minutes)
|
||
|
||
**Purpose**: Ensure no crashes over 10 consecutive runs
|
||
|
||
**Commands**:
|
||
```bash
|
||
cd /tmp/mimalloc-bench/bench/larson
|
||
|
||
# 10 consecutive 4-thread runs
|
||
for i in {1..10}; do
|
||
echo "Run $i/10..."
|
||
LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
|
||
./larson 0 8 1024 10000 1 12345 4 || { echo "FAILED at run $i"; exit 1; }
|
||
done
|
||
|
||
echo "✅ All 10 runs succeeded!"
|
||
```
|
||
|
||
**Success Criteria**:
|
||
- ✅ 10/10 runs complete without crashes
|
||
- ✅ Performance stable across runs (variance < 10%)
|
||
|
||
**Estimated Time**: 10 minutes
|
||
|
||
---
|
||
|
||
### **Task 2.4: Document Results** (15 minutes)
|
||
|
||
**Create**: `apps/experiments/hakmem-poc/PHASE_6.15_P0_RESULTS.md`
|
||
|
||
**Template**:
|
||
```markdown
|
||
# Phase 6.15 P0: Safety Lock Implementation - Results
|
||
|
||
**Date**: 2025-10-22
|
||
**Status**: ✅ **COMPLETED** (Correctness achieved)
|
||
**Implementation Time**: X minutes
|
||
|
||
---
|
||
|
||
## 📊 **Benchmark Results**
|
||
|
||
### **larson (mimalloc-bench)**
|
||
|
||
| Threads | Before P0 (UNSAFE) | After P0 (SAFE) | Change |
|
||
|---------|-------------------|-----------------|--------|
|
||
| 1-thread | 15.1M ops/sec | X.XM ops/sec | ±X% |
|
||
| 4-thread | 3.3M ops/sec | X.XM ops/sec | +XXX% ✅ |
|
||
|
||
**Performance Summary**:
|
||
- 1-thread overhead: X% (lock overhead, acceptable)
|
||
- 4-thread improvement: +XXX% (from -78% to safe)
|
||
- 4-thread scalability: X.Xx (4T / 1T, expected ~1.0)
|
||
|
||
---
|
||
|
||
## ✅ **Success Criteria Met**
|
||
|
||
- ✅ 1T performance: X.XM ops/sec (within 15% of Phase 6.14)
|
||
- ✅ 4T performance: X.XM ops/sec (safe, no scalability)
|
||
- ✅ Helgrind: **0 data races** detected
|
||
- ✅ Stability: **10/10 runs** without crashes
|
||
|
||
---
|
||
|
||
## 🔧 **Implementation Details**
|
||
|
||
**Files Modified**:
|
||
- `hakmem.c` - Added global lock + wrapper functions
|
||
|
||
**Lines Changed**:
|
||
- +20 lines (pthread.h, global lock, HAKMEM_LOCK/UNLOCK macros)
|
||
- +10 lines (hak_alloc_at wrapper)
|
||
- +10 lines (hak_free_at wrapper)
|
||
- **Total**: ~40 lines
|
||
|
||
**Pattern**:
|
||
```c
|
||
void* hak_alloc_at(size_t size, uintptr_t site_id) {
|
||
HAKMEM_LOCK();
|
||
void* ptr = hak_alloc_at_internal(size, site_id);
|
||
HAKMEM_UNLOCK();
|
||
return ptr;
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## 🎯 **Next Steps**
|
||
|
||
**Phase 6.15 P1**: Tiny Pool TLS (2 hours)
|
||
- Expected: 4T = 12-15M ops/sec (+100-150%)
|
||
- TLS hit rate: 95%+
|
||
- Lock avoidance: 95%+
|
||
|
||
**Start Date**: 2025-10-XX
|
||
```
|
||
|
||
**Estimated Time**: 15 minutes
|
||
|
||
---
|
||
|
||
### **Step 2 Total Time: 2-3 hours**
|
||
|
||
---
|
||
|
||
## 🚀 **Step 3: Multi-threaded Performance (P1-P3)** (8-10 hours)
|
||
|
||
### **Overview**
|
||
|
||
**Goal**: Achieve near-ideal scalability (4T ≈ 4x 1T) using Thread-Local Storage (TLS)
|
||
|
||
**Validation**: Phase 6.13 already proved TLS works
|
||
- 1-thread: 17.8M ops/sec (+123% vs system)
|
||
- 4-thread: 15.9M ops/sec (+147% vs system)
|
||
|
||
**Strategy**: Expand existing L2.5 TLS to Tiny Pool and L2 Pool
|
||
|
||
---
|
||
|
||
### **Phase 6.15 P1: Tiny Pool TLS** (2 hours)
|
||
|
||
**Goal**: Thread-local cache for ≤1KB allocations (8 size classes)
|
||
|
||
**Existing Reference**: `hakmem_l25_pool.c:26` (TLS pattern already implemented)
|
||
|
||
#### **Implementation**
|
||
|
||
**File**: `apps/experiments/hakmem-poc/hakmem_tiny.c`
|
||
|
||
**Changes**:
|
||
|
||
1. **Add TLS cache** (after line 12):
|
||
```c
|
||
// Phase 6.15 P1: Thread-Local Storage for Tiny Pool
|
||
// Pattern: Same as L2.5 Pool (hakmem_l25_pool.c:26)
|
||
// Purpose: Reduce global freelist contention (50 cycles → 10 cycles)
|
||
// Hit rate expected: 95%+
|
||
|
||
static __thread TinySlab* tls_tiny_cache[TINY_NUM_CLASSES] = {NULL};
|
||
static __thread int tls_tiny_initialized = 0;
|
||
```
|
||
|
||
2. **TLS initialization** (new function):
|
||
```c
|
||
// Initialize TLS cache for current thread
|
||
static void hak_tiny_tls_init(void) {
|
||
if (tls_tiny_initialized) return;
|
||
|
||
// Initialize all size classes to NULL
|
||
for (int i = 0; i < TINY_NUM_CLASSES; i++) {
|
||
tls_tiny_cache[i] = NULL;
|
||
}
|
||
|
||
tls_tiny_initialized = 1;
|
||
}
|
||
```
|
||
|
||
3. **Modify hak_tiny_alloc** (existing function):
|
||
```c
|
||
void* hak_tiny_alloc(size_t size, uintptr_t site_id) {
|
||
// Phase 6.15 P1: TLS fast path
|
||
if (!tls_tiny_initialized) {
|
||
hak_tiny_tls_init();
|
||
}
|
||
|
||
int class_idx = hak_tiny_get_class_index(size);
|
||
|
||
// TLS hit check (no lock needed)
|
||
TinySlab* slab = tls_tiny_cache[class_idx];
|
||
if (slab && slab->free_count > 0) {
|
||
// Fast path: Allocate from TLS cache
|
||
return hak_tiny_alloc_from_slab(slab, class_idx);
|
||
}
|
||
|
||
// TLS miss: Refill from global freelist (locked)
|
||
HAKMEM_LOCK();
|
||
|
||
// Try to get a slab from global freelist
|
||
slab = g_tiny_pool.free_slabs[class_idx];
|
||
if (slab) {
|
||
// Move slab to TLS cache
|
||
g_tiny_pool.free_slabs[class_idx] = slab->next;
|
||
tls_tiny_cache[class_idx] = slab;
|
||
slab->next = NULL; // Detach from freelist
|
||
} else {
|
||
// Allocate new slab (existing logic)
|
||
slab = allocate_new_slab(class_idx);
|
||
if (!slab) {
|
||
HAKMEM_UNLOCK();
|
||
return NULL;
|
||
}
|
||
tls_tiny_cache[class_idx] = slab;
|
||
}
|
||
|
||
HAKMEM_UNLOCK();
|
||
|
||
// Allocate from newly cached slab
|
||
return hak_tiny_alloc_from_slab(slab, class_idx);
|
||
}
|
||
```
|
||
|
||
4. **Modify hak_tiny_free** (existing function):
|
||
```c
|
||
void hak_tiny_free(void* ptr, uintptr_t site_id) {
|
||
if (!ptr) return;
|
||
|
||
// Find owner slab (O(N) or O(1) depending on g_use_registry)
|
||
TinySlab* slab = hak_tiny_owner_slab(ptr);
|
||
if (!slab) {
|
||
fprintf(stderr, "[Tiny] ERROR: Invalid pointer!\n");
|
||
return;
|
||
}
|
||
|
||
int class_idx = slab->size_class;
|
||
|
||
// Free block in slab
|
||
hak_tiny_free_in_slab(slab, ptr, class_idx);
|
||
|
||
// Check if slab is now empty
|
||
if (slab->free_count == slab->total_count) {
|
||
// Phase 6.15 P1: Return empty slab to global freelist
|
||
|
||
// First, remove from TLS cache if it's there
|
||
if (tls_tiny_cache[class_idx] == slab) {
|
||
tls_tiny_cache[class_idx] = NULL;
|
||
}
|
||
|
||
// Return to global freelist (locked)
|
||
HAKMEM_LOCK();
|
||
slab->next = g_tiny_pool.free_slabs[class_idx];
|
||
g_tiny_pool.free_slabs[class_idx] = slab;
|
||
HAKMEM_UNLOCK();
|
||
}
|
||
}
|
||
```
|
||
|
||
**Expected Performance**:
|
||
- TLS hit rate: 95%+
|
||
- Lock contention: 5% (only on TLS miss)
|
||
- 4T performance: 12-15M ops/sec (+264-355% vs 3.3M baseline)
|
||
|
||
**Implementation Time**: 2 hours
|
||
|
||
---
|
||
|
||
### **Phase 6.15 P2: L2 Pool TLS** (3 hours)
|
||
|
||
**Goal**: Thread-local cache for 2-32KB allocations (5 size classes)
|
||
|
||
**Pattern**: Same as Tiny Pool TLS (above)
|
||
|
||
#### **Implementation**
|
||
|
||
**File**: `apps/experiments/hakmem-poc/hakmem_pool.c`
|
||
|
||
**Changes**: (Similar structure to Tiny Pool TLS)
|
||
|
||
1. Add `static __thread L2Block* tls_l2_cache[L2_NUM_CLASSES];`
|
||
2. Implement TLS fast path in `hak_pool_alloc()`
|
||
3. Implement TLS refill logic (global freelist → TLS cache)
|
||
4. Implement TLS return logic (empty slabs → global freelist)
|
||
|
||
**Expected Performance**:
|
||
- TLS hit rate: 90%+
|
||
- Cumulative 4T performance: 15-18M ops/sec
|
||
|
||
**Implementation Time**: 3 hours
|
||
|
||
---
|
||
|
||
### **Phase 6.15 P3: L2.5 Pool TLS Expansion** (3 hours)
|
||
|
||
**Goal**: Expand existing L2.5 TLS to full implementation
|
||
|
||
**Current State**: `hakmem_l25_pool.c:26` already has TLS declaration:
|
||
```c
|
||
__thread L25Block* tls_l25_cache[L25_NUM_CLASSES] = {NULL};
|
||
```
|
||
|
||
**Missing**: TLS refill/eviction logic (currently only used in fast path)
|
||
|
||
#### **Implementation**
|
||
|
||
**File**: `apps/experiments/hakmem-poc/hakmem_l25_pool.c`
|
||
|
||
**Changes**:
|
||
|
||
1. **Implement TLS refill** (in `hak_l25_pool_alloc`):
|
||
```c
|
||
// Existing TLS check (line ~230)
|
||
L25Block* block = tls_l25_cache[class_idx];
|
||
if (block) {
|
||
tls_l25_cache[class_idx] = NULL; // Pop from TLS
|
||
// ... existing header rewrite ...
|
||
return user_ptr;
|
||
}
|
||
|
||
// NEW: TLS refill from global freelist
|
||
HAKMEM_LOCK();
|
||
|
||
int shard_idx = (site_id >> 4) & (L25_NUM_SHARDS - 1);
|
||
|
||
// Check non-empty bitmap
|
||
if (!(g_l25_pool.nonempty_mask[class_idx] & (1ULL << shard_idx))) {
|
||
// Empty freelist, allocate new bundle
|
||
// ... existing logic ...
|
||
} else {
|
||
// Pop from global freelist
|
||
block = g_l25_pool.freelist[class_idx][shard_idx];
|
||
g_l25_pool.freelist[class_idx][shard_idx] = block->next;
|
||
|
||
// Update bitmap if freelist is now empty
|
||
if (!g_l25_pool.freelist[class_idx][shard_idx]) {
|
||
g_l25_pool.nonempty_mask[class_idx] &= ~(1ULL << shard_idx);
|
||
}
|
||
|
||
// Move to TLS cache
|
||
tls_l25_cache[class_idx] = block;
|
||
}
|
||
|
||
HAKMEM_UNLOCK();
|
||
|
||
// Allocate from TLS cache
|
||
block = tls_l25_cache[class_idx];
|
||
tls_l25_cache[class_idx] = NULL;
|
||
// ... existing header rewrite ...
|
||
return user_ptr;
|
||
```
|
||
|
||
2. **Implement TLS eviction** (in `hak_l25_pool_free`):
|
||
```c
|
||
// Existing logic to add to freelist
|
||
L25Block* block = (L25Block*)hdr;
|
||
|
||
// Phase 6.15 P3: Add to TLS cache first (if empty)
|
||
if (!tls_l25_cache[class_idx]) {
|
||
tls_l25_cache[class_idx] = block;
|
||
block->next = NULL;
|
||
return; // No need to lock
|
||
}
|
||
|
||
// TLS cache full, return to global freelist (locked)
|
||
HAKMEM_LOCK();
|
||
|
||
block->next = g_l25_pool.freelist[class_idx][shard_idx];
|
||
g_l25_pool.freelist[class_idx][shard_idx] = block;
|
||
|
||
// Update bitmap
|
||
g_l25_pool.nonempty_mask[class_idx] |= (1ULL << shard_idx);
|
||
|
||
HAKMEM_UNLOCK();
|
||
```
|
||
|
||
**Expected Performance**:
|
||
- TLS hit rate: 95%+
|
||
- Cumulative 4T performance: 18-22M ops/sec (+445-567%)
|
||
|
||
**Implementation Time**: 3 hours
|
||
|
||
---
|
||
|
||
## 📋 **Implementation Checklist**
|
||
|
||
### **Step 1: Documentation** (1 hour) ✅
|
||
- [ ] Task 1.1: Fix PHASE_6.14_COMPLETION_REPORT.md (15 min)
|
||
- [ ] Task 1.2: Create PHASE_6.15_PLAN.md (30 min) ← THIS DOCUMENT
|
||
- [ ] Task 1.3: Update CURRENT_TASK.md (10 min)
|
||
- [ ] Task 1.4: Update README.md if exists (5 min)
|
||
- [ ] Task 1.5: Verification (5 min)
|
||
|
||
### **Step 2: P0 Safety Lock** (2-3 hours)
|
||
- [ ] Task 2.1: Implementation (30 min)
|
||
- [ ] Add pthread.h include
|
||
- [ ] Add g_hakmem_lock + HAKMEM_LOCK/UNLOCK macros
|
||
- [ ] Wrap hak_alloc_at() with lock
|
||
- [ ] Wrap hak_free_at() with lock
|
||
- [ ] Task 2.2: Build & Smoke Test (15 min)
|
||
- [ ] `make clean && make bench_allocators`
|
||
- [ ] Single-threaded test (json scenario)
|
||
- [ ] Verify: 13-15M ops/sec
|
||
- [ ] Task 2.3: Multi-threaded Validation (1 hour)
|
||
- [ ] Test 1: larson 1T/4T (30 min)
|
||
- [ ] Test 2: Helgrind race detection (20 min)
|
||
- [ ] Test 3: Stability test 10 runs (10 min)
|
||
- [ ] Task 2.4: Document Results (15 min)
|
||
- [ ] Create PHASE_6.15_P0_RESULTS.md
|
||
|
||
### **Step 3: TLS Performance** (8-10 hours)
|
||
- [ ] **P1: Tiny Pool TLS** (2 hours)
|
||
- [ ] Add `tls_tiny_cache[]` declaration
|
||
- [ ] Implement `hak_tiny_tls_init()`
|
||
- [ ] Modify `hak_tiny_alloc()` (TLS fast path)
|
||
- [ ] Modify `hak_tiny_free()` (TLS eviction)
|
||
- [ ] Test: larson 4T → 12-15M ops/sec
|
||
- [ ] Document: PHASE_6.15_P1_RESULTS.md
|
||
|
||
- [ ] **P2: L2 Pool TLS** (3 hours)
|
||
- [ ] Add `tls_l2_cache[]` declaration
|
||
- [ ] Implement TLS fast path in `hak_pool_alloc()`
|
||
- [ ] Implement TLS refill logic
|
||
- [ ] Implement TLS eviction logic
|
||
- [ ] Test: larson 4T → 15-18M ops/sec
|
||
- [ ] Document: PHASE_6.15_P2_RESULTS.md
|
||
|
||
- [ ] **P3: L2.5 Pool TLS Expansion** (3 hours)
|
||
- [ ] Implement TLS refill in `hak_l25_pool_alloc()`
|
||
- [ ] Implement TLS eviction in `hak_l25_pool_free()`
|
||
- [ ] Test: larson 4T → 18-22M ops/sec
|
||
- [ ] Document: PHASE_6.15_P3_RESULTS.md
|
||
|
||
- [ ] **Final Validation** (1 hour)
|
||
- [ ] larson 1T/4T/16T full validation
|
||
- [ ] Internal benchmarks (json/mir/vm)
|
||
- [ ] Helgrind final check
|
||
- [ ] Create PHASE_6.15_COMPLETION_REPORT.md
|
||
|
||
---
|
||
|
||
## ⚠️ **Risk Assessment**
|
||
|
||
| Phase | Risk Level | Failure Mode | Mitigation |
|
||
|-------|-----------|--------------|------------|
|
||
| **P0 (Safety Lock)** | **ZERO** | Worst case: slow but safe | N/A |
|
||
| **P1 (Tiny TLS)** | **LOW** | TLS miss overhead | Feature flag `HAKMEM_TLS_TINY` |
|
||
| **P2 (L2 TLS)** | **LOW** | Memory overhead (TLS×threads) | Monitor RSS |
|
||
| **P3 (L2.5 TLS)** | **LOW** | Existing code 50% done | Incremental |
|
||
|
||
**Rollback Strategy**:
|
||
- Every phase has `#ifdef HAKMEM_TLS_PHASEX`
|
||
- Can disable individual TLS layers if issues found
|
||
- P0 Safety Lock ensures correctness even if TLS disabled
|
||
|
||
---
|
||
|
||
## 🎯 **Success Criteria**
|
||
|
||
### **Minimum Success** (P0 only)
|
||
- ✅ 4T ≥ 13M ops/sec (safe, from 3.3M)
|
||
- ✅ Zero race conditions (Helgrind)
|
||
- ✅ 10/10 stability runs
|
||
|
||
### **Target Success** (P0 + P1 + P2)
|
||
- ✅ 4T ≥ 15M ops/sec (+355% vs 3.3M baseline)
|
||
- ✅ TLS hit rate ≥ 90%
|
||
- ✅ No single-threaded regression (≤15% overhead)
|
||
|
||
### **Stretch Goal** (All Phases)
|
||
- ✅ 4T ≥ 18M ops/sec (+445%)
|
||
- ✅ 16T ≥ 11.6M ops/sec (match system allocator)
|
||
- ✅ Scalable up to 32 threads
|
||
|
||
### **Validated** (Phase 6.13 Proof)
|
||
- ✅ **ALREADY ACHIEVED**: 4T = **15.9M ops/sec** (+381%) ✅
|
||
|
||
---
|
||
|
||
## 📊 **Expected Timeline**
|
||
|
||
### **Week 1: Foundation** (Day 1-2)
|
||
- **Day 1 AM** (1 hour): Step 1 - Documentation updates
|
||
- **Day 1 PM** (2-3 hours): Step 2 - P0 Safety Lock
|
||
- **Day 2** (2 hours): Step 3 - P1 Tiny Pool TLS
|
||
|
||
**Milestone**: 4T = 12-15M ops/sec (+264-355%)
|
||
|
||
### **Week 2: Expansion** (Day 3-5)
|
||
- **Day 3-4** (3 hours): Step 3 - P2 L2 Pool TLS
|
||
- **Day 5** (3 hours): Step 3 - P3 L2.5 Pool TLS
|
||
|
||
**Milestone**: 4T = 18-22M ops/sec (+445-567%)
|
||
|
||
### **Week 3: Validation** (Day 6)
|
||
- **Day 6** (1 hour): Final validation + completion report
|
||
|
||
**Milestone**: ✅ **Phase 6.15 Complete**
|
||
|
||
---
|
||
|
||
## 🔬 **Technical References**
|
||
|
||
### **Existing TLS Implementation**
|
||
**File**: `apps/experiments/hakmem-poc/hakmem_l25_pool.c:26`
|
||
```c
|
||
__thread L25Block* tls_l25_cache[L25_NUM_CLASSES] = {NULL};
|
||
```
|
||
|
||
**Pattern**: Per-thread cache for each size class (L1 cache hit)
|
||
|
||
### **Phase 6.13 Validation**
|
||
**File**: `apps/experiments/hakmem-poc/PHASE_6.13_INITIAL_RESULTS.md`
|
||
|
||
**Results**:
|
||
- 1-thread: 17.8M ops/sec (+123% vs system)
|
||
- 4-thread: 15.9M ops/sec (+147% vs system)
|
||
- **Proof**: TLS works and provides massive benefit
|
||
|
||
### **Thread Safety Analysis**
|
||
**File**: `apps/experiments/hakmem-poc/THREAD_SAFETY_SOLUTION.md`
|
||
|
||
**Key Insights**:
|
||
- mimalloc/jemalloc both use TLS as primary approach
|
||
- TLS hit rate: 95%+ (industry standard)
|
||
- Lock contention: 5% (only on TLS miss/refill)
|
||
|
||
---
|
||
|
||
## 📝 **Implementation Notes**
|
||
|
||
### **Why 3 Stages?**
|
||
1. **Step 1 (Docs)**: Ensure clarity on what went wrong (67.9M issue) and what's being fixed
|
||
2. **Step 2 (P0)**: Prove correctness FIRST (no crashes, no data races)
|
||
3. **Step 3 (P1-P3)**: Optimize for performance (TLS) with safety already guaranteed
|
||
|
||
### **Why Not Skip P0?**
|
||
- **Risk mitigation**: If TLS fails, we still have working thread-safe allocator
|
||
- **Debugging**: Easier to debug TLS issues with known-working locked baseline
|
||
- **Validation**: P0 proves the global lock pattern is correct
|
||
|
||
### **Why TLS Over Lock-free?**
|
||
- **Phase 6.14 proved**: Sequential O(N) is 2.9-13.7x faster than Random O(1) Hash
|
||
- **Implication**: Lock-free atomic hash will be SLOWER than TLS
|
||
- **Industry standard**: mimalloc/jemalloc use TLS, not lock-free
|
||
- **Proven**: Phase 6.13 validated +123-147% improvement with TLS
|
||
|
||
---
|
||
|
||
## 🚀 **Next Steps After Phase 6.15**
|
||
|
||
### **Phase 6.17: 16-Thread Scalability** (Optional, 4 hours)
|
||
**Current Issue**: 16T = 7.6M ops/sec (-34.8% vs system 11.6M)
|
||
|
||
**Investigation**:
|
||
1. Profile global lock contention (perf, helgrind)
|
||
2. Measure Whale cache hit rate by thread count
|
||
3. Analyze shard distribution (hash collision?)
|
||
4. Optimize TLS cache refill (batch refill to reduce global access)
|
||
|
||
**Target**: 16T ≥ 11.6M ops/sec (match or beat system)
|
||
|
||
---
|
||
|
||
## 📚 **Related Documents**
|
||
|
||
- [THREAD_SAFETY_SOLUTION.md](THREAD_SAFETY_SOLUTION.md) - Complete analysis (Option A/B/C comparison)
|
||
- [PHASE_6.13_INITIAL_RESULTS.md](PHASE_6.13_INITIAL_RESULTS.md) - TLS validation proof
|
||
- [PHASE_6.14_COMPLETION_REPORT.md](PHASE_6.14_COMPLETION_REPORT.md) - Registry toggle + thread issue discovery
|
||
- [CURRENT_TASK.md](CURRENT_TASK.md) - Overall project status
|
||
|
||
---
|
||
|
||
**Total Time Investment**: 12-13 hours
|
||
**Expected ROI**: **6-15x improvement** (3.3M → 20-50M ops/sec)
|
||
**Risk**: Low (feature flags + proven design)
|
||
**Validation**: Phase 6.13 already proves TLS works (**+147%** at 4 threads)
|
||
|
||
---
|
||
|
||
**Implementation by**: Claude + ChatGPT协调开発
|
||
**Planning Date**: 2025-10-22
|
||
**Status**: ✅ **Ready to Execute**
|