hakmem/docs/design/PHASE_6.15_PLAN.md

# Phase 6.15: Multi-threaded Safety + TLS Performance - Implementation Plan

**Date**: 2025-10-22
**Status**: 📋 **Planning Complete**
**Total Time**: 12-13 hours (3 weeks)

---

## 📊 **Executive Summary**

### **Current Problem**
hakmem allocator is **completely thread-unsafe** with catastrophic multi-threaded performance:

| Threads | Performance (ops/sec) | vs 1-thread |
|---------|----------------------|-------------|
| **1-thread** | 15.1M ops/sec | baseline |
| **4-thread** | 3.3M ops/sec | **-78% slower** ❌ |

**Root Cause**: Zero thread synchronization primitives in current codebase (no `pthread_mutex` anywhere)

### **Solution Strategy**

**3-Stage Gradual Implementation**:
1. **Step 1**: Document updates (1 hour) - Fix 67.9M measurement issue, create Phase 6.15 plan
2. **Step 2**: P0 Safety Lock (30 min + testing) - Ensure correctness with minimal changes
3. **Step 3**: TLS Performance (8-10 hours) - Achieve 4T = 15M ops/sec (+381% validated)

**Expected Outcome**:
- **Minimum Success** (P0): 4T = 1T performance (safe, no scalability)
- **Target Success** (P0+P1): 4T = 12-15M ops/sec (+264-355%)
- **Validated** (Phase 6.13): 4T = **15.9M ops/sec** (+381%) ✅ **ALREADY PROVEN**

---

## 🎯 **Step 1: Documentation Updates** (1 hour)

### **Task 1.1: Fix Phase 6.14 Completion Report** (15 minutes)

**File**: `apps/experiments/hakmem-poc/PHASE_6.14_COMPLETION_REPORT.md`

**Current Problem**:
- Report focuses on Registry ON/OFF toggle
- No mention of 67.9M ops/sec measurement issue
- Misleading performance claims

**Required Changes**:

1. **Add Executive Summary Section** (after line 9):
```markdown
## ⚠️ **Important Note: 67.9M Performance Measurement**

**Issue**: Earlier reports mentioned 67.9M ops/sec performance
**Status**: ❌ **NOT REPRODUCIBLE** - Likely measurement error

**Actual Achievements**:
- ✅ Registry ON/OFF toggle implementation (Pattern 2)
- ✅ O(N) proven 2.9-13.7x faster than O(1) for Small-N (8-32 slabs)
- ✅ Default: `g_use_registry = 0` (O(N) Sequential Access)

**Performance Reality**:
- 1-thread: 15.3M ops/sec (O(N), validated)
- 4-thread: **3.3M ops/sec** (THREAD-UNSAFE, requires Phase 6.15 fix)
```

2. **Update Section Title** (line 9):
```markdown
## 📊 **Executive Summary: Registry Toggle + Thread Safety Issue**
```

3. **Add Thread Safety Warning** (after line 158):
```markdown
---

## 🚨 **Critical Discovery: Thread Safety Issue**

### **Problem**
Phase 6.14 testing revealed **catastrophic multi-threaded performance collapse**:

| Threads | Performance | vs 1-thread |
|---------|-------------|-------------|
| 1-thread | 15.3M ops/sec | baseline |
| 4-thread | **3.3M ops/sec** | **-78%** ❌ |

**Root Cause**: `grep pthread_mutex *.c` → **0 results** (no locks!)

**Impact**: All global structures are race-condition prone:
- `g_tiny_pool.free_slabs[]` - Concurrent access without locks
- `g_l25_pool.freelist[]` - Multiple threads modifying same freelist
- `g_slab_registry[]` - Hash table corruption
- `g_whale_cache` - Ring buffer race conditions

### **Solution**
**Phase 6.15**: Multi-threaded Safety + TLS Performance
- **P0** (30 min): Global safety lock (correctness first)
- **P1** (2 hours): Tiny Pool TLS (95%+ lock avoidance)
- **P2** (3 hours): L2 Pool TLS (full coverage)
- **P3** (3 hours): L2.5 Pool TLS expansion

**Expected Results**:
- P0: 4T = 13-15M ops/sec (safe, no scalability)
- P0+P1: 4T = 12-15M ops/sec (+264-355%)
- **Validated**: 4T = **15.9M ops/sec** (+381% vs 3.3M baseline) ✅

**Reference**: `THREAD_SAFETY_SOLUTION.md` - Complete analysis
```

**Estimated Time**: 15 minutes

---

### **Task 1.2: Create Phase 6.15 Plan Document** (30 minutes)

**File**: `apps/experiments/hakmem-poc/PHASE_6.15_PLAN.md` (THIS FILE)

**Contents**: ✅ **Already created** (this document)

**Sections**:
1. Executive Summary
2. Step 1: Documentation Updates (detailed)
3. Step 2: P0 Safety Lock (implementation + testing)
4. Step 3: Multi-threaded Performance (P1-P3 breakdown)
5. Implementation Checklist
6. Risk Assessment
7. Success Criteria

**Estimated Time**: 30 minutes (already completed)

---

### **Task 1.3: Update CURRENT_TASK.md** (10 minutes)

**File**: `apps/experiments/hakmem-poc/CURRENT_TASK.md`

**Required Changes**:

1. **Update Current Status** (after line 30):
```markdown
## 🎯 **Current Focus: Phase 6.15 Multi-threaded Safety** (2025-10-22)

### **Immediate Priority: Thread Safety Fix** ⚠️

**Problem Discovered**: hakmem is completely thread-unsafe
- 4-thread performance: **3.3M ops/sec** (-78% vs 1-thread 15.1M)
- Root cause: Zero synchronization primitives (no `pthread_mutex`)

**Solution in Progress**: Phase 6.15 (3-stage implementation)
1. ✅ **Step 1**: Documentation updates (1 hour) ← IN PROGRESS
2. ⏳ **Step 2**: P0 Safety Lock (30 min + testing)
3. ⏳ **Step 3**: TLS Performance (P1-P3, 8-10 hours)

**Expected Outcome**: 4T = 15.9M ops/sec (validated in Phase 6.13)

**Planning Document**: [PHASE_6.15_PLAN.md](PHASE_6.15_PLAN.md)
```

2. **Move Phase 6.14 to Completed Section** (after line 296):
```markdown
## ✅ Phase 6.14 完了！（2025-10-22）

**実装完了**: Registry ON/OFF 切り替え実装 + Thread Safety Issue 発見

**✅ 実装完了内容**:
1. **Pattern 2 実装**: `HAKMEM_USE_REGISTRY` 環境変数で ON/OFF 切り替え
2. **O(N) vs O(1) 検証**: O(N) が 2.9-13.7倍速いことを実証
3. **デフォルト設定**: `g_use_registry = 0` (O(N) Sequential Access)

**🚨 Critical Discovery**: 4-thread 性能崩壊 (-78%)
- 原因: 全グローバル変数がロック無し
- 対策: Phase 6.15 で修正予定

**📊 測定結果**:
```
1-thread: 15.3M ops/sec (O(N), Registry OFF)
4-thread:  3.3M ops/sec (-78% ← THREAD-UNSAFE) ❌
```

**詳細ドキュメント**:
- [PHASE_6.14_COMPLETION_REPORT.md](PHASE_6.14_COMPLETION_REPORT.md) - Pattern 2 実装
- [THREAD_SAFETY_SOLUTION.md](THREAD_SAFETY_SOLUTION.md) - 完全分析
- [PHASE_6.15_PLAN.md](PHASE_6.15_PLAN.md) - 修正計画

**実装時間**: 34分（予定通り） ⚡
```

**Estimated Time**: 10 minutes

---

### **Task 1.4: Update README (if needed)** (5 minutes)

**File**: `apps/experiments/hakmem-poc/README.md` (if exists)

**Check if exists**:
```bash
ls -la apps/experiments/hakmem-poc/README.md
```

**If exists, add warning**:
```markdown
## ⚠️ **Current Status: Thread Safety in Development**

**Known Issue**: hakmem is currently thread-unsafe
- **Single-threaded**: 15.1M ops/sec ✅ Excellent
- **Multi-threaded**: 3.3M ops/sec (4T) ❌ Requires fix

**Fix in Progress**: Phase 6.15 Multi-threaded Safety
- Expected completion: 2025-10-24 (2-3 days)
- Target performance: 15-20M ops/sec at 4 threads

**Do NOT use in multi-threaded applications until Phase 6.15 is complete.**
```

**Estimated Time**: 5 minutes (or skip if README doesn't exist)

---

### **Task 1.5: Verification** (5 minutes)

**Checklist**:
- [ ] PHASE_6.14_COMPLETION_REPORT.md updated (67.9M issue documented)
- [ ] PHASE_6.15_PLAN.md created (this document)
- [ ] CURRENT_TASK.md updated (Phase 6.15 status)
- [ ] README.md updated (if exists)

**Verification Commands**:
```bash
cd apps/experiments/hakmem-poc

# Check files exist
ls -la PHASE_6.14_COMPLETION_REPORT.md
ls -la PHASE_6.15_PLAN.md
ls -la CURRENT_TASK.md

# Grep for keywords
grep -n "67.9M\|Thread Safety\|Phase 6.15" PHASE_6.14_COMPLETION_REPORT.md
grep -n "Phase 6.15\|Thread Safety" CURRENT_TASK.md
```

**Estimated Time**: 5 minutes

---

## ⏱️ **Step 1 Total Time: 1 hour 5 minutes**

---

## 🔐 **Step 2: P0 Safety Lock Implementation** (2-3 hours)

### **Goal**
Ensure **correctness** with minimal code changes. No performance improvement expected (4T ≈ 1T).

### **Success Criteria**
- ✅ 1-thread: 13-15M ops/sec (ロックオーバーヘッド 0-15% acceptable)
- ✅ 4-thread: 13-15M ops/sec (no scalability, but SAFE)
- ✅ Helgrind: Data race = 0 件
- ✅ Stability: 10 consecutive runs without crash

---

### **Task 2.1: Implementation** (30 minutes)

#### **File**: `apps/experiments/hakmem-poc/hakmem.c`

**Changes Required**:

1. **Add pthread.h include** (after line 22):
```c
#include <pthread.h>  // Phase 6.15 P0: Thread Safety
```

2. **Add global lock** (after line 58):
```c
// ============================================================================
// Phase 6.15 P0: Thread Safety - Global Lock
// ============================================================================

// Global lock for all allocator operations
// Purpose: Ensure correctness in multi-threaded environment
// Performance: 4T ≈ 1T (no scalability, safety first)
// Will be replaced by TLS in P1-P3 (95%+ lock avoidance)
static pthread_mutex_t g_hakmem_lock = PTHREAD_MUTEX_INITIALIZER;

// Lock/unlock helpers (for debugging and future instrumentation)
#define HAKMEM_LOCK() pthread_mutex_lock(&g_hakmem_lock)
#define HAKMEM_UNLOCK() pthread_mutex_unlock(&g_hakmem_lock)
```

3. **Wrap hak_alloc_at()** (find the function, approximately line 300-400):
```c
void* hak_alloc_at(size_t size, uintptr_t site_id) {
    // Phase 6.15 P0: Global lock (safety first)
    HAKMEM_LOCK();

    // Existing implementation
    void* ptr = hak_alloc_at_internal(size, site_id);

    HAKMEM_UNLOCK();
    return ptr;
}

// Rename old hak_alloc_at to hak_alloc_at_internal
static void* hak_alloc_at_internal(size_t size, uintptr_t site_id) {
    // ... existing code (no changes) ...
}
```

4. **Wrap hak_free_at()** (find the function):
```c
void hak_free_at(void* ptr, uintptr_t site_id) {
    if (!ptr) return;

    // Phase 6.15 P0: Global lock (safety first)
    HAKMEM_LOCK();

    // Existing implementation
    hak_free_at_internal(ptr, site_id);

    HAKMEM_UNLOCK();
}

// Rename old hak_free_at to hak_free_at_internal
static void hak_free_at_internal(void* ptr, uintptr_t site_id) {
    // ... existing code (no changes) ...
}
```

5. **Protect hak_init()** (find initialization function):
```c
void hak_init(void) {
    // Phase 6.15 P0: No lock needed (called once before any threads spawn)
    // But add atomic check for safety

    // ... existing init code ...
}
```

**Estimated Time**: 30 minutes

---

### **Task 2.2: Build & Smoke Test** (15 minutes)

**Commands**:
```bash
cd /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc

# Clean build
make clean
make bench_allocators

# Smoke test (single-threaded)
./bench_allocators --allocator hakmem-baseline --scenario json

# Expected: ~300-350ns (slight overhead acceptable)
```

**Success Criteria**:
- ✅ Build succeeds (no compilation errors)
- ✅ No crashes on single-threaded test
- ✅ Performance: 13-15M ops/sec (within 0-15% of Phase 6.14)

**Estimated Time**: 15 minutes

---

### **Task 2.3: Multi-threaded Validation** (1 hour)

#### **Test 1: larson Benchmark** (30 minutes)

**Setup**:
```bash
# Build shared library (if not already done)
cd /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc
make clean && make shared

# Verify library
ls -lh libhakmem.so
nm -D libhakmem.so | grep -E "malloc|free|calloc|realloc"
```

**Benchmark Execution**:
```bash
cd /tmp/mimalloc-bench/bench/larson

# 1-thread baseline
./larson 0 8 1024 10000 1 12345 1

# 1-thread with hakmem P0
LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
./larson 0 8 1024 10000 1 12345 1

# Expected: 13-15M ops/sec (lock overhead 0-15%)
```

```bash
# 4-thread baseline
./larson 0 8 1024 10000 1 12345 4

# 4-thread with hakmem P0
LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
./larson 0 8 1024 10000 1 12345 4

# Expected: 13-15M ops/sec (same as 1T, no scalability)
# Critical: NO CRASHES, NO DATA CORRUPTION
```

**Success Criteria**:
- ✅ 1T: 13-15M ops/sec (within 15% of Phase 6.14)
- ✅ 4T: 13-15M ops/sec (no scalability expected)
- ✅ 4T: NO crashes, NO segfaults
- ✅ 4T: NO data corruption (verify checksum if larson supports)

**Estimated Time**: 30 minutes

---

#### **Test 2: Helgrind Race Detection** (20 minutes)

**Purpose**: Verify all data races are eliminated

**Commands**:
```bash
cd /tmp/mimalloc-bench/bench/larson

# Install valgrind (if not installed)
sudo apt-get install -y valgrind

# Run Helgrind on 4-thread test
valgrind --tool=helgrind \
  --read-var-info=yes \
  LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
  ./larson 0 8 1024 1000 1 12345 4
  # Note: Reduced iterations (1000 instead of 10000) for faster run

# Expected output:
# ERROR SUMMARY: 0 errors from 0 contexts (suppressed: X from Y)
```

**Success Criteria**:
- ✅ ERROR SUMMARY: **0 errors** (zero data races)
- ✅ No warnings about unprotected reads/writes
- ⚠️ NOTE: Helgrind may show false positives from libc. Ignore if they are NOT in hakmem code.

**Estimated Time**: 20 minutes

---

#### **Test 3: Stability Test** (10 minutes)

**Purpose**: Ensure no crashes over 10 consecutive runs

**Commands**:
```bash
cd /tmp/mimalloc-bench/bench/larson

# 10 consecutive 4-thread runs
for i in {1..10}; do
  echo "Run $i/10..."
  LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
  ./larson 0 8 1024 10000 1 12345 4 || { echo "FAILED at run $i"; exit 1; }
done

echo "✅ All 10 runs succeeded!"
```

**Success Criteria**:
- ✅ 10/10 runs complete without crashes
- ✅ Performance stable across runs (variance < 10%)

**Estimated Time**: 10 minutes

---

### **Task 2.4: Document Results** (15 minutes)

**Create**: `apps/experiments/hakmem-poc/PHASE_6.15_P0_RESULTS.md`

**Template**:
```markdown
# Phase 6.15 P0: Safety Lock Implementation - Results

**Date**: 2025-10-22
**Status**: ✅ **COMPLETED** (Correctness achieved)
**Implementation Time**: X minutes

---

## 📊 **Benchmark Results**

### **larson (mimalloc-bench)**

| Threads | Before P0 (UNSAFE) | After P0 (SAFE) | Change |
|---------|-------------------|-----------------|--------|
| 1-thread | 15.1M ops/sec | X.XM ops/sec | ±X% |
| 4-thread | 3.3M ops/sec | X.XM ops/sec | +XXX% ✅ |

**Performance Summary**:
- 1-thread overhead: X% (lock overhead, acceptable)
- 4-thread improvement: +XXX% (from -78% to safe)
- 4-thread scalability: X.Xx (4T / 1T, expected ~1.0)

---

## ✅ **Success Criteria Met**

- ✅ 1T performance: X.XM ops/sec (within 15% of Phase 6.14)
- ✅ 4T performance: X.XM ops/sec (safe, no scalability)
- ✅ Helgrind: **0 data races** detected
- ✅ Stability: **10/10 runs** without crashes

---

## 🔧 **Implementation Details**

**Files Modified**:
- `hakmem.c` - Added global lock + wrapper functions

**Lines Changed**:
- +20 lines (pthread.h, global lock, HAKMEM_LOCK/UNLOCK macros)
- +10 lines (hak_alloc_at wrapper)
- +10 lines (hak_free_at wrapper)
- **Total**: ~40 lines

**Pattern**:
```c
void* hak_alloc_at(size_t size, uintptr_t site_id) {
    HAKMEM_LOCK();
    void* ptr = hak_alloc_at_internal(size, site_id);
    HAKMEM_UNLOCK();
    return ptr;
}
```

---

## 🎯 **Next Steps**

**Phase 6.15 P1**: Tiny Pool TLS (2 hours)
- Expected: 4T = 12-15M ops/sec (+100-150%)
- TLS hit rate: 95%+
- Lock avoidance: 95%+

**Start Date**: 2025-10-XX
```

**Estimated Time**: 15 minutes

---

### **Step 2 Total Time: 2-3 hours**

---

## 🚀 **Step 3: Multi-threaded Performance (P1-P3)** (8-10 hours)

### **Overview**

**Goal**: Achieve near-ideal scalability (4T ≈ 4x 1T) using Thread-Local Storage (TLS)

**Validation**: Phase 6.13 already proved TLS works
- 1-thread: 17.8M ops/sec (+123% vs system)
- 4-thread: 15.9M ops/sec (+147% vs system)

**Strategy**: Expand existing L2.5 TLS to Tiny Pool and L2 Pool

---

### **Phase 6.15 P1: Tiny Pool TLS** (2 hours)

**Goal**: Thread-local cache for ≤1KB allocations (8 size classes)

**Existing Reference**: `hakmem_l25_pool.c:26` (TLS pattern already implemented)

#### **Implementation**

**File**: `apps/experiments/hakmem-poc/hakmem_tiny.c`

**Changes**:

1. **Add TLS cache** (after line 12):
```c
// Phase 6.15 P1: Thread-Local Storage for Tiny Pool
// Pattern: Same as L2.5 Pool (hakmem_l25_pool.c:26)
// Purpose: Reduce global freelist contention (50 cycles → 10 cycles)
// Hit rate expected: 95%+

static __thread TinySlab* tls_tiny_cache[TINY_NUM_CLASSES] = {NULL};
static __thread int tls_tiny_initialized = 0;
```

2. **TLS initialization** (new function):
```c
// Initialize TLS cache for current thread
static void hak_tiny_tls_init(void) {
    if (tls_tiny_initialized) return;

    // Initialize all size classes to NULL
    for (int i = 0; i < TINY_NUM_CLASSES; i++) {
        tls_tiny_cache[i] = NULL;
    }

    tls_tiny_initialized = 1;
}
```

3. **Modify hak_tiny_alloc** (existing function):
```c
void* hak_tiny_alloc(size_t size, uintptr_t site_id) {
    // Phase 6.15 P1: TLS fast path
    if (!tls_tiny_initialized) {
        hak_tiny_tls_init();
    }

    int class_idx = hak_tiny_get_class_index(size);

    // TLS hit check (no lock needed)
    TinySlab* slab = tls_tiny_cache[class_idx];
    if (slab && slab->free_count > 0) {
        // Fast path: Allocate from TLS cache
        return hak_tiny_alloc_from_slab(slab, class_idx);
    }

    // TLS miss: Refill from global freelist (locked)
    HAKMEM_LOCK();

    // Try to get a slab from global freelist
    slab = g_tiny_pool.free_slabs[class_idx];
    if (slab) {
        // Move slab to TLS cache
        g_tiny_pool.free_slabs[class_idx] = slab->next;
        tls_tiny_cache[class_idx] = slab;
        slab->next = NULL;  // Detach from freelist
    } else {
        // Allocate new slab (existing logic)
        slab = allocate_new_slab(class_idx);
        if (!slab) {
            HAKMEM_UNLOCK();
            return NULL;
        }
        tls_tiny_cache[class_idx] = slab;
    }

    HAKMEM_UNLOCK();

    // Allocate from newly cached slab
    return hak_tiny_alloc_from_slab(slab, class_idx);
}
```

4. **Modify hak_tiny_free** (existing function):
```c
void hak_tiny_free(void* ptr, uintptr_t site_id) {
    if (!ptr) return;

    // Find owner slab (O(N) or O(1) depending on g_use_registry)
    TinySlab* slab = hak_tiny_owner_slab(ptr);
    if (!slab) {
        fprintf(stderr, "[Tiny] ERROR: Invalid pointer!\n");
        return;
    }

    int class_idx = slab->size_class;

    // Free block in slab
    hak_tiny_free_in_slab(slab, ptr, class_idx);

    // Check if slab is now empty
    if (slab->free_count == slab->total_count) {
        // Phase 6.15 P1: Return empty slab to global freelist

        // First, remove from TLS cache if it's there
        if (tls_tiny_cache[class_idx] == slab) {
            tls_tiny_cache[class_idx] = NULL;
        }

        // Return to global freelist (locked)
        HAKMEM_LOCK();
        slab->next = g_tiny_pool.free_slabs[class_idx];
        g_tiny_pool.free_slabs[class_idx] = slab;
        HAKMEM_UNLOCK();
    }
}
```

**Expected Performance**:
- TLS hit rate: 95%+
- Lock contention: 5% (only on TLS miss)
- 4T performance: 12-15M ops/sec (+264-355% vs 3.3M baseline)

**Implementation Time**: 2 hours

---

### **Phase 6.15 P2: L2 Pool TLS** (3 hours)

**Goal**: Thread-local cache for 2-32KB allocations (5 size classes)

**Pattern**: Same as Tiny Pool TLS (above)

#### **Implementation**

**File**: `apps/experiments/hakmem-poc/hakmem_pool.c`

**Changes**: (Similar structure to Tiny Pool TLS)

1. Add `static __thread L2Block* tls_l2_cache[L2_NUM_CLASSES];`
2. Implement TLS fast path in `hak_pool_alloc()`
3. Implement TLS refill logic (global freelist → TLS cache)
4. Implement TLS return logic (empty slabs → global freelist)

**Expected Performance**:
- TLS hit rate: 90%+
- Cumulative 4T performance: 15-18M ops/sec

**Implementation Time**: 3 hours

---

### **Phase 6.15 P3: L2.5 Pool TLS Expansion** (3 hours)

**Goal**: Expand existing L2.5 TLS to full implementation

**Current State**: `hakmem_l25_pool.c:26` already has TLS declaration:
```c
__thread L25Block* tls_l25_cache[L25_NUM_CLASSES] = {NULL};
```

**Missing**: TLS refill/eviction logic (currently only used in fast path)

#### **Implementation**

**File**: `apps/experiments/hakmem-poc/hakmem_l25_pool.c`

**Changes**:

1. **Implement TLS refill** (in `hak_l25_pool_alloc`):
```c
// Existing TLS check (line ~230)
L25Block* block = tls_l25_cache[class_idx];
if (block) {
    tls_l25_cache[class_idx] = NULL;  // Pop from TLS
    // ... existing header rewrite ...
    return user_ptr;
}

// NEW: TLS refill from global freelist
HAKMEM_LOCK();

int shard_idx = (site_id >> 4) & (L25_NUM_SHARDS - 1);

// Check non-empty bitmap
if (!(g_l25_pool.nonempty_mask[class_idx] & (1ULL << shard_idx))) {
    // Empty freelist, allocate new bundle
    // ... existing logic ...
} else {
    // Pop from global freelist
    block = g_l25_pool.freelist[class_idx][shard_idx];
    g_l25_pool.freelist[class_idx][shard_idx] = block->next;

    // Update bitmap if freelist is now empty
    if (!g_l25_pool.freelist[class_idx][shard_idx]) {
        g_l25_pool.nonempty_mask[class_idx] &= ~(1ULL << shard_idx);
    }

    // Move to TLS cache
    tls_l25_cache[class_idx] = block;
}

HAKMEM_UNLOCK();

// Allocate from TLS cache
block = tls_l25_cache[class_idx];
tls_l25_cache[class_idx] = NULL;
// ... existing header rewrite ...
return user_ptr;
```

2. **Implement TLS eviction** (in `hak_l25_pool_free`):
```c
// Existing logic to add to freelist
L25Block* block = (L25Block*)hdr;

// Phase 6.15 P3: Add to TLS cache first (if empty)
if (!tls_l25_cache[class_idx]) {
    tls_l25_cache[class_idx] = block;
    block->next = NULL;
    return;  // No need to lock
}

// TLS cache full, return to global freelist (locked)
HAKMEM_LOCK();

block->next = g_l25_pool.freelist[class_idx][shard_idx];
g_l25_pool.freelist[class_idx][shard_idx] = block;

// Update bitmap
g_l25_pool.nonempty_mask[class_idx] |= (1ULL << shard_idx);

HAKMEM_UNLOCK();
```

**Expected Performance**:
- TLS hit rate: 95%+
- Cumulative 4T performance: 18-22M ops/sec (+445-567%)

**Implementation Time**: 3 hours

---

## 📋 **Implementation Checklist**

### **Step 1: Documentation** (1 hour) ✅
- [ ] Task 1.1: Fix PHASE_6.14_COMPLETION_REPORT.md (15 min)
- [ ] Task 1.2: Create PHASE_6.15_PLAN.md (30 min) ← THIS DOCUMENT
- [ ] Task 1.3: Update CURRENT_TASK.md (10 min)
- [ ] Task 1.4: Update README.md if exists (5 min)
- [ ] Task 1.5: Verification (5 min)

### **Step 2: P0 Safety Lock** (2-3 hours)
- [ ] Task 2.1: Implementation (30 min)
  - [ ] Add pthread.h include
  - [ ] Add g_hakmem_lock + HAKMEM_LOCK/UNLOCK macros
  - [ ] Wrap hak_alloc_at() with lock
  - [ ] Wrap hak_free_at() with lock
- [ ] Task 2.2: Build & Smoke Test (15 min)
  - [ ] `make clean && make bench_allocators`
  - [ ] Single-threaded test (json scenario)
  - [ ] Verify: 13-15M ops/sec
- [ ] Task 2.3: Multi-threaded Validation (1 hour)
  - [ ] Test 1: larson 1T/4T (30 min)
  - [ ] Test 2: Helgrind race detection (20 min)
  - [ ] Test 3: Stability test 10 runs (10 min)
- [ ] Task 2.4: Document Results (15 min)
  - [ ] Create PHASE_6.15_P0_RESULTS.md

### **Step 3: TLS Performance** (8-10 hours)
- [ ] **P1: Tiny Pool TLS** (2 hours)
  - [ ] Add `tls_tiny_cache[]` declaration
  - [ ] Implement `hak_tiny_tls_init()`
  - [ ] Modify `hak_tiny_alloc()` (TLS fast path)
  - [ ] Modify `hak_tiny_free()` (TLS eviction)
  - [ ] Test: larson 4T → 12-15M ops/sec
  - [ ] Document: PHASE_6.15_P1_RESULTS.md

- [ ] **P2: L2 Pool TLS** (3 hours)
  - [ ] Add `tls_l2_cache[]` declaration
  - [ ] Implement TLS fast path in `hak_pool_alloc()`
  - [ ] Implement TLS refill logic
  - [ ] Implement TLS eviction logic
  - [ ] Test: larson 4T → 15-18M ops/sec
  - [ ] Document: PHASE_6.15_P2_RESULTS.md

- [ ] **P3: L2.5 Pool TLS Expansion** (3 hours)
  - [ ] Implement TLS refill in `hak_l25_pool_alloc()`
  - [ ] Implement TLS eviction in `hak_l25_pool_free()`
  - [ ] Test: larson 4T → 18-22M ops/sec
  - [ ] Document: PHASE_6.15_P3_RESULTS.md

- [ ] **Final Validation** (1 hour)
  - [ ] larson 1T/4T/16T full validation
  - [ ] Internal benchmarks (json/mir/vm)
  - [ ] Helgrind final check
  - [ ] Create PHASE_6.15_COMPLETION_REPORT.md

---

## ⚠️ **Risk Assessment**

| Phase | Risk Level | Failure Mode | Mitigation |
|-------|-----------|--------------|------------|
| **P0 (Safety Lock)** | **ZERO** | Worst case: slow but safe | N/A |
| **P1 (Tiny TLS)** | **LOW** | TLS miss overhead | Feature flag `HAKMEM_TLS_TINY` |
| **P2 (L2 TLS)** | **LOW** | Memory overhead (TLS×threads) | Monitor RSS |
| **P3 (L2.5 TLS)** | **LOW** | Existing code 50% done | Incremental |

**Rollback Strategy**:
- Every phase has `#ifdef HAKMEM_TLS_PHASEX`
- Can disable individual TLS layers if issues found
- P0 Safety Lock ensures correctness even if TLS disabled

---

## 🎯 **Success Criteria**

### **Minimum Success** (P0 only)
- ✅ 4T ≥ 13M ops/sec (safe, from 3.3M)
- ✅ Zero race conditions (Helgrind)
- ✅ 10/10 stability runs

### **Target Success** (P0 + P1 + P2)
- ✅ 4T ≥ 15M ops/sec (+355% vs 3.3M baseline)
- ✅ TLS hit rate ≥ 90%
- ✅ No single-threaded regression (≤15% overhead)

### **Stretch Goal** (All Phases)
- ✅ 4T ≥ 18M ops/sec (+445%)
- ✅ 16T ≥ 11.6M ops/sec (match system allocator)
- ✅ Scalable up to 32 threads

### **Validated** (Phase 6.13 Proof)
- ✅ **ALREADY ACHIEVED**: 4T = **15.9M ops/sec** (+381%) ✅

---

## 📊 **Expected Timeline**

### **Week 1: Foundation** (Day 1-2)
- **Day 1 AM** (1 hour): Step 1 - Documentation updates
- **Day 1 PM** (2-3 hours): Step 2 - P0 Safety Lock
- **Day 2** (2 hours): Step 3 - P1 Tiny Pool TLS

**Milestone**: 4T = 12-15M ops/sec (+264-355%)

### **Week 2: Expansion** (Day 3-5)
- **Day 3-4** (3 hours): Step 3 - P2 L2 Pool TLS
- **Day 5** (3 hours): Step 3 - P3 L2.5 Pool TLS

**Milestone**: 4T = 18-22M ops/sec (+445-567%)

### **Week 3: Validation** (Day 6)
- **Day 6** (1 hour): Final validation + completion report

**Milestone**: ✅ **Phase 6.15 Complete**

---

## 🔬 **Technical References**

### **Existing TLS Implementation**
**File**: `apps/experiments/hakmem-poc/hakmem_l25_pool.c:26`
```c
__thread L25Block* tls_l25_cache[L25_NUM_CLASSES] = {NULL};
```

**Pattern**: Per-thread cache for each size class (L1 cache hit)

### **Phase 6.13 Validation**
**File**: `apps/experiments/hakmem-poc/PHASE_6.13_INITIAL_RESULTS.md`

**Results**:
- 1-thread: 17.8M ops/sec (+123% vs system)
- 4-thread: 15.9M ops/sec (+147% vs system)
- **Proof**: TLS works and provides massive benefit

### **Thread Safety Analysis**
**File**: `apps/experiments/hakmem-poc/THREAD_SAFETY_SOLUTION.md`

**Key Insights**:
- mimalloc/jemalloc both use TLS as primary approach
- TLS hit rate: 95%+ (industry standard)
- Lock contention: 5% (only on TLS miss/refill)

---

## 📝 **Implementation Notes**

### **Why 3 Stages?**
1. **Step 1 (Docs)**: Ensure clarity on what went wrong (67.9M issue) and what's being fixed
2. **Step 2 (P0)**: Prove correctness FIRST (no crashes, no data races)
3. **Step 3 (P1-P3)**: Optimize for performance (TLS) with safety already guaranteed

### **Why Not Skip P0?**
- **Risk mitigation**: If TLS fails, we still have working thread-safe allocator
- **Debugging**: Easier to debug TLS issues with known-working locked baseline
- **Validation**: P0 proves the global lock pattern is correct

### **Why TLS Over Lock-free?**
- **Phase 6.14 proved**: Sequential O(N) is 2.9-13.7x faster than Random O(1) Hash
- **Implication**: Lock-free atomic hash will be SLOWER than TLS
- **Industry standard**: mimalloc/jemalloc use TLS, not lock-free
- **Proven**: Phase 6.13 validated +123-147% improvement with TLS

---

## 🚀 **Next Steps After Phase 6.15**

### **Phase 6.17: 16-Thread Scalability** (Optional, 4 hours)
**Current Issue**: 16T = 7.6M ops/sec (-34.8% vs system 11.6M)

**Investigation**:
1. Profile global lock contention (perf, helgrind)
2. Measure Whale cache hit rate by thread count
3. Analyze shard distribution (hash collision?)
4. Optimize TLS cache refill (batch refill to reduce global access)

**Target**: 16T ≥ 11.6M ops/sec (match or beat system)

---

## 📚 **Related Documents**

- [THREAD_SAFETY_SOLUTION.md](THREAD_SAFETY_SOLUTION.md) - Complete analysis (Option A/B/C comparison)
- [PHASE_6.13_INITIAL_RESULTS.md](PHASE_6.13_INITIAL_RESULTS.md) - TLS validation proof
- [PHASE_6.14_COMPLETION_REPORT.md](PHASE_6.14_COMPLETION_REPORT.md) - Registry toggle + thread issue discovery
- [CURRENT_TASK.md](CURRENT_TASK.md) - Overall project status

---

**Total Time Investment**: 12-13 hours
**Expected ROI**: **6-15x improvement** (3.3M → 20-50M ops/sec)
**Risk**: Low (feature flags + proven design)
**Validation**: Phase 6.13 already proves TLS works (**+147%** at 4 threads)

---

**Implementation by**: Claude + ChatGPT协调开発
**Planning Date**: 2025-10-22
**Status**: ✅ **Ready to Execute**