Files
hakmem/docs/design/PHASE_6.15_PLAN.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

28 KiB
Raw Blame History

Phase 6.15: Multi-threaded Safety + TLS Performance - Implementation Plan

Date: 2025-10-22 Status: 📋 Planning Complete Total Time: 12-13 hours (3 weeks)


📊 Executive Summary

Current Problem

hakmem allocator is completely thread-unsafe with catastrophic multi-threaded performance:

Threads Performance (ops/sec) vs 1-thread
1-thread 15.1M ops/sec baseline
4-thread 3.3M ops/sec -78% slower

Root Cause: Zero thread synchronization primitives in current codebase (no pthread_mutex anywhere)

Solution Strategy

3-Stage Gradual Implementation:

  1. Step 1: Document updates (1 hour) - Fix 67.9M measurement issue, create Phase 6.15 plan
  2. Step 2: P0 Safety Lock (30 min + testing) - Ensure correctness with minimal changes
  3. Step 3: TLS Performance (8-10 hours) - Achieve 4T = 15M ops/sec (+381% validated)

Expected Outcome:

  • Minimum Success (P0): 4T = 1T performance (safe, no scalability)
  • Target Success (P0+P1): 4T = 12-15M ops/sec (+264-355%)
  • Validated (Phase 6.13): 4T = 15.9M ops/sec (+381%) ALREADY PROVEN

🎯 Step 1: Documentation Updates (1 hour)

Task 1.1: Fix Phase 6.14 Completion Report (15 minutes)

File: apps/experiments/hakmem-poc/PHASE_6.14_COMPLETION_REPORT.md

Current Problem:

  • Report focuses on Registry ON/OFF toggle
  • No mention of 67.9M ops/sec measurement issue
  • Misleading performance claims

Required Changes:

  1. Add Executive Summary Section (after line 9):
## ⚠️ **Important Note: 67.9M Performance Measurement**

**Issue**: Earlier reports mentioned 67.9M ops/sec performance
**Status**: ❌ **NOT REPRODUCIBLE** - Likely measurement error

**Actual Achievements**:
- ✅ Registry ON/OFF toggle implementation (Pattern 2)
- ✅ O(N) proven 2.9-13.7x faster than O(1) for Small-N (8-32 slabs)
- ✅ Default: `g_use_registry = 0` (O(N) Sequential Access)

**Performance Reality**:
- 1-thread: 15.3M ops/sec (O(N), validated)
- 4-thread: **3.3M ops/sec** (THREAD-UNSAFE, requires Phase 6.15 fix)
  1. Update Section Title (line 9):
## 📊 **Executive Summary: Registry Toggle + Thread Safety Issue**
  1. Add Thread Safety Warning (after line 158):
---

## 🚨 **Critical Discovery: Thread Safety Issue**

### **Problem**
Phase 6.14 testing revealed **catastrophic multi-threaded performance collapse**:

| Threads | Performance | vs 1-thread |
|---------|-------------|-------------|
| 1-thread | 15.3M ops/sec | baseline |
| 4-thread | **3.3M ops/sec** | **-78%** ❌ |

**Root Cause**: `grep pthread_mutex *.c`**0 results** (no locks!)

**Impact**: All global structures are race-condition prone:
- `g_tiny_pool.free_slabs[]` - Concurrent access without locks
- `g_l25_pool.freelist[]` - Multiple threads modifying same freelist
- `g_slab_registry[]` - Hash table corruption
- `g_whale_cache` - Ring buffer race conditions

### **Solution**
**Phase 6.15**: Multi-threaded Safety + TLS Performance
- **P0** (30 min): Global safety lock (correctness first)
- **P1** (2 hours): Tiny Pool TLS (95%+ lock avoidance)
- **P2** (3 hours): L2 Pool TLS (full coverage)
- **P3** (3 hours): L2.5 Pool TLS expansion

**Expected Results**:
- P0: 4T = 13-15M ops/sec (safe, no scalability)
- P0+P1: 4T = 12-15M ops/sec (+264-355%)
- **Validated**: 4T = **15.9M ops/sec** (+381% vs 3.3M baseline) ✅

**Reference**: `THREAD_SAFETY_SOLUTION.md` - Complete analysis

Estimated Time: 15 minutes


Task 1.2: Create Phase 6.15 Plan Document (30 minutes)

File: apps/experiments/hakmem-poc/PHASE_6.15_PLAN.md (THIS FILE)

Contents: Already created (this document)

Sections:

  1. Executive Summary
  2. Step 1: Documentation Updates (detailed)
  3. Step 2: P0 Safety Lock (implementation + testing)
  4. Step 3: Multi-threaded Performance (P1-P3 breakdown)
  5. Implementation Checklist
  6. Risk Assessment
  7. Success Criteria

Estimated Time: 30 minutes (already completed)


Task 1.3: Update CURRENT_TASK.md (10 minutes)

File: apps/experiments/hakmem-poc/CURRENT_TASK.md

Required Changes:

  1. Update Current Status (after line 30):
## 🎯 **Current Focus: Phase 6.15 Multi-threaded Safety** (2025-10-22)

### **Immediate Priority: Thread Safety Fix** ⚠️

**Problem Discovered**: hakmem is completely thread-unsafe
- 4-thread performance: **3.3M ops/sec** (-78% vs 1-thread 15.1M)
- Root cause: Zero synchronization primitives (no `pthread_mutex`)

**Solution in Progress**: Phase 6.15 (3-stage implementation)
1.**Step 1**: Documentation updates (1 hour) ← IN PROGRESS
2.**Step 2**: P0 Safety Lock (30 min + testing)
3.**Step 3**: TLS Performance (P1-P3, 8-10 hours)

**Expected Outcome**: 4T = 15.9M ops/sec (validated in Phase 6.13)

**Planning Document**: [PHASE_6.15_PLAN.md](PHASE_6.15_PLAN.md)
  1. Move Phase 6.14 to Completed Section (after line 296):
## ✅ Phase 6.14 完了2025-10-22

**実装完了**: Registry ON/OFF 切り替え実装 + Thread Safety Issue 発見

**✅ 実装完了内容**:
1. **Pattern 2 実装**: `HAKMEM_USE_REGISTRY` 環境変数で ON/OFF 切り替え
2. **O(N) vs O(1) 検証**: O(N) が 2.9-13.7倍速いことを実証
3. **デフォルト設定**: `g_use_registry = 0` (O(N) Sequential Access)

**🚨 Critical Discovery**: 4-thread 性能崩壊 (-78%)
- 原因: 全グローバル変数がロック無し
- 対策: Phase 6.15 で修正予定

**📊 測定結果**:

1-thread: 15.3M ops/sec (O(N), Registry OFF) 4-thread: 3.3M ops/sec (-78% ← THREAD-UNSAFE)


**詳細ドキュメント**:
- [PHASE_6.14_COMPLETION_REPORT.md](PHASE_6.14_COMPLETION_REPORT.md) - Pattern 2 実装
- [THREAD_SAFETY_SOLUTION.md](THREAD_SAFETY_SOLUTION.md) - 完全分析
- [PHASE_6.15_PLAN.md](PHASE_6.15_PLAN.md) - 修正計画

**実装時間**: 34分予定通り

Estimated Time: 10 minutes


Task 1.4: Update README (if needed) (5 minutes)

File: apps/experiments/hakmem-poc/README.md (if exists)

Check if exists:

ls -la apps/experiments/hakmem-poc/README.md

If exists, add warning:

## ⚠️ **Current Status: Thread Safety in Development**

**Known Issue**: hakmem is currently thread-unsafe
- **Single-threaded**: 15.1M ops/sec ✅ Excellent
- **Multi-threaded**: 3.3M ops/sec (4T) ❌ Requires fix

**Fix in Progress**: Phase 6.15 Multi-threaded Safety
- Expected completion: 2025-10-24 (2-3 days)
- Target performance: 15-20M ops/sec at 4 threads

**Do NOT use in multi-threaded applications until Phase 6.15 is complete.**

Estimated Time: 5 minutes (or skip if README doesn't exist)


Task 1.5: Verification (5 minutes)

Checklist:

  • PHASE_6.14_COMPLETION_REPORT.md updated (67.9M issue documented)
  • PHASE_6.15_PLAN.md created (this document)
  • CURRENT_TASK.md updated (Phase 6.15 status)
  • README.md updated (if exists)

Verification Commands:

cd apps/experiments/hakmem-poc

# Check files exist
ls -la PHASE_6.14_COMPLETION_REPORT.md
ls -la PHASE_6.15_PLAN.md
ls -la CURRENT_TASK.md

# Grep for keywords
grep -n "67.9M\|Thread Safety\|Phase 6.15" PHASE_6.14_COMPLETION_REPORT.md
grep -n "Phase 6.15\|Thread Safety" CURRENT_TASK.md

Estimated Time: 5 minutes


⏱️ Step 1 Total Time: 1 hour 5 minutes


🔐 Step 2: P0 Safety Lock Implementation (2-3 hours)

Goal

Ensure correctness with minimal code changes. No performance improvement expected (4T ≈ 1T).

Success Criteria

  • 1-thread: 13-15M ops/sec (ロックオーバーヘッド 0-15% acceptable)
  • 4-thread: 13-15M ops/sec (no scalability, but SAFE)
  • Helgrind: Data race = 0 件
  • Stability: 10 consecutive runs without crash

Task 2.1: Implementation (30 minutes)

File: apps/experiments/hakmem-poc/hakmem.c

Changes Required:

  1. Add pthread.h include (after line 22):
#include <pthread.h>  // Phase 6.15 P0: Thread Safety
  1. Add global lock (after line 58):
// ============================================================================
// Phase 6.15 P0: Thread Safety - Global Lock
// ============================================================================

// Global lock for all allocator operations
// Purpose: Ensure correctness in multi-threaded environment
// Performance: 4T ≈ 1T (no scalability, safety first)
// Will be replaced by TLS in P1-P3 (95%+ lock avoidance)
static pthread_mutex_t g_hakmem_lock = PTHREAD_MUTEX_INITIALIZER;

// Lock/unlock helpers (for debugging and future instrumentation)
#define HAKMEM_LOCK() pthread_mutex_lock(&g_hakmem_lock)
#define HAKMEM_UNLOCK() pthread_mutex_unlock(&g_hakmem_lock)
  1. Wrap hak_alloc_at() (find the function, approximately line 300-400):
void* hak_alloc_at(size_t size, uintptr_t site_id) {
    // Phase 6.15 P0: Global lock (safety first)
    HAKMEM_LOCK();

    // Existing implementation
    void* ptr = hak_alloc_at_internal(size, site_id);

    HAKMEM_UNLOCK();
    return ptr;
}

// Rename old hak_alloc_at to hak_alloc_at_internal
static void* hak_alloc_at_internal(size_t size, uintptr_t site_id) {
    // ... existing code (no changes) ...
}
  1. Wrap hak_free_at() (find the function):
void hak_free_at(void* ptr, uintptr_t site_id) {
    if (!ptr) return;

    // Phase 6.15 P0: Global lock (safety first)
    HAKMEM_LOCK();

    // Existing implementation
    hak_free_at_internal(ptr, site_id);

    HAKMEM_UNLOCK();
}

// Rename old hak_free_at to hak_free_at_internal
static void hak_free_at_internal(void* ptr, uintptr_t site_id) {
    // ... existing code (no changes) ...
}
  1. Protect hak_init() (find initialization function):
void hak_init(void) {
    // Phase 6.15 P0: No lock needed (called once before any threads spawn)
    // But add atomic check for safety

    // ... existing init code ...
}

Estimated Time: 30 minutes


Task 2.2: Build & Smoke Test (15 minutes)

Commands:

cd /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc

# Clean build
make clean
make bench_allocators

# Smoke test (single-threaded)
./bench_allocators --allocator hakmem-baseline --scenario json

# Expected: ~300-350ns (slight overhead acceptable)

Success Criteria:

  • Build succeeds (no compilation errors)
  • No crashes on single-threaded test
  • Performance: 13-15M ops/sec (within 0-15% of Phase 6.14)

Estimated Time: 15 minutes


Task 2.3: Multi-threaded Validation (1 hour)

Test 1: larson Benchmark (30 minutes)

Setup:

# Build shared library (if not already done)
cd /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc
make clean && make shared

# Verify library
ls -lh libhakmem.so
nm -D libhakmem.so | grep -E "malloc|free|calloc|realloc"

Benchmark Execution:

cd /tmp/mimalloc-bench/bench/larson

# 1-thread baseline
./larson 0 8 1024 10000 1 12345 1

# 1-thread with hakmem P0
LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
./larson 0 8 1024 10000 1 12345 1

# Expected: 13-15M ops/sec (lock overhead 0-15%)
# 4-thread baseline
./larson 0 8 1024 10000 1 12345 4

# 4-thread with hakmem P0
LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
./larson 0 8 1024 10000 1 12345 4

# Expected: 13-15M ops/sec (same as 1T, no scalability)
# Critical: NO CRASHES, NO DATA CORRUPTION

Success Criteria:

  • 1T: 13-15M ops/sec (within 15% of Phase 6.14)
  • 4T: 13-15M ops/sec (no scalability expected)
  • 4T: NO crashes, NO segfaults
  • 4T: NO data corruption (verify checksum if larson supports)

Estimated Time: 30 minutes


Test 2: Helgrind Race Detection (20 minutes)

Purpose: Verify all data races are eliminated

Commands:

cd /tmp/mimalloc-bench/bench/larson

# Install valgrind (if not installed)
sudo apt-get install -y valgrind

# Run Helgrind on 4-thread test
valgrind --tool=helgrind \
  --read-var-info=yes \
  LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
  ./larson 0 8 1024 1000 1 12345 4
  # Note: Reduced iterations (1000 instead of 10000) for faster run

# Expected output:
# ERROR SUMMARY: 0 errors from 0 contexts (suppressed: X from Y)

Success Criteria:

  • ERROR SUMMARY: 0 errors (zero data races)
  • No warnings about unprotected reads/writes
  • ⚠️ NOTE: Helgrind may show false positives from libc. Ignore if they are NOT in hakmem code.

Estimated Time: 20 minutes


Test 3: Stability Test (10 minutes)

Purpose: Ensure no crashes over 10 consecutive runs

Commands:

cd /tmp/mimalloc-bench/bench/larson

# 10 consecutive 4-thread runs
for i in {1..10}; do
  echo "Run $i/10..."
  LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
  ./larson 0 8 1024 10000 1 12345 4 || { echo "FAILED at run $i"; exit 1; }
done

echo "✅ All 10 runs succeeded!"

Success Criteria:

  • 10/10 runs complete without crashes
  • Performance stable across runs (variance < 10%)

Estimated Time: 10 minutes


Task 2.4: Document Results (15 minutes)

Create: apps/experiments/hakmem-poc/PHASE_6.15_P0_RESULTS.md

Template:

# Phase 6.15 P0: Safety Lock Implementation - Results

**Date**: 2025-10-22
**Status**: ✅ **COMPLETED** (Correctness achieved)
**Implementation Time**: X minutes

---

## 📊 **Benchmark Results**

### **larson (mimalloc-bench)**

| Threads | Before P0 (UNSAFE) | After P0 (SAFE) | Change |
|---------|-------------------|-----------------|--------|
| 1-thread | 15.1M ops/sec | X.XM ops/sec | ±X% |
| 4-thread | 3.3M ops/sec | X.XM ops/sec | +XXX% ✅ |

**Performance Summary**:
- 1-thread overhead: X% (lock overhead, acceptable)
- 4-thread improvement: +XXX% (from -78% to safe)
- 4-thread scalability: X.Xx (4T / 1T, expected ~1.0)

---

## ✅ **Success Criteria Met**

- ✅ 1T performance: X.XM ops/sec (within 15% of Phase 6.14)
- ✅ 4T performance: X.XM ops/sec (safe, no scalability)
- ✅ Helgrind: **0 data races** detected
- ✅ Stability: **10/10 runs** without crashes

---

## 🔧 **Implementation Details**

**Files Modified**:
- `hakmem.c` - Added global lock + wrapper functions

**Lines Changed**:
- +20 lines (pthread.h, global lock, HAKMEM_LOCK/UNLOCK macros)
- +10 lines (hak_alloc_at wrapper)
- +10 lines (hak_free_at wrapper)
- **Total**: ~40 lines

**Pattern**:
```c
void* hak_alloc_at(size_t size, uintptr_t site_id) {
    HAKMEM_LOCK();
    void* ptr = hak_alloc_at_internal(size, site_id);
    HAKMEM_UNLOCK();
    return ptr;
}

🎯 Next Steps

Phase 6.15 P1: Tiny Pool TLS (2 hours)

  • Expected: 4T = 12-15M ops/sec (+100-150%)
  • TLS hit rate: 95%+
  • Lock avoidance: 95%+

Start Date: 2025-10-XX


**Estimated Time**: 15 minutes

---

### **Step 2 Total Time: 2-3 hours**

---

## 🚀 **Step 3: Multi-threaded Performance (P1-P3)** (8-10 hours)

### **Overview**

**Goal**: Achieve near-ideal scalability (4T ≈ 4x 1T) using Thread-Local Storage (TLS)

**Validation**: Phase 6.13 already proved TLS works
- 1-thread: 17.8M ops/sec (+123% vs system)
- 4-thread: 15.9M ops/sec (+147% vs system)

**Strategy**: Expand existing L2.5 TLS to Tiny Pool and L2 Pool

---

### **Phase 6.15 P1: Tiny Pool TLS** (2 hours)

**Goal**: Thread-local cache for ≤1KB allocations (8 size classes)

**Existing Reference**: `hakmem_l25_pool.c:26` (TLS pattern already implemented)

#### **Implementation**

**File**: `apps/experiments/hakmem-poc/hakmem_tiny.c`

**Changes**:

1. **Add TLS cache** (after line 12):
```c
// Phase 6.15 P1: Thread-Local Storage for Tiny Pool
// Pattern: Same as L2.5 Pool (hakmem_l25_pool.c:26)
// Purpose: Reduce global freelist contention (50 cycles → 10 cycles)
// Hit rate expected: 95%+

static __thread TinySlab* tls_tiny_cache[TINY_NUM_CLASSES] = {NULL};
static __thread int tls_tiny_initialized = 0;
  1. TLS initialization (new function):
// Initialize TLS cache for current thread
static void hak_tiny_tls_init(void) {
    if (tls_tiny_initialized) return;

    // Initialize all size classes to NULL
    for (int i = 0; i < TINY_NUM_CLASSES; i++) {
        tls_tiny_cache[i] = NULL;
    }

    tls_tiny_initialized = 1;
}
  1. Modify hak_tiny_alloc (existing function):
void* hak_tiny_alloc(size_t size, uintptr_t site_id) {
    // Phase 6.15 P1: TLS fast path
    if (!tls_tiny_initialized) {
        hak_tiny_tls_init();
    }

    int class_idx = hak_tiny_get_class_index(size);

    // TLS hit check (no lock needed)
    TinySlab* slab = tls_tiny_cache[class_idx];
    if (slab && slab->free_count > 0) {
        // Fast path: Allocate from TLS cache
        return hak_tiny_alloc_from_slab(slab, class_idx);
    }

    // TLS miss: Refill from global freelist (locked)
    HAKMEM_LOCK();

    // Try to get a slab from global freelist
    slab = g_tiny_pool.free_slabs[class_idx];
    if (slab) {
        // Move slab to TLS cache
        g_tiny_pool.free_slabs[class_idx] = slab->next;
        tls_tiny_cache[class_idx] = slab;
        slab->next = NULL;  // Detach from freelist
    } else {
        // Allocate new slab (existing logic)
        slab = allocate_new_slab(class_idx);
        if (!slab) {
            HAKMEM_UNLOCK();
            return NULL;
        }
        tls_tiny_cache[class_idx] = slab;
    }

    HAKMEM_UNLOCK();

    // Allocate from newly cached slab
    return hak_tiny_alloc_from_slab(slab, class_idx);
}
  1. Modify hak_tiny_free (existing function):
void hak_tiny_free(void* ptr, uintptr_t site_id) {
    if (!ptr) return;

    // Find owner slab (O(N) or O(1) depending on g_use_registry)
    TinySlab* slab = hak_tiny_owner_slab(ptr);
    if (!slab) {
        fprintf(stderr, "[Tiny] ERROR: Invalid pointer!\n");
        return;
    }

    int class_idx = slab->size_class;

    // Free block in slab
    hak_tiny_free_in_slab(slab, ptr, class_idx);

    // Check if slab is now empty
    if (slab->free_count == slab->total_count) {
        // Phase 6.15 P1: Return empty slab to global freelist

        // First, remove from TLS cache if it's there
        if (tls_tiny_cache[class_idx] == slab) {
            tls_tiny_cache[class_idx] = NULL;
        }

        // Return to global freelist (locked)
        HAKMEM_LOCK();
        slab->next = g_tiny_pool.free_slabs[class_idx];
        g_tiny_pool.free_slabs[class_idx] = slab;
        HAKMEM_UNLOCK();
    }
}

Expected Performance:

  • TLS hit rate: 95%+
  • Lock contention: 5% (only on TLS miss)
  • 4T performance: 12-15M ops/sec (+264-355% vs 3.3M baseline)

Implementation Time: 2 hours


Phase 6.15 P2: L2 Pool TLS (3 hours)

Goal: Thread-local cache for 2-32KB allocations (5 size classes)

Pattern: Same as Tiny Pool TLS (above)

Implementation

File: apps/experiments/hakmem-poc/hakmem_pool.c

Changes: (Similar structure to Tiny Pool TLS)

  1. Add static __thread L2Block* tls_l2_cache[L2_NUM_CLASSES];
  2. Implement TLS fast path in hak_pool_alloc()
  3. Implement TLS refill logic (global freelist → TLS cache)
  4. Implement TLS return logic (empty slabs → global freelist)

Expected Performance:

  • TLS hit rate: 90%+
  • Cumulative 4T performance: 15-18M ops/sec

Implementation Time: 3 hours


Phase 6.15 P3: L2.5 Pool TLS Expansion (3 hours)

Goal: Expand existing L2.5 TLS to full implementation

Current State: hakmem_l25_pool.c:26 already has TLS declaration:

__thread L25Block* tls_l25_cache[L25_NUM_CLASSES] = {NULL};

Missing: TLS refill/eviction logic (currently only used in fast path)

Implementation

File: apps/experiments/hakmem-poc/hakmem_l25_pool.c

Changes:

  1. Implement TLS refill (in hak_l25_pool_alloc):
// Existing TLS check (line ~230)
L25Block* block = tls_l25_cache[class_idx];
if (block) {
    tls_l25_cache[class_idx] = NULL;  // Pop from TLS
    // ... existing header rewrite ...
    return user_ptr;
}

// NEW: TLS refill from global freelist
HAKMEM_LOCK();

int shard_idx = (site_id >> 4) & (L25_NUM_SHARDS - 1);

// Check non-empty bitmap
if (!(g_l25_pool.nonempty_mask[class_idx] & (1ULL << shard_idx))) {
    // Empty freelist, allocate new bundle
    // ... existing logic ...
} else {
    // Pop from global freelist
    block = g_l25_pool.freelist[class_idx][shard_idx];
    g_l25_pool.freelist[class_idx][shard_idx] = block->next;

    // Update bitmap if freelist is now empty
    if (!g_l25_pool.freelist[class_idx][shard_idx]) {
        g_l25_pool.nonempty_mask[class_idx] &= ~(1ULL << shard_idx);
    }

    // Move to TLS cache
    tls_l25_cache[class_idx] = block;
}

HAKMEM_UNLOCK();

// Allocate from TLS cache
block = tls_l25_cache[class_idx];
tls_l25_cache[class_idx] = NULL;
// ... existing header rewrite ...
return user_ptr;
  1. Implement TLS eviction (in hak_l25_pool_free):
// Existing logic to add to freelist
L25Block* block = (L25Block*)hdr;

// Phase 6.15 P3: Add to TLS cache first (if empty)
if (!tls_l25_cache[class_idx]) {
    tls_l25_cache[class_idx] = block;
    block->next = NULL;
    return;  // No need to lock
}

// TLS cache full, return to global freelist (locked)
HAKMEM_LOCK();

block->next = g_l25_pool.freelist[class_idx][shard_idx];
g_l25_pool.freelist[class_idx][shard_idx] = block;

// Update bitmap
g_l25_pool.nonempty_mask[class_idx] |= (1ULL << shard_idx);

HAKMEM_UNLOCK();

Expected Performance:

  • TLS hit rate: 95%+
  • Cumulative 4T performance: 18-22M ops/sec (+445-567%)

Implementation Time: 3 hours


📋 Implementation Checklist

Step 1: Documentation (1 hour)

  • Task 1.1: Fix PHASE_6.14_COMPLETION_REPORT.md (15 min)
  • Task 1.2: Create PHASE_6.15_PLAN.md (30 min) ← THIS DOCUMENT
  • Task 1.3: Update CURRENT_TASK.md (10 min)
  • Task 1.4: Update README.md if exists (5 min)
  • Task 1.5: Verification (5 min)

Step 2: P0 Safety Lock (2-3 hours)

  • Task 2.1: Implementation (30 min)
    • Add pthread.h include
    • Add g_hakmem_lock + HAKMEM_LOCK/UNLOCK macros
    • Wrap hak_alloc_at() with lock
    • Wrap hak_free_at() with lock
  • Task 2.2: Build & Smoke Test (15 min)
    • make clean && make bench_allocators
    • Single-threaded test (json scenario)
    • Verify: 13-15M ops/sec
  • Task 2.3: Multi-threaded Validation (1 hour)
    • Test 1: larson 1T/4T (30 min)
    • Test 2: Helgrind race detection (20 min)
    • Test 3: Stability test 10 runs (10 min)
  • Task 2.4: Document Results (15 min)
    • Create PHASE_6.15_P0_RESULTS.md

Step 3: TLS Performance (8-10 hours)

  • P1: Tiny Pool TLS (2 hours)

    • Add tls_tiny_cache[] declaration
    • Implement hak_tiny_tls_init()
    • Modify hak_tiny_alloc() (TLS fast path)
    • Modify hak_tiny_free() (TLS eviction)
    • Test: larson 4T → 12-15M ops/sec
    • Document: PHASE_6.15_P1_RESULTS.md
  • P2: L2 Pool TLS (3 hours)

    • Add tls_l2_cache[] declaration
    • Implement TLS fast path in hak_pool_alloc()
    • Implement TLS refill logic
    • Implement TLS eviction logic
    • Test: larson 4T → 15-18M ops/sec
    • Document: PHASE_6.15_P2_RESULTS.md
  • P3: L2.5 Pool TLS Expansion (3 hours)

    • Implement TLS refill in hak_l25_pool_alloc()
    • Implement TLS eviction in hak_l25_pool_free()
    • Test: larson 4T → 18-22M ops/sec
    • Document: PHASE_6.15_P3_RESULTS.md
  • Final Validation (1 hour)

    • larson 1T/4T/16T full validation
    • Internal benchmarks (json/mir/vm)
    • Helgrind final check
    • Create PHASE_6.15_COMPLETION_REPORT.md

⚠️ Risk Assessment

Phase Risk Level Failure Mode Mitigation
P0 (Safety Lock) ZERO Worst case: slow but safe N/A
P1 (Tiny TLS) LOW TLS miss overhead Feature flag HAKMEM_TLS_TINY
P2 (L2 TLS) LOW Memory overhead (TLS×threads) Monitor RSS
P3 (L2.5 TLS) LOW Existing code 50% done Incremental

Rollback Strategy:

  • Every phase has #ifdef HAKMEM_TLS_PHASEX
  • Can disable individual TLS layers if issues found
  • P0 Safety Lock ensures correctness even if TLS disabled

🎯 Success Criteria

Minimum Success (P0 only)

  • 4T ≥ 13M ops/sec (safe, from 3.3M)
  • Zero race conditions (Helgrind)
  • 10/10 stability runs

Target Success (P0 + P1 + P2)

  • 4T ≥ 15M ops/sec (+355% vs 3.3M baseline)
  • TLS hit rate ≥ 90%
  • No single-threaded regression (≤15% overhead)

Stretch Goal (All Phases)

  • 4T ≥ 18M ops/sec (+445%)
  • 16T ≥ 11.6M ops/sec (match system allocator)
  • Scalable up to 32 threads

Validated (Phase 6.13 Proof)

  • ALREADY ACHIEVED: 4T = 15.9M ops/sec (+381%)

📊 Expected Timeline

Week 1: Foundation (Day 1-2)

  • Day 1 AM (1 hour): Step 1 - Documentation updates
  • Day 1 PM (2-3 hours): Step 2 - P0 Safety Lock
  • Day 2 (2 hours): Step 3 - P1 Tiny Pool TLS

Milestone: 4T = 12-15M ops/sec (+264-355%)

Week 2: Expansion (Day 3-5)

  • Day 3-4 (3 hours): Step 3 - P2 L2 Pool TLS
  • Day 5 (3 hours): Step 3 - P3 L2.5 Pool TLS

Milestone: 4T = 18-22M ops/sec (+445-567%)

Week 3: Validation (Day 6)

  • Day 6 (1 hour): Final validation + completion report

Milestone: Phase 6.15 Complete


🔬 Technical References

Existing TLS Implementation

File: apps/experiments/hakmem-poc/hakmem_l25_pool.c:26

__thread L25Block* tls_l25_cache[L25_NUM_CLASSES] = {NULL};

Pattern: Per-thread cache for each size class (L1 cache hit)

Phase 6.13 Validation

File: apps/experiments/hakmem-poc/PHASE_6.13_INITIAL_RESULTS.md

Results:

  • 1-thread: 17.8M ops/sec (+123% vs system)
  • 4-thread: 15.9M ops/sec (+147% vs system)
  • Proof: TLS works and provides massive benefit

Thread Safety Analysis

File: apps/experiments/hakmem-poc/THREAD_SAFETY_SOLUTION.md

Key Insights:

  • mimalloc/jemalloc both use TLS as primary approach
  • TLS hit rate: 95%+ (industry standard)
  • Lock contention: 5% (only on TLS miss/refill)

📝 Implementation Notes

Why 3 Stages?

  1. Step 1 (Docs): Ensure clarity on what went wrong (67.9M issue) and what's being fixed
  2. Step 2 (P0): Prove correctness FIRST (no crashes, no data races)
  3. Step 3 (P1-P3): Optimize for performance (TLS) with safety already guaranteed

Why Not Skip P0?

  • Risk mitigation: If TLS fails, we still have working thread-safe allocator
  • Debugging: Easier to debug TLS issues with known-working locked baseline
  • Validation: P0 proves the global lock pattern is correct

Why TLS Over Lock-free?

  • Phase 6.14 proved: Sequential O(N) is 2.9-13.7x faster than Random O(1) Hash
  • Implication: Lock-free atomic hash will be SLOWER than TLS
  • Industry standard: mimalloc/jemalloc use TLS, not lock-free
  • Proven: Phase 6.13 validated +123-147% improvement with TLS

🚀 Next Steps After Phase 6.15

Phase 6.17: 16-Thread Scalability (Optional, 4 hours)

Current Issue: 16T = 7.6M ops/sec (-34.8% vs system 11.6M)

Investigation:

  1. Profile global lock contention (perf, helgrind)
  2. Measure Whale cache hit rate by thread count
  3. Analyze shard distribution (hash collision?)
  4. Optimize TLS cache refill (batch refill to reduce global access)

Target: 16T ≥ 11.6M ops/sec (match or beat system)



Total Time Investment: 12-13 hours Expected ROI: 6-15x improvement (3.3M → 20-50M ops/sec) Risk: Low (feature flags + proven design) Validation: Phase 6.13 already proves TLS works (+147% at 4 threads)


Implementation by: Claude + ChatGPT协调开発 Planning Date: 2025-10-22 Status: Ready to Execute