Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

28 KiB

Raw Blame History

Phase 6.15: Multi-threaded Safety + TLS Performance - Implementation Plan

Date: 2025-10-22 Status: 📋 Planning Complete Total Time: 12-13 hours (3 weeks)

📊 Executive Summary

Current Problem

hakmem allocator is completely thread-unsafe with catastrophic multi-threaded performance:

Threads	Performance (ops/sec)	vs 1-thread
1-thread	15.1M ops/sec	baseline
4-thread	3.3M ops/sec	-78% slower ❌

Root Cause: Zero thread synchronization primitives in current codebase (no pthread_mutex anywhere)

Solution Strategy

3-Stage Gradual Implementation:

Step 1: Document updates (1 hour) - Fix 67.9M measurement issue, create Phase 6.15 plan
Step 2: P0 Safety Lock (30 min + testing) - Ensure correctness with minimal changes
Step 3: TLS Performance (8-10 hours) - Achieve 4T = 15M ops/sec (+381% validated)

Expected Outcome:

Minimum Success (P0): 4T = 1T performance (safe, no scalability)
Target Success (P0+P1): 4T = 12-15M ops/sec (+264-355%)
Validated (Phase 6.13): 4T = 15.9M ops/sec (+381%) ✅ ALREADY PROVEN

🎯 Step 1: Documentation Updates (1 hour)

Task 1.1: Fix Phase 6.14 Completion Report (15 minutes)

File: apps/experiments/hakmem-poc/PHASE_6.14_COMPLETION_REPORT.md

Current Problem:

Report focuses on Registry ON/OFF toggle
No mention of 67.9M ops/sec measurement issue
Misleading performance claims

Required Changes:

Add Executive Summary Section (after line 9):

## ⚠️ **Important Note: 67.9M Performance Measurement**

**Issue**: Earlier reports mentioned 67.9M ops/sec performance
**Status**: ❌ **NOT REPRODUCIBLE** - Likely measurement error

**Actual Achievements**:
- ✅ Registry ON/OFF toggle implementation (Pattern 2)
- ✅ O(N) proven 2.9-13.7x faster than O(1) for Small-N (8-32 slabs)
- ✅ Default: `g_use_registry = 0` (O(N) Sequential Access)

**Performance Reality**:
- 1-thread: 15.3M ops/sec (O(N), validated)
- 4-thread: **3.3M ops/sec** (THREAD-UNSAFE, requires Phase 6.15 fix)

Update Section Title (line 9):

## 📊 **Executive Summary: Registry Toggle + Thread Safety Issue**

Add Thread Safety Warning (after line 158):

---

## 🚨 **Critical Discovery: Thread Safety Issue**

### **Problem**
Phase 6.14 testing revealed **catastrophic multi-threaded performance collapse**:

| Threads | Performance | vs 1-thread |
|---------|-------------|-------------|
| 1-thread | 15.3M ops/sec | baseline |
| 4-thread | **3.3M ops/sec** | **-78%** ❌ |

**Root Cause**: `grep pthread_mutex *.c` → **0 results** (no locks!)

**Impact**: All global structures are race-condition prone:
- `g_tiny_pool.free_slabs[]` - Concurrent access without locks
- `g_l25_pool.freelist[]` - Multiple threads modifying same freelist
- `g_slab_registry[]` - Hash table corruption
- `g_whale_cache` - Ring buffer race conditions

### **Solution**
**Phase 6.15**: Multi-threaded Safety + TLS Performance
- **P0** (30 min): Global safety lock (correctness first)
- **P1** (2 hours): Tiny Pool TLS (95%+ lock avoidance)
- **P2** (3 hours): L2 Pool TLS (full coverage)
- **P3** (3 hours): L2.5 Pool TLS expansion

**Expected Results**:
- P0: 4T = 13-15M ops/sec (safe, no scalability)
- P0+P1: 4T = 12-15M ops/sec (+264-355%)
- **Validated**: 4T = **15.9M ops/sec** (+381% vs 3.3M baseline) ✅

**Reference**: `THREAD_SAFETY_SOLUTION.md` - Complete analysis

Estimated Time: 15 minutes

Task 1.2: Create Phase 6.15 Plan Document (30 minutes)

File: apps/experiments/hakmem-poc/PHASE_6.15_PLAN.md (THIS FILE)

Contents: ✅ Already created (this document)

Sections:

Executive Summary
Step 1: Documentation Updates (detailed)
Step 2: P0 Safety Lock (implementation + testing)
Step 3: Multi-threaded Performance (P1-P3 breakdown)
Implementation Checklist
Risk Assessment
Success Criteria

Estimated Time: 30 minutes (already completed)

Task 1.3: Update CURRENT_TASK.md (10 minutes)

File: apps/experiments/hakmem-poc/CURRENT_TASK.md

Required Changes:

Update Current Status (after line 30):

## 🎯 **Current Focus: Phase 6.15 Multi-threaded Safety** (2025-10-22)

### **Immediate Priority: Thread Safety Fix** ⚠️

**Problem Discovered**: hakmem is completely thread-unsafe
- 4-thread performance: **3.3M ops/sec** (-78% vs 1-thread 15.1M)
- Root cause: Zero synchronization primitives (no `pthread_mutex`)

**Solution in Progress**: Phase 6.15 (3-stage implementation)
1. ✅ **Step 1**: Documentation updates (1 hour) ← IN PROGRESS
2. ⏳ **Step 2**: P0 Safety Lock (30 min + testing)
3. ⏳ **Step 3**: TLS Performance (P1-P3, 8-10 hours)

**Expected Outcome**: 4T = 15.9M ops/sec (validated in Phase 6.13)

**Planning Document**: [PHASE_6.15_PLAN.md](PHASE_6.15_PLAN.md)

Move Phase 6.14 to Completed Section (after line 296):

## ✅ Phase 6.14 完了！（2025-10-22）

**実装完了**: Registry ON/OFF 切り替え実装 + Thread Safety Issue 発見

**✅ 実装完了内容**:
1. **Pattern 2 実装**: `HAKMEM_USE_REGISTRY` 環境変数で ON/OFF 切り替え
2. **O(N) vs O(1) 検証**: O(N) が 2.9-13.7倍速いことを実証
3. **デフォルト設定**: `g_use_registry = 0` (O(N) Sequential Access)

**🚨 Critical Discovery**: 4-thread 性能崩壊 (-78%)
- 原因: 全グローバル変数がロック無し
- 対策: Phase 6.15 で修正予定

**📊 測定結果**:

1-thread: 15.3M ops/sec (O(N), Registry OFF) 4-thread: 3.3M ops/sec (-78% ← THREAD-UNSAFE) ❌


**詳細ドキュメント**:
- [PHASE_6.14_COMPLETION_REPORT.md](PHASE_6.14_COMPLETION_REPORT.md) - Pattern 2 実装
- [THREAD_SAFETY_SOLUTION.md](THREAD_SAFETY_SOLUTION.md) - 完全分析
- [PHASE_6.15_PLAN.md](PHASE_6.15_PLAN.md) - 修正計画

**実装時間**: 34分（予定通り） ⚡

Estimated Time: 10 minutes

Task 1.4: Update README (if needed) (5 minutes)

File: apps/experiments/hakmem-poc/README.md (if exists)

Check if exists:

ls -la apps/experiments/hakmem-poc/README.md

If exists, add warning:

## ⚠️ **Current Status: Thread Safety in Development**

**Known Issue**: hakmem is currently thread-unsafe
- **Single-threaded**: 15.1M ops/sec ✅ Excellent
- **Multi-threaded**: 3.3M ops/sec (4T) ❌ Requires fix

**Fix in Progress**: Phase 6.15 Multi-threaded Safety
- Expected completion: 2025-10-24 (2-3 days)
- Target performance: 15-20M ops/sec at 4 threads

**Do NOT use in multi-threaded applications until Phase 6.15 is complete.**

Estimated Time: 5 minutes (or skip if README doesn't exist)

Task 1.5: Verification (5 minutes)

Checklist:

PHASE_6.14_COMPLETION_REPORT.md updated (67.9M issue documented)
PHASE_6.15_PLAN.md created (this document)
CURRENT_TASK.md updated (Phase 6.15 status)
README.md updated (if exists)

Verification Commands:

cd apps/experiments/hakmem-poc

# Check files exist
ls -la PHASE_6.14_COMPLETION_REPORT.md
ls -la PHASE_6.15_PLAN.md
ls -la CURRENT_TASK.md

# Grep for keywords
grep -n "67.9M\|Thread Safety\|Phase 6.15" PHASE_6.14_COMPLETION_REPORT.md
grep -n "Phase 6.15\|Thread Safety" CURRENT_TASK.md

Estimated Time: 5 minutes

⏱️ Step 1 Total Time: 1 hour 5 minutes

🔐 Step 2: P0 Safety Lock Implementation (2-3 hours)

Goal

Ensure correctness with minimal code changes. No performance improvement expected (4T ≈ 1T).

Success Criteria

✅ 1-thread: 13-15M ops/sec (ロックオーバーヘッド 0-15% acceptable)
✅ 4-thread: 13-15M ops/sec (no scalability, but SAFE)
✅ Helgrind: Data race = 0 件
✅ Stability: 10 consecutive runs without crash

Task 2.1: Implementation (30 minutes)

File: `apps/experiments/hakmem-poc/hakmem.c`

Changes Required:

Add pthread.h include (after line 22):

#include <pthread.h>  // Phase 6.15 P0: Thread Safety

Add global lock (after line 58):

// ============================================================================
// Phase 6.15 P0: Thread Safety - Global Lock
// ============================================================================

// Global lock for all allocator operations
// Purpose: Ensure correctness in multi-threaded environment
// Performance: 4T ≈ 1T (no scalability, safety first)
// Will be replaced by TLS in P1-P3 (95%+ lock avoidance)
static pthread_mutex_t g_hakmem_lock = PTHREAD_MUTEX_INITIALIZER;

// Lock/unlock helpers (for debugging and future instrumentation)
#define HAKMEM_LOCK() pthread_mutex_lock(&g_hakmem_lock)
#define HAKMEM_UNLOCK() pthread_mutex_unlock(&g_hakmem_lock)

Wrap hak_alloc_at() (find the function, approximately line 300-400):

void* hak_alloc_at(size_t size, uintptr_t site_id) {
    // Phase 6.15 P0: Global lock (safety first)
    HAKMEM_LOCK();

    // Existing implementation
    void* ptr = hak_alloc_at_internal(size, site_id);

    HAKMEM_UNLOCK();
    return ptr;
}

// Rename old hak_alloc_at to hak_alloc_at_internal
static void* hak_alloc_at_internal(size_t size, uintptr_t site_id) {
    // ... existing code (no changes) ...
}

Wrap hak_free_at() (find the function):

void hak_free_at(void* ptr, uintptr_t site_id) {
    if (!ptr) return;

    // Phase 6.15 P0: Global lock (safety first)
    HAKMEM_LOCK();

    // Existing implementation
    hak_free_at_internal(ptr, site_id);

    HAKMEM_UNLOCK();
}

// Rename old hak_free_at to hak_free_at_internal
static void hak_free_at_internal(void* ptr, uintptr_t site_id) {
    // ... existing code (no changes) ...
}

Protect hak_init() (find initialization function):

void hak_init(void) {
    // Phase 6.15 P0: No lock needed (called once before any threads spawn)
    // But add atomic check for safety

    // ... existing init code ...
}

Estimated Time: 30 minutes

Task 2.2: Build & Smoke Test (15 minutes)

Commands:

cd /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc

# Clean build
make clean
make bench_allocators

# Smoke test (single-threaded)
./bench_allocators --allocator hakmem-baseline --scenario json

# Expected: ~300-350ns (slight overhead acceptable)

Success Criteria:

✅ Build succeeds (no compilation errors)
✅ No crashes on single-threaded test
✅ Performance: 13-15M ops/sec (within 0-15% of Phase 6.14)

Estimated Time: 15 minutes

Task 2.3: Multi-threaded Validation (1 hour)

Test 1: larson Benchmark (30 minutes)

Setup:

# Build shared library (if not already done)
cd /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc
make clean && make shared

# Verify library
ls -lh libhakmem.so
nm -D libhakmem.so | grep -E "malloc|free|calloc|realloc"

Benchmark Execution:

cd /tmp/mimalloc-bench/bench/larson

# 1-thread baseline
./larson 0 8 1024 10000 1 12345 1

# 1-thread with hakmem P0
LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
./larson 0 8 1024 10000 1 12345 1

# Expected: 13-15M ops/sec (lock overhead 0-15%)

# 4-thread baseline
./larson 0 8 1024 10000 1 12345 4

# 4-thread with hakmem P0
LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
./larson 0 8 1024 10000 1 12345 4

# Expected: 13-15M ops/sec (same as 1T, no scalability)
# Critical: NO CRASHES, NO DATA CORRUPTION

Success Criteria:

✅ 1T: 13-15M ops/sec (within 15% of Phase 6.14)
✅ 4T: 13-15M ops/sec (no scalability expected)
✅ 4T: NO crashes, NO segfaults
✅ 4T: NO data corruption (verify checksum if larson supports)

Estimated Time: 30 minutes

Test 2: Helgrind Race Detection (20 minutes)

Purpose: Verify all data races are eliminated

Commands:

cd /tmp/mimalloc-bench/bench/larson

# Install valgrind (if not installed)
sudo apt-get install -y valgrind

# Run Helgrind on 4-thread test
valgrind --tool=helgrind \
  --read-var-info=yes \
  LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
  ./larson 0 8 1024 1000 1 12345 4
  # Note: Reduced iterations (1000 instead of 10000) for faster run

# Expected output:
# ERROR SUMMARY: 0 errors from 0 contexts (suppressed: X from Y)

Success Criteria:

✅ ERROR SUMMARY: 0 errors (zero data races)
✅ No warnings about unprotected reads/writes
⚠️ NOTE: Helgrind may show false positives from libc. Ignore if they are NOT in hakmem code.

Estimated Time: 20 minutes

Test 3: Stability Test (10 minutes)

Purpose: Ensure no crashes over 10 consecutive runs

Commands:

cd /tmp/mimalloc-bench/bench/larson

# 10 consecutive 4-thread runs
for i in {1..10}; do
  echo "Run $i/10..."
  LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
  ./larson 0 8 1024 10000 1 12345 4 || { echo "FAILED at run $i"; exit 1; }
done

echo "✅ All 10 runs succeeded!"

Success Criteria:

✅ 10/10 runs complete without crashes
✅ Performance stable across runs (variance < 10%)

Estimated Time: 10 minutes

Task 2.4: Document Results (15 minutes)

Create: apps/experiments/hakmem-poc/PHASE_6.15_P0_RESULTS.md

Template:

# Phase 6.15 P0: Safety Lock Implementation - Results

**Date**: 2025-10-22
**Status**: ✅ **COMPLETED** (Correctness achieved)
**Implementation Time**: X minutes

---

## 📊 **Benchmark Results**

### **larson (mimalloc-bench)**

| Threads | Before P0 (UNSAFE) | After P0 (SAFE) | Change |
|---------|-------------------|-----------------|--------|
| 1-thread | 15.1M ops/sec | X.XM ops/sec | ±X% |
| 4-thread | 3.3M ops/sec | X.XM ops/sec | +XXX% ✅ |

**Performance Summary**:
- 1-thread overhead: X% (lock overhead, acceptable)
- 4-thread improvement: +XXX% (from -78% to safe)
- 4-thread scalability: X.Xx (4T / 1T, expected ~1.0)

---

## ✅ **Success Criteria Met**

- ✅ 1T performance: X.XM ops/sec (within 15% of Phase 6.14)
- ✅ 4T performance: X.XM ops/sec (safe, no scalability)
- ✅ Helgrind: **0 data races** detected
- ✅ Stability: **10/10 runs** without crashes

---

## 🔧 **Implementation Details**

**Files Modified**:
- `hakmem.c` - Added global lock + wrapper functions

**Lines Changed**:
- +20 lines (pthread.h, global lock, HAKMEM_LOCK/UNLOCK macros)
- +10 lines (hak_alloc_at wrapper)
- +10 lines (hak_free_at wrapper)
- **Total**: ~40 lines

**Pattern**:
```c
void* hak_alloc_at(size_t size, uintptr_t site_id) {
    HAKMEM_LOCK();
    void* ptr = hak_alloc_at_internal(size, site_id);
    HAKMEM_UNLOCK();
    return ptr;
}

🎯 Next Steps

Phase 6.15 P1: Tiny Pool TLS (2 hours)

Expected: 4T = 12-15M ops/sec (+100-150%)
TLS hit rate: 95%+
Lock avoidance: 95%+

Start Date: 2025-10-XX


**Estimated Time**: 15 minutes

---

### **Step 2 Total Time: 2-3 hours**

---

## 🚀 **Step 3: Multi-threaded Performance (P1-P3)** (8-10 hours)

### **Overview**

**Goal**: Achieve near-ideal scalability (4T ≈ 4x 1T) using Thread-Local Storage (TLS)

**Validation**: Phase 6.13 already proved TLS works
- 1-thread: 17.8M ops/sec (+123% vs system)
- 4-thread: 15.9M ops/sec (+147% vs system)

**Strategy**: Expand existing L2.5 TLS to Tiny Pool and L2 Pool

---

### **Phase 6.15 P1: Tiny Pool TLS** (2 hours)

**Goal**: Thread-local cache for ≤1KB allocations (8 size classes)

**Existing Reference**: `hakmem_l25_pool.c:26` (TLS pattern already implemented)

#### **Implementation**

**File**: `apps/experiments/hakmem-poc/hakmem_tiny.c`

**Changes**:

1. **Add TLS cache** (after line 12):
```c
// Phase 6.15 P1: Thread-Local Storage for Tiny Pool
// Pattern: Same as L2.5 Pool (hakmem_l25_pool.c:26)
// Purpose: Reduce global freelist contention (50 cycles → 10 cycles)
// Hit rate expected: 95%+

static __thread TinySlab* tls_tiny_cache[TINY_NUM_CLASSES] = {NULL};
static __thread int tls_tiny_initialized = 0;

TLS initialization (new function):

// Initialize TLS cache for current thread
static void hak_tiny_tls_init(void) {
    if (tls_tiny_initialized) return;

    // Initialize all size classes to NULL
    for (int i = 0; i < TINY_NUM_CLASSES; i++) {
        tls_tiny_cache[i] = NULL;
    }

    tls_tiny_initialized = 1;
}

Modify hak_tiny_alloc (existing function):

void* hak_tiny_alloc(size_t size, uintptr_t site_id) {
    // Phase 6.15 P1: TLS fast path
    if (!tls_tiny_initialized) {
        hak_tiny_tls_init();
    }

    int class_idx = hak_tiny_get_class_index(size);

    // TLS hit check (no lock needed)
    TinySlab* slab = tls_tiny_cache[class_idx];
    if (slab && slab->free_count > 0) {
        // Fast path: Allocate from TLS cache
        return hak_tiny_alloc_from_slab(slab, class_idx);
    }

    // TLS miss: Refill from global freelist (locked)
    HAKMEM_LOCK();

    // Try to get a slab from global freelist
    slab = g_tiny_pool.free_slabs[class_idx];
    if (slab) {
        // Move slab to TLS cache
        g_tiny_pool.free_slabs[class_idx] = slab->next;
        tls_tiny_cache[class_idx] = slab;
        slab->next = NULL;  // Detach from freelist
    } else {
        // Allocate new slab (existing logic)
        slab = allocate_new_slab(class_idx);
        if (!slab) {
            HAKMEM_UNLOCK();
            return NULL;
        }
        tls_tiny_cache[class_idx] = slab;
    }

    HAKMEM_UNLOCK();

    // Allocate from newly cached slab
    return hak_tiny_alloc_from_slab(slab, class_idx);
}

Modify hak_tiny_free (existing function):

void hak_tiny_free(void* ptr, uintptr_t site_id) {
    if (!ptr) return;

    // Find owner slab (O(N) or O(1) depending on g_use_registry)
    TinySlab* slab = hak_tiny_owner_slab(ptr);
    if (!slab) {
        fprintf(stderr, "[Tiny] ERROR: Invalid pointer!\n");
        return;
    }

    int class_idx = slab->size_class;

    // Free block in slab
    hak_tiny_free_in_slab(slab, ptr, class_idx);

    // Check if slab is now empty
    if (slab->free_count == slab->total_count) {
        // Phase 6.15 P1: Return empty slab to global freelist

        // First, remove from TLS cache if it's there
        if (tls_tiny_cache[class_idx] == slab) {
            tls_tiny_cache[class_idx] = NULL;
        }

        // Return to global freelist (locked)
        HAKMEM_LOCK();
        slab->next = g_tiny_pool.free_slabs[class_idx];
        g_tiny_pool.free_slabs[class_idx] = slab;
        HAKMEM_UNLOCK();
    }
}

Expected Performance:

TLS hit rate: 95%+
Lock contention: 5% (only on TLS miss)
4T performance: 12-15M ops/sec (+264-355% vs 3.3M baseline)

Implementation Time: 2 hours

Phase 6.15 P2: L2 Pool TLS (3 hours)

Goal: Thread-local cache for 2-32KB allocations (5 size classes)

Pattern: Same as Tiny Pool TLS (above)

Implementation

File: apps/experiments/hakmem-poc/hakmem_pool.c

Changes: (Similar structure to Tiny Pool TLS)

Add static __thread L2Block* tls_l2_cache[L2_NUM_CLASSES];
Implement TLS fast path in hak_pool_alloc()
Implement TLS refill logic (global freelist → TLS cache)
Implement TLS return logic (empty slabs → global freelist)

Expected Performance:

TLS hit rate: 90%+
Cumulative 4T performance: 15-18M ops/sec

Implementation Time: 3 hours

Phase 6.15 P3: L2.5 Pool TLS Expansion (3 hours)

Goal: Expand existing L2.5 TLS to full implementation

Current State: hakmem_l25_pool.c:26 already has TLS declaration:

__thread L25Block* tls_l25_cache[L25_NUM_CLASSES] = {NULL};

Missing: TLS refill/eviction logic (currently only used in fast path)

Implementation

File: apps/experiments/hakmem-poc/hakmem_l25_pool.c

Changes:

Implement TLS refill (in hak_l25_pool_alloc):

// Existing TLS check (line ~230)
L25Block* block = tls_l25_cache[class_idx];
if (block) {
    tls_l25_cache[class_idx] = NULL;  // Pop from TLS
    // ... existing header rewrite ...
    return user_ptr;
}

// NEW: TLS refill from global freelist
HAKMEM_LOCK();

int shard_idx = (site_id >> 4) & (L25_NUM_SHARDS - 1);

// Check non-empty bitmap
if (!(g_l25_pool.nonempty_mask[class_idx] & (1ULL << shard_idx))) {
    // Empty freelist, allocate new bundle
    // ... existing logic ...
} else {
    // Pop from global freelist
    block = g_l25_pool.freelist[class_idx][shard_idx];
    g_l25_pool.freelist[class_idx][shard_idx] = block->next;

    // Update bitmap if freelist is now empty
    if (!g_l25_pool.freelist[class_idx][shard_idx]) {
        g_l25_pool.nonempty_mask[class_idx] &= ~(1ULL << shard_idx);
    }

    // Move to TLS cache
    tls_l25_cache[class_idx] = block;
}

HAKMEM_UNLOCK();

// Allocate from TLS cache
block = tls_l25_cache[class_idx];
tls_l25_cache[class_idx] = NULL;
// ... existing header rewrite ...
return user_ptr;

Implement TLS eviction (in hak_l25_pool_free):

// Existing logic to add to freelist
L25Block* block = (L25Block*)hdr;

// Phase 6.15 P3: Add to TLS cache first (if empty)
if (!tls_l25_cache[class_idx]) {
    tls_l25_cache[class_idx] = block;
    block->next = NULL;
    return;  // No need to lock
}

// TLS cache full, return to global freelist (locked)
HAKMEM_LOCK();

block->next = g_l25_pool.freelist[class_idx][shard_idx];
g_l25_pool.freelist[class_idx][shard_idx] = block;

// Update bitmap
g_l25_pool.nonempty_mask[class_idx] |= (1ULL << shard_idx);

HAKMEM_UNLOCK();

Expected Performance:

TLS hit rate: 95%+
Cumulative 4T performance: 18-22M ops/sec (+445-567%)

Implementation Time: 3 hours

📋 Implementation Checklist

Step 1: Documentation (1 hour) ✅

Task 1.1: Fix PHASE_6.14_COMPLETION_REPORT.md (15 min)
Task 1.2: Create PHASE_6.15_PLAN.md (30 min) ← THIS DOCUMENT
Task 1.3: Update CURRENT_TASK.md (10 min)
Task 1.4: Update README.md if exists (5 min)
Task 1.5: Verification (5 min)

Step 2: P0 Safety Lock (2-3 hours)

Task 2.1: Implementation (30 min)
- Add pthread.h include
- Add g_hakmem_lock + HAKMEM_LOCK/UNLOCK macros
- Wrap hak_alloc_at() with lock
- Wrap hak_free_at() with lock
Task 2.2: Build & Smoke Test (15 min)
- make clean && make bench_allocators
- Single-threaded test (json scenario)
- Verify: 13-15M ops/sec
Task 2.3: Multi-threaded Validation (1 hour)
- Test 1: larson 1T/4T (30 min)
- Test 2: Helgrind race detection (20 min)
- Test 3: Stability test 10 runs (10 min)
Task 2.4: Document Results (15 min)
- Create PHASE_6.15_P0_RESULTS.md

Step 3: TLS Performance (8-10 hours)

P1: Tiny Pool TLS (2 hours)
- Add tls_tiny_cache[] declaration
- Implement hak_tiny_tls_init()
- Modify hak_tiny_alloc() (TLS fast path)
- Modify hak_tiny_free() (TLS eviction)
- Test: larson 4T → 12-15M ops/sec
- Document: PHASE_6.15_P1_RESULTS.md
P2: L2 Pool TLS (3 hours)
- Add tls_l2_cache[] declaration
- Implement TLS fast path in hak_pool_alloc()
- Implement TLS refill logic
- Implement TLS eviction logic
- Test: larson 4T → 15-18M ops/sec
- Document: PHASE_6.15_P2_RESULTS.md
P3: L2.5 Pool TLS Expansion (3 hours)
- Implement TLS refill in hak_l25_pool_alloc()
- Implement TLS eviction in hak_l25_pool_free()
- Test: larson 4T → 18-22M ops/sec
- Document: PHASE_6.15_P3_RESULTS.md
Final Validation (1 hour)
- larson 1T/4T/16T full validation
- Internal benchmarks (json/mir/vm)
- Helgrind final check
- Create PHASE_6.15_COMPLETION_REPORT.md

⚠️ Risk Assessment

Phase	Risk Level	Failure Mode	Mitigation
P0 (Safety Lock)	ZERO	Worst case: slow but safe	N/A
P1 (Tiny TLS)	LOW	TLS miss overhead	Feature flag `HAKMEM_TLS_TINY`
P2 (L2 TLS)	LOW	Memory overhead (TLS×threads)	Monitor RSS
P3 (L2.5 TLS)	LOW	Existing code 50% done	Incremental

Rollback Strategy:

Every phase has #ifdef HAKMEM_TLS_PHASEX
Can disable individual TLS layers if issues found
P0 Safety Lock ensures correctness even if TLS disabled

🎯 Success Criteria

Minimum Success (P0 only)

✅ 4T ≥ 13M ops/sec (safe, from 3.3M)
✅ Zero race conditions (Helgrind)
✅ 10/10 stability runs

Target Success (P0 + P1 + P2)

✅ 4T ≥ 15M ops/sec (+355% vs 3.3M baseline)
✅ TLS hit rate ≥ 90%
✅ No single-threaded regression (≤15% overhead)

Stretch Goal (All Phases)

✅ 4T ≥ 18M ops/sec (+445%)
✅ 16T ≥ 11.6M ops/sec (match system allocator)
✅ Scalable up to 32 threads

Validated (Phase 6.13 Proof)

✅ ALREADY ACHIEVED: 4T = 15.9M ops/sec (+381%) ✅

📊 Expected Timeline

Week 1: Foundation (Day 1-2)

Day 1 AM (1 hour): Step 1 - Documentation updates
Day 1 PM (2-3 hours): Step 2 - P0 Safety Lock
Day 2 (2 hours): Step 3 - P1 Tiny Pool TLS

Milestone: 4T = 12-15M ops/sec (+264-355%)

Week 2: Expansion (Day 3-5)

Day 3-4 (3 hours): Step 3 - P2 L2 Pool TLS
Day 5 (3 hours): Step 3 - P3 L2.5 Pool TLS

Milestone: 4T = 18-22M ops/sec (+445-567%)

Week 3: Validation (Day 6)

Day 6 (1 hour): Final validation + completion report

Milestone: ✅ Phase 6.15 Complete

🔬 Technical References

Existing TLS Implementation

File: apps/experiments/hakmem-poc/hakmem_l25_pool.c:26

__thread L25Block* tls_l25_cache[L25_NUM_CLASSES] = {NULL};

Pattern: Per-thread cache for each size class (L1 cache hit)

Phase 6.13 Validation

File: apps/experiments/hakmem-poc/PHASE_6.13_INITIAL_RESULTS.md

Results:

1-thread: 17.8M ops/sec (+123% vs system)
4-thread: 15.9M ops/sec (+147% vs system)
Proof: TLS works and provides massive benefit

Thread Safety Analysis

File: apps/experiments/hakmem-poc/THREAD_SAFETY_SOLUTION.md

Key Insights:

mimalloc/jemalloc both use TLS as primary approach
TLS hit rate: 95%+ (industry standard)
Lock contention: 5% (only on TLS miss/refill)

📝 Implementation Notes

Why 3 Stages?

Step 1 (Docs): Ensure clarity on what went wrong (67.9M issue) and what's being fixed
Step 2 (P0): Prove correctness FIRST (no crashes, no data races)
Step 3 (P1-P3): Optimize for performance (TLS) with safety already guaranteed

Why Not Skip P0?

Risk mitigation: If TLS fails, we still have working thread-safe allocator
Debugging: Easier to debug TLS issues with known-working locked baseline
Validation: P0 proves the global lock pattern is correct

Why TLS Over Lock-free?

Phase 6.14 proved: Sequential O(N) is 2.9-13.7x faster than Random O(1) Hash
Implication: Lock-free atomic hash will be SLOWER than TLS
Industry standard: mimalloc/jemalloc use TLS, not lock-free
Proven: Phase 6.13 validated +123-147% improvement with TLS

🚀 Next Steps After Phase 6.15

Phase 6.17: 16-Thread Scalability (Optional, 4 hours)

Current Issue: 16T = 7.6M ops/sec (-34.8% vs system 11.6M)

Investigation:

Profile global lock contention (perf, helgrind)
Measure Whale cache hit rate by thread count
Analyze shard distribution (hash collision?)
Optimize TLS cache refill (batch refill to reduce global access)

Target: 16T ≥ 11.6M ops/sec (match or beat system)

THREAD_SAFETY_SOLUTION.md - Complete analysis (Option A/B/C comparison)
PHASE_6.13_INITIAL_RESULTS.md - TLS validation proof
PHASE_6.14_COMPLETION_REPORT.md - Registry toggle + thread issue discovery
CURRENT_TASK.md - Overall project status

Total Time Investment: 12-13 hours Expected ROI: 6-15x improvement (3.3M → 20-50M ops/sec) Risk: Low (feature flags + proven design) Validation: Phase 6.13 already proves TLS works (+147% at 4 threads)

Implementation by: Claude + ChatGPT协调开発 Planning Date: 2025-10-22 Status: ✅ Ready to Execute

28 KiB Raw Blame History Unescape Escape

Phase 6.15: Multi-threaded Safety + TLS Performance - Implementation Plan

📊 Executive Summary

Current Problem

Solution Strategy

🎯 Step 1: Documentation Updates (1 hour)

Task 1.1: Fix Phase 6.14 Completion Report (15 minutes)

Task 1.2: Create Phase 6.15 Plan Document (30 minutes)

Task 1.3: Update CURRENT_TASK.md (10 minutes)

Task 1.4: Update README (if needed) (5 minutes)

Task 1.5: Verification (5 minutes)

⏱️ Step 1 Total Time: 1 hour 5 minutes

🔐 Step 2: P0 Safety Lock Implementation (2-3 hours)

Goal

Success Criteria

Task 2.1: Implementation (30 minutes)

File: apps/experiments/hakmem-poc/hakmem.c

Task 2.2: Build & Smoke Test (15 minutes)

Task 2.3: Multi-threaded Validation (1 hour)

Test 1: larson Benchmark (30 minutes)

Test 2: Helgrind Race Detection (20 minutes)

Test 3: Stability Test (10 minutes)

Task 2.4: Document Results (15 minutes)

🎯 Next Steps

Phase 6.15 P2: L2 Pool TLS (3 hours)

Implementation

Phase 6.15 P3: L2.5 Pool TLS Expansion (3 hours)

Implementation

📋 Implementation Checklist

Step 1: Documentation (1 hour) ✅

Step 2: P0 Safety Lock (2-3 hours)

Step 3: TLS Performance (8-10 hours)

⚠️ Risk Assessment

🎯 Success Criteria

Minimum Success (P0 only)

Target Success (P0 + P1 + P2)

Stretch Goal (All Phases)

Validated (Phase 6.13 Proof)

📊 Expected Timeline

Week 1: Foundation (Day 1-2)

Week 2: Expansion (Day 3-5)

Week 3: Validation (Day 6)

🔬 Technical References

Existing TLS Implementation

Phase 6.13 Validation

Thread Safety Analysis

📝 Implementation Notes

Why 3 Stages?

Why Not Skip P0?

Why TLS Over Lock-free?

🚀 Next Steps After Phase 6.15

Phase 6.17: 16-Thread Scalability (Optional, 4 hours)

📚 Related Documents

28 KiB

Raw Blame History

File: `apps/experiments/hakmem-poc/hakmem.c`