Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
28 KiB
Phase 6.15: Multi-threaded Safety + TLS Performance - Implementation Plan
Date: 2025-10-22 Status: 📋 Planning Complete Total Time: 12-13 hours (3 weeks)
📊 Executive Summary
Current Problem
hakmem allocator is completely thread-unsafe with catastrophic multi-threaded performance:
| Threads | Performance (ops/sec) | vs 1-thread |
|---|---|---|
| 1-thread | 15.1M ops/sec | baseline |
| 4-thread | 3.3M ops/sec | -78% slower ❌ |
Root Cause: Zero thread synchronization primitives in current codebase (no pthread_mutex anywhere)
Solution Strategy
3-Stage Gradual Implementation:
- Step 1: Document updates (1 hour) - Fix 67.9M measurement issue, create Phase 6.15 plan
- Step 2: P0 Safety Lock (30 min + testing) - Ensure correctness with minimal changes
- Step 3: TLS Performance (8-10 hours) - Achieve 4T = 15M ops/sec (+381% validated)
Expected Outcome:
- Minimum Success (P0): 4T = 1T performance (safe, no scalability)
- Target Success (P0+P1): 4T = 12-15M ops/sec (+264-355%)
- Validated (Phase 6.13): 4T = 15.9M ops/sec (+381%) ✅ ALREADY PROVEN
🎯 Step 1: Documentation Updates (1 hour)
Task 1.1: Fix Phase 6.14 Completion Report (15 minutes)
File: apps/experiments/hakmem-poc/PHASE_6.14_COMPLETION_REPORT.md
Current Problem:
- Report focuses on Registry ON/OFF toggle
- No mention of 67.9M ops/sec measurement issue
- Misleading performance claims
Required Changes:
- Add Executive Summary Section (after line 9):
## ⚠️ **Important Note: 67.9M Performance Measurement**
**Issue**: Earlier reports mentioned 67.9M ops/sec performance
**Status**: ❌ **NOT REPRODUCIBLE** - Likely measurement error
**Actual Achievements**:
- ✅ Registry ON/OFF toggle implementation (Pattern 2)
- ✅ O(N) proven 2.9-13.7x faster than O(1) for Small-N (8-32 slabs)
- ✅ Default: `g_use_registry = 0` (O(N) Sequential Access)
**Performance Reality**:
- 1-thread: 15.3M ops/sec (O(N), validated)
- 4-thread: **3.3M ops/sec** (THREAD-UNSAFE, requires Phase 6.15 fix)
- Update Section Title (line 9):
## 📊 **Executive Summary: Registry Toggle + Thread Safety Issue**
- Add Thread Safety Warning (after line 158):
---
## 🚨 **Critical Discovery: Thread Safety Issue**
### **Problem**
Phase 6.14 testing revealed **catastrophic multi-threaded performance collapse**:
| Threads | Performance | vs 1-thread |
|---------|-------------|-------------|
| 1-thread | 15.3M ops/sec | baseline |
| 4-thread | **3.3M ops/sec** | **-78%** ❌ |
**Root Cause**: `grep pthread_mutex *.c` → **0 results** (no locks!)
**Impact**: All global structures are race-condition prone:
- `g_tiny_pool.free_slabs[]` - Concurrent access without locks
- `g_l25_pool.freelist[]` - Multiple threads modifying same freelist
- `g_slab_registry[]` - Hash table corruption
- `g_whale_cache` - Ring buffer race conditions
### **Solution**
**Phase 6.15**: Multi-threaded Safety + TLS Performance
- **P0** (30 min): Global safety lock (correctness first)
- **P1** (2 hours): Tiny Pool TLS (95%+ lock avoidance)
- **P2** (3 hours): L2 Pool TLS (full coverage)
- **P3** (3 hours): L2.5 Pool TLS expansion
**Expected Results**:
- P0: 4T = 13-15M ops/sec (safe, no scalability)
- P0+P1: 4T = 12-15M ops/sec (+264-355%)
- **Validated**: 4T = **15.9M ops/sec** (+381% vs 3.3M baseline) ✅
**Reference**: `THREAD_SAFETY_SOLUTION.md` - Complete analysis
Estimated Time: 15 minutes
Task 1.2: Create Phase 6.15 Plan Document (30 minutes)
File: apps/experiments/hakmem-poc/PHASE_6.15_PLAN.md (THIS FILE)
Contents: ✅ Already created (this document)
Sections:
- Executive Summary
- Step 1: Documentation Updates (detailed)
- Step 2: P0 Safety Lock (implementation + testing)
- Step 3: Multi-threaded Performance (P1-P3 breakdown)
- Implementation Checklist
- Risk Assessment
- Success Criteria
Estimated Time: 30 minutes (already completed)
Task 1.3: Update CURRENT_TASK.md (10 minutes)
File: apps/experiments/hakmem-poc/CURRENT_TASK.md
Required Changes:
- Update Current Status (after line 30):
## 🎯 **Current Focus: Phase 6.15 Multi-threaded Safety** (2025-10-22)
### **Immediate Priority: Thread Safety Fix** ⚠️
**Problem Discovered**: hakmem is completely thread-unsafe
- 4-thread performance: **3.3M ops/sec** (-78% vs 1-thread 15.1M)
- Root cause: Zero synchronization primitives (no `pthread_mutex`)
**Solution in Progress**: Phase 6.15 (3-stage implementation)
1. ✅ **Step 1**: Documentation updates (1 hour) ← IN PROGRESS
2. ⏳ **Step 2**: P0 Safety Lock (30 min + testing)
3. ⏳ **Step 3**: TLS Performance (P1-P3, 8-10 hours)
**Expected Outcome**: 4T = 15.9M ops/sec (validated in Phase 6.13)
**Planning Document**: [PHASE_6.15_PLAN.md](PHASE_6.15_PLAN.md)
- Move Phase 6.14 to Completed Section (after line 296):
## ✅ Phase 6.14 完了!(2025-10-22)
**実装完了**: Registry ON/OFF 切り替え実装 + Thread Safety Issue 発見
**✅ 実装完了内容**:
1. **Pattern 2 実装**: `HAKMEM_USE_REGISTRY` 環境変数で ON/OFF 切り替え
2. **O(N) vs O(1) 検証**: O(N) が 2.9-13.7倍速いことを実証
3. **デフォルト設定**: `g_use_registry = 0` (O(N) Sequential Access)
**🚨 Critical Discovery**: 4-thread 性能崩壊 (-78%)
- 原因: 全グローバル変数がロック無し
- 対策: Phase 6.15 で修正予定
**📊 測定結果**:
1-thread: 15.3M ops/sec (O(N), Registry OFF) 4-thread: 3.3M ops/sec (-78% ← THREAD-UNSAFE) ❌
**詳細ドキュメント**:
- [PHASE_6.14_COMPLETION_REPORT.md](PHASE_6.14_COMPLETION_REPORT.md) - Pattern 2 実装
- [THREAD_SAFETY_SOLUTION.md](THREAD_SAFETY_SOLUTION.md) - 完全分析
- [PHASE_6.15_PLAN.md](PHASE_6.15_PLAN.md) - 修正計画
**実装時間**: 34分(予定通り) ⚡
Estimated Time: 10 minutes
Task 1.4: Update README (if needed) (5 minutes)
File: apps/experiments/hakmem-poc/README.md (if exists)
Check if exists:
ls -la apps/experiments/hakmem-poc/README.md
If exists, add warning:
## ⚠️ **Current Status: Thread Safety in Development**
**Known Issue**: hakmem is currently thread-unsafe
- **Single-threaded**: 15.1M ops/sec ✅ Excellent
- **Multi-threaded**: 3.3M ops/sec (4T) ❌ Requires fix
**Fix in Progress**: Phase 6.15 Multi-threaded Safety
- Expected completion: 2025-10-24 (2-3 days)
- Target performance: 15-20M ops/sec at 4 threads
**Do NOT use in multi-threaded applications until Phase 6.15 is complete.**
Estimated Time: 5 minutes (or skip if README doesn't exist)
Task 1.5: Verification (5 minutes)
Checklist:
- PHASE_6.14_COMPLETION_REPORT.md updated (67.9M issue documented)
- PHASE_6.15_PLAN.md created (this document)
- CURRENT_TASK.md updated (Phase 6.15 status)
- README.md updated (if exists)
Verification Commands:
cd apps/experiments/hakmem-poc
# Check files exist
ls -la PHASE_6.14_COMPLETION_REPORT.md
ls -la PHASE_6.15_PLAN.md
ls -la CURRENT_TASK.md
# Grep for keywords
grep -n "67.9M\|Thread Safety\|Phase 6.15" PHASE_6.14_COMPLETION_REPORT.md
grep -n "Phase 6.15\|Thread Safety" CURRENT_TASK.md
Estimated Time: 5 minutes
⏱️ Step 1 Total Time: 1 hour 5 minutes
🔐 Step 2: P0 Safety Lock Implementation (2-3 hours)
Goal
Ensure correctness with minimal code changes. No performance improvement expected (4T ≈ 1T).
Success Criteria
- ✅ 1-thread: 13-15M ops/sec (ロックオーバーヘッド 0-15% acceptable)
- ✅ 4-thread: 13-15M ops/sec (no scalability, but SAFE)
- ✅ Helgrind: Data race = 0 件
- ✅ Stability: 10 consecutive runs without crash
Task 2.1: Implementation (30 minutes)
File: apps/experiments/hakmem-poc/hakmem.c
Changes Required:
- Add pthread.h include (after line 22):
#include <pthread.h> // Phase 6.15 P0: Thread Safety
- Add global lock (after line 58):
// ============================================================================
// Phase 6.15 P0: Thread Safety - Global Lock
// ============================================================================
// Global lock for all allocator operations
// Purpose: Ensure correctness in multi-threaded environment
// Performance: 4T ≈ 1T (no scalability, safety first)
// Will be replaced by TLS in P1-P3 (95%+ lock avoidance)
static pthread_mutex_t g_hakmem_lock = PTHREAD_MUTEX_INITIALIZER;
// Lock/unlock helpers (for debugging and future instrumentation)
#define HAKMEM_LOCK() pthread_mutex_lock(&g_hakmem_lock)
#define HAKMEM_UNLOCK() pthread_mutex_unlock(&g_hakmem_lock)
- Wrap hak_alloc_at() (find the function, approximately line 300-400):
void* hak_alloc_at(size_t size, uintptr_t site_id) {
// Phase 6.15 P0: Global lock (safety first)
HAKMEM_LOCK();
// Existing implementation
void* ptr = hak_alloc_at_internal(size, site_id);
HAKMEM_UNLOCK();
return ptr;
}
// Rename old hak_alloc_at to hak_alloc_at_internal
static void* hak_alloc_at_internal(size_t size, uintptr_t site_id) {
// ... existing code (no changes) ...
}
- Wrap hak_free_at() (find the function):
void hak_free_at(void* ptr, uintptr_t site_id) {
if (!ptr) return;
// Phase 6.15 P0: Global lock (safety first)
HAKMEM_LOCK();
// Existing implementation
hak_free_at_internal(ptr, site_id);
HAKMEM_UNLOCK();
}
// Rename old hak_free_at to hak_free_at_internal
static void hak_free_at_internal(void* ptr, uintptr_t site_id) {
// ... existing code (no changes) ...
}
- Protect hak_init() (find initialization function):
void hak_init(void) {
// Phase 6.15 P0: No lock needed (called once before any threads spawn)
// But add atomic check for safety
// ... existing init code ...
}
Estimated Time: 30 minutes
Task 2.2: Build & Smoke Test (15 minutes)
Commands:
cd /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc
# Clean build
make clean
make bench_allocators
# Smoke test (single-threaded)
./bench_allocators --allocator hakmem-baseline --scenario json
# Expected: ~300-350ns (slight overhead acceptable)
Success Criteria:
- ✅ Build succeeds (no compilation errors)
- ✅ No crashes on single-threaded test
- ✅ Performance: 13-15M ops/sec (within 0-15% of Phase 6.14)
Estimated Time: 15 minutes
Task 2.3: Multi-threaded Validation (1 hour)
Test 1: larson Benchmark (30 minutes)
Setup:
# Build shared library (if not already done)
cd /home/tomoaki/git/hakorune-selfhost/apps/experiments/hakmem-poc
make clean && make shared
# Verify library
ls -lh libhakmem.so
nm -D libhakmem.so | grep -E "malloc|free|calloc|realloc"
Benchmark Execution:
cd /tmp/mimalloc-bench/bench/larson
# 1-thread baseline
./larson 0 8 1024 10000 1 12345 1
# 1-thread with hakmem P0
LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
./larson 0 8 1024 10000 1 12345 1
# Expected: 13-15M ops/sec (lock overhead 0-15%)
# 4-thread baseline
./larson 0 8 1024 10000 1 12345 4
# 4-thread with hakmem P0
LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
./larson 0 8 1024 10000 1 12345 4
# Expected: 13-15M ops/sec (same as 1T, no scalability)
# Critical: NO CRASHES, NO DATA CORRUPTION
Success Criteria:
- ✅ 1T: 13-15M ops/sec (within 15% of Phase 6.14)
- ✅ 4T: 13-15M ops/sec (no scalability expected)
- ✅ 4T: NO crashes, NO segfaults
- ✅ 4T: NO data corruption (verify checksum if larson supports)
Estimated Time: 30 minutes
Test 2: Helgrind Race Detection (20 minutes)
Purpose: Verify all data races are eliminated
Commands:
cd /tmp/mimalloc-bench/bench/larson
# Install valgrind (if not installed)
sudo apt-get install -y valgrind
# Run Helgrind on 4-thread test
valgrind --tool=helgrind \
--read-var-info=yes \
LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
./larson 0 8 1024 1000 1 12345 4
# Note: Reduced iterations (1000 instead of 10000) for faster run
# Expected output:
# ERROR SUMMARY: 0 errors from 0 contexts (suppressed: X from Y)
Success Criteria:
- ✅ ERROR SUMMARY: 0 errors (zero data races)
- ✅ No warnings about unprotected reads/writes
- ⚠️ NOTE: Helgrind may show false positives from libc. Ignore if they are NOT in hakmem code.
Estimated Time: 20 minutes
Test 3: Stability Test (10 minutes)
Purpose: Ensure no crashes over 10 consecutive runs
Commands:
cd /tmp/mimalloc-bench/bench/larson
# 10 consecutive 4-thread runs
for i in {1..10}; do
echo "Run $i/10..."
LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
./larson 0 8 1024 10000 1 12345 4 || { echo "FAILED at run $i"; exit 1; }
done
echo "✅ All 10 runs succeeded!"
Success Criteria:
- ✅ 10/10 runs complete without crashes
- ✅ Performance stable across runs (variance < 10%)
Estimated Time: 10 minutes
Task 2.4: Document Results (15 minutes)
Create: apps/experiments/hakmem-poc/PHASE_6.15_P0_RESULTS.md
Template:
# Phase 6.15 P0: Safety Lock Implementation - Results
**Date**: 2025-10-22
**Status**: ✅ **COMPLETED** (Correctness achieved)
**Implementation Time**: X minutes
---
## 📊 **Benchmark Results**
### **larson (mimalloc-bench)**
| Threads | Before P0 (UNSAFE) | After P0 (SAFE) | Change |
|---------|-------------------|-----------------|--------|
| 1-thread | 15.1M ops/sec | X.XM ops/sec | ±X% |
| 4-thread | 3.3M ops/sec | X.XM ops/sec | +XXX% ✅ |
**Performance Summary**:
- 1-thread overhead: X% (lock overhead, acceptable)
- 4-thread improvement: +XXX% (from -78% to safe)
- 4-thread scalability: X.Xx (4T / 1T, expected ~1.0)
---
## ✅ **Success Criteria Met**
- ✅ 1T performance: X.XM ops/sec (within 15% of Phase 6.14)
- ✅ 4T performance: X.XM ops/sec (safe, no scalability)
- ✅ Helgrind: **0 data races** detected
- ✅ Stability: **10/10 runs** without crashes
---
## 🔧 **Implementation Details**
**Files Modified**:
- `hakmem.c` - Added global lock + wrapper functions
**Lines Changed**:
- +20 lines (pthread.h, global lock, HAKMEM_LOCK/UNLOCK macros)
- +10 lines (hak_alloc_at wrapper)
- +10 lines (hak_free_at wrapper)
- **Total**: ~40 lines
**Pattern**:
```c
void* hak_alloc_at(size_t size, uintptr_t site_id) {
HAKMEM_LOCK();
void* ptr = hak_alloc_at_internal(size, site_id);
HAKMEM_UNLOCK();
return ptr;
}
🎯 Next Steps
Phase 6.15 P1: Tiny Pool TLS (2 hours)
- Expected: 4T = 12-15M ops/sec (+100-150%)
- TLS hit rate: 95%+
- Lock avoidance: 95%+
Start Date: 2025-10-XX
**Estimated Time**: 15 minutes
---
### **Step 2 Total Time: 2-3 hours**
---
## 🚀 **Step 3: Multi-threaded Performance (P1-P3)** (8-10 hours)
### **Overview**
**Goal**: Achieve near-ideal scalability (4T ≈ 4x 1T) using Thread-Local Storage (TLS)
**Validation**: Phase 6.13 already proved TLS works
- 1-thread: 17.8M ops/sec (+123% vs system)
- 4-thread: 15.9M ops/sec (+147% vs system)
**Strategy**: Expand existing L2.5 TLS to Tiny Pool and L2 Pool
---
### **Phase 6.15 P1: Tiny Pool TLS** (2 hours)
**Goal**: Thread-local cache for ≤1KB allocations (8 size classes)
**Existing Reference**: `hakmem_l25_pool.c:26` (TLS pattern already implemented)
#### **Implementation**
**File**: `apps/experiments/hakmem-poc/hakmem_tiny.c`
**Changes**:
1. **Add TLS cache** (after line 12):
```c
// Phase 6.15 P1: Thread-Local Storage for Tiny Pool
// Pattern: Same as L2.5 Pool (hakmem_l25_pool.c:26)
// Purpose: Reduce global freelist contention (50 cycles → 10 cycles)
// Hit rate expected: 95%+
static __thread TinySlab* tls_tiny_cache[TINY_NUM_CLASSES] = {NULL};
static __thread int tls_tiny_initialized = 0;
- TLS initialization (new function):
// Initialize TLS cache for current thread
static void hak_tiny_tls_init(void) {
if (tls_tiny_initialized) return;
// Initialize all size classes to NULL
for (int i = 0; i < TINY_NUM_CLASSES; i++) {
tls_tiny_cache[i] = NULL;
}
tls_tiny_initialized = 1;
}
- Modify hak_tiny_alloc (existing function):
void* hak_tiny_alloc(size_t size, uintptr_t site_id) {
// Phase 6.15 P1: TLS fast path
if (!tls_tiny_initialized) {
hak_tiny_tls_init();
}
int class_idx = hak_tiny_get_class_index(size);
// TLS hit check (no lock needed)
TinySlab* slab = tls_tiny_cache[class_idx];
if (slab && slab->free_count > 0) {
// Fast path: Allocate from TLS cache
return hak_tiny_alloc_from_slab(slab, class_idx);
}
// TLS miss: Refill from global freelist (locked)
HAKMEM_LOCK();
// Try to get a slab from global freelist
slab = g_tiny_pool.free_slabs[class_idx];
if (slab) {
// Move slab to TLS cache
g_tiny_pool.free_slabs[class_idx] = slab->next;
tls_tiny_cache[class_idx] = slab;
slab->next = NULL; // Detach from freelist
} else {
// Allocate new slab (existing logic)
slab = allocate_new_slab(class_idx);
if (!slab) {
HAKMEM_UNLOCK();
return NULL;
}
tls_tiny_cache[class_idx] = slab;
}
HAKMEM_UNLOCK();
// Allocate from newly cached slab
return hak_tiny_alloc_from_slab(slab, class_idx);
}
- Modify hak_tiny_free (existing function):
void hak_tiny_free(void* ptr, uintptr_t site_id) {
if (!ptr) return;
// Find owner slab (O(N) or O(1) depending on g_use_registry)
TinySlab* slab = hak_tiny_owner_slab(ptr);
if (!slab) {
fprintf(stderr, "[Tiny] ERROR: Invalid pointer!\n");
return;
}
int class_idx = slab->size_class;
// Free block in slab
hak_tiny_free_in_slab(slab, ptr, class_idx);
// Check if slab is now empty
if (slab->free_count == slab->total_count) {
// Phase 6.15 P1: Return empty slab to global freelist
// First, remove from TLS cache if it's there
if (tls_tiny_cache[class_idx] == slab) {
tls_tiny_cache[class_idx] = NULL;
}
// Return to global freelist (locked)
HAKMEM_LOCK();
slab->next = g_tiny_pool.free_slabs[class_idx];
g_tiny_pool.free_slabs[class_idx] = slab;
HAKMEM_UNLOCK();
}
}
Expected Performance:
- TLS hit rate: 95%+
- Lock contention: 5% (only on TLS miss)
- 4T performance: 12-15M ops/sec (+264-355% vs 3.3M baseline)
Implementation Time: 2 hours
Phase 6.15 P2: L2 Pool TLS (3 hours)
Goal: Thread-local cache for 2-32KB allocations (5 size classes)
Pattern: Same as Tiny Pool TLS (above)
Implementation
File: apps/experiments/hakmem-poc/hakmem_pool.c
Changes: (Similar structure to Tiny Pool TLS)
- Add
static __thread L2Block* tls_l2_cache[L2_NUM_CLASSES]; - Implement TLS fast path in
hak_pool_alloc() - Implement TLS refill logic (global freelist → TLS cache)
- Implement TLS return logic (empty slabs → global freelist)
Expected Performance:
- TLS hit rate: 90%+
- Cumulative 4T performance: 15-18M ops/sec
Implementation Time: 3 hours
Phase 6.15 P3: L2.5 Pool TLS Expansion (3 hours)
Goal: Expand existing L2.5 TLS to full implementation
Current State: hakmem_l25_pool.c:26 already has TLS declaration:
__thread L25Block* tls_l25_cache[L25_NUM_CLASSES] = {NULL};
Missing: TLS refill/eviction logic (currently only used in fast path)
Implementation
File: apps/experiments/hakmem-poc/hakmem_l25_pool.c
Changes:
- Implement TLS refill (in
hak_l25_pool_alloc):
// Existing TLS check (line ~230)
L25Block* block = tls_l25_cache[class_idx];
if (block) {
tls_l25_cache[class_idx] = NULL; // Pop from TLS
// ... existing header rewrite ...
return user_ptr;
}
// NEW: TLS refill from global freelist
HAKMEM_LOCK();
int shard_idx = (site_id >> 4) & (L25_NUM_SHARDS - 1);
// Check non-empty bitmap
if (!(g_l25_pool.nonempty_mask[class_idx] & (1ULL << shard_idx))) {
// Empty freelist, allocate new bundle
// ... existing logic ...
} else {
// Pop from global freelist
block = g_l25_pool.freelist[class_idx][shard_idx];
g_l25_pool.freelist[class_idx][shard_idx] = block->next;
// Update bitmap if freelist is now empty
if (!g_l25_pool.freelist[class_idx][shard_idx]) {
g_l25_pool.nonempty_mask[class_idx] &= ~(1ULL << shard_idx);
}
// Move to TLS cache
tls_l25_cache[class_idx] = block;
}
HAKMEM_UNLOCK();
// Allocate from TLS cache
block = tls_l25_cache[class_idx];
tls_l25_cache[class_idx] = NULL;
// ... existing header rewrite ...
return user_ptr;
- Implement TLS eviction (in
hak_l25_pool_free):
// Existing logic to add to freelist
L25Block* block = (L25Block*)hdr;
// Phase 6.15 P3: Add to TLS cache first (if empty)
if (!tls_l25_cache[class_idx]) {
tls_l25_cache[class_idx] = block;
block->next = NULL;
return; // No need to lock
}
// TLS cache full, return to global freelist (locked)
HAKMEM_LOCK();
block->next = g_l25_pool.freelist[class_idx][shard_idx];
g_l25_pool.freelist[class_idx][shard_idx] = block;
// Update bitmap
g_l25_pool.nonempty_mask[class_idx] |= (1ULL << shard_idx);
HAKMEM_UNLOCK();
Expected Performance:
- TLS hit rate: 95%+
- Cumulative 4T performance: 18-22M ops/sec (+445-567%)
Implementation Time: 3 hours
📋 Implementation Checklist
Step 1: Documentation (1 hour) ✅
- Task 1.1: Fix PHASE_6.14_COMPLETION_REPORT.md (15 min)
- Task 1.2: Create PHASE_6.15_PLAN.md (30 min) ← THIS DOCUMENT
- Task 1.3: Update CURRENT_TASK.md (10 min)
- Task 1.4: Update README.md if exists (5 min)
- Task 1.5: Verification (5 min)
Step 2: P0 Safety Lock (2-3 hours)
- Task 2.1: Implementation (30 min)
- Add pthread.h include
- Add g_hakmem_lock + HAKMEM_LOCK/UNLOCK macros
- Wrap hak_alloc_at() with lock
- Wrap hak_free_at() with lock
- Task 2.2: Build & Smoke Test (15 min)
make clean && make bench_allocators- Single-threaded test (json scenario)
- Verify: 13-15M ops/sec
- Task 2.3: Multi-threaded Validation (1 hour)
- Test 1: larson 1T/4T (30 min)
- Test 2: Helgrind race detection (20 min)
- Test 3: Stability test 10 runs (10 min)
- Task 2.4: Document Results (15 min)
- Create PHASE_6.15_P0_RESULTS.md
Step 3: TLS Performance (8-10 hours)
-
P1: Tiny Pool TLS (2 hours)
- Add
tls_tiny_cache[]declaration - Implement
hak_tiny_tls_init() - Modify
hak_tiny_alloc()(TLS fast path) - Modify
hak_tiny_free()(TLS eviction) - Test: larson 4T → 12-15M ops/sec
- Document: PHASE_6.15_P1_RESULTS.md
- Add
-
P2: L2 Pool TLS (3 hours)
- Add
tls_l2_cache[]declaration - Implement TLS fast path in
hak_pool_alloc() - Implement TLS refill logic
- Implement TLS eviction logic
- Test: larson 4T → 15-18M ops/sec
- Document: PHASE_6.15_P2_RESULTS.md
- Add
-
P3: L2.5 Pool TLS Expansion (3 hours)
- Implement TLS refill in
hak_l25_pool_alloc() - Implement TLS eviction in
hak_l25_pool_free() - Test: larson 4T → 18-22M ops/sec
- Document: PHASE_6.15_P3_RESULTS.md
- Implement TLS refill in
-
Final Validation (1 hour)
- larson 1T/4T/16T full validation
- Internal benchmarks (json/mir/vm)
- Helgrind final check
- Create PHASE_6.15_COMPLETION_REPORT.md
⚠️ Risk Assessment
| Phase | Risk Level | Failure Mode | Mitigation |
|---|---|---|---|
| P0 (Safety Lock) | ZERO | Worst case: slow but safe | N/A |
| P1 (Tiny TLS) | LOW | TLS miss overhead | Feature flag HAKMEM_TLS_TINY |
| P2 (L2 TLS) | LOW | Memory overhead (TLS×threads) | Monitor RSS |
| P3 (L2.5 TLS) | LOW | Existing code 50% done | Incremental |
Rollback Strategy:
- Every phase has
#ifdef HAKMEM_TLS_PHASEX - Can disable individual TLS layers if issues found
- P0 Safety Lock ensures correctness even if TLS disabled
🎯 Success Criteria
Minimum Success (P0 only)
- ✅ 4T ≥ 13M ops/sec (safe, from 3.3M)
- ✅ Zero race conditions (Helgrind)
- ✅ 10/10 stability runs
Target Success (P0 + P1 + P2)
- ✅ 4T ≥ 15M ops/sec (+355% vs 3.3M baseline)
- ✅ TLS hit rate ≥ 90%
- ✅ No single-threaded regression (≤15% overhead)
Stretch Goal (All Phases)
- ✅ 4T ≥ 18M ops/sec (+445%)
- ✅ 16T ≥ 11.6M ops/sec (match system allocator)
- ✅ Scalable up to 32 threads
Validated (Phase 6.13 Proof)
- ✅ ALREADY ACHIEVED: 4T = 15.9M ops/sec (+381%) ✅
📊 Expected Timeline
Week 1: Foundation (Day 1-2)
- Day 1 AM (1 hour): Step 1 - Documentation updates
- Day 1 PM (2-3 hours): Step 2 - P0 Safety Lock
- Day 2 (2 hours): Step 3 - P1 Tiny Pool TLS
Milestone: 4T = 12-15M ops/sec (+264-355%)
Week 2: Expansion (Day 3-5)
- Day 3-4 (3 hours): Step 3 - P2 L2 Pool TLS
- Day 5 (3 hours): Step 3 - P3 L2.5 Pool TLS
Milestone: 4T = 18-22M ops/sec (+445-567%)
Week 3: Validation (Day 6)
- Day 6 (1 hour): Final validation + completion report
Milestone: ✅ Phase 6.15 Complete
🔬 Technical References
Existing TLS Implementation
File: apps/experiments/hakmem-poc/hakmem_l25_pool.c:26
__thread L25Block* tls_l25_cache[L25_NUM_CLASSES] = {NULL};
Pattern: Per-thread cache for each size class (L1 cache hit)
Phase 6.13 Validation
File: apps/experiments/hakmem-poc/PHASE_6.13_INITIAL_RESULTS.md
Results:
- 1-thread: 17.8M ops/sec (+123% vs system)
- 4-thread: 15.9M ops/sec (+147% vs system)
- Proof: TLS works and provides massive benefit
Thread Safety Analysis
File: apps/experiments/hakmem-poc/THREAD_SAFETY_SOLUTION.md
Key Insights:
- mimalloc/jemalloc both use TLS as primary approach
- TLS hit rate: 95%+ (industry standard)
- Lock contention: 5% (only on TLS miss/refill)
📝 Implementation Notes
Why 3 Stages?
- Step 1 (Docs): Ensure clarity on what went wrong (67.9M issue) and what's being fixed
- Step 2 (P0): Prove correctness FIRST (no crashes, no data races)
- Step 3 (P1-P3): Optimize for performance (TLS) with safety already guaranteed
Why Not Skip P0?
- Risk mitigation: If TLS fails, we still have working thread-safe allocator
- Debugging: Easier to debug TLS issues with known-working locked baseline
- Validation: P0 proves the global lock pattern is correct
Why TLS Over Lock-free?
- Phase 6.14 proved: Sequential O(N) is 2.9-13.7x faster than Random O(1) Hash
- Implication: Lock-free atomic hash will be SLOWER than TLS
- Industry standard: mimalloc/jemalloc use TLS, not lock-free
- Proven: Phase 6.13 validated +123-147% improvement with TLS
🚀 Next Steps After Phase 6.15
Phase 6.17: 16-Thread Scalability (Optional, 4 hours)
Current Issue: 16T = 7.6M ops/sec (-34.8% vs system 11.6M)
Investigation:
- Profile global lock contention (perf, helgrind)
- Measure Whale cache hit rate by thread count
- Analyze shard distribution (hash collision?)
- Optimize TLS cache refill (batch refill to reduce global access)
Target: 16T ≥ 11.6M ops/sec (match or beat system)
📚 Related Documents
- THREAD_SAFETY_SOLUTION.md - Complete analysis (Option A/B/C comparison)
- PHASE_6.13_INITIAL_RESULTS.md - TLS validation proof
- PHASE_6.14_COMPLETION_REPORT.md - Registry toggle + thread issue discovery
- CURRENT_TASK.md - Overall project status
Total Time Investment: 12-13 hours Expected ROI: 6-15x improvement (3.3M → 20-50M ops/sec) Risk: Low (feature flags + proven design) Validation: Phase 6.13 already proves TLS works (+147% at 4 threads)
Implementation by: Claude + ChatGPT协调开発 Planning Date: 2025-10-22 Status: ✅ Ready to Execute