Files
hakmem/docs/archive/PHASE_6.15_QUICK_REF.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

6.3 KiB

Phase 6.15: Quick Reference Card

Full Details: See PHASE_6.15_PLAN.md (1008 lines)


📊 The Problem

Current State: hakmem is THREAD-UNSAFE

1-thread:  15.1M ops/sec ✅ Excellent
4-thread:   3.3M ops/sec ❌ -78% collapse!

Root Cause: grep pthread_mutex *.c → 0 results

🎯 The Solution (3 Steps)

Step What Time Expected Result
1 Fix docs 1h Clarity on 67.9M issue
2 P0 Safety Lock 2-3h 4T = 13-15M (safe, no scaling)
3 TLS Performance 8-10h 4T = 15-20M (+381% proven)

📋 Step-by-Step Execution

Day 1 Morning: Step 1 (1 hour)

cd apps/experiments/hakmem-poc

# 1. Edit PHASE_6.14_COMPLETION_REPORT.md
# Add section explaining 67.9M measurement issue
# Add thread safety warning

# 2. Edit CURRENT_TASK.md
# Move Phase 6.14 to completed
# Add Phase 6.15 as current focus

# 3. Verify
grep "67.9M\|Thread Safety" PHASE_6.14_COMPLETION_REPORT.md
grep "Phase 6.15" CURRENT_TASK.md

Day 1 Afternoon: Step 2 - P0 Safety Lock (2-3 hours)

Implementation (30 min)

File: hakmem.c

// After line 22: Add pthread.h
#include <pthread.h>

// After line 58: Add global lock
static pthread_mutex_t g_hakmem_lock = PTHREAD_MUTEX_INITIALIZER;
#define HAKMEM_LOCK() pthread_mutex_lock(&g_hakmem_lock)
#define HAKMEM_UNLOCK() pthread_mutex_unlock(&g_hakmem_lock)

// Wrap hak_alloc_at (find ~line 300-400)
void* hak_alloc_at(size_t size, uintptr_t site_id) {
    HAKMEM_LOCK();
    void* ptr = hak_alloc_at_internal(size, site_id);  // Rename old function
    HAKMEM_UNLOCK();
    return ptr;
}

// Wrap hak_free_at
void hak_free_at(void* ptr, uintptr_t site_id) {
    if (!ptr) return;
    HAKMEM_LOCK();
    hak_free_at_internal(ptr, site_id);  // Rename old function
    HAKMEM_UNLOCK();
}

Testing (1.5 hours)

# Build
make clean && make shared

# Test 1: larson 1T/4T (30 min)
cd /tmp/mimalloc-bench/bench/larson

# 1-thread
LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
./larson 0 8 1024 10000 1 12345 1
# Expected: 13-15M ops/sec

# 4-thread
LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
./larson 0 8 1024 10000 1 12345 4
# Expected: 13-15M ops/sec (same as 1T, no crashes!)

# Test 2: Helgrind (20 min)
valgrind --tool=helgrind \
  LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
  ./larson 0 8 1024 1000 1 12345 4
# Expected: ERROR SUMMARY: 0 errors

# Test 3: Stability (10 min)
for i in {1..10}; do
  LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
  ./larson 0 8 1024 10000 1 12345 4 || exit 1
done
# Expected: 10/10 runs succeed

Documentation (15 min)

Create PHASE_6.15_P0_RESULTS.md with benchmark results.


Day 2: Step 3 - P1 Tiny Pool TLS (2 hours)

File: hakmem_tiny.c

Pattern (copy from hakmem_l25_pool.c:26):

// Add TLS cache
static __thread TinySlab* tls_tiny_cache[TINY_NUM_CLASSES] = {NULL};

// TLS fast path in hak_tiny_alloc()
TinySlab* slab = tls_tiny_cache[class_idx];
if (slab && slab->free_count > 0) {
    // Fast path: no lock needed
    return alloc_from_slab(slab, class_idx);
}

// TLS miss: refill from global (locked)
HAKMEM_LOCK();
// ... refill logic ...
HAKMEM_UNLOCK();

Test: larson 4T → expect 12-15M ops/sec


Day 3-4: P2 L2 Pool TLS (3 hours)

File: hakmem_pool.c

Same pattern as Tiny Pool (above)

Test: larson 4T → expect 15-18M ops/sec


Day 5: P3 L2.5 Pool TLS (3 hours)

File: hakmem_l25_pool.c

Existing: Line 26 already has __thread L25Block* tls_l25_cache[5];

Add: Refill/eviction logic in alloc/free functions

Test: larson 4T → expect 18-22M ops/sec


📊 Performance Roadmap

Before P0:  1T = 15.1M  4T = 3.3M  (-78%) ← UNSAFE
After P0:   1T = 13-15M 4T = 13-15M (+294-355%) ← SAFE, no scaling
After P1:   1T = 13-15M 4T = 12-15M (+264-355%) ← 95% TLS hit
After P2:   1T = 13-15M 4T = 15-18M (+355-445%) ← 90% TLS hit
After P3:   1T = 13-15M 4T = 18-22M (+445-567%) ← Full TLS

Phase 6.13 Validation:
            1T = 17.8M  4T = 15.9M (+381%) ✅ PROVEN

Success Criteria

P0 (Minimum):

  • 4T ≥ 13M ops/sec
  • Helgrind: 0 data races
  • 10/10 stability runs

P0+P1+P2 (Target):

  • 4T ≥ 15M ops/sec
  • TLS hit rate ≥ 90%
  • No 1T regression (≤15%)

All Phases (Stretch):

  • 4T ≥ 18M ops/sec
  • 16T ≥ 11.6M ops/sec

🚨 Critical Findings

  1. 67.9M ops/sec = Measurement Error

    • Actual: 15.1M (1T), 3.3M (4T)
    • Fix: Update Phase 6.14 report
  2. 4-thread collapse = Thread-unsafe

    • NOT a feature, NOT expected
    • Zero pthread_mutex in codebase
    • Fix: P0 global lock (30 min)
  3. TLS is validated (+381%)

    • Phase 6.13 proved 4T = 15.9M ops/sec
    • NOT the cause of Phase 6.11.5 regression
    • Real culprit: Slab Registry (Phase 6.12.1)

📁 Document Map

PHASE_6.15_PLAN.md (this)    - Full implementation guide (1008 lines)
PHASE_6.15_SUMMARY.md        - Executive summary (152 lines)
PHASE_6.15_QUICK_REF.md      - Quick reference card (YOU ARE HERE)

THREAD_SAFETY_SOLUTION.md    - Complete analysis (Option A/B/C)
PHASE_6.13_INITIAL_RESULTS.md - TLS validation proof
PHASE_6.14_COMPLETION_REPORT.md - Thread issue discovery

🔧 Common Commands

# Build hakmem
cd apps/experiments/hakmem-poc
make clean && make shared

# larson benchmark (4-thread)
cd /tmp/mimalloc-bench/bench/larson
LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
./larson 0 8 1024 10000 1 12345 4

# Helgrind race detection
valgrind --tool=helgrind \
  LD_PRELOAD=~/git/hakorune-selfhost/apps/experiments/hakmem-poc/libhakmem.so \
  ./larson 0 8 1024 1000 1 12345 4

# Check pthread usage
grep -n "pthread" apps/experiments/hakmem-poc/*.c

📞 Need Help?


Status: Ready to execute Total Time: 12-13 hours (6 days) Expected ROI: 6-15x improvement (3.3M → 20-50M ops/sec)