hakmem/docs/analysis/PHASE7_BENCHMARK_PLAN.md

# Phase 7 Full Benchmark Suite Execution Plan

**Date**: 2025-11-08
**Phase**: 7-1.3 (HEADER_CLASSIDX=1 optimization)
**Current Status**: Partial results available (Larson 1T: 2.63M ops/s, bench_random_mixed 128B: 17.7M ops/s)
**Goal**: Comprehensive performance evaluation across ALL benchmark patterns

---

## Executive Summary

### Available Benchmarks (5 categories)

1. **Larson** - Multi-threaded stress test (8-128B, mimalloc-bench derived)
2. **Random Mixed** - Single-threaded random allocation (16-8192B)
3. **Mid-Large MT** - Multi-threaded mid-size (8-32KB)
4. **VM Mixed** - Large allocations (512KB-2MB, L2.5/L2 test)
5. **Tiny Hot** - Hot path micro-benchmark (8-64B, LIFO)

### Current Build Status (Phase 7 = HEADER_CLASSIDX=1)

All benchmarks were built with HEADER_CLASSIDX=1 on 2025-11-07/08:
- ✅ `larson_hakmem` (2025-11-08 11:48)
- ✅ `bench_random_mixed_hakmem` (2025-11-08 11:48)
- ✅ `bench_mid_large_mt_hakmem` (2025-11-07 18:42)
- ✅ `bench_tiny_hot_hakmem` (2025-11-07 18:03)
- ✅ `bench_vm_mixed_hakmem` (2025-11-07 18:03)

**Note**: Makefile has `HAKMEM_TINY_HEADER_CLASSIDX=1` permanently enabled (line 99-100).

---

## Execution Plan

### Phase 1: Verify Build Status (5 minutes)

**Verify HEADER_CLASSIDX=1 is enabled:**
```bash
# Check Makefile flag
grep "HAKMEM_TINY_HEADER_CLASSIDX" Makefile

# Verify all binaries are up-to-date
make -n bench_random_mixed_hakmem bench_tiny_hot_hakmem \
         bench_mid_large_mt_hakmem bench_vm_mixed_hakmem \
         larson_hakmem
```

**If rebuild needed:**
```bash
# Clean rebuild with HEADER_CLASSIDX=1 (already default)
make clean
make -j bench_random_mixed_hakmem bench_random_mixed_system bench_random_mixed_mi \
         bench_tiny_hot_hakmem bench_tiny_hot_system bench_tiny_hot_mi \
         bench_mid_large_mt_hakmem bench_mid_large_mt_system bench_mid_large_mt_mi \
         bench_vm_mixed_hakmem bench_vm_mixed_system \
         larson_hakmem larson_system larson_mi
```

**Time**: ~3-5 minutes (if rebuild needed)

---

### Phase 2: Quick Sanity Test (2 minutes)

**Test each benchmark runs successfully:**
```bash
# Larson (1T, 1 second)
./larson_hakmem 1 8 128 1024 1 12345 1

# Random Mixed (small run)
./bench_random_mixed_hakmem 1000 128 1234567

# Mid-Large MT (2 threads, small)
./bench_mid_large_mt_hakmem 2 1000 2048 42

# VM Mixed (small)
./bench_vm_mixed_hakmem 100 256 424242

# Tiny Hot (small)
./bench_tiny_hot_hakmem 32 10 1000
```

**Expected**: All benchmarks run without SEGV/crashes.

---

### Phase 3: Full Benchmark Suite Execution

#### Option A: Automated Suite Runner (RECOMMENDED) ⭐

**Use existing bench_suite_matrix.sh:**
```bash
# This runs ALL benchmarks (random_mixed, mid_large_mt, vm_mixed, tiny_hot)
# across system/mimalloc/HAKMEM variants
./scripts/bench_suite_matrix.sh
```

**Output**:
- CSV: `bench_results/suite/<timestamp>/results.csv`
- Raw logs: `bench_results/suite/<timestamp>/raw/*.out`

**Time**: ~15-20 minutes

**Coverage**:
- Random Mixed: 2 cycles × 2 ws × 3 variants = 12 runs
- Mid-Large MT: 2 threads × 3 variants = 6 runs
- VM Mixed: 2 cycles × 2 variants = 4 runs (system + hakmem only)
- Tiny Hot: 2 sizes × 3 variants = 6 runs

**Total**: 28 benchmark runs

---

#### Option B: Individual Benchmark Scripts (Detailed Analysis)

If you need more control or want to run A/B tests with environment variables:

##### 3.1 Larson Benchmark (Multi-threaded Stress)

**Basic run (1T, 4T, 8T):**
```bash
# 1 thread, 10 seconds
HAKMEM_WRAP_TINY=1 ./larson_hakmem 10 8 128 1024 1 12345 1

# 4 threads, 10 seconds (CRITICAL: test multi-thread stability)
HAKMEM_WRAP_TINY=1 ./larson_hakmem 10 8 128 1024 1 12345 4

# 8 threads, 10 seconds
HAKMEM_WRAP_TINY=1 ./larson_hakmem 10 8 128 1024 1 12345 8
```

**A/B test with environment variables:**
```bash
# Use automated script (includes PGO)
./scripts/bench_larson_1t_ab.sh
```

**Output**: `bench_results/larson_ab/<timestamp>/results.csv`

**Time**: ~20-30 minutes (includes PGO build)

**Key Metrics**:
- Throughput (ops/s)
- Stability (4T should not crash - see Phase 6-2.3 active counter fix)

---

##### 3.2 Random Mixed (Single-threaded, Mixed Sizes)

**Basic run:**
```bash
# 400K cycles, 8192B working set
HAKMEM_WRAP_TINY=1 ./bench_random_mixed_hakmem 400000 8192 1234567
./bench_random_mixed_system 400000 8192 1234567
./bench_random_mixed_mi 400000 8192 1234567
```

**A/B test with environment variables:**
```bash
# Runs 5 repetitions, median calculation
./scripts/bench_random_mixed_ab.sh
```

**Output**: `bench_results/random_mixed_ab/<timestamp>/results.csv`

**Time**: ~15-20 minutes (5 reps × multiple configs)

**Key Metrics**:
- Throughput (ops/s) across different working set sizes
- SPECIALIZE_MASK impact (0 vs 0x0F)
- FAST_CAP impact (8 vs 16 vs 32)

---

##### 3.3 Mid-Large MT (Multi-threaded, 8-32KB)

**Basic run:**
```bash
# 4 threads, 40K cycles, 2KB working set
HAKMEM_WRAP_TINY=1 ./bench_mid_large_mt_hakmem 4 40000 2048 42
./bench_mid_large_mt_system 4 40000 2048 42
./bench_mid_large_mt_mi 4 40000 2048 42
```

**A/B test:**
```bash
./scripts/bench_mid_large_mt_ab.sh
```

**Output**: `bench_results/mid_large_mt_ab/<timestamp>/results.csv`

**Time**: ~10-15 minutes

**Key Metrics**:
- Multi-threaded performance (2T vs 4T)
- HAKMEM's SuperSlab efficiency (expected: strong performance here)

**Note**: Previous results showed HAKMEM weakness here (suite/20251107: 2.1M vs system 8.7M).
This is unexpected given the Mid-Large benchmark success (+108% on 2025-11-02).
Need to investigate if this is a regression or different test pattern.

---

##### 3.4 VM Mixed (Large Allocations, 512KB-2MB)

**Basic run:**
```bash
# 20K cycles, 256 working set
HAKMEM_BIGCACHE_L25=1 HAKMEM_WRAP_TINY=1 ./bench_vm_mixed_hakmem 20000 256 424242
./bench_vm_mixed_system 20000 256 424242
```

**Time**: ~5 minutes

**Key Metrics**:
- L2.5 cache effectiveness (BIGCACHE_L25=1 vs 0)
- Large allocation performance

---

##### 3.5 Tiny Hot (Hot Path Micro-benchmark)

**Basic run:**
```bash
# 32B, 100 batch, 60K cycles
HAKMEM_WRAP_TINY=1 ./bench_tiny_hot_hakmem 32 100 60000
./bench_tiny_hot_system 32 100 60000
./bench_tiny_hot_mi 32 100 60000

# 64B
HAKMEM_WRAP_TINY=1 ./bench_tiny_hot_hakmem 64 100 60000
./bench_tiny_hot_system 64 100 60000
./bench_tiny_hot_mi 64 100 60000
```

**Time**: ~5 minutes

**Key Metrics**:
- Hot path efficiency (direct TLS cache access)
- Expected weakness (Phase 6 analysis: -60% vs system)

---

### Phase 4: Analysis and Comparison

#### 4.1 Extract Results from Suite Run

```bash
# Get latest suite results
latest=$(ls -td bench_results/suite/* | head -1)
cat ${latest}/results.csv

# Quick comparison
awk -F, 'NR>1 {
    if ($2=="hakmem") hakmem[$1]+=$4
    if ($2=="system") system[$1]+=$4
    if ($2=="mi") mi[$1]+=$4
    count[$1]++
} END {
    for (b in hakmem) {
        h=hakmem[b]/count[b]
        s=system[b]/count[b]
        m=mi[b]/count[b]
        printf "%s: HAKMEM=%.2fM system=%.2fM mi=%.2fM (vs_sys=%+.1f%%, vs_mi=%+.1f%%)\n",
               b, h/1e6, s/1e6, m/1e6, (h/s-1)*100, (h/m-1)*100
    }
}' ${latest}/results.csv
```

#### 4.2 Key Comparisons

**Phase 7 vs System malloc:**
```bash
# Extract HAKMEM vs system for each benchmark
awk -F, 'NR>1 && ($2=="hakmem" || $2=="system") {
    key=$1 "," $3
    if ($2=="hakmem") h[key]=$4
    if ($2=="system") s[key]=$4
} END {
    for (k in h) {
        if (s[k]) {
            pct = (h[k]/s[k] - 1) * 100
            printf "%s: %.2fM vs %.2fM (%+.1f%%)\n", k, h[k]/1e6, s[k]/1e6, pct
        }
    }
}' ${latest}/results.csv | sort
```

**Phase 7 vs mimalloc:**
```bash
# Similar for mimalloc comparison
awk -F, 'NR>1 && ($2=="hakmem" || $2=="mi") {
    key=$1 "," $3
    if ($2=="hakmem") h[key]=$4
    if ($2=="mi") m[key]=$4
} END {
    for (k in h) {
        if (m[k]) {
            pct = (h[k]/m[k] - 1) * 100
            printf "%s: %.2fM vs %.2fM (%+.1f%%)\n", k, h[k]/1e6, m[k]/1e6, pct
        }
    }
}' ${latest}/results.csv | sort
```

#### 4.3 Generate Summary Report

```bash
# Create comprehensive summary
cat > PHASE7_RESULTS_SUMMARY.md << 'REPORT'
# Phase 7 Benchmark Results Summary

## Test Configuration
- Phase: 7-1.3 (HEADER_CLASSIDX=1)
- Date: $(date +%Y-%m-%d)
- Suite: $(basename ${latest})

## Overall Results

### Random Mixed (16-8192B, single-threaded)
[Insert results here]

### Mid-Large MT (8-32KB, multi-threaded)
[Insert results here]

### VM Mixed (512KB-2MB, large allocations)
[Insert results here]

### Tiny Hot (8-64B, hot path micro)
[Insert results here]

### Larson (8-128B, multi-threaded stress)
[Insert results here]

## Analysis

### Strengths
[Areas where HAKMEM outperforms]

### Weaknesses
[Areas where HAKMEM underperforms]

### Comparison with Previous Phases
[Phase 6 vs Phase 7 delta]

## Bottleneck Identification

[Performance profiling with perf]

REPORT
```

---

### Phase 5: Performance Profiling (Optional, if bottlenecks found)

**Profile hot paths with perf:**
```bash
# Profile random_mixed (if slow)
perf record -g --call-graph dwarf -- \
  ./bench_random_mixed_hakmem 400000 8192 1234567

perf report --stdio > perf_random_mixed_phase7.txt

# Profile larson 1T
perf record -g --call-graph dwarf -- \
  ./larson_hakmem 10 8 128 1024 1 12345 1

perf report --stdio > perf_larson_1t_phase7.txt
```

**Compare with Phase 6:**
```bash
# If you have Phase 6 binaries saved, run side-by-side
# and compare perf reports
```

---

## Expected Results & Analysis Strategy

### Baseline Expectations (from Phase 6 analysis)

#### Strong Areas (Expected +50% to +171% vs System)
1. **Mid-Large (8-32KB)**: HAKMEM's SuperSlab should dominate
   - Expected: +100% to +150% vs system
   - Phase 7 improvement target: Maintain or improve

2. **Large Allocations (VM Mixed)**: L2.5 layer efficiency
   - Expected: Competitive or slight win vs system

#### Weak Areas (Expected -50% to -70% vs System)
1. **Tiny (≤128B)**: Structural weakness identified in Phase 6
   - Expected: -40% to -60% vs system
   - Phase 7 HEADER_CLASSIDX may help: +10-20% improvement

2. **Random Mixed**: Magazine layer overhead
   - Expected: -20% to -50% vs system
   - Phase 7 target: Reduce gap

3. **Larson Multi-thread**: Contention issues
   - Expected: Variable (1T: ok, 4T+: risk of crashes)
   - Phase 7 critical: Verify 4T stability (active counter fix)

### What to Look For

#### Phase 7 Improvements (HEADER_CLASSIDX=1)
- **Tiny allocations**: +10-30% improvement (fewer header loads)
- **Random mixed**: +15-25% improvement (class_idx in header)
- **Cache efficiency**: Better locality (1-byte header vs 2-byte)

#### Red Flags
- **Mid-Large regression**: Should NOT regress (HEADER_CLASSIDX doesn't affect mid-large path)
- **4T+ crashes in Larson**: Active counter bug should be fixed (Phase 6-2.3)
- **Severe regression (>20%)**: Investigate immediately

#### Bottleneck Identification
If Phase 7 results are disappointing:
1. **Run perf** on slow benchmarks
2. **Compare with Phase 6** perf profiles (if available)
3. **Check hot paths**:
   - `tiny_alloc_fast()` - Should be 3-4 instructions
   - `tiny_free_fast()` - Should be fast header check
   - `superslab_refill()` - Should use P0 ctz optimization

---

## Time Estimates

### Minimal Run (Option A: Suite Script Only)
- Build verification: 2 min
- Sanity test: 2 min
- Suite execution: 15-20 min
- Quick analysis: 5 min
- **Total: ~25-30 minutes**

### Comprehensive Run (Option B: All Individual Scripts)
- Build verification: 2 min
- Sanity test: 2 min
- Larson A/B: 25 min
- Random Mixed A/B: 20 min
- Mid-Large MT A/B: 15 min
- VM Mixed: 5 min
- Tiny Hot: 5 min
- Analysis & report: 15 min
- **Total: ~90 minutes (1.5 hours)**

### With Performance Profiling
- Add: ~20-30 min per benchmark
- **Total: ~2-3 hours**

---

## Recommended Execution Order

### Quick Assessment (30 minutes)
1. ✅ Verify build status
2. ✅ Run suite script (bench_suite_matrix.sh)
3. ✅ Generate quick comparison
4. 🔍 Identify major wins/losses
5. 📝 Decide if deep dive needed

### Deep Analysis (if needed, +60 minutes)
1. 🔬 Run individual A/B scripts for problem areas
2. 📊 Profile with perf
3. 📝 Compare with Phase 6 baseline
4. 💡 Generate actionable insights

---

## Output Organization

```
bench_results/
├── suite/
│   └── <timestamp>/
│       ├── results.csv          # All benchmarks, all variants
│       └── raw/*.out             # Raw logs
├── random_mixed_ab/
│   └── <timestamp>/
│       ├── results.csv          # A/B test results
│       └── raw/*.txt             # Per-run data
├── larson_ab/
│   └── <timestamp>/
│       ├── results.csv
│       └── raw/*.out
├── mid_large_mt_ab/
│   └── <timestamp>/
│       ├── results.csv
│       └── raw/*.out
└── ...

# Analysis reports
PHASE7_RESULTS_SUMMARY.md        # High-level summary
PHASE7_DETAILED_ANALYSIS.md      # Deep dive (if needed)
perf_*.txt                        # Performance profiles
```

---

## Next Steps After Benchmark

### If Phase 7 Shows Strong Results (+30-50% overall)
1. ✅ Commit and document improvements
2. 🎯 Focus on remaining weak areas (Tiny allocations)
3. 📢 Prepare performance summary for stakeholders

### If Phase 7 Shows Modest Results (+10-20% overall)
1. 🔍 Identify specific bottlenecks (perf profiling)
2. 🧪 Test individual optimizations in isolation
3. 📊 Compare with Phase 6 to ensure no regressions

### If Phase 7 Shows Regressions (any area -10% or worse)
1. 🚨 Immediate investigation
2. 🔄 Bisect to find regression point
3. 🧪 Consider reverting HEADER_CLASSIDX if severe

---

## Quick Reference Commands

```bash
# Full suite (automated)
./scripts/bench_suite_matrix.sh

# Individual benchmarks (quick test)
./larson_hakmem 1 8 128 1024 1 12345 1
./bench_random_mixed_hakmem 400000 8192 1234567
./bench_mid_large_mt_hakmem 4 40000 2048 42
./bench_vm_mixed_hakmem 20000 256 424242
./bench_tiny_hot_hakmem 32 100 60000

# A/B tests (environment variable sweeps)
./scripts/bench_larson_1t_ab.sh
./scripts/bench_random_mixed_ab.sh
./scripts/bench_mid_large_mt_ab.sh

# Latest results
ls -td bench_results/suite/* | head -1
cat $(ls -td bench_results/suite/* | head -1)/results.csv

# Performance profiling
perf record -g --call-graph dwarf -- ./bench_random_mixed_hakmem 400000 8192 1234567
perf report --stdio > perf_output.txt
```

---

## Key Success Metrics

### Primary Goal: Overall Improvement
- **Target**: +20-30% average throughput vs Phase 6
- **Minimum**: No regressions in mid-large (HAKMEM's strength)

### Secondary Goals:
1. **Stability**: 4T+ Larson runs without crashes
2. **Tiny improvement**: -40% to -50% vs system (from -60%)
3. **Random mixed improvement**: -10% to -20% vs system (from -30%+)

### Stretch Goals:
1. **Mid-large dominance**: Maintain +100% vs system
2. **Overall parity**: Match or beat system malloc on average
3. **Consistency**: No severe outliers (no single test <50% of system)

---

**Document Version**: 1.0  
**Created**: 2025-11-08  
**Author**: Claude (Task Agent)  
**Status**: Ready for execution
-												Phase 7 Task 3: Pre-warm TLS cache (+180-280% improvement!)

MAJOR SUCCESS: HAKMEM now achieves 85-92% of System malloc on tiny
allocations (128-512B) and BEATS System at 146% on 1024B allocations!

Performance Results:
- Random Mixed 128B: 21M → 59M ops/s (+181%) 🚀
- Random Mixed 256B: 19M → 70M ops/s (+268%) 🚀
- Random Mixed 512B: 21M → 68M ops/s (+224%) 🚀
- Random Mixed 1024B: 21M → 65M ops/s (+210%, 146% of System!) 🏆
- Larson 1T: 2.68M ops/s (stable, no regression)

Implementation:
1. Task 3a: Remove profiling overhead in release builds
   - Wrapped RDTSC calls in #if !HAKMEM_BUILD_RELEASE
   - Compiler can eliminate profiling code completely
   - Effect: +2% (2.68M → 2.73M Larson)

2. Task 3b: Simplify refill logic
   - Use constants from hakmem_build_flags.h
   - TLS cache already optimal
   - Effect: No regression

3. Task 3c: Pre-warm TLS cache (GAME CHANGER!)
   - Pre-allocate 16 blocks per class at init
   - Eliminates cold-start penalty
   - Effect: +180-280% improvement 🚀

Root Cause:
The bottleneck was cold-start, not the hot path! First allocation in
each class triggered a SuperSlab refill (100+ cycles). Pre-warming
eliminated this penalty, revealing Phase 7's true potential.

Files Modified:
- core/hakmem_tiny.c: Pre-warm function implementation
- core/box/hak_core_init.inc.h: Pre-warm initialization call
- core/tiny_alloc_fast.inc.h: Profiling overhead removal
- core/hakmem_phase7_config.h: Task 3 constants (NEW)
- core/hakmem_build_flags.h: Phase 7 feature flags
- Makefile: PREWARM_TLS flag, phase7 targets
- CLAUDE.md: Phase 7 success summary
- PHASE7_TASK3_RESULTS.md: Comprehensive results report (NEW)

Build:
make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1 phase7-bench

🎉 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-08 12:54:52 +09:00
+								# Phase 7 Full Benchmark Suite Execution Plan
 								**Date**: 2025-11-08
 								**Phase**: 7-1.3 (HEADER_CLASSIDX=1 optimization)
 								**Current Status**: Partial results available (Larson 1T: 2.63M ops/s, bench_random_mixed 128B: 17.7M ops/s)
 								**Goal**: Comprehensive performance evaluation across ALL benchmark patterns
 								---
 								## Executive Summary
 								### Available Benchmarks (5 categories)
 . **Larson** - Multi-threaded stress test (8-128B, mimalloc-bench derived)
 . **Random Mixed** - Single-threaded random allocation (16-8192B)
 . **Mid-Large MT** - Multi-threaded mid-size (8-32KB)
 . **VM Mixed** - Large allocations (512KB-2MB, L2.5/L2 test)
 . **Tiny Hot** - Hot path micro-benchmark (8-64B, LIFO)
 								### Current Build Status (Phase 7 = HEADER_CLASSIDX=1)
 								All benchmarks were built with HEADER_CLASSIDX=1 on 2025-11-07/08:
 								- ✅ `larson_hakmem` (2025-11-08 11:48)
 								- ✅ `bench_random_mixed_hakmem` (2025-11-08 11:48)
 								- ✅ `bench_mid_large_mt_hakmem` (2025-11-07 18:42)
 								- ✅ `bench_tiny_hot_hakmem` (2025-11-07 18:03)
 								- ✅ `bench_vm_mixed_hakmem` (2025-11-07 18:03)
 								**Note**: Makefile has `HAKMEM_TINY_HEADER_CLASSIDX=1` permanently enabled (line 99-100).
 								---
 								## Execution Plan
 								### Phase 1: Verify Build Status (5 minutes)
 								**Verify HEADER_CLASSIDX=1 is enabled:**
 								```bash
 								# Check Makefile flag
 								grep "HAKMEM_TINY_HEADER_CLASSIDX" Makefile
 								# Verify all binaries are up-to-date
 								make -n bench_random_mixed_hakmem bench_tiny_hot_hakmem \
 								         bench_mid_large_mt_hakmem bench_vm_mixed_hakmem \
 								         larson_hakmem
 								```
 								**If rebuild needed:**
 								```bash
 								# Clean rebuild with HEADER_CLASSIDX=1 (already default)
 								make clean
 								make -j bench_random_mixed_hakmem bench_random_mixed_system bench_random_mixed_mi \
 								         bench_tiny_hot_hakmem bench_tiny_hot_system bench_tiny_hot_mi \
 								         bench_mid_large_mt_hakmem bench_mid_large_mt_system bench_mid_large_mt_mi \
 								         bench_vm_mixed_hakmem bench_vm_mixed_system \
 								         larson_hakmem larson_system larson_mi
 								```
 								**Time**: ~3-5 minutes (if rebuild needed)
 								---
 								### Phase 2: Quick Sanity Test (2 minutes)
 								**Test each benchmark runs successfully:**
 								```bash
 								# Larson (1T, 1 second)
 								./larson_hakmem 1 8 128 1024 1 12345 1
 								# Random Mixed (small run)
 								./bench_random_mixed_hakmem 1000 128 1234567
 								# Mid-Large MT (2 threads, small)
 								./bench_mid_large_mt_hakmem 2 1000 2048 42
 								# VM Mixed (small)
 								./bench_vm_mixed_hakmem 100 256 424242
 								# Tiny Hot (small)
 								./bench_tiny_hot_hakmem 32 10 1000
 								```
 								**Expected**: All benchmarks run without SEGV/crashes.
 								---
 								### Phase 3: Full Benchmark Suite Execution
 								#### Option A: Automated Suite Runner (RECOMMENDED) ⭐
 								**Use existing bench_suite_matrix.sh:**
 								```bash
 								# This runs ALL benchmarks (random_mixed, mid_large_mt, vm_mixed, tiny_hot)
 								# across system/mimalloc/HAKMEM variants
 								./scripts/bench_suite_matrix.sh
 								```
 								**Output**:
 								- CSV: `bench_results/suite/<timestamp>/results.csv`
 								- Raw logs: `bench_results/suite/<timestamp>/raw/*.out`
 								**Time**: ~15-20 minutes
 								**Coverage**:
 								- Random Mixed: 2 cycles × 2 ws × 3 variants = 12 runs
 								- Mid-Large MT: 2 threads × 3 variants = 6 runs
 								- VM Mixed: 2 cycles × 2 variants = 4 runs (system + hakmem only)
 								- Tiny Hot: 2 sizes × 3 variants = 6 runs
 								**Total**: 28 benchmark runs
 								---
 								#### Option B: Individual Benchmark Scripts (Detailed Analysis)
 								If you need more control or want to run A/B tests with environment variables:
 								##### 3.1 Larson Benchmark (Multi-threaded Stress)
 								**Basic run (1T, 4T, 8T):**
 								```bash
 								# 1 thread, 10 seconds
 								HAKMEM_WRAP_TINY=1 ./larson_hakmem 10 8 128 1024 1 12345 1
 								# 4 threads, 10 seconds (CRITICAL: test multi-thread stability)
 								HAKMEM_WRAP_TINY=1 ./larson_hakmem 10 8 128 1024 1 12345 4
 								# 8 threads, 10 seconds
 								HAKMEM_WRAP_TINY=1 ./larson_hakmem 10 8 128 1024 1 12345 8
 								```
 								**A/B test with environment variables:**
 								```bash
 								# Use automated script (includes PGO)
 								./scripts/bench_larson_1t_ab.sh
 								```
 								**Output**: `bench_results/larson_ab/<timestamp>/results.csv`
 								**Time**: ~20-30 minutes (includes PGO build)
 								**Key Metrics**:
 								- Throughput (ops/s)
 								- Stability (4T should not crash - see Phase 6-2.3 active counter fix)
 								---
 								##### 3.2 Random Mixed (Single-threaded, Mixed Sizes)
 								**Basic run:**
 								```bash
 								# 400K cycles, 8192B working set
 								HAKMEM_WRAP_TINY=1 ./bench_random_mixed_hakmem 400000 8192 1234567
 								./bench_random_mixed_system 400000 8192 1234567
 								./bench_random_mixed_mi 400000 8192 1234567
 								```
 								**A/B test with environment variables:**
 								```bash
 								# Runs 5 repetitions, median calculation
 								./scripts/bench_random_mixed_ab.sh
 								```
 								**Output**: `bench_results/random_mixed_ab/<timestamp>/results.csv`
 								**Time**: ~15-20 minutes (5 reps × multiple configs)
 								**Key Metrics**:
 								- Throughput (ops/s) across different working set sizes
 								- SPECIALIZE_MASK impact (0 vs 0x0F)
 								- FAST_CAP impact (8 vs 16 vs 32)
 								---
 								##### 3.3 Mid-Large MT (Multi-threaded, 8-32KB)
 								**Basic run:**
 								```bash
 								# 4 threads, 40K cycles, 2KB working set
 								HAKMEM_WRAP_TINY=1 ./bench_mid_large_mt_hakmem 4 40000 2048 42
 								./bench_mid_large_mt_system 4 40000 2048 42
 								./bench_mid_large_mt_mi 4 40000 2048 42
 								```
 								**A/B test:**
 								```bash
 								./scripts/bench_mid_large_mt_ab.sh
 								```
 								**Output**: `bench_results/mid_large_mt_ab/<timestamp>/results.csv`
 								**Time**: ~10-15 minutes
 								**Key Metrics**:
 								- Multi-threaded performance (2T vs 4T)
 								- HAKMEM's SuperSlab efficiency (expected: strong performance here)
 								**Note**: Previous results showed HAKMEM weakness here (suite/20251107: 2.1M vs system 8.7M).
 								This is unexpected given the Mid-Large benchmark success (+108% on 2025-11-02).
 								Need to investigate if this is a regression or different test pattern.
 								---
 								##### 3.4 VM Mixed (Large Allocations, 512KB-2MB)
 								**Basic run:**
 								```bash
 								# 20K cycles, 256 working set
 								HAKMEM_BIGCACHE_L25=1 HAKMEM_WRAP_TINY=1 ./bench_vm_mixed_hakmem 20000 256 424242
 								./bench_vm_mixed_system 20000 256 424242
 								```
 								**Time**: ~5 minutes
 								**Key Metrics**:
 								- L2.5 cache effectiveness (BIGCACHE_L25=1 vs 0)
 								- Large allocation performance
 								---
 								##### 3.5 Tiny Hot (Hot Path Micro-benchmark)
 								**Basic run:**
 								```bash
 								# 32B, 100 batch, 60K cycles
 								HAKMEM_WRAP_TINY=1 ./bench_tiny_hot_hakmem 32 100 60000
 								./bench_tiny_hot_system 32 100 60000
 								./bench_tiny_hot_mi 32 100 60000
 								# 64B
 								HAKMEM_WRAP_TINY=1 ./bench_tiny_hot_hakmem 64 100 60000
 								./bench_tiny_hot_system 64 100 60000
 								./bench_tiny_hot_mi 64 100 60000
 								```
 								**Time**: ~5 minutes
 								**Key Metrics**:
 								- Hot path efficiency (direct TLS cache access)
 								- Expected weakness (Phase 6 analysis: -60% vs system)
 								---
 								### Phase 4: Analysis and Comparison
 								#### 4.1 Extract Results from Suite Run
 								```bash
 								# Get latest suite results
 								latest=$(ls -td bench_results/suite/* | head -1)
 								cat ${latest}/results.csv
 								# Quick comparison
 								awk -F, 'NR>1 {
 								    if ($2=="hakmem") hakmem[$1]+=$4
 								    if ($2=="system") system[$1]+=$4
 								    if ($2=="mi") mi[$1]+=$4
 								    count[$1]++
 								} END {
 								    for (b in hakmem) {
 								        h=hakmem[b]/count[b]
 								        s=system[b]/count[b]
 								        m=mi[b]/count[b]
 								        printf "%s: HAKMEM=%.2fM system=%.2fM mi=%.2fM (vs_sys=%+.1f%%, vs_mi=%+.1f%%)\n",
 								               b, h/1e6, s/1e6, m/1e6, (h/s-1)*100, (h/m-1)*100
 								    }
 								}' ${latest}/results.csv
 								```
 								#### 4.2 Key Comparisons
 								**Phase 7 vs System malloc:**
 								```bash
 								# Extract HAKMEM vs system for each benchmark
 								awk -F, 'NR>1 && ($2=="hakmem" || $2=="system") {
 								    key=$1 "," $3
 								    if ($2=="hakmem") h[key]=$4
 								    if ($2=="system") s[key]=$4
 								} END {
 								    for (k in h) {
 								        if (s[k]) {
 								            pct = (h[k]/s[k] - 1) * 100
 								            printf "%s: %.2fM vs %.2fM (%+.1f%%)\n", k, h[k]/1e6, s[k]/1e6, pct
 								        }
 								    }
 								}' ${latest}/results.csv | sort
 								```
 								**Phase 7 vs mimalloc:**
 								```bash
 								# Similar for mimalloc comparison
 								awk -F, 'NR>1 && ($2=="hakmem" || $2=="mi") {
 								    key=$1 "," $3
 								    if ($2=="hakmem") h[key]=$4
 								    if ($2=="mi") m[key]=$4
 								} END {
 								    for (k in h) {
 								        if (m[k]) {
 								            pct = (h[k]/m[k] - 1) * 100
 								            printf "%s: %.2fM vs %.2fM (%+.1f%%)\n", k, h[k]/1e6, m[k]/1e6, pct
 								        }
 								    }
 								}' ${latest}/results.csv | sort
 								```
 								#### 4.3 Generate Summary Report
 								```bash
 								# Create comprehensive summary
 								cat > PHASE7_RESULTS_SUMMARY.md << 'REPORT'
 								# Phase 7 Benchmark Results Summary
 								## Test Configuration
 								- Phase: 7-1.3 (HEADER_CLASSIDX=1)
 								- Date: $(date +%Y-%m-%d)
 								- Suite: $(basename ${latest})
 								## Overall Results
 								### Random Mixed (16-8192B, single-threaded)
 								[Insert results here]
 								### Mid-Large MT (8-32KB, multi-threaded)
 								[Insert results here]
 								### VM Mixed (512KB-2MB, large allocations)
 								[Insert results here]
 								### Tiny Hot (8-64B, hot path micro)
 								[Insert results here]
 								### Larson (8-128B, multi-threaded stress)
 								[Insert results here]
 								## Analysis
 								### Strengths
 								[Areas where HAKMEM outperforms]
 								### Weaknesses
 								[Areas where HAKMEM underperforms]
 								### Comparison with Previous Phases
 								[Phase 6 vs Phase 7 delta]
 								## Bottleneck Identification
 								[Performance profiling with perf]
 								REPORT
 								```
 								---
 								### Phase 5: Performance Profiling (Optional, if bottlenecks found)
 								**Profile hot paths with perf:**
 								```bash
 								# Profile random_mixed (if slow)
 								perf record -g --call-graph dwarf -- \
 								  ./bench_random_mixed_hakmem 400000 8192 1234567
 								perf report --stdio > perf_random_mixed_phase7.txt
 								# Profile larson 1T
 								perf record -g --call-graph dwarf -- \
 								  ./larson_hakmem 10 8 128 1024 1 12345 1
 								perf report --stdio > perf_larson_1t_phase7.txt
 								```
 								**Compare with Phase 6:**
 								```bash
 								# If you have Phase 6 binaries saved, run side-by-side
 								# and compare perf reports
 								```
 								---
 								## Expected Results & Analysis Strategy
 								### Baseline Expectations (from Phase 6 analysis)
 								#### Strong Areas (Expected +50% to +171% vs System)
 . **Mid-Large (8-32KB)**: HAKMEM's SuperSlab should dominate
 								   - Expected: +100% to +150% vs system
 								   - Phase 7 improvement target: Maintain or improve
 . **Large Allocations (VM Mixed)**: L2.5 layer efficiency
 								   - Expected: Competitive or slight win vs system
 								#### Weak Areas (Expected -50% to -70% vs System)
 . **Tiny (≤128B)**: Structural weakness identified in Phase 6
 								   - Expected: -40% to -60% vs system
 								   - Phase 7 HEADER_CLASSIDX may help: +10-20% improvement
 . **Random Mixed**: Magazine layer overhead
 								   - Expected: -20% to -50% vs system
 								   - Phase 7 target: Reduce gap
 . **Larson Multi-thread**: Contention issues
 								   - Expected: Variable (1T: ok, 4T+: risk of crashes)
 								   - Phase 7 critical: Verify 4T stability (active counter fix)
 								### What to Look For
 								#### Phase 7 Improvements (HEADER_CLASSIDX=1)
 								- **Tiny allocations**: +10-30% improvement (fewer header loads)
 								- **Random mixed**: +15-25% improvement (class_idx in header)
 								- **Cache efficiency**: Better locality (1-byte header vs 2-byte)
 								#### Red Flags
 								- **Mid-Large regression**: Should NOT regress (HEADER_CLASSIDX doesn't affect mid-large path)
 								- **4T+ crashes in Larson**: Active counter bug should be fixed (Phase 6-2.3)
 								- **Severe regression (>20%)**: Investigate immediately
 								#### Bottleneck Identification
 								If Phase 7 results are disappointing:
 . **Run perf** on slow benchmarks
 . **Compare with Phase 6** perf profiles (if available)
 . **Check hot paths**:
 								   - `tiny_alloc_fast()` - Should be 3-4 instructions
 								   - `tiny_free_fast()` - Should be fast header check
 								   - `superslab_refill()` - Should use P0 ctz optimization
 								---
 								## Time Estimates
 								### Minimal Run (Option A: Suite Script Only)
 								- Build verification: 2 min
 								- Sanity test: 2 min
 								- Suite execution: 15-20 min
 								- Quick analysis: 5 min
 								- **Total: ~25-30 minutes**
 								### Comprehensive Run (Option B: All Individual Scripts)
 								- Build verification: 2 min
 								- Sanity test: 2 min
 								- Larson A/B: 25 min
 								- Random Mixed A/B: 20 min
 								- Mid-Large MT A/B: 15 min
 								- VM Mixed: 5 min
 								- Tiny Hot: 5 min
 								- Analysis & report: 15 min
 								- **Total: ~90 minutes (1.5 hours)**
 								### With Performance Profiling
 								- Add: ~20-30 min per benchmark
 								- **Total: ~2-3 hours**
 								---
 								## Recommended Execution Order
 								### Quick Assessment (30 minutes)
 . ✅ Verify build status
 . ✅ Run suite script (bench_suite_matrix.sh)
 . ✅ Generate quick comparison
 . 🔍 Identify major wins/losses
 . 📝 Decide if deep dive needed
 								### Deep Analysis (if needed, +60 minutes)
 . 🔬 Run individual A/B scripts for problem areas
 . 📊 Profile with perf
 . 📝 Compare with Phase 6 baseline
 . 💡 Generate actionable insights
 								---
 								## Output Organization
 								```
 								bench_results/
 								├── suite/
 								│   └── <timestamp>/
 								│       ├── results.csv          # All benchmarks, all variants
 								│       └── raw/*.out             # Raw logs
 								├── random_mixed_ab/
 								│   └── <timestamp>/
 								│       ├── results.csv          # A/B test results
 								│       └── raw/*.txt             # Per-run data
 								├── larson_ab/
 								│   └── <timestamp>/
 								│       ├── results.csv
 								│       └── raw/*.out
 								├── mid_large_mt_ab/
 								│   └── <timestamp>/
 								│       ├── results.csv
 								│       └── raw/*.out
 								└── ...
 								# Analysis reports
 								PHASE7_RESULTS_SUMMARY.md        # High-level summary
 								PHASE7_DETAILED_ANALYSIS.md      # Deep dive (if needed)
 								perf_*.txt                        # Performance profiles
 								```
 								---
 								## Next Steps After Benchmark
 								### If Phase 7 Shows Strong Results (+30-50% overall)
 . ✅ Commit and document improvements
 . 🎯 Focus on remaining weak areas (Tiny allocations)
 . 📢 Prepare performance summary for stakeholders
 								### If Phase 7 Shows Modest Results (+10-20% overall)
 . 🔍 Identify specific bottlenecks (perf profiling)
 . 🧪 Test individual optimizations in isolation
 . 📊 Compare with Phase 6 to ensure no regressions
 								### If Phase 7 Shows Regressions (any area -10% or worse)
 . 🚨 Immediate investigation
 . 🔄 Bisect to find regression point
 . 🧪 Consider reverting HEADER_CLASSIDX if severe
 								---
 								## Quick Reference Commands
 								```bash
 								# Full suite (automated)
 								./scripts/bench_suite_matrix.sh
 								# Individual benchmarks (quick test)
 								./larson_hakmem 1 8 128 1024 1 12345 1
 								./bench_random_mixed_hakmem 400000 8192 1234567
 								./bench_mid_large_mt_hakmem 4 40000 2048 42
 								./bench_vm_mixed_hakmem 20000 256 424242
 								./bench_tiny_hot_hakmem 32 100 60000
 								# A/B tests (environment variable sweeps)
 								./scripts/bench_larson_1t_ab.sh
 								./scripts/bench_random_mixed_ab.sh
 								./scripts/bench_mid_large_mt_ab.sh
 								# Latest results
 								ls -td bench_results/suite/* | head -1
 								cat $(ls -td bench_results/suite/* | head -1)/results.csv
 								# Performance profiling
 								perf record -g --call-graph dwarf -- ./bench_random_mixed_hakmem 400000 8192 1234567
 								perf report --stdio > perf_output.txt
 								```
 								---
 								## Key Success Metrics
 								### Primary Goal: Overall Improvement
 								- **Target**: +20-30% average throughput vs Phase 6
 								- **Minimum**: No regressions in mid-large (HAKMEM's strength)
 								### Secondary Goals:
 . **Stability**: 4T+ Larson runs without crashes
 . **Tiny improvement**: -40% to -50% vs system (from -60%)
 . **Random mixed improvement**: -10% to -20% vs system (from -30%+)
 								### Stretch Goals:
 . **Mid-large dominance**: Maintain +100% vs system
 . **Overall parity**: Match or beat system malloc on average
 . **Consistency**: No severe outliers (no single test <50% of system)
 								---
 								**Document Version**: 1.0
 								**Created**: 2025-11-08
 								**Author**: Claude (Task Agent)
 								**Status**: Ready for execution