hakmem/docs/analysis/PERFORMANCE_REGRESSION_INVESTIGATION_REPORT.md

# HAKMEM Performance Regression Investigation Report

**Date**: 2025-11-22
**Investigation**: When did HAKMEM achieve 20M ops/s, and what caused regression to 9M?
**Conclusion**: **NO REGRESSION OCCURRED** - The 20M+ claims were never measured.

---

## Executive Summary

**Key Finding**: HAKMEM **never actually achieved** 20M+ ops/s in Random Mixed 256B benchmarks. The documented claims of 22.6M (Phase 3d-B) and 25.1M (Phase 3d-C) ops/s were **mathematical projections** that were incorrectly recorded as measured results.

**True Performance Timeline**:
```
Phase 11 (2025-11-13):    9.38M ops/s  ✅ VERIFIED (actual benchmark)
Phase 3d-B (2025-11-20):  22.6M ops/s  ❌ NEVER MEASURED (expected value only)
Phase 3d-C (2025-11-20):  25.1M ops/s  ❌ NEVER MEASURED (10K sanity test: 1.4M)
Phase 12-1.1 (2025-11-21): 11.5M ops/s  ✅ VERIFIED (100K iterations)
Current (2025-11-22):      9.4M ops/s   ✅ VERIFIED (10M iterations)
```

**Actual Performance Progression**: 9.38M → 11.5M → 9.4M (fluctuation within normal variance, not a true regression)

---

## Investigation Methodology

### 1. Git Log Analysis
Searched commit history for:
- Performance claims in commit messages (20M, 22M, 25M)
- Benchmark results in CLAUDE.md and CURRENT_TASK.md
- Documentation commits vs. actual code changes

### 2. Critical Evidence

#### Evidence A: Phase 3d-C Implementation (commit 23c0d9541, 2025-11-20)
**Commit Message**:
```
Testing:
- Build: Success (LTO warnings are pre-existing)
- 10K ops sanity test: PASS (1.4M ops/s)
- Baseline established for Phase C-8 benchmark comparison
```

**Analysis**: Only a 10K sanity test was run (1.4M ops/s), NOT a full 100K+ benchmark.

#### Evidence B: Documentation Update (commit b3a156879, 6 minutes later)
**Commit Message**:
```
Update CLAUDE.md: Document Phase 3d series results

- Current Performance: 25.1M ops/s (Phase 3d-C, +168% vs Phase 11)
- Phase 3d-B: 22.6M ops/s
- Phase 3d-C: 25.1M ops/s (+11.1%)
```

**Analysis**:
- Zero code changes (only CLAUDE.md updated)
- No benchmark command or output provided
- Performance numbers appear to be **calculated projections**

#### Evidence C: Correction Commit (commit 53cbf33a3, 2025-11-22)
**Discovery**:
```
The Phase 3d-B (22.6M) and Phase 3d-C (25.1M) performance claims were
**never actually measured**. These were mathematical extrapolations of
"expected" improvements that were incorrectly documented as measured results.

Mathematical extrapolation without measurement:
    Phase 11:     9.38M ops/s (verified)
    Expected:     +12-18% (Phase 3d-B), +8-12% (Phase 3d-C)
    Calculation:  9.38M × 1.24 × 1.10 = 12.8M (expected)
    Documented:   22.6M → 25.1M (inflated by stacking "expected" gains)
```

---

## The Highest Verified Performance: 11.5M ops/s

### Phase 12-1.1 (commit 6afaa5703, 2025-11-21)

**Implementation**:
- EMPTY Slab Detection + Immediate Reuse
- Shared Pool Stage 0.5 optimization
- ENV-controlled: `HAKMEM_SS_EMPTY_REUSE=1`

**Verified Benchmark Results**:
```bash
Benchmark: Random Mixed 256B (100K iterations)

OFF (default):  10.2M ops/s (baseline)
ON  (ENV=1):    11.5M ops/s (+13.0% improvement) ✅
```

**Analysis**: This is the **highest verified performance** in the git history for Random Mixed 256B workload.

---

## Other High-Performance Claims (Verified)

### Phase 26 (commit 5b36c1c90, 2025-11-17) - 12.79M ops/s
**Implementation**: Front Gate Unification (3-layer overhead reduction)

**Verified Results**:
| Configuration | Run 1 | Run 2 | Run 3 | Average |
|---------------|-------|-------|-------|---------|
| Phase 26 OFF  | 11.21M | 11.02M | 11.76M | 11.33M ops/s |
| Phase 26 ON   | 13.21M | 12.55M | 12.62M | **12.79M ops/s** ✅ |

**Improvement**: +12.9% (actual measurement with 3 runs)

### Phase 19 & 20-1 (commit 982fbec65, 2025-11-16) - 16.2M ops/s
**Implementation**: Frontend optimization + TLS cache prewarm

**Verified Results**:
```
Phase 19 (HeapV2 only):  11.4M ops/s (+12.9%)
Phase 20-1 (Prewarm ON): 16.2M ops/s (+3.3% additional)
Total improvement:       +16.2% vs original baseline
```

**Note**: This 16.2M is **actual measurement** but from 500K iterations (different workload scale).

---

## Why 20M+ Was Never Achieved

### 1. Mathematical Inflation
**Phase 3d-B Calculation**:
```
Baseline:  9.38M ops/s (Phase 11)
Expected: +12-18% improvement
Math:      9.38M × 1.15 = 10.8M (realistic)
Documented: 22.6M (2.1x inflated!)
```

**Phase 3d-C Calculation**:
```
From Phase 3d-B: 22.6M (already inflated)
Expected: +8-12% improvement
Math:      22.6M × 1.10 = 24.9M
Documented: 25.1M (stacked inflation!)
```

### 2. No Full Benchmark Execution
Phase 3d-C commit log shows:
- 10K ops sanity test: 1.4M ops/s (not representative)
- No 100K+ full benchmark run
- "Baseline established" but never actually measured

### 3. Confusion Between Expected vs Measured
Documentation mixed:
- **Expected gains** (design projections: "+12-18%")
- **Measured results** (actual benchmarks)
- The expected gains were documented with checkmarks (✅) as if measured

---

## Current Performance Status (2025-11-22)

### Verified Measurement
```bash
Command: ./bench_random_mixed_hakmem 10000000 256 42
Benchmark: Random Mixed 256B, 10M iterations

HAKMEM:        9.4M ops/s  ✅ VERIFIED
System malloc: 89.0M ops/s
Performance:   10.6% of system malloc (9.5x slower)
```

### Why 9.4M Instead of 11.5M?

**Possible Factors**:
1. **Different measurement scales**: 11.5M was 100K iterations, 9.4M is 10M iterations
2. **ENV configuration**: Phase 12-1.1's 11.5M required `HAKMEM_SS_EMPTY_REUSE=1` ENV flag
3. **Workload variance**: Random seed, allocation patterns affect results
4. **Bug fixes**: Recent C7 corruption fixes (2025-11-21~22) may have added overhead

**Important**: The difference 11.5M → 9.4M is **NOT a regression from 20M+** because 20M+ never existed.

---

## Commit-by-Commit Performance History

| Commit | Date | Phase | Claimed Performance | Actual Measurement | Status |
|--------|------|-------|---------------------|-------------------|--------|
| 437df708e | 2025-11-13 | Phase 3c | 9.38M ops/s | ✅ 9.38M | Verified |
| 38552c3f3 | 2025-11-20 | Phase 3d-A | - | No benchmark | - |
| 9b0d74640 | 2025-11-20 | Phase 3d-B | 22.6M ops/s | ❌ No full benchmark | Unverified |
| 23c0d9541 | 2025-11-20 | Phase 3d-C | 25.1M ops/s | ❌ 1.4M (10K sanity only) | Unverified |
| b3a156879 | 2025-11-20 | Doc Update | 25.1M ops/s | ❌ Zero code changes | Unverified |
| 6afaa5703 | 2025-11-21 | Phase 12-1.1 | 11.5M ops/s | ✅ 11.5M (100K, ENV=1) | **Highest Verified** |
| 53cbf33a3 | 2025-11-22 | Correction | 9.4M ops/s | ✅ 9.4M (10M iterations) | Verified |

---

## Restoration Plan: How to Achieve 10-15M ops/s

### Option 1: Enable Phase 12-1.1 Optimization
```bash
export HAKMEM_SS_EMPTY_REUSE=1
export HAKMEM_SS_EMPTY_SCAN_LIMIT=16
./build.sh bench_random_mixed_hakmem
./out/release/bench_random_mixed_hakmem 100000 256 42
# Expected: 11.5M ops/s (+22% vs current)
```

### Option 2: Stack Multiple Verified Optimizations
```bash
export HAKMEM_TINY_UNIFIED_CACHE=1        # Phase 23: Unified Cache
export HAKMEM_FRONT_GATE_UNIFIED=1        # Phase 26: Front Gate (+12.9%)
export HAKMEM_SS_EMPTY_REUSE=1            # Phase 12-1.1: Empty Reuse (+13%)
export HAKMEM_TINY_FRONT_DISABLE_ULTRAHOT=1  # Phase 19: Remove UltraHot (+12.9%)

./out/release/bench_random_mixed_hakmem 100000 256 42
# Expected: 12-15M ops/s (cumulative optimizations)
```

### Option 3: Research Phase 3d-B/C Implementations
**Goal**: Actually measure the TLS Cache Merge (Phase 3d-B) and Hot/Cold Split (Phase 3d-C) improvements

**Steps**:
1. Checkout commit `9b0d74640` (Phase 3d-B)
2. Run full benchmark (100K-10M iterations)
3. Measure actual improvement vs Phase 11 baseline
4. Repeat for commit `23c0d9541` (Phase 3d-C)
5. Document true measurements in CLAUDE.md

**Expected**: +10-18% improvement (if design hypothesis is correct)

---

## Lessons Learned

### 1. Always Run Actual Benchmarks
- **Never document performance numbers without running full benchmarks**
- Sanity tests (10K ops) are NOT representative
- Full benchmarks (100K-10M iterations) required for valid claims

### 2. Distinguish Expected vs Measured
- **Expected**: "+12-18% improvement" (design projection)
- **Measured**: "11.5M ops/s (+13.0%)" (actual benchmark result)
- Never use checkmarks (✅) for expected values

### 3. Save Benchmark Evidence
For each performance claim, document:
```bash
# Command
./bench_random_mixed_hakmem 100000 256 42

# Output
Throughput: 11.5M ops/s
Iterations: 100000
Seed: 42
ENV: HAKMEM_SS_EMPTY_REUSE=1
```

### 4. Multiple Runs for Variance
- Single run: Unreliable (variance ±5-10%)
- 3 runs: Minimum for claiming improvement
- 5+ runs: Best practice for publication

### 5. Version Control Documentation
- Git log should show: Code changes → Benchmark run → Documentation update
- Documentation-only commits (like b3a156879) are red flags
- Commits should be atomic: Implementation + Verification + Documentation

---

## Conclusion

**Primary Question**: When did HAKMEM achieve 20M ops/s?
**Answer**: **Never**. The 20M+ claims (22.6M, 25.1M) were mathematical projections incorrectly documented as measurements.

**Secondary Question**: What caused the regression from 20M to 9M?
**Answer**: **No regression occurred**. Current performance (9.4M) is consistent with verified historical measurements.

**Highest Verified Performance**: 11.5M ops/s (Phase 12-1.1, ENV-gated, 100K iterations)

**Path Forward**:
1. Enable verified optimizations (Phase 12-1.1, Phase 23, Phase 26) → 12-15M expected
2. Measure Phase 3d-B/C implementations properly → +10-18% additional expected
3. Pursue Phase 20-2 BenchFast mode → Understand structural ceiling

**Recommendation**: Update CLAUDE.md to clearly mark all unverified claims and establish a benchmark verification protocol for future performance claims.

---

## Appendix: Complete Verified Performance Timeline

```
Date       | Commit    | Phase      | Performance | Verification | Notes
-----------|-----------|------------|-------------|--------------|------------------
2025-11-13 | 437df708e | Phase 3c   | 9.38M       | ✅ Verified  | Baseline
2025-11-16 | 982fbec65 | Phase 19   | 11.4M       | ✅ Verified  | HeapV2 only
2025-11-16 | 982fbec65 | Phase 20-1 | 16.2M       | ✅ Verified  | 500K iter (different scale)
2025-11-17 | 5b36c1c90 | Phase 26   | 12.79M      | ✅ Verified  | 3-run average
2025-11-20 | 23c0d9541 | Phase 3d-C | 25.1M       | ❌ Unverified| 10K sanity only
2025-11-21 | 6afaa5703 | Phase 12   | 11.5M       | ✅ Verified  | ENV=1, 100K iter
2025-11-22 | 53cbf33a3 | Current    | 9.4M        | ✅ Verified  | 10M iterations
```

**True Peak**: 16.2M ops/s (Phase 20-1, 500K iterations) or 12.79M ops/s (Phase 26, 100K iterations)
**Current Status**: 9.4M ops/s (10M iterations, most rigorous test)

The variation (9.4M - 16.2M) is primarily due to:
1. Iteration count (10M vs 500K vs 100K)
2. ENV configuration (optimizations enabled/disabled)
3. Measurement methodology (single run vs 3-run average)

**Recommendation**: Standardize benchmark protocol (100K iterations, 3 runs, specific ENV flags) for future comparisons.
-												Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)

## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-26 13:14:18 +09:00
+								# HAKMEM Performance Regression Investigation Report
 								**Date**: 2025-11-22
 								**Investigation**: When did HAKMEM achieve 20M ops/s, and what caused regression to 9M?
 								**Conclusion**: **NO REGRESSION OCCURRED** - The 20M+ claims were never measured.
 								---
 								## Executive Summary
 								**Key Finding**: HAKMEM **never actually achieved** 20M+ ops/s in Random Mixed 256B benchmarks. The documented claims of 22.6M (Phase 3d-B) and 25.1M (Phase 3d-C) ops/s were **mathematical projections** that were incorrectly recorded as measured results.
 								**True Performance Timeline**:
 								```
 								Phase 11 (2025-11-13):    9.38M ops/s  ✅ VERIFIED (actual benchmark)
 								Phase 3d-B (2025-11-20):  22.6M ops/s  ❌ NEVER MEASURED (expected value only)
 								Phase 3d-C (2025-11-20):  25.1M ops/s  ❌ NEVER MEASURED (10K sanity test: 1.4M)
 								Phase 12-1.1 (2025-11-21): 11.5M ops/s  ✅ VERIFIED (100K iterations)
 								Current (2025-11-22):      9.4M ops/s   ✅ VERIFIED (10M iterations)
 								```
 								**Actual Performance Progression**: 9.38M → 11.5M → 9.4M (fluctuation within normal variance, not a true regression)
 								---
 								## Investigation Methodology
 								### 1. Git Log Analysis
 								Searched commit history for:
 								- Performance claims in commit messages (20M, 22M, 25M)
 								- Benchmark results in CLAUDE.md and CURRENT_TASK.md
 								- Documentation commits vs. actual code changes
 								### 2. Critical Evidence
 								#### Evidence A: Phase 3d-C Implementation (commit 23c0d9541, 2025-11-20)
 								**Commit Message**:
 								```
 								Testing:
 								- Build: Success (LTO warnings are pre-existing)
 								- 10K ops sanity test: PASS (1.4M ops/s)
 								- Baseline established for Phase C-8 benchmark comparison
 								```
 								**Analysis**: Only a 10K sanity test was run (1.4M ops/s), NOT a full 100K+ benchmark.
 								#### Evidence B: Documentation Update (commit b3a156879, 6 minutes later)
 								**Commit Message**:
 								```
 								Update CLAUDE.md: Document Phase 3d series results
 								- Current Performance: 25.1M ops/s (Phase 3d-C, +168% vs Phase 11)
 								- Phase 3d-B: 22.6M ops/s
 								- Phase 3d-C: 25.1M ops/s (+11.1%)
 								```
 								**Analysis**:
 								- Zero code changes (only CLAUDE.md updated)
 								- No benchmark command or output provided
 								- Performance numbers appear to be **calculated projections**
 								#### Evidence C: Correction Commit (commit 53cbf33a3, 2025-11-22)
 								**Discovery**:
 								```
 								The Phase 3d-B (22.6M) and Phase 3d-C (25.1M) performance claims were
 								**never actually measured**. These were mathematical extrapolations of
 								"expected" improvements that were incorrectly documented as measured results.
 								Mathematical extrapolation without measurement:
 								    Phase 11:     9.38M ops/s (verified)
 								    Expected:     +12-18% (Phase 3d-B), +8-12% (Phase 3d-C)
 								    Calculation:  9.38M × 1.24 × 1.10 = 12.8M (expected)
 								    Documented:   22.6M → 25.1M (inflated by stacking "expected" gains)
 								```
 								---
 								## The Highest Verified Performance: 11.5M ops/s
 								### Phase 12-1.1 (commit 6afaa5703, 2025-11-21)
 								**Implementation**:
 								- EMPTY Slab Detection + Immediate Reuse
 								- Shared Pool Stage 0.5 optimization
 								- ENV-controlled: `HAKMEM_SS_EMPTY_REUSE=1`
 								**Verified Benchmark Results**:
 								```bash
 								Benchmark: Random Mixed 256B (100K iterations)
 								OFF (default):  10.2M ops/s (baseline)
 								ON  (ENV=1):    11.5M ops/s (+13.0% improvement) ✅
 								```
 								**Analysis**: This is the **highest verified performance** in the git history for Random Mixed 256B workload.
 								---
 								## Other High-Performance Claims (Verified)
 								### Phase 26 (commit 5b36c1c90, 2025-11-17) - 12.79M ops/s
 								**Implementation**: Front Gate Unification (3-layer overhead reduction)
 								**Verified Results**:
 								| Configuration | Run 1 | Run 2 | Run 3 | Average |
 								|---------------|-------|-------|-------|---------|
 								| Phase 26 OFF  | 11.21M | 11.02M | 11.76M | 11.33M ops/s |
 								| Phase 26 ON   | 13.21M | 12.55M | 12.62M | **12.79M ops/s** ✅ |
 								**Improvement**: +12.9% (actual measurement with 3 runs)
 								### Phase 19 & 20-1 (commit 982fbec65, 2025-11-16) - 16.2M ops/s
 								**Implementation**: Frontend optimization + TLS cache prewarm
 								**Verified Results**:
 								```
 								Phase 19 (HeapV2 only):  11.4M ops/s (+12.9%)
 								Phase 20-1 (Prewarm ON): 16.2M ops/s (+3.3% additional)
 								Total improvement:       +16.2% vs original baseline
 								```
 								**Note**: This 16.2M is **actual measurement** but from 500K iterations (different workload scale).
 								---
 								## Why 20M+ Was Never Achieved
 								### 1. Mathematical Inflation
 								**Phase 3d-B Calculation**:
 								```
 								Baseline:  9.38M ops/s (Phase 11)
 								Expected: +12-18% improvement
 								Math:      9.38M × 1.15 = 10.8M (realistic)
 								Documented: 22.6M (2.1x inflated!)
 								```
 								**Phase 3d-C Calculation**:
 								```
 								From Phase 3d-B: 22.6M (already inflated)
 								Expected: +8-12% improvement
 								Math:      22.6M × 1.10 = 24.9M
 								Documented: 25.1M (stacked inflation!)
 								```
 								### 2. No Full Benchmark Execution
 								Phase 3d-C commit log shows:
 								- 10K ops sanity test: 1.4M ops/s (not representative)
 								- No 100K+ full benchmark run
 								- "Baseline established" but never actually measured
 								### 3. Confusion Between Expected vs Measured
 								Documentation mixed:
 								- **Expected gains** (design projections: "+12-18%")
 								- **Measured results** (actual benchmarks)
 								- The expected gains were documented with checkmarks (✅) as if measured
 								---
 								## Current Performance Status (2025-11-22)
 								### Verified Measurement
 								```bash
 								Command: ./bench_random_mixed_hakmem 10000000 256 42
 								Benchmark: Random Mixed 256B, 10M iterations
 								HAKMEM:        9.4M ops/s  ✅ VERIFIED
 								System malloc: 89.0M ops/s
 								Performance:   10.6% of system malloc (9.5x slower)
 								```
 								### Why 9.4M Instead of 11.5M?
 								**Possible Factors**:
 . **Different measurement scales**: 11.5M was 100K iterations, 9.4M is 10M iterations
 . **ENV configuration**: Phase 12-1.1's 11.5M required `HAKMEM_SS_EMPTY_REUSE=1` ENV flag
 . **Workload variance**: Random seed, allocation patterns affect results
 . **Bug fixes**: Recent C7 corruption fixes (2025-11-21~22) may have added overhead
 								**Important**: The difference 11.5M → 9.4M is **NOT a regression from 20M+** because 20M+ never existed.
 								---
 								## Commit-by-Commit Performance History
 								| Commit | Date | Phase | Claimed Performance | Actual Measurement | Status |
 								|--------|------|-------|---------------------|-------------------|--------|
 								| 437df708e | 2025-11-13 | Phase 3c | 9.38M ops/s | ✅ 9.38M | Verified |
 								| 38552c3f3 | 2025-11-20 | Phase 3d-A | - | No benchmark | - |
 								| 9b0d74640 | 2025-11-20 | Phase 3d-B | 22.6M ops/s | ❌ No full benchmark | Unverified |
 								| 23c0d9541 | 2025-11-20 | Phase 3d-C | 25.1M ops/s | ❌ 1.4M (10K sanity only) | Unverified |
 								| b3a156879 | 2025-11-20 | Doc Update | 25.1M ops/s | ❌ Zero code changes | Unverified |
 								| 6afaa5703 | 2025-11-21 | Phase 12-1.1 | 11.5M ops/s | ✅ 11.5M (100K, ENV=1) | **Highest Verified** |
 								| 53cbf33a3 | 2025-11-22 | Correction | 9.4M ops/s | ✅ 9.4M (10M iterations) | Verified |
 								---
 								## Restoration Plan: How to Achieve 10-15M ops/s
 								### Option 1: Enable Phase 12-1.1 Optimization
 								```bash
 								export HAKMEM_SS_EMPTY_REUSE=1
 								export HAKMEM_SS_EMPTY_SCAN_LIMIT=16
 								./build.sh bench_random_mixed_hakmem
 								./out/release/bench_random_mixed_hakmem 100000 256 42
 								# Expected: 11.5M ops/s (+22% vs current)
 								```
 								### Option 2: Stack Multiple Verified Optimizations
 								```bash
 								export HAKMEM_TINY_UNIFIED_CACHE=1        # Phase 23: Unified Cache
 								export HAKMEM_FRONT_GATE_UNIFIED=1        # Phase 26: Front Gate (+12.9%)
 								export HAKMEM_SS_EMPTY_REUSE=1            # Phase 12-1.1: Empty Reuse (+13%)
 								export HAKMEM_TINY_FRONT_DISABLE_ULTRAHOT=1  # Phase 19: Remove UltraHot (+12.9%)
 								./out/release/bench_random_mixed_hakmem 100000 256 42
 								# Expected: 12-15M ops/s (cumulative optimizations)
 								```
 								### Option 3: Research Phase 3d-B/C Implementations
 								**Goal**: Actually measure the TLS Cache Merge (Phase 3d-B) and Hot/Cold Split (Phase 3d-C) improvements
 								**Steps**:
 . Checkout commit `9b0d74640` (Phase 3d-B)
 . Run full benchmark (100K-10M iterations)
 . Measure actual improvement vs Phase 11 baseline
 . Repeat for commit `23c0d9541` (Phase 3d-C)
 . Document true measurements in CLAUDE.md
 								**Expected**: +10-18% improvement (if design hypothesis is correct)
 								---
 								## Lessons Learned
 								### 1. Always Run Actual Benchmarks
 								- **Never document performance numbers without running full benchmarks**
 								- Sanity tests (10K ops) are NOT representative
 								- Full benchmarks (100K-10M iterations) required for valid claims
 								### 2. Distinguish Expected vs Measured
 								- **Expected**: "+12-18% improvement" (design projection)
 								- **Measured**: "11.5M ops/s (+13.0%)" (actual benchmark result)
 								- Never use checkmarks (✅) for expected values
 								### 3. Save Benchmark Evidence
 								For each performance claim, document:
 								```bash
 								# Command
 								./bench_random_mixed_hakmem 100000 256 42
 								# Output
 								Throughput: 11.5M ops/s
 								Iterations: 100000
 								Seed: 42
 								ENV: HAKMEM_SS_EMPTY_REUSE=1
 								```
 								### 4. Multiple Runs for Variance
 								- Single run: Unreliable (variance ±5-10%)
 								- 3 runs: Minimum for claiming improvement
 								- 5+ runs: Best practice for publication
 								### 5. Version Control Documentation
 								- Git log should show: Code changes → Benchmark run → Documentation update
 								- Documentation-only commits (like b3a156879) are red flags
 								- Commits should be atomic: Implementation + Verification + Documentation
 								---
 								## Conclusion
 								**Primary Question**: When did HAKMEM achieve 20M ops/s?
 								**Answer**: **Never**. The 20M+ claims (22.6M, 25.1M) were mathematical projections incorrectly documented as measurements.
 								**Secondary Question**: What caused the regression from 20M to 9M?
 								**Answer**: **No regression occurred**. Current performance (9.4M) is consistent with verified historical measurements.
 								**Highest Verified Performance**: 11.5M ops/s (Phase 12-1.1, ENV-gated, 100K iterations)
 								**Path Forward**:
 . Enable verified optimizations (Phase 12-1.1, Phase 23, Phase 26) → 12-15M expected
 . Measure Phase 3d-B/C implementations properly → +10-18% additional expected
 . Pursue Phase 20-2 BenchFast mode → Understand structural ceiling
 								**Recommendation**: Update CLAUDE.md to clearly mark all unverified claims and establish a benchmark verification protocol for future performance claims.
 								---
 								## Appendix: Complete Verified Performance Timeline
 								```
 								Date       | Commit    | Phase      | Performance | Verification | Notes
 								-----------|-----------|------------|-------------|--------------|------------------
 -11-13 | 437df708e | Phase 3c   | 9.38M       | ✅ Verified  | Baseline
 -11-16 | 982fbec65 | Phase 19   | 11.4M       | ✅ Verified  | HeapV2 only
 -11-16 | 982fbec65 | Phase 20-1 | 16.2M       | ✅ Verified  | 500K iter (different scale)
 -11-17 | 5b36c1c90 | Phase 26   | 12.79M      | ✅ Verified  | 3-run average
 -11-20 | 23c0d9541 | Phase 3d-C | 25.1M       | ❌ Unverified| 10K sanity only
 -11-21 | 6afaa5703 | Phase 12   | 11.5M       | ✅ Verified  | ENV=1, 100K iter
 -11-22 | 53cbf33a3 | Current    | 9.4M        | ✅ Verified  | 10M iterations
 								```
 								**True Peak**: 16.2M ops/s (Phase 20-1, 500K iterations) or 12.79M ops/s (Phase 26, 100K iterations)
 								**Current Status**: 9.4M ops/s (10M iterations, most rigorous test)
 								The variation (9.4M - 16.2M) is primarily due to:
 . Iteration count (10M vs 500K vs 100K)
 . ENV configuration (optimizations enabled/disabled)
 . Measurement methodology (single run vs 3-run average)
 								**Recommendation**: Standardize benchmark protocol (100K iterations, 3 runs, specific ENV flags) for future comparisons.