312 lines
11 KiB
Markdown
312 lines
11 KiB
Markdown
|
|
# HAKMEM Performance Regression Investigation Report
|
|||
|
|
|
|||
|
|
**Date**: 2025-11-22
|
|||
|
|
**Investigation**: When did HAKMEM achieve 20M ops/s, and what caused regression to 9M?
|
|||
|
|
**Conclusion**: **NO REGRESSION OCCURRED** - The 20M+ claims were never measured.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Executive Summary
|
|||
|
|
|
|||
|
|
**Key Finding**: HAKMEM **never actually achieved** 20M+ ops/s in Random Mixed 256B benchmarks. The documented claims of 22.6M (Phase 3d-B) and 25.1M (Phase 3d-C) ops/s were **mathematical projections** that were incorrectly recorded as measured results.
|
|||
|
|
|
|||
|
|
**True Performance Timeline**:
|
|||
|
|
```
|
|||
|
|
Phase 11 (2025-11-13): 9.38M ops/s ✅ VERIFIED (actual benchmark)
|
|||
|
|
Phase 3d-B (2025-11-20): 22.6M ops/s ❌ NEVER MEASURED (expected value only)
|
|||
|
|
Phase 3d-C (2025-11-20): 25.1M ops/s ❌ NEVER MEASURED (10K sanity test: 1.4M)
|
|||
|
|
Phase 12-1.1 (2025-11-21): 11.5M ops/s ✅ VERIFIED (100K iterations)
|
|||
|
|
Current (2025-11-22): 9.4M ops/s ✅ VERIFIED (10M iterations)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Actual Performance Progression**: 9.38M → 11.5M → 9.4M (fluctuation within normal variance, not a true regression)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Investigation Methodology
|
|||
|
|
|
|||
|
|
### 1. Git Log Analysis
|
|||
|
|
Searched commit history for:
|
|||
|
|
- Performance claims in commit messages (20M, 22M, 25M)
|
|||
|
|
- Benchmark results in CLAUDE.md and CURRENT_TASK.md
|
|||
|
|
- Documentation commits vs. actual code changes
|
|||
|
|
|
|||
|
|
### 2. Critical Evidence
|
|||
|
|
|
|||
|
|
#### Evidence A: Phase 3d-C Implementation (commit 23c0d9541, 2025-11-20)
|
|||
|
|
**Commit Message**:
|
|||
|
|
```
|
|||
|
|
Testing:
|
|||
|
|
- Build: Success (LTO warnings are pre-existing)
|
|||
|
|
- 10K ops sanity test: PASS (1.4M ops/s)
|
|||
|
|
- Baseline established for Phase C-8 benchmark comparison
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Analysis**: Only a 10K sanity test was run (1.4M ops/s), NOT a full 100K+ benchmark.
|
|||
|
|
|
|||
|
|
#### Evidence B: Documentation Update (commit b3a156879, 6 minutes later)
|
|||
|
|
**Commit Message**:
|
|||
|
|
```
|
|||
|
|
Update CLAUDE.md: Document Phase 3d series results
|
|||
|
|
|
|||
|
|
- Current Performance: 25.1M ops/s (Phase 3d-C, +168% vs Phase 11)
|
|||
|
|
- Phase 3d-B: 22.6M ops/s
|
|||
|
|
- Phase 3d-C: 25.1M ops/s (+11.1%)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Analysis**:
|
|||
|
|
- Zero code changes (only CLAUDE.md updated)
|
|||
|
|
- No benchmark command or output provided
|
|||
|
|
- Performance numbers appear to be **calculated projections**
|
|||
|
|
|
|||
|
|
#### Evidence C: Correction Commit (commit 53cbf33a3, 2025-11-22)
|
|||
|
|
**Discovery**:
|
|||
|
|
```
|
|||
|
|
The Phase 3d-B (22.6M) and Phase 3d-C (25.1M) performance claims were
|
|||
|
|
**never actually measured**. These were mathematical extrapolations of
|
|||
|
|
"expected" improvements that were incorrectly documented as measured results.
|
|||
|
|
|
|||
|
|
Mathematical extrapolation without measurement:
|
|||
|
|
Phase 11: 9.38M ops/s (verified)
|
|||
|
|
Expected: +12-18% (Phase 3d-B), +8-12% (Phase 3d-C)
|
|||
|
|
Calculation: 9.38M × 1.24 × 1.10 = 12.8M (expected)
|
|||
|
|
Documented: 22.6M → 25.1M (inflated by stacking "expected" gains)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## The Highest Verified Performance: 11.5M ops/s
|
|||
|
|
|
|||
|
|
### Phase 12-1.1 (commit 6afaa5703, 2025-11-21)
|
|||
|
|
|
|||
|
|
**Implementation**:
|
|||
|
|
- EMPTY Slab Detection + Immediate Reuse
|
|||
|
|
- Shared Pool Stage 0.5 optimization
|
|||
|
|
- ENV-controlled: `HAKMEM_SS_EMPTY_REUSE=1`
|
|||
|
|
|
|||
|
|
**Verified Benchmark Results**:
|
|||
|
|
```bash
|
|||
|
|
Benchmark: Random Mixed 256B (100K iterations)
|
|||
|
|
|
|||
|
|
OFF (default): 10.2M ops/s (baseline)
|
|||
|
|
ON (ENV=1): 11.5M ops/s (+13.0% improvement) ✅
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Analysis**: This is the **highest verified performance** in the git history for Random Mixed 256B workload.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Other High-Performance Claims (Verified)
|
|||
|
|
|
|||
|
|
### Phase 26 (commit 5b36c1c90, 2025-11-17) - 12.79M ops/s
|
|||
|
|
**Implementation**: Front Gate Unification (3-layer overhead reduction)
|
|||
|
|
|
|||
|
|
**Verified Results**:
|
|||
|
|
| Configuration | Run 1 | Run 2 | Run 3 | Average |
|
|||
|
|
|---------------|-------|-------|-------|---------|
|
|||
|
|
| Phase 26 OFF | 11.21M | 11.02M | 11.76M | 11.33M ops/s |
|
|||
|
|
| Phase 26 ON | 13.21M | 12.55M | 12.62M | **12.79M ops/s** ✅ |
|
|||
|
|
|
|||
|
|
**Improvement**: +12.9% (actual measurement with 3 runs)
|
|||
|
|
|
|||
|
|
### Phase 19 & 20-1 (commit 982fbec65, 2025-11-16) - 16.2M ops/s
|
|||
|
|
**Implementation**: Frontend optimization + TLS cache prewarm
|
|||
|
|
|
|||
|
|
**Verified Results**:
|
|||
|
|
```
|
|||
|
|
Phase 19 (HeapV2 only): 11.4M ops/s (+12.9%)
|
|||
|
|
Phase 20-1 (Prewarm ON): 16.2M ops/s (+3.3% additional)
|
|||
|
|
Total improvement: +16.2% vs original baseline
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Note**: This 16.2M is **actual measurement** but from 500K iterations (different workload scale).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Why 20M+ Was Never Achieved
|
|||
|
|
|
|||
|
|
### 1. Mathematical Inflation
|
|||
|
|
**Phase 3d-B Calculation**:
|
|||
|
|
```
|
|||
|
|
Baseline: 9.38M ops/s (Phase 11)
|
|||
|
|
Expected: +12-18% improvement
|
|||
|
|
Math: 9.38M × 1.15 = 10.8M (realistic)
|
|||
|
|
Documented: 22.6M (2.1x inflated!)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Phase 3d-C Calculation**:
|
|||
|
|
```
|
|||
|
|
From Phase 3d-B: 22.6M (already inflated)
|
|||
|
|
Expected: +8-12% improvement
|
|||
|
|
Math: 22.6M × 1.10 = 24.9M
|
|||
|
|
Documented: 25.1M (stacked inflation!)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 2. No Full Benchmark Execution
|
|||
|
|
Phase 3d-C commit log shows:
|
|||
|
|
- 10K ops sanity test: 1.4M ops/s (not representative)
|
|||
|
|
- No 100K+ full benchmark run
|
|||
|
|
- "Baseline established" but never actually measured
|
|||
|
|
|
|||
|
|
### 3. Confusion Between Expected vs Measured
|
|||
|
|
Documentation mixed:
|
|||
|
|
- **Expected gains** (design projections: "+12-18%")
|
|||
|
|
- **Measured results** (actual benchmarks)
|
|||
|
|
- The expected gains were documented with checkmarks (✅) as if measured
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Current Performance Status (2025-11-22)
|
|||
|
|
|
|||
|
|
### Verified Measurement
|
|||
|
|
```bash
|
|||
|
|
Command: ./bench_random_mixed_hakmem 10000000 256 42
|
|||
|
|
Benchmark: Random Mixed 256B, 10M iterations
|
|||
|
|
|
|||
|
|
HAKMEM: 9.4M ops/s ✅ VERIFIED
|
|||
|
|
System malloc: 89.0M ops/s
|
|||
|
|
Performance: 10.6% of system malloc (9.5x slower)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Why 9.4M Instead of 11.5M?
|
|||
|
|
|
|||
|
|
**Possible Factors**:
|
|||
|
|
1. **Different measurement scales**: 11.5M was 100K iterations, 9.4M is 10M iterations
|
|||
|
|
2. **ENV configuration**: Phase 12-1.1's 11.5M required `HAKMEM_SS_EMPTY_REUSE=1` ENV flag
|
|||
|
|
3. **Workload variance**: Random seed, allocation patterns affect results
|
|||
|
|
4. **Bug fixes**: Recent C7 corruption fixes (2025-11-21~22) may have added overhead
|
|||
|
|
|
|||
|
|
**Important**: The difference 11.5M → 9.4M is **NOT a regression from 20M+** because 20M+ never existed.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Commit-by-Commit Performance History
|
|||
|
|
|
|||
|
|
| Commit | Date | Phase | Claimed Performance | Actual Measurement | Status |
|
|||
|
|
|--------|------|-------|---------------------|-------------------|--------|
|
|||
|
|
| 437df708e | 2025-11-13 | Phase 3c | 9.38M ops/s | ✅ 9.38M | Verified |
|
|||
|
|
| 38552c3f3 | 2025-11-20 | Phase 3d-A | - | No benchmark | - |
|
|||
|
|
| 9b0d74640 | 2025-11-20 | Phase 3d-B | 22.6M ops/s | ❌ No full benchmark | Unverified |
|
|||
|
|
| 23c0d9541 | 2025-11-20 | Phase 3d-C | 25.1M ops/s | ❌ 1.4M (10K sanity only) | Unverified |
|
|||
|
|
| b3a156879 | 2025-11-20 | Doc Update | 25.1M ops/s | ❌ Zero code changes | Unverified |
|
|||
|
|
| 6afaa5703 | 2025-11-21 | Phase 12-1.1 | 11.5M ops/s | ✅ 11.5M (100K, ENV=1) | **Highest Verified** |
|
|||
|
|
| 53cbf33a3 | 2025-11-22 | Correction | 9.4M ops/s | ✅ 9.4M (10M iterations) | Verified |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Restoration Plan: How to Achieve 10-15M ops/s
|
|||
|
|
|
|||
|
|
### Option 1: Enable Phase 12-1.1 Optimization
|
|||
|
|
```bash
|
|||
|
|
export HAKMEM_SS_EMPTY_REUSE=1
|
|||
|
|
export HAKMEM_SS_EMPTY_SCAN_LIMIT=16
|
|||
|
|
./build.sh bench_random_mixed_hakmem
|
|||
|
|
./out/release/bench_random_mixed_hakmem 100000 256 42
|
|||
|
|
# Expected: 11.5M ops/s (+22% vs current)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Option 2: Stack Multiple Verified Optimizations
|
|||
|
|
```bash
|
|||
|
|
export HAKMEM_TINY_UNIFIED_CACHE=1 # Phase 23: Unified Cache
|
|||
|
|
export HAKMEM_FRONT_GATE_UNIFIED=1 # Phase 26: Front Gate (+12.9%)
|
|||
|
|
export HAKMEM_SS_EMPTY_REUSE=1 # Phase 12-1.1: Empty Reuse (+13%)
|
|||
|
|
export HAKMEM_TINY_FRONT_DISABLE_ULTRAHOT=1 # Phase 19: Remove UltraHot (+12.9%)
|
|||
|
|
|
|||
|
|
./out/release/bench_random_mixed_hakmem 100000 256 42
|
|||
|
|
# Expected: 12-15M ops/s (cumulative optimizations)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Option 3: Research Phase 3d-B/C Implementations
|
|||
|
|
**Goal**: Actually measure the TLS Cache Merge (Phase 3d-B) and Hot/Cold Split (Phase 3d-C) improvements
|
|||
|
|
|
|||
|
|
**Steps**:
|
|||
|
|
1. Checkout commit `9b0d74640` (Phase 3d-B)
|
|||
|
|
2. Run full benchmark (100K-10M iterations)
|
|||
|
|
3. Measure actual improvement vs Phase 11 baseline
|
|||
|
|
4. Repeat for commit `23c0d9541` (Phase 3d-C)
|
|||
|
|
5. Document true measurements in CLAUDE.md
|
|||
|
|
|
|||
|
|
**Expected**: +10-18% improvement (if design hypothesis is correct)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Lessons Learned
|
|||
|
|
|
|||
|
|
### 1. Always Run Actual Benchmarks
|
|||
|
|
- **Never document performance numbers without running full benchmarks**
|
|||
|
|
- Sanity tests (10K ops) are NOT representative
|
|||
|
|
- Full benchmarks (100K-10M iterations) required for valid claims
|
|||
|
|
|
|||
|
|
### 2. Distinguish Expected vs Measured
|
|||
|
|
- **Expected**: "+12-18% improvement" (design projection)
|
|||
|
|
- **Measured**: "11.5M ops/s (+13.0%)" (actual benchmark result)
|
|||
|
|
- Never use checkmarks (✅) for expected values
|
|||
|
|
|
|||
|
|
### 3. Save Benchmark Evidence
|
|||
|
|
For each performance claim, document:
|
|||
|
|
```bash
|
|||
|
|
# Command
|
|||
|
|
./bench_random_mixed_hakmem 100000 256 42
|
|||
|
|
|
|||
|
|
# Output
|
|||
|
|
Throughput: 11.5M ops/s
|
|||
|
|
Iterations: 100000
|
|||
|
|
Seed: 42
|
|||
|
|
ENV: HAKMEM_SS_EMPTY_REUSE=1
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 4. Multiple Runs for Variance
|
|||
|
|
- Single run: Unreliable (variance ±5-10%)
|
|||
|
|
- 3 runs: Minimum for claiming improvement
|
|||
|
|
- 5+ runs: Best practice for publication
|
|||
|
|
|
|||
|
|
### 5. Version Control Documentation
|
|||
|
|
- Git log should show: Code changes → Benchmark run → Documentation update
|
|||
|
|
- Documentation-only commits (like b3a156879) are red flags
|
|||
|
|
- Commits should be atomic: Implementation + Verification + Documentation
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Conclusion
|
|||
|
|
|
|||
|
|
**Primary Question**: When did HAKMEM achieve 20M ops/s?
|
|||
|
|
**Answer**: **Never**. The 20M+ claims (22.6M, 25.1M) were mathematical projections incorrectly documented as measurements.
|
|||
|
|
|
|||
|
|
**Secondary Question**: What caused the regression from 20M to 9M?
|
|||
|
|
**Answer**: **No regression occurred**. Current performance (9.4M) is consistent with verified historical measurements.
|
|||
|
|
|
|||
|
|
**Highest Verified Performance**: 11.5M ops/s (Phase 12-1.1, ENV-gated, 100K iterations)
|
|||
|
|
|
|||
|
|
**Path Forward**:
|
|||
|
|
1. Enable verified optimizations (Phase 12-1.1, Phase 23, Phase 26) → 12-15M expected
|
|||
|
|
2. Measure Phase 3d-B/C implementations properly → +10-18% additional expected
|
|||
|
|
3. Pursue Phase 20-2 BenchFast mode → Understand structural ceiling
|
|||
|
|
|
|||
|
|
**Recommendation**: Update CLAUDE.md to clearly mark all unverified claims and establish a benchmark verification protocol for future performance claims.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Appendix: Complete Verified Performance Timeline
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Date | Commit | Phase | Performance | Verification | Notes
|
|||
|
|
-----------|-----------|------------|-------------|--------------|------------------
|
|||
|
|
2025-11-13 | 437df708e | Phase 3c | 9.38M | ✅ Verified | Baseline
|
|||
|
|
2025-11-16 | 982fbec65 | Phase 19 | 11.4M | ✅ Verified | HeapV2 only
|
|||
|
|
2025-11-16 | 982fbec65 | Phase 20-1 | 16.2M | ✅ Verified | 500K iter (different scale)
|
|||
|
|
2025-11-17 | 5b36c1c90 | Phase 26 | 12.79M | ✅ Verified | 3-run average
|
|||
|
|
2025-11-20 | 23c0d9541 | Phase 3d-C | 25.1M | ❌ Unverified| 10K sanity only
|
|||
|
|
2025-11-21 | 6afaa5703 | Phase 12 | 11.5M | ✅ Verified | ENV=1, 100K iter
|
|||
|
|
2025-11-22 | 53cbf33a3 | Current | 9.4M | ✅ Verified | 10M iterations
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**True Peak**: 16.2M ops/s (Phase 20-1, 500K iterations) or 12.79M ops/s (Phase 26, 100K iterations)
|
|||
|
|
**Current Status**: 9.4M ops/s (10M iterations, most rigorous test)
|
|||
|
|
|
|||
|
|
The variation (9.4M - 16.2M) is primarily due to:
|
|||
|
|
1. Iteration count (10M vs 500K vs 100K)
|
|||
|
|
2. ENV configuration (optimizations enabled/disabled)
|
|||
|
|
3. Measurement methodology (single run vs 3-run average)
|
|||
|
|
|
|||
|
|
**Recommendation**: Standardize benchmark protocol (100K iterations, 3 runs, specific ENV flags) for future comparisons.
|