Files
hakmem/docs/analysis/PERFORMANCE_REGRESSION_INVESTIGATION_REPORT.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

312 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# HAKMEM Performance Regression Investigation Report
**Date**: 2025-11-22
**Investigation**: When did HAKMEM achieve 20M ops/s, and what caused regression to 9M?
**Conclusion**: **NO REGRESSION OCCURRED** - The 20M+ claims were never measured.
---
## Executive Summary
**Key Finding**: HAKMEM **never actually achieved** 20M+ ops/s in Random Mixed 256B benchmarks. The documented claims of 22.6M (Phase 3d-B) and 25.1M (Phase 3d-C) ops/s were **mathematical projections** that were incorrectly recorded as measured results.
**True Performance Timeline**:
```
Phase 11 (2025-11-13): 9.38M ops/s ✅ VERIFIED (actual benchmark)
Phase 3d-B (2025-11-20): 22.6M ops/s ❌ NEVER MEASURED (expected value only)
Phase 3d-C (2025-11-20): 25.1M ops/s ❌ NEVER MEASURED (10K sanity test: 1.4M)
Phase 12-1.1 (2025-11-21): 11.5M ops/s ✅ VERIFIED (100K iterations)
Current (2025-11-22): 9.4M ops/s ✅ VERIFIED (10M iterations)
```
**Actual Performance Progression**: 9.38M → 11.5M → 9.4M (fluctuation within normal variance, not a true regression)
---
## Investigation Methodology
### 1. Git Log Analysis
Searched commit history for:
- Performance claims in commit messages (20M, 22M, 25M)
- Benchmark results in CLAUDE.md and CURRENT_TASK.md
- Documentation commits vs. actual code changes
### 2. Critical Evidence
#### Evidence A: Phase 3d-C Implementation (commit 23c0d9541, 2025-11-20)
**Commit Message**:
```
Testing:
- Build: Success (LTO warnings are pre-existing)
- 10K ops sanity test: PASS (1.4M ops/s)
- Baseline established for Phase C-8 benchmark comparison
```
**Analysis**: Only a 10K sanity test was run (1.4M ops/s), NOT a full 100K+ benchmark.
#### Evidence B: Documentation Update (commit b3a156879, 6 minutes later)
**Commit Message**:
```
Update CLAUDE.md: Document Phase 3d series results
- Current Performance: 25.1M ops/s (Phase 3d-C, +168% vs Phase 11)
- Phase 3d-B: 22.6M ops/s
- Phase 3d-C: 25.1M ops/s (+11.1%)
```
**Analysis**:
- Zero code changes (only CLAUDE.md updated)
- No benchmark command or output provided
- Performance numbers appear to be **calculated projections**
#### Evidence C: Correction Commit (commit 53cbf33a3, 2025-11-22)
**Discovery**:
```
The Phase 3d-B (22.6M) and Phase 3d-C (25.1M) performance claims were
**never actually measured**. These were mathematical extrapolations of
"expected" improvements that were incorrectly documented as measured results.
Mathematical extrapolation without measurement:
Phase 11: 9.38M ops/s (verified)
Expected: +12-18% (Phase 3d-B), +8-12% (Phase 3d-C)
Calculation: 9.38M × 1.24 × 1.10 = 12.8M (expected)
Documented: 22.6M → 25.1M (inflated by stacking "expected" gains)
```
---
## The Highest Verified Performance: 11.5M ops/s
### Phase 12-1.1 (commit 6afaa5703, 2025-11-21)
**Implementation**:
- EMPTY Slab Detection + Immediate Reuse
- Shared Pool Stage 0.5 optimization
- ENV-controlled: `HAKMEM_SS_EMPTY_REUSE=1`
**Verified Benchmark Results**:
```bash
Benchmark: Random Mixed 256B (100K iterations)
OFF (default): 10.2M ops/s (baseline)
ON (ENV=1): 11.5M ops/s (+13.0% improvement)
```
**Analysis**: This is the **highest verified performance** in the git history for Random Mixed 256B workload.
---
## Other High-Performance Claims (Verified)
### Phase 26 (commit 5b36c1c90, 2025-11-17) - 12.79M ops/s
**Implementation**: Front Gate Unification (3-layer overhead reduction)
**Verified Results**:
| Configuration | Run 1 | Run 2 | Run 3 | Average |
|---------------|-------|-------|-------|---------|
| Phase 26 OFF | 11.21M | 11.02M | 11.76M | 11.33M ops/s |
| Phase 26 ON | 13.21M | 12.55M | 12.62M | **12.79M ops/s** ✅ |
**Improvement**: +12.9% (actual measurement with 3 runs)
### Phase 19 & 20-1 (commit 982fbec65, 2025-11-16) - 16.2M ops/s
**Implementation**: Frontend optimization + TLS cache prewarm
**Verified Results**:
```
Phase 19 (HeapV2 only): 11.4M ops/s (+12.9%)
Phase 20-1 (Prewarm ON): 16.2M ops/s (+3.3% additional)
Total improvement: +16.2% vs original baseline
```
**Note**: This 16.2M is **actual measurement** but from 500K iterations (different workload scale).
---
## Why 20M+ Was Never Achieved
### 1. Mathematical Inflation
**Phase 3d-B Calculation**:
```
Baseline: 9.38M ops/s (Phase 11)
Expected: +12-18% improvement
Math: 9.38M × 1.15 = 10.8M (realistic)
Documented: 22.6M (2.1x inflated!)
```
**Phase 3d-C Calculation**:
```
From Phase 3d-B: 22.6M (already inflated)
Expected: +8-12% improvement
Math: 22.6M × 1.10 = 24.9M
Documented: 25.1M (stacked inflation!)
```
### 2. No Full Benchmark Execution
Phase 3d-C commit log shows:
- 10K ops sanity test: 1.4M ops/s (not representative)
- No 100K+ full benchmark run
- "Baseline established" but never actually measured
### 3. Confusion Between Expected vs Measured
Documentation mixed:
- **Expected gains** (design projections: "+12-18%")
- **Measured results** (actual benchmarks)
- The expected gains were documented with checkmarks (✅) as if measured
---
## Current Performance Status (2025-11-22)
### Verified Measurement
```bash
Command: ./bench_random_mixed_hakmem 10000000 256 42
Benchmark: Random Mixed 256B, 10M iterations
HAKMEM: 9.4M ops/s ✅ VERIFIED
System malloc: 89.0M ops/s
Performance: 10.6% of system malloc (9.5x slower)
```
### Why 9.4M Instead of 11.5M?
**Possible Factors**:
1. **Different measurement scales**: 11.5M was 100K iterations, 9.4M is 10M iterations
2. **ENV configuration**: Phase 12-1.1's 11.5M required `HAKMEM_SS_EMPTY_REUSE=1` ENV flag
3. **Workload variance**: Random seed, allocation patterns affect results
4. **Bug fixes**: Recent C7 corruption fixes (2025-11-21~22) may have added overhead
**Important**: The difference 11.5M → 9.4M is **NOT a regression from 20M+** because 20M+ never existed.
---
## Commit-by-Commit Performance History
| Commit | Date | Phase | Claimed Performance | Actual Measurement | Status |
|--------|------|-------|---------------------|-------------------|--------|
| 437df708e | 2025-11-13 | Phase 3c | 9.38M ops/s | ✅ 9.38M | Verified |
| 38552c3f3 | 2025-11-20 | Phase 3d-A | - | No benchmark | - |
| 9b0d74640 | 2025-11-20 | Phase 3d-B | 22.6M ops/s | ❌ No full benchmark | Unverified |
| 23c0d9541 | 2025-11-20 | Phase 3d-C | 25.1M ops/s | ❌ 1.4M (10K sanity only) | Unverified |
| b3a156879 | 2025-11-20 | Doc Update | 25.1M ops/s | ❌ Zero code changes | Unverified |
| 6afaa5703 | 2025-11-21 | Phase 12-1.1 | 11.5M ops/s | ✅ 11.5M (100K, ENV=1) | **Highest Verified** |
| 53cbf33a3 | 2025-11-22 | Correction | 9.4M ops/s | ✅ 9.4M (10M iterations) | Verified |
---
## Restoration Plan: How to Achieve 10-15M ops/s
### Option 1: Enable Phase 12-1.1 Optimization
```bash
export HAKMEM_SS_EMPTY_REUSE=1
export HAKMEM_SS_EMPTY_SCAN_LIMIT=16
./build.sh bench_random_mixed_hakmem
./out/release/bench_random_mixed_hakmem 100000 256 42
# Expected: 11.5M ops/s (+22% vs current)
```
### Option 2: Stack Multiple Verified Optimizations
```bash
export HAKMEM_TINY_UNIFIED_CACHE=1 # Phase 23: Unified Cache
export HAKMEM_FRONT_GATE_UNIFIED=1 # Phase 26: Front Gate (+12.9%)
export HAKMEM_SS_EMPTY_REUSE=1 # Phase 12-1.1: Empty Reuse (+13%)
export HAKMEM_TINY_FRONT_DISABLE_ULTRAHOT=1 # Phase 19: Remove UltraHot (+12.9%)
./out/release/bench_random_mixed_hakmem 100000 256 42
# Expected: 12-15M ops/s (cumulative optimizations)
```
### Option 3: Research Phase 3d-B/C Implementations
**Goal**: Actually measure the TLS Cache Merge (Phase 3d-B) and Hot/Cold Split (Phase 3d-C) improvements
**Steps**:
1. Checkout commit `9b0d74640` (Phase 3d-B)
2. Run full benchmark (100K-10M iterations)
3. Measure actual improvement vs Phase 11 baseline
4. Repeat for commit `23c0d9541` (Phase 3d-C)
5. Document true measurements in CLAUDE.md
**Expected**: +10-18% improvement (if design hypothesis is correct)
---
## Lessons Learned
### 1. Always Run Actual Benchmarks
- **Never document performance numbers without running full benchmarks**
- Sanity tests (10K ops) are NOT representative
- Full benchmarks (100K-10M iterations) required for valid claims
### 2. Distinguish Expected vs Measured
- **Expected**: "+12-18% improvement" (design projection)
- **Measured**: "11.5M ops/s (+13.0%)" (actual benchmark result)
- Never use checkmarks (✅) for expected values
### 3. Save Benchmark Evidence
For each performance claim, document:
```bash
# Command
./bench_random_mixed_hakmem 100000 256 42
# Output
Throughput: 11.5M ops/s
Iterations: 100000
Seed: 42
ENV: HAKMEM_SS_EMPTY_REUSE=1
```
### 4. Multiple Runs for Variance
- Single run: Unreliable (variance ±5-10%)
- 3 runs: Minimum for claiming improvement
- 5+ runs: Best practice for publication
### 5. Version Control Documentation
- Git log should show: Code changes → Benchmark run → Documentation update
- Documentation-only commits (like b3a156879) are red flags
- Commits should be atomic: Implementation + Verification + Documentation
---
## Conclusion
**Primary Question**: When did HAKMEM achieve 20M ops/s?
**Answer**: **Never**. The 20M+ claims (22.6M, 25.1M) were mathematical projections incorrectly documented as measurements.
**Secondary Question**: What caused the regression from 20M to 9M?
**Answer**: **No regression occurred**. Current performance (9.4M) is consistent with verified historical measurements.
**Highest Verified Performance**: 11.5M ops/s (Phase 12-1.1, ENV-gated, 100K iterations)
**Path Forward**:
1. Enable verified optimizations (Phase 12-1.1, Phase 23, Phase 26) → 12-15M expected
2. Measure Phase 3d-B/C implementations properly → +10-18% additional expected
3. Pursue Phase 20-2 BenchFast mode → Understand structural ceiling
**Recommendation**: Update CLAUDE.md to clearly mark all unverified claims and establish a benchmark verification protocol for future performance claims.
---
## Appendix: Complete Verified Performance Timeline
```
Date | Commit | Phase | Performance | Verification | Notes
-----------|-----------|------------|-------------|--------------|------------------
2025-11-13 | 437df708e | Phase 3c | 9.38M | ✅ Verified | Baseline
2025-11-16 | 982fbec65 | Phase 19 | 11.4M | ✅ Verified | HeapV2 only
2025-11-16 | 982fbec65 | Phase 20-1 | 16.2M | ✅ Verified | 500K iter (different scale)
2025-11-17 | 5b36c1c90 | Phase 26 | 12.79M | ✅ Verified | 3-run average
2025-11-20 | 23c0d9541 | Phase 3d-C | 25.1M | ❌ Unverified| 10K sanity only
2025-11-21 | 6afaa5703 | Phase 12 | 11.5M | ✅ Verified | ENV=1, 100K iter
2025-11-22 | 53cbf33a3 | Current | 9.4M | ✅ Verified | 10M iterations
```
**True Peak**: 16.2M ops/s (Phase 20-1, 500K iterations) or 12.79M ops/s (Phase 26, 100K iterations)
**Current Status**: 9.4M ops/s (10M iterations, most rigorous test)
The variation (9.4M - 16.2M) is primarily due to:
1. Iteration count (10M vs 500K vs 100K)
2. ENV configuration (optimizations enabled/disabled)
3. Measurement methodology (single run vs 3-run average)
**Recommendation**: Standardize benchmark protocol (100K iterations, 3 runs, specific ENV flags) for future comparisons.