Files

Moe Charm (CI) a9ddb52ad4 ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)

Phase 1 完了：環境変数整理 + fprintf デバッグガード

ENV変数削除（BG/HotMag系）:
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除

fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message

ドキュメント整理:
- 328 markdown files 削除（旧レポート・重複docs）

性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作✅)
- ENV整理による機能影響なし
- Debug出力は一部残存（次phase で対応）

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-26 14:45:26 +09:00

15 KiB

Raw Blame History

Phase 7 Full Benchmark Suite Execution Plan

Date: 2025-11-08 Phase: 7-1.3 (HEADER_CLASSIDX=1 optimization) Current Status: Partial results available (Larson 1T: 2.63M ops/s, bench_random_mixed 128B: 17.7M ops/s) Goal: Comprehensive performance evaluation across ALL benchmark patterns

Executive Summary

Available Benchmarks (5 categories)

Larson - Multi-threaded stress test (8-128B, mimalloc-bench derived)
Random Mixed - Single-threaded random allocation (16-8192B)
Mid-Large MT - Multi-threaded mid-size (8-32KB)
VM Mixed - Large allocations (512KB-2MB, L2.5/L2 test)
Tiny Hot - Hot path micro-benchmark (8-64B, LIFO)

Current Build Status (Phase 7 = HEADER_CLASSIDX=1)

All benchmarks were built with HEADER_CLASSIDX=1 on 2025-11-07/08:

✅ larson_hakmem (2025-11-08 11:48)
✅ bench_random_mixed_hakmem (2025-11-08 11:48)
✅ bench_mid_large_mt_hakmem (2025-11-07 18:42)
✅ bench_tiny_hot_hakmem (2025-11-07 18:03)
✅ bench_vm_mixed_hakmem (2025-11-07 18:03)

Note: Makefile has HAKMEM_TINY_HEADER_CLASSIDX=1 permanently enabled (line 99-100).

Execution Plan

Phase 1: Verify Build Status (5 minutes)

Verify HEADER_CLASSIDX=1 is enabled:

# Check Makefile flag
grep "HAKMEM_TINY_HEADER_CLASSIDX" Makefile

# Verify all binaries are up-to-date
make -n bench_random_mixed_hakmem bench_tiny_hot_hakmem \
         bench_mid_large_mt_hakmem bench_vm_mixed_hakmem \
         larson_hakmem

If rebuild needed:

# Clean rebuild with HEADER_CLASSIDX=1 (already default)
make clean
make -j bench_random_mixed_hakmem bench_random_mixed_system bench_random_mixed_mi \
         bench_tiny_hot_hakmem bench_tiny_hot_system bench_tiny_hot_mi \
         bench_mid_large_mt_hakmem bench_mid_large_mt_system bench_mid_large_mt_mi \
         bench_vm_mixed_hakmem bench_vm_mixed_system \
         larson_hakmem larson_system larson_mi

Time: ~3-5 minutes (if rebuild needed)

Phase 2: Quick Sanity Test (2 minutes)

Test each benchmark runs successfully:

# Larson (1T, 1 second)
./larson_hakmem 1 8 128 1024 1 12345 1

# Random Mixed (small run)
./bench_random_mixed_hakmem 1000 128 1234567

# Mid-Large MT (2 threads, small)
./bench_mid_large_mt_hakmem 2 1000 2048 42

# VM Mixed (small)
./bench_vm_mixed_hakmem 100 256 424242

# Tiny Hot (small)
./bench_tiny_hot_hakmem 32 10 1000

Expected: All benchmarks run without SEGV/crashes.

Phase 3: Full Benchmark Suite Execution

Option A: Automated Suite Runner (RECOMMENDED) ⭐

Use existing bench_suite_matrix.sh:

# This runs ALL benchmarks (random_mixed, mid_large_mt, vm_mixed, tiny_hot)
# across system/mimalloc/HAKMEM variants
./scripts/bench_suite_matrix.sh

Output:

CSV: bench_results/suite/<timestamp>/results.csv
Raw logs: bench_results/suite/<timestamp>/raw/*.out

Time: ~15-20 minutes

Coverage:

Random Mixed: 2 cycles × 2 ws × 3 variants = 12 runs
Mid-Large MT: 2 threads × 3 variants = 6 runs
VM Mixed: 2 cycles × 2 variants = 4 runs (system + hakmem only)
Tiny Hot: 2 sizes × 3 variants = 6 runs

Total: 28 benchmark runs

Option B: Individual Benchmark Scripts (Detailed Analysis)

If you need more control or want to run A/B tests with environment variables:

3.1 Larson Benchmark (Multi-threaded Stress)

Basic run (1T, 4T, 8T):

# 1 thread, 10 seconds
HAKMEM_WRAP_TINY=1 ./larson_hakmem 10 8 128 1024 1 12345 1

# 4 threads, 10 seconds (CRITICAL: test multi-thread stability)
HAKMEM_WRAP_TINY=1 ./larson_hakmem 10 8 128 1024 1 12345 4

# 8 threads, 10 seconds
HAKMEM_WRAP_TINY=1 ./larson_hakmem 10 8 128 1024 1 12345 8

A/B test with environment variables:

# Use automated script (includes PGO)
./scripts/bench_larson_1t_ab.sh

Output: bench_results/larson_ab/<timestamp>/results.csv

Time: ~20-30 minutes (includes PGO build)

Key Metrics:

Throughput (ops/s)
Stability (4T should not crash - see Phase 6-2.3 active counter fix)

3.2 Random Mixed (Single-threaded, Mixed Sizes)

Basic run:

# 400K cycles, 8192B working set
HAKMEM_WRAP_TINY=1 ./bench_random_mixed_hakmem 400000 8192 1234567
./bench_random_mixed_system 400000 8192 1234567
./bench_random_mixed_mi 400000 8192 1234567

A/B test with environment variables:

# Runs 5 repetitions, median calculation
./scripts/bench_random_mixed_ab.sh

Output: bench_results/random_mixed_ab/<timestamp>/results.csv

Time: ~15-20 minutes (5 reps × multiple configs)

Key Metrics:

Throughput (ops/s) across different working set sizes
SPECIALIZE_MASK impact (0 vs 0x0F)
FAST_CAP impact (8 vs 16 vs 32)

3.3 Mid-Large MT (Multi-threaded, 8-32KB)

Basic run:

# 4 threads, 40K cycles, 2KB working set
HAKMEM_WRAP_TINY=1 ./bench_mid_large_mt_hakmem 4 40000 2048 42
./bench_mid_large_mt_system 4 40000 2048 42
./bench_mid_large_mt_mi 4 40000 2048 42

A/B test:

./scripts/bench_mid_large_mt_ab.sh

Output: bench_results/mid_large_mt_ab/<timestamp>/results.csv

Time: ~10-15 minutes

Key Metrics:

Multi-threaded performance (2T vs 4T)
HAKMEM's SuperSlab efficiency (expected: strong performance here)

Note: Previous results showed HAKMEM weakness here (suite/20251107: 2.1M vs system 8.7M). This is unexpected given the Mid-Large benchmark success (+108% on 2025-11-02). Need to investigate if this is a regression or different test pattern.

3.4 VM Mixed (Large Allocations, 512KB-2MB)

Basic run:

# 20K cycles, 256 working set
HAKMEM_BIGCACHE_L25=1 HAKMEM_WRAP_TINY=1 ./bench_vm_mixed_hakmem 20000 256 424242
./bench_vm_mixed_system 20000 256 424242

Time: ~5 minutes

Key Metrics:

L2.5 cache effectiveness (BIGCACHE_L25=1 vs 0)
Large allocation performance

3.5 Tiny Hot (Hot Path Micro-benchmark)

Basic run:

# 32B, 100 batch, 60K cycles
HAKMEM_WRAP_TINY=1 ./bench_tiny_hot_hakmem 32 100 60000
./bench_tiny_hot_system 32 100 60000
./bench_tiny_hot_mi 32 100 60000

# 64B
HAKMEM_WRAP_TINY=1 ./bench_tiny_hot_hakmem 64 100 60000
./bench_tiny_hot_system 64 100 60000
./bench_tiny_hot_mi 64 100 60000

Time: ~5 minutes

Key Metrics:

Hot path efficiency (direct TLS cache access)
Expected weakness (Phase 6 analysis: -60% vs system)

Phase 4: Analysis and Comparison

4.1 Extract Results from Suite Run

# Get latest suite results
latest=$(ls -td bench_results/suite/* | head -1)
cat ${latest}/results.csv

# Quick comparison
awk -F, 'NR>1 {
    if ($2=="hakmem") hakmem[$1]+=$4
    if ($2=="system") system[$1]+=$4
    if ($2=="mi") mi[$1]+=$4
    count[$1]++
} END {
    for (b in hakmem) {
        h=hakmem[b]/count[b]
        s=system[b]/count[b]
        m=mi[b]/count[b]
        printf "%s: HAKMEM=%.2fM system=%.2fM mi=%.2fM (vs_sys=%+.1f%%, vs_mi=%+.1f%%)\n",
               b, h/1e6, s/1e6, m/1e6, (h/s-1)*100, (h/m-1)*100
    }
}' ${latest}/results.csv

4.2 Key Comparisons

Phase 7 vs System malloc:

# Extract HAKMEM vs system for each benchmark
awk -F, 'NR>1 && ($2=="hakmem" || $2=="system") {
    key=$1 "," $3
    if ($2=="hakmem") h[key]=$4
    if ($2=="system") s[key]=$4
} END {
    for (k in h) {
        if (s[k]) {
            pct = (h[k]/s[k] - 1) * 100
            printf "%s: %.2fM vs %.2fM (%+.1f%%)\n", k, h[k]/1e6, s[k]/1e6, pct
        }
    }
}' ${latest}/results.csv | sort

Phase 7 vs mimalloc:

# Similar for mimalloc comparison
awk -F, 'NR>1 && ($2=="hakmem" || $2=="mi") {
    key=$1 "," $3
    if ($2=="hakmem") h[key]=$4
    if ($2=="mi") m[key]=$4
} END {
    for (k in h) {
        if (m[k]) {
            pct = (h[k]/m[k] - 1) * 100
            printf "%s: %.2fM vs %.2fM (%+.1f%%)\n", k, h[k]/1e6, m[k]/1e6, pct
        }
    }
}' ${latest}/results.csv | sort

4.3 Generate Summary Report

# Create comprehensive summary
cat > PHASE7_RESULTS_SUMMARY.md << 'REPORT'
# Phase 7 Benchmark Results Summary

## Test Configuration
- Phase: 7-1.3 (HEADER_CLASSIDX=1)
- Date: $(date +%Y-%m-%d)
- Suite: $(basename ${latest})

## Overall Results

### Random Mixed (16-8192B, single-threaded)
[Insert results here]

### Mid-Large MT (8-32KB, multi-threaded)
[Insert results here]

### VM Mixed (512KB-2MB, large allocations)
[Insert results here]

### Tiny Hot (8-64B, hot path micro)
[Insert results here]

### Larson (8-128B, multi-threaded stress)
[Insert results here]

## Analysis

### Strengths
[Areas where HAKMEM outperforms]

### Weaknesses
[Areas where HAKMEM underperforms]

### Comparison with Previous Phases
[Phase 6 vs Phase 7 delta]

## Bottleneck Identification

[Performance profiling with perf]

REPORT

Phase 5: Performance Profiling (Optional, if bottlenecks found)

Profile hot paths with perf:

# Profile random_mixed (if slow)
perf record -g --call-graph dwarf -- \
  ./bench_random_mixed_hakmem 400000 8192 1234567

perf report --stdio > perf_random_mixed_phase7.txt

# Profile larson 1T
perf record -g --call-graph dwarf -- \
  ./larson_hakmem 10 8 128 1024 1 12345 1

perf report --stdio > perf_larson_1t_phase7.txt

Compare with Phase 6:

# If you have Phase 6 binaries saved, run side-by-side
# and compare perf reports

Expected Results & Analysis Strategy

Baseline Expectations (from Phase 6 analysis)

Strong Areas (Expected +50% to +171% vs System)

Mid-Large (8-32KB): HAKMEM's SuperSlab should dominate
- Expected: +100% to +150% vs system
- Phase 7 improvement target: Maintain or improve
Large Allocations (VM Mixed): L2.5 layer efficiency
- Expected: Competitive or slight win vs system

Weak Areas (Expected -50% to -70% vs System)

Tiny (≤128B): Structural weakness identified in Phase 6
- Expected: -40% to -60% vs system
- Phase 7 HEADER_CLASSIDX may help: +10-20% improvement
Random Mixed: Magazine layer overhead
- Expected: -20% to -50% vs system
- Phase 7 target: Reduce gap
Larson Multi-thread: Contention issues
- Expected: Variable (1T: ok, 4T+: risk of crashes)
- Phase 7 critical: Verify 4T stability (active counter fix)

What to Look For

Phase 7 Improvements (HEADER_CLASSIDX=1)

Tiny allocations: +10-30% improvement (fewer header loads)
Random mixed: +15-25% improvement (class_idx in header)
Cache efficiency: Better locality (1-byte header vs 2-byte)

Red Flags

Mid-Large regression: Should NOT regress (HEADER_CLASSIDX doesn't affect mid-large path)
4T+ crashes in Larson: Active counter bug should be fixed (Phase 6-2.3)
Severe regression (>20%): Investigate immediately

Bottleneck Identification

If Phase 7 results are disappointing:

Run perf on slow benchmarks
Compare with Phase 6 perf profiles (if available)
Check hot paths:
- tiny_alloc_fast() - Should be 3-4 instructions
- tiny_free_fast() - Should be fast header check
- superslab_refill() - Should use P0 ctz optimization

Time Estimates

Minimal Run (Option A: Suite Script Only)

Build verification: 2 min
Sanity test: 2 min
Suite execution: 15-20 min
Quick analysis: 5 min
Total: ~25-30 minutes

Comprehensive Run (Option B: All Individual Scripts)

Build verification: 2 min
Sanity test: 2 min
Larson A/B: 25 min
Random Mixed A/B: 20 min
Mid-Large MT A/B: 15 min
VM Mixed: 5 min
Tiny Hot: 5 min
Analysis & report: 15 min
Total: ~90 minutes (1.5 hours)

With Performance Profiling

Add: ~20-30 min per benchmark
Total: ~2-3 hours

Recommended Execution Order

Quick Assessment (30 minutes)

✅ Verify build status
✅ Run suite script (bench_suite_matrix.sh)
✅ Generate quick comparison
🔍 Identify major wins/losses
📝 Decide if deep dive needed

Deep Analysis (if needed, +60 minutes)

🔬 Run individual A/B scripts for problem areas
📊 Profile with perf
📝 Compare with Phase 6 baseline
💡 Generate actionable insights

Output Organization

bench_results/
├── suite/
│   └── <timestamp>/
│       ├── results.csv          # All benchmarks, all variants
│       └── raw/*.out             # Raw logs
├── random_mixed_ab/
│   └── <timestamp>/
│       ├── results.csv          # A/B test results
│       └── raw/*.txt             # Per-run data
├── larson_ab/
│   └── <timestamp>/
│       ├── results.csv
│       └── raw/*.out
├── mid_large_mt_ab/
│   └── <timestamp>/
│       ├── results.csv
│       └── raw/*.out
└── ...

# Analysis reports
PHASE7_RESULTS_SUMMARY.md        # High-level summary
PHASE7_DETAILED_ANALYSIS.md      # Deep dive (if needed)
perf_*.txt                        # Performance profiles

Next Steps After Benchmark

If Phase 7 Shows Strong Results (+30-50% overall)

✅ Commit and document improvements
🎯 Focus on remaining weak areas (Tiny allocations)
📢 Prepare performance summary for stakeholders

If Phase 7 Shows Modest Results (+10-20% overall)

🔍 Identify specific bottlenecks (perf profiling)
🧪 Test individual optimizations in isolation
📊 Compare with Phase 6 to ensure no regressions

If Phase 7 Shows Regressions (any area -10% or worse)

🚨 Immediate investigation
🔄 Bisect to find regression point
🧪 Consider reverting HEADER_CLASSIDX if severe

Quick Reference Commands

# Full suite (automated)
./scripts/bench_suite_matrix.sh

# Individual benchmarks (quick test)
./larson_hakmem 1 8 128 1024 1 12345 1
./bench_random_mixed_hakmem 400000 8192 1234567
./bench_mid_large_mt_hakmem 4 40000 2048 42
./bench_vm_mixed_hakmem 20000 256 424242
./bench_tiny_hot_hakmem 32 100 60000

# A/B tests (environment variable sweeps)
./scripts/bench_larson_1t_ab.sh
./scripts/bench_random_mixed_ab.sh
./scripts/bench_mid_large_mt_ab.sh

# Latest results
ls -td bench_results/suite/* | head -1
cat $(ls -td bench_results/suite/* | head -1)/results.csv

# Performance profiling
perf record -g --call-graph dwarf -- ./bench_random_mixed_hakmem 400000 8192 1234567
perf report --stdio > perf_output.txt

Key Success Metrics

Primary Goal: Overall Improvement

Target: +20-30% average throughput vs Phase 6
Minimum: No regressions in mid-large (HAKMEM's strength)

Secondary Goals:

Stability: 4T+ Larson runs without crashes
Tiny improvement: -40% to -50% vs system (from -60%)
Random mixed improvement: -10% to -20% vs system (from -30%+)

Stretch Goals:

Mid-large dominance: Maintain +100% vs system
Overall parity: Match or beat system malloc on average
Consistency: No severe outliers (no single test <50% of system)

Document Version: 1.0
Created: 2025-11-08
Author: Claude (Task Agent)
Status: Ready for execution

15 KiB Raw Blame History Unescape Escape

Phase 7 Full Benchmark Suite Execution Plan

Executive Summary

Available Benchmarks (5 categories)

Current Build Status (Phase 7 = HEADER_CLASSIDX=1)

Execution Plan

Phase 1: Verify Build Status (5 minutes)

Phase 2: Quick Sanity Test (2 minutes)

Phase 3: Full Benchmark Suite Execution

Option A: Automated Suite Runner (RECOMMENDED) ⭐

Option B: Individual Benchmark Scripts (Detailed Analysis)

3.1 Larson Benchmark (Multi-threaded Stress)

3.2 Random Mixed (Single-threaded, Mixed Sizes)

3.3 Mid-Large MT (Multi-threaded, 8-32KB)

3.4 VM Mixed (Large Allocations, 512KB-2MB)

3.5 Tiny Hot (Hot Path Micro-benchmark)

Phase 4: Analysis and Comparison

4.1 Extract Results from Suite Run

4.2 Key Comparisons

4.3 Generate Summary Report

Phase 5: Performance Profiling (Optional, if bottlenecks found)

Expected Results & Analysis Strategy

Baseline Expectations (from Phase 6 analysis)

Strong Areas (Expected +50% to +171% vs System)

Weak Areas (Expected -50% to -70% vs System)

What to Look For

Phase 7 Improvements (HEADER_CLASSIDX=1)

Red Flags

Bottleneck Identification

Time Estimates

Minimal Run (Option A: Suite Script Only)

Comprehensive Run (Option B: All Individual Scripts)

With Performance Profiling

Recommended Execution Order

Quick Assessment (30 minutes)

Deep Analysis (if needed, +60 minutes)

Output Organization

Next Steps After Benchmark

If Phase 7 Shows Strong Results (+30-50% overall)

If Phase 7 Shows Modest Results (+10-20% overall)

If Phase 7 Shows Regressions (any area -10% or worse)

Quick Reference Commands

Key Success Metrics

Primary Goal: Overall Improvement

Secondary Goals:

Stretch Goals:

15 KiB

Raw Blame History