Files
hakmem/PHASE6A_BENCHMARK_RESULTS.md
Moe Charm (CI) 4ef0171bc0 feat: Add ACE allocation failure tracing and debug hooks
This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations.

Key changes include:
- **ACE Tracing Implementation**:
  - Added  environment variable to enable/disable detailed logging of allocation failures.
  - Instrumented , , and  to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure).
- **Build System Fixes**:
  - Corrected  to ensure  is properly linked into , resolving an  error.
- **LD_PRELOAD Wrapper Adjustments**:
  - Investigated and understood the  wrapper's behavior under , particularly its interaction with  and  checks.
  - Enabled debugging flags for  environment to prevent unintended fallbacks to 's  for non-tiny allocations, allowing comprehensive testing of the  allocator.
- **Debugging & Verification**:
  - Introduced temporary verbose logging to pinpoint execution flow issues within  interception and  routing. These temporary logs have been removed.
  - Created  to facilitate testing of the tracing features.

This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in  by providing clear insights into the failure pathways.
2025-12-01 16:37:59 +09:00

12 KiB

Phase 6-A Benchmark Results

Date: 2025-11-29 Change: Disable SuperSlab lookup debug validation in RELEASE builds File: core/tiny_region_id.h:199-239 Guard: #if !HAKMEM_BUILD_RELEASE around hak_super_lookup() call Reason: perf profiling showed 15.84% CPU cost on allocation hot path (debug-only validation)


Executive Summary

Phase 6-A implementation successfully removes debug validation overhead in release builds, but the measured performance impact is significantly smaller than predicted:

  • Expected: +12-15% (random_mixed), +8-10% (mid_mt_gap)
  • Actual (best 3 of 5): +1.67% (random_mixed), +1.33% (mid_mt_gap)
  • Actual (excluding warmup): +4.07% (random_mixed), +1.97% (mid_mt_gap)

Recommendation: HOLD on commit. Investigate discrepancy between perf analysis (15.84% CPU) and benchmark results (~1-4% improvement).


Benchmark Configuration

Build Configurations

Baseline (Before Phase 6-A)

make clean
make EXTRA_CFLAGS="-g -O3" bench_random_mixed_hakmem bench_mid_mt_gap_hakmem
# Note: Makefile sets -DHAKMEM_BUILD_RELEASE=1 by default
# Result: SuperSlab lookup ALWAYS enabled (no guard in code yet)

Phase 6-A (After)

git stash pop  # Restore Phase 6-A changes
make clean
make EXTRA_CFLAGS="-g -O3" bench_random_mixed_hakmem bench_mid_mt_gap_hakmem
# Note: Makefile sets -DHAKMEM_BUILD_RELEASE=1 by default
# Result: SuperSlab lookup DISABLED (guarded by #if !HAKMEM_BUILD_RELEASE)

Benchmark Parameters

  • Iterations: 1,000,000 operations per run
  • Working Set: 256 blocks
  • Seed: 42 (reproducible)
  • Runs: 5 per configuration
  • Suppression: 2>/dev/null to exclude debug output noise

Raw Results

bench_random_mixed (Tiny workload, 16B-1KB)

Baseline (Before Phase 6-A, SuperSlab lookup ALWAYS enabled)

Run 1: 53.81 M ops/s
Run 2: 53.25 M ops/s
Run 3: 53.56 M ops/s
Run 4: 49.41 M ops/s
Run 5: 51.41 M ops/s
Average: 52.29 M ops/s
Stdev: 1.86 M ops/s

Phase 6-A (Release build, SuperSlab lookup DISABLED)

Run 1: 39.11 M ops/s  ⚠️ OUTLIER (warmup)
Run 2: 53.30 M ops/s
Run 3: 56.28 M ops/s
Run 4: 52.79 M ops/s
Run 5: 53.72 M ops/s
Average: 51.04 M ops/s (all runs)
Stdev: 6.80 M ops/s (high due to outlier)
Average (excl. Run 1): 54.02 M ops/s

Outlier Analysis: Run 1 is 27.6% slower than the average of runs 2-5, indicating a warmup/cache-cold issue.


bench_mid_mt_gap (Mid MT workload, 1KB-8KB)

Baseline (Before Phase 6-A, SuperSlab lookup ALWAYS enabled)

Run 1: 41.70 M ops/s
Run 2: 37.39 M ops/s
Run 3: 40.91 M ops/s
Run 4: 40.53 M ops/s
Run 5: 40.56 M ops/s
Average: 40.22 M ops/s
Stdev: 1.65 M ops/s

Phase 6-A (Release build, SuperSlab lookup DISABLED)

Run 1: 41.49 M ops/s
Run 2: 41.81 M ops/s
Run 3: 41.51 M ops/s
Run 4: 38.43 M ops/s
Run 5: 40.78 M ops/s
Average: 40.80 M ops/s
Stdev: 1.38 M ops/s

Variance Analysis: Both baseline and Phase 6-A show similar variance (~3-4 M ops/s spread), suggesting measurement noise is inherent to this benchmark.


Statistical Analysis

Comparison 1: All Runs (Conservative)

Benchmark Baseline Phase 6-A Absolute Relative Expected Result
random_mixed 52.29 M 51.04 M -1.25 M -2.39% +12-15% FAIL
mid_mt_gap 40.22 M 40.80 M +0.59 M +1.46% +8-10% FAIL

Comparison 2: Excluding First Run (Warmup Correction)

Benchmark Baseline Phase 6-A Absolute Relative Expected Result
random_mixed 51.91 M 54.02 M +2.11 M +4.07% +12-15% ⚠️ PARTIAL
mid_mt_gap 39.85 M 40.63 M +0.78 M +1.97% +8-10% FAIL

Comparison 3: Best 3 of 5 (Peak Performance)

Benchmark Baseline Phase 6-A Absolute Relative Expected Result
random_mixed 53.54 M 54.43 M +0.89 M +1.67% +12-15% FAIL
mid_mt_gap 41.06 M 41.60 M +0.54 M +1.33% +8-10% FAIL

Performance Summary

Overall Results (Best 3 of 5 method)

  • random_mixed: 53.54 → 54.43 M ops/s (+1.67%)
  • mid_mt_gap: 41.06 → 41.60 M ops/s (+1.33%)

vs Predictions

  • random_mixed: Expected +12-15%, Actual +1.67% → FAIL (8-10x smaller than expected)
  • mid_mt_gap: Expected +8-10%, Actual +1.33% → FAIL (6-7x smaller than expected)

Interpretation

Phase 6-A shows statistically measurable but practically negligible performance improvements:

  • Excluding warmup: +4.07% (random_mixed), +1.97% (mid_mt_gap)
  • Best 3 of 5: +1.67% (random_mixed), +1.33% (mid_mt_gap)
  • All runs: -2.39% (random_mixed), +1.46% (mid_mt_gap)

The improvements are 8-10x smaller than expected based on perf analysis.


Root Cause Analysis

Why the Discrepancy?

The perf profile showed hak_super_lookup() consuming 15.84% of CPU time, yet removing it yields only ~1-4% improvement. Possible explanations:

1. Compiler Optimization (Most Likely)

The compiler may already be optimizing away the hak_super_lookup() call in release builds:

  • Dead Store Elimination: The result of hak_super_lookup() is only used for debug logging
  • Inlining + Constant Propagation: With LTO, the compiler sees the result is unused
  • Evidence: Phase 6-A guard has minimal impact, suggesting code was already "free"

Action: Examine assembly output to verify if hak_super_lookup() is present in baseline build

2. Perf Sampling Bias

The perf profile may have been captured during a different workload phase:

  • Different allocation patterns (class distribution)
  • Different cache states (cold vs. hot)
  • Different thread counts (single vs. multi-threaded)

Action: Re-run perf on the exact benchmark workload to verify 15.84% claim

3. Measurement Noise

The benchmarks show high variance:

  • random_mixed: 1.86 M stdev (3.6% of mean)
  • mid_mt_gap: 1.65 M stdev (4.1% of mean)

The measured improvements (+1-4%) are within 1-2 standard deviations of noise.

Action: Run longer benchmarks (10M+ operations) to reduce noise

4. Lookup Already Cache-Friendly

The SuperSlab registry lookup may be highly cache-efficient in these workloads:

  • Small working set (256 blocks) fits in L1/L2 cache
  • Registry entries for active SuperSlabs are hot
  • Cost is much lower than perf's 15.84% suggests

Action: Benchmark with larger working sets (4KB+) to stress cache

5. Wrong Hot Path

The perf profile showed 15.84% CPU in hak_super_lookup(), but this may not be on the allocation hot path that these benchmarks exercise:

  • The call is in tiny_region_id_write_header() (allocation)
  • Benchmarks mix alloc+free, free path may dominate
  • Perf may have sampled during a malloc-heavy phase

Action: Isolate allocation-only benchmark (no frees) to verify


Recommendations

Immediate Actions

  1. HOLD on committing Phase 6-A until investigation completes

    • Current results don't justify the change
    • Risk: code churn without measurable benefit
  2. Verify Compiler Behavior

    # Generate assembly for baseline build
    gcc -S -DHAKMEM_BUILD_RELEASE=1 -O3 -o baseline.s core/tiny_region_id.h
    
    # Check if hak_super_lookup appears
    grep "hak_super_lookup" baseline.s
    
    # If absent: compiler already eliminated it (explains minimal improvement)
    # If present: something else is going on
    
  3. Re-run Perf on Benchmark Workload

    # Build baseline without Phase 6-A
    git stash
    make clean && make bench_random_mixed_hakmem
    
    # Profile the exact benchmark
    perf record -g ./bench_random_mixed_hakmem 10000000 256 42
    perf report --stdio | grep -A20 "hak_super_lookup"
    
    # Verify if 15.84% claim holds for this workload
    
  4. Longer Benchmark Runs

    # 100M operations to reduce noise
    for i in 1 2 3 4 5; do
        ./bench_random_mixed_hakmem 100000000 256 42 2>/dev/null
    done
    

Long-Term Considerations

If investigation reveals:

Scenario A: Compiler Already Optimized

  • Decision: Commit Phase 6-A for code cleanliness (no harm, no foul)
  • Rationale: Explicitly documents debug-only code, prevents future confusion
  • Benefit: Future-proof if compiler behavior changes

Scenario B: Perf Was Wrong

  • Decision: Discard Phase 6-A, update perf methodology
  • Rationale: The 15.84% CPU claim was based on flawed profiling
  • Action: Document correct perf sampling procedure

Scenario C: Benchmark Doesn't Stress Hot Path

  • Decision: Commit Phase 6-A, improve benchmark coverage
  • Rationale: Real workloads may show the expected gains
  • Action: Add allocation-heavy benchmark (e.g., 90% malloc, 10% free)

Scenario D: Measurement Noise Dominates

  • Decision: Commit Phase 6-A if longer runs show >5% improvement
  • Rationale: Noise can hide real improvements
  • Action: Use mimalloc-bench suite for more stable measurements

Next Steps

Phase 6-B: Conditional Path Forward

Option 1: Investigate First (Recommended)

  1. Run assembly analysis (1 hour)
  2. Re-run perf on benchmark (2 hours)
  3. Run longer benchmarks (4 hours)
  4. Make data-driven decision

Option 2: Commit Anyway

  • Rationale: Code is cleaner, no measurable harm
  • Risk: Future confusion if optimization isn't actually needed

Option 3: Discard Phase 6-A

  • Rationale: No measurable benefit, not worth the churn
  • Risk: Miss real optimization if measurement was flawed

Appendix: Full Benchmark Output

Baseline - bench_random_mixed

=== Baseline: bench_random_mixed (Before Phase 6-A, SuperSlab lookup ALWAYS enabled) ===
Run 1: Throughput =  53806309 ops/s [iter=1000000 ws=256] time=0.019s
Run 2: Throughput =  53246568 ops/s [iter=1000000 ws=256] time=0.019s
Run 3: Throughput =  53563123 ops/s [iter=1000000 ws=256] time=0.019s
Run 4: Throughput =  49409566 ops/s [iter=1000000 ws=256] time=0.020s
Run 5: Throughput =  51412515 ops/s [iter=1000000 ws=256] time=0.019s

Phase 6-A - bench_random_mixed

=== Phase 6-A: bench_random_mixed (Release build, SuperSlab lookup DISABLED) ===
Run 1: Throughput =  39111201 ops/s [iter=1000000 ws=256] time=0.026s
Run 2: Throughput =  53296242 ops/s [iter=1000000 ws=256] time=0.019s
Run 3: Throughput =  56279982 ops/s [iter=1000000 ws=256] time=0.018s
Run 4: Throughput =  52790754 ops/s [iter=1000000 ws=256] time=0.019s
Run 5: Throughput =  53715992 ops/s [iter=1000000 ws=256] time=0.019s

Baseline - bench_mid_mt_gap

=== Baseline: bench_mid_mt_gap (Before Phase 6-A, SuperSlab lookup ALWAYS enabled) ===
Run 1: Throughput = 41.70 M operations per second, relative time: 0.023979 s.
Run 2: Throughput = 37.39 M operations per second, relative time: 0.026745 s.
Run 3: Throughput = 40.91 M operations per second, relative time: 0.024445 s.
Run 4: Throughput = 40.53 M operations per second, relative time: 0.024671 s.
Run 5: Throughput = 40.56 M operations per second, relative time: 0.024657 s.

Phase 6-A - bench_mid_mt_gap

=== Phase 6-A: bench_mid_mt_gap (Release build, SuperSlab lookup DISABLED) ===
Run 1: Throughput = 41.49 M operations per second, relative time: 0.024103 s.
Run 2: Throughput = 41.81 M operations per second, relative time: 0.023917 s.
Run 3: Throughput = 41.51 M operations per second, relative time: 0.024089 s.
Run 4: Throughput = 38.43 M operations per second, relative time: 0.026019 s.
Run 5: Throughput = 40.78 M operations per second, relative time: 0.024524 s.

Conclusion

Phase 6-A successfully implements the intended optimization (disabling SuperSlab lookup in release builds), but the measured performance impact (+1-4%) is 8-10x smaller than the expected +12-15% based on perf analysis.

Critical Question: Why does removing code that perf claims costs 15.84% CPU only yield 1-4% improvement?

Most Likely Answer: The compiler was already optimizing away the hak_super_lookup() call in release builds through dead code elimination, since its result is only used for debug assertions.

Recommended Action: Investigate before committing. If the compiler was already optimizing, Phase 6-A is still valuable for code clarity and future-proofing, but the performance claim needs correction.