Files

Moe Charm (CI) 4ef0171bc0 feat: Add ACE allocation failure tracing and debug hooks

This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations.

Key changes include:
- **ACE Tracing Implementation**:
  - Added  environment variable to enable/disable detailed logging of allocation failures.
  - Instrumented , , and  to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure).
- **Build System Fixes**:
  - Corrected  to ensure  is properly linked into , resolving an  error.
- **LD_PRELOAD Wrapper Adjustments**:
  - Investigated and understood the  wrapper's behavior under , particularly its interaction with  and  checks.
  - Enabled debugging flags for  environment to prevent unintended fallbacks to 's  for non-tiny allocations, allowing comprehensive testing of the  allocator.
- **Debugging & Verification**:
  - Introduced temporary verbose logging to pinpoint execution flow issues within  interception and  routing. These temporary logs have been removed.
  - Created  to facilitate testing of the tracing features.

This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in  by providing clear insights into the failure pathways.

2025-12-01 16:37:59 +09:00

12 KiB

Raw Blame History

Phase 6-A Benchmark Results

Date: 2025-11-29 Change: Disable SuperSlab lookup debug validation in RELEASE builds File: core/tiny_region_id.h:199-239 Guard: #if !HAKMEM_BUILD_RELEASE around hak_super_lookup() call Reason: perf profiling showed 15.84% CPU cost on allocation hot path (debug-only validation)

Executive Summary

Phase 6-A implementation successfully removes debug validation overhead in release builds, but the measured performance impact is significantly smaller than predicted:

Expected: +12-15% (random_mixed), +8-10% (mid_mt_gap)
Actual (best 3 of 5): +1.67% (random_mixed), +1.33% (mid_mt_gap)
Actual (excluding warmup): +4.07% (random_mixed), +1.97% (mid_mt_gap)

Recommendation: HOLD on commit. Investigate discrepancy between perf analysis (15.84% CPU) and benchmark results (~1-4% improvement).

Benchmark Configuration

Build Configurations

Baseline (Before Phase 6-A)

make clean
make EXTRA_CFLAGS="-g -O3" bench_random_mixed_hakmem bench_mid_mt_gap_hakmem
# Note: Makefile sets -DHAKMEM_BUILD_RELEASE=1 by default
# Result: SuperSlab lookup ALWAYS enabled (no guard in code yet)

Phase 6-A (After)

git stash pop  # Restore Phase 6-A changes
make clean
make EXTRA_CFLAGS="-g -O3" bench_random_mixed_hakmem bench_mid_mt_gap_hakmem
# Note: Makefile sets -DHAKMEM_BUILD_RELEASE=1 by default
# Result: SuperSlab lookup DISABLED (guarded by #if !HAKMEM_BUILD_RELEASE)

Benchmark Parameters

Iterations: 1,000,000 operations per run
Working Set: 256 blocks
Seed: 42 (reproducible)
Runs: 5 per configuration
Suppression: 2>/dev/null to exclude debug output noise

Raw Results

bench_random_mixed (Tiny workload, 16B-1KB)

Baseline (Before Phase 6-A, SuperSlab lookup ALWAYS enabled)

Run 1: 53.81 M ops/s
Run 2: 53.25 M ops/s
Run 3: 53.56 M ops/s
Run 4: 49.41 M ops/s
Run 5: 51.41 M ops/s
Average: 52.29 M ops/s
Stdev: 1.86 M ops/s

Phase 6-A (Release build, SuperSlab lookup DISABLED)

Run 1: 39.11 M ops/s  ⚠️ OUTLIER (warmup)
Run 2: 53.30 M ops/s
Run 3: 56.28 M ops/s
Run 4: 52.79 M ops/s
Run 5: 53.72 M ops/s
Average: 51.04 M ops/s (all runs)
Stdev: 6.80 M ops/s (high due to outlier)
Average (excl. Run 1): 54.02 M ops/s

Outlier Analysis: Run 1 is 27.6% slower than the average of runs 2-5, indicating a warmup/cache-cold issue.

bench_mid_mt_gap (Mid MT workload, 1KB-8KB)

Baseline (Before Phase 6-A, SuperSlab lookup ALWAYS enabled)

Run 1: 41.70 M ops/s
Run 2: 37.39 M ops/s
Run 3: 40.91 M ops/s
Run 4: 40.53 M ops/s
Run 5: 40.56 M ops/s
Average: 40.22 M ops/s
Stdev: 1.65 M ops/s

Phase 6-A (Release build, SuperSlab lookup DISABLED)

Run 1: 41.49 M ops/s
Run 2: 41.81 M ops/s
Run 3: 41.51 M ops/s
Run 4: 38.43 M ops/s
Run 5: 40.78 M ops/s
Average: 40.80 M ops/s
Stdev: 1.38 M ops/s

Variance Analysis: Both baseline and Phase 6-A show similar variance (~3-4 M ops/s spread), suggesting measurement noise is inherent to this benchmark.

Statistical Analysis

Comparison 1: All Runs (Conservative)

Benchmark	Baseline	Phase 6-A	Absolute	Relative	Expected	Result
random_mixed	52.29 M	51.04 M	-1.25 M	-2.39%	+12-15%	❌ FAIL
mid_mt_gap	40.22 M	40.80 M	+0.59 M	+1.46%	+8-10%	❌ FAIL

Comparison 2: Excluding First Run (Warmup Correction)

Benchmark	Baseline	Phase 6-A	Absolute	Relative	Expected	Result
random_mixed	51.91 M	54.02 M	+2.11 M	+4.07%	+12-15%	⚠️ PARTIAL
mid_mt_gap	39.85 M	40.63 M	+0.78 M	+1.97%	+8-10%	❌ FAIL

Comparison 3: Best 3 of 5 (Peak Performance)

Benchmark	Baseline	Phase 6-A	Absolute	Relative	Expected	Result
random_mixed	53.54 M	54.43 M	+0.89 M	+1.67%	+12-15%	❌ FAIL
mid_mt_gap	41.06 M	41.60 M	+0.54 M	+1.33%	+8-10%	❌ FAIL

Performance Summary

Overall Results (Best 3 of 5 method)

random_mixed: 53.54 → 54.43 M ops/s (+1.67%)
mid_mt_gap: 41.06 → 41.60 M ops/s (+1.33%)

vs Predictions

random_mixed: Expected +12-15%, Actual +1.67% → FAIL (8-10x smaller than expected)
mid_mt_gap: Expected +8-10%, Actual +1.33% → FAIL (6-7x smaller than expected)

Interpretation

Phase 6-A shows statistically measurable but practically negligible performance improvements:

Excluding warmup: +4.07% (random_mixed), +1.97% (mid_mt_gap)
Best 3 of 5: +1.67% (random_mixed), +1.33% (mid_mt_gap)
All runs: -2.39% (random_mixed), +1.46% (mid_mt_gap)

The improvements are 8-10x smaller than expected based on perf analysis.

Root Cause Analysis

Why the Discrepancy?

The perf profile showed hak_super_lookup() consuming 15.84% of CPU time, yet removing it yields only ~1-4% improvement. Possible explanations:

1. Compiler Optimization (Most Likely)

The compiler may already be optimizing away the hak_super_lookup() call in release builds:

Dead Store Elimination: The result of hak_super_lookup() is only used for debug logging
Inlining + Constant Propagation: With LTO, the compiler sees the result is unused
Evidence: Phase 6-A guard has minimal impact, suggesting code was already "free"

Action: Examine assembly output to verify if hak_super_lookup() is present in baseline build

2. Perf Sampling Bias

The perf profile may have been captured during a different workload phase:

Different allocation patterns (class distribution)
Different cache states (cold vs. hot)
Different thread counts (single vs. multi-threaded)

Action: Re-run perf on the exact benchmark workload to verify 15.84% claim

3. Measurement Noise

The benchmarks show high variance:

random_mixed: 1.86 M stdev (3.6% of mean)
mid_mt_gap: 1.65 M stdev (4.1% of mean)

The measured improvements (+1-4%) are within 1-2 standard deviations of noise.

Action: Run longer benchmarks (10M+ operations) to reduce noise

4. Lookup Already Cache-Friendly

The SuperSlab registry lookup may be highly cache-efficient in these workloads:

Small working set (256 blocks) fits in L1/L2 cache
Registry entries for active SuperSlabs are hot
Cost is much lower than perf's 15.84% suggests

Action: Benchmark with larger working sets (4KB+) to stress cache

5. Wrong Hot Path

The perf profile showed 15.84% CPU in hak_super_lookup(), but this may not be on the allocation hot path that these benchmarks exercise:

The call is in tiny_region_id_write_header() (allocation)
Benchmarks mix alloc+free, free path may dominate
Perf may have sampled during a malloc-heavy phase

Action: Isolate allocation-only benchmark (no frees) to verify

Recommendations

Immediate Actions

HOLD on committing Phase 6-A until investigation completes
- Current results don't justify the change
- Risk: code churn without measurable benefit

Verify Compiler Behavior

# Generate assembly for baseline build
gcc -S -DHAKMEM_BUILD_RELEASE=1 -O3 -o baseline.s core/tiny_region_id.h

# Check if hak_super_lookup appears
grep "hak_super_lookup" baseline.s

# If absent: compiler already eliminated it (explains minimal improvement)
# If present: something else is going on

Re-run Perf on Benchmark Workload

# Build baseline without Phase 6-A
git stash
make clean && make bench_random_mixed_hakmem

# Profile the exact benchmark
perf record -g ./bench_random_mixed_hakmem 10000000 256 42
perf report --stdio | grep -A20 "hak_super_lookup"

# Verify if 15.84% claim holds for this workload

Longer Benchmark Runs

# 100M operations to reduce noise
for i in 1 2 3 4 5; do
    ./bench_random_mixed_hakmem 100000000 256 42 2>/dev/null
done

Long-Term Considerations

If investigation reveals:

Scenario A: Compiler Already Optimized

Decision: Commit Phase 6-A for code cleanliness (no harm, no foul)
Rationale: Explicitly documents debug-only code, prevents future confusion
Benefit: Future-proof if compiler behavior changes

Scenario B: Perf Was Wrong

Decision: Discard Phase 6-A, update perf methodology
Rationale: The 15.84% CPU claim was based on flawed profiling
Action: Document correct perf sampling procedure

Scenario C: Benchmark Doesn't Stress Hot Path

Decision: Commit Phase 6-A, improve benchmark coverage
Rationale: Real workloads may show the expected gains
Action: Add allocation-heavy benchmark (e.g., 90% malloc, 10% free)

Scenario D: Measurement Noise Dominates

Decision: Commit Phase 6-A if longer runs show >5% improvement
Rationale: Noise can hide real improvements
Action: Use mimalloc-bench suite for more stable measurements

Next Steps

Phase 6-B: Conditional Path Forward

Option 1: Investigate First (Recommended)

Run assembly analysis (1 hour)
Re-run perf on benchmark (2 hours)
Run longer benchmarks (4 hours)
Make data-driven decision

Option 2: Commit Anyway

Rationale: Code is cleaner, no measurable harm
Risk: Future confusion if optimization isn't actually needed

Option 3: Discard Phase 6-A

Rationale: No measurable benefit, not worth the churn
Risk: Miss real optimization if measurement was flawed

Appendix: Full Benchmark Output

Baseline - bench_random_mixed

=== Baseline: bench_random_mixed (Before Phase 6-A, SuperSlab lookup ALWAYS enabled) ===
Run 1: Throughput =  53806309 ops/s [iter=1000000 ws=256] time=0.019s
Run 2: Throughput =  53246568 ops/s [iter=1000000 ws=256] time=0.019s
Run 3: Throughput =  53563123 ops/s [iter=1000000 ws=256] time=0.019s
Run 4: Throughput =  49409566 ops/s [iter=1000000 ws=256] time=0.020s
Run 5: Throughput =  51412515 ops/s [iter=1000000 ws=256] time=0.019s

Phase 6-A - bench_random_mixed

=== Phase 6-A: bench_random_mixed (Release build, SuperSlab lookup DISABLED) ===
Run 1: Throughput =  39111201 ops/s [iter=1000000 ws=256] time=0.026s
Run 2: Throughput =  53296242 ops/s [iter=1000000 ws=256] time=0.019s
Run 3: Throughput =  56279982 ops/s [iter=1000000 ws=256] time=0.018s
Run 4: Throughput =  52790754 ops/s [iter=1000000 ws=256] time=0.019s
Run 5: Throughput =  53715992 ops/s [iter=1000000 ws=256] time=0.019s

Baseline - bench_mid_mt_gap

=== Baseline: bench_mid_mt_gap (Before Phase 6-A, SuperSlab lookup ALWAYS enabled) ===
Run 1: Throughput = 41.70 M operations per second, relative time: 0.023979 s.
Run 2: Throughput = 37.39 M operations per second, relative time: 0.026745 s.
Run 3: Throughput = 40.91 M operations per second, relative time: 0.024445 s.
Run 4: Throughput = 40.53 M operations per second, relative time: 0.024671 s.
Run 5: Throughput = 40.56 M operations per second, relative time: 0.024657 s.

Phase 6-A - bench_mid_mt_gap

=== Phase 6-A: bench_mid_mt_gap (Release build, SuperSlab lookup DISABLED) ===
Run 1: Throughput = 41.49 M operations per second, relative time: 0.024103 s.
Run 2: Throughput = 41.81 M operations per second, relative time: 0.023917 s.
Run 3: Throughput = 41.51 M operations per second, relative time: 0.024089 s.
Run 4: Throughput = 38.43 M operations per second, relative time: 0.026019 s.
Run 5: Throughput = 40.78 M operations per second, relative time: 0.024524 s.

Conclusion

Phase 6-A successfully implements the intended optimization (disabling SuperSlab lookup in release builds), but the measured performance impact (+1-4%) is 8-10x smaller than the expected +12-15% based on perf analysis.

Critical Question: Why does removing code that perf claims costs 15.84% CPU only yield 1-4% improvement?

Most Likely Answer: The compiler was already optimizing away the hak_super_lookup() call in release builds through dead code elimination, since its result is only used for debug assertions.

Recommended Action: Investigate before committing. If the compiler was already optimizing, Phase 6-A is still valuable for code clarity and future-proofing, but the performance claim needs correction.

12 KiB Raw Blame History

Phase 6-A Benchmark Results

Executive Summary

Benchmark Configuration

Build Configurations

Baseline (Before Phase 6-A)

Phase 6-A (After)

Benchmark Parameters

Raw Results

bench_random_mixed (Tiny workload, 16B-1KB)

Baseline (Before Phase 6-A, SuperSlab lookup ALWAYS enabled)

Phase 6-A (Release build, SuperSlab lookup DISABLED)

bench_mid_mt_gap (Mid MT workload, 1KB-8KB)

Baseline (Before Phase 6-A, SuperSlab lookup ALWAYS enabled)

Phase 6-A (Release build, SuperSlab lookup DISABLED)

Statistical Analysis

Comparison 1: All Runs (Conservative)

Comparison 2: Excluding First Run (Warmup Correction)

Comparison 3: Best 3 of 5 (Peak Performance)

Performance Summary

Overall Results (Best 3 of 5 method)

vs Predictions

Interpretation

Root Cause Analysis

Why the Discrepancy?

1. Compiler Optimization (Most Likely)

2. Perf Sampling Bias

3. Measurement Noise

4. Lookup Already Cache-Friendly

5. Wrong Hot Path

Recommendations

Immediate Actions

Long-Term Considerations

Scenario A: Compiler Already Optimized

Scenario B: Perf Was Wrong

Scenario C: Benchmark Doesn't Stress Hot Path

Scenario D: Measurement Noise Dominates

Next Steps

Phase 6-B: Conditional Path Forward

Appendix: Full Benchmark Output

Baseline - bench_random_mixed

Phase 6-A - bench_random_mixed

Baseline - bench_mid_mt_gap

Phase 6-A - bench_mid_mt_gap

Conclusion

12 KiB

Raw Blame History