This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations. Key changes include: - **ACE Tracing Implementation**: - Added environment variable to enable/disable detailed logging of allocation failures. - Instrumented , , and to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure). - **Build System Fixes**: - Corrected to ensure is properly linked into , resolving an error. - **LD_PRELOAD Wrapper Adjustments**: - Investigated and understood the wrapper's behavior under , particularly its interaction with and checks. - Enabled debugging flags for environment to prevent unintended fallbacks to 's for non-tiny allocations, allowing comprehensive testing of the allocator. - **Debugging & Verification**: - Introduced temporary verbose logging to pinpoint execution flow issues within interception and routing. These temporary logs have been removed. - Created to facilitate testing of the tracing features. This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in by providing clear insights into the failure pathways.
12 KiB
Phase 6-A Benchmark Results
Date: 2025-11-29
Change: Disable SuperSlab lookup debug validation in RELEASE builds
File: core/tiny_region_id.h:199-239
Guard: #if !HAKMEM_BUILD_RELEASE around hak_super_lookup() call
Reason: perf profiling showed 15.84% CPU cost on allocation hot path (debug-only validation)
Executive Summary
Phase 6-A implementation successfully removes debug validation overhead in release builds, but the measured performance impact is significantly smaller than predicted:
- Expected: +12-15% (random_mixed), +8-10% (mid_mt_gap)
- Actual (best 3 of 5): +1.67% (random_mixed), +1.33% (mid_mt_gap)
- Actual (excluding warmup): +4.07% (random_mixed), +1.97% (mid_mt_gap)
Recommendation: HOLD on commit. Investigate discrepancy between perf analysis (15.84% CPU) and benchmark results (~1-4% improvement).
Benchmark Configuration
Build Configurations
Baseline (Before Phase 6-A)
make clean
make EXTRA_CFLAGS="-g -O3" bench_random_mixed_hakmem bench_mid_mt_gap_hakmem
# Note: Makefile sets -DHAKMEM_BUILD_RELEASE=1 by default
# Result: SuperSlab lookup ALWAYS enabled (no guard in code yet)
Phase 6-A (After)
git stash pop # Restore Phase 6-A changes
make clean
make EXTRA_CFLAGS="-g -O3" bench_random_mixed_hakmem bench_mid_mt_gap_hakmem
# Note: Makefile sets -DHAKMEM_BUILD_RELEASE=1 by default
# Result: SuperSlab lookup DISABLED (guarded by #if !HAKMEM_BUILD_RELEASE)
Benchmark Parameters
- Iterations: 1,000,000 operations per run
- Working Set: 256 blocks
- Seed: 42 (reproducible)
- Runs: 5 per configuration
- Suppression:
2>/dev/nullto exclude debug output noise
Raw Results
bench_random_mixed (Tiny workload, 16B-1KB)
Baseline (Before Phase 6-A, SuperSlab lookup ALWAYS enabled)
Run 1: 53.81 M ops/s
Run 2: 53.25 M ops/s
Run 3: 53.56 M ops/s
Run 4: 49.41 M ops/s
Run 5: 51.41 M ops/s
Average: 52.29 M ops/s
Stdev: 1.86 M ops/s
Phase 6-A (Release build, SuperSlab lookup DISABLED)
Run 1: 39.11 M ops/s ⚠️ OUTLIER (warmup)
Run 2: 53.30 M ops/s
Run 3: 56.28 M ops/s
Run 4: 52.79 M ops/s
Run 5: 53.72 M ops/s
Average: 51.04 M ops/s (all runs)
Stdev: 6.80 M ops/s (high due to outlier)
Average (excl. Run 1): 54.02 M ops/s
Outlier Analysis: Run 1 is 27.6% slower than the average of runs 2-5, indicating a warmup/cache-cold issue.
bench_mid_mt_gap (Mid MT workload, 1KB-8KB)
Baseline (Before Phase 6-A, SuperSlab lookup ALWAYS enabled)
Run 1: 41.70 M ops/s
Run 2: 37.39 M ops/s
Run 3: 40.91 M ops/s
Run 4: 40.53 M ops/s
Run 5: 40.56 M ops/s
Average: 40.22 M ops/s
Stdev: 1.65 M ops/s
Phase 6-A (Release build, SuperSlab lookup DISABLED)
Run 1: 41.49 M ops/s
Run 2: 41.81 M ops/s
Run 3: 41.51 M ops/s
Run 4: 38.43 M ops/s
Run 5: 40.78 M ops/s
Average: 40.80 M ops/s
Stdev: 1.38 M ops/s
Variance Analysis: Both baseline and Phase 6-A show similar variance (~3-4 M ops/s spread), suggesting measurement noise is inherent to this benchmark.
Statistical Analysis
Comparison 1: All Runs (Conservative)
| Benchmark | Baseline | Phase 6-A | Absolute | Relative | Expected | Result |
|---|---|---|---|---|---|---|
| random_mixed | 52.29 M | 51.04 M | -1.25 M | -2.39% | +12-15% | ❌ FAIL |
| mid_mt_gap | 40.22 M | 40.80 M | +0.59 M | +1.46% | +8-10% | ❌ FAIL |
Comparison 2: Excluding First Run (Warmup Correction)
| Benchmark | Baseline | Phase 6-A | Absolute | Relative | Expected | Result |
|---|---|---|---|---|---|---|
| random_mixed | 51.91 M | 54.02 M | +2.11 M | +4.07% | +12-15% | ⚠️ PARTIAL |
| mid_mt_gap | 39.85 M | 40.63 M | +0.78 M | +1.97% | +8-10% | ❌ FAIL |
Comparison 3: Best 3 of 5 (Peak Performance)
| Benchmark | Baseline | Phase 6-A | Absolute | Relative | Expected | Result |
|---|---|---|---|---|---|---|
| random_mixed | 53.54 M | 54.43 M | +0.89 M | +1.67% | +12-15% | ❌ FAIL |
| mid_mt_gap | 41.06 M | 41.60 M | +0.54 M | +1.33% | +8-10% | ❌ FAIL |
Performance Summary
Overall Results (Best 3 of 5 method)
- random_mixed: 53.54 → 54.43 M ops/s (+1.67%)
- mid_mt_gap: 41.06 → 41.60 M ops/s (+1.33%)
vs Predictions
- random_mixed: Expected +12-15%, Actual +1.67% → FAIL (8-10x smaller than expected)
- mid_mt_gap: Expected +8-10%, Actual +1.33% → FAIL (6-7x smaller than expected)
Interpretation
Phase 6-A shows statistically measurable but practically negligible performance improvements:
- Excluding warmup: +4.07% (random_mixed), +1.97% (mid_mt_gap)
- Best 3 of 5: +1.67% (random_mixed), +1.33% (mid_mt_gap)
- All runs: -2.39% (random_mixed), +1.46% (mid_mt_gap)
The improvements are 8-10x smaller than expected based on perf analysis.
Root Cause Analysis
Why the Discrepancy?
The perf profile showed hak_super_lookup() consuming 15.84% of CPU time, yet removing it yields only ~1-4% improvement. Possible explanations:
1. Compiler Optimization (Most Likely)
The compiler may already be optimizing away the hak_super_lookup() call in release builds:
- Dead Store Elimination: The result of
hak_super_lookup()is only used for debug logging - Inlining + Constant Propagation: With LTO, the compiler sees the result is unused
- Evidence: Phase 6-A guard has minimal impact, suggesting code was already "free"
Action: Examine assembly output to verify if hak_super_lookup() is present in baseline build
2. Perf Sampling Bias
The perf profile may have been captured during a different workload phase:
- Different allocation patterns (class distribution)
- Different cache states (cold vs. hot)
- Different thread counts (single vs. multi-threaded)
Action: Re-run perf on the exact benchmark workload to verify 15.84% claim
3. Measurement Noise
The benchmarks show high variance:
- random_mixed: 1.86 M stdev (3.6% of mean)
- mid_mt_gap: 1.65 M stdev (4.1% of mean)
The measured improvements (+1-4%) are within 1-2 standard deviations of noise.
Action: Run longer benchmarks (10M+ operations) to reduce noise
4. Lookup Already Cache-Friendly
The SuperSlab registry lookup may be highly cache-efficient in these workloads:
- Small working set (256 blocks) fits in L1/L2 cache
- Registry entries for active SuperSlabs are hot
- Cost is much lower than perf's 15.84% suggests
Action: Benchmark with larger working sets (4KB+) to stress cache
5. Wrong Hot Path
The perf profile showed 15.84% CPU in hak_super_lookup(), but this may not be on the allocation hot path that these benchmarks exercise:
- The call is in
tiny_region_id_write_header()(allocation) - Benchmarks mix alloc+free, free path may dominate
- Perf may have sampled during a malloc-heavy phase
Action: Isolate allocation-only benchmark (no frees) to verify
Recommendations
Immediate Actions
-
HOLD on committing Phase 6-A until investigation completes
- Current results don't justify the change
- Risk: code churn without measurable benefit
-
Verify Compiler Behavior
# Generate assembly for baseline build gcc -S -DHAKMEM_BUILD_RELEASE=1 -O3 -o baseline.s core/tiny_region_id.h # Check if hak_super_lookup appears grep "hak_super_lookup" baseline.s # If absent: compiler already eliminated it (explains minimal improvement) # If present: something else is going on -
Re-run Perf on Benchmark Workload
# Build baseline without Phase 6-A git stash make clean && make bench_random_mixed_hakmem # Profile the exact benchmark perf record -g ./bench_random_mixed_hakmem 10000000 256 42 perf report --stdio | grep -A20 "hak_super_lookup" # Verify if 15.84% claim holds for this workload -
Longer Benchmark Runs
# 100M operations to reduce noise for i in 1 2 3 4 5; do ./bench_random_mixed_hakmem 100000000 256 42 2>/dev/null done
Long-Term Considerations
If investigation reveals:
Scenario A: Compiler Already Optimized
- Decision: Commit Phase 6-A for code cleanliness (no harm, no foul)
- Rationale: Explicitly documents debug-only code, prevents future confusion
- Benefit: Future-proof if compiler behavior changes
Scenario B: Perf Was Wrong
- Decision: Discard Phase 6-A, update perf methodology
- Rationale: The 15.84% CPU claim was based on flawed profiling
- Action: Document correct perf sampling procedure
Scenario C: Benchmark Doesn't Stress Hot Path
- Decision: Commit Phase 6-A, improve benchmark coverage
- Rationale: Real workloads may show the expected gains
- Action: Add allocation-heavy benchmark (e.g., 90% malloc, 10% free)
Scenario D: Measurement Noise Dominates
- Decision: Commit Phase 6-A if longer runs show >5% improvement
- Rationale: Noise can hide real improvements
- Action: Use mimalloc-bench suite for more stable measurements
Next Steps
Phase 6-B: Conditional Path Forward
Option 1: Investigate First (Recommended)
- Run assembly analysis (1 hour)
- Re-run perf on benchmark (2 hours)
- Run longer benchmarks (4 hours)
- Make data-driven decision
Option 2: Commit Anyway
- Rationale: Code is cleaner, no measurable harm
- Risk: Future confusion if optimization isn't actually needed
Option 3: Discard Phase 6-A
- Rationale: No measurable benefit, not worth the churn
- Risk: Miss real optimization if measurement was flawed
Appendix: Full Benchmark Output
Baseline - bench_random_mixed
=== Baseline: bench_random_mixed (Before Phase 6-A, SuperSlab lookup ALWAYS enabled) ===
Run 1: Throughput = 53806309 ops/s [iter=1000000 ws=256] time=0.019s
Run 2: Throughput = 53246568 ops/s [iter=1000000 ws=256] time=0.019s
Run 3: Throughput = 53563123 ops/s [iter=1000000 ws=256] time=0.019s
Run 4: Throughput = 49409566 ops/s [iter=1000000 ws=256] time=0.020s
Run 5: Throughput = 51412515 ops/s [iter=1000000 ws=256] time=0.019s
Phase 6-A - bench_random_mixed
=== Phase 6-A: bench_random_mixed (Release build, SuperSlab lookup DISABLED) ===
Run 1: Throughput = 39111201 ops/s [iter=1000000 ws=256] time=0.026s
Run 2: Throughput = 53296242 ops/s [iter=1000000 ws=256] time=0.019s
Run 3: Throughput = 56279982 ops/s [iter=1000000 ws=256] time=0.018s
Run 4: Throughput = 52790754 ops/s [iter=1000000 ws=256] time=0.019s
Run 5: Throughput = 53715992 ops/s [iter=1000000 ws=256] time=0.019s
Baseline - bench_mid_mt_gap
=== Baseline: bench_mid_mt_gap (Before Phase 6-A, SuperSlab lookup ALWAYS enabled) ===
Run 1: Throughput = 41.70 M operations per second, relative time: 0.023979 s.
Run 2: Throughput = 37.39 M operations per second, relative time: 0.026745 s.
Run 3: Throughput = 40.91 M operations per second, relative time: 0.024445 s.
Run 4: Throughput = 40.53 M operations per second, relative time: 0.024671 s.
Run 5: Throughput = 40.56 M operations per second, relative time: 0.024657 s.
Phase 6-A - bench_mid_mt_gap
=== Phase 6-A: bench_mid_mt_gap (Release build, SuperSlab lookup DISABLED) ===
Run 1: Throughput = 41.49 M operations per second, relative time: 0.024103 s.
Run 2: Throughput = 41.81 M operations per second, relative time: 0.023917 s.
Run 3: Throughput = 41.51 M operations per second, relative time: 0.024089 s.
Run 4: Throughput = 38.43 M operations per second, relative time: 0.026019 s.
Run 5: Throughput = 40.78 M operations per second, relative time: 0.024524 s.
Conclusion
Phase 6-A successfully implements the intended optimization (disabling SuperSlab lookup in release builds), but the measured performance impact (+1-4%) is 8-10x smaller than the expected +12-15% based on perf analysis.
Critical Question: Why does removing code that perf claims costs 15.84% CPU only yield 1-4% improvement?
Most Likely Answer: The compiler was already optimizing away the hak_super_lookup() call in release builds through dead code elimination, since its result is only used for debug assertions.
Recommended Action: Investigate before committing. If the compiler was already optimizing, Phase 6-A is still valuable for code clarity and future-proofing, but the performance claim needs correction.