1. Archive unused backend files (ss_legacy/unified_backend_box.c/h) - These files were not linked in the build - Moved to archive/ to reduce confusion 2. Created HAK_RET_ALLOC_BLOCK macro for SuperSlab allocations - Replaces superslab_return_block() function - Consistent with existing HAK_RET_ALLOC pattern - Single source of truth for header writing - Defined in hakmem_tiny_superslab_internal.h 3. Added header validation on TLS SLL push - Detects blocks pushed without proper header - Enabled via HAKMEM_TINY_SLL_VALIDATE_HDR=1 (release) - Always on in debug builds - Logs first 10 violations with backtraces Benefits: - Easier to track allocation paths - Catches header bugs at push time - More maintainable macro-based design Note: Larson bug still reproduces - header corruption occurs before push validation can catch it. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
5.2 KiB
5.2 KiB
HAKMEM Allocator - Phase 2 Performance Analysis
Quick Summary
| Metric | Phase 1 | Phase 2 | Change |
|---|---|---|---|
| Throughput | 72M ops/s | 79.8M ops/s | +10.8% ✓ |
| Cycles | 78.6M | 72.2M | -8.1% ✓ |
| Instructions | 167M | 153M | -8.4% ✓ |
| Branches | 36M | 23M | -36% ✓ |
| Branch Misses | 921K (2.56%) | 1.02M (4.43%) | +73% ✗ |
| L3 Cache Misses | 173K (9.28%) | 216K (10.28%) | +25% ✗ |
| dTLB Misses | N/A | 41 (0.01%) | Excellent! ✓ |
Top 5 Hotspots (Phase 2, 628 samples)
-
malloc() - 36.51% CPU time
- Function overhead (prologue/epilogue): ~18%
- Lock operations: 5.05%
- Initialization checks: ~15%
-
main() - 30.51% CPU time
- Benchmark loop overhead (not allocator)
-
free() - 19.66% CPU time
- Lock operations: 3.29%
- Cached variable checks: ~15%
- Function overhead: ~10%
-
clear_page_erms (kernel) - 9.31% CPU time
- Page fault handling
-
irqentry_exit_to_user_mode (kernel) - 5.33% CPU time
- Kernel exit overhead
Phase 3 Optimization Targets (Ranked by Impact)
🔥 Priority 1: Fast-Path Inlining (Expected: +5-8%)
Target: Reduce malloc/free from 56% → ~33% CPU time
- Inline hot paths to eliminate function call overhead
- Remove stats counters from production builds
- Cache initialization state in TLS
🔥 Priority 2: Branch Optimization (Expected: +3-5%)
Target: Reduce branch misses from 1.02M → <700K
- Apply Profile-Guided Optimization (PGO)
- Add LIKELY/UNLIKELY hints
- Reduce branches in fast path from ~15 to 5-7
🔥 Priority 3: Cache Optimization (Expected: +2-4%)
Target: Reduce L3 misses from 216K → <180K
- Align hot structures to cache lines
- Add prefetching in allocation path
- Compact metadata structures
🎯 Priority 4: Remove Init Overhead (Expected: +2-3%)
- Cache g_initialized/g_enable checks in TLS
- Use constructor attributes more aggressively
🎯 Priority 5: Reduce Lock Contention (Expected: +1-2%)
- Move stats to TLS, aggregate periodically
- Eliminate atomic ops from fast path
🎯 Priority 6: Optimize TLS Operations (Expected: +1-2%)
- Reduce TLS reads/writes from ~10 to ~4 per operation
- Cache TLS values in registers
Expected Phase 3 Results
Target Throughput: 87-95M ops/s (+9-19% improvement)
| Metric | Phase 2 | Phase 3 Target | Change |
|---|---|---|---|
| Throughput | 79.8M ops/s | 87-95M ops/s | +9-19% |
| malloc CPU | 36.51% | ~22% | -40% |
| free CPU | 19.66% | ~11% | -44% |
| Branch misses | 4.43% | <3% | -32% |
| L3 cache misses | 10.28% | <8% | -22% |
Key Insights
✅ What Worked in Phase 2
- SuperSlab size increase (64KB → 512KB): Dramatically reduced branches (-36%)
- Amortized initialization: memset overhead dropped from 6.41% → 1.77%
- Virtual memory optimization: TLB miss rate is excellent (0.01%)
❌ What Needs Work
- Branch prediction: Miss rate doubled despite fewer branches
- Cache pressure: Larger SuperSlabs increased L3 misses
- Function overhead: malloc/free dominate CPU time (56%)
🤔 Surprising Findings
-
Cross-calling pattern: malloc/free call each other 8-12% of the time
- Thread-local cache flushing
- Deferred release operations
- May benefit from batching
-
Kernel overhead increased: clear_page_erms went from 2.23% → 9.31%
- May need page pre-faulting strategy
-
Main loop visible: 30.51% CPU time
- Benchmark overhead, not allocator
- Real allocator overhead is ~56% (malloc + free)
Files Generated
perf_phase2_stats.txt- perf stat -d outputperf_phase2_symbols.txt- Symbol-level hotspotsperf_phase2_callgraph.txt- Call graph analysisperf_phase2_detailed.txt- Detailed counter breakdownperf_malloc_annotate.txt- Assembly annotation for malloc()perf_free_annotate.txt- Assembly annotation for free()perf_analysis_summary.txt- Detailed comparison with Phase 1phase3_recommendations.txt- Complete optimization roadmap
How to Use This Data
For Quick Reference
cat perf_phase2_stats.txt # See overall metrics
cat perf_phase2_symbols.txt # See top functions
For Deep Analysis
cat perf_malloc_annotate.txt # See assembly-level hotspots in malloc
cat perf_free_annotate.txt # See assembly-level hotspots in free
cat perf_analysis_summary.txt # See Phase 1 vs Phase 2 comparison
For Planning Phase 3
cat phase3_recommendations.txt # See ranked optimization opportunities
To Re-run Analysis
# Quick stat
perf stat -d ./bench_random_mixed_hakmem 1000000 256 42
# Detailed profiling
perf record -F 9999 -g ./bench_random_mixed_hakmem 5000000 256 42
perf report --stdio --no-children --sort symbol
Next Steps
- Week 1: Implement fast-path inlining + remove stats locks (Expected: +8-10%)
- Week 2: Apply PGO + branch hints (Expected: +3-5%)
- Week 3: Cache line alignment + prefetching (Expected: +2-4%)
- Week 4: TLS optimization + polish (Expected: +1-3%)
Total Expected: +14-22% improvement → Target: 91-97M ops/s
Generated: 2025-11-28 Phase: 2 → 3 transition Baseline: 72M ops/s → Current: 79.8M ops/s → Target: 87-95M ops/s