Files
hakmem/PHASE2_PERF_ANALYSIS.md
Moe Charm (CI) 5582cbc22c Refactor: Unified allocation macros + header validation
1. Archive unused backend files (ss_legacy/unified_backend_box.c/h)
   - These files were not linked in the build
   - Moved to archive/ to reduce confusion

2. Created HAK_RET_ALLOC_BLOCK macro for SuperSlab allocations
   - Replaces superslab_return_block() function
   - Consistent with existing HAK_RET_ALLOC pattern
   - Single source of truth for header writing
   - Defined in hakmem_tiny_superslab_internal.h

3. Added header validation on TLS SLL push
   - Detects blocks pushed without proper header
   - Enabled via HAKMEM_TINY_SLL_VALIDATE_HDR=1 (release)
   - Always on in debug builds
   - Logs first 10 violations with backtraces

Benefits:
- Easier to track allocation paths
- Catches header bugs at push time
- More maintainable macro-based design

Note: Larson bug still reproduces - header corruption occurs
before push validation can catch it.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 05:37:24 +09:00

5.2 KiB

HAKMEM Allocator - Phase 2 Performance Analysis

Quick Summary

Metric Phase 1 Phase 2 Change
Throughput 72M ops/s 79.8M ops/s +10.8%
Cycles 78.6M 72.2M -8.1% ✓
Instructions 167M 153M -8.4% ✓
Branches 36M 23M -36%
Branch Misses 921K (2.56%) 1.02M (4.43%) +73% ✗
L3 Cache Misses 173K (9.28%) 216K (10.28%) +25% ✗
dTLB Misses N/A 41 (0.01%) Excellent!

Top 5 Hotspots (Phase 2, 628 samples)

  1. malloc() - 36.51% CPU time

    • Function overhead (prologue/epilogue): ~18%
    • Lock operations: 5.05%
    • Initialization checks: ~15%
  2. main() - 30.51% CPU time

    • Benchmark loop overhead (not allocator)
  3. free() - 19.66% CPU time

    • Lock operations: 3.29%
    • Cached variable checks: ~15%
    • Function overhead: ~10%
  4. clear_page_erms (kernel) - 9.31% CPU time

    • Page fault handling
  5. irqentry_exit_to_user_mode (kernel) - 5.33% CPU time

    • Kernel exit overhead

Phase 3 Optimization Targets (Ranked by Impact)

🔥 Priority 1: Fast-Path Inlining (Expected: +5-8%)

Target: Reduce malloc/free from 56% → ~33% CPU time

  • Inline hot paths to eliminate function call overhead
  • Remove stats counters from production builds
  • Cache initialization state in TLS

🔥 Priority 2: Branch Optimization (Expected: +3-5%)

Target: Reduce branch misses from 1.02M → <700K

  • Apply Profile-Guided Optimization (PGO)
  • Add LIKELY/UNLIKELY hints
  • Reduce branches in fast path from ~15 to 5-7

🔥 Priority 3: Cache Optimization (Expected: +2-4%)

Target: Reduce L3 misses from 216K → <180K

  • Align hot structures to cache lines
  • Add prefetching in allocation path
  • Compact metadata structures

🎯 Priority 4: Remove Init Overhead (Expected: +2-3%)

  • Cache g_initialized/g_enable checks in TLS
  • Use constructor attributes more aggressively

🎯 Priority 5: Reduce Lock Contention (Expected: +1-2%)

  • Move stats to TLS, aggregate periodically
  • Eliminate atomic ops from fast path

🎯 Priority 6: Optimize TLS Operations (Expected: +1-2%)

  • Reduce TLS reads/writes from ~10 to ~4 per operation
  • Cache TLS values in registers

Expected Phase 3 Results

Target Throughput: 87-95M ops/s (+9-19% improvement)

Metric Phase 2 Phase 3 Target Change
Throughput 79.8M ops/s 87-95M ops/s +9-19%
malloc CPU 36.51% ~22% -40%
free CPU 19.66% ~11% -44%
Branch misses 4.43% <3% -32%
L3 cache misses 10.28% <8% -22%

Key Insights

What Worked in Phase 2

  1. SuperSlab size increase (64KB → 512KB): Dramatically reduced branches (-36%)
  2. Amortized initialization: memset overhead dropped from 6.41% → 1.77%
  3. Virtual memory optimization: TLB miss rate is excellent (0.01%)

What Needs Work

  1. Branch prediction: Miss rate doubled despite fewer branches
  2. Cache pressure: Larger SuperSlabs increased L3 misses
  3. Function overhead: malloc/free dominate CPU time (56%)

🤔 Surprising Findings

  1. Cross-calling pattern: malloc/free call each other 8-12% of the time

    • Thread-local cache flushing
    • Deferred release operations
    • May benefit from batching
  2. Kernel overhead increased: clear_page_erms went from 2.23% → 9.31%

    • May need page pre-faulting strategy
  3. Main loop visible: 30.51% CPU time

    • Benchmark overhead, not allocator
    • Real allocator overhead is ~56% (malloc + free)

Files Generated

  • perf_phase2_stats.txt - perf stat -d output
  • perf_phase2_symbols.txt - Symbol-level hotspots
  • perf_phase2_callgraph.txt - Call graph analysis
  • perf_phase2_detailed.txt - Detailed counter breakdown
  • perf_malloc_annotate.txt - Assembly annotation for malloc()
  • perf_free_annotate.txt - Assembly annotation for free()
  • perf_analysis_summary.txt - Detailed comparison with Phase 1
  • phase3_recommendations.txt - Complete optimization roadmap

How to Use This Data

For Quick Reference

cat perf_phase2_stats.txt        # See overall metrics
cat perf_phase2_symbols.txt      # See top functions

For Deep Analysis

cat perf_malloc_annotate.txt     # See assembly-level hotspots in malloc
cat perf_free_annotate.txt       # See assembly-level hotspots in free
cat perf_analysis_summary.txt    # See Phase 1 vs Phase 2 comparison

For Planning Phase 3

cat phase3_recommendations.txt   # See ranked optimization opportunities

To Re-run Analysis

# Quick stat
perf stat -d ./bench_random_mixed_hakmem 1000000 256 42

# Detailed profiling
perf record -F 9999 -g ./bench_random_mixed_hakmem 5000000 256 42
perf report --stdio --no-children --sort symbol

Next Steps

  1. Week 1: Implement fast-path inlining + remove stats locks (Expected: +8-10%)
  2. Week 2: Apply PGO + branch hints (Expected: +3-5%)
  3. Week 3: Cache line alignment + prefetching (Expected: +2-4%)
  4. Week 4: TLS optimization + polish (Expected: +1-3%)

Total Expected: +14-22% improvement → Target: 91-97M ops/s


Generated: 2025-11-28 Phase: 2 → 3 transition Baseline: 72M ops/s → Current: 79.8M ops/s → Target: 87-95M ops/s