Files
hakmem/docs/analysis/PERF_ANALYSIS_INDEX.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

8.7 KiB
Raw Blame History

HAKMEM Tiny Pool - Performance Analysis Index

Date: 2025-10-26 Session: Post-getenv Fix Analysis Status: Analysis Complete - Optimization Recommended


Quick Navigation

For Immediate Action

For Detailed Review

Raw Performance Data

  • perf_post_getenv.data - Perf recording (1 GB)
  • perf_post_getenv_report.txt - Top functions report
  • perf_post_getenv_annotate.txt - Annotated assembly

Executive Summary

Achievement

  • Eliminated getenv bottleneck: 43.96% CPU → 0%
  • Performance improvement: +86% to +173% (60 → 120-164 M ops/sec)
  • Now FASTER than glibc: +15% to +57%

Current Status

  • New #1 Bottleneck: hak_tiny_alloc (22.75% CPU)
  • Verdict: Worth optimizing (2.27x above 10% threshold)
  • Next Target: Reduce hak_tiny_alloc to ~10% CPU

Recommendation

OPTIMIZE NEXT BOTTLENECK - Clear path to 180-250 M ops/sec (2-3x glibc)


File Descriptions

Analysis Documents

PERF_POST_GETENV_ANALYSIS.md (11 KB)

Purpose: Comprehensive post-getenv performance analysis Contains:

  • Q1: NEW #1 Bottleneck identification (hak_tiny_alloc 22.75%)
  • Q2: Top 5 hotspots ranking
  • Q3: Optimization worthiness assessment
  • Q4: Root cause analysis and proposed fixes
  • Before/after comparison table
  • Final recommendation with justification

Key Finding: hak_tiny_alloc at 22.75% is 2.27x above 10% threshold → Optimize!

OPTIMIZATION_NEXT_STEPS.md (7 KB)

Purpose: Actionable implementation guide Contains:

  • Root cause breakdown from perf annotate
  • 4-phase optimization strategy (prioritized)
  • Implementation plan with time estimates
  • Success criteria and validation commands
  • Risk assessment
  • Code examples and snippets

Start Here: If you're ready to implement optimizations

PERF_SUMMARY.txt (2.6 KB)

Purpose: Quick reference card Contains:

  • Performance journey (4 phases)
  • Optimization roadmap
  • Key metrics comparison
  • Next steps recommendation

Use Case: Quick briefing or status check

BOTTLENECK_COMPARISON.txt (4.4 KB)

Purpose: Side-by-side before/after analysis Contains:

  • Top 10 CPU consumers comparison
  • Critical observations (4 key insights)
  • Performance trajectory visualization
  • Decision matrix (6 criteria)
  • Next bottleneck recommendation

Use Case: Understanding impact of getenv fix


Key Metrics at a Glance

Metric Before (getenv bug) After (fixed) Change
Performance 60 M ops/sec 120-164 M ops/sec +86-173%
vs glibc -43% slower +15-57% faster HUGE WIN
Top bottleneck getenv 43.96% hak_tiny_alloc 22.75% Different
Allocator CPU ~69% ~51% -18%
Wasted CPU 44% (getenv) 0% -44%

Top 5 Current Bottlenecks

Rank Function CPU (Self) Status Action
1 hak_tiny_alloc 22.75% ⚠ HIGH OPTIMIZE
2 __random 14.00% INFO Benchmark overhead
3 mid_desc_lookup 12.55% ⚠ MED Consider optimizing
4 hak_tiny_owner_slab 9.09% ✓ OK Below threshold
5 hak_free_at 11.08% INFO Children time

Primary Target: hak_tiny_alloc (22.75%) - 2.27x above 10% threshold


Optimization Roadmap

Phase 7.2.5: Eliminate getenv ✓ COMPLETE

  • Status: Done
  • Impact: -43.96% CPU, +86-173% throughput
  • Achievement: 60 → 120-164 M ops/sec

Phase 7.2.6: Optimize hak_tiny_alloc ← NEXT

  • Target: 22.75% → ~10% CPU
  • Method: Inline fast path, reduce stack, cache TLS
  • Expected: +50-70% throughput (→ 180-220 M ops/sec)
  • Effort: 2-4 hours

Phase 7.2.7: Optimize mid_desc_lookup (Optional)

  • Target: 12.55% → ~6% CPU
  • Method: Smaller hash table, prefetching
  • Expected: +10-20% additional throughput
  • Effort: 1-2 hours

Phase 7.2.8: Ship It!

  • Condition: All bottlenecks <10%
  • Expected Performance: 200-250 M ops/sec (2-3x glibc)
  • Status: Enable g_wrap_tiny_enabled = 1 by default

Root Cause: hak_tiny_alloc (22.75% CPU)

Hotspot Breakdown

  1. Heavy stack usage (10.5% CPU)

    • 88 bytes allocated
    • Multiple stack reads/writes
    • Register spilling
  2. Repeated global reads (7.2% CPU)

    • g_tiny_initialized (3.52%)
    • g_wrap_tiny_enabled (0.28%)
    • Should cache in TLS
  3. Complex control flow (5.0% CPU)

    • Size validation branches
    • Magazine refill in main path
    • Should separate fast/slow paths

Hottest Instructions (from perf annotate)

3.71%:  push %r14                        Register pressure
3.52%:  mov g_tiny_initialized,%r14d     Global read
3.53%:  mov 0x1c(%rsp),%ebp             Stack read
3.33%:  cmpq $0x80,0x10(%rsp)           Size check
3.06%:  mov %rbp,0x38(%rsp)             Stack write

Proposed Solution

1. Inline Fast Path (Priority: HIGH)

Impact: -5 to -7% CPU Effort: 2-3 hours

Create inline hak_tiny_alloc_fast():

  • Quick size validation
  • Direct TLS magazine access
  • Fast path for magazine hit (common case)
  • Delegate to slow path only for refill

2. Reduce Stack Usage (Priority: MEDIUM)

Impact: -3 to -4% CPU Effort: 1-2 hours

Reduce from 88 → <32 bytes:

  • Fewer local variables
  • Pass in registers where possible
  • Move rarely-used locals to slow path

3. Cache Globals in TLS (Priority: LOW)

Impact: -2 to -3% CPU Effort: 1 hour

Cache g_tiny_initialized and g_wrap_tiny_enabled in TLS:

  • Read once on TLS init
  • Avoid repeated global reads (3.8% CPU saved)

Total Expected: -10 to -15% CPU reduction (22.75% → ~10%)


Success Criteria

After optimization, verify:

  • hak_tiny_alloc CPU: 22.75% → <12%
  • Total throughput: 120-164 M → 180-250 M ops/sec
  • Faster than glibc: +70% to +140% (vs current +15-57%)
  • No correctness regressions
  • No new bottleneck >15%

Files to Review/Modify

Source Code

  • /home/tomoaki/git/hakmem/hakmem_pool.c - Main implementation
  • /home/tomoaki/git/hakmem/hakmem_pool.h - Add inline fast path

Performance Data

  • /home/tomoaki/git/hakmem/perf_post_getenv.data - Current perf recording
  • /home/tomoaki/git/hakmem/perf_post_getenv_annotate.txt - Assembly hotspots

Benchmarks

  • /home/tomoaki/git/hakmem/bench_comprehensive_hakmem - Test binary
  • Run with: HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem

Timeline

Completed (Today)

  • Collect fresh perf data post-getenv fix
  • Identify new #1 bottleneck (hak_tiny_alloc)
  • Analyze root causes via perf annotate
  • Compare before/after getenv fix
  • Make optimization recommendation
  • Create implementation guide

Next Session (2-4 hours)

  • Implement inline fast path
  • Reduce stack usage
  • Benchmark and validate
  • Collect new perf data
  • Assess if further optimization needed

Future (Optional, 1-2 hours)

  • Optimize mid_desc_lookup (12.55%)
  • Final validation
  • Enable tiny pool by default
  • Ship it!

Questions?

Q: Should we stop optimizing and ship now? A: No. hak_tiny_alloc at 22.75% is 2.27x above threshold. Clear optimization opportunity with high ROI (50-70% gain for 2-4 hours work).

Q: What if optimization doesn't work? A: Low risk. We can always revert. Current performance (120-164 M ops/sec) already beats glibc, so we're not making it worse.

Q: How do we know when to stop? A: When top bottleneck falls below 10%, or when effort exceeds returns. Currently at 22.75%, so not there yet.

Q: What about the other bottlenecks? A: mid_desc_lookup (12.55%) is secondary target if time permits. hak_tiny_owner_slab (9.09%) is below 10% threshold and acceptable.


Additional Resources

Previous Analysis (For Context)

  • PERF_ANALYSIS_RESULTS.md - Original analysis that identified getenv bug
  • perf_report.txt - Old data (with getenv bug)
  • perf_annotate_*.txt - Old annotations

Benchmark Results

See PERF_POST_GETENV_ANALYSIS.md section "Supporting Data" for:

  • Per-test throughput breakdown
  • Size class performance (16B, 32B, 64B, 128B)
  • Comparison with glibc baseline

Contact

Project: HAKMEM Memory Allocator Repository: /home/tomoaki/git/hakmem Analysis Date: 2025-10-26 Analyst: Claude Code (Anthropic)


Last Updated: 2025-10-26 09:08 JST Status: Ready for Phase 7.2.6 Implementation