Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

8.7 KiB

Raw Blame History

HAKMEM Tiny Pool - Performance Analysis Index

Date: 2025-10-26 Session: Post-getenv Fix Analysis Status: Analysis Complete - Optimization Recommended

For Immediate Action

OPTIMIZATION_NEXT_STEPS.md - Implementation guide for next optimization
PERF_SUMMARY.txt - One-page executive summary

For Detailed Review

PERF_POST_GETENV_ANALYSIS.md - Complete analysis with Q&A
BOTTLENECK_COMPARISON.txt - Before/after comparison

Raw Performance Data

perf_post_getenv.data - Perf recording (1 GB)
perf_post_getenv_report.txt - Top functions report
perf_post_getenv_annotate.txt - Annotated assembly

Executive Summary

Achievement

Eliminated getenv bottleneck: 43.96% CPU → 0%
Performance improvement: +86% to +173% (60 → 120-164 M ops/sec)
Now FASTER than glibc: +15% to +57%

Current Status

New #1 Bottleneck: hak_tiny_alloc (22.75% CPU)
Verdict: Worth optimizing (2.27x above 10% threshold)
Next Target: Reduce hak_tiny_alloc to ~10% CPU

Recommendation

OPTIMIZE NEXT BOTTLENECK - Clear path to 180-250 M ops/sec (2-3x glibc)

File Descriptions

Analysis Documents

PERF_POST_GETENV_ANALYSIS.md (11 KB)

Purpose: Comprehensive post-getenv performance analysis Contains:

Q1: NEW #1 Bottleneck identification (hak_tiny_alloc 22.75%)
Q2: Top 5 hotspots ranking
Q3: Optimization worthiness assessment
Q4: Root cause analysis and proposed fixes
Before/after comparison table
Final recommendation with justification

Key Finding: hak_tiny_alloc at 22.75% is 2.27x above 10% threshold → Optimize!

OPTIMIZATION_NEXT_STEPS.md (7 KB)

Purpose: Actionable implementation guide Contains:

Root cause breakdown from perf annotate
4-phase optimization strategy (prioritized)
Implementation plan with time estimates
Success criteria and validation commands
Risk assessment
Code examples and snippets

Start Here: If you're ready to implement optimizations

PERF_SUMMARY.txt (2.6 KB)

Purpose: Quick reference card Contains:

Performance journey (4 phases)
Optimization roadmap
Key metrics comparison
Next steps recommendation

Use Case: Quick briefing or status check

BOTTLENECK_COMPARISON.txt (4.4 KB)

Purpose: Side-by-side before/after analysis Contains:

Top 10 CPU consumers comparison
Critical observations (4 key insights)
Performance trajectory visualization
Decision matrix (6 criteria)
Next bottleneck recommendation

Use Case: Understanding impact of getenv fix

Key Metrics at a Glance

Metric	Before (getenv bug)	After (fixed)	Change
Performance	60 M ops/sec	120-164 M ops/sec	+86-173%
vs glibc	-43% slower	+15-57% faster	HUGE WIN
Top bottleneck	getenv 43.96%	hak_tiny_alloc 22.75%	Different
Allocator CPU	~69%	~51%	-18%
Wasted CPU	44% (getenv)	0%	-44%

Top 5 Current Bottlenecks

Rank	Function	CPU (Self)	Status	Action
1	hak_tiny_alloc	22.75%	⚠ HIGH	OPTIMIZE
2	__random	14.00%	ℹ INFO	Benchmark overhead
3	mid_desc_lookup	12.55%	⚠ MED	Consider optimizing
4	hak_tiny_owner_slab	9.09%	✓ OK	Below threshold
5	hak_free_at	11.08%	ℹ INFO	Children time

Primary Target: hak_tiny_alloc (22.75%) - 2.27x above 10% threshold

Optimization Roadmap

Phase 7.2.5: Eliminate getenv ✓ COMPLETE

Status: Done
Impact: -43.96% CPU, +86-173% throughput
Achievement: 60 → 120-164 M ops/sec

Phase 7.2.6: Optimize hak_tiny_alloc ← NEXT

Target: 22.75% → ~10% CPU
Method: Inline fast path, reduce stack, cache TLS
Expected: +50-70% throughput (→ 180-220 M ops/sec)
Effort: 2-4 hours

Phase 7.2.7: Optimize mid_desc_lookup (Optional)

Target: 12.55% → ~6% CPU
Method: Smaller hash table, prefetching
Expected: +10-20% additional throughput
Effort: 1-2 hours

Phase 7.2.8: Ship It!

Condition: All bottlenecks <10%
Expected Performance: 200-250 M ops/sec (2-3x glibc)
Status: Enable g_wrap_tiny_enabled = 1 by default

Root Cause: hak_tiny_alloc (22.75% CPU)

Hotspot Breakdown

Heavy stack usage (10.5% CPU)
- 88 bytes allocated
- Multiple stack reads/writes
- Register spilling
Repeated global reads (7.2% CPU)
- g_tiny_initialized (3.52%)
- g_wrap_tiny_enabled (0.28%)
- Should cache in TLS
Complex control flow (5.0% CPU)
- Size validation branches
- Magazine refill in main path
- Should separate fast/slow paths

Hottest Instructions (from perf annotate)

3.71%:  push %r14                       ← Register pressure
3.52%:  mov g_tiny_initialized,%r14d    ← Global read
3.53%:  mov 0x1c(%rsp),%ebp            ← Stack read
3.33%:  cmpq $0x80,0x10(%rsp)          ← Size check
3.06%:  mov %rbp,0x38(%rsp)            ← Stack write

Proposed Solution

1. Inline Fast Path (Priority: HIGH)

Impact: -5 to -7% CPU Effort: 2-3 hours

Create inline hak_tiny_alloc_fast():

Quick size validation
Direct TLS magazine access
Fast path for magazine hit (common case)
Delegate to slow path only for refill

2. Reduce Stack Usage (Priority: MEDIUM)

Impact: -3 to -4% CPU Effort: 1-2 hours

Reduce from 88 → <32 bytes:

Fewer local variables
Pass in registers where possible
Move rarely-used locals to slow path

3. Cache Globals in TLS (Priority: LOW)

Impact: -2 to -3% CPU Effort: 1 hour

Cache g_tiny_initialized and g_wrap_tiny_enabled in TLS:

Read once on TLS init
Avoid repeated global reads (3.8% CPU saved)

Total Expected: -10 to -15% CPU reduction (22.75% → ~10%)

Success Criteria

After optimization, verify:

hak_tiny_alloc CPU: 22.75% → <12%
Total throughput: 120-164 M → 180-250 M ops/sec
Faster than glibc: +70% to +140% (vs current +15-57%)
No correctness regressions
No new bottleneck >15%

Files to Review/Modify

Source Code

/home/tomoaki/git/hakmem/hakmem_pool.c - Main implementation
/home/tomoaki/git/hakmem/hakmem_pool.h - Add inline fast path

Performance Data

/home/tomoaki/git/hakmem/perf_post_getenv.data - Current perf recording
/home/tomoaki/git/hakmem/perf_post_getenv_annotate.txt - Assembly hotspots

Benchmarks

/home/tomoaki/git/hakmem/bench_comprehensive_hakmem - Test binary
Run with: HAKMEM_WRAP_TINY=1 ./bench_comprehensive_hakmem

Timeline

Completed (Today)

Collect fresh perf data post-getenv fix
Identify new #1 bottleneck (hak_tiny_alloc)
Analyze root causes via perf annotate
Compare before/after getenv fix
Make optimization recommendation
Create implementation guide

Next Session (2-4 hours)

Implement inline fast path
Reduce stack usage
Benchmark and validate
Collect new perf data
Assess if further optimization needed

Future (Optional, 1-2 hours)

Optimize mid_desc_lookup (12.55%)
Final validation
Enable tiny pool by default
Ship it!

Questions?

Q: Should we stop optimizing and ship now? A: No. hak_tiny_alloc at 22.75% is 2.27x above threshold. Clear optimization opportunity with high ROI (50-70% gain for 2-4 hours work).

Q: What if optimization doesn't work? A: Low risk. We can always revert. Current performance (120-164 M ops/sec) already beats glibc, so we're not making it worse.

Q: How do we know when to stop? A: When top bottleneck falls below 10%, or when effort exceeds returns. Currently at 22.75%, so not there yet.

Q: What about the other bottlenecks? A: mid_desc_lookup (12.55%) is secondary target if time permits. hak_tiny_owner_slab (9.09%) is below 10% threshold and acceptable.

Additional Resources

Previous Analysis (For Context)

PERF_ANALYSIS_RESULTS.md - Original analysis that identified getenv bug
perf_report.txt - Old data (with getenv bug)
perf_annotate_*.txt - Old annotations

Benchmark Results

See PERF_POST_GETENV_ANALYSIS.md section "Supporting Data" for:

Per-test throughput breakdown
Size class performance (16B, 32B, 64B, 128B)
Comparison with glibc baseline

Contact

Project: HAKMEM Memory Allocator Repository: /home/tomoaki/git/hakmem Analysis Date: 2025-10-26 Analyst: Claude Code (Anthropic)

Last Updated: 2025-10-26 09:08 JST Status: Ready for Phase 7.2.6 Implementation

8.7 KiB Raw Blame History Unescape Escape