Files
hakmem/docs/analysis/TINY_PERF_PROFILE_STEP1.md
Moe Charm (CI) a9ddb52ad4 ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)
Phase 1 完了:環境変数整理 + fprintf デバッグガード

ENV変数削除(BG/HotMag系):
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除

fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message

ドキュメント整理:
- 328 markdown files 削除(旧レポート・重複docs)

性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作)
- ENV整理による機能影響なし
- Debug出力は一部残存(次phase で対応)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 14:45:26 +09:00

9.0 KiB
Raw Blame History

Tiny Allocator: Perf Profile Step 1

Date: 2025-11-14 Workload: bench_random_mixed_hakmem 500K iterations, 256B blocks Throughput: 8.31M ops/s (9.3x slower than System malloc)


Perf Profiling Results

Configuration

perf record -F 999 -g -- ./out/release/bench_random_mixed_hakmem 500000 256 42
perf report --stdio --no-children

Samples: 90 samples, 285M cycles


Top 10 Functions (Overall)

Rank Overhead Function Location Notes
1 5.57% __pte_offset_map_lock kernel Page table management
2 4.82% main user Benchmark loop (mmap/munmap)
3 4.52% tiny_alloc_fast user Alloc hot path
4 4.20% _raw_spin_trylock kernel Kernel spinlock
5 3.95% do_syscall_64 kernel Syscall handler
6 3.65% classify_ptr user Free path (pointer classification)
7 3.11% __mem_cgroup_charge kernel Memory cgroup
8 2.89% free user Free wrapper
9 2.86% do_vmi_align_munmap kernel munmap handling
10 1.84% __alloc_pages kernel Page allocation

User-Space Hot Paths Analysis

Alloc Path (Total: ~5.9%)

tiny_alloc_fast                    4.52%  ← Main alloc fast path
  ├─ hak_free_at.part.0            3.18%  (called from alloc?)
  └─ hak_tiny_alloc_fast_wrapper   1.34%  ← Wrapper overhead

hak_tiny_alloc_fast_wrapper        1.35%  (standalone)

Total alloc overhead:              ~5.86%

Free Path (Total: ~8.0%)

classify_ptr                       3.65%  ← Pointer classification (region lookup)
free                               2.89%  ← Free wrapper
  ├─ main                          1.49%
  └─ malloc                        1.40%

hak_free_at.part.0                 1.43%  ← Free implementation

Total free overhead:               ~7.97%

Total User-Space Hot Path

Alloc:  5.86%
Free:   7.97%
Total:  13.83%  ← User-space allocation overhead

Kernel overhead: 86.17% (initialization, syscalls, page faults)


Key Findings

1. ss_refill_fc_fill は Top 10 に不在

Interpretation: Front cache (FC) hit rate が高い

  • Refill pathss_refill_fc_fillがボトルネックになっていない
  • Most allocations served from TLS cache (fast path)

2. Alloc vs Free Balance

Alloc path: 5.86%  (tiny_alloc_fast dominant)
Free path:  7.97%  (classify_ptr + free wrapper)

Free path is 36% more expensive than alloc path!

Potential optimization target: classify_ptr (3.65%)

  • Pointer region lookup for routing (Tiny vs Pool vs ACE)
  • Currently uses mincore/registry lookup

3. Kernel Overhead Dominates (86%)

Breakdown:

  • Initialization: page faults, memset, pthread_once (~40-50%)
  • Syscalls: mmap, munmap from benchmark setup (~20-30%)
  • Memory management: page table ops, cgroup, etc. (~10-20%)

Impact: User-space optimization が直接性能に反映されにくい

  • 500K iterations でも初期化の影響が大きい
  • Real workload では user-space overhead の比率が高くなる可能性

4. Front Cache Efficiency

Evidence:

  • ss_refill_fc_fill not in top 10 → FC hit rate high
  • tiny_alloc_fast only 4.52% → Fast path is efficient

Implication: Front cache tuning の効果は限定的かもしれない

  • Current FC parameters already near-optimal for this workload
  • Drain interval tuning の方が効果的な可能性

Next Steps (Following User Plan)

Step 1: Perf Profile Complete

Conclusion:

  • Alloc hot path: tiny_alloc_fast (4.52%)
  • Free hot path: classify_ptr (3.65%) + free (2.89%)
  • ss_refill_fc_fill: Not in top 10 (FC hit rate high)
  • Kernel overhead: 86% (initialization + syscalls)

Step 2: Drain Interval A/B Testing

Target: Find optimal TLS_SLL_DRAIN interval

Test Matrix:

# Current default: 1024
export HAKMEM_TINY_SLL_DRAIN_INTERVAL=512
export HAKMEM_TINY_SLL_DRAIN_INTERVAL=1024  # baseline
export HAKMEM_TINY_SLL_DRAIN_INTERVAL=2048

Metrics to Compare:

  • Throughput (ops/s) - primary metric
  • Syscalls (strace -c) - mmap/munmap/mincore count
  • CPU overhead - user vs kernel time

Expected Impact:

  • Lower interval (512): More frequent drain → less memory, potentially more overhead
  • Higher interval (2048): Less frequent drain → more memory, potentially better throughput

Workload Sizes: 128B, 256B (hot classes)

Step 3: Front Cache Tuning (if needed)

ENV Variables:

HAKMEM_TINY_FAST_CAP        # FC capacity per class
HAKMEM_TINY_REFILL_COUNT_HOT  # Refill batch size for hot classes
HAKMEM_TINY_REFILL_COUNT_MID  # Refill batch size for mid classes

Metrics:

  • FC hit/miss stats (g_front_fc_hit/miss or FRONT_STATS)
  • Throughput impact

Step 4: ss_refill_fc_fill Optimization (if needed)

Only if:

  • Step 2/3 improvements are minimal
  • Deeper profiling shows ss_refill_fc_fill as bottleneck

Potential optimizations:

  • Remote drain trigger frequency
  • Header restore efficiency
  • Batch processing in refill

Detailed Call Graphs

tiny_alloc_fast (4.52%)

tiny_alloc_fast (4.52%)
├─ Called from hak_free_at.part.0 (3.18%)  ← Recursive call?
│  └─ 0
└─ hak_tiny_alloc_fast_wrapper (1.34%)     ← Direct call

Note: Recursive call from free path is unexpected - may indicate:

  • Allocation during free (e.g., metadata growth)
  • Stack trace artifact from perf sampling

classify_ptr (3.65%)

classify_ptr (3.65%)
└─ main

Function: Determine allocation source (Tiny vs Pool vs ACE)

  • Uses mincore/registry lookup
  • Called on every free operation
  • Optimization opportunity: Cache classification results in pointer header/metadata

free (2.89%)

free (2.89%)
├─ main (1.49%)    ← Direct free calls from benchmark
└─ malloc (1.40%)  ← Free from realloc path?

Profiling Limitations

1. Short-Lived Workload

Iterations: 500K
Runtime:    60ms
Samples:    90 samples

Impact: Initialization dominates, hot path underrepresented

Solution: Profile longer workloads (5M-10M iterations) or steady-state benchmarks

2. Perf Sampling Frequency

-F 999 (999 Hz sampling)

Impact: May miss very fast functions (< 1ms)

Solution: Use higher frequency (-F 9999) or event-based sampling

3. Compiler Optimizations

-O3 -flto (Link-Time Optimization)

Impact: Inlining may hide function overhead

Solution: Check annotated assembly (perf annotate) for inlined functions


Recommendations

Immediate Actions (Step 2)

  1. Drain Interval A/B Testing (ENV-only, no code changes)

    • Test: 512 / 1024 / 2048
    • Workloads: 128B, 256B
    • Metrics: Throughput + syscalls
  2. Choose Default based on:

    • Best throughput for common sizes (128-256B)
    • Acceptable memory overhead
    • Syscall count reduction

Conditional Actions (Step 3)

If Step 2 improvements < 10%:

  • Front cache tuning (FAST_CAP / REFILL_COUNT)
  • Measure FC hit/miss stats

Future Optimizations (Step 4+)

If classify_ptr remains hot (after Step 2/3):

  • Cache classification in pointer metadata
  • Use header bits to encode region type
  • Reduce mincore/registry lookups

If kernel overhead remains > 80%:

  • Consider longer-running benchmarks
  • Focus on real workload profiling
  • Optimize initialization path separately

Appendix: Raw Perf Data

Command Used

perf record -F 999 -g -o perf_tiny_256b_long.data \
  -- ./out/release/bench_random_mixed_hakmem 500000 256 42

perf report -i perf_tiny_256b_long.data --stdio --no-children

Sample Output

Samples: 90  of event 'cycles:P'
Event count (approx.): 285,508,084

Overhead  Command          Shared Object              Symbol
     5.57%  bench_random_mi  [kernel.kallsyms]          [k] __pte_offset_map_lock
     4.82%  bench_random_mi  bench_random_mixed_hakmem  [.] main
     4.52%  bench_random_mi  bench_random_mixed_hakmem  [.] tiny_alloc_fast
     4.20%  bench_random_mi  [kernel.kallsyms]          [k] _raw_spin_trylock
     3.95%  bench_random_mi  [kernel.kallsyms]          [k] do_syscall_64
     3.65%  bench_random_mi  bench_random_mixed_hakmem  [.] classify_ptr
     3.11%  bench_random_mi  [kernel.kallsyms]          [k] __mem_cgroup_charge
     2.89%  bench_random_mi  bench_random_mixed_hakmem  [.] free

Conclusion

Step 1 Complete

Hot Spot Summary:

  • Alloc: tiny_alloc_fast (4.52%) - already efficient
  • Free: classify_ptr (3.65%) + free (2.89%) - potential optimization
  • Refill: ss_refill_fc_fill - not in top 10 (high FC hit rate)

Kernel overhead: 86% (initialization + syscalls dominate short workload)

Recommended Next Step: Step 2 - Drain Interval A/B Testing

  • ENV-only tuning, no code changes
  • Quick validation of performance impact
  • Data-driven default selection

Expected Impact: +5-15% throughput improvement (conservative estimate)