Files
hakmem/docs/analysis/CPU_HOTPATH_OVERVIEW.md

3.2 KiB
Raw Blame History

CPU Hotpath Overview (bench profile)

Context

  • Build/profile: HAKMEM_PROFILE=bench, HAKMEM_TINY_PROFILE=full, HAKMEM_WARM_TLS_BIND_C7=2.
  • Workloads sampled:
    • 161024B (./bench_random_mixed_hakmem 1000000 256 42)
    • 1291024B (HAKMEM_BENCH_MIN_SIZE=129 HAKMEM_BENCH_MAX_SIZE=1024 ./bench_random_mixed_hakmem 1000000 256 42)
  • Target: identify userspace hot spots to guide C7 flattening work.

Sampling attempt (perf)

  • perf record -g -e cycles:u and perf record -g -e cpu-clock:u both fall back to page-faults on this host (likely perf_event_paranoid). The captures show:
    • ~97% of page-fault samples in __memset_avx2_unaligned_erms during warmup/zeroing.
    • Callers were tiny_tls_sll_drain.part.0.constprop.0 and adaptive_sizing_init (warmup path).
  • No steadystate cycle profile was available without elevated perf permissions. perf data removed after inspection (rm perf.data) to keep tree clean.

What we can infer despite the limitation

  • Warmup zeroing dominates pagefault samples; steadystate alloc/free is not represented.
  • Hot candidates for the next pass (from previous code inspection and bench intuition):
    • tiny_alloc_fast / malloc_tiny_fast (C7 fast path)
    • hak_tiny_free_fast_v2
    • tiny_unified_cache hit path helpers
    • tls_sll_pop_impl / tiny_tls_sll_drain

Next measurement options

  • If perf cycles are still blocked:
    • Use perf stat -e cycles,instructions,branches,branch-misses -r 5 -- ... to get aggregate IPC per workload.
    • Add temporary userspace counters (Boxguarded) around C7 alloc/free hot sections to estimate perop cycles.
    • Run perf with elevated permissions or lower perf_event_paranoid if available.

Aggregate perf stat snapshot (bench profile)

  • Env: HAKMEM_PROFILE=bench HAKMEM_TINY_PROFILE=full HAKMEM_WARM_TLS_BIND_C7=2
  • Workloads (Release, 3× perf stat):
    • 161024B: cycles≈109.8M, instructions≈233.2M → IPC≈2.12, branches≈49.4M, branch-miss≈2.89%
    • 1291024B: cycles≈109.6M, instructions≈230.3M → IPC≈2.10, branches≈48.8M, branch-miss≈2.90%
    • 161024B with HAKMEM_TINY_C7_HOT=1 (UC hit-only + flat TLS→UC→cold):
      • cycles≈111.8M, instructions≈242.1M → IPC≈2.16, branches≈52.0M, branch-miss≈2.75%
      • RSS≈7.1MB; throughput ≈47.447.6M ops/s (hot=1) vs ≈47.2M (hot=0) on the same runset.

Action items flowing from this note

  • Proceed with design notes for C7 alloc/free flattening and UC hit simplification based on code structure.
  • Keep warmup zeroing out of the steadystate loop when profiling (consider HAKMEM_BENCH_FAST_MODE for future captures).

Conclusion (current state)

  • HAKMEM_TINY_C7_HOT は実験用フラグとして残し、デフォルト OFF のまま運用する。ON にしても branch-miss はわずかに改善する程度で、ops/s は同等〜微減。
  • ひとまず「安全+そこそこ速い」現行経路を基準とし、さらなるフラット化は別途必要性を見て検討する。***