Files
hakmem/docs/analysis/CPU_HOTPATH_OVERVIEW.md

54 lines
3.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

CPU Hotpath Overview (bench profile)
====================================
Context
-------
- Build/profile: `HAKMEM_PROFILE=bench`, `HAKMEM_TINY_PROFILE=full`, `HAKMEM_WARM_TLS_BIND_C7=2`.
- Workloads sampled:
- 161024B (`./bench_random_mixed_hakmem 1000000 256 42`)
- 1291024B (`HAKMEM_BENCH_MIN_SIZE=129 HAKMEM_BENCH_MAX_SIZE=1024 ./bench_random_mixed_hakmem 1000000 256 42`)
- Target: identify userspace hot spots to guide C7 flattening work.
Sampling attempt (perf)
-----------------------
- `perf record -g -e cycles:u` and `perf record -g -e cpu-clock:u` both fall back to `page-faults` on this host (likely perf_event_paranoid). The captures show:
- ~97% of page-fault samples in `__memset_avx2_unaligned_erms` during warmup/zeroing.
- Callers were `tiny_tls_sll_drain.part.0.constprop.0` and `adaptive_sizing_init` (warmup path).
- No steadystate cycle profile was available without elevated perf permissions. `perf data` removed after inspection (`rm perf.data`) to keep tree clean.
What we can infer despite the limitation
----------------------------------------
- Warmup zeroing dominates pagefault samples; steadystate alloc/free is not represented.
- Hot candidates for the next pass (from previous code inspection and bench intuition):
- `tiny_alloc_fast` / `malloc_tiny_fast` (C7 fast path)
- `hak_tiny_free_fast_v2`
- `tiny_unified_cache` hit path helpers
- `tls_sll_pop_impl` / `tiny_tls_sll_drain`
Next measurement options
------------------------
- If perf cycles are still blocked:
- Use `perf stat -e cycles,instructions,branches,branch-misses -r 5 -- ...` to get aggregate IPC per workload.
- Add temporary userspace counters (Boxguarded) around C7 alloc/free hot sections to estimate perop cycles.
- Run perf with elevated permissions or lower `perf_event_paranoid` if available.
Aggregate perf stat snapshot (bench profile)
--------------------------------------------
- Env: `HAKMEM_PROFILE=bench HAKMEM_TINY_PROFILE=full HAKMEM_WARM_TLS_BIND_C7=2`
- Workloads (Release, 3× perf stat):
- 161024B: cycles≈109.8M, instructions≈233.2M → IPC≈2.12, branches≈49.4M, branch-miss≈2.89%
- 1291024B: cycles≈109.6M, instructions≈230.3M → IPC≈2.10, branches≈48.8M, branch-miss≈2.90%
- 161024B with `HAKMEM_TINY_C7_HOT=1` (UC hit-only + flat TLS→UC→cold):
- cycles≈111.8M, instructions≈242.1M → IPC≈2.16, branches≈52.0M, branch-miss≈2.75%
- RSS≈7.1MB; throughput ≈47.447.6M ops/s (hot=1) vs ≈47.2M (hot=0) on the same runset.
Action items flowing from this note
-----------------------------------
- Proceed with design notes for C7 alloc/free flattening and UC hit simplification based on code structure.
- Keep warmup zeroing out of the steadystate loop when profiling (consider `HAKMEM_BENCH_FAST_MODE` for future captures).
Conclusion (current state)
--------------------------
- `HAKMEM_TINY_C7_HOT` は実験用フラグとして残し、デフォルト OFF のまま運用する。ON にしても branch-miss はわずかに改善する程度で、ops/s は同等〜微減。
- ひとまず「安全+そこそこ速い」現行経路を基準とし、さらなるフラット化は別途必要性を見て検討する。***