Files
hakmem/docs/analysis/CPU_HOTPATH_OVERVIEW.md

54 lines
3.2 KiB
Markdown
Raw Normal View History

CPU Hotpath Overview (bench profile)
====================================
Context
-------
- Build/profile: `HAKMEM_PROFILE=bench`, `HAKMEM_TINY_PROFILE=full`, `HAKMEM_WARM_TLS_BIND_C7=2`.
- Workloads sampled:
- 161024B (`./bench_random_mixed_hakmem 1000000 256 42`)
- 1291024B (`HAKMEM_BENCH_MIN_SIZE=129 HAKMEM_BENCH_MAX_SIZE=1024 ./bench_random_mixed_hakmem 1000000 256 42`)
- Target: identify userspace hot spots to guide C7 flattening work.
Sampling attempt (perf)
-----------------------
- `perf record -g -e cycles:u` and `perf record -g -e cpu-clock:u` both fall back to `page-faults` on this host (likely perf_event_paranoid). The captures show:
- ~97% of page-fault samples in `__memset_avx2_unaligned_erms` during warmup/zeroing.
- Callers were `tiny_tls_sll_drain.part.0.constprop.0` and `adaptive_sizing_init` (warmup path).
- No steadystate cycle profile was available without elevated perf permissions. `perf data` removed after inspection (`rm perf.data`) to keep tree clean.
What we can infer despite the limitation
----------------------------------------
- Warmup zeroing dominates pagefault samples; steadystate alloc/free is not represented.
- Hot candidates for the next pass (from previous code inspection and bench intuition):
- `tiny_alloc_fast` / `malloc_tiny_fast` (C7 fast path)
- `hak_tiny_free_fast_v2`
- `tiny_unified_cache` hit path helpers
- `tls_sll_pop_impl` / `tiny_tls_sll_drain`
Next measurement options
------------------------
- If perf cycles are still blocked:
- Use `perf stat -e cycles,instructions,branches,branch-misses -r 5 -- ...` to get aggregate IPC per workload.
- Add temporary userspace counters (Boxguarded) around C7 alloc/free hot sections to estimate perop cycles.
- Run perf with elevated permissions or lower `perf_event_paranoid` if available.
Aggregate perf stat snapshot (bench profile)
--------------------------------------------
- Env: `HAKMEM_PROFILE=bench HAKMEM_TINY_PROFILE=full HAKMEM_WARM_TLS_BIND_C7=2`
- Workloads (Release, 3× perf stat):
- 161024B: cycles≈109.8M, instructions≈233.2M → IPC≈2.12, branches≈49.4M, branch-miss≈2.89%
- 1291024B: cycles≈109.6M, instructions≈230.3M → IPC≈2.10, branches≈48.8M, branch-miss≈2.90%
- 161024B with `HAKMEM_TINY_C7_HOT=1` (UC hit-only + flat TLS→UC→cold):
- cycles≈111.8M, instructions≈242.1M → IPC≈2.16, branches≈52.0M, branch-miss≈2.75%
- RSS≈7.1MB; throughput ≈47.447.6M ops/s (hot=1) vs ≈47.2M (hot=0) on the same runset.
Action items flowing from this note
-----------------------------------
- Proceed with design notes for C7 alloc/free flattening and UC hit simplification based on code structure.
- Keep warmup zeroing out of the steadystate loop when profiling (consider `HAKMEM_BENCH_FAST_MODE` for future captures).
Conclusion (current state)
--------------------------
- `HAKMEM_TINY_C7_HOT` は実験用フラグとして残し、デフォルト OFF のまま運用する。ON にしても branch-miss はわずかに改善する程度で、ops/s は同等〜微減。
- ひとまず「安全+そこそこ速い」現行経路を基準とし、さらなるフラット化は別途必要性を見て検討する。***