3.2 KiB
3.2 KiB
CPU Hotpath Overview (bench profile)
Context
- Build/profile:
HAKMEM_PROFILE=bench,HAKMEM_TINY_PROFILE=full,HAKMEM_WARM_TLS_BIND_C7=2. - Workloads sampled:
- 16–1024B (
./bench_random_mixed_hakmem 1000000 256 42) - 129–1024B (
HAKMEM_BENCH_MIN_SIZE=129 HAKMEM_BENCH_MAX_SIZE=1024 ./bench_random_mixed_hakmem 1000000 256 42)
- 16–1024B (
- Target: identify user‑space hot spots to guide C7 flattening work.
Sampling attempt (perf)
perf record -g -e cycles:uandperf record -g -e cpu-clock:uboth fall back topage-faultson this host (likely perf_event_paranoid). The captures show:- ~97% of page-fault samples in
__memset_avx2_unaligned_ermsduring warmup/zeroing. - Callers were
tiny_tls_sll_drain.part.0.constprop.0andadaptive_sizing_init(warmup path).
- ~97% of page-fault samples in
- No steady‑state cycle profile was available without elevated perf permissions.
perf dataremoved after inspection (rm perf.data) to keep tree clean.
What we can infer despite the limitation
- Warmup zeroing dominates page‑fault samples; steady‑state alloc/free is not represented.
- Hot candidates for the next pass (from previous code inspection and bench intuition):
tiny_alloc_fast/malloc_tiny_fast(C7 fast path)hak_tiny_free_fast_v2tiny_unified_cachehit path helperstls_sll_pop_impl/tiny_tls_sll_drain
Next measurement options
- If perf cycles are still blocked:
- Use
perf stat -e cycles,instructions,branches,branch-misses -r 5 -- ...to get aggregate IPC per workload. - Add temporary user‑space counters (Box‑guarded) around C7 alloc/free hot sections to estimate per‑op cycles.
- Run perf with elevated permissions or lower
perf_event_paranoidif available.
- Use
Aggregate perf stat snapshot (bench profile)
- Env:
HAKMEM_PROFILE=bench HAKMEM_TINY_PROFILE=full HAKMEM_WARM_TLS_BIND_C7=2 - Workloads (Release, 3× perf stat):
- 16–1024B: cycles≈109.8M, instructions≈233.2M → IPC≈2.12, branches≈49.4M, branch-miss≈2.89%
- 129–1024B: cycles≈109.6M, instructions≈230.3M → IPC≈2.10, branches≈48.8M, branch-miss≈2.90%
- 16–1024B with
HAKMEM_TINY_C7_HOT=1(UC hit-only + flat TLS→UC→cold):- cycles≈111.8M, instructions≈242.1M → IPC≈2.16, branches≈52.0M, branch-miss≈2.75%
- RSS≈7.1MB; throughput ≈47.4–47.6M ops/s (hot=1) vs ≈47.2M (hot=0) on the same runset.
Action items flowing from this note
- Proceed with design notes for C7 alloc/free flattening and UC hit simplification based on code structure.
- Keep warmup zeroing out of the steady‑state loop when profiling (consider
HAKMEM_BENCH_FAST_MODEfor future captures).
Conclusion (current state)
HAKMEM_TINY_C7_HOTは実験用フラグとして残し、デフォルト OFF のまま運用する。ON にしても branch-miss はわずかに改善する程度で、ops/s は同等〜微減。- ひとまず「安全+そこそこ速い」現行経路を基準とし、さらなるフラット化は別途必要性を見て検討する。***