CPU Hotpath Overview (bench profile)

Context

Build/profile: HAKMEM_PROFILE=bench, HAKMEM_TINY_PROFILE=full, HAKMEM_WARM_TLS_BIND_C7=2.
Workloads sampled:
- 16–1024B (./bench_random_mixed_hakmem 1000000 256 42)
- 129–1024B (HAKMEM_BENCH_MIN_SIZE=129 HAKMEM_BENCH_MAX_SIZE=1024 ./bench_random_mixed_hakmem 1000000 256 42)
Target: identify user‑space hot spots to guide C7 flattening work.

perf record -g -e cycles:u and perf record -g -e cpu-clock:u both fall back to page-faults on this host (likely perf_event_paranoid). The captures show:
- ~97% of page-fault samples in __memset_avx2_unaligned_erms during warmup/zeroing.
- Callers were tiny_tls_sll_drain.part.0.constprop.0 and adaptive_sizing_init (warmup path).
No steady‑state cycle profile was available without elevated perf permissions. perf data removed after inspection (rm perf.data) to keep tree clean.

Warmup zeroing dominates page‑fault samples; steady‑state alloc/free is not represented.
Hot candidates for the next pass (from previous code inspection and bench intuition):
- tiny_alloc_fast / malloc_tiny_fast (C7 fast path)
- hak_tiny_free_fast_v2
- tiny_unified_cache hit path helpers
- tls_sll_pop_impl / tiny_tls_sll_drain

If perf cycles are still blocked:
- Use perf stat -e cycles,instructions,branches,branch-misses -r 5 -- ... to get aggregate IPC per workload.
- Add temporary user‑space counters (Box‑guarded) around C7 alloc/free hot sections to estimate per‑op cycles.
- Run perf with elevated permissions or lower perf_event_paranoid if available.

Proceed with design notes for C7 alloc/free flattening and UC hit simplification based on code structure.
Keep warmup zeroing out of the steady‑state loop when profiling (consider HAKMEM_BENCH_FAST_MODE for future captures).

HAKMEM_TINY_C7_HOT は実験用フラグとして残し、デフォルト OFF のまま運用する。ON にしても branch-miss はわずかに改善する程度で、ops/s は同等〜微減。
ひとまず「安全＋そこそこ速い」現行経路を基準とし、さらなるフラット化は別途必要性を見て検討する。***