Files
hakmem/docs/analysis/PF_STATUS_V4_202502.md
Moe Charm (CI) 2a13478dc7 Optimize C6 heavy and C7 ultra performance analysis with refined design refinements
- Update environment profile presets and visibility analysis
- Enhance small object and tiny segment v4 box implementations
- Refine C7 ultra and C6 heavy allocation strategies
- Add comprehensive performance metrics and design documentation

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-10 22:57:26 +09:00

96 lines
4.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# PF/OS ベースライン
# BASELINE-LOCK (Mixed 161024B v3 vs v4, Release)
- コマンド共通 (ws=400, iters=1M):
```
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE
HAKMEM_BENCH_MIN_SIZE=16
HAKMEM_BENCH_MAX_SIZE=1024
```
- v3 本命構成C7-only v3, v4/segment すべて OFF, fast classify v3 ON:
- `HAKMEM_SMALL_HEAP_V3_ENABLED=1 HAKMEM_SMALL_HEAP_V3_CLASSES=0x80 HAKMEM_SMALL_HEAP_V4_ENABLED=0 HAKMEM_SMALL_HEAP_V4_CLASSES=0 HAKMEM_TINY_PTR_FAST_CLASSIFY_V4_ENABLED=0 HAKMEM_SMALL_SEGMENT_V4_ENABLED=0`
- Throughput: **33.733.9M ops/s**2 run, segv/assert なし)
- v4 強制C7+C6 v4 + fast classify v4, v3 OFF, segment OFF:
- `HAKMEM_SMALL_HEAP_V3_ENABLED=0 HAKMEM_SMALL_HEAP_V3_CLASSES=0 HAKMEM_SMALL_HEAP_V4_ENABLED=1 HAKMEM_SMALL_HEAP_V4_CLASSES=0xC0 HAKMEM_TINY_PTR_FAST_CLASSIFY_V4_ENABLED=1`
- Throughput: **32.032.5M ops/s**
- C7-only v4C6 v1, v3 OFF, fast classify v4 ON:
- `HAKMEM_SMALL_HEAP_V4_CLASSES=0x80 HAKMEM_SMALL_HEAP_V3_ENABLED=0`
- Throughput: **≈33.0M ops/s**
- 判断: 現行 Mixed の本命は v3 構成上記。v4 系は研究箱として opt-in 扱いを維持。
# PF/OS ベースライン (PF2, small-object v4 状態)
- コマンド (Release, v4: C7+C6 を v4 に強制、v3 OFF):
```
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
HAKMEM_BENCH_MIN_SIZE=16 \
HAKMEM_BENCH_MAX_SIZE=1024 \
HAKMEM_SS_OS_STATS=1 \
HAKMEM_SMALL_HEAP_V4_ENABLED=1 \
HAKMEM_SMALL_HEAP_V4_CLASSES=0xC0 \
HAKMEM_SMALL_HEAP_V3_ENABLED=0 \
perf stat -e cycles,instructions,task-clock,page-faults \
./bench_random_mixed_hakmem 1000000 400 1
```
- 結果 (環境: リリースビルド, ws=400, iters=1M):
- Throughput: **31,779,973 ops/s** (time=0.031s)
- perf stat: cycles=205,322,023 / instructions=385,092,104 / task-clock=51.40ms / page-faults=6,702
- `[SS_OS_STATS]` : alloc=2 free=4 madvise=2 madvise_enomem=0 madvise_disabled=0 mmap_total=2
- 所感:
- v4 (C7+C6) 強制時の pf/OS 基準値。v3 基準 (~40M) より遅めだが、pf 数値と OS stats を PF2 の起点として固定。
- 今後 SmallSegmentBox_v4 を繋ぐ A/B では、page-faults/SS_OS_STATS をこの値からどこまで下げられるかを指標にする。
## PF3: smallsegment_v4 ゲート A/BC7+C6 v4 強制)
- コマンド (Release, v4: C7+C6, v3 OFF):
```
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
HAKMEM_BENCH_MIN_SIZE=16 \
HAKMEM_BENCH_MAX_SIZE=1024 \
HAKMEM_SMALL_HEAP_V4_ENABLED=1 \
HAKMEM_SMALL_HEAP_V4_CLASSES=0xC0 \
HAKMEM_SMALL_HEAP_V3_ENABLED=0 \
HAKMEM_SMALL_HEAP_V3_CLASSES=0 \
perf stat -e cycles,instructions,task-clock,page-faults \
HAKMEM_SMALL_SEGMENT_V4_ENABLED=0 ./bench_random_mixed_hakmem 1000000 400 1
perf stat -e cycles,instructions,task-clock,page-faults \
HAKMEM_SMALL_SEGMENT_V4_ENABLED=1 ./bench_random_mixed_hakmem 1000000 400 1
```
- 結果 (ws=400, iters=1M):
- OFF: Throughput **28,890,266 ops/s**, page-faults=6,744, task-clock=54.84ms
- ON : Throughput **28,849,781 ops/s**, page-faults=6,746, task-clock=61.49ms
- 所感:
- smallsegment_v4 ゲートを通しても pf/ops はほぼ変化なし(現状は Tiny v1 lease 経由の薄い実装)。
- 「Segment 経由の入り口」はできたので、PF4 以降で専用 mmap/segment 分割を実装して再 A/B する。
## DEBUG perf (cycles:u, -O0/-g, v4=C7+C6)
- ビルド:
```
make clean
CFLAGS='-O0 -g' USE_LTO=0 OPT_LEVEL=0 NATIVE=0 make bench_random_mixed_hakmem -j4
```
- コマンド:
```
HAKMEM_PROFILE=DEBUG_TINY_FRONT_PERF \
HAKMEM_BENCH_MIN_SIZE=16 \
HAKMEM_BENCH_MAX_SIZE=1024 \
HAKMEM_SMALL_HEAP_V4_ENABLED=1 \
HAKMEM_SMALL_HEAP_V4_CLASSES=0xC0 \
HAKMEM_SMALL_HEAP_V3_ENABLED=0 \
perf record -F 5000 --call-graph dwarf -e cycles:u \
-o perf.data.pf_v4 ./bench_random_mixed_hakmem 1000000 400 1
```
- Throughput: **15,173,790 ops/s** (DEBUG, ws=400, iters=1M, v4=C7+C6)
- self% 上位 (perf report --stdio):
- free 14.37%small_heap_free_fast_v4 内 3.39%
- tiny_alloc_gate_fast 13.33%
- main 12.93%
- malloc 7.09%
- ss_map_lookup 4.97% / hak_super_registry_init + memset 合算 ~4.5%
- small_heap_alloc_fast_v4 2.23%
- hak_tiny_size_to_class 2.21% / tiny_route_get 2.34% / front_gate_unified_enabled 2.36% / tiny_route_is_heap_kind 2.09%
- xorshift32 2.08%
- メモ:
- v4 強制下でも gate/classify/ss_map_lookup が依然目立つ。Segment/OS 側が整えば pf と合わせて自明に下がるかを PF3 で確認。