Fix: Add alloc_gate_stats_box.o to BENCH_HAKMEM_OBJS_BASE; Document PERF-ULTRA-REBASE-4 findings

Phase PERF-ULTRA-REBASE-4 confirmed:
- dispatcher (25.48%) and alloc gate (21.13%) already heavily optimized via snapshot
- New bottleneck: C7 ULTRA refill path (tiny_c7_ultra_page_of at 1.78%)
- Recommendation: Next optimize C7 ULTRA refill for +1-2% overall gain
This commit is contained in:
Moe Charm (CI)
2025-12-11 21:36:58 +09:00
parent 0f15adae4e
commit 9fb2240319
2 changed files with 95 additions and 1 deletions

View File

@ -863,3 +863,97 @@ C7 ULTRA alloc は tiny_c7_ultra.c 内最適化で self%/throughput ともほぼ
**詳細**: `docs/analysis/ALLOC_GATE_ANALYSIS.md` 参照 **詳細**: `docs/analysis/ALLOC_GATE_ANALYSIS.md` 参照
---
## Phase PERF-ULTRA-REBASE-4: 再計測と確認 (2025-12-11)
**目的**: dispatcher と alloc gate が既に最適化されていることを確認した後、実際に新しい perf profile を取得
**計測条件**:
- ENV: 全て OFFデフォルト、stats 無しで baseline
- ワークロード: Mixed 16-1024B, 10M iter, ws=8192
- perf record: cycles:u, F 5000, dwarf call-graph
### ホットパス分析 (self%, 1K samples)
| 順位 | 関数/パス | self% | 変化 |
|------|----------|-------|------|
| **#1** | **free** | **25.48%** | 0.74% vs REBASE-3 |
| **#2** | **malloc** | **21.13%** | 0% (同等) |
| **#3** | **tiny_c7_ultra_alloc** | **7.66%** | ±0% (同等) |
| #4 | tiny_c7_ultra_free | 3.50% | 0.6% (最適化効果) |
| #5 | so_free | 2.47% | (新規visible) |
| #6 | so_alloc_fast | 2.39% | (新規visible) |
| **#7** | **tiny_c7_ultra_page_of** | **1.78%** | **NEW: refill path** |
| #8 | so_alloc | 1.21% | (新規visible) |
| #9 | classify_ptr | 1.15% | (新規visible) |
### 統計情報Mixed 1M iter, ws=400
**Alloc Gate Stats**:
```
total=542,019 calls
size2class=0 calls ✅ (完全削減)
route_calls=0 calls ✅ (完全削減)
env_checks=275,089 (構造的コスト)
class分布: C7=50.8%, C6=25.3%, C5=12.7%, C4=6.4%, C2-C3=4.8%
```
**Free Dispatcher Stats**:
```
total=8,081 calls
tiny=0, mid=8,081, large=0 (全て mid パス)
ultra=0 (ULTRA が fre dispatcher を bypass している)
tiny_legacy=7, pool=0, v6=0
route_calls=267,954 (大部分は alloc 側から呼ばれている)
env_checks=9 (初期化時のみ)
```
### 分析
**確認事項**:
1. **Dispatcher (25.48%) は既に最適化済み**
- route_for_class は 9 回のみ(初期化時)
- 25% はファンクション呼び出しのコストarchitecture level
2. **Alloc Gate (21.13%) は既に最適化済み**
- size_to_class = 0 calls (LUT)
- route_for_class = 0 calls (ULTRA enabled)
- env_checks = 275K はC7 ULTRA の enable check unavoidable
3. **新しいボトルネック**:
- C7 ULTRA refill (tiny_c7_ultra_page_of) が 1.78% で新規にvisible
- so_alloc/so_free が合計 ~5%
- classify_ptr が 1.15%
### スループット
- **Mixed 16-1024B**: 39.5M ops/s (iters=1M, ws=400)
- **比較**: REBASE-3 の 30.6M ops/siters=10M, ws=8192とは別ワークロード
### 次フェーズ候補
**Option A: C7 ULTRA refill 最適化**
- tiny_c7_ultra_page_of が 1.78%
- Segment learning / page lookup の refill パスを最適化
- 期待: refill パス削減で全体 1-2%
**Option B: Architectural Level の最適化**
- free dispatcher (25%) + malloc dispatcher (21%) = 46%
- 現状は C API (malloc/free) の呼び出しコスト
- 例: ホットパス全体を inlined dispatcher で再設計
- リスク: 大規模な設計変更
**Option C: so_alloc/so_free 系 (~5%) の削減**
- v3 backend の最適化
- classify_ptr (1.15%) の削減
- 期待: 1-2M ops/s
**推奨**: Option AC7 ULTRA refillから着手。dispatcher/gate の 46% は architecture 的な必要コストで、難易度 vs 効果の観点から現状は受け入れるべき。
### 結論
- **dispatcher + gate**: 計 46% → 既に最適化済みENV/route snapshot 化完了)
- **C7 ULTRA 内部**: alloc 7.66% + free 3.50% + refill 1.78% = 12.94%
- **次のターゲット**: C7 ULTRA refill パス1.78%)からの削減開始

View File

@ -250,7 +250,7 @@ endif
# Benchmark targets # Benchmark targets
BENCH_HAKMEM = bench_allocators_hakmem BENCH_HAKMEM = bench_allocators_hakmem
BENCH_SYSTEM = bench_allocators_system BENCH_SYSTEM = bench_allocators_system
BENCH_HAKMEM_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/wrapper_env_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o bench_allocators_hakmem.o BENCH_HAKMEM_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/wrapper_env_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o bench_allocators_hakmem.o
BENCH_HAKMEM_OBJS = $(BENCH_HAKMEM_OBJS_BASE) BENCH_HAKMEM_OBJS = $(BENCH_HAKMEM_OBJS_BASE)
ifeq ($(POOL_TLS_PHASE1),1) ifeq ($(POOL_TLS_PHASE1),1)
BENCH_HAKMEM_OBJS += pool_tls.o pool_refill.o pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o BENCH_HAKMEM_OBJS += pool_tls.o pool_refill.o pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o