Fix: Add alloc_gate_stats_box.o to BENCH_HAKMEM_OBJS_BASE; Document PERF-ULTRA-REBASE-4 findings
Phase PERF-ULTRA-REBASE-4 confirmed: - dispatcher (25.48%) and alloc gate (21.13%) already heavily optimized via snapshot - New bottleneck: C7 ULTRA refill path (tiny_c7_ultra_page_of at 1.78%) - Recommendation: Next optimize C7 ULTRA refill for +1-2% overall gain
This commit is contained in:
@ -863,3 +863,97 @@ C7 ULTRA alloc は tiny_c7_ultra.c 内最適化で self%/throughput ともほぼ
|
|||||||
|
|
||||||
**詳細**: `docs/analysis/ALLOC_GATE_ANALYSIS.md` 参照
|
**詳細**: `docs/analysis/ALLOC_GATE_ANALYSIS.md` 参照
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Phase PERF-ULTRA-REBASE-4: 再計測と確認 (2025-12-11)
|
||||||
|
|
||||||
|
**目的**: dispatcher と alloc gate が既に最適化されていることを確認した後、実際に新しい perf profile を取得
|
||||||
|
|
||||||
|
**計測条件**:
|
||||||
|
- ENV: 全て OFF(デフォルト、stats 無しで baseline)
|
||||||
|
- ワークロード: Mixed 16-1024B, 10M iter, ws=8192
|
||||||
|
- perf record: cycles:u, F 5000, dwarf call-graph
|
||||||
|
|
||||||
|
### ホットパス分析 (self%, 1K samples)
|
||||||
|
|
||||||
|
| 順位 | 関数/パス | self% | 変化 |
|
||||||
|
|------|----------|-------|------|
|
||||||
|
| **#1** | **free** | **25.48%** | −0.74% vs REBASE-3 |
|
||||||
|
| **#2** | **malloc** | **21.13%** | −0% (同等) |
|
||||||
|
| **#3** | **tiny_c7_ultra_alloc** | **7.66%** | ±0% (同等) |
|
||||||
|
| #4 | tiny_c7_ultra_free | 3.50% | −0.6% (最適化効果) |
|
||||||
|
| #5 | so_free | 2.47% | (新規visible) |
|
||||||
|
| #6 | so_alloc_fast | 2.39% | (新規visible) |
|
||||||
|
| **#7** | **tiny_c7_ultra_page_of** | **1.78%** | **NEW: refill path** |
|
||||||
|
| #8 | so_alloc | 1.21% | (新規visible) |
|
||||||
|
| #9 | classify_ptr | 1.15% | (新規visible) |
|
||||||
|
|
||||||
|
### 統計情報(Mixed 1M iter, ws=400)
|
||||||
|
|
||||||
|
**Alloc Gate Stats**:
|
||||||
|
```
|
||||||
|
total=542,019 calls
|
||||||
|
size2class=0 calls ✅ (完全削減)
|
||||||
|
route_calls=0 calls ✅ (完全削減)
|
||||||
|
env_checks=275,089 (構造的コスト)
|
||||||
|
class分布: C7=50.8%, C6=25.3%, C5=12.7%, C4=6.4%, C2-C3=4.8%
|
||||||
|
```
|
||||||
|
|
||||||
|
**Free Dispatcher Stats**:
|
||||||
|
```
|
||||||
|
total=8,081 calls
|
||||||
|
tiny=0, mid=8,081, large=0 (全て mid パス)
|
||||||
|
ultra=0 (ULTRA が fre dispatcher を bypass している)
|
||||||
|
tiny_legacy=7, pool=0, v6=0
|
||||||
|
route_calls=267,954 (大部分は alloc 側から呼ばれている)
|
||||||
|
env_checks=9 (初期化時のみ)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 分析
|
||||||
|
|
||||||
|
**確認事項**:
|
||||||
|
1. **Dispatcher (25.48%) は既に最適化済み**
|
||||||
|
- route_for_class は 9 回のみ(初期化時)
|
||||||
|
- 25% はファンクション呼び出しのコスト(architecture level)
|
||||||
|
|
||||||
|
2. **Alloc Gate (21.13%) は既に最適化済み**
|
||||||
|
- size_to_class = 0 calls (LUT)
|
||||||
|
- route_for_class = 0 calls (ULTRA enabled)
|
||||||
|
- env_checks = 275K はC7 ULTRA の enable check (unavoidable)
|
||||||
|
|
||||||
|
3. **新しいボトルネック**:
|
||||||
|
- C7 ULTRA refill (tiny_c7_ultra_page_of) が 1.78% で新規にvisible
|
||||||
|
- so_alloc/so_free が合計 ~5%
|
||||||
|
- classify_ptr が 1.15%
|
||||||
|
|
||||||
|
### スループット
|
||||||
|
|
||||||
|
- **Mixed 16-1024B**: 39.5M ops/s (iters=1M, ws=400)
|
||||||
|
- **比較**: REBASE-3 の 30.6M ops/s(iters=10M, ws=8192)とは別ワークロード
|
||||||
|
|
||||||
|
### 次フェーズ候補
|
||||||
|
|
||||||
|
**Option A: C7 ULTRA refill 最適化**
|
||||||
|
- tiny_c7_ultra_page_of が 1.78%
|
||||||
|
- Segment learning / page lookup の refill パスを最適化
|
||||||
|
- 期待: refill パス削減で全体 1-2%
|
||||||
|
|
||||||
|
**Option B: Architectural Level の最適化**
|
||||||
|
- free dispatcher (25%) + malloc dispatcher (21%) = 46%
|
||||||
|
- 現状は C API (malloc/free) の呼び出しコスト
|
||||||
|
- 例: ホットパス全体を inlined dispatcher で再設計
|
||||||
|
- リスク: 大規模な設計変更
|
||||||
|
|
||||||
|
**Option C: so_alloc/so_free 系 (~5%) の削減**
|
||||||
|
- v3 backend の最適化
|
||||||
|
- classify_ptr (1.15%) の削減
|
||||||
|
- 期待: 1-2M ops/s
|
||||||
|
|
||||||
|
**推奨**: Option A(C7 ULTRA refill)から着手。dispatcher/gate の 46% は architecture 的な必要コストで、難易度 vs 効果の観点から現状は受け入れるべき。
|
||||||
|
|
||||||
|
### 結論
|
||||||
|
|
||||||
|
- **dispatcher + gate**: 計 46% → 既に最適化済み(ENV/route snapshot 化完了)
|
||||||
|
- **C7 ULTRA 内部**: alloc 7.66% + free 3.50% + refill 1.78% = 12.94%
|
||||||
|
- **次のターゲット**: C7 ULTRA refill パス(1.78%)からの削減開始
|
||||||
|
|
||||||
|
|||||||
2
Makefile
2
Makefile
@ -250,7 +250,7 @@ endif
|
|||||||
# Benchmark targets
|
# Benchmark targets
|
||||||
BENCH_HAKMEM = bench_allocators_hakmem
|
BENCH_HAKMEM = bench_allocators_hakmem
|
||||||
BENCH_SYSTEM = bench_allocators_system
|
BENCH_SYSTEM = bench_allocators_system
|
||||||
BENCH_HAKMEM_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/wrapper_env_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o bench_allocators_hakmem.o
|
BENCH_HAKMEM_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/wrapper_env_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o bench_allocators_hakmem.o
|
||||||
BENCH_HAKMEM_OBJS = $(BENCH_HAKMEM_OBJS_BASE)
|
BENCH_HAKMEM_OBJS = $(BENCH_HAKMEM_OBJS_BASE)
|
||||||
ifeq ($(POOL_TLS_PHASE1),1)
|
ifeq ($(POOL_TLS_PHASE1),1)
|
||||||
BENCH_HAKMEM_OBJS += pool_tls.o pool_refill.o pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o
|
BENCH_HAKMEM_OBJS += pool_tls.o pool_refill.o pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o
|
||||||
|
|||||||
Reference in New Issue
Block a user