From 9fb2240319e4ceab3b8e98c48f1fc00538c56d94 Mon Sep 17 00:00:00 2001 From: "Moe Charm (CI)" Date: Thu, 11 Dec 2025 21:36:58 +0900 Subject: [PATCH] Fix: Add alloc_gate_stats_box.o to BENCH_HAKMEM_OBJS_BASE; Document PERF-ULTRA-REBASE-4 findings Phase PERF-ULTRA-REBASE-4 confirmed: - dispatcher (25.48%) and alloc gate (21.13%) already heavily optimized via snapshot - New bottleneck: C7 ULTRA refill path (tiny_c7_ultra_page_of at 1.78%) - Recommendation: Next optimize C7 ULTRA refill for +1-2% overall gain --- CURRENT_TASK.md | 94 +++++++++++++++++++++++++++++++++++++++++++++++++ Makefile | 2 +- 2 files changed, 95 insertions(+), 1 deletion(-) diff --git a/CURRENT_TASK.md b/CURRENT_TASK.md index a117d520..f68f2016 100644 --- a/CURRENT_TASK.md +++ b/CURRENT_TASK.md @@ -863,3 +863,97 @@ C7 ULTRA alloc は tiny_c7_ultra.c 内最適化で self%/throughput ともほぼ **詳細**: `docs/analysis/ALLOC_GATE_ANALYSIS.md` 参照 +--- + +## Phase PERF-ULTRA-REBASE-4: 再計測と確認 (2025-12-11) + +**目的**: dispatcher と alloc gate が既に最適化されていることを確認した後、実際に新しい perf profile を取得 + +**計測条件**: +- ENV: 全て OFF(デフォルト、stats 無しで baseline) +- ワークロード: Mixed 16-1024B, 10M iter, ws=8192 +- perf record: cycles:u, F 5000, dwarf call-graph + +### ホットパス分析 (self%, 1K samples) + +| 順位 | 関数/パス | self% | 変化 | +|------|----------|-------|------| +| **#1** | **free** | **25.48%** | −0.74% vs REBASE-3 | +| **#2** | **malloc** | **21.13%** | −0% (同等) | +| **#3** | **tiny_c7_ultra_alloc** | **7.66%** | ±0% (同等) | +| #4 | tiny_c7_ultra_free | 3.50% | −0.6% (最適化効果) | +| #5 | so_free | 2.47% | (新規visible) | +| #6 | so_alloc_fast | 2.39% | (新規visible) | +| **#7** | **tiny_c7_ultra_page_of** | **1.78%** | **NEW: refill path** | +| #8 | so_alloc | 1.21% | (新規visible) | +| #9 | classify_ptr | 1.15% | (新規visible) | + +### 統計情報(Mixed 1M iter, ws=400) + +**Alloc Gate Stats**: +``` +total=542,019 calls +size2class=0 calls ✅ (完全削減) +route_calls=0 calls ✅ (完全削減) +env_checks=275,089 (構造的コスト) +class分布: C7=50.8%, C6=25.3%, C5=12.7%, C4=6.4%, C2-C3=4.8% +``` + +**Free Dispatcher Stats**: +``` +total=8,081 calls +tiny=0, mid=8,081, large=0 (全て mid パス) +ultra=0 (ULTRA が fre dispatcher を bypass している) +tiny_legacy=7, pool=0, v6=0 +route_calls=267,954 (大部分は alloc 側から呼ばれている) +env_checks=9 (初期化時のみ) +``` + +### 分析 + +**確認事項**: +1. **Dispatcher (25.48%) は既に最適化済み** + - route_for_class は 9 回のみ(初期化時) + - 25% はファンクション呼び出しのコスト(architecture level) + +2. **Alloc Gate (21.13%) は既に最適化済み** + - size_to_class = 0 calls (LUT) + - route_for_class = 0 calls (ULTRA enabled) + - env_checks = 275K はC7 ULTRA の enable check (unavoidable) + +3. **新しいボトルネック**: + - C7 ULTRA refill (tiny_c7_ultra_page_of) が 1.78% で新規にvisible + - so_alloc/so_free が合計 ~5% + - classify_ptr が 1.15% + +### スループット + +- **Mixed 16-1024B**: 39.5M ops/s (iters=1M, ws=400) +- **比較**: REBASE-3 の 30.6M ops/s(iters=10M, ws=8192)とは別ワークロード + +### 次フェーズ候補 + +**Option A: C7 ULTRA refill 最適化** +- tiny_c7_ultra_page_of が 1.78% +- Segment learning / page lookup の refill パスを最適化 +- 期待: refill パス削減で全体 1-2% + +**Option B: Architectural Level の最適化** +- free dispatcher (25%) + malloc dispatcher (21%) = 46% +- 現状は C API (malloc/free) の呼び出しコスト +- 例: ホットパス全体を inlined dispatcher で再設計 +- リスク: 大規模な設計変更 + +**Option C: so_alloc/so_free 系 (~5%) の削減** +- v3 backend の最適化 +- classify_ptr (1.15%) の削減 +- 期待: 1-2M ops/s + +**推奨**: Option A(C7 ULTRA refill)から着手。dispatcher/gate の 46% は architecture 的な必要コストで、難易度 vs 効果の観点から現状は受け入れるべき。 + +### 結論 + +- **dispatcher + gate**: 計 46% → 既に最適化済み(ENV/route snapshot 化完了) +- **C7 ULTRA 内部**: alloc 7.66% + free 3.50% + refill 1.78% = 12.94% +- **次のターゲット**: C7 ULTRA refill パス(1.78%)からの削減開始 + diff --git a/Makefile b/Makefile index 052c635f..14e0aec6 100644 --- a/Makefile +++ b/Makefile @@ -250,7 +250,7 @@ endif # Benchmark targets BENCH_HAKMEM = bench_allocators_hakmem BENCH_SYSTEM = bench_allocators_system -BENCH_HAKMEM_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/wrapper_env_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o bench_allocators_hakmem.o +BENCH_HAKMEM_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/wrapper_env_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o bench_allocators_hakmem.o BENCH_HAKMEM_OBJS = $(BENCH_HAKMEM_OBJS_BASE) ifeq ($(POOL_TLS_PHASE1),1) BENCH_HAKMEM_OBJS += pool_tls.o pool_refill.o pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o