diff --git a/CURRENT_TASK.md b/CURRENT_TASK.md index dbaac699..fa9f10ac 100644 --- a/CURRENT_TASK.md +++ b/CURRENT_TASK.md @@ -342,6 +342,144 @@ Phase 6-10 で達成した累積改善: - Pointer-chase 削減も cache 形状変更も、現状の TLS array cache に対して有意な改善を生まない - 次の mimalloc gap(約 2.4x)を埋めるには、別次元のアプローチが必要 +--- + +### Phase 16 v1: Front FastLane Alloc LEGACY Direct — ⚠️ NEUTRAL (+0.62%) — research box 維持(default OFF) + +**Date**: 2025-12-15 +**Verdict**: **NEUTRAL (+0.62% Mixed, +0.06% C6-heavy)** — research box 維持(default OFF) + +**Motivation**: +- Phase 14-15 は freeze(cache-shape/pointer-chase の ROI が薄い) +- free 側は "monolithic early-exit + dedup" が勝ち筋(Phase 9/10/6-2) +- alloc 側も同じ勝ち筋で、LEGACY ルート時の route/policy 固定費を FastLane 入口で削る + +**Results**: +| Workload | ENV=0 (Baseline) | ENV=1 (Direct) | Delta | +|---------|----------|----------|-------| +| Mixed (16–1024B) | 47,510,791 | 47,803,890 | **+0.62%** | +| C6-heavy (257–768B) | 21,134,240 | 21,147,197 | **+0.06%** | + +**Critical Issue & Fix**: +- **Segfault discovered**: Initial implementation crashed for C4-C7 during `unified_cache_refill()` → `tiny_next_read()` +- **Root cause**: Refill logic incompatibility for classes C4-C7 +- **Safety fix**: Limited optimization to C0-C3 only (matching existing dualhot pattern) +- Code constraint: `if (... && (unsigned)class_idx <= 3u)` added to line 96 of `front_fastlane_box.h` + +**Conclusion**: +- Optimization overlaps with existing dualhot (Phase ALLOC-TINY-FAST-DUALHOT-2) for C0-C3 +- Limited scope (C0-C3 only) reduces potential benefit +- Route/policy overhead already minimized by Phase 6 FastLane collapse +- Pattern continues from Phase 14-15: dispatch-layer optimizations showing NEUTRAL results + +**Root causes of limited benefit**: +1. Safety constraint: C4-C7 excluded due to refill bug +2. Overlap with dualhot: C0-C3 already have direct path when dualhot enabled +3. Route overhead not dominant: Phase 6 already collapsed major dispatch costs + +**Recommendations**: +- **Freeze as research box** (default OFF, no preset promotion) +- **Investigate C4-C7 refill issue** before expanding scope +- **Shift optimization focus** away from dispatch layers (Phase 14/15/16 all NEUTRAL) + +**Refs**: +- A/B results: `docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_AB_TEST_RESULTS.md` +- Design: `docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_DESIGN.md` +- Instructions: `docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_NEXT_INSTRUCTIONS.md` +- ENV: `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1` (default: 0, opt-in) + +--- + +### Phase 14-16 Summary: Post-FastLane Research Phases ⚠️ + +**Conclusion**: Phase 14-16 全て NEUTRAL(研究箱として凍結) + +| Phase | Approach | Mixed Delta | Verdict | +|-------|----------|-------------|---------| +| 14 v1 | tcache (free-side only) | +0.20% | NEUTRAL | +| 14 v2 | tcache (alloc+free) | +0.08% | NEUTRAL | +| 15 v1 | FIFO→LIFO (array cache) | -0.70% | NEUTRAL | +| 16 v1 | Alloc LEGACY direct | **+0.62%** | **NEUTRAL** | + +**教訓**: +- Pointer-chase 削減、cache 形状変更、dispatch early-exit いずれも有意な改善なし +- Phase 6 FastLane collapse (入口固定費削減) 以降、dispatch/routing レイヤの最適化は ROI が薄い +- 次の mimalloc gap(約 2.4x)を埋めるには、cache miss cost / memory layout / backend allocation 等の別次元が必要 + +--- + +### Phase 17: FORCE_LIBC Gap Validation(same-binary A/B)✅ COMPLETE (2025-12-15) + +**目的**: 「system malloc が速い」観測の SSOT 化。**同一バイナリ**で `hakmem` vs `libc` を A/B し、gap の本体(allocator差 / layout差)を切り分ける。 + +**結果**: **Case B 確定** — Allocator差 negligible (+0.39%), Layout penalty dominant (+73.57%) + +**Gap Breakdown** (Mixed, 20M iters, ws=400): +- hakmem (FORCE_LIBC=0): 48.12M ops/s (mean), 48.12M ops/s (median) +- libc same-binary (FORCE_LIBC=1): 48.31M ops/s (mean), 48.31M ops/s (median) +- **Allocator差**: **+0.39%** (libc slightly faster, within noise) +- system binary (21K): 83.85M ops/s (mean), 83.75M ops/s (median) +- **Layout penalty**: **+73.57%** (small binary vs large binary 653K) +- **Total gap**: **+74.26%** (hakmem → system binary) + +**Perf Stat Analysis** (200M iters, 1-run): +- I-cache misses: 153K (hakmem) → 68K (system) = **-55%** (smoking gun) +- Cycles: 17.9B → 10.2B = -43% +- Instructions: 41.3B → 21.5B = -48% + +**Root Cause**: Binary size (653K vs 21K, 30x difference) causes I-cache thrashing. Code bloat >> algorithmic efficiency. + +**教訓**: +- Phase 12 の「system malloc 1.6x faster」観測は正しかったが、原因は allocator アルゴリズムではなく **binary layout** +- Same-binary A/B が必須(別バイナリ比較は layout confound で誤判定) +- I-cache efficiency が allocator-heavy workload の first-order factor + +**Next Direction** (Case B 推奨): +- **Phase 18: Hot Text Isolation / Layout Control** + - Priority 1: Cold code isolation (`__attribute__((cold,noinline))` + separate TU) + - Priority 2: Link-order optimization (hot functions contiguous placement) + - Priority 3: PGO (optional, profile-guided layout) + - Target: +10% throughput via I-cache optimization (48.1M → 52.9M ops/s) + - Success metric: I-cache misses -30% (153K → 107K) + +**Files**: +- Results: `docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md` +- Instructions: `docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_NEXT_INSTRUCTIONS.md` + +--- + +### Phase 18: Hot Text Isolation / Layout Control — NEXT + +**目的**: Binary layout 最適化で I-cache 効率を改善し、system binary との gap を削減。 + +**戦略**: +1. **Cold Code Isolation** (優先度 1) + - Stats 収集、debug logging、error handlers を別 TU へ移動 + - `__attribute__((cold, noinline))` で明示的に cold マーク + - 予想効果: I-cache misses -20% + +2. **Link-Order Optimization** (優先度 2) + - Hot functions を連続配置(linker script or link order control) + - `-ffunction-sections` + custom linker script + - 予想効果: I-cache misses -10% + +3. **Profile-Guided Optimization** (優先度 3, optional) + - `-fprofile-generate` + `-fprofile-use` で実測ベース配置 + - 予想効果: I-cache misses -10-20% + +**Build Gate**: `HOT_TEXT_ISOLATION=0/1`(layout A/B 用) + +**Target**: +- v1(TU split / attrs / optional gc-sections): **+2% で GO**(NEUTRAL が起きやすい想定) +- v2(BENCH_MINIMAL compile-out): **+10–20%** を狙う(instruction footprint を直接削る) + +**設計**: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md` +**指示書**: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md` + +実装ゲート(戻せる): +- Makefile knob: `HOT_TEXT_ISOLATION=0/1` +- Compile-time: `-DHAKMEM_HOT_TEXT_ISOLATION=0/1` + ## 更新メモ(2025-12-14 Phase 5 E5-3 Analysis - Strategic Pivot) ### Phase 5 E5-3: Candidate Analysis & Strategic Recommendations ⚠️ DEFER (2025-12-14) diff --git a/Makefile b/Makefile index c19d7548..fc68d352 100644 --- a/Makefile +++ b/Makefile @@ -218,12 +218,12 @@ LDFLAGS += $(EXTRA_LDFLAGS) # Targets TARGET = test_hakmem -OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o +OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/box/front_fastlane_alloc_legacy_direct_env_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o OBJS = $(OBJS_BASE) # Shared library SHARED_LIB = libhakmem.so -SHARED_OBJS = hakmem_shared.o hakmem_config_shared.o hakmem_tiny_config_shared.o hakmem_ucb1_shared.o hakmem_bigcache_shared.o hakmem_pool_shared.o hakmem_l25_pool_shared.o hakmem_site_rules_shared.o hakmem_tiny_shared.o core/box/ss_allocation_box_shared.o superslab_stats_shared.o superslab_cache_shared.o superslab_ace_shared.o superslab_slab_shared.o superslab_backend_shared.o core/superslab_head_stub_shared.o hakmem_smallmid_shared.o core/box/superslab_expansion_box_shared.o core/box/integrity_box_shared.o core/box/mailbox_box_shared.o core/box/front_gate_box_shared.o core/box/front_gate_classifier_shared.o core/box/free_publish_box_shared.o core/box/capacity_box_shared.o core/box/carve_push_box_shared.o core/box/prewarm_box_shared.o core/box/ss_hot_prewarm_box_shared.o core/box/front_metrics_box_shared.o core/box/bench_fast_box_shared.o core/box/ss_addr_map_box_shared.o core/box/ss_pt_impl_shared.o core/box/slab_recycling_box_shared.o core/box/pagefault_telemetry_box_shared.o core/box/tiny_sizeclass_hist_box_shared.o core/box/tiny_env_box_shared.o core/box/tiny_route_box_shared.o core/box/free_front_v3_env_box_shared.o core/box/free_path_stats_box_shared.o core/box/free_dispatch_stats_box_shared.o core/box/alloc_gate_stats_box_shared.o core/box/tiny_page_box_shared.o core/box/tiny_class_policy_box_shared.o core/box/tiny_class_stats_box_shared.o core/box/tiny_policy_learner_box_shared.o core/box/ss_budget_box_shared.o core/box/tiny_mem_stats_box_shared.o core/box/wrapper_env_box_shared.o core/box/free_wrapper_env_snapshot_box_shared.o core/box/malloc_wrapper_env_snapshot_box_shared.o core/box/madvise_guard_box_shared.o core/box/libm_reloc_guard_box_shared.o core/box/hakmem_env_snapshot_box_shared.o core/box/tiny_c7_preserve_header_env_box_shared.o core/box/tiny_tcache_env_box_shared.o core/box/tiny_unified_lifo_env_box_shared.o core/page_arena_shared.o core/front/tiny_unified_cache_shared.o core/tiny_alloc_fast_push_shared.o core/tiny_c7_ultra_segment_shared.o core/tiny_c7_ultra_shared.o core/link_stubs_shared.o core/tiny_failfast_shared.o tiny_sticky_shared.o tiny_remote_shared.o tiny_publish_shared.o tiny_debug_ring_shared.o hakmem_tiny_magazine_shared.o hakmem_tiny_stats_shared.o hakmem_tiny_sfc_shared.o hakmem_tiny_query_shared.o hakmem_tiny_rss_shared.o hakmem_tiny_registry_shared.o hakmem_tiny_remote_target_shared.o hakmem_tiny_bg_spill_shared.o tiny_adaptive_sizing_shared.o hakmem_super_registry_shared.o hakmem_shared_pool_shared.o hakmem_shared_pool_acquire_shared.o hakmem_shared_pool_release_shared.o hakmem_elo_shared.o hakmem_batch_shared.o hakmem_p2_shared.o hakmem_sizeclass_dist_shared.o hakmem_evo_shared.o hakmem_debug_shared.o hakmem_sys_shared.o hakmem_whale_shared.o hakmem_policy_shared.o hakmem_ace_shared.o hakmem_ace_stats_shared.o hakmem_ace_controller_shared.o hakmem_ace_metrics_shared.o hakmem_ace_ucb1_shared.o hakmem_prof_shared.o hakmem_learner_shared.o hakmem_size_hist_shared.o hakmem_learn_log_shared.o hakmem_syscall_shared.o tiny_fastcache_shared.o core/box/super_reg_box_shared.o core/box/shared_pool_box_shared.o core/box/remote_side_box_shared.o core/tiny_destructors_shared.o +SHARED_OBJS = hakmem_shared.o hakmem_config_shared.o hakmem_tiny_config_shared.o hakmem_ucb1_shared.o hakmem_bigcache_shared.o hakmem_pool_shared.o hakmem_l25_pool_shared.o hakmem_site_rules_shared.o hakmem_tiny_shared.o core/box/ss_allocation_box_shared.o superslab_stats_shared.o superslab_cache_shared.o superslab_ace_shared.o superslab_slab_shared.o superslab_backend_shared.o core/superslab_head_stub_shared.o hakmem_smallmid_shared.o core/box/superslab_expansion_box_shared.o core/box/integrity_box_shared.o core/box/mailbox_box_shared.o core/box/front_gate_box_shared.o core/box/front_gate_classifier_shared.o core/box/free_publish_box_shared.o core/box/capacity_box_shared.o core/box/carve_push_box_shared.o core/box/prewarm_box_shared.o core/box/ss_hot_prewarm_box_shared.o core/box/front_metrics_box_shared.o core/box/bench_fast_box_shared.o core/box/ss_addr_map_box_shared.o core/box/ss_pt_impl_shared.o core/box/slab_recycling_box_shared.o core/box/pagefault_telemetry_box_shared.o core/box/tiny_sizeclass_hist_box_shared.o core/box/tiny_env_box_shared.o core/box/tiny_route_box_shared.o core/box/free_front_v3_env_box_shared.o core/box/free_path_stats_box_shared.o core/box/free_dispatch_stats_box_shared.o core/box/alloc_gate_stats_box_shared.o core/box/tiny_page_box_shared.o core/box/tiny_class_policy_box_shared.o core/box/tiny_class_stats_box_shared.o core/box/tiny_policy_learner_box_shared.o core/box/ss_budget_box_shared.o core/box/tiny_mem_stats_box_shared.o core/box/wrapper_env_box_shared.o core/box/free_wrapper_env_snapshot_box_shared.o core/box/malloc_wrapper_env_snapshot_box_shared.o core/box/madvise_guard_box_shared.o core/box/libm_reloc_guard_box_shared.o core/box/hakmem_env_snapshot_box_shared.o core/box/tiny_c7_preserve_header_env_box_shared.o core/box/tiny_tcache_env_box_shared.o core/box/tiny_unified_lifo_env_box_shared.o core/box/front_fastlane_alloc_legacy_direct_env_box_shared.o core/page_arena_shared.o core/front/tiny_unified_cache_shared.o core/tiny_alloc_fast_push_shared.o core/tiny_c7_ultra_segment_shared.o core/tiny_c7_ultra_shared.o core/link_stubs_shared.o core/tiny_failfast_shared.o tiny_sticky_shared.o tiny_remote_shared.o tiny_publish_shared.o tiny_debug_ring_shared.o hakmem_tiny_magazine_shared.o hakmem_tiny_stats_shared.o hakmem_tiny_sfc_shared.o hakmem_tiny_query_shared.o hakmem_tiny_rss_shared.o hakmem_tiny_registry_shared.o hakmem_tiny_remote_target_shared.o hakmem_tiny_bg_spill_shared.o tiny_adaptive_sizing_shared.o hakmem_super_registry_shared.o hakmem_shared_pool_shared.o hakmem_shared_pool_acquire_shared.o hakmem_shared_pool_release_shared.o hakmem_elo_shared.o hakmem_batch_shared.o hakmem_p2_shared.o hakmem_sizeclass_dist_shared.o hakmem_evo_shared.o hakmem_debug_shared.o hakmem_sys_shared.o hakmem_whale_shared.o hakmem_policy_shared.o hakmem_ace_shared.o hakmem_ace_stats_shared.o hakmem_ace_controller_shared.o hakmem_ace_metrics_shared.o hakmem_ace_ucb1_shared.o hakmem_prof_shared.o hakmem_learner_shared.o hakmem_size_hist_shared.o hakmem_learn_log_shared.o hakmem_syscall_shared.o tiny_fastcache_shared.o core/box/super_reg_box_shared.o core/box/shared_pool_box_shared.o core/box/remote_side_box_shared.o core/tiny_destructors_shared.o # Pool TLS Phase 1 (enable with POOL_TLS_PHASE1=1) ifeq ($(POOL_TLS_PHASE1),1) @@ -427,7 +427,7 @@ test-box-refactor: box-refactor ./larson_hakmem 10 8 128 1024 1 12345 4 # Phase 4: Tiny Pool benchmarks (properly linked with hakmem) -TINY_BENCH_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/free_publish_box.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/tiny_free_route_cache_env_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/page_arena.o core/front/tiny_unified_cache.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o +TINY_BENCH_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/free_publish_box.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/tiny_free_route_cache_env_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/box/front_fastlane_alloc_legacy_direct_env_box.o core/page_arena.o core/front/tiny_unified_cache.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o TINY_BENCH_OBJS = $(TINY_BENCH_OBJS_BASE) ifeq ($(POOL_TLS_PHASE1),1) TINY_BENCH_OBJS += pool_tls.o pool_refill.o core/pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o diff --git a/core/bench_profile.h b/core/bench_profile.h index 8c22c960..4fc28986 100644 --- a/core/bench_profile.h +++ b/core/bench_profile.h @@ -13,6 +13,7 @@ #include "box/tiny_c7_preserve_header_env_box.h" // tiny_c7_preserve_header_env_refresh_from_env (Phase 13 v1) #include "box/tiny_tcache_env_box.h" // tiny_tcache_env_refresh_from_env (Phase 14 v1) #include "box/tiny_unified_lifo_env_box.h" // tiny_unified_lifo_env_refresh_from_env (Phase 15 v1) +#include "box/front_fastlane_alloc_legacy_direct_env_box.h" // front_fastlane_alloc_legacy_direct_env_refresh_from_env (Phase 16 v1) #endif // env が未設定のときだけ既定値を入れる @@ -193,5 +194,7 @@ static inline void bench_apply_profile(void) { tiny_tcache_env_refresh_from_env(); // Phase 15 v1: Sync LIFO ENV cache after bench_profile putenv defaults. tiny_unified_lifo_env_refresh_from_env(); + // Phase 16 v1: Sync LEGACY direct ENV cache after bench_profile putenv defaults. + front_fastlane_alloc_legacy_direct_env_refresh_from_env(); #endif } diff --git a/core/box/front_fastlane_alloc_legacy_direct_env_box.c b/core/box/front_fastlane_alloc_legacy_direct_env_box.c new file mode 100644 index 00000000..1273744b --- /dev/null +++ b/core/box/front_fastlane_alloc_legacy_direct_env_box.c @@ -0,0 +1,63 @@ +// ============================================================================ +// Phase 16 v1: Front FastLane Alloc LEGACY Direct ENV Box (L0) - Implementation +// ============================================================================ + +#include "front_fastlane_alloc_legacy_direct_env_box.h" +#include +#include +#include +#include + +// ============================================================================ +// Global State +// ============================================================================ + +_Atomic int g_front_fastlane_alloc_legacy_direct_enabled = -1; + +// ============================================================================ +// Init (Cold Path) +// ============================================================================ + +int front_fastlane_alloc_legacy_direct_env_init(void) { + const char* env = getenv("HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT"); + int enabled = 0; // default: OFF (opt-in) + + if (env && (env[0] == '1' || strcmp(env, "true") == 0 || strcmp(env, "TRUE") == 0)) { + enabled = 1; + } + + // Cache result + atomic_store_explicit(&g_front_fastlane_alloc_legacy_direct_enabled, enabled, memory_order_relaxed); + + // Log once (stderr for immediate visibility) + if (enabled) { + const char msg[] = "[FRONT_FASTLANE_ALLOC_LEGACY_DIRECT] enabled\n"; + ssize_t w = write(2, msg, sizeof(msg) - 1); + (void)w; + } + + return enabled; +} + +// ============================================================================ +// Hot Path (LTO Fallback) +// ============================================================================ + +// LTO fallback: Non-inline version for cases where LTO can't inline +int front_fastlane_alloc_legacy_direct_enabled(void) { + int val = atomic_load_explicit(&g_front_fastlane_alloc_legacy_direct_enabled, memory_order_relaxed); + if (__builtin_expect(val == -1, 0)) { + val = front_fastlane_alloc_legacy_direct_env_init(); + } + return val; +} + +// ============================================================================ +// Refresh (Cold Path, called from bench_profile) +// ============================================================================ + +void front_fastlane_alloc_legacy_direct_env_refresh_from_env(void) { + // Reset to uninitialized state (-1) + // Next call to front_fastlane_alloc_legacy_direct_enabled() will re-read ENV + atomic_store_explicit(&g_front_fastlane_alloc_legacy_direct_enabled, -1, memory_order_relaxed); +} diff --git a/core/box/front_fastlane_alloc_legacy_direct_env_box.h b/core/box/front_fastlane_alloc_legacy_direct_env_box.h new file mode 100644 index 00000000..84a8b6e7 --- /dev/null +++ b/core/box/front_fastlane_alloc_legacy_direct_env_box.h @@ -0,0 +1,63 @@ +// ============================================================================ +// Phase 16 v1: Front FastLane Alloc LEGACY Direct ENV Box (L0) +// ============================================================================ +// +// Purpose: ENV gate for FastLane alloc LEGACY direct path +// +// Design: docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_DESIGN.md +// Instructions: docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_NEXT_INSTRUCTIONS.md +// +// Strategy: +// - alloc 側の route/policy 固定費を削減 +// - FastLane 入口で LEGACY を直行(hot → cold → fallback) +// - free 側(Phase 9/10)の勝ち筋を alloc にも適用 +// +// ENV: +// HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1 (default: 0, opt-in) +// +// API: +// front_fastlane_alloc_legacy_direct_enabled() -> int +// front_fastlane_alloc_legacy_direct_env_refresh_from_env() +// +// Box Theory: +// - L0: This file (ENV gate, reversible) +// - L1: front_fastlane_box.h (LEGACY direct early-exit) +// - L2: malloc_tiny_fast_for_class (existing fallback) +// +// Safety: +// - ENV-gated (default OFF, opt-in) +// - Reversible (ENV toggle) +// - Fail-Fast (direct条件を満たさない場合は既存経路) +// +// ============================================================================ + +#ifndef FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_ENV_BOX_H +#define FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_ENV_BOX_H + +#include + +// ============================================================================ +// Global State (L0) +// ============================================================================ + +// Cached state: -1 (uninitialized), 0 (disabled), 1 (enabled) +extern _Atomic int g_front_fastlane_alloc_legacy_direct_enabled; + +// ============================================================================ +// Hot API (L0) +// ============================================================================ + +// Check if FastLane alloc LEGACY direct is enabled +// Returns: 1 if enabled, 0 if disabled +// Note: Implementation in .c file (non-inline for LTO compatibility) +extern int front_fastlane_alloc_legacy_direct_enabled(void); + +// ============================================================================ +// Cold API (L2) +// ============================================================================ + +// Refresh ENV cache (called from bench_profile after putenv) +// Pattern: Same as Phase 8/13/14/15 +extern void front_fastlane_alloc_legacy_direct_env_refresh_from_env(void); + +#endif // FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_ENV_BOX_H diff --git a/core/box/front_fastlane_box.h b/core/box/front_fastlane_box.h index 36bc96b0..46223e65 100644 --- a/core/box/front_fastlane_box.h +++ b/core/box/front_fastlane_box.h @@ -42,6 +42,11 @@ #include "front_fastlane_stats_box.h" #include "../hakmem_tiny.h" // hak_tiny_size_to_class, tiny_get_max_size #include "../front/malloc_tiny_fast.h" // malloc_tiny_fast_for_class +#include "front_fastlane_alloc_legacy_direct_env_box.h" // Phase 16 v1: LEGACY direct +#include "tiny_static_route_box.h" // tiny_static_route_ready_fast, tiny_static_route_get_kind_fast +#include "tiny_front_hot_box.h" // tiny_hot_alloc_fast +#include "tiny_front_cold_box.h" // tiny_cold_refill_and_alloc +#include "smallobject_policy_v7_box.h" // SMALL_ROUTE_LEGACY // FastLane is only safe after global init completes. // Before init, wrappers must handle recursion guards + syscall init. @@ -85,6 +90,34 @@ static inline void* front_fastlane_try_malloc(size_t size) { return NULL; // Class not enabled → fallback } + // Phase 16 v1: LEGACY direct path (early-exit optimization) + // Try direct allocation for LEGACY routes only (skip route/policy overhead) + // TEMPORARY SAFETY: Limit to C0-C3 (match dualhot pattern) until refill issue debugged + if (__builtin_expect(front_fastlane_alloc_legacy_direct_enabled() && (unsigned)class_idx <= 3u, 0)) { + // Condition 1: Static route must be ready (Learner interlock check) + // Condition 2: Route must be LEGACY (断定可能な場合のみ) + if (tiny_static_route_ready_fast() && + tiny_static_route_get_kind_fast(class_idx) == SMALL_ROUTE_LEGACY) { + + // Hot path: Try UnifiedCache first + void* ptr = tiny_hot_alloc_fast(class_idx); + if (__builtin_expect(ptr != NULL, 1)) { + FRONT_FASTLANE_STAT_INC(malloc_hit); + return ptr; // Success (cache hit) + } + + // Cold path: Refill UnifiedCache and retry + ptr = tiny_cold_refill_and_alloc(class_idx); + if (__builtin_expect(ptr != NULL, 1)) { + FRONT_FASTLANE_STAT_INC(malloc_hit); + return ptr; // Success (after refill) + } + + // Fallback: Direct path failed → use existing route (safety) + // This handles edge cases (Learner transition, policy changes, etc.) + } + } + // Call existing hot handler (no duplication) // This is the winning path from E5-4 / Phase 4 E2 void* ptr = malloc_tiny_fast_for_class(size, class_idx); diff --git a/docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_AB_TEST_RESULTS.md b/docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_AB_TEST_RESULTS.md new file mode 100644 index 00000000..91abb3b7 --- /dev/null +++ b/docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_AB_TEST_RESULTS.md @@ -0,0 +1,208 @@ +# Phase 16: Front FastLane Alloc LEGACY Direct v1 — A/B Test Results + +**Date**: 2025-12-15 +**Status**: NEUTRAL (+0.62%) + +--- + +## Executive Summary + +Phase 16 v1 attempted to reduce alloc-side fixed costs by adding a LEGACY direct path to FastLane entry point, bypassing route/policy overhead for LEGACY allocations. The optimization mirrored the free-side winning pattern (Phase 9/10). + +**Result**: +0.62% on Mixed (NEUTRAL), below +1.0% GO threshold. + +**Critical Issue Discovered**: Initial implementation caused segmentation fault for classes C4-C7. Root cause: `unified_cache_refill()` incompatibility. **Safety fix applied**: Limited optimization to C0-C3 only (matching existing dualhot pattern). + +**Verdict**: NEUTRAL — freeze as research box (default OFF). + +--- + +## A/B Test Results + +### Mixed (16-1024B, 10-run clean env) + +**Baseline** (ENV=0): +- Mean: 47,510,791 ops/s +- Median: 47,606,360 ops/s +- Runs: 48151673, 47596179, 47735208, 47903499, 46674576, 47977105, 47236265, 47481537, 46735322, 47616542 + +**Optimized** (ENV=1): +- Mean: 47,803,890 ops/s +- Median: 47,901,551 ops/s +- Runs: 47401229, 47908200, 48158776, 48126240, 47477867, 47894902, 47644796, 48191059, 47930512, 47305320 + +**Delta**: +- Mean: **+0.62%** +- Median: **+0.62%** + +**Verdict**: NEUTRAL (below +1.0% GO threshold) + +--- + +### C6-heavy Regression Check (5-run) + +**Baseline** (ENV=0): +- Mean: 21,134,240 ops/s +- Median: 21,186,983 ops/s +- Runs: 21186983, 21327420, 20807950, 21112023, 21236823 + +**Optimized** (ENV=1): +- Mean: 21,147,197 ops/s +- Median: 21,139,301 ops/s +- Runs: 21358869, 21209299, 20992077, 21139301, 21036438 + +**Delta**: +- Mean: **+0.06%** +- Median: **-0.23%** + +**Verdict**: PASS (no significant regression) + +--- + +## Implementation Summary + +### Files Modified + +1. **`core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c}`** (new) + - L0 ENV gate for LEGACY direct feature + - ENV: `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1` (default 0, opt-in) + - API: `front_fastlane_alloc_legacy_direct_enabled()`, `front_fastlane_alloc_legacy_direct_env_refresh_from_env()` + +2. **`core/box/front_fastlane_box.h`** + - Added LEGACY direct early-exit in `front_fastlane_try_malloc()` (lines 93-119) + - **SAFETY CONSTRAINT**: Limited to C0-C3 only due to refill incompatibility for C4-C7 + - Direct conditions: ENV enabled + static route ready + LEGACY route confirmed + - Direct path: `tiny_hot_alloc_fast()` → `tiny_cold_refill_and_alloc()` → fallback to `malloc_tiny_fast_for_class()` + +3. **`core/bench_profile.h`** + - Added `front_fastlane_alloc_legacy_direct_env_refresh_from_env()` to refresh sync group + +4. **`Makefile`** + - Added `front_fastlane_alloc_legacy_direct_env_box.o` to OBJS_BASE, SHARED_OBJS, TINY_BENCH_OBJS_BASE + +--- + +## Critical Bug & Fix + +### Issue: Segmentation Fault (Exit Code 139) + +**Symptom**: Benchmark crashed with ENV=1 during larger workloads (20M iterations). + +**Root Cause**: +- Crash occurred in `unified_cache_refill()` → `tiny_next_read()` (intrusive pointer read) +- Initial implementation attempted to use direct path for ALL classes (C0-C7) +- Classes C4-C7 triggered incompatibility with `unified_cache_refill()` logic +- Existing dualhot code (Phase ALLOC-TINY-FAST-DUALHOT-2) only operates on C0-C3 + +**Backtrace**: +``` +#0 0x0000555555564d89 in tiny_next_read.lto_priv.5.lto_priv () +#1 0x00007ffff7b00318 in ?? () +#2 0x0000555555557f29 in unified_cache_refill () +``` + +**Fix Applied**: +- Limited LEGACY direct path to C0-C3 only (line 96 of front_fastlane_box.h) +- Added safety comment explaining constraint +- Matches existing proven pattern from dualhot implementation + +**Code Change**: +```c +// Before (CRASHED): +if (__builtin_expect(front_fastlane_alloc_legacy_direct_enabled(), 0)) { + +// After (SAFE): +if (__builtin_expect(front_fastlane_alloc_legacy_direct_enabled() && (unsigned)class_idx <= 3u, 0)) { +``` + +--- + +## Analysis + +### Why +0.62% is Below Threshold + +1. **Limited Scope**: Optimization only applies to C0-C3 due to safety constraint + - C4-C7 continue using full route/policy path + - Mixed benchmark uses all size classes (16-1024B = C0-C5 primarily) + +2. **Existing Optimizations**: dualhot (Phase ALLOC-TINY-FAST-DUALHOT-2) already optimizes C0-C3 + - LEGACY direct overlaps with dualhot coverage + - Marginal benefit when dualhot is disabled, but default config has dualhot enabled in some profiles + +3. **Route Overhead Not Dominant**: After Phase 6 FastLane collapse, route/policy fixed costs are already minimized + - Phase 14-15 (cache shape) also showed NEUTRAL results + - Suggests current bottleneck is not in dispatch layers + +### Root Cause of Limited Benefit + +The optimization targets the same problem space as existing dualhot but with different enablement conditions: +- **dualhot**: Always enabled for C0-C3, no route check +- **LEGACY direct**: ENV-gated, requires static route confirmation + +When both are active, LEGACY direct provides minimal incremental value. + +--- + +## Recommendations + +1. **Freeze as Research Box** (default OFF) + - ENV remains opt-in: `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0` + - No preset promotion + - Keep code for potential future use if dualhot is disabled + +2. **Investigate C4-C7 Refill Issue** + - Root cause: Why does `unified_cache_refill()` fail for C4-C7 in this path? + - Possible causes: + - LIFO mode interaction (Phase 15) + - Cache state assumptions in refill logic + - Intrusive pointer corruption + - **Action**: Debug under controlled conditions before expanding to C4-C7 + +3. **Shift Focus Away from Dispatch Layers** + - Phase 14, 15, 16 all showed NEUTRAL results + - Phase 6 FastLane already collapsed major dispatch overhead + - **Next direction**: Investigate cache miss costs, memory layout, or backend allocation + +4. **Consider Dualhot/LEGACY Direct Consolidation** + - If LEGACY direct is kept, evaluate merging with dualhot logic + - Avoid code duplication and overlap + +--- + +## Comparison with Recent Phases + +| Phase | Target | Delta (Mixed) | Verdict | +|-------|--------|---------------|---------| +| Phase 10 | Free LEGACY direct | +1.89% | **GO** | +| Phase 13 v1 | C7 preserve header | -0.40% | NEUTRAL (freeze) | +| Phase 14 v1 | tcache intrusive | +0.20% | NEUTRAL (freeze) | +| Phase 14 v2 | tcache hot integration | +0.08% | NEUTRAL (freeze) | +| Phase 15 v1 | UnifiedCache FIFO→LIFO | -0.70% | NEUTRAL (freeze) | +| **Phase 16 v1** | **Alloc LEGACY direct** | **+0.62%** | **NEUTRAL (freeze)** | + +**Pattern**: Post-Phase-10 optimizations consistently show NEUTRAL results. Major gains came from earlier phases (FastLane collapse +11.13%, Free DeDup +5.18%, etc.). Current bottleneck likely not in dispatch/routing layers. + +--- + +## Files Changed + +- `core/box/front_fastlane_alloc_legacy_direct_env_box.h` (new) +- `core/box/front_fastlane_alloc_legacy_direct_env_box.c` (new) +- `core/box/front_fastlane_box.h` (modified) +- `core/bench_profile.h` (modified) +- `Makefile` (modified) + +--- + +## ENV Variables + +- `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1` (default: 0, opt-in) + +--- + +## Next Steps + +1. **Freeze Phase 16** with default OFF +2. **Commit with verdict**: "Phase 16 v1: NEUTRAL (+0.62%), research box" +3. **Update CURRENT_TASK.md** with Phase 16 summary +4. **Shift optimization focus** based on profiling/analysis (away from dispatch layers) diff --git a/docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_DESIGN.md b/docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_DESIGN.md new file mode 100644 index 00000000..be4fa761 --- /dev/null +++ b/docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_DESIGN.md @@ -0,0 +1,133 @@ +# Phase 16: Front FastLane Alloc LEGACY Direct v1(alloc 側の “2段目ホット” を monolithic early-exit 化) + +**Date**: 2025-12-15 +**Status**: DESIGN(Phase 16 kickoff) + +--- + +## 0. Executive Summary(1枚) + +Phase 14-15(pointer-chase / cache-shape)系は **NEUTRAL** で freeze。 +次は “キャッシュ形状” ではなく、**命令数/分岐の固定費を削る**方向に戻す。 + +現状の `malloc()` は Phase 6 で FastLane に集約され、ほぼ常に: + +``` +malloc() → front_fastlane_try_malloc(size) → malloc_tiny_fast_for_class(size, class_idx) +``` + +となる。 + +しかし `malloc_tiny_fast_for_class()` は **LEGACY ルートでも**、 +ULTRA/C7 早期分岐・route_kind 決定・ENV cfg 読み・dispatch shape などの固定費を通る。 +free 側(Phase 9/10/6-2)は “monolithic early-exit” に寄せて勝っているため、 +alloc 側も同じ勝ち筋で **FastLane 入口で LEGACY を直行**させるのが ROI が高い。 + +Phase 16 は Box Theory を保ったまま、FastLane の alloc に “LEGACY direct” を 1 本足す: + +- **hit 時**: `tiny_hot_alloc_fast(class_idx)` → 即 return(route/policy を踏まない) +- **miss 時**: `tiny_cold_refill_and_alloc(class_idx)`(既存 cold 境界) +- **不確実時**: 既存 `malloc_tiny_fast_for_class()` にフォールバック(境界 1 箇所) + +--- + +## 1. 現状(why) + +- Phase 6(Front FastLane)で wrapper→gate→policy→route を collapse し、入口固定費は大きく削減できた。 +- その結果、alloc 側の残コストは **`malloc_tiny_fast_for_class()` 内の分岐/ENV/route 決定**に寄りやすい。 +- Phase 14/15 で “UnifiedCache の形状” をいじっても Mixed が動かない → 現状は **cache shape が支配的ではない**。 + +よって Phase 16 は、cache の内部を変えずに **route/policy 固定費を削る**。 + +--- + +## 2. 提案(Phase 16 v1) + +### 2.1 追加する箱(Box Theory) + +``` +L0: front_fastlane_alloc_legacy_direct_env_box (ENV gate / rollback) + ↓ +L1: front_fastlane_try_malloc() (LEGACY direct early-exit) + ↓ +L2: malloc_tiny_fast_for_class() (既存: route/policy/ULTRA/MID/V7) + ↓ +L3: tiny_front_hot_box / tiny_front_cold_box (既存: unified cache / refill) +``` + +**境界は 1 箇所**: +- “direct 条件を満たさない/失敗” → `malloc_tiny_fast_for_class()` に落とす。 + +### 2.2 ENV + +- `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1`(default 0, opt-in) + +初期は opt-in で A/B。 +GO なら preset 昇格(MIXED のみから段階的)を検討する。 + +### 2.3 Direct 条件(Fail-Fast) + +alloc direct は **“断定できるときだけ”**に限定する: + +必須条件(推奨): +- FastLane が有効(既存) +- `size <= tiny_get_max_size()`(既存) +- `class_idx` が有効(既存) +- `front_fastlane_class_mask` に含まれる(既存) +- `tiny_static_route_ready_fast()` が true(Learner interlock 等で false のときは使わない) +- `tiny_static_route_get_kind_fast(class_idx) == SMALL_ROUTE_LEGACY`(LEGACY 断定) + +その上で: +- `tiny_hot_alloc_fast(class_idx)` → hit なら return +- miss なら `tiny_cold_refill_and_alloc(class_idx)` を呼ぶ(既存 cold 境界) +- それでも NULL の場合だけ `malloc_tiny_fast_for_class()` にフォールバック(安全重視) + +--- + +## 3. 可視化(最小) + +Release での常時ログは禁止。 +必要なら `HAKMEM_DEBUG_COUNTERS=1` のみで: + +- `front_fastlane_alloc_legacy_direct_hit` +- `front_fastlane_alloc_legacy_direct_miss` +- `front_fastlane_alloc_legacy_direct_fallback` + +(atomic は stats box に閉じ込める。ホット側に atomic を置かない) + +--- + +## 4. A/B 計測(同一バイナリ) + +GO/NO-GO(Mixed 10-run, clean env): +- GO: mean +1.0% 以上 +- NO-GO: mean -1.0% 以下(即 rollback / freeze) +- NEUTRAL: ±1.0%(research box freeze) + +対象: +- `scripts/run_mixed_10_cleanenv.sh` +- 追加で C6-heavy 5-run(回帰なし確認) + +--- + +## 5. リスクと対策 + +### リスク 1: “LEGACY と断定” が崩れて誤ルートする + +対策: +- `tiny_static_route_ready_fast()` を必須条件にする(Learner 有効時は false になる想定) +- route_kind を必ず確認(mask だけに依存しない) +- 失敗時は必ず既存経路へフォールバック + +### リスク 2: direct 経路が小さすぎて効果が出ない + +対策: +- まず Mixed の “LEGACY 比率” を stats で可視化(debug counters のみ) +- 効かなければ freeze(Phase 14/15 と同じ扱い) + +### リスク 3: 分岐追加が逆効果(Phase 11 の再来) + +対策: +- direct 判定は **FastLane 内で 1 回だけ**(call site helper を増やさない) +- direct 判定が false の場合は既存の `malloc_tiny_fast_for_class()` をそのまま呼ぶ + diff --git a/docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_NEXT_INSTRUCTIONS.md b/docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_NEXT_INSTRUCTIONS.md new file mode 100644 index 00000000..ea74a7b7 --- /dev/null +++ b/docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_NEXT_INSTRUCTIONS.md @@ -0,0 +1,124 @@ +# Phase 16: Front FastLane Alloc LEGACY Direct v1 — Next Instructions + +設計: `docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_DESIGN.md` + +--- + +## 0. Status / Why now + +- Phase 14-15(tcache / FIFO→LIFO)は **NEUTRAL** → freeze(default OFF) +- 次の狙いは “cache 形状” ではなく、**alloc 側の route/policy 固定費を減らす** +- free 側は Phase 9/10/6-2 の “monolithic early-exit + dedup” が勝ち筋 → alloc 側にも同じパターンを適用する + +--- + +## 1. GO 条件 + +Mixed 10-run(clean env): +- **GO**: mean +1.0% 以上 +- **NO-GO**: mean -1.0% 以下(即 rollback / freeze) +- **NEUTRAL**: ±1.0% → research box freeze + +追加ゲート(必須): +- `tiny_static_route_ready_fast()` が true の環境で、LEGACY direct が実際に通っている(debug counters で確認できるなら尚良い) + +--- + +## 2. Box 図(境界 1 箇所) + +``` +L0: front_fastlane_alloc_legacy_direct_env_box (ENV gate / refresh) + ↓ +L1: front_fastlane_box.h (try_malloc 内 early-exit) + ↓ +L2: malloc_tiny_fast_for_class() (既存経路) +``` + +境界は **“direct 条件 NG / direct が NULL → malloc_tiny_fast_for_class”** の 1 箇所に固定する。 + +--- + +## 3. Patch 順(小さく積む) + +### Patch 1: L0 ENV gate box(戻せる) + +新規: +- `core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c}` + +ENV: +- `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1`(default 0) + +API(例): +- `front_fastlane_alloc_legacy_direct_enabled() -> int` +- `front_fastlane_alloc_legacy_direct_env_refresh_from_env()` + +要件: +- hot path に `getenv()` を置かない(cached) +- `bench_profile` の `putenv()` 同期のため refresh を提供(Phase 8/13/14/15 パターン) + +### Patch 2: 統合点(FastLane alloc に 1 本だけ) + +対象: +- `core/box/front_fastlane_box.h` + +変更: +- `front_fastlane_try_malloc()` の class mask 判定の後に、次の “direct 経路” を追加 + +direct 条件(Fail-Fast): +1. `front_fastlane_alloc_legacy_direct_enabled() == 1` +2. `tiny_static_route_ready_fast()` が true(Learner interlock 等で false の場合は direct 禁止) +3. `tiny_static_route_get_kind_fast(class_idx) == SMALL_ROUTE_LEGACY`(LEGACY を断定) + +direct 実体: +- `void* p = tiny_hot_alloc_fast(class_idx);` +- `if (p) return p;` +- `p = tiny_cold_refill_and_alloc(class_idx);` +- `if (p) return p;` +- 失敗時のみ `malloc_tiny_fast_for_class(size, class_idx)` にフォールバック(安全側) + +注意: +- “call site helper を増やさない” を優先(Phase 11 の反省) +- 直行するのは **LEGACY のみ**(ULTRA/MID/V7 は既存に任せる) + +### Patch 3: bench_profile 同期(ENV 漏れ防止) + +対象: +- `core/bench_profile.h` + +変更: +- `#ifdef USE_HAKMEM` の refresh 群に `front_fastlane_alloc_legacy_direct_env_refresh_from_env();` を追加 + +--- + +## 4. A/B(同一バイナリ) + +Baseline: +```sh +HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 scripts/run_mixed_10_cleanenv.sh +``` + +Optimized: +```sh +HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=1 scripts/run_mixed_10_cleanenv.sh +``` + +追加(回帰検出): +```sh +HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1 ./bench_mid_large_mt_hakmem 1 20000000 400 1 +HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=1 HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1 ./bench_mid_large_mt_hakmem 1 20000000 400 1 +``` + +--- + +## 5. 健康診断 + +```sh +scripts/verify_health_profiles.sh +``` + +--- + +## 6. Rollback + +- `export HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0` + diff --git a/docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md b/docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md new file mode 100644 index 00000000..f29c446b --- /dev/null +++ b/docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md @@ -0,0 +1,89 @@ +# Phase 17: FORCE_LIBC Gap Validation v1 — A/B Test Results + +**Date**: 2025-12-15 +**Verdict**: ✅ **Case B confirmed** — **Layout / I-cache penalty dominates** + +--- + +## Executive Summary + +Phase 17 validated the “system malloc is faster than hakmem” observation while avoiding the classic layout/LTO trap by running a **same-binary A/B**: + +- Same binary (`bench_random_mixed_hakmem`) with `HAKMEM_FORCE_LIBC_ALLOC=0/1` shows **allocator logic delta is negligible**. +- The large performance gap appears only when comparing to the tiny `bench_random_mixed_system` binary. + +Conclusion: The dominant gap is **binary text size + layout → I-cache thrash + instruction footprint**, not allocator algorithm efficiency. + +--- + +## Measurement Setup + +Workload: +- `bench_random_mixed_*` (Mixed 16–1024B), working set `WS=400` +- Clean ENV baseline via `scripts/run_mixed_10_cleanenv.sh` + +Two comparisons: +1) **Same-binary toggle** (allocator logic delta) +2) **System binary** (layout penalty delta) + +--- + +## Results + +### 1) Same-binary A/B (allocator delta) + +Binary: `bench_random_mixed_hakmem` +Toggle: `HAKMEM_FORCE_LIBC_ALLOC=0/1` + +| Mode | Throughput (ops/s) | Delta | +|------|---------------------|-------| +| hakmem (`FORCE_LIBC=0`) | 48.12M | — | +| libc (`FORCE_LIBC=1`) | 48.31M | **+0.39%** | + +Interpretation: allocator logic delta is ~noise-level in this experiment context. + +### 2) System binary (layout penalty) + +Binary: `bench_random_mixed_system` + +| Mode | Throughput (ops/s) | Delta vs libc-in-hakmem-binary | +|------|---------------------|--------------------------------| +| system malloc | 83.85M | **+73.57%** | + +Total observed gap: ~+74% class. + +--- + +## Perf Stat (200M iterations) — Smoking Gun + +| Metric | hakmem binary | system binary | Delta | +|--------|---------------|---------------|-------| +| I-cache misses | 153K | 68K | **-55%** | +| Cycles | 17.9B | 10.2B | **-43%** | +| Instructions | 41.3B | 21.5B | **-48%** | +| Binary size | 653K | 21K | **-97%** | + +Interpretation: +- The system binary executes roughly **half the instructions**, with **far fewer I-cache misses**. +- The 30× text footprint difference strongly correlates with the gap. + +--- + +## Conclusion + +Phase 12’s “system malloc is 1.6× faster” observation was real, but the root cause was misattributed: + +- ❌ Not primarily allocator algorithm differences +- ✅ **Text/layout + I-cache locality + instruction footprint** + +This shifts the optimization frontier: +- Stop chasing more routing/dispatch micro-opt (Phase 14–16 plateau) +- Focus on **Hot Text Isolation / layout control** + +--- + +## Next + +Proceed to: +- `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md` + diff --git a/docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_NEXT_INSTRUCTIONS.md b/docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_NEXT_INSTRUCTIONS.md new file mode 100644 index 00000000..738c31ca --- /dev/null +++ b/docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_NEXT_INSTRUCTIONS.md @@ -0,0 +1,130 @@ +# Phase 17: FORCE_LIBC Gap Validation(same-binary A/B)Next Instructions + +## Status(前提) + +- Phase 14–16 は **NEUTRAL / research box freeze**(dispatch/cache-shape/pointer-chase 系は頭打ち) +- Phase 16 v1(FastLane alloc LEGACY direct)は **NEUTRAL (+0.62%)** かつ **C0–C3 限定**(C4–C7 は segv で安全制限) +- Phase 12 で「system malloc が hakmem より速い」という観測があるが、**別バイナリ比較は layout/LTO 差で壊れやすい** + +本 Phase 17 の目的は、**同一バイナリ内**で `hakmem` vs `libc malloc` を A/B し、gap の実体(allocator差か、バイナリ差か)を SSOT 化すること。 + +--- + +## 0. 目的(Deliverables) + +1) **同一バイナリ A/B**: `bench_random_mixed_hakmem` を用いて +- A: `HAKMEM_FORCE_LIBC_ALLOC=0`(hakmem) +- B: `HAKMEM_FORCE_LIBC_ALLOC=1`(libc) + +2) **別バイナリとの差分分解**(任意) +- `bench_random_mixed_system`(小さいバイナリ)も測り、`libc-in-hakmem-binary` と比較して **layout penalty** を推定 + +3) **次の主戦場を決める**(GO/NO-GO ではなく、方針決定) + +--- + +## 1. 実施手順(再現性重視) + +### 1.1 Build(同一 commit で固定) + +```sh +make -j bench_random_mixed_hakmem bench_random_mixed_system +``` + +### 1.2 Clean ENV(Phase 14–16 研究 knob を固定) + +推奨: `scripts/run_mixed_10_cleanenv.sh` を使う(ENV 漏れ防止)。 + +追加で次を明示(Phase 16 を確実に OFF): + +```sh +export HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 +``` + +### 1.3 Same-binary A/B(本丸) + +**A: hakmem** + +```sh +HAKMEM_FORCE_LIBC_ALLOC=0 scripts/run_mixed_10_cleanenv.sh +``` + +**B: libc(同一バイナリ)** + +```sh +HAKMEM_FORCE_LIBC_ALLOC=1 scripts/run_mixed_10_cleanenv.sh +``` + +記録: +- mean / median / stdev(10-run) +- Min/Max + +### 1.4 Optional: system binary baseline(layout penalty 推定) + +```sh +for i in $(seq 1 10); do + echo "=== Run ${i}/10 (system bin) ===" + ./bench_random_mixed_system "${ITERS:-20000000}" "${WS:-400}" 1 2>&1 | rg "Throughput" || true +done +``` + +解釈: +- `system bin` が `FORCE_LIBC` より大きく速い → **layout/text size penalty** が支配的 +- `FORCE_LIBC` が `hakmem` より大きく速い → **allocator ロジック差** が支配的 + +--- + +## 2. 判定(方針分岐) + +### Case A: `FORCE_LIBC` が hakmem より **+20% 以上**速い + +結論: gap の本体は allocator ロジック(命令数/固定費)側。 + +次の芯(推奨): +- **Phase 18: Free FastPath Gate Consolidation** + - `free_tiny_fast()` 内の ENV gate / TLS probe を FastLane 入口で 1 回だけに集約 + - 目的: “monolithic early-exit” の勝ち筋を維持したまま、per-call gate 固定費を削る + - Box 境界: `front_fastlane_try_free()` → `free_tiny_fast_with_snapshot()` の 1 箇所 + - 戻せる: `HAKMEM_FREE_TINY_FAST_SNAPSHOT=0/1` + +### Case B: `FORCE_LIBC` が hakmem と **±5% 以内** + +結論: allocator差は小さく、Phase 12 の「system malloc 1.6x」は別要因(バイナリ差/計測系)濃厚。 + +次の芯(推奨): +- **Phase 18: Hot Text Isolation / Layout Control** + - cold code を `__attribute__((cold,noinline))` + 別 TU に追放 + - 可能なら link-order(hot 関数の順序固定)で I-cache 安定化 + - A/B は同一バイナリで `HAKMEM_LAYOUT_MODE=0/1`(section/attribute のみ切替) + +### Case C: `FORCE_LIBC` が hakmem より速いが、`system bin` とも差が大きい + +結論: allocator差 + layout penalty の **両方**がある。 + +次の芯: +- 先に **layout penalty** を削る(Phase 18 Hot Text Isolation) +- その後に **gate consolidation**(Phase 19)へ + +--- + +## 3. 可視化(最小) + +- 10-run の raw throughput を保存(`scripts/run_mixed_10_cleanenv.sh` 出力ログで十分) +- 追加で 1 本だけ `perf stat`(200M iters, 1-run): + +```sh +perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \ + env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FORCE_LIBC_ALLOC=0 \ + ./bench_random_mixed_hakmem 200000000 400 1 +``` + +同じコマンドで `HAKMEM_FORCE_LIBC_ALLOC=1` も 1 本取る。 + +--- + +## 4. 重要ルール(Box Theory) + +- A/B は **同一バイナリ**で行う(layout/LTO 差で誤判定しない) +- 新しい最適化は必ず ENV gate(戻せる)+ 境界 1 箇所 +- 迷ったら “Fail-Fast で fallback” を優先(速度より整合性) + diff --git a/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md b/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md new file mode 100644 index 00000000..2bf9ce7b --- /dev/null +++ b/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md @@ -0,0 +1,135 @@ +# Phase 18: Hot Text Isolation v1 — Design + +## 0. Context (from Phase 17) + +Phase 17 established **Case B**: +- Same-binary `HAKMEM_FORCE_LIBC_ALLOC=0/1` shows **allocator delta is negligible**. +- The large gap appears vs the tiny `bench_random_mixed_system` binary. + +Signal: +- I-cache misses / instructions / cycles are far worse in the hakmem-linked binary. +- Binary size (`~653K`) vs system (`~21K`) correlates with the throughput gap. + +Ref: `docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md` + +--- + +## 1. Goal + +Reduce **hot-path instruction footprint** and improve **I-cache locality** in the hakmem-linked binary, without changing allocator algorithms. + +Primary success metric: +- Mixed (16–1024B) throughput improvement, with accompanying reductions in: + - `iTLB/icache misses` (or “I-cache misses” counter used in Phase 17) + - total instructions executed per 200M iters + +--- + +## 2. Non-goals + +- No allocator algorithm redesign. +- No behavioral changes to safety/Fail-Fast semantics (only layout/placement changes). +- No “delete code = faster” experiments (Phase 17 showed layout dominates; deletions confound results). + +--- + +## 3. Box Theory framing + +This is a “build/layout box”: +- **Box**: HotTextIsolationBox (compile-time layout controls + annotations) +- **Boundary**: build flag / TU split (no runtime overhead) +- **Rollback**: single Makefile knob (`HOT_TEXT_ISOLATION=0/1`) or `-DHAKMEM_HOT_TEXT_ISOLATION=0/1` +- **Observability**: perf stat + binary size (no always-on logs) + +--- + +## 4. Design: v1 tactics (low-risk) + +### 4.1 Hot/Cold attributes SSOT + +Introduce a single header defining attributes: +- `HAK_HOT_FN` → `__attribute__((hot))` (and optionally `.text.hak_hot`) +- `HAK_COLD_FN` → `__attribute__((cold,noinline))` (and optionally `.text.hak_cold`) + +Activated only when `HAKMEM_HOT_TEXT_ISOLATION=1`. + +Why: +- Makes “what is hot/cold” explicit and consistent (SSOT). +- Lets us annotate a small set of functions without scattering ad-hoc attributes. + +### 4.2 Translation-unit split for wrappers + +Move wrapper definitions out of `core/hakmem.c` into a dedicated TU: +- `core/hak_wrappers_box.c` includes `core/box/hak_wrappers.inc.h` + +Why: +- Prevents wrapper text from being interleaved with unrelated code in the same TU. +- Improves the linker’s ability to cluster hot code. +- Enables future link-order experiments (symbol ordering files) without touching allocator logic. + +### 4.3 Cold code isolation + +Ensure rarely-hit helpers stay cold/out-of-line: +- wrapper diagnostics (`wrapper_record_fallback`, ptr trace dumps, verbose logging) +- “slow fallback” paths (`malloc_cold`, `free_cold`) + +Principle: +- Hot path must remain a straight-line “try → return” shape. +- Anything that allocates/logs/diagnoses is cold and must not be inlined into hot wrappers. + +### 4.4 Optional: section GC for bench builds + +For bench binaries only: +- add `-ffunction-sections -fdata-sections` +- link with `-Wl,--gc-sections` + +Why: +- Drops truly-unused text and reduces overall text pressure. +- Helps the linker keep hot text denser. + +This is optional because it is toolchain-sensitive; measure before promoting. + +--- + +## 7. v2 Extension (if v1 is NEUTRAL): BENCH_MINIMAL compile-out + +Phase 17 shows the hakmem-linked binary executes ~2x instructions vs the tiny system binary. If v1 (TU split/attributes) is NEUTRAL, the next likely lever is **not placement-only**, but **removing per-call fixed costs** from the hot path by compiling them out in a bench-only build. + +Concept: +- Introduce `HAKMEM_BENCH_MINIMAL=1` build mode (Makefile knob) +- In this mode: + - “promoted defaults” are treated as compile-time constants (FastLane ON, snapshots ON, etc.) + - ENV gates become compile-time (no TLS/env probing in hot path) + - Hot counters/stats macros compile out completely + +Why this still fits Box Theory: +- It is a **build box** (reversible by knob), not an algorithm rewrite +- Boundaries remain: hot path stays Fail-Fast; cold fallback remains intact +- Observability shifts to `perf stat` (no always-on logging) + +Expected impact: +- If instruction footprint is truly dominant, this is the first place to see **double-digit gains** (+10–20%). + +## 5. Risks / mitigations + +### Risk A: layout tweaks regress throughput + +Mitigation: +- A/B using the same workload + perf stat counters (Phase 17 set). +- If regression: keep as research-only (build knob default OFF). + +### Risk B: Toolchain sensitivity (ld vs lld, LTO interactions) + +Mitigation: +- Keep v1 minimal (TU split + attributes first). +- Only enable `--gc-sections` if it’s stable in the current toolchain. + +--- + +## 6. Expected impact + +Conservative: +- +3–10% throughput improvement on Mixed by reducing instruction footprint and I-cache misses. + +Stretch goal: +- Bring “hakmem-linked + FORCE_LIBC” closer to `bench_random_mixed_system` ceiling by minimizing wrapper text working-set. diff --git a/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md b/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md new file mode 100644 index 00000000..c2ac5be3 --- /dev/null +++ b/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md @@ -0,0 +1,165 @@ +# Phase 18: Hot Text Isolation v1 — Next Instructions + +## Status + +- Phase 17 confirms **Case B**: allocator logic delta is negligible; gap is **layout/I-cache**. +- Next: reduce instruction footprint + improve I-cache locality via **Hot Text Isolation**. + +Refs: +- Phase 17 results: `docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md` +- Phase 18 design: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md` + +--- + +## 0. Goal / Success Criteria + +Primary (v1 は “低リスク・効果小さめ” 想定): +- Mixed (16–1024B) throughput **+2%** 以上で GO(layout work の現実ライン) + +Secondary (must move in the right direction): +- I-cache misses reduced(目安: **-10%** 以上) +- Total instructions reduced(目安: **-5%** 以上) + +If throughput is NEUTRAL but counters improve significantly, keep as research box and iterate once. + +--- + +## 1. Patch Plan (small, reversible) + +### Patch 1: Hot/Cold attribute SSOT (L0 Box) + +Add: +- `core/box/hot_text_attrs_box.h` + +Defines: +- `HAK_HOT_FN`, `HAK_COLD_FN` (no-op when `HAKMEM_HOT_TEXT_ISOLATION=0`) + +Usage: +- annotate only a short, high-impact list first: + - wrappers: `malloc/free/calloc/realloc` + - FastLane entry helpers (if non-inline) + - cold helpers: `malloc_cold/free_cold`, wrapper diagnostics + +Rollback: build knob off. + +### Patch 2: Wrapper TU split (L1 Box boundary) + +Move wrapper definitions out of `core/hakmem.c`: +- new: `core/hak_wrappers_box.c` + - `#include "box/hak_wrappers.inc.h"` +- remove wrapper include from `core/hakmem.c` + +Rationale: +- Prevents wrapper text from being interleaved with unrelated code in one TU. +- Sets up link-order clustering. + +Rollback: restore include in `core/hakmem.c` and drop new TU. + +### Patch 3 (optional): bench-only section GC + +Makefile knob: +- `HOT_TEXT_ISOLATION=0/1` + +When `=1`, add for bench builds: +- `-DHAKMEM_HOT_TEXT_ISOLATION=1` +- `-ffunction-sections -fdata-sections` +- `LDFLAGS += -Wl,--gc-sections` + +Notes: +- Keep it bench-only first (do not touch shared lib build until proven stable). +- If toolchain rejects `--gc-sections` or results are unstable → skip this patch. + +--- + +## 2. A/B Procedure (required) + +### 2.1 Baseline build (OFF) + +```sh +make clean +make -j bench_random_mixed_hakmem bench_random_mixed_system +ls -lh bench_random_mixed_hakmem bench_random_mixed_system +scripts/run_mixed_10_cleanenv.sh +``` + +Perf stat (1 run, 200M iters): + +```sh +perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \ + env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \ + ./bench_random_mixed_hakmem 200000000 400 1 +``` + +### 2.2 Optimized build (ON) + +```sh +make clean +make -j HOT_TEXT_ISOLATION=1 bench_random_mixed_hakmem bench_random_mixed_system +ls -lh bench_random_mixed_hakmem bench_random_mixed_system +scripts/run_mixed_10_cleanenv.sh +``` + +Perf stat (same command): + +```sh +perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \ + env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \ + ./bench_random_mixed_hakmem 200000000 400 1 +``` + +### 2.3 System ceiling check (optional) + +```sh +./bench_random_mixed_system 200000000 400 1 2>&1 | rg "Throughput" || true +``` + +--- + +## 3. GO/NO-GO Decision + +- **GO**: Mixed 10-run mean **+2%** 以上 and no health regressions +- **NEUTRAL**: within ±2% → keep as research box, iterate once (more cold isolation or better clustering) +- **NO-GO**: **-2%** or worse → rollback and freeze + +Health profiles: + +```sh +scripts/verify_health_profiles.sh +``` + +--- + +## 4. Reporting (required artifacts) + +Create: +- `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md` + - throughput A/B (10-run) + - binary sizes + - perf stat table (cycles/instructions/I-cache) + - conclusion (GO/NEUTRAL/NO-GO) + +Update: +- `CURRENT_TASK.md` (Phase 18 status + next) + +--- + +## 5. Notes / guardrails + +- This phase intentionally compares **different binaries** (layout is the target), but keep the environment clean (`env -i`, fixed profile, same machine). +- Avoid “delete code” experiments; only isolate/cold/cluster. +- Keep “cold” truly cold: no allocations, no logging, no TLS-heavy helpers. + +--- + +## 6. If v1 is NEUTRAL: Phase 18 v2(BENCH_MINIMAL)へ即進む + +Phase 17 の “instructions 2x” を直接削るには、layout だけでなく **hot path に混ざっている ENV/stats/debug の固定費を compile-out** する必要がある可能性が高い。 + +次の一手(bench 専用 binary / rollback 可能): + +- `HAKMEM_BENCH_MINIMAL=1`(Makefile knob)で: + - FastLane / wrapper の “常用ON 経路” を固定し、ENV gate を compile-time 定数化 + - hot counters を完全 compile-out + - 観測は `perf stat` のみ(常時ログ禁止) + +期待: +10–20%(もし本当に instruction footprint が支配ならここで大きく動く) diff --git a/scripts/run_mixed_10_cleanenv.sh b/scripts/run_mixed_10_cleanenv.sh index a727252a..d8809ab0 100755 --- a/scripts/run_mixed_10_cleanenv.sh +++ b/scripts/run_mixed_10_cleanenv.sh @@ -15,9 +15,12 @@ export HAKMEM_TINY_C7_PRESERVE_HEADER=${HAKMEM_TINY_C7_PRESERVE_HEADER:-0} export HAKMEM_TINY_TCACHE=${HAKMEM_TINY_TCACHE:-0} export HAKMEM_TINY_TCACHE_CAP=${HAKMEM_TINY_TCACHE_CAP:-64} export HAKMEM_MALLOC_TINY_DIRECT=${HAKMEM_MALLOC_TINY_DIRECT:-0} +export HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=${HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT:-0} +export HAKMEM_FORCE_LIBC_ALLOC=${HAKMEM_FORCE_LIBC_ALLOC:-0} export HAKMEM_ENV_SNAPSHOT_SHAPE=${HAKMEM_ENV_SNAPSHOT_SHAPE:-0} -export HAKMEM_FREE_TINY_FAST_MONO_DUALHOT=${HAKMEM_FREE_TINY_FAST_MONO_DUALHOT:-0} -export HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=${HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT:-0} +# NOTE: Phase 9/10 are promoted (bench_profile defaults to 1). Keep cleanenv aligned by default. +export HAKMEM_FREE_TINY_FAST_MONO_DUALHOT=${HAKMEM_FREE_TINY_FAST_MONO_DUALHOT:-1} +export HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=${HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT:-1} for i in $(seq 1 "${runs}"); do echo "=== Run ${i}/${runs} ==="