Phase 16 v1 NEUTRAL, Phase 17 Case B confirmed, Phase 18 design added
## Phase 16 v1: Front FastLane Alloc LEGACY Direct — NEUTRAL (+0.62%)
Target: Reduce alloc-side fixed costs by adding LEGACY direct path to
FastLane entry, mirroring Phase 9/10 free-side winning pattern.
Result: +0.62% on Mixed (below +1.0% GO threshold) → NEUTRAL, freeze as
research box (default OFF).
Critical issue: Initial impl crashed (segfault) for C4-C7. Root cause:
unified_cache_refill() incompatibility. Safety fix: Limited to C0-C3
only (matching existing dualhot pattern).
Files:
- core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c} (new)
- core/box/front_fastlane_box.h (LEGACY direct path, lines 93-119)
- core/bench_profile.h (env refresh sync)
- Makefile (new obj)
- docs/analysis/PHASE16_*.md (design/results/instructions)
ENV: HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 (default OFF, opt-in)
Verdict: Research box frozen. Phase 14-16 plateau confirms dispatch/
routing optimization ROI is exhausted post-Phase-6 FastLane collapse.
---
## Phase 17: FORCE_LIBC Gap Validation — Case B Confirmed
Purpose: Validate "system malloc faster" observation using same-binary
A/B testing to isolate allocator logic差 vs binary layout penalty.
Method:
- Same-binary toggle: HAKMEM_FORCE_LIBC_ALLOC=0/1 (bench_random_mixed_hakmem)
- System binary: bench_random_mixed_system (21K separate binary)
- Perf stat: Hardware counter analysis (I-cache, cycles, instructions)
Result: **Case B confirmed** — Allocator差 negligible, layout penalty dominates.
Gap breakdown (Mixed, 20M iters, ws=400):
- hakmem (FORCE_LIBC=0): 48.12M ops/s
- libc (FORCE_LIBC=1, same binary): 48.31M ops/s → +0.39% (noise level)
- system binary (21K): 83.85M ops/s → +73.57% vs libc, +74.26% vs hakmem
Perf stat (200M iters):
- I-cache misses: 153K (hakmem) → 68K (system) = -55% (smoking gun)
- Cycles: 17.9B → 10.2B = -43%
- Instructions: 41.3B → 21.5B = -48%
- Binary size: 653K → 21K (30x difference)
Root cause: Binary size (30x) causes I-cache thrashing. Code bloat >>
algorithmic efficiency.
Conclusion: Phase 12's "system malloc 1.6x faster" was real, but
misattributed. Gap is layout/I-cache, NOT allocator algorithm.
Files:
- docs/analysis/PHASE17_*.md (results/instructions)
- scripts/run_mixed_10_cleanenv.sh (Phase 9/10 defaults aligned)
Next: Phase 18 Hot Text Isolation (layout optimization, not algorithm opt)
---
## Phase 18: Hot Text Isolation — Design Added
Purpose: Reduce I-cache misses + instruction footprint via layout control
(binary optimization, not allocator algorithm changes).
Strategy (v1 → v2 progression):
v1 (TU split + hot/cold attrs + optional gc-sections):
- Target: +2% throughput (GO threshold, realistic for layout tweaks)
- Secondary: I-cache -10%, instructions -5% (direction confirmation)
- Risk: Low (reversible via build knob)
- Expected: +0-2% (NEUTRAL likely, but validates approach)
v2 (BENCH_MINIMAL compile-out):
- Target: +10-20% throughput (本命)
- Method: Conditional compilation removes stats/ENV/debug from hot path
- Expected: Instruction count -30-40% → significant I-cache improvement
Files:
- docs/analysis/PHASE18_*.md (design/instructions)
- CURRENT_TASK.md (Phase 17 complete, Phase 18 v1/v2 plan)
Build gate: HOT_TEXT_ISOLATION=0/1 (Makefile knob)
Next: Implement Phase 18 v1 (TU split first, BENCH_MINIMAL if v1 NEUTRAL)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
138
CURRENT_TASK.md
138
CURRENT_TASK.md
@ -342,6 +342,144 @@ Phase 6-10 で達成した累積改善:
|
||||
- Pointer-chase 削減も cache 形状変更も、現状の TLS array cache に対して有意な改善を生まない
|
||||
- 次の mimalloc gap(約 2.4x)を埋めるには、別次元のアプローチが必要
|
||||
|
||||
---
|
||||
|
||||
### Phase 16 v1: Front FastLane Alloc LEGACY Direct — ⚠️ NEUTRAL (+0.62%) — research box 維持(default OFF)
|
||||
|
||||
**Date**: 2025-12-15
|
||||
**Verdict**: **NEUTRAL (+0.62% Mixed, +0.06% C6-heavy)** — research box 維持(default OFF)
|
||||
|
||||
**Motivation**:
|
||||
- Phase 14-15 は freeze(cache-shape/pointer-chase の ROI が薄い)
|
||||
- free 側は "monolithic early-exit + dedup" が勝ち筋(Phase 9/10/6-2)
|
||||
- alloc 側も同じ勝ち筋で、LEGACY ルート時の route/policy 固定費を FastLane 入口で削る
|
||||
|
||||
**Results**:
|
||||
| Workload | ENV=0 (Baseline) | ENV=1 (Direct) | Delta |
|
||||
|---------|----------|----------|-------|
|
||||
| Mixed (16–1024B) | 47,510,791 | 47,803,890 | **+0.62%** |
|
||||
| C6-heavy (257–768B) | 21,134,240 | 21,147,197 | **+0.06%** |
|
||||
|
||||
**Critical Issue & Fix**:
|
||||
- **Segfault discovered**: Initial implementation crashed for C4-C7 during `unified_cache_refill()` → `tiny_next_read()`
|
||||
- **Root cause**: Refill logic incompatibility for classes C4-C7
|
||||
- **Safety fix**: Limited optimization to C0-C3 only (matching existing dualhot pattern)
|
||||
- Code constraint: `if (... && (unsigned)class_idx <= 3u)` added to line 96 of `front_fastlane_box.h`
|
||||
|
||||
**Conclusion**:
|
||||
- Optimization overlaps with existing dualhot (Phase ALLOC-TINY-FAST-DUALHOT-2) for C0-C3
|
||||
- Limited scope (C0-C3 only) reduces potential benefit
|
||||
- Route/policy overhead already minimized by Phase 6 FastLane collapse
|
||||
- Pattern continues from Phase 14-15: dispatch-layer optimizations showing NEUTRAL results
|
||||
|
||||
**Root causes of limited benefit**:
|
||||
1. Safety constraint: C4-C7 excluded due to refill bug
|
||||
2. Overlap with dualhot: C0-C3 already have direct path when dualhot enabled
|
||||
3. Route overhead not dominant: Phase 6 already collapsed major dispatch costs
|
||||
|
||||
**Recommendations**:
|
||||
- **Freeze as research box** (default OFF, no preset promotion)
|
||||
- **Investigate C4-C7 refill issue** before expanding scope
|
||||
- **Shift optimization focus** away from dispatch layers (Phase 14/15/16 all NEUTRAL)
|
||||
|
||||
**Refs**:
|
||||
- A/B results: `docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_AB_TEST_RESULTS.md`
|
||||
- Design: `docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_DESIGN.md`
|
||||
- Instructions: `docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_NEXT_INSTRUCTIONS.md`
|
||||
- ENV: `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1` (default: 0, opt-in)
|
||||
|
||||
---
|
||||
|
||||
### Phase 14-16 Summary: Post-FastLane Research Phases ⚠️
|
||||
|
||||
**Conclusion**: Phase 14-16 全て NEUTRAL(研究箱として凍結)
|
||||
|
||||
| Phase | Approach | Mixed Delta | Verdict |
|
||||
|-------|----------|-------------|---------|
|
||||
| 14 v1 | tcache (free-side only) | +0.20% | NEUTRAL |
|
||||
| 14 v2 | tcache (alloc+free) | +0.08% | NEUTRAL |
|
||||
| 15 v1 | FIFO→LIFO (array cache) | -0.70% | NEUTRAL |
|
||||
| 16 v1 | Alloc LEGACY direct | **+0.62%** | **NEUTRAL** |
|
||||
|
||||
**教訓**:
|
||||
- Pointer-chase 削減、cache 形状変更、dispatch early-exit いずれも有意な改善なし
|
||||
- Phase 6 FastLane collapse (入口固定費削減) 以降、dispatch/routing レイヤの最適化は ROI が薄い
|
||||
- 次の mimalloc gap(約 2.4x)を埋めるには、cache miss cost / memory layout / backend allocation 等の別次元が必要
|
||||
|
||||
---
|
||||
|
||||
### Phase 17: FORCE_LIBC Gap Validation(same-binary A/B)✅ COMPLETE (2025-12-15)
|
||||
|
||||
**目的**: 「system malloc が速い」観測の SSOT 化。**同一バイナリ**で `hakmem` vs `libc` を A/B し、gap の本体(allocator差 / layout差)を切り分ける。
|
||||
|
||||
**結果**: **Case B 確定** — Allocator差 negligible (+0.39%), Layout penalty dominant (+73.57%)
|
||||
|
||||
**Gap Breakdown** (Mixed, 20M iters, ws=400):
|
||||
- hakmem (FORCE_LIBC=0): 48.12M ops/s (mean), 48.12M ops/s (median)
|
||||
- libc same-binary (FORCE_LIBC=1): 48.31M ops/s (mean), 48.31M ops/s (median)
|
||||
- **Allocator差**: **+0.39%** (libc slightly faster, within noise)
|
||||
- system binary (21K): 83.85M ops/s (mean), 83.75M ops/s (median)
|
||||
- **Layout penalty**: **+73.57%** (small binary vs large binary 653K)
|
||||
- **Total gap**: **+74.26%** (hakmem → system binary)
|
||||
|
||||
**Perf Stat Analysis** (200M iters, 1-run):
|
||||
- I-cache misses: 153K (hakmem) → 68K (system) = **-55%** (smoking gun)
|
||||
- Cycles: 17.9B → 10.2B = -43%
|
||||
- Instructions: 41.3B → 21.5B = -48%
|
||||
|
||||
**Root Cause**: Binary size (653K vs 21K, 30x difference) causes I-cache thrashing. Code bloat >> algorithmic efficiency.
|
||||
|
||||
**教訓**:
|
||||
- Phase 12 の「system malloc 1.6x faster」観測は正しかったが、原因は allocator アルゴリズムではなく **binary layout**
|
||||
- Same-binary A/B が必須(別バイナリ比較は layout confound で誤判定)
|
||||
- I-cache efficiency が allocator-heavy workload の first-order factor
|
||||
|
||||
**Next Direction** (Case B 推奨):
|
||||
- **Phase 18: Hot Text Isolation / Layout Control**
|
||||
- Priority 1: Cold code isolation (`__attribute__((cold,noinline))` + separate TU)
|
||||
- Priority 2: Link-order optimization (hot functions contiguous placement)
|
||||
- Priority 3: PGO (optional, profile-guided layout)
|
||||
- Target: +10% throughput via I-cache optimization (48.1M → 52.9M ops/s)
|
||||
- Success metric: I-cache misses -30% (153K → 107K)
|
||||
|
||||
**Files**:
|
||||
- Results: `docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md`
|
||||
- Instructions: `docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_NEXT_INSTRUCTIONS.md`
|
||||
|
||||
---
|
||||
|
||||
### Phase 18: Hot Text Isolation / Layout Control — NEXT
|
||||
|
||||
**目的**: Binary layout 最適化で I-cache 効率を改善し、system binary との gap を削減。
|
||||
|
||||
**戦略**:
|
||||
1. **Cold Code Isolation** (優先度 1)
|
||||
- Stats 収集、debug logging、error handlers を別 TU へ移動
|
||||
- `__attribute__((cold, noinline))` で明示的に cold マーク
|
||||
- 予想効果: I-cache misses -20%
|
||||
|
||||
2. **Link-Order Optimization** (優先度 2)
|
||||
- Hot functions を連続配置(linker script or link order control)
|
||||
- `-ffunction-sections` + custom linker script
|
||||
- 予想効果: I-cache misses -10%
|
||||
|
||||
3. **Profile-Guided Optimization** (優先度 3, optional)
|
||||
- `-fprofile-generate` + `-fprofile-use` で実測ベース配置
|
||||
- 予想効果: I-cache misses -10-20%
|
||||
|
||||
**Build Gate**: `HOT_TEXT_ISOLATION=0/1`(layout A/B 用)
|
||||
|
||||
**Target**:
|
||||
- v1(TU split / attrs / optional gc-sections): **+2% で GO**(NEUTRAL が起きやすい想定)
|
||||
- v2(BENCH_MINIMAL compile-out): **+10–20%** を狙う(instruction footprint を直接削る)
|
||||
|
||||
**設計**: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md`
|
||||
**指示書**: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md`
|
||||
|
||||
実装ゲート(戻せる):
|
||||
- Makefile knob: `HOT_TEXT_ISOLATION=0/1`
|
||||
- Compile-time: `-DHAKMEM_HOT_TEXT_ISOLATION=0/1`
|
||||
|
||||
## 更新メモ(2025-12-14 Phase 5 E5-3 Analysis - Strategic Pivot)
|
||||
|
||||
### Phase 5 E5-3: Candidate Analysis & Strategic Recommendations ⚠️ DEFER (2025-12-14)
|
||||
|
||||
6
Makefile
6
Makefile
@ -218,12 +218,12 @@ LDFLAGS += $(EXTRA_LDFLAGS)
|
||||
|
||||
# Targets
|
||||
TARGET = test_hakmem
|
||||
OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o
|
||||
OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/box/front_fastlane_alloc_legacy_direct_env_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o
|
||||
OBJS = $(OBJS_BASE)
|
||||
|
||||
# Shared library
|
||||
SHARED_LIB = libhakmem.so
|
||||
SHARED_OBJS = hakmem_shared.o hakmem_config_shared.o hakmem_tiny_config_shared.o hakmem_ucb1_shared.o hakmem_bigcache_shared.o hakmem_pool_shared.o hakmem_l25_pool_shared.o hakmem_site_rules_shared.o hakmem_tiny_shared.o core/box/ss_allocation_box_shared.o superslab_stats_shared.o superslab_cache_shared.o superslab_ace_shared.o superslab_slab_shared.o superslab_backend_shared.o core/superslab_head_stub_shared.o hakmem_smallmid_shared.o core/box/superslab_expansion_box_shared.o core/box/integrity_box_shared.o core/box/mailbox_box_shared.o core/box/front_gate_box_shared.o core/box/front_gate_classifier_shared.o core/box/free_publish_box_shared.o core/box/capacity_box_shared.o core/box/carve_push_box_shared.o core/box/prewarm_box_shared.o core/box/ss_hot_prewarm_box_shared.o core/box/front_metrics_box_shared.o core/box/bench_fast_box_shared.o core/box/ss_addr_map_box_shared.o core/box/ss_pt_impl_shared.o core/box/slab_recycling_box_shared.o core/box/pagefault_telemetry_box_shared.o core/box/tiny_sizeclass_hist_box_shared.o core/box/tiny_env_box_shared.o core/box/tiny_route_box_shared.o core/box/free_front_v3_env_box_shared.o core/box/free_path_stats_box_shared.o core/box/free_dispatch_stats_box_shared.o core/box/alloc_gate_stats_box_shared.o core/box/tiny_page_box_shared.o core/box/tiny_class_policy_box_shared.o core/box/tiny_class_stats_box_shared.o core/box/tiny_policy_learner_box_shared.o core/box/ss_budget_box_shared.o core/box/tiny_mem_stats_box_shared.o core/box/wrapper_env_box_shared.o core/box/free_wrapper_env_snapshot_box_shared.o core/box/malloc_wrapper_env_snapshot_box_shared.o core/box/madvise_guard_box_shared.o core/box/libm_reloc_guard_box_shared.o core/box/hakmem_env_snapshot_box_shared.o core/box/tiny_c7_preserve_header_env_box_shared.o core/box/tiny_tcache_env_box_shared.o core/box/tiny_unified_lifo_env_box_shared.o core/page_arena_shared.o core/front/tiny_unified_cache_shared.o core/tiny_alloc_fast_push_shared.o core/tiny_c7_ultra_segment_shared.o core/tiny_c7_ultra_shared.o core/link_stubs_shared.o core/tiny_failfast_shared.o tiny_sticky_shared.o tiny_remote_shared.o tiny_publish_shared.o tiny_debug_ring_shared.o hakmem_tiny_magazine_shared.o hakmem_tiny_stats_shared.o hakmem_tiny_sfc_shared.o hakmem_tiny_query_shared.o hakmem_tiny_rss_shared.o hakmem_tiny_registry_shared.o hakmem_tiny_remote_target_shared.o hakmem_tiny_bg_spill_shared.o tiny_adaptive_sizing_shared.o hakmem_super_registry_shared.o hakmem_shared_pool_shared.o hakmem_shared_pool_acquire_shared.o hakmem_shared_pool_release_shared.o hakmem_elo_shared.o hakmem_batch_shared.o hakmem_p2_shared.o hakmem_sizeclass_dist_shared.o hakmem_evo_shared.o hakmem_debug_shared.o hakmem_sys_shared.o hakmem_whale_shared.o hakmem_policy_shared.o hakmem_ace_shared.o hakmem_ace_stats_shared.o hakmem_ace_controller_shared.o hakmem_ace_metrics_shared.o hakmem_ace_ucb1_shared.o hakmem_prof_shared.o hakmem_learner_shared.o hakmem_size_hist_shared.o hakmem_learn_log_shared.o hakmem_syscall_shared.o tiny_fastcache_shared.o core/box/super_reg_box_shared.o core/box/shared_pool_box_shared.o core/box/remote_side_box_shared.o core/tiny_destructors_shared.o
|
||||
SHARED_OBJS = hakmem_shared.o hakmem_config_shared.o hakmem_tiny_config_shared.o hakmem_ucb1_shared.o hakmem_bigcache_shared.o hakmem_pool_shared.o hakmem_l25_pool_shared.o hakmem_site_rules_shared.o hakmem_tiny_shared.o core/box/ss_allocation_box_shared.o superslab_stats_shared.o superslab_cache_shared.o superslab_ace_shared.o superslab_slab_shared.o superslab_backend_shared.o core/superslab_head_stub_shared.o hakmem_smallmid_shared.o core/box/superslab_expansion_box_shared.o core/box/integrity_box_shared.o core/box/mailbox_box_shared.o core/box/front_gate_box_shared.o core/box/front_gate_classifier_shared.o core/box/free_publish_box_shared.o core/box/capacity_box_shared.o core/box/carve_push_box_shared.o core/box/prewarm_box_shared.o core/box/ss_hot_prewarm_box_shared.o core/box/front_metrics_box_shared.o core/box/bench_fast_box_shared.o core/box/ss_addr_map_box_shared.o core/box/ss_pt_impl_shared.o core/box/slab_recycling_box_shared.o core/box/pagefault_telemetry_box_shared.o core/box/tiny_sizeclass_hist_box_shared.o core/box/tiny_env_box_shared.o core/box/tiny_route_box_shared.o core/box/free_front_v3_env_box_shared.o core/box/free_path_stats_box_shared.o core/box/free_dispatch_stats_box_shared.o core/box/alloc_gate_stats_box_shared.o core/box/tiny_page_box_shared.o core/box/tiny_class_policy_box_shared.o core/box/tiny_class_stats_box_shared.o core/box/tiny_policy_learner_box_shared.o core/box/ss_budget_box_shared.o core/box/tiny_mem_stats_box_shared.o core/box/wrapper_env_box_shared.o core/box/free_wrapper_env_snapshot_box_shared.o core/box/malloc_wrapper_env_snapshot_box_shared.o core/box/madvise_guard_box_shared.o core/box/libm_reloc_guard_box_shared.o core/box/hakmem_env_snapshot_box_shared.o core/box/tiny_c7_preserve_header_env_box_shared.o core/box/tiny_tcache_env_box_shared.o core/box/tiny_unified_lifo_env_box_shared.o core/box/front_fastlane_alloc_legacy_direct_env_box_shared.o core/page_arena_shared.o core/front/tiny_unified_cache_shared.o core/tiny_alloc_fast_push_shared.o core/tiny_c7_ultra_segment_shared.o core/tiny_c7_ultra_shared.o core/link_stubs_shared.o core/tiny_failfast_shared.o tiny_sticky_shared.o tiny_remote_shared.o tiny_publish_shared.o tiny_debug_ring_shared.o hakmem_tiny_magazine_shared.o hakmem_tiny_stats_shared.o hakmem_tiny_sfc_shared.o hakmem_tiny_query_shared.o hakmem_tiny_rss_shared.o hakmem_tiny_registry_shared.o hakmem_tiny_remote_target_shared.o hakmem_tiny_bg_spill_shared.o tiny_adaptive_sizing_shared.o hakmem_super_registry_shared.o hakmem_shared_pool_shared.o hakmem_shared_pool_acquire_shared.o hakmem_shared_pool_release_shared.o hakmem_elo_shared.o hakmem_batch_shared.o hakmem_p2_shared.o hakmem_sizeclass_dist_shared.o hakmem_evo_shared.o hakmem_debug_shared.o hakmem_sys_shared.o hakmem_whale_shared.o hakmem_policy_shared.o hakmem_ace_shared.o hakmem_ace_stats_shared.o hakmem_ace_controller_shared.o hakmem_ace_metrics_shared.o hakmem_ace_ucb1_shared.o hakmem_prof_shared.o hakmem_learner_shared.o hakmem_size_hist_shared.o hakmem_learn_log_shared.o hakmem_syscall_shared.o tiny_fastcache_shared.o core/box/super_reg_box_shared.o core/box/shared_pool_box_shared.o core/box/remote_side_box_shared.o core/tiny_destructors_shared.o
|
||||
|
||||
# Pool TLS Phase 1 (enable with POOL_TLS_PHASE1=1)
|
||||
ifeq ($(POOL_TLS_PHASE1),1)
|
||||
@ -427,7 +427,7 @@ test-box-refactor: box-refactor
|
||||
./larson_hakmem 10 8 128 1024 1 12345 4
|
||||
|
||||
# Phase 4: Tiny Pool benchmarks (properly linked with hakmem)
|
||||
TINY_BENCH_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/free_publish_box.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/tiny_free_route_cache_env_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/page_arena.o core/front/tiny_unified_cache.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o
|
||||
TINY_BENCH_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/free_publish_box.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/tiny_free_route_cache_env_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/box/front_fastlane_alloc_legacy_direct_env_box.o core/page_arena.o core/front/tiny_unified_cache.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o
|
||||
TINY_BENCH_OBJS = $(TINY_BENCH_OBJS_BASE)
|
||||
ifeq ($(POOL_TLS_PHASE1),1)
|
||||
TINY_BENCH_OBJS += pool_tls.o pool_refill.o core/pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o
|
||||
|
||||
@ -13,6 +13,7 @@
|
||||
#include "box/tiny_c7_preserve_header_env_box.h" // tiny_c7_preserve_header_env_refresh_from_env (Phase 13 v1)
|
||||
#include "box/tiny_tcache_env_box.h" // tiny_tcache_env_refresh_from_env (Phase 14 v1)
|
||||
#include "box/tiny_unified_lifo_env_box.h" // tiny_unified_lifo_env_refresh_from_env (Phase 15 v1)
|
||||
#include "box/front_fastlane_alloc_legacy_direct_env_box.h" // front_fastlane_alloc_legacy_direct_env_refresh_from_env (Phase 16 v1)
|
||||
#endif
|
||||
|
||||
// env が未設定のときだけ既定値を入れる
|
||||
@ -193,5 +194,7 @@ static inline void bench_apply_profile(void) {
|
||||
tiny_tcache_env_refresh_from_env();
|
||||
// Phase 15 v1: Sync LIFO ENV cache after bench_profile putenv defaults.
|
||||
tiny_unified_lifo_env_refresh_from_env();
|
||||
// Phase 16 v1: Sync LEGACY direct ENV cache after bench_profile putenv defaults.
|
||||
front_fastlane_alloc_legacy_direct_env_refresh_from_env();
|
||||
#endif
|
||||
}
|
||||
|
||||
63
core/box/front_fastlane_alloc_legacy_direct_env_box.c
Normal file
63
core/box/front_fastlane_alloc_legacy_direct_env_box.c
Normal file
@ -0,0 +1,63 @@
|
||||
// ============================================================================
|
||||
// Phase 16 v1: Front FastLane Alloc LEGACY Direct ENV Box (L0) - Implementation
|
||||
// ============================================================================
|
||||
|
||||
#include "front_fastlane_alloc_legacy_direct_env_box.h"
|
||||
#include <stdlib.h>
|
||||
#include <string.h>
|
||||
#include <stdio.h>
|
||||
#include <unistd.h>
|
||||
|
||||
// ============================================================================
|
||||
// Global State
|
||||
// ============================================================================
|
||||
|
||||
_Atomic int g_front_fastlane_alloc_legacy_direct_enabled = -1;
|
||||
|
||||
// ============================================================================
|
||||
// Init (Cold Path)
|
||||
// ============================================================================
|
||||
|
||||
int front_fastlane_alloc_legacy_direct_env_init(void) {
|
||||
const char* env = getenv("HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT");
|
||||
int enabled = 0; // default: OFF (opt-in)
|
||||
|
||||
if (env && (env[0] == '1' || strcmp(env, "true") == 0 || strcmp(env, "TRUE") == 0)) {
|
||||
enabled = 1;
|
||||
}
|
||||
|
||||
// Cache result
|
||||
atomic_store_explicit(&g_front_fastlane_alloc_legacy_direct_enabled, enabled, memory_order_relaxed);
|
||||
|
||||
// Log once (stderr for immediate visibility)
|
||||
if (enabled) {
|
||||
const char msg[] = "[FRONT_FASTLANE_ALLOC_LEGACY_DIRECT] enabled\n";
|
||||
ssize_t w = write(2, msg, sizeof(msg) - 1);
|
||||
(void)w;
|
||||
}
|
||||
|
||||
return enabled;
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Hot Path (LTO Fallback)
|
||||
// ============================================================================
|
||||
|
||||
// LTO fallback: Non-inline version for cases where LTO can't inline
|
||||
int front_fastlane_alloc_legacy_direct_enabled(void) {
|
||||
int val = atomic_load_explicit(&g_front_fastlane_alloc_legacy_direct_enabled, memory_order_relaxed);
|
||||
if (__builtin_expect(val == -1, 0)) {
|
||||
val = front_fastlane_alloc_legacy_direct_env_init();
|
||||
}
|
||||
return val;
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// Refresh (Cold Path, called from bench_profile)
|
||||
// ============================================================================
|
||||
|
||||
void front_fastlane_alloc_legacy_direct_env_refresh_from_env(void) {
|
||||
// Reset to uninitialized state (-1)
|
||||
// Next call to front_fastlane_alloc_legacy_direct_enabled() will re-read ENV
|
||||
atomic_store_explicit(&g_front_fastlane_alloc_legacy_direct_enabled, -1, memory_order_relaxed);
|
||||
}
|
||||
63
core/box/front_fastlane_alloc_legacy_direct_env_box.h
Normal file
63
core/box/front_fastlane_alloc_legacy_direct_env_box.h
Normal file
@ -0,0 +1,63 @@
|
||||
// ============================================================================
|
||||
// Phase 16 v1: Front FastLane Alloc LEGACY Direct ENV Box (L0)
|
||||
// ============================================================================
|
||||
//
|
||||
// Purpose: ENV gate for FastLane alloc LEGACY direct path
|
||||
//
|
||||
// Design: docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_DESIGN.md
|
||||
// Instructions: docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_NEXT_INSTRUCTIONS.md
|
||||
//
|
||||
// Strategy:
|
||||
// - alloc 側の route/policy 固定費を削減
|
||||
// - FastLane 入口で LEGACY を直行(hot → cold → fallback)
|
||||
// - free 側(Phase 9/10)の勝ち筋を alloc にも適用
|
||||
//
|
||||
// ENV:
|
||||
// HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1 (default: 0, opt-in)
|
||||
//
|
||||
// API:
|
||||
// front_fastlane_alloc_legacy_direct_enabled() -> int
|
||||
// front_fastlane_alloc_legacy_direct_env_refresh_from_env()
|
||||
//
|
||||
// Box Theory:
|
||||
// - L0: This file (ENV gate, reversible)
|
||||
// - L1: front_fastlane_box.h (LEGACY direct early-exit)
|
||||
// - L2: malloc_tiny_fast_for_class (existing fallback)
|
||||
//
|
||||
// Safety:
|
||||
// - ENV-gated (default OFF, opt-in)
|
||||
// - Reversible (ENV toggle)
|
||||
// - Fail-Fast (direct条件を満たさない場合は既存経路)
|
||||
//
|
||||
// ============================================================================
|
||||
|
||||
#ifndef FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_ENV_BOX_H
|
||||
#define FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_ENV_BOX_H
|
||||
|
||||
#include <stdatomic.h>
|
||||
|
||||
// ============================================================================
|
||||
// Global State (L0)
|
||||
// ============================================================================
|
||||
|
||||
// Cached state: -1 (uninitialized), 0 (disabled), 1 (enabled)
|
||||
extern _Atomic int g_front_fastlane_alloc_legacy_direct_enabled;
|
||||
|
||||
// ============================================================================
|
||||
// Hot API (L0)
|
||||
// ============================================================================
|
||||
|
||||
// Check if FastLane alloc LEGACY direct is enabled
|
||||
// Returns: 1 if enabled, 0 if disabled
|
||||
// Note: Implementation in .c file (non-inline for LTO compatibility)
|
||||
extern int front_fastlane_alloc_legacy_direct_enabled(void);
|
||||
|
||||
// ============================================================================
|
||||
// Cold API (L2)
|
||||
// ============================================================================
|
||||
|
||||
// Refresh ENV cache (called from bench_profile after putenv)
|
||||
// Pattern: Same as Phase 8/13/14/15
|
||||
extern void front_fastlane_alloc_legacy_direct_env_refresh_from_env(void);
|
||||
|
||||
#endif // FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_ENV_BOX_H
|
||||
@ -42,6 +42,11 @@
|
||||
#include "front_fastlane_stats_box.h"
|
||||
#include "../hakmem_tiny.h" // hak_tiny_size_to_class, tiny_get_max_size
|
||||
#include "../front/malloc_tiny_fast.h" // malloc_tiny_fast_for_class
|
||||
#include "front_fastlane_alloc_legacy_direct_env_box.h" // Phase 16 v1: LEGACY direct
|
||||
#include "tiny_static_route_box.h" // tiny_static_route_ready_fast, tiny_static_route_get_kind_fast
|
||||
#include "tiny_front_hot_box.h" // tiny_hot_alloc_fast
|
||||
#include "tiny_front_cold_box.h" // tiny_cold_refill_and_alloc
|
||||
#include "smallobject_policy_v7_box.h" // SMALL_ROUTE_LEGACY
|
||||
|
||||
// FastLane is only safe after global init completes.
|
||||
// Before init, wrappers must handle recursion guards + syscall init.
|
||||
@ -85,6 +90,34 @@ static inline void* front_fastlane_try_malloc(size_t size) {
|
||||
return NULL; // Class not enabled → fallback
|
||||
}
|
||||
|
||||
// Phase 16 v1: LEGACY direct path (early-exit optimization)
|
||||
// Try direct allocation for LEGACY routes only (skip route/policy overhead)
|
||||
// TEMPORARY SAFETY: Limit to C0-C3 (match dualhot pattern) until refill issue debugged
|
||||
if (__builtin_expect(front_fastlane_alloc_legacy_direct_enabled() && (unsigned)class_idx <= 3u, 0)) {
|
||||
// Condition 1: Static route must be ready (Learner interlock check)
|
||||
// Condition 2: Route must be LEGACY (断定可能な場合のみ)
|
||||
if (tiny_static_route_ready_fast() &&
|
||||
tiny_static_route_get_kind_fast(class_idx) == SMALL_ROUTE_LEGACY) {
|
||||
|
||||
// Hot path: Try UnifiedCache first
|
||||
void* ptr = tiny_hot_alloc_fast(class_idx);
|
||||
if (__builtin_expect(ptr != NULL, 1)) {
|
||||
FRONT_FASTLANE_STAT_INC(malloc_hit);
|
||||
return ptr; // Success (cache hit)
|
||||
}
|
||||
|
||||
// Cold path: Refill UnifiedCache and retry
|
||||
ptr = tiny_cold_refill_and_alloc(class_idx);
|
||||
if (__builtin_expect(ptr != NULL, 1)) {
|
||||
FRONT_FASTLANE_STAT_INC(malloc_hit);
|
||||
return ptr; // Success (after refill)
|
||||
}
|
||||
|
||||
// Fallback: Direct path failed → use existing route (safety)
|
||||
// This handles edge cases (Learner transition, policy changes, etc.)
|
||||
}
|
||||
}
|
||||
|
||||
// Call existing hot handler (no duplication)
|
||||
// This is the winning path from E5-4 / Phase 4 E2
|
||||
void* ptr = malloc_tiny_fast_for_class(size, class_idx);
|
||||
|
||||
@ -0,0 +1,208 @@
|
||||
# Phase 16: Front FastLane Alloc LEGACY Direct v1 — A/B Test Results
|
||||
|
||||
**Date**: 2025-12-15
|
||||
**Status**: NEUTRAL (+0.62%)
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
Phase 16 v1 attempted to reduce alloc-side fixed costs by adding a LEGACY direct path to FastLane entry point, bypassing route/policy overhead for LEGACY allocations. The optimization mirrored the free-side winning pattern (Phase 9/10).
|
||||
|
||||
**Result**: +0.62% on Mixed (NEUTRAL), below +1.0% GO threshold.
|
||||
|
||||
**Critical Issue Discovered**: Initial implementation caused segmentation fault for classes C4-C7. Root cause: `unified_cache_refill()` incompatibility. **Safety fix applied**: Limited optimization to C0-C3 only (matching existing dualhot pattern).
|
||||
|
||||
**Verdict**: NEUTRAL — freeze as research box (default OFF).
|
||||
|
||||
---
|
||||
|
||||
## A/B Test Results
|
||||
|
||||
### Mixed (16-1024B, 10-run clean env)
|
||||
|
||||
**Baseline** (ENV=0):
|
||||
- Mean: 47,510,791 ops/s
|
||||
- Median: 47,606,360 ops/s
|
||||
- Runs: 48151673, 47596179, 47735208, 47903499, 46674576, 47977105, 47236265, 47481537, 46735322, 47616542
|
||||
|
||||
**Optimized** (ENV=1):
|
||||
- Mean: 47,803,890 ops/s
|
||||
- Median: 47,901,551 ops/s
|
||||
- Runs: 47401229, 47908200, 48158776, 48126240, 47477867, 47894902, 47644796, 48191059, 47930512, 47305320
|
||||
|
||||
**Delta**:
|
||||
- Mean: **+0.62%**
|
||||
- Median: **+0.62%**
|
||||
|
||||
**Verdict**: NEUTRAL (below +1.0% GO threshold)
|
||||
|
||||
---
|
||||
|
||||
### C6-heavy Regression Check (5-run)
|
||||
|
||||
**Baseline** (ENV=0):
|
||||
- Mean: 21,134,240 ops/s
|
||||
- Median: 21,186,983 ops/s
|
||||
- Runs: 21186983, 21327420, 20807950, 21112023, 21236823
|
||||
|
||||
**Optimized** (ENV=1):
|
||||
- Mean: 21,147,197 ops/s
|
||||
- Median: 21,139,301 ops/s
|
||||
- Runs: 21358869, 21209299, 20992077, 21139301, 21036438
|
||||
|
||||
**Delta**:
|
||||
- Mean: **+0.06%**
|
||||
- Median: **-0.23%**
|
||||
|
||||
**Verdict**: PASS (no significant regression)
|
||||
|
||||
---
|
||||
|
||||
## Implementation Summary
|
||||
|
||||
### Files Modified
|
||||
|
||||
1. **`core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c}`** (new)
|
||||
- L0 ENV gate for LEGACY direct feature
|
||||
- ENV: `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1` (default 0, opt-in)
|
||||
- API: `front_fastlane_alloc_legacy_direct_enabled()`, `front_fastlane_alloc_legacy_direct_env_refresh_from_env()`
|
||||
|
||||
2. **`core/box/front_fastlane_box.h`**
|
||||
- Added LEGACY direct early-exit in `front_fastlane_try_malloc()` (lines 93-119)
|
||||
- **SAFETY CONSTRAINT**: Limited to C0-C3 only due to refill incompatibility for C4-C7
|
||||
- Direct conditions: ENV enabled + static route ready + LEGACY route confirmed
|
||||
- Direct path: `tiny_hot_alloc_fast()` → `tiny_cold_refill_and_alloc()` → fallback to `malloc_tiny_fast_for_class()`
|
||||
|
||||
3. **`core/bench_profile.h`**
|
||||
- Added `front_fastlane_alloc_legacy_direct_env_refresh_from_env()` to refresh sync group
|
||||
|
||||
4. **`Makefile`**
|
||||
- Added `front_fastlane_alloc_legacy_direct_env_box.o` to OBJS_BASE, SHARED_OBJS, TINY_BENCH_OBJS_BASE
|
||||
|
||||
---
|
||||
|
||||
## Critical Bug & Fix
|
||||
|
||||
### Issue: Segmentation Fault (Exit Code 139)
|
||||
|
||||
**Symptom**: Benchmark crashed with ENV=1 during larger workloads (20M iterations).
|
||||
|
||||
**Root Cause**:
|
||||
- Crash occurred in `unified_cache_refill()` → `tiny_next_read()` (intrusive pointer read)
|
||||
- Initial implementation attempted to use direct path for ALL classes (C0-C7)
|
||||
- Classes C4-C7 triggered incompatibility with `unified_cache_refill()` logic
|
||||
- Existing dualhot code (Phase ALLOC-TINY-FAST-DUALHOT-2) only operates on C0-C3
|
||||
|
||||
**Backtrace**:
|
||||
```
|
||||
#0 0x0000555555564d89 in tiny_next_read.lto_priv.5.lto_priv ()
|
||||
#1 0x00007ffff7b00318 in ?? ()
|
||||
#2 0x0000555555557f29 in unified_cache_refill ()
|
||||
```
|
||||
|
||||
**Fix Applied**:
|
||||
- Limited LEGACY direct path to C0-C3 only (line 96 of front_fastlane_box.h)
|
||||
- Added safety comment explaining constraint
|
||||
- Matches existing proven pattern from dualhot implementation
|
||||
|
||||
**Code Change**:
|
||||
```c
|
||||
// Before (CRASHED):
|
||||
if (__builtin_expect(front_fastlane_alloc_legacy_direct_enabled(), 0)) {
|
||||
|
||||
// After (SAFE):
|
||||
if (__builtin_expect(front_fastlane_alloc_legacy_direct_enabled() && (unsigned)class_idx <= 3u, 0)) {
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Analysis
|
||||
|
||||
### Why +0.62% is Below Threshold
|
||||
|
||||
1. **Limited Scope**: Optimization only applies to C0-C3 due to safety constraint
|
||||
- C4-C7 continue using full route/policy path
|
||||
- Mixed benchmark uses all size classes (16-1024B = C0-C5 primarily)
|
||||
|
||||
2. **Existing Optimizations**: dualhot (Phase ALLOC-TINY-FAST-DUALHOT-2) already optimizes C0-C3
|
||||
- LEGACY direct overlaps with dualhot coverage
|
||||
- Marginal benefit when dualhot is disabled, but default config has dualhot enabled in some profiles
|
||||
|
||||
3. **Route Overhead Not Dominant**: After Phase 6 FastLane collapse, route/policy fixed costs are already minimized
|
||||
- Phase 14-15 (cache shape) also showed NEUTRAL results
|
||||
- Suggests current bottleneck is not in dispatch layers
|
||||
|
||||
### Root Cause of Limited Benefit
|
||||
|
||||
The optimization targets the same problem space as existing dualhot but with different enablement conditions:
|
||||
- **dualhot**: Always enabled for C0-C3, no route check
|
||||
- **LEGACY direct**: ENV-gated, requires static route confirmation
|
||||
|
||||
When both are active, LEGACY direct provides minimal incremental value.
|
||||
|
||||
---
|
||||
|
||||
## Recommendations
|
||||
|
||||
1. **Freeze as Research Box** (default OFF)
|
||||
- ENV remains opt-in: `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0`
|
||||
- No preset promotion
|
||||
- Keep code for potential future use if dualhot is disabled
|
||||
|
||||
2. **Investigate C4-C7 Refill Issue**
|
||||
- Root cause: Why does `unified_cache_refill()` fail for C4-C7 in this path?
|
||||
- Possible causes:
|
||||
- LIFO mode interaction (Phase 15)
|
||||
- Cache state assumptions in refill logic
|
||||
- Intrusive pointer corruption
|
||||
- **Action**: Debug under controlled conditions before expanding to C4-C7
|
||||
|
||||
3. **Shift Focus Away from Dispatch Layers**
|
||||
- Phase 14, 15, 16 all showed NEUTRAL results
|
||||
- Phase 6 FastLane already collapsed major dispatch overhead
|
||||
- **Next direction**: Investigate cache miss costs, memory layout, or backend allocation
|
||||
|
||||
4. **Consider Dualhot/LEGACY Direct Consolidation**
|
||||
- If LEGACY direct is kept, evaluate merging with dualhot logic
|
||||
- Avoid code duplication and overlap
|
||||
|
||||
---
|
||||
|
||||
## Comparison with Recent Phases
|
||||
|
||||
| Phase | Target | Delta (Mixed) | Verdict |
|
||||
|-------|--------|---------------|---------|
|
||||
| Phase 10 | Free LEGACY direct | +1.89% | **GO** |
|
||||
| Phase 13 v1 | C7 preserve header | -0.40% | NEUTRAL (freeze) |
|
||||
| Phase 14 v1 | tcache intrusive | +0.20% | NEUTRAL (freeze) |
|
||||
| Phase 14 v2 | tcache hot integration | +0.08% | NEUTRAL (freeze) |
|
||||
| Phase 15 v1 | UnifiedCache FIFO→LIFO | -0.70% | NEUTRAL (freeze) |
|
||||
| **Phase 16 v1** | **Alloc LEGACY direct** | **+0.62%** | **NEUTRAL (freeze)** |
|
||||
|
||||
**Pattern**: Post-Phase-10 optimizations consistently show NEUTRAL results. Major gains came from earlier phases (FastLane collapse +11.13%, Free DeDup +5.18%, etc.). Current bottleneck likely not in dispatch/routing layers.
|
||||
|
||||
---
|
||||
|
||||
## Files Changed
|
||||
|
||||
- `core/box/front_fastlane_alloc_legacy_direct_env_box.h` (new)
|
||||
- `core/box/front_fastlane_alloc_legacy_direct_env_box.c` (new)
|
||||
- `core/box/front_fastlane_box.h` (modified)
|
||||
- `core/bench_profile.h` (modified)
|
||||
- `Makefile` (modified)
|
||||
|
||||
---
|
||||
|
||||
## ENV Variables
|
||||
|
||||
- `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1` (default: 0, opt-in)
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Freeze Phase 16** with default OFF
|
||||
2. **Commit with verdict**: "Phase 16 v1: NEUTRAL (+0.62%), research box"
|
||||
3. **Update CURRENT_TASK.md** with Phase 16 summary
|
||||
4. **Shift optimization focus** based on profiling/analysis (away from dispatch layers)
|
||||
@ -0,0 +1,133 @@
|
||||
# Phase 16: Front FastLane Alloc LEGACY Direct v1(alloc 側の “2段目ホット” を monolithic early-exit 化)
|
||||
|
||||
**Date**: 2025-12-15
|
||||
**Status**: DESIGN(Phase 16 kickoff)
|
||||
|
||||
---
|
||||
|
||||
## 0. Executive Summary(1枚)
|
||||
|
||||
Phase 14-15(pointer-chase / cache-shape)系は **NEUTRAL** で freeze。
|
||||
次は “キャッシュ形状” ではなく、**命令数/分岐の固定費を削る**方向に戻す。
|
||||
|
||||
現状の `malloc()` は Phase 6 で FastLane に集約され、ほぼ常に:
|
||||
|
||||
```
|
||||
malloc() → front_fastlane_try_malloc(size) → malloc_tiny_fast_for_class(size, class_idx)
|
||||
```
|
||||
|
||||
となる。
|
||||
|
||||
しかし `malloc_tiny_fast_for_class()` は **LEGACY ルートでも**、
|
||||
ULTRA/C7 早期分岐・route_kind 決定・ENV cfg 読み・dispatch shape などの固定費を通る。
|
||||
free 側(Phase 9/10/6-2)は “monolithic early-exit” に寄せて勝っているため、
|
||||
alloc 側も同じ勝ち筋で **FastLane 入口で LEGACY を直行**させるのが ROI が高い。
|
||||
|
||||
Phase 16 は Box Theory を保ったまま、FastLane の alloc に “LEGACY direct” を 1 本足す:
|
||||
|
||||
- **hit 時**: `tiny_hot_alloc_fast(class_idx)` → 即 return(route/policy を踏まない)
|
||||
- **miss 時**: `tiny_cold_refill_and_alloc(class_idx)`(既存 cold 境界)
|
||||
- **不確実時**: 既存 `malloc_tiny_fast_for_class()` にフォールバック(境界 1 箇所)
|
||||
|
||||
---
|
||||
|
||||
## 1. 現状(why)
|
||||
|
||||
- Phase 6(Front FastLane)で wrapper→gate→policy→route を collapse し、入口固定費は大きく削減できた。
|
||||
- その結果、alloc 側の残コストは **`malloc_tiny_fast_for_class()` 内の分岐/ENV/route 決定**に寄りやすい。
|
||||
- Phase 14/15 で “UnifiedCache の形状” をいじっても Mixed が動かない → 現状は **cache shape が支配的ではない**。
|
||||
|
||||
よって Phase 16 は、cache の内部を変えずに **route/policy 固定費を削る**。
|
||||
|
||||
---
|
||||
|
||||
## 2. 提案(Phase 16 v1)
|
||||
|
||||
### 2.1 追加する箱(Box Theory)
|
||||
|
||||
```
|
||||
L0: front_fastlane_alloc_legacy_direct_env_box (ENV gate / rollback)
|
||||
↓
|
||||
L1: front_fastlane_try_malloc() (LEGACY direct early-exit)
|
||||
↓
|
||||
L2: malloc_tiny_fast_for_class() (既存: route/policy/ULTRA/MID/V7)
|
||||
↓
|
||||
L3: tiny_front_hot_box / tiny_front_cold_box (既存: unified cache / refill)
|
||||
```
|
||||
|
||||
**境界は 1 箇所**:
|
||||
- “direct 条件を満たさない/失敗” → `malloc_tiny_fast_for_class()` に落とす。
|
||||
|
||||
### 2.2 ENV
|
||||
|
||||
- `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1`(default 0, opt-in)
|
||||
|
||||
初期は opt-in で A/B。
|
||||
GO なら preset 昇格(MIXED のみから段階的)を検討する。
|
||||
|
||||
### 2.3 Direct 条件(Fail-Fast)
|
||||
|
||||
alloc direct は **“断定できるときだけ”**に限定する:
|
||||
|
||||
必須条件(推奨):
|
||||
- FastLane が有効(既存)
|
||||
- `size <= tiny_get_max_size()`(既存)
|
||||
- `class_idx` が有効(既存)
|
||||
- `front_fastlane_class_mask` に含まれる(既存)
|
||||
- `tiny_static_route_ready_fast()` が true(Learner interlock 等で false のときは使わない)
|
||||
- `tiny_static_route_get_kind_fast(class_idx) == SMALL_ROUTE_LEGACY`(LEGACY 断定)
|
||||
|
||||
その上で:
|
||||
- `tiny_hot_alloc_fast(class_idx)` → hit なら return
|
||||
- miss なら `tiny_cold_refill_and_alloc(class_idx)` を呼ぶ(既存 cold 境界)
|
||||
- それでも NULL の場合だけ `malloc_tiny_fast_for_class()` にフォールバック(安全重視)
|
||||
|
||||
---
|
||||
|
||||
## 3. 可視化(最小)
|
||||
|
||||
Release での常時ログは禁止。
|
||||
必要なら `HAKMEM_DEBUG_COUNTERS=1` のみで:
|
||||
|
||||
- `front_fastlane_alloc_legacy_direct_hit`
|
||||
- `front_fastlane_alloc_legacy_direct_miss`
|
||||
- `front_fastlane_alloc_legacy_direct_fallback`
|
||||
|
||||
(atomic は stats box に閉じ込める。ホット側に atomic を置かない)
|
||||
|
||||
---
|
||||
|
||||
## 4. A/B 計測(同一バイナリ)
|
||||
|
||||
GO/NO-GO(Mixed 10-run, clean env):
|
||||
- GO: mean +1.0% 以上
|
||||
- NO-GO: mean -1.0% 以下(即 rollback / freeze)
|
||||
- NEUTRAL: ±1.0%(research box freeze)
|
||||
|
||||
対象:
|
||||
- `scripts/run_mixed_10_cleanenv.sh`
|
||||
- 追加で C6-heavy 5-run(回帰なし確認)
|
||||
|
||||
---
|
||||
|
||||
## 5. リスクと対策
|
||||
|
||||
### リスク 1: “LEGACY と断定” が崩れて誤ルートする
|
||||
|
||||
対策:
|
||||
- `tiny_static_route_ready_fast()` を必須条件にする(Learner 有効時は false になる想定)
|
||||
- route_kind を必ず確認(mask だけに依存しない)
|
||||
- 失敗時は必ず既存経路へフォールバック
|
||||
|
||||
### リスク 2: direct 経路が小さすぎて効果が出ない
|
||||
|
||||
対策:
|
||||
- まず Mixed の “LEGACY 比率” を stats で可視化(debug counters のみ)
|
||||
- 効かなければ freeze(Phase 14/15 と同じ扱い)
|
||||
|
||||
### リスク 3: 分岐追加が逆効果(Phase 11 の再来)
|
||||
|
||||
対策:
|
||||
- direct 判定は **FastLane 内で 1 回だけ**(call site helper を増やさない)
|
||||
- direct 判定が false の場合は既存の `malloc_tiny_fast_for_class()` をそのまま呼ぶ
|
||||
|
||||
@ -0,0 +1,124 @@
|
||||
# Phase 16: Front FastLane Alloc LEGACY Direct v1 — Next Instructions
|
||||
|
||||
設計: `docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_DESIGN.md`
|
||||
|
||||
---
|
||||
|
||||
## 0. Status / Why now
|
||||
|
||||
- Phase 14-15(tcache / FIFO→LIFO)は **NEUTRAL** → freeze(default OFF)
|
||||
- 次の狙いは “cache 形状” ではなく、**alloc 側の route/policy 固定費を減らす**
|
||||
- free 側は Phase 9/10/6-2 の “monolithic early-exit + dedup” が勝ち筋 → alloc 側にも同じパターンを適用する
|
||||
|
||||
---
|
||||
|
||||
## 1. GO 条件
|
||||
|
||||
Mixed 10-run(clean env):
|
||||
- **GO**: mean +1.0% 以上
|
||||
- **NO-GO**: mean -1.0% 以下(即 rollback / freeze)
|
||||
- **NEUTRAL**: ±1.0% → research box freeze
|
||||
|
||||
追加ゲート(必須):
|
||||
- `tiny_static_route_ready_fast()` が true の環境で、LEGACY direct が実際に通っている(debug counters で確認できるなら尚良い)
|
||||
|
||||
---
|
||||
|
||||
## 2. Box 図(境界 1 箇所)
|
||||
|
||||
```
|
||||
L0: front_fastlane_alloc_legacy_direct_env_box (ENV gate / refresh)
|
||||
↓
|
||||
L1: front_fastlane_box.h (try_malloc 内 early-exit)
|
||||
↓
|
||||
L2: malloc_tiny_fast_for_class() (既存経路)
|
||||
```
|
||||
|
||||
境界は **“direct 条件 NG / direct が NULL → malloc_tiny_fast_for_class”** の 1 箇所に固定する。
|
||||
|
||||
---
|
||||
|
||||
## 3. Patch 順(小さく積む)
|
||||
|
||||
### Patch 1: L0 ENV gate box(戻せる)
|
||||
|
||||
新規:
|
||||
- `core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c}`
|
||||
|
||||
ENV:
|
||||
- `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1`(default 0)
|
||||
|
||||
API(例):
|
||||
- `front_fastlane_alloc_legacy_direct_enabled() -> int`
|
||||
- `front_fastlane_alloc_legacy_direct_env_refresh_from_env()`
|
||||
|
||||
要件:
|
||||
- hot path に `getenv()` を置かない(cached)
|
||||
- `bench_profile` の `putenv()` 同期のため refresh を提供(Phase 8/13/14/15 パターン)
|
||||
|
||||
### Patch 2: 統合点(FastLane alloc に 1 本だけ)
|
||||
|
||||
対象:
|
||||
- `core/box/front_fastlane_box.h`
|
||||
|
||||
変更:
|
||||
- `front_fastlane_try_malloc()` の class mask 判定の後に、次の “direct 経路” を追加
|
||||
|
||||
direct 条件(Fail-Fast):
|
||||
1. `front_fastlane_alloc_legacy_direct_enabled() == 1`
|
||||
2. `tiny_static_route_ready_fast()` が true(Learner interlock 等で false の場合は direct 禁止)
|
||||
3. `tiny_static_route_get_kind_fast(class_idx) == SMALL_ROUTE_LEGACY`(LEGACY を断定)
|
||||
|
||||
direct 実体:
|
||||
- `void* p = tiny_hot_alloc_fast(class_idx);`
|
||||
- `if (p) return p;`
|
||||
- `p = tiny_cold_refill_and_alloc(class_idx);`
|
||||
- `if (p) return p;`
|
||||
- 失敗時のみ `malloc_tiny_fast_for_class(size, class_idx)` にフォールバック(安全側)
|
||||
|
||||
注意:
|
||||
- “call site helper を増やさない” を優先(Phase 11 の反省)
|
||||
- 直行するのは **LEGACY のみ**(ULTRA/MID/V7 は既存に任せる)
|
||||
|
||||
### Patch 3: bench_profile 同期(ENV 漏れ防止)
|
||||
|
||||
対象:
|
||||
- `core/bench_profile.h`
|
||||
|
||||
変更:
|
||||
- `#ifdef USE_HAKMEM` の refresh 群に `front_fastlane_alloc_legacy_direct_env_refresh_from_env();` を追加
|
||||
|
||||
---
|
||||
|
||||
## 4. A/B(同一バイナリ)
|
||||
|
||||
Baseline:
|
||||
```sh
|
||||
HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 scripts/run_mixed_10_cleanenv.sh
|
||||
```
|
||||
|
||||
Optimized:
|
||||
```sh
|
||||
HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=1 scripts/run_mixed_10_cleanenv.sh
|
||||
```
|
||||
|
||||
追加(回帰検出):
|
||||
```sh
|
||||
HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1 ./bench_mid_large_mt_hakmem 1 20000000 400 1
|
||||
HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=1 HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1 ./bench_mid_large_mt_hakmem 1 20000000 400 1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. 健康診断
|
||||
|
||||
```sh
|
||||
scripts/verify_health_profiles.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Rollback
|
||||
|
||||
- `export HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0`
|
||||
|
||||
@ -0,0 +1,89 @@
|
||||
# Phase 17: FORCE_LIBC Gap Validation v1 — A/B Test Results
|
||||
|
||||
**Date**: 2025-12-15
|
||||
**Verdict**: ✅ **Case B confirmed** — **Layout / I-cache penalty dominates**
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
Phase 17 validated the “system malloc is faster than hakmem” observation while avoiding the classic layout/LTO trap by running a **same-binary A/B**:
|
||||
|
||||
- Same binary (`bench_random_mixed_hakmem`) with `HAKMEM_FORCE_LIBC_ALLOC=0/1` shows **allocator logic delta is negligible**.
|
||||
- The large performance gap appears only when comparing to the tiny `bench_random_mixed_system` binary.
|
||||
|
||||
Conclusion: The dominant gap is **binary text size + layout → I-cache thrash + instruction footprint**, not allocator algorithm efficiency.
|
||||
|
||||
---
|
||||
|
||||
## Measurement Setup
|
||||
|
||||
Workload:
|
||||
- `bench_random_mixed_*` (Mixed 16–1024B), working set `WS=400`
|
||||
- Clean ENV baseline via `scripts/run_mixed_10_cleanenv.sh`
|
||||
|
||||
Two comparisons:
|
||||
1) **Same-binary toggle** (allocator logic delta)
|
||||
2) **System binary** (layout penalty delta)
|
||||
|
||||
---
|
||||
|
||||
## Results
|
||||
|
||||
### 1) Same-binary A/B (allocator delta)
|
||||
|
||||
Binary: `bench_random_mixed_hakmem`
|
||||
Toggle: `HAKMEM_FORCE_LIBC_ALLOC=0/1`
|
||||
|
||||
| Mode | Throughput (ops/s) | Delta |
|
||||
|------|---------------------|-------|
|
||||
| hakmem (`FORCE_LIBC=0`) | 48.12M | — |
|
||||
| libc (`FORCE_LIBC=1`) | 48.31M | **+0.39%** |
|
||||
|
||||
Interpretation: allocator logic delta is ~noise-level in this experiment context.
|
||||
|
||||
### 2) System binary (layout penalty)
|
||||
|
||||
Binary: `bench_random_mixed_system`
|
||||
|
||||
| Mode | Throughput (ops/s) | Delta vs libc-in-hakmem-binary |
|
||||
|------|---------------------|--------------------------------|
|
||||
| system malloc | 83.85M | **+73.57%** |
|
||||
|
||||
Total observed gap: ~+74% class.
|
||||
|
||||
---
|
||||
|
||||
## Perf Stat (200M iterations) — Smoking Gun
|
||||
|
||||
| Metric | hakmem binary | system binary | Delta |
|
||||
|--------|---------------|---------------|-------|
|
||||
| I-cache misses | 153K | 68K | **-55%** |
|
||||
| Cycles | 17.9B | 10.2B | **-43%** |
|
||||
| Instructions | 41.3B | 21.5B | **-48%** |
|
||||
| Binary size | 653K | 21K | **-97%** |
|
||||
|
||||
Interpretation:
|
||||
- The system binary executes roughly **half the instructions**, with **far fewer I-cache misses**.
|
||||
- The 30× text footprint difference strongly correlates with the gap.
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
Phase 12’s “system malloc is 1.6× faster” observation was real, but the root cause was misattributed:
|
||||
|
||||
- ❌ Not primarily allocator algorithm differences
|
||||
- ✅ **Text/layout + I-cache locality + instruction footprint**
|
||||
|
||||
This shifts the optimization frontier:
|
||||
- Stop chasing more routing/dispatch micro-opt (Phase 14–16 plateau)
|
||||
- Focus on **Hot Text Isolation / layout control**
|
||||
|
||||
---
|
||||
|
||||
## Next
|
||||
|
||||
Proceed to:
|
||||
- `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md`
|
||||
|
||||
@ -0,0 +1,130 @@
|
||||
# Phase 17: FORCE_LIBC Gap Validation(same-binary A/B)Next Instructions
|
||||
|
||||
## Status(前提)
|
||||
|
||||
- Phase 14–16 は **NEUTRAL / research box freeze**(dispatch/cache-shape/pointer-chase 系は頭打ち)
|
||||
- Phase 16 v1(FastLane alloc LEGACY direct)は **NEUTRAL (+0.62%)** かつ **C0–C3 限定**(C4–C7 は segv で安全制限)
|
||||
- Phase 12 で「system malloc が hakmem より速い」という観測があるが、**別バイナリ比較は layout/LTO 差で壊れやすい**
|
||||
|
||||
本 Phase 17 の目的は、**同一バイナリ内**で `hakmem` vs `libc malloc` を A/B し、gap の実体(allocator差か、バイナリ差か)を SSOT 化すること。
|
||||
|
||||
---
|
||||
|
||||
## 0. 目的(Deliverables)
|
||||
|
||||
1) **同一バイナリ A/B**: `bench_random_mixed_hakmem` を用いて
|
||||
- A: `HAKMEM_FORCE_LIBC_ALLOC=0`(hakmem)
|
||||
- B: `HAKMEM_FORCE_LIBC_ALLOC=1`(libc)
|
||||
|
||||
2) **別バイナリとの差分分解**(任意)
|
||||
- `bench_random_mixed_system`(小さいバイナリ)も測り、`libc-in-hakmem-binary` と比較して **layout penalty** を推定
|
||||
|
||||
3) **次の主戦場を決める**(GO/NO-GO ではなく、方針決定)
|
||||
|
||||
---
|
||||
|
||||
## 1. 実施手順(再現性重視)
|
||||
|
||||
### 1.1 Build(同一 commit で固定)
|
||||
|
||||
```sh
|
||||
make -j bench_random_mixed_hakmem bench_random_mixed_system
|
||||
```
|
||||
|
||||
### 1.2 Clean ENV(Phase 14–16 研究 knob を固定)
|
||||
|
||||
推奨: `scripts/run_mixed_10_cleanenv.sh` を使う(ENV 漏れ防止)。
|
||||
|
||||
追加で次を明示(Phase 16 を確実に OFF):
|
||||
|
||||
```sh
|
||||
export HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0
|
||||
```
|
||||
|
||||
### 1.3 Same-binary A/B(本丸)
|
||||
|
||||
**A: hakmem**
|
||||
|
||||
```sh
|
||||
HAKMEM_FORCE_LIBC_ALLOC=0 scripts/run_mixed_10_cleanenv.sh
|
||||
```
|
||||
|
||||
**B: libc(同一バイナリ)**
|
||||
|
||||
```sh
|
||||
HAKMEM_FORCE_LIBC_ALLOC=1 scripts/run_mixed_10_cleanenv.sh
|
||||
```
|
||||
|
||||
記録:
|
||||
- mean / median / stdev(10-run)
|
||||
- Min/Max
|
||||
|
||||
### 1.4 Optional: system binary baseline(layout penalty 推定)
|
||||
|
||||
```sh
|
||||
for i in $(seq 1 10); do
|
||||
echo "=== Run ${i}/10 (system bin) ==="
|
||||
./bench_random_mixed_system "${ITERS:-20000000}" "${WS:-400}" 1 2>&1 | rg "Throughput" || true
|
||||
done
|
||||
```
|
||||
|
||||
解釈:
|
||||
- `system bin` が `FORCE_LIBC` より大きく速い → **layout/text size penalty** が支配的
|
||||
- `FORCE_LIBC` が `hakmem` より大きく速い → **allocator ロジック差** が支配的
|
||||
|
||||
---
|
||||
|
||||
## 2. 判定(方針分岐)
|
||||
|
||||
### Case A: `FORCE_LIBC` が hakmem より **+20% 以上**速い
|
||||
|
||||
結論: gap の本体は allocator ロジック(命令数/固定費)側。
|
||||
|
||||
次の芯(推奨):
|
||||
- **Phase 18: Free FastPath Gate Consolidation**
|
||||
- `free_tiny_fast()` 内の ENV gate / TLS probe を FastLane 入口で 1 回だけに集約
|
||||
- 目的: “monolithic early-exit” の勝ち筋を維持したまま、per-call gate 固定費を削る
|
||||
- Box 境界: `front_fastlane_try_free()` → `free_tiny_fast_with_snapshot()` の 1 箇所
|
||||
- 戻せる: `HAKMEM_FREE_TINY_FAST_SNAPSHOT=0/1`
|
||||
|
||||
### Case B: `FORCE_LIBC` が hakmem と **±5% 以内**
|
||||
|
||||
結論: allocator差は小さく、Phase 12 の「system malloc 1.6x」は別要因(バイナリ差/計測系)濃厚。
|
||||
|
||||
次の芯(推奨):
|
||||
- **Phase 18: Hot Text Isolation / Layout Control**
|
||||
- cold code を `__attribute__((cold,noinline))` + 別 TU に追放
|
||||
- 可能なら link-order(hot 関数の順序固定)で I-cache 安定化
|
||||
- A/B は同一バイナリで `HAKMEM_LAYOUT_MODE=0/1`(section/attribute のみ切替)
|
||||
|
||||
### Case C: `FORCE_LIBC` が hakmem より速いが、`system bin` とも差が大きい
|
||||
|
||||
結論: allocator差 + layout penalty の **両方**がある。
|
||||
|
||||
次の芯:
|
||||
- 先に **layout penalty** を削る(Phase 18 Hot Text Isolation)
|
||||
- その後に **gate consolidation**(Phase 19)へ
|
||||
|
||||
---
|
||||
|
||||
## 3. 可視化(最小)
|
||||
|
||||
- 10-run の raw throughput を保存(`scripts/run_mixed_10_cleanenv.sh` 出力ログで十分)
|
||||
- 追加で 1 本だけ `perf stat`(200M iters, 1-run):
|
||||
|
||||
```sh
|
||||
perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \
|
||||
env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FORCE_LIBC_ALLOC=0 \
|
||||
./bench_random_mixed_hakmem 200000000 400 1
|
||||
```
|
||||
|
||||
同じコマンドで `HAKMEM_FORCE_LIBC_ALLOC=1` も 1 本取る。
|
||||
|
||||
---
|
||||
|
||||
## 4. 重要ルール(Box Theory)
|
||||
|
||||
- A/B は **同一バイナリ**で行う(layout/LTO 差で誤判定しない)
|
||||
- 新しい最適化は必ず ENV gate(戻せる)+ 境界 1 箇所
|
||||
- 迷ったら “Fail-Fast で fallback” を優先(速度より整合性)
|
||||
|
||||
135
docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md
Normal file
135
docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md
Normal file
@ -0,0 +1,135 @@
|
||||
# Phase 18: Hot Text Isolation v1 — Design
|
||||
|
||||
## 0. Context (from Phase 17)
|
||||
|
||||
Phase 17 established **Case B**:
|
||||
- Same-binary `HAKMEM_FORCE_LIBC_ALLOC=0/1` shows **allocator delta is negligible**.
|
||||
- The large gap appears vs the tiny `bench_random_mixed_system` binary.
|
||||
|
||||
Signal:
|
||||
- I-cache misses / instructions / cycles are far worse in the hakmem-linked binary.
|
||||
- Binary size (`~653K`) vs system (`~21K`) correlates with the throughput gap.
|
||||
|
||||
Ref: `docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md`
|
||||
|
||||
---
|
||||
|
||||
## 1. Goal
|
||||
|
||||
Reduce **hot-path instruction footprint** and improve **I-cache locality** in the hakmem-linked binary, without changing allocator algorithms.
|
||||
|
||||
Primary success metric:
|
||||
- Mixed (16–1024B) throughput improvement, with accompanying reductions in:
|
||||
- `iTLB/icache misses` (or “I-cache misses” counter used in Phase 17)
|
||||
- total instructions executed per 200M iters
|
||||
|
||||
---
|
||||
|
||||
## 2. Non-goals
|
||||
|
||||
- No allocator algorithm redesign.
|
||||
- No behavioral changes to safety/Fail-Fast semantics (only layout/placement changes).
|
||||
- No “delete code = faster” experiments (Phase 17 showed layout dominates; deletions confound results).
|
||||
|
||||
---
|
||||
|
||||
## 3. Box Theory framing
|
||||
|
||||
This is a “build/layout box”:
|
||||
- **Box**: HotTextIsolationBox (compile-time layout controls + annotations)
|
||||
- **Boundary**: build flag / TU split (no runtime overhead)
|
||||
- **Rollback**: single Makefile knob (`HOT_TEXT_ISOLATION=0/1`) or `-DHAKMEM_HOT_TEXT_ISOLATION=0/1`
|
||||
- **Observability**: perf stat + binary size (no always-on logs)
|
||||
|
||||
---
|
||||
|
||||
## 4. Design: v1 tactics (low-risk)
|
||||
|
||||
### 4.1 Hot/Cold attributes SSOT
|
||||
|
||||
Introduce a single header defining attributes:
|
||||
- `HAK_HOT_FN` → `__attribute__((hot))` (and optionally `.text.hak_hot`)
|
||||
- `HAK_COLD_FN` → `__attribute__((cold,noinline))` (and optionally `.text.hak_cold`)
|
||||
|
||||
Activated only when `HAKMEM_HOT_TEXT_ISOLATION=1`.
|
||||
|
||||
Why:
|
||||
- Makes “what is hot/cold” explicit and consistent (SSOT).
|
||||
- Lets us annotate a small set of functions without scattering ad-hoc attributes.
|
||||
|
||||
### 4.2 Translation-unit split for wrappers
|
||||
|
||||
Move wrapper definitions out of `core/hakmem.c` into a dedicated TU:
|
||||
- `core/hak_wrappers_box.c` includes `core/box/hak_wrappers.inc.h`
|
||||
|
||||
Why:
|
||||
- Prevents wrapper text from being interleaved with unrelated code in the same TU.
|
||||
- Improves the linker’s ability to cluster hot code.
|
||||
- Enables future link-order experiments (symbol ordering files) without touching allocator logic.
|
||||
|
||||
### 4.3 Cold code isolation
|
||||
|
||||
Ensure rarely-hit helpers stay cold/out-of-line:
|
||||
- wrapper diagnostics (`wrapper_record_fallback`, ptr trace dumps, verbose logging)
|
||||
- “slow fallback” paths (`malloc_cold`, `free_cold`)
|
||||
|
||||
Principle:
|
||||
- Hot path must remain a straight-line “try → return” shape.
|
||||
- Anything that allocates/logs/diagnoses is cold and must not be inlined into hot wrappers.
|
||||
|
||||
### 4.4 Optional: section GC for bench builds
|
||||
|
||||
For bench binaries only:
|
||||
- add `-ffunction-sections -fdata-sections`
|
||||
- link with `-Wl,--gc-sections`
|
||||
|
||||
Why:
|
||||
- Drops truly-unused text and reduces overall text pressure.
|
||||
- Helps the linker keep hot text denser.
|
||||
|
||||
This is optional because it is toolchain-sensitive; measure before promoting.
|
||||
|
||||
---
|
||||
|
||||
## 7. v2 Extension (if v1 is NEUTRAL): BENCH_MINIMAL compile-out
|
||||
|
||||
Phase 17 shows the hakmem-linked binary executes ~2x instructions vs the tiny system binary. If v1 (TU split/attributes) is NEUTRAL, the next likely lever is **not placement-only**, but **removing per-call fixed costs** from the hot path by compiling them out in a bench-only build.
|
||||
|
||||
Concept:
|
||||
- Introduce `HAKMEM_BENCH_MINIMAL=1` build mode (Makefile knob)
|
||||
- In this mode:
|
||||
- “promoted defaults” are treated as compile-time constants (FastLane ON, snapshots ON, etc.)
|
||||
- ENV gates become compile-time (no TLS/env probing in hot path)
|
||||
- Hot counters/stats macros compile out completely
|
||||
|
||||
Why this still fits Box Theory:
|
||||
- It is a **build box** (reversible by knob), not an algorithm rewrite
|
||||
- Boundaries remain: hot path stays Fail-Fast; cold fallback remains intact
|
||||
- Observability shifts to `perf stat` (no always-on logging)
|
||||
|
||||
Expected impact:
|
||||
- If instruction footprint is truly dominant, this is the first place to see **double-digit gains** (+10–20%).
|
||||
|
||||
## 5. Risks / mitigations
|
||||
|
||||
### Risk A: layout tweaks regress throughput
|
||||
|
||||
Mitigation:
|
||||
- A/B using the same workload + perf stat counters (Phase 17 set).
|
||||
- If regression: keep as research-only (build knob default OFF).
|
||||
|
||||
### Risk B: Toolchain sensitivity (ld vs lld, LTO interactions)
|
||||
|
||||
Mitigation:
|
||||
- Keep v1 minimal (TU split + attributes first).
|
||||
- Only enable `--gc-sections` if it’s stable in the current toolchain.
|
||||
|
||||
---
|
||||
|
||||
## 6. Expected impact
|
||||
|
||||
Conservative:
|
||||
- +3–10% throughput improvement on Mixed by reducing instruction footprint and I-cache misses.
|
||||
|
||||
Stretch goal:
|
||||
- Bring “hakmem-linked + FORCE_LIBC” closer to `bench_random_mixed_system` ceiling by minimizing wrapper text working-set.
|
||||
165
docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md
Normal file
165
docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md
Normal file
@ -0,0 +1,165 @@
|
||||
# Phase 18: Hot Text Isolation v1 — Next Instructions
|
||||
|
||||
## Status
|
||||
|
||||
- Phase 17 confirms **Case B**: allocator logic delta is negligible; gap is **layout/I-cache**.
|
||||
- Next: reduce instruction footprint + improve I-cache locality via **Hot Text Isolation**.
|
||||
|
||||
Refs:
|
||||
- Phase 17 results: `docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md`
|
||||
- Phase 18 design: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md`
|
||||
|
||||
---
|
||||
|
||||
## 0. Goal / Success Criteria
|
||||
|
||||
Primary (v1 は “低リスク・効果小さめ” 想定):
|
||||
- Mixed (16–1024B) throughput **+2%** 以上で GO(layout work の現実ライン)
|
||||
|
||||
Secondary (must move in the right direction):
|
||||
- I-cache misses reduced(目安: **-10%** 以上)
|
||||
- Total instructions reduced(目安: **-5%** 以上)
|
||||
|
||||
If throughput is NEUTRAL but counters improve significantly, keep as research box and iterate once.
|
||||
|
||||
---
|
||||
|
||||
## 1. Patch Plan (small, reversible)
|
||||
|
||||
### Patch 1: Hot/Cold attribute SSOT (L0 Box)
|
||||
|
||||
Add:
|
||||
- `core/box/hot_text_attrs_box.h`
|
||||
|
||||
Defines:
|
||||
- `HAK_HOT_FN`, `HAK_COLD_FN` (no-op when `HAKMEM_HOT_TEXT_ISOLATION=0`)
|
||||
|
||||
Usage:
|
||||
- annotate only a short, high-impact list first:
|
||||
- wrappers: `malloc/free/calloc/realloc`
|
||||
- FastLane entry helpers (if non-inline)
|
||||
- cold helpers: `malloc_cold/free_cold`, wrapper diagnostics
|
||||
|
||||
Rollback: build knob off.
|
||||
|
||||
### Patch 2: Wrapper TU split (L1 Box boundary)
|
||||
|
||||
Move wrapper definitions out of `core/hakmem.c`:
|
||||
- new: `core/hak_wrappers_box.c`
|
||||
- `#include "box/hak_wrappers.inc.h"`
|
||||
- remove wrapper include from `core/hakmem.c`
|
||||
|
||||
Rationale:
|
||||
- Prevents wrapper text from being interleaved with unrelated code in one TU.
|
||||
- Sets up link-order clustering.
|
||||
|
||||
Rollback: restore include in `core/hakmem.c` and drop new TU.
|
||||
|
||||
### Patch 3 (optional): bench-only section GC
|
||||
|
||||
Makefile knob:
|
||||
- `HOT_TEXT_ISOLATION=0/1`
|
||||
|
||||
When `=1`, add for bench builds:
|
||||
- `-DHAKMEM_HOT_TEXT_ISOLATION=1`
|
||||
- `-ffunction-sections -fdata-sections`
|
||||
- `LDFLAGS += -Wl,--gc-sections`
|
||||
|
||||
Notes:
|
||||
- Keep it bench-only first (do not touch shared lib build until proven stable).
|
||||
- If toolchain rejects `--gc-sections` or results are unstable → skip this patch.
|
||||
|
||||
---
|
||||
|
||||
## 2. A/B Procedure (required)
|
||||
|
||||
### 2.1 Baseline build (OFF)
|
||||
|
||||
```sh
|
||||
make clean
|
||||
make -j bench_random_mixed_hakmem bench_random_mixed_system
|
||||
ls -lh bench_random_mixed_hakmem bench_random_mixed_system
|
||||
scripts/run_mixed_10_cleanenv.sh
|
||||
```
|
||||
|
||||
Perf stat (1 run, 200M iters):
|
||||
|
||||
```sh
|
||||
perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \
|
||||
env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
|
||||
./bench_random_mixed_hakmem 200000000 400 1
|
||||
```
|
||||
|
||||
### 2.2 Optimized build (ON)
|
||||
|
||||
```sh
|
||||
make clean
|
||||
make -j HOT_TEXT_ISOLATION=1 bench_random_mixed_hakmem bench_random_mixed_system
|
||||
ls -lh bench_random_mixed_hakmem bench_random_mixed_system
|
||||
scripts/run_mixed_10_cleanenv.sh
|
||||
```
|
||||
|
||||
Perf stat (same command):
|
||||
|
||||
```sh
|
||||
perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \
|
||||
env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
|
||||
./bench_random_mixed_hakmem 200000000 400 1
|
||||
```
|
||||
|
||||
### 2.3 System ceiling check (optional)
|
||||
|
||||
```sh
|
||||
./bench_random_mixed_system 200000000 400 1 2>&1 | rg "Throughput" || true
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. GO/NO-GO Decision
|
||||
|
||||
- **GO**: Mixed 10-run mean **+2%** 以上 and no health regressions
|
||||
- **NEUTRAL**: within ±2% → keep as research box, iterate once (more cold isolation or better clustering)
|
||||
- **NO-GO**: **-2%** or worse → rollback and freeze
|
||||
|
||||
Health profiles:
|
||||
|
||||
```sh
|
||||
scripts/verify_health_profiles.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Reporting (required artifacts)
|
||||
|
||||
Create:
|
||||
- `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md`
|
||||
- throughput A/B (10-run)
|
||||
- binary sizes
|
||||
- perf stat table (cycles/instructions/I-cache)
|
||||
- conclusion (GO/NEUTRAL/NO-GO)
|
||||
|
||||
Update:
|
||||
- `CURRENT_TASK.md` (Phase 18 status + next)
|
||||
|
||||
---
|
||||
|
||||
## 5. Notes / guardrails
|
||||
|
||||
- This phase intentionally compares **different binaries** (layout is the target), but keep the environment clean (`env -i`, fixed profile, same machine).
|
||||
- Avoid “delete code” experiments; only isolate/cold/cluster.
|
||||
- Keep “cold” truly cold: no allocations, no logging, no TLS-heavy helpers.
|
||||
|
||||
---
|
||||
|
||||
## 6. If v1 is NEUTRAL: Phase 18 v2(BENCH_MINIMAL)へ即進む
|
||||
|
||||
Phase 17 の “instructions 2x” を直接削るには、layout だけでなく **hot path に混ざっている ENV/stats/debug の固定費を compile-out** する必要がある可能性が高い。
|
||||
|
||||
次の一手(bench 専用 binary / rollback 可能):
|
||||
|
||||
- `HAKMEM_BENCH_MINIMAL=1`(Makefile knob)で:
|
||||
- FastLane / wrapper の “常用ON 経路” を固定し、ENV gate を compile-time 定数化
|
||||
- hot counters を完全 compile-out
|
||||
- 観測は `perf stat` のみ(常時ログ禁止)
|
||||
|
||||
期待: +10–20%(もし本当に instruction footprint が支配ならここで大きく動く)
|
||||
@ -15,9 +15,12 @@ export HAKMEM_TINY_C7_PRESERVE_HEADER=${HAKMEM_TINY_C7_PRESERVE_HEADER:-0}
|
||||
export HAKMEM_TINY_TCACHE=${HAKMEM_TINY_TCACHE:-0}
|
||||
export HAKMEM_TINY_TCACHE_CAP=${HAKMEM_TINY_TCACHE_CAP:-64}
|
||||
export HAKMEM_MALLOC_TINY_DIRECT=${HAKMEM_MALLOC_TINY_DIRECT:-0}
|
||||
export HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=${HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT:-0}
|
||||
export HAKMEM_FORCE_LIBC_ALLOC=${HAKMEM_FORCE_LIBC_ALLOC:-0}
|
||||
export HAKMEM_ENV_SNAPSHOT_SHAPE=${HAKMEM_ENV_SNAPSHOT_SHAPE:-0}
|
||||
export HAKMEM_FREE_TINY_FAST_MONO_DUALHOT=${HAKMEM_FREE_TINY_FAST_MONO_DUALHOT:-0}
|
||||
export HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=${HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT:-0}
|
||||
# NOTE: Phase 9/10 are promoted (bench_profile defaults to 1). Keep cleanenv aligned by default.
|
||||
export HAKMEM_FREE_TINY_FAST_MONO_DUALHOT=${HAKMEM_FREE_TINY_FAST_MONO_DUALHOT:-1}
|
||||
export HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=${HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT:-1}
|
||||
|
||||
for i in $(seq 1 "${runs}"); do
|
||||
echo "=== Run ${i}/${runs} ==="
|
||||
|
||||
Reference in New Issue
Block a user