Phase 16 v1 NEUTRAL, Phase 17 Case B confirmed, Phase 18 design added
## Phase 16 v1: Front FastLane Alloc LEGACY Direct — NEUTRAL (+0.62%)
Target: Reduce alloc-side fixed costs by adding LEGACY direct path to
FastLane entry, mirroring Phase 9/10 free-side winning pattern.
Result: +0.62% on Mixed (below +1.0% GO threshold) → NEUTRAL, freeze as
research box (default OFF).
Critical issue: Initial impl crashed (segfault) for C4-C7. Root cause:
unified_cache_refill() incompatibility. Safety fix: Limited to C0-C3
only (matching existing dualhot pattern).
Files:
- core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c} (new)
- core/box/front_fastlane_box.h (LEGACY direct path, lines 93-119)
- core/bench_profile.h (env refresh sync)
- Makefile (new obj)
- docs/analysis/PHASE16_*.md (design/results/instructions)
ENV: HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 (default OFF, opt-in)
Verdict: Research box frozen. Phase 14-16 plateau confirms dispatch/
routing optimization ROI is exhausted post-Phase-6 FastLane collapse.
---
## Phase 17: FORCE_LIBC Gap Validation — Case B Confirmed
Purpose: Validate "system malloc faster" observation using same-binary
A/B testing to isolate allocator logic差 vs binary layout penalty.
Method:
- Same-binary toggle: HAKMEM_FORCE_LIBC_ALLOC=0/1 (bench_random_mixed_hakmem)
- System binary: bench_random_mixed_system (21K separate binary)
- Perf stat: Hardware counter analysis (I-cache, cycles, instructions)
Result: **Case B confirmed** — Allocator差 negligible, layout penalty dominates.
Gap breakdown (Mixed, 20M iters, ws=400):
- hakmem (FORCE_LIBC=0): 48.12M ops/s
- libc (FORCE_LIBC=1, same binary): 48.31M ops/s → +0.39% (noise level)
- system binary (21K): 83.85M ops/s → +73.57% vs libc, +74.26% vs hakmem
Perf stat (200M iters):
- I-cache misses: 153K (hakmem) → 68K (system) = -55% (smoking gun)
- Cycles: 17.9B → 10.2B = -43%
- Instructions: 41.3B → 21.5B = -48%
- Binary size: 653K → 21K (30x difference)
Root cause: Binary size (30x) causes I-cache thrashing. Code bloat >>
algorithmic efficiency.
Conclusion: Phase 12's "system malloc 1.6x faster" was real, but
misattributed. Gap is layout/I-cache, NOT allocator algorithm.
Files:
- docs/analysis/PHASE17_*.md (results/instructions)
- scripts/run_mixed_10_cleanenv.sh (Phase 9/10 defaults aligned)
Next: Phase 18 Hot Text Isolation (layout optimization, not algorithm opt)
---
## Phase 18: Hot Text Isolation — Design Added
Purpose: Reduce I-cache misses + instruction footprint via layout control
(binary optimization, not allocator algorithm changes).
Strategy (v1 → v2 progression):
v1 (TU split + hot/cold attrs + optional gc-sections):
- Target: +2% throughput (GO threshold, realistic for layout tweaks)
- Secondary: I-cache -10%, instructions -5% (direction confirmation)
- Risk: Low (reversible via build knob)
- Expected: +0-2% (NEUTRAL likely, but validates approach)
v2 (BENCH_MINIMAL compile-out):
- Target: +10-20% throughput (本命)
- Method: Conditional compilation removes stats/ENV/debug from hot path
- Expected: Instruction count -30-40% → significant I-cache improvement
Files:
- docs/analysis/PHASE18_*.md (design/instructions)
- CURRENT_TASK.md (Phase 17 complete, Phase 18 v1/v2 plan)
Build gate: HOT_TEXT_ISOLATION=0/1 (Makefile knob)
Next: Implement Phase 18 v1 (TU split first, BENCH_MINIMAL if v1 NEUTRAL)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
138
CURRENT_TASK.md
138
CURRENT_TASK.md
@ -342,6 +342,144 @@ Phase 6-10 で達成した累積改善:
|
|||||||
- Pointer-chase 削減も cache 形状変更も、現状の TLS array cache に対して有意な改善を生まない
|
- Pointer-chase 削減も cache 形状変更も、現状の TLS array cache に対して有意な改善を生まない
|
||||||
- 次の mimalloc gap(約 2.4x)を埋めるには、別次元のアプローチが必要
|
- 次の mimalloc gap(約 2.4x)を埋めるには、別次元のアプローチが必要
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Phase 16 v1: Front FastLane Alloc LEGACY Direct — ⚠️ NEUTRAL (+0.62%) — research box 維持(default OFF)
|
||||||
|
|
||||||
|
**Date**: 2025-12-15
|
||||||
|
**Verdict**: **NEUTRAL (+0.62% Mixed, +0.06% C6-heavy)** — research box 維持(default OFF)
|
||||||
|
|
||||||
|
**Motivation**:
|
||||||
|
- Phase 14-15 は freeze(cache-shape/pointer-chase の ROI が薄い)
|
||||||
|
- free 側は "monolithic early-exit + dedup" が勝ち筋(Phase 9/10/6-2)
|
||||||
|
- alloc 側も同じ勝ち筋で、LEGACY ルート時の route/policy 固定費を FastLane 入口で削る
|
||||||
|
|
||||||
|
**Results**:
|
||||||
|
| Workload | ENV=0 (Baseline) | ENV=1 (Direct) | Delta |
|
||||||
|
|---------|----------|----------|-------|
|
||||||
|
| Mixed (16–1024B) | 47,510,791 | 47,803,890 | **+0.62%** |
|
||||||
|
| C6-heavy (257–768B) | 21,134,240 | 21,147,197 | **+0.06%** |
|
||||||
|
|
||||||
|
**Critical Issue & Fix**:
|
||||||
|
- **Segfault discovered**: Initial implementation crashed for C4-C7 during `unified_cache_refill()` → `tiny_next_read()`
|
||||||
|
- **Root cause**: Refill logic incompatibility for classes C4-C7
|
||||||
|
- **Safety fix**: Limited optimization to C0-C3 only (matching existing dualhot pattern)
|
||||||
|
- Code constraint: `if (... && (unsigned)class_idx <= 3u)` added to line 96 of `front_fastlane_box.h`
|
||||||
|
|
||||||
|
**Conclusion**:
|
||||||
|
- Optimization overlaps with existing dualhot (Phase ALLOC-TINY-FAST-DUALHOT-2) for C0-C3
|
||||||
|
- Limited scope (C0-C3 only) reduces potential benefit
|
||||||
|
- Route/policy overhead already minimized by Phase 6 FastLane collapse
|
||||||
|
- Pattern continues from Phase 14-15: dispatch-layer optimizations showing NEUTRAL results
|
||||||
|
|
||||||
|
**Root causes of limited benefit**:
|
||||||
|
1. Safety constraint: C4-C7 excluded due to refill bug
|
||||||
|
2. Overlap with dualhot: C0-C3 already have direct path when dualhot enabled
|
||||||
|
3. Route overhead not dominant: Phase 6 already collapsed major dispatch costs
|
||||||
|
|
||||||
|
**Recommendations**:
|
||||||
|
- **Freeze as research box** (default OFF, no preset promotion)
|
||||||
|
- **Investigate C4-C7 refill issue** before expanding scope
|
||||||
|
- **Shift optimization focus** away from dispatch layers (Phase 14/15/16 all NEUTRAL)
|
||||||
|
|
||||||
|
**Refs**:
|
||||||
|
- A/B results: `docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_AB_TEST_RESULTS.md`
|
||||||
|
- Design: `docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_DESIGN.md`
|
||||||
|
- Instructions: `docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_NEXT_INSTRUCTIONS.md`
|
||||||
|
- ENV: `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1` (default: 0, opt-in)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Phase 14-16 Summary: Post-FastLane Research Phases ⚠️
|
||||||
|
|
||||||
|
**Conclusion**: Phase 14-16 全て NEUTRAL(研究箱として凍結)
|
||||||
|
|
||||||
|
| Phase | Approach | Mixed Delta | Verdict |
|
||||||
|
|-------|----------|-------------|---------|
|
||||||
|
| 14 v1 | tcache (free-side only) | +0.20% | NEUTRAL |
|
||||||
|
| 14 v2 | tcache (alloc+free) | +0.08% | NEUTRAL |
|
||||||
|
| 15 v1 | FIFO→LIFO (array cache) | -0.70% | NEUTRAL |
|
||||||
|
| 16 v1 | Alloc LEGACY direct | **+0.62%** | **NEUTRAL** |
|
||||||
|
|
||||||
|
**教訓**:
|
||||||
|
- Pointer-chase 削減、cache 形状変更、dispatch early-exit いずれも有意な改善なし
|
||||||
|
- Phase 6 FastLane collapse (入口固定費削減) 以降、dispatch/routing レイヤの最適化は ROI が薄い
|
||||||
|
- 次の mimalloc gap(約 2.4x)を埋めるには、cache miss cost / memory layout / backend allocation 等の別次元が必要
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Phase 17: FORCE_LIBC Gap Validation(same-binary A/B)✅ COMPLETE (2025-12-15)
|
||||||
|
|
||||||
|
**目的**: 「system malloc が速い」観測の SSOT 化。**同一バイナリ**で `hakmem` vs `libc` を A/B し、gap の本体(allocator差 / layout差)を切り分ける。
|
||||||
|
|
||||||
|
**結果**: **Case B 確定** — Allocator差 negligible (+0.39%), Layout penalty dominant (+73.57%)
|
||||||
|
|
||||||
|
**Gap Breakdown** (Mixed, 20M iters, ws=400):
|
||||||
|
- hakmem (FORCE_LIBC=0): 48.12M ops/s (mean), 48.12M ops/s (median)
|
||||||
|
- libc same-binary (FORCE_LIBC=1): 48.31M ops/s (mean), 48.31M ops/s (median)
|
||||||
|
- **Allocator差**: **+0.39%** (libc slightly faster, within noise)
|
||||||
|
- system binary (21K): 83.85M ops/s (mean), 83.75M ops/s (median)
|
||||||
|
- **Layout penalty**: **+73.57%** (small binary vs large binary 653K)
|
||||||
|
- **Total gap**: **+74.26%** (hakmem → system binary)
|
||||||
|
|
||||||
|
**Perf Stat Analysis** (200M iters, 1-run):
|
||||||
|
- I-cache misses: 153K (hakmem) → 68K (system) = **-55%** (smoking gun)
|
||||||
|
- Cycles: 17.9B → 10.2B = -43%
|
||||||
|
- Instructions: 41.3B → 21.5B = -48%
|
||||||
|
|
||||||
|
**Root Cause**: Binary size (653K vs 21K, 30x difference) causes I-cache thrashing. Code bloat >> algorithmic efficiency.
|
||||||
|
|
||||||
|
**教訓**:
|
||||||
|
- Phase 12 の「system malloc 1.6x faster」観測は正しかったが、原因は allocator アルゴリズムではなく **binary layout**
|
||||||
|
- Same-binary A/B が必須(別バイナリ比較は layout confound で誤判定)
|
||||||
|
- I-cache efficiency が allocator-heavy workload の first-order factor
|
||||||
|
|
||||||
|
**Next Direction** (Case B 推奨):
|
||||||
|
- **Phase 18: Hot Text Isolation / Layout Control**
|
||||||
|
- Priority 1: Cold code isolation (`__attribute__((cold,noinline))` + separate TU)
|
||||||
|
- Priority 2: Link-order optimization (hot functions contiguous placement)
|
||||||
|
- Priority 3: PGO (optional, profile-guided layout)
|
||||||
|
- Target: +10% throughput via I-cache optimization (48.1M → 52.9M ops/s)
|
||||||
|
- Success metric: I-cache misses -30% (153K → 107K)
|
||||||
|
|
||||||
|
**Files**:
|
||||||
|
- Results: `docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md`
|
||||||
|
- Instructions: `docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_NEXT_INSTRUCTIONS.md`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Phase 18: Hot Text Isolation / Layout Control — NEXT
|
||||||
|
|
||||||
|
**目的**: Binary layout 最適化で I-cache 効率を改善し、system binary との gap を削減。
|
||||||
|
|
||||||
|
**戦略**:
|
||||||
|
1. **Cold Code Isolation** (優先度 1)
|
||||||
|
- Stats 収集、debug logging、error handlers を別 TU へ移動
|
||||||
|
- `__attribute__((cold, noinline))` で明示的に cold マーク
|
||||||
|
- 予想効果: I-cache misses -20%
|
||||||
|
|
||||||
|
2. **Link-Order Optimization** (優先度 2)
|
||||||
|
- Hot functions を連続配置(linker script or link order control)
|
||||||
|
- `-ffunction-sections` + custom linker script
|
||||||
|
- 予想効果: I-cache misses -10%
|
||||||
|
|
||||||
|
3. **Profile-Guided Optimization** (優先度 3, optional)
|
||||||
|
- `-fprofile-generate` + `-fprofile-use` で実測ベース配置
|
||||||
|
- 予想効果: I-cache misses -10-20%
|
||||||
|
|
||||||
|
**Build Gate**: `HOT_TEXT_ISOLATION=0/1`(layout A/B 用)
|
||||||
|
|
||||||
|
**Target**:
|
||||||
|
- v1(TU split / attrs / optional gc-sections): **+2% で GO**(NEUTRAL が起きやすい想定)
|
||||||
|
- v2(BENCH_MINIMAL compile-out): **+10–20%** を狙う(instruction footprint を直接削る)
|
||||||
|
|
||||||
|
**設計**: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md`
|
||||||
|
**指示書**: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md`
|
||||||
|
|
||||||
|
実装ゲート(戻せる):
|
||||||
|
- Makefile knob: `HOT_TEXT_ISOLATION=0/1`
|
||||||
|
- Compile-time: `-DHAKMEM_HOT_TEXT_ISOLATION=0/1`
|
||||||
|
|
||||||
## 更新メモ(2025-12-14 Phase 5 E5-3 Analysis - Strategic Pivot)
|
## 更新メモ(2025-12-14 Phase 5 E5-3 Analysis - Strategic Pivot)
|
||||||
|
|
||||||
### Phase 5 E5-3: Candidate Analysis & Strategic Recommendations ⚠️ DEFER (2025-12-14)
|
### Phase 5 E5-3: Candidate Analysis & Strategic Recommendations ⚠️ DEFER (2025-12-14)
|
||||||
|
|||||||
6
Makefile
6
Makefile
@ -218,12 +218,12 @@ LDFLAGS += $(EXTRA_LDFLAGS)
|
|||||||
|
|
||||||
# Targets
|
# Targets
|
||||||
TARGET = test_hakmem
|
TARGET = test_hakmem
|
||||||
OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o
|
OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/box/front_fastlane_alloc_legacy_direct_env_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o
|
||||||
OBJS = $(OBJS_BASE)
|
OBJS = $(OBJS_BASE)
|
||||||
|
|
||||||
# Shared library
|
# Shared library
|
||||||
SHARED_LIB = libhakmem.so
|
SHARED_LIB = libhakmem.so
|
||||||
SHARED_OBJS = hakmem_shared.o hakmem_config_shared.o hakmem_tiny_config_shared.o hakmem_ucb1_shared.o hakmem_bigcache_shared.o hakmem_pool_shared.o hakmem_l25_pool_shared.o hakmem_site_rules_shared.o hakmem_tiny_shared.o core/box/ss_allocation_box_shared.o superslab_stats_shared.o superslab_cache_shared.o superslab_ace_shared.o superslab_slab_shared.o superslab_backend_shared.o core/superslab_head_stub_shared.o hakmem_smallmid_shared.o core/box/superslab_expansion_box_shared.o core/box/integrity_box_shared.o core/box/mailbox_box_shared.o core/box/front_gate_box_shared.o core/box/front_gate_classifier_shared.o core/box/free_publish_box_shared.o core/box/capacity_box_shared.o core/box/carve_push_box_shared.o core/box/prewarm_box_shared.o core/box/ss_hot_prewarm_box_shared.o core/box/front_metrics_box_shared.o core/box/bench_fast_box_shared.o core/box/ss_addr_map_box_shared.o core/box/ss_pt_impl_shared.o core/box/slab_recycling_box_shared.o core/box/pagefault_telemetry_box_shared.o core/box/tiny_sizeclass_hist_box_shared.o core/box/tiny_env_box_shared.o core/box/tiny_route_box_shared.o core/box/free_front_v3_env_box_shared.o core/box/free_path_stats_box_shared.o core/box/free_dispatch_stats_box_shared.o core/box/alloc_gate_stats_box_shared.o core/box/tiny_page_box_shared.o core/box/tiny_class_policy_box_shared.o core/box/tiny_class_stats_box_shared.o core/box/tiny_policy_learner_box_shared.o core/box/ss_budget_box_shared.o core/box/tiny_mem_stats_box_shared.o core/box/wrapper_env_box_shared.o core/box/free_wrapper_env_snapshot_box_shared.o core/box/malloc_wrapper_env_snapshot_box_shared.o core/box/madvise_guard_box_shared.o core/box/libm_reloc_guard_box_shared.o core/box/hakmem_env_snapshot_box_shared.o core/box/tiny_c7_preserve_header_env_box_shared.o core/box/tiny_tcache_env_box_shared.o core/box/tiny_unified_lifo_env_box_shared.o core/page_arena_shared.o core/front/tiny_unified_cache_shared.o core/tiny_alloc_fast_push_shared.o core/tiny_c7_ultra_segment_shared.o core/tiny_c7_ultra_shared.o core/link_stubs_shared.o core/tiny_failfast_shared.o tiny_sticky_shared.o tiny_remote_shared.o tiny_publish_shared.o tiny_debug_ring_shared.o hakmem_tiny_magazine_shared.o hakmem_tiny_stats_shared.o hakmem_tiny_sfc_shared.o hakmem_tiny_query_shared.o hakmem_tiny_rss_shared.o hakmem_tiny_registry_shared.o hakmem_tiny_remote_target_shared.o hakmem_tiny_bg_spill_shared.o tiny_adaptive_sizing_shared.o hakmem_super_registry_shared.o hakmem_shared_pool_shared.o hakmem_shared_pool_acquire_shared.o hakmem_shared_pool_release_shared.o hakmem_elo_shared.o hakmem_batch_shared.o hakmem_p2_shared.o hakmem_sizeclass_dist_shared.o hakmem_evo_shared.o hakmem_debug_shared.o hakmem_sys_shared.o hakmem_whale_shared.o hakmem_policy_shared.o hakmem_ace_shared.o hakmem_ace_stats_shared.o hakmem_ace_controller_shared.o hakmem_ace_metrics_shared.o hakmem_ace_ucb1_shared.o hakmem_prof_shared.o hakmem_learner_shared.o hakmem_size_hist_shared.o hakmem_learn_log_shared.o hakmem_syscall_shared.o tiny_fastcache_shared.o core/box/super_reg_box_shared.o core/box/shared_pool_box_shared.o core/box/remote_side_box_shared.o core/tiny_destructors_shared.o
|
SHARED_OBJS = hakmem_shared.o hakmem_config_shared.o hakmem_tiny_config_shared.o hakmem_ucb1_shared.o hakmem_bigcache_shared.o hakmem_pool_shared.o hakmem_l25_pool_shared.o hakmem_site_rules_shared.o hakmem_tiny_shared.o core/box/ss_allocation_box_shared.o superslab_stats_shared.o superslab_cache_shared.o superslab_ace_shared.o superslab_slab_shared.o superslab_backend_shared.o core/superslab_head_stub_shared.o hakmem_smallmid_shared.o core/box/superslab_expansion_box_shared.o core/box/integrity_box_shared.o core/box/mailbox_box_shared.o core/box/front_gate_box_shared.o core/box/front_gate_classifier_shared.o core/box/free_publish_box_shared.o core/box/capacity_box_shared.o core/box/carve_push_box_shared.o core/box/prewarm_box_shared.o core/box/ss_hot_prewarm_box_shared.o core/box/front_metrics_box_shared.o core/box/bench_fast_box_shared.o core/box/ss_addr_map_box_shared.o core/box/ss_pt_impl_shared.o core/box/slab_recycling_box_shared.o core/box/pagefault_telemetry_box_shared.o core/box/tiny_sizeclass_hist_box_shared.o core/box/tiny_env_box_shared.o core/box/tiny_route_box_shared.o core/box/free_front_v3_env_box_shared.o core/box/free_path_stats_box_shared.o core/box/free_dispatch_stats_box_shared.o core/box/alloc_gate_stats_box_shared.o core/box/tiny_page_box_shared.o core/box/tiny_class_policy_box_shared.o core/box/tiny_class_stats_box_shared.o core/box/tiny_policy_learner_box_shared.o core/box/ss_budget_box_shared.o core/box/tiny_mem_stats_box_shared.o core/box/wrapper_env_box_shared.o core/box/free_wrapper_env_snapshot_box_shared.o core/box/malloc_wrapper_env_snapshot_box_shared.o core/box/madvise_guard_box_shared.o core/box/libm_reloc_guard_box_shared.o core/box/hakmem_env_snapshot_box_shared.o core/box/tiny_c7_preserve_header_env_box_shared.o core/box/tiny_tcache_env_box_shared.o core/box/tiny_unified_lifo_env_box_shared.o core/box/front_fastlane_alloc_legacy_direct_env_box_shared.o core/page_arena_shared.o core/front/tiny_unified_cache_shared.o core/tiny_alloc_fast_push_shared.o core/tiny_c7_ultra_segment_shared.o core/tiny_c7_ultra_shared.o core/link_stubs_shared.o core/tiny_failfast_shared.o tiny_sticky_shared.o tiny_remote_shared.o tiny_publish_shared.o tiny_debug_ring_shared.o hakmem_tiny_magazine_shared.o hakmem_tiny_stats_shared.o hakmem_tiny_sfc_shared.o hakmem_tiny_query_shared.o hakmem_tiny_rss_shared.o hakmem_tiny_registry_shared.o hakmem_tiny_remote_target_shared.o hakmem_tiny_bg_spill_shared.o tiny_adaptive_sizing_shared.o hakmem_super_registry_shared.o hakmem_shared_pool_shared.o hakmem_shared_pool_acquire_shared.o hakmem_shared_pool_release_shared.o hakmem_elo_shared.o hakmem_batch_shared.o hakmem_p2_shared.o hakmem_sizeclass_dist_shared.o hakmem_evo_shared.o hakmem_debug_shared.o hakmem_sys_shared.o hakmem_whale_shared.o hakmem_policy_shared.o hakmem_ace_shared.o hakmem_ace_stats_shared.o hakmem_ace_controller_shared.o hakmem_ace_metrics_shared.o hakmem_ace_ucb1_shared.o hakmem_prof_shared.o hakmem_learner_shared.o hakmem_size_hist_shared.o hakmem_learn_log_shared.o hakmem_syscall_shared.o tiny_fastcache_shared.o core/box/super_reg_box_shared.o core/box/shared_pool_box_shared.o core/box/remote_side_box_shared.o core/tiny_destructors_shared.o
|
||||||
|
|
||||||
# Pool TLS Phase 1 (enable with POOL_TLS_PHASE1=1)
|
# Pool TLS Phase 1 (enable with POOL_TLS_PHASE1=1)
|
||||||
ifeq ($(POOL_TLS_PHASE1),1)
|
ifeq ($(POOL_TLS_PHASE1),1)
|
||||||
@ -427,7 +427,7 @@ test-box-refactor: box-refactor
|
|||||||
./larson_hakmem 10 8 128 1024 1 12345 4
|
./larson_hakmem 10 8 128 1024 1 12345 4
|
||||||
|
|
||||||
# Phase 4: Tiny Pool benchmarks (properly linked with hakmem)
|
# Phase 4: Tiny Pool benchmarks (properly linked with hakmem)
|
||||||
TINY_BENCH_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/free_publish_box.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/tiny_free_route_cache_env_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/page_arena.o core/front/tiny_unified_cache.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o
|
TINY_BENCH_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/free_publish_box.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/tiny_free_route_cache_env_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/box/front_fastlane_alloc_legacy_direct_env_box.o core/page_arena.o core/front/tiny_unified_cache.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o
|
||||||
TINY_BENCH_OBJS = $(TINY_BENCH_OBJS_BASE)
|
TINY_BENCH_OBJS = $(TINY_BENCH_OBJS_BASE)
|
||||||
ifeq ($(POOL_TLS_PHASE1),1)
|
ifeq ($(POOL_TLS_PHASE1),1)
|
||||||
TINY_BENCH_OBJS += pool_tls.o pool_refill.o core/pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o
|
TINY_BENCH_OBJS += pool_tls.o pool_refill.o core/pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o
|
||||||
|
|||||||
@ -13,6 +13,7 @@
|
|||||||
#include "box/tiny_c7_preserve_header_env_box.h" // tiny_c7_preserve_header_env_refresh_from_env (Phase 13 v1)
|
#include "box/tiny_c7_preserve_header_env_box.h" // tiny_c7_preserve_header_env_refresh_from_env (Phase 13 v1)
|
||||||
#include "box/tiny_tcache_env_box.h" // tiny_tcache_env_refresh_from_env (Phase 14 v1)
|
#include "box/tiny_tcache_env_box.h" // tiny_tcache_env_refresh_from_env (Phase 14 v1)
|
||||||
#include "box/tiny_unified_lifo_env_box.h" // tiny_unified_lifo_env_refresh_from_env (Phase 15 v1)
|
#include "box/tiny_unified_lifo_env_box.h" // tiny_unified_lifo_env_refresh_from_env (Phase 15 v1)
|
||||||
|
#include "box/front_fastlane_alloc_legacy_direct_env_box.h" // front_fastlane_alloc_legacy_direct_env_refresh_from_env (Phase 16 v1)
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
// env が未設定のときだけ既定値を入れる
|
// env が未設定のときだけ既定値を入れる
|
||||||
@ -193,5 +194,7 @@ static inline void bench_apply_profile(void) {
|
|||||||
tiny_tcache_env_refresh_from_env();
|
tiny_tcache_env_refresh_from_env();
|
||||||
// Phase 15 v1: Sync LIFO ENV cache after bench_profile putenv defaults.
|
// Phase 15 v1: Sync LIFO ENV cache after bench_profile putenv defaults.
|
||||||
tiny_unified_lifo_env_refresh_from_env();
|
tiny_unified_lifo_env_refresh_from_env();
|
||||||
|
// Phase 16 v1: Sync LEGACY direct ENV cache after bench_profile putenv defaults.
|
||||||
|
front_fastlane_alloc_legacy_direct_env_refresh_from_env();
|
||||||
#endif
|
#endif
|
||||||
}
|
}
|
||||||
|
|||||||
63
core/box/front_fastlane_alloc_legacy_direct_env_box.c
Normal file
63
core/box/front_fastlane_alloc_legacy_direct_env_box.c
Normal file
@ -0,0 +1,63 @@
|
|||||||
|
// ============================================================================
|
||||||
|
// Phase 16 v1: Front FastLane Alloc LEGACY Direct ENV Box (L0) - Implementation
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
#include "front_fastlane_alloc_legacy_direct_env_box.h"
|
||||||
|
#include <stdlib.h>
|
||||||
|
#include <string.h>
|
||||||
|
#include <stdio.h>
|
||||||
|
#include <unistd.h>
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// Global State
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
_Atomic int g_front_fastlane_alloc_legacy_direct_enabled = -1;
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// Init (Cold Path)
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
int front_fastlane_alloc_legacy_direct_env_init(void) {
|
||||||
|
const char* env = getenv("HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT");
|
||||||
|
int enabled = 0; // default: OFF (opt-in)
|
||||||
|
|
||||||
|
if (env && (env[0] == '1' || strcmp(env, "true") == 0 || strcmp(env, "TRUE") == 0)) {
|
||||||
|
enabled = 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Cache result
|
||||||
|
atomic_store_explicit(&g_front_fastlane_alloc_legacy_direct_enabled, enabled, memory_order_relaxed);
|
||||||
|
|
||||||
|
// Log once (stderr for immediate visibility)
|
||||||
|
if (enabled) {
|
||||||
|
const char msg[] = "[FRONT_FASTLANE_ALLOC_LEGACY_DIRECT] enabled\n";
|
||||||
|
ssize_t w = write(2, msg, sizeof(msg) - 1);
|
||||||
|
(void)w;
|
||||||
|
}
|
||||||
|
|
||||||
|
return enabled;
|
||||||
|
}
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// Hot Path (LTO Fallback)
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
// LTO fallback: Non-inline version for cases where LTO can't inline
|
||||||
|
int front_fastlane_alloc_legacy_direct_enabled(void) {
|
||||||
|
int val = atomic_load_explicit(&g_front_fastlane_alloc_legacy_direct_enabled, memory_order_relaxed);
|
||||||
|
if (__builtin_expect(val == -1, 0)) {
|
||||||
|
val = front_fastlane_alloc_legacy_direct_env_init();
|
||||||
|
}
|
||||||
|
return val;
|
||||||
|
}
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// Refresh (Cold Path, called from bench_profile)
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
void front_fastlane_alloc_legacy_direct_env_refresh_from_env(void) {
|
||||||
|
// Reset to uninitialized state (-1)
|
||||||
|
// Next call to front_fastlane_alloc_legacy_direct_enabled() will re-read ENV
|
||||||
|
atomic_store_explicit(&g_front_fastlane_alloc_legacy_direct_enabled, -1, memory_order_relaxed);
|
||||||
|
}
|
||||||
63
core/box/front_fastlane_alloc_legacy_direct_env_box.h
Normal file
63
core/box/front_fastlane_alloc_legacy_direct_env_box.h
Normal file
@ -0,0 +1,63 @@
|
|||||||
|
// ============================================================================
|
||||||
|
// Phase 16 v1: Front FastLane Alloc LEGACY Direct ENV Box (L0)
|
||||||
|
// ============================================================================
|
||||||
|
//
|
||||||
|
// Purpose: ENV gate for FastLane alloc LEGACY direct path
|
||||||
|
//
|
||||||
|
// Design: docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_DESIGN.md
|
||||||
|
// Instructions: docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_NEXT_INSTRUCTIONS.md
|
||||||
|
//
|
||||||
|
// Strategy:
|
||||||
|
// - alloc 側の route/policy 固定費を削減
|
||||||
|
// - FastLane 入口で LEGACY を直行(hot → cold → fallback)
|
||||||
|
// - free 側(Phase 9/10)の勝ち筋を alloc にも適用
|
||||||
|
//
|
||||||
|
// ENV:
|
||||||
|
// HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1 (default: 0, opt-in)
|
||||||
|
//
|
||||||
|
// API:
|
||||||
|
// front_fastlane_alloc_legacy_direct_enabled() -> int
|
||||||
|
// front_fastlane_alloc_legacy_direct_env_refresh_from_env()
|
||||||
|
//
|
||||||
|
// Box Theory:
|
||||||
|
// - L0: This file (ENV gate, reversible)
|
||||||
|
// - L1: front_fastlane_box.h (LEGACY direct early-exit)
|
||||||
|
// - L2: malloc_tiny_fast_for_class (existing fallback)
|
||||||
|
//
|
||||||
|
// Safety:
|
||||||
|
// - ENV-gated (default OFF, opt-in)
|
||||||
|
// - Reversible (ENV toggle)
|
||||||
|
// - Fail-Fast (direct条件を満たさない場合は既存経路)
|
||||||
|
//
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
#ifndef FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_ENV_BOX_H
|
||||||
|
#define FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_ENV_BOX_H
|
||||||
|
|
||||||
|
#include <stdatomic.h>
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// Global State (L0)
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
// Cached state: -1 (uninitialized), 0 (disabled), 1 (enabled)
|
||||||
|
extern _Atomic int g_front_fastlane_alloc_legacy_direct_enabled;
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// Hot API (L0)
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
// Check if FastLane alloc LEGACY direct is enabled
|
||||||
|
// Returns: 1 if enabled, 0 if disabled
|
||||||
|
// Note: Implementation in .c file (non-inline for LTO compatibility)
|
||||||
|
extern int front_fastlane_alloc_legacy_direct_enabled(void);
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// Cold API (L2)
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
// Refresh ENV cache (called from bench_profile after putenv)
|
||||||
|
// Pattern: Same as Phase 8/13/14/15
|
||||||
|
extern void front_fastlane_alloc_legacy_direct_env_refresh_from_env(void);
|
||||||
|
|
||||||
|
#endif // FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_ENV_BOX_H
|
||||||
@ -42,6 +42,11 @@
|
|||||||
#include "front_fastlane_stats_box.h"
|
#include "front_fastlane_stats_box.h"
|
||||||
#include "../hakmem_tiny.h" // hak_tiny_size_to_class, tiny_get_max_size
|
#include "../hakmem_tiny.h" // hak_tiny_size_to_class, tiny_get_max_size
|
||||||
#include "../front/malloc_tiny_fast.h" // malloc_tiny_fast_for_class
|
#include "../front/malloc_tiny_fast.h" // malloc_tiny_fast_for_class
|
||||||
|
#include "front_fastlane_alloc_legacy_direct_env_box.h" // Phase 16 v1: LEGACY direct
|
||||||
|
#include "tiny_static_route_box.h" // tiny_static_route_ready_fast, tiny_static_route_get_kind_fast
|
||||||
|
#include "tiny_front_hot_box.h" // tiny_hot_alloc_fast
|
||||||
|
#include "tiny_front_cold_box.h" // tiny_cold_refill_and_alloc
|
||||||
|
#include "smallobject_policy_v7_box.h" // SMALL_ROUTE_LEGACY
|
||||||
|
|
||||||
// FastLane is only safe after global init completes.
|
// FastLane is only safe after global init completes.
|
||||||
// Before init, wrappers must handle recursion guards + syscall init.
|
// Before init, wrappers must handle recursion guards + syscall init.
|
||||||
@ -85,6 +90,34 @@ static inline void* front_fastlane_try_malloc(size_t size) {
|
|||||||
return NULL; // Class not enabled → fallback
|
return NULL; // Class not enabled → fallback
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Phase 16 v1: LEGACY direct path (early-exit optimization)
|
||||||
|
// Try direct allocation for LEGACY routes only (skip route/policy overhead)
|
||||||
|
// TEMPORARY SAFETY: Limit to C0-C3 (match dualhot pattern) until refill issue debugged
|
||||||
|
if (__builtin_expect(front_fastlane_alloc_legacy_direct_enabled() && (unsigned)class_idx <= 3u, 0)) {
|
||||||
|
// Condition 1: Static route must be ready (Learner interlock check)
|
||||||
|
// Condition 2: Route must be LEGACY (断定可能な場合のみ)
|
||||||
|
if (tiny_static_route_ready_fast() &&
|
||||||
|
tiny_static_route_get_kind_fast(class_idx) == SMALL_ROUTE_LEGACY) {
|
||||||
|
|
||||||
|
// Hot path: Try UnifiedCache first
|
||||||
|
void* ptr = tiny_hot_alloc_fast(class_idx);
|
||||||
|
if (__builtin_expect(ptr != NULL, 1)) {
|
||||||
|
FRONT_FASTLANE_STAT_INC(malloc_hit);
|
||||||
|
return ptr; // Success (cache hit)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Cold path: Refill UnifiedCache and retry
|
||||||
|
ptr = tiny_cold_refill_and_alloc(class_idx);
|
||||||
|
if (__builtin_expect(ptr != NULL, 1)) {
|
||||||
|
FRONT_FASTLANE_STAT_INC(malloc_hit);
|
||||||
|
return ptr; // Success (after refill)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Fallback: Direct path failed → use existing route (safety)
|
||||||
|
// This handles edge cases (Learner transition, policy changes, etc.)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// Call existing hot handler (no duplication)
|
// Call existing hot handler (no duplication)
|
||||||
// This is the winning path from E5-4 / Phase 4 E2
|
// This is the winning path from E5-4 / Phase 4 E2
|
||||||
void* ptr = malloc_tiny_fast_for_class(size, class_idx);
|
void* ptr = malloc_tiny_fast_for_class(size, class_idx);
|
||||||
|
|||||||
@ -0,0 +1,208 @@
|
|||||||
|
# Phase 16: Front FastLane Alloc LEGACY Direct v1 — A/B Test Results
|
||||||
|
|
||||||
|
**Date**: 2025-12-15
|
||||||
|
**Status**: NEUTRAL (+0.62%)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Executive Summary
|
||||||
|
|
||||||
|
Phase 16 v1 attempted to reduce alloc-side fixed costs by adding a LEGACY direct path to FastLane entry point, bypassing route/policy overhead for LEGACY allocations. The optimization mirrored the free-side winning pattern (Phase 9/10).
|
||||||
|
|
||||||
|
**Result**: +0.62% on Mixed (NEUTRAL), below +1.0% GO threshold.
|
||||||
|
|
||||||
|
**Critical Issue Discovered**: Initial implementation caused segmentation fault for classes C4-C7. Root cause: `unified_cache_refill()` incompatibility. **Safety fix applied**: Limited optimization to C0-C3 only (matching existing dualhot pattern).
|
||||||
|
|
||||||
|
**Verdict**: NEUTRAL — freeze as research box (default OFF).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## A/B Test Results
|
||||||
|
|
||||||
|
### Mixed (16-1024B, 10-run clean env)
|
||||||
|
|
||||||
|
**Baseline** (ENV=0):
|
||||||
|
- Mean: 47,510,791 ops/s
|
||||||
|
- Median: 47,606,360 ops/s
|
||||||
|
- Runs: 48151673, 47596179, 47735208, 47903499, 46674576, 47977105, 47236265, 47481537, 46735322, 47616542
|
||||||
|
|
||||||
|
**Optimized** (ENV=1):
|
||||||
|
- Mean: 47,803,890 ops/s
|
||||||
|
- Median: 47,901,551 ops/s
|
||||||
|
- Runs: 47401229, 47908200, 48158776, 48126240, 47477867, 47894902, 47644796, 48191059, 47930512, 47305320
|
||||||
|
|
||||||
|
**Delta**:
|
||||||
|
- Mean: **+0.62%**
|
||||||
|
- Median: **+0.62%**
|
||||||
|
|
||||||
|
**Verdict**: NEUTRAL (below +1.0% GO threshold)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### C6-heavy Regression Check (5-run)
|
||||||
|
|
||||||
|
**Baseline** (ENV=0):
|
||||||
|
- Mean: 21,134,240 ops/s
|
||||||
|
- Median: 21,186,983 ops/s
|
||||||
|
- Runs: 21186983, 21327420, 20807950, 21112023, 21236823
|
||||||
|
|
||||||
|
**Optimized** (ENV=1):
|
||||||
|
- Mean: 21,147,197 ops/s
|
||||||
|
- Median: 21,139,301 ops/s
|
||||||
|
- Runs: 21358869, 21209299, 20992077, 21139301, 21036438
|
||||||
|
|
||||||
|
**Delta**:
|
||||||
|
- Mean: **+0.06%**
|
||||||
|
- Median: **-0.23%**
|
||||||
|
|
||||||
|
**Verdict**: PASS (no significant regression)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Implementation Summary
|
||||||
|
|
||||||
|
### Files Modified
|
||||||
|
|
||||||
|
1. **`core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c}`** (new)
|
||||||
|
- L0 ENV gate for LEGACY direct feature
|
||||||
|
- ENV: `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1` (default 0, opt-in)
|
||||||
|
- API: `front_fastlane_alloc_legacy_direct_enabled()`, `front_fastlane_alloc_legacy_direct_env_refresh_from_env()`
|
||||||
|
|
||||||
|
2. **`core/box/front_fastlane_box.h`**
|
||||||
|
- Added LEGACY direct early-exit in `front_fastlane_try_malloc()` (lines 93-119)
|
||||||
|
- **SAFETY CONSTRAINT**: Limited to C0-C3 only due to refill incompatibility for C4-C7
|
||||||
|
- Direct conditions: ENV enabled + static route ready + LEGACY route confirmed
|
||||||
|
- Direct path: `tiny_hot_alloc_fast()` → `tiny_cold_refill_and_alloc()` → fallback to `malloc_tiny_fast_for_class()`
|
||||||
|
|
||||||
|
3. **`core/bench_profile.h`**
|
||||||
|
- Added `front_fastlane_alloc_legacy_direct_env_refresh_from_env()` to refresh sync group
|
||||||
|
|
||||||
|
4. **`Makefile`**
|
||||||
|
- Added `front_fastlane_alloc_legacy_direct_env_box.o` to OBJS_BASE, SHARED_OBJS, TINY_BENCH_OBJS_BASE
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Critical Bug & Fix
|
||||||
|
|
||||||
|
### Issue: Segmentation Fault (Exit Code 139)
|
||||||
|
|
||||||
|
**Symptom**: Benchmark crashed with ENV=1 during larger workloads (20M iterations).
|
||||||
|
|
||||||
|
**Root Cause**:
|
||||||
|
- Crash occurred in `unified_cache_refill()` → `tiny_next_read()` (intrusive pointer read)
|
||||||
|
- Initial implementation attempted to use direct path for ALL classes (C0-C7)
|
||||||
|
- Classes C4-C7 triggered incompatibility with `unified_cache_refill()` logic
|
||||||
|
- Existing dualhot code (Phase ALLOC-TINY-FAST-DUALHOT-2) only operates on C0-C3
|
||||||
|
|
||||||
|
**Backtrace**:
|
||||||
|
```
|
||||||
|
#0 0x0000555555564d89 in tiny_next_read.lto_priv.5.lto_priv ()
|
||||||
|
#1 0x00007ffff7b00318 in ?? ()
|
||||||
|
#2 0x0000555555557f29 in unified_cache_refill ()
|
||||||
|
```
|
||||||
|
|
||||||
|
**Fix Applied**:
|
||||||
|
- Limited LEGACY direct path to C0-C3 only (line 96 of front_fastlane_box.h)
|
||||||
|
- Added safety comment explaining constraint
|
||||||
|
- Matches existing proven pattern from dualhot implementation
|
||||||
|
|
||||||
|
**Code Change**:
|
||||||
|
```c
|
||||||
|
// Before (CRASHED):
|
||||||
|
if (__builtin_expect(front_fastlane_alloc_legacy_direct_enabled(), 0)) {
|
||||||
|
|
||||||
|
// After (SAFE):
|
||||||
|
if (__builtin_expect(front_fastlane_alloc_legacy_direct_enabled() && (unsigned)class_idx <= 3u, 0)) {
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Analysis
|
||||||
|
|
||||||
|
### Why +0.62% is Below Threshold
|
||||||
|
|
||||||
|
1. **Limited Scope**: Optimization only applies to C0-C3 due to safety constraint
|
||||||
|
- C4-C7 continue using full route/policy path
|
||||||
|
- Mixed benchmark uses all size classes (16-1024B = C0-C5 primarily)
|
||||||
|
|
||||||
|
2. **Existing Optimizations**: dualhot (Phase ALLOC-TINY-FAST-DUALHOT-2) already optimizes C0-C3
|
||||||
|
- LEGACY direct overlaps with dualhot coverage
|
||||||
|
- Marginal benefit when dualhot is disabled, but default config has dualhot enabled in some profiles
|
||||||
|
|
||||||
|
3. **Route Overhead Not Dominant**: After Phase 6 FastLane collapse, route/policy fixed costs are already minimized
|
||||||
|
- Phase 14-15 (cache shape) also showed NEUTRAL results
|
||||||
|
- Suggests current bottleneck is not in dispatch layers
|
||||||
|
|
||||||
|
### Root Cause of Limited Benefit
|
||||||
|
|
||||||
|
The optimization targets the same problem space as existing dualhot but with different enablement conditions:
|
||||||
|
- **dualhot**: Always enabled for C0-C3, no route check
|
||||||
|
- **LEGACY direct**: ENV-gated, requires static route confirmation
|
||||||
|
|
||||||
|
When both are active, LEGACY direct provides minimal incremental value.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recommendations
|
||||||
|
|
||||||
|
1. **Freeze as Research Box** (default OFF)
|
||||||
|
- ENV remains opt-in: `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0`
|
||||||
|
- No preset promotion
|
||||||
|
- Keep code for potential future use if dualhot is disabled
|
||||||
|
|
||||||
|
2. **Investigate C4-C7 Refill Issue**
|
||||||
|
- Root cause: Why does `unified_cache_refill()` fail for C4-C7 in this path?
|
||||||
|
- Possible causes:
|
||||||
|
- LIFO mode interaction (Phase 15)
|
||||||
|
- Cache state assumptions in refill logic
|
||||||
|
- Intrusive pointer corruption
|
||||||
|
- **Action**: Debug under controlled conditions before expanding to C4-C7
|
||||||
|
|
||||||
|
3. **Shift Focus Away from Dispatch Layers**
|
||||||
|
- Phase 14, 15, 16 all showed NEUTRAL results
|
||||||
|
- Phase 6 FastLane already collapsed major dispatch overhead
|
||||||
|
- **Next direction**: Investigate cache miss costs, memory layout, or backend allocation
|
||||||
|
|
||||||
|
4. **Consider Dualhot/LEGACY Direct Consolidation**
|
||||||
|
- If LEGACY direct is kept, evaluate merging with dualhot logic
|
||||||
|
- Avoid code duplication and overlap
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Comparison with Recent Phases
|
||||||
|
|
||||||
|
| Phase | Target | Delta (Mixed) | Verdict |
|
||||||
|
|-------|--------|---------------|---------|
|
||||||
|
| Phase 10 | Free LEGACY direct | +1.89% | **GO** |
|
||||||
|
| Phase 13 v1 | C7 preserve header | -0.40% | NEUTRAL (freeze) |
|
||||||
|
| Phase 14 v1 | tcache intrusive | +0.20% | NEUTRAL (freeze) |
|
||||||
|
| Phase 14 v2 | tcache hot integration | +0.08% | NEUTRAL (freeze) |
|
||||||
|
| Phase 15 v1 | UnifiedCache FIFO→LIFO | -0.70% | NEUTRAL (freeze) |
|
||||||
|
| **Phase 16 v1** | **Alloc LEGACY direct** | **+0.62%** | **NEUTRAL (freeze)** |
|
||||||
|
|
||||||
|
**Pattern**: Post-Phase-10 optimizations consistently show NEUTRAL results. Major gains came from earlier phases (FastLane collapse +11.13%, Free DeDup +5.18%, etc.). Current bottleneck likely not in dispatch/routing layers.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Files Changed
|
||||||
|
|
||||||
|
- `core/box/front_fastlane_alloc_legacy_direct_env_box.h` (new)
|
||||||
|
- `core/box/front_fastlane_alloc_legacy_direct_env_box.c` (new)
|
||||||
|
- `core/box/front_fastlane_box.h` (modified)
|
||||||
|
- `core/bench_profile.h` (modified)
|
||||||
|
- `Makefile` (modified)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ENV Variables
|
||||||
|
|
||||||
|
- `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1` (default: 0, opt-in)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
1. **Freeze Phase 16** with default OFF
|
||||||
|
2. **Commit with verdict**: "Phase 16 v1: NEUTRAL (+0.62%), research box"
|
||||||
|
3. **Update CURRENT_TASK.md** with Phase 16 summary
|
||||||
|
4. **Shift optimization focus** based on profiling/analysis (away from dispatch layers)
|
||||||
@ -0,0 +1,133 @@
|
|||||||
|
# Phase 16: Front FastLane Alloc LEGACY Direct v1(alloc 側の “2段目ホット” を monolithic early-exit 化)
|
||||||
|
|
||||||
|
**Date**: 2025-12-15
|
||||||
|
**Status**: DESIGN(Phase 16 kickoff)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 0. Executive Summary(1枚)
|
||||||
|
|
||||||
|
Phase 14-15(pointer-chase / cache-shape)系は **NEUTRAL** で freeze。
|
||||||
|
次は “キャッシュ形状” ではなく、**命令数/分岐の固定費を削る**方向に戻す。
|
||||||
|
|
||||||
|
現状の `malloc()` は Phase 6 で FastLane に集約され、ほぼ常に:
|
||||||
|
|
||||||
|
```
|
||||||
|
malloc() → front_fastlane_try_malloc(size) → malloc_tiny_fast_for_class(size, class_idx)
|
||||||
|
```
|
||||||
|
|
||||||
|
となる。
|
||||||
|
|
||||||
|
しかし `malloc_tiny_fast_for_class()` は **LEGACY ルートでも**、
|
||||||
|
ULTRA/C7 早期分岐・route_kind 決定・ENV cfg 読み・dispatch shape などの固定費を通る。
|
||||||
|
free 側(Phase 9/10/6-2)は “monolithic early-exit” に寄せて勝っているため、
|
||||||
|
alloc 側も同じ勝ち筋で **FastLane 入口で LEGACY を直行**させるのが ROI が高い。
|
||||||
|
|
||||||
|
Phase 16 は Box Theory を保ったまま、FastLane の alloc に “LEGACY direct” を 1 本足す:
|
||||||
|
|
||||||
|
- **hit 時**: `tiny_hot_alloc_fast(class_idx)` → 即 return(route/policy を踏まない)
|
||||||
|
- **miss 時**: `tiny_cold_refill_and_alloc(class_idx)`(既存 cold 境界)
|
||||||
|
- **不確実時**: 既存 `malloc_tiny_fast_for_class()` にフォールバック(境界 1 箇所)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. 現状(why)
|
||||||
|
|
||||||
|
- Phase 6(Front FastLane)で wrapper→gate→policy→route を collapse し、入口固定費は大きく削減できた。
|
||||||
|
- その結果、alloc 側の残コストは **`malloc_tiny_fast_for_class()` 内の分岐/ENV/route 決定**に寄りやすい。
|
||||||
|
- Phase 14/15 で “UnifiedCache の形状” をいじっても Mixed が動かない → 現状は **cache shape が支配的ではない**。
|
||||||
|
|
||||||
|
よって Phase 16 は、cache の内部を変えずに **route/policy 固定費を削る**。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. 提案(Phase 16 v1)
|
||||||
|
|
||||||
|
### 2.1 追加する箱(Box Theory)
|
||||||
|
|
||||||
|
```
|
||||||
|
L0: front_fastlane_alloc_legacy_direct_env_box (ENV gate / rollback)
|
||||||
|
↓
|
||||||
|
L1: front_fastlane_try_malloc() (LEGACY direct early-exit)
|
||||||
|
↓
|
||||||
|
L2: malloc_tiny_fast_for_class() (既存: route/policy/ULTRA/MID/V7)
|
||||||
|
↓
|
||||||
|
L3: tiny_front_hot_box / tiny_front_cold_box (既存: unified cache / refill)
|
||||||
|
```
|
||||||
|
|
||||||
|
**境界は 1 箇所**:
|
||||||
|
- “direct 条件を満たさない/失敗” → `malloc_tiny_fast_for_class()` に落とす。
|
||||||
|
|
||||||
|
### 2.2 ENV
|
||||||
|
|
||||||
|
- `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1`(default 0, opt-in)
|
||||||
|
|
||||||
|
初期は opt-in で A/B。
|
||||||
|
GO なら preset 昇格(MIXED のみから段階的)を検討する。
|
||||||
|
|
||||||
|
### 2.3 Direct 条件(Fail-Fast)
|
||||||
|
|
||||||
|
alloc direct は **“断定できるときだけ”**に限定する:
|
||||||
|
|
||||||
|
必須条件(推奨):
|
||||||
|
- FastLane が有効(既存)
|
||||||
|
- `size <= tiny_get_max_size()`(既存)
|
||||||
|
- `class_idx` が有効(既存)
|
||||||
|
- `front_fastlane_class_mask` に含まれる(既存)
|
||||||
|
- `tiny_static_route_ready_fast()` が true(Learner interlock 等で false のときは使わない)
|
||||||
|
- `tiny_static_route_get_kind_fast(class_idx) == SMALL_ROUTE_LEGACY`(LEGACY 断定)
|
||||||
|
|
||||||
|
その上で:
|
||||||
|
- `tiny_hot_alloc_fast(class_idx)` → hit なら return
|
||||||
|
- miss なら `tiny_cold_refill_and_alloc(class_idx)` を呼ぶ(既存 cold 境界)
|
||||||
|
- それでも NULL の場合だけ `malloc_tiny_fast_for_class()` にフォールバック(安全重視)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. 可視化(最小)
|
||||||
|
|
||||||
|
Release での常時ログは禁止。
|
||||||
|
必要なら `HAKMEM_DEBUG_COUNTERS=1` のみで:
|
||||||
|
|
||||||
|
- `front_fastlane_alloc_legacy_direct_hit`
|
||||||
|
- `front_fastlane_alloc_legacy_direct_miss`
|
||||||
|
- `front_fastlane_alloc_legacy_direct_fallback`
|
||||||
|
|
||||||
|
(atomic は stats box に閉じ込める。ホット側に atomic を置かない)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. A/B 計測(同一バイナリ)
|
||||||
|
|
||||||
|
GO/NO-GO(Mixed 10-run, clean env):
|
||||||
|
- GO: mean +1.0% 以上
|
||||||
|
- NO-GO: mean -1.0% 以下(即 rollback / freeze)
|
||||||
|
- NEUTRAL: ±1.0%(research box freeze)
|
||||||
|
|
||||||
|
対象:
|
||||||
|
- `scripts/run_mixed_10_cleanenv.sh`
|
||||||
|
- 追加で C6-heavy 5-run(回帰なし確認)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. リスクと対策
|
||||||
|
|
||||||
|
### リスク 1: “LEGACY と断定” が崩れて誤ルートする
|
||||||
|
|
||||||
|
対策:
|
||||||
|
- `tiny_static_route_ready_fast()` を必須条件にする(Learner 有効時は false になる想定)
|
||||||
|
- route_kind を必ず確認(mask だけに依存しない)
|
||||||
|
- 失敗時は必ず既存経路へフォールバック
|
||||||
|
|
||||||
|
### リスク 2: direct 経路が小さすぎて効果が出ない
|
||||||
|
|
||||||
|
対策:
|
||||||
|
- まず Mixed の “LEGACY 比率” を stats で可視化(debug counters のみ)
|
||||||
|
- 効かなければ freeze(Phase 14/15 と同じ扱い)
|
||||||
|
|
||||||
|
### リスク 3: 分岐追加が逆効果(Phase 11 の再来)
|
||||||
|
|
||||||
|
対策:
|
||||||
|
- direct 判定は **FastLane 内で 1 回だけ**(call site helper を増やさない)
|
||||||
|
- direct 判定が false の場合は既存の `malloc_tiny_fast_for_class()` をそのまま呼ぶ
|
||||||
|
|
||||||
@ -0,0 +1,124 @@
|
|||||||
|
# Phase 16: Front FastLane Alloc LEGACY Direct v1 — Next Instructions
|
||||||
|
|
||||||
|
設計: `docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_DESIGN.md`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 0. Status / Why now
|
||||||
|
|
||||||
|
- Phase 14-15(tcache / FIFO→LIFO)は **NEUTRAL** → freeze(default OFF)
|
||||||
|
- 次の狙いは “cache 形状” ではなく、**alloc 側の route/policy 固定費を減らす**
|
||||||
|
- free 側は Phase 9/10/6-2 の “monolithic early-exit + dedup” が勝ち筋 → alloc 側にも同じパターンを適用する
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. GO 条件
|
||||||
|
|
||||||
|
Mixed 10-run(clean env):
|
||||||
|
- **GO**: mean +1.0% 以上
|
||||||
|
- **NO-GO**: mean -1.0% 以下(即 rollback / freeze)
|
||||||
|
- **NEUTRAL**: ±1.0% → research box freeze
|
||||||
|
|
||||||
|
追加ゲート(必須):
|
||||||
|
- `tiny_static_route_ready_fast()` が true の環境で、LEGACY direct が実際に通っている(debug counters で確認できるなら尚良い)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Box 図(境界 1 箇所)
|
||||||
|
|
||||||
|
```
|
||||||
|
L0: front_fastlane_alloc_legacy_direct_env_box (ENV gate / refresh)
|
||||||
|
↓
|
||||||
|
L1: front_fastlane_box.h (try_malloc 内 early-exit)
|
||||||
|
↓
|
||||||
|
L2: malloc_tiny_fast_for_class() (既存経路)
|
||||||
|
```
|
||||||
|
|
||||||
|
境界は **“direct 条件 NG / direct が NULL → malloc_tiny_fast_for_class”** の 1 箇所に固定する。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Patch 順(小さく積む)
|
||||||
|
|
||||||
|
### Patch 1: L0 ENV gate box(戻せる)
|
||||||
|
|
||||||
|
新規:
|
||||||
|
- `core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c}`
|
||||||
|
|
||||||
|
ENV:
|
||||||
|
- `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1`(default 0)
|
||||||
|
|
||||||
|
API(例):
|
||||||
|
- `front_fastlane_alloc_legacy_direct_enabled() -> int`
|
||||||
|
- `front_fastlane_alloc_legacy_direct_env_refresh_from_env()`
|
||||||
|
|
||||||
|
要件:
|
||||||
|
- hot path に `getenv()` を置かない(cached)
|
||||||
|
- `bench_profile` の `putenv()` 同期のため refresh を提供(Phase 8/13/14/15 パターン)
|
||||||
|
|
||||||
|
### Patch 2: 統合点(FastLane alloc に 1 本だけ)
|
||||||
|
|
||||||
|
対象:
|
||||||
|
- `core/box/front_fastlane_box.h`
|
||||||
|
|
||||||
|
変更:
|
||||||
|
- `front_fastlane_try_malloc()` の class mask 判定の後に、次の “direct 経路” を追加
|
||||||
|
|
||||||
|
direct 条件(Fail-Fast):
|
||||||
|
1. `front_fastlane_alloc_legacy_direct_enabled() == 1`
|
||||||
|
2. `tiny_static_route_ready_fast()` が true(Learner interlock 等で false の場合は direct 禁止)
|
||||||
|
3. `tiny_static_route_get_kind_fast(class_idx) == SMALL_ROUTE_LEGACY`(LEGACY を断定)
|
||||||
|
|
||||||
|
direct 実体:
|
||||||
|
- `void* p = tiny_hot_alloc_fast(class_idx);`
|
||||||
|
- `if (p) return p;`
|
||||||
|
- `p = tiny_cold_refill_and_alloc(class_idx);`
|
||||||
|
- `if (p) return p;`
|
||||||
|
- 失敗時のみ `malloc_tiny_fast_for_class(size, class_idx)` にフォールバック(安全側)
|
||||||
|
|
||||||
|
注意:
|
||||||
|
- “call site helper を増やさない” を優先(Phase 11 の反省)
|
||||||
|
- 直行するのは **LEGACY のみ**(ULTRA/MID/V7 は既存に任せる)
|
||||||
|
|
||||||
|
### Patch 3: bench_profile 同期(ENV 漏れ防止)
|
||||||
|
|
||||||
|
対象:
|
||||||
|
- `core/bench_profile.h`
|
||||||
|
|
||||||
|
変更:
|
||||||
|
- `#ifdef USE_HAKMEM` の refresh 群に `front_fastlane_alloc_legacy_direct_env_refresh_from_env();` を追加
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. A/B(同一バイナリ)
|
||||||
|
|
||||||
|
Baseline:
|
||||||
|
```sh
|
||||||
|
HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 scripts/run_mixed_10_cleanenv.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Optimized:
|
||||||
|
```sh
|
||||||
|
HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=1 scripts/run_mixed_10_cleanenv.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
追加(回帰検出):
|
||||||
|
```sh
|
||||||
|
HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1 ./bench_mid_large_mt_hakmem 1 20000000 400 1
|
||||||
|
HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=1 HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1 ./bench_mid_large_mt_hakmem 1 20000000 400 1
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. 健康診断
|
||||||
|
|
||||||
|
```sh
|
||||||
|
scripts/verify_health_profiles.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Rollback
|
||||||
|
|
||||||
|
- `export HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0`
|
||||||
|
|
||||||
@ -0,0 +1,89 @@
|
|||||||
|
# Phase 17: FORCE_LIBC Gap Validation v1 — A/B Test Results
|
||||||
|
|
||||||
|
**Date**: 2025-12-15
|
||||||
|
**Verdict**: ✅ **Case B confirmed** — **Layout / I-cache penalty dominates**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Executive Summary
|
||||||
|
|
||||||
|
Phase 17 validated the “system malloc is faster than hakmem” observation while avoiding the classic layout/LTO trap by running a **same-binary A/B**:
|
||||||
|
|
||||||
|
- Same binary (`bench_random_mixed_hakmem`) with `HAKMEM_FORCE_LIBC_ALLOC=0/1` shows **allocator logic delta is negligible**.
|
||||||
|
- The large performance gap appears only when comparing to the tiny `bench_random_mixed_system` binary.
|
||||||
|
|
||||||
|
Conclusion: The dominant gap is **binary text size + layout → I-cache thrash + instruction footprint**, not allocator algorithm efficiency.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Measurement Setup
|
||||||
|
|
||||||
|
Workload:
|
||||||
|
- `bench_random_mixed_*` (Mixed 16–1024B), working set `WS=400`
|
||||||
|
- Clean ENV baseline via `scripts/run_mixed_10_cleanenv.sh`
|
||||||
|
|
||||||
|
Two comparisons:
|
||||||
|
1) **Same-binary toggle** (allocator logic delta)
|
||||||
|
2) **System binary** (layout penalty delta)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Results
|
||||||
|
|
||||||
|
### 1) Same-binary A/B (allocator delta)
|
||||||
|
|
||||||
|
Binary: `bench_random_mixed_hakmem`
|
||||||
|
Toggle: `HAKMEM_FORCE_LIBC_ALLOC=0/1`
|
||||||
|
|
||||||
|
| Mode | Throughput (ops/s) | Delta |
|
||||||
|
|------|---------------------|-------|
|
||||||
|
| hakmem (`FORCE_LIBC=0`) | 48.12M | — |
|
||||||
|
| libc (`FORCE_LIBC=1`) | 48.31M | **+0.39%** |
|
||||||
|
|
||||||
|
Interpretation: allocator logic delta is ~noise-level in this experiment context.
|
||||||
|
|
||||||
|
### 2) System binary (layout penalty)
|
||||||
|
|
||||||
|
Binary: `bench_random_mixed_system`
|
||||||
|
|
||||||
|
| Mode | Throughput (ops/s) | Delta vs libc-in-hakmem-binary |
|
||||||
|
|------|---------------------|--------------------------------|
|
||||||
|
| system malloc | 83.85M | **+73.57%** |
|
||||||
|
|
||||||
|
Total observed gap: ~+74% class.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Perf Stat (200M iterations) — Smoking Gun
|
||||||
|
|
||||||
|
| Metric | hakmem binary | system binary | Delta |
|
||||||
|
|--------|---------------|---------------|-------|
|
||||||
|
| I-cache misses | 153K | 68K | **-55%** |
|
||||||
|
| Cycles | 17.9B | 10.2B | **-43%** |
|
||||||
|
| Instructions | 41.3B | 21.5B | **-48%** |
|
||||||
|
| Binary size | 653K | 21K | **-97%** |
|
||||||
|
|
||||||
|
Interpretation:
|
||||||
|
- The system binary executes roughly **half the instructions**, with **far fewer I-cache misses**.
|
||||||
|
- The 30× text footprint difference strongly correlates with the gap.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Conclusion
|
||||||
|
|
||||||
|
Phase 12’s “system malloc is 1.6× faster” observation was real, but the root cause was misattributed:
|
||||||
|
|
||||||
|
- ❌ Not primarily allocator algorithm differences
|
||||||
|
- ✅ **Text/layout + I-cache locality + instruction footprint**
|
||||||
|
|
||||||
|
This shifts the optimization frontier:
|
||||||
|
- Stop chasing more routing/dispatch micro-opt (Phase 14–16 plateau)
|
||||||
|
- Focus on **Hot Text Isolation / layout control**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Next
|
||||||
|
|
||||||
|
Proceed to:
|
||||||
|
- `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md`
|
||||||
|
|
||||||
@ -0,0 +1,130 @@
|
|||||||
|
# Phase 17: FORCE_LIBC Gap Validation(same-binary A/B)Next Instructions
|
||||||
|
|
||||||
|
## Status(前提)
|
||||||
|
|
||||||
|
- Phase 14–16 は **NEUTRAL / research box freeze**(dispatch/cache-shape/pointer-chase 系は頭打ち)
|
||||||
|
- Phase 16 v1(FastLane alloc LEGACY direct)は **NEUTRAL (+0.62%)** かつ **C0–C3 限定**(C4–C7 は segv で安全制限)
|
||||||
|
- Phase 12 で「system malloc が hakmem より速い」という観測があるが、**別バイナリ比較は layout/LTO 差で壊れやすい**
|
||||||
|
|
||||||
|
本 Phase 17 の目的は、**同一バイナリ内**で `hakmem` vs `libc malloc` を A/B し、gap の実体(allocator差か、バイナリ差か)を SSOT 化すること。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 0. 目的(Deliverables)
|
||||||
|
|
||||||
|
1) **同一バイナリ A/B**: `bench_random_mixed_hakmem` を用いて
|
||||||
|
- A: `HAKMEM_FORCE_LIBC_ALLOC=0`(hakmem)
|
||||||
|
- B: `HAKMEM_FORCE_LIBC_ALLOC=1`(libc)
|
||||||
|
|
||||||
|
2) **別バイナリとの差分分解**(任意)
|
||||||
|
- `bench_random_mixed_system`(小さいバイナリ)も測り、`libc-in-hakmem-binary` と比較して **layout penalty** を推定
|
||||||
|
|
||||||
|
3) **次の主戦場を決める**(GO/NO-GO ではなく、方針決定)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. 実施手順(再現性重視)
|
||||||
|
|
||||||
|
### 1.1 Build(同一 commit で固定)
|
||||||
|
|
||||||
|
```sh
|
||||||
|
make -j bench_random_mixed_hakmem bench_random_mixed_system
|
||||||
|
```
|
||||||
|
|
||||||
|
### 1.2 Clean ENV(Phase 14–16 研究 knob を固定)
|
||||||
|
|
||||||
|
推奨: `scripts/run_mixed_10_cleanenv.sh` を使う(ENV 漏れ防止)。
|
||||||
|
|
||||||
|
追加で次を明示(Phase 16 を確実に OFF):
|
||||||
|
|
||||||
|
```sh
|
||||||
|
export HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0
|
||||||
|
```
|
||||||
|
|
||||||
|
### 1.3 Same-binary A/B(本丸)
|
||||||
|
|
||||||
|
**A: hakmem**
|
||||||
|
|
||||||
|
```sh
|
||||||
|
HAKMEM_FORCE_LIBC_ALLOC=0 scripts/run_mixed_10_cleanenv.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
**B: libc(同一バイナリ)**
|
||||||
|
|
||||||
|
```sh
|
||||||
|
HAKMEM_FORCE_LIBC_ALLOC=1 scripts/run_mixed_10_cleanenv.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
記録:
|
||||||
|
- mean / median / stdev(10-run)
|
||||||
|
- Min/Max
|
||||||
|
|
||||||
|
### 1.4 Optional: system binary baseline(layout penalty 推定)
|
||||||
|
|
||||||
|
```sh
|
||||||
|
for i in $(seq 1 10); do
|
||||||
|
echo "=== Run ${i}/10 (system bin) ==="
|
||||||
|
./bench_random_mixed_system "${ITERS:-20000000}" "${WS:-400}" 1 2>&1 | rg "Throughput" || true
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
解釈:
|
||||||
|
- `system bin` が `FORCE_LIBC` より大きく速い → **layout/text size penalty** が支配的
|
||||||
|
- `FORCE_LIBC` が `hakmem` より大きく速い → **allocator ロジック差** が支配的
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. 判定(方針分岐)
|
||||||
|
|
||||||
|
### Case A: `FORCE_LIBC` が hakmem より **+20% 以上**速い
|
||||||
|
|
||||||
|
結論: gap の本体は allocator ロジック(命令数/固定費)側。
|
||||||
|
|
||||||
|
次の芯(推奨):
|
||||||
|
- **Phase 18: Free FastPath Gate Consolidation**
|
||||||
|
- `free_tiny_fast()` 内の ENV gate / TLS probe を FastLane 入口で 1 回だけに集約
|
||||||
|
- 目的: “monolithic early-exit” の勝ち筋を維持したまま、per-call gate 固定費を削る
|
||||||
|
- Box 境界: `front_fastlane_try_free()` → `free_tiny_fast_with_snapshot()` の 1 箇所
|
||||||
|
- 戻せる: `HAKMEM_FREE_TINY_FAST_SNAPSHOT=0/1`
|
||||||
|
|
||||||
|
### Case B: `FORCE_LIBC` が hakmem と **±5% 以内**
|
||||||
|
|
||||||
|
結論: allocator差は小さく、Phase 12 の「system malloc 1.6x」は別要因(バイナリ差/計測系)濃厚。
|
||||||
|
|
||||||
|
次の芯(推奨):
|
||||||
|
- **Phase 18: Hot Text Isolation / Layout Control**
|
||||||
|
- cold code を `__attribute__((cold,noinline))` + 別 TU に追放
|
||||||
|
- 可能なら link-order(hot 関数の順序固定)で I-cache 安定化
|
||||||
|
- A/B は同一バイナリで `HAKMEM_LAYOUT_MODE=0/1`(section/attribute のみ切替)
|
||||||
|
|
||||||
|
### Case C: `FORCE_LIBC` が hakmem より速いが、`system bin` とも差が大きい
|
||||||
|
|
||||||
|
結論: allocator差 + layout penalty の **両方**がある。
|
||||||
|
|
||||||
|
次の芯:
|
||||||
|
- 先に **layout penalty** を削る(Phase 18 Hot Text Isolation)
|
||||||
|
- その後に **gate consolidation**(Phase 19)へ
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. 可視化(最小)
|
||||||
|
|
||||||
|
- 10-run の raw throughput を保存(`scripts/run_mixed_10_cleanenv.sh` 出力ログで十分)
|
||||||
|
- 追加で 1 本だけ `perf stat`(200M iters, 1-run):
|
||||||
|
|
||||||
|
```sh
|
||||||
|
perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \
|
||||||
|
env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FORCE_LIBC_ALLOC=0 \
|
||||||
|
./bench_random_mixed_hakmem 200000000 400 1
|
||||||
|
```
|
||||||
|
|
||||||
|
同じコマンドで `HAKMEM_FORCE_LIBC_ALLOC=1` も 1 本取る。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. 重要ルール(Box Theory)
|
||||||
|
|
||||||
|
- A/B は **同一バイナリ**で行う(layout/LTO 差で誤判定しない)
|
||||||
|
- 新しい最適化は必ず ENV gate(戻せる)+ 境界 1 箇所
|
||||||
|
- 迷ったら “Fail-Fast で fallback” を優先(速度より整合性)
|
||||||
|
|
||||||
135
docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md
Normal file
135
docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md
Normal file
@ -0,0 +1,135 @@
|
|||||||
|
# Phase 18: Hot Text Isolation v1 — Design
|
||||||
|
|
||||||
|
## 0. Context (from Phase 17)
|
||||||
|
|
||||||
|
Phase 17 established **Case B**:
|
||||||
|
- Same-binary `HAKMEM_FORCE_LIBC_ALLOC=0/1` shows **allocator delta is negligible**.
|
||||||
|
- The large gap appears vs the tiny `bench_random_mixed_system` binary.
|
||||||
|
|
||||||
|
Signal:
|
||||||
|
- I-cache misses / instructions / cycles are far worse in the hakmem-linked binary.
|
||||||
|
- Binary size (`~653K`) vs system (`~21K`) correlates with the throughput gap.
|
||||||
|
|
||||||
|
Ref: `docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Goal
|
||||||
|
|
||||||
|
Reduce **hot-path instruction footprint** and improve **I-cache locality** in the hakmem-linked binary, without changing allocator algorithms.
|
||||||
|
|
||||||
|
Primary success metric:
|
||||||
|
- Mixed (16–1024B) throughput improvement, with accompanying reductions in:
|
||||||
|
- `iTLB/icache misses` (or “I-cache misses” counter used in Phase 17)
|
||||||
|
- total instructions executed per 200M iters
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Non-goals
|
||||||
|
|
||||||
|
- No allocator algorithm redesign.
|
||||||
|
- No behavioral changes to safety/Fail-Fast semantics (only layout/placement changes).
|
||||||
|
- No “delete code = faster” experiments (Phase 17 showed layout dominates; deletions confound results).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Box Theory framing
|
||||||
|
|
||||||
|
This is a “build/layout box”:
|
||||||
|
- **Box**: HotTextIsolationBox (compile-time layout controls + annotations)
|
||||||
|
- **Boundary**: build flag / TU split (no runtime overhead)
|
||||||
|
- **Rollback**: single Makefile knob (`HOT_TEXT_ISOLATION=0/1`) or `-DHAKMEM_HOT_TEXT_ISOLATION=0/1`
|
||||||
|
- **Observability**: perf stat + binary size (no always-on logs)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Design: v1 tactics (low-risk)
|
||||||
|
|
||||||
|
### 4.1 Hot/Cold attributes SSOT
|
||||||
|
|
||||||
|
Introduce a single header defining attributes:
|
||||||
|
- `HAK_HOT_FN` → `__attribute__((hot))` (and optionally `.text.hak_hot`)
|
||||||
|
- `HAK_COLD_FN` → `__attribute__((cold,noinline))` (and optionally `.text.hak_cold`)
|
||||||
|
|
||||||
|
Activated only when `HAKMEM_HOT_TEXT_ISOLATION=1`.
|
||||||
|
|
||||||
|
Why:
|
||||||
|
- Makes “what is hot/cold” explicit and consistent (SSOT).
|
||||||
|
- Lets us annotate a small set of functions without scattering ad-hoc attributes.
|
||||||
|
|
||||||
|
### 4.2 Translation-unit split for wrappers
|
||||||
|
|
||||||
|
Move wrapper definitions out of `core/hakmem.c` into a dedicated TU:
|
||||||
|
- `core/hak_wrappers_box.c` includes `core/box/hak_wrappers.inc.h`
|
||||||
|
|
||||||
|
Why:
|
||||||
|
- Prevents wrapper text from being interleaved with unrelated code in the same TU.
|
||||||
|
- Improves the linker’s ability to cluster hot code.
|
||||||
|
- Enables future link-order experiments (symbol ordering files) without touching allocator logic.
|
||||||
|
|
||||||
|
### 4.3 Cold code isolation
|
||||||
|
|
||||||
|
Ensure rarely-hit helpers stay cold/out-of-line:
|
||||||
|
- wrapper diagnostics (`wrapper_record_fallback`, ptr trace dumps, verbose logging)
|
||||||
|
- “slow fallback” paths (`malloc_cold`, `free_cold`)
|
||||||
|
|
||||||
|
Principle:
|
||||||
|
- Hot path must remain a straight-line “try → return” shape.
|
||||||
|
- Anything that allocates/logs/diagnoses is cold and must not be inlined into hot wrappers.
|
||||||
|
|
||||||
|
### 4.4 Optional: section GC for bench builds
|
||||||
|
|
||||||
|
For bench binaries only:
|
||||||
|
- add `-ffunction-sections -fdata-sections`
|
||||||
|
- link with `-Wl,--gc-sections`
|
||||||
|
|
||||||
|
Why:
|
||||||
|
- Drops truly-unused text and reduces overall text pressure.
|
||||||
|
- Helps the linker keep hot text denser.
|
||||||
|
|
||||||
|
This is optional because it is toolchain-sensitive; measure before promoting.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. v2 Extension (if v1 is NEUTRAL): BENCH_MINIMAL compile-out
|
||||||
|
|
||||||
|
Phase 17 shows the hakmem-linked binary executes ~2x instructions vs the tiny system binary. If v1 (TU split/attributes) is NEUTRAL, the next likely lever is **not placement-only**, but **removing per-call fixed costs** from the hot path by compiling them out in a bench-only build.
|
||||||
|
|
||||||
|
Concept:
|
||||||
|
- Introduce `HAKMEM_BENCH_MINIMAL=1` build mode (Makefile knob)
|
||||||
|
- In this mode:
|
||||||
|
- “promoted defaults” are treated as compile-time constants (FastLane ON, snapshots ON, etc.)
|
||||||
|
- ENV gates become compile-time (no TLS/env probing in hot path)
|
||||||
|
- Hot counters/stats macros compile out completely
|
||||||
|
|
||||||
|
Why this still fits Box Theory:
|
||||||
|
- It is a **build box** (reversible by knob), not an algorithm rewrite
|
||||||
|
- Boundaries remain: hot path stays Fail-Fast; cold fallback remains intact
|
||||||
|
- Observability shifts to `perf stat` (no always-on logging)
|
||||||
|
|
||||||
|
Expected impact:
|
||||||
|
- If instruction footprint is truly dominant, this is the first place to see **double-digit gains** (+10–20%).
|
||||||
|
|
||||||
|
## 5. Risks / mitigations
|
||||||
|
|
||||||
|
### Risk A: layout tweaks regress throughput
|
||||||
|
|
||||||
|
Mitigation:
|
||||||
|
- A/B using the same workload + perf stat counters (Phase 17 set).
|
||||||
|
- If regression: keep as research-only (build knob default OFF).
|
||||||
|
|
||||||
|
### Risk B: Toolchain sensitivity (ld vs lld, LTO interactions)
|
||||||
|
|
||||||
|
Mitigation:
|
||||||
|
- Keep v1 minimal (TU split + attributes first).
|
||||||
|
- Only enable `--gc-sections` if it’s stable in the current toolchain.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Expected impact
|
||||||
|
|
||||||
|
Conservative:
|
||||||
|
- +3–10% throughput improvement on Mixed by reducing instruction footprint and I-cache misses.
|
||||||
|
|
||||||
|
Stretch goal:
|
||||||
|
- Bring “hakmem-linked + FORCE_LIBC” closer to `bench_random_mixed_system` ceiling by minimizing wrapper text working-set.
|
||||||
165
docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md
Normal file
165
docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md
Normal file
@ -0,0 +1,165 @@
|
|||||||
|
# Phase 18: Hot Text Isolation v1 — Next Instructions
|
||||||
|
|
||||||
|
## Status
|
||||||
|
|
||||||
|
- Phase 17 confirms **Case B**: allocator logic delta is negligible; gap is **layout/I-cache**.
|
||||||
|
- Next: reduce instruction footprint + improve I-cache locality via **Hot Text Isolation**.
|
||||||
|
|
||||||
|
Refs:
|
||||||
|
- Phase 17 results: `docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md`
|
||||||
|
- Phase 18 design: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 0. Goal / Success Criteria
|
||||||
|
|
||||||
|
Primary (v1 は “低リスク・効果小さめ” 想定):
|
||||||
|
- Mixed (16–1024B) throughput **+2%** 以上で GO(layout work の現実ライン)
|
||||||
|
|
||||||
|
Secondary (must move in the right direction):
|
||||||
|
- I-cache misses reduced(目安: **-10%** 以上)
|
||||||
|
- Total instructions reduced(目安: **-5%** 以上)
|
||||||
|
|
||||||
|
If throughput is NEUTRAL but counters improve significantly, keep as research box and iterate once.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Patch Plan (small, reversible)
|
||||||
|
|
||||||
|
### Patch 1: Hot/Cold attribute SSOT (L0 Box)
|
||||||
|
|
||||||
|
Add:
|
||||||
|
- `core/box/hot_text_attrs_box.h`
|
||||||
|
|
||||||
|
Defines:
|
||||||
|
- `HAK_HOT_FN`, `HAK_COLD_FN` (no-op when `HAKMEM_HOT_TEXT_ISOLATION=0`)
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
- annotate only a short, high-impact list first:
|
||||||
|
- wrappers: `malloc/free/calloc/realloc`
|
||||||
|
- FastLane entry helpers (if non-inline)
|
||||||
|
- cold helpers: `malloc_cold/free_cold`, wrapper diagnostics
|
||||||
|
|
||||||
|
Rollback: build knob off.
|
||||||
|
|
||||||
|
### Patch 2: Wrapper TU split (L1 Box boundary)
|
||||||
|
|
||||||
|
Move wrapper definitions out of `core/hakmem.c`:
|
||||||
|
- new: `core/hak_wrappers_box.c`
|
||||||
|
- `#include "box/hak_wrappers.inc.h"`
|
||||||
|
- remove wrapper include from `core/hakmem.c`
|
||||||
|
|
||||||
|
Rationale:
|
||||||
|
- Prevents wrapper text from being interleaved with unrelated code in one TU.
|
||||||
|
- Sets up link-order clustering.
|
||||||
|
|
||||||
|
Rollback: restore include in `core/hakmem.c` and drop new TU.
|
||||||
|
|
||||||
|
### Patch 3 (optional): bench-only section GC
|
||||||
|
|
||||||
|
Makefile knob:
|
||||||
|
- `HOT_TEXT_ISOLATION=0/1`
|
||||||
|
|
||||||
|
When `=1`, add for bench builds:
|
||||||
|
- `-DHAKMEM_HOT_TEXT_ISOLATION=1`
|
||||||
|
- `-ffunction-sections -fdata-sections`
|
||||||
|
- `LDFLAGS += -Wl,--gc-sections`
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
- Keep it bench-only first (do not touch shared lib build until proven stable).
|
||||||
|
- If toolchain rejects `--gc-sections` or results are unstable → skip this patch.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. A/B Procedure (required)
|
||||||
|
|
||||||
|
### 2.1 Baseline build (OFF)
|
||||||
|
|
||||||
|
```sh
|
||||||
|
make clean
|
||||||
|
make -j bench_random_mixed_hakmem bench_random_mixed_system
|
||||||
|
ls -lh bench_random_mixed_hakmem bench_random_mixed_system
|
||||||
|
scripts/run_mixed_10_cleanenv.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Perf stat (1 run, 200M iters):
|
||||||
|
|
||||||
|
```sh
|
||||||
|
perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \
|
||||||
|
env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
|
||||||
|
./bench_random_mixed_hakmem 200000000 400 1
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.2 Optimized build (ON)
|
||||||
|
|
||||||
|
```sh
|
||||||
|
make clean
|
||||||
|
make -j HOT_TEXT_ISOLATION=1 bench_random_mixed_hakmem bench_random_mixed_system
|
||||||
|
ls -lh bench_random_mixed_hakmem bench_random_mixed_system
|
||||||
|
scripts/run_mixed_10_cleanenv.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Perf stat (same command):
|
||||||
|
|
||||||
|
```sh
|
||||||
|
perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \
|
||||||
|
env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
|
||||||
|
./bench_random_mixed_hakmem 200000000 400 1
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.3 System ceiling check (optional)
|
||||||
|
|
||||||
|
```sh
|
||||||
|
./bench_random_mixed_system 200000000 400 1 2>&1 | rg "Throughput" || true
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. GO/NO-GO Decision
|
||||||
|
|
||||||
|
- **GO**: Mixed 10-run mean **+2%** 以上 and no health regressions
|
||||||
|
- **NEUTRAL**: within ±2% → keep as research box, iterate once (more cold isolation or better clustering)
|
||||||
|
- **NO-GO**: **-2%** or worse → rollback and freeze
|
||||||
|
|
||||||
|
Health profiles:
|
||||||
|
|
||||||
|
```sh
|
||||||
|
scripts/verify_health_profiles.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Reporting (required artifacts)
|
||||||
|
|
||||||
|
Create:
|
||||||
|
- `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md`
|
||||||
|
- throughput A/B (10-run)
|
||||||
|
- binary sizes
|
||||||
|
- perf stat table (cycles/instructions/I-cache)
|
||||||
|
- conclusion (GO/NEUTRAL/NO-GO)
|
||||||
|
|
||||||
|
Update:
|
||||||
|
- `CURRENT_TASK.md` (Phase 18 status + next)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Notes / guardrails
|
||||||
|
|
||||||
|
- This phase intentionally compares **different binaries** (layout is the target), but keep the environment clean (`env -i`, fixed profile, same machine).
|
||||||
|
- Avoid “delete code” experiments; only isolate/cold/cluster.
|
||||||
|
- Keep “cold” truly cold: no allocations, no logging, no TLS-heavy helpers.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. If v1 is NEUTRAL: Phase 18 v2(BENCH_MINIMAL)へ即進む
|
||||||
|
|
||||||
|
Phase 17 の “instructions 2x” を直接削るには、layout だけでなく **hot path に混ざっている ENV/stats/debug の固定費を compile-out** する必要がある可能性が高い。
|
||||||
|
|
||||||
|
次の一手(bench 専用 binary / rollback 可能):
|
||||||
|
|
||||||
|
- `HAKMEM_BENCH_MINIMAL=1`(Makefile knob)で:
|
||||||
|
- FastLane / wrapper の “常用ON 経路” を固定し、ENV gate を compile-time 定数化
|
||||||
|
- hot counters を完全 compile-out
|
||||||
|
- 観測は `perf stat` のみ(常時ログ禁止)
|
||||||
|
|
||||||
|
期待: +10–20%(もし本当に instruction footprint が支配ならここで大きく動く)
|
||||||
@ -15,9 +15,12 @@ export HAKMEM_TINY_C7_PRESERVE_HEADER=${HAKMEM_TINY_C7_PRESERVE_HEADER:-0}
|
|||||||
export HAKMEM_TINY_TCACHE=${HAKMEM_TINY_TCACHE:-0}
|
export HAKMEM_TINY_TCACHE=${HAKMEM_TINY_TCACHE:-0}
|
||||||
export HAKMEM_TINY_TCACHE_CAP=${HAKMEM_TINY_TCACHE_CAP:-64}
|
export HAKMEM_TINY_TCACHE_CAP=${HAKMEM_TINY_TCACHE_CAP:-64}
|
||||||
export HAKMEM_MALLOC_TINY_DIRECT=${HAKMEM_MALLOC_TINY_DIRECT:-0}
|
export HAKMEM_MALLOC_TINY_DIRECT=${HAKMEM_MALLOC_TINY_DIRECT:-0}
|
||||||
|
export HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=${HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT:-0}
|
||||||
|
export HAKMEM_FORCE_LIBC_ALLOC=${HAKMEM_FORCE_LIBC_ALLOC:-0}
|
||||||
export HAKMEM_ENV_SNAPSHOT_SHAPE=${HAKMEM_ENV_SNAPSHOT_SHAPE:-0}
|
export HAKMEM_ENV_SNAPSHOT_SHAPE=${HAKMEM_ENV_SNAPSHOT_SHAPE:-0}
|
||||||
export HAKMEM_FREE_TINY_FAST_MONO_DUALHOT=${HAKMEM_FREE_TINY_FAST_MONO_DUALHOT:-0}
|
# NOTE: Phase 9/10 are promoted (bench_profile defaults to 1). Keep cleanenv aligned by default.
|
||||||
export HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=${HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT:-0}
|
export HAKMEM_FREE_TINY_FAST_MONO_DUALHOT=${HAKMEM_FREE_TINY_FAST_MONO_DUALHOT:-1}
|
||||||
|
export HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=${HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT:-1}
|
||||||
|
|
||||||
for i in $(seq 1 "${runs}"); do
|
for i in $(seq 1 "${runs}"); do
|
||||||
echo "=== Run ${i}/${runs} ==="
|
echo "=== Run ${i}/${runs} ==="
|
||||||
|
|||||||
Reference in New Issue
Block a user