Working state before pushing to cyu remote

This commit is contained in:
Moe Charm (CI)
2025-12-19 03:45:01 +09:00
parent e4c5f05355
commit 2013514f7b
28 changed files with 1968 additions and 43 deletions

View File

@ -1,5 +1,193 @@
# CURRENT_TASKRolling, SSOT
## SSOT今の正
- **性能SSOT**: `scripts/run_mixed_10_cleanenv.sh`WS=400, RUNS=10, サイズ16..1040強制、*_ONLY強制OFF
- **経路確認**: `scripts/run_mixed_observe_ssot.sh`OBSERVE専用、throughput比較には使わない
- **buildモード**: `docs/analysis/SSOT_BUILD_MODES.md`
- **外部比較(短時間)**: `docs/analysis/PHASE92_TCMALLOC_GAP_TRIAGE_SSOT.md`LD_PRELOAD同一バイナリ + hakmem_force_libc 切り分け)
## Phase 87-88終了: NO-GO
**Status**: ✅ **OBSERVE verified** + ❌ **Phase 88 NO-GO**
### Phase 87: Inline Slots Verification
**Initial Finding (Wrong)**: Standard binary showed PUSH TOTAL/POP TOTAL = 0
- **Root Cause**: ENV ドリフト(`HAKMEM_BENCH_MIN_SIZE/MAX_SIZE` 漏れ)
- 修正: `scripts/run_mixed_10_cleanenv.sh` でサイズ範囲を強制固定MIN=16, MAX=1040
- `HAKMEM_BENCH_C5_ONLY=0`, `HAKMEM_BENCH_C6_ONLY=0`, `HAKMEM_BENCH_C7_ONLY=0` 強制
**Corrected Finding (OBSERVE binary)** - 20M ops Mixed SSOT WS=400:
```
PUSH TOTAL: C4=687,564 C5=1,373,605 C6=2,750,862 TOTAL=4,812,031 ✓
POP TOTAL: C4=687,564 C5=1,373,605 C6=2,750,862 TOTAL=4,812,031 ✓
PUSH FULL: 0 (0.00%)
POP EMPTY: 168 (0.003%)
JUDGMENT: ✓ [C] LEGACY used AND C4/C5/C6 INLINE SLOTS ACTIVE → Ready for Phase 88/89
```
### Phase 88: Batch Drain Optimization
**Overflow Analysis**:
- POP EMPTY rate: 168 / 4,812,031 = **0.003%** ← 極小
- PUSH FULL rate: 0 / 4,812,031 = **0%** ← 起きていない
- **Decision**: バッチ化しても速さは動かないoverflow がほぼ起きていない)
**Phase 88 Decision**: **NO-GO凍結**
- Rationale: 0.003% overflow 率では layout tax リスク > 期待値
- Infrastructure: 観測用 telemetry は残す(将来の WS/容量 変更時に再検証可能)
**Artifacts Created**:
- Telemetry box: `core/box/tiny_inline_slots_overflow_stats_box.h/c`
- Phase 87 results: `docs/analysis/PHASE87_OBSERVATION_RESULTS.md`
- SSOT 強化: `scripts/run_mixed_10_cleanenv.sh`, `scripts/run_mixed_observe_ssot.sh`
- ENV ドリフト防止ドキュメント: `docs/analysis/BENCH_REPRODUCIBILITY_SSOT.md`
**Key Learning**:
- "踏んでるか確定"には **OBSERVE バイナリ + total counters** が必須
- 観測と性能測定は分離telemetry overhead を避ける)
- ENV ドリフトMIN/MAX サイズ, CLASS_ONLY = 経路を変える主要因
**Follow-up Fix (SSOT hardening)**:
- `scripts/run_mixed_10_cleanenv.sh` now forces `HAKMEM_BENCH_MIN_SIZE=16` / `HAKMEM_BENCH_MAX_SIZE=1040` and disables `HAKMEM_BENCH_C{5,6,7}_ONLY` to prevent path drift.
- New pre-flight helper: `scripts/run_mixed_observe_ssot.sh` (Route Banner + OBSERVE, single run).
- Overflow stats compile gating fixed (see above).
---
## Phase 89完了: Bottleneck Analysis & Optimization Roadmap
**Status**: ✅ **SSOT Measurement Complete** + **3 Optimization Candidates Identified**
### 4-Step SSOT Procedure Completion
**Step 1: OBSERVE Binary Preflight**
- Binary: `bench_random_mixed_hakmem_observe` (with telemetry enabled)
- Inline slots verification: ✓ PUSH TOTAL = 4.81M, POP EMPTY = 0.003% (confirmed active & healthy)
- Throughput (with telemetry): 51.52M ops/s
**Step 2: Standard 10-run Baseline**
- Binary: `bench_random_mixed_hakmem` (clean, no telemetry)
- 10-run SSOT results: **51.36M ops/s** (CV: 0.7%, very stable)
- Range: 50.74M - 51.73M
- **Decision**: This is baseline for bottleneck analysis
**Step 3: FAST PGO 10-run Comparison**
- Binary: `bench_random_mixed_hakmem_minimal_pgo` (PGO optimized)
- 10-run SSOT results: **54.16M ops/s** (CV: 1.5%, acceptable)
- Range: 52.89M - 55.13M
- **Performance Gap**: 54.16M - 51.36M = **2.80M ops/s (+5.45%)**
- This represents the optimization ceiling with current PGO profile
**Step 4: Results Captured**
- Git SHA: e4c5f0535 (master branch)
- Timestamp: 2025-12-18 23:06:01
- System: AMD Ryzen 5825U, 16 cores, 6.8.0-87-generic kernel
- Files: `docs/analysis/PHASE89_SSOT_MEASUREMENT.md`
### Perf Analysis & Top Bottleneck Identification
**Profile Run**: 40M operations (0.78s), 833 perf samples
**Top Functions by CPU Time**:
1. **free** - 27.40% (hottest)
2. main - 26.30% (benchmark loop, not optimizable)
3. **malloc** - 20.36% (hottest)
4. malloc.cold - 10.65% (cold path, avoid optimizing)
5. free.cold - 5.59% (cold path, avoid optimizing)
6. **tiny_region_id_write_header** - 2.98% (hot, inlining candidate)
**malloc + free combined = 47.76% of CPU time** (already Phase 9/10/78-1/80-1 optimized)
### Top 3 Optimization Candidates (Ranked by Priority)
| Candidate | Priority | Recommendation | Expected Gain | Risk | Effort |
|-----------|----------|-----------------|----------------|------|--------|
| **tiny_region_id_write_header always_inline** | **HIGH** | **PURSUE** | +1-2% | LOW | 1-2h |
| malloc/free branch reduction | MEDIUM | DEFER | +2-3% | MEDIUM | 20-40h |
| Cold-path optimization | LOW | **AVOID** | +1% | HIGH | 10-20h |
**Candidate 1: tiny_region_id_write_header always_inline (2.98% CPU)**
- Current: Selective inlining from `core/region_id_v6.c`
- Proposal: Force `always_inline` for hot-path call sites
- **Layout Impact**: MINIMAL (no code bulk, maintains I-cache discipline)
- **Recommendation**: YES - PURSUE
- Estimated timeline: Phase 90
- Implementation: 1-2 lines, add `__attribute__((always_inline))` wrapper
**Candidate 2: malloc/free branch reduction (47.76% CPU)**
- Current: Phase 9/10/78-1/80-1/83-1 already optimized
- Observation: 56.4M branch-misses (branch prediction pressure)
- Proposal: Pre-compute routing tables (like Phase 85 approach)
- **Risk**: Code bloat, potential layout tax regression (Phase 85 was NO-GO)
- **Recommendation**: DEFER
- Wait for workload characteristics that justify complexity
- Current gains saturation point reached
---
## Phase 91終了: NEUTRAL / 凍結)
**Status**: ⚪ **NEUTRAL**C6 IFL: +0.38% / 10-run→ default OFF で保持
- 目的: C6 inline slots の FIFO を intrusive LIFO に置換して fixed tax を削る
- 結果SSOT 10-run:
- Control`HAKMEM_TINY_C6_INLINE_SLOTS_IFL=0`mean 52.05M
- Treatment`HAKMEM_TINY_C6_INLINE_SLOTS_IFL=1`mean 52.25M
- Δ **+0.38%**GO閾値 +1.0% 未達)
- 判定: **凍結research box**
- 回帰は無し、ただし ROI が小さいため C5/C4 へ展開しない
---
## Phase 92開始予定
**Status**: 🔍 **次フェーズ計画中**
**目的**: tcmalloc 性能ギャップhakmem: 52M vs tcmalloc: 58M, -12.8%)を短時間で原因分類
**実施予定**:
1. ケース A小 vs 大オブジェクト分離テストC6-only vs C7-only
2. ケース BInline Slots vs Unified Cache 分離テスト
3. ケース CLIFO vs FIFO 比較
4. ケース DPool size sensitivity テスト
**期間**: 1-2h短時間 Triage
**出力**: Primary bottleneck 特定 → 次の Candidate 選定
**References**:
- Triage Plan: `docs/analysis/PHASE92_TCMALLOC_GAP_TRIAGE_SSOT.md`
---
**Candidate 3: Cold-path de-duplication (16.24% CPU)**
- Current: malloc.cold (10.65%) + free.cold (5.59%) explicitly separated
- Rationale: Separation improves hot-path I-cache utilization
- **Recommendation**: AVOID
- Aligns with user's "layout tax 回避" principle
- Optimizing cold paths would ADD code to hot path (violates design)
### Key Performance Insights
**FAST PGO vs Standard (+5.45%) breakdown**:
- PGO branch prediction optimization: ~3%
- Code layout optimization: ~2%
- Inlining decisions: ~0.5%
**Conclusion**: Standard build limited by branch prediction pressure; further gains require architectural tradeoffs.
**Inline Slots Health**: Working perfectly - 0.003% overflow rate confirms no bottleneck
### References & Artifacts
- SSOT Measurement: `docs/analysis/PHASE89_SSOT_MEASUREMENT.md`
- Bottleneck Analysis: `docs/analysis/PHASE89_BOTTLENECK_ANALYSIS.md`
- Perf Stats: `docs/analysis/PHASE89_PERF_STAT.txt`
- Scripts: `scripts/run_mixed_10_cleanenv.sh`, `scripts/run_mixed_observe_ssot.sh`
---
## Phase 86終了: NO-GO
**Status**: ❌ NO-GO (+0.25% improvement, threshold: +1.0%)
@ -19,16 +207,16 @@
## 0) 今の「正」SSOT
- **性能比較の正**: FAST PGO build`make pgo-fast-full``bench_random_mixed_hakmem_minimal_pgo` **WarmPool=16**
- Phase 75C5/C6 inline slotsは presets に昇格済み
- Phase 75-4 で FAST PGO rebase を実施し **C5+C6=ON が +3.16% (GO)** を確認(ただし **FAST PGO baseline 自体が Phase 69 から大きく後退**している疑い → Phase 75-5 で PGO 再生成が必要
- **安全・互換の正**: Standard build`make bench_random_mixed_hakmem`
- **観測の正**: OBSERVE build`make perf_observe`
- **スコアカード(目標/現在値)**: `docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md`
- **FAST baselineSSOT**: `docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md` を正とするPhase 69: 62.63M ops/s = 51.77% of mimalloc
- **Phase 75 の計測Standard**: `bench_random_mixed_hakmem`**A/B +5.41%** を確認Phase 75-3 4-point matrix
- **Phase 75 の計測FAST PGO**: `bench_random_mixed_hakmem_minimal_pgo`**A/B +3.16%** を確認Phase 75-4 4-point matrix
- 次の目標: **M2 = 55%**gap は FAST baseline を基準に判断する)
- **現行 SSOTPhase 89 capture / Git SHA: e4c5f0535**:
- Standard`./bench_random_mixed_hakmem`10-run mean: **51.36M ops/s**CV ~0.7%
- FAST PGO minimal`./bench_random_mixed_hakmem_minimal_pgo`10-run mean: **54.16M ops/s**CV ~1.5% / Standard比 +5.45%
- OBSERVE`./bench_random_mixed_hakmem_observe`: 51.52M ops/stelemetry込み、性能比較の正ではない
- SSOT capture: `docs/analysis/PHASE89_SSOT_MEASUREMENT.md`
- **性能最適化の判断の正**: 同一バイナリ A/BENVトグル `scripts/run_mixed_10_cleanenv.sh`
- **mimalloc/tcmalloc 参照の正**: reference別バイナリ/LD_PRELOAD `docs/analysis/ALLOCATOR_COMPARISON_SSOT.md`
- **スコアカード(目標/現在値の正)**: `docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md`Phase 89 SSOT を現行 snapshot として反映済み
- Phase 66/68/6960M〜62M台**historical**(現 HEAD と直接比較しない。比較するなら rebase を取る
- **次フェーズ(設計見直し)**: `docs/analysis/PHASE90_STRUCTURAL_REVIEW_AND_GAP_TRIAGE_SSOT.md`
- **Mixed 10-run SSOTハーネス**: `scripts/run_mixed_10_cleanenv.sh`
- デフォルト `BENCH_BIN=./bench_random_mixed_hakmem`Standard
- FAST PGO は `BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo` を明示する
@ -86,6 +274,32 @@
- 結果: `docs/analysis/PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md`
- 原因: lazy-init pattern が既に最適化済みper-op overhead minimal→ fixed mode の ROI 極小
## 2a) 次の大方針設計の順番、SSOT
目的: “mimalloc/tcmalloc が強すぎる”状況でも、Box Theory境界1箇所・戻せる・可視化最小・fail-fastを崩さず **+510%** を狙う。
優先順Google/TCMalloc の芯を参考にする):
1. **ThreadCache overflow のバッチ化(最優先)**
- inline slotsC4/C5/C6が満杯になったときの overflow を「1個ずつ」ではなく「まとめて」冷やす
- 変換点は 1 箇所flush/drainに固定
2. **Central/Shared 側のバッチ push/pop次点**
- shared/remote への統合をバッチ化して lock/atomic の回数を減らす
3. **Memory return / footprint policy運用軸**
- Balanced/Lean の勝ち筋syscall/RSS drift/tailをSSOT化しつつ、速度を落とさない範囲で攻める
重要: 現状は「設計の芯」を決める段階。実装は **計測で overflow の頻度が十分に高い**ことを確認してから。
## 2b) 次の作業(待機中)
ユーザーが別エージェントClaude Codeに依頼した処理が完了するまで待機する。
完了後に着手するチェック最短で必要な2つ:
- **inline slots overflow 率の計測**C4/C5/C6 の FULL/overflow 回数・割合)
- **overflow 先のコストの定量化**overflow 時に落ちる関数の perf stat / perf report
これが揃ったら Phase 86Overflow batch designへ進む。
## 3) 運用ルールBox Theory + layout tax 対策)
- 変更は必ず **箱 + 境界1箇所 + ENVで戻せる** で積むFail-fast、最小可視化

View File

@ -232,6 +232,17 @@ CFLAGS += -DHAKMEM_TINY_CLASS5_FIXED_REFILL=1
CFLAGS_SHARED += -DHAKMEM_TINY_CLASS5_FIXED_REFILL=1
endif
# Phase 91: C6 Intrusive LIFO Inline Slots (Per-class LIFO transformation)
# Purpose: Replace FIFO ring with intrusive LIFO to reduce per-operation metadata overhead
# Enable: make BOX_TINY_C6_INLINE_SLOTS_IFL=1
# Expected: +1-2% throughput improvement (C6 only, 57% coverage)
# Default: ON (research box, reversible via ENV gate HAKMEM_TINY_C6_INLINE_SLOTS_IFL=0)
BOX_TINY_C6_INLINE_SLOTS_IFL ?= 1
ifeq ($(BOX_TINY_C6_INLINE_SLOTS_IFL),1)
CFLAGS += -DHAKMEM_BOX_TINY_C6_INLINE_SLOTS_IFL=1
CFLAGS_SHARED += -DHAKMEM_BOX_TINY_C6_INLINE_SLOTS_IFL=1
endif
# Phase 3 (2025-11-29): mincore removed entirely
# - mincore() syscall overhead eliminated (was +10.3% with DISABLE flag)
# - Phase 1b/2 registry-based validation provides sufficient safety
@ -253,7 +264,7 @@ LDFLAGS += $(EXTRA_LDFLAGS)
# Targets
TARGET = test_hakmem
OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o core/box/ss_release_policy_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/tiny_free_route_cache_env_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/box/front_fastlane_alloc_legacy_direct_env_box.o core/box/fastlane_direct_env_box.o core/box/tiny_header_hotfull_env_box.o core/box/tiny_inline_slots_fixed_mode_box.o core/box/tiny_inline_slots_switch_dispatch_fixed_box.o core/box/free_path_commit_once_fixed_box.o core/box/free_path_legacy_mask_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/tiny_c6_inline_slots.o core/tiny_c5_inline_slots.o core/tiny_c2_local_cache.o core/tiny_c3_inline_slots.o core/tiny_c4_inline_slots.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o core/box/small_policy_snapshot_tls_box.o
OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o core/box/ss_release_policy_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/tiny_free_route_cache_env_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/box/front_fastlane_alloc_legacy_direct_env_box.o core/box/fastlane_direct_env_box.o core/box/tiny_header_hotfull_env_box.o core/box/tiny_inline_slots_fixed_mode_box.o core/box/tiny_inline_slots_switch_dispatch_fixed_box.o core/box/free_path_commit_once_fixed_box.o core/box/free_path_legacy_mask_box.o core/box/tiny_inline_slots_overflow_stats_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/tiny_c6_inline_slots.o core/tiny_c6_inline_slots_ifl.o core/tiny_c5_inline_slots.o core/tiny_c2_local_cache.o core/tiny_c3_inline_slots.o core/tiny_c4_inline_slots.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o core/box/small_policy_snapshot_tls_box.o
OBJS = $(OBJS_BASE)
# Shared library
@ -287,7 +298,7 @@ endif
# Benchmark targets
BENCH_HAKMEM = bench_allocators_hakmem
BENCH_SYSTEM = bench_allocators_system
BENCH_HAKMEM_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o core/box/ss_release_policy_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/tiny_free_route_cache_env_box.o core/box/fastlane_direct_env_box.o core/box/tiny_inline_slots_fixed_mode_box.o core/box/tiny_inline_slots_switch_dispatch_fixed_box.o core/box/free_path_commit_once_fixed_box.o core/box/free_path_legacy_mask_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/tiny_c6_inline_slots.o core/tiny_c5_inline_slots.o core/tiny_c2_local_cache.o core/tiny_c3_inline_slots.o core/tiny_c4_inline_slots.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o core/box/small_policy_snapshot_tls_box.o bench_allocators_hakmem.o
BENCH_HAKMEM_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o core/box/ss_release_policy_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/tiny_free_route_cache_env_box.o core/box/fastlane_direct_env_box.o core/box/tiny_inline_slots_fixed_mode_box.o core/box/tiny_inline_slots_switch_dispatch_fixed_box.o core/box/free_path_commit_once_fixed_box.o core/box/free_path_legacy_mask_box.o core/box/tiny_inline_slots_overflow_stats_box.o core/page_arena.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/tiny_c6_inline_slots.o core/tiny_c6_inline_slots_ifl.o core/tiny_c5_inline_slots.o core/tiny_c2_local_cache.o core/tiny_c3_inline_slots.o core/tiny_c4_inline_slots.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o core/box/small_policy_snapshot_tls_box.o bench_allocators_hakmem.o
BENCH_HAKMEM_OBJS = $(BENCH_HAKMEM_OBJS_BASE)
ifeq ($(POOL_TLS_PHASE1),1)
BENCH_HAKMEM_OBJS += pool_tls.o pool_refill.o pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o
@ -464,7 +475,7 @@ test-box-refactor: box-refactor
./larson_hakmem 10 8 128 1024 1 12345 4
# Phase 4: Tiny Pool benchmarks (properly linked with hakmem)
TINY_BENCH_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o core/box/ss_release_policy_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/free_publish_box.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/tiny_free_route_cache_env_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/box/front_fastlane_alloc_legacy_direct_env_box.o core/box/fastlane_direct_env_box.o core/box/tiny_header_hotfull_env_box.o core/box/tiny_inline_slots_fixed_mode_box.o core/box/tiny_inline_slots_switch_dispatch_fixed_box.o core/box/free_path_commit_once_fixed_box.o core/box/free_path_legacy_mask_box.o core/page_arena.o core/front/tiny_unified_cache.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/tiny_c6_inline_slots.o core/tiny_c5_inline_slots.o core/tiny_c2_local_cache.o core/tiny_c3_inline_slots.o core/tiny_c4_inline_slots.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o core/box/small_policy_snapshot_tls_box.o
TINY_BENCH_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o core/box/ss_release_policy_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/free_publish_box.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/ss_pt_impl.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/free_front_v3_env_box.o core/box/free_path_stats_box.o core/box/free_dispatch_stats_box.o core/box/free_cold_shape_env_box.o core/box/free_cold_shape_stats_box.o core/box/alloc_gate_stats_box.o core/box/tiny_c6_ultra_free_box.o core/box/tiny_c5_ultra_free_box.o core/box/tiny_c4_ultra_free_box.o core/box/tiny_ultra_tls_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/tiny_static_route_box.o core/box/tiny_metadata_cache_hot_box.o core/box/wrapper_env_box.o core/box/free_wrapper_env_snapshot_box.o core/box/malloc_wrapper_env_snapshot_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/box/tiny_free_route_cache_env_box.o core/box/hakmem_env_snapshot_box.o core/box/tiny_c7_preserve_header_env_box.o core/box/tiny_tcache_env_box.o core/box/tiny_unified_lifo_env_box.o core/box/front_fastlane_alloc_legacy_direct_env_box.o core/box/fastlane_direct_env_box.o core/box/tiny_header_hotfull_env_box.o core/box/tiny_inline_slots_fixed_mode_box.o core/box/tiny_inline_slots_switch_dispatch_fixed_box.o core/box/free_path_commit_once_fixed_box.o core/box/free_path_legacy_mask_box.o core/box/tiny_inline_slots_overflow_stats_box.o core/page_arena.o core/front/tiny_unified_cache.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/tiny_alloc_fast_push.o core/tiny_c7_ultra_segment.o core/tiny_c7_ultra.o core/tiny_c6_inline_slots.o core/tiny_c6_inline_slots_ifl.o core/tiny_c5_inline_slots.o core/tiny_c2_local_cache.o core/tiny_c3_inline_slots.o core/tiny_c4_inline_slots.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o core/smallobject_hotbox_v5.o core/smallsegment_v5.o core/smallobject_cold_iface_v5.o core/smallsegment_v6.o core/smallobject_cold_iface_v6.o core/smallobject_core_v6.o core/region_id_v6.o core/smallsegment_v7.o core/smallobject_cold_iface_v7.o core/mid_hotbox_v3.o core/smallobject_policy_v7.o core/smallobject_segment_mid_v3.o core/smallobject_cold_iface_mid_v3.o core/smallobject_stats_mid_v3.o core/smallobject_learner_v2.o core/smallobject_mid_v35.o core/box/small_policy_snapshot_tls_box.o
TINY_BENCH_OBJS = $(TINY_BENCH_OBJS_BASE)
ifeq ($(POOL_TLS_PHASE1),1)
TINY_BENCH_OBJS += pool_tls.o pool_refill.o core/pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o
@ -714,14 +725,23 @@ pgo-fast-build:
@echo "========================================="
@echo "Phase 66: Building PGO-Optimized Binary (FAST minimal)"
@echo "========================================="
@if [ -x bench_random_mixed_hakmem ]; then mv bench_random_mixed_hakmem bench_random_mixed_hakmem.standard_saved; fi
$(MAKE) clean
$(MAKE) PROFILE_USE=1 bench_random_mixed_hakmem EXTRA_CFLAGS='-DHAKMEM_BENCH_MINIMAL=1'
mv bench_random_mixed_hakmem bench_random_mixed_hakmem_minimal_pgo
@if [ -x bench_random_mixed_hakmem.standard_saved ]; then mv bench_random_mixed_hakmem.standard_saved bench_random_mixed_hakmem; fi
@echo ""
@echo "✓ PGO-optimized FAST minimal binary built: bench_random_mixed_hakmem_minimal_pgo"
@echo "Next: BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh"
@echo ""
pgo-fast-bin: pgo-fast-build
# Convenience alias (SSOT runner expects this name to be buildable).
# Usage: make bench_random_mixed_hakmem_minimal_pgo
.PHONY: bench_random_mixed_hakmem_minimal_pgo
bench_random_mixed_hakmem_minimal_pgo: pgo-fast-build
pgo-fast-full: pgo-fast-profile pgo-fast-collect pgo-fast-build
@echo "========================================="
@echo "Phase 66: PGO Full Workflow Complete (FAST minimal)"
@ -734,9 +754,11 @@ pgo-fast-full: pgo-fast-profile pgo-fast-collect pgo-fast-build
# Purpose: FAST build with compile-time fixed front config (phase 47 A/B test)
.PHONY: bench_random_mixed_hakmem_fast_pgo
bench_random_mixed_hakmem_fast_pgo:
@if [ -x bench_random_mixed_hakmem ]; then mv bench_random_mixed_hakmem bench_random_mixed_hakmem.standard_saved; fi
$(MAKE) clean
$(MAKE) bench_random_mixed_hakmem EXTRA_CFLAGS='-DHAKMEM_BENCH_MINIMAL=1 -DHAKMEM_TINY_FRONT_PGO=1'
mv bench_random_mixed_hakmem bench_random_mixed_hakmem_fast_pgo
@if [ -x bench_random_mixed_hakmem.standard_saved ]; then mv bench_random_mixed_hakmem.standard_saved bench_random_mixed_hakmem; fi
# Phase 35-B: OBSERVE target (enables diagnostic counters for behavior observation)
# Usage: make bench_random_mixed_hakmem_observe
@ -744,9 +766,11 @@ bench_random_mixed_hakmem_fast_pgo:
# Purpose: Behavior observation & debugging (OBSERVE build)
.PHONY: bench_random_mixed_hakmem_observe
bench_random_mixed_hakmem_observe:
@if [ -x bench_random_mixed_hakmem ]; then mv bench_random_mixed_hakmem bench_random_mixed_hakmem.standard_saved; fi
$(MAKE) clean
$(MAKE) bench_random_mixed_hakmem EXTRA_CFLAGS='-DHAKMEM_TINY_CLASS_STATS_COMPILED=1 -DHAKMEM_TINY_FREE_STATS_COMPILED=1 -DHAKMEM_UNIFIED_CACHE_STATS_COMPILED=1 -DHAKMEM_TINY_FREE_TRACE_COMPILED=1'
$(MAKE) bench_random_mixed_hakmem EXTRA_CFLAGS='-DHAKMEM_TINY_CLASS_STATS_COMPILED=1 -DHAKMEM_TINY_FREE_STATS_COMPILED=1 -DHAKMEM_UNIFIED_CACHE_STATS_COMPILED=1 -DHAKMEM_TINY_FREE_TRACE_COMPILED=1 -DHAKMEM_INLINE_SLOTS_OVERFLOW_STATS_COMPILED=1'
mv bench_random_mixed_hakmem bench_random_mixed_hakmem_observe
@if [ -x bench_random_mixed_hakmem.standard_saved ]; then mv bench_random_mixed_hakmem.standard_saved bench_random_mixed_hakmem; fi
# Phase 38: Automated perf workflow targets
# Usage: make perf_fast - Build FAST binary and run 10-run benchmark

View File

@ -28,6 +28,7 @@
#include "core/box/ss_stats_box.h"
#include "core/box/warm_pool_rel_counters_box.h"
#include "core/box/tiny_mem_stats_box.h"
#include "core/box/tiny_inline_slots_overflow_stats_box.h"
// Box BenchMeta: Benchmark metadata management (bypass hakmem wrapper)
// Phase 15: Separate BenchMeta (slots array) from CoreAlloc (user workload)
@ -423,5 +424,10 @@ int main(int argc, char** argv){
#endif
#endif
// Phase 87: Print overflow statistics
#ifdef USE_HAKMEM
tiny_inline_slots_overflow_report_stats();
#endif
return 0;
}

View File

@ -19,6 +19,7 @@
#include "box/tiny_inline_slots_fixed_mode_box.h" // tiny_inline_slots_fixed_mode_refresh_from_env (Phase 78-1)
#include "box/free_path_commit_once_fixed_box.h" // free_path_commit_once_refresh_from_env (Phase 85)
#include "box/free_path_legacy_mask_box.h" // free_path_legacy_mask_refresh_from_env (Phase 86)
#include "box/tiny_c6_inline_slots_ifl_env_box.h" // tiny_c6_inline_slots_ifl_refresh_from_env (Phase 91)
#endif
// env が未設定のときだけ既定値を入れる
@ -241,5 +242,7 @@ static inline void bench_apply_profile(void) {
free_path_commit_once_refresh_from_env();
// Phase 86: Optionally use legacy mask for early exit (no indirect calls, just bit test).
free_path_legacy_mask_refresh_from_env();
// Phase 91: C6 intrusive LIFO inline slots (per-class LIFO transformation).
tiny_c6_inline_slots_ifl_refresh_from_env();
#endif
}

View File

@ -0,0 +1,47 @@
// tiny_c6_inline_slots_ifl_env_box.h - Phase 91: C6 Intrusive LIFO Inline Slots ENV Gate
//
// Goal: Runtime ENV gate for C6-only intrusive LIFO inline slots optimization
// Scope: C6 class only (FIFO ring → intrusive LIFO transformation)
// Default: OFF (research box, ENV=0)
//
// ENV Variables:
// HAKMEM_TINY_C6_INLINE_SLOTS_IFL=0/1 (default: 0, OFF)
// HAKMEM_TINY_C6_IFL_STRICT=0/1 (LARSON_FIX safety check)
//
// Design:
// - Extern refresh function called from bench_profile.h (fixed mode pattern)
// - Thread-safe initialization via refresh_all_env_caches()
// - Fail-fast on LARSON_FIX + IFL conflict
//
// Phase 91: C6-only intrusive LIFO (replaces FIFO ring)
// Phase 91+: C5, C4 expansion if C6 GO
#ifndef HAK_BOX_TINY_C6_INLINE_SLOTS_IFL_ENV_BOX_H
#define HAK_BOX_TINY_C6_INLINE_SLOTS_IFL_ENV_BOX_H
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include "../hakmem_build_flags.h"
// ============================================================================
// ENV Gate: C6 Intrusive LIFO Inline Slots
// ============================================================================
extern uint8_t g_tiny_c6_inline_slots_ifl_enabled;
extern uint8_t g_tiny_c6_inline_slots_ifl_strict;
// Refresh ENV variables (called from bench_profile.h::refresh_all_env_caches)
void tiny_c6_inline_slots_ifl_refresh_from_env(void);
// Check if C6 inline slots IFL are enabled (cached by refresh function)
static inline int tiny_c6_inline_slots_ifl_enabled(void) {
return g_tiny_c6_inline_slots_ifl_enabled;
}
// Fast path version (same as enabled, for naming consistency with other box pattern)
static inline int tiny_c6_inline_slots_ifl_enabled_fast(void) {
return g_tiny_c6_inline_slots_ifl_enabled;
}
#endif // HAK_BOX_TINY_C6_INLINE_SLOTS_IFL_ENV_BOX_H

View File

@ -0,0 +1,85 @@
// tiny_c6_inline_slots_ifl_tls_box.h - Phase 91: C6 Intrusive LIFO TLS State & Wrappers
//
// Goal: Thread-local state for C6 intrusive LIFO inline slots + inline push/pop wrappers
// Scope: Per-thread LIFO head pointer, count, enabled flag
// Integration: Thin wrapper over tiny_c6_intrusive_freelist_box.h (c6_ifl_*)
//
// TLS State:
// - head: LIFO stack pointer (intrusive, embedded next in freed objects)
// - count: Current entries (drain triggered at count > 128)
// - enabled: Cached flag from tiny_c6_inline_slots_ifl_env_box.h
//
// Phase 91: C6-only IFL implementation
// Phase 91+: C5, C4 expansion via similar pattern
#ifndef HAK_BOX_TINY_C6_INLINE_SLOTS_IFL_TLS_BOX_H
#define HAK_BOX_TINY_C6_INLINE_SLOTS_IFL_TLS_BOX_H
#include <stdbool.h>
#include <stdint.h>
#include "../tiny_nextptr.h"
#include "tiny_c6_intrusive_freelist_box.h"
// ============================================================================
// TLS State Structure
// ============================================================================
struct TinyC6InlineSlotsIFL {
void* head; // LIFO stack pointer (intrusive next embedded)
uint16_t count; // Current entry count
uint8_t enabled; // Cached flag from ENV gate
};
// ============================================================================
// TLS Variable (defined in core/tiny_c6_inline_slots_ifl.c)
// ============================================================================
extern __thread struct TinyC6InlineSlotsIFL g_tiny_c6_inline_slots_ifl;
// ============================================================================
// Fast-Path Inline Accessors
// ============================================================================
// Push object to C6 LIFO (intrusive)
// Returns: true if push succeeded, false if disabled
static inline bool tiny_c6_inline_slots_ifl_push_fast(void* ptr) {
if (!g_tiny_c6_inline_slots_ifl.enabled) {
return false;
}
// Push to intrusive LIFO head (delegates to c6_ifl_push)
c6_ifl_push(&g_tiny_c6_inline_slots_ifl.head, ptr);
g_tiny_c6_inline_slots_ifl.count++;
// Overflow: count > 128 triggers drain (handled by caller)
return true;
}
// Pop object from C6 LIFO (intrusive)
// Returns: pointer to freed object, or NULL if empty/disabled
static inline void* tiny_c6_inline_slots_ifl_pop_fast(void) {
if (!g_tiny_c6_inline_slots_ifl.enabled || g_tiny_c6_inline_slots_ifl.count == 0) {
return NULL;
}
// Pop from intrusive LIFO head (delegates to c6_ifl_pop)
void* ptr = c6_ifl_pop(&g_tiny_c6_inline_slots_ifl.head);
if (ptr != NULL) {
g_tiny_c6_inline_slots_ifl.count--;
}
return ptr;
}
// Check availability
static inline bool tiny_c6_inline_slots_ifl_available(void) {
return g_tiny_c6_inline_slots_ifl.enabled && g_tiny_c6_inline_slots_ifl.count > 0;
}
// ============================================================================
// Overflow Handler (declared, defined in core/tiny_c6_inline_slots_ifl.c)
// ============================================================================
void tiny_c6_inline_slots_ifl_drain_to_unified(void);
#endif // HAK_BOX_TINY_C6_INLINE_SLOTS_IFL_TLS_BOX_H

View File

@ -44,6 +44,8 @@
#include "tiny_inline_slots_fixed_mode_box.h" // Phase 78-1: Optional fixed-mode gating
#include "tiny_inline_slots_switch_dispatch_box.h" // Phase 80-1: Switch dispatch for C4/C5/C6
#include "tiny_inline_slots_switch_dispatch_fixed_box.h" // Phase 83-1: Switch dispatch fixed mode
#include "tiny_c6_inline_slots_ifl_env_box.h" // Phase 91: C6 intrusive LIFO inline slots ENV gate
#include "tiny_c6_inline_slots_ifl_tls_box.h" // Phase 91: C6 intrusive LIFO inline slots TLS state
// ============================================================================
// Branch Prediction Macros (Pointer Safety - Prediction Hints)
@ -156,6 +158,19 @@ static inline void* tiny_hot_alloc_fast(int class_idx) {
}
break;
case 6:
// Phase 91: C6 Intrusive LIFO Inline Slots (check BEFORE FIFO)
if (tiny_c6_inline_slots_ifl_enabled_fast()) {
void* base = tiny_c6_inline_slots_ifl_pop_fast();
if (TINY_HOT_LIKELY(base != NULL)) {
TINY_HOT_METRICS_HIT(class_idx);
#if HAKMEM_TINY_HEADER_CLASSIDX
return tiny_header_finalize_alloc(base, class_idx);
#else
return base;
#endif
}
}
// Phase 75-1: C6 Inline Slots (FIFO - fallback)
if (tiny_c6_inline_slots_enabled_fast()) {
void* base = c6_inline_pop(c6_inline_tls());
if (TINY_HOT_LIKELY(base != NULL)) {
@ -222,6 +237,21 @@ static inline void* tiny_hot_alloc_fast(int class_idx) {
// C5 inline miss → fall through to C6/unified cache
}
// Phase 91: C6 Intrusive LIFO Inline Slots early-exit (ENV gated)
// Try C6 IFL THIRD (before C6 FIFO and unified cache) for class 6
if (class_idx == 6 && tiny_c6_inline_slots_ifl_enabled_fast()) {
void* base = tiny_c6_inline_slots_ifl_pop_fast();
if (TINY_HOT_LIKELY(base != NULL)) {
TINY_HOT_METRICS_HIT(class_idx);
#if HAKMEM_TINY_HEADER_CLASSIDX
return tiny_header_finalize_alloc(base, class_idx);
#else
return base;
#endif
}
// C6 IFL miss → fall through to C6 FIFO
}
// Phase 75-1: C6 Inline Slots early-exit (ENV gated)
// Try C6 inline slots THIRD (before unified cache) for class 6
if (class_idx == 6 && tiny_c6_inline_slots_enabled_fast()) {

View File

@ -0,0 +1,153 @@
// tiny_inline_slots_overflow_stats_box.c - Phase 87: Inline Slots Overflow Telemetry
//
// Measures how often inline slots rings overflow and fallback to unified_cache/legacy paths.
#include "tiny_inline_slots_overflow_stats_box.h"
#include <stdio.h>
#include <stdlib.h>
#include <stdatomic.h>
// ============================================================================
// Global State
// ============================================================================
TinyInlineSlotsOverflowStats g_inline_slots_overflow_stats = {
.c3_push_full = 0,
.c4_push_full = 0,
.c5_push_full = 0,
.c6_push_full = 0,
.c3_pop_empty = 0,
.c4_pop_empty = 0,
.c5_pop_empty = 0,
.c6_pop_empty = 0,
.overflow_to_unified_cache = 0,
.overflow_to_legacy = 0,
};
// ============================================================================
// Refresh from ENV (called by bench_profile)
// ============================================================================
void tiny_inline_slots_overflow_refresh_from_env(void) {
// Placeholder for future ENV gating if needed
// Currently always enabled in observation builds (controlled by compile flag)
}
// ============================================================================
// Reporting
// ============================================================================
void tiny_inline_slots_overflow_report_stats(void) {
// Phase 87b: Legacy fallback counter
uint64_t legacy_fallback_calls = atomic_load(&g_inline_slots_overflow_stats.legacy_fallback_calls);
// Total push attempts (all classes)
uint64_t c3_push_total = atomic_load(&g_inline_slots_overflow_stats.c3_push_total);
uint64_t c4_push_total = atomic_load(&g_inline_slots_overflow_stats.c4_push_total);
uint64_t c5_push_total = atomic_load(&g_inline_slots_overflow_stats.c5_push_total);
uint64_t c6_push_total = atomic_load(&g_inline_slots_overflow_stats.c6_push_total);
// Total pop attempts (all classes)
uint64_t c3_pop_total = atomic_load(&g_inline_slots_overflow_stats.c3_pop_total);
uint64_t c4_pop_total = atomic_load(&g_inline_slots_overflow_stats.c4_pop_total);
uint64_t c5_pop_total = atomic_load(&g_inline_slots_overflow_stats.c5_pop_total);
uint64_t c6_pop_total = atomic_load(&g_inline_slots_overflow_stats.c6_pop_total);
// Overflow counts (ring full/empty)
uint64_t c3_push_full = atomic_load(&g_inline_slots_overflow_stats.c3_push_full);
uint64_t c4_push_full = atomic_load(&g_inline_slots_overflow_stats.c4_push_full);
uint64_t c5_push_full = atomic_load(&g_inline_slots_overflow_stats.c5_push_full);
uint64_t c6_push_full = atomic_load(&g_inline_slots_overflow_stats.c6_push_full);
uint64_t c3_pop_empty = atomic_load(&g_inline_slots_overflow_stats.c3_pop_empty);
uint64_t c4_pop_empty = atomic_load(&g_inline_slots_overflow_stats.c4_pop_empty);
uint64_t c5_pop_empty = atomic_load(&g_inline_slots_overflow_stats.c5_pop_empty);
uint64_t c6_pop_empty = atomic_load(&g_inline_slots_overflow_stats.c6_pop_empty);
uint64_t overflow_to_uc = atomic_load(&g_inline_slots_overflow_stats.overflow_to_unified_cache);
uint64_t overflow_to_legacy = atomic_load(&g_inline_slots_overflow_stats.overflow_to_legacy);
// Totals
uint64_t total_push_total = c3_push_total + c4_push_total + c5_push_total + c6_push_total;
uint64_t total_pop_total = c3_pop_total + c4_pop_total + c5_pop_total + c6_pop_total;
uint64_t total_push_full = c3_push_full + c4_push_full + c5_push_full + c6_push_full;
uint64_t total_pop_empty = c3_pop_empty + c4_pop_empty + c5_pop_empty + c6_pop_empty;
uint64_t total_overflow = overflow_to_uc + overflow_to_legacy;
fprintf(stderr, "\n");
fprintf(stderr, "=== PHASE 87: INLINE SLOTS OVERFLOW STATS ===\n");
fprintf(stderr, "\n");
fprintf(stderr, "PUSH TOTAL (Free Path Attempts - Verify inline slots called):\n");
fprintf(stderr, " C3: %10llu\n", (unsigned long long)c3_push_total);
fprintf(stderr, " C4: %10llu\n", (unsigned long long)c4_push_total);
fprintf(stderr, " C5: %10llu\n", (unsigned long long)c5_push_total);
fprintf(stderr, " C6: %10llu\n", (unsigned long long)c6_push_total);
fprintf(stderr, " TOTAL: %6llu\n", (unsigned long long)total_push_total);
fprintf(stderr, "\n");
fprintf(stderr, "PUSH FULL (Free Path Ring Overflow):\n");
fprintf(stderr, " C3: %10llu", (unsigned long long)c3_push_full);
if (c3_push_total > 0) fprintf(stderr, " (%.2f%%)\n", 100.0 * c3_push_full / c3_push_total);
else fprintf(stderr, " (N/A)\n");
fprintf(stderr, " C4: %10llu", (unsigned long long)c4_push_full);
if (c4_push_total > 0) fprintf(stderr, " (%.2f%%)\n", 100.0 * c4_push_full / c4_push_total);
else fprintf(stderr, " (N/A)\n");
fprintf(stderr, " C5: %10llu", (unsigned long long)c5_push_full);
if (c5_push_total > 0) fprintf(stderr, " (%.2f%%)\n", 100.0 * c5_push_full / c5_push_total);
else fprintf(stderr, " (N/A)\n");
fprintf(stderr, " C6: %10llu", (unsigned long long)c6_push_full);
if (c6_push_total > 0) fprintf(stderr, " (%.2f%%)\n", 100.0 * c6_push_full / c6_push_total);
else fprintf(stderr, " (N/A)\n");
fprintf(stderr, " TOTAL: %6llu", (unsigned long long)total_push_full);
if (total_push_total > 0) fprintf(stderr, " (%.2f%%)\n", 100.0 * total_push_full / total_push_total);
else fprintf(stderr, " (N/A)\n");
fprintf(stderr, "\n");
fprintf(stderr, "POP TOTAL (Alloc Path Attempts - Verify inline slots called):\n");
fprintf(stderr, " C3: %10llu\n", (unsigned long long)c3_pop_total);
fprintf(stderr, " C4: %10llu\n", (unsigned long long)c4_pop_total);
fprintf(stderr, " C5: %10llu\n", (unsigned long long)c5_pop_total);
fprintf(stderr, " C6: %10llu\n", (unsigned long long)c6_pop_total);
fprintf(stderr, " TOTAL: %6llu\n", (unsigned long long)total_pop_total);
fprintf(stderr, "\n");
fprintf(stderr, "POP EMPTY (Alloc Path Ring Underflow):\n");
fprintf(stderr, " C3: %10llu", (unsigned long long)c3_pop_empty);
if (c3_pop_total > 0) fprintf(stderr, " (%.2f%%)\n", 100.0 * c3_pop_empty / c3_pop_total);
else fprintf(stderr, " (N/A)\n");
fprintf(stderr, " C4: %10llu", (unsigned long long)c4_pop_empty);
if (c4_pop_total > 0) fprintf(stderr, " (%.2f%%)\n", 100.0 * c4_pop_empty / c4_pop_total);
else fprintf(stderr, " (N/A)\n");
fprintf(stderr, " C5: %10llu", (unsigned long long)c5_pop_empty);
if (c5_pop_total > 0) fprintf(stderr, " (%.2f%%)\n", 100.0 * c5_pop_empty / c5_pop_total);
else fprintf(stderr, " (N/A)\n");
fprintf(stderr, " C6: %10llu", (unsigned long long)c6_pop_empty);
if (c6_pop_total > 0) fprintf(stderr, " (%.2f%%)\n", 100.0 * c6_pop_empty / c6_pop_total);
else fprintf(stderr, " (N/A)\n");
fprintf(stderr, " TOTAL: %6llu", (unsigned long long)total_pop_empty);
if (total_pop_total > 0) fprintf(stderr, " (%.2f%%)\n", 100.0 * total_pop_empty / total_pop_total);
else fprintf(stderr, " (N/A)\n");
fprintf(stderr, "\n");
fprintf(stderr, "OVERFLOW DESTINATIONS:\n");
fprintf(stderr, " Unified Cache: %10llu\n", (unsigned long long)overflow_to_uc);
fprintf(stderr, " Legacy Fallback: %7llu\n", (unsigned long long)overflow_to_legacy);
fprintf(stderr, " TOTAL: %14llu\n", (unsigned long long)total_overflow);
fprintf(stderr, "\n");
fprintf(stderr, "=== PHASE 87b: CALL PATH VERIFICATION ===\n");
fprintf(stderr, "\n");
fprintf(stderr, "LEGACY FALLBACK CALLS (Free path route verification):\n");
fprintf(stderr, " tiny_legacy_fallback_free_base_with_env: %llu\n", (unsigned long long)legacy_fallback_calls);
fprintf(stderr, "\n");
fprintf(stderr, "JUDGMENT:\n");
if (legacy_fallback_calls == 0) {
fprintf(stderr, " ⚠️ [A] LEGACY fallback NOT used → Alternate free path (not expected)\n");
} else if (total_push_total == 0 && total_pop_total == 0) {
fprintf(stderr, " ⚠️ [B] LEGACY used, but C4/C5/C6 INLINE SLOTS DISABLED → enable=OFF\n");
} else if (total_push_total > 0 || total_pop_total > 0) {
fprintf(stderr, " ✓ [C] LEGACY used AND C4/C5/C6 INLINE SLOTS ACTIVE → Ready for Phase 88/89\n");
fprintf(stderr, " Push activity: %llu, Pop activity: %llu\n",
(unsigned long long)total_push_total, (unsigned long long)total_pop_total);
}
fprintf(stderr, "\n");
fprintf(stderr, "===========================================\n");
fprintf(stderr, "\n");
fflush(stderr);
}

View File

@ -0,0 +1,155 @@
// tiny_inline_slots_overflow_stats_box.h - Phase 87: Inline Slots Overflow Telemetry
//
// Purpose: Measure overflow frequency for C3/C4/C5/C6 inline slots to determine
// if batch drain (Phase 88) is worth implementing.
//
// Metrics:
// - push_full: When free path TLS ring is FULL, must fallback to unified_cache/legacy
// - pop_empty: When alloc path TLS ring is EMPTY, must fetch from unified_cache/SuperSlab
// - overflow_to_uc: Fallback to unified_cache (before legacy path)
// - overflow_to_legacy: Final fallback when unified_cache also full
//
// Usage:
// - Compile-time: Only enabled in observation builds (not RELEASE) unless explicitly enabled.
// - Call tiny_inline_slots_overflow_report_stats() on exit to print summary
//
// Compile gate:
// - HAKMEM_INLINE_SLOTS_OVERFLOW_STATS_COMPILED=0/1 (default 0)
#ifndef HAK_BOX_TINY_INLINE_SLOTS_OVERFLOW_STATS_BOX_H
#define HAK_BOX_TINY_INLINE_SLOTS_OVERFLOW_STATS_BOX_H
#include <stdint.h>
#include <stdatomic.h>
// ============================================================================
// Global Counters (per-class overflow tracking)
// ============================================================================
typedef struct {
// C3/C4/C5/C6 push attempts (free path: total attempts)
_Atomic uint64_t c3_push_total;
_Atomic uint64_t c4_push_total;
_Atomic uint64_t c5_push_total;
_Atomic uint64_t c6_push_total;
// C3/C4/C5/C6 push_full (free path: TLS ring FULL)
_Atomic uint64_t c3_push_full;
_Atomic uint64_t c4_push_full;
_Atomic uint64_t c5_push_full;
_Atomic uint64_t c6_push_full;
// C3/C4/C5/C6 pop attempts (alloc path: total attempts)
_Atomic uint64_t c3_pop_total;
_Atomic uint64_t c4_pop_total;
_Atomic uint64_t c5_pop_total;
_Atomic uint64_t c6_pop_total;
// C3/C4/C5/C6 pop_empty (alloc path: TLS ring EMPTY)
_Atomic uint64_t c3_pop_empty;
_Atomic uint64_t c4_pop_empty;
_Atomic uint64_t c5_pop_empty;
_Atomic uint64_t c6_pop_empty;
// Overflow destinations
_Atomic uint64_t overflow_to_unified_cache; // fallback when inline ring full
_Atomic uint64_t overflow_to_legacy; // fallback when unified_cache also full
// Phase 87b: Legacy fallback counter (verify actual call paths)
_Atomic uint64_t legacy_fallback_calls; // total calls to tiny_legacy_fallback_free_base_with_env
} TinyInlineSlotsOverflowStats;
extern TinyInlineSlotsOverflowStats g_inline_slots_overflow_stats;
// ============================================================================
// Refresh from ENV (at init time)
// ============================================================================
void tiny_inline_slots_overflow_refresh_from_env(void);
// ============================================================================
// Reporting
// ============================================================================
void tiny_inline_slots_overflow_report_stats(void);
// ============================================================================
// Fast-path APIs (inlined, minimal overhead when disabled)
// ============================================================================
__attribute__((always_inline))
static inline int tiny_inline_slots_overflow_enabled(void) {
// Compile-time control (header-only hot-path helpers).
// Default is OFF in release; enable for OBSERVE/research builds as needed.
#if !HAKMEM_BUILD_RELEASE || HAKMEM_INLINE_SLOTS_OVERFLOW_STATS_COMPILED
return 1;
#else
return 0;
#endif
}
__attribute__((always_inline))
static inline void tiny_inline_slots_count_push_total(int class_idx) {
if (__builtin_expect(!tiny_inline_slots_overflow_enabled(), 1)) return;
switch (class_idx) {
case 3: atomic_fetch_add(&g_inline_slots_overflow_stats.c3_push_total, 1); break;
case 4: atomic_fetch_add(&g_inline_slots_overflow_stats.c4_push_total, 1); break;
case 5: atomic_fetch_add(&g_inline_slots_overflow_stats.c5_push_total, 1); break;
case 6: atomic_fetch_add(&g_inline_slots_overflow_stats.c6_push_total, 1); break;
default: break;
}
}
__attribute__((always_inline))
static inline void tiny_inline_slots_count_push_full(int class_idx) {
if (__builtin_expect(!tiny_inline_slots_overflow_enabled(), 1)) return;
switch (class_idx) {
case 3: atomic_fetch_add(&g_inline_slots_overflow_stats.c3_push_full, 1); break;
case 4: atomic_fetch_add(&g_inline_slots_overflow_stats.c4_push_full, 1); break;
case 5: atomic_fetch_add(&g_inline_slots_overflow_stats.c5_push_full, 1); break;
case 6: atomic_fetch_add(&g_inline_slots_overflow_stats.c6_push_full, 1); break;
default: break;
}
}
__attribute__((always_inline))
static inline void tiny_inline_slots_count_pop_total(int class_idx) {
if (__builtin_expect(!tiny_inline_slots_overflow_enabled(), 1)) return;
switch (class_idx) {
case 3: atomic_fetch_add(&g_inline_slots_overflow_stats.c3_pop_total, 1); break;
case 4: atomic_fetch_add(&g_inline_slots_overflow_stats.c4_pop_total, 1); break;
case 5: atomic_fetch_add(&g_inline_slots_overflow_stats.c5_pop_total, 1); break;
case 6: atomic_fetch_add(&g_inline_slots_overflow_stats.c6_pop_total, 1); break;
default: break;
}
}
__attribute__((always_inline))
static inline void tiny_inline_slots_count_pop_empty(int class_idx) {
if (__builtin_expect(!tiny_inline_slots_overflow_enabled(), 1)) return;
switch (class_idx) {
case 3: atomic_fetch_add(&g_inline_slots_overflow_stats.c3_pop_empty, 1); break;
case 4: atomic_fetch_add(&g_inline_slots_overflow_stats.c4_pop_empty, 1); break;
case 5: atomic_fetch_add(&g_inline_slots_overflow_stats.c5_pop_empty, 1); break;
case 6: atomic_fetch_add(&g_inline_slots_overflow_stats.c6_pop_empty, 1); break;
default: break;
}
}
__attribute__((always_inline))
static inline void tiny_inline_slots_count_overflow_to_uc(void) {
if (__builtin_expect(!tiny_inline_slots_overflow_enabled(), 1)) return;
atomic_fetch_add(&g_inline_slots_overflow_stats.overflow_to_unified_cache, 1);
}
__attribute__((always_inline))
static inline void tiny_inline_slots_count_overflow_to_legacy(void) {
if (__builtin_expect(!tiny_inline_slots_overflow_enabled(), 1)) return;
atomic_fetch_add(&g_inline_slots_overflow_stats.overflow_to_legacy, 1);
}
#endif // HAK_BOX_TINY_INLINE_SLOTS_OVERFLOW_STATS_BOX_H

View File

@ -25,6 +25,9 @@
#include "tiny_inline_slots_fixed_mode_box.h" // Phase 78-1: Optional fixed-mode gating
#include "tiny_inline_slots_switch_dispatch_box.h" // Phase 80-1: Switch dispatch for C4/C5/C6
#include "tiny_inline_slots_switch_dispatch_fixed_box.h" // Phase 83-1: Switch dispatch fixed mode
#include "tiny_inline_slots_overflow_stats_box.h" // Phase 87b: Legacy fallback counter
#include "tiny_c6_inline_slots_ifl_env_box.h" // Phase 91: C6 intrusive LIFO inline slots ENV gate
#include "tiny_c6_inline_slots_ifl_tls_box.h" // Phase 91: C6 intrusive LIFO inline slots TLS state
// Purpose: Encapsulate legacy free logic (shared by multiple paths)
// Called by: malloc_tiny_fast.h (free path) + tiny_c6_ultra_free_box.c (C6 fallback)
@ -36,6 +39,9 @@
//
__attribute__((always_inline))
static inline void tiny_legacy_fallback_free_base_with_env(void* base, uint32_t class_idx, const HakmemEnvSnapshot* env) {
// Phase 87b: Count legacy fallback calls for verification
atomic_fetch_add(&g_inline_slots_overflow_stats.legacy_fallback_calls, 1);
// Phase 80-1: Switch dispatch for C4/C5/C6 (branch reduction optimization)
// Phase 83-1: Per-op branch removed via fixed-mode caching
// C2/C3 excluded (NO-GO from Phase 77-1/79-1)
@ -65,6 +71,17 @@ static inline void tiny_legacy_fallback_free_base_with_env(void* base, uint32_t
}
break;
case 6:
// Phase 91: C6 Intrusive LIFO Inline Slots (check BEFORE FIFO)
if (tiny_c6_inline_slots_ifl_enabled_fast()) {
if (tiny_c6_inline_slots_ifl_push_fast(base)) {
FREE_PATH_STAT_INC(legacy_fallback);
if (__builtin_expect(free_path_stats_enabled(), 0)) {
g_free_path_stats.legacy_by_class[class_idx]++;
}
return;
}
}
// Phase 75-1: C6 Inline Slots (FIFO - fallback)
if (tiny_c6_inline_slots_enabled_fast()) {
if (c6_inline_push(c6_inline_tls(), base)) {
FREE_PATH_STAT_INC(legacy_fallback);
@ -126,6 +143,20 @@ static inline void tiny_legacy_fallback_free_base_with_env(void* base, uint32_t
// FULL → fall through to C6/unified cache
}
// Phase 91: C6 Intrusive LIFO Inline Slots early-exit (ENV gated)
// Try C6 IFL THIRD (before C6 FIFO and unified cache) for class 6
if (class_idx == 6 && tiny_c6_inline_slots_ifl_enabled_fast()) {
if (tiny_c6_inline_slots_ifl_push_fast(base)) {
// Success: pushed to C6 IFL
FREE_PATH_STAT_INC(legacy_fallback);
if (__builtin_expect(free_path_stats_enabled(), 0)) {
g_free_path_stats.legacy_by_class[class_idx]++;
}
return;
}
// FULL → fall through to C6 FIFO
}
// Phase 75-1: C6 Inline Slots early-exit (ENV gated)
// Try C6 inline slots THIRD (before unified cache) for class 6
if (class_idx == 6 && tiny_c6_inline_slots_enabled_fast()) {

View File

@ -26,6 +26,7 @@
#include "../box/tiny_c3_inline_slots_tls_box.h"
#include "../box/tiny_c3_inline_slots_env_box.h"
#include "../box/tiny_inline_slots_fixed_mode_box.h"
#include "../box/tiny_inline_slots_overflow_stats_box.h"
// ============================================================================
// C3 Inline Slots: Fast-Path Push/Pop (Always-Inline)
@ -42,8 +43,11 @@ static inline TinyC3InlineSlots* c3_inline_tls(void) {
// Returns: 1 if success, 0 if full (caller must fallback to unified_cache)
__attribute__((always_inline))
static inline int c3_inline_push(TinyC3InlineSlots* slots, void* ptr) {
tiny_inline_slots_count_push_total(3); // Phase 87: Telemetry (all attempts)
// Check if ring is full
if (__builtin_expect(c3_inline_full(slots), 0)) {
tiny_inline_slots_count_push_full(3); // Phase 87: Telemetry (overflow)
return 0; // Full, caller must use unified_cache
}
@ -58,8 +62,11 @@ static inline int c3_inline_push(TinyC3InlineSlots* slots, void* ptr) {
// Returns: non-NULL if success, NULL if empty (caller must fallback to unified_cache)
__attribute__((always_inline))
static inline void* c3_inline_pop(TinyC3InlineSlots* slots) {
tiny_inline_slots_count_pop_total(3); // Phase 87: Telemetry (all attempts)
// Check if ring is empty
if (__builtin_expect(c3_inline_empty(slots), 0)) {
tiny_inline_slots_count_pop_empty(3); // Phase 87: Telemetry (underflow)
return NULL; // Empty, caller must use unified_cache
}

View File

@ -25,6 +25,7 @@
#include "../box/tiny_c4_inline_slots_env_box.h"
#include "../box/tiny_c4_inline_slots_tls_box.h"
#include "../box/tiny_inline_slots_fixed_mode_box.h"
#include "../box/tiny_inline_slots_overflow_stats_box.h"
// ============================================================================
// Fast-Path API (always_inline for zero branch overhead)
@ -35,8 +36,11 @@
// Precondition: ptr is valid BASE pointer for C4 class
__attribute__((always_inline))
static inline int c4_inline_push(TinyC4InlineSlots* slots, void* ptr) {
tiny_inline_slots_count_push_total(4); // Phase 87: Telemetry (all attempts)
// Full check (single branch, likely taken in steady state)
if (__builtin_expect(c4_inline_full(slots), 0)) {
tiny_inline_slots_count_push_full(4); // Phase 87: Telemetry (overflow)
return 0; // Full, caller must fallback
}
@ -52,8 +56,11 @@ static inline int c4_inline_push(TinyC4InlineSlots* slots, void* ptr) {
// Precondition: slots is initialized and enabled
__attribute__((always_inline))
static inline void* c4_inline_pop(TinyC4InlineSlots* slots) {
tiny_inline_slots_count_pop_total(4); // Phase 87: Telemetry (all attempts)
// Empty check (single branch, likely NOT taken in steady state)
if (__builtin_expect(c4_inline_empty(slots), 0)) {
tiny_inline_slots_count_pop_empty(4); // Phase 87: Telemetry (underflow)
return NULL; // Empty, caller must fallback
}

View File

@ -25,6 +25,7 @@
#include "../box/tiny_c5_inline_slots_env_box.h"
#include "../box/tiny_c5_inline_slots_tls_box.h"
#include "../box/tiny_inline_slots_fixed_mode_box.h"
#include "../box/tiny_inline_slots_overflow_stats_box.h"
// ============================================================================
// Fast-Path API (always_inline for zero branch overhead)
@ -35,8 +36,11 @@
// Precondition: ptr is valid BASE pointer for C5 class
__attribute__((always_inline))
static inline int c5_inline_push(TinyC5InlineSlots* slots, void* ptr) {
tiny_inline_slots_count_push_total(5); // Phase 87: Telemetry (all attempts)
// Full check (single branch, likely taken in steady state)
if (__builtin_expect(c5_inline_full(slots), 0)) {
tiny_inline_slots_count_push_full(5); // Phase 87: Telemetry (overflow)
return 0; // Full, caller must fallback
}
@ -52,8 +56,11 @@ static inline int c5_inline_push(TinyC5InlineSlots* slots, void* ptr) {
// Precondition: slots is initialized and enabled
__attribute__((always_inline))
static inline void* c5_inline_pop(TinyC5InlineSlots* slots) {
tiny_inline_slots_count_pop_total(5); // Phase 87: Telemetry (all attempts)
// Empty check (single branch, likely NOT taken in steady state)
if (__builtin_expect(c5_inline_empty(slots), 0)) {
tiny_inline_slots_count_pop_empty(5); // Phase 87: Telemetry (underflow)
return NULL; // Empty, caller must fallback
}

View File

@ -25,6 +25,7 @@
#include "../box/tiny_c6_inline_slots_env_box.h"
#include "../box/tiny_c6_inline_slots_tls_box.h"
#include "../box/tiny_inline_slots_fixed_mode_box.h"
#include "../box/tiny_inline_slots_overflow_stats_box.h"
// ============================================================================
// Fast-Path API (always_inline for zero branch overhead)
@ -35,8 +36,11 @@
// Precondition: ptr is valid BASE pointer for C6 class
__attribute__((always_inline))
static inline int c6_inline_push(TinyC6InlineSlots* slots, void* ptr) {
tiny_inline_slots_count_push_total(6); // Phase 87: Telemetry (all attempts)
// Full check (single branch, likely taken in steady state)
if (__builtin_expect(c6_inline_full(slots), 0)) {
tiny_inline_slots_count_push_full(6); // Phase 87: Telemetry (overflow)
return 0; // Full, caller must fallback
}
@ -52,8 +56,11 @@ static inline int c6_inline_push(TinyC6InlineSlots* slots, void* ptr) {
// Precondition: slots is initialized and enabled
__attribute__((always_inline))
static inline void* c6_inline_pop(TinyC6InlineSlots* slots) {
tiny_inline_slots_count_pop_total(6); // Phase 87: Telemetry (all attempts)
// Empty check (single branch, likely NOT taken in steady state)
if (__builtin_expect(c6_inline_empty(slots), 0)) {
tiny_inline_slots_count_pop_empty(6); // Phase 87: Telemetry (underflow)
return NULL; // Empty, caller must fallback
}

View File

@ -382,6 +382,19 @@
# define HAKMEM_UNIFIED_CACHE_STATS_COMPILED 0
#endif
// ------------------------------------------------------------
// Phase 87: Inline Slots Overflow/Traffic Telemetry (Compile gate)
// ------------------------------------------------------------
// Inline Slots Overflow Stats: Compile gate (default OFF = compile-out)
// Set to 1 for OBSERVE/research builds that need:
// - per-class push/pop totals (to prove the path is actually exercised)
// - overflow/underflow counts (FULL/EMPTY)
//
// IMPORTANT: This must be a compile-time flag because the hot-path helpers are header-only.
#ifndef HAKMEM_INLINE_SLOTS_OVERFLOW_STATS_COMPILED
# define HAKMEM_INLINE_SLOTS_OVERFLOW_STATS_COMPILED 0
#endif
// ------------------------------------------------------------
// Phase 29: Pool Hotbox v2 Stats Prune (Compile-out telemetry atomics)
// ------------------------------------------------------------

View File

@ -0,0 +1,101 @@
// tiny_c6_inline_slots_ifl.c - Phase 91: C6 Intrusive LIFO Inline Slots Implementation
//
// Goal: TLS variable definition, ENV refresh, overflow handler
// Scope: Per-thread LIFO state, initialization, drain to unified_cache
#include <stdlib.h>
#include <stdio.h>
#include "box/tiny_c6_inline_slots_ifl_env_box.h"
#include "box/tiny_c6_inline_slots_ifl_tls_box.h"
#include "box/tiny_unified_lifo_box.h"
// ============================================================================
// Global State (set by refresh function)
// ============================================================================
uint8_t g_tiny_c6_inline_slots_ifl_enabled = 0;
uint8_t g_tiny_c6_inline_slots_ifl_strict = 0;
// ============================================================================
// TLS Variable Definition
// ============================================================================
// TLS instance (one per thread)
// Zero-initialized by default (head=NULL, count=0, enabled=0)
__thread struct TinyC6InlineSlotsIFL g_tiny_c6_inline_slots_ifl = {
.head = NULL,
.count = 0,
.enabled = 0,
};
// ============================================================================
// ENV Refresh (called from bench_profile.h::refresh_all_env_caches)
// ============================================================================
void tiny_c6_inline_slots_ifl_refresh_from_env(void) {
// 1. Read master ENV gate
const char* env_val = getenv("HAKMEM_TINY_C6_INLINE_SLOTS_IFL");
int requested = (env_val && *env_val && *env_val != '0') ? 1 : 0;
if (!requested) {
g_tiny_c6_inline_slots_ifl_enabled = 0;
return;
}
// 2. Fail-fast: LARSON_FIX incompatible
// Intrusive LIFO uses next pointer in freed object header,
// cannot coexist with owner_tid validation in header
const char* larson_env = getenv("HAKMEM_TINY_LARSON_FIX");
int larson_fix_enabled = (larson_env && *larson_env && *larson_env != '0') ? 1 : 0;
if (larson_fix_enabled) {
#if !HAKMEM_BUILD_RELEASE
fprintf(stderr, "[C6-IFL] FAIL-FAST: HAKMEM_TINY_LARSON_FIX=1 incompatible with intrusive LIFO, disabling\n");
fflush(stderr);
#endif
g_tiny_c6_inline_slots_ifl_enabled = 0;
g_tiny_c6_inline_slots_ifl_strict = 1;
return;
}
// 3. Read strict mode (diagnostic, not enforced)
const char* strict_env = getenv("HAKMEM_TINY_C6_IFL_STRICT");
g_tiny_c6_inline_slots_ifl_strict = (strict_env && *strict_env && *strict_env != '0') ? 1 : 0;
// 4. Enable IFL for this thread
g_tiny_c6_inline_slots_ifl_enabled = 1;
g_tiny_c6_inline_slots_ifl.enabled = 1;
#if !HAKMEM_BUILD_RELEASE
fprintf(stderr, "[C6-IFL] Initialized: enabled=1, strict=%d\n",
g_tiny_c6_inline_slots_ifl_strict);
fflush(stderr);
#endif
}
// ============================================================================
// Overflow Handler: Drain LIFO to Unified Cache
// ============================================================================
void tiny_c6_inline_slots_ifl_drain_to_unified(void) {
// Drain all entries from LIFO head to unified_cache
// Called when count > 128 (overflow condition)
while (g_tiny_c6_inline_slots_ifl.count > 0) {
void* ptr = tiny_c6_inline_slots_ifl_pop_fast();
if (ptr == NULL) {
break; // Should not happen if count tracking is correct
}
// Push to unified_cache LIFO for C6
int success = unified_cache_try_push_lifo(6, ptr);
if (!success) {
// Unified cache is full; this should be rare
// For now, we leak the pointer (FIXME: proper fallback)
#if !HAKMEM_BUILD_RELEASE
fprintf(stderr, "[C6-IFL-DRAIN] WARNING: unified_cache full, dropping pointer %p\n", ptr);
fflush(stderr);
#endif
}
}
}

View File

@ -2,12 +2,15 @@
目的: 「数%を詰める開発」で一番きつい **ベンチが再現しない問題**を潰す。
補助: buildの使い分けは `docs/analysis/SSOT_BUILD_MODES.md` を正とする。
## 1) まず結論(よくある原因)
同じマシンでも、以下が変わると 515% は普通に動く。
- **CPU power/thermal**governor / EPP / turbo
- **HAKMEM_PROFILE 未指定**route が変わる)
- **ベンチのサイズレンジ漏れ**`HAKMEM_BENCH_MIN_SIZE/MAX_SIZE` で class 分布が変わる)
- **export 漏れ**(過去の ENV が残る)
- **別バイナリ比較**layout tax: text 配置が変わる)
@ -18,6 +21,9 @@
- `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE` を明示
- `RUNS=10`(ノイズを平均化)
- `WS=400`SSOT
- サイズレンジは SSOT 側で固定runner が強制):
- `HAKMEM_BENCH_MIN_SIZE=16`
- `HAKMEM_BENCH_MAX_SIZE=1040`
- 任意(切り分け用):
- `HAKMEM_BENCH_ENV_LOG=1`CPU governor/EPP/freq をログ)
@ -33,6 +39,7 @@ allocator比較は layout tax が混ざるため **reference**。
1. SSOT実行は必ず cleanenv:
- `scripts/run_mixed_10_cleanenv.sh`
- `SSOT_MIN_SIZE/SSOT_MAX_SIZE` でレンジを明示的に上書きできるexport 漏れの影響を受けない)
2. 毎回、環境ログを残す:
- `HAKMEM_BENCH_ENV_LOG=1`
3. 結果をファイル化(後から追える形):

View File

@ -11,36 +11,27 @@
mimalloc との比較は **FAST build** で行うStandard は fixed tax を含むため公平でない)。
## Current snapshot2025-12-18, Phase 69 PGO + WarmPool=16 — 現行 baseline
## Current snapshot2025-12-18, Phase 89 SSOT capture — 現行 baseline
計測条件(再現の正)
- Mixed: `scripts/run_mixed_10_cleanenv.sh``ITERS=20000000 WS=400`
- 10-run mean/median
- Git: master (Phase 68 PGO, seed/WS diversified profile)
- **Baseline binary**: `bench_random_mixed_hakmem_minimal_pgo` (Phase 68 upgraded)
- **Stability**: Phase 66: 3 iterations, +3.0% mean, variance <±1% | Phase 68: 10-run, +1.19% vs Phase 66 (GO)
**このスコアカードの「現行の正」は Phase 89 の SSOT capture**を基準にする
- SSOT capture: `docs/analysis/PHASE89_SSOT_MEASUREMENT.md`Git SHA: `e4c5f0535`
- Mixed SSOT runner: `scripts/run_mixed_10_cleanenv.sh``ITERS=20000000 WS=400`
- プロファイル: `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE`
- SSOT を崩す最頻事故: `HAKMEM_PROFILE` 未指定 / `MIN_SIZE/MAX_SIZE` 漏れ(→経路が変わる)
Note:
- Phase 75 introduced C5/C6 inline slots and promoted them into presets. Phase 75 A/B results were recorded on the Standard binary (`./bench_random_mixed_hakmem`).
- FAST PGO SSOT baselines/ratios should only be updated after re-running A/B with `BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo`.
### hakmem SSOT baselinesPhase 89
### hakmem Build Variants同一バイナリレイアウト
| Build | Mean (M ops/s) | Median (M ops/s) | vs mimalloc | 備考 |
|-------|----------------|------------------|-------------|------|
| FAST v3 | 58.478 | 58.876 | 48.34% | 旧 baselinePhase 59b rebase。性能評価の正から昇格 → Phase 66 PGO へ |
| FAST v3 + PGO | 59.80 | 60.25 | 49.41% | Phase 47: NEUTRAL (+0.27% mean, +1.02% median, research box) |
| **FAST v3 + PGO (Phase 66)** | **60.89** | **61.35** | **50.32%** | **GO: +3.0% mean (3回検証済み、安定 <±1%)**。Phase 66 PGO initial baseline |
| **FAST v3 + PGO (Phase 68)** | **61.614** | **61.924** | **50.93%** | **GO: +1.19% vs Phase 66** ✓ (seed/WS diversification) |
| **FAST v3 + PGO (Phase 69)** | **62.63** | **63.38** | **51.77%** | **強GO: +3.26% vs Phase 68** ✓✓✓ (Warm Pool Size=16, ENV-only) → **昇格済み 新 FAST baseline** ✓ |
| FAST v3 + PGO + Phase 75 (C5+C6 ON) [Point D] | **55.51** | - | **45.70%** | Phase 75-4 FAST PGO rebase (C5+C6 inline slots): +3.16% vs Point A ✓ **[REBASE URGENT]** |
| Standard | 53.50 | - | 44.21% | 安全・互換基準Phase 48 前計測、要 rebase |
| OBSERVE | TBD | - | - | 診断カウンタ ON |
| Build | Mean (M ops/s) | Median (M ops/s) | 備考 |
|-------|----------------|------------------|------|
| Standard | **51.36** | - | SSOT baselinetelemetryなし、最適化判断の正 |
| FAST PGO minimal | **54.16** | - | SSOT ceiling`bench_random_mixed_hakmem_minimal_pgo`。Standard比 **+5.45%** |
| OBSERVE | 51.52 | - | 経路確認用telemetry込み。性能比較の正ではない |
補足:
- Phase 66/68/6960M〜62M台**過去コミットでの到達点historical**。現 HEAD の SSOT baseline と直接比較しない(比較する場合は rebase を取る)。
- Phase 63: `make bench_random_mixed_hakmem_fast_fixed``HAKMEM_FAST_PROFILE_FIXED=1`)は research buildGO 未達時は SSOT に載せない)。結果は `docs/analysis/PHASE63_FAST_PROFILE_FIXED_BUILD_RESULTS.md`
**FAST vs Standard delta: +10.6%**Standard 側は Phase 48 前計測、mimalloc baseline 変更で ratio 調整)
**FAST vs Standard deltaPhase 89: +5.45%**
**Phase 59b Notes:**
- **Profile Change**: Switched from `MIXED_TINYV3_C7_BALANCED` to `MIXED_TINYV3_C7_SAFE` (Speed-first) as canonical default
@ -92,7 +83,7 @@ scripts/bench_allocators_compare.sh --scenario mixed --iterations 50
結果2025-12-18, mixed, iterations=50:
| allocator | ops/sec (M) | vs mimalloc (Phase 69 ref) | vs system | soft_pf | RSS (MB) |
| allocator | ops/sec (M) | vs mimalloc (reference) | vs system | soft_pf | RSS (MB) |
|----------|--------------|----------------------------|-----------|---------|----------|
| tcmalloc (LD_PRELOAD) | 34.56 | 28.6% | 11.2x | 3,842 | 21.5 |
| jemalloc (LD_PRELOAD) | 24.33 | 20.1% | 7.9x | 143 | 3.8 |
@ -114,16 +105,16 @@ scripts/bench_allocators_compare.sh --scenario mixed --iterations 50
推奨マイルストーンMixed 161024B, FAST build
| Milestone | Target | Current (2025-12-18, corrected) | Status |
| Milestone | Target | Current (Phase 89 SSOT) | Status |
|-----------|--------|-----------------------------------|--------|
| M1 | mimalloc の **50%** | 44.46% | 🟡 **未達** (PROFILE 修正後の計測) |
| M2 | mimalloc の **55%** | 44.46% | 🔴 **未達** (Gap: -10.54pp)|
| M1 | mimalloc の **50%** | 43.39% | 🟡 **未達** |
| M2 | mimalloc の **55%** | 43.39% | 🔴 **未達** (Gap: -11.61pp)|
| M3 | mimalloc の **60%** | - | 🔴 未達(構造改造必要)|
| M4 | mimalloc の **6570%** | - | 🔴 未達(構造改造必要)|
**現状:** hakmem (FAST PGO) (2025-12-18) = 55.53M ops/s = mimalloc の 44.46%Random Mixed, WS=400, ITERS=20M, 10-run
**現状SSOT:** hakmem (FAST PGO minimal) = **54.16M ops/s** = mimalloc の **43.39%**Random Mixed, WS=400, ITERS=20M, 10-run
⚠️ **重要**: Phase 69 baseline (62.63M = 51.77%) は古い計測条件の可能性。PROFILE 明示修正後の新 baseline は 44.46%M1 未達)
⚠️ **重要**: Phase 66/68/6960M〜62M台は過去コミットでの到達点historical。現 HEAD との比較は `docs/analysis/PHASE67A_LAYOUT_TAX_FORENSICS_SSOT.md` に沿って rebase を取ってから行う
**Phase 68 PGO 昇格Phase 66 → Phase 68 upgrade:**
- Phase 66 baseline: 60.89M ops/s = 50.32% (+3.0% mean, 3-run stable)

View File

@ -0,0 +1,128 @@
# Phase 87: Inline Slots Overflow Observation - Infrastructure Setup (COMPLETE)
## Phase 87-1: Telemetry Box Created ✓
### Files Added
1. **core/box/tiny_inline_slots_overflow_stats_box.h**
- Global counter structure: `TinyInlineSlotsOverflowStats`
- Counters: C3/C4/C5/C6 push_full, pop_empty, overflow_to_uc, overflow_to_legacy
- Fast-path inline API with `__builtin_expect()` for zero-cost when disabled
- Enabled via compile-time gate:
- `HAKMEM_INLINE_SLOTS_OVERFLOW_STATS_COMPILED=0/1` (default 0)
- Non-RELEASE builds can also enable it (depending on build flags)
2. **core/box/tiny_inline_slots_overflow_stats_box.c**
- Global state initialization
- Refresh function placeholder
- Report function for final statistics output
### Makefile Integration
- Added `core/box/tiny_inline_slots_overflow_stats_box.o` to:
- OBJS_BASE
- BENCH_HAKMEM_OBJS_BASE
- TINY_BENCH_OBJS_BASE
- OBSERVE build enables telemetry explicitly:
- `make bench_random_mixed_hakmem_observe` adds `-DHAKMEM_INLINE_SLOTS_OVERFLOW_STATS_COMPILED=1`
### Build Status
✓ Successfully compiled (no errors, no warnings in new code)
✓ Binary ready: `bench_random_mixed_hakmem`
---
## Next: Phase 87-2 - Counter Integration Points
To enable overflow measurement, counters must be injected at:
### Free Path (Push FULL)
- Location: `core/front/tiny_c6_inline_slots.h:37` (c6_inline_push)
- Trigger: When ring is FULL, return 0
- Counter: `tiny_inline_slots_count_push_full(6)`
- Similar for C3 (`core/front/tiny_c3_inline_slots.h`), C4, C5
### Alloc Path (Pop EMPTY)
- Location: `core/front/tiny_c6_inline_slots.h:54` (c6_inline_pop)
- Trigger: When ring is EMPTY, return NULL
- Counter: `tiny_inline_slots_count_pop_empty(6)`
- Similar for C3, C4, C5
### Fallback Destinations (Unified Cache)
- Location: `core/front/tiny_unified_cache.h:177-216` (unified_cache_push)
- Trigger: When unified cache is FULL, return 0
- Counter: `tiny_inline_slots_count_overflow_to_uc()`
- Also: when unified_cache_push returns 0, legacy path gets called
- Counter: `tiny_inline_slots_count_overflow_to_legacy()`
---
## Testing Plan (Phase 87-2)
### Observation Conditions
- **Profile**: MIXED_TINYV3_C7_SAFE
- **Working Set**: WS=400 (default inline slots conditions)
- **Iterations**: 20M (ITERS=20000000)
- **Runs**: single-run OBSERVE preflight (SSOT throughput runs remain Standard/FAST)
### Expected Output
Debug build will print statistics:
```
=== PHASE 87: INLINE SLOTS OVERFLOW STATS ===
PUSH FULL (Free Path Ring Overflow):
C3: ...
C4: ...
C5: ...
C6: ...
POP EMPTY (Alloc Path Ring Underflow):
C3: ...
C4: ...
C5: ...
C6: ...
Note: `OVERFLOW DESTINATIONS` counters are optional and may remain 0 unless explicitly instrumented at fallback call sites.
```
### GO/NO-GO Decision Logic
**GO for Phase 88** if:
- `(push_full + pop_empty) / (20M * 3 runs) ≥ 0.1%`
- Indicates sufficient overflow frequency to warrant batch optimization
**NO-GO for Phase 88** if:
- Overflow rate < 0.1%
- Suggests overhead reduction ROI is minimal
- Consider alternative optimization layers
---
## Architecture Notes
- Counters use `_Atomic` for thread-safety (single increment per operation)
- Zero overhead in RELEASE builds (compile-time constant folding)
- Reporting happens on exit (calls `tiny_inline_slots_overflow_report_stats()`)
- Call point: Should add to bench program exit sequence
---
## Files Status
| File | Status |
|------|--------|
| tiny_inline_slots_overflow_stats_box.h | Created |
| tiny_inline_slots_overflow_stats_box.c | Created |
| Makefile | Updated (object files added) |
| C3/C4/C5/C6 inline slots | Pending counter integration |
| Observation binary build | Pending debug build |
---
## Ready for Phase 87-2
Next action: Inject counters into inline slots and run RUNS=3 observation.

View File

@ -0,0 +1,102 @@
# Phase 87: Inline Slots Overflow Observation Results
## Objective
Measure inline slots overflow frequency (C3/C4/C5/C6) to determine if Phase 88 (batch drain optimization) is worth implementing.
## Observation Setup
- **Workload**: Mixed SSOT (WS=400, 16-1024B allocation sizes)
- **Operations**: 20,000,000 random alloc/free operations
- **Runs**: single-run observation (OBSERVE binary)
- **Configuration**:
- Route assignments: LEGACY for all C0-C7
- Inline slots: C4/C5/C6 enabled (Phase 75/76), fixed mode ON (Phase 78), switch dispatch ON (Phase 80)
## Critical Fix (measurement correctness)
An earlier observation run reported `PUSH TOTAL/POP TOTAL = 0` for all classes.
That was **not** valid evidence that inline slots were unused.
Root cause was **telemetry compile gating**:
- `tiny_inline_slots_overflow_enabled()` is a header-only hot-path check.
- The original implementation relied on a `#define` inside `tiny_inline_slots_overflow_stats_box.c`,
which does not apply to other translation units.
- Fix: introduce `HAKMEM_INLINE_SLOTS_OVERFLOW_STATS_COMPILED` in `core/hakmem_build_flags.h` and make the enabled check depend on it.
- OBSERVE build now enables it via Makefile: `bench_random_mixed_hakmem_observe` adds `-DHAKMEM_INLINE_SLOTS_OVERFLOW_STATS_COMPILED=1`.
## Verified Result: inline slots **are** being called (WS=400 SSOT)
### Total Operation Counts (Verification)
```
PUSH TOTAL (Free Path Attempts):
C4: 687,564
C5: 1,373,605
C6: 2,750,862
TOTAL (C4-C6): 4,812,031
POP TOTAL (Alloc Path Attempts):
C4: 687,564
C5: 1,373,605
C6: 2,750,862
TOTAL (C4-C6): 4,812,031
```
This confirms:
-`tiny_legacy_fallback_free_base_with_env()` is being executed (LEGACY fallback path).
- ✅ C4/C5/C6 inline slots push/pop are active in the LEGACY fallback/hot alloc paths.
## Overflow / Underflow Rates (WS=400 SSOT)
```
PUSH FULL (Free Path Ring Overflow):
TOTAL: 0 (0.00%)
POP EMPTY (Alloc Path Ring Underflow):
TOTAL: 168 (0.003%)
```
Interpretation:
- WS=400 SSOT is a **near-perfect steady state** for C4/C5/C6 inline slots.
- Overflow batching ROI is effectively zero: `push_full=0`, `pop_empty≈0.003%`.
## Phase 88 ROI Decision: **NO-GO**
### Recommendation
**DO NOT IMPLEMENT Phase 88 (Batch Drain Optimization)**
### Rationale
1. **Overflow is essentially absent**: `push_full=0`, `pop_empty≈0.003%`.
2. **Batch drain overhead would dominate**: any additional logic is far more likely to incur layout/branch tax than to save work.
3. **This is already the desirable state**: inline slots are sized correctly for WS=400 SSOT.
### Cost-Benefit Analysis
- **Implementation Cost**: high (batch logic, tests, ongoing maintenance)
- **Benefit Under SSOT**: ~0% (overflow frequency too low)
- **Risk**: layout tax / regression in a hot-path-heavy code region
### Alternative Path (If overflow work is desired)
Use a research workload that intentionally produces misses/overflow (e.g. larger WS), and re-run this observation.
Do not use WS=400 SSOT for that validation.
## Implementation Artifacts
### Files Created
- `core/box/tiny_inline_slots_overflow_stats_box.h` - Telemetry box header
- `core/box/tiny_inline_slots_overflow_stats_box.c` - Telemetry implementation
- `core/front/tiny_c{3,4,5,6}_inline_slots.h` - Updated with total counter calls
### Telemetry Infrastructure
- Atomic counters for thread-safe measurement
- Compile-time enabled (always in observation builds)
- Zero overhead when disabled (checked at init time)
- Percentage calculations for overflow rates
## Conclusion
**Phase 87 observation (with fixed telemetry gating) confirms that inline slots are active and overflow is negligible for WS=400 SSOT.**
Phase 88 is therefore correctly frozen as NO-GO for SSOT performance work.
### Score: NO-GO ✗
- Expected Improvement: ~0% (overflow extremely rare)
- Actual Improvement: N/A (measurement-only)
- Implementation Burden: High (new code path, batch logic)
- Recommendation: Archive Phase 88 pending inline slots adoption

View File

@ -0,0 +1,186 @@
# Phase 89: Bottleneck Analysis & Next Optimization Candidates
**Date**: 2025-12-18
**SSOT Baseline (Standard)**: 51.36M ops/s
**SSOT Optimized (FAST PGO)**: 54.16M ops/s (+5.45%)
---
## Perf Profile Summary
**Profile Run**: 40M operations (0.78s), 833 samples
**Top 50 Functions by CPU Time**:
| Rank | Function | CPU Time | Type | Notes |
|------|----------|----------|------|-------|
| 1 | **free** | 27.40% | **HOTTEST** | Free path (malloc_tiny_fast main handler) |
| 2 | main | 26.30% | Loop | Benchmark loop structure (not optimizable) |
| 3 | **malloc** | 20.36% | **HOTTEST** | Alloc path (malloc_tiny_fast main handler) |
| 4 | malloc.cold | 10.65% | Cold path | Rarely executed alloc fallback |
| 5 | free.cold | 5.59% | Cold path | Rarely executed free fallback |
| 6 | **tiny_region_id_write_header** | 2.98% | **HOT** | Region metadata write (inlined candidate) |
| 7-50 | Various | ~5% | Minor | Page faults, memset, init (one-time/rare) |
---
## Key Observations
### CPU Time Breakdown:
- **malloc + free combined**: 47.76% (27.40% + 20.36%)
- This is the core allocation/deallocation hot path
- Current architecture: `malloc_tiny_fast.h` with inline slots (C4-C7) already optimized
- **tiny_region_id_write_header**: 2.98%
- Called during every free for C4-C7 classes
- Currently NOT inlined to all call sites (selective inlining only)
- Potential optimization: Force always_inline for hot paths
- **malloc.cold / free.cold**: 10.65% + 5.59% = 16.24%
- Cold paths (fallback routes)
- Should NOT be optimized (violates layout tax principle)
- Adding code to optimize cold paths increases code bloat
### Inline Slots Status (from OBSERVE):
- C4/C5/C6 inline slots ARE active during measurement
- PUSH TOTAL: 4.81M ops (100% of C4-C7 operations)
- Overflow rate: 0.003% (negligible)
- **Conclusion**: Inline slots are working perfectly, not a bottleneck
---
## Top 3 Optimization Candidates
### Candidate 1: tiny_region_id_write_header Inlining (2.98% CPU)
**Current Implementation**:
- Located in: `core/region_id_v6.c`
- Called from: `malloc_tiny_fast.h` during free path
- Current inlining: Selective (only some call sites)
**Opportunity**:
- Force `always_inline` on hot-path call sites to eliminate function call overhead
- Estimated savings: 1-2% CPU time (small gain, low risk)
- **Layout Impact**: MINIMAL (only modifying call site, not adding code bulk)
**Risk Assessment**:
- LOW: Function is already optimized, only changing inline strategy
- No new branches or code paths
- I-cache pressure: minimal (function body is ~30-50 cycles)
**Recommendation**: **YES - PURSUE**
- Implement: Add `__attribute__((always_inline))` to hot-path wrapper
- Target: Free path only (malloc path is lower frequency)
- Expected gain: +1-2% throughput
---
### Candidate 2: malloc/free Hot-Path Branch Reduction (47.76% CPU)
**Current Implementation**:
- Located in: `core/front/malloc_tiny_fast.h` (Phase 9/10/80-1 optimized)
- Already using: Fixed inline slots, switch dispatch, per-op policy snapshots
- Branches: 1-3 per operation (policy check, class route, handler dispatch)
**Opportunity**:
- Profile shows **56.4M branch-misses** out of ~1.75 insn/cycle
- This indicates branch prediction pressure, not a simple optimization
- Further reduction requires: Per-thread pre-computed routing tables or elimination of policy snapshot checks
**Analysis**:
- Phase 9/10/78-1/80-1/83-1 have already eliminated most low-hanging branches
- Remaining optimization would require structural change (pre-compute all routing at init time)
- **Risk**: Code bloat from pre-computed tables, potential layout tax regression
**Recommendation**: **DEFERRED TO PHASE 90+**
- Requires architectural change (similar to Phase 85's approach, which was NO-GO)
- Wait for overflow/workload characteristics that justify the complexity
- Current gains are saturated
---
### Candidate 3: Cold-Path De-duplication (malloc.cold/free.cold = 16.24% CPU)
**Current Implementation**:
- malloc.cold: 10.65% (fallback alloc path)
- free.cold: 5.59% (fallback free path)
**Opportunity**: NONE (Intentional Design)
**Rationale**:
- Cold paths are EXPLICITLY separate to avoid code bloat in hot path
- Separating code improves I-cache utilization for hot path
- Optimizing cold path would ADD code to hot path (violating layout tax principle)
- Cold paths are rarely executed in SSOT workload
**Recommendation**: **NO - DO NOT PURSUE**
- Aligns with user's emphasis on "avoiding layout tax"
- Cold paths are correctly placed
- Optimization here would hurt hot-path performance
---
## Performance Ceiling Analysis
**FAST PGO vs Standard: 5.45% delta**
This gap represents:
1. **PGO branch prediction optimizations** (~3%)
- PGO reorders frequently-taken paths
- Improves branch prediction hit rate
2. **Code layout optimizations** (~2%)
- Hottest functions placed contiguously
- Reduces I-cache misses
3. **Inlining decisions** (~0.5%)
- PGO optimizes inlining thresholds
- Fewer expensive calls in hot path
**Implication for Standard Build**:
- Standard build is fundamentally limited by branch prediction pressure
- Further gains require: (a) reducing branches, or (b) making branches more predictable
- Both options require careful architectural tradeoffs
---
## Recommended Strategy for Phase 90+
### Immediate (Quick Win):
1. **Phase 90: tiny_region_id_write_header always_inline**
- Effort: 1-2 lines of code
- Expected gain: +1-2%
- Risk: LOW
### Medium-term (Structural):
2. **Phase 91: Hot-path routing pre-computation (optional)**
- Only if overflow rate increases or workload changes
- Risk: MEDIUM (code bloat, layout tax)
- Expected gain: +2-3% (speculative)
3. **Phase 92: Allocator comparison sweep**
- Use FAST PGO as comparison baseline (+5.45%)
- Verify gap closure as individual optimizations accumulate
### Deferred:
- Avoid cold-path optimization (maintains I-cache discipline)
- Do NOT pursue redundant branch elimination (saturation point reached)
---
## Summary Table
| Candidate | Priority | Effort | Risk | Expected Gain | Recommendation |
|-----------|----------|--------|------|----------------|-----------------|
| tiny_region_id_write_header inlining | HIGH | 1-2h | LOW | +1-2% | **PURSUE** |
| malloc/free branch reduction | MED | 20-40h | MEDIUM | +2-3% | DEFER |
| cold-path optimization | LOW | 10-20h | HIGH | +1% | **AVOID** |
---
## Layout Tax Adherence Check
✓ Candidate 1 (header inlining): No code bulk, maintains I-cache discipline
✓ Candidate 2 deferred: Avoids adding branches to hot path
✓ Candidate 3 avoided: Maintains cold-path separation principle
**Conclusion**: All recommendations align with user's "避けるlayout tax" principle.

View File

@ -0,0 +1,141 @@
# Phase 89 SSOT Measurement Capture
**Timestamp**: 2025-12-18 23:06:01
**Git SHA**: e4c5f0535
**Branch**: master
---
## Step 1: OBSERVE Binary (Telemetry Verification)
**Binary**: `./bench_random_mixed_hakmem_observe`
**Profile**: `MIXED_TINYV3_C7_SAFE`
**Iterations**: 20,000,000
**Working Set**: 400
**Inline Slots Overflow Stats (Preflight Verification)**:
- PUSH TOTAL: 4,812,031 ops (C4+C5+C6 verified active)
- POP TOTAL: 4,812,031 ops
- PUSH FULL: 0 (0.00%)
- POP EMPTY: 168 (0.003%)
- LEGACY FALLBACK CALLS: 5,327,294
- Judgment: ✓ \[C\] LEGACY used AND C4/C5/C6 INLINE SLOTS ACTIVE
- Throughput (with telemetry): **51.52M ops/s**
---
## Step 2: Standard Build (Clean Performance Baseline)
**Binary**: `./bench_random_mixed_hakmem`
**Build Flags**: RELEASE, no telemetry, standard optimization
**Profile**: `MIXED_TINYV3_C7_SAFE`
**Iterations**: 20,000,000
**Working Set**: 400
**Runs**: 10
**10-Run Results**:
| Run | Throughput | Status |
|-----|-----------|--------|
| 1 | 51.15M | OK |
| 2 | 51.44M | OK |
| 3 | 51.61M | OK |
| 4 | 51.73M | Peak |
| 5 | 50.74M | Low |
| 6 | 51.34M | OK |
| 7 | 50.74M | Low |
| 8 | 51.37M | OK |
| 9 | 51.39M | OK |
| 10 | 51.31M | OK |
**Statistics**:
- **Mean**: 51.36M ops/s
- **Min**: 50.74M ops/s
- **Max**: 51.73M ops/s
- **Range**: 0.99M ops/s
- **CV**: ~0.7%
---
## Step 3: FAST PGO Build (Optimized Performance Tracking)
**Binary**: `./bench_random_mixed_hakmem_minimal_pgo`
**Build Flags**: RELEASE, PGO optimized, BENCH_MINIMAL=1
**Profile**: `MIXED_TINYV3_C7_SAFE`
**Iterations**: 20,000,000
**Working Set**: 400
**Runs**: 10
**10-Run Results**:
| Run | Throughput | Status |
|-----|-----------|--------|
| 1 | 55.13M | Peak |
| 2 | 54.73M | High |
| 3 | 53.81M | OK |
| 4 | 54.60M | High |
| 5 | 55.02M | Peak |
| 6 | 52.89M | Low |
| 7 | 53.61M | OK |
| 8 | 53.53M | OK |
| 9 | 55.08M | Peak |
| 10 | 53.51M | OK |
**Statistics**:
- **Mean**: 54.16M ops/s
- **Min**: 52.89M ops/s
- **Max**: 55.13M ops/s
- **Range**: 2.24M ops/s
- **CV**: ~1.5%
---
## Performance Delta Analysis
**Standard vs FAST PGO**:
- Delta: 54.16M - 51.36M = **2.80M ops/s**
- Percentage Gain: (2.80M / 51.36M) × 100 = **5.45%**
**Interpretation**:
- FAST PGO is 5.45% faster than Standard build
- This represents the optimization ceiling with current profile-guided configuration
- SSOT baseline for bottleneck analysis: **Standard 51.36M ops/s**
---
## Environment Configuration (SSOT Locked)
**Key ENV variables** (forced in `scripts/run_mixed_10_cleanenv.sh`):
- `HAKMEM_BENCH_MIN_SIZE=16` - SSOT: prevent size drift
- `HAKMEM_BENCH_MAX_SIZE=1040` - SSOT: prevent class filtering
- `HAKMEM_BENCH_C5_ONLY=0` - SSOT: no single-class mode
- `HAKMEM_BENCH_C6_ONLY=0` - SSOT: no single-class mode
- `HAKMEM_BENCH_C7_ONLY=0` - SSOT: no single-class mode
- `HAKMEM_WARM_POOL_SIZE=16` - Phase 69 winner
- `HAKMEM_TINY_C4_INLINE_SLOTS=1` - Phase 76-1 promoted
- `HAKMEM_TINY_C5_INLINE_SLOTS=1` - Phase 75-2 promoted
- `HAKMEM_TINY_C6_INLINE_SLOTS=1` - Phase 75-1 promoted
- `HAKMEM_TINY_INLINE_SLOTS_FIXED=1` - Phase 78-1 promoted
- `HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH=1` - Phase 80-1 promoted
- `HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH_FIXED=0` - Phase 83-1 NO-GO
- `HAKMEM_FASTLANE_DIRECT=1` - Phase 19-1b promoted
- `HAKMEM_FREE_TINY_FAST_MONO_DUALHOT=1` - Phase 9/10 promoted
- `HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=1` - Phase 10 promoted
- `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE` - default route
---
## System Configuration
- **CPU**: AMD Ryzen 7 5825U with Radeon Graphics
- **Cores**: 16
- **Memory**: MemTotal: 13166508 kB
- **Kernel**: 6.8.0-87-generic
---
## Next Steps (Phase 89 Step 5)
**Objective**: Identify top 3 bottleneck candidates using perf measurement
- Run `perf top` during Mixed SSOT execution
- Analyze top 50 functions by CPU time
- Filter to high-frequency code paths (avoid 0.001% optimizations)
- Prepare recommendations for Phase 90+

View File

@ -0,0 +1,145 @@
# Phase 90: Structural Review & Gap Triagemimalloc/tcmalloc 差分を“設計”に落とす SSOT
目的: 「layout tax を疑う/疑わない」以前に、**差分がどこから来ているか**を “同じ儀式” で毎回再現し、次の構造案Phase 91+)を決める。
前提:
- SSOT runner性能の正: `scripts/run_mixed_10_cleanenv.sh``ITERS=20000000 WS=400 RUNS=10`
- OBSERVE runner経路の正: `scripts/run_mixed_observe_ssot.sh`telemetry込み、性能比較に使わない
- 現行SSOTPhase 89: `docs/analysis/PHASE89_SSOT_MEASUREMENT.md`
非目標:
- 長時間 soak5分/30分/60分は Phase 90 ではやらない。
- “1行の micro-opt” は Phase 90 ではやらないPhase 91+ の入力だけ作る)。
---
## Box Theory ルールPhase 90 版)
1. **境界は1箇所**: 測定の入口はスクリプトで固定(手打ち禁止)。
2. **戻せる**: 比較は同一バイナリ ENV トグル、または “同一バイナリ LD_PRELOAD” を優先。
3. **見える化**: まず OBSERVE で「踏んでる」を確定し、SSOT で数値を取る。
4. **Fail-fast**: `HAKMEM_PROFILE` 未指定など SSOT 違反は即エラー(スクリプト側で強制)。
---
## Step 0: SSOT Preflight経路確認、性能ではない
目的: “踏んでない最適化” を排除する。
```bash
make bench_random_mixed_hakmem_observe
HAKMEM_ROUTE_BANNER=1 ./scripts/run_mixed_observe_ssot.sh | tee /tmp/phase90_observe_preflight.log
```
判定:
- `Route assignments` が想定と一致していることMixed SSOT の既定は多くが `LEGACY` になりがち)
- `Inline Slots Overflow Stats`**PUSH/POP TOTAL > 0** であることC4/C5/C6 inline slots が生きている)
---
## Step 1: hakmem SSOT baselineStandard / FAST PGO
目的: Phase 89 と同じ条件で “今の値” を固定するCV 付き)。
```bash
make bench_random_mixed_hakmem
./scripts/run_mixed_10_cleanenv.sh | tee /tmp/phase90_hakmem_standard_10run.log
make pgo-fast-full
BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo ./scripts/run_mixed_10_cleanenv.sh | tee /tmp/phase90_hakmem_fastpgo_10run.log
```
記録SSOTに必須:
- `git rev-parse HEAD`
- `Mean/Median/CV`
- `HAKMEM_PROFILE`
---
## Step 2: allocator reference短時間、長時間なし
目的: “外部強者の位置” を数値で固定する(ただし reference
```bash
make bench_random_mixed_system bench_random_mixed_mi
RUNS=10 scripts/run_allocator_quick_matrix.sh | tee /tmp/phase90_allocator_quick_matrix.log
```
注意:
- これは **reference**(別バイナリ/LD_PRELOAD が混ざる)。
- SSOT最適化判断は必ず Step 1 の同一儀式で行う。
---
## Step 3: same-binary matrixlayout差を最小化、設計差を浮かせる
目的: 「hakmemが遅い」の原因が “layout/ベンチ差” か “アルゴリズム/固定費” かを切り分ける。
```bash
make bench_random_mixed_system shared
RUNS=10 scripts/run_allocator_preload_matrix.sh | tee /tmp/phase90_allocator_preload_matrix.log
```
読み方:
- `bench_random_mixed_hakmem*`linked SSOT**同じ数値になる必要はない**(経路が違う)。
- ここで見るのは「同一入口malloc/freeでの相対差」。
---
## Step 4: perf stat同一カウンタで “差分の形” を固定)
目的: “速い/遅い” を命令/分岐/メモリのどれで負けているかに落とす。
### hakmemlinked
```bash
perf stat -e cycles,instructions,branches,branch-misses,cache-misses,iTLB-load-misses,dTLB-load-misses \\
./bench_random_mixed_hakmem 20000000 400 1 2>&1 | tee /tmp/phase90_perfstat_hakmem_linked.txt
```
### system binary + LD_PRELOADtcmalloc/jemalloc/mimalloc
```bash
perf stat -e cycles,instructions,branches,branch-misses,cache-misses,iTLB-load-misses,dTLB-load-misses \\
env LD_PRELOAD=\"$TCMALLOC_SO\" ./bench_random_mixed_system 20000000 400 1 2>&1 | tee /tmp/phase90_perfstat_tcmalloc_preload.txt
```
---
## Phase 90 の “設計判断” 出力Phase 91 の入力)
Phase 90 はここで終わり。次のどれを採用するかは **Step 1〜4 の差分**で決める。
### A) 固定費(命令/分岐)が負けている(最頻パターン)
狙い:
- per-op の “儀式”route/policy/env/gateを hot path から追放
- できる限り **commit-once / fixed mode** へ寄せる(ただし layout tax を避ける形で)
次フェーズ候補:
- Phase 91: “Hot path contract” の再定義(どの箱を踏まないか、を SSOT 化)
### B) メモリ系cache/TLBが負けている
狙い:
- TLS 構造のサイズ/配置、ptr→meta 到達、書き込み順序dependency chainを見直す
次フェーズ候補:
- Phase 91: TLS struct packing / hot fields co-location小さく、戻せる
### C) 同一バイナリLD_PRELOADでは差が小さい
狙い:
- linked SSOT 側の “入口/配置/箱列” が重い(もしくはベンチ差分)
次フェーズ候補:
- Phase 91: linked SSOT の入口を drop-in と揃える(比較の意味を合わせる)
---
## GO/NO-GOPhase 90
Phase 90 は “計測と設計判断の SSOT 化” が成果物。
- **GO**: Step 0〜4 が再現可能(ログが揃い、差分の形が説明できる)
- **NO-GO**: `HAKMEM_PROFILE` 未指定/ENV漏れ等で結果が破綻先に SSOT 儀式を修正)

View File

@ -0,0 +1,157 @@
# Phase 92: tcmalloc Gap Triage SSOT
## 目的
Phase 89 で検出した tcmalloc との性能ギャップhakmem: 52M vs tcmalloc: 58Mを**短時間で**原因分類する。
---
## 既知事実Phase 89 から継承)
- **hakmem baseline**: 51.36M ops/s (SSOT standard)
- **tcmalloc**: 58M ops/s 付近(参考値)
- **差分**: -12.8% hakmem が遅い)
---
## Phase 92 Triage フロー(最短 1-2h
### 1⃣ **ケース A小オブジェクトC4-C6 vs 大オブジェクトC7+**
**疑問**: tcmalloc の優位は「小サイズに特化」か「大サイズに強い」か?
**実施**:
```bash
# C6 のみSmall, 16-256B
HAKMEM_BENCH_C6_ONLY=1 RUNS=3 ./scripts/run_mixed_10_cleanenv.sh
# C7 のみLarge, 1024B+
HAKMEM_BENCH_C7_ONLY=1 RUNS=3 ./scripts/run_mixed_10_cleanenv.sh
```
**判定**:
- C6 > 52M, C7 < 45M **問題は Large allocC7**
- C6 < 50M, C7 < 45M **問題は均等分散**
- C6 > 52M, C7 > 48M → **問題は別(メモリ効率?)**
---
### 2⃣ **ケース BUnified Cache vs Inline Slots**
**疑問**: tcmalloc 優位は「キャッシュ管理」か「インライン最適化」か?
**実施**:
```bash
# Inline Slots 全無効
HAKMEM_TINY_C6_INLINE_SLOTS=0 HAKMEM_TINY_C5_INLINE_SLOTS=0 \
HAKMEM_TINY_C4_INLINE_SLOTS=0 RUNS=3 ./scripts/run_mixed_10_cleanenv.sh
# Unified Cache のみinline slots 全 OFF
HAKMEM_UNIFIED_CACHE_ONLY=1 RUNS=3 ./scripts/run_mixed_10_cleanenv.sh
```
**判定**:
- `-inline > 50M`**inline slots オーバーヘッド**
- `-inline < 48M`**unified cache 自体が遅い**
---
### 3⃣ **ケース Cフラグメンテーション/再利用効率**
**疑問**: LIFO vs FIFO の差、または tcmalloc の再利用戦略の優位性?
**実施**:
```bash
# LIFO 有効phase 15
HAKMEM_TINY_UNIFIED_LIFO=1 RUNS=3 ./scripts/run_mixed_10_cleanenv.sh
# FIFOdefault
RUNS=3 ./scripts/run_mixed_10_cleanenv.sh
```
**判定**:
- LIFO > +1% → **FIFO が問題候補**
- LIFO = FIFO ± 0.5% → **LIFO/FIFO は neutral**
---
### 4⃣ **ケース Dページサイズ/プールサイズ**
**疑問**: tcmalloc と hakmem のメモリレイアウト / warm pool size の違い?
**実施**:
```bash
# 大プール(確保多く、断片化少なく)
HAKMEM_WARM_POOL_SIZE=100000 RUNS=3 ./scripts/run_mixed_10_cleanenv.sh
# 小プール(確保少なく、効率見直し)
HAKMEM_WARM_POOL_SIZE=1000 RUNS=3 ./scripts/run_mixed_10_cleanenv.sh
# デフォルト
RUNS=3 ./scripts/run_mixed_10_cleanenv.sh
```
**判定**:
- pool big > baseline → **プール不足(確保過多)**
- pool small < baseline **プール不足(メモリ不足)**
- pool default = baseline **pool size neutral**
---
## 測定時間見積もり
| ケース | 実施数 | 時間/実施 | 合計 |
|--------|--------|----------|------|
| A (C6/C7) | 2×3=6 | 2 min | 12 min |
| B (inline) | 2×3=6 | 2 min | 12 min |
| C (LIFO) | 2×3=6 | 2 min | 12 min |
| D (pool) | 3×3=9 | 2 min | 18 min |
| **合計** | - | - | **54 min** |
---
## 判定マトリクス
| ケース | 結果 | 判定 | 次アクション |
|--------|------|------|-------------|
| A | C6 > 52M, C7 低 | C7 が制限 | Phase 93: C7 最適化 |
| B | -inline > 50M | Inline 段階的 OFF | Phase 94: Inline review |
| C | LIFO > +1% | LIFO 推奨 | Phase 92b: LIFO 展開 |
| D | pool_big > +2% | 確保が重い | Phase 95: Pool tuning |
---
## 記録フォーマット
結果は下記フォーマットで PHASE92_TCMALLOC_GAP_RESULTS.txt に記録:
```
=== Phase 92 Triage Results ===
Baseline (51.36M): [ENTER CONTROL VALUE]
ケース A (C6 vs C7):
C6-only: [VALUE] ops/s
C7-only: [VALUE] ops/s
判定: [CONCLUSION]
ケース B (Inline vs Unified):
No-inline: [VALUE] ops/s
Unified-only: [VALUE] ops/s
判定: [CONCLUSION]
ケース C (LIFO vs FIFO):
LIFO: [VALUE] ops/s
FIFO: [VALUE] ops/s
判定: [CONCLUSION]
ケース D (Pool sizing):
Pool-big: [VALUE] ops/s
Pool-small: [VALUE] ops/s
Pool-default: [VALUE] ops/s
判定: [CONCLUSION]
=== FINAL VERDICT ===
Primary bottleneck: [A|B|C|D|MIXED]
Next phase: Phase 9x [recommendation]
```

View File

@ -0,0 +1,100 @@
# SSOT Build Modes: Standard / FAST / OBSERVE の役割定義
## 目的
ベンチマーク測定において、**ビルドモード**と**測定モード**を分離し、
各フェーズで何を測定するかを明確化する。
---
## 3つのモード
### 1. **Standard Build** (`-DNDEBUG`)
- **役割**: 本番相当、最適化最大
- **使用**: Phase 89+ 本格 SSOTA/B テスト、GO/NO-GO 判定)
- **スクリプト**: `scripts/run_mixed_10_cleanenv.sh`
- **出力**: Throughput最終スコア
- **特性**: LTO, -O3, frame-pointer 削除、統計安定性CV < 2%
### 2. **FAST Build** (`HAKMEM_BENCH_FAST_MODE=1`)
- **役割**: 最大パフォーマンス引き出しPGOキャッシュ最適化
- **使用**: 性能天井確認設計上限検証
- **スクリプト**: `scripts/run_mixed_fast_pgo_ssot.sh`要作成
- **出力**: Throughputceiling reference
- **特性**: Profile-Guided Optimization, aggressive inlining
### 3. **OBSERVE Build**
- **役割**: 経路確認フローダンプ
- **使用**: ENV ドリフト検出設定妥当性確認
- **スクリプト**: `scripts/run_mixed_observe_ssot.sh`
- **出力**: 詳細統計inline slots 活動unified cache hit/misslegacy fallback 呼び出し
- **特性**: メトリクス収集診断情報
---
## SSOT 測定手順(標準パターン)
### 流れ
```
1. OBSERVE (diagnosis)
→ 経路が正しいか確認「LEGACY used AND C6 INLINE SLOTS ACTIVE」の判定
→ ENV 設定ドリフトを検出
2. Standard SSOT (control + treatment)
→ IFL=0 (control) 10-run
→ IFL=1 (treatment) 10-run
→ 統計的に有意な差があるか判定
3. if NO-GO → FAST build で ceiling 確認
→ design は correct か、implementation は correct か の切り分け
```
---
## 各モードの環境管理
### Standard
```bash
HAKMEM_BENCH_MIN_SIZE=16 HAKMEM_BENCH_MAX_SIZE=1040
HAKMEM_BENCH_C5_ONLY=0 HAKMEM_BENCH_C6_ONLY=0 HAKMEM_BENCH_C7_ONLY=0
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE
```
### FAST将来
```bash
HAKMEM_BENCH_FAST_MODE=1
HAKMEM_PROFILE=MIXED_TINYV3_C7_FAST_PGO (要定義)
```
### OBSERVE
```bash
# Standard + diagnostic metrics
HAKMEM_UNIFIED_CACHE_STATS_COMPILED=1
HAKMEM_INLINE_SLOTS_OVERFLOW_STATS=1
```
---
## GO/NO-GO 判定基準
| 指標 | 基準 | 判定 |
|------|------|------|
| 改善度 | +1.0% | GO |
| CV変動係数 | < 3% | 統計安定 |
| 回帰 | < -1.0% | NO-GO重大 |
| 観測スコア | baseline × 1.018 以上 | strong GO |
---
## 参考Phase 91 (C6 IFL) の例
**OBSERVE 結果**:
- 経路確認:✓ LEGACY used AND inline slots active
- スコア51.47M ops/s
**Standard SSOT 結果**:
- Control (IFL=0)52.05M ops/s, CV 1.2%
- Treatment (IFL=1)52.25M ops/s, CV 1.5%
- 改善度+0.38%
- 判定NEUTRAL目標未達)→ NO-GO

View File

@ -122,6 +122,7 @@ hakmem.o: core/hakmem.c core/hakmem.h core/hakmem_build_flags.h \
core/box/../front/../box/../front/../box/tiny_c4_inline_slots_env_box.h \
core/box/../front/../box/../front/../box/../hakmem_build_flags.h \
core/box/../front/../box/../front/../box/tiny_c5_inline_slots_env_box.h \
core/box/../front/../box/../front/../box/tiny_inline_slots_overflow_stats_box.h \
core/box/../front/../box/tiny_c5_inline_slots_env_box.h \
core/box/../front/../box/../front/tiny_c5_inline_slots.h \
core/box/../front/../box/../front/../box/tiny_c5_inline_slots_env_box.h \
@ -142,6 +143,9 @@ hakmem.o: core/hakmem.c core/hakmem.h core/hakmem_build_flags.h \
core/box/../front/../box/tiny_inline_slots_fixed_mode_box.h \
core/box/../front/../box/tiny_inline_slots_switch_dispatch_box.h \
core/box/../front/../box/tiny_inline_slots_switch_dispatch_fixed_box.h \
core/box/../front/../box/tiny_c6_inline_slots_ifl_env_box.h \
core/box/../front/../box/tiny_c6_inline_slots_ifl_tls_box.h \
core/box/../front/../box/tiny_c6_intrusive_freelist_box.h \
core/box/../front/../box/tiny_front_cold_box.h \
core/box/../front/../box/tiny_layout_box.h \
core/box/../front/../box/tiny_hotheap_v2_box.h \
@ -184,6 +188,7 @@ hakmem.o: core/hakmem.c core/hakmem.h core/hakmem_build_flags.h \
core/box/../front/../box/tiny_metadata_cache_env_box.h \
core/box/../front/../box/hakmem_env_snapshot_box.h \
core/box/../front/../box/tiny_unified_cache_fastapi_env_box.h \
core/box/../front/../box/tiny_inline_slots_overflow_stats_box.h \
core/box/../front/../box/tiny_ptr_convert_box.h \
core/box/../front/../box/tiny_front_stats_box.h \
core/box/../front/../box/free_path_stats_box.h \
@ -415,6 +420,7 @@ core/box/../front/../box/../front/../box/tiny_c3_inline_slots_env_box.h:
core/box/../front/../box/../front/../box/tiny_c4_inline_slots_env_box.h:
core/box/../front/../box/../front/../box/../hakmem_build_flags.h:
core/box/../front/../box/../front/../box/tiny_c5_inline_slots_env_box.h:
core/box/../front/../box/../front/../box/tiny_inline_slots_overflow_stats_box.h:
core/box/../front/../box/tiny_c5_inline_slots_env_box.h:
core/box/../front/../box/../front/tiny_c5_inline_slots.h:
core/box/../front/../box/../front/../box/tiny_c5_inline_slots_env_box.h:
@ -435,6 +441,9 @@ core/box/../front/../box/../front/../box/tiny_c3_inline_slots_env_box.h:
core/box/../front/../box/tiny_inline_slots_fixed_mode_box.h:
core/box/../front/../box/tiny_inline_slots_switch_dispatch_box.h:
core/box/../front/../box/tiny_inline_slots_switch_dispatch_fixed_box.h:
core/box/../front/../box/tiny_c6_inline_slots_ifl_env_box.h:
core/box/../front/../box/tiny_c6_inline_slots_ifl_tls_box.h:
core/box/../front/../box/tiny_c6_intrusive_freelist_box.h:
core/box/../front/../box/tiny_front_cold_box.h:
core/box/../front/../box/tiny_layout_box.h:
core/box/../front/../box/tiny_hotheap_v2_box.h:
@ -477,6 +486,7 @@ core/box/../front/../box/tiny_front_hot_box.h:
core/box/../front/../box/tiny_metadata_cache_env_box.h:
core/box/../front/../box/hakmem_env_snapshot_box.h:
core/box/../front/../box/tiny_unified_cache_fastapi_env_box.h:
core/box/../front/../box/tiny_inline_slots_overflow_stats_box.h:
core/box/../front/../box/tiny_ptr_convert_box.h:
core/box/../front/../box/tiny_front_stats_box.h:
core/box/../front/../box/free_path_stats_box.h:

View File

@ -10,6 +10,22 @@ ws=${WS:-400}
runs=${RUNS:-10}
bin=${BENCH_BIN:-./bench_random_mixed_hakmem}
# SSOT header: bin sha / profile / iters / ws / runs
echo "[SSOT-HEADER] bin=$(sha256sum "${bin}" | cut -c1-8) profile=${profile} iters=${iters} ws=${ws} runs=${runs}"
# Bench size range SSOT (bench_random_mixed.c reads these).
# IMPORTANT: we FORCE these to avoid leaked exports causing "wrong classes exercised"
# (e.g. only <=256B => C4/C5/C6 inline-slots never invoked).
ssot_min_size=${SSOT_MIN_SIZE:-16}
ssot_max_size=${SSOT_MAX_SIZE:-1040} # matches bench default (16..1040 ≒ 16..1024)
export HAKMEM_BENCH_MIN_SIZE="${ssot_min_size}"
export HAKMEM_BENCH_MAX_SIZE="${ssot_max_size}"
# Disable fixed-size bench modes (must be forced to avoid leaks).
export HAKMEM_BENCH_C5_ONLY=0
export HAKMEM_BENCH_C6_ONLY=0
export HAKMEM_BENCH_C7_ONLY=0
# Keep profiles reproducible even if user exported env vars.
case "${profile}" in
MIXED_TINYV3_C7_BALANCED)
@ -53,6 +69,11 @@ export HAKMEM_TINY_INLINE_SLOTS_FIXED=${HAKMEM_TINY_INLINE_SLOTS_FIXED:-1}
# NOTE: Phase 80-1 winner (Switch dispatch for inline slots, removes if-chain comparisons)
export HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH=${HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH:-1}
if [[ "${HAKMEM_BENCH_HEADER_LOG:-1}" == "1" ]]; then
sha="$(git rev-parse --short HEAD 2>/dev/null || echo unknown)"
echo "[SSOT] sha=${sha} bin=${bin} profile=${profile} iters=${iters} ws=${ws} runs=${runs} size=${ssot_min_size}..${ssot_max_size}" >&2
fi
if [[ "${HAKMEM_BENCH_ENV_LOG:-0}" == "1" ]]; then
if [[ -x ./scripts/bench_env_banner.sh ]]; then
./scripts/bench_env_banner.sh >&2 || true

View File

@ -0,0 +1,47 @@
#!/usr/bin/env bash
set -euo pipefail
# Single-run OBSERVE helper for "is the path actually executed?" checks.
#
# This script is intentionally NOT a throughput SSOT runner.
# It is a pre-flight: verify route/banner + per-class counters + stats are non-zero.
#
# Usage:
# ./scripts/run_mixed_observe_ssot.sh
# WS=400 ITERS=20000000 ./scripts/run_mixed_observe_ssot.sh
#
# Requires: `make bench_random_mixed_hakmem_observe`
profile=${HAKMEM_PROFILE:-MIXED_TINYV3_C7_SAFE}
iters=${ITERS:-20000000}
ws=${WS:-400}
bin=${BENCH_BIN:-./bench_random_mixed_hakmem_observe}
# SSOT header: bin sha / profile / iters / ws
echo "[SSOT-HEADER] bin=$(sha256sum "${bin}" | cut -c1-8) profile=${profile} iters=${iters} ws=${ws} mode=OBSERVE"
# Force the same size range as SSOT to avoid class distribution drift.
export HAKMEM_BENCH_MIN_SIZE=${SSOT_MIN_SIZE:-16}
export HAKMEM_BENCH_MAX_SIZE=${SSOT_MAX_SIZE:-1040}
export HAKMEM_BENCH_C5_ONLY=0
export HAKMEM_BENCH_C6_ONLY=0
export HAKMEM_BENCH_C7_ONLY=0
# One-shot route configuration banner (Phase 70-1).
export HAKMEM_ROUTE_BANNER=1
# Keep cleanenv defaults aligned with the main runner for knobs that affect control flow.
export HAKMEM_WARM_POOL_SIZE=${HAKMEM_WARM_POOL_SIZE:-16}
export HAKMEM_TINY_C4_INLINE_SLOTS=${HAKMEM_TINY_C4_INLINE_SLOTS:-1}
export HAKMEM_TINY_C5_INLINE_SLOTS=${HAKMEM_TINY_C5_INLINE_SLOTS:-1}
export HAKMEM_TINY_C6_INLINE_SLOTS=${HAKMEM_TINY_C6_INLINE_SLOTS:-1}
export HAKMEM_TINY_INLINE_SLOTS_FIXED=${HAKMEM_TINY_INLINE_SLOTS_FIXED:-1}
export HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH=${HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH:-1}
export HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH_FIXED=${HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH_FIXED:-0}
if [[ "${HAKMEM_BENCH_HEADER_LOG:-1}" == "1" ]]; then
sha="$(git rev-parse --short HEAD 2>/dev/null || echo unknown)"
echo "[OBSERVE] sha=${sha} bin=${bin} profile=${profile} iters=${iters} ws=${ws} size=${HAKMEM_BENCH_MIN_SIZE}..${HAKMEM_BENCH_MAX_SIZE}" >&2
fi
HAKMEM_PROFILE="${profile}" "${bin}" "${iters}" "${ws}" 1