Phase 16 v1 NEUTRAL, Phase 17 Case B confirmed, Phase 18 design added
## Phase 16 v1: Front FastLane Alloc LEGACY Direct — NEUTRAL (+0.62%)
Target: Reduce alloc-side fixed costs by adding LEGACY direct path to
FastLane entry, mirroring Phase 9/10 free-side winning pattern.
Result: +0.62% on Mixed (below +1.0% GO threshold) → NEUTRAL, freeze as
research box (default OFF).
Critical issue: Initial impl crashed (segfault) for C4-C7. Root cause:
unified_cache_refill() incompatibility. Safety fix: Limited to C0-C3
only (matching existing dualhot pattern).
Files:
- core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c} (new)
- core/box/front_fastlane_box.h (LEGACY direct path, lines 93-119)
- core/bench_profile.h (env refresh sync)
- Makefile (new obj)
- docs/analysis/PHASE16_*.md (design/results/instructions)
ENV: HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 (default OFF, opt-in)
Verdict: Research box frozen. Phase 14-16 plateau confirms dispatch/
routing optimization ROI is exhausted post-Phase-6 FastLane collapse.
---
## Phase 17: FORCE_LIBC Gap Validation — Case B Confirmed
Purpose: Validate "system malloc faster" observation using same-binary
A/B testing to isolate allocator logic差 vs binary layout penalty.
Method:
- Same-binary toggle: HAKMEM_FORCE_LIBC_ALLOC=0/1 (bench_random_mixed_hakmem)
- System binary: bench_random_mixed_system (21K separate binary)
- Perf stat: Hardware counter analysis (I-cache, cycles, instructions)
Result: **Case B confirmed** — Allocator差 negligible, layout penalty dominates.
Gap breakdown (Mixed, 20M iters, ws=400):
- hakmem (FORCE_LIBC=0): 48.12M ops/s
- libc (FORCE_LIBC=1, same binary): 48.31M ops/s → +0.39% (noise level)
- system binary (21K): 83.85M ops/s → +73.57% vs libc, +74.26% vs hakmem
Perf stat (200M iters):
- I-cache misses: 153K (hakmem) → 68K (system) = -55% (smoking gun)
- Cycles: 17.9B → 10.2B = -43%
- Instructions: 41.3B → 21.5B = -48%
- Binary size: 653K → 21K (30x difference)
Root cause: Binary size (30x) causes I-cache thrashing. Code bloat >>
algorithmic efficiency.
Conclusion: Phase 12's "system malloc 1.6x faster" was real, but
misattributed. Gap is layout/I-cache, NOT allocator algorithm.
Files:
- docs/analysis/PHASE17_*.md (results/instructions)
- scripts/run_mixed_10_cleanenv.sh (Phase 9/10 defaults aligned)
Next: Phase 18 Hot Text Isolation (layout optimization, not algorithm opt)
---
## Phase 18: Hot Text Isolation — Design Added
Purpose: Reduce I-cache misses + instruction footprint via layout control
(binary optimization, not allocator algorithm changes).
Strategy (v1 → v2 progression):
v1 (TU split + hot/cold attrs + optional gc-sections):
- Target: +2% throughput (GO threshold, realistic for layout tweaks)
- Secondary: I-cache -10%, instructions -5% (direction confirmation)
- Risk: Low (reversible via build knob)
- Expected: +0-2% (NEUTRAL likely, but validates approach)
v2 (BENCH_MINIMAL compile-out):
- Target: +10-20% throughput (本命)
- Method: Conditional compilation removes stats/ENV/debug from hot path
- Expected: Instruction count -30-40% → significant I-cache improvement
Files:
- docs/analysis/PHASE18_*.md (design/instructions)
- CURRENT_TASK.md (Phase 17 complete, Phase 18 v1/v2 plan)
Build gate: HOT_TEXT_ISOLATION=0/1 (Makefile knob)
Next: Implement Phase 18 v1 (TU split first, BENCH_MINIMAL if v1 NEUTRAL)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
@ -0,0 +1,208 @@
|
||||
# Phase 16: Front FastLane Alloc LEGACY Direct v1 — A/B Test Results
|
||||
|
||||
**Date**: 2025-12-15
|
||||
**Status**: NEUTRAL (+0.62%)
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
Phase 16 v1 attempted to reduce alloc-side fixed costs by adding a LEGACY direct path to FastLane entry point, bypassing route/policy overhead for LEGACY allocations. The optimization mirrored the free-side winning pattern (Phase 9/10).
|
||||
|
||||
**Result**: +0.62% on Mixed (NEUTRAL), below +1.0% GO threshold.
|
||||
|
||||
**Critical Issue Discovered**: Initial implementation caused segmentation fault for classes C4-C7. Root cause: `unified_cache_refill()` incompatibility. **Safety fix applied**: Limited optimization to C0-C3 only (matching existing dualhot pattern).
|
||||
|
||||
**Verdict**: NEUTRAL — freeze as research box (default OFF).
|
||||
|
||||
---
|
||||
|
||||
## A/B Test Results
|
||||
|
||||
### Mixed (16-1024B, 10-run clean env)
|
||||
|
||||
**Baseline** (ENV=0):
|
||||
- Mean: 47,510,791 ops/s
|
||||
- Median: 47,606,360 ops/s
|
||||
- Runs: 48151673, 47596179, 47735208, 47903499, 46674576, 47977105, 47236265, 47481537, 46735322, 47616542
|
||||
|
||||
**Optimized** (ENV=1):
|
||||
- Mean: 47,803,890 ops/s
|
||||
- Median: 47,901,551 ops/s
|
||||
- Runs: 47401229, 47908200, 48158776, 48126240, 47477867, 47894902, 47644796, 48191059, 47930512, 47305320
|
||||
|
||||
**Delta**:
|
||||
- Mean: **+0.62%**
|
||||
- Median: **+0.62%**
|
||||
|
||||
**Verdict**: NEUTRAL (below +1.0% GO threshold)
|
||||
|
||||
---
|
||||
|
||||
### C6-heavy Regression Check (5-run)
|
||||
|
||||
**Baseline** (ENV=0):
|
||||
- Mean: 21,134,240 ops/s
|
||||
- Median: 21,186,983 ops/s
|
||||
- Runs: 21186983, 21327420, 20807950, 21112023, 21236823
|
||||
|
||||
**Optimized** (ENV=1):
|
||||
- Mean: 21,147,197 ops/s
|
||||
- Median: 21,139,301 ops/s
|
||||
- Runs: 21358869, 21209299, 20992077, 21139301, 21036438
|
||||
|
||||
**Delta**:
|
||||
- Mean: **+0.06%**
|
||||
- Median: **-0.23%**
|
||||
|
||||
**Verdict**: PASS (no significant regression)
|
||||
|
||||
---
|
||||
|
||||
## Implementation Summary
|
||||
|
||||
### Files Modified
|
||||
|
||||
1. **`core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c}`** (new)
|
||||
- L0 ENV gate for LEGACY direct feature
|
||||
- ENV: `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1` (default 0, opt-in)
|
||||
- API: `front_fastlane_alloc_legacy_direct_enabled()`, `front_fastlane_alloc_legacy_direct_env_refresh_from_env()`
|
||||
|
||||
2. **`core/box/front_fastlane_box.h`**
|
||||
- Added LEGACY direct early-exit in `front_fastlane_try_malloc()` (lines 93-119)
|
||||
- **SAFETY CONSTRAINT**: Limited to C0-C3 only due to refill incompatibility for C4-C7
|
||||
- Direct conditions: ENV enabled + static route ready + LEGACY route confirmed
|
||||
- Direct path: `tiny_hot_alloc_fast()` → `tiny_cold_refill_and_alloc()` → fallback to `malloc_tiny_fast_for_class()`
|
||||
|
||||
3. **`core/bench_profile.h`**
|
||||
- Added `front_fastlane_alloc_legacy_direct_env_refresh_from_env()` to refresh sync group
|
||||
|
||||
4. **`Makefile`**
|
||||
- Added `front_fastlane_alloc_legacy_direct_env_box.o` to OBJS_BASE, SHARED_OBJS, TINY_BENCH_OBJS_BASE
|
||||
|
||||
---
|
||||
|
||||
## Critical Bug & Fix
|
||||
|
||||
### Issue: Segmentation Fault (Exit Code 139)
|
||||
|
||||
**Symptom**: Benchmark crashed with ENV=1 during larger workloads (20M iterations).
|
||||
|
||||
**Root Cause**:
|
||||
- Crash occurred in `unified_cache_refill()` → `tiny_next_read()` (intrusive pointer read)
|
||||
- Initial implementation attempted to use direct path for ALL classes (C0-C7)
|
||||
- Classes C4-C7 triggered incompatibility with `unified_cache_refill()` logic
|
||||
- Existing dualhot code (Phase ALLOC-TINY-FAST-DUALHOT-2) only operates on C0-C3
|
||||
|
||||
**Backtrace**:
|
||||
```
|
||||
#0 0x0000555555564d89 in tiny_next_read.lto_priv.5.lto_priv ()
|
||||
#1 0x00007ffff7b00318 in ?? ()
|
||||
#2 0x0000555555557f29 in unified_cache_refill ()
|
||||
```
|
||||
|
||||
**Fix Applied**:
|
||||
- Limited LEGACY direct path to C0-C3 only (line 96 of front_fastlane_box.h)
|
||||
- Added safety comment explaining constraint
|
||||
- Matches existing proven pattern from dualhot implementation
|
||||
|
||||
**Code Change**:
|
||||
```c
|
||||
// Before (CRASHED):
|
||||
if (__builtin_expect(front_fastlane_alloc_legacy_direct_enabled(), 0)) {
|
||||
|
||||
// After (SAFE):
|
||||
if (__builtin_expect(front_fastlane_alloc_legacy_direct_enabled() && (unsigned)class_idx <= 3u, 0)) {
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Analysis
|
||||
|
||||
### Why +0.62% is Below Threshold
|
||||
|
||||
1. **Limited Scope**: Optimization only applies to C0-C3 due to safety constraint
|
||||
- C4-C7 continue using full route/policy path
|
||||
- Mixed benchmark uses all size classes (16-1024B = C0-C5 primarily)
|
||||
|
||||
2. **Existing Optimizations**: dualhot (Phase ALLOC-TINY-FAST-DUALHOT-2) already optimizes C0-C3
|
||||
- LEGACY direct overlaps with dualhot coverage
|
||||
- Marginal benefit when dualhot is disabled, but default config has dualhot enabled in some profiles
|
||||
|
||||
3. **Route Overhead Not Dominant**: After Phase 6 FastLane collapse, route/policy fixed costs are already minimized
|
||||
- Phase 14-15 (cache shape) also showed NEUTRAL results
|
||||
- Suggests current bottleneck is not in dispatch layers
|
||||
|
||||
### Root Cause of Limited Benefit
|
||||
|
||||
The optimization targets the same problem space as existing dualhot but with different enablement conditions:
|
||||
- **dualhot**: Always enabled for C0-C3, no route check
|
||||
- **LEGACY direct**: ENV-gated, requires static route confirmation
|
||||
|
||||
When both are active, LEGACY direct provides minimal incremental value.
|
||||
|
||||
---
|
||||
|
||||
## Recommendations
|
||||
|
||||
1. **Freeze as Research Box** (default OFF)
|
||||
- ENV remains opt-in: `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0`
|
||||
- No preset promotion
|
||||
- Keep code for potential future use if dualhot is disabled
|
||||
|
||||
2. **Investigate C4-C7 Refill Issue**
|
||||
- Root cause: Why does `unified_cache_refill()` fail for C4-C7 in this path?
|
||||
- Possible causes:
|
||||
- LIFO mode interaction (Phase 15)
|
||||
- Cache state assumptions in refill logic
|
||||
- Intrusive pointer corruption
|
||||
- **Action**: Debug under controlled conditions before expanding to C4-C7
|
||||
|
||||
3. **Shift Focus Away from Dispatch Layers**
|
||||
- Phase 14, 15, 16 all showed NEUTRAL results
|
||||
- Phase 6 FastLane already collapsed major dispatch overhead
|
||||
- **Next direction**: Investigate cache miss costs, memory layout, or backend allocation
|
||||
|
||||
4. **Consider Dualhot/LEGACY Direct Consolidation**
|
||||
- If LEGACY direct is kept, evaluate merging with dualhot logic
|
||||
- Avoid code duplication and overlap
|
||||
|
||||
---
|
||||
|
||||
## Comparison with Recent Phases
|
||||
|
||||
| Phase | Target | Delta (Mixed) | Verdict |
|
||||
|-------|--------|---------------|---------|
|
||||
| Phase 10 | Free LEGACY direct | +1.89% | **GO** |
|
||||
| Phase 13 v1 | C7 preserve header | -0.40% | NEUTRAL (freeze) |
|
||||
| Phase 14 v1 | tcache intrusive | +0.20% | NEUTRAL (freeze) |
|
||||
| Phase 14 v2 | tcache hot integration | +0.08% | NEUTRAL (freeze) |
|
||||
| Phase 15 v1 | UnifiedCache FIFO→LIFO | -0.70% | NEUTRAL (freeze) |
|
||||
| **Phase 16 v1** | **Alloc LEGACY direct** | **+0.62%** | **NEUTRAL (freeze)** |
|
||||
|
||||
**Pattern**: Post-Phase-10 optimizations consistently show NEUTRAL results. Major gains came from earlier phases (FastLane collapse +11.13%, Free DeDup +5.18%, etc.). Current bottleneck likely not in dispatch/routing layers.
|
||||
|
||||
---
|
||||
|
||||
## Files Changed
|
||||
|
||||
- `core/box/front_fastlane_alloc_legacy_direct_env_box.h` (new)
|
||||
- `core/box/front_fastlane_alloc_legacy_direct_env_box.c` (new)
|
||||
- `core/box/front_fastlane_box.h` (modified)
|
||||
- `core/bench_profile.h` (modified)
|
||||
- `Makefile` (modified)
|
||||
|
||||
---
|
||||
|
||||
## ENV Variables
|
||||
|
||||
- `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1` (default: 0, opt-in)
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Freeze Phase 16** with default OFF
|
||||
2. **Commit with verdict**: "Phase 16 v1: NEUTRAL (+0.62%), research box"
|
||||
3. **Update CURRENT_TASK.md** with Phase 16 summary
|
||||
4. **Shift optimization focus** based on profiling/analysis (away from dispatch layers)
|
||||
@ -0,0 +1,133 @@
|
||||
# Phase 16: Front FastLane Alloc LEGACY Direct v1(alloc 側の “2段目ホット” を monolithic early-exit 化)
|
||||
|
||||
**Date**: 2025-12-15
|
||||
**Status**: DESIGN(Phase 16 kickoff)
|
||||
|
||||
---
|
||||
|
||||
## 0. Executive Summary(1枚)
|
||||
|
||||
Phase 14-15(pointer-chase / cache-shape)系は **NEUTRAL** で freeze。
|
||||
次は “キャッシュ形状” ではなく、**命令数/分岐の固定費を削る**方向に戻す。
|
||||
|
||||
現状の `malloc()` は Phase 6 で FastLane に集約され、ほぼ常に:
|
||||
|
||||
```
|
||||
malloc() → front_fastlane_try_malloc(size) → malloc_tiny_fast_for_class(size, class_idx)
|
||||
```
|
||||
|
||||
となる。
|
||||
|
||||
しかし `malloc_tiny_fast_for_class()` は **LEGACY ルートでも**、
|
||||
ULTRA/C7 早期分岐・route_kind 決定・ENV cfg 読み・dispatch shape などの固定費を通る。
|
||||
free 側(Phase 9/10/6-2)は “monolithic early-exit” に寄せて勝っているため、
|
||||
alloc 側も同じ勝ち筋で **FastLane 入口で LEGACY を直行**させるのが ROI が高い。
|
||||
|
||||
Phase 16 は Box Theory を保ったまま、FastLane の alloc に “LEGACY direct” を 1 本足す:
|
||||
|
||||
- **hit 時**: `tiny_hot_alloc_fast(class_idx)` → 即 return(route/policy を踏まない)
|
||||
- **miss 時**: `tiny_cold_refill_and_alloc(class_idx)`(既存 cold 境界)
|
||||
- **不確実時**: 既存 `malloc_tiny_fast_for_class()` にフォールバック(境界 1 箇所)
|
||||
|
||||
---
|
||||
|
||||
## 1. 現状(why)
|
||||
|
||||
- Phase 6(Front FastLane)で wrapper→gate→policy→route を collapse し、入口固定費は大きく削減できた。
|
||||
- その結果、alloc 側の残コストは **`malloc_tiny_fast_for_class()` 内の分岐/ENV/route 決定**に寄りやすい。
|
||||
- Phase 14/15 で “UnifiedCache の形状” をいじっても Mixed が動かない → 現状は **cache shape が支配的ではない**。
|
||||
|
||||
よって Phase 16 は、cache の内部を変えずに **route/policy 固定費を削る**。
|
||||
|
||||
---
|
||||
|
||||
## 2. 提案(Phase 16 v1)
|
||||
|
||||
### 2.1 追加する箱(Box Theory)
|
||||
|
||||
```
|
||||
L0: front_fastlane_alloc_legacy_direct_env_box (ENV gate / rollback)
|
||||
↓
|
||||
L1: front_fastlane_try_malloc() (LEGACY direct early-exit)
|
||||
↓
|
||||
L2: malloc_tiny_fast_for_class() (既存: route/policy/ULTRA/MID/V7)
|
||||
↓
|
||||
L3: tiny_front_hot_box / tiny_front_cold_box (既存: unified cache / refill)
|
||||
```
|
||||
|
||||
**境界は 1 箇所**:
|
||||
- “direct 条件を満たさない/失敗” → `malloc_tiny_fast_for_class()` に落とす。
|
||||
|
||||
### 2.2 ENV
|
||||
|
||||
- `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1`(default 0, opt-in)
|
||||
|
||||
初期は opt-in で A/B。
|
||||
GO なら preset 昇格(MIXED のみから段階的)を検討する。
|
||||
|
||||
### 2.3 Direct 条件(Fail-Fast)
|
||||
|
||||
alloc direct は **“断定できるときだけ”**に限定する:
|
||||
|
||||
必須条件(推奨):
|
||||
- FastLane が有効(既存)
|
||||
- `size <= tiny_get_max_size()`(既存)
|
||||
- `class_idx` が有効(既存)
|
||||
- `front_fastlane_class_mask` に含まれる(既存)
|
||||
- `tiny_static_route_ready_fast()` が true(Learner interlock 等で false のときは使わない)
|
||||
- `tiny_static_route_get_kind_fast(class_idx) == SMALL_ROUTE_LEGACY`(LEGACY 断定)
|
||||
|
||||
その上で:
|
||||
- `tiny_hot_alloc_fast(class_idx)` → hit なら return
|
||||
- miss なら `tiny_cold_refill_and_alloc(class_idx)` を呼ぶ(既存 cold 境界)
|
||||
- それでも NULL の場合だけ `malloc_tiny_fast_for_class()` にフォールバック(安全重視)
|
||||
|
||||
---
|
||||
|
||||
## 3. 可視化(最小)
|
||||
|
||||
Release での常時ログは禁止。
|
||||
必要なら `HAKMEM_DEBUG_COUNTERS=1` のみで:
|
||||
|
||||
- `front_fastlane_alloc_legacy_direct_hit`
|
||||
- `front_fastlane_alloc_legacy_direct_miss`
|
||||
- `front_fastlane_alloc_legacy_direct_fallback`
|
||||
|
||||
(atomic は stats box に閉じ込める。ホット側に atomic を置かない)
|
||||
|
||||
---
|
||||
|
||||
## 4. A/B 計測(同一バイナリ)
|
||||
|
||||
GO/NO-GO(Mixed 10-run, clean env):
|
||||
- GO: mean +1.0% 以上
|
||||
- NO-GO: mean -1.0% 以下(即 rollback / freeze)
|
||||
- NEUTRAL: ±1.0%(research box freeze)
|
||||
|
||||
対象:
|
||||
- `scripts/run_mixed_10_cleanenv.sh`
|
||||
- 追加で C6-heavy 5-run(回帰なし確認)
|
||||
|
||||
---
|
||||
|
||||
## 5. リスクと対策
|
||||
|
||||
### リスク 1: “LEGACY と断定” が崩れて誤ルートする
|
||||
|
||||
対策:
|
||||
- `tiny_static_route_ready_fast()` を必須条件にする(Learner 有効時は false になる想定)
|
||||
- route_kind を必ず確認(mask だけに依存しない)
|
||||
- 失敗時は必ず既存経路へフォールバック
|
||||
|
||||
### リスク 2: direct 経路が小さすぎて効果が出ない
|
||||
|
||||
対策:
|
||||
- まず Mixed の “LEGACY 比率” を stats で可視化(debug counters のみ)
|
||||
- 効かなければ freeze(Phase 14/15 と同じ扱い)
|
||||
|
||||
### リスク 3: 分岐追加が逆効果(Phase 11 の再来)
|
||||
|
||||
対策:
|
||||
- direct 判定は **FastLane 内で 1 回だけ**(call site helper を増やさない)
|
||||
- direct 判定が false の場合は既存の `malloc_tiny_fast_for_class()` をそのまま呼ぶ
|
||||
|
||||
@ -0,0 +1,124 @@
|
||||
# Phase 16: Front FastLane Alloc LEGACY Direct v1 — Next Instructions
|
||||
|
||||
設計: `docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_DESIGN.md`
|
||||
|
||||
---
|
||||
|
||||
## 0. Status / Why now
|
||||
|
||||
- Phase 14-15(tcache / FIFO→LIFO)は **NEUTRAL** → freeze(default OFF)
|
||||
- 次の狙いは “cache 形状” ではなく、**alloc 側の route/policy 固定費を減らす**
|
||||
- free 側は Phase 9/10/6-2 の “monolithic early-exit + dedup” が勝ち筋 → alloc 側にも同じパターンを適用する
|
||||
|
||||
---
|
||||
|
||||
## 1. GO 条件
|
||||
|
||||
Mixed 10-run(clean env):
|
||||
- **GO**: mean +1.0% 以上
|
||||
- **NO-GO**: mean -1.0% 以下(即 rollback / freeze)
|
||||
- **NEUTRAL**: ±1.0% → research box freeze
|
||||
|
||||
追加ゲート(必須):
|
||||
- `tiny_static_route_ready_fast()` が true の環境で、LEGACY direct が実際に通っている(debug counters で確認できるなら尚良い)
|
||||
|
||||
---
|
||||
|
||||
## 2. Box 図(境界 1 箇所)
|
||||
|
||||
```
|
||||
L0: front_fastlane_alloc_legacy_direct_env_box (ENV gate / refresh)
|
||||
↓
|
||||
L1: front_fastlane_box.h (try_malloc 内 early-exit)
|
||||
↓
|
||||
L2: malloc_tiny_fast_for_class() (既存経路)
|
||||
```
|
||||
|
||||
境界は **“direct 条件 NG / direct が NULL → malloc_tiny_fast_for_class”** の 1 箇所に固定する。
|
||||
|
||||
---
|
||||
|
||||
## 3. Patch 順(小さく積む)
|
||||
|
||||
### Patch 1: L0 ENV gate box(戻せる)
|
||||
|
||||
新規:
|
||||
- `core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c}`
|
||||
|
||||
ENV:
|
||||
- `HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1`(default 0)
|
||||
|
||||
API(例):
|
||||
- `front_fastlane_alloc_legacy_direct_enabled() -> int`
|
||||
- `front_fastlane_alloc_legacy_direct_env_refresh_from_env()`
|
||||
|
||||
要件:
|
||||
- hot path に `getenv()` を置かない(cached)
|
||||
- `bench_profile` の `putenv()` 同期のため refresh を提供(Phase 8/13/14/15 パターン)
|
||||
|
||||
### Patch 2: 統合点(FastLane alloc に 1 本だけ)
|
||||
|
||||
対象:
|
||||
- `core/box/front_fastlane_box.h`
|
||||
|
||||
変更:
|
||||
- `front_fastlane_try_malloc()` の class mask 判定の後に、次の “direct 経路” を追加
|
||||
|
||||
direct 条件(Fail-Fast):
|
||||
1. `front_fastlane_alloc_legacy_direct_enabled() == 1`
|
||||
2. `tiny_static_route_ready_fast()` が true(Learner interlock 等で false の場合は direct 禁止)
|
||||
3. `tiny_static_route_get_kind_fast(class_idx) == SMALL_ROUTE_LEGACY`(LEGACY を断定)
|
||||
|
||||
direct 実体:
|
||||
- `void* p = tiny_hot_alloc_fast(class_idx);`
|
||||
- `if (p) return p;`
|
||||
- `p = tiny_cold_refill_and_alloc(class_idx);`
|
||||
- `if (p) return p;`
|
||||
- 失敗時のみ `malloc_tiny_fast_for_class(size, class_idx)` にフォールバック(安全側)
|
||||
|
||||
注意:
|
||||
- “call site helper を増やさない” を優先(Phase 11 の反省)
|
||||
- 直行するのは **LEGACY のみ**(ULTRA/MID/V7 は既存に任せる)
|
||||
|
||||
### Patch 3: bench_profile 同期(ENV 漏れ防止)
|
||||
|
||||
対象:
|
||||
- `core/bench_profile.h`
|
||||
|
||||
変更:
|
||||
- `#ifdef USE_HAKMEM` の refresh 群に `front_fastlane_alloc_legacy_direct_env_refresh_from_env();` を追加
|
||||
|
||||
---
|
||||
|
||||
## 4. A/B(同一バイナリ)
|
||||
|
||||
Baseline:
|
||||
```sh
|
||||
HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 scripts/run_mixed_10_cleanenv.sh
|
||||
```
|
||||
|
||||
Optimized:
|
||||
```sh
|
||||
HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=1 scripts/run_mixed_10_cleanenv.sh
|
||||
```
|
||||
|
||||
追加(回帰検出):
|
||||
```sh
|
||||
HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1 ./bench_mid_large_mt_hakmem 1 20000000 400 1
|
||||
HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=1 HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1 ./bench_mid_large_mt_hakmem 1 20000000 400 1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. 健康診断
|
||||
|
||||
```sh
|
||||
scripts/verify_health_profiles.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Rollback
|
||||
|
||||
- `export HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0`
|
||||
|
||||
@ -0,0 +1,89 @@
|
||||
# Phase 17: FORCE_LIBC Gap Validation v1 — A/B Test Results
|
||||
|
||||
**Date**: 2025-12-15
|
||||
**Verdict**: ✅ **Case B confirmed** — **Layout / I-cache penalty dominates**
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
Phase 17 validated the “system malloc is faster than hakmem” observation while avoiding the classic layout/LTO trap by running a **same-binary A/B**:
|
||||
|
||||
- Same binary (`bench_random_mixed_hakmem`) with `HAKMEM_FORCE_LIBC_ALLOC=0/1` shows **allocator logic delta is negligible**.
|
||||
- The large performance gap appears only when comparing to the tiny `bench_random_mixed_system` binary.
|
||||
|
||||
Conclusion: The dominant gap is **binary text size + layout → I-cache thrash + instruction footprint**, not allocator algorithm efficiency.
|
||||
|
||||
---
|
||||
|
||||
## Measurement Setup
|
||||
|
||||
Workload:
|
||||
- `bench_random_mixed_*` (Mixed 16–1024B), working set `WS=400`
|
||||
- Clean ENV baseline via `scripts/run_mixed_10_cleanenv.sh`
|
||||
|
||||
Two comparisons:
|
||||
1) **Same-binary toggle** (allocator logic delta)
|
||||
2) **System binary** (layout penalty delta)
|
||||
|
||||
---
|
||||
|
||||
## Results
|
||||
|
||||
### 1) Same-binary A/B (allocator delta)
|
||||
|
||||
Binary: `bench_random_mixed_hakmem`
|
||||
Toggle: `HAKMEM_FORCE_LIBC_ALLOC=0/1`
|
||||
|
||||
| Mode | Throughput (ops/s) | Delta |
|
||||
|------|---------------------|-------|
|
||||
| hakmem (`FORCE_LIBC=0`) | 48.12M | — |
|
||||
| libc (`FORCE_LIBC=1`) | 48.31M | **+0.39%** |
|
||||
|
||||
Interpretation: allocator logic delta is ~noise-level in this experiment context.
|
||||
|
||||
### 2) System binary (layout penalty)
|
||||
|
||||
Binary: `bench_random_mixed_system`
|
||||
|
||||
| Mode | Throughput (ops/s) | Delta vs libc-in-hakmem-binary |
|
||||
|------|---------------------|--------------------------------|
|
||||
| system malloc | 83.85M | **+73.57%** |
|
||||
|
||||
Total observed gap: ~+74% class.
|
||||
|
||||
---
|
||||
|
||||
## Perf Stat (200M iterations) — Smoking Gun
|
||||
|
||||
| Metric | hakmem binary | system binary | Delta |
|
||||
|--------|---------------|---------------|-------|
|
||||
| I-cache misses | 153K | 68K | **-55%** |
|
||||
| Cycles | 17.9B | 10.2B | **-43%** |
|
||||
| Instructions | 41.3B | 21.5B | **-48%** |
|
||||
| Binary size | 653K | 21K | **-97%** |
|
||||
|
||||
Interpretation:
|
||||
- The system binary executes roughly **half the instructions**, with **far fewer I-cache misses**.
|
||||
- The 30× text footprint difference strongly correlates with the gap.
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
Phase 12’s “system malloc is 1.6× faster” observation was real, but the root cause was misattributed:
|
||||
|
||||
- ❌ Not primarily allocator algorithm differences
|
||||
- ✅ **Text/layout + I-cache locality + instruction footprint**
|
||||
|
||||
This shifts the optimization frontier:
|
||||
- Stop chasing more routing/dispatch micro-opt (Phase 14–16 plateau)
|
||||
- Focus on **Hot Text Isolation / layout control**
|
||||
|
||||
---
|
||||
|
||||
## Next
|
||||
|
||||
Proceed to:
|
||||
- `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md`
|
||||
|
||||
@ -0,0 +1,130 @@
|
||||
# Phase 17: FORCE_LIBC Gap Validation(same-binary A/B)Next Instructions
|
||||
|
||||
## Status(前提)
|
||||
|
||||
- Phase 14–16 は **NEUTRAL / research box freeze**(dispatch/cache-shape/pointer-chase 系は頭打ち)
|
||||
- Phase 16 v1(FastLane alloc LEGACY direct)は **NEUTRAL (+0.62%)** かつ **C0–C3 限定**(C4–C7 は segv で安全制限)
|
||||
- Phase 12 で「system malloc が hakmem より速い」という観測があるが、**別バイナリ比較は layout/LTO 差で壊れやすい**
|
||||
|
||||
本 Phase 17 の目的は、**同一バイナリ内**で `hakmem` vs `libc malloc` を A/B し、gap の実体(allocator差か、バイナリ差か)を SSOT 化すること。
|
||||
|
||||
---
|
||||
|
||||
## 0. 目的(Deliverables)
|
||||
|
||||
1) **同一バイナリ A/B**: `bench_random_mixed_hakmem` を用いて
|
||||
- A: `HAKMEM_FORCE_LIBC_ALLOC=0`(hakmem)
|
||||
- B: `HAKMEM_FORCE_LIBC_ALLOC=1`(libc)
|
||||
|
||||
2) **別バイナリとの差分分解**(任意)
|
||||
- `bench_random_mixed_system`(小さいバイナリ)も測り、`libc-in-hakmem-binary` と比較して **layout penalty** を推定
|
||||
|
||||
3) **次の主戦場を決める**(GO/NO-GO ではなく、方針決定)
|
||||
|
||||
---
|
||||
|
||||
## 1. 実施手順(再現性重視)
|
||||
|
||||
### 1.1 Build(同一 commit で固定)
|
||||
|
||||
```sh
|
||||
make -j bench_random_mixed_hakmem bench_random_mixed_system
|
||||
```
|
||||
|
||||
### 1.2 Clean ENV(Phase 14–16 研究 knob を固定)
|
||||
|
||||
推奨: `scripts/run_mixed_10_cleanenv.sh` を使う(ENV 漏れ防止)。
|
||||
|
||||
追加で次を明示(Phase 16 を確実に OFF):
|
||||
|
||||
```sh
|
||||
export HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0
|
||||
```
|
||||
|
||||
### 1.3 Same-binary A/B(本丸)
|
||||
|
||||
**A: hakmem**
|
||||
|
||||
```sh
|
||||
HAKMEM_FORCE_LIBC_ALLOC=0 scripts/run_mixed_10_cleanenv.sh
|
||||
```
|
||||
|
||||
**B: libc(同一バイナリ)**
|
||||
|
||||
```sh
|
||||
HAKMEM_FORCE_LIBC_ALLOC=1 scripts/run_mixed_10_cleanenv.sh
|
||||
```
|
||||
|
||||
記録:
|
||||
- mean / median / stdev(10-run)
|
||||
- Min/Max
|
||||
|
||||
### 1.4 Optional: system binary baseline(layout penalty 推定)
|
||||
|
||||
```sh
|
||||
for i in $(seq 1 10); do
|
||||
echo "=== Run ${i}/10 (system bin) ==="
|
||||
./bench_random_mixed_system "${ITERS:-20000000}" "${WS:-400}" 1 2>&1 | rg "Throughput" || true
|
||||
done
|
||||
```
|
||||
|
||||
解釈:
|
||||
- `system bin` が `FORCE_LIBC` より大きく速い → **layout/text size penalty** が支配的
|
||||
- `FORCE_LIBC` が `hakmem` より大きく速い → **allocator ロジック差** が支配的
|
||||
|
||||
---
|
||||
|
||||
## 2. 判定(方針分岐)
|
||||
|
||||
### Case A: `FORCE_LIBC` が hakmem より **+20% 以上**速い
|
||||
|
||||
結論: gap の本体は allocator ロジック(命令数/固定費)側。
|
||||
|
||||
次の芯(推奨):
|
||||
- **Phase 18: Free FastPath Gate Consolidation**
|
||||
- `free_tiny_fast()` 内の ENV gate / TLS probe を FastLane 入口で 1 回だけに集約
|
||||
- 目的: “monolithic early-exit” の勝ち筋を維持したまま、per-call gate 固定費を削る
|
||||
- Box 境界: `front_fastlane_try_free()` → `free_tiny_fast_with_snapshot()` の 1 箇所
|
||||
- 戻せる: `HAKMEM_FREE_TINY_FAST_SNAPSHOT=0/1`
|
||||
|
||||
### Case B: `FORCE_LIBC` が hakmem と **±5% 以内**
|
||||
|
||||
結論: allocator差は小さく、Phase 12 の「system malloc 1.6x」は別要因(バイナリ差/計測系)濃厚。
|
||||
|
||||
次の芯(推奨):
|
||||
- **Phase 18: Hot Text Isolation / Layout Control**
|
||||
- cold code を `__attribute__((cold,noinline))` + 別 TU に追放
|
||||
- 可能なら link-order(hot 関数の順序固定)で I-cache 安定化
|
||||
- A/B は同一バイナリで `HAKMEM_LAYOUT_MODE=0/1`(section/attribute のみ切替)
|
||||
|
||||
### Case C: `FORCE_LIBC` が hakmem より速いが、`system bin` とも差が大きい
|
||||
|
||||
結論: allocator差 + layout penalty の **両方**がある。
|
||||
|
||||
次の芯:
|
||||
- 先に **layout penalty** を削る(Phase 18 Hot Text Isolation)
|
||||
- その後に **gate consolidation**(Phase 19)へ
|
||||
|
||||
---
|
||||
|
||||
## 3. 可視化(最小)
|
||||
|
||||
- 10-run の raw throughput を保存(`scripts/run_mixed_10_cleanenv.sh` 出力ログで十分)
|
||||
- 追加で 1 本だけ `perf stat`(200M iters, 1-run):
|
||||
|
||||
```sh
|
||||
perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \
|
||||
env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE HAKMEM_FORCE_LIBC_ALLOC=0 \
|
||||
./bench_random_mixed_hakmem 200000000 400 1
|
||||
```
|
||||
|
||||
同じコマンドで `HAKMEM_FORCE_LIBC_ALLOC=1` も 1 本取る。
|
||||
|
||||
---
|
||||
|
||||
## 4. 重要ルール(Box Theory)
|
||||
|
||||
- A/B は **同一バイナリ**で行う(layout/LTO 差で誤判定しない)
|
||||
- 新しい最適化は必ず ENV gate(戻せる)+ 境界 1 箇所
|
||||
- 迷ったら “Fail-Fast で fallback” を優先(速度より整合性)
|
||||
|
||||
135
docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md
Normal file
135
docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md
Normal file
@ -0,0 +1,135 @@
|
||||
# Phase 18: Hot Text Isolation v1 — Design
|
||||
|
||||
## 0. Context (from Phase 17)
|
||||
|
||||
Phase 17 established **Case B**:
|
||||
- Same-binary `HAKMEM_FORCE_LIBC_ALLOC=0/1` shows **allocator delta is negligible**.
|
||||
- The large gap appears vs the tiny `bench_random_mixed_system` binary.
|
||||
|
||||
Signal:
|
||||
- I-cache misses / instructions / cycles are far worse in the hakmem-linked binary.
|
||||
- Binary size (`~653K`) vs system (`~21K`) correlates with the throughput gap.
|
||||
|
||||
Ref: `docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md`
|
||||
|
||||
---
|
||||
|
||||
## 1. Goal
|
||||
|
||||
Reduce **hot-path instruction footprint** and improve **I-cache locality** in the hakmem-linked binary, without changing allocator algorithms.
|
||||
|
||||
Primary success metric:
|
||||
- Mixed (16–1024B) throughput improvement, with accompanying reductions in:
|
||||
- `iTLB/icache misses` (or “I-cache misses” counter used in Phase 17)
|
||||
- total instructions executed per 200M iters
|
||||
|
||||
---
|
||||
|
||||
## 2. Non-goals
|
||||
|
||||
- No allocator algorithm redesign.
|
||||
- No behavioral changes to safety/Fail-Fast semantics (only layout/placement changes).
|
||||
- No “delete code = faster” experiments (Phase 17 showed layout dominates; deletions confound results).
|
||||
|
||||
---
|
||||
|
||||
## 3. Box Theory framing
|
||||
|
||||
This is a “build/layout box”:
|
||||
- **Box**: HotTextIsolationBox (compile-time layout controls + annotations)
|
||||
- **Boundary**: build flag / TU split (no runtime overhead)
|
||||
- **Rollback**: single Makefile knob (`HOT_TEXT_ISOLATION=0/1`) or `-DHAKMEM_HOT_TEXT_ISOLATION=0/1`
|
||||
- **Observability**: perf stat + binary size (no always-on logs)
|
||||
|
||||
---
|
||||
|
||||
## 4. Design: v1 tactics (low-risk)
|
||||
|
||||
### 4.1 Hot/Cold attributes SSOT
|
||||
|
||||
Introduce a single header defining attributes:
|
||||
- `HAK_HOT_FN` → `__attribute__((hot))` (and optionally `.text.hak_hot`)
|
||||
- `HAK_COLD_FN` → `__attribute__((cold,noinline))` (and optionally `.text.hak_cold`)
|
||||
|
||||
Activated only when `HAKMEM_HOT_TEXT_ISOLATION=1`.
|
||||
|
||||
Why:
|
||||
- Makes “what is hot/cold” explicit and consistent (SSOT).
|
||||
- Lets us annotate a small set of functions without scattering ad-hoc attributes.
|
||||
|
||||
### 4.2 Translation-unit split for wrappers
|
||||
|
||||
Move wrapper definitions out of `core/hakmem.c` into a dedicated TU:
|
||||
- `core/hak_wrappers_box.c` includes `core/box/hak_wrappers.inc.h`
|
||||
|
||||
Why:
|
||||
- Prevents wrapper text from being interleaved with unrelated code in the same TU.
|
||||
- Improves the linker’s ability to cluster hot code.
|
||||
- Enables future link-order experiments (symbol ordering files) without touching allocator logic.
|
||||
|
||||
### 4.3 Cold code isolation
|
||||
|
||||
Ensure rarely-hit helpers stay cold/out-of-line:
|
||||
- wrapper diagnostics (`wrapper_record_fallback`, ptr trace dumps, verbose logging)
|
||||
- “slow fallback” paths (`malloc_cold`, `free_cold`)
|
||||
|
||||
Principle:
|
||||
- Hot path must remain a straight-line “try → return” shape.
|
||||
- Anything that allocates/logs/diagnoses is cold and must not be inlined into hot wrappers.
|
||||
|
||||
### 4.4 Optional: section GC for bench builds
|
||||
|
||||
For bench binaries only:
|
||||
- add `-ffunction-sections -fdata-sections`
|
||||
- link with `-Wl,--gc-sections`
|
||||
|
||||
Why:
|
||||
- Drops truly-unused text and reduces overall text pressure.
|
||||
- Helps the linker keep hot text denser.
|
||||
|
||||
This is optional because it is toolchain-sensitive; measure before promoting.
|
||||
|
||||
---
|
||||
|
||||
## 7. v2 Extension (if v1 is NEUTRAL): BENCH_MINIMAL compile-out
|
||||
|
||||
Phase 17 shows the hakmem-linked binary executes ~2x instructions vs the tiny system binary. If v1 (TU split/attributes) is NEUTRAL, the next likely lever is **not placement-only**, but **removing per-call fixed costs** from the hot path by compiling them out in a bench-only build.
|
||||
|
||||
Concept:
|
||||
- Introduce `HAKMEM_BENCH_MINIMAL=1` build mode (Makefile knob)
|
||||
- In this mode:
|
||||
- “promoted defaults” are treated as compile-time constants (FastLane ON, snapshots ON, etc.)
|
||||
- ENV gates become compile-time (no TLS/env probing in hot path)
|
||||
- Hot counters/stats macros compile out completely
|
||||
|
||||
Why this still fits Box Theory:
|
||||
- It is a **build box** (reversible by knob), not an algorithm rewrite
|
||||
- Boundaries remain: hot path stays Fail-Fast; cold fallback remains intact
|
||||
- Observability shifts to `perf stat` (no always-on logging)
|
||||
|
||||
Expected impact:
|
||||
- If instruction footprint is truly dominant, this is the first place to see **double-digit gains** (+10–20%).
|
||||
|
||||
## 5. Risks / mitigations
|
||||
|
||||
### Risk A: layout tweaks regress throughput
|
||||
|
||||
Mitigation:
|
||||
- A/B using the same workload + perf stat counters (Phase 17 set).
|
||||
- If regression: keep as research-only (build knob default OFF).
|
||||
|
||||
### Risk B: Toolchain sensitivity (ld vs lld, LTO interactions)
|
||||
|
||||
Mitigation:
|
||||
- Keep v1 minimal (TU split + attributes first).
|
||||
- Only enable `--gc-sections` if it’s stable in the current toolchain.
|
||||
|
||||
---
|
||||
|
||||
## 6. Expected impact
|
||||
|
||||
Conservative:
|
||||
- +3–10% throughput improvement on Mixed by reducing instruction footprint and I-cache misses.
|
||||
|
||||
Stretch goal:
|
||||
- Bring “hakmem-linked + FORCE_LIBC” closer to `bench_random_mixed_system` ceiling by minimizing wrapper text working-set.
|
||||
165
docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md
Normal file
165
docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md
Normal file
@ -0,0 +1,165 @@
|
||||
# Phase 18: Hot Text Isolation v1 — Next Instructions
|
||||
|
||||
## Status
|
||||
|
||||
- Phase 17 confirms **Case B**: allocator logic delta is negligible; gap is **layout/I-cache**.
|
||||
- Next: reduce instruction footprint + improve I-cache locality via **Hot Text Isolation**.
|
||||
|
||||
Refs:
|
||||
- Phase 17 results: `docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md`
|
||||
- Phase 18 design: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md`
|
||||
|
||||
---
|
||||
|
||||
## 0. Goal / Success Criteria
|
||||
|
||||
Primary (v1 は “低リスク・効果小さめ” 想定):
|
||||
- Mixed (16–1024B) throughput **+2%** 以上で GO(layout work の現実ライン)
|
||||
|
||||
Secondary (must move in the right direction):
|
||||
- I-cache misses reduced(目安: **-10%** 以上)
|
||||
- Total instructions reduced(目安: **-5%** 以上)
|
||||
|
||||
If throughput is NEUTRAL but counters improve significantly, keep as research box and iterate once.
|
||||
|
||||
---
|
||||
|
||||
## 1. Patch Plan (small, reversible)
|
||||
|
||||
### Patch 1: Hot/Cold attribute SSOT (L0 Box)
|
||||
|
||||
Add:
|
||||
- `core/box/hot_text_attrs_box.h`
|
||||
|
||||
Defines:
|
||||
- `HAK_HOT_FN`, `HAK_COLD_FN` (no-op when `HAKMEM_HOT_TEXT_ISOLATION=0`)
|
||||
|
||||
Usage:
|
||||
- annotate only a short, high-impact list first:
|
||||
- wrappers: `malloc/free/calloc/realloc`
|
||||
- FastLane entry helpers (if non-inline)
|
||||
- cold helpers: `malloc_cold/free_cold`, wrapper diagnostics
|
||||
|
||||
Rollback: build knob off.
|
||||
|
||||
### Patch 2: Wrapper TU split (L1 Box boundary)
|
||||
|
||||
Move wrapper definitions out of `core/hakmem.c`:
|
||||
- new: `core/hak_wrappers_box.c`
|
||||
- `#include "box/hak_wrappers.inc.h"`
|
||||
- remove wrapper include from `core/hakmem.c`
|
||||
|
||||
Rationale:
|
||||
- Prevents wrapper text from being interleaved with unrelated code in one TU.
|
||||
- Sets up link-order clustering.
|
||||
|
||||
Rollback: restore include in `core/hakmem.c` and drop new TU.
|
||||
|
||||
### Patch 3 (optional): bench-only section GC
|
||||
|
||||
Makefile knob:
|
||||
- `HOT_TEXT_ISOLATION=0/1`
|
||||
|
||||
When `=1`, add for bench builds:
|
||||
- `-DHAKMEM_HOT_TEXT_ISOLATION=1`
|
||||
- `-ffunction-sections -fdata-sections`
|
||||
- `LDFLAGS += -Wl,--gc-sections`
|
||||
|
||||
Notes:
|
||||
- Keep it bench-only first (do not touch shared lib build until proven stable).
|
||||
- If toolchain rejects `--gc-sections` or results are unstable → skip this patch.
|
||||
|
||||
---
|
||||
|
||||
## 2. A/B Procedure (required)
|
||||
|
||||
### 2.1 Baseline build (OFF)
|
||||
|
||||
```sh
|
||||
make clean
|
||||
make -j bench_random_mixed_hakmem bench_random_mixed_system
|
||||
ls -lh bench_random_mixed_hakmem bench_random_mixed_system
|
||||
scripts/run_mixed_10_cleanenv.sh
|
||||
```
|
||||
|
||||
Perf stat (1 run, 200M iters):
|
||||
|
||||
```sh
|
||||
perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \
|
||||
env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
|
||||
./bench_random_mixed_hakmem 200000000 400 1
|
||||
```
|
||||
|
||||
### 2.2 Optimized build (ON)
|
||||
|
||||
```sh
|
||||
make clean
|
||||
make -j HOT_TEXT_ISOLATION=1 bench_random_mixed_hakmem bench_random_mixed_system
|
||||
ls -lh bench_random_mixed_hakmem bench_random_mixed_system
|
||||
scripts/run_mixed_10_cleanenv.sh
|
||||
```
|
||||
|
||||
Perf stat (same command):
|
||||
|
||||
```sh
|
||||
perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \
|
||||
env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
|
||||
./bench_random_mixed_hakmem 200000000 400 1
|
||||
```
|
||||
|
||||
### 2.3 System ceiling check (optional)
|
||||
|
||||
```sh
|
||||
./bench_random_mixed_system 200000000 400 1 2>&1 | rg "Throughput" || true
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. GO/NO-GO Decision
|
||||
|
||||
- **GO**: Mixed 10-run mean **+2%** 以上 and no health regressions
|
||||
- **NEUTRAL**: within ±2% → keep as research box, iterate once (more cold isolation or better clustering)
|
||||
- **NO-GO**: **-2%** or worse → rollback and freeze
|
||||
|
||||
Health profiles:
|
||||
|
||||
```sh
|
||||
scripts/verify_health_profiles.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Reporting (required artifacts)
|
||||
|
||||
Create:
|
||||
- `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md`
|
||||
- throughput A/B (10-run)
|
||||
- binary sizes
|
||||
- perf stat table (cycles/instructions/I-cache)
|
||||
- conclusion (GO/NEUTRAL/NO-GO)
|
||||
|
||||
Update:
|
||||
- `CURRENT_TASK.md` (Phase 18 status + next)
|
||||
|
||||
---
|
||||
|
||||
## 5. Notes / guardrails
|
||||
|
||||
- This phase intentionally compares **different binaries** (layout is the target), but keep the environment clean (`env -i`, fixed profile, same machine).
|
||||
- Avoid “delete code” experiments; only isolate/cold/cluster.
|
||||
- Keep “cold” truly cold: no allocations, no logging, no TLS-heavy helpers.
|
||||
|
||||
---
|
||||
|
||||
## 6. If v1 is NEUTRAL: Phase 18 v2(BENCH_MINIMAL)へ即進む
|
||||
|
||||
Phase 17 の “instructions 2x” を直接削るには、layout だけでなく **hot path に混ざっている ENV/stats/debug の固定費を compile-out** する必要がある可能性が高い。
|
||||
|
||||
次の一手(bench 専用 binary / rollback 可能):
|
||||
|
||||
- `HAKMEM_BENCH_MINIMAL=1`(Makefile knob)で:
|
||||
- FastLane / wrapper の “常用ON 経路” を固定し、ENV gate を compile-time 定数化
|
||||
- hot counters を完全 compile-out
|
||||
- 観測は `perf stat` のみ(常時ログ禁止)
|
||||
|
||||
期待: +10–20%(もし本当に instruction footprint が支配ならここで大きく動く)
|
||||
Reference in New Issue
Block a user