diff --git a/CURRENT_TASK.md b/CURRENT_TASK.md index 7cae4f45..f0683415 100644 --- a/CURRENT_TASK.md +++ b/CURRENT_TASK.md @@ -469,38 +469,63 @@ Phase 6-10 で達成した累積改善: --- -### Phase 18: Hot Text Isolation / Layout Control — NEXT +### Phase 18: Hot Text Isolation — PROGRESS -**目的**: Binary layout 最適化で I-cache 効率を改善し、system binary との gap を削減。 +**目的**: Binary 最適化で system binary との gap (+74.26%) を削減する。Phase 17 で layout penalty が支配的と判明したため、2段階の戦略で対応。 **戦略**: -1. **Cold Code Isolation** (優先度 1) - - Stats 収集、debug logging、error handlers を別 TU へ移動 - - `__attribute__((cold, noinline))` で明示的に cold マーク - - 予想効果: I-cache misses -20% -2. **Link-Order Optimization** (優先度 2) - - Hot functions を連続配置(linker script or link order control) - - `-ffunction-sections` + custom linker script - - 予想効果: I-cache misses -10% +#### Phase 18 v1: Layout optimization (section-based) — ❌ NO-GO (2025-12-15) -3. **Profile-Guided Optimization** (優先度 3, optional) - - `-fprofile-generate` + `-fprofile-use` で実測ベース配置 - - 予想効果: I-cache misses -10-20% +**試行**: `-ffunction-sections -fdata-sections -Wl,--gc-sections` で I-cache 改善 +**結果**: +- Throughput: -0.87% (48.94M → 48.52M ops/s) +- I-cache misses: **+91.06%** (131K → 250K) ← 喫煙銃 +- Variance: +80% -**Build Gate**: `HOT_TEXT_ISOLATION=0/1`(layout A/B 用) +**原因**: Section splitting without explicit hot symbol ordering が code locality を破壊 +**教訓**: Layout tweaks は fragile。Ordering strategy がないと有害。 -**Target**: -- v1(TU split / attrs / optional gc-sections): **+2% で GO**(NEUTRAL が起きやすい想定) -- v2(BENCH_MINIMAL compile-out): **+10–20%** を狙う(instruction footprint を直接削る) +**決定**: Freeze v1(Makefile で安全に隔離) +- `HOT_TEXT_ISOLATION=1` → attributes only (safe, 効果なし) +- `HOT_TEXT_GC_SECTIONS=1` → section splitting (NO-GO, disabled) -**設計**: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md` -**指示書**: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md` -**結果(v1)**: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md`(❌ NO-GO / I-cache miss 悪化) +**ファイル**: +- 設計: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md` +- 指示書: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md` +- 結果: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md` -実装ゲート(戻せる): -- Makefile knob: `HOT_TEXT_ISOLATION=0/1` -- Compile-time: `-DHAKMEM_HOT_TEXT_ISOLATION=0/1` +#### Phase 18 v2: BENCH_MINIMAL (instruction removal) — NEXT + +**戦略**: Instruction footprint を compile-time に削除 +- Stats collection: FRONT_FASTLANE_STAT_INC → no-op +- ENV checks: runtime lookup → constant +- Debug logging: 条件コンパイルで削除 + +**期待効果**: +- Instructions: -30-40% +- Throughput: +10-20% + +**GO 基準** (STRICT): +- Throughput: **+5% 最小**(+8% 推奨) +- Instructions: **-15% 最小** ← 成功の喫煙銃 +- I-cache: 自動的に改善(instruction 削減に追従) + +If instructions < -15%: abandon(allocator は bottleneck でない) + +**Build Gate**: `BENCH_MINIMAL=0/1`(production safe, opt-in) + +**ファイル**: +- 設計: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_DESIGN.md` +- 指示書: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_NEXT_INSTRUCTIONS.md` +- 実装: 次段階 + +**実装計画**: +1. Makefile に BENCH_MINIMAL knob 追加 +2. Stats macro を conditional に +3. ENV checks を constant に +4. Debug logging を wrap +5. A/B test で +5%+/-15% 判定 ## 更新メモ(2025-12-14 Phase 5 E5-3 Analysis - Strategic Pivot) diff --git a/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_DESIGN.md b/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_DESIGN.md new file mode 100644 index 00000000..b1fde2bb --- /dev/null +++ b/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_DESIGN.md @@ -0,0 +1,274 @@ +# Phase 18 v2: Hot Text Isolation — BENCH_MINIMAL Design + +## Context (Phase 18 v1 failed) + +Phase 18 v1 attempted layout optimization using section splitting + GC (`-ffunction-sections -fdata-sections -Wl,--gc-sections`). + +**Result**: CATASTROPHIC I-cache regression (+91%). + +**Root cause**: Section-based splitting without explicit hot symbol ordering destroyed code locality. + +**Lesson**: Layout tweaks are too fragile; **instruction count reduction is the direct path**. + +--- + +## 0. Goal (High Impact) + +Reduce **instruction footprint** per allocation/free by removing non-essential code paths at compile-time: +- Stats collection (counter increments on every operation) +- Environment variable checks (TLS lookups) +- Debug logging (conditional output) + +**Expected impact**: +- Instruction count: -30-40% (benchmark workload) +- I-cache misses: automatic improvement from smaller working set +- Throughput: +10-20% (Phase 17 instructions -48% gap should close proportionally) + +**GO Criteria** (strict): +- Throughput: +5% minimum, +8% preferred +- Instructions: -15% minimum (clear, measurable reduction) +- I-cache: should improve proportionally to instruction reduction + +If instructions do not drop meaningfully, abandon this phase. + +--- + +## 1. Strategy: BENCH_MINIMAL Build Mode + +### 1.1 What to Remove + +**Category A: Stats collection** (highest frequency, low value for bench) +```c +FRONT_FASTLANE_STAT_INC(malloc_total); // ← 1 per malloc +FRONT_FASTLANE_STAT_INC(malloc_hit); // ← 1 per malloc +FRONT_FASTLANE_STAT_INC(malloc_fallback_*); // ← conditional +free_tiny_fast_mono_stat_inc_hit(); // ← 1 per free +``` + +**Category B: Environment variable checks** (per-operation cost if not short-circuit) +```c +if (hakmem_env_snapshot_enabled()) { ... } // TLS + branch +if (front_fastlane_enabled()) { ... } // TLS + branch +``` + +**Category C: Debug logging / verbose output** (rarely needed in bench) +```c +hakmem_tls_trace_alloc(); // log path +hakmem_diag_record_fallback(); +``` + +### 1.2 Compile-Time Gate + +New build mode: +``` +HAKMEM_BENCH_MINIMAL=0/1 (default 0, opt-in) +``` + +Activated via Makefile knob: +```makefile +BENCH_MINIMAL ?= 0 +ifeq ($(BENCH_MINIMAL),1) + CFLAGS += -DHAKMEM_BENCH_MINIMAL=1 + # ... applies to both bench binaries and shared lib +endif +``` + +**Important**: This is **not** a research-only knob. It's a build mode that's safe to ship (just disables instrumentation). + +### 1.3 Safeguards + +1. **Conditional compilation markers**: + - Wrap stats inside `#if !HAKMEM_BENCH_MINIMAL` + - Wrap ENV checks inside `#if !HAKMEM_BENCH_MINIMAL` + - Ensure code is still correct when markers are OFF (normal operation) + +2. **No behavioral changes**: + - Fast paths must remain identical (stats are just instrumentation) + - Slow paths (fallback, cold) can be simplified (less instrumentation) + +3. **Rollback**: Single build knob, reversible + +--- + +## 2. Implementation Plan + +### Phase 2.1: Stats removal (Priority 1) + +**File**: `core/box/front_fastlane_stats_box.h` + +```c +#if !HAKMEM_BENCH_MINIMAL +#define FRONT_FASTLANE_STAT_INC(stat) \ + do { (void)0; } while(0) // ← becomes no-op +#else +#define FRONT_FASTLANE_STAT_INC(stat) \ + atomic_fetch_add_explicit(&g_stats.stat, 1, memory_order_relaxed) +#endif +``` + +**File**: `core/front/malloc_tiny_fast.h` + +Search for all `free_tiny_fast_mono_stat_*` calls and wrap: +```c +#if !HAKMEM_BENCH_MINIMAL +free_tiny_fast_mono_stat_inc_hit(); +#endif +``` + +**Expected savings**: ~20-30 instructions/op (counter increments + memory sync) + +### Phase 2.2: Environment variable check removal (Priority 1) + +**File**: `core/box/hakmem_env_snapshot_box.h` + +```c +#if !HAKMEM_BENCH_MINIMAL +// Normal: Check ENV at runtime +bool hakmem_env_snapshot_enabled(void) { + return g_env_snapshot_enabled; +} +#else +// Bench: Always enable snapshot (one-time cost) +static inline bool hakmem_env_snapshot_enabled(void) { + return 1; // ← compile-time constant, eliminated by optimizer +} +#endif +``` + +**File**: `core/bench_profile.h` + +Similar pattern for profile-related checks. + +**Expected savings**: ~5-10 instructions/op (TLS lookups + branch prediction) + +### Phase 2.3: Debug logging removal (Priority 2, if needed) + +**File**: trace/logging functions + +Conditional compilation for verbose paths (already often guarded by `HAKMEM_BUILD_DEBUG`). + +**Expected savings**: ~0-5 instructions/op (rare paths, low impact for bench) + +--- + +## 3. Risks / Mitigations + +### Risk A: Benchmark no longer representative + +If we remove instrumentation, does the bench still measure the allocator? + +**Mitigation**: BENCH_MINIMAL disables stats (instrumentation), not core logic. +- Fast paths remain identical. +- Only instrumentation overhead is removed. +- This is similar to how production binaries disable debug tracing. + +### Risk B: Build regression + +Conditional compilation could hide bugs. + +**Mitigation**: +- Ensure `BENCH_MINIMAL=0` (default) always tested first. +- Test both modes in CI if available. +- Manual verification that code is syntactically correct in both modes. + +### Risk C: Instruction count doesn't drop + +If removing stats + ENV checks doesn't clearly drop instructions, it means: +- Compiler/LTO already optimizes these away in RELEASE builds +- The overhead is elsewhere (memory access patterns, branch misprediction) + +**Mitigation**: Run perf stat with `BENCH_MINIMAL=1` to confirm instruction reduction. If < -10%, reconsider strategy. + +--- + +## 4. Expected Impact + +**Conservative** (stats only): +- Throughput: +5-8% +- Instructions: -20-30% +- I-cache: improvement follows from instruction reduction + +**Ambitious** (stats + ENV + debug): +- Throughput: +10-20% +- Instructions: -30-40% +- I-cache: -20-30% (proportional to instruction reduction) + +**System binary ceiling**: 85M ops/s +- Baseline: 48M ops/s +- Target: 52-55M ops/s (via +5-15%) +- Would be 60-68M ops/s with ambitious impact + +--- + +## 5. Box Theory Compliance + +**Box**: BenchMinimalBox +- **Boundary**: `HAKMEM_BENCH_MINIMAL=0/1` (compile-time) +- **Scope**: Instrumentation removal only (no algorithm changes) +- **Rollback**: Single build knob (default OFF, backward compatible) +- **Observability**: Perf stat before/after (instructions, I-cache, throughput) + +--- + +## 6. Success Criteria + +### GO (Proceed): +- Throughput: **+5% or more** +- Instructions: **-15% or more** (smoking gun for success) +- I-cache: should improve proportionally + +### NEUTRAL (keep as research box): +- Throughput: ±3% (within noise) +- Instructions: -5% to -15% (marginal) + +### NO-GO (freeze): +- Throughput: < -2% or no improvement +- Instructions: < -5% (failed optimization objective) + +--- + +## 7. Files to Modify + +1. **`core/box/hot_text_attrs_box.h`** (update to add BENCH_MINIMAL support) + - Already created in Phase 18 v1 + +2. **`core/box/front_fastlane_stats_box.h`** (stats conditional) + - Update macro to be no-op when BENCH_MINIMAL=1 + +3. **`core/front/malloc_tiny_fast.h`** (free stats wrapper) + - Wrap free_tiny_fast_mono_stat_* calls + +4. **`core/box/hakmem_env_snapshot_box.h`** (ENV check simplification) + - Make hakmem_env_snapshot_enabled() constant when BENCH_MINIMAL=1 + +5. **`core/bench_profile.h`** (profile checks) + - Simplify profile related checks + +6. **`Makefile`** + - Add `BENCH_MINIMAL ?= 0` knob + - Apply `-DHAKMEM_BENCH_MINIMAL=1` when enabled + +--- + +## 8. Notes + +- **Production safety**: BENCH_MINIMAL is DISABLED by default. Production builds use full instrumentation. +- **Bench-only**: This mode is intended for benchmark binary builds, not shipped libraries. +- **Phase 17 learning**: Instruction count was -48% in system vs hakmem. If BENCH_MINIMAL achieves -30% to -40%, we should see +10-20% throughput. + +--- + +## 9. Next Steps After Phase 18 v2 + +### If GO (+5%+): +- Promote BENCH_MINIMAL=1 as the new baseline for bench builds +- Update CURRENT_TASK with Phase 18 v2 victory +- Prepare Phase 19 (next optimization frontier) + +### If NEUTRAL: +- Investigate why instructions didn't drop (compiler already optimized?) +- Consider aggressive options: SIMD, allocation batching, lock-free + +### If NO-GO: +- Acknowledge that Phase 17 +74% gap is primarily memory-bound (IPC=2.30) +- Shift focus to broader architectural changes diff --git a/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_NEXT_INSTRUCTIONS.md b/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_NEXT_INSTRUCTIONS.md new file mode 100644 index 00000000..62ed97b1 --- /dev/null +++ b/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_NEXT_INSTRUCTIONS.md @@ -0,0 +1,315 @@ +# Phase 18 v2: Hot Text Isolation — BENCH_MINIMAL Implementation Instructions + +## Status + +- Phase 18 v1 (layout + sections) → NO-GO (I-cache regression) +- Phase 18 v2 (instruction removal) → Next phase +- Strategy: Remove stats/ENV overhead at compile-time +- Expected: +10-20% throughput, -30-40% instructions + +Ref: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_DESIGN.md` + +--- + +## 0. Goal / Success Criteria + +**GO threshold** (STRICT): +- Throughput: **+5% minimum** (+8% preferred) +- Instructions: **-15% minimum** (clear proof of concept) +- I-cache: automatic improvement from smaller footprint + +**NEUTRAL**: +- Throughput: ±3% (marginal) +- Instructions: -5% to -15% (incomplete optimization) + +**NO-GO**: +- Throughput: < -2% or negative +- Instructions: < -5% (failed to reduce overhead) + +If instructions do not drop -15%+, abandon this phase (allocator is not the bottleneck). + +--- + +## 1. Implementation Steps + +### 1.1 Add Makefile knob + +**File**: `Makefile` + +After Phase 18 v1 section (around line 140), add: + +```makefile +# Phase 18 v2: BENCH_MINIMAL (remove instrumentation for benchmark builds) +BENCH_MINIMAL ?= 0 +ifeq ($(BENCH_MINIMAL),1) + CFLAGS += -DHAKMEM_BENCH_MINIMAL=1 + CFLAGS_SHARED += -DHAKMEM_BENCH_MINIMAL=1 + # Note: Both bench and shared lib will disable instrumentation + # Mainly impacts bench_* binaries (where BENCH_MINIMAL is intentionally enabled) +endif +``` + +**Location**: After `HOT_TEXT_GC_SECTIONS` section (line ~145) + +### 1.2 Update front_fastlane_stats_box.h + +**File**: `core/box/front_fastlane_stats_box.h` + +Find the `FRONT_FASTLANE_STAT_INC` macro (currently defined as atomic increment). + +Replace entire macro section: + +```c +// Before: +#define FRONT_FASTLANE_STAT_INC(stat) \ + do { \ + if (__builtin_expect(front_fastlane_stats_enabled(), 0)) { \ + atomic_fetch_add_explicit(&g_stats.stat, 1, memory_order_relaxed); \ + } \ + } while(0) + +// After: +#if HAKMEM_BENCH_MINIMAL +#define FRONT_FASTLANE_STAT_INC(stat) do { (void)0; } while(0) +#else +#define FRONT_FASTLANE_STAT_INC(stat) \ + do { \ + if (__builtin_expect(front_fastlane_stats_enabled(), 0)) { \ + atomic_fetch_add_explicit(&g_stats.stat, 1, memory_order_relaxed); \ + } \ + } while(0) +#endif +``` + +**Rationale**: Stats collection becomes no-op in BENCH_MINIMAL, compiled away entirely. + +### 1.3 Update malloc_tiny_fast.h free stats + +**File**: `core/front/malloc_tiny_fast.h` + +Search for `free_tiny_fast_mono_stat_inc_*` calls (probably in free_tiny_fast function). + +Wrap each call: + +```c +// Before: +free_tiny_fast_mono_stat_inc_hit(); + +// After: +#if !HAKMEM_BENCH_MINIMAL +free_tiny_fast_mono_stat_inc_hit(); +#endif +``` + +Search pattern: `free_tiny_fast_mono_stat_` (should find ~5-10 calls) + +**Rationale**: Similar to malloc, free stats become optional in BENCH_MINIMAL. + +### 1.4 Update hakmem_env_snapshot_box.h + +**File**: `core/box/hakmem_env_snapshot_box.h` + +Find function `hakmem_env_snapshot_enabled()`. + +Replace implementation: + +```c +// Before: +static inline bool hakmem_env_snapshot_enabled(void) { + return atomic_load_explicit(&g_snapshot_enabled, memory_order_relaxed); +} + +// After: +#if HAKMEM_BENCH_MINIMAL +// In bench mode, snapshot is always enabled (one-time cost, compile-away benefit) +static inline bool hakmem_env_snapshot_enabled(void) { + return 1; +} +#else +// Normal mode: runtime check +static inline bool hakmem_env_snapshot_enabled(void) { + return atomic_load_explicit(&g_snapshot_enabled, memory_order_relaxed); +} +#endif +``` + +**Rationale**: ENV checks become compile-time constants in BENCH_MINIMAL, enabling better optimization. + +### 1.5 (Optional) Update debug logging + +**File**: any file with `hakmem_tls_trace_*` or `hakmem_diag_*` calls + +Only if needed (lower priority). Wrap verbose logging: + +```c +#if !HAKMEM_BENCH_MINIMAL +hakmem_tls_trace_alloc(ptr, size); +#endif +``` + +--- + +## 2. Build & Verify + +### 2.1 Baseline build (BENCH_MINIMAL=0) + +```sh +make clean +make -j bench_random_mixed_hakmem bench_random_mixed_system +ls -lh bench_random_mixed_hakmem +``` + +Expected: ~653K (same as before) + +### 2.2 Optimized build (BENCH_MINIMAL=1) + +```sh +make clean +make -j BENCH_MINIMAL=1 bench_random_mixed_hakmem bench_random_mixed_system +ls -lh bench_random_mixed_hakmem +``` + +Expected: Same size (instrumentation is compiled away, not removed) + +**Note**: Binary size may not change (removal is compile-time, not linker removal). + +### 2.3 Verify compilation + +Both builds should complete without errors. + +Check for syntax errors in conditional code: +```sh +# If you see errors like "undeclared identifier", it means conditional guards are wrong +``` + +--- + +## 3. A/B Test Execution + +### 3.1 Baseline run (BENCH_MINIMAL=0) + +```sh +make clean +make -j bench_random_mixed_hakmem bench_random_mixed_system +scripts/run_mixed_10_cleanenv.sh +``` + +Record: +- Mean throughput +- Stdev +- Min/Max + +Run perf stat: + +```sh +perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \ + env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \ + ./bench_random_mixed_hakmem 200000000 400 1 +``` + +Record: +- cycles +- instructions +- I-cache-load-misses +- branch-misses % + +### 3.2 Optimized run (BENCH_MINIMAL=1) + +```sh +make clean +make -j BENCH_MINIMAL=1 bench_random_mixed_hakmem bench_random_mixed_system +scripts/run_mixed_10_cleanenv.sh +``` + +Same recording as baseline. + +Perf stat (same command). + +### 3.3 Optional: system ceiling check + +```sh +./bench_random_mixed_system 200000000 400 1 2>&1 | rg "Throughput" +``` + +(Already measured in Phase 17: ~85M ops/s) + +--- + +## 4. Analysis & GO/NO-GO Decision + +### Compute deltas + +``` +Delta_throughput = (optimized_mean - baseline_mean) / baseline_mean * 100 +Delta_instructions = (optimized_instructions - baseline_instructions) / baseline_instructions * 100 +Delta_icache = (optimized_icache - baseline_icache) / baseline_icache * 100 +``` + +### Decision logic + +**GO** (if ALL true): +- Delta_throughput ≥ +5% +- Delta_instructions ≤ -15% +- Delta_icache ≤ -10% (should follow instruction reduction) + +**NEUTRAL** (if some criteria marginal): +- Delta_throughput ∈ ±3% +- Delta_instructions ∈ [-15%, -5%] +- Variance increase < 50% + +**NO-GO** (if any critical missed): +- Delta_throughput < -2% +- Delta_instructions > -5% (failed to reduce overhead) +- Variance increase > 100% + +--- + +## 5. Reporting (required artifacts) + +Create file: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_AB_TEST_RESULTS.md` + +Include: + +```markdown +# Phase 18 v2: BENCH_MINIMAL — A/B Test Results + +| Metric | Baseline | Optimized | Delta | +|--------|----------|-----------|-------| +| Throughput (mean) | X.XXM | X.XXM | +/-X.XX% | +| Throughput (σ) | X.XXM | X.XXM | % | +| Instructions | X.XB | X.XB | +/-X.X% | +| I-cache misses | XXXK | XXXK | +/-X.X% | +| Cycles | X.XB | X.XB | +/-X.X% | + +## Verdict + +GO / NEUTRAL / NO-GO with reasoning +``` + +Update: `CURRENT_TASK.md` (Phase 18 v2 status + next) + +--- + +## 6. Important Notes + +- Ensure `BENCH_MINIMAL=0` (default) is tested first +- Both BENCH_MINIMAL=0 and BENCH_MINIMAL=1 must compile successfully +- If instructions drop < -15%, the optimization is incomplete (check for compiler already doing this) +- This is NOT a research-only knob; it's a valid bench build mode (safe to enable in CI) + +--- + +## 7. If GO: Next Steps + +1. **Promote BENCH_MINIMAL=1** as default for bench builds +2. **Prepare Phase 19** (next optimization frontier) + - Possible targets: SIMD prefetch, lock-free structures, allocation batching +3. **Document learnings** about allocator overhead vs memory latency + +--- + +## 8. If NO-GO: Lessons + +- Allocator is memory-bound (IPC=2.30 constant across phases) +- Instruction count reduction doesn't always yield throughput gains if memory latency dominates +- Need architectural changes (cache-friendly layout, batching) rather than micro-optimizations