Phase 18 v2: BENCH_MINIMAL design + instructions (instruction removal strategy)

## Phase 18 v2: Next Phase Direction

After Phase 18 v1 failure (layout optimization caused I-cache regression),
shift to instruction count reduction via compile-time removal:

- Stats collection (FRONT_FASTLANE_STAT_INC → no-op)
- Environment checks (runtime lookup → constant)
- Debug logging (conditional compilation)

Expected impact: Instructions -30-40%, Throughput +10-20%

## Success Criteria (STRICT)

GO (must have ALL):
- Throughput: +5% minimum (+8% preferred)
- Instructions: -15% minimum (smoking gun)
- I-cache: automatic improvement from smaller footprint

NEUTRAL: throughput ±3%, instructions -5% to -15%
NO-GO: throughput < -2%, instructions < -5%

Key: If instructions do not drop -15%+, allocator is not the bottleneck
and this phase should be abandoned.

## Implementation Strategy

1. Makefile knob: BENCH_MINIMAL=0/1 (default OFF, production-safe)
2. Conditional removal:
   - Stats: #if !HAKMEM_BENCH_MINIMAL
   - ENV checks: constant propagation
   - Debug: conditional includes

3. A/B test with perf stat (must measure instruction reduction)

## Files

New:
- docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_DESIGN.md (detailed design)
- docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_NEXT_INSTRUCTIONS.md (step-by-step)

Modified:
- CURRENT_TASK.md (Phase 18 v1/v2 status)

## Key Learning from Phase 18 v1 Failure

Layout optimization is extremely fragile without strong ordering guarantees.
Section splitting alone (without symbol ordering, PGO, or linker script)
destroyed code locality and increased I-cache misses 91%.

Switching to direct instruction removal is safer and more predictable.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
Moe Charm (CI)
2025-12-15 05:55:22 +09:00
parent b1912d6587
commit ad346f7885
3 changed files with 637 additions and 23 deletions

View File

@ -469,38 +469,63 @@ Phase 6-10 で達成した累積改善:
--- ---
### Phase 18: Hot Text Isolation / Layout Control — NEXT ### Phase 18: Hot Text Isolation — PROGRESS
**目的**: Binary layout 最適化で I-cache 効率を改善し、system binary との gap を削減 **目的**: Binary 最適化で system binary との gap (+74.26%) を削減する。Phase 17 で layout penalty が支配的と判明したため、2段階の戦略で対応
**戦略**: **戦略**:
1. **Cold Code Isolation** (優先度 1)
- Stats 収集、debug logging、error handlers を別 TU へ移動
- `__attribute__((cold, noinline))` で明示的に cold マーク
- 予想効果: I-cache misses -20%
2. **Link-Order Optimization** (優先度 2) #### Phase 18 v1: Layout optimization (section-based) — ❌ NO-GO (2025-12-15)
- Hot functions を連続配置linker script or link order control
- `-ffunction-sections` + custom linker script
- 予想効果: I-cache misses -10%
3. **Profile-Guided Optimization** (優先度 3, optional) **試行**: `-ffunction-sections -fdata-sections -Wl,--gc-sections` で I-cache 改善
- `-fprofile-generate` + `-fprofile-use` で実測ベース配置 **結果**:
- 予想効果: I-cache misses -10-20% - Throughput: -0.87% (48.94M → 48.52M ops/s)
- I-cache misses: **+91.06%** (131K → 250K) ← 喫煙銃
- Variance: +80%
**Build Gate**: `HOT_TEXT_ISOLATION=0/1`layout A/B 用) **原因**: Section splitting without explicit hot symbol ordering が code locality を破壊
**教訓**: Layout tweaks は fragile。Ordering strategy がないと有害。
**Target**: **決定**: Freeze v1Makefile で安全に隔離)
- v1TU split / attrs / optional gc-sections: **+2% で GO**NEUTRAL が起きやすい想定) - `HOT_TEXT_ISOLATION=1` → attributes only (safe, 効果なし)
- v2BENCH_MINIMAL compile-out: **+1020%** を狙うinstruction footprint を直接削る) - `HOT_TEXT_GC_SECTIONS=1` → section splitting (NO-GO, disabled)
**設計**: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md` **ファイル**:
**指示書**: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md` - 設計: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_DESIGN.md`
**結果v1**: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md`(❌ NO-GO / I-cache miss 悪化) - 指示書: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md`
- 結果: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md`
実装ゲート(戻せる): #### Phase 18 v2: BENCH_MINIMAL (instruction removal) — NEXT
- Makefile knob: `HOT_TEXT_ISOLATION=0/1`
- Compile-time: `-DHAKMEM_HOT_TEXT_ISOLATION=0/1` **戦略**: Instruction footprint を compile-time に削除
- Stats collection: FRONT_FASTLANE_STAT_INC → no-op
- ENV checks: runtime lookup → constant
- Debug logging: 条件コンパイルで削除
**期待効果**:
- Instructions: -30-40%
- Throughput: +10-20%
**GO 基準** (STRICT):
- Throughput: **+5% 最小**+8% 推奨)
- Instructions: **-15% 最小** ← 成功の喫煙銃
- I-cache: 自動的に改善instruction 削減に追従)
If instructions < -15%: abandonallocator bottleneck でない
**Build Gate**: `BENCH_MINIMAL=0/1`production safe, opt-in
**ファイル**:
- 設計: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_DESIGN.md`
- 指示書: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_NEXT_INSTRUCTIONS.md`
- 実装: 次段階
**実装計画**:
1. Makefile BENCH_MINIMAL knob 追加
2. Stats macro conditional
3. ENV checks constant
4. Debug logging wrap
5. A/B test +5%+/-15% 判定
## 更新メモ2025-12-14 Phase 5 E5-3 Analysis - Strategic Pivot ## 更新メモ2025-12-14 Phase 5 E5-3 Analysis - Strategic Pivot

View File

@ -0,0 +1,274 @@
# Phase 18 v2: Hot Text Isolation — BENCH_MINIMAL Design
## Context (Phase 18 v1 failed)
Phase 18 v1 attempted layout optimization using section splitting + GC (`-ffunction-sections -fdata-sections -Wl,--gc-sections`).
**Result**: CATASTROPHIC I-cache regression (+91%).
**Root cause**: Section-based splitting without explicit hot symbol ordering destroyed code locality.
**Lesson**: Layout tweaks are too fragile; **instruction count reduction is the direct path**.
---
## 0. Goal (High Impact)
Reduce **instruction footprint** per allocation/free by removing non-essential code paths at compile-time:
- Stats collection (counter increments on every operation)
- Environment variable checks (TLS lookups)
- Debug logging (conditional output)
**Expected impact**:
- Instruction count: -30-40% (benchmark workload)
- I-cache misses: automatic improvement from smaller working set
- Throughput: +10-20% (Phase 17 instructions -48% gap should close proportionally)
**GO Criteria** (strict):
- Throughput: +5% minimum, +8% preferred
- Instructions: -15% minimum (clear, measurable reduction)
- I-cache: should improve proportionally to instruction reduction
If instructions do not drop meaningfully, abandon this phase.
---
## 1. Strategy: BENCH_MINIMAL Build Mode
### 1.1 What to Remove
**Category A: Stats collection** (highest frequency, low value for bench)
```c
FRONT_FASTLANE_STAT_INC(malloc_total); // ← 1 per malloc
FRONT_FASTLANE_STAT_INC(malloc_hit); // ← 1 per malloc
FRONT_FASTLANE_STAT_INC(malloc_fallback_*); // ← conditional
free_tiny_fast_mono_stat_inc_hit(); // ← 1 per free
```
**Category B: Environment variable checks** (per-operation cost if not short-circuit)
```c
if (hakmem_env_snapshot_enabled()) { ... } // TLS + branch
if (front_fastlane_enabled()) { ... } // TLS + branch
```
**Category C: Debug logging / verbose output** (rarely needed in bench)
```c
hakmem_tls_trace_alloc(); // log path
hakmem_diag_record_fallback();
```
### 1.2 Compile-Time Gate
New build mode:
```
HAKMEM_BENCH_MINIMAL=0/1 (default 0, opt-in)
```
Activated via Makefile knob:
```makefile
BENCH_MINIMAL ?= 0
ifeq ($(BENCH_MINIMAL),1)
CFLAGS += -DHAKMEM_BENCH_MINIMAL=1
# ... applies to both bench binaries and shared lib
endif
```
**Important**: This is **not** a research-only knob. It's a build mode that's safe to ship (just disables instrumentation).
### 1.3 Safeguards
1. **Conditional compilation markers**:
- Wrap stats inside `#if !HAKMEM_BENCH_MINIMAL`
- Wrap ENV checks inside `#if !HAKMEM_BENCH_MINIMAL`
- Ensure code is still correct when markers are OFF (normal operation)
2. **No behavioral changes**:
- Fast paths must remain identical (stats are just instrumentation)
- Slow paths (fallback, cold) can be simplified (less instrumentation)
3. **Rollback**: Single build knob, reversible
---
## 2. Implementation Plan
### Phase 2.1: Stats removal (Priority 1)
**File**: `core/box/front_fastlane_stats_box.h`
```c
#if !HAKMEM_BENCH_MINIMAL
#define FRONT_FASTLANE_STAT_INC(stat) \
do { (void)0; } while(0) // ← becomes no-op
#else
#define FRONT_FASTLANE_STAT_INC(stat) \
atomic_fetch_add_explicit(&g_stats.stat, 1, memory_order_relaxed)
#endif
```
**File**: `core/front/malloc_tiny_fast.h`
Search for all `free_tiny_fast_mono_stat_*` calls and wrap:
```c
#if !HAKMEM_BENCH_MINIMAL
free_tiny_fast_mono_stat_inc_hit();
#endif
```
**Expected savings**: ~20-30 instructions/op (counter increments + memory sync)
### Phase 2.2: Environment variable check removal (Priority 1)
**File**: `core/box/hakmem_env_snapshot_box.h`
```c
#if !HAKMEM_BENCH_MINIMAL
// Normal: Check ENV at runtime
bool hakmem_env_snapshot_enabled(void) {
return g_env_snapshot_enabled;
}
#else
// Bench: Always enable snapshot (one-time cost)
static inline bool hakmem_env_snapshot_enabled(void) {
return 1; // ← compile-time constant, eliminated by optimizer
}
#endif
```
**File**: `core/bench_profile.h`
Similar pattern for profile-related checks.
**Expected savings**: ~5-10 instructions/op (TLS lookups + branch prediction)
### Phase 2.3: Debug logging removal (Priority 2, if needed)
**File**: trace/logging functions
Conditional compilation for verbose paths (already often guarded by `HAKMEM_BUILD_DEBUG`).
**Expected savings**: ~0-5 instructions/op (rare paths, low impact for bench)
---
## 3. Risks / Mitigations
### Risk A: Benchmark no longer representative
If we remove instrumentation, does the bench still measure the allocator?
**Mitigation**: BENCH_MINIMAL disables stats (instrumentation), not core logic.
- Fast paths remain identical.
- Only instrumentation overhead is removed.
- This is similar to how production binaries disable debug tracing.
### Risk B: Build regression
Conditional compilation could hide bugs.
**Mitigation**:
- Ensure `BENCH_MINIMAL=0` (default) always tested first.
- Test both modes in CI if available.
- Manual verification that code is syntactically correct in both modes.
### Risk C: Instruction count doesn't drop
If removing stats + ENV checks doesn't clearly drop instructions, it means:
- Compiler/LTO already optimizes these away in RELEASE builds
- The overhead is elsewhere (memory access patterns, branch misprediction)
**Mitigation**: Run perf stat with `BENCH_MINIMAL=1` to confirm instruction reduction. If < -10%, reconsider strategy.
---
## 4. Expected Impact
**Conservative** (stats only):
- Throughput: +5-8%
- Instructions: -20-30%
- I-cache: improvement follows from instruction reduction
**Ambitious** (stats + ENV + debug):
- Throughput: +10-20%
- Instructions: -30-40%
- I-cache: -20-30% (proportional to instruction reduction)
**System binary ceiling**: 85M ops/s
- Baseline: 48M ops/s
- Target: 52-55M ops/s (via +5-15%)
- Would be 60-68M ops/s with ambitious impact
---
## 5. Box Theory Compliance
**Box**: BenchMinimalBox
- **Boundary**: `HAKMEM_BENCH_MINIMAL=0/1` (compile-time)
- **Scope**: Instrumentation removal only (no algorithm changes)
- **Rollback**: Single build knob (default OFF, backward compatible)
- **Observability**: Perf stat before/after (instructions, I-cache, throughput)
---
## 6. Success Criteria
### GO (Proceed):
- Throughput: **+5% or more**
- Instructions: **-15% or more** (smoking gun for success)
- I-cache: should improve proportionally
### NEUTRAL (keep as research box):
- Throughput: ±3% (within noise)
- Instructions: -5% to -15% (marginal)
### NO-GO (freeze):
- Throughput: < -2% or no improvement
- Instructions: < -5% (failed optimization objective)
---
## 7. Files to Modify
1. **`core/box/hot_text_attrs_box.h`** (update to add BENCH_MINIMAL support)
- Already created in Phase 18 v1
2. **`core/box/front_fastlane_stats_box.h`** (stats conditional)
- Update macro to be no-op when BENCH_MINIMAL=1
3. **`core/front/malloc_tiny_fast.h`** (free stats wrapper)
- Wrap free_tiny_fast_mono_stat_* calls
4. **`core/box/hakmem_env_snapshot_box.h`** (ENV check simplification)
- Make hakmem_env_snapshot_enabled() constant when BENCH_MINIMAL=1
5. **`core/bench_profile.h`** (profile checks)
- Simplify profile related checks
6. **`Makefile`**
- Add `BENCH_MINIMAL ?= 0` knob
- Apply `-DHAKMEM_BENCH_MINIMAL=1` when enabled
---
## 8. Notes
- **Production safety**: BENCH_MINIMAL is DISABLED by default. Production builds use full instrumentation.
- **Bench-only**: This mode is intended for benchmark binary builds, not shipped libraries.
- **Phase 17 learning**: Instruction count was -48% in system vs hakmem. If BENCH_MINIMAL achieves -30% to -40%, we should see +10-20% throughput.
---
## 9. Next Steps After Phase 18 v2
### If GO (+5%+):
- Promote BENCH_MINIMAL=1 as the new baseline for bench builds
- Update CURRENT_TASK with Phase 18 v2 victory
- Prepare Phase 19 (next optimization frontier)
### If NEUTRAL:
- Investigate why instructions didn't drop (compiler already optimized?)
- Consider aggressive options: SIMD, allocation batching, lock-free
### If NO-GO:
- Acknowledge that Phase 17 +74% gap is primarily memory-bound (IPC=2.30)
- Shift focus to broader architectural changes

View File

@ -0,0 +1,315 @@
# Phase 18 v2: Hot Text Isolation — BENCH_MINIMAL Implementation Instructions
## Status
- Phase 18 v1 (layout + sections) → NO-GO (I-cache regression)
- Phase 18 v2 (instruction removal) → Next phase
- Strategy: Remove stats/ENV overhead at compile-time
- Expected: +10-20% throughput, -30-40% instructions
Ref: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_DESIGN.md`
---
## 0. Goal / Success Criteria
**GO threshold** (STRICT):
- Throughput: **+5% minimum** (+8% preferred)
- Instructions: **-15% minimum** (clear proof of concept)
- I-cache: automatic improvement from smaller footprint
**NEUTRAL**:
- Throughput: ±3% (marginal)
- Instructions: -5% to -15% (incomplete optimization)
**NO-GO**:
- Throughput: < -2% or negative
- Instructions: < -5% (failed to reduce overhead)
If instructions do not drop -15%+, abandon this phase (allocator is not the bottleneck).
---
## 1. Implementation Steps
### 1.1 Add Makefile knob
**File**: `Makefile`
After Phase 18 v1 section (around line 140), add:
```makefile
# Phase 18 v2: BENCH_MINIMAL (remove instrumentation for benchmark builds)
BENCH_MINIMAL ?= 0
ifeq ($(BENCH_MINIMAL),1)
CFLAGS += -DHAKMEM_BENCH_MINIMAL=1
CFLAGS_SHARED += -DHAKMEM_BENCH_MINIMAL=1
# Note: Both bench and shared lib will disable instrumentation
# Mainly impacts bench_* binaries (where BENCH_MINIMAL is intentionally enabled)
endif
```
**Location**: After `HOT_TEXT_GC_SECTIONS` section (line ~145)
### 1.2 Update front_fastlane_stats_box.h
**File**: `core/box/front_fastlane_stats_box.h`
Find the `FRONT_FASTLANE_STAT_INC` macro (currently defined as atomic increment).
Replace entire macro section:
```c
// Before:
#define FRONT_FASTLANE_STAT_INC(stat) \
do { \
if (__builtin_expect(front_fastlane_stats_enabled(), 0)) { \
atomic_fetch_add_explicit(&g_stats.stat, 1, memory_order_relaxed); \
} \
} while(0)
// After:
#if HAKMEM_BENCH_MINIMAL
#define FRONT_FASTLANE_STAT_INC(stat) do { (void)0; } while(0)
#else
#define FRONT_FASTLANE_STAT_INC(stat) \
do { \
if (__builtin_expect(front_fastlane_stats_enabled(), 0)) { \
atomic_fetch_add_explicit(&g_stats.stat, 1, memory_order_relaxed); \
} \
} while(0)
#endif
```
**Rationale**: Stats collection becomes no-op in BENCH_MINIMAL, compiled away entirely.
### 1.3 Update malloc_tiny_fast.h free stats
**File**: `core/front/malloc_tiny_fast.h`
Search for `free_tiny_fast_mono_stat_inc_*` calls (probably in free_tiny_fast function).
Wrap each call:
```c
// Before:
free_tiny_fast_mono_stat_inc_hit();
// After:
#if !HAKMEM_BENCH_MINIMAL
free_tiny_fast_mono_stat_inc_hit();
#endif
```
Search pattern: `free_tiny_fast_mono_stat_` (should find ~5-10 calls)
**Rationale**: Similar to malloc, free stats become optional in BENCH_MINIMAL.
### 1.4 Update hakmem_env_snapshot_box.h
**File**: `core/box/hakmem_env_snapshot_box.h`
Find function `hakmem_env_snapshot_enabled()`.
Replace implementation:
```c
// Before:
static inline bool hakmem_env_snapshot_enabled(void) {
return atomic_load_explicit(&g_snapshot_enabled, memory_order_relaxed);
}
// After:
#if HAKMEM_BENCH_MINIMAL
// In bench mode, snapshot is always enabled (one-time cost, compile-away benefit)
static inline bool hakmem_env_snapshot_enabled(void) {
return 1;
}
#else
// Normal mode: runtime check
static inline bool hakmem_env_snapshot_enabled(void) {
return atomic_load_explicit(&g_snapshot_enabled, memory_order_relaxed);
}
#endif
```
**Rationale**: ENV checks become compile-time constants in BENCH_MINIMAL, enabling better optimization.
### 1.5 (Optional) Update debug logging
**File**: any file with `hakmem_tls_trace_*` or `hakmem_diag_*` calls
Only if needed (lower priority). Wrap verbose logging:
```c
#if !HAKMEM_BENCH_MINIMAL
hakmem_tls_trace_alloc(ptr, size);
#endif
```
---
## 2. Build & Verify
### 2.1 Baseline build (BENCH_MINIMAL=0)
```sh
make clean
make -j bench_random_mixed_hakmem bench_random_mixed_system
ls -lh bench_random_mixed_hakmem
```
Expected: ~653K (same as before)
### 2.2 Optimized build (BENCH_MINIMAL=1)
```sh
make clean
make -j BENCH_MINIMAL=1 bench_random_mixed_hakmem bench_random_mixed_system
ls -lh bench_random_mixed_hakmem
```
Expected: Same size (instrumentation is compiled away, not removed)
**Note**: Binary size may not change (removal is compile-time, not linker removal).
### 2.3 Verify compilation
Both builds should complete without errors.
Check for syntax errors in conditional code:
```sh
# If you see errors like "undeclared identifier", it means conditional guards are wrong
```
---
## 3. A/B Test Execution
### 3.1 Baseline run (BENCH_MINIMAL=0)
```sh
make clean
make -j bench_random_mixed_hakmem bench_random_mixed_system
scripts/run_mixed_10_cleanenv.sh
```
Record:
- Mean throughput
- Stdev
- Min/Max
Run perf stat:
```sh
perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \
env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
./bench_random_mixed_hakmem 200000000 400 1
```
Record:
- cycles
- instructions
- I-cache-load-misses
- branch-misses %
### 3.2 Optimized run (BENCH_MINIMAL=1)
```sh
make clean
make -j BENCH_MINIMAL=1 bench_random_mixed_hakmem bench_random_mixed_system
scripts/run_mixed_10_cleanenv.sh
```
Same recording as baseline.
Perf stat (same command).
### 3.3 Optional: system ceiling check
```sh
./bench_random_mixed_system 200000000 400 1 2>&1 | rg "Throughput"
```
(Already measured in Phase 17: ~85M ops/s)
---
## 4. Analysis & GO/NO-GO Decision
### Compute deltas
```
Delta_throughput = (optimized_mean - baseline_mean) / baseline_mean * 100
Delta_instructions = (optimized_instructions - baseline_instructions) / baseline_instructions * 100
Delta_icache = (optimized_icache - baseline_icache) / baseline_icache * 100
```
### Decision logic
**GO** (if ALL true):
- Delta_throughput +5%
- Delta_instructions -15%
- Delta_icache -10% (should follow instruction reduction)
**NEUTRAL** (if some criteria marginal):
- Delta_throughput ±3%
- Delta_instructions [-15%, -5%]
- Variance increase < 50%
**NO-GO** (if any critical missed):
- Delta_throughput < -2%
- Delta_instructions > -5% (failed to reduce overhead)
- Variance increase > 100%
---
## 5. Reporting (required artifacts)
Create file: `docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_AB_TEST_RESULTS.md`
Include:
```markdown
# Phase 18 v2: BENCH_MINIMAL — A/B Test Results
| Metric | Baseline | Optimized | Delta |
|--------|----------|-----------|-------|
| Throughput (mean) | X.XXM | X.XXM | +/-X.XX% |
| Throughput (σ) | X.XXM | X.XXM | % |
| Instructions | X.XB | X.XB | +/-X.X% |
| I-cache misses | XXXK | XXXK | +/-X.X% |
| Cycles | X.XB | X.XB | +/-X.X% |
## Verdict
GO / NEUTRAL / NO-GO with reasoning
```
Update: `CURRENT_TASK.md` (Phase 18 v2 status + next)
---
## 6. Important Notes
- Ensure `BENCH_MINIMAL=0` (default) is tested first
- Both BENCH_MINIMAL=0 and BENCH_MINIMAL=1 must compile successfully
- If instructions drop < -15%, the optimization is incomplete (check for compiler already doing this)
- This is NOT a research-only knob; it's a valid bench build mode (safe to enable in CI)
---
## 7. If GO: Next Steps
1. **Promote BENCH_MINIMAL=1** as default for bench builds
2. **Prepare Phase 19** (next optimization frontier)
- Possible targets: SIMD prefetch, lock-free structures, allocation batching
3. **Document learnings** about allocator overhead vs memory latency
---
## 8. If NO-GO: Lessons
- Allocator is memory-bound (IPC=2.30 constant across phases)
- Instruction count reduction doesn't always yield throughput gains if memory latency dominates
- Need architectural changes (cache-friendly layout, batching) rather than micro-optimizations