From 6a6744d0654cb9acca53795103fe2694dd0d3f06 Mon Sep 17 00:00:00 2001 From: "Moe Charm (CI)" Date: Sun, 14 Dec 2025 01:54:21 +0900 Subject: [PATCH] Phase 4 E2: Alloc Per-Class FastPath - NEUTRAL (-0.21%) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A/B Test Results (Mixed, 10-run, 20M iters): - Baseline (DUALHOT=0): 45.40M ops/s (mean), 45.51M ops/s (median) - Optimized (DUALHOT=1): 45.30M ops/s (mean), 45.22M ops/s (median) - Improvement: -0.21% mean, -0.62% median Decision: NEUTRAL (within ±1.0% noise threshold) Action: FREEZE as research box (default OFF, no promotion) Key Findings: - C0-C3 fast path adds branch overhead without measurable benefit - Unlike FREE path (+13%), ALLOC path already has optimized route caching - Phase 3 C3 static routing eliminated route lookup overhead - Additional per-class specialization doesn't reduce existing cost Root Cause: - Free DUALHOT skips expensive policy_snapshot() + tiny_route_for_class() - Alloc DUALHOT adds C0-C3 branch but route already cached (Phase 3 C3) - Net effect: Branch cost ≈ Route savings → neutral Conclusion: Alloc route optimization has reached diminishing returns Cumulative Status: - Phase 4 E1: +3.92% (GO, research box) - Phase 4 E2: -0.21% (NEUTRAL, frozen) Files: - CURRENT_TASK.md: Updated with E2 results - docs/analysis/PHASE4_E2_ALLOC_PER_CLASS_FASTPATH_AB_TEST_RESULTS.md: Full A/B test report 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 --- CURRENT_TASK.md | 43 +++++- ...LLOC_PER_CLASS_FASTPATH_AB_TEST_RESULTS.md | 134 ++++++++++++++++++ 2 files changed, 172 insertions(+), 5 deletions(-) create mode 100644 docs/analysis/PHASE4_E2_ALLOC_PER_CLASS_FASTPATH_AB_TEST_RESULTS.md diff --git a/CURRENT_TASK.md b/CURRENT_TASK.md index a6783600..24d58442 100644 --- a/CURRENT_TASK.md +++ b/CURRENT_TASK.md @@ -1,6 +1,43 @@ # 本線タスク(現在) -## 更新メモ(2025-12-14 Phase 4 E1 Complete - ENV Snapshot Consolidation) +## 更新メモ(2025-12-14 Phase 4 E2 Complete - Alloc Per-Class FastPath) + +### Phase 4 E2: Alloc Per-Class FastPath ⚪ NEUTRAL (2025-12-14) + +**Target**: C0-C3 dedicated fast path for alloc (bypass policy route for small sizes) +- Strategy: Skip policy snapshot + route determination for C0-C3 classes +- Reuse DUALHOT pattern from free path (which achieved +13% for C0-C3) +- Baseline: HAKMEM_ENV_SNAPSHOT=1 enabled (E1 active) + +**Implementation**: +- ENV gate: `HAKMEM_TINY_ALLOC_DUALHOT=0/1` (already exists, default: 0) +- Integration: `malloc_tiny_fast_for_class()` lines 247-259 +- C0-C3 check: Direct to LEGACY unified cache when enabled +- Pattern: Probe window lazy init (64-call tolerance for early putenv) + +**A/B Test Results** (Mixed, 10-run, 20M iters, HAKMEM_ENV_SNAPSHOT=1): +- Baseline (DUALHOT=0): **45.40M ops/s** (mean), 45.51M ops/s (median), σ=0.38M +- Optimized (DUALHOT=1): **45.30M ops/s** (mean), 45.22M ops/s (median), σ=0.49M +- **Improvement: -0.21% mean, -0.62% median** + +**Decision: NEUTRAL** (-0.21% within ±1.0% noise threshold) +- Action: Keep as research box (default OFF, freeze) +- Reason: C0-C3 fast path adds branch overhead without measurable gain on Mixed +- Unlike FREE path (+13%), ALLOC path doesn't show significant route determination cost + +**Key Insight**: +- Free path benefits from DUALHOT because it skips expensive policy snapshot + route lookup +- Alloc path already has optimized route caching (Phase 3 C3 static routing) +- C0-C3 specialization doesn't provide additional benefit over current routing +- Conclusion: Alloc route optimization has reached diminishing returns + +**Cumulative Status**: +- Phase 4 E1: +3.92% (GO, research box) +- Phase 4 E2: -0.21% (NEUTRAL, frozen) + +### Next: Phase 4 E3 - TBD (consult perf profile or pursue other optimization vectors) + +--- ### Phase 4 E1: ENV Snapshot Consolidation ✅ COMPLETE (2025-12-14) @@ -28,10 +65,6 @@ **Key Insight**: Shifting from shape optimizations (plateaued) to TLS/memory overhead yields strong returns. ENV snapshot consolidation represents new optimization frontier beyond branch prediction tuning. -### Next: Phase 4 E2 - Alloc Per-Class Fast Path -- 指示書(SSOT): `docs/analysis/PHASE4_E2_ALLOC_PER_CLASS_FASTPATH_NEXT_INSTRUCTIONS.md` -- 設計メモ: `docs/analysis/PHASE4_E2_ALLOC_PER_CLASS_FASTPATH_1_DESIGN.md` - ### Phase 4 Perf Profiling Complete ✅ (2025-12-14) **Profile Analysis**: diff --git a/docs/analysis/PHASE4_E2_ALLOC_PER_CLASS_FASTPATH_AB_TEST_RESULTS.md b/docs/analysis/PHASE4_E2_ALLOC_PER_CLASS_FASTPATH_AB_TEST_RESULTS.md new file mode 100644 index 00000000..35ca5aa0 --- /dev/null +++ b/docs/analysis/PHASE4_E2_ALLOC_PER_CLASS_FASTPATH_AB_TEST_RESULTS.md @@ -0,0 +1,134 @@ +# Phase 4 E2: Alloc Per-Class FastPath - A/B Test Results + +## Test Configuration + +**Date**: 2025-12-14 +**Baseline**: HAKMEM_ENV_SNAPSHOT=1 (E1 enabled, new baseline from E1) +**Profile**: MIXED_TINYV3_C7_SAFE +**Workload**: Mixed 16-1024B allocation/free pattern +**Parameters**: 20M iterations, ws=400, 1 thread +**Runs**: 10 iterations per configuration + +## Hypothesis + +**Strategy**: Apply DUALHOT pattern to alloc path (C0-C3 dedicated fast path) +- Free path achieved +13% with C0-C3 DUALHOT (skip policy + route) +- Alloc path should benefit similarly by bypassing route determination +- Expected gain: +1-3% (conservative, based on free path success) + +**Implementation**: +- ENV gate: `HAKMEM_TINY_ALLOC_DUALHOT=0/1` (default: 0) +- Integration point: `malloc_tiny_fast_for_class()` lines 247-259 +- C0-C3 check: Direct to LEGACY unified cache (skip policy snapshot + route lookup) + +## Test Results + +### Baseline (HAKMEM_TINY_ALLOC_DUALHOT=0) + +``` +Run 1: 45,565,309 ops/s +Run 2: 45,083,995 ops/s +Run 3: 45,204,517 ops/s +Run 4: 45,610,342 ops/s +Run 5: 45,201,090 ops/s +Run 6: 45,658,791 ops/s +Run 7: 45,447,571 ops/s +Run 8: 44,603,716 ops/s +Run 9: 45,707,506 ops/s +Run 10: 45,900,099 ops/s +``` + +**Statistics**: +- Mean: **45,398,294 ops/s** +- Median: **45,506,440 ops/s** +- StdDev: **379,641 ops/s** (0.84% CV) + +### Optimized (HAKMEM_TINY_ALLOC_DUALHOT=1) + +``` +Run 1: 45,007,059 ops/s +Run 2: 46,283,502 ops/s +Run 3: 44,590,279 ops/s +Run 4: 45,187,469 ops/s +Run 5: 45,769,085 ops/s +Run 6: 44,955,784 ops/s +Run 7: 45,516,635 ops/s +Run 8: 44,959,890 ops/s +Run 9: 45,523,059 ops/s +Run 10: 45,256,729 ops/s +``` + +**Statistics**: +- Mean: **45,304,949 ops/s** +- Median: **45,222,099 ops/s** +- StdDev: **485,566 ops/s** (1.07% CV) + +## Performance Delta + +- **Mean gain**: **-0.21%** (45.40M → 45.30M ops/s) +- **Median gain**: **-0.62%** (45.51M → 45.22M ops/s) +- **Variance increase**: StdDev increased from 0.38M to 0.49M + +## Decision + +**Result**: ⚪ **NEUTRAL** (-0.21% within ±1.0% noise threshold) + +**Criteria**: +- GO: mean gain >= +1.0% +- NEUTRAL: mean gain in [-1.0%, +1.0%] +- NO-GO: mean gain <= -1.0% + +**Action**: **FREEZE** as research box (default OFF, no promotion) + +## Analysis + +### Why NEUTRAL (not GO)? + +1. **Minimal performance impact**: -0.21% is within measurement noise +2. **Free path comparison**: Free DUALHOT achieved +13%, alloc shows -0.21% +3. **Branch overhead**: C0-C3 check adds branch cost without measurable benefit +4. **Route caching effectiveness**: Phase 3 C3 static routing already optimized route lookup + +### Root Cause Analysis + +**Why alloc path differs from free path**: + +1. **Free path (DUALHOT=1, +13% gain)**: + - Skips expensive `policy_snapshot()` TLS read + - Skips `tiny_route_for_class()` lookup + - Direct to `tiny_legacy_fallback_free_base()` + - Large savings: ~10-15 cycles per C0-C3 free + +2. **Alloc path (DUALHOT=1, -0.21% neutral)**: + - Route already cached via `tiny_static_route_ready_fast()` (Phase 3 C3) + - Policy snapshot lightweight (TLS cache hit) + - C0-C3 check adds branch before route lookup + - Net effect: Branch cost ≈ Route lookup savings → neutral + +### Key Insight + +**Alloc route optimization has reached diminishing returns**: +- Phase 3 C3 (static routing) already eliminated route lookup overhead +- Additional C0-C3 specialization adds branch without reducing existing cost +- Unlike free path, alloc path doesn't have "expensive operation to skip" + +**Conclusion**: Per-class specialization effective only when bypassing measurable overhead. Alloc path route caching already optimal. + +## Next Steps + +**Freeze E2**: +- Keep `HAKMEM_TINY_ALLOC_DUALHOT` as research box (default OFF) +- No promotion to MIXED_TINYV3_C7_SAFE preset +- Implementation retained for potential future workload-specific use + +**Future Optimization Vectors**: +- Consult perf profile for new hot spots (E1 changed baseline) +- Consider non-routing optimizations (e.g., TLS layout, memory access patterns) +- Explore workload-specific specialization (C6-heavy may differ from Mixed) + +## References + +- Design: `docs/analysis/PHASE4_E2_ALLOC_PER_CLASS_FASTPATH_1_DESIGN.md` +- Instructions: `docs/analysis/PHASE4_E2_ALLOC_PER_CLASS_FASTPATH_NEXT_INSTRUCTIONS.md` +- Phase 4 E1 (ENV Snapshot): +3.92% GO (research box) +- Phase 3 C3 (Static Routing): +2.20% ADOPT (default ON)