Phase 4 E2: Alloc Per-Class FastPath - NEUTRAL (-0.21%)

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (DUALHOT=0): 45.40M ops/s (mean), 45.51M ops/s (median)
- Optimized (DUALHOT=1): 45.30M ops/s (mean), 45.22M ops/s (median)
- Improvement: -0.21% mean, -0.62% median

Decision: NEUTRAL (within ±1.0% noise threshold)
Action: FREEZE as research box (default OFF, no promotion)

Key Findings:
- C0-C3 fast path adds branch overhead without measurable benefit
- Unlike FREE path (+13%), ALLOC path already has optimized route caching
- Phase 3 C3 static routing eliminated route lookup overhead
- Additional per-class specialization doesn't reduce existing cost

Root Cause:
- Free DUALHOT skips expensive policy_snapshot() + tiny_route_for_class()
- Alloc DUALHOT adds C0-C3 branch but route already cached (Phase 3 C3)
- Net effect: Branch cost ≈ Route savings → neutral

Conclusion: Alloc route optimization has reached diminishing returns

Cumulative Status:
- Phase 4 E1: +3.92% (GO, research box)
- Phase 4 E2: -0.21% (NEUTRAL, frozen)

Files:
- CURRENT_TASK.md: Updated with E2 results
- docs/analysis/PHASE4_E2_ALLOC_PER_CLASS_FASTPATH_AB_TEST_RESULTS.md: Full A/B test report

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
Moe Charm (CI)
2025-12-14 01:54:21 +09:00
parent 7f3ff6c7e6
commit 6a6744d065
2 changed files with 172 additions and 5 deletions

View File

@ -1,6 +1,43 @@
# 本線タスク(現在) # 本線タスク(現在)
## 更新メモ2025-12-14 Phase 4 E1 Complete - ENV Snapshot Consolidation ## 更新メモ2025-12-14 Phase 4 E2 Complete - Alloc Per-Class FastPath
### Phase 4 E2: Alloc Per-Class FastPath ⚪ NEUTRAL (2025-12-14)
**Target**: C0-C3 dedicated fast path for alloc (bypass policy route for small sizes)
- Strategy: Skip policy snapshot + route determination for C0-C3 classes
- Reuse DUALHOT pattern from free path (which achieved +13% for C0-C3)
- Baseline: HAKMEM_ENV_SNAPSHOT=1 enabled (E1 active)
**Implementation**:
- ENV gate: `HAKMEM_TINY_ALLOC_DUALHOT=0/1` (already exists, default: 0)
- Integration: `malloc_tiny_fast_for_class()` lines 247-259
- C0-C3 check: Direct to LEGACY unified cache when enabled
- Pattern: Probe window lazy init (64-call tolerance for early putenv)
**A/B Test Results** (Mixed, 10-run, 20M iters, HAKMEM_ENV_SNAPSHOT=1):
- Baseline (DUALHOT=0): **45.40M ops/s** (mean), 45.51M ops/s (median), σ=0.38M
- Optimized (DUALHOT=1): **45.30M ops/s** (mean), 45.22M ops/s (median), σ=0.49M
- **Improvement: -0.21% mean, -0.62% median**
**Decision: NEUTRAL** (-0.21% within ±1.0% noise threshold)
- Action: Keep as research box (default OFF, freeze)
- Reason: C0-C3 fast path adds branch overhead without measurable gain on Mixed
- Unlike FREE path (+13%), ALLOC path doesn't show significant route determination cost
**Key Insight**:
- Free path benefits from DUALHOT because it skips expensive policy snapshot + route lookup
- Alloc path already has optimized route caching (Phase 3 C3 static routing)
- C0-C3 specialization doesn't provide additional benefit over current routing
- Conclusion: Alloc route optimization has reached diminishing returns
**Cumulative Status**:
- Phase 4 E1: +3.92% (GO, research box)
- Phase 4 E2: -0.21% (NEUTRAL, frozen)
### Next: Phase 4 E3 - TBD (consult perf profile or pursue other optimization vectors)
---
### Phase 4 E1: ENV Snapshot Consolidation ✅ COMPLETE (2025-12-14) ### Phase 4 E1: ENV Snapshot Consolidation ✅ COMPLETE (2025-12-14)
@ -28,10 +65,6 @@
**Key Insight**: Shifting from shape optimizations (plateaued) to TLS/memory overhead yields strong returns. ENV snapshot consolidation represents new optimization frontier beyond branch prediction tuning. **Key Insight**: Shifting from shape optimizations (plateaued) to TLS/memory overhead yields strong returns. ENV snapshot consolidation represents new optimization frontier beyond branch prediction tuning.
### Next: Phase 4 E2 - Alloc Per-Class Fast Path
- 指示書SSOT: `docs/analysis/PHASE4_E2_ALLOC_PER_CLASS_FASTPATH_NEXT_INSTRUCTIONS.md`
- 設計メモ: `docs/analysis/PHASE4_E2_ALLOC_PER_CLASS_FASTPATH_1_DESIGN.md`
### Phase 4 Perf Profiling Complete ✅ (2025-12-14) ### Phase 4 Perf Profiling Complete ✅ (2025-12-14)
**Profile Analysis**: **Profile Analysis**:

View File

@ -0,0 +1,134 @@
# Phase 4 E2: Alloc Per-Class FastPath - A/B Test Results
## Test Configuration
**Date**: 2025-12-14
**Baseline**: HAKMEM_ENV_SNAPSHOT=1 (E1 enabled, new baseline from E1)
**Profile**: MIXED_TINYV3_C7_SAFE
**Workload**: Mixed 16-1024B allocation/free pattern
**Parameters**: 20M iterations, ws=400, 1 thread
**Runs**: 10 iterations per configuration
## Hypothesis
**Strategy**: Apply DUALHOT pattern to alloc path (C0-C3 dedicated fast path)
- Free path achieved +13% with C0-C3 DUALHOT (skip policy + route)
- Alloc path should benefit similarly by bypassing route determination
- Expected gain: +1-3% (conservative, based on free path success)
**Implementation**:
- ENV gate: `HAKMEM_TINY_ALLOC_DUALHOT=0/1` (default: 0)
- Integration point: `malloc_tiny_fast_for_class()` lines 247-259
- C0-C3 check: Direct to LEGACY unified cache (skip policy snapshot + route lookup)
## Test Results
### Baseline (HAKMEM_TINY_ALLOC_DUALHOT=0)
```
Run 1: 45,565,309 ops/s
Run 2: 45,083,995 ops/s
Run 3: 45,204,517 ops/s
Run 4: 45,610,342 ops/s
Run 5: 45,201,090 ops/s
Run 6: 45,658,791 ops/s
Run 7: 45,447,571 ops/s
Run 8: 44,603,716 ops/s
Run 9: 45,707,506 ops/s
Run 10: 45,900,099 ops/s
```
**Statistics**:
- Mean: **45,398,294 ops/s**
- Median: **45,506,440 ops/s**
- StdDev: **379,641 ops/s** (0.84% CV)
### Optimized (HAKMEM_TINY_ALLOC_DUALHOT=1)
```
Run 1: 45,007,059 ops/s
Run 2: 46,283,502 ops/s
Run 3: 44,590,279 ops/s
Run 4: 45,187,469 ops/s
Run 5: 45,769,085 ops/s
Run 6: 44,955,784 ops/s
Run 7: 45,516,635 ops/s
Run 8: 44,959,890 ops/s
Run 9: 45,523,059 ops/s
Run 10: 45,256,729 ops/s
```
**Statistics**:
- Mean: **45,304,949 ops/s**
- Median: **45,222,099 ops/s**
- StdDev: **485,566 ops/s** (1.07% CV)
## Performance Delta
- **Mean gain**: **-0.21%** (45.40M → 45.30M ops/s)
- **Median gain**: **-0.62%** (45.51M → 45.22M ops/s)
- **Variance increase**: StdDev increased from 0.38M to 0.49M
## Decision
**Result**: ⚪ **NEUTRAL** (-0.21% within ±1.0% noise threshold)
**Criteria**:
- GO: mean gain >= +1.0%
- NEUTRAL: mean gain in [-1.0%, +1.0%]
- NO-GO: mean gain <= -1.0%
**Action**: **FREEZE** as research box (default OFF, no promotion)
## Analysis
### Why NEUTRAL (not GO)?
1. **Minimal performance impact**: -0.21% is within measurement noise
2. **Free path comparison**: Free DUALHOT achieved +13%, alloc shows -0.21%
3. **Branch overhead**: C0-C3 check adds branch cost without measurable benefit
4. **Route caching effectiveness**: Phase 3 C3 static routing already optimized route lookup
### Root Cause Analysis
**Why alloc path differs from free path**:
1. **Free path (DUALHOT=1, +13% gain)**:
- Skips expensive `policy_snapshot()` TLS read
- Skips `tiny_route_for_class()` lookup
- Direct to `tiny_legacy_fallback_free_base()`
- Large savings: ~10-15 cycles per C0-C3 free
2. **Alloc path (DUALHOT=1, -0.21% neutral)**:
- Route already cached via `tiny_static_route_ready_fast()` (Phase 3 C3)
- Policy snapshot lightweight (TLS cache hit)
- C0-C3 check adds branch before route lookup
- Net effect: Branch cost ≈ Route lookup savings → neutral
### Key Insight
**Alloc route optimization has reached diminishing returns**:
- Phase 3 C3 (static routing) already eliminated route lookup overhead
- Additional C0-C3 specialization adds branch without reducing existing cost
- Unlike free path, alloc path doesn't have "expensive operation to skip"
**Conclusion**: Per-class specialization effective only when bypassing measurable overhead. Alloc path route caching already optimal.
## Next Steps
**Freeze E2**:
- Keep `HAKMEM_TINY_ALLOC_DUALHOT` as research box (default OFF)
- No promotion to MIXED_TINYV3_C7_SAFE preset
- Implementation retained for potential future workload-specific use
**Future Optimization Vectors**:
- Consult perf profile for new hot spots (E1 changed baseline)
- Consider non-routing optimizations (e.g., TLS layout, memory access patterns)
- Explore workload-specific specialization (C6-heavy may differ from Mixed)
## References
- Design: `docs/analysis/PHASE4_E2_ALLOC_PER_CLASS_FASTPATH_1_DESIGN.md`
- Instructions: `docs/analysis/PHASE4_E2_ALLOC_PER_CLASS_FASTPATH_NEXT_INSTRUCTIONS.md`
- Phase 4 E1 (ENV Snapshot): +3.92% GO (research box)
- Phase 3 C3 (Static Routing): +2.20% ADOPT (default ON)