Files
hakmem/docs/analysis/PHASE3_BASELINE_AND_CANDIDATES.md

448 lines
14 KiB
Markdown
Raw Normal View History

# Phase 3: Baseline Establishment & Next Optimization Candidates
**Date**: 2025-12-13
**Status**: BASELINE ESTABLISHED
**Goal**: Identify next micro-optimization targets with +1-5% potential each
---
## Executive Summary
**Baseline Performance (MID_V3=0, MIXED workload)**:
- Mean: 45.78M ops/s
- Median: 46.79M ops/s
- Range: 42.36M - 47.12M ops/s
- StdDev: ~1.75M ops/s (3.8% variance)
**Top Optimization Candidates**:
1. **free() wrapper** (28.95% self%) - HIGH PRIORITY
2. **tiny_alloc_gate_fast()** (12.75% self%) - HIGH PRIORITY
3. **main() benchmark overhead** (12.53% self%) - IGNORE (benchmark artifact)
**Expected Next Gains**: +3-8% cumulative from free path optimizations
---
## Step 0: Baseline Establishment
### Configuration Verification
**Profile**: `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE`
**Key Settings** (verified in `/mnt/workdisk/public_share/hakmem/core/bench_profile.h:74`):
```c
bench_setenv_default("HAKMEM_MID_V3_ENABLED", "0"); // CRITICAL: MID_V3 disabled for Mixed
bench_setenv_default("HAKMEM_MID_V3_CLASSES", "0x0");
```
**Optimization Flags Enabled**:
- `HAKMEM_FREE_TINY_FAST_HOTCOLD=1` (Phase FREE-TINY-FAST-DUALHOT-1)
- `HAKMEM_WRAP_SHAPE=1` (Phase 2 B4: Hot/Cold wrapper split)
- `HAKMEM_TINY_ALLOC_ROUTE_SHAPE=1` (Phase 2 B3: Route branch optimization)
- `HAKMEM_TINY_STATIC_ROUTE=1` (Phase 3 C3: Static routing, +2.2%)
### Baseline Measurements (5 runs)
| Run | Throughput (M ops/s) | Time (ms) |
|-----|---------------------|-----------|
| 1 | 46.84 | 21 |
| 2 | 46.79 | 21 |
| 3 | 45.77 | 22 |
| 4 | 47.12 | 21 |
| 5 | 42.36 | 24 |
**Statistics**:
- **Mean**: 45.78M ops/s
- **Median**: 46.79M ops/s
- **Min**: 42.36M ops/s
- **Max**: 47.12M ops/s
- **Range**: 4.76M ops/s (11.2%)
- **StdDev**: ~1.75M ops/s (3.8% variance)
**Assessment**: Baseline is stable with ~3.8% variance. Run 5 (42.36M) is an outlier, likely due to system noise. Median (46.79M) is a reliable baseline reference.
**Comparison to Previous**:
- Previous C3 baseline: ~39.8M ops/s (with default settings)
- **Current baseline: 46.79M ops/s**
- **Improvement: +17.5%** (confirms MID_V3=0 + cumulative optimizations working)
---
## Step 1: Perf Profiling Results
### Profiling Setup
**Command**:
```bash
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE perf record -F 99 -g ./bench_random_mixed_hakmem 10000000 400 1
```
**Results**:
- Samples: 30 (cycles:P event)
- Event count: 921,849,973 cycles
- Throughput: 47.37M ops/s (consistent with baseline)
### Top Functions by Self%
| Rank | Symbol | Self% | Samples | Children% | Category |
|------|-------------------------------------|--------|---------|-----------|-----------------|
| 1 | `free` | 28.95% | 3 | 45.20% | **HOT WRAPPER** |
| 2 | `tiny_alloc_gate_fast.lto_priv.0` | 12.75% | 3 | 29.11% | **HOT ALLOC** |
| 3 | `main` | 12.53% | 3 | 21.00% | Benchmark |
| 4 | `malloc` | 12.43% | 3 | 16.71% | Wrapper |
| 5 | `tiny_front_v3_enabled.lto_priv.0` | 7.75% | 2 | 7.85% | Tiny front |
| 6 | `tiny_route_for_class.lto_priv.0` | 4.39% | 2 | 24.78% | Route lookup |
| 7 | `free.cold` | 4.15% | 1 | 4.15% | Cold path |
| 8 | `hak_pool_free` | 4.02% | 1 | 4.02% | Pool free |
### Call Graph Analysis
**free() hot path** (28.95% self, 45.20% children):
```
free (28.95% self)
├── tiny_route_for_class.lto_priv.0 (20.38%) ← MAJOR BOTTLENECK
├── free (recursive, 16.24%)
├── tiny_region_id_write_header.lto_priv.0 (4.29%)
└── malloc (4.28%)
```
**tiny_alloc_gate_fast** (12.75% self, 29.11% children):
```
tiny_alloc_gate_fast (12.75% self)
├── tiny_alloc_gate_fast (recursive inlining, 20.64%)
├── main (4.27%)
└── free (4.20%)
```
**Key Insight**: `tiny_route_for_class()` is called from `free()` and consuming 20.38% of total time. This is the **#1 optimization target**.
---
## Step 2: Candidate Prioritization
### HIGH PRIORITY (Expected +3-5% each)
#### 1. **free() wrapper path** (28.95% self%)
**Location**: `/mnt/workdisk/public_share/hakmem/core/box/hak_wrappers.inc.h:524`
**Current Implementation**:
```c
void free(void* ptr) {
if (!ptr) return;
// BenchFast bypass (unlikely, 0)
if (__builtin_expect(bench_fast_enabled(), 0)) { ... }
const wrapper_env_cfg_t* wcfg = wrapper_env_cfg(); // ← Memory load
if (__builtin_expect(wcfg->wrap_shape, 0)) { // ← Branch
if (__builtin_expect(TINY_FRONT_UNIFIED_GATE_ENABLED, 1)) {
int freed;
if (__builtin_expect(hak_free_tiny_fast_hotcold_enabled(), 0)) {
freed = free_tiny_fast_hot(ptr);
} else {
freed = free_tiny_fast(ptr);
}
if (__builtin_expect(freed, 1)) {
return; // SUCCESS
}
}
return free_cold(ptr, wcfg);
}
// Legacy path...
}
```
**Optimization Opportunities**:
**A. Cache `wrapper_env_cfg()` result** (Expected: +1-2%)
- Currently calls `wrapper_env_cfg()` on every free
- Could cache in TLS or register during init
- Risk: LOW (read-only after init)
**B. Inline `free_tiny_fast_hot()` decision** (Expected: +1-2%)
- Branch `hak_free_tiny_fast_hotcold_enabled()` is runtime env check
- Could be compile-time or init-time cached
- Risk: LOW (already gated by HAKMEM_FREE_TINY_FAST_HOTCOLD)
**C. Reduce branch mispredictions** (Expected: +0.5-1%)
- Reorder branches to put likely path first
- Current: `bench_fast_enabled()` checked first (unlikely=0)
- Optimization: Move Tiny fast path check earlier
- Risk: LOW
**Total Expected Gain: +2.5-5%**
#### 2. **tiny_route_for_class()** (4.39% self%, 24.78% children)
**Location**: `/mnt/workdisk/public_share/hakmem/core/box/tiny_route_env_box.h:147`
**Current Implementation**:
```c
static inline tiny_route_kind_t tiny_route_for_class(uint8_t ci) {
FREE_DISPATCH_STAT_INC(route_for_class_calls); // Debug stat (RELEASE: noop)
if (__builtin_expect(!g_tiny_route_snapshot_done, 0)) {
tiny_route_snapshot_init();
}
if (__builtin_expect(ci >= TINY_NUM_CLASSES, 0)) {
return TINY_ROUTE_LEGACY;
}
return g_tiny_route_class[ci];
}
```
**Optimization Opportunities**:
**A. Eliminate `g_tiny_route_snapshot_done` check** (Expected: +1-2%)
- Check happens on EVERY call from free path
- Phase 3 C3 already implemented static routing for alloc path
- **Proposal**: Apply same static route cache to free path
- Implementation: Add `tiny_static_route_for_free(ci)` that bypasses snapshot check
- Risk: MEDIUM (need to ensure init ordering)
**B. Remove bounds check `ci >= TINY_NUM_CLASSES`** (Expected: +0.5-1%)
- In free path, `ci` is derived from header (already validated)
- Could add `tiny_route_for_class_unchecked(ci)` variant
- Risk: MEDIUM (need careful caller audit)
**Total Expected Gain: +1.5-3%**
#### 3. **tiny_alloc_gate_fast()** (12.75% self%)
**Location**: `/mnt/workdisk/public_share/hakmem/core/box/tiny_alloc_gate_box.h:139`
**Current Implementation**:
```c
static inline void* tiny_alloc_gate_fast(size_t size)
{
int class_idx = hak_tiny_size_to_class(size);
if (__builtin_expect(class_idx < 0 || class_idx >= TINY_NUM_CLASSES, 0)) {
return NULL;
}
TinyRoutePolicy route = tiny_route_get(class_idx); // ← Already optimized (Phase 3 C3)
if (__builtin_expect(route == ROUTE_POOL_ONLY, 0)) {
return NULL;
}
void* user_ptr = malloc_tiny_fast_for_class(size, class_idx);
if (__builtin_expect(route == ROUTE_TINY_ONLY, 1)) {
return user_ptr;
}
// ROUTE_TINY_FIRST fallback...
}
```
**Optimization Opportunities**:
**A. Specialize for common routes** (Expected: +1-2%)
- MIXED workload: C0-C7 are all ROUTE_TINY_ONLY (LEGACY)
- Could create `tiny_alloc_gate_fast_legacy_only()` variant
- Eliminates ROUTE_POOL_ONLY and ROUTE_TINY_FIRST checks
- Risk: LOW (already profiled via HAKMEM_TINY_STATIC_ROUTE)
**B. Inline `malloc_tiny_fast_for_class()`** (Expected: +0.5-1%)
- Might not be fully inlined by LTO
- Add `__attribute__((always_inline))` hint
- Risk: LOW
**Total Expected Gain: +1.5-3%**
---
### MEDIUM PRIORITY (Expected +0.5-1% each)
#### 4. **tiny_front_v3_enabled()** (7.75% self%)
- Appears in free path via `free_tiny_fast_hot()`
- Likely a runtime env check that could be cached
- Risk: LOW
- Expected Gain: +0.5-1%
#### 5. **free.cold** (4.15% self%)
- Cold path for free wrapper
- Handles classification and fallback
- Not a hot optimization target (already in slow path)
- Expected Gain: <+0.5%
---
### LOW PRIORITY / IGNORE
#### 6. **main()** (12.53% self%)
- Benchmark overhead (not part of allocator)
- IGNORE
#### 7. **malloc()** (12.43% self%)
- Already optimized in previous phases
- Appears lower than free in profile
- Defer to next round
---
## Step 3: Recommended Next Steps
Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established Summary: - D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median) - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median) - Mean gain: +2.19%, Median gain: +2.37% - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%) - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset - D2 (Wrapper env cache): FROZEN - Previous result: -1.44% regression (TLS overhead > benefit) - Status: Research box (do not pursue further) - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset) - Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13) Cumulative Gains (Phase 2-3): B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19% Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%) MID_V3 fix: +13% (structural change, Mixed OFF by default) Documentation Updates: - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status - CURRENT_TASK.md: Phase 3 complete summary Next: - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%) - Or Phase 4 planning if no more D3-class targets - Current active optimizations: B3, B4, C3, D1, MID_V3 fix Files Changed: - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines) - docs/analysis/*.md (6 files updated with D1/D2 results) - CURRENT_TASK.md (Phase 3 status update) - analyze_d1_results.py (statistical analysis script) - core/bench_profile.h (D1 promoted to default in MIXED preset) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 22:42:22 +09:00
### Phase 3 D1: Free Path Route Cache ✅ GOENV opt-in
**Target**: `tiny_route_for_class()` の呼び出しを free path から削る
**Result**: Mixed 10-run mean **+1.06%**median は負ける回がある)
**Decision**: ✅ GO だが **default 化は 20-run 確認待ち**
Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established Summary: - D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median) - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median) - Mean gain: +2.19%, Median gain: +2.37% - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%) - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset - D2 (Wrapper env cache): FROZEN - Previous result: -1.44% regression (TLS overhead > benefit) - Status: Research box (do not pursue further) - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset) - Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13) Cumulative Gains (Phase 2-3): B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19% Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%) MID_V3 fix: +13% (structural change, Mixed OFF by default) Documentation Updates: - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status - CURRENT_TASK.md: Phase 3 complete summary Next: - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%) - Or Phase 4 planning if no more D3-class targets - Current active optimizations: B3, B4, C3, D1, MID_V3 fix Files Changed: - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines) - docs/analysis/*.md (6 files updated with D1/D2 results) - CURRENT_TASK.md (Phase 3 status update) - analyze_d1_results.py (statistical analysis script) - core/bench_profile.h (D1 promoted to default in MIXED preset) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 22:42:22 +09:00
**ENV Gate**: `HAKMEM_FREE_STATIC_ROUTE=1`default: 0
---
Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established Summary: - D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median) - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median) - Mean gain: +2.19%, Median gain: +2.37% - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%) - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset - D2 (Wrapper env cache): FROZEN - Previous result: -1.44% regression (TLS overhead > benefit) - Status: Research box (do not pursue further) - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset) - Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13) Cumulative Gains (Phase 2-3): B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19% Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%) MID_V3 fix: +13% (structural change, Mixed OFF by default) Documentation Updates: - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status - CURRENT_TASK.md: Phase 3 complete summary Next: - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%) - Or Phase 4 planning if no more D3-class targets - Current active optimizations: B3, B4, C3, D1, MID_V3 fix Files Changed: - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines) - docs/analysis/*.md (6 files updated with D1/D2 results) - CURRENT_TASK.md (Phase 3 status update) - analyze_d1_results.py (statistical analysis script) - core/bench_profile.h (D1 promoted to default in MIXED preset) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 22:42:22 +09:00
### Phase 3 D2: Wrapper Env Cache ❌ NO-GOFROZEN
**Target**: `wrapper_env_cfg()` の呼び出しを wrapper hot path から削る
**Result**: Mixed 10-run mean **-1.44%** regression
**Decision**: ❌ NO-GO研究箱 freeze、default OFF
Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established Summary: - D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median) - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median) - Mean gain: +2.19%, Median gain: +2.37% - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%) - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset - D2 (Wrapper env cache): FROZEN - Previous result: -1.44% regression (TLS overhead > benefit) - Status: Research box (do not pursue further) - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset) - Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13) Cumulative Gains (Phase 2-3): B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19% Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%) MID_V3 fix: +13% (structural change, Mixed OFF by default) Documentation Updates: - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status - CURRENT_TASK.md: Phase 3 complete summary Next: - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%) - Or Phase 4 planning if no more D3-class targets - Current active optimizations: B3, B4, C3, D1, MID_V3 fix Files Changed: - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines) - docs/analysis/*.md (6 files updated with D1/D2 results) - CURRENT_TASK.md (Phase 3 status update) - analyze_d1_results.py (statistical analysis script) - core/bench_profile.h (D1 promoted to default in MIXED preset) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 22:42:22 +09:00
**ENV Gate**: `HAKMEM_WRAP_ENV_CACHE=1`default: 0
---
### Phase 3 D3: Alloc Gate Specialization (MEDIUM PRIORITY)
**Target**: `tiny_alloc_gate_fast()` for LEGACY-only route
**Expected Gain**: +1-2%
**Risk**: LOW
**Effort**: 2-3 hours
**Implementation**:
1. Create `tiny_alloc_gate_fast_legacy()` specialized variant
2. Eliminate ROUTE_POOL_ONLY and ROUTE_TINY_FIRST branches
3. Use in MIXED profile where all classes are LEGACY
4. A/B test: BASELINE vs D3
**ENV Gate**: `HAKMEM_ALLOC_GATE_LEGACY_ONLY=1` (default: 0)
---
Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established Summary: - D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median) - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median) - Mean gain: +2.19%, Median gain: +2.37% - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%) - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset - D2 (Wrapper env cache): FROZEN - Previous result: -1.44% regression (TLS overhead > benefit) - Status: Research box (do not pursue further) - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset) - Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13) Cumulative Gains (Phase 2-3): B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19% Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%) MID_V3 fix: +13% (structural change, Mixed OFF by default) Documentation Updates: - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status - CURRENT_TASK.md: Phase 3 complete summary Next: - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%) - Or Phase 4 planning if no more D3-class targets - Current active optimizations: B3, B4, C3, D1, MID_V3 fix Files Changed: - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines) - docs/analysis/*.md (6 files updated with D1/D2 results) - CURRENT_TASK.md (Phase 3 status update) - analyze_d1_results.py (statistical analysis script) - core/bench_profile.h (D1 promoted to default in MIXED preset) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 22:42:22 +09:00
## Expected Cumulative Results更新
Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established Summary: - D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median) - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median) - Mean gain: +2.19%, Median gain: +2.37% - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%) - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset - D2 (Wrapper env cache): FROZEN - Previous result: -1.44% regression (TLS overhead > benefit) - Status: Research box (do not pursue further) - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset) - Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13) Cumulative Gains (Phase 2-3): B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19% Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%) MID_V3 fix: +13% (structural change, Mixed OFF by default) Documentation Updates: - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status - CURRENT_TASK.md: Phase 3 complete summary Next: - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%) - Or Phase 4 planning if no more D3-class targets - Current active optimizations: B3, B4, C3, D1, MID_V3 fix Files Changed: - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines) - docs/analysis/*.md (6 files updated with D1/D2 results) - CURRENT_TASK.md (Phase 3 status update) - analyze_d1_results.py (statistical analysis script) - core/bench_profile.h (D1 promoted to default in MIXED preset) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 22:42:22 +09:00
| Phase | Optimization | Expected Gain | Notes |
|------------|----------------------------------|---------------|-------|
| Baseline | MID_V3=0 + B3+B4+C3 | - | — |
| **D1** | Free route cache | +0〜+2% | mean は勝ち、median 確認待ちdefault OFF |
| **D2** | Wrapper env cache | — | NO-GOfreeze |
| **D3** | Alloc gate specialization | +0〜+2% | perf で 5% 超なら着手 |
**With MID_V3 fix for Mixed**: +13% additional (expected ~56M ops/s total)
---
## Risk Assessment
| Optimization | Risk Level | Mitigation |
|---------------------|------------|-------------------------------------------------|
| Free route cache | MEDIUM | Ensure init ordering, ENV gate for rollback |
Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established Summary: - D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median) - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median) - Mean gain: +2.19%, Median gain: +2.37% - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%) - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset - D2 (Wrapper env cache): FROZEN - Previous result: -1.44% regression (TLS overhead > benefit) - Status: Research box (do not pursue further) - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset) - Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13) Cumulative Gains (Phase 2-3): B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19% Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%) MID_V3 fix: +13% (structural change, Mixed OFF by default) Documentation Updates: - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status - CURRENT_TASK.md: Phase 3 complete summary Next: - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%) - Or Phase 4 planning if no more D3-class targets - Current active optimizations: B3, B4, C3, D1, MID_V3 fix Files Changed: - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines) - docs/analysis/*.md (6 files updated with D1/D2 results) - CURRENT_TASK.md (Phase 3 status update) - analyze_d1_results.py (statistical analysis script) - core/bench_profile.h (D1 promoted to default in MIXED preset) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 22:42:22 +09:00
| Wrapper env cache | — | NO-GO-1.44% regression |
| Alloc specialization| LOW | Profile-specific, existing static route pattern |
**All optimizations**: Follow ENV gate + A/B test + decision pattern (research box)
---
Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established Summary: - D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median) - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median) - Mean gain: +2.19%, Median gain: +2.37% - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%) - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset - D2 (Wrapper env cache): FROZEN - Previous result: -1.44% regression (TLS overhead > benefit) - Status: Research box (do not pursue further) - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset) - Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13) Cumulative Gains (Phase 2-3): B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19% Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%) MID_V3 fix: +13% (structural change, Mixed OFF by default) Documentation Updates: - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status - CURRENT_TASK.md: Phase 3 complete summary Next: - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%) - Or Phase 4 planning if no more D3-class targets - Current active optimizations: B3, B4, C3, D1, MID_V3 fix Files Changed: - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines) - docs/analysis/*.md (6 files updated with D1/D2 results) - CURRENT_TASK.md (Phase 3 status update) - analyze_d1_results.py (statistical analysis script) - core/bench_profile.h (D1 promoted to default in MIXED preset) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 22:42:22 +09:00
## Post-D1/D2 Status (2025-12-13)
### Phase 3 D1/D2 Validation Complete ✅
1. **D1 (Free Route Cache)**: ✅ ADOPT - PROMOTED TO DEFAULT
- 20-run validation completed
- Results: Mean +2.19%, Median +2.37% (both criteria met)
- Status: Added to MIXED_TINYV3_C7_SAFE preset as default
- Implementation: `HAKMEM_FREE_STATIC_ROUTE=1`
2. **D2 (Wrapper Env Cache)**: ❌ FROZEN
- Results: -1.44% regression
- Status: Research box frozen, default OFF, do not pursue
- Implementation: `HAKMEM_WRAP_ENV_CACHE=1` (opt-in only, not recommended)
### Active Optimizations in MIXED_TINYV3_C7_SAFE
1. **B3**: Routing branch shape (+2.89% proven)
2. **B4**: Wrapper hot/cold split (+1.47% proven)
3. **C3**: Static routing (+2.20% proven)
4. **D1**: Free route cache (+2.19% proven) - NEW
5. **MID_V3**: OFF for Mixed (C6 routing fix, +13% proven)
**Cumulative gain**: ~7.6% (B3 + B4 + C3 + D1, excluding MID_V3 fix)
Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established Summary: - D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median) - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median) - Mean gain: +2.19%, Median gain: +2.37% - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%) - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset - D2 (Wrapper env cache): FROZEN - Previous result: -1.44% regression (TLS overhead > benefit) - Status: Research box (do not pursue further) - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset) - Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13) Cumulative Gains (Phase 2-3): B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19% Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%) MID_V3 fix: +13% (structural change, Mixed OFF by default) Documentation Updates: - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status - CURRENT_TASK.md: Phase 3 complete summary Next: - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%) - Or Phase 4 planning if no more D3-class targets - Current active optimizations: B3, B4, C3, D1, MID_V3 fix Files Changed: - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines) - docs/analysis/*.md (6 files updated with D1/D2 results) - CURRENT_TASK.md (Phase 3 status update) - analyze_d1_results.py (statistical analysis script) - core/bench_profile.h (D1 promoted to default in MIXED preset) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 22:42:22 +09:00
### Next Actions
Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established Summary: - D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median) - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median) - Mean gain: +2.19%, Median gain: +2.37% - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%) - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset - D2 (Wrapper env cache): FROZEN - Previous result: -1.44% regression (TLS overhead > benefit) - Status: Research box (do not pursue further) - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset) - Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13) Cumulative Gains (Phase 2-3): B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19% Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%) MID_V3 fix: +13% (structural change, Mixed OFF by default) Documentation Updates: - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status - CURRENT_TASK.md: Phase 3 complete summary Next: - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%) - Or Phase 4 planning if no more D3-class targets - Current active optimizations: B3, B4, C3, D1, MID_V3 fix Files Changed: - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines) - docs/analysis/*.md (6 files updated with D1/D2 results) - CURRENT_TASK.md (Phase 3 status update) - analyze_d1_results.py (statistical analysis script) - core/bench_profile.h (D1 promoted to default in MIXED preset) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 22:42:22 +09:00
1. **Profile**: Run perf on current baseline to identify next targets
- Requirement: self% ≥5% for Phase 3 D3 consideration
- Target: `tiny_alloc_gate_fast` specialization
Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established Summary: - D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median) - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median) - Mean gain: +2.19%, Median gain: +2.37% - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%) - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset - D2 (Wrapper env cache): FROZEN - Previous result: -1.44% regression (TLS overhead > benefit) - Status: Research box (do not pursue further) - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset) - Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13) Cumulative Gains (Phase 2-3): B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19% Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%) MID_V3 fix: +13% (structural change, Mixed OFF by default) Documentation Updates: - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status - CURRENT_TASK.md: Phase 3 complete summary Next: - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%) - Or Phase 4 planning if no more D3-class targets - Current active optimizations: B3, B4, C3, D1, MID_V3 fix Files Changed: - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines) - docs/analysis/*.md (6 files updated with D1/D2 results) - CURRENT_TASK.md (Phase 3 status update) - analyze_d1_results.py (statistical analysis script) - core/bench_profile.h (D1 promoted to default in MIXED preset) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 22:42:22 +09:00
2. **Optional**: Phase 3 D3 (Alloc gate specialization) - pending perf validation
- Only proceed if perf shows ≥5% self% in alloc gate
- ENV: `HAKMEM_ALLOC_GATE_LEGACY_ONLY=0/1`
Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established Summary: - D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median) - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median) - Mean gain: +2.19%, Median gain: +2.37% - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%) - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset - D2 (Wrapper env cache): FROZEN - Previous result: -1.44% regression (TLS overhead > benefit) - Status: Research box (do not pursue further) - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset) - Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13) Cumulative Gains (Phase 2-3): B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19% Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%) MID_V3 fix: +13% (structural change, Mixed OFF by default) Documentation Updates: - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status - CURRENT_TASK.md: Phase 3 complete summary Next: - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%) - Or Phase 4 planning if no more D3-class targets - Current active optimizations: B3, B4, C3, D1, MID_V3 fix Files Changed: - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines) - docs/analysis/*.md (6 files updated with D1/D2 results) - CURRENT_TASK.md (Phase 3 status update) - analyze_d1_results.py (statistical analysis script) - core/bench_profile.h (D1 promoted to default in MIXED preset) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 22:42:22 +09:00
3. **Phase 4 Planning**: If no more 5%+ targets, prepare Phase 4 roadmap
---
## Appendix: Raw Perf Data
### Full Perf Report (Top 20)
```
# Samples: 30 of event 'cycles:P'
# Event count (approx.): 921849973
46.11% 0.00% 0 [.] 0000000000000000
45.20% 28.95% 3 [.] free
29.11% 12.75% 3 [.] tiny_alloc_gate_fast.lto_priv.0
24.78% 4.39% 2 [.] tiny_route_for_class.lto_priv.0
21.00% 12.53% 3 [.] main
16.71% 12.43% 3 [.] malloc
12.95% 4.27% 1 [.] tiny_region_id_write_header.lto_priv.0
8.66% 4.39% 1 [.] tiny_c7_ultra_free
8.56% 4.28% 1 [.] free_tiny_fast_cold.lto_priv.0
7.85% 7.75% 2 [.] tiny_front_v3_enabled.lto_priv.0
4.27% 0.00% 0 [.] 0x00007ad3a9c2d001
4.23% 0.00% 0 [.] tiny_c7_ultra_enabled_env.lto_priv.0
4.21% 0.00% 0 [.] 0x00007ad3ab960c81
4.20% 0.00% 0 [.] 0x00007ad3ab939401
4.15% 4.15% 1 [.] free.cold
4.15% 0.00% 0 [.] unified_cache_push.lto_priv.0
4.02% 4.02% 1 [.] hak_pool_free
```
### Baseline Run Details
**Run 1**: 46.84M ops/s
```
Throughput = 46841499 ops/s [iter=1000000 ws=400] time=0.021s
[RSS] max_kb=30208
```
**Run 2**: 46.79M ops/s
```
Throughput = 46793317 ops/s [iter=1000000 ws=400] time=0.021s
[RSS] max_kb=30080
```
**Run 3**: 45.77M ops/s
```
Throughput = 45772756 ops/s [iter=1000000 ws=400] time=0.022s
[RSS] max_kb=34176
```
**Run 4**: 47.12M ops/s
```
Throughput = 47117176 ops/s [iter=1000000 ws=400] time=0.021s
[RSS] max_kb=30080
```
**Run 5**: 42.36M ops/s (outlier)
```
Throughput = 42359615 ops/s [iter=1000000 ws=400] time=0.024s
[RSS] max_kb=30080
```
---
## Document History
- **2025-12-13**: Initial baseline establishment and candidate analysis
- **Next**: Phase 3 D1 implementation (Free route cache)