Files
hakmem/docs/analysis/PHASE3_BASELINE_AND_CANDIDATES.md
Moe Charm (CI) 50bded8c85 Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established
Summary:
- D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT
  - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median)
  - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median)
  - Mean gain: +2.19%, Median gain: +2.37%
  - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%)
  - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset

- D2 (Wrapper env cache): FROZEN
  - Previous result: -1.44% regression (TLS overhead > benefit)
  - Status: Research box (do not pursue further)
  - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset)

- Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13)

Cumulative Gains (Phase 2-3):
  B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19%
  Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%)
  MID_V3 fix: +13% (structural change, Mixed OFF by default)

Documentation Updates:
  - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report
  - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status
  - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results
  - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status
  - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN
  - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status
  - CURRENT_TASK.md: Phase 3 complete summary

Next:
  - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%)
  - Or Phase 4 planning if no more D3-class targets
  - Current active optimizations: B3, B4, C3, D1, MID_V3 fix

Files Changed:
  - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines)
  - docs/analysis/*.md (6 files updated with D1/D2 results)
  - CURRENT_TASK.md (Phase 3 status update)
  - analyze_d1_results.py (statistical analysis script)
  - core/bench_profile.h (D1 promoted to default in MIXED preset)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 22:42:22 +09:00

448 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 3: Baseline Establishment & Next Optimization Candidates
**Date**: 2025-12-13
**Status**: BASELINE ESTABLISHED
**Goal**: Identify next micro-optimization targets with +1-5% potential each
---
## Executive Summary
**Baseline Performance (MID_V3=0, MIXED workload)**:
- Mean: 45.78M ops/s
- Median: 46.79M ops/s
- Range: 42.36M - 47.12M ops/s
- StdDev: ~1.75M ops/s (3.8% variance)
**Top Optimization Candidates**:
1. **free() wrapper** (28.95% self%) - HIGH PRIORITY
2. **tiny_alloc_gate_fast()** (12.75% self%) - HIGH PRIORITY
3. **main() benchmark overhead** (12.53% self%) - IGNORE (benchmark artifact)
**Expected Next Gains**: +3-8% cumulative from free path optimizations
---
## Step 0: Baseline Establishment
### Configuration Verification
**Profile**: `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE`
**Key Settings** (verified in `/mnt/workdisk/public_share/hakmem/core/bench_profile.h:74`):
```c
bench_setenv_default("HAKMEM_MID_V3_ENABLED", "0"); // CRITICAL: MID_V3 disabled for Mixed
bench_setenv_default("HAKMEM_MID_V3_CLASSES", "0x0");
```
**Optimization Flags Enabled**:
- `HAKMEM_FREE_TINY_FAST_HOTCOLD=1` (Phase FREE-TINY-FAST-DUALHOT-1)
- `HAKMEM_WRAP_SHAPE=1` (Phase 2 B4: Hot/Cold wrapper split)
- `HAKMEM_TINY_ALLOC_ROUTE_SHAPE=1` (Phase 2 B3: Route branch optimization)
- `HAKMEM_TINY_STATIC_ROUTE=1` (Phase 3 C3: Static routing, +2.2%)
### Baseline Measurements (5 runs)
| Run | Throughput (M ops/s) | Time (ms) |
|-----|---------------------|-----------|
| 1 | 46.84 | 21 |
| 2 | 46.79 | 21 |
| 3 | 45.77 | 22 |
| 4 | 47.12 | 21 |
| 5 | 42.36 | 24 |
**Statistics**:
- **Mean**: 45.78M ops/s
- **Median**: 46.79M ops/s
- **Min**: 42.36M ops/s
- **Max**: 47.12M ops/s
- **Range**: 4.76M ops/s (11.2%)
- **StdDev**: ~1.75M ops/s (3.8% variance)
**Assessment**: Baseline is stable with ~3.8% variance. Run 5 (42.36M) is an outlier, likely due to system noise. Median (46.79M) is a reliable baseline reference.
**Comparison to Previous**:
- Previous C3 baseline: ~39.8M ops/s (with default settings)
- **Current baseline: 46.79M ops/s**
- **Improvement: +17.5%** (confirms MID_V3=0 + cumulative optimizations working)
---
## Step 1: Perf Profiling Results
### Profiling Setup
**Command**:
```bash
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE perf record -F 99 -g ./bench_random_mixed_hakmem 10000000 400 1
```
**Results**:
- Samples: 30 (cycles:P event)
- Event count: 921,849,973 cycles
- Throughput: 47.37M ops/s (consistent with baseline)
### Top Functions by Self%
| Rank | Symbol | Self% | Samples | Children% | Category |
|------|-------------------------------------|--------|---------|-----------|-----------------|
| 1 | `free` | 28.95% | 3 | 45.20% | **HOT WRAPPER** |
| 2 | `tiny_alloc_gate_fast.lto_priv.0` | 12.75% | 3 | 29.11% | **HOT ALLOC** |
| 3 | `main` | 12.53% | 3 | 21.00% | Benchmark |
| 4 | `malloc` | 12.43% | 3 | 16.71% | Wrapper |
| 5 | `tiny_front_v3_enabled.lto_priv.0` | 7.75% | 2 | 7.85% | Tiny front |
| 6 | `tiny_route_for_class.lto_priv.0` | 4.39% | 2 | 24.78% | Route lookup |
| 7 | `free.cold` | 4.15% | 1 | 4.15% | Cold path |
| 8 | `hak_pool_free` | 4.02% | 1 | 4.02% | Pool free |
### Call Graph Analysis
**free() hot path** (28.95% self, 45.20% children):
```
free (28.95% self)
├── tiny_route_for_class.lto_priv.0 (20.38%) ← MAJOR BOTTLENECK
├── free (recursive, 16.24%)
├── tiny_region_id_write_header.lto_priv.0 (4.29%)
└── malloc (4.28%)
```
**tiny_alloc_gate_fast** (12.75% self, 29.11% children):
```
tiny_alloc_gate_fast (12.75% self)
├── tiny_alloc_gate_fast (recursive inlining, 20.64%)
├── main (4.27%)
└── free (4.20%)
```
**Key Insight**: `tiny_route_for_class()` is called from `free()` and consuming 20.38% of total time. This is the **#1 optimization target**.
---
## Step 2: Candidate Prioritization
### HIGH PRIORITY (Expected +3-5% each)
#### 1. **free() wrapper path** (28.95% self%)
**Location**: `/mnt/workdisk/public_share/hakmem/core/box/hak_wrappers.inc.h:524`
**Current Implementation**:
```c
void free(void* ptr) {
if (!ptr) return;
// BenchFast bypass (unlikely, 0)
if (__builtin_expect(bench_fast_enabled(), 0)) { ... }
const wrapper_env_cfg_t* wcfg = wrapper_env_cfg(); // ← Memory load
if (__builtin_expect(wcfg->wrap_shape, 0)) { // ← Branch
if (__builtin_expect(TINY_FRONT_UNIFIED_GATE_ENABLED, 1)) {
int freed;
if (__builtin_expect(hak_free_tiny_fast_hotcold_enabled(), 0)) {
freed = free_tiny_fast_hot(ptr);
} else {
freed = free_tiny_fast(ptr);
}
if (__builtin_expect(freed, 1)) {
return; // SUCCESS
}
}
return free_cold(ptr, wcfg);
}
// Legacy path...
}
```
**Optimization Opportunities**:
**A. Cache `wrapper_env_cfg()` result** (Expected: +1-2%)
- Currently calls `wrapper_env_cfg()` on every free
- Could cache in TLS or register during init
- Risk: LOW (read-only after init)
**B. Inline `free_tiny_fast_hot()` decision** (Expected: +1-2%)
- Branch `hak_free_tiny_fast_hotcold_enabled()` is runtime env check
- Could be compile-time or init-time cached
- Risk: LOW (already gated by HAKMEM_FREE_TINY_FAST_HOTCOLD)
**C. Reduce branch mispredictions** (Expected: +0.5-1%)
- Reorder branches to put likely path first
- Current: `bench_fast_enabled()` checked first (unlikely=0)
- Optimization: Move Tiny fast path check earlier
- Risk: LOW
**Total Expected Gain: +2.5-5%**
#### 2. **tiny_route_for_class()** (4.39% self%, 24.78% children)
**Location**: `/mnt/workdisk/public_share/hakmem/core/box/tiny_route_env_box.h:147`
**Current Implementation**:
```c
static inline tiny_route_kind_t tiny_route_for_class(uint8_t ci) {
FREE_DISPATCH_STAT_INC(route_for_class_calls); // Debug stat (RELEASE: noop)
if (__builtin_expect(!g_tiny_route_snapshot_done, 0)) {
tiny_route_snapshot_init();
}
if (__builtin_expect(ci >= TINY_NUM_CLASSES, 0)) {
return TINY_ROUTE_LEGACY;
}
return g_tiny_route_class[ci];
}
```
**Optimization Opportunities**:
**A. Eliminate `g_tiny_route_snapshot_done` check** (Expected: +1-2%)
- Check happens on EVERY call from free path
- Phase 3 C3 already implemented static routing for alloc path
- **Proposal**: Apply same static route cache to free path
- Implementation: Add `tiny_static_route_for_free(ci)` that bypasses snapshot check
- Risk: MEDIUM (need to ensure init ordering)
**B. Remove bounds check `ci >= TINY_NUM_CLASSES`** (Expected: +0.5-1%)
- In free path, `ci` is derived from header (already validated)
- Could add `tiny_route_for_class_unchecked(ci)` variant
- Risk: MEDIUM (need careful caller audit)
**Total Expected Gain: +1.5-3%**
#### 3. **tiny_alloc_gate_fast()** (12.75% self%)
**Location**: `/mnt/workdisk/public_share/hakmem/core/box/tiny_alloc_gate_box.h:139`
**Current Implementation**:
```c
static inline void* tiny_alloc_gate_fast(size_t size)
{
int class_idx = hak_tiny_size_to_class(size);
if (__builtin_expect(class_idx < 0 || class_idx >= TINY_NUM_CLASSES, 0)) {
return NULL;
}
TinyRoutePolicy route = tiny_route_get(class_idx); // ← Already optimized (Phase 3 C3)
if (__builtin_expect(route == ROUTE_POOL_ONLY, 0)) {
return NULL;
}
void* user_ptr = malloc_tiny_fast_for_class(size, class_idx);
if (__builtin_expect(route == ROUTE_TINY_ONLY, 1)) {
return user_ptr;
}
// ROUTE_TINY_FIRST fallback...
}
```
**Optimization Opportunities**:
**A. Specialize for common routes** (Expected: +1-2%)
- MIXED workload: C0-C7 are all ROUTE_TINY_ONLY (LEGACY)
- Could create `tiny_alloc_gate_fast_legacy_only()` variant
- Eliminates ROUTE_POOL_ONLY and ROUTE_TINY_FIRST checks
- Risk: LOW (already profiled via HAKMEM_TINY_STATIC_ROUTE)
**B. Inline `malloc_tiny_fast_for_class()`** (Expected: +0.5-1%)
- Might not be fully inlined by LTO
- Add `__attribute__((always_inline))` hint
- Risk: LOW
**Total Expected Gain: +1.5-3%**
---
### MEDIUM PRIORITY (Expected +0.5-1% each)
#### 4. **tiny_front_v3_enabled()** (7.75% self%)
- Appears in free path via `free_tiny_fast_hot()`
- Likely a runtime env check that could be cached
- Risk: LOW
- Expected Gain: +0.5-1%
#### 5. **free.cold** (4.15% self%)
- Cold path for free wrapper
- Handles classification and fallback
- Not a hot optimization target (already in slow path)
- Expected Gain: <+0.5%
---
### LOW PRIORITY / IGNORE
#### 6. **main()** (12.53% self%)
- Benchmark overhead (not part of allocator)
- IGNORE
#### 7. **malloc()** (12.43% self%)
- Already optimized in previous phases
- Appears lower than free in profile
- Defer to next round
---
## Step 3: Recommended Next Steps
### Phase 3 D1: Free Path Route Cache ✅ GOENV opt-in
**Target**: `tiny_route_for_class()` の呼び出しを free path から削る
**Result**: Mixed 10-run mean **+1.06%**median は負ける回がある)
**Decision**: ✅ GO だが **default 化は 20-run 確認待ち**
**ENV Gate**: `HAKMEM_FREE_STATIC_ROUTE=1`default: 0
---
### Phase 3 D2: Wrapper Env Cache ❌ NO-GOFROZEN
**Target**: `wrapper_env_cfg()` の呼び出しを wrapper hot path から削る
**Result**: Mixed 10-run mean **-1.44%** regression
**Decision**: ❌ NO-GO研究箱 freeze、default OFF
**ENV Gate**: `HAKMEM_WRAP_ENV_CACHE=1`default: 0
---
### Phase 3 D3: Alloc Gate Specialization (MEDIUM PRIORITY)
**Target**: `tiny_alloc_gate_fast()` for LEGACY-only route
**Expected Gain**: +1-2%
**Risk**: LOW
**Effort**: 2-3 hours
**Implementation**:
1. Create `tiny_alloc_gate_fast_legacy()` specialized variant
2. Eliminate ROUTE_POOL_ONLY and ROUTE_TINY_FIRST branches
3. Use in MIXED profile where all classes are LEGACY
4. A/B test: BASELINE vs D3
**ENV Gate**: `HAKMEM_ALLOC_GATE_LEGACY_ONLY=1` (default: 0)
---
## Expected Cumulative Results更新
| Phase | Optimization | Expected Gain | Notes |
|------------|----------------------------------|---------------|-------|
| Baseline | MID_V3=0 + B3+B4+C3 | - | — |
| **D1** | Free route cache | +0〜+2% | mean は勝ち、median 確認待ちdefault OFF |
| **D2** | Wrapper env cache | — | NO-GOfreeze |
| **D3** | Alloc gate specialization | +0〜+2% | perf で 5% 超なら着手 |
**With MID_V3 fix for Mixed**: +13% additional (expected ~56M ops/s total)
---
## Risk Assessment
| Optimization | Risk Level | Mitigation |
|---------------------|------------|-------------------------------------------------|
| Free route cache | MEDIUM | Ensure init ordering, ENV gate for rollback |
| Wrapper env cache | — | NO-GO-1.44% regression |
| Alloc specialization| LOW | Profile-specific, existing static route pattern |
**All optimizations**: Follow ENV gate + A/B test + decision pattern (research box)
---
## Post-D1/D2 Status (2025-12-13)
### Phase 3 D1/D2 Validation Complete ✅
1. **D1 (Free Route Cache)**: ✅ ADOPT - PROMOTED TO DEFAULT
- 20-run validation completed
- Results: Mean +2.19%, Median +2.37% (both criteria met)
- Status: Added to MIXED_TINYV3_C7_SAFE preset as default
- Implementation: `HAKMEM_FREE_STATIC_ROUTE=1`
2. **D2 (Wrapper Env Cache)**: ❌ FROZEN
- Results: -1.44% regression
- Status: Research box frozen, default OFF, do not pursue
- Implementation: `HAKMEM_WRAP_ENV_CACHE=1` (opt-in only, not recommended)
### Active Optimizations in MIXED_TINYV3_C7_SAFE
1. **B3**: Routing branch shape (+2.89% proven)
2. **B4**: Wrapper hot/cold split (+1.47% proven)
3. **C3**: Static routing (+2.20% proven)
4. **D1**: Free route cache (+2.19% proven) - NEW
5. **MID_V3**: OFF for Mixed (C6 routing fix, +13% proven)
**Cumulative gain**: ~7.6% (B3 + B4 + C3 + D1, excluding MID_V3 fix)
### Next Actions
1. **Profile**: Run perf on current baseline to identify next targets
- Requirement: self% ≥5% for Phase 3 D3 consideration
- Target: `tiny_alloc_gate_fast` specialization
2. **Optional**: Phase 3 D3 (Alloc gate specialization) - pending perf validation
- Only proceed if perf shows ≥5% self% in alloc gate
- ENV: `HAKMEM_ALLOC_GATE_LEGACY_ONLY=0/1`
3. **Phase 4 Planning**: If no more 5%+ targets, prepare Phase 4 roadmap
---
## Appendix: Raw Perf Data
### Full Perf Report (Top 20)
```
# Samples: 30 of event 'cycles:P'
# Event count (approx.): 921849973
46.11% 0.00% 0 [.] 0000000000000000
45.20% 28.95% 3 [.] free
29.11% 12.75% 3 [.] tiny_alloc_gate_fast.lto_priv.0
24.78% 4.39% 2 [.] tiny_route_for_class.lto_priv.0
21.00% 12.53% 3 [.] main
16.71% 12.43% 3 [.] malloc
12.95% 4.27% 1 [.] tiny_region_id_write_header.lto_priv.0
8.66% 4.39% 1 [.] tiny_c7_ultra_free
8.56% 4.28% 1 [.] free_tiny_fast_cold.lto_priv.0
7.85% 7.75% 2 [.] tiny_front_v3_enabled.lto_priv.0
4.27% 0.00% 0 [.] 0x00007ad3a9c2d001
4.23% 0.00% 0 [.] tiny_c7_ultra_enabled_env.lto_priv.0
4.21% 0.00% 0 [.] 0x00007ad3ab960c81
4.20% 0.00% 0 [.] 0x00007ad3ab939401
4.15% 4.15% 1 [.] free.cold
4.15% 0.00% 0 [.] unified_cache_push.lto_priv.0
4.02% 4.02% 1 [.] hak_pool_free
```
### Baseline Run Details
**Run 1**: 46.84M ops/s
```
Throughput = 46841499 ops/s [iter=1000000 ws=400] time=0.021s
[RSS] max_kb=30208
```
**Run 2**: 46.79M ops/s
```
Throughput = 46793317 ops/s [iter=1000000 ws=400] time=0.021s
[RSS] max_kb=30080
```
**Run 3**: 45.77M ops/s
```
Throughput = 45772756 ops/s [iter=1000000 ws=400] time=0.022s
[RSS] max_kb=34176
```
**Run 4**: 47.12M ops/s
```
Throughput = 47117176 ops/s [iter=1000000 ws=400] time=0.021s
[RSS] max_kb=30080
```
**Run 5**: 42.36M ops/s (outlier)
```
Throughput = 42359615 ops/s [iter=1000000 ws=400] time=0.024s
[RSS] max_kb=30080
```
---
## Document History
- **2025-12-13**: Initial baseline establishment and candidate analysis
- **Next**: Phase 3 D1 implementation (Free route cache)