Files
hakmem/docs/analysis/PHASE3_BASELINE_AND_CANDIDATES.md
2025-12-14 00:05:11 +09:00

452 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 3: Baseline Establishment & Next Optimization Candidates
**Date**: 2025-12-13
**Status**: BASELINE ESTABLISHED
**Goal**: Identify next micro-optimization targets with +1-5% potential each
---
## Executive Summary
**Baseline Performance (MID_V3=0, MIXED workload)**:
- Mean: 45.78M ops/s
- Median: 46.79M ops/s
- Range: 42.36M - 47.12M ops/s
- StdDev: ~1.75M ops/s (3.8% variance)
**Top Optimization Candidates**:
1. **free() wrapper** (28.95% self%) - HIGH PRIORITY
2. **tiny_alloc_gate_fast()** (12.75% self%) - HIGH PRIORITY
3. **main() benchmark overhead** (12.53% self%) - IGNORE (benchmark artifact)
**Expected Next Gains**: +3-8% cumulative from free path optimizations
---
## Step 0: Baseline Establishment
### Configuration Verification
**Profile**: `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE`
**Key Settings** (verified in `/mnt/workdisk/public_share/hakmem/core/bench_profile.h:74`):
```c
bench_setenv_default("HAKMEM_MID_V3_ENABLED", "0"); // CRITICAL: MID_V3 disabled for Mixed
bench_setenv_default("HAKMEM_MID_V3_CLASSES", "0x0");
```
**Optimization Flags Enabled**:
- `HAKMEM_FREE_TINY_FAST_HOTCOLD=1` (Phase FREE-TINY-FAST-DUALHOT-1)
- `HAKMEM_WRAP_SHAPE=1` (Phase 2 B4: Hot/Cold wrapper split)
- `HAKMEM_TINY_ALLOC_ROUTE_SHAPE=1` (Phase 2 B3: Route branch optimization)
- `HAKMEM_TINY_STATIC_ROUTE=1` (Phase 3 C3: Static routing, +2.2%)
### Baseline Measurements (5 runs)
| Run | Throughput (M ops/s) | Time (ms) |
|-----|---------------------|-----------|
| 1 | 46.84 | 21 |
| 2 | 46.79 | 21 |
| 3 | 45.77 | 22 |
| 4 | 47.12 | 21 |
| 5 | 42.36 | 24 |
**Statistics**:
- **Mean**: 45.78M ops/s
- **Median**: 46.79M ops/s
- **Min**: 42.36M ops/s
- **Max**: 47.12M ops/s
- **Range**: 4.76M ops/s (11.2%)
- **StdDev**: ~1.75M ops/s (3.8% variance)
**Assessment**: Baseline is stable with ~3.8% variance. Run 5 (42.36M) is an outlier, likely due to system noise. Median (46.79M) is a reliable baseline reference.
**Comparison to Previous**:
- Previous C3 baseline: ~39.8M ops/s (with default settings)
- **Current baseline: 46.79M ops/s**
- **Improvement: +17.5%** (confirms MID_V3=0 + cumulative optimizations working)
---
## Step 1: Perf Profiling Results
### Profiling Setup
**Command**:
```bash
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE perf record -F 99 -g ./bench_random_mixed_hakmem 10000000 400 1
```
**Results**:
- Samples: 30 (cycles:P event)
- Event count: 921,849,973 cycles
- Throughput: 47.37M ops/s (consistent with baseline)
### Top Functions by Self%
| Rank | Symbol | Self% | Samples | Children% | Category |
|------|-------------------------------------|--------|---------|-----------|-----------------|
| 1 | `free` | 28.95% | 3 | 45.20% | **HOT WRAPPER** |
| 2 | `tiny_alloc_gate_fast.lto_priv.0` | 12.75% | 3 | 29.11% | **HOT ALLOC** |
| 3 | `main` | 12.53% | 3 | 21.00% | Benchmark |
| 4 | `malloc` | 12.43% | 3 | 16.71% | Wrapper |
| 5 | `tiny_front_v3_enabled.lto_priv.0` | 7.75% | 2 | 7.85% | Tiny front |
| 6 | `tiny_route_for_class.lto_priv.0` | 4.39% | 2 | 24.78% | Route lookup |
| 7 | `free.cold` | 4.15% | 1 | 4.15% | Cold path |
| 8 | `hak_pool_free` | 4.02% | 1 | 4.02% | Pool free |
### Call Graph Analysis
**free() hot path** (28.95% self, 45.20% children):
```
free (28.95% self)
├── tiny_route_for_class.lto_priv.0 (20.38%) ← MAJOR BOTTLENECK
├── free (recursive, 16.24%)
├── tiny_region_id_write_header.lto_priv.0 (4.29%)
└── malloc (4.28%)
```
**tiny_alloc_gate_fast** (12.75% self, 29.11% children):
```
tiny_alloc_gate_fast (12.75% self)
├── tiny_alloc_gate_fast (recursive inlining, 20.64%)
├── main (4.27%)
└── free (4.20%)
```
**Key Insight**: `tiny_route_for_class()` is called from `free()` and consuming 20.38% of total time. This is the **#1 optimization target**.
---
## Step 2: Candidate Prioritization
### HIGH PRIORITY (Expected +3-5% each)
#### 1. **free() wrapper path** (28.95% self%)
**Location**: `/mnt/workdisk/public_share/hakmem/core/box/hak_wrappers.inc.h:524`
**Current Implementation**:
```c
void free(void* ptr) {
if (!ptr) return;
// BenchFast bypass (unlikely, 0)
if (__builtin_expect(bench_fast_enabled(), 0)) { ... }
const wrapper_env_cfg_t* wcfg = wrapper_env_cfg(); // ← Memory load
if (__builtin_expect(wcfg->wrap_shape, 0)) { // ← Branch
if (__builtin_expect(TINY_FRONT_UNIFIED_GATE_ENABLED, 1)) {
int freed;
if (__builtin_expect(hak_free_tiny_fast_hotcold_enabled(), 0)) {
freed = free_tiny_fast_hot(ptr);
} else {
freed = free_tiny_fast(ptr);
}
if (__builtin_expect(freed, 1)) {
return; // SUCCESS
}
}
return free_cold(ptr, wcfg);
}
// Legacy path...
}
```
**Optimization Opportunities**:
**A. Cache `wrapper_env_cfg()` result** (Expected: +1-2%)
- Currently calls `wrapper_env_cfg()` on every free
- Could cache in TLS or register during init
- Risk: LOW (read-only after init)
**B. Inline `free_tiny_fast_hot()` decision** (Expected: +1-2%)
- Branch `hak_free_tiny_fast_hotcold_enabled()` is runtime env check
- Could be compile-time or init-time cached
- Risk: LOW (already gated by HAKMEM_FREE_TINY_FAST_HOTCOLD)
**C. Reduce branch mispredictions** (Expected: +0.5-1%)
- Reorder branches to put likely path first
- Current: `bench_fast_enabled()` checked first (unlikely=0)
- Optimization: Move Tiny fast path check earlier
- Risk: LOW
**Total Expected Gain: +2.5-5%**
#### 2. **tiny_route_for_class()** (4.39% self%, 24.78% children)
**Location**: `/mnt/workdisk/public_share/hakmem/core/box/tiny_route_env_box.h:147`
**Current Implementation**:
```c
static inline tiny_route_kind_t tiny_route_for_class(uint8_t ci) {
FREE_DISPATCH_STAT_INC(route_for_class_calls); // Debug stat (RELEASE: noop)
if (__builtin_expect(!g_tiny_route_snapshot_done, 0)) {
tiny_route_snapshot_init();
}
if (__builtin_expect(ci >= TINY_NUM_CLASSES, 0)) {
return TINY_ROUTE_LEGACY;
}
return g_tiny_route_class[ci];
}
```
**Optimization Opportunities**:
**A. Eliminate `g_tiny_route_snapshot_done` check** (Expected: +1-2%)
- Check happens on EVERY call from free path
- Phase 3 C3 already implemented static routing for alloc path
- **Proposal**: Apply same static route cache to free path
- Implementation: Add `tiny_static_route_for_free(ci)` that bypasses snapshot check
- Risk: MEDIUM (need to ensure init ordering)
**B. Remove bounds check `ci >= TINY_NUM_CLASSES`** (Expected: +0.5-1%)
- In free path, `ci` is derived from header (already validated)
- Could add `tiny_route_for_class_unchecked(ci)` variant
- Risk: MEDIUM (need careful caller audit)
**Total Expected Gain: +1.5-3%**
#### 3. **tiny_alloc_gate_fast()** (12.75% self%)
**Location**: `/mnt/workdisk/public_share/hakmem/core/box/tiny_alloc_gate_box.h:139`
**Current Implementation**:
```c
static inline void* tiny_alloc_gate_fast(size_t size)
{
int class_idx = hak_tiny_size_to_class(size);
if (__builtin_expect(class_idx < 0 || class_idx >= TINY_NUM_CLASSES, 0)) {
return NULL;
}
TinyRoutePolicy route = tiny_route_get(class_idx); // ← Already optimized (Phase 3 C3)
if (__builtin_expect(route == ROUTE_POOL_ONLY, 0)) {
return NULL;
}
void* user_ptr = malloc_tiny_fast_for_class(size, class_idx);
if (__builtin_expect(route == ROUTE_TINY_ONLY, 1)) {
return user_ptr;
}
// ROUTE_TINY_FIRST fallback...
}
```
**Optimization Opportunities**:
**A. Specialize for common routes** (Expected: +1-2%)
- MIXED workload: C0-C7 are all ROUTE_TINY_ONLY (LEGACY)
- Could create `tiny_alloc_gate_fast_legacy_only()` variant
- Eliminates ROUTE_POOL_ONLY and ROUTE_TINY_FIRST checks
- Risk: LOW (already profiled via HAKMEM_TINY_STATIC_ROUTE)
**B. Inline `malloc_tiny_fast_for_class()`** (Expected: +0.5-1%)
- Might not be fully inlined by LTO
- Add `__attribute__((always_inline))` hint
- Risk: LOW
**Total Expected Gain: +1.5-3%**
---
### MEDIUM PRIORITY (Expected +0.5-1% each)
#### 4. **tiny_front_v3_enabled()** (7.75% self%)
- Appears in free path via `free_tiny_fast_hot()`
- Likely a runtime env check that could be cached
- Risk: LOW
- Expected Gain: +0.5-1%
#### 5. **free.cold** (4.15% self%)
- Cold path for free wrapper
- Handles classification and fallback
- Not a hot optimization target (already in slow path)
- Expected Gain: <+0.5%
---
### LOW PRIORITY / IGNORE
#### 6. **main()** (12.53% self%)
- Benchmark overhead (not part of allocator)
- IGNORE
#### 7. **malloc()** (12.43% self%)
- Already optimized in previous phases
- Appears lower than free in profile
- Defer to next round
---
## Step 3: Recommended Next Steps
### Phase 3 D1: Free Path Route Cache ✅ ADOPTPROMOTED TO DEFAULT
**Target**: `tiny_route_for_class()` の呼び出しを free path から削る
**Result**: Mixed 20-run mean **+2.19%** / median **+2.37%**
**Decision**: ✅ `MIXED_TINYV3_C7_SAFE` の default に昇格
**ENV Gate**:
- `HAKMEM_FREE_STATIC_ROUTE=0/1`default: 0
- `MIXED_TINYV3_C7_SAFE` プリセットは `1` を default 注入rollback は `0`
---
### Phase 3 D2: Wrapper Env Cache ❌ NO-GOFROZEN
**Target**: `wrapper_env_cfg()` の呼び出しを wrapper hot path から削る
**Result**: Mixed 10-run mean **-1.44%** regression
**Decision**: ❌ NO-GO研究箱 freeze、default OFF
**ENV Gate**: `HAKMEM_WRAP_ENV_CACHE=1`default: 0
---
### Phase 3 D3: Alloc Gate Specialization (MEDIUM PRIORITY)
**Target**: `tiny_alloc_gate_fast()` の分岐形を最短化MIXED 向け)
**Expected Gain**: +1-2%
**Risk**: LOW
**Effort**: 2-3 hours
**Implementation**:
1. New ENV gate: `HAKMEM_ALLOC_GATE_SHAPE=0/1`
2. `tiny_route_get()` を避け、`g_tiny_route[]` の直接参照に置換release logging branch を回避)
3. `ROUTE_POOL_ONLY` は必ず尊重(`HAKMEM_TINY_PROFILE=hot/off` を壊さない)
4. A/B test: BASELINE vs D3
**Design**: `docs/analysis/PHASE4_D3_ALLOC_GATE_SPECIALIZATION_1_DESIGN.md`
**ENV Gate**: `HAKMEM_ALLOC_GATE_SHAPE=0/1` (default: 0)
---
## Expected Cumulative Results更新
| Phase | Optimization | Expected Gain | Notes |
|------------|----------------------------------|---------------|-------|
| Baseline | MID_V3=0 + B3+B4+C3 | - | — |
| **D1** | Free route cache | +0〜+2% | ✅ ADOPTMixed preset default ON |
| **D2** | Wrapper env cache | — | NO-GOfreeze |
| **D3** | Alloc gate specialization | +0〜+2% | perf で 5% 超なら着手 |
**With MID_V3 fix for Mixed**: +13% additional (expected ~56M ops/s total)
---
## Risk Assessment
| Optimization | Risk Level | Mitigation |
|---------------------|------------|-------------------------------------------------|
| Free route cache | MEDIUM | Ensure init ordering, ENV gate for rollback |
| Wrapper env cache | — | NO-GO-1.44% regression |
| Alloc specialization| LOW | Profile-specific, existing static route pattern |
**All optimizations**: Follow ENV gate + A/B test + decision pattern (research box)
---
## Post-D1/D2 Status (2025-12-13)
### Phase 3 D1/D2 Validation Complete ✅
1. **D1 (Free Route Cache)**: ✅ ADOPT - PROMOTED TO DEFAULT
- 20-run validation completed
- Results: Mean +2.19%, Median +2.37% (both criteria met)
- Status: Added to MIXED_TINYV3_C7_SAFE preset as default
- Implementation: `HAKMEM_FREE_STATIC_ROUTE=1`
2. **D2 (Wrapper Env Cache)**: ❌ FROZEN
- Results: -1.44% regression
- Status: Research box frozen, default OFF, do not pursue
- Implementation: `HAKMEM_WRAP_ENV_CACHE=1` (opt-in only, not recommended)
### Active Optimizations in MIXED_TINYV3_C7_SAFE
1. **B3**: Routing branch shape (+2.89% proven)
2. **B4**: Wrapper hot/cold split (+1.47% proven)
3. **C3**: Static routing (+2.20% proven)
4. **D1**: Free route cache (+2.19% proven) - NEW
5. **MID_V3**: OFF for Mixed (C6 routing fix, +13% proven)
**Cumulative gain**: ~7.6% (B3 + B4 + C3 + D1, excluding MID_V3 fix)
### Next Actions
1. **Profile**: Run perf on current baseline to identify next targets
- Requirement: self% ≥5% for Phase 3 D3 consideration
- Target: `tiny_alloc_gate_fast` specialization
2. **Optional**: Phase 3 D3 (Alloc gate specialization) - pending perf validation
- Only proceed if perf shows ≥5% self% in alloc gate
- ENV: `HAKMEM_ALLOC_GATE_SHAPE=0/1`
3. **Phase 4 Planning**: If no more 5%+ targets, prepare Phase 4 roadmap
---
## Appendix: Raw Perf Data
### Full Perf Report (Top 20)
```
# Samples: 30 of event 'cycles:P'
# Event count (approx.): 921849973
46.11% 0.00% 0 [.] 0000000000000000
45.20% 28.95% 3 [.] free
29.11% 12.75% 3 [.] tiny_alloc_gate_fast.lto_priv.0
24.78% 4.39% 2 [.] tiny_route_for_class.lto_priv.0
21.00% 12.53% 3 [.] main
16.71% 12.43% 3 [.] malloc
12.95% 4.27% 1 [.] tiny_region_id_write_header.lto_priv.0
8.66% 4.39% 1 [.] tiny_c7_ultra_free
8.56% 4.28% 1 [.] free_tiny_fast_cold.lto_priv.0
7.85% 7.75% 2 [.] tiny_front_v3_enabled.lto_priv.0
4.27% 0.00% 0 [.] 0x00007ad3a9c2d001
4.23% 0.00% 0 [.] tiny_c7_ultra_enabled_env.lto_priv.0
4.21% 0.00% 0 [.] 0x00007ad3ab960c81
4.20% 0.00% 0 [.] 0x00007ad3ab939401
4.15% 4.15% 1 [.] free.cold
4.15% 0.00% 0 [.] unified_cache_push.lto_priv.0
4.02% 4.02% 1 [.] hak_pool_free
```
### Baseline Run Details
**Run 1**: 46.84M ops/s
```
Throughput = 46841499 ops/s [iter=1000000 ws=400] time=0.021s
[RSS] max_kb=30208
```
**Run 2**: 46.79M ops/s
```
Throughput = 46793317 ops/s [iter=1000000 ws=400] time=0.021s
[RSS] max_kb=30080
```
**Run 3**: 45.77M ops/s
```
Throughput = 45772756 ops/s [iter=1000000 ws=400] time=0.022s
[RSS] max_kb=34176
```
**Run 4**: 47.12M ops/s
```
Throughput = 47117176 ops/s [iter=1000000 ws=400] time=0.021s
[RSS] max_kb=30080
```
**Run 5**: 42.36M ops/s (outlier)
```
Throughput = 42359615 ops/s [iter=1000000 ws=400] time=0.024s
[RSS] max_kb=30080
```
---
## Document History
- **2025-12-13**: Initial baseline establishment and candidate analysis
- **Next**: Phase 3 D1 implementation (Free route cache)