# Phase 3: Baseline Establishment & Next Optimization Candidates **Date**: 2025-12-13 **Status**: BASELINE ESTABLISHED **Goal**: Identify next micro-optimization targets with +1-5% potential each --- ## Executive Summary **Baseline Performance (MID_V3=0, MIXED workload)**: - Mean: 45.78M ops/s - Median: 46.79M ops/s - Range: 42.36M - 47.12M ops/s - StdDev: ~1.75M ops/s (3.8% variance) **Top Optimization Candidates**: 1. **free() wrapper** (28.95% self%) - HIGH PRIORITY 2. **tiny_alloc_gate_fast()** (12.75% self%) - HIGH PRIORITY 3. **main() benchmark overhead** (12.53% self%) - IGNORE (benchmark artifact) **Expected Next Gains**: +3-8% cumulative from free path optimizations --- ## Step 0: Baseline Establishment ### Configuration Verification **Profile**: `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE` **Key Settings** (verified in `/mnt/workdisk/public_share/hakmem/core/bench_profile.h:74`): ```c bench_setenv_default("HAKMEM_MID_V3_ENABLED", "0"); // CRITICAL: MID_V3 disabled for Mixed bench_setenv_default("HAKMEM_MID_V3_CLASSES", "0x0"); ``` **Optimization Flags Enabled**: - `HAKMEM_FREE_TINY_FAST_HOTCOLD=1` (Phase FREE-TINY-FAST-DUALHOT-1) - `HAKMEM_WRAP_SHAPE=1` (Phase 2 B4: Hot/Cold wrapper split) - `HAKMEM_TINY_ALLOC_ROUTE_SHAPE=1` (Phase 2 B3: Route branch optimization) - `HAKMEM_TINY_STATIC_ROUTE=1` (Phase 3 C3: Static routing, +2.2%) ### Baseline Measurements (5 runs) | Run | Throughput (M ops/s) | Time (ms) | |-----|---------------------|-----------| | 1 | 46.84 | 21 | | 2 | 46.79 | 21 | | 3 | 45.77 | 22 | | 4 | 47.12 | 21 | | 5 | 42.36 | 24 | **Statistics**: - **Mean**: 45.78M ops/s - **Median**: 46.79M ops/s - **Min**: 42.36M ops/s - **Max**: 47.12M ops/s - **Range**: 4.76M ops/s (11.2%) - **StdDev**: ~1.75M ops/s (3.8% variance) **Assessment**: Baseline is stable with ~3.8% variance. Run 5 (42.36M) is an outlier, likely due to system noise. Median (46.79M) is a reliable baseline reference. **Comparison to Previous**: - Previous C3 baseline: ~39.8M ops/s (with default settings) - **Current baseline: 46.79M ops/s** - **Improvement: +17.5%** (confirms MID_V3=0 + cumulative optimizations working) --- ## Step 1: Perf Profiling Results ### Profiling Setup **Command**: ```bash HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE perf record -F 99 -g ./bench_random_mixed_hakmem 10000000 400 1 ``` **Results**: - Samples: 30 (cycles:P event) - Event count: 921,849,973 cycles - Throughput: 47.37M ops/s (consistent with baseline) ### Top Functions by Self% | Rank | Symbol | Self% | Samples | Children% | Category | |------|-------------------------------------|--------|---------|-----------|-----------------| | 1 | `free` | 28.95% | 3 | 45.20% | **HOT WRAPPER** | | 2 | `tiny_alloc_gate_fast.lto_priv.0` | 12.75% | 3 | 29.11% | **HOT ALLOC** | | 3 | `main` | 12.53% | 3 | 21.00% | Benchmark | | 4 | `malloc` | 12.43% | 3 | 16.71% | Wrapper | | 5 | `tiny_front_v3_enabled.lto_priv.0` | 7.75% | 2 | 7.85% | Tiny front | | 6 | `tiny_route_for_class.lto_priv.0` | 4.39% | 2 | 24.78% | Route lookup | | 7 | `free.cold` | 4.15% | 1 | 4.15% | Cold path | | 8 | `hak_pool_free` | 4.02% | 1 | 4.02% | Pool free | ### Call Graph Analysis **free() hot path** (28.95% self, 45.20% children): ``` free (28.95% self) ├── tiny_route_for_class.lto_priv.0 (20.38%) ← MAJOR BOTTLENECK ├── free (recursive, 16.24%) ├── tiny_region_id_write_header.lto_priv.0 (4.29%) └── malloc (4.28%) ``` **tiny_alloc_gate_fast** (12.75% self, 29.11% children): ``` tiny_alloc_gate_fast (12.75% self) ├── tiny_alloc_gate_fast (recursive inlining, 20.64%) ├── main (4.27%) └── free (4.20%) ``` **Key Insight**: `tiny_route_for_class()` is called from `free()` and consuming 20.38% of total time. This is the **#1 optimization target**. --- ## Step 2: Candidate Prioritization ### HIGH PRIORITY (Expected +3-5% each) #### 1. **free() wrapper path** (28.95% self%) **Location**: `/mnt/workdisk/public_share/hakmem/core/box/hak_wrappers.inc.h:524` **Current Implementation**: ```c void free(void* ptr) { if (!ptr) return; // BenchFast bypass (unlikely, 0) if (__builtin_expect(bench_fast_enabled(), 0)) { ... } const wrapper_env_cfg_t* wcfg = wrapper_env_cfg(); // ← Memory load if (__builtin_expect(wcfg->wrap_shape, 0)) { // ← Branch if (__builtin_expect(TINY_FRONT_UNIFIED_GATE_ENABLED, 1)) { int freed; if (__builtin_expect(hak_free_tiny_fast_hotcold_enabled(), 0)) { freed = free_tiny_fast_hot(ptr); } else { freed = free_tiny_fast(ptr); } if (__builtin_expect(freed, 1)) { return; // SUCCESS } } return free_cold(ptr, wcfg); } // Legacy path... } ``` **Optimization Opportunities**: **A. Cache `wrapper_env_cfg()` result** (Expected: +1-2%) - Currently calls `wrapper_env_cfg()` on every free - Could cache in TLS or register during init - Risk: LOW (read-only after init) **B. Inline `free_tiny_fast_hot()` decision** (Expected: +1-2%) - Branch `hak_free_tiny_fast_hotcold_enabled()` is runtime env check - Could be compile-time or init-time cached - Risk: LOW (already gated by HAKMEM_FREE_TINY_FAST_HOTCOLD) **C. Reduce branch mispredictions** (Expected: +0.5-1%) - Reorder branches to put likely path first - Current: `bench_fast_enabled()` checked first (unlikely=0) - Optimization: Move Tiny fast path check earlier - Risk: LOW **Total Expected Gain: +2.5-5%** #### 2. **tiny_route_for_class()** (4.39% self%, 24.78% children) **Location**: `/mnt/workdisk/public_share/hakmem/core/box/tiny_route_env_box.h:147` **Current Implementation**: ```c static inline tiny_route_kind_t tiny_route_for_class(uint8_t ci) { FREE_DISPATCH_STAT_INC(route_for_class_calls); // Debug stat (RELEASE: noop) if (__builtin_expect(!g_tiny_route_snapshot_done, 0)) { tiny_route_snapshot_init(); } if (__builtin_expect(ci >= TINY_NUM_CLASSES, 0)) { return TINY_ROUTE_LEGACY; } return g_tiny_route_class[ci]; } ``` **Optimization Opportunities**: **A. Eliminate `g_tiny_route_snapshot_done` check** (Expected: +1-2%) - Check happens on EVERY call from free path - Phase 3 C3 already implemented static routing for alloc path - **Proposal**: Apply same static route cache to free path - Implementation: Add `tiny_static_route_for_free(ci)` that bypasses snapshot check - Risk: MEDIUM (need to ensure init ordering) **B. Remove bounds check `ci >= TINY_NUM_CLASSES`** (Expected: +0.5-1%) - In free path, `ci` is derived from header (already validated) - Could add `tiny_route_for_class_unchecked(ci)` variant - Risk: MEDIUM (need careful caller audit) **Total Expected Gain: +1.5-3%** #### 3. **tiny_alloc_gate_fast()** (12.75% self%) **Location**: `/mnt/workdisk/public_share/hakmem/core/box/tiny_alloc_gate_box.h:139` **Current Implementation**: ```c static inline void* tiny_alloc_gate_fast(size_t size) { int class_idx = hak_tiny_size_to_class(size); if (__builtin_expect(class_idx < 0 || class_idx >= TINY_NUM_CLASSES, 0)) { return NULL; } TinyRoutePolicy route = tiny_route_get(class_idx); // ← Already optimized (Phase 3 C3) if (__builtin_expect(route == ROUTE_POOL_ONLY, 0)) { return NULL; } void* user_ptr = malloc_tiny_fast_for_class(size, class_idx); if (__builtin_expect(route == ROUTE_TINY_ONLY, 1)) { return user_ptr; } // ROUTE_TINY_FIRST fallback... } ``` **Optimization Opportunities**: **A. Specialize for common routes** (Expected: +1-2%) - MIXED workload: C0-C7 are all ROUTE_TINY_ONLY (LEGACY) - Could create `tiny_alloc_gate_fast_legacy_only()` variant - Eliminates ROUTE_POOL_ONLY and ROUTE_TINY_FIRST checks - Risk: LOW (already profiled via HAKMEM_TINY_STATIC_ROUTE) **B. Inline `malloc_tiny_fast_for_class()`** (Expected: +0.5-1%) - Might not be fully inlined by LTO - Add `__attribute__((always_inline))` hint - Risk: LOW **Total Expected Gain: +1.5-3%** --- ### MEDIUM PRIORITY (Expected +0.5-1% each) #### 4. **tiny_front_v3_enabled()** (7.75% self%) - Appears in free path via `free_tiny_fast_hot()` - Likely a runtime env check that could be cached - Risk: LOW - Expected Gain: +0.5-1% #### 5. **free.cold** (4.15% self%) - Cold path for free wrapper - Handles classification and fallback - Not a hot optimization target (already in slow path) - Expected Gain: <+0.5% --- ### LOW PRIORITY / IGNORE #### 6. **main()** (12.53% self%) - Benchmark overhead (not part of allocator) - IGNORE #### 7. **malloc()** (12.43% self%) - Already optimized in previous phases - Appears lower than free in profile - Defer to next round --- ## Step 3: Recommended Next Steps ### Phase 3 D1: Free Path Route Cache ✅ ADOPT(PROMOTED TO DEFAULT) **Target**: `tiny_route_for_class()` の呼び出しを free path から削る **Result**: Mixed 20-run mean **+2.19%** / median **+2.37%** **Decision**: ✅ `MIXED_TINYV3_C7_SAFE` の default に昇格 **ENV Gate**: - `HAKMEM_FREE_STATIC_ROUTE=0/1`(default: 0) - `MIXED_TINYV3_C7_SAFE` プリセットは `1` を default 注入(rollback は `0`) --- ### Phase 3 D2: Wrapper Env Cache ❌ NO-GO(FROZEN) **Target**: `wrapper_env_cfg()` の呼び出しを wrapper hot path から削る **Result**: Mixed 10-run mean **-1.44%** regression **Decision**: ❌ NO-GO(研究箱 freeze、default OFF) **ENV Gate**: `HAKMEM_WRAP_ENV_CACHE=1`(default: 0) --- ### Phase 3 D3: Alloc Gate Specialization (MEDIUM PRIORITY) **Target**: `tiny_alloc_gate_fast()` の分岐形を最短化(MIXED 向け) **Expected Gain**: +1-2% **Risk**: LOW **Effort**: 2-3 hours **Implementation**: 1. New ENV gate: `HAKMEM_ALLOC_GATE_SHAPE=0/1` 2. `tiny_route_get()` を避け、`g_tiny_route[]` の直接参照に置換(release logging branch を回避) 3. `ROUTE_POOL_ONLY` は必ず尊重(`HAKMEM_TINY_PROFILE=hot/off` を壊さない) 4. A/B test: BASELINE vs D3 **Design**: `docs/analysis/PHASE4_D3_ALLOC_GATE_SPECIALIZATION_1_DESIGN.md` **ENV Gate**: `HAKMEM_ALLOC_GATE_SHAPE=0/1` (default: 0) --- ## Expected Cumulative Results(更新) | Phase | Optimization | Expected Gain | Notes | |------------|----------------------------------|---------------|-------| | Baseline | MID_V3=0 + B3+B4+C3 | - | — | | **D1** | Free route cache | +0〜+2% | ✅ ADOPT(Mixed preset default ON) | | **D2** | Wrapper env cache | — | NO-GO(freeze) | | **D3** | Alloc gate specialization | +0〜+2% | perf で 5% 超なら着手 | **With MID_V3 fix for Mixed**: +13% additional (expected ~56M ops/s total) --- ## Risk Assessment | Optimization | Risk Level | Mitigation | |---------------------|------------|-------------------------------------------------| | Free route cache | MEDIUM | Ensure init ordering, ENV gate for rollback | | Wrapper env cache | — | NO-GO(-1.44% regression) | | Alloc specialization| LOW | Profile-specific, existing static route pattern | **All optimizations**: Follow ENV gate + A/B test + decision pattern (research box) --- ## Post-D1/D2 Status (2025-12-13) ### Phase 3 D1/D2 Validation Complete ✅ 1. **D1 (Free Route Cache)**: ✅ ADOPT - PROMOTED TO DEFAULT - 20-run validation completed - Results: Mean +2.19%, Median +2.37% (both criteria met) - Status: Added to MIXED_TINYV3_C7_SAFE preset as default - Implementation: `HAKMEM_FREE_STATIC_ROUTE=1` 2. **D2 (Wrapper Env Cache)**: ❌ FROZEN - Results: -1.44% regression - Status: Research box frozen, default OFF, do not pursue - Implementation: `HAKMEM_WRAP_ENV_CACHE=1` (opt-in only, not recommended) ### Active Optimizations in MIXED_TINYV3_C7_SAFE 1. **B3**: Routing branch shape (+2.89% proven) 2. **B4**: Wrapper hot/cold split (+1.47% proven) 3. **C3**: Static routing (+2.20% proven) 4. **D1**: Free route cache (+2.19% proven) - NEW 5. **MID_V3**: OFF for Mixed (C6 routing fix, +13% proven) **Cumulative gain**: ~7.6% (B3 + B4 + C3 + D1, excluding MID_V3 fix) ### Next Actions 1. **Profile**: Run perf on current baseline to identify next targets - Requirement: self% ≥5% for Phase 3 D3 consideration - Target: `tiny_alloc_gate_fast` specialization 2. **Optional**: Phase 3 D3 (Alloc gate specialization) - pending perf validation - Only proceed if perf shows ≥5% self% in alloc gate - ENV: `HAKMEM_ALLOC_GATE_SHAPE=0/1` 3. **Phase 4 Planning**: If no more 5%+ targets, prepare Phase 4 roadmap --- ## Appendix: Raw Perf Data ### Full Perf Report (Top 20) ``` # Samples: 30 of event 'cycles:P' # Event count (approx.): 921849973 46.11% 0.00% 0 [.] 0000000000000000 45.20% 28.95% 3 [.] free 29.11% 12.75% 3 [.] tiny_alloc_gate_fast.lto_priv.0 24.78% 4.39% 2 [.] tiny_route_for_class.lto_priv.0 21.00% 12.53% 3 [.] main 16.71% 12.43% 3 [.] malloc 12.95% 4.27% 1 [.] tiny_region_id_write_header.lto_priv.0 8.66% 4.39% 1 [.] tiny_c7_ultra_free 8.56% 4.28% 1 [.] free_tiny_fast_cold.lto_priv.0 7.85% 7.75% 2 [.] tiny_front_v3_enabled.lto_priv.0 4.27% 0.00% 0 [.] 0x00007ad3a9c2d001 4.23% 0.00% 0 [.] tiny_c7_ultra_enabled_env.lto_priv.0 4.21% 0.00% 0 [.] 0x00007ad3ab960c81 4.20% 0.00% 0 [.] 0x00007ad3ab939401 4.15% 4.15% 1 [.] free.cold 4.15% 0.00% 0 [.] unified_cache_push.lto_priv.0 4.02% 4.02% 1 [.] hak_pool_free ``` ### Baseline Run Details **Run 1**: 46.84M ops/s ``` Throughput = 46841499 ops/s [iter=1000000 ws=400] time=0.021s [RSS] max_kb=30208 ``` **Run 2**: 46.79M ops/s ``` Throughput = 46793317 ops/s [iter=1000000 ws=400] time=0.021s [RSS] max_kb=30080 ``` **Run 3**: 45.77M ops/s ``` Throughput = 45772756 ops/s [iter=1000000 ws=400] time=0.022s [RSS] max_kb=34176 ``` **Run 4**: 47.12M ops/s ``` Throughput = 47117176 ops/s [iter=1000000 ws=400] time=0.021s [RSS] max_kb=30080 ``` **Run 5**: 42.36M ops/s (outlier) ``` Throughput = 42359615 ops/s [iter=1000000 ws=400] time=0.024s [RSS] max_kb=30080 ``` --- ## Document History - **2025-12-13**: Initial baseline establishment and candidate analysis - **Next**: Phase 3 D1 implementation (Free route cache)