# Phase 3: Baseline Establishment & Next Optimization Candidates **Date**: 2025-12-13 **Status**: BASELINE ESTABLISHED **Goal**: Identify next micro-optimization targets with +1-5% potential each --- ## Executive Summary **Baseline Performance (MID_V3=0, MIXED workload)**: - Mean: 45.78M ops/s - Median: 46.79M ops/s - Range: 42.36M - 47.12M ops/s - StdDev: ~1.75M ops/s (3.8% variance) **Top Optimization Candidates**: 1. **free() wrapper** (28.95% self%) - HIGH PRIORITY 2. **tiny_alloc_gate_fast()** (12.75% self%) - HIGH PRIORITY 3. **main() benchmark overhead** (12.53% self%) - IGNORE (benchmark artifact) **Expected Next Gains**: +3-8% cumulative from free path optimizations --- ## Step 0: Baseline Establishment ### Configuration Verification **Profile**: `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE` **Key Settings** (verified in `/mnt/workdisk/public_share/hakmem/core/bench_profile.h:74`): ```c bench_setenv_default("HAKMEM_MID_V3_ENABLED", "0"); // CRITICAL: MID_V3 disabled for Mixed bench_setenv_default("HAKMEM_MID_V3_CLASSES", "0x0"); ``` **Optimization Flags Enabled**: - `HAKMEM_FREE_TINY_FAST_HOTCOLD=1` (Phase FREE-TINY-FAST-DUALHOT-1) - `HAKMEM_WRAP_SHAPE=1` (Phase 2 B4: Hot/Cold wrapper split) - `HAKMEM_TINY_ALLOC_ROUTE_SHAPE=1` (Phase 2 B3: Route branch optimization) - `HAKMEM_TINY_STATIC_ROUTE=1` (Phase 3 C3: Static routing, +2.2%) ### Baseline Measurements (5 runs) | Run | Throughput (M ops/s) | Time (ms) | |-----|---------------------|-----------| | 1 | 46.84 | 21 | | 2 | 46.79 | 21 | | 3 | 45.77 | 22 | | 4 | 47.12 | 21 | | 5 | 42.36 | 24 | **Statistics**: - **Mean**: 45.78M ops/s - **Median**: 46.79M ops/s - **Min**: 42.36M ops/s - **Max**: 47.12M ops/s - **Range**: 4.76M ops/s (11.2%) - **StdDev**: ~1.75M ops/s (3.8% variance) **Assessment**: Baseline is stable with ~3.8% variance. Run 5 (42.36M) is an outlier, likely due to system noise. Median (46.79M) is a reliable baseline reference. **Comparison to Previous**: - Previous C3 baseline: ~39.8M ops/s (with default settings) - **Current baseline: 46.79M ops/s** - **Improvement: +17.5%** (confirms MID_V3=0 + cumulative optimizations working) --- ## Step 1: Perf Profiling Results ### Profiling Setup **Command**: ```bash HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE perf record -F 99 -g ./bench_random_mixed_hakmem 10000000 400 1 ``` **Results**: - Samples: 30 (cycles:P event) - Event count: 921,849,973 cycles - Throughput: 47.37M ops/s (consistent with baseline) ### Top Functions by Self% | Rank | Symbol | Self% | Samples | Children% | Category | |------|-------------------------------------|--------|---------|-----------|-----------------| | 1 | `free` | 28.95% | 3 | 45.20% | **HOT WRAPPER** | | 2 | `tiny_alloc_gate_fast.lto_priv.0` | 12.75% | 3 | 29.11% | **HOT ALLOC** | | 3 | `main` | 12.53% | 3 | 21.00% | Benchmark | | 4 | `malloc` | 12.43% | 3 | 16.71% | Wrapper | | 5 | `tiny_front_v3_enabled.lto_priv.0` | 7.75% | 2 | 7.85% | Tiny front | | 6 | `tiny_route_for_class.lto_priv.0` | 4.39% | 2 | 24.78% | Route lookup | | 7 | `free.cold` | 4.15% | 1 | 4.15% | Cold path | | 8 | `hak_pool_free` | 4.02% | 1 | 4.02% | Pool free | ### Call Graph Analysis **free() hot path** (28.95% self, 45.20% children): ``` free (28.95% self) ├── tiny_route_for_class.lto_priv.0 (20.38%) ← MAJOR BOTTLENECK ├── free (recursive, 16.24%) ├── tiny_region_id_write_header.lto_priv.0 (4.29%) └── malloc (4.28%) ``` **tiny_alloc_gate_fast** (12.75% self, 29.11% children): ``` tiny_alloc_gate_fast (12.75% self) ├── tiny_alloc_gate_fast (recursive inlining, 20.64%) ├── main (4.27%) └── free (4.20%) ``` **Key Insight**: `tiny_route_for_class()` is called from `free()` and consuming 20.38% of total time. This is the **#1 optimization target**. --- ## Step 2: Candidate Prioritization ### HIGH PRIORITY (Expected +3-5% each) #### 1. **free() wrapper path** (28.95% self%) **Location**: `/mnt/workdisk/public_share/hakmem/core/box/hak_wrappers.inc.h:524` **Current Implementation**: ```c void free(void* ptr) { if (!ptr) return; // BenchFast bypass (unlikely, 0) if (__builtin_expect(bench_fast_enabled(), 0)) { ... } const wrapper_env_cfg_t* wcfg = wrapper_env_cfg(); // ← Memory load if (__builtin_expect(wcfg->wrap_shape, 0)) { // ← Branch if (__builtin_expect(TINY_FRONT_UNIFIED_GATE_ENABLED, 1)) { int freed; if (__builtin_expect(hak_free_tiny_fast_hotcold_enabled(), 0)) { freed = free_tiny_fast_hot(ptr); } else { freed = free_tiny_fast(ptr); } if (__builtin_expect(freed, 1)) { return; // SUCCESS } } return free_cold(ptr, wcfg); } // Legacy path... } ``` **Optimization Opportunities**: **A. Cache `wrapper_env_cfg()` result** (Expected: +1-2%) - Currently calls `wrapper_env_cfg()` on every free - Could cache in TLS or register during init - Risk: LOW (read-only after init) **B. Inline `free_tiny_fast_hot()` decision** (Expected: +1-2%) - Branch `hak_free_tiny_fast_hotcold_enabled()` is runtime env check - Could be compile-time or init-time cached - Risk: LOW (already gated by HAKMEM_FREE_TINY_FAST_HOTCOLD) **C. Reduce branch mispredictions** (Expected: +0.5-1%) - Reorder branches to put likely path first - Current: `bench_fast_enabled()` checked first (unlikely=0) - Optimization: Move Tiny fast path check earlier - Risk: LOW **Total Expected Gain: +2.5-5%** #### 2. **tiny_route_for_class()** (4.39% self%, 24.78% children) **Location**: `/mnt/workdisk/public_share/hakmem/core/box/tiny_route_env_box.h:147` **Current Implementation**: ```c static inline tiny_route_kind_t tiny_route_for_class(uint8_t ci) { FREE_DISPATCH_STAT_INC(route_for_class_calls); // Debug stat (RELEASE: noop) if (__builtin_expect(!g_tiny_route_snapshot_done, 0)) { tiny_route_snapshot_init(); } if (__builtin_expect(ci >= TINY_NUM_CLASSES, 0)) { return TINY_ROUTE_LEGACY; } return g_tiny_route_class[ci]; } ``` **Optimization Opportunities**: **A. Eliminate `g_tiny_route_snapshot_done` check** (Expected: +1-2%) - Check happens on EVERY call from free path - Phase 3 C3 already implemented static routing for alloc path - **Proposal**: Apply same static route cache to free path - Implementation: Add `tiny_static_route_for_free(ci)` that bypasses snapshot check - Risk: MEDIUM (need to ensure init ordering) **B. Remove bounds check `ci >= TINY_NUM_CLASSES`** (Expected: +0.5-1%) - In free path, `ci` is derived from header (already validated) - Could add `tiny_route_for_class_unchecked(ci)` variant - Risk: MEDIUM (need careful caller audit) **Total Expected Gain: +1.5-3%** #### 3. **tiny_alloc_gate_fast()** (12.75% self%) **Location**: `/mnt/workdisk/public_share/hakmem/core/box/tiny_alloc_gate_box.h:139` **Current Implementation**: ```c static inline void* tiny_alloc_gate_fast(size_t size) { int class_idx = hak_tiny_size_to_class(size); if (__builtin_expect(class_idx < 0 || class_idx >= TINY_NUM_CLASSES, 0)) { return NULL; } TinyRoutePolicy route = tiny_route_get(class_idx); // ← Already optimized (Phase 3 C3) if (__builtin_expect(route == ROUTE_POOL_ONLY, 0)) { return NULL; } void* user_ptr = malloc_tiny_fast_for_class(size, class_idx); if (__builtin_expect(route == ROUTE_TINY_ONLY, 1)) { return user_ptr; } // ROUTE_TINY_FIRST fallback... } ``` **Optimization Opportunities**: **A. Specialize for common routes** (Expected: +1-2%) - MIXED workload: C0-C7 are all ROUTE_TINY_ONLY (LEGACY) - Could create `tiny_alloc_gate_fast_legacy_only()` variant - Eliminates ROUTE_POOL_ONLY and ROUTE_TINY_FIRST checks - Risk: LOW (already profiled via HAKMEM_TINY_STATIC_ROUTE) **B. Inline `malloc_tiny_fast_for_class()`** (Expected: +0.5-1%) - Might not be fully inlined by LTO - Add `__attribute__((always_inline))` hint - Risk: LOW **Total Expected Gain: +1.5-3%** --- ### MEDIUM PRIORITY (Expected +0.5-1% each) #### 4. **tiny_front_v3_enabled()** (7.75% self%) - Appears in free path via `free_tiny_fast_hot()` - Likely a runtime env check that could be cached - Risk: LOW - Expected Gain: +0.5-1% #### 5. **free.cold** (4.15% self%) - Cold path for free wrapper - Handles classification and fallback - Not a hot optimization target (already in slow path) - Expected Gain: <+0.5% --- ### LOW PRIORITY / IGNORE #### 6. **main()** (12.53% self%) - Benchmark overhead (not part of allocator) - IGNORE #### 7. **malloc()** (12.43% self%) - Already optimized in previous phases - Appears lower than free in profile - Defer to next round --- ## Step 3: Recommended Next Steps ### Phase 3 D1: Free Path Route Cache (HIGH PRIORITY) **Target**: `tiny_route_for_class()` eliminating snapshot check in free path **Expected Gain**: +1-2% **Risk**: MEDIUM **Effort**: 2-3 hours **Implementation**: 1. Add `tiny_static_route_for_free(ci)` function (mirror of alloc path optimization) 2. Cache route decisions at init time in `g_tiny_static_route_free[8]` 3. Update `free_tiny_fast_hot()` to use cached route 4. A/B test: BASELINE vs D1 **ENV Gate**: `HAKMEM_FREE_STATIC_ROUTE=1` (default: 0) --- ### Phase 3 D2: Wrapper Env Cache (HIGH PRIORITY) **Target**: `wrapper_env_cfg()` caching in free path **Expected Gain**: +1-2% **Risk**: LOW **Effort**: 1-2 hours **Implementation**: 1. Cache `wrapper_env_cfg()` result in TLS or init-time global 2. Avoid repeated memory load on every free() call 3. Update free wrapper to use cached pointer 4. A/B test: BASELINE vs D2 **ENV Gate**: `HAKMEM_WRAP_ENV_CACHE=1` (default: 0) --- ### Phase 3 D3: Alloc Gate Specialization (MEDIUM PRIORITY) **Target**: `tiny_alloc_gate_fast()` for LEGACY-only route **Expected Gain**: +1-2% **Risk**: LOW **Effort**: 2-3 hours **Implementation**: 1. Create `tiny_alloc_gate_fast_legacy()` specialized variant 2. Eliminate ROUTE_POOL_ONLY and ROUTE_TINY_FIRST branches 3. Use in MIXED profile where all classes are LEGACY 4. A/B test: BASELINE vs D3 **ENV Gate**: `HAKMEM_ALLOC_GATE_LEGACY_ONLY=1` (default: 0) --- ## Expected Cumulative Results | Phase | Optimization | Expected Gain | Cumulative | |------------|----------------------------------|---------------|-------------| | Baseline | MID_V3=0 + B3+B4+C3 | - | 46.79M ops/s| | **Phase 3 D1** | Free route cache | +1-2% | 47.3-47.7M | | **Phase 3 D2** | Wrapper env cache | +1-2% | 47.8-48.7M | | **Phase 3 D3** | Alloc gate specialization | +1-2% | 48.3-49.7M | | **Total Expected** | - | **+3-6%** | **48-50M ops/s** | **With MID_V3 fix for Mixed**: +13% additional (expected ~56M ops/s total) --- ## Risk Assessment | Optimization | Risk Level | Mitigation | |---------------------|------------|-------------------------------------------------| | Free route cache | MEDIUM | Ensure init ordering, ENV gate for rollback | | Wrapper env cache | LOW | Read-only after init, simple TLS cache | | Alloc specialization| LOW | Profile-specific, existing static route pattern | **All optimizations**: Follow ENV gate + A/B test + decision pattern (research box) --- ## Next Actions 1. **Immediate**: Implement Phase 3 D1 (Free route cache) - Expected: +1-2% gain - Risk: MEDIUM (requires careful init ordering) - Timeline: 2-3 hours 2. **Follow-up**: Implement Phase 3 D2 (Wrapper env cache) - Expected: +1-2% gain - Risk: LOW - Timeline: 1-2 hours 3. **Optional**: Implement Phase 3 D3 (Alloc gate specialization) - Expected: +1-2% gain - Risk: LOW - Timeline: 2-3 hours **Total Timeline**: 5-8 hours for +3-6% cumulative improvement --- ## Appendix: Raw Perf Data ### Full Perf Report (Top 20) ``` # Samples: 30 of event 'cycles:P' # Event count (approx.): 921849973 46.11% 0.00% 0 [.] 0000000000000000 45.20% 28.95% 3 [.] free 29.11% 12.75% 3 [.] tiny_alloc_gate_fast.lto_priv.0 24.78% 4.39% 2 [.] tiny_route_for_class.lto_priv.0 21.00% 12.53% 3 [.] main 16.71% 12.43% 3 [.] malloc 12.95% 4.27% 1 [.] tiny_region_id_write_header.lto_priv.0 8.66% 4.39% 1 [.] tiny_c7_ultra_free 8.56% 4.28% 1 [.] free_tiny_fast_cold.lto_priv.0 7.85% 7.75% 2 [.] tiny_front_v3_enabled.lto_priv.0 4.27% 0.00% 0 [.] 0x00007ad3a9c2d001 4.23% 0.00% 0 [.] tiny_c7_ultra_enabled_env.lto_priv.0 4.21% 0.00% 0 [.] 0x00007ad3ab960c81 4.20% 0.00% 0 [.] 0x00007ad3ab939401 4.15% 4.15% 1 [.] free.cold 4.15% 0.00% 0 [.] unified_cache_push.lto_priv.0 4.02% 4.02% 1 [.] hak_pool_free ``` ### Baseline Run Details **Run 1**: 46.84M ops/s ``` Throughput = 46841499 ops/s [iter=1000000 ws=400] time=0.021s [RSS] max_kb=30208 ``` **Run 2**: 46.79M ops/s ``` Throughput = 46793317 ops/s [iter=1000000 ws=400] time=0.021s [RSS] max_kb=30080 ``` **Run 3**: 45.77M ops/s ``` Throughput = 45772756 ops/s [iter=1000000 ws=400] time=0.022s [RSS] max_kb=34176 ``` **Run 4**: 47.12M ops/s ``` Throughput = 47117176 ops/s [iter=1000000 ws=400] time=0.021s [RSS] max_kb=30080 ``` **Run 5**: 42.36M ops/s (outlier) ``` Throughput = 42359615 ops/s [iter=1000000 ws=400] time=0.024s [RSS] max_kb=30080 ``` --- ## Document History - **2025-12-13**: Initial baseline establishment and candidate analysis - **Next**: Phase 3 D1 implementation (Free route cache)