Files
hakmem/docs/analysis/PHASE3_BASELINE_AND_CANDIDATES.md
2025-12-14 00:05:11 +09:00

15 KiB
Raw Blame History

Phase 3: Baseline Establishment & Next Optimization Candidates

Date: 2025-12-13 Status: BASELINE ESTABLISHED Goal: Identify next micro-optimization targets with +1-5% potential each


Executive Summary

Baseline Performance (MID_V3=0, MIXED workload):

  • Mean: 45.78M ops/s
  • Median: 46.79M ops/s
  • Range: 42.36M - 47.12M ops/s
  • StdDev: ~1.75M ops/s (3.8% variance)

Top Optimization Candidates:

  1. free() wrapper (28.95% self%) - HIGH PRIORITY
  2. tiny_alloc_gate_fast() (12.75% self%) - HIGH PRIORITY
  3. main() benchmark overhead (12.53% self%) - IGNORE (benchmark artifact)

Expected Next Gains: +3-8% cumulative from free path optimizations


Step 0: Baseline Establishment

Configuration Verification

Profile: HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE

Key Settings (verified in /mnt/workdisk/public_share/hakmem/core/bench_profile.h:74):

bench_setenv_default("HAKMEM_MID_V3_ENABLED", "0");  // CRITICAL: MID_V3 disabled for Mixed
bench_setenv_default("HAKMEM_MID_V3_CLASSES", "0x0");

Optimization Flags Enabled:

  • HAKMEM_FREE_TINY_FAST_HOTCOLD=1 (Phase FREE-TINY-FAST-DUALHOT-1)
  • HAKMEM_WRAP_SHAPE=1 (Phase 2 B4: Hot/Cold wrapper split)
  • HAKMEM_TINY_ALLOC_ROUTE_SHAPE=1 (Phase 2 B3: Route branch optimization)
  • HAKMEM_TINY_STATIC_ROUTE=1 (Phase 3 C3: Static routing, +2.2%)

Baseline Measurements (5 runs)

Run Throughput (M ops/s) Time (ms)
1 46.84 21
2 46.79 21
3 45.77 22
4 47.12 21
5 42.36 24

Statistics:

  • Mean: 45.78M ops/s
  • Median: 46.79M ops/s
  • Min: 42.36M ops/s
  • Max: 47.12M ops/s
  • Range: 4.76M ops/s (11.2%)
  • StdDev: ~1.75M ops/s (3.8% variance)

Assessment: Baseline is stable with ~3.8% variance. Run 5 (42.36M) is an outlier, likely due to system noise. Median (46.79M) is a reliable baseline reference.

Comparison to Previous:

  • Previous C3 baseline: ~39.8M ops/s (with default settings)
  • Current baseline: 46.79M ops/s
  • Improvement: +17.5% (confirms MID_V3=0 + cumulative optimizations working)

Step 1: Perf Profiling Results

Profiling Setup

Command:

HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE perf record -F 99 -g ./bench_random_mixed_hakmem 10000000 400 1

Results:

  • Samples: 30 (cycles:P event)
  • Event count: 921,849,973 cycles
  • Throughput: 47.37M ops/s (consistent with baseline)

Top Functions by Self%

Rank Symbol Self% Samples Children% Category
1 free 28.95% 3 45.20% HOT WRAPPER
2 tiny_alloc_gate_fast.lto_priv.0 12.75% 3 29.11% HOT ALLOC
3 main 12.53% 3 21.00% Benchmark
4 malloc 12.43% 3 16.71% Wrapper
5 tiny_front_v3_enabled.lto_priv.0 7.75% 2 7.85% Tiny front
6 tiny_route_for_class.lto_priv.0 4.39% 2 24.78% Route lookup
7 free.cold 4.15% 1 4.15% Cold path
8 hak_pool_free 4.02% 1 4.02% Pool free

Call Graph Analysis

free() hot path (28.95% self, 45.20% children):

free (28.95% self)
├── tiny_route_for_class.lto_priv.0 (20.38%)  ← MAJOR BOTTLENECK
├── free (recursive, 16.24%)
├── tiny_region_id_write_header.lto_priv.0 (4.29%)
└── malloc (4.28%)

tiny_alloc_gate_fast (12.75% self, 29.11% children):

tiny_alloc_gate_fast (12.75% self)
├── tiny_alloc_gate_fast (recursive inlining, 20.64%)
├── main (4.27%)
└── free (4.20%)

Key Insight: tiny_route_for_class() is called from free() and consuming 20.38% of total time. This is the #1 optimization target.


Step 2: Candidate Prioritization

HIGH PRIORITY (Expected +3-5% each)

1. free() wrapper path (28.95% self%)

Location: /mnt/workdisk/public_share/hakmem/core/box/hak_wrappers.inc.h:524

Current Implementation:

void free(void* ptr) {
    if (!ptr) return;

    // BenchFast bypass (unlikely, 0)
    if (__builtin_expect(bench_fast_enabled(), 0)) { ... }

    const wrapper_env_cfg_t* wcfg = wrapper_env_cfg();  // ← Memory load

    if (__builtin_expect(wcfg->wrap_shape, 0)) {        // ← Branch
        if (__builtin_expect(TINY_FRONT_UNIFIED_GATE_ENABLED, 1)) {
            int freed;
            if (__builtin_expect(hak_free_tiny_fast_hotcold_enabled(), 0)) {
                freed = free_tiny_fast_hot(ptr);
            } else {
                freed = free_tiny_fast(ptr);
            }
            if (__builtin_expect(freed, 1)) {
                return;  // SUCCESS
            }
        }
        return free_cold(ptr, wcfg);
    }
    // Legacy path...
}

Optimization Opportunities:

A. Cache wrapper_env_cfg() result (Expected: +1-2%)

  • Currently calls wrapper_env_cfg() on every free
  • Could cache in TLS or register during init
  • Risk: LOW (read-only after init)

B. Inline free_tiny_fast_hot() decision (Expected: +1-2%)

  • Branch hak_free_tiny_fast_hotcold_enabled() is runtime env check
  • Could be compile-time or init-time cached
  • Risk: LOW (already gated by HAKMEM_FREE_TINY_FAST_HOTCOLD)

C. Reduce branch mispredictions (Expected: +0.5-1%)

  • Reorder branches to put likely path first
  • Current: bench_fast_enabled() checked first (unlikely=0)
  • Optimization: Move Tiny fast path check earlier
  • Risk: LOW

Total Expected Gain: +2.5-5%

2. tiny_route_for_class() (4.39% self%, 24.78% children)

Location: /mnt/workdisk/public_share/hakmem/core/box/tiny_route_env_box.h:147

Current Implementation:

static inline tiny_route_kind_t tiny_route_for_class(uint8_t ci) {
    FREE_DISPATCH_STAT_INC(route_for_class_calls);  // Debug stat (RELEASE: noop)
    if (__builtin_expect(!g_tiny_route_snapshot_done, 0)) {
        tiny_route_snapshot_init();
    }
    if (__builtin_expect(ci >= TINY_NUM_CLASSES, 0)) {
        return TINY_ROUTE_LEGACY;
    }
    return g_tiny_route_class[ci];
}

Optimization Opportunities:

A. Eliminate g_tiny_route_snapshot_done check (Expected: +1-2%)

  • Check happens on EVERY call from free path
  • Phase 3 C3 already implemented static routing for alloc path
  • Proposal: Apply same static route cache to free path
  • Implementation: Add tiny_static_route_for_free(ci) that bypasses snapshot check
  • Risk: MEDIUM (need to ensure init ordering)

B. Remove bounds check ci >= TINY_NUM_CLASSES (Expected: +0.5-1%)

  • In free path, ci is derived from header (already validated)
  • Could add tiny_route_for_class_unchecked(ci) variant
  • Risk: MEDIUM (need careful caller audit)

Total Expected Gain: +1.5-3%

3. tiny_alloc_gate_fast() (12.75% self%)

Location: /mnt/workdisk/public_share/hakmem/core/box/tiny_alloc_gate_box.h:139

Current Implementation:

static inline void* tiny_alloc_gate_fast(size_t size)
{
    int class_idx = hak_tiny_size_to_class(size);
    if (__builtin_expect(class_idx < 0 || class_idx >= TINY_NUM_CLASSES, 0)) {
        return NULL;
    }

    TinyRoutePolicy route = tiny_route_get(class_idx);  // ← Already optimized (Phase 3 C3)

    if (__builtin_expect(route == ROUTE_POOL_ONLY, 0)) {
        return NULL;
    }

    void* user_ptr = malloc_tiny_fast_for_class(size, class_idx);

    if (__builtin_expect(route == ROUTE_TINY_ONLY, 1)) {
        return user_ptr;
    }

    // ROUTE_TINY_FIRST fallback...
}

Optimization Opportunities:

A. Specialize for common routes (Expected: +1-2%)

  • MIXED workload: C0-C7 are all ROUTE_TINY_ONLY (LEGACY)
  • Could create tiny_alloc_gate_fast_legacy_only() variant
  • Eliminates ROUTE_POOL_ONLY and ROUTE_TINY_FIRST checks
  • Risk: LOW (already profiled via HAKMEM_TINY_STATIC_ROUTE)

B. Inline malloc_tiny_fast_for_class() (Expected: +0.5-1%)

  • Might not be fully inlined by LTO
  • Add __attribute__((always_inline)) hint
  • Risk: LOW

Total Expected Gain: +1.5-3%


MEDIUM PRIORITY (Expected +0.5-1% each)

4. tiny_front_v3_enabled() (7.75% self%)

  • Appears in free path via free_tiny_fast_hot()
  • Likely a runtime env check that could be cached
  • Risk: LOW
  • Expected Gain: +0.5-1%

5. free.cold (4.15% self%)

  • Cold path for free wrapper
  • Handles classification and fallback
  • Not a hot optimization target (already in slow path)
  • Expected Gain: <+0.5%

LOW PRIORITY / IGNORE

6. main() (12.53% self%)

  • Benchmark overhead (not part of allocator)
  • IGNORE

7. malloc() (12.43% self%)

  • Already optimized in previous phases
  • Appears lower than free in profile
  • Defer to next round

Phase 3 D1: Free Path Route Cache ADOPTPROMOTED TO DEFAULT

Target: tiny_route_for_class() の呼び出しを free path から削る Result: Mixed 20-run mean +2.19% / median +2.37% Decision: MIXED_TINYV3_C7_SAFE の default に昇格

ENV Gate:

  • HAKMEM_FREE_STATIC_ROUTE=0/1default: 0
  • MIXED_TINYV3_C7_SAFE プリセットは 1 を default 注入rollback は 0

Phase 3 D2: Wrapper Env Cache NO-GOFROZEN

Target: wrapper_env_cfg() の呼び出しを wrapper hot path から削る Result: Mixed 10-run mean -1.44% regression Decision: NO-GO研究箱 freeze、default OFF

ENV Gate: HAKMEM_WRAP_ENV_CACHE=1default: 0


Phase 3 D3: Alloc Gate Specialization (MEDIUM PRIORITY)

Target: tiny_alloc_gate_fast() の分岐形を最短化MIXED 向け) Expected Gain: +1-2% Risk: LOW Effort: 2-3 hours

Implementation:

  1. New ENV gate: HAKMEM_ALLOC_GATE_SHAPE=0/1
  2. tiny_route_get() を避け、g_tiny_route[] の直接参照に置換release logging branch を回避)
  3. ROUTE_POOL_ONLY は必ず尊重(HAKMEM_TINY_PROFILE=hot/off を壊さない)
  4. A/B test: BASELINE vs D3

Design: docs/analysis/PHASE4_D3_ALLOC_GATE_SPECIALIZATION_1_DESIGN.md

ENV Gate: HAKMEM_ALLOC_GATE_SHAPE=0/1 (default: 0)


Expected Cumulative Results更新

Phase Optimization Expected Gain Notes
Baseline MID_V3=0 + B3+B4+C3 -
D1 Free route cache +0〜+2% ADOPTMixed preset default ON
D2 Wrapper env cache NO-GOfreeze
D3 Alloc gate specialization +0〜+2% perf で 5% 超なら着手

With MID_V3 fix for Mixed: +13% additional (expected ~56M ops/s total)


Risk Assessment

Optimization Risk Level Mitigation
Free route cache MEDIUM Ensure init ordering, ENV gate for rollback
Wrapper env cache NO-GO-1.44% regression
Alloc specialization LOW Profile-specific, existing static route pattern

All optimizations: Follow ENV gate + A/B test + decision pattern (research box)


Post-D1/D2 Status (2025-12-13)

Phase 3 D1/D2 Validation Complete

  1. D1 (Free Route Cache): ADOPT - PROMOTED TO DEFAULT

    • 20-run validation completed
    • Results: Mean +2.19%, Median +2.37% (both criteria met)
    • Status: Added to MIXED_TINYV3_C7_SAFE preset as default
    • Implementation: HAKMEM_FREE_STATIC_ROUTE=1
  2. D2 (Wrapper Env Cache): FROZEN

    • Results: -1.44% regression
    • Status: Research box frozen, default OFF, do not pursue
    • Implementation: HAKMEM_WRAP_ENV_CACHE=1 (opt-in only, not recommended)

Active Optimizations in MIXED_TINYV3_C7_SAFE

  1. B3: Routing branch shape (+2.89% proven)
  2. B4: Wrapper hot/cold split (+1.47% proven)
  3. C3: Static routing (+2.20% proven)
  4. D1: Free route cache (+2.19% proven) - NEW
  5. MID_V3: OFF for Mixed (C6 routing fix, +13% proven)

Cumulative gain: ~7.6% (B3 + B4 + C3 + D1, excluding MID_V3 fix)

Next Actions

  1. Profile: Run perf on current baseline to identify next targets

    • Requirement: self% ≥5% for Phase 3 D3 consideration
    • Target: tiny_alloc_gate_fast specialization
  2. Optional: Phase 3 D3 (Alloc gate specialization) - pending perf validation

    • Only proceed if perf shows ≥5% self% in alloc gate
    • ENV: HAKMEM_ALLOC_GATE_SHAPE=0/1
  3. Phase 4 Planning: If no more 5%+ targets, prepare Phase 4 roadmap


Appendix: Raw Perf Data

Full Perf Report (Top 20)

# Samples: 30  of event 'cycles:P'
# Event count (approx.): 921849973

    46.11%     0.00%             0  [.] 0000000000000000
    45.20%    28.95%             3  [.] free
    29.11%    12.75%             3  [.] tiny_alloc_gate_fast.lto_priv.0
    24.78%     4.39%             2  [.] tiny_route_for_class.lto_priv.0
    21.00%    12.53%             3  [.] main
    16.71%    12.43%             3  [.] malloc
    12.95%     4.27%             1  [.] tiny_region_id_write_header.lto_priv.0
     8.66%     4.39%             1  [.] tiny_c7_ultra_free
     8.56%     4.28%             1  [.] free_tiny_fast_cold.lto_priv.0
     7.85%     7.75%             2  [.] tiny_front_v3_enabled.lto_priv.0
     4.27%     0.00%             0  [.] 0x00007ad3a9c2d001
     4.23%     0.00%             0  [.] tiny_c7_ultra_enabled_env.lto_priv.0
     4.21%     0.00%             0  [.] 0x00007ad3ab960c81
     4.20%     0.00%             0  [.] 0x00007ad3ab939401
     4.15%     4.15%             1  [.] free.cold
     4.15%     0.00%             0  [.] unified_cache_push.lto_priv.0
     4.02%     4.02%             1  [.] hak_pool_free

Baseline Run Details

Run 1: 46.84M ops/s

Throughput =  46841499 ops/s [iter=1000000 ws=400] time=0.021s
[RSS] max_kb=30208

Run 2: 46.79M ops/s

Throughput =  46793317 ops/s [iter=1000000 ws=400] time=0.021s
[RSS] max_kb=30080

Run 3: 45.77M ops/s

Throughput =  45772756 ops/s [iter=1000000 ws=400] time=0.022s
[RSS] max_kb=34176

Run 4: 47.12M ops/s

Throughput =  47117176 ops/s [iter=1000000 ws=400] time=0.021s
[RSS] max_kb=30080

Run 5: 42.36M ops/s (outlier)

Throughput =  42359615 ops/s [iter=1000000 ws=400] time=0.024s
[RSS] max_kb=30080

Document History

  • 2025-12-13: Initial baseline establishment and candidate analysis
  • Next: Phase 3 D1 implementation (Free route cache)