Files

Moe Charm (CI) b40aff290e Phase 4 D3 Design: Alloc Gate Shape

2025-12-14 00:05:11 +09:00

15 KiB

Raw Blame History

Phase 3: Baseline Establishment & Next Optimization Candidates

Date: 2025-12-13 Status: BASELINE ESTABLISHED Goal: Identify next micro-optimization targets with +1-5% potential each

Executive Summary

Baseline Performance (MID_V3=0, MIXED workload):

Mean: 45.78M ops/s
Median: 46.79M ops/s
Range: 42.36M - 47.12M ops/s
StdDev: ~1.75M ops/s (3.8% variance)

Top Optimization Candidates:

free() wrapper (28.95% self%) - HIGH PRIORITY
tiny_alloc_gate_fast() (12.75% self%) - HIGH PRIORITY
main() benchmark overhead (12.53% self%) - IGNORE (benchmark artifact)

Expected Next Gains: +3-8% cumulative from free path optimizations

Step 0: Baseline Establishment

Configuration Verification

Profile: HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE

Key Settings (verified in /mnt/workdisk/public_share/hakmem/core/bench_profile.h:74):

bench_setenv_default("HAKMEM_MID_V3_ENABLED", "0");  // CRITICAL: MID_V3 disabled for Mixed
bench_setenv_default("HAKMEM_MID_V3_CLASSES", "0x0");

Optimization Flags Enabled:

HAKMEM_FREE_TINY_FAST_HOTCOLD=1 (Phase FREE-TINY-FAST-DUALHOT-1)
HAKMEM_WRAP_SHAPE=1 (Phase 2 B4: Hot/Cold wrapper split)
HAKMEM_TINY_ALLOC_ROUTE_SHAPE=1 (Phase 2 B3: Route branch optimization)
HAKMEM_TINY_STATIC_ROUTE=1 (Phase 3 C3: Static routing, +2.2%)

Baseline Measurements (5 runs)

Run	Throughput (M ops/s)	Time (ms)
1	46.84	21
2	46.79	21
3	45.77	22
4	47.12	21
5	42.36	24

Statistics:

Mean: 45.78M ops/s
Median: 46.79M ops/s
Min: 42.36M ops/s
Max: 47.12M ops/s
Range: 4.76M ops/s (11.2%)
StdDev: ~1.75M ops/s (3.8% variance)

Assessment: Baseline is stable with ~3.8% variance. Run 5 (42.36M) is an outlier, likely due to system noise. Median (46.79M) is a reliable baseline reference.

Comparison to Previous:

Previous C3 baseline: ~39.8M ops/s (with default settings)
Current baseline: 46.79M ops/s
Improvement: +17.5% (confirms MID_V3=0 + cumulative optimizations working)

Step 1: Perf Profiling Results

Profiling Setup

Command:

HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE perf record -F 99 -g ./bench_random_mixed_hakmem 10000000 400 1

Results:

Samples: 30 (cycles:P event)
Event count: 921,849,973 cycles
Throughput: 47.37M ops/s (consistent with baseline)

Top Functions by Self%

Rank	Symbol	Self%	Samples	Children%	Category
1	`free`	28.95%	3	45.20%	HOT WRAPPER
2	`tiny_alloc_gate_fast.lto_priv.0`	12.75%	3	29.11%	HOT ALLOC
3	`main`	12.53%	3	21.00%	Benchmark
4	`malloc`	12.43%	3	16.71%	Wrapper
5	`tiny_front_v3_enabled.lto_priv.0`	7.75%	2	7.85%	Tiny front
6	`tiny_route_for_class.lto_priv.0`	4.39%	2	24.78%	Route lookup
7	`free.cold`	4.15%	1	4.15%	Cold path
8	`hak_pool_free`	4.02%	1	4.02%	Pool free

Call Graph Analysis

free() hot path (28.95% self, 45.20% children):

free (28.95% self)
├── tiny_route_for_class.lto_priv.0 (20.38%)  ← MAJOR BOTTLENECK
├── free (recursive, 16.24%)
├── tiny_region_id_write_header.lto_priv.0 (4.29%)
└── malloc (4.28%)

tiny_alloc_gate_fast (12.75% self, 29.11% children):

tiny_alloc_gate_fast (12.75% self)
├── tiny_alloc_gate_fast (recursive inlining, 20.64%)
├── main (4.27%)
└── free (4.20%)

Key Insight: tiny_route_for_class() is called from free() and consuming 20.38% of total time. This is the #1 optimization target.

Step 2: Candidate Prioritization

HIGH PRIORITY (Expected +3-5% each)

1. free() wrapper path (28.95% self%)

Location: /mnt/workdisk/public_share/hakmem/core/box/hak_wrappers.inc.h:524

Current Implementation:

void free(void* ptr) {
    if (!ptr) return;

    // BenchFast bypass (unlikely, 0)
    if (__builtin_expect(bench_fast_enabled(), 0)) { ... }

    const wrapper_env_cfg_t* wcfg = wrapper_env_cfg();  // ← Memory load

    if (__builtin_expect(wcfg->wrap_shape, 0)) {        // ← Branch
        if (__builtin_expect(TINY_FRONT_UNIFIED_GATE_ENABLED, 1)) {
            int freed;
            if (__builtin_expect(hak_free_tiny_fast_hotcold_enabled(), 0)) {
                freed = free_tiny_fast_hot(ptr);
            } else {
                freed = free_tiny_fast(ptr);
            }
            if (__builtin_expect(freed, 1)) {
                return;  // SUCCESS
            }
        }
        return free_cold(ptr, wcfg);
    }
    // Legacy path...
}

Optimization Opportunities:

A. Cache wrapper_env_cfg() result (Expected: +1-2%)

Currently calls wrapper_env_cfg() on every free
Could cache in TLS or register during init
Risk: LOW (read-only after init)

B. Inline free_tiny_fast_hot() decision (Expected: +1-2%)

Branch hak_free_tiny_fast_hotcold_enabled() is runtime env check
Could be compile-time or init-time cached
Risk: LOW (already gated by HAKMEM_FREE_TINY_FAST_HOTCOLD)

C. Reduce branch mispredictions (Expected: +0.5-1%)

Reorder branches to put likely path first
Current: bench_fast_enabled() checked first (unlikely=0)
Optimization: Move Tiny fast path check earlier
Risk: LOW

Total Expected Gain: +2.5-5%

2. tiny_route_for_class() (4.39% self%, 24.78% children)

Location: /mnt/workdisk/public_share/hakmem/core/box/tiny_route_env_box.h:147

Current Implementation:

static inline tiny_route_kind_t tiny_route_for_class(uint8_t ci) {
    FREE_DISPATCH_STAT_INC(route_for_class_calls);  // Debug stat (RELEASE: noop)
    if (__builtin_expect(!g_tiny_route_snapshot_done, 0)) {
        tiny_route_snapshot_init();
    }
    if (__builtin_expect(ci >= TINY_NUM_CLASSES, 0)) {
        return TINY_ROUTE_LEGACY;
    }
    return g_tiny_route_class[ci];
}

Optimization Opportunities:

A. Eliminate g_tiny_route_snapshot_done check (Expected: +1-2%)

Check happens on EVERY call from free path
Phase 3 C3 already implemented static routing for alloc path
Proposal: Apply same static route cache to free path
Implementation: Add tiny_static_route_for_free(ci) that bypasses snapshot check
Risk: MEDIUM (need to ensure init ordering)

B. Remove bounds check ci >= TINY_NUM_CLASSES (Expected: +0.5-1%)

In free path, ci is derived from header (already validated)
Could add tiny_route_for_class_unchecked(ci) variant
Risk: MEDIUM (need careful caller audit)

Total Expected Gain: +1.5-3%

3. tiny_alloc_gate_fast() (12.75% self%)

Location: /mnt/workdisk/public_share/hakmem/core/box/tiny_alloc_gate_box.h:139

Current Implementation:

static inline void* tiny_alloc_gate_fast(size_t size)
{
    int class_idx = hak_tiny_size_to_class(size);
    if (__builtin_expect(class_idx < 0 || class_idx >= TINY_NUM_CLASSES, 0)) {
        return NULL;
    }

    TinyRoutePolicy route = tiny_route_get(class_idx);  // ← Already optimized (Phase 3 C3)

    if (__builtin_expect(route == ROUTE_POOL_ONLY, 0)) {
        return NULL;
    }

    void* user_ptr = malloc_tiny_fast_for_class(size, class_idx);

    if (__builtin_expect(route == ROUTE_TINY_ONLY, 1)) {
        return user_ptr;
    }

    // ROUTE_TINY_FIRST fallback...
}

Optimization Opportunities:

A. Specialize for common routes (Expected: +1-2%)

MIXED workload: C0-C7 are all ROUTE_TINY_ONLY (LEGACY)
Could create tiny_alloc_gate_fast_legacy_only() variant
Eliminates ROUTE_POOL_ONLY and ROUTE_TINY_FIRST checks
Risk: LOW (already profiled via HAKMEM_TINY_STATIC_ROUTE)

B. Inline malloc_tiny_fast_for_class() (Expected: +0.5-1%)

Might not be fully inlined by LTO
Add __attribute__((always_inline)) hint
Risk: LOW

Total Expected Gain: +1.5-3%

MEDIUM PRIORITY (Expected +0.5-1% each)

4. tiny_front_v3_enabled() (7.75% self%)

Appears in free path via free_tiny_fast_hot()
Likely a runtime env check that could be cached
Risk: LOW
Expected Gain: +0.5-1%

5. free.cold (4.15% self%)

Cold path for free wrapper
Handles classification and fallback
Not a hot optimization target (already in slow path)
Expected Gain: <+0.5%

LOW PRIORITY / IGNORE

6. main() (12.53% self%)

Benchmark overhead (not part of allocator)
IGNORE

7. malloc() (12.43% self%)

Already optimized in previous phases
Appears lower than free in profile
Defer to next round

Step 3: Recommended Next Steps

Phase 3 D1: Free Path Route Cache ✅ ADOPT（PROMOTED TO DEFAULT）

Target: tiny_route_for_class() の呼び出しを free path から削る Result: Mixed 20-run mean +2.19% / median +2.37% Decision: ✅ MIXED_TINYV3_C7_SAFE の default に昇格

ENV Gate:

HAKMEM_FREE_STATIC_ROUTE=0/1（default: 0）
MIXED_TINYV3_C7_SAFE プリセットは 1 を default 注入（rollback は 0）

Phase 3 D2: Wrapper Env Cache ❌ NO-GO（FROZEN）

Target: wrapper_env_cfg() の呼び出しを wrapper hot path から削る Result: Mixed 10-run mean -1.44% regression Decision: ❌ NO-GO（研究箱 freeze、default OFF）

ENV Gate: HAKMEM_WRAP_ENV_CACHE=1（default: 0）

Phase 3 D3: Alloc Gate Specialization (MEDIUM PRIORITY)

Target: tiny_alloc_gate_fast() の分岐形を最短化（MIXED 向け） Expected Gain: +1-2% Risk: LOW Effort: 2-3 hours

Implementation:

New ENV gate: HAKMEM_ALLOC_GATE_SHAPE=0/1
tiny_route_get() を避け、g_tiny_route[] の直接参照に置換（release logging branch を回避）
ROUTE_POOL_ONLY は必ず尊重（HAKMEM_TINY_PROFILE=hot/off を壊さない）
A/B test: BASELINE vs D3

Design: docs/analysis/PHASE4_D3_ALLOC_GATE_SPECIALIZATION_1_DESIGN.md

ENV Gate: HAKMEM_ALLOC_GATE_SHAPE=0/1 (default: 0)

Expected Cumulative Results（更新）

Phase	Optimization	Expected Gain	Notes
Baseline	MID_V3=0 + B3+B4+C3	-	—
D1	Free route cache	+0〜+2%	✅ ADOPT（Mixed preset default ON）
D2	Wrapper env cache	—	NO-GO（freeze）
D3	Alloc gate specialization	+0〜+2%	perf で 5% 超なら着手

With MID_V3 fix for Mixed: +13% additional (expected ~56M ops/s total)

Risk Assessment

Optimization	Risk Level	Mitigation
Free route cache	MEDIUM	Ensure init ordering, ENV gate for rollback
Wrapper env cache	—	NO-GO（-1.44% regression）
Alloc specialization	LOW	Profile-specific, existing static route pattern

All optimizations: Follow ENV gate + A/B test + decision pattern (research box)

Post-D1/D2 Status (2025-12-13)

Phase 3 D1/D2 Validation Complete ✅

D1 (Free Route Cache): ✅ ADOPT - PROMOTED TO DEFAULT
- 20-run validation completed
- Results: Mean +2.19%, Median +2.37% (both criteria met)
- Status: Added to MIXED_TINYV3_C7_SAFE preset as default
- Implementation: HAKMEM_FREE_STATIC_ROUTE=1
D2 (Wrapper Env Cache): ❌ FROZEN
- Results: -1.44% regression
- Status: Research box frozen, default OFF, do not pursue
- Implementation: HAKMEM_WRAP_ENV_CACHE=1 (opt-in only, not recommended)

Active Optimizations in MIXED_TINYV3_C7_SAFE

B3: Routing branch shape (+2.89% proven)
B4: Wrapper hot/cold split (+1.47% proven)
C3: Static routing (+2.20% proven)
D1: Free route cache (+2.19% proven) - NEW
MID_V3: OFF for Mixed (C6 routing fix, +13% proven)

Cumulative gain: ~7.6% (B3 + B4 + C3 + D1, excluding MID_V3 fix)

Next Actions

Profile: Run perf on current baseline to identify next targets
- Requirement: self% ≥5% for Phase 3 D3 consideration
- Target: tiny_alloc_gate_fast specialization
Optional: Phase 3 D3 (Alloc gate specialization) - pending perf validation
- Only proceed if perf shows ≥5% self% in alloc gate
- ENV: HAKMEM_ALLOC_GATE_SHAPE=0/1
Phase 4 Planning: If no more 5%+ targets, prepare Phase 4 roadmap

Appendix: Raw Perf Data

Full Perf Report (Top 20)

# Samples: 30  of event 'cycles:P'
# Event count (approx.): 921849973

    46.11%     0.00%             0  [.] 0000000000000000
    45.20%    28.95%             3  [.] free
    29.11%    12.75%             3  [.] tiny_alloc_gate_fast.lto_priv.0
    24.78%     4.39%             2  [.] tiny_route_for_class.lto_priv.0
    21.00%    12.53%             3  [.] main
    16.71%    12.43%             3  [.] malloc
    12.95%     4.27%             1  [.] tiny_region_id_write_header.lto_priv.0
     8.66%     4.39%             1  [.] tiny_c7_ultra_free
     8.56%     4.28%             1  [.] free_tiny_fast_cold.lto_priv.0
     7.85%     7.75%             2  [.] tiny_front_v3_enabled.lto_priv.0
     4.27%     0.00%             0  [.] 0x00007ad3a9c2d001
     4.23%     0.00%             0  [.] tiny_c7_ultra_enabled_env.lto_priv.0
     4.21%     0.00%             0  [.] 0x00007ad3ab960c81
     4.20%     0.00%             0  [.] 0x00007ad3ab939401
     4.15%     4.15%             1  [.] free.cold
     4.15%     0.00%             0  [.] unified_cache_push.lto_priv.0
     4.02%     4.02%             1  [.] hak_pool_free

Baseline Run Details

Run 1: 46.84M ops/s

Throughput =  46841499 ops/s [iter=1000000 ws=400] time=0.021s
[RSS] max_kb=30208

Run 2: 46.79M ops/s

Throughput =  46793317 ops/s [iter=1000000 ws=400] time=0.021s
[RSS] max_kb=30080

Run 3: 45.77M ops/s

Throughput =  45772756 ops/s [iter=1000000 ws=400] time=0.022s
[RSS] max_kb=34176

Run 4: 47.12M ops/s

Throughput =  47117176 ops/s [iter=1000000 ws=400] time=0.021s
[RSS] max_kb=30080

Run 5: 42.36M ops/s (outlier)

Throughput =  42359615 ops/s [iter=1000000 ws=400] time=0.024s
[RSS] max_kb=30080

Document History

2025-12-13: Initial baseline establishment and candidate analysis
Next: Phase 3 D1 implementation (Free route cache)

15 KiB Raw Blame History Unescape Escape

Phase 3: Baseline Establishment & Next Optimization Candidates

Executive Summary

Step 0: Baseline Establishment

Configuration Verification

Baseline Measurements (5 runs)

Step 1: Perf Profiling Results

Profiling Setup

Top Functions by Self%

Call Graph Analysis

Step 2: Candidate Prioritization

HIGH PRIORITY (Expected +3-5% each)

1. free() wrapper path (28.95% self%)

2. tiny_route_for_class() (4.39% self%, 24.78% children)

3. tiny_alloc_gate_fast() (12.75% self%)

MEDIUM PRIORITY (Expected +0.5-1% each)

4. tiny_front_v3_enabled() (7.75% self%)

5. free.cold (4.15% self%)

LOW PRIORITY / IGNORE

6. main() (12.53% self%)

7. malloc() (12.43% self%)

Step 3: Recommended Next Steps

Phase 3 D1: Free Path Route Cache ✅ ADOPT（PROMOTED TO DEFAULT）

Phase 3 D2: Wrapper Env Cache ❌ NO-GO（FROZEN）

Phase 3 D3: Alloc Gate Specialization (MEDIUM PRIORITY)

Expected Cumulative Results（更新）

Risk Assessment

Post-D1/D2 Status (2025-12-13)

Phase 3 D1/D2 Validation Complete ✅

Active Optimizations in MIXED_TINYV3_C7_SAFE

Next Actions

Appendix: Raw Perf Data

Full Perf Report (Top 20)

Baseline Run Details

Document History

15 KiB

Raw Blame History