tomoaki/hakmem

Fork 0

Files

Moe Charm (CI) b40aff290e Phase 4 D3 Design: Alloc Gate Shape

2025-12-14 00:05:11 +09:00

13 KiB

Raw Blame History

Phase 3 Finalization Summary

Date: 2025-12-13 Status: Phase 3 D1/D2 Validation Complete Decision: D1 PROMOTED TO DEFAULT, D2 FROZEN

Executive Summary

Phase 3 has been successfully completed with comprehensive validation of D1 (Free Route Cache) and D2 (Wrapper Env Cache). D1 showed strong, consistent gains in 20-run validation and has been promoted to the MIXED_TINYV3_C7_SAFE preset default. D2 showed regression and has been frozen as a research box.

Key Results

D1 (Free Route Cache): +2.19% mean, +2.37% median → ADOPTED
D2 (Wrapper Env Cache): -1.44% regression → FROZEN
Cumulative Phase 2-3 Gains: ~7.6% (B3 + B4 + C3 + D1)
Baseline Phase 3: 46.04M ops/s (Mixed, 10-run)

Timeline: Phase 2 → Phase 3 Journey

Phase 2: Structural Changes

B3: Routing Branch Shape (+2.89%)

Status: ✅ ADOPTED
Implementation: HAKMEM_TINY_ALLOC_ROUTE_SHAPE=1
Strategy: LIKELY on LEGACY (hot), cold helper for rare routes
Results: Mixed +2.89%, C6-heavy +9.13%
Impact: Improved branch prediction for common allocation paths

B4: Wrapper Hot/Cold Split (+1.47%)

Status: ✅ ADOPTED
Implementation: HAKMEM_WRAP_SHAPE=1
Strategy: noinline,cold helpers for rare checks (LD mode, jemalloc, diagnostics)
Results: Mixed +1.47%
Impact: Reduced wrapper entry overhead

Phase 3: Cache Locality Optimizations

C1: TLS Prefetch (NEUTRAL)

Status: 🔬 NEUTRAL / FROZEN
Implementation: HAKMEM_TINY_PREFETCH=1
Results: Mixed -0.34% mean, +1.28% median
Decision: Research box (default OFF)
Reason: Prefetch timing dependent, effect within noise range

C2: Metadata Cache (NEUTRAL)

Status: 🔬 NEUTRAL / FROZEN
Implementation: HAKMEM_TINY_METADATA_CACHE=1
Results: Mixed -0.45% mean, -1.06% median
Decision: Research box (default OFF)
Reason: Learner interlock cost + cache benefits not realized in current hot path

C3: Static Routing (+2.20%)

Status: ✅ ADOPTED
Implementation: HAKMEM_TINY_STATIC_ROUTE=1
Strategy: Bypass policy_snapshot + learner evaluation with static routing table
Results: Mixed +2.20%
Impact: Eliminated atomic + branch overhead in allocation path

C4: MID_V3 Routing Fix (+13%)

Status: ✅ ADOPTED
Implementation: HAKMEM_MID_V3_ENABLED=0 for Mixed
Results: Mixed +13% (43.33M → 48.97M ops/s)
Decision: Mixed OFF by default, C6-heavy ON
Reason: C6 routing to LEGACY is faster in Mixed workload

D1: Free Route Cache (+2.19%) ✅ PROMOTED

Status: ✅ ADOPTED (2025-12-13)
Implementation: HAKMEM_FREE_STATIC_ROUTE=1
Strategy: TLS cache for free path routing, bypass tiny_route_for_class()
Initial 10-run: Mean +1.06%, Median -0.77%
20-run Validation:
- Baseline (ROUTE=0): Mean 46.30M ops/s, Median 46.30M ops/s
- Optimized (ROUTE=1): Mean 47.32M ops/s, Median 47.39M ops/s
- Gain: Mean +2.19%, Median +2.37%
Decision: PROMOTE TO DEFAULT (both criteria met: mean >= +1.0%, median >= +0.0%)
Impact: Eliminates tiny_route_for_class() call overhead in free path

D2: Wrapper Env Cache (-1.44%) ❌ FROZEN

Status: ❌ NO-GO / FROZEN
Implementation: HAKMEM_WRAP_ENV_CACHE=1
Strategy: TLS cache for wrapper_env_cfg() pointer
Results: Mixed -1.44% regression
Decision: FREEZE (do not pursue further)
Reason: TLS cache overhead > benefit, simple global access faster
Lesson: Not all caching helps - profile before adding indirection

Statistical Validation Details

Baseline Phase 3 (10-run, Mixed, 20M iters, ws=400)

Date: 2025-12-13

Raw Data:

45753693, 46285007, 45977011, 46142131, 46068493,
45920245, 46143884, 46011560, 45995670, 46084818

Statistics:

Mean: 46,038,251 ops/s (46.04M ops/s)
Median: 46,040,027 ops/s (46.04M ops/s)
StdDev: 144,182 ops/s (0.14M ops/s)
Min: 45,753,693 ops/s (45.75M ops/s)
Max: 46,285,007 ops/s (46.29M ops/s)

D1 Validation: 20-run Comparison

Baseline (HAKMEM_FREE_STATIC_ROUTE=0)

Raw Data (20 runs):

46264909, 46143884, 46296296, 46439628, 46296296,
46189376, 46296296, 46499548, 46296296, 46387832,
46143884, 46296296, 46143884, 46296296, 46439628,
46296296, 46296296, 46439628, 46296296, 46296296

Statistics:

Mean: 46,302,758 ops/s (46.30M ops/s)
Median: 46,296,296 ops/s (46.30M ops/s)
StdDev: 100,680 ops/s (0.10M ops/s)
Min: 46,143,884 ops/s (46.14M ops/s)
Max: 46,499,548 ops/s (46.50M ops/s)

Optimized (HAKMEM_FREE_STATIC_ROUTE=1)

Raw Data (20 runs):

47259147, 47259147, 47501710, 47393365, 47165991,
47165991, 47393365, 47165991, 47393365, 47393365,
47165991, 47393365, 47165991, 47393365, 47393365,
47393365, 47393365, 47393365, 47165991, 47393365

Statistics:

Mean: 47,317,148 ops/s (47.32M ops/s)
Median: 47,393,365 ops/s (47.39M ops/s)
StdDev: 112,807 ops/s (0.11M ops/s)
Min: 47,165,991 ops/s (47.17M ops/s)
Max: 47,501,710 ops/s (47.50M ops/s)

Gain Analysis

Mean Gain: +2.19% ✓ (>= +1.0% threshold)
Median Gain: +2.37% ✓ (>= +0.0% threshold)
Variance Ratio: 1.12x (optimized/baseline)

Decision Criteria (from PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md:65):

IF mean_gain >= +1.0% AND median_gain >= +0.0%:
  → GO: Promote HAKMEM_FREE_STATIC_ROUTE=1 to default

Result: Both criteria met → PROMOTE TO DEFAULT ✅

Cumulative Gains: Phase 2-3

Active Optimizations in MIXED_TINYV3_C7_SAFE

B3: Routing Branch Shape (+2.89%)
- ENV: HAKMEM_TINY_ALLOC_ROUTE_SHAPE=1
- Impact: Branch prediction optimization
B4: Wrapper Hot/Cold Split (+1.47%)
- ENV: HAKMEM_WRAP_SHAPE=1
- Impact: Reduced wrapper overhead
C3: Static Routing (+2.20%)
- ENV: HAKMEM_TINY_STATIC_ROUTE=1
- Impact: Policy snapshot bypass
D1: Free Route Cache (+2.19%) - NEW
- ENV: HAKMEM_FREE_STATIC_ROUTE=1
- Impact: Free path routing cache
MID_V3 Routing Fix (+13%)
- ENV: HAKMEM_MID_V3_ENABLED=0 (Mixed)
- Impact: C6 routing to LEGACY

Gain Calculation

Additive approximation (conservative):

B3 + B4 + C3 + D1 = 2.89% + 1.47% + 2.20% + 2.19% = 8.75%

Multiplicative (more realistic):

(1.0289) × (1.0147) × (1.0220) × (1.0219) ≈ 1.0893 → +8.93%

Note: MID_V3 fix (+13%) is a structural change, not additive to the above.

Conservative estimate: ~7.6-8.9% cumulative gain from Phase 2-3 optimizations

Research Boxes: Frozen vs Available

Frozen (Do Not Pursue)

D2: Wrapper Env Cache
- ENV: HAKMEM_WRAP_ENV_CACHE=1
- Status: ❌ FROZEN
- Reason: -1.44% regression, TLS overhead > benefit
B1: Header Tax Reduction v2
- ENV: HAKMEM_TINY_HEADER_MODE=LIGHT
- Status: ❌ FROZEN
- Reason: -2.54% regression
A3: Always Inline Header
- ENV: HAKMEM_TINY_HEADER_ALWAYS_INLINE=1
- Status: ❌ FROZEN
- Reason: -4.00% regression (I-cache pressure)

Available for Research (NEUTRAL)

C1: TLS Prefetch
- ENV: HAKMEM_TINY_PREFETCH=1
- Status: 🔬 NEUTRAL (default OFF)
- Results: -0.34% mean, +1.28% median
C2: Metadata Cache
- ENV: HAKMEM_TINY_METADATA_CACHE=1
- Status: 🔬 NEUTRAL (default OFF)
- Results: -0.45% mean, -1.06% median

Next Phase: D3 Conditions

D3: Alloc Gate Specialization

Requirement: perf validation showing tiny_alloc_gate_fast self% ≥ 5%

Design: docs/analysis/PHASE4_D3_ALLOC_GATE_SPECIALIZATION_1_DESIGN.md

Strategy: Specialize alloc gate for fixed MIXED configuration

Eliminate dynamic checks
Inline hot paths
Reduce branch complexity

ENV: HAKMEM_ALLOC_GATE_SHAPE=0/1

Decision Criteria:

IF perf shows ≥5% self% in alloc gate → Proceed with D3
ELSE → Move to Phase 4 planning

Perf Validation Required

perf record -F 99 --call-graph dwarf -- \
  HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE ./bench_random_mixed_hakmem 20000000 400 1
perf report --stdio

Target: Identify functions with self% ≥ 5% for optimization

Implementation Changes

File: core/bench_profile.h

Added (line 80-81):

// Phase 3 D1: Free route cache (TLS cache for free path routing, +2.19% proven)
bench_setenv_default("HAKMEM_FREE_STATIC_ROUTE", "1");

Location: MIXED_TINYV3_C7_SAFE preset section

Effect: D1 optimization now enabled by default for Mixed workload

Documentation Updates

Files Updated (6 total)

PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md
- Added BASELINE_PHASE3 (10-run summary)
- Updated D1 status: ADOPT (20-run validation results)
- Added D2 status: FROZEN (NO-GO)
PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md
- Added 20-run validation section
- Decision: PROMOTE TO DEFAULT
- Updated operational status
PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md
- Added Phase 3 Final Status: FROZEN
- Reason: -1.44% regression
ENV_PROFILE_PRESETS.md
- Updated D1: ADOPT (promoted to default)
- Updated D2: FROZEN (do not pursue)
- Added 20-run validation results
PHASE3_BASELINE_AND_CANDIDATES.md
- Added Post-D1/D2 Status section
- Updated Active Optimizations list
- Cumulative gain: ~7.6%
CURRENT_TASK.md
- Updated current status: Phase 3 D1/D2 Validation Complete
- D1: PROMOTED, D2: FROZEN
- Baseline Phase 3: 46.04M ops/s

Lessons Learned

1. Statistical Rigor Matters

Initial 10-run for D1 showed +1.06% mean but -0.77% median, creating uncertainty.

20-run validation resolved ambiguity: +2.19% mean, +2.37% median (both positive).

Lesson: For borderline cases, invest in larger sample sizes to reduce variance and confirm trends.

2. Not All Caching Helps

D2 hypothesis: TLS caching of wrapper_env_cfg() would reduce overhead.

Reality: Simple global pointer access was faster than TLS cache indirection.

Lesson: Profile before adding indirection. Global access patterns can be more efficient than local caching when the global is already cache-resident.

3. TLS Overhead is Real

Both C1 (prefetch) and D2 (env cache) showed that adding TLS operations isn't always beneficial.

Lesson: TLS access has non-zero cost. Only worthwhile when it eliminates heavier operations (like D1's route calculation).

4. 20-run Validation is Worth It

10-run: Faster, but higher variance (±2-3% noise) 20-run: Slower, but lower variance (±1-2% noise)

Lesson: For promotion decisions, 20-run validation provides confidence that gains are real, not measurement artifacts.

Build & Test Results

Rebuild Verification

make clean && make bench_random_mixed_hakmem

Status: ✅ SUCCESSFUL Warnings: None related to D1 changes Sanity Check: 47.20M ops/s (D1 enabled by default, matches optimized baseline)

Benchmark Configuration

Command:

HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE ./bench_random_mixed_hakmem 20000000 400 1

Parameters:

Iterations: 20,000,000
Working set: 400
Threads: 1

Environment:

Date: 2025-12-13
Kernel: Linux 6.8.0-87-generic
Build: Release (LTO enabled)

Success Criteria: Achieved ✅

Current baseline established (10-run)
D1 baseline 20-run collected
D1 optimized 20-run collected
Statistical analysis complete
D1 decision made (GO → PROMOTED)
Preset updated (HAKMEM_FREE_STATIC_ROUTE=1 default)
All docs synchronized with results
Comprehensive summary created
Ready for final commit

Future Work

Phase 3 D3: Pending Perf Validation

Condition: Proceed if tiny_alloc_gate_fast self% ≥ 5%

Next Steps:

Run perf on current baseline (with D1 enabled)
Analyze top functions
If alloc gate ≥5%, implement D3 specialization
If not, move to Phase 4 planning

Phase 4: TBD

Potential Directions:

Wrapper layer further optimization (if perf shows opportunity)
Free path second-level optimizations
Allocator-wide architectural simplification

Decision Point: After Phase 3 D3 validation

Conclusion

Phase 3 has successfully delivered +2.19% improvement through D1 (Free Route Cache), bringing the cumulative Phase 2-3 gain to ~7.6-8.9%. D2 (Wrapper Env Cache) was correctly rejected due to regression, demonstrating the value of rigorous A/B testing.

The 20-run validation methodology proved essential for borderline optimizations, providing statistical confidence for promotion decisions. D1 is now active by default in the MIXED_TINYV3_C7_SAFE preset, and all documentation has been synchronized.

Next steps depend on perf validation: if alloc gate shows ≥5% overhead, Phase 3 D3 will proceed; otherwise, Phase 4 planning begins.

Phase 3 Status: ✅ COMPLETE

Generated: 2025-12-13 Author: Claude Code Phase 3 Finalization Validation: 20-run statistical analysis Decision: D1 PROMOTED, D2 FROZEN

13 KiB Raw Blame History Unescape Escape

Phase 3 Finalization Summary

Executive Summary

Key Results

Timeline: Phase 2 → Phase 3 Journey

Phase 2: Structural Changes

B3: Routing Branch Shape (+2.89%)

B4: Wrapper Hot/Cold Split (+1.47%)

Phase 3: Cache Locality Optimizations

C1: TLS Prefetch (NEUTRAL)

C2: Metadata Cache (NEUTRAL)

C3: Static Routing (+2.20%)

C4: MID_V3 Routing Fix (+13%)

D1: Free Route Cache (+2.19%) ✅ PROMOTED

D2: Wrapper Env Cache (-1.44%) ❌ FROZEN

Statistical Validation Details

Baseline Phase 3 (10-run, Mixed, 20M iters, ws=400)

D1 Validation: 20-run Comparison

Baseline (HAKMEM_FREE_STATIC_ROUTE=0)

Optimized (HAKMEM_FREE_STATIC_ROUTE=1)

Gain Analysis

Cumulative Gains: Phase 2-3

Active Optimizations in MIXED_TINYV3_C7_SAFE

Gain Calculation

Research Boxes: Frozen vs Available

Frozen (Do Not Pursue)

Available for Research (NEUTRAL)

Next Phase: D3 Conditions

D3: Alloc Gate Specialization

Perf Validation Required

Implementation Changes

File: core/bench_profile.h

Documentation Updates

Files Updated (6 total)

Lessons Learned

1. Statistical Rigor Matters

2. Not All Caching Helps

3. TLS Overhead is Real

4. 20-run Validation is Worth It

Build & Test Results

Rebuild Verification

Benchmark Configuration

Success Criteria: Achieved ✅

Future Work

Phase 3 D3: Pending Perf Validation

Phase 4: TBD

Conclusion

13 KiB

Raw Blame History