# Phase FREE-TINY-FAST-DUALHOT-1: Optimize C0-C3 Direct Path ## Goal Optimize C0-C3 classes (≈48% of calls) by treating them as "second hot path" rather than "cold path". 実装は **HOTCOLD split(`free_tiny_fast_hot()`)側に統合**し、C0-C3 は hot 側で早期 return することで、 `noinline,cold` への関数呼び出しを避ける(= “dual hot” 化)。 ## Background ### HOTCOLD-OPT-1 Learnings Phase FREE-TINY-FAST-HOTCOLD-OPT-1 revealed: - C7 (ULTRA): 50.11% of calls ← Correctly optimized as "hot" - C0-C3 (legacy fallback): 48.43% of calls ← **NOT rare, second hot** - Mistake: Made C0-C3 noinline → -13% regression **Lesson**: Don't call C0-C3 "cold" if it's 48% of workload. ## Design ### Call Flow Analysis **Current dispatch**(Front Gate Unified 側の free): ``` wrap_free(ptr) └─ if (TINY_FRONT_UNIFIED_GATE_ENABLED) { if (HAKMEM_FREE_TINY_FAST_HOTCOLD=1) free_tiny_fast_hot(ptr) else free_tiny_fast(ptr) // monolithic } ``` **DUALHOT flow**(実装済み: `free_tiny_fast_hot()`): ``` free_tiny_fast_hot(ptr) ├─ header magic + class_idx + base ├─ if (class_idx == 7 && tiny_c7_ultra_enabled_env()) { tiny_c7_ultra_free(ptr); return 1; } ├─ if (class_idx <= 3 && HAKMEM_TINY_LARSON_FIX==0) { │ tiny_legacy_fallback_free_base(base, class_idx); │ return 1; │ } ├─ policy snapshot + route_kind switch(ULTRA/MID/V7) └─ cold_path: free_tiny_fast_cold(ptr, base, class_idx) ``` ### Optimization Target **Cost savings for C0-C3 path**: 1. **Eliminate policy snapshot**: `tiny_front_v3_snapshot_get()` - Estimated cost: 5-10 cycles per call - Frequency: 48.43% of all frees - Impact: 2-5% of total overhead 2. **Eliminate route determination**: `tiny_route_for_class()` - Estimated cost: 2-3 cycles - Impact: 1-2% of total overhead 3. **Direct function call** (instead of dispatcher logic): - Inlining potential - Better branch prediction ### Safety Gaurd: HAKMEM_TINY_LARSON_FIX **When HAKMEM_TINY_LARSON_FIX=1:** - The optimization is automatically disabled - Falls through to original path (with full validation) - Preserves Larson compatibility mode **Rationale**: - Larson mode may require different C0-C3 handling - Safety: Don't optimize if special mode is active ## Implementation ### Target Files - `core/front/malloc_tiny_fast.h`(`free_tiny_fast_hot()` 内) - `core/box/hak_wrappers.inc.h`(HOTCOLD dispatch) ### Code Pattern (実装は `free_tiny_fast_hot()` 内にあり、C0-C3 は hot で `return 1` する) ### ENV Gate (Safety) Add to check for Larson mode: ```c #define HAKMEM_TINY_LARSON_FIX \ (__builtin_expect((getenv("HAKMEM_TINY_LARSON_FIX") ? 1 : 0), 0)) ``` Or use existing pattern if available: ```c extern int g_tiny_larson_mode; if (class_idx <= 3 && !g_tiny_larson_mode) { ... } ``` ## Validation ### A/B Benchmark **Configuration:** - Profile: MIXED_TINYV3_C7_SAFE - Workload: Random mixed (10-1024B) - Runs: 10 iterations **Command:** ```bash ```bash # Baseline (monolithic) HAKMEM_FREE_TINY_FAST_HOTCOLD=0 \ HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \ ./bench_random_mixed_hakmem 100000000 400 1 # Opt (HOTCOLD + DUALHOT in hot) HAKMEM_FREE_TINY_FAST_HOTCOLD=1 \ HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \ ./bench_random_mixed_hakmem 100000000 400 1 # Safety disable (forces full path; useful A/B sanity) HAKMEM_TINY_LARSON_FIX=1 \ HAKMEM_FREE_TINY_FAST_HOTCOLD=1 \ HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \ ./bench_random_mixed_hakmem 100000000 400 1 ``` ``` ### Perf Analysis **Target metrics:** 1. **Throughput median** (±2% tolerance) 2. **Branch misses** (`perf stat -e branch-misses`) - Expect: Lower branch misses in optimized version - Reason: Fewer conditional branches in C0-C3 path **Command:** ```bash perf stat -e branch-misses,cycles,instructions \ -- env HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \ ./bench_random_mixed_hakmem 100000000 400 1 ``` ## Success Criteria | Criterion | Target | Rationale | |-----------|--------|-----------| | Throughput | ±2% | No regression vs baseline | | Branch misses | Decreased | Direct path has fewer branches | | free self% | Reduced | Fewer policy snapshots | | Safety | No crashes | Larson mode doesn't break | ## Expected Impact **If successful:** - Skip policy snapshot for 48.43% of frees - Reduce free self% from 32.04% to ~28-30% (2-4 percentage points) - Translate to ~3-5% throughput improvement **Why modest gains:** - C0-C3 is only 48% of calls - Policy snapshot is 5-10 cycles (not huge absolute time) - But consistent improvement across all mixed workloads ## Files to Modify - `core/front/malloc_tiny_fast.h` - `core/box/hak_wrappers.inc.h` ## Files to Reference - `/mnt/workdisk/public_share/hakmem/core/front/malloc_tiny_fast.h` (current implementation) - `/mnt/workdisk/public_share/hakmem/core/tiny_legacy.inc.h` (tiny_legacy_fallback_free_base signature) - `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_lazy_init.inc.h` (tiny_front_v3_enabled, etc) ## Commit Message ``` Phase FREE-TINY-FAST-DUALHOT-1: Optimize C0-C3 direct free path Treat C0-C3 classes (48% of calls) as "second hot path", not cold. Skip expensive policy snapshot and route determination, direct to tiny_legacy_fallback_free_base(). Measurements from FREE-TINY-FAST-HOTCOLD-OPT-1 revealed that C0-C3 is not rare (48.43% of all frees), so naive hot/cold split failed. This phase applies the correct optimization: direct path for frequent C0-C3 class. ENV: HAKMEM_TINY_LARSON_FIX disables optimization (safety gate) Expected: -2-4pp free self%, +3-5% throughput 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude ```