# Phase FREE-TINY-FAST-DUALHOT-1: Optimize C0-C3 Direct Path

## Goal

Optimize C0-C3 classes (≈48% of calls) by treating them as "second hot path" rather than "cold path".

実装は **HOTCOLD split（`free_tiny_fast_hot()`）側に統合**し、C0-C3 は hot 側で早期 return することで、
`noinline,cold` への関数呼び出しを避ける（= “dual hot” 化）。

## Background

### HOTCOLD-OPT-1 Learnings

Phase FREE-TINY-FAST-HOTCOLD-OPT-1 revealed:
- C7 (ULTRA): 50.11% of calls ← Correctly optimized as "hot"
- C0-C3 (legacy fallback): 48.43% of calls ← **NOT rare, second hot**
- Mistake: Made C0-C3 noinline → -13% regression

**Lesson**: Don't call C0-C3 "cold" if it's 48% of workload.

## Design

### Call Flow Analysis

**Current dispatch**（Front Gate Unified 側の free）:
```
wrap_free(ptr)
  └─ if (TINY_FRONT_UNIFIED_GATE_ENABLED) {
        if (HAKMEM_FREE_TINY_FAST_HOTCOLD=1) free_tiny_fast_hot(ptr)
        else                                 free_tiny_fast(ptr)   // monolithic
      }
```

**DUALHOT flow**（実装済み: `free_tiny_fast_hot()`）:
```
free_tiny_fast_hot(ptr)
  ├─ header magic + class_idx + base
  ├─ if (class_idx == 7 && tiny_c7_ultra_enabled_env()) { tiny_c7_ultra_free(ptr); return 1; }
  ├─ if (class_idx <= 3 && HAKMEM_TINY_LARSON_FIX==0) {
  │     tiny_legacy_fallback_free_base(base, class_idx);
  │     return 1;
  │   }
  ├─ policy snapshot + route_kind switch（ULTRA/MID/V7）
  └─ cold_path: free_tiny_fast_cold(ptr, base, class_idx)
```

### Optimization Target

**Cost savings for C0-C3 path**:
1. **Eliminate policy snapshot**: `tiny_front_v3_snapshot_get()`
   - Estimated cost: 5-10 cycles per call
   - Frequency: 48.43% of all frees
   - Impact: 2-5% of total overhead

2. **Eliminate route determination**: `tiny_route_for_class()`
   - Estimated cost: 2-3 cycles
   - Impact: 1-2% of total overhead

3. **Direct function call** (instead of dispatcher logic):
   - Inlining potential
   - Better branch prediction

### Safety Gaurd: HAKMEM_TINY_LARSON_FIX

**When HAKMEM_TINY_LARSON_FIX=1:**
- The optimization is automatically disabled
- Falls through to original path (with full validation)
- Preserves Larson compatibility mode

**Rationale**:
- Larson mode may require different C0-C3 handling
- Safety: Don't optimize if special mode is active

## Implementation

### Target Files
- `core/front/malloc_tiny_fast.h`（`free_tiny_fast_hot()` 内）
- `core/box/hak_wrappers.inc.h`（HOTCOLD dispatch）

### Code Pattern

（実装は `free_tiny_fast_hot()` 内にあり、C0-C3 は hot で `return 1` する）

### ENV Gate (Safety)

Add to check for Larson mode:
```c
#define HAKMEM_TINY_LARSON_FIX \
    (__builtin_expect((getenv("HAKMEM_TINY_LARSON_FIX") ? 1 : 0), 0))
```

Or use existing pattern if available:
```c
extern int g_tiny_larson_mode;
if (class_idx <= 3 && !g_tiny_larson_mode) { ... }
```

## Validation

### A/B Benchmark

**Configuration:**
- Profile: MIXED_TINYV3_C7_SAFE
- Workload: Random mixed (10-1024B)
- Runs: 10 iterations

**Command:**
```bash
```bash
# Baseline (monolithic)
HAKMEM_FREE_TINY_FAST_HOTCOLD=0 \
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
./bench_random_mixed_hakmem 100000000 400 1

# Opt (HOTCOLD + DUALHOT in hot)
HAKMEM_FREE_TINY_FAST_HOTCOLD=1 \
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
./bench_random_mixed_hakmem 100000000 400 1

# Safety disable (forces full path; useful A/B sanity)
HAKMEM_TINY_LARSON_FIX=1 \
HAKMEM_FREE_TINY_FAST_HOTCOLD=1 \
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
./bench_random_mixed_hakmem 100000000 400 1
```
```

### Perf Analysis

**Target metrics:**
1. **Throughput median** (±2% tolerance)
2. **Branch misses** (`perf stat -e branch-misses`)
   - Expect: Lower branch misses in optimized version
   - Reason: Fewer conditional branches in C0-C3 path

**Command:**
```bash
perf stat -e branch-misses,cycles,instructions \
    -- env HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
    ./bench_random_mixed_hakmem 100000000 400 1
```

## Success Criteria

| Criterion | Target | Rationale |
|-----------|--------|-----------|
| Throughput | ±2% | No regression vs baseline |
| Branch misses | Decreased | Direct path has fewer branches |
| free self% | Reduced | Fewer policy snapshots |
| Safety | No crashes | Larson mode doesn't break |

## Expected Impact

**If successful:**
- Skip policy snapshot for 48.43% of frees
- Reduce free self% from 32.04% to ~28-30% (2-4 percentage points)
- Translate to ~3-5% throughput improvement

**Why modest gains:**
- C0-C3 is only 48% of calls
- Policy snapshot is 5-10 cycles (not huge absolute time)
- But consistent improvement across all mixed workloads

## Files to Modify

- `core/front/malloc_tiny_fast.h`
- `core/box/hak_wrappers.inc.h`

## Files to Reference

- `/mnt/workdisk/public_share/hakmem/core/front/malloc_tiny_fast.h` (current implementation)
- `/mnt/workdisk/public_share/hakmem/core/tiny_legacy.inc.h` (tiny_legacy_fallback_free_base signature)
- `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_lazy_init.inc.h` (tiny_front_v3_enabled, etc)

## Commit Message

```
Phase FREE-TINY-FAST-DUALHOT-1: Optimize C0-C3 direct free path

Treat C0-C3 classes (48% of calls) as "second hot path", not cold.
Skip expensive policy snapshot and route determination, direct to
tiny_legacy_fallback_free_base().

Measurements from FREE-TINY-FAST-HOTCOLD-OPT-1 revealed that C0-C3
is not rare (48.43% of all frees), so naive hot/cold split failed.
This phase applies the correct optimization: direct path for frequent
C0-C3 class.

ENV: HAKMEM_TINY_LARSON_FIX disables optimization (safety gate)

Expected: -2-4pp free self%, +3-5% throughput

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
```