197 lines
5.7 KiB
Markdown
197 lines
5.7 KiB
Markdown
|
|
# Phase FREE-TINY-FAST-DUALHOT-1: Optimize C0-C3 Direct Path
|
|||
|
|
|
|||
|
|
## Goal
|
|||
|
|
|
|||
|
|
Optimize C0-C3 classes (≈48% of calls) by treating them as "second hot path" rather than "cold path".
|
|||
|
|
|
|||
|
|
実装は **HOTCOLD split(`free_tiny_fast_hot()`)側に統合**し、C0-C3 は hot 側で早期 return することで、
|
|||
|
|
`noinline,cold` への関数呼び出しを避ける(= “dual hot” 化)。
|
|||
|
|
|
|||
|
|
## Background
|
|||
|
|
|
|||
|
|
### HOTCOLD-OPT-1 Learnings
|
|||
|
|
|
|||
|
|
Phase FREE-TINY-FAST-HOTCOLD-OPT-1 revealed:
|
|||
|
|
- C7 (ULTRA): 50.11% of calls ← Correctly optimized as "hot"
|
|||
|
|
- C0-C3 (legacy fallback): 48.43% of calls ← **NOT rare, second hot**
|
|||
|
|
- Mistake: Made C0-C3 noinline → -13% regression
|
|||
|
|
|
|||
|
|
**Lesson**: Don't call C0-C3 "cold" if it's 48% of workload.
|
|||
|
|
|
|||
|
|
## Design
|
|||
|
|
|
|||
|
|
### Call Flow Analysis
|
|||
|
|
|
|||
|
|
**Current dispatch**(Front Gate Unified 側の free):
|
|||
|
|
```
|
|||
|
|
wrap_free(ptr)
|
|||
|
|
└─ if (TINY_FRONT_UNIFIED_GATE_ENABLED) {
|
|||
|
|
if (HAKMEM_FREE_TINY_FAST_HOTCOLD=1) free_tiny_fast_hot(ptr)
|
|||
|
|
else free_tiny_fast(ptr) // monolithic
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**DUALHOT flow**(実装済み: `free_tiny_fast_hot()`):
|
|||
|
|
```
|
|||
|
|
free_tiny_fast_hot(ptr)
|
|||
|
|
├─ header magic + class_idx + base
|
|||
|
|
├─ if (class_idx == 7 && tiny_c7_ultra_enabled_env()) { tiny_c7_ultra_free(ptr); return 1; }
|
|||
|
|
├─ if (class_idx <= 3 && HAKMEM_TINY_LARSON_FIX==0) {
|
|||
|
|
│ tiny_legacy_fallback_free_base(base, class_idx);
|
|||
|
|
│ return 1;
|
|||
|
|
│ }
|
|||
|
|
├─ policy snapshot + route_kind switch(ULTRA/MID/V7)
|
|||
|
|
└─ cold_path: free_tiny_fast_cold(ptr, base, class_idx)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Optimization Target
|
|||
|
|
|
|||
|
|
**Cost savings for C0-C3 path**:
|
|||
|
|
1. **Eliminate policy snapshot**: `tiny_front_v3_snapshot_get()`
|
|||
|
|
- Estimated cost: 5-10 cycles per call
|
|||
|
|
- Frequency: 48.43% of all frees
|
|||
|
|
- Impact: 2-5% of total overhead
|
|||
|
|
|
|||
|
|
2. **Eliminate route determination**: `tiny_route_for_class()`
|
|||
|
|
- Estimated cost: 2-3 cycles
|
|||
|
|
- Impact: 1-2% of total overhead
|
|||
|
|
|
|||
|
|
3. **Direct function call** (instead of dispatcher logic):
|
|||
|
|
- Inlining potential
|
|||
|
|
- Better branch prediction
|
|||
|
|
|
|||
|
|
### Safety Gaurd: HAKMEM_TINY_LARSON_FIX
|
|||
|
|
|
|||
|
|
**When HAKMEM_TINY_LARSON_FIX=1:**
|
|||
|
|
- The optimization is automatically disabled
|
|||
|
|
- Falls through to original path (with full validation)
|
|||
|
|
- Preserves Larson compatibility mode
|
|||
|
|
|
|||
|
|
**Rationale**:
|
|||
|
|
- Larson mode may require different C0-C3 handling
|
|||
|
|
- Safety: Don't optimize if special mode is active
|
|||
|
|
|
|||
|
|
## Implementation
|
|||
|
|
|
|||
|
|
### Target Files
|
|||
|
|
- `core/front/malloc_tiny_fast.h`(`free_tiny_fast_hot()` 内)
|
|||
|
|
- `core/box/hak_wrappers.inc.h`(HOTCOLD dispatch)
|
|||
|
|
|
|||
|
|
### Code Pattern
|
|||
|
|
|
|||
|
|
(実装は `free_tiny_fast_hot()` 内にあり、C0-C3 は hot で `return 1` する)
|
|||
|
|
|
|||
|
|
### ENV Gate (Safety)
|
|||
|
|
|
|||
|
|
Add to check for Larson mode:
|
|||
|
|
```c
|
|||
|
|
#define HAKMEM_TINY_LARSON_FIX \
|
|||
|
|
(__builtin_expect((getenv("HAKMEM_TINY_LARSON_FIX") ? 1 : 0), 0))
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Or use existing pattern if available:
|
|||
|
|
```c
|
|||
|
|
extern int g_tiny_larson_mode;
|
|||
|
|
if (class_idx <= 3 && !g_tiny_larson_mode) { ... }
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Validation
|
|||
|
|
|
|||
|
|
### A/B Benchmark
|
|||
|
|
|
|||
|
|
**Configuration:**
|
|||
|
|
- Profile: MIXED_TINYV3_C7_SAFE
|
|||
|
|
- Workload: Random mixed (10-1024B)
|
|||
|
|
- Runs: 10 iterations
|
|||
|
|
|
|||
|
|
**Command:**
|
|||
|
|
```bash
|
|||
|
|
```bash
|
|||
|
|
# Baseline (monolithic)
|
|||
|
|
HAKMEM_FREE_TINY_FAST_HOTCOLD=0 \
|
|||
|
|
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
|
|||
|
|
./bench_random_mixed_hakmem 100000000 400 1
|
|||
|
|
|
|||
|
|
# Opt (HOTCOLD + DUALHOT in hot)
|
|||
|
|
HAKMEM_FREE_TINY_FAST_HOTCOLD=1 \
|
|||
|
|
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
|
|||
|
|
./bench_random_mixed_hakmem 100000000 400 1
|
|||
|
|
|
|||
|
|
# Safety disable (forces full path; useful A/B sanity)
|
|||
|
|
HAKMEM_TINY_LARSON_FIX=1 \
|
|||
|
|
HAKMEM_FREE_TINY_FAST_HOTCOLD=1 \
|
|||
|
|
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
|
|||
|
|
./bench_random_mixed_hakmem 100000000 400 1
|
|||
|
|
```
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Perf Analysis
|
|||
|
|
|
|||
|
|
**Target metrics:**
|
|||
|
|
1. **Throughput median** (±2% tolerance)
|
|||
|
|
2. **Branch misses** (`perf stat -e branch-misses`)
|
|||
|
|
- Expect: Lower branch misses in optimized version
|
|||
|
|
- Reason: Fewer conditional branches in C0-C3 path
|
|||
|
|
|
|||
|
|
**Command:**
|
|||
|
|
```bash
|
|||
|
|
perf stat -e branch-misses,cycles,instructions \
|
|||
|
|
-- env HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
|
|||
|
|
./bench_random_mixed_hakmem 100000000 400 1
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Success Criteria
|
|||
|
|
|
|||
|
|
| Criterion | Target | Rationale |
|
|||
|
|
|-----------|--------|-----------|
|
|||
|
|
| Throughput | ±2% | No regression vs baseline |
|
|||
|
|
| Branch misses | Decreased | Direct path has fewer branches |
|
|||
|
|
| free self% | Reduced | Fewer policy snapshots |
|
|||
|
|
| Safety | No crashes | Larson mode doesn't break |
|
|||
|
|
|
|||
|
|
## Expected Impact
|
|||
|
|
|
|||
|
|
**If successful:**
|
|||
|
|
- Skip policy snapshot for 48.43% of frees
|
|||
|
|
- Reduce free self% from 32.04% to ~28-30% (2-4 percentage points)
|
|||
|
|
- Translate to ~3-5% throughput improvement
|
|||
|
|
|
|||
|
|
**Why modest gains:**
|
|||
|
|
- C0-C3 is only 48% of calls
|
|||
|
|
- Policy snapshot is 5-10 cycles (not huge absolute time)
|
|||
|
|
- But consistent improvement across all mixed workloads
|
|||
|
|
|
|||
|
|
## Files to Modify
|
|||
|
|
|
|||
|
|
- `core/front/malloc_tiny_fast.h`
|
|||
|
|
- `core/box/hak_wrappers.inc.h`
|
|||
|
|
|
|||
|
|
## Files to Reference
|
|||
|
|
|
|||
|
|
- `/mnt/workdisk/public_share/hakmem/core/front/malloc_tiny_fast.h` (current implementation)
|
|||
|
|
- `/mnt/workdisk/public_share/hakmem/core/tiny_legacy.inc.h` (tiny_legacy_fallback_free_base signature)
|
|||
|
|
- `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_lazy_init.inc.h` (tiny_front_v3_enabled, etc)
|
|||
|
|
|
|||
|
|
## Commit Message
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Phase FREE-TINY-FAST-DUALHOT-1: Optimize C0-C3 direct free path
|
|||
|
|
|
|||
|
|
Treat C0-C3 classes (48% of calls) as "second hot path", not cold.
|
|||
|
|
Skip expensive policy snapshot and route determination, direct to
|
|||
|
|
tiny_legacy_fallback_free_base().
|
|||
|
|
|
|||
|
|
Measurements from FREE-TINY-FAST-HOTCOLD-OPT-1 revealed that C0-C3
|
|||
|
|
is not rare (48.43% of all frees), so naive hot/cold split failed.
|
|||
|
|
This phase applies the correct optimization: direct path for frequent
|
|||
|
|
C0-C3 class.
|
|||
|
|
|
|||
|
|
ENV: HAKMEM_TINY_LARSON_FIX disables optimization (safety gate)
|
|||
|
|
|
|||
|
|
Expected: -2-4pp free self%, +3-5% throughput
|
|||
|
|
|
|||
|
|
🤖 Generated with [Claude Code](https://claude.com/claude-code)
|
|||
|
|
|
|||
|
|
Co-Authored-By: Claude <noreply@anthropic.com>
|
|||
|
|
```
|