Files
hakmem/docs/analysis/FREE_TINY_FAST_DUALHOT_1_DESIGN.md

197 lines
5.7 KiB
Markdown
Raw Normal View History

Phase ALLOC-TINY-FAST-DUALHOT-1 & Optimization Roadmap Update Add comprehensive design docs and research boxes: - docs/analysis/ALLOC_TINY_FAST_DUALHOT_1_DESIGN.md: ALLOC DUALHOT investigation - docs/analysis/FREE_TINY_FAST_DUALHOT_1_DESIGN.md: FREE DUALHOT final specs - docs/analysis/FREE_TINY_FAST_HOTCOLD_OPT_1_DESIGN.md: Hot/Cold split research - docs/analysis/POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md: Deferred batching design - docs/analysis/POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md: Stats overhead findings - docs/analysis/MID_DESC_CACHE_BENCHMARK_2025-12-12.md: Cache measurement results - docs/analysis/LAST_MATCH_CACHE_IMPLEMENTATION.md: TLS cache investigation Research boxes (SS page table): - core/box/ss_pt_env_box.h: HAKMEM_SS_LOOKUP_KIND gate - core/box/ss_pt_types_box.h: 2-level page table structures - core/box/ss_pt_lookup_box.h: ss_pt_lookup() implementation - core/box/ss_pt_register_box.h: Page table registration - core/box/ss_pt_impl.c: Global definitions Updates: - docs/specs/ENV_VARS_COMPLETE.md: HOTCOLD, DEFERRED, SS_LOOKUP env vars - core/box/hak_free_api.inc.h: FREE-DISPATCH-SSOT integration - core/box/pool_mid_inuse_deferred_box.h: Deferred API updates - core/box/pool_mid_inuse_deferred_stats_box.h: Stats collection - core/hakmem_super_registry: SS page table integration Current Status: - FREE-TINY-FAST-DUALHOT-1: +13% improvement, ready for adoption - ALLOC-TINY-FAST-DUALHOT-1: -2% regression, frozen as research box - Next: Optimization roadmap per ROI (mimalloc gap 2.5x) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-13 05:35:46 +09:00
# Phase FREE-TINY-FAST-DUALHOT-1: Optimize C0-C3 Direct Path
## Goal
Optimize C0-C3 classes (≈48% of calls) by treating them as "second hot path" rather than "cold path".
実装は **HOTCOLD split`free_tiny_fast_hot()`)側に統合**し、C0-C3 は hot 側で早期 return することで、
`noinline,cold` への関数呼び出しを避ける(= “dual hot” 化)。
## Background
### HOTCOLD-OPT-1 Learnings
Phase FREE-TINY-FAST-HOTCOLD-OPT-1 revealed:
- C7 (ULTRA): 50.11% of calls ← Correctly optimized as "hot"
- C0-C3 (legacy fallback): 48.43% of calls ← **NOT rare, second hot**
- Mistake: Made C0-C3 noinline → -13% regression
**Lesson**: Don't call C0-C3 "cold" if it's 48% of workload.
## Design
### Call Flow Analysis
**Current dispatch**Front Gate Unified 側の free:
```
wrap_free(ptr)
└─ if (TINY_FRONT_UNIFIED_GATE_ENABLED) {
if (HAKMEM_FREE_TINY_FAST_HOTCOLD=1) free_tiny_fast_hot(ptr)
else free_tiny_fast(ptr) // monolithic
}
```
**DUALHOT flow**(実装済み: `free_tiny_fast_hot()`:
```
free_tiny_fast_hot(ptr)
├─ header magic + class_idx + base
├─ if (class_idx == 7 && tiny_c7_ultra_enabled_env()) { tiny_c7_ultra_free(ptr); return 1; }
├─ if (class_idx <= 3 && HAKMEM_TINY_LARSON_FIX==0) {
│ tiny_legacy_fallback_free_base(base, class_idx);
│ return 1;
│ }
├─ policy snapshot + route_kind switchULTRA/MID/V7
└─ cold_path: free_tiny_fast_cold(ptr, base, class_idx)
```
### Optimization Target
**Cost savings for C0-C3 path**:
1. **Eliminate policy snapshot**: `tiny_front_v3_snapshot_get()`
- Estimated cost: 5-10 cycles per call
- Frequency: 48.43% of all frees
- Impact: 2-5% of total overhead
2. **Eliminate route determination**: `tiny_route_for_class()`
- Estimated cost: 2-3 cycles
- Impact: 1-2% of total overhead
3. **Direct function call** (instead of dispatcher logic):
- Inlining potential
- Better branch prediction
### Safety Gaurd: HAKMEM_TINY_LARSON_FIX
**When HAKMEM_TINY_LARSON_FIX=1:**
- The optimization is automatically disabled
- Falls through to original path (with full validation)
- Preserves Larson compatibility mode
**Rationale**:
- Larson mode may require different C0-C3 handling
- Safety: Don't optimize if special mode is active
## Implementation
### Target Files
- `core/front/malloc_tiny_fast.h``free_tiny_fast_hot()` 内)
- `core/box/hak_wrappers.inc.h`HOTCOLD dispatch
### Code Pattern
(実装は `free_tiny_fast_hot()` 内にあり、C0-C3 は hot で `return 1` する)
### ENV Gate (Safety)
Add to check for Larson mode:
```c
#define HAKMEM_TINY_LARSON_FIX \
(__builtin_expect((getenv("HAKMEM_TINY_LARSON_FIX") ? 1 : 0), 0))
```
Or use existing pattern if available:
```c
extern int g_tiny_larson_mode;
if (class_idx <= 3 && !g_tiny_larson_mode) { ... }
```
## Validation
### A/B Benchmark
**Configuration:**
- Profile: MIXED_TINYV3_C7_SAFE
- Workload: Random mixed (10-1024B)
- Runs: 10 iterations
**Command:**
```bash
```bash
# Baseline (monolithic)
HAKMEM_FREE_TINY_FAST_HOTCOLD=0 \
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
./bench_random_mixed_hakmem 100000000 400 1
# Opt (HOTCOLD + DUALHOT in hot)
HAKMEM_FREE_TINY_FAST_HOTCOLD=1 \
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
./bench_random_mixed_hakmem 100000000 400 1
# Safety disable (forces full path; useful A/B sanity)
HAKMEM_TINY_LARSON_FIX=1 \
HAKMEM_FREE_TINY_FAST_HOTCOLD=1 \
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
./bench_random_mixed_hakmem 100000000 400 1
```
```
### Perf Analysis
**Target metrics:**
1. **Throughput median** (±2% tolerance)
2. **Branch misses** (`perf stat -e branch-misses`)
- Expect: Lower branch misses in optimized version
- Reason: Fewer conditional branches in C0-C3 path
**Command:**
```bash
perf stat -e branch-misses,cycles,instructions \
-- env HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
./bench_random_mixed_hakmem 100000000 400 1
```
## Success Criteria
| Criterion | Target | Rationale |
|-----------|--------|-----------|
| Throughput | ±2% | No regression vs baseline |
| Branch misses | Decreased | Direct path has fewer branches |
| free self% | Reduced | Fewer policy snapshots |
| Safety | No crashes | Larson mode doesn't break |
## Expected Impact
**If successful:**
- Skip policy snapshot for 48.43% of frees
- Reduce free self% from 32.04% to ~28-30% (2-4 percentage points)
- Translate to ~3-5% throughput improvement
**Why modest gains:**
- C0-C3 is only 48% of calls
- Policy snapshot is 5-10 cycles (not huge absolute time)
- But consistent improvement across all mixed workloads
## Files to Modify
- `core/front/malloc_tiny_fast.h`
- `core/box/hak_wrappers.inc.h`
## Files to Reference
- `/mnt/workdisk/public_share/hakmem/core/front/malloc_tiny_fast.h` (current implementation)
- `/mnt/workdisk/public_share/hakmem/core/tiny_legacy.inc.h` (tiny_legacy_fallback_free_base signature)
- `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_lazy_init.inc.h` (tiny_front_v3_enabled, etc)
## Commit Message
```
Phase FREE-TINY-FAST-DUALHOT-1: Optimize C0-C3 direct free path
Treat C0-C3 classes (48% of calls) as "second hot path", not cold.
Skip expensive policy snapshot and route determination, direct to
tiny_legacy_fallback_free_base().
Measurements from FREE-TINY-FAST-HOTCOLD-OPT-1 revealed that C0-C3
is not rare (48.43% of all frees), so naive hot/cold split failed.
This phase applies the correct optimization: direct path for frequent
C0-C3 class.
ENV: HAKMEM_TINY_LARSON_FIX disables optimization (safety gate)
Expected: -2-4pp free self%, +3-5% throughput
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
```