Add comprehensive design docs and research boxes: - docs/analysis/ALLOC_TINY_FAST_DUALHOT_1_DESIGN.md: ALLOC DUALHOT investigation - docs/analysis/FREE_TINY_FAST_DUALHOT_1_DESIGN.md: FREE DUALHOT final specs - docs/analysis/FREE_TINY_FAST_HOTCOLD_OPT_1_DESIGN.md: Hot/Cold split research - docs/analysis/POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md: Deferred batching design - docs/analysis/POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md: Stats overhead findings - docs/analysis/MID_DESC_CACHE_BENCHMARK_2025-12-12.md: Cache measurement results - docs/analysis/LAST_MATCH_CACHE_IMPLEMENTATION.md: TLS cache investigation Research boxes (SS page table): - core/box/ss_pt_env_box.h: HAKMEM_SS_LOOKUP_KIND gate - core/box/ss_pt_types_box.h: 2-level page table structures - core/box/ss_pt_lookup_box.h: ss_pt_lookup() implementation - core/box/ss_pt_register_box.h: Page table registration - core/box/ss_pt_impl.c: Global definitions Updates: - docs/specs/ENV_VARS_COMPLETE.md: HOTCOLD, DEFERRED, SS_LOOKUP env vars - core/box/hak_free_api.inc.h: FREE-DISPATCH-SSOT integration - core/box/pool_mid_inuse_deferred_box.h: Deferred API updates - core/box/pool_mid_inuse_deferred_stats_box.h: Stats collection - core/hakmem_super_registry: SS page table integration Current Status: - FREE-TINY-FAST-DUALHOT-1: +13% improvement, ready for adoption - ALLOC-TINY-FAST-DUALHOT-1: -2% regression, frozen as research box - Next: Optimization roadmap per ROI (mimalloc gap 2.5x) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
5.7 KiB
Phase FREE-TINY-FAST-DUALHOT-1: Optimize C0-C3 Direct Path
Goal
Optimize C0-C3 classes (≈48% of calls) by treating them as "second hot path" rather than "cold path".
実装は HOTCOLD split(free_tiny_fast_hot())側に統合し、C0-C3 は hot 側で早期 return することで、
noinline,cold への関数呼び出しを避ける(= “dual hot” 化)。
Background
HOTCOLD-OPT-1 Learnings
Phase FREE-TINY-FAST-HOTCOLD-OPT-1 revealed:
- C7 (ULTRA): 50.11% of calls ← Correctly optimized as "hot"
- C0-C3 (legacy fallback): 48.43% of calls ← NOT rare, second hot
- Mistake: Made C0-C3 noinline → -13% regression
Lesson: Don't call C0-C3 "cold" if it's 48% of workload.
Design
Call Flow Analysis
Current dispatch(Front Gate Unified 側の free):
wrap_free(ptr)
└─ if (TINY_FRONT_UNIFIED_GATE_ENABLED) {
if (HAKMEM_FREE_TINY_FAST_HOTCOLD=1) free_tiny_fast_hot(ptr)
else free_tiny_fast(ptr) // monolithic
}
DUALHOT flow(実装済み: free_tiny_fast_hot()):
free_tiny_fast_hot(ptr)
├─ header magic + class_idx + base
├─ if (class_idx == 7 && tiny_c7_ultra_enabled_env()) { tiny_c7_ultra_free(ptr); return 1; }
├─ if (class_idx <= 3 && HAKMEM_TINY_LARSON_FIX==0) {
│ tiny_legacy_fallback_free_base(base, class_idx);
│ return 1;
│ }
├─ policy snapshot + route_kind switch(ULTRA/MID/V7)
└─ cold_path: free_tiny_fast_cold(ptr, base, class_idx)
Optimization Target
Cost savings for C0-C3 path:
-
Eliminate policy snapshot:
tiny_front_v3_snapshot_get()- Estimated cost: 5-10 cycles per call
- Frequency: 48.43% of all frees
- Impact: 2-5% of total overhead
-
Eliminate route determination:
tiny_route_for_class()- Estimated cost: 2-3 cycles
- Impact: 1-2% of total overhead
-
Direct function call (instead of dispatcher logic):
- Inlining potential
- Better branch prediction
Safety Gaurd: HAKMEM_TINY_LARSON_FIX
When HAKMEM_TINY_LARSON_FIX=1:
- The optimization is automatically disabled
- Falls through to original path (with full validation)
- Preserves Larson compatibility mode
Rationale:
- Larson mode may require different C0-C3 handling
- Safety: Don't optimize if special mode is active
Implementation
Target Files
core/front/malloc_tiny_fast.h(free_tiny_fast_hot()内)core/box/hak_wrappers.inc.h(HOTCOLD dispatch)
Code Pattern
(実装は free_tiny_fast_hot() 内にあり、C0-C3 は hot で return 1 する)
ENV Gate (Safety)
Add to check for Larson mode:
#define HAKMEM_TINY_LARSON_FIX \
(__builtin_expect((getenv("HAKMEM_TINY_LARSON_FIX") ? 1 : 0), 0))
Or use existing pattern if available:
extern int g_tiny_larson_mode;
if (class_idx <= 3 && !g_tiny_larson_mode) { ... }
Validation
A/B Benchmark
Configuration:
- Profile: MIXED_TINYV3_C7_SAFE
- Workload: Random mixed (10-1024B)
- Runs: 10 iterations
Command:
```bash
# Baseline (monolithic)
HAKMEM_FREE_TINY_FAST_HOTCOLD=0 \
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
./bench_random_mixed_hakmem 100000000 400 1
# Opt (HOTCOLD + DUALHOT in hot)
HAKMEM_FREE_TINY_FAST_HOTCOLD=1 \
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
./bench_random_mixed_hakmem 100000000 400 1
# Safety disable (forces full path; useful A/B sanity)
HAKMEM_TINY_LARSON_FIX=1 \
HAKMEM_FREE_TINY_FAST_HOTCOLD=1 \
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
./bench_random_mixed_hakmem 100000000 400 1
### Perf Analysis
**Target metrics:**
1. **Throughput median** (±2% tolerance)
2. **Branch misses** (`perf stat -e branch-misses`)
- Expect: Lower branch misses in optimized version
- Reason: Fewer conditional branches in C0-C3 path
**Command:**
```bash
perf stat -e branch-misses,cycles,instructions \
-- env HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
./bench_random_mixed_hakmem 100000000 400 1
Success Criteria
| Criterion | Target | Rationale |
|---|---|---|
| Throughput | ±2% | No regression vs baseline |
| Branch misses | Decreased | Direct path has fewer branches |
| free self% | Reduced | Fewer policy snapshots |
| Safety | No crashes | Larson mode doesn't break |
Expected Impact
If successful:
- Skip policy snapshot for 48.43% of frees
- Reduce free self% from 32.04% to ~28-30% (2-4 percentage points)
- Translate to ~3-5% throughput improvement
Why modest gains:
- C0-C3 is only 48% of calls
- Policy snapshot is 5-10 cycles (not huge absolute time)
- But consistent improvement across all mixed workloads
Files to Modify
core/front/malloc_tiny_fast.hcore/box/hak_wrappers.inc.h
Files to Reference
/mnt/workdisk/public_share/hakmem/core/front/malloc_tiny_fast.h(current implementation)/mnt/workdisk/public_share/hakmem/core/tiny_legacy.inc.h(tiny_legacy_fallback_free_base signature)/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_lazy_init.inc.h(tiny_front_v3_enabled, etc)
Commit Message
Phase FREE-TINY-FAST-DUALHOT-1: Optimize C0-C3 direct free path
Treat C0-C3 classes (48% of calls) as "second hot path", not cold.
Skip expensive policy snapshot and route determination, direct to
tiny_legacy_fallback_free_base().
Measurements from FREE-TINY-FAST-HOTCOLD-OPT-1 revealed that C0-C3
is not rare (48.43% of all frees), so naive hot/cold split failed.
This phase applies the correct optimization: direct path for frequent
C0-C3 class.
ENV: HAKMEM_TINY_LARSON_FIX disables optimization (safety gate)
Expected: -2-4pp free self%, +3-5% throughput
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>