Files
hakmem/docs/analysis/FREE_TINY_FAST_DUALHOT_1_DESIGN.md
Moe Charm (CI) d9991f39ff Phase ALLOC-TINY-FAST-DUALHOT-1 & Optimization Roadmap Update
Add comprehensive design docs and research boxes:
- docs/analysis/ALLOC_TINY_FAST_DUALHOT_1_DESIGN.md: ALLOC DUALHOT investigation
- docs/analysis/FREE_TINY_FAST_DUALHOT_1_DESIGN.md: FREE DUALHOT final specs
- docs/analysis/FREE_TINY_FAST_HOTCOLD_OPT_1_DESIGN.md: Hot/Cold split research
- docs/analysis/POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md: Deferred batching design
- docs/analysis/POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md: Stats overhead findings
- docs/analysis/MID_DESC_CACHE_BENCHMARK_2025-12-12.md: Cache measurement results
- docs/analysis/LAST_MATCH_CACHE_IMPLEMENTATION.md: TLS cache investigation

Research boxes (SS page table):
- core/box/ss_pt_env_box.h: HAKMEM_SS_LOOKUP_KIND gate
- core/box/ss_pt_types_box.h: 2-level page table structures
- core/box/ss_pt_lookup_box.h: ss_pt_lookup() implementation
- core/box/ss_pt_register_box.h: Page table registration
- core/box/ss_pt_impl.c: Global definitions

Updates:
- docs/specs/ENV_VARS_COMPLETE.md: HOTCOLD, DEFERRED, SS_LOOKUP env vars
- core/box/hak_free_api.inc.h: FREE-DISPATCH-SSOT integration
- core/box/pool_mid_inuse_deferred_box.h: Deferred API updates
- core/box/pool_mid_inuse_deferred_stats_box.h: Stats collection
- core/hakmem_super_registry: SS page table integration

Current Status:
- FREE-TINY-FAST-DUALHOT-1: +13% improvement, ready for adoption
- ALLOC-TINY-FAST-DUALHOT-1: -2% regression, frozen as research box
- Next: Optimization roadmap per ROI (mimalloc gap 2.5x)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-13 05:35:46 +09:00

5.7 KiB
Raw Blame History

Phase FREE-TINY-FAST-DUALHOT-1: Optimize C0-C3 Direct Path

Goal

Optimize C0-C3 classes (≈48% of calls) by treating them as "second hot path" rather than "cold path".

実装は HOTCOLD splitfree_tiny_fast_hot())側に統合し、C0-C3 は hot 側で早期 return することで、 noinline,cold への関数呼び出しを避ける(= “dual hot” 化)。

Background

HOTCOLD-OPT-1 Learnings

Phase FREE-TINY-FAST-HOTCOLD-OPT-1 revealed:

  • C7 (ULTRA): 50.11% of calls ← Correctly optimized as "hot"
  • C0-C3 (legacy fallback): 48.43% of calls ← NOT rare, second hot
  • Mistake: Made C0-C3 noinline → -13% regression

Lesson: Don't call C0-C3 "cold" if it's 48% of workload.

Design

Call Flow Analysis

Current dispatchFront Gate Unified 側の free:

wrap_free(ptr)
  └─ if (TINY_FRONT_UNIFIED_GATE_ENABLED) {
        if (HAKMEM_FREE_TINY_FAST_HOTCOLD=1) free_tiny_fast_hot(ptr)
        else                                 free_tiny_fast(ptr)   // monolithic
      }

DUALHOT flow(実装済み: free_tiny_fast_hot():

free_tiny_fast_hot(ptr)
  ├─ header magic + class_idx + base
  ├─ if (class_idx == 7 && tiny_c7_ultra_enabled_env()) { tiny_c7_ultra_free(ptr); return 1; }
  ├─ if (class_idx <= 3 && HAKMEM_TINY_LARSON_FIX==0) {
  │     tiny_legacy_fallback_free_base(base, class_idx);
  │     return 1;
  │   }
  ├─ policy snapshot + route_kind switchULTRA/MID/V7
  └─ cold_path: free_tiny_fast_cold(ptr, base, class_idx)

Optimization Target

Cost savings for C0-C3 path:

  1. Eliminate policy snapshot: tiny_front_v3_snapshot_get()

    • Estimated cost: 5-10 cycles per call
    • Frequency: 48.43% of all frees
    • Impact: 2-5% of total overhead
  2. Eliminate route determination: tiny_route_for_class()

    • Estimated cost: 2-3 cycles
    • Impact: 1-2% of total overhead
  3. Direct function call (instead of dispatcher logic):

    • Inlining potential
    • Better branch prediction

Safety Gaurd: HAKMEM_TINY_LARSON_FIX

When HAKMEM_TINY_LARSON_FIX=1:

  • The optimization is automatically disabled
  • Falls through to original path (with full validation)
  • Preserves Larson compatibility mode

Rationale:

  • Larson mode may require different C0-C3 handling
  • Safety: Don't optimize if special mode is active

Implementation

Target Files

  • core/front/malloc_tiny_fast.hfree_tiny_fast_hot() 内)
  • core/box/hak_wrappers.inc.hHOTCOLD dispatch

Code Pattern

(実装は free_tiny_fast_hot() 内にあり、C0-C3 は hot で return 1 する)

ENV Gate (Safety)

Add to check for Larson mode:

#define HAKMEM_TINY_LARSON_FIX \
    (__builtin_expect((getenv("HAKMEM_TINY_LARSON_FIX") ? 1 : 0), 0))

Or use existing pattern if available:

extern int g_tiny_larson_mode;
if (class_idx <= 3 && !g_tiny_larson_mode) { ... }

Validation

A/B Benchmark

Configuration:

  • Profile: MIXED_TINYV3_C7_SAFE
  • Workload: Random mixed (10-1024B)
  • Runs: 10 iterations

Command:

```bash
# Baseline (monolithic)
HAKMEM_FREE_TINY_FAST_HOTCOLD=0 \
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
./bench_random_mixed_hakmem 100000000 400 1

# Opt (HOTCOLD + DUALHOT in hot)
HAKMEM_FREE_TINY_FAST_HOTCOLD=1 \
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
./bench_random_mixed_hakmem 100000000 400 1

# Safety disable (forces full path; useful A/B sanity)
HAKMEM_TINY_LARSON_FIX=1 \
HAKMEM_FREE_TINY_FAST_HOTCOLD=1 \
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
./bench_random_mixed_hakmem 100000000 400 1

### Perf Analysis

**Target metrics:**
1. **Throughput median** (±2% tolerance)
2. **Branch misses** (`perf stat -e branch-misses`)
   - Expect: Lower branch misses in optimized version
   - Reason: Fewer conditional branches in C0-C3 path

**Command:**
```bash
perf stat -e branch-misses,cycles,instructions \
    -- env HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
    ./bench_random_mixed_hakmem 100000000 400 1

Success Criteria

Criterion Target Rationale
Throughput ±2% No regression vs baseline
Branch misses Decreased Direct path has fewer branches
free self% Reduced Fewer policy snapshots
Safety No crashes Larson mode doesn't break

Expected Impact

If successful:

  • Skip policy snapshot for 48.43% of frees
  • Reduce free self% from 32.04% to ~28-30% (2-4 percentage points)
  • Translate to ~3-5% throughput improvement

Why modest gains:

  • C0-C3 is only 48% of calls
  • Policy snapshot is 5-10 cycles (not huge absolute time)
  • But consistent improvement across all mixed workloads

Files to Modify

  • core/front/malloc_tiny_fast.h
  • core/box/hak_wrappers.inc.h

Files to Reference

  • /mnt/workdisk/public_share/hakmem/core/front/malloc_tiny_fast.h (current implementation)
  • /mnt/workdisk/public_share/hakmem/core/tiny_legacy.inc.h (tiny_legacy_fallback_free_base signature)
  • /mnt/workdisk/public_share/hakmem/core/hakmem_tiny_lazy_init.inc.h (tiny_front_v3_enabled, etc)

Commit Message

Phase FREE-TINY-FAST-DUALHOT-1: Optimize C0-C3 direct free path

Treat C0-C3 classes (48% of calls) as "second hot path", not cold.
Skip expensive policy snapshot and route determination, direct to
tiny_legacy_fallback_free_base().

Measurements from FREE-TINY-FAST-HOTCOLD-OPT-1 revealed that C0-C3
is not rare (48.43% of all frees), so naive hot/cold split failed.
This phase applies the correct optimization: direct path for frequent
C0-C3 class.

ENV: HAKMEM_TINY_LARSON_FIX disables optimization (safety gate)

Expected: -2-4pp free self%, +3-5% throughput

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>