Files

Moe Charm (CI) d9991f39ff Phase ALLOC-TINY-FAST-DUALHOT-1 & Optimization Roadmap Update

Add comprehensive design docs and research boxes:
- docs/analysis/ALLOC_TINY_FAST_DUALHOT_1_DESIGN.md: ALLOC DUALHOT investigation
- docs/analysis/FREE_TINY_FAST_DUALHOT_1_DESIGN.md: FREE DUALHOT final specs
- docs/analysis/FREE_TINY_FAST_HOTCOLD_OPT_1_DESIGN.md: Hot/Cold split research
- docs/analysis/POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md: Deferred batching design
- docs/analysis/POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md: Stats overhead findings
- docs/analysis/MID_DESC_CACHE_BENCHMARK_2025-12-12.md: Cache measurement results
- docs/analysis/LAST_MATCH_CACHE_IMPLEMENTATION.md: TLS cache investigation

Research boxes (SS page table):
- core/box/ss_pt_env_box.h: HAKMEM_SS_LOOKUP_KIND gate
- core/box/ss_pt_types_box.h: 2-level page table structures
- core/box/ss_pt_lookup_box.h: ss_pt_lookup() implementation
- core/box/ss_pt_register_box.h: Page table registration
- core/box/ss_pt_impl.c: Global definitions

Updates:
- docs/specs/ENV_VARS_COMPLETE.md: HOTCOLD, DEFERRED, SS_LOOKUP env vars
- core/box/hak_free_api.inc.h: FREE-DISPATCH-SSOT integration
- core/box/pool_mid_inuse_deferred_box.h: Deferred API updates
- core/box/pool_mid_inuse_deferred_stats_box.h: Stats collection
- core/hakmem_super_registry: SS page table integration

Current Status:
- FREE-TINY-FAST-DUALHOT-1: +13% improvement, ready for adoption
- ALLOC-TINY-FAST-DUALHOT-1: -2% regression, frozen as research box
- Next: Optimization roadmap per ROI (mimalloc gap 2.5x)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-12-13 05:35:46 +09:00

5.7 KiB

Raw Blame History

Phase FREE-TINY-FAST-DUALHOT-1: Optimize C0-C3 Direct Path

Goal

Optimize C0-C3 classes (≈48% of calls) by treating them as "second hot path" rather than "cold path".

実装は HOTCOLD split（free_tiny_fast_hot()）側に統合し、C0-C3 は hot 側で早期 return することで、 noinline,cold への関数呼び出しを避ける（= “dual hot” 化）。

Background

HOTCOLD-OPT-1 Learnings

Phase FREE-TINY-FAST-HOTCOLD-OPT-1 revealed:

C7 (ULTRA): 50.11% of calls ← Correctly optimized as "hot"
C0-C3 (legacy fallback): 48.43% of calls ← NOT rare, second hot
Mistake: Made C0-C3 noinline → -13% regression

Lesson: Don't call C0-C3 "cold" if it's 48% of workload.

Design

Call Flow Analysis

Current dispatch（Front Gate Unified 側の free）:

wrap_free(ptr)
  └─ if (TINY_FRONT_UNIFIED_GATE_ENABLED) {
        if (HAKMEM_FREE_TINY_FAST_HOTCOLD=1) free_tiny_fast_hot(ptr)
        else                                 free_tiny_fast(ptr)   // monolithic
      }

DUALHOT flow（実装済み: free_tiny_fast_hot()）:

free_tiny_fast_hot(ptr)
  ├─ header magic + class_idx + base
  ├─ if (class_idx == 7 && tiny_c7_ultra_enabled_env()) { tiny_c7_ultra_free(ptr); return 1; }
  ├─ if (class_idx <= 3 && HAKMEM_TINY_LARSON_FIX==0) {
  │     tiny_legacy_fallback_free_base(base, class_idx);
  │     return 1;
  │   }
  ├─ policy snapshot + route_kind switch（ULTRA/MID/V7）
  └─ cold_path: free_tiny_fast_cold(ptr, base, class_idx)

Optimization Target

Cost savings for C0-C3 path:

Eliminate policy snapshot: tiny_front_v3_snapshot_get()
- Estimated cost: 5-10 cycles per call
- Frequency: 48.43% of all frees
- Impact: 2-5% of total overhead
Eliminate route determination: tiny_route_for_class()
- Estimated cost: 2-3 cycles
- Impact: 1-2% of total overhead
Direct function call (instead of dispatcher logic):
- Inlining potential
- Better branch prediction

Safety Gaurd: HAKMEM_TINY_LARSON_FIX

When HAKMEM_TINY_LARSON_FIX=1:

The optimization is automatically disabled
Falls through to original path (with full validation)
Preserves Larson compatibility mode

Rationale:

Larson mode may require different C0-C3 handling
Safety: Don't optimize if special mode is active

Implementation

Target Files

core/front/malloc_tiny_fast.h（free_tiny_fast_hot() 内）
core/box/hak_wrappers.inc.h（HOTCOLD dispatch）

Code Pattern

（実装は free_tiny_fast_hot() 内にあり、C0-C3 は hot で return 1 する）

ENV Gate (Safety)

Add to check for Larson mode:

#define HAKMEM_TINY_LARSON_FIX \
    (__builtin_expect((getenv("HAKMEM_TINY_LARSON_FIX") ? 1 : 0), 0))

Or use existing pattern if available:

extern int g_tiny_larson_mode;
if (class_idx <= 3 && !g_tiny_larson_mode) { ... }

Validation

A/B Benchmark

Configuration:

Profile: MIXED_TINYV3_C7_SAFE
Workload: Random mixed (10-1024B)
Runs: 10 iterations

Command:

```bash
# Baseline (monolithic)
HAKMEM_FREE_TINY_FAST_HOTCOLD=0 \
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
./bench_random_mixed_hakmem 100000000 400 1

# Opt (HOTCOLD + DUALHOT in hot)
HAKMEM_FREE_TINY_FAST_HOTCOLD=1 \
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
./bench_random_mixed_hakmem 100000000 400 1

# Safety disable (forces full path; useful A/B sanity)
HAKMEM_TINY_LARSON_FIX=1 \
HAKMEM_FREE_TINY_FAST_HOTCOLD=1 \
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
./bench_random_mixed_hakmem 100000000 400 1


### Perf Analysis

**Target metrics:**
1. **Throughput median** (±2% tolerance)
2. **Branch misses** (`perf stat -e branch-misses`)
   - Expect: Lower branch misses in optimized version
   - Reason: Fewer conditional branches in C0-C3 path

**Command:**
```bash
perf stat -e branch-misses,cycles,instructions \
    -- env HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
    ./bench_random_mixed_hakmem 100000000 400 1

Success Criteria

Criterion	Target	Rationale
Throughput	±2%	No regression vs baseline
Branch misses	Decreased	Direct path has fewer branches
free self%	Reduced	Fewer policy snapshots
Safety	No crashes	Larson mode doesn't break

Expected Impact

If successful:

Skip policy snapshot for 48.43% of frees
Reduce free self% from 32.04% to ~28-30% (2-4 percentage points)
Translate to ~3-5% throughput improvement

Why modest gains:

C0-C3 is only 48% of calls
Policy snapshot is 5-10 cycles (not huge absolute time)
But consistent improvement across all mixed workloads

Files to Modify

core/front/malloc_tiny_fast.h
core/box/hak_wrappers.inc.h

Files to Reference

/mnt/workdisk/public_share/hakmem/core/front/malloc_tiny_fast.h (current implementation)
/mnt/workdisk/public_share/hakmem/core/tiny_legacy.inc.h (tiny_legacy_fallback_free_base signature)
/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_lazy_init.inc.h (tiny_front_v3_enabled, etc)

Commit Message

Phase FREE-TINY-FAST-DUALHOT-1: Optimize C0-C3 direct free path

Treat C0-C3 classes (48% of calls) as "second hot path", not cold.
Skip expensive policy snapshot and route determination, direct to
tiny_legacy_fallback_free_base().

Measurements from FREE-TINY-FAST-HOTCOLD-OPT-1 revealed that C0-C3
is not rare (48.43% of all frees), so naive hot/cold split failed.
This phase applies the correct optimization: direct path for frequent
C0-C3 class.

ENV: HAKMEM_TINY_LARSON_FIX disables optimization (safety gate)

Expected: -2-4pp free self%, +3-5% throughput

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

5.7 KiB Raw Blame History Unescape Escape