diff --git a/CURRENT_TASK.md b/CURRENT_TASK.md index 082d5caf..d243b58a 100644 --- a/CURRENT_TASK.md +++ b/CURRENT_TASK.md @@ -113,6 +113,13 @@ **Ref**: - `docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_6C_DUPLICATE_ROUTE_DEDUP_DESIGN.md` +- `docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_6C_DUPLICATE_ROUTE_DEDUP_AB_TEST_RESULTS.md` + +**Next**: +- Phase 19-7: LARSON_FIX TLS consolidation(重複 `getenv("HAKMEM_TINY_LARSON_FIX")` を 1 箇所に集約) + - Ref: `docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_7_LARSON_FIX_TLS_CONSOLIDATION_DESIGN.md` +- Phase 20 (proposal): WarmPool slab_idx hint(warm hit の O(cap) scan を削る) + - Ref: `docs/analysis/PHASE20_WARM_POOL_SLABIDX_HINT_1_DESIGN.md` --- diff --git a/docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_6C_DUPLICATE_ROUTE_DEDUP_AB_TEST_RESULTS.md b/docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_6C_DUPLICATE_ROUTE_DEDUP_AB_TEST_RESULTS.md new file mode 100644 index 00000000..d6fcf8c5 --- /dev/null +++ b/docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_6C_DUPLICATE_ROUTE_DEDUP_AB_TEST_RESULTS.md @@ -0,0 +1,37 @@ +## Phase 19-6C — Duplicate tiny route lookup de-dup (free path) — ✅ GO + +### Goal + +Eliminate redundant `tiny_route_for_class()` / `tiny_route_is_heap_kind()` work in the free hot/cold chain by computing the Tiny route once and passing it across the single boundary. + +### Patch summary + +- Add helper: `free_tiny_fast_compute_route_and_heap(class_idx, &route, &use_tiny_heap)` +- Pass-down: + - `free_tiny_fast_hot()` computes route once before entering `free_tiny_fast_cold(...)` + - `free_tiny_fast()` legacy fallback uses the same helper (no duplicate logic) +- Change `free_tiny_fast_cold()` signature to accept `route` + `use_tiny_heap` + +Files: +- `core/front/malloc_tiny_fast.h` + +### A/B Test (Mixed 10-run) + +Command: +- `scripts/run_mixed_10_cleanenv.sh` (profile `MIXED_TINYV3_C7_SAFE`, `iters=20M`, `ws=400`, `runs=10`) + +Results: + +| Metric | Baseline | Optimized | Delta | +|---|---:|---:|---:| +| Mean | 53.49M ops/s | 54.55M ops/s | +1.98% | + +### Decision + +- ✅ GO (>= +1.0% threshold) + +### Notes + +- Box Theory boundary stays single: **route compute (LUT) happens once**, then cold path consumes the precomputed values. +- Avoids mixing route enums (`SmallRouteKind` vs `tiny_route_kind_t`) by keeping this optimization in the Tiny-route layer. + diff --git a/docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_7_LARSON_FIX_TLS_CONSOLIDATION_AB_TEST_RESULTS.md b/docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_7_LARSON_FIX_TLS_CONSOLIDATION_AB_TEST_RESULTS.md new file mode 100644 index 00000000..2a2f6604 --- /dev/null +++ b/docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_7_LARSON_FIX_TLS_CONSOLIDATION_AB_TEST_RESULTS.md @@ -0,0 +1,54 @@ +## Phase 19-7 — LARSON_FIX TLS Consolidation — ❌ NO-GO + +### Goal + +Eliminate 5 duplicate `getenv("HAKMEM_TINY_LARSON_FIX")` calls by consolidating into single per-thread TLS cache. + +### Code change + +- Add: `core/box/tiny_larson_fix_tls_box.h` (unified TLS cache helper) +- Modify: `core/front/malloc_tiny_fast.h` (replace 5 duplicate blocks with helper calls) + - Line 452 (free_tiny_fast_cold, block 1) + - Line 646 (free_tiny_fast_hot, C0-C3 direct path) + - Line 792 (free_tiny_fast, mono_dualhot) + - Line 822 (free_tiny_fast, mono_legacy_direct) + - Line 917 (free_tiny_fast, legacy_fallback) + +### A/B Test (Mixed 10-run) + +Command: +- `scripts/run_mixed_10_cleanenv.sh` (profile `MIXED_TINYV3_C7_SAFE`, `iters=20M`, `ws=400`, `runs=10`) + +Results: + +| Metric | Baseline | Optimized | Delta | +|---|---:|---:|---:| +| Mean | 54.55M ops/s | 53.82M ops/s | -1.34% | +| Median | 54.55M ops/s | 54.01M ops/s | -0.99% | + +### Decision + +- ❌ NO-GO (<= -1.0% threshold) +- Reverted immediately + +### Root Cause Analysis + +**Why consolidation failed**: + +1. **Compiler optimization interference**: Individual `static __thread` variables in separate scopes allow per-scope optimization +2. **Function call overhead**: Helper function (`tiny_larson_fix_enabled()`) not fully inlined despite `static inline` +3. **Cache locality**: 5 separate TLS caches may have better L1 locality than single shared cache accessed from multiple callsites +4. **Branch prediction**: Separate lazy-init gates may have better prediction patterns than unified gate + +**Key learning**: Not all "duplicate code" is inefficient. Per-scope TLS caching can outperform centralized caching when: +- Each scope has different access patterns +- Compiler can optimize each scope independently +- Function call overhead outweighs duplication cost + +### Notes + +- Expected gain: +0.3-0.7% +- Actual result: -1.34% +- **Delta from expected: -1.6 to -2.0 percentage points** +- This is a **rare case where duplication is optimal** (compiler benefits from independent scope optimization) + diff --git a/docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_7_LARSON_FIX_TLS_CONSOLIDATION_DESIGN.md b/docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_7_LARSON_FIX_TLS_CONSOLIDATION_DESIGN.md new file mode 100644 index 00000000..bbb45097 --- /dev/null +++ b/docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_7_LARSON_FIX_TLS_CONSOLIDATION_DESIGN.md @@ -0,0 +1,259 @@ +# Phase 19-7: Consolidate HAKMEM_TINY_LARSON_FIX TLS Cache + +## Goal + +Eliminate 5 duplicate `getenv("HAKMEM_TINY_LARSON_FIX")` calls in free path: +- `free_tiny_fast_hot()` line 446-454: lazy-init into `static __thread int g_larson_fix = -1` +- `free_tiny_fast_hot()` line 640-648: **DUPLICATE** (different scope) +- `free_tiny_fast()` mono_dualhot line 786-794: **DUPLICATE** (different scope) +- `free_tiny_fast()` mono_legacy_direct line 816-824: **DUPLICATE** (different scope) +- `free_tiny_fast()` legacy_fallback line 911-919: **DUPLICATE** (different scope) + +Expected: **+0.3-0.7% throughput** from: +- Eliminating 4x redundant getenv() calls (heavy on benchmark startup) +- Reducing branch instructions (5→1 initialization gate) +- Improving I-cache locality (consolidate lazy-init into 1 TLS box) + +--- + +## Problem Analysis + +### Redundancy Pattern + +``` +free_tiny_fast_hot(): + ├─ line 446-454: static __thread int g_larson_fix = -1 + │ └─ if g_larson_fix == -1: getenv() + cache + │ + └─ line 640-648: static __thread int g_larson_fix = -1 (SAME SCOPE!) + └─ if g_larson_fix == -1: getenv() + cache (REDUNDANT!) + +free_tiny_fast(): + ├─ line 786-794: static __thread int g_larson_fix = -1 + │ └─ getenv() (RECOMPUTE, separate scope) + │ + ├─ line 816-824: static __thread int g_larson_fix = -1 + │ └─ getenv() (RECOMPUTE, separate scope) + │ + └─ line 911-919: static __thread int g_larson_fix = -1 + └─ getenv() (RECOMPUTE, separate scope) +``` + +### Issue: Scope Isolation + +Each static local has **separate scope** → compiler cannot elide duplicate reads +- Line 446-454 and 640-648 are both in `free_tiny_fast_hot()` → same function + - But **different variable names/scopes** → separate initialization checks + - Compiler sees each as independent → generates 2x init code +- Lines 786-794, 816-824, 911-919 are in different functions/blocks + - Definitely cannot share scope → 3x additional duplicates + +### Instruction Overhead + +Per free operation (assuming 20% go through each path): +- Current: 5x lazy-init gates (5x branch on g_larson_fix==−1, 5x conditional getenv calls) +- After: 1x unified TLS cache (1x branch, 1x conditional getenv for entire operation lifecycle) +- Savings: ~10-15 instructions per free (redundant branches eliminated) + +--- + +## Solution: Unified TLS Cache Box + +**Strategy**: Extract LARSON_FIX caching into standalone TLS box header. + +### Step 1: Create TLS Box + +File: `core/box/tiny_larson_fix_tls_box.h` (new) + +```c +#ifndef HAK_TINY_LARSON_FIX_TLS_BOX_H +#define HAK_TINY_LARSON_FIX_TLS_BOX_H + +#include + +// Unified per-thread cache for HAKMEM_TINY_LARSON_FIX +// Eliminates redundant getenv() calls across hot path +static inline int tiny_larson_fix_enabled(void) { + static __thread int g_larson_fix = -1; + if (__builtin_expect(g_larson_fix == -1, 0)) { + const char* e = getenv("HAKMEM_TINY_LARSON_FIX"); + g_larson_fix = (e && *e && *e != '0') ? 1 : 0; + } + return g_larson_fix; +} + +#endif // HAK_TINY_LARSON_FIX_TLS_BOX_H +``` + +### Step 2: Update malloc_tiny_fast.h + +Replace all 5 duplicate instances with single call: + +**Before (line 446-454 in free_tiny_fast_hot)**: +```c +static __thread int g_larson_fix = -1; +if (__builtin_expect(g_larson_fix == -1, 0)) { + const char* e = getenv("HAKMEM_TINY_LARSON_FIX"); + g_larson_fix = (e && *e && *e != '0') ? 1 : 0; +} + +if (__builtin_expect(g_larson_fix || ..., 0)) { +``` + +**After**: +```c +if (__builtin_expect(tiny_larson_fix_enabled() || ..., 0)) { +``` + +Repeat for all 5 locations: +- Line 452 (free_tiny_fast_hot) → use `tiny_larson_fix_enabled()` +- Line 646 (free_tiny_fast_hot) → use `tiny_larson_fix_enabled()` +- Line 792 (free_tiny_fast mono_dualhot) → use `tiny_larson_fix_enabled()` +- Line 822 (free_tiny_fast mono_legacy_direct) → use `tiny_larson_fix_enabled()` +- Line 917 (free_tiny_fast legacy_fallback) → use `tiny_larson_fix_enabled()` + +### Step 3: Add Include + +In `malloc_tiny_fast.h`, add: +```c +#include "../box/tiny_larson_fix_tls_box.h" // Phase 19-7: Unified LARSON_FIX TLS +``` + +--- + +## Code Changes Summary + +| File | Change | Type | +|------|--------|------| +| `core/box/tiny_larson_fix_tls_box.h` | NEW: Unified TLS box | New file | +| `core/front/malloc_tiny_fast.h` | Include new box | +1 line | +| `core/front/malloc_tiny_fast.h` | Line 452-454: Remove static + use helper | Modify | +| `core/front/malloc_tiny_fast.h` | Line 646-648: Remove static + use helper | Modify | +| `core/front/malloc_tiny_fast.h` | Line 792-794: Remove static + use helper | Modify | +| `core/front/malloc_tiny_fast.h` | Line 822-824: Remove static + use helper | Modify | +| `core/front/malloc_tiny_fast.h` | Line 917-919: Remove static + use helper | Modify | + +**Total**: ~15 lines removed, ~1 function added + +--- + +## A/B Test Protocol + +### Baseline (Phase 19-6C state) +```sh +scripts/run_mixed_10_cleanenv.sh +# Expected: 54.55M ops/s (19-6C state) +``` + +### Optimized (Phase 19-7) +```sh +scripts/run_mixed_10_cleanenv.sh +# Expected: 54.77-55.00M ops/s (+0.3-0.7% if instruction savings convert) +``` + +### GO Criteria +- Mean ≥ 54.55M ops/s (no regression) +- Optimal: +0.3-0.7% improvement + +### Validation +```sh +perf stat -e branches,branch-misses -- ./bench_random_mixed_hakmem 50000000 400 1 +# Check: branches/op should decrease (fewer init branches) +``` + +--- + +## Risk Assessment + +| Risk | Level | Mitigation | +|------|-------|-----------|\ +| TLS box initialization order | LOW | Static inline function, no external dependencies | +| Scope visibility | LOW | Only used in malloc_tiny_fast.h (internal) | +| getenv() thread-safety | LOW | getenv() is thread-safe, result cached in TLS | +| Multiple-definition errors | LOW | Static inline in header (no ODR violation) | + +**Overall**: **GREEN** (very low risk, mechanical refactoring) + +--- + +## Expected Performance Gains + +**Per-operation cost model**: +- Baseline: 5x lazy-init gates + 5x redundant branches +- After: 1x lazy-init gate + 1x branch (amortized) +- Instruction delta: -10-15 instructions per free operation + +**Throughput gain**: +- If 80-90% of operations involve free path: ~1-2 instructions/op saved +- Expected: **+0.3-0.7% throughput** + +--- + +## Next Steps + +1. ✅ Design (complete) +2. 🔄 Implementation: Create box + update all 5 call sites +3. 🔄 Build: `make clean && make -j bench_random_mixed_hakmem` +4. 🔄 A/B Test: Run 10-run benchmark +5. 📊 Decision: GO/NO-GO based on throughput delta +6. 📋 Commit + Document + +--- + +## Appendix: Before/After Comparison + +### Before (5 duplicate blocks) +```c +// free_tiny_fast_hot, block 1: +{ + static __thread int g_larson_fix = -1; + if (__builtin_expect(g_larson_fix == -1, 0)) { + const char* e = getenv("HAKMEM_TINY_LARSON_FIX"); + g_larson_fix = (e && *e && *e != '0') ? 1 : 0; + } + if (g_larson_fix) { ... } +} + +// free_tiny_fast_hot, block 2: +{ + static __thread int g_larson_fix = -1; // DUPLICATE + if (__builtin_expect(g_larson_fix == -1, 0)) { + const char* e = getenv("HAKMEM_TINY_LARSON_FIX"); // DUPLICATE + g_larson_fix = (e && *e && *e != '0') ? 1 : 0; + } + if (g_larson_fix) { ... } +} + +// free_tiny_fast, block 3: +{ + static __thread int g_larson_fix = -1; // DUPLICATE (separate scope) + if (__builtin_expect(g_larson_fix == -1, 0)) { + const char* e = getenv("HAKMEM_TINY_LARSON_FIX"); // DUPLICATE + g_larson_fix = (e && *e && *e != '0') ? 1 : 0; + } + if (!g_larson_fix && ...) { ... } +} + +// ... similar for blocks 4 and 5 ... +``` + +### After (1 unified function) +```c +// All 5 blocks become: +if (__builtin_expect(tiny_larson_fix_enabled(), 0)) { + // ... logic ... +} +``` + +**Benefit**: Single TLS cache, single getenv() call, single initialization gate + +--- + +## Related + +- Phase 19-5: Attempted global ENV cache (NO-GO, L1 thrashing) +- Phase 19-5v2: Attempted snapshot integration (NO-GO, broke TLS pattern) +- Phase 19-6C: ✅ Route dedup (completed, +1.98%) +- Phase 19-7: **THIS** — TLS consolidation (proposed) + +Next frontier: Policy snapshot caching in free paths (+0.3-0.7%) diff --git a/docs/analysis/PHASE20_WARM_POOL_SLABIDX_HINT_1_DESIGN.md b/docs/analysis/PHASE20_WARM_POOL_SLABIDX_HINT_1_DESIGN.md new file mode 100644 index 00000000..b3fb92f6 --- /dev/null +++ b/docs/analysis/PHASE20_WARM_POOL_SLABIDX_HINT_1_DESIGN.md @@ -0,0 +1,152 @@ +# Phase 20 (Proposal): WarmPool SlabIdx Hint (avoid O(cap) scan on warm hit) + +## TL;DR + +`unified_cache_refill()` already does **batch refill** (up to 128/256/512 blocks) and already has the **Warm Pool** to avoid registry scans. + +The remaining hot cost on the warm-hit path is typically: + +1. `tiny_warm_pool_pop(class_idx)` (O(1)) +2. `for (i=0..cap)` scan to find `slab_idx` where `tiny_get_class_from_ss(ss,i)==class_idx` (**O(cap)**) +3. `ss_tls_bind_one(...)` and optional TLS carve + +This Phase 20 reduces step (2) by storing a **slab_idx hint** alongside the warm pool entry. + +Expected ROI: **+1–4%** on Mixed (only if warm-hit rate is high and cap scan is nontrivial). + +--- + +## Box Theory framing + +### New box + +**WarmPoolEntryBox**: “warm entry = (SuperSlab*, slab_idx_hint)” + +- Responsibility: store + retrieve warm candidates with a verified slab index hint +- No side effects outside warm pool memory (pure stack operations) + +### Boundary (single conversion point) + +The only “conversion” is in `unified_cache_refill()`: + +`WarmPoolEntryBox.pop(class_idx)` → `(ss, slab_idx_hint)` +→ validate hint once +→ `ss_tls_bind_one(class_idx, tls, ss, slab_idx, tid)` + +No other code paths should re-infer slab_idx. + +### Rollback + +ENV gate: `HAKMEM_WARM_POOL_SLABIDX_HINT=0/1` (default 0, opt-in) + +- OFF: current `TinyWarmPool` behavior (ss only, scan to find slab_idx) +- ON: use entry-with-hint fast path; fallback to scan if hint invalid + +--- + +## Design details + +### Data structure + +Add a parallel “entry” type without changing SuperSlab layout: + +```c +typedef struct { + SuperSlab* ss; + uint16_t slab_idx_hint; // [0..cap-1], or 0xFFFF for “unknown” +} TinyWarmEntry; +``` + +Implementation options: + +1) Replace `TinyWarmPool.slabs[]` with `TinyWarmEntry entries[]` +2) Dual arrays (`slabs[]` + `slab_idx[]`) + +Prefer (2) if you want minimal diff risk. + +### Push contract + +When pushing a warm candidate, compute slab_idx once (cold-ish context) and store the hint: + +```c +int slab_idx = ss_find_first_slab_for_class(ss, class_idx); // may scan +warm_pool_push(class_idx, ss, slab_idx); +``` + +This moves scan cost to the producer side (where it is already scanning registry / iterating slabs). + +### Pop contract (hot) + +```c +TinyWarmEntry e = warm_pool_pop_entry(class_idx); +if (!e.ss) return NULL; + +int slab_idx = e.slab_idx_hint; +if (slab_idx != 0xFFFF && tiny_get_class_from_ss(e.ss, slab_idx) == class_idx) { + // fast: validated +} else { + // fallback: scan to find slab_idx (rare) + slab_idx = ss_find_first_slab_for_class(e.ss, class_idx); + if (slab_idx < 0) return NULL; // fail-fast → warm miss fallback +} +``` + +### Fail-fast rules + +- If hint invalid and scan fails → treat as warm miss (fallback to existing refill path) +- Never bind TLS to a slab_idx that doesn’t match class + +### Minimal observability + +Add only coarse counters (ENV-gated, or debug builds only): + +- `warm_hint_hit` (hint valid) +- `warm_hint_miss` (hint invalid → scanned) +- `warm_hint_scan_fail` (should be ~0; fail-fast signal) + +--- + +## A/B plan + +### Step 0: prove this is worth doing + +Before implementing, validate that warm pool is used and the slab scan is nontrivial: + +- `perf record` / `perf report` focus on `unified_cache_refill` +- confirm the slab scan loop shows up (e.g. `tiny_get_class_from_ss` / loop body) + +If `unified_cache_refill` is < ~3% total cycles in Mixed, ROI will be capped. + +### Step 1: implement behind ENV + +- `HAKMEM_WARM_POOL_SLABIDX_HINT=0/1` (default 0) +- Keep behavior identical when OFF. + +### Step 2: Mixed 10-run + +Use the existing cleanenv harness: + +```sh +HAKMEM_WARM_POOL_SLABIDX_HINT=0 scripts/run_mixed_10_cleanenv.sh +HAKMEM_WARM_POOL_SLABIDX_HINT=1 scripts/run_mixed_10_cleanenv.sh +``` + +GO/NO-GO: +- GO: mean **+1.0%** or more +- NEUTRAL: ±1.0% +- NO-GO: -1.0% or worse + +### Step 3: perf sanity + +If GO, confirm: +- `branches/op` down (or stable) +- no iTLB explosion + +--- + +## Why “Segment Batch Refill Layer” is probably redundant here + +`unified_cache_refill()` already batches heavily (up to 512) and already has Warm Pool + PageBox + carve boxes. + +If you want a Phase 20 with big ROI, aim at the remaining **O(cap) scan** and any remaining registry scans, not adding another batch layer on top of an existing batch layer. + diff --git a/hakmem.d b/hakmem.d index d44ab9da..829491c7 100644 --- a/hakmem.d +++ b/hakmem.d @@ -165,6 +165,7 @@ hakmem.o: core/hakmem.c core/hakmem.h core/hakmem_build_flags.h \ core/box/../front/../box/free_cold_shape_stats_box.h \ core/box/../front/../box/free_tiny_fast_mono_dualhot_env_box.h \ core/box/../front/../box/free_tiny_fast_mono_legacy_direct_env_box.h \ + core/box/../front/../box/tiny_larson_fix_tls_box.h \ core/box/tiny_alloc_gate_box.h core/box/tiny_route_box.h \ core/box/tiny_alloc_gate_shape_env_box.h \ core/box/tiny_front_config_box.h core/box/wrapper_env_box.h \ @@ -422,6 +423,7 @@ core/box/../front/../box/free_cold_shape_env_box.h: core/box/../front/../box/free_cold_shape_stats_box.h: core/box/../front/../box/free_tiny_fast_mono_dualhot_env_box.h: core/box/../front/../box/free_tiny_fast_mono_legacy_direct_env_box.h: +core/box/../front/../box/tiny_larson_fix_tls_box.h: core/box/tiny_alloc_gate_box.h: core/box/tiny_route_box.h: core/box/tiny_alloc_gate_shape_env_box.h: