Files
hakmem/docs/analysis/PHASE20_WARM_POOL_SLABIDX_HINT_1_DESIGN.md
Moe Charm (CI) 4d9429e14c Phase 19-7: LARSON_FIX TLS Consolidation — NO-GO (-1.34%)
Goal: Eliminate 5 duplicate getenv("HAKMEM_TINY_LARSON_FIX") calls
- Create unified TLS cache box: tiny_larson_fix_tls_box.h
- Replace 5 separate static __thread blocks with single helper

Result: -1.34% throughput (54.55M → 53.82M ops/s)
- Expected: +0.3-0.7%
- Actual: -1.34%
- Decision: NO-GO, reverted immediately

Root cause: Compiler optimization works better with separate-scope TLS caches
- Each scope gets independent optimization
- Function call overhead outweighs duplication savings
- Rare case where duplication is optimal

Key learning: Not all code duplication is inefficient. Per-scope TLS
caching can outperform centralized caching when compiler can optimize
each scope independently.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 22:27:19 +09:00

4.2 KiB
Raw Blame History

Phase 20 (Proposal): WarmPool SlabIdx Hint (avoid O(cap) scan on warm hit)

TL;DR

unified_cache_refill() already does batch refill (up to 128/256/512 blocks) and already has the Warm Pool to avoid registry scans.

The remaining hot cost on the warm-hit path is typically:

  1. tiny_warm_pool_pop(class_idx) (O(1))
  2. for (i=0..cap) scan to find slab_idx where tiny_get_class_from_ss(ss,i)==class_idx (O(cap))
  3. ss_tls_bind_one(...) and optional TLS carve

This Phase 20 reduces step (2) by storing a slab_idx hint alongside the warm pool entry.

Expected ROI: +14% on Mixed (only if warm-hit rate is high and cap scan is nontrivial).


Box Theory framing

New box

WarmPoolEntryBox: “warm entry = (SuperSlab*, slab_idx_hint)”

  • Responsibility: store + retrieve warm candidates with a verified slab index hint
  • No side effects outside warm pool memory (pure stack operations)

Boundary (single conversion point)

The only “conversion” is in unified_cache_refill():

WarmPoolEntryBox.pop(class_idx)(ss, slab_idx_hint) → validate hint once → ss_tls_bind_one(class_idx, tls, ss, slab_idx, tid)

No other code paths should re-infer slab_idx.

Rollback

ENV gate: HAKMEM_WARM_POOL_SLABIDX_HINT=0/1 (default 0, opt-in)

  • OFF: current TinyWarmPool behavior (ss only, scan to find slab_idx)
  • ON: use entry-with-hint fast path; fallback to scan if hint invalid

Design details

Data structure

Add a parallel “entry” type without changing SuperSlab layout:

typedef struct {
  SuperSlab* ss;
  uint16_t slab_idx_hint; // [0..cap-1], or 0xFFFF for “unknown”
} TinyWarmEntry;

Implementation options:

  1. Replace TinyWarmPool.slabs[] with TinyWarmEntry entries[]
  2. Dual arrays (slabs[] + slab_idx[])

Prefer (2) if you want minimal diff risk.

Push contract

When pushing a warm candidate, compute slab_idx once (cold-ish context) and store the hint:

int slab_idx = ss_find_first_slab_for_class(ss, class_idx); // may scan
warm_pool_push(class_idx, ss, slab_idx);

This moves scan cost to the producer side (where it is already scanning registry / iterating slabs).

Pop contract (hot)

TinyWarmEntry e = warm_pool_pop_entry(class_idx);
if (!e.ss) return NULL;

int slab_idx = e.slab_idx_hint;
if (slab_idx != 0xFFFF && tiny_get_class_from_ss(e.ss, slab_idx) == class_idx) {
  // fast: validated
} else {
  // fallback: scan to find slab_idx (rare)
  slab_idx = ss_find_first_slab_for_class(e.ss, class_idx);
  if (slab_idx < 0) return NULL; // fail-fast → warm miss fallback
}

Fail-fast rules

  • If hint invalid and scan fails → treat as warm miss (fallback to existing refill path)
  • Never bind TLS to a slab_idx that doesnt match class

Minimal observability

Add only coarse counters (ENV-gated, or debug builds only):

  • warm_hint_hit (hint valid)
  • warm_hint_miss (hint invalid → scanned)
  • warm_hint_scan_fail (should be ~0; fail-fast signal)

A/B plan

Step 0: prove this is worth doing

Before implementing, validate that warm pool is used and the slab scan is nontrivial:

  • perf record / perf report focus on unified_cache_refill
  • confirm the slab scan loop shows up (e.g. tiny_get_class_from_ss / loop body)

If unified_cache_refill is < ~3% total cycles in Mixed, ROI will be capped.

Step 1: implement behind ENV

  • HAKMEM_WARM_POOL_SLABIDX_HINT=0/1 (default 0)
  • Keep behavior identical when OFF.

Step 2: Mixed 10-run

Use the existing cleanenv harness:

HAKMEM_WARM_POOL_SLABIDX_HINT=0 scripts/run_mixed_10_cleanenv.sh
HAKMEM_WARM_POOL_SLABIDX_HINT=1 scripts/run_mixed_10_cleanenv.sh

GO/NO-GO:

  • GO: mean +1.0% or more
  • NEUTRAL: ±1.0%
  • NO-GO: -1.0% or worse

Step 3: perf sanity

If GO, confirm:

  • branches/op down (or stable)
  • no iTLB explosion

Why “Segment Batch Refill Layer” is probably redundant here

unified_cache_refill() already batches heavily (up to 512) and already has Warm Pool + PageBox + carve boxes.

If you want a Phase 20 with big ROI, aim at the remaining O(cap) scan and any remaining registry scans, not adding another batch layer on top of an existing batch layer.