Files
hakmem/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_NEXT_INSTRUCTIONS.md
Moe Charm (CI) ad346f7885 Phase 18 v2: BENCH_MINIMAL design + instructions (instruction removal strategy)
## Phase 18 v2: Next Phase Direction

After Phase 18 v1 failure (layout optimization caused I-cache regression),
shift to instruction count reduction via compile-time removal:

- Stats collection (FRONT_FASTLANE_STAT_INC → no-op)
- Environment checks (runtime lookup → constant)
- Debug logging (conditional compilation)

Expected impact: Instructions -30-40%, Throughput +10-20%

## Success Criteria (STRICT)

GO (must have ALL):
- Throughput: +5% minimum (+8% preferred)
- Instructions: -15% minimum (smoking gun)
- I-cache: automatic improvement from smaller footprint

NEUTRAL: throughput ±3%, instructions -5% to -15%
NO-GO: throughput < -2%, instructions < -5%

Key: If instructions do not drop -15%+, allocator is not the bottleneck
and this phase should be abandoned.

## Implementation Strategy

1. Makefile knob: BENCH_MINIMAL=0/1 (default OFF, production-safe)
2. Conditional removal:
   - Stats: #if !HAKMEM_BENCH_MINIMAL
   - ENV checks: constant propagation
   - Debug: conditional includes

3. A/B test with perf stat (must measure instruction reduction)

## Files

New:
- docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_DESIGN.md (detailed design)
- docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_NEXT_INSTRUCTIONS.md (step-by-step)

Modified:
- CURRENT_TASK.md (Phase 18 v1/v2 status)

## Key Learning from Phase 18 v1 Failure

Layout optimization is extremely fragile without strong ordering guarantees.
Section splitting alone (without symbol ordering, PGO, or linker script)
destroyed code locality and increased I-cache misses 91%.

Switching to direct instruction removal is safer and more predictable.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-15 05:55:22 +09:00

7.6 KiB
Raw Blame History

Phase 18 v2: Hot Text Isolation — BENCH_MINIMAL Implementation Instructions

Status

  • Phase 18 v1 (layout + sections) → NO-GO (I-cache regression)
  • Phase 18 v2 (instruction removal) → Next phase
  • Strategy: Remove stats/ENV overhead at compile-time
  • Expected: +10-20% throughput, -30-40% instructions

Ref: docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_DESIGN.md


0. Goal / Success Criteria

GO threshold (STRICT):

  • Throughput: +5% minimum (+8% preferred)
  • Instructions: -15% minimum (clear proof of concept)
  • I-cache: automatic improvement from smaller footprint

NEUTRAL:

  • Throughput: ±3% (marginal)
  • Instructions: -5% to -15% (incomplete optimization)

NO-GO:

  • Throughput: < -2% or negative
  • Instructions: < -5% (failed to reduce overhead)

If instructions do not drop -15%+, abandon this phase (allocator is not the bottleneck).


1. Implementation Steps

1.1 Add Makefile knob

File: Makefile

After Phase 18 v1 section (around line 140), add:

# Phase 18 v2: BENCH_MINIMAL (remove instrumentation for benchmark builds)
BENCH_MINIMAL ?= 0
ifeq ($(BENCH_MINIMAL),1)
  CFLAGS += -DHAKMEM_BENCH_MINIMAL=1
  CFLAGS_SHARED += -DHAKMEM_BENCH_MINIMAL=1
  # Note: Both bench and shared lib will disable instrumentation
  # Mainly impacts bench_* binaries (where BENCH_MINIMAL is intentionally enabled)
endif

Location: After HOT_TEXT_GC_SECTIONS section (line ~145)

1.2 Update front_fastlane_stats_box.h

File: core/box/front_fastlane_stats_box.h

Find the FRONT_FASTLANE_STAT_INC macro (currently defined as atomic increment).

Replace entire macro section:

// Before:
#define FRONT_FASTLANE_STAT_INC(stat) \
    do { \
        if (__builtin_expect(front_fastlane_stats_enabled(), 0)) { \
            atomic_fetch_add_explicit(&g_stats.stat, 1, memory_order_relaxed); \
        } \
    } while(0)

// After:
#if HAKMEM_BENCH_MINIMAL
#define FRONT_FASTLANE_STAT_INC(stat) do { (void)0; } while(0)
#else
#define FRONT_FASTLANE_STAT_INC(stat) \
    do { \
        if (__builtin_expect(front_fastlane_stats_enabled(), 0)) { \
            atomic_fetch_add_explicit(&g_stats.stat, 1, memory_order_relaxed); \
        } \
    } while(0)
#endif

Rationale: Stats collection becomes no-op in BENCH_MINIMAL, compiled away entirely.

1.3 Update malloc_tiny_fast.h free stats

File: core/front/malloc_tiny_fast.h

Search for free_tiny_fast_mono_stat_inc_* calls (probably in free_tiny_fast function).

Wrap each call:

// Before:
free_tiny_fast_mono_stat_inc_hit();

// After:
#if !HAKMEM_BENCH_MINIMAL
free_tiny_fast_mono_stat_inc_hit();
#endif

Search pattern: free_tiny_fast_mono_stat_ (should find ~5-10 calls)

Rationale: Similar to malloc, free stats become optional in BENCH_MINIMAL.

1.4 Update hakmem_env_snapshot_box.h

File: core/box/hakmem_env_snapshot_box.h

Find function hakmem_env_snapshot_enabled().

Replace implementation:

// Before:
static inline bool hakmem_env_snapshot_enabled(void) {
    return atomic_load_explicit(&g_snapshot_enabled, memory_order_relaxed);
}

// After:
#if HAKMEM_BENCH_MINIMAL
// In bench mode, snapshot is always enabled (one-time cost, compile-away benefit)
static inline bool hakmem_env_snapshot_enabled(void) {
    return 1;
}
#else
// Normal mode: runtime check
static inline bool hakmem_env_snapshot_enabled(void) {
    return atomic_load_explicit(&g_snapshot_enabled, memory_order_relaxed);
}
#endif

Rationale: ENV checks become compile-time constants in BENCH_MINIMAL, enabling better optimization.

1.5 (Optional) Update debug logging

File: any file with hakmem_tls_trace_* or hakmem_diag_* calls

Only if needed (lower priority). Wrap verbose logging:

#if !HAKMEM_BENCH_MINIMAL
hakmem_tls_trace_alloc(ptr, size);
#endif

2. Build & Verify

2.1 Baseline build (BENCH_MINIMAL=0)

make clean
make -j bench_random_mixed_hakmem bench_random_mixed_system
ls -lh bench_random_mixed_hakmem

Expected: ~653K (same as before)

2.2 Optimized build (BENCH_MINIMAL=1)

make clean
make -j BENCH_MINIMAL=1 bench_random_mixed_hakmem bench_random_mixed_system
ls -lh bench_random_mixed_hakmem

Expected: Same size (instrumentation is compiled away, not removed)

Note: Binary size may not change (removal is compile-time, not linker removal).

2.3 Verify compilation

Both builds should complete without errors.

Check for syntax errors in conditional code:

# If you see errors like "undeclared identifier", it means conditional guards are wrong

3. A/B Test Execution

3.1 Baseline run (BENCH_MINIMAL=0)

make clean
make -j bench_random_mixed_hakmem bench_random_mixed_system
scripts/run_mixed_10_cleanenv.sh

Record:

  • Mean throughput
  • Stdev
  • Min/Max

Run perf stat:

perf stat -e cycles,instructions,branches,branch-misses,cache-misses,dTLB-load-misses,minor-faults -- \
  env -i PATH="$PATH" HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
  ./bench_random_mixed_hakmem 200000000 400 1

Record:

  • cycles
  • instructions
  • I-cache-load-misses
  • branch-misses %

3.2 Optimized run (BENCH_MINIMAL=1)

make clean
make -j BENCH_MINIMAL=1 bench_random_mixed_hakmem bench_random_mixed_system
scripts/run_mixed_10_cleanenv.sh

Same recording as baseline.

Perf stat (same command).

3.3 Optional: system ceiling check

./bench_random_mixed_system 200000000 400 1 2>&1 | rg "Throughput"

(Already measured in Phase 17: ~85M ops/s)


4. Analysis & GO/NO-GO Decision

Compute deltas

Delta_throughput = (optimized_mean - baseline_mean) / baseline_mean * 100
Delta_instructions = (optimized_instructions - baseline_instructions) / baseline_instructions * 100
Delta_icache = (optimized_icache - baseline_icache) / baseline_icache * 100

Decision logic

GO (if ALL true):

  • Delta_throughput ≥ +5%
  • Delta_instructions ≤ -15%
  • Delta_icache ≤ -10% (should follow instruction reduction)

NEUTRAL (if some criteria marginal):

  • Delta_throughput ∈ ±3%
  • Delta_instructions ∈ [-15%, -5%]
  • Variance increase < 50%

NO-GO (if any critical missed):

  • Delta_throughput < -2%
  • Delta_instructions > -5% (failed to reduce overhead)
  • Variance increase > 100%

5. Reporting (required artifacts)

Create file: docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_AB_TEST_RESULTS.md

Include:

# Phase 18 v2: BENCH_MINIMAL — A/B Test Results

| Metric | Baseline | Optimized | Delta |
|--------|----------|-----------|-------|
| Throughput (mean) | X.XXM | X.XXM | +/-X.XX% |
| Throughput (σ) | X.XXM | X.XXM | % |
| Instructions | X.XB | X.XB | +/-X.X% |
| I-cache misses | XXXK | XXXK | +/-X.X% |
| Cycles | X.XB | X.XB | +/-X.X% |

## Verdict

GO / NEUTRAL / NO-GO with reasoning

Update: CURRENT_TASK.md (Phase 18 v2 status + next)


6. Important Notes

  • Ensure BENCH_MINIMAL=0 (default) is tested first
  • Both BENCH_MINIMAL=0 and BENCH_MINIMAL=1 must compile successfully
  • If instructions drop < -15%, the optimization is incomplete (check for compiler already doing this)
  • This is NOT a research-only knob; it's a valid bench build mode (safe to enable in CI)

7. If GO: Next Steps

  1. Promote BENCH_MINIMAL=1 as default for bench builds
  2. Prepare Phase 19 (next optimization frontier)
    • Possible targets: SIMD prefetch, lock-free structures, allocation batching
  3. Document learnings about allocator overhead vs memory latency

8. If NO-GO: Lessons

  • Allocator is memory-bound (IPC=2.30 constant across phases)
  • Instruction count reduction doesn't always yield throughput gains if memory latency dominates
  • Need architectural changes (cache-friendly layout, batching) rather than micro-optimizations