Phase 75-5: PGO Regeneration (C5/C6 Inline Slots Aware) — Next Instructions

Status: NEXT (HIGH PRIORITY)

Goal

Rebuild the FAST PGO SSOT binary (bench_random_mixed_hakmem_minimal_pgo) with a training profile that matches the current promoted defaults:

HAKMEM_WARM_POOL_SIZE=16
HAKMEM_TINY_C5_INLINE_SLOTS=1
HAKMEM_TINY_C6_INLINE_SLOTS=1

This is required because Phase 75-4 observed a large gap between:

Phase 69 historical FAST baseline (62.63M ops/s)
Phase 75-4 current FAST PGO Point A baseline (53.81M ops/s)

SSOT Rules

Use scripts/run_mixed_10_cleanenv.sh as the harness.
Always pin the binary explicitly via BENCH_BIN=... to avoid Standard/FAST confusion.
Keep comparisons within the same binary when judging a single knob (C5/C6 OFF/ON).

Step 1: Prepare training commands (C5/C6 ON)

Pick one of these approaches (A is preferred):

A) Training uses the harness (preferred)

Ensure the training workload exports the correct knobs:

export HAKMEM_WARM_POOL_SIZE=16
export HAKMEM_TINY_C5_INLINE_SLOTS=1
export HAKMEM_TINY_C6_INLINE_SLOTS=1

Then run the existing PGO training target (repo-specific; example):

make pgo-fast-full

B) Hard-pin knobs inside PGO training config (if needed)

If the training driver does not inherit ENV cleanly, update the PGO training config script to include:

HAKMEM_WARM_POOL_SIZE=16
HAKMEM_TINY_C5_INLINE_SLOTS=1
HAKMEM_TINY_C6_INLINE_SLOTS=1

Step 2: Validate the rebuilt binary

Run Mixed SSOT 10-run on FAST PGO:

BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo scripts/run_mixed_10_cleanenv.sh

Record mean/median/CV and update the scorecard baseline if improved.

Step 3: Re-run Phase 75-4 matrix on FAST PGO (sanity)

Run 4-point matrix on FAST PGO to confirm:

Point D > Point A
and quantify additivity (B/C contributions)

BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo \
  HAKMEM_TINY_C5_INLINE_SLOTS=0 HAKMEM_TINY_C6_INLINE_SLOTS=0 RUNS=10 \
  scripts/run_mixed_10_cleanenv.sh

BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo \
  HAKMEM_TINY_C5_INLINE_SLOTS=1 HAKMEM_TINY_C6_INLINE_SLOTS=0 RUNS=10 \
  scripts/run_mixed_10_cleanenv.sh

BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo \
  HAKMEM_TINY_C5_INLINE_SLOTS=0 HAKMEM_TINY_C6_INLINE_SLOTS=1 RUNS=10 \
  scripts/run_mixed_10_cleanenv.sh

BENCH_BIN=./bench_random_mixed_hakmem_minimal_pgo \
  HAKMEM_TINY_C5_INLINE_SLOTS=1 HAKMEM_TINY_C6_INLINE_SLOTS=1 RUNS=10 \
  scripts/run_mixed_10_cleanenv.sh

Step 4: If regression persists, do layout tax forensics

Use:

./scripts/box/layout_tax_forensics_box.sh \
  ./bench_random_mixed_hakmem_minimal_pgo_phase69_best \
  ./bench_random_mixed_hakmem_minimal_pgo

Then classify:

IPC drop (>3%) → text layout / inlining / code placement issue
branch-miss spike (>10%) → hint mismatch / control-flow reshaping
cache/dTLB spike → data layout / TLS bloat / spill

GO/NO-GO Gates

GO: FAST PGO baseline recovers significantly (target: close to Phase 69 order-of-magnitude), and Phase 75-4 D vs A remains ≥ +1.0%.
NEUTRAL: D vs A stays positive but baseline still low → keep investigating training config.
NO-GO: D vs A becomes negative → revert or rework inline slots integration for FAST builds.

3.3 KiB Raw Blame History