Files
hakmem/docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_DESIGN.md
Moe Charm (CI) ad346f7885 Phase 18 v2: BENCH_MINIMAL design + instructions (instruction removal strategy)
## Phase 18 v2: Next Phase Direction

After Phase 18 v1 failure (layout optimization caused I-cache regression),
shift to instruction count reduction via compile-time removal:

- Stats collection (FRONT_FASTLANE_STAT_INC → no-op)
- Environment checks (runtime lookup → constant)
- Debug logging (conditional compilation)

Expected impact: Instructions -30-40%, Throughput +10-20%

## Success Criteria (STRICT)

GO (must have ALL):
- Throughput: +5% minimum (+8% preferred)
- Instructions: -15% minimum (smoking gun)
- I-cache: automatic improvement from smaller footprint

NEUTRAL: throughput ±3%, instructions -5% to -15%
NO-GO: throughput < -2%, instructions < -5%

Key: If instructions do not drop -15%+, allocator is not the bottleneck
and this phase should be abandoned.

## Implementation Strategy

1. Makefile knob: BENCH_MINIMAL=0/1 (default OFF, production-safe)
2. Conditional removal:
   - Stats: #if !HAKMEM_BENCH_MINIMAL
   - ENV checks: constant propagation
   - Debug: conditional includes

3. A/B test with perf stat (must measure instruction reduction)

## Files

New:
- docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_DESIGN.md (detailed design)
- docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_NEXT_INSTRUCTIONS.md (step-by-step)

Modified:
- CURRENT_TASK.md (Phase 18 v1/v2 status)

## Key Learning from Phase 18 v1 Failure

Layout optimization is extremely fragile without strong ordering guarantees.
Section splitting alone (without symbol ordering, PGO, or linker script)
destroyed code locality and increased I-cache misses 91%.

Switching to direct instruction removal is safer and more predictable.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-15 05:55:22 +09:00

8.0 KiB

Phase 18 v2: Hot Text Isolation — BENCH_MINIMAL Design

Context (Phase 18 v1 failed)

Phase 18 v1 attempted layout optimization using section splitting + GC (-ffunction-sections -fdata-sections -Wl,--gc-sections).

Result: CATASTROPHIC I-cache regression (+91%).

Root cause: Section-based splitting without explicit hot symbol ordering destroyed code locality.

Lesson: Layout tweaks are too fragile; instruction count reduction is the direct path.


0. Goal (High Impact)

Reduce instruction footprint per allocation/free by removing non-essential code paths at compile-time:

  • Stats collection (counter increments on every operation)
  • Environment variable checks (TLS lookups)
  • Debug logging (conditional output)

Expected impact:

  • Instruction count: -30-40% (benchmark workload)
  • I-cache misses: automatic improvement from smaller working set
  • Throughput: +10-20% (Phase 17 instructions -48% gap should close proportionally)

GO Criteria (strict):

  • Throughput: +5% minimum, +8% preferred
  • Instructions: -15% minimum (clear, measurable reduction)
  • I-cache: should improve proportionally to instruction reduction

If instructions do not drop meaningfully, abandon this phase.


1. Strategy: BENCH_MINIMAL Build Mode

1.1 What to Remove

Category A: Stats collection (highest frequency, low value for bench)

FRONT_FASTLANE_STAT_INC(malloc_total);      // ← 1 per malloc
FRONT_FASTLANE_STAT_INC(malloc_hit);        // ← 1 per malloc
FRONT_FASTLANE_STAT_INC(malloc_fallback_*); // ← conditional
free_tiny_fast_mono_stat_inc_hit();          // ← 1 per free

Category B: Environment variable checks (per-operation cost if not short-circuit)

if (hakmem_env_snapshot_enabled()) { ... }   // TLS + branch
if (front_fastlane_enabled()) { ... }        // TLS + branch

Category C: Debug logging / verbose output (rarely needed in bench)

hakmem_tls_trace_alloc();   // log path
hakmem_diag_record_fallback();

1.2 Compile-Time Gate

New build mode:

HAKMEM_BENCH_MINIMAL=0/1  (default 0, opt-in)

Activated via Makefile knob:

BENCH_MINIMAL ?= 0
ifeq ($(BENCH_MINIMAL),1)
  CFLAGS += -DHAKMEM_BENCH_MINIMAL=1
  # ... applies to both bench binaries and shared lib
endif

Important: This is not a research-only knob. It's a build mode that's safe to ship (just disables instrumentation).

1.3 Safeguards

  1. Conditional compilation markers:

    • Wrap stats inside #if !HAKMEM_BENCH_MINIMAL
    • Wrap ENV checks inside #if !HAKMEM_BENCH_MINIMAL
    • Ensure code is still correct when markers are OFF (normal operation)
  2. No behavioral changes:

    • Fast paths must remain identical (stats are just instrumentation)
    • Slow paths (fallback, cold) can be simplified (less instrumentation)
  3. Rollback: Single build knob, reversible


2. Implementation Plan

Phase 2.1: Stats removal (Priority 1)

File: core/box/front_fastlane_stats_box.h

#if !HAKMEM_BENCH_MINIMAL
#define FRONT_FASTLANE_STAT_INC(stat) \
    do { (void)0; } while(0)  // ← becomes no-op
#else
#define FRONT_FASTLANE_STAT_INC(stat) \
    atomic_fetch_add_explicit(&g_stats.stat, 1, memory_order_relaxed)
#endif

File: core/front/malloc_tiny_fast.h

Search for all free_tiny_fast_mono_stat_* calls and wrap:

#if !HAKMEM_BENCH_MINIMAL
free_tiny_fast_mono_stat_inc_hit();
#endif

Expected savings: ~20-30 instructions/op (counter increments + memory sync)

Phase 2.2: Environment variable check removal (Priority 1)

File: core/box/hakmem_env_snapshot_box.h

#if !HAKMEM_BENCH_MINIMAL
// Normal: Check ENV at runtime
bool hakmem_env_snapshot_enabled(void) {
    return g_env_snapshot_enabled;
}
#else
// Bench: Always enable snapshot (one-time cost)
static inline bool hakmem_env_snapshot_enabled(void) {
    return 1;  // ← compile-time constant, eliminated by optimizer
}
#endif

File: core/bench_profile.h

Similar pattern for profile-related checks.

Expected savings: ~5-10 instructions/op (TLS lookups + branch prediction)

Phase 2.3: Debug logging removal (Priority 2, if needed)

File: trace/logging functions

Conditional compilation for verbose paths (already often guarded by HAKMEM_BUILD_DEBUG).

Expected savings: ~0-5 instructions/op (rare paths, low impact for bench)


3. Risks / Mitigations

Risk A: Benchmark no longer representative

If we remove instrumentation, does the bench still measure the allocator?

Mitigation: BENCH_MINIMAL disables stats (instrumentation), not core logic.

  • Fast paths remain identical.
  • Only instrumentation overhead is removed.
  • This is similar to how production binaries disable debug tracing.

Risk B: Build regression

Conditional compilation could hide bugs.

Mitigation:

  • Ensure BENCH_MINIMAL=0 (default) always tested first.
  • Test both modes in CI if available.
  • Manual verification that code is syntactically correct in both modes.

Risk C: Instruction count doesn't drop

If removing stats + ENV checks doesn't clearly drop instructions, it means:

  • Compiler/LTO already optimizes these away in RELEASE builds
  • The overhead is elsewhere (memory access patterns, branch misprediction)

Mitigation: Run perf stat with BENCH_MINIMAL=1 to confirm instruction reduction. If < -10%, reconsider strategy.


4. Expected Impact

Conservative (stats only):

  • Throughput: +5-8%
  • Instructions: -20-30%
  • I-cache: improvement follows from instruction reduction

Ambitious (stats + ENV + debug):

  • Throughput: +10-20%
  • Instructions: -30-40%
  • I-cache: -20-30% (proportional to instruction reduction)

System binary ceiling: 85M ops/s

  • Baseline: 48M ops/s
  • Target: 52-55M ops/s (via +5-15%)
  • Would be 60-68M ops/s with ambitious impact

5. Box Theory Compliance

Box: BenchMinimalBox

  • Boundary: HAKMEM_BENCH_MINIMAL=0/1 (compile-time)
  • Scope: Instrumentation removal only (no algorithm changes)
  • Rollback: Single build knob (default OFF, backward compatible)
  • Observability: Perf stat before/after (instructions, I-cache, throughput)

6. Success Criteria

GO (Proceed):

  • Throughput: +5% or more
  • Instructions: -15% or more (smoking gun for success)
  • I-cache: should improve proportionally

NEUTRAL (keep as research box):

  • Throughput: ±3% (within noise)
  • Instructions: -5% to -15% (marginal)

NO-GO (freeze):

  • Throughput: < -2% or no improvement
  • Instructions: < -5% (failed optimization objective)

7. Files to Modify

  1. core/box/hot_text_attrs_box.h (update to add BENCH_MINIMAL support)

    • Already created in Phase 18 v1
  2. core/box/front_fastlane_stats_box.h (stats conditional)

    • Update macro to be no-op when BENCH_MINIMAL=1
  3. core/front/malloc_tiny_fast.h (free stats wrapper)

    • Wrap free_tiny_fast_mono_stat_* calls
  4. core/box/hakmem_env_snapshot_box.h (ENV check simplification)

    • Make hakmem_env_snapshot_enabled() constant when BENCH_MINIMAL=1
  5. core/bench_profile.h (profile checks)

    • Simplify profile related checks
  6. Makefile

    • Add BENCH_MINIMAL ?= 0 knob
    • Apply -DHAKMEM_BENCH_MINIMAL=1 when enabled

8. Notes

  • Production safety: BENCH_MINIMAL is DISABLED by default. Production builds use full instrumentation.
  • Bench-only: This mode is intended for benchmark binary builds, not shipped libraries.
  • Phase 17 learning: Instruction count was -48% in system vs hakmem. If BENCH_MINIMAL achieves -30% to -40%, we should see +10-20% throughput.

9. Next Steps After Phase 18 v2

If GO (+5%+):

  • Promote BENCH_MINIMAL=1 as the new baseline for bench builds
  • Update CURRENT_TASK with Phase 18 v2 victory
  • Prepare Phase 19 (next optimization frontier)

If NEUTRAL:

  • Investigate why instructions didn't drop (compiler already optimized?)
  • Consider aggressive options: SIMD, allocation batching, lock-free

If NO-GO:

  • Acknowledge that Phase 17 +74% gap is primarily memory-bound (IPC=2.30)
  • Shift focus to broader architectural changes