275 lines
8.0 KiB
Markdown
275 lines
8.0 KiB
Markdown
|
|
# Phase 18 v2: Hot Text Isolation — BENCH_MINIMAL Design
|
||
|
|
|
||
|
|
## Context (Phase 18 v1 failed)
|
||
|
|
|
||
|
|
Phase 18 v1 attempted layout optimization using section splitting + GC (`-ffunction-sections -fdata-sections -Wl,--gc-sections`).
|
||
|
|
|
||
|
|
**Result**: CATASTROPHIC I-cache regression (+91%).
|
||
|
|
|
||
|
|
**Root cause**: Section-based splitting without explicit hot symbol ordering destroyed code locality.
|
||
|
|
|
||
|
|
**Lesson**: Layout tweaks are too fragile; **instruction count reduction is the direct path**.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 0. Goal (High Impact)
|
||
|
|
|
||
|
|
Reduce **instruction footprint** per allocation/free by removing non-essential code paths at compile-time:
|
||
|
|
- Stats collection (counter increments on every operation)
|
||
|
|
- Environment variable checks (TLS lookups)
|
||
|
|
- Debug logging (conditional output)
|
||
|
|
|
||
|
|
**Expected impact**:
|
||
|
|
- Instruction count: -30-40% (benchmark workload)
|
||
|
|
- I-cache misses: automatic improvement from smaller working set
|
||
|
|
- Throughput: +10-20% (Phase 17 instructions -48% gap should close proportionally)
|
||
|
|
|
||
|
|
**GO Criteria** (strict):
|
||
|
|
- Throughput: +5% minimum, +8% preferred
|
||
|
|
- Instructions: -15% minimum (clear, measurable reduction)
|
||
|
|
- I-cache: should improve proportionally to instruction reduction
|
||
|
|
|
||
|
|
If instructions do not drop meaningfully, abandon this phase.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 1. Strategy: BENCH_MINIMAL Build Mode
|
||
|
|
|
||
|
|
### 1.1 What to Remove
|
||
|
|
|
||
|
|
**Category A: Stats collection** (highest frequency, low value for bench)
|
||
|
|
```c
|
||
|
|
FRONT_FASTLANE_STAT_INC(malloc_total); // ← 1 per malloc
|
||
|
|
FRONT_FASTLANE_STAT_INC(malloc_hit); // ← 1 per malloc
|
||
|
|
FRONT_FASTLANE_STAT_INC(malloc_fallback_*); // ← conditional
|
||
|
|
free_tiny_fast_mono_stat_inc_hit(); // ← 1 per free
|
||
|
|
```
|
||
|
|
|
||
|
|
**Category B: Environment variable checks** (per-operation cost if not short-circuit)
|
||
|
|
```c
|
||
|
|
if (hakmem_env_snapshot_enabled()) { ... } // TLS + branch
|
||
|
|
if (front_fastlane_enabled()) { ... } // TLS + branch
|
||
|
|
```
|
||
|
|
|
||
|
|
**Category C: Debug logging / verbose output** (rarely needed in bench)
|
||
|
|
```c
|
||
|
|
hakmem_tls_trace_alloc(); // log path
|
||
|
|
hakmem_diag_record_fallback();
|
||
|
|
```
|
||
|
|
|
||
|
|
### 1.2 Compile-Time Gate
|
||
|
|
|
||
|
|
New build mode:
|
||
|
|
```
|
||
|
|
HAKMEM_BENCH_MINIMAL=0/1 (default 0, opt-in)
|
||
|
|
```
|
||
|
|
|
||
|
|
Activated via Makefile knob:
|
||
|
|
```makefile
|
||
|
|
BENCH_MINIMAL ?= 0
|
||
|
|
ifeq ($(BENCH_MINIMAL),1)
|
||
|
|
CFLAGS += -DHAKMEM_BENCH_MINIMAL=1
|
||
|
|
# ... applies to both bench binaries and shared lib
|
||
|
|
endif
|
||
|
|
```
|
||
|
|
|
||
|
|
**Important**: This is **not** a research-only knob. It's a build mode that's safe to ship (just disables instrumentation).
|
||
|
|
|
||
|
|
### 1.3 Safeguards
|
||
|
|
|
||
|
|
1. **Conditional compilation markers**:
|
||
|
|
- Wrap stats inside `#if !HAKMEM_BENCH_MINIMAL`
|
||
|
|
- Wrap ENV checks inside `#if !HAKMEM_BENCH_MINIMAL`
|
||
|
|
- Ensure code is still correct when markers are OFF (normal operation)
|
||
|
|
|
||
|
|
2. **No behavioral changes**:
|
||
|
|
- Fast paths must remain identical (stats are just instrumentation)
|
||
|
|
- Slow paths (fallback, cold) can be simplified (less instrumentation)
|
||
|
|
|
||
|
|
3. **Rollback**: Single build knob, reversible
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 2. Implementation Plan
|
||
|
|
|
||
|
|
### Phase 2.1: Stats removal (Priority 1)
|
||
|
|
|
||
|
|
**File**: `core/box/front_fastlane_stats_box.h`
|
||
|
|
|
||
|
|
```c
|
||
|
|
#if !HAKMEM_BENCH_MINIMAL
|
||
|
|
#define FRONT_FASTLANE_STAT_INC(stat) \
|
||
|
|
do { (void)0; } while(0) // ← becomes no-op
|
||
|
|
#else
|
||
|
|
#define FRONT_FASTLANE_STAT_INC(stat) \
|
||
|
|
atomic_fetch_add_explicit(&g_stats.stat, 1, memory_order_relaxed)
|
||
|
|
#endif
|
||
|
|
```
|
||
|
|
|
||
|
|
**File**: `core/front/malloc_tiny_fast.h`
|
||
|
|
|
||
|
|
Search for all `free_tiny_fast_mono_stat_*` calls and wrap:
|
||
|
|
```c
|
||
|
|
#if !HAKMEM_BENCH_MINIMAL
|
||
|
|
free_tiny_fast_mono_stat_inc_hit();
|
||
|
|
#endif
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected savings**: ~20-30 instructions/op (counter increments + memory sync)
|
||
|
|
|
||
|
|
### Phase 2.2: Environment variable check removal (Priority 1)
|
||
|
|
|
||
|
|
**File**: `core/box/hakmem_env_snapshot_box.h`
|
||
|
|
|
||
|
|
```c
|
||
|
|
#if !HAKMEM_BENCH_MINIMAL
|
||
|
|
// Normal: Check ENV at runtime
|
||
|
|
bool hakmem_env_snapshot_enabled(void) {
|
||
|
|
return g_env_snapshot_enabled;
|
||
|
|
}
|
||
|
|
#else
|
||
|
|
// Bench: Always enable snapshot (one-time cost)
|
||
|
|
static inline bool hakmem_env_snapshot_enabled(void) {
|
||
|
|
return 1; // ← compile-time constant, eliminated by optimizer
|
||
|
|
}
|
||
|
|
#endif
|
||
|
|
```
|
||
|
|
|
||
|
|
**File**: `core/bench_profile.h`
|
||
|
|
|
||
|
|
Similar pattern for profile-related checks.
|
||
|
|
|
||
|
|
**Expected savings**: ~5-10 instructions/op (TLS lookups + branch prediction)
|
||
|
|
|
||
|
|
### Phase 2.3: Debug logging removal (Priority 2, if needed)
|
||
|
|
|
||
|
|
**File**: trace/logging functions
|
||
|
|
|
||
|
|
Conditional compilation for verbose paths (already often guarded by `HAKMEM_BUILD_DEBUG`).
|
||
|
|
|
||
|
|
**Expected savings**: ~0-5 instructions/op (rare paths, low impact for bench)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 3. Risks / Mitigations
|
||
|
|
|
||
|
|
### Risk A: Benchmark no longer representative
|
||
|
|
|
||
|
|
If we remove instrumentation, does the bench still measure the allocator?
|
||
|
|
|
||
|
|
**Mitigation**: BENCH_MINIMAL disables stats (instrumentation), not core logic.
|
||
|
|
- Fast paths remain identical.
|
||
|
|
- Only instrumentation overhead is removed.
|
||
|
|
- This is similar to how production binaries disable debug tracing.
|
||
|
|
|
||
|
|
### Risk B: Build regression
|
||
|
|
|
||
|
|
Conditional compilation could hide bugs.
|
||
|
|
|
||
|
|
**Mitigation**:
|
||
|
|
- Ensure `BENCH_MINIMAL=0` (default) always tested first.
|
||
|
|
- Test both modes in CI if available.
|
||
|
|
- Manual verification that code is syntactically correct in both modes.
|
||
|
|
|
||
|
|
### Risk C: Instruction count doesn't drop
|
||
|
|
|
||
|
|
If removing stats + ENV checks doesn't clearly drop instructions, it means:
|
||
|
|
- Compiler/LTO already optimizes these away in RELEASE builds
|
||
|
|
- The overhead is elsewhere (memory access patterns, branch misprediction)
|
||
|
|
|
||
|
|
**Mitigation**: Run perf stat with `BENCH_MINIMAL=1` to confirm instruction reduction. If < -10%, reconsider strategy.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 4. Expected Impact
|
||
|
|
|
||
|
|
**Conservative** (stats only):
|
||
|
|
- Throughput: +5-8%
|
||
|
|
- Instructions: -20-30%
|
||
|
|
- I-cache: improvement follows from instruction reduction
|
||
|
|
|
||
|
|
**Ambitious** (stats + ENV + debug):
|
||
|
|
- Throughput: +10-20%
|
||
|
|
- Instructions: -30-40%
|
||
|
|
- I-cache: -20-30% (proportional to instruction reduction)
|
||
|
|
|
||
|
|
**System binary ceiling**: 85M ops/s
|
||
|
|
- Baseline: 48M ops/s
|
||
|
|
- Target: 52-55M ops/s (via +5-15%)
|
||
|
|
- Would be 60-68M ops/s with ambitious impact
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 5. Box Theory Compliance
|
||
|
|
|
||
|
|
**Box**: BenchMinimalBox
|
||
|
|
- **Boundary**: `HAKMEM_BENCH_MINIMAL=0/1` (compile-time)
|
||
|
|
- **Scope**: Instrumentation removal only (no algorithm changes)
|
||
|
|
- **Rollback**: Single build knob (default OFF, backward compatible)
|
||
|
|
- **Observability**: Perf stat before/after (instructions, I-cache, throughput)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 6. Success Criteria
|
||
|
|
|
||
|
|
### GO (Proceed):
|
||
|
|
- Throughput: **+5% or more**
|
||
|
|
- Instructions: **-15% or more** (smoking gun for success)
|
||
|
|
- I-cache: should improve proportionally
|
||
|
|
|
||
|
|
### NEUTRAL (keep as research box):
|
||
|
|
- Throughput: ±3% (within noise)
|
||
|
|
- Instructions: -5% to -15% (marginal)
|
||
|
|
|
||
|
|
### NO-GO (freeze):
|
||
|
|
- Throughput: < -2% or no improvement
|
||
|
|
- Instructions: < -5% (failed optimization objective)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 7. Files to Modify
|
||
|
|
|
||
|
|
1. **`core/box/hot_text_attrs_box.h`** (update to add BENCH_MINIMAL support)
|
||
|
|
- Already created in Phase 18 v1
|
||
|
|
|
||
|
|
2. **`core/box/front_fastlane_stats_box.h`** (stats conditional)
|
||
|
|
- Update macro to be no-op when BENCH_MINIMAL=1
|
||
|
|
|
||
|
|
3. **`core/front/malloc_tiny_fast.h`** (free stats wrapper)
|
||
|
|
- Wrap free_tiny_fast_mono_stat_* calls
|
||
|
|
|
||
|
|
4. **`core/box/hakmem_env_snapshot_box.h`** (ENV check simplification)
|
||
|
|
- Make hakmem_env_snapshot_enabled() constant when BENCH_MINIMAL=1
|
||
|
|
|
||
|
|
5. **`core/bench_profile.h`** (profile checks)
|
||
|
|
- Simplify profile related checks
|
||
|
|
|
||
|
|
6. **`Makefile`**
|
||
|
|
- Add `BENCH_MINIMAL ?= 0` knob
|
||
|
|
- Apply `-DHAKMEM_BENCH_MINIMAL=1` when enabled
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 8. Notes
|
||
|
|
|
||
|
|
- **Production safety**: BENCH_MINIMAL is DISABLED by default. Production builds use full instrumentation.
|
||
|
|
- **Bench-only**: This mode is intended for benchmark binary builds, not shipped libraries.
|
||
|
|
- **Phase 17 learning**: Instruction count was -48% in system vs hakmem. If BENCH_MINIMAL achieves -30% to -40%, we should see +10-20% throughput.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 9. Next Steps After Phase 18 v2
|
||
|
|
|
||
|
|
### If GO (+5%+):
|
||
|
|
- Promote BENCH_MINIMAL=1 as the new baseline for bench builds
|
||
|
|
- Update CURRENT_TASK with Phase 18 v2 victory
|
||
|
|
- Prepare Phase 19 (next optimization frontier)
|
||
|
|
|
||
|
|
### If NEUTRAL:
|
||
|
|
- Investigate why instructions didn't drop (compiler already optimized?)
|
||
|
|
- Consider aggressive options: SIMD, allocation batching, lock-free
|
||
|
|
|
||
|
|
### If NO-GO:
|
||
|
|
- Acknowledge that Phase 17 +74% gap is primarily memory-bound (IPC=2.30)
|
||
|
|
- Shift focus to broader architectural changes
|