Key changes: - Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible) Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns - Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M): tcmalloc: 115.26M (92.33% of mimalloc) jemalloc: 97.39M (77.96% of mimalloc) system: 85.20M (68.24% of mimalloc) mimalloc: 124.82M (baseline) - hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements Result: baseline stabilized to 55.53M (44.46% of mimalloc) Previous unstable measurement (35.57M) was due to profile leak - Documentation: * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO) * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology - M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
210 lines
7.2 KiB
Markdown
210 lines
7.2 KiB
Markdown
# Phase 78-0: SSOT Verification & Phase 78-1 Plan
|
|
|
|
## Phase 78-0 Complete: ✅ SSOT Verified
|
|
|
|
### Verification Results (Single Run)
|
|
|
|
**Binary**: `./bench_random_mixed_hakmem` (Standard, C4/C5/C6 ON, C3 OFF)
|
|
**Configuration**: HAKMEM_ROUTE_BANNER=1, HAKMEM_MEASURE_UNIFIED_CACHE=1
|
|
**Workload**: 20M iterations, WS=400, 16-1040B mixed allocations
|
|
|
|
### Route Configuration
|
|
- unified_cache_enabled = 1 ✓
|
|
- warm_pool_max_per_class = 12 ✓
|
|
- All routes = LEGACY (correct for Phase 76-2 state) ✓
|
|
|
|
### Unified Cache Statistics (Per-Class)
|
|
| Class | Hits | Misses | Interpretation |
|
|
|-------|------|--------|-----------------|
|
|
| C4 | 0 | 1 | Inline slots active (full interception) ✓ |
|
|
| C5 | 0 | 1 | Inline slots active (full interception) ✓ |
|
|
| C6 | 0 | 1 | Inline slots active (full interception) ✓ |
|
|
|
|
### Critical Insight
|
|
**Zero unified_cache hits for C4/C5/C6 = Expected and Correct**
|
|
|
|
The inline slots ARE working perfectly:
|
|
- During steady-state operations: 100% of C4/C5/C6 traffic intercepted by inline slots
|
|
- Never reaches unified_cache during normal allocation path
|
|
- 1 miss per class occurs only during initialization/drain (not steady-state)
|
|
|
|
### Throughput Baseline
|
|
- **40.50 M ops/s** (confirms Phase 76-2 SSOT baseline intact)
|
|
|
|
### GATE DECISION
|
|
✅ **GO TO PHASE 78-1**
|
|
|
|
SSOT state verified:
|
|
- C4/C5/C6 inline slots confirmed active
|
|
- Traffic interception pattern correct
|
|
- Ready for per-op overhead optimization
|
|
|
|
---
|
|
|
|
## Phase 78-1: Per-Op Decision Overhead Removal
|
|
|
|
### Problem Statement
|
|
Current inline slot enable checks (tiny_c4/c5/c6_inline_slots_enabled()) add per-operation overhead:
|
|
|
|
```c
|
|
// Current (Phase 76-1): Called on EVERY alloc/free
|
|
if (class_idx == 4 && tiny_c4_inline_slots_enabled()) {
|
|
// tiny_c4_inline_slots_enabled() = function call + cached static check
|
|
}
|
|
```
|
|
|
|
Each operation has:
|
|
1. Function call overhead
|
|
2. Static variable load (g_c4_inline_slots_enabled)
|
|
3. Comparison (== -1) - minimal but measurable
|
|
|
|
### Solution: Fixed Mode Optimization
|
|
**New ENV**: `HAKMEM_TINY_INLINE_SLOTS_FIXED=0/1` (default OFF for conservative testing)
|
|
|
|
When `FIXED=1`:
|
|
1. At program startup (via bench_profile_apply), read all C4/C5/C6 ENVs once
|
|
2. Cache decisions in static globals: `g_c4_inline_slots_fixed_mode`, etc.
|
|
3. Hot path: Direct global read instead of function call (0 per-op overhead)
|
|
|
|
### Expected Performance Impact
|
|
- **Optimistic**: +1.5% to +3.0% (eliminate per-op decision overhead)
|
|
- **Realistic**: +0.5% to +1.5% (modern CPUs speculate through branches well)
|
|
- **Conservative**: +0.1% to +0.5% (if CPU already eliminated the cost via prediction)
|
|
|
|
### Implementation Checklist
|
|
|
|
#### Phase 78-1a: Create Fixed Mode Box
|
|
- ✓ Created: `core/box/tiny_inline_slots_fixed_mode_box.h`
|
|
- Global caching variables: `g_c4/c5/c6_inline_slots_fixed_mode`
|
|
- Initialization function: `tiny_inline_slots_fixed_mode_init()`
|
|
- Fast path functions: `tiny_c4_inline_slots_enabled_fast()`, etc.
|
|
|
|
#### Phase 78-1b: Update Alloc Path (tiny_front_hot_box.h)
|
|
- Replace `tiny_c4/c5/c6_inline_slots_enabled()` with fast versions
|
|
- Add include: `#include "tiny_inline_slots_fixed_mode_box.h"`
|
|
- Update enable checks to use `_fast()` suffix
|
|
|
|
#### Phase 78-1c: Update Free Path (tiny_legacy_fallback_box.h)
|
|
- Replace `tiny_c4/c5/c6_inline_slots_enabled()` with fast versions
|
|
- Add include: `#include "tiny_inline_slots_fixed_mode_box.h"`
|
|
- Update enable checks to use `_fast()` suffix
|
|
|
|
#### Phase 78-1d: Initialize at Program Startup
|
|
- Option 1: Call `tiny_inline_slots_fixed_mode_init()` from `bench_profile_apply()`
|
|
- Option 2: Call from `hakmem_tiny_init_thread()` (TLS init time)
|
|
- Recommended: Option 1 (once at program startup, not per-thread)
|
|
|
|
#### Phase 78-1e: A/B Test
|
|
- **Baseline**: `HAKMEM_TINY_INLINE_SLOTS_FIXED=0` (default, Phase 76-2 behavior)
|
|
- **Treatment**: `HAKMEM_TINY_INLINE_SLOTS_FIXED=1` (fixed mode optimization)
|
|
- **GO Threshold**: +1.0% (same as Phase 77-1, same binary)
|
|
- **Runs**: 10 per configuration (WS=400, 20M iterations)
|
|
|
|
### Code Pattern
|
|
|
|
#### Alloc Path (tiny_front_hot_box.h)
|
|
```c
|
|
#include "tiny_inline_slots_fixed_mode_box.h" // NEW
|
|
|
|
// In tiny_hot_alloc_fast():
|
|
// Phase 78-1: C3 inline slots with fixed mode
|
|
if (class_idx == 3 && tiny_c3_inline_slots_enabled_fast()) { // CHANGED: use _fast()
|
|
// ...
|
|
}
|
|
|
|
// Phase 76-1: C4 Inline Slots with fixed mode
|
|
if (class_idx == 4 && tiny_c4_inline_slots_enabled_fast()) { // CHANGED: use _fast()
|
|
// ...
|
|
}
|
|
```
|
|
|
|
#### Initialization (bench_profile.h or hakmem_tiny.c)
|
|
```c
|
|
extern void tiny_inline_slots_fixed_mode_init(void);
|
|
|
|
void bench_apply_profile(void) {
|
|
// ... existing code ...
|
|
|
|
// Phase 78-1: Initialize fixed mode if enabled
|
|
if (tiny_inline_slots_fixed_enabled()) {
|
|
tiny_inline_slots_fixed_mode_init();
|
|
}
|
|
}
|
|
```
|
|
|
|
### Rationale for This Optimization
|
|
|
|
1. **Proven Optimization**: C4/C5/C6 are locked to SSOT (+7.05% cumulative)
|
|
2. **Per-Op Overhead Matters**: Hot path executes 20M+ times per benchmark
|
|
3. **Low Risk**: Backward compatible (FIXED=0 is default, restores Phase 76-1 behavior)
|
|
4. **Architectural Fit**: Aligns with Box Pattern (single responsibility at initialization)
|
|
5. **Foundation for Future**: Can apply same technique to other per-op decisions
|
|
|
|
### Risk Assessment
|
|
|
|
**Low Risk**:
|
|
- Backward compatible (FIXED=0 by default)
|
|
- No change to inline slots logic, only to enable checks
|
|
- Can quickly disable with ENV (FIXED=0)
|
|
- A/B testing validates correctness
|
|
|
|
**Potential Issues**:
|
|
- Compiler optimization might eliminate the overhead we're trying to remove (unlikely with aggressive optimization flags)
|
|
- Cache coherency on multi-socket systems (unlikely to affect performance)
|
|
|
|
### Success Criteria
|
|
|
|
✅ **PASS** (+1.0% minimum):
|
|
- Implementation complete
|
|
- A/B test shows +1.0% or greater gain
|
|
- Promote FIXED to default
|
|
- Document in PHASE78_1 results
|
|
|
|
⚠️ **MARGINAL** (+0.3% to +0.9%):
|
|
- Measurable gain but below threshold
|
|
- Keep as optional optimization (FIXED=0 default)
|
|
- Investigate CPU branch prediction effectiveness
|
|
|
|
❌ **FAIL** (< +0.3%):
|
|
- Compiler/CPU already eliminated the overhead
|
|
- Revert to Phase 76-1 behavior (simpler code)
|
|
- Explore alternative optimizations (Phase 79+)
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. **Implement Phase 78-1** (if approved):
|
|
- Update tiny_c4/c5/c6_inline_slots_env_box.h to check fixed mode
|
|
- Update tiny_front_hot_box.h and tiny_legacy_fallback_box.h
|
|
- Add initialization call to bench_profile_apply()
|
|
- Build and test
|
|
|
|
2. **Run Phase 78-1 A/B Test** (10 runs each configuration)
|
|
|
|
3. **Decision Gate**:
|
|
- ✅ +1.0% → Promote to SSOT
|
|
- ⚠️ +0.3% → Keep optional
|
|
- ❌ <+0.3% → Revert (keep Phase 76-1 as is)
|
|
|
|
4. **Phase 79+**: If Phase 78-1 ≥ +1.0%, continue with alternative optimization axes
|
|
|
|
---
|
|
|
|
## Summary Table
|
|
|
|
| Phase | Focus | Result | Decision |
|
|
|-------|-------|--------|----------|
|
|
| 77-0 | C0-C3 Volume | C3 traffic minimal | Proceed to 77-1 |
|
|
| 77-1 | C3 Inline Slots | +0.40% (NO-GO) | NO-GO, skip 77-2 |
|
|
| 78-0 | SSOT Verification | ✅ Verified | Proceed to 78-1 |
|
|
| **78-1** | **Per-Op Overhead** | **TBD** | **In Progress** |
|
|
|
|
---
|
|
|
|
**Status**: Phase 78-0 ✅ Complete, Phase 78-1 Plan Finalized, Ready for Implementation
|
|
|
|
**Binary Size**: Phase 76-2 baseline + ~1.5KB (new box, static globals)
|
|
|
|
**Code Quality**: Low-risk optimization (backward compatible, architectural alignment)
|