194 lines
4.8 KiB
Markdown
194 lines
4.8 KiB
Markdown
|
|
# Phase 61: C7 ULTRA Header-Light A/B Test Results
|
||
|
|
|
||
|
|
**Date**: 2025-12-17
|
||
|
|
**Status**: NEUTRAL (+0.31%, below +1.0% GO threshold)
|
||
|
|
**Decision**: Keep OFF by default, available as research flag
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Test Configuration
|
||
|
|
|
||
|
|
**Baseline**: `HAKMEM_TINY_C7_ULTRA_HEADER_LIGHT=0` (header write on every alloc)
|
||
|
|
**Treatment**: `HAKMEM_TINY_C7_ULTRA_HEADER_LIGHT=1` (header write once at refill)
|
||
|
|
|
||
|
|
**Profile**: MIXED_TINYV3_C7_SAFE (Speed-first)
|
||
|
|
**Runs**: 10 iterations per configuration
|
||
|
|
**Binary**: bench_random_mixed_hakmem_minimal
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Runtime Profiling (Step 0)
|
||
|
|
|
||
|
|
**Command**:
|
||
|
|
```bash
|
||
|
|
perf record -F 99 -g -- ./bench_random_mixed_hakmem_minimal 200000000 400 1
|
||
|
|
perf report --no-children | head -60
|
||
|
|
```
|
||
|
|
|
||
|
|
**Top Hotspots**:
|
||
|
|
1. `free`: 30.92%
|
||
|
|
2. `malloc`: 24.77%
|
||
|
|
3. `tiny_region_id_write_header`: 2.32% (within `free` backtrace)
|
||
|
|
4. `tiny_c7_ultra_alloc`: 1.90%
|
||
|
|
|
||
|
|
**Observation**:
|
||
|
|
- Header write is 2.32% hotspot (down from 4.56% in Phase 42)
|
||
|
|
- C7 ULTRA alloc is 1.90% of total cycles
|
||
|
|
- Combined target overhead: ~4.22%
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## A/B Test Results
|
||
|
|
|
||
|
|
### Baseline (HEADER_LIGHT=0)
|
||
|
|
|
||
|
|
```
|
||
|
|
Run 1: 60,596,666 ops/s
|
||
|
|
Run 2: 60,631,338 ops/s
|
||
|
|
Run 3: 58,848,585 ops/s
|
||
|
|
Run 4: 57,592,486 ops/s
|
||
|
|
Run 5: 60,072,235 ops/s
|
||
|
|
Run 6: 58,936,742 ops/s
|
||
|
|
Run 7: 59,389,954 ops/s
|
||
|
|
Run 8: 59,785,720 ops/s
|
||
|
|
Run 9: 59,956,318 ops/s
|
||
|
|
Run 10: 59,619,539 ops/s
|
||
|
|
```
|
||
|
|
|
||
|
|
**Statistics**:
|
||
|
|
- Mean: 59,542,958 ops/s
|
||
|
|
- Median: 59,702,630 ops/s
|
||
|
|
- Min: 57,592,486 ops/s
|
||
|
|
- Max: 60,631,338 ops/s
|
||
|
|
- StdDev: 912,145
|
||
|
|
- CV: 1.53%
|
||
|
|
|
||
|
|
### Treatment (HEADER_LIGHT=1)
|
||
|
|
|
||
|
|
```
|
||
|
|
Run 1: 58,677,671 ops/s
|
||
|
|
Run 2: 59,459,236 ops/s
|
||
|
|
Run 3: 61,090,929 ops/s
|
||
|
|
Run 4: 57,586,075 ops/s
|
||
|
|
Run 5: 61,556,526 ops/s
|
||
|
|
Run 6: 61,837,526 ops/s
|
||
|
|
Run 7: 58,629,333 ops/s
|
||
|
|
Run 8: 60,012,916 ops/s
|
||
|
|
Run 9: 57,548,197 ops/s
|
||
|
|
Run 10: 60,888,920 ops/s
|
||
|
|
```
|
||
|
|
|
||
|
|
**Statistics**:
|
||
|
|
- Mean: 59,728,733 ops/s
|
||
|
|
- Median: 59,736,076 ops/s
|
||
|
|
- Min: 57,548,197 ops/s
|
||
|
|
- Max: 61,837,526 ops/s
|
||
|
|
- StdDev: 1,591,714
|
||
|
|
- CV: 2.66%
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Analysis
|
||
|
|
|
||
|
|
**Delta**: +0.31% (185,775 ops/s improvement)
|
||
|
|
|
||
|
|
**Decision Matrix**:
|
||
|
|
- GO: +1.0% or better → NOT MET
|
||
|
|
- NEUTRAL: ±1.0% → **MATCHED** (+0.31%)
|
||
|
|
- NO-GO: -1.0% or worse → NOT MET
|
||
|
|
|
||
|
|
**Verdict**: **NEUTRAL**
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Discussion
|
||
|
|
|
||
|
|
### Why +0.31% is Below Expectations
|
||
|
|
|
||
|
|
1. **Header Write Overhead Lower Than Expected**:
|
||
|
|
- Profiling shows 2.32% (not 4.56% as in Phase 42)
|
||
|
|
- Mixed workload dilutes C7-specific hotspots
|
||
|
|
- Expected: ~2-3% gain
|
||
|
|
- Actual: +0.31%
|
||
|
|
|
||
|
|
2. **Higher Variance in Treatment**:
|
||
|
|
- Baseline CV: 1.53%
|
||
|
|
- Treatment CV: 2.66% (1.74x higher)
|
||
|
|
- Suggests additional noise or cache effects
|
||
|
|
|
||
|
|
3. **Header Write Not the Bottleneck**:
|
||
|
|
- C7 ULTRA alloc hit is already fast (~5-7 instructions)
|
||
|
|
- Header write (~3-4 instructions) is small part
|
||
|
|
- Other factors (TLS cache locality, refill overhead) dominate
|
||
|
|
|
||
|
|
4. **Refill Phase Overhead**:
|
||
|
|
- Header-light mode writes headers during refill (cold path)
|
||
|
|
- Adds branch in hot path (`if (header_light)`)
|
||
|
|
- Net instruction reduction: ~2-3 instructions (not 5-7)
|
||
|
|
|
||
|
|
### Positive Observations
|
||
|
|
|
||
|
|
1. **No Regression**: +0.31% is positive (though small)
|
||
|
|
2. **Implementation Stable**: Pre-existing implementation works correctly
|
||
|
|
3. **No Safety Issues**: Invariant (headers present) holds
|
||
|
|
4. **Rollback Safe**: ENV gate=0 by default
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Recommendation
|
||
|
|
|
||
|
|
**Status**: Keep as **research flag** (default OFF)
|
||
|
|
|
||
|
|
**Rationale**:
|
||
|
|
1. Gain (+0.31%) is below significance threshold (+1.0%)
|
||
|
|
2. Higher variance (CV 2.66% vs 1.53%) suggests instability
|
||
|
|
3. Instruction reduction insufficient to justify complexity
|
||
|
|
4. Better opportunities exist (e.g., Phase 62: TLS prefetch, Phase 63: refill batching)
|
||
|
|
|
||
|
|
**Future Re-evaluation**:
|
||
|
|
- Retry with C7-heavy workload (>50% C7 allocations)
|
||
|
|
- Combine with other C7 optimizations (batch refill, SIMD header write)
|
||
|
|
- Profile with IPC/cache-miss counters (not just cycles)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## ENV Control
|
||
|
|
|
||
|
|
**Variable**: `HAKMEM_TINY_C7_ULTRA_HEADER_LIGHT`
|
||
|
|
**Default**: 0 (OFF)
|
||
|
|
**Location**: `core/box/tiny_front_v3_env_box.h:145-152`
|
||
|
|
|
||
|
|
**Usage**:
|
||
|
|
```bash
|
||
|
|
# Enable header-light mode (research only)
|
||
|
|
export HAKMEM_TINY_C7_ULTRA_HEADER_LIGHT=1
|
||
|
|
|
||
|
|
# Disable (default)
|
||
|
|
export HAKMEM_TINY_C7_ULTRA_HEADER_LIGHT=0
|
||
|
|
# or unset
|
||
|
|
unset HAKMEM_TINY_C7_ULTRA_HEADER_LIGHT
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Next Steps
|
||
|
|
|
||
|
|
1. **Keep implementation**: Code is clean, no removal needed
|
||
|
|
2. **Document as research flag**: Available for future C7-heavy workloads
|
||
|
|
3. **Phase 62 priorities**:
|
||
|
|
- TLS prefetch optimization (higher impact potential)
|
||
|
|
- Refill batch size tuning (reduce cold path overhead)
|
||
|
|
- IPC profiling (identify real bottlenecks)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Conclusion
|
||
|
|
|
||
|
|
Phase 61 achieves **NEUTRAL** status (+0.31%):
|
||
|
|
- Implementation works correctly (no bugs)
|
||
|
|
- Gain is real but insufficient (+0.31% < +1.0% threshold)
|
||
|
|
- Keep as research flag (default OFF)
|
||
|
|
- Focus on higher-impact optimizations (Phase 62+)
|
||
|
|
|
||
|
|
**Lesson**: Micro-optimizations require precise profiling. Cycle count alone insufficient—need IPC, cache misses, and workload-specific analysis.
|