232 lines
6.8 KiB
Markdown
232 lines
6.8 KiB
Markdown
|
|
# Phase 62A: C7 ULTRA Alloc Dependency Chain Trim - Results
|
||
|
|
|
||
|
|
**Date**: 2025-12-17
|
||
|
|
**Status**: NEUTRAL (-0.71%, research box)
|
||
|
|
**Baseline**: 48.34% of mimalloc (Phase 59b Speed-first)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Executive Summary
|
||
|
|
|
||
|
|
Phase 62A attempted to optimize `tiny_c7_ultra_alloc()` hot path by eliminating per-call `tiny_front_v3_c7_ultra_header_light_enabled()` checks and using TLS `headers_initialized` flag instead. The optimization resulted in **-0.71% regression (NEUTRAL)**, indicating the approach does not yield the expected +1-3% gain.
|
||
|
|
|
||
|
|
**Conclusion**: Research box (default OFF, `HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=0`)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## A/B Test Results (Mixed benchmark, 10-run)
|
||
|
|
|
||
|
|
### Baseline (HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=0)
|
||
|
|
|
||
|
|
**Runs** (M ops/s):
|
||
|
|
```
|
||
|
|
59.553, 59.906, 60.134, 59.533, 56.265, 59.368, 60.045, 58.487, 60.141, 59.569
|
||
|
|
```
|
||
|
|
|
||
|
|
**Statistics**:
|
||
|
|
- **Mean**: 59.300 M ops/s
|
||
|
|
- **Median**: 59.561 M ops/s
|
||
|
|
- **StdDev**: 1.173 M ops/s
|
||
|
|
- **CV**: 1.98%
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
### Treatment (HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=1)
|
||
|
|
|
||
|
|
**Runs** (M ops/s):
|
||
|
|
```
|
||
|
|
56.352, 58.924, 58.946, 60.109, 58.630, 58.689, 59.609, 58.160, 59.939, 59.430
|
||
|
|
```
|
||
|
|
|
||
|
|
**Statistics**:
|
||
|
|
- **Mean**: 58.879 M ops/s
|
||
|
|
- **Median**: 58.935 M ops/s
|
||
|
|
- **StdDev**: 1.079 M ops/s
|
||
|
|
- **CV**: 1.83%
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Comparison
|
||
|
|
|
||
|
|
| Metric | Baseline | Treatment | Delta |
|
||
|
|
|--------|----------|-----------|-------|
|
||
|
|
| Mean | 59.300 | 58.879 | **-0.71%** |
|
||
|
|
| Median | 59.561 | 58.935 | -1.05% |
|
||
|
|
| StdDev | 1.173 | 1.079 | -8.0% |
|
||
|
|
| CV | 1.98% | 1.83% | -0.15pp |
|
||
|
|
|
||
|
|
**Verdict**: **NEUTRAL** (-0.71% within ±1.0% threshold, but negative)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Implementation Details
|
||
|
|
|
||
|
|
### Optimization Strategy
|
||
|
|
|
||
|
|
**Original Code** (`tiny_c7_ultra_alloc` hot path):
|
||
|
|
```c
|
||
|
|
void* tiny_c7_ultra_alloc(size_t size) {
|
||
|
|
tiny_c7_ultra_tls_t* tls = &g_tiny_c7_ultra_tls;
|
||
|
|
const bool header_light = tiny_front_v3_c7_ultra_header_light_enabled(); // Per-call check
|
||
|
|
|
||
|
|
uint16_t n = tls->count;
|
||
|
|
if (n > 0) {
|
||
|
|
void* base = tls->freelist[n - 1];
|
||
|
|
tls->count = n - 1;
|
||
|
|
|
||
|
|
if (header_light) { // Per-call branch
|
||
|
|
return (uint8_t*)base + 1;
|
||
|
|
}
|
||
|
|
return tiny_region_id_write_header(base, 7);
|
||
|
|
}
|
||
|
|
// ... refill and retry
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Optimized Code** (Phase 62A):
|
||
|
|
```c
|
||
|
|
void* tiny_c7_ultra_alloc(size_t size) {
|
||
|
|
tiny_c7_ultra_tls_t* tls = &g_tiny_c7_ultra_tls;
|
||
|
|
// No per-call header_light check - use TLS flag instead
|
||
|
|
|
||
|
|
uint16_t n = tls->count;
|
||
|
|
if (n > 0) {
|
||
|
|
void* base = tls->freelist[n - 1];
|
||
|
|
tls->count = n - 1;
|
||
|
|
|
||
|
|
if (tls->headers_initialized) { // TLS flag set during refill
|
||
|
|
return (uint8_t*)base + 1;
|
||
|
|
}
|
||
|
|
return tiny_region_id_write_header(base, 7);
|
||
|
|
}
|
||
|
|
// ... refill and retry
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Intended Benefits**:
|
||
|
|
1. Eliminate per-call `tiny_front_v3_c7_ultra_header_light_enabled()` function call
|
||
|
|
2. Replace with TLS field access (already in cache from count/freelist)
|
||
|
|
3. Reduce dependency chain length
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Root Cause Analysis
|
||
|
|
|
||
|
|
### Why No Improvement?
|
||
|
|
|
||
|
|
1. **LTO Optimization Already In Place**
|
||
|
|
- In HAKMEM_BENCH_MINIMAL (`-flto`), `tiny_front_v3_c7_ultra_header_light_enabled()` is likely already inlined
|
||
|
|
- Function call overhead may already be zero at compile time
|
||
|
|
- Replacing with TLS field access doesn't improve latency (still L1 cache hit)
|
||
|
|
|
||
|
|
2. **TLS Access Not Cheaper Than Expected**
|
||
|
|
- TLS field `headers_initialized` requires offset calculation + memory access
|
||
|
|
- Function call overhead may actually be lower (register-based, already predicted)
|
||
|
|
- Branch prediction on `if (header_light)` may be extremely accurate (99.99%+)
|
||
|
|
|
||
|
|
3. **Layout Tax from Added Code**
|
||
|
|
- Phases 43, 46A, 47 precedent: adding code branches can cause I-cache/alignment disruption
|
||
|
|
- Added if-dispatch at function entry (`if (!c7_ultra_alloc_depchain_opt_enabled())`) may affect code layout
|
||
|
|
- Result: -0.71% regression consistent with pattern
|
||
|
|
|
||
|
|
4. **Hot Path May Already Be Optimal**
|
||
|
|
- Phase 61 profiling showed `tiny_c7_ultra_alloc` at 5.18% stack %
|
||
|
|
- But function-level optimization attempts (Phase 43/46A/47) all showed negative or marginal returns
|
||
|
|
- Suggests hot path is already well-optimized by compiler
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Lessons Learned
|
||
|
|
|
||
|
|
### 1. Function Call Overhead is Negligible in LTO Mode
|
||
|
|
|
||
|
|
With `-flto` and link-time optimization, function calls to simple getters are aggressively inlined. Removing them doesn't necessarily improve performance because:
|
||
|
|
- Compiler already determined optimal inlining
|
||
|
|
- Instruction fetch overhead may not be the bottleneck
|
||
|
|
- Replacing call with memory access can have similar latency
|
||
|
|
|
||
|
|
### 2. Layout Tax is Real and Persistent
|
||
|
|
|
||
|
|
This is the third time (Phase 43: -1.18%, Phase 46A: -0.68%, Phase 62A: -0.71%) that code addition/reorganization has resulted in regressions despite targeting hot functions. Pattern suggests:
|
||
|
|
- I-cache alignment matters more than instruction count
|
||
|
|
- Code layout disruptions can negate micro-optimization gains
|
||
|
|
- Box Theory "minimal code change" principle is well-justified
|
||
|
|
|
||
|
|
### 3. Per-Call Flags May Be Faster Than Per-TLS State
|
||
|
|
|
||
|
|
Counter-intuitive finding: accessing a per-call computed flag (via function inlining) may be faster than accessing TLS state, because:
|
||
|
|
- Function results are likely in registers (temporary)
|
||
|
|
- TLS access requires memory load + offset calculation
|
||
|
|
- Branch predictor handles pattern well
|
||
|
|
|
||
|
|
### 4. 5.18% Stack % ≠ Optimizable Hotspot
|
||
|
|
|
||
|
|
Phase 61 profiling showed `tiny_c7_ultra_alloc` at 5.18% combined stack overhead, but this is misleading because:
|
||
|
|
- Much of the time is in malloc/free wrappers and benchmark loop (not C7 ultra itself)
|
||
|
|
- Self time is likely 2-3% (actual function execution)
|
||
|
|
- Micro-optimizations on already-optimized paths yield diminishing returns
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Decision
|
||
|
|
|
||
|
|
**NEUTRAL (research box)**:
|
||
|
|
- Set default to `HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=0` (OFF)
|
||
|
|
- Keep code with ENV gate for future reference
|
||
|
|
- Do not adopt as production default
|
||
|
|
|
||
|
|
**Next Steps**:
|
||
|
|
1. Phase 62B: Try secondary target (tiny_region_id_write_header reordering) - higher risk
|
||
|
|
2. Or pivot to Phase 62C: Accept 48.34% as performance ceiling, focus on production readiness
|
||
|
|
3. Or Phase 62D: Algorithmic redesign (batching, prefault strategy) - very high cost/risk
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Box Theory Compliance
|
||
|
|
|
||
|
|
| Principle | Status | Notes |
|
||
|
|
|-----------|--------|-------|
|
||
|
|
| Single Conversion Point | ✅ Yes | `tiny_c7_ultra_alloc()` boundary |
|
||
|
|
| Clear Boundary | ✅ Yes | Env gate `HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT` |
|
||
|
|
| Reversible | ✅ Yes | Can switch via ENV or compile flag |
|
||
|
|
| No Side Effects | ✅ Yes | Pure optimization attempt, no new data structures |
|
||
|
|
| Performance | ❌ No | **-0.71% regression, NO-GO** |
|
||
|
|
|
||
|
|
**Overall**: Box Theory compliant but performance non-compliant.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Appendix: Raw Data
|
||
|
|
|
||
|
|
### Baseline (10-run, M ops/s)
|
||
|
|
```
|
||
|
|
59.553099
|
||
|
|
59.906197
|
||
|
|
60.134051
|
||
|
|
59.533090
|
||
|
|
56.265139
|
||
|
|
59.367898
|
||
|
|
60.044922
|
||
|
|
58.486467
|
||
|
|
60.141028
|
||
|
|
59.568791
|
||
|
|
```
|
||
|
|
|
||
|
|
### Treatment (10-run, M ops/s)
|
||
|
|
```
|
||
|
|
56.351851
|
||
|
|
58.923605
|
||
|
|
58.946089
|
||
|
|
60.109441
|
||
|
|
58.629557
|
||
|
|
58.689160
|
||
|
|
59.609485
|
||
|
|
58.160391
|
||
|
|
59.939368
|
||
|
|
59.430088
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**End of Phase 62A Report**
|