hakmem/docs/analysis/PHASE62A_C7_ULTRA_DEPCHAIN_OPT_RESULTS.md

# Phase 62A: C7 ULTRA Alloc Dependency Chain Trim - Results

**Date**: 2025-12-17
**Status**: NEUTRAL (-0.71%, research box)
**Baseline**: 48.34% of mimalloc (Phase 59b Speed-first)

---

## Executive Summary

Phase 62A attempted to optimize `tiny_c7_ultra_alloc()` hot path by eliminating per-call `tiny_front_v3_c7_ultra_header_light_enabled()` checks and using TLS `headers_initialized` flag instead. The optimization resulted in **-0.71% regression (NEUTRAL)**, indicating the approach does not yield the expected +1-3% gain.

**Conclusion**: Research box (default OFF, `HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=0`)

---

## A/B Test Results (Mixed benchmark, 10-run)

### Baseline (HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=0)

**Runs** (M ops/s):
```
59.553, 59.906, 60.134, 59.533, 56.265, 59.368, 60.045, 58.487, 60.141, 59.569
```

**Statistics**:
- **Mean**: 59.300 M ops/s
- **Median**: 59.561 M ops/s
- **StdDev**: 1.173 M ops/s
- **CV**: 1.98%

---

### Treatment (HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=1)

**Runs** (M ops/s):
```
56.352, 58.924, 58.946, 60.109, 58.630, 58.689, 59.609, 58.160, 59.939, 59.430
```

**Statistics**:
- **Mean**: 58.879 M ops/s
- **Median**: 58.935 M ops/s
- **StdDev**: 1.079 M ops/s
- **CV**: 1.83%

---

## Comparison

| Metric | Baseline | Treatment | Delta |
|--------|----------|-----------|-------|
| Mean | 59.300 | 58.879 | **-0.71%** |
| Median | 59.561 | 58.935 | -1.05% |
| StdDev | 1.173 | 1.079 | -8.0% |
| CV | 1.98% | 1.83% | -0.15pp |

**Verdict**: **NEUTRAL** (-0.71% within ±1.0% threshold, but negative)

---

## Implementation Details

### Optimization Strategy

**Original Code** (`tiny_c7_ultra_alloc` hot path):
```c
void* tiny_c7_ultra_alloc(size_t size) {
    tiny_c7_ultra_tls_t* tls = &g_tiny_c7_ultra_tls;
    const bool header_light = tiny_front_v3_c7_ultra_header_light_enabled();  // Per-call check

    uint16_t n = tls->count;
    if (n > 0) {
        void* base = tls->freelist[n - 1];
        tls->count = n - 1;

        if (header_light) {  // Per-call branch
            return (uint8_t*)base + 1;
        }
        return tiny_region_id_write_header(base, 7);
    }
    // ... refill and retry
}
```

**Optimized Code** (Phase 62A):
```c
void* tiny_c7_ultra_alloc(size_t size) {
    tiny_c7_ultra_tls_t* tls = &g_tiny_c7_ultra_tls;
    // No per-call header_light check - use TLS flag instead

    uint16_t n = tls->count;
    if (n > 0) {
        void* base = tls->freelist[n - 1];
        tls->count = n - 1;

        if (tls->headers_initialized) {  // TLS flag set during refill
            return (uint8_t*)base + 1;
        }
        return tiny_region_id_write_header(base, 7);
    }
    // ... refill and retry
}
```

**Intended Benefits**:
1. Eliminate per-call `tiny_front_v3_c7_ultra_header_light_enabled()` function call
2. Replace with TLS field access (already in cache from count/freelist)
3. Reduce dependency chain length

---

## Root Cause Analysis

### Why No Improvement?

1. **LTO Optimization Already In Place**
   - In HAKMEM_BENCH_MINIMAL (`-flto`), `tiny_front_v3_c7_ultra_header_light_enabled()` is likely already inlined
   - Function call overhead may already be zero at compile time
   - Replacing with TLS field access doesn't improve latency (still L1 cache hit)

2. **TLS Access Not Cheaper Than Expected**
   - TLS field `headers_initialized` requires offset calculation + memory access
   - Function call overhead may actually be lower (register-based, already predicted)
   - Branch prediction on `if (header_light)` may be extremely accurate (99.99%+)

3. **Layout Tax from Added Code**
   - Phases 43, 46A, 47 precedent: adding code branches can cause I-cache/alignment disruption
   - Added if-dispatch at function entry (`if (!c7_ultra_alloc_depchain_opt_enabled())`) may affect code layout
   - Result: -0.71% regression consistent with pattern

4. **Hot Path May Already Be Optimal**
   - Phase 61 profiling showed `tiny_c7_ultra_alloc` at 5.18% stack %
   - But function-level optimization attempts (Phase 43/46A/47) all showed negative or marginal returns
   - Suggests hot path is already well-optimized by compiler

---

## Lessons Learned

### 1. Function Call Overhead is Negligible in LTO Mode

With `-flto` and link-time optimization, function calls to simple getters are aggressively inlined. Removing them doesn't necessarily improve performance because:
- Compiler already determined optimal inlining
- Instruction fetch overhead may not be the bottleneck
- Replacing call with memory access can have similar latency

### 2. Layout Tax is Real and Persistent

This is the third time (Phase 43: -1.18%, Phase 46A: -0.68%, Phase 62A: -0.71%) that code addition/reorganization has resulted in regressions despite targeting hot functions. Pattern suggests:
- I-cache alignment matters more than instruction count
- Code layout disruptions can negate micro-optimization gains
- Box Theory "minimal code change" principle is well-justified

### 3. Per-Call Flags May Be Faster Than Per-TLS State

Counter-intuitive finding: accessing a per-call computed flag (via function inlining) may be faster than accessing TLS state, because:
- Function results are likely in registers (temporary)
- TLS access requires memory load + offset calculation
- Branch predictor handles pattern well

### 4. 5.18% Stack % ≠ Optimizable Hotspot

Phase 61 profiling showed `tiny_c7_ultra_alloc` at 5.18% combined stack overhead, but this is misleading because:
- Much of the time is in malloc/free wrappers and benchmark loop (not C7 ultra itself)
- Self time is likely 2-3% (actual function execution)
- Micro-optimizations on already-optimized paths yield diminishing returns

---

## Decision

**NEUTRAL (research box)**:
- Set default to `HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=0` (OFF)
- Keep code with ENV gate for future reference
- Do not adopt as production default

**Next Steps**:
1. Phase 62B: Try secondary target (tiny_region_id_write_header reordering) - higher risk
2. Or pivot to Phase 62C: Accept 48.34% as performance ceiling, focus on production readiness
3. Or Phase 62D: Algorithmic redesign (batching, prefault strategy) - very high cost/risk

---

## Box Theory Compliance

| Principle | Status | Notes |
|-----------|--------|-------|
| Single Conversion Point | ✅ Yes | `tiny_c7_ultra_alloc()` boundary |
| Clear Boundary | ✅ Yes | Env gate `HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT` |
| Reversible | ✅ Yes | Can switch via ENV or compile flag |
| No Side Effects | ✅ Yes | Pure optimization attempt, no new data structures |
| Performance | ❌ No | **-0.71% regression, NO-GO** |

**Overall**: Box Theory compliant but performance non-compliant.

---

## Appendix: Raw Data

### Baseline (10-run, M ops/s)
```
59.553099
59.906197
60.134051
59.533090
56.265139
59.367898
60.044922
58.486467
60.141028
59.568791
```

### Treatment (10-run, M ops/s)
```
56.351851
58.923605
58.946089
60.109441
58.629557
58.689160
59.609485
58.160391
59.939368
59.430088
```

---

**End of Phase 62A Report**
Phase 62A: C7 ULTRA Alloc Dependency Chain Trim - NEUTRAL (-0.71%) Implemented C7 ULTRA allocation hotpath optimization attempt as per Phase 62A instructions. Objective: Reduce dependency chain in tiny_c7_ultra_alloc() by: 1. Eliminating per-call tiny_front_v3_c7_ultra_header_light_enabled() checks 2. Using TLS headers_initialized flag set during refill 3. Reducing branch count and register pressure Implementation: - New ENV box: core/box/c7_ultra_alloc_depchain_opt_box.h - HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=0/1 gate (default OFF) - Modified tiny_c7_ultra_alloc() with optimized path - Preserved original path for compatibility Results (Mixed benchmark, 10-run): - Baseline (OPT=0): 59.300 M ops/s (CV 1.98%) - Treatment (OPT=1): 58.879 M ops/s (CV 1.83%) - Delta: -0.71% (NEUTRAL, within ±1.0% threshold but negative) - Status: NEUTRAL → Research box (default OFF) Root Cause Analysis: 1. LTO optimization already inlines header_light function (call cost = 0) 2. TLS access (memory load + offset) not cheaper than function call 3. Layout tax from code addition (I-cache disruption pattern from Phases 43/46A/47) 4. 5.18% stack % is not optimizable hotspot (already well-optimized) Key Lessons: - LTO-optimized function calls can be cheaper than TLS field access - Micro-optimizations on already-optimized paths show diminishing/negative returns - 48.34% gap to mimalloc is likely algorithmic, not micro-architectural - Layout tax remains consistent pattern across attempted micro-optimizations Decision: - NEUTRAL verdict → kept as research box with ENV gate (default OFF) - Not adopted as production default - Next phases: Option B (production readiness pivot) likely higher ROI than further micro-opts Box Theory Compliance: ✅ Compliant (single point, reversible, clear boundary) Performance Compliance: ❌ No (-0.71% regression) Documentation: - PHASE62A_C7_ULTRA_DEPCHAIN_OPT_RESULTS.md: Full A/B test analysis - CURRENT_TASK.md: Updated with results and next phase options 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> 2025-12-17 16:34:03 +09:00			`# Phase 62A: C7 ULTRA Alloc Dependency Chain Trim - Results`

			`Date: 2025-12-17`
			`Status: NEUTRAL (-0.71%, research box)`
			`Baseline: 48.34% of mimalloc (Phase 59b Speed-first)`

			`---`

			`## Executive Summary`

			Phase 62A attempted to optimize `tiny_c7_ultra_alloc()` hot path by eliminating per-call `tiny_front_v3_c7_ultra_header_light_enabled()` checks and using TLS `headers_initialized` flag instead. The optimization resulted in -0.71% regression (NEUTRAL), indicating the approach does not yield the expected +1-3% gain.

			Conclusion: Research box (default OFF, `HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=0`)

			`---`

			`## A/B Test Results (Mixed benchmark, 10-run)`

			`### Baseline (HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=0)`

			`Runs (M ops/s):`
			```
			`59.553, 59.906, 60.134, 59.533, 56.265, 59.368, 60.045, 58.487, 60.141, 59.569`
			```

			`Statistics:`
			`- Mean: 59.300 M ops/s`
			`- Median: 59.561 M ops/s`
			`- StdDev: 1.173 M ops/s`
			`- CV: 1.98%`

			`---`

			`### Treatment (HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=1)`

			`Runs (M ops/s):`
			```
			`56.352, 58.924, 58.946, 60.109, 58.630, 58.689, 59.609, 58.160, 59.939, 59.430`
			```

			`Statistics:`
			`- Mean: 58.879 M ops/s`
			`- Median: 58.935 M ops/s`
			`- StdDev: 1.079 M ops/s`
			`- CV: 1.83%`

			`---`

			`## Comparison`

			`\| Metric \| Baseline \| Treatment \| Delta \|`
			`\|--------\|----------\|-----------\|-------\|`
			`\| Mean \| 59.300 \| 58.879 \| -0.71% \|`
			`\| Median \| 59.561 \| 58.935 \| -1.05% \|`
			`\| StdDev \| 1.173 \| 1.079 \| -8.0% \|`
			`\| CV \| 1.98% \| 1.83% \| -0.15pp \|`

			`Verdict: NEUTRAL (-0.71% within ±1.0% threshold, but negative)`

			`---`

			`## Implementation Details`

			`### Optimization Strategy`

			Original Code (`tiny_c7_ultra_alloc` hot path):
			```c
			`void* tiny_c7_ultra_alloc(size_t size) {`
			`tiny_c7_ultra_tls_t* tls = &g_tiny_c7_ultra_tls;`
			`const bool header_light = tiny_front_v3_c7_ultra_header_light_enabled(); // Per-call check`

			`uint16_t n = tls->count;`
			`if (n > 0) {`
			`void* base = tls->freelist[n - 1];`
			`tls->count = n - 1;`

			`if (header_light) { // Per-call branch`
			`return (uint8_t*)base + 1;`
			`}`
			`return tiny_region_id_write_header(base, 7);`
			`}`
			`// ... refill and retry`
			`}`
			```

			`Optimized Code (Phase 62A):`
			```c
			`void* tiny_c7_ultra_alloc(size_t size) {`
			`tiny_c7_ultra_tls_t* tls = &g_tiny_c7_ultra_tls;`
			`// No per-call header_light check - use TLS flag instead`

			`uint16_t n = tls->count;`
			`if (n > 0) {`
			`void* base = tls->freelist[n - 1];`
			`tls->count = n - 1;`

			`if (tls->headers_initialized) { // TLS flag set during refill`
			`return (uint8_t*)base + 1;`
			`}`
			`return tiny_region_id_write_header(base, 7);`
			`}`
			`// ... refill and retry`
			`}`
			```

			`Intended Benefits:`
			1. Eliminate per-call `tiny_front_v3_c7_ultra_header_light_enabled()` function call
			`2. Replace with TLS field access (already in cache from count/freelist)`
			`3. Reduce dependency chain length`

			`---`

			`## Root Cause Analysis`

			`### Why No Improvement?`

			`1. LTO Optimization Already In Place`
			- In HAKMEM_BENCH_MINIMAL (`-flto`), `tiny_front_v3_c7_ultra_header_light_enabled()` is likely already inlined
			`- Function call overhead may already be zero at compile time`
			`- Replacing with TLS field access doesn't improve latency (still L1 cache hit)`

			`2. TLS Access Not Cheaper Than Expected`
			- TLS field `headers_initialized` requires offset calculation + memory access
			`- Function call overhead may actually be lower (register-based, already predicted)`
			- Branch prediction on `if (header_light)` may be extremely accurate (99.99%+)

			`3. Layout Tax from Added Code`
			`- Phases 43, 46A, 47 precedent: adding code branches can cause I-cache/alignment disruption`
			- Added if-dispatch at function entry (`if (!c7_ultra_alloc_depchain_opt_enabled())`) may affect code layout
			`- Result: -0.71% regression consistent with pattern`

			`4. Hot Path May Already Be Optimal`
			- Phase 61 profiling showed `tiny_c7_ultra_alloc` at 5.18% stack %
			`- But function-level optimization attempts (Phase 43/46A/47) all showed negative or marginal returns`
			`- Suggests hot path is already well-optimized by compiler`

			`---`

			`## Lessons Learned`

			`### 1. Function Call Overhead is Negligible in LTO Mode`

			With `-flto` and link-time optimization, function calls to simple getters are aggressively inlined. Removing them doesn't necessarily improve performance because:
			`- Compiler already determined optimal inlining`
			`- Instruction fetch overhead may not be the bottleneck`
			`- Replacing call with memory access can have similar latency`

			`### 2. Layout Tax is Real and Persistent`

			`This is the third time (Phase 43: -1.18%, Phase 46A: -0.68%, Phase 62A: -0.71%) that code addition/reorganization has resulted in regressions despite targeting hot functions. Pattern suggests:`
			`- I-cache alignment matters more than instruction count`
			`- Code layout disruptions can negate micro-optimization gains`
			`- Box Theory "minimal code change" principle is well-justified`

			`### 3. Per-Call Flags May Be Faster Than Per-TLS State`

			`Counter-intuitive finding: accessing a per-call computed flag (via function inlining) may be faster than accessing TLS state, because:`
			`- Function results are likely in registers (temporary)`
			`- TLS access requires memory load + offset calculation`
			`- Branch predictor handles pattern well`

			`### 4. 5.18% Stack % ≠ Optimizable Hotspot`

			Phase 61 profiling showed `tiny_c7_ultra_alloc` at 5.18% combined stack overhead, but this is misleading because:
			`- Much of the time is in malloc/free wrappers and benchmark loop (not C7 ultra itself)`
			`- Self time is likely 2-3% (actual function execution)`
			`- Micro-optimizations on already-optimized paths yield diminishing returns`

			`---`

			`## Decision`

			`NEUTRAL (research box):`
			- Set default to `HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=0` (OFF)
			`- Keep code with ENV gate for future reference`
			`- Do not adopt as production default`

			`Next Steps:`
			`1. Phase 62B: Try secondary target (tiny_region_id_write_header reordering) - higher risk`
			`2. Or pivot to Phase 62C: Accept 48.34% as performance ceiling, focus on production readiness`
			`3. Or Phase 62D: Algorithmic redesign (batching, prefault strategy) - very high cost/risk`

			`---`

			`## Box Theory Compliance`

			`\| Principle \| Status \| Notes \|`
			`\|-----------\|--------\|-------\|`
			\| Single Conversion Point \| ✅ Yes \| `tiny_c7_ultra_alloc()` boundary \|
			\| Clear Boundary \| ✅ Yes \| Env gate `HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT` \|
			`\| Reversible \| ✅ Yes \| Can switch via ENV or compile flag \|`
			`\| No Side Effects \| ✅ Yes \| Pure optimization attempt, no new data structures \|`
			`\| Performance \| ❌ No \| -0.71% regression, NO-GO \|`

			`Overall: Box Theory compliant but performance non-compliant.`

			`---`

			`## Appendix: Raw Data`

			`### Baseline (10-run, M ops/s)`
			```
			`59.553099`
			`59.906197`
			`60.134051`
			`59.533090`
			`56.265139`
			`59.367898`
			`60.044922`
			`58.486467`
			`60.141028`
			`59.568791`
			```

			`### Treatment (10-run, M ops/s)`
			```
			`56.351851`
			`58.923605`
			`58.946089`
			`60.109441`
			`58.629557`
			`58.689160`
			`59.609485`
			`58.160391`
			`59.939368`
			`59.430088`
			```

			`---`

			`End of Phase 62A Report`