hakmem/docs/analysis/PHASE77_1_C3_INLINE_SLOTS_RESULTS.md

# Phase 77-1: C3 Inline Slots A/B Test Results

## Executive Summary

**Decision**: **NO-GO** (+0.40% gain, below +1.0% GO threshold)

**Key Finding**: C3 inline slots provide minimal performance improvement (+0.40%) despite architectural alignment with successful C4-C6 optimizations. This suggests **C3 traffic is not a bottleneck** in the mixed workload (WS=400, 16-1040B allocations).

---

## Test Configuration

### Workload
- **Binary**: `./bench_random_mixed_hakmem` (with C3 inline slots compiled)
- **Iterations**: 20,000,000 ops per run
- **Working Set**: 400 slots
- **Size Range**: 16-1040B (mixed allocations)
- **Runs**: 10 per configuration

### Configurations
- **Baseline**: C3 OFF (`HAKMEM_TINY_C3_INLINE_SLOTS=0`), C4/C5/C6 ON
- **Treatment**: C3 ON (`HAKMEM_TINY_C3_INLINE_SLOTS=1`), C4/C5/C6 ON
- **Measurement**: Throughput (ops/s)

---

## Raw Results (10 runs each)

### Baseline (C3 OFF)
```
40435972, 41430741, 41023773, 39807320, 40474129,
40436476, 40643305, 40116079, 40295157, 40622709
```
- **Mean**: 40.52 M ops/s
- **Min**: 39.80 M ops/s
- **Max**: 41.43 M ops/s
- **Std Dev**: ~0.57 M ops/s

### Treatment (C3 ON)
```
40836958, 40492669, 40726473, 41205860, 40609735,
40943945, 40612661, 41083970, 40370334, 40040018
```
- **Mean**: 40.69 M ops/s
- **Min**: 40.04 M ops/s
- **Max**: 41.20 M ops/s
- **Std Dev**: ~0.43 M ops/s

---

## Delta Analysis

| Metric | Value |
|--------|-------|
| **Baseline Mean** | 40.52 M ops/s |
| **Treatment Mean** | 40.69 M ops/s |
| **Absolute Gain** | 0.17 M ops/s |
| **Relative Gain** | **+0.40%** |
| **GO Threshold** | +1.0% |
| **Status** | ❌ **NO-GO** |

### Confidence Analysis
- Sample size: 10 per group
- Overlap: Baseline and Treatment ranges have significant overlap
- Signal-to-noise: Gain (0.17M) is comparable to baseline std dev (0.57M)
- **Conclusion**: Gain is within noise, not statistically significant

---

## Root Cause Analysis: Why No Gain?

### 1. **Phase 77-0 Observation Confirmed**
- Unified_cache statistics showed C3 had only 1 miss out of 20M operations (0.00005% miss rate)
- This ultra-low miss rate indicates C3 is already well-serviced by existing mechanisms

### 2. **Warm Pool Effectiveness**
- Warm pool + first-page-cache are likely intercepting C3 traffic
- C3 is below the "hot class" threshold where inline slots provide ROI

### 3. **TLS Overhead vs. Benefit**
- C3 adds 2KB/thread TLS overhead
- No corresponding reduction in unified_cache misses → overhead not justified
- Unlike C4-C6 where inline slots eliminated significant unified_cache traffic

### 4. **Workload Characteristics**
- WS=400 mixed workload is dominated by C5-C6 (57.17% + 28.55% = 85.7% of operations)
- C3 only ~15.6% of workload (64-128B size range)
- Even if C3 were optimized, it can only affect 15.6% of operations
- Only 4-5% of that traffic is currently hitting unified_cache (based on Phase 77-0 data)

---

## Comparison to C4-C6 Success

### Why C4-C6 Succeeded (+7.05% cumulative)

| Factor | C4-C6 | C3 |
|--------|-------|-----|
| **Hot traffic %** | 14.29% + 28.55% + 57.17% = 100% of Tiny | ~15.6% of total |
| **Unified_cache hits** | Low but visible | Almost none |
| **Context dependency** | Super-additive synergy | No interaction |
| **Size class range** | 128-2048B (large objects) | 64-128B (small) |

**Key Insight**: C4-C6 optimization succeeded because it addressed **active contention** in the unified_cache. C3 optimization addresses **non-existent contention**.

---

## Per-Class Coverage Summary (Final)

### C0-C7 Optimization Status

| Class | Size Range | Coverage % | Optimization | Result | Status |
|-------|-----------|-----------|--------------|--------|--------|
| **C6** | 1025-2048B | 57.17% | Inline Slots | +2.87% | ✅ GO (Phase 75-1) |
| **C5** | 513-1024B | 28.55% | Inline Slots | +1.10% | ✅ GO (Phase 75-2) |
| **C4** | 257-512B | 14.29% | Inline Slots | +1.27% (in context) | ✅ GO (Phase 76-1, +7.05% cumulative) |
| **C3** | 65-256B | ~15.6% | Inline Slots | +0.40% | ❌ NO-GO (Phase 77-1) |
| **C2** | 33-64B | ~15.6% | Not tested | N/A | ⏸️ CONDITIONAL (blocked by C3 NO-GO) |
| **C7** | 2049-4096B | 0.00% | N/A | N/A | ✅ NO-GO (Phase 76-0) |
| **C0-C1** | <32B | Minimal | N/A | N/A | ⏸️ Future (blocked by C2) |

---

## Decision Logic

### Success Criteria
| Criterion | Threshold | Actual | Pass |
|-----------|-----------|--------|------|
| **GO Threshold** | ≥ +1.0% | **+0.40%** | ❌ |
| **Noise floor** | < 50% of baseline std dev | **30% of std dev** | ⚠️ |
| **Statistical significance** | p < 0.05 (10 samples) | High overlap | ❌ |

### Decision: **NO-GO**

**Rationale**:
1. ❌ **Below GO threshold**: +0.40% is significantly below +1.0% GO floor
2. ❌ **Statistical insignificance**: Gain is within measurement noise
3. ❌ **Root cause confirmed**: Phase 77-0 data shows C3 has minimal unified_cache contention
4. ❌ **No follow-on to C2**: Phase 77-2 (C2) conditional on C3 success → BLOCKED

**Impact**: C3-C2 optimization axis exhausted. Per-class inline slots optimization complete at C4-C6.

---

## Phase 77-2 Status: **SKIPPED** (Conditional NO-GO)

Phase 77-2 (C2 inline slots) was conditional on Phase 77-1 (C3) success. Since Phase 77-1 is NO-GO:
- Phase 77-2 is **SKIPPED** (not implemented)
- C2 remains unoptimized (consistent with Phase 77-0 observation: negligible unified_cache traffic)

---

## Recommended Next Steps

### 1. **Lock C4-C6 as Permanent SSOT** ✅ (Already done Phase 76-2)
- C4+C5+C6 inline slots = **+7.05% cumulative gain, super-additive**
- Promoted to defaults in `core/bench_profile.h` and test scripts

### 2. **Explore Alternative Optimization Axes** (Phase 78+)
Given C3 NO-GO, consider:
- **Option A**: Allocation fast-path further optimization (instruction/branch reduction)
- **Option B**: Metadata/page lookup optimization (avoid pointer chasing)
- **Option C**: Warm pool tuning beyond Phase 69's WarmPool=16
- **Option D**: Alternative size-class strategies (C1/C2 with different thresholds)

### 3. **Track mimalloc Ratio** (Secondary Metric, ongoing)
- Current: 89.2% (Phase 76-2 baseline)
- Monitor code bloat from C4-C6 additions
- Rebbase FAST PGO profile if bloat becomes concern

---

## Conclusion

**Phase 77-1 validates that per-class inline slots optimization has a **natural stopping point** at C3**. Unlike C4-C6 which addressed hot unified_cache traffic, C3 (and by extension C2) appear to be well-serviced by existing warm pool and caching mechanisms.

**Key Learning**: Not all size classes benefit equally from the same optimization pattern. C3's low traffic and non-existent unified_cache contention make inline slots wasteful in this workload.

**Status**: ✅ **DECISION MADE** (C3 NO-GO, C4-C6 locked to SSOT, Phase 77 complete)

---

**Phase 77 Status**: ✓ COMPLETE (Phase 77-0 GO, Phase 77-1 NO-GO, Phase 77-2 SKIPPED)

**Next Phase**: Phase 78 (Alternative optimization axis TBD)
-												Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update

Key changes:
- Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible)
  Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns

- Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M):
  tcmalloc: 115.26M (92.33% of mimalloc)
  jemalloc: 97.39M (77.96% of mimalloc)
  system: 85.20M (68.24% of mimalloc)
  mimalloc: 124.82M (baseline)

- hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh
  PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements
  Result: baseline stabilized to 55.53M (44.46% of mimalloc)
  Previous unstable measurement (35.57M) was due to profile leak

- Documentation:
  * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status
  * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO)
  * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure
  * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology

- M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed

🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

											
										
										
											2025-12-18 18:50:00 +09:00
+								# Phase 77-1: C3 Inline Slots A/B Test Results
 								## Executive Summary
 								**Decision**: **NO-GO** (+0.40% gain, below +1.0% GO threshold)
 								**Key Finding**: C3 inline slots provide minimal performance improvement (+0.40%) despite architectural alignment with successful C4-C6 optimizations. This suggests **C3 traffic is not a bottleneck** in the mixed workload (WS=400, 16-1040B allocations).
 								---
 								## Test Configuration
 								### Workload
 								- **Binary**: `./bench_random_mixed_hakmem` (with C3 inline slots compiled)
 								- **Iterations**: 20,000,000 ops per run
 								- **Working Set**: 400 slots
 								- **Size Range**: 16-1040B (mixed allocations)
 								- **Runs**: 10 per configuration
 								### Configurations
 								- **Baseline**: C3 OFF (`HAKMEM_TINY_C3_INLINE_SLOTS=0`), C4/C5/C6 ON
 								- **Treatment**: C3 ON (`HAKMEM_TINY_C3_INLINE_SLOTS=1`), C4/C5/C6 ON
 								- **Measurement**: Throughput (ops/s)
 								---
 								## Raw Results (10 runs each)
 								### Baseline (C3 OFF)
 								```
 								40435972, 41430741, 41023773, 39807320, 40474129,
 								40436476, 40643305, 40116079, 40295157, 40622709
 								```
 								- **Mean**: 40.52 M ops/s
 								- **Min**: 39.80 M ops/s
 								- **Max**: 41.43 M ops/s
 								- **Std Dev**: ~0.57 M ops/s
 								### Treatment (C3 ON)
 								```
 								40836958, 40492669, 40726473, 41205860, 40609735,
 								40943945, 40612661, 41083970, 40370334, 40040018
 								```
 								- **Mean**: 40.69 M ops/s
 								- **Min**: 40.04 M ops/s
 								- **Max**: 41.20 M ops/s
 								- **Std Dev**: ~0.43 M ops/s
 								---
 								## Delta Analysis
 								| Metric | Value |
 								|--------|-------|
 								| **Baseline Mean** | 40.52 M ops/s |
 								| **Treatment Mean** | 40.69 M ops/s |
 								| **Absolute Gain** | 0.17 M ops/s |
 								| **Relative Gain** | **+0.40%** |
 								| **GO Threshold** | +1.0% |
 								| **Status** | ❌ **NO-GO** |
 								### Confidence Analysis
 								- Sample size: 10 per group
 								- Overlap: Baseline and Treatment ranges have significant overlap
 								- Signal-to-noise: Gain (0.17M) is comparable to baseline std dev (0.57M)
 								- **Conclusion**: Gain is within noise, not statistically significant
 								---
 								## Root Cause Analysis: Why No Gain?
 								### 1. **Phase 77-0 Observation Confirmed**
 								- Unified_cache statistics showed C3 had only 1 miss out of 20M operations (0.00005% miss rate)
 								- This ultra-low miss rate indicates C3 is already well-serviced by existing mechanisms
 								### 2. **Warm Pool Effectiveness**
 								- Warm pool + first-page-cache are likely intercepting C3 traffic
 								- C3 is below the "hot class" threshold where inline slots provide ROI
 								### 3. **TLS Overhead vs. Benefit**
 								- C3 adds 2KB/thread TLS overhead
 								- No corresponding reduction in unified_cache misses → overhead not justified
 								- Unlike C4-C6 where inline slots eliminated significant unified_cache traffic
 								### 4. **Workload Characteristics**
 								- WS=400 mixed workload is dominated by C5-C6 (57.17% + 28.55% = 85.7% of operations)
 								- C3 only ~15.6% of workload (64-128B size range)
 								- Even if C3 were optimized, it can only affect 15.6% of operations
 								- Only 4-5% of that traffic is currently hitting unified_cache (based on Phase 77-0 data)
 								---
 								## Comparison to C4-C6 Success
 								### Why C4-C6 Succeeded (+7.05% cumulative)
 								| Factor | C4-C6 | C3 |
 								|--------|-------|-----|
 								| **Hot traffic %** | 14.29% + 28.55% + 57.17% = 100% of Tiny | ~15.6% of total |
 								| **Unified_cache hits** | Low but visible | Almost none |
 								| **Context dependency** | Super-additive synergy | No interaction |
 								| **Size class range** | 128-2048B (large objects) | 64-128B (small) |
 								**Key Insight**: C4-C6 optimization succeeded because it addressed **active contention** in the unified_cache. C3 optimization addresses **non-existent contention**.
 								---
 								## Per-Class Coverage Summary (Final)
 								### C0-C7 Optimization Status
 								| Class | Size Range | Coverage % | Optimization | Result | Status |
 								|-------|-----------|-----------|--------------|--------|--------|
 								| **C6** | 1025-2048B | 57.17% | Inline Slots | +2.87% | ✅ GO (Phase 75-1) |
 								| **C5** | 513-1024B | 28.55% | Inline Slots | +1.10% | ✅ GO (Phase 75-2) |
 								| **C4** | 257-512B | 14.29% | Inline Slots | +1.27% (in context) | ✅ GO (Phase 76-1, +7.05% cumulative) |
 								| **C3** | 65-256B | ~15.6% | Inline Slots | +0.40% | ❌ NO-GO (Phase 77-1) |
 								| **C2** | 33-64B | ~15.6% | Not tested | N/A | ⏸️ CONDITIONAL (blocked by C3 NO-GO) |
 								| **C7** | 2049-4096B | 0.00% | N/A | N/A | ✅ NO-GO (Phase 76-0) |
 								| **C0-C1** | <32B | Minimal | N/A | N/A | ⏸️ Future (blocked by C2) |
 								---
 								## Decision Logic
 								### Success Criteria
 								| Criterion | Threshold | Actual | Pass |
 								|-----------|-----------|--------|------|
 								| **GO Threshold** | ≥ +1.0% | **+0.40%** | ❌ |
 								| **Noise floor** | < 50% of baseline std dev | **30% of std dev** | ⚠️ |
 								| **Statistical significance** | p < 0.05 (10 samples) | High overlap | ❌ |
 								### Decision: **NO-GO**
 								**Rationale**:
 . ❌ **Below GO threshold**: +0.40% is significantly below +1.0% GO floor
 . ❌ **Statistical insignificance**: Gain is within measurement noise
 . ❌ **Root cause confirmed**: Phase 77-0 data shows C3 has minimal unified_cache contention
 . ❌ **No follow-on to C2**: Phase 77-2 (C2) conditional on C3 success → BLOCKED
 								**Impact**: C3-C2 optimization axis exhausted. Per-class inline slots optimization complete at C4-C6.
 								---
 								## Phase 77-2 Status: **SKIPPED** (Conditional NO-GO)
 								Phase 77-2 (C2 inline slots) was conditional on Phase 77-1 (C3) success. Since Phase 77-1 is NO-GO:
 								- Phase 77-2 is **SKIPPED** (not implemented)
 								- C2 remains unoptimized (consistent with Phase 77-0 observation: negligible unified_cache traffic)
 								---
 								## Recommended Next Steps
 								### 1. **Lock C4-C6 as Permanent SSOT** ✅ (Already done Phase 76-2)
 								- C4+C5+C6 inline slots = **+7.05% cumulative gain, super-additive**
 								- Promoted to defaults in `core/bench_profile.h` and test scripts
 								### 2. **Explore Alternative Optimization Axes** (Phase 78+)
 								Given C3 NO-GO, consider:
 								- **Option A**: Allocation fast-path further optimization (instruction/branch reduction)
 								- **Option B**: Metadata/page lookup optimization (avoid pointer chasing)
 								- **Option C**: Warm pool tuning beyond Phase 69's WarmPool=16
 								- **Option D**: Alternative size-class strategies (C1/C2 with different thresholds)
 								### 3. **Track mimalloc Ratio** (Secondary Metric, ongoing)
 								- Current: 89.2% (Phase 76-2 baseline)
 								- Monitor code bloat from C4-C6 additions
 								- Rebbase FAST PGO profile if bloat becomes concern
 								---
 								## Conclusion
 								**Phase 77-1 validates that per-class inline slots optimization has a **natural stopping point** at C3**. Unlike C4-C6 which addressed hot unified_cache traffic, C3 (and by extension C2) appear to be well-serviced by existing warm pool and caching mechanisms.
 								**Key Learning**: Not all size classes benefit equally from the same optimization pattern. C3's low traffic and non-existent unified_cache contention make inline slots wasteful in this workload.
 								**Status**: ✅ **DECISION MADE** (C3 NO-GO, C4-C6 locked to SSOT, Phase 77 complete)
 								---
 								**Phase 77 Status**: ✓ COMPLETE (Phase 77-0 GO, Phase 77-1 NO-GO, Phase 77-2 SKIPPED)
 								**Next Phase**: Phase 78 (Alternative optimization axis TBD)