hakmem/docs/analysis/PHASE76_1_C4_INLINE_SLOTS_RESULTS.md

# Phase 76-1: C4 Inline Slots A/B Test Results

## Executive Summary

**Decision**: **GO** (+1.73% gain, exceeds +1.0% threshold)

**Key Finding**: C4 inline slots optimization provides **+1.73% throughput gain** on Standard binary, completing the C4/C5/C6 inline slots trilogy.

**Implementation**: Modular box pattern following Phase 75-1/75-2 (C6/C5) design, integrating C4 into existing cascade.

---

## Implementation Summary

### Modular Boxes Created

1. **`core/box/tiny_c4_inline_slots_env_box.h`**
   - ENV gate: `HAKMEM_TINY_C4_INLINE_SLOTS=0/1`
   - Lazy-init pattern (default OFF)

2. **`core/box/tiny_c4_inline_slots_tls_box.h`**
   - TLS ring buffer: 64 slots (512B per thread)
   - FIFO ring (head/tail indices, modulo 64)

3. **`core/front/tiny_c4_inline_slots.h`**
   - `c4_inline_push()` - always_inline
   - `c4_inline_pop()` - always_inline

4. **`core/tiny_c4_inline_slots.c`**
   - TLS variable definition

### Integration Points

**Alloc Path** (`tiny_front_hot_box.h`):
```c
// C4 FIRST → C5 → C6 → unified_cache
if (class_idx == 4 && tiny_c4_inline_slots_enabled()) {
    void* base = c4_inline_pop(c4_inline_tls());
    if (TINY_HOT_LIKELY(base != NULL)) {
        return tiny_header_finalize_alloc(base, class_idx);
    }
}
```

**Free Path** (`tiny_legacy_fallback_box.h`):
```c
// C4 FIRST → C5 → C6 → unified_cache
if (class_idx == 4 && tiny_c4_inline_slots_enabled()) {
    if (c4_inline_push(c4_inline_tls(), base)) {
        return;  // Success
    }
}
```

---

## 10-Run A/B Test Results

### Test Configuration

- **Workload**: Mixed SSOT (WS=400, ITERS=20000000)
- **Binary**: `./bench_random_mixed_hakmem` (Standard build)
- **Existing Defaults**: C5=1, C6=1 (Phase 75-3 promoted)
- **Runs**: 10 per configuration
- **Harness**: `scripts/run_mixed_10_cleanenv.sh`

### Raw Data

| Run | Baseline (C4=0) | Treatment (C4=1) | Delta |
|-----|-----------------|------------------|-------|
| 1   | 52.91 M ops/s   | 53.87 M ops/s    | +1.82% |
| 2   | 52.52 M ops/s   | 53.16 M ops/s    | +1.22% |
| 3   | 53.26 M ops/s   | 53.64 M ops/s    | +0.71% |
| 4   | 53.45 M ops/s   | 53.30 M ops/s    | -0.28% |
| 5   | 51.88 M ops/s   | 52.62 M ops/s    | +1.43% |
| 6   | 52.83 M ops/s   | 53.81 M ops/s    | +1.85% |
| 7   | 50.41 M ops/s   | 52.76 M ops/s    | +4.66% |
| 8   | 51.89 M ops/s   | 53.46 M ops/s    | +3.02% |
| 9   | 53.03 M ops/s   | 53.62 M ops/s    | +1.11% |
| 10  | 51.97 M ops/s   | 53.00 M ops/s    | +1.98% |

### Statistical Summary

| Metric | Baseline (C4=0) | Treatment (C4=1) | Delta |
|--------|-----------------|------------------|-------|
| **Mean** | **52.42 M ops/s** | **53.33 M ops/s** | **+1.73%** |
| Min | 50.41 M ops/s | 52.62 M ops/s | +4.39% |
| Max | 53.45 M ops/s | 53.87 M ops/s | +0.78% |

---

## Decision Matrix

### Success Criteria

| Criterion | Threshold | Actual | Pass |
|-----------|-----------|--------|------|
| **GO Threshold** | ≥ +1.0% | **+1.73%** | ✓ |
| NEUTRAL Range | ±1.0% | N/A | N/A |
| NO-GO Threshold | ≤ -1.0% | N/A | N/A |

### Decision: **GO**

**Rationale**:
1. Mean throughput gain of **+1.73%** exceeds GO threshold (+1.0%)
2. All individual runs show positive or near-zero delta (only 1/10 negative by -0.28%)
3. Consistent improvement across multiple runs (9/10 positive)
4. Pattern matches Phase 75-1 (C6: +2.87%) and Phase 75-2 (C5: +1.10%) success

**Quality Rating**: **Strong GO** (exceeds threshold by +0.73pp, robust across runs)

---

## Per-Class Coverage Analysis

### C4-C7 Optimization Status

| Class | Size Range | Coverage % | Optimization | Status |
|-------|-----------|-----------|--------------|--------|
| **C4** | 257-512B | 14.29% | Inline Slots | **GO (+1.73%)** |
| **C5** | 513-1024B | 28.55% | Inline Slots | GO (+1.10%, Phase 75-2) |
| **C6** | 1025-2048B | 57.17% | Inline Slots | GO (+2.87%, Phase 75-1) |
| **C7** | 2049-4096B | 0.00% | N/A | NO-GO (Phase 76-0: 0% ops) |

**Combined C4-C6 Coverage**: 100% of C4-C7 operations (14.29% + 28.55% + 57.17%)

### Cumulative Gain Tracking

| Optimization | Coverage | Individual Gain | Cumulative Impact |
|--------------|----------|-----------------|-------------------|
| C6 Inline Slots (Phase 75-1) | 57.17% | +2.87% | +2.87% |
| C5 Inline Slots (Phase 75-2) | 28.55% | +1.10% | +3.97% (C5+C6 4-point: +5.41%) |
| **C4 Inline Slots (Phase 76-1)** | **14.29%** | **+1.73%** | **+7.14%** (estimated, C4+C5+C6 combined) |

**Note**: Actual cumulative gain will be measured in follow-up 4-point matrix test if needed. Phase 75-3 showed C5+C6 achieved +5.41% (near-perfect sub-additivity at 1.72%).

---

## TLS Layout Impact

### TLS Cost Summary

| Component | Capacity | Size per Thread | Total (C4+C5+C6) |
|-----------|----------|-----------------|------------------|
| C4 inline slots | 64 | 512B | - |
| C5 inline slots | 128 | 1,024B | - |
| C6 inline slots | 128 | 1,024B | - |
| **Combined** | - | - | **2,560B (~2.5KB)** |

**System-Wide** (10 threads): ~25KB total
**Per-Thread L1-dcache**: +2.5KB footprint

**Observation**: No cache-miss spike observed (unlike Phase 74-2 LOCALIZE which showed +86% cache-misses). TLS expansion of 512B for C4 is well within safe limits.

---

## Comparison: C4 vs C5 vs C6

| Phase | Class | Coverage | Capacity | TLS Cost | Individual Gain |
|-------|-------|----------|----------|----------|-----------------|
| 75-1 | C6 | 57.17% | 128 | 1KB | **+2.87%** (highest) |
| 75-2 | C5 | 28.55% | 128 | 1KB | +1.10% |
| **76-1** | **C4** | **14.29%** | **64** | **512B** | **+1.73%** |

**Key Insight**: C4 achieves **+1.73% gain** with only **14.29% coverage**, showing higher efficiency per-operation than C5 (+1.10% with 28.55% coverage). This suggests C4 class has higher branch overhead in the baseline unified_cache path.

---

## Recommended Actions

### Immediate (Required)

1. **✓ Promote C4 Inline Slots to SSOT**
   - Set `HAKMEM_TINY_C4_INLINE_SLOTS=1` (default ON)
   - Update `core/bench_profile.h`
   - Update `scripts/run_mixed_10_cleanenv.sh`

2. **✓ Document Phase 76-1 Results**
   - Create `PHASE76_1_C4_INLINE_SLOTS_RESULTS.md`
   - Update `CURRENT_TASK.md`
   - Record in `PERFORMANCE_TARGETS_SCORECARD.md`

### Optional (Future Work)

3. **4-Point Matrix Test (C4+C5+C6)**
   - Measure full combined effect
   - Quantify sub-additivity (C4 + (C5+C6 proven +5.41%))
   - Expected: +7-8% total gain if near-perfect additivity holds

4. **FAST PGO Rebase**
   - Test C4+C5+C6 on FAST PGO binary
   - Monitor for code bloat sensitivity (Phase 75-5 lesson)
   - Track mimalloc ratio progress

---

## Test Artifacts

### Log Files
- `/tmp/phase76_1_c4_baseline.log` (C4=0, 10 runs)
- `/tmp/phase76_1_c4_treatment.log` (C4=1, 10 runs)
- `/tmp/phase76_1_analysis.sh` (statistical analysis)

### Binary Information
- Binary: `./bench_random_mixed_hakmem`
- Build time: 2025-12-18 10:42
- Size: 674K
- Compiler: gcc -O3 -march=native -flto

---

## Conclusion

Phase 76-1 validates that C4 inline slots optimization provides **+1.73% throughput gain** on Standard binary, completing the C4-C6 inline slots optimization trilogy.

The implementation follows the proven modular box pattern from Phase 75-1/75-2, integrates cleanly into the existing C5→C6→unified_cache cascade, and shows no adverse TLS or cache-miss effects.

**Recommendation**: Proceed with SSOT promotion to `core/bench_profile.h` and `scripts/run_mixed_10_cleanenv.sh`, setting `HAKMEM_TINY_C4_INLINE_SLOTS=1` as the new default.

---

**Phase 76-1 Status**: ✓ COMPLETE (GO, +1.73% gain validated on Standard binary)

**Next Phase**: Phase 76-2 (C4+C5+C6 4-point matrix validation) or SSOT promotion (if matrix deferred)
Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update Key changes: - Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible) Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns - Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M): tcmalloc: 115.26M (92.33% of mimalloc) jemalloc: 97.39M (77.96% of mimalloc) system: 85.20M (68.24% of mimalloc) mimalloc: 124.82M (baseline) - hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements Result: baseline stabilized to 55.53M (44.46% of mimalloc) Previous unstable measurement (35.57M) was due to profile leak - Documentation: * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO) * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology - M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> 2025-12-18 18:50:00 +09:00			`# Phase 76-1: C4 Inline Slots A/B Test Results`

			`## Executive Summary`

			`Decision: GO (+1.73% gain, exceeds +1.0% threshold)`

			`Key Finding: C4 inline slots optimization provides +1.73% throughput gain on Standard binary, completing the C4/C5/C6 inline slots trilogy.`

			`Implementation: Modular box pattern following Phase 75-1/75-2 (C6/C5) design, integrating C4 into existing cascade.`

			`---`

			`## Implementation Summary`

			`### Modular Boxes Created`

			1. `core/box/tiny_c4_inline_slots_env_box.h`
			- ENV gate: `HAKMEM_TINY_C4_INLINE_SLOTS=0/1`
			`- Lazy-init pattern (default OFF)`

			2. `core/box/tiny_c4_inline_slots_tls_box.h`
			`- TLS ring buffer: 64 slots (512B per thread)`
			`- FIFO ring (head/tail indices, modulo 64)`

			3. `core/front/tiny_c4_inline_slots.h`
			- `c4_inline_push()` - always_inline
			- `c4_inline_pop()` - always_inline

			4. `core/tiny_c4_inline_slots.c`
			`- TLS variable definition`

			`### Integration Points`

			Alloc Path (`tiny_front_hot_box.h`):
			```c
			`// C4 FIRST → C5 → C6 → unified_cache`
			`if (class_idx == 4 && tiny_c4_inline_slots_enabled()) {`
			`void* base = c4_inline_pop(c4_inline_tls());`
			`if (TINY_HOT_LIKELY(base != NULL)) {`
			`return tiny_header_finalize_alloc(base, class_idx);`
			`}`
			`}`
			```

			Free Path (`tiny_legacy_fallback_box.h`):
			```c
			`// C4 FIRST → C5 → C6 → unified_cache`
			`if (class_idx == 4 && tiny_c4_inline_slots_enabled()) {`
			`if (c4_inline_push(c4_inline_tls(), base)) {`
			`return; // Success`
			`}`
			`}`
			```

			`---`

			`## 10-Run A/B Test Results`

			`### Test Configuration`

			`- Workload: Mixed SSOT (WS=400, ITERS=20000000)`
			- Binary: `./bench_random_mixed_hakmem` (Standard build)
			`- Existing Defaults: C5=1, C6=1 (Phase 75-3 promoted)`
			`- Runs: 10 per configuration`
			- Harness: `scripts/run_mixed_10_cleanenv.sh`

			`### Raw Data`

			`\| Run \| Baseline (C4=0) \| Treatment (C4=1) \| Delta \|`
			`\|-----\|-----------------\|------------------\|-------\|`
			`\| 1 \| 52.91 M ops/s \| 53.87 M ops/s \| +1.82% \|`
			`\| 2 \| 52.52 M ops/s \| 53.16 M ops/s \| +1.22% \|`
			`\| 3 \| 53.26 M ops/s \| 53.64 M ops/s \| +0.71% \|`
			`\| 4 \| 53.45 M ops/s \| 53.30 M ops/s \| -0.28% \|`
			`\| 5 \| 51.88 M ops/s \| 52.62 M ops/s \| +1.43% \|`
			`\| 6 \| 52.83 M ops/s \| 53.81 M ops/s \| +1.85% \|`
			`\| 7 \| 50.41 M ops/s \| 52.76 M ops/s \| +4.66% \|`
			`\| 8 \| 51.89 M ops/s \| 53.46 M ops/s \| +3.02% \|`
			`\| 9 \| 53.03 M ops/s \| 53.62 M ops/s \| +1.11% \|`
			`\| 10 \| 51.97 M ops/s \| 53.00 M ops/s \| +1.98% \|`

			`### Statistical Summary`

			`\| Metric \| Baseline (C4=0) \| Treatment (C4=1) \| Delta \|`
			`\|--------\|-----------------\|------------------\|-------\|`
			`\| Mean \| 52.42 M ops/s \| 53.33 M ops/s \| +1.73% \|`
			`\| Min \| 50.41 M ops/s \| 52.62 M ops/s \| +4.39% \|`
			`\| Max \| 53.45 M ops/s \| 53.87 M ops/s \| +0.78% \|`

			`---`

			`## Decision Matrix`

			`### Success Criteria`

			`\| Criterion \| Threshold \| Actual \| Pass \|`
			`\|-----------\|-----------\|--------\|------\|`
			`\| GO Threshold \| ≥ +1.0% \| +1.73% \| ✓ \|`
			`\| NEUTRAL Range \| ±1.0% \| N/A \| N/A \|`
			`\| NO-GO Threshold \| ≤ -1.0% \| N/A \| N/A \|`

			`### Decision: GO`

			`Rationale:`
			`1. Mean throughput gain of +1.73% exceeds GO threshold (+1.0%)`
			`2. All individual runs show positive or near-zero delta (only 1/10 negative by -0.28%)`
			`3. Consistent improvement across multiple runs (9/10 positive)`
			`4. Pattern matches Phase 75-1 (C6: +2.87%) and Phase 75-2 (C5: +1.10%) success`

			`Quality Rating: Strong GO (exceeds threshold by +0.73pp, robust across runs)`

			`---`

			`## Per-Class Coverage Analysis`

			`### C4-C7 Optimization Status`

			`\| Class \| Size Range \| Coverage % \| Optimization \| Status \|`
			`\|-------\|-----------\|-----------\|--------------\|--------\|`
			`\| C4 \| 257-512B \| 14.29% \| Inline Slots \| GO (+1.73%) \|`
			`\| C5 \| 513-1024B \| 28.55% \| Inline Slots \| GO (+1.10%, Phase 75-2) \|`
			`\| C6 \| 1025-2048B \| 57.17% \| Inline Slots \| GO (+2.87%, Phase 75-1) \|`
			`\| C7 \| 2049-4096B \| 0.00% \| N/A \| NO-GO (Phase 76-0: 0% ops) \|`

			`Combined C4-C6 Coverage: 100% of C4-C7 operations (14.29% + 28.55% + 57.17%)`

			`### Cumulative Gain Tracking`

			`\| Optimization \| Coverage \| Individual Gain \| Cumulative Impact \|`
			`\|--------------\|----------\|-----------------\|-------------------\|`
			`\| C6 Inline Slots (Phase 75-1) \| 57.17% \| +2.87% \| +2.87% \|`
			`\| C5 Inline Slots (Phase 75-2) \| 28.55% \| +1.10% \| +3.97% (C5+C6 4-point: +5.41%) \|`
			`\| C4 Inline Slots (Phase 76-1) \| 14.29% \| +1.73% \| +7.14% (estimated, C4+C5+C6 combined) \|`

			`Note: Actual cumulative gain will be measured in follow-up 4-point matrix test if needed. Phase 75-3 showed C5+C6 achieved +5.41% (near-perfect sub-additivity at 1.72%).`

			`---`

			`## TLS Layout Impact`

			`### TLS Cost Summary`

			`\| Component \| Capacity \| Size per Thread \| Total (C4+C5+C6) \|`
			`\|-----------\|----------\|-----------------\|------------------\|`
			`\| C4 inline slots \| 64 \| 512B \| - \|`
			`\| C5 inline slots \| 128 \| 1,024B \| - \|`
			`\| C6 inline slots \| 128 \| 1,024B \| - \|`
			`\| Combined \| - \| - \| 2,560B (~2.5KB) \|`

			`System-Wide (10 threads): ~25KB total`
			`Per-Thread L1-dcache: +2.5KB footprint`

			`Observation: No cache-miss spike observed (unlike Phase 74-2 LOCALIZE which showed +86% cache-misses). TLS expansion of 512B for C4 is well within safe limits.`

			`---`

			`## Comparison: C4 vs C5 vs C6`

			`\| Phase \| Class \| Coverage \| Capacity \| TLS Cost \| Individual Gain \|`
			`\|-------\|-------\|----------\|----------\|----------\|-----------------\|`
			`\| 75-1 \| C6 \| 57.17% \| 128 \| 1KB \| +2.87% (highest) \|`
			`\| 75-2 \| C5 \| 28.55% \| 128 \| 1KB \| +1.10% \|`
			`\| 76-1 \| C4 \| 14.29% \| 64 \| 512B \| +1.73% \|`

			`Key Insight: C4 achieves +1.73% gain with only 14.29% coverage, showing higher efficiency per-operation than C5 (+1.10% with 28.55% coverage). This suggests C4 class has higher branch overhead in the baseline unified_cache path.`

			`---`

			`## Recommended Actions`

			`### Immediate (Required)`

			`1. ✓ Promote C4 Inline Slots to SSOT`
			- Set `HAKMEM_TINY_C4_INLINE_SLOTS=1` (default ON)
			- Update `core/bench_profile.h`
			- Update `scripts/run_mixed_10_cleanenv.sh`

			`2. ✓ Document Phase 76-1 Results`
			- Create `PHASE76_1_C4_INLINE_SLOTS_RESULTS.md`
			- Update `CURRENT_TASK.md`
			- Record in `PERFORMANCE_TARGETS_SCORECARD.md`

			`### Optional (Future Work)`

			`3. 4-Point Matrix Test (C4+C5+C6)`
			`- Measure full combined effect`
			`- Quantify sub-additivity (C4 + (C5+C6 proven +5.41%))`
			`- Expected: +7-8% total gain if near-perfect additivity holds`

			`4. FAST PGO Rebase`
			`- Test C4+C5+C6 on FAST PGO binary`
			`- Monitor for code bloat sensitivity (Phase 75-5 lesson)`
			`- Track mimalloc ratio progress`

			`---`

			`## Test Artifacts`

			`### Log Files`
			- `/tmp/phase76_1_c4_baseline.log` (C4=0, 10 runs)
			- `/tmp/phase76_1_c4_treatment.log` (C4=1, 10 runs)
			- `/tmp/phase76_1_analysis.sh` (statistical analysis)

			`### Binary Information`
			- Binary: `./bench_random_mixed_hakmem`
			`- Build time: 2025-12-18 10:42`
			`- Size: 674K`
			`- Compiler: gcc -O3 -march=native -flto`

			`---`

			`## Conclusion`

			`Phase 76-1 validates that C4 inline slots optimization provides +1.73% throughput gain on Standard binary, completing the C4-C6 inline slots optimization trilogy.`

			`The implementation follows the proven modular box pattern from Phase 75-1/75-2, integrates cleanly into the existing C5→C6→unified_cache cascade, and shows no adverse TLS or cache-miss effects.`

			Recommendation: Proceed with SSOT promotion to `core/bench_profile.h` and `scripts/run_mixed_10_cleanenv.sh`, setting `HAKMEM_TINY_C4_INLINE_SLOTS=1` as the new default.

			`---`

			`Phase 76-1 Status: ✓ COMPLETE (GO, +1.73% gain validated on Standard binary)`

			`Next Phase: Phase 76-2 (C4+C5+C6 4-point matrix validation) or SSOT promotion (if matrix deferred)`