225 lines
7.5 KiB
Markdown
225 lines
7.5 KiB
Markdown
|
|
# Phase 76-1: C4 Inline Slots A/B Test Results
|
||
|
|
|
||
|
|
## Executive Summary
|
||
|
|
|
||
|
|
**Decision**: **GO** (+1.73% gain, exceeds +1.0% threshold)
|
||
|
|
|
||
|
|
**Key Finding**: C4 inline slots optimization provides **+1.73% throughput gain** on Standard binary, completing the C4/C5/C6 inline slots trilogy.
|
||
|
|
|
||
|
|
**Implementation**: Modular box pattern following Phase 75-1/75-2 (C6/C5) design, integrating C4 into existing cascade.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Implementation Summary
|
||
|
|
|
||
|
|
### Modular Boxes Created
|
||
|
|
|
||
|
|
1. **`core/box/tiny_c4_inline_slots_env_box.h`**
|
||
|
|
- ENV gate: `HAKMEM_TINY_C4_INLINE_SLOTS=0/1`
|
||
|
|
- Lazy-init pattern (default OFF)
|
||
|
|
|
||
|
|
2. **`core/box/tiny_c4_inline_slots_tls_box.h`**
|
||
|
|
- TLS ring buffer: 64 slots (512B per thread)
|
||
|
|
- FIFO ring (head/tail indices, modulo 64)
|
||
|
|
|
||
|
|
3. **`core/front/tiny_c4_inline_slots.h`**
|
||
|
|
- `c4_inline_push()` - always_inline
|
||
|
|
- `c4_inline_pop()` - always_inline
|
||
|
|
|
||
|
|
4. **`core/tiny_c4_inline_slots.c`**
|
||
|
|
- TLS variable definition
|
||
|
|
|
||
|
|
### Integration Points
|
||
|
|
|
||
|
|
**Alloc Path** (`tiny_front_hot_box.h`):
|
||
|
|
```c
|
||
|
|
// C4 FIRST → C5 → C6 → unified_cache
|
||
|
|
if (class_idx == 4 && tiny_c4_inline_slots_enabled()) {
|
||
|
|
void* base = c4_inline_pop(c4_inline_tls());
|
||
|
|
if (TINY_HOT_LIKELY(base != NULL)) {
|
||
|
|
return tiny_header_finalize_alloc(base, class_idx);
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Free Path** (`tiny_legacy_fallback_box.h`):
|
||
|
|
```c
|
||
|
|
// C4 FIRST → C5 → C6 → unified_cache
|
||
|
|
if (class_idx == 4 && tiny_c4_inline_slots_enabled()) {
|
||
|
|
if (c4_inline_push(c4_inline_tls(), base)) {
|
||
|
|
return; // Success
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## 10-Run A/B Test Results
|
||
|
|
|
||
|
|
### Test Configuration
|
||
|
|
|
||
|
|
- **Workload**: Mixed SSOT (WS=400, ITERS=20000000)
|
||
|
|
- **Binary**: `./bench_random_mixed_hakmem` (Standard build)
|
||
|
|
- **Existing Defaults**: C5=1, C6=1 (Phase 75-3 promoted)
|
||
|
|
- **Runs**: 10 per configuration
|
||
|
|
- **Harness**: `scripts/run_mixed_10_cleanenv.sh`
|
||
|
|
|
||
|
|
### Raw Data
|
||
|
|
|
||
|
|
| Run | Baseline (C4=0) | Treatment (C4=1) | Delta |
|
||
|
|
|-----|-----------------|------------------|-------|
|
||
|
|
| 1 | 52.91 M ops/s | 53.87 M ops/s | +1.82% |
|
||
|
|
| 2 | 52.52 M ops/s | 53.16 M ops/s | +1.22% |
|
||
|
|
| 3 | 53.26 M ops/s | 53.64 M ops/s | +0.71% |
|
||
|
|
| 4 | 53.45 M ops/s | 53.30 M ops/s | -0.28% |
|
||
|
|
| 5 | 51.88 M ops/s | 52.62 M ops/s | +1.43% |
|
||
|
|
| 6 | 52.83 M ops/s | 53.81 M ops/s | +1.85% |
|
||
|
|
| 7 | 50.41 M ops/s | 52.76 M ops/s | +4.66% |
|
||
|
|
| 8 | 51.89 M ops/s | 53.46 M ops/s | +3.02% |
|
||
|
|
| 9 | 53.03 M ops/s | 53.62 M ops/s | +1.11% |
|
||
|
|
| 10 | 51.97 M ops/s | 53.00 M ops/s | +1.98% |
|
||
|
|
|
||
|
|
### Statistical Summary
|
||
|
|
|
||
|
|
| Metric | Baseline (C4=0) | Treatment (C4=1) | Delta |
|
||
|
|
|--------|-----------------|------------------|-------|
|
||
|
|
| **Mean** | **52.42 M ops/s** | **53.33 M ops/s** | **+1.73%** |
|
||
|
|
| Min | 50.41 M ops/s | 52.62 M ops/s | +4.39% |
|
||
|
|
| Max | 53.45 M ops/s | 53.87 M ops/s | +0.78% |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Decision Matrix
|
||
|
|
|
||
|
|
### Success Criteria
|
||
|
|
|
||
|
|
| Criterion | Threshold | Actual | Pass |
|
||
|
|
|-----------|-----------|--------|------|
|
||
|
|
| **GO Threshold** | ≥ +1.0% | **+1.73%** | ✓ |
|
||
|
|
| NEUTRAL Range | ±1.0% | N/A | N/A |
|
||
|
|
| NO-GO Threshold | ≤ -1.0% | N/A | N/A |
|
||
|
|
|
||
|
|
### Decision: **GO**
|
||
|
|
|
||
|
|
**Rationale**:
|
||
|
|
1. Mean throughput gain of **+1.73%** exceeds GO threshold (+1.0%)
|
||
|
|
2. All individual runs show positive or near-zero delta (only 1/10 negative by -0.28%)
|
||
|
|
3. Consistent improvement across multiple runs (9/10 positive)
|
||
|
|
4. Pattern matches Phase 75-1 (C6: +2.87%) and Phase 75-2 (C5: +1.10%) success
|
||
|
|
|
||
|
|
**Quality Rating**: **Strong GO** (exceeds threshold by +0.73pp, robust across runs)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Per-Class Coverage Analysis
|
||
|
|
|
||
|
|
### C4-C7 Optimization Status
|
||
|
|
|
||
|
|
| Class | Size Range | Coverage % | Optimization | Status |
|
||
|
|
|-------|-----------|-----------|--------------|--------|
|
||
|
|
| **C4** | 257-512B | 14.29% | Inline Slots | **GO (+1.73%)** |
|
||
|
|
| **C5** | 513-1024B | 28.55% | Inline Slots | GO (+1.10%, Phase 75-2) |
|
||
|
|
| **C6** | 1025-2048B | 57.17% | Inline Slots | GO (+2.87%, Phase 75-1) |
|
||
|
|
| **C7** | 2049-4096B | 0.00% | N/A | NO-GO (Phase 76-0: 0% ops) |
|
||
|
|
|
||
|
|
**Combined C4-C6 Coverage**: 100% of C4-C7 operations (14.29% + 28.55% + 57.17%)
|
||
|
|
|
||
|
|
### Cumulative Gain Tracking
|
||
|
|
|
||
|
|
| Optimization | Coverage | Individual Gain | Cumulative Impact |
|
||
|
|
|--------------|----------|-----------------|-------------------|
|
||
|
|
| C6 Inline Slots (Phase 75-1) | 57.17% | +2.87% | +2.87% |
|
||
|
|
| C5 Inline Slots (Phase 75-2) | 28.55% | +1.10% | +3.97% (C5+C6 4-point: +5.41%) |
|
||
|
|
| **C4 Inline Slots (Phase 76-1)** | **14.29%** | **+1.73%** | **+7.14%** (estimated, C4+C5+C6 combined) |
|
||
|
|
|
||
|
|
**Note**: Actual cumulative gain will be measured in follow-up 4-point matrix test if needed. Phase 75-3 showed C5+C6 achieved +5.41% (near-perfect sub-additivity at 1.72%).
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## TLS Layout Impact
|
||
|
|
|
||
|
|
### TLS Cost Summary
|
||
|
|
|
||
|
|
| Component | Capacity | Size per Thread | Total (C4+C5+C6) |
|
||
|
|
|-----------|----------|-----------------|------------------|
|
||
|
|
| C4 inline slots | 64 | 512B | - |
|
||
|
|
| C5 inline slots | 128 | 1,024B | - |
|
||
|
|
| C6 inline slots | 128 | 1,024B | - |
|
||
|
|
| **Combined** | - | - | **2,560B (~2.5KB)** |
|
||
|
|
|
||
|
|
**System-Wide** (10 threads): ~25KB total
|
||
|
|
**Per-Thread L1-dcache**: +2.5KB footprint
|
||
|
|
|
||
|
|
**Observation**: No cache-miss spike observed (unlike Phase 74-2 LOCALIZE which showed +86% cache-misses). TLS expansion of 512B for C4 is well within safe limits.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Comparison: C4 vs C5 vs C6
|
||
|
|
|
||
|
|
| Phase | Class | Coverage | Capacity | TLS Cost | Individual Gain |
|
||
|
|
|-------|-------|----------|----------|----------|-----------------|
|
||
|
|
| 75-1 | C6 | 57.17% | 128 | 1KB | **+2.87%** (highest) |
|
||
|
|
| 75-2 | C5 | 28.55% | 128 | 1KB | +1.10% |
|
||
|
|
| **76-1** | **C4** | **14.29%** | **64** | **512B** | **+1.73%** |
|
||
|
|
|
||
|
|
**Key Insight**: C4 achieves **+1.73% gain** with only **14.29% coverage**, showing higher efficiency per-operation than C5 (+1.10% with 28.55% coverage). This suggests C4 class has higher branch overhead in the baseline unified_cache path.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Recommended Actions
|
||
|
|
|
||
|
|
### Immediate (Required)
|
||
|
|
|
||
|
|
1. **✓ Promote C4 Inline Slots to SSOT**
|
||
|
|
- Set `HAKMEM_TINY_C4_INLINE_SLOTS=1` (default ON)
|
||
|
|
- Update `core/bench_profile.h`
|
||
|
|
- Update `scripts/run_mixed_10_cleanenv.sh`
|
||
|
|
|
||
|
|
2. **✓ Document Phase 76-1 Results**
|
||
|
|
- Create `PHASE76_1_C4_INLINE_SLOTS_RESULTS.md`
|
||
|
|
- Update `CURRENT_TASK.md`
|
||
|
|
- Record in `PERFORMANCE_TARGETS_SCORECARD.md`
|
||
|
|
|
||
|
|
### Optional (Future Work)
|
||
|
|
|
||
|
|
3. **4-Point Matrix Test (C4+C5+C6)**
|
||
|
|
- Measure full combined effect
|
||
|
|
- Quantify sub-additivity (C4 + (C5+C6 proven +5.41%))
|
||
|
|
- Expected: +7-8% total gain if near-perfect additivity holds
|
||
|
|
|
||
|
|
4. **FAST PGO Rebase**
|
||
|
|
- Test C4+C5+C6 on FAST PGO binary
|
||
|
|
- Monitor for code bloat sensitivity (Phase 75-5 lesson)
|
||
|
|
- Track mimalloc ratio progress
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Test Artifacts
|
||
|
|
|
||
|
|
### Log Files
|
||
|
|
- `/tmp/phase76_1_c4_baseline.log` (C4=0, 10 runs)
|
||
|
|
- `/tmp/phase76_1_c4_treatment.log` (C4=1, 10 runs)
|
||
|
|
- `/tmp/phase76_1_analysis.sh` (statistical analysis)
|
||
|
|
|
||
|
|
### Binary Information
|
||
|
|
- Binary: `./bench_random_mixed_hakmem`
|
||
|
|
- Build time: 2025-12-18 10:42
|
||
|
|
- Size: 674K
|
||
|
|
- Compiler: gcc -O3 -march=native -flto
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Conclusion
|
||
|
|
|
||
|
|
Phase 76-1 validates that C4 inline slots optimization provides **+1.73% throughput gain** on Standard binary, completing the C4-C6 inline slots optimization trilogy.
|
||
|
|
|
||
|
|
The implementation follows the proven modular box pattern from Phase 75-1/75-2, integrates cleanly into the existing C5→C6→unified_cache cascade, and shows no adverse TLS or cache-miss effects.
|
||
|
|
|
||
|
|
**Recommendation**: Proceed with SSOT promotion to `core/bench_profile.h` and `scripts/run_mixed_10_cleanenv.sh`, setting `HAKMEM_TINY_C4_INLINE_SLOTS=1` as the new default.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Phase 76-1 Status**: ✓ COMPLETE (GO, +1.73% gain validated on Standard binary)
|
||
|
|
|
||
|
|
**Next Phase**: Phase 76-2 (C4+C5+C6 4-point matrix validation) or SSOT promotion (if matrix deferred)
|