Key changes: - Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible) Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns - Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M): tcmalloc: 115.26M (92.33% of mimalloc) jemalloc: 97.39M (77.96% of mimalloc) system: 85.20M (68.24% of mimalloc) mimalloc: 124.82M (baseline) - hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements Result: baseline stabilized to 55.53M (44.46% of mimalloc) Previous unstable measurement (35.57M) was due to profile leak - Documentation: * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO) * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology - M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
7.5 KiB
Phase 76-1: C4 Inline Slots A/B Test Results
Executive Summary
Decision: GO (+1.73% gain, exceeds +1.0% threshold)
Key Finding: C4 inline slots optimization provides +1.73% throughput gain on Standard binary, completing the C4/C5/C6 inline slots trilogy.
Implementation: Modular box pattern following Phase 75-1/75-2 (C6/C5) design, integrating C4 into existing cascade.
Implementation Summary
Modular Boxes Created
-
core/box/tiny_c4_inline_slots_env_box.h- ENV gate:
HAKMEM_TINY_C4_INLINE_SLOTS=0/1 - Lazy-init pattern (default OFF)
- ENV gate:
-
core/box/tiny_c4_inline_slots_tls_box.h- TLS ring buffer: 64 slots (512B per thread)
- FIFO ring (head/tail indices, modulo 64)
-
core/front/tiny_c4_inline_slots.hc4_inline_push()- always_inlinec4_inline_pop()- always_inline
-
core/tiny_c4_inline_slots.c- TLS variable definition
Integration Points
Alloc Path (tiny_front_hot_box.h):
// C4 FIRST → C5 → C6 → unified_cache
if (class_idx == 4 && tiny_c4_inline_slots_enabled()) {
void* base = c4_inline_pop(c4_inline_tls());
if (TINY_HOT_LIKELY(base != NULL)) {
return tiny_header_finalize_alloc(base, class_idx);
}
}
Free Path (tiny_legacy_fallback_box.h):
// C4 FIRST → C5 → C6 → unified_cache
if (class_idx == 4 && tiny_c4_inline_slots_enabled()) {
if (c4_inline_push(c4_inline_tls(), base)) {
return; // Success
}
}
10-Run A/B Test Results
Test Configuration
- Workload: Mixed SSOT (WS=400, ITERS=20000000)
- Binary:
./bench_random_mixed_hakmem(Standard build) - Existing Defaults: C5=1, C6=1 (Phase 75-3 promoted)
- Runs: 10 per configuration
- Harness:
scripts/run_mixed_10_cleanenv.sh
Raw Data
| Run | Baseline (C4=0) | Treatment (C4=1) | Delta |
|---|---|---|---|
| 1 | 52.91 M ops/s | 53.87 M ops/s | +1.82% |
| 2 | 52.52 M ops/s | 53.16 M ops/s | +1.22% |
| 3 | 53.26 M ops/s | 53.64 M ops/s | +0.71% |
| 4 | 53.45 M ops/s | 53.30 M ops/s | -0.28% |
| 5 | 51.88 M ops/s | 52.62 M ops/s | +1.43% |
| 6 | 52.83 M ops/s | 53.81 M ops/s | +1.85% |
| 7 | 50.41 M ops/s | 52.76 M ops/s | +4.66% |
| 8 | 51.89 M ops/s | 53.46 M ops/s | +3.02% |
| 9 | 53.03 M ops/s | 53.62 M ops/s | +1.11% |
| 10 | 51.97 M ops/s | 53.00 M ops/s | +1.98% |
Statistical Summary
| Metric | Baseline (C4=0) | Treatment (C4=1) | Delta |
|---|---|---|---|
| Mean | 52.42 M ops/s | 53.33 M ops/s | +1.73% |
| Min | 50.41 M ops/s | 52.62 M ops/s | +4.39% |
| Max | 53.45 M ops/s | 53.87 M ops/s | +0.78% |
Decision Matrix
Success Criteria
| Criterion | Threshold | Actual | Pass |
|---|---|---|---|
| GO Threshold | ≥ +1.0% | +1.73% | ✓ |
| NEUTRAL Range | ±1.0% | N/A | N/A |
| NO-GO Threshold | ≤ -1.0% | N/A | N/A |
Decision: GO
Rationale:
- Mean throughput gain of +1.73% exceeds GO threshold (+1.0%)
- All individual runs show positive or near-zero delta (only 1/10 negative by -0.28%)
- Consistent improvement across multiple runs (9/10 positive)
- Pattern matches Phase 75-1 (C6: +2.87%) and Phase 75-2 (C5: +1.10%) success
Quality Rating: Strong GO (exceeds threshold by +0.73pp, robust across runs)
Per-Class Coverage Analysis
C4-C7 Optimization Status
| Class | Size Range | Coverage % | Optimization | Status |
|---|---|---|---|---|
| C4 | 257-512B | 14.29% | Inline Slots | GO (+1.73%) |
| C5 | 513-1024B | 28.55% | Inline Slots | GO (+1.10%, Phase 75-2) |
| C6 | 1025-2048B | 57.17% | Inline Slots | GO (+2.87%, Phase 75-1) |
| C7 | 2049-4096B | 0.00% | N/A | NO-GO (Phase 76-0: 0% ops) |
Combined C4-C6 Coverage: 100% of C4-C7 operations (14.29% + 28.55% + 57.17%)
Cumulative Gain Tracking
| Optimization | Coverage | Individual Gain | Cumulative Impact |
|---|---|---|---|
| C6 Inline Slots (Phase 75-1) | 57.17% | +2.87% | +2.87% |
| C5 Inline Slots (Phase 75-2) | 28.55% | +1.10% | +3.97% (C5+C6 4-point: +5.41%) |
| C4 Inline Slots (Phase 76-1) | 14.29% | +1.73% | +7.14% (estimated, C4+C5+C6 combined) |
Note: Actual cumulative gain will be measured in follow-up 4-point matrix test if needed. Phase 75-3 showed C5+C6 achieved +5.41% (near-perfect sub-additivity at 1.72%).
TLS Layout Impact
TLS Cost Summary
| Component | Capacity | Size per Thread | Total (C4+C5+C6) |
|---|---|---|---|
| C4 inline slots | 64 | 512B | - |
| C5 inline slots | 128 | 1,024B | - |
| C6 inline slots | 128 | 1,024B | - |
| Combined | - | - | 2,560B (~2.5KB) |
System-Wide (10 threads): ~25KB total Per-Thread L1-dcache: +2.5KB footprint
Observation: No cache-miss spike observed (unlike Phase 74-2 LOCALIZE which showed +86% cache-misses). TLS expansion of 512B for C4 is well within safe limits.
Comparison: C4 vs C5 vs C6
| Phase | Class | Coverage | Capacity | TLS Cost | Individual Gain |
|---|---|---|---|---|---|
| 75-1 | C6 | 57.17% | 128 | 1KB | +2.87% (highest) |
| 75-2 | C5 | 28.55% | 128 | 1KB | +1.10% |
| 76-1 | C4 | 14.29% | 64 | 512B | +1.73% |
Key Insight: C4 achieves +1.73% gain with only 14.29% coverage, showing higher efficiency per-operation than C5 (+1.10% with 28.55% coverage). This suggests C4 class has higher branch overhead in the baseline unified_cache path.
Recommended Actions
Immediate (Required)
-
✓ Promote C4 Inline Slots to SSOT
- Set
HAKMEM_TINY_C4_INLINE_SLOTS=1(default ON) - Update
core/bench_profile.h - Update
scripts/run_mixed_10_cleanenv.sh
- Set
-
✓ Document Phase 76-1 Results
- Create
PHASE76_1_C4_INLINE_SLOTS_RESULTS.md - Update
CURRENT_TASK.md - Record in
PERFORMANCE_TARGETS_SCORECARD.md
- Create
Optional (Future Work)
-
4-Point Matrix Test (C4+C5+C6)
- Measure full combined effect
- Quantify sub-additivity (C4 + (C5+C6 proven +5.41%))
- Expected: +7-8% total gain if near-perfect additivity holds
-
FAST PGO Rebase
- Test C4+C5+C6 on FAST PGO binary
- Monitor for code bloat sensitivity (Phase 75-5 lesson)
- Track mimalloc ratio progress
Test Artifacts
Log Files
/tmp/phase76_1_c4_baseline.log(C4=0, 10 runs)/tmp/phase76_1_c4_treatment.log(C4=1, 10 runs)/tmp/phase76_1_analysis.sh(statistical analysis)
Binary Information
- Binary:
./bench_random_mixed_hakmem - Build time: 2025-12-18 10:42
- Size: 674K
- Compiler: gcc -O3 -march=native -flto
Conclusion
Phase 76-1 validates that C4 inline slots optimization provides +1.73% throughput gain on Standard binary, completing the C4-C6 inline slots optimization trilogy.
The implementation follows the proven modular box pattern from Phase 75-1/75-2, integrates cleanly into the existing C5→C6→unified_cache cascade, and shows no adverse TLS or cache-miss effects.
Recommendation: Proceed with SSOT promotion to core/bench_profile.h and scripts/run_mixed_10_cleanenv.sh, setting HAKMEM_TINY_C4_INLINE_SLOTS=1 as the new default.
Phase 76-1 Status: ✓ COMPLETE (GO, +1.73% gain validated on Standard binary)
Next Phase: Phase 76-2 (C4+C5+C6 4-point matrix validation) or SSOT promotion (if matrix deferred)