Files

Moe Charm (CI) 89a9212700 Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update

Key changes:
- Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible)
  Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns

- Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M):
  tcmalloc: 115.26M (92.33% of mimalloc)
  jemalloc: 97.39M (77.96% of mimalloc)
  system: 85.20M (68.24% of mimalloc)
  mimalloc: 124.82M (baseline)

- hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh
  PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements
  Result: baseline stabilized to 55.53M (44.46% of mimalloc)
  Previous unstable measurement (35.57M) was due to profile leak

- Documentation:
  * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status
  * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO)
  * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure
  * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology

- M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed

🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-18 18:50:00 +09:00

7.5 KiB

Raw Blame History

Phase 76-1: C4 Inline Slots A/B Test Results

Executive Summary

Decision: GO (+1.73% gain, exceeds +1.0% threshold)

Key Finding: C4 inline slots optimization provides +1.73% throughput gain on Standard binary, completing the C4/C5/C6 inline slots trilogy.

Implementation: Modular box pattern following Phase 75-1/75-2 (C6/C5) design, integrating C4 into existing cascade.

Implementation Summary

Modular Boxes Created

core/box/tiny_c4_inline_slots_env_box.h
- ENV gate: HAKMEM_TINY_C4_INLINE_SLOTS=0/1
- Lazy-init pattern (default OFF)
core/box/tiny_c4_inline_slots_tls_box.h
- TLS ring buffer: 64 slots (512B per thread)
- FIFO ring (head/tail indices, modulo 64)
core/front/tiny_c4_inline_slots.h
- c4_inline_push() - always_inline
- c4_inline_pop() - always_inline
core/tiny_c4_inline_slots.c
- TLS variable definition

Integration Points

Alloc Path (tiny_front_hot_box.h):

// C4 FIRST → C5 → C6 → unified_cache
if (class_idx == 4 && tiny_c4_inline_slots_enabled()) {
    void* base = c4_inline_pop(c4_inline_tls());
    if (TINY_HOT_LIKELY(base != NULL)) {
        return tiny_header_finalize_alloc(base, class_idx);
    }
}

Free Path (tiny_legacy_fallback_box.h):

// C4 FIRST → C5 → C6 → unified_cache
if (class_idx == 4 && tiny_c4_inline_slots_enabled()) {
    if (c4_inline_push(c4_inline_tls(), base)) {
        return;  // Success
    }
}

10-Run A/B Test Results

Test Configuration

Workload: Mixed SSOT (WS=400, ITERS=20000000)
Binary: ./bench_random_mixed_hakmem (Standard build)
Existing Defaults: C5=1, C6=1 (Phase 75-3 promoted)
Runs: 10 per configuration
Harness: scripts/run_mixed_10_cleanenv.sh

Raw Data

Run	Baseline (C4=0)	Treatment (C4=1)	Delta
1	52.91 M ops/s	53.87 M ops/s	+1.82%
2	52.52 M ops/s	53.16 M ops/s	+1.22%
3	53.26 M ops/s	53.64 M ops/s	+0.71%
4	53.45 M ops/s	53.30 M ops/s	-0.28%
5	51.88 M ops/s	52.62 M ops/s	+1.43%
6	52.83 M ops/s	53.81 M ops/s	+1.85%
7	50.41 M ops/s	52.76 M ops/s	+4.66%
8	51.89 M ops/s	53.46 M ops/s	+3.02%
9	53.03 M ops/s	53.62 M ops/s	+1.11%
10	51.97 M ops/s	53.00 M ops/s	+1.98%

Statistical Summary

Metric	Baseline (C4=0)	Treatment (C4=1)	Delta
Mean	52.42 M ops/s	53.33 M ops/s	+1.73%
Min	50.41 M ops/s	52.62 M ops/s	+4.39%
Max	53.45 M ops/s	53.87 M ops/s	+0.78%

Decision Matrix

Success Criteria

Criterion	Threshold	Actual	Pass
GO Threshold	≥ +1.0%	+1.73%	✓
NEUTRAL Range	±1.0%	N/A	N/A
NO-GO Threshold	≤ -1.0%	N/A	N/A

Decision: GO

Rationale:

Mean throughput gain of +1.73% exceeds GO threshold (+1.0%)
All individual runs show positive or near-zero delta (only 1/10 negative by -0.28%)
Consistent improvement across multiple runs (9/10 positive)
Pattern matches Phase 75-1 (C6: +2.87%) and Phase 75-2 (C5: +1.10%) success

Quality Rating: Strong GO (exceeds threshold by +0.73pp, robust across runs)

Per-Class Coverage Analysis

C4-C7 Optimization Status

Class	Size Range	Coverage %	Optimization	Status
C4	257-512B	14.29%	Inline Slots	GO (+1.73%)
C5	513-1024B	28.55%	Inline Slots	GO (+1.10%, Phase 75-2)
C6	1025-2048B	57.17%	Inline Slots	GO (+2.87%, Phase 75-1)
C7	2049-4096B	0.00%	N/A	NO-GO (Phase 76-0: 0% ops)

Combined C4-C6 Coverage: 100% of C4-C7 operations (14.29% + 28.55% + 57.17%)

Cumulative Gain Tracking

Optimization	Coverage	Individual Gain	Cumulative Impact
C6 Inline Slots (Phase 75-1)	57.17%	+2.87%	+2.87%
C5 Inline Slots (Phase 75-2)	28.55%	+1.10%	+3.97% (C5+C6 4-point: +5.41%)
C4 Inline Slots (Phase 76-1)	14.29%	+1.73%	+7.14% (estimated, C4+C5+C6 combined)

Note: Actual cumulative gain will be measured in follow-up 4-point matrix test if needed. Phase 75-3 showed C5+C6 achieved +5.41% (near-perfect sub-additivity at 1.72%).

TLS Layout Impact

TLS Cost Summary

Component	Capacity	Size per Thread	Total (C4+C5+C6)
C4 inline slots	64	512B	-
C5 inline slots	128	1,024B	-
C6 inline slots	128	1,024B	-
Combined	-	-	2,560B (~2.5KB)

System-Wide (10 threads): ~25KB total Per-Thread L1-dcache: +2.5KB footprint

Observation: No cache-miss spike observed (unlike Phase 74-2 LOCALIZE which showed +86% cache-misses). TLS expansion of 512B for C4 is well within safe limits.

Comparison: C4 vs C5 vs C6

Phase	Class	Coverage	Capacity	TLS Cost	Individual Gain
75-1	C6	57.17%	128	1KB	+2.87% (highest)
75-2	C5	28.55%	128	1KB	+1.10%
76-1	C4	14.29%	64	512B	+1.73%

Key Insight: C4 achieves +1.73% gain with only 14.29% coverage, showing higher efficiency per-operation than C5 (+1.10% with 28.55% coverage). This suggests C4 class has higher branch overhead in the baseline unified_cache path.

Recommended Actions

Immediate (Required)

✓ Promote C4 Inline Slots to SSOT
- Set HAKMEM_TINY_C4_INLINE_SLOTS=1 (default ON)
- Update core/bench_profile.h
- Update scripts/run_mixed_10_cleanenv.sh
✓ Document Phase 76-1 Results
- Create PHASE76_1_C4_INLINE_SLOTS_RESULTS.md
- Update CURRENT_TASK.md
- Record in PERFORMANCE_TARGETS_SCORECARD.md

Optional (Future Work)

4-Point Matrix Test (C4+C5+C6)
- Measure full combined effect
- Quantify sub-additivity (C4 + (C5+C6 proven +5.41%))
- Expected: +7-8% total gain if near-perfect additivity holds
FAST PGO Rebase
- Test C4+C5+C6 on FAST PGO binary
- Monitor for code bloat sensitivity (Phase 75-5 lesson)
- Track mimalloc ratio progress

Test Artifacts

Log Files

/tmp/phase76_1_c4_baseline.log (C4=0, 10 runs)
/tmp/phase76_1_c4_treatment.log (C4=1, 10 runs)
/tmp/phase76_1_analysis.sh (statistical analysis)

Binary Information

Binary: ./bench_random_mixed_hakmem
Build time: 2025-12-18 10:42
Size: 674K
Compiler: gcc -O3 -march=native -flto

Conclusion

Phase 76-1 validates that C4 inline slots optimization provides +1.73% throughput gain on Standard binary, completing the C4-C6 inline slots optimization trilogy.

The implementation follows the proven modular box pattern from Phase 75-1/75-2, integrates cleanly into the existing C5→C6→unified_cache cascade, and shows no adverse TLS or cache-miss effects.

Recommendation: Proceed with SSOT promotion to core/bench_profile.h and scripts/run_mixed_10_cleanenv.sh, setting HAKMEM_TINY_C4_INLINE_SLOTS=1 as the new default.

Phase 76-1 Status: ✓ COMPLETE (GO, +1.73% gain validated on Standard binary)

Next Phase: Phase 76-2 (C4+C5+C6 4-point matrix validation) or SSOT promotion (if matrix deferred)

7.5 KiB Raw Blame History