Files

Moe Charm (CI) 10fb0497e2 Phase 62A: C7 ULTRA Alloc Dependency Chain Trim - NEUTRAL (-0.71%)

Implemented C7 ULTRA allocation hotpath optimization attempt as per Phase 62A instructions.

Objective: Reduce dependency chain in tiny_c7_ultra_alloc() by:
1. Eliminating per-call tiny_front_v3_c7_ultra_header_light_enabled() checks
2. Using TLS headers_initialized flag set during refill
3. Reducing branch count and register pressure

Implementation:
- New ENV box: core/box/c7_ultra_alloc_depchain_opt_box.h
- HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=0/1 gate (default OFF)
- Modified tiny_c7_ultra_alloc() with optimized path
- Preserved original path for compatibility

Results (Mixed benchmark, 10-run):
- Baseline (OPT=0): 59.300 M ops/s (CV 1.98%)
- Treatment (OPT=1): 58.879 M ops/s (CV 1.83%)
- Delta: -0.71% (NEUTRAL, within ±1.0% threshold but negative)
- Status: NEUTRAL → Research box (default OFF)

Root Cause Analysis:
1. LTO optimization already inlines header_light function (call cost = 0)
2. TLS access (memory load + offset) not cheaper than function call
3. Layout tax from code addition (I-cache disruption pattern from Phases 43/46A/47)
4. 5.18% stack % is not optimizable hotspot (already well-optimized)

Key Lessons:
- LTO-optimized function calls can be cheaper than TLS field access
- Micro-optimizations on already-optimized paths show diminishing/negative returns
- 48.34% gap to mimalloc is likely algorithmic, not micro-architectural
- Layout tax remains consistent pattern across attempted micro-optimizations

Decision:
- NEUTRAL verdict → kept as research box with ENV gate (default OFF)
- Not adopted as production default
- Next phases: Option B (production readiness pivot) likely higher ROI than further micro-opts

Box Theory Compliance: ✅ Compliant (single point, reversible, clear boundary)
Performance Compliance: ❌ No (-0.71% regression)

Documentation:
- PHASE62A_C7_ULTRA_DEPCHAIN_OPT_RESULTS.md: Full A/B test analysis
- CURRENT_TASK.md: Updated with results and next phase options

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

2025-12-17 16:34:03 +09:00

6.8 KiB

Raw Blame History

Phase 62A: C7 ULTRA Alloc Dependency Chain Trim - Results

Date: 2025-12-17 Status: NEUTRAL (-0.71%, research box) Baseline: 48.34% of mimalloc (Phase 59b Speed-first)

Executive Summary

Phase 62A attempted to optimize tiny_c7_ultra_alloc() hot path by eliminating per-call tiny_front_v3_c7_ultra_header_light_enabled() checks and using TLS headers_initialized flag instead. The optimization resulted in -0.71% regression (NEUTRAL), indicating the approach does not yield the expected +1-3% gain.

Conclusion: Research box (default OFF, HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=0)

A/B Test Results (Mixed benchmark, 10-run)

Baseline (HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=0)

Runs (M ops/s):

59.553, 59.906, 60.134, 59.533, 56.265, 59.368, 60.045, 58.487, 60.141, 59.569

Statistics:

Mean: 59.300 M ops/s
Median: 59.561 M ops/s
StdDev: 1.173 M ops/s
CV: 1.98%

Treatment (HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=1)

Runs (M ops/s):

56.352, 58.924, 58.946, 60.109, 58.630, 58.689, 59.609, 58.160, 59.939, 59.430

Statistics:

Mean: 58.879 M ops/s
Median: 58.935 M ops/s
StdDev: 1.079 M ops/s
CV: 1.83%

Comparison

Metric	Baseline	Treatment	Delta
Mean	59.300	58.879	-0.71%
Median	59.561	58.935	-1.05%
StdDev	1.173	1.079	-8.0%
CV	1.98%	1.83%	-0.15pp

Verdict: NEUTRAL (-0.71% within ±1.0% threshold, but negative)

Implementation Details

Optimization Strategy

Original Code (tiny_c7_ultra_alloc hot path):

void* tiny_c7_ultra_alloc(size_t size) {
    tiny_c7_ultra_tls_t* tls = &g_tiny_c7_ultra_tls;
    const bool header_light = tiny_front_v3_c7_ultra_header_light_enabled();  // Per-call check

    uint16_t n = tls->count;
    if (n > 0) {
        void* base = tls->freelist[n - 1];
        tls->count = n - 1;

        if (header_light) {  // Per-call branch
            return (uint8_t*)base + 1;
        }
        return tiny_region_id_write_header(base, 7);
    }
    // ... refill and retry
}

Optimized Code (Phase 62A):

void* tiny_c7_ultra_alloc(size_t size) {
    tiny_c7_ultra_tls_t* tls = &g_tiny_c7_ultra_tls;
    // No per-call header_light check - use TLS flag instead

    uint16_t n = tls->count;
    if (n > 0) {
        void* base = tls->freelist[n - 1];
        tls->count = n - 1;

        if (tls->headers_initialized) {  // TLS flag set during refill
            return (uint8_t*)base + 1;
        }
        return tiny_region_id_write_header(base, 7);
    }
    // ... refill and retry
}

Intended Benefits:

Eliminate per-call tiny_front_v3_c7_ultra_header_light_enabled() function call
Replace with TLS field access (already in cache from count/freelist)
Reduce dependency chain length

Root Cause Analysis

Why No Improvement?

LTO Optimization Already In Place
- In HAKMEM_BENCH_MINIMAL (-flto), tiny_front_v3_c7_ultra_header_light_enabled() is likely already inlined
- Function call overhead may already be zero at compile time
- Replacing with TLS field access doesn't improve latency (still L1 cache hit)
TLS Access Not Cheaper Than Expected
- TLS field headers_initialized requires offset calculation + memory access
- Function call overhead may actually be lower (register-based, already predicted)
- Branch prediction on if (header_light) may be extremely accurate (99.99%+)
Layout Tax from Added Code
- Phases 43, 46A, 47 precedent: adding code branches can cause I-cache/alignment disruption
- Added if-dispatch at function entry (if (!c7_ultra_alloc_depchain_opt_enabled())) may affect code layout
- Result: -0.71% regression consistent with pattern
Hot Path May Already Be Optimal
- Phase 61 profiling showed tiny_c7_ultra_alloc at 5.18% stack %
- But function-level optimization attempts (Phase 43/46A/47) all showed negative or marginal returns
- Suggests hot path is already well-optimized by compiler

Lessons Learned

1. Function Call Overhead is Negligible in LTO Mode

With -flto and link-time optimization, function calls to simple getters are aggressively inlined. Removing them doesn't necessarily improve performance because:

Compiler already determined optimal inlining
Instruction fetch overhead may not be the bottleneck
Replacing call with memory access can have similar latency

2. Layout Tax is Real and Persistent

This is the third time (Phase 43: -1.18%, Phase 46A: -0.68%, Phase 62A: -0.71%) that code addition/reorganization has resulted in regressions despite targeting hot functions. Pattern suggests:

I-cache alignment matters more than instruction count
Code layout disruptions can negate micro-optimization gains
Box Theory "minimal code change" principle is well-justified

3. Per-Call Flags May Be Faster Than Per-TLS State

Counter-intuitive finding: accessing a per-call computed flag (via function inlining) may be faster than accessing TLS state, because:

Function results are likely in registers (temporary)
TLS access requires memory load + offset calculation
Branch predictor handles pattern well

4. 5.18% Stack % ≠ Optimizable Hotspot

Phase 61 profiling showed tiny_c7_ultra_alloc at 5.18% combined stack overhead, but this is misleading because:

Much of the time is in malloc/free wrappers and benchmark loop (not C7 ultra itself)
Self time is likely 2-3% (actual function execution)
Micro-optimizations on already-optimized paths yield diminishing returns

Decision

NEUTRAL (research box):

Set default to HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=0 (OFF)
Keep code with ENV gate for future reference
Do not adopt as production default

Next Steps:

Phase 62B: Try secondary target (tiny_region_id_write_header reordering) - higher risk
Or pivot to Phase 62C: Accept 48.34% as performance ceiling, focus on production readiness
Or Phase 62D: Algorithmic redesign (batching, prefault strategy) - very high cost/risk

Box Theory Compliance

Principle	Status	Notes
Single Conversion Point	✅ Yes	`tiny_c7_ultra_alloc()` boundary
Clear Boundary	✅ Yes	Env gate `HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT`
Reversible	✅ Yes	Can switch via ENV or compile flag
No Side Effects	✅ Yes	Pure optimization attempt, no new data structures
Performance	❌ No	-0.71% regression, NO-GO

Overall: Box Theory compliant but performance non-compliant.

Appendix: Raw Data

Baseline (10-run, M ops/s)

Treatment (10-run, M ops/s)

End of Phase 62A Report

6.8 KiB Raw Blame History