Files
hakmem/docs/analysis/PHASE62A_C7_ULTRA_DEPCHAIN_OPT_RESULTS.md
Moe Charm (CI) 10fb0497e2 Phase 62A: C7 ULTRA Alloc Dependency Chain Trim - NEUTRAL (-0.71%)
Implemented C7 ULTRA allocation hotpath optimization attempt as per Phase 62A instructions.

Objective: Reduce dependency chain in tiny_c7_ultra_alloc() by:
1. Eliminating per-call tiny_front_v3_c7_ultra_header_light_enabled() checks
2. Using TLS headers_initialized flag set during refill
3. Reducing branch count and register pressure

Implementation:
- New ENV box: core/box/c7_ultra_alloc_depchain_opt_box.h
- HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=0/1 gate (default OFF)
- Modified tiny_c7_ultra_alloc() with optimized path
- Preserved original path for compatibility

Results (Mixed benchmark, 10-run):
- Baseline (OPT=0): 59.300 M ops/s (CV 1.98%)
- Treatment (OPT=1): 58.879 M ops/s (CV 1.83%)
- Delta: -0.71% (NEUTRAL, within ±1.0% threshold but negative)
- Status: NEUTRAL → Research box (default OFF)

Root Cause Analysis:
1. LTO optimization already inlines header_light function (call cost = 0)
2. TLS access (memory load + offset) not cheaper than function call
3. Layout tax from code addition (I-cache disruption pattern from Phases 43/46A/47)
4. 5.18% stack % is not optimizable hotspot (already well-optimized)

Key Lessons:
- LTO-optimized function calls can be cheaper than TLS field access
- Micro-optimizations on already-optimized paths show diminishing/negative returns
- 48.34% gap to mimalloc is likely algorithmic, not micro-architectural
- Layout tax remains consistent pattern across attempted micro-optimizations

Decision:
- NEUTRAL verdict → kept as research box with ENV gate (default OFF)
- Not adopted as production default
- Next phases: Option B (production readiness pivot) likely higher ROI than further micro-opts

Box Theory Compliance:  Compliant (single point, reversible, clear boundary)
Performance Compliance:  No (-0.71% regression)

Documentation:
- PHASE62A_C7_ULTRA_DEPCHAIN_OPT_RESULTS.md: Full A/B test analysis
- CURRENT_TASK.md: Updated with results and next phase options

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-17 16:34:03 +09:00

6.8 KiB

Phase 62A: C7 ULTRA Alloc Dependency Chain Trim - Results

Date: 2025-12-17 Status: NEUTRAL (-0.71%, research box) Baseline: 48.34% of mimalloc (Phase 59b Speed-first)


Executive Summary

Phase 62A attempted to optimize tiny_c7_ultra_alloc() hot path by eliminating per-call tiny_front_v3_c7_ultra_header_light_enabled() checks and using TLS headers_initialized flag instead. The optimization resulted in -0.71% regression (NEUTRAL), indicating the approach does not yield the expected +1-3% gain.

Conclusion: Research box (default OFF, HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=0)


A/B Test Results (Mixed benchmark, 10-run)

Baseline (HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=0)

Runs (M ops/s):

59.553, 59.906, 60.134, 59.533, 56.265, 59.368, 60.045, 58.487, 60.141, 59.569

Statistics:

  • Mean: 59.300 M ops/s
  • Median: 59.561 M ops/s
  • StdDev: 1.173 M ops/s
  • CV: 1.98%

Treatment (HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=1)

Runs (M ops/s):

56.352, 58.924, 58.946, 60.109, 58.630, 58.689, 59.609, 58.160, 59.939, 59.430

Statistics:

  • Mean: 58.879 M ops/s
  • Median: 58.935 M ops/s
  • StdDev: 1.079 M ops/s
  • CV: 1.83%

Comparison

Metric Baseline Treatment Delta
Mean 59.300 58.879 -0.71%
Median 59.561 58.935 -1.05%
StdDev 1.173 1.079 -8.0%
CV 1.98% 1.83% -0.15pp

Verdict: NEUTRAL (-0.71% within ±1.0% threshold, but negative)


Implementation Details

Optimization Strategy

Original Code (tiny_c7_ultra_alloc hot path):

void* tiny_c7_ultra_alloc(size_t size) {
    tiny_c7_ultra_tls_t* tls = &g_tiny_c7_ultra_tls;
    const bool header_light = tiny_front_v3_c7_ultra_header_light_enabled();  // Per-call check

    uint16_t n = tls->count;
    if (n > 0) {
        void* base = tls->freelist[n - 1];
        tls->count = n - 1;

        if (header_light) {  // Per-call branch
            return (uint8_t*)base + 1;
        }
        return tiny_region_id_write_header(base, 7);
    }
    // ... refill and retry
}

Optimized Code (Phase 62A):

void* tiny_c7_ultra_alloc(size_t size) {
    tiny_c7_ultra_tls_t* tls = &g_tiny_c7_ultra_tls;
    // No per-call header_light check - use TLS flag instead

    uint16_t n = tls->count;
    if (n > 0) {
        void* base = tls->freelist[n - 1];
        tls->count = n - 1;

        if (tls->headers_initialized) {  // TLS flag set during refill
            return (uint8_t*)base + 1;
        }
        return tiny_region_id_write_header(base, 7);
    }
    // ... refill and retry
}

Intended Benefits:

  1. Eliminate per-call tiny_front_v3_c7_ultra_header_light_enabled() function call
  2. Replace with TLS field access (already in cache from count/freelist)
  3. Reduce dependency chain length

Root Cause Analysis

Why No Improvement?

  1. LTO Optimization Already In Place

    • In HAKMEM_BENCH_MINIMAL (-flto), tiny_front_v3_c7_ultra_header_light_enabled() is likely already inlined
    • Function call overhead may already be zero at compile time
    • Replacing with TLS field access doesn't improve latency (still L1 cache hit)
  2. TLS Access Not Cheaper Than Expected

    • TLS field headers_initialized requires offset calculation + memory access
    • Function call overhead may actually be lower (register-based, already predicted)
    • Branch prediction on if (header_light) may be extremely accurate (99.99%+)
  3. Layout Tax from Added Code

    • Phases 43, 46A, 47 precedent: adding code branches can cause I-cache/alignment disruption
    • Added if-dispatch at function entry (if (!c7_ultra_alloc_depchain_opt_enabled())) may affect code layout
    • Result: -0.71% regression consistent with pattern
  4. Hot Path May Already Be Optimal

    • Phase 61 profiling showed tiny_c7_ultra_alloc at 5.18% stack %
    • But function-level optimization attempts (Phase 43/46A/47) all showed negative or marginal returns
    • Suggests hot path is already well-optimized by compiler

Lessons Learned

1. Function Call Overhead is Negligible in LTO Mode

With -flto and link-time optimization, function calls to simple getters are aggressively inlined. Removing them doesn't necessarily improve performance because:

  • Compiler already determined optimal inlining
  • Instruction fetch overhead may not be the bottleneck
  • Replacing call with memory access can have similar latency

2. Layout Tax is Real and Persistent

This is the third time (Phase 43: -1.18%, Phase 46A: -0.68%, Phase 62A: -0.71%) that code addition/reorganization has resulted in regressions despite targeting hot functions. Pattern suggests:

  • I-cache alignment matters more than instruction count
  • Code layout disruptions can negate micro-optimization gains
  • Box Theory "minimal code change" principle is well-justified

3. Per-Call Flags May Be Faster Than Per-TLS State

Counter-intuitive finding: accessing a per-call computed flag (via function inlining) may be faster than accessing TLS state, because:

  • Function results are likely in registers (temporary)
  • TLS access requires memory load + offset calculation
  • Branch predictor handles pattern well

4. 5.18% Stack % ≠ Optimizable Hotspot

Phase 61 profiling showed tiny_c7_ultra_alloc at 5.18% combined stack overhead, but this is misleading because:

  • Much of the time is in malloc/free wrappers and benchmark loop (not C7 ultra itself)
  • Self time is likely 2-3% (actual function execution)
  • Micro-optimizations on already-optimized paths yield diminishing returns

Decision

NEUTRAL (research box):

  • Set default to HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=0 (OFF)
  • Keep code with ENV gate for future reference
  • Do not adopt as production default

Next Steps:

  1. Phase 62B: Try secondary target (tiny_region_id_write_header reordering) - higher risk
  2. Or pivot to Phase 62C: Accept 48.34% as performance ceiling, focus on production readiness
  3. Or Phase 62D: Algorithmic redesign (batching, prefault strategy) - very high cost/risk

Box Theory Compliance

Principle Status Notes
Single Conversion Point Yes tiny_c7_ultra_alloc() boundary
Clear Boundary Yes Env gate HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT
Reversible Yes Can switch via ENV or compile flag
No Side Effects Yes Pure optimization attempt, no new data structures
Performance No -0.71% regression, NO-GO

Overall: Box Theory compliant but performance non-compliant.


Appendix: Raw Data

Baseline (10-run, M ops/s)

59.553099
59.906197
60.134051
59.533090
56.265139
59.367898
60.044922
58.486467
60.141028
59.568791

Treatment (10-run, M ops/s)

56.351851
58.923605
58.946089
60.109441
58.629557
58.689160
59.609485
58.160391
59.939368
59.430088

End of Phase 62A Report