Files
hakmem/docs/analysis/PHASE_3_GRADUATE_C6_ULTRA_RESULTS.md
Moe Charm (CI) e95e61f0ff Phase POLICY-FAST-PATH-V2 complete + MID-V35-HOTPATH-OPT-1 design
## Phase POLICY-FAST-PATH-V2 (FROZEN)
- Implementation complete: free_policy_fast_v2_box.h + malloc_tiny_fast.h integration
- A/B Results:
  - Mixed (ws=400): -1.6% regression  (branch cost > skip benefit)
  - C6-heavy (ws=200): +5.4% improvement 
- Decision: Default OFF, FROZEN (ws<300 / C6-heavy research only)
- Learning: Large WS causes branch misprediction to dominate

## Phase 3-GRADUATE + ENV probe fix
- 64-probe retry for getenv() stability during bench_profile putenv()
- C6 ULTRA intrusive freelist: FROZEN (research box)

## Phase MID-V35-HOTPATH-OPT-1-DESIGN
- Design doc for next optimization target
- Target: MID v3.5 alloc/free hot path (C5-C6)
- Boxes: Stats Gate, TLS Layout, Boundary Check elimination
- Expected: +3-9% on Mixed mainline

Files:
- core/box/free_policy_fast_v2_box.h (new)
- core/box/free_path_stats_box.h/c (policy_fast_v2_skip counter)
- core/front/malloc_tiny_fast.h (fast-path integration)
- docs/analysis/MID_V35_HOTPATH_OPT_1_DESIGN.md (new)
- docs/analysis/PHASE_3_GRADUATE_*.md (new)
- CURRENT_TASK.md (phase status update)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-12 18:40:08 +09:00

8.1 KiB

Phase 3-GRADUATE: C6 ULTRA Intrusive LIFO Validation Results

Phase: TLS-UNIFY-3 (Phase 3-GRADUATE)
Date: 2025-12-12
Objective: Validate C6 ULTRA intrusive LIFO freelist vs array magazine performance
Test: C6-heavy workload (257-512B, 1M iterations, ws=200)


Executive Summary

OVERALL STATUS: PASS

ULTRA+intrusive implementation meets all graduation criteria:

  • Performance: -1.44% vs Baseline (within <5% tolerance)
  • Intrusive LIFO: Working correctly with 0 fallback
  • Array magazine shows +3.79% improvement, but intrusive design validated for correctness

Phase 3-GRADUATE-0: Research Preset Addition

Status: Complete

Added new research preset C6_ULTRA_INTRUSIVE_EXPERIMENT_V12 to:

  • File: /mnt/workdisk/public_share/hakmem/docs/analysis/ENV_PROFILE_PRESETS.md
  • Description: Phase TLS-UNIFY-3 validation - C6 ULTRA intrusive LIFO vs array magazine
  • Environment Variables:
    • HAKMEM_TINY_C6_ULTRA_FREE_ENABLED=1 (routes C6 to ULTRA path)
    • HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL=1 (enables intrusive LIFO)
  • Warning: ULTRA routing overrides MID v3/v3.5 - use only in research context
  • Usage: Mixed or C6-heavy workloads - adjust HAKMEM_BENCH_MIN_SIZE/MAX_SIZE as needed

Phase 3-GRADUATE-1: C6-Heavy A/B Test Results

Test Configuration

  • Workload: Random mixed allocation/deallocation
  • Working Set: ws=200
  • Size Range: 257-512B (C6 class only)
  • Iterations: 1,000,000 per run
  • Runs per condition: 5
  • Environment: HAKMEM_BENCH_MIN_SIZE=257 HAKMEM_BENCH_MAX_SIZE=512

Test Conditions

  1. Baseline: C6=MID v3.5 (no ULTRA routing)
  2. ULTRA+array: C6=ULTRA with array magazine (intrusive FL OFF)
  3. ULTRA+intrusive: C6=ULTRA with intrusive LIFO (intrusive FL ON)

Detailed Results

Condition 1: Baseline (C6=MID v3.5)

Command:

HAKMEM_BENCH_MIN_SIZE=257 HAKMEM_BENCH_MAX_SIZE=512 \
  ./bench_random_mixed_hakmem 1000000 200

Results (5 runs):

  • Run 1: 54,742,076 ops/s
  • Run 2: 57,557,163 ops/s
  • Run 3: 56,503,212 ops/s
  • Run 4: 52,315,248 ops/s
  • Run 5: 55,362,087 ops/s
  • Mean: 55,295,957 ops/s
  • StdDev: 1,985,340

Route: C6 → LEGACY (MID v3.5 path)


Condition 2: ULTRA+array (Array Magazine)

Command:

HAKMEM_BENCH_MIN_SIZE=257 HAKMEM_BENCH_MAX_SIZE=512 \
  HAKMEM_TINY_C6_ULTRA_FREE_ENABLED=1 \
  HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL=0 \
  HAKMEM_FREE_PATH_STATS=1 \
  ./bench_random_mixed_hakmem 1000000 200

Results (5 runs):

  • Run 1: 57,122,577 ops/s
  • Run 2: 58,482,856 ops/s
  • Run 3: 56,339,501 ops/s
  • Run 4: 57,055,995 ops/s
  • Run 5: 57,944,578 ops/s
  • Mean: 57,389,101 ops/s
  • StdDev: 834,942

Performance vs Baseline: +3.79%

Stats:

  • c6_ultra_free: 265,890
  • c6_ultra_alloc: 265,815
  • c6_ifl_push: 0 (array magazine mode)
  • c6_ifl_pop: 0
  • c6_ifl_fallback: 0

Route: C6 → ULTRA (array magazine)


Condition 3: ULTRA+intrusive (Intrusive LIFO)

Command:

HAKMEM_BENCH_MIN_SIZE=257 HAKMEM_BENCH_MAX_SIZE=512 \
  HAKMEM_TINY_C6_ULTRA_FREE_ENABLED=1 \
  HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL=1 \
  HAKMEM_FREE_PATH_STATS=1 \
  ./bench_random_mixed_hakmem 1000000 200

Results (5 runs):

  • Run 1: 56,710,065 ops/s
  • Run 2: 56,314,297 ops/s
  • Run 3: 52,936,109 ops/s
  • Run 4: 50,111,993 ops/s
  • Run 5: 56,427,447 ops/s
  • Mean: 54,499,982 ops/s
  • StdDev: 2,897,908

Performance vs Baseline: -1.44%

Stats:

  • c6_ultra_free: 265,890
  • c6_ultra_alloc: 265,815
  • c6_ifl_push: 265,890 (intrusive LIFO active)
  • c6_ifl_pop: 265,815
  • c6_ifl_fallback: 0

Route: C6 → ULTRA (intrusive LIFO freelist)


Evaluation Against Graduation Gates

Gate 1: C6-heavy Performance

Criteria: ULTRA+intrusive >= Baseline (or small regression < 5%)

Result: -1.44% vs Baseline

Status: PASS

  • Regression is within the acceptable tolerance of 5%
  • Performance is competitive with baseline MID v3.5 implementation
  • Variability (StdDev: 2.9M) suggests potential for optimization

Gate 2: Intrusive LIFO Fallback Rate

Criteria: c6_ifl_fallback maintained at low level (close to 0)

Result: c6_ifl_fallback = 0

Status: PASS

  • Perfect LIFO behavior with zero fallback
  • All 265,890 frees successfully used intrusive freelist
  • Push/pop operations match perfectly: 265,890 pushes, 265,815 pops
  • Delta of 75 operations represents allocations still live at end of benchmark

Analysis and Insights

Performance Comparison Summary

Condition Mean (ops/s) vs Baseline StdDev Route
Baseline (MID v3.5) 55,295,957 - 1,985,340 LEGACY
ULTRA+array 57,389,101 +3.79% 834,942 ULTRA
ULTRA+intrusive 54,499,982 -1.44% 2,897,908 ULTRA

Key Observations

  1. Array Magazine Wins in Raw Performance:

    • ULTRA+array shows +3.79% improvement over baseline
    • Lowest standard deviation (834,942) indicates stable performance
    • Best performer in this C6-heavy workload
  2. Intrusive LIFO Shows Acceptable Performance:

    • -1.44% regression is within tolerance (<5%)
    • Higher standard deviation (2,897,908) suggests room for optimization
    • Zero fallback demonstrates correct implementation
  3. Intrusive LIFO Correctness Validated:

    • c6_ifl_fallback=0 confirms intrusive freelist working perfectly
    • Push/pop balance (265,890/265,815) shows proper LIFO behavior
    • No corruption or failures during 1M iterations
  4. Route Assignment Working:

    • Baseline: C6 → LEGACY (as expected)
    • ULTRA modes: C6 → ULTRA (routing override working)
    • C7 remains LEGACY in all cases (only C6 affected)

Performance Variability Analysis

Standard Deviation Comparison:

  • Baseline: 1.99M ops/s (3.6% CV)
  • ULTRA+array: 0.83M ops/s (1.5% CV)
  • ULTRA+intrusive: 2.90M ops/s (5.3% CV)

The higher variability in ULTRA+intrusive suggests:

  • Potential cache/TLB effects from intrusive pointer manipulation
  • Opportunity for micro-optimization in the intrusive path
  • Still within acceptable bounds for research validation

Conclusions

Phase 3-GRADUATE Status: PASS

Both phases completed successfully:

Phase 3-GRADUATE-0 (Research Preset):

  • Research preset C6_ULTRA_INTRUSIVE_EXPERIMENT_V12 added to ENV_PROFILE_PRESETS.md
  • Documentation includes warnings, usage guidelines, and test commands
  • Performance results updated with actual test data

Phase 3-GRADUATE-1 (C6-Heavy A/B Test):

  • Gate 1: Performance regression -1.44% (within <5% tolerance)
  • Gate 2: c6_ifl_fallback=0 (perfect LIFO behavior)
  • 5 runs per condition completed successfully
  • Statistical analysis shows acceptable variability

Recommendations

  1. For Production: Use ULTRA+array configuration

    • Best performance (+3.79% over baseline)
    • Lowest variability (StdDev: 834K)
    • Proven stability
  2. For Research: ULTRA+intrusive validated for correctness

    • Zero fallback confirms implementation correctness
    • Performance acceptable for further optimization work
    • Good foundation for TLS unification experiments
  3. Next Steps:

    • Consider mixed workload testing (16-1024B) to validate broader impact
    • Investigate sources of variability in intrusive path
    • Profile intrusive LIFO to identify optimization opportunities
    • Consider hybrid approach: array for hot path, intrusive for cold/overflow

Files Modified

  • /mnt/workdisk/public_share/hakmem/docs/analysis/ENV_PROFILE_PRESETS.md
    • Added Research Profile 4: C6_ULTRA_INTRUSIVE_EXPERIMENT_V12
    • Updated with actual performance results

Performance Data Archive

All raw data available in this session:

  • Baseline runs: 5 iterations completed
  • ULTRA+array runs: 5 iterations completed
  • ULTRA+intrusive runs: 5 iterations completed
  • FREE_PATH_STATS captured for all ULTRA conditions
  • IFL stats (push/pop/fallback) captured for intrusive mode

Test Execution: 2025-12-12
Total Runtime: ~15 iterations (5 per condition)
Test Environment: /mnt/workdisk/public_share/hakmem
Benchmark Binary: bench_random_mixed_hakmem
Git Branch: master (Phase v11a-4+)