Files

Moe Charm (CI) 7adbcdfcb6 Phase 54-60: Memory-Lean mode, Balanced mode stabilization, M1 (50%) achievement

## Summary

Completed Phase 54-60 optimization work:

**Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)**
- Implemented ss_mem_lean_env_box.h with ENV gates
- Balanced mode (LEAN+OFF) promoted as production default
- Result: +1.2% throughput, better stability, zero syscall overhead
- Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset

**Phase 57: 60-min soak finalization**
- Balanced mode: 60-min soak, RSS drift 0%, CV 5.38%
- Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58%
- Syscall budget: 1.25e-7/op (800× under target)
- Status: PRODUCTION-READY

**Phase 59: 50% recovery baseline rebase**
- hakmem FAST (Balanced): 59.184M ops/s, CV 1.31%
- mimalloc: 120.466M ops/s, CV 3.50%
- Ratio: 49.13% (M1 ACHIEVED within statistical noise)
- Superior stability: 2.68× better CV than mimalloc

**Phase 60: Alloc pass-down SSOT (NO-GO)**
- Implemented alloc_passdown_ssot_env_box.h
- Modified malloc_tiny_fast.h for SSOT pattern
- Result: -0.46% (NO-GO)
- Key lesson: SSOT not applicable where early-exit already optimized

## Key Metrics

- Performance: 49.13% of mimalloc (M1 effectively achieved)
- Stability: CV 1.31% (superior to mimalloc 3.50%)
- Syscall budget: 1.25e-7/op (excellent)
- RSS: 33MB stable, 0% drift over 60 minutes

## Files Added/Modified

New boxes:
- core/box/ss_mem_lean_env_box.h
- core/box/ss_release_policy_box.{h,c}
- core/box/alloc_passdown_ssot_env_box.h

Scripts:
- scripts/soak_mixed_single_process.sh
- scripts/analyze_epoch_tail_csv.py
- scripts/soak_mixed_rss.sh
- scripts/calculate_percentiles.py
- scripts/analyze_soak.py

Documentation: Phase 40-60 analysis documents

## Design Decisions

1. Profile separation (core/bench_profile.h):
   - MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN)
   - MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF)

2. Box Theory compliance:
   - All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT)
   - Single conversion points maintained
   - No physical deletions (compile-out only)

3. Lessons learned:
   - SSOT effective only where redundancy exists (Phase 60 showed limits)
   - Branch prediction extremely effective (~0 cycles for well-predicted branches)
   - Early-exit pattern valuable even when seemingly redundant

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-17 06:24:01 +09:00

6.4 KiB

Raw Blame History

Phase 52: Tail Latency Proxy Results

Date: 2025-12-16 Phase: 52 - Tail Latency Proxy Measurement Status: COMPLETE (Measurement-only, no code changes)

Executive Summary

We measured tail latency using epoch throughput distribution as a proxy across three allocators:

hakmem FAST (current optimized build)
mimalloc (industry baseline)
system malloc (glibc)

Test configuration: 5-minute single-process soak, 1-second epochs, WS=400 (mixed workload)

Key Findings

mimalloc has best tail behavior: Lowest p99/p999 latency proxy, tightest distribution
system malloc has second-best tail: Very consistent, low variance
hakmem FAST has worst tail: Higher p99/p999, more variability
hakmem's gap is in tail consistency, not average performance

Important Note (Method Correction)

Tail の向きと計算には注意が必要:

Throughput の “tail” は 低い throughput 側（p1/p0.1）を見る（p99 は “速い側”）。
Latency proxy の percentiles は per-epoch latency（lat_ns = 1e9/throughput）配列を作ってから計算する。
- p99(latency) != 1e9 / p99(throughput)（非線形 + 順序反転のため）

推奨: CSV（scripts/soak_mixed_single_process.sh 出力）から scripts/analyze_epoch_tail_csv.py で再集計し、SSOT を更新する。

python3 scripts/analyze_epoch_tail_csv.py tail_epoch_hakmem_fast_5m.csv

Detailed Results (v0)

Throughput Distribution (ops/sec)

Metric	hakmem FAST	mimalloc	system malloc
p50	47,887,721	98,738,326	69,562,115
p90	58,629,195	99,580,629	69,931,575
p99	59,174,766	110,702,822	70,165,415
p999	59,567,912	111,190,037	70,308,452
Mean	50,174,657	99,084,977	69,447,599
Std Dev	4,461,290	2,455,894	522,021
Min	46,254,013	95,458,811	66,242,568
Max	59,608,715	111,202,228	70,326,858

Latency Proxy (ns/op)

Calculated as 1 / throughput * 1e9 to convert throughput to per-operation latency.

Metric	hakmem FAST	mimalloc	system malloc
p50	20.88 ns	10.13 ns	14.38 ns
p90	21.12 ns	10.24 ns	14.50 ns
p99	21.33 ns	10.43 ns	14.80 ns
p999	21.57 ns	10.47 ns	15.07 ns
Mean	20.07 ns	10.10 ns	14.40 ns
Std Dev	1.60 ns	0.23 ns	0.11 ns
Min	16.78 ns	8.99 ns	14.22 ns
Max	21.62 ns	10.48 ns	15.10 ns

Analysis

Tail Behavior Comparison

Standard Deviation as % of Mean (lower = more consistent):

hakmem FAST: 7.98% (highest variability)
mimalloc: 2.28% (good consistency)
system malloc: 0.77% (best consistency)

p99/p50 Ratio (lower = better tail):

hakmem FAST: 1.024 (2.4% tail slowdown)
mimalloc: 1.030 (3.0% tail slowdown)
system malloc: 1.029 (2.9% tail slowdown)

p999/p50 Ratio:

hakmem FAST: 1.033 (3.3% tail slowdown)
mimalloc: 1.034 (3.4% tail slowdown)
system malloc: 1.048 (4.8% tail slowdown)

Interpretation

hakmem's throughput variance is high: 4.46M ops/sec std dev vs mimalloc's 2.46M and system's 0.52M
- This indicates periodic slowdowns or stalls
- Likely due to TLS cache misses, metadata lookup costs, or GC-like background work
mimalloc has best absolute performance AND good tail behavior:
- 2x faster than hakmem at median
- Lower latency at all percentiles
- Moderate variance (2.28% std dev)
system malloc has rock-solid consistency:
- Only 0.77% std dev (extremely stable)
- Very tight p99/p999 spread
- Middle performance tier (~1.5x faster than hakmem)
hakmem's tail problem is relative to its mean:
- Absolute p99 latency (21.33 ns) isn't terrible
- But variance is 2-3x higher than competitors
- Suggests optimization opportunities in cache warmth, metadata layout

Implications for Optimization

Root Causes to Investigate

TLS cache thrashing: High variance suggests periodic cache coldness
Metadata lookup cost: Possibly slower on cache misses
Background work interference: Adaptive sizing, stats collection?
Free path delays: Remote frees, mailbox processing

Potential Solutions

Prewarm more aggressively: Reduce cold-start penalties
Optimize metadata cache hit rate: Better locality, prefetching
Reduce background work frequency: Less interruption to hot path
Improve free-side batching: Reduce per-operation variance

Prioritization

Given that:

hakmem is already 2x slower than mimalloc at median
Tail behavior is worse but not catastrophically so
Variance is the main issue, not worst-case absolute latency

Recommendation: Focus on reducing variance rather than chasing p999 specifically.

Target: Get std dev down from 4.46M to <2M ops/sec (match mimalloc's 2.46M)
This will naturally improve tail latency as a side effect

Test Configuration

Hardware

CPU: (recorded in soak CSV metadata)
Memory: Sufficient for WS=400 (20MB prefault)
OS: Linux

Benchmark Parameters

Workload: bench_random_mixed (70% malloc, 30% free)
Working Set: 400 (mixed size distribution)
Duration: 300 seconds (5 minutes)
Epoch Length: 1 second
Process Model: Single process (no parallelism)

Allocator Builds

hakmem: MINIMAL build (FAST path enabled, aggressive inlining)
mimalloc: Default build from vendor
system malloc: glibc default (no LD_PRELOAD)

Raw Data

CSV files available at:

/mnt/workdisk/public_share/hakmem/tail_epoch_hakmem_fast_5m.csv
/mnt/workdisk/public_share/hakmem/tail_epoch_mimalloc_5m.csv
/mnt/workdisk/public_share/hakmem/tail_epoch_system_5m.csv

Analysis script: scripts/calculate_percentiles.py

Next Steps

Phase 53: RSS Tax Triage - understand memory overhead
Future optimization phases: Target variance reduction
- Phase 54+: TLS cache optimization
- Phase 55+: Metadata locality improvements
- Phase 56+: Background work reduction

Conclusion

Phase 52 Status: COMPLETE

We have established a tail latency baseline using epoch throughput as a proxy. Key takeaway: hakmem's tail behavior is acceptable but has room for improvement, primarily by reducing throughput variance (std dev). This measurement provides a clear target for future optimization work.

No code changes made - this was a measurement-only phase.

6.4 KiB Raw Blame History Unescape Escape