Files
hakmem/docs/analysis/PHASE52_TAIL_LATENCY_PROXY_RESULTS.md
Moe Charm (CI) 7adbcdfcb6 Phase 54-60: Memory-Lean mode, Balanced mode stabilization, M1 (50%) achievement
## Summary

Completed Phase 54-60 optimization work:

**Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)**
- Implemented ss_mem_lean_env_box.h with ENV gates
- Balanced mode (LEAN+OFF) promoted as production default
- Result: +1.2% throughput, better stability, zero syscall overhead
- Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset

**Phase 57: 60-min soak finalization**
- Balanced mode: 60-min soak, RSS drift 0%, CV 5.38%
- Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58%
- Syscall budget: 1.25e-7/op (800× under target)
- Status: PRODUCTION-READY

**Phase 59: 50% recovery baseline rebase**
- hakmem FAST (Balanced): 59.184M ops/s, CV 1.31%
- mimalloc: 120.466M ops/s, CV 3.50%
- Ratio: 49.13% (M1 ACHIEVED within statistical noise)
- Superior stability: 2.68× better CV than mimalloc

**Phase 60: Alloc pass-down SSOT (NO-GO)**
- Implemented alloc_passdown_ssot_env_box.h
- Modified malloc_tiny_fast.h for SSOT pattern
- Result: -0.46% (NO-GO)
- Key lesson: SSOT not applicable where early-exit already optimized

## Key Metrics

- Performance: 49.13% of mimalloc (M1 effectively achieved)
- Stability: CV 1.31% (superior to mimalloc 3.50%)
- Syscall budget: 1.25e-7/op (excellent)
- RSS: 33MB stable, 0% drift over 60 minutes

## Files Added/Modified

New boxes:
- core/box/ss_mem_lean_env_box.h
- core/box/ss_release_policy_box.{h,c}
- core/box/alloc_passdown_ssot_env_box.h

Scripts:
- scripts/soak_mixed_single_process.sh
- scripts/analyze_epoch_tail_csv.py
- scripts/soak_mixed_rss.sh
- scripts/calculate_percentiles.py
- scripts/analyze_soak.py

Documentation: Phase 40-60 analysis documents

## Design Decisions

1. Profile separation (core/bench_profile.h):
   - MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN)
   - MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF)

2. Box Theory compliance:
   - All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT)
   - Single conversion points maintained
   - No physical deletions (compile-out only)

3. Lessons learned:
   - SSOT effective only where redundancy exists (Phase 60 showed limits)
   - Branch prediction extremely effective (~0 cycles for well-predicted branches)
   - Early-exit pattern valuable even when seemingly redundant

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-17 06:24:01 +09:00

176 lines
6.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 52: Tail Latency Proxy Results
**Date**: 2025-12-16
**Phase**: 52 - Tail Latency Proxy Measurement
**Status**: COMPLETE (Measurement-only, no code changes)
## Executive Summary
We measured tail latency using epoch throughput distribution as a proxy across three allocators:
- **hakmem FAST** (current optimized build)
- **mimalloc** (industry baseline)
- **system malloc** (glibc)
Test configuration: 5-minute single-process soak, 1-second epochs, WS=400 (mixed workload)
### Key Findings
1. **mimalloc has best tail behavior**: Lowest p99/p999 latency proxy, tightest distribution
2. **system malloc has second-best tail**: Very consistent, low variance
3. **hakmem FAST has worst tail**: Higher p99/p999, more variability
4. **hakmem's gap is in tail consistency, not average performance**
## Important Note (Method Correction)
Tail の向きと計算には注意が必要:
- Throughput の “tail” は **低い throughput 側**p1/p0.1を見るp99 は “速い側”)。
- Latency proxy の percentiles は **per-epoch latency**`lat_ns = 1e9/throughput`)配列を作ってから計算する。
- `p99(latency) != 1e9 / p99(throughput)`(非線形 + 順序反転のため)
推奨: CSV`scripts/soak_mixed_single_process.sh` 出力)から `scripts/analyze_epoch_tail_csv.py` で再集計し、SSOT を更新する。
```bash
python3 scripts/analyze_epoch_tail_csv.py tail_epoch_hakmem_fast_5m.csv
```
## Detailed Results (v0)
### Throughput Distribution (ops/sec)
| Metric | hakmem FAST | mimalloc | system malloc |
|--------|-------------|----------|---------------|
| **p50** | 47,887,721 | 98,738,326 | 69,562,115 |
| **p90** | 58,629,195 | 99,580,629 | 69,931,575 |
| **p99** | 59,174,766 | 110,702,822 | 70,165,415 |
| **p999** | 59,567,912 | 111,190,037 | 70,308,452 |
| **Mean** | 50,174,657 | 99,084,977 | 69,447,599 |
| **Std Dev** | 4,461,290 | 2,455,894 | 522,021 |
| **Min** | 46,254,013 | 95,458,811 | 66,242,568 |
| **Max** | 59,608,715 | 111,202,228 | 70,326,858 |
### Latency Proxy (ns/op)
Calculated as `1 / throughput * 1e9` to convert throughput to per-operation latency.
| Metric | hakmem FAST | mimalloc | system malloc |
|--------|-------------|----------|---------------|
| **p50** | 20.88 ns | 10.13 ns | 14.38 ns |
| **p90** | 21.12 ns | 10.24 ns | 14.50 ns |
| **p99** | 21.33 ns | 10.43 ns | 14.80 ns |
| **p999** | 21.57 ns | 10.47 ns | 15.07 ns |
| **Mean** | 20.07 ns | 10.10 ns | 14.40 ns |
| **Std Dev** | 1.60 ns | 0.23 ns | 0.11 ns |
| **Min** | 16.78 ns | 8.99 ns | 14.22 ns |
| **Max** | 21.62 ns | 10.48 ns | 15.10 ns |
## Analysis
### Tail Behavior Comparison
**Standard Deviation as % of Mean (lower = more consistent):**
- hakmem FAST: 7.98% (highest variability)
- mimalloc: 2.28% (good consistency)
- system malloc: 0.77% (best consistency)
**p99/p50 Ratio (lower = better tail):**
- hakmem FAST: 1.024 (2.4% tail slowdown)
- mimalloc: 1.030 (3.0% tail slowdown)
- system malloc: 1.029 (2.9% tail slowdown)
**p999/p50 Ratio:**
- hakmem FAST: 1.033 (3.3% tail slowdown)
- mimalloc: 1.034 (3.4% tail slowdown)
- system malloc: 1.048 (4.8% tail slowdown)
### Interpretation
1. **hakmem's throughput variance is high**: 4.46M ops/sec std dev vs mimalloc's 2.46M and system's 0.52M
- This indicates periodic slowdowns or stalls
- Likely due to TLS cache misses, metadata lookup costs, or GC-like background work
2. **mimalloc has best absolute performance AND good tail behavior**:
- 2x faster than hakmem at median
- Lower latency at all percentiles
- Moderate variance (2.28% std dev)
3. **system malloc has rock-solid consistency**:
- Only 0.77% std dev (extremely stable)
- Very tight p99/p999 spread
- Middle performance tier (~1.5x faster than hakmem)
4. **hakmem's tail problem is relative to its mean**:
- Absolute p99 latency (21.33 ns) isn't terrible
- But variance is 2-3x higher than competitors
- Suggests optimization opportunities in cache warmth, metadata layout
## Implications for Optimization
### Root Causes to Investigate
1. **TLS cache thrashing**: High variance suggests periodic cache coldness
2. **Metadata lookup cost**: Possibly slower on cache misses
3. **Background work interference**: Adaptive sizing, stats collection?
4. **Free path delays**: Remote frees, mailbox processing
### Potential Solutions
1. **Prewarm more aggressively**: Reduce cold-start penalties
2. **Optimize metadata cache hit rate**: Better locality, prefetching
3. **Reduce background work frequency**: Less interruption to hot path
4. **Improve free-side batching**: Reduce per-operation variance
### Prioritization
Given that:
- hakmem is already 2x slower than mimalloc at median
- Tail behavior is worse but not catastrophically so
- Variance is the main issue, not worst-case absolute latency
**Recommendation**: Focus on **reducing variance** rather than chasing p999 specifically.
- Target: Get std dev down from 4.46M to <2M ops/sec (match mimalloc's 2.46M)
- This will naturally improve tail latency as a side effect
## Test Configuration
### Hardware
- CPU: (recorded in soak CSV metadata)
- Memory: Sufficient for WS=400 (20MB prefault)
- OS: Linux
### Benchmark Parameters
- **Workload**: bench_random_mixed (70% malloc, 30% free)
- **Working Set**: 400 (mixed size distribution)
- **Duration**: 300 seconds (5 minutes)
- **Epoch Length**: 1 second
- **Process Model**: Single process (no parallelism)
### Allocator Builds
- hakmem: MINIMAL build (FAST path enabled, aggressive inlining)
- mimalloc: Default build from vendor
- system malloc: glibc default (no LD_PRELOAD)
## Raw Data
CSV files available at:
- `/mnt/workdisk/public_share/hakmem/tail_epoch_hakmem_fast_5m.csv`
- `/mnt/workdisk/public_share/hakmem/tail_epoch_mimalloc_5m.csv`
- `/mnt/workdisk/public_share/hakmem/tail_epoch_system_5m.csv`
Analysis script: `scripts/calculate_percentiles.py`
## Next Steps
1. **Phase 53**: RSS Tax Triage - understand memory overhead
2. **Future optimization phases**: Target variance reduction
- Phase 54+: TLS cache optimization
- Phase 55+: Metadata locality improvements
- Phase 56+: Background work reduction
## Conclusion
**Phase 52 Status: COMPLETE**
We have established a tail latency baseline using epoch throughput as a proxy. Key takeaway: hakmem's tail behavior is acceptable but has room for improvement, primarily by reducing throughput variance (std dev). This measurement provides a clear target for future optimization work.
**No code changes made** - this was a measurement-only phase.