Files
hakmem/docs/PHASE2_BENCHMARK_RESULTS.md
Moe Charm (CI) d397994b23 Add Phase 2 benchmark results: Headerless ON/OFF comparison
Results Summary:
- sh8bench: Headerless ON PASSES (no corruption), OFF FAILS (segfault)
- Simple alloc benchmark: OFF = 78.15 Mops/s, ON = 54.60 Mops/s (-30.1%)
- Library size: OFF = 547K, ON = 502K (-8.2%)

Key Findings:
1. Headerless ON successfully eliminates TLS_SLL_HDR_RESET corruption
2. Performance regression (30%) exceeds 5% target - needs optimization
3. Trade-off: Correctness vs Performance documented

Recommendation: Keep OFF as default short-term, optimize ON for long-term.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 17:23:32 +09:00

148 lines
4.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Performance Benchmark Results: Headerless ON vs OFF
**Date:** 2025-12-03
**Commit:** f90e261c5 (Phase 1.2 complete)
**Build Flags:** Default (HEADER_CLASSIDX=1, AGGRESSIVE_INLINE=1, PREWARM_TLS=1)
## Build Variants
### Headerless OFF (Phase 1 compatible)
- Command: `make shared -j8`
- Library Size: 547K
- Layout: Classes 1-6 have 1-byte header, user_ptr = base + 1
### Headerless ON (Phase 2 implementation)
- Command: `make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1"`
- Library Size: 502K (-8.2% smaller)
- Layout: No headers, user_ptr = base (alignment-perfect)
---
## Benchmark Results
### 1. sh8bench (Memory corruption test)
**Purpose:** Tests for neighbor block corruption and alignment issues
| Build | Status | Time | TLS_SLL_HDR_RESET | Notes |
|-------|--------|------|-------------------|-------|
| Headerless OFF | FAIL (Segfault) | 0.065s | ✅ Detected (cls=1) | Crashes immediately with corruption |
| Headerless ON | PASS | 22.0s | ❌ None | Completes successfully, no corruption |
**Analysis:**
- Headerless OFF: Still exhibits neighbor corruption bug (got=0xb1 expect=0xa1)
- Headerless ON: **Fixes the root cause** - no TLS_SLL_HDR_RESET, test completes
- This proves Phase 2 successfully addresses the alignment/corruption issue
---
### 2. Simple Allocation Benchmark (1M iterations × 100 alloc/free)
**Purpose:** Measures throughput for 16-byte allocations (Tiny class 1)
| Build | Run 1 (Mops/s) | Run 2 (Mops/s) | Run 3 (Mops/s) | Average (Mops/s) |
|-------|----------------|----------------|----------------|------------------|
| Headerless OFF | 78.22 | 77.68 | 78.54 | **78.15** |
| Headerless ON | 54.29 | 55.06 | 54.45 | **54.60** |
**Performance Impact:**
- Degradation: (78.15 - 54.60) / 78.15 = **-30.1%**
- This exceeds the 5% target by significant margin
**Root Cause Analysis (Hypothesis):**
1. SuperSlab Registry lookup overhead in Headerless mode
2. Free path now requires class identification via registry instead of inline header
3. Additional cache misses for registry access during free()
---
### 3. cfrac (Memory-intensive factorization)
| Build | Status | Notes |
|-------|--------|-------|
| Headerless OFF | FAIL (Abort) | Degenerate solution error, aborts |
| Headerless ON | FAIL (Segfault) | Crashes during execution |
**Analysis:** Both modes fail cfrac test, suggesting unrelated issue
---
### 4. larson (Multi-threaded stress test)
| Build | Status | Notes |
|-------|--------|-------|
| Headerless OFF | Timeout | Hangs or runs indefinitely |
| Headerless ON | Not tested | - |
| System malloc | Timeout | Baseline also hangs |
**Analysis:** larson test appears incompatible with test environment
---
## Summary
### Correctness
**Headerless ON fixes sh8bench corruption** - Major success for Phase 2
- Eliminates TLS_SLL_HDR_RESET completely
- No neighbor block corruption
- Validates the headerless strategy for correctness
### Performance
⚠️ **Headerless ON has 30% throughput regression** - Exceeds 5% target
- Headerless OFF: 78.15 Mops/s (baseline)
- Headerless ON: 54.60 Mops/s (-30.1%)
- Performance target (≤5% impact): **NOT MET**
### Trade-offs
| Metric | Headerless OFF | Headerless ON | Winner |
|--------|----------------|---------------|--------|
| Correctness (sh8bench) | FAIL | PASS | ✅ ON |
| Performance (simple_bench) | 78.15 Mops/s | 54.60 Mops/s | ✅ OFF |
| Library Size | 547K | 502K | ✅ ON |
| C Standard Compliance | Violates alignof | Compliant | ✅ ON |
---
## Recommendations
### Short-term (Current State)
Keep Headerless OFF as default for production:
- Higher performance for existing workloads
- Defensive measures (atomic fence + header write) mitigate corruption
- TLS_SLL_HDR_RESET provides early warning
### Medium-term (Optimization Path)
Investigate Headerless ON performance:
1. Profile SuperSlab Registry lookup overhead
2. Consider caching class_idx hints in TLS
3. Optimize registry data structure (faster lookup)
4. Target: Reduce gap from 30% to <5%
### Long-term (C Standard Compliance)
Once optimized, switch to Headerless ON:
- Guarantees `alignof(max_align_t)` compliance
- Eliminates entire class of corruption bugs
- Smaller library footprint
---
## Next Steps
1. **Profile Headerless ON free() path**
- Identify specific bottleneck (registry lookup, cache miss, etc.)
2. **Implement registry optimizations**
- TLS cache for recently freed classes
- Faster SuperSlab address class lookup
3. **Re-benchmark after optimization**
- Target: 5% performance impact
4. **Decision point**
- If optimized: Switch default to Headerless ON
- If not: Keep OFF, document trade-off
---
**Conclusion:** Phase 2 Headerless implementation is **functionally correct** and fixes the corruption bug, but requires performance optimization before production deployment.