Files
hakmem/docs/analysis/PHASE7_COMPREHENSIVE_BENCHMARK_RESULTS.md

370 lines
11 KiB
Markdown
Raw Normal View History

feat: Phase 7 + Phase 2 - Massive performance & stability improvements Performance Achievements: - Tiny allocations: +180-280% (21M → 59-70M ops/s random mixed) - Single-thread: +24% (2.71M → 3.36M ops/s Larson) - 4T stability: 0% → 95% (19/20 success rate) - Overall: 91.3% of System malloc average (target was 40-55%) ✓ Phase 7 (Tasks 1-3): Core Optimizations - Task 1: Header validation removal (Region-ID direct lookup) - Task 2: Aggressive inline (TLS cache access optimization) - Task 3: Pre-warm TLS cache (eliminate cold-start penalty) Result: +180-280% improvement, 85-146% of System malloc Critical Bug Fixes: - Fix 64B allocation crash (size-to-class +1 for header) - Fix 4T wrapper recursion bugs (BUG #7, #8, #10, #11) - Remove malloc fallback (30% → 50% stability) Phase 2a: SuperSlab Dynamic Expansion (CRITICAL) - Implement mimalloc-style chunk linking - Unlimited slab expansion (no more OOM at 32 slabs) - Fix chunk initialization bug (bitmap=0x00000001 after expansion) Files: core/hakmem_tiny_superslab.c/h, core/superslab/superslab_types.h Result: 50% → 95% stability (19/20 4T success) Phase 2b: TLS Cache Adaptive Sizing - Dynamic capacity: 16-2048 slots based on usage - High-water mark tracking + exponential growth/shrink - Expected: +3-10% performance, -30-50% memory Files: core/tiny_adaptive_sizing.c/h (new) Phase 2c: BigCache Dynamic Hash Table - Migrate from fixed 256×8 array to dynamic hash table - Auto-resize: 256 → 512 → 1024 → 65,536 buckets - Improved hash function (FNV-1a) + collision chaining Files: core/hakmem_bigcache.c/h Expected: +10-20% cache hit rate Design Flaws Analysis: - Identified 6 components with fixed-capacity bottlenecks - SuperSlab (CRITICAL), TLS Cache (HIGH), BigCache/L2.5 (MEDIUM) - Report: DESIGN_FLAWS_ANALYSIS.md (11 chapters) Documentation: - 13 comprehensive reports (PHASE*.md, DESIGN_FLAWS*.md) - Implementation guides, test results, production readiness - Bug fix reports, root cause analysis Build System: - Makefile: phase7 targets, PREWARM_TLS flag - Auto dependency generation (-MMD -MP) for .inc files Known Issues: - 4T stability: 19/20 (95%) - investigating 1 failure for 100% - L2.5 Pool dynamic sharding: design only (needs 2-3 days integration) 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-08 17:08:00 +09:00
# Phase 7 Comprehensive Benchmark Results
**Date**: 2025-11-08
**Build Configuration**: `HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1`
**Status**: CRITICAL BUGS FOUND - NOT PRODUCTION READY
---
## Executive Summary
### Production Readiness: FAILED
**Critical Issues Found:**
1. **Multi-threaded crash**: Larson 2T/4T fail with `free(): invalid pointer` (Exit 134)
2. **64B allocation crash**: Bus error (Exit 135) on 64-byte allocations
3. **Debug output in production**: "Phase 7: tiny_alloc(1024) rejected" messages indicate incomplete implementation
**Performance (Single-threaded, working sizes):**
- Single-thread performance is excellent (76-120% of System malloc)
- But crashes make this unusable in production
### Key Findings
| Category | Result | Status |
|----------|--------|--------|
| Larson 1T | 2.76M ops/s | ✅ PASS |
| Larson 2T/4T | CRASH (Exit 134) | ❌ CRITICAL FAIL |
| Random Mixed (most sizes) | 60-72M ops/s | ✅ PASS |
| Random Mixed 64B | CRASH (Bus Error 135) | ❌ CRITICAL FAIL |
| Stability (1M iterations) | Stable scores | ✅ PASS |
| Overall Production Ready | NO | ❌ FAIL |
---
## Detailed Benchmark Results
### 1. Larson Multi-Thread Stress Test
| Threads | HAKMEM Result | System Result | Status |
|---------|---------------|---------------|--------|
| 1T | 2,758,490 ops/s | ~3.3M ops/s (est.) | ✅ 84% of System |
| 2T | **CRASH (Exit 134)** | N/A | ❌ CRITICAL |
| 4T | **CRASH (Exit 134)** | N/A | ❌ CRITICAL |
**Crash Details:**
```
[DEBUG] Phase 7: tiny_alloc(1024) rejected, using malloc fallback
free(): invalid pointer
Exit code: 134 (SIGABRT - double free or corruption)
```
**Root Cause**: Unknown - likely race condition in multi-threaded free path or malloc fallback integration issue.
---
### 2. Random Mixed Allocation Benchmark
**Test**: 100,000 iterations of mixed malloc/free patterns
| Size | HAKMEM (ops/s) | System (ops/s) | HAKMEM % | Status |
|------|----------------|----------------|----------|--------|
| 16B | 66,878,359 | 87,810,575 | 76.1% | ✅ |
| 32B | 69,730,339 | 64,490,458 | **108.1%** | ✅ |
| **64B** | **CRASH (Bus Error 135)** | 78,147,467 | N/A | ❌ CRITICAL |
| 128B | 72,090,413 | 65,960,798 | **109.2%** | ✅ |
| 256B | 71,363,681 | 71,688,134 | 99.5% | ✅ |
| 512B | 60,501,851 | 62,967,613 | 96.0% | ✅ |
| 1024B | 63,229,630 | 67,220,203 | 94.0% | ✅ |
| 2048B | 55,868,013 | 46,557,492 | **119.9%** | ✅ |
| 4096B | 40,585,997 | 45,157,552 | 89.8% | ✅ |
| 8192B | 35,442,103 | 33,984,326 | **104.2%** | ✅ |
**Performance Highlights (working sizes):**
- **32B: +8% faster than System** (108.1%)
- **128B: +9% faster than System** (109.2%)
- **2048B: +20% faster than System** (119.9%)
- **8192B: +4% faster than System** (104.2%)
**64B Crash Details:**
```
Exit code: 135 (SIGBUS - unaligned memory access or invalid pointer)
Crash during allocation, not free
```
**Root Cause**: Unknown - possibly alignment issue or class index calculation error for 64B size class.
---
### 3. Long-Run Stability Tests
**Test**: 1,000,000 iterations (10x normal) to check for memory leaks and variance
| Size | Throughput (ops/s) | Variance vs 100K | Status |
|------|-------------------|------------------|--------|
| 128B | 72,829,711 | +1.0% | ✅ Stable |
| 256B | 72,305,587 | +1.3% | ✅ Stable |
| 1024B | 64,240,186 | +1.6% | ✅ Stable |
**Analysis**:
- Variance <2% indicates stable performance
- No memory leaks detected (throughput would degrade if leaking)
- Scores slightly higher in long runs (likely cache warmup effects)
---
### 4. Comparison vs Phase 6 Baseline
**Phase 6 Baseline** (from CLAUDE.md):
- Tiny: 52.59 M/s (38.7% of System 135.94 M/s)
- Phase 6 Goal: 85-92% of System
**Phase 7 Results** (working sizes):
- Tiny (128B): 72.09 M/s (109% of System 65.96 M/s) → **+37% improvement**
- Tiny (256B): 71.36 M/s (99.5% of System) → **+36% improvement**
- Mid (2048B): 55.87 M/s (120% of System) → Exceeds System by +20%
**Goal Achievement**:
- Target: 85-92% of System → **Achieved 96-120%** (working sizes)
- But: **Critical crashes make this irrelevant**
---
### 5. Comprehensive Benchmark (Phase 8 features)
**Status**: Could not run - linking errors
**Issue**: `bench_comprehensive.c` calls Phase 8 functions:
- `hak_tiny_print_memory_profile()`
- `hkm_learner_init()`
- `superslab_ace_print_stats()`
These are not compatible with Phase 7 build. Would need:
- Remove Phase 8 dependencies, OR
- Build with Phase 8 flags, OR
- Use simpler benchmark suite
---
## Root Cause Analysis
### Issue 1: Multi-threaded Crash (Larson 2T/4T)
**Symptoms**:
- Single-threaded works perfectly (2.76M ops/s)
- 2+ threads crash immediately with "free(): invalid pointer"
- Consistent across 2T and 4T tests
**Debug Output**:
```
[DEBUG] Phase 7: tiny_alloc(1024) rejected, using malloc fallback
```
**Hypotheses**:
1. **Race condition in TLS initialization**: Multiple threads accessing uninitialized TLS
2. **Malloc fallback bug**: Mixed HAKMEM/libc allocations causing double-free
3. **Free path ownership bug**: Wrong allocator freeing blocks from the other
**Priority**: CRITICAL - must fix before any production use
---
### Issue 2: 64B Bus Error Crash
**Symptoms**:
- Bus error (SIGBUS) on 64-byte allocations
- All other sizes (16, 32, 128, 256, ..., 8192) work fine
- Crash happens during allocation, not free
**Hypotheses**:
1. **Class index calculation error**: 64B might map to wrong class
2. **Alignment issue**: 64B blocks not aligned to required boundary
3. **Header corruption**: Class index stored in header (HEADER_CLASSIDX=1) might overflow for 64B
**Clue**: Debug message shows "tiny_alloc(1024) rejected" even for 64B allocations, suggesting routing logic is broken.
**Priority**: CRITICAL - 64B is a common allocation size
---
### Issue 3: Debug Output in Production Build
**Symptom**:
```
[DEBUG] Phase 7: tiny_alloc(1024) rejected, using malloc fallback
```
**Impact**:
- Performance overhead (fprintf in hot path)
- Indicates incomplete implementation (rejections shouldn't happen in production)
- Suggests Phase 7 optimizations have broken size routing
**Priority**: HIGH - indicates deeper implementation issues
---
## Production Readiness Assessment
### Success Criteria (from CURRENT_TASK.md)
| Criterion | Result | Status |
|-----------|--------|--------|
| ✅ All benchmarks complete without crashes | ❌ 2T/4T Larson crash, 64B crash | FAIL |
| ✅ Tiny performance: 85-92% of System | ✅ 96-120% (working sizes) | PASS |
| ✅ Mid-Large performance: maintained | ✅ 120% of System | PASS |
| ✅ Multi-thread stability: no regression | ❌ Complete crash | FAIL |
| ✅ Fragmentation stress: acceptable | ⚠️ Not tested (build issues) | SKIP |
| ✅ Comprehensive report generated | ✅ This document | PASS |
**Overall**: **FAIL - 2 critical crashes**
---
## Recommended Next Steps
### Immediate Actions (Critical Bugs)
**1. Fix Multi-threaded Crash (Highest Priority)**
```bash
# Debug with ASan
make clean
make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1 \
ASAN=1 larson_hakmem
./larson_hakmem 2 8 128 1024 1 12345 2
# Check TLS initialization
grep -r "PREWARM_TLS" core/
# Verify all TLS variables are initialized before thread spawn
```
**Expected Root Cause**: TLS prewarm not actually executing, or race in initialization.
**2. Fix 64B Bus Error (High Priority)**
```bash
# Add debug output to class index calculation
# File: core/box/hak_alloc_api.inc.h or similar
printf("tiny_alloc(%zu) -> class %d\n", size, class_idx);
# Check alignment
# File: core/hakmem_tiny_superslab.c
assert((uintptr_t)ptr % 64 == 0); // 64B must be 64-byte aligned
```
**Expected Root Cause**: HEADER_CLASSIDX=1 storing wrong class index for 64B.
**3. Remove Debug Output**
```bash
# Find and remove/disable debug prints
grep -r "DEBUG.*Phase 7" core/
# Should be gated by #ifdef HAKMEM_DEBUG
```
---
### Phase 7 Feature Regression Test
**Before deploying any fix, verify**:
1. All single-threaded benchmarks still pass
2. Performance doesn't regress to Phase 6 levels
3. No new crashes introduced
**Test Suite**:
```bash
# Single-thread (must pass)
./larson_hakmem 1 1 128 1024 1 12345 1 # Expect: 2.76M ops/s
./bench_random_mixed_hakmem 100000 128 1234567 # Expect: 72M ops/s
# Multi-thread (currently fails, must fix)
./larson_hakmem 2 8 128 1024 1 12345 2 # Expect: no crash
./larson_hakmem 4 8 128 1024 1 12345 4 # Expect: no crash
# 64B (currently fails, must fix)
./bench_random_mixed_hakmem 100000 64 1234567 # Expect: no crash, ~70M ops/s
```
---
### Alternate Path: Revert Phase 7 Optimizations
If bugs are too complex to fix quickly:
```bash
# Revert to Phase 6
git checkout HEAD~3 # Or specific Phase 6 commit
# Verify Phase 6 still works
make clean && make larson_hakmem
./larson_hakmem 4 8 128 1024 1 12345 4 # Should work
# Incrementally re-apply Phase 7 optimizations
git cherry-pick <HEADER_CLASSIDX commit> # Test
git cherry-pick <AGGRESSIVE_INLINE commit> # Test
git cherry-pick <PREWARM_TLS commit> # Test
# Identify which commit introduced the bugs
```
---
## Build Information
**Compiler**: gcc with LTO
**Flags**:
```
-O3 -flto -march=native -mtune=native
-DHAKMEM_TINY_PHASE6_BOX_REFACTOR=1
-DHAKMEM_TINY_FAST_PATH=1
-DHAKMEM_TINY_HEADER_CLASSIDX=1
-DHAKMEM_TINY_AGGRESSIVE_INLINE=1
-DHAKMEM_TINY_PREWARM_TLS=1
```
**Known Issues**:
- `bench_comprehensive` won't link (Phase 8 dependencies)
- `bench_fragment_stress` not tested (same issue)
- Debug output leaking into production builds
---
## Appendix: Full Benchmark Output Samples
### Larson 1T (Success)
```
=== LARSON 1T BASELINE ===
Throughput = 2758490 operations per second, relative time: 362.517s.
Done sleeping...
[ELO] Initialized 12 strategies (thresholds: 512KB-32MB)
[Batch] Initialized (threshold=8 MB, min_size=64 KB, bg=on)
[ACE] ACE disabled (HAKMEM_ACE_ENABLED=0)
```
### Larson 2T (Crash)
```
[DEBUG] Phase 7: tiny_alloc(1024) rejected, using malloc fallback
free(): invalid pointer
Exit code: 134
```
### 64B Crash
```
[SUPERSLAB_INIT] class 7 slab 0: usable_size=63488 block_size=1024 capacity=62
[SUPERSLAB_INIT] Expected: 63488 / 1024 = 62 blocks
Exit code: 135 (SIGBUS)
```
---
## Conclusion
**Phase 7 achieved exceptional single-threaded performance** (96-120% of System malloc), **but introduced critical bugs**:
1. **Multi-threaded crash**: Unusable with 2+ threads
2. **64B crash**: Unusable for common allocation size
3. **Incomplete implementation**: Debug fallbacks in production code
**Recommendation**: **DO NOT DEPLOY** to production. Revert to Phase 6 or fix critical bugs before proceeding to Phase 7 Tasks 6-9.
**Next Steps** (in priority order):
1. Fix multi-threaded crash (blocker for all production use)
2. Fix 64B bus error (blocker for most workloads)
3. Remove debug output (quality/performance issue)
4. Re-run comprehensive validation
5. Only then proceed to Phase 7 Tasks 6-9
---
**Generated**: 2025-11-08
**Test Duration**: ~2 hours
**Total Benchmarks**: 15 tests (10 sizes × random mixed, 3 × Larson, 3 × stability)
**Crashes Found**: 2 critical (Larson MT, 64B)
**Production Ready**: ❌ NO