# Phase 7 Comprehensive Benchmark Results **Date**: 2025-11-08 **Build Configuration**: `HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1` **Status**: CRITICAL BUGS FOUND - NOT PRODUCTION READY --- ## Executive Summary ### Production Readiness: FAILED **Critical Issues Found:** 1. **Multi-threaded crash**: Larson 2T/4T fail with `free(): invalid pointer` (Exit 134) 2. **64B allocation crash**: Bus error (Exit 135) on 64-byte allocations 3. **Debug output in production**: "Phase 7: tiny_alloc(1024) rejected" messages indicate incomplete implementation **Performance (Single-threaded, working sizes):** - Single-thread performance is excellent (76-120% of System malloc) - But crashes make this unusable in production ### Key Findings | Category | Result | Status | |----------|--------|--------| | Larson 1T | 2.76M ops/s | ✅ PASS | | Larson 2T/4T | CRASH (Exit 134) | ❌ CRITICAL FAIL | | Random Mixed (most sizes) | 60-72M ops/s | ✅ PASS | | Random Mixed 64B | CRASH (Bus Error 135) | ❌ CRITICAL FAIL | | Stability (1M iterations) | Stable scores | ✅ PASS | | Overall Production Ready | NO | ❌ FAIL | --- ## Detailed Benchmark Results ### 1. Larson Multi-Thread Stress Test | Threads | HAKMEM Result | System Result | Status | |---------|---------------|---------------|--------| | 1T | 2,758,490 ops/s | ~3.3M ops/s (est.) | ✅ 84% of System | | 2T | **CRASH (Exit 134)** | N/A | ❌ CRITICAL | | 4T | **CRASH (Exit 134)** | N/A | ❌ CRITICAL | **Crash Details:** ``` [DEBUG] Phase 7: tiny_alloc(1024) rejected, using malloc fallback free(): invalid pointer Exit code: 134 (SIGABRT - double free or corruption) ``` **Root Cause**: Unknown - likely race condition in multi-threaded free path or malloc fallback integration issue. --- ### 2. Random Mixed Allocation Benchmark **Test**: 100,000 iterations of mixed malloc/free patterns | Size | HAKMEM (ops/s) | System (ops/s) | HAKMEM % | Status | |------|----------------|----------------|----------|--------| | 16B | 66,878,359 | 87,810,575 | 76.1% | ✅ | | 32B | 69,730,339 | 64,490,458 | **108.1%** | ✅ | | **64B** | **CRASH (Bus Error 135)** | 78,147,467 | N/A | ❌ CRITICAL | | 128B | 72,090,413 | 65,960,798 | **109.2%** | ✅ | | 256B | 71,363,681 | 71,688,134 | 99.5% | ✅ | | 512B | 60,501,851 | 62,967,613 | 96.0% | ✅ | | 1024B | 63,229,630 | 67,220,203 | 94.0% | ✅ | | 2048B | 55,868,013 | 46,557,492 | **119.9%** | ✅ | | 4096B | 40,585,997 | 45,157,552 | 89.8% | ✅ | | 8192B | 35,442,103 | 33,984,326 | **104.2%** | ✅ | **Performance Highlights (working sizes):** - **32B: +8% faster than System** (108.1%) - **128B: +9% faster than System** (109.2%) - **2048B: +20% faster than System** (119.9%) - **8192B: +4% faster than System** (104.2%) **64B Crash Details:** ``` Exit code: 135 (SIGBUS - unaligned memory access or invalid pointer) Crash during allocation, not free ``` **Root Cause**: Unknown - possibly alignment issue or class index calculation error for 64B size class. --- ### 3. Long-Run Stability Tests **Test**: 1,000,000 iterations (10x normal) to check for memory leaks and variance | Size | Throughput (ops/s) | Variance vs 100K | Status | |------|-------------------|------------------|--------| | 128B | 72,829,711 | +1.0% | ✅ Stable | | 256B | 72,305,587 | +1.3% | ✅ Stable | | 1024B | 64,240,186 | +1.6% | ✅ Stable | **Analysis**: - Variance <2% indicates stable performance - No memory leaks detected (throughput would degrade if leaking) - Scores slightly higher in long runs (likely cache warmup effects) --- ### 4. Comparison vs Phase 6 Baseline **Phase 6 Baseline** (from CLAUDE.md): - Tiny: 52.59 M/s (38.7% of System 135.94 M/s) - Phase 6 Goal: 85-92% of System **Phase 7 Results** (working sizes): - Tiny (128B): 72.09 M/s (109% of System 65.96 M/s) → **+37% improvement** - Tiny (256B): 71.36 M/s (99.5% of System) → **+36% improvement** - Mid (2048B): 55.87 M/s (120% of System) → Exceeds System by +20% **Goal Achievement**: - Target: 85-92% of System → **Achieved 96-120%** (working sizes) - But: **Critical crashes make this irrelevant** --- ### 5. Comprehensive Benchmark (Phase 8 features) **Status**: Could not run - linking errors **Issue**: `bench_comprehensive.c` calls Phase 8 functions: - `hak_tiny_print_memory_profile()` - `hkm_learner_init()` - `superslab_ace_print_stats()` These are not compatible with Phase 7 build. Would need: - Remove Phase 8 dependencies, OR - Build with Phase 8 flags, OR - Use simpler benchmark suite --- ## Root Cause Analysis ### Issue 1: Multi-threaded Crash (Larson 2T/4T) **Symptoms**: - Single-threaded works perfectly (2.76M ops/s) - 2+ threads crash immediately with "free(): invalid pointer" - Consistent across 2T and 4T tests **Debug Output**: ``` [DEBUG] Phase 7: tiny_alloc(1024) rejected, using malloc fallback ``` **Hypotheses**: 1. **Race condition in TLS initialization**: Multiple threads accessing uninitialized TLS 2. **Malloc fallback bug**: Mixed HAKMEM/libc allocations causing double-free 3. **Free path ownership bug**: Wrong allocator freeing blocks from the other **Priority**: CRITICAL - must fix before any production use --- ### Issue 2: 64B Bus Error Crash **Symptoms**: - Bus error (SIGBUS) on 64-byte allocations - All other sizes (16, 32, 128, 256, ..., 8192) work fine - Crash happens during allocation, not free **Hypotheses**: 1. **Class index calculation error**: 64B might map to wrong class 2. **Alignment issue**: 64B blocks not aligned to required boundary 3. **Header corruption**: Class index stored in header (HEADER_CLASSIDX=1) might overflow for 64B **Clue**: Debug message shows "tiny_alloc(1024) rejected" even for 64B allocations, suggesting routing logic is broken. **Priority**: CRITICAL - 64B is a common allocation size --- ### Issue 3: Debug Output in Production Build **Symptom**: ``` [DEBUG] Phase 7: tiny_alloc(1024) rejected, using malloc fallback ``` **Impact**: - Performance overhead (fprintf in hot path) - Indicates incomplete implementation (rejections shouldn't happen in production) - Suggests Phase 7 optimizations have broken size routing **Priority**: HIGH - indicates deeper implementation issues --- ## Production Readiness Assessment ### Success Criteria (from CURRENT_TASK.md) | Criterion | Result | Status | |-----------|--------|--------| | ✅ All benchmarks complete without crashes | ❌ 2T/4T Larson crash, 64B crash | FAIL | | ✅ Tiny performance: 85-92% of System | ✅ 96-120% (working sizes) | PASS | | ✅ Mid-Large performance: maintained | ✅ 120% of System | PASS | | ✅ Multi-thread stability: no regression | ❌ Complete crash | FAIL | | ✅ Fragmentation stress: acceptable | ⚠️ Not tested (build issues) | SKIP | | ✅ Comprehensive report generated | ✅ This document | PASS | **Overall**: **FAIL - 2 critical crashes** --- ## Recommended Next Steps ### Immediate Actions (Critical Bugs) **1. Fix Multi-threaded Crash (Highest Priority)** ```bash # Debug with ASan make clean make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1 \ ASAN=1 larson_hakmem ./larson_hakmem 2 8 128 1024 1 12345 2 # Check TLS initialization grep -r "PREWARM_TLS" core/ # Verify all TLS variables are initialized before thread spawn ``` **Expected Root Cause**: TLS prewarm not actually executing, or race in initialization. **2. Fix 64B Bus Error (High Priority)** ```bash # Add debug output to class index calculation # File: core/box/hak_alloc_api.inc.h or similar printf("tiny_alloc(%zu) -> class %d\n", size, class_idx); # Check alignment # File: core/hakmem_tiny_superslab.c assert((uintptr_t)ptr % 64 == 0); // 64B must be 64-byte aligned ``` **Expected Root Cause**: HEADER_CLASSIDX=1 storing wrong class index for 64B. **3. Remove Debug Output** ```bash # Find and remove/disable debug prints grep -r "DEBUG.*Phase 7" core/ # Should be gated by #ifdef HAKMEM_DEBUG ``` --- ### Phase 7 Feature Regression Test **Before deploying any fix, verify**: 1. All single-threaded benchmarks still pass 2. Performance doesn't regress to Phase 6 levels 3. No new crashes introduced **Test Suite**: ```bash # Single-thread (must pass) ./larson_hakmem 1 1 128 1024 1 12345 1 # Expect: 2.76M ops/s ./bench_random_mixed_hakmem 100000 128 1234567 # Expect: 72M ops/s # Multi-thread (currently fails, must fix) ./larson_hakmem 2 8 128 1024 1 12345 2 # Expect: no crash ./larson_hakmem 4 8 128 1024 1 12345 4 # Expect: no crash # 64B (currently fails, must fix) ./bench_random_mixed_hakmem 100000 64 1234567 # Expect: no crash, ~70M ops/s ``` --- ### Alternate Path: Revert Phase 7 Optimizations If bugs are too complex to fix quickly: ```bash # Revert to Phase 6 git checkout HEAD~3 # Or specific Phase 6 commit # Verify Phase 6 still works make clean && make larson_hakmem ./larson_hakmem 4 8 128 1024 1 12345 4 # Should work # Incrementally re-apply Phase 7 optimizations git cherry-pick # Test git cherry-pick # Test git cherry-pick # Test # Identify which commit introduced the bugs ``` --- ## Build Information **Compiler**: gcc with LTO **Flags**: ``` -O3 -flto -march=native -mtune=native -DHAKMEM_TINY_PHASE6_BOX_REFACTOR=1 -DHAKMEM_TINY_FAST_PATH=1 -DHAKMEM_TINY_HEADER_CLASSIDX=1 -DHAKMEM_TINY_AGGRESSIVE_INLINE=1 -DHAKMEM_TINY_PREWARM_TLS=1 ``` **Known Issues**: - `bench_comprehensive` won't link (Phase 8 dependencies) - `bench_fragment_stress` not tested (same issue) - Debug output leaking into production builds --- ## Appendix: Full Benchmark Output Samples ### Larson 1T (Success) ``` === LARSON 1T BASELINE === Throughput = 2758490 operations per second, relative time: 362.517s. Done sleeping... [ELO] Initialized 12 strategies (thresholds: 512KB-32MB) [Batch] Initialized (threshold=8 MB, min_size=64 KB, bg=on) [ACE] ACE disabled (HAKMEM_ACE_ENABLED=0) ``` ### Larson 2T (Crash) ``` [DEBUG] Phase 7: tiny_alloc(1024) rejected, using malloc fallback free(): invalid pointer Exit code: 134 ``` ### 64B Crash ``` [SUPERSLAB_INIT] class 7 slab 0: usable_size=63488 block_size=1024 capacity=62 [SUPERSLAB_INIT] Expected: 63488 / 1024 = 62 blocks Exit code: 135 (SIGBUS) ``` --- ## Conclusion **Phase 7 achieved exceptional single-threaded performance** (96-120% of System malloc), **but introduced critical bugs**: 1. **Multi-threaded crash**: Unusable with 2+ threads 2. **64B crash**: Unusable for common allocation size 3. **Incomplete implementation**: Debug fallbacks in production code **Recommendation**: **DO NOT DEPLOY** to production. Revert to Phase 6 or fix critical bugs before proceeding to Phase 7 Tasks 6-9. **Next Steps** (in priority order): 1. Fix multi-threaded crash (blocker for all production use) 2. Fix 64B bus error (blocker for most workloads) 3. Remove debug output (quality/performance issue) 4. Re-run comprehensive validation 5. Only then proceed to Phase 7 Tasks 6-9 --- **Generated**: 2025-11-08 **Test Duration**: ~2 hours **Total Benchmarks**: 15 tests (10 sizes × random mixed, 3 × Larson, 3 × stability) **Crashes Found**: 2 critical (Larson MT, 64B) **Production Ready**: ❌ NO