370 lines
11 KiB
Markdown
370 lines
11 KiB
Markdown
|
|
# Phase 7 Comprehensive Benchmark Results
|
|||
|
|
|
|||
|
|
**Date**: 2025-11-08
|
|||
|
|
**Build Configuration**: `HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1`
|
|||
|
|
**Status**: CRITICAL BUGS FOUND - NOT PRODUCTION READY
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Executive Summary
|
|||
|
|
|
|||
|
|
### Production Readiness: FAILED
|
|||
|
|
|
|||
|
|
**Critical Issues Found:**
|
|||
|
|
1. **Multi-threaded crash**: Larson 2T/4T fail with `free(): invalid pointer` (Exit 134)
|
|||
|
|
2. **64B allocation crash**: Bus error (Exit 135) on 64-byte allocations
|
|||
|
|
3. **Debug output in production**: "Phase 7: tiny_alloc(1024) rejected" messages indicate incomplete implementation
|
|||
|
|
|
|||
|
|
**Performance (Single-threaded, working sizes):**
|
|||
|
|
- Single-thread performance is excellent (76-120% of System malloc)
|
|||
|
|
- But crashes make this unusable in production
|
|||
|
|
|
|||
|
|
### Key Findings
|
|||
|
|
|
|||
|
|
| Category | Result | Status |
|
|||
|
|
|----------|--------|--------|
|
|||
|
|
| Larson 1T | 2.76M ops/s | ✅ PASS |
|
|||
|
|
| Larson 2T/4T | CRASH (Exit 134) | ❌ CRITICAL FAIL |
|
|||
|
|
| Random Mixed (most sizes) | 60-72M ops/s | ✅ PASS |
|
|||
|
|
| Random Mixed 64B | CRASH (Bus Error 135) | ❌ CRITICAL FAIL |
|
|||
|
|
| Stability (1M iterations) | Stable scores | ✅ PASS |
|
|||
|
|
| Overall Production Ready | NO | ❌ FAIL |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Detailed Benchmark Results
|
|||
|
|
|
|||
|
|
### 1. Larson Multi-Thread Stress Test
|
|||
|
|
|
|||
|
|
| Threads | HAKMEM Result | System Result | Status |
|
|||
|
|
|---------|---------------|---------------|--------|
|
|||
|
|
| 1T | 2,758,490 ops/s | ~3.3M ops/s (est.) | ✅ 84% of System |
|
|||
|
|
| 2T | **CRASH (Exit 134)** | N/A | ❌ CRITICAL |
|
|||
|
|
| 4T | **CRASH (Exit 134)** | N/A | ❌ CRITICAL |
|
|||
|
|
|
|||
|
|
**Crash Details:**
|
|||
|
|
```
|
|||
|
|
[DEBUG] Phase 7: tiny_alloc(1024) rejected, using malloc fallback
|
|||
|
|
free(): invalid pointer
|
|||
|
|
Exit code: 134 (SIGABRT - double free or corruption)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Root Cause**: Unknown - likely race condition in multi-threaded free path or malloc fallback integration issue.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 2. Random Mixed Allocation Benchmark
|
|||
|
|
|
|||
|
|
**Test**: 100,000 iterations of mixed malloc/free patterns
|
|||
|
|
|
|||
|
|
| Size | HAKMEM (ops/s) | System (ops/s) | HAKMEM % | Status |
|
|||
|
|
|------|----------------|----------------|----------|--------|
|
|||
|
|
| 16B | 66,878,359 | 87,810,575 | 76.1% | ✅ |
|
|||
|
|
| 32B | 69,730,339 | 64,490,458 | **108.1%** | ✅ |
|
|||
|
|
| **64B** | **CRASH (Bus Error 135)** | 78,147,467 | N/A | ❌ CRITICAL |
|
|||
|
|
| 128B | 72,090,413 | 65,960,798 | **109.2%** | ✅ |
|
|||
|
|
| 256B | 71,363,681 | 71,688,134 | 99.5% | ✅ |
|
|||
|
|
| 512B | 60,501,851 | 62,967,613 | 96.0% | ✅ |
|
|||
|
|
| 1024B | 63,229,630 | 67,220,203 | 94.0% | ✅ |
|
|||
|
|
| 2048B | 55,868,013 | 46,557,492 | **119.9%** | ✅ |
|
|||
|
|
| 4096B | 40,585,997 | 45,157,552 | 89.8% | ✅ |
|
|||
|
|
| 8192B | 35,442,103 | 33,984,326 | **104.2%** | ✅ |
|
|||
|
|
|
|||
|
|
**Performance Highlights (working sizes):**
|
|||
|
|
- **32B: +8% faster than System** (108.1%)
|
|||
|
|
- **128B: +9% faster than System** (109.2%)
|
|||
|
|
- **2048B: +20% faster than System** (119.9%)
|
|||
|
|
- **8192B: +4% faster than System** (104.2%)
|
|||
|
|
|
|||
|
|
**64B Crash Details:**
|
|||
|
|
```
|
|||
|
|
Exit code: 135 (SIGBUS - unaligned memory access or invalid pointer)
|
|||
|
|
Crash during allocation, not free
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Root Cause**: Unknown - possibly alignment issue or class index calculation error for 64B size class.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 3. Long-Run Stability Tests
|
|||
|
|
|
|||
|
|
**Test**: 1,000,000 iterations (10x normal) to check for memory leaks and variance
|
|||
|
|
|
|||
|
|
| Size | Throughput (ops/s) | Variance vs 100K | Status |
|
|||
|
|
|------|-------------------|------------------|--------|
|
|||
|
|
| 128B | 72,829,711 | +1.0% | ✅ Stable |
|
|||
|
|
| 256B | 72,305,587 | +1.3% | ✅ Stable |
|
|||
|
|
| 1024B | 64,240,186 | +1.6% | ✅ Stable |
|
|||
|
|
|
|||
|
|
**Analysis**:
|
|||
|
|
- Variance <2% indicates stable performance
|
|||
|
|
- No memory leaks detected (throughput would degrade if leaking)
|
|||
|
|
- Scores slightly higher in long runs (likely cache warmup effects)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 4. Comparison vs Phase 6 Baseline
|
|||
|
|
|
|||
|
|
**Phase 6 Baseline** (from CLAUDE.md):
|
|||
|
|
- Tiny: 52.59 M/s (38.7% of System 135.94 M/s)
|
|||
|
|
- Phase 6 Goal: 85-92% of System
|
|||
|
|
|
|||
|
|
**Phase 7 Results** (working sizes):
|
|||
|
|
- Tiny (128B): 72.09 M/s (109% of System 65.96 M/s) → **+37% improvement**
|
|||
|
|
- Tiny (256B): 71.36 M/s (99.5% of System) → **+36% improvement**
|
|||
|
|
- Mid (2048B): 55.87 M/s (120% of System) → Exceeds System by +20%
|
|||
|
|
|
|||
|
|
**Goal Achievement**:
|
|||
|
|
- Target: 85-92% of System → **Achieved 96-120%** (working sizes)
|
|||
|
|
- But: **Critical crashes make this irrelevant**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 5. Comprehensive Benchmark (Phase 8 features)
|
|||
|
|
|
|||
|
|
**Status**: Could not run - linking errors
|
|||
|
|
|
|||
|
|
**Issue**: `bench_comprehensive.c` calls Phase 8 functions:
|
|||
|
|
- `hak_tiny_print_memory_profile()`
|
|||
|
|
- `hkm_learner_init()`
|
|||
|
|
- `superslab_ace_print_stats()`
|
|||
|
|
|
|||
|
|
These are not compatible with Phase 7 build. Would need:
|
|||
|
|
- Remove Phase 8 dependencies, OR
|
|||
|
|
- Build with Phase 8 flags, OR
|
|||
|
|
- Use simpler benchmark suite
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Root Cause Analysis
|
|||
|
|
|
|||
|
|
### Issue 1: Multi-threaded Crash (Larson 2T/4T)
|
|||
|
|
|
|||
|
|
**Symptoms**:
|
|||
|
|
- Single-threaded works perfectly (2.76M ops/s)
|
|||
|
|
- 2+ threads crash immediately with "free(): invalid pointer"
|
|||
|
|
- Consistent across 2T and 4T tests
|
|||
|
|
|
|||
|
|
**Debug Output**:
|
|||
|
|
```
|
|||
|
|
[DEBUG] Phase 7: tiny_alloc(1024) rejected, using malloc fallback
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Hypotheses**:
|
|||
|
|
1. **Race condition in TLS initialization**: Multiple threads accessing uninitialized TLS
|
|||
|
|
2. **Malloc fallback bug**: Mixed HAKMEM/libc allocations causing double-free
|
|||
|
|
3. **Free path ownership bug**: Wrong allocator freeing blocks from the other
|
|||
|
|
|
|||
|
|
**Priority**: CRITICAL - must fix before any production use
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Issue 2: 64B Bus Error Crash
|
|||
|
|
|
|||
|
|
**Symptoms**:
|
|||
|
|
- Bus error (SIGBUS) on 64-byte allocations
|
|||
|
|
- All other sizes (16, 32, 128, 256, ..., 8192) work fine
|
|||
|
|
- Crash happens during allocation, not free
|
|||
|
|
|
|||
|
|
**Hypotheses**:
|
|||
|
|
1. **Class index calculation error**: 64B might map to wrong class
|
|||
|
|
2. **Alignment issue**: 64B blocks not aligned to required boundary
|
|||
|
|
3. **Header corruption**: Class index stored in header (HEADER_CLASSIDX=1) might overflow for 64B
|
|||
|
|
|
|||
|
|
**Clue**: Debug message shows "tiny_alloc(1024) rejected" even for 64B allocations, suggesting routing logic is broken.
|
|||
|
|
|
|||
|
|
**Priority**: CRITICAL - 64B is a common allocation size
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Issue 3: Debug Output in Production Build
|
|||
|
|
|
|||
|
|
**Symptom**:
|
|||
|
|
```
|
|||
|
|
[DEBUG] Phase 7: tiny_alloc(1024) rejected, using malloc fallback
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Impact**:
|
|||
|
|
- Performance overhead (fprintf in hot path)
|
|||
|
|
- Indicates incomplete implementation (rejections shouldn't happen in production)
|
|||
|
|
- Suggests Phase 7 optimizations have broken size routing
|
|||
|
|
|
|||
|
|
**Priority**: HIGH - indicates deeper implementation issues
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Production Readiness Assessment
|
|||
|
|
|
|||
|
|
### Success Criteria (from CURRENT_TASK.md)
|
|||
|
|
|
|||
|
|
| Criterion | Result | Status |
|
|||
|
|
|-----------|--------|--------|
|
|||
|
|
| ✅ All benchmarks complete without crashes | ❌ 2T/4T Larson crash, 64B crash | FAIL |
|
|||
|
|
| ✅ Tiny performance: 85-92% of System | ✅ 96-120% (working sizes) | PASS |
|
|||
|
|
| ✅ Mid-Large performance: maintained | ✅ 120% of System | PASS |
|
|||
|
|
| ✅ Multi-thread stability: no regression | ❌ Complete crash | FAIL |
|
|||
|
|
| ✅ Fragmentation stress: acceptable | ⚠️ Not tested (build issues) | SKIP |
|
|||
|
|
| ✅ Comprehensive report generated | ✅ This document | PASS |
|
|||
|
|
|
|||
|
|
**Overall**: **FAIL - 2 critical crashes**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Recommended Next Steps
|
|||
|
|
|
|||
|
|
### Immediate Actions (Critical Bugs)
|
|||
|
|
|
|||
|
|
**1. Fix Multi-threaded Crash (Highest Priority)**
|
|||
|
|
```bash
|
|||
|
|
# Debug with ASan
|
|||
|
|
make clean
|
|||
|
|
make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1 \
|
|||
|
|
ASAN=1 larson_hakmem
|
|||
|
|
./larson_hakmem 2 8 128 1024 1 12345 2
|
|||
|
|
|
|||
|
|
# Check TLS initialization
|
|||
|
|
grep -r "PREWARM_TLS" core/
|
|||
|
|
# Verify all TLS variables are initialized before thread spawn
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected Root Cause**: TLS prewarm not actually executing, or race in initialization.
|
|||
|
|
|
|||
|
|
**2. Fix 64B Bus Error (High Priority)**
|
|||
|
|
```bash
|
|||
|
|
# Add debug output to class index calculation
|
|||
|
|
# File: core/box/hak_alloc_api.inc.h or similar
|
|||
|
|
printf("tiny_alloc(%zu) -> class %d\n", size, class_idx);
|
|||
|
|
|
|||
|
|
# Check alignment
|
|||
|
|
# File: core/hakmem_tiny_superslab.c
|
|||
|
|
assert((uintptr_t)ptr % 64 == 0); // 64B must be 64-byte aligned
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected Root Cause**: HEADER_CLASSIDX=1 storing wrong class index for 64B.
|
|||
|
|
|
|||
|
|
**3. Remove Debug Output**
|
|||
|
|
```bash
|
|||
|
|
# Find and remove/disable debug prints
|
|||
|
|
grep -r "DEBUG.*Phase 7" core/
|
|||
|
|
# Should be gated by #ifdef HAKMEM_DEBUG
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Phase 7 Feature Regression Test
|
|||
|
|
|
|||
|
|
**Before deploying any fix, verify**:
|
|||
|
|
1. All single-threaded benchmarks still pass
|
|||
|
|
2. Performance doesn't regress to Phase 6 levels
|
|||
|
|
3. No new crashes introduced
|
|||
|
|
|
|||
|
|
**Test Suite**:
|
|||
|
|
```bash
|
|||
|
|
# Single-thread (must pass)
|
|||
|
|
./larson_hakmem 1 1 128 1024 1 12345 1 # Expect: 2.76M ops/s
|
|||
|
|
./bench_random_mixed_hakmem 100000 128 1234567 # Expect: 72M ops/s
|
|||
|
|
|
|||
|
|
# Multi-thread (currently fails, must fix)
|
|||
|
|
./larson_hakmem 2 8 128 1024 1 12345 2 # Expect: no crash
|
|||
|
|
./larson_hakmem 4 8 128 1024 1 12345 4 # Expect: no crash
|
|||
|
|
|
|||
|
|
# 64B (currently fails, must fix)
|
|||
|
|
./bench_random_mixed_hakmem 100000 64 1234567 # Expect: no crash, ~70M ops/s
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Alternate Path: Revert Phase 7 Optimizations
|
|||
|
|
|
|||
|
|
If bugs are too complex to fix quickly:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Revert to Phase 6
|
|||
|
|
git checkout HEAD~3 # Or specific Phase 6 commit
|
|||
|
|
|
|||
|
|
# Verify Phase 6 still works
|
|||
|
|
make clean && make larson_hakmem
|
|||
|
|
./larson_hakmem 4 8 128 1024 1 12345 4 # Should work
|
|||
|
|
|
|||
|
|
# Incrementally re-apply Phase 7 optimizations
|
|||
|
|
git cherry-pick <HEADER_CLASSIDX commit> # Test
|
|||
|
|
git cherry-pick <AGGRESSIVE_INLINE commit> # Test
|
|||
|
|
git cherry-pick <PREWARM_TLS commit> # Test
|
|||
|
|
# Identify which commit introduced the bugs
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Build Information
|
|||
|
|
|
|||
|
|
**Compiler**: gcc with LTO
|
|||
|
|
**Flags**:
|
|||
|
|
```
|
|||
|
|
-O3 -flto -march=native -mtune=native
|
|||
|
|
-DHAKMEM_TINY_PHASE6_BOX_REFACTOR=1
|
|||
|
|
-DHAKMEM_TINY_FAST_PATH=1
|
|||
|
|
-DHAKMEM_TINY_HEADER_CLASSIDX=1
|
|||
|
|
-DHAKMEM_TINY_AGGRESSIVE_INLINE=1
|
|||
|
|
-DHAKMEM_TINY_PREWARM_TLS=1
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Known Issues**:
|
|||
|
|
- `bench_comprehensive` won't link (Phase 8 dependencies)
|
|||
|
|
- `bench_fragment_stress` not tested (same issue)
|
|||
|
|
- Debug output leaking into production builds
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Appendix: Full Benchmark Output Samples
|
|||
|
|
|
|||
|
|
### Larson 1T (Success)
|
|||
|
|
```
|
|||
|
|
=== LARSON 1T BASELINE ===
|
|||
|
|
Throughput = 2758490 operations per second, relative time: 362.517s.
|
|||
|
|
Done sleeping...
|
|||
|
|
[ELO] Initialized 12 strategies (thresholds: 512KB-32MB)
|
|||
|
|
[Batch] Initialized (threshold=8 MB, min_size=64 KB, bg=on)
|
|||
|
|
[ACE] ACE disabled (HAKMEM_ACE_ENABLED=0)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Larson 2T (Crash)
|
|||
|
|
```
|
|||
|
|
[DEBUG] Phase 7: tiny_alloc(1024) rejected, using malloc fallback
|
|||
|
|
free(): invalid pointer
|
|||
|
|
Exit code: 134
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 64B Crash
|
|||
|
|
```
|
|||
|
|
[SUPERSLAB_INIT] class 7 slab 0: usable_size=63488 block_size=1024 capacity=62
|
|||
|
|
[SUPERSLAB_INIT] Expected: 63488 / 1024 = 62 blocks
|
|||
|
|
Exit code: 135 (SIGBUS)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Conclusion
|
|||
|
|
|
|||
|
|
**Phase 7 achieved exceptional single-threaded performance** (96-120% of System malloc), **but introduced critical bugs**:
|
|||
|
|
|
|||
|
|
1. **Multi-threaded crash**: Unusable with 2+ threads
|
|||
|
|
2. **64B crash**: Unusable for common allocation size
|
|||
|
|
3. **Incomplete implementation**: Debug fallbacks in production code
|
|||
|
|
|
|||
|
|
**Recommendation**: **DO NOT DEPLOY** to production. Revert to Phase 6 or fix critical bugs before proceeding to Phase 7 Tasks 6-9.
|
|||
|
|
|
|||
|
|
**Next Steps** (in priority order):
|
|||
|
|
1. Fix multi-threaded crash (blocker for all production use)
|
|||
|
|
2. Fix 64B bus error (blocker for most workloads)
|
|||
|
|
3. Remove debug output (quality/performance issue)
|
|||
|
|
4. Re-run comprehensive validation
|
|||
|
|
5. Only then proceed to Phase 7 Tasks 6-9
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**Generated**: 2025-11-08
|
|||
|
|
**Test Duration**: ~2 hours
|
|||
|
|
**Total Benchmarks**: 15 tests (10 sizes × random mixed, 3 × Larson, 3 × stability)
|
|||
|
|
**Crashes Found**: 2 critical (Larson MT, 64B)
|
|||
|
|
**Production Ready**: ❌ NO
|