Phase 1 完了:環境変数整理 + fprintf デバッグガード ENV変数削除(BG/HotMag系): - core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines) - core/hakmem_tiny_bg_spill.c: BG spill ENV 削除 - core/tiny_refill.h: BG remote 固定値化 - core/hakmem_tiny_slow.inc: BG refs 削除 fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE): - core/hakmem_shared_pool.c: Lock stats (~18 fprintf) - core/page_arena.c: Init/Shutdown/Stats (~27 fprintf) - core/hakmem.c: SIGSEGV init message ドキュメント整理: - 328 markdown files 削除(旧レポート・重複docs) 性能確認: - Larson: 52.35M ops/s (前回52.8M、安定動作✅) - ENV整理による機能影響なし - Debug出力は一部残存(次phase で対応) 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
370 lines
11 KiB
Markdown
370 lines
11 KiB
Markdown
# Phase 7 Comprehensive Benchmark Results
|
||
|
||
**Date**: 2025-11-08
|
||
**Build Configuration**: `HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1`
|
||
**Status**: CRITICAL BUGS FOUND - NOT PRODUCTION READY
|
||
|
||
---
|
||
|
||
## Executive Summary
|
||
|
||
### Production Readiness: FAILED
|
||
|
||
**Critical Issues Found:**
|
||
1. **Multi-threaded crash**: Larson 2T/4T fail with `free(): invalid pointer` (Exit 134)
|
||
2. **64B allocation crash**: Bus error (Exit 135) on 64-byte allocations
|
||
3. **Debug output in production**: "Phase 7: tiny_alloc(1024) rejected" messages indicate incomplete implementation
|
||
|
||
**Performance (Single-threaded, working sizes):**
|
||
- Single-thread performance is excellent (76-120% of System malloc)
|
||
- But crashes make this unusable in production
|
||
|
||
### Key Findings
|
||
|
||
| Category | Result | Status |
|
||
|----------|--------|--------|
|
||
| Larson 1T | 2.76M ops/s | ✅ PASS |
|
||
| Larson 2T/4T | CRASH (Exit 134) | ❌ CRITICAL FAIL |
|
||
| Random Mixed (most sizes) | 60-72M ops/s | ✅ PASS |
|
||
| Random Mixed 64B | CRASH (Bus Error 135) | ❌ CRITICAL FAIL |
|
||
| Stability (1M iterations) | Stable scores | ✅ PASS |
|
||
| Overall Production Ready | NO | ❌ FAIL |
|
||
|
||
---
|
||
|
||
## Detailed Benchmark Results
|
||
|
||
### 1. Larson Multi-Thread Stress Test
|
||
|
||
| Threads | HAKMEM Result | System Result | Status |
|
||
|---------|---------------|---------------|--------|
|
||
| 1T | 2,758,490 ops/s | ~3.3M ops/s (est.) | ✅ 84% of System |
|
||
| 2T | **CRASH (Exit 134)** | N/A | ❌ CRITICAL |
|
||
| 4T | **CRASH (Exit 134)** | N/A | ❌ CRITICAL |
|
||
|
||
**Crash Details:**
|
||
```
|
||
[DEBUG] Phase 7: tiny_alloc(1024) rejected, using malloc fallback
|
||
free(): invalid pointer
|
||
Exit code: 134 (SIGABRT - double free or corruption)
|
||
```
|
||
|
||
**Root Cause**: Unknown - likely race condition in multi-threaded free path or malloc fallback integration issue.
|
||
|
||
---
|
||
|
||
### 2. Random Mixed Allocation Benchmark
|
||
|
||
**Test**: 100,000 iterations of mixed malloc/free patterns
|
||
|
||
| Size | HAKMEM (ops/s) | System (ops/s) | HAKMEM % | Status |
|
||
|------|----------------|----------------|----------|--------|
|
||
| 16B | 66,878,359 | 87,810,575 | 76.1% | ✅ |
|
||
| 32B | 69,730,339 | 64,490,458 | **108.1%** | ✅ |
|
||
| **64B** | **CRASH (Bus Error 135)** | 78,147,467 | N/A | ❌ CRITICAL |
|
||
| 128B | 72,090,413 | 65,960,798 | **109.2%** | ✅ |
|
||
| 256B | 71,363,681 | 71,688,134 | 99.5% | ✅ |
|
||
| 512B | 60,501,851 | 62,967,613 | 96.0% | ✅ |
|
||
| 1024B | 63,229,630 | 67,220,203 | 94.0% | ✅ |
|
||
| 2048B | 55,868,013 | 46,557,492 | **119.9%** | ✅ |
|
||
| 4096B | 40,585,997 | 45,157,552 | 89.8% | ✅ |
|
||
| 8192B | 35,442,103 | 33,984,326 | **104.2%** | ✅ |
|
||
|
||
**Performance Highlights (working sizes):**
|
||
- **32B: +8% faster than System** (108.1%)
|
||
- **128B: +9% faster than System** (109.2%)
|
||
- **2048B: +20% faster than System** (119.9%)
|
||
- **8192B: +4% faster than System** (104.2%)
|
||
|
||
**64B Crash Details:**
|
||
```
|
||
Exit code: 135 (SIGBUS - unaligned memory access or invalid pointer)
|
||
Crash during allocation, not free
|
||
```
|
||
|
||
**Root Cause**: Unknown - possibly alignment issue or class index calculation error for 64B size class.
|
||
|
||
---
|
||
|
||
### 3. Long-Run Stability Tests
|
||
|
||
**Test**: 1,000,000 iterations (10x normal) to check for memory leaks and variance
|
||
|
||
| Size | Throughput (ops/s) | Variance vs 100K | Status |
|
||
|------|-------------------|------------------|--------|
|
||
| 128B | 72,829,711 | +1.0% | ✅ Stable |
|
||
| 256B | 72,305,587 | +1.3% | ✅ Stable |
|
||
| 1024B | 64,240,186 | +1.6% | ✅ Stable |
|
||
|
||
**Analysis**:
|
||
- Variance <2% indicates stable performance
|
||
- No memory leaks detected (throughput would degrade if leaking)
|
||
- Scores slightly higher in long runs (likely cache warmup effects)
|
||
|
||
---
|
||
|
||
### 4. Comparison vs Phase 6 Baseline
|
||
|
||
**Phase 6 Baseline** (from CLAUDE.md):
|
||
- Tiny: 52.59 M/s (38.7% of System 135.94 M/s)
|
||
- Phase 6 Goal: 85-92% of System
|
||
|
||
**Phase 7 Results** (working sizes):
|
||
- Tiny (128B): 72.09 M/s (109% of System 65.96 M/s) → **+37% improvement**
|
||
- Tiny (256B): 71.36 M/s (99.5% of System) → **+36% improvement**
|
||
- Mid (2048B): 55.87 M/s (120% of System) → Exceeds System by +20%
|
||
|
||
**Goal Achievement**:
|
||
- Target: 85-92% of System → **Achieved 96-120%** (working sizes)
|
||
- But: **Critical crashes make this irrelevant**
|
||
|
||
---
|
||
|
||
### 5. Comprehensive Benchmark (Phase 8 features)
|
||
|
||
**Status**: Could not run - linking errors
|
||
|
||
**Issue**: `bench_comprehensive.c` calls Phase 8 functions:
|
||
- `hak_tiny_print_memory_profile()`
|
||
- `hkm_learner_init()`
|
||
- `superslab_ace_print_stats()`
|
||
|
||
These are not compatible with Phase 7 build. Would need:
|
||
- Remove Phase 8 dependencies, OR
|
||
- Build with Phase 8 flags, OR
|
||
- Use simpler benchmark suite
|
||
|
||
---
|
||
|
||
## Root Cause Analysis
|
||
|
||
### Issue 1: Multi-threaded Crash (Larson 2T/4T)
|
||
|
||
**Symptoms**:
|
||
- Single-threaded works perfectly (2.76M ops/s)
|
||
- 2+ threads crash immediately with "free(): invalid pointer"
|
||
- Consistent across 2T and 4T tests
|
||
|
||
**Debug Output**:
|
||
```
|
||
[DEBUG] Phase 7: tiny_alloc(1024) rejected, using malloc fallback
|
||
```
|
||
|
||
**Hypotheses**:
|
||
1. **Race condition in TLS initialization**: Multiple threads accessing uninitialized TLS
|
||
2. **Malloc fallback bug**: Mixed HAKMEM/libc allocations causing double-free
|
||
3. **Free path ownership bug**: Wrong allocator freeing blocks from the other
|
||
|
||
**Priority**: CRITICAL - must fix before any production use
|
||
|
||
---
|
||
|
||
### Issue 2: 64B Bus Error Crash
|
||
|
||
**Symptoms**:
|
||
- Bus error (SIGBUS) on 64-byte allocations
|
||
- All other sizes (16, 32, 128, 256, ..., 8192) work fine
|
||
- Crash happens during allocation, not free
|
||
|
||
**Hypotheses**:
|
||
1. **Class index calculation error**: 64B might map to wrong class
|
||
2. **Alignment issue**: 64B blocks not aligned to required boundary
|
||
3. **Header corruption**: Class index stored in header (HEADER_CLASSIDX=1) might overflow for 64B
|
||
|
||
**Clue**: Debug message shows "tiny_alloc(1024) rejected" even for 64B allocations, suggesting routing logic is broken.
|
||
|
||
**Priority**: CRITICAL - 64B is a common allocation size
|
||
|
||
---
|
||
|
||
### Issue 3: Debug Output in Production Build
|
||
|
||
**Symptom**:
|
||
```
|
||
[DEBUG] Phase 7: tiny_alloc(1024) rejected, using malloc fallback
|
||
```
|
||
|
||
**Impact**:
|
||
- Performance overhead (fprintf in hot path)
|
||
- Indicates incomplete implementation (rejections shouldn't happen in production)
|
||
- Suggests Phase 7 optimizations have broken size routing
|
||
|
||
**Priority**: HIGH - indicates deeper implementation issues
|
||
|
||
---
|
||
|
||
## Production Readiness Assessment
|
||
|
||
### Success Criteria (from CURRENT_TASK.md)
|
||
|
||
| Criterion | Result | Status |
|
||
|-----------|--------|--------|
|
||
| ✅ All benchmarks complete without crashes | ❌ 2T/4T Larson crash, 64B crash | FAIL |
|
||
| ✅ Tiny performance: 85-92% of System | ✅ 96-120% (working sizes) | PASS |
|
||
| ✅ Mid-Large performance: maintained | ✅ 120% of System | PASS |
|
||
| ✅ Multi-thread stability: no regression | ❌ Complete crash | FAIL |
|
||
| ✅ Fragmentation stress: acceptable | ⚠️ Not tested (build issues) | SKIP |
|
||
| ✅ Comprehensive report generated | ✅ This document | PASS |
|
||
|
||
**Overall**: **FAIL - 2 critical crashes**
|
||
|
||
---
|
||
|
||
## Recommended Next Steps
|
||
|
||
### Immediate Actions (Critical Bugs)
|
||
|
||
**1. Fix Multi-threaded Crash (Highest Priority)**
|
||
```bash
|
||
# Debug with ASan
|
||
make clean
|
||
make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1 \
|
||
ASAN=1 larson_hakmem
|
||
./larson_hakmem 2 8 128 1024 1 12345 2
|
||
|
||
# Check TLS initialization
|
||
grep -r "PREWARM_TLS" core/
|
||
# Verify all TLS variables are initialized before thread spawn
|
||
```
|
||
|
||
**Expected Root Cause**: TLS prewarm not actually executing, or race in initialization.
|
||
|
||
**2. Fix 64B Bus Error (High Priority)**
|
||
```bash
|
||
# Add debug output to class index calculation
|
||
# File: core/box/hak_alloc_api.inc.h or similar
|
||
printf("tiny_alloc(%zu) -> class %d\n", size, class_idx);
|
||
|
||
# Check alignment
|
||
# File: core/hakmem_tiny_superslab.c
|
||
assert((uintptr_t)ptr % 64 == 0); // 64B must be 64-byte aligned
|
||
```
|
||
|
||
**Expected Root Cause**: HEADER_CLASSIDX=1 storing wrong class index for 64B.
|
||
|
||
**3. Remove Debug Output**
|
||
```bash
|
||
# Find and remove/disable debug prints
|
||
grep -r "DEBUG.*Phase 7" core/
|
||
# Should be gated by #ifdef HAKMEM_DEBUG
|
||
```
|
||
|
||
---
|
||
|
||
### Phase 7 Feature Regression Test
|
||
|
||
**Before deploying any fix, verify**:
|
||
1. All single-threaded benchmarks still pass
|
||
2. Performance doesn't regress to Phase 6 levels
|
||
3. No new crashes introduced
|
||
|
||
**Test Suite**:
|
||
```bash
|
||
# Single-thread (must pass)
|
||
./larson_hakmem 1 1 128 1024 1 12345 1 # Expect: 2.76M ops/s
|
||
./bench_random_mixed_hakmem 100000 128 1234567 # Expect: 72M ops/s
|
||
|
||
# Multi-thread (currently fails, must fix)
|
||
./larson_hakmem 2 8 128 1024 1 12345 2 # Expect: no crash
|
||
./larson_hakmem 4 8 128 1024 1 12345 4 # Expect: no crash
|
||
|
||
# 64B (currently fails, must fix)
|
||
./bench_random_mixed_hakmem 100000 64 1234567 # Expect: no crash, ~70M ops/s
|
||
```
|
||
|
||
---
|
||
|
||
### Alternate Path: Revert Phase 7 Optimizations
|
||
|
||
If bugs are too complex to fix quickly:
|
||
|
||
```bash
|
||
# Revert to Phase 6
|
||
git checkout HEAD~3 # Or specific Phase 6 commit
|
||
|
||
# Verify Phase 6 still works
|
||
make clean && make larson_hakmem
|
||
./larson_hakmem 4 8 128 1024 1 12345 4 # Should work
|
||
|
||
# Incrementally re-apply Phase 7 optimizations
|
||
git cherry-pick <HEADER_CLASSIDX commit> # Test
|
||
git cherry-pick <AGGRESSIVE_INLINE commit> # Test
|
||
git cherry-pick <PREWARM_TLS commit> # Test
|
||
# Identify which commit introduced the bugs
|
||
```
|
||
|
||
---
|
||
|
||
## Build Information
|
||
|
||
**Compiler**: gcc with LTO
|
||
**Flags**:
|
||
```
|
||
-O3 -flto -march=native -mtune=native
|
||
-DHAKMEM_TINY_PHASE6_BOX_REFACTOR=1
|
||
-DHAKMEM_TINY_FAST_PATH=1
|
||
-DHAKMEM_TINY_HEADER_CLASSIDX=1
|
||
-DHAKMEM_TINY_AGGRESSIVE_INLINE=1
|
||
-DHAKMEM_TINY_PREWARM_TLS=1
|
||
```
|
||
|
||
**Known Issues**:
|
||
- `bench_comprehensive` won't link (Phase 8 dependencies)
|
||
- `bench_fragment_stress` not tested (same issue)
|
||
- Debug output leaking into production builds
|
||
|
||
---
|
||
|
||
## Appendix: Full Benchmark Output Samples
|
||
|
||
### Larson 1T (Success)
|
||
```
|
||
=== LARSON 1T BASELINE ===
|
||
Throughput = 2758490 operations per second, relative time: 362.517s.
|
||
Done sleeping...
|
||
[ELO] Initialized 12 strategies (thresholds: 512KB-32MB)
|
||
[Batch] Initialized (threshold=8 MB, min_size=64 KB, bg=on)
|
||
[ACE] ACE disabled (HAKMEM_ACE_ENABLED=0)
|
||
```
|
||
|
||
### Larson 2T (Crash)
|
||
```
|
||
[DEBUG] Phase 7: tiny_alloc(1024) rejected, using malloc fallback
|
||
free(): invalid pointer
|
||
Exit code: 134
|
||
```
|
||
|
||
### 64B Crash
|
||
```
|
||
[SUPERSLAB_INIT] class 7 slab 0: usable_size=63488 block_size=1024 capacity=62
|
||
[SUPERSLAB_INIT] Expected: 63488 / 1024 = 62 blocks
|
||
Exit code: 135 (SIGBUS)
|
||
```
|
||
|
||
---
|
||
|
||
## Conclusion
|
||
|
||
**Phase 7 achieved exceptional single-threaded performance** (96-120% of System malloc), **but introduced critical bugs**:
|
||
|
||
1. **Multi-threaded crash**: Unusable with 2+ threads
|
||
2. **64B crash**: Unusable for common allocation size
|
||
3. **Incomplete implementation**: Debug fallbacks in production code
|
||
|
||
**Recommendation**: **DO NOT DEPLOY** to production. Revert to Phase 6 or fix critical bugs before proceeding to Phase 7 Tasks 6-9.
|
||
|
||
**Next Steps** (in priority order):
|
||
1. Fix multi-threaded crash (blocker for all production use)
|
||
2. Fix 64B bus error (blocker for most workloads)
|
||
3. Remove debug output (quality/performance issue)
|
||
4. Re-run comprehensive validation
|
||
5. Only then proceed to Phase 7 Tasks 6-9
|
||
|
||
---
|
||
|
||
**Generated**: 2025-11-08
|
||
**Test Duration**: ~2 hours
|
||
**Total Benchmarks**: 15 tests (10 sizes × random mixed, 3 × Larson, 3 × stability)
|
||
**Crashes Found**: 2 critical (Larson MT, 64B)
|
||
**Production Ready**: ❌ NO
|