Files
hakmem/docs/archive/PERFORMANCE_INVESTIGATION_2025_10_22.md

382 lines
11 KiB
Markdown
Raw Normal View History

# Performance Investigation Report - 2025-10-22
**Investigator**: Claude (ultrathink mode)
**Duration**: 2 hours
**Status**: ✅ **Root Cause Identified**
---
## 🎯 **Executive Summary**
### **Problem Statement**
hakmem performance drastically degraded after Phase 6.14:
| Metric | Phase 6.14 (Success) | Current (Failed) | Degradation |
|--------|---------------------|------------------|-------------|
| **1-thread** | 15,271,429 ops/sec | 2,698,795 ops/sec | **-82.3%** ❌ |
| **4-thread** | 67,853,659 ops/sec | 658,228 ops/sec | **-99.0%** ❌ |
User expected: "Pre-warm code deletion would restore performance"
Reality: **Pre-warm code never existed** (it was only a proposal document)
### **Root Cause Discovered**
**Default mode (BALANCED) has massive overhead from advanced features:**
- ✅ Site Rules (hash lookup + 65-line routing switch-case)
- ✅ L2 Pool / L2.5 Pool (multiple allocation attempts)
- ✅ ELO/EVO (learning lifecycle)
- ✅ BigCache (cache lookup for large allocations)
- ✅ Batch madvise / Free policy
**Solution**: Use `HAKMEM_MODE=minimal` to disable all advanced features.
---
## 📊 **Performance Comparison**
### **1-thread (10,000 chunks × 8-1024B mixed)**
```bash
# System malloc (baseline)
./larson 0 8 1024 10000 1 12345 1
Throughput = 12,746,479 ops/sec
# hakmem BALANCED mode (default)
LD_PRELOAD=./libhakmem.so ./larson 0 8 1024 10000 1 12345 1
Throughput = 2,698,795 ops/sec (-78.8% vs system) ❌
# hakmem MINIMAL mode
HAKMEM_MODE=minimal LD_PRELOAD=./libhakmem.so ./larson 0 8 1024 10000 1 12345 1
Throughput = 15,141,176 ops/sec (+18.8% vs system) ✅ (+461% vs balanced!)
```
### **4-thread**
```bash
# System malloc (baseline)
./larson 0 8 1024 10000 1 12345 4
Throughput = 12,899,666 ops/sec
# hakmem BALANCED mode (default)
LD_PRELOAD=./libhakmem.so ./larson 0 8 1024 10000 1 12345 4
Throughput = 658,228 ops/sec (-94.9% vs system) ❌
# hakmem MINIMAL mode
HAKMEM_MODE=minimal LD_PRELOAD=./libhakmem.so ./larson 0 8 1024 10000 1 12345 4
Throughput = 3,265,753 ops/sec (-74.7% vs system) ⚠️ (+396% vs balanced)
```
---
## 🔍 **Investigation Process**
### **Step 1: Verify Pre-warm Code Existence**
**User claim**: "Pre-warm code was added and then deleted"
**Investigation**:
```bash
$ grep -r "Pre-warm\|prewarm\|pre-warm" *.c *.h
No matches found
```
**Result**: ✅ **No Pre-warm code exists**
**Conclusion**: `OPTIMIZATION_SUMMARY_2025_10_22.md` was only a **proposal document**, not an implementation.
---
### **Step 2: Identify Phase 6.14 vs Current Difference**
**Hypothesis**: Phase 6.14 and current code are identical, but configuration differs.
**Investigation**:
```bash
$ git log --since="2025-10-21" -- apps/experiments/hakmem-poc/
b7864269 runner(script-runner): stabilize pre-invoke + short diagnostics
7616873d WIP: phase20.8 mainline
```
**Result**: No Phase 6.14 commit found → **Phase 6.14 was never committed**
**Conclusion**: Phase 6.14 report was written, but code changes were not committed or were reverted.
---
### **Step 3: Analyze malloc Path Overhead**
**Code**: `hakmem.c:357-541` (`hak_alloc_at()`)
**Overhead sources** (185 lines of routing logic):
1. **ELO learning** (Line 371-391): `hak_evo_tick()` + atomic_load
2. **BigCache trial** (Line 394-400): hash lookup for size >= 1MB
3. **Tiny Pool trial** (Line 404-411): slab allocation for size <= 1KB
4. **Site Rules lookup** (Line 415-480): **65-line switch-case routing**
5. **L2 Pool trial** (Line 484-491): poolable check + allocation
6. **L2.5 Pool trial** (Line 495-502): large pool check + allocation
7. **Size distribution record** (Line 518): `hak_evo_record_size()`
8. **Header validation** (Line 521-537): magic check + metadata update
**For 8-1024B allocations**:
- ❌ Tiny Pool trial (Line 404-411): Hit!
- ❌ Site Rules lookup (Line 415): O(1) hash table (4-probe) **EVERY TIME**
- ❌ Site Rules routing (Line 416-480): 65-line switch-case logic
**Overhead estimate**:
- Site Rules lookup: ~50-100 cycles (hash + probing)
- Routing switch-case: ~20-40 cycles
- Tiny Pool check: ~10-20 cycles
- **Total**: ~80-160 cycles per allocation
**For 15M allocations/sec**:
- 15,000,000 allocs × 100 cycles = 1.5 billion cycles
- @ 3GHz CPU = **0.5 seconds overhead**
**This matches the observed degradation!**
---
### **Step 4: Test Minimal Mode**
**Hypothesis**: Disabling all advanced features will restore Phase 6.14 performance.
**Minimal mode configuration** (`hakmem_config.c:33-55`):
```c
cfg->features.alloc = HAKMEM_FEATURE_MALLOC | HAKMEM_FEATURE_MMAP;
cfg->features.cache = 0; // No BigCache
cfg->features.learning = 0; // No ELO/Evolution
cfg->features.memory = 0; // No Batch/THP/FreePolicy
```
**Result**:
-**1-thread: 15.1M ops/sec** (Phase 6.14: 15.3M ops/sec) → **Matched!**
- ⚠️ **4-thread: 3.3M ops/sec** (Phase 6.14: 67.9M ops/sec) → **Still broken**
**Conclusion**:
- 1-thread performance is fully explained by advanced features overhead
- 4-thread performance has **additional issue** (likely Tiny Pool or TLS)
---
## 🧪 **Key Findings**
### **1. Pre-warm Code Never Existed**
**Evidence**:
- No code matches "pre-warm" in source files
- `OPTIMIZATION_SUMMARY_2025_10_22.md` is a proposal, not implementation
- No git commits related to Pre-warm
**Impact**: User's hypothesis was **incorrect**.
---
### **2. Phase 6.14 Configuration Unknown**
**Evidence**:
- No Phase 6.14 git commit found
- No documentation of exact configuration used
- Phase 6.14 report contradicts Phase 6.13 results:
- Phase 6.13: 1T=17.7M, 4T=15.9M ops/sec
- Phase 6.14: 1T=15.3M, 4T=67.9M ops/sec
- **4-thread is 4.3x faster in Phase 6.14?** 🤔
**Hypothesis**: Phase 6.14 used **different larson parameters** or **different mode**.
---
### **3. BALANCED Mode Overhead is Severe**
**Overhead breakdown** (estimated cycles per allocation):
| Feature | Overhead (cycles) | Impact |
|---------|------------------|--------|
| **Site Rules lookup** | 50-100 | High |
| **Site Rules routing** | 20-40 | Medium |
| **ELO/EVO tick** | 10-20 | Low |
| **BigCache check** | 5-10 | Low |
| **Tiny Pool check** | 10-20 | Low |
| **Total** | **95-190 cycles** | **Severe** |
**For comparison**:
- System malloc: ~50-100 cycles
- hakmem MINIMAL: ~60-120 cycles
- hakmem BALANCED: ~150-300 cycles (2-3x slower!)
---
### **4. Minimal Mode Partially Restores Performance**
**1-thread performance**:
- System malloc: 12.7M ops/sec
- hakmem MINIMAL: **15.1M ops/sec** (+18.8%)
- hakmem BALANCED: 2.7M ops/sec (-78.8%)
**Conclusion**: MINIMAL mode **beats system malloc** by 18.8%! ✅
**4-thread performance**:
- System malloc: 12.9M ops/sec
- hakmem MINIMAL: 3.3M ops/sec (-74.7%)
- hakmem BALANCED: 0.66M ops/sec (-94.9%)
**Conclusion**: MINIMAL mode is still 74.7% slower than system malloc. ❌
---
## 🎯 **Recommendations**
### **P0 (Immediate)**: Change Default Mode to MINIMAL
**Action**:
```c
// hakmem_config.c:177
static HakemMode parse_mode_env(const char* mode_str) {
if (!mode_str) return HAKMEM_MODE_MINIMAL; // Changed from BALANCED
// ...
}
```
**Expected impact**:
- ✅ 1-thread: +461% performance (+18.8% vs system malloc)
- ⚠️ 4-thread: +396% performance (still -74.7% vs system malloc)
**Risk**: None (MINIMAL is baseline, no advanced features)
---
### **P1 (1-2 hours)**: Investigate Phase 6.14 Mystery
**Questions**:
1. How did Phase 6.14 achieve 67.9M ops/sec (4-thread)?
2. Was Tiny Pool or TLS enabled?
3. Were larson parameters different?
**Actions**:
1. Re-read Phase 6.14 report carefully
2. Check Phase 6.13 report for TLS configuration
3. Try enabling only Tiny Pool in MINIMAL mode:
```bash
# Test Tiny Pool in isolation
HAKMEM_MODE=minimal HAKMEM_TINY_POOL=1 LD_PRELOAD=./libhakmem.so ./larson 0 8 1024 10000 1 12345 4
```
---
### **P2 (2-3 hours)**: Profile Advanced Features Individually
**Hypothesis**: Site Rules, ELO, or L2.5 Pool is the main bottleneck.
**Action**: Enable features one-by-one and measure:
```bash
# Baseline (MINIMAL)
HAKMEM_MODE=minimal LD_PRELOAD=./libhakmem.so ./larson 0 8 1024 10000 1 12345 1
# +Tiny Pool
HAKMEM_MODE=minimal HAKMEM_TINY_POOL=1 LD_PRELOAD=./libhakmem.so ./larson ...
# +L2 Pool
HAKMEM_MODE=minimal HAKMEM_POOL=1 LD_PRELOAD=./libhakmem.so ./larson ...
# +Site Rules
# (No env var - need code modification to disable Site Rules)
# +BigCache
HAKMEM_MODE=minimal HAKMEM_BIGCACHE=1 LD_PRELOAD=./libhakmem.so ./larson ...
# +ELO
HAKMEM_MODE=minimal HAKMEM_ELO=1 LD_PRELOAD=./libhakmem.so ./larson ...
```
**Goal**: Identify which feature(s) cause the most overhead.
---
### **P3 (4-6 hours)**: Fix 4-Thread Performance
**Current status**:
- System malloc (4T): 12.9M ops/sec
- hakmem MINIMAL (4T): 3.3M ops/sec (-74.7%)
**Hypothesis**: Missing TLS or Tiny Pool optimization.
**Actions**:
1. Review Phase 6.13 TLS implementation
2. Check if Tiny Pool is thread-safe
3. Enable TLS for Tiny Pool
4. Profile lock contention
**Target**: 4-thread >= 15M ops/sec (Phase 6.14 parity)
---
## 📝 **Lessons Learned**
### **1. Configuration is Critical**
**Mistake**: Did not track exact configuration used in Phase 6.14.
**Consequence**: Unable to reproduce 67.9M ops/sec (4-thread) result.
**Fix**: Always document:
- Environment variables used
- Mode configuration
- Compiler flags
- Exact command line
---
### **2. Advanced Features Have Cost**
**Mistake**: Enabled Site Rules, ELO, L2.5 Pool by default without measuring overhead.
**Consequence**: BALANCED mode is 5x slower than MINIMAL.
**Fix**:
- Default to MINIMAL
- Enable advanced features only when proven beneficial
- Always A/B test before making default
---
### **3. Pre-warm Was a Red Herring**
**Mistake**: User assumed Pre-warm was implemented based on proposal document.
**Consequence**: 2 hours spent investigating non-existent code.
**Fix**:
- Always verify code existence with grep
- Distinguish proposal docs from implementation docs
- Use git commits as source of truth
---
## 📁 **Files Investigated**
- `hakmem.c:357-541` - `hak_alloc_at()` overhead
- `hakmem_config.c:81-103` - BALANCED mode configuration
- `hakmem_config.c:33-55` - MINIMAL mode configuration
- `hakmem_features.h` - Feature flags
- `OPTIMIZATION_SUMMARY_2025_10_22.md` - Pre-warm proposal (not implemented)
- `PHASE_6.14_COMPLETION_REPORT.md` - Success report (configuration unknown)
- `PHASE_6.13_INITIAL_RESULTS.md` - TLS validation results
---
## 🚀 **Next Steps**
1.**Immediate**: Change default mode to MINIMAL
2.**P1**: Investigate Phase 6.14 configuration mystery
3.**P2**: Profile advanced features individually
4.**P3**: Fix 4-thread performance (target: 15M ops/sec)
---
**Created**: 2025-10-22
**Investigation Time**: 2 hours
**Status**: Root cause identified, solution proposed ✅