209 lines
6.1 KiB
Markdown
209 lines
6.1 KiB
Markdown
|
|
# Phase 6.6 Complete Summary
|
|||
|
|
|
|||
|
|
**Date**: 2025-10-21
|
|||
|
|
**Status**: ✅ **COMPLETE**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 Goal & Achievement
|
|||
|
|
|
|||
|
|
**Goal**: Fix ELO control flow bug that prevented batch madvise activation
|
|||
|
|
**Result**: ✅ **Successfully fixed and verified** - Batch madvise now working correctly
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🐛 Problem
|
|||
|
|
|
|||
|
|
After Phase 6.5 (Learning Lifecycle) integration:
|
|||
|
|
- 2MB allocations were using `MALLOC` instead of `MMAP`
|
|||
|
|
- BigCache eviction called `free()` instead of `hak_batch_add()`
|
|||
|
|
- Batch madvise statistics showed **0 blocks batched** (completely inactive)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔍 Root Cause (Diagnosed by Gemini Pro)
|
|||
|
|
|
|||
|
|
**Control flow ordering bug** in `hakmem.c:hak_alloc_at()`:
|
|||
|
|
|
|||
|
|
1. OLD policy decision (`infer_policy()`) executed FIRST → returned `POLICY_DEFAULT`
|
|||
|
|
2. Allocation happened using old policy → `alloc_malloc()` called
|
|||
|
|
3. ELO strategy selection executed TOO LATE → results completely ignored
|
|||
|
|
4. ELO results only used for BigCache eligibility, not allocation method
|
|||
|
|
|
|||
|
|
**Key insight**: "The right answer computed at the wrong time is the wrong answer"
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ✅ Fix Applied
|
|||
|
|
|
|||
|
|
**Modified**: `hakmem.c` (lines 645-720)
|
|||
|
|
|
|||
|
|
**Before** (WRONG):
|
|||
|
|
```c
|
|||
|
|
void* hak_alloc_at(size_t size, ...) {
|
|||
|
|
// 1. Old policy (WRONG!)
|
|||
|
|
policy = POLICY_DEFAULT;
|
|||
|
|
|
|||
|
|
// 2. Allocate (TOO EARLY!)
|
|||
|
|
ptr = allocate_with_policy(size, policy); // Uses malloc
|
|||
|
|
|
|||
|
|
// 3. ELO selection (TOO LATE!)
|
|||
|
|
strategy_id = hak_elo_select_strategy(); // Result not used!
|
|||
|
|
threshold = hak_elo_get_threshold(strategy_id);
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**After** (CORRECT):
|
|||
|
|
```c
|
|||
|
|
void* hak_alloc_at(size_t size, ...) {
|
|||
|
|
// 1. ELO selection FIRST!
|
|||
|
|
strategy_id = hak_elo_select_strategy();
|
|||
|
|
threshold = hak_elo_get_threshold(strategy_id);
|
|||
|
|
|
|||
|
|
// 2. BigCache check
|
|||
|
|
if (hak_bigcache_try_get(...)) return cached_ptr;
|
|||
|
|
|
|||
|
|
// 3. Use ELO threshold to decide malloc vs mmap
|
|||
|
|
ptr = (size >= threshold) ? alloc_mmap(size) : alloc_malloc(size);
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Result**: 2MB allocations now correctly use `mmap`, enabling batch madvise.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 Benchmark Results
|
|||
|
|
|
|||
|
|
**Configuration**: `bench_runner.sh --warmup 2 --runs 10` (200 total runs)
|
|||
|
|
|
|||
|
|
### VM Scenario (2MB allocations)
|
|||
|
|
|
|||
|
|
| Allocator | Median (ns) | vs Phase 6.4 | vs mimalloc |
|
|||
|
|
|-----------|-------------|--------------|-------------|
|
|||
|
|
| mimalloc | 19,964 | +12.6% | baseline |
|
|||
|
|
| jemalloc | 26,241 | -3.0% | +31.4% |
|
|||
|
|
| **hakmem-evolving** | **37,602** | **+2.6%** | **+88.3%** |
|
|||
|
|
| hakmem-baseline | 40,282 | +9.1% | +101.7% |
|
|||
|
|
| system | 59,995 | -4.4% | +200.4% |
|
|||
|
|
|
|||
|
|
### Analysis
|
|||
|
|
|
|||
|
|
1. ✅ **No regression**: +2.6% difference vs Phase 6.4 is within measurement variance
|
|||
|
|
2. ✅ **ELO working**: hakmem-evolving beats hakmem-baseline
|
|||
|
|
3. ✅ **Batch madvise active**: Verified with debug logging
|
|||
|
|
4. ⚠️ **Overhead gap**: Still 2× slower than mimalloc → Phase 6.7 investigation
|
|||
|
|
|
|||
|
|
**Note**: README.md claimed "16,125 ns" for Phase 6.4, but FINAL_RESULTS.md shows 36,647 ns (the correct baseline for comparison).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🧪 Verification
|
|||
|
|
|
|||
|
|
### Batch Madvise Activation Confirmed
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
[DEBUG] BigCache eviction: method=1 (MMAP), size=2097152 ✅
|
|||
|
|
[DEBUG] Calling hak_batch_add(raw=0x..., size=2097152) ✅
|
|||
|
|
|
|||
|
|
Batch Statistics:
|
|||
|
|
Total blocks added: 1 ✅
|
|||
|
|
Flush operations: 1 ✅
|
|||
|
|
Total bytes flushed: 2097152 ✅
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎓 Lessons Learned
|
|||
|
|
|
|||
|
|
### Design Mistakes
|
|||
|
|
|
|||
|
|
1. **Control flow ordering**: Strategy selection must happen BEFORE usage
|
|||
|
|
2. **Dead code accumulation**: Old `infer_policy()` logic left behind
|
|||
|
|
3. **Silent failures**: ELO results computed but not used
|
|||
|
|
|
|||
|
|
### Detection Challenges
|
|||
|
|
|
|||
|
|
1. **High-level symptoms**: "Batch not activating" didn't point to control flow
|
|||
|
|
2. **Required detailed tracing**: Had to add debug logging to discover MALLOC usage
|
|||
|
|
3. **Multi-layer architecture**: Problem spanned ELO, allocation, BigCache, batch
|
|||
|
|
|
|||
|
|
### AI Collaboration Success
|
|||
|
|
|
|||
|
|
- **Gemini Pro**: Root cause diagnosis from logs + code analysis
|
|||
|
|
- **Claude**: Applied fix, tested, documented
|
|||
|
|
- **Synergy**: Gemini saw the forest (control flow), Claude fixed the trees (code)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📝 Bonus Findings
|
|||
|
|
|
|||
|
|
### BigCache Size Check Bug (Already Fixed)
|
|||
|
|
|
|||
|
|
Gemini Task 5cfad9 diagnosed a heap-buffer-overflow bug:
|
|||
|
|
- **Problem**: BigCache returning undersized blocks without `actual_bytes >= requested_bytes` check
|
|||
|
|
- **Impact**: cold-churn benchmark (varying sizes) triggers buffer overflow
|
|||
|
|
- **Status**: ✅ **Already fixed** in previous session
|
|||
|
|
- **Code**: `hakmem_bigcache.c:151` has size check with "Segfault fix!" comment
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚀 Next Steps (Phase 6.7)
|
|||
|
|
|
|||
|
|
### 1. Overhead Analysis
|
|||
|
|
|
|||
|
|
**Goal**: Identify why hakmem is 2× slower than mimalloc
|
|||
|
|
|
|||
|
|
**Candidates** (from OVERHEAD_ANALYSIS_PLAN.md):
|
|||
|
|
- P0: BigCache lookup (~50-100 ns)
|
|||
|
|
- P0: ELO strategy selection (~100-200 ns)
|
|||
|
|
- P1: mmap/munmap syscalls (~1,000-5,000 ns) ← **Main suspect**
|
|||
|
|
- P1: Page faults (~100-500 ns per page)
|
|||
|
|
|
|||
|
|
**Strategy**:
|
|||
|
|
1. Feature isolation testing (environment variables)
|
|||
|
|
2. `perf` profiling (hotspot identification)
|
|||
|
|
3. `strace` syscall counting
|
|||
|
|
|
|||
|
|
### 2. Optimization Ideas
|
|||
|
|
|
|||
|
|
1. **FROZEN mode by default** (after learning) → -5% overhead
|
|||
|
|
2. **BigCache direct indexing** (instead of linear search) → -5% overhead
|
|||
|
|
3. **Pre-allocated arena** (Phase 7+) → -50% overhead target
|
|||
|
|
|
|||
|
|
**Realistic goal**: Reduce gap from +88% to +40% (Phase 7), then +20% (Phase 8)
|
|||
|
|
|
|||
|
|
**Limit**: Cannot beat mimalloc without slab allocator (industry standard, 10+ years optimization)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📁 Documentation Created
|
|||
|
|
|
|||
|
|
1. **PHASE_6.6_ELO_CONTROL_FLOW_FIX.md** (updated with benchmark results)
|
|||
|
|
2. **OVERHEAD_ANALYSIS_PLAN.md** (Phase 6.7 preparation)
|
|||
|
|
3. **PHASE_6.6_SUMMARY.md** (this file)
|
|||
|
|
4. **GEMINI_BIGCACHE_ANALYSIS.md** (confirmed existing fix)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🏆 Final Status
|
|||
|
|
|
|||
|
|
**Phase 6.6**: ✅ **COMPLETE**
|
|||
|
|
|
|||
|
|
**Achievements**:
|
|||
|
|
- ✅ ELO control flow bug fixed
|
|||
|
|
- ✅ Batch madvise activation verified
|
|||
|
|
- ✅ Performance parity with Phase 6.4 maintained (+2.6% variance)
|
|||
|
|
- ✅ Comprehensive documentation created
|
|||
|
|
- ✅ Phase 6.7 roadmap prepared
|
|||
|
|
|
|||
|
|
**Code quality**:
|
|||
|
|
- Modified files: 1 (`hakmem.c`)
|
|||
|
|
- Lines changed: ~75 lines (reordering + cleanup)
|
|||
|
|
- Test coverage: VM scenario verified (200 runs)
|
|||
|
|
|
|||
|
|
**Time investment**: ~6 hours (diagnosis + fix + benchmarking + documentation)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**Ready for Phase 6.7: Overhead Analysis & Optimization** 🚀
|