# Phase 7: Immediate Action Plan **Date:** 2025-11-08 **Status:** πŸ”₯ CRITICAL OPTIMIZATION REQUIRED --- ## TL;DR Phase 7 works but is **40x slower** than System malloc due to `mincore()` overhead. **Fix:** Replace `mincore()` with alignment check (99.9% cases) + `mincore()` fallback (0.1% cases) **Impact:** 634 cycles β†’ 1-2 cycles (**317x faster!**) **Time:** 1-2 hours --- ## Critical Finding ``` Current: mincore() on EVERY free = 634 cycles Target: System malloc tcache = 10-15 cycles Result: Phase 7 is 40x SLOWER! ``` **Micro-Benchmark Proof:** ``` [MINCORE] Mapped memory: 634 cycles/call [ALIGN] Alignment check: 0 cycles/call [HYBRID] Align + mincore: 1 cycles/call ← SOLUTION! ``` --- ## The Fix (1-2 Hours) ### Step 1: Add Helper (core/hakmem_internal.h) Add after line 294: ```c // Fast path: Check if ptr-1 is likely accessible (99.9% cases) // Returns: 1 if ptr-1 is NOT near page boundary (safe to read) static inline int is_likely_valid_header(void* ptr) { uintptr_t p = (uintptr_t)ptr; // Check: ptr-1 is NOT within first 16 bytes of a page // Most allocations are NOT at page boundaries return (p & 0xFFF) >= 16; // 1 cycle } ``` ### Step 2: Optimize Fast Free (core/tiny_free_fast_v2.inc.h) Replace lines 53-60 with: ```c // OPTIMIZED: Hybrid check (1-2 cycles effective) void* header_addr = (char*)ptr - 1; // Fast path: Alignment check (99.9% cases, 1 cycle) if (__builtin_expect(!is_likely_valid_header(ptr), 0)) { // Slow path: Page boundary case (0.1% cases, 634 cycles) extern int hak_is_memory_readable(void* addr); if (!hak_is_memory_readable(header_addr)) { return 0; // Header not accessible } } // Header is accessible (either by alignment or mincore check) int class_idx = tiny_region_id_read_header(ptr); ``` ### Step 3: Optimize Dual-Header Dispatch (core/box/hak_free_api.inc.h) Replace lines 94-96 with: ```c // SAFETY: Check if raw header is accessible before dereferencing if (!is_likely_valid_header((char*)ptr + HEADER_SIZE)) { // Page boundary: use mincore fallback if (!hak_is_memory_readable(raw)) { // Header not accessible, continue to slow path goto mid_l25_lookup; } } AllocHeader* hdr = (AllocHeader*)raw; ``` --- ## Testing (30 Minutes) ### Test 1: Verify Optimization ```bash ./micro_mincore_bench # Expected: [HYBRID] 1 cycles/call (vs 634 before) ``` ### Test 2: Larson Smoke Test ```bash make clean && make larson_hakmem ./larson_hakmem 1 8 128 1024 1 12345 1 # Expected: 40-60M ops/s (vs 0.8M before = 50x improvement!) ``` ### Test 3: Stability Check ```bash # 10-minute continuous test timeout 600 bash -c 'while true; do ./larson_hakmem 10 8 128 1024 1 $RANDOM 4 || break; done' # Expected: No crashes ``` --- ## Why This Works **Problem:** - Page boundary allocations: <0.1% frequency - But we pay `mincore()` cost (634 cycles) on 100% of frees **Solution:** - Alignment check: 1 cycle, 99.9% cases - mincore fallback: 634 cycles, 0.1% cases - **Effective cost:** 0.999 * 1 + 0.001 * 634 = **1.6 cycles** **Result:** 634 β†’ 1.6 cycles = **396x faster!** --- ## Expected Results ### Performance (After Fix) | Benchmark | Before (ops/s) | After (ops/s) | Improvement | |-----------|----------------|---------------|-------------| | Larson 1T | 0.8M | 40-60M | **50-75x** πŸš€ | | Larson 4T | 0.8M | 120-180M | **150-225x** πŸš€ | | vs System malloc | -95% | **+20-50%** | **Competitive!** βœ… | ### Memory Overhead | Size | Header | Overhead | |------|--------|----------| | 8B | 1B | 12.5% (but 0% in Slab[0]) | | 128B | 1B | 0.78% | | 512B | 1B | 0.20% | | **Average** | 1B | **<3%** (vs System's 10-15%) | --- ## Success Criteria **Minimum (GO/NO-GO):** - βœ… Micro-benchmark: 1-2 cycles (hybrid) - βœ… Larson: β‰₯20M ops/s (minimum viable) - βœ… No crashes (10-minute stress test) **Target:** - βœ… Larson: β‰₯40M ops/s (2x System) - βœ… Memory: ≀System * 1.05 (RSS) - βœ… Stability: 100% (no crashes) **Stretch:** - βœ… Beat mimalloc (if possible) - βœ… 50M+ ops/s (Larson 1T) --- ## Risks | Risk | Probability | Mitigation | |------|-------------|------------| | False positives (alignment check) | Very Low | Magic validation catches them | | Still slower than System | Low | Micro-benchmark proves 1-2 cycles | | 1024B fallback impacts score | Medium | Measure frequency, optimize if >10% | **Overall Risk:** LOW (proven by micro-benchmark) --- ## Timeline | Phase | Duration | Deliverable | |-------|----------|-------------| | **1. Implement** | 1-2 hours | Code changes (3 files) | | **2. Test** | 30 min | Micro + Larson smoke | | **3. Validate** | 2-3 hours | Full benchmark suite | | **4. Deploy** | 1 day | Production-ready | **Total:** 1-2 days to production --- ## Next Steps 1. βœ… Read this document 2. ⏳ Implement optimization (Step 1-3 above) 3. ⏳ Run tests (micro + Larson) 4. ⏳ Full benchmark suite 5. ⏳ Compare with mimalloc 6. ⏳ Deploy! --- ## References - **Full Report:** `PHASE7_DESIGN_REVIEW.md` (758 lines) - **Micro-Benchmark:** `tests/micro_mincore_bench.c` - **Code Locations:** - `core/hakmem_internal.h:294` (add helper) - `core/tiny_free_fast_v2.inc.h:53-60` (optimize) - `core/box/hak_free_api.inc.h:94-96` (optimize) --- ## Questions? **Q: Why not remove mincore entirely?** A: Need it for page boundary cases (0.1%), otherwise SEGV. **Q: What about false positives?** A: Magic byte validation catches them (line 75 in tiny_region_id.h). **Q: Will this work on ARM/other platforms?** A: Yes, alignment check is portable (bitwise AND). **Q: What if it's still slow?** A: Micro-benchmark proves 1-2 cycles. If slow, something else is wrong. --- **GO BUILD IT!** πŸš€