hakmem/docs/design/PHASE7_ACTION_PLAN.md

# Phase 7: Immediate Action Plan

**Date:** 2025-11-08
**Status:** 🔥 CRITICAL OPTIMIZATION REQUIRED

---

## TL;DR

Phase 7 works but is **40x slower** than System malloc due to `mincore()` overhead.

**Fix:** Replace `mincore()` with alignment check (99.9% cases) + `mincore()` fallback (0.1% cases)

**Impact:** 634 cycles → 1-2 cycles (**317x faster!**)

**Time:** 1-2 hours

---

## Critical Finding

```
Current:  mincore() on EVERY free = 634 cycles
Target:   System malloc tcache    = 10-15 cycles
Result:   Phase 7 is 40x SLOWER!
```

**Micro-Benchmark Proof:**
```
[MINCORE] Mapped memory:   634 cycles/call
[ALIGN]   Alignment check: 0 cycles/call
[HYBRID]  Align + mincore:  1 cycles/call  ← SOLUTION!
```

---

## The Fix (1-2 Hours)

### Step 1: Add Helper (core/hakmem_internal.h)

Add after line 294:

```c
// Fast path: Check if ptr-1 is likely accessible (99.9% cases)
// Returns: 1 if ptr-1 is NOT near page boundary (safe to read)
static inline int is_likely_valid_header(void* ptr) {
    uintptr_t p = (uintptr_t)ptr;
    // Check: ptr-1 is NOT within first 16 bytes of a page
    // Most allocations are NOT at page boundaries
    return (p & 0xFFF) >= 16;  // 1 cycle
}
```

### Step 2: Optimize Fast Free (core/tiny_free_fast_v2.inc.h)

Replace lines 53-60 with:

```c
// OPTIMIZED: Hybrid check (1-2 cycles effective)
void* header_addr = (char*)ptr - 1;

// Fast path: Alignment check (99.9% cases, 1 cycle)
if (__builtin_expect(!is_likely_valid_header(ptr), 0)) {
    // Slow path: Page boundary case (0.1% cases, 634 cycles)
    extern int hak_is_memory_readable(void* addr);
    if (!hak_is_memory_readable(header_addr)) {
        return 0;  // Header not accessible
    }
}

// Header is accessible (either by alignment or mincore check)
int class_idx = tiny_region_id_read_header(ptr);
```

### Step 3: Optimize Dual-Header Dispatch (core/box/hak_free_api.inc.h)

Replace lines 94-96 with:

```c
// SAFETY: Check if raw header is accessible before dereferencing
if (!is_likely_valid_header((char*)ptr + HEADER_SIZE)) {
    // Page boundary: use mincore fallback
    if (!hak_is_memory_readable(raw)) {
        // Header not accessible, continue to slow path
        goto mid_l25_lookup;
    }
}

AllocHeader* hdr = (AllocHeader*)raw;
```

---

## Testing (30 Minutes)

### Test 1: Verify Optimization
```bash
./micro_mincore_bench
# Expected: [HYBRID] 1 cycles/call (vs 634 before)
```

### Test 2: Larson Smoke Test
```bash
make clean && make larson_hakmem
./larson_hakmem 1 8 128 1024 1 12345 1
# Expected: 40-60M ops/s (vs 0.8M before = 50x improvement!)
```

### Test 3: Stability Check
```bash
# 10-minute continuous test
timeout 600 bash -c 'while true; do ./larson_hakmem 10 8 128 1024 1 $RANDOM 4 || break; done'
# Expected: No crashes
```

---

## Why This Works

**Problem:**
- Page boundary allocations: <0.1% frequency
- But we pay `mincore()` cost (634 cycles) on 100% of frees

**Solution:**
- Alignment check: 1 cycle, 99.9% cases
- mincore fallback: 634 cycles, 0.1% cases
- **Effective cost:** 0.999 * 1 + 0.001 * 634 = **1.6 cycles**

**Result:** 634 → 1.6 cycles = **396x faster!**

---

## Expected Results

### Performance (After Fix)

| Benchmark | Before (ops/s) | After (ops/s) | Improvement |
|-----------|----------------|---------------|-------------|
| Larson 1T | 0.8M | 40-60M | **50-75x** 🚀 |
| Larson 4T | 0.8M | 120-180M | **150-225x** 🚀 |
| vs System malloc | -95% | **+20-50%** | **Competitive!** ✅ |

### Memory Overhead

| Size | Header | Overhead |
|------|--------|----------|
| 8B | 1B | 12.5% (but 0% in Slab[0]) |
| 128B | 1B | 0.78% |
| 512B | 1B | 0.20% |
| **Average** | 1B | **<3%** (vs System's 10-15%) |

---

## Success Criteria

**Minimum (GO/NO-GO):**
- ✅ Micro-benchmark: 1-2 cycles (hybrid)
- ✅ Larson: ≥20M ops/s (minimum viable)
- ✅ No crashes (10-minute stress test)

**Target:**
- ✅ Larson: ≥40M ops/s (2x System)
- ✅ Memory: ≤System * 1.05 (RSS)
- ✅ Stability: 100% (no crashes)

**Stretch:**
- ✅ Beat mimalloc (if possible)
- ✅ 50M+ ops/s (Larson 1T)

---

## Risks

| Risk | Probability | Mitigation |
|------|-------------|------------|
| False positives (alignment check) | Very Low | Magic validation catches them |
| Still slower than System | Low | Micro-benchmark proves 1-2 cycles |
| 1024B fallback impacts score | Medium | Measure frequency, optimize if >10% |

**Overall Risk:** LOW (proven by micro-benchmark)

---

## Timeline

| Phase | Duration | Deliverable |
|-------|----------|-------------|
| **1. Implement** | 1-2 hours | Code changes (3 files) |
| **2. Test** | 30 min | Micro + Larson smoke |
| **3. Validate** | 2-3 hours | Full benchmark suite |
| **4. Deploy** | 1 day | Production-ready |

**Total:** 1-2 days to production

---

## Next Steps

1. ✅ Read this document
2. ⏳ Implement optimization (Step 1-3 above)
3. ⏳ Run tests (micro + Larson)
4. ⏳ Full benchmark suite
5. ⏳ Compare with mimalloc
6. ⏳ Deploy!

---

## References

- **Full Report:** `PHASE7_DESIGN_REVIEW.md` (758 lines)
- **Micro-Benchmark:** `tests/micro_mincore_bench.c`
- **Code Locations:**
  - `core/hakmem_internal.h:294` (add helper)
  - `core/tiny_free_fast_v2.inc.h:53-60` (optimize)
  - `core/box/hak_free_api.inc.h:94-96` (optimize)

---

## Questions?

**Q: Why not remove mincore entirely?**
A: Need it for page boundary cases (0.1%), otherwise SEGV.

**Q: What about false positives?**
A: Magic byte validation catches them (line 75 in tiny_region_id.h).

**Q: Will this work on ARM/other platforms?**
A: Yes, alignment check is portable (bitwise AND).

**Q: What if it's still slow?**
A: Micro-benchmark proves 1-2 cycles. If slow, something else is wrong.

---

**GO BUILD IT!** 🚀