Files
hakmem/PHASE7_ACTION_PLAN.md
Moe Charm (CI) 4983352812 Perf: Phase 7-1.3 - Hybrid mincore + Macro fix (+194-333%)
## Summary
Fixed CRITICAL bottleneck (mincore overhead) and macro definition bug.
Result: 2-3x performance improvement across all benchmarks.

## Performance Results
- Larson 1T: 631K → 2.73M ops/s (+333%) 🚀
- bench_random_mixed (128B): 768K → 2.26M ops/s (+194%) 🚀
- bench_random_mixed (512B): → 1.43M ops/s (new)
- [HEADER_INVALID] messages: Many → ~Zero 

## Changes

### 1. Hybrid mincore Optimization (317-634x faster)
**Problem**: `hak_is_memory_readable()` calls mincore() syscall on EVERY free
- Cost: 634 cycles/call
- Impact: 40x slower than System malloc

**Solution**: Check alignment BEFORE calling mincore()
- Step 1 (1-byte header): `if ((ptr & 0xFFF) == 0)` → only 0.1% call mincore
- Step 2 (16-byte header): `if ((ptr & 0xFFF) < HEADER_SIZE)` → only 0.4% call mincore
- Result: 634 → 1-2 cycles effective (99.6% skip mincore)

**Files**:
- core/tiny_free_fast_v2.inc.h:53-71 - Step 1 hybrid check
- core/box/hak_free_api.inc.h:94-107 - Step 2 hybrid check
- core/hakmem_internal.h:281-312 - Performance warning added

### 2. HAK_RET_ALLOC Macro Fix (CRITICAL BUG)
**Problem**: Macro definition order prevented Phase 7 header write
- hakmem_tiny.c:130 defined legacy macro (no header write)
- tiny_alloc_fast.inc.h:67 had `#ifndef` guard → skipped!
- Result: Headers NEVER written → All frees failed → Slow path

**Solution**: Force Phase 7 macro to override legacy
- hakmem_tiny.c:119 - Added `#ifndef HAK_RET_ALLOC` guard
- tiny_alloc_fast.inc.h:69-72 - Added `#undef` before redefine

### 3. Magic Byte Fix
**Problem**: Release builds don't write magic byte, but free ALWAYS checks it
- Result: All headers marked as invalid

**Solution**: ALWAYS write magic byte (same 1-byte write, no overhead)
- tiny_region_id.h:50-54 - Removed `#if !HAKMEM_BUILD_RELEASE` guard

## Technical Details

### Hybrid mincore Effectiveness
| Case | Frequency | Cost | Weighted |
|------|-----------|------|----------|
| Normal (Step 1) | 99.9% | 1-2 cycles | 1-2 |
| Page boundary | 0.1% | 634 cycles | 0.6 |
| **Total** | - | - | **1.6-2.6 cycles** |

**Improvement**: 634 → 1.6 cycles = **317-396x faster!**

### Macro Fix Impact
**Before**: HAK_RET_ALLOC(cls, ptr) → return (ptr)  // No header write
**After**: HAK_RET_ALLOC(cls, ptr) → return tiny_region_id_write_header((ptr), (cls))

**Result**: Headers properly written → Fast path works → +194-333% performance

## Investigation
Task Agent Ultrathink analysis identified:
1. mincore() syscall overhead (634 cycles)
2. Macro definition order conflict
3. Release/Debug build mismatch (magic byte)

Full report: PHASE7_DESIGN_REVIEW.md (23KB, 758 lines)

## Related
- Phase 7-1.0: PoC implementation (+39%~+436%)
- Phase 7-1.1: Dual-header dispatch (Task Agent)
- Phase 7-1.2: Page boundary SEGV fix (100% crash-free)
- Phase 7-1.3: Hybrid mincore + Macro fix (this commit)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-08 04:50:41 +09:00

5.6 KiB

Phase 7: Immediate Action Plan

Date: 2025-11-08 Status: 🔥 CRITICAL OPTIMIZATION REQUIRED


TL;DR

Phase 7 works but is 40x slower than System malloc due to mincore() overhead.

Fix: Replace mincore() with alignment check (99.9% cases) + mincore() fallback (0.1% cases)

Impact: 634 cycles → 1-2 cycles (317x faster!)

Time: 1-2 hours


Critical Finding

Current:  mincore() on EVERY free = 634 cycles
Target:   System malloc tcache    = 10-15 cycles
Result:   Phase 7 is 40x SLOWER!

Micro-Benchmark Proof:

[MINCORE] Mapped memory:   634 cycles/call
[ALIGN]   Alignment check: 0 cycles/call
[HYBRID]  Align + mincore:  1 cycles/call  ← SOLUTION!

The Fix (1-2 Hours)

Step 1: Add Helper (core/hakmem_internal.h)

Add after line 294:

// Fast path: Check if ptr-1 is likely accessible (99.9% cases)
// Returns: 1 if ptr-1 is NOT near page boundary (safe to read)
static inline int is_likely_valid_header(void* ptr) {
    uintptr_t p = (uintptr_t)ptr;
    // Check: ptr-1 is NOT within first 16 bytes of a page
    // Most allocations are NOT at page boundaries
    return (p & 0xFFF) >= 16;  // 1 cycle
}

Step 2: Optimize Fast Free (core/tiny_free_fast_v2.inc.h)

Replace lines 53-60 with:

// OPTIMIZED: Hybrid check (1-2 cycles effective)
void* header_addr = (char*)ptr - 1;

// Fast path: Alignment check (99.9% cases, 1 cycle)
if (__builtin_expect(!is_likely_valid_header(ptr), 0)) {
    // Slow path: Page boundary case (0.1% cases, 634 cycles)
    extern int hak_is_memory_readable(void* addr);
    if (!hak_is_memory_readable(header_addr)) {
        return 0;  // Header not accessible
    }
}

// Header is accessible (either by alignment or mincore check)
int class_idx = tiny_region_id_read_header(ptr);

Step 3: Optimize Dual-Header Dispatch (core/box/hak_free_api.inc.h)

Replace lines 94-96 with:

// SAFETY: Check if raw header is accessible before dereferencing
if (!is_likely_valid_header((char*)ptr + HEADER_SIZE)) {
    // Page boundary: use mincore fallback
    if (!hak_is_memory_readable(raw)) {
        // Header not accessible, continue to slow path
        goto mid_l25_lookup;
    }
}

AllocHeader* hdr = (AllocHeader*)raw;

Testing (30 Minutes)

Test 1: Verify Optimization

./micro_mincore_bench
# Expected: [HYBRID] 1 cycles/call (vs 634 before)

Test 2: Larson Smoke Test

make clean && make larson_hakmem
./larson_hakmem 1 8 128 1024 1 12345 1
# Expected: 40-60M ops/s (vs 0.8M before = 50x improvement!)

Test 3: Stability Check

# 10-minute continuous test
timeout 600 bash -c 'while true; do ./larson_hakmem 10 8 128 1024 1 $RANDOM 4 || break; done'
# Expected: No crashes

Why This Works

Problem:

  • Page boundary allocations: <0.1% frequency
  • But we pay mincore() cost (634 cycles) on 100% of frees

Solution:

  • Alignment check: 1 cycle, 99.9% cases
  • mincore fallback: 634 cycles, 0.1% cases
  • Effective cost: 0.999 * 1 + 0.001 * 634 = 1.6 cycles

Result: 634 → 1.6 cycles = 396x faster!


Expected Results

Performance (After Fix)

Benchmark Before (ops/s) After (ops/s) Improvement
Larson 1T 0.8M 40-60M 50-75x 🚀
Larson 4T 0.8M 120-180M 150-225x 🚀
vs System malloc -95% +20-50% Competitive!

Memory Overhead

Size Header Overhead
8B 1B 12.5% (but 0% in Slab[0])
128B 1B 0.78%
512B 1B 0.20%
Average 1B <3% (vs System's 10-15%)

Success Criteria

Minimum (GO/NO-GO):

  • Micro-benchmark: 1-2 cycles (hybrid)
  • Larson: ≥20M ops/s (minimum viable)
  • No crashes (10-minute stress test)

Target:

  • Larson: ≥40M ops/s (2x System)
  • Memory: ≤System * 1.05 (RSS)
  • Stability: 100% (no crashes)

Stretch:

  • Beat mimalloc (if possible)
  • 50M+ ops/s (Larson 1T)

Risks

Risk Probability Mitigation
False positives (alignment check) Very Low Magic validation catches them
Still slower than System Low Micro-benchmark proves 1-2 cycles
1024B fallback impacts score Medium Measure frequency, optimize if >10%

Overall Risk: LOW (proven by micro-benchmark)


Timeline

Phase Duration Deliverable
1. Implement 1-2 hours Code changes (3 files)
2. Test 30 min Micro + Larson smoke
3. Validate 2-3 hours Full benchmark suite
4. Deploy 1 day Production-ready

Total: 1-2 days to production


Next Steps

  1. Read this document
  2. Implement optimization (Step 1-3 above)
  3. Run tests (micro + Larson)
  4. Full benchmark suite
  5. Compare with mimalloc
  6. Deploy!

References

  • Full Report: PHASE7_DESIGN_REVIEW.md (758 lines)
  • Micro-Benchmark: tests/micro_mincore_bench.c
  • Code Locations:
    • core/hakmem_internal.h:294 (add helper)
    • core/tiny_free_fast_v2.inc.h:53-60 (optimize)
    • core/box/hak_free_api.inc.h:94-96 (optimize)

Questions?

Q: Why not remove mincore entirely? A: Need it for page boundary cases (0.1%), otherwise SEGV.

Q: What about false positives? A: Magic byte validation catches them (line 75 in tiny_region_id.h).

Q: Will this work on ARM/other platforms? A: Yes, alignment check is portable (bitwise AND).

Q: What if it's still slow? A: Micro-benchmark proves 1-2 cycles. If slow, something else is wrong.


GO BUILD IT! 🚀