Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
7.3 KiB
ChatGPT Pro Response: mmap vs malloc Strategy
Date: 2025-10-21 Response Time: ~2 minutes Model: GPT-5 (via codex) Status: ✅ Clear recommendation received
🎯 Final Recommendation: GO with Option A
Decision: Switch POLICY_LARGE_INFREQUENT to mmap with kill-switch guard.
✅ Why Option A
- Phase 6.3 requires mmap:
madviseis a no-op onmallocblocks - BigCache absorbs risk: 90% hit rate → only 10% hit OS (1538 → 150 faults)
- mimalloc's secret: "keep mapping, lazily reclaim" with MADV_FREE/DONTNEED
- Immediate unlock: Phase 6.3 works immediately
🔥 CRITICAL BUG DISCOVERED in Current Code
Problem in hakmem.c:543:
case ALLOC_METHOD_MMAP:
if (hdr->size >= BATCH_MIN_SIZE) {
hak_batch_add(raw, hdr->size); // Add to batch
}
munmap(raw, hdr->size); // ← BUG! Immediately unmaps
break;
Why this is wrong:
- Calls
munmapimmediately after adding to batch - Negates Phase 6.3 benefit: batch cannot coalesce/defray TLB work
- TLB flush happens on
munmap, not onmadvise
✅ Correct Implementation
Free Path Logic (Choose ONE):
Option 1: Cache in BigCache
// Try BigCache first
if (hak_bigcache_try_insert(ptr, size, site_id)) {
// Cached! Do NOT munmap
// Optionally: madvise(MADV_FREE) on insert or eviction
return;
}
Option 2: Batch for delayed reclaim
// BigCache full, add to batch
if (size >= BATCH_MIN_SIZE) {
hak_batch_add(raw, size);
// Do NOT munmap here!
// munmap happens on batch flush (coalesced)
return;
}
Option 3: Immediate unmap (last resort)
// Cold eviction only
munmap(raw, size);
🎯 Implementation Plan
Phase 1: Minimal Change (1-line)
File: hakmem.c:357
case POLICY_LARGE_INFREQUENT:
return alloc_mmap(size); // Changed from alloc_malloc
Guard with kill-switch:
#ifdef HAKO_HAKMEM_LARGE_MMAP
return alloc_mmap(size);
#else
return alloc_malloc(size); // Safe fallback
#endif
Env variable: HAKO_HAKMEM_LARGE_MMAP=1 (default OFF)
Phase 2: Fix Free Path
File: hakmem.c:543-548
Current (WRONG):
case ALLOC_METHOD_MMAP:
if (hdr->size >= BATCH_MIN_SIZE) {
hak_batch_add(raw, hdr->size);
}
munmap(raw, hdr->size); // ← Remove this!
break;
Correct:
case ALLOC_METHOD_MMAP:
// Try BigCache first
if (hdr->size >= 1048576) { // 1MB threshold
if (hak_bigcache_try_insert(user_ptr, hdr->size, site_id)) {
// Cached, skip munmap
return;
}
}
// BigCache full, add to batch
if (hdr->size >= BATCH_MIN_SIZE) {
hak_batch_add(raw, hdr->size);
// munmap deferred to batch flush
return;
}
// Small or batch disabled, immediate unmap
munmap(raw, hdr->size);
break;
Phase 3: Batch Flush Implementation
File: hakmem_batch.c
void hak_batch_flush(void) {
if (batch_count == 0) return;
// Use MADV_FREE (prefer) or MADV_DONTNEED (fallback)
for (size_t i = 0; i < batch_count; i++) {
#ifdef __linux__
madvise(batch[i].ptr, batch[i].size, MADV_FREE);
#else
madvise(batch[i].ptr, batch[i].size, MADV_DONTNEED);
#endif
}
// Optional: munmap on cold eviction
// (Keep VA mapped for reuse in most cases)
batch_count = 0;
}
📊 Expected Performance Gains
Metrics Prediction:
| Metric | Current (malloc) | With Option A (mmap) | Improvement |
|---|---|---|---|
| Page faults | 513 | 120-180 | 65-77% fewer |
| TLB shootdowns | ~150 | 3-8 | 95% fewer |
| Latency (VM) | 36,647 ns | 24,000-28,000 ns | 30-45% faster |
Success Criteria:
- ✅ Page faults: 120-180 (vs 513 current)
- ✅ Batch flushes: 3-8 per run
- ✅ Latency: 25-28 µs (vs 36.6 µs current)
Rollback Criteria:
- ❌ Page faults > 500 (BigCache failing)
- ❌ Latency regression (slower than 36,647 ns)
🛡️ Risk Mitigation
1. Kill-Switch Guard
// Compile-time or runtime flag
HAKO_HAKMEM_LARGE_MMAP=1 // Enable mmap path
2. BigCache Hard Cap
- Limit: 64-256 MB (1-2× working set)
- LRU eviction to batched reclaim
3. Prefer MADV_FREE
- Lower TLB cost than MADV_DONTNEED
- Better performance on quick reuse
- Linux:
MADV_FREE, macOS:MADV_FREE_REUSABLE
4. Observability (Add Counters)
- mmap allocation count
- BigCache hits/misses for mmap
- Batch flush count
- munmap count
- Sample
minflt/majfltbefore/after
🧪 Test Plan
Step 1: Enable mmap with guard
# Makefile
CFLAGS += -DHAKO_HAKMEM_LARGE_MMAP=1
Step 2: Run VM scenario benchmark
# 10 runs, measure:
make bench_vm RUNS=10
Step 3: Collect metrics
- BigCache hit% for mmap
- Page faults (expect 120-180)
- Batch flushes (expect 3-8)
- Latency (expect 24-28 µs)
Step 4: Validate or rollback
# If page faults > 500 or latency regresses:
CFLAGS += -UHAKO_HAKMEM_LARGE_MMAP # Rollback
🎯 BigCache + mmap Compatibility
ChatGPT Pro confirms: SAFE
- ✅ mmap blocks can be cached (same as malloc semantics)
- ✅ Content unspecified (matches malloc)
- ✅ Reusable after
MADV_FREE
Required changes:
- Allocation:
hak_bigcache_try_getserves mmap blocks - Free: Try BigCache insert first, skip
munmapif cached - Header: Keep
ALLOC_METHOD_MMAPon cached blocks
🏆 mimalloc's Secret Revealed
How mimalloc wins on VM scenario:
- Keep VA mapped: Don't
munmapimmediately - Lazy reclaim: Use
MADV_FREE/REUSABLE - Batch TLB work: Coalesce reclamation
- Per-segment reuse: Cache large blocks
Our Option A emulates this: BigCache + mmap + MADV_FREE + batching
📋 Action Items
Immediate (Phase 1):
- Add kill-switch guard (
HAKO_HAKMEM_LARGE_MMAP) - Change line 357:
return alloc_mmap(size); - Test compile
Critical (Phase 2):
- Fix free path (remove immediate
munmap) - Implement BigCache insert check
- Defer
munmapto batch flush
Optimization (Phase 3):
- Switch to
MADV_FREE(Linux) - Add observability counters
- Implement BigCache hard cap (64-256 MB)
Validation:
- Run VM scenario (10 runs)
- Verify page faults < 200
- Verify latency 24-28 µs
- Rollback if metrics fail
🎯 Alternative: Option C (ELO)
If Option A fails:
- Extend ELO action space: malloc vs mmap dimension
- Doubles ELO arms (12 → 24 strategies)
- Slower convergence, more complex
ChatGPT Pro says: "Overkill right now. Ship Option A with kill-switch first."
📊 Summary
Decision: ✅ GO with Option A (mmap + kill-switch)
Critical Fix: Remove immediate munmap in free path
Expected Gain: 30-45% improvement on VM scenario (36.6 → 24-28 µs)
Next Steps:
- Implement Phase 1 (1-line change + guard)
- Fix Phase 2 (free path)
- Run VM benchmark
- Validate or rollback
Confidence: HIGH (based on BigCache's 90% hit rate + mimalloc analysis)
Generated: 2025-10-21 by ChatGPT-5 (via codex exec) Status: Ready for implementation Priority: P0 (unlocks Phase 6.3)