**P0-2: Lock Instrumentation** (✅ Complete) - Add atomic counters to g_shared_pool.alloc_lock - Track acquire_slab() vs release_slab() separately - Environment: HAKMEM_SHARED_POOL_LOCK_STATS=1 - Report stats at shutdown via destructor **P0-3: Analysis Results** (✅ Complete) - 100% contention from acquire_slab() (allocation path) - 0% from release_slab() (effectively lock-free!) - Lock rate: 0.206% (TLS hit rate: 99.8%) - Scaling: 4T→8T = 1.44x (sublinear, lock bottleneck) **Key Findings**: - 4T: 330 lock acquisitions / 160K ops - 8T: 658 lock acquisitions / 320K ops - futex: 68% of syscall time (from previous strace) - Bottleneck: acquire_slab 3-stage logic under mutex **Report**: MID_LARGE_LOCK_CONTENTION_ANALYSIS.md (2.3KB) - Detailed breakdown by code path - Root cause analysis (TLS miss → shared pool lock) - Lock-free implementation roadmap (P0-4/P0-5) - Expected impact: +50-73% throughput **Files Modified**: - core/hakmem_shared_pool.c: +60 lines instrumentation - Atomic counters: g_lock_acquire/release_slab_count - lock_stats_init() + lock_stats_report() - Per-path tracking in acquire/release functions **Next Steps**: - P0-4: Lock-free per-class free lists (Stage 1: LIFO stack CAS) - P0-5: Lock-free slot claiming (Stage 2: atomic bitmap) - P0-6: A/B comparison (target: +50-73%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
5.1 KiB
Mid-Large Mincore A/B Testing - Quick Summary
Date: 2025-11-14
Status: ✅ COMPLETE - Investigation finished, recommendation provided
Report: MID_LARGE_MINCORE_INVESTIGATION_REPORT.md
Quick Answer: Should We Disable mincore?
NO - mincore is Essential for Safety ⚠️
| Configuration | Throughput | Exit Code | Production Ready |
|---|---|---|---|
| mincore ON (default) | 1.04M ops/s | 0 (success) | ✅ Yes |
| mincore OFF | SEGFAULT | 139 (SIGSEGV) | ❌ No |
Key Findings
1. mincore is NOT the Bottleneck
Evidence:
strace -e trace=mincore -c ./bench_mid_large_mt_hakmem 2 200000 2048 42
# Result: Only 4 mincore calls (200K iterations)
Comparison:
- Tiny allocator: 1,574 mincore calls (200K iters) - 5.51% time
- Mid-Large allocator: 4 mincore calls (200K iters) - 0.1% time
Conclusion: mincore overhead is negligible for Mid-Large allocator.
2. Real Bottleneck: futex (68% Syscall Time)
perf Analysis:
| Syscall | % Time | usec/call | Calls | Root Cause |
|---|---|---|---|---|
| futex | 68.18% | 1,970 | 36 | Shared pool lock contention |
| munmap | 11.60% | 7 | 1,665 | SuperSlab deallocation |
| mmap | 7.28% | 4 | 1,692 | SuperSlab allocation |
| madvise | 6.85% | 4 | 1,591 | Unknown source |
| mincore | 5.51% | 3 | 1,574 | AllocHeader safety checks |
Recommendation: Fix futex contention (68%) before optimizing mincore (5%).
3. Why mincore is Essential
Without mincore:
- Headerless Tiny C7 (1KB): Blind read of
ptr - HEADER_SIZE→ SEGFAULT if SuperSlab unmapped - LD_PRELOAD mixed allocations: Cannot detect libc allocations → double-free or wrong-allocator crashes
- Double-free protection: Cannot detect already-freed memory → corruption
With mincore:
- Safe fallback to
__libc_free()when memory unmapped - Correct routing for headerless Tiny allocations
- Mixed HAKMEM/libc environment support
Trade-off: +5.51% overhead (Tiny) / +0.1% overhead (Mid-Large) for safety.
Implementation Summary
Code Changes (Available for Future Use)
Files Modified:
core/box/hak_free_api.inc.h- Added#ifdef HAKMEM_DISABLE_MINCORE_CHECKguardMakefile- AddedDISABLE_MINCOREflag (default: 0)build.sh- Added ENV support for A/B testing
Usage (NOT RECOMMENDED):
# Build with mincore disabled (will SEGFAULT!)
DISABLE_MINCORE=1 POOL_TLS_PHASE1=1 POOL_TLS_BIND_BOX=1 ./build.sh bench_mid_large_mt_hakmem
# Build with mincore enabled (default, safe)
POOL_TLS_PHASE1=1 POOL_TLS_BIND_BOX=1 ./build.sh bench_mid_large_mt_hakmem
Recommended Next Steps
Priority 1: Fix futex Contention (P0)
Impact: -68% syscall overhead → +73% throughput (1.04M → 1.8M ops/s)
Options:
- Lock-free Stage 1 free path (per-class atomic LIFO)
- Reduce shared pool lock scope
- Batch acquire (multiple slabs per lock)
Effort: Medium (2-3 days)
Priority 2: Investigate Pool TLS Routing (P1)
Impact: Unknown (requires debugging)
Mystery: Mid-Large benchmark (8-34KB) should use Pool TLS (8-52KB range), but frees fall through to mincore path.
Next Steps:
- Enable debug build
- Check
[POOL_TLS_REJECT]logs - Add free path routing logs
- Verify header writes/reads
Effort: Low (1 day)
Priority 3: Optimize mincore (P2 - Low Priority)
Impact: -5.51% syscall overhead → +5% throughput (Tiny only)
Options:
- Expand TLS page cache (2 → 16 entries)
- Use registry-based safety (replace mincore)
- Bloom filter for unmapped pages
Effort: Low (1-2 days)
Note: Only pursue if futex optimization doesn't close gap with System malloc.
Performance Targets
Short-Term (1-2 weeks)
- Fix futex → 1.8M ops/s (+73% vs baseline)
- Fix Pool TLS routing → 2.5M ops/s (+39% vs futex fix)
Medium-Term (1-2 months)
- Optimize mincore → 3.0M ops/s (+20% vs routing fix)
- Increase Pool TLS range (64KB) → 4.0M ops/s (+33% vs mincore)
Long-Term Goal
- 5.4M ops/s (match System malloc)
- 24.2M ops/s (match mimalloc) - requires architectural changes
Conclusion
Do NOT disable mincore - the A/B test confirmed it's:
- Not the bottleneck (only 4 calls, 0.1% time)
- Essential for safety (SEGFAULT without it)
- Low priority (fix futex first - 68% vs 5.51% impact)
Focus Instead On:
- futex contention (68% syscall time)
- Pool TLS routing mystery
- SuperSlab allocation churn
Expected Impact:
- futex fix alone: +73% throughput (1.04M → 1.8M ops/s)
- All optimizations: +285% throughput (1.04M → 4.0M ops/s)
A/B Testing Framework: ✅ Implemented and available
Recommendation: Keep mincore enabled (default: DISABLE_MINCORE=0)
Next Action: Fix futex contention (Priority P0)
Report: MID_LARGE_MINCORE_INVESTIGATION_REPORT.md (full details)
Date: 2025-11-14
Tool: Claude Code