Phase 1 完了:環境変数整理 + fprintf デバッグガード ENV変数削除(BG/HotMag系): - core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines) - core/hakmem_tiny_bg_spill.c: BG spill ENV 削除 - core/tiny_refill.h: BG remote 固定値化 - core/hakmem_tiny_slow.inc: BG refs 削除 fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE): - core/hakmem_shared_pool.c: Lock stats (~18 fprintf) - core/page_arena.c: Init/Shutdown/Stats (~27 fprintf) - core/hakmem.c: SIGSEGV init message ドキュメント整理: - 328 markdown files 削除(旧レポート・重複docs) 性能確認: - Larson: 52.35M ops/s (前回52.8M、安定動作✅) - ENV整理による機能影響なし - Debug出力は一部残存(次phase で対応) 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
6.3 KiB
Perf Baseline: Front-Direct Mode (Post-SEGV Fix)
Date: 2025-11-14
Commit: 696aa7c0b (SEGV fix with mincore() safety checks)
Test: bench_random_mixed_hakmem 200000 4096 1234567
Mode: HAKMEM_TINY_FRONT_DIRECT=1
📊 Performance Summary
Throughput
HAKMEM (Front-Direct): 563K ops/s (0.355s for 200K iterations)
System malloc: ~90M ops/s (estimated)
Gap: 160x slower (0.63% of target)
Regression Alert: Phase 11 achieved 9.38M ops/s (before SEGV fix) Current: 563K ops/s → -94% regression (mincore() overhead)
🔥 Hotspot Analysis
Syscall Statistics (200K iterations)
| Syscall | Count | Time (s) | % Time | Impact |
|---|---|---|---|---|
| munmap | 3,214 | 0.0258 | 47.4% | ❌ CRITICAL |
| mmap | 3,241 | 0.0149 | 27.4% | ❌ CRITICAL |
| madvise | 1,591 | 0.0072 | 13.3% | ⚠️ High |
| mincore | 1,591 | 0.0060 | 11.0% | ⚠️ High (SEGV fix overhead) |
| Other | 143 | 0.0006 | 1.0% | ✓ OK |
| Total | 9,780 | 0.0544 | 100% |
Key Findings:
-
mmap/munmap churn: 6,455 calls (74.8% of syscall time)
- Root cause: SuperSlab aggressive deallocation
- Expected: ~100-200 calls (mimalloc-style pooling)
- Gap: 32-65x excessive syscalls
-
mincore() overhead: 1,591 calls (11.0% time)
- Added by SEGV fix (commit
696aa7c0b) - Called on EVERY unknown pointer in free wrapper
- Optimization needed: Cache result, skip for known patterns
- Added by SEGV fix (commit
📈 Hardware Performance Counters
| Counter | Value | Notes |
|---|---|---|
| Cycles | 826M | |
| Instructions | 847M | |
| IPC | 1.03 | ⚠️ Low (target: 2-4) |
| Branches | 177M | |
| Branch misses | 12.1M | 6.82% miss rate (✓ OK) |
| Cache refs | 53.3M | |
| Cache misses | 8.7M | 16.32% miss rate (⚠️ High) |
| Page faults | 59,659 | ⚠️ High (0.30 per iteration) |
Performance Issues:
- Low IPC (1.03): Memory stalls dominating (cache misses, TLB pressure)
- High cache miss rate (16.32%): Pointer chasing, poor locality
- Page faults (59K): mmap/munmap churn causing TLB thrashing
🎯 Bottleneck Ranking (by Impact)
Box 1: SuperSlab/Shared Pool (CRITICAL - 74.8% syscall time)
Symptoms:
- mmap: 3,241 calls
- munmap: 3,214 calls
- madvise: 1,591 calls
- Total: 8,046 syscalls (82% of all syscalls)
Root Cause: Phase 9 Lazy Deallocation NOT working
- Hypothesis: LRU cache too small, prewarm insufficient
- Expected behavior: Reuse SuperSlabs, minimal syscalls
- Actual: Aggressive deallocation (mimalloc gap)
Attack Plan:
- Immediate: Verify LRU cache is active
- Check
g_ss_lru_*counters - ENV:
HAKMEM_SS_LRU_DEBUG=1
- Check
- Phase 12 Design: Shared SuperSlab Pool (mimalloc-style)
- 1 SuperSlab serves multiple size classes
- Dynamic slab allocation
- Target: 877 SuperSlabs → 100-200 (-70-80%)
Expected Impact: +1500% (74.8% → ~5%)
Box 2: mincore() Overhead (MODERATE - 11.0% syscall time)
Symptoms:
- mincore: 1,591 calls (11.0% time)
- Added by SEGV fix (commit
696aa7c0b) - Called on EVERY external pointer in free wrapper
Root Cause: No caching, no fast-path for known patterns
Attack Plan:
- Optimization A: Cache mincore() result per page
- TLS cache:
last_checked_page → is_mapped - Hit rate estimate: 90-95% (same page repeated)
- TLS cache:
- Optimization B: Skip mincore() for known ranges
- Check if ptr in expected range (heap, stack, mmap areas)
- Use
/proc/self/mapson init
- Optimization C: Remove from classify_ptr()
- Already done (Step 3 removed AllocHeader probe)
- Only free wrapper needs it
Expected Impact: +12-15% (11.0% → ~1%)
Box 3: Front Cache Miss (LOW - visible in cache stats)
Symptoms:
- Cache miss rate: 16.32%
- IPC: 1.03 (low, memory-bound)
Attack Plan (after Box 1/2 fixed):
- Check FastCache hit rate
- ENV:
HAKMEM_FRONT_STATS=1 - Target: >90% hit rate
- ENV:
- Tune FC capacity/refill size
- ENV:
HAKMEM_FC_CAP=256(2x current) - ENV:
HAKMEM_FC_REFILL=32(2x current)
- ENV:
Expected Impact: +5-10% (after syscall fixes)
🚀 Optimization Priority
Phase A: SuperSlab Churn Fix (Target: +1500%)
# Step 1: Diagnose LRU
export HAKMEM_SS_LRU_DEBUG=1
export HAKMEM_SS_PREWARM_DEBUG=1
./bench_random_mixed_hakmem 200000 4096 1234567
# Step 2: Tune LRU size
export HAKMEM_SS_LRU_SIZE=128 # Current: unknown
export HAKMEM_SS_PREWARM=64 # Current: unknown
# Step 3: Design Phase 12 Shared Pool
# - Implement mimalloc-style dynamic slab allocation
# - Target: 6,455 syscalls → ~100 (-98%)
Phase B: mincore() Optimization (Target: +12-15%)
# Step 1: Page cache (TLS)
static __thread struct {
void* page;
int is_mapped;
} g_mincore_cache = {NULL, 0};
# Step 2: Fast-path check
if (page == g_mincore_cache.page) {
is_mapped = g_mincore_cache.is_mapped; // Cache hit
} else {
is_mapped = mincore(...); // Syscall
g_mincore_cache.page = page;
g_mincore_cache.is_mapped = is_mapped;
}
# Expected: 1,591 → ~100 calls (-94%)
Phase C: Front Tuning (Target: +5-10%)
# After Phase A/B complete
export HAKMEM_FC_CAP=256
export HAKMEM_FC_REFILL=32
export HAKMEM_FRONT_STATS=1
📋 Immediate Action Items
- [ultrathink/ChatGPT] Review this report
- [Task 1] Diagnose why Phase 9 LRU is not working
- Run with
HAKMEM_SS_LRU_DEBUG=1 - Check LRU hit/miss counters
- Run with
- [Task 2] Design mincore() page cache
- TLS cache (page → is_mapped)
- Measure hit rate
- [Task 3] Implement Phase 12 Shared SuperSlab Pool
- Design doc: mimalloc-style dynamic allocation
- Target: 877 → 100-200 SuperSlabs
🎯 Target Performance (After Optimizations)
Current: 563K ops/s
Target: 70-90M ops/s (System malloc: 90M)
Gap: 124-160x
Required: +12,400-15,900% improvement
Phase A (SuperSlab): +1500% → 8.5M ops/s (9.4% of target)
Phase B (mincore): +15% → 10.0M ops/s (11.1% of target)
Phase C (Front): +10% → 11.0M ops/s (12.2% of target)
Phase D (??): Need more (+650-750%)
Note: Current performance is worse than Phase 11 (9.38M → 563K) Root cause: mincore() added in SEGV fix (1,591 syscalls) Priority: Fix mincore() overhead FIRST (Phase B), then SuperSlab (Phase A)