# Current Task – 2025-11-08 ## 🚀 Phase 7-1.3: Hybrid mincore Optimization - System malloc に勝つ準備 ### ミッション **Phase 7 の CRITICAL BOTTLENECK を修正** - **Current**: 634 cycles/free (mincore overhead) - **Target**: 1-2 cycles/free (hybrid approach) - **Improvement**: **317-634x faster!** 🚀 - **Strategy**: Alignment check (fast) + mincore fallback (rare) --- ## 📊 Phase 7-1.2 完了状況 ### ✅ 完了済み 1. **Phase 7-1.0**: PoC 実装 (+39%~+436% improvement) 2. **Phase 7-1.1**: Dual-header dispatch (Task Agent) 3. **Phase 7-1.2**: Page boundary SEGV fix (100% crash-free) ### 📈 達成した成果 - ✅ 1-byte header system 動作確認 - ✅ Dual-header dispatch (Tiny + malloc/mmap) - ✅ Page boundary 安全性確保 - ✅ All benchmarks crash-free ### 🔥 発見された CRITICAL 問題 **Task Agent Ultrathink Analysis (Phase 7 Design Review) の結果:** **Bottleneck**: `hak_is_memory_readable()` が **すべての free()** で mincore() を呼ぶ - **Measured Cost**: 634 cycles/call - **System tcache**: 10-15 cycles - **Result**: Phase 7 は System malloc の **1/40 の速度** 💀 **Why This Happened:** - Page boundary SEGV を防ぐため、`ptr-1` の readability を確認 - しかし page boundary は **<0.1%** の頻度 - **99.9%** の normal case でも 634 cycles 払っている --- ## ✅ 解決策: Hybrid mincore Optimization ### Concept **Fast path (alignment check) + Slow path (mincore fallback)** ```c // Before (slow): すべての free で mincore if (!hak_is_memory_readable(ptr-1)) return 0; // 634 cycles // After (fast): 99.9% はアライメントチェックのみ if (((uintptr_t)ptr & 0xFFF) == 0) { // 1-2 cycles // Page boundary (0.1%): Safety check if (!hak_is_memory_readable(ptr-1)) return 0; // 634 cycles } // Normal case (99.9%): Direct header read ``` ### Performance Impact | Case | Frequency | Cost | Weighted | |------|-----------|------|----------| | Normal (not boundary) | 99.9% | 1-2 cycles | 1-2 | | Page boundary | 0.1% | 634 cycles | 0.6 | | **Total** | - | - | **1.6-2.6 cycles** | **Improvement**: 634 → 1.6 cycles = **317-396x faster!** ### Micro-Benchmark Results (Task Agent) ``` [MINCORE] Mapped memory: 634 cycles/call ← Current [ALIGN] Alignment check: 0 cycles/call [HYBRID] Align + mincore: 1 cycles/call ← Optimized! [BOUNDARY] Page boundary: 2155 cycles/call (rare, <0.1%) ``` --- ## 📋 実装計画(Phase 7-1.3) ### Task 1: Implement Hybrid mincore (1-2 hours) **File 1**: `core/tiny_free_fast_v2.inc.h:53-60` **Before**: ```c // CRITICAL: Check if header location (ptr-1) is accessible before reading void* header_addr = (char*)ptr - 1; extern int hak_is_memory_readable(void* addr); if (__builtin_expect(!hak_is_memory_readable(header_addr), 0)) { // Header not accessible - route to slow path return 0; } ``` **After**: ```c // CRITICAL: Fast check for page boundaries (0.1% case) // Most allocations (99.9%) are NOT at page boundaries, so check alignment first void* header_addr = (char*)ptr - 1; if (__builtin_expect(((uintptr_t)ptr & 0xFFF) == 0, 0)) { // Potential page boundary - do safety check extern int hak_is_memory_readable(void* addr); if (!hak_is_memory_readable(header_addr)) { // Header not accessible - route to slow path return 0; } } // Normal case (99.9%): header is safe to read (no mincore call!) ``` **File 2**: `core/box/hak_free_api.inc.h:96` (Step 2 dual-header dispatch) **Before**: ```c // SAFETY: Check if raw header is accessible before dereferencing if (hak_is_memory_readable(raw)) { AllocHeader* hdr = (AllocHeader*)raw; // ... } ``` **After**: ```c // SAFETY: Fast check for page boundaries first if (((uintptr_t)raw & 0xFFF) == 0) { // Potential page boundary - do safety check if (!hak_is_memory_readable(raw)) { goto slow_path; } } // Normal case: raw header is safe to read AllocHeader* hdr = (AllocHeader*)raw; // ... ``` **File 3**: Add comment to `core/hakmem_internal.h:277-294` ```c // NOTE: This function is expensive (634 cycles via mincore syscall). // Use alignment check first to avoid calling this on normal allocations: // if (((uintptr_t)ptr & 0xFFF) == 0) { // if (!hak_is_memory_readable(ptr)) { /* handle page boundary */ } // } static inline int hak_is_memory_readable(void* addr) { // ... existing implementation } ``` ### Task 2: Validate with Micro-Benchmark (30 min) **File**: `tests/micro_mincore_bench.c` (already created by Task Agent) ```bash # Build and run micro-benchmark gcc -O3 -o micro_mincore_bench tests/micro_mincore_bench.c ./micro_mincore_bench # Expected output: # [MINCORE] Mapped memory: 634 cycles/call # [ALIGN] Alignment check: 0 cycles/call # [HYBRID] Align + mincore: 1 cycles/call ← Target! ``` **Success Criteria**: - ✅ HYBRID shows ~1-2 cycles (vs 634 before) ### Task 3: Smoke Test with Larson (30 min) ```bash # Rebuild Phase 7 with optimization make clean && make HEADER_CLASSIDX=1 larson_hakmem # Run smoke test (1T) HAKMEM_TINY_USE_SUPERSLAB=1 ./larson_hakmem 1 1 128 1024 1 12345 1 # Expected: 20-40M ops/s (vs 1M before) ``` **Success Criteria**: - ✅ Throughput > 20M ops/s (20x improvement) - ✅ No crashes (stability) ### Task 4: Full Validation (1-2 hours) ```bash # Test multiple sizes for size in 128 256 512 1024 2048; do echo "=== Testing size=$size ===" ./bench_random_mixed_hakmem 10000 $size 1234567 done # Test Larson 4T (MT stability) ./larson_hakmem 10 8 128 1024 1 12345 4 # Expected: All pass, 20-60M ops/s ``` --- ## 🎯 Expected Outcomes ### Performance Targets | Benchmark | Before (7-1.2) | After (7-1.3) | Improvement | |-----------|----------------|---------------|-------------| | **bench_random_mixed** | 692K ops/s | **40-60M ops/s** | **58-87x** 🚀 | | **larson_hakmem 1T** | 838K ops/s | **40-80M ops/s** | **48-95x** 🚀 | | **larson_hakmem 4T** | 838K ops/s | **120-240M ops/s** | **143-286x** 🚀 | ### vs System malloc | Metric | System | HAKMEM (7-1.3) | Result | |--------|--------|----------------|--------| | **Tiny free** | 10-15 cycles | **1-2 cycles** | **5-15x faster** 🏆 | | **Throughput** | 56M ops/s | **40-80M ops/s** | **70-140%** ✅ | **Prediction**: **70-140% of System malloc** (互角〜勝ち!) --- ## 📁 関連ドキュメント ### Task Agent Generated (Phase 7 Design Review) - [`PHASE7_DESIGN_REVIEW.md`](PHASE7_DESIGN_REVIEW.md) - 完全な技術分析 (23KB, 758 lines) - [`PHASE7_ACTION_PLAN.md`](PHASE7_ACTION_PLAN.md) - 実装ガイド (5.7KB, 235 lines) - [`PHASE7_SUMMARY.md`](PHASE7_SUMMARY.md) - エグゼクティブサマリー (11KB, 302 lines) - [`PHASE7_QUICKREF.txt`](PHASE7_QUICKREF.txt) - クイックリファレンス (5.3KB) - [`tests/micro_mincore_bench.c`](tests/micro_mincore_bench.c) - Micro-benchmark (4.5KB) ### Phase 7 History - [`REGION_ID_DESIGN.md`](REGION_ID_DESIGN.md) - 完全設計(Task Agent Opus Ultrathink) - [`PAGE_BOUNDARY_SEGV_FIX.md`](PAGE_BOUNDARY_SEGV_FIX.md) - Phase 7-1.2 修正レポート - [`CLAUDE.md#phase-7`](CLAUDE.md#phase-7-region-id-direct-lookup---ultra-fast-free-path-2025-11-08-) - Phase 7 概要 --- ## 🛠️ 実行コマンド ### Step 1: Implement Hybrid Optimization (1-2 hours) ```bash # Edit 3 files (see Task 1 above): # - core/tiny_free_fast_v2.inc.h # - core/box/hak_free_api.inc.h # - core/hakmem_internal.h ``` ### Step 2: Validate Micro-Benchmark (30 min) ```bash gcc -O3 -o micro_mincore_bench tests/micro_mincore_bench.c ./micro_mincore_bench # Expected: HYBRID ~1-2 cycles ✅ ``` ### Step 3: Smoke Test (30 min) ```bash make clean && make HEADER_CLASSIDX=1 larson_hakmem HAKMEM_TINY_USE_SUPERSLAB=1 ./larson_hakmem 1 1 128 1024 1 12345 1 # Expected: >20M ops/s ✅ ``` ### Step 4: Full Validation (1-2 hours) ```bash # Random mixed sizes ./bench_random_mixed_hakmem 10000 1024 1234567 # Larson MT ./larson_hakmem 10 8 128 1024 1 12345 4 # Expected: 40-80M ops/s, no crashes ✅ ``` --- ## 📅 Timeline - **Phase 7-1.3 (Hybrid Optimization)**: 1-2時間 ← **今ここ!** - **Validation & Testing**: 1-2時間 - **Phase 7-2 (Full Benchmark vs mimalloc)**: 2-3時間 - **Total**: **4-6時間で System malloc に勝つ** 🎉 --- ## 🚦 Go/No-Go Decision ### Phase 7-1.2 Status: NO-GO ⛔ **Reason**: mincore overhead (634 cycles = 40x slower than System) ### Phase 7-1.3 Status: CONDITIONAL GO 🟡 **Condition**: 1. ✅ Hybrid implementation complete 2. ✅ Micro-benchmark shows 1-2 cycles 3. ✅ Larson smoke test >20M ops/s **Risk**: LOW (proven by Task Agent micro-benchmark) --- ## ✅ 完了済み(Phase 7-1.2 まで) ### Phase 7-1.2: Page Boundary SEGV Fix (2025-11-08) - ✅ `hak_is_memory_readable()` check before header read - ✅ All benchmarks crash-free (1024B, 2048B, 4096B) - ✅ Committed: `24beb34de` - **Issue**: mincore overhead (634 cycles) → Phase 7-1.3 で修正 ### Phase 7-1.1: Dual-Header Dispatch (2025-11-08) - ✅ Task Agent contributions (header validation, malloc fallback) - ✅ 16-byte AllocHeader dispatch - ✅ Committed ### Phase 7-1.0: PoC Implementation (2025-11-08) - ✅ 1-byte header system - ✅ Ultra-fast free path (basic version) - ✅ Initial results: +39%~+436% --- **次のアクション: Phase 7-1.3 Hybrid Optimization 実装開始!** 🚀