Files

Moe Charm (CI) 4983352812 Perf: Phase 7-1.3 - Hybrid mincore + Macro fix (+194-333%)

## Summary
Fixed CRITICAL bottleneck (mincore overhead) and macro definition bug.
Result: 2-3x performance improvement across all benchmarks.

## Performance Results
- Larson 1T: 631K → 2.73M ops/s (+333%) 🚀
- bench_random_mixed (128B): 768K → 2.26M ops/s (+194%) 🚀
- bench_random_mixed (512B): → 1.43M ops/s (new)
- [HEADER_INVALID] messages: Many → ~Zero ✅

## Changes

### 1. Hybrid mincore Optimization (317-634x faster)
**Problem**: `hak_is_memory_readable()` calls mincore() syscall on EVERY free
- Cost: 634 cycles/call
- Impact: 40x slower than System malloc

**Solution**: Check alignment BEFORE calling mincore()
- Step 1 (1-byte header): `if ((ptr & 0xFFF) == 0)` → only 0.1% call mincore
- Step 2 (16-byte header): `if ((ptr & 0xFFF) < HEADER_SIZE)` → only 0.4% call mincore
- Result: 634 → 1-2 cycles effective (99.6% skip mincore)

**Files**:
- core/tiny_free_fast_v2.inc.h:53-71 - Step 1 hybrid check
- core/box/hak_free_api.inc.h:94-107 - Step 2 hybrid check
- core/hakmem_internal.h:281-312 - Performance warning added

### 2. HAK_RET_ALLOC Macro Fix (CRITICAL BUG)
**Problem**: Macro definition order prevented Phase 7 header write
- hakmem_tiny.c:130 defined legacy macro (no header write)
- tiny_alloc_fast.inc.h:67 had `#ifndef` guard → skipped!
- Result: Headers NEVER written → All frees failed → Slow path

**Solution**: Force Phase 7 macro to override legacy
- hakmem_tiny.c:119 - Added `#ifndef HAK_RET_ALLOC` guard
- tiny_alloc_fast.inc.h:69-72 - Added `#undef` before redefine

### 3. Magic Byte Fix
**Problem**: Release builds don't write magic byte, but free ALWAYS checks it
- Result: All headers marked as invalid

**Solution**: ALWAYS write magic byte (same 1-byte write, no overhead)
- tiny_region_id.h:50-54 - Removed `#if !HAKMEM_BUILD_RELEASE` guard

## Technical Details

### Hybrid mincore Effectiveness
| Case | Frequency | Cost | Weighted |
|------|-----------|------|----------|
| Normal (Step 1) | 99.9% | 1-2 cycles | 1-2 |
| Page boundary | 0.1% | 634 cycles | 0.6 |
| **Total** | - | - | **1.6-2.6 cycles** |

**Improvement**: 634 → 1.6 cycles = **317-396x faster!**

### Macro Fix Impact
**Before**: HAK_RET_ALLOC(cls, ptr) → return (ptr)  // No header write
**After**: HAK_RET_ALLOC(cls, ptr) → return tiny_region_id_write_header((ptr), (cls))

**Result**: Headers properly written → Fast path works → +194-333% performance

## Investigation
Task Agent Ultrathink analysis identified:
1. mincore() syscall overhead (634 cycles)
2. Macro definition order conflict
3. Release/Debug build mismatch (magic byte)

Full report: PHASE7_DESIGN_REVIEW.md (23KB, 758 lines)

## Related
- Phase 7-1.0: PoC implementation (+39%~+436%)
- Phase 7-1.1: Dual-header dispatch (Task Agent)
- Phase 7-1.2: Page boundary SEGV fix (100% crash-free)
- Phase 7-1.3: Hybrid mincore + Macro fix (this commit)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-08 04:50:41 +09:00

9.2 KiB

Raw Blame History

Current Task – 2025-11-08

🚀 Phase 7-1.3: Hybrid mincore Optimization - System malloc に勝つ準備

ミッション

Phase 7 の CRITICAL BOTTLENECK を修正

Current: 634 cycles/free (mincore overhead)
Target: 1-2 cycles/free (hybrid approach)
Improvement: 317-634x faster! 🚀
Strategy: Alignment check (fast) + mincore fallback (rare)

📊 Phase 7-1.2 完了状況

✅ 完了済み

Phase 7-1.0: PoC 実装 (+39%~+436% improvement)
Phase 7-1.1: Dual-header dispatch (Task Agent)
Phase 7-1.2: Page boundary SEGV fix (100% crash-free)

📈 達成した成果

✅ 1-byte header system 動作確認
✅ Dual-header dispatch (Tiny + malloc/mmap)
✅ Page boundary 安全性確保
✅ All benchmarks crash-free

🔥 発見された CRITICAL 問題

Task Agent Ultrathink Analysis (Phase 7 Design Review) の結果:

Bottleneck: hak_is_memory_readable() が すべての free() で mincore() を呼ぶ

Measured Cost: 634 cycles/call
System tcache: 10-15 cycles
Result: Phase 7 は System malloc の 1/40 の速度 💀

Why This Happened:

Page boundary SEGV を防ぐため、ptr-1 の readability を確認
しかし page boundary は <0.1% の頻度
99.9% の normal case でも 634 cycles 払っている

✅ 解決策: Hybrid mincore Optimization

Concept

Fast path (alignment check) + Slow path (mincore fallback)

// Before (slow): すべての free で mincore
if (!hak_is_memory_readable(ptr-1)) return 0;  // 634 cycles

// After (fast): 99.9% はアライメントチェックのみ
if (((uintptr_t)ptr & 0xFFF) == 0) {           // 1-2 cycles
    // Page boundary (0.1%): Safety check
    if (!hak_is_memory_readable(ptr-1)) return 0;  // 634 cycles
}
// Normal case (99.9%): Direct header read

Performance Impact

Case	Frequency	Cost	Weighted
Normal (not boundary)	99.9%	1-2 cycles	1-2
Page boundary	0.1%	634 cycles	0.6
Total	-	-	1.6-2.6 cycles

Improvement: 634 → 1.6 cycles = 317-396x faster!

Micro-Benchmark Results (Task Agent)

[MINCORE] Mapped memory:   634 cycles/call  ← Current
[ALIGN]   Alignment check: 0 cycles/call
[HYBRID]  Align + mincore:  1 cycles/call   ← Optimized!
[BOUNDARY] Page boundary:  2155 cycles/call (rare, <0.1%)

📋 実装計画（Phase 7-1.3）

Task 1: Implement Hybrid mincore (1-2 hours)

File 1: core/tiny_free_fast_v2.inc.h:53-60

Before:

// CRITICAL: Check if header location (ptr-1) is accessible before reading
void* header_addr = (char*)ptr - 1;
extern int hak_is_memory_readable(void* addr);
if (__builtin_expect(!hak_is_memory_readable(header_addr), 0)) {
    // Header not accessible - route to slow path
    return 0;
}

After:

// CRITICAL: Fast check for page boundaries (0.1% case)
// Most allocations (99.9%) are NOT at page boundaries, so check alignment first
void* header_addr = (char*)ptr - 1;
if (__builtin_expect(((uintptr_t)ptr & 0xFFF) == 0, 0)) {
    // Potential page boundary - do safety check
    extern int hak_is_memory_readable(void* addr);
    if (!hak_is_memory_readable(header_addr)) {
        // Header not accessible - route to slow path
        return 0;
    }
}
// Normal case (99.9%): header is safe to read (no mincore call!)

File 2: core/box/hak_free_api.inc.h:96 (Step 2 dual-header dispatch)

Before:

// SAFETY: Check if raw header is accessible before dereferencing
if (hak_is_memory_readable(raw)) {
    AllocHeader* hdr = (AllocHeader*)raw;
    // ...
}

After:

// SAFETY: Fast check for page boundaries first
if (((uintptr_t)raw & 0xFFF) == 0) {
    // Potential page boundary - do safety check
    if (!hak_is_memory_readable(raw)) {
        goto slow_path;
    }
}
// Normal case: raw header is safe to read
AllocHeader* hdr = (AllocHeader*)raw;
// ...

File 3: Add comment to core/hakmem_internal.h:277-294

// NOTE: This function is expensive (634 cycles via mincore syscall).
// Use alignment check first to avoid calling this on normal allocations:
//   if (((uintptr_t)ptr & 0xFFF) == 0) {
//       if (!hak_is_memory_readable(ptr)) { /* handle page boundary */ }
//   }
static inline int hak_is_memory_readable(void* addr) {
    // ... existing implementation
}

Task 2: Validate with Micro-Benchmark (30 min)

File: tests/micro_mincore_bench.c (already created by Task Agent)

# Build and run micro-benchmark
gcc -O3 -o micro_mincore_bench tests/micro_mincore_bench.c
./micro_mincore_bench

# Expected output:
# [MINCORE] Mapped memory:   634 cycles/call
# [ALIGN]   Alignment check: 0 cycles/call
# [HYBRID]  Align + mincore:  1 cycles/call  ← Target!

Success Criteria:

✅ HYBRID shows ~1-2 cycles (vs 634 before)

Task 3: Smoke Test with Larson (30 min)

# Rebuild Phase 7 with optimization
make clean && make HEADER_CLASSIDX=1 larson_hakmem

# Run smoke test (1T)
HAKMEM_TINY_USE_SUPERSLAB=1 ./larson_hakmem 1 1 128 1024 1 12345 1

# Expected: 20-40M ops/s (vs 1M before)

Success Criteria:

✅ Throughput > 20M ops/s (20x improvement)
✅ No crashes (stability)

Task 4: Full Validation (1-2 hours)

# Test multiple sizes
for size in 128 256 512 1024 2048; do
    echo "=== Testing size=$size ==="
    ./bench_random_mixed_hakmem 10000 $size 1234567
done

# Test Larson 4T (MT stability)
./larson_hakmem 10 8 128 1024 1 12345 4

# Expected: All pass, 20-60M ops/s

🎯 Expected Outcomes

Performance Targets

Benchmark	Before (7-1.2)	After (7-1.3)	Improvement
bench_random_mixed	692K ops/s	40-60M ops/s	58-87x 🚀
larson_hakmem 1T	838K ops/s	40-80M ops/s	48-95x 🚀
larson_hakmem 4T	838K ops/s	120-240M ops/s	143-286x 🚀

vs System malloc

Metric	System	HAKMEM (7-1.3)	Result
Tiny free	10-15 cycles	1-2 cycles	5-15x faster 🏆
Throughput	56M ops/s	40-80M ops/s	70-140% ✅

Prediction: 70-140% of System malloc (互角〜勝ち!)

📁 関連ドキュメント

Task Agent Generated (Phase 7 Design Review)

PHASE7_DESIGN_REVIEW.md - 完全な技術分析 (23KB, 758 lines)
PHASE7_ACTION_PLAN.md - 実装ガイド (5.7KB, 235 lines)
PHASE7_SUMMARY.md - エグゼクティブサマリー (11KB, 302 lines)
PHASE7_QUICKREF.txt - クイックリファレンス (5.3KB)
tests/micro_mincore_bench.c - Micro-benchmark (4.5KB)

Phase 7 History

REGION_ID_DESIGN.md - 完全設計（Task Agent Opus Ultrathink）
PAGE_BOUNDARY_SEGV_FIX.md - Phase 7-1.2 修正レポート
CLAUDE.md#phase-7 - Phase 7 概要

🛠️ 実行コマンド

Step 1: Implement Hybrid Optimization (1-2 hours)

# Edit 3 files (see Task 1 above):
# - core/tiny_free_fast_v2.inc.h
# - core/box/hak_free_api.inc.h
# - core/hakmem_internal.h

Step 2: Validate Micro-Benchmark (30 min)

gcc -O3 -o micro_mincore_bench tests/micro_mincore_bench.c
./micro_mincore_bench
# Expected: HYBRID ~1-2 cycles ✅

Step 3: Smoke Test (30 min)

make clean && make HEADER_CLASSIDX=1 larson_hakmem
HAKMEM_TINY_USE_SUPERSLAB=1 ./larson_hakmem 1 1 128 1024 1 12345 1
# Expected: >20M ops/s ✅

Step 4: Full Validation (1-2 hours)

# Random mixed sizes
./bench_random_mixed_hakmem 10000 1024 1234567

# Larson MT
./larson_hakmem 10 8 128 1024 1 12345 4

# Expected: 40-80M ops/s, no crashes ✅

📅 Timeline

Phase 7-1.3 (Hybrid Optimization): 1-2時間 ← 今ここ！
Validation & Testing: 1-2時間
Phase 7-2 (Full Benchmark vs mimalloc): 2-3時間
Total: 4-6時間で System malloc に勝つ 🎉

🚦 Go/No-Go Decision

Phase 7-1.2 Status: NO-GO ⛔

Reason: mincore overhead (634 cycles = 40x slower than System)

Phase 7-1.3 Status: CONDITIONAL GO 🟡

Condition:

✅ Hybrid implementation complete
✅ Micro-benchmark shows 1-2 cycles
✅ Larson smoke test >20M ops/s

Risk: LOW (proven by Task Agent micro-benchmark)

✅ 完了済み（Phase 7-1.2 まで）

Phase 7-1.2: Page Boundary SEGV Fix (2025-11-08)

✅ hak_is_memory_readable() check before header read
✅ All benchmarks crash-free (1024B, 2048B, 4096B)
✅ Committed: 24beb34de
Issue: mincore overhead (634 cycles) → Phase 7-1.3 で修正

Phase 7-1.1: Dual-Header Dispatch (2025-11-08)

✅ Task Agent contributions (header validation, malloc fallback)
✅ 16-byte AllocHeader dispatch
✅ Committed

Phase 7-1.0: PoC Implementation (2025-11-08)

✅ 1-byte header system
✅ Ultra-fast free path (basic version)
✅ Initial results: +39%~+436%

次のアクション: Phase 7-1.3 Hybrid Optimization 実装開始！ 🚀

9.2 KiB Raw Blame History Unescape Escape