# Current Task – 2025-11-08

## 🚀 Phase 7-1.3: Hybrid mincore Optimization - System malloc に勝つ準備

### ミッション
**Phase 7 の CRITICAL BOTTLENECK を修正**
- **Current**: 634 cycles/free (mincore overhead)
- **Target**: 1-2 cycles/free (hybrid approach)
- **Improvement**: **317-634x faster!** 🚀
- **Strategy**: Alignment check (fast) + mincore fallback (rare)

---

## 📊 Phase 7-1.2 完了状況

### ✅ 完了済み
1. **Phase 7-1.0**: PoC 実装 (+39%~+436% improvement)
2. **Phase 7-1.1**: Dual-header dispatch (Task Agent)
3. **Phase 7-1.2**: Page boundary SEGV fix (100% crash-free)

### 📈 達成した成果
- ✅ 1-byte header system 動作確認
- ✅ Dual-header dispatch (Tiny + malloc/mmap)
- ✅ Page boundary 安全性確保
- ✅ All benchmarks crash-free

### 🔥 発見された CRITICAL 問題

**Task Agent Ultrathink Analysis (Phase 7 Design Review) の結果:**

**Bottleneck**: `hak_is_memory_readable()` が **すべての free()** で mincore() を呼ぶ
- **Measured Cost**: 634 cycles/call
- **System tcache**: 10-15 cycles
- **Result**: Phase 7 は System malloc の **1/40 の速度** 💀

**Why This Happened:**
- Page boundary SEGV を防ぐため、`ptr-1` の readability を確認
- しかし page boundary は **<0.1%** の頻度
- **99.9%** の normal case でも 634 cycles 払っている

---

## ✅ 解決策: Hybrid mincore Optimization

### Concept
**Fast path (alignment check) + Slow path (mincore fallback)**

```c
// Before (slow): すべての free で mincore
if (!hak_is_memory_readable(ptr-1)) return 0;  // 634 cycles

// After (fast): 99.9% はアライメントチェックのみ
if (((uintptr_t)ptr & 0xFFF) == 0) {           // 1-2 cycles
    // Page boundary (0.1%): Safety check
    if (!hak_is_memory_readable(ptr-1)) return 0;  // 634 cycles
}
// Normal case (99.9%): Direct header read
```

### Performance Impact

| Case | Frequency | Cost | Weighted |
|------|-----------|------|----------|
| Normal (not boundary) | 99.9% | 1-2 cycles | 1-2 |
| Page boundary | 0.1% | 634 cycles | 0.6 |
| **Total** | - | - | **1.6-2.6 cycles** |

**Improvement**: 634 → 1.6 cycles = **317-396x faster!**

### Micro-Benchmark Results (Task Agent)

```
[MINCORE] Mapped memory:   634 cycles/call  ← Current
[ALIGN]   Alignment check: 0 cycles/call
[HYBRID]  Align + mincore:  1 cycles/call   ← Optimized!
[BOUNDARY] Page boundary:  2155 cycles/call (rare, <0.1%)
```

---

## 📋 実装計画（Phase 7-1.3）

### Task 1: Implement Hybrid mincore (1-2 hours)

**File 1**: `core/tiny_free_fast_v2.inc.h:53-60`

**Before**:
```c
// CRITICAL: Check if header location (ptr-1) is accessible before reading
void* header_addr = (char*)ptr - 1;
extern int hak_is_memory_readable(void* addr);
if (__builtin_expect(!hak_is_memory_readable(header_addr), 0)) {
    // Header not accessible - route to slow path
    return 0;
}
```

**After**:
```c
// CRITICAL: Fast check for page boundaries (0.1% case)
// Most allocations (99.9%) are NOT at page boundaries, so check alignment first
void* header_addr = (char*)ptr - 1;
if (__builtin_expect(((uintptr_t)ptr & 0xFFF) == 0, 0)) {
    // Potential page boundary - do safety check
    extern int hak_is_memory_readable(void* addr);
    if (!hak_is_memory_readable(header_addr)) {
        // Header not accessible - route to slow path
        return 0;
    }
}
// Normal case (99.9%): header is safe to read (no mincore call!)
```

**File 2**: `core/box/hak_free_api.inc.h:96` (Step 2 dual-header dispatch)

**Before**:
```c
// SAFETY: Check if raw header is accessible before dereferencing
if (hak_is_memory_readable(raw)) {
    AllocHeader* hdr = (AllocHeader*)raw;
    // ...
}
```

**After**:
```c
// SAFETY: Fast check for page boundaries first
if (((uintptr_t)raw & 0xFFF) == 0) {
    // Potential page boundary - do safety check
    if (!hak_is_memory_readable(raw)) {
        goto slow_path;
    }
}
// Normal case: raw header is safe to read
AllocHeader* hdr = (AllocHeader*)raw;
// ...
```

**File 3**: Add comment to `core/hakmem_internal.h:277-294`

```c
// NOTE: This function is expensive (634 cycles via mincore syscall).
// Use alignment check first to avoid calling this on normal allocations:
//   if (((uintptr_t)ptr & 0xFFF) == 0) {
//       if (!hak_is_memory_readable(ptr)) { /* handle page boundary */ }
//   }
static inline int hak_is_memory_readable(void* addr) {
    // ... existing implementation
}
```

### Task 2: Validate with Micro-Benchmark (30 min)

**File**: `tests/micro_mincore_bench.c` (already created by Task Agent)

```bash
# Build and run micro-benchmark
gcc -O3 -o micro_mincore_bench tests/micro_mincore_bench.c
./micro_mincore_bench

# Expected output:
# [MINCORE] Mapped memory:   634 cycles/call
# [ALIGN]   Alignment check: 0 cycles/call
# [HYBRID]  Align + mincore:  1 cycles/call  ← Target!
```

**Success Criteria**:
- ✅ HYBRID shows ~1-2 cycles (vs 634 before)

### Task 3: Smoke Test with Larson (30 min)

```bash
# Rebuild Phase 7 with optimization
make clean && make HEADER_CLASSIDX=1 larson_hakmem

# Run smoke test (1T)
HAKMEM_TINY_USE_SUPERSLAB=1 ./larson_hakmem 1 1 128 1024 1 12345 1

# Expected: 20-40M ops/s (vs 1M before)
```

**Success Criteria**:
- ✅ Throughput > 20M ops/s (20x improvement)
- ✅ No crashes (stability)

### Task 4: Full Validation (1-2 hours)

```bash
# Test multiple sizes
for size in 128 256 512 1024 2048; do
    echo "=== Testing size=$size ==="
    ./bench_random_mixed_hakmem 10000 $size 1234567
done

# Test Larson 4T (MT stability)
./larson_hakmem 10 8 128 1024 1 12345 4

# Expected: All pass, 20-60M ops/s
```

---

## 🎯 Expected Outcomes

### Performance Targets

| Benchmark | Before (7-1.2) | After (7-1.3) | Improvement |
|-----------|----------------|---------------|-------------|
| **bench_random_mixed** | 692K ops/s | **40-60M ops/s** | **58-87x** 🚀 |
| **larson_hakmem 1T** | 838K ops/s | **40-80M ops/s** | **48-95x** 🚀 |
| **larson_hakmem 4T** | 838K ops/s | **120-240M ops/s** | **143-286x** 🚀 |

### vs System malloc

| Metric | System | HAKMEM (7-1.3) | Result |
|--------|--------|----------------|--------|
| **Tiny free** | 10-15 cycles | **1-2 cycles** | **5-15x faster** 🏆 |
| **Throughput** | 56M ops/s | **40-80M ops/s** | **70-140%** ✅ |

**Prediction**: **70-140% of System malloc** (互角〜勝ち!)

---

## 📁 関連ドキュメント

### Task Agent Generated (Phase 7 Design Review)
- [`PHASE7_DESIGN_REVIEW.md`](PHASE7_DESIGN_REVIEW.md) - 完全な技術分析 (23KB, 758 lines)
- [`PHASE7_ACTION_PLAN.md`](PHASE7_ACTION_PLAN.md) - 実装ガイド (5.7KB, 235 lines)
- [`PHASE7_SUMMARY.md`](PHASE7_SUMMARY.md) - エグゼクティブサマリー (11KB, 302 lines)
- [`PHASE7_QUICKREF.txt`](PHASE7_QUICKREF.txt) - クイックリファレンス (5.3KB)
- [`tests/micro_mincore_bench.c`](tests/micro_mincore_bench.c) - Micro-benchmark (4.5KB)

### Phase 7 History
- [`REGION_ID_DESIGN.md`](REGION_ID_DESIGN.md) - 完全設計（Task Agent Opus Ultrathink）
- [`PAGE_BOUNDARY_SEGV_FIX.md`](PAGE_BOUNDARY_SEGV_FIX.md) - Phase 7-1.2 修正レポート
- [`CLAUDE.md#phase-7`](CLAUDE.md#phase-7-region-id-direct-lookup---ultra-fast-free-path-2025-11-08-) - Phase 7 概要

---

## 🛠️ 実行コマンド

### Step 1: Implement Hybrid Optimization (1-2 hours)
```bash
# Edit 3 files (see Task 1 above):
# - core/tiny_free_fast_v2.inc.h
# - core/box/hak_free_api.inc.h
# - core/hakmem_internal.h
```

### Step 2: Validate Micro-Benchmark (30 min)
```bash
gcc -O3 -o micro_mincore_bench tests/micro_mincore_bench.c
./micro_mincore_bench
# Expected: HYBRID ~1-2 cycles ✅
```

### Step 3: Smoke Test (30 min)
```bash
make clean && make HEADER_CLASSIDX=1 larson_hakmem
HAKMEM_TINY_USE_SUPERSLAB=1 ./larson_hakmem 1 1 128 1024 1 12345 1
# Expected: >20M ops/s ✅
```

### Step 4: Full Validation (1-2 hours)
```bash
# Random mixed sizes
./bench_random_mixed_hakmem 10000 1024 1234567

# Larson MT
./larson_hakmem 10 8 128 1024 1 12345 4

# Expected: 40-80M ops/s, no crashes ✅
```

---

## 📅 Timeline

- **Phase 7-1.3 (Hybrid Optimization)**: 1-2時間 ← **今ここ！**
- **Validation & Testing**: 1-2時間
- **Phase 7-2 (Full Benchmark vs mimalloc)**: 2-3時間
- **Total**: **4-6時間で System malloc に勝つ** 🎉

---

## 🚦 Go/No-Go Decision

### Phase 7-1.2 Status: NO-GO ⛔
**Reason**: mincore overhead (634 cycles = 40x slower than System)

### Phase 7-1.3 Status: CONDITIONAL GO 🟡
**Condition**:
1. ✅ Hybrid implementation complete
2. ✅ Micro-benchmark shows 1-2 cycles
3. ✅ Larson smoke test >20M ops/s

**Risk**: LOW (proven by Task Agent micro-benchmark)

---

## ✅ 完了済み（Phase 7-1.2 まで）

### Phase 7-1.2: Page Boundary SEGV Fix (2025-11-08)
- ✅ `hak_is_memory_readable()` check before header read
- ✅ All benchmarks crash-free (1024B, 2048B, 4096B)
- ✅ Committed: `24beb34de`
- **Issue**: mincore overhead (634 cycles) → Phase 7-1.3 で修正

### Phase 7-1.1: Dual-Header Dispatch (2025-11-08)
- ✅ Task Agent contributions (header validation, malloc fallback)
- ✅ 16-byte AllocHeader dispatch
- ✅ Committed

### Phase 7-1.0: PoC Implementation (2025-11-08)
- ✅ 1-byte header system
- ✅ Ultra-fast free path (basic version)
- ✅ Initial results: +39%~+436%

---

**次のアクション: Phase 7-1.3 Hybrid Optimization 実装開始！** 🚀