## Summary Fixed CRITICAL bottleneck (mincore overhead) and macro definition bug. Result: 2-3x performance improvement across all benchmarks. ## Performance Results - Larson 1T: 631K → 2.73M ops/s (+333%) 🚀 - bench_random_mixed (128B): 768K → 2.26M ops/s (+194%) 🚀 - bench_random_mixed (512B): → 1.43M ops/s (new) - [HEADER_INVALID] messages: Many → ~Zero ✅ ## Changes ### 1. Hybrid mincore Optimization (317-634x faster) **Problem**: `hak_is_memory_readable()` calls mincore() syscall on EVERY free - Cost: 634 cycles/call - Impact: 40x slower than System malloc **Solution**: Check alignment BEFORE calling mincore() - Step 1 (1-byte header): `if ((ptr & 0xFFF) == 0)` → only 0.1% call mincore - Step 2 (16-byte header): `if ((ptr & 0xFFF) < HEADER_SIZE)` → only 0.4% call mincore - Result: 634 → 1-2 cycles effective (99.6% skip mincore) **Files**: - core/tiny_free_fast_v2.inc.h:53-71 - Step 1 hybrid check - core/box/hak_free_api.inc.h:94-107 - Step 2 hybrid check - core/hakmem_internal.h:281-312 - Performance warning added ### 2. HAK_RET_ALLOC Macro Fix (CRITICAL BUG) **Problem**: Macro definition order prevented Phase 7 header write - hakmem_tiny.c:130 defined legacy macro (no header write) - tiny_alloc_fast.inc.h:67 had `#ifndef` guard → skipped! - Result: Headers NEVER written → All frees failed → Slow path **Solution**: Force Phase 7 macro to override legacy - hakmem_tiny.c:119 - Added `#ifndef HAK_RET_ALLOC` guard - tiny_alloc_fast.inc.h:69-72 - Added `#undef` before redefine ### 3. Magic Byte Fix **Problem**: Release builds don't write magic byte, but free ALWAYS checks it - Result: All headers marked as invalid **Solution**: ALWAYS write magic byte (same 1-byte write, no overhead) - tiny_region_id.h:50-54 - Removed `#if !HAKMEM_BUILD_RELEASE` guard ## Technical Details ### Hybrid mincore Effectiveness | Case | Frequency | Cost | Weighted | |------|-----------|------|----------| | Normal (Step 1) | 99.9% | 1-2 cycles | 1-2 | | Page boundary | 0.1% | 634 cycles | 0.6 | | **Total** | - | - | **1.6-2.6 cycles** | **Improvement**: 634 → 1.6 cycles = **317-396x faster!** ### Macro Fix Impact **Before**: HAK_RET_ALLOC(cls, ptr) → return (ptr) // No header write **After**: HAK_RET_ALLOC(cls, ptr) → return tiny_region_id_write_header((ptr), (cls)) **Result**: Headers properly written → Fast path works → +194-333% performance ## Investigation Task Agent Ultrathink analysis identified: 1. mincore() syscall overhead (634 cycles) 2. Macro definition order conflict 3. Release/Debug build mismatch (magic byte) Full report: PHASE7_DESIGN_REVIEW.md (23KB, 758 lines) ## Related - Phase 7-1.0: PoC implementation (+39%~+436%) - Phase 7-1.1: Dual-header dispatch (Task Agent) - Phase 7-1.2: Page boundary SEGV fix (100% crash-free) - Phase 7-1.3: Hybrid mincore + Macro fix (this commit) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
322 lines
9.2 KiB
Markdown
322 lines
9.2 KiB
Markdown
# Current Task – 2025-11-08
|
||
|
||
## 🚀 Phase 7-1.3: Hybrid mincore Optimization - System malloc に勝つ準備
|
||
|
||
### ミッション
|
||
**Phase 7 の CRITICAL BOTTLENECK を修正**
|
||
- **Current**: 634 cycles/free (mincore overhead)
|
||
- **Target**: 1-2 cycles/free (hybrid approach)
|
||
- **Improvement**: **317-634x faster!** 🚀
|
||
- **Strategy**: Alignment check (fast) + mincore fallback (rare)
|
||
|
||
---
|
||
|
||
## 📊 Phase 7-1.2 完了状況
|
||
|
||
### ✅ 完了済み
|
||
1. **Phase 7-1.0**: PoC 実装 (+39%~+436% improvement)
|
||
2. **Phase 7-1.1**: Dual-header dispatch (Task Agent)
|
||
3. **Phase 7-1.2**: Page boundary SEGV fix (100% crash-free)
|
||
|
||
### 📈 達成した成果
|
||
- ✅ 1-byte header system 動作確認
|
||
- ✅ Dual-header dispatch (Tiny + malloc/mmap)
|
||
- ✅ Page boundary 安全性確保
|
||
- ✅ All benchmarks crash-free
|
||
|
||
### 🔥 発見された CRITICAL 問題
|
||
|
||
**Task Agent Ultrathink Analysis (Phase 7 Design Review) の結果:**
|
||
|
||
**Bottleneck**: `hak_is_memory_readable()` が **すべての free()** で mincore() を呼ぶ
|
||
- **Measured Cost**: 634 cycles/call
|
||
- **System tcache**: 10-15 cycles
|
||
- **Result**: Phase 7 は System malloc の **1/40 の速度** 💀
|
||
|
||
**Why This Happened:**
|
||
- Page boundary SEGV を防ぐため、`ptr-1` の readability を確認
|
||
- しかし page boundary は **<0.1%** の頻度
|
||
- **99.9%** の normal case でも 634 cycles 払っている
|
||
|
||
---
|
||
|
||
## ✅ 解決策: Hybrid mincore Optimization
|
||
|
||
### Concept
|
||
**Fast path (alignment check) + Slow path (mincore fallback)**
|
||
|
||
```c
|
||
// Before (slow): すべての free で mincore
|
||
if (!hak_is_memory_readable(ptr-1)) return 0; // 634 cycles
|
||
|
||
// After (fast): 99.9% はアライメントチェックのみ
|
||
if (((uintptr_t)ptr & 0xFFF) == 0) { // 1-2 cycles
|
||
// Page boundary (0.1%): Safety check
|
||
if (!hak_is_memory_readable(ptr-1)) return 0; // 634 cycles
|
||
}
|
||
// Normal case (99.9%): Direct header read
|
||
```
|
||
|
||
### Performance Impact
|
||
|
||
| Case | Frequency | Cost | Weighted |
|
||
|------|-----------|------|----------|
|
||
| Normal (not boundary) | 99.9% | 1-2 cycles | 1-2 |
|
||
| Page boundary | 0.1% | 634 cycles | 0.6 |
|
||
| **Total** | - | - | **1.6-2.6 cycles** |
|
||
|
||
**Improvement**: 634 → 1.6 cycles = **317-396x faster!**
|
||
|
||
### Micro-Benchmark Results (Task Agent)
|
||
|
||
```
|
||
[MINCORE] Mapped memory: 634 cycles/call ← Current
|
||
[ALIGN] Alignment check: 0 cycles/call
|
||
[HYBRID] Align + mincore: 1 cycles/call ← Optimized!
|
||
[BOUNDARY] Page boundary: 2155 cycles/call (rare, <0.1%)
|
||
```
|
||
|
||
---
|
||
|
||
## 📋 実装計画(Phase 7-1.3)
|
||
|
||
### Task 1: Implement Hybrid mincore (1-2 hours)
|
||
|
||
**File 1**: `core/tiny_free_fast_v2.inc.h:53-60`
|
||
|
||
**Before**:
|
||
```c
|
||
// CRITICAL: Check if header location (ptr-1) is accessible before reading
|
||
void* header_addr = (char*)ptr - 1;
|
||
extern int hak_is_memory_readable(void* addr);
|
||
if (__builtin_expect(!hak_is_memory_readable(header_addr), 0)) {
|
||
// Header not accessible - route to slow path
|
||
return 0;
|
||
}
|
||
```
|
||
|
||
**After**:
|
||
```c
|
||
// CRITICAL: Fast check for page boundaries (0.1% case)
|
||
// Most allocations (99.9%) are NOT at page boundaries, so check alignment first
|
||
void* header_addr = (char*)ptr - 1;
|
||
if (__builtin_expect(((uintptr_t)ptr & 0xFFF) == 0, 0)) {
|
||
// Potential page boundary - do safety check
|
||
extern int hak_is_memory_readable(void* addr);
|
||
if (!hak_is_memory_readable(header_addr)) {
|
||
// Header not accessible - route to slow path
|
||
return 0;
|
||
}
|
||
}
|
||
// Normal case (99.9%): header is safe to read (no mincore call!)
|
||
```
|
||
|
||
**File 2**: `core/box/hak_free_api.inc.h:96` (Step 2 dual-header dispatch)
|
||
|
||
**Before**:
|
||
```c
|
||
// SAFETY: Check if raw header is accessible before dereferencing
|
||
if (hak_is_memory_readable(raw)) {
|
||
AllocHeader* hdr = (AllocHeader*)raw;
|
||
// ...
|
||
}
|
||
```
|
||
|
||
**After**:
|
||
```c
|
||
// SAFETY: Fast check for page boundaries first
|
||
if (((uintptr_t)raw & 0xFFF) == 0) {
|
||
// Potential page boundary - do safety check
|
||
if (!hak_is_memory_readable(raw)) {
|
||
goto slow_path;
|
||
}
|
||
}
|
||
// Normal case: raw header is safe to read
|
||
AllocHeader* hdr = (AllocHeader*)raw;
|
||
// ...
|
||
```
|
||
|
||
**File 3**: Add comment to `core/hakmem_internal.h:277-294`
|
||
|
||
```c
|
||
// NOTE: This function is expensive (634 cycles via mincore syscall).
|
||
// Use alignment check first to avoid calling this on normal allocations:
|
||
// if (((uintptr_t)ptr & 0xFFF) == 0) {
|
||
// if (!hak_is_memory_readable(ptr)) { /* handle page boundary */ }
|
||
// }
|
||
static inline int hak_is_memory_readable(void* addr) {
|
||
// ... existing implementation
|
||
}
|
||
```
|
||
|
||
### Task 2: Validate with Micro-Benchmark (30 min)
|
||
|
||
**File**: `tests/micro_mincore_bench.c` (already created by Task Agent)
|
||
|
||
```bash
|
||
# Build and run micro-benchmark
|
||
gcc -O3 -o micro_mincore_bench tests/micro_mincore_bench.c
|
||
./micro_mincore_bench
|
||
|
||
# Expected output:
|
||
# [MINCORE] Mapped memory: 634 cycles/call
|
||
# [ALIGN] Alignment check: 0 cycles/call
|
||
# [HYBRID] Align + mincore: 1 cycles/call ← Target!
|
||
```
|
||
|
||
**Success Criteria**:
|
||
- ✅ HYBRID shows ~1-2 cycles (vs 634 before)
|
||
|
||
### Task 3: Smoke Test with Larson (30 min)
|
||
|
||
```bash
|
||
# Rebuild Phase 7 with optimization
|
||
make clean && make HEADER_CLASSIDX=1 larson_hakmem
|
||
|
||
# Run smoke test (1T)
|
||
HAKMEM_TINY_USE_SUPERSLAB=1 ./larson_hakmem 1 1 128 1024 1 12345 1
|
||
|
||
# Expected: 20-40M ops/s (vs 1M before)
|
||
```
|
||
|
||
**Success Criteria**:
|
||
- ✅ Throughput > 20M ops/s (20x improvement)
|
||
- ✅ No crashes (stability)
|
||
|
||
### Task 4: Full Validation (1-2 hours)
|
||
|
||
```bash
|
||
# Test multiple sizes
|
||
for size in 128 256 512 1024 2048; do
|
||
echo "=== Testing size=$size ==="
|
||
./bench_random_mixed_hakmem 10000 $size 1234567
|
||
done
|
||
|
||
# Test Larson 4T (MT stability)
|
||
./larson_hakmem 10 8 128 1024 1 12345 4
|
||
|
||
# Expected: All pass, 20-60M ops/s
|
||
```
|
||
|
||
---
|
||
|
||
## 🎯 Expected Outcomes
|
||
|
||
### Performance Targets
|
||
|
||
| Benchmark | Before (7-1.2) | After (7-1.3) | Improvement |
|
||
|-----------|----------------|---------------|-------------|
|
||
| **bench_random_mixed** | 692K ops/s | **40-60M ops/s** | **58-87x** 🚀 |
|
||
| **larson_hakmem 1T** | 838K ops/s | **40-80M ops/s** | **48-95x** 🚀 |
|
||
| **larson_hakmem 4T** | 838K ops/s | **120-240M ops/s** | **143-286x** 🚀 |
|
||
|
||
### vs System malloc
|
||
|
||
| Metric | System | HAKMEM (7-1.3) | Result |
|
||
|--------|--------|----------------|--------|
|
||
| **Tiny free** | 10-15 cycles | **1-2 cycles** | **5-15x faster** 🏆 |
|
||
| **Throughput** | 56M ops/s | **40-80M ops/s** | **70-140%** ✅ |
|
||
|
||
**Prediction**: **70-140% of System malloc** (互角〜勝ち!)
|
||
|
||
---
|
||
|
||
## 📁 関連ドキュメント
|
||
|
||
### Task Agent Generated (Phase 7 Design Review)
|
||
- [`PHASE7_DESIGN_REVIEW.md`](PHASE7_DESIGN_REVIEW.md) - 完全な技術分析 (23KB, 758 lines)
|
||
- [`PHASE7_ACTION_PLAN.md`](PHASE7_ACTION_PLAN.md) - 実装ガイド (5.7KB, 235 lines)
|
||
- [`PHASE7_SUMMARY.md`](PHASE7_SUMMARY.md) - エグゼクティブサマリー (11KB, 302 lines)
|
||
- [`PHASE7_QUICKREF.txt`](PHASE7_QUICKREF.txt) - クイックリファレンス (5.3KB)
|
||
- [`tests/micro_mincore_bench.c`](tests/micro_mincore_bench.c) - Micro-benchmark (4.5KB)
|
||
|
||
### Phase 7 History
|
||
- [`REGION_ID_DESIGN.md`](REGION_ID_DESIGN.md) - 完全設計(Task Agent Opus Ultrathink)
|
||
- [`PAGE_BOUNDARY_SEGV_FIX.md`](PAGE_BOUNDARY_SEGV_FIX.md) - Phase 7-1.2 修正レポート
|
||
- [`CLAUDE.md#phase-7`](CLAUDE.md#phase-7-region-id-direct-lookup---ultra-fast-free-path-2025-11-08-) - Phase 7 概要
|
||
|
||
---
|
||
|
||
## 🛠️ 実行コマンド
|
||
|
||
### Step 1: Implement Hybrid Optimization (1-2 hours)
|
||
```bash
|
||
# Edit 3 files (see Task 1 above):
|
||
# - core/tiny_free_fast_v2.inc.h
|
||
# - core/box/hak_free_api.inc.h
|
||
# - core/hakmem_internal.h
|
||
```
|
||
|
||
### Step 2: Validate Micro-Benchmark (30 min)
|
||
```bash
|
||
gcc -O3 -o micro_mincore_bench tests/micro_mincore_bench.c
|
||
./micro_mincore_bench
|
||
# Expected: HYBRID ~1-2 cycles ✅
|
||
```
|
||
|
||
### Step 3: Smoke Test (30 min)
|
||
```bash
|
||
make clean && make HEADER_CLASSIDX=1 larson_hakmem
|
||
HAKMEM_TINY_USE_SUPERSLAB=1 ./larson_hakmem 1 1 128 1024 1 12345 1
|
||
# Expected: >20M ops/s ✅
|
||
```
|
||
|
||
### Step 4: Full Validation (1-2 hours)
|
||
```bash
|
||
# Random mixed sizes
|
||
./bench_random_mixed_hakmem 10000 1024 1234567
|
||
|
||
# Larson MT
|
||
./larson_hakmem 10 8 128 1024 1 12345 4
|
||
|
||
# Expected: 40-80M ops/s, no crashes ✅
|
||
```
|
||
|
||
---
|
||
|
||
## 📅 Timeline
|
||
|
||
- **Phase 7-1.3 (Hybrid Optimization)**: 1-2時間 ← **今ここ!**
|
||
- **Validation & Testing**: 1-2時間
|
||
- **Phase 7-2 (Full Benchmark vs mimalloc)**: 2-3時間
|
||
- **Total**: **4-6時間で System malloc に勝つ** 🎉
|
||
|
||
---
|
||
|
||
## 🚦 Go/No-Go Decision
|
||
|
||
### Phase 7-1.2 Status: NO-GO ⛔
|
||
**Reason**: mincore overhead (634 cycles = 40x slower than System)
|
||
|
||
### Phase 7-1.3 Status: CONDITIONAL GO 🟡
|
||
**Condition**:
|
||
1. ✅ Hybrid implementation complete
|
||
2. ✅ Micro-benchmark shows 1-2 cycles
|
||
3. ✅ Larson smoke test >20M ops/s
|
||
|
||
**Risk**: LOW (proven by Task Agent micro-benchmark)
|
||
|
||
---
|
||
|
||
## ✅ 完了済み(Phase 7-1.2 まで)
|
||
|
||
### Phase 7-1.2: Page Boundary SEGV Fix (2025-11-08)
|
||
- ✅ `hak_is_memory_readable()` check before header read
|
||
- ✅ All benchmarks crash-free (1024B, 2048B, 4096B)
|
||
- ✅ Committed: `24beb34de`
|
||
- **Issue**: mincore overhead (634 cycles) → Phase 7-1.3 で修正
|
||
|
||
### Phase 7-1.1: Dual-Header Dispatch (2025-11-08)
|
||
- ✅ Task Agent contributions (header validation, malloc fallback)
|
||
- ✅ 16-byte AllocHeader dispatch
|
||
- ✅ Committed
|
||
|
||
### Phase 7-1.0: PoC Implementation (2025-11-08)
|
||
- ✅ 1-byte header system
|
||
- ✅ Ultra-fast free path (basic version)
|
||
- ✅ Initial results: +39%~+436%
|
||
|
||
---
|
||
|
||
**次のアクション: Phase 7-1.3 Hybrid Optimization 実装開始!** 🚀
|