Files
hakmem/CURRENT_TASK.md
Moe Charm (CI) 4983352812 Perf: Phase 7-1.3 - Hybrid mincore + Macro fix (+194-333%)
## Summary
Fixed CRITICAL bottleneck (mincore overhead) and macro definition bug.
Result: 2-3x performance improvement across all benchmarks.

## Performance Results
- Larson 1T: 631K → 2.73M ops/s (+333%) 🚀
- bench_random_mixed (128B): 768K → 2.26M ops/s (+194%) 🚀
- bench_random_mixed (512B): → 1.43M ops/s (new)
- [HEADER_INVALID] messages: Many → ~Zero 

## Changes

### 1. Hybrid mincore Optimization (317-634x faster)
**Problem**: `hak_is_memory_readable()` calls mincore() syscall on EVERY free
- Cost: 634 cycles/call
- Impact: 40x slower than System malloc

**Solution**: Check alignment BEFORE calling mincore()
- Step 1 (1-byte header): `if ((ptr & 0xFFF) == 0)` → only 0.1% call mincore
- Step 2 (16-byte header): `if ((ptr & 0xFFF) < HEADER_SIZE)` → only 0.4% call mincore
- Result: 634 → 1-2 cycles effective (99.6% skip mincore)

**Files**:
- core/tiny_free_fast_v2.inc.h:53-71 - Step 1 hybrid check
- core/box/hak_free_api.inc.h:94-107 - Step 2 hybrid check
- core/hakmem_internal.h:281-312 - Performance warning added

### 2. HAK_RET_ALLOC Macro Fix (CRITICAL BUG)
**Problem**: Macro definition order prevented Phase 7 header write
- hakmem_tiny.c:130 defined legacy macro (no header write)
- tiny_alloc_fast.inc.h:67 had `#ifndef` guard → skipped!
- Result: Headers NEVER written → All frees failed → Slow path

**Solution**: Force Phase 7 macro to override legacy
- hakmem_tiny.c:119 - Added `#ifndef HAK_RET_ALLOC` guard
- tiny_alloc_fast.inc.h:69-72 - Added `#undef` before redefine

### 3. Magic Byte Fix
**Problem**: Release builds don't write magic byte, but free ALWAYS checks it
- Result: All headers marked as invalid

**Solution**: ALWAYS write magic byte (same 1-byte write, no overhead)
- tiny_region_id.h:50-54 - Removed `#if !HAKMEM_BUILD_RELEASE` guard

## Technical Details

### Hybrid mincore Effectiveness
| Case | Frequency | Cost | Weighted |
|------|-----------|------|----------|
| Normal (Step 1) | 99.9% | 1-2 cycles | 1-2 |
| Page boundary | 0.1% | 634 cycles | 0.6 |
| **Total** | - | - | **1.6-2.6 cycles** |

**Improvement**: 634 → 1.6 cycles = **317-396x faster!**

### Macro Fix Impact
**Before**: HAK_RET_ALLOC(cls, ptr) → return (ptr)  // No header write
**After**: HAK_RET_ALLOC(cls, ptr) → return tiny_region_id_write_header((ptr), (cls))

**Result**: Headers properly written → Fast path works → +194-333% performance

## Investigation
Task Agent Ultrathink analysis identified:
1. mincore() syscall overhead (634 cycles)
2. Macro definition order conflict
3. Release/Debug build mismatch (magic byte)

Full report: PHASE7_DESIGN_REVIEW.md (23KB, 758 lines)

## Related
- Phase 7-1.0: PoC implementation (+39%~+436%)
- Phase 7-1.1: Dual-header dispatch (Task Agent)
- Phase 7-1.2: Page boundary SEGV fix (100% crash-free)
- Phase 7-1.3: Hybrid mincore + Macro fix (this commit)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-08 04:50:41 +09:00

322 lines
9.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Current Task 2025-11-08
## 🚀 Phase 7-1.3: Hybrid mincore Optimization - System malloc に勝つ準備
### ミッション
**Phase 7 の CRITICAL BOTTLENECK を修正**
- **Current**: 634 cycles/free (mincore overhead)
- **Target**: 1-2 cycles/free (hybrid approach)
- **Improvement**: **317-634x faster!** 🚀
- **Strategy**: Alignment check (fast) + mincore fallback (rare)
---
## 📊 Phase 7-1.2 完了状況
### ✅ 完了済み
1. **Phase 7-1.0**: PoC 実装 (+39%~+436% improvement)
2. **Phase 7-1.1**: Dual-header dispatch (Task Agent)
3. **Phase 7-1.2**: Page boundary SEGV fix (100% crash-free)
### 📈 達成した成果
- ✅ 1-byte header system 動作確認
- ✅ Dual-header dispatch (Tiny + malloc/mmap)
- ✅ Page boundary 安全性確保
- ✅ All benchmarks crash-free
### 🔥 発見された CRITICAL 問題
**Task Agent Ultrathink Analysis (Phase 7 Design Review) の結果:**
**Bottleneck**: `hak_is_memory_readable()`**すべての free()** で mincore() を呼ぶ
- **Measured Cost**: 634 cycles/call
- **System tcache**: 10-15 cycles
- **Result**: Phase 7 は System malloc の **1/40 の速度** 💀
**Why This Happened:**
- Page boundary SEGV を防ぐため、`ptr-1` の readability を確認
- しかし page boundary は **<0.1%** の頻度
- **99.9%** normal case でも 634 cycles 払っている
---
## ✅ 解決策: Hybrid mincore Optimization
### Concept
**Fast path (alignment check) + Slow path (mincore fallback)**
```c
// Before (slow): すべての free で mincore
if (!hak_is_memory_readable(ptr-1)) return 0; // 634 cycles
// After (fast): 99.9% はアライメントチェックのみ
if (((uintptr_t)ptr & 0xFFF) == 0) { // 1-2 cycles
// Page boundary (0.1%): Safety check
if (!hak_is_memory_readable(ptr-1)) return 0; // 634 cycles
}
// Normal case (99.9%): Direct header read
```
### Performance Impact
| Case | Frequency | Cost | Weighted |
|------|-----------|------|----------|
| Normal (not boundary) | 99.9% | 1-2 cycles | 1-2 |
| Page boundary | 0.1% | 634 cycles | 0.6 |
| **Total** | - | - | **1.6-2.6 cycles** |
**Improvement**: 634 1.6 cycles = **317-396x faster!**
### Micro-Benchmark Results (Task Agent)
```
[MINCORE] Mapped memory: 634 cycles/call ← Current
[ALIGN] Alignment check: 0 cycles/call
[HYBRID] Align + mincore: 1 cycles/call ← Optimized!
[BOUNDARY] Page boundary: 2155 cycles/call (rare, <0.1%)
```
---
## 📋 実装計画Phase 7-1.3
### Task 1: Implement Hybrid mincore (1-2 hours)
**File 1**: `core/tiny_free_fast_v2.inc.h:53-60`
**Before**:
```c
// CRITICAL: Check if header location (ptr-1) is accessible before reading
void* header_addr = (char*)ptr - 1;
extern int hak_is_memory_readable(void* addr);
if (__builtin_expect(!hak_is_memory_readable(header_addr), 0)) {
// Header not accessible - route to slow path
return 0;
}
```
**After**:
```c
// CRITICAL: Fast check for page boundaries (0.1% case)
// Most allocations (99.9%) are NOT at page boundaries, so check alignment first
void* header_addr = (char*)ptr - 1;
if (__builtin_expect(((uintptr_t)ptr & 0xFFF) == 0, 0)) {
// Potential page boundary - do safety check
extern int hak_is_memory_readable(void* addr);
if (!hak_is_memory_readable(header_addr)) {
// Header not accessible - route to slow path
return 0;
}
}
// Normal case (99.9%): header is safe to read (no mincore call!)
```
**File 2**: `core/box/hak_free_api.inc.h:96` (Step 2 dual-header dispatch)
**Before**:
```c
// SAFETY: Check if raw header is accessible before dereferencing
if (hak_is_memory_readable(raw)) {
AllocHeader* hdr = (AllocHeader*)raw;
// ...
}
```
**After**:
```c
// SAFETY: Fast check for page boundaries first
if (((uintptr_t)raw & 0xFFF) == 0) {
// Potential page boundary - do safety check
if (!hak_is_memory_readable(raw)) {
goto slow_path;
}
}
// Normal case: raw header is safe to read
AllocHeader* hdr = (AllocHeader*)raw;
// ...
```
**File 3**: Add comment to `core/hakmem_internal.h:277-294`
```c
// NOTE: This function is expensive (634 cycles via mincore syscall).
// Use alignment check first to avoid calling this on normal allocations:
// if (((uintptr_t)ptr & 0xFFF) == 0) {
// if (!hak_is_memory_readable(ptr)) { /* handle page boundary */ }
// }
static inline int hak_is_memory_readable(void* addr) {
// ... existing implementation
}
```
### Task 2: Validate with Micro-Benchmark (30 min)
**File**: `tests/micro_mincore_bench.c` (already created by Task Agent)
```bash
# Build and run micro-benchmark
gcc -O3 -o micro_mincore_bench tests/micro_mincore_bench.c
./micro_mincore_bench
# Expected output:
# [MINCORE] Mapped memory: 634 cycles/call
# [ALIGN] Alignment check: 0 cycles/call
# [HYBRID] Align + mincore: 1 cycles/call ← Target!
```
**Success Criteria**:
- HYBRID shows ~1-2 cycles (vs 634 before)
### Task 3: Smoke Test with Larson (30 min)
```bash
# Rebuild Phase 7 with optimization
make clean && make HEADER_CLASSIDX=1 larson_hakmem
# Run smoke test (1T)
HAKMEM_TINY_USE_SUPERSLAB=1 ./larson_hakmem 1 1 128 1024 1 12345 1
# Expected: 20-40M ops/s (vs 1M before)
```
**Success Criteria**:
- Throughput > 20M ops/s (20x improvement)
- ✅ No crashes (stability)
### Task 4: Full Validation (1-2 hours)
```bash
# Test multiple sizes
for size in 128 256 512 1024 2048; do
echo "=== Testing size=$size ==="
./bench_random_mixed_hakmem 10000 $size 1234567
done
# Test Larson 4T (MT stability)
./larson_hakmem 10 8 128 1024 1 12345 4
# Expected: All pass, 20-60M ops/s
```
---
## 🎯 Expected Outcomes
### Performance Targets
| Benchmark | Before (7-1.2) | After (7-1.3) | Improvement |
|-----------|----------------|---------------|-------------|
| **bench_random_mixed** | 692K ops/s | **40-60M ops/s** | **58-87x** 🚀 |
| **larson_hakmem 1T** | 838K ops/s | **40-80M ops/s** | **48-95x** 🚀 |
| **larson_hakmem 4T** | 838K ops/s | **120-240M ops/s** | **143-286x** 🚀 |
### vs System malloc
| Metric | System | HAKMEM (7-1.3) | Result |
|--------|--------|----------------|--------|
| **Tiny free** | 10-15 cycles | **1-2 cycles** | **5-15x faster** 🏆 |
| **Throughput** | 56M ops/s | **40-80M ops/s** | **70-140%** ✅ |
**Prediction**: **70-140% of System malloc** (互角〜勝ち!)
---
## 📁 関連ドキュメント
### Task Agent Generated (Phase 7 Design Review)
- [`PHASE7_DESIGN_REVIEW.md`](PHASE7_DESIGN_REVIEW.md) - 完全な技術分析 (23KB, 758 lines)
- [`PHASE7_ACTION_PLAN.md`](PHASE7_ACTION_PLAN.md) - 実装ガイド (5.7KB, 235 lines)
- [`PHASE7_SUMMARY.md`](PHASE7_SUMMARY.md) - エグゼクティブサマリー (11KB, 302 lines)
- [`PHASE7_QUICKREF.txt`](PHASE7_QUICKREF.txt) - クイックリファレンス (5.3KB)
- [`tests/micro_mincore_bench.c`](tests/micro_mincore_bench.c) - Micro-benchmark (4.5KB)
### Phase 7 History
- [`REGION_ID_DESIGN.md`](REGION_ID_DESIGN.md) - 完全設計Task Agent Opus Ultrathink
- [`PAGE_BOUNDARY_SEGV_FIX.md`](PAGE_BOUNDARY_SEGV_FIX.md) - Phase 7-1.2 修正レポート
- [`CLAUDE.md#phase-7`](CLAUDE.md#phase-7-region-id-direct-lookup---ultra-fast-free-path-2025-11-08-) - Phase 7 概要
---
## 🛠️ 実行コマンド
### Step 1: Implement Hybrid Optimization (1-2 hours)
```bash
# Edit 3 files (see Task 1 above):
# - core/tiny_free_fast_v2.inc.h
# - core/box/hak_free_api.inc.h
# - core/hakmem_internal.h
```
### Step 2: Validate Micro-Benchmark (30 min)
```bash
gcc -O3 -o micro_mincore_bench tests/micro_mincore_bench.c
./micro_mincore_bench
# Expected: HYBRID ~1-2 cycles ✅
```
### Step 3: Smoke Test (30 min)
```bash
make clean && make HEADER_CLASSIDX=1 larson_hakmem
HAKMEM_TINY_USE_SUPERSLAB=1 ./larson_hakmem 1 1 128 1024 1 12345 1
# Expected: >20M ops/s ✅
```
### Step 4: Full Validation (1-2 hours)
```bash
# Random mixed sizes
./bench_random_mixed_hakmem 10000 1024 1234567
# Larson MT
./larson_hakmem 10 8 128 1024 1 12345 4
# Expected: 40-80M ops/s, no crashes ✅
```
---
## 📅 Timeline
- **Phase 7-1.3 (Hybrid Optimization)**: 1-2時間 ← **今ここ!**
- **Validation & Testing**: 1-2時間
- **Phase 7-2 (Full Benchmark vs mimalloc)**: 2-3時間
- **Total**: **4-6時間で System malloc に勝つ** 🎉
---
## 🚦 Go/No-Go Decision
### Phase 7-1.2 Status: NO-GO ⛔
**Reason**: mincore overhead (634 cycles = 40x slower than System)
### Phase 7-1.3 Status: CONDITIONAL GO 🟡
**Condition**:
1. ✅ Hybrid implementation complete
2. ✅ Micro-benchmark shows 1-2 cycles
3. ✅ Larson smoke test >20M ops/s
**Risk**: LOW (proven by Task Agent micro-benchmark)
---
## ✅ 完了済みPhase 7-1.2 まで)
### Phase 7-1.2: Page Boundary SEGV Fix (2025-11-08)
-`hak_is_memory_readable()` check before header read
- ✅ All benchmarks crash-free (1024B, 2048B, 4096B)
- ✅ Committed: `24beb34de`
- **Issue**: mincore overhead (634 cycles) → Phase 7-1.3 で修正
### Phase 7-1.1: Dual-Header Dispatch (2025-11-08)
- ✅ Task Agent contributions (header validation, malloc fallback)
- ✅ 16-byte AllocHeader dispatch
- ✅ Committed
### Phase 7-1.0: PoC Implementation (2025-11-08)
- ✅ 1-byte header system
- ✅ Ultra-fast free path (basic version)
- ✅ Initial results: +39%~+436%
---
**次のアクション: Phase 7-1.3 Hybrid Optimization 実装開始!** 🚀