Files
hakmem/CURRENT_TASK.md
Moe Charm (CI) 4983352812 Perf: Phase 7-1.3 - Hybrid mincore + Macro fix (+194-333%)
## Summary
Fixed CRITICAL bottleneck (mincore overhead) and macro definition bug.
Result: 2-3x performance improvement across all benchmarks.

## Performance Results
- Larson 1T: 631K → 2.73M ops/s (+333%) 🚀
- bench_random_mixed (128B): 768K → 2.26M ops/s (+194%) 🚀
- bench_random_mixed (512B): → 1.43M ops/s (new)
- [HEADER_INVALID] messages: Many → ~Zero 

## Changes

### 1. Hybrid mincore Optimization (317-634x faster)
**Problem**: `hak_is_memory_readable()` calls mincore() syscall on EVERY free
- Cost: 634 cycles/call
- Impact: 40x slower than System malloc

**Solution**: Check alignment BEFORE calling mincore()
- Step 1 (1-byte header): `if ((ptr & 0xFFF) == 0)` → only 0.1% call mincore
- Step 2 (16-byte header): `if ((ptr & 0xFFF) < HEADER_SIZE)` → only 0.4% call mincore
- Result: 634 → 1-2 cycles effective (99.6% skip mincore)

**Files**:
- core/tiny_free_fast_v2.inc.h:53-71 - Step 1 hybrid check
- core/box/hak_free_api.inc.h:94-107 - Step 2 hybrid check
- core/hakmem_internal.h:281-312 - Performance warning added

### 2. HAK_RET_ALLOC Macro Fix (CRITICAL BUG)
**Problem**: Macro definition order prevented Phase 7 header write
- hakmem_tiny.c:130 defined legacy macro (no header write)
- tiny_alloc_fast.inc.h:67 had `#ifndef` guard → skipped!
- Result: Headers NEVER written → All frees failed → Slow path

**Solution**: Force Phase 7 macro to override legacy
- hakmem_tiny.c:119 - Added `#ifndef HAK_RET_ALLOC` guard
- tiny_alloc_fast.inc.h:69-72 - Added `#undef` before redefine

### 3. Magic Byte Fix
**Problem**: Release builds don't write magic byte, but free ALWAYS checks it
- Result: All headers marked as invalid

**Solution**: ALWAYS write magic byte (same 1-byte write, no overhead)
- tiny_region_id.h:50-54 - Removed `#if !HAKMEM_BUILD_RELEASE` guard

## Technical Details

### Hybrid mincore Effectiveness
| Case | Frequency | Cost | Weighted |
|------|-----------|------|----------|
| Normal (Step 1) | 99.9% | 1-2 cycles | 1-2 |
| Page boundary | 0.1% | 634 cycles | 0.6 |
| **Total** | - | - | **1.6-2.6 cycles** |

**Improvement**: 634 → 1.6 cycles = **317-396x faster!**

### Macro Fix Impact
**Before**: HAK_RET_ALLOC(cls, ptr) → return (ptr)  // No header write
**After**: HAK_RET_ALLOC(cls, ptr) → return tiny_region_id_write_header((ptr), (cls))

**Result**: Headers properly written → Fast path works → +194-333% performance

## Investigation
Task Agent Ultrathink analysis identified:
1. mincore() syscall overhead (634 cycles)
2. Macro definition order conflict
3. Release/Debug build mismatch (magic byte)

Full report: PHASE7_DESIGN_REVIEW.md (23KB, 758 lines)

## Related
- Phase 7-1.0: PoC implementation (+39%~+436%)
- Phase 7-1.1: Dual-header dispatch (Task Agent)
- Phase 7-1.2: Page boundary SEGV fix (100% crash-free)
- Phase 7-1.3: Hybrid mincore + Macro fix (this commit)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-08 04:50:41 +09:00

9.2 KiB
Raw Blame History

Current Task 2025-11-08

🚀 Phase 7-1.3: Hybrid mincore Optimization - System malloc に勝つ準備

ミッション

Phase 7 の CRITICAL BOTTLENECK を修正

  • Current: 634 cycles/free (mincore overhead)
  • Target: 1-2 cycles/free (hybrid approach)
  • Improvement: 317-634x faster! 🚀
  • Strategy: Alignment check (fast) + mincore fallback (rare)

📊 Phase 7-1.2 完了状況

完了済み

  1. Phase 7-1.0: PoC 実装 (+39%~+436% improvement)
  2. Phase 7-1.1: Dual-header dispatch (Task Agent)
  3. Phase 7-1.2: Page boundary SEGV fix (100% crash-free)

📈 達成した成果

  • 1-byte header system 動作確認
  • Dual-header dispatch (Tiny + malloc/mmap)
  • Page boundary 安全性確保
  • All benchmarks crash-free

🔥 発見された CRITICAL 問題

Task Agent Ultrathink Analysis (Phase 7 Design Review) の結果:

Bottleneck: hak_is_memory_readable()すべての free() で mincore() を呼ぶ

  • Measured Cost: 634 cycles/call
  • System tcache: 10-15 cycles
  • Result: Phase 7 は System malloc の 1/40 の速度 💀

Why This Happened:

  • Page boundary SEGV を防ぐため、ptr-1 の readability を確認
  • しかし page boundary は <0.1% の頻度
  • 99.9% の normal case でも 634 cycles 払っている

解決策: Hybrid mincore Optimization

Concept

Fast path (alignment check) + Slow path (mincore fallback)

// Before (slow): すべての free で mincore
if (!hak_is_memory_readable(ptr-1)) return 0;  // 634 cycles

// After (fast): 99.9% はアライメントチェックのみ
if (((uintptr_t)ptr & 0xFFF) == 0) {           // 1-2 cycles
    // Page boundary (0.1%): Safety check
    if (!hak_is_memory_readable(ptr-1)) return 0;  // 634 cycles
}
// Normal case (99.9%): Direct header read

Performance Impact

Case Frequency Cost Weighted
Normal (not boundary) 99.9% 1-2 cycles 1-2
Page boundary 0.1% 634 cycles 0.6
Total - - 1.6-2.6 cycles

Improvement: 634 → 1.6 cycles = 317-396x faster!

Micro-Benchmark Results (Task Agent)

[MINCORE] Mapped memory:   634 cycles/call  ← Current
[ALIGN]   Alignment check: 0 cycles/call
[HYBRID]  Align + mincore:  1 cycles/call   ← Optimized!
[BOUNDARY] Page boundary:  2155 cycles/call (rare, <0.1%)

📋 実装計画Phase 7-1.3

Task 1: Implement Hybrid mincore (1-2 hours)

File 1: core/tiny_free_fast_v2.inc.h:53-60

Before:

// CRITICAL: Check if header location (ptr-1) is accessible before reading
void* header_addr = (char*)ptr - 1;
extern int hak_is_memory_readable(void* addr);
if (__builtin_expect(!hak_is_memory_readable(header_addr), 0)) {
    // Header not accessible - route to slow path
    return 0;
}

After:

// CRITICAL: Fast check for page boundaries (0.1% case)
// Most allocations (99.9%) are NOT at page boundaries, so check alignment first
void* header_addr = (char*)ptr - 1;
if (__builtin_expect(((uintptr_t)ptr & 0xFFF) == 0, 0)) {
    // Potential page boundary - do safety check
    extern int hak_is_memory_readable(void* addr);
    if (!hak_is_memory_readable(header_addr)) {
        // Header not accessible - route to slow path
        return 0;
    }
}
// Normal case (99.9%): header is safe to read (no mincore call!)

File 2: core/box/hak_free_api.inc.h:96 (Step 2 dual-header dispatch)

Before:

// SAFETY: Check if raw header is accessible before dereferencing
if (hak_is_memory_readable(raw)) {
    AllocHeader* hdr = (AllocHeader*)raw;
    // ...
}

After:

// SAFETY: Fast check for page boundaries first
if (((uintptr_t)raw & 0xFFF) == 0) {
    // Potential page boundary - do safety check
    if (!hak_is_memory_readable(raw)) {
        goto slow_path;
    }
}
// Normal case: raw header is safe to read
AllocHeader* hdr = (AllocHeader*)raw;
// ...

File 3: Add comment to core/hakmem_internal.h:277-294

// NOTE: This function is expensive (634 cycles via mincore syscall).
// Use alignment check first to avoid calling this on normal allocations:
//   if (((uintptr_t)ptr & 0xFFF) == 0) {
//       if (!hak_is_memory_readable(ptr)) { /* handle page boundary */ }
//   }
static inline int hak_is_memory_readable(void* addr) {
    // ... existing implementation
}

Task 2: Validate with Micro-Benchmark (30 min)

File: tests/micro_mincore_bench.c (already created by Task Agent)

# Build and run micro-benchmark
gcc -O3 -o micro_mincore_bench tests/micro_mincore_bench.c
./micro_mincore_bench

# Expected output:
# [MINCORE] Mapped memory:   634 cycles/call
# [ALIGN]   Alignment check: 0 cycles/call
# [HYBRID]  Align + mincore:  1 cycles/call  ← Target!

Success Criteria:

  • HYBRID shows ~1-2 cycles (vs 634 before)

Task 3: Smoke Test with Larson (30 min)

# Rebuild Phase 7 with optimization
make clean && make HEADER_CLASSIDX=1 larson_hakmem

# Run smoke test (1T)
HAKMEM_TINY_USE_SUPERSLAB=1 ./larson_hakmem 1 1 128 1024 1 12345 1

# Expected: 20-40M ops/s (vs 1M before)

Success Criteria:

  • Throughput > 20M ops/s (20x improvement)
  • No crashes (stability)

Task 4: Full Validation (1-2 hours)

# Test multiple sizes
for size in 128 256 512 1024 2048; do
    echo "=== Testing size=$size ==="
    ./bench_random_mixed_hakmem 10000 $size 1234567
done

# Test Larson 4T (MT stability)
./larson_hakmem 10 8 128 1024 1 12345 4

# Expected: All pass, 20-60M ops/s

🎯 Expected Outcomes

Performance Targets

Benchmark Before (7-1.2) After (7-1.3) Improvement
bench_random_mixed 692K ops/s 40-60M ops/s 58-87x 🚀
larson_hakmem 1T 838K ops/s 40-80M ops/s 48-95x 🚀
larson_hakmem 4T 838K ops/s 120-240M ops/s 143-286x 🚀

vs System malloc

Metric System HAKMEM (7-1.3) Result
Tiny free 10-15 cycles 1-2 cycles 5-15x faster 🏆
Throughput 56M ops/s 40-80M ops/s 70-140%

Prediction: 70-140% of System malloc (互角〜勝ち!)


📁 関連ドキュメント

Task Agent Generated (Phase 7 Design Review)

Phase 7 History


🛠️ 実行コマンド

Step 1: Implement Hybrid Optimization (1-2 hours)

# Edit 3 files (see Task 1 above):
# - core/tiny_free_fast_v2.inc.h
# - core/box/hak_free_api.inc.h
# - core/hakmem_internal.h

Step 2: Validate Micro-Benchmark (30 min)

gcc -O3 -o micro_mincore_bench tests/micro_mincore_bench.c
./micro_mincore_bench
# Expected: HYBRID ~1-2 cycles ✅

Step 3: Smoke Test (30 min)

make clean && make HEADER_CLASSIDX=1 larson_hakmem
HAKMEM_TINY_USE_SUPERSLAB=1 ./larson_hakmem 1 1 128 1024 1 12345 1
# Expected: >20M ops/s ✅

Step 4: Full Validation (1-2 hours)

# Random mixed sizes
./bench_random_mixed_hakmem 10000 1024 1234567

# Larson MT
./larson_hakmem 10 8 128 1024 1 12345 4

# Expected: 40-80M ops/s, no crashes ✅

📅 Timeline

  • Phase 7-1.3 (Hybrid Optimization): 1-2時間 ← 今ここ!
  • Validation & Testing: 1-2時間
  • Phase 7-2 (Full Benchmark vs mimalloc): 2-3時間
  • Total: 4-6時間で System malloc に勝つ 🎉

🚦 Go/No-Go Decision

Phase 7-1.2 Status: NO-GO

Reason: mincore overhead (634 cycles = 40x slower than System)

Phase 7-1.3 Status: CONDITIONAL GO 🟡

Condition:

  1. Hybrid implementation complete
  2. Micro-benchmark shows 1-2 cycles
  3. Larson smoke test >20M ops/s

Risk: LOW (proven by Task Agent micro-benchmark)


完了済みPhase 7-1.2 まで)

Phase 7-1.2: Page Boundary SEGV Fix (2025-11-08)

  • hak_is_memory_readable() check before header read
  • All benchmarks crash-free (1024B, 2048B, 4096B)
  • Committed: 24beb34de
  • Issue: mincore overhead (634 cycles) → Phase 7-1.3 で修正

Phase 7-1.1: Dual-Header Dispatch (2025-11-08)

  • Task Agent contributions (header validation, malloc fallback)
  • 16-byte AllocHeader dispatch
  • Committed

Phase 7-1.0: PoC Implementation (2025-11-08)

  • 1-byte header system
  • Ultra-fast free path (basic version)
  • Initial results: +39%~+436%

次のアクション: Phase 7-1.3 Hybrid Optimization 実装開始! 🚀