diff --git a/PHASE_B_COMPLETION_REPORT.md b/PHASE_B_COMPLETION_REPORT.md new file mode 100644 index 00000000..dd8ee8d2 --- /dev/null +++ b/PHASE_B_COMPLETION_REPORT.md @@ -0,0 +1,348 @@ +# Phase B: TinyFrontC23Box - Completion Report + +**Date**: 2025-11-14 +**Status**: ✅ **COMPLETE** +**Goal**: Ultra-simple front path for C2/C3 (128B/256B) to bypass SFC/SLL complexity +**Target**: 15-20M ops/s +**Achievement**: 8.5-9.5M ops/s (+7-15% improvement) + +--- + +## Executive Summary + +Phase B implemented an ultra-simple front path specifically for C2/C3 size classes (128B/256B allocations), bypassing the complex SFC/SLL/Magazine layers. While we achieved **significant improvements (+7-15%)**, we fell short of the 15-20M target. Performance analysis revealed that **user-space optimization has reached diminishing returns** - remaining performance gap is dominated by kernel overhead (99%+). + +### Key Achievements +1. ✅ **TinyFrontC23Box implemented** - Direct FC → SS refill path +2. ✅ **Optimal refill target identified** - refill=64 via A/B testing +3. ✅ **classify_ptr optimization** - Header-based fast path (+12.8% for 256B) +4. ✅ **500K stability fix** - Fixed two critical bugs (deadlock + node pool exhaustion) + +### Performance Results +| Size | Baseline | Phase B | Improvement | +|------|----------|---------|-------------| +| 128B | 8.27M ops/s | 9.55M ops/s | **+15.5%** | +| 256B | 7.90M ops/s | 8.47M ops/s | **+7.2%** | +| 500K iterations | ❌ SEGV | ✅ Stable (9.44M ops/s) | **Fixed** | + +--- + +## Work Summary + +### 1. classify_ptr Optimization (Header-Based Fast Path) + +**Problem**: `classify_ptr()` bottleneck at 3.74% in perf profile +**Solution**: Added header-based fast path before registry lookup + +**Implementation**: `core/box/front_gate_classifier.c` +```c +// Fast path: Read magic byte at ptr-1 (2-5 cycles vs 50-100 cycles for registry) +uintptr_t offset_in_page = (uintptr_t)ptr & 0xFFF; +if (offset_in_page >= 1) { + uint8_t header = *((uint8_t*)ptr - 1); + uint8_t magic = header & 0xF0; + + if (magic == HEADER_MAGIC) { // 0xa0 = Tiny + int class_idx = header & HEADER_CLASS_MASK; + return PTR_KIND_TINY_HEADER; + } +} +``` + +**Results**: +- 256B: +12.8% (7.68M → 8.66M ops/s) +- 128B: -7.8% regression (8.76M → 8.08M ops/s) +- Mixed outcome, but provided foundation for Phase B + +--- + +### 2. TinyFrontC23Box Implementation + +**Architecture**: +``` +Traditional Path: alloc_fast → FC → SLL → Magazine → Backend (4-5 layers) +TinyFrontC23 Path: alloc_fast → FC → ss_refill_fc_fill (2 layers) +``` + +**Key Design**: +- **ENV-gated**: `HAKMEM_TINY_FRONT_C23_SIMPLE=1` +- **C2/C3 only**: class_idx 2 or 3 (128B/256B) +- **Direct refill**: Bypass TLS SLL, Magazine, go straight to SuperSlab +- **Zero overhead**: TLS-cached ENV check (1-2 cycles after first call) + +**Files Created**: +- `core/front/tiny_front_c23.h` - Ultra-simple C2/C3 allocator (157 lines) +- Modified `core/tiny_alloc_fast.inc.h` - Added C23 hook (4 lines) + +**Core Algorithm** (`tiny_front_c23.h:86-120`): +```c +static inline void* tiny_front_c23_alloc(size_t size, int class_idx) { + // Step 1: Try FastCache pop (L1, ultra-fast) + void* ptr = fastcache_pop(class_idx); + if (__builtin_expect(ptr != NULL, 1)) { + return ptr; // Hot path (90-95% hit rate) + } + + // Step 2: Refill from SuperSlab (bypass SLL/Magazine) + int want = tiny_front_c23_refill_target(class_idx); + int refilled = ss_refill_fc_fill(class_idx, want); + + // Step 3: Retry FastCache pop + if (refilled > 0) { + ptr = fastcache_pop(class_idx); + if (ptr) return ptr; + } + + // Step 4: Fallback to generic path + return NULL; +} +``` + +--- + +### 3. Refill Target A/B Testing + +**Tested Values**: refill = 16, 32, 64, 128 +**Workload**: 100K iterations, Random Mixed + +**Results (100K iterations)**: + +| Refill | 128B ops/s | vs Baseline | 256B ops/s | vs Baseline | +|--------|------------|-------------|------------|-------------| +| Baseline (C23 OFF) | 8.27M | - | 7.90M | - | +| refill=16 | 8.76M | +5.9% | 8.01M | +1.4% | +| refill=32 | 9.00M | +8.8% | 8.61M | **+9.0%** | +| refill=64 | 9.55M | **+15.5%** | 8.47M | +7.2% | +| refill=128 | 9.41M | +13.8% | 8.37M | +5.9% | + +**Decision**: **refill=64** selected as default +- Balanced performance across C2/C3 +- 128B best: +15.5% +- 256B good: +7.2% + +**ENV Control**: `HAKMEM_TINY_FRONT_C23_REFILL=64` (default) + +--- + +### 4. 500K SEGV Investigation & Fix + +#### Problem +- Crash at 500K iterations with "Node pool exhausted for class 7" +- Occurred in `hak_tiny_alloc_slow()` with stack corruption + +#### Root Cause Analysis (Task Agent Investigation) +**Two separate bugs identified**: + +1. **Deadlock Bug** (FREE path): + - Location: `core/hakmem_shared_pool.c:382-387` (`sp_freelist_push_lockfree`) + - Issue: Recursive lock attempt on non-recursive mutex + - Caller (`shared_pool_release_slab:772`) already held `alloc_lock` + - Fallback path tried to acquire same lock → deadlock + +2. **Node Pool Exhaustion** (ALLOC path): + - Location: `core/hakmem_shared_pool.h:77` (`MAX_FREE_NODES_PER_CLASS`) + - Issue: Pool size (512 nodes/class) exhausted at ~500K iterations + - Exhaustion triggered fallback paths → stack corruption in `hak_tiny_alloc_slow()` + +#### Fixes Applied + +**Fix #1**: Deadlock Fix (`hakmem_shared_pool.c:382-387`) +```c +// BEFORE (DEADLOCK): +if (!node) { + pthread_mutex_lock(&g_shared_pool.alloc_lock); // ❌ DEADLOCK! + (void)sp_freelist_push(class_idx, meta, slot_idx); + pthread_mutex_unlock(&g_shared_pool.alloc_lock); + return 0; +} + +// AFTER (FIXED): +if (!node) { + // Fallback: push into legacy per-class free list + // ASSUME: Caller already holds alloc_lock (e.g., shared_pool_release_slab:772) + // Do NOT lock again to avoid deadlock on non-recursive mutex! + (void)sp_freelist_push(class_idx, meta, slot_idx); // ✅ NO LOCK + return 0; +} +``` + +**Fix #2**: Node Pool Expansion (`hakmem_shared_pool.h:77`) +```c +// BEFORE: +#define MAX_FREE_NODES_PER_CLASS 512 + +// AFTER: +#define MAX_FREE_NODES_PER_CLASS 4096 // Support 500K+ iterations +``` + +#### Test Results +``` +Before fixes: + - 100K iterations: ✅ Stable + - 500K iterations: ❌ SEGV with "Node pool exhausted for class 7" + +After fixes: + - 100K iterations: ✅ 9.55M ops/s (128B) + - 500K iterations: ✅ 9.44M ops/s (stable, no warnings, no crashes) +``` + +**Note**: These bugs were in **Mid-Large allocator's SP-SLOT Box**, NOT in Phase B's TinyFrontC23Box. Phase B code remained stable throughout. + +--- + +## Performance Analysis + +### Why We Didn't Reach 15-20M Target + +**Perf Profiling** (with Phase B C23 enabled): +``` +User-space overhead: < 1% +Kernel overhead: 99%+ +classify_ptr: No longer appears in profile (optimized out) +``` + +**Interpretation**: +- User-space optimizations have **reached diminishing returns** +- Remaining 2x gap (9M → 15-20M) is dominated by **kernel overhead** +- Cannot be closed by user-space optimization alone +- Would require kernel-level changes or architectural shifts + +**CLAUDE.md** excerpt (Phase 9-11 lessons): +> **Phase 11 (Prewarm)**: +6.4% → 症状の緩和だけで根本解決ではない +> **Phase 10 (TLS/SFC)**: +2% → Frontend hit rateはボトルネックではない +> **根本原因**: SuperSlab allocation churn (877個生成 @ 100K iterations) +> **次の戦略**: Phase 12 Shared SuperSlab Pool (mimalloc式) - 本質的解決 + +**Conclusion**: Phase B achieved **incremental optimization** (+7-15%), but **architectural changes** (Phase 12) are needed for step-function improvement toward 90M ops/s (System malloc level). + +--- + +## Commits + +1. **classify_ptr optimization** (commit hash: check git log) + - `core/box/front_gate_classifier.c`: Header-based fast path + +2. **TinyFrontC23Box implementation** (commit hash: check git log) + - `core/front/tiny_front_c23.h`: New ultra-simple allocator + - `core/tiny_alloc_fast.inc.h`: C23 hook integration + +3. **Refill target default** (commit hash: check git log) + - Updated `tiny_front_c23.h:54`: refill=64 default + +4. **500K SEGV fix** (commit: 93cc23450) + - `core/hakmem_shared_pool.c`: Deadlock fix + - `core/hakmem_shared_pool.h`: Node pool expansion (512→4096) + +--- + +## ENV Controls for Phase B + +```bash +# Enable C23 fast path (default: OFF) +export HAKMEM_TINY_FRONT_C23_SIMPLE=1 + +# Set refill target (default: 64) +export HAKMEM_TINY_FRONT_C23_REFILL=64 + +# Run benchmark +./out/release/bench_random_mixed_hakmem 100000 256 42 +``` + +**Recommended Settings**: +- Production: `HAKMEM_TINY_FRONT_C23_SIMPLE=1` + `REFILL=64` +- Testing: Try `REFILL=32` for 256B-heavy workloads + +--- + +## Lessons Learned + +### Technical Insights +1. **Incremental optimization has limits** - Phase B achieved +7-15%, but 2x gap requires architectural changes +2. **User-space vs kernel bottleneck** - Perf profiling revealed 99%+ kernel overhead, not solvable by user-space optimization +3. **Separate bugs can compound** - Deadlock (FREE path) + node pool exhaustion (ALLOC path) both triggered by same workload (500K) +4. **A/B testing is essential** - Refill target optimal value was size-dependent (128B→64, 256B→32) + +### Process Insights +1. **Task agent for deep investigation** - Excellent for complex root cause analysis (500K SEGV) +2. **Perf profiling early and often** - Identified classify_ptr bottleneck (3.74%) and kernel dominance (99%) +3. **Commit small, test often** - Each fix tested at 100K/500K before moving to next +4. **Document as you go** - This report captures all decisions and rationale for future reference + +--- + +## Next Steps (Phase 12 Recommendation) + +**Strategy**: mimalloc-style Shared SuperSlab Pool + +**Problem**: Current architecture allocates 1 SuperSlab per size class → 877 SuperSlabs @ 100K iterations → massive metadata overhead + +**Solution**: Multiple size classes share same SuperSlab, dynamic slab assignment + +**Expected Impact**: +- SuperSlab count: 877 → 100-200 (-70-80%) +- Metadata overhead: -70-80% +- Cache miss rate: Significantly reduced +- Performance: 9M → 70-90M ops/s (+650-860% expected) + +**Implementation Plan**: +1. Phase 12-1: Dynamic slab metadata (SlabMeta with runtime class_idx) +2. Phase 12-2: Shared allocation (multiple classes from same SS) +3. Phase 12-3: Smart eviction (LRU-based slab reclamation) +4. Phase 12-4: Benchmark vs System malloc (target: 80-100%) + +**Reference**: See `CLAUDE.md` Phase 12 section for detailed design + +--- + +## Conclusion + +Phase B **successfully implemented** TinyFrontC23Box and achieved **measurable improvements** (+7-15% for C2/C3). However, perf profiling revealed that **user-space optimization has reached diminishing returns** - the remaining 2x gap to 15-20M target is dominated by kernel overhead (99%+) and cannot be closed by further user-space tuning. + +**Key Takeaway**: Phase B was a **valuable learning phase** that: +1. Demonstrated incremental optimization limits +2. Identified true bottleneck (kernel + metadata churn) +3. Paved the way for Phase 12 (architectural solution) + +**Status**: Phase B is **COMPLETE** and **STABLE** (500K iterations pass). Ready to proceed to Phase 12 for step-function improvement. + +--- + +## Appendix: Performance Data + +### 100K Iterations, Random Mixed 128B +``` +Baseline (C23 OFF): 8.27M ops/s +refill=16: 8.76M ops/s (+5.9%) +refill=32: 9.00M ops/s (+8.8%) +refill=64: 9.55M ops/s (+15.5%) ← SELECTED +refill=128: 9.41M ops/s (+13.8%) +``` + +### 100K Iterations, Random Mixed 256B +``` +Baseline (C23 OFF): 7.90M ops/s +refill=16: 8.01M ops/s (+1.4%) +refill=32: 8.61M ops/s (+9.0%) +refill=64: 8.47M ops/s (+7.2%) ← SELECTED (balanced) +refill=128: 8.37M ops/s (+5.9%) +``` + +### 500K Iterations, Random Mixed 256B +``` +Before fix: SEGV with "Node pool exhausted for class 7" +After fix: 9.44M ops/s, stable, no warnings +``` + +### Perf Profile (1M iterations, Phase B enabled) +``` +classify_ptr: < 0.1% (was 3.74%, optimized) +tiny_alloc_fast: < 0.5% (was 1.20%, optimized) +User-space total: < 1% +Kernel overhead: 99%+ +``` + +--- + +**Report Author**: Claude Code +**Date**: 2025-11-14 +**Session**: Phase B Completion