# Phase 9-2 Benchmark Report: WS8192 Performance Analysis **Date**: 2025-11-30 **Test Configuration**: WS8192 (Working Set = 8192 allocations) **Benchmark**: bench_random_mixed_hakmem 10000000 8192 **Status**: Baseline measurements complete, optimization not yet implemented --- ## Executive Summary WS8192ベンチマークを正しいパラメータで測定しました。結果: 1. **SuperSlab OFF vs ON**: ほぼ同じ性能(16.23M vs 16.15M ops/s、-0.51%) 2. **期待値とのギャップ**: Phase 9-2の期待値は25-30M ops/s (+50-80%)、実測は改善なし 3. **根本原因**: Phase 9-2の修正(EMPTY→Freelist recycling)が**未実装**であることが判明 4. **次のステップ**: Phase 9-2 Option Aの実装が必要 --- ## 1. Benchmark Results ### 1.1 SuperSlab OFF (Baseline) ```bash HAKMEM_TINY_USE_SUPERSLAB=0 ./bench_random_mixed_hakmem 10000000 8192 ``` | Run | Throughput (ops/s) | Time (s) | |-----|-------------------|----------| | 1 | 16,468,918 | 0.607 | | 2 | 16,192,733 | 0.618 | | 3 | 16,035,542 | 0.624 | | **Average** | **16,232,398** | **0.616** | | **Std Dev** | 178,517 (±1.1%) | 0.007 | **Key Observations**: - Consistent performance (±1.1% variance) - 4x `[SS_BACKEND] shared_fail→legacy cls=7` warnings - TLS_SLL errors present (header corruption warnings) ### 1.2 SuperSlab ON (Current State) ```bash HAKMEM_TINY_USE_SUPERSLAB=1 ./bench_random_mixed_hakmem 10000000 8192 ``` | Run | Throughput (ops/s) | Time (s) | |-----|-------------------|----------| | 1 | 16,231,848 | 0.616 | | 2 | 16,305,843 | 0.613 | | 3 | 15,910,918 | 0.628 | | **Average** | **16,149,536** | **0.619** | | **Std Dev** | 171,766 (±1.1%) | 0.007 | **Key Observations**: - **No performance improvement** (-0.51% vs baseline) - Same `shared_fail→legacy` warnings (4x Class 7 fallbacks) - Same TLS_SLL errors - SuperSlab enabled but not providing benefits ### 1.3 Improvement Analysis ``` Baseline (SuperSlab OFF): 16.23 M ops/s Current (SuperSlab ON): 16.15 M ops/s Improvement: -0.51% (REGRESSION, within noise) Expected (Phase 9-2): 25-30 M ops/s Gap: -8.85 to -13.85 M ops/s (-35% to -46%) ``` **Verdict**: SuperSlab is enabled but **not functional** due to missing EMPTY recycling. --- ## 2. Problem Analysis ### 2.1 Why SuperSlab Has No Effect From PHASE9_2_SUPERSLAB_BACKEND_INVESTIGATION.md investigation: **Root Cause**: Shared pool Stage 3 soft cap blocks new SuperSlab allocation, but **EMPTY slabs are not recycled** to Stage 1 freelist. **Flow**: ``` 1. Benchmark allocates ~820 Class 7 blocks (10% of WS=8192) 2. Shared pool allocates 2 SuperSlabs (512KB each = 1022 blocks total) 3. class_active_slots[7] = 2 (soft cap reached) 4. Next allocation request: - Stage 0.5 (EMPTY scan): Finds nothing (only 2 SS, both ACTIVE) - Stage 1 (freelist): Empty (no EMPTY→ACTIVE transitions) - Stage 2 (UNUSED claim): Exhausted (first pass only) - Stage 3 (new SS alloc): FAIL (soft cap: current=2 >= limit=2) 5. shared_pool_acquire_slab() returns -1 6. Falls back to legacy backend 7. Legacy backend uses system malloc → kernel overhead ``` **Result**: SuperSlab backend is **bypassed 4 times** during benchmark → falls back to legacy system malloc. ### 2.2 Observable Evidence **Log Snippet**: ``` [SS_BACKEND] shared_fail→legacy cls=7 ← SuperSlab failed, using legacy [SS_BACKEND] shared_fail→legacy cls=7 [SS_BACKEND] shared_fail→legacy cls=7 [SS_BACKEND] shared_fail→legacy cls=7 ``` **What This Means**: - SuperSlab attempted allocation → hit soft cap → failed - Fell back to `hak_tiny_alloc_superslab_backend_legacy()` - Legacy backend uses **system malloc** (not SuperSlab) - Kernel overhead: mmap/munmap syscalls → 55% CPU in kernel **Why No Performance Difference**: - SuperSlab ON: Uses legacy backend (same as SuperSlab OFF) - SuperSlab OFF: Uses legacy backend (expected) - Both configurations → same code path → same performance --- ## 3. Missing Implementation: EMPTY→Freelist Recycling ### 3.1 What Needs to Be Implemented **Phase 9-2 Option A** (from investigation report): #### Step 1: Add EMPTY Detection to Remote Drain **File**: `core/superslab_slab.c` (after line 109) ```c void _ss_remote_drain_to_freelist_unsafe(SuperSlab* ss, int slab_idx, TinySlabMeta* meta) { // ... existing drain logic ... meta->freelist = prev; atomic_store(&ss->remote_counts[slab_idx], 0); // ✅ NEW: Check if slab is now EMPTY if (meta->used == 0 && meta->capacity > 0) { ss_mark_slab_empty(ss, slab_idx); // Set empty_mask bit // Notify shared pool: push to per-class freelist int class_idx = (int)meta->class_idx; if (class_idx >= 0 && class_idx < TINY_NUM_CLASSES_SS) { shared_pool_release_slab(ss, slab_idx); } } // ... update masks ... } ``` #### Step 2: Add EMPTY Detection to TLS SLL Drain **File**: `core/box/tls_sll_drain_box.c` ```c uint32_t tiny_tls_sll_drain(int class_idx, uint32_t batch_size) { // ... existing drain logic ... // After draining N blocks from TLS SLL to freelist: if (meta->used == 0 && meta->capacity > 0) { ss_mark_slab_empty(ss, slab_idx); shared_pool_release_slab(ss, slab_idx); } } ``` ### 3.2 Expected Impact (After Implementation) **Performance Prediction** (from Phase 9-2 investigation, Section 9.2): | Configuration | Throughput | Kernel Overhead | Stage 1 Hit Rate | |--------------|------------|-----------------|------------------| | Current (no recycling) | 16.5 M ops/s | 55% | 0% | | **Option A (EMPTY recycling)** | **25-28 M ops/s** | 15% | 80% | | Option A+B (+ 2MB SS) | 30-35 M ops/s | 12% | 85% | **Why +50-70% Improvement**: - EMPTY slabs recycle instantly via lock-free Stage 1 - Soft cap never hit (slots reused, not created) - Eliminates mmap/munmap overhead from legacy fallback - SuperSlab backend becomes **fully functional** --- ## 4. Comparison with Phase 9-1 ### 4.1 Phase 9-1 Status From PHASE9_1_PROGRESS.md: **Phase 9-1 Goal**: Optimize SuperSlab lookup (50-80 cycles → 8-12 cycles) **Status**: Infrastructure complete (4/6 steps), **migration not started** - ✅ Step 1-4: Hash table + TLS hints implementation - ⏸️ Step 5: Migration (IN PROGRESS) - ⏸️ Step 6: Benchmark (PENDING) **Key Point**: Phase 9-1 optimizations are **not yet integrated** into hot path. ### 4.2 Phase 9-2 Status **Phase 9-2 Goal**: Fix SuperSlab backend (eliminate legacy fallbacks) **Status**: Investigation complete, **implementation not started** - ✅ Root cause identified (EMPTY recycling missing) - ✅ 4 fix options proposed (Option A recommended) - ⏸️ Implementation: NOT STARTED - ⏸️ Benchmark: NOT STARTED **Key Point**: Phase 9-2 is still in **planning phase**. --- ## 5. Performance Budget Analysis ### 5.1 Current Bottlenecks (WS8192) ``` Total: 212 cycles/op (16.5 M ops/s @ 2.8 GHz) - SuperSlab Lookup: 50-80 cycles ← Phase 9-1 target - Legacy Fallback: 30-50 cycles ← Phase 9-2 target - Fragmentation: 30-50 cycles - TLS Drain: 10-15 cycles - Actual Work: 30-40 cycles ``` **Kernel Overhead**: 55% (mmap/munmap from legacy fallback) ### 5.2 Expected After Phase 9-1 + 9-2 **After Phase 9-1** (lookup optimization): ``` Total: 152 cycles/op (18.4 M ops/s baseline) - SuperSlab Lookup: 8-12 cycles ✅ Fixed (hash + TLS hints) - Legacy Fallback: 30-50 cycles ← Still broken - Fragmentation: 30-50 cycles - TLS Drain: 10-15 cycles - Actual Work: 30-40 cycles ``` **Expected**: 16.5M → 23-25M ops/s (+39-52%) **After Phase 9-1 + 9-2** (lookup + backend): ``` Total: 95 cycles/op (29.5 M ops/s baseline) - SuperSlab Lookup: 8-12 cycles ✅ Fixed (Phase 9-1) - Legacy Fallback: 0 cycles ✅ Fixed (Phase 9-2) - SuperSlab Backend: 15-20 cycles ✅ Stage 1 reuse - Fragmentation: 20-30 cycles - TLS Drain: 10-15 cycles - Actual Work: 30-40 cycles ``` **Expected**: 16.5M → **30-35M ops/s** (+80-110%) **Kernel Overhead**: 55% → 12-15% --- ## 6. Diagnostic Output Analysis ### 6.1 Repeated Warnings **TLS_SLL_POP_POST_INVALID**: ``` [TLS_SLL_POP_POST_INVALID] cls=6 next=0x7 last_writer=pop [TLS_SLL_HDR_RESET] cls=6 base=0x... got=0x00 expect=0xa6 count=0 [TLS_SLL_POP_POST_INVALID] cls=6 next=0x5b last_writer=pop ``` **Analysis** (from Phase 9-2 investigation, Section 2): - **cls=6**: Class 6 (512-byte blocks) - **got=0x00**: Header corrupted/zeroed - **count=0**: One-time event (not recurring) - **Hypothesis**: Use-after-free or slab reuse race - **Mitigation**: Existing guards (`tiny_tls_slab_reuse_guard()`) should prevent - **Verdict**: **Not critical** (one-time event, guards in place) - **Action**: Monitor with `HAKMEM_SUPER_REG_DEBUG=1` for recurrence ### 6.2 Shared Fail Events ``` [SS_BACKEND] shared_fail→legacy cls=7 ``` **Count**: 4 events per benchmark run **Class**: Class 7 (2048-byte allocations, 1024-1040B range in benchmark) **Reason**: Soft cap reached (Stage 3 blocked) **Impact**: Falls back to system malloc → kernel overhead **This is the PRIMARY bottleneck** that Phase 9-2 Option A will fix. --- ## 7. Verification of Test Configuration ### 7.1 Benchmark Parameters **Command Used**: ```bash ./bench_random_mixed_hakmem 10000000 8192 ``` **Breakdown**: - `10000000`: 10M cycles (steady-state measurement) - `8192`: Working set size (WS8192) **From bench_random_mixed.c (line 45-46)**: ```c int cycles = (argc>1)? atoi(argv[1]) : 10000000; // total ops int ws = (argc>2)? atoi(argv[2]) : 8192; // working-set slots ``` **Allocation Pattern** (line 116): ```c size_t sz = 16u + (r & 0x3FFu); // 16..1040 bytes (approx 16..1024) ``` **Class Distribution** (estimated): ``` 16-64B → Classes 0-3 (~40%) 64-256B → Classes 4-5 (~30%) 256-512B → Class 6 (~20%) 512-1040B → Class 7 (~10% = ~820 live allocations) ``` **Why Class 7 Exhausts**: - 820 live allocations ÷ 511 blocks/SuperSlab = 1.6 SuperSlabs (rounded to 2) - Soft cap = 2 → any additional allocation fails → legacy fallback ### 7.2 Comparison with Phase 9-1 Baseline **From PHASE9_1_PROGRESS.md (line 142)**: ```bash ./bench_random_mixed_hakmem 10000000 8192 # ~16.5 M ops/s ``` **Current Measurement**: - SuperSlab OFF: 16.23 M ops/s - SuperSlab ON: 16.15 M ops/s **Match**: ✅ Values align with Phase 9-1 baseline (16.5M vs 16.2M, within variance) --- ## 8. Next Steps ### 8.1 Immediate Actions 1. **Implement Phase 9-2 Option A** (EMPTY→Freelist recycling) - Modify `core/superslab_slab.c` (remote drain) - Modify `core/box/tls_sll_drain_box.c` (TLS SLL drain) - Add EMPTY detection: `if (meta->used == 0) { shared_pool_release_slab(...) }` 2. **Run Debug Build** to verify EMPTY recycling ```bash make clean make CFLAGS="-O2 -g -DHAKMEM_BUILD_RELEASE=0" bench_random_mixed_hakmem HAKMEM_TINY_USE_SUPERSLAB=1 \ HAKMEM_SS_ACQUIRE_DEBUG=1 \ HAKMEM_SHARED_POOL_STAGE_STATS=1 \ ./bench_random_mixed_hakmem 100000 256 42 ``` 3. **Verify Stage 1 Hits** in debug output - Look for `[SP_ACQUIRE_STAGE1_LOCKFREE]` logs - Confirm freelist population: `[SP_SLOT_FREELIST_LOCKFREE]` - Verify zero `shared_fail→legacy` events ### 8.2 Performance Validation 4. **Re-run WS8192 Benchmark** (after Option A implementation) ```bash # Baseline (should be same as before) HAKMEM_TINY_USE_SUPERSLAB=0 ./bench_random_mixed_hakmem 10000000 8192 # Optimized (should show +50-70% improvement) HAKMEM_TINY_USE_SUPERSLAB=1 ./bench_random_mixed_hakmem 10000000 8192 ``` 5. **Success Criteria** (from Phase 9-2 Section 11.2): - ✅ Throughput: 16.5M → 25-30M ops/s (+50-80%) - ✅ Zero `shared_fail→legacy` events - ✅ Stage 1 hit rate: 70-80% (after warmup) - ✅ Kernel overhead: 55% → <15% ### 8.3 Optional Enhancements 6. **Implement Option B** (revert to 2MB SuperSlab) - Change `SUPERSLAB_LG_DEFAULT` from 19 → 21 - Expected additional gain: +10-15% (30-35M ops/s total) 7. **Implement Option D** (expand EMPTY scan limit) - Change `HAKMEM_SS_EMPTY_SCAN_LIMIT` default from 16 → 64 - Expected additional gain: +3-8% (marginal) --- ## 9. Risk Assessment ### 9.1 Implementation Risks (Option A) | Risk | Likelihood | Impact | Mitigation | |------|------------|--------|------------| | **Double-free in EMPTY detection** | Low | Critical | Add `meta->used > 0` assertion before `shared_pool_release_slab()` | | **Race: EMPTY→ACTIVE→EMPTY** | Medium | Medium | Use atomic `meta->used` reads; Stage 1 CAS prevents double-activation | | **Deadlock in release_slab** | Low | Medium | Use lock-free push (already implemented) | **Overall**: Low risk (Box boundaries well-defined, guards in place) ### 9.2 Performance Risks | Risk | Likelihood | Impact | Mitigation | |------|------------|--------|------------| | **Improvement less than expected** | Medium | Medium | Profile with perf, check Stage 1 hit rate, consider Option B | | **Regression in other workloads** | Low | Medium | Run full benchmark suite (WS256, cache_thrash, larson) | | **Memory leak from freelist** | Low | High | Monitor RSS growth, verify EMPTY detection logic | **Overall**: Medium risk (new feature, but small code change) --- ## 10. Lessons Learned ### 10.1 Benchmark Parameter Confusion **Issue**: Initial request mentioned "デフォルトパラメータで測定してしまい、ワークロードが軽すぎました" **Reality**: Default parameters ARE WS8192 (line 46 in bench_random_mixed.c) ```c int ws = (argc>2)? atoi(argv[2]) : 8192; // default: 8192 ``` **Takeaway**: Always check source code to verify default behavior (documentation may be outdated). ### 10.2 SuperSlab Enabled ≠ SuperSlab Functional **Issue**: `HAKMEM_TINY_USE_SUPERSLAB=1` enables SuperSlab code, but doesn't guarantee it's used. **Reality**: Legacy fallback is triggered when SuperSlab backend fails (soft cap, OOM, etc.) **Takeaway**: Check for `shared_fail→legacy` warnings in output to verify SuperSlab is actually being used. ### 10.3 Phase Dependencies **Issue**: Assumed Phase 9-2 was complete (based on PHASE9_2_*.md files) **Reality**: Phase 9-2 investigation is complete, but **implementation is not started** **Takeaway**: Check document status header (e.g., "Status: Root Cause Analysis Complete" vs "Status: Implementation Complete") --- ## 11. Conclusion **Current State**: WS8192 benchmark correctly measured at 16.2-16.3 M ops/s, consistent across SuperSlab ON/OFF. **Root Cause**: SuperSlab backend falls back to legacy system malloc due to missing EMPTY→Freelist recycling (Phase 9-2 Option A). **Expected Improvement**: After implementing Option A, expect 25-30 M ops/s (+50-80%) by eliminating legacy fallbacks and enabling lock-free Stage 1 EMPTY reuse. **Next Action**: Implement Phase 9-2 Option A (2-3 hour task), then re-benchmark WS8192 to verify +50-70% improvement. --- **Report Prepared By**: Claude (Sonnet 4.5) **Benchmark Date**: 2025-11-30 **Total Test Time**: ~6 seconds (6 runs × 0.6s average) **Status**: Baseline established, awaiting Phase 9-2 implementation