Files
hakmem/PHASE9_2_BENCHMARK_REPORT.md
Moe Charm (CI) 4ef0171bc0 feat: Add ACE allocation failure tracing and debug hooks
This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations.

Key changes include:
- **ACE Tracing Implementation**:
  - Added  environment variable to enable/disable detailed logging of allocation failures.
  - Instrumented , , and  to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure).
- **Build System Fixes**:
  - Corrected  to ensure  is properly linked into , resolving an  error.
- **LD_PRELOAD Wrapper Adjustments**:
  - Investigated and understood the  wrapper's behavior under , particularly its interaction with  and  checks.
  - Enabled debugging flags for  environment to prevent unintended fallbacks to 's  for non-tiny allocations, allowing comprehensive testing of the  allocator.
- **Debugging & Verification**:
  - Introduced temporary verbose logging to pinpoint execution flow issues within  interception and  routing. These temporary logs have been removed.
  - Created  to facilitate testing of the tracing features.

This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in  by providing clear insights into the failure pathways.
2025-12-01 16:37:59 +09:00

465 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 9-2 Benchmark Report: WS8192 Performance Analysis
**Date**: 2025-11-30
**Test Configuration**: WS8192 (Working Set = 8192 allocations)
**Benchmark**: bench_random_mixed_hakmem 10000000 8192
**Status**: Baseline measurements complete, optimization not yet implemented
---
## Executive Summary
WS8192ベンチマークを正しいパラメータで測定しました。結果
1. **SuperSlab OFF vs ON**: ほぼ同じ性能16.23M vs 16.15M ops/s、-0.51%
2. **期待値とのギャップ**: Phase 9-2の期待値は25-30M ops/s (+50-80%)、実測は改善なし
3. **根本原因**: Phase 9-2の修正EMPTY→Freelist recyclingが**未実装**であることが判明
4. **次のステップ**: Phase 9-2 Option Aの実装が必要
---
## 1. Benchmark Results
### 1.1 SuperSlab OFF (Baseline)
```bash
HAKMEM_TINY_USE_SUPERSLAB=0 ./bench_random_mixed_hakmem 10000000 8192
```
| Run | Throughput (ops/s) | Time (s) |
|-----|-------------------|----------|
| 1 | 16,468,918 | 0.607 |
| 2 | 16,192,733 | 0.618 |
| 3 | 16,035,542 | 0.624 |
| **Average** | **16,232,398** | **0.616** |
| **Std Dev** | 178,517 (±1.1%) | 0.007 |
**Key Observations**:
- Consistent performance (±1.1% variance)
- 4x `[SS_BACKEND] shared_fail→legacy cls=7` warnings
- TLS_SLL errors present (header corruption warnings)
### 1.2 SuperSlab ON (Current State)
```bash
HAKMEM_TINY_USE_SUPERSLAB=1 ./bench_random_mixed_hakmem 10000000 8192
```
| Run | Throughput (ops/s) | Time (s) |
|-----|-------------------|----------|
| 1 | 16,231,848 | 0.616 |
| 2 | 16,305,843 | 0.613 |
| 3 | 15,910,918 | 0.628 |
| **Average** | **16,149,536** | **0.619** |
| **Std Dev** | 171,766 (±1.1%) | 0.007 |
**Key Observations**:
- **No performance improvement** (-0.51% vs baseline)
- Same `shared_fail→legacy` warnings (4x Class 7 fallbacks)
- Same TLS_SLL errors
- SuperSlab enabled but not providing benefits
### 1.3 Improvement Analysis
```
Baseline (SuperSlab OFF): 16.23 M ops/s
Current (SuperSlab ON): 16.15 M ops/s
Improvement: -0.51% (REGRESSION, within noise)
Expected (Phase 9-2): 25-30 M ops/s
Gap: -8.85 to -13.85 M ops/s (-35% to -46%)
```
**Verdict**: SuperSlab is enabled but **not functional** due to missing EMPTY recycling.
---
## 2. Problem Analysis
### 2.1 Why SuperSlab Has No Effect
From PHASE9_2_SUPERSLAB_BACKEND_INVESTIGATION.md investigation:
**Root Cause**: Shared pool Stage 3 soft cap blocks new SuperSlab allocation, but **EMPTY slabs are not recycled** to Stage 1 freelist.
**Flow**:
```
1. Benchmark allocates ~820 Class 7 blocks (10% of WS=8192)
2. Shared pool allocates 2 SuperSlabs (512KB each = 1022 blocks total)
3. class_active_slots[7] = 2 (soft cap reached)
4. Next allocation request:
- Stage 0.5 (EMPTY scan): Finds nothing (only 2 SS, both ACTIVE)
- Stage 1 (freelist): Empty (no EMPTY→ACTIVE transitions)
- Stage 2 (UNUSED claim): Exhausted (first pass only)
- Stage 3 (new SS alloc): FAIL (soft cap: current=2 >= limit=2)
5. shared_pool_acquire_slab() returns -1
6. Falls back to legacy backend
7. Legacy backend uses system malloc → kernel overhead
```
**Result**: SuperSlab backend is **bypassed 4 times** during benchmark → falls back to legacy system malloc.
### 2.2 Observable Evidence
**Log Snippet**:
```
[SS_BACKEND] shared_fail→legacy cls=7 ← SuperSlab failed, using legacy
[SS_BACKEND] shared_fail→legacy cls=7
[SS_BACKEND] shared_fail→legacy cls=7
[SS_BACKEND] shared_fail→legacy cls=7
```
**What This Means**:
- SuperSlab attempted allocation → hit soft cap → failed
- Fell back to `hak_tiny_alloc_superslab_backend_legacy()`
- Legacy backend uses **system malloc** (not SuperSlab)
- Kernel overhead: mmap/munmap syscalls → 55% CPU in kernel
**Why No Performance Difference**:
- SuperSlab ON: Uses legacy backend (same as SuperSlab OFF)
- SuperSlab OFF: Uses legacy backend (expected)
- Both configurations → same code path → same performance
---
## 3. Missing Implementation: EMPTY→Freelist Recycling
### 3.1 What Needs to Be Implemented
**Phase 9-2 Option A** (from investigation report):
#### Step 1: Add EMPTY Detection to Remote Drain
**File**: `core/superslab_slab.c` (after line 109)
```c
void _ss_remote_drain_to_freelist_unsafe(SuperSlab* ss, int slab_idx, TinySlabMeta* meta) {
// ... existing drain logic ...
meta->freelist = prev;
atomic_store(&ss->remote_counts[slab_idx], 0);
// ✅ NEW: Check if slab is now EMPTY
if (meta->used == 0 && meta->capacity > 0) {
ss_mark_slab_empty(ss, slab_idx); // Set empty_mask bit
// Notify shared pool: push to per-class freelist
int class_idx = (int)meta->class_idx;
if (class_idx >= 0 && class_idx < TINY_NUM_CLASSES_SS) {
shared_pool_release_slab(ss, slab_idx);
}
}
// ... update masks ...
}
```
#### Step 2: Add EMPTY Detection to TLS SLL Drain
**File**: `core/box/tls_sll_drain_box.c`
```c
uint32_t tiny_tls_sll_drain(int class_idx, uint32_t batch_size) {
// ... existing drain logic ...
// After draining N blocks from TLS SLL to freelist:
if (meta->used == 0 && meta->capacity > 0) {
ss_mark_slab_empty(ss, slab_idx);
shared_pool_release_slab(ss, slab_idx);
}
}
```
### 3.2 Expected Impact (After Implementation)
**Performance Prediction** (from Phase 9-2 investigation, Section 9.2):
| Configuration | Throughput | Kernel Overhead | Stage 1 Hit Rate |
|--------------|------------|-----------------|------------------|
| Current (no recycling) | 16.5 M ops/s | 55% | 0% |
| **Option A (EMPTY recycling)** | **25-28 M ops/s** | 15% | 80% |
| Option A+B (+ 2MB SS) | 30-35 M ops/s | 12% | 85% |
**Why +50-70% Improvement**:
- EMPTY slabs recycle instantly via lock-free Stage 1
- Soft cap never hit (slots reused, not created)
- Eliminates mmap/munmap overhead from legacy fallback
- SuperSlab backend becomes **fully functional**
---
## 4. Comparison with Phase 9-1
### 4.1 Phase 9-1 Status
From PHASE9_1_PROGRESS.md:
**Phase 9-1 Goal**: Optimize SuperSlab lookup (50-80 cycles → 8-12 cycles)
**Status**: Infrastructure complete (4/6 steps), **migration not started**
- ✅ Step 1-4: Hash table + TLS hints implementation
- ⏸️ Step 5: Migration (IN PROGRESS)
- ⏸️ Step 6: Benchmark (PENDING)
**Key Point**: Phase 9-1 optimizations are **not yet integrated** into hot path.
### 4.2 Phase 9-2 Status
**Phase 9-2 Goal**: Fix SuperSlab backend (eliminate legacy fallbacks)
**Status**: Investigation complete, **implementation not started**
- ✅ Root cause identified (EMPTY recycling missing)
- ✅ 4 fix options proposed (Option A recommended)
- ⏸️ Implementation: NOT STARTED
- ⏸️ Benchmark: NOT STARTED
**Key Point**: Phase 9-2 is still in **planning phase**.
---
## 5. Performance Budget Analysis
### 5.1 Current Bottlenecks (WS8192)
```
Total: 212 cycles/op (16.5 M ops/s @ 2.8 GHz)
- SuperSlab Lookup: 50-80 cycles ← Phase 9-1 target
- Legacy Fallback: 30-50 cycles ← Phase 9-2 target
- Fragmentation: 30-50 cycles
- TLS Drain: 10-15 cycles
- Actual Work: 30-40 cycles
```
**Kernel Overhead**: 55% (mmap/munmap from legacy fallback)
### 5.2 Expected After Phase 9-1 + 9-2
**After Phase 9-1** (lookup optimization):
```
Total: 152 cycles/op (18.4 M ops/s baseline)
- SuperSlab Lookup: 8-12 cycles ✅ Fixed (hash + TLS hints)
- Legacy Fallback: 30-50 cycles ← Still broken
- Fragmentation: 30-50 cycles
- TLS Drain: 10-15 cycles
- Actual Work: 30-40 cycles
```
**Expected**: 16.5M → 23-25M ops/s (+39-52%)
**After Phase 9-1 + 9-2** (lookup + backend):
```
Total: 95 cycles/op (29.5 M ops/s baseline)
- SuperSlab Lookup: 8-12 cycles ✅ Fixed (Phase 9-1)
- Legacy Fallback: 0 cycles ✅ Fixed (Phase 9-2)
- SuperSlab Backend: 15-20 cycles ✅ Stage 1 reuse
- Fragmentation: 20-30 cycles
- TLS Drain: 10-15 cycles
- Actual Work: 30-40 cycles
```
**Expected**: 16.5M → **30-35M ops/s** (+80-110%)
**Kernel Overhead**: 55% → 12-15%
---
## 6. Diagnostic Output Analysis
### 6.1 Repeated Warnings
**TLS_SLL_POP_POST_INVALID**:
```
[TLS_SLL_POP_POST_INVALID] cls=6 next=0x7 last_writer=pop
[TLS_SLL_HDR_RESET] cls=6 base=0x... got=0x00 expect=0xa6 count=0
[TLS_SLL_POP_POST_INVALID] cls=6 next=0x5b last_writer=pop
```
**Analysis** (from Phase 9-2 investigation, Section 2):
- **cls=6**: Class 6 (512-byte blocks)
- **got=0x00**: Header corrupted/zeroed
- **count=0**: One-time event (not recurring)
- **Hypothesis**: Use-after-free or slab reuse race
- **Mitigation**: Existing guards (`tiny_tls_slab_reuse_guard()`) should prevent
- **Verdict**: **Not critical** (one-time event, guards in place)
- **Action**: Monitor with `HAKMEM_SUPER_REG_DEBUG=1` for recurrence
### 6.2 Shared Fail Events
```
[SS_BACKEND] shared_fail→legacy cls=7
```
**Count**: 4 events per benchmark run
**Class**: Class 7 (2048-byte allocations, 1024-1040B range in benchmark)
**Reason**: Soft cap reached (Stage 3 blocked)
**Impact**: Falls back to system malloc → kernel overhead
**This is the PRIMARY bottleneck** that Phase 9-2 Option A will fix.
---
## 7. Verification of Test Configuration
### 7.1 Benchmark Parameters
**Command Used**:
```bash
./bench_random_mixed_hakmem 10000000 8192
```
**Breakdown**:
- `10000000`: 10M cycles (steady-state measurement)
- `8192`: Working set size (WS8192)
**From bench_random_mixed.c (line 45-46)**:
```c
int cycles = (argc>1)? atoi(argv[1]) : 10000000; // total ops
int ws = (argc>2)? atoi(argv[2]) : 8192; // working-set slots
```
**Allocation Pattern** (line 116):
```c
size_t sz = 16u + (r & 0x3FFu); // 16..1040 bytes (approx 16..1024)
```
**Class Distribution** (estimated):
```
16-64B → Classes 0-3 (~40%)
64-256B → Classes 4-5 (~30%)
256-512B → Class 6 (~20%)
512-1040B → Class 7 (~10% = ~820 live allocations)
```
**Why Class 7 Exhausts**:
- 820 live allocations ÷ 511 blocks/SuperSlab = 1.6 SuperSlabs (rounded to 2)
- Soft cap = 2 → any additional allocation fails → legacy fallback
### 7.2 Comparison with Phase 9-1 Baseline
**From PHASE9_1_PROGRESS.md (line 142)**:
```bash
./bench_random_mixed_hakmem 10000000 8192 # ~16.5 M ops/s
```
**Current Measurement**:
- SuperSlab OFF: 16.23 M ops/s
- SuperSlab ON: 16.15 M ops/s
**Match**: ✅ Values align with Phase 9-1 baseline (16.5M vs 16.2M, within variance)
---
## 8. Next Steps
### 8.1 Immediate Actions
1. **Implement Phase 9-2 Option A** (EMPTY→Freelist recycling)
- Modify `core/superslab_slab.c` (remote drain)
- Modify `core/box/tls_sll_drain_box.c` (TLS SLL drain)
- Add EMPTY detection: `if (meta->used == 0) { shared_pool_release_slab(...) }`
2. **Run Debug Build** to verify EMPTY recycling
```bash
make clean
make CFLAGS="-O2 -g -DHAKMEM_BUILD_RELEASE=0" bench_random_mixed_hakmem
HAKMEM_TINY_USE_SUPERSLAB=1 \
HAKMEM_SS_ACQUIRE_DEBUG=1 \
HAKMEM_SHARED_POOL_STAGE_STATS=1 \
./bench_random_mixed_hakmem 100000 256 42
```
3. **Verify Stage 1 Hits** in debug output
- Look for `[SP_ACQUIRE_STAGE1_LOCKFREE]` logs
- Confirm freelist population: `[SP_SLOT_FREELIST_LOCKFREE]`
- Verify zero `shared_fail→legacy` events
### 8.2 Performance Validation
4. **Re-run WS8192 Benchmark** (after Option A implementation)
```bash
# Baseline (should be same as before)
HAKMEM_TINY_USE_SUPERSLAB=0 ./bench_random_mixed_hakmem 10000000 8192
# Optimized (should show +50-70% improvement)
HAKMEM_TINY_USE_SUPERSLAB=1 ./bench_random_mixed_hakmem 10000000 8192
```
5. **Success Criteria** (from Phase 9-2 Section 11.2):
- ✅ Throughput: 16.5M → 25-30M ops/s (+50-80%)
- ✅ Zero `shared_fail→legacy` events
- ✅ Stage 1 hit rate: 70-80% (after warmup)
- ✅ Kernel overhead: 55% → <15%
### 8.3 Optional Enhancements
6. **Implement Option B** (revert to 2MB SuperSlab)
- Change `SUPERSLAB_LG_DEFAULT` from 19 → 21
- Expected additional gain: +10-15% (30-35M ops/s total)
7. **Implement Option D** (expand EMPTY scan limit)
- Change `HAKMEM_SS_EMPTY_SCAN_LIMIT` default from 16 → 64
- Expected additional gain: +3-8% (marginal)
---
## 9. Risk Assessment
### 9.1 Implementation Risks (Option A)
| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| **Double-free in EMPTY detection** | Low | Critical | Add `meta->used > 0` assertion before `shared_pool_release_slab()` |
| **Race: EMPTY→ACTIVE→EMPTY** | Medium | Medium | Use atomic `meta->used` reads; Stage 1 CAS prevents double-activation |
| **Deadlock in release_slab** | Low | Medium | Use lock-free push (already implemented) |
**Overall**: Low risk (Box boundaries well-defined, guards in place)
### 9.2 Performance Risks
| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| **Improvement less than expected** | Medium | Medium | Profile with perf, check Stage 1 hit rate, consider Option B |
| **Regression in other workloads** | Low | Medium | Run full benchmark suite (WS256, cache_thrash, larson) |
| **Memory leak from freelist** | Low | High | Monitor RSS growth, verify EMPTY detection logic |
**Overall**: Medium risk (new feature, but small code change)
---
## 10. Lessons Learned
### 10.1 Benchmark Parameter Confusion
**Issue**: Initial request mentioned "デフォルトパラメータで測定してしまい、ワークロードが軽すぎました"
**Reality**: Default parameters ARE WS8192 (line 46 in bench_random_mixed.c)
```c
int ws = (argc>2)? atoi(argv[2]) : 8192; // default: 8192
```
**Takeaway**: Always check source code to verify default behavior (documentation may be outdated).
### 10.2 SuperSlab Enabled ≠ SuperSlab Functional
**Issue**: `HAKMEM_TINY_USE_SUPERSLAB=1` enables SuperSlab code, but doesn't guarantee it's used.
**Reality**: Legacy fallback is triggered when SuperSlab backend fails (soft cap, OOM, etc.)
**Takeaway**: Check for `shared_fail→legacy` warnings in output to verify SuperSlab is actually being used.
### 10.3 Phase Dependencies
**Issue**: Assumed Phase 9-2 was complete (based on PHASE9_2_*.md files)
**Reality**: Phase 9-2 investigation is complete, but **implementation is not started**
**Takeaway**: Check document status header (e.g., "Status: Root Cause Analysis Complete" vs "Status: Implementation Complete")
---
## 11. Conclusion
**Current State**: WS8192 benchmark correctly measured at 16.2-16.3 M ops/s, consistent across SuperSlab ON/OFF.
**Root Cause**: SuperSlab backend falls back to legacy system malloc due to missing EMPTY→Freelist recycling (Phase 9-2 Option A).
**Expected Improvement**: After implementing Option A, expect 25-30 M ops/s (+50-80%) by eliminating legacy fallbacks and enabling lock-free Stage 1 EMPTY reuse.
**Next Action**: Implement Phase 9-2 Option A (2-3 hour task), then re-benchmark WS8192 to verify +50-70% improvement.
---
**Report Prepared By**: Claude (Sonnet 4.5)
**Benchmark Date**: 2025-11-30
**Total Test Time**: ~6 seconds (6 runs × 0.6s average)
**Status**: Baseline established, awaiting Phase 9-2 implementation