465 lines
15 KiB
Markdown
465 lines
15 KiB
Markdown
|
|
# Phase 9-2 Benchmark Report: WS8192 Performance Analysis
|
|||
|
|
|
|||
|
|
**Date**: 2025-11-30
|
|||
|
|
**Test Configuration**: WS8192 (Working Set = 8192 allocations)
|
|||
|
|
**Benchmark**: bench_random_mixed_hakmem 10000000 8192
|
|||
|
|
**Status**: Baseline measurements complete, optimization not yet implemented
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Executive Summary
|
|||
|
|
|
|||
|
|
WS8192ベンチマークを正しいパラメータで測定しました。結果:
|
|||
|
|
|
|||
|
|
1. **SuperSlab OFF vs ON**: ほぼ同じ性能(16.23M vs 16.15M ops/s、-0.51%)
|
|||
|
|
2. **期待値とのギャップ**: Phase 9-2の期待値は25-30M ops/s (+50-80%)、実測は改善なし
|
|||
|
|
3. **根本原因**: Phase 9-2の修正(EMPTY→Freelist recycling)が**未実装**であることが判明
|
|||
|
|
4. **次のステップ**: Phase 9-2 Option Aの実装が必要
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 1. Benchmark Results
|
|||
|
|
|
|||
|
|
### 1.1 SuperSlab OFF (Baseline)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
HAKMEM_TINY_USE_SUPERSLAB=0 ./bench_random_mixed_hakmem 10000000 8192
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
| Run | Throughput (ops/s) | Time (s) |
|
|||
|
|
|-----|-------------------|----------|
|
|||
|
|
| 1 | 16,468,918 | 0.607 |
|
|||
|
|
| 2 | 16,192,733 | 0.618 |
|
|||
|
|
| 3 | 16,035,542 | 0.624 |
|
|||
|
|
| **Average** | **16,232,398** | **0.616** |
|
|||
|
|
| **Std Dev** | 178,517 (±1.1%) | 0.007 |
|
|||
|
|
|
|||
|
|
**Key Observations**:
|
|||
|
|
- Consistent performance (±1.1% variance)
|
|||
|
|
- 4x `[SS_BACKEND] shared_fail→legacy cls=7` warnings
|
|||
|
|
- TLS_SLL errors present (header corruption warnings)
|
|||
|
|
|
|||
|
|
### 1.2 SuperSlab ON (Current State)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
HAKMEM_TINY_USE_SUPERSLAB=1 ./bench_random_mixed_hakmem 10000000 8192
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
| Run | Throughput (ops/s) | Time (s) |
|
|||
|
|
|-----|-------------------|----------|
|
|||
|
|
| 1 | 16,231,848 | 0.616 |
|
|||
|
|
| 2 | 16,305,843 | 0.613 |
|
|||
|
|
| 3 | 15,910,918 | 0.628 |
|
|||
|
|
| **Average** | **16,149,536** | **0.619** |
|
|||
|
|
| **Std Dev** | 171,766 (±1.1%) | 0.007 |
|
|||
|
|
|
|||
|
|
**Key Observations**:
|
|||
|
|
- **No performance improvement** (-0.51% vs baseline)
|
|||
|
|
- Same `shared_fail→legacy` warnings (4x Class 7 fallbacks)
|
|||
|
|
- Same TLS_SLL errors
|
|||
|
|
- SuperSlab enabled but not providing benefits
|
|||
|
|
|
|||
|
|
### 1.3 Improvement Analysis
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Baseline (SuperSlab OFF): 16.23 M ops/s
|
|||
|
|
Current (SuperSlab ON): 16.15 M ops/s
|
|||
|
|
Improvement: -0.51% (REGRESSION, within noise)
|
|||
|
|
|
|||
|
|
Expected (Phase 9-2): 25-30 M ops/s
|
|||
|
|
Gap: -8.85 to -13.85 M ops/s (-35% to -46%)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Verdict**: SuperSlab is enabled but **not functional** due to missing EMPTY recycling.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 2. Problem Analysis
|
|||
|
|
|
|||
|
|
### 2.1 Why SuperSlab Has No Effect
|
|||
|
|
|
|||
|
|
From PHASE9_2_SUPERSLAB_BACKEND_INVESTIGATION.md investigation:
|
|||
|
|
|
|||
|
|
**Root Cause**: Shared pool Stage 3 soft cap blocks new SuperSlab allocation, but **EMPTY slabs are not recycled** to Stage 1 freelist.
|
|||
|
|
|
|||
|
|
**Flow**:
|
|||
|
|
```
|
|||
|
|
1. Benchmark allocates ~820 Class 7 blocks (10% of WS=8192)
|
|||
|
|
2. Shared pool allocates 2 SuperSlabs (512KB each = 1022 blocks total)
|
|||
|
|
3. class_active_slots[7] = 2 (soft cap reached)
|
|||
|
|
4. Next allocation request:
|
|||
|
|
- Stage 0.5 (EMPTY scan): Finds nothing (only 2 SS, both ACTIVE)
|
|||
|
|
- Stage 1 (freelist): Empty (no EMPTY→ACTIVE transitions)
|
|||
|
|
- Stage 2 (UNUSED claim): Exhausted (first pass only)
|
|||
|
|
- Stage 3 (new SS alloc): FAIL (soft cap: current=2 >= limit=2)
|
|||
|
|
5. shared_pool_acquire_slab() returns -1
|
|||
|
|
6. Falls back to legacy backend
|
|||
|
|
7. Legacy backend uses system malloc → kernel overhead
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Result**: SuperSlab backend is **bypassed 4 times** during benchmark → falls back to legacy system malloc.
|
|||
|
|
|
|||
|
|
### 2.2 Observable Evidence
|
|||
|
|
|
|||
|
|
**Log Snippet**:
|
|||
|
|
```
|
|||
|
|
[SS_BACKEND] shared_fail→legacy cls=7 ← SuperSlab failed, using legacy
|
|||
|
|
[SS_BACKEND] shared_fail→legacy cls=7
|
|||
|
|
[SS_BACKEND] shared_fail→legacy cls=7
|
|||
|
|
[SS_BACKEND] shared_fail→legacy cls=7
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**What This Means**:
|
|||
|
|
- SuperSlab attempted allocation → hit soft cap → failed
|
|||
|
|
- Fell back to `hak_tiny_alloc_superslab_backend_legacy()`
|
|||
|
|
- Legacy backend uses **system malloc** (not SuperSlab)
|
|||
|
|
- Kernel overhead: mmap/munmap syscalls → 55% CPU in kernel
|
|||
|
|
|
|||
|
|
**Why No Performance Difference**:
|
|||
|
|
- SuperSlab ON: Uses legacy backend (same as SuperSlab OFF)
|
|||
|
|
- SuperSlab OFF: Uses legacy backend (expected)
|
|||
|
|
- Both configurations → same code path → same performance
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 3. Missing Implementation: EMPTY→Freelist Recycling
|
|||
|
|
|
|||
|
|
### 3.1 What Needs to Be Implemented
|
|||
|
|
|
|||
|
|
**Phase 9-2 Option A** (from investigation report):
|
|||
|
|
|
|||
|
|
#### Step 1: Add EMPTY Detection to Remote Drain
|
|||
|
|
**File**: `core/superslab_slab.c` (after line 109)
|
|||
|
|
```c
|
|||
|
|
void _ss_remote_drain_to_freelist_unsafe(SuperSlab* ss, int slab_idx, TinySlabMeta* meta) {
|
|||
|
|
// ... existing drain logic ...
|
|||
|
|
|
|||
|
|
meta->freelist = prev;
|
|||
|
|
atomic_store(&ss->remote_counts[slab_idx], 0);
|
|||
|
|
|
|||
|
|
// ✅ NEW: Check if slab is now EMPTY
|
|||
|
|
if (meta->used == 0 && meta->capacity > 0) {
|
|||
|
|
ss_mark_slab_empty(ss, slab_idx); // Set empty_mask bit
|
|||
|
|
|
|||
|
|
// Notify shared pool: push to per-class freelist
|
|||
|
|
int class_idx = (int)meta->class_idx;
|
|||
|
|
if (class_idx >= 0 && class_idx < TINY_NUM_CLASSES_SS) {
|
|||
|
|
shared_pool_release_slab(ss, slab_idx);
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// ... update masks ...
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Step 2: Add EMPTY Detection to TLS SLL Drain
|
|||
|
|
**File**: `core/box/tls_sll_drain_box.c`
|
|||
|
|
```c
|
|||
|
|
uint32_t tiny_tls_sll_drain(int class_idx, uint32_t batch_size) {
|
|||
|
|
// ... existing drain logic ...
|
|||
|
|
|
|||
|
|
// After draining N blocks from TLS SLL to freelist:
|
|||
|
|
if (meta->used == 0 && meta->capacity > 0) {
|
|||
|
|
ss_mark_slab_empty(ss, slab_idx);
|
|||
|
|
shared_pool_release_slab(ss, slab_idx);
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3.2 Expected Impact (After Implementation)
|
|||
|
|
|
|||
|
|
**Performance Prediction** (from Phase 9-2 investigation, Section 9.2):
|
|||
|
|
|
|||
|
|
| Configuration | Throughput | Kernel Overhead | Stage 1 Hit Rate |
|
|||
|
|
|--------------|------------|-----------------|------------------|
|
|||
|
|
| Current (no recycling) | 16.5 M ops/s | 55% | 0% |
|
|||
|
|
| **Option A (EMPTY recycling)** | **25-28 M ops/s** | 15% | 80% |
|
|||
|
|
| Option A+B (+ 2MB SS) | 30-35 M ops/s | 12% | 85% |
|
|||
|
|
|
|||
|
|
**Why +50-70% Improvement**:
|
|||
|
|
- EMPTY slabs recycle instantly via lock-free Stage 1
|
|||
|
|
- Soft cap never hit (slots reused, not created)
|
|||
|
|
- Eliminates mmap/munmap overhead from legacy fallback
|
|||
|
|
- SuperSlab backend becomes **fully functional**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 4. Comparison with Phase 9-1
|
|||
|
|
|
|||
|
|
### 4.1 Phase 9-1 Status
|
|||
|
|
|
|||
|
|
From PHASE9_1_PROGRESS.md:
|
|||
|
|
|
|||
|
|
**Phase 9-1 Goal**: Optimize SuperSlab lookup (50-80 cycles → 8-12 cycles)
|
|||
|
|
**Status**: Infrastructure complete (4/6 steps), **migration not started**
|
|||
|
|
- ✅ Step 1-4: Hash table + TLS hints implementation
|
|||
|
|
- ⏸️ Step 5: Migration (IN PROGRESS)
|
|||
|
|
- ⏸️ Step 6: Benchmark (PENDING)
|
|||
|
|
|
|||
|
|
**Key Point**: Phase 9-1 optimizations are **not yet integrated** into hot path.
|
|||
|
|
|
|||
|
|
### 4.2 Phase 9-2 Status
|
|||
|
|
|
|||
|
|
**Phase 9-2 Goal**: Fix SuperSlab backend (eliminate legacy fallbacks)
|
|||
|
|
**Status**: Investigation complete, **implementation not started**
|
|||
|
|
- ✅ Root cause identified (EMPTY recycling missing)
|
|||
|
|
- ✅ 4 fix options proposed (Option A recommended)
|
|||
|
|
- ⏸️ Implementation: NOT STARTED
|
|||
|
|
- ⏸️ Benchmark: NOT STARTED
|
|||
|
|
|
|||
|
|
**Key Point**: Phase 9-2 is still in **planning phase**.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 5. Performance Budget Analysis
|
|||
|
|
|
|||
|
|
### 5.1 Current Bottlenecks (WS8192)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Total: 212 cycles/op (16.5 M ops/s @ 2.8 GHz)
|
|||
|
|
- SuperSlab Lookup: 50-80 cycles ← Phase 9-1 target
|
|||
|
|
- Legacy Fallback: 30-50 cycles ← Phase 9-2 target
|
|||
|
|
- Fragmentation: 30-50 cycles
|
|||
|
|
- TLS Drain: 10-15 cycles
|
|||
|
|
- Actual Work: 30-40 cycles
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Kernel Overhead**: 55% (mmap/munmap from legacy fallback)
|
|||
|
|
|
|||
|
|
### 5.2 Expected After Phase 9-1 + 9-2
|
|||
|
|
|
|||
|
|
**After Phase 9-1** (lookup optimization):
|
|||
|
|
```
|
|||
|
|
Total: 152 cycles/op (18.4 M ops/s baseline)
|
|||
|
|
- SuperSlab Lookup: 8-12 cycles ✅ Fixed (hash + TLS hints)
|
|||
|
|
- Legacy Fallback: 30-50 cycles ← Still broken
|
|||
|
|
- Fragmentation: 30-50 cycles
|
|||
|
|
- TLS Drain: 10-15 cycles
|
|||
|
|
- Actual Work: 30-40 cycles
|
|||
|
|
```
|
|||
|
|
**Expected**: 16.5M → 23-25M ops/s (+39-52%)
|
|||
|
|
|
|||
|
|
**After Phase 9-1 + 9-2** (lookup + backend):
|
|||
|
|
```
|
|||
|
|
Total: 95 cycles/op (29.5 M ops/s baseline)
|
|||
|
|
- SuperSlab Lookup: 8-12 cycles ✅ Fixed (Phase 9-1)
|
|||
|
|
- Legacy Fallback: 0 cycles ✅ Fixed (Phase 9-2)
|
|||
|
|
- SuperSlab Backend: 15-20 cycles ✅ Stage 1 reuse
|
|||
|
|
- Fragmentation: 20-30 cycles
|
|||
|
|
- TLS Drain: 10-15 cycles
|
|||
|
|
- Actual Work: 30-40 cycles
|
|||
|
|
```
|
|||
|
|
**Expected**: 16.5M → **30-35M ops/s** (+80-110%)
|
|||
|
|
**Kernel Overhead**: 55% → 12-15%
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 6. Diagnostic Output Analysis
|
|||
|
|
|
|||
|
|
### 6.1 Repeated Warnings
|
|||
|
|
|
|||
|
|
**TLS_SLL_POP_POST_INVALID**:
|
|||
|
|
```
|
|||
|
|
[TLS_SLL_POP_POST_INVALID] cls=6 next=0x7 last_writer=pop
|
|||
|
|
[TLS_SLL_HDR_RESET] cls=6 base=0x... got=0x00 expect=0xa6 count=0
|
|||
|
|
[TLS_SLL_POP_POST_INVALID] cls=6 next=0x5b last_writer=pop
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Analysis** (from Phase 9-2 investigation, Section 2):
|
|||
|
|
- **cls=6**: Class 6 (512-byte blocks)
|
|||
|
|
- **got=0x00**: Header corrupted/zeroed
|
|||
|
|
- **count=0**: One-time event (not recurring)
|
|||
|
|
- **Hypothesis**: Use-after-free or slab reuse race
|
|||
|
|
- **Mitigation**: Existing guards (`tiny_tls_slab_reuse_guard()`) should prevent
|
|||
|
|
- **Verdict**: **Not critical** (one-time event, guards in place)
|
|||
|
|
- **Action**: Monitor with `HAKMEM_SUPER_REG_DEBUG=1` for recurrence
|
|||
|
|
|
|||
|
|
### 6.2 Shared Fail Events
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
[SS_BACKEND] shared_fail→legacy cls=7
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Count**: 4 events per benchmark run
|
|||
|
|
**Class**: Class 7 (2048-byte allocations, 1024-1040B range in benchmark)
|
|||
|
|
**Reason**: Soft cap reached (Stage 3 blocked)
|
|||
|
|
**Impact**: Falls back to system malloc → kernel overhead
|
|||
|
|
|
|||
|
|
**This is the PRIMARY bottleneck** that Phase 9-2 Option A will fix.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 7. Verification of Test Configuration
|
|||
|
|
|
|||
|
|
### 7.1 Benchmark Parameters
|
|||
|
|
|
|||
|
|
**Command Used**:
|
|||
|
|
```bash
|
|||
|
|
./bench_random_mixed_hakmem 10000000 8192
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Breakdown**:
|
|||
|
|
- `10000000`: 10M cycles (steady-state measurement)
|
|||
|
|
- `8192`: Working set size (WS8192)
|
|||
|
|
|
|||
|
|
**From bench_random_mixed.c (line 45-46)**:
|
|||
|
|
```c
|
|||
|
|
int cycles = (argc>1)? atoi(argv[1]) : 10000000; // total ops
|
|||
|
|
int ws = (argc>2)? atoi(argv[2]) : 8192; // working-set slots
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Allocation Pattern** (line 116):
|
|||
|
|
```c
|
|||
|
|
size_t sz = 16u + (r & 0x3FFu); // 16..1040 bytes (approx 16..1024)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Class Distribution** (estimated):
|
|||
|
|
```
|
|||
|
|
16-64B → Classes 0-3 (~40%)
|
|||
|
|
64-256B → Classes 4-5 (~30%)
|
|||
|
|
256-512B → Class 6 (~20%)
|
|||
|
|
512-1040B → Class 7 (~10% = ~820 live allocations)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Why Class 7 Exhausts**:
|
|||
|
|
- 820 live allocations ÷ 511 blocks/SuperSlab = 1.6 SuperSlabs (rounded to 2)
|
|||
|
|
- Soft cap = 2 → any additional allocation fails → legacy fallback
|
|||
|
|
|
|||
|
|
### 7.2 Comparison with Phase 9-1 Baseline
|
|||
|
|
|
|||
|
|
**From PHASE9_1_PROGRESS.md (line 142)**:
|
|||
|
|
```bash
|
|||
|
|
./bench_random_mixed_hakmem 10000000 8192 # ~16.5 M ops/s
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Current Measurement**:
|
|||
|
|
- SuperSlab OFF: 16.23 M ops/s
|
|||
|
|
- SuperSlab ON: 16.15 M ops/s
|
|||
|
|
|
|||
|
|
**Match**: ✅ Values align with Phase 9-1 baseline (16.5M vs 16.2M, within variance)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 8. Next Steps
|
|||
|
|
|
|||
|
|
### 8.1 Immediate Actions
|
|||
|
|
|
|||
|
|
1. **Implement Phase 9-2 Option A** (EMPTY→Freelist recycling)
|
|||
|
|
- Modify `core/superslab_slab.c` (remote drain)
|
|||
|
|
- Modify `core/box/tls_sll_drain_box.c` (TLS SLL drain)
|
|||
|
|
- Add EMPTY detection: `if (meta->used == 0) { shared_pool_release_slab(...) }`
|
|||
|
|
|
|||
|
|
2. **Run Debug Build** to verify EMPTY recycling
|
|||
|
|
```bash
|
|||
|
|
make clean
|
|||
|
|
make CFLAGS="-O2 -g -DHAKMEM_BUILD_RELEASE=0" bench_random_mixed_hakmem
|
|||
|
|
|
|||
|
|
HAKMEM_TINY_USE_SUPERSLAB=1 \
|
|||
|
|
HAKMEM_SS_ACQUIRE_DEBUG=1 \
|
|||
|
|
HAKMEM_SHARED_POOL_STAGE_STATS=1 \
|
|||
|
|
./bench_random_mixed_hakmem 100000 256 42
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
3. **Verify Stage 1 Hits** in debug output
|
|||
|
|
- Look for `[SP_ACQUIRE_STAGE1_LOCKFREE]` logs
|
|||
|
|
- Confirm freelist population: `[SP_SLOT_FREELIST_LOCKFREE]`
|
|||
|
|
- Verify zero `shared_fail→legacy` events
|
|||
|
|
|
|||
|
|
### 8.2 Performance Validation
|
|||
|
|
|
|||
|
|
4. **Re-run WS8192 Benchmark** (after Option A implementation)
|
|||
|
|
```bash
|
|||
|
|
# Baseline (should be same as before)
|
|||
|
|
HAKMEM_TINY_USE_SUPERSLAB=0 ./bench_random_mixed_hakmem 10000000 8192
|
|||
|
|
|
|||
|
|
# Optimized (should show +50-70% improvement)
|
|||
|
|
HAKMEM_TINY_USE_SUPERSLAB=1 ./bench_random_mixed_hakmem 10000000 8192
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
5. **Success Criteria** (from Phase 9-2 Section 11.2):
|
|||
|
|
- ✅ Throughput: 16.5M → 25-30M ops/s (+50-80%)
|
|||
|
|
- ✅ Zero `shared_fail→legacy` events
|
|||
|
|
- ✅ Stage 1 hit rate: 70-80% (after warmup)
|
|||
|
|
- ✅ Kernel overhead: 55% → <15%
|
|||
|
|
|
|||
|
|
### 8.3 Optional Enhancements
|
|||
|
|
|
|||
|
|
6. **Implement Option B** (revert to 2MB SuperSlab)
|
|||
|
|
- Change `SUPERSLAB_LG_DEFAULT` from 19 → 21
|
|||
|
|
- Expected additional gain: +10-15% (30-35M ops/s total)
|
|||
|
|
|
|||
|
|
7. **Implement Option D** (expand EMPTY scan limit)
|
|||
|
|
- Change `HAKMEM_SS_EMPTY_SCAN_LIMIT` default from 16 → 64
|
|||
|
|
- Expected additional gain: +3-8% (marginal)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 9. Risk Assessment
|
|||
|
|
|
|||
|
|
### 9.1 Implementation Risks (Option A)
|
|||
|
|
|
|||
|
|
| Risk | Likelihood | Impact | Mitigation |
|
|||
|
|
|------|------------|--------|------------|
|
|||
|
|
| **Double-free in EMPTY detection** | Low | Critical | Add `meta->used > 0` assertion before `shared_pool_release_slab()` |
|
|||
|
|
| **Race: EMPTY→ACTIVE→EMPTY** | Medium | Medium | Use atomic `meta->used` reads; Stage 1 CAS prevents double-activation |
|
|||
|
|
| **Deadlock in release_slab** | Low | Medium | Use lock-free push (already implemented) |
|
|||
|
|
|
|||
|
|
**Overall**: Low risk (Box boundaries well-defined, guards in place)
|
|||
|
|
|
|||
|
|
### 9.2 Performance Risks
|
|||
|
|
|
|||
|
|
| Risk | Likelihood | Impact | Mitigation |
|
|||
|
|
|------|------------|--------|------------|
|
|||
|
|
| **Improvement less than expected** | Medium | Medium | Profile with perf, check Stage 1 hit rate, consider Option B |
|
|||
|
|
| **Regression in other workloads** | Low | Medium | Run full benchmark suite (WS256, cache_thrash, larson) |
|
|||
|
|
| **Memory leak from freelist** | Low | High | Monitor RSS growth, verify EMPTY detection logic |
|
|||
|
|
|
|||
|
|
**Overall**: Medium risk (new feature, but small code change)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 10. Lessons Learned
|
|||
|
|
|
|||
|
|
### 10.1 Benchmark Parameter Confusion
|
|||
|
|
|
|||
|
|
**Issue**: Initial request mentioned "デフォルトパラメータで測定してしまい、ワークロードが軽すぎました"
|
|||
|
|
**Reality**: Default parameters ARE WS8192 (line 46 in bench_random_mixed.c)
|
|||
|
|
```c
|
|||
|
|
int ws = (argc>2)? atoi(argv[2]) : 8192; // default: 8192
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Takeaway**: Always check source code to verify default behavior (documentation may be outdated).
|
|||
|
|
|
|||
|
|
### 10.2 SuperSlab Enabled ≠ SuperSlab Functional
|
|||
|
|
|
|||
|
|
**Issue**: `HAKMEM_TINY_USE_SUPERSLAB=1` enables SuperSlab code, but doesn't guarantee it's used.
|
|||
|
|
**Reality**: Legacy fallback is triggered when SuperSlab backend fails (soft cap, OOM, etc.)
|
|||
|
|
|
|||
|
|
**Takeaway**: Check for `shared_fail→legacy` warnings in output to verify SuperSlab is actually being used.
|
|||
|
|
|
|||
|
|
### 10.3 Phase Dependencies
|
|||
|
|
|
|||
|
|
**Issue**: Assumed Phase 9-2 was complete (based on PHASE9_2_*.md files)
|
|||
|
|
**Reality**: Phase 9-2 investigation is complete, but **implementation is not started**
|
|||
|
|
|
|||
|
|
**Takeaway**: Check document status header (e.g., "Status: Root Cause Analysis Complete" vs "Status: Implementation Complete")
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 11. Conclusion
|
|||
|
|
|
|||
|
|
**Current State**: WS8192 benchmark correctly measured at 16.2-16.3 M ops/s, consistent across SuperSlab ON/OFF.
|
|||
|
|
|
|||
|
|
**Root Cause**: SuperSlab backend falls back to legacy system malloc due to missing EMPTY→Freelist recycling (Phase 9-2 Option A).
|
|||
|
|
|
|||
|
|
**Expected Improvement**: After implementing Option A, expect 25-30 M ops/s (+50-80%) by eliminating legacy fallbacks and enabling lock-free Stage 1 EMPTY reuse.
|
|||
|
|
|
|||
|
|
**Next Action**: Implement Phase 9-2 Option A (2-3 hour task), then re-benchmark WS8192 to verify +50-70% improvement.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**Report Prepared By**: Claude (Sonnet 4.5)
|
|||
|
|
**Benchmark Date**: 2025-11-30
|
|||
|
|
**Total Test Time**: ~6 seconds (6 runs × 0.6s average)
|
|||
|
|
**Status**: Baseline established, awaiting Phase 9-2 implementation
|