Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
217 lines
6.4 KiB
Markdown
217 lines
6.4 KiB
Markdown
# 3-Layer Architecture Performance Comparison (2025-11-01)
|
||
|
||
## 📊 Results Summary
|
||
|
||
### Tiny Hot Bench (64B)
|
||
|
||
| Metric | Baseline (old) | 3-Layer (current) | Change |
|
||
|--------|----------------|-------------------|--------|
|
||
| **Throughput** | 179 M ops/s | 116.64 M ops/s | **-35%** ❌ |
|
||
| **Latency** | 5.6 ns/op | 8.57 ns/op | +53% ❌ |
|
||
| **Instructions/op** | 100.1 | 169.9 | **+70%** ❌ |
|
||
| **Total instructions** | 2.00B | 3.40B | +70% ❌ |
|
||
| **Branch misses** | 0.14% | 0.13% | -7% ✅ |
|
||
| **L1 cache misses** | 1.34M | 0.54M | -60% ✅ |
|
||
|
||
---
|
||
|
||
## 🔍 Layer Hit Statistics (3-Layer)
|
||
|
||
```
|
||
=== 3-Layer Architecture Stats ===
|
||
Bump hits: 0 ( 0.00%) ❌
|
||
Mag hits: 9843754 (98.44%) ✅
|
||
Slow hits: 156252 ( 1.56%) ✅
|
||
Total allocs: 10000006
|
||
Refill count: 156252
|
||
Refill items: 9843876 (avg 63.0/refill)
|
||
```
|
||
|
||
**Analysis**:
|
||
- ✅ **Magazine working**: 98.44% hit rate (was 0% in first attempt)
|
||
- ❌ **Bump allocator NOT working**: 0% hit rate (not implemented)
|
||
- ✅ **Slow path reduced**: 1.56% (was 100% in first attempt)
|
||
- ✅ **Refill logic working**: 156K refills, 63 items/refill average
|
||
|
||
---
|
||
|
||
## 🚨 Root Cause Analysis
|
||
|
||
### Why is performance WORSE?
|
||
|
||
#### 1. Expensive Slow Path Refill (Critical Issue)
|
||
|
||
**Current implementation** (`tiny_alloc_slow_new`):
|
||
```c
|
||
// Calls hak_tiny_alloc_slow 64 times per refill!
|
||
for (int i = 0; i < 64; i++) {
|
||
void* p = hak_tiny_alloc_slow(0, class_idx); // 64 function calls!
|
||
items[refilled++] = p;
|
||
}
|
||
```
|
||
|
||
**Cost per refill**:
|
||
- 64 function calls to `hak_tiny_alloc_slow`
|
||
- Each call goes through old 6-7 layer architecture
|
||
- Each call has full overhead (locks, checks, slab management)
|
||
|
||
**Total overhead**:
|
||
- 156,252 refills × 64 calls = **10 million** expensive slow path calls
|
||
- This is 50% of total allocations (20M ops)!
|
||
- Each slow path call costs ~100+ instructions
|
||
|
||
**Calculation**:
|
||
```
|
||
Extra instructions from refill = 10M × 100 = 1 billion instructions
|
||
Baseline instructions = 2 billion
|
||
3-layer instructions = 3.4 billion
|
||
Overhead from refill = 1.4 billion (matches!)
|
||
```
|
||
|
||
#### 2. Bump Allocator Not Implemented
|
||
|
||
- Bump allocator returns NULL (not implemented)
|
||
- Hot classes (0-2: 8B/16B/32B) fall through to Magazine
|
||
- Missing ultra-fast path (2-3 instructions/op target)
|
||
|
||
#### 3. Magazine-only vs Layered Fast Paths
|
||
|
||
**Old architecture had specialized hot paths**:
|
||
- HAKMEM_TINY_BENCH_FASTPATH (SLL + Magazine for benchmarks)
|
||
- TinyHotMag (class 0-2 specialized)
|
||
- g_hot_alloc_fn (class 0-3 specialized functions)
|
||
|
||
**New architecture only has**:
|
||
- Small Magazine (generic for all classes)
|
||
|
||
**Missing optimization**: No specialized hot paths for 8B/16B/32B
|
||
|
||
---
|
||
|
||
## 🎯 Performance Goals vs Reality
|
||
|
||
| Metric | Baseline | Goal | Current | Gap |
|
||
|--------|----------|------|---------|-----|
|
||
| **Tiny Hot insns/op** | 100 | 20-30 | **169.9** | -140 to -150 |
|
||
| **Tiny Hot throughput** | 179 M/s | 240-250 M/s | **116.64 M/s** | -123 to -133 M/s |
|
||
| **Random Mixed insns/op** | 412 | 100-150 | **Not tested** | N/A |
|
||
|
||
**Status**: ❌ Missing all goals by significant margin
|
||
|
||
---
|
||
|
||
## 🔧 Options to Fix
|
||
|
||
### Option A: Optimize Slow Path Refill (High Priority)
|
||
|
||
**Problem**: Calling `hak_tiny_alloc_slow` 64 times is too expensive
|
||
|
||
**Solution 1**: Batch allocation from slab
|
||
```c
|
||
// Instead of 64 individual calls, allocate from slab in one shot
|
||
void* slab_batch_alloc(int class_idx, int count, void** out_items);
|
||
```
|
||
|
||
**Expected gain**:
|
||
- 64 calls → 1 call = ~60x reduction in overhead
|
||
- Instructions/op: 169.9 → ~110 (estimate)
|
||
- Throughput: 116.64 → ~155 M ops/s (estimate)
|
||
|
||
**Solution 2**: Direct slab carving
|
||
```c
|
||
// Directly carve from superslab without going through slow path
|
||
void* items = superslab_carve_batch(class_idx, 64, size);
|
||
```
|
||
|
||
**Expected gain**:
|
||
- Eliminate all slow path overhead
|
||
- Instructions/op: 169.9 → ~70-80 (estimate)
|
||
- Throughput: 116.64 → ~185 M ops/s (estimate)
|
||
|
||
### Option B: Implement Bump Allocator (Medium Priority)
|
||
|
||
**Status**: Currently returns NULL (not implemented)
|
||
|
||
**Implementation needed**:
|
||
```c
|
||
static void tiny_bump_refill(int class_idx, void* base, size_t total_size) {
|
||
g_tiny_bump[class_idx].bcur = base;
|
||
g_tiny_bump[class_idx].bend = (char*)base + total_size;
|
||
}
|
||
```
|
||
|
||
**Expected gain**:
|
||
- Hot classes (0-2) hit Bump first (2-3 insns/op)
|
||
- Reduce Magazine pressure
|
||
- Instructions/op: -10 to -20 (estimate)
|
||
|
||
### Option C: Rollback to Baseline
|
||
|
||
**When**: If Option A + B don't achieve goals
|
||
|
||
**Decision criteria**:
|
||
- If instructions/op > 100 after optimizations
|
||
- If throughput < 179 M ops/s after optimizations
|
||
- If complexity outweighs benefits
|
||
|
||
---
|
||
|
||
## 📋 Next Steps
|
||
|
||
### Immediate (Fix slow path refill)
|
||
|
||
1. **Implement slab batch allocation** (Option A, Solution 2)
|
||
- Create `superslab_carve_batch` function
|
||
- Bypass old slow path entirely
|
||
- Directly carve 64 items from superslab
|
||
|
||
2. **Test and measure**
|
||
- Rebuild and run bench_tiny_hot_hakx
|
||
- Check instructions/op (target: < 110)
|
||
- Check throughput (target: > 155 M ops/s)
|
||
|
||
3. **If successful, implement Bump** (Option B)
|
||
- Add `tiny_bump_refill` to slow path
|
||
- Allocate 4KB slab, use for Bump
|
||
- Test hot classes (0-2) hit rate
|
||
|
||
### Decision Point
|
||
|
||
**If after A + B**:
|
||
- ✅ Instructions/op < 100: Continue with 3-layer
|
||
- ⚠️ Instructions/op 100-120: Evaluate, may keep if stable
|
||
- ❌ Instructions/op > 120: Rollback, 3-layer adds too much overhead
|
||
|
||
---
|
||
|
||
## 🤔 Objective Assessment
|
||
|
||
### User's request: "客観的に判断おねがいね" (Please judge objectively)
|
||
|
||
**Current status**:
|
||
- ❌ Performance is WORSE (-35% throughput, +70% instructions)
|
||
- ✅ Magazine working (98.44% hit rate)
|
||
- ❌ Slow path refill too expensive (1 billion extra instructions)
|
||
- ❌ Bump allocator not implemented
|
||
|
||
**Root cause**: Architectural mismatch
|
||
- Old slow path not designed for batch refill
|
||
- Calling it 64 times defeats the purpose of simplification
|
||
|
||
**Recommendation**:
|
||
1. **Fix slow path refill** (batch allocation) - this is critical
|
||
2. **Test again** with realistic refill cost
|
||
3. **If still worse than baseline**: Rollback and try different approach
|
||
|
||
**Alternative approach if fix fails**:
|
||
- Instead of replacing entire architecture, add specialized fastpath for class 0-2 only
|
||
- Keep existing architecture for class 3+ (proven to work)
|
||
- Smaller, safer change with lower risk
|
||
|
||
---
|
||
|
||
**User emphasized**: "複雑で逆に重くなりそうなときは注意ね"
|
||
Translation: "Be careful if it gets complex and becomes heavier"
|
||
|
||
**Current reality**: ✅ We got heavier (slower), need to fix or rollback
|