217 lines
6.4 KiB
Markdown
217 lines
6.4 KiB
Markdown
|
|
# 3-Layer Architecture Performance Comparison (2025-11-01)
|
|||
|
|
|
|||
|
|
## 📊 Results Summary
|
|||
|
|
|
|||
|
|
### Tiny Hot Bench (64B)
|
|||
|
|
|
|||
|
|
| Metric | Baseline (old) | 3-Layer (current) | Change |
|
|||
|
|
|--------|----------------|-------------------|--------|
|
|||
|
|
| **Throughput** | 179 M ops/s | 116.64 M ops/s | **-35%** ❌ |
|
|||
|
|
| **Latency** | 5.6 ns/op | 8.57 ns/op | +53% ❌ |
|
|||
|
|
| **Instructions/op** | 100.1 | 169.9 | **+70%** ❌ |
|
|||
|
|
| **Total instructions** | 2.00B | 3.40B | +70% ❌ |
|
|||
|
|
| **Branch misses** | 0.14% | 0.13% | -7% ✅ |
|
|||
|
|
| **L1 cache misses** | 1.34M | 0.54M | -60% ✅ |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔍 Layer Hit Statistics (3-Layer)
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
=== 3-Layer Architecture Stats ===
|
|||
|
|
Bump hits: 0 ( 0.00%) ❌
|
|||
|
|
Mag hits: 9843754 (98.44%) ✅
|
|||
|
|
Slow hits: 156252 ( 1.56%) ✅
|
|||
|
|
Total allocs: 10000006
|
|||
|
|
Refill count: 156252
|
|||
|
|
Refill items: 9843876 (avg 63.0/refill)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Analysis**:
|
|||
|
|
- ✅ **Magazine working**: 98.44% hit rate (was 0% in first attempt)
|
|||
|
|
- ❌ **Bump allocator NOT working**: 0% hit rate (not implemented)
|
|||
|
|
- ✅ **Slow path reduced**: 1.56% (was 100% in first attempt)
|
|||
|
|
- ✅ **Refill logic working**: 156K refills, 63 items/refill average
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚨 Root Cause Analysis
|
|||
|
|
|
|||
|
|
### Why is performance WORSE?
|
|||
|
|
|
|||
|
|
#### 1. Expensive Slow Path Refill (Critical Issue)
|
|||
|
|
|
|||
|
|
**Current implementation** (`tiny_alloc_slow_new`):
|
|||
|
|
```c
|
|||
|
|
// Calls hak_tiny_alloc_slow 64 times per refill!
|
|||
|
|
for (int i = 0; i < 64; i++) {
|
|||
|
|
void* p = hak_tiny_alloc_slow(0, class_idx); // 64 function calls!
|
|||
|
|
items[refilled++] = p;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Cost per refill**:
|
|||
|
|
- 64 function calls to `hak_tiny_alloc_slow`
|
|||
|
|
- Each call goes through old 6-7 layer architecture
|
|||
|
|
- Each call has full overhead (locks, checks, slab management)
|
|||
|
|
|
|||
|
|
**Total overhead**:
|
|||
|
|
- 156,252 refills × 64 calls = **10 million** expensive slow path calls
|
|||
|
|
- This is 50% of total allocations (20M ops)!
|
|||
|
|
- Each slow path call costs ~100+ instructions
|
|||
|
|
|
|||
|
|
**Calculation**:
|
|||
|
|
```
|
|||
|
|
Extra instructions from refill = 10M × 100 = 1 billion instructions
|
|||
|
|
Baseline instructions = 2 billion
|
|||
|
|
3-layer instructions = 3.4 billion
|
|||
|
|
Overhead from refill = 1.4 billion (matches!)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### 2. Bump Allocator Not Implemented
|
|||
|
|
|
|||
|
|
- Bump allocator returns NULL (not implemented)
|
|||
|
|
- Hot classes (0-2: 8B/16B/32B) fall through to Magazine
|
|||
|
|
- Missing ultra-fast path (2-3 instructions/op target)
|
|||
|
|
|
|||
|
|
#### 3. Magazine-only vs Layered Fast Paths
|
|||
|
|
|
|||
|
|
**Old architecture had specialized hot paths**:
|
|||
|
|
- HAKMEM_TINY_BENCH_FASTPATH (SLL + Magazine for benchmarks)
|
|||
|
|
- TinyHotMag (class 0-2 specialized)
|
|||
|
|
- g_hot_alloc_fn (class 0-3 specialized functions)
|
|||
|
|
|
|||
|
|
**New architecture only has**:
|
|||
|
|
- Small Magazine (generic for all classes)
|
|||
|
|
|
|||
|
|
**Missing optimization**: No specialized hot paths for 8B/16B/32B
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 Performance Goals vs Reality
|
|||
|
|
|
|||
|
|
| Metric | Baseline | Goal | Current | Gap |
|
|||
|
|
|--------|----------|------|---------|-----|
|
|||
|
|
| **Tiny Hot insns/op** | 100 | 20-30 | **169.9** | -140 to -150 |
|
|||
|
|
| **Tiny Hot throughput** | 179 M/s | 240-250 M/s | **116.64 M/s** | -123 to -133 M/s |
|
|||
|
|
| **Random Mixed insns/op** | 412 | 100-150 | **Not tested** | N/A |
|
|||
|
|
|
|||
|
|
**Status**: ❌ Missing all goals by significant margin
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔧 Options to Fix
|
|||
|
|
|
|||
|
|
### Option A: Optimize Slow Path Refill (High Priority)
|
|||
|
|
|
|||
|
|
**Problem**: Calling `hak_tiny_alloc_slow` 64 times is too expensive
|
|||
|
|
|
|||
|
|
**Solution 1**: Batch allocation from slab
|
|||
|
|
```c
|
|||
|
|
// Instead of 64 individual calls, allocate from slab in one shot
|
|||
|
|
void* slab_batch_alloc(int class_idx, int count, void** out_items);
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected gain**:
|
|||
|
|
- 64 calls → 1 call = ~60x reduction in overhead
|
|||
|
|
- Instructions/op: 169.9 → ~110 (estimate)
|
|||
|
|
- Throughput: 116.64 → ~155 M ops/s (estimate)
|
|||
|
|
|
|||
|
|
**Solution 2**: Direct slab carving
|
|||
|
|
```c
|
|||
|
|
// Directly carve from superslab without going through slow path
|
|||
|
|
void* items = superslab_carve_batch(class_idx, 64, size);
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected gain**:
|
|||
|
|
- Eliminate all slow path overhead
|
|||
|
|
- Instructions/op: 169.9 → ~70-80 (estimate)
|
|||
|
|
- Throughput: 116.64 → ~185 M ops/s (estimate)
|
|||
|
|
|
|||
|
|
### Option B: Implement Bump Allocator (Medium Priority)
|
|||
|
|
|
|||
|
|
**Status**: Currently returns NULL (not implemented)
|
|||
|
|
|
|||
|
|
**Implementation needed**:
|
|||
|
|
```c
|
|||
|
|
static void tiny_bump_refill(int class_idx, void* base, size_t total_size) {
|
|||
|
|
g_tiny_bump[class_idx].bcur = base;
|
|||
|
|
g_tiny_bump[class_idx].bend = (char*)base + total_size;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected gain**:
|
|||
|
|
- Hot classes (0-2) hit Bump first (2-3 insns/op)
|
|||
|
|
- Reduce Magazine pressure
|
|||
|
|
- Instructions/op: -10 to -20 (estimate)
|
|||
|
|
|
|||
|
|
### Option C: Rollback to Baseline
|
|||
|
|
|
|||
|
|
**When**: If Option A + B don't achieve goals
|
|||
|
|
|
|||
|
|
**Decision criteria**:
|
|||
|
|
- If instructions/op > 100 after optimizations
|
|||
|
|
- If throughput < 179 M ops/s after optimizations
|
|||
|
|
- If complexity outweighs benefits
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📋 Next Steps
|
|||
|
|
|
|||
|
|
### Immediate (Fix slow path refill)
|
|||
|
|
|
|||
|
|
1. **Implement slab batch allocation** (Option A, Solution 2)
|
|||
|
|
- Create `superslab_carve_batch` function
|
|||
|
|
- Bypass old slow path entirely
|
|||
|
|
- Directly carve 64 items from superslab
|
|||
|
|
|
|||
|
|
2. **Test and measure**
|
|||
|
|
- Rebuild and run bench_tiny_hot_hakx
|
|||
|
|
- Check instructions/op (target: < 110)
|
|||
|
|
- Check throughput (target: > 155 M ops/s)
|
|||
|
|
|
|||
|
|
3. **If successful, implement Bump** (Option B)
|
|||
|
|
- Add `tiny_bump_refill` to slow path
|
|||
|
|
- Allocate 4KB slab, use for Bump
|
|||
|
|
- Test hot classes (0-2) hit rate
|
|||
|
|
|
|||
|
|
### Decision Point
|
|||
|
|
|
|||
|
|
**If after A + B**:
|
|||
|
|
- ✅ Instructions/op < 100: Continue with 3-layer
|
|||
|
|
- ⚠️ Instructions/op 100-120: Evaluate, may keep if stable
|
|||
|
|
- ❌ Instructions/op > 120: Rollback, 3-layer adds too much overhead
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🤔 Objective Assessment
|
|||
|
|
|
|||
|
|
### User's request: "客観的に判断おねがいね" (Please judge objectively)
|
|||
|
|
|
|||
|
|
**Current status**:
|
|||
|
|
- ❌ Performance is WORSE (-35% throughput, +70% instructions)
|
|||
|
|
- ✅ Magazine working (98.44% hit rate)
|
|||
|
|
- ❌ Slow path refill too expensive (1 billion extra instructions)
|
|||
|
|
- ❌ Bump allocator not implemented
|
|||
|
|
|
|||
|
|
**Root cause**: Architectural mismatch
|
|||
|
|
- Old slow path not designed for batch refill
|
|||
|
|
- Calling it 64 times defeats the purpose of simplification
|
|||
|
|
|
|||
|
|
**Recommendation**:
|
|||
|
|
1. **Fix slow path refill** (batch allocation) - this is critical
|
|||
|
|
2. **Test again** with realistic refill cost
|
|||
|
|
3. **If still worse than baseline**: Rollback and try different approach
|
|||
|
|
|
|||
|
|
**Alternative approach if fix fails**:
|
|||
|
|
- Instead of replacing entire architecture, add specialized fastpath for class 0-2 only
|
|||
|
|
- Keep existing architecture for class 3+ (proven to work)
|
|||
|
|
- Smaller, safer change with lower risk
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**User emphasized**: "複雑で逆に重くなりそうなときは注意ね"
|
|||
|
|
Translation: "Be careful if it gets complex and becomes heavier"
|
|||
|
|
|
|||
|
|
**Current reality**: ✅ We got heavier (slower), need to fix or rollback
|