hakmem/archive/analysis/3LAYER_COMPARISON.md

# 3-Layer Architecture Performance Comparison (2025-11-01)

## 📊 Results Summary

### Tiny Hot Bench (64B)

| Metric | Baseline (old) | 3-Layer (current) | Change |
|--------|----------------|-------------------|--------|
| **Throughput** | 179 M ops/s | 116.64 M ops/s | **-35%** ❌ |
| **Latency** | 5.6 ns/op | 8.57 ns/op | +53% ❌ |
| **Instructions/op** | 100.1 | 169.9 | **+70%** ❌ |
| **Total instructions** | 2.00B | 3.40B | +70% ❌ |
| **Branch misses** | 0.14% | 0.13% | -7% ✅ |
| **L1 cache misses** | 1.34M | 0.54M | -60% ✅ |

---

## 🔍 Layer Hit Statistics (3-Layer)

```
=== 3-Layer Architecture Stats ===
Bump hits:              0 ( 0.00%)  ❌
Mag hits:         9843754 (98.44%)  ✅
Slow hits:         156252 ( 1.56%)  ✅
Total allocs:    10000006
Refill count:      156252
Refill items:     9843876 (avg 63.0/refill)
```

**Analysis**:
- ✅ **Magazine working**: 98.44% hit rate (was 0% in first attempt)
- ❌ **Bump allocator NOT working**: 0% hit rate (not implemented)
- ✅ **Slow path reduced**: 1.56% (was 100% in first attempt)
- ✅ **Refill logic working**: 156K refills, 63 items/refill average

---

## 🚨 Root Cause Analysis

### Why is performance WORSE?

#### 1. Expensive Slow Path Refill (Critical Issue)

**Current implementation** (`tiny_alloc_slow_new`):
```c
// Calls hak_tiny_alloc_slow 64 times per refill!
for (int i = 0; i < 64; i++) {
    void* p = hak_tiny_alloc_slow(0, class_idx);  // 64 function calls!
    items[refilled++] = p;
}
```

**Cost per refill**:
- 64 function calls to `hak_tiny_alloc_slow`
- Each call goes through old 6-7 layer architecture
- Each call has full overhead (locks, checks, slab management)

**Total overhead**:
- 156,252 refills × 64 calls = **10 million** expensive slow path calls
- This is 50% of total allocations (20M ops)!
- Each slow path call costs ~100+ instructions

**Calculation**:
```
Extra instructions from refill = 10M × 100 = 1 billion instructions
Baseline instructions = 2 billion
3-layer instructions = 3.4 billion
Overhead from refill = 1.4 billion (matches!)
```

#### 2. Bump Allocator Not Implemented

- Bump allocator returns NULL (not implemented)
- Hot classes (0-2: 8B/16B/32B) fall through to Magazine
- Missing ultra-fast path (2-3 instructions/op target)

#### 3. Magazine-only vs Layered Fast Paths

**Old architecture had specialized hot paths**:
- HAKMEM_TINY_BENCH_FASTPATH (SLL + Magazine for benchmarks)
- TinyHotMag (class 0-2 specialized)
- g_hot_alloc_fn (class 0-3 specialized functions)

**New architecture only has**:
- Small Magazine (generic for all classes)

**Missing optimization**: No specialized hot paths for 8B/16B/32B

---

## 🎯 Performance Goals vs Reality

| Metric | Baseline | Goal | Current | Gap |
|--------|----------|------|---------|-----|
| **Tiny Hot insns/op** | 100 | 20-30 | **169.9** | -140 to -150 |
| **Tiny Hot throughput** | 179 M/s | 240-250 M/s | **116.64 M/s** | -123 to -133 M/s |
| **Random Mixed insns/op** | 412 | 100-150 | **Not tested** | N/A |

**Status**: ❌ Missing all goals by significant margin

---

## 🔧 Options to Fix

### Option A: Optimize Slow Path Refill (High Priority)

**Problem**: Calling `hak_tiny_alloc_slow` 64 times is too expensive

**Solution 1**: Batch allocation from slab
```c
// Instead of 64 individual calls, allocate from slab in one shot
void* slab_batch_alloc(int class_idx, int count, void** out_items);
```

**Expected gain**:
- 64 calls → 1 call = ~60x reduction in overhead
- Instructions/op: 169.9 → ~110 (estimate)
- Throughput: 116.64 → ~155 M ops/s (estimate)

**Solution 2**: Direct slab carving
```c
// Directly carve from superslab without going through slow path
void* items = superslab_carve_batch(class_idx, 64, size);
```

**Expected gain**:
- Eliminate all slow path overhead
- Instructions/op: 169.9 → ~70-80 (estimate)
- Throughput: 116.64 → ~185 M ops/s (estimate)

### Option B: Implement Bump Allocator (Medium Priority)

**Status**: Currently returns NULL (not implemented)

**Implementation needed**:
```c
static void tiny_bump_refill(int class_idx, void* base, size_t total_size) {
    g_tiny_bump[class_idx].bcur = base;
    g_tiny_bump[class_idx].bend = (char*)base + total_size;
}
```

**Expected gain**:
- Hot classes (0-2) hit Bump first (2-3 insns/op)
- Reduce Magazine pressure
- Instructions/op: -10 to -20 (estimate)

### Option C: Rollback to Baseline

**When**: If Option A + B don't achieve goals

**Decision criteria**:
- If instructions/op > 100 after optimizations
- If throughput < 179 M ops/s after optimizations
- If complexity outweighs benefits

---

## 📋 Next Steps

### Immediate (Fix slow path refill)

1. **Implement slab batch allocation** (Option A, Solution 2)
   - Create `superslab_carve_batch` function
   - Bypass old slow path entirely
   - Directly carve 64 items from superslab

2. **Test and measure**
   - Rebuild and run bench_tiny_hot_hakx
   - Check instructions/op (target: < 110)
   - Check throughput (target: > 155 M ops/s)

3. **If successful, implement Bump** (Option B)
   - Add `tiny_bump_refill` to slow path
   - Allocate 4KB slab, use for Bump
   - Test hot classes (0-2) hit rate

### Decision Point

**If after A + B**:
- ✅ Instructions/op < 100: Continue with 3-layer
- ⚠️ Instructions/op 100-120: Evaluate, may keep if stable
- ❌ Instructions/op > 120: Rollback, 3-layer adds too much overhead

---

## 🤔 Objective Assessment

### User's request: "客観的に判断おねがいね" (Please judge objectively)

**Current status**:
- ❌ Performance is WORSE (-35% throughput, +70% instructions)
- ✅ Magazine working (98.44% hit rate)
- ❌ Slow path refill too expensive (1 billion extra instructions)
- ❌ Bump allocator not implemented

**Root cause**: Architectural mismatch
- Old slow path not designed for batch refill
- Calling it 64 times defeats the purpose of simplification

**Recommendation**:
1. **Fix slow path refill** (batch allocation) - this is critical
2. **Test again** with realistic refill cost
3. **If still worse than baseline**: Rollback and try different approach

**Alternative approach if fix fails**:
- Instead of replacing entire architecture, add specialized fastpath for class 0-2 only
- Keep existing architecture for class 3+ (proven to work)
- Smaller, safer change with lower risk

---

**User emphasized**: "複雑で逆に重くなりそうなときは注意ね"
Translation: "Be careful if it gets complex and becomes heavier"

**Current reality**: ✅ We got heavier (slower), need to fix or rollback