Files
hakmem/archive/analysis/3LAYER_COMPARISON.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

217 lines
6.4 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 3-Layer Architecture Performance Comparison (2025-11-01)
## 📊 Results Summary
### Tiny Hot Bench (64B)
| Metric | Baseline (old) | 3-Layer (current) | Change |
|--------|----------------|-------------------|--------|
| **Throughput** | 179 M ops/s | 116.64 M ops/s | **-35%** ❌ |
| **Latency** | 5.6 ns/op | 8.57 ns/op | +53% ❌ |
| **Instructions/op** | 100.1 | 169.9 | **+70%** ❌ |
| **Total instructions** | 2.00B | 3.40B | +70% ❌ |
| **Branch misses** | 0.14% | 0.13% | -7% ✅ |
| **L1 cache misses** | 1.34M | 0.54M | -60% ✅ |
---
## 🔍 Layer Hit Statistics (3-Layer)
```
=== 3-Layer Architecture Stats ===
Bump hits: 0 ( 0.00%) ❌
Mag hits: 9843754 (98.44%) ✅
Slow hits: 156252 ( 1.56%) ✅
Total allocs: 10000006
Refill count: 156252
Refill items: 9843876 (avg 63.0/refill)
```
**Analysis**:
-**Magazine working**: 98.44% hit rate (was 0% in first attempt)
-**Bump allocator NOT working**: 0% hit rate (not implemented)
-**Slow path reduced**: 1.56% (was 100% in first attempt)
-**Refill logic working**: 156K refills, 63 items/refill average
---
## 🚨 Root Cause Analysis
### Why is performance WORSE?
#### 1. Expensive Slow Path Refill (Critical Issue)
**Current implementation** (`tiny_alloc_slow_new`):
```c
// Calls hak_tiny_alloc_slow 64 times per refill!
for (int i = 0; i < 64; i++) {
void* p = hak_tiny_alloc_slow(0, class_idx); // 64 function calls!
items[refilled++] = p;
}
```
**Cost per refill**:
- 64 function calls to `hak_tiny_alloc_slow`
- Each call goes through old 6-7 layer architecture
- Each call has full overhead (locks, checks, slab management)
**Total overhead**:
- 156,252 refills × 64 calls = **10 million** expensive slow path calls
- This is 50% of total allocations (20M ops)!
- Each slow path call costs ~100+ instructions
**Calculation**:
```
Extra instructions from refill = 10M × 100 = 1 billion instructions
Baseline instructions = 2 billion
3-layer instructions = 3.4 billion
Overhead from refill = 1.4 billion (matches!)
```
#### 2. Bump Allocator Not Implemented
- Bump allocator returns NULL (not implemented)
- Hot classes (0-2: 8B/16B/32B) fall through to Magazine
- Missing ultra-fast path (2-3 instructions/op target)
#### 3. Magazine-only vs Layered Fast Paths
**Old architecture had specialized hot paths**:
- HAKMEM_TINY_BENCH_FASTPATH (SLL + Magazine for benchmarks)
- TinyHotMag (class 0-2 specialized)
- g_hot_alloc_fn (class 0-3 specialized functions)
**New architecture only has**:
- Small Magazine (generic for all classes)
**Missing optimization**: No specialized hot paths for 8B/16B/32B
---
## 🎯 Performance Goals vs Reality
| Metric | Baseline | Goal | Current | Gap |
|--------|----------|------|---------|-----|
| **Tiny Hot insns/op** | 100 | 20-30 | **169.9** | -140 to -150 |
| **Tiny Hot throughput** | 179 M/s | 240-250 M/s | **116.64 M/s** | -123 to -133 M/s |
| **Random Mixed insns/op** | 412 | 100-150 | **Not tested** | N/A |
**Status**: ❌ Missing all goals by significant margin
---
## 🔧 Options to Fix
### Option A: Optimize Slow Path Refill (High Priority)
**Problem**: Calling `hak_tiny_alloc_slow` 64 times is too expensive
**Solution 1**: Batch allocation from slab
```c
// Instead of 64 individual calls, allocate from slab in one shot
void* slab_batch_alloc(int class_idx, int count, void** out_items);
```
**Expected gain**:
- 64 calls → 1 call = ~60x reduction in overhead
- Instructions/op: 169.9 → ~110 (estimate)
- Throughput: 116.64 → ~155 M ops/s (estimate)
**Solution 2**: Direct slab carving
```c
// Directly carve from superslab without going through slow path
void* items = superslab_carve_batch(class_idx, 64, size);
```
**Expected gain**:
- Eliminate all slow path overhead
- Instructions/op: 169.9 → ~70-80 (estimate)
- Throughput: 116.64 → ~185 M ops/s (estimate)
### Option B: Implement Bump Allocator (Medium Priority)
**Status**: Currently returns NULL (not implemented)
**Implementation needed**:
```c
static void tiny_bump_refill(int class_idx, void* base, size_t total_size) {
g_tiny_bump[class_idx].bcur = base;
g_tiny_bump[class_idx].bend = (char*)base + total_size;
}
```
**Expected gain**:
- Hot classes (0-2) hit Bump first (2-3 insns/op)
- Reduce Magazine pressure
- Instructions/op: -10 to -20 (estimate)
### Option C: Rollback to Baseline
**When**: If Option A + B don't achieve goals
**Decision criteria**:
- If instructions/op > 100 after optimizations
- If throughput < 179 M ops/s after optimizations
- If complexity outweighs benefits
---
## 📋 Next Steps
### Immediate (Fix slow path refill)
1. **Implement slab batch allocation** (Option A, Solution 2)
- Create `superslab_carve_batch` function
- Bypass old slow path entirely
- Directly carve 64 items from superslab
2. **Test and measure**
- Rebuild and run bench_tiny_hot_hakx
- Check instructions/op (target: < 110)
- Check throughput (target: > 155 M ops/s)
3. **If successful, implement Bump** (Option B)
- Add `tiny_bump_refill` to slow path
- Allocate 4KB slab, use for Bump
- Test hot classes (0-2) hit rate
### Decision Point
**If after A + B**:
- ✅ Instructions/op < 100: Continue with 3-layer
- Instructions/op 100-120: Evaluate, may keep if stable
- Instructions/op > 120: Rollback, 3-layer adds too much overhead
---
## 🤔 Objective Assessment
### User's request: "客観的に判断おねがいね" (Please judge objectively)
**Current status**:
- ❌ Performance is WORSE (-35% throughput, +70% instructions)
- ✅ Magazine working (98.44% hit rate)
- ❌ Slow path refill too expensive (1 billion extra instructions)
- ❌ Bump allocator not implemented
**Root cause**: Architectural mismatch
- Old slow path not designed for batch refill
- Calling it 64 times defeats the purpose of simplification
**Recommendation**:
1. **Fix slow path refill** (batch allocation) - this is critical
2. **Test again** with realistic refill cost
3. **If still worse than baseline**: Rollback and try different approach
**Alternative approach if fix fails**:
- Instead of replacing entire architecture, add specialized fastpath for class 0-2 only
- Keep existing architecture for class 3+ (proven to work)
- Smaller, safer change with lower risk
---
**User emphasized**: "複雑で逆に重くなりそうなときは注意ね"
Translation: "Be careful if it gets complex and becomes heavier"
**Current reality**: ✅ We got heavier (slower), need to fix or rollback