Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
6.4 KiB
3-Layer Architecture Performance Comparison (2025-11-01)
📊 Results Summary
Tiny Hot Bench (64B)
| Metric | Baseline (old) | 3-Layer (current) | Change |
|---|---|---|---|
| Throughput | 179 M ops/s | 116.64 M ops/s | -35% ❌ |
| Latency | 5.6 ns/op | 8.57 ns/op | +53% ❌ |
| Instructions/op | 100.1 | 169.9 | +70% ❌ |
| Total instructions | 2.00B | 3.40B | +70% ❌ |
| Branch misses | 0.14% | 0.13% | -7% ✅ |
| L1 cache misses | 1.34M | 0.54M | -60% ✅ |
🔍 Layer Hit Statistics (3-Layer)
=== 3-Layer Architecture Stats ===
Bump hits: 0 ( 0.00%) ❌
Mag hits: 9843754 (98.44%) ✅
Slow hits: 156252 ( 1.56%) ✅
Total allocs: 10000006
Refill count: 156252
Refill items: 9843876 (avg 63.0/refill)
Analysis:
- ✅ Magazine working: 98.44% hit rate (was 0% in first attempt)
- ❌ Bump allocator NOT working: 0% hit rate (not implemented)
- ✅ Slow path reduced: 1.56% (was 100% in first attempt)
- ✅ Refill logic working: 156K refills, 63 items/refill average
🚨 Root Cause Analysis
Why is performance WORSE?
1. Expensive Slow Path Refill (Critical Issue)
Current implementation (tiny_alloc_slow_new):
// Calls hak_tiny_alloc_slow 64 times per refill!
for (int i = 0; i < 64; i++) {
void* p = hak_tiny_alloc_slow(0, class_idx); // 64 function calls!
items[refilled++] = p;
}
Cost per refill:
- 64 function calls to
hak_tiny_alloc_slow - Each call goes through old 6-7 layer architecture
- Each call has full overhead (locks, checks, slab management)
Total overhead:
- 156,252 refills × 64 calls = 10 million expensive slow path calls
- This is 50% of total allocations (20M ops)!
- Each slow path call costs ~100+ instructions
Calculation:
Extra instructions from refill = 10M × 100 = 1 billion instructions
Baseline instructions = 2 billion
3-layer instructions = 3.4 billion
Overhead from refill = 1.4 billion (matches!)
2. Bump Allocator Not Implemented
- Bump allocator returns NULL (not implemented)
- Hot classes (0-2: 8B/16B/32B) fall through to Magazine
- Missing ultra-fast path (2-3 instructions/op target)
3. Magazine-only vs Layered Fast Paths
Old architecture had specialized hot paths:
- HAKMEM_TINY_BENCH_FASTPATH (SLL + Magazine for benchmarks)
- TinyHotMag (class 0-2 specialized)
- g_hot_alloc_fn (class 0-3 specialized functions)
New architecture only has:
- Small Magazine (generic for all classes)
Missing optimization: No specialized hot paths for 8B/16B/32B
🎯 Performance Goals vs Reality
| Metric | Baseline | Goal | Current | Gap |
|---|---|---|---|---|
| Tiny Hot insns/op | 100 | 20-30 | 169.9 | -140 to -150 |
| Tiny Hot throughput | 179 M/s | 240-250 M/s | 116.64 M/s | -123 to -133 M/s |
| Random Mixed insns/op | 412 | 100-150 | Not tested | N/A |
Status: ❌ Missing all goals by significant margin
🔧 Options to Fix
Option A: Optimize Slow Path Refill (High Priority)
Problem: Calling hak_tiny_alloc_slow 64 times is too expensive
Solution 1: Batch allocation from slab
// Instead of 64 individual calls, allocate from slab in one shot
void* slab_batch_alloc(int class_idx, int count, void** out_items);
Expected gain:
- 64 calls → 1 call = ~60x reduction in overhead
- Instructions/op: 169.9 → ~110 (estimate)
- Throughput: 116.64 → ~155 M ops/s (estimate)
Solution 2: Direct slab carving
// Directly carve from superslab without going through slow path
void* items = superslab_carve_batch(class_idx, 64, size);
Expected gain:
- Eliminate all slow path overhead
- Instructions/op: 169.9 → ~70-80 (estimate)
- Throughput: 116.64 → ~185 M ops/s (estimate)
Option B: Implement Bump Allocator (Medium Priority)
Status: Currently returns NULL (not implemented)
Implementation needed:
static void tiny_bump_refill(int class_idx, void* base, size_t total_size) {
g_tiny_bump[class_idx].bcur = base;
g_tiny_bump[class_idx].bend = (char*)base + total_size;
}
Expected gain:
- Hot classes (0-2) hit Bump first (2-3 insns/op)
- Reduce Magazine pressure
- Instructions/op: -10 to -20 (estimate)
Option C: Rollback to Baseline
When: If Option A + B don't achieve goals
Decision criteria:
- If instructions/op > 100 after optimizations
- If throughput < 179 M ops/s after optimizations
- If complexity outweighs benefits
📋 Next Steps
Immediate (Fix slow path refill)
-
Implement slab batch allocation (Option A, Solution 2)
- Create
superslab_carve_batchfunction - Bypass old slow path entirely
- Directly carve 64 items from superslab
- Create
-
Test and measure
- Rebuild and run bench_tiny_hot_hakx
- Check instructions/op (target: < 110)
- Check throughput (target: > 155 M ops/s)
-
If successful, implement Bump (Option B)
- Add
tiny_bump_refillto slow path - Allocate 4KB slab, use for Bump
- Test hot classes (0-2) hit rate
- Add
Decision Point
If after A + B:
- ✅ Instructions/op < 100: Continue with 3-layer
- ⚠️ Instructions/op 100-120: Evaluate, may keep if stable
- ❌ Instructions/op > 120: Rollback, 3-layer adds too much overhead
🤔 Objective Assessment
User's request: "客観的に判断おねがいね" (Please judge objectively)
Current status:
- ❌ Performance is WORSE (-35% throughput, +70% instructions)
- ✅ Magazine working (98.44% hit rate)
- ❌ Slow path refill too expensive (1 billion extra instructions)
- ❌ Bump allocator not implemented
Root cause: Architectural mismatch
- Old slow path not designed for batch refill
- Calling it 64 times defeats the purpose of simplification
Recommendation:
- Fix slow path refill (batch allocation) - this is critical
- Test again with realistic refill cost
- If still worse than baseline: Rollback and try different approach
Alternative approach if fix fails:
- Instead of replacing entire architecture, add specialized fastpath for class 0-2 only
- Keep existing architecture for class 3+ (proven to work)
- Smaller, safer change with lower risk
User emphasized: "複雑で逆に重くなりそうなときは注意ね" Translation: "Be careful if it gets complex and becomes heavier"
Current reality: ✅ We got heavier (slower), need to fix or rollback