Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

6.4 KiB

Raw Blame History

3-Layer Architecture Performance Comparison (2025-11-01)

📊 Results Summary

Tiny Hot Bench (64B)

Metric	Baseline (old)	3-Layer (current)	Change
Throughput	179 M ops/s	116.64 M ops/s	-35% ❌
Latency	5.6 ns/op	8.57 ns/op	+53% ❌
Instructions/op	100.1	169.9	+70% ❌
Total instructions	2.00B	3.40B	+70% ❌
Branch misses	0.14%	0.13%	-7% ✅
L1 cache misses	1.34M	0.54M	-60% ✅

🔍 Layer Hit Statistics (3-Layer)

=== 3-Layer Architecture Stats ===
Bump hits:              0 ( 0.00%)  ❌
Mag hits:         9843754 (98.44%)  ✅
Slow hits:         156252 ( 1.56%)  ✅
Total allocs:    10000006
Refill count:      156252
Refill items:     9843876 (avg 63.0/refill)

Analysis:

✅ Magazine working: 98.44% hit rate (was 0% in first attempt)
❌ Bump allocator NOT working: 0% hit rate (not implemented)
✅ Slow path reduced: 1.56% (was 100% in first attempt)
✅ Refill logic working: 156K refills, 63 items/refill average

🚨 Root Cause Analysis

Why is performance WORSE?

1. Expensive Slow Path Refill (Critical Issue)

Current implementation (tiny_alloc_slow_new):

// Calls hak_tiny_alloc_slow 64 times per refill!
for (int i = 0; i < 64; i++) {
    void* p = hak_tiny_alloc_slow(0, class_idx);  // 64 function calls!
    items[refilled++] = p;
}

Cost per refill:

64 function calls to hak_tiny_alloc_slow
Each call goes through old 6-7 layer architecture
Each call has full overhead (locks, checks, slab management)

Total overhead:

156,252 refills × 64 calls = 10 million expensive slow path calls
This is 50% of total allocations (20M ops)!
Each slow path call costs ~100+ instructions

Calculation:

Extra instructions from refill = 10M × 100 = 1 billion instructions
Baseline instructions = 2 billion
3-layer instructions = 3.4 billion
Overhead from refill = 1.4 billion (matches!)

2. Bump Allocator Not Implemented

Bump allocator returns NULL (not implemented)
Hot classes (0-2: 8B/16B/32B) fall through to Magazine
Missing ultra-fast path (2-3 instructions/op target)

3. Magazine-only vs Layered Fast Paths

Old architecture had specialized hot paths:

HAKMEM_TINY_BENCH_FASTPATH (SLL + Magazine for benchmarks)
TinyHotMag (class 0-2 specialized)
g_hot_alloc_fn (class 0-3 specialized functions)

New architecture only has:

Small Magazine (generic for all classes)

Missing optimization: No specialized hot paths for 8B/16B/32B

🎯 Performance Goals vs Reality

Metric	Baseline	Goal	Current	Gap
Tiny Hot insns/op	100	20-30	169.9	-140 to -150
Tiny Hot throughput	179 M/s	240-250 M/s	116.64 M/s	-123 to -133 M/s
Random Mixed insns/op	412	100-150	Not tested	N/A

Status: ❌ Missing all goals by significant margin

🔧 Options to Fix

Option A: Optimize Slow Path Refill (High Priority)

Problem: Calling hak_tiny_alloc_slow 64 times is too expensive

Solution 1: Batch allocation from slab

// Instead of 64 individual calls, allocate from slab in one shot
void* slab_batch_alloc(int class_idx, int count, void** out_items);

Expected gain:

64 calls → 1 call = ~60x reduction in overhead
Instructions/op: 169.9 → ~110 (estimate)
Throughput: 116.64 → ~155 M ops/s (estimate)

Solution 2: Direct slab carving

// Directly carve from superslab without going through slow path
void* items = superslab_carve_batch(class_idx, 64, size);

Expected gain:

Eliminate all slow path overhead
Instructions/op: 169.9 → ~70-80 (estimate)
Throughput: 116.64 → ~185 M ops/s (estimate)

Option B: Implement Bump Allocator (Medium Priority)

Status: Currently returns NULL (not implemented)

Implementation needed:

static void tiny_bump_refill(int class_idx, void* base, size_t total_size) {
    g_tiny_bump[class_idx].bcur = base;
    g_tiny_bump[class_idx].bend = (char*)base + total_size;
}

Expected gain:

Hot classes (0-2) hit Bump first (2-3 insns/op)
Reduce Magazine pressure
Instructions/op: -10 to -20 (estimate)

Option C: Rollback to Baseline

When: If Option A + B don't achieve goals

Decision criteria:

If instructions/op > 100 after optimizations
If throughput < 179 M ops/s after optimizations
If complexity outweighs benefits

📋 Next Steps

Immediate (Fix slow path refill)

Implement slab batch allocation (Option A, Solution 2)
- Create superslab_carve_batch function
- Bypass old slow path entirely
- Directly carve 64 items from superslab
Test and measure
- Rebuild and run bench_tiny_hot_hakx
- Check instructions/op (target: < 110)
- Check throughput (target: > 155 M ops/s)
If successful, implement Bump (Option B)
- Add tiny_bump_refill to slow path
- Allocate 4KB slab, use for Bump
- Test hot classes (0-2) hit rate

Decision Point

If after A + B:

✅ Instructions/op < 100: Continue with 3-layer
⚠️ Instructions/op 100-120: Evaluate, may keep if stable
❌ Instructions/op > 120: Rollback, 3-layer adds too much overhead

🤔 Objective Assessment

User's request: "客観的に判断おねがいね" (Please judge objectively)

Current status:

❌ Performance is WORSE (-35% throughput, +70% instructions)
✅ Magazine working (98.44% hit rate)
❌ Slow path refill too expensive (1 billion extra instructions)
❌ Bump allocator not implemented

Root cause: Architectural mismatch

Old slow path not designed for batch refill
Calling it 64 times defeats the purpose of simplification

Recommendation:

Fix slow path refill (batch allocation) - this is critical
Test again with realistic refill cost
If still worse than baseline: Rollback and try different approach

Alternative approach if fix fails:

Instead of replacing entire architecture, add specialized fastpath for class 0-2 only
Keep existing architecture for class 3+ (proven to work)
Smaller, safer change with lower risk

User emphasized: "複雑で逆に重くなりそうなときは注意ね" Translation: "Be careful if it gets complex and becomes heavier"

Current reality: ✅ We got heavier (slower), need to fix or rollback

6.4 KiB Raw Blame History Unescape Escape

3-Layer Architecture Performance Comparison (2025-11-01)

📊 Results Summary

Tiny Hot Bench (64B)

🔍 Layer Hit Statistics (3-Layer)

🚨 Root Cause Analysis

Why is performance WORSE?

1. Expensive Slow Path Refill (Critical Issue)

2. Bump Allocator Not Implemented

3. Magazine-only vs Layered Fast Paths

🎯 Performance Goals vs Reality

🔧 Options to Fix

Option A: Optimize Slow Path Refill (High Priority)

Option B: Implement Bump Allocator (Medium Priority)

Option C: Rollback to Baseline

📋 Next Steps

Immediate (Fix slow path refill)

Decision Point

🤔 Objective Assessment

User's request: "客観的に判断おねがいね" (Please judge objectively)

6.4 KiB

Raw Blame History