Files
hakmem/archive/analysis/3LAYER_COMPARISON.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

6.4 KiB
Raw Blame History

3-Layer Architecture Performance Comparison (2025-11-01)

📊 Results Summary

Tiny Hot Bench (64B)

Metric Baseline (old) 3-Layer (current) Change
Throughput 179 M ops/s 116.64 M ops/s -35%
Latency 5.6 ns/op 8.57 ns/op +53%
Instructions/op 100.1 169.9 +70%
Total instructions 2.00B 3.40B +70%
Branch misses 0.14% 0.13% -7%
L1 cache misses 1.34M 0.54M -60%

🔍 Layer Hit Statistics (3-Layer)

=== 3-Layer Architecture Stats ===
Bump hits:              0 ( 0.00%)  ❌
Mag hits:         9843754 (98.44%)  ✅
Slow hits:         156252 ( 1.56%)  ✅
Total allocs:    10000006
Refill count:      156252
Refill items:     9843876 (avg 63.0/refill)

Analysis:

  • Magazine working: 98.44% hit rate (was 0% in first attempt)
  • Bump allocator NOT working: 0% hit rate (not implemented)
  • Slow path reduced: 1.56% (was 100% in first attempt)
  • Refill logic working: 156K refills, 63 items/refill average

🚨 Root Cause Analysis

Why is performance WORSE?

1. Expensive Slow Path Refill (Critical Issue)

Current implementation (tiny_alloc_slow_new):

// Calls hak_tiny_alloc_slow 64 times per refill!
for (int i = 0; i < 64; i++) {
    void* p = hak_tiny_alloc_slow(0, class_idx);  // 64 function calls!
    items[refilled++] = p;
}

Cost per refill:

  • 64 function calls to hak_tiny_alloc_slow
  • Each call goes through old 6-7 layer architecture
  • Each call has full overhead (locks, checks, slab management)

Total overhead:

  • 156,252 refills × 64 calls = 10 million expensive slow path calls
  • This is 50% of total allocations (20M ops)!
  • Each slow path call costs ~100+ instructions

Calculation:

Extra instructions from refill = 10M × 100 = 1 billion instructions
Baseline instructions = 2 billion
3-layer instructions = 3.4 billion
Overhead from refill = 1.4 billion (matches!)

2. Bump Allocator Not Implemented

  • Bump allocator returns NULL (not implemented)
  • Hot classes (0-2: 8B/16B/32B) fall through to Magazine
  • Missing ultra-fast path (2-3 instructions/op target)

3. Magazine-only vs Layered Fast Paths

Old architecture had specialized hot paths:

  • HAKMEM_TINY_BENCH_FASTPATH (SLL + Magazine for benchmarks)
  • TinyHotMag (class 0-2 specialized)
  • g_hot_alloc_fn (class 0-3 specialized functions)

New architecture only has:

  • Small Magazine (generic for all classes)

Missing optimization: No specialized hot paths for 8B/16B/32B


🎯 Performance Goals vs Reality

Metric Baseline Goal Current Gap
Tiny Hot insns/op 100 20-30 169.9 -140 to -150
Tiny Hot throughput 179 M/s 240-250 M/s 116.64 M/s -123 to -133 M/s
Random Mixed insns/op 412 100-150 Not tested N/A

Status: Missing all goals by significant margin


🔧 Options to Fix

Option A: Optimize Slow Path Refill (High Priority)

Problem: Calling hak_tiny_alloc_slow 64 times is too expensive

Solution 1: Batch allocation from slab

// Instead of 64 individual calls, allocate from slab in one shot
void* slab_batch_alloc(int class_idx, int count, void** out_items);

Expected gain:

  • 64 calls → 1 call = ~60x reduction in overhead
  • Instructions/op: 169.9 → ~110 (estimate)
  • Throughput: 116.64 → ~155 M ops/s (estimate)

Solution 2: Direct slab carving

// Directly carve from superslab without going through slow path
void* items = superslab_carve_batch(class_idx, 64, size);

Expected gain:

  • Eliminate all slow path overhead
  • Instructions/op: 169.9 → ~70-80 (estimate)
  • Throughput: 116.64 → ~185 M ops/s (estimate)

Option B: Implement Bump Allocator (Medium Priority)

Status: Currently returns NULL (not implemented)

Implementation needed:

static void tiny_bump_refill(int class_idx, void* base, size_t total_size) {
    g_tiny_bump[class_idx].bcur = base;
    g_tiny_bump[class_idx].bend = (char*)base + total_size;
}

Expected gain:

  • Hot classes (0-2) hit Bump first (2-3 insns/op)
  • Reduce Magazine pressure
  • Instructions/op: -10 to -20 (estimate)

Option C: Rollback to Baseline

When: If Option A + B don't achieve goals

Decision criteria:

  • If instructions/op > 100 after optimizations
  • If throughput < 179 M ops/s after optimizations
  • If complexity outweighs benefits

📋 Next Steps

Immediate (Fix slow path refill)

  1. Implement slab batch allocation (Option A, Solution 2)

    • Create superslab_carve_batch function
    • Bypass old slow path entirely
    • Directly carve 64 items from superslab
  2. Test and measure

    • Rebuild and run bench_tiny_hot_hakx
    • Check instructions/op (target: < 110)
    • Check throughput (target: > 155 M ops/s)
  3. If successful, implement Bump (Option B)

    • Add tiny_bump_refill to slow path
    • Allocate 4KB slab, use for Bump
    • Test hot classes (0-2) hit rate

Decision Point

If after A + B:

  • Instructions/op < 100: Continue with 3-layer
  • ⚠️ Instructions/op 100-120: Evaluate, may keep if stable
  • Instructions/op > 120: Rollback, 3-layer adds too much overhead

🤔 Objective Assessment

User's request: "客観的に判断おねがいね" (Please judge objectively)

Current status:

  • Performance is WORSE (-35% throughput, +70% instructions)
  • Magazine working (98.44% hit rate)
  • Slow path refill too expensive (1 billion extra instructions)
  • Bump allocator not implemented

Root cause: Architectural mismatch

  • Old slow path not designed for batch refill
  • Calling it 64 times defeats the purpose of simplification

Recommendation:

  1. Fix slow path refill (batch allocation) - this is critical
  2. Test again with realistic refill cost
  3. If still worse than baseline: Rollback and try different approach

Alternative approach if fix fails:

  • Instead of replacing entire architecture, add specialized fastpath for class 0-2 only
  • Keep existing architecture for class 3+ (proven to work)
  • Smaller, safer change with lower risk

User emphasized: "複雑で逆に重くなりそうなときは注意ね" Translation: "Be careful if it gets complex and becomes heavier"

Current reality: We got heavier (slower), need to fix or rollback