Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

16 KiB

Raw Blame History

Phase 6.11.4: Threading Overhead Analysis & Optimization Plan

Date: 2025-10-22 Author: ChatGPT Ultra Think (o1-preview equivalent) Context: Post-Phase 6.11.3 profiling results reveal hak_alloc consuming 39.6% of cycles

📊 Executive Summary

Current Bottleneck

hak_alloc:       126,479 cycles (39.6%)  ← #2 MAJOR BOTTLENECK
  ├─ ELO selection (100回ごと)
  ├─ Site Rules lookup (4-probe hash)
  ├─ atomic_fetch_add (全allocでアトミック操作)
  ├─ 条件分岐 (FROZEN/CANARY/LEARN)
  └─ Learning logic (hak_evo_tick, hak_elo_record_alloc)

Recommended Strategy: Staged Optimization (3 Phases)

Phase 6.11.4 (P0-1): Atomic削減 - Immediate, Low-risk (~15-20% reduction)
Phase 6.11.4 (P0-2): LEARN軽量化 - Medium-term, Medium-risk (~25-35% reduction)
Phase 6.11.5 (P1): Learning Thread - Long-term, High-reward (~50-70% reduction)

Target: 126,479 cycles → <50,000 cycles (~60% reduction total)

1. Thread-Safety Cost Analysis

1.1 Current Atomic Operations

Location: hakmem.c:362-369

if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_EVOLUTION)) {
    static _Atomic uint64_t tick_counter = 0;
    if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
        // hak_evo_tick() - HEAVY (P² update, distribution, state transition)
    }
}

Cost Breakdown (estimated per allocation):

Operation	Cycles	% of hak_alloc	Notes
`atomic_fetch_add`	30-50	24-40%	LOCK CMPXCHG on x86
Conditional check (`& 0x3FF`)	2-5	2-4%	Bitwise AND + branch
`hak_evo_tick` (1/1024)	5,000-10,000	4-8%	Amortized: ~5-10 cycles/alloc
Subtotal (Evolution)	~40-70	~30-50%	Major overhead!

ELO sampling (hakmem.c:397-412):

g_elo_call_count++;  // Non-atomic increment (RACE CONDITION!)
if (g_elo_call_count % 100 == 0 || g_cached_strategy_id == -1) {
    strategy_id = hak_elo_select_strategy();       // ~500-1000 cycles
    g_cached_strategy_id = strategy_id;
    hak_elo_record_alloc(strategy_id, size, 0);    // ~100-200 cycles
}

Operation	Cycles	% of hak_alloc	Notes
`g_elo_call_count++`	1-2	<1%	UNSAFE! Non-atomic
Modulo check (`% 100`)	5-10	4-8%	DIV instruction
`hak_elo_select_strategy` (1/100)	500-1000	4-8%	Amortized: ~5-10 cycles/alloc
`hak_elo_record_alloc` (1/100)	100-200	1-2%	Amortized: ~1-2 cycles/alloc
Subtotal (ELO)	~15-30	~10-20%	Medium overhead

Total atomic overhead: 55-100 cycles/allocation (~40-80% of hak_alloc)

1.2 Lock-Free Queue Overhead (for Phase 6.11.5)

Estimated cost per event (MPSC queue):

Operation	Cycles	Notes
Allocate event struct	20-40	malloc/pool
Write event data	10-20	Memory stores
Enqueue (CAS)	30-50	LOCK CMPXCHG
Total per event	60-110	Higher than current atomic!

⚠️ CRITICAL INSIGHT: Lock-free queue is NOT faster for high-frequency events!

Reason:

Current: 1 atomic op (atomic_fetch_add)
Queue: 1 allocation + 1 atomic op (enqueue)
Net change: +60-70 cycles per allocation

Recommendation: AVOID lock-free queue for hot-path. Use alternative approach.

2. Implementation Plan: Staged Optimization

Phase 6.11.4 (P0-1): Atomic Operation Elimination ⭐ HIGHEST PRIORITY

Goal: Remove atomic overhead when learning disabled Expected gain: 30-50 cycles (~24-40% of hak_alloc) Implementation time: 30 minutes Risk: ZERO (compile-time guard)

Changes

File: hakmem.c:362-369

// BEFORE:
if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_EVOLUTION)) {
    static _Atomic uint64_t tick_counter = 0;
    if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
        hak_evo_tick(now_ns);
    }
}

// AFTER:
#if HAKMEM_FEATURE_EVOLUTION
    static _Atomic uint64_t tick_counter = 0;
    if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
        hak_evo_tick(get_time_ns());
    }
#endif

Tradeoff: None! Pure win when HAKMEM_FEATURE_EVOLUTION=0 at compile-time.

Measurement:

# Baseline (with atomic)
HAKMEM_DEBUG_TIMING=1 make bench_allocators_hakmem && HAKMEM_TIMING=1 ./bench_allocators_hakmem

# After (without atomic)
# Edit hakmem_config.h: #define HAKMEM_FEATURE_EVOLUTION 0
HAKMEM_DEBUG_TIMING=1 make bench_allocators_hakmem && HAKMEM_TIMING=1 ./bench_allocators_hakmem

Expected result:

hak_alloc: 126,479 → 96,000 cycles (-24%)

Phase 6.11.4 (P0-2): LEARN Mode Lightweight Sampling ⭐ HIGH PRIORITY

Goal: Reduce ELO overhead without accuracy loss Expected gain: 15-30 cycles (~12-24% of hak_alloc) Implementation time: 1-2 hours Risk: LOW (conservative approach)

Strategy: Async ELO Update

Problem: hak_elo_select_strategy() は重い (500-1000 cycles) Solution: 非同期イベントキュー ではなく 事前計算戦略

Key Insight: ELO selection は hot-path に不要！

Implementation

1. Pre-computed Strategy Cache

// Global state (hakmem.c)
static _Atomic int g_cached_strategy_id = 2;  // Default: 2MB threshold
static _Atomic uint64_t g_elo_generation = 0;  // Invalidation key

2. Background Thread (Simulated)

// Called by hak_evo_tick() (1024 alloc ごと)
void hak_elo_async_recompute(void) {
    // Re-select best strategy (epsilon-greedy)
    int new_strategy = hak_elo_select_strategy();

    atomic_store(&g_cached_strategy_id, new_strategy);
    atomic_fetch_add(&g_elo_generation, 1);  // Invalidate
}

3. Hot-path (hakmem.c:397-412)

// LEARN mode: Read cached strategy (NO ELO call!)
if (hak_evo_is_frozen()) {
    strategy_id = hak_evo_get_confirmed_strategy();
    threshold = hak_elo_get_threshold(strategy_id);
} else if (hak_evo_is_canary()) {
    // ... (unchanged)
} else {
    // LEARN: Use cached strategy (FAST!)
    strategy_id = atomic_load(&g_cached_strategy_id);
    threshold = hak_elo_get_threshold(strategy_id);

    // Optional: Lightweight recording (no timing yet)
    // hak_elo_record_alloc(strategy_id, size, 0);  // Skip for now
}

Tradeoff Analysis:

Aspect	Before	After	Change
Hot-path cost	15-30 cycles	5-10 cycles	-67% to -50%
ELO accuracy	100%	99%	-1% (negligible)
Latency (strategy update)	0 (immediate)	1024 allocs	Acceptable

Expected result:

hak_alloc: 96,000 → 70,000 cycles (-27%)
Total: 126,479 → 70,000 cycles (-45%)

Recommendation: ✅ IMPLEMENT FIRST (before Phase 6.11.5)

Phase 6.11.5 (P1): Learning Thread (Full Offload) ⭐ FUTURE WORK

Goal: Complete learning offload to dedicated thread Expected gain: 20-40 cycles (additional ~15-30%) Implementation time: 4-6 hours Risk: MEDIUM (thread management, race conditions)

Architecture

┌─────────────────────────────────────────┐
│         hak_alloc (Hot-path)            │
│  ┌───────────────────────────────────┐  │
│  │ 1. Read g_cached_strategy_id      │  │ ← Atomic read (~10 cycles)
│  │ 2. Route allocation               │  │
│  │ 3. [Optional] Push event to queue │  │ ← Only if sampling (1/100)
│  └───────────────────────────────────┘  │
└─────────────────────────────────────────┘
                    ↓ (Event Queue - MPSC)
┌─────────────────────────────────────────┐
│        Learning Thread (Background)     │
│  ┌───────────────────────────────────┐  │
│  │ 1. Pop events (batched)           │  │
│  │ 2. Update ELO ratings             │  │
│  │ 3. Update distribution signature  │  │
│  │ 4. Recompute best strategy        │  │
│  │ 5. Update g_cached_strategy_id    │  │
│  └───────────────────────────────────┘  │
└─────────────────────────────────────────┘

Implementation Details

1. Event Queue (Custom Ring Buffer)

// hakmem_events.h
#define EVENT_QUEUE_SIZE 1024

typedef struct {
    uint8_t type;        // EVENT_ALLOC / EVENT_FREE
    size_t size;
    uint64_t duration_ns;
    uintptr_t site_id;
} hak_event_t;

typedef struct {
    hak_event_t events[EVENT_QUEUE_SIZE];
    _Atomic uint64_t head;  // Producer index
    _Atomic uint64_t tail;  // Consumer index
} hak_event_queue_t;

Cost: ~30 cycles (ring buffer write, no CAS needed!)

2. Sampling Strategy

// Hot-path: Sample 1/100 allocations
if (fast_random() % 100 == 0) {
    hak_event_push((hak_event_t){
        .type = EVENT_ALLOC,
        .size = size,
        .duration_ns = 0,  // Not measured in hot-path
        .site_id = site_id
    });
}

3. Background Thread

void* learning_thread_main(void* arg) {
    while (!g_shutdown) {
        // Batch processing (every 100ms)
        usleep(100000);

        hak_event_t events[100];
        int count = hak_event_pop_batch(events, 100);

        for (int i = 0; i < count; i++) {
            hak_elo_record_alloc(events[i].site_id, events[i].size, 0);
        }

        // Periodic ELO update (every 10 batches)
        if (g_batch_count % 10 == 0) {
            hak_elo_async_recompute();
        }
    }
    return NULL;
}

Tradeoff Analysis

Aspect	Phase 6.11.4 (P0-2)	Phase 6.11.5 (P1)	Change
Hot-path cost	5-10 cycles	~10-15 cycles	+5 cycles (sampling overhead)
Thread overhead	0	~1% CPU (background)	Negligible
Learning latency	1024 allocs	100-200ms	Acceptable
Complexity	Low	Medium	Moderate increase

⚠️ CRITICAL DECISION: Phase 6.11.5 DOES NOT improve hot-path over Phase 6.11.4!

Reason: Sampling overhead (~5 cycles) cancels out atomic elimination (~5 cycles)

Recommendation: ⚠️ SKIP Phase 6.11.5 unless:

Learning accuracy requires higher sampling rate (>1/100)
Background analytics needed (real-time dashboard)

3. Hash Table Optimization (Phase 6.11.6 - P2)

Current cost: Site Rules lookup (~10-20 cycles)

Strategy 1: Perfect Hashing

Benefit: O(1) lookup without collisions Tradeoff: Rebuild cost on new site, max 256 sites

Implementation:

// Pre-computed hash table (generated at runtime)
static RouteType g_site_routes[256];  // Direct lookup, no probing

Expected gain: 5-10 cycles (~4-8%)

Strategy 2: Cache-line Alignment

Current: 4-probe hash → 4 cache lines (worst case) Improvement: Pack entries into single cache line

typedef struct {
    uint64_t site_id;
    RouteType route;
    uint8_t padding[6];  // Align to 16 bytes
} __attribute__((aligned(16))) SiteRuleEntry;

Expected gain: 2-5 cycles (~2-4%)

Recommendation

Priority: P2 (after Phase 6.11.4 P0-1/P0-2) Expected gain: 7-15 cycles (~6-12%) Implementation time: 2-3 hours

4. Trade-off Analysis

4.1 Thread-Safety vs Learning Accuracy

Approach	Hot-path Cost	Learning Accuracy	Complexity
Current	126,479 cycles	100%	Low
P0-1 (Atomic削減)	96,000 cycles	100%	Very Low
P0-2 (Cached Strategy)	70,000 cycles	99%	Low
P1 (Learning Thread)	70,000-75,000 cycles	95-99%	Medium
P2 (Hash Opt)	60,000 cycles	99%	Medium

4.2 Implementation Complexity vs Performance Gain

                        Performance Gain
                        ↑
                        │
  P0-1 ──────────────────┼────────────┐  (30-50 cycles, 30 min)
  (Atomic削減)           │            │
                        │            │
  P0-2 ──────────────────┼──────┐     │  (25-35 cycles, 1-2 hrs)
  (Cached Strategy)      │      │     │
                        │      │     │
  P2 ─────────────────┼──────┼─────┼──┐  (7-15 cycles, 2-3 hrs)
  (Hash Opt)          │      │     │  │
                     │      │     │  │
  P1 ────────────────┼──────┼─────┼──┤  (5-10 cycles, 4-6 hrs)
  (Learning Thread)  │      │     │  │
                     0──────────────────→ Complexity
                          Low    Med  High

Sweet Spot: P0-2 (Cached Strategy)

55% total reduction (126,479 → 70,000 cycles)
1-2 hours implementation
Low complexity, low risk

5. Recommended Implementation Order

Week 1: Quick Wins (P0-1 + P0-2)

Day 1: Phase 6.11.4 (P0-1) - Atomic削減

Time: 30 minutes
Expected: 126,479 → 96,000 cycles (-24%)

Day 2: Phase 6.11.4 (P0-2) - Cached Strategy

Time: 1-2 hours
Expected: 96,000 → 70,000 cycles (-27%)
Total: -45% reduction ✅

Week 2: Medium Gains (P2)

Day 3-4: Phase 6.11.6 (P2) - Hash Optimization

Time: 2-3 hours
Expected: 70,000 → 60,000 cycles (-14%)
Total: -52% reduction ✅

Week 3: Evaluation

Benchmark all scenarios (json/mir/vm)

If hak_alloc < 50,000 cycles → STOP ✅
If hak_alloc > 50,000 cycles → Consider Phase 6.11.5 (P1)

6. Risk Assessment

Phase	Risk Level	Failure Mode	Mitigation
P0-1	ZERO	None (compile-time)	None needed
P0-2	LOW	Stale strategy (1-2% accuracy loss)	Periodic invalidation
P1	MEDIUM	Race conditions, thread bugs	Extensive testing, feature flag
P2	LOW	Hash collisions, rebuild cost	Fallback to linear probe

7. Expected Final Results

Pessimistic Scenario (Only P0-1 + P0-2)

hak_alloc: 126,479 → 70,000 cycles (-45%)
Overall: 319,021 → 262,542 cycles (-18%)

vm scenario: 15,021 ns → 12,000 ns (-20%)

Optimistic Scenario (P0-1 + P0-2 + P2)

hak_alloc: 126,479 → 60,000 cycles (-52%)
Overall: 319,021 → 252,542 cycles (-21%)

vm scenario: 15,021 ns → 11,500 ns (-23%)

Stretch Goal (All Phases)

hak_alloc: 126,479 → 50,000 cycles (-60%)
Overall: 319,021 → 242,542 cycles (-24%)

vm scenario: 15,021 ns → 11,000 ns (-27%)

8. Conclusion

✅ Recommended Path: Staged Optimization (P0-1 → P0-2 → P2)

Rationale:

P0-1 is free (compile-time guard) → Immediate -24%
P0-2 is high-ROI (1-2 hrs) → Additional -27%
P1 (Learning Thread) is NOT worth it (complexity vs gain)
P2 is optional polish → Additional -14%

Final Target: 70,000 cycles (55% reduction from baseline)

Timeline:

Week 1: P0-1 + P0-2 (2-3 hours total)
Week 2: P2 (optional, 2-3 hours)
Week 3: Benchmark & validate

Success Criteria:

hak_alloc < 75,000 cycles (40% reduction) → Minimum Success
hak_alloc < 60,000 cycles (52% reduction) → Target Success ✅
hak_alloc < 50,000 cycles (60% reduction) → Stretch Goal 🎉

Next Steps

Implement P0-1 (30 min)
Measure baseline (10 min)
Implement P0-2 (1-2 hrs)
Measure improvement (10 min)
Decide on P2 based on results

Total time investment: 2-3 hours for 45% reduction ← Excellent ROI!

16 KiB Raw Blame History Unescape Escape