Files
hakmem/docs/analysis/PHASE_6.11.4_THREADING_COST_ANALYSIS.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

16 KiB
Raw Blame History

Phase 6.11.4: Threading Overhead Analysis & Optimization Plan

Date: 2025-10-22 Author: ChatGPT Ultra Think (o1-preview equivalent) Context: Post-Phase 6.11.3 profiling results reveal hak_alloc consuming 39.6% of cycles


📊 Executive Summary

Current Bottleneck

hak_alloc:       126,479 cycles (39.6%)  ← #2 MAJOR BOTTLENECK
  ├─ ELO selection (100回ごと)
  ├─ Site Rules lookup (4-probe hash)
  ├─ atomic_fetch_add (全allocでアトミック操作)
  ├─ 条件分岐 (FROZEN/CANARY/LEARN)
  └─ Learning logic (hak_evo_tick, hak_elo_record_alloc)
  1. Phase 6.11.4 (P0-1): Atomic削減 - Immediate, Low-risk (~15-20% reduction)
  2. Phase 6.11.4 (P0-2): LEARN軽量化 - Medium-term, Medium-risk (~25-35% reduction)
  3. Phase 6.11.5 (P1): Learning Thread - Long-term, High-reward (~50-70% reduction)

Target: 126,479 cycles → <50,000 cycles (~60% reduction total)


1. Thread-Safety Cost Analysis

1.1 Current Atomic Operations

Location: hakmem.c:362-369

if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_EVOLUTION)) {
    static _Atomic uint64_t tick_counter = 0;
    if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
        // hak_evo_tick() - HEAVY (P² update, distribution, state transition)
    }
}

Cost Breakdown (estimated per allocation):

Operation Cycles % of hak_alloc Notes
atomic_fetch_add 30-50 24-40% LOCK CMPXCHG on x86
Conditional check (& 0x3FF) 2-5 2-4% Bitwise AND + branch
hak_evo_tick (1/1024) 5,000-10,000 4-8% Amortized: ~5-10 cycles/alloc
Subtotal (Evolution) ~40-70 ~30-50% Major overhead!

ELO sampling (hakmem.c:397-412):

g_elo_call_count++;  // Non-atomic increment (RACE CONDITION!)
if (g_elo_call_count % 100 == 0 || g_cached_strategy_id == -1) {
    strategy_id = hak_elo_select_strategy();       // ~500-1000 cycles
    g_cached_strategy_id = strategy_id;
    hak_elo_record_alloc(strategy_id, size, 0);    // ~100-200 cycles
}
Operation Cycles % of hak_alloc Notes
g_elo_call_count++ 1-2 <1% UNSAFE! Non-atomic
Modulo check (% 100) 5-10 4-8% DIV instruction
hak_elo_select_strategy (1/100) 500-1000 4-8% Amortized: ~5-10 cycles/alloc
hak_elo_record_alloc (1/100) 100-200 1-2% Amortized: ~1-2 cycles/alloc
Subtotal (ELO) ~15-30 ~10-20% Medium overhead

Total atomic overhead: 55-100 cycles/allocation (~40-80% of hak_alloc)


1.2 Lock-Free Queue Overhead (for Phase 6.11.5)

Estimated cost per event (MPSC queue):

Operation Cycles Notes
Allocate event struct 20-40 malloc/pool
Write event data 10-20 Memory stores
Enqueue (CAS) 30-50 LOCK CMPXCHG
Total per event 60-110 Higher than current atomic!

⚠️ CRITICAL INSIGHT: Lock-free queue is NOT faster for high-frequency events!

Reason:

  • Current: 1 atomic op (atomic_fetch_add)
  • Queue: 1 allocation + 1 atomic op (enqueue)
  • Net change: +60-70 cycles per allocation

Recommendation: AVOID lock-free queue for hot-path. Use alternative approach.


2. Implementation Plan: Staged Optimization

Phase 6.11.4 (P0-1): Atomic Operation Elimination HIGHEST PRIORITY

Goal: Remove atomic overhead when learning disabled Expected gain: 30-50 cycles (~24-40% of hak_alloc) Implementation time: 30 minutes Risk: ZERO (compile-time guard)

Changes

File: hakmem.c:362-369

// BEFORE:
if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_EVOLUTION)) {
    static _Atomic uint64_t tick_counter = 0;
    if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
        hak_evo_tick(now_ns);
    }
}

// AFTER:
#if HAKMEM_FEATURE_EVOLUTION
    static _Atomic uint64_t tick_counter = 0;
    if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
        hak_evo_tick(get_time_ns());
    }
#endif

Tradeoff: None! Pure win when HAKMEM_FEATURE_EVOLUTION=0 at compile-time.

Measurement:

# Baseline (with atomic)
HAKMEM_DEBUG_TIMING=1 make bench_allocators_hakmem && HAKMEM_TIMING=1 ./bench_allocators_hakmem

# After (without atomic)
# Edit hakmem_config.h: #define HAKMEM_FEATURE_EVOLUTION 0
HAKMEM_DEBUG_TIMING=1 make bench_allocators_hakmem && HAKMEM_TIMING=1 ./bench_allocators_hakmem

Expected result:

hak_alloc: 126,479 → 96,000 cycles (-24%)

Phase 6.11.4 (P0-2): LEARN Mode Lightweight Sampling HIGH PRIORITY

Goal: Reduce ELO overhead without accuracy loss Expected gain: 15-30 cycles (~12-24% of hak_alloc) Implementation time: 1-2 hours Risk: LOW (conservative approach)

Strategy: Async ELO Update

Problem: hak_elo_select_strategy() は重い (500-1000 cycles) Solution: 非同期イベントキュー ではなく 事前計算戦略

Key Insight: ELO selection は hot-path に不要

Implementation

1. Pre-computed Strategy Cache

// Global state (hakmem.c)
static _Atomic int g_cached_strategy_id = 2;  // Default: 2MB threshold
static _Atomic uint64_t g_elo_generation = 0;  // Invalidation key

2. Background Thread (Simulated)

// Called by hak_evo_tick() (1024 alloc ごと)
void hak_elo_async_recompute(void) {
    // Re-select best strategy (epsilon-greedy)
    int new_strategy = hak_elo_select_strategy();

    atomic_store(&g_cached_strategy_id, new_strategy);
    atomic_fetch_add(&g_elo_generation, 1);  // Invalidate
}

3. Hot-path (hakmem.c:397-412)

// LEARN mode: Read cached strategy (NO ELO call!)
if (hak_evo_is_frozen()) {
    strategy_id = hak_evo_get_confirmed_strategy();
    threshold = hak_elo_get_threshold(strategy_id);
} else if (hak_evo_is_canary()) {
    // ... (unchanged)
} else {
    // LEARN: Use cached strategy (FAST!)
    strategy_id = atomic_load(&g_cached_strategy_id);
    threshold = hak_elo_get_threshold(strategy_id);

    // Optional: Lightweight recording (no timing yet)
    // hak_elo_record_alloc(strategy_id, size, 0);  // Skip for now
}

Tradeoff Analysis:

Aspect Before After Change
Hot-path cost 15-30 cycles 5-10 cycles -67% to -50%
ELO accuracy 100% 99% -1% (negligible)
Latency (strategy update) 0 (immediate) 1024 allocs Acceptable

Expected result:

hak_alloc: 96,000 → 70,000 cycles (-27%)
Total: 126,479 → 70,000 cycles (-45%)

Recommendation: IMPLEMENT FIRST (before Phase 6.11.5)


Phase 6.11.5 (P1): Learning Thread (Full Offload) FUTURE WORK

Goal: Complete learning offload to dedicated thread Expected gain: 20-40 cycles (additional ~15-30%) Implementation time: 4-6 hours Risk: MEDIUM (thread management, race conditions)

Architecture

┌─────────────────────────────────────────┐
│         hak_alloc (Hot-path)            │
│  ┌───────────────────────────────────┐  │
│  │ 1. Read g_cached_strategy_id      │  │ ← Atomic read (~10 cycles)
│  │ 2. Route allocation               │  │
│  │ 3. [Optional] Push event to queue │  │ ← Only if sampling (1/100)
│  └───────────────────────────────────┘  │
└─────────────────────────────────────────┘
                    ↓ (Event Queue - MPSC)
┌─────────────────────────────────────────┐
│        Learning Thread (Background)     │
│  ┌───────────────────────────────────┐  │
│  │ 1. Pop events (batched)           │  │
│  │ 2. Update ELO ratings             │  │
│  │ 3. Update distribution signature  │  │
│  │ 4. Recompute best strategy        │  │
│  │ 5. Update g_cached_strategy_id    │  │
│  └───────────────────────────────────┘  │
└─────────────────────────────────────────┘

Implementation Details

1. Event Queue (Custom Ring Buffer)

// hakmem_events.h
#define EVENT_QUEUE_SIZE 1024

typedef struct {
    uint8_t type;        // EVENT_ALLOC / EVENT_FREE
    size_t size;
    uint64_t duration_ns;
    uintptr_t site_id;
} hak_event_t;

typedef struct {
    hak_event_t events[EVENT_QUEUE_SIZE];
    _Atomic uint64_t head;  // Producer index
    _Atomic uint64_t tail;  // Consumer index
} hak_event_queue_t;

Cost: ~30 cycles (ring buffer write, no CAS needed!)

2. Sampling Strategy

// Hot-path: Sample 1/100 allocations
if (fast_random() % 100 == 0) {
    hak_event_push((hak_event_t){
        .type = EVENT_ALLOC,
        .size = size,
        .duration_ns = 0,  // Not measured in hot-path
        .site_id = site_id
    });
}

3. Background Thread

void* learning_thread_main(void* arg) {
    while (!g_shutdown) {
        // Batch processing (every 100ms)
        usleep(100000);

        hak_event_t events[100];
        int count = hak_event_pop_batch(events, 100);

        for (int i = 0; i < count; i++) {
            hak_elo_record_alloc(events[i].site_id, events[i].size, 0);
        }

        // Periodic ELO update (every 10 batches)
        if (g_batch_count % 10 == 0) {
            hak_elo_async_recompute();
        }
    }
    return NULL;
}

Tradeoff Analysis

Aspect Phase 6.11.4 (P0-2) Phase 6.11.5 (P1) Change
Hot-path cost 5-10 cycles ~10-15 cycles +5 cycles (sampling overhead)
Thread overhead 0 ~1% CPU (background) Negligible
Learning latency 1024 allocs 100-200ms Acceptable
Complexity Low Medium Moderate increase

⚠️ CRITICAL DECISION: Phase 6.11.5 DOES NOT improve hot-path over Phase 6.11.4!

Reason: Sampling overhead (~5 cycles) cancels out atomic elimination (~5 cycles)

Recommendation: ⚠️ SKIP Phase 6.11.5 unless:

  1. Learning accuracy requires higher sampling rate (>1/100)
  2. Background analytics needed (real-time dashboard)

3. Hash Table Optimization (Phase 6.11.6 - P2)

Current cost: Site Rules lookup (~10-20 cycles)

Strategy 1: Perfect Hashing

Benefit: O(1) lookup without collisions Tradeoff: Rebuild cost on new site, max 256 sites

Implementation:

// Pre-computed hash table (generated at runtime)
static RouteType g_site_routes[256];  // Direct lookup, no probing

Expected gain: 5-10 cycles (~4-8%)

Strategy 2: Cache-line Alignment

Current: 4-probe hash → 4 cache lines (worst case) Improvement: Pack entries into single cache line

typedef struct {
    uint64_t site_id;
    RouteType route;
    uint8_t padding[6];  // Align to 16 bytes
} __attribute__((aligned(16))) SiteRuleEntry;

Expected gain: 2-5 cycles (~2-4%)

Recommendation

Priority: P2 (after Phase 6.11.4 P0-1/P0-2) Expected gain: 7-15 cycles (~6-12%) Implementation time: 2-3 hours


4. Trade-off Analysis

4.1 Thread-Safety vs Learning Accuracy

Approach Hot-path Cost Learning Accuracy Complexity
Current 126,479 cycles 100% Low
P0-1 (Atomic削減) 96,000 cycles 100% Very Low
P0-2 (Cached Strategy) 70,000 cycles 99% Low
P1 (Learning Thread) 70,000-75,000 cycles 95-99% Medium
P2 (Hash Opt) 60,000 cycles 99% Medium

4.2 Implementation Complexity vs Performance Gain

                        Performance Gain
                        ↑
                        │
  P0-1 ──────────────────┼────────────┐  (30-50 cycles, 30 min)
  (Atomic削減)           │            │
                        │            │
  P0-2 ──────────────────┼──────┐     │  (25-35 cycles, 1-2 hrs)
  (Cached Strategy)      │      │     │
                        │      │     │
  P2 ─────────────────┼──────┼─────┼──┐  (7-15 cycles, 2-3 hrs)
  (Hash Opt)          │      │     │  │
                     │      │     │  │
  P1 ────────────────┼──────┼─────┼──┤  (5-10 cycles, 4-6 hrs)
  (Learning Thread)  │      │     │  │
                     0──────────────────→ Complexity
                          Low    Med  High

Sweet Spot: P0-2 (Cached Strategy)

  • 55% total reduction (126,479 → 70,000 cycles)
  • 1-2 hours implementation
  • Low complexity, low risk

Week 1: Quick Wins (P0-1 + P0-2)

Day 1: Phase 6.11.4 (P0-1) - Atomic削減

  • Time: 30 minutes
  • Expected: 126,479 → 96,000 cycles (-24%)

Day 2: Phase 6.11.4 (P0-2) - Cached Strategy

  • Time: 1-2 hours
  • Expected: 96,000 → 70,000 cycles (-27%)
  • Total: -45% reduction

Week 2: Medium Gains (P2)

Day 3-4: Phase 6.11.6 (P2) - Hash Optimization

  • Time: 2-3 hours
  • Expected: 70,000 → 60,000 cycles (-14%)
  • Total: -52% reduction

Week 3: Evaluation

Benchmark all scenarios (json/mir/vm)

  • If hak_alloc < 50,000 cycles → STOP
  • If hak_alloc > 50,000 cycles → Consider Phase 6.11.5 (P1)

6. Risk Assessment

Phase Risk Level Failure Mode Mitigation
P0-1 ZERO None (compile-time) None needed
P0-2 LOW Stale strategy (1-2% accuracy loss) Periodic invalidation
P1 MEDIUM Race conditions, thread bugs Extensive testing, feature flag
P2 LOW Hash collisions, rebuild cost Fallback to linear probe

7. Expected Final Results

Pessimistic Scenario (Only P0-1 + P0-2)

hak_alloc: 126,479 → 70,000 cycles (-45%)
Overall: 319,021 → 262,542 cycles (-18%)

vm scenario: 15,021 ns → 12,000 ns (-20%)

Optimistic Scenario (P0-1 + P0-2 + P2)

hak_alloc: 126,479 → 60,000 cycles (-52%)
Overall: 319,021 → 252,542 cycles (-21%)

vm scenario: 15,021 ns → 11,500 ns (-23%)

Stretch Goal (All Phases)

hak_alloc: 126,479 → 50,000 cycles (-60%)
Overall: 319,021 → 242,542 cycles (-24%)

vm scenario: 15,021 ns → 11,000 ns (-27%)

8. Conclusion

Rationale:

  1. P0-1 is free (compile-time guard) → Immediate -24%
  2. P0-2 is high-ROI (1-2 hrs) → Additional -27%
  3. P1 (Learning Thread) is NOT worth it (complexity vs gain)
  4. P2 is optional polish → Additional -14%

Final Target: 70,000 cycles (55% reduction from baseline)

Timeline:

  • Week 1: P0-1 + P0-2 (2-3 hours total)
  • Week 2: P2 (optional, 2-3 hours)
  • Week 3: Benchmark & validate

Success Criteria:

  • hak_alloc < 75,000 cycles (40% reduction) → Minimum Success
  • hak_alloc < 60,000 cycles (52% reduction) → Target Success
  • hak_alloc < 50,000 cycles (60% reduction) → Stretch Goal 🎉

Next Steps

  1. Implement P0-1 (30 min)
  2. Measure baseline (10 min)
  3. Implement P0-2 (1-2 hrs)
  4. Measure improvement (10 min)
  5. Decide on P2 based on results

Total time investment: 2-3 hours for 45% reductionExcellent ROI!