# Phase 6.11.4: Threading Overhead Analysis & Optimization Plan **Date**: 2025-10-22 **Author**: ChatGPT Ultra Think (o1-preview equivalent) **Context**: Post-Phase 6.11.3 profiling results reveal `hak_alloc` consuming 39.6% of cycles --- ## ๐Ÿ“Š Executive Summary ### Current Bottleneck ``` hak_alloc: 126,479 cycles (39.6%) โ† #2 MAJOR BOTTLENECK โ”œโ”€ ELO selection (100ๅ›žใ”ใจ) โ”œโ”€ Site Rules lookup (4-probe hash) โ”œโ”€ atomic_fetch_add (ๅ…จallocใงใ‚ขใƒˆใƒŸใƒƒใ‚ฏๆ“ไฝœ) โ”œโ”€ ๆกไปถๅˆ†ๅฒ (FROZEN/CANARY/LEARN) โ””โ”€ Learning logic (hak_evo_tick, hak_elo_record_alloc) ``` ### Recommended Strategy: **Staged Optimization** (3 Phases) 1. **Phase 6.11.4 (P0-1)**: Atomicๅ‰Šๆธ› - Immediate, Low-risk (~15-20% reduction) 2. **Phase 6.11.4 (P0-2)**: LEARN่ปฝ้‡ๅŒ– - Medium-term, Medium-risk (~25-35% reduction) 3. **Phase 6.11.5 (P1)**: Learning Thread - Long-term, High-reward (~50-70% reduction) **Target**: 126,479 cycles โ†’ **<50,000 cycles** (~60% reduction total) --- ## 1. Thread-Safety Cost Analysis ### 1.1 Current Atomic Operations **Location**: `hakmem.c:362-369` ```c if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_EVOLUTION)) { static _Atomic uint64_t tick_counter = 0; if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) { // hak_evo_tick() - HEAVY (Pยฒ update, distribution, state transition) } } ``` **Cost Breakdown** (estimated per allocation): | Operation | Cycles | % of hak_alloc | Notes | |-----------|--------|----------------|-------| | `atomic_fetch_add` | **30-50** | **24-40%** | LOCK CMPXCHG on x86 | | Conditional check (`& 0x3FF`) | 2-5 | 2-4% | Bitwise AND + branch | | `hak_evo_tick` (1/1024) | 5,000-10,000 | 4-8% | Amortized: ~5-10 cycles/alloc | | **Subtotal (Evolution)** | **~40-70** | **~30-50%** | **Major overhead!** | **ELO sampling** (`hakmem.c:397-412`): ```c g_elo_call_count++; // Non-atomic increment (RACE CONDITION!) if (g_elo_call_count % 100 == 0 || g_cached_strategy_id == -1) { strategy_id = hak_elo_select_strategy(); // ~500-1000 cycles g_cached_strategy_id = strategy_id; hak_elo_record_alloc(strategy_id, size, 0); // ~100-200 cycles } ``` | Operation | Cycles | % of hak_alloc | Notes | |-----------|--------|----------------|-------| | `g_elo_call_count++` | 1-2 | <1% | **UNSAFE! Non-atomic** | | Modulo check (`% 100`) | 5-10 | 4-8% | DIV instruction | | `hak_elo_select_strategy` (1/100) | 500-1000 | 4-8% | Amortized: ~5-10 cycles/alloc | | `hak_elo_record_alloc` (1/100) | 100-200 | 1-2% | Amortized: ~1-2 cycles/alloc | | **Subtotal (ELO)** | **~15-30** | **~10-20%** | Medium overhead | **Total atomic overhead**: **55-100 cycles/allocation** (~40-80% of `hak_alloc`) --- ### 1.2 Lock-Free Queue Overhead (for Phase 6.11.5) **Estimated cost per event** (MPSC queue): | Operation | Cycles | Notes | |-----------|--------|-------| | Allocate event struct | 20-40 | malloc/pool | | Write event data | 10-20 | Memory stores | | Enqueue (CAS) | 30-50 | LOCK CMPXCHG | | **Total per event** | **60-110** | Higher than current atomic! | **โš ๏ธ CRITICAL INSIGHT**: Lock-free queue is **NOT faster** for high-frequency events! **Reason**: - Current: 1 atomic op (`atomic_fetch_add`) - Queue: 1 allocation + 1 atomic op (enqueue) - **Net change**: +60-70 cycles per allocation **Recommendation**: **AVOID lock-free queue for hot-path**. Use alternative approach. --- ## 2. Implementation Plan: Staged Optimization ### Phase 6.11.4 (P0-1): Atomic Operation Elimination โญ **HIGHEST PRIORITY** **Goal**: Remove atomic overhead when learning disabled **Expected gain**: **30-50 cycles** (~24-40% of `hak_alloc`) **Implementation time**: **30 minutes** **Risk**: **ZERO** (compile-time guard) #### Changes **File**: `hakmem.c:362-369` ```c // BEFORE: if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_EVOLUTION)) { static _Atomic uint64_t tick_counter = 0; if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) { hak_evo_tick(now_ns); } } // AFTER: #if HAKMEM_FEATURE_EVOLUTION static _Atomic uint64_t tick_counter = 0; if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) { hak_evo_tick(get_time_ns()); } #endif ``` **Tradeoff**: None! Pure win when `HAKMEM_FEATURE_EVOLUTION=0` at compile-time. **Measurement**: ```bash # Baseline (with atomic) HAKMEM_DEBUG_TIMING=1 make bench_allocators_hakmem && HAKMEM_TIMING=1 ./bench_allocators_hakmem # After (without atomic) # Edit hakmem_config.h: #define HAKMEM_FEATURE_EVOLUTION 0 HAKMEM_DEBUG_TIMING=1 make bench_allocators_hakmem && HAKMEM_TIMING=1 ./bench_allocators_hakmem ``` **Expected result**: ``` hak_alloc: 126,479 โ†’ 96,000 cycles (-24%) ``` --- ### Phase 6.11.4 (P0-2): LEARN Mode Lightweight Sampling โญ **HIGH PRIORITY** **Goal**: Reduce ELO overhead without accuracy loss **Expected gain**: **15-30 cycles** (~12-24% of `hak_alloc`) **Implementation time**: **1-2 hours** **Risk**: **LOW** (conservative approach) #### Strategy: Async ELO Update **Problem**: `hak_elo_select_strategy()` ใฏ้‡ใ„ (500-1000 cycles) **Solution**: ้žๅŒๆœŸใ‚คใƒ™ใƒณใƒˆใ‚ญใƒฅใƒผ **ใงใฏใชใ** ไบ‹ๅ‰่จˆ็ฎ—ๆˆฆ็•ฅ **Key Insight**: ELO selection ใฏ **hot-path ใซไธ่ฆ**๏ผ #### Implementation **1. Pre-computed Strategy Cache** ```c // Global state (hakmem.c) static _Atomic int g_cached_strategy_id = 2; // Default: 2MB threshold static _Atomic uint64_t g_elo_generation = 0; // Invalidation key ``` **2. Background Thread (Simulated)** ```c // Called by hak_evo_tick() (1024 alloc ใ”ใจ) void hak_elo_async_recompute(void) { // Re-select best strategy (epsilon-greedy) int new_strategy = hak_elo_select_strategy(); atomic_store(&g_cached_strategy_id, new_strategy); atomic_fetch_add(&g_elo_generation, 1); // Invalidate } ``` **3. Hot-path (hakmem.c:397-412)** ```c // LEARN mode: Read cached strategy (NO ELO call!) if (hak_evo_is_frozen()) { strategy_id = hak_evo_get_confirmed_strategy(); threshold = hak_elo_get_threshold(strategy_id); } else if (hak_evo_is_canary()) { // ... (unchanged) } else { // LEARN: Use cached strategy (FAST!) strategy_id = atomic_load(&g_cached_strategy_id); threshold = hak_elo_get_threshold(strategy_id); // Optional: Lightweight recording (no timing yet) // hak_elo_record_alloc(strategy_id, size, 0); // Skip for now } ``` **Tradeoff Analysis**: | Aspect | Before | After | Change | |--------|--------|-------|--------| | Hot-path cost | 15-30 cycles | **5-10 cycles** | **-67% to -50%** | | ELO accuracy | 100% | 99% | -1% (negligible) | | Latency (strategy update) | 0 (immediate) | 1024 allocs | Acceptable | **Expected result**: ``` hak_alloc: 96,000 โ†’ 70,000 cycles (-27%) Total: 126,479 โ†’ 70,000 cycles (-45%) ``` **Recommendation**: โœ… **IMPLEMENT FIRST** (before Phase 6.11.5) --- ### Phase 6.11.5 (P1): Learning Thread (Full Offload) โญ **FUTURE WORK** **Goal**: Complete learning offload to dedicated thread **Expected gain**: **20-40 cycles** (additional ~15-30%) **Implementation time**: **4-6 hours** **Risk**: **MEDIUM** (thread management, race conditions) #### Architecture ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ hak_alloc (Hot-path) โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ 1. Read g_cached_strategy_id โ”‚ โ”‚ โ† Atomic read (~10 cycles) โ”‚ โ”‚ 2. Route allocation โ”‚ โ”‚ โ”‚ โ”‚ 3. [Optional] Push event to queue โ”‚ โ”‚ โ† Only if sampling (1/100) โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ†“ (Event Queue - MPSC) โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Learning Thread (Background) โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ 1. Pop events (batched) โ”‚ โ”‚ โ”‚ โ”‚ 2. Update ELO ratings โ”‚ โ”‚ โ”‚ โ”‚ 3. Update distribution signature โ”‚ โ”‚ โ”‚ โ”‚ 4. Recompute best strategy โ”‚ โ”‚ โ”‚ โ”‚ 5. Update g_cached_strategy_id โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` #### Implementation Details **1. Event Queue (Custom Ring Buffer)** ```c // hakmem_events.h #define EVENT_QUEUE_SIZE 1024 typedef struct { uint8_t type; // EVENT_ALLOC / EVENT_FREE size_t size; uint64_t duration_ns; uintptr_t site_id; } hak_event_t; typedef struct { hak_event_t events[EVENT_QUEUE_SIZE]; _Atomic uint64_t head; // Producer index _Atomic uint64_t tail; // Consumer index } hak_event_queue_t; ``` **Cost**: ~30 cycles (ring buffer write, no CAS needed!) **2. Sampling Strategy** ```c // Hot-path: Sample 1/100 allocations if (fast_random() % 100 == 0) { hak_event_push((hak_event_t){ .type = EVENT_ALLOC, .size = size, .duration_ns = 0, // Not measured in hot-path .site_id = site_id }); } ``` **3. Background Thread** ```c void* learning_thread_main(void* arg) { while (!g_shutdown) { // Batch processing (every 100ms) usleep(100000); hak_event_t events[100]; int count = hak_event_pop_batch(events, 100); for (int i = 0; i < count; i++) { hak_elo_record_alloc(events[i].site_id, events[i].size, 0); } // Periodic ELO update (every 10 batches) if (g_batch_count % 10 == 0) { hak_elo_async_recompute(); } } return NULL; } ``` #### Tradeoff Analysis | Aspect | Phase 6.11.4 (P0-2) | Phase 6.11.5 (P1) | Change | |--------|---------------------|-------------------|--------| | Hot-path cost | 5-10 cycles | **~10-15 cycles** | +5 cycles (sampling overhead) | | Thread overhead | 0 | ~1% CPU (background) | Negligible | | Learning latency | 1024 allocs | 100-200ms | Acceptable | | Complexity | Low | Medium | Moderate increase | **โš ๏ธ CRITICAL DECISION**: Phase 6.11.5 **DOES NOT improve hot-path** over Phase 6.11.4! **Reason**: Sampling overhead (~5 cycles) cancels out atomic elimination (~5 cycles) **Recommendation**: โš ๏ธ **SKIP Phase 6.11.5** unless: 1. Learning accuracy requires higher sampling rate (>1/100) 2. Background analytics needed (real-time dashboard) --- ## 3. Hash Table Optimization (Phase 6.11.6 - P2) **Current cost**: Site Rules lookup (~10-20 cycles) ### Strategy 1: Perfect Hashing **Benefit**: O(1) lookup without collisions **Tradeoff**: Rebuild cost on new site, max 256 sites **Implementation**: ```c // Pre-computed hash table (generated at runtime) static RouteType g_site_routes[256]; // Direct lookup, no probing ``` **Expected gain**: **5-10 cycles** (~4-8%) ### Strategy 2: Cache-line Alignment **Current**: 4-probe hash โ†’ 4 cache lines (worst case) **Improvement**: Pack entries into single cache line ```c typedef struct { uint64_t site_id; RouteType route; uint8_t padding[6]; // Align to 16 bytes } __attribute__((aligned(16))) SiteRuleEntry; ``` **Expected gain**: **2-5 cycles** (~2-4%) ### Recommendation **Priority**: P2 (after Phase 6.11.4 P0-1/P0-2) **Expected gain**: **7-15 cycles** (~6-12%) **Implementation time**: 2-3 hours --- ## 4. Trade-off Analysis ### 4.1 Thread-Safety vs Learning Accuracy | Approach | Hot-path Cost | Learning Accuracy | Complexity | |----------|---------------|-------------------|------------| | **Current** | 126,479 cycles | 100% | Low | | **P0-1 (Atomicๅ‰Šๆธ›)** | 96,000 cycles | 100% | Very Low | | **P0-2 (Cached Strategy)** | 70,000 cycles | 99% | Low | | **P1 (Learning Thread)** | 70,000-75,000 cycles | 95-99% | Medium | | **P2 (Hash Opt)** | 60,000 cycles | 99% | Medium | ### 4.2 Implementation Complexity vs Performance Gain ``` Performance Gain โ†‘ โ”‚ P0-1 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” (30-50 cycles, 30 min) (Atomicๅ‰Šๆธ›) โ”‚ โ”‚ โ”‚ โ”‚ P0-2 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ (25-35 cycles, 1-2 hrs) (Cached Strategy) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ P2 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ” (7-15 cycles, 2-3 hrs) (Hash Opt) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ P1 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”ค (5-10 cycles, 4-6 hrs) (Learning Thread) โ”‚ โ”‚ โ”‚ โ”‚ 0โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Complexity Low Med High ``` **Sweet Spot**: **P0-2 (Cached Strategy)** - 55% total reduction (126,479 โ†’ 70,000 cycles) - 1-2 hours implementation - Low complexity, low risk --- ## 5. Recommended Implementation Order ### Week 1: Quick Wins (P0-1 + P0-2) **Day 1**: Phase 6.11.4 (P0-1) - Atomicๅ‰Šๆธ› - Time: 30 minutes - Expected: 126,479 โ†’ 96,000 cycles (-24%) **Day 2**: Phase 6.11.4 (P0-2) - Cached Strategy - Time: 1-2 hours - Expected: 96,000 โ†’ 70,000 cycles (-27%) - **Total: -45% reduction** โœ… ### Week 2: Medium Gains (P2) **Day 3-4**: Phase 6.11.6 (P2) - Hash Optimization - Time: 2-3 hours - Expected: 70,000 โ†’ 60,000 cycles (-14%) - **Total: -52% reduction** โœ… ### Week 3: Evaluation **Benchmark** all scenarios (json/mir/vm) - If `hak_alloc` < 50,000 cycles โ†’ **STOP** โœ… - If `hak_alloc` > 50,000 cycles โ†’ Consider Phase 6.11.5 (P1) --- ## 6. Risk Assessment | Phase | Risk Level | Failure Mode | Mitigation | |-------|-----------|--------------|------------| | **P0-1** | **ZERO** | None (compile-time) | None needed | | **P0-2** | **LOW** | Stale strategy (1-2% accuracy loss) | Periodic invalidation | | **P1** | **MEDIUM** | Race conditions, thread bugs | Extensive testing, feature flag | | **P2** | **LOW** | Hash collisions, rebuild cost | Fallback to linear probe | --- ## 7. Expected Final Results ### Pessimistic Scenario (Only P0-1 + P0-2) ``` hak_alloc: 126,479 โ†’ 70,000 cycles (-45%) Overall: 319,021 โ†’ 262,542 cycles (-18%) vm scenario: 15,021 ns โ†’ 12,000 ns (-20%) ``` ### Optimistic Scenario (P0-1 + P0-2 + P2) ``` hak_alloc: 126,479 โ†’ 60,000 cycles (-52%) Overall: 319,021 โ†’ 252,542 cycles (-21%) vm scenario: 15,021 ns โ†’ 11,500 ns (-23%) ``` ### Stretch Goal (All Phases) ``` hak_alloc: 126,479 โ†’ 50,000 cycles (-60%) Overall: 319,021 โ†’ 242,542 cycles (-24%) vm scenario: 15,021 ns โ†’ 11,000 ns (-27%) ``` --- ## 8. Conclusion ### โœ… Recommended Path: **Staged Optimization** (P0-1 โ†’ P0-2 โ†’ P2) **Rationale**: 1. **P0-1** is free (compile-time guard) โ†’ Immediate -24% 2. **P0-2** is high-ROI (1-2 hrs) โ†’ Additional -27% 3. **P1 (Learning Thread) is NOT worth it** (complexity vs gain) 4. **P2** is optional polish โ†’ Additional -14% **Final Target**: **70,000 cycles** (55% reduction from baseline) **Timeline**: - Week 1: P0-1 + P0-2 (2-3 hours total) - Week 2: P2 (optional, 2-3 hours) - Week 3: Benchmark & validate **Success Criteria**: - `hak_alloc` < 75,000 cycles (40% reduction) โ†’ **Minimum Success** - `hak_alloc` < 60,000 cycles (52% reduction) โ†’ **Target Success** โœ… - `hak_alloc` < 50,000 cycles (60% reduction) โ†’ **Stretch Goal** ๐ŸŽ‰ --- ## Next Steps 1. **Implement P0-1** (30 min) 2. **Measure baseline** (10 min) 3. **Implement P0-2** (1-2 hrs) 4. **Measure improvement** (10 min) 5. **Decide on P2** based on results **Total time investment**: 2-3 hours for **45% reduction** โ† **Excellent ROI!**