hakmem/docs/analysis/PHASE_6.11.4_THREADING_COST_ANALYSIS.md

# Phase 6.11.4: Threading Overhead Analysis & Optimization Plan

**Date**: 2025-10-22
**Author**: ChatGPT Ultra Think (o1-preview equivalent)
**Context**: Post-Phase 6.11.3 profiling results reveal `hak_alloc` consuming 39.6% of cycles

---

## 📊 Executive Summary

### Current Bottleneck
```
hak_alloc:       126,479 cycles (39.6%)  ← #2 MAJOR BOTTLENECK
  ├─ ELO selection (100回ごと)
  ├─ Site Rules lookup (4-probe hash)
  ├─ atomic_fetch_add (全allocでアトミック操作)
  ├─ 条件分岐 (FROZEN/CANARY/LEARN)
  └─ Learning logic (hak_evo_tick, hak_elo_record_alloc)
```

### Recommended Strategy: **Staged Optimization** (3 Phases)

1. **Phase 6.11.4 (P0-1)**: Atomic削減 - Immediate, Low-risk (~15-20% reduction)
2. **Phase 6.11.4 (P0-2)**: LEARN軽量化 - Medium-term, Medium-risk (~25-35% reduction)
3. **Phase 6.11.5 (P1)**: Learning Thread - Long-term, High-reward (~50-70% reduction)

**Target**: 126,479 cycles → **<50,000 cycles** (~60% reduction total)

---

## 1. Thread-Safety Cost Analysis

### 1.1 Current Atomic Operations

**Location**: `hakmem.c:362-369`

```c
if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_EVOLUTION)) {
    static _Atomic uint64_t tick_counter = 0;
    if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
        // hak_evo_tick() - HEAVY (P² update, distribution, state transition)
    }
}
```

**Cost Breakdown** (estimated per allocation):

| Operation | Cycles | % of hak_alloc | Notes |
|-----------|--------|----------------|-------|
| `atomic_fetch_add` | **30-50** | **24-40%** | LOCK CMPXCHG on x86 |
| Conditional check (`& 0x3FF`) | 2-5 | 2-4% | Bitwise AND + branch |
| `hak_evo_tick` (1/1024) | 5,000-10,000 | 4-8% | Amortized: ~5-10 cycles/alloc |
| **Subtotal (Evolution)** | **~40-70** | **~30-50%** | **Major overhead!** |

**ELO sampling** (`hakmem.c:397-412`):

```c
g_elo_call_count++;  // Non-atomic increment (RACE CONDITION!)
if (g_elo_call_count % 100 == 0 || g_cached_strategy_id == -1) {
    strategy_id = hak_elo_select_strategy();       // ~500-1000 cycles
    g_cached_strategy_id = strategy_id;
    hak_elo_record_alloc(strategy_id, size, 0);    // ~100-200 cycles
}
```

| Operation | Cycles | % of hak_alloc | Notes |
|-----------|--------|----------------|-------|
| `g_elo_call_count++` | 1-2 | <1% | **UNSAFE! Non-atomic** |
| Modulo check (`% 100`) | 5-10 | 4-8% | DIV instruction |
| `hak_elo_select_strategy` (1/100) | 500-1000 | 4-8% | Amortized: ~5-10 cycles/alloc |
| `hak_elo_record_alloc` (1/100) | 100-200 | 1-2% | Amortized: ~1-2 cycles/alloc |
| **Subtotal (ELO)** | **~15-30** | **~10-20%** | Medium overhead |

**Total atomic overhead**: **55-100 cycles/allocation** (~40-80% of `hak_alloc`)

---

### 1.2 Lock-Free Queue Overhead (for Phase 6.11.5)

**Estimated cost per event** (MPSC queue):

| Operation | Cycles | Notes |
|-----------|--------|-------|
| Allocate event struct | 20-40 | malloc/pool |
| Write event data | 10-20 | Memory stores |
| Enqueue (CAS) | 30-50 | LOCK CMPXCHG |
| **Total per event** | **60-110** | Higher than current atomic! |

**⚠️ CRITICAL INSIGHT**: Lock-free queue is **NOT faster** for high-frequency events!

**Reason**:
- Current: 1 atomic op (`atomic_fetch_add`)
- Queue: 1 allocation + 1 atomic op (enqueue)
- **Net change**: +60-70 cycles per allocation

**Recommendation**: **AVOID lock-free queue for hot-path**. Use alternative approach.

---

## 2. Implementation Plan: Staged Optimization

### Phase 6.11.4 (P0-1): Atomic Operation Elimination ⭐ **HIGHEST PRIORITY**

**Goal**: Remove atomic overhead when learning disabled
**Expected gain**: **30-50 cycles** (~24-40% of `hak_alloc`)
**Implementation time**: **30 minutes**
**Risk**: **ZERO** (compile-time guard)

#### Changes

**File**: `hakmem.c:362-369`

```c
// BEFORE:
if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_EVOLUTION)) {
    static _Atomic uint64_t tick_counter = 0;
    if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
        hak_evo_tick(now_ns);
    }
}

// AFTER:
#if HAKMEM_FEATURE_EVOLUTION
    static _Atomic uint64_t tick_counter = 0;
    if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
        hak_evo_tick(get_time_ns());
    }
#endif
```

**Tradeoff**: None! Pure win when `HAKMEM_FEATURE_EVOLUTION=0` at compile-time.

**Measurement**:
```bash
# Baseline (with atomic)
HAKMEM_DEBUG_TIMING=1 make bench_allocators_hakmem && HAKMEM_TIMING=1 ./bench_allocators_hakmem

# After (without atomic)
# Edit hakmem_config.h: #define HAKMEM_FEATURE_EVOLUTION 0
HAKMEM_DEBUG_TIMING=1 make bench_allocators_hakmem && HAKMEM_TIMING=1 ./bench_allocators_hakmem
```

**Expected result**:
```
hak_alloc: 126,479 → 96,000 cycles (-24%)
```

---

### Phase 6.11.4 (P0-2): LEARN Mode Lightweight Sampling ⭐ **HIGH PRIORITY**

**Goal**: Reduce ELO overhead without accuracy loss
**Expected gain**: **15-30 cycles** (~12-24% of `hak_alloc`)
**Implementation time**: **1-2 hours**
**Risk**: **LOW** (conservative approach)

#### Strategy: Async ELO Update

**Problem**: `hak_elo_select_strategy()` は重い (500-1000 cycles)
**Solution**: 非同期イベントキュー **ではなく** 事前計算戦略

**Key Insight**: ELO selection は **hot-path に不要**！

#### Implementation

**1. Pre-computed Strategy Cache**

```c
// Global state (hakmem.c)
static _Atomic int g_cached_strategy_id = 2;  // Default: 2MB threshold
static _Atomic uint64_t g_elo_generation = 0;  // Invalidation key
```

**2. Background Thread (Simulated)**

```c
// Called by hak_evo_tick() (1024 alloc ごと)
void hak_elo_async_recompute(void) {
    // Re-select best strategy (epsilon-greedy)
    int new_strategy = hak_elo_select_strategy();

    atomic_store(&g_cached_strategy_id, new_strategy);
    atomic_fetch_add(&g_elo_generation, 1);  // Invalidate
}
```

**3. Hot-path (hakmem.c:397-412)**

```c
// LEARN mode: Read cached strategy (NO ELO call!)
if (hak_evo_is_frozen()) {
    strategy_id = hak_evo_get_confirmed_strategy();
    threshold = hak_elo_get_threshold(strategy_id);
} else if (hak_evo_is_canary()) {
    // ... (unchanged)
} else {
    // LEARN: Use cached strategy (FAST!)
    strategy_id = atomic_load(&g_cached_strategy_id);
    threshold = hak_elo_get_threshold(strategy_id);

    // Optional: Lightweight recording (no timing yet)
    // hak_elo_record_alloc(strategy_id, size, 0);  // Skip for now
}
```

**Tradeoff Analysis**:

| Aspect | Before | After | Change |
|--------|--------|-------|--------|
| Hot-path cost | 15-30 cycles | **5-10 cycles** | **-67% to -50%** |
| ELO accuracy | 100% | 99% | -1% (negligible) |
| Latency (strategy update) | 0 (immediate) | 1024 allocs | Acceptable |

**Expected result**:
```
hak_alloc: 96,000 → 70,000 cycles (-27%)
Total: 126,479 → 70,000 cycles (-45%)
```

**Recommendation**: ✅ **IMPLEMENT FIRST** (before Phase 6.11.5)

---

### Phase 6.11.5 (P1): Learning Thread (Full Offload) ⭐ **FUTURE WORK**

**Goal**: Complete learning offload to dedicated thread
**Expected gain**: **20-40 cycles** (additional ~15-30%)
**Implementation time**: **4-6 hours**
**Risk**: **MEDIUM** (thread management, race conditions)

#### Architecture

```
┌─────────────────────────────────────────┐
│         hak_alloc (Hot-path)            │
│  ┌───────────────────────────────────┐  │
│  │ 1. Read g_cached_strategy_id      │  │ ← Atomic read (~10 cycles)
│  │ 2. Route allocation               │  │
│  │ 3. [Optional] Push event to queue │  │ ← Only if sampling (1/100)
│  └───────────────────────────────────┘  │
└─────────────────────────────────────────┘
                    ↓ (Event Queue - MPSC)
┌─────────────────────────────────────────┐
│        Learning Thread (Background)     │
│  ┌───────────────────────────────────┐  │
│  │ 1. Pop events (batched)           │  │
│  │ 2. Update ELO ratings             │  │
│  │ 3. Update distribution signature  │  │
│  │ 4. Recompute best strategy        │  │
│  │ 5. Update g_cached_strategy_id    │  │
│  └───────────────────────────────────┘  │
└─────────────────────────────────────────┘
```

#### Implementation Details

**1. Event Queue (Custom Ring Buffer)**

```c
// hakmem_events.h
#define EVENT_QUEUE_SIZE 1024

typedef struct {
    uint8_t type;        // EVENT_ALLOC / EVENT_FREE
    size_t size;
    uint64_t duration_ns;
    uintptr_t site_id;
} hak_event_t;

typedef struct {
    hak_event_t events[EVENT_QUEUE_SIZE];
    _Atomic uint64_t head;  // Producer index
    _Atomic uint64_t tail;  // Consumer index
} hak_event_queue_t;
```

**Cost**: ~30 cycles (ring buffer write, no CAS needed!)

**2. Sampling Strategy**

```c
// Hot-path: Sample 1/100 allocations
if (fast_random() % 100 == 0) {
    hak_event_push((hak_event_t){
        .type = EVENT_ALLOC,
        .size = size,
        .duration_ns = 0,  // Not measured in hot-path
        .site_id = site_id
    });
}
```

**3. Background Thread**

```c
void* learning_thread_main(void* arg) {
    while (!g_shutdown) {
        // Batch processing (every 100ms)
        usleep(100000);

        hak_event_t events[100];
        int count = hak_event_pop_batch(events, 100);

        for (int i = 0; i < count; i++) {
            hak_elo_record_alloc(events[i].site_id, events[i].size, 0);
        }

        // Periodic ELO update (every 10 batches)
        if (g_batch_count % 10 == 0) {
            hak_elo_async_recompute();
        }
    }
    return NULL;
}
```

#### Tradeoff Analysis

| Aspect | Phase 6.11.4 (P0-2) | Phase 6.11.5 (P1) | Change |
|--------|---------------------|-------------------|--------|
| Hot-path cost | 5-10 cycles | **~10-15 cycles** | +5 cycles (sampling overhead) |
| Thread overhead | 0 | ~1% CPU (background) | Negligible |
| Learning latency | 1024 allocs | 100-200ms | Acceptable |
| Complexity | Low | Medium | Moderate increase |

**⚠️ CRITICAL DECISION**: Phase 6.11.5 **DOES NOT improve hot-path** over Phase 6.11.4!

**Reason**: Sampling overhead (~5 cycles) cancels out atomic elimination (~5 cycles)

**Recommendation**: ⚠️ **SKIP Phase 6.11.5** unless:
1. Learning accuracy requires higher sampling rate (>1/100)
2. Background analytics needed (real-time dashboard)

---

## 3. Hash Table Optimization (Phase 6.11.6 - P2)

**Current cost**: Site Rules lookup (~10-20 cycles)

### Strategy 1: Perfect Hashing

**Benefit**: O(1) lookup without collisions
**Tradeoff**: Rebuild cost on new site, max 256 sites

**Implementation**:
```c
// Pre-computed hash table (generated at runtime)
static RouteType g_site_routes[256];  // Direct lookup, no probing
```

**Expected gain**: **5-10 cycles** (~4-8%)

### Strategy 2: Cache-line Alignment

**Current**: 4-probe hash → 4 cache lines (worst case)
**Improvement**: Pack entries into single cache line

```c
typedef struct {
    uint64_t site_id;
    RouteType route;
    uint8_t padding[6];  // Align to 16 bytes
} __attribute__((aligned(16))) SiteRuleEntry;
```

**Expected gain**: **2-5 cycles** (~2-4%)

### Recommendation

**Priority**: P2 (after Phase 6.11.4 P0-1/P0-2)
**Expected gain**: **7-15 cycles** (~6-12%)
**Implementation time**: 2-3 hours

---

## 4. Trade-off Analysis

### 4.1 Thread-Safety vs Learning Accuracy

| Approach | Hot-path Cost | Learning Accuracy | Complexity |
|----------|---------------|-------------------|------------|
| **Current** | 126,479 cycles | 100% | Low |
| **P0-1 (Atomic削減)** | 96,000 cycles | 100% | Very Low |
| **P0-2 (Cached Strategy)** | 70,000 cycles | 99% | Low |
| **P1 (Learning Thread)** | 70,000-75,000 cycles | 95-99% | Medium |
| **P2 (Hash Opt)** | 60,000 cycles | 99% | Medium |

### 4.2 Implementation Complexity vs Performance Gain

```
                        Performance Gain
                        ↑
                        │
  P0-1 ──────────────────┼────────────┐  (30-50 cycles, 30 min)
  (Atomic削減)           │            │
                        │            │
  P0-2 ──────────────────┼──────┐     │  (25-35 cycles, 1-2 hrs)
  (Cached Strategy)      │      │     │
                        │      │     │
  P2 ─────────────────┼──────┼─────┼──┐  (7-15 cycles, 2-3 hrs)
  (Hash Opt)          │      │     │  │
                     │      │     │  │
  P1 ────────────────┼──────┼─────┼──┤  (5-10 cycles, 4-6 hrs)
  (Learning Thread)  │      │     │  │
                     0──────────────────→ Complexity
                          Low    Med  High
```

**Sweet Spot**: **P0-2 (Cached Strategy)**
- 55% total reduction (126,479 → 70,000 cycles)
- 1-2 hours implementation
- Low complexity, low risk

---

## 5. Recommended Implementation Order

### Week 1: Quick Wins (P0-1 + P0-2)

**Day 1**: Phase 6.11.4 (P0-1) - Atomic削減
- Time: 30 minutes
- Expected: 126,479 → 96,000 cycles (-24%)

**Day 2**: Phase 6.11.4 (P0-2) - Cached Strategy
- Time: 1-2 hours
- Expected: 96,000 → 70,000 cycles (-27%)
- **Total: -45% reduction** ✅

### Week 2: Medium Gains (P2)

**Day 3-4**: Phase 6.11.6 (P2) - Hash Optimization
- Time: 2-3 hours
- Expected: 70,000 → 60,000 cycles (-14%)
- **Total: -52% reduction** ✅

### Week 3: Evaluation

**Benchmark** all scenarios (json/mir/vm)
- If `hak_alloc` < 50,000 cycles → **STOP** ✅
- If `hak_alloc` > 50,000 cycles → Consider Phase 6.11.5 (P1)

---

## 6. Risk Assessment

| Phase | Risk Level | Failure Mode | Mitigation |
|-------|-----------|--------------|------------|
| **P0-1** | **ZERO** | None (compile-time) | None needed |
| **P0-2** | **LOW** | Stale strategy (1-2% accuracy loss) | Periodic invalidation |
| **P1** | **MEDIUM** | Race conditions, thread bugs | Extensive testing, feature flag |
| **P2** | **LOW** | Hash collisions, rebuild cost | Fallback to linear probe |

---

## 7. Expected Final Results

### Pessimistic Scenario (Only P0-1 + P0-2)
```
hak_alloc: 126,479 → 70,000 cycles (-45%)
Overall: 319,021 → 262,542 cycles (-18%)

vm scenario: 15,021 ns → 12,000 ns (-20%)
```

### Optimistic Scenario (P0-1 + P0-2 + P2)
```
hak_alloc: 126,479 → 60,000 cycles (-52%)
Overall: 319,021 → 252,542 cycles (-21%)

vm scenario: 15,021 ns → 11,500 ns (-23%)
```

### Stretch Goal (All Phases)
```
hak_alloc: 126,479 → 50,000 cycles (-60%)
Overall: 319,021 → 242,542 cycles (-24%)

vm scenario: 15,021 ns → 11,000 ns (-27%)
```

---

## 8. Conclusion

### ✅ Recommended Path: **Staged Optimization** (P0-1 → P0-2 → P2)

**Rationale**:
1. **P0-1** is free (compile-time guard) → Immediate -24%
2. **P0-2** is high-ROI (1-2 hrs) → Additional -27%
3. **P1 (Learning Thread) is NOT worth it** (complexity vs gain)
4. **P2** is optional polish → Additional -14%

**Final Target**: **70,000 cycles** (55% reduction from baseline)

**Timeline**:
- Week 1: P0-1 + P0-2 (2-3 hours total)
- Week 2: P2 (optional, 2-3 hours)
- Week 3: Benchmark & validate

**Success Criteria**:
- `hak_alloc` < 75,000 cycles (40% reduction) → **Minimum Success**
- `hak_alloc` < 60,000 cycles (52% reduction) → **Target Success** ✅
- `hak_alloc` < 50,000 cycles (60% reduction) → **Stretch Goal** 🎉

---

## Next Steps

1. **Implement P0-1** (30 min)
2. **Measure baseline** (10 min)
3. **Implement P0-2** (1-2 hrs)
4. **Measure improvement** (10 min)
5. **Decide on P2** based on results

**Total time investment**: 2-3 hours for **45% reduction** ← **Excellent ROI!**