Files
hakmem/docs/analysis/PHASE_6.11.4_THREADING_COST_ANALYSIS.md

516 lines
16 KiB
Markdown
Raw Normal View History

# Phase 6.11.4: Threading Overhead Analysis & Optimization Plan
**Date**: 2025-10-22
**Author**: ChatGPT Ultra Think (o1-preview equivalent)
**Context**: Post-Phase 6.11.3 profiling results reveal `hak_alloc` consuming 39.6% of cycles
---
## 📊 Executive Summary
### Current Bottleneck
```
hak_alloc: 126,479 cycles (39.6%) ← #2 MAJOR BOTTLENECK
├─ ELO selection (100回ごと)
├─ Site Rules lookup (4-probe hash)
├─ atomic_fetch_add (全allocでアトミック操作)
├─ 条件分岐 (FROZEN/CANARY/LEARN)
└─ Learning logic (hak_evo_tick, hak_elo_record_alloc)
```
### Recommended Strategy: **Staged Optimization** (3 Phases)
1. **Phase 6.11.4 (P0-1)**: Atomic削減 - Immediate, Low-risk (~15-20% reduction)
2. **Phase 6.11.4 (P0-2)**: LEARN軽量化 - Medium-term, Medium-risk (~25-35% reduction)
3. **Phase 6.11.5 (P1)**: Learning Thread - Long-term, High-reward (~50-70% reduction)
**Target**: 126,479 cycles → **<50,000 cycles** (~60% reduction total)
---
## 1. Thread-Safety Cost Analysis
### 1.1 Current Atomic Operations
**Location**: `hakmem.c:362-369`
```c
if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_EVOLUTION)) {
static _Atomic uint64_t tick_counter = 0;
if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
// hak_evo_tick() - HEAVY (P² update, distribution, state transition)
}
}
```
**Cost Breakdown** (estimated per allocation):
| Operation | Cycles | % of hak_alloc | Notes |
|-----------|--------|----------------|-------|
| `atomic_fetch_add` | **30-50** | **24-40%** | LOCK CMPXCHG on x86 |
| Conditional check (`& 0x3FF`) | 2-5 | 2-4% | Bitwise AND + branch |
| `hak_evo_tick` (1/1024) | 5,000-10,000 | 4-8% | Amortized: ~5-10 cycles/alloc |
| **Subtotal (Evolution)** | **~40-70** | **~30-50%** | **Major overhead!** |
**ELO sampling** (`hakmem.c:397-412`):
```c
g_elo_call_count++; // Non-atomic increment (RACE CONDITION!)
if (g_elo_call_count % 100 == 0 || g_cached_strategy_id == -1) {
strategy_id = hak_elo_select_strategy(); // ~500-1000 cycles
g_cached_strategy_id = strategy_id;
hak_elo_record_alloc(strategy_id, size, 0); // ~100-200 cycles
}
```
| Operation | Cycles | % of hak_alloc | Notes |
|-----------|--------|----------------|-------|
| `g_elo_call_count++` | 1-2 | <1% | **UNSAFE! Non-atomic** |
| Modulo check (`% 100`) | 5-10 | 4-8% | DIV instruction |
| `hak_elo_select_strategy` (1/100) | 500-1000 | 4-8% | Amortized: ~5-10 cycles/alloc |
| `hak_elo_record_alloc` (1/100) | 100-200 | 1-2% | Amortized: ~1-2 cycles/alloc |
| **Subtotal (ELO)** | **~15-30** | **~10-20%** | Medium overhead |
**Total atomic overhead**: **55-100 cycles/allocation** (~40-80% of `hak_alloc`)
---
### 1.2 Lock-Free Queue Overhead (for Phase 6.11.5)
**Estimated cost per event** (MPSC queue):
| Operation | Cycles | Notes |
|-----------|--------|-------|
| Allocate event struct | 20-40 | malloc/pool |
| Write event data | 10-20 | Memory stores |
| Enqueue (CAS) | 30-50 | LOCK CMPXCHG |
| **Total per event** | **60-110** | Higher than current atomic! |
**⚠️ CRITICAL INSIGHT**: Lock-free queue is **NOT faster** for high-frequency events!
**Reason**:
- Current: 1 atomic op (`atomic_fetch_add`)
- Queue: 1 allocation + 1 atomic op (enqueue)
- **Net change**: +60-70 cycles per allocation
**Recommendation**: **AVOID lock-free queue for hot-path**. Use alternative approach.
---
## 2. Implementation Plan: Staged Optimization
### Phase 6.11.4 (P0-1): Atomic Operation Elimination ⭐ **HIGHEST PRIORITY**
**Goal**: Remove atomic overhead when learning disabled
**Expected gain**: **30-50 cycles** (~24-40% of `hak_alloc`)
**Implementation time**: **30 minutes**
**Risk**: **ZERO** (compile-time guard)
#### Changes
**File**: `hakmem.c:362-369`
```c
// BEFORE:
if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_EVOLUTION)) {
static _Atomic uint64_t tick_counter = 0;
if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
hak_evo_tick(now_ns);
}
}
// AFTER:
#if HAKMEM_FEATURE_EVOLUTION
static _Atomic uint64_t tick_counter = 0;
if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
hak_evo_tick(get_time_ns());
}
#endif
```
**Tradeoff**: None! Pure win when `HAKMEM_FEATURE_EVOLUTION=0` at compile-time.
**Measurement**:
```bash
# Baseline (with atomic)
HAKMEM_DEBUG_TIMING=1 make bench_allocators_hakmem && HAKMEM_TIMING=1 ./bench_allocators_hakmem
# After (without atomic)
# Edit hakmem_config.h: #define HAKMEM_FEATURE_EVOLUTION 0
HAKMEM_DEBUG_TIMING=1 make bench_allocators_hakmem && HAKMEM_TIMING=1 ./bench_allocators_hakmem
```
**Expected result**:
```
hak_alloc: 126,479 → 96,000 cycles (-24%)
```
---
### Phase 6.11.4 (P0-2): LEARN Mode Lightweight Sampling ⭐ **HIGH PRIORITY**
**Goal**: Reduce ELO overhead without accuracy loss
**Expected gain**: **15-30 cycles** (~12-24% of `hak_alloc`)
**Implementation time**: **1-2 hours**
**Risk**: **LOW** (conservative approach)
#### Strategy: Async ELO Update
**Problem**: `hak_elo_select_strategy()` は重い (500-1000 cycles)
**Solution**: 非同期イベントキュー **ではなく** 事前計算戦略
**Key Insight**: ELO selection は **hot-path に不要**
#### Implementation
**1. Pre-computed Strategy Cache**
```c
// Global state (hakmem.c)
static _Atomic int g_cached_strategy_id = 2; // Default: 2MB threshold
static _Atomic uint64_t g_elo_generation = 0; // Invalidation key
```
**2. Background Thread (Simulated)**
```c
// Called by hak_evo_tick() (1024 alloc ごと)
void hak_elo_async_recompute(void) {
// Re-select best strategy (epsilon-greedy)
int new_strategy = hak_elo_select_strategy();
atomic_store(&g_cached_strategy_id, new_strategy);
atomic_fetch_add(&g_elo_generation, 1); // Invalidate
}
```
**3. Hot-path (hakmem.c:397-412)**
```c
// LEARN mode: Read cached strategy (NO ELO call!)
if (hak_evo_is_frozen()) {
strategy_id = hak_evo_get_confirmed_strategy();
threshold = hak_elo_get_threshold(strategy_id);
} else if (hak_evo_is_canary()) {
// ... (unchanged)
} else {
// LEARN: Use cached strategy (FAST!)
strategy_id = atomic_load(&g_cached_strategy_id);
threshold = hak_elo_get_threshold(strategy_id);
// Optional: Lightweight recording (no timing yet)
// hak_elo_record_alloc(strategy_id, size, 0); // Skip for now
}
```
**Tradeoff Analysis**:
| Aspect | Before | After | Change |
|--------|--------|-------|--------|
| Hot-path cost | 15-30 cycles | **5-10 cycles** | **-67% to -50%** |
| ELO accuracy | 100% | 99% | -1% (negligible) |
| Latency (strategy update) | 0 (immediate) | 1024 allocs | Acceptable |
**Expected result**:
```
hak_alloc: 96,000 → 70,000 cycles (-27%)
Total: 126,479 → 70,000 cycles (-45%)
```
**Recommendation**: ✅ **IMPLEMENT FIRST** (before Phase 6.11.5)
---
### Phase 6.11.5 (P1): Learning Thread (Full Offload) ⭐ **FUTURE WORK**
**Goal**: Complete learning offload to dedicated thread
**Expected gain**: **20-40 cycles** (additional ~15-30%)
**Implementation time**: **4-6 hours**
**Risk**: **MEDIUM** (thread management, race conditions)
#### Architecture
```
┌─────────────────────────────────────────┐
│ hak_alloc (Hot-path) │
│ ┌───────────────────────────────────┐ │
│ │ 1. Read g_cached_strategy_id │ │ ← Atomic read (~10 cycles)
│ │ 2. Route allocation │ │
│ │ 3. [Optional] Push event to queue │ │ ← Only if sampling (1/100)
│ └───────────────────────────────────┘ │
└─────────────────────────────────────────┘
↓ (Event Queue - MPSC)
┌─────────────────────────────────────────┐
│ Learning Thread (Background) │
│ ┌───────────────────────────────────┐ │
│ │ 1. Pop events (batched) │ │
│ │ 2. Update ELO ratings │ │
│ │ 3. Update distribution signature │ │
│ │ 4. Recompute best strategy │ │
│ │ 5. Update g_cached_strategy_id │ │
│ └───────────────────────────────────┘ │
└─────────────────────────────────────────┘
```
#### Implementation Details
**1. Event Queue (Custom Ring Buffer)**
```c
// hakmem_events.h
#define EVENT_QUEUE_SIZE 1024
typedef struct {
uint8_t type; // EVENT_ALLOC / EVENT_FREE
size_t size;
uint64_t duration_ns;
uintptr_t site_id;
} hak_event_t;
typedef struct {
hak_event_t events[EVENT_QUEUE_SIZE];
_Atomic uint64_t head; // Producer index
_Atomic uint64_t tail; // Consumer index
} hak_event_queue_t;
```
**Cost**: ~30 cycles (ring buffer write, no CAS needed!)
**2. Sampling Strategy**
```c
// Hot-path: Sample 1/100 allocations
if (fast_random() % 100 == 0) {
hak_event_push((hak_event_t){
.type = EVENT_ALLOC,
.size = size,
.duration_ns = 0, // Not measured in hot-path
.site_id = site_id
});
}
```
**3. Background Thread**
```c
void* learning_thread_main(void* arg) {
while (!g_shutdown) {
// Batch processing (every 100ms)
usleep(100000);
hak_event_t events[100];
int count = hak_event_pop_batch(events, 100);
for (int i = 0; i < count; i++) {
hak_elo_record_alloc(events[i].site_id, events[i].size, 0);
}
// Periodic ELO update (every 10 batches)
if (g_batch_count % 10 == 0) {
hak_elo_async_recompute();
}
}
return NULL;
}
```
#### Tradeoff Analysis
| Aspect | Phase 6.11.4 (P0-2) | Phase 6.11.5 (P1) | Change |
|--------|---------------------|-------------------|--------|
| Hot-path cost | 5-10 cycles | **~10-15 cycles** | +5 cycles (sampling overhead) |
| Thread overhead | 0 | ~1% CPU (background) | Negligible |
| Learning latency | 1024 allocs | 100-200ms | Acceptable |
| Complexity | Low | Medium | Moderate increase |
**⚠️ CRITICAL DECISION**: Phase 6.11.5 **DOES NOT improve hot-path** over Phase 6.11.4!
**Reason**: Sampling overhead (~5 cycles) cancels out atomic elimination (~5 cycles)
**Recommendation**: ⚠️ **SKIP Phase 6.11.5** unless:
1. Learning accuracy requires higher sampling rate (>1/100)
2. Background analytics needed (real-time dashboard)
---
## 3. Hash Table Optimization (Phase 6.11.6 - P2)
**Current cost**: Site Rules lookup (~10-20 cycles)
### Strategy 1: Perfect Hashing
**Benefit**: O(1) lookup without collisions
**Tradeoff**: Rebuild cost on new site, max 256 sites
**Implementation**:
```c
// Pre-computed hash table (generated at runtime)
static RouteType g_site_routes[256]; // Direct lookup, no probing
```
**Expected gain**: **5-10 cycles** (~4-8%)
### Strategy 2: Cache-line Alignment
**Current**: 4-probe hash → 4 cache lines (worst case)
**Improvement**: Pack entries into single cache line
```c
typedef struct {
uint64_t site_id;
RouteType route;
uint8_t padding[6]; // Align to 16 bytes
} __attribute__((aligned(16))) SiteRuleEntry;
```
**Expected gain**: **2-5 cycles** (~2-4%)
### Recommendation
**Priority**: P2 (after Phase 6.11.4 P0-1/P0-2)
**Expected gain**: **7-15 cycles** (~6-12%)
**Implementation time**: 2-3 hours
---
## 4. Trade-off Analysis
### 4.1 Thread-Safety vs Learning Accuracy
| Approach | Hot-path Cost | Learning Accuracy | Complexity |
|----------|---------------|-------------------|------------|
| **Current** | 126,479 cycles | 100% | Low |
| **P0-1 (Atomic削減)** | 96,000 cycles | 100% | Very Low |
| **P0-2 (Cached Strategy)** | 70,000 cycles | 99% | Low |
| **P1 (Learning Thread)** | 70,000-75,000 cycles | 95-99% | Medium |
| **P2 (Hash Opt)** | 60,000 cycles | 99% | Medium |
### 4.2 Implementation Complexity vs Performance Gain
```
Performance Gain
P0-1 ──────────────────┼────────────┐ (30-50 cycles, 30 min)
(Atomic削減) │ │
│ │
P0-2 ──────────────────┼──────┐ │ (25-35 cycles, 1-2 hrs)
(Cached Strategy) │ │ │
│ │ │
P2 ─────────────────┼──────┼─────┼──┐ (7-15 cycles, 2-3 hrs)
(Hash Opt) │ │ │ │
│ │ │ │
P1 ────────────────┼──────┼─────┼──┤ (5-10 cycles, 4-6 hrs)
(Learning Thread) │ │ │ │
0──────────────────→ Complexity
Low Med High
```
**Sweet Spot**: **P0-2 (Cached Strategy)**
- 55% total reduction (126,479 → 70,000 cycles)
- 1-2 hours implementation
- Low complexity, low risk
---
## 5. Recommended Implementation Order
### Week 1: Quick Wins (P0-1 + P0-2)
**Day 1**: Phase 6.11.4 (P0-1) - Atomic削減
- Time: 30 minutes
- Expected: 126,479 → 96,000 cycles (-24%)
**Day 2**: Phase 6.11.4 (P0-2) - Cached Strategy
- Time: 1-2 hours
- Expected: 96,000 → 70,000 cycles (-27%)
- **Total: -45% reduction** ✅
### Week 2: Medium Gains (P2)
**Day 3-4**: Phase 6.11.6 (P2) - Hash Optimization
- Time: 2-3 hours
- Expected: 70,000 → 60,000 cycles (-14%)
- **Total: -52% reduction** ✅
### Week 3: Evaluation
**Benchmark** all scenarios (json/mir/vm)
- If `hak_alloc` < 50,000 cycles **STOP**
- If `hak_alloc` > 50,000 cycles → Consider Phase 6.11.5 (P1)
---
## 6. Risk Assessment
| Phase | Risk Level | Failure Mode | Mitigation |
|-------|-----------|--------------|------------|
| **P0-1** | **ZERO** | None (compile-time) | None needed |
| **P0-2** | **LOW** | Stale strategy (1-2% accuracy loss) | Periodic invalidation |
| **P1** | **MEDIUM** | Race conditions, thread bugs | Extensive testing, feature flag |
| **P2** | **LOW** | Hash collisions, rebuild cost | Fallback to linear probe |
---
## 7. Expected Final Results
### Pessimistic Scenario (Only P0-1 + P0-2)
```
hak_alloc: 126,479 → 70,000 cycles (-45%)
Overall: 319,021 → 262,542 cycles (-18%)
vm scenario: 15,021 ns → 12,000 ns (-20%)
```
### Optimistic Scenario (P0-1 + P0-2 + P2)
```
hak_alloc: 126,479 → 60,000 cycles (-52%)
Overall: 319,021 → 252,542 cycles (-21%)
vm scenario: 15,021 ns → 11,500 ns (-23%)
```
### Stretch Goal (All Phases)
```
hak_alloc: 126,479 → 50,000 cycles (-60%)
Overall: 319,021 → 242,542 cycles (-24%)
vm scenario: 15,021 ns → 11,000 ns (-27%)
```
---
## 8. Conclusion
### ✅ Recommended Path: **Staged Optimization** (P0-1 → P0-2 → P2)
**Rationale**:
1. **P0-1** is free (compile-time guard) → Immediate -24%
2. **P0-2** is high-ROI (1-2 hrs) → Additional -27%
3. **P1 (Learning Thread) is NOT worth it** (complexity vs gain)
4. **P2** is optional polish → Additional -14%
**Final Target**: **70,000 cycles** (55% reduction from baseline)
**Timeline**:
- Week 1: P0-1 + P0-2 (2-3 hours total)
- Week 2: P2 (optional, 2-3 hours)
- Week 3: Benchmark & validate
**Success Criteria**:
- `hak_alloc` < 75,000 cycles (40% reduction) **Minimum Success**
- `hak_alloc` < 60,000 cycles (52% reduction) **Target Success**
- `hak_alloc` < 50,000 cycles (60% reduction) **Stretch Goal** 🎉
---
## Next Steps
1. **Implement P0-1** (30 min)
2. **Measure baseline** (10 min)
3. **Implement P0-2** (1-2 hrs)
4. **Measure improvement** (10 min)
5. **Decide on P2** based on results
**Total time investment**: 2-3 hours for **45% reduction****Excellent ROI!**