516 lines
16 KiB
Markdown
516 lines
16 KiB
Markdown
|
|
# Phase 6.11.4: Threading Overhead Analysis & Optimization Plan
|
|||
|
|
|
|||
|
|
**Date**: 2025-10-22
|
|||
|
|
**Author**: ChatGPT Ultra Think (o1-preview equivalent)
|
|||
|
|
**Context**: Post-Phase 6.11.3 profiling results reveal `hak_alloc` consuming 39.6% of cycles
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 Executive Summary
|
|||
|
|
|
|||
|
|
### Current Bottleneck
|
|||
|
|
```
|
|||
|
|
hak_alloc: 126,479 cycles (39.6%) ← #2 MAJOR BOTTLENECK
|
|||
|
|
├─ ELO selection (100回ごと)
|
|||
|
|
├─ Site Rules lookup (4-probe hash)
|
|||
|
|
├─ atomic_fetch_add (全allocでアトミック操作)
|
|||
|
|
├─ 条件分岐 (FROZEN/CANARY/LEARN)
|
|||
|
|
└─ Learning logic (hak_evo_tick, hak_elo_record_alloc)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Recommended Strategy: **Staged Optimization** (3 Phases)
|
|||
|
|
|
|||
|
|
1. **Phase 6.11.4 (P0-1)**: Atomic削減 - Immediate, Low-risk (~15-20% reduction)
|
|||
|
|
2. **Phase 6.11.4 (P0-2)**: LEARN軽量化 - Medium-term, Medium-risk (~25-35% reduction)
|
|||
|
|
3. **Phase 6.11.5 (P1)**: Learning Thread - Long-term, High-reward (~50-70% reduction)
|
|||
|
|
|
|||
|
|
**Target**: 126,479 cycles → **<50,000 cycles** (~60% reduction total)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 1. Thread-Safety Cost Analysis
|
|||
|
|
|
|||
|
|
### 1.1 Current Atomic Operations
|
|||
|
|
|
|||
|
|
**Location**: `hakmem.c:362-369`
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_EVOLUTION)) {
|
|||
|
|
static _Atomic uint64_t tick_counter = 0;
|
|||
|
|
if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
|
|||
|
|
// hak_evo_tick() - HEAVY (P² update, distribution, state transition)
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Cost Breakdown** (estimated per allocation):
|
|||
|
|
|
|||
|
|
| Operation | Cycles | % of hak_alloc | Notes |
|
|||
|
|
|-----------|--------|----------------|-------|
|
|||
|
|
| `atomic_fetch_add` | **30-50** | **24-40%** | LOCK CMPXCHG on x86 |
|
|||
|
|
| Conditional check (`& 0x3FF`) | 2-5 | 2-4% | Bitwise AND + branch |
|
|||
|
|
| `hak_evo_tick` (1/1024) | 5,000-10,000 | 4-8% | Amortized: ~5-10 cycles/alloc |
|
|||
|
|
| **Subtotal (Evolution)** | **~40-70** | **~30-50%** | **Major overhead!** |
|
|||
|
|
|
|||
|
|
**ELO sampling** (`hakmem.c:397-412`):
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
g_elo_call_count++; // Non-atomic increment (RACE CONDITION!)
|
|||
|
|
if (g_elo_call_count % 100 == 0 || g_cached_strategy_id == -1) {
|
|||
|
|
strategy_id = hak_elo_select_strategy(); // ~500-1000 cycles
|
|||
|
|
g_cached_strategy_id = strategy_id;
|
|||
|
|
hak_elo_record_alloc(strategy_id, size, 0); // ~100-200 cycles
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
| Operation | Cycles | % of hak_alloc | Notes |
|
|||
|
|
|-----------|--------|----------------|-------|
|
|||
|
|
| `g_elo_call_count++` | 1-2 | <1% | **UNSAFE! Non-atomic** |
|
|||
|
|
| Modulo check (`% 100`) | 5-10 | 4-8% | DIV instruction |
|
|||
|
|
| `hak_elo_select_strategy` (1/100) | 500-1000 | 4-8% | Amortized: ~5-10 cycles/alloc |
|
|||
|
|
| `hak_elo_record_alloc` (1/100) | 100-200 | 1-2% | Amortized: ~1-2 cycles/alloc |
|
|||
|
|
| **Subtotal (ELO)** | **~15-30** | **~10-20%** | Medium overhead |
|
|||
|
|
|
|||
|
|
**Total atomic overhead**: **55-100 cycles/allocation** (~40-80% of `hak_alloc`)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 1.2 Lock-Free Queue Overhead (for Phase 6.11.5)
|
|||
|
|
|
|||
|
|
**Estimated cost per event** (MPSC queue):
|
|||
|
|
|
|||
|
|
| Operation | Cycles | Notes |
|
|||
|
|
|-----------|--------|-------|
|
|||
|
|
| Allocate event struct | 20-40 | malloc/pool |
|
|||
|
|
| Write event data | 10-20 | Memory stores |
|
|||
|
|
| Enqueue (CAS) | 30-50 | LOCK CMPXCHG |
|
|||
|
|
| **Total per event** | **60-110** | Higher than current atomic! |
|
|||
|
|
|
|||
|
|
**⚠️ CRITICAL INSIGHT**: Lock-free queue is **NOT faster** for high-frequency events!
|
|||
|
|
|
|||
|
|
**Reason**:
|
|||
|
|
- Current: 1 atomic op (`atomic_fetch_add`)
|
|||
|
|
- Queue: 1 allocation + 1 atomic op (enqueue)
|
|||
|
|
- **Net change**: +60-70 cycles per allocation
|
|||
|
|
|
|||
|
|
**Recommendation**: **AVOID lock-free queue for hot-path**. Use alternative approach.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 2. Implementation Plan: Staged Optimization
|
|||
|
|
|
|||
|
|
### Phase 6.11.4 (P0-1): Atomic Operation Elimination ⭐ **HIGHEST PRIORITY**
|
|||
|
|
|
|||
|
|
**Goal**: Remove atomic overhead when learning disabled
|
|||
|
|
**Expected gain**: **30-50 cycles** (~24-40% of `hak_alloc`)
|
|||
|
|
**Implementation time**: **30 minutes**
|
|||
|
|
**Risk**: **ZERO** (compile-time guard)
|
|||
|
|
|
|||
|
|
#### Changes
|
|||
|
|
|
|||
|
|
**File**: `hakmem.c:362-369`
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// BEFORE:
|
|||
|
|
if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_EVOLUTION)) {
|
|||
|
|
static _Atomic uint64_t tick_counter = 0;
|
|||
|
|
if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
|
|||
|
|
hak_evo_tick(now_ns);
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// AFTER:
|
|||
|
|
#if HAKMEM_FEATURE_EVOLUTION
|
|||
|
|
static _Atomic uint64_t tick_counter = 0;
|
|||
|
|
if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
|
|||
|
|
hak_evo_tick(get_time_ns());
|
|||
|
|
}
|
|||
|
|
#endif
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Tradeoff**: None! Pure win when `HAKMEM_FEATURE_EVOLUTION=0` at compile-time.
|
|||
|
|
|
|||
|
|
**Measurement**:
|
|||
|
|
```bash
|
|||
|
|
# Baseline (with atomic)
|
|||
|
|
HAKMEM_DEBUG_TIMING=1 make bench_allocators_hakmem && HAKMEM_TIMING=1 ./bench_allocators_hakmem
|
|||
|
|
|
|||
|
|
# After (without atomic)
|
|||
|
|
# Edit hakmem_config.h: #define HAKMEM_FEATURE_EVOLUTION 0
|
|||
|
|
HAKMEM_DEBUG_TIMING=1 make bench_allocators_hakmem && HAKMEM_TIMING=1 ./bench_allocators_hakmem
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected result**:
|
|||
|
|
```
|
|||
|
|
hak_alloc: 126,479 → 96,000 cycles (-24%)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Phase 6.11.4 (P0-2): LEARN Mode Lightweight Sampling ⭐ **HIGH PRIORITY**
|
|||
|
|
|
|||
|
|
**Goal**: Reduce ELO overhead without accuracy loss
|
|||
|
|
**Expected gain**: **15-30 cycles** (~12-24% of `hak_alloc`)
|
|||
|
|
**Implementation time**: **1-2 hours**
|
|||
|
|
**Risk**: **LOW** (conservative approach)
|
|||
|
|
|
|||
|
|
#### Strategy: Async ELO Update
|
|||
|
|
|
|||
|
|
**Problem**: `hak_elo_select_strategy()` は重い (500-1000 cycles)
|
|||
|
|
**Solution**: 非同期イベントキュー **ではなく** 事前計算戦略
|
|||
|
|
|
|||
|
|
**Key Insight**: ELO selection は **hot-path に不要**!
|
|||
|
|
|
|||
|
|
#### Implementation
|
|||
|
|
|
|||
|
|
**1. Pre-computed Strategy Cache**
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Global state (hakmem.c)
|
|||
|
|
static _Atomic int g_cached_strategy_id = 2; // Default: 2MB threshold
|
|||
|
|
static _Atomic uint64_t g_elo_generation = 0; // Invalidation key
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**2. Background Thread (Simulated)**
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Called by hak_evo_tick() (1024 alloc ごと)
|
|||
|
|
void hak_elo_async_recompute(void) {
|
|||
|
|
// Re-select best strategy (epsilon-greedy)
|
|||
|
|
int new_strategy = hak_elo_select_strategy();
|
|||
|
|
|
|||
|
|
atomic_store(&g_cached_strategy_id, new_strategy);
|
|||
|
|
atomic_fetch_add(&g_elo_generation, 1); // Invalidate
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**3. Hot-path (hakmem.c:397-412)**
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// LEARN mode: Read cached strategy (NO ELO call!)
|
|||
|
|
if (hak_evo_is_frozen()) {
|
|||
|
|
strategy_id = hak_evo_get_confirmed_strategy();
|
|||
|
|
threshold = hak_elo_get_threshold(strategy_id);
|
|||
|
|
} else if (hak_evo_is_canary()) {
|
|||
|
|
// ... (unchanged)
|
|||
|
|
} else {
|
|||
|
|
// LEARN: Use cached strategy (FAST!)
|
|||
|
|
strategy_id = atomic_load(&g_cached_strategy_id);
|
|||
|
|
threshold = hak_elo_get_threshold(strategy_id);
|
|||
|
|
|
|||
|
|
// Optional: Lightweight recording (no timing yet)
|
|||
|
|
// hak_elo_record_alloc(strategy_id, size, 0); // Skip for now
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Tradeoff Analysis**:
|
|||
|
|
|
|||
|
|
| Aspect | Before | After | Change |
|
|||
|
|
|--------|--------|-------|--------|
|
|||
|
|
| Hot-path cost | 15-30 cycles | **5-10 cycles** | **-67% to -50%** |
|
|||
|
|
| ELO accuracy | 100% | 99% | -1% (negligible) |
|
|||
|
|
| Latency (strategy update) | 0 (immediate) | 1024 allocs | Acceptable |
|
|||
|
|
|
|||
|
|
**Expected result**:
|
|||
|
|
```
|
|||
|
|
hak_alloc: 96,000 → 70,000 cycles (-27%)
|
|||
|
|
Total: 126,479 → 70,000 cycles (-45%)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Recommendation**: ✅ **IMPLEMENT FIRST** (before Phase 6.11.5)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Phase 6.11.5 (P1): Learning Thread (Full Offload) ⭐ **FUTURE WORK**
|
|||
|
|
|
|||
|
|
**Goal**: Complete learning offload to dedicated thread
|
|||
|
|
**Expected gain**: **20-40 cycles** (additional ~15-30%)
|
|||
|
|
**Implementation time**: **4-6 hours**
|
|||
|
|
**Risk**: **MEDIUM** (thread management, race conditions)
|
|||
|
|
|
|||
|
|
#### Architecture
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
┌─────────────────────────────────────────┐
|
|||
|
|
│ hak_alloc (Hot-path) │
|
|||
|
|
│ ┌───────────────────────────────────┐ │
|
|||
|
|
│ │ 1. Read g_cached_strategy_id │ │ ← Atomic read (~10 cycles)
|
|||
|
|
│ │ 2. Route allocation │ │
|
|||
|
|
│ │ 3. [Optional] Push event to queue │ │ ← Only if sampling (1/100)
|
|||
|
|
│ └───────────────────────────────────┘ │
|
|||
|
|
└─────────────────────────────────────────┘
|
|||
|
|
↓ (Event Queue - MPSC)
|
|||
|
|
┌─────────────────────────────────────────┐
|
|||
|
|
│ Learning Thread (Background) │
|
|||
|
|
│ ┌───────────────────────────────────┐ │
|
|||
|
|
│ │ 1. Pop events (batched) │ │
|
|||
|
|
│ │ 2. Update ELO ratings │ │
|
|||
|
|
│ │ 3. Update distribution signature │ │
|
|||
|
|
│ │ 4. Recompute best strategy │ │
|
|||
|
|
│ │ 5. Update g_cached_strategy_id │ │
|
|||
|
|
│ └───────────────────────────────────┘ │
|
|||
|
|
└─────────────────────────────────────────┘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Implementation Details
|
|||
|
|
|
|||
|
|
**1. Event Queue (Custom Ring Buffer)**
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// hakmem_events.h
|
|||
|
|
#define EVENT_QUEUE_SIZE 1024
|
|||
|
|
|
|||
|
|
typedef struct {
|
|||
|
|
uint8_t type; // EVENT_ALLOC / EVENT_FREE
|
|||
|
|
size_t size;
|
|||
|
|
uint64_t duration_ns;
|
|||
|
|
uintptr_t site_id;
|
|||
|
|
} hak_event_t;
|
|||
|
|
|
|||
|
|
typedef struct {
|
|||
|
|
hak_event_t events[EVENT_QUEUE_SIZE];
|
|||
|
|
_Atomic uint64_t head; // Producer index
|
|||
|
|
_Atomic uint64_t tail; // Consumer index
|
|||
|
|
} hak_event_queue_t;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Cost**: ~30 cycles (ring buffer write, no CAS needed!)
|
|||
|
|
|
|||
|
|
**2. Sampling Strategy**
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Hot-path: Sample 1/100 allocations
|
|||
|
|
if (fast_random() % 100 == 0) {
|
|||
|
|
hak_event_push((hak_event_t){
|
|||
|
|
.type = EVENT_ALLOC,
|
|||
|
|
.size = size,
|
|||
|
|
.duration_ns = 0, // Not measured in hot-path
|
|||
|
|
.site_id = site_id
|
|||
|
|
});
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**3. Background Thread**
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
void* learning_thread_main(void* arg) {
|
|||
|
|
while (!g_shutdown) {
|
|||
|
|
// Batch processing (every 100ms)
|
|||
|
|
usleep(100000);
|
|||
|
|
|
|||
|
|
hak_event_t events[100];
|
|||
|
|
int count = hak_event_pop_batch(events, 100);
|
|||
|
|
|
|||
|
|
for (int i = 0; i < count; i++) {
|
|||
|
|
hak_elo_record_alloc(events[i].site_id, events[i].size, 0);
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Periodic ELO update (every 10 batches)
|
|||
|
|
if (g_batch_count % 10 == 0) {
|
|||
|
|
hak_elo_async_recompute();
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
return NULL;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
#### Tradeoff Analysis
|
|||
|
|
|
|||
|
|
| Aspect | Phase 6.11.4 (P0-2) | Phase 6.11.5 (P1) | Change |
|
|||
|
|
|--------|---------------------|-------------------|--------|
|
|||
|
|
| Hot-path cost | 5-10 cycles | **~10-15 cycles** | +5 cycles (sampling overhead) |
|
|||
|
|
| Thread overhead | 0 | ~1% CPU (background) | Negligible |
|
|||
|
|
| Learning latency | 1024 allocs | 100-200ms | Acceptable |
|
|||
|
|
| Complexity | Low | Medium | Moderate increase |
|
|||
|
|
|
|||
|
|
**⚠️ CRITICAL DECISION**: Phase 6.11.5 **DOES NOT improve hot-path** over Phase 6.11.4!
|
|||
|
|
|
|||
|
|
**Reason**: Sampling overhead (~5 cycles) cancels out atomic elimination (~5 cycles)
|
|||
|
|
|
|||
|
|
**Recommendation**: ⚠️ **SKIP Phase 6.11.5** unless:
|
|||
|
|
1. Learning accuracy requires higher sampling rate (>1/100)
|
|||
|
|
2. Background analytics needed (real-time dashboard)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 3. Hash Table Optimization (Phase 6.11.6 - P2)
|
|||
|
|
|
|||
|
|
**Current cost**: Site Rules lookup (~10-20 cycles)
|
|||
|
|
|
|||
|
|
### Strategy 1: Perfect Hashing
|
|||
|
|
|
|||
|
|
**Benefit**: O(1) lookup without collisions
|
|||
|
|
**Tradeoff**: Rebuild cost on new site, max 256 sites
|
|||
|
|
|
|||
|
|
**Implementation**:
|
|||
|
|
```c
|
|||
|
|
// Pre-computed hash table (generated at runtime)
|
|||
|
|
static RouteType g_site_routes[256]; // Direct lookup, no probing
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected gain**: **5-10 cycles** (~4-8%)
|
|||
|
|
|
|||
|
|
### Strategy 2: Cache-line Alignment
|
|||
|
|
|
|||
|
|
**Current**: 4-probe hash → 4 cache lines (worst case)
|
|||
|
|
**Improvement**: Pack entries into single cache line
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
typedef struct {
|
|||
|
|
uint64_t site_id;
|
|||
|
|
RouteType route;
|
|||
|
|
uint8_t padding[6]; // Align to 16 bytes
|
|||
|
|
} __attribute__((aligned(16))) SiteRuleEntry;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected gain**: **2-5 cycles** (~2-4%)
|
|||
|
|
|
|||
|
|
### Recommendation
|
|||
|
|
|
|||
|
|
**Priority**: P2 (after Phase 6.11.4 P0-1/P0-2)
|
|||
|
|
**Expected gain**: **7-15 cycles** (~6-12%)
|
|||
|
|
**Implementation time**: 2-3 hours
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 4. Trade-off Analysis
|
|||
|
|
|
|||
|
|
### 4.1 Thread-Safety vs Learning Accuracy
|
|||
|
|
|
|||
|
|
| Approach | Hot-path Cost | Learning Accuracy | Complexity |
|
|||
|
|
|----------|---------------|-------------------|------------|
|
|||
|
|
| **Current** | 126,479 cycles | 100% | Low |
|
|||
|
|
| **P0-1 (Atomic削減)** | 96,000 cycles | 100% | Very Low |
|
|||
|
|
| **P0-2 (Cached Strategy)** | 70,000 cycles | 99% | Low |
|
|||
|
|
| **P1 (Learning Thread)** | 70,000-75,000 cycles | 95-99% | Medium |
|
|||
|
|
| **P2 (Hash Opt)** | 60,000 cycles | 99% | Medium |
|
|||
|
|
|
|||
|
|
### 4.2 Implementation Complexity vs Performance Gain
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
Performance Gain
|
|||
|
|
↑
|
|||
|
|
│
|
|||
|
|
P0-1 ──────────────────┼────────────┐ (30-50 cycles, 30 min)
|
|||
|
|
(Atomic削減) │ │
|
|||
|
|
│ │
|
|||
|
|
P0-2 ──────────────────┼──────┐ │ (25-35 cycles, 1-2 hrs)
|
|||
|
|
(Cached Strategy) │ │ │
|
|||
|
|
│ │ │
|
|||
|
|
P2 ─────────────────┼──────┼─────┼──┐ (7-15 cycles, 2-3 hrs)
|
|||
|
|
(Hash Opt) │ │ │ │
|
|||
|
|
│ │ │ │
|
|||
|
|
P1 ────────────────┼──────┼─────┼──┤ (5-10 cycles, 4-6 hrs)
|
|||
|
|
(Learning Thread) │ │ │ │
|
|||
|
|
0──────────────────→ Complexity
|
|||
|
|
Low Med High
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Sweet Spot**: **P0-2 (Cached Strategy)**
|
|||
|
|
- 55% total reduction (126,479 → 70,000 cycles)
|
|||
|
|
- 1-2 hours implementation
|
|||
|
|
- Low complexity, low risk
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 5. Recommended Implementation Order
|
|||
|
|
|
|||
|
|
### Week 1: Quick Wins (P0-1 + P0-2)
|
|||
|
|
|
|||
|
|
**Day 1**: Phase 6.11.4 (P0-1) - Atomic削減
|
|||
|
|
- Time: 30 minutes
|
|||
|
|
- Expected: 126,479 → 96,000 cycles (-24%)
|
|||
|
|
|
|||
|
|
**Day 2**: Phase 6.11.4 (P0-2) - Cached Strategy
|
|||
|
|
- Time: 1-2 hours
|
|||
|
|
- Expected: 96,000 → 70,000 cycles (-27%)
|
|||
|
|
- **Total: -45% reduction** ✅
|
|||
|
|
|
|||
|
|
### Week 2: Medium Gains (P2)
|
|||
|
|
|
|||
|
|
**Day 3-4**: Phase 6.11.6 (P2) - Hash Optimization
|
|||
|
|
- Time: 2-3 hours
|
|||
|
|
- Expected: 70,000 → 60,000 cycles (-14%)
|
|||
|
|
- **Total: -52% reduction** ✅
|
|||
|
|
|
|||
|
|
### Week 3: Evaluation
|
|||
|
|
|
|||
|
|
**Benchmark** all scenarios (json/mir/vm)
|
|||
|
|
- If `hak_alloc` < 50,000 cycles → **STOP** ✅
|
|||
|
|
- If `hak_alloc` > 50,000 cycles → Consider Phase 6.11.5 (P1)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 6. Risk Assessment
|
|||
|
|
|
|||
|
|
| Phase | Risk Level | Failure Mode | Mitigation |
|
|||
|
|
|-------|-----------|--------------|------------|
|
|||
|
|
| **P0-1** | **ZERO** | None (compile-time) | None needed |
|
|||
|
|
| **P0-2** | **LOW** | Stale strategy (1-2% accuracy loss) | Periodic invalidation |
|
|||
|
|
| **P1** | **MEDIUM** | Race conditions, thread bugs | Extensive testing, feature flag |
|
|||
|
|
| **P2** | **LOW** | Hash collisions, rebuild cost | Fallback to linear probe |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 7. Expected Final Results
|
|||
|
|
|
|||
|
|
### Pessimistic Scenario (Only P0-1 + P0-2)
|
|||
|
|
```
|
|||
|
|
hak_alloc: 126,479 → 70,000 cycles (-45%)
|
|||
|
|
Overall: 319,021 → 262,542 cycles (-18%)
|
|||
|
|
|
|||
|
|
vm scenario: 15,021 ns → 12,000 ns (-20%)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Optimistic Scenario (P0-1 + P0-2 + P2)
|
|||
|
|
```
|
|||
|
|
hak_alloc: 126,479 → 60,000 cycles (-52%)
|
|||
|
|
Overall: 319,021 → 252,542 cycles (-21%)
|
|||
|
|
|
|||
|
|
vm scenario: 15,021 ns → 11,500 ns (-23%)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Stretch Goal (All Phases)
|
|||
|
|
```
|
|||
|
|
hak_alloc: 126,479 → 50,000 cycles (-60%)
|
|||
|
|
Overall: 319,021 → 242,542 cycles (-24%)
|
|||
|
|
|
|||
|
|
vm scenario: 15,021 ns → 11,000 ns (-27%)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 8. Conclusion
|
|||
|
|
|
|||
|
|
### ✅ Recommended Path: **Staged Optimization** (P0-1 → P0-2 → P2)
|
|||
|
|
|
|||
|
|
**Rationale**:
|
|||
|
|
1. **P0-1** is free (compile-time guard) → Immediate -24%
|
|||
|
|
2. **P0-2** is high-ROI (1-2 hrs) → Additional -27%
|
|||
|
|
3. **P1 (Learning Thread) is NOT worth it** (complexity vs gain)
|
|||
|
|
4. **P2** is optional polish → Additional -14%
|
|||
|
|
|
|||
|
|
**Final Target**: **70,000 cycles** (55% reduction from baseline)
|
|||
|
|
|
|||
|
|
**Timeline**:
|
|||
|
|
- Week 1: P0-1 + P0-2 (2-3 hours total)
|
|||
|
|
- Week 2: P2 (optional, 2-3 hours)
|
|||
|
|
- Week 3: Benchmark & validate
|
|||
|
|
|
|||
|
|
**Success Criteria**:
|
|||
|
|
- `hak_alloc` < 75,000 cycles (40% reduction) → **Minimum Success**
|
|||
|
|
- `hak_alloc` < 60,000 cycles (52% reduction) → **Target Success** ✅
|
|||
|
|
- `hak_alloc` < 50,000 cycles (60% reduction) → **Stretch Goal** 🎉
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Next Steps
|
|||
|
|
|
|||
|
|
1. **Implement P0-1** (30 min)
|
|||
|
|
2. **Measure baseline** (10 min)
|
|||
|
|
3. **Implement P0-2** (1-2 hrs)
|
|||
|
|
4. **Measure improvement** (10 min)
|
|||
|
|
5. **Decide on P2** based on results
|
|||
|
|
|
|||
|
|
**Total time investment**: 2-3 hours for **45% reduction** ← **Excellent ROI!**
|