Files
hakmem/docs/analysis/PHASE_6.11.4_THREADING_COST_ANALYSIS.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

516 lines
16 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 6.11.4: Threading Overhead Analysis & Optimization Plan
**Date**: 2025-10-22
**Author**: ChatGPT Ultra Think (o1-preview equivalent)
**Context**: Post-Phase 6.11.3 profiling results reveal `hak_alloc` consuming 39.6% of cycles
---
## 📊 Executive Summary
### Current Bottleneck
```
hak_alloc: 126,479 cycles (39.6%) ← #2 MAJOR BOTTLENECK
├─ ELO selection (100回ごと)
├─ Site Rules lookup (4-probe hash)
├─ atomic_fetch_add (全allocでアトミック操作)
├─ 条件分岐 (FROZEN/CANARY/LEARN)
└─ Learning logic (hak_evo_tick, hak_elo_record_alloc)
```
### Recommended Strategy: **Staged Optimization** (3 Phases)
1. **Phase 6.11.4 (P0-1)**: Atomic削減 - Immediate, Low-risk (~15-20% reduction)
2. **Phase 6.11.4 (P0-2)**: LEARN軽量化 - Medium-term, Medium-risk (~25-35% reduction)
3. **Phase 6.11.5 (P1)**: Learning Thread - Long-term, High-reward (~50-70% reduction)
**Target**: 126,479 cycles → **<50,000 cycles** (~60% reduction total)
---
## 1. Thread-Safety Cost Analysis
### 1.1 Current Atomic Operations
**Location**: `hakmem.c:362-369`
```c
if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_EVOLUTION)) {
static _Atomic uint64_t tick_counter = 0;
if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
// hak_evo_tick() - HEAVY (P² update, distribution, state transition)
}
}
```
**Cost Breakdown** (estimated per allocation):
| Operation | Cycles | % of hak_alloc | Notes |
|-----------|--------|----------------|-------|
| `atomic_fetch_add` | **30-50** | **24-40%** | LOCK CMPXCHG on x86 |
| Conditional check (`& 0x3FF`) | 2-5 | 2-4% | Bitwise AND + branch |
| `hak_evo_tick` (1/1024) | 5,000-10,000 | 4-8% | Amortized: ~5-10 cycles/alloc |
| **Subtotal (Evolution)** | **~40-70** | **~30-50%** | **Major overhead!** |
**ELO sampling** (`hakmem.c:397-412`):
```c
g_elo_call_count++; // Non-atomic increment (RACE CONDITION!)
if (g_elo_call_count % 100 == 0 || g_cached_strategy_id == -1) {
strategy_id = hak_elo_select_strategy(); // ~500-1000 cycles
g_cached_strategy_id = strategy_id;
hak_elo_record_alloc(strategy_id, size, 0); // ~100-200 cycles
}
```
| Operation | Cycles | % of hak_alloc | Notes |
|-----------|--------|----------------|-------|
| `g_elo_call_count++` | 1-2 | <1% | **UNSAFE! Non-atomic** |
| Modulo check (`% 100`) | 5-10 | 4-8% | DIV instruction |
| `hak_elo_select_strategy` (1/100) | 500-1000 | 4-8% | Amortized: ~5-10 cycles/alloc |
| `hak_elo_record_alloc` (1/100) | 100-200 | 1-2% | Amortized: ~1-2 cycles/alloc |
| **Subtotal (ELO)** | **~15-30** | **~10-20%** | Medium overhead |
**Total atomic overhead**: **55-100 cycles/allocation** (~40-80% of `hak_alloc`)
---
### 1.2 Lock-Free Queue Overhead (for Phase 6.11.5)
**Estimated cost per event** (MPSC queue):
| Operation | Cycles | Notes |
|-----------|--------|-------|
| Allocate event struct | 20-40 | malloc/pool |
| Write event data | 10-20 | Memory stores |
| Enqueue (CAS) | 30-50 | LOCK CMPXCHG |
| **Total per event** | **60-110** | Higher than current atomic! |
** CRITICAL INSIGHT**: Lock-free queue is **NOT faster** for high-frequency events!
**Reason**:
- Current: 1 atomic op (`atomic_fetch_add`)
- Queue: 1 allocation + 1 atomic op (enqueue)
- **Net change**: +60-70 cycles per allocation
**Recommendation**: **AVOID lock-free queue for hot-path**. Use alternative approach.
---
## 2. Implementation Plan: Staged Optimization
### Phase 6.11.4 (P0-1): Atomic Operation Elimination ⭐ **HIGHEST PRIORITY**
**Goal**: Remove atomic overhead when learning disabled
**Expected gain**: **30-50 cycles** (~24-40% of `hak_alloc`)
**Implementation time**: **30 minutes**
**Risk**: **ZERO** (compile-time guard)
#### Changes
**File**: `hakmem.c:362-369`
```c
// BEFORE:
if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_EVOLUTION)) {
static _Atomic uint64_t tick_counter = 0;
if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
hak_evo_tick(now_ns);
}
}
// AFTER:
#if HAKMEM_FEATURE_EVOLUTION
static _Atomic uint64_t tick_counter = 0;
if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
hak_evo_tick(get_time_ns());
}
#endif
```
**Tradeoff**: None! Pure win when `HAKMEM_FEATURE_EVOLUTION=0` at compile-time.
**Measurement**:
```bash
# Baseline (with atomic)
HAKMEM_DEBUG_TIMING=1 make bench_allocators_hakmem && HAKMEM_TIMING=1 ./bench_allocators_hakmem
# After (without atomic)
# Edit hakmem_config.h: #define HAKMEM_FEATURE_EVOLUTION 0
HAKMEM_DEBUG_TIMING=1 make bench_allocators_hakmem && HAKMEM_TIMING=1 ./bench_allocators_hakmem
```
**Expected result**:
```
hak_alloc: 126,479 → 96,000 cycles (-24%)
```
---
### Phase 6.11.4 (P0-2): LEARN Mode Lightweight Sampling ⭐ **HIGH PRIORITY**
**Goal**: Reduce ELO overhead without accuracy loss
**Expected gain**: **15-30 cycles** (~12-24% of `hak_alloc`)
**Implementation time**: **1-2 hours**
**Risk**: **LOW** (conservative approach)
#### Strategy: Async ELO Update
**Problem**: `hak_elo_select_strategy()` は重い (500-1000 cycles)
**Solution**: 非同期イベントキュー **ではなく** 事前計算戦略
**Key Insight**: ELO selection **hot-path に不要**
#### Implementation
**1. Pre-computed Strategy Cache**
```c
// Global state (hakmem.c)
static _Atomic int g_cached_strategy_id = 2; // Default: 2MB threshold
static _Atomic uint64_t g_elo_generation = 0; // Invalidation key
```
**2. Background Thread (Simulated)**
```c
// Called by hak_evo_tick() (1024 alloc ごと)
void hak_elo_async_recompute(void) {
// Re-select best strategy (epsilon-greedy)
int new_strategy = hak_elo_select_strategy();
atomic_store(&g_cached_strategy_id, new_strategy);
atomic_fetch_add(&g_elo_generation, 1); // Invalidate
}
```
**3. Hot-path (hakmem.c:397-412)**
```c
// LEARN mode: Read cached strategy (NO ELO call!)
if (hak_evo_is_frozen()) {
strategy_id = hak_evo_get_confirmed_strategy();
threshold = hak_elo_get_threshold(strategy_id);
} else if (hak_evo_is_canary()) {
// ... (unchanged)
} else {
// LEARN: Use cached strategy (FAST!)
strategy_id = atomic_load(&g_cached_strategy_id);
threshold = hak_elo_get_threshold(strategy_id);
// Optional: Lightweight recording (no timing yet)
// hak_elo_record_alloc(strategy_id, size, 0); // Skip for now
}
```
**Tradeoff Analysis**:
| Aspect | Before | After | Change |
|--------|--------|-------|--------|
| Hot-path cost | 15-30 cycles | **5-10 cycles** | **-67% to -50%** |
| ELO accuracy | 100% | 99% | -1% (negligible) |
| Latency (strategy update) | 0 (immediate) | 1024 allocs | Acceptable |
**Expected result**:
```
hak_alloc: 96,000 → 70,000 cycles (-27%)
Total: 126,479 → 70,000 cycles (-45%)
```
**Recommendation**: **IMPLEMENT FIRST** (before Phase 6.11.5)
---
### Phase 6.11.5 (P1): Learning Thread (Full Offload) ⭐ **FUTURE WORK**
**Goal**: Complete learning offload to dedicated thread
**Expected gain**: **20-40 cycles** (additional ~15-30%)
**Implementation time**: **4-6 hours**
**Risk**: **MEDIUM** (thread management, race conditions)
#### Architecture
```
┌─────────────────────────────────────────┐
│ hak_alloc (Hot-path) │
│ ┌───────────────────────────────────┐ │
│ │ 1. Read g_cached_strategy_id │ │ ← Atomic read (~10 cycles)
│ │ 2. Route allocation │ │
│ │ 3. [Optional] Push event to queue │ │ ← Only if sampling (1/100)
│ └───────────────────────────────────┘ │
└─────────────────────────────────────────┘
↓ (Event Queue - MPSC)
┌─────────────────────────────────────────┐
│ Learning Thread (Background) │
│ ┌───────────────────────────────────┐ │
│ │ 1. Pop events (batched) │ │
│ │ 2. Update ELO ratings │ │
│ │ 3. Update distribution signature │ │
│ │ 4. Recompute best strategy │ │
│ │ 5. Update g_cached_strategy_id │ │
│ └───────────────────────────────────┘ │
└─────────────────────────────────────────┘
```
#### Implementation Details
**1. Event Queue (Custom Ring Buffer)**
```c
// hakmem_events.h
#define EVENT_QUEUE_SIZE 1024
typedef struct {
uint8_t type; // EVENT_ALLOC / EVENT_FREE
size_t size;
uint64_t duration_ns;
uintptr_t site_id;
} hak_event_t;
typedef struct {
hak_event_t events[EVENT_QUEUE_SIZE];
_Atomic uint64_t head; // Producer index
_Atomic uint64_t tail; // Consumer index
} hak_event_queue_t;
```
**Cost**: ~30 cycles (ring buffer write, no CAS needed!)
**2. Sampling Strategy**
```c
// Hot-path: Sample 1/100 allocations
if (fast_random() % 100 == 0) {
hak_event_push((hak_event_t){
.type = EVENT_ALLOC,
.size = size,
.duration_ns = 0, // Not measured in hot-path
.site_id = site_id
});
}
```
**3. Background Thread**
```c
void* learning_thread_main(void* arg) {
while (!g_shutdown) {
// Batch processing (every 100ms)
usleep(100000);
hak_event_t events[100];
int count = hak_event_pop_batch(events, 100);
for (int i = 0; i < count; i++) {
hak_elo_record_alloc(events[i].site_id, events[i].size, 0);
}
// Periodic ELO update (every 10 batches)
if (g_batch_count % 10 == 0) {
hak_elo_async_recompute();
}
}
return NULL;
}
```
#### Tradeoff Analysis
| Aspect | Phase 6.11.4 (P0-2) | Phase 6.11.5 (P1) | Change |
|--------|---------------------|-------------------|--------|
| Hot-path cost | 5-10 cycles | **~10-15 cycles** | +5 cycles (sampling overhead) |
| Thread overhead | 0 | ~1% CPU (background) | Negligible |
| Learning latency | 1024 allocs | 100-200ms | Acceptable |
| Complexity | Low | Medium | Moderate increase |
** CRITICAL DECISION**: Phase 6.11.5 **DOES NOT improve hot-path** over Phase 6.11.4!
**Reason**: Sampling overhead (~5 cycles) cancels out atomic elimination (~5 cycles)
**Recommendation**: **SKIP Phase 6.11.5** unless:
1. Learning accuracy requires higher sampling rate (>1/100)
2. Background analytics needed (real-time dashboard)
---
## 3. Hash Table Optimization (Phase 6.11.6 - P2)
**Current cost**: Site Rules lookup (~10-20 cycles)
### Strategy 1: Perfect Hashing
**Benefit**: O(1) lookup without collisions
**Tradeoff**: Rebuild cost on new site, max 256 sites
**Implementation**:
```c
// Pre-computed hash table (generated at runtime)
static RouteType g_site_routes[256]; // Direct lookup, no probing
```
**Expected gain**: **5-10 cycles** (~4-8%)
### Strategy 2: Cache-line Alignment
**Current**: 4-probe hash → 4 cache lines (worst case)
**Improvement**: Pack entries into single cache line
```c
typedef struct {
uint64_t site_id;
RouteType route;
uint8_t padding[6]; // Align to 16 bytes
} __attribute__((aligned(16))) SiteRuleEntry;
```
**Expected gain**: **2-5 cycles** (~2-4%)
### Recommendation
**Priority**: P2 (after Phase 6.11.4 P0-1/P0-2)
**Expected gain**: **7-15 cycles** (~6-12%)
**Implementation time**: 2-3 hours
---
## 4. Trade-off Analysis
### 4.1 Thread-Safety vs Learning Accuracy
| Approach | Hot-path Cost | Learning Accuracy | Complexity |
|----------|---------------|-------------------|------------|
| **Current** | 126,479 cycles | 100% | Low |
| **P0-1 (Atomic削減)** | 96,000 cycles | 100% | Very Low |
| **P0-2 (Cached Strategy)** | 70,000 cycles | 99% | Low |
| **P1 (Learning Thread)** | 70,000-75,000 cycles | 95-99% | Medium |
| **P2 (Hash Opt)** | 60,000 cycles | 99% | Medium |
### 4.2 Implementation Complexity vs Performance Gain
```
Performance Gain
P0-1 ──────────────────┼────────────┐ (30-50 cycles, 30 min)
(Atomic削減) │ │
│ │
P0-2 ──────────────────┼──────┐ │ (25-35 cycles, 1-2 hrs)
(Cached Strategy) │ │ │
│ │ │
P2 ─────────────────┼──────┼─────┼──┐ (7-15 cycles, 2-3 hrs)
(Hash Opt) │ │ │ │
│ │ │ │
P1 ────────────────┼──────┼─────┼──┤ (5-10 cycles, 4-6 hrs)
(Learning Thread) │ │ │ │
0──────────────────→ Complexity
Low Med High
```
**Sweet Spot**: **P0-2 (Cached Strategy)**
- 55% total reduction (126,479 → 70,000 cycles)
- 1-2 hours implementation
- Low complexity, low risk
---
## 5. Recommended Implementation Order
### Week 1: Quick Wins (P0-1 + P0-2)
**Day 1**: Phase 6.11.4 (P0-1) - Atomic削減
- Time: 30 minutes
- Expected: 126,479 → 96,000 cycles (-24%)
**Day 2**: Phase 6.11.4 (P0-2) - Cached Strategy
- Time: 1-2 hours
- Expected: 96,000 → 70,000 cycles (-27%)
- **Total: -45% reduction** ✅
### Week 2: Medium Gains (P2)
**Day 3-4**: Phase 6.11.6 (P2) - Hash Optimization
- Time: 2-3 hours
- Expected: 70,000 → 60,000 cycles (-14%)
- **Total: -52% reduction** ✅
### Week 3: Evaluation
**Benchmark** all scenarios (json/mir/vm)
- If `hak_alloc` < 50,000 cycles **STOP**
- If `hak_alloc` > 50,000 cycles → Consider Phase 6.11.5 (P1)
---
## 6. Risk Assessment
| Phase | Risk Level | Failure Mode | Mitigation |
|-------|-----------|--------------|------------|
| **P0-1** | **ZERO** | None (compile-time) | None needed |
| **P0-2** | **LOW** | Stale strategy (1-2% accuracy loss) | Periodic invalidation |
| **P1** | **MEDIUM** | Race conditions, thread bugs | Extensive testing, feature flag |
| **P2** | **LOW** | Hash collisions, rebuild cost | Fallback to linear probe |
---
## 7. Expected Final Results
### Pessimistic Scenario (Only P0-1 + P0-2)
```
hak_alloc: 126,479 → 70,000 cycles (-45%)
Overall: 319,021 → 262,542 cycles (-18%)
vm scenario: 15,021 ns → 12,000 ns (-20%)
```
### Optimistic Scenario (P0-1 + P0-2 + P2)
```
hak_alloc: 126,479 → 60,000 cycles (-52%)
Overall: 319,021 → 252,542 cycles (-21%)
vm scenario: 15,021 ns → 11,500 ns (-23%)
```
### Stretch Goal (All Phases)
```
hak_alloc: 126,479 → 50,000 cycles (-60%)
Overall: 319,021 → 242,542 cycles (-24%)
vm scenario: 15,021 ns → 11,000 ns (-27%)
```
---
## 8. Conclusion
### ✅ Recommended Path: **Staged Optimization** (P0-1 → P0-2 → P2)
**Rationale**:
1. **P0-1** is free (compile-time guard) → Immediate -24%
2. **P0-2** is high-ROI (1-2 hrs) → Additional -27%
3. **P1 (Learning Thread) is NOT worth it** (complexity vs gain)
4. **P2** is optional polish → Additional -14%
**Final Target**: **70,000 cycles** (55% reduction from baseline)
**Timeline**:
- Week 1: P0-1 + P0-2 (2-3 hours total)
- Week 2: P2 (optional, 2-3 hours)
- Week 3: Benchmark & validate
**Success Criteria**:
- `hak_alloc` < 75,000 cycles (40% reduction) **Minimum Success**
- `hak_alloc` < 60,000 cycles (52% reduction) **Target Success**
- `hak_alloc` < 50,000 cycles (60% reduction) **Stretch Goal** 🎉
---
## Next Steps
1. **Implement P0-1** (30 min)
2. **Measure baseline** (10 min)
3. **Implement P0-2** (1-2 hrs)
4. **Measure improvement** (10 min)
5. **Decide on P2** based on results
**Total time investment**: 2-3 hours for **45% reduction** **Excellent ROI!**