Files
hakmem/docs/ACE_LEARNING_LAYER_PLAN.md

436 lines
12 KiB
Markdown
Raw Normal View History

# ACE Learning Layer - Implementation Plan
**Goal**: 断片化・巨大WS・reallocの弱点を学習で潰して"つよつよ"にする
## 優先順位(効果 × 実装容易性)
### 🔥 Phase 1: 最小実装1日、最大効果
**Target**: 断片化ストレス 3.87 → 10-20 M ops/s
#### 1.1 メトリクス収集基盤2-3時間
```c
// core/hakmem_ace_metrics.h
typedef struct ACEMetrics {
// Fast metrics (1Hz collection)
uint64_t throughput_ops; // ops/sec
uint64_t p50_lat_ns; // median latency
uint64_t p90_lat_ns; // p90 latency
double llc_miss_rate; // LLC misses / loads
uint64_t mutex_wait_ns; // mutex contention time
uint64_t remote_free_backlog[8]; // per size-class
// Slow metrics (30-60s collection)
uint64_t rss_mb; // resident set size
double fragmentation_ratio; // allocated / held
uint32_t partial_empty_slabs; // partially empty slabs
uint32_t arena_contention[4]; // per-arena contention
} ACEMetrics;
// Lightweight collection (no syscalls in fast path)
void ace_metrics_tick_1hz(ACEMetrics* m);
void ace_metrics_tick_60s(ACEMetrics* m);
```
**実装**:
- `throughput_ops`: 既存のTLS countersから集計
- `llc_miss_rate`: rdpmc使用軽量
- `mutex_wait_ns`: 既存のlockタイミング統計
- `remote_free_backlog`: Tiny/Mid MTの既存カウンタ
- `fragmentation_ratio`: Superslab/Mid MTの使用率から計算
- `rss_mb`: /proc/self/statm60秒に1回のみ
#### 1.2 Fast Loop骨格2-3時間
```c
// core/hakmem_ace_controller.c
typedef struct ACEKnobs {
// TLS/Magazine capacity (per class)
uint16_t tls_cap[8]; // [4, 8, 16, 32, 64]
// Bundle width (central freelist transfer)
uint16_t bundle_width; // [16, 32, 64, 128]
// Remote free drain threshold
uint16_t drain_threshold[8]; // [32, 64, 128, 256]
// Partial release threshold
uint8_t partial_release_pages; // [1, 2, 4, 8]
// Diet factor (巨大WS検知時)
float diet_factor; // [0.5, 0.66, 0.75, 0.9]
} ACEKnobs;
// Fast loop (every 0.5-1s)
void ace_fast_loop_tick(ACEMetrics* m, ACEKnobs* k) {
// Compute reward
double reward =
1.0 * normalize(m->throughput_ops)
- 0.3 * m->llc_miss_rate
- 0.2 * normalize(m->mutex_wait_ns)
- 0.2 * normalize_backlog(m->remote_free_backlog);
// 1) Remote backlog ↑ → drain threshold ↓
if (max_backlog(m->remote_free_backlog) > 128) {
lower_drain_threshold(k);
}
// 2) LLC miss ↑ → TLS capacity ↓ (diet)
if (m->llc_miss_rate > 0.15) {
apply_diet(k, k->diet_factor);
}
// 3) Mutex wait ↑ → bundle width ↑
if (m->mutex_wait_ns > threshold) {
increase_bundle_width(k);
}
// Update learning (UCB1 or simple hill-climb)
ucb1_update_knobs(reward, k);
}
```
**実装優先度**:
1.**TLS capacity調整**最も効果的、既存TINY_TLS_MAG_CAPを動的化
2.**Drain threshold調整**既存POOL_REMOTE_DRAIN_THRESHOLDを動的化
3. ⚠️ **Bundle width調整**新規実装必要、後回しOK
#### 1.3 学習アルゴリズム1-2時間
```c
// Simple UCB1 for discrete knobs
typedef struct UCB1Arm {
uint32_t pulls;
double total_reward;
} UCB1Arm;
typedef struct UCB1Bandit {
UCB1Arm arms[5]; // 5 candidates per knob
uint32_t total_pulls;
} UCB1Bandit;
// Select next knob value
int ucb1_select_arm(UCB1Bandit* b) {
double best_ucb = -INFINITY;
int best_arm = 0;
for (int i = 0; i < 5; i++) {
double avg = b->arms[i].total_reward / (b->arms[i].pulls + 1e-6);
double ucb = avg + sqrt(2 * log(b->total_pulls + 1) / (b->arms[i].pulls + 1e-6));
if (ucb > best_ucb) {
best_ucb = ucb;
best_arm = i;
}
}
return best_arm;
}
```
**候補値**:
```c
static const uint16_t TLS_CAP_CANDIDATES[] = {4, 8, 16, 32, 64};
static const uint16_t DRAIN_THRESHOLD_CANDIDATES[] = {32, 64, 128, 256, 512};
static const float DIET_FACTOR_CANDIDATES[] = {0.5, 0.66, 0.75, 0.9, 1.0};
```
---
### 🚀 Phase 2: 断片化対策(半日)
**Target**: Fragmentation stress 3.87 → 10-20 M ops/s
#### 2.1 Slow Loop2-3時間
```c
// Slow loop (every 30-60s)
void ace_slow_loop_tick(ACEMetrics* m, ACEKnobs* k) {
// Fragmentation ↑ or RSS ↑ → partial release
if (m->fragmentation_ratio > 0.6 || m->rss_mb > threshold) {
// Lower threshold to trigger more scavenging
if (k->partial_release_pages > 1) {
k->partial_release_pages--;
}
// Run budgeted scavenge (max 5ms)
run_partial_scavenge_budgeted(5);
}
// RSS stable & low miss → restore thresholds
else if (m->llc_miss_rate < 0.10 && m->fragmentation_ratio < 0.4) {
// Raise threshold (hysteresis)
if (k->partial_release_pages < 8) {
k->partial_release_pages++;
}
}
}
```
#### 2.2 Budgeted Scavenge2-3時間
```c
// Partial release with time budget
void run_partial_scavenge_budgeted(uint32_t max_ms) {
uint64_t start_ns = get_ns();
uint64_t budget_ns = max_ms * 1000000ULL;
// Iterate slabs/superslabs
for (each slab/superslab) {
// Check consecutive empty pages
uint32_t empty_pages = count_consecutive_empty_pages(slab);
if (empty_pages >= g_ace_knobs.partial_release_pages) {
// madvise DONTNEED or munmap
partial_release_pages(slab, empty_pages);
}
// Check budget
if (get_ns() - start_ns > budget_ns) {
break; // Stop, resume next cycle
}
}
}
```
**実装ポイント**:
- Superslab: 既存のビットマップから空きページ検出
- Mid MT: セグメントの使用率追跡が必要(新規フィールド追加)
- madvise(DONTNEED)で物理メモリ返却munmapより軽量
---
### 🎯 Phase 3: 巨大WS対策半日
**Target**: 巨大WS 22 → 30-45 M ops/s
#### 3.1 自動ダイエット実装済みのknobを動的化
```c
void apply_diet(ACEKnobs* k, float factor) {
for (int i = 0; i < 8; i++) {
// Hot classes (0-2) diet more aggressively
float class_factor = (i < 3) ? factor : (factor + 1.0) / 2.0;
uint16_t new_cap = (uint16_t)(k->tls_cap[i] * class_factor);
k->tls_cap[i] = MAX(4, new_cap); // Min capacity = 4
}
}
void restore_diet(ACEKnobs* k) {
for (int i = 0; i < 8; i++) {
// Gradual restore (10% per cycle)
k->tls_cap[i] = MIN(64, k->tls_cap[i] * 1.1);
}
}
```
#### 3.2 LLC Miss監視
```c
// Using rdpmc (lightweight)
double read_llc_miss_rate() {
static uint64_t prev_misses = 0;
static uint64_t prev_refs = 0;
uint64_t misses = rdpmc(LLC_MISS_EVENT);
uint64_t refs = rdpmc(LLC_REF_EVENT);
double rate = (double)(misses - prev_misses) / (refs - prev_refs + 1);
prev_misses = misses;
prev_refs = refs;
return rate;
}
```
---
### 🔧 Phase 4: realloc最適化1日
**Target**: 6.6-277ns ブレを 1.3-2x短縮
#### 4.1 In-place拡張
```c
void* hak_realloc_inplace(void* ptr, size_t old_size, size_t new_size) {
// Check adjacent free space
void* next_block = (char*)ptr + old_size;
if (is_free(next_block)) {
size_t available = get_free_size(next_block);
size_t needed = new_size - old_size;
if (available >= needed) {
// Merge and extend in-place
merge_free_block(ptr, next_block);
return ptr; // No memcpy!
}
}
return NULL; // Fall back to copy
}
```
#### 4.2 コピー最適化
```c
void* hak_realloc_copy(void* ptr, size_t old_size, size_t new_size) {
void* new_ptr = hak_malloc(new_size);
// Size-dependent strategy
if (old_size < 4096) {
// Small: direct memcpy
memcpy(new_ptr, ptr, old_size);
}
else if (old_size < k->nt_threshold) {
// Medium: prefetch + memcpy
__builtin_prefetch(ptr, 0, 3);
__builtin_prefetch((char*)ptr + 64, 0, 3);
memcpy(new_ptr, ptr, old_size);
}
else {
// Large: non-temporal store
memcpy_nt(new_ptr, ptr, old_size);
}
hak_free(ptr);
return new_ptr;
}
```
---
## 実装順序(推奨)
### Day 1: 最小ACE最大効果
1. ✅ メトリクス収集throughput, llc_miss, backlog
2. ✅ Fast loop骨格
3. ✅ TLS capacity調整動的化
4. ✅ UCB1学習
5. ✅ ON/OFF切替環境変数
**Expected**: 断片化ケースで即効果3.9 → 8-12 M ops/s
### Day 2: 断片化対策
1. ✅ Slow loop
2. ✅ Fragmentation metrics
3. ✅ Budgeted scavenge
4. ✅ Partial release
**Expected**: 断片化ケースで 10-20 M ops/s
### Day 3: 巨大WS対策
1. ✅ LLC miss monitoring
2. ✅ Diet adjustment
3. ✅ Gradual restore
**Expected**: 巨大WSで 30-45 M ops/s
### Day 4: realloc最適化オプション
1. ✅ In-place拡張
2. ✅ コピー最適化
3. ✅ NT threshold学習
**Expected**: realloc 1.3-2x改善
---
## 安全弁(必須)
```c
typedef struct ACEGuards {
uint64_t max_p99_lat_ns; // p99 latency上限
uint64_t max_rss_mb; // RSS上限
uint32_t max_cpu_percent; // ACE CPU占有上限
bool enabled; // ON/OFF switch
} ACEGuards;
void ace_check_guards(ACEMetrics* m, ACEGuards* g) {
if (!g->enabled) return;
// Latency guard
if (m->p90_lat_ns > g->max_p99_lat_ns) {
ace_emergency_stop("Latency spike");
return;
}
// RSS guard
if (m->rss_mb > g->max_rss_mb) {
ace_emergency_stop("RSS overflow");
return;
}
}
```
---
## 環境変数制御
```bash
# Enable/disable ACE
export HAKMEM_ACE_ENABLED=1
# Fast loop interval (ms)
export HAKMEM_ACE_FAST_INTERVAL_MS=500
# Slow loop interval (s)
export HAKMEM_ACE_SLOW_INTERVAL_S=30
# Guards
export HAKMEM_ACE_MAX_P99_LAT_NS=10000000 # 10ms
export HAKMEM_ACE_MAX_RSS_MB=16384 # 16GB
# Debug logging
export HAKMEM_ACE_LOG_LEVEL=1 # 0=off, 1=info, 2=debug
```
---
## A/B計測
```bash
# Baseline (ACE OFF)
HAKMEM_ACE_ENABLED=0 ./bench_fragment_stress > baseline.txt
# ACE ON
HAKMEM_ACE_ENABLED=1 ./bench_fragment_stress > ace_on.txt
# Compare
diff <(grep "ops/sec" baseline.txt) <(grep "ops/sec" ace_on.txt)
```
---
## 期待効果(保守的見積り)
| ワークロード | Before | After (ACE) | 改善率 |
|------------|--------|-------------|--------|
| 断片化ストレス | 3.87 M ops/s | 10-20 M ops/s | **2.6-5.2x** |
| 巨大WS | 22.15 M ops/s | 30-45 M ops/s | **1.4-2.0x** |
| realloc重め | 277ns (worst) | 140-210ns | **1.3-2.0x** |
| Mid MT (維持) | 111.6 M ops/s | 110-115 M ops/s | **±5%** |
---
## ファイル構成
```
core/
├── hakmem_ace_metrics.h ← NEW (metrics collection)
├── hakmem_ace_metrics.c ← NEW
├── hakmem_ace_controller.h ← NEW (fast/slow loops)
├── hakmem_ace_controller.c ← NEW
├── hakmem_ace_ucb1.h ← NEW (learning)
├── hakmem_ace_ucb1.c ← NEW
├── hakmem_ace_scavenge.h ← NEW (partial release)
├── hakmem_ace_scavenge.c ← NEW
└── hakmem_ace_realloc.h ← NEW (realloc opt, Phase 4)
# Existing files to modify:
core/hakmem_tiny_magazine.c ← Add dynamic TLS capacity
core/hakmem_pool.c ← Add dynamic drain threshold
core/hakmem.c ← Integrate ace_tick() calls
```
---
## Next Step
**どのPhaseから始めますか**
推奨: **Phase 1 (Day 1)** - 最小実装で最大効果
- メトリクス収集
- Fast loop
- TLS capacity調整
- UCB1学習
このPhase 1のコードhakmem_ace_metrics.h/.c + hakmem_ace_controller.h/.c
今すぐ書きましょうか? 🚀