Files
hakmem/docs/ACE_LEARNING_LAYER.md
Moe Charm (CI) 0546454168 WIP: Add TLS SLL validation and SuperSlab registry fallback
ChatGPT's diagnostic changes to address TLS_SLL_HDR_RESET issue.
Current status: Partial mitigation, but root cause remains.

Changes Applied:
1. SuperSlab Registry Fallback (hakmem_super_registry.h)
   - Added legacy table probe when hash map lookup misses
   - Prevents NULL returns for valid SuperSlabs during initialization
   - Status:  Works but may hide underlying registration issues

2. TLS SLL Push Validation (tls_sll_box.h)
   - Reject push if SuperSlab lookup returns NULL
   - Reject push if class_idx mismatch detected
   - Added [TLS_SLL_PUSH_NO_SS] diagnostic message
   - Status:  Prevents list corruption (defensive)

3. SuperSlab Allocation Class Fix (superslab_allocate.c)
   - Pass actual class_idx to sp_internal_allocate_superslab
   - Prevents dummy class=8 causing OOB access
   - Status:  Root cause fix for allocation path

4. Debug Output Additions
   - First 256 push/pop operations traced
   - First 4 mismatches logged with details
   - SuperSlab registration state logged
   - Status:  Diagnostic tool (not a fix)

5. TLS Hint Box Removed
   - Deleted ss_tls_hint_box.{c,h} (Phase 1 optimization)
   - Simplified to focus on stability first
   - Status:  Can be re-added after root cause fixed

Current Problem (REMAINS UNSOLVED):
- [TLS_SLL_HDR_RESET] still occurs after ~60 seconds of sh8bench
- Pointer is 16 bytes offset from expected (class 1 → class 2 boundary)
- hak_super_lookup returns NULL for that pointer
- Suggests: Use-After-Free, Double-Free, or pointer arithmetic error

Root Cause Analysis:
- Pattern: Pointer offset by +16 (one class 1 stride)
- Timing: Cumulative problem (appears after 60s, not immediately)
- Location: Header corruption detected during TLS SLL pop

Remaining Issues:
⚠️ Registry fallback is defensive (may hide registration bugs)
⚠️ Push validation prevents symptoms but not root cause
⚠️ 16-byte pointer offset source unidentified

Next Steps for Investigation:
1. Full pointer arithmetic audit (Magazine ⇔ TLS SLL paths)
2. Enhanced logging at HDR_RESET point:
   - Expected vs actual pointer value
   - Pointer provenance (where it came from)
   - Allocation trace for that block
3. Verify Headerless flag is OFF throughout build
4. Check for double-offset application in conversions

Technical Assessment:
- 60% root cause fixes (allocation class, validation)
- 40% defensive mitigation (registry fallback, push rejection)

Performance Impact:
- Registry fallback: +10-30 cycles on cold path (negligible)
- Push validation: +5-10 cycles per push (acceptable)
- Overall: < 2% performance impact estimated

Related Issues:
- Phase 1 TLS Hint Box removed temporarily
- Phase 2 Headerless blocked until stability achieved

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 20:42:28 +09:00

222 lines
8.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ACE Learning Layer - ACE (Agentic Context Engineering)
> 実装上の役割としては「Adaptive Control Engine」として L1 レイヤのノブ調整を行うが、
> ACE 自体の意味は Agentic Context Engineering観測→意思決定→適用のエージェント型ループとして統一する。
**目的**: 断片化・巨大WS・reallocの弱点を学習で潰して"つよつよ"にする 💪
## 現状の弱点ChatGPT Pro分析より
| 問題 | 現在 | 目標 | 改善率 |
|------|------|------|--------|
| **断片化ストレス** | 3.87 M ops/s | 10-20 M ops/s | **2.6-5.2x** ⭐ |
| **巨大WS** | 22.15 M ops/s | 30-45 M ops/s | **1.4-2.0x** |
| **realloc依存** | 6.6-277ns (ブレ) | 140-210ns | **1.3-2.0x** |
| Mid MT (維持) | 111.6 M ops/s | 110-115 M ops/s | ±5% |
## コンセプト
### 学習ループ
```
┌─────────────────────────────────────────────────────────┐
│ Metrics Collection (1Hz) │
│ ├─ throughput_ops │
│ ├─ llc_miss_rate (LLC misses) │
│ ├─ mutex_wait_ns (contention) │
│ ├─ remote_free_backlog (per class) │
│ ├─ fragmentation_ratio (slow, 60s) │
│ └─ rss_mb (slow, 60s) │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Fast Loop (0.5-1s) - 即応調整 │
│ ├─ Remote backlog ↑ → drain threshold ↓ │
│ ├─ LLC miss ↑ → TLS capacity ↓ (diet) │
│ └─ Mutex wait ↑ → bundle width ↑ │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ Slow Loop (30-60s) - 断片化・RSS対策 │
│ ├─ Fragmentation ↑ → partial release │
│ ├─ RSS ↑ → budgeted scavenge (max 5ms) │
│ └─ Stable → restore thresholds (hysteresis) │
└─────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ UCB1 Learning - ノブ調整 │
│ └─ reward = throughput - (misses + locks + frag) │
└─────────────────────────────────────────────────────────┘
```
### 調整可能なノブ
| ノブ | 候補値 | 効果 |
|------|--------|------|
| **TLS capacity** | [4, 8, 16, 32, 64] | LLC miss対策、diet調整 |
| **Drain threshold** | [32, 64, 128, 256, 512] | Remote free backlog解消 |
| **Partial release pages** | [1, 2, 4, 8] | 断片化・RSS削減 |
| **Diet factor** | [0.5, 0.66, 0.75, 0.9, 1.0] | 巨大WS時のTLS縮小率 |
| **Bundle width** | [16, 32, 64, 128] | 中央フリリストの束移動 |
## 実装フェーズ
### 🔥 Phase 1: 最小実装1日、最大効果
**Target**: 断片化ストレス 3.87 → 8-12 M ops/s
実装内容:
1. メトリクス収集2-3時間
- throughput, LLC miss, mutex wait, backlog
2. Fast loop骨格2-3時間
- 報酬計算、ノブ調整ロジック
3. TLS capacity動的化1-2時間
- 既存TINY_TLS_MAG_CAPを動的に変更
4. UCB1学習1-2時間
- 離散候補から最適値選択
5. ON/OFF切替1時間
- 環境変数 `HAKMEM_ACE_ENABLED=1`
**期待効果**: 断片化ケースで即座に効果
### 🚀 Phase 2: 断片化対策(半日)
**Target**: 断片化ストレス 8-12 → 10-20 M ops/s
実装内容:
1. Slow loop2-3時間
- Fragmentation/RSS監視
2. Budgeted scavenge2-3時間
- 時間制限付き部分返却max 5ms
- madvise(DONTNEED)でメモリ返却
**期待効果**: 断片化完全対策
### 🎯 Phase 3: 巨大WS対策半日
**Target**: 巨大WS 22 → 30-45 M ops/s
実装内容:
1. LLC miss monitoring1-2時間
- rdpmc使用、軽量計測
2. Auto diet1-2時間
- Miss率高→TLS縮小
- 安定→段階的復元
**期待効果**: 巨大WSで2x改善
### 🔧 Phase 4: realloc最適化1日、オプション
**Target**: realloc 1.3-2x改善
実装内容:
1. In-place拡張半日
- 隣接空きブロック併合
2. コピー最適化(半日)
- サイズ別戦略NT store
## 使い方
### 基本
```bash
# ACE有効化
export HAKMEM_ACE_ENABLED=1
# Fast loopインターバルデフォルト500ms
export HAKMEM_ACE_FAST_INTERVAL_MS=500
# Slow loopインターバルデフォルト30s
export HAKMEM_ACE_SLOW_INTERVAL_S=30
# ベンチマーク実行
bash benchmarks/scripts/stress/run_fragmentation_stress.sh
```
### A/B比較
```bash
# ベースラインACE OFF
HAKMEM_ACE_ENABLED=0 ./bench_fragment_stress_hakx > baseline.txt
# ACE ON
HAKMEM_ACE_ENABLED=1 ./bench_fragment_stress_hakx > ace_on.txt
# 比較
diff baseline.txt ace_on.txt
```
### デバッグ
```bash
# ログレベル設定
export HAKMEM_ACE_LOG_LEVEL=2 # 0=off, 1=info, 2=debug
# 実行すると、ACEの調整をリアルタイム表示
# [ACE] Fast loop: reward=0.85, llc_miss=0.12, backlog=45
# [ACE] Adjusting TLS cap[2]: 32 → 24 (diet factor=0.75)
# [ACE] Drain threshold[3]: 128 → 64 (high backlog)
```
## 安全弁
ACEは以下の条件で自動停止
```bash
# Latency guardデフォルト10ms
export HAKMEM_ACE_MAX_P99_LAT_NS=10000000
# RSS guardデフォルト16GB
export HAKMEM_ACE_MAX_RSS_MB=16384
# CPU占有上限デフォルト5%
export HAKMEM_ACE_MAX_CPU_PERCENT=5
```
## 期待結果(保守的見積り)
| ワークロード | Before | After (ACE) | コメント |
|------------|--------|-------------|----------|
| **断片化ストレス** | 3.87 M ops/s | **10-20 M ops/s** | Budgeted scavenge効果 |
| **巨大WS** | 22.15 M ops/s | **30-45 M ops/s** | Auto diet + LLC miss最適化 |
| **realloc heavy** | 277ns (worst) | **140-210ns** | In-place拡張 + NT store |
| Mid MT (4-8 threads) | 111.6 M ops/s | **110-115 M ops/s** | 維持±5%以内) |
| 16 threads | 頭打ち | **+0-10%** | シャーディング効果 |
## ファイル構成
```
core/
├── hakmem_ace_metrics.{h,c} ← メトリクス収集
├── hakmem_ace_controller.{h,c} ← Fast/Slow loops
├── hakmem_ace_ucb1.{h,c} ← UCB1学習
├── hakmem_ace_scavenge.{h,c} ← 部分返却
└── hakmem_ace_realloc.{h,c} ← realloc最適化
# 既存ファイルの変更:
core/hakmem_tiny_magazine.c ← TLS capacity動的化
core/hakmem_pool.c ← Drain threshold動的化
core/hakmem.c ← ace_tick()統合
```
## 技術詳細
詳細な実装プランは `docs/ACE_LEARNING_LAYER_PLAN.md` を参照。
## 次のステップ
1. ✅ ドキュメント整備(このファイル)
2. 📝 Phase 1実装
- [ ] hakmem_ace_metrics.{h,c}
- [ ] hakmem_ace_controller.{h,c}
- [ ] hakmem_ace_ucb1.{h,c}
- [ ] 既存コードの統合
3. 🧪 A/Bテスト
4. 📊 結果レポート
---
**Status**: PLANNING (2025-11-01)
**Priority**: HIGH - 次の開発フェーズ 🎯
**Expected Impact**: 2-5x improvement on weak workloads