hakmem/docs/ACE_LEARNING_LAYER.md

# ACE Learning Layer - ACE (Agentic Context Engineering)

> 実装上の役割としては「Adaptive Control Engine」として L1 レイヤのノブ調整を行うが、  
> ACE 自体の意味は Agentic Context Engineering（観測→意思決定→適用のエージェント型ループ）として統一する。

**目的**: 断片化・巨大WS・reallocの弱点を学習で潰して"つよつよ"にする 💪

## 現状の弱点（ChatGPT Pro分析より）

| 問題 | 現在 | 目標 | 改善率 |
|------|------|------|--------|
| **断片化ストレス** | 3.87 M ops/s | 10-20 M ops/s | **2.6-5.2x** ⭐ |
| **巨大WS** | 22.15 M ops/s | 30-45 M ops/s | **1.4-2.0x** |
| **realloc依存** | 6.6-277ns (ブレ) | 140-210ns | **1.3-2.0x** |
| Mid MT (維持) | 111.6 M ops/s | 110-115 M ops/s | ±5% |

## コンセプト

### 学習ループ

```
┌─────────────────────────────────────────────────────────┐
│  Metrics Collection (1Hz)                                │
│  ├─ throughput_ops                                       │
│  ├─ llc_miss_rate (LLC misses)                          │
│  ├─ mutex_wait_ns (contention)                          │
│  ├─ remote_free_backlog (per class)                     │
│  ├─ fragmentation_ratio (slow, 60s)                     │
│  └─ rss_mb (slow, 60s)                                  │
└─────────────────────────────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────┐
│  Fast Loop (0.5-1s) - 即応調整                           │
│  ├─ Remote backlog ↑ → drain threshold ↓                │
│  ├─ LLC miss ↑ → TLS capacity ↓ (diet)                 │
│  └─ Mutex wait ↑ → bundle width ↑                      │
└─────────────────────────────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────┐
│  Slow Loop (30-60s) - 断片化・RSS対策                    │
│  ├─ Fragmentation ↑ → partial release                   │
│  ├─ RSS ↑ → budgeted scavenge (max 5ms)                │
│  └─ Stable → restore thresholds (hysteresis)            │
└─────────────────────────────────────────────────────────┘
                           ▼
┌─────────────────────────────────────────────────────────┐
│  UCB1 Learning - ノブ調整                                 │
│  └─ reward = throughput - (misses + locks + frag)       │
└─────────────────────────────────────────────────────────┘
```

### 調整可能なノブ

| ノブ | 候補値 | 効果 |
|------|--------|------|
| **TLS capacity** | [4, 8, 16, 32, 64] | LLC miss対策、diet調整 |
| **Drain threshold** | [32, 64, 128, 256, 512] | Remote free backlog解消 |
| **Partial release pages** | [1, 2, 4, 8] | 断片化・RSS削減 |
| **Diet factor** | [0.5, 0.66, 0.75, 0.9, 1.0] | 巨大WS時のTLS縮小率 |
| **Bundle width** | [16, 32, 64, 128] | 中央フリリストの束移動 |

## 実装フェーズ

### 🔥 Phase 1: 最小実装（1日、最大効果）

**Target**: 断片化ストレス 3.87 → 8-12 M ops/s

実装内容：
1. メトリクス収集（2-3時間）
   - throughput, LLC miss, mutex wait, backlog
2. Fast loop骨格（2-3時間）
   - 報酬計算、ノブ調整ロジック
3. TLS capacity動的化（1-2時間）
   - 既存TINY_TLS_MAG_CAPを動的に変更
4. UCB1学習（1-2時間）
   - 離散候補から最適値選択
5. ON/OFF切替（1時間）
   - 環境変数 `HAKMEM_ACE_ENABLED=1`

**期待効果**: 断片化ケースで即座に効果

### 🚀 Phase 2: 断片化対策（半日）

**Target**: 断片化ストレス 8-12 → 10-20 M ops/s

実装内容：
1. Slow loop（2-3時間）
   - Fragmentation/RSS監視
2. Budgeted scavenge（2-3時間）
   - 時間制限付き部分返却（max 5ms）
   - madvise(DONTNEED)でメモリ返却

**期待効果**: 断片化完全対策

### 🎯 Phase 3: 巨大WS対策（半日）

**Target**: 巨大WS 22 → 30-45 M ops/s

実装内容：
1. LLC miss monitoring（1-2時間）
   - rdpmc使用、軽量計測
2. Auto diet（1-2時間）
   - Miss率高→TLS縮小
   - 安定→段階的復元

**期待効果**: 巨大WSで2x改善

### 🔧 Phase 4: realloc最適化（1日、オプション）

**Target**: realloc 1.3-2x改善

実装内容：
1. In-place拡張（半日）
   - 隣接空きブロック併合
2. コピー最適化（半日）
   - サイズ別戦略（NT store）

## 使い方

### 基本

```bash
# ACE有効化
export HAKMEM_ACE_ENABLED=1

# Fast loopインターバル（デフォルト500ms）
export HAKMEM_ACE_FAST_INTERVAL_MS=500

# Slow loopインターバル（デフォルト30s）
export HAKMEM_ACE_SLOW_INTERVAL_S=30

# ベンチマーク実行
bash benchmarks/scripts/stress/run_fragmentation_stress.sh
```

### A/B比較

```bash
# ベースライン（ACE OFF）
HAKMEM_ACE_ENABLED=0 ./bench_fragment_stress_hakx > baseline.txt

# ACE ON
HAKMEM_ACE_ENABLED=1 ./bench_fragment_stress_hakx > ace_on.txt

# 比較
diff baseline.txt ace_on.txt
```

### デバッグ

```bash
# ログレベル設定
export HAKMEM_ACE_LOG_LEVEL=2  # 0=off, 1=info, 2=debug

# 実行すると、ACEの調整をリアルタイム表示
# [ACE] Fast loop: reward=0.85, llc_miss=0.12, backlog=45
# [ACE] Adjusting TLS cap[2]: 32 → 24 (diet factor=0.75)
# [ACE] Drain threshold[3]: 128 → 64 (high backlog)
```

## 安全弁

ACEは以下の条件で自動停止：

```bash
# Latency guard（デフォルト10ms）
export HAKMEM_ACE_MAX_P99_LAT_NS=10000000

# RSS guard（デフォルト16GB）
export HAKMEM_ACE_MAX_RSS_MB=16384

# CPU占有上限（デフォルト5%）
export HAKMEM_ACE_MAX_CPU_PERCENT=5
```

## 期待結果（保守的見積り）

| ワークロード | Before | After (ACE) | コメント |
|------------|--------|-------------|----------|
| **断片化ストレス** | 3.87 M ops/s | **10-20 M ops/s** | Budgeted scavenge効果 |
| **巨大WS** | 22.15 M ops/s | **30-45 M ops/s** | Auto diet + LLC miss最適化 |
| **realloc heavy** | 277ns (worst) | **140-210ns** | In-place拡張 + NT store |
| Mid MT (4-8 threads) | 111.6 M ops/s | **110-115 M ops/s** | 維持（±5%以内） |
| 16 threads | 頭打ち | **+0-10%** | シャーディング効果 |

## ファイル構成

```
core/
├── hakmem_ace_metrics.{h,c}       ← メトリクス収集
├── hakmem_ace_controller.{h,c}    ← Fast/Slow loops
├── hakmem_ace_ucb1.{h,c}          ← UCB1学習
├── hakmem_ace_scavenge.{h,c}      ← 部分返却
└── hakmem_ace_realloc.{h,c}       ← realloc最適化

# 既存ファイルの変更:
core/hakmem_tiny_magazine.c        ← TLS capacity動的化
core/hakmem_pool.c                 ← Drain threshold動的化
core/hakmem.c                      ← ace_tick()統合
```

## 技術詳細

詳細な実装プランは `docs/ACE_LEARNING_LAYER_PLAN.md` を参照。

## 次のステップ

1. ✅ ドキュメント整備（このファイル）
2. 📝 Phase 1実装
   - [ ] hakmem_ace_metrics.{h,c}
   - [ ] hakmem_ace_controller.{h,c}
   - [ ] hakmem_ace_ucb1.{h,c}
   - [ ] 既存コードの統合
3. 🧪 A/Bテスト
4. 📊 結果レポート

---

**Status**: PLANNING (2025-11-01)
**Priority**: HIGH - 次の開発フェーズ 🎯
**Expected Impact**: 2-5x improvement on weak workloads
-												WIP: Add TLS SLL validation and SuperSlab registry fallback

ChatGPT's diagnostic changes to address TLS_SLL_HDR_RESET issue.
Current status: Partial mitigation, but root cause remains.

Changes Applied:
1. SuperSlab Registry Fallback (hakmem_super_registry.h)
   - Added legacy table probe when hash map lookup misses
   - Prevents NULL returns for valid SuperSlabs during initialization
   - Status: ✅ Works but may hide underlying registration issues

2. TLS SLL Push Validation (tls_sll_box.h)
   - Reject push if SuperSlab lookup returns NULL
   - Reject push if class_idx mismatch detected
   - Added [TLS_SLL_PUSH_NO_SS] diagnostic message
   - Status: ✅ Prevents list corruption (defensive)

3. SuperSlab Allocation Class Fix (superslab_allocate.c)
   - Pass actual class_idx to sp_internal_allocate_superslab
   - Prevents dummy class=8 causing OOB access
   - Status: ✅ Root cause fix for allocation path

4. Debug Output Additions
   - First 256 push/pop operations traced
   - First 4 mismatches logged with details
   - SuperSlab registration state logged
   - Status: ✅ Diagnostic tool (not a fix)

5. TLS Hint Box Removed
   - Deleted ss_tls_hint_box.{c,h} (Phase 1 optimization)
   - Simplified to focus on stability first
   - Status: ⏳ Can be re-added after root cause fixed

Current Problem (REMAINS UNSOLVED):
- [TLS_SLL_HDR_RESET] still occurs after ~60 seconds of sh8bench
- Pointer is 16 bytes offset from expected (class 1 → class 2 boundary)
- hak_super_lookup returns NULL for that pointer
- Suggests: Use-After-Free, Double-Free, or pointer arithmetic error

Root Cause Analysis:
- Pattern: Pointer offset by +16 (one class 1 stride)
- Timing: Cumulative problem (appears after 60s, not immediately)
- Location: Header corruption detected during TLS SLL pop

Remaining Issues:
⚠️ Registry fallback is defensive (may hide registration bugs)
⚠️ Push validation prevents symptoms but not root cause
⚠️ 16-byte pointer offset source unidentified

Next Steps for Investigation:
1. Full pointer arithmetic audit (Magazine ⇔ TLS SLL paths)
2. Enhanced logging at HDR_RESET point:
   - Expected vs actual pointer value
   - Pointer provenance (where it came from)
   - Allocation trace for that block
3. Verify Headerless flag is OFF throughout build
4. Check for double-offset application in conversions

Technical Assessment:
- 60% root cause fixes (allocation class, validation)
- 40% defensive mitigation (registry fallback, push rejection)

Performance Impact:
- Registry fallback: +10-30 cycles on cold path (negligible)
- Push validation: +5-10 cycles per push (acceptable)
- Overall: < 2% performance impact estimated

Related Issues:
- Phase 1 TLS Hint Box removed temporarily
- Phase 2 Headerless blocked until stability achieved

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-12-03 20:42:28 +09:00
+								# ACE Learning Layer - ACE (Agentic Context Engineering)
 								> 実装上の役割としては「Adaptive Control Engine」として L1 レイヤのノブ調整を行うが、
 								> ACE 自体の意味は Agentic Context Engineering（観測→意思決定→適用のエージェント型ループ）として統一する。
-												Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-05 12:31:14 +09:00
 								**目的**: 断片化・巨大WS・reallocの弱点を学習で潰して"つよつよ"にする 💪
 								## 現状の弱点（ChatGPT Pro分析より）
 								| 問題 | 現在 | 目標 | 改善率 |
 								|------|------|------|--------|
 								| **断片化ストレス** | 3.87 M ops/s | 10-20 M ops/s | **2.6-5.2x** ⭐ |
 								| **巨大WS** | 22.15 M ops/s | 30-45 M ops/s | **1.4-2.0x** |
 								| **realloc依存** | 6.6-277ns (ブレ) | 140-210ns | **1.3-2.0x** |
 								| Mid MT (維持) | 111.6 M ops/s | 110-115 M ops/s | ±5% |
 								## コンセプト
 								### 学習ループ
 								```
 								┌─────────────────────────────────────────────────────────┐
 								│  Metrics Collection (1Hz)                                │
 								│  ├─ throughput_ops                                       │
 								│  ├─ llc_miss_rate (LLC misses)                          │
 								│  ├─ mutex_wait_ns (contention)                          │
 								│  ├─ remote_free_backlog (per class)                     │
 								│  ├─ fragmentation_ratio (slow, 60s)                     │
 								│  └─ rss_mb (slow, 60s)                                  │
 								└─────────────────────────────────────────────────────────┘
 								                           ▼
 								┌─────────────────────────────────────────────────────────┐
 								│  Fast Loop (0.5-1s) - 即応調整                           │
 								│  ├─ Remote backlog ↑ → drain threshold ↓                │
 								│  ├─ LLC miss ↑ → TLS capacity ↓ (diet)                 │
 								│  └─ Mutex wait ↑ → bundle width ↑                      │
 								└─────────────────────────────────────────────────────────┘
 								                           ▼
 								┌─────────────────────────────────────────────────────────┐
 								│  Slow Loop (30-60s) - 断片化・RSS対策                    │
 								│  ├─ Fragmentation ↑ → partial release                   │
 								│  ├─ RSS ↑ → budgeted scavenge (max 5ms)                │
 								│  └─ Stable → restore thresholds (hysteresis)            │
 								└─────────────────────────────────────────────────────────┘
 								                           ▼
 								┌─────────────────────────────────────────────────────────┐
 								│  UCB1 Learning - ノブ調整                                 │
 								│  └─ reward = throughput - (misses + locks + frag)       │
 								└─────────────────────────────────────────────────────────┘
 								```
 								### 調整可能なノブ
 								| ノブ | 候補値 | 効果 |
 								|------|--------|------|
 								| **TLS capacity** | [4, 8, 16, 32, 64] | LLC miss対策、diet調整 |
 								| **Drain threshold** | [32, 64, 128, 256, 512] | Remote free backlog解消 |
 								| **Partial release pages** | [1, 2, 4, 8] | 断片化・RSS削減 |
 								| **Diet factor** | [0.5, 0.66, 0.75, 0.9, 1.0] | 巨大WS時のTLS縮小率 |
 								| **Bundle width** | [16, 32, 64, 128] | 中央フリリストの束移動 |
 								## 実装フェーズ
 								### 🔥 Phase 1: 最小実装（1日、最大効果）
 								**Target**: 断片化ストレス 3.87 → 8-12 M ops/s
 								実装内容：
 . メトリクス収集（2-3時間）
 								   - throughput, LLC miss, mutex wait, backlog
 . Fast loop骨格（2-3時間）
 								   - 報酬計算、ノブ調整ロジック
 . TLS capacity動的化（1-2時間）
 								   - 既存TINY_TLS_MAG_CAPを動的に変更
 . UCB1学習（1-2時間）
 								   - 離散候補から最適値選択
 . ON/OFF切替（1時間）
 								   - 環境変数 `HAKMEM_ACE_ENABLED=1`
 								**期待効果**: 断片化ケースで即座に効果
 								### 🚀 Phase 2: 断片化対策（半日）
 								**Target**: 断片化ストレス 8-12 → 10-20 M ops/s
 								実装内容：
 . Slow loop（2-3時間）
 								   - Fragmentation/RSS監視
 . Budgeted scavenge（2-3時間）
 								   - 時間制限付き部分返却（max 5ms）
 								   - madvise(DONTNEED)でメモリ返却
 								**期待効果**: 断片化完全対策
 								### 🎯 Phase 3: 巨大WS対策（半日）
 								**Target**: 巨大WS 22 → 30-45 M ops/s
 								実装内容：
 . LLC miss monitoring（1-2時間）
 								   - rdpmc使用、軽量計測
 . Auto diet（1-2時間）
 								   - Miss率高→TLS縮小
 								   - 安定→段階的復元
 								**期待効果**: 巨大WSで2x改善
 								### 🔧 Phase 4: realloc最適化（1日、オプション）
 								**Target**: realloc 1.3-2x改善
 								実装内容：
 . In-place拡張（半日）
 								   - 隣接空きブロック併合
 . コピー最適化（半日）
 								   - サイズ別戦略（NT store）
 								## 使い方
 								### 基本
 								```bash
 								# ACE有効化
 								export HAKMEM_ACE_ENABLED=1
 								# Fast loopインターバル（デフォルト500ms）
 								export HAKMEM_ACE_FAST_INTERVAL_MS=500
 								# Slow loopインターバル（デフォルト30s）
 								export HAKMEM_ACE_SLOW_INTERVAL_S=30
 								# ベンチマーク実行
 								bash benchmarks/scripts/stress/run_fragmentation_stress.sh
 								```
 								### A/B比較
 								```bash
 								# ベースライン（ACE OFF）
 								HAKMEM_ACE_ENABLED=0 ./bench_fragment_stress_hakx > baseline.txt
 								# ACE ON
 								HAKMEM_ACE_ENABLED=1 ./bench_fragment_stress_hakx > ace_on.txt
 								# 比較
 								diff baseline.txt ace_on.txt
 								```
 								### デバッグ
 								```bash
 								# ログレベル設定
 								export HAKMEM_ACE_LOG_LEVEL=2  # 0=off, 1=info, 2=debug
 								# 実行すると、ACEの調整をリアルタイム表示
 								# [ACE] Fast loop: reward=0.85, llc_miss=0.12, backlog=45
 								# [ACE] Adjusting TLS cap[2]: 32 → 24 (diet factor=0.75)
 								# [ACE] Drain threshold[3]: 128 → 64 (high backlog)
 								```
 								## 安全弁
 								ACEは以下の条件で自動停止：
 								```bash
 								# Latency guard（デフォルト10ms）
 								export HAKMEM_ACE_MAX_P99_LAT_NS=10000000
 								# RSS guard（デフォルト16GB）
 								export HAKMEM_ACE_MAX_RSS_MB=16384
 								# CPU占有上限（デフォルト5%）
 								export HAKMEM_ACE_MAX_CPU_PERCENT=5
 								```
 								## 期待結果（保守的見積り）
 								| ワークロード | Before | After (ACE) | コメント |
 								|------------|--------|-------------|----------|
 								| **断片化ストレス** | 3.87 M ops/s | **10-20 M ops/s** | Budgeted scavenge効果 |
 								| **巨大WS** | 22.15 M ops/s | **30-45 M ops/s** | Auto diet + LLC miss最適化 |
 								| **realloc heavy** | 277ns (worst) | **140-210ns** | In-place拡張 + NT store |
 								| Mid MT (4-8 threads) | 111.6 M ops/s | **110-115 M ops/s** | 維持（±5%以内） |
 								| 16 threads | 頭打ち | **+0-10%** | シャーディング効果 |
 								## ファイル構成
 								```
 								core/
 								├── hakmem_ace_metrics.{h,c}       ← メトリクス収集
 								├── hakmem_ace_controller.{h,c}    ← Fast/Slow loops
 								├── hakmem_ace_ucb1.{h,c}          ← UCB1学習
 								├── hakmem_ace_scavenge.{h,c}      ← 部分返却
 								└── hakmem_ace_realloc.{h,c}       ← realloc最適化
 								# 既存ファイルの変更:
 								core/hakmem_tiny_magazine.c        ← TLS capacity動的化
 								core/hakmem_pool.c                 ← Drain threshold動的化
 								core/hakmem.c                      ← ace_tick()統合
 								```
 								## 技術詳細
 								詳細な実装プランは `docs/ACE_LEARNING_LAYER_PLAN.md` を参照。
 								## 次のステップ
 . ✅ ドキュメント整備（このファイル）
 . 📝 Phase 1実装
 								   - [ ] hakmem_ace_metrics.{h,c}
 								   - [ ] hakmem_ace_controller.{h,c}
 								   - [ ] hakmem_ace_ucb1.{h,c}
 								   - [ ] 既存コードの統合
 . 🧪 A/Bテスト
 . 📊 結果レポート
 								---
 								**Status**: PLANNING (2025-11-01)
 								**Priority**: HIGH - 次の開発フェーズ 🎯
 								**Expected Impact**: 2-5x improvement on weak workloads