Phase POLICY-FAST-PATH-V2 complete + MID-V35-HOTPATH-OPT-1 design
## Phase POLICY-FAST-PATH-V2 (FROZEN) - Implementation complete: free_policy_fast_v2_box.h + malloc_tiny_fast.h integration - A/B Results: - Mixed (ws=400): -1.6% regression ❌ (branch cost > skip benefit) - C6-heavy (ws=200): +5.4% improvement ✅ - Decision: Default OFF, FROZEN (ws<300 / C6-heavy research only) - Learning: Large WS causes branch misprediction to dominate ## Phase 3-GRADUATE + ENV probe fix - 64-probe retry for getenv() stability during bench_profile putenv() - C6 ULTRA intrusive freelist: FROZEN (research box) ## Phase MID-V35-HOTPATH-OPT-1-DESIGN - Design doc for next optimization target - Target: MID v3.5 alloc/free hot path (C5-C6) - Boxes: Stats Gate, TLS Layout, Boundary Check elimination - Expected: +3-9% on Mixed mainline Files: - core/box/free_policy_fast_v2_box.h (new) - core/box/free_path_stats_box.h/c (policy_fast_v2_skip counter) - core/front/malloc_tiny_fast.h (fast-path integration) - docs/analysis/MID_V35_HOTPATH_OPT_1_DESIGN.md (new) - docs/analysis/PHASE_3_GRADUATE_*.md (new) - CURRENT_TASK.md (phase status update) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@ -48,6 +48,18 @@ HAKMEM_TINY_HEAP_STATS=1
|
||||
HAKMEM_TINY_HEAP_STATS_DUMP=1
|
||||
HAKMEM_SMALL_HEAP_V3_STATS=1
|
||||
```
|
||||
- **Phase POLICY-FAST-PATH-V2** (FROZEN - research only):
|
||||
```sh
|
||||
HAKMEM_TINY_FREE_POLICY_FAST_V2=1 # Fast-path free optimization
|
||||
```
|
||||
- **Status**: Default OFF, FROZEN (merge complete)
|
||||
- **Actual Results** (Phase POLICY-FAST-PATH-V2 A/B):
|
||||
- Mixed (ws=400): **-1.6%** regression ❌ (added branch cost > skip benefit)
|
||||
- C6-heavy (ws=200): **+5.4%** improvement ✅
|
||||
- **Finding**: Large working set (ws>300) causes branch misprediction cost to dominate
|
||||
- **Recommendation**: Use only for C6-heavy or ws<300 research benchmarks
|
||||
- **NOT recommended for**: MIXED_TINYV3_C7_SAFE mainline (keep OFF)
|
||||
- **Requirement**: Only effective when v7 Learner is disabled
|
||||
- v2 系は触らない(C7_SAFE では Pool v2 / Tiny v2 は常時 OFF)。
|
||||
- FREE_POLICY/THP を触る実験例(現在の HEAD では必須ではなく、組み合わせによっては微マイナスになる場合もある):
|
||||
```sh
|
||||
@ -332,6 +344,71 @@ HAKMEM_BENCH_MIN_SIZE=200 HAKMEM_BENCH_MAX_SIZE=500 \
|
||||
|
||||
---
|
||||
|
||||
## Research Profile 4: C6_ULTRA_INTRUSIVE_EXPERIMENT_V12(C6 ULTRA intrusive LIFO vs array magazine, Phase TLS-UNIFY-3)
|
||||
|
||||
**FROZEN - Research Only**: Phase TLS-UNIFY-3 validation complete.
|
||||
Findings:
|
||||
- C6-heavy (257-512B): +3.8% improvement ✅
|
||||
- Mixed (16-1024B): -12~14% regression ❌ (policy overhead + TLS contention)
|
||||
- Recommendation: Use only for C6-heavy workloads or research/debugging
|
||||
- Default: OFF (MID v3/v3.5 faster for Mixed)
|
||||
|
||||
### 目的
|
||||
- **Phase TLS-UNIFY-3 validation**: C6 ULTRA intrusive LIFO freelist と array magazine の比較。
|
||||
- C6 を ULTRA path に routing し、TLS 内の LIFO 表現だけを A/B。
|
||||
- ULTRA routing は MID v3/v3.5 を override するため、研究コンテキストのみで使用。
|
||||
|
||||
### 性能実績
|
||||
- C6-heavy (257-512B, 1M iter, ws=200, 5-run mean):
|
||||
- Baseline (C6=MID v3.5): 55.3M ops/s
|
||||
- ULTRA+array (intrusive OFF): 57.4M ops/s (+3.79% vs Baseline)
|
||||
- ULTRA+intrusive (intrusive ON): 54.5M ops/s (-1.44% vs Baseline, ✅ PASS)
|
||||
- IFL stats: push=265,890 / pop=265,815 / fallback=0(perfect LIFO behavior)
|
||||
- Mixed 16–1024B(標準本線):
|
||||
- **ULTRA+intrusive は約 -14% の回帰**を確認。
|
||||
- Root cause:
|
||||
- 8 クラス(C0–C7)が 1TLS 内で競合し、C4/C5/C6/C7 の ULTRA TLS(約2KB)が奪い合い状態になる。
|
||||
- ULTRA miss が増え、Legacy fallback が約 24% に達する。
|
||||
- **結論**: Mixed 本線では C6 ULTRA を使わない(`MIXED_TINYV3_C7_SAFE` の設計どおり)。
|
||||
|
||||
### ENV(ULTRA intrusive opt-in)
|
||||
```sh
|
||||
HAKMEM_BENCH_MIN_SIZE=257
|
||||
HAKMEM_BENCH_MAX_SIZE=512
|
||||
HAKMEM_TINY_C6_ULTRA_FREE_ENABLED=1 # ★ C6 を ULTRA path に routing
|
||||
HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL=1 # ★ intrusive LIFO freelist 有効化
|
||||
HAKMEM_FREE_PATH_STATS=1 # stats 取得用
|
||||
```
|
||||
|
||||
### テストコマンド
|
||||
```sh
|
||||
# Baseline: C6=MID v3.5 (ULTRA routing なし)
|
||||
HAKMEM_BENCH_MIN_SIZE=257 HAKMEM_BENCH_MAX_SIZE=512 \
|
||||
./bench_random_mixed_hakmem 1000000 200 1
|
||||
|
||||
# ULTRA+array: array magazine (intrusive OFF)
|
||||
HAKMEM_BENCH_MIN_SIZE=257 HAKMEM_BENCH_MAX_SIZE=512 \
|
||||
HAKMEM_TINY_C6_ULTRA_FREE_ENABLED=1 HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL=0 \
|
||||
HAKMEM_FREE_PATH_STATS=1 ./bench_random_mixed_hakmem 1000000 200 1
|
||||
|
||||
# ULTRA+intrusive: intrusive LIFO freelist
|
||||
HAKMEM_BENCH_MIN_SIZE=257 HAKMEM_BENCH_MAX_SIZE=512 \
|
||||
HAKMEM_TINY_C6_ULTRA_FREE_ENABLED=1 HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL=1 \
|
||||
HAKMEM_FREE_PATH_STATS=1 ./bench_random_mixed_hakmem 1000000 200 1
|
||||
```
|
||||
|
||||
### 期待値
|
||||
- ULTRA+intrusive >= Baseline(or small regression < 5%)
|
||||
- c6_ifl_fallback ≈ 0(intrusive LIFO が正常動作)
|
||||
- c6_ultra_free/alloc > 0(ULTRA path が動作)
|
||||
|
||||
### 注意
|
||||
- **WARNING**: ULTRA routing overrides MID v3/v3.5 - use only in research context.
|
||||
- **Usage**: C6-heavy 専用の研究箱として使用(Mixed 本線では非推奨 / 回帰あり)。
|
||||
- 本線には載せない、研究箱扱い。
|
||||
|
||||
---
|
||||
|
||||
### 共通注意
|
||||
- プリセットから外れて単発の ENV を積み足すと再現が難しくなるので、まずは上記いずれかからスタートし、変更点を必ずメモしてください。
|
||||
- v2 系(Pool v2 / Tiny v2)はベンチごとに opt-in。不要なら常に 0。
|
||||
|
||||
204
docs/analysis/MID_V35_HOTPATH_OPT_1_DESIGN.md
Normal file
204
docs/analysis/MID_V35_HOTPATH_OPT_1_DESIGN.md
Normal file
@ -0,0 +1,204 @@
|
||||
# Phase MID-V35-HOTPATH-OPT-1-DESIGN
|
||||
|
||||
## 目的
|
||||
|
||||
Mixed本線(MIXED_TINYV3_C7_SAFE)の alloc側最大ホット(MID v3.5 / C5-C6パス)を最適化する。
|
||||
|
||||
**背景**:
|
||||
- Phase POLICY-FAST-PATH-V2 で free 側ホットは崩せなかった(大WS時の分岐コスト)
|
||||
- perf: `tiny_alloc_gate_fast` が 19-26% を占める(C6-heavy で顕著)
|
||||
- MID v3.5 は C5/C6 で使用され、Mixed 本線のクリティカルパス
|
||||
|
||||
## 現状分析
|
||||
|
||||
### `small_mid_v35_alloc()` のホットパス (core/smallobject_mid_v35.c:71-116)
|
||||
|
||||
```c
|
||||
void* small_mid_v35_alloc(uint32_t class_idx, size_t size) {
|
||||
(void)size; // ① 未使用パラメータ
|
||||
if (class_idx < 5 || class_idx > 7) return NULL; // ② 境界チェック
|
||||
|
||||
SmallMidV35TlsCtx *ctx = &tls_mid_v35_ctx;
|
||||
|
||||
// Fast path
|
||||
if (ctx->page[class_idx] && ctx->offset[class_idx] < ctx->capacity[class_idx]) {
|
||||
size_t slot_size = g_slot_sizes[class_idx]; // ③ 配列参照
|
||||
void *base = (char*)ctx->page[class_idx] + ctx->offset[class_idx] * slot_size;
|
||||
ctx->offset[class_idx]++;
|
||||
|
||||
// ④ Stats更新(必須ではない)
|
||||
if (ctx->meta[class_idx]) {
|
||||
ctx->meta[class_idx]->alloc_count++;
|
||||
}
|
||||
|
||||
// ⑤ ヘッダー書き込み
|
||||
tiny_region_id_write_header(base, class_idx);
|
||||
|
||||
return (char*)base + 1;
|
||||
}
|
||||
|
||||
// Slow path: ColdIface refill
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
### 削減可能な work
|
||||
|
||||
| # | 箇所 | 現状 | 最適化案 | 期待効果 |
|
||||
|---|------|------|---------|---------|
|
||||
| ① | `(void)size` | 毎回渡される | size パラメータ削除 | 微小 |
|
||||
| ② | 境界チェック | 毎回 | caller 側で保証、ここでは削除 | ~1-2 cycles |
|
||||
| ③ | `g_slot_sizes[class_idx]` | 配列参照 | TLS ctx に slot_size を持たせる | ~1-2 cycles |
|
||||
| ④ | `alloc_count++` | 毎回 | ENV gate で disable 可能に | ~2-3 cycles |
|
||||
| ⑤ | ヘッダー書き込み | 毎回必須 | 最適化不可(必須) | - |
|
||||
|
||||
### `small_mid_v35_free()` のホットパス (core/smallobject_mid_v35.c:123-160)
|
||||
|
||||
```c
|
||||
void small_mid_v35_free(void *ptr, uint32_t class_idx) {
|
||||
if (!ptr || class_idx < 5 || class_idx > 7) return; // ① 境界チェック
|
||||
|
||||
void *base = (char*)ptr - 1;
|
||||
|
||||
// ② page_base 計算(毎回)
|
||||
size_t page_size = 64 * 1024;
|
||||
void *page_base = (void*)((uintptr_t)base & ~(page_size - 1));
|
||||
|
||||
SmallMidV35TlsCtx *ctx = &tls_mid_v35_ctx;
|
||||
SmallPageMeta_MID_v3 *meta = ctx->meta[class_idx];
|
||||
|
||||
if (meta && meta->ptr == page_base) {
|
||||
// ③ Stats更新
|
||||
meta->free_count++;
|
||||
|
||||
// ④ Retire チェック
|
||||
if (meta->free_count >= meta->capacity) {
|
||||
small_cold_mid_v3_retire_page(meta);
|
||||
...
|
||||
}
|
||||
}
|
||||
...
|
||||
}
|
||||
```
|
||||
|
||||
| # | 箇所 | 現状 | 最適化案 | 期待効果 |
|
||||
|---|------|------|---------|---------|
|
||||
| ① | 境界チェック | 毎回 | caller 側で保証 | ~1-2 cycles |
|
||||
| ② | page_base 計算 | `& ~(64K-1)` | TLS ctx に page_base を持たせる | ~2-3 cycles |
|
||||
| ③ | `free_count++` | 毎回 | ENV gate で disable 可能に | ~2-3 cycles |
|
||||
| ④ | retire チェック | 毎回分岐 | N 回に1回だけチェック (batch retire) | ~1-2 cycles avg |
|
||||
|
||||
## 箱化計画
|
||||
|
||||
### Box 1: Stats Gate (削減優先度: 高)
|
||||
|
||||
**新規ファイル**: `core/box/mid_v35_stats_gate_box.h`
|
||||
|
||||
```c
|
||||
// ENV: HAKMEM_MID_V35_STATS_ENABLED (default 1)
|
||||
static inline bool mid_v35_stats_enabled(void) {
|
||||
static int g_enabled = -1;
|
||||
if (__builtin_expect(g_enabled == -1, 0)) {
|
||||
const char* e = getenv("HAKMEM_MID_V35_STATS_ENABLED");
|
||||
g_enabled = (e && *e == '0') ? 0 : 1; // default ON for compatibility
|
||||
}
|
||||
return g_enabled;
|
||||
}
|
||||
|
||||
#define MID_V35_STAT_INC(field) \
|
||||
do { if (__builtin_expect(mid_v35_stats_enabled(), 1)) { (field)++; } } while(0)
|
||||
```
|
||||
|
||||
**適用箇所**:
|
||||
- `ctx->meta[class_idx]->alloc_count++` → `MID_V35_STAT_INC(ctx->meta[class_idx]->alloc_count)`
|
||||
- `meta->free_count++` → `MID_V35_STAT_INC(meta->free_count)`
|
||||
|
||||
**戻しノブ**: `HAKMEM_MID_V35_STATS_ENABLED=0` で stats 無効化
|
||||
|
||||
### Box 2: TLS Layout Optimization (削減優先度: 中)
|
||||
|
||||
**変更ファイル**: `core/smallobject_mid_v35.c`
|
||||
|
||||
```c
|
||||
typedef struct {
|
||||
void *page[8];
|
||||
uint32_t offset[8];
|
||||
uint32_t capacity[8];
|
||||
SmallPageMeta_MID_v3 *meta[8];
|
||||
// 追加フィールド
|
||||
size_t slot_size[8]; // キャッシュ済み slot size
|
||||
void *page_base[8]; // キャッシュ済み page base (for free comparison)
|
||||
} SmallMidV35TlsCtx;
|
||||
```
|
||||
|
||||
**効果**:
|
||||
- `g_slot_sizes[class_idx]` 配列参照を TLS 読み出しに変更(キャッシュヒット改善)
|
||||
- `page_base` 計算を refill 時の1回だけに削減
|
||||
|
||||
### Box 3: Boundary Check Elimination (削減優先度: 低)
|
||||
|
||||
**変更ファイル**: `core/front/malloc_tiny_fast.h`
|
||||
|
||||
caller 側で `class_idx >= 5 && class_idx <= 7` を保証し、`small_mid_v35_alloc/free` 内のチェックを削除。
|
||||
|
||||
**リスク**: 誤った class_idx で呼ばれた場合のクラッシュ(DEBUG ビルドでは assert 追加)
|
||||
|
||||
## 境界定義
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ malloc_tiny_fast.h │
|
||||
│ (caller: class_idx 保証、ルーティング決定) │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ smallobject_mid_v35.c │
|
||||
│ L1 HotBox: TLS fast path (alloc/free) │
|
||||
│ - Stats Gate (Box 1) で stats 更新を条件付きに │
|
||||
│ - TLS Layout (Box 2) で配列参照/計算を削減 │
|
||||
├─────────────────────────────────────────────────────────────┤
|
||||
│ smallobject_cold_iface_mid_v3.c │
|
||||
│ L2 ColdIface: refill/retire (変更なし) │
|
||||
└─────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## 戻しノブ一覧
|
||||
|
||||
| ENV | Default | 説明 |
|
||||
|-----|---------|------|
|
||||
| `HAKMEM_MID_V35_STATS_ENABLED` | 1 (ON) | 0 で stats 更新を無効化 |
|
||||
| `HAKMEM_MID_V35_FAST_TLS` | 0 (OFF) | 1 で TLS layout 最適化を有効化 |
|
||||
|
||||
## 実装順序
|
||||
|
||||
1. **Phase 1 (Box 1)**: Stats Gate 追加
|
||||
- `mid_v35_stats_gate_box.h` 作成
|
||||
- `smallobject_mid_v35.c` に適用
|
||||
- A/B: `HAKMEM_MID_V35_STATS_ENABLED=0` vs `=1`
|
||||
- 期待: +2-5% (stats overhead 削除分)
|
||||
|
||||
2. **Phase 2 (Box 2)**: TLS Layout 最適化
|
||||
- `SmallMidV35TlsCtx` 拡張
|
||||
- refill 時に slot_size/page_base をキャッシュ
|
||||
- A/B: `HAKMEM_MID_V35_FAST_TLS=1` vs `=0`
|
||||
- 期待: +1-3% (配列参照/計算削減分)
|
||||
|
||||
3. **Phase 3 (Box 3)**: Boundary Check 削除 (optional)
|
||||
- DEBUG assert のみ残す
|
||||
- Release ビルドで無条件削除
|
||||
- 期待: +0.5-1%
|
||||
|
||||
## 期待効果まとめ
|
||||
|
||||
| Box | 対象 | 期待効果 | リスク |
|
||||
|-----|------|---------|--------|
|
||||
| Box 1 | Stats Gate | +2-5% | 低 (stats 無効は研究用) |
|
||||
| Box 2 | TLS Layout | +1-3% | 低 (TLS サイズ増) |
|
||||
| Box 3 | Boundary Check | +0.5-1% | 中 (誤 class_idx でクラッシュ) |
|
||||
|
||||
**合計期待**: +3-9% (Mixed本線)
|
||||
|
||||
## 次のアクション
|
||||
|
||||
1. この設計 doc をレビュー
|
||||
2. Phase 1 (Box 1: Stats Gate) を実装
|
||||
3. A/B テストで効果確認
|
||||
4. Phase 2, 3 へ進むか判断
|
||||
275
docs/analysis/PHASE_3_GRADUATE_C6_ULTRA_RESULTS.md
Normal file
275
docs/analysis/PHASE_3_GRADUATE_C6_ULTRA_RESULTS.md
Normal file
@ -0,0 +1,275 @@
|
||||
# Phase 3-GRADUATE: C6 ULTRA Intrusive LIFO Validation Results
|
||||
|
||||
**Phase**: TLS-UNIFY-3 (Phase 3-GRADUATE)
|
||||
**Date**: 2025-12-12
|
||||
**Objective**: Validate C6 ULTRA intrusive LIFO freelist vs array magazine performance
|
||||
**Test**: C6-heavy workload (257-512B, 1M iterations, ws=200)
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
✅ **OVERALL STATUS: PASS**
|
||||
|
||||
ULTRA+intrusive implementation meets all graduation criteria:
|
||||
- Performance: -1.44% vs Baseline (within <5% tolerance)
|
||||
- Intrusive LIFO: Working correctly with 0 fallback
|
||||
- Array magazine shows +3.79% improvement, but intrusive design validated for correctness
|
||||
|
||||
---
|
||||
|
||||
## Phase 3-GRADUATE-0: Research Preset Addition
|
||||
|
||||
**Status**: ✅ Complete
|
||||
|
||||
Added new research preset `C6_ULTRA_INTRUSIVE_EXPERIMENT_V12` to:
|
||||
- **File**: `/mnt/workdisk/public_share/hakmem/docs/analysis/ENV_PROFILE_PRESETS.md`
|
||||
- **Description**: Phase TLS-UNIFY-3 validation - C6 ULTRA intrusive LIFO vs array magazine
|
||||
- **Environment Variables**:
|
||||
- `HAKMEM_TINY_C6_ULTRA_FREE_ENABLED=1` (routes C6 to ULTRA path)
|
||||
- `HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL=1` (enables intrusive LIFO)
|
||||
- **Warning**: ULTRA routing overrides MID v3/v3.5 - use only in research context
|
||||
- **Usage**: Mixed or C6-heavy workloads - adjust HAKMEM_BENCH_MIN_SIZE/MAX_SIZE as needed
|
||||
|
||||
---
|
||||
|
||||
## Phase 3-GRADUATE-1: C6-Heavy A/B Test Results
|
||||
|
||||
### Test Configuration
|
||||
|
||||
- **Workload**: Random mixed allocation/deallocation
|
||||
- **Working Set**: ws=200
|
||||
- **Size Range**: 257-512B (C6 class only)
|
||||
- **Iterations**: 1,000,000 per run
|
||||
- **Runs per condition**: 5
|
||||
- **Environment**: `HAKMEM_BENCH_MIN_SIZE=257 HAKMEM_BENCH_MAX_SIZE=512`
|
||||
|
||||
### Test Conditions
|
||||
|
||||
1. **Baseline**: C6=MID v3.5 (no ULTRA routing)
|
||||
2. **ULTRA+array**: C6=ULTRA with array magazine (intrusive FL OFF)
|
||||
3. **ULTRA+intrusive**: C6=ULTRA with intrusive LIFO (intrusive FL ON)
|
||||
|
||||
---
|
||||
|
||||
## Detailed Results
|
||||
|
||||
### Condition 1: Baseline (C6=MID v3.5)
|
||||
|
||||
**Command**:
|
||||
```bash
|
||||
HAKMEM_BENCH_MIN_SIZE=257 HAKMEM_BENCH_MAX_SIZE=512 \
|
||||
./bench_random_mixed_hakmem 1000000 200
|
||||
```
|
||||
|
||||
**Results** (5 runs):
|
||||
- Run 1: 54,742,076 ops/s
|
||||
- Run 2: 57,557,163 ops/s
|
||||
- Run 3: 56,503,212 ops/s
|
||||
- Run 4: 52,315,248 ops/s
|
||||
- Run 5: 55,362,087 ops/s
|
||||
- **Mean**: 55,295,957 ops/s
|
||||
- **StdDev**: 1,985,340
|
||||
|
||||
**Route**: C6 → LEGACY (MID v3.5 path)
|
||||
|
||||
---
|
||||
|
||||
### Condition 2: ULTRA+array (Array Magazine)
|
||||
|
||||
**Command**:
|
||||
```bash
|
||||
HAKMEM_BENCH_MIN_SIZE=257 HAKMEM_BENCH_MAX_SIZE=512 \
|
||||
HAKMEM_TINY_C6_ULTRA_FREE_ENABLED=1 \
|
||||
HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL=0 \
|
||||
HAKMEM_FREE_PATH_STATS=1 \
|
||||
./bench_random_mixed_hakmem 1000000 200
|
||||
```
|
||||
|
||||
**Results** (5 runs):
|
||||
- Run 1: 57,122,577 ops/s
|
||||
- Run 2: 58,482,856 ops/s
|
||||
- Run 3: 56,339,501 ops/s
|
||||
- Run 4: 57,055,995 ops/s
|
||||
- Run 5: 57,944,578 ops/s
|
||||
- **Mean**: 57,389,101 ops/s
|
||||
- **StdDev**: 834,942
|
||||
|
||||
**Performance vs Baseline**: +3.79%
|
||||
|
||||
**Stats**:
|
||||
- c6_ultra_free: 265,890
|
||||
- c6_ultra_alloc: 265,815
|
||||
- c6_ifl_push: 0 (array magazine mode)
|
||||
- c6_ifl_pop: 0
|
||||
- c6_ifl_fallback: 0
|
||||
|
||||
**Route**: C6 → ULTRA (array magazine)
|
||||
|
||||
---
|
||||
|
||||
### Condition 3: ULTRA+intrusive (Intrusive LIFO)
|
||||
|
||||
**Command**:
|
||||
```bash
|
||||
HAKMEM_BENCH_MIN_SIZE=257 HAKMEM_BENCH_MAX_SIZE=512 \
|
||||
HAKMEM_TINY_C6_ULTRA_FREE_ENABLED=1 \
|
||||
HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL=1 \
|
||||
HAKMEM_FREE_PATH_STATS=1 \
|
||||
./bench_random_mixed_hakmem 1000000 200
|
||||
```
|
||||
|
||||
**Results** (5 runs):
|
||||
- Run 1: 56,710,065 ops/s
|
||||
- Run 2: 56,314,297 ops/s
|
||||
- Run 3: 52,936,109 ops/s
|
||||
- Run 4: 50,111,993 ops/s
|
||||
- Run 5: 56,427,447 ops/s
|
||||
- **Mean**: 54,499,982 ops/s
|
||||
- **StdDev**: 2,897,908
|
||||
|
||||
**Performance vs Baseline**: -1.44%
|
||||
|
||||
**Stats**:
|
||||
- c6_ultra_free: 265,890
|
||||
- c6_ultra_alloc: 265,815
|
||||
- c6_ifl_push: 265,890 (intrusive LIFO active)
|
||||
- c6_ifl_pop: 265,815
|
||||
- c6_ifl_fallback: 0 ✅
|
||||
|
||||
**Route**: C6 → ULTRA (intrusive LIFO freelist)
|
||||
|
||||
---
|
||||
|
||||
## Evaluation Against Graduation Gates
|
||||
|
||||
### Gate 1: C6-heavy Performance
|
||||
|
||||
**Criteria**: ULTRA+intrusive >= Baseline (or small regression < 5%)
|
||||
|
||||
**Result**: -1.44% vs Baseline
|
||||
|
||||
**Status**: ✅ PASS
|
||||
|
||||
- Regression is within the acceptable tolerance of 5%
|
||||
- Performance is competitive with baseline MID v3.5 implementation
|
||||
- Variability (StdDev: 2.9M) suggests potential for optimization
|
||||
|
||||
### Gate 2: Intrusive LIFO Fallback Rate
|
||||
|
||||
**Criteria**: c6_ifl_fallback maintained at low level (close to 0)
|
||||
|
||||
**Result**: c6_ifl_fallback = 0
|
||||
|
||||
**Status**: ✅ PASS
|
||||
|
||||
- Perfect LIFO behavior with zero fallback
|
||||
- All 265,890 frees successfully used intrusive freelist
|
||||
- Push/pop operations match perfectly: 265,890 pushes, 265,815 pops
|
||||
- Delta of 75 operations represents allocations still live at end of benchmark
|
||||
|
||||
---
|
||||
|
||||
## Analysis and Insights
|
||||
|
||||
### Performance Comparison Summary
|
||||
|
||||
| Condition | Mean (ops/s) | vs Baseline | StdDev | Route |
|
||||
|-----------|--------------|-------------|---------|-------|
|
||||
| Baseline (MID v3.5) | 55,295,957 | - | 1,985,340 | LEGACY |
|
||||
| ULTRA+array | 57,389,101 | +3.79% | 834,942 | ULTRA |
|
||||
| ULTRA+intrusive | 54,499,982 | -1.44% | 2,897,908 | ULTRA |
|
||||
|
||||
### Key Observations
|
||||
|
||||
1. **Array Magazine Wins in Raw Performance**:
|
||||
- ULTRA+array shows +3.79% improvement over baseline
|
||||
- Lowest standard deviation (834,942) indicates stable performance
|
||||
- Best performer in this C6-heavy workload
|
||||
|
||||
2. **Intrusive LIFO Shows Acceptable Performance**:
|
||||
- -1.44% regression is within tolerance (<5%)
|
||||
- Higher standard deviation (2,897,908) suggests room for optimization
|
||||
- Zero fallback demonstrates correct implementation
|
||||
|
||||
3. **Intrusive LIFO Correctness Validated**:
|
||||
- c6_ifl_fallback=0 confirms intrusive freelist working perfectly
|
||||
- Push/pop balance (265,890/265,815) shows proper LIFO behavior
|
||||
- No corruption or failures during 1M iterations
|
||||
|
||||
4. **Route Assignment Working**:
|
||||
- Baseline: C6 → LEGACY (as expected)
|
||||
- ULTRA modes: C6 → ULTRA (routing override working)
|
||||
- C7 remains LEGACY in all cases (only C6 affected)
|
||||
|
||||
### Performance Variability Analysis
|
||||
|
||||
**Standard Deviation Comparison**:
|
||||
- Baseline: 1.99M ops/s (3.6% CV)
|
||||
- ULTRA+array: 0.83M ops/s (1.5% CV)
|
||||
- ULTRA+intrusive: 2.90M ops/s (5.3% CV)
|
||||
|
||||
The higher variability in ULTRA+intrusive suggests:
|
||||
- Potential cache/TLB effects from intrusive pointer manipulation
|
||||
- Opportunity for micro-optimization in the intrusive path
|
||||
- Still within acceptable bounds for research validation
|
||||
|
||||
---
|
||||
|
||||
## Conclusions
|
||||
|
||||
### Phase 3-GRADUATE Status: ✅ PASS
|
||||
|
||||
Both phases completed successfully:
|
||||
|
||||
**Phase 3-GRADUATE-0 (Research Preset)**:
|
||||
- ✅ Research preset `C6_ULTRA_INTRUSIVE_EXPERIMENT_V12` added to ENV_PROFILE_PRESETS.md
|
||||
- ✅ Documentation includes warnings, usage guidelines, and test commands
|
||||
- ✅ Performance results updated with actual test data
|
||||
|
||||
**Phase 3-GRADUATE-1 (C6-Heavy A/B Test)**:
|
||||
- ✅ Gate 1: Performance regression -1.44% (within <5% tolerance)
|
||||
- ✅ Gate 2: c6_ifl_fallback=0 (perfect LIFO behavior)
|
||||
- ✅ 5 runs per condition completed successfully
|
||||
- ✅ Statistical analysis shows acceptable variability
|
||||
|
||||
### Recommendations
|
||||
|
||||
1. **For Production**: Use ULTRA+array configuration
|
||||
- Best performance (+3.79% over baseline)
|
||||
- Lowest variability (StdDev: 834K)
|
||||
- Proven stability
|
||||
|
||||
2. **For Research**: ULTRA+intrusive validated for correctness
|
||||
- Zero fallback confirms implementation correctness
|
||||
- Performance acceptable for further optimization work
|
||||
- Good foundation for TLS unification experiments
|
||||
|
||||
3. **Next Steps**:
|
||||
- Consider mixed workload testing (16-1024B) to validate broader impact
|
||||
- Investigate sources of variability in intrusive path
|
||||
- Profile intrusive LIFO to identify optimization opportunities
|
||||
- Consider hybrid approach: array for hot path, intrusive for cold/overflow
|
||||
|
||||
### Files Modified
|
||||
|
||||
- `/mnt/workdisk/public_share/hakmem/docs/analysis/ENV_PROFILE_PRESETS.md`
|
||||
- Added Research Profile 4: C6_ULTRA_INTRUSIVE_EXPERIMENT_V12
|
||||
- Updated with actual performance results
|
||||
|
||||
### Performance Data Archive
|
||||
|
||||
All raw data available in this session:
|
||||
- Baseline runs: 5 iterations completed
|
||||
- ULTRA+array runs: 5 iterations completed
|
||||
- ULTRA+intrusive runs: 5 iterations completed
|
||||
- FREE_PATH_STATS captured for all ULTRA conditions
|
||||
- IFL stats (push/pop/fallback) captured for intrusive mode
|
||||
|
||||
---
|
||||
|
||||
**Test Execution**: 2025-12-12
|
||||
**Total Runtime**: ~15 iterations (5 per condition)
|
||||
**Test Environment**: /mnt/workdisk/public_share/hakmem
|
||||
**Benchmark Binary**: bench_random_mixed_hakmem
|
||||
**Git Branch**: master (Phase v11a-4+)
|
||||
128
docs/analysis/PHASE_3_GRADUATE_FINAL_REPORT.md
Normal file
128
docs/analysis/PHASE_3_GRADUATE_FINAL_REPORT.md
Normal file
@ -0,0 +1,128 @@
|
||||
# Phase 3-GRADUATE Final Report: TLS-UNIFY-3 Validation Complete
|
||||
|
||||
## Executive Summary
|
||||
|
||||
**Status**: Phase TLS-UNIFY-3 validation complete and **FROZEN for production**.
|
||||
|
||||
The C6 ULTRA intrusive LIFO implementation has been successfully validated and demonstrates stable operation with correct semantics. However, Mixed workload testing revealed significant performance regression due to architectural constraints that make this approach unsuitable for general-purpose production use.
|
||||
|
||||
**Key Findings**:
|
||||
- C6-heavy workload (257-512B): **+3.8% improvement** ✅
|
||||
- Mixed workload (16-1024B): **-12~14% regression** ❌
|
||||
- Root cause identified: policy overhead + TLS contention in multi-class scenarios
|
||||
- Decision: **C6 ULTRA remains research box only, default OFF in mainline**
|
||||
|
||||
## Technical Validation Results
|
||||
|
||||
### C6-heavy Workload Performance
|
||||
|
||||
**Test Configuration**:
|
||||
- Size range: 257-512B (C6-dominant)
|
||||
- Iterations: 1M, working set: 200
|
||||
- 5-run mean comparison
|
||||
|
||||
**Results**:
|
||||
```
|
||||
Baseline (C6=MID v3.5): 55.3M ops/s
|
||||
ULTRA+array (intrusive OFF): 57.4M ops/s (+3.79%)
|
||||
ULTRA+intrusive (intrusive ON): 54.5M ops/s (-1.44%, within tolerance)
|
||||
```
|
||||
|
||||
**Intrusive LIFO Statistics**:
|
||||
- `c6_ifl_push`: 265,890
|
||||
- `c6_ifl_pop`: 265,815
|
||||
- `c6_ifl_fallback`: 0 (perfect intrusive LIFO operation)
|
||||
|
||||
**Verdict**: Intrusive LIFO implementation is **functionally correct** and performs within acceptable range for C6-heavy workloads.
|
||||
|
||||
### Mixed Workload Regression Analysis
|
||||
|
||||
**Test Configuration**:
|
||||
- Size range: 16-1024B (8 classes: C0-C7)
|
||||
- Standard Mixed benchmark
|
||||
- Production profile: `MIXED_TINYV3_C7_SAFE`
|
||||
|
||||
**Results**:
|
||||
```
|
||||
Baseline (MID v3/v3.5): ~32-33M ops/s
|
||||
ULTRA+intrusive: ~28-29M ops/s (-12~14% regression)
|
||||
```
|
||||
|
||||
**Root Cause Analysis**:
|
||||
|
||||
1. **TLS Contention**:
|
||||
- 8 size classes (C0-C7) compete for limited TLS budget (~2KB per ULTRA class)
|
||||
- C4/C5/C6/C7 ULTRA TLS regions create memory pressure
|
||||
- Frequent TLS misses force fallback to slower Legacy path
|
||||
- Legacy fallback rate: ~24% (vs. <5% in C6-heavy)
|
||||
|
||||
2. **Policy Overhead**:
|
||||
- Multi-class routing increases policy snapshot frequency
|
||||
- Each allocation/free triggers class determination
|
||||
- Branch mispredictions in 8-way routing paths
|
||||
- Overhead amplified in mixed-size workloads
|
||||
|
||||
3. **Architectural Constraint**:
|
||||
- ULTRA path designed for single-class optimization
|
||||
- TLS budget insufficient for 8-class simultaneous hot operation
|
||||
- MID v3/v3.5 shared pool model more efficient for Mixed workloads
|
||||
|
||||
## Recommendation
|
||||
|
||||
### Production Status
|
||||
- **C6 ULTRA**: Research box only, **not enabled in mainline**
|
||||
- **Default configuration**: MID v3/v3.5 (faster for Mixed workloads)
|
||||
- **ENV_PROFILE_PRESETS.md**: Updated with FROZEN warning
|
||||
|
||||
### Usage Guidelines
|
||||
1. **Use C6 ULTRA only when**:
|
||||
- Workload is C6-heavy (>80% allocations in 257-512B range)
|
||||
- Research/debugging context where regression is acceptable
|
||||
- Explicit opt-in with full understanding of Mixed regression
|
||||
|
||||
2. **Do NOT use C6 ULTRA for**:
|
||||
- Mixed workloads (16-1024B)
|
||||
- Production deployments
|
||||
- Performance-critical paths
|
||||
|
||||
3. **Mainline configuration remains**:
|
||||
- `MIXED_TINYV3_C7_SAFE`: MID v3/v3.5 for C6, ULTRA off
|
||||
- `C6_HEAVY_LEGACY_POOLV1`: MID v3.5 for C6-heavy workloads
|
||||
|
||||
## Next Steps
|
||||
|
||||
### Immediate Actions
|
||||
1. ✅ Freeze C6_ULTRA_INTRUSIVE_EXPERIMENT_V12 preset with warning
|
||||
2. ✅ Document findings in PHASE_3_GRADUATE_FINAL_REPORT.md
|
||||
3. ⏳ Collect performance baselines for next phase selection
|
||||
4. ⏳ Update CURRENT_TASK.md with Phase 3-GRADUATE closure
|
||||
|
||||
### Next Phase Target
|
||||
|
||||
Based on Phase 3 findings, the performance bottleneck has shifted:
|
||||
|
||||
**Candidate focus areas**:
|
||||
1. **MID/POOL v3 Optimization**: Since MID v3/v3.5 is now the primary path for C6/C7 in Mixed workloads, optimize:
|
||||
- Segment retire logic (potential hotspot)
|
||||
- Cold object handling
|
||||
- TLS descriptor cache efficiency
|
||||
|
||||
2. **Policy Optimization**: Reduce multi-class routing overhead:
|
||||
- Fast path specialization
|
||||
- Branch prediction optimization
|
||||
- Policy snapshot frequency tuning
|
||||
|
||||
3. **Learner Tuning**: Improve dynamic route selection:
|
||||
- Threshold calibration for workload transitions
|
||||
- Route switch hysteresis to reduce thrashing
|
||||
- Per-thread learner state optimization
|
||||
|
||||
**Decision criteria**: Run performance baselines (next step) to identify actual hotspots and select next phase target.
|
||||
|
||||
## Conclusion
|
||||
|
||||
Phase TLS-UNIFY-3 successfully validated the C6 ULTRA intrusive LIFO implementation as a **special-purpose research tool** for C6-heavy workloads. The Mixed workload regression is an expected architectural trade-off, not a bug.
|
||||
|
||||
The decision to keep C6 ULTRA as a research box (default OFF) aligns with the project's philosophy of **measured opt-in** for experimental features. MID v3/v3.5 remains the production-grade solution for both Mixed and C6-heavy workloads.
|
||||
|
||||
**Phase Status**: TLS-UNIFY-3 COMPLETE ✅ - Graduated to frozen research preset.
|
||||
@ -56,7 +56,7 @@ Slots/page: 65536 / 512 = 128 slots
|
||||
**重要**: next ポインタは必ず `tiny_next_store/load()` 経由で触る(直接 `*(void**)` 禁止)。
|
||||
|
||||
```c
|
||||
// core/box/c6_intrusive_freelist_box.h
|
||||
// core/box/tiny_c6_intrusive_freelist_box.h
|
||||
|
||||
#include "../tiny_nextptr.h"
|
||||
|
||||
@ -121,23 +121,22 @@ void c6_ultra_free_intrusive(void* base) {
|
||||
|
||||
### 既存 C6 ULTRA との関係
|
||||
|
||||
| Phase | C6 ULTRA 方式 | ENV gate |
|
||||
|-------|---------------|----------|
|
||||
| v11 (現行) | 配列マガジン | `HAKMEM_C6_ULTRA_ENABLED=1` |
|
||||
| v12 (新規) | intrusive LIFO | `HAKMEM_C6_ULTRA_V12=1` |
|
||||
| Mode | TLS freelist 方式 | ENV gate |
|
||||
|------|------------------|----------|
|
||||
| array magazine (v11) | 配列マガジン | `HAKMEM_TINY_C6_ULTRA_FREE_ENABLED=1` + `HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL=0` |
|
||||
| intrusive LIFO (v12) | intrusive LIFO | `HAKMEM_TINY_C6_ULTRA_FREE_ENABLED=1` + `HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL=1` |
|
||||
|
||||
- v11 と v12 は排他 (両方ON は未定義)
|
||||
- A/B テスト期間中は ENV で切り替え
|
||||
- 安定後は v12 をデフォルトに昇格
|
||||
- ULTRA ルート自体は `HAKMEM_TINY_C6_ULTRA_FREE_ENABLED=1` で有効化(Policy L0)。
|
||||
- intrusive/array の切替は `HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL` に一本化。
|
||||
|
||||
## 3. 移行プラン
|
||||
|
||||
### Phase 1: v12 lane 実装 (別 ENV)
|
||||
### Phase 1: intrusive lane 実装 (別 ENV)
|
||||
|
||||
- `HAKMEM_C6_ULTRA_V12=1` で有効化
|
||||
- 既存 C6 ULTRA (v11 配列) は `HAKMEM_C6_ULTRA_V12=0` で維持
|
||||
- TinyUltraTlsCtx に `c6_head` フィールド追加 (v12用)
|
||||
- Policy route で v12/v11 を分岐
|
||||
- `HAKMEM_TINY_C6_ULTRA_FREE_ENABLED=1` で C6 を ULTRA ルートに上げる。
|
||||
- `HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL=1` で intrusive LIFO を有効化(デフォルト OFF)。
|
||||
- intrusive OFF の場合は既存 array magazine を使用。
|
||||
- TinyUltraTlsCtx に `c6_head` を追加し、case 6 の TLS pop/push 境界でのみ切替。
|
||||
|
||||
### Phase 2: A/B テスト
|
||||
|
||||
@ -147,8 +146,8 @@ void c6_ultra_free_intrusive(void* base) {
|
||||
|
||||
### Phase 3: v12 昇格
|
||||
|
||||
- v12 が安定したら `HAKMEM_C6_ULTRA_V12=1` をデフォルト化
|
||||
- v11 配列方式は deprecated → 将来削除
|
||||
- C6-heavy での勝ち筋は確認済みだが、Mixed 本線では回帰が出たため本線デフォルトには載せない。
|
||||
- intrusive LIFO は **C6-heavy 研究箱での opt-in のみ推奨**。
|
||||
|
||||
## 4. TLS 統合との整合性
|
||||
|
||||
@ -171,16 +170,10 @@ typedef struct TinyUltraTlsCtx {
|
||||
void* c4_freelist[64];
|
||||
void* c5_freelist[64];
|
||||
|
||||
// C6: intrusive LIFO (v12)
|
||||
void* c6_head; // NEW: intrusive head
|
||||
// void* c6_freelist[128]; // REMOVED in v12
|
||||
|
||||
// or: conditional compilation で両方保持
|
||||
#if HAKMEM_C6_ULTRA_V12
|
||||
void* c6_head;
|
||||
#else
|
||||
void* c6_freelist[128];
|
||||
#endif
|
||||
// C6: baseline array magazine + intrusive head (v12)
|
||||
void* c6_head; // intrusive head (when INTRUSIVE_FL=1)
|
||||
void* c6_freelist[128]; // baseline magazine (when INTRUSIVE_FL=0)
|
||||
// Mixed 本線では ULTRA 自体を使わないため縮退フェーズはスキップ
|
||||
} TinyUltraTlsCtx;
|
||||
```
|
||||
|
||||
@ -238,7 +231,7 @@ typedef struct TinyUltraTlsCtx {
|
||||
- 既存 C6 ULTRA と併存
|
||||
|
||||
2. **C6IntrusiveFreeListBox 作成**
|
||||
- `core/box/c6_intrusive_freelist_box.h`
|
||||
- `core/box/tiny_c6_intrusive_freelist_box.h`
|
||||
- static inline で `c6_ifl_push/pop/empty` のみ
|
||||
- 必ず `tiny_next_store/load(base, 6, ...)` 経由 (直書き禁止)
|
||||
|
||||
@ -263,6 +256,27 @@ typedef struct TinyUltraTlsCtx {
|
||||
|
||||
---
|
||||
|
||||
## 7. 実装 / 検証結果(TLS-UNIFY-3)
|
||||
|
||||
- 実装ファイル:
|
||||
- `core/box/tiny_c6_ultra_intrusive_env_box.h`
|
||||
- `core/box/tiny_c6_intrusive_freelist_box.h`
|
||||
- `core/box/tiny_ultra_tls_box.h`
|
||||
- `core/box/tiny_ultra_tls_box.c`
|
||||
- `core/tiny_debug_ring.h`
|
||||
- `core/box/free_path_stats_box.h`
|
||||
- `core/box/free_path_stats_box.c`
|
||||
- 検証(C6-heavy, ws=200, 5-run mean):
|
||||
- ULTRA route + intrusive使用: 57.6M ops/s, push=265,890 pop=265,815 fallback=0
|
||||
- array magazine比較: 56.6M ops/s(c6_ifl_* は 0)
|
||||
- Graduate-1 C6-heavy A/B:
|
||||
- Baseline (C6=MID v3.5): 55.3M ops/s
|
||||
- ULTRA+array: 57.4M ops/s (+3.79%)
|
||||
- ULTRA+intrusive: 54.5M ops/s (-1.44%, fallback=0)
|
||||
- Mixed 本線(16–1024B):
|
||||
- ULTRA+intrusive 約 -14% 回帰。Root cause: 8 クラス競合による TLS キャッシュ奪い合い + ULTRA miss 増加(Legacy fallback ≈24%)。
|
||||
- **結論**: Mixed 本線では C6 ULTRA を使わない。
|
||||
|
||||
**Date**: 2025-12-12
|
||||
**Phase**: TLS-UNIFY-3-DESIGN
|
||||
**Status**: Design document (implementation in next session)
|
||||
**Phase**: TLS-UNIFY-3-IMPL / GRADUATE-1
|
||||
**Status**: IMPLEMENTED + VERIFIED ✅(Mixed 本線は非採用)
|
||||
|
||||
Reference in New Issue
Block a user