Files
hakmem/CURRENT_TASK.md
2025-11-30 11:02:39 +09:00

364 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Current Task: Phase 9-2 — SuperSlab状態の一元化プラン
**Date**: 2025-12-01
**Status**: 実行時バグは暫定解消(スロット同期で registry 枯渇を停止)
**Goal**: Legacy/Shared の二重メタデータを排除し、SuperSlab 状態管理を shared pool に一本化する根治設計を進める。
---
## 背景 / 症状
- `HAKMEM_TINY_USE_SUPERSLAB=1``SuperSlab registry full` → registry が解放されず枯渇。
- 原因: Legacy 経路で確保した SuperSlab が Shared Pool のスロット状態に反映されず、`shared_pool_release_slab()` が早期 return していた。
- 暫定対処: `sp_meta_sync_slots_from_ss()` で差分を検出したら同期し、EMPTY→FREE リスト→registry 解除まで進めるよう修正済み。
## 根本原因(箱理論での見立て)
- 状態の二重管理: Legacy パスと Shared Pool パスがそれぞれ SuperSlab 状態を持ち、整合が崩れる。
- 境界が多重化: acquire/free の境界が複数あり、EMPTY 判定・slot 遷移が散在。
## 目標
1) SuperSlab の状態遷移UNUSED/ACTIVE/EMPTYを Shared Pool の slot 状態に一元化。
2) acquire/free/adopt/drain の境界を共有プール経路に集約(戻せるよう A/B ガード付き)。
3) Legacy backend は互換箱として残しつつ入口で同期し、最終的に削除可能な状態へ。
## 次にやること(手順)
1. **入口統一の設計**
- `superslab_allocate()` を shared pool 薄ラッパ経由にし、登録・`SharedSSMeta` 初期化を必ず通す案を設計env で ON/OFF
2. **free 経路の整理**
- TLS drain / remote / local free からの EMPTY 判定を `shared_pool_release_slab()` だけが扱うよう責務を明確化。
- `empty_mask/nonempty_mask/freelist_mask` 更新を shared pool 内部ヘルパに一本化する設計を起こす。
3. **観測とガード**
- `HAKMEM_TINY_SS_SHARED` / `HAKMEM_TINY_USE_SUPERSLAB` で A/B、`*_DEBUG` でワンショット観測。
- `shared_fail→legacy` と registry 占有率をダッシュボード化して移行完了を判断。
4. **段階的収束計画を書く**
- いつ Legacy backend を既定 OFF にし、削除するかの段階と撤退条件(戻し条件)を文書化。
## 現状のブロッカー / リスク
- Legacy/Shared 混在のままコードが増えると再び同期漏れが出るリスク。
- LRU/EMPTY マスクの責務が散らばっており、統合時に副作用が出る可能性。
## 成果物イメージ
- 設計ノート: 入口統一ラッパ、マスク更新ヘルパ、A/B ガード設計。
- 最小パッチ案: ラッパ導入+マスク更新の集約(コード変更は次ステップで)。
- 検証手順: registry 枯渇の再発テスト、`shared_fail→legacy` カウンタの収束確認。
---
## Commits
### Phase 8 Root Cause Fix
**Commit**: `191e65983`
**Date**: 2025-11-30
**Files**: 3 files, 36 insertions(+), 13 deletions(-)
**Changes**:
1. `bench_fast_box.c` (Layer 0 + Layer 1):
- Removed unified_cache_init() call (design misunderstanding)
- Limited prealloc to 128 blocks/class (actual TLS SLL capacity)
- Added root cause comments explaining why unified_cache_init() was wrong
2. `bench_fast_box.h` (Layer 3):
- Added Box Contract documentation (BenchFast uses TLS SLL, NOT UC)
- Documented scope separation (workload vs infrastructure allocations)
- Added contract violation example (Phase 8 bug explanation)
3. `tiny_unified_cache.c` (Layer 2):
- Changed calloc() → __libc_calloc() (infrastructure isolation)
- Changed free() → __libc_free() (symmetric cleanup)
- Added defensive fix comments explaining infrastructure bypass
### Phase 8-TLS-Fix
**Commit**: `da8f4d2c8`
**Date**: 2025-11-30
**Files**: 3 files, 21 insertions(+), 11 deletions(-)
**Changes**:
1. `bench_fast_box.c` (TLS→Atomic):
- Changed `__thread int bench_fast_init_in_progress``atomic_int g_bench_fast_init_in_progress`
- Added atomic_load() for reads, atomic_store() for writes
- Added root cause comments (pthread_once creates fresh TLS)
2. `bench_fast_box.h` (TLS→Atomic):
- Updated extern declaration to match atomic_int
- Added Phase 8-TLS-Fix comment explaining cross-thread safety
3. `bench_fast_box.c` (Header Write):
- Replaced `tiny_region_id_write_header()` → direct write `*(uint8_t*)base = 0xa0 | class_idx`
- Added Phase 8-P3-Fix comment explaining P3 optimization bypass
- Contract: BenchFast always writes headers (required for free routing)
4. `hak_wrappers.inc.h` (Atomic):
- Updated bench_fast_init_in_progress check to use atomic_load()
- Added Phase 8-TLS-Fix comment for cross-thread safety
---
## Performance Journey
### Phase-by-Phase Progress
```
Phase 3 (mincore removal): 56.8 M ops/s
Phase 4 (Hot/Cold Box): 57.2 M ops/s (+0.7%)
Phase 5 (Mid MT fix): 52.3 M ops/s (-8.6% regression)
Phase 6 (Lock-free Mid MT): 42.1 M ops/s (Mid MT: +2.65%)
Phase 7-Step1 (Unified front): 80.6 M ops/s (+54.2%!) ⭐
Phase 7-Step4 (Dead code): 81.5 M ops/s (+1.1%) ⭐⭐
Phase 8 (Normal mode): 16.3 M ops/s (working, different workload)
Total improvement: +43.5% (56.8M → 81.5M) from Phase 3
```
**Note**: Phase 8 used different benchmark (10M iterations, ws=8192) vs Phase 7 (ws=256).
Normal mode performance: 16.3M ops/s (working, no crash).
---
## Technical Details
### Layer 0: Prealloc Capacity Fix
**File**: `core/box/bench_fast_box.c`
**Lines**: 131-148
**Root Cause**:
- Old code preallocated 50,000 blocks/class
- TLS SLL actual capacity: 128 blocks (adaptive sizing limit)
- Lost blocks (beyond 128) caused heap corruption
**Fix**:
```c
// Before:
const uint32_t PREALLOC_COUNT = 50000; // Too large!
// After:
const uint32_t ACTUAL_TLS_SLL_CAPACITY = 128; // Observed actual capacity
for (int cls = 2; cls <= 7; cls++) {
uint32_t capacity = ACTUAL_TLS_SLL_CAPACITY;
for (int i = 0; i < (int)capacity; i++) {
// preallocate...
}
}
```
### Layer 1: Design Misunderstanding Fix
**File**: `core/box/bench_fast_box.c`
**Lines**: 123-128 (REMOVED)
**Root Cause**:
- BenchFast uses TLS SLL directly (g_tls_sll[])
- Unified Cache is NOT used by BenchFast
- unified_cache_init() created 16KB allocations (infrastructure)
- Later freed by BenchFast → header misclassification → CRASH
**Fix**:
```c
// REMOVED:
// unified_cache_init(); // WRONG! BenchFast uses TLS SLL, not Unified Cache
// Added comment:
// Phase 8 Root Cause Fix: REMOVED unified_cache_init() call
// Reason: BenchFast uses TLS SLL directly, NOT Unified Cache
```
### Layer 2: Infrastructure Isolation
**File**: `core/front/tiny_unified_cache.c`
**Lines**: 61-71 (init), 103-109 (shutdown)
**Strategy**: Dual-Path Separation
- **Workload allocations** (measured): HAKMEM paths (TLS SLL, Unified Cache)
- **Infrastructure allocations** (unmeasured): __libc_calloc/__libc_free
**Fix**:
```c
// Before:
g_unified_cache[cls].slots = (void**)calloc(cap, sizeof(void*));
// After:
extern void* __libc_calloc(size_t, size_t);
g_unified_cache[cls].slots = (void**)__libc_calloc(cap, sizeof(void*));
```
### Layer 3: Box Contract Documentation
**File**: `core/box/bench_fast_box.h`
**Lines**: 13-51
**Added Documentation**:
- BenchFast uses TLS SLL, NOT Unified Cache
- Scope separation (workload vs infrastructure)
- Preconditions and guarantees
- Contract violation example (Phase 8 bug)
### TLS→Atomic Fix
**File**: `core/box/bench_fast_box.c`
**Lines**: 22-27 (declaration), 37, 124, 215 (usage)
**Root Cause**:
```
pthread_once() → creates new thread
New thread has fresh TLS (bench_fast_init_in_progress = 0)
Guard broken → getenv() allocates → freed by __libc_free() → CRASH
```
**Fix**:
```c
// Before (TLS - broken):
__thread int bench_fast_init_in_progress = 0;
if (__builtin_expect(bench_fast_init_in_progress, 0)) { ... }
// After (Atomic - fixed):
atomic_int g_bench_fast_init_in_progress = 0;
if (__builtin_expect(atomic_load(&g_bench_fast_init_in_progress), 0)) { ... }
```
**箱理論 Validation**:
- **Responsibility**: Guard must protect entire process (not per-thread)
- **Contract**: "No BenchFast allocations during init" (all threads)
- **Observable**: Atomic variable visible across all threads
- **Composable**: Works with pthread_once() threading model
### Header Write Fix
**File**: `core/box/bench_fast_box.c`
**Lines**: 70-80
**Root Cause**:
- P3 optimization: tiny_region_id_write_header() skips header writes by default
- BenchFast free routing checks header magic (0xa0-0xa7)
- No header → free() misroutes to __libc_free() → CRASH
**Fix**:
```c
// Before (broken - calls function that skips write):
tiny_region_id_write_header(base, class_idx);
return (void*)((char*)base + 1);
// After (fixed - direct write):
*(uint8_t*)base = (uint8_t)(0xa0 | (class_idx & 0x0f)); // Direct write
return (void*)((char*)base + 1);
```
**Contract**: BenchFast always writes headers (required for free routing)
---
## Next Phase Options
### Option A: Continue Phase 7 (Steps 5-7) 📦
**Goal**: Remove remaining legacy layers (complete dead code elimination)
**Expected**: Additional +3-5% via further code cleanup
**Duration**: 1-2 days
**Risk**: Low (infrastructure already in place)
**Remaining Steps**:
- Step 5: Compile library with PGO flag (Makefile change)
- Step 6: Verify dead code elimination in assembly
- Step 7: Measure performance improvement
### Option B: PGO Re-enablement 🚀
**Goal**: Re-enable PGO workflow from Phase 4-Step1
**Expected**: +6-13% cumulative (on top of 81.5M)
**Duration**: 2-3 days
**Risk**: Low (proven pattern)
**Current projection**:
- Phase 7 baseline: 81.5 M ops/s
- With PGO: ~86-93 M ops/s (+6-13%)
### Option C: BenchFast Pool Expansion 🏎️
**Goal**: Increase BenchFast pool size for full 10M iteration support
**Expected**: Structural ceiling measurement (30-40M ops/s target)
**Duration**: 1 day
**Risk**: Low (just increase prealloc count)
**Current status**:
- Pool: 128 blocks/class (768 total)
- Exhaustion: C6/C7 exhaust after ~200 iterations
- Need: ~10,000 blocks/class for 10M iterations (60,000 total)
### Option D: Production Readiness 📊
**Goal**: Comprehensive benchmark suite, deployment guide
**Expected**: Full performance comparison, stability testing
**Duration**: 3-5 days
**Risk**: Low (documentation + testing)
---
## Recommendation
### Top Pick: **Option C (BenchFast Pool Expansion)** 🏎️
**Reasoning**:
1. **Phase 8 fixes working**: TLS→Atomic + Header write proven
2. **Quick win**: Just increase ACTUAL_TLS_SLL_CAPACITY to 10,000
3. **Scientific value**: Measure true structural ceiling (no safety costs)
4. **Low risk**: 1-day task, no code changes (just capacity tuning)
5. **Data-driven**: Enables comparison vs normal mode (16.3M vs 30-40M expected)
**Expected Result**:
```
Normal mode: 16.3 M ops/s (current)
BenchFast mode: 30-40 M ops/s (target, 2-2.5x faster)
```
**Implementation**:
```c
// core/box/bench_fast_box.c:140
const uint32_t ACTUAL_TLS_SLL_CAPACITY = 10000; // Was 128
```
---
### Second Choice: **Option B (PGO Re-enablement)** 🚀
**Reasoning**:
1. **Proven benefit**: +6.25% in Phase 4-Step1
2. **Cumulative**: Would stack with Phase 7 (81.5M baseline)
3. **Low risk**: Just fix build issue
4. **High impact**: ~86-93 M ops/s projected
---
## Current Performance Summary
### bench_random_mixed (16B-1KB, Tiny workload)
```
Phase 7-Step4 (ws=256): 81.5 M ops/s (+55.5% total)
Phase 8 (ws=8192): 16.3 M ops/s (normal mode, working)
```
### bench_mid_mt_gap (1KB-8KB, Mid MT workload, ws=256)
```
After Phase 6-B (lock-free): 42.09 M ops/s (+2.65%)
vs System malloc: 26.8 M ops/s (1.57x faster)
```
### Overall Status
-**Tiny allocations** (16B-1KB): **81.5 M ops/s** (excellent, +55.5%!)
-**Mid MT allocations** (1KB-8KB): 42 M ops/s (excellent, 1.57x vs system)
-**BenchFast mode**: No crash (TLS→Atomic + Header fix working)
- ⏸️ **Large allocations** (32KB-2MB): Not benchmarked yet
- ⏸️ **MT workloads**: No MT benchmarks yet
---
## Decision Time
**Choose your next phase**:
- **Option A**: Continue Phase 7 (Steps 5-7, final cleanup)
- **Option B**: PGO re-enablement (recommended for normal builds)
- **Option C**: BenchFast pool expansion (recommended for ceiling measurement)
- **Option D**: Production readiness & benchmarking
**Or**: Celebrate Phase 8 success! 🎉 (Root cause fixes complete!)
---
Updated: 2025-11-30
Phase: 8 COMPLETE (Root Cause Fixes) → 9 PENDING
Previous: Phase 7 (Tiny Front Unification, +55.5%)
Achievement: BenchFast crash investigation and fixes (箱理論 root cause analysis!)