Files
hakmem/CURRENT_TASK.md

400 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Current Task: Phase 7 + Pool TLS — Step 4.x Integration & ValidationTiny P0: デフォルトON
**Date**: 2025-11-09
**Status**: 🚀 In Progress (Step 4.x)
**Priority**: HIGH
---
## 🎯 Goal
Box理論に沿って、Pool TLS を中心に「syscall 希薄化」と「境界一箇所化」を推し進め、Tiny/Mid/Larson の安定高速化を図る。
### **Why This Works**
Phase 7 Task 3 achieved **+180-280% improvement** by pre-warming:
- **Before**: First allocation → TLS miss → SuperSlab refill (100+ cycles)
- **After**: First allocation → TLS hit (15 cycles, pre-populated cache)
**Same bottleneck exists in Pool TLS**:
- First 8KB allocation → TLS miss → Arena carve → mmap (1000+ cycles)
- Pre-warm eliminates this cold-start penalty
---
## 📊 Current StatusStep 4までの主な進捗
### 実装サマリTiny + Pool TLS
- ✅ Tiny 1024B 特例(ヘッダ無し)+ class7 補給の軽量適応mmap 多発の主因を遮断)
- ✅ OS 降下の境界化(`hak_os_map_boundary()`mmap 呼び出しを一箇所に集約
- ✅ Pool TLS Arena1→2→4→8MB指数成長, ENV で可変mmap をアリーナへ集約
- ✅ Page Registryチャンク登録/lookup で owner 解決)
- ✅ Remote QueuePool 用, mutex バケット版)+ alloc 前の軽量 drain を配線
#### Tiny P0Batch Refill
- ✅ P0 致命バグ修正freelist→SLL一括移送後に `meta->used += from_freelist` が抜けていた)
- ✅ 線形 carve の FailFast ガード(簡素/一般/TLSバンプの全経路
- ✅ ランタイム A/B スイッチ実装:
- 既定ON`HAKMEM_TINY_P0_ENABLE` 未設定/≠0
- Kill: `HAKMEM_TINY_P0_DISABLE=1`、Drain 切替: `HAKMEM_TINY_P0_NO_DRAIN=1`、ログ: `HAKMEM_TINY_P0_LOG=1`
- ✅ ベンチ: 100k×256B1Tで P0 ON 最速(~2.76M ops/s、P0 OFF ~2.73M ops/s安定
- ⚠️ 既知: `[P0_COUNTER_MISMATCH]` 警告active_delta と taken の差分が稀に出るが、SEGV は解消済(継続監査)
##### NEW: P0 carve ループの根本原因と修正SEGV 解消)
- 🔴 根因: P0 バッチ carve ループ内で `superslab_refill(class_idx)` により TLS が新しい SuperSlab を指すのに、`tls` を再読込せず `meta=tls->meta` のみ更新 → `ss_active_add(tls->ss, batch)` が古い SuperSlab に加算され、active カウンタ破壊・SEGV に繋がる。
- 🛠 修正: `superslab_refill()` 後に `tls = &g_tls_slabs[class_idx]; meta = tls->meta;` を再読込core/hakmem_tiny_refill_p0.inc.h
- 🧪 検証: 固定サイズ 256B/1KB 200k iters完走、SEGV 再現なし。active_delta=0 を確認。RS はわずかに改善0.80.9% → 継続最適化対象)。
詳細: docs/TINY_P0_BATCH_REFILL.md
---
## 🚀 次のステップ(アクション)
1) Remote Queue の drain を Pool TLS refill 境界とも統合(低水位時は drain→refill→bind
- 現状: pool_alloc 入口で drain, pop 後 low-water で追加 drain を実装済み
- 追加: refill 経路(`pool_refill_and_alloc` 呼出し直前)でも drain を試行し、drain 成功時は refill を回避
2) strace による syscall 減少確認(指標化)
- RandomMixed: 256 / 1024B, それぞれ `mmap/madvise/munmap` 回数(-c合計
- PoolTLS: 1T/4T の `mmap/madvise/munmap` 減少を比較Arena導入前後
3) 性能A/BENV: INIT/MAX/GROWTHで最適化勘所を探索
- `HAKMEM_POOL_TLS_ARENA_MB_INIT`, `HAKMEM_POOL_TLS_ARENA_MB_MAX`, `HAKMEM_POOL_TLS_ARENA_GROWTH_LEVELS` の組合せを評価
- 目標: syscall を削減しつつメモリ使用量を許容範囲に維持
4) Remote Queue の高速化(次フェーズ)
5) Tiny 256B/1KB の直詰め最適化(性能)
- P0→FC 直詰めの一往復設計を活用し、以下を段階的に適用A/Bスイッチ済み
- FC cap/batch 上限の掃引class5/7
- remote drain 閾値化のチューニング(頻度削減)
- adopt 先行の徹底map 前に再試行)
- 配列詰めの軽い unroll分岐ヒントの見直しbranchmiss 低減)
- まずはmutex→lock分割/軽量スピン化、必要に応じてクラス別queue
- Page Registry の O(1) 化(ページ単位のテーブル), 将来はper-arena ID化
### NEW: 本日の適用と計測スナップショットRyzen 7 5825U
- 変更点Tiny 256B/1KB 向け)
- FastCache 有効容量を per-class で厳密適用(`tiny_fc_room/push_bulk``g_fast_cap[c]` を使用)
- 既定 cap 見直し: class5=96, class7=48ENVで上書き可: `HAKMEM_TINY_FAST_CAP_C{5,7}`
- Direct-FC の drain 閾値 既定を 32→64ENV: `HAKMEM_TINY_P0_DRAIN_THRESH`
- class7 の Direct-FC 既定は OFF`HAKMEM_TINY_P0_DIRECT_FC_C7=1` で明示ON
- 固定サイズベンチrelease, 200k iters
- 256B: 4.494.54M ops/s, branch-miss ≈ 8.89%(先行値 ≈11% から改善)
- 1KB: 現状 SEGVDirect-FC OFF でも再現)→ P0 一般経路の残存不具合の可能性
- 結果保存: benchmarks/results/<date>_ryzen7-5825U_fixed/
- 推奨: class7 は当面 P0 をA/Bで停止`HAKMEM_TINY_P0_DISABLE=1` もしくは class7限定ガード導入し、256Bのチューニングを先行。
**Challenge**: Pool blocks are LARGE (8KB-52KB) vs Tiny (128B-1KB)
**Memory Budget Analysis**:
```
Phase 7 Tiny:
- 16 blocks × 1KB = 16KB per class
- 7 classes × 16KB = 112KB total ✅ Acceptable
Pool TLS (Naive):
- 16 blocks × 8KB = 128KB (class 0)
- 16 blocks × 52KB = 832KB (class 6)
- Total: ~4-5MB ❌ Too much!
```
**Smart Strategy**: Variable pre-warm counts based on expected usage
```c
// Hot classes (8-24KB) - common in real workloads
Class 0 (8KB): 16 blocks = 128KB
Class 1 (16KB): 16 blocks = 256KB
Class 2 (24KB): 12 blocks = 288KB
// Warm classes (32-40KB)
Class 3 (32KB): 8 blocks = 256KB
Class 4 (40KB): 8 blocks = 320KB
// Cold classes (48-52KB) - rare
Class 5 (48KB): 4 blocks = 192KB
Class 6 (52KB): 4 blocks = 208KB
Total: ~1.6MB Acceptable
```
**Rationale**:
1. Smaller classes are used more frequently (Pareto principle)
2. Total memory: 1.6MB (reasonable for 8-52KB allocations)
3. Covers most real-world workload patterns
---
## ENVArena 関連)
```
# Initial chunk size in MB (default: 1)
export HAKMEM_POOL_TLS_ARENA_MB_INIT=2
# Maximum chunk size in MB (default: 8)
export HAKMEM_POOL_TLS_ARENA_MB_MAX=16
# Number of growth levels (default: 3 → 1→2→4→8MB)
export HAKMEM_POOL_TLS_ARENA_GROWTH_LEVELS=4
```
**Location**: `core/pool_tls.c`
**Code**:
```c
// Pre-warm counts optimized for memory usage
static const int PREWARM_COUNTS[POOL_SIZE_CLASSES] = {
16, 16, 12, // Hot: 8KB, 16KB, 24KB
8, 8, // Warm: 32KB, 40KB
4, 4 // Cold: 48KB, 52KB
};
void pool_tls_prewarm(void) {
for (int class_idx = 0; class_idx < POOL_SIZE_CLASSES; class_idx++) {
int count = PREWARM_COUNTS[class_idx];
size_t size = POOL_CLASS_SIZES[class_idx];
// Allocate then immediately free to populate TLS cache
for (int i = 0; i < count; i++) {
void* ptr = pool_alloc(size);
if (ptr) {
pool_free(ptr); // Goes back to TLS freelist
} else {
// OOM during pre-warm (rare, but handle gracefully)
break;
}
}
}
}
```
**Header Addition** (`core/pool_tls.h`):
```c
// Pre-warm TLS cache (call once at thread init)
void pool_tls_prewarm(void);
```
---
## 軽い確認(推奨)
```
# PoolTLS
./build.sh bench_pool_tls_hakmem
./bench_pool_tls_hakmem 1 100000 256 42
./bench_pool_tls_hakmem 4 50000 256 42
# syscall 計測mmap/madvise/munmap 合計が減っているか確認)
strace -e trace=mmap,madvise,munmap -c ./bench_pool_tls_hakmem 1 100000 256 42
strace -e trace=mmap,madvise,munmap -c ./bench_random_mixed_hakmem 100000 256 42
strace -e trace=mmap,madvise,munmap -c ./bench_random_mixed_hakmem 100000 1024 42
```
**Location**: `core/hakmem.c` (or wherever Pool TLS init happens)
**Code**:
```c
#ifdef HAKMEM_POOL_TLS_PHASE1
// Initialize Pool TLS
pool_thread_init();
// Pre-warm cache (Phase 1.5b optimization)
#ifdef HAKMEM_POOL_TLS_PREWARM
pool_tls_prewarm();
#endif
#endif
```
**Makefile Addition**:
```makefile
# Pool TLS Phase 1.5b - Pre-warm optimization
ifeq ($(POOL_TLS_PREWARM),1)
CFLAGS += -DHAKMEM_POOL_TLS_PREWARM=1
endif
```
**Update `build.sh`**:
```bash
make \
POOL_TLS_PHASE1=1 \
POOL_TLS_PREWARM=1 \ # NEW!
HEADER_CLASSIDX=1 \
AGGRESSIVE_INLINE=1 \
PREWARM_TLS=1 \
"${TARGET}"
```
---
### **Step 4: Build & Smoke Test** ⏳ 10 min
```bash
# Build with pre-warm enabled
./build_pool_tls.sh bench_mid_large_mt_hakmem
# Quick smoke test
./dev_pool_tls.sh test
# Expected: No crashes, similar or better performance
```
---
### **Step 5: Benchmark** ⏳ 15 min
```bash
# Full benchmark vs System malloc
./run_pool_bench.sh
# Expected results:
# Before (1.5a): 1.79M ops/s
# After (1.5b): 5-15M ops/s (+3-8x)
```
**Additional benchmarks**:
```bash
# Different sizes
./bench_mid_large_mt_hakmem 1 100000 256 42 # 8-32KB mixed
./bench_mid_large_mt_hakmem 1 100000 1024 42 # Larger workset
# Multi-threaded
./bench_mid_large_mt_hakmem 4 100000 256 42 # 4T
```
---
### **Step 6: Measure & Analyze** ⏳ 10 min
**Metrics to collect**:
1. ops/s improvement (target: +3-8x)
2. Memory overhead (should be ~1.6MB per thread)
3. Cold-start penalty reduction (first allocation latency)
**Success Criteria**:
- ✅ No crashes or stability issues
- ✅ +200% or better improvement (5M ops/s minimum)
- ✅ Memory overhead < 2MB per thread
- No performance regression on small workloads
---
### **Step 7: Tune (if needed)** ⏳ 15 min (optional)
**If results are suboptimal**, adjust pre-warm counts:
**Too slow** (< 5M ops/s):
- Increase hot class pre-warm (16 24)
- More aggressive: Pre-warm all classes to 16
**Memory too high** (> 2MB):
- Reduce cold class pre-warm (4 → 2)
- Lazy pre-warm: Only hot classes initially
**Adaptive approach**:
```c
// Pre-warm based on runtime heuristics
void pool_tls_prewarm_adaptive(void) {
// Start with minimal pre-warm
static const int MIN_PREWARM[7] = {8, 8, 4, 4, 2, 2, 2};
// TODO: Track usage patterns and adjust dynamically
}
```
---
## 📋 **Implementation Checklist**
### **Phase 1.5b: Pre-warm Optimization**
- [ ] **Step 1**: Design pre-warm strategy (15 min)
- [ ] Analyze memory budget
- [ ] Decide pre-warm counts per class
- [ ] Document rationale
- [ ] **Step 2**: Implement `pool_tls_prewarm()` (20 min)
- [ ] Add PREWARM_COUNTS array
- [ ] Write pre-warm function
- [ ] Add to pool_tls.h
- [ ] **Step 3**: Integrate with init (10 min)
- [ ] Add call to hakmem.c init
- [ ] Add Makefile flag
- [ ] Update build.sh
- [ ] **Step 4**: Build & smoke test (10 min)
- [ ] Build with pre-warm enabled
- [ ] Run dev_pool_tls.sh test
- [ ] Verify no crashes
- [ ] **Step 5**: Benchmark (15 min)
- [ ] Run run_pool_bench.sh
- [ ] Test different sizes
- [ ] Test multi-threaded
- [ ] **Step 6**: Measure & analyze (10 min)
- [ ] Record performance improvement
- [ ] Measure memory overhead
- [ ] Validate success criteria
- [ ] **Step 7**: Tune (optional, 15 min)
- [ ] Adjust pre-warm counts if needed
- [ ] Re-benchmark
- [ ] Document final configuration
**Total Estimated Time**: 1.5 hours (90 minutes)
---
## 🎯 **Expected Outcomes**
### **Performance Targets**
```
Phase 1.5a (current): 1.79M ops/s
Phase 1.5b (target): 5-15M ops/s (+3-8x)
Conservative: 5M ops/s (+180%)
Expected: 8M ops/s (+350%)
Optimistic: 15M ops/s (+740%)
```
### **Comparison to Phase 7**
```
Phase 7 Task 3 (Tiny):
Before: 21M → After: 59M ops/s (+181%)
Phase 1.5b (Pool):
Before: 1.79M → After: 5-15M ops/s (+180-740%)
Similar or better improvement expected!
```
### **Risk Assessment**
- **Technical Risk**: LOW (proven pattern from Phase 7)
- **Stability Risk**: LOW (simple, non-invasive change)
- **Memory Risk**: LOW (1.6MB is negligible for Pool workloads)
- **Complexity Risk**: LOW (< 50 LOC change)
---
## 📁 **Related Documents**
- `CLAUDE.md` - Development history (Phase 1.5a documented)
- `POOL_TLS_QUICKSTART.md` - Quick start guide
- `POOL_TLS_INVESTIGATION_FINAL.md` - Phase 1.5a debugging journey
- `PHASE7_TASK3_RESULTS.md` - Pre-warm success pattern (Tiny)
---
## 🚀 **Next Actions**
**NOW**: Start Step 1 - Design pre-warm strategy
**NEXT**: Implement pool_tls_prewarm() function
**THEN**: Build, test, benchmark
**Estimated Completion**: 1.5 hours from start
**Success Probability**: 90% (proven technique)
---
**Status**: Ready to implement - awaiting user confirmation to proceed! 🚀