Files
hakmem/CURRENT_TASK.md
Moe Charm (CI) 1010a961fb Tiny: fix header/stride mismatch and harden refill paths
- Root cause: header-based class indexing (HEADER_CLASSIDX=1) wrote a 1-byte
  header during allocation, but linear carve/refill and initial slab capacity
  still used bare class block sizes. This mismatch could overrun slab usable
  space and corrupt freelists, causing reproducible SEGV at ~100k iters.

Changes
- Superslab: compute capacity with effective stride (block_size + header for
  classes 0..6; class7 remains headerless) in superslab_init_slab(). Add a
  debug-only bound check in superslab_alloc_from_slab() to fail fast if carve
  would exceed usable bytes.
- Refill (non-P0 and P0): use header-aware stride for all linear carving and
  TLS window bump operations. Ensure alignment/validation in tiny_refill_opt.h
  also uses stride, not raw class size.
- Drain: keep existing defense-in-depth for remote sentinel and sanitize nodes
  before splicing into freelist (already present).

Notes
- This unifies the memory layout across alloc/linear-carve/refill with a single
  stride definition and keeps class7 (1024B) headerless as designed.
- Debug builds add fail-fast checks; release builds remain lean.

Next
- Re-run Tiny benches (256/1024B) in debug to confirm stability, then in
  release. If any remaining crash persists, bisect with HAKMEM_TINY_P0_BATCH_REFILL=0
  to isolate P0 batch carve, and continue reducing branch-miss as planned.
2025-11-09 18:55:50 +09:00

363 lines
9.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Current Task: Phase 7 + Pool TLS — Step 4.x Integration & Validation
**Date**: 2025-11-09
**Status**: 🚀 In Progress (Step 4.x)
**Priority**: HIGH
---
## 🎯 Goal
Box理論に沿って、Pool TLS を中心に「syscall 希薄化」と「境界一箇所化」を推し進め、Tiny/Mid/Larson の安定高速化を図る。
### **Why This Works**
Phase 7 Task 3 achieved **+180-280% improvement** by pre-warming:
- **Before**: First allocation → TLS miss → SuperSlab refill (100+ cycles)
- **After**: First allocation → TLS hit (15 cycles, pre-populated cache)
**Same bottleneck exists in Pool TLS**:
- First 8KB allocation → TLS miss → Arena carve → mmap (1000+ cycles)
- Pre-warm eliminates this cold-start penalty
---
## 📊 Current StatusStep 4までの主な進捗
### 実装サマリ
- ✅ Tiny 1024B 特例(ヘッダ無し)+ class7 補給の軽量適応mmap 多発の主因を遮断)
- ✅ OS 降下の境界化(`hak_os_map_boundary()`mmap 呼び出しを一箇所に集約
- ✅ Pool TLS Arena1→2→4→8MB指数成長, ENV で可変mmap をアリーナへ集約
- ✅ Page Registryチャンク登録/lookup で owner 解決)
- ✅ Remote QueuePool 用, mutex バケット版)+ alloc 前の軽量 drain を配線
---
## 🚀 次のステップ(アクション)
1) Remote Queue の drain を Pool TLS refill 境界とも統合(低水位時は drain→refill→bind
- 現状: pool_alloc 入口で drain, pop 後 low-water で追加 drain を実装済み
- 追加: refill 経路(`pool_refill_and_alloc` 呼出し直前)でも drain を試行し、drain 成功時は refill を回避
2) strace による syscall 減少確認(指標化)
- RandomMixed: 256 / 1024B, それぞれ `mmap/madvise/munmap` 回数(-c合計
- PoolTLS: 1T/4T の `mmap/madvise/munmap` 減少を比較Arena導入前後
3) 性能A/BENV: INIT/MAX/GROWTHで最適化勘所を探索
- `HAKMEM_POOL_TLS_ARENA_MB_INIT`, `HAKMEM_POOL_TLS_ARENA_MB_MAX`, `HAKMEM_POOL_TLS_ARENA_GROWTH_LEVELS` の組合せを評価
- 目標: syscall を削減しつつメモリ使用量を許容範囲に維持
4) Remote Queue の高速化(次フェーズ)
- まずはmutex→lock分割/軽量スピン化、必要に応じてクラス別queue
- Page Registry の O(1) 化(ページ単位のテーブル), 将来はper-arena ID化
**Challenge**: Pool blocks are LARGE (8KB-52KB) vs Tiny (128B-1KB)
**Memory Budget Analysis**:
```
Phase 7 Tiny:
- 16 blocks × 1KB = 16KB per class
- 7 classes × 16KB = 112KB total ✅ Acceptable
Pool TLS (Naive):
- 16 blocks × 8KB = 128KB (class 0)
- 16 blocks × 52KB = 832KB (class 6)
- Total: ~4-5MB ❌ Too much!
```
**Smart Strategy**: Variable pre-warm counts based on expected usage
```c
// Hot classes (8-24KB) - common in real workloads
Class 0 (8KB): 16 blocks = 128KB
Class 1 (16KB): 16 blocks = 256KB
Class 2 (24KB): 12 blocks = 288KB
// Warm classes (32-40KB)
Class 3 (32KB): 8 blocks = 256KB
Class 4 (40KB): 8 blocks = 320KB
// Cold classes (48-52KB) - rare
Class 5 (48KB): 4 blocks = 192KB
Class 6 (52KB): 4 blocks = 208KB
Total: ~1.6MB Acceptable
```
**Rationale**:
1. Smaller classes are used more frequently (Pareto principle)
2. Total memory: 1.6MB (reasonable for 8-52KB allocations)
3. Covers most real-world workload patterns
---
## ENVArena 関連)
```
# Initial chunk size in MB (default: 1)
export HAKMEM_POOL_TLS_ARENA_MB_INIT=2
# Maximum chunk size in MB (default: 8)
export HAKMEM_POOL_TLS_ARENA_MB_MAX=16
# Number of growth levels (default: 3 → 1→2→4→8MB)
export HAKMEM_POOL_TLS_ARENA_GROWTH_LEVELS=4
```
**Location**: `core/pool_tls.c`
**Code**:
```c
// Pre-warm counts optimized for memory usage
static const int PREWARM_COUNTS[POOL_SIZE_CLASSES] = {
16, 16, 12, // Hot: 8KB, 16KB, 24KB
8, 8, // Warm: 32KB, 40KB
4, 4 // Cold: 48KB, 52KB
};
void pool_tls_prewarm(void) {
for (int class_idx = 0; class_idx < POOL_SIZE_CLASSES; class_idx++) {
int count = PREWARM_COUNTS[class_idx];
size_t size = POOL_CLASS_SIZES[class_idx];
// Allocate then immediately free to populate TLS cache
for (int i = 0; i < count; i++) {
void* ptr = pool_alloc(size);
if (ptr) {
pool_free(ptr); // Goes back to TLS freelist
} else {
// OOM during pre-warm (rare, but handle gracefully)
break;
}
}
}
}
```
**Header Addition** (`core/pool_tls.h`):
```c
// Pre-warm TLS cache (call once at thread init)
void pool_tls_prewarm(void);
```
---
## 軽い確認(推奨)
```
# PoolTLS
./build.sh bench_pool_tls_hakmem
./bench_pool_tls_hakmem 1 100000 256 42
./bench_pool_tls_hakmem 4 50000 256 42
# syscall 計測mmap/madvise/munmap 合計が減っているか確認)
strace -e trace=mmap,madvise,munmap -c ./bench_pool_tls_hakmem 1 100000 256 42
strace -e trace=mmap,madvise,munmap -c ./bench_random_mixed_hakmem 100000 256 42
strace -e trace=mmap,madvise,munmap -c ./bench_random_mixed_hakmem 100000 1024 42
```
**Location**: `core/hakmem.c` (or wherever Pool TLS init happens)
**Code**:
```c
#ifdef HAKMEM_POOL_TLS_PHASE1
// Initialize Pool TLS
pool_thread_init();
// Pre-warm cache (Phase 1.5b optimization)
#ifdef HAKMEM_POOL_TLS_PREWARM
pool_tls_prewarm();
#endif
#endif
```
**Makefile Addition**:
```makefile
# Pool TLS Phase 1.5b - Pre-warm optimization
ifeq ($(POOL_TLS_PREWARM),1)
CFLAGS += -DHAKMEM_POOL_TLS_PREWARM=1
endif
```
**Update `build.sh`**:
```bash
make \
POOL_TLS_PHASE1=1 \
POOL_TLS_PREWARM=1 \ # NEW!
HEADER_CLASSIDX=1 \
AGGRESSIVE_INLINE=1 \
PREWARM_TLS=1 \
"${TARGET}"
```
---
### **Step 4: Build & Smoke Test** ⏳ 10 min
```bash
# Build with pre-warm enabled
./build_pool_tls.sh bench_mid_large_mt_hakmem
# Quick smoke test
./dev_pool_tls.sh test
# Expected: No crashes, similar or better performance
```
---
### **Step 5: Benchmark** ⏳ 15 min
```bash
# Full benchmark vs System malloc
./run_pool_bench.sh
# Expected results:
# Before (1.5a): 1.79M ops/s
# After (1.5b): 5-15M ops/s (+3-8x)
```
**Additional benchmarks**:
```bash
# Different sizes
./bench_mid_large_mt_hakmem 1 100000 256 42 # 8-32KB mixed
./bench_mid_large_mt_hakmem 1 100000 1024 42 # Larger workset
# Multi-threaded
./bench_mid_large_mt_hakmem 4 100000 256 42 # 4T
```
---
### **Step 6: Measure & Analyze** ⏳ 10 min
**Metrics to collect**:
1. ops/s improvement (target: +3-8x)
2. Memory overhead (should be ~1.6MB per thread)
3. Cold-start penalty reduction (first allocation latency)
**Success Criteria**:
- ✅ No crashes or stability issues
- ✅ +200% or better improvement (5M ops/s minimum)
- ✅ Memory overhead < 2MB per thread
- No performance regression on small workloads
---
### **Step 7: Tune (if needed)** ⏳ 15 min (optional)
**If results are suboptimal**, adjust pre-warm counts:
**Too slow** (< 5M ops/s):
- Increase hot class pre-warm (16 24)
- More aggressive: Pre-warm all classes to 16
**Memory too high** (> 2MB):
- Reduce cold class pre-warm (4 → 2)
- Lazy pre-warm: Only hot classes initially
**Adaptive approach**:
```c
// Pre-warm based on runtime heuristics
void pool_tls_prewarm_adaptive(void) {
// Start with minimal pre-warm
static const int MIN_PREWARM[7] = {8, 8, 4, 4, 2, 2, 2};
// TODO: Track usage patterns and adjust dynamically
}
```
---
## 📋 **Implementation Checklist**
### **Phase 1.5b: Pre-warm Optimization**
- [ ] **Step 1**: Design pre-warm strategy (15 min)
- [ ] Analyze memory budget
- [ ] Decide pre-warm counts per class
- [ ] Document rationale
- [ ] **Step 2**: Implement `pool_tls_prewarm()` (20 min)
- [ ] Add PREWARM_COUNTS array
- [ ] Write pre-warm function
- [ ] Add to pool_tls.h
- [ ] **Step 3**: Integrate with init (10 min)
- [ ] Add call to hakmem.c init
- [ ] Add Makefile flag
- [ ] Update build.sh
- [ ] **Step 4**: Build & smoke test (10 min)
- [ ] Build with pre-warm enabled
- [ ] Run dev_pool_tls.sh test
- [ ] Verify no crashes
- [ ] **Step 5**: Benchmark (15 min)
- [ ] Run run_pool_bench.sh
- [ ] Test different sizes
- [ ] Test multi-threaded
- [ ] **Step 6**: Measure & analyze (10 min)
- [ ] Record performance improvement
- [ ] Measure memory overhead
- [ ] Validate success criteria
- [ ] **Step 7**: Tune (optional, 15 min)
- [ ] Adjust pre-warm counts if needed
- [ ] Re-benchmark
- [ ] Document final configuration
**Total Estimated Time**: 1.5 hours (90 minutes)
---
## 🎯 **Expected Outcomes**
### **Performance Targets**
```
Phase 1.5a (current): 1.79M ops/s
Phase 1.5b (target): 5-15M ops/s (+3-8x)
Conservative: 5M ops/s (+180%)
Expected: 8M ops/s (+350%)
Optimistic: 15M ops/s (+740%)
```
### **Comparison to Phase 7**
```
Phase 7 Task 3 (Tiny):
Before: 21M → After: 59M ops/s (+181%)
Phase 1.5b (Pool):
Before: 1.79M → After: 5-15M ops/s (+180-740%)
Similar or better improvement expected!
```
### **Risk Assessment**
- **Technical Risk**: LOW (proven pattern from Phase 7)
- **Stability Risk**: LOW (simple, non-invasive change)
- **Memory Risk**: LOW (1.6MB is negligible for Pool workloads)
- **Complexity Risk**: LOW (< 50 LOC change)
---
## 📁 **Related Documents**
- `CLAUDE.md` - Development history (Phase 1.5a documented)
- `POOL_TLS_QUICKSTART.md` - Quick start guide
- `POOL_TLS_INVESTIGATION_FINAL.md` - Phase 1.5a debugging journey
- `PHASE7_TASK3_RESULTS.md` - Pre-warm success pattern (Tiny)
---
## 🚀 **Next Actions**
**NOW**: Start Step 1 - Design pre-warm strategy
**NEXT**: Implement pool_tls_prewarm() function
**THEN**: Build, test, benchmark
**Estimated Completion**: 1.5 hours from start
**Success Probability**: 90% (proven technique)
---
**Status**: Ready to implement - awaiting user confirmation to proceed! 🚀