Files
hakmem/CURRENT_TASK.md
Moe Charm (CI) 1010a961fb Tiny: fix header/stride mismatch and harden refill paths
- Root cause: header-based class indexing (HEADER_CLASSIDX=1) wrote a 1-byte
  header during allocation, but linear carve/refill and initial slab capacity
  still used bare class block sizes. This mismatch could overrun slab usable
  space and corrupt freelists, causing reproducible SEGV at ~100k iters.

Changes
- Superslab: compute capacity with effective stride (block_size + header for
  classes 0..6; class7 remains headerless) in superslab_init_slab(). Add a
  debug-only bound check in superslab_alloc_from_slab() to fail fast if carve
  would exceed usable bytes.
- Refill (non-P0 and P0): use header-aware stride for all linear carving and
  TLS window bump operations. Ensure alignment/validation in tiny_refill_opt.h
  also uses stride, not raw class size.
- Drain: keep existing defense-in-depth for remote sentinel and sanitize nodes
  before splicing into freelist (already present).

Notes
- This unifies the memory layout across alloc/linear-carve/refill with a single
  stride definition and keeps class7 (1024B) headerless as designed.
- Debug builds add fail-fast checks; release builds remain lean.

Next
- Re-run Tiny benches (256/1024B) in debug to confirm stability, then in
  release. If any remaining crash persists, bisect with HAKMEM_TINY_P0_BATCH_REFILL=0
  to isolate P0 batch carve, and continue reducing branch-miss as planned.
2025-11-09 18:55:50 +09:00

9.7 KiB
Raw Blame History

Current Task: Phase 7 + Pool TLS — Step 4.x Integration & Validation

Date: 2025-11-09 Status: 🚀 In Progress (Step 4.x) Priority: HIGH


🎯 Goal

Box理論に沿って、Pool TLS を中心に「syscall 希薄化」と「境界一箇所化」を推し進め、Tiny/Mid/Larson の安定高速化を図る。

Why This Works

Phase 7 Task 3 achieved +180-280% improvement by pre-warming:

  • Before: First allocation → TLS miss → SuperSlab refill (100+ cycles)
  • After: First allocation → TLS hit (15 cycles, pre-populated cache)

Same bottleneck exists in Pool TLS:

  • First 8KB allocation → TLS miss → Arena carve → mmap (1000+ cycles)
  • Pre-warm eliminates this cold-start penalty

📊 Current StatusStep 4までの主な進捗

実装サマリ

  • Tiny 1024B 特例(ヘッダ無し)+ class7 補給の軽量適応mmap 多発の主因を遮断)
  • OS 降下の境界化(hak_os_map_boundary()mmap 呼び出しを一箇所に集約
  • Pool TLS Arena1→2→4→8MB指数成長, ENV で可変mmap をアリーナへ集約
  • Page Registryチャンク登録/lookup で owner 解決)
  • Remote QueuePool 用, mutex バケット版)+ alloc 前の軽量 drain を配線

🚀 次のステップ(アクション)

  1. Remote Queue の drain を Pool TLS refill 境界とも統合(低水位時は drain→refill→bind
  • 現状: pool_alloc 入口で drain, pop 後 low-water で追加 drain を実装済み
  • 追加: refill 経路(pool_refill_and_alloc 呼出し直前)でも drain を試行し、drain 成功時は refill を回避
  1. strace による syscall 減少確認(指標化)
  • RandomMixed: 256 / 1024B, それぞれ mmap/madvise/munmap 回数(-c合計
  • PoolTLS: 1T/4T の mmap/madvise/munmap 減少を比較Arena導入前後
  1. 性能A/BENV: INIT/MAX/GROWTHで最適化勘所を探索
  • HAKMEM_POOL_TLS_ARENA_MB_INIT, HAKMEM_POOL_TLS_ARENA_MB_MAX, HAKMEM_POOL_TLS_ARENA_GROWTH_LEVELS の組合せを評価
  • 目標: syscall を削減しつつメモリ使用量を許容範囲に維持
  1. Remote Queue の高速化(次フェーズ)
  • まずはmutex→lock分割/軽量スピン化、必要に応じてクラス別queue
  • Page Registry の O(1) 化(ページ単位のテーブル), 将来はper-arena ID化

Challenge: Pool blocks are LARGE (8KB-52KB) vs Tiny (128B-1KB)

Memory Budget Analysis:

Phase 7 Tiny:
- 16 blocks × 1KB = 16KB per class
- 7 classes × 16KB = 112KB total ✅ Acceptable

Pool TLS (Naive):
- 16 blocks × 8KB = 128KB (class 0)
- 16 blocks × 52KB = 832KB (class 6)
- Total: ~4-5MB ❌ Too much!

Smart Strategy: Variable pre-warm counts based on expected usage

// Hot classes (8-24KB) - common in real workloads
Class 0 (8KB):  16 blocks = 128KB
Class 1 (16KB): 16 blocks = 256KB
Class 2 (24KB): 12 blocks = 288KB

// Warm classes (32-40KB)
Class 3 (32KB): 8 blocks = 256KB
Class 4 (40KB): 8 blocks = 320KB

// Cold classes (48-52KB) - rare
Class 5 (48KB): 4 blocks = 192KB
Class 6 (52KB): 4 blocks = 208KB

Total: ~1.6MB  Acceptable

Rationale:

  1. Smaller classes are used more frequently (Pareto principle)
  2. Total memory: 1.6MB (reasonable for 8-52KB allocations)
  3. Covers most real-world workload patterns

ENVArena 関連)

# Initial chunk size in MB (default: 1)
export HAKMEM_POOL_TLS_ARENA_MB_INIT=2

# Maximum chunk size in MB (default: 8)
export HAKMEM_POOL_TLS_ARENA_MB_MAX=16

# Number of growth levels (default: 3 → 1→2→4→8MB)
export HAKMEM_POOL_TLS_ARENA_GROWTH_LEVELS=4

Location: core/pool_tls.c

Code:

// Pre-warm counts optimized for memory usage
static const int PREWARM_COUNTS[POOL_SIZE_CLASSES] = {
    16, 16, 12,  // Hot: 8KB, 16KB, 24KB
    8, 8,        // Warm: 32KB, 40KB
    4, 4         // Cold: 48KB, 52KB
};

void pool_tls_prewarm(void) {
    for (int class_idx = 0; class_idx < POOL_SIZE_CLASSES; class_idx++) {
        int count = PREWARM_COUNTS[class_idx];
        size_t size = POOL_CLASS_SIZES[class_idx];

        // Allocate then immediately free to populate TLS cache
        for (int i = 0; i < count; i++) {
            void* ptr = pool_alloc(size);
            if (ptr) {
                pool_free(ptr);  // Goes back to TLS freelist
            } else {
                // OOM during pre-warm (rare, but handle gracefully)
                break;
            }
        }
    }
}

Header Addition (core/pool_tls.h):

// Pre-warm TLS cache (call once at thread init)
void pool_tls_prewarm(void);

軽い確認(推奨)

# PoolTLS
./build.sh bench_pool_tls_hakmem
./bench_pool_tls_hakmem 1 100000 256 42
./bench_pool_tls_hakmem 4 50000 256 42

# syscall 計測mmap/madvise/munmap 合計が減っているか確認)
strace -e trace=mmap,madvise,munmap -c ./bench_pool_tls_hakmem 1 100000 256 42
strace -e trace=mmap,madvise,munmap -c ./bench_random_mixed_hakmem 100000 256 42
strace -e trace=mmap,madvise,munmap -c ./bench_random_mixed_hakmem 100000 1024 42

Location: core/hakmem.c (or wherever Pool TLS init happens)

Code:

#ifdef HAKMEM_POOL_TLS_PHASE1
    // Initialize Pool TLS
    pool_thread_init();

    // Pre-warm cache (Phase 1.5b optimization)
    #ifdef HAKMEM_POOL_TLS_PREWARM
    pool_tls_prewarm();
    #endif
#endif

Makefile Addition:

# Pool TLS Phase 1.5b - Pre-warm optimization
ifeq ($(POOL_TLS_PREWARM),1)
CFLAGS += -DHAKMEM_POOL_TLS_PREWARM=1
endif

Update build.sh:

make \
  POOL_TLS_PHASE1=1 \
  POOL_TLS_PREWARM=1 \  # NEW!
  HEADER_CLASSIDX=1 \
  AGGRESSIVE_INLINE=1 \
  PREWARM_TLS=1 \
  "${TARGET}"

Step 4: Build & Smoke Test 10 min

# Build with pre-warm enabled
./build_pool_tls.sh bench_mid_large_mt_hakmem

# Quick smoke test
./dev_pool_tls.sh test

# Expected: No crashes, similar or better performance

Step 5: Benchmark 15 min

# Full benchmark vs System malloc
./run_pool_bench.sh

# Expected results:
# Before (1.5a): 1.79M ops/s
# After (1.5b):  5-15M ops/s (+3-8x)

Additional benchmarks:

# Different sizes
./bench_mid_large_mt_hakmem 1 100000 256 42   # 8-32KB mixed
./bench_mid_large_mt_hakmem 1 100000 1024 42  # Larger workset

# Multi-threaded
./bench_mid_large_mt_hakmem 4 100000 256 42   # 4T

Step 6: Measure & Analyze 10 min

Metrics to collect:

  1. ops/s improvement (target: +3-8x)
  2. Memory overhead (should be ~1.6MB per thread)
  3. Cold-start penalty reduction (first allocation latency)

Success Criteria:

  • No crashes or stability issues
  • +200% or better improvement (5M ops/s minimum)
  • Memory overhead < 2MB per thread
  • No performance regression on small workloads

Step 7: Tune (if needed) 15 min (optional)

If results are suboptimal, adjust pre-warm counts:

Too slow (< 5M ops/s):

  • Increase hot class pre-warm (16 → 24)
  • More aggressive: Pre-warm all classes to 16

Memory too high (> 2MB):

  • Reduce cold class pre-warm (4 → 2)
  • Lazy pre-warm: Only hot classes initially

Adaptive approach:

// Pre-warm based on runtime heuristics
void pool_tls_prewarm_adaptive(void) {
    // Start with minimal pre-warm
    static const int MIN_PREWARM[7] = {8, 8, 4, 4, 2, 2, 2};

    // TODO: Track usage patterns and adjust dynamically
}

📋 Implementation Checklist

Phase 1.5b: Pre-warm Optimization

  • Step 1: Design pre-warm strategy (15 min)

    • Analyze memory budget
    • Decide pre-warm counts per class
    • Document rationale
  • Step 2: Implement pool_tls_prewarm() (20 min)

    • Add PREWARM_COUNTS array
    • Write pre-warm function
    • Add to pool_tls.h
  • Step 3: Integrate with init (10 min)

    • Add call to hakmem.c init
    • Add Makefile flag
    • Update build.sh
  • Step 4: Build & smoke test (10 min)

    • Build with pre-warm enabled
    • Run dev_pool_tls.sh test
    • Verify no crashes
  • Step 5: Benchmark (15 min)

    • Run run_pool_bench.sh
    • Test different sizes
    • Test multi-threaded
  • Step 6: Measure & analyze (10 min)

    • Record performance improvement
    • Measure memory overhead
    • Validate success criteria
  • Step 7: Tune (optional, 15 min)

    • Adjust pre-warm counts if needed
    • Re-benchmark
    • Document final configuration

Total Estimated Time: 1.5 hours (90 minutes)


🎯 Expected Outcomes

Performance Targets

Phase 1.5a (current): 1.79M ops/s
Phase 1.5b (target):  5-15M ops/s (+3-8x)

Conservative: 5M ops/s   (+180%)
Expected:     8M ops/s   (+350%)
Optimistic:   15M ops/s  (+740%)

Comparison to Phase 7

Phase 7 Task 3 (Tiny):
  Before: 21M → After: 59M ops/s (+181%)

Phase 1.5b (Pool):
  Before: 1.79M → After: 5-15M ops/s (+180-740%)

Similar or better improvement expected!

Risk Assessment

  • Technical Risk: LOW (proven pattern from Phase 7)
  • Stability Risk: LOW (simple, non-invasive change)
  • Memory Risk: LOW (1.6MB is negligible for Pool workloads)
  • Complexity Risk: LOW (< 50 LOC change)

  • CLAUDE.md - Development history (Phase 1.5a documented)
  • POOL_TLS_QUICKSTART.md - Quick start guide
  • POOL_TLS_INVESTIGATION_FINAL.md - Phase 1.5a debugging journey
  • PHASE7_TASK3_RESULTS.md - Pre-warm success pattern (Tiny)

🚀 Next Actions

NOW: Start Step 1 - Design pre-warm strategy NEXT: Implement pool_tls_prewarm() function THEN: Build, test, benchmark

Estimated Completion: 1.5 hours from start Success Probability: 90% (proven technique)


Status: Ready to implement - awaiting user confirmation to proceed! 🚀