Files
hakmem/docs/analysis/MID_LARGE_P0_FIX_REPORT_20251114.md
Moe Charm (CI) a9ddb52ad4 ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)
Phase 1 完了:環境変数整理 + fprintf デバッグガード

ENV変数削除(BG/HotMag系):
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除

fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message

ドキュメント整理:
- 328 markdown files 削除(旧レポート・重複docs)

性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作)
- ENV整理による機能影響なし
- Debug出力は一部残存(次phase で対応)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 14:45:26 +09:00

8.6 KiB
Raw Blame History

Mid-Large Allocator P0 Fix Report (2025-11-14)

Executive Summary

Status: P0-1 FIXED - Pool TLS disabled by default Status: 🚧 P0-2 IDENTIFIED - Remote queue mutex contention

Performance Impact:

Before Fix (Pool TLS OFF): 0.24M ops/s (1% of mimalloc)
After Fix  (Pool TLS ON):  0.97M ops/s (4% of mimalloc, +304%)
Remaining Gap:             5.6x slower than System, 25x slower than mimalloc

Problem 1: Pool TLS Disabled by Default FIXED

Root Cause

File: build.sh:105-107

# Default: Pool TLSはOFF必要時のみ明示ON。短時間ベンチでのmutexとpage faultコストを避ける。
POOL_TLS_PHASE1_DEFAULT=${POOL_TLS_PHASE1:-0}  # デフォルト: OFF
POOL_TLS_PREWARM_DEFAULT=${POOL_TLS_PREWARM:-0}  # デフォルト: OFF

Impact: 8KB-52KB allocations bypassed Pool TLS entirely, falling through to:

  1. Mid allocator (ineffective for some sizes)
  2. ACE allocator (returns NULL for 33KB)
  3. Final mmap fallback (extremely slow)

Allocation Path Analysis

Before Fix (8KB-32KB allocations):

hak_alloc_at()
  ├─ Tiny check (size > 1024) → SKIP
  ├─ Pool TLS check → DISABLED ❌
  ├─ Mid check → SKIP/NULL
  ├─ ACE check → NULL (confirmed via logs)
  └─ Final fallback → mmap (SLOW!)

After Fix:

hak_alloc_at()
  ├─ Tiny check (size > 1024) → SKIP
  ├─ Pool TLS check → pool_alloc() ✅
  │   ├─ TLS cache hit → FAST!
  │   └─ Cold path → arena_batch_carve()
  └─ (no fallback needed)

Fix Applied

Build Command:

POOL_TLS_PHASE1=1 POOL_TLS_PREWARM=1 ./build.sh bench_mid_large_mt_hakmem

Result:

  • Pool TLS enabled and functional
  • No [POOL_ARENA] or [POOL_TLS] error logs → normal operation
  • Performance: 0.24M → 0.97M ops/s (+304%)

Problem 2: Remote Queue Mutex Contention 🚧 IDENTIFIED

Syscall Analysis (strace)

% time   calls  usec/call  syscall
------- ------- ---------- -------
67.59%    209      6,482   futex      ← Dominant bottleneck!
17.30%  46,665        7    mincore
14.95%  47,647        6    gettid
 0.10%    209        9    mmap

futex accounts for 67% of syscall time (1.35 seconds total)

Root Cause

File: core/pool_tls_remote.c:27-44

int pool_remote_push(int class_idx, void* ptr, int owner_tid){
  // ...
  pthread_mutex_lock(&g_locks[b]);   // ← Cross-thread free → mutex contention!
  // Push to remote queue
  pthread_mutex_unlock(&g_locks[b]);
  return 1;
}

Why This is Expensive:

  • Multi-threaded benchmark: 2 threads × 40K ops = 80K allocations
  • Cross-thread frees are frequent in mixed workload
  • Every cross-thread free → mutex lock → potential futex syscall
  • Threads contend on g_locks[b] hash buckets

Also Found: pool_tls_registry.c uses mutex for registry operations:

  • pool_reg_register(): line 31 (on chunk allocation)
  • pool_reg_unregister(): line 41 (on chunk deallocation)
  • pool_reg_lookup(): line 52 (on pointer ownership resolution)

Registry calls: 209 (matches mmap count), less frequent but still contributes.


Performance Comparison

Current Results (Pool TLS ON)

Benchmark: bench_mid_large_mt_hakmem 2 40000 2048 42

System malloc:   5.4M ops/s  (100%)
mimalloc:       24.2M ops/s  (448%)
HAKMEM (before): 0.24M ops/s  (4.4%)  ← Pool TLS OFF
HAKMEM (after):  0.97M ops/s  (18%)   ← Pool TLS ON (+304%)

Remaining Gap:

  • vs System: 5.6x slower
  • vs mimalloc: 25x slower

Perf Stat Analysis

perf stat -e cycles,instructions,branches,branch-misses,cache-misses -- \
  ./bench_mid_large_mt_hakmem 2 40000 2048 42

Throughput:         0.93M ops/s (average of 3 runs)
Branch misses:      11.03% (high)
Cache misses:       2.3M
L1 D-cache misses:  6.4M

Debug Logs Added

Files Modified:

  1. core/pool_tls_arena.c:82-90 - mmap failure logging
  2. core/pool_tls_arena.c:126-133 - chunk_ensure failure logging
  3. core/pool_tls.c:118-128 - refill failure logging

Example Output:

[POOL_ARENA] mmap FAILED: new_size=8 MB, growth_level=3, errno=12
[POOL_ARENA] chunk_ensure FAILED: class=3, block_size=32768, count=64, needed=2097152
[POOL_TLS] pool_refill_and_alloc FAILED: class=3, size=32768

Result: No errors logged → Pool TLS operating normally.


Next Steps (Priority Order)

Option A: Fix Remote Queue Mutex (High Impact) 🔥

Priority: P0 (67% syscall time!)

Approaches:

  1. Lock-free MPSC queue (multi-producer, single-consumer)

    • Use atomic operations (CAS) instead of mutex
    • Example: mimalloc's thread message queue
    • Expected: 50-70% futex time reduction
  2. Per-thread batching

    • Buffer remote frees on sender side
    • Push in batches (e.g., every 64 frees)
    • Reduces lock frequency 64x
  3. Thread-local remote slots (TLS sender buffer)

    • Each thread maintains per-class remote buffers
    • Periodic flush to owner's queue
    • Avoids lock on every free

Expected Impact: 0.97M → 3-5M ops/s (+200-400%)

Option B: Fix build.sh Default (Mid Impact) 🛠️

Priority: P1 (prevents future confusion)

Change: build.sh:106

# OLD (buggy default):
POOL_TLS_PHASE1_DEFAULT=${POOL_TLS_PHASE1:-0}  # OFF

# NEW (correct default for mid-large targets):
if [[ "${TARGET}" == *"mid_large"* || "${TARGET}" == *"pool_tls"* ]]; then
  POOL_TLS_PHASE1_DEFAULT=${POOL_TLS_PHASE1:-1}  # AUTO-ENABLE for mid-large
else
  POOL_TLS_PHASE1_DEFAULT=${POOL_TLS_PHASE1:-0}  # Keep OFF for tiny benchmarks
fi

Benefit: Prevents accidental regression for mid-large workloads.

Option C: Re-run A/B Benchmark (Low Priority) 📊

Command:

POOL_TLS_PHASE1=1 POOL_TLS_PREWARM=1 scripts/bench_mid_large_mt_ab.sh

Purpose:

  • Measure Pool TLS improvement across thread counts (2, 4, 8)
  • Compare with system/mimalloc baselines
  • Generate updated results CSV

Expected Results:

  • 2 threads: 0.97M ops/s (current)
  • 4 threads: ~1.5M ops/s (if futex contention increases)

Lessons Learned

1. Always Check Build Flags First ⚠️

Mistake: Spent time debugging allocator internals before checking build configuration.

Lesson: When benchmark performance is unexpectedly poor, verify:

  • Build flags (make print-flags)
  • Compiler optimizations (-O3, -DNDEBUG)
  • Feature toggles (e.g., POOL_TLS_PHASE1)

2. Debug Logs Are Essential 📋

Impact: Added 3 debug logs (15 lines of code) → instantly confirmed Pool TLS was working.

Pattern:

static _Atomic int fail_count = 0;
int n = atomic_fetch_add(&fail_count, 1);
if (n < 10) {  // Limit spam
    fprintf(stderr, "[MODULE] Event: details\n");
}

3. strace Overhead Can Mislead 🐌

Observation:

  • Without strace: 0.97M ops/s
  • With strace: 0.079M ops/s (12x slower!)

Lesson: Use perf stat for low-overhead profiling, reserve strace for syscall pattern analysis only.

4. Futex Time ≠ Futex Count

Data:

  • futex calls: 209
  • futex time: 67% (1.35 sec)
  • Average: 6.5ms per futex call!

Implication: High contention → threads sleeping on mutex → expensive futex waits.


Code Changes Summary

1. Debug Instrumentation Added

File Lines Purpose
core/pool_tls_arena.c 82-90 Log mmap failures
core/pool_tls_arena.c 126-133 Log chunk_ensure failures
core/pool_tls.c 118-128 Log refill failures

2. Headers Added

File Change
core/pool_tls_arena.c Added <stdio.h>, <errno.h>, <stdatomic.h>
core/pool_tls.c Added <stdatomic.h>

Note: No logic changes, only observability improvements.


Recommendations

Immediate (This Session)

  1. Done: Fix Pool TLS disabled issue (+304%)
  2. Done: Identify futex bottleneck (pool_remote_push)
  3. 🔄 Pending: Implement lock-free remote queue (Option A)

Short-Term (Next Session)

  1. Lock-free MPSC queue for pool_remote_push()
  2. Update build.sh to auto-enable Pool TLS for mid-large targets
  3. Re-run A/B benchmarks with Pool TLS enabled

Long-Term

  1. Registry optimization: Lock-free hash table or per-thread caching
  2. mincore reduction: 17% syscall time, Phase 7 side-effect?
  3. gettid caching: 47K calls, should be cached via TLS

Conclusion

P0-1 FIXED: Pool TLS disabled by default caused 97x performance gap.

P0-2 IDENTIFIED: Remote queue mutex accounts for 67% syscall time.

Current Status: 0.97M ops/s (4% of mimalloc, +304% from baseline)

Next Priority: Implement lock-free remote queue to target 3-5M ops/s.


Report Generated: 2025-11-14 Author: Claude Code + User Collaboration Session: Bottleneck Analysis Phase 12