Files
hakmem/docs/analysis/POOL_HOT_PATH_BOTTLENECK.md
Moe Charm (CI) a9ddb52ad4 ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)
Phase 1 完了:環境変数整理 + fprintf デバッグガード

ENV変数削除(BG/HotMag系):
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除

fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message

ドキュメント整理:
- 328 markdown files 削除(旧レポート・重複docs)

性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作)
- ENV整理による機能影響なし
- Debug出力は一部残存(次phase で対応)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 14:45:26 +09:00

181 lines
5.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Pool Hot Path Bottleneck Analysis
## Executive Summary
**Root Cause**: Pool allocator is 100x slower than expected due to **pthread_mutex_lock in the hot path** (line 267 of `core/box/pool_core_api.inc.h`).
**Current Performance**: 434,611 ops/s
**Expected Performance**: 50-80M ops/s
**Gap**: ~100x slower
## Critical Finding: Mutex in Hot Path
### The Smoking Gun (Line 267)
```c
// core/box/pool_core_api.inc.h:267
pthread_mutex_t* lock = &g_pool.freelist_locks[class_idx][shard_idx].m;
pthread_mutex_lock(lock); // 💀 FULL KERNEL MUTEX IN HOT PATH
```
**Impact**: Every allocation that misses ALL TLS caches falls into this mutex lock:
- **Mutex overhead**: 100-500 cycles (kernel syscall)
- **Contention overhead**: 1000+ cycles under MT load
- **Cache invalidation**: 50-100 cycles from cache line bouncing
## Detailed Bottleneck Breakdown
### Pool Allocator Hot Path (hak_pool_try_alloc)
```c
Line 234-236: TC drain check // ~20-30 cycles
Line 236: TLS ring check // ~10-20 cycles
Line 237: TLS LIFO check // ~10-20 cycles
Line 240-256: Trylock probe loop // ~100-300 cycles (3 attempts!)
Line 258-261: Active page checks // ~30-50 cycles (3 pages!)
Line 267: pthread_mutex_lock // 💀 100-500+ cycles
Line 280: refill_freelist // ~1000+ cycles (mmap)
```
**Total worst case**: 1500-2500 cycles per allocation
### Tiny Allocator Hot Path (tiny_alloc_fast)
```c
Line 205: Load TLS head // 1 cycle
Line 206: Check NULL // 1 cycle
Line 238: Update head = *next // 2-3 cycles
Return // 1 cycle
```
**Total**: 5-6 cycles (300x faster!)
## Performance Analysis
### Cycle Cost Breakdown
| Operation | Pool (cycles) | Tiny (cycles) | Ratio |
|-----------|---------------|---------------|-------|
| TLS cache check | 60-100 | 2-3 | 30x slower |
| Trylock probes | 100-300 | 0 | ∞ |
| Mutex lock | 100-500 | 0 | ∞ |
| Atomic operations | 50-100 | 0 | ∞ |
| Random generation | 10-20 | 0 | ∞ |
| **Total Hot Path** | **320-1020** | **5-6** | **64-170x slower** |
### Why Tiny is Fast
1. **Single TLS freelist**: Direct pointer pop (3-4 instructions)
2. **No locks**: Pure TLS, zero synchronization
3. **No atomics**: Thread-local only
4. **Simple refill**: Batch from SuperSlab when empty
### Why Pool is Slow
1. **Multiple cache layers**: Ring + LIFO + Active pages (complex checks)
2. **Trylock probes**: Up to 3 mutex attempts before main lock
3. **Full mutex lock**: Kernel syscall in hot path
4. **Atomic remote lists**: Memory barriers and cache invalidation
5. **Per-allocation RNG**: Extra cycles for sampling
## Root Causes
### 1. Over-Engineered Architecture
Pool has 5 layers of caching before hitting the mutex:
- TC (Thread Cache) drain
- TLS ring
- TLS LIFO
- Active pages (3 of them!)
- Trylock probes
Each layer adds branches and cycles, yet still falls back to mutex!
### 2. Mutex-Protected Freelist
The core freelist is protected by **64 mutexes** (7 classes × 8 shards + extra), but this still causes massive contention under MT load.
### 3. Complex Shard Selection
```c
// Line 238-239
int shard_idx = hak_pool_get_shard_index(site_id);
int s0 = choose_nonempty_shard(class_idx, shard_idx);
```
Requires hash computation and nonempty mask checking.
## Proposed Fix: Lock-Free Pool Allocator
### Solution 1: Copy Tiny's Approach (Recommended)
**Effort**: 4-6 hours
**Expected Performance**: 40-60M ops/s
Replace entire Pool hot path with Tiny-style TLS freelist:
```c
void* hak_pool_try_alloc_fast(size_t size, uintptr_t site_id) {
int class_idx = hak_pool_get_class_index(size);
// Simple TLS freelist (like Tiny)
void* head = g_tls_pool_head[class_idx];
if (head) {
g_tls_pool_head[class_idx] = *(void**)head;
return (char*)head + HEADER_SIZE;
}
// Refill from backend (batch, no lock)
return pool_refill_and_alloc(class_idx);
}
```
### Solution 2: Remove Mutex, Use CAS
**Effort**: 8-12 hours
**Expected Performance**: 20-30M ops/s
Replace mutex with lock-free CAS operations:
```c
// Instead of pthread_mutex_lock
PoolBlock* old_head;
do {
old_head = atomic_load(&g_pool.freelist[class_idx][shard_idx]);
if (!old_head) break;
} while (!atomic_compare_exchange_weak(&g_pool.freelist[class_idx][shard_idx],
&old_head, old_head->next));
```
### Solution 3: Increase TLS Cache Hit Rate
**Effort**: 2-3 hours
**Expected Performance**: 5-10M ops/s (partial improvement)
- Increase POOL_L2_RING_CAP from 64 to 256
- Pre-warm TLS caches at init (like Tiny Phase 7)
- Batch refill 64 blocks at once
## Implementation Plan
### Quick Win (2 hours)
1. Increase `POOL_L2_RING_CAP` to 256
2. Add pre-warming in `hak_pool_init()`
3. Test performance
### Full Fix (6 hours)
1. Create `pool_fast_path.inc.h` (copy from tiny_alloc_fast.inc.h)
2. Replace `hak_pool_try_alloc` with simple TLS freelist
3. Implement batch refill without locks
4. Add feature flag for rollback safety
5. Test MT performance
## Expected Results
With proposed fix (Solution 1):
- **Current**: 434,611 ops/s
- **Expected**: 40-60M ops/s
- **Improvement**: 92-138x faster
- **vs System**: Should achieve 70-90% of System malloc
## Files to Modify
1. `core/box/pool_core_api.inc.h`: Replace lines 229-286
2. `core/hakmem_pool.h`: Add TLS freelist declarations
3. Create `core/pool_fast_path.inc.h`: New fast path implementation
## Success Metrics
✅ Pool allocation hot path < 20 cycles
No mutex locks in common case
TLS hit rate > 95%
✅ Performance > 40M ops/s for 8-32KB allocations
✅ MT scaling without contention