Files
hakmem/POOL_TLS_PHASE1_5A_FIX.md

112 lines
3.7 KiB
Markdown
Raw Normal View History

# Pool TLS Phase 1.5a - Arena munmap Bug Fix
## Problem
**Symptom:** `./bench_mid_large_mt_hakmem 1 50000 256 42` → SEGV (Exit 139)
**Root Cause:** TLS Arena was `munmap()`ing old chunks when growing, but **live allocations** still pointed into those chunks!
**Failure Scenario:**
1. Thread allocates 64 blocks of 8KB (refill from arena)
2. Blocks are returned to user code
3. Some blocks are freed back to TLS cache
4. More allocations trigger another refill
5. Arena chunk grows → `munmap()` of old chunk
6. **Old blocks still in use now point to unmapped memory!**
7. When those blocks are freed → SEGV when accessing header
**Code Location:** `/mnt/workdisk/public_share/hakmem/core/pool_tls_arena.c:40`
```c
// BUGGY CODE (removed):
if (chunk->chunk_base) {
munmap(chunk->chunk_base, chunk->chunk_size); // ← SEGV! Live ptrs exist!
}
```
## Solution
**Arena Standard Behavior:** Arenas grow but **never shrink** during thread lifetime.
Old chunks are intentionally "leaked" because they contain live allocations. They are only freed at thread exit via `arena_cleanup_thread()`.
**Fix Applied:**
```c
// CRITICAL FIX: DO NOT munmap old chunk!
// Reason: Live allocations may still point into it. Arena chunks are kept
// alive for the thread's lifetime and only freed at thread exit.
// This is standard arena behavior - grow but never shrink.
//
// OLD CHUNK IS LEAKED INTENTIONALLY - it contains live allocations
```
## Results
### Before Fix
- 100 iterations: **PASS**
- 150 iterations: **PASS**
- 200 iterations: **SEGV** (Exit 139)
- 50K iterations: **SEGV** (Exit 139)
### After Fix
- 50K iterations (1T): **898K ops/s**
- 100K iterations (1T): **1.01M ops/s**
- 50K iterations (4T): **2.66M ops/s**
**Stability:** 3 consecutive runs at 50K iterations:
- Run 1: 900,870 ops/s
- Run 2: 887,748 ops/s
- Run 3: 893,364 ops/s
**Average:** ~894K ops/s (consistent with previous 863K target, variance is normal)
## Why Previous Fixes Weren't Sufficient
**Previous session fixes (all still in place):**
1. `/mnt/workdisk/public_share/hakmem/core/tiny_region_id.h:74` - Magic validation
2. `/mnt/workdisk/public_share/hakmem/core/tiny_free_fast_v2.inc.h:56-77` - Header safety checks
3. `/mnt/workdisk/public_share/hakmem/core/box/hak_free_api.inc.h:81-111` - Pool TLS dispatch
These fixes prevented **invalid header dereference**, but didn't fix the **root cause** of unmapped memory access from prematurely freed arena chunks.
## Memory Impact
**Q:** Does this leak memory?
**A:** No! It's standard arena behavior:
- Old chunks are kept alive (containing live allocations)
- Thread-local arena (~1.6MB typical working set)
- Chunks are freed at thread exit via `arena_cleanup_thread()`
- Total memory: O(thread count × working set) - acceptable
**Alternative (complex):** Track live allocations per chunk with reference counting → too slow for hot path
**Industry Standard:** jemalloc, tcmalloc, mimalloc all use grow-only arenas
## Files Modified
1. `/mnt/workdisk/public_share/hakmem/core/pool_tls_arena.c:38-54` - Removed buggy `munmap()` call
## Build Commands
```bash
make clean
make POOL_TLS_PHASE1=1 HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1 bench_mid_large_mt_hakmem
./bench_mid_large_mt_hakmem 1 50000 256 42
```
## Next Steps
Pool TLS Phase 1.5a is now **STABLE** at 50K+ iterations!
Ready for:
- ✅ Phase 1.5b: Pre-warm TLS cache (next task)
- ✅ Phase 1.5c: Optimize mincore() overhead (future)
## Lessons Learned
1. **Arena Lifetime Management:** Never `munmap()` chunks with potential live allocations
2. **Load-Dependent Bugs:** Crashes at 200+ iterations revealed chunk growth trigger
3. **Standard Patterns:** Follow industry-standard arena behavior (grow-only)