# Task for Other AI: Fix 4T High-Contention Crash (Mixed Allocation Bug) **Date**: 2025-11-08 **Priority**: CRITICAL **Status**: BLOCKING production deployment --- ## Executive Summary **Problem**: 4T high-contention crash with **70% failure rate** (6/20 success) **Root Cause Identified**: Mixed HAKMEM/libc allocations causing `free(): invalid pointer` **Your Mission**: Fix the mixed allocation bug to achieve **100% stability** --- ## Background ### Current Status Phase 7 optimization achieved **excellent performance**: - Single-threaded: **91.3% of System malloc** (target was 40-55%) ✅ - Multi-threaded low-contention: **100% stable** ✅ - **BUT**: 4T high-contention: **70% crash rate** ❌ ### What Works ```bash # ✅ Works perfectly (100% stable) ./larson_hakmem 1 1 128 1024 1 12345 1 # 1T: 2.74M ops/s ./larson_hakmem 2 8 128 1024 1 12345 2 # 2T: 4.91M ops/s ./larson_hakmem 10 8 128 256 1 12345 4 # 4T low: 251K ops/s # ❌ Crashes 70% of the time ./larson_hakmem 10 8 128 1024 1 12345 4 # 4T high: 981K ops/s (when it works) ``` ### What Breaks **Crash pattern**: ``` free(): invalid pointer [DEBUG] superslab_refill returned NULL (OOM) detail: class=4 prev_ss=(nil) active=0 bitmap=0x00000000 prev_meta=(nil) used=0 cap=0 slab_idx=0 reused_freelist=0 free_idx=-2 errno=12 ``` **Sequence of events**: 1. Thread exhausts SuperSlab for class 6 (or 1, 4) 2. `superslab_refill()` fails with OOM (errno=12, ENOMEM) 3. Code falls back to `malloc()` (libc malloc) 4. Now we have **mixed allocations**: some from HAKMEM, some from libc 5. `free()` receives a libc-allocated pointer 6. HAKMEM's free path tries to handle it → **CRASH** --- ## Root Cause Analysis (from Task Agent) ### The Mixed Allocation Problem **File**: `core/box/hak_alloc_api.inc.h` or similar allocation paths **Current behavior**: ```c // Pseudo-code of current allocation path void* hak_alloc(size_t size) { // Try HAKMEM allocation void* ptr = hak_tiny_alloc(size); if (ptr) return ptr; // HAKMEM failed (OOM) → fallback to libc malloc return malloc(size); // ← PROBLEM: Now we have mixed allocations! } void hak_free(void* ptr) { // Try to free as HAKMEM allocation if (looks_like_hakmem(ptr)) { hakmem_free(ptr); // ← PROBLEM: What if it's actually from malloc()? } else { free(ptr); // ← PROBLEM: What if we guessed wrong? } } ``` **Why this crashes**: - HAKMEM can't distinguish between HAKMEM-allocated and malloc-allocated pointers - Header-based detection is unreliable (malloc memory might look like HAKMEM headers) - Cross-allocation free causes corruption/crashes ### Why SuperSlab OOM Happens **High-contention scenario**: - 4 threads × 1024 chunks each = 4096 concurrent allocations - All threads allocate 128B blocks (class 4 or 6) - SuperSlab runs out of slabs for that class - No dynamic scaling → OOM **Evidence**: `bitmap=0x00000000` means all 32 slabs exhausted --- ## Your Mission: 3 Potential Fixes (Choose Best Approach) ### Option A: Disable malloc Fallback (Recommended - Safest) **Idea**: Make allocation failures explicit instead of silently falling back **Implementation**: **File**: Find the allocation path that does malloc fallback (likely `core/box/hak_alloc_api.inc.h` or `core/hakmem_tiny.c`) **Change**: ```c // Before (BROKEN): void* hak_alloc(size_t size) { void* ptr = hak_tiny_alloc(size); if (ptr) return ptr; // Fallback to malloc (causes mixed allocations) return malloc(size); // ❌ BAD } // After (SAFE): void* hak_alloc(size_t size) { void* ptr = hak_tiny_alloc(size); if (!ptr) { // OOM: Log and fail explicitly fprintf(stderr, "[HAKMEM] OOM for size=%zu, returning NULL\n", size); errno = ENOMEM; return NULL; // ✅ Explicit failure } return ptr; } ``` **Pros**: - Simple and safe - No mixed allocations - Caller can handle OOM explicitly **Cons**: - Applications must handle NULL returns - Might break code that assumes malloc never fails **Testing**: ```bash # Should complete without crashes OR fail cleanly with OOM message ./larson_hakmem 10 8 128 1024 1 12345 4 ``` --- ### Option B: Fix SuperSlab Starvation (Recommended - Best Long-term) **Idea**: Prevent OOM by dynamically scaling SuperSlab capacity **Implementation**: **File**: `core/tiny_superslab_alloc.inc.h` or SuperSlab management code **Change 1: Detect starvation**: ```c // In superslab_refill() if (bitmap == 0x00000000) { // All slabs exhausted → try to allocate more fprintf(stderr, "[HAKMEM] SuperSlab class %d exhausted, allocating more...\n", class_idx); // Allocate a new SuperSlab SuperSlab* new_ss = allocate_superslab(class_idx); if (new_ss) { register_superslab(new_ss); // Retry refill from new SuperSlab return refill_from_superslab(new_ss, class_idx, count); } } ``` **Change 2: Increase initial capacity for hot classes**: ```c // In SuperSlab initialization // Classes 1, 4, 6 are hot in multi-threaded workloads if (class_idx == 1 || class_idx == 4 || class_idx == 6) { initial_slabs = 64; // Double capacity for hot classes } else { initial_slabs = 32; // Default } ``` **Pros**: - Fixes root cause (OOM) - No mixed allocations needed - Scales naturally with workload **Cons**: - More complex - Memory overhead for extra SuperSlabs **Testing**: ```bash # Should complete 100% of the time without OOM for i in {1..20}; do ./larson_hakmem 10 8 128 1024 1 12345 4; done ``` --- ### Option C: Add Allocation Ownership Tracking (Comprehensive) **Idea**: Track which allocator owns each pointer **Implementation**: **File**: `core/box/hak_free_api.inc.h` or free path **Change 1: Add ownership bitmap**: ```c // Global bitmap to track HAKMEM allocations // Each bit represents a 64KB region #define OWNERSHIP_BITMAP_SIZE (1ULL << 20) // 1M bits = 64GB coverage static uint64_t g_hakmem_ownership_bitmap[OWNERSHIP_BITMAP_SIZE / 64]; // Mark allocation as HAKMEM-owned static inline void mark_hakmem_allocation(void* ptr, size_t size) { uintptr_t addr = (uintptr_t)ptr; size_t region = addr / (64 * 1024); // 64KB regions size_t word = region / 64; size_t bit = region % 64; atomic_fetch_or(&g_hakmem_ownership_bitmap[word], 1ULL << bit); } // Check if allocation is HAKMEM-owned static inline int is_hakmem_allocation(void* ptr) { uintptr_t addr = (uintptr_t)ptr; size_t region = addr / (64 * 1024); size_t word = region / 64; size_t bit = region % 64; return (g_hakmem_ownership_bitmap[word] & (1ULL << bit)) != 0; } ``` **Change 2: Use ownership in free path**: ```c void hak_free(void* ptr) { if (is_hakmem_allocation(ptr)) { hakmem_free(ptr); // ✅ Confirmed HAKMEM } else { free(ptr); // ✅ Confirmed libc malloc } } ``` **Pros**: - Allows mixed allocations safely - Works with existing malloc fallback **Cons**: - Complex to implement correctly - Memory overhead for bitmap - Atomic operations on free path --- ## Recommendation: **Combine Option A + Option B** **Phase 1 (Immediate - 1 hour)**: Disable malloc fallback (Option A) - Quick and safe fix - Prevents crashes immediately - Test 4T stability → should be 100% **Phase 2 (Next - 2-4 hours)**: Fix SuperSlab starvation (Option B) - Implement dynamic SuperSlab scaling - Increase capacity for hot classes (1, 4, 6) - Remove Option A workaround **Phase 3 (Optional)**: Add ownership tracking (Option C) for defense-in-depth --- ## Testing Requirements ### Test 1: Stability (CRITICAL) ```bash # Must achieve 100% success rate for i in {1..20}; do echo "Run $i:" env HAKMEM_TINY_USE_SUPERSLAB=1 HAKMEM_TINY_MEM_DIET=0 \ ./larson_hakmem 10 8 128 1024 1 12345 4 2>&1 | grep "Throughput" echo "Exit code: $?" done # Expected: 20/20 success (100%) ``` ### Test 2: Performance (No regression) ```bash # Should maintain ~981K ops/s env HAKMEM_TINY_USE_SUPERSLAB=1 HAKMEM_TINY_MEM_DIET=0 \ ./larson_hakmem 10 8 128 1024 1 12345 4 # Expected: Throughput ≈ 981K ops/s (same as before) ``` ### Test 3: Regression Check ```bash # Ensure low-contention still works ./larson_hakmem 1 1 128 1024 1 12345 1 # 1T ./larson_hakmem 2 8 128 1024 1 12345 2 # 2T ./larson_hakmem 10 8 128 256 1 12345 4 # 4T low # Expected: All complete successfully ``` --- ## Success Criteria ✅ **4T high-contention stability: 100% (20/20 runs)** ✅ **No performance regression** (≥950K ops/s) ✅ **No crashes or OOM errors** ✅ **1T/2T/4T low-contention still work** --- ## Files to Review/Modify **Likely files** (search for malloc fallback): 1. `core/box/hak_alloc_api.inc.h` - Main allocation API 2. `core/hakmem_tiny.c` - Tiny allocator implementation 3. `core/tiny_alloc_fast.inc.h` - Fast path allocation 4. `core/tiny_superslab_alloc.inc.h` - SuperSlab allocation 5. `core/hakmem_tiny_refill_p0.inc.h` - Refill logic **Search commands**: ```bash # Find malloc fallback grep -rn "malloc(" core/ | grep -v "//.*malloc" # Find OOM handling grep -rn "errno.*ENOMEM\|OOM\|returned NULL" core/ # Find SuperSlab allocation grep -rn "superslab_refill\|allocate.*superslab" core/ ``` --- ## Expected Deliverable **Report file**: `/mnt/workdisk/public_share/hakmem/PHASE7_MIXED_ALLOCATION_FIX.md` **Required sections**: 1. **Approach chosen** (A, B, C, or combination) 2. **Code changes** (diffs showing before/after) 3. **Why it works** (explanation of fix) 4. **Test results** (20/20 stability test) 5. **Performance impact** (before/after comparison) 6. **Production readiness** (YES/NO verdict) --- ## Context Documents - `PHASE7_4T_STABILITY_VERIFICATION.md` - Recent stability test (30% success) - `PHASE7_BUG3_FIX_REPORT.md` - Previous debugging attempts - `PHASE7_FINAL_BENCHMARK_RESULTS.md` - Overall Phase 7 results - `CLAUDE.md` - Project history and status --- ## Questions? Debug Hints **Q: Where is the malloc fallback code?** A: Search for `malloc(` in `core/box/*.inc.h` and `core/hakmem_tiny*.c` **Q: How do I test just the fix without full rebuild?** A: `make clean && make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1 larson_hakmem` **Q: What if Option A causes application crashes?** A: That's expected if the app doesn't handle malloc failures. Move to Option B. **Q: How do I know if SuperSlab OOM is fixed?** A: No more `[DEBUG] superslab_refill returned NULL (OOM)` messages in output --- **Good luck! Let's achieve 100% stability! 🚀**