393 lines
10 KiB
Markdown
393 lines
10 KiB
Markdown
|
|
# Task for Other AI: Fix 4T High-Contention Crash (Mixed Allocation Bug)
|
|||
|
|
|
|||
|
|
**Date**: 2025-11-08
|
|||
|
|
**Priority**: CRITICAL
|
|||
|
|
**Status**: BLOCKING production deployment
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Executive Summary
|
|||
|
|
|
|||
|
|
**Problem**: 4T high-contention crash with **70% failure rate** (6/20 success)
|
|||
|
|
|
|||
|
|
**Root Cause Identified**: Mixed HAKMEM/libc allocations causing `free(): invalid pointer`
|
|||
|
|
|
|||
|
|
**Your Mission**: Fix the mixed allocation bug to achieve **100% stability**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Background
|
|||
|
|
|
|||
|
|
### Current Status
|
|||
|
|
|
|||
|
|
Phase 7 optimization achieved **excellent performance**:
|
|||
|
|
- Single-threaded: **91.3% of System malloc** (target was 40-55%) ✅
|
|||
|
|
- Multi-threaded low-contention: **100% stable** ✅
|
|||
|
|
- **BUT**: 4T high-contention: **70% crash rate** ❌
|
|||
|
|
|
|||
|
|
### What Works
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# ✅ Works perfectly (100% stable)
|
|||
|
|
./larson_hakmem 1 1 128 1024 1 12345 1 # 1T: 2.74M ops/s
|
|||
|
|
./larson_hakmem 2 8 128 1024 1 12345 2 # 2T: 4.91M ops/s
|
|||
|
|
./larson_hakmem 10 8 128 256 1 12345 4 # 4T low: 251K ops/s
|
|||
|
|
|
|||
|
|
# ❌ Crashes 70% of the time
|
|||
|
|
./larson_hakmem 10 8 128 1024 1 12345 4 # 4T high: 981K ops/s (when it works)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### What Breaks
|
|||
|
|
|
|||
|
|
**Crash pattern**:
|
|||
|
|
```
|
|||
|
|
free(): invalid pointer
|
|||
|
|
[DEBUG] superslab_refill returned NULL (OOM) detail:
|
|||
|
|
class=4 prev_ss=(nil) active=0 bitmap=0x00000000
|
|||
|
|
prev_meta=(nil) used=0 cap=0 slab_idx=0
|
|||
|
|
reused_freelist=0 free_idx=-2 errno=12
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Sequence of events**:
|
|||
|
|
1. Thread exhausts SuperSlab for class 6 (or 1, 4)
|
|||
|
|
2. `superslab_refill()` fails with OOM (errno=12, ENOMEM)
|
|||
|
|
3. Code falls back to `malloc()` (libc malloc)
|
|||
|
|
4. Now we have **mixed allocations**: some from HAKMEM, some from libc
|
|||
|
|
5. `free()` receives a libc-allocated pointer
|
|||
|
|
6. HAKMEM's free path tries to handle it → **CRASH**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Root Cause Analysis (from Task Agent)
|
|||
|
|
|
|||
|
|
### The Mixed Allocation Problem
|
|||
|
|
|
|||
|
|
**File**: `core/box/hak_alloc_api.inc.h` or similar allocation paths
|
|||
|
|
|
|||
|
|
**Current behavior**:
|
|||
|
|
```c
|
|||
|
|
// Pseudo-code of current allocation path
|
|||
|
|
void* hak_alloc(size_t size) {
|
|||
|
|
// Try HAKMEM allocation
|
|||
|
|
void* ptr = hak_tiny_alloc(size);
|
|||
|
|
if (ptr) return ptr;
|
|||
|
|
|
|||
|
|
// HAKMEM failed (OOM) → fallback to libc malloc
|
|||
|
|
return malloc(size); // ← PROBLEM: Now we have mixed allocations!
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
void hak_free(void* ptr) {
|
|||
|
|
// Try to free as HAKMEM allocation
|
|||
|
|
if (looks_like_hakmem(ptr)) {
|
|||
|
|
hakmem_free(ptr); // ← PROBLEM: What if it's actually from malloc()?
|
|||
|
|
} else {
|
|||
|
|
free(ptr); // ← PROBLEM: What if we guessed wrong?
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Why this crashes**:
|
|||
|
|
- HAKMEM can't distinguish between HAKMEM-allocated and malloc-allocated pointers
|
|||
|
|
- Header-based detection is unreliable (malloc memory might look like HAKMEM headers)
|
|||
|
|
- Cross-allocation free causes corruption/crashes
|
|||
|
|
|
|||
|
|
### Why SuperSlab OOM Happens
|
|||
|
|
|
|||
|
|
**High-contention scenario**:
|
|||
|
|
- 4 threads × 1024 chunks each = 4096 concurrent allocations
|
|||
|
|
- All threads allocate 128B blocks (class 4 or 6)
|
|||
|
|
- SuperSlab runs out of slabs for that class
|
|||
|
|
- No dynamic scaling → OOM
|
|||
|
|
|
|||
|
|
**Evidence**: `bitmap=0x00000000` means all 32 slabs exhausted
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Your Mission: 3 Potential Fixes (Choose Best Approach)
|
|||
|
|
|
|||
|
|
### Option A: Disable malloc Fallback (Recommended - Safest)
|
|||
|
|
|
|||
|
|
**Idea**: Make allocation failures explicit instead of silently falling back
|
|||
|
|
|
|||
|
|
**Implementation**:
|
|||
|
|
|
|||
|
|
**File**: Find the allocation path that does malloc fallback (likely `core/box/hak_alloc_api.inc.h` or `core/hakmem_tiny.c`)
|
|||
|
|
|
|||
|
|
**Change**:
|
|||
|
|
```c
|
|||
|
|
// Before (BROKEN):
|
|||
|
|
void* hak_alloc(size_t size) {
|
|||
|
|
void* ptr = hak_tiny_alloc(size);
|
|||
|
|
if (ptr) return ptr;
|
|||
|
|
|
|||
|
|
// Fallback to malloc (causes mixed allocations)
|
|||
|
|
return malloc(size); // ❌ BAD
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// After (SAFE):
|
|||
|
|
void* hak_alloc(size_t size) {
|
|||
|
|
void* ptr = hak_tiny_alloc(size);
|
|||
|
|
if (!ptr) {
|
|||
|
|
// OOM: Log and fail explicitly
|
|||
|
|
fprintf(stderr, "[HAKMEM] OOM for size=%zu, returning NULL\n", size);
|
|||
|
|
errno = ENOMEM;
|
|||
|
|
return NULL; // ✅ Explicit failure
|
|||
|
|
}
|
|||
|
|
return ptr;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Pros**:
|
|||
|
|
- Simple and safe
|
|||
|
|
- No mixed allocations
|
|||
|
|
- Caller can handle OOM explicitly
|
|||
|
|
|
|||
|
|
**Cons**:
|
|||
|
|
- Applications must handle NULL returns
|
|||
|
|
- Might break code that assumes malloc never fails
|
|||
|
|
|
|||
|
|
**Testing**:
|
|||
|
|
```bash
|
|||
|
|
# Should complete without crashes OR fail cleanly with OOM message
|
|||
|
|
./larson_hakmem 10 8 128 1024 1 12345 4
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Option B: Fix SuperSlab Starvation (Recommended - Best Long-term)
|
|||
|
|
|
|||
|
|
**Idea**: Prevent OOM by dynamically scaling SuperSlab capacity
|
|||
|
|
|
|||
|
|
**Implementation**:
|
|||
|
|
|
|||
|
|
**File**: `core/tiny_superslab_alloc.inc.h` or SuperSlab management code
|
|||
|
|
|
|||
|
|
**Change 1: Detect starvation**:
|
|||
|
|
```c
|
|||
|
|
// In superslab_refill()
|
|||
|
|
if (bitmap == 0x00000000) {
|
|||
|
|
// All slabs exhausted → try to allocate more
|
|||
|
|
fprintf(stderr, "[HAKMEM] SuperSlab class %d exhausted, allocating more...\n", class_idx);
|
|||
|
|
|
|||
|
|
// Allocate a new SuperSlab
|
|||
|
|
SuperSlab* new_ss = allocate_superslab(class_idx);
|
|||
|
|
if (new_ss) {
|
|||
|
|
register_superslab(new_ss);
|
|||
|
|
// Retry refill from new SuperSlab
|
|||
|
|
return refill_from_superslab(new_ss, class_idx, count);
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Change 2: Increase initial capacity for hot classes**:
|
|||
|
|
```c
|
|||
|
|
// In SuperSlab initialization
|
|||
|
|
// Classes 1, 4, 6 are hot in multi-threaded workloads
|
|||
|
|
if (class_idx == 1 || class_idx == 4 || class_idx == 6) {
|
|||
|
|
initial_slabs = 64; // Double capacity for hot classes
|
|||
|
|
} else {
|
|||
|
|
initial_slabs = 32; // Default
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Pros**:
|
|||
|
|
- Fixes root cause (OOM)
|
|||
|
|
- No mixed allocations needed
|
|||
|
|
- Scales naturally with workload
|
|||
|
|
|
|||
|
|
**Cons**:
|
|||
|
|
- More complex
|
|||
|
|
- Memory overhead for extra SuperSlabs
|
|||
|
|
|
|||
|
|
**Testing**:
|
|||
|
|
```bash
|
|||
|
|
# Should complete 100% of the time without OOM
|
|||
|
|
for i in {1..20}; do ./larson_hakmem 10 8 128 1024 1 12345 4; done
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Option C: Add Allocation Ownership Tracking (Comprehensive)
|
|||
|
|
|
|||
|
|
**Idea**: Track which allocator owns each pointer
|
|||
|
|
|
|||
|
|
**Implementation**:
|
|||
|
|
|
|||
|
|
**File**: `core/box/hak_free_api.inc.h` or free path
|
|||
|
|
|
|||
|
|
**Change 1: Add ownership bitmap**:
|
|||
|
|
```c
|
|||
|
|
// Global bitmap to track HAKMEM allocations
|
|||
|
|
// Each bit represents a 64KB region
|
|||
|
|
#define OWNERSHIP_BITMAP_SIZE (1ULL << 20) // 1M bits = 64GB coverage
|
|||
|
|
static uint64_t g_hakmem_ownership_bitmap[OWNERSHIP_BITMAP_SIZE / 64];
|
|||
|
|
|
|||
|
|
// Mark allocation as HAKMEM-owned
|
|||
|
|
static inline void mark_hakmem_allocation(void* ptr, size_t size) {
|
|||
|
|
uintptr_t addr = (uintptr_t)ptr;
|
|||
|
|
size_t region = addr / (64 * 1024); // 64KB regions
|
|||
|
|
size_t word = region / 64;
|
|||
|
|
size_t bit = region % 64;
|
|||
|
|
atomic_fetch_or(&g_hakmem_ownership_bitmap[word], 1ULL << bit);
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Check if allocation is HAKMEM-owned
|
|||
|
|
static inline int is_hakmem_allocation(void* ptr) {
|
|||
|
|
uintptr_t addr = (uintptr_t)ptr;
|
|||
|
|
size_t region = addr / (64 * 1024);
|
|||
|
|
size_t word = region / 64;
|
|||
|
|
size_t bit = region % 64;
|
|||
|
|
return (g_hakmem_ownership_bitmap[word] & (1ULL << bit)) != 0;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Change 2: Use ownership in free path**:
|
|||
|
|
```c
|
|||
|
|
void hak_free(void* ptr) {
|
|||
|
|
if (is_hakmem_allocation(ptr)) {
|
|||
|
|
hakmem_free(ptr); // ✅ Confirmed HAKMEM
|
|||
|
|
} else {
|
|||
|
|
free(ptr); // ✅ Confirmed libc malloc
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Pros**:
|
|||
|
|
- Allows mixed allocations safely
|
|||
|
|
- Works with existing malloc fallback
|
|||
|
|
|
|||
|
|
**Cons**:
|
|||
|
|
- Complex to implement correctly
|
|||
|
|
- Memory overhead for bitmap
|
|||
|
|
- Atomic operations on free path
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Recommendation: **Combine Option A + Option B**
|
|||
|
|
|
|||
|
|
**Phase 1 (Immediate - 1 hour)**: Disable malloc fallback (Option A)
|
|||
|
|
- Quick and safe fix
|
|||
|
|
- Prevents crashes immediately
|
|||
|
|
- Test 4T stability → should be 100%
|
|||
|
|
|
|||
|
|
**Phase 2 (Next - 2-4 hours)**: Fix SuperSlab starvation (Option B)
|
|||
|
|
- Implement dynamic SuperSlab scaling
|
|||
|
|
- Increase capacity for hot classes (1, 4, 6)
|
|||
|
|
- Remove Option A workaround
|
|||
|
|
|
|||
|
|
**Phase 3 (Optional)**: Add ownership tracking (Option C) for defense-in-depth
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Testing Requirements
|
|||
|
|
|
|||
|
|
### Test 1: Stability (CRITICAL)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Must achieve 100% success rate
|
|||
|
|
for i in {1..20}; do
|
|||
|
|
echo "Run $i:"
|
|||
|
|
env HAKMEM_TINY_USE_SUPERSLAB=1 HAKMEM_TINY_MEM_DIET=0 \
|
|||
|
|
./larson_hakmem 10 8 128 1024 1 12345 4 2>&1 | grep "Throughput"
|
|||
|
|
echo "Exit code: $?"
|
|||
|
|
done
|
|||
|
|
|
|||
|
|
# Expected: 20/20 success (100%)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Test 2: Performance (No regression)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Should maintain ~981K ops/s
|
|||
|
|
env HAKMEM_TINY_USE_SUPERSLAB=1 HAKMEM_TINY_MEM_DIET=0 \
|
|||
|
|
./larson_hakmem 10 8 128 1024 1 12345 4
|
|||
|
|
|
|||
|
|
# Expected: Throughput ≈ 981K ops/s (same as before)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Test 3: Regression Check
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Ensure low-contention still works
|
|||
|
|
./larson_hakmem 1 1 128 1024 1 12345 1 # 1T
|
|||
|
|
./larson_hakmem 2 8 128 1024 1 12345 2 # 2T
|
|||
|
|
./larson_hakmem 10 8 128 256 1 12345 4 # 4T low
|
|||
|
|
|
|||
|
|
# Expected: All complete successfully
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Success Criteria
|
|||
|
|
|
|||
|
|
✅ **4T high-contention stability: 100% (20/20 runs)**
|
|||
|
|
✅ **No performance regression** (≥950K ops/s)
|
|||
|
|
✅ **No crashes or OOM errors**
|
|||
|
|
✅ **1T/2T/4T low-contention still work**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Files to Review/Modify
|
|||
|
|
|
|||
|
|
**Likely files** (search for malloc fallback):
|
|||
|
|
1. `core/box/hak_alloc_api.inc.h` - Main allocation API
|
|||
|
|
2. `core/hakmem_tiny.c` - Tiny allocator implementation
|
|||
|
|
3. `core/tiny_alloc_fast.inc.h` - Fast path allocation
|
|||
|
|
4. `core/tiny_superslab_alloc.inc.h` - SuperSlab allocation
|
|||
|
|
5. `core/hakmem_tiny_refill_p0.inc.h` - Refill logic
|
|||
|
|
|
|||
|
|
**Search commands**:
|
|||
|
|
```bash
|
|||
|
|
# Find malloc fallback
|
|||
|
|
grep -rn "malloc(" core/ | grep -v "//.*malloc"
|
|||
|
|
|
|||
|
|
# Find OOM handling
|
|||
|
|
grep -rn "errno.*ENOMEM\|OOM\|returned NULL" core/
|
|||
|
|
|
|||
|
|
# Find SuperSlab allocation
|
|||
|
|
grep -rn "superslab_refill\|allocate.*superslab" core/
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Expected Deliverable
|
|||
|
|
|
|||
|
|
**Report file**: `/mnt/workdisk/public_share/hakmem/PHASE7_MIXED_ALLOCATION_FIX.md`
|
|||
|
|
|
|||
|
|
**Required sections**:
|
|||
|
|
1. **Approach chosen** (A, B, C, or combination)
|
|||
|
|
2. **Code changes** (diffs showing before/after)
|
|||
|
|
3. **Why it works** (explanation of fix)
|
|||
|
|
4. **Test results** (20/20 stability test)
|
|||
|
|
5. **Performance impact** (before/after comparison)
|
|||
|
|
6. **Production readiness** (YES/NO verdict)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Context Documents
|
|||
|
|
|
|||
|
|
- `PHASE7_4T_STABILITY_VERIFICATION.md` - Recent stability test (30% success)
|
|||
|
|
- `PHASE7_BUG3_FIX_REPORT.md` - Previous debugging attempts
|
|||
|
|
- `PHASE7_FINAL_BENCHMARK_RESULTS.md` - Overall Phase 7 results
|
|||
|
|
- `CLAUDE.md` - Project history and status
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Questions? Debug Hints
|
|||
|
|
|
|||
|
|
**Q: Where is the malloc fallback code?**
|
|||
|
|
A: Search for `malloc(` in `core/box/*.inc.h` and `core/hakmem_tiny*.c`
|
|||
|
|
|
|||
|
|
**Q: How do I test just the fix without full rebuild?**
|
|||
|
|
A: `make clean && make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1 larson_hakmem`
|
|||
|
|
|
|||
|
|
**Q: What if Option A causes application crashes?**
|
|||
|
|
A: That's expected if the app doesn't handle malloc failures. Move to Option B.
|
|||
|
|
|
|||
|
|
**Q: How do I know if SuperSlab OOM is fixed?**
|
|||
|
|
A: No more `[DEBUG] superslab_refill returned NULL (OOM)` messages in output
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**Good luck! Let's achieve 100% stability! 🚀**
|