## Changes ### 1. core/page_arena.c - Removed init failure message (lines 25-27) - error is handled by returning early - All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks ### 2. core/hakmem.c - Wrapped SIGSEGV handler init message (line 72) - CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs ### 3. core/hakmem_shared_pool.c - Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE: - Node pool exhaustion warning (line 252) - SP_META_CAPACITY_ERROR warning (line 421) - SP_FIX_GEOMETRY debug logging (line 745) - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865) - SP_ACQUIRE_STAGE0_L0 debug logging (line 803) - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922) - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996) - SP_ACQUIRE_STAGE3 debug logging (line 1116) - SP_SLOT_RELEASE debug logging (line 1245) - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305) - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316) - Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized ## Performance Validation Before: 51M ops/s (with debug fprintf overhead) After: 49.1M ops/s (consistent performance, fprintf removed from hot paths) ## Build & Test ```bash ./build.sh larson_hakmem ./out/release/larson_hakmem 1 5 1 1000 100 10000 42 # Result: 49.1M ops/s ``` Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
10 KiB
Task for Other AI: Fix 4T High-Contention Crash (Mixed Allocation Bug)
Date: 2025-11-08 Priority: CRITICAL Status: BLOCKING production deployment
Executive Summary
Problem: 4T high-contention crash with 70% failure rate (6/20 success)
Root Cause Identified: Mixed HAKMEM/libc allocations causing free(): invalid pointer
Your Mission: Fix the mixed allocation bug to achieve 100% stability
Background
Current Status
Phase 7 optimization achieved excellent performance:
- Single-threaded: 91.3% of System malloc (target was 40-55%) ✅
- Multi-threaded low-contention: 100% stable ✅
- BUT: 4T high-contention: 70% crash rate ❌
What Works
# ✅ Works perfectly (100% stable)
./larson_hakmem 1 1 128 1024 1 12345 1 # 1T: 2.74M ops/s
./larson_hakmem 2 8 128 1024 1 12345 2 # 2T: 4.91M ops/s
./larson_hakmem 10 8 128 256 1 12345 4 # 4T low: 251K ops/s
# ❌ Crashes 70% of the time
./larson_hakmem 10 8 128 1024 1 12345 4 # 4T high: 981K ops/s (when it works)
What Breaks
Crash pattern:
free(): invalid pointer
[DEBUG] superslab_refill returned NULL (OOM) detail:
class=4 prev_ss=(nil) active=0 bitmap=0x00000000
prev_meta=(nil) used=0 cap=0 slab_idx=0
reused_freelist=0 free_idx=-2 errno=12
Sequence of events:
- Thread exhausts SuperSlab for class 6 (or 1, 4)
superslab_refill()fails with OOM (errno=12, ENOMEM)- Code falls back to
malloc()(libc malloc) - Now we have mixed allocations: some from HAKMEM, some from libc
free()receives a libc-allocated pointer- HAKMEM's free path tries to handle it → CRASH
Root Cause Analysis (from Task Agent)
The Mixed Allocation Problem
File: core/box/hak_alloc_api.inc.h or similar allocation paths
Current behavior:
// Pseudo-code of current allocation path
void* hak_alloc(size_t size) {
// Try HAKMEM allocation
void* ptr = hak_tiny_alloc(size);
if (ptr) return ptr;
// HAKMEM failed (OOM) → fallback to libc malloc
return malloc(size); // ← PROBLEM: Now we have mixed allocations!
}
void hak_free(void* ptr) {
// Try to free as HAKMEM allocation
if (looks_like_hakmem(ptr)) {
hakmem_free(ptr); // ← PROBLEM: What if it's actually from malloc()?
} else {
free(ptr); // ← PROBLEM: What if we guessed wrong?
}
}
Why this crashes:
- HAKMEM can't distinguish between HAKMEM-allocated and malloc-allocated pointers
- Header-based detection is unreliable (malloc memory might look like HAKMEM headers)
- Cross-allocation free causes corruption/crashes
Why SuperSlab OOM Happens
High-contention scenario:
- 4 threads × 1024 chunks each = 4096 concurrent allocations
- All threads allocate 128B blocks (class 4 or 6)
- SuperSlab runs out of slabs for that class
- No dynamic scaling → OOM
Evidence: bitmap=0x00000000 means all 32 slabs exhausted
Your Mission: 3 Potential Fixes (Choose Best Approach)
Option A: Disable malloc Fallback (Recommended - Safest)
Idea: Make allocation failures explicit instead of silently falling back
Implementation:
File: Find the allocation path that does malloc fallback (likely core/box/hak_alloc_api.inc.h or core/hakmem_tiny.c)
Change:
// Before (BROKEN):
void* hak_alloc(size_t size) {
void* ptr = hak_tiny_alloc(size);
if (ptr) return ptr;
// Fallback to malloc (causes mixed allocations)
return malloc(size); // ❌ BAD
}
// After (SAFE):
void* hak_alloc(size_t size) {
void* ptr = hak_tiny_alloc(size);
if (!ptr) {
// OOM: Log and fail explicitly
fprintf(stderr, "[HAKMEM] OOM for size=%zu, returning NULL\n", size);
errno = ENOMEM;
return NULL; // ✅ Explicit failure
}
return ptr;
}
Pros:
- Simple and safe
- No mixed allocations
- Caller can handle OOM explicitly
Cons:
- Applications must handle NULL returns
- Might break code that assumes malloc never fails
Testing:
# Should complete without crashes OR fail cleanly with OOM message
./larson_hakmem 10 8 128 1024 1 12345 4
Option B: Fix SuperSlab Starvation (Recommended - Best Long-term)
Idea: Prevent OOM by dynamically scaling SuperSlab capacity
Implementation:
File: core/tiny_superslab_alloc.inc.h or SuperSlab management code
Change 1: Detect starvation:
// In superslab_refill()
if (bitmap == 0x00000000) {
// All slabs exhausted → try to allocate more
fprintf(stderr, "[HAKMEM] SuperSlab class %d exhausted, allocating more...\n", class_idx);
// Allocate a new SuperSlab
SuperSlab* new_ss = allocate_superslab(class_idx);
if (new_ss) {
register_superslab(new_ss);
// Retry refill from new SuperSlab
return refill_from_superslab(new_ss, class_idx, count);
}
}
Change 2: Increase initial capacity for hot classes:
// In SuperSlab initialization
// Classes 1, 4, 6 are hot in multi-threaded workloads
if (class_idx == 1 || class_idx == 4 || class_idx == 6) {
initial_slabs = 64; // Double capacity for hot classes
} else {
initial_slabs = 32; // Default
}
Pros:
- Fixes root cause (OOM)
- No mixed allocations needed
- Scales naturally with workload
Cons:
- More complex
- Memory overhead for extra SuperSlabs
Testing:
# Should complete 100% of the time without OOM
for i in {1..20}; do ./larson_hakmem 10 8 128 1024 1 12345 4; done
Option C: Add Allocation Ownership Tracking (Comprehensive)
Idea: Track which allocator owns each pointer
Implementation:
File: core/box/hak_free_api.inc.h or free path
Change 1: Add ownership bitmap:
// Global bitmap to track HAKMEM allocations
// Each bit represents a 64KB region
#define OWNERSHIP_BITMAP_SIZE (1ULL << 20) // 1M bits = 64GB coverage
static uint64_t g_hakmem_ownership_bitmap[OWNERSHIP_BITMAP_SIZE / 64];
// Mark allocation as HAKMEM-owned
static inline void mark_hakmem_allocation(void* ptr, size_t size) {
uintptr_t addr = (uintptr_t)ptr;
size_t region = addr / (64 * 1024); // 64KB regions
size_t word = region / 64;
size_t bit = region % 64;
atomic_fetch_or(&g_hakmem_ownership_bitmap[word], 1ULL << bit);
}
// Check if allocation is HAKMEM-owned
static inline int is_hakmem_allocation(void* ptr) {
uintptr_t addr = (uintptr_t)ptr;
size_t region = addr / (64 * 1024);
size_t word = region / 64;
size_t bit = region % 64;
return (g_hakmem_ownership_bitmap[word] & (1ULL << bit)) != 0;
}
Change 2: Use ownership in free path:
void hak_free(void* ptr) {
if (is_hakmem_allocation(ptr)) {
hakmem_free(ptr); // ✅ Confirmed HAKMEM
} else {
free(ptr); // ✅ Confirmed libc malloc
}
}
Pros:
- Allows mixed allocations safely
- Works with existing malloc fallback
Cons:
- Complex to implement correctly
- Memory overhead for bitmap
- Atomic operations on free path
Recommendation: Combine Option A + Option B
Phase 1 (Immediate - 1 hour): Disable malloc fallback (Option A)
- Quick and safe fix
- Prevents crashes immediately
- Test 4T stability → should be 100%
Phase 2 (Next - 2-4 hours): Fix SuperSlab starvation (Option B)
- Implement dynamic SuperSlab scaling
- Increase capacity for hot classes (1, 4, 6)
- Remove Option A workaround
Phase 3 (Optional): Add ownership tracking (Option C) for defense-in-depth
Testing Requirements
Test 1: Stability (CRITICAL)
# Must achieve 100% success rate
for i in {1..20}; do
echo "Run $i:"
env HAKMEM_TINY_USE_SUPERSLAB=1 HAKMEM_TINY_MEM_DIET=0 \
./larson_hakmem 10 8 128 1024 1 12345 4 2>&1 | grep "Throughput"
echo "Exit code: $?"
done
# Expected: 20/20 success (100%)
Test 2: Performance (No regression)
# Should maintain ~981K ops/s
env HAKMEM_TINY_USE_SUPERSLAB=1 HAKMEM_TINY_MEM_DIET=0 \
./larson_hakmem 10 8 128 1024 1 12345 4
# Expected: Throughput ≈ 981K ops/s (same as before)
Test 3: Regression Check
# Ensure low-contention still works
./larson_hakmem 1 1 128 1024 1 12345 1 # 1T
./larson_hakmem 2 8 128 1024 1 12345 2 # 2T
./larson_hakmem 10 8 128 256 1 12345 4 # 4T low
# Expected: All complete successfully
Success Criteria
✅ 4T high-contention stability: 100% (20/20 runs) ✅ No performance regression (≥950K ops/s) ✅ No crashes or OOM errors ✅ 1T/2T/4T low-contention still work
Files to Review/Modify
Likely files (search for malloc fallback):
core/box/hak_alloc_api.inc.h- Main allocation APIcore/hakmem_tiny.c- Tiny allocator implementationcore/tiny_alloc_fast.inc.h- Fast path allocationcore/tiny_superslab_alloc.inc.h- SuperSlab allocationcore/hakmem_tiny_refill_p0.inc.h- Refill logic
Search commands:
# Find malloc fallback
grep -rn "malloc(" core/ | grep -v "//.*malloc"
# Find OOM handling
grep -rn "errno.*ENOMEM\|OOM\|returned NULL" core/
# Find SuperSlab allocation
grep -rn "superslab_refill\|allocate.*superslab" core/
Expected Deliverable
Report file: /mnt/workdisk/public_share/hakmem/PHASE7_MIXED_ALLOCATION_FIX.md
Required sections:
- Approach chosen (A, B, C, or combination)
- Code changes (diffs showing before/after)
- Why it works (explanation of fix)
- Test results (20/20 stability test)
- Performance impact (before/after comparison)
- Production readiness (YES/NO verdict)
Context Documents
PHASE7_4T_STABILITY_VERIFICATION.md- Recent stability test (30% success)PHASE7_BUG3_FIX_REPORT.md- Previous debugging attemptsPHASE7_FINAL_BENCHMARK_RESULTS.md- Overall Phase 7 resultsCLAUDE.md- Project history and status
Questions? Debug Hints
Q: Where is the malloc fallback code?
A: Search for malloc( in core/box/*.inc.h and core/hakmem_tiny*.c
Q: How do I test just the fix without full rebuild?
A: make clean && make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1 larson_hakmem
Q: What if Option A causes application crashes? A: That's expected if the app doesn't handle malloc failures. Move to Option B.
Q: How do I know if SuperSlab OOM is fixed?
A: No more [DEBUG] superslab_refill returned NULL (OOM) messages in output
Good luck! Let's achieve 100% stability! 🚀