Files
hakmem/docs/status/TASK_FOR_OTHER_AI.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

10 KiB
Raw Blame History

Task for Other AI: Fix 4T High-Contention Crash (Mixed Allocation Bug)

Date: 2025-11-08 Priority: CRITICAL Status: BLOCKING production deployment


Executive Summary

Problem: 4T high-contention crash with 70% failure rate (6/20 success)

Root Cause Identified: Mixed HAKMEM/libc allocations causing free(): invalid pointer

Your Mission: Fix the mixed allocation bug to achieve 100% stability


Background

Current Status

Phase 7 optimization achieved excellent performance:

  • Single-threaded: 91.3% of System malloc (target was 40-55%)
  • Multi-threaded low-contention: 100% stable
  • BUT: 4T high-contention: 70% crash rate

What Works

# ✅ Works perfectly (100% stable)
./larson_hakmem 1 1 128 1024 1 12345 1   # 1T: 2.74M ops/s
./larson_hakmem 2 8 128 1024 1 12345 2   # 2T: 4.91M ops/s
./larson_hakmem 10 8 128 256 1 12345 4   # 4T low: 251K ops/s

# ❌ Crashes 70% of the time
./larson_hakmem 10 8 128 1024 1 12345 4  # 4T high: 981K ops/s (when it works)

What Breaks

Crash pattern:

free(): invalid pointer
[DEBUG] superslab_refill returned NULL (OOM) detail:
  class=4 prev_ss=(nil) active=0 bitmap=0x00000000
  prev_meta=(nil) used=0 cap=0 slab_idx=0
  reused_freelist=0 free_idx=-2 errno=12

Sequence of events:

  1. Thread exhausts SuperSlab for class 6 (or 1, 4)
  2. superslab_refill() fails with OOM (errno=12, ENOMEM)
  3. Code falls back to malloc() (libc malloc)
  4. Now we have mixed allocations: some from HAKMEM, some from libc
  5. free() receives a libc-allocated pointer
  6. HAKMEM's free path tries to handle it → CRASH

Root Cause Analysis (from Task Agent)

The Mixed Allocation Problem

File: core/box/hak_alloc_api.inc.h or similar allocation paths

Current behavior:

// Pseudo-code of current allocation path
void* hak_alloc(size_t size) {
    // Try HAKMEM allocation
    void* ptr = hak_tiny_alloc(size);
    if (ptr) return ptr;

    // HAKMEM failed (OOM) → fallback to libc malloc
    return malloc(size);  // ← PROBLEM: Now we have mixed allocations!
}

void hak_free(void* ptr) {
    // Try to free as HAKMEM allocation
    if (looks_like_hakmem(ptr)) {
        hakmem_free(ptr);  // ← PROBLEM: What if it's actually from malloc()?
    } else {
        free(ptr);  // ← PROBLEM: What if we guessed wrong?
    }
}

Why this crashes:

  • HAKMEM can't distinguish between HAKMEM-allocated and malloc-allocated pointers
  • Header-based detection is unreliable (malloc memory might look like HAKMEM headers)
  • Cross-allocation free causes corruption/crashes

Why SuperSlab OOM Happens

High-contention scenario:

  • 4 threads × 1024 chunks each = 4096 concurrent allocations
  • All threads allocate 128B blocks (class 4 or 6)
  • SuperSlab runs out of slabs for that class
  • No dynamic scaling → OOM

Evidence: bitmap=0x00000000 means all 32 slabs exhausted


Your Mission: 3 Potential Fixes (Choose Best Approach)

Idea: Make allocation failures explicit instead of silently falling back

Implementation:

File: Find the allocation path that does malloc fallback (likely core/box/hak_alloc_api.inc.h or core/hakmem_tiny.c)

Change:

// Before (BROKEN):
void* hak_alloc(size_t size) {
    void* ptr = hak_tiny_alloc(size);
    if (ptr) return ptr;

    // Fallback to malloc (causes mixed allocations)
    return malloc(size);  // ❌ BAD
}

// After (SAFE):
void* hak_alloc(size_t size) {
    void* ptr = hak_tiny_alloc(size);
    if (!ptr) {
        // OOM: Log and fail explicitly
        fprintf(stderr, "[HAKMEM] OOM for size=%zu, returning NULL\n", size);
        errno = ENOMEM;
        return NULL;  // ✅ Explicit failure
    }
    return ptr;
}

Pros:

  • Simple and safe
  • No mixed allocations
  • Caller can handle OOM explicitly

Cons:

  • Applications must handle NULL returns
  • Might break code that assumes malloc never fails

Testing:

# Should complete without crashes OR fail cleanly with OOM message
./larson_hakmem 10 8 128 1024 1 12345 4

Idea: Prevent OOM by dynamically scaling SuperSlab capacity

Implementation:

File: core/tiny_superslab_alloc.inc.h or SuperSlab management code

Change 1: Detect starvation:

// In superslab_refill()
if (bitmap == 0x00000000) {
    // All slabs exhausted → try to allocate more
    fprintf(stderr, "[HAKMEM] SuperSlab class %d exhausted, allocating more...\n", class_idx);

    // Allocate a new SuperSlab
    SuperSlab* new_ss = allocate_superslab(class_idx);
    if (new_ss) {
        register_superslab(new_ss);
        // Retry refill from new SuperSlab
        return refill_from_superslab(new_ss, class_idx, count);
    }
}

Change 2: Increase initial capacity for hot classes:

// In SuperSlab initialization
// Classes 1, 4, 6 are hot in multi-threaded workloads
if (class_idx == 1 || class_idx == 4 || class_idx == 6) {
    initial_slabs = 64;  // Double capacity for hot classes
} else {
    initial_slabs = 32;  // Default
}

Pros:

  • Fixes root cause (OOM)
  • No mixed allocations needed
  • Scales naturally with workload

Cons:

  • More complex
  • Memory overhead for extra SuperSlabs

Testing:

# Should complete 100% of the time without OOM
for i in {1..20}; do ./larson_hakmem 10 8 128 1024 1 12345 4; done

Option C: Add Allocation Ownership Tracking (Comprehensive)

Idea: Track which allocator owns each pointer

Implementation:

File: core/box/hak_free_api.inc.h or free path

Change 1: Add ownership bitmap:

// Global bitmap to track HAKMEM allocations
// Each bit represents a 64KB region
#define OWNERSHIP_BITMAP_SIZE (1ULL << 20)  // 1M bits = 64GB coverage
static uint64_t g_hakmem_ownership_bitmap[OWNERSHIP_BITMAP_SIZE / 64];

// Mark allocation as HAKMEM-owned
static inline void mark_hakmem_allocation(void* ptr, size_t size) {
    uintptr_t addr = (uintptr_t)ptr;
    size_t region = addr / (64 * 1024);  // 64KB regions
    size_t word = region / 64;
    size_t bit = region % 64;
    atomic_fetch_or(&g_hakmem_ownership_bitmap[word], 1ULL << bit);
}

// Check if allocation is HAKMEM-owned
static inline int is_hakmem_allocation(void* ptr) {
    uintptr_t addr = (uintptr_t)ptr;
    size_t region = addr / (64 * 1024);
    size_t word = region / 64;
    size_t bit = region % 64;
    return (g_hakmem_ownership_bitmap[word] & (1ULL << bit)) != 0;
}

Change 2: Use ownership in free path:

void hak_free(void* ptr) {
    if (is_hakmem_allocation(ptr)) {
        hakmem_free(ptr);  // ✅ Confirmed HAKMEM
    } else {
        free(ptr);  // ✅ Confirmed libc malloc
    }
}

Pros:

  • Allows mixed allocations safely
  • Works with existing malloc fallback

Cons:

  • Complex to implement correctly
  • Memory overhead for bitmap
  • Atomic operations on free path

Recommendation: Combine Option A + Option B

Phase 1 (Immediate - 1 hour): Disable malloc fallback (Option A)

  • Quick and safe fix
  • Prevents crashes immediately
  • Test 4T stability → should be 100%

Phase 2 (Next - 2-4 hours): Fix SuperSlab starvation (Option B)

  • Implement dynamic SuperSlab scaling
  • Increase capacity for hot classes (1, 4, 6)
  • Remove Option A workaround

Phase 3 (Optional): Add ownership tracking (Option C) for defense-in-depth


Testing Requirements

Test 1: Stability (CRITICAL)

# Must achieve 100% success rate
for i in {1..20}; do
  echo "Run $i:"
  env HAKMEM_TINY_USE_SUPERSLAB=1 HAKMEM_TINY_MEM_DIET=0 \
    ./larson_hakmem 10 8 128 1024 1 12345 4 2>&1 | grep "Throughput"
  echo "Exit code: $?"
done

# Expected: 20/20 success (100%)

Test 2: Performance (No regression)

# Should maintain ~981K ops/s
env HAKMEM_TINY_USE_SUPERSLAB=1 HAKMEM_TINY_MEM_DIET=0 \
  ./larson_hakmem 10 8 128 1024 1 12345 4

# Expected: Throughput ≈ 981K ops/s (same as before)

Test 3: Regression Check

# Ensure low-contention still works
./larson_hakmem 1 1 128 1024 1 12345 1  # 1T
./larson_hakmem 2 8 128 1024 1 12345 2  # 2T
./larson_hakmem 10 8 128 256 1 12345 4  # 4T low

# Expected: All complete successfully

Success Criteria

4T high-contention stability: 100% (20/20 runs) No performance regression (≥950K ops/s) No crashes or OOM errors 1T/2T/4T low-contention still work


Files to Review/Modify

Likely files (search for malloc fallback):

  1. core/box/hak_alloc_api.inc.h - Main allocation API
  2. core/hakmem_tiny.c - Tiny allocator implementation
  3. core/tiny_alloc_fast.inc.h - Fast path allocation
  4. core/tiny_superslab_alloc.inc.h - SuperSlab allocation
  5. core/hakmem_tiny_refill_p0.inc.h - Refill logic

Search commands:

# Find malloc fallback
grep -rn "malloc(" core/ | grep -v "//.*malloc"

# Find OOM handling
grep -rn "errno.*ENOMEM\|OOM\|returned NULL" core/

# Find SuperSlab allocation
grep -rn "superslab_refill\|allocate.*superslab" core/

Expected Deliverable

Report file: /mnt/workdisk/public_share/hakmem/PHASE7_MIXED_ALLOCATION_FIX.md

Required sections:

  1. Approach chosen (A, B, C, or combination)
  2. Code changes (diffs showing before/after)
  3. Why it works (explanation of fix)
  4. Test results (20/20 stability test)
  5. Performance impact (before/after comparison)
  6. Production readiness (YES/NO verdict)

Context Documents

  • PHASE7_4T_STABILITY_VERIFICATION.md - Recent stability test (30% success)
  • PHASE7_BUG3_FIX_REPORT.md - Previous debugging attempts
  • PHASE7_FINAL_BENCHMARK_RESULTS.md - Overall Phase 7 results
  • CLAUDE.md - Project history and status

Questions? Debug Hints

Q: Where is the malloc fallback code? A: Search for malloc( in core/box/*.inc.h and core/hakmem_tiny*.c

Q: How do I test just the fix without full rebuild? A: make clean && make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1 larson_hakmem

Q: What if Option A causes application crashes? A: That's expected if the app doesn't handle malloc failures. Move to Option B.

Q: How do I know if SuperSlab OOM is fixed? A: No more [DEBUG] superslab_refill returned NULL (OOM) messages in output


Good luck! Let's achieve 100% stability! 🚀