Files
hakmem/docs/status/TASK_FOR_OTHER_AI.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

393 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Task for Other AI: Fix 4T High-Contention Crash (Mixed Allocation Bug)
**Date**: 2025-11-08
**Priority**: CRITICAL
**Status**: BLOCKING production deployment
---
## Executive Summary
**Problem**: 4T high-contention crash with **70% failure rate** (6/20 success)
**Root Cause Identified**: Mixed HAKMEM/libc allocations causing `free(): invalid pointer`
**Your Mission**: Fix the mixed allocation bug to achieve **100% stability**
---
## Background
### Current Status
Phase 7 optimization achieved **excellent performance**:
- Single-threaded: **91.3% of System malloc** (target was 40-55%) ✅
- Multi-threaded low-contention: **100% stable**
- **BUT**: 4T high-contention: **70% crash rate**
### What Works
```bash
# ✅ Works perfectly (100% stable)
./larson_hakmem 1 1 128 1024 1 12345 1 # 1T: 2.74M ops/s
./larson_hakmem 2 8 128 1024 1 12345 2 # 2T: 4.91M ops/s
./larson_hakmem 10 8 128 256 1 12345 4 # 4T low: 251K ops/s
# ❌ Crashes 70% of the time
./larson_hakmem 10 8 128 1024 1 12345 4 # 4T high: 981K ops/s (when it works)
```
### What Breaks
**Crash pattern**:
```
free(): invalid pointer
[DEBUG] superslab_refill returned NULL (OOM) detail:
class=4 prev_ss=(nil) active=0 bitmap=0x00000000
prev_meta=(nil) used=0 cap=0 slab_idx=0
reused_freelist=0 free_idx=-2 errno=12
```
**Sequence of events**:
1. Thread exhausts SuperSlab for class 6 (or 1, 4)
2. `superslab_refill()` fails with OOM (errno=12, ENOMEM)
3. Code falls back to `malloc()` (libc malloc)
4. Now we have **mixed allocations**: some from HAKMEM, some from libc
5. `free()` receives a libc-allocated pointer
6. HAKMEM's free path tries to handle it → **CRASH**
---
## Root Cause Analysis (from Task Agent)
### The Mixed Allocation Problem
**File**: `core/box/hak_alloc_api.inc.h` or similar allocation paths
**Current behavior**:
```c
// Pseudo-code of current allocation path
void* hak_alloc(size_t size) {
// Try HAKMEM allocation
void* ptr = hak_tiny_alloc(size);
if (ptr) return ptr;
// HAKMEM failed (OOM) → fallback to libc malloc
return malloc(size); // ← PROBLEM: Now we have mixed allocations!
}
void hak_free(void* ptr) {
// Try to free as HAKMEM allocation
if (looks_like_hakmem(ptr)) {
hakmem_free(ptr); // ← PROBLEM: What if it's actually from malloc()?
} else {
free(ptr); // ← PROBLEM: What if we guessed wrong?
}
}
```
**Why this crashes**:
- HAKMEM can't distinguish between HAKMEM-allocated and malloc-allocated pointers
- Header-based detection is unreliable (malloc memory might look like HAKMEM headers)
- Cross-allocation free causes corruption/crashes
### Why SuperSlab OOM Happens
**High-contention scenario**:
- 4 threads × 1024 chunks each = 4096 concurrent allocations
- All threads allocate 128B blocks (class 4 or 6)
- SuperSlab runs out of slabs for that class
- No dynamic scaling → OOM
**Evidence**: `bitmap=0x00000000` means all 32 slabs exhausted
---
## Your Mission: 3 Potential Fixes (Choose Best Approach)
### Option A: Disable malloc Fallback (Recommended - Safest)
**Idea**: Make allocation failures explicit instead of silently falling back
**Implementation**:
**File**: Find the allocation path that does malloc fallback (likely `core/box/hak_alloc_api.inc.h` or `core/hakmem_tiny.c`)
**Change**:
```c
// Before (BROKEN):
void* hak_alloc(size_t size) {
void* ptr = hak_tiny_alloc(size);
if (ptr) return ptr;
// Fallback to malloc (causes mixed allocations)
return malloc(size); // ❌ BAD
}
// After (SAFE):
void* hak_alloc(size_t size) {
void* ptr = hak_tiny_alloc(size);
if (!ptr) {
// OOM: Log and fail explicitly
fprintf(stderr, "[HAKMEM] OOM for size=%zu, returning NULL\n", size);
errno = ENOMEM;
return NULL; // ✅ Explicit failure
}
return ptr;
}
```
**Pros**:
- Simple and safe
- No mixed allocations
- Caller can handle OOM explicitly
**Cons**:
- Applications must handle NULL returns
- Might break code that assumes malloc never fails
**Testing**:
```bash
# Should complete without crashes OR fail cleanly with OOM message
./larson_hakmem 10 8 128 1024 1 12345 4
```
---
### Option B: Fix SuperSlab Starvation (Recommended - Best Long-term)
**Idea**: Prevent OOM by dynamically scaling SuperSlab capacity
**Implementation**:
**File**: `core/tiny_superslab_alloc.inc.h` or SuperSlab management code
**Change 1: Detect starvation**:
```c
// In superslab_refill()
if (bitmap == 0x00000000) {
// All slabs exhausted → try to allocate more
fprintf(stderr, "[HAKMEM] SuperSlab class %d exhausted, allocating more...\n", class_idx);
// Allocate a new SuperSlab
SuperSlab* new_ss = allocate_superslab(class_idx);
if (new_ss) {
register_superslab(new_ss);
// Retry refill from new SuperSlab
return refill_from_superslab(new_ss, class_idx, count);
}
}
```
**Change 2: Increase initial capacity for hot classes**:
```c
// In SuperSlab initialization
// Classes 1, 4, 6 are hot in multi-threaded workloads
if (class_idx == 1 || class_idx == 4 || class_idx == 6) {
initial_slabs = 64; // Double capacity for hot classes
} else {
initial_slabs = 32; // Default
}
```
**Pros**:
- Fixes root cause (OOM)
- No mixed allocations needed
- Scales naturally with workload
**Cons**:
- More complex
- Memory overhead for extra SuperSlabs
**Testing**:
```bash
# Should complete 100% of the time without OOM
for i in {1..20}; do ./larson_hakmem 10 8 128 1024 1 12345 4; done
```
---
### Option C: Add Allocation Ownership Tracking (Comprehensive)
**Idea**: Track which allocator owns each pointer
**Implementation**:
**File**: `core/box/hak_free_api.inc.h` or free path
**Change 1: Add ownership bitmap**:
```c
// Global bitmap to track HAKMEM allocations
// Each bit represents a 64KB region
#define OWNERSHIP_BITMAP_SIZE (1ULL << 20) // 1M bits = 64GB coverage
static uint64_t g_hakmem_ownership_bitmap[OWNERSHIP_BITMAP_SIZE / 64];
// Mark allocation as HAKMEM-owned
static inline void mark_hakmem_allocation(void* ptr, size_t size) {
uintptr_t addr = (uintptr_t)ptr;
size_t region = addr / (64 * 1024); // 64KB regions
size_t word = region / 64;
size_t bit = region % 64;
atomic_fetch_or(&g_hakmem_ownership_bitmap[word], 1ULL << bit);
}
// Check if allocation is HAKMEM-owned
static inline int is_hakmem_allocation(void* ptr) {
uintptr_t addr = (uintptr_t)ptr;
size_t region = addr / (64 * 1024);
size_t word = region / 64;
size_t bit = region % 64;
return (g_hakmem_ownership_bitmap[word] & (1ULL << bit)) != 0;
}
```
**Change 2: Use ownership in free path**:
```c
void hak_free(void* ptr) {
if (is_hakmem_allocation(ptr)) {
hakmem_free(ptr); // ✅ Confirmed HAKMEM
} else {
free(ptr); // ✅ Confirmed libc malloc
}
}
```
**Pros**:
- Allows mixed allocations safely
- Works with existing malloc fallback
**Cons**:
- Complex to implement correctly
- Memory overhead for bitmap
- Atomic operations on free path
---
## Recommendation: **Combine Option A + Option B**
**Phase 1 (Immediate - 1 hour)**: Disable malloc fallback (Option A)
- Quick and safe fix
- Prevents crashes immediately
- Test 4T stability → should be 100%
**Phase 2 (Next - 2-4 hours)**: Fix SuperSlab starvation (Option B)
- Implement dynamic SuperSlab scaling
- Increase capacity for hot classes (1, 4, 6)
- Remove Option A workaround
**Phase 3 (Optional)**: Add ownership tracking (Option C) for defense-in-depth
---
## Testing Requirements
### Test 1: Stability (CRITICAL)
```bash
# Must achieve 100% success rate
for i in {1..20}; do
echo "Run $i:"
env HAKMEM_TINY_USE_SUPERSLAB=1 HAKMEM_TINY_MEM_DIET=0 \
./larson_hakmem 10 8 128 1024 1 12345 4 2>&1 | grep "Throughput"
echo "Exit code: $?"
done
# Expected: 20/20 success (100%)
```
### Test 2: Performance (No regression)
```bash
# Should maintain ~981K ops/s
env HAKMEM_TINY_USE_SUPERSLAB=1 HAKMEM_TINY_MEM_DIET=0 \
./larson_hakmem 10 8 128 1024 1 12345 4
# Expected: Throughput ≈ 981K ops/s (same as before)
```
### Test 3: Regression Check
```bash
# Ensure low-contention still works
./larson_hakmem 1 1 128 1024 1 12345 1 # 1T
./larson_hakmem 2 8 128 1024 1 12345 2 # 2T
./larson_hakmem 10 8 128 256 1 12345 4 # 4T low
# Expected: All complete successfully
```
---
## Success Criteria
**4T high-contention stability: 100% (20/20 runs)**
**No performance regression** (≥950K ops/s)
**No crashes or OOM errors**
**1T/2T/4T low-contention still work**
---
## Files to Review/Modify
**Likely files** (search for malloc fallback):
1. `core/box/hak_alloc_api.inc.h` - Main allocation API
2. `core/hakmem_tiny.c` - Tiny allocator implementation
3. `core/tiny_alloc_fast.inc.h` - Fast path allocation
4. `core/tiny_superslab_alloc.inc.h` - SuperSlab allocation
5. `core/hakmem_tiny_refill_p0.inc.h` - Refill logic
**Search commands**:
```bash
# Find malloc fallback
grep -rn "malloc(" core/ | grep -v "//.*malloc"
# Find OOM handling
grep -rn "errno.*ENOMEM\|OOM\|returned NULL" core/
# Find SuperSlab allocation
grep -rn "superslab_refill\|allocate.*superslab" core/
```
---
## Expected Deliverable
**Report file**: `/mnt/workdisk/public_share/hakmem/PHASE7_MIXED_ALLOCATION_FIX.md`
**Required sections**:
1. **Approach chosen** (A, B, C, or combination)
2. **Code changes** (diffs showing before/after)
3. **Why it works** (explanation of fix)
4. **Test results** (20/20 stability test)
5. **Performance impact** (before/after comparison)
6. **Production readiness** (YES/NO verdict)
---
## Context Documents
- `PHASE7_4T_STABILITY_VERIFICATION.md` - Recent stability test (30% success)
- `PHASE7_BUG3_FIX_REPORT.md` - Previous debugging attempts
- `PHASE7_FINAL_BENCHMARK_RESULTS.md` - Overall Phase 7 results
- `CLAUDE.md` - Project history and status
---
## Questions? Debug Hints
**Q: Where is the malloc fallback code?**
A: Search for `malloc(` in `core/box/*.inc.h` and `core/hakmem_tiny*.c`
**Q: How do I test just the fix without full rebuild?**
A: `make clean && make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1 larson_hakmem`
**Q: What if Option A causes application crashes?**
A: That's expected if the app doesn't handle malloc failures. Move to Option B.
**Q: How do I know if SuperSlab OOM is fixed?**
A: No more `[DEBUG] superslab_refill returned NULL (OOM)` messages in output
---
**Good luck! Let's achieve 100% stability! 🚀**