Files

Moe Charm (CI) 707056b765 feat: Phase 7 + Phase 2 - Massive performance & stability improvements

Performance Achievements:
- Tiny allocations: +180-280% (21M → 59-70M ops/s random mixed)
- Single-thread: +24% (2.71M → 3.36M ops/s Larson)
- 4T stability: 0% → 95% (19/20 success rate)
- Overall: 91.3% of System malloc average (target was 40-55%) ✓

Phase 7 (Tasks 1-3): Core Optimizations
- Task 1: Header validation removal (Region-ID direct lookup)
- Task 2: Aggressive inline (TLS cache access optimization)
- Task 3: Pre-warm TLS cache (eliminate cold-start penalty)
  Result: +180-280% improvement, 85-146% of System malloc

Critical Bug Fixes:
- Fix 64B allocation crash (size-to-class +1 for header)
- Fix 4T wrapper recursion bugs (BUG #7, #8, #10, #11)
- Remove malloc fallback (30% → 50% stability)

Phase 2a: SuperSlab Dynamic Expansion (CRITICAL)
- Implement mimalloc-style chunk linking
- Unlimited slab expansion (no more OOM at 32 slabs)
- Fix chunk initialization bug (bitmap=0x00000001 after expansion)
  Files: core/hakmem_tiny_superslab.c/h, core/superslab/superslab_types.h
  Result: 50% → 95% stability (19/20 4T success)

Phase 2b: TLS Cache Adaptive Sizing
- Dynamic capacity: 16-2048 slots based on usage
- High-water mark tracking + exponential growth/shrink
- Expected: +3-10% performance, -30-50% memory
  Files: core/tiny_adaptive_sizing.c/h (new)

Phase 2c: BigCache Dynamic Hash Table
- Migrate from fixed 256×8 array to dynamic hash table
- Auto-resize: 256 → 512 → 1024 → 65,536 buckets
- Improved hash function (FNV-1a) + collision chaining
  Files: core/hakmem_bigcache.c/h
  Expected: +10-20% cache hit rate

Design Flaws Analysis:
- Identified 6 components with fixed-capacity bottlenecks
- SuperSlab (CRITICAL), TLS Cache (HIGH), BigCache/L2.5 (MEDIUM)
- Report: DESIGN_FLAWS_ANALYSIS.md (11 chapters)

Documentation:
- 13 comprehensive reports (PHASE*.md, DESIGN_FLAWS*.md)
- Implementation guides, test results, production readiness
- Bug fix reports, root cause analysis

Build System:
- Makefile: phase7 targets, PREWARM_TLS flag
- Auto dependency generation (-MMD -MP) for .inc files

Known Issues:
- 4T stability: 19/20 (95%) - investigating 1 failure for 100%
- L2.5 Pool dynamic sharding: design only (needs 2-3 days integration)

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-08 17:08:00 +09:00

16 KiB

Raw Blame History

Phase 2a: SuperSlab Dynamic Expansion Implementation

Date: 2025-11-08 Priority: 🔴 CRITICAL - BLOCKING 100% stability Estimated Effort: 7-10 days Status: Ready for implementation

Executive Summary

Problem: SuperSlab uses fixed 32-slab array → OOM under 4T high-contention Solution: Implement mimalloc-style chunk linking → unlimited slab expansion Expected Result: 50% → 100% stability (20/20 success rate)

Current Architecture (BROKEN)

File: `core/superslab/superslab_types.h:82`

typedef struct SuperSlab {
    Slab slabs[SLABS_PER_SUPERSLAB_MAX];  // ← FIXED 32 slabs! Cannot grow!
    uint32_t bitmap;                       // ← 32 bits = 32 slabs max
    size_t total_active_blocks;
    int class_idx;
    // ...
} SuperSlab;

Why This Fails

4T high-contention scenario:

Thread 1: allocates from slabs[0-7]   → bitmap bits 0-7 = 0
Thread 2: allocates from slabs[8-15]  → bitmap bits 8-15 = 0
Thread 3: allocates from slabs[16-23] → bitmap bits 16-23 = 0
Thread 4: allocates from slabs[24-31] → bitmap bits 24-31 = 0

→ bitmap = 0x00000000 (all slabs busy)
→ superslab_refill() returns NULL
→ OOM → malloc fallback (now disabled) → CRASH

Evidence from logs:

[DEBUG] superslab_refill returned NULL (OOM) detail:
  class=4 prev_ss=(nil) active=0 bitmap=0x00000000
  prev_meta=(nil) used=0 cap=0 slab_idx=0
  reused_freelist=0 free_idx=-2 errno=12

Proposed Architecture (mimalloc-style)

Design Pattern: Linked Chunks

Inspiration: mimalloc uses linked segments, jemalloc uses linked chunks

typedef struct SuperSlabChunk {
    Slab slabs[32];                    // Initial 32 slabs per chunk
    struct SuperSlabChunk* next;       // ← Link to next chunk
    uint32_t bitmap;                   // 32 bits for this chunk's slabs
    size_t total_active_blocks;        // Active blocks in this chunk
    int class_idx;
} SuperSlabChunk;

typedef struct SuperSlabHead {
    SuperSlabChunk* first_chunk;       // Head of chunk list
    SuperSlabChunk* current_chunk;     // Current chunk for allocation
    size_t total_chunks;               // Total chunks allocated
    int class_idx;
    pthread_mutex_t lock;              // Protect chunk list
} SuperSlabHead;

Allocation Flow

1. superslab_refill() called
   ↓
2. Try current_chunk
   ↓
3. bitmap == 0x00000000? (all slabs busy)
   ↓ YES
4. Try current_chunk->next
   ↓ NULL (no next chunk)
5. Allocate new chunk via mmap
   ↓
6. current_chunk->next = new_chunk
   ↓
7. current_chunk = new_chunk
   ↓
8. Refill from new_chunk
   ↓ SUCCESS
9. Return blocks to caller

Visual Representation

Before (BROKEN):
┌─────────────────────────────────┐
│ SuperSlab (2MB)                 │
│ slabs[32] ← FIXED!              │
│ [0][1][2]...[31]                │
│ bitmap = 0x00000000 → OOM 💥    │
└─────────────────────────────────┘

After (DYNAMIC):
┌─────────────────────────────────┐
│ SuperSlabHead                   │
│ ├─ first_chunk ──────────────┐  │
│ └─ current_chunk ────────┐   │  │
└──────────────────────────│───│──┘
                           │   │
                           ▼   ▼
                    ┌────────────────┐      ┌────────────────┐
                    │ Chunk 1 (2MB)  │ ───► │ Chunk 2 (2MB)  │ ───► ...
                    │ slabs[32]      │ next │ slabs[32]      │ next
                    │ bitmap=0x0000  │      │ bitmap=0xFFFF  │
                    └────────────────┘      └────────────────┘
                     (all busy)              (has free slabs!)

Implementation Tasks

Task 1: Define New Data Structures (2-3 hours)

File: core/superslab/superslab_types.h

Changes:

Rename existing SuperSlab → SuperSlabChunk:

typedef struct SuperSlabChunk {
    Slab slabs[32];                    // Keep 32 slabs per chunk
    struct SuperSlabChunk* next;       // NEW: Link to next chunk
    uint32_t bitmap;
    size_t total_active_blocks;
    int class_idx;

    // Existing fields...
} SuperSlabChunk;

Add new SuperSlabHead:

typedef struct SuperSlabHead {
    SuperSlabChunk* first_chunk;       // Head of chunk list
    SuperSlabChunk* current_chunk;     // Current chunk for fast allocation
    size_t total_chunks;               // Total chunks in list
    int class_idx;

    // Thread safety
    pthread_mutex_t expansion_lock;    // Protect chunk list expansion
} SuperSlabHead;

Update global registry:

// Before:
extern SuperSlab* g_superslab_registry[MAX_SUPERSLABS];

// After:
extern SuperSlabHead* g_superslab_heads[TINY_NUM_CLASSES];

Task 2: Implement Chunk Allocation (3-4 hours)

File: core/superslab/superslab_alloc.c (new file or add to existing)

Function 1: Allocate new chunk:

// Allocate a new SuperSlabChunk via mmap
static SuperSlabChunk* alloc_new_chunk(int class_idx) {
    size_t chunk_size = SUPERSLAB_SIZE;  // 2MB

    // mmap new chunk
    void* raw = mmap(NULL, chunk_size, PROT_READ | PROT_WRITE,
                     MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    if (raw == MAP_FAILED) {
        fprintf(stderr, "[HAKMEM] CRITICAL: Failed to mmap new SuperSlabChunk for class %d (errno=%d)\n",
                class_idx, errno);
        return NULL;
    }

    // Initialize chunk structure
    SuperSlabChunk* chunk = (SuperSlabChunk*)raw;
    chunk->next = NULL;
    chunk->bitmap = 0xFFFFFFFF;  // All 32 slabs available
    chunk->total_active_blocks = 0;
    chunk->class_idx = class_idx;

    // Initialize slabs
    size_t block_size = class_to_size(class_idx);
    init_slabs_in_chunk(chunk, block_size);

    return chunk;
}

Function 2: Link new chunk to head:

// Expand SuperSlabHead by linking new chunk
static int expand_superslab_head(SuperSlabHead* head) {
    if (!head) return -1;

    // Allocate new chunk
    SuperSlabChunk* new_chunk = alloc_new_chunk(head->class_idx);
    if (!new_chunk) {
        return -1;  // True OOM (system out of memory)
    }

    // Thread-safe linking
    pthread_mutex_lock(&head->expansion_lock);

    if (head->current_chunk) {
        // Link at end of list
        SuperSlabChunk* tail = head->current_chunk;
        while (tail->next) {
            tail = tail->next;
        }
        tail->next = new_chunk;
    } else {
        // First chunk
        head->first_chunk = new_chunk;
    }

    // Update current chunk to new chunk
    head->current_chunk = new_chunk;
    head->total_chunks++;

    pthread_mutex_unlock(&head->expansion_lock);

    fprintf(stderr, "[HAKMEM] Expanded SuperSlabHead for class %d: %zu chunks now\n",
            head->class_idx, head->total_chunks);

    return 0;
}

Task 3: Update Refill Logic (4-5 hours)

File: core/tiny_superslab_alloc.inc.h or wherever superslab_refill() is

Modify superslab_refill() to try all chunks:

// Before (BROKEN):
void* superslab_refill(int class_idx, int count) {
    SuperSlab* ss = get_superslab_for_class(class_idx);
    if (!ss) return NULL;

    if (ss->bitmap == 0x00000000) {
        // All slabs busy → OOM!
        return NULL;  // ← CRASH HERE
    }

    // Try to refill from this SuperSlab
    return refill_from_superslab(ss, count);
}

// After (DYNAMIC):
void* superslab_refill(int class_idx, int count) {
    SuperSlabHead* head = g_superslab_heads[class_idx];
    if (!head) {
        // Initialize head for this class (first time)
        head = init_superslab_head(class_idx);
        if (!head) return NULL;
        g_superslab_heads[class_idx] = head;
    }

    SuperSlabChunk* chunk = head->current_chunk;

    // Try current chunk first (fast path)
    if (chunk && chunk->bitmap != 0x00000000) {
        return refill_from_chunk(chunk, count);
    }

    // Current chunk exhausted, try to expand
    fprintf(stderr, "[DEBUG] SuperSlabChunk exhausted for class %d (bitmap=0x00000000), expanding...\n", class_idx);

    if (expand_superslab_head(head) < 0) {
        fprintf(stderr, "[HAKMEM] CRITICAL: Failed to expand SuperSlabHead for class %d\n", class_idx);
        return NULL;  // True system OOM
    }

    // Retry refill from new chunk
    chunk = head->current_chunk;
    if (!chunk || chunk->bitmap == 0x00000000) {
        fprintf(stderr, "[HAKMEM] CRITICAL: New chunk still has no free slabs for class %d\n", class_idx);
        return NULL;
    }

    return refill_from_chunk(chunk, count);
}

Helper function:

// Refill from a specific chunk
static void* refill_from_chunk(SuperSlabChunk* chunk, int count) {
    if (!chunk || chunk->bitmap == 0x00000000) return NULL;

    // Use existing P0 optimization (ctz-based slab selection)
    uint32_t mask = chunk->bitmap;
    while (mask && count > 0) {
        int slab_idx = __builtin_ctz(mask);
        mask &= ~(1u << slab_idx);

        Slab* slab = &chunk->slabs[slab_idx];
        // Try to acquire slab and refill
        // ... existing refill logic
    }

    return /* refilled blocks */;
}

Task 4: Update Initialization (2-3 hours)

File: core/hakmem_tiny.c or initialization code

Modify hak_tiny_init():

void hak_tiny_init(void) {
    // Initialize SuperSlabHead for each class
    for (int class_idx = 0; class_idx < TINY_NUM_CLASSES; class_idx++) {
        SuperSlabHead* head = init_superslab_head(class_idx);
        if (!head) {
            fprintf(stderr, "[HAKMEM] CRITICAL: Failed to initialize SuperSlabHead for class %d\n", class_idx);
            abort();
        }
        g_superslab_heads[class_idx] = head;
    }
}

// Initialize SuperSlabHead with initial chunk(s)
static SuperSlabHead* init_superslab_head(int class_idx) {
    SuperSlabHead* head = calloc(1, sizeof(SuperSlabHead));
    if (!head) return NULL;

    head->class_idx = class_idx;
    head->total_chunks = 0;
    pthread_mutex_init(&head->expansion_lock, NULL);

    // Allocate initial chunk(s)
    int initial_chunks = 1;

    // Hot classes (1, 4, 6) get 2 initial chunks
    if (class_idx == 1 || class_idx == 4 || class_idx == 6) {
        initial_chunks = 2;
    }

    for (int i = 0; i < initial_chunks; i++) {
        if (expand_superslab_head(head) < 0) {
            fprintf(stderr, "[HAKMEM] CRITICAL: Failed to allocate initial chunk %d for class %d\n", i, class_idx);
            free(head);
            return NULL;
        }
    }

    return head;
}

Task 5: Update Free Path (2-3 hours)

File: core/hakmem_tiny_free.inc or free path code

Modify free to find correct chunk:

void hak_tiny_free(void* ptr) {
    if (!ptr) return;

    // Determine class_idx from header or registry
    int class_idx = get_class_idx_for_ptr(ptr);
    if (class_idx < 0) {
        fprintf(stderr, "[HAKMEM] Invalid free: ptr=%p not in any SuperSlab\n", ptr);
        return;
    }

    // Find which chunk this ptr belongs to
    SuperSlabHead* head = g_superslab_heads[class_idx];
    if (!head) {
        fprintf(stderr, "[HAKMEM] Invalid free: no SuperSlabHead for class %d\n", class_idx);
        return;
    }

    SuperSlabChunk* chunk = head->first_chunk;
    while (chunk) {
        // Check if ptr is within this chunk's memory range
        uintptr_t chunk_start = (uintptr_t)chunk;
        uintptr_t chunk_end = chunk_start + SUPERSLAB_SIZE;
        uintptr_t ptr_addr = (uintptr_t)ptr;

        if (ptr_addr >= chunk_start && ptr_addr < chunk_end) {
            // Found the chunk, free to it
            free_to_chunk(chunk, ptr);
            return;
        }

        chunk = chunk->next;
    }

    fprintf(stderr, "[HAKMEM] Invalid free: ptr=%p not found in any chunk for class %d\n", ptr, class_idx);
}

Task 6: Update Registry (3-4 hours)

File: Registry code (wherever SuperSlab registry is managed)

Replace flat registry with per-class heads:

// Before:
SuperSlab* g_superslab_registry[MAX_SUPERSLABS];

// After:
SuperSlabHead* g_superslab_heads[TINY_NUM_CLASSES];

Update registry lookup:

// Before:
SuperSlab* find_superslab_for_ptr(void* ptr) {
    for (int i = 0; i < MAX_SUPERSLABS; i++) {
        SuperSlab* ss = g_superslab_registry[i];
        if (ptr_in_range(ptr, ss)) return ss;
    }
    return NULL;
}

// After:
SuperSlabChunk* find_chunk_for_ptr(void* ptr, int* out_class_idx) {
    for (int class_idx = 0; class_idx < TINY_NUM_CLASSES; class_idx++) {
        SuperSlabHead* head = g_superslab_heads[class_idx];
        if (!head) continue;

        SuperSlabChunk* chunk = head->first_chunk;
        while (chunk) {
            if (ptr_in_chunk_range(ptr, chunk)) {
                if (out_class_idx) *out_class_idx = class_idx;
                return chunk;
            }
            chunk = chunk->next;
        }
    }
    return NULL;
}

Testing Strategy

Test 1: Build Verification

# Rebuild with new architecture
make clean
make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1 larson_hakmem

# Check for compilation errors
echo $?  # Should be 0

Test 2: Single-Thread Stability

# Should work perfectly (no change in behavior)
./larson_hakmem 1 1 128 1024 1 12345 1

# Expected: 2.68-2.71M ops/s (no regression)

Test 3: 4T High-Contention (CRITICAL)

# Run 20 times, count successes
success=0
for i in {1..20}; do
  echo "=== Run $i ==="
  env HAKMEM_TINY_USE_SUPERSLAB=1 HAKMEM_TINY_MEM_DIET=0 \
    ./larson_hakmem 10 8 128 1024 1 12345 4 2>&1 | tee phase2a_run_$i.log

  if grep -q "Throughput" phase2a_run_$i.log; then
    ((success++))
    echo "✓ Success ($success/20)"
  else
    echo "✗ Failed"
  fi
done

echo "Final: $success/20 success rate"

# TARGET: 20/20 (100%)
# Current baseline: 10/20 (50%)

Test 4: Chunk Expansion Verification

# Enable debug logging
HAKMEM_LOG=1 ./larson_hakmem 10 8 128 1024 1 12345 4 2>&1 | grep "Expanded SuperSlabHead"

# Should see:
# [HAKMEM] Expanded SuperSlabHead for class 4: 2 chunks now
# [HAKMEM] Expanded SuperSlabHead for class 4: 3 chunks now
# ...

Test 5: Memory Leak Check

# Valgrind test (may be slow)
valgrind --leak-check=full --show-leak-kinds=all \
  ./larson_hakmem 1 1 128 1024 1 12345 1 2>&1 | tee valgrind_phase2a.log

# Check for leaks
grep "definitely lost" valgrind_phase2a.log
# Should be 0 bytes

Success Criteria

✅ Compilation: No errors, no warnings ✅ Single-thread: 2.68-2.71M ops/s (no regression) ✅ 4T stability: 20/20 (100%) ← KEY METRIC ✅ Chunk expansion: Logs show multiple chunks allocated ✅ No memory leaks: Valgrind clean ✅ Performance: 4T throughput ≥981K ops/s (when it works)

Deliverable

Report file: /mnt/workdisk/public_share/hakmem/PHASE2A_IMPLEMENTATION_REPORT.md

Required sections:

Architecture changes (SuperSlab → SuperSlabChunk + SuperSlabHead)
Code diffs (all modified files)
Test results (20/20 stability test)
Performance comparison (before/after)
Chunk expansion behavior (how many chunks allocated under load)
Memory usage (overhead per chunk, total memory)
Production readiness (YES/NO verdict)

Files to Create/Modify

New files:

core/superslab/superslab_alloc.c - Chunk allocation functions

Modified files:

core/superslab/superslab_types.h - SuperSlabChunk + SuperSlabHead
core/tiny_superslab_alloc.inc.h - Refill logic with expansion
core/hakmem_tiny_free.inc - Free path with chunk lookup
core/hakmem_tiny.c - Initialization with SuperSlabHead
Registry code - Update to per-class heads

Estimated LOC: 500-800 lines (new code + modifications)

Risk Mitigation

Risk 1: Performance regression

Mitigation: Keep fast path (current_chunk) unchanged
Single-chunk case should be identical to before

Risk 2: Thread safety issues

Mitigation: Use expansion_lock only for chunk linking
Slab-level atomics unchanged

Risk 3: Memory overhead

Each chunk: 2MB (same as before)
SuperSlabHead: ~64 bytes per class
Total overhead: negligible

Risk 4: Complexity

Mitigation: Follow mimalloc pattern (proven design)
Keep chunk size fixed (2MB) for simplicity

Let's implement Phase 2a and achieve 100% stability! 🚀

16 KiB Raw Blame History