Files
hakmem/docs/design/HYBRID_IMPLEMENTATION_DESIGN.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

17 KiB

Hybrid Bitmap+Mini-Magazine Implementation Design

Date: 2025-10-26 Goal: Clean, modular implementation of Hybrid approach Target: 83ns → 40-50ns (40-50% improvement) Philosophy: 綺麗綺麗大作戦 - Clean code from the start


Current Structure Analysis

Existing Implementation (Phase 6.X)

Already Implemented :

  1. Two-Tier Bitmap: summary + bitmap (lines 103-106 in hakmem_tiny.h)
  2. TLS Magazine: 2048 items (lines 54-70 in hakmem_tiny.c)
  3. TLS Active Slabs: A/B slabs (lines 72-73)
  4. Remote Free MPSC: Lock-free remote free (lines 126-138)
  5. Hint Word: Scan start position (line 102)

Current Hot Path (lines 656-661):

if (mag->top > 0) {
    void* p = mag->items[--mag->top].ptr;
    t_tiny_rng ^= t_tiny_rng << 13;  // Statistics XOR (3 ns)
    t_tiny_rng ^= t_tiny_rng >> 17;
    t_tiny_rng ^= t_tiny_rng << 5;
    if ((t_tiny_rng & ...) == 0) g_tiny_pool.alloc_count[class_idx]++;
    return p;
}

Bottlenecks Identified:

  1. Statistics in hot path: +10-15ns (lines 658-659, 677-678, 793-794)
  2. Bitmap scan on TLS slab: +5-6ns (line 671, 776)
  3. Multiple TLS reads: +2-3ns (mag, slab_a, slab_b)

Current Performance: 83 ns/op


Clean Modular Design

Module 1: Page Mini-Magazine (Data Plane)

Purpose: O(1) LIFO free-list for fast allocation Location: New file hakmem_tiny_mini_mag.h

// ============================================================================
// Page Mini-Magazine: Fast LIFO Cache (Data Plane)
// ============================================================================

// Intrusive LIFO block
typedef struct MiniMagBlock {
    struct MiniMagBlock* next;  // 8 bytes in free block
} MiniMagBlock;

// Mini-magazine metadata (embedded in TinySlab)
typedef struct {
    MiniMagBlock* head;         // LIFO stack head
    uint16_t count;             // Current item count
    uint16_t capacity;          // Max items (16-32)
} PageMiniMag;

// Fast path: Pop from mini-magazine (1-2 ns)
static inline void* mini_mag_pop(PageMiniMag* mag) {
    MiniMagBlock* b = mag->head;
    if (!b) return NULL;

    mag->head = b->next;
    mag->count--;
    return (void*)b;
}

// Fast path: Push to mini-magazine (1-2 ns)
static inline int mini_mag_push(PageMiniMag* mag, void* ptr) {
    if (mag->count >= mag->capacity) return 0;  // Full

    MiniMagBlock* b = (MiniMagBlock*)ptr;
    b->next = mag->head;
    mag->head = b;
    mag->count++;
    return 1;
}

Characteristics:

  • Zero overhead (intrusive next-pointer)
  • O(1) pop/push (1-2 ns)
  • Cache-friendly (LIFO = temporal locality)
  • No locks (owner-only access)

Module 2: Batch Refill from Bitmap (Control Plane)

Purpose: Batch refill mini-magazine from two-tier bitmap Location: New file hakmem_tiny_batch_refill.h

// ============================================================================
// Batch Refill: Two-Tier Bitmap → Mini-Magazine (Control Plane)
// ============================================================================

// Refill mini-magazine from bitmap (batch of N items)
// Returns number of items refilled
static inline int batch_refill_from_bitmap(
    TinySlab* slab,
    PageMiniMag* mag,
    int want
) {
    if (want <= 0 || mag->count >= mag->capacity) return 0;

    size_t block_size = g_tiny_class_sizes[slab->class_idx];
    int got = 0;

    // Two-tier bitmap scan (using existing summary)
    while (got < want && slab->free_count > 0) {
        int block_idx = hak_tiny_find_free_block(slab);
        if (block_idx < 0) break;

        // Mark as used in bitmap
        hak_tiny_set_used(slab, block_idx);
        slab->free_count--;

        // Calculate block pointer
        void* ptr = (char*)slab->base + (block_idx * block_size);

        // Push to mini-magazine
        if (!mini_mag_push(mag, ptr)) break;
        got++;
    }

    return got;
}

// Spill mini-magazine back to bitmap (on slab eviction)
static inline void batch_spill_to_bitmap(
    TinySlab* slab,
    PageMiniMag* mag
) {
    size_t block_size = g_tiny_class_sizes[slab->class_idx];

    while (mag->count > 0) {
        void* ptr = mini_mag_pop(mag);
        if (!ptr) break;

        // Calculate block index
        uintptr_t offset = (uintptr_t)ptr - (uintptr_t)slab->base;
        int block_idx = offset / block_size;

        // Mark as free in bitmap
        hak_tiny_set_free(slab, block_idx);
        slab->free_count++;
    }
}

Characteristics:

  • Batch processing (amortized cost: 3ns/item)
  • Uses existing two-tier bitmap
  • Preserves bitmap consistency
  • Minimal code (~40 lines)

Module 3: Integrated TinySlab Structure

Purpose: Add mini-magazine to existing TinySlab Location: hakmem_tiny.h (modification)

// Modified TinySlab structure
typedef struct TinySlab {
    void* base;                     // Base address (64KB aligned)
    uint64_t* bitmap;               // Free block bitmap (dynamic size)
    uint16_t free_count;            // Number of free blocks
    uint16_t total_count;           // Total blocks in slab
    uint8_t class_idx;              // Size class index (0-7)
    uint8_t _padding[3];
    struct TinySlab* next;          // Next slab in list

    // MPSC remote-free stack head (lock-free)
    atomic_uintptr_t remote_head;
    atomic_uint remote_count;
    pthread_t owner_tid;

    // Two-tier bitmap (existing)
    uint16_t hint_word;
    uint8_t summary_words;
    uint8_t _pad_sum[1];
    uint64_t* summary;

    // NEW: Page Mini-Magazine (16-32 items)
    PageMiniMag mini_mag;           // 16 bytes (aligned)
} TinySlab;

Changes:

  • Additive only (no breaking changes)
  • Cache-line aligned (64 bytes)
  • Backward compatible

Module 4: Statistics Batching (Out-of-Band)

Purpose: Remove statistics from hot path Location: New file hakmem_tiny_stats.h

// ============================================================================
// Statistics: Batched TLS Counters (Out-of-Band)
// ============================================================================

// Per-thread batch counters (flushed periodically)
static __thread uint32_t t_alloc_batch[TINY_NUM_CLASSES];
static __thread uint32_t t_free_batch[TINY_NUM_CLASSES];

// Fast path: Increment TLS counter only (0.5 ns)
static inline void stats_record_alloc(int class_idx) {
    t_alloc_batch[class_idx]++;
}

static inline void stats_record_free(int class_idx) {
    t_free_batch[class_idx]++;
}

// Cold path: Flush batch to global counters
static inline void stats_flush_if_needed(int class_idx) {
    // Flush every 256 allocations
    if ((t_alloc_batch[class_idx] & 0xFF) == 0xFF) {
        g_tiny_pool.alloc_count[class_idx] += 256;
        t_alloc_batch[class_idx] = 0;
    }
    if ((t_free_batch[class_idx] & 0xFF) == 0xFF) {
        g_tiny_pool.free_count[class_idx] += 256;
        t_free_batch[class_idx] = 0;
    }
}

Characteristics:

  • Hot path: 0.5 ns (simple increment)
  • Cold path: Batch flush (every 256 ops)
  • Accuracy: 99.6% (vs 93.75% with sampling)
  • No XOR RNG overhead

Implementation Phases

Phase 1: Add Mini-Magazine to TinySlab (2-3 hours)

Goal: Enable page-level fast path

Files to modify:

  1. hakmem_tiny.h: Add PageMiniMag to TinySlab
  2. Create hakmem_tiny_mini_mag.h: Mini-magazine operations
  3. Create hakmem_tiny_batch_refill.h: Batch refill/spill

Changes:

// hakmem_tiny.h (line 107, after summary)
typedef struct TinySlab {
    // ... existing fields ...
    uint64_t* summary;

    // NEW: Page Mini-Magazine
    PageMiniMag mini_mag;
} TinySlab;

// Initialize mini-magazine on slab creation (hakmem_tiny.c:allocate_new_slab)
slab->mini_mag.head = NULL;
slab->mini_mag.count = 0;
slab->mini_mag.capacity = 16;  // Start with 16 items

Testing:

# Compile
make clean && make -j4

# Unit test: Mini-magazine operations
# (create test_mini_mag.c)
./test_mini_mag

# Integration test: Verify no regressions
./bench_tiny --iterations=100000 --threads=1

Expected: No performance change (feature not used yet)


Phase 2: Integrate Mini-Magazine into Hot Path (3-4 hours)

Goal: Use mini-magazine in allocation fast path

Files to modify:

  1. hakmem_tiny.c: Modify hak_tiny_alloc() to use mini-mag

New hot path logic:

void* hak_tiny_alloc(size_t size) {
    int class_idx = hak_tiny_size_to_class(size);
    if (class_idx < 0) return NULL;

    // 1. TLS Magazine (existing fast path)
    TinyTLSMag* mag = &g_tls_mags[class_idx];
    if (mag->top > 0) {
        void* p = mag->items[--mag->top].ptr;
        stats_record_alloc(class_idx);  // NEW: Batched
        return p;
    }

    // 2. TLS Active Slab Mini-Magazine (NEW: lock-free)
    TinySlab* tls = g_tls_active_slab_a[class_idx];
    if (tls && tls->mini_mag.count > 0) {
        void* p = mini_mag_pop(&tls->mini_mag);
        if (p) {
            stats_record_alloc(class_idx);
            return p;
        }
    }

    // 3. Refill mini-magazine from bitmap (medium path)
    if (tls && tls->free_count > 0) {
        int got = batch_refill_from_bitmap(tls, &tls->mini_mag, 16);
        if (got > 0) {
            void* p = mini_mag_pop(&tls->mini_mag);
            if (p) {
                stats_record_alloc(class_idx);
                return p;
            }
        }
    }

    // 4. Slow path (existing global pool logic)
    return tiny_alloc_slow_path(class_idx);
}

Testing:

# Performance test
./bench_tiny --iterations=1000000 --threads=1
# Expected: 83ns → 55-65ns

# Multi-threaded test
./bench_tiny_mt --iterations=100000 --threads=4
# Expected: No regressions

# Correctness test
./test_mf2
./test_mf2_warmup

Expected: 83ns → 55-65ns (+25-35% improvement)


Phase 3: Remove Statistics from Hot Path (1 hour)

Goal: Eliminate XOR RNG overhead

Files to modify:

  1. Create hakmem_tiny_stats.h: Batched statistics
  2. hakmem_tiny.c: Replace XOR RNG with batched counters

Changes:

// Remove lines 658-659, 677-678, 793-794 (XOR RNG)
// Replace with:
stats_record_alloc(class_idx);

// Add periodic flush (in slow path)
stats_flush_if_needed(class_idx);

Testing:

# Verify statistics still work
./test_mf2
hak_tiny_get_stats(...)  # Should show reasonable counts

# Performance test
./bench_tiny --iterations=1000000 --threads=1
# Expected: 55-65ns → 45-55ns

Expected: 55-65ns → 45-55ns (+10ns improvement)


Phase 4: TLS Magazine Integration (1-2 hours)

Goal: Optimize TLS Magazine refill from mini-magazines

Files to modify:

  1. hakmem_tiny.c: Refill TLS Magazine from slab mini-magazines

Changes:

// When TLS Magazine is low, refill from multiple slab mini-magazines
static void refill_tls_magazine(int class_idx) {
    TinyTLSMag* mag = &g_tls_mags[class_idx];
    int room = mag->cap - mag->top;
    if (room <= 0) return;

    // Try TLS active slab A
    TinySlab* tls_a = g_tls_active_slab_a[class_idx];
    if (tls_a) {
        while (room > 0 && tls_a->mini_mag.count > 0) {
            void* p = mini_mag_pop(&tls_a->mini_mag);
            if (!p) break;
            mag->items[mag->top++].ptr = p;
            room--;
        }

        // If mini-mag empty, refill from bitmap
        if (tls_a->mini_mag.count == 0 && tls_a->free_count > 0) {
            batch_refill_from_bitmap(tls_a, &tls_a->mini_mag, 16);
        }
    }

    // Try TLS active slab B (if still room)
    if (room > 0) {
        TinySlab* tls_b = g_tls_active_slab_b[class_idx];
        // ... similar logic ...
    }
}

Testing:

# Full benchmark suite
./bench_tiny --iterations=1000000 --threads=1
./bench_tiny_mt --iterations=100000 --threads=4
./bench_allocators_hakmem --scenario json
./bench_allocators_hakmem --scenario mir

# Expected: 45-55ns → 40-50ns

Expected: 45-55ns → 40-50ns (+5-10ns improvement)


Code Organization (綺麗綺麗)

File Structure

hakmem/
├── hakmem_tiny.h                  # Main header (modified)
├── hakmem_tiny.c                  # Main implementation (modified)
├── hakmem_tiny_mini_mag.h         # NEW: Mini-magazine operations
├── hakmem_tiny_batch_refill.h     # NEW: Batch refill/spill
├── hakmem_tiny_stats.h            # NEW: Batched statistics
└── hakmem_tiny_superslab.h        # Existing (unchanged)

Module Dependencies

hakmem_tiny.c
  ├── includes hakmem_tiny.h
  ├── includes hakmem_tiny_mini_mag.h     (Phase 1)
  ├── includes hakmem_tiny_batch_refill.h (Phase 2)
  └── includes hakmem_tiny_stats.h        (Phase 3)

Coding Standards

Naming Convention:

// Prefix: mini_mag_*, batch_*, stats_*
mini_mag_pop()
mini_mag_push()
batch_refill_from_bitmap()
batch_spill_to_bitmap()
stats_record_alloc()
stats_flush_if_needed()

Inline Policy:

// Hot path: always inline
static inline void* mini_mag_pop(...) __attribute__((always_inline));

// Medium path: let compiler decide
static inline int batch_refill_from_bitmap(...);

// Cold path: never inline
static void tiny_alloc_slow_path(...) __attribute__((noinline));

Alignment:

// Cache-line aligned structures
typedef struct __attribute__((aligned(64))) {
    // ...
} PageMiniMag;

Comments:

// Module-level comment (box)
// ============================================================================
// Page Mini-Magazine: Fast LIFO Cache (Data Plane)
// ============================================================================

// Function-level comment (purpose + cost)
// Fast path: Pop from mini-magazine (1-2 ns)
static inline void* mini_mag_pop(PageMiniMag* mag) {
    // ...
}

Testing Strategy

Unit Tests

test_mini_mag.c:

// Test mini-magazine operations
void test_push_pop() {
    PageMiniMag mag = {.head = NULL, .count = 0, .capacity = 16};
    void* ptrs[16];

    // Push 16 items
    for (int i = 0; i < 16; i++) {
        ptrs[i] = malloc(64);
        assert(mini_mag_push(&mag, ptrs[i]) == 1);
    }
    assert(mag.count == 16);

    // Push when full (should fail)
    void* extra = malloc(64);
    assert(mini_mag_push(&mag, extra) == 0);

    // Pop 16 items (LIFO order)
    for (int i = 15; i >= 0; i--) {
        void* p = mini_mag_pop(&mag);
        assert(p == ptrs[i]);  // LIFO
    }
    assert(mag.count == 0);

    // Pop when empty (should return NULL)
    assert(mini_mag_pop(&mag) == NULL);
}

test_batch_refill.c:

// Test batch refill from bitmap
void test_refill() {
    TinySlab* slab = allocate_new_slab(0);  // 8B class
    assert(slab->free_count == 8192);

    // Refill 16 items
    int got = batch_refill_from_bitmap(slab, &slab->mini_mag, 16);
    assert(got == 16);
    assert(slab->mini_mag.count == 16);
    assert(slab->free_count == 8192 - 16);

    // Verify items are valid
    for (int i = 0; i < 16; i++) {
        void* p = mini_mag_pop(&slab->mini_mag);
        assert(p >= slab->base && p < (char*)slab->base + TINY_SLAB_SIZE);
    }
}

Integration Tests

Existing tests should pass:

./test_mf2
./test_mf2_warmup
./bench_tiny --iterations=100000 --threads=1
./bench_tiny_mt --iterations=100000 --threads=4

Performance Tests

Before/After comparison:

# Baseline (before optimization)
./bench_tiny --iterations=1000000 --threads=1 > baseline.txt

# After Phase 1 (mini-magazine added but not used)
./bench_tiny --iterations=1000000 --threads=1 > phase1.txt
diff baseline.txt phase1.txt  # Should be identical

# After Phase 2 (mini-magazine integrated)
./bench_tiny --iterations=1000000 --threads=1 > phase2.txt
# Expected: 83ns → 55-65ns

# After Phase 3 (statistics batched)
./bench_tiny --iterations=1000000 --threads=1 > phase3.txt
# Expected: 55-65ns → 45-55ns

# After Phase 4 (TLS integration)
./bench_tiny --iterations=1000000 --threads=1 > phase4.txt
# Expected: 45-55ns → 40-50ns

Success Criteria

Performance Targets

Phase Target Pass Criteria
Baseline 83 ns/op Current performance
Phase 1 83 ns/op No regression
Phase 2 55-65 ns/op +25-35% improvement
Phase 3 45-55 ns/op +35-45% improvement
Phase 4 40-50 ns/op +40-52% improvement

Functional Requirements

  • All existing tests pass
  • No memory leaks (valgrind clean)
  • Thread-safe (helgrind clean)
  • Statistics accurate (within 1% of actual)
  • No regressions on L2/L2.5 pools

Code Quality

  • Clean compilation (-Wall -Wextra -Werror)
  • Modular design (separate .h files)
  • Inline hints appropriate
  • Cache-line aligned critical structures
  • Comments explain "why" not "what"

Rollback Plan

Each phase is independent and can be reverted:

# Revert Phase 4
git revert <phase4-commit>

# Revert Phase 3
git revert <phase3-commit>

# Revert Phase 2
git revert <phase2-commit>

# Revert Phase 1
git revert <phase1-commit>

All changes are backward-compatible (additive only).


Last Updated: 2025-10-26 Status: Design complete, ready for implementation Next: Begin Phase 1 - Add Mini-Magazine to TinySlab