Files
hakmem/docs/design/HYBRID_IMPLEMENTATION_DESIGN.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

670 lines
17 KiB
Markdown

# Hybrid Bitmap+Mini-Magazine Implementation Design
**Date**: 2025-10-26
**Goal**: Clean, modular implementation of Hybrid approach
**Target**: 83ns → 40-50ns (40-50% improvement)
**Philosophy**: 綺麗綺麗大作戦 - Clean code from the start
---
## Current Structure Analysis
### Existing Implementation (Phase 6.X)
**Already Implemented** ✅:
1. **Two-Tier Bitmap**: `summary` + `bitmap` (lines 103-106 in hakmem_tiny.h)
2. **TLS Magazine**: 2048 items (lines 54-70 in hakmem_tiny.c)
3. **TLS Active Slabs**: A/B slabs (lines 72-73)
4. **Remote Free MPSC**: Lock-free remote free (lines 126-138)
5. **Hint Word**: Scan start position (line 102)
**Current Hot Path** (lines 656-661):
```c
if (mag->top > 0) {
void* p = mag->items[--mag->top].ptr;
t_tiny_rng ^= t_tiny_rng << 13; // Statistics XOR (3 ns)
t_tiny_rng ^= t_tiny_rng >> 17;
t_tiny_rng ^= t_tiny_rng << 5;
if ((t_tiny_rng & ...) == 0) g_tiny_pool.alloc_count[class_idx]++;
return p;
}
```
**Bottlenecks Identified**:
1. ❌ Statistics in hot path: +10-15ns (lines 658-659, 677-678, 793-794)
2. ❌ Bitmap scan on TLS slab: +5-6ns (line 671, 776)
3. ❌ Multiple TLS reads: +2-3ns (mag, slab_a, slab_b)
**Current Performance**: 83 ns/op
---
## Clean Modular Design
### Module 1: Page Mini-Magazine (Data Plane)
**Purpose**: O(1) LIFO free-list for fast allocation
**Location**: New file `hakmem_tiny_mini_mag.h`
```c
// ============================================================================
// Page Mini-Magazine: Fast LIFO Cache (Data Plane)
// ============================================================================
// Intrusive LIFO block
typedef struct MiniMagBlock {
struct MiniMagBlock* next; // 8 bytes in free block
} MiniMagBlock;
// Mini-magazine metadata (embedded in TinySlab)
typedef struct {
MiniMagBlock* head; // LIFO stack head
uint16_t count; // Current item count
uint16_t capacity; // Max items (16-32)
} PageMiniMag;
// Fast path: Pop from mini-magazine (1-2 ns)
static inline void* mini_mag_pop(PageMiniMag* mag) {
MiniMagBlock* b = mag->head;
if (!b) return NULL;
mag->head = b->next;
mag->count--;
return (void*)b;
}
// Fast path: Push to mini-magazine (1-2 ns)
static inline int mini_mag_push(PageMiniMag* mag, void* ptr) {
if (mag->count >= mag->capacity) return 0; // Full
MiniMagBlock* b = (MiniMagBlock*)ptr;
b->next = mag->head;
mag->head = b;
mag->count++;
return 1;
}
```
**Characteristics**:
- ✅ Zero overhead (intrusive next-pointer)
- ✅ O(1) pop/push (1-2 ns)
- ✅ Cache-friendly (LIFO = temporal locality)
- ✅ No locks (owner-only access)
---
### Module 2: Batch Refill from Bitmap (Control Plane)
**Purpose**: Batch refill mini-magazine from two-tier bitmap
**Location**: New file `hakmem_tiny_batch_refill.h`
```c
// ============================================================================
// Batch Refill: Two-Tier Bitmap → Mini-Magazine (Control Plane)
// ============================================================================
// Refill mini-magazine from bitmap (batch of N items)
// Returns number of items refilled
static inline int batch_refill_from_bitmap(
TinySlab* slab,
PageMiniMag* mag,
int want
) {
if (want <= 0 || mag->count >= mag->capacity) return 0;
size_t block_size = g_tiny_class_sizes[slab->class_idx];
int got = 0;
// Two-tier bitmap scan (using existing summary)
while (got < want && slab->free_count > 0) {
int block_idx = hak_tiny_find_free_block(slab);
if (block_idx < 0) break;
// Mark as used in bitmap
hak_tiny_set_used(slab, block_idx);
slab->free_count--;
// Calculate block pointer
void* ptr = (char*)slab->base + (block_idx * block_size);
// Push to mini-magazine
if (!mini_mag_push(mag, ptr)) break;
got++;
}
return got;
}
// Spill mini-magazine back to bitmap (on slab eviction)
static inline void batch_spill_to_bitmap(
TinySlab* slab,
PageMiniMag* mag
) {
size_t block_size = g_tiny_class_sizes[slab->class_idx];
while (mag->count > 0) {
void* ptr = mini_mag_pop(mag);
if (!ptr) break;
// Calculate block index
uintptr_t offset = (uintptr_t)ptr - (uintptr_t)slab->base;
int block_idx = offset / block_size;
// Mark as free in bitmap
hak_tiny_set_free(slab, block_idx);
slab->free_count++;
}
}
```
**Characteristics**:
- ✅ Batch processing (amortized cost: 3ns/item)
- ✅ Uses existing two-tier bitmap
- ✅ Preserves bitmap consistency
- ✅ Minimal code (~40 lines)
---
### Module 3: Integrated TinySlab Structure
**Purpose**: Add mini-magazine to existing TinySlab
**Location**: `hakmem_tiny.h` (modification)
```c
// Modified TinySlab structure
typedef struct TinySlab {
void* base; // Base address (64KB aligned)
uint64_t* bitmap; // Free block bitmap (dynamic size)
uint16_t free_count; // Number of free blocks
uint16_t total_count; // Total blocks in slab
uint8_t class_idx; // Size class index (0-7)
uint8_t _padding[3];
struct TinySlab* next; // Next slab in list
// MPSC remote-free stack head (lock-free)
atomic_uintptr_t remote_head;
atomic_uint remote_count;
pthread_t owner_tid;
// Two-tier bitmap (existing)
uint16_t hint_word;
uint8_t summary_words;
uint8_t _pad_sum[1];
uint64_t* summary;
// NEW: Page Mini-Magazine (16-32 items)
PageMiniMag mini_mag; // 16 bytes (aligned)
} TinySlab;
```
**Changes**:
- ✅ Additive only (no breaking changes)
- ✅ Cache-line aligned (64 bytes)
- ✅ Backward compatible
---
### Module 4: Statistics Batching (Out-of-Band)
**Purpose**: Remove statistics from hot path
**Location**: New file `hakmem_tiny_stats.h`
```c
// ============================================================================
// Statistics: Batched TLS Counters (Out-of-Band)
// ============================================================================
// Per-thread batch counters (flushed periodically)
static __thread uint32_t t_alloc_batch[TINY_NUM_CLASSES];
static __thread uint32_t t_free_batch[TINY_NUM_CLASSES];
// Fast path: Increment TLS counter only (0.5 ns)
static inline void stats_record_alloc(int class_idx) {
t_alloc_batch[class_idx]++;
}
static inline void stats_record_free(int class_idx) {
t_free_batch[class_idx]++;
}
// Cold path: Flush batch to global counters
static inline void stats_flush_if_needed(int class_idx) {
// Flush every 256 allocations
if ((t_alloc_batch[class_idx] & 0xFF) == 0xFF) {
g_tiny_pool.alloc_count[class_idx] += 256;
t_alloc_batch[class_idx] = 0;
}
if ((t_free_batch[class_idx] & 0xFF) == 0xFF) {
g_tiny_pool.free_count[class_idx] += 256;
t_free_batch[class_idx] = 0;
}
}
```
**Characteristics**:
- ✅ Hot path: 0.5 ns (simple increment)
- ✅ Cold path: Batch flush (every 256 ops)
- ✅ Accuracy: 99.6% (vs 93.75% with sampling)
- ✅ No XOR RNG overhead
---
## Implementation Phases
### Phase 1: Add Mini-Magazine to TinySlab (2-3 hours)
**Goal**: Enable page-level fast path
**Files to modify**:
1. `hakmem_tiny.h`: Add `PageMiniMag` to `TinySlab`
2. Create `hakmem_tiny_mini_mag.h`: Mini-magazine operations
3. Create `hakmem_tiny_batch_refill.h`: Batch refill/spill
**Changes**:
```c
// hakmem_tiny.h (line 107, after summary)
typedef struct TinySlab {
// ... existing fields ...
uint64_t* summary;
// NEW: Page Mini-Magazine
PageMiniMag mini_mag;
} TinySlab;
// Initialize mini-magazine on slab creation (hakmem_tiny.c:allocate_new_slab)
slab->mini_mag.head = NULL;
slab->mini_mag.count = 0;
slab->mini_mag.capacity = 16; // Start with 16 items
```
**Testing**:
```bash
# Compile
make clean && make -j4
# Unit test: Mini-magazine operations
# (create test_mini_mag.c)
./test_mini_mag
# Integration test: Verify no regressions
./bench_tiny --iterations=100000 --threads=1
```
**Expected**: No performance change (feature not used yet)
---
### Phase 2: Integrate Mini-Magazine into Hot Path (3-4 hours)
**Goal**: Use mini-magazine in allocation fast path
**Files to modify**:
1. `hakmem_tiny.c`: Modify `hak_tiny_alloc()` to use mini-mag
**New hot path logic**:
```c
void* hak_tiny_alloc(size_t size) {
int class_idx = hak_tiny_size_to_class(size);
if (class_idx < 0) return NULL;
// 1. TLS Magazine (existing fast path)
TinyTLSMag* mag = &g_tls_mags[class_idx];
if (mag->top > 0) {
void* p = mag->items[--mag->top].ptr;
stats_record_alloc(class_idx); // NEW: Batched
return p;
}
// 2. TLS Active Slab Mini-Magazine (NEW: lock-free)
TinySlab* tls = g_tls_active_slab_a[class_idx];
if (tls && tls->mini_mag.count > 0) {
void* p = mini_mag_pop(&tls->mini_mag);
if (p) {
stats_record_alloc(class_idx);
return p;
}
}
// 3. Refill mini-magazine from bitmap (medium path)
if (tls && tls->free_count > 0) {
int got = batch_refill_from_bitmap(tls, &tls->mini_mag, 16);
if (got > 0) {
void* p = mini_mag_pop(&tls->mini_mag);
if (p) {
stats_record_alloc(class_idx);
return p;
}
}
}
// 4. Slow path (existing global pool logic)
return tiny_alloc_slow_path(class_idx);
}
```
**Testing**:
```bash
# Performance test
./bench_tiny --iterations=1000000 --threads=1
# Expected: 83ns → 55-65ns
# Multi-threaded test
./bench_tiny_mt --iterations=100000 --threads=4
# Expected: No regressions
# Correctness test
./test_mf2
./test_mf2_warmup
```
**Expected**: 83ns → 55-65ns (+25-35% improvement)
---
### Phase 3: Remove Statistics from Hot Path (1 hour)
**Goal**: Eliminate XOR RNG overhead
**Files to modify**:
1. Create `hakmem_tiny_stats.h`: Batched statistics
2. `hakmem_tiny.c`: Replace XOR RNG with batched counters
**Changes**:
```c
// Remove lines 658-659, 677-678, 793-794 (XOR RNG)
// Replace with:
stats_record_alloc(class_idx);
// Add periodic flush (in slow path)
stats_flush_if_needed(class_idx);
```
**Testing**:
```bash
# Verify statistics still work
./test_mf2
hak_tiny_get_stats(...) # Should show reasonable counts
# Performance test
./bench_tiny --iterations=1000000 --threads=1
# Expected: 55-65ns → 45-55ns
```
**Expected**: 55-65ns → 45-55ns (+10ns improvement)
---
### Phase 4: TLS Magazine Integration (1-2 hours)
**Goal**: Optimize TLS Magazine refill from mini-magazines
**Files to modify**:
1. `hakmem_tiny.c`: Refill TLS Magazine from slab mini-magazines
**Changes**:
```c
// When TLS Magazine is low, refill from multiple slab mini-magazines
static void refill_tls_magazine(int class_idx) {
TinyTLSMag* mag = &g_tls_mags[class_idx];
int room = mag->cap - mag->top;
if (room <= 0) return;
// Try TLS active slab A
TinySlab* tls_a = g_tls_active_slab_a[class_idx];
if (tls_a) {
while (room > 0 && tls_a->mini_mag.count > 0) {
void* p = mini_mag_pop(&tls_a->mini_mag);
if (!p) break;
mag->items[mag->top++].ptr = p;
room--;
}
// If mini-mag empty, refill from bitmap
if (tls_a->mini_mag.count == 0 && tls_a->free_count > 0) {
batch_refill_from_bitmap(tls_a, &tls_a->mini_mag, 16);
}
}
// Try TLS active slab B (if still room)
if (room > 0) {
TinySlab* tls_b = g_tls_active_slab_b[class_idx];
// ... similar logic ...
}
}
```
**Testing**:
```bash
# Full benchmark suite
./bench_tiny --iterations=1000000 --threads=1
./bench_tiny_mt --iterations=100000 --threads=4
./bench_allocators_hakmem --scenario json
./bench_allocators_hakmem --scenario mir
# Expected: 45-55ns → 40-50ns
```
**Expected**: 45-55ns → 40-50ns (+5-10ns improvement)
---
## Code Organization (綺麗綺麗)
### File Structure
```
hakmem/
├── hakmem_tiny.h # Main header (modified)
├── hakmem_tiny.c # Main implementation (modified)
├── hakmem_tiny_mini_mag.h # NEW: Mini-magazine operations
├── hakmem_tiny_batch_refill.h # NEW: Batch refill/spill
├── hakmem_tiny_stats.h # NEW: Batched statistics
└── hakmem_tiny_superslab.h # Existing (unchanged)
```
### Module Dependencies
```
hakmem_tiny.c
├── includes hakmem_tiny.h
├── includes hakmem_tiny_mini_mag.h (Phase 1)
├── includes hakmem_tiny_batch_refill.h (Phase 2)
└── includes hakmem_tiny_stats.h (Phase 3)
```
### Coding Standards
**Naming Convention**:
```c
// Prefix: mini_mag_*, batch_*, stats_*
mini_mag_pop()
mini_mag_push()
batch_refill_from_bitmap()
batch_spill_to_bitmap()
stats_record_alloc()
stats_flush_if_needed()
```
**Inline Policy**:
```c
// Hot path: always inline
static inline void* mini_mag_pop(...) __attribute__((always_inline));
// Medium path: let compiler decide
static inline int batch_refill_from_bitmap(...);
// Cold path: never inline
static void tiny_alloc_slow_path(...) __attribute__((noinline));
```
**Alignment**:
```c
// Cache-line aligned structures
typedef struct __attribute__((aligned(64))) {
// ...
} PageMiniMag;
```
**Comments**:
```c
// Module-level comment (box)
// ============================================================================
// Page Mini-Magazine: Fast LIFO Cache (Data Plane)
// ============================================================================
// Function-level comment (purpose + cost)
// Fast path: Pop from mini-magazine (1-2 ns)
static inline void* mini_mag_pop(PageMiniMag* mag) {
// ...
}
```
---
## Testing Strategy
### Unit Tests
**test_mini_mag.c**:
```c
// Test mini-magazine operations
void test_push_pop() {
PageMiniMag mag = {.head = NULL, .count = 0, .capacity = 16};
void* ptrs[16];
// Push 16 items
for (int i = 0; i < 16; i++) {
ptrs[i] = malloc(64);
assert(mini_mag_push(&mag, ptrs[i]) == 1);
}
assert(mag.count == 16);
// Push when full (should fail)
void* extra = malloc(64);
assert(mini_mag_push(&mag, extra) == 0);
// Pop 16 items (LIFO order)
for (int i = 15; i >= 0; i--) {
void* p = mini_mag_pop(&mag);
assert(p == ptrs[i]); // LIFO
}
assert(mag.count == 0);
// Pop when empty (should return NULL)
assert(mini_mag_pop(&mag) == NULL);
}
```
**test_batch_refill.c**:
```c
// Test batch refill from bitmap
void test_refill() {
TinySlab* slab = allocate_new_slab(0); // 8B class
assert(slab->free_count == 8192);
// Refill 16 items
int got = batch_refill_from_bitmap(slab, &slab->mini_mag, 16);
assert(got == 16);
assert(slab->mini_mag.count == 16);
assert(slab->free_count == 8192 - 16);
// Verify items are valid
for (int i = 0; i < 16; i++) {
void* p = mini_mag_pop(&slab->mini_mag);
assert(p >= slab->base && p < (char*)slab->base + TINY_SLAB_SIZE);
}
}
```
### Integration Tests
**Existing tests should pass**:
```bash
./test_mf2
./test_mf2_warmup
./bench_tiny --iterations=100000 --threads=1
./bench_tiny_mt --iterations=100000 --threads=4
```
### Performance Tests
**Before/After comparison**:
```bash
# Baseline (before optimization)
./bench_tiny --iterations=1000000 --threads=1 > baseline.txt
# After Phase 1 (mini-magazine added but not used)
./bench_tiny --iterations=1000000 --threads=1 > phase1.txt
diff baseline.txt phase1.txt # Should be identical
# After Phase 2 (mini-magazine integrated)
./bench_tiny --iterations=1000000 --threads=1 > phase2.txt
# Expected: 83ns → 55-65ns
# After Phase 3 (statistics batched)
./bench_tiny --iterations=1000000 --threads=1 > phase3.txt
# Expected: 55-65ns → 45-55ns
# After Phase 4 (TLS integration)
./bench_tiny --iterations=1000000 --threads=1 > phase4.txt
# Expected: 45-55ns → 40-50ns
```
---
## Success Criteria
### Performance Targets
| Phase | Target | Pass Criteria |
|-------|--------|---------------|
| **Baseline** | 83 ns/op | Current performance |
| **Phase 1** | 83 ns/op | No regression |
| **Phase 2** | 55-65 ns/op | +25-35% improvement |
| **Phase 3** | 45-55 ns/op | +35-45% improvement |
| **Phase 4** | 40-50 ns/op | **+40-52% improvement** ✅ |
### Functional Requirements
- [ ] All existing tests pass
- [ ] No memory leaks (valgrind clean)
- [ ] Thread-safe (helgrind clean)
- [ ] Statistics accurate (within 1% of actual)
- [ ] No regressions on L2/L2.5 pools
### Code Quality
- [ ] Clean compilation (`-Wall -Wextra -Werror`)
- [ ] Modular design (separate .h files)
- [ ] Inline hints appropriate
- [ ] Cache-line aligned critical structures
- [ ] Comments explain "why" not "what"
---
## Rollback Plan
Each phase is independent and can be reverted:
```bash
# Revert Phase 4
git revert <phase4-commit>
# Revert Phase 3
git revert <phase3-commit>
# Revert Phase 2
git revert <phase2-commit>
# Revert Phase 1
git revert <phase1-commit>
```
All changes are backward-compatible (additive only).
---
**Last Updated**: 2025-10-26
**Status**: Design complete, ready for implementation
**Next**: Begin Phase 1 - Add Mini-Magazine to TinySlab