Files
hakmem/docs/design/MIMALLOC_IMPLEMENTATION_ROADMAP.md

641 lines
15 KiB
Markdown
Raw Normal View History

Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization) ## Changes ### 1. core/page_arena.c - Removed init failure message (lines 25-27) - error is handled by returning early - All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks ### 2. core/hakmem.c - Wrapped SIGSEGV handler init message (line 72) - CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs ### 3. core/hakmem_shared_pool.c - Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE: - Node pool exhaustion warning (line 252) - SP_META_CAPACITY_ERROR warning (line 421) - SP_FIX_GEOMETRY debug logging (line 745) - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865) - SP_ACQUIRE_STAGE0_L0 debug logging (line 803) - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922) - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996) - SP_ACQUIRE_STAGE3 debug logging (line 1116) - SP_SLOT_RELEASE debug logging (line 1245) - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305) - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316) - Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized ## Performance Validation Before: 51M ops/s (with debug fprintf overhead) After: 49.1M ops/s (consistent performance, fprintf removed from hot paths) ## Build & Test ```bash ./build.sh larson_hakmem ./out/release/larson_hakmem 1 5 1 1000 100 10000 42 # Result: 49.1M ops/s ``` Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00
# mimalloc Optimization Implementation Roadmap
## Closing the 47% Performance Gap
**Current:** 16.53 M ops/sec
**Target:** 24.00 M ops/sec (+45%)
**Strategy:** Three-phase implementation with incremental validation
---
## Phase 1: Direct Page Cache ⚡ **HIGH PRIORITY**
**Target:** +2.5-3.3 M ops/sec (15-20% improvement)
**Effort:** 1-2 days
**Risk:** Low
**Dependencies:** None
### Implementation Steps
#### Step 1.1: Add Direct Cache to Heap Structure
**File:** `core/hakmem_tiny.h`
```c
#define HAKMEM_DIRECT_PAGES 129 // Up to 1024 bytes (129 * 8)
typedef struct hakmem_tiny_heap_s {
// Existing fields...
hakmem_tiny_class_t size_classes[32];
// NEW: Direct page cache
hakmem_tiny_page_t* pages_direct[HAKMEM_DIRECT_PAGES];
// Existing fields...
} hakmem_tiny_heap_t;
```
**Memory cost:** 129 × 8 = 1,032 bytes per heap (acceptable)
#### Step 1.2: Initialize Direct Cache
**File:** `core/hakmem_tiny.c`
```c
void hakmem_tiny_heap_init(hakmem_tiny_heap_t* heap) {
// Existing initialization...
// Initialize direct cache
for (size_t i = 0; i < HAKMEM_DIRECT_PAGES; i++) {
heap->pages_direct[i] = NULL;
}
// Populate from existing size classes
hakmem_tiny_rebuild_direct_cache(heap);
}
```
#### Step 1.3: Cache Update Function
**File:** `core/hakmem_tiny.c`
```c
static inline void hakmem_tiny_update_direct_cache(
hakmem_tiny_heap_t* heap,
hakmem_tiny_page_t* page,
size_t block_size)
{
if (block_size > 1024) return; // Only cache small sizes
size_t idx = (block_size + 7) / 8; // Round up to word size
if (idx < HAKMEM_DIRECT_PAGES) {
heap->pages_direct[idx] = page;
}
}
// Call this whenever a page is added/removed from size class
```
#### Step 1.4: Fast Path Using Direct Cache
**File:** `core/hakmem_tiny.c`
```c
static inline void* hakmem_tiny_malloc_direct(
hakmem_tiny_heap_t* heap,
size_t size)
{
// Fast path: direct cache lookup
if (size <= 1024) {
size_t idx = (size + 7) / 8;
hakmem_tiny_page_t* page = heap->pages_direct[idx];
if (page && page->free_list) {
// Pop from free list
hakmem_block_t* block = page->free_list;
page->free_list = block->next;
page->used++;
return block;
}
}
// Fallback to existing generic path
return hakmem_tiny_malloc_generic(heap, size);
}
// Update main malloc to call this:
void* hakmem_malloc(size_t size) {
if (size <= HAKMEM_TINY_MAX) {
return hakmem_tiny_malloc_direct(tls_heap, size);
}
// ... existing large allocation path
}
```
### Validation
**Benchmark command:**
```bash
./bench_random_mixed_hakx
```
**Expected output:**
```
Before: 16.53 M ops/sec
After: 19.00-20.00 M ops/sec (+15-20%)
```
**If target not met:**
1. Profile with `perf record -e cycles,cache-misses ./bench_random_mixed_hakx`
2. Check direct cache hit rate
3. Verify cache is being updated correctly
4. Check for branch mispredictions
---
## Phase 2: Dual Free Lists 🚀 **MEDIUM PRIORITY**
**Target:** +2.0-3.3 M ops/sec additional (10-15% improvement)
**Effort:** 3-5 days
**Risk:** Medium (structural changes)
**Dependencies:** Phase 1 complete
### Implementation Steps
#### Step 2.1: Modify Page Structure
**File:** `core/hakmem_tiny.h`
```c
typedef struct hakmem_tiny_page_s {
// Existing fields...
uint32_t block_size;
uint32_t capacity;
// OLD: Single free list
// hakmem_block_t* free_list;
// NEW: Three separate free lists
hakmem_block_t* free; // Hot allocation path
hakmem_block_t* local_free; // Local frees (no atomic!)
_Atomic(uintptr_t) thread_free; // Remote frees + flags (lower 2 bits)
uint32_t used;
// ... other fields
} hakmem_tiny_page_t;
```
**Note:** `thread_free` encodes both pointer and flags in lower 2 bits (aligned blocks allow this)
#### Step 2.2: Update Free Path
**File:** `core/hakmem_tiny.c`
```c
void hakmem_tiny_free(void* ptr) {
hakmem_tiny_page_t* page = hakmem_tiny_ptr_to_page(ptr);
hakmem_block_t* block = (hakmem_block_t*)ptr;
// Fast path: local thread owns this page
if (hakmem_tiny_is_local_page(page)) {
// Add to local_free (no atomic!)
block->next = page->local_free;
page->local_free = block;
page->used--;
// Retire page if fully free
if (page->used == 0) {
hakmem_tiny_page_retire(page);
}
return;
}
// Slow path: remote free (atomic)
hakmem_tiny_free_remote(page, block);
}
```
#### Step 2.3: Migration Logic
**File:** `core/hakmem_tiny.c`
```c
static void hakmem_tiny_collect_frees(hakmem_tiny_page_t* page) {
// Step 1: Collect remote frees (atomic)
uintptr_t tfree = atomic_exchange(&page->thread_free, 0);
hakmem_block_t* remote_list = (hakmem_block_t*)(tfree & ~0x3);
if (remote_list) {
// Append to local_free
hakmem_block_t* tail = remote_list;
while (tail->next) tail = tail->next;
tail->next = page->local_free;
page->local_free = remote_list;
}
// Step 2: Migrate local_free to free
if (page->local_free && !page->free) {
page->free = page->local_free;
page->local_free = NULL;
}
}
// Call this in allocation path when free list is empty
void* hakmem_tiny_malloc_direct(hakmem_tiny_heap_t* heap, size_t size) {
// ... direct cache lookup
hakmem_tiny_page_t* page = heap->pages_direct[idx];
if (page) {
// Try to allocate from free list
hakmem_block_t* block = page->free;
if (block) {
page->free = block->next;
page->used++;
return block;
}
// Free list empty - collect and retry
hakmem_tiny_collect_frees(page);
block = page->free;
if (block) {
page->free = block->next;
page->used++;
return block;
}
}
// Fallback
return hakmem_tiny_malloc_generic(heap, size);
}
```
### Validation
**Benchmark command:**
```bash
./bench_random_mixed_hakx
```
**Expected output:**
```
After Phase 1: 19.00-20.00 M ops/sec
After Phase 2: 21.50-23.00 M ops/sec (+10-15% additional)
```
**Key metrics to track:**
1. Atomic operation count (should drop significantly)
2. Cache miss rate (should improve)
3. Free path latency (should be faster)
**If target not met:**
1. Profile atomic operations: `perf record -e cpu-cycles,instructions,cache-references,cache-misses ./bench_random_mixed_hakx`
2. Check remote free percentage
3. Verify migration is happening correctly
4. Analyze cache line bouncing
---
## Phase 3: Branch Hints + Bit-Packed Flags 🎯 **LOW PRIORITY**
**Target:** +1.0-2.0 M ops/sec additional (5-8% improvement)
**Effort:** 1-2 days
**Risk:** Low
**Dependencies:** Phase 2 complete
### Implementation Steps
#### Step 3.1: Add Branch Hint Macros
**File:** `core/hakmem_config.h`
```c
#if defined(__GNUC__) || defined(__clang__)
#define hakmem_likely(x) __builtin_expect(!!(x), 1)
#define hakmem_unlikely(x) __builtin_expect(!!(x), 0)
#else
#define hakmem_likely(x) (x)
#define hakmem_unlikely(x) (x)
#endif
```
#### Step 3.2: Add Branch Hints to Hot Path
**File:** `core/hakmem_tiny.c`
```c
void* hakmem_tiny_malloc_direct(hakmem_tiny_heap_t* heap, size_t size) {
// Fast path hint
if (hakmem_likely(size <= 1024)) {
size_t idx = (size + 7) / 8;
hakmem_tiny_page_t* page = heap->pages_direct[idx];
if (hakmem_likely(page != NULL)) {
hakmem_block_t* block = page->free;
if (hakmem_likely(block != NULL)) {
page->free = block->next;
page->used++;
return block;
}
// Slow path within fast path
hakmem_tiny_collect_frees(page);
block = page->free;
if (hakmem_likely(block != NULL)) {
page->free = block->next;
page->used++;
return block;
}
}
}
// Fallback (unlikely)
return hakmem_tiny_malloc_generic(heap, size);
}
void hakmem_tiny_free(void* ptr) {
if (hakmem_unlikely(ptr == NULL)) return;
hakmem_tiny_page_t* page = hakmem_tiny_ptr_to_page(ptr);
hakmem_block_t* block = (hakmem_block_t*)ptr;
// Local free is likely
if (hakmem_likely(hakmem_tiny_is_local_page(page))) {
block->next = page->local_free;
page->local_free = block;
page->used--;
// Rarely fully free
if (hakmem_unlikely(page->used == 0)) {
hakmem_tiny_page_retire(page);
}
return;
}
// Remote free is unlikely
hakmem_tiny_free_remote(page, block);
}
```
#### Step 3.3: Bit-Pack Page Flags
**File:** `core/hakmem_tiny.h`
```c
typedef union hakmem_page_flags_u {
uint8_t combined; // For fast check
struct {
uint8_t is_full : 1;
uint8_t has_remote_frees : 1;
uint8_t is_retired : 1;
uint8_t unused : 5;
} bits;
} hakmem_page_flags_t;
typedef struct hakmem_tiny_page_s {
// ... other fields
hakmem_page_flags_t flags;
// ...
} hakmem_tiny_page_t;
```
**Usage:**
```c
// Single comparison instead of multiple
if (hakmem_likely(page->flags.combined == 0)) {
// Fast path: not full, no remote frees, not retired
// ... 3-instruction free
}
```
### Validation
**Benchmark command:**
```bash
./bench_random_mixed_hakx
```
**Expected output:**
```
After Phase 2: 21.50-23.00 M ops/sec
After Phase 3: 23.00-24.50 M ops/sec (+5-8% additional)
```
**Key metrics:**
1. Branch misprediction rate (should decrease)
2. Instruction count (should decrease slightly)
3. Code size (should decrease due to better branch layout)
---
## Testing Strategy
### Unit Tests
**File:** `test_hakmem_phases.c`
```c
// Phase 1: Direct cache correctness
void test_direct_cache() {
hakmem_tiny_heap_t* heap = hakmem_tiny_heap_create();
// Allocate various sizes
void* p8 = hakmem_malloc(8);
void* p16 = hakmem_malloc(16);
void* p32 = hakmem_malloc(32);
// Verify direct cache is populated
assert(heap->pages_direct[1] != NULL); // 8 bytes
assert(heap->pages_direct[2] != NULL); // 16 bytes
assert(heap->pages_direct[4] != NULL); // 32 bytes
// Free and verify cache is updated
hakmem_free(p8);
assert(heap->pages_direct[1]->free != NULL);
hakmem_tiny_heap_destroy(heap);
}
// Phase 2: Dual free lists
void test_dual_free_lists() {
hakmem_tiny_heap_t* heap = hakmem_tiny_heap_create();
void* p = hakmem_malloc(64);
hakmem_tiny_page_t* page = hakmem_tiny_ptr_to_page(p);
// Local free goes to local_free
hakmem_free(p);
assert(page->local_free != NULL);
assert(page->free == NULL || page->free != p);
// Allocate again triggers migration
void* p2 = hakmem_malloc(64);
assert(page->local_free == NULL); // Migrated
hakmem_tiny_heap_destroy(heap);
}
// Phase 3: Branch hints (no functional change)
void test_branch_hints() {
// Just verify compilation and no regression
for (int i = 0; i < 10000; i++) {
void* p = hakmem_malloc(64);
hakmem_free(p);
}
}
```
### Benchmark Suite
**Run after each phase:**
```bash
# Core benchmark
./bench_random_mixed_hakx
# Stress tests
./bench_mid_large_hakx
./bench_tiny_hot_hakx
./bench_fragment_stress_hakx
# Multi-threaded
./bench_mid_large_mt_hakx
```
### Validation Checklist
**Phase 1:**
- [ ] Direct cache correctly populated
- [ ] Cache hit rate > 95% for small allocations
- [ ] Performance gain: 15-20%
- [ ] No memory leaks
- [ ] All existing tests pass
**Phase 2:**
- [ ] Local frees go to local_free
- [ ] Remote frees go to thread_free
- [ ] Migration works correctly
- [ ] Atomic operation count reduced by 80%+
- [ ] Performance gain: 10-15% additional
- [ ] Thread-safety maintained
- [ ] All existing tests pass
**Phase 3:**
- [ ] Branch hints compile correctly
- [ ] Bit-packed flags work as expected
- [ ] Performance gain: 5-8% additional
- [ ] Code size reduced or unchanged
- [ ] All existing tests pass
---
## Rollback Plan
### Phase 1 Rollback
If Phase 1 doesn't meet targets:
```c
// #define HAKMEM_USE_DIRECT_CACHE 1 // Comment out
void* hakmem_malloc(size_t size) {
#ifdef HAKMEM_USE_DIRECT_CACHE
return hakmem_tiny_malloc_direct(tls_heap, size);
#else
return hakmem_tiny_malloc_generic(tls_heap, size); // Old path
#endif
}
```
### Phase 2 Rollback
If Phase 2 causes issues:
```c
// Revert to single free list
typedef struct hakmem_tiny_page_s {
#ifdef HAKMEM_USE_DUAL_LISTS
hakmem_block_t* free;
hakmem_block_t* local_free;
_Atomic(uintptr_t) thread_free;
#else
hakmem_block_t* free_list; // Old single list
#endif
// ...
} hakmem_tiny_page_t;
```
---
## Success Criteria
### Minimum Acceptable Performance
- **Phase 1:** +10% (18.18 M ops/sec)
- **Phase 2:** +20% cumulative (19.84 M ops/sec)
- **Phase 3:** +35% cumulative (22.32 M ops/sec)
### Target Performance
- **Phase 1:** +15% (19.01 M ops/sec)
- **Phase 2:** +27% cumulative (21.00 M ops/sec)
- **Phase 3:** +40% cumulative (23.14 M ops/sec)
### Stretch Goal
- **Phase 3:** +45% cumulative (24.00 M ops/sec) - **Match mimalloc!**
---
## Timeline
### Conservative Estimate
- **Week 1:** Phase 1 implementation + validation
- **Week 2:** Phase 2 implementation
- **Week 3:** Phase 2 validation + debugging
- **Week 4:** Phase 3 implementation + final validation
**Total: 4 weeks**
### Aggressive Estimate
- **Day 1-2:** Phase 1 implementation + validation
- **Day 3-6:** Phase 2 implementation + validation
- **Day 7-8:** Phase 3 implementation + validation
**Total: 8 days**
---
## Risk Mitigation
### Technical Risks
1. **Cache coherency issues** (Phase 2)
- Mitigation: Extensive multi-threaded testing
- Fallback: Keep atomic operations on critical path
2. **Memory overhead** (Phase 1)
- Mitigation: Monitor RSS increase
- Fallback: Reduce HAKMEM_DIRECT_PAGES to 65 (512 bytes)
3. **Correctness bugs** (Phase 2)
- Mitigation: Extensive unit tests, ASAN/TSAN builds
- Fallback: Revert to single free list
### Performance Risks
1. **Phase 1 underperforms** (<10%)
- Action: Profile cache hit rate
- Fix: Adjust cache update logic
2. **Phase 2 adds latency** (cache bouncing)
- Action: Profile cache misses
- Fix: Adjust migration threshold
3. **Phase 3 no improvement** (compiler already optimized)
- Action: Check assembly output
- Fix: Skip phase or use PGO
---
## Monitoring
### Key Metrics to Track
1. **Operations/sec** (primary metric)
2. **Latency percentiles** (p50, p95, p99)
3. **Memory usage** (RSS)
4. **Cache miss rate**
5. **Branch misprediction rate**
6. **Atomic operation count**
### Profiling Commands
```bash
# Basic profiling
perf record -e cycles,instructions,cache-misses ./bench_random_mixed_hakx
perf report
# Cache analysis
perf record -e cache-references,cache-misses,L1-dcache-load-misses ./bench_random_mixed_hakx
# Branch analysis
perf record -e branch-misses,branches ./bench_random_mixed_hakx
# ASAN/TSAN builds
CC=clang CFLAGS="-fsanitize=address" make
CC=clang CFLAGS="-fsanitize=thread" make
```
---
## Next Steps
1. **Implement Phase 1** (direct page cache)
2. **Benchmark and validate** (target: +15-20%)
3. **If successful:** Proceed to Phase 2
4. **If not:** Debug and iterate
**Start now with Phase 1 - it's low-risk and high-reward!**