Files
hakmem/docs/design/MIMALLOC_IMPLEMENTATION_ROADMAP.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

641 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# mimalloc Optimization Implementation Roadmap
## Closing the 47% Performance Gap
**Current:** 16.53 M ops/sec
**Target:** 24.00 M ops/sec (+45%)
**Strategy:** Three-phase implementation with incremental validation
---
## Phase 1: Direct Page Cache ⚡ **HIGH PRIORITY**
**Target:** +2.5-3.3 M ops/sec (15-20% improvement)
**Effort:** 1-2 days
**Risk:** Low
**Dependencies:** None
### Implementation Steps
#### Step 1.1: Add Direct Cache to Heap Structure
**File:** `core/hakmem_tiny.h`
```c
#define HAKMEM_DIRECT_PAGES 129 // Up to 1024 bytes (129 * 8)
typedef struct hakmem_tiny_heap_s {
// Existing fields...
hakmem_tiny_class_t size_classes[32];
// NEW: Direct page cache
hakmem_tiny_page_t* pages_direct[HAKMEM_DIRECT_PAGES];
// Existing fields...
} hakmem_tiny_heap_t;
```
**Memory cost:** 129 × 8 = 1,032 bytes per heap (acceptable)
#### Step 1.2: Initialize Direct Cache
**File:** `core/hakmem_tiny.c`
```c
void hakmem_tiny_heap_init(hakmem_tiny_heap_t* heap) {
// Existing initialization...
// Initialize direct cache
for (size_t i = 0; i < HAKMEM_DIRECT_PAGES; i++) {
heap->pages_direct[i] = NULL;
}
// Populate from existing size classes
hakmem_tiny_rebuild_direct_cache(heap);
}
```
#### Step 1.3: Cache Update Function
**File:** `core/hakmem_tiny.c`
```c
static inline void hakmem_tiny_update_direct_cache(
hakmem_tiny_heap_t* heap,
hakmem_tiny_page_t* page,
size_t block_size)
{
if (block_size > 1024) return; // Only cache small sizes
size_t idx = (block_size + 7) / 8; // Round up to word size
if (idx < HAKMEM_DIRECT_PAGES) {
heap->pages_direct[idx] = page;
}
}
// Call this whenever a page is added/removed from size class
```
#### Step 1.4: Fast Path Using Direct Cache
**File:** `core/hakmem_tiny.c`
```c
static inline void* hakmem_tiny_malloc_direct(
hakmem_tiny_heap_t* heap,
size_t size)
{
// Fast path: direct cache lookup
if (size <= 1024) {
size_t idx = (size + 7) / 8;
hakmem_tiny_page_t* page = heap->pages_direct[idx];
if (page && page->free_list) {
// Pop from free list
hakmem_block_t* block = page->free_list;
page->free_list = block->next;
page->used++;
return block;
}
}
// Fallback to existing generic path
return hakmem_tiny_malloc_generic(heap, size);
}
// Update main malloc to call this:
void* hakmem_malloc(size_t size) {
if (size <= HAKMEM_TINY_MAX) {
return hakmem_tiny_malloc_direct(tls_heap, size);
}
// ... existing large allocation path
}
```
### Validation
**Benchmark command:**
```bash
./bench_random_mixed_hakx
```
**Expected output:**
```
Before: 16.53 M ops/sec
After: 19.00-20.00 M ops/sec (+15-20%)
```
**If target not met:**
1. Profile with `perf record -e cycles,cache-misses ./bench_random_mixed_hakx`
2. Check direct cache hit rate
3. Verify cache is being updated correctly
4. Check for branch mispredictions
---
## Phase 2: Dual Free Lists 🚀 **MEDIUM PRIORITY**
**Target:** +2.0-3.3 M ops/sec additional (10-15% improvement)
**Effort:** 3-5 days
**Risk:** Medium (structural changes)
**Dependencies:** Phase 1 complete
### Implementation Steps
#### Step 2.1: Modify Page Structure
**File:** `core/hakmem_tiny.h`
```c
typedef struct hakmem_tiny_page_s {
// Existing fields...
uint32_t block_size;
uint32_t capacity;
// OLD: Single free list
// hakmem_block_t* free_list;
// NEW: Three separate free lists
hakmem_block_t* free; // Hot allocation path
hakmem_block_t* local_free; // Local frees (no atomic!)
_Atomic(uintptr_t) thread_free; // Remote frees + flags (lower 2 bits)
uint32_t used;
// ... other fields
} hakmem_tiny_page_t;
```
**Note:** `thread_free` encodes both pointer and flags in lower 2 bits (aligned blocks allow this)
#### Step 2.2: Update Free Path
**File:** `core/hakmem_tiny.c`
```c
void hakmem_tiny_free(void* ptr) {
hakmem_tiny_page_t* page = hakmem_tiny_ptr_to_page(ptr);
hakmem_block_t* block = (hakmem_block_t*)ptr;
// Fast path: local thread owns this page
if (hakmem_tiny_is_local_page(page)) {
// Add to local_free (no atomic!)
block->next = page->local_free;
page->local_free = block;
page->used--;
// Retire page if fully free
if (page->used == 0) {
hakmem_tiny_page_retire(page);
}
return;
}
// Slow path: remote free (atomic)
hakmem_tiny_free_remote(page, block);
}
```
#### Step 2.3: Migration Logic
**File:** `core/hakmem_tiny.c`
```c
static void hakmem_tiny_collect_frees(hakmem_tiny_page_t* page) {
// Step 1: Collect remote frees (atomic)
uintptr_t tfree = atomic_exchange(&page->thread_free, 0);
hakmem_block_t* remote_list = (hakmem_block_t*)(tfree & ~0x3);
if (remote_list) {
// Append to local_free
hakmem_block_t* tail = remote_list;
while (tail->next) tail = tail->next;
tail->next = page->local_free;
page->local_free = remote_list;
}
// Step 2: Migrate local_free to free
if (page->local_free && !page->free) {
page->free = page->local_free;
page->local_free = NULL;
}
}
// Call this in allocation path when free list is empty
void* hakmem_tiny_malloc_direct(hakmem_tiny_heap_t* heap, size_t size) {
// ... direct cache lookup
hakmem_tiny_page_t* page = heap->pages_direct[idx];
if (page) {
// Try to allocate from free list
hakmem_block_t* block = page->free;
if (block) {
page->free = block->next;
page->used++;
return block;
}
// Free list empty - collect and retry
hakmem_tiny_collect_frees(page);
block = page->free;
if (block) {
page->free = block->next;
page->used++;
return block;
}
}
// Fallback
return hakmem_tiny_malloc_generic(heap, size);
}
```
### Validation
**Benchmark command:**
```bash
./bench_random_mixed_hakx
```
**Expected output:**
```
After Phase 1: 19.00-20.00 M ops/sec
After Phase 2: 21.50-23.00 M ops/sec (+10-15% additional)
```
**Key metrics to track:**
1. Atomic operation count (should drop significantly)
2. Cache miss rate (should improve)
3. Free path latency (should be faster)
**If target not met:**
1. Profile atomic operations: `perf record -e cpu-cycles,instructions,cache-references,cache-misses ./bench_random_mixed_hakx`
2. Check remote free percentage
3. Verify migration is happening correctly
4. Analyze cache line bouncing
---
## Phase 3: Branch Hints + Bit-Packed Flags 🎯 **LOW PRIORITY**
**Target:** +1.0-2.0 M ops/sec additional (5-8% improvement)
**Effort:** 1-2 days
**Risk:** Low
**Dependencies:** Phase 2 complete
### Implementation Steps
#### Step 3.1: Add Branch Hint Macros
**File:** `core/hakmem_config.h`
```c
#if defined(__GNUC__) || defined(__clang__)
#define hakmem_likely(x) __builtin_expect(!!(x), 1)
#define hakmem_unlikely(x) __builtin_expect(!!(x), 0)
#else
#define hakmem_likely(x) (x)
#define hakmem_unlikely(x) (x)
#endif
```
#### Step 3.2: Add Branch Hints to Hot Path
**File:** `core/hakmem_tiny.c`
```c
void* hakmem_tiny_malloc_direct(hakmem_tiny_heap_t* heap, size_t size) {
// Fast path hint
if (hakmem_likely(size <= 1024)) {
size_t idx = (size + 7) / 8;
hakmem_tiny_page_t* page = heap->pages_direct[idx];
if (hakmem_likely(page != NULL)) {
hakmem_block_t* block = page->free;
if (hakmem_likely(block != NULL)) {
page->free = block->next;
page->used++;
return block;
}
// Slow path within fast path
hakmem_tiny_collect_frees(page);
block = page->free;
if (hakmem_likely(block != NULL)) {
page->free = block->next;
page->used++;
return block;
}
}
}
// Fallback (unlikely)
return hakmem_tiny_malloc_generic(heap, size);
}
void hakmem_tiny_free(void* ptr) {
if (hakmem_unlikely(ptr == NULL)) return;
hakmem_tiny_page_t* page = hakmem_tiny_ptr_to_page(ptr);
hakmem_block_t* block = (hakmem_block_t*)ptr;
// Local free is likely
if (hakmem_likely(hakmem_tiny_is_local_page(page))) {
block->next = page->local_free;
page->local_free = block;
page->used--;
// Rarely fully free
if (hakmem_unlikely(page->used == 0)) {
hakmem_tiny_page_retire(page);
}
return;
}
// Remote free is unlikely
hakmem_tiny_free_remote(page, block);
}
```
#### Step 3.3: Bit-Pack Page Flags
**File:** `core/hakmem_tiny.h`
```c
typedef union hakmem_page_flags_u {
uint8_t combined; // For fast check
struct {
uint8_t is_full : 1;
uint8_t has_remote_frees : 1;
uint8_t is_retired : 1;
uint8_t unused : 5;
} bits;
} hakmem_page_flags_t;
typedef struct hakmem_tiny_page_s {
// ... other fields
hakmem_page_flags_t flags;
// ...
} hakmem_tiny_page_t;
```
**Usage:**
```c
// Single comparison instead of multiple
if (hakmem_likely(page->flags.combined == 0)) {
// Fast path: not full, no remote frees, not retired
// ... 3-instruction free
}
```
### Validation
**Benchmark command:**
```bash
./bench_random_mixed_hakx
```
**Expected output:**
```
After Phase 2: 21.50-23.00 M ops/sec
After Phase 3: 23.00-24.50 M ops/sec (+5-8% additional)
```
**Key metrics:**
1. Branch misprediction rate (should decrease)
2. Instruction count (should decrease slightly)
3. Code size (should decrease due to better branch layout)
---
## Testing Strategy
### Unit Tests
**File:** `test_hakmem_phases.c`
```c
// Phase 1: Direct cache correctness
void test_direct_cache() {
hakmem_tiny_heap_t* heap = hakmem_tiny_heap_create();
// Allocate various sizes
void* p8 = hakmem_malloc(8);
void* p16 = hakmem_malloc(16);
void* p32 = hakmem_malloc(32);
// Verify direct cache is populated
assert(heap->pages_direct[1] != NULL); // 8 bytes
assert(heap->pages_direct[2] != NULL); // 16 bytes
assert(heap->pages_direct[4] != NULL); // 32 bytes
// Free and verify cache is updated
hakmem_free(p8);
assert(heap->pages_direct[1]->free != NULL);
hakmem_tiny_heap_destroy(heap);
}
// Phase 2: Dual free lists
void test_dual_free_lists() {
hakmem_tiny_heap_t* heap = hakmem_tiny_heap_create();
void* p = hakmem_malloc(64);
hakmem_tiny_page_t* page = hakmem_tiny_ptr_to_page(p);
// Local free goes to local_free
hakmem_free(p);
assert(page->local_free != NULL);
assert(page->free == NULL || page->free != p);
// Allocate again triggers migration
void* p2 = hakmem_malloc(64);
assert(page->local_free == NULL); // Migrated
hakmem_tiny_heap_destroy(heap);
}
// Phase 3: Branch hints (no functional change)
void test_branch_hints() {
// Just verify compilation and no regression
for (int i = 0; i < 10000; i++) {
void* p = hakmem_malloc(64);
hakmem_free(p);
}
}
```
### Benchmark Suite
**Run after each phase:**
```bash
# Core benchmark
./bench_random_mixed_hakx
# Stress tests
./bench_mid_large_hakx
./bench_tiny_hot_hakx
./bench_fragment_stress_hakx
# Multi-threaded
./bench_mid_large_mt_hakx
```
### Validation Checklist
**Phase 1:**
- [ ] Direct cache correctly populated
- [ ] Cache hit rate > 95% for small allocations
- [ ] Performance gain: 15-20%
- [ ] No memory leaks
- [ ] All existing tests pass
**Phase 2:**
- [ ] Local frees go to local_free
- [ ] Remote frees go to thread_free
- [ ] Migration works correctly
- [ ] Atomic operation count reduced by 80%+
- [ ] Performance gain: 10-15% additional
- [ ] Thread-safety maintained
- [ ] All existing tests pass
**Phase 3:**
- [ ] Branch hints compile correctly
- [ ] Bit-packed flags work as expected
- [ ] Performance gain: 5-8% additional
- [ ] Code size reduced or unchanged
- [ ] All existing tests pass
---
## Rollback Plan
### Phase 1 Rollback
If Phase 1 doesn't meet targets:
```c
// #define HAKMEM_USE_DIRECT_CACHE 1 // Comment out
void* hakmem_malloc(size_t size) {
#ifdef HAKMEM_USE_DIRECT_CACHE
return hakmem_tiny_malloc_direct(tls_heap, size);
#else
return hakmem_tiny_malloc_generic(tls_heap, size); // Old path
#endif
}
```
### Phase 2 Rollback
If Phase 2 causes issues:
```c
// Revert to single free list
typedef struct hakmem_tiny_page_s {
#ifdef HAKMEM_USE_DUAL_LISTS
hakmem_block_t* free;
hakmem_block_t* local_free;
_Atomic(uintptr_t) thread_free;
#else
hakmem_block_t* free_list; // Old single list
#endif
// ...
} hakmem_tiny_page_t;
```
---
## Success Criteria
### Minimum Acceptable Performance
- **Phase 1:** +10% (18.18 M ops/sec)
- **Phase 2:** +20% cumulative (19.84 M ops/sec)
- **Phase 3:** +35% cumulative (22.32 M ops/sec)
### Target Performance
- **Phase 1:** +15% (19.01 M ops/sec)
- **Phase 2:** +27% cumulative (21.00 M ops/sec)
- **Phase 3:** +40% cumulative (23.14 M ops/sec)
### Stretch Goal
- **Phase 3:** +45% cumulative (24.00 M ops/sec) - **Match mimalloc!**
---
## Timeline
### Conservative Estimate
- **Week 1:** Phase 1 implementation + validation
- **Week 2:** Phase 2 implementation
- **Week 3:** Phase 2 validation + debugging
- **Week 4:** Phase 3 implementation + final validation
**Total: 4 weeks**
### Aggressive Estimate
- **Day 1-2:** Phase 1 implementation + validation
- **Day 3-6:** Phase 2 implementation + validation
- **Day 7-8:** Phase 3 implementation + validation
**Total: 8 days**
---
## Risk Mitigation
### Technical Risks
1. **Cache coherency issues** (Phase 2)
- Mitigation: Extensive multi-threaded testing
- Fallback: Keep atomic operations on critical path
2. **Memory overhead** (Phase 1)
- Mitigation: Monitor RSS increase
- Fallback: Reduce HAKMEM_DIRECT_PAGES to 65 (512 bytes)
3. **Correctness bugs** (Phase 2)
- Mitigation: Extensive unit tests, ASAN/TSAN builds
- Fallback: Revert to single free list
### Performance Risks
1. **Phase 1 underperforms** (<10%)
- Action: Profile cache hit rate
- Fix: Adjust cache update logic
2. **Phase 2 adds latency** (cache bouncing)
- Action: Profile cache misses
- Fix: Adjust migration threshold
3. **Phase 3 no improvement** (compiler already optimized)
- Action: Check assembly output
- Fix: Skip phase or use PGO
---
## Monitoring
### Key Metrics to Track
1. **Operations/sec** (primary metric)
2. **Latency percentiles** (p50, p95, p99)
3. **Memory usage** (RSS)
4. **Cache miss rate**
5. **Branch misprediction rate**
6. **Atomic operation count**
### Profiling Commands
```bash
# Basic profiling
perf record -e cycles,instructions,cache-misses ./bench_random_mixed_hakx
perf report
# Cache analysis
perf record -e cache-references,cache-misses,L1-dcache-load-misses ./bench_random_mixed_hakx
# Branch analysis
perf record -e branch-misses,branches ./bench_random_mixed_hakx
# ASAN/TSAN builds
CC=clang CFLAGS="-fsanitize=address" make
CC=clang CFLAGS="-fsanitize=thread" make
```
---
## Next Steps
1. **Implement Phase 1** (direct page cache)
2. **Benchmark and validate** (target: +15-20%)
3. **If successful:** Proceed to Phase 2
4. **If not:** Debug and iterate
**Start now with Phase 1 - it's low-risk and high-reward!**