Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
17 KiB
Tiny Pool Optimization Strategy
Date: 2025-10-26 Goal: Reduce 5.9x performance gap with mimalloc for small allocations (8-64 bytes) Current: 83 ns/op (hakmem) vs 14 ns/op (mimalloc) Target: 50-55 ns/op (35-40% improvement, ~3.5x gap remaining) Status: Ready for implementation
Executive Summary
The 5.9x Gap: Root Cause Analysis
Based on comprehensive mimalloc analysis (see ANALYSIS_SUMMARY.md), the performance gap stems from architectural differences, not bugs:
| Component | mimalloc | hakmem | Impact |
|---|---|---|---|
| Primary data structure | LIFO free list (intrusive) | Bitmap + magazine | +20 ns |
| State location | Thread-local only | Thread-local + global | +10 ns |
| Cache validation | Implicit (per-thread pages) | Explicit (ownership tracking) | +5 ns |
| Statistics overhead | Batched/deferred | Per-allocation sampled | +10 ns |
| Control flow | 1 branch | 3-4 branches | +5 ns |
Total measured gap: 69 ns (83 - 14)
Why We Can't Match mimalloc's 14 ns
Irreducible architectural gaps (10-13 ns total):
- Bitmap lookup [5 ns]: Find-first-set + bit extraction vs single pointer read
- Magazine validation [3-5 ns]: Ownership tracking for diagnostics vs implicit ownership
- Statistics integration [2-3 ns]: Per-class stats require bookkeeping vs atomic counters
What We Can Realistically Achieve
Addressable overhead (30-35 ns):
- P0: Lookup table classification → +3-5 ns
- P1: Remove stats from hot path → +10-15 ns
- P2: Inline fast path → +5-10 ns
- P3: Branch elimination → +10-15 ns
Expected result: 83 ns → 50-55 ns (35-40% improvement)
Strategic Approach: Three-Phase Implementation
Phase I: Quick Wins (P0 + P1) - 90 minutes
Target: 83 ns → 65-70 ns (~20% improvement) ROI: Highest impact per time invested
Why Start Here:
- P0 and P1 are independent (no code conflicts)
- Combined gain: 13-20 ns
- Low risk (simple, localized changes)
- Immediate validation possible
Phase II: Fast Path Optimization (P2) - 60 minutes
Target: 65-70 ns → 55-60 ns (~30% cumulative improvement) ROI: High impact, moderate complexity
Why Second:
- Depends on P1 completion (stats removed from hot path)
- Creates foundation for P3 (branch elimination)
- More complex but well-documented in roadmap
Phase III: Advanced Optimization (P3) - 90 minutes
Target: 55-60 ns → 50-55 ns (~40% cumulative improvement) ROI: Moderate, requires careful testing
Why Last:
- Most complex (branchless logic)
- Highest risk of subtle bugs
- Marginal improvement (diminishing returns)
Detailed Implementation Plan
P0: Lookup Table Size Classification
File: hakmem_tiny.h
Effort: 30 minutes
Gain: +3-5 ns
Risk: Very Low
Current Implementation (If-Chain)
static inline int hak_tiny_size_to_class(size_t size) {
if (size <= 8) return 0;
if (size <= 16) return 1;
if (size <= 32) return 2;
if (size <= 64) return 3;
if (size <= 128) return 4;
if (size <= 256) return 5;
if (size <= 512) return 6;
if (size <= 1024) return 7;
return -1;
}
// Branches: 8 (worst case), avg 4 mispredictions
// Cost: 5-8 ns
Target Implementation (LUT)
// Add after line 36 in hakmem_tiny.h (after g_tiny_blocks_per_slab)
static const uint8_t g_tiny_size_to_class[1025] = {
// 0-8: class 0
0,0,0,0,0,0,0,0,0,
// 9-16: class 1
1,1,1,1,1,1,1,1,
// 17-32: class 2
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
// 33-64: class 3
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,
3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,
// 65-128: class 4
4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
// 129-256: class 5
// ... (continue pattern for all 1025 entries)
// 513-1024: class 7
};
static inline int hak_tiny_size_to_class_fast(size_t size) {
// Fast path: direct table lookup
if (__builtin_expect(size <= 1024, 1)) {
return g_tiny_size_to_class[size];
}
// Slow path: out of Tiny Pool range
return -1;
}
// Branches: 1 (predictable)
// Cost: 0.5-1 ns (L1 cache hit)
Implementation Steps
- Add
g_tiny_size_to_class[1025]table tohakmem_tiny.h(after line 36) - Replace
hak_tiny_size_to_class()withhak_tiny_size_to_class_fast() - Update all call sites in
hakmem_tiny.c - Compile and verify correctness
Testing
# Build and verify correctness
make clean && make -j4
# Unit test: verify classification accuracy
./test_tiny_size_class # (create if needed)
# Performance test
./bench_tiny --iterations=1000000 --threads=1
# Expected: 83ns → 78-80ns
Rollback
git diff hakmem_tiny.h # Verify changes
git checkout hakmem_tiny.h # If regression detected
P1: Remove Statistics from Critical Path
File: hakmem_tiny.c
Effort: 60 minutes
Gain: +10-15 ns
Risk: Low (statistics semantics preserved)
Current Implementation (Per-Allocation Sampling)
// hakmem_tiny.c:656-659 (hot path)
void* p = mag->items[--mag->top].ptr;
// Sampled counter update (XOR-based PRNG)
t_tiny_rng ^= t_tiny_rng << 13; // 3 ns overhead
t_tiny_rng ^= t_tiny_rng >> 17;
t_tiny_rng ^= t_tiny_rng << 5;
if ((t_tiny_rng & ((1u<<g_tiny_count_sample_exp)-1u)) == 0u) {
g_tiny_pool.alloc_count[class_idx]++; // Atomic increment (contention!)
}
return p;
// Total stats overhead: 10-15 ns
Target Implementation (Batched Lazy Update)
// New TLS counter (add to hakmem_tiny.c TLS section)
static __thread uint64_t t_tiny_alloc_counter[TINY_NUM_CLASSES] = {0};
// Hot path (remove all stats code)
void* p = mag->items[--mag->top].ptr;
// NO STATS HERE!
return p;
// Stats overhead: 0 ns
// Cold path: Lazy accumulation (called during magazine refill)
static void hak_tiny_lazy_counter_update(int class_idx) {
// Accumulate every 100 allocations
if (++t_tiny_alloc_counter[class_idx] >= 100) {
g_tiny_pool.alloc_count[class_idx] += t_tiny_alloc_counter[class_idx];
t_tiny_alloc_counter[class_idx] = 0;
}
}
// Call from slow path (magazine refill function)
void hak_tiny_refill_magazine(...) {
// ... existing refill logic ...
hak_tiny_lazy_counter_update(class_idx);
}
Implementation Steps
- Add
t_tiny_alloc_counter[TINY_NUM_CLASSES]TLS array - Remove XOR PRNG code from
hak_tiny_alloc()hot path (lines 656-659) - Add
hak_tiny_lazy_counter_update()function - Call lazy update in slow path (magazine refill, slab allocation)
- Update
hak_tiny_get_stats()to flush pending TLS counters
Statistics Accuracy Trade-off
- Before: Sampled (1/16 allocations counted)
- After: Batched (accumulated every 100 allocations)
- Impact: Both are approximations; batching is more accurate and faster
Testing
# Build
make clean && make -j4
# Functional test: verify stats still work
./test_mf2 # Should show non-zero alloc_count
# Performance test
./bench_tiny --iterations=1000000 --threads=1
# Expected: 78-80ns → 63-70ns
Rollback
git diff hakmem_tiny.c
git checkout hakmem_tiny.c # If stats broken
P2: Inline Fast Path
Files: hakmem_tiny_alloc_fast.h (new), hakmem.h, hakmem_tiny.c
Effort: 60 minutes
Gain: +5-10 ns
Risk: Moderate (function splitting, ABI considerations)
Current Implementation (Unified Function)
// hakmem_tiny.c: Single function for fast + slow path
void* hak_tiny_alloc(size_t size) {
// Size classification
int class_idx = hak_tiny_size_to_class(size);
// Magazine check
TinyTLSMag* mag = &g_tls_mags[class_idx];
if (mag->top > 0) {
return mag->items[--mag->top].ptr; // Fast path
}
// TLS active slab A
TinySlab* slab_a = g_tls_active_slab_a[class_idx];
if (slab_a && slab_a->free_count > 0) {
// ... bitmap scan ...
}
// TLS active slab B
// ...
// Global pool (slow path with locks)
// ...
}
// Problem: Compiler can't inline due to size/complexity
// Call overhead: 5-10 ns
Target Implementation (Split Fast/Slow)
// New file: hakmem_tiny_alloc_fast.h
#ifndef HAKMEM_TINY_ALLOC_FAST_H
#define HAKMEM_TINY_ALLOC_FAST_H
#include "hakmem_tiny.h"
// External slow path declaration
extern void* hak_tiny_alloc_slow(size_t size, int class_idx);
// Inlined fast path (magazine-only)
__attribute__((always_inline))
static inline void* hak_tiny_alloc_hot(size_t size) {
// Fast bounds check
if (__builtin_expect(size > TINY_MAX_SIZE, 0)) {
return NULL; // Not Tiny Pool range
}
// O(1) size classification (uses P0 LUT)
int class_idx = g_tiny_size_to_class[size];
// TLS magazine check (fast path only)
TinyTLSMag* mag = &g_tls_mags[class_idx];
if (__builtin_expect(mag->top > 0, 1)) {
return mag->items[--mag->top].ptr;
}
// Fall through to slow path
return hak_tiny_alloc_slow(size, class_idx);
}
#endif
// hakmem_tiny.c: Slow path only
void* hak_tiny_alloc_slow(size_t size, int class_idx) {
// TLS active slab A
TinySlab* slab_a = g_tls_active_slab_a[class_idx];
if (slab_a && slab_a->free_count > 0) {
// ... bitmap scan ...
}
// TLS active slab B
// ...
// Global pool (locks)
// ...
}
// hakmem.h: Update public API
#include "hakmem_tiny_alloc_fast.h"
// Use inlined fast path for tiny allocations
static inline void* hak_alloc_at(size_t size, void* site) {
if (size <= TINY_MAX_SIZE) {
return hak_tiny_alloc_hot(size);
}
// ... L2/L2.5 pools ...
}
Benefits
- Zero call overhead: Compiler inlines directly into caller
- Better register allocation: Hot path uses minimal stack
- Improved branch prediction: Fast path separate from cold code
Implementation Steps
- Create
hakmem_tiny_alloc_fast.h - Move magazine-only logic to
hak_tiny_alloc_hot() - Rename
hak_tiny_alloc()→hak_tiny_alloc_slow() - Update
hakmem.hto include new header - Update all call sites
Testing
# Build
make clean && make -j4
# Functional test: verify all paths work
./test_mf2
./test_mf2_warmup
# Performance test
./bench_tiny --iterations=1000000 --threads=1
# Expected: 63-70ns → 55-65ns
Rollback
git rm hakmem_tiny_alloc_fast.h
git checkout hakmem.h hakmem_tiny.c
Testing Strategy
Phase-by-Phase Validation
After P0 (Lookup Table)
# Correctness
./test_tiny_size_class # Verify all 1025 entries correct
# Performance
./bench_tiny --iterations=1000000 --threads=1
# Target: 83ns → 78-80ns
# No regressions on other sizes
./bench_allocators_hakmem --scenario json # L2.5 64KB
./bench_allocators_hakmem --scenario mir # L2.5 256KB
After P1 (Remove Stats)
# Correctness: Stats still work
./test_mf2
hak_tiny_get_stats() # Should show non-zero counts
# Performance
./bench_tiny --iterations=1000000 --threads=1
# Target: 78-80ns → 63-70ns
# Multi-threaded: Verify TLS counters work
./bench_tiny_mt --iterations=100000 --threads=4
# Should see reasonable alloc_count
After P2 (Inline Fast Path)
# Correctness: All paths work
./test_mf2
./test_mf2_warmup
# Performance
./bench_tiny --iterations=1000000 --threads=1
# Target: 63-70ns → 55-65ns
# Code inspection: Verify inlining
objdump -d bench_tiny | grep -A 20 'hak_alloc_at'
# Should show inline expansion, not call instruction
Comprehensive Validation
# After all P0-P2 complete
make clean && make -j4
# Full benchmark suite
./bench_tiny --iterations=1000000 --threads=1 # Single-threaded
./bench_tiny_mt --iterations=100000 --threads=4 # Multi-threaded
./bench_allocators_hakmem --scenario json # L2.5 64KB
./bench_allocators_hakmem --scenario mir # L2.5 256KB
# Git commit
git add .
git commit -m "Tiny Pool optimization P0-P2: 83ns → 55-65ns
- P0: Lookup table size classification (+3-5ns)
- P1: Remove statistics from hot path (+10-15ns)
- P2: Inline fast path (+5-10ns)
Cumulative improvement: ~30-35%
Remaining gap to mimalloc (14ns): 3.5-4x (irreducible)"
Risk Mitigation
Low-Risk Approach
- Incremental changes: One optimization at a time
- Validation at each step: Benchmark + test after each P0, P1, P2
- Git commits: Separate commit for each optimization
- Rollback ready:
git checkoutif regression detected
Potential Issues and Mitigation
Issue 1: LUT Size (1 KB)
- Risk: L1 cache pressure
- Mitigation: 1 KB fits in L1 (32 KB typical), negligible impact
- Fallback: Use smaller LUT (65 entries, round up to power-of-2)
Issue 2: Inlining Bloat
- Risk: Code size increase from inlining
- Mitigation: Only inline magazine path (~10 instructions)
- Fallback: Use
__attribute__((hot))instead ofalways_inline
Issue 3: Stats Accuracy
- Risk: Batched counters less accurate than sampled
- Mitigation: Actually MORE accurate (no sampling variance)
- Fallback: Keep TLS counters but flush more frequently
Success Criteria
Performance Targets
| Optimization | Current | Target | Achieved? |
|---|---|---|---|
| Baseline | 83 ns/op | - | ✓ (established) |
| P0 Complete | 83 ns | 78-80 ns | [ ] |
| P1 Complete | 78-80 ns | 63-70 ns | [ ] |
| P2 Complete | 63-70 ns | 55-65 ns | [ ] |
| P0-P2 Combined | 83 ns | 50-55 ns | [ ] |
Stretch Goal: 50 ns/op (38% improvement, 3.5x gap to mimalloc)
Functional Requirements
- All existing tests pass
- No regressions on L2/L2.5 pools
- Statistics still functional (batched counters work)
- Multi-threaded safety maintained
- Zero hard page faults (memory reuse preserved)
Code Quality
- Clean compilation with
-Wall -Wextra - No compiler warnings
- Documented trade-offs (stats batching, LUT size)
- Git commit messages reference this strategy doc
Timeline and Effort
Phase I: Quick Wins (P0 + P1)
- P0 Implementation: 30 minutes
- P0 Testing: 15 minutes
- P1 Implementation: 60 minutes
- P1 Testing: 15 minutes
- Total Phase I: 2 hours
Phase II: Fast Path (P2)
- P2 Implementation: 60 minutes
- P2 Testing: 15 minutes
- Total Phase II: 1.25 hours
Phase III: Validation
- Comprehensive testing: 30 minutes
- Documentation update: 15 minutes
- Git commit: 15 minutes
- Total Phase III: 1 hour
Grand Total: 4.25 hours
Realistic estimate with contingency: 5-6 hours
Beyond P0-P2: Future Optimization (Optional)
P3: Branch Elimination (Not Included in This Strategy)
Effort: 90 minutes Gain: +10-15 ns Risk: High (complex, subtle bugs possible)
Why Deferred:
- Diminishing returns (30% improvement already achieved)
- Complexity vs gain trade-off unfavorable
- Can be addressed in future iteration
Alternative: NEXT_STEPS.md Approach
After P0-P2 complete, consider NEXT_STEPS.md optimizations:
- MPSC opportunistic drain during alloc slow path
- Immediate full→free slab promotion after drain
- Adaptive magazine capacity per site
These may yield better ROI than P3.
Comparison to Alternatives
Alternative 1: Implement NEXT_STEPS.md First
Pros: Novel features (adaptive magazines, ELO learning) Cons: Higher complexity, uncertain performance gain Why Not: mimalloc analysis shows stats overhead is 10-15ns - addressing known bottleneck first is safer
Alternative 2: Copy mimalloc's Free List Architecture
Pros: Would match mimalloc's 14ns performance Cons: Abandons bitmap approach, loses diagnostics/ownership tracking Why Not: Violates hakmem's research goals (flexible architecture)
Alternative 3: Do Nothing
Pros: Zero effort Cons: 5.9x gap remains, no learning Why Not: P0-P2 are low-risk, high-ROI quick wins
Conclusion
This strategy prioritizes high-impact, low-risk optimizations to achieve a 30-40% performance improvement in 4-6 hours of work.
Key Principles:
- Incremental validation: Test after each step
- Focus on hot path: Remove overhead from critical path
- Preserve semantics: No behavior changes, only optimization
- Accept trade-offs: 50-55ns is excellent; chasing 14ns abandons research goals
Next Steps:
- Review and approve this strategy
- Implement P0 (Lookup Table) - 30 minutes
- Validate P0 - 15 minutes
- Implement P1 (Remove Stats) - 60 minutes
- Validate P1 - 15 minutes
- Implement P2 (Inline Fast Path) - 60 minutes
- Validate P2 - 15 minutes
- Comprehensive testing and documentation - 1 hour
Expected Outcome: Tiny Pool performance improves from 83ns to 50-55ns, closing 40% of the gap with mimalloc while preserving hakmem's research architecture.
Last Updated: 2025-10-26 Status: Ready for implementation Approval: Pending user confirmation Implementation Start: After approval