Implementation:
- Register tiny_fast_print_stats() via atexit() on first refill
- Forward declaration for function ordering
- Enable with HAKMEM_TINY_FAST_STATS=1
Usage:
```bash
HAKMEM_TINY_FAST_STATS=1 ./larson_hakmem 2 8 128 1024 1 12345 4
```
Results (threads=4, Throughput=1.377M ops/s):
- refills = 1,285 per thread
- drains = 0 (cache never full)
- Total ops = 2.754M (2 seconds)
- Refill allocations = 20,560 (1,285 × 16)
- **Refill rate: 0.75%**
- **Cache hit rate: 99.25%** ✨
Analysis:
Contrary to expectations, refill cost is NOT the bottleneck:
- Current refill cost: 1,285 × 1,600 cycles = 2.056M cycles
- Even if batched (200 cycles): saves 1.799M cycles
- But refills are only 0.75% of operations!
True bottleneck must be:
1. Fast path itself (99.25% of allocations)
- malloc() overhead despite reordering
- size_to_class mapping (even LUT has cost)
- TLS cache access pattern
2. free() path (not optimized yet)
3. Cross-thread synchronization (22.8% cycles in profiling)
Key insight:
Phase 1 (entry point optimization) and Phase 3 (batch refill)
won't help much because:
- Entry point: Fast path already hit 99.25%
- Batch refill: Only affects 0.75% of operations
Next steps:
1. Add malloc/free counters to identify which is slower
2. Consider Phase 2 (Dual Free Lists) for locality
3. Investigate free() path optimization
4. May need to profile TLS cache access patterns
Related: mimalloc research shows dual free lists reduce cache
line bouncing - this may be more important than refill cost.
Implementation: Move HAKMEM_TINY_FAST_PATH check BEFORE all guard checks
in malloc(), inspired by mimalloc/tcache entry point design.
Strategy:
- tcache has 0 branches before fast path
- mimalloc has 1-2 branches before fast path
- Old HAKMEM had 8+ branches before fast path
- Phase 1: Move fast path to line 1, add branch prediction hints
Changes in core/hakmem.c:
1. Fast Path First: Size check → Init check → Cache hit (3 branches)
2. Slow Path: All guards moved after fast path (rare cases)
3. Branch hints: __builtin_expect() for hot paths
Expected results (from research):
- ST: 0.46M → 1.4-2.3M ops/s (+204-400%)
- MT: 1.86M → 3.7-5.6M ops/s (+99-201%)
Actual results (Larson 2s 8-128B 1024):
- ST: 0.377M → 0.424M ops/s (+12% only)
- MT: 1.856M → 1.453M ops/s (-22% regression!)
Analysis:
- Similar pattern to previous Option A test (+42% ST, -20% MT)
- Entry point reordering alone is insufficient
- True bottleneck may be:
1. tiny_fast_alloc() internals (size-to-class, cache access)
2. Refill cost (1,600 cycles for 16 individual calls)
3. Need Batch Refill optimization (Phase 3) as priority
Next steps:
- Investigate refill bottleneck with perf profiling
- Consider implementing Phase 3 (Batch Refill) before Phase 2
- May need combination of multiple optimizations for breakthrough
Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md
Changes:
- Reorder malloc() to prioritize Fast Path (initialized + tiny size check first)
- Move Fast Path check before all guard checks (recursion, LD_PRELOAD, etc.)
- Optimize free() with same strategy (initialized check first)
- Add branch prediction hints (__builtin_expect)
Implementation:
- malloc(): Fast Path now executes with 3 branches total
- Branch 1+2: g_initialized && size <= TINY_FAST_THRESHOLD
- Branch 3: tiny_fast_alloc() cache hit check
- Slow Path: All guard checks moved after Fast Path miss
- free(): Fast Path with 1-2 branches
- Branch 1: g_initialized check
- Direct to hak_free_at() on normal case
Performance Results (Larson benchmark, size=8-128B):
Single-thread (threads=1):
- Before: 0.46M ops/s (10.7% of system malloc)
- After: 0.65M ops/s (15.4% of system malloc)
- Change: +42% improvement ✓
Multi-thread (threads=4):
- Before: 1.81M ops/s (25.0% of system malloc)
- After: 1.44M ops/s (19.9% of system malloc)
- Change: -20% regression ✗
Analysis:
- ST improvement shows Fast Path optimization works
- MT regression suggests contention or cache issues
- Did not meet target (+200-400%), further optimization needed
Next Steps:
- Investigate MT regression (cache coherency?)
- Consider more aggressive inlining
- Explore Option B (Refill optimization)