Implementation:
- Register tiny_fast_print_stats() via atexit() on first refill
- Forward declaration for function ordering
- Enable with HAKMEM_TINY_FAST_STATS=1
Usage:
```bash
HAKMEM_TINY_FAST_STATS=1 ./larson_hakmem 2 8 128 1024 1 12345 4
```
Results (threads=4, Throughput=1.377M ops/s):
- refills = 1,285 per thread
- drains = 0 (cache never full)
- Total ops = 2.754M (2 seconds)
- Refill allocations = 20,560 (1,285 × 16)
- **Refill rate: 0.75%**
- **Cache hit rate: 99.25%** ✨
Analysis:
Contrary to expectations, refill cost is NOT the bottleneck:
- Current refill cost: 1,285 × 1,600 cycles = 2.056M cycles
- Even if batched (200 cycles): saves 1.799M cycles
- But refills are only 0.75% of operations!
True bottleneck must be:
1. Fast path itself (99.25% of allocations)
- malloc() overhead despite reordering
- size_to_class mapping (even LUT has cost)
- TLS cache access pattern
2. free() path (not optimized yet)
3. Cross-thread synchronization (22.8% cycles in profiling)
Key insight:
Phase 1 (entry point optimization) and Phase 3 (batch refill)
won't help much because:
- Entry point: Fast path already hit 99.25%
- Batch refill: Only affects 0.75% of operations
Next steps:
1. Add malloc/free counters to identify which is slower
2. Consider Phase 2 (Dual Free Lists) for locality
3. Investigate free() path optimization
4. May need to profile TLS cache access patterns
Related: mimalloc research shows dual free lists reduce cache
line bouncing - this may be more important than refill cost.