Implementation:
Ultra-lightweight CPU cycle profiling using RDTSC instruction (~10 cycles overhead).
Changes:
1. Added rdtsc() inline function for x86_64 CPU cycle counter
2. Instrumented tiny_fast_alloc(), tiny_fast_free(), tiny_fast_refill()
3. Track malloc, free, refill, and migration cycles separately
4. Profile output via HAKMEM_TINY_PROFILE=1 environment variable
5. Renamed variables to avoid conflict with core/hakmem.c globals
Files modified:
- core/tiny_fastcache.h: rdtsc(), profile helpers, extern declarations
- core/tiny_fastcache.c: counter definitions, print_profile() output
Usage:
```bash
HAKMEM_TINY_PROFILE=1 ./larson_hakmem 2 8 128 1024 1 12345 4
```
Results (Larson 4 threads, 1.637M ops/s):
```
[MALLOC] count=20,480, avg_cycles=2,476
[REFILL] count=1,285, avg_cycles=38,412 ← 15.5x slower!
[FREE] (no data - not called via fast path)
```
Critical discoveries:
1. **REFILL is the bottleneck:**
- Average 38,412 cycles per refill (15.5x slower than malloc)
- Refill accounts for: 1,285 × 38,412 = 49.3M cycles
- Despite Phase 3 batch optimization, still extremely slow
- Calling hak_tiny_alloc() 16 times has massive overhead
2. **MALLOC is 24x slower than expected:**
- Average 2,476 cycles (expected ~100 cycles for tcache)
- Even cache hits are slow
- Profiling overhead is only ~10 cycles, so real cost is ~2,466 cycles
- Something fundamentally wrong with fast path
3. **Only 2.5% of allocations use fast path:**
- Total operations: 1.637M × 2s = 3.27M ops
- Tiny fast alloc: 20,480 × 4 threads = 81,920 ops
- Coverage: 81,920 / 3,270,000 = **2.5%**
- **97.5% of allocations bypass tiny_fast_alloc entirely!**
4. **FREE is not instrumented:**
- No free() calls captured by profiling
- hakmem.c's free() likely takes different path
- Not calling tiny_fast_free() at all
Root cause analysis:
The 4x performance gap (vs system malloc) is NOT due to:
- Entry point overhead (Phase 1) ❌
- Dual free lists (Phase 2) ❌
- Batch refill efficiency (Phase 3) ❌
The REAL problems:
1. **Tiny fast path is barely used** (2.5% coverage)
2. **Refill is catastrophically slow** (38K cycles)
3. **Even cache hits are 24x too slow** (2.5K cycles)
4. **Free path is completely bypassed**
Why system malloc is 4x faster:
- System tcache has ~100 cycle malloc
- System tcache has ~90% hit rate (vs our 2.5% usage)
- System malloc/free are symmetric (we only optimize malloc)
Next steps:
1. Investigate why 97.5% bypass tiny_fast_alloc
2. Profile the slow path (hak_alloc_at) that handles 97.5%
3. Understand why even cache hits take 2,476 cycles
4. Instrument free() path to see where frees go
5. May need to optimize slow path instead of fast path
This profiling reveals we've been optimizing the wrong thing.
The "fast path" is neither fast (2.5K cycles) nor used (2.5%).
Implementation:
- Register tiny_fast_print_stats() via atexit() on first refill
- Forward declaration for function ordering
- Enable with HAKMEM_TINY_FAST_STATS=1
Usage:
```bash
HAKMEM_TINY_FAST_STATS=1 ./larson_hakmem 2 8 128 1024 1 12345 4
```
Results (threads=4, Throughput=1.377M ops/s):
- refills = 1,285 per thread
- drains = 0 (cache never full)
- Total ops = 2.754M (2 seconds)
- Refill allocations = 20,560 (1,285 × 16)
- **Refill rate: 0.75%**
- **Cache hit rate: 99.25%** ✨
Analysis:
Contrary to expectations, refill cost is NOT the bottleneck:
- Current refill cost: 1,285 × 1,600 cycles = 2.056M cycles
- Even if batched (200 cycles): saves 1.799M cycles
- But refills are only 0.75% of operations!
True bottleneck must be:
1. Fast path itself (99.25% of allocations)
- malloc() overhead despite reordering
- size_to_class mapping (even LUT has cost)
- TLS cache access pattern
2. free() path (not optimized yet)
3. Cross-thread synchronization (22.8% cycles in profiling)
Key insight:
Phase 1 (entry point optimization) and Phase 3 (batch refill)
won't help much because:
- Entry point: Fast path already hit 99.25%
- Batch refill: Only affects 0.75% of operations
Next steps:
1. Add malloc/free counters to identify which is slower
2. Consider Phase 2 (Dual Free Lists) for locality
3. Investigate free() path optimization
4. May need to profile TLS cache access patterns
Related: mimalloc research shows dual free lists reduce cache
line bouncing - this may be more important than refill cost.
Implementation: Move HAKMEM_TINY_FAST_PATH check BEFORE all guard checks
in malloc(), inspired by mimalloc/tcache entry point design.
Strategy:
- tcache has 0 branches before fast path
- mimalloc has 1-2 branches before fast path
- Old HAKMEM had 8+ branches before fast path
- Phase 1: Move fast path to line 1, add branch prediction hints
Changes in core/hakmem.c:
1. Fast Path First: Size check → Init check → Cache hit (3 branches)
2. Slow Path: All guards moved after fast path (rare cases)
3. Branch hints: __builtin_expect() for hot paths
Expected results (from research):
- ST: 0.46M → 1.4-2.3M ops/s (+204-400%)
- MT: 1.86M → 3.7-5.6M ops/s (+99-201%)
Actual results (Larson 2s 8-128B 1024):
- ST: 0.377M → 0.424M ops/s (+12% only)
- MT: 1.856M → 1.453M ops/s (-22% regression!)
Analysis:
- Similar pattern to previous Option A test (+42% ST, -20% MT)
- Entry point reordering alone is insufficient
- True bottleneck may be:
1. tiny_fast_alloc() internals (size-to-class, cache access)
2. Refill cost (1,600 cycles for 16 individual calls)
3. Need Batch Refill optimization (Phase 3) as priority
Next steps:
- Investigate refill bottleneck with perf profiling
- Consider implementing Phase 3 (Batch Refill) before Phase 2
- May need combination of multiple optimizations for breakthrough
Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md
Changes:
- Reorder malloc() to prioritize Fast Path (initialized + tiny size check first)
- Move Fast Path check before all guard checks (recursion, LD_PRELOAD, etc.)
- Optimize free() with same strategy (initialized check first)
- Add branch prediction hints (__builtin_expect)
Implementation:
- malloc(): Fast Path now executes with 3 branches total
- Branch 1+2: g_initialized && size <= TINY_FAST_THRESHOLD
- Branch 3: tiny_fast_alloc() cache hit check
- Slow Path: All guard checks moved after Fast Path miss
- free(): Fast Path with 1-2 branches
- Branch 1: g_initialized check
- Direct to hak_free_at() on normal case
Performance Results (Larson benchmark, size=8-128B):
Single-thread (threads=1):
- Before: 0.46M ops/s (10.7% of system malloc)
- After: 0.65M ops/s (15.4% of system malloc)
- Change: +42% improvement ✓
Multi-thread (threads=4):
- Before: 1.81M ops/s (25.0% of system malloc)
- After: 1.44M ops/s (19.9% of system malloc)
- Change: -20% regression ✗
Analysis:
- ST improvement shows Fast Path optimization works
- MT regression suggests contention or cache issues
- Did not meet target (+200-400%), further optimization needed
Next Steps:
- Investigate MT regression (cache coherency?)
- Consider more aggressive inlining
- Explore Option B (Refill optimization)