Commit Graph

4 Commits

Author SHA1 Message Date
872622b78b Phase 6-8: RDTSC cycle profiling - Critical bottleneck discovered!
Implementation:
Ultra-lightweight CPU cycle profiling using RDTSC instruction (~10 cycles overhead).

Changes:
1. Added rdtsc() inline function for x86_64 CPU cycle counter
2. Instrumented tiny_fast_alloc(), tiny_fast_free(), tiny_fast_refill()
3. Track malloc, free, refill, and migration cycles separately
4. Profile output via HAKMEM_TINY_PROFILE=1 environment variable
5. Renamed variables to avoid conflict with core/hakmem.c globals

Files modified:
- core/tiny_fastcache.h: rdtsc(), profile helpers, extern declarations
- core/tiny_fastcache.c: counter definitions, print_profile() output

Usage:
```bash
HAKMEM_TINY_PROFILE=1 ./larson_hakmem 2 8 128 1024 1 12345 4
```

Results (Larson 4 threads, 1.637M ops/s):
```
[MALLOC] count=20,480, avg_cycles=2,476
[REFILL] count=1,285,  avg_cycles=38,412  ← 15.5x slower!
[FREE]   (no data - not called via fast path)
```

Critical discoveries:

1. **REFILL is the bottleneck:**
   - Average 38,412 cycles per refill (15.5x slower than malloc)
   - Refill accounts for: 1,285 × 38,412 = 49.3M cycles
   - Despite Phase 3 batch optimization, still extremely slow
   - Calling hak_tiny_alloc() 16 times has massive overhead

2. **MALLOC is 24x slower than expected:**
   - Average 2,476 cycles (expected ~100 cycles for tcache)
   - Even cache hits are slow
   - Profiling overhead is only ~10 cycles, so real cost is ~2,466 cycles
   - Something fundamentally wrong with fast path

3. **Only 2.5% of allocations use fast path:**
   - Total operations: 1.637M × 2s = 3.27M ops
   - Tiny fast alloc: 20,480 × 4 threads = 81,920 ops
   - Coverage: 81,920 / 3,270,000 = **2.5%**
   - **97.5% of allocations bypass tiny_fast_alloc entirely!**

4. **FREE is not instrumented:**
   - No free() calls captured by profiling
   - hakmem.c's free() likely takes different path
   - Not calling tiny_fast_free() at all

Root cause analysis:

The 4x performance gap (vs system malloc) is NOT due to:
- Entry point overhead (Phase 1) 
- Dual free lists (Phase 2) 
- Batch refill efficiency (Phase 3) 

The REAL problems:
1. **Tiny fast path is barely used** (2.5% coverage)
2. **Refill is catastrophically slow** (38K cycles)
3. **Even cache hits are 24x too slow** (2.5K cycles)
4. **Free path is completely bypassed**

Why system malloc is 4x faster:
- System tcache has ~100 cycle malloc
- System tcache has ~90% hit rate (vs our 2.5% usage)
- System malloc/free are symmetric (we only optimize malloc)

Next steps:
1. Investigate why 97.5% bypass tiny_fast_alloc
2. Profile the slow path (hak_alloc_at) that handles 97.5%
3. Understand why even cache hits take 2,476 cycles
4. Instrument free() path to see where frees go
5. May need to optimize slow path instead of fast path

This profiling reveals we've been optimizing the wrong thing.
The "fast path" is neither fast (2.5K cycles) nor used (2.5%).
2025-11-05 05:44:18 +00:00
3429ed4457 Phase 6-7: Dual Free Lists (Phase 2) - Mixed results
Implementation:
Separate alloc/free paths to reduce cache line bouncing (mimalloc's strategy).

Changes:
1. Added g_tiny_fast_free_head[] - separate free staging area
2. Modified tiny_fast_alloc() - lazy migration from free_head
3. Modified tiny_fast_free() - push to free_head (separate cache line)
4. Modified tiny_fast_drain() - drain from free_head

Key design (inspired by mimalloc):
- alloc_head: Hot allocation path (g_tiny_fast_cache)
- free_head: Local frees staging (g_tiny_fast_free_head)
- Migration: Pointer swap when alloc_head empty (zero-cost batching)
- Benefit: alloc/free touch different cache lines → reduce bouncing

Results (Larson 2s 8-128B 1024):
- Phase 3 baseline: ST 0.474M, MT 1.712M ops/s
- Phase 2: ST 0.600M, MT 1.624M ops/s
- Change: **+27% ST, -5% MT** ⚠️

Analysis - Mixed results:
 Single-thread: +27% improvement
   - Better cache locality (alloc/free separated)
   - No contention, pure memory access pattern win

 Multi-thread: -5% regression (expected +30-50%)
   - Migration logic overhead (extra branches)
   - Dual arrays increase TLS size → more cache misses?
   - Pointer swap cost on migration path
   - May not help in Larson's specific access pattern

Comparison to system malloc:
- Current: 1.624M ops/s (MT)
- System: ~7.2M ops/s (MT)
- **Gap: Still 4.4x slower**

Key insights:
1. mimalloc's dual free lists help with *cross-thread* frees
2. Larson may be mostly *same-thread* frees → less benefit
3. Migration overhead > cache line bouncing reduction
4. ST improvement shows memory locality matters
5. Need to profile actual malloc/free patterns in Larson

Why mimalloc succeeds but HAKMEM doesn't:
- mimalloc has sophisticated remote free queue (lock-free MPSC)
- HAKMEM's simple dual lists don't handle cross-thread well
- Larson's workload may differ from mimalloc's target benchmarks

Next considerations:
- Verify Larson's same-thread vs cross-thread free ratio
- Consider combining all 3 phases (may have synergy)
- Profile with actual counters (malloc vs free hotspots)
- May need fundamentally different approach
2025-11-05 05:35:06 +00:00
09e1d89e8d Phase 6-4: Larson benchmark optimizations - LUT size-to-class
Two optimizations to improve Larson benchmark performance:

1. **Option A: Fast Path Priority** (core/hakmem.c)
   - Move HAKMEM_TINY_FAST_PATH check before all guard checks
   - Reduce malloc() fast path from 8+ branches to 3 branches
   - Results: +42% ST, -20% MT (mixed results)

2. **LUT Optimization** (core/tiny_fastcache.h)
   - Replace 11-branch linear search with O(1) lookup table
   - Use size_to_class_lut[size >> 3] for fast mapping
   - Results: +24% MT, -24% ST (MT-optimized tradeoff)

Benchmark results (Larson 2s 8-128B 1024 chunks):
- Original:     ST 0.498M ops/s, MT 1.502M ops/s
- LUT version:  ST 0.377M ops/s, MT 1.856M ops/s

Analysis:
- ST regression: Branch predictor learns linear search pattern
- MT improvement: LUT avoids branch misprediction on context switch
- Recommendation: Keep LUT for multi-threaded workloads

Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md
2025-11-05 04:58:03 +00:00
52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00