|
|
09e1d89e8d
|
Phase 6-4: Larson benchmark optimizations - LUT size-to-class
Two optimizations to improve Larson benchmark performance:
1. **Option A: Fast Path Priority** (core/hakmem.c)
- Move HAKMEM_TINY_FAST_PATH check before all guard checks
- Reduce malloc() fast path from 8+ branches to 3 branches
- Results: +42% ST, -20% MT (mixed results)
2. **LUT Optimization** (core/tiny_fastcache.h)
- Replace 11-branch linear search with O(1) lookup table
- Use size_to_class_lut[size >> 3] for fast mapping
- Results: +24% MT, -24% ST (MT-optimized tradeoff)
Benchmark results (Larson 2s 8-128B 1024 chunks):
- Original: ST 0.498M ops/s, MT 1.502M ops/s
- LUT version: ST 0.377M ops/s, MT 1.856M ops/s
Analysis:
- ST regression: Branch predictor learns linear search pattern
- MT improvement: LUT avoids branch misprediction on context switch
- Recommendation: Keep LUT for multi-threaded workloads
Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md
|
2025-11-05 04:58:03 +00:00 |
|
|
|
52386401b3
|
Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation
Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files
Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)
This is a clean repository without large log files.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
|
2025-11-05 12:31:14 +09:00 |
|