hakmem

tomoaki/hakmem

Fork 0

Commit Graph

Author	SHA1	Message	Date
Claude	09e1d89e8d	Phase 6-4: Larson benchmark optimizations - LUT size-to-class Two optimizations to improve Larson benchmark performance: 1. Option A: Fast Path Priority (core/hakmem.c) - Move HAKMEM_TINY_FAST_PATH check before all guard checks - Reduce malloc() fast path from 8+ branches to 3 branches - Results: +42% ST, -20% MT (mixed results) 2. LUT Optimization (core/tiny_fastcache.h) - Replace 11-branch linear search with O(1) lookup table - Use size_to_class_lut[size >> 3] for fast mapping - Results: +24% MT, -24% ST (MT-optimized tradeoff) Benchmark results (Larson 2s 8-128B 1024 chunks): - Original: ST 0.498M ops/s, MT 1.502M ops/s - LUT version: ST 0.377M ops/s, MT 1.856M ops/s Analysis: - ST regression: Branch predictor learns linear search pattern - MT improvement: LUT avoids branch misprediction on context switch - Recommendation: Keep LUT for multi-threaded workloads Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md	2025-11-05 04:58:03 +00:00
Claude	b64cfc055e	Implement Option A: Fast Path priority optimization (Phase 6-4) Changes: - Reorder malloc() to prioritize Fast Path (initialized + tiny size check first) - Move Fast Path check before all guard checks (recursion, LD_PRELOAD, etc.) - Optimize free() with same strategy (initialized check first) - Add branch prediction hints (__builtin_expect) Implementation: - malloc(): Fast Path now executes with 3 branches total - Branch 1+2: g_initialized && size <= TINY_FAST_THRESHOLD - Branch 3: tiny_fast_alloc() cache hit check - Slow Path: All guard checks moved after Fast Path miss - free(): Fast Path with 1-2 branches - Branch 1: g_initialized check - Direct to hak_free_at() on normal case Performance Results (Larson benchmark, size=8-128B): Single-thread (threads=1): - Before: 0.46M ops/s (10.7% of system malloc) - After: 0.65M ops/s (15.4% of system malloc) - Change: +42% improvement ✓ Multi-thread (threads=4): - Before: 1.81M ops/s (25.0% of system malloc) - After: 1.44M ops/s (19.9% of system malloc) - Change: -20% regression ✗ Analysis: - ST improvement shows Fast Path optimization works - MT regression suggests contention or cache issues - Did not meet target (+200-400%), further optimization needed Next Steps: - Investigate MT regression (cache coherency?) - Consider more aggressive inlining - Explore Option B (Refill optimization)	2025-11-05 04:44:50 +00:00
Moe Charm (CI)	52386401b3	Debug Counters Implementation - Clean History Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-05 12:31:14 +09:00

Author

SHA1

Message

Date

Claude

09e1d89e8d

Phase 6-4: Larson benchmark optimizations - LUT size-to-class

Two optimizations to improve Larson benchmark performance:

1. **Option A: Fast Path Priority** (core/hakmem.c)
   - Move HAKMEM_TINY_FAST_PATH check before all guard checks
   - Reduce malloc() fast path from 8+ branches to 3 branches
   - Results: +42% ST, -20% MT (mixed results)

2. **LUT Optimization** (core/tiny_fastcache.h)
   - Replace 11-branch linear search with O(1) lookup table
   - Use size_to_class_lut[size >> 3] for fast mapping
   - Results: +24% MT, -24% ST (MT-optimized tradeoff)

Benchmark results (Larson 2s 8-128B 1024 chunks):
- Original:     ST 0.498M ops/s, MT 1.502M ops/s
- LUT version:  ST 0.377M ops/s, MT 1.856M ops/s

Analysis:
- ST regression: Branch predictor learns linear search pattern
- MT improvement: LUT avoids branch misprediction on context switch
- Recommendation: Keep LUT for multi-threaded workloads

Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md

2025-11-05 04:58:03 +00:00

Claude

b64cfc055e

Implement Option A: Fast Path priority optimization (Phase 6-4)

Changes:
- Reorder malloc() to prioritize Fast Path (initialized + tiny size check first)
- Move Fast Path check before all guard checks (recursion, LD_PRELOAD, etc.)
- Optimize free() with same strategy (initialized check first)
- Add branch prediction hints (__builtin_expect)

Implementation:
- malloc(): Fast Path now executes with 3 branches total
  - Branch 1+2: g_initialized && size <= TINY_FAST_THRESHOLD
  - Branch 3: tiny_fast_alloc() cache hit check
  - Slow Path: All guard checks moved after Fast Path miss

- free(): Fast Path with 1-2 branches
  - Branch 1: g_initialized check
  - Direct to hak_free_at() on normal case

Performance Results (Larson benchmark, size=8-128B):

Single-thread (threads=1):
- Before: 0.46M ops/s (10.7% of system malloc)
- After:  0.65M ops/s (15.4% of system malloc)
- Change: +42% improvement ✓

Multi-thread (threads=4):
- Before: 1.81M ops/s (25.0% of system malloc)
- After:  1.44M ops/s (19.9% of system malloc)
- Change: -20% regression ✗

Analysis:
- ST improvement shows Fast Path optimization works
- MT regression suggests contention or cache issues
- Did not meet target (+200-400%), further optimization needed

Next Steps:
- Investigate MT regression (cache coherency?)
- Consider more aggressive inlining
- Explore Option B (Refill optimization)

2025-11-05 04:44:50 +00:00

Moe Charm (CI)

52386401b3

Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

3 Commits