hakmem

409 Commits 2 Branches 0 Tags

Author	SHA1	Message	Date
Claude	3429ed4457	Phase 6-7: Dual Free Lists (Phase 2) - Mixed results Implementation: Separate alloc/free paths to reduce cache line bouncing (mimalloc's strategy). Changes: 1. Added g_tiny_fast_free_head[] - separate free staging area 2. Modified tiny_fast_alloc() - lazy migration from free_head 3. Modified tiny_fast_free() - push to free_head (separate cache line) 4. Modified tiny_fast_drain() - drain from free_head Key design (inspired by mimalloc): - alloc_head: Hot allocation path (g_tiny_fast_cache) - free_head: Local frees staging (g_tiny_fast_free_head) - Migration: Pointer swap when alloc_head empty (zero-cost batching) - Benefit: alloc/free touch different cache lines → reduce bouncing Results (Larson 2s 8-128B 1024): - Phase 3 baseline: ST 0.474M, MT 1.712M ops/s - Phase 2: ST 0.600M, MT 1.624M ops/s - Change: +27% ST, -5% MT ⚠️ Analysis - Mixed results: ✅ Single-thread: +27% improvement - Better cache locality (alloc/free separated) - No contention, pure memory access pattern win ❌ Multi-thread: -5% regression (expected +30-50%) - Migration logic overhead (extra branches) - Dual arrays increase TLS size → more cache misses? - Pointer swap cost on migration path - May not help in Larson's specific access pattern Comparison to system malloc: - Current: 1.624M ops/s (MT) - System: ~7.2M ops/s (MT) - Gap: Still 4.4x slower Key insights: 1. mimalloc's dual free lists help with cross-thread frees 2. Larson may be mostly same-thread frees → less benefit 3. Migration overhead > cache line bouncing reduction 4. ST improvement shows memory locality matters 5. Need to profile actual malloc/free patterns in Larson Why mimalloc succeeds but HAKMEM doesn't: - mimalloc has sophisticated remote free queue (lock-free MPSC) - HAKMEM's simple dual lists don't handle cross-thread well - Larson's workload may differ from mimalloc's target benchmarks Next considerations: - Verify Larson's same-thread vs cross-thread free ratio - Consider combining all 3 phases (may have synergy) - Profile with actual counters (malloc vs free hotspots) - May need fundamentally different approach	2025-11-05 05:35:06 +00:00
Claude	e3514e7fa9	Phase 6-6: Batch Refill Optimization (Phase 3) - Success! Implementation: Replace 16 individual cache pushes with batch linking for refill path. Changes in core/tiny_fastcache.c: 1. Allocate blocks into temporary batch[] array 2. Link all blocks in one pass: batch[i] → batch[i+1] 3. Attach linked list to cache head atomically 4. Pop one for caller Optimization: - OLD: 16 allocs + 16 individual pushes (scattered memory writes) - NEW: 16 allocs + batch link in one pass (sequential writes) - Memory writes reduced: ~16 → ~2 per block (-87%) - Cache locality improved: sequential vs scattered access Results (Larson 2s 8-128B 1024): - Phase 1 baseline: ST 0.424M, MT 1.453M ops/s - Phase 3: ST 0.474M, MT 1.712M ops/s - Improvement: +12% ST, +18% MT ✨ Analysis: Better than expected! Predicted +0.65% (refill is 0.75% of ops), but achieved +12-18% due to: 1. Batch linking improves cache efficiency 2. Eliminated 16 scattered freelist push overhead 3. Better memory locality (sequential vs random writes) Comparison to system malloc: - Current: 1.712M ops/s (MT) - System: ~7.2M ops/s (MT) - Gap: Still 4.2x slower Key insight: Phase 3 more effective than Phase 1 (entry point reordering). This suggests memory access patterns matter more than branch counts. Next: Phase 2 (Dual Free Lists) - the main target Expected: +30-50% from reducing cache line bouncing (mimalloc's key advantage)	2025-11-05 05:27:18 +00:00
Claude	494205435b	Add debug counters for refill analysis - Surprising discovery Implementation: - Register tiny_fast_print_stats() via atexit() on first refill - Forward declaration for function ordering - Enable with HAKMEM_TINY_FAST_STATS=1 Usage: ```bash HAKMEM_TINY_FAST_STATS=1 ./larson_hakmem 2 8 128 1024 1 12345 4 ``` Results (threads=4, Throughput=1.377M ops/s): - refills = 1,285 per thread - drains = 0 (cache never full) - Total ops = 2.754M (2 seconds) - Refill allocations = 20,560 (1,285 × 16) - Refill rate: 0.75% - Cache hit rate: 99.25% ✨ Analysis: Contrary to expectations, refill cost is NOT the bottleneck: - Current refill cost: 1,285 × 1,600 cycles = 2.056M cycles - Even if batched (200 cycles): saves 1.799M cycles - But refills are only 0.75% of operations! True bottleneck must be: 1. Fast path itself (99.25% of allocations) - malloc() overhead despite reordering - size_to_class mapping (even LUT has cost) - TLS cache access pattern 2. free() path (not optimized yet) 3. Cross-thread synchronization (22.8% cycles in profiling) Key insight: Phase 1 (entry point optimization) and Phase 3 (batch refill) won't help much because: - Entry point: Fast path already hit 99.25% - Batch refill: Only affects 0.75% of operations Next steps: 1. Add malloc/free counters to identify which is slower 2. Consider Phase 2 (Dual Free Lists) for locality 3. Investigate free() path optimization 4. May need to profile TLS cache access patterns Related: mimalloc research shows dual free lists reduce cache line bouncing - this may be more important than refill cost.	2025-11-05 05:19:32 +00:00
Claude	3e4e90eadb	Phase 6-5: Entry Point Optimization (Phase 1) - Unexpected results Implementation: Move HAKMEM_TINY_FAST_PATH check BEFORE all guard checks in malloc(), inspired by mimalloc/tcache entry point design. Strategy: - tcache has 0 branches before fast path - mimalloc has 1-2 branches before fast path - Old HAKMEM had 8+ branches before fast path - Phase 1: Move fast path to line 1, add branch prediction hints Changes in core/hakmem.c: 1. Fast Path First: Size check → Init check → Cache hit (3 branches) 2. Slow Path: All guards moved after fast path (rare cases) 3. Branch hints: __builtin_expect() for hot paths Expected results (from research): - ST: 0.46M → 1.4-2.3M ops/s (+204-400%) - MT: 1.86M → 3.7-5.6M ops/s (+99-201%) Actual results (Larson 2s 8-128B 1024): - ST: 0.377M → 0.424M ops/s (+12% only) - MT: 1.856M → 1.453M ops/s (-22% regression!) Analysis: - Similar pattern to previous Option A test (+42% ST, -20% MT) - Entry point reordering alone is insufficient - True bottleneck may be: 1. tiny_fast_alloc() internals (size-to-class, cache access) 2. Refill cost (1,600 cycles for 16 individual calls) 3. Need Batch Refill optimization (Phase 3) as priority Next steps: - Investigate refill bottleneck with perf profiling - Consider implementing Phase 3 (Batch Refill) before Phase 2 - May need combination of multiple optimizations for breakthrough Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md	2025-11-05 05:10:02 +00:00
Claude	09e1d89e8d	Phase 6-4: Larson benchmark optimizations - LUT size-to-class Two optimizations to improve Larson benchmark performance: 1. Option A: Fast Path Priority (core/hakmem.c) - Move HAKMEM_TINY_FAST_PATH check before all guard checks - Reduce malloc() fast path from 8+ branches to 3 branches - Results: +42% ST, -20% MT (mixed results) 2. LUT Optimization (core/tiny_fastcache.h) - Replace 11-branch linear search with O(1) lookup table - Use size_to_class_lut[size >> 3] for fast mapping - Results: +24% MT, -24% ST (MT-optimized tradeoff) Benchmark results (Larson 2s 8-128B 1024 chunks): - Original: ST 0.498M ops/s, MT 1.502M ops/s - LUT version: ST 0.377M ops/s, MT 1.856M ops/s Analysis: - ST regression: Branch predictor learns linear search pattern - MT improvement: LUT avoids branch misprediction on context switch - Recommendation: Keep LUT for multi-threaded workloads Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md	2025-11-05 04:58:03 +00:00
Claude	b64cfc055e	Implement Option A: Fast Path priority optimization (Phase 6-4) Changes: - Reorder malloc() to prioritize Fast Path (initialized + tiny size check first) - Move Fast Path check before all guard checks (recursion, LD_PRELOAD, etc.) - Optimize free() with same strategy (initialized check first) - Add branch prediction hints (__builtin_expect) Implementation: - malloc(): Fast Path now executes with 3 branches total - Branch 1+2: g_initialized && size <= TINY_FAST_THRESHOLD - Branch 3: tiny_fast_alloc() cache hit check - Slow Path: All guard checks moved after Fast Path miss - free(): Fast Path with 1-2 branches - Branch 1: g_initialized check - Direct to hak_free_at() on normal case Performance Results (Larson benchmark, size=8-128B): Single-thread (threads=1): - Before: 0.46M ops/s (10.7% of system malloc) - After: 0.65M ops/s (15.4% of system malloc) - Change: +42% improvement ✓ Multi-thread (threads=4): - Before: 1.81M ops/s (25.0% of system malloc) - After: 1.44M ops/s (19.9% of system malloc) - Change: -20% regression ✗ Analysis: - ST improvement shows Fast Path optimization works - MT regression suggests contention or cache issues - Did not meet target (+200-400%), further optimization needed Next Steps: - Investigate MT regression (cache coherency?) - Consider more aggressive inlining - Explore Option B (Refill optimization)	2025-11-05 04:44:50 +00:00
Claude	f0c87d0cac	Add Larson performance analysis and optimized profile Ultrathink analysis reveals root cause of 4x performance gap: Key Findings: - Single-thread: HAKMEM 0.46M ops/s vs system 4.29M ops/s (10.7%) - Multi-thread: HAKMEM 1.81M ops/s vs system 7.23M ops/s (25.0%) - Root cause: malloc() entry point has 8+ branch checks - Bottleneck: Fast Path is structurally complex vs system tcache Files Added: - LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md: Detailed analysis with 3 optimization strategies - scripts/profiles/tinyhot_optimized.env: CLAUDE.md-based optimized config Proposed Solutions: - Option A: Optimize malloc() guard checks (+200-400% expected) - Option B: Improve refill efficiency (+30-50% expected) - Option C: Complete Fast Path simplification (+400-800% expected) Target: Achieve 60-80% of system malloc performance	2025-11-05 04:03:10 +00:00
Claude	b4e4416544	Add mimalloc-bench submodule and simplify larson_hakmem build Changes: - Add mimalloc-bench as git submodule for Larson benchmark source - Simplify Makefile: Remove shim layer (hakmem.o provides malloc/free directly) - Enable larson.sh script to build and run Larson benchmarks This allows running: ./scripts/larson.sh hakmem --profile tinyhot_tput 2 4	2025-11-05 03:43:50 +00:00
Moe Charm (CI)	52386401b3	Debug Counters Implementation - Clean History Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-05 12:31:14 +09:00

409 Commits