# Session Summary - 2025-10-26 ## TL;DR **重大な発見**: Phase 1-4.1 の全ベンチマークが **glibc malloc** を測定していた。 hakmem を正しく測定した結果、**mimalloc の 8.8倍遅い**ことが判明。 **成果**: - ✅ Benchmark infrastructure を修正(二度と間違えない仕組み) - ✅ 正確な性能比較(hakmem vs glibc vs mimalloc) - ✅ ボトルネック分析(推定) - ✅ 検証ツール・チェックリスト作成 **現実**: - hakmem: 103 M ops/sec (9.7 ns/op) - mimalloc: **908 M ops/sec (1.1 ns/op)** - 8.8x faster - glibc: 364 M ops/sec (2.7 ns/op) - 3.5x faster --- ## 📅 Timeline ### Morning: Phase 4 Regression Analysis - Phase 4: 373-380 M ops/sec (-3.6% vs Phase 3) - Created PHASE4_REGRESSION_ANALYSIS.md - Created PHASE4_IMPROVEMENT_ROADMAP.md - ChatGPT Pro advice: Gating + Batching + Pull型 ### Afternoon: Phase 4 Improvements - Phase 4.1: Option A+B micro-optimization - Result: 381 M ops/sec (+1-2%) - Phase 4.2: High-water gating - Result: 103 M ops/sec (???) ### Evening: Critical Discovery 🚨 - **All benchmarks were measuring glibc malloc!** - Root cause: Makefile implicit rule - bench_tiny.c didn't link hakmem ### Night: Infrastructure Fix + Reality Check - Fixed Makefile (explicit targets) - Created verify_bench.sh - Created BENCHMARKING_CHECKLIST.md - **True measurement**: hakmem = 103 M ops/sec - **Comparison**: mimalloc = 908 M ops/sec (8.8x faster!) --- ## 🔍 Critical Discovery Details ### The Mistake **What happened**: ```bash # Before (wrong) make bench_tiny → gcc bench_tiny.c -o bench_tiny # Makefile implicit rule! → No hakmem linkage → Using glibc malloc # All reported numbers were glibc malloc: Phase 3: 391 M ops/sec (glibc) Phase 4: 373 M ops/sec (glibc) Phase 4.1: 381 M ops/sec (glibc) ``` **Root Cause**: - Makefile had no explicit `bench_tiny` target - `bench_tiny.c` only calls `malloc/free` (no hakmem.h include) - System malloc was used by default - No errors → looked like it was working **Why performance "changed"**: - 361→391 M ops/sec (8.3% variation) was **measurement noise** - CPU load, cache state, system activity - **No relation to hakmem code changes** --- ### The Fix **1. Explicit Makefile targets**: ```makefile TINY_BENCH_OBJS = hakmem.o hakmem_config.o ... (all hakmem objects) bench_tiny: bench_tiny.o $(TINY_BENCH_OBJS) $(CC) -o $@ $^ $(LDFLAGS) @echo "✓ bench_tiny built with hakmem" ``` **2. Verification script** (`verify_bench.sh`): ```bash $ ./verify_bench.sh ./bench_tiny ✅ hakmem symbols: 119 ✅ Binary size: 156KB ✅ Verification PASSED ``` **3. Checklist** (`BENCHMARKING_CHECKLIST.md`): - Pre-benchmark verification steps - Post-benchmark validation - Prevents future mistakes --- ## 📊 True Performance Comparison ### Benchmark Setup **Workload**: bench_tiny - 16B allocations - 100 alloc → 100 free × 10M iterations - Total: 2B operations (1B alloc + 1B free) **Environment**: - Platform: WSL2 (Linux 5.15.167.4-microsoft-standard-WSL2) - CPU: (not specified, but modern x86-64) - Compiler: gcc -O3 -march=native -mtune=native ### Results | Allocator | Throughput | Latency | vs hakmem | Notes | |-----------|------------|---------|-----------|-------| | **mimalloc** | **908.31 M ops/sec** | **1.1 ns/op** | **8.8x faster** | LD_PRELOAD=libmimalloc.so.2 | | **glibc** | 364.00 M ops/sec | 2.7 ns/op | 3.5x faster | System default | | **hakmem** | 103.35 M ops/sec | 9.7 ns/op | 1.0x (baseline) | Phase 4.2 | ### Analysis **mimalloc dominance**: - 908 M ops/sec is **exceptionally fast** - 1.1 ns/op ≈ 3-4 CPU cycles (assuming 3 GHz) - Near theoretical minimum for malloc/free - Result of 10+ years of optimization **hakmem gap**: - 8.8x slower than mimalloc - 3.5x slower than glibc - **8.6 ns overhead** vs mimalloc **Note**: Previous ANALYSIS_SUMMARY.md stated mimalloc = 14 ns/op, but that was from different benchmark (likely more complex allocations). --- ## 🔬 Bottleneck Analysis ### Estimated 8.6 ns Breakdown Based on code review (perf unavailable in WSL2): | Component | Estimated Cost | Details | |-----------|---------------|---------| | Registry lookup | 1-2 ns | Linear probing hash (max 8 probes), lock-free | | Bitmap operations | 1-2 ns | set_used/free + summary bitmap update | | Stats (batched TLS) | 0.5 ns | TLS increment (Phase 3 optimization) | | Mini-magazine | 0.5-1 ns | pop/push operations (Phase 2 addition) | | Other overhead | 3-5 ns | Function calls, branches, cache misses | | **Total** | **6.5-10.5 ns** | **Matches measured 9.7 ns** ✅ | ### Key Findings **1. Registry lookup (O(1) but not free)**: ```c static int g_use_registry = 1; // Enabled by default // registry_lookup(): lock-free but linear probing ``` - O(1) average case - Lock-free (good!) - But: linear probing up to 8 probes - Cost: ~1-2 ns per lookup **2. Bitmap overhead**: ```c hak_tiny_set_used(slab, block_idx); // Updates bitmap + summary bitmap ``` - Two-tier bitmap for fast empty-word detection - Each set/free updates both levels - Cost: ~1-2 ns **3. Mini-magazine (Phase 2)**: - Added for "fast path" (1-2 ns vs 5-6 ns bitmap) - But: adds extra layer vs direct TLS cache - Cost: ~0.5-1 ns **4. Stats (Phase 3 - optimized)**: - Batched TLS counters (0.5 ns) - Good! (vs 10-15 ns XOR RNG before) **5. Unaccounted overhead (~3-5 ns)**: - Function call overhead - Branch mispredictions - Cache misses - Data structure indirection --- ## 💡 Why is mimalloc so fast? ### mimalloc Architecture (inferred) **1. Minimalist TLS cache**: - Direct pointer to free list - Single pointer dereference - Near-zero overhead **2. Inline metadata**: - Page metadata embedded in page - O(1) pointer→page calculation - No hash table needed **3. Lock-free everywhere**: - Thread-local pages - Atomic operations only for shared structures - No pthread mutexes in hot path **4. Cache-optimized**: - Sequential layout - Prefetch-friendly - Minimal pointer chasing **5. Extreme simplicity**: - Fewer layers - Fewer branches - Fewer memory accesses **hakmem's complexity**: - TLS Magazine → Mini-magazine → Bitmap - 3 layers vs mimalloc's 1-2 layers - Each layer adds overhead --- ## 📈 Progress Summary ### What Was Completed Today #### Phase 1-3 (Previously, but measured wrong) - ❌ Phase 1: Mini-magazine infrastructure (measured glibc) - ❌ Phase 2: Hot path integration (measured glibc) - ❌ Phase 3: Stats batching (measured glibc) #### Phase 4 (Attempted, but measured wrong) - ❌ Phase 4: TLS spill optimization (measured glibc) - ❌ Phase 4.1: Micro-optimization A+B (measured glibc) #### Critical Fix (Actual Work) - ✅ Discovered benchmark infrastructure bug - ✅ Fixed Makefile (explicit targets) - ✅ Created verify_bench.sh (119 hakmem symbols detected) - ✅ Created BENCHMARKING_CHECKLIST.md (prevention) - ✅ Phase 4.2: High-water gating (first correct measurement!) #### Analysis & Documentation - ✅ PHASE4_REGRESSION_ANALYSIS.md (16 KB) - ✅ PHASE4_IMPROVEMENT_ROADMAP.md (13 KB) - ✅ BENCHMARKING_CHECKLIST.md (comprehensive) - ✅ verify_bench.sh (automated verification) - ✅ Comparative benchmark (hakmem vs glibc vs mimalloc) - ✅ Bottleneck analysis (code review) --- ## 🎯 Reality Check ### The Gap **mimalloc**: 908 M ops/sec (1.1 ns/op) **hakmem**: 103 M ops/sec (9.7 ns/op) **Gap**: **8.8x** ### Is This Gap Closeable? **Honest Assessment**: **Short-term** (Quick Wins - weeks): - Target: 103 → 150 M ops/sec (+45%) - Method: - Reduce bitmap overhead - Optimize registry hash - Remove mini-magazine layer (if not helping) - Tune TLS Magazine capacity - Realistic: **Achievable** **Mid-term** (Major Optimization - months): - Target: 150 → 250 M ops/sec (+140%) - Method: - Redesign data structures - Inline metadata (like mimalloc) - Lock-free everywhere - Cache-optimized layout - Realistic: **Difficult but possible** **Long-term** (Mimalloc-level - years): - Target: 250 → 700+ M ops/sec (+580%) - Method: - Complete architectural overhaul - Learn from mimalloc source code - Remove all unnecessary layers - Extreme optimization - Realistic: **Requires mimalloc-level expertise** **Fundamental Challenge**: - hakmem is a **research allocator** (visibility, experimentation) - mimalloc is a **production allocator** (speed, simplicity) - These goals are **inherently in conflict** **Research features that add overhead**: - Bitmap (vs inline metadata) - Statistics (even batched TLS) - Multiple layers (TLS Mag → Mini-mag → Bitmap) - Flexibility (registry toggle, SuperSlab, etc.) --- ## 🚀 Next Steps ### Option A: Accept Reality (Recommended) **Goal**: hakmem as **research platform**, not production allocator **Approach**: 1. Document current performance (103 M ops/sec baseline) 2. Focus on **research features**: - Bitmap visibility for debugging - Statistics for analysis - Experimentation platform 3. Optimize **moderately**: - Quick wins (103 → 150 M ops/sec) - Don't sacrifice research goals for speed 4. **Accept 5-8x slower than mimalloc** as reasonable trade-off **Outcome**: Honest, useful research tool --- ### Option B: Pursue Performance (Challenging) **Goal**: Close the gap to 2-3x slower than mimalloc **Phase 1: Quick Wins** (Target: 150 M ops/sec) - Remove mini-magazine if not helping - Optimize registry hash (reduce collisions) - Tune TLS Magazine capacity - Inline hot functions - Profile with proper tools (perf, gprof) **Phase 2: Architectural** (Target: 250 M ops/sec) - Inline metadata (like mimalloc) - Simplify data structures - Remove unnecessary layers - Cache-optimized layout **Phase 3: Extreme** (Target: 400+ M ops/sec) - Study mimalloc source code deeply - Adopt mimalloc techniques - Sacrifice research features if needed **Outcome**: High-performance allocator, less research-friendly --- ### Option C: Hybrid Approach (Balanced) **Goal**: Research + reasonable performance **Approach**: 1. **Default mode**: Research-friendly (bitmap, stats, visibility) - Performance: 100-150 M ops/sec 2. **Fast mode**: Production-optimized (inline metadata, no stats) - Performance: 300-500 M ops/sec - Toggle: `HAKMEM_FAST_MODE=1` **Outcome**: Best of both worlds, more complexity --- ## 📚 Documentation Created Today | File | Size | Purpose | |------|------|---------| | PHASE4_REGRESSION_ANALYSIS.md | 16 KB | Root cause analysis of Phase 4 regression | | PHASE4_IMPROVEMENT_ROADMAP.md | 13 KB | Phased improvement plan (4.1→4.2→4.3→4.4) | | BENCHMARKING_CHECKLIST.md | 11 KB | Prevention guide (never make same mistake) | | verify_bench.sh | 2 KB | Automated verification (119 symbols check) | | bench_hakmem.sh | 2 KB | Benchmark script template | | SESSION_SUMMARY_2025-10-26.md | (this file) | Comprehensive session summary | **Total**: ~44 KB of documentation --- ## 🎓 Lessons Learned ### Technical Lessons 1. **Makefile implicit rules are dangerous** - Always define explicit targets - Especially for critical binaries 2. **"It works" ≠ "It's correct"** - bench_tiny "worked" (no errors) - But measured wrong thing entirely 3. **Verification is essential** - `nm` to check symbols - `ldd` to check libraries - `size` to check binary size 4. **Performance claims need causality** - Code change → performance change? - Or just measurement noise? 5. **Benchmarking is hard** - Same code - Same compiler flags - Same workload - Verify what you're measuring! ### Process Lessons 1. **Document procedures** - Checklists prevent mistakes - Future-you will thank you 2. **Automate verification** - Scripts don't forget - verify_bench.sh (119 symbols) 3. **Question surprising results** - "Phase 3 improved 1.3%" - Was it real? - If glibc improved, that'd be newsworthy! 4. **Be honest about limitations** - hakmem is 8.8x slower - That's okay for research! 5. **Celebrate discoveries** - Found a fundamental issue - Fixed it permanently - That's progress! --- ## 💭 Reflections ### What Went Well ✅ 1. **Fixed fundamental infrastructure bug** - Will never happen again - Proper tooling in place 2. **Honest performance comparison** - Now know true baseline - Can set realistic goals 3. **Comprehensive documentation** - 44 KB of useful docs - Future-proofing 4. **Learned from ChatGPT Pro** - Gating strategy - Pull型 architecture - Batching techniques ### What Could Be Better 🤔 1. **Should have verified linkage earlier** - Wasted time on Phase 1-4.1 - But: learned valuable lessons! 2. **Need better profiling tools** - perf doesn't work in WSL2 - Consider: gprof, valgrind, manual timing 3. **Architectural complexity** - 3 layers (TLS Mag → Mini-mag → Bitmap) - Each adds overhead - Simplification opportunity? --- ## 🎯 Recommendation **For tomorrow** (if continuing): ### Immediate Actions 1. **Run verify_bench.sh** before any benchmark ```bash ./verify_bench.sh ./bench_tiny ``` 2. **Re-measure Phase 0 (baseline)** with correct infrastructure - Before Phase 1 - True starting point 3. **Profile with working tools** - gprof (if available) - Manual timing (HAKMEM_DEBUG_TIMING=1) - Identify true bottlenecks 4. **Implement one Quick Win** - Start with easiest improvement - Measure impact properly - Document result ### Strategic Decision **Choose a path**: - **Path A**: Research focus (accept 5-8x slower) - **Path B**: Performance pursuit (target 2-3x slower) - **Path C**: Hybrid (modes for different use cases) **Recommended**: **Path A** (Research focus) - Honest about trade-offs - Useful research tool - Moderate optimization (Quick Wins) - Don't sacrifice research goals --- ## 📊 Final Numbers ### Before Today (Wrong) | Phase | Throughput | Note | |-------|------------|------| | Phase 2 | 361 M ops/sec | ❌ glibc malloc | | Phase 3 | 391 M ops/sec | ❌ glibc malloc | | Phase 4 | 373 M ops/sec | ❌ glibc malloc | | Phase 4.1 | 381 M ops/sec | ❌ glibc malloc | ### After Today (Correct) | Allocator | Throughput | Latency | Verified | |-----------|------------|---------|----------| | hakmem (Phase 4.2) | 103 M ops/sec | 9.7 ns/op | ✅ 119 symbols | | glibc malloc | 364 M ops/sec | 2.7 ns/op | ✅ No hakmem | | mimalloc | 908 M ops/sec | 1.1 ns/op | ✅ LD_PRELOAD | **Gap to close**: 8.8x (mimalloc) or 3.5x (glibc) --- ## 🙏 Acknowledgments - **ChatGPT Pro**: Excellent architectural advice (Gating, Batching, Pull型) - **User (にゃーん)**: Persistent questioning led to discovery - **verify_bench.sh**: Will prevent future mistakes --- ## 📅 Session Stats - **Duration**: ~8 hours - **Commits**: 6 - **Files Created**: 6 - **Lines of Code**: ~500 - **Documentation**: ~44 KB - **Major Discovery**: 1 (benchmark infrastructure bug) - **Performance Gap Discovered**: 8.8x vs mimalloc --- ## ✨ Conclusion **Today was tough** 😿 but **extremely valuable** ✨ **Bad News**: - Phase 1-4.1 measurements were wrong - hakmem is 8.8x slower than mimalloc - Big gap to close **Good News**: - Found fundamental infrastructure bug - Fixed it permanently (verify_bench.sh + checklist) - Know true baseline now (103 M ops/sec) - Can set realistic goals going forward - Comprehensive documentation for future **Moving Forward**: - Be honest about performance - Focus on research value - Optimize pragmatically - Never make same mistake again **Final Thought**: > "It's better to know uncomfortable truth than comfortable lie." hakmem は 103 M ops/sec。それが現実。 でも、それは**研究プラットフォーム**として十分な性能かもしれませんにゃ。 にゃーん!今日も一日お疲れさまでしたにゃ!😸✨ --- *Generated: 2025-10-26* *Session: Phase 4 Optimization & Infrastructure Fix* *Status: Documented & Lessons Learned*