Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
16 KiB
Session Summary - 2025-10-26
TL;DR
重大な発見: Phase 1-4.1 の全ベンチマークが glibc malloc を測定していた。 hakmem を正しく測定した結果、mimalloc の 8.8倍遅いことが判明。
成果:
- ✅ Benchmark infrastructure を修正(二度と間違えない仕組み)
- ✅ 正確な性能比較(hakmem vs glibc vs mimalloc)
- ✅ ボトルネック分析(推定)
- ✅ 検証ツール・チェックリスト作成
現実:
- hakmem: 103 M ops/sec (9.7 ns/op)
- mimalloc: 908 M ops/sec (1.1 ns/op) - 8.8x faster
- glibc: 364 M ops/sec (2.7 ns/op) - 3.5x faster
📅 Timeline
Morning: Phase 4 Regression Analysis
- Phase 4: 373-380 M ops/sec (-3.6% vs Phase 3)
- Created PHASE4_REGRESSION_ANALYSIS.md
- Created PHASE4_IMPROVEMENT_ROADMAP.md
- ChatGPT Pro advice: Gating + Batching + Pull型
Afternoon: Phase 4 Improvements
- Phase 4.1: Option A+B micro-optimization
- Result: 381 M ops/sec (+1-2%)
- Phase 4.2: High-water gating
- Result: 103 M ops/sec (???)
Evening: Critical Discovery 🚨
- All benchmarks were measuring glibc malloc!
- Root cause: Makefile implicit rule
- bench_tiny.c didn't link hakmem
Night: Infrastructure Fix + Reality Check
- Fixed Makefile (explicit targets)
- Created verify_bench.sh
- Created BENCHMARKING_CHECKLIST.md
- True measurement: hakmem = 103 M ops/sec
- Comparison: mimalloc = 908 M ops/sec (8.8x faster!)
🔍 Critical Discovery Details
The Mistake
What happened:
# Before (wrong)
make bench_tiny
→ gcc bench_tiny.c -o bench_tiny # Makefile implicit rule!
→ No hakmem linkage
→ Using glibc malloc
# All reported numbers were glibc malloc:
Phase 3: 391 M ops/sec (glibc)
Phase 4: 373 M ops/sec (glibc)
Phase 4.1: 381 M ops/sec (glibc)
Root Cause:
- Makefile had no explicit
bench_tinytarget bench_tiny.conly callsmalloc/free(no hakmem.h include)- System malloc was used by default
- No errors → looked like it was working
Why performance "changed":
- 361→391 M ops/sec (8.3% variation) was measurement noise
- CPU load, cache state, system activity
- No relation to hakmem code changes
The Fix
1. Explicit Makefile targets:
TINY_BENCH_OBJS = hakmem.o hakmem_config.o ... (all hakmem objects)
bench_tiny: bench_tiny.o $(TINY_BENCH_OBJS)
$(CC) -o $@ $^ $(LDFLAGS)
@echo "✓ bench_tiny built with hakmem"
2. Verification script (verify_bench.sh):
$ ./verify_bench.sh ./bench_tiny
✅ hakmem symbols: 119
✅ Binary size: 156KB
✅ Verification PASSED
3. Checklist (BENCHMARKING_CHECKLIST.md):
- Pre-benchmark verification steps
- Post-benchmark validation
- Prevents future mistakes
📊 True Performance Comparison
Benchmark Setup
Workload: bench_tiny
- 16B allocations
- 100 alloc → 100 free × 10M iterations
- Total: 2B operations (1B alloc + 1B free)
Environment:
- Platform: WSL2 (Linux 5.15.167.4-microsoft-standard-WSL2)
- CPU: (not specified, but modern x86-64)
- Compiler: gcc -O3 -march=native -mtune=native
Results
| Allocator | Throughput | Latency | vs hakmem | Notes |
|---|---|---|---|---|
| mimalloc | 908.31 M ops/sec | 1.1 ns/op | 8.8x faster | LD_PRELOAD=libmimalloc.so.2 |
| glibc | 364.00 M ops/sec | 2.7 ns/op | 3.5x faster | System default |
| hakmem | 103.35 M ops/sec | 9.7 ns/op | 1.0x (baseline) | Phase 4.2 |
Analysis
mimalloc dominance:
- 908 M ops/sec is exceptionally fast
- 1.1 ns/op ≈ 3-4 CPU cycles (assuming 3 GHz)
- Near theoretical minimum for malloc/free
- Result of 10+ years of optimization
hakmem gap:
- 8.8x slower than mimalloc
- 3.5x slower than glibc
- 8.6 ns overhead vs mimalloc
Note: Previous ANALYSIS_SUMMARY.md stated mimalloc = 14 ns/op, but that was from different benchmark (likely more complex allocations).
🔬 Bottleneck Analysis
Estimated 8.6 ns Breakdown
Based on code review (perf unavailable in WSL2):
| Component | Estimated Cost | Details |
|---|---|---|
| Registry lookup | 1-2 ns | Linear probing hash (max 8 probes), lock-free |
| Bitmap operations | 1-2 ns | set_used/free + summary bitmap update |
| Stats (batched TLS) | 0.5 ns | TLS increment (Phase 3 optimization) |
| Mini-magazine | 0.5-1 ns | pop/push operations (Phase 2 addition) |
| Other overhead | 3-5 ns | Function calls, branches, cache misses |
| Total | 6.5-10.5 ns | Matches measured 9.7 ns ✅ |
Key Findings
1. Registry lookup (O(1) but not free):
static int g_use_registry = 1; // Enabled by default
// registry_lookup(): lock-free but linear probing
- O(1) average case
- Lock-free (good!)
- But: linear probing up to 8 probes
- Cost: ~1-2 ns per lookup
2. Bitmap overhead:
hak_tiny_set_used(slab, block_idx);
// Updates bitmap + summary bitmap
- Two-tier bitmap for fast empty-word detection
- Each set/free updates both levels
- Cost: ~1-2 ns
3. Mini-magazine (Phase 2):
- Added for "fast path" (1-2 ns vs 5-6 ns bitmap)
- But: adds extra layer vs direct TLS cache
- Cost: ~0.5-1 ns
4. Stats (Phase 3 - optimized):
- Batched TLS counters (0.5 ns)
- Good! (vs 10-15 ns XOR RNG before)
5. Unaccounted overhead (~3-5 ns):
- Function call overhead
- Branch mispredictions
- Cache misses
- Data structure indirection
💡 Why is mimalloc so fast?
mimalloc Architecture (inferred)
1. Minimalist TLS cache:
- Direct pointer to free list
- Single pointer dereference
- Near-zero overhead
2. Inline metadata:
- Page metadata embedded in page
- O(1) pointer→page calculation
- No hash table needed
3. Lock-free everywhere:
- Thread-local pages
- Atomic operations only for shared structures
- No pthread mutexes in hot path
4. Cache-optimized:
- Sequential layout
- Prefetch-friendly
- Minimal pointer chasing
5. Extreme simplicity:
- Fewer layers
- Fewer branches
- Fewer memory accesses
hakmem's complexity:
- TLS Magazine → Mini-magazine → Bitmap
- 3 layers vs mimalloc's 1-2 layers
- Each layer adds overhead
📈 Progress Summary
What Was Completed Today
Phase 1-3 (Previously, but measured wrong)
- ❌ Phase 1: Mini-magazine infrastructure (measured glibc)
- ❌ Phase 2: Hot path integration (measured glibc)
- ❌ Phase 3: Stats batching (measured glibc)
Phase 4 (Attempted, but measured wrong)
- ❌ Phase 4: TLS spill optimization (measured glibc)
- ❌ Phase 4.1: Micro-optimization A+B (measured glibc)
Critical Fix (Actual Work)
- ✅ Discovered benchmark infrastructure bug
- ✅ Fixed Makefile (explicit targets)
- ✅ Created verify_bench.sh (119 hakmem symbols detected)
- ✅ Created BENCHMARKING_CHECKLIST.md (prevention)
- ✅ Phase 4.2: High-water gating (first correct measurement!)
Analysis & Documentation
- ✅ PHASE4_REGRESSION_ANALYSIS.md (16 KB)
- ✅ PHASE4_IMPROVEMENT_ROADMAP.md (13 KB)
- ✅ BENCHMARKING_CHECKLIST.md (comprehensive)
- ✅ verify_bench.sh (automated verification)
- ✅ Comparative benchmark (hakmem vs glibc vs mimalloc)
- ✅ Bottleneck analysis (code review)
🎯 Reality Check
The Gap
mimalloc: 908 M ops/sec (1.1 ns/op) hakmem: 103 M ops/sec (9.7 ns/op) Gap: 8.8x
Is This Gap Closeable?
Honest Assessment:
Short-term (Quick Wins - weeks):
- Target: 103 → 150 M ops/sec (+45%)
- Method:
- Reduce bitmap overhead
- Optimize registry hash
- Remove mini-magazine layer (if not helping)
- Tune TLS Magazine capacity
- Realistic: Achievable
Mid-term (Major Optimization - months):
- Target: 150 → 250 M ops/sec (+140%)
- Method:
- Redesign data structures
- Inline metadata (like mimalloc)
- Lock-free everywhere
- Cache-optimized layout
- Realistic: Difficult but possible
Long-term (Mimalloc-level - years):
- Target: 250 → 700+ M ops/sec (+580%)
- Method:
- Complete architectural overhaul
- Learn from mimalloc source code
- Remove all unnecessary layers
- Extreme optimization
- Realistic: Requires mimalloc-level expertise
Fundamental Challenge:
- hakmem is a research allocator (visibility, experimentation)
- mimalloc is a production allocator (speed, simplicity)
- These goals are inherently in conflict
Research features that add overhead:
- Bitmap (vs inline metadata)
- Statistics (even batched TLS)
- Multiple layers (TLS Mag → Mini-mag → Bitmap)
- Flexibility (registry toggle, SuperSlab, etc.)
🚀 Next Steps
Option A: Accept Reality (Recommended)
Goal: hakmem as research platform, not production allocator
Approach:
- Document current performance (103 M ops/sec baseline)
- Focus on research features:
- Bitmap visibility for debugging
- Statistics for analysis
- Experimentation platform
- Optimize moderately:
- Quick wins (103 → 150 M ops/sec)
- Don't sacrifice research goals for speed
- Accept 5-8x slower than mimalloc as reasonable trade-off
Outcome: Honest, useful research tool
Option B: Pursue Performance (Challenging)
Goal: Close the gap to 2-3x slower than mimalloc
Phase 1: Quick Wins (Target: 150 M ops/sec)
- Remove mini-magazine if not helping
- Optimize registry hash (reduce collisions)
- Tune TLS Magazine capacity
- Inline hot functions
- Profile with proper tools (perf, gprof)
Phase 2: Architectural (Target: 250 M ops/sec)
- Inline metadata (like mimalloc)
- Simplify data structures
- Remove unnecessary layers
- Cache-optimized layout
Phase 3: Extreme (Target: 400+ M ops/sec)
- Study mimalloc source code deeply
- Adopt mimalloc techniques
- Sacrifice research features if needed
Outcome: High-performance allocator, less research-friendly
Option C: Hybrid Approach (Balanced)
Goal: Research + reasonable performance
Approach:
- Default mode: Research-friendly (bitmap, stats, visibility)
- Performance: 100-150 M ops/sec
- Fast mode: Production-optimized (inline metadata, no stats)
- Performance: 300-500 M ops/sec
- Toggle:
HAKMEM_FAST_MODE=1
Outcome: Best of both worlds, more complexity
📚 Documentation Created Today
| File | Size | Purpose |
|---|---|---|
| PHASE4_REGRESSION_ANALYSIS.md | 16 KB | Root cause analysis of Phase 4 regression |
| PHASE4_IMPROVEMENT_ROADMAP.md | 13 KB | Phased improvement plan (4.1→4.2→4.3→4.4) |
| BENCHMARKING_CHECKLIST.md | 11 KB | Prevention guide (never make same mistake) |
| verify_bench.sh | 2 KB | Automated verification (119 symbols check) |
| bench_hakmem.sh | 2 KB | Benchmark script template |
| SESSION_SUMMARY_2025-10-26.md | (this file) | Comprehensive session summary |
Total: ~44 KB of documentation
🎓 Lessons Learned
Technical Lessons
-
Makefile implicit rules are dangerous
- Always define explicit targets
- Especially for critical binaries
-
"It works" ≠ "It's correct"
- bench_tiny "worked" (no errors)
- But measured wrong thing entirely
-
Verification is essential
nmto check symbolslddto check librariessizeto check binary size
-
Performance claims need causality
- Code change → performance change?
- Or just measurement noise?
-
Benchmarking is hard
- Same code
- Same compiler flags
- Same workload
- Verify what you're measuring!
Process Lessons
-
Document procedures
- Checklists prevent mistakes
- Future-you will thank you
-
Automate verification
- Scripts don't forget
- verify_bench.sh (119 symbols)
-
Question surprising results
- "Phase 3 improved 1.3%" - Was it real?
- If glibc improved, that'd be newsworthy!
-
Be honest about limitations
- hakmem is 8.8x slower
- That's okay for research!
-
Celebrate discoveries
- Found a fundamental issue
- Fixed it permanently
- That's progress!
💭 Reflections
What Went Well ✅
-
Fixed fundamental infrastructure bug
- Will never happen again
- Proper tooling in place
-
Honest performance comparison
- Now know true baseline
- Can set realistic goals
-
Comprehensive documentation
- 44 KB of useful docs
- Future-proofing
-
Learned from ChatGPT Pro
- Gating strategy
- Pull型 architecture
- Batching techniques
What Could Be Better 🤔
-
Should have verified linkage earlier
- Wasted time on Phase 1-4.1
- But: learned valuable lessons!
-
Need better profiling tools
- perf doesn't work in WSL2
- Consider: gprof, valgrind, manual timing
-
Architectural complexity
- 3 layers (TLS Mag → Mini-mag → Bitmap)
- Each adds overhead
- Simplification opportunity?
🎯 Recommendation
For tomorrow (if continuing):
Immediate Actions
-
Run verify_bench.sh before any benchmark
./verify_bench.sh ./bench_tiny -
Re-measure Phase 0 (baseline) with correct infrastructure
- Before Phase 1
- True starting point
-
Profile with working tools
- gprof (if available)
- Manual timing (HAKMEM_DEBUG_TIMING=1)
- Identify true bottlenecks
-
Implement one Quick Win
- Start with easiest improvement
- Measure impact properly
- Document result
Strategic Decision
Choose a path:
- Path A: Research focus (accept 5-8x slower)
- Path B: Performance pursuit (target 2-3x slower)
- Path C: Hybrid (modes for different use cases)
Recommended: Path A (Research focus)
- Honest about trade-offs
- Useful research tool
- Moderate optimization (Quick Wins)
- Don't sacrifice research goals
📊 Final Numbers
Before Today (Wrong)
| Phase | Throughput | Note |
|---|---|---|
| Phase 2 | 361 M ops/sec | ❌ glibc malloc |
| Phase 3 | 391 M ops/sec | ❌ glibc malloc |
| Phase 4 | 373 M ops/sec | ❌ glibc malloc |
| Phase 4.1 | 381 M ops/sec | ❌ glibc malloc |
After Today (Correct)
| Allocator | Throughput | Latency | Verified |
|---|---|---|---|
| hakmem (Phase 4.2) | 103 M ops/sec | 9.7 ns/op | ✅ 119 symbols |
| glibc malloc | 364 M ops/sec | 2.7 ns/op | ✅ No hakmem |
| mimalloc | 908 M ops/sec | 1.1 ns/op | ✅ LD_PRELOAD |
Gap to close: 8.8x (mimalloc) or 3.5x (glibc)
🙏 Acknowledgments
- ChatGPT Pro: Excellent architectural advice (Gating, Batching, Pull型)
- User (にゃーん): Persistent questioning led to discovery
- verify_bench.sh: Will prevent future mistakes
📅 Session Stats
- Duration: ~8 hours
- Commits: 6
- Files Created: 6
- Lines of Code: ~500
- Documentation: ~44 KB
- Major Discovery: 1 (benchmark infrastructure bug)
- Performance Gap Discovered: 8.8x vs mimalloc
✨ Conclusion
Today was tough 😿 but extremely valuable ✨
Bad News:
- Phase 1-4.1 measurements were wrong
- hakmem is 8.8x slower than mimalloc
- Big gap to close
Good News:
- Found fundamental infrastructure bug
- Fixed it permanently (verify_bench.sh + checklist)
- Know true baseline now (103 M ops/sec)
- Can set realistic goals going forward
- Comprehensive documentation for future
Moving Forward:
- Be honest about performance
- Focus on research value
- Optimize pragmatically
- Never make same mistake again
Final Thought:
"It's better to know uncomfortable truth than comfortable lie."
hakmem は 103 M ops/sec。それが現実。 でも、それは研究プラットフォームとして十分な性能かもしれませんにゃ。
にゃーん!今日も一日お疲れさまでしたにゃ!😸✨
Generated: 2025-10-26 Session: Phase 4 Optimization & Infrastructure Fix Status: Documented & Lessons Learned