Files
hakmem/docs/archive/SESSION_SUMMARY_2025-10-26.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

598 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Session Summary - 2025-10-26
## TL;DR
**重大な発見**: Phase 1-4.1 の全ベンチマークが **glibc malloc** を測定していた。
hakmem を正しく測定した結果、**mimalloc の 8.8倍遅い**ことが判明。
**成果**:
- ✅ Benchmark infrastructure を修正(二度と間違えない仕組み)
- ✅ 正確な性能比較hakmem vs glibc vs mimalloc
- ✅ ボトルネック分析(推定)
- ✅ 検証ツール・チェックリスト作成
**現実**:
- hakmem: 103 M ops/sec (9.7 ns/op)
- mimalloc: **908 M ops/sec (1.1 ns/op)** - 8.8x faster
- glibc: 364 M ops/sec (2.7 ns/op) - 3.5x faster
---
## 📅 Timeline
### Morning: Phase 4 Regression Analysis
- Phase 4: 373-380 M ops/sec (-3.6% vs Phase 3)
- Created PHASE4_REGRESSION_ANALYSIS.md
- Created PHASE4_IMPROVEMENT_ROADMAP.md
- ChatGPT Pro advice: Gating + Batching + Pull型
### Afternoon: Phase 4 Improvements
- Phase 4.1: Option A+B micro-optimization
- Result: 381 M ops/sec (+1-2%)
- Phase 4.2: High-water gating
- Result: 103 M ops/sec (???)
### Evening: Critical Discovery 🚨
- **All benchmarks were measuring glibc malloc!**
- Root cause: Makefile implicit rule
- bench_tiny.c didn't link hakmem
### Night: Infrastructure Fix + Reality Check
- Fixed Makefile (explicit targets)
- Created verify_bench.sh
- Created BENCHMARKING_CHECKLIST.md
- **True measurement**: hakmem = 103 M ops/sec
- **Comparison**: mimalloc = 908 M ops/sec (8.8x faster!)
---
## 🔍 Critical Discovery Details
### The Mistake
**What happened**:
```bash
# Before (wrong)
make bench_tiny
→ gcc bench_tiny.c -o bench_tiny # Makefile implicit rule!
→ No hakmem linkage
→ Using glibc malloc
# All reported numbers were glibc malloc:
Phase 3: 391 M ops/sec (glibc)
Phase 4: 373 M ops/sec (glibc)
Phase 4.1: 381 M ops/sec (glibc)
```
**Root Cause**:
- Makefile had no explicit `bench_tiny` target
- `bench_tiny.c` only calls `malloc/free` (no hakmem.h include)
- System malloc was used by default
- No errors → looked like it was working
**Why performance "changed"**:
- 361→391 M ops/sec (8.3% variation) was **measurement noise**
- CPU load, cache state, system activity
- **No relation to hakmem code changes**
---
### The Fix
**1. Explicit Makefile targets**:
```makefile
TINY_BENCH_OBJS = hakmem.o hakmem_config.o ... (all hakmem objects)
bench_tiny: bench_tiny.o $(TINY_BENCH_OBJS)
$(CC) -o $@ $^ $(LDFLAGS)
@echo "✓ bench_tiny built with hakmem"
```
**2. Verification script** (`verify_bench.sh`):
```bash
$ ./verify_bench.sh ./bench_tiny
✅ hakmem symbols: 119
✅ Binary size: 156KB
✅ Verification PASSED
```
**3. Checklist** (`BENCHMARKING_CHECKLIST.md`):
- Pre-benchmark verification steps
- Post-benchmark validation
- Prevents future mistakes
---
## 📊 True Performance Comparison
### Benchmark Setup
**Workload**: bench_tiny
- 16B allocations
- 100 alloc → 100 free × 10M iterations
- Total: 2B operations (1B alloc + 1B free)
**Environment**:
- Platform: WSL2 (Linux 5.15.167.4-microsoft-standard-WSL2)
- CPU: (not specified, but modern x86-64)
- Compiler: gcc -O3 -march=native -mtune=native
### Results
| Allocator | Throughput | Latency | vs hakmem | Notes |
|-----------|------------|---------|-----------|-------|
| **mimalloc** | **908.31 M ops/sec** | **1.1 ns/op** | **8.8x faster** | LD_PRELOAD=libmimalloc.so.2 |
| **glibc** | 364.00 M ops/sec | 2.7 ns/op | 3.5x faster | System default |
| **hakmem** | 103.35 M ops/sec | 9.7 ns/op | 1.0x (baseline) | Phase 4.2 |
### Analysis
**mimalloc dominance**:
- 908 M ops/sec is **exceptionally fast**
- 1.1 ns/op ≈ 3-4 CPU cycles (assuming 3 GHz)
- Near theoretical minimum for malloc/free
- Result of 10+ years of optimization
**hakmem gap**:
- 8.8x slower than mimalloc
- 3.5x slower than glibc
- **8.6 ns overhead** vs mimalloc
**Note**: Previous ANALYSIS_SUMMARY.md stated mimalloc = 14 ns/op, but that was from different benchmark (likely more complex allocations).
---
## 🔬 Bottleneck Analysis
### Estimated 8.6 ns Breakdown
Based on code review (perf unavailable in WSL2):
| Component | Estimated Cost | Details |
|-----------|---------------|---------|
| Registry lookup | 1-2 ns | Linear probing hash (max 8 probes), lock-free |
| Bitmap operations | 1-2 ns | set_used/free + summary bitmap update |
| Stats (batched TLS) | 0.5 ns | TLS increment (Phase 3 optimization) |
| Mini-magazine | 0.5-1 ns | pop/push operations (Phase 2 addition) |
| Other overhead | 3-5 ns | Function calls, branches, cache misses |
| **Total** | **6.5-10.5 ns** | **Matches measured 9.7 ns** ✅ |
### Key Findings
**1. Registry lookup (O(1) but not free)**:
```c
static int g_use_registry = 1; // Enabled by default
// registry_lookup(): lock-free but linear probing
```
- O(1) average case
- Lock-free (good!)
- But: linear probing up to 8 probes
- Cost: ~1-2 ns per lookup
**2. Bitmap overhead**:
```c
hak_tiny_set_used(slab, block_idx);
// Updates bitmap + summary bitmap
```
- Two-tier bitmap for fast empty-word detection
- Each set/free updates both levels
- Cost: ~1-2 ns
**3. Mini-magazine (Phase 2)**:
- Added for "fast path" (1-2 ns vs 5-6 ns bitmap)
- But: adds extra layer vs direct TLS cache
- Cost: ~0.5-1 ns
**4. Stats (Phase 3 - optimized)**:
- Batched TLS counters (0.5 ns)
- Good! (vs 10-15 ns XOR RNG before)
**5. Unaccounted overhead (~3-5 ns)**:
- Function call overhead
- Branch mispredictions
- Cache misses
- Data structure indirection
---
## 💡 Why is mimalloc so fast?
### mimalloc Architecture (inferred)
**1. Minimalist TLS cache**:
- Direct pointer to free list
- Single pointer dereference
- Near-zero overhead
**2. Inline metadata**:
- Page metadata embedded in page
- O(1) pointer→page calculation
- No hash table needed
**3. Lock-free everywhere**:
- Thread-local pages
- Atomic operations only for shared structures
- No pthread mutexes in hot path
**4. Cache-optimized**:
- Sequential layout
- Prefetch-friendly
- Minimal pointer chasing
**5. Extreme simplicity**:
- Fewer layers
- Fewer branches
- Fewer memory accesses
**hakmem's complexity**:
- TLS Magazine → Mini-magazine → Bitmap
- 3 layers vs mimalloc's 1-2 layers
- Each layer adds overhead
---
## 📈 Progress Summary
### What Was Completed Today
#### Phase 1-3 (Previously, but measured wrong)
- ❌ Phase 1: Mini-magazine infrastructure (measured glibc)
- ❌ Phase 2: Hot path integration (measured glibc)
- ❌ Phase 3: Stats batching (measured glibc)
#### Phase 4 (Attempted, but measured wrong)
- ❌ Phase 4: TLS spill optimization (measured glibc)
- ❌ Phase 4.1: Micro-optimization A+B (measured glibc)
#### Critical Fix (Actual Work)
- ✅ Discovered benchmark infrastructure bug
- ✅ Fixed Makefile (explicit targets)
- ✅ Created verify_bench.sh (119 hakmem symbols detected)
- ✅ Created BENCHMARKING_CHECKLIST.md (prevention)
- ✅ Phase 4.2: High-water gating (first correct measurement!)
#### Analysis & Documentation
- ✅ PHASE4_REGRESSION_ANALYSIS.md (16 KB)
- ✅ PHASE4_IMPROVEMENT_ROADMAP.md (13 KB)
- ✅ BENCHMARKING_CHECKLIST.md (comprehensive)
- ✅ verify_bench.sh (automated verification)
- ✅ Comparative benchmark (hakmem vs glibc vs mimalloc)
- ✅ Bottleneck analysis (code review)
---
## 🎯 Reality Check
### The Gap
**mimalloc**: 908 M ops/sec (1.1 ns/op)
**hakmem**: 103 M ops/sec (9.7 ns/op)
**Gap**: **8.8x**
### Is This Gap Closeable?
**Honest Assessment**:
**Short-term** (Quick Wins - weeks):
- Target: 103 → 150 M ops/sec (+45%)
- Method:
- Reduce bitmap overhead
- Optimize registry hash
- Remove mini-magazine layer (if not helping)
- Tune TLS Magazine capacity
- Realistic: **Achievable**
**Mid-term** (Major Optimization - months):
- Target: 150 → 250 M ops/sec (+140%)
- Method:
- Redesign data structures
- Inline metadata (like mimalloc)
- Lock-free everywhere
- Cache-optimized layout
- Realistic: **Difficult but possible**
**Long-term** (Mimalloc-level - years):
- Target: 250 → 700+ M ops/sec (+580%)
- Method:
- Complete architectural overhaul
- Learn from mimalloc source code
- Remove all unnecessary layers
- Extreme optimization
- Realistic: **Requires mimalloc-level expertise**
**Fundamental Challenge**:
- hakmem is a **research allocator** (visibility, experimentation)
- mimalloc is a **production allocator** (speed, simplicity)
- These goals are **inherently in conflict**
**Research features that add overhead**:
- Bitmap (vs inline metadata)
- Statistics (even batched TLS)
- Multiple layers (TLS Mag → Mini-mag → Bitmap)
- Flexibility (registry toggle, SuperSlab, etc.)
---
## 🚀 Next Steps
### Option A: Accept Reality (Recommended)
**Goal**: hakmem as **research platform**, not production allocator
**Approach**:
1. Document current performance (103 M ops/sec baseline)
2. Focus on **research features**:
- Bitmap visibility for debugging
- Statistics for analysis
- Experimentation platform
3. Optimize **moderately**:
- Quick wins (103 → 150 M ops/sec)
- Don't sacrifice research goals for speed
4. **Accept 5-8x slower than mimalloc** as reasonable trade-off
**Outcome**: Honest, useful research tool
---
### Option B: Pursue Performance (Challenging)
**Goal**: Close the gap to 2-3x slower than mimalloc
**Phase 1: Quick Wins** (Target: 150 M ops/sec)
- Remove mini-magazine if not helping
- Optimize registry hash (reduce collisions)
- Tune TLS Magazine capacity
- Inline hot functions
- Profile with proper tools (perf, gprof)
**Phase 2: Architectural** (Target: 250 M ops/sec)
- Inline metadata (like mimalloc)
- Simplify data structures
- Remove unnecessary layers
- Cache-optimized layout
**Phase 3: Extreme** (Target: 400+ M ops/sec)
- Study mimalloc source code deeply
- Adopt mimalloc techniques
- Sacrifice research features if needed
**Outcome**: High-performance allocator, less research-friendly
---
### Option C: Hybrid Approach (Balanced)
**Goal**: Research + reasonable performance
**Approach**:
1. **Default mode**: Research-friendly (bitmap, stats, visibility)
- Performance: 100-150 M ops/sec
2. **Fast mode**: Production-optimized (inline metadata, no stats)
- Performance: 300-500 M ops/sec
- Toggle: `HAKMEM_FAST_MODE=1`
**Outcome**: Best of both worlds, more complexity
---
## 📚 Documentation Created Today
| File | Size | Purpose |
|------|------|---------|
| PHASE4_REGRESSION_ANALYSIS.md | 16 KB | Root cause analysis of Phase 4 regression |
| PHASE4_IMPROVEMENT_ROADMAP.md | 13 KB | Phased improvement plan (4.1→4.2→4.3→4.4) |
| BENCHMARKING_CHECKLIST.md | 11 KB | Prevention guide (never make same mistake) |
| verify_bench.sh | 2 KB | Automated verification (119 symbols check) |
| bench_hakmem.sh | 2 KB | Benchmark script template |
| SESSION_SUMMARY_2025-10-26.md | (this file) | Comprehensive session summary |
**Total**: ~44 KB of documentation
---
## 🎓 Lessons Learned
### Technical Lessons
1. **Makefile implicit rules are dangerous**
- Always define explicit targets
- Especially for critical binaries
2. **"It works" ≠ "It's correct"**
- bench_tiny "worked" (no errors)
- But measured wrong thing entirely
3. **Verification is essential**
- `nm` to check symbols
- `ldd` to check libraries
- `size` to check binary size
4. **Performance claims need causality**
- Code change → performance change?
- Or just measurement noise?
5. **Benchmarking is hard**
- Same code
- Same compiler flags
- Same workload
- Verify what you're measuring!
### Process Lessons
1. **Document procedures**
- Checklists prevent mistakes
- Future-you will thank you
2. **Automate verification**
- Scripts don't forget
- verify_bench.sh (119 symbols)
3. **Question surprising results**
- "Phase 3 improved 1.3%" - Was it real?
- If glibc improved, that'd be newsworthy!
4. **Be honest about limitations**
- hakmem is 8.8x slower
- That's okay for research!
5. **Celebrate discoveries**
- Found a fundamental issue
- Fixed it permanently
- That's progress!
---
## 💭 Reflections
### What Went Well ✅
1. **Fixed fundamental infrastructure bug**
- Will never happen again
- Proper tooling in place
2. **Honest performance comparison**
- Now know true baseline
- Can set realistic goals
3. **Comprehensive documentation**
- 44 KB of useful docs
- Future-proofing
4. **Learned from ChatGPT Pro**
- Gating strategy
- Pull型 architecture
- Batching techniques
### What Could Be Better 🤔
1. **Should have verified linkage earlier**
- Wasted time on Phase 1-4.1
- But: learned valuable lessons!
2. **Need better profiling tools**
- perf doesn't work in WSL2
- Consider: gprof, valgrind, manual timing
3. **Architectural complexity**
- 3 layers (TLS Mag → Mini-mag → Bitmap)
- Each adds overhead
- Simplification opportunity?
---
## 🎯 Recommendation
**For tomorrow** (if continuing):
### Immediate Actions
1. **Run verify_bench.sh** before any benchmark
```bash
./verify_bench.sh ./bench_tiny
```
2. **Re-measure Phase 0 (baseline)** with correct infrastructure
- Before Phase 1
- True starting point
3. **Profile with working tools**
- gprof (if available)
- Manual timing (HAKMEM_DEBUG_TIMING=1)
- Identify true bottlenecks
4. **Implement one Quick Win**
- Start with easiest improvement
- Measure impact properly
- Document result
### Strategic Decision
**Choose a path**:
- **Path A**: Research focus (accept 5-8x slower)
- **Path B**: Performance pursuit (target 2-3x slower)
- **Path C**: Hybrid (modes for different use cases)
**Recommended**: **Path A** (Research focus)
- Honest about trade-offs
- Useful research tool
- Moderate optimization (Quick Wins)
- Don't sacrifice research goals
---
## 📊 Final Numbers
### Before Today (Wrong)
| Phase | Throughput | Note |
|-------|------------|------|
| Phase 2 | 361 M ops/sec | ❌ glibc malloc |
| Phase 3 | 391 M ops/sec | ❌ glibc malloc |
| Phase 4 | 373 M ops/sec | ❌ glibc malloc |
| Phase 4.1 | 381 M ops/sec | ❌ glibc malloc |
### After Today (Correct)
| Allocator | Throughput | Latency | Verified |
|-----------|------------|---------|----------|
| hakmem (Phase 4.2) | 103 M ops/sec | 9.7 ns/op | ✅ 119 symbols |
| glibc malloc | 364 M ops/sec | 2.7 ns/op | ✅ No hakmem |
| mimalloc | 908 M ops/sec | 1.1 ns/op | ✅ LD_PRELOAD |
**Gap to close**: 8.8x (mimalloc) or 3.5x (glibc)
---
## 🙏 Acknowledgments
- **ChatGPT Pro**: Excellent architectural advice (Gating, Batching, Pull型)
- **User (にゃーん)**: Persistent questioning led to discovery
- **verify_bench.sh**: Will prevent future mistakes
---
## 📅 Session Stats
- **Duration**: ~8 hours
- **Commits**: 6
- **Files Created**: 6
- **Lines of Code**: ~500
- **Documentation**: ~44 KB
- **Major Discovery**: 1 (benchmark infrastructure bug)
- **Performance Gap Discovered**: 8.8x vs mimalloc
---
## ✨ Conclusion
**Today was tough** 😿 but **extremely valuable**
**Bad News**:
- Phase 1-4.1 measurements were wrong
- hakmem is 8.8x slower than mimalloc
- Big gap to close
**Good News**:
- Found fundamental infrastructure bug
- Fixed it permanently (verify_bench.sh + checklist)
- Know true baseline now (103 M ops/sec)
- Can set realistic goals going forward
- Comprehensive documentation for future
**Moving Forward**:
- Be honest about performance
- Focus on research value
- Optimize pragmatically
- Never make same mistake again
**Final Thought**:
> "It's better to know uncomfortable truth than comfortable lie."
hakmem は 103 M ops/sec。それが現実。
でも、それは**研究プラットフォーム**として十分な性能かもしれませんにゃ。
にゃーん!今日も一日お疲れさまでしたにゃ!😸✨
---
*Generated: 2025-10-26*
*Session: Phase 4 Optimization & Infrastructure Fix*
*Status: Documented & Lessons Learned*