598 lines
16 KiB
Markdown
598 lines
16 KiB
Markdown
|
|
# Session Summary - 2025-10-26
|
|||
|
|
|
|||
|
|
## TL;DR
|
|||
|
|
|
|||
|
|
**重大な発見**: Phase 1-4.1 の全ベンチマークが **glibc malloc** を測定していた。
|
|||
|
|
hakmem を正しく測定した結果、**mimalloc の 8.8倍遅い**ことが判明。
|
|||
|
|
|
|||
|
|
**成果**:
|
|||
|
|
- ✅ Benchmark infrastructure を修正(二度と間違えない仕組み)
|
|||
|
|
- ✅ 正確な性能比較(hakmem vs glibc vs mimalloc)
|
|||
|
|
- ✅ ボトルネック分析(推定)
|
|||
|
|
- ✅ 検証ツール・チェックリスト作成
|
|||
|
|
|
|||
|
|
**現実**:
|
|||
|
|
- hakmem: 103 M ops/sec (9.7 ns/op)
|
|||
|
|
- mimalloc: **908 M ops/sec (1.1 ns/op)** - 8.8x faster
|
|||
|
|
- glibc: 364 M ops/sec (2.7 ns/op) - 3.5x faster
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📅 Timeline
|
|||
|
|
|
|||
|
|
### Morning: Phase 4 Regression Analysis
|
|||
|
|
- Phase 4: 373-380 M ops/sec (-3.6% vs Phase 3)
|
|||
|
|
- Created PHASE4_REGRESSION_ANALYSIS.md
|
|||
|
|
- Created PHASE4_IMPROVEMENT_ROADMAP.md
|
|||
|
|
- ChatGPT Pro advice: Gating + Batching + Pull型
|
|||
|
|
|
|||
|
|
### Afternoon: Phase 4 Improvements
|
|||
|
|
- Phase 4.1: Option A+B micro-optimization
|
|||
|
|
- Result: 381 M ops/sec (+1-2%)
|
|||
|
|
- Phase 4.2: High-water gating
|
|||
|
|
- Result: 103 M ops/sec (???)
|
|||
|
|
|
|||
|
|
### Evening: Critical Discovery 🚨
|
|||
|
|
- **All benchmarks were measuring glibc malloc!**
|
|||
|
|
- Root cause: Makefile implicit rule
|
|||
|
|
- bench_tiny.c didn't link hakmem
|
|||
|
|
|
|||
|
|
### Night: Infrastructure Fix + Reality Check
|
|||
|
|
- Fixed Makefile (explicit targets)
|
|||
|
|
- Created verify_bench.sh
|
|||
|
|
- Created BENCHMARKING_CHECKLIST.md
|
|||
|
|
- **True measurement**: hakmem = 103 M ops/sec
|
|||
|
|
- **Comparison**: mimalloc = 908 M ops/sec (8.8x faster!)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔍 Critical Discovery Details
|
|||
|
|
|
|||
|
|
### The Mistake
|
|||
|
|
|
|||
|
|
**What happened**:
|
|||
|
|
```bash
|
|||
|
|
# Before (wrong)
|
|||
|
|
make bench_tiny
|
|||
|
|
→ gcc bench_tiny.c -o bench_tiny # Makefile implicit rule!
|
|||
|
|
→ No hakmem linkage
|
|||
|
|
→ Using glibc malloc
|
|||
|
|
|
|||
|
|
# All reported numbers were glibc malloc:
|
|||
|
|
Phase 3: 391 M ops/sec (glibc)
|
|||
|
|
Phase 4: 373 M ops/sec (glibc)
|
|||
|
|
Phase 4.1: 381 M ops/sec (glibc)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Root Cause**:
|
|||
|
|
- Makefile had no explicit `bench_tiny` target
|
|||
|
|
- `bench_tiny.c` only calls `malloc/free` (no hakmem.h include)
|
|||
|
|
- System malloc was used by default
|
|||
|
|
- No errors → looked like it was working
|
|||
|
|
|
|||
|
|
**Why performance "changed"**:
|
|||
|
|
- 361→391 M ops/sec (8.3% variation) was **measurement noise**
|
|||
|
|
- CPU load, cache state, system activity
|
|||
|
|
- **No relation to hakmem code changes**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### The Fix
|
|||
|
|
|
|||
|
|
**1. Explicit Makefile targets**:
|
|||
|
|
```makefile
|
|||
|
|
TINY_BENCH_OBJS = hakmem.o hakmem_config.o ... (all hakmem objects)
|
|||
|
|
|
|||
|
|
bench_tiny: bench_tiny.o $(TINY_BENCH_OBJS)
|
|||
|
|
$(CC) -o $@ $^ $(LDFLAGS)
|
|||
|
|
@echo "✓ bench_tiny built with hakmem"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**2. Verification script** (`verify_bench.sh`):
|
|||
|
|
```bash
|
|||
|
|
$ ./verify_bench.sh ./bench_tiny
|
|||
|
|
✅ hakmem symbols: 119
|
|||
|
|
✅ Binary size: 156KB
|
|||
|
|
✅ Verification PASSED
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**3. Checklist** (`BENCHMARKING_CHECKLIST.md`):
|
|||
|
|
- Pre-benchmark verification steps
|
|||
|
|
- Post-benchmark validation
|
|||
|
|
- Prevents future mistakes
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 True Performance Comparison
|
|||
|
|
|
|||
|
|
### Benchmark Setup
|
|||
|
|
|
|||
|
|
**Workload**: bench_tiny
|
|||
|
|
- 16B allocations
|
|||
|
|
- 100 alloc → 100 free × 10M iterations
|
|||
|
|
- Total: 2B operations (1B alloc + 1B free)
|
|||
|
|
|
|||
|
|
**Environment**:
|
|||
|
|
- Platform: WSL2 (Linux 5.15.167.4-microsoft-standard-WSL2)
|
|||
|
|
- CPU: (not specified, but modern x86-64)
|
|||
|
|
- Compiler: gcc -O3 -march=native -mtune=native
|
|||
|
|
|
|||
|
|
### Results
|
|||
|
|
|
|||
|
|
| Allocator | Throughput | Latency | vs hakmem | Notes |
|
|||
|
|
|-----------|------------|---------|-----------|-------|
|
|||
|
|
| **mimalloc** | **908.31 M ops/sec** | **1.1 ns/op** | **8.8x faster** | LD_PRELOAD=libmimalloc.so.2 |
|
|||
|
|
| **glibc** | 364.00 M ops/sec | 2.7 ns/op | 3.5x faster | System default |
|
|||
|
|
| **hakmem** | 103.35 M ops/sec | 9.7 ns/op | 1.0x (baseline) | Phase 4.2 |
|
|||
|
|
|
|||
|
|
### Analysis
|
|||
|
|
|
|||
|
|
**mimalloc dominance**:
|
|||
|
|
- 908 M ops/sec is **exceptionally fast**
|
|||
|
|
- 1.1 ns/op ≈ 3-4 CPU cycles (assuming 3 GHz)
|
|||
|
|
- Near theoretical minimum for malloc/free
|
|||
|
|
- Result of 10+ years of optimization
|
|||
|
|
|
|||
|
|
**hakmem gap**:
|
|||
|
|
- 8.8x slower than mimalloc
|
|||
|
|
- 3.5x slower than glibc
|
|||
|
|
- **8.6 ns overhead** vs mimalloc
|
|||
|
|
|
|||
|
|
**Note**: Previous ANALYSIS_SUMMARY.md stated mimalloc = 14 ns/op, but that was from different benchmark (likely more complex allocations).
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🔬 Bottleneck Analysis
|
|||
|
|
|
|||
|
|
### Estimated 8.6 ns Breakdown
|
|||
|
|
|
|||
|
|
Based on code review (perf unavailable in WSL2):
|
|||
|
|
|
|||
|
|
| Component | Estimated Cost | Details |
|
|||
|
|
|-----------|---------------|---------|
|
|||
|
|
| Registry lookup | 1-2 ns | Linear probing hash (max 8 probes), lock-free |
|
|||
|
|
| Bitmap operations | 1-2 ns | set_used/free + summary bitmap update |
|
|||
|
|
| Stats (batched TLS) | 0.5 ns | TLS increment (Phase 3 optimization) |
|
|||
|
|
| Mini-magazine | 0.5-1 ns | pop/push operations (Phase 2 addition) |
|
|||
|
|
| Other overhead | 3-5 ns | Function calls, branches, cache misses |
|
|||
|
|
| **Total** | **6.5-10.5 ns** | **Matches measured 9.7 ns** ✅ |
|
|||
|
|
|
|||
|
|
### Key Findings
|
|||
|
|
|
|||
|
|
**1. Registry lookup (O(1) but not free)**:
|
|||
|
|
```c
|
|||
|
|
static int g_use_registry = 1; // Enabled by default
|
|||
|
|
// registry_lookup(): lock-free but linear probing
|
|||
|
|
```
|
|||
|
|
- O(1) average case
|
|||
|
|
- Lock-free (good!)
|
|||
|
|
- But: linear probing up to 8 probes
|
|||
|
|
- Cost: ~1-2 ns per lookup
|
|||
|
|
|
|||
|
|
**2. Bitmap overhead**:
|
|||
|
|
```c
|
|||
|
|
hak_tiny_set_used(slab, block_idx);
|
|||
|
|
// Updates bitmap + summary bitmap
|
|||
|
|
```
|
|||
|
|
- Two-tier bitmap for fast empty-word detection
|
|||
|
|
- Each set/free updates both levels
|
|||
|
|
- Cost: ~1-2 ns
|
|||
|
|
|
|||
|
|
**3. Mini-magazine (Phase 2)**:
|
|||
|
|
- Added for "fast path" (1-2 ns vs 5-6 ns bitmap)
|
|||
|
|
- But: adds extra layer vs direct TLS cache
|
|||
|
|
- Cost: ~0.5-1 ns
|
|||
|
|
|
|||
|
|
**4. Stats (Phase 3 - optimized)**:
|
|||
|
|
- Batched TLS counters (0.5 ns)
|
|||
|
|
- Good! (vs 10-15 ns XOR RNG before)
|
|||
|
|
|
|||
|
|
**5. Unaccounted overhead (~3-5 ns)**:
|
|||
|
|
- Function call overhead
|
|||
|
|
- Branch mispredictions
|
|||
|
|
- Cache misses
|
|||
|
|
- Data structure indirection
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 💡 Why is mimalloc so fast?
|
|||
|
|
|
|||
|
|
### mimalloc Architecture (inferred)
|
|||
|
|
|
|||
|
|
**1. Minimalist TLS cache**:
|
|||
|
|
- Direct pointer to free list
|
|||
|
|
- Single pointer dereference
|
|||
|
|
- Near-zero overhead
|
|||
|
|
|
|||
|
|
**2. Inline metadata**:
|
|||
|
|
- Page metadata embedded in page
|
|||
|
|
- O(1) pointer→page calculation
|
|||
|
|
- No hash table needed
|
|||
|
|
|
|||
|
|
**3. Lock-free everywhere**:
|
|||
|
|
- Thread-local pages
|
|||
|
|
- Atomic operations only for shared structures
|
|||
|
|
- No pthread mutexes in hot path
|
|||
|
|
|
|||
|
|
**4. Cache-optimized**:
|
|||
|
|
- Sequential layout
|
|||
|
|
- Prefetch-friendly
|
|||
|
|
- Minimal pointer chasing
|
|||
|
|
|
|||
|
|
**5. Extreme simplicity**:
|
|||
|
|
- Fewer layers
|
|||
|
|
- Fewer branches
|
|||
|
|
- Fewer memory accesses
|
|||
|
|
|
|||
|
|
**hakmem's complexity**:
|
|||
|
|
- TLS Magazine → Mini-magazine → Bitmap
|
|||
|
|
- 3 layers vs mimalloc's 1-2 layers
|
|||
|
|
- Each layer adds overhead
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📈 Progress Summary
|
|||
|
|
|
|||
|
|
### What Was Completed Today
|
|||
|
|
|
|||
|
|
#### Phase 1-3 (Previously, but measured wrong)
|
|||
|
|
- ❌ Phase 1: Mini-magazine infrastructure (measured glibc)
|
|||
|
|
- ❌ Phase 2: Hot path integration (measured glibc)
|
|||
|
|
- ❌ Phase 3: Stats batching (measured glibc)
|
|||
|
|
|
|||
|
|
#### Phase 4 (Attempted, but measured wrong)
|
|||
|
|
- ❌ Phase 4: TLS spill optimization (measured glibc)
|
|||
|
|
- ❌ Phase 4.1: Micro-optimization A+B (measured glibc)
|
|||
|
|
|
|||
|
|
#### Critical Fix (Actual Work)
|
|||
|
|
- ✅ Discovered benchmark infrastructure bug
|
|||
|
|
- ✅ Fixed Makefile (explicit targets)
|
|||
|
|
- ✅ Created verify_bench.sh (119 hakmem symbols detected)
|
|||
|
|
- ✅ Created BENCHMARKING_CHECKLIST.md (prevention)
|
|||
|
|
- ✅ Phase 4.2: High-water gating (first correct measurement!)
|
|||
|
|
|
|||
|
|
#### Analysis & Documentation
|
|||
|
|
- ✅ PHASE4_REGRESSION_ANALYSIS.md (16 KB)
|
|||
|
|
- ✅ PHASE4_IMPROVEMENT_ROADMAP.md (13 KB)
|
|||
|
|
- ✅ BENCHMARKING_CHECKLIST.md (comprehensive)
|
|||
|
|
- ✅ verify_bench.sh (automated verification)
|
|||
|
|
- ✅ Comparative benchmark (hakmem vs glibc vs mimalloc)
|
|||
|
|
- ✅ Bottleneck analysis (code review)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 Reality Check
|
|||
|
|
|
|||
|
|
### The Gap
|
|||
|
|
|
|||
|
|
**mimalloc**: 908 M ops/sec (1.1 ns/op)
|
|||
|
|
**hakmem**: 103 M ops/sec (9.7 ns/op)
|
|||
|
|
**Gap**: **8.8x**
|
|||
|
|
|
|||
|
|
### Is This Gap Closeable?
|
|||
|
|
|
|||
|
|
**Honest Assessment**:
|
|||
|
|
|
|||
|
|
**Short-term** (Quick Wins - weeks):
|
|||
|
|
- Target: 103 → 150 M ops/sec (+45%)
|
|||
|
|
- Method:
|
|||
|
|
- Reduce bitmap overhead
|
|||
|
|
- Optimize registry hash
|
|||
|
|
- Remove mini-magazine layer (if not helping)
|
|||
|
|
- Tune TLS Magazine capacity
|
|||
|
|
- Realistic: **Achievable**
|
|||
|
|
|
|||
|
|
**Mid-term** (Major Optimization - months):
|
|||
|
|
- Target: 150 → 250 M ops/sec (+140%)
|
|||
|
|
- Method:
|
|||
|
|
- Redesign data structures
|
|||
|
|
- Inline metadata (like mimalloc)
|
|||
|
|
- Lock-free everywhere
|
|||
|
|
- Cache-optimized layout
|
|||
|
|
- Realistic: **Difficult but possible**
|
|||
|
|
|
|||
|
|
**Long-term** (Mimalloc-level - years):
|
|||
|
|
- Target: 250 → 700+ M ops/sec (+580%)
|
|||
|
|
- Method:
|
|||
|
|
- Complete architectural overhaul
|
|||
|
|
- Learn from mimalloc source code
|
|||
|
|
- Remove all unnecessary layers
|
|||
|
|
- Extreme optimization
|
|||
|
|
- Realistic: **Requires mimalloc-level expertise**
|
|||
|
|
|
|||
|
|
**Fundamental Challenge**:
|
|||
|
|
- hakmem is a **research allocator** (visibility, experimentation)
|
|||
|
|
- mimalloc is a **production allocator** (speed, simplicity)
|
|||
|
|
- These goals are **inherently in conflict**
|
|||
|
|
|
|||
|
|
**Research features that add overhead**:
|
|||
|
|
- Bitmap (vs inline metadata)
|
|||
|
|
- Statistics (even batched TLS)
|
|||
|
|
- Multiple layers (TLS Mag → Mini-mag → Bitmap)
|
|||
|
|
- Flexibility (registry toggle, SuperSlab, etc.)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🚀 Next Steps
|
|||
|
|
|
|||
|
|
### Option A: Accept Reality (Recommended)
|
|||
|
|
**Goal**: hakmem as **research platform**, not production allocator
|
|||
|
|
|
|||
|
|
**Approach**:
|
|||
|
|
1. Document current performance (103 M ops/sec baseline)
|
|||
|
|
2. Focus on **research features**:
|
|||
|
|
- Bitmap visibility for debugging
|
|||
|
|
- Statistics for analysis
|
|||
|
|
- Experimentation platform
|
|||
|
|
3. Optimize **moderately**:
|
|||
|
|
- Quick wins (103 → 150 M ops/sec)
|
|||
|
|
- Don't sacrifice research goals for speed
|
|||
|
|
4. **Accept 5-8x slower than mimalloc** as reasonable trade-off
|
|||
|
|
|
|||
|
|
**Outcome**: Honest, useful research tool
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Option B: Pursue Performance (Challenging)
|
|||
|
|
**Goal**: Close the gap to 2-3x slower than mimalloc
|
|||
|
|
|
|||
|
|
**Phase 1: Quick Wins** (Target: 150 M ops/sec)
|
|||
|
|
- Remove mini-magazine if not helping
|
|||
|
|
- Optimize registry hash (reduce collisions)
|
|||
|
|
- Tune TLS Magazine capacity
|
|||
|
|
- Inline hot functions
|
|||
|
|
- Profile with proper tools (perf, gprof)
|
|||
|
|
|
|||
|
|
**Phase 2: Architectural** (Target: 250 M ops/sec)
|
|||
|
|
- Inline metadata (like mimalloc)
|
|||
|
|
- Simplify data structures
|
|||
|
|
- Remove unnecessary layers
|
|||
|
|
- Cache-optimized layout
|
|||
|
|
|
|||
|
|
**Phase 3: Extreme** (Target: 400+ M ops/sec)
|
|||
|
|
- Study mimalloc source code deeply
|
|||
|
|
- Adopt mimalloc techniques
|
|||
|
|
- Sacrifice research features if needed
|
|||
|
|
|
|||
|
|
**Outcome**: High-performance allocator, less research-friendly
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### Option C: Hybrid Approach (Balanced)
|
|||
|
|
**Goal**: Research + reasonable performance
|
|||
|
|
|
|||
|
|
**Approach**:
|
|||
|
|
1. **Default mode**: Research-friendly (bitmap, stats, visibility)
|
|||
|
|
- Performance: 100-150 M ops/sec
|
|||
|
|
2. **Fast mode**: Production-optimized (inline metadata, no stats)
|
|||
|
|
- Performance: 300-500 M ops/sec
|
|||
|
|
- Toggle: `HAKMEM_FAST_MODE=1`
|
|||
|
|
|
|||
|
|
**Outcome**: Best of both worlds, more complexity
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📚 Documentation Created Today
|
|||
|
|
|
|||
|
|
| File | Size | Purpose |
|
|||
|
|
|------|------|---------|
|
|||
|
|
| PHASE4_REGRESSION_ANALYSIS.md | 16 KB | Root cause analysis of Phase 4 regression |
|
|||
|
|
| PHASE4_IMPROVEMENT_ROADMAP.md | 13 KB | Phased improvement plan (4.1→4.2→4.3→4.4) |
|
|||
|
|
| BENCHMARKING_CHECKLIST.md | 11 KB | Prevention guide (never make same mistake) |
|
|||
|
|
| verify_bench.sh | 2 KB | Automated verification (119 symbols check) |
|
|||
|
|
| bench_hakmem.sh | 2 KB | Benchmark script template |
|
|||
|
|
| SESSION_SUMMARY_2025-10-26.md | (this file) | Comprehensive session summary |
|
|||
|
|
|
|||
|
|
**Total**: ~44 KB of documentation
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎓 Lessons Learned
|
|||
|
|
|
|||
|
|
### Technical Lessons
|
|||
|
|
|
|||
|
|
1. **Makefile implicit rules are dangerous**
|
|||
|
|
- Always define explicit targets
|
|||
|
|
- Especially for critical binaries
|
|||
|
|
|
|||
|
|
2. **"It works" ≠ "It's correct"**
|
|||
|
|
- bench_tiny "worked" (no errors)
|
|||
|
|
- But measured wrong thing entirely
|
|||
|
|
|
|||
|
|
3. **Verification is essential**
|
|||
|
|
- `nm` to check symbols
|
|||
|
|
- `ldd` to check libraries
|
|||
|
|
- `size` to check binary size
|
|||
|
|
|
|||
|
|
4. **Performance claims need causality**
|
|||
|
|
- Code change → performance change?
|
|||
|
|
- Or just measurement noise?
|
|||
|
|
|
|||
|
|
5. **Benchmarking is hard**
|
|||
|
|
- Same code
|
|||
|
|
- Same compiler flags
|
|||
|
|
- Same workload
|
|||
|
|
- Verify what you're measuring!
|
|||
|
|
|
|||
|
|
### Process Lessons
|
|||
|
|
|
|||
|
|
1. **Document procedures**
|
|||
|
|
- Checklists prevent mistakes
|
|||
|
|
- Future-you will thank you
|
|||
|
|
|
|||
|
|
2. **Automate verification**
|
|||
|
|
- Scripts don't forget
|
|||
|
|
- verify_bench.sh (119 symbols)
|
|||
|
|
|
|||
|
|
3. **Question surprising results**
|
|||
|
|
- "Phase 3 improved 1.3%" - Was it real?
|
|||
|
|
- If glibc improved, that'd be newsworthy!
|
|||
|
|
|
|||
|
|
4. **Be honest about limitations**
|
|||
|
|
- hakmem is 8.8x slower
|
|||
|
|
- That's okay for research!
|
|||
|
|
|
|||
|
|
5. **Celebrate discoveries**
|
|||
|
|
- Found a fundamental issue
|
|||
|
|
- Fixed it permanently
|
|||
|
|
- That's progress!
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 💭 Reflections
|
|||
|
|
|
|||
|
|
### What Went Well ✅
|
|||
|
|
|
|||
|
|
1. **Fixed fundamental infrastructure bug**
|
|||
|
|
- Will never happen again
|
|||
|
|
- Proper tooling in place
|
|||
|
|
|
|||
|
|
2. **Honest performance comparison**
|
|||
|
|
- Now know true baseline
|
|||
|
|
- Can set realistic goals
|
|||
|
|
|
|||
|
|
3. **Comprehensive documentation**
|
|||
|
|
- 44 KB of useful docs
|
|||
|
|
- Future-proofing
|
|||
|
|
|
|||
|
|
4. **Learned from ChatGPT Pro**
|
|||
|
|
- Gating strategy
|
|||
|
|
- Pull型 architecture
|
|||
|
|
- Batching techniques
|
|||
|
|
|
|||
|
|
### What Could Be Better 🤔
|
|||
|
|
|
|||
|
|
1. **Should have verified linkage earlier**
|
|||
|
|
- Wasted time on Phase 1-4.1
|
|||
|
|
- But: learned valuable lessons!
|
|||
|
|
|
|||
|
|
2. **Need better profiling tools**
|
|||
|
|
- perf doesn't work in WSL2
|
|||
|
|
- Consider: gprof, valgrind, manual timing
|
|||
|
|
|
|||
|
|
3. **Architectural complexity**
|
|||
|
|
- 3 layers (TLS Mag → Mini-mag → Bitmap)
|
|||
|
|
- Each adds overhead
|
|||
|
|
- Simplification opportunity?
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🎯 Recommendation
|
|||
|
|
|
|||
|
|
**For tomorrow** (if continuing):
|
|||
|
|
|
|||
|
|
### Immediate Actions
|
|||
|
|
|
|||
|
|
1. **Run verify_bench.sh** before any benchmark
|
|||
|
|
```bash
|
|||
|
|
./verify_bench.sh ./bench_tiny
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
2. **Re-measure Phase 0 (baseline)** with correct infrastructure
|
|||
|
|
- Before Phase 1
|
|||
|
|
- True starting point
|
|||
|
|
|
|||
|
|
3. **Profile with working tools**
|
|||
|
|
- gprof (if available)
|
|||
|
|
- Manual timing (HAKMEM_DEBUG_TIMING=1)
|
|||
|
|
- Identify true bottlenecks
|
|||
|
|
|
|||
|
|
4. **Implement one Quick Win**
|
|||
|
|
- Start with easiest improvement
|
|||
|
|
- Measure impact properly
|
|||
|
|
- Document result
|
|||
|
|
|
|||
|
|
### Strategic Decision
|
|||
|
|
|
|||
|
|
**Choose a path**:
|
|||
|
|
- **Path A**: Research focus (accept 5-8x slower)
|
|||
|
|
- **Path B**: Performance pursuit (target 2-3x slower)
|
|||
|
|
- **Path C**: Hybrid (modes for different use cases)
|
|||
|
|
|
|||
|
|
**Recommended**: **Path A** (Research focus)
|
|||
|
|
- Honest about trade-offs
|
|||
|
|
- Useful research tool
|
|||
|
|
- Moderate optimization (Quick Wins)
|
|||
|
|
- Don't sacrifice research goals
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📊 Final Numbers
|
|||
|
|
|
|||
|
|
### Before Today (Wrong)
|
|||
|
|
|
|||
|
|
| Phase | Throughput | Note |
|
|||
|
|
|-------|------------|------|
|
|||
|
|
| Phase 2 | 361 M ops/sec | ❌ glibc malloc |
|
|||
|
|
| Phase 3 | 391 M ops/sec | ❌ glibc malloc |
|
|||
|
|
| Phase 4 | 373 M ops/sec | ❌ glibc malloc |
|
|||
|
|
| Phase 4.1 | 381 M ops/sec | ❌ glibc malloc |
|
|||
|
|
|
|||
|
|
### After Today (Correct)
|
|||
|
|
|
|||
|
|
| Allocator | Throughput | Latency | Verified |
|
|||
|
|
|-----------|------------|---------|----------|
|
|||
|
|
| hakmem (Phase 4.2) | 103 M ops/sec | 9.7 ns/op | ✅ 119 symbols |
|
|||
|
|
| glibc malloc | 364 M ops/sec | 2.7 ns/op | ✅ No hakmem |
|
|||
|
|
| mimalloc | 908 M ops/sec | 1.1 ns/op | ✅ LD_PRELOAD |
|
|||
|
|
|
|||
|
|
**Gap to close**: 8.8x (mimalloc) or 3.5x (glibc)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 🙏 Acknowledgments
|
|||
|
|
|
|||
|
|
- **ChatGPT Pro**: Excellent architectural advice (Gating, Batching, Pull型)
|
|||
|
|
- **User (にゃーん)**: Persistent questioning led to discovery
|
|||
|
|
- **verify_bench.sh**: Will prevent future mistakes
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 📅 Session Stats
|
|||
|
|
|
|||
|
|
- **Duration**: ~8 hours
|
|||
|
|
- **Commits**: 6
|
|||
|
|
- **Files Created**: 6
|
|||
|
|
- **Lines of Code**: ~500
|
|||
|
|
- **Documentation**: ~44 KB
|
|||
|
|
- **Major Discovery**: 1 (benchmark infrastructure bug)
|
|||
|
|
- **Performance Gap Discovered**: 8.8x vs mimalloc
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## ✨ Conclusion
|
|||
|
|
|
|||
|
|
**Today was tough** 😿 but **extremely valuable** ✨
|
|||
|
|
|
|||
|
|
**Bad News**:
|
|||
|
|
- Phase 1-4.1 measurements were wrong
|
|||
|
|
- hakmem is 8.8x slower than mimalloc
|
|||
|
|
- Big gap to close
|
|||
|
|
|
|||
|
|
**Good News**:
|
|||
|
|
- Found fundamental infrastructure bug
|
|||
|
|
- Fixed it permanently (verify_bench.sh + checklist)
|
|||
|
|
- Know true baseline now (103 M ops/sec)
|
|||
|
|
- Can set realistic goals going forward
|
|||
|
|
- Comprehensive documentation for future
|
|||
|
|
|
|||
|
|
**Moving Forward**:
|
|||
|
|
- Be honest about performance
|
|||
|
|
- Focus on research value
|
|||
|
|
- Optimize pragmatically
|
|||
|
|
- Never make same mistake again
|
|||
|
|
|
|||
|
|
**Final Thought**:
|
|||
|
|
> "It's better to know uncomfortable truth than comfortable lie."
|
|||
|
|
|
|||
|
|
hakmem は 103 M ops/sec。それが現実。
|
|||
|
|
でも、それは**研究プラットフォーム**として十分な性能かもしれませんにゃ。
|
|||
|
|
|
|||
|
|
にゃーん!今日も一日お疲れさまでしたにゃ!😸✨
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
*Generated: 2025-10-26*
|
|||
|
|
*Session: Phase 4 Optimization & Infrastructure Fix*
|
|||
|
|
*Status: Documented & Lessons Learned*
|