Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
209 lines
6.2 KiB
Markdown
209 lines
6.2 KiB
Markdown
# SuperSlab Memory Overhead Investigation - Results
|
||
|
||
**Date:** 2025-10-26
|
||
**Status:** ✅ ROOT CAUSE IDENTIFIED
|
||
|
||
---
|
||
|
||
## Executive Summary
|
||
|
||
**The Paradox:**
|
||
- HAKMEM shows 168% memory overhead (40.9 MB RSS for 15.3 MB data)
|
||
- mimalloc shows only 65% overhead (25.1 MB RSS for 15.3 MB data)
|
||
- Theoretically, HAKMEM's bitmap design should be MORE efficient
|
||
|
||
**Root Cause Found:**
|
||
**SuperSlabs are allocated but NEVER freed**, even when all slabs become empty.
|
||
|
||
---
|
||
|
||
## Investigation Timeline
|
||
|
||
### Initial Hypothesis (WRONG)
|
||
Task agent analysis suggested SuperSlab allocator was failing and falling back to individual slab allocations.
|
||
|
||
### Debug Testing Revealed Truth
|
||
|
||
**Test Results:**
|
||
```
|
||
=== HAKMEM ===
|
||
1M: 15.3 MB data → 40.9 MB RSS (168% overhead)
|
||
|
||
[DEBUG] SuperSlab Stats:
|
||
Successful allocs: 1,600,000
|
||
Failed allocs: 0
|
||
Success rate: 100.0% ✓
|
||
|
||
[DEBUG] SuperSlab Allocations:
|
||
SuperSlabs allocated: 13
|
||
Total bytes allocated: 26.0 MB
|
||
Average allocs per SuperSlab: 123,077
|
||
```
|
||
|
||
### Key Findings
|
||
|
||
#### 1. SuperSlab is Working PERFECTLY
|
||
- **100% success rate** - no fallback to legacy path
|
||
- **13 SuperSlabs allocated** for 1.6M allocations (100K + 500K + 1M tests)
|
||
- **Expected:** 1.6M / 4000 blocks/slab / 32 slabs/SuperSlab ≈ 12.5 SuperSlabs
|
||
- **Actual:** 13 SuperSlabs ✓ **EXACTLY RIGHT!**
|
||
|
||
#### 2. Allocation Efficiency is Excellent
|
||
SuperSlab consolidation is working as designed:
|
||
- 32 × 64KB slabs consolidated into 2MB aligned regions
|
||
- O(1) pointer-to-SuperSlab lookup via alignment
|
||
- Efficient memory layout
|
||
|
||
#### 3. The REAL Problem: No Deallocation
|
||
|
||
**Critical Discovery:**
|
||
```bash
|
||
$ grep -n "superslab_free(" hakmem_tiny*.c
|
||
hakmem_tiny_superslab.c:99:void superslab_free(SuperSlab* ss) {
|
||
# NO OTHER MATCHES - FUNCTION IS NEVER CALLED!
|
||
```
|
||
|
||
**Impact:**
|
||
- test_scaling.c runs 3 tests sequentially
|
||
- Each test allocates and then frees all memory
|
||
- But freed SuperSlabs are never returned to OS
|
||
- RSS accumulates across all 3 tests
|
||
|
||
**RSS Breakdown:**
|
||
```
|
||
SuperSlabs (13 × 2MB): 26.0 MB
|
||
Pointer arrays (test bookkeeping): 12.8 MB
|
||
TLS Magazine + metadata: ~2.0 MB
|
||
─────────────────────────────────────────
|
||
Total RSS: 40.8 MB ✓ Matches actual!
|
||
```
|
||
|
||
#### 4. mimalloc's Advantage
|
||
|
||
mimalloc releases empty pages back to OS via `madvise(MADV_DONTNEED)` or similar mechanisms.
|
||
|
||
When test_scaling.c frees 100K allocations before starting 500K test, mimalloc's RSS decreases. HAKMEM's RSS stays high.
|
||
|
||
---
|
||
|
||
## The Solution: Dynamic Deallocation
|
||
|
||
**User's Insight (confirmed correct):**
|
||
> "初期コスト ここも動的にしたらいいんじゃにゃい?
|
||
> それこそbitmapの仕組みの生きるところでは"
|
||
>
|
||
> _"Shouldn't we make the initial costs dynamic too?
|
||
> That's where the bitmap mechanism's flexibility really shines!"_
|
||
|
||
**Implementation Strategy:**
|
||
|
||
### Phase 1: Track Empty SuperSlabs
|
||
Add tracking to determine when all 32 slabs in a SuperSlab are empty:
|
||
- Add `active_blocks` counter to SuperSlab
|
||
- Decrement on free(), increment on alloc()
|
||
- When `active_blocks == 0`, SuperSlab is completely empty
|
||
|
||
### Phase 2: Deferred Deallocation
|
||
Don't free immediately (would cause thrashing):
|
||
- Keep 1-2 empty SuperSlabs per size class as reserve
|
||
- Only free when reserve threshold exceeded
|
||
- Use background thread or periodic cleanup
|
||
|
||
### Phase 3: Call `superslab_free()`
|
||
Already implemented at `hakmem_tiny_superslab.c:99`:
|
||
```c
|
||
void superslab_free(SuperSlab* ss) {
|
||
if (!ss || ss->magic != SUPERSLAB_MAGIC) return;
|
||
ss->magic = 0; // Prevent use-after-free
|
||
|
||
pthread_mutex_lock(&g_superslab_lock);
|
||
g_superslabs_freed++;
|
||
g_bytes_allocated -= SUPERSLAB_SIZE;
|
||
pthread_mutex_unlock(&g_superslab_lock);
|
||
|
||
munmap(ss, SUPERSLAB_SIZE); // ← Returns memory to OS!
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
## Expected Impact
|
||
|
||
**Current (without deallocation):**
|
||
```
|
||
1M test after 100K+500K: 40.9 MB RSS (168% overhead)
|
||
```
|
||
|
||
**After implementing deallocation:**
|
||
```
|
||
1M test (isolated): ~17-20 MB RSS (~30-50% overhead)
|
||
- 16 MB SuperSlabs (8 × 2MB for 1M allocs)
|
||
- 8 MB pointer array
|
||
- ~1-3 MB TLS + metadata
|
||
```
|
||
|
||
**This would make HAKMEM competitive with mimalloc's 25.1 MB!**
|
||
|
||
---
|
||
|
||
## Performance vs Memory Trade-off
|
||
|
||
### Current Design (Fast, Memory-Hungry)
|
||
- ✅ 163 M ops/sec (beats mimalloc's 152 M ops/sec by 7.5%)
|
||
- ❌ 168% memory overhead (worse than mimalloc's 65%)
|
||
- Never releases memory back to OS
|
||
|
||
### With Dynamic Deallocation (Fast AND Efficient)
|
||
- ✅ Performance maintained (deallocation is background/deferred)
|
||
- ✅ Memory overhead reduced to ~30-50% (competitive with mimalloc)
|
||
- ✅ Leverages bitmap's flexibility advantage
|
||
|
||
---
|
||
|
||
## Implementation Priority
|
||
|
||
### Phase 7.6: SuperSlab Deallocation (HIGH PRIORITY)
|
||
|
||
**Rationale:**
|
||
- Smallest code change for biggest impact
|
||
- Validates user's hypothesis about dynamic optimization
|
||
- Proves bitmap design superiority at scale
|
||
|
||
**Estimated LOC:** ~50 lines
|
||
- Add active_blocks tracking: ~20 lines
|
||
- Add empty SuperSlab queue: ~15 lines
|
||
- Call superslab_free() when threshold exceeded: ~15 lines
|
||
|
||
**Estimated Impact:**
|
||
- Memory overhead: 168% → ~30-50% (**-75% improvement**)
|
||
- RSS for 1M test: 40.9 MB → ~17-20 MB (**-50% reduction**)
|
||
- Performance: MAINTAINED (deallocation is deferred/background)
|
||
|
||
---
|
||
|
||
## Validation Plan
|
||
|
||
1. **Implement tracking:** Add active_blocks counter
|
||
2. **Implement policy:** Keep 1-2 empty SuperSlabs per class
|
||
3. **Implement deallocation:** Call superslab_free() when exceeded
|
||
4. **Test:** Run test_scaling.c and verify RSS < 20 MB
|
||
5. **Benchmark:** Run bench_comprehensive_hakmem to ensure no regression
|
||
6. **Compare:** Re-run mimalloc showdown to validate parity
|
||
|
||
---
|
||
|
||
## Conclusion
|
||
|
||
**SuperSlab is NOT broken - it's just incomplete!**
|
||
|
||
The allocation path works perfectly. We just need to add the deallocation path.
|
||
|
||
This validates the user's core insight: **bitmap's flexibility enables dynamic optimization** that free-list allocators struggle with.
|
||
|
||
With SuperSlab deallocation implemented, HAKMEM will:
|
||
- ✅ **Beat mimalloc on performance** (already proven: +7.5%)
|
||
- ✅ **Match mimalloc on memory efficiency** (pending implementation)
|
||
- ✅ **Prove bitmap superiority** at both speed AND scale
|
||
|
||
**Next step:** Implement Phase 7.6: SuperSlab Dynamic Deallocation
|