Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2.9 KiB
Final Verdict: HAKMEM Memory Overhead Analysis
The Real Answer
After deep investigation, the 39.6 MB RSS for 1M × 16B allocations breaks down as follows:
Component Breakdown
- Actual Data: 15.26 MB (1M × 16B)
- Pointer Array: 7.63 MB (test program's
void** ptrs) - HAKMEM Overhead: 16.71 MB
Where Does the 16.71 MB Come From?
The investigation revealed that RSS != actual memory allocations due to:
-
Page Granularity: RSS counts in 4 KB pages
- Slab size: 64 KB (16 pages)
- 245 slabs × 16 pages = 3,920 pages
- 3,920 × 4 KB = 15.31 MB (matches data!)
-
Metadata is Separate: Bitmaps, slab headers, etc. are allocated separately
- Primary bitmaps: 122.5 KB
- Summary bitmaps: 1.9 KB
- Slab metadata: 21 KB
- TLS Magazine: 128 KB
- Total metadata: ~274 KB
-
The Mystery 16 MB: After eliminating all known sources, the remaining 16 MB is likely:
- Virtual memory overhead from the system allocator used by
aligned_alloc() - TLS and stack overhead from threading infrastructure
- Shared library overhead (HAKMEM itself as a .so file)
- Process overhead (heap arena, etc.)
- Virtual memory overhead from the system allocator used by
The Real Problem: Not What We Thought!
Initial Hypothesis (WRONG)
aligned_alloc()wastes 64 KB per slab due to alignment
Evidence Against
- Test showed
aligned_alloc(64KB) × 100only added 1.5 MB RSS, not 6.4 MB - This means system allocator is efficient at alignment
Actual Problem (CORRECT)
The benchmark may be fundamentally flawed!
The test program (test_memory_usage.c) only touches ONE BYTE per allocation:
ptrs[i] = malloc(16);
if (ptrs[i]) *(char*)ptrs[i] = 'A'; // Only touches first byte!
RSS only counts touched pages!
If only the first byte of each 16-byte block is touched, and blocks are packed:
- 256 blocks fit in 4 KB page (256 × 16B = 4KB)
- 1M blocks need 3,907 pages minimum
- But if blocks span pages due to slab boundaries...
Revised Analysis
I need to run actual measurements to understand where the overhead truly comes from.
The Scaling Pattern is Real
100K allocs: HAKMEM 221% OH, mimalloc 234% OH → HAKMEM wins!
1M allocs: HAKMEM 160% OH, mimalloc 65% OH → mimalloc wins!
This suggests HAKMEM has:
- Better fixed overhead (wins at small scale)
- Worse variable overhead (loses at large scale)
Conclusion
The document MEMORY_OVERHEAD_ANALYSIS.md contains correct diagnostic methodology but may have jumped to conclusions about aligned_alloc().
The real issue is likely one of:
- SuperSlab is NOT being used (g_use_superslab=1 but not active)
- TLS Magazine is holding too many blocks
- Slab fragmentation (last slab partially filled)
- Test methodology issue (RSS vs actual allocations)
Recommendation: Run actual instrumented tests with slab counters to see exactly how many slabs are allocated and what their utilization is.