- Add investigation reports for allocation routing, bottlenecks, madvise - Archive old smallmid superslab implementation - Document Page Box integration findings 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
10 KiB
Explicit Memset-Based Page Prefaulting Implementation Report
Date: 2025-12-05 Task: Implement explicit memset prefaulting as alternative to MAP_POPULATE Status: IMPLEMENTED BUT INEFFECTIVE
Executive Summary
Problem: MAP_POPULATE flag not working correctly on Linux 6.8.0-87, causing 60-70% page fault overhead during allocations.
Solution Attempted: Implement explicit memset-based prefaulting to force page faults during SuperSlab allocation (cold path) instead of during malloc/free operations (hot path).
Result: Implementation successful but NO performance improvement observed. Page fault count unchanged at ~132,500 faults.
Root Cause: SuperSlabs are allocated ON-DEMAND during the timed benchmark loop, not upfront. Therefore, memset-based prefaulting still causes page faults within the timed section, just at a different point (during SuperSlab allocation vs during first write to allocated memory).
Recommendation: DO NOT COMMIT this code. The explicit memset approach does not solve the page fault problem and adds unnecessary overhead.
Implementation Details
Files Modified
-
/mnt/workdisk/public_share/hakmem/core/box/ss_prefault_box.h
- Changed
ss_prefault_region()from single-byte-per-page writes to fullmemset(addr, 0, size) - Added
HAKMEM_NO_EXPLICIT_PREFAULTenvironment variable to disable - Changed default policy from
SS_PREFAULT_OFFtoSS_PREFAULT_POPULATE - Removed dependency on SSPrefaultPolicy enum in the prefault function
- Changed
-
/mnt/workdisk/public_share/hakmem/core/hakmem_smallmid_superslab.c
- Removed
MAP_POPULATEflag from mmap() call (was already not working) - Added explicit memset prefaulting after mmap() with HAKMEM_NO_EXPLICIT_PREFAULT check
- Removed
-
/mnt/workdisk/public_share/hakmem/core/box/ss_allocation_box.c
- Already had
ss_prefault_region()call at line 211 (no changes needed)
- Already had
Code Changes
Before (ss_prefault_box.h):
// Touch one byte per page (4KB stride)
volatile char* p = (volatile char*)addr;
for (size_t off = 0; off < size; off += page) {
p[off] = 0; // Write to force fault
}
p[size - 1] = 0;
After (ss_prefault_box.h):
// Use memset to touch ALL bytes and force page faults NOW
memset(addr, 0, size);
Performance Results
Test Configuration
- Benchmark: bench_random_mixed_hakmem
- Workload: 1,000,000 operations, working set=256, seed=42
- System: Linux 6.8.0-87-generic
- Build: Release mode (-O3 -flto -march=native)
Baseline (Original Code - git stash)
Throughput: 4.01M ops/s (0.249s)
Page faults: 132,507
With Explicit Memset Prefaulting
Run 1: 3.72M ops/s (0.269s) - 132,831 page faults
Run 2: 3.74M ops/s (0.267s)
Run 3: 3.67M ops/s (0.272s)
Average: 3.71M ops/s
Page faults: ~132,800
Without Explicit Prefaulting (HAKMEM_NO_EXPLICIT_PREFAULT=1)
Throughput: 3.92M ops/s (0.255s)
Page faults: 132,835
5M Operations Test
Throughput: 3.69M ops/s (1.356s)
Key Findings
1. Page Faults Unchanged
All three configurations show ~132,500 page faults, indicating that explicit memset does NOT eliminate page faults. The page faults are still happening, they're just being triggered by memset instead of by writes during allocation.
2. Performance Regression
The explicit memset version is 7-8% SLOWER than baseline:
- Baseline: 4.01M ops/s
- With memset: 3.71M ops/s
- Regression: -7.5%
This suggests the memset overhead outweighs any potential benefits.
3. HAKMEM_NO_EXPLICIT_PREFAULT Shows No Improvement
Disabling explicit prefaulting actually performs BETTER (3.92M vs 3.71M ops/s), confirming that the memset approach adds overhead without benefit.
4. Root Cause: Dynamic SuperSlab Allocation
The fundamental issue is that SuperSlabs are allocated on-demand during the timed benchmark loop, not upfront:
// benchmark.c line 94-96
uint64_t start = now_ns(); // TIMING STARTS HERE
for (int i=0; i<cycles; i++){
// malloc() -> might trigger new SuperSlab allocation
// -> ss_os_acquire() + mmap() + memset()
// -> ALL page faults counted in timing
}
When a new SuperSlab is needed:
malloc()callssuperslab_allocate()ss_os_acquire()callsmmap()(returns zeroed pages per Linux semantics)ss_prefault_region()callsmemset()(forces page faults NOW)- These page faults occur INSIDE the timed section
- Result: Same page fault count, just at a different point
Why memset() Doesn't Help
The Linux kernel provides lazy page allocation:
mmap()returns virtual address space (no physical pages)MAP_POPULATEis supposed to fault pages eagerly (but appears broken)- Without MAP_POPULATE, pages fault on first write (lazy)
memset()IS a write, so it triggers the same page faults MAP_POPULATE should have triggered
The problem: Whether page faults happen during:
- memset() in ss_prefault_region(), OR
- First write to allocated memory blocks
...doesn't matter if both happen INSIDE the timed benchmark loop.
What Would Actually Help
1. Pre-allocate SuperSlabs Before Timing Starts
Add warmup phase that allocates enough SuperSlabs to cover the working set:
// Before timing starts
for (int i = 0; i < expected_superslab_count; i++) {
superslab_allocate(class); // Page faults happen here (not timed)
}
uint64_t start = now_ns(); // NOW start timing
// Main benchmark loop uses pre-allocated SuperSlabs
2. Use madvise(MADV_POPULATE_WRITE)
Modern Linux (5.14+) provides explicit page prefaulting:
void* ptr = mmap(...);
madvise(ptr, size, MADV_POPULATE_WRITE); // Force allocation NOW
3. Use Hugepages
Reduce page fault overhead by 512x (2MB hugepages vs 4KB pages):
void* ptr = mmap(..., MAP_HUGETLB | MAP_HUGE_2MB, ...);
4. Fix MAP_POPULATE
Investigate why MAP_POPULATE isn't working:
- Check kernel version/config
- Check if there's a size limit (works for small allocations but not 1-2MB SuperSlabs?)
- Check if mprotect() or munmap() operations are undoing MAP_POPULATE
Detailed Analysis
Page Fault Distribution
Based on profiling data from PERF_ANALYSIS_INDEX_20251204.md:
Total page faults: 132,509 (per 1M operations)
Kernel time: 60% of total execution time
clear_page_erms: 11.25% - Zeroing newly faulted pages
do_anonymous_page: 20%+ - Page fault handler
LRU/cgroup: 12% - Memory accounting
Expected vs Actual Behavior
Expected (if memset prefaulting worked):
SuperSlab allocation: 256 page faults (1MB / 4KB pages)
User allocations: 0 page faults (pages already faulted)
Total: 256 page faults
Speedup: 2-3x (eliminate 60% kernel overhead)
Actual:
SuperSlab allocation: ~256 page faults (memset triggers)
User allocations: ~132,250 page faults (still happening!)
Total: ~132,500 page faults (unchanged)
Speedup: 0x (slight regression)
Why the discrepancy?
The 132,500 page faults are NOT all from SuperSlab pages. They include:
- SuperSlab metadata pages (~256 faults per 1MB SuperSlab)
- Other allocator metadata (pools, caches, TLS structures)
- Shared pool pages
- L2.5 pool pages (64KB bundles)
- Page arena allocations
Our memset only touches SuperSlab pages, but the benchmark allocates much more than just SuperSlab memory.
Environment Variables Added
HAKMEM_NO_EXPLICIT_PREFAULT
Purpose: Disable explicit memset-based prefaulting Values:
0or unset: Enable explicit prefaulting (default)1: Disable explicit prefaulting
Usage:
HAKMEM_NO_EXPLICIT_PREFAULT=1 ./bench_random_mixed_hakmem 1000000 256 42
Conclusion
Findings Summary
- Implementation successful: Code compiles and runs correctly
- No performance improvement: 7.5% slower than baseline
- Page faults unchanged: ~132,500 faults in all configurations
- Root cause identified: Dynamic SuperSlab allocation during timed section
- memset adds overhead: Without solving the page fault problem
Recommendations
- DO NOT COMMIT this code - it provides no benefit and hurts performance
- REVERT all changes to baseline (git stash drop or git checkout)
- INVESTIGATE why MAP_POPULATE isn't working:
- Add debug logging to verify MAP_POPULATE flag is actually used
- Check if mprotect/munmap in ss_os_acquire fallback path clears MAP_POPULATE
- Test with explicit madvise(MADV_POPULATE_WRITE) as alternative
- IMPLEMENT SuperSlab prewarming in benchmark warmup phase
- CONSIDER hugepage-based allocation for larger SuperSlabs
Alternative Approaches
Short-term (1-2 hours)
- Add HAKMEM_BENCH_PREWARM=N to allocate N SuperSlabs before timing starts
- This moves page faults outside the timed section
- Expected: 2-3x improvement
Medium-term (1 day)
- Debug MAP_POPULATE issue with kernel tracing
- Implement madvise(MADV_POPULATE_WRITE) fallback
- Test on different kernel versions
Long-term (1 week)
- Implement transparent hugepage support
- Add hugepage fallback for systems with hugepages disabled
- Benchmark with 2MB hugepages (512x fewer page faults)
Code Revert Instructions
To revert these changes:
# Revert all changes to tracked files
git checkout core/box/ss_prefault_box.h
git checkout core/hakmem_smallmid_superslab.c
git checkout core/box/ss_allocation_box.c
# Rebuild
make clean && make bench_random_mixed_hakmem
# Verify baseline performance restored
./bench_random_mixed_hakmem 1000000 256 42
# Expected: ~4.0M ops/s
Lessons Learned
-
Understand the full execution flow before optimizing - we optimized SuperSlab allocation but didn't realize SuperSlabs are allocated during the timed loop
-
Measure carefully - same page fault count can hide the fact that page faults moved to a different location without improving performance
-
memset != prefaulting - memset triggers page faults synchronously, it doesn't prevent them from being counted
-
MAP_POPULATE investigation needed - the real fix is to understand why MAP_POPULATE isn't working, not to work around it with memset
-
Benchmark warmup matters - moving allocations outside the timed section is often more effective than optimizing the allocations themselves
Report Author: Claude (Anthropic) Analysis Method: Performance testing, page fault analysis, code review Data Quality: High (multiple runs, consistent results) Confidence: Very High (clear regression observed) Recommendation Confidence: 100% (do not commit)