# hakmem Profiling Guide **Purpose**: Practical commands for validating Phase 6.7 overhead analysis **Date**: 2025-10-21 --- ## 1. Feature Isolation Testing ### 1.1 Add Environment Variable Support **File**: `hakmem.c` Add at top of file: ```c // Environment variable control (for profiling) static int g_disable_bigcache = 0; static int g_disable_elo = 0; static int g_minimal_mode = 0; void hak_init_env_vars(void) { g_disable_bigcache = getenv("HAKMEM_DISABLE_BIGCACHE") ? 1 : 0; g_disable_elo = getenv("HAKMEM_DISABLE_ELO") ? 1 : 0; g_minimal_mode = getenv("HAKMEM_MINIMAL") ? 1 : 0; if (g_minimal_mode) { g_disable_bigcache = 1; g_disable_elo = 1; } if (g_disable_bigcache) fprintf(stderr, "[hakmem] BigCache disabled (profiling mode)\n"); if (g_disable_elo) fprintf(stderr, "[hakmem] ELO disabled (fixed 2MB threshold)\n"); if (g_minimal_mode) fprintf(stderr, "[hakmem] Minimal mode (all features OFF)\n"); } ``` Modify `hak_alloc_at()`: ```c void* hak_alloc_at(size_t size, hak_callsite_t site) { static int initialized = 0; if (!initialized) { hak_init_env_vars(); initialized = 1; } // ... existing code, but wrap features in checks: // ELO selection (skip if disabled) if (!g_disable_elo && !hak_evo_is_frozen()) { strategy_id = hak_elo_select_strategy(); threshold = hak_elo_get_threshold(strategy_id); } else { threshold = 2097152; // Fixed 2MB threshold } // BigCache lookup (skip if disabled) if (!g_disable_bigcache && size >= 1048576) { void* cached_ptr = NULL; if (hak_bigcache_try_get(size, site, &cached_ptr)) { return cached_ptr; } } // ... rest unchanged } ``` --- ### 1.2 Run Feature Isolation Tests **Baseline** (all features): ```bash make clean && make bench bash bench_runner.sh --warmup 2 --runs 10 --output baseline.csv grep "hakmem-evolving,vm" baseline.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}' ``` Expected: ~37,602 ns **No BigCache**: ```bash HAKMEM_DISABLE_BIGCACHE=1 bash bench_runner.sh --warmup 2 --runs 10 --output no_bigcache.csv grep "hakmem-evolving,vm" no_bigcache.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}' ``` Expected: ~37,552 ns (-50 ns) **No ELO**: ```bash HAKMEM_DISABLE_ELO=1 bash bench_runner.sh --warmup 2 --runs 10 --output no_elo.csv grep "hakmem-evolving,vm" no_elo.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}' ``` Expected: ~37,452 ns (-150 ns) **FROZEN mode**: ```bash HAKMEM_EVO_POLICY=frozen bash bench_runner.sh --warmup 2 --runs 10 --output frozen.csv grep "hakmem-evolving,vm" frozen.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}' ``` Expected: ~37,452 ns (-150 ns, same as No ELO) **MINIMAL mode**: ```bash HAKMEM_MINIMAL=1 bash bench_runner.sh --warmup 2 --runs 10 --output minimal.csv grep "hakmem-evolving,vm" minimal.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}' ``` Expected: ~37,252 ns (-350 ns total) **Analysis script**: ```bash echo "Configuration,Median,Delta" echo "Baseline,$(grep "hakmem-evolving,vm" baseline.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),0" echo "No BigCache,$(grep "hakmem-evolving,vm" no_bigcache.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),-50" echo "No ELO,$(grep "hakmem-evolving,vm" no_elo.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),-150" echo "FROZEN,$(grep "hakmem-evolving,vm" frozen.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),-150" echo "MINIMAL,$(grep "hakmem-evolving,vm" minimal.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),-350" ``` --- ## 2. Profiling with perf ### 2.1 Compile with Debug Symbols ```bash make clean make CFLAGS="-g -O2 -fno-omit-frame-pointer" ``` ### 2.2 Run perf record **Single run** (quick check): ```bash perf record -g -e cycles:u \ ./bench_allocators \ --allocator hakmem-evolving \ --scenario vm \ --iterations 100 perf report --stdio > perf_hakmem.txt ``` **Expected top functions** (from analysis): 1. `alloc_mmap` / `mmap` - 60-70% (syscall overhead) 2. `hak_elo_select_strategy` - 5-10% (strategy selection) 3. `hak_bigcache_try_get` - 3-5% (cache lookup) 4. `memset` / `memcpy` - 10-15% (memory initialization) **Validation**: ```bash grep -A 20 "Overhead" perf_hakmem.txt | head -25 ``` --- ### 2.3 Compare with mimalloc ```bash # mimalloc profiling perf record -g -e cycles:u \ ./bench_allocators \ --allocator mimalloc \ --scenario vm \ --iterations 100 perf report --stdio > perf_mimalloc.txt # Compare top functions echo "=== hakmem top 10 ===" grep -A 30 "Overhead" perf_hakmem.txt | head -35 | tail -10 echo "" echo "=== mimalloc top 10 ===" grep -A 30 "Overhead" perf_mimalloc.txt | head -35 | tail -10 ``` **Expected differences**: - hakmem: More time in `hak_elo_select_strategy`, `hak_bigcache_try_get` - mimalloc: More time in `mi_alloc_huge` (internal allocator), less overhead --- ### 2.4 Annotate Source Code ```bash perf annotate hak_alloc_at > annotate_hak_alloc_at.txt cat annotate_hak_alloc_at.txt ``` **Look for**: - Hot lines (high % samples) - Branch mispredictions - Cache misses --- ## 3. Cache Performance Analysis ### 3.1 Cache Miss Profiling ```bash perf stat -e cycles,instructions,cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses \ ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 100 perf stat -e cycles,instructions,cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses \ ./bench_allocators --allocator mimalloc --scenario vm --iterations 100 ``` **Expected metrics**: | Metric | hakmem | mimalloc | Analysis | |--------|--------|----------|----------| | IPC (instructions/cycle) | ~1.5 | ~2.0 | mimalloc more efficient | | L1 miss rate | 5-10% | 2-5% | hakmem more cache-unfriendly | | Cache references | Higher | Lower | hakmem touches more memory | --- ### 3.2 TLB Profiling ```bash perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses \ ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 100 perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses \ ./bench_allocators --allocator mimalloc --scenario vm --iterations 100 ``` **Expected**: - Similar TLB miss rates (both use 2MB pages) - Slight advantage to mimalloc (better locality) --- ## 4. Micro-benchmarks ### 4.1 BigCache Lookup Speed **File**: `test_bigcache_speed.c` ```c #include "hakmem_bigcache.h" #include #include int main() { hak_bigcache_init(); struct timespec start, end; clock_gettime(CLOCK_MONOTONIC, &start); const int N = 1000000; for (int i = 0; i < N; i++) { void* ptr = NULL; hak_bigcache_try_get(2097152, (uintptr_t)i, &ptr); } clock_gettime(CLOCK_MONOTONIC, &end); uint64_t ns = (end.tv_sec - start.tv_sec) * 1000000000ULL + (end.tv_nsec - start.tv_nsec); printf("BigCache lookup: %.1f ns/op\n", (double)ns / N); hak_bigcache_shutdown(); return 0; } ``` **Compile & run**: ```bash gcc -O2 -o test_bigcache_speed test_bigcache_speed.c hakmem_bigcache.c -I. ./test_bigcache_speed ``` **Expected**: 50-100 ns/op --- ### 4.2 ELO Selection Speed **File**: `test_elo_speed.c` ```c #include "hakmem_elo.h" #include #include int main() { hak_elo_init(); struct timespec start, end; clock_gettime(CLOCK_MONOTONIC, &start); const int N = 1000000; for (int i = 0; i < N; i++) { int strategy = hak_elo_select_strategy(); (void)strategy; } clock_gettime(CLOCK_MONOTONIC, &end); uint64_t ns = (end.tv_sec - start.tv_sec) * 1000000000ULL + (end.tv_nsec - start.tv_nsec); printf("ELO selection: %.1f ns/op\n", (double)ns / N); hak_elo_shutdown(); return 0; } ``` **Compile & run**: ```bash gcc -O2 -o test_elo_speed test_elo_speed.c hakmem_elo.c -I. -lm ./test_elo_speed ``` **Expected**: 100-200 ns/op --- ### 4.3 Header Operations Speed **File**: `test_header_speed.c` ```c #include "hakmem.h" #include #include #include typedef struct { uint32_t magic; uint32_t method; size_t requested_size; size_t actual_size; uintptr_t alloc_site; size_t class_bytes; } AllocHeader; #define HAKMEM_MAGIC 0x48414B4D int main() { struct timespec start, end; clock_gettime(CLOCK_MONOTONIC, &start); const int N = 1000000; for (int i = 0; i < N; i++) { AllocHeader hdr; hdr.magic = HAKMEM_MAGIC; hdr.method = 1; hdr.alloc_site = (uintptr_t)&hdr; hdr.class_bytes = 2097152; if (hdr.magic != HAKMEM_MAGIC) abort(); } clock_gettime(CLOCK_MONOTONIC, &end); uint64_t ns = (end.tv_sec - start.tv_sec) * 1000000000ULL + (end.tv_nsec - start.tv_nsec); printf("Header operations: %.1f ns/op\n", (double)ns / N); return 0; } ``` **Compile & run**: ```bash gcc -O2 -o test_header_speed test_header_speed.c ./test_header_speed ``` **Expected**: 30-50 ns/op --- ## 5. Syscall Tracing (Already Done) ### 5.1 Detailed Syscall Trace ```bash # Full trace (warning: huge output) strace -o hakmem_full.strace \ ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10 # Count only strace -c -o hakmem_summary.strace \ ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10 cat hakmem_summary.strace ``` ### 5.2 Compare Syscall Patterns ```bash # Extract mmap calls grep "mmap" hakmem_full.strace > hakmem_mmap.txt grep "mmap" mimalloc_full.strace > mimalloc_mmap.txt # Compare first 10 echo "=== hakmem mmap ===" head -10 hakmem_mmap.txt echo "" echo "=== mimalloc mmap ===" head -10 mimalloc_mmap.txt ``` **Look for**: - Same syscall arguments? (flags, prot, etc.) - Same order of operations? - Any hakmem-specific patterns? --- ## 6. Memory Layout Analysis ### 6.1 /proc/self/maps **File**: `dump_maps.c` ```c #include #include #include "hakmem.h" int main() { // Allocate 10 blocks void* ptrs[10]; for (int i = 0; i < 10; i++) { ptrs[i] = hak_alloc_cs(2097152); } // Dump memory map system("cat /proc/self/maps | grep -E 'rw-p.*anon'"); // Free all for (int i = 0; i < 10; i++) { hak_free_cs(ptrs[i], 2097152); } return 0; } ``` **Compile & run**: ```bash gcc -O2 -o dump_maps dump_maps.c hakmem.c hakmem_bigcache.c hakmem_elo.c hakmem_batch.c -I. -lm ./dump_maps ``` **Look for**: - Fragmentation (many small mappings vs few large ones) - Address space layout (sequential vs scattered) --- ## 7. Comparative Analysis Script **File**: `compare_all.sh` ```bash #!/bin/bash echo "Running comparative profiling..." # 1. Feature isolation echo "" echo "=== Feature Isolation ===" HAKMEM_MINIMAL=1 bash bench_runner.sh --warmup 2 --runs 3 --output minimal.csv echo "Baseline: 37602 ns" echo "Minimal: $(grep "hakmem-evolving,vm" minimal.csv | awk -F, '{print $4}' | sort -n | awk 'NR==2{print $1}') ns" # 2. perf stat echo "" echo "=== perf stat (hakmem) ===" perf stat -e cycles,instructions,cache-misses,L1-dcache-load-misses \ ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10 2>&1 | grep -E "cycles|instructions|misses" echo "" echo "=== perf stat (mimalloc) ===" perf stat -e cycles,instructions,cache-misses,L1-dcache-load-misses \ ./bench_allocators --allocator mimalloc --scenario vm --iterations 10 2>&1 | grep -E "cycles|instructions|misses" # 3. Micro-benchmarks echo "" echo "=== Micro-benchmarks ===" ./test_bigcache_speed 2>/dev/null ./test_elo_speed 2>/dev/null ./test_header_speed 2>/dev/null echo "" echo "Done. See above for results." ``` **Run**: ```bash bash compare_all.sh > profiling_results.txt 2>&1 cat profiling_results.txt ``` --- ## 8. Expected Results Summary | Test | Expected Result | Interpretation | |------|-----------------|----------------| | **Feature isolation** | -350 ns total | Features contribute < 1% overhead | | **MINIMAL mode** | 37,252 ns | Still +86% vs mimalloc → structural gap | | **perf top** | 60-70% in mmap | Syscalls dominate, but equal for both | | **Cache misses** | 5-10% L1 miss | Slightly worse than mimalloc | | **BigCache micro** | 50-100 ns | Hash lookup overhead | | **ELO micro** | 100-200 ns | Strategy selection overhead | | **Header micro** | 30-50 ns | Metadata overhead | **Key validation**: If MINIMAL mode still has +86% gap, then **allocation model** (not features) is the bottleneck. --- ## 9. Next Steps After profiling: 1. ✅ **Validate overhead breakdown** (does it match Phase 6.7 analysis?) 2. ✅ **Identify unexpected hotspots** (perf report surprises?) 3. ✅ **Document findings** (update PHASE_6.7_OVERHEAD_ANALYSIS.md) 4. ✅ **Decide on optimizations** (Priority 0/1/2 from analysis) **If analysis is correct**: Move to Phase 7 (focus on learning, not speed) **If analysis is wrong**: Investigate new bottlenecks revealed by profiling --- **End of Profiling Guide** 🔬