Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
13 KiB
hakmem Profiling Guide
Purpose: Practical commands for validating Phase 6.7 overhead analysis
Date: 2025-10-21
1. Feature Isolation Testing
1.1 Add Environment Variable Support
File: hakmem.c
Add at top of file:
// Environment variable control (for profiling)
static int g_disable_bigcache = 0;
static int g_disable_elo = 0;
static int g_minimal_mode = 0;
void hak_init_env_vars(void) {
g_disable_bigcache = getenv("HAKMEM_DISABLE_BIGCACHE") ? 1 : 0;
g_disable_elo = getenv("HAKMEM_DISABLE_ELO") ? 1 : 0;
g_minimal_mode = getenv("HAKMEM_MINIMAL") ? 1 : 0;
if (g_minimal_mode) {
g_disable_bigcache = 1;
g_disable_elo = 1;
}
if (g_disable_bigcache) fprintf(stderr, "[hakmem] BigCache disabled (profiling mode)\n");
if (g_disable_elo) fprintf(stderr, "[hakmem] ELO disabled (fixed 2MB threshold)\n");
if (g_minimal_mode) fprintf(stderr, "[hakmem] Minimal mode (all features OFF)\n");
}
Modify hak_alloc_at():
void* hak_alloc_at(size_t size, hak_callsite_t site) {
static int initialized = 0;
if (!initialized) {
hak_init_env_vars();
initialized = 1;
}
// ... existing code, but wrap features in checks:
// ELO selection (skip if disabled)
if (!g_disable_elo && !hak_evo_is_frozen()) {
strategy_id = hak_elo_select_strategy();
threshold = hak_elo_get_threshold(strategy_id);
} else {
threshold = 2097152; // Fixed 2MB threshold
}
// BigCache lookup (skip if disabled)
if (!g_disable_bigcache && size >= 1048576) {
void* cached_ptr = NULL;
if (hak_bigcache_try_get(size, site, &cached_ptr)) {
return cached_ptr;
}
}
// ... rest unchanged
}
1.2 Run Feature Isolation Tests
Baseline (all features):
make clean && make bench
bash bench_runner.sh --warmup 2 --runs 10 --output baseline.csv
grep "hakmem-evolving,vm" baseline.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'
Expected: ~37,602 ns
No BigCache:
HAKMEM_DISABLE_BIGCACHE=1 bash bench_runner.sh --warmup 2 --runs 10 --output no_bigcache.csv
grep "hakmem-evolving,vm" no_bigcache.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'
Expected: ~37,552 ns (-50 ns)
No ELO:
HAKMEM_DISABLE_ELO=1 bash bench_runner.sh --warmup 2 --runs 10 --output no_elo.csv
grep "hakmem-evolving,vm" no_elo.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'
Expected: ~37,452 ns (-150 ns)
FROZEN mode:
HAKMEM_EVO_POLICY=frozen bash bench_runner.sh --warmup 2 --runs 10 --output frozen.csv
grep "hakmem-evolving,vm" frozen.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'
Expected: ~37,452 ns (-150 ns, same as No ELO)
MINIMAL mode:
HAKMEM_MINIMAL=1 bash bench_runner.sh --warmup 2 --runs 10 --output minimal.csv
grep "hakmem-evolving,vm" minimal.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'
Expected: ~37,252 ns (-350 ns total)
Analysis script:
echo "Configuration,Median,Delta"
echo "Baseline,$(grep "hakmem-evolving,vm" baseline.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),0"
echo "No BigCache,$(grep "hakmem-evolving,vm" no_bigcache.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),-50"
echo "No ELO,$(grep "hakmem-evolving,vm" no_elo.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),-150"
echo "FROZEN,$(grep "hakmem-evolving,vm" frozen.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),-150"
echo "MINIMAL,$(grep "hakmem-evolving,vm" minimal.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),-350"
2. Profiling with perf
2.1 Compile with Debug Symbols
make clean
make CFLAGS="-g -O2 -fno-omit-frame-pointer"
2.2 Run perf record
Single run (quick check):
perf record -g -e cycles:u \
./bench_allocators \
--allocator hakmem-evolving \
--scenario vm \
--iterations 100
perf report --stdio > perf_hakmem.txt
Expected top functions (from analysis):
alloc_mmap/mmap- 60-70% (syscall overhead)hak_elo_select_strategy- 5-10% (strategy selection)hak_bigcache_try_get- 3-5% (cache lookup)memset/memcpy- 10-15% (memory initialization)
Validation:
grep -A 20 "Overhead" perf_hakmem.txt | head -25
2.3 Compare with mimalloc
# mimalloc profiling
perf record -g -e cycles:u \
./bench_allocators \
--allocator mimalloc \
--scenario vm \
--iterations 100
perf report --stdio > perf_mimalloc.txt
# Compare top functions
echo "=== hakmem top 10 ==="
grep -A 30 "Overhead" perf_hakmem.txt | head -35 | tail -10
echo ""
echo "=== mimalloc top 10 ==="
grep -A 30 "Overhead" perf_mimalloc.txt | head -35 | tail -10
Expected differences:
- hakmem: More time in
hak_elo_select_strategy,hak_bigcache_try_get - mimalloc: More time in
mi_alloc_huge(internal allocator), less overhead
2.4 Annotate Source Code
perf annotate hak_alloc_at > annotate_hak_alloc_at.txt
cat annotate_hak_alloc_at.txt
Look for:
- Hot lines (high % samples)
- Branch mispredictions
- Cache misses
3. Cache Performance Analysis
3.1 Cache Miss Profiling
perf stat -e cycles,instructions,cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses \
./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 100
perf stat -e cycles,instructions,cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses \
./bench_allocators --allocator mimalloc --scenario vm --iterations 100
Expected metrics:
| Metric | hakmem | mimalloc | Analysis |
|---|---|---|---|
| IPC (instructions/cycle) | ~1.5 | ~2.0 | mimalloc more efficient |
| L1 miss rate | 5-10% | 2-5% | hakmem more cache-unfriendly |
| Cache references | Higher | Lower | hakmem touches more memory |
3.2 TLB Profiling
perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses \
./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 100
perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses \
./bench_allocators --allocator mimalloc --scenario vm --iterations 100
Expected:
- Similar TLB miss rates (both use 2MB pages)
- Slight advantage to mimalloc (better locality)
4. Micro-benchmarks
4.1 BigCache Lookup Speed
File: test_bigcache_speed.c
#include "hakmem_bigcache.h"
#include <time.h>
#include <stdio.h>
int main() {
hak_bigcache_init();
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
const int N = 1000000;
for (int i = 0; i < N; i++) {
void* ptr = NULL;
hak_bigcache_try_get(2097152, (uintptr_t)i, &ptr);
}
clock_gettime(CLOCK_MONOTONIC, &end);
uint64_t ns = (end.tv_sec - start.tv_sec) * 1000000000ULL +
(end.tv_nsec - start.tv_nsec);
printf("BigCache lookup: %.1f ns/op\n", (double)ns / N);
hak_bigcache_shutdown();
return 0;
}
Compile & run:
gcc -O2 -o test_bigcache_speed test_bigcache_speed.c hakmem_bigcache.c -I.
./test_bigcache_speed
Expected: 50-100 ns/op
4.2 ELO Selection Speed
File: test_elo_speed.c
#include "hakmem_elo.h"
#include <time.h>
#include <stdio.h>
int main() {
hak_elo_init();
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
const int N = 1000000;
for (int i = 0; i < N; i++) {
int strategy = hak_elo_select_strategy();
(void)strategy;
}
clock_gettime(CLOCK_MONOTONIC, &end);
uint64_t ns = (end.tv_sec - start.tv_sec) * 1000000000ULL +
(end.tv_nsec - start.tv_nsec);
printf("ELO selection: %.1f ns/op\n", (double)ns / N);
hak_elo_shutdown();
return 0;
}
Compile & run:
gcc -O2 -o test_elo_speed test_elo_speed.c hakmem_elo.c -I. -lm
./test_elo_speed
Expected: 100-200 ns/op
4.3 Header Operations Speed
File: test_header_speed.c
#include "hakmem.h"
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
typedef struct {
uint32_t magic;
uint32_t method;
size_t requested_size;
size_t actual_size;
uintptr_t alloc_site;
size_t class_bytes;
} AllocHeader;
#define HAKMEM_MAGIC 0x48414B4D
int main() {
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
const int N = 1000000;
for (int i = 0; i < N; i++) {
AllocHeader hdr;
hdr.magic = HAKMEM_MAGIC;
hdr.method = 1;
hdr.alloc_site = (uintptr_t)&hdr;
hdr.class_bytes = 2097152;
if (hdr.magic != HAKMEM_MAGIC) abort();
}
clock_gettime(CLOCK_MONOTONIC, &end);
uint64_t ns = (end.tv_sec - start.tv_sec) * 1000000000ULL +
(end.tv_nsec - start.tv_nsec);
printf("Header operations: %.1f ns/op\n", (double)ns / N);
return 0;
}
Compile & run:
gcc -O2 -o test_header_speed test_header_speed.c
./test_header_speed
Expected: 30-50 ns/op
5. Syscall Tracing (Already Done)
5.1 Detailed Syscall Trace
# Full trace (warning: huge output)
strace -o hakmem_full.strace \
./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10
# Count only
strace -c -o hakmem_summary.strace \
./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10
cat hakmem_summary.strace
5.2 Compare Syscall Patterns
# Extract mmap calls
grep "mmap" hakmem_full.strace > hakmem_mmap.txt
grep "mmap" mimalloc_full.strace > mimalloc_mmap.txt
# Compare first 10
echo "=== hakmem mmap ==="
head -10 hakmem_mmap.txt
echo ""
echo "=== mimalloc mmap ==="
head -10 mimalloc_mmap.txt
Look for:
- Same syscall arguments? (flags, prot, etc.)
- Same order of operations?
- Any hakmem-specific patterns?
6. Memory Layout Analysis
6.1 /proc/self/maps
File: dump_maps.c
#include <stdio.h>
#include <stdlib.h>
#include "hakmem.h"
int main() {
// Allocate 10 blocks
void* ptrs[10];
for (int i = 0; i < 10; i++) {
ptrs[i] = hak_alloc_cs(2097152);
}
// Dump memory map
system("cat /proc/self/maps | grep -E 'rw-p.*anon'");
// Free all
for (int i = 0; i < 10; i++) {
hak_free_cs(ptrs[i], 2097152);
}
return 0;
}
Compile & run:
gcc -O2 -o dump_maps dump_maps.c hakmem.c hakmem_bigcache.c hakmem_elo.c hakmem_batch.c -I. -lm
./dump_maps
Look for:
- Fragmentation (many small mappings vs few large ones)
- Address space layout (sequential vs scattered)
7. Comparative Analysis Script
File: compare_all.sh
#!/bin/bash
echo "Running comparative profiling..."
# 1. Feature isolation
echo ""
echo "=== Feature Isolation ==="
HAKMEM_MINIMAL=1 bash bench_runner.sh --warmup 2 --runs 3 --output minimal.csv
echo "Baseline: 37602 ns"
echo "Minimal: $(grep "hakmem-evolving,vm" minimal.csv | awk -F, '{print $4}' | sort -n | awk 'NR==2{print $1}') ns"
# 2. perf stat
echo ""
echo "=== perf stat (hakmem) ==="
perf stat -e cycles,instructions,cache-misses,L1-dcache-load-misses \
./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10 2>&1 | grep -E "cycles|instructions|misses"
echo ""
echo "=== perf stat (mimalloc) ==="
perf stat -e cycles,instructions,cache-misses,L1-dcache-load-misses \
./bench_allocators --allocator mimalloc --scenario vm --iterations 10 2>&1 | grep -E "cycles|instructions|misses"
# 3. Micro-benchmarks
echo ""
echo "=== Micro-benchmarks ==="
./test_bigcache_speed 2>/dev/null
./test_elo_speed 2>/dev/null
./test_header_speed 2>/dev/null
echo ""
echo "Done. See above for results."
Run:
bash compare_all.sh > profiling_results.txt 2>&1
cat profiling_results.txt
8. Expected Results Summary
| Test | Expected Result | Interpretation |
|---|---|---|
| Feature isolation | -350 ns total | Features contribute < 1% overhead |
| MINIMAL mode | 37,252 ns | Still +86% vs mimalloc → structural gap |
| perf top | 60-70% in mmap | Syscalls dominate, but equal for both |
| Cache misses | 5-10% L1 miss | Slightly worse than mimalloc |
| BigCache micro | 50-100 ns | Hash lookup overhead |
| ELO micro | 100-200 ns | Strategy selection overhead |
| Header micro | 30-50 ns | Metadata overhead |
Key validation: If MINIMAL mode still has +86% gap, then allocation model (not features) is the bottleneck.
9. Next Steps
After profiling:
- ✅ Validate overhead breakdown (does it match Phase 6.7 analysis?)
- ✅ Identify unexpected hotspots (perf report surprises?)
- ✅ Document findings (update PHASE_6.7_OVERHEAD_ANALYSIS.md)
- ✅ Decide on optimizations (Priority 0/1/2 from analysis)
If analysis is correct: Move to Phase 7 (focus on learning, not speed)
If analysis is wrong: Investigate new bottlenecks revealed by profiling
End of Profiling Guide 🔬