Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
530 lines
13 KiB
Markdown
530 lines
13 KiB
Markdown
# hakmem Profiling Guide
|
|
|
|
**Purpose**: Practical commands for validating Phase 6.7 overhead analysis
|
|
|
|
**Date**: 2025-10-21
|
|
|
|
---
|
|
|
|
## 1. Feature Isolation Testing
|
|
|
|
### 1.1 Add Environment Variable Support
|
|
|
|
**File**: `hakmem.c`
|
|
|
|
Add at top of file:
|
|
```c
|
|
// Environment variable control (for profiling)
|
|
static int g_disable_bigcache = 0;
|
|
static int g_disable_elo = 0;
|
|
static int g_minimal_mode = 0;
|
|
|
|
void hak_init_env_vars(void) {
|
|
g_disable_bigcache = getenv("HAKMEM_DISABLE_BIGCACHE") ? 1 : 0;
|
|
g_disable_elo = getenv("HAKMEM_DISABLE_ELO") ? 1 : 0;
|
|
g_minimal_mode = getenv("HAKMEM_MINIMAL") ? 1 : 0;
|
|
|
|
if (g_minimal_mode) {
|
|
g_disable_bigcache = 1;
|
|
g_disable_elo = 1;
|
|
}
|
|
|
|
if (g_disable_bigcache) fprintf(stderr, "[hakmem] BigCache disabled (profiling mode)\n");
|
|
if (g_disable_elo) fprintf(stderr, "[hakmem] ELO disabled (fixed 2MB threshold)\n");
|
|
if (g_minimal_mode) fprintf(stderr, "[hakmem] Minimal mode (all features OFF)\n");
|
|
}
|
|
```
|
|
|
|
Modify `hak_alloc_at()`:
|
|
```c
|
|
void* hak_alloc_at(size_t size, hak_callsite_t site) {
|
|
static int initialized = 0;
|
|
if (!initialized) {
|
|
hak_init_env_vars();
|
|
initialized = 1;
|
|
}
|
|
|
|
// ... existing code, but wrap features in checks:
|
|
|
|
// ELO selection (skip if disabled)
|
|
if (!g_disable_elo && !hak_evo_is_frozen()) {
|
|
strategy_id = hak_elo_select_strategy();
|
|
threshold = hak_elo_get_threshold(strategy_id);
|
|
} else {
|
|
threshold = 2097152; // Fixed 2MB threshold
|
|
}
|
|
|
|
// BigCache lookup (skip if disabled)
|
|
if (!g_disable_bigcache && size >= 1048576) {
|
|
void* cached_ptr = NULL;
|
|
if (hak_bigcache_try_get(size, site, &cached_ptr)) {
|
|
return cached_ptr;
|
|
}
|
|
}
|
|
|
|
// ... rest unchanged
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
### 1.2 Run Feature Isolation Tests
|
|
|
|
**Baseline** (all features):
|
|
```bash
|
|
make clean && make bench
|
|
bash bench_runner.sh --warmup 2 --runs 10 --output baseline.csv
|
|
grep "hakmem-evolving,vm" baseline.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'
|
|
```
|
|
Expected: ~37,602 ns
|
|
|
|
**No BigCache**:
|
|
```bash
|
|
HAKMEM_DISABLE_BIGCACHE=1 bash bench_runner.sh --warmup 2 --runs 10 --output no_bigcache.csv
|
|
grep "hakmem-evolving,vm" no_bigcache.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'
|
|
```
|
|
Expected: ~37,552 ns (-50 ns)
|
|
|
|
**No ELO**:
|
|
```bash
|
|
HAKMEM_DISABLE_ELO=1 bash bench_runner.sh --warmup 2 --runs 10 --output no_elo.csv
|
|
grep "hakmem-evolving,vm" no_elo.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'
|
|
```
|
|
Expected: ~37,452 ns (-150 ns)
|
|
|
|
**FROZEN mode**:
|
|
```bash
|
|
HAKMEM_EVO_POLICY=frozen bash bench_runner.sh --warmup 2 --runs 10 --output frozen.csv
|
|
grep "hakmem-evolving,vm" frozen.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'
|
|
```
|
|
Expected: ~37,452 ns (-150 ns, same as No ELO)
|
|
|
|
**MINIMAL mode**:
|
|
```bash
|
|
HAKMEM_MINIMAL=1 bash bench_runner.sh --warmup 2 --runs 10 --output minimal.csv
|
|
grep "hakmem-evolving,vm" minimal.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'
|
|
```
|
|
Expected: ~37,252 ns (-350 ns total)
|
|
|
|
**Analysis script**:
|
|
```bash
|
|
echo "Configuration,Median,Delta"
|
|
echo "Baseline,$(grep "hakmem-evolving,vm" baseline.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),0"
|
|
echo "No BigCache,$(grep "hakmem-evolving,vm" no_bigcache.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),-50"
|
|
echo "No ELO,$(grep "hakmem-evolving,vm" no_elo.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),-150"
|
|
echo "FROZEN,$(grep "hakmem-evolving,vm" frozen.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),-150"
|
|
echo "MINIMAL,$(grep "hakmem-evolving,vm" minimal.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),-350"
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Profiling with perf
|
|
|
|
### 2.1 Compile with Debug Symbols
|
|
|
|
```bash
|
|
make clean
|
|
make CFLAGS="-g -O2 -fno-omit-frame-pointer"
|
|
```
|
|
|
|
### 2.2 Run perf record
|
|
|
|
**Single run** (quick check):
|
|
```bash
|
|
perf record -g -e cycles:u \
|
|
./bench_allocators \
|
|
--allocator hakmem-evolving \
|
|
--scenario vm \
|
|
--iterations 100
|
|
|
|
perf report --stdio > perf_hakmem.txt
|
|
```
|
|
|
|
**Expected top functions** (from analysis):
|
|
1. `alloc_mmap` / `mmap` - 60-70% (syscall overhead)
|
|
2. `hak_elo_select_strategy` - 5-10% (strategy selection)
|
|
3. `hak_bigcache_try_get` - 3-5% (cache lookup)
|
|
4. `memset` / `memcpy` - 10-15% (memory initialization)
|
|
|
|
**Validation**:
|
|
```bash
|
|
grep -A 20 "Overhead" perf_hakmem.txt | head -25
|
|
```
|
|
|
|
---
|
|
|
|
### 2.3 Compare with mimalloc
|
|
|
|
```bash
|
|
# mimalloc profiling
|
|
perf record -g -e cycles:u \
|
|
./bench_allocators \
|
|
--allocator mimalloc \
|
|
--scenario vm \
|
|
--iterations 100
|
|
|
|
perf report --stdio > perf_mimalloc.txt
|
|
|
|
# Compare top functions
|
|
echo "=== hakmem top 10 ==="
|
|
grep -A 30 "Overhead" perf_hakmem.txt | head -35 | tail -10
|
|
|
|
echo ""
|
|
echo "=== mimalloc top 10 ==="
|
|
grep -A 30 "Overhead" perf_mimalloc.txt | head -35 | tail -10
|
|
```
|
|
|
|
**Expected differences**:
|
|
- hakmem: More time in `hak_elo_select_strategy`, `hak_bigcache_try_get`
|
|
- mimalloc: More time in `mi_alloc_huge` (internal allocator), less overhead
|
|
|
|
---
|
|
|
|
### 2.4 Annotate Source Code
|
|
|
|
```bash
|
|
perf annotate hak_alloc_at > annotate_hak_alloc_at.txt
|
|
cat annotate_hak_alloc_at.txt
|
|
```
|
|
|
|
**Look for**:
|
|
- Hot lines (high % samples)
|
|
- Branch mispredictions
|
|
- Cache misses
|
|
|
|
---
|
|
|
|
## 3. Cache Performance Analysis
|
|
|
|
### 3.1 Cache Miss Profiling
|
|
|
|
```bash
|
|
perf stat -e cycles,instructions,cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses \
|
|
./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 100
|
|
|
|
perf stat -e cycles,instructions,cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses \
|
|
./bench_allocators --allocator mimalloc --scenario vm --iterations 100
|
|
```
|
|
|
|
**Expected metrics**:
|
|
| Metric | hakmem | mimalloc | Analysis |
|
|
|--------|--------|----------|----------|
|
|
| IPC (instructions/cycle) | ~1.5 | ~2.0 | mimalloc more efficient |
|
|
| L1 miss rate | 5-10% | 2-5% | hakmem more cache-unfriendly |
|
|
| Cache references | Higher | Lower | hakmem touches more memory |
|
|
|
|
---
|
|
|
|
### 3.2 TLB Profiling
|
|
|
|
```bash
|
|
perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses \
|
|
./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 100
|
|
|
|
perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses \
|
|
./bench_allocators --allocator mimalloc --scenario vm --iterations 100
|
|
```
|
|
|
|
**Expected**:
|
|
- Similar TLB miss rates (both use 2MB pages)
|
|
- Slight advantage to mimalloc (better locality)
|
|
|
|
---
|
|
|
|
## 4. Micro-benchmarks
|
|
|
|
### 4.1 BigCache Lookup Speed
|
|
|
|
**File**: `test_bigcache_speed.c`
|
|
```c
|
|
#include "hakmem_bigcache.h"
|
|
#include <time.h>
|
|
#include <stdio.h>
|
|
|
|
int main() {
|
|
hak_bigcache_init();
|
|
|
|
struct timespec start, end;
|
|
clock_gettime(CLOCK_MONOTONIC, &start);
|
|
|
|
const int N = 1000000;
|
|
for (int i = 0; i < N; i++) {
|
|
void* ptr = NULL;
|
|
hak_bigcache_try_get(2097152, (uintptr_t)i, &ptr);
|
|
}
|
|
|
|
clock_gettime(CLOCK_MONOTONIC, &end);
|
|
uint64_t ns = (end.tv_sec - start.tv_sec) * 1000000000ULL +
|
|
(end.tv_nsec - start.tv_nsec);
|
|
|
|
printf("BigCache lookup: %.1f ns/op\n", (double)ns / N);
|
|
|
|
hak_bigcache_shutdown();
|
|
return 0;
|
|
}
|
|
```
|
|
|
|
**Compile & run**:
|
|
```bash
|
|
gcc -O2 -o test_bigcache_speed test_bigcache_speed.c hakmem_bigcache.c -I.
|
|
./test_bigcache_speed
|
|
```
|
|
|
|
**Expected**: 50-100 ns/op
|
|
|
|
---
|
|
|
|
### 4.2 ELO Selection Speed
|
|
|
|
**File**: `test_elo_speed.c`
|
|
```c
|
|
#include "hakmem_elo.h"
|
|
#include <time.h>
|
|
#include <stdio.h>
|
|
|
|
int main() {
|
|
hak_elo_init();
|
|
|
|
struct timespec start, end;
|
|
clock_gettime(CLOCK_MONOTONIC, &start);
|
|
|
|
const int N = 1000000;
|
|
for (int i = 0; i < N; i++) {
|
|
int strategy = hak_elo_select_strategy();
|
|
(void)strategy;
|
|
}
|
|
|
|
clock_gettime(CLOCK_MONOTONIC, &end);
|
|
uint64_t ns = (end.tv_sec - start.tv_sec) * 1000000000ULL +
|
|
(end.tv_nsec - start.tv_nsec);
|
|
|
|
printf("ELO selection: %.1f ns/op\n", (double)ns / N);
|
|
|
|
hak_elo_shutdown();
|
|
return 0;
|
|
}
|
|
```
|
|
|
|
**Compile & run**:
|
|
```bash
|
|
gcc -O2 -o test_elo_speed test_elo_speed.c hakmem_elo.c -I. -lm
|
|
./test_elo_speed
|
|
```
|
|
|
|
**Expected**: 100-200 ns/op
|
|
|
|
---
|
|
|
|
### 4.3 Header Operations Speed
|
|
|
|
**File**: `test_header_speed.c`
|
|
```c
|
|
#include "hakmem.h"
|
|
#include <time.h>
|
|
#include <stdio.h>
|
|
#include <stdlib.h>
|
|
|
|
typedef struct {
|
|
uint32_t magic;
|
|
uint32_t method;
|
|
size_t requested_size;
|
|
size_t actual_size;
|
|
uintptr_t alloc_site;
|
|
size_t class_bytes;
|
|
} AllocHeader;
|
|
|
|
#define HAKMEM_MAGIC 0x48414B4D
|
|
|
|
int main() {
|
|
struct timespec start, end;
|
|
clock_gettime(CLOCK_MONOTONIC, &start);
|
|
|
|
const int N = 1000000;
|
|
for (int i = 0; i < N; i++) {
|
|
AllocHeader hdr;
|
|
hdr.magic = HAKMEM_MAGIC;
|
|
hdr.method = 1;
|
|
hdr.alloc_site = (uintptr_t)&hdr;
|
|
hdr.class_bytes = 2097152;
|
|
|
|
if (hdr.magic != HAKMEM_MAGIC) abort();
|
|
}
|
|
|
|
clock_gettime(CLOCK_MONOTONIC, &end);
|
|
uint64_t ns = (end.tv_sec - start.tv_sec) * 1000000000ULL +
|
|
(end.tv_nsec - start.tv_nsec);
|
|
|
|
printf("Header operations: %.1f ns/op\n", (double)ns / N);
|
|
|
|
return 0;
|
|
}
|
|
```
|
|
|
|
**Compile & run**:
|
|
```bash
|
|
gcc -O2 -o test_header_speed test_header_speed.c
|
|
./test_header_speed
|
|
```
|
|
|
|
**Expected**: 30-50 ns/op
|
|
|
|
---
|
|
|
|
## 5. Syscall Tracing (Already Done)
|
|
|
|
### 5.1 Detailed Syscall Trace
|
|
|
|
```bash
|
|
# Full trace (warning: huge output)
|
|
strace -o hakmem_full.strace \
|
|
./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10
|
|
|
|
# Count only
|
|
strace -c -o hakmem_summary.strace \
|
|
./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10
|
|
|
|
cat hakmem_summary.strace
|
|
```
|
|
|
|
### 5.2 Compare Syscall Patterns
|
|
|
|
```bash
|
|
# Extract mmap calls
|
|
grep "mmap" hakmem_full.strace > hakmem_mmap.txt
|
|
grep "mmap" mimalloc_full.strace > mimalloc_mmap.txt
|
|
|
|
# Compare first 10
|
|
echo "=== hakmem mmap ==="
|
|
head -10 hakmem_mmap.txt
|
|
|
|
echo ""
|
|
echo "=== mimalloc mmap ==="
|
|
head -10 mimalloc_mmap.txt
|
|
```
|
|
|
|
**Look for**:
|
|
- Same syscall arguments? (flags, prot, etc.)
|
|
- Same order of operations?
|
|
- Any hakmem-specific patterns?
|
|
|
|
---
|
|
|
|
## 6. Memory Layout Analysis
|
|
|
|
### 6.1 /proc/self/maps
|
|
|
|
**File**: `dump_maps.c`
|
|
```c
|
|
#include <stdio.h>
|
|
#include <stdlib.h>
|
|
#include "hakmem.h"
|
|
|
|
int main() {
|
|
// Allocate 10 blocks
|
|
void* ptrs[10];
|
|
for (int i = 0; i < 10; i++) {
|
|
ptrs[i] = hak_alloc_cs(2097152);
|
|
}
|
|
|
|
// Dump memory map
|
|
system("cat /proc/self/maps | grep -E 'rw-p.*anon'");
|
|
|
|
// Free all
|
|
for (int i = 0; i < 10; i++) {
|
|
hak_free_cs(ptrs[i], 2097152);
|
|
}
|
|
|
|
return 0;
|
|
}
|
|
```
|
|
|
|
**Compile & run**:
|
|
```bash
|
|
gcc -O2 -o dump_maps dump_maps.c hakmem.c hakmem_bigcache.c hakmem_elo.c hakmem_batch.c -I. -lm
|
|
./dump_maps
|
|
```
|
|
|
|
**Look for**:
|
|
- Fragmentation (many small mappings vs few large ones)
|
|
- Address space layout (sequential vs scattered)
|
|
|
|
---
|
|
|
|
## 7. Comparative Analysis Script
|
|
|
|
**File**: `compare_all.sh`
|
|
```bash
|
|
#!/bin/bash
|
|
|
|
echo "Running comparative profiling..."
|
|
|
|
# 1. Feature isolation
|
|
echo ""
|
|
echo "=== Feature Isolation ==="
|
|
HAKMEM_MINIMAL=1 bash bench_runner.sh --warmup 2 --runs 3 --output minimal.csv
|
|
echo "Baseline: 37602 ns"
|
|
echo "Minimal: $(grep "hakmem-evolving,vm" minimal.csv | awk -F, '{print $4}' | sort -n | awk 'NR==2{print $1}') ns"
|
|
|
|
# 2. perf stat
|
|
echo ""
|
|
echo "=== perf stat (hakmem) ==="
|
|
perf stat -e cycles,instructions,cache-misses,L1-dcache-load-misses \
|
|
./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10 2>&1 | grep -E "cycles|instructions|misses"
|
|
|
|
echo ""
|
|
echo "=== perf stat (mimalloc) ==="
|
|
perf stat -e cycles,instructions,cache-misses,L1-dcache-load-misses \
|
|
./bench_allocators --allocator mimalloc --scenario vm --iterations 10 2>&1 | grep -E "cycles|instructions|misses"
|
|
|
|
# 3. Micro-benchmarks
|
|
echo ""
|
|
echo "=== Micro-benchmarks ==="
|
|
./test_bigcache_speed 2>/dev/null
|
|
./test_elo_speed 2>/dev/null
|
|
./test_header_speed 2>/dev/null
|
|
|
|
echo ""
|
|
echo "Done. See above for results."
|
|
```
|
|
|
|
**Run**:
|
|
```bash
|
|
bash compare_all.sh > profiling_results.txt 2>&1
|
|
cat profiling_results.txt
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Expected Results Summary
|
|
|
|
| Test | Expected Result | Interpretation |
|
|
|------|-----------------|----------------|
|
|
| **Feature isolation** | -350 ns total | Features contribute < 1% overhead |
|
|
| **MINIMAL mode** | 37,252 ns | Still +86% vs mimalloc → structural gap |
|
|
| **perf top** | 60-70% in mmap | Syscalls dominate, but equal for both |
|
|
| **Cache misses** | 5-10% L1 miss | Slightly worse than mimalloc |
|
|
| **BigCache micro** | 50-100 ns | Hash lookup overhead |
|
|
| **ELO micro** | 100-200 ns | Strategy selection overhead |
|
|
| **Header micro** | 30-50 ns | Metadata overhead |
|
|
|
|
**Key validation**: If MINIMAL mode still has +86% gap, then **allocation model** (not features) is the bottleneck.
|
|
|
|
---
|
|
|
|
## 9. Next Steps
|
|
|
|
After profiling:
|
|
|
|
1. ✅ **Validate overhead breakdown** (does it match Phase 6.7 analysis?)
|
|
2. ✅ **Identify unexpected hotspots** (perf report surprises?)
|
|
3. ✅ **Document findings** (update PHASE_6.7_OVERHEAD_ANALYSIS.md)
|
|
4. ✅ **Decide on optimizations** (Priority 0/1/2 from analysis)
|
|
|
|
**If analysis is correct**: Move to Phase 7 (focus on learning, not speed)
|
|
|
|
**If analysis is wrong**: Investigate new bottlenecks revealed by profiling
|
|
|
|
---
|
|
|
|
**End of Profiling Guide** 🔬
|