hakmem/docs/archive/PROFILING_GUIDE.md

# hakmem Profiling Guide

**Purpose**: Practical commands for validating Phase 6.7 overhead analysis

**Date**: 2025-10-21

---

## 1. Feature Isolation Testing

### 1.1 Add Environment Variable Support

**File**: `hakmem.c`

Add at top of file:
```c
// Environment variable control (for profiling)
static int g_disable_bigcache = 0;
static int g_disable_elo = 0;
static int g_minimal_mode = 0;

void hak_init_env_vars(void) {
    g_disable_bigcache = getenv("HAKMEM_DISABLE_BIGCACHE") ? 1 : 0;
    g_disable_elo = getenv("HAKMEM_DISABLE_ELO") ? 1 : 0;
    g_minimal_mode = getenv("HAKMEM_MINIMAL") ? 1 : 0;

    if (g_minimal_mode) {
        g_disable_bigcache = 1;
        g_disable_elo = 1;
    }

    if (g_disable_bigcache) fprintf(stderr, "[hakmem] BigCache disabled (profiling mode)\n");
    if (g_disable_elo) fprintf(stderr, "[hakmem] ELO disabled (fixed 2MB threshold)\n");
    if (g_minimal_mode) fprintf(stderr, "[hakmem] Minimal mode (all features OFF)\n");
}
```

Modify `hak_alloc_at()`:
```c
void* hak_alloc_at(size_t size, hak_callsite_t site) {
    static int initialized = 0;
    if (!initialized) {
        hak_init_env_vars();
        initialized = 1;
    }

    // ... existing code, but wrap features in checks:

    // ELO selection (skip if disabled)
    if (!g_disable_elo && !hak_evo_is_frozen()) {
        strategy_id = hak_elo_select_strategy();
        threshold = hak_elo_get_threshold(strategy_id);
    } else {
        threshold = 2097152;  // Fixed 2MB threshold
    }

    // BigCache lookup (skip if disabled)
    if (!g_disable_bigcache && size >= 1048576) {
        void* cached_ptr = NULL;
        if (hak_bigcache_try_get(size, site, &cached_ptr)) {
            return cached_ptr;
        }
    }

    // ... rest unchanged
}
```

---

### 1.2 Run Feature Isolation Tests

**Baseline** (all features):
```bash
make clean && make bench
bash bench_runner.sh --warmup 2 --runs 10 --output baseline.csv
grep "hakmem-evolving,vm" baseline.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'
```
Expected: ~37,602 ns

**No BigCache**:
```bash
HAKMEM_DISABLE_BIGCACHE=1 bash bench_runner.sh --warmup 2 --runs 10 --output no_bigcache.csv
grep "hakmem-evolving,vm" no_bigcache.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'
```
Expected: ~37,552 ns (-50 ns)

**No ELO**:
```bash
HAKMEM_DISABLE_ELO=1 bash bench_runner.sh --warmup 2 --runs 10 --output no_elo.csv
grep "hakmem-evolving,vm" no_elo.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'
```
Expected: ~37,452 ns (-150 ns)

**FROZEN mode**:
```bash
HAKMEM_EVO_POLICY=frozen bash bench_runner.sh --warmup 2 --runs 10 --output frozen.csv
grep "hakmem-evolving,vm" frozen.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'
```
Expected: ~37,452 ns (-150 ns, same as No ELO)

**MINIMAL mode**:
```bash
HAKMEM_MINIMAL=1 bash bench_runner.sh --warmup 2 --runs 10 --output minimal.csv
grep "hakmem-evolving,vm" minimal.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'
```
Expected: ~37,252 ns (-350 ns total)

**Analysis script**:
```bash
echo "Configuration,Median,Delta"
echo "Baseline,$(grep "hakmem-evolving,vm" baseline.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),0"
echo "No BigCache,$(grep "hakmem-evolving,vm" no_bigcache.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),-50"
echo "No ELO,$(grep "hakmem-evolving,vm" no_elo.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),-150"
echo "FROZEN,$(grep "hakmem-evolving,vm" frozen.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),-150"
echo "MINIMAL,$(grep "hakmem-evolving,vm" minimal.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),-350"
```

---

## 2. Profiling with perf

### 2.1 Compile with Debug Symbols

```bash
make clean
make CFLAGS="-g -O2 -fno-omit-frame-pointer"
```

### 2.2 Run perf record

**Single run** (quick check):
```bash
perf record -g -e cycles:u \
    ./bench_allocators \
    --allocator hakmem-evolving \
    --scenario vm \
    --iterations 100

perf report --stdio > perf_hakmem.txt
```

**Expected top functions** (from analysis):
1. `alloc_mmap` / `mmap` - 60-70% (syscall overhead)
2. `hak_elo_select_strategy` - 5-10% (strategy selection)
3. `hak_bigcache_try_get` - 3-5% (cache lookup)
4. `memset` / `memcpy` - 10-15% (memory initialization)

**Validation**:
```bash
grep -A 20 "Overhead" perf_hakmem.txt | head -25
```

---

### 2.3 Compare with mimalloc

```bash
# mimalloc profiling
perf record -g -e cycles:u \
    ./bench_allocators \
    --allocator mimalloc \
    --scenario vm \
    --iterations 100

perf report --stdio > perf_mimalloc.txt

# Compare top functions
echo "=== hakmem top 10 ==="
grep -A 30 "Overhead" perf_hakmem.txt | head -35 | tail -10

echo ""
echo "=== mimalloc top 10 ==="
grep -A 30 "Overhead" perf_mimalloc.txt | head -35 | tail -10
```

**Expected differences**:
- hakmem: More time in `hak_elo_select_strategy`, `hak_bigcache_try_get`
- mimalloc: More time in `mi_alloc_huge` (internal allocator), less overhead

---

### 2.4 Annotate Source Code

```bash
perf annotate hak_alloc_at > annotate_hak_alloc_at.txt
cat annotate_hak_alloc_at.txt
```

**Look for**:
- Hot lines (high % samples)
- Branch mispredictions
- Cache misses

---

## 3. Cache Performance Analysis

### 3.1 Cache Miss Profiling

```bash
perf stat -e cycles,instructions,cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses \
    ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 100

perf stat -e cycles,instructions,cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses \
    ./bench_allocators --allocator mimalloc --scenario vm --iterations 100
```

**Expected metrics**:
| Metric | hakmem | mimalloc | Analysis |
|--------|--------|----------|----------|
| IPC (instructions/cycle) | ~1.5 | ~2.0 | mimalloc more efficient |
| L1 miss rate | 5-10% | 2-5% | hakmem more cache-unfriendly |
| Cache references | Higher | Lower | hakmem touches more memory |

---

### 3.2 TLB Profiling

```bash
perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses \
    ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 100

perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses \
    ./bench_allocators --allocator mimalloc --scenario vm --iterations 100
```

**Expected**:
- Similar TLB miss rates (both use 2MB pages)
- Slight advantage to mimalloc (better locality)

---

## 4. Micro-benchmarks

### 4.1 BigCache Lookup Speed

**File**: `test_bigcache_speed.c`
```c
#include "hakmem_bigcache.h"
#include <time.h>
#include <stdio.h>

int main() {
    hak_bigcache_init();

    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);

    const int N = 1000000;
    for (int i = 0; i < N; i++) {
        void* ptr = NULL;
        hak_bigcache_try_get(2097152, (uintptr_t)i, &ptr);
    }

    clock_gettime(CLOCK_MONOTONIC, &end);
    uint64_t ns = (end.tv_sec - start.tv_sec) * 1000000000ULL +
                  (end.tv_nsec - start.tv_nsec);

    printf("BigCache lookup: %.1f ns/op\n", (double)ns / N);

    hak_bigcache_shutdown();
    return 0;
}
```

**Compile & run**:
```bash
gcc -O2 -o test_bigcache_speed test_bigcache_speed.c hakmem_bigcache.c -I.
./test_bigcache_speed
```

**Expected**: 50-100 ns/op

---

### 4.2 ELO Selection Speed

**File**: `test_elo_speed.c`
```c
#include "hakmem_elo.h"
#include <time.h>
#include <stdio.h>

int main() {
    hak_elo_init();

    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);

    const int N = 1000000;
    for (int i = 0; i < N; i++) {
        int strategy = hak_elo_select_strategy();
        (void)strategy;
    }

    clock_gettime(CLOCK_MONOTONIC, &end);
    uint64_t ns = (end.tv_sec - start.tv_sec) * 1000000000ULL +
                  (end.tv_nsec - start.tv_nsec);

    printf("ELO selection: %.1f ns/op\n", (double)ns / N);

    hak_elo_shutdown();
    return 0;
}
```

**Compile & run**:
```bash
gcc -O2 -o test_elo_speed test_elo_speed.c hakmem_elo.c -I. -lm
./test_elo_speed
```

**Expected**: 100-200 ns/op

---

### 4.3 Header Operations Speed

**File**: `test_header_speed.c`
```c
#include "hakmem.h"
#include <time.h>
#include <stdio.h>
#include <stdlib.h>

typedef struct {
    uint32_t magic;
    uint32_t method;
    size_t   requested_size;
    size_t   actual_size;
    uintptr_t alloc_site;
    size_t   class_bytes;
} AllocHeader;

#define HAKMEM_MAGIC 0x48414B4D

int main() {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);

    const int N = 1000000;
    for (int i = 0; i < N; i++) {
        AllocHeader hdr;
        hdr.magic = HAKMEM_MAGIC;
        hdr.method = 1;
        hdr.alloc_site = (uintptr_t)&hdr;
        hdr.class_bytes = 2097152;

        if (hdr.magic != HAKMEM_MAGIC) abort();
    }

    clock_gettime(CLOCK_MONOTONIC, &end);
    uint64_t ns = (end.tv_sec - start.tv_sec) * 1000000000ULL +
                  (end.tv_nsec - start.tv_nsec);

    printf("Header operations: %.1f ns/op\n", (double)ns / N);

    return 0;
}
```

**Compile & run**:
```bash
gcc -O2 -o test_header_speed test_header_speed.c
./test_header_speed
```

**Expected**: 30-50 ns/op

---

## 5. Syscall Tracing (Already Done)

### 5.1 Detailed Syscall Trace

```bash
# Full trace (warning: huge output)
strace -o hakmem_full.strace \
    ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10

# Count only
strace -c -o hakmem_summary.strace \
    ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10

cat hakmem_summary.strace
```

### 5.2 Compare Syscall Patterns

```bash
# Extract mmap calls
grep "mmap" hakmem_full.strace > hakmem_mmap.txt
grep "mmap" mimalloc_full.strace > mimalloc_mmap.txt

# Compare first 10
echo "=== hakmem mmap ==="
head -10 hakmem_mmap.txt

echo ""
echo "=== mimalloc mmap ==="
head -10 mimalloc_mmap.txt
```

**Look for**:
- Same syscall arguments? (flags, prot, etc.)
- Same order of operations?
- Any hakmem-specific patterns?

---

## 6. Memory Layout Analysis

### 6.1 /proc/self/maps

**File**: `dump_maps.c`
```c
#include <stdio.h>
#include <stdlib.h>
#include "hakmem.h"

int main() {
    // Allocate 10 blocks
    void* ptrs[10];
    for (int i = 0; i < 10; i++) {
        ptrs[i] = hak_alloc_cs(2097152);
    }

    // Dump memory map
    system("cat /proc/self/maps | grep -E 'rw-p.*anon'");

    // Free all
    for (int i = 0; i < 10; i++) {
        hak_free_cs(ptrs[i], 2097152);
    }

    return 0;
}
```

**Compile & run**:
```bash
gcc -O2 -o dump_maps dump_maps.c hakmem.c hakmem_bigcache.c hakmem_elo.c hakmem_batch.c -I. -lm
./dump_maps
```

**Look for**:
- Fragmentation (many small mappings vs few large ones)
- Address space layout (sequential vs scattered)

---

## 7. Comparative Analysis Script

**File**: `compare_all.sh`
```bash
#!/bin/bash

echo "Running comparative profiling..."

# 1. Feature isolation
echo ""
echo "=== Feature Isolation ==="
HAKMEM_MINIMAL=1 bash bench_runner.sh --warmup 2 --runs 3 --output minimal.csv
echo "Baseline: 37602 ns"
echo "Minimal:  $(grep "hakmem-evolving,vm" minimal.csv | awk -F, '{print $4}' | sort -n | awk 'NR==2{print $1}') ns"

# 2. perf stat
echo ""
echo "=== perf stat (hakmem) ==="
perf stat -e cycles,instructions,cache-misses,L1-dcache-load-misses \
    ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10 2>&1 | grep -E "cycles|instructions|misses"

echo ""
echo "=== perf stat (mimalloc) ==="
perf stat -e cycles,instructions,cache-misses,L1-dcache-load-misses \
    ./bench_allocators --allocator mimalloc --scenario vm --iterations 10 2>&1 | grep -E "cycles|instructions|misses"

# 3. Micro-benchmarks
echo ""
echo "=== Micro-benchmarks ==="
./test_bigcache_speed 2>/dev/null
./test_elo_speed 2>/dev/null
./test_header_speed 2>/dev/null

echo ""
echo "Done. See above for results."
```

**Run**:
```bash
bash compare_all.sh > profiling_results.txt 2>&1
cat profiling_results.txt
```

---

## 8. Expected Results Summary

| Test | Expected Result | Interpretation |
|------|-----------------|----------------|
| **Feature isolation** | -350 ns total | Features contribute < 1% overhead |
| **MINIMAL mode** | 37,252 ns | Still +86% vs mimalloc → structural gap |
| **perf top** | 60-70% in mmap | Syscalls dominate, but equal for both |
| **Cache misses** | 5-10% L1 miss | Slightly worse than mimalloc |
| **BigCache micro** | 50-100 ns | Hash lookup overhead |
| **ELO micro** | 100-200 ns | Strategy selection overhead |
| **Header micro** | 30-50 ns | Metadata overhead |

**Key validation**: If MINIMAL mode still has +86% gap, then **allocation model** (not features) is the bottleneck.

---

## 9. Next Steps

After profiling:

1. ✅ **Validate overhead breakdown** (does it match Phase 6.7 analysis?)
2. ✅ **Identify unexpected hotspots** (perf report surprises?)
3. ✅ **Document findings** (update PHASE_6.7_OVERHEAD_ANALYSIS.md)
4. ✅ **Decide on optimizations** (Priority 0/1/2 from analysis)

**If analysis is correct**: Move to Phase 7 (focus on learning, not speed)

**If analysis is wrong**: Investigate new bottlenecks revealed by profiling

---

**End of Profiling Guide** 🔬