Files
hakmem/docs/archive/PROFILING_GUIDE.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

13 KiB

hakmem Profiling Guide

Purpose: Practical commands for validating Phase 6.7 overhead analysis

Date: 2025-10-21


1. Feature Isolation Testing

1.1 Add Environment Variable Support

File: hakmem.c

Add at top of file:

// Environment variable control (for profiling)
static int g_disable_bigcache = 0;
static int g_disable_elo = 0;
static int g_minimal_mode = 0;

void hak_init_env_vars(void) {
    g_disable_bigcache = getenv("HAKMEM_DISABLE_BIGCACHE") ? 1 : 0;
    g_disable_elo = getenv("HAKMEM_DISABLE_ELO") ? 1 : 0;
    g_minimal_mode = getenv("HAKMEM_MINIMAL") ? 1 : 0;

    if (g_minimal_mode) {
        g_disable_bigcache = 1;
        g_disable_elo = 1;
    }

    if (g_disable_bigcache) fprintf(stderr, "[hakmem] BigCache disabled (profiling mode)\n");
    if (g_disable_elo) fprintf(stderr, "[hakmem] ELO disabled (fixed 2MB threshold)\n");
    if (g_minimal_mode) fprintf(stderr, "[hakmem] Minimal mode (all features OFF)\n");
}

Modify hak_alloc_at():

void* hak_alloc_at(size_t size, hak_callsite_t site) {
    static int initialized = 0;
    if (!initialized) {
        hak_init_env_vars();
        initialized = 1;
    }

    // ... existing code, but wrap features in checks:

    // ELO selection (skip if disabled)
    if (!g_disable_elo && !hak_evo_is_frozen()) {
        strategy_id = hak_elo_select_strategy();
        threshold = hak_elo_get_threshold(strategy_id);
    } else {
        threshold = 2097152;  // Fixed 2MB threshold
    }

    // BigCache lookup (skip if disabled)
    if (!g_disable_bigcache && size >= 1048576) {
        void* cached_ptr = NULL;
        if (hak_bigcache_try_get(size, site, &cached_ptr)) {
            return cached_ptr;
        }
    }

    // ... rest unchanged
}

1.2 Run Feature Isolation Tests

Baseline (all features):

make clean && make bench
bash bench_runner.sh --warmup 2 --runs 10 --output baseline.csv
grep "hakmem-evolving,vm" baseline.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'

Expected: ~37,602 ns

No BigCache:

HAKMEM_DISABLE_BIGCACHE=1 bash bench_runner.sh --warmup 2 --runs 10 --output no_bigcache.csv
grep "hakmem-evolving,vm" no_bigcache.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'

Expected: ~37,552 ns (-50 ns)

No ELO:

HAKMEM_DISABLE_ELO=1 bash bench_runner.sh --warmup 2 --runs 10 --output no_elo.csv
grep "hakmem-evolving,vm" no_elo.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'

Expected: ~37,452 ns (-150 ns)

FROZEN mode:

HAKMEM_EVO_POLICY=frozen bash bench_runner.sh --warmup 2 --runs 10 --output frozen.csv
grep "hakmem-evolving,vm" frozen.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'

Expected: ~37,452 ns (-150 ns, same as No ELO)

MINIMAL mode:

HAKMEM_MINIMAL=1 bash bench_runner.sh --warmup 2 --runs 10 --output minimal.csv
grep "hakmem-evolving,vm" minimal.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'

Expected: ~37,252 ns (-350 ns total)

Analysis script:

echo "Configuration,Median,Delta"
echo "Baseline,$(grep "hakmem-evolving,vm" baseline.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),0"
echo "No BigCache,$(grep "hakmem-evolving,vm" no_bigcache.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),-50"
echo "No ELO,$(grep "hakmem-evolving,vm" no_elo.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),-150"
echo "FROZEN,$(grep "hakmem-evolving,vm" frozen.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),-150"
echo "MINIMAL,$(grep "hakmem-evolving,vm" minimal.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),-350"

2. Profiling with perf

2.1 Compile with Debug Symbols

make clean
make CFLAGS="-g -O2 -fno-omit-frame-pointer"

2.2 Run perf record

Single run (quick check):

perf record -g -e cycles:u \
    ./bench_allocators \
    --allocator hakmem-evolving \
    --scenario vm \
    --iterations 100

perf report --stdio > perf_hakmem.txt

Expected top functions (from analysis):

  1. alloc_mmap / mmap - 60-70% (syscall overhead)
  2. hak_elo_select_strategy - 5-10% (strategy selection)
  3. hak_bigcache_try_get - 3-5% (cache lookup)
  4. memset / memcpy - 10-15% (memory initialization)

Validation:

grep -A 20 "Overhead" perf_hakmem.txt | head -25

2.3 Compare with mimalloc

# mimalloc profiling
perf record -g -e cycles:u \
    ./bench_allocators \
    --allocator mimalloc \
    --scenario vm \
    --iterations 100

perf report --stdio > perf_mimalloc.txt

# Compare top functions
echo "=== hakmem top 10 ==="
grep -A 30 "Overhead" perf_hakmem.txt | head -35 | tail -10

echo ""
echo "=== mimalloc top 10 ==="
grep -A 30 "Overhead" perf_mimalloc.txt | head -35 | tail -10

Expected differences:

  • hakmem: More time in hak_elo_select_strategy, hak_bigcache_try_get
  • mimalloc: More time in mi_alloc_huge (internal allocator), less overhead

2.4 Annotate Source Code

perf annotate hak_alloc_at > annotate_hak_alloc_at.txt
cat annotate_hak_alloc_at.txt

Look for:

  • Hot lines (high % samples)
  • Branch mispredictions
  • Cache misses

3. Cache Performance Analysis

3.1 Cache Miss Profiling

perf stat -e cycles,instructions,cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses \
    ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 100

perf stat -e cycles,instructions,cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses \
    ./bench_allocators --allocator mimalloc --scenario vm --iterations 100

Expected metrics:

Metric hakmem mimalloc Analysis
IPC (instructions/cycle) ~1.5 ~2.0 mimalloc more efficient
L1 miss rate 5-10% 2-5% hakmem more cache-unfriendly
Cache references Higher Lower hakmem touches more memory

3.2 TLB Profiling

perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses \
    ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 100

perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses \
    ./bench_allocators --allocator mimalloc --scenario vm --iterations 100

Expected:

  • Similar TLB miss rates (both use 2MB pages)
  • Slight advantage to mimalloc (better locality)

4. Micro-benchmarks

4.1 BigCache Lookup Speed

File: test_bigcache_speed.c

#include "hakmem_bigcache.h"
#include <time.h>
#include <stdio.h>

int main() {
    hak_bigcache_init();

    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);

    const int N = 1000000;
    for (int i = 0; i < N; i++) {
        void* ptr = NULL;
        hak_bigcache_try_get(2097152, (uintptr_t)i, &ptr);
    }

    clock_gettime(CLOCK_MONOTONIC, &end);
    uint64_t ns = (end.tv_sec - start.tv_sec) * 1000000000ULL +
                  (end.tv_nsec - start.tv_nsec);

    printf("BigCache lookup: %.1f ns/op\n", (double)ns / N);

    hak_bigcache_shutdown();
    return 0;
}

Compile & run:

gcc -O2 -o test_bigcache_speed test_bigcache_speed.c hakmem_bigcache.c -I.
./test_bigcache_speed

Expected: 50-100 ns/op


4.2 ELO Selection Speed

File: test_elo_speed.c

#include "hakmem_elo.h"
#include <time.h>
#include <stdio.h>

int main() {
    hak_elo_init();

    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);

    const int N = 1000000;
    for (int i = 0; i < N; i++) {
        int strategy = hak_elo_select_strategy();
        (void)strategy;
    }

    clock_gettime(CLOCK_MONOTONIC, &end);
    uint64_t ns = (end.tv_sec - start.tv_sec) * 1000000000ULL +
                  (end.tv_nsec - start.tv_nsec);

    printf("ELO selection: %.1f ns/op\n", (double)ns / N);

    hak_elo_shutdown();
    return 0;
}

Compile & run:

gcc -O2 -o test_elo_speed test_elo_speed.c hakmem_elo.c -I. -lm
./test_elo_speed

Expected: 100-200 ns/op


4.3 Header Operations Speed

File: test_header_speed.c

#include "hakmem.h"
#include <time.h>
#include <stdio.h>
#include <stdlib.h>

typedef struct {
    uint32_t magic;
    uint32_t method;
    size_t   requested_size;
    size_t   actual_size;
    uintptr_t alloc_site;
    size_t   class_bytes;
} AllocHeader;

#define HAKMEM_MAGIC 0x48414B4D

int main() {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);

    const int N = 1000000;
    for (int i = 0; i < N; i++) {
        AllocHeader hdr;
        hdr.magic = HAKMEM_MAGIC;
        hdr.method = 1;
        hdr.alloc_site = (uintptr_t)&hdr;
        hdr.class_bytes = 2097152;

        if (hdr.magic != HAKMEM_MAGIC) abort();
    }

    clock_gettime(CLOCK_MONOTONIC, &end);
    uint64_t ns = (end.tv_sec - start.tv_sec) * 1000000000ULL +
                  (end.tv_nsec - start.tv_nsec);

    printf("Header operations: %.1f ns/op\n", (double)ns / N);

    return 0;
}

Compile & run:

gcc -O2 -o test_header_speed test_header_speed.c
./test_header_speed

Expected: 30-50 ns/op


5. Syscall Tracing (Already Done)

5.1 Detailed Syscall Trace

# Full trace (warning: huge output)
strace -o hakmem_full.strace \
    ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10

# Count only
strace -c -o hakmem_summary.strace \
    ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10

cat hakmem_summary.strace

5.2 Compare Syscall Patterns

# Extract mmap calls
grep "mmap" hakmem_full.strace > hakmem_mmap.txt
grep "mmap" mimalloc_full.strace > mimalloc_mmap.txt

# Compare first 10
echo "=== hakmem mmap ==="
head -10 hakmem_mmap.txt

echo ""
echo "=== mimalloc mmap ==="
head -10 mimalloc_mmap.txt

Look for:

  • Same syscall arguments? (flags, prot, etc.)
  • Same order of operations?
  • Any hakmem-specific patterns?

6. Memory Layout Analysis

6.1 /proc/self/maps

File: dump_maps.c

#include <stdio.h>
#include <stdlib.h>
#include "hakmem.h"

int main() {
    // Allocate 10 blocks
    void* ptrs[10];
    for (int i = 0; i < 10; i++) {
        ptrs[i] = hak_alloc_cs(2097152);
    }

    // Dump memory map
    system("cat /proc/self/maps | grep -E 'rw-p.*anon'");

    // Free all
    for (int i = 0; i < 10; i++) {
        hak_free_cs(ptrs[i], 2097152);
    }

    return 0;
}

Compile & run:

gcc -O2 -o dump_maps dump_maps.c hakmem.c hakmem_bigcache.c hakmem_elo.c hakmem_batch.c -I. -lm
./dump_maps

Look for:

  • Fragmentation (many small mappings vs few large ones)
  • Address space layout (sequential vs scattered)

7. Comparative Analysis Script

File: compare_all.sh

#!/bin/bash

echo "Running comparative profiling..."

# 1. Feature isolation
echo ""
echo "=== Feature Isolation ==="
HAKMEM_MINIMAL=1 bash bench_runner.sh --warmup 2 --runs 3 --output minimal.csv
echo "Baseline: 37602 ns"
echo "Minimal:  $(grep "hakmem-evolving,vm" minimal.csv | awk -F, '{print $4}' | sort -n | awk 'NR==2{print $1}') ns"

# 2. perf stat
echo ""
echo "=== perf stat (hakmem) ==="
perf stat -e cycles,instructions,cache-misses,L1-dcache-load-misses \
    ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10 2>&1 | grep -E "cycles|instructions|misses"

echo ""
echo "=== perf stat (mimalloc) ==="
perf stat -e cycles,instructions,cache-misses,L1-dcache-load-misses \
    ./bench_allocators --allocator mimalloc --scenario vm --iterations 10 2>&1 | grep -E "cycles|instructions|misses"

# 3. Micro-benchmarks
echo ""
echo "=== Micro-benchmarks ==="
./test_bigcache_speed 2>/dev/null
./test_elo_speed 2>/dev/null
./test_header_speed 2>/dev/null

echo ""
echo "Done. See above for results."

Run:

bash compare_all.sh > profiling_results.txt 2>&1
cat profiling_results.txt

8. Expected Results Summary

Test Expected Result Interpretation
Feature isolation -350 ns total Features contribute < 1% overhead
MINIMAL mode 37,252 ns Still +86% vs mimalloc → structural gap
perf top 60-70% in mmap Syscalls dominate, but equal for both
Cache misses 5-10% L1 miss Slightly worse than mimalloc
BigCache micro 50-100 ns Hash lookup overhead
ELO micro 100-200 ns Strategy selection overhead
Header micro 30-50 ns Metadata overhead

Key validation: If MINIMAL mode still has +86% gap, then allocation model (not features) is the bottleneck.


9. Next Steps

After profiling:

  1. Validate overhead breakdown (does it match Phase 6.7 analysis?)
  2. Identify unexpected hotspots (perf report surprises?)
  3. Document findings (update PHASE_6.7_OVERHEAD_ANALYSIS.md)
  4. Decide on optimizations (Priority 0/1/2 from analysis)

If analysis is correct: Move to Phase 7 (focus on learning, not speed)

If analysis is wrong: Investigate new bottlenecks revealed by profiling


End of Profiling Guide 🔬