Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

13 KiB

Raw Blame History

hakmem Profiling Guide

Purpose: Practical commands for validating Phase 6.7 overhead analysis

Date: 2025-10-21

1. Feature Isolation Testing

1.1 Add Environment Variable Support

File: hakmem.c

Add at top of file:

// Environment variable control (for profiling)
static int g_disable_bigcache = 0;
static int g_disable_elo = 0;
static int g_minimal_mode = 0;

void hak_init_env_vars(void) {
    g_disable_bigcache = getenv("HAKMEM_DISABLE_BIGCACHE") ? 1 : 0;
    g_disable_elo = getenv("HAKMEM_DISABLE_ELO") ? 1 : 0;
    g_minimal_mode = getenv("HAKMEM_MINIMAL") ? 1 : 0;

    if (g_minimal_mode) {
        g_disable_bigcache = 1;
        g_disable_elo = 1;
    }

    if (g_disable_bigcache) fprintf(stderr, "[hakmem] BigCache disabled (profiling mode)\n");
    if (g_disable_elo) fprintf(stderr, "[hakmem] ELO disabled (fixed 2MB threshold)\n");
    if (g_minimal_mode) fprintf(stderr, "[hakmem] Minimal mode (all features OFF)\n");
}

Modify hak_alloc_at():

void* hak_alloc_at(size_t size, hak_callsite_t site) {
    static int initialized = 0;
    if (!initialized) {
        hak_init_env_vars();
        initialized = 1;
    }

    // ... existing code, but wrap features in checks:

    // ELO selection (skip if disabled)
    if (!g_disable_elo && !hak_evo_is_frozen()) {
        strategy_id = hak_elo_select_strategy();
        threshold = hak_elo_get_threshold(strategy_id);
    } else {
        threshold = 2097152;  // Fixed 2MB threshold
    }

    // BigCache lookup (skip if disabled)
    if (!g_disable_bigcache && size >= 1048576) {
        void* cached_ptr = NULL;
        if (hak_bigcache_try_get(size, site, &cached_ptr)) {
            return cached_ptr;
        }
    }

    // ... rest unchanged
}

1.2 Run Feature Isolation Tests

Baseline (all features):

make clean && make bench
bash bench_runner.sh --warmup 2 --runs 10 --output baseline.csv
grep "hakmem-evolving,vm" baseline.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'

Expected: ~37,602 ns

No BigCache:

HAKMEM_DISABLE_BIGCACHE=1 bash bench_runner.sh --warmup 2 --runs 10 --output no_bigcache.csv
grep "hakmem-evolving,vm" no_bigcache.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'

Expected: ~37,552 ns (-50 ns)

No ELO:

HAKMEM_DISABLE_ELO=1 bash bench_runner.sh --warmup 2 --runs 10 --output no_elo.csv
grep "hakmem-evolving,vm" no_elo.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'

Expected: ~37,452 ns (-150 ns)

FROZEN mode:

HAKMEM_EVO_POLICY=frozen bash bench_runner.sh --warmup 2 --runs 10 --output frozen.csv
grep "hakmem-evolving,vm" frozen.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'

Expected: ~37,452 ns (-150 ns, same as No ELO)

MINIMAL mode:

HAKMEM_MINIMAL=1 bash bench_runner.sh --warmup 2 --runs 10 --output minimal.csv
grep "hakmem-evolving,vm" minimal.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print "Median:", $1}'

Expected: ~37,252 ns (-350 ns total)

Analysis script:

echo "Configuration,Median,Delta"
echo "Baseline,$(grep "hakmem-evolving,vm" baseline.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),0"
echo "No BigCache,$(grep "hakmem-evolving,vm" no_bigcache.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),-50"
echo "No ELO,$(grep "hakmem-evolving,vm" no_elo.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),-150"
echo "FROZEN,$(grep "hakmem-evolving,vm" frozen.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),-150"
echo "MINIMAL,$(grep "hakmem-evolving,vm" minimal.csv | awk -F, '{print $4}' | sort -n | awk 'NR==5{print $1}'),-350"

2. Profiling with perf

2.1 Compile with Debug Symbols

make clean
make CFLAGS="-g -O2 -fno-omit-frame-pointer"

2.2 Run perf record

Single run (quick check):

perf record -g -e cycles:u \
    ./bench_allocators \
    --allocator hakmem-evolving \
    --scenario vm \
    --iterations 100

perf report --stdio > perf_hakmem.txt

Expected top functions (from analysis):

alloc_mmap / mmap - 60-70% (syscall overhead)
hak_elo_select_strategy - 5-10% (strategy selection)
hak_bigcache_try_get - 3-5% (cache lookup)
memset / memcpy - 10-15% (memory initialization)

Validation:

grep -A 20 "Overhead" perf_hakmem.txt | head -25

2.3 Compare with mimalloc

# mimalloc profiling
perf record -g -e cycles:u \
    ./bench_allocators \
    --allocator mimalloc \
    --scenario vm \
    --iterations 100

perf report --stdio > perf_mimalloc.txt

# Compare top functions
echo "=== hakmem top 10 ==="
grep -A 30 "Overhead" perf_hakmem.txt | head -35 | tail -10

echo ""
echo "=== mimalloc top 10 ==="
grep -A 30 "Overhead" perf_mimalloc.txt | head -35 | tail -10

Expected differences:

hakmem: More time in hak_elo_select_strategy, hak_bigcache_try_get
mimalloc: More time in mi_alloc_huge (internal allocator), less overhead

2.4 Annotate Source Code

perf annotate hak_alloc_at > annotate_hak_alloc_at.txt
cat annotate_hak_alloc_at.txt

Look for:

Hot lines (high % samples)
Branch mispredictions
Cache misses

3. Cache Performance Analysis

3.1 Cache Miss Profiling

perf stat -e cycles,instructions,cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses \
    ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 100

perf stat -e cycles,instructions,cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses \
    ./bench_allocators --allocator mimalloc --scenario vm --iterations 100

Expected metrics:

Metric	hakmem	mimalloc	Analysis
IPC (instructions/cycle)	~1.5	~2.0	mimalloc more efficient
L1 miss rate	5-10%	2-5%	hakmem more cache-unfriendly
Cache references	Higher	Lower	hakmem touches more memory

3.2 TLB Profiling

perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses \
    ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 100

perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses \
    ./bench_allocators --allocator mimalloc --scenario vm --iterations 100

Expected:

Similar TLB miss rates (both use 2MB pages)
Slight advantage to mimalloc (better locality)

4. Micro-benchmarks

4.1 BigCache Lookup Speed

File: test_bigcache_speed.c

#include "hakmem_bigcache.h"
#include <time.h>
#include <stdio.h>

int main() {
    hak_bigcache_init();

    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);

    const int N = 1000000;
    for (int i = 0; i < N; i++) {
        void* ptr = NULL;
        hak_bigcache_try_get(2097152, (uintptr_t)i, &ptr);
    }

    clock_gettime(CLOCK_MONOTONIC, &end);
    uint64_t ns = (end.tv_sec - start.tv_sec) * 1000000000ULL +
                  (end.tv_nsec - start.tv_nsec);

    printf("BigCache lookup: %.1f ns/op\n", (double)ns / N);

    hak_bigcache_shutdown();
    return 0;
}

Compile & run:

gcc -O2 -o test_bigcache_speed test_bigcache_speed.c hakmem_bigcache.c -I.
./test_bigcache_speed

Expected: 50-100 ns/op

4.2 ELO Selection Speed

File: test_elo_speed.c

#include "hakmem_elo.h"
#include <time.h>
#include <stdio.h>

int main() {
    hak_elo_init();

    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);

    const int N = 1000000;
    for (int i = 0; i < N; i++) {
        int strategy = hak_elo_select_strategy();
        (void)strategy;
    }

    clock_gettime(CLOCK_MONOTONIC, &end);
    uint64_t ns = (end.tv_sec - start.tv_sec) * 1000000000ULL +
                  (end.tv_nsec - start.tv_nsec);

    printf("ELO selection: %.1f ns/op\n", (double)ns / N);

    hak_elo_shutdown();
    return 0;
}

Compile & run:

gcc -O2 -o test_elo_speed test_elo_speed.c hakmem_elo.c -I. -lm
./test_elo_speed

Expected: 100-200 ns/op

4.3 Header Operations Speed

File: test_header_speed.c

#include "hakmem.h"
#include <time.h>
#include <stdio.h>
#include <stdlib.h>

typedef struct {
    uint32_t magic;
    uint32_t method;
    size_t   requested_size;
    size_t   actual_size;
    uintptr_t alloc_site;
    size_t   class_bytes;
} AllocHeader;

#define HAKMEM_MAGIC 0x48414B4D

int main() {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);

    const int N = 1000000;
    for (int i = 0; i < N; i++) {
        AllocHeader hdr;
        hdr.magic = HAKMEM_MAGIC;
        hdr.method = 1;
        hdr.alloc_site = (uintptr_t)&hdr;
        hdr.class_bytes = 2097152;

        if (hdr.magic != HAKMEM_MAGIC) abort();
    }

    clock_gettime(CLOCK_MONOTONIC, &end);
    uint64_t ns = (end.tv_sec - start.tv_sec) * 1000000000ULL +
                  (end.tv_nsec - start.tv_nsec);

    printf("Header operations: %.1f ns/op\n", (double)ns / N);

    return 0;
}

Compile & run:

gcc -O2 -o test_header_speed test_header_speed.c
./test_header_speed

Expected: 30-50 ns/op

5. Syscall Tracing (Already Done)

5.1 Detailed Syscall Trace

# Full trace (warning: huge output)
strace -o hakmem_full.strace \
    ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10

# Count only
strace -c -o hakmem_summary.strace \
    ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10

cat hakmem_summary.strace

5.2 Compare Syscall Patterns

# Extract mmap calls
grep "mmap" hakmem_full.strace > hakmem_mmap.txt
grep "mmap" mimalloc_full.strace > mimalloc_mmap.txt

# Compare first 10
echo "=== hakmem mmap ==="
head -10 hakmem_mmap.txt

echo ""
echo "=== mimalloc mmap ==="
head -10 mimalloc_mmap.txt

Look for:

Same syscall arguments? (flags, prot, etc.)
Same order of operations?
Any hakmem-specific patterns?

6. Memory Layout Analysis

6.1 /proc/self/maps

File: dump_maps.c

#include <stdio.h>
#include <stdlib.h>
#include "hakmem.h"

int main() {
    // Allocate 10 blocks
    void* ptrs[10];
    for (int i = 0; i < 10; i++) {
        ptrs[i] = hak_alloc_cs(2097152);
    }

    // Dump memory map
    system("cat /proc/self/maps | grep -E 'rw-p.*anon'");

    // Free all
    for (int i = 0; i < 10; i++) {
        hak_free_cs(ptrs[i], 2097152);
    }

    return 0;
}

Compile & run:

gcc -O2 -o dump_maps dump_maps.c hakmem.c hakmem_bigcache.c hakmem_elo.c hakmem_batch.c -I. -lm
./dump_maps

Look for:

Fragmentation (many small mappings vs few large ones)
Address space layout (sequential vs scattered)

7. Comparative Analysis Script

File: compare_all.sh

#!/bin/bash

echo "Running comparative profiling..."

# 1. Feature isolation
echo ""
echo "=== Feature Isolation ==="
HAKMEM_MINIMAL=1 bash bench_runner.sh --warmup 2 --runs 3 --output minimal.csv
echo "Baseline: 37602 ns"
echo "Minimal:  $(grep "hakmem-evolving,vm" minimal.csv | awk -F, '{print $4}' | sort -n | awk 'NR==2{print $1}') ns"

# 2. perf stat
echo ""
echo "=== perf stat (hakmem) ==="
perf stat -e cycles,instructions,cache-misses,L1-dcache-load-misses \
    ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 10 2>&1 | grep -E "cycles|instructions|misses"

echo ""
echo "=== perf stat (mimalloc) ==="
perf stat -e cycles,instructions,cache-misses,L1-dcache-load-misses \
    ./bench_allocators --allocator mimalloc --scenario vm --iterations 10 2>&1 | grep -E "cycles|instructions|misses"

# 3. Micro-benchmarks
echo ""
echo "=== Micro-benchmarks ==="
./test_bigcache_speed 2>/dev/null
./test_elo_speed 2>/dev/null
./test_header_speed 2>/dev/null

echo ""
echo "Done. See above for results."

Run:

bash compare_all.sh > profiling_results.txt 2>&1
cat profiling_results.txt

8. Expected Results Summary

Test	Expected Result	Interpretation
Feature isolation	-350 ns total	Features contribute < 1% overhead
MINIMAL mode	37,252 ns	Still +86% vs mimalloc → structural gap
perf top	60-70% in mmap	Syscalls dominate, but equal for both
Cache misses	5-10% L1 miss	Slightly worse than mimalloc
BigCache micro	50-100 ns	Hash lookup overhead
ELO micro	100-200 ns	Strategy selection overhead
Header micro	30-50 ns	Metadata overhead

Key validation: If MINIMAL mode still has +86% gap, then allocation model (not features) is the bottleneck.

9. Next Steps

After profiling:

✅ Validate overhead breakdown (does it match Phase 6.7 analysis?)
✅ Identify unexpected hotspots (perf report surprises?)
✅ Document findings (update PHASE_6.7_OVERHEAD_ANALYSIS.md)
✅ Decide on optimizations (Priority 0/1/2 from analysis)

If analysis is correct: Move to Phase 7 (focus on learning, not speed)

If analysis is wrong: Investigate new bottlenecks revealed by profiling

End of Profiling Guide 🔬

13 KiB Raw Blame History

hakmem Profiling Guide

1. Feature Isolation Testing

1.1 Add Environment Variable Support

1.2 Run Feature Isolation Tests

2. Profiling with perf

2.1 Compile with Debug Symbols

2.2 Run perf record

2.3 Compare with mimalloc

2.4 Annotate Source Code

3. Cache Performance Analysis

3.1 Cache Miss Profiling

3.2 TLB Profiling

4. Micro-benchmarks

4.1 BigCache Lookup Speed

4.2 ELO Selection Speed

4.3 Header Operations Speed

5. Syscall Tracing (Already Done)

5.1 Detailed Syscall Trace

5.2 Compare Syscall Patterns

6. Memory Layout Analysis

6.1 /proc/self/maps

7. Comparative Analysis Script

8. Expected Results Summary

9. Next Steps

13 KiB

Raw Blame History