Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

11 KiB

Raw Blame History

Performance Investigation Report - 2025-10-22

Investigator: Claude (ultrathink mode) Duration: 2 hours Status: ✅ Root Cause Identified

🎯 Executive Summary

Problem Statement

hakmem performance drastically degraded after Phase 6.14:

Metric	Phase 6.14 (Success)	Current (Failed)	Degradation
1-thread	15,271,429 ops/sec	2,698,795 ops/sec	-82.3% ❌
4-thread	67,853,659 ops/sec	658,228 ops/sec	-99.0% ❌

User expected: "Pre-warm code deletion would restore performance" Reality: Pre-warm code never existed (it was only a proposal document)

Root Cause Discovered

Default mode (BALANCED) has massive overhead from advanced features:

✅ Site Rules (hash lookup + 65-line routing switch-case)
✅ L2 Pool / L2.5 Pool (multiple allocation attempts)
✅ ELO/EVO (learning lifecycle)
✅ BigCache (cache lookup for large allocations)
✅ Batch madvise / Free policy

Solution: Use HAKMEM_MODE=minimal to disable all advanced features.

📊 Performance Comparison

1-thread (10,000 chunks × 8-1024B mixed)

# System malloc (baseline)
./larson 0 8 1024 10000 1 12345 1
Throughput = 12,746,479 ops/sec

# hakmem BALANCED mode (default)
LD_PRELOAD=./libhakmem.so ./larson 0 8 1024 10000 1 12345 1
Throughput =  2,698,795 ops/sec  (-78.8% vs system) ❌

# hakmem MINIMAL mode
HAKMEM_MODE=minimal LD_PRELOAD=./libhakmem.so ./larson 0 8 1024 10000 1 12345 1
Throughput = 15,141,176 ops/sec  (+18.8% vs system) ✅ (+461% vs balanced!)

4-thread

# System malloc (baseline)
./larson 0 8 1024 10000 1 12345 4
Throughput = 12,899,666 ops/sec

# hakmem BALANCED mode (default)
LD_PRELOAD=./libhakmem.so ./larson 0 8 1024 10000 1 12345 4
Throughput =    658,228 ops/sec  (-94.9% vs system) ❌

# hakmem MINIMAL mode
HAKMEM_MODE=minimal LD_PRELOAD=./libhakmem.so ./larson 0 8 1024 10000 1 12345 4
Throughput =  3,265,753 ops/sec  (-74.7% vs system) ⚠️ (+396% vs balanced)

🔍 Investigation Process

Step 1: Verify Pre-warm Code Existence

User claim: "Pre-warm code was added and then deleted"

Investigation:

$ grep -r "Pre-warm\|prewarm\|pre-warm" *.c *.h
No matches found

Result: ✅ No Pre-warm code exists

Conclusion: OPTIMIZATION_SUMMARY_2025_10_22.md was only a proposal document, not an implementation.

Step 2: Identify Phase 6.14 vs Current Difference

Hypothesis: Phase 6.14 and current code are identical, but configuration differs.

Investigation:

$ git log --since="2025-10-21" -- apps/experiments/hakmem-poc/
b7864269 runner(script-runner): stabilize pre-invoke + short diagnostics
7616873d WIP: phase20.8 mainline

Result: No Phase 6.14 commit found → Phase 6.14 was never committed

Conclusion: Phase 6.14 report was written, but code changes were not committed or were reverted.

Step 3: Analyze malloc Path Overhead

Code: hakmem.c:357-541 (hak_alloc_at())

Overhead sources (185 lines of routing logic):

ELO learning (Line 371-391): hak_evo_tick() + atomic_load
BigCache trial (Line 394-400): hash lookup for size >= 1MB
Tiny Pool trial (Line 404-411): slab allocation for size <= 1KB
Site Rules lookup (Line 415-480): 65-line switch-case routing
L2 Pool trial (Line 484-491): poolable check + allocation
L2.5 Pool trial (Line 495-502): large pool check + allocation
Size distribution record (Line 518): hak_evo_record_size()
Header validation (Line 521-537): magic check + metadata update

For 8-1024B allocations:

❌ Tiny Pool trial (Line 404-411): Hit!
❌ Site Rules lookup (Line 415): O(1) hash table (4-probe) EVERY TIME
❌ Site Rules routing (Line 416-480): 65-line switch-case logic

Overhead estimate:

Site Rules lookup: ~50-100 cycles (hash + probing)
Routing switch-case: ~20-40 cycles
Tiny Pool check: ~10-20 cycles
Total: ~80-160 cycles per allocation

For 15M allocations/sec:

15,000,000 allocs × 100 cycles = 1.5 billion cycles
@ 3GHz CPU = 0.5 seconds overhead

This matches the observed degradation!

Step 4: Test Minimal Mode

Hypothesis: Disabling all advanced features will restore Phase 6.14 performance.

Minimal mode configuration (hakmem_config.c:33-55):

cfg->features.alloc = HAKMEM_FEATURE_MALLOC | HAKMEM_FEATURE_MMAP;
cfg->features.cache = 0;        // No BigCache
cfg->features.learning = 0;     // No ELO/Evolution
cfg->features.memory = 0;       // No Batch/THP/FreePolicy

Result:

✅ 1-thread: 15.1M ops/sec (Phase 6.14: 15.3M ops/sec) → Matched!
⚠️ 4-thread: 3.3M ops/sec (Phase 6.14: 67.9M ops/sec) → Still broken

Conclusion:

1-thread performance is fully explained by advanced features overhead
4-thread performance has additional issue (likely Tiny Pool or TLS)

🧪 Key Findings

1. Pre-warm Code Never Existed

Evidence:

No code matches "pre-warm" in source files
OPTIMIZATION_SUMMARY_2025_10_22.md is a proposal, not implementation
No git commits related to Pre-warm

Impact: User's hypothesis was incorrect.

2. Phase 6.14 Configuration Unknown

Evidence:

No Phase 6.14 git commit found
No documentation of exact configuration used
Phase 6.14 report contradicts Phase 6.13 results:
- Phase 6.13: 1T=17.7M, 4T=15.9M ops/sec
- Phase 6.14: 1T=15.3M, 4T=67.9M ops/sec
- 4-thread is 4.3x faster in Phase 6.14? 🤔

Hypothesis: Phase 6.14 used different larson parameters or different mode.

3. BALANCED Mode Overhead is Severe

Overhead breakdown (estimated cycles per allocation):

Feature	Overhead (cycles)	Impact
Site Rules lookup	50-100	High
Site Rules routing	20-40	Medium
ELO/EVO tick	10-20	Low
BigCache check	5-10	Low
Tiny Pool check	10-20	Low
Total	95-190 cycles	Severe

For comparison:

System malloc: ~50-100 cycles
hakmem MINIMAL: ~60-120 cycles
hakmem BALANCED: ~150-300 cycles (2-3x slower!)

4. Minimal Mode Partially Restores Performance

1-thread performance:

System malloc: 12.7M ops/sec
hakmem MINIMAL: 15.1M ops/sec (+18.8%)
hakmem BALANCED: 2.7M ops/sec (-78.8%)

Conclusion: MINIMAL mode beats system malloc by 18.8%! ✅

4-thread performance:

System malloc: 12.9M ops/sec
hakmem MINIMAL: 3.3M ops/sec (-74.7%)
hakmem BALANCED: 0.66M ops/sec (-94.9%)

Conclusion: MINIMAL mode is still 74.7% slower than system malloc. ❌

🎯 Recommendations

P0 (Immediate): Change Default Mode to MINIMAL

Action:

// hakmem_config.c:177
static HakemMode parse_mode_env(const char* mode_str) {
    if (!mode_str) return HAKMEM_MODE_MINIMAL;  // Changed from BALANCED
    // ...
}

Expected impact:

✅ 1-thread: +461% performance (+18.8% vs system malloc)
⚠️ 4-thread: +396% performance (still -74.7% vs system malloc)

Risk: None (MINIMAL is baseline, no advanced features)

P1 (1-2 hours): Investigate Phase 6.14 Mystery

Questions:

How did Phase 6.14 achieve 67.9M ops/sec (4-thread)?
Was Tiny Pool or TLS enabled?
Were larson parameters different?

Actions:

Re-read Phase 6.14 report carefully
Check Phase 6.13 report for TLS configuration

Try enabling only Tiny Pool in MINIMAL mode:

# Test Tiny Pool in isolation
HAKMEM_MODE=minimal HAKMEM_TINY_POOL=1 LD_PRELOAD=./libhakmem.so ./larson 0 8 1024 10000 1 12345 4

P2 (2-3 hours): Profile Advanced Features Individually

Hypothesis: Site Rules, ELO, or L2.5 Pool is the main bottleneck.

Action: Enable features one-by-one and measure:

# Baseline (MINIMAL)
HAKMEM_MODE=minimal LD_PRELOAD=./libhakmem.so ./larson 0 8 1024 10000 1 12345 1

# +Tiny Pool
HAKMEM_MODE=minimal HAKMEM_TINY_POOL=1 LD_PRELOAD=./libhakmem.so ./larson ...

# +L2 Pool
HAKMEM_MODE=minimal HAKMEM_POOL=1 LD_PRELOAD=./libhakmem.so ./larson ...

# +Site Rules
# (No env var - need code modification to disable Site Rules)

# +BigCache
HAKMEM_MODE=minimal HAKMEM_BIGCACHE=1 LD_PRELOAD=./libhakmem.so ./larson ...

# +ELO
HAKMEM_MODE=minimal HAKMEM_ELO=1 LD_PRELOAD=./libhakmem.so ./larson ...

Goal: Identify which feature(s) cause the most overhead.

P3 (4-6 hours): Fix 4-Thread Performance

Current status:

System malloc (4T): 12.9M ops/sec
hakmem MINIMAL (4T): 3.3M ops/sec (-74.7%)

Hypothesis: Missing TLS or Tiny Pool optimization.

Actions:

Review Phase 6.13 TLS implementation
Check if Tiny Pool is thread-safe
Enable TLS for Tiny Pool
Profile lock contention

Target: 4-thread >= 15M ops/sec (Phase 6.14 parity)

📝 Lessons Learned

1. Configuration is Critical

Mistake: Did not track exact configuration used in Phase 6.14.

Consequence: Unable to reproduce 67.9M ops/sec (4-thread) result.

Fix: Always document:

Environment variables used
Mode configuration
Compiler flags
Exact command line

2. Advanced Features Have Cost

Mistake: Enabled Site Rules, ELO, L2.5 Pool by default without measuring overhead.

Consequence: BALANCED mode is 5x slower than MINIMAL.

Fix:

Default to MINIMAL
Enable advanced features only when proven beneficial
Always A/B test before making default

3. Pre-warm Was a Red Herring

Mistake: User assumed Pre-warm was implemented based on proposal document.

Consequence: 2 hours spent investigating non-existent code.