Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

12 KiB

Raw Blame History

Phase 6.8: Configuration Cleanup & Mode-based Architecture

Date: 2025-10-21 Status: 🚧 IN PROGRESS

🎯 Goal

Problem: 現状の hakmem は環境変数が多すぎて管理困難

HAKMEM_FREE_POLICY, HAKMEM_THP, HAKMEM_EVO_POLICY, etc.
組み合わせが複雑で不正な設定でバグる可能性
ベンチマーク比較が困難（どの設定で比較？）

Solution: 5つのプリセットモードに統合

シンプルな HAKMEM_MODE=balanced で適切な設定
各機能の効果を段階的に測定可能
論文での説明が容易

📊 5 Modes Definition

Mode Overview

Mode	Use Case	Target Audience	Performance Goal
MINIMAL	ベースライン測定	ベンチマーク比較	system malloc 相当
FAST	本番環境（速度優先）	Production use	mimalloc +20%
BALANCED	デフォルト推奨	General use	mimalloc +40%
LEARNING	学習フェーズ	Development	mimalloc +60%
RESEARCH	開発・デバッグ	Research	N/A（全機能ON）

Feature Matrix

Feature	MINIMAL	FAST	BALANCED	LEARNING	RESEARCH
ELO learning	❌	❌ FROZEN	✅ FROZEN	✅ LEARN	✅ LEARN
BigCache	❌	✅	✅	✅	✅
Batch madvise	❌	✅	✅	✅	✅
TinyPool (future)	❌	✅	✅	❌	❌
Free policy	batch	adaptive	adaptive	adaptive	adaptive
THP	off	auto	auto	auto	on
Evolution lifecycle	-	FROZEN	FROZEN	LEARN→FROZEN	LEARN
Debug logging	❌	❌	❌	⚠️ minimal	✅ verbose

🔧 Implementation Plan

Step 0: Baseline Measurement ✅ (Already done in Phase 6.6-6.7)

Current state:

hakmem-evolving: 37,602 ns (VM scenario, 2MB)
mimalloc: 19,964 ns (+88.3% gap)
All features ON (uncontrolled)

Step 1: MINIMAL Mode 🎯 (P0 - Foundation)

Goal: Create baseline with all features OFF

Implementation:

// hakmem_config.h
typedef enum {
    HAKMEM_MODE_MINIMAL = 0,
    HAKMEM_MODE_FAST,
    HAKMEM_MODE_BALANCED,
    HAKMEM_MODE_LEARNING,
    HAKMEM_MODE_RESEARCH,
} HakemMode;

typedef struct {
    HakemMode mode;

    // Feature flags
    int enable_elo;
    int enable_bigcache;
    int enable_batch;
    int enable_pool;  // future (Step 5)

    // Policies
    FreePolicy free_policy;
    THPPolicy thp_policy;
    const char* evo_phase;  // "frozen", "learn", "canary"

    // Debug
    int debug_logging;
} HakemConfig;

extern HakemConfig g_hakem_config;
void hak_config_init(void);

Changes:

hakmem_config.h/c: New files
hakmem.c: Call hak_config_init() in hak_init()
All modules: Check g_hakem_config flags before enabling features

Benchmark:

HAKMEM_MODE=minimal ./bench_allocators --allocator hakmem-evolving --scenario vm --iterations 100

Expected:

Performance: ~40,000-50,000 ns (slower than current, no optimizations)
Serves as baseline for feature comparison

Estimated time: 1 day

Step 2: Enable BigCache 🎯 (P0 - Tier-2 Cache)

Goal: Measure BigCache impact in isolation

Implementation:

MINIMAL + BigCache ON
Keep ELO/Batch/THP OFF

Benchmark:

HAKMEM_MODE=minimal ./bench_runner.sh --warmup 2 --runs 10
# Then:
# hakmem.c: g_hakem_config.enable_bigcache = 1;
./bench_runner.sh --warmup 2 --runs 10

Expected:

VM scenario hit rate: 99%+
Performance: -5,000 ns improvement (cache hits avoid mmap)
Target: 35,000-40,000 ns

Measurement:

BigCache hit rate
mmap syscall count (should drop)
Performance delta

Estimated time: 0.5 day

Step 3: Enable Batch madvise 🎯 (P1 - TLB Optimization)

Goal: Measure batch madvise impact

Implementation:

MINIMAL + BigCache + Batch ON
Keep ELO/THP OFF

Benchmark:

# Previous: MINIMAL + BigCache
# New: MINIMAL + BigCache + Batch
./bench_runner.sh --warmup 2 --runs 10

Expected:

Batch flush operations: 1-10 per run
Performance: -500-1,000 ns improvement (TLB optimization)
Target: 34,000-39,000 ns

Measurement:

Batch statistics (blocks added, flush count)
madvise syscall count
Performance delta

Estimated time: 0.5 day

Step 4: Enable ELO (FROZEN) 🎯 (P1 - Strategy Selection)

Goal: Measure ELO overhead in FROZEN mode (no learning)

Implementation:

BALANCED mode = MINIMAL + BigCache + Batch + ELO(FROZEN)

Benchmark:

HAKMEM_MODE=balanced ./bench_runner.sh --warmup 2 --runs 10

Expected:

ELO overhead: ~100-200 ns (strategy selection per allocation)
Performance: +100-200 ns regression (acceptable for adaptability)
Target: 34,500-39,500 ns

Measurement:

ELO selection overhead
Strategy distribution
Performance delta

Estimated time: 0.5 day

Step 5: TinyPool Implementation (FAST mode) 🚀 (P2 - Fast Path)

Goal: Implement pool-based fast path (ChatGPT Pro proposal)

Implementation:

FAST mode = BALANCED + TinyPool
7 size classes: 16/32/64/128/256/512/1024B
Per-thread free lists
class×shard O(1) mapping

Code sketch:

// hakmem_pool.h
typedef struct Node { struct Node* next; } Node;
typedef struct { Node* head; uint32_t cnt; } FreeList;

#define SHARDS 64
#define CLASSES 7  // 16B to 1024B

typedef struct {
    FreeList list[SHARDS];
} ClassPools;

_Thread_local ClassPools tls_pools[CLASSES];

// Fast path (O(1))
void* hak_alloc_small(size_t sz, void* pc);
void hak_free_small(void* p, void* pc);

Benchmark:

# Baseline: BALANCED mode
HAKMEM_MODE=balanced ./bench_runner.sh --warmup 10 --runs 50

# New: FAST mode
HAKMEM_MODE=fast ./bench_runner.sh --warmup 10 --runs 50

Expected:

Small allocations (≤1KB): 9-15 ns fast path
VM scenario (2MB): No change (pool not used for large allocations)
Need new benchmark: tiny-hot (16/32/64B allocations)

Measurement:

Pool hit rate
Fast path latency (perf profiling)
Comparison with mimalloc on tiny-hot

Estimated time: 2-3 weeks (MVP: 2 weeks, MT support: +1 week)

Step 6: ELO LEARNING mode 🎯 (P2 - Adaptive Learning)

Goal: Measure learning overhead and convergence

Implementation:

LEARNING mode = BALANCED + ELO(LEARN→FROZEN)

Benchmark:

HAKMEM_MODE=learning ./bench_runner.sh --warmup 100 --runs 100

Expected:

LEARN phase: +200-500 ns overhead (ELO selection + recording)
Convergence: 1024-2048 allocations → FROZEN
FROZEN phase: Same as BALANCED mode
Overall: +50-100 ns average (amortized)

Measurement:

ELO rating convergence
Phase transitions (LEARN → FROZEN → CANARY)
Learning overhead vs benefit

Estimated time: 1 day

Step 7: RESEARCH mode (All features) 🎯 (P3 - Development)

Goal: Enable all features + debug logging

Implementation:

RESEARCH mode = LEARNING + THP(ON) + Debug logging

Use case:

Development & debugging only
Not for benchmarking (too slow)

Estimated time: 0.5 day

📈 Benchmark Plan

Comparison Matrix

Scenario	MINIMAL	+BigCache	+Batch	BALANCED	FAST	LEARNING
VM (2MB)	45,000	40,000	39,000	39,500	39,500	39,600
tiny-hot	50	50	50	50	12	52
cold-churn	TBD	TBD	TBD	TBD	TBD	TBD
json-parse	TBD	TBD	TBD	TBD	TBD	TBD

Note: Numbers are estimates, actual results TBD

Metrics to Collect

For each mode:

Performance: Median latency (ns)
Syscalls: mmap/munmap/madvise counts
Page faults: soft/hard counts
Memory: RSS delta
Cache: Hit rates (BigCache, Pool)

Benchmark Script

#!/bin/bash
# bench_modes.sh - Compare all modes

MODES="minimal balanced fast learning"
SCENARIOS="vm cold-churn json-parse"

for mode in $MODES; do
    for scenario in $SCENARIOS; do
        echo "=== Mode: $mode, Scenario: $scenario ==="
        HAKMEM_MODE=$mode ./bench_runner.sh \
            --allocator hakmem-evolving \
            --scenario $scenario \
            --warmup 10 --runs 50 \
            --output results_${mode}_${scenario}.csv
    done
done

# Aggregate results
python3 analyze_modes.py results_*.csv

🎯 Success Metrics

Step 1-4 (MINIMAL → BALANCED)

✅ Each feature's impact is measurable
✅ Performance regression < 10% per feature
✅ Total BALANCED overhead: +40-60% vs mimalloc

Step 5 (FAST mode with TinyPool)

✅ tiny-hot benchmark: mimalloc +20% or better
✅ VM scenario: No regression vs BALANCED
✅ Pool hit rate: 90%+ for small allocations

Step 6 (LEARNING mode)

✅ Convergence within 2048 allocations
✅ Learning overhead amortized to < 5%
✅ FROZEN performance = BALANCED

📝 Migration Plan (Backward Compatibility)

Environment Variable Priority

// 1. HAKMEM_MODE has highest priority
const char* mode_env = getenv("HAKMEM_MODE");
if (mode_env) {
    hak_config_apply_mode(mode_env);  // Apply preset
} else {
    // 2. Fall back to individual settings (legacy)
    const char* free_policy = getenv("HAKMEM_FREE_POLICY");
    const char* thp = getenv("HAKMEM_THP");
    // ... etc
}

// 3. Individual settings can override mode
// Example: HAKMEM_MODE=balanced HAKMEM_THP=off
//   → Use BALANCED preset, but force THP=off

Deprecation Timeline

Phase 6.8: Both HAKMEM_MODE and individual env vars supported
Phase 7: Prefer HAKMEM_MODE, warn if individual vars used
Phase 8: Deprecate individual vars (only HAKMEM_MODE)

🚀 Implementation Timeline

Step	Task	Time	Cumulative	Status
0	Baseline (done)	-	-	✅
1	MINIMAL mode	1 day	1 day	🚧
2	+BigCache	0.5 day	1.5 days	⏳
3	+Batch	0.5 day	2 days	⏳
4	BALANCED (ELO FROZEN)	0.5 day	2.5 days	⏳
5	FAST (TinyPool MVP)	2-3 weeks	3.5-4.5 weeks	⏳
6	LEARNING mode	1 day	3.6-4.6 weeks	⏳
7	RESEARCH mode	0.5 day	3.65-4.65 weeks	⏳

Total: 3.7-4.7 weeks (MVP: 2.5 days, Full: 4-5 weeks)

📚 Documentation Updates

README.md

Add section:

## 🎯 Quick Start: Choosing a Mode

- **Development**: `HAKMEM_MODE=learning` (adaptive, slow)
- **Production**: `HAKMEM_MODE=fast` (mimalloc +20%)
- **General**: `HAKMEM_MODE=balanced` (default, mimalloc +40%)
- **Benchmarking**: `HAKMEM_MODE=minimal` (baseline)
- **Research**: `HAKMEM_MODE=research` (all features + debug)

New Files

PHASE_6.8_CONFIG_CLEANUP.md (this file)
apps/experiments/hakmem-poc/hakmem_config.h
apps/experiments/hakmem-poc/hakmem_config.c
apps/experiments/hakmem-poc/bench_modes.sh
apps/experiments/hakmem-poc/analyze_modes.py

🎓 Expected Outcomes

For Paper

Before Phase 6.8:

❌ "hakmem is +88% slower than mimalloc"
⚠️ Complex configuration, hard to reproduce
⚠️ Unclear which features contribute to overhead

After Phase 6.8:

✅ "BALANCED mode: +40% overhead for adaptive learning"
✅ "FAST mode: +20% overhead, competitive with production allocators"
✅ "Each feature's impact clearly measured"
✅ "5 simple modes, easy to reproduce"

For Future Work

Step 5 (TinyPool) can become Phase 7 if successful
ChatGPT Pro's hybrid architecture validated
Clear path to mimalloc-level performance

🏆 Final Status

Phase 6.8: 🚧 IN PROGRESS

Next Steps:

✅ Design document created (this file)
🚧 Implement Step 1 (MINIMAL mode)
⏳ Measure & iterate through Steps 2-7

Ready to start implementation! 🚀

12 KiB Raw Blame History Unescape Escape

Phase 6.8: Configuration Cleanup & Mode-based Architecture

🎯 Goal

📊 5 Modes Definition

Mode Overview

Feature Matrix

🔧 Implementation Plan

Step 0: Baseline Measurement ✅ (Already done in Phase 6.6-6.7)

Step 1: MINIMAL Mode 🎯 (P0 - Foundation)

Step 2: Enable BigCache 🎯 (P0 - Tier-2 Cache)

Step 3: Enable Batch madvise 🎯 (P1 - TLB Optimization)

Step 4: Enable ELO (FROZEN) 🎯 (P1 - Strategy Selection)

Step 5: TinyPool Implementation (FAST mode) 🚀 (P2 - Fast Path)

Step 6: ELO LEARNING mode 🎯 (P2 - Adaptive Learning)

Step 7: RESEARCH mode (All features) 🎯 (P3 - Development)

📈 Benchmark Plan

Comparison Matrix

Metrics to Collect

Benchmark Script

🎯 Success Metrics

Step 1-4 (MINIMAL → BALANCED)

Step 5 (FAST mode with TinyPool)

Step 6 (LEARNING mode)

📝 Migration Plan (Backward Compatibility)

Environment Variable Priority

Deprecation Timeline

🚀 Implementation Timeline

📚 Documentation Updates

README.md

New Files

🎓 Expected Outcomes

For Paper

For Future Work

🏆 Final Status

12 KiB

Raw Blame History