Moe Charm (CI) ad346f7885 Phase 18 v2: BENCH_MINIMAL design + instructions (instruction removal strategy)
## Phase 18 v2: Next Phase Direction

After Phase 18 v1 failure (layout optimization caused I-cache regression),
shift to instruction count reduction via compile-time removal:

- Stats collection (FRONT_FASTLANE_STAT_INC → no-op)
- Environment checks (runtime lookup → constant)
- Debug logging (conditional compilation)

Expected impact: Instructions -30-40%, Throughput +10-20%

## Success Criteria (STRICT)

GO (must have ALL):
- Throughput: +5% minimum (+8% preferred)
- Instructions: -15% minimum (smoking gun)
- I-cache: automatic improvement from smaller footprint

NEUTRAL: throughput ±3%, instructions -5% to -15%
NO-GO: throughput < -2%, instructions < -5%

Key: If instructions do not drop -15%+, allocator is not the bottleneck
and this phase should be abandoned.

## Implementation Strategy

1. Makefile knob: BENCH_MINIMAL=0/1 (default OFF, production-safe)
2. Conditional removal:
   - Stats: #if !HAKMEM_BENCH_MINIMAL
   - ENV checks: constant propagation
   - Debug: conditional includes

3. A/B test with perf stat (must measure instruction reduction)

## Files

New:
- docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_DESIGN.md (detailed design)
- docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_NEXT_INSTRUCTIONS.md (step-by-step)

Modified:
- CURRENT_TASK.md (Phase 18 v1/v2 status)

## Key Learning from Phase 18 v1 Failure

Layout optimization is extremely fragile without strong ordering guarantees.
Section splitting alone (without symbol ordering, PGO, or linker script)
destroyed code locality and increased I-cache misses 91%.

Switching to direct instruction removal is safer and more predictable.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-15 05:55:22 +09:00

hakmem PoC - Call-site Profiling + UCB1 Evolution

詳細ドキュメントの入口: docs/INDEX.md(カテゴリ別リンク) / 再整理方針: docs/DOCS_REORG_PLAN.md

Purpose: Proof-of-Concept for the core ideas from the paper:

  1. "Call-site address is an implicit purpose label - same location → same pattern"
  2. "UCB1 bandit learns optimal allocation policies automatically"

🎯 Current Status (2025-11-01)

Mid-Range Multi-Threaded Complete (110M ops/sec)

  • Achievement: 110M ops/sec on mid-range MT workload (8-32KB)
  • Comparison: 100-101% of mimalloc, 2.12x faster than glibc
  • Implementation: core/hakmem_mid_mt.{c,h}
  • Benchmarks: benchmarks/scripts/mid/ (run_mid_mt_bench.sh, compare_mid_mt_allocators.sh)
  • Report: MID_MT_COMPLETION_REPORT.md

Repository Reorganization Complete

  • New Structure: All benchmarks under benchmarks/, tests under tests/
  • Root Directory: 252 → 70 items (72% reduction)
  • Organization:
    • benchmarks/src/{tiny,mid,comprehensive,stress}/ - Benchmark sources
    • benchmarks/scripts/{tiny,mid,comprehensive,utils}/ - Scripts organized by category
    • benchmarks/results/ - All benchmark results (871+ files)
    • tests/{unit,integration,stress}/ - Tests by type
  • Details: FOLDER_REORGANIZATION_2025_11_01.md

ACE Learning Layer Phase 1 Complete (ACE = Agentic Context Engineering / Adaptive Control Engine)

  • Status: Phase 1 Infrastructure COMPLETE (2025-11-01)
  • Goal: Fix weak workloads with adaptive learning
    • Fragmentation stress: 3.87 → 10-20 M ops/s (2.6-5.2x target)
    • Large working set: 22.15 → 30-45 M ops/s (1.4-2.0x target)
    • realloc: 277ns → 140-210ns (1.3-2.0x target)
  • Phase 1 Deliverables (100% complete):
    • Metrics collection infrastructure (hakmem_ace_metrics.{c,h})
    • UCB1 learning algorithm (hakmem_ace_ucb1.{c,h})
    • Dual-loop controller (hakmem_ace_controller.{c,h})
    • Dynamic TLS capacity adjustment
    • Hot-path metrics integration (alloc/free tracking)
    • A/B benchmark script (scripts/bench_ace_ab.sh)
  • Documentation:
    • User guide: docs/ACE_LEARNING_LAYER.md
    • Implementation plan: docs/ACE_LEARNING_LAYER_PLAN.md
    • Progress report: ACE_PHASE1_PROGRESS.md
  • Usage: HAKMEM_ACE_ENABLED=1 ./your_benchmark
  • Next: Phase 2 - Extended benchmarking + learning convergence validation

📂 Quick Navigation

  • Build & Run: See "Quick Start" section below
  • Benchmarks: benchmarks/scripts/ organized by category
  • Documentation: DOCS_INDEX.md - Central documentation hub
  • Current Work: CURRENT_TASK.md

🧪 Larson Quick RunTiny + Superslab、本線

Use the defaults wrapper so critical env vars are always set:

  • Throughput-oriented (2s, threads=1,4): scripts/run_larson_defaults.sh
  • Lower page-fault/sys (10s, threads=4): scripts/run_larson_defaults.sh pf 10 4
  • Claude-friendly presets (envs pre-wired for reproducible debug): scripts/run_larson_claude.sh [tput|pf|repro|fast0|guard|debug] 2 4
    • For Claude Code runs with log capture, use scripts/claude_code_debug.sh.

本線セグフォしないを既定にしました。publish→mail→adopt が動く前提の既定環境です:

  • Tiny/Superslab gates: HAKMEM_TINY_USE_SUPERSLAB=1既定ON, HAKMEM_TINY_MUST_ADOPT=1, HAKMEM_TINY_SS_ADOPT=1
  • Fast-tier spill to create publish: HAKMEM_TINY_FAST_CAP=64, HAKMEM_TINY_FAST_SPARE_PERIOD=8
  • TLS list: HAKMEM_TINY_TLS_LIST=1
  • Mailbox discovery: HAKMEM_TINY_MAILBOX_SLOWDISC=1, HAKMEM_TINY_MAILBOX_SLOWDISC_PERIOD=256
  • Superslab sizing/cache/precharge: per mode (tput vs pf)

Debugging tips:

  • Add HAKMEM_TINY_RF_TRACE=1 for one-shot publish/mail traces.
  • Use scripts/run_larson_claude.sh debug 2 4 to enable TRACE_RING and emit early SIGUSR2 so the Tiny ring is dumped before crashes.

SLLfirst Fast PathBox 5

  • Hot path favors TLS SLL (perthread freelist) first; on miss, falls back to HotMag/TLS list, then SuperSlab.
  • Learning shifts to SLL via sll_cap_for_class() with perclass override/multiplier (small classes 0..3).
  • Ownership → remote drain → bind is centralized via SlabHandle (Box 3→2) for safety and determinism.
  • A/B knobs:
    • HAKMEM_TINY_TLS_SLL=0/1 (default 1)
    • HAKMEM_SLL_MULTIPLIER=N and HAKMEM_TINY_SLL_CAP_C{0..7}
    • HAKMEM_TINY_TLS_LIST=0/1

P0 batch refill is now compile-time only; runtime P0 env toggles were removed.

Benchmark Matrix

  • Quick matrix to compare midlayers vs SLLfirst:
    • scripts/bench_matrix.sh 30 8 (duration=30s, threads=8)
  • Single run (throughput):
    • HAKMEM_TINY_TRACE_RING=0 HAKMEM_SAFE_FREE=0 scripts/run_larson_claude.sh tput 30 8
  • Force-notify path (A/B) with HAKMEM_TINY_RF_FORCE_NOTIFY=1 to surface missing first-notify cases.

Build Modes (Box Refactor)

  • 既定(本線): Box Theory refactor (Phase 61.7) と Superslab 経路は常時ON
    • コンパイルフラグ: -DHAKMEM_TINY_PHASE6_BOX_REFACTOR=1Makefile既定
    • 実行時既定: g_use_superslab=1環境変数で明示的に0にしない限りON
    • 旧経路でのA/B: make BOX_REFACTOR_DEFAULT=0 larson_hakmem

🚨 Segfaultfree ポリシー(絶対条件)

  • 本線は「セグフォしない」ことを最優先に設計/実装されています。
  • 変更時は以下のガードを通してから採用してください。
    • Guard ラン: ./scripts/larson.sh guard 2 4Trace Ring + Safe Free
    • ASan/UBSan/TSan: ./scripts/larson.sh asan 2 4 / ubsan / tsan
    • FailFast環境: HAKMEM_TINY_RF_TRACE=0 他、LARSON_GUIDE.md の安全手順に従う
    • リング末尾の remote_invalid / SENTINEL_TRAP が出ないことを確認

新規A/B観測と制御

  • Registry 窓: HAKMEM_TINY_REG_SCAN_MAX既定256
    • レジストリ小窓の走査上限を制御(探索コスト vs adopt 命中率のA/B用
  • Mid簡素化refill: HAKMEM_TINY_MID_REFILL_SIMPLE=1class>=4で多段探索をスキップ
    • tput重視A/B用adopt/探索を減らす。常用前にPF/RSSを確認。

Mimalloc vs HAKMEM (Larson quick A/B)

  • Recommended HAKMEM env (Tiny Hot, SLLonly, fast tier on):
HAKMEM_TINY_REFILL_COUNT_HOT=64 \
HAKMEM_TINY_FAST_CAP=16 \
HAKMEM_TINY_TRACE_RING=0 HAKMEM_SAFE_FREE=0 \
HAKMEM_TINY_TLS_SLL=1 HAKMEM_TINY_TLS_LIST=0 \
HAKMEM_WRAP_TINY=1 HAKMEM_TINY_SS_ADOPT=1 \
./larson_hakmem 2 8 128 1024 1 12345 4
  • Oneshot refill path confirmation (noisy print just once):
HAKMEM_TINY_REFILL_OPT_DEBUG=1 <above_env> ./larson_hakmem 2 8 128 1024 1 12345 4
  • Mimalloc (direct link binary):
LD_LIBRARY_PATH=$PWD/mimalloc-bench/extern/mi/out/release ./larson_mi 2 8 128 1024 1 12345 4
  • Perf (selected counters):
perf stat -e cycles,instructions,branches,branch-misses,cache-references,cache-misses,\
  L1-dcache-loads,L1-dcache-load-misses -- \
  env <above_env> ./larson_hakmem 5 8 128 1024 1 12345 4

🎯 What This Proves

Phase 1: Call-site Profiling (DONE)

  1. Call-site capture works: __builtin_return_address(0) uniquely identifies allocation sites
  2. Different sites have different patterns: JSON (small, frequent) vs MIR (medium) vs VM (large)
  3. Profiling is lightweight: Simple hash table + sampling
  4. Zero user burden: Just replace mallochak_alloc_cs

Phase 2-4: UCB1 Evolution + A/B Testing (DONE)

  1. KPI measurement: P50/P95/P99 latency, Page Faults, RSS delta
  2. Discrete policy steps: 6 levels (64KB → 2MB)
  3. UCB1 bandit: Exploration + Exploitation balance
  4. Safety mechanisms:
    • ±1 step exploration (safe)
    • Hysteresis (8% improvement × 3 consecutive)
    • Cooldown (180 seconds)
  5. A/B testing: baseline vs evolving modes

Phase 5: Benchmarking Infrastructure (COMPLETE)

  1. Allocator comparison framework: hakmem vs jemalloc/mimalloc/system malloc
  2. Fair benchmarking: Same workload, 50 runs per config, 1000 total runs
  3. KPI measurement: Latency (P50/P95/P99), page faults, RSS, throughput
  4. Paper-ready output: CSV format for graphs/tables
  5. Initial ranking (UCB1): 🥉 3rd place among 5 allocators

This proves Sections 3.6-3.7 of the paper. See PAPER_SUMMARY.md for detailed results.

Phase 6.1-6.4: ELO Rating System (COMPLETE)

  1. Strategy diversity: 6 threshold levels (64KB, 128KB, 256KB, 512KB, 1MB, 2MB)
  2. ELO rating: Each strategy has rating, learns from win/loss/draw
  3. Softmax selection: Probability ∝ exp(rating/temperature)
  4. BigCache optimization: Tier-2 size-class caching for large allocations
  5. Batch madvise: MADV_DONTNEED batching for reduced syscall overhead

🏆 VM Scenario Benchmark Results (iterations=100):

🥇 mimalloc         15,822 ns  (baseline)
🥈 hakmem-evolving  16,125 ns  (+1.9%)  ← BigCache効果
🥉 system           16,814 ns  (+6.3%)
4th jemalloc        17,575 ns  (+11.1%)

Key achievement: 1.9% gap to 1st place (down from -50% in Phase 5!)

See PHASE_6.2_ELO_IMPLEMENTATION.md for details.

Phase 6.5: Learning Lifecycle (COMPLETE)

  1. 3-state machine: LEARN → FROZEN → CANARY
    • LEARN: Active learning with ELO updates
    • FROZEN: Zero-overhead production mode (confirmed best policy)
    • CANARY: Safe 5% trial sampling to detect workload changes
  2. Convergence detection: P² algorithm for O(1) p99 estimation
  3. Distribution signature: L1 distance for workload shift detection
  4. Environment variables: Fully configurable (freeze time, window size, etc.)
  5. Production ready: 6/6 tests passing, LEARN→FROZEN transition verified

Key feature: Learning converges in ~180 seconds, then runs at zero overhead in FROZEN mode!

See PHASE_6.5_LEARNING_LIFECYCLE.md for complete documentation.

Phase 6.6: ELO Control Flow Fix (COMPLETE)

Problem: After Phase 6.5 integration, batch madvise stopped activating Root Cause: ELO strategy selection happened AFTER allocation, results ignored Fix: Reordered hak_alloc_at() to use ELO threshold BEFORE allocation

Diagnosis by: Gemini Pro (2025-10-21) Fixed by: Claude (2025-10-21)

Key insight:

  • OLD: allocate_with_policy(POLICY_DEFAULT) → malloc → ELO selection (too late!)
  • NEW: ELO selection → size >= threshold ? mmap : malloc

Result: 2MB allocations now correctly use mmap, enabling batch madvise optimization.

See PHASE_6.6_ELO_CONTROL_FLOW_FIX.md for detailed analysis.

Phase 6.7: Overhead Analysis (COMPLETE)

Goal: Identify why hakmem is 2× slower than mimalloc despite identical syscall counts

Key Findings:

  1. Syscall overhead is NOT the bottleneck
    • hakmem: 292 mmap, 206 madvise (same as mimalloc)
    • Batch madvise working correctly
  2. The gap is structural, not algorithmic
    • mimalloc: Pool-based allocation (9ns fast path)
    • hakmem: Hash-based caching (31ns fast path)
    • 3.4× fast path difference explains 2× total gap
  3. hakmem's "smart features" have < 1% overhead
    • ELO: ~100-200ns (0.5%)
    • BigCache: ~50-100ns (0.3%)
    • Total: ~350ns out of 17,638ns gap (2%)

Recommendation: Accept the gap for research prototype OR implement hybrid pool fast-path (ChatGPT Pro proposal)

Deliverables:

Phase 6.8: Configuration Cleanup (COMPLETE)

Goal: Simplify complex environment variables into 5 preset modes + implement feature flags

Critical Bug Fixed: Task Agent investigation revealed complete design vs implementation gap:

  • Design: "Check g_hakem_config flags before enabling features"
  • Implementation: Features ran unconditionally (never checked!)
  • Impact: "MINIMAL mode" measured 14,959 ns but was actually BALANCED (all features ON)

Solution Implemented: Mode-based configuration + Feature-gated initialization

# Simple preset modes
export HAKMEM_MODE=minimal    # Baseline (all features OFF)
export HAKMEM_MODE=fast       # Production (pool fast-path + FROZEN)
export HAKMEM_MODE=balanced   # Default (BigCache + ELO FROZEN + Batch)
export HAKMEM_MODE=learning   # Development (ELO LEARN + adaptive)
export HAKMEM_MODE=research   # Debug (all features + verbose logging)

🎯 Benchmark Results - PROOF OF SUCCESS!

Test: VM scenario (2MB allocations, 100 iterations)

MINIMAL mode:  216,173 ns  (all features OFF - true baseline)
BALANCED mode:  15,487 ns  (BigCache + ELO ON)
→ 13.95x speedup from optimizations! 🚀

Feature Matrix (Now Actually Enforced!):

Feature MINIMAL FAST BALANCED LEARNING RESEARCH
ELO learning FROZEN FROZEN LEARN LEARN
BigCache
Batch madvise
TinyPool (future)
Debug logging ⚠️

Code Quality Improvements:

  • hakmem.c: 899 → 600 lines (-33% reduction)
  • New infrastructure: hakmem_features.h, hakmem_config.c/h, hakmem_internal.h (692 lines)
  • Static inline helpers: Zero-cost abstraction (100% inlined with -O2)
  • Feature flags: Runtime checks with < 0.1% overhead

Benefits Delivered:

  • Easy to use (HAKMEM_MODE=balanced)
  • Clear benchmarking (14x performance difference proven!)
  • Backward compatible (individual env vars still work)
  • Paper-friendly (quantified feature impact)

See PHASE_6.8_PROGRESS.md for complete implementation details.


🚀 Quick Start

🎯 Choose Your Mode (Phase 6.8+)

New: hakmem now supports 5 simple preset modes!

# 1. MINIMAL - Baseline (all optimizations OFF)
export HAKMEM_MODE=minimal
./bench_allocators --allocator hakmem-evolving --scenario vm

# 2. BALANCED - Default recommended (BigCache + ELO FROZEN + Batch)
export HAKMEM_MODE=balanced  # or omit (default)
./bench_allocators --allocator hakmem-evolving --scenario vm

# 3. LEARNING - Development (ELO learns, adapts to workload)
export HAKMEM_MODE=learning
./test_hakmem

# 4. FAST - Production (future: pool fast-path + FROZEN)
export HAKMEM_MODE=fast
./bench_allocators --allocator hakmem-evolving --scenario vm

# 5. RESEARCH - Debug (all features + verbose logging)
export HAKMEM_MODE=research
./test_hakmem

Quick reference:

  • Just want it to work? → Use balanced (default)
  • Benchmarking baseline? → Use minimal
  • Development/testing? → Use learning
  • Production deployment? → Use fast (after Phase 7)
  • Debugging issues? → Use research

📖 Legacy Usage (Phase 1-6.7)

# Build
make

# Run basic test
make run

# Run A/B test (baseline mode)
./test_hakmem

# Run A/B test (evolving mode - UCB1 enabled)
env HAKMEM_MODE=evolving ./test_hakmem

# Override individual settings (backward compatible)
export HAKMEM_MODE=balanced
export HAKMEM_THP=off  # Override THP policy
./bench_allocators --allocator hakmem-evolving --scenario vm

⚙️ Useful Environment Variables

Tiny publish/adopt pipeline

# Enable SuperSlab (required for publish/adopt)
export HAKMEM_TINY_USE_SUPERSLAB=1
# Optional: must-adopt-before-mmap (one-pass adopt before mmap)
export HAKMEM_TINY_MUST_ADOPT=1
  • HAKMEM_TINY_USE_SUPERSLAB=1

    • publish→mailbox→adopt は SuperSlab 経路が ON のときのみ動作しますOFFでは pipeline はゼロ)。
    • ベンチ時の既定ONを推奨A/Bで OFFにしてメモリ効率優先との比較も可
  • HAKMEM_SAFE_FREE=1

    • Adds a best-effort mincore() guard before reading headers on free().
    • Safer with LD_PRELOAD at the cost of extra overhead. Default: off.
  • HAKMEM_WRAP_TINY=1

    • Allows Tiny Pool allocations during malloc/free wrappers (LD_PRELOAD).
    • Wrapper-context uses a magazine-only fast path (no locks/refill) for safety.
    • Default: off for stability. Enable to test Tiny impact on small-object workloads.
  • HAKMEM_TINY_MAG_CAP=INT

    • Upper bound for Tiny TLS magazine per class (soft). Default: build limit (2048); recommended 1024 for BURST.
  • HAKMEM_SITE_RULES=1

    • Enables Site Rules. Note: tier selection no longer uses Site Rules (SACS3); only layerinternal future hints.
  • HAKMEM_PROF=1, HAKMEM_PROF_SAMPLE=N

    • Enables lightweight sampling profiler. N is exponent, sample every 2^N calls (default 12). Outputs percategory avg ns.
  • HAKMEM_ACE_SAMPLE=N

    • ACE layer (L1) stats sampling for mid/large hit/miss and L1 fallback. Default off.

🧪 Larson Runner (Reproducible)

Use the provided runner to compare system/mimalloc/hakmem under identical settings.

scripts/run_larson.sh [options] [runtime_sec] [threads_csv]

Options:
  -d SECONDS     Runtime seconds (default: 10)
  -t CSV         Threads CSV, e.g. 1,4 (default: 1,4)
  -c NUM         Chunks per thread (default: 10000)
  -r NUM         Rounds (default: 1)
  -m BYTES       Min size (default: 8)
  -M BYTES       Max size (default: 1024)
  -s SEED        Random seed (default: 12345)
  -p PRESET      Preset: burst|loop (sets -c/-r)

Presets:
  burst → chunks/thread=10000, rounds=1   # 厳しめ(同時保持が多い)
  loop  → chunks/thread=100,   rounds=100 # 甘め(局所性が高い)

Examples:
  scripts/run_larson.sh -d 10 -t 1,4            # burst既定
  scripts/run_larson.sh -d 10 -t 1,4 -p loop    # 100×100 ループ

Performanceoriented env (recommended when comparing hakmem):

HAKMEM_DISABLE_BATCH=0
HAKMEM_TINY_META_ALLOC=0
HAKMEM_TINY_META_FREE=0
HAKMEM_TINY_SS_ADOPT=1
bash scripts/run_larson.sh -d 10 -t 1,4


Counters dump (refill/publish 可視化):

レガシー互換個別ENV

HAKMEM_TINY_COUNTERS_DUMP=1 ./test_hakmem # 終了時に [Refill Stage Counters]/[Publish Hits]

マスタ箱経由Phase 4d

HAKMEM_STATS=counters ./test_hakmem # 同様のカウンタを HAKMEM_STATS で一括ON HAKMEM_STATS_DUMP=1 ./test_hakmem # atexit で Tiny 全カウンタをダンプ


LD_PRELOAD notes:

- 本リポジトリには `libhakmem.so` を用意(`make shared`)。
- mimallocbench 同梱の `bench/larson/larson` は配布バイナリのため、この環境では GLIBC バージョン不一致で実行できない場合があります。
- LD_PRELOAD 経路の再現が必要な場合は、GLIBC 互換のバイナリを別途用意するか、system 版ベンチ(例: comprehensive_system 等)に対して `LD_PRELOAD=$(pwd)/libhakmem.so` を適用してください。

Current status (quick snapshot, burst: `-d 2 -t 1,4 -m 8 -M 128 -c 1024 -r 1`):

- system (1T): ~14.6 M ops/s
- mimalloc (1T): ~16.8 M ops/s
- hakmem (1T): ~1.11.3 M ops/s
- system (4T): ~16.8 M ops/s
- mimalloc (4T): ~16.8 M ops/s
- hakmem (4T): ~4.2 M ops/s

備考: Larson は現状まだ差が大きいですが、他の内蔵ベンチTiny Hot/Random Mixed 等では良い勝負Tiny Hot: mimalloc 比 ~98%を確認済み。Larson 改善の主眼は free→alloc の publish/pop 接続最適化と MT 配線の整備ですAdopt Gate 導入済み)。

### 🔬 Profiler Sweep (Overhead Tracking)

Use the sweep helper to probe size ranges and gather sampling profiler output quickly (2s per run by default):

scripts/prof_sweep.sh -d 2 -t 1,4 -s 8 # sample=1/256, 1T/4T, multiple ranges scripts/prof_sweep.sh -d 2 -t 4 -s 10 -m 2048 -M 32768 # focus (232KiB)


Env tips:
- `HAKMEM_TINY_MAG_CAP=1024` recommended for BURST style runs.
- Profiling ON adds minimal overhead due to sampling; keep N high (812) for realistic loads.

Profiler categories (subset):
- `tiny_alloc`, `ace_alloc`, `malloc_alloc`, `mmap_alloc`, `bigcache_try`
- Tiny internals: `tiny_bitmap`, `tiny_drain_locked/owner`, `tiny_spill`, `tiny_reg_lookup/register`
- Pool internals: `pool_lock/refill`, `l25_lock/refill`

Notes:

  • Runner uses absolute LD_PRELOAD paths for reliability.
  • Set MIMALLOC_SO=/path/to/libmimalloc.so.2 if auto-detection fails.

🧱 TLS Active Slab (Arena-lite)

Tiny Pool はスレッド毎・クラス毎に1枚の「TLS Active Slab」を持ちます。

  • magazine miss時は TLS Slab からロックレスで割当所有スレッドのみがbitmap更新
  • remote-free は MPSC スタックへ。所有スレッドが tiny_remote_drain_owner() でロック無しドレイン。
  • adopt はクラスロック下で一度だけ実施wrap中は trylock 限定)。

これにより、ロック競合と偽共有の影響を最小化し、1T/4T いずれでも安定して短縮します。

🧊 EVO/Gatingデフォルト低オーバーヘッド

学習系EVOの計測はデフォルト無効化HAKMEM_EVO_SAMPLE=0)。

  • free()clock_gettime() や p² 更新はサンプリング有効時のみ実行。
  • 計測を見たい場合のみ HAKMEM_EVO_SAMPLE=N を設定してください。

🏆 Benchmark Comparison (Phase 5)

# Build benchmark programs
make bench

# Run quick benchmark (3 warmup, 5 runs)
bash bench_runner.sh --warmup 3 --runs 5

# Run full benchmark (10 warmup, 50 runs)
bash bench_runner.sh --warmup 10 --runs 50 --output results.csv

# Manual single run
./bench_allocators_hakmem --allocator hakmem-baseline --scenario json
./bench_allocators_system --allocator system --scenario json
LD_PRELOAD=libjemalloc.so.2 ./bench_allocators_system --allocator jemalloc --scenario json

Benchmark scenarios:

  • json - Small (64KB), frequent (1000 iterations)
  • mir - Medium (256KB), moderate (100 iterations)
  • vm - Large (2MB), infrequent (10 iterations)
  • mixed - All patterns combined

Allocators tested:

  • hakmem-baseline - Fixed policy (256KB threshold)
  • hakmem-evolving - UCB1 adaptive learning
  • system - glibc malloc (baseline)
  • jemalloc - Industry standard (Firefox, Redis)
  • mimalloc - Microsoft allocator (state-of-the-art)

📊 Expected Results

Basic Test (test_hakmem)

You should see 3 different call-sites with distinct patterns:

Site #1:
  Address:    0x55d8a7b012ab
  Allocs:     1000
  Total:      64000000 bytes
  Avg size:   64000 bytes      # JSON parsing (64KB)
  Max size:   65536 bytes
  Policy:     SMALL_FREQUENT (malloc)

Site #2:
  Address:    0x55d8a7b012f3
  Allocs:     100
  Total:      25600000 bytes
  Avg size:   256000 bytes     # MIR build (256KB)
  Max size:   262144 bytes
  Policy:     MEDIUM (malloc)

Site #3:
  Address:    0x55d8a7b0133b
  Allocs:     10
  Total:      20971520 bytes
  Avg size:   2097152 bytes    # VM execution (2MB)
  Max size:   2097152 bytes
  Policy:     LARGE_INFREQUENT (mmap)

Key observation: Same code, different call-sites → automatically different profiles!

Benchmark Results (Phase 5) - FINAL

🏆 Overall Ranking (Points System: 5 allocators × 4 scenarios)

🥇 #1: mimalloc             18 points
🥈 #2: jemalloc             13 points
🥉 #3: hakmem-evolving      12 points ← Our contribution
   #4: system               10 points
   #5: hakmem-baseline      7 points

📊 Performance by Scenario (Median Latency, 50 runs each)

Scenario hakmem-evolving Best (Winner) Gap Status
JSON (64KB) 284.0 ns 263.5 ns (system) +7.8% Acceptable overhead
MIR (512KB) 1,750.5 ns 1,350.5 ns (mimalloc) +29.6% ⚠️ Competitive
VM (2MB) 58,600.0 ns 18,724.5 ns (mimalloc) +213.0% Needs per-site caching
MIXED 969.5 ns 518.5 ns (mimalloc) +87.0% Needs work

🔑 Key Findings:

  1. Call-site profiling overhead is acceptable (+7.8% on JSON)
  2. Competitive on medium allocations (+29.6% on MIR)
  3. Large allocation gap (3.1× slower than mimalloc on VM)
    • Root cause: Lack of per-site free-list caching
    • Future work: Implement Tier-2 MappedRegion hash map

🔥 Critical Discovery: Page Faults Issue

  • Initial direct mmap(): 1,538 page faults (769× more than system malloc!)
  • Fixed with malloc-based approach: 1,025 page faults (now equal to system)
  • Performance swing: VM scenario -54% → +14.4% (68.4 point improvement!)

See PAPER_SUMMARY.md for detailed analysis and paper narrative.


🔧 Implementation Details

Files

Phase 1-5 (UCB1 + Benchmarking):

  • hakmem.h - C API (call-site profiling + KPI measurement, ~110 lines)
  • hakmem.c - Core implementation (profiling + KPI + lifecycle, ~750 lines)
  • hakmem_ucb1.c - UCB1 bandit evolution (~330 lines)
  • test_hakmem.c - A/B test program (~135 lines)
  • bench_allocators.c - Benchmark framework (~360 lines)
  • bench_runner.sh - Automated benchmark runner (~200 lines)

Phase 6.1-6.4 (ELO System):

  • hakmem_elo.h/.c - ELO rating system (~450 lines)
  • hakmem_bigcache.h/.c - BigCache tier-2 optimization (~210 lines)
  • hakmem_batch.h/.c - Batch madvise optimization (~120 lines)

Phase 6.5 (Learning Lifecycle):

  • hakmem_p2.h/.c - P² percentile estimation (~130 lines)
  • hakmem_sizeclass_dist.h/.c - Distribution signature (~120 lines)
  • hakmem_evo.h/.c - State machine core (~610 lines)
  • test_evo.c - Lifecycle tests (~220 lines)

Documentation:

  • BENCHMARK_DESIGN.md, PAPER_SUMMARY.md, PHASE_6.2_ELO_IMPLEMENTATION.md, PHASE_6.5_LEARNING_LIFECYCLE.md

Phase 6.16 (SACS3)

SACS3: sizeonly tier selection + ACE for L1.

  • L0 Tiny (≤1KiB): TinySlab with TLS magazine and TLS Active Slab.
  • L1 ACE (1KiB2MiB): unified hkm_ace_alloc()
    • MidPool (2/4/8/16/32 KiB), LargePool (64/128/256/512 KiB/1 MiB)
    • W_MAX rounding: allow class cutup if class ≤ W_MAX×size (FrozenPolicy.w_max)
    • 3264KiB gap absorbed to 64KiB when allowed by W_MAX
  • L2 Big (≥2MiB): BigCache/mmap (THP gate)

Site Rules is OFF by default and no longer used for tier selection. Hot path has no clock_gettime except optional sampling.

New modules:

  • hakmem_policy.h/.c FrozenPolicy (RCU snapshot). Hot path loads once per call; learning thread publishes a new snapshot.
  • hakmem_ace.h/.c ACE layer alloc (L1 unified), W_MAX rounding.
  • hakmem_prof.h/.c sampling profiler (categories, avg ns).
  • hakmem_ace_stats.h/.c L1 mid/large hit/miss + L1 fallback counters (sampling).

学習ターゲット4軸

SACS3 の“賢いキャッシュ”は、次の4軸で最適化します。

  • しきい値mmap/L1↔L2切替: 将来 FrozenPolicy.thp_threshold へ反映
  • 器の数(サイズクラス数): Mid/Large のクラス本数(段階的に可変枠を導入)
  • 器の形サイズ境界・粒度・W_MAX: 例) w_max_mid/large
  • 器の量CAP/在庫量): クラス別CAPページ/バンドル)→ Soft CAPで補充強度を制御実装済

ランタイム制御(環境変数)

  • 学習器: HAKMEM_LEARN=1

    • 窓長: HAKMEM_LEARN_WINDOW_MS既定1000
    • 目標ヒット率: HAKMEM_TARGET_HIT_MID0.65, HAKMEM_TARGET_HIT_LARGE0.55
    • ステップ: HAKMEM_CAP_STEP_MID4, HAKMEM_CAP_STEP_LARGE1
    • 予算制約: HAKMEM_BUDGET_MID, HAKMEM_BUDGET_LARGE0=無効)
    • 最小サンプル/窓: HAKMEM_LEARN_MIN_SAMPLES256
  • 手動CAP上書き: HAKMEM_CAP_MID=a,b,c,d,e, HAKMEM_CAP_LARGE=a,b,c,d,e

  • 切上げ許容: HAKMEM_WMAX_MID, HAKMEM_WMAX_LARGE

  • Mid free A/B: HAKMEM_POOL_TLS_FREE=0/1既定1

将来追加(実験用):

  • ラッパー内L1許可: HAKMEM_WRAP_L2=1, HAKMEM_WRAP_L25=1
  • 可変Midクラス枠手動: HAKMEM_MID_DYN1=<bytes>

Inline/Hot Path 方針

  • ホットパスは「サイズ即決 + O(1)テーブル参照 + 最小分岐」。
  • clock_gettime() 等のシステムコールはホットパス禁止(サンプリング/学習スレ側で実行)。
  • static inline + LUT でクラス決定を O(1) に(hakmem_pool.c/hakmem_l25_pool.c 参照)。
  • FrozenPolicy は RCUスナップショットを関数冒頭で1回loadし、以後は読み取りのみ。

Soft CAP実装済と 学習器(実装済)

  • Mid/L2.5 の refill で FrozenPolicy CAP を参照し、補充バンドル数を調整。
    • CAP超過: バンドル=1
    • CAP不足: 赤字に応じて 1〜4不足大なら下限2
  • shard空 & CAP過多: 近傍shardから12probe stealMid/L2.5)。
  • 学習器は別スレッドで窓ごとにヒット率を評価し、CAPを±Δヒステリシス/予算制約付き)→ hkm_policy_publish() で公開。

段階導入(提案)

  1. 可変Midクラス枠×1例: 14KBを導入し、分布ピークに合わせて境界を最適化。
  2. W_MAX を離散候補でバンディット+CANARY 最適化。
  3. mmapしきい値L1↔L2をバンディット/ELOで学習し thp_threshold に反映。
  4. 可変枠×2 → クラス数/境界の自動最適化(バックグラウンド重計算)。

Total: ~3745 lines for complete production-ready allocator!

What's Implemented

Phase 1-5 (Foundation):

  • Call-site capture (HAK_CALLSITE() macro)
  • Zero-friction API (hak_alloc_cs() / hak_free_cs())
  • Simple hash table (256 slots, linear probing)
  • Basic profiling (count, size, avg, max)
  • Policy-based optimization (malloc vs mmap)
  • UCB1 bandit evolution
  • KPI measurement (P50/P95/P99, page faults, RSS)
  • A/B testing (baseline vs evolving)
  • Benchmark framework (jemalloc/mimalloc comparison)

Phase 6.1-6.4 (ELO System):

  • ELO rating system (6 strategies with win/loss/draw)
  • Softmax selection (temperature-based exploration)
  • BigCache tier-2 (size-class caching for large allocations)
  • Batch madvise (MADV_DONTNEED syscall optimization)

Phase 6.5 (Learning Lifecycle):

  • 3-state machine (LEARN → FROZEN → CANARY)
  • P² algorithm (O(1) p99 estimation)
  • Size-class distribution signature (L1 distance)
  • Environment variable configuration
  • Zero-overhead FROZEN mode (confirmed best policy)
  • CANARY mode (5% trial sampling)
  • Convergence detection & workload shift detection

What's NOT Implemented (Future)

  • Multi-threaded support (single-threaded PoC)
  • Advanced mmap strategies (MADV_HUGEPAGE, etc.)
  • Redis/Nginx real-world benchmarks
  • Confusion Matrix for auto-inference accuracy

📈 Implementation Progress

Phase Feature Status Date
Phase 1 Call-site profiling Complete 2025-10-21 AM
Phase 2 Policy optimization (malloc/mmap) Complete 2025-10-21 PM
Phase 3 UCB1 bandit evolution Complete 2025-10-21 Eve
Phase 4 A/B testing Complete 2025-10-21 Eve
Phase 5 jemalloc/mimalloc comparison Complete 2025-10-21 Night
Phase 6.1-6.4 ELO rating system integration Complete 2025-10-21
Phase 6.5 Learning lifecycle (LEARN→FROZEN→CANARY) Complete 2025-10-21
Phase 7 Redis/Nginx real-world benchmarks 📋 Next TBD

💡 Key Insights from PoC

  1. Call-site works as identity: Different hak_alloc_cs() calls → different addresses
  2. Zero overhead abstraction: Macro expands to __builtin_return_address(0)
  3. Profiling overhead is acceptable: +7.8% on JSON (64KB), competitive on MIR (+29.6%)
  4. Hash table is fast: Simple power-of-2 hash, <8 probes
  5. Learning phase works: First 9 allocations gather data, 10th triggers optimization
  6. UCB1 evolution improves performance: hakmem-evolving +71% vs hakmem-baseline (12 vs 7 points)
  7. Page faults matter critically: 769× difference (1,538 vs 2) on direct mmap without caching
  8. Memory reuse is essential: System malloc's free-list enables 3.1× speedup on large allocations
  9. Per-site caching is the missing piece: Clear path to competitive performance (1st place)

📝 Connection to Paper

This PoC implements:

  • Section 3.6.2: Call-site Profiling API
  • Section 3.7: Learning ≠ LLM (UCB1 = lightweight online optimization)
  • Section 4.3: Hot-Path Performance (O(1) lookup, <300ns overhead)
  • Section 5: Evaluation Framework (A/B test + benchmarking)

Paper Sections Proven:

  • Section 3.6.2: Call-site Profiling
  • Section 3.7: Learning ≠ LLM (UCB1 = lightweight online optimization)
  • Section 4.3: Hot-Path Performance (<50ns overhead)
  • Section 5: Evaluation Framework (A/B test + jemalloc/mimalloc comparison) 🔄

🧪 Verification Checklist

Run the test and check:

  • 3 distinct call-sites detected
  • Allocation counts match (1000/100/10)
  • Average sizes are correct (64KB/256KB/2MB)
  • No crashes or memory leaks
  • Policy inference works (SMALL_FREQUENT/MEDIUM/LARGE_INFREQUENT)
  • Optimization strategies applied (malloc vs mmap)
  • Learning phase demonstrated (9 malloc + 1 mmap for large allocs)
  • A/B testing works (baseline vs evolving modes)
  • Benchmark framework functional
  • Full benchmark results collected (1000 runs, 5 allocators)

If all checks pass → Core concept AND optimization proven! 🎉


🎊 Summary

What We've Proven:

  1. Call-site = implicit purpose label
  2. Automatic policy inference (rule-based → UCB1 → ELO)
  3. ELO evolution with adaptive learning
  4. Call-site profiling overhead is acceptable (+7.8% on JSON)
  5. Competitive 3rd place ranking among 5 allocators
  6. KPI measurement (P50/P95/P99, page faults, RSS)
  7. A/B testing (baseline vs evolving)
  8. Honest comparison vs jemalloc/mimalloc (1000 benchmark runs)
  9. Production-ready lifecycle: LEARN → FROZEN → CANARY
  10. Zero-overhead frozen mode: Confirmed best policy after convergence
  11. P² percentile estimation: O(1) memory p99 tracking
  12. Workload shift detection: L1 distribution distance
  13. 🔍 Critical discovery: Page faults issue (769× difference) → malloc-based approach
  14. 📋 Clear path forward: Redis/Nginx real-world benchmarks

Code Size:

  • Phase 1-5 (UCB1 + Benchmarking): ~1625 lines
  • Phase 6.1-6.4 (ELO System): ~780 lines
  • Phase 6.5 (Learning Lifecycle): ~1340 lines
  • Total: ~3745 lines for complete production-ready allocator!

Paper Sections Proven:

  • Section 3.6.2: Call-site Profiling
  • Section 3.7: Learning ≠ LLM (UCB1 = lightweight online optimization)
  • Section 4.3: Hot-Path Performance (+7.8% overhead on JSON)
  • Section 5: Evaluation Framework (5 allocators, 1000 runs, honest comparison)
  • Gemini S+ requirement met: jemalloc/mimalloc comparison

Status: ACE Learning Layer Planning + Mid MT Complete 🎯 Date: 2025-11-01

Latest Updates (2025-11-01)

  • Mid MT Complete: 110M ops/sec achieved (100-101% of mimalloc)
  • Repository Reorganized: Benchmarks/tests consolidated, root cleaned (72% reduction)
  • 🎯 ACE Learning Layer: Documentation complete, ready for Phase 1 implementation
    • Target: Fix fragmentation (2.6-5.2x), large WS (1.4-2.0x), realloc (1.3-2.0x)
    • Approach: Dual-loop adaptive control + UCB1 learning
    • See docs/ACE_LEARNING_LAYER.md for details

⚠️ Critical Update (2025-10-22): Thread Safety Issue Discovered

Problem: hakmem is completely thread-unsafe (no pthread_mutex anywhere)

  • 1-thread: 15.1M ops/sec Normal
  • 4-thread: 3.3M ops/sec -78% collapse (Race Condition)

Phase 6.14 Clarification:

  • Registry ON/OFF toggle implementation (Pattern 2)
  • O(N) Sequential proven 2.9-13.7x faster than O(1) Hash for Small-N
  • Default: g_use_registry = 0 (O(N), L1 cache hit 95%+)
  • Reported 67.9M ops/sec at 4-thread: NOT REPRODUCIBLE (measurement error)

Phase 6.15 Plan (12-13 hours, 6 days):

  1. Step 1 (1h): Documentation updates
  2. Step 2 (2-3h): P0 Safety Lock (pthread_mutex global lock) → 4T = 13-15M ops/sec
  3. Step 3 (8-10h): TLS implementation (Tiny/L2/L2.5 Pool TLS) → 4T = 15-22M ops/sec

Validation: Phase 6.13 already proved TLS works (15.9M ops/sec at 4T, +381%)

Details: See PHASE_6.15_PLAN.md, PHASE_6.15_SUMMARY.md, THREAD_SAFETY_SOLUTION.md


Previous Status: Phase 6.5 Complete - Production-Ready Learning Lifecycle! 🎉 Previous Date: 2025-10-21

Timeline:

  • 2025-10-21 AM: Phase 1 - Call-site profiling PoC
  • 2025-10-21 PM: Phase 2 - Policy-based optimization (malloc/mmap)
  • 2025-10-21 Evening: Phase 3-4 - UCB1 bandit + A/B testing
  • 2025-10-21 Night: Phase 5 - Benchmark infrastructure (1000 runs, 🥉 3rd place!)
  • 2025-10-21 Late Night: Phase 6.1-6.4 - ELO rating system integration
  • 2025-10-21 Night: Phase 6.5 - Learning lifecycle complete (6/6 tests passing)

Phase 6.5 Achievement:

  • 3-state machine: LEARN → FROZEN → CANARY
  • Zero-overhead FROZEN mode: 10-20× faster than LEARN mode
  • P² p99 estimation: O(1) memory percentile tracking
  • Distribution shift detection: L1 distance for workload changes
  • Environment variable config: Full control over freeze/convergence/canary settings
  • Production ready: All lifecycle transitions verified

Key Results:

  • VM scenario ranking: 🥈 2nd place (+1.9% gap to 1st!)
  • Phase 5 (UCB1): 🥉 3rd place (12 points) among 5 allocators
  • Phase 6.4 (ELO+BigCache): 🥈 2nd place, nearly tied with mimalloc
  • Call-site profiling overhead: +7.8% (acceptable)
  • FROZEN mode overhead: Zero (confirmed best policy, no ELO updates)
  • Convergence time: ~180 seconds (configurable via HAKMEM_FREEZE_SEC)
  • CANARY sampling: 5% trial (configurable via HAKMEM_CANARY_FRAC)

Next Steps:

  1. Phase 1-5 complete (UCB1 + benchmarking)
  2. Phase 6.1-6.4 complete (ELO system)
  3. Phase 6.5 complete (learning lifecycle)
  4. 🔧 Phase 6.6: Fix Batch madvise (0 blocks batched) → 1st place target 🏆
  5. 📋 Phase 7: Redis/Nginx real-world benchmarks
  6. 📝 Paper writeup (see PAPER_SUMMARY.md)

Related Documentation:

Description
No description provided
Readme 93 MiB