Go to file

Moe Charm (CI) ea417200d2 Phase 62: C7 ULTRA Hotpath Optimization - Planning & Profiling Analysis

Complete planning for Phase 62 based on runtime profiling of Phase 59b baseline.

Key Findings (200M ops Mixed benchmark):
- tiny_c7_ultra_alloc: 5.18% (new primary target, 5x larger than Phase 61)
- tiny_region_id_write_header: 3.82% (reconfirmed, Phase 61 showed 2.32%)
- Allocation-specific hot path: 12.37% (C7 + header + cache)

Phase 62 Recommendation: Option A (C7 ULTRA Inline + IPC Analysis)
- Expected gain: +1-3% (higher absolute margin than Phases 46A/61)
- Risk level: Medium (layout tax precedent from Phase 46A -0.68%, Phase 43 -1.18%)
- Approach: Deep profiling → ASM inspection → A/B test with ENV gate

Alternative Options:
- Option B: tiny_region_id_write_header (3.82%, higher risk)
- Option C: Algorithmic redesign (post-50% milestone)

Box Theory Compliance:
- Single conversion point: tiny_c7_ultra_alloc() boundary
- Reversible: ENV gate HAKMEM_TINY_C7_ULTRA_INLINE_OPT (0/1)
- No side effects: Pure dependency chain reordering

Timeline: Single phase, 4-6 hours (profile + ASM + test)

Documentation:
- PHASE62_NEXT_TARGET_ANALYSIS.md: Complete planning document with profiling data
- CURRENT_TASK.md: Updated next phase guidance

Profiling tools prepared:
- perf record with extended events (cycles, cache-misses, branch-misses)
- ASM inspection methodology documented
- A/B test threshold: ±0.5% (micro-scale)

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

2025-12-17 16:27:06 +09:00

.claude

docs: Update CURRENT_TASK.md and claude.md for Phase 8 completion

2025-11-30 05:50:43 +09:00

README.md

hakmem PoC - Call-site Profiling + UCB1 Evolution

詳細ドキュメントの入口: docs/INDEX.md（カテゴリ別リンク） / 再整理方針: docs/DOCS_REORG_PLAN.md

Purpose: Proof-of-Concept for the core ideas from the paper:

"Call-site address is an implicit purpose label - same location → same pattern"

"UCB1 bandit learns optimal allocation policies automatically"

🎯 Current Status (2025-11-01)

✅ Mid-Range Multi-Threaded Complete (110M ops/sec)

Achievement: 110M ops/sec on mid-range MT workload (8-32KB)
Comparison: 100-101% of mimalloc, 2.12x faster than glibc
Implementation: core/hakmem_mid_mt.{c,h}
Benchmarks: benchmarks/scripts/mid/ (run_mid_mt_bench.sh, compare_mid_mt_allocators.sh)
Report: MID_MT_COMPLETION_REPORT.md

✅ Repository Reorganization Complete

New Structure: All benchmarks under benchmarks/, tests under tests/
Root Directory: 252 → 70 items (72% reduction)
Organization:
- benchmarks/src/{tiny,mid,comprehensive,stress}/ - Benchmark sources
- benchmarks/scripts/{tiny,mid,comprehensive,utils}/ - Scripts organized by category
- benchmarks/results/ - All benchmark results (871+ files)
- tests/{unit,integration,stress}/ - Tests by type
Details: FOLDER_REORGANIZATION_2025_11_01.md

✅ ACE Learning Layer Phase 1 Complete (ACE = Agentic Context Engineering / Adaptive Control Engine)

Status: Phase 1 Infrastructure COMPLETE ✅ (2025-11-01)
Goal: Fix weak workloads with adaptive learning
- Fragmentation stress: 3.87 → 10-20 M ops/s (2.6-5.2x target)
- Large working set: 22.15 → 30-45 M ops/s (1.4-2.0x target)
- realloc: 277ns → 140-210ns (1.3-2.0x target)
Phase 1 Deliverables (100% complete):
- ✅ Metrics collection infrastructure (hakmem_ace_metrics.{c,h})
- ✅ UCB1 learning algorithm (hakmem_ace_ucb1.{c,h})
- ✅ Dual-loop controller (hakmem_ace_controller.{c,h})
- ✅ Dynamic TLS capacity adjustment
- ✅ Hot-path metrics integration (alloc/free tracking)
- ✅ A/B benchmark script (scripts/bench_ace_ab.sh)
Documentation:
- User guide: docs/ACE_LEARNING_LAYER.md
- Implementation plan: docs/ACE_LEARNING_LAYER_PLAN.md
- Progress report: ACE_PHASE1_PROGRESS.md
Usage: HAKMEM_ACE_ENABLED=1 ./your_benchmark
Next: Phase 2 - Extended benchmarking + learning convergence validation

Build & Run: See "Quick Start" section below
Benchmarks: benchmarks/scripts/ organized by category
Documentation: DOCS_INDEX.md - Central documentation hub
Current Work: CURRENT_TASK.md

🧪 Larson Quick Run（Tiny + Superslab、本線）

Use the defaults wrapper so critical env vars are always set:

Throughput-oriented (2s, threads=1,4): scripts/run_larson_defaults.sh
Lower page-fault/sys (10s, threads=4): scripts/run_larson_defaults.sh pf 10 4
Claude-friendly presets (envs pre-wired for reproducible debug): scripts/run_larson_claude.sh [tput|pf|repro|fast0|guard|debug] 2 4
- For Claude Code runs with log capture, use scripts/claude_code_debug.sh.

本線（セグフォしない）を既定にしました。publish→mail→adopt が動く前提の既定環境です:

Tiny/Superslab gates: HAKMEM_TINY_USE_SUPERSLAB=1（既定ON）, HAKMEM_TINY_MUST_ADOPT=1, HAKMEM_TINY_SS_ADOPT=1
Fast-tier spill to create publish: HAKMEM_TINY_FAST_CAP=64, HAKMEM_TINY_FAST_SPARE_PERIOD=8
TLS list: HAKMEM_TINY_TLS_LIST=1
Mailbox discovery: HAKMEM_TINY_MAILBOX_SLOWDISC=1, HAKMEM_TINY_MAILBOX_SLOWDISC_PERIOD=256
Superslab sizing/cache/precharge: per mode (tput vs pf)

Debugging tips:

Add HAKMEM_TINY_RF_TRACE=1 for one-shot publish/mail traces.
Use scripts/run_larson_claude.sh debug 2 4 to enable TRACE_RING and emit early SIGUSR2 so the Tiny ring is dumped before crashes.

SLL‑first Fast Path（Box 5）

Hot path favors TLS SLL (per‑thread freelist) first; on miss, falls back to HotMag/TLS list, then SuperSlab.
Learning shifts to SLL via sll_cap_for_class() with per‑class override/multiplier (small classes 0..3).
Ownership → remote drain → bind is centralized via SlabHandle (Box 3→2) for safety and determinism.
A/B knobs:
- HAKMEM_TINY_TLS_SLL=0/1 (default 1)
- HAKMEM_SLL_MULTIPLIER=N and HAKMEM_TINY_SLL_CAP_C{0..7}
- HAKMEM_TINY_TLS_LIST=0/1

P0 batch refill is now compile-time only; runtime P0 env toggles were removed.

Benchmark Matrix

Quick matrix to compare mid‑layers vs SLL‑first:
- scripts/bench_matrix.sh 30 8 (duration=30s, threads=8)
Single run (throughput):
- HAKMEM_TINY_TRACE_RING=0 HAKMEM_SAFE_FREE=0 scripts/run_larson_claude.sh tput 30 8
Force-notify path (A/B) with HAKMEM_TINY_RF_FORCE_NOTIFY=1 to surface missing first-notify cases.

Build Modes (Box Refactor)

既定（本線）: Box Theory refactor (Phase 6‑1.7) と Superslab 経路は常時ON
- コンパイルフラグ: -DHAKMEM_TINY_PHASE6_BOX_REFACTOR=1（Makefile既定）
- 実行時既定: g_use_superslab=1（環境変数で明示的に0にしない限りON）
- 旧経路でのA/B: make BOX_REFACTOR_DEFAULT=0 larson_hakmem

🚨 Segfault‑free ポリシー（絶対条件）

本線は「セグフォしない」ことを最優先に設計/実装されています。
変更時は以下のガードを通してから採用してください。
- Guard ラン: ./scripts/larson.sh guard 2 4（Trace Ring + Safe Free）
- ASan/UBSan/TSan: ./scripts/larson.sh asan 2 4 / ubsan / tsan
- Fail‑Fast（環境）: HAKMEM_TINY_RF_TRACE=0 他、LARSON_GUIDE.md の安全手順に従う
- リング末尾の remote_invalid / SENTINEL_TRAP が出ないことを確認

新規A/Bノブ（観測と制御）

Registry 窓: HAKMEM_TINY_REG_SCAN_MAX（既定256）
- レジストリ小窓の走査上限を制御（探索コスト vs adopt 命中率のA/B用）
Mid簡素化refill: HAKMEM_TINY_MID_REFILL_SIMPLE=1（class>=4で多段探索をスキップ）
- tput重視A/B用（adopt/探索を減らす）。常用前にPF/RSSを確認。

Mimalloc vs HAKMEM (Larson quick A/B)

Recommended HAKMEM env (Tiny Hot, SLL‑only, fast tier on):

HAKMEM_TINY_REFILL_COUNT_HOT=64 \
HAKMEM_TINY_FAST_CAP=16 \
HAKMEM_TINY_TRACE_RING=0 HAKMEM_SAFE_FREE=0 \
HAKMEM_TINY_TLS_SLL=1 HAKMEM_TINY_TLS_LIST=0 \
HAKMEM_WRAP_TINY=1 HAKMEM_TINY_SS_ADOPT=1 \
./larson_hakmem 2 8 128 1024 1 12345 4

One‑shot refill path confirmation (noisy print just once):

HAKMEM_TINY_REFILL_OPT_DEBUG=1 <above_env> ./larson_hakmem 2 8 128 1024 1 12345 4

Mimalloc (direct link binary):

LD_LIBRARY_PATH=$PWD/mimalloc-bench/extern/mi/out/release ./larson_mi 2 8 128 1024 1 12345 4

Perf (selected counters):

perf stat -e cycles,instructions,branches,branch-misses,cache-references,cache-misses,\
  L1-dcache-loads,L1-dcache-load-misses -- \
  env <above_env> ./larson_hakmem 5 8 128 1024 1 12345 4

🎯 What This Proves

✅ Phase 1: Call-site Profiling (DONE)

Call-site capture works: __builtin_return_address(0) uniquely identifies allocation sites
Different sites have different patterns: JSON (small, frequent) vs MIR (medium) vs VM (large)
Profiling is lightweight: Simple hash table + sampling
Zero user burden: Just replace malloc → hak_alloc_cs

✅ Phase 2-4: UCB1 Evolution + A/B Testing (DONE)

KPI measurement: P50/P95/P99 latency, Page Faults, RSS delta
Discrete policy steps: 6 levels (64KB → 2MB)
UCB1 bandit: Exploration + Exploitation balance
Safety mechanisms:
- ±1 step exploration (safe)
- Hysteresis (8% improvement × 3 consecutive)
- Cooldown (180 seconds)
A/B testing: baseline vs evolving modes

✅ Phase 5: Benchmarking Infrastructure (COMPLETE)

Allocator comparison framework: hakmem vs jemalloc/mimalloc/system malloc
Fair benchmarking: Same workload, 50 runs per config, 1000 total runs
KPI measurement: Latency (P50/P95/P99), page faults, RSS, throughput
Paper-ready output: CSV format for graphs/tables
Initial ranking (UCB1): 🥉 3rd place among 5 allocators

This proves Sections 3.6-3.7 of the paper. See PAPER_SUMMARY.md for detailed results.

✅ Phase 6.1-6.4: ELO Rating System (COMPLETE)

Strategy diversity: 6 threshold levels (64KB, 128KB, 256KB, 512KB, 1MB, 2MB)
ELO rating: Each strategy has rating, learns from win/loss/draw
Softmax selection: Probability ∝ exp(rating/temperature)
BigCache optimization: Tier-2 size-class caching for large allocations
Batch madvise: MADV_DONTNEED batching for reduced syscall overhead

🏆 VM Scenario Benchmark Results (iterations=100):

🥇 mimalloc         15,822 ns  (baseline)
🥈 hakmem-evolving  16,125 ns  (+1.9%)  ← BigCache効果！
🥉 system           16,814 ns  (+6.3%)
4th jemalloc        17,575 ns  (+11.1%)

Key achievement: 1.9% gap to 1st place (down from -50% in Phase 5!)

See PHASE_6.2_ELO_IMPLEMENTATION.md for details.

✅ Phase 6.5: Learning Lifecycle (COMPLETE)

3-state machine: LEARN → FROZEN → CANARY
- LEARN: Active learning with ELO updates
- FROZEN: Zero-overhead production mode (confirmed best policy)
- CANARY: Safe 5% trial sampling to detect workload changes
Convergence detection: P² algorithm for O(1) p99 estimation
Distribution signature: L1 distance for workload shift detection
Environment variables: Fully configurable (freeze time, window size, etc.)
Production ready: 6/6 tests passing, LEARN→FROZEN transition verified

Key feature: Learning converges in ~180 seconds, then runs at zero overhead in FROZEN mode!

See PHASE_6.5_LEARNING_LIFECYCLE.md for complete documentation.

✅ Phase 6.6: ELO Control Flow Fix (COMPLETE)

Problem: After Phase 6.5 integration, batch madvise stopped activating Root Cause: ELO strategy selection happened AFTER allocation, results ignored Fix: Reordered hak_alloc_at() to use ELO threshold BEFORE allocation

Diagnosis by: Gemini Pro (2025-10-21) Fixed by: Claude (2025-10-21)

Key insight:

OLD: allocate_with_policy(POLICY_DEFAULT) → malloc → ELO selection (too late!)
NEW: ELO selection → size >= threshold ? mmap : malloc ✅

Result: 2MB allocations now correctly use mmap, enabling batch madvise optimization.

See PHASE_6.6_ELO_CONTROL_FLOW_FIX.md for detailed analysis.

✅ Phase 6.7: Overhead Analysis (COMPLETE)

Goal: Identify why hakmem is 2× slower than mimalloc despite identical syscall counts

Key Findings:

Syscall overhead is NOT the bottleneck
- hakmem: 292 mmap, 206 madvise (same as mimalloc)
- Batch madvise working correctly
The gap is structural, not algorithmic
- mimalloc: Pool-based allocation (9ns fast path)
- hakmem: Hash-based caching (31ns fast path)
- 3.4× fast path difference explains 2× total gap
hakmem's "smart features" have < 1% overhead
- ELO: ~100-200ns (0.5%)
- BigCache: ~50-100ns (0.3%)
- Total: ~350ns out of 17,638ns gap (2%)

Recommendation: Accept the gap for research prototype OR implement hybrid pool fast-path (ChatGPT Pro proposal)

Deliverables:

PHASE_6.7_OVERHEAD_ANALYSIS.md (27KB, comprehensive)
PHASE_6.7_SUMMARY.md (11KB, TL;DR)
PROFILING_GUIDE.md (validation tools)
ALLOCATION_MODEL_COMPARISON.md (visual diagrams)

✅ Phase 6.8: Configuration Cleanup (COMPLETE)

Goal: Simplify complex environment variables into 5 preset modes + implement feature flags

Critical Bug Fixed: Task Agent investigation revealed complete design vs implementation gap:

Design: "Check g_hakem_config flags before enabling features"
Implementation: Features ran unconditionally (never checked!)
Impact: "MINIMAL mode" measured 14,959 ns but was actually BALANCED (all features ON)

Solution Implemented: Mode-based configuration + Feature-gated initialization

# Simple preset modes
export HAKMEM_MODE=minimal    # Baseline (all features OFF)
export HAKMEM_MODE=fast       # Production (pool fast-path + FROZEN)
export HAKMEM_MODE=balanced   # Default (BigCache + ELO FROZEN + Batch)
export HAKMEM_MODE=learning   # Development (ELO LEARN + adaptive)
export HAKMEM_MODE=research   # Debug (all features + verbose logging)

🎯 Benchmark Results - PROOF OF SUCCESS!

Test: VM scenario (2MB allocations, 100 iterations)

MINIMAL mode:  216,173 ns  (all features OFF - true baseline)
BALANCED mode:  15,487 ns  (BigCache + ELO ON)
→ 13.95x speedup from optimizations! 🚀

Feature Matrix (Now Actually Enforced!):

Feature	MINIMAL	FAST	BALANCED	LEARNING	RESEARCH
ELO learning	❌	❌ FROZEN	✅ FROZEN	✅ LEARN	✅ LEARN
BigCache	❌	✅	✅	✅	✅
Batch madvise	❌	✅	✅	✅	✅
TinyPool (future)	❌	✅	✅	❌	❌
Debug logging	❌	❌	❌	⚠️	✅

Code Quality Improvements:

✅ hakmem.c: 899 → 600 lines (-33% reduction)
✅ New infrastructure: hakmem_features.h, hakmem_config.c/h, hakmem_internal.h (692 lines)
✅ Static inline helpers: Zero-cost abstraction (100% inlined with -O2)
✅ Feature flags: Runtime checks with < 0.1% overhead

Benefits Delivered:

✅ Easy to use (HAKMEM_MODE=balanced)
✅ Clear benchmarking (14x performance difference proven!)
✅ Backward compatible (individual env vars still work)
✅ Paper-friendly (quantified feature impact)

See PHASE_6.8_PROGRESS.md for complete implementation details.

🚀 Quick Start

🎯 Choose Your Mode (Phase 6.8+)

New: hakmem now supports 5 simple preset modes!

# 1. MINIMAL - Baseline (all optimizations OFF)
export HAKMEM_MODE=minimal
./bench_allocators --allocator hakmem-evolving --scenario vm

# 2. BALANCED - Default recommended (BigCache + ELO FROZEN + Batch)
export HAKMEM_MODE=balanced  # or omit (default)
./bench_allocators --allocator hakmem-evolving --scenario vm

# 3. LEARNING - Development (ELO learns, adapts to workload)
export HAKMEM_MODE=learning
./test_hakmem

# 4. FAST - Production (future: pool fast-path + FROZEN)
export HAKMEM_MODE=fast
./bench_allocators --allocator hakmem-evolving --scenario vm

# 5. RESEARCH - Debug (all features + verbose logging)
export HAKMEM_MODE=research
./test_hakmem

Quick reference:

Just want it to work? → Use balanced (default)
Benchmarking baseline? → Use minimal
Development/testing? → Use learning
Production deployment? → Use fast (after Phase 7)
Debugging issues? → Use research

📖 Legacy Usage (Phase 1-6.7)

# Build
make

# Run basic test
make run

# Run A/B test (baseline mode)
./test_hakmem

# Run A/B test (evolving mode - UCB1 enabled)
env HAKMEM_MODE=evolving ./test_hakmem

# Override individual settings (backward compatible)
export HAKMEM_MODE=balanced
export HAKMEM_THP=off  # Override THP policy
./bench_allocators --allocator hakmem-evolving --scenario vm

⚙️ Useful Environment Variables

Tiny publish/adopt pipeline

# Enable SuperSlab (required for publish/adopt)
export HAKMEM_TINY_USE_SUPERSLAB=1
# Optional: must-adopt-before-mmap (one-pass adopt before mmap)
export HAKMEM_TINY_MUST_ADOPT=1

HAKMEM_TINY_USE_SUPERSLAB=1
- publish→mailbox→adopt は SuperSlab 経路が ON のときのみ動作します（OFFでは pipeline はゼロ）。
- ベンチ時の既定ONを推奨（A/Bで OFFにしてメモリ効率優先との比較も可）。
HAKMEM_SAFE_FREE=1
- Adds a best-effort mincore() guard before reading headers on free().
- Safer with LD_PRELOAD at the cost of extra overhead. Default: off.
HAKMEM_WRAP_TINY=1
- Allows Tiny Pool allocations during malloc/free wrappers (LD_PRELOAD).
- Wrapper-context uses a magazine-only fast path (no locks/refill) for safety.
- Default: off for stability. Enable to test Tiny impact on small-object workloads.
HAKMEM_TINY_MAG_CAP=INT
- Upper bound for Tiny TLS magazine per class (soft). Default: build limit (2048); recommended 1024 for BURST.
HAKMEM_SITE_RULES=1
- Enables Site Rules. Note: tier selection no longer uses Site Rules (SACS‑3); only layer‑internal future hints.
HAKMEM_PROF=1, HAKMEM_PROF_SAMPLE=N
- Enables lightweight sampling profiler. N is exponent, sample every 2^N calls (default 12). Outputs per‑category avg ns.
HAKMEM_ACE_SAMPLE=N
- ACE layer (L1) stats sampling for mid/large hit/miss and L1 fallback. Default off.

🧪 Larson Runner (Reproducible)

Use the provided runner to compare system/mimalloc/hakmem under identical settings.

scripts/run_larson.sh [options] [runtime_sec] [threads_csv]

Options:
  -d SECONDS     Runtime seconds (default: 10)
  -t CSV         Threads CSV, e.g. 1,4 (default: 1,4)
  -c NUM         Chunks per thread (default: 10000)
  -r NUM         Rounds (default: 1)
  -m BYTES       Min size (default: 8)
  -M BYTES       Max size (default: 1024)
  -s SEED        Random seed (default: 12345)
  -p PRESET      Preset: burst|loop (sets -c/-r)

Presets:
  burst → chunks/thread=10000, rounds=1   # 厳しめ（同時保持が多い）
  loop  → chunks/thread=100,   rounds=100 # 甘め（局所性が高い）

Examples:
  scripts/run_larson.sh -d 10 -t 1,4            # burst既定
  scripts/run_larson.sh -d 10 -t 1,4 -p loop    # 100×100 ループ

Performance‑oriented env (recommended when comparing hakmem):

HAKMEM_DISABLE_BATCH=0
HAKMEM_TINY_META_ALLOC=0
HAKMEM_TINY_META_FREE=0
HAKMEM_TINY_SS_ADOPT=1
bash scripts/run_larson.sh -d 10 -t 1,4


Counters dump (refill/publish 可視化):

レガシー互換（個別ENV）

HAKMEM_TINY_COUNTERS_DUMP=1 ./test_hakmem # 終了時に [Refill Stage Counters]/[Publish Hits]

マスタ箱経由（Phase 4d）

HAKMEM_STATS=counters ./test_hakmem # 同様のカウンタを HAKMEM_STATS で一括ON HAKMEM_STATS_DUMP=1 ./test_hakmem # atexit で Tiny 全カウンタをダンプ


LD_PRELOAD notes:

- 本リポジトリには `libhakmem.so` を用意（`make shared`）。
- mimalloc‑bench 同梱の `bench/larson/larson` は配布バイナリのため、この環境では GLIBC バージョン不一致で実行できない場合があります。
- LD_PRELOAD 経路の再現が必要な場合は、GLIBC 互換のバイナリを別途用意するか、system 版ベンチ（例: comprehensive_system 等）に対して `LD_PRELOAD=$(pwd)/libhakmem.so` を適用してください。

Current status (quick snapshot, burst: `-d 2 -t 1,4 -m 8 -M 128 -c 1024 -r 1`):

- system (1T): ~14.6 M ops/s
- mimalloc (1T): ~16.8 M ops/s
- hakmem (1T): ~1.1–1.3 M ops/s
- system (4T): ~16.8 M ops/s
- mimalloc (4T): ~16.8 M ops/s
- hakmem (4T): ~4.2 M ops/s

備考: Larson は現状まだ差が大きいですが、他の内蔵ベンチ（Tiny Hot/Random Mixed 等）では良い勝負（Tiny Hot: mimalloc 比 ~98%）を確認済み。Larson 改善の主眼は free→alloc の publish/pop 接続最適化と MT 配線の整備です（Adopt Gate 導入済み）。

### 🔬 Profiler Sweep (Overhead Tracking)

Use the sweep helper to probe size ranges and gather sampling profiler output quickly (2s per run by default):

scripts/prof_sweep.sh -d 2 -t 1,4 -s 8 # sample=1/256, 1T/4T, multiple ranges scripts/prof_sweep.sh -d 2 -t 4 -s 10 -m 2048 -M 32768 # focus (2–32KiB)


Env tips:
- `HAKMEM_TINY_MAG_CAP=1024` recommended for BURST style runs.
- Profiling ON adds minimal overhead due to sampling; keep N high (8–12) for realistic loads.

Profiler categories (subset):
- `tiny_alloc`, `ace_alloc`, `malloc_alloc`, `mmap_alloc`, `bigcache_try`
- Tiny internals: `tiny_bitmap`, `tiny_drain_locked/owner`, `tiny_spill`, `tiny_reg_lookup/register`
- Pool internals: `pool_lock/refill`, `l25_lock/refill`

Notes:

Runner uses absolute LD_PRELOAD paths for reliability.
Set MIMALLOC_SO=/path/to/libmimalloc.so.2 if auto-detection fails.

🧱 TLS Active Slab (Arena-lite)

Tiny Pool はスレッド毎・クラス毎に1枚の「TLS Active Slab」を持ちます。

magazine miss時は TLS Slab からロックレスで割当（所有スレッドのみがbitmap更新）。
remote-free は MPSC スタックへ。所有スレッドが tiny_remote_drain_owner() でロック無しドレイン。
adopt はクラスロック下で一度だけ実施（wrap中は trylock 限定）。

これにより、ロック競合と偽共有の影響を最小化し、1T/4T いずれでも安定して短縮します。

🧊 EVO/Gating（デフォルト低オーバーヘッド）

学習系（EVO）の計測はデフォルト無効化（HAKMEM_EVO_SAMPLE=0）。

free() の clock_gettime() や p² 更新はサンプリング有効時のみ実行。
計測を見たい場合のみ HAKMEM_EVO_SAMPLE=N を設定してください。

🏆 Benchmark Comparison (Phase 5)

# Build benchmark programs
make bench

# Run quick benchmark (3 warmup, 5 runs)
bash bench_runner.sh --warmup 3 --runs 5

# Run full benchmark (10 warmup, 50 runs)
bash bench_runner.sh --warmup 10 --runs 50 --output results.csv

# Manual single run
./bench_allocators_hakmem --allocator hakmem-baseline --scenario json
./bench_allocators_system --allocator system --scenario json
LD_PRELOAD=libjemalloc.so.2 ./bench_allocators_system --allocator jemalloc --scenario json

Benchmark scenarios:

json - Small (64KB), frequent (1000 iterations)
mir - Medium (256KB), moderate (100 iterations)
vm - Large (2MB), infrequent (10 iterations)
mixed - All patterns combined

Allocators tested:

hakmem-baseline - Fixed policy (256KB threshold)
hakmem-evolving - UCB1 adaptive learning
system - glibc malloc (baseline)
jemalloc - Industry standard (Firefox, Redis)
mimalloc - Microsoft allocator (state-of-the-art)

📊 Expected Results

Basic Test (test_hakmem)

You should see 3 different call-sites with distinct patterns:

Site #1:
  Address:    0x55d8a7b012ab
  Allocs:     1000
  Total:      64000000 bytes
  Avg size:   64000 bytes      # JSON parsing (64KB)
  Max size:   65536 bytes
  Policy:     SMALL_FREQUENT (malloc)

Site #2:
  Address:    0x55d8a7b012f3
  Allocs:     100
  Total:      25600000 bytes
  Avg size:   256000 bytes     # MIR build (256KB)
  Max size:   262144 bytes
  Policy:     MEDIUM (malloc)

Site #3:
  Address:    0x55d8a7b0133b
  Allocs:     10
  Total:      20971520 bytes
  Avg size:   2097152 bytes    # VM execution (2MB)
  Max size:   2097152 bytes
  Policy:     LARGE_INFREQUENT (mmap)

Key observation: Same code, different call-sites → automatically different profiles!

Benchmark Results (Phase 5) - FINAL

🏆 Overall Ranking (Points System: 5 allocators × 4 scenarios)

🥇 #1: mimalloc             18 points
🥈 #2: jemalloc             13 points
🥉 #3: hakmem-evolving      12 points ← Our contribution
   #4: system               10 points
   #5: hakmem-baseline      7 points

📊 Performance by Scenario (Median Latency, 50 runs each)

Scenario	hakmem-evolving	Best (Winner)	Gap	Status
JSON (64KB)	284.0 ns	263.5 ns (system)	+7.8%	✅ Acceptable overhead
MIR (512KB)	1,750.5 ns	1,350.5 ns (mimalloc)	+29.6%	⚠️ Competitive
VM (2MB)	58,600.0 ns	18,724.5 ns (mimalloc)	+213.0%	❌ Needs per-site caching
MIXED	969.5 ns	518.5 ns (mimalloc)	+87.0%	❌ Needs work

🔑 Key Findings:

✅ Call-site profiling overhead is acceptable (+7.8% on JSON)
✅ Competitive on medium allocations (+29.6% on MIR)
❌ Large allocation gap (3.1× slower than mimalloc on VM)
- Root cause: Lack of per-site free-list caching
- Future work: Implement Tier-2 MappedRegion hash map

🔥 Critical Discovery: Page Faults Issue

Initial direct mmap(): 1,538 page faults (769× more than system malloc!)
Fixed with malloc-based approach: 1,025 page faults (now equal to system)
Performance swing: VM scenario -54% → +14.4% (68.4 point improvement!)

See PAPER_SUMMARY.md for detailed analysis and paper narrative.

🔧 Implementation Details

Files

Phase 1-5 (UCB1 + Benchmarking):

hakmem.h - C API (call-site profiling + KPI measurement, ~110 lines)
hakmem.c - Core implementation (profiling + KPI + lifecycle, ~750 lines)
hakmem_ucb1.c - UCB1 bandit evolution (~330 lines)
test_hakmem.c - A/B test program (~135 lines)
bench_allocators.c - Benchmark framework (~360 lines)
bench_runner.sh - Automated benchmark runner (~200 lines)

Phase 6.1-6.4 (ELO System):

hakmem_elo.h/.c - ELO rating system (~450 lines)
hakmem_bigcache.h/.c - BigCache tier-2 optimization (~210 lines)
hakmem_batch.h/.c - Batch madvise optimization (~120 lines)

Phase 6.5 (Learning Lifecycle):

hakmem_p2.h/.c - P² percentile estimation (~130 lines)
hakmem_sizeclass_dist.h/.c - Distribution signature (~120 lines)
hakmem_evo.h/.c - State machine core (~610 lines)
test_evo.c - Lifecycle tests (~220 lines)

Documentation:

BENCHMARK_DESIGN.md, PAPER_SUMMARY.md, PHASE_6.2_ELO_IMPLEMENTATION.md, PHASE_6.5_LEARNING_LIFECYCLE.md

Phase 6.16 (SACS‑3)

SACS‑3: size‑only tier selection + ACE for L1.

L0 Tiny (≤1KiB): TinySlab with TLS magazine and TLS Active Slab.
L1 ACE (1KiB–2MiB): unified hkm_ace_alloc()
- MidPool (2/4/8/16/32 KiB), LargePool (64/128/256/512 KiB/1 MiB)
- W_MAX rounding: allow class cut‑up if class ≤ W_MAX×size (FrozenPolicy.w_max)
- 32–64KiB gap absorbed to 64KiB when allowed by W_MAX
L2 Big (≥2MiB): BigCache/mmap (THP gate)

Site Rules is OFF by default and no longer used for tier selection. Hot path has no clock_gettime except optional sampling.

New modules:

hakmem_policy.h/.c – FrozenPolicy (RCU snapshot). Hot path loads once per call; learning thread publishes a new snapshot.
hakmem_ace.h/.c – ACE layer alloc (L1 unified), W_MAX rounding.
hakmem_prof.h/.c – sampling profiler (categories, avg ns).
hakmem_ace_stats.h/.c – L1 mid/large hit/miss + L1 fallback counters (sampling).

学習ターゲット（4軸）

SACS‑3 の“賢いキャッシュ”は、次の4軸で最適化します。

しきい値（mmap/L1↔L2切替）: 将来 FrozenPolicy.thp_threshold へ反映
器の数（サイズクラス数）: Mid/Large のクラス本数（段階的に可変枠を導入）
器の形（サイズ境界・粒度・W_MAX）: 例) w_max_mid/large
器の量（CAP/在庫量）: クラス別CAP（ページ/バンドル）→ Soft CAPで補充強度を制御（実装済）

ランタイム制御（環境変数）

学習器: HAKMEM_LEARN=1
- 窓長: HAKMEM_LEARN_WINDOW_MS（既定1000）
- 目標ヒット率: HAKMEM_TARGET_HIT_MID（0.65）, HAKMEM_TARGET_HIT_LARGE（0.55）
- ステップ: HAKMEM_CAP_STEP_MID（4）, HAKMEM_CAP_STEP_LARGE（1）
- 予算制約: HAKMEM_BUDGET_MID, HAKMEM_BUDGET_LARGE（0=無効）
- 最小サンプル/窓: HAKMEM_LEARN_MIN_SAMPLES（256）
手動CAP上書き: HAKMEM_CAP_MID=a,b,c,d,e, HAKMEM_CAP_LARGE=a,b,c,d,e
切上げ許容: HAKMEM_WMAX_MID, HAKMEM_WMAX_LARGE
Mid free A/B: HAKMEM_POOL_TLS_FREE=0/1（既定1）

将来追加（実験用）:

ラッパー内L1許可: HAKMEM_WRAP_L2=1, HAKMEM_WRAP_L25=1
可変Midクラス枠（手動）: HAKMEM_MID_DYN1=<bytes>

Inline/Hot Path 方針

ホットパスは「サイズ即決 + O(1)テーブル参照 + 最小分岐」。
clock_gettime() 等のシステムコールはホットパス禁止（サンプリング/学習スレ側で実行）。
static inline + LUT でクラス決定を O(1) に（hakmem_pool.c/hakmem_l25_pool.c 参照）。
FrozenPolicy は RCUスナップショットを関数冒頭で1回loadし、以後は読み取りのみ。

Soft CAP（実装済）と学習器（実装済）

Mid/L2.5 の refill で FrozenPolicy CAP を参照し、補充バンドル数を調整。
- CAP超過: バンドル=1
- CAP不足: 赤字に応じて 1〜4（不足大なら下限2）
shard空 & CAP過多: 近傍shardから1–2probe steal（Mid/L2.5）。
学習器は別スレッドで窓ごとにヒット率を評価し、CAPを±Δ（ヒステリシス/予算制約付き）→ hkm_policy_publish() で公開。

段階導入（提案）

可変Midクラス枠×1（例: 14KB）を導入し、分布ピークに合わせて境界を最適化。
W_MAX を離散候補でバンディット+CANARY 最適化。
mmapしきい値（L1↔L2）をバンディット/ELOで学習し thp_threshold に反映。
可変枠×2 → クラス数/境界の自動最適化（バックグラウンド重計算）。

Total: ~3745 lines for complete production-ready allocator!

What's Implemented

Phase 1-5 (Foundation):

✅ Call-site capture (HAK_CALLSITE() macro)
✅ Zero-friction API (hak_alloc_cs() / hak_free_cs())
✅ Simple hash table (256 slots, linear probing)
✅ Basic profiling (count, size, avg, max)
✅ Policy-based optimization (malloc vs mmap)
✅ UCB1 bandit evolution
✅ KPI measurement (P50/P95/P99, page faults, RSS)
✅ A/B testing (baseline vs evolving)
✅ Benchmark framework (jemalloc/mimalloc comparison)

Phase 6.1-6.4 (ELO System):

✅ ELO rating system (6 strategies with win/loss/draw)
✅ Softmax selection (temperature-based exploration)
✅ BigCache tier-2 (size-class caching for large allocations)
✅ Batch madvise (MADV_DONTNEED syscall optimization)

Phase 6.5 (Learning Lifecycle):

✅ 3-state machine (LEARN → FROZEN → CANARY)
✅ P² algorithm (O(1) p99 estimation)
✅ Size-class distribution signature (L1 distance)
✅ Environment variable configuration
✅ Zero-overhead FROZEN mode (confirmed best policy)
✅ CANARY mode (5% trial sampling)
✅ Convergence detection & workload shift detection

What's NOT Implemented (Future)

❌ Multi-threaded support (single-threaded PoC)
❌ Advanced mmap strategies (MADV_HUGEPAGE, etc.)
❌ Redis/Nginx real-world benchmarks
❌ Confusion Matrix for auto-inference accuracy

📈 Implementation Progress

Phase	Feature	Status	Date
Phase 1	Call-site profiling	✅ Complete	2025-10-21 AM
Phase 2	Policy optimization (malloc/mmap)	✅ Complete	2025-10-21 PM
Phase 3	UCB1 bandit evolution	✅ Complete	2025-10-21 Eve
Phase 4	A/B testing	✅ Complete	2025-10-21 Eve
Phase 5	jemalloc/mimalloc comparison	✅ Complete	2025-10-21 Night
Phase 6.1-6.4	ELO rating system integration	✅ Complete	2025-10-21
Phase 6.5	Learning lifecycle (LEARN→FROZEN→CANARY)	✅ Complete	2025-10-21
Phase 7	Redis/Nginx real-world benchmarks	📋 Next	TBD

💡 Key Insights from PoC

Call-site works as identity: Different hak_alloc_cs() calls → different addresses
Zero overhead abstraction: Macro expands to __builtin_return_address(0)
Profiling overhead is acceptable: +7.8% on JSON (64KB), competitive on MIR (+29.6%)
Hash table is fast: Simple power-of-2 hash, <8 probes
Learning phase works: First 9 allocations gather data, 10th triggers optimization
UCB1 evolution improves performance: hakmem-evolving +71% vs hakmem-baseline (12 vs 7 points)
Page faults matter critically: 769× difference (1,538 vs 2) on direct mmap without caching
Memory reuse is essential: System malloc's free-list enables 3.1× speedup on large allocations
Per-site caching is the missing piece: Clear path to competitive performance (1st place)

📝 Connection to Paper

This PoC implements:

Section 3.6.2: Call-site Profiling API
Section 3.7: Learning ≠ LLM (UCB1 = lightweight online optimization)
Section 4.3: Hot-Path Performance (O(1) lookup, <300ns overhead)
Section 5: Evaluation Framework (A/B test + benchmarking)

Paper Sections Proven:

Section 3.6.2: Call-site Profiling ✅
Section 3.7: Learning ≠ LLM (UCB1 = lightweight online optimization) ✅
Section 4.3: Hot-Path Performance (<50ns overhead) ✅
Section 5: Evaluation Framework (A/B test + jemalloc/mimalloc comparison) 🔄

🧪 Verification Checklist

Run the test and check:

3 distinct call-sites detected ✅
Allocation counts match (1000/100/10) ✅
Average sizes are correct (64KB/256KB/2MB) ✅
No crashes or memory leaks ✅
Policy inference works (SMALL_FREQUENT/MEDIUM/LARGE_INFREQUENT) ✅
Optimization strategies applied (malloc vs mmap) ✅
Learning phase demonstrated (9 malloc + 1 mmap for large allocs) ✅
A/B testing works (baseline vs evolving modes) ✅
Benchmark framework functional ✅
Full benchmark results collected (1000 runs, 5 allocators) ✅

If all checks pass → Core concept AND optimization proven! ✅🎉

🎊 Summary

What We've Proven:

✅ Call-site = implicit purpose label
✅ Automatic policy inference (rule-based → UCB1 → ELO)
✅ ELO evolution with adaptive learning
✅ Call-site profiling overhead is acceptable (+7.8% on JSON)
✅ Competitive 3rd place ranking among 5 allocators
✅ KPI measurement (P50/P95/P99, page faults, RSS)
✅ A/B testing (baseline vs evolving)
✅ Honest comparison vs jemalloc/mimalloc (1000 benchmark runs)
✅ Production-ready lifecycle: LEARN → FROZEN → CANARY
✅ Zero-overhead frozen mode: Confirmed best policy after convergence
✅ P² percentile estimation: O(1) memory p99 tracking
✅ Workload shift detection: L1 distribution distance
🔍 Critical discovery: Page faults issue (769× difference) → malloc-based approach
📋 Clear path forward: Redis/Nginx real-world benchmarks

Code Size:

Phase 1-5 (UCB1 + Benchmarking): ~1625 lines
Phase 6.1-6.4 (ELO System): ~780 lines
Phase 6.5 (Learning Lifecycle): ~1340 lines
Total: ~3745 lines for complete production-ready allocator!

Paper Sections Proven:

Section 3.6.2: Call-site Profiling ✅
Section 3.7: Learning ≠ LLM (UCB1 = lightweight online optimization) ✅
Section 4.3: Hot-Path Performance (+7.8% overhead on JSON) ✅
Section 5: Evaluation Framework (5 allocators, 1000 runs, honest comparison) ✅
Gemini S+ requirement met: jemalloc/mimalloc comparison ✅

Status: ACE Learning Layer Planning + Mid MT Complete 🎯 Date: 2025-11-01

Latest Updates (2025-11-01)

✅ Mid MT Complete: 110M ops/sec achieved (100-101% of mimalloc)
✅ Repository Reorganized: Benchmarks/tests consolidated, root cleaned (72% reduction)
🎯 ACE Learning Layer: Documentation complete, ready for Phase 1 implementation
- Target: Fix fragmentation (2.6-5.2x), large WS (1.4-2.0x), realloc (1.3-2.0x)
- Approach: Dual-loop adaptive control + UCB1 learning
- See docs/ACE_LEARNING_LAYER.md for details

⚠️ Critical Update (2025-10-22): Thread Safety Issue Discovered

Problem: hakmem is completely thread-unsafe (no pthread_mutex anywhere)

1-thread: 15.1M ops/sec ✅ Normal
4-thread: 3.3M ops/sec ❌ -78% collapse (Race Condition)

Phase 6.14 Clarification:

✅ Registry ON/OFF toggle implementation (Pattern 2)
✅ O(N) Sequential proven 2.9-13.7x faster than O(1) Hash for Small-N
✅ Default: g_use_registry = 0 (O(N), L1 cache hit 95%+)
❌ Reported 67.9M ops/sec at 4-thread: NOT REPRODUCIBLE (measurement error)

Phase 6.15 Plan (12-13 hours, 6 days):

Step 1 (1h): Documentation updates ✅
Step 2 (2-3h): P0 Safety Lock (pthread_mutex global lock) → 4T = 13-15M ops/sec
Step 3 (8-10h): TLS implementation (Tiny/L2/L2.5 Pool TLS) → 4T = 15-22M ops/sec

Validation: Phase 6.13 already proved TLS works (15.9M ops/sec at 4T, +381%)

Details: See PHASE_6.15_PLAN.md, PHASE_6.15_SUMMARY.md, THREAD_SAFETY_SOLUTION.md

Previous Status: Phase 6.5 Complete - Production-Ready Learning Lifecycle! 🎉✨ Previous Date: 2025-10-21

Timeline:

2025-10-21 AM: Phase 1 - Call-site profiling PoC
2025-10-21 PM: Phase 2 - Policy-based optimization (malloc/mmap)
2025-10-21 Evening: Phase 3-4 - UCB1 bandit + A/B testing
2025-10-21 Night: Phase 5 - Benchmark infrastructure (1000 runs, 🥉 3rd place!)
2025-10-21 Late Night: Phase 6.1-6.4 - ELO rating system integration
2025-10-21 Night: Phase 6.5 - Learning lifecycle complete (6/6 tests passing) ✨

Phase 6.5 Achievement:

✅ 3-state machine: LEARN → FROZEN → CANARY
✅ Zero-overhead FROZEN mode: 10-20× faster than LEARN mode
✅ P² p99 estimation: O(1) memory percentile tracking
✅ Distribution shift detection: L1 distance for workload changes
✅ Environment variable config: Full control over freeze/convergence/canary settings
✅ Production ready: All lifecycle transitions verified

Key Results:

VM scenario ranking: 🥈 2nd place (+1.9% gap to 1st!)
Phase 5 (UCB1): 🥉 3rd place (12 points) among 5 allocators
Phase 6.4 (ELO+BigCache): 🥈 2nd place, nearly tied with mimalloc
Call-site profiling overhead: +7.8% (acceptable)
FROZEN mode overhead: Zero (confirmed best policy, no ELO updates)
Convergence time: ~180 seconds (configurable via HAKMEM_FREEZE_SEC)
CANARY sampling: 5% trial (configurable via HAKMEM_CANARY_FRAC)

Next Steps:

✅ Phase 1-5 complete (UCB1 + benchmarking)
✅ Phase 6.1-6.4 complete (ELO system)
✅ Phase 6.5 complete (learning lifecycle)
🔧 Phase 6.6: Fix Batch madvise (0 blocks batched) → 1st place target 🏆
📋 Phase 7: Redis/Nginx real-world benchmarks
📝 Paper writeup (see PAPER_SUMMARY.md)

Related Documentation:

Paper summary: PAPER_SUMMARY.md ⭐ Start here for paper writeup
Phase 6.2 (ELO): PHASE_6.2_ELO_IMPLEMENTATION.md
Phase 6.5 (Lifecycle): PHASE_6.5_LEARNING_LIFECYCLE.md ✨ New!
Paper materials: docs/private/papers-active/hakmem-c-abi-allocator/
Design doc: BENCHMARK_DESIGN.md
Raw results: competitors_results.csv (15,001 runs)
Analysis script: analyze_final.py

README.md Unescape Escape

hakmem PoC - Call-site Profiling + UCB1 Evolution

🎯 Current Status (2025-11-01)

✅ Mid-Range Multi-Threaded Complete (110M ops/sec)

✅ Repository Reorganization Complete

✅ ACE Learning Layer Phase 1 Complete (ACE = Agentic Context Engineering / Adaptive Control Engine)

📂 Quick Navigation

🧪 Larson Quick Run（Tiny + Superslab、本線）

SLL‑first Fast Path（Box 5）

Benchmark Matrix

Build Modes (Box Refactor)

🚨 Segfault‑free ポリシー（絶対条件）

新規A/Bノブ（観測と制御）

Mimalloc vs HAKMEM (Larson quick A/B)

🎯 What This Proves

✅ Phase 1: Call-site Profiling (DONE)

✅ Phase 2-4: UCB1 Evolution + A/B Testing (DONE)

✅ Phase 5: Benchmarking Infrastructure (COMPLETE)

✅ Phase 6.1-6.4: ELO Rating System (COMPLETE)

✅ Phase 6.5: Learning Lifecycle (COMPLETE)

✅ Phase 6.6: ELO Control Flow Fix (COMPLETE)

✅ Phase 6.7: Overhead Analysis (COMPLETE)

✅ Phase 6.8: Configuration Cleanup (COMPLETE)

🚀 Quick Start

🎯 Choose Your Mode (Phase 6.8+)

📖 Legacy Usage (Phase 1-6.7)

⚙️ Useful Environment Variables

🧪 Larson Runner (Reproducible)

レガシー互換（個別ENV）

マスタ箱経由（Phase 4d）

🧱 TLS Active Slab (Arena-lite)

🧊 EVO/Gating（デフォルト低オーバーヘッド）

🏆 Benchmark Comparison (Phase 5)

📊 Expected Results

Basic Test (test_hakmem)

Benchmark Results (Phase 5) - FINAL

🔧 Implementation Details

Files

Phase 6.16 (SACS‑3)

学習ターゲット（4軸）

ランタイム制御（環境変数）

Inline/Hot Path 方針

Soft CAP（実装済）と 学習器（実装済）

段階導入（提案）

What's Implemented

What's NOT Implemented (Future)

📈 Implementation Progress

💡 Key Insights from PoC

📝 Connection to Paper

🧪 Verification Checklist

🎊 Summary

Latest Updates (2025-11-01)

⚠️ Critical Update (2025-10-22): Thread Safety Issue Discovered

README.md

Soft CAP（実装済）と学習器（実装済）