Go to file

Moe Charm (CI) ad346f7885 Phase 18 v2: BENCH_MINIMAL design + instructions (instruction removal strategy)

## Phase 18 v2: Next Phase Direction

After Phase 18 v1 failure (layout optimization caused I-cache regression),
shift to instruction count reduction via compile-time removal:

- Stats collection (FRONT_FASTLANE_STAT_INC → no-op)
- Environment checks (runtime lookup → constant)
- Debug logging (conditional compilation)

Expected impact: Instructions -30-40%, Throughput +10-20%

## Success Criteria (STRICT)

GO (must have ALL):
- Throughput: +5% minimum (+8% preferred)
- Instructions: -15% minimum (smoking gun)
- I-cache: automatic improvement from smaller footprint

NEUTRAL: throughput ±3%, instructions -5% to -15%
NO-GO: throughput < -2%, instructions < -5%

Key: If instructions do not drop -15%+, allocator is not the bottleneck
and this phase should be abandoned.

## Implementation Strategy

1. Makefile knob: BENCH_MINIMAL=0/1 (default OFF, production-safe)
2. Conditional removal:
   - Stats: #if !HAKMEM_BENCH_MINIMAL
   - ENV checks: constant propagation
   - Debug: conditional includes

3. A/B test with perf stat (must measure instruction reduction)

## Files

New:
- docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_DESIGN.md (detailed design)
- docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_NEXT_INSTRUCTIONS.md (step-by-step)

Modified:
- CURRENT_TASK.md (Phase 18 v1/v2 status)

## Key Learning from Phase 18 v1 Failure

Layout optimization is extremely fragile without strong ordering guarantees.
Section splitting alone (without symbol ordering, PGO, or linker script)
destroyed code locality and increased I-cache misses 91%.

Switching to direct instruction removal is safer and more predictable.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

2025-12-15 05:55:22 +09:00

.claude

docs: Update CURRENT_TASK.md and claude.md for Phase 8 completion

2025-11-30 05:50:43 +09:00

README.md

hakmem PoC - Call-site Profiling + UCB1 Evolution

詳細ドキュメントの入口: docs/INDEX.md（カテゴリ別リンク） / 再整理方針: docs/DOCS_REORG_PLAN.md

Purpose: Proof-of-Concept for the core ideas from the paper:

"Call-site address is an implicit purpose label - same location → same pattern"

"UCB1 bandit learns optimal allocation policies automatically"

🎯 Current Status (2025-11-01)

✅ Mid-Range Multi-Threaded Complete (110M ops/sec)

Achievement: 110M ops/sec on mid-range MT workload (8-32KB)
Comparison: 100-101% of mimalloc, 2.12x faster than glibc
Implementation: core/hakmem_mid_mt.{c,h}
Benchmarks: benchmarks/scripts/mid/ (run_mid_mt_bench.sh, compare_mid_mt_allocators.sh)
Report: MID_MT_COMPLETION_REPORT.md

✅ Repository Reorganization Complete

New Structure: All benchmarks under benchmarks/, tests under tests/
Root Directory: 252 → 70 items (72% reduction)
Organization:
- benchmarks/src/{tiny,mid,comprehensive,stress}/ - Benchmark sources
- benchmarks/scripts/{tiny,mid,comprehensive,utils}/ - Scripts organized by category
- benchmarks/results/ - All benchmark results (871+ files)
- tests/{unit,integration,stress}/ - Tests by type
Details: FOLDER_REORGANIZATION_2025_11_01.md

✅ ACE Learning Layer Phase 1 Complete (ACE = Agentic Context Engineering / Adaptive Control Engine)

Status: Phase 1 Infrastructure COMPLETE ✅ (2025-11-01)
Goal: Fix weak workloads with adaptive learning
- Fragmentation stress: 3.87 → 10-20 M ops/s (2.6-5.2x target)
- Large working set: 22.15 → 30-45 M ops/s (1.4-2.0x target)
- realloc: 277ns → 140-210ns (1.3-2.0x target)
Phase 1 Deliverables (100% complete):
- ✅ Metrics collection infrastructure (hakmem_ace_metrics.{c,h})
- ✅ UCB1 learning algorithm (hakmem_ace_ucb1.{c,h})
- ✅ Dual-loop controller (hakmem_ace_controller.{c,h})
- ✅ Dynamic TLS capacity adjustment
- ✅ Hot-path metrics integration (alloc/free tracking)
- ✅ A/B benchmark script (scripts/bench_ace_ab.sh)
Documentation:
- User guide: docs/ACE_LEARNING_LAYER.md
- Implementation plan: docs/ACE_LEARNING_LAYER_PLAN.md
- Progress report: ACE_PHASE1_PROGRESS.md
Usage: HAKMEM_ACE_ENABLED=1 ./your_benchmark
Next: Phase 2 - Extended benchmarking + learning convergence validation

Build & Run: See "Quick Start" section below
Benchmarks: benchmarks/scripts/ organized by category
Documentation: DOCS_INDEX.md - Central documentation hub
Current Work: CURRENT_TASK.md

🧪 Larson Quick Run（Tiny + Superslab、本線）

Use the defaults wrapper so critical env vars are always set:

Throughput-oriented (2s, threads=1,4): scripts/run_larson_defaults.sh
Lower page-fault/sys (10s, threads=4): scripts/run_larson_defaults.sh pf 10 4
Claude-friendly presets (envs pre-wired for reproducible debug): scripts/run_larson_claude.sh [tput|pf|repro|fast0|guard|debug] 2 4
- For Claude Code runs with log capture, use scripts/claude_code_debug.sh.

本線（セグフォしない）を既定にしました。publish→mail→adopt が動く前提の既定環境です:

Tiny/Superslab gates: HAKMEM_TINY_USE_SUPERSLAB=1（既定ON）, HAKMEM_TINY_MUST_ADOPT=1, HAKMEM_TINY_SS_ADOPT=1
Fast-tier spill to create publish: HAKMEM_TINY_FAST_CAP=64, HAKMEM_TINY_FAST_SPARE_PERIOD=8
TLS list: HAKMEM_TINY_TLS_LIST=1
Mailbox discovery: HAKMEM_TINY_MAILBOX_SLOWDISC=1, HAKMEM_TINY_MAILBOX_SLOWDISC_PERIOD=256
Superslab sizing/cache/precharge: per mode (tput vs pf)

Debugging tips:

Add HAKMEM_TINY_RF_TRACE=1 for one-shot publish/mail traces.
Use scripts/run_larson_claude.sh debug 2 4 to enable TRACE_RING and emit early SIGUSR2 so the Tiny ring is dumped before crashes.

SLL‑first Fast Path（Box 5）

Hot path favors TLS SLL (per‑thread freelist) first; on miss, falls back to HotMag/TLS list, then SuperSlab.
Learning shifts to SLL via sll_cap_for_class() with per‑class override/multiplier (small classes 0..3).
Ownership → remote drain → bind is centralized via SlabHandle (Box 3→2) for safety and determinism.
A/B knobs:
- HAKMEM_TINY_TLS_SLL=0/1 (default 1)
- HAKMEM_SLL_MULTIPLIER=N and HAKMEM_TINY_SLL_CAP_C{0..7}
- HAKMEM_TINY_TLS_LIST=0/1

P0 batch refill is now compile-time only; runtime P0 env toggles were removed.

Benchmark Matrix

Quick matrix to compare mid‑layers vs SLL‑first:
- scripts/bench_matrix.sh 30 8 (duration=30s, threads=8)
Single run (throughput):
- HAKMEM_TINY_TRACE_RING=0 HAKMEM_SAFE_FREE=0 scripts/run_larson_claude.sh tput 30 8
Force-notify path (A/B) with HAKMEM_TINY_RF_FORCE_NOTIFY=1 to surface missing first-notify cases.

Build Modes (Box Refactor)

既定（本線）: Box Theory refactor (Phase 6‑1.7) と Superslab 経路は常時ON
- コンパイルフラグ: -DHAKMEM_TINY_PHASE6_BOX_REFACTOR=1（Makefile既定）
- 実行時既定: g_use_superslab=1（環境変数で明示的に0にしない限りON）
- 旧経路でのA/B: make BOX_REFACTOR_DEFAULT=0 larson_hakmem

🚨 Segfault‑free ポリシー（絶対条件）

本線は「セグフォしない」ことを最優先に設計/実装されています。
変更時は以下のガードを通してから採用してください。
- Guard ラン: ./scripts/larson.sh guard 2 4（Trace Ring + Safe Free）
- ASan/UBSan/TSan: ./scripts/larson.sh asan 2 4 / ubsan / tsan
- Fail‑Fast（環境）: HAKMEM_TINY_RF_TRACE=0 他、LARSON_GUIDE.md の安全手順に従う
- リング末尾の remote_invalid / SENTINEL_TRAP が出ないことを確認

新規A/Bノブ（観測と制御）

Registry 窓: HAKMEM_TINY_REG_SCAN_MAX（既定256）
- レジストリ小窓の走査上限を制御（探索コスト vs adopt 命中率のA/B用）
Mid簡素化refill: HAKMEM_TINY_MID_REFILL_SIMPLE=1（class>=4で多段探索をスキップ）
- tput重視A/B用（adopt/探索を減らす）。常用前にPF/RSSを確認。

Mimalloc vs HAKMEM (Larson quick A/B)

Recommended HAKMEM env (Tiny Hot, SLL‑only, fast tier on):

HAKMEM_TINY_REFILL_COUNT_HOT=64 \
HAKMEM_TINY_FAST_CAP=16 \
HAKMEM_TINY_TRACE_RING=0 HAKMEM_SAFE_FREE=0 \
HAKMEM_TINY_TLS_SLL=1 HAKMEM_TINY_TLS_LIST=0 \
HAKMEM_WRAP_TINY=1 HAKMEM_TINY_SS_ADOPT=1 \
./larson_hakmem 2 8 128 1024 1 12345 4

One‑shot refill path confirmation (noisy print just once):

HAKMEM_TINY_REFILL_OPT_DEBUG=1 <above_env> ./larson_hakmem 2 8 128 1024 1 12345 4

Mimalloc (direct link binary):

LD_LIBRARY_PATH=$PWD/mimalloc-bench/extern/mi/out/release ./larson_mi 2 8 128 1024 1 12345 4

Perf (selected counters):

perf stat -e cycles,instructions,branches,branch-misses,cache-references,cache-misses,\
  L1-dcache-loads,L1-dcache-load-misses -- \
  env <above_env> ./larson_hakmem 5 8 128 1024 1 12345 4

🎯 What This Proves

✅ Phase 1: Call-site Profiling (DONE)

Call-site capture works: __builtin_return_address(0) uniquely identifies allocation sites
Different sites have different patterns: JSON (small, frequent) vs MIR (medium) vs VM (large)
Profiling is lightweight: Simple hash table + sampling
Zero user burden: Just replace malloc → hak_alloc_cs

✅ Phase 2-4: UCB1 Evolution + A/B Testing (DONE)

KPI measurement: P50/P95/P99 latency, Page Faults, RSS delta
Discrete policy steps: 6 levels (64KB → 2MB)
UCB1 bandit: Exploration + Exploitation balance
Safety mechanisms:
- ±1 step exploration (safe)
- Hysteresis (8% improvement × 3 consecutive)
- Cooldown (180 seconds)
A/B testing: baseline vs evolving modes

✅ Phase 5: Benchmarking Infrastructure (COMPLETE)

Allocator comparison framework: hakmem vs jemalloc/mimalloc/system malloc
Fair benchmarking: Same workload, 50 runs per config, 1000 total runs
KPI measurement: Latency (P50/P95/P99), page faults, RSS, throughput
Paper-ready output: CSV format for graphs/tables
Initial ranking (UCB1): 🥉 3rd place among 5 allocators

This proves Sections 3.6-3.7 of the paper. See PAPER_SUMMARY.md for detailed results.

✅ Phase 6.1-6.4: ELO Rating System (COMPLETE)

Strategy diversity: 6 threshold levels (64KB, 128KB, 256KB, 512KB, 1MB, 2MB)
ELO rating: Each strategy has rating, learns from win/loss/draw
Softmax selection: Probability ∝ exp(rating/temperature)
BigCache optimization: Tier-2 size-class caching for large allocations
Batch madvise: MADV_DONTNEED batching for reduced syscall overhead

🏆 VM Scenario Benchmark Results (iterations=100):

🥇 mimalloc         15,822 ns  (baseline)
🥈 hakmem-evolving  16,125 ns  (+1.9%)  ← BigCache効果！
🥉 system           16,814 ns  (+6.3%)
4th jemalloc        17,575 ns  (+11.1%)

Key achievement: 1.9% gap to 1st place (down from -50% in Phase 5!)

See PHASE_6.2_ELO_IMPLEMENTATION.md for details.

✅ Phase 6.5: Learning Lifecycle (COMPLETE)

3-state machine: LEARN → FROZEN → CANARY
- LEARN: Active learning with ELO updates
- FROZEN: Zero-overhead production mode (confirmed best policy)
- CANARY: Safe 5% trial sampling to detect workload changes
Convergence detection: P² algorithm for O(1) p99 estimation
Distribution signature: L1 distance for workload shift detection
Environment variables: Fully configurable (freeze time, window size, etc.)
Production ready: 6/6 tests passing, LEARN→FROZEN transition verified

Key feature: Learning converges in ~180 seconds, then runs at zero overhead in FROZEN mode!

See PHASE_6.5_LEARNING_LIFECYCLE.md for complete documentation.

✅ Phase 6.6: ELO Control Flow Fix (COMPLETE)

Problem: After Phase 6.5 integration, batch madvise stopped activating Root Cause: ELO strategy selection happened AFTER allocation, results ignored Fix: Reordered hak_alloc_at() to use ELO threshold BEFORE allocation

Diagnosis by: Gemini Pro (2025-10-21) Fixed by: Claude (2025-10-21)

Key insight:

OLD: allocate_with_policy(POLICY_DEFAULT) → malloc → ELO selection (too late!)
NEW: ELO selection → size >= threshold ? mmap : malloc ✅

Result: 2MB allocations now correctly use mmap, enabling batch madvise optimization.

See PHASE_6.6_ELO_CONTROL_FLOW_FIX.md for detailed analysis.

✅ Phase 6.7: Overhead Analysis (COMPLETE)

Goal: Identify why hakmem is 2× slower than mimalloc despite identical syscall counts

Key Findings:

Syscall overhead is NOT the bottleneck
- hakmem: 292 mmap, 206 madvise (same as mimalloc)
- Batch madvise working correctly
The gap is structural, not algorithmic
- mimalloc: Pool-based allocation (9ns fast path)
- hakmem: Hash-based caching (31ns fast path)
- 3.4× fast path difference explains 2× total gap
hakmem's "smart features" have < 1% overhead
- ELO: ~100-200ns (0.5%)
- BigCache: ~50-100ns (0.3%)
- Total: ~350ns out of 17,638ns gap (2%)

Recommendation: Accept the gap for research prototype OR implement hybrid pool fast-path (ChatGPT Pro proposal)

Deliverables:

PHASE_6.7_OVERHEAD_ANALYSIS.md (27KB, comprehensive)
PHASE_6.7_SUMMARY.md (11KB, TL;DR)
PROFILING_GUIDE.md (validation tools)
ALLOCATION_MODEL_COMPARISON.md (visual diagrams)

✅ Phase 6.8: Configuration Cleanup (COMPLETE)

Goal: Simplify complex environment variables into 5 preset modes + implement feature flags

Critical Bug Fixed: Task Agent investigation revealed complete design vs implementation gap:

Design: "Check g_hakem_config flags before enabling features"
Implementation: Features ran unconditionally (never checked!)
Impact: "MINIMAL mode" measured 14,959 ns but was actually BALANCED (all features ON)

Solution Implemented: Mode-based configuration + Feature-gated initialization

# Simple preset modes
export HAKMEM_MODE=minimal    # Baseline (all features OFF)
export HAKMEM_MODE=fast       # Production (pool fast-path + FROZEN)
export HAKMEM_MODE=balanced   # Default (BigCache + ELO FROZEN + Batch)
export HAKMEM_MODE=learning   # Development (ELO LEARN + adaptive)
export HAKMEM_MODE=research   # Debug (all features + verbose logging)

🎯 Benchmark Results - PROOF OF SUCCESS!

Test: VM scenario (2MB allocations, 100 iterations)

MINIMAL mode:  216,173 ns  (all features OFF - true baseline)
BALANCED mode:  15,487 ns  (BigCache + ELO ON)
→ 13.95x speedup from optimizations! 🚀

Feature Matrix (Now Actually Enforced!):

Feature	MINIMAL	FAST	BALANCED	LEARNING	RESEARCH
ELO learning	❌	❌ FROZEN	✅ FROZEN	✅ LEARN	✅ LEARN
BigCache	❌	✅	✅	✅	✅
Batch madvise	❌	✅	✅	✅	✅
TinyPool (future)	❌	✅	✅	❌	❌
Debug logging	❌	❌	❌	⚠️	✅

Code Quality Improvements:

✅ hakmem.c: 899 → 600 lines (-33% reduction)
✅ New infrastructure: hakmem_features.h, hakmem_config.c/h, hakmem_internal.h (692 lines)
✅ Static inline helpers: Zero-cost abstraction (100% inlined with -O2)
✅ Feature flags: Runtime checks with < 0.1% overhead

Benefits Delivered:

✅ Easy to use (HAKMEM_MODE=balanced)
✅ Clear benchmarking (14x performance difference proven!)
✅ Backward compatible (individual env vars still work)
✅ Paper-friendly (quantified feature impact)

See PHASE_6.8_PROGRESS.md for complete implementation details.

🚀 Quick Start

🎯 Choose Your Mode (Phase 6.8+)

New: hakmem now supports 5 simple preset modes!

# 1. MINIMAL - Baseline (all optimizations OFF)
export HAKMEM_MODE=minimal
./bench_allocators --allocator hakmem-evolving --scenario vm

# 2. BALANCED - Default recommended (BigCache + ELO FROZEN + Batch)
export HAKMEM_MODE=balanced  # or omit (default)
./bench_allocators --allocator hakmem-evolving --scenario vm

# 3. LEARNING - Development (ELO learns, adapts to workload)
export HAKMEM_MODE=learning
./test_hakmem

# 4. FAST - Production (future: pool fast-path + FROZEN)
export HAKMEM_MODE=fast
./bench_allocators --allocator hakmem-evolving --scenario vm

# 5. RESEARCH - Debug (all features + verbose logging)
export HAKMEM_MODE=research
./test_hakmem

Quick reference:

Just want it to work? → Use balanced (default)
Benchmarking baseline? → Use minimal
Development/testing? → Use learning
Production deployment? → Use fast (after Phase 7)
Debugging issues? → Use research

📖 Legacy Usage (Phase 1-6.7)

# Build
make

# Run basic test
make run

# Run A/B test (baseline mode)
./test_hakmem

# Run A/B test (evolving mode - UCB1 enabled)
env HAKMEM_MODE=evolving ./test_hakmem

# Override individual settings (backward compatible)
export HAKMEM_MODE=balanced
export HAKMEM_THP=off  # Override THP policy
./bench_allocators --allocator hakmem-evolving --scenario vm

⚙️ Useful Environment Variables

Tiny publish/adopt pipeline

# Enable SuperSlab (required for publish/adopt)
export HAKMEM_TINY_USE_SUPERSLAB=1
# Optional: must-adopt-before-mmap (one-pass adopt before mmap)
export HAKMEM_TINY_MUST_ADOPT=1

HAKMEM_TINY_USE_SUPERSLAB=1
- publish→mailbox→adopt は SuperSlab 経路が ON のときのみ動作します（OFFでは pipeline はゼロ）。
- ベンチ時の既定ONを推奨（A/Bで OFFにしてメモリ効率優先との比較も可）。
HAKMEM_SAFE_FREE=1
- Adds a best-effort mincore() guard before reading headers on free().
- Safer with LD_PRELOAD at the cost of extra overhead. Default: off.
HAKMEM_WRAP_TINY=1
- Allows Tiny Pool allocations during malloc/free wrappers (LD_PRELOAD).
- Wrapper-context uses a magazine-only fast path (no locks/refill) for safety.
- Default: off for stability. Enable to test Tiny impact on small-object workloads.
HAKMEM_TINY_MAG_CAP=INT
- Upper bound for Tiny TLS magazine per class (soft). Default: build limit (2048); recommended 1024 for BURST.
HAKMEM_SITE_RULES=1
- Enables Site Rules. Note: tier selection no longer uses Site Rules (SACS‑3); only layer‑internal future hints.
HAKMEM_PROF=1, HAKMEM_PROF_SAMPLE=N
- Enables lightweight sampling profiler. N is exponent, sample every 2^N calls (default 12). Outputs per‑category avg ns.
HAKMEM_ACE_SAMPLE=N
- ACE layer (L1) stats sampling for mid/large hit/miss and L1 fallback. Default off.

🧪 Larson Runner (Reproducible)

Use the provided runner to compare system/mimalloc/hakmem under identical settings.

scripts/run_larson.sh [options] [runtime_sec] [threads_csv]

Options:
  -d SECONDS     Runtime seconds (default: 10)
  -t CSV         Threads CSV, e.g. 1,4 (default: 1,4)
  -c NUM         Chunks per thread (default: 10000)
  -r NUM         Rounds (default: 1)
  -m BYTES       Min size (default: 8)
  -M BYTES       Max size (default: 1024)
  -s SEED        Random seed (default: 12345)
  -p PRESET      Preset: burst|loop (sets -c/-r)

Presets:
  burst → chunks/thread=10000, rounds=1   # 厳しめ（同時保持が多い）
  loop  → chunks/thread=100,   rounds=100 # 甘め（局所性が高い）

Examples:
  scripts/run_larson.sh -d 10 -t 1,4            # burst既定
  scripts/run_larson.sh -d 10 -t 1,4 -p loop    # 100×100 ループ

Performance‑oriented env (recommended when comparing hakmem):

HAKMEM_DISABLE_BATCH=0
HAKMEM_TINY_META_ALLOC=0
HAKMEM_TINY_META_FREE=0
HAKMEM_TINY_SS_ADOPT=1
bash scripts/run_larson.sh -d 10 -t 1,4


Counters dump (refill/publish 可視化):

レガシー互換（個別ENV）

HAKMEM_TINY_COUNTERS_DUMP=1 ./test_hakmem # 終了時に [Refill Stage Counters]/[Publish Hits]

マスタ箱経由（Phase 4d）

HAKMEM_STATS=counters ./test_hakmem # 同様のカウンタを HAKMEM_STATS で一括ON HAKMEM_STATS_DUMP=1 ./test_hakmem # atexit で Tiny 全カウンタをダンプ


LD_PRELOAD notes:

- 本リポジトリには `libhakmem.so` を用意（`make shared`）。
- mimalloc‑bench 同梱の `bench/larson/larson` は配布バイナリのため、この環境では GLIBC バージョン不一致で実行できない場合があります。
- LD_PRELOAD 経路の再現が必要な場合は、GLIBC 互換のバイナリを別途用意するか、system 版ベンチ（例: comprehensive_system 等）に対して `LD_PRELOAD=$(pwd)/libhakmem.so` を適用してください。

Current status (quick snapshot, burst: `-d 2 -t 1,4 -m 8 -M 128 -c 1024 -r 1`):

- system (1T): ~14.6 M ops/s
- mimalloc (1T): ~16.8 M ops/s
- hakmem (1T): ~1.1–1.3 M ops/s
- system (4T): ~16.8 M ops/s
- mimalloc (4T): ~16.8 M ops/s
- hakmem (4T): ~4.2 M ops/s

備考: Larson は現状まだ差が大きいですが、他の内蔵ベンチ（Tiny Hot/Random Mixed 等）では良い勝負（Tiny Hot: mimalloc 比 ~98%）を確認済み。Larson 改善の主眼は free→alloc の publish/pop 接続最適化と MT 配線の整備です（Adopt Gate 導入済み）。

### 🔬 Profiler Sweep (Overhead Tracking)

Use the sweep helper to probe size ranges and gather sampling profiler output quickly (2s per run by default):

scripts/prof_sweep.sh -d 2 -t 1,4 -s 8 # sample=1/256, 1T/4T, multiple ranges scripts/prof_sweep.sh -d 2 -t 4 -s 10 -m 2048 -M 32768 # focus (2–32KiB)


Env tips:
- `HAKMEM_TINY_MAG_CAP=1024` recommended for BURST style runs.
- Profiling ON adds minimal overhead due to sampling; keep N high (8–12) for realistic loads.

Profiler categories (subset):
- `tiny_alloc`, `ace_alloc`, `malloc_alloc`, `mmap_alloc`, `bigcache_try`
- Tiny internals: `tiny_bitmap`, `tiny_drain_locked/owner`, `tiny_spill`, `tiny_reg_lookup/register`
- Pool internals: `pool_lock/refill`, `l25_lock/refill`

Notes:

Runner uses absolute LD_PRELOAD paths for reliability.
Set MIMALLOC_SO=/path/to/libmimalloc.so.2 if auto-detection fails.

🧱 TLS Active Slab (Arena-lite)

Tiny Pool はスレッド毎・クラス毎に1枚の「TLS Active Slab」を持ちます。

magazine miss時は TLS Slab からロックレスで割当（所有スレッドのみがbitmap更新）。
remote-free は MPSC スタックへ。所有スレッドが tiny_remote_drain_owner() でロック無しドレイン。
adopt はクラスロック下で一度だけ実施（wrap中は trylock 限定）。

これにより、ロック競合と偽共有の影響を最小化し、1T/4T いずれでも安定して短縮します。

🧊 EVO/Gating（デフォルト低オーバーヘッド）

学習系（EVO）の計測はデフォルト無効化（HAKMEM_EVO_SAMPLE=0）。

free() の clock_gettime() や p² 更新はサンプリング有効時のみ実行。
計測を見たい場合のみ HAKMEM_EVO_SAMPLE=N を設定してください。

🏆 Benchmark Comparison (Phase 5)

# Build benchmark programs
make bench

# Run quick benchmark (3 warmup, 5 runs)
bash bench_runner.sh --warmup 3 --runs 5

# Run full benchmark (10 warmup, 50 runs)
bash bench_runner.sh --warmup 10 --runs 50 --output results.csv

# Manual single run
./bench_allocators_hakmem --allocator hakmem-baseline --scenario json
./bench_allocators_system --allocator system --scenario json
LD_PRELOAD=libjemalloc.so.2 ./bench_allocators_system --allocator jemalloc --scenario json

Benchmark scenarios:

json - Small (64KB), frequent (1000 iterations)
mir - Medium (256KB), moderate (100 iterations)
vm - Large (2MB), infrequent (10 iterations)
mixed - All patterns combined

Allocators tested:

hakmem-baseline - Fixed policy (256KB threshold)
hakmem-evolving - UCB1 adaptive learning
system - glibc malloc (baseline)
jemalloc - Industry standard (Firefox, Redis)
mimalloc - Microsoft allocator (state-of-the-art)

📊 Expected Results

Basic Test (test_hakmem)

You should see 3 different call-sites with distinct patterns:

Site #1:
  Address:    0x55d8a7b012ab
  Allocs:     1000
  Total:      64000000 bytes
  Avg size:   64000 bytes      # JSON parsing (64KB)
  Max size:   65536 bytes
  Policy:     SMALL_FREQUENT (malloc)

Site #2:
  Address:    0x55d8a7b012f3
  Allocs:     100
  Total:      25600000 bytes
  Avg size:   256000 bytes     # MIR build (256KB)
  Max size:   262144 bytes
  Policy:     MEDIUM (malloc)

Site #3:
  Address:    0x55d8a7b0133b
  Allocs:     10
  Total:      20971520 bytes
  Avg size:   2097152 bytes    # VM execution (2MB)
  Max size:   2097152 bytes
  Policy:     LARGE_INFREQUENT (mmap)

Key observation: Same code, different call-sites → automatically different profiles!

Benchmark Results (Phase 5) - FINAL

🏆 Overall Ranking (Points System: 5 allocators × 4 scenarios)

🥇 #1: mimalloc             18 points
🥈 #2: jemalloc             13 points
🥉 #3: hakmem-evolving      12 points ← Our contribution
   #4: system               10 points
   #5: hakmem-baseline      7 points

📊 Performance by Scenario (Median Latency, 50 runs each)

Scenario	hakmem-evolving	Best (Winner)	Gap	Status
JSON (64KB)	284.0 ns	263.5 ns (system)	+7.8%	✅ Acceptable overhead
MIR (512KB)	1,750.5 ns	1,350.5 ns (mimalloc)	+29.6%	⚠️ Competitive
VM (2MB)	58,600.0 ns	18,724.5 ns (mimalloc)	+213.0%	❌ Needs per-site caching
MIXED	969.5 ns	518.5 ns (mimalloc)	+87.0%	❌ Needs work

🔑 Key Findings:

✅ Call-site profiling overhead is acceptable (+7.8% on JSON)
✅ Competitive on medium allocations (+29.6% on MIR)
❌ Large allocation gap (3.1× slower than mimalloc on VM)
- Root cause: Lack of per-site free-list caching
- Future work: Implement Tier-2 MappedRegion hash map

🔥 Critical Discovery: Page Faults Issue

Initial direct mmap(): 1,538 page faults (769× more than system malloc!)
Fixed with malloc-based approach: 1,025 page faults (now equal to system)
Performance swing: VM scenario -54% → +14.4% (68.4 point improvement!)

See PAPER_SUMMARY.md for detailed analysis and paper narrative.

🔧 Implementation Details

Files

Phase 1-5 (UCB1 + Benchmarking):

hakmem.h - C API (call-site profiling + KPI measurement, ~110 lines)
hakmem.c - Core implementation (profiling + KPI + lifecycle, ~750 lines)
hakmem_ucb1.c - UCB1 bandit evolution (~330 lines)
test_hakmem.c - A/B test program (~135 lines)
bench_allocators.c - Benchmark framework (~360 lines)
bench_runner.sh - Automated benchmark runner (~200 lines)

Phase 6.1-6.4 (ELO System):

hakmem_elo.h/.c - ELO rating system (~450 lines)
hakmem_bigcache.h/.c - BigCache tier-2 optimization (~210 lines)
hakmem_batch.h/.c - Batch madvise optimization (~120 lines)

Phase 6.5 (Learning Lifecycle):

hakmem_p2.h/.c - P² percentile estimation (~130 lines)
hakmem_sizeclass_dist.h/.c - Distribution signature (~120 lines)
hakmem_evo.h/.c - State machine core (~610 lines)
test_evo.c - Lifecycle tests (~220 lines)

Documentation:

BENCHMARK_DESIGN.md, PAPER_SUMMARY.md, PHASE_6.2_ELO_IMPLEMENTATION.md, PHASE_6.5_LEARNING_LIFECYCLE.md

Phase 6.16 (SACS‑3)

SACS‑3: size‑only tier selection + ACE for L1.

L0 Tiny (≤1KiB): TinySlab with TLS magazine and TLS Active Slab.
L1 ACE (1KiB–2MiB): unified hkm_ace_alloc()
- MidPool (2/4/8/16/32 KiB), LargePool (64/128/256/512 KiB/1 MiB)
- W_MAX rounding: allow class cut‑up if class ≤ W_MAX×size (FrozenPolicy.w_max)
- 32–64KiB gap absorbed to 64KiB when allowed by W_MAX
L2 Big (≥2MiB): BigCache/mmap (THP gate)

Site Rules is OFF by default and no longer used for tier selection. Hot path has no clock_gettime except optional sampling.

New modules:

hakmem_policy.h/.c – FrozenPolicy (RCU snapshot). Hot path loads once per call; learning thread publishes a new snapshot.
hakmem_ace.h/.c – ACE layer alloc (L1 unified), W_MAX rounding.
hakmem_prof.h/.c – sampling profiler (categories, avg ns).
hakmem_ace_stats.h/.c – L1 mid/large hit/miss + L1 fallback counters (sampling).

学習ターゲット（4軸）

SACS‑3 の“賢いキャッシュ”は、次の4軸で最適化します。

しきい値（mmap/L1↔L2切替）: 将来 FrozenPolicy.thp_threshold へ反映
器の数（サイズクラス数）: Mid/Large のクラス本数（段階的に可変枠を導入）
器の形（サイズ境界・粒度・W_MAX）: 例) w_max_mid/large
器の量（CAP/在庫量）: クラス別CAP（ページ/バンドル）→ Soft CAPで補充強度を制御（実装済）

ランタイム制御（環境変数）

学習器: HAKMEM_LEARN=1
- 窓長: HAKMEM_LEARN_WINDOW_MS（既定1000）
- 目標ヒット率: HAKMEM_TARGET_HIT_MID（0.65）, HAKMEM_TARGET_HIT_LARGE（0.55）
- ステップ: HAKMEM_CAP_STEP_MID（4）, HAKMEM_CAP_STEP_LARGE（1）
- 予算制約: HAKMEM_BUDGET_MID, HAKMEM_BUDGET_LARGE（0=無効）
- 最小サンプル/窓: HAKMEM_LEARN_MIN_SAMPLES（256）
手動CAP上書き: HAKMEM_CAP_MID=a,b,c,d,e, HAKMEM_CAP_LARGE=a,b,c,d,e
切上げ許容: HAKMEM_WMAX_MID, HAKMEM_WMAX_LARGE
Mid free A/B: HAKMEM_POOL_TLS_FREE=0/1（既定1）

将来追加（実験用）:

ラッパー内L1許可: HAKMEM_WRAP_L2=1, HAKMEM_WRAP_L25=1
可変Midクラス枠（手動）: HAKMEM_MID_DYN1=<bytes>

Inline/Hot Path 方針

ホットパスは「サイズ即決 + O(1)テーブル参照 + 最小分岐」。
clock_gettime() 等のシステムコールはホットパス禁止（サンプリング/学習スレ側で実行）。
static inline + LUT でクラス決定を O(1) に（hakmem_pool.c/hakmem_l25_pool.c 参照）。
FrozenPolicy は RCUスナップショットを関数冒頭で1回loadし、以後は読み取りのみ。

Soft CAP（実装済）と学習器（実装済）

Mid/L2.5 の refill で FrozenPolicy CAP を参照し、補充バンドル数を調整。
- CAP超過: バンドル=1
- CAP不足: 赤字に応じて 1〜4（不足大なら下限2）
shard空 & CAP過多: 近傍shardから1–2probe steal（Mid/L2.5）。
学習器は別スレッドで窓ごとにヒット率を評価し、CAPを±Δ（ヒステリシス/予算制約付き）→ hkm_policy_publish() で公開。

段階導入（提案）

可変Midクラス枠×1（例: 14KB）を導入し、分布ピークに合わせて境界を最適化。
W_MAX を離散候補でバンディット+CANARY 最適化。
mmapしきい値（L1↔L2）をバンディット/ELOで学習し thp_threshold に反映。
可変枠×2 → クラス数/境界の自動最適化（バックグラウンド重計算）。

Total: ~3745 lines for complete production-ready allocator!

What's Implemented

Phase 1-5 (Foundation):

✅ Call-site capture (HAK_CALLSITE() macro)
✅ Zero-friction API (hak_alloc_cs() / hak_free_cs())
✅ Simple hash table (256 slots, linear probing)
✅ Basic profiling (count, size, avg, max)
✅ Policy-based optimization (malloc vs mmap)
✅ UCB1 bandit evolution
✅ KPI measurement (P50/P95/P99, page faults, RSS)
✅ A/B testing (baseline vs evolving)
✅ Benchmark framework (jemalloc/mimalloc comparison)

Phase 6.1-6.4 (ELO System):

✅ ELO rating system (6 strategies with win/loss/draw)
✅ Softmax selection (temperature-based exploration)
✅ BigCache tier-2 (size-class caching for large allocations)
✅ Batch madvise (MADV_DONTNEED syscall optimization)

Phase 6.5 (Learning Lifecycle):

✅ 3-state machine (LEARN → FROZEN → CANARY)
✅ P² algorithm (O(1) p99 estimation)
✅ Size-class distribution signature (L1 distance)
✅ Environment variable configuration
✅ Zero-overhead FROZEN mode (confirmed best policy)
✅ CANARY mode (5% trial sampling)
✅ Convergence detection & workload shift detection

What's NOT Implemented (Future)

❌ Multi-threaded support (single-threaded PoC)
❌ Advanced mmap strategies (MADV_HUGEPAGE, etc.)
❌ Redis/Nginx real-world benchmarks
❌ Confusion Matrix for auto-inference accuracy

📈 Implementation Progress

Phase	Feature	Status	Date
Phase 1	Call-site profiling	✅ Complete	2025-10-21 AM
Phase 2	Policy optimization (malloc/mmap)	✅ Complete	2025-10-21 PM
Phase 3	UCB1 bandit evolution	✅ Complete	2025-10-21 Eve
Phase 4	A/B testing	✅ Complete	2025-10-21 Eve
Phase 5	jemalloc/mimalloc comparison	✅ Complete	2025-10-21 Night
Phase 6.1-6.4	ELO rating system integration	✅ Complete	2025-10-21
Phase 6.5	Learning lifecycle (LEARN→FROZEN→CANARY)	✅ Complete	2025-10-21
Phase 7	Redis/Nginx real-world benchmarks	📋 Next	TBD

💡 Key Insights from PoC

Call-site works as identity: Different hak_alloc_cs() calls → different addresses
Zero overhead abstraction: Macro expands to __builtin_return_address(0)
Profiling overhead is acceptable: +7.8% on JSON (64KB), competitive on MIR (+29.6%)
Hash table is fast: Simple power-of-2 hash, <8 probes
Learning phase works: First 9 allocations gather data, 10th triggers optimization
UCB1 evolution improves performance: hakmem-evolving +71% vs hakmem-baseline (12 vs 7 points)
Page faults matter critically: 769× difference (1,538 vs 2) on direct mmap without caching
Memory reuse is essential: System malloc's free-list enables 3.1× speedup on large allocations
Per-site caching is the missing piece: Clear path to competitive performance (1st place)

📝 Connection to Paper

This PoC implements:

Section 3.6.2: Call-site Profiling API
Section 3.7: Learning ≠ LLM (UCB1 = lightweight online optimization)
Section 4.3: Hot-Path Performance (O(1) lookup, <300ns overhead)
Section 5: Evaluation Framework (A/B test + benchmarking)

Paper Sections Proven:

Section 3.6.2: Call-site Profiling ✅
Section 3.7: Learning ≠ LLM (UCB1 = lightweight online optimization) ✅
Section 4.3: Hot-Path Performance (<50ns overhead) ✅
Section 5: Evaluation Framework (A/B test + jemalloc/mimalloc comparison) 🔄

🧪 Verification Checklist

Run the test and check:

3 distinct call-sites detected ✅
Allocation counts match (1000/100/10) ✅
Average sizes are correct (64KB/256KB/2MB) ✅
No crashes or memory leaks ✅
Policy inference works (SMALL_FREQUENT/MEDIUM/LARGE_INFREQUENT) ✅
Optimization strategies applied (malloc vs mmap) ✅
Learning phase demonstrated (9 malloc + 1 mmap for large allocs) ✅
A/B testing works (baseline vs evolving modes) ✅
Benchmark framework functional ✅
Full benchmark results collected (1000 runs, 5 allocators) ✅

If all checks pass → Core concept AND optimization proven! ✅🎉

🎊 Summary

What We've Proven:

✅ Call-site = implicit purpose label
✅ Automatic policy inference (rule-based → UCB1 → ELO)
✅ ELO evolution with adaptive learning
✅ Call-site profiling overhead is acceptable (+7.8% on JSON)
✅ Competitive 3rd place ranking among 5 allocators
✅ KPI measurement (P50/P95/P99, page faults, RSS)
✅ A/B testing (baseline vs evolving)
✅ Honest comparison vs jemalloc/mimalloc (1000 benchmark runs)
✅ Production-ready lifecycle: LEARN → FROZEN → CANARY
✅ Zero-overhead frozen mode: Confirmed best policy after convergence
✅ P² percentile estimation: O(1) memory p99 tracking
✅ Workload shift detection: L1 distribution distance
🔍 Critical discovery: Page faults issue (769× difference) → malloc-based approach
📋 Clear path forward: Redis/Nginx real-world benchmarks

Code Size:

Phase 1-5 (UCB1 + Benchmarking): ~1625 lines
Phase 6.1-6.4 (ELO System): ~780 lines
Phase 6.5 (Learning Lifecycle): ~1340 lines
Total: ~3745 lines for complete production-ready allocator!

Paper Sections Proven:

Section 3.6.2: Call-site Profiling ✅
Section 3.7: Learning ≠ LLM (UCB1 = lightweight online optimization) ✅
Section 4.3: Hot-Path Performance (+7.8% overhead on JSON) ✅
Section 5: Evaluation Framework (5 allocators, 1000 runs, honest comparison) ✅
Gemini S+ requirement met: jemalloc/mimalloc comparison ✅

Status: ACE Learning Layer Planning + Mid MT Complete 🎯 Date: 2025-11-01

Latest Updates (2025-11-01)

✅ Mid MT Complete: 110M ops/sec achieved (100-101% of mimalloc)
✅ Repository Reorganized: Benchmarks/tests consolidated, root cleaned (72% reduction)
🎯 ACE Learning Layer: Documentation complete, ready for Phase 1 implementation
- Target: Fix fragmentation (2.6-5.2x), large WS (1.4-2.0x), realloc (1.3-2.0x)
- Approach: Dual-loop adaptive control + UCB1 learning
- See docs/ACE_LEARNING_LAYER.md for details

⚠️ Critical Update (2025-10-22): Thread Safety Issue Discovered

Problem: hakmem is completely thread-unsafe (no pthread_mutex anywhere)

1-thread: 15.1M ops/sec ✅ Normal
4-thread: 3.3M ops/sec ❌ -78% collapse (Race Condition)

Phase 6.14 Clarification:

✅ Registry ON/OFF toggle implementation (Pattern 2)
✅ O(N) Sequential proven 2.9-13.7x faster than O(1) Hash for Small-N
✅ Default: g_use_registry = 0 (O(N), L1 cache hit 95%+)
❌ Reported 67.9M ops/sec at 4-thread: NOT REPRODUCIBLE (measurement error)

Phase 6.15 Plan (12-13 hours, 6 days):

Step 1 (1h): Documentation updates ✅
Step 2 (2-3h): P0 Safety Lock (pthread_mutex global lock) → 4T = 13-15M ops/sec
Step 3 (8-10h): TLS implementation (Tiny/L2/L2.5 Pool TLS) → 4T = 15-22M ops/sec

Validation: Phase 6.13 already proved TLS works (15.9M ops/sec at 4T, +381%)

Details: See PHASE_6.15_PLAN.md, PHASE_6.15_SUMMARY.md, THREAD_SAFETY_SOLUTION.md

Previous Status: Phase 6.5 Complete - Production-Ready Learning Lifecycle! 🎉✨ Previous Date: 2025-10-21

Timeline:

2025-10-21 AM: Phase 1 - Call-site profiling PoC
2025-10-21 PM: Phase 2 - Policy-based optimization (malloc/mmap)
2025-10-21 Evening: Phase 3-4 - UCB1 bandit + A/B testing
2025-10-21 Night: Phase 5 - Benchmark infrastructure (1000 runs, 🥉 3rd place!)
2025-10-21 Late Night: Phase 6.1-6.4 - ELO rating system integration
2025-10-21 Night: Phase 6.5 - Learning lifecycle complete (6/6 tests passing) ✨

Phase 6.5 Achievement:

✅ 3-state machine: LEARN → FROZEN → CANARY
✅ Zero-overhead FROZEN mode: 10-20× faster than LEARN mode
✅ P² p99 estimation: O(1) memory percentile tracking
✅ Distribution shift detection: L1 distance for workload changes
✅ Environment variable config: Full control over freeze/convergence/canary settings
✅ Production ready: All lifecycle transitions verified

Key Results:

VM scenario ranking: 🥈 2nd place (+1.9% gap to 1st!)
Phase 5 (UCB1): 🥉 3rd place (12 points) among 5 allocators
Phase 6.4 (ELO+BigCache): 🥈 2nd place, nearly tied with mimalloc
Call-site profiling overhead: +7.8% (acceptable)
FROZEN mode overhead: Zero (confirmed best policy, no ELO updates)
Convergence time: ~180 seconds (configurable via HAKMEM_FREEZE_SEC)
CANARY sampling: 5% trial (configurable via HAKMEM_CANARY_FRAC)

Next Steps:

✅ Phase 1-5 complete (UCB1 + benchmarking)
✅ Phase 6.1-6.4 complete (ELO system)
✅ Phase 6.5 complete (learning lifecycle)
🔧 Phase 6.6: Fix Batch madvise (0 blocks batched) → 1st place target 🏆
📋 Phase 7: Redis/Nginx real-world benchmarks
📝 Paper writeup (see PAPER_SUMMARY.md)

Related Documentation:

Paper summary: PAPER_SUMMARY.md ⭐ Start here for paper writeup
Phase 6.2 (ELO): PHASE_6.2_ELO_IMPLEMENTATION.md
Phase 6.5 (Lifecycle): PHASE_6.5_LEARNING_LIFECYCLE.md ✨ New!
Paper materials: docs/private/papers-active/hakmem-c-abi-allocator/
Design doc: BENCHMARK_DESIGN.md
Raw results: competitors_results.csv (15,001 runs)
Analysis script: analyze_final.py

README.md Unescape Escape

hakmem PoC - Call-site Profiling + UCB1 Evolution

🎯 Current Status (2025-11-01)

✅ Mid-Range Multi-Threaded Complete (110M ops/sec)

✅ Repository Reorganization Complete

✅ ACE Learning Layer Phase 1 Complete (ACE = Agentic Context Engineering / Adaptive Control Engine)

📂 Quick Navigation

🧪 Larson Quick Run（Tiny + Superslab、本線）

SLL‑first Fast Path（Box 5）

Benchmark Matrix

Build Modes (Box Refactor)

🚨 Segfault‑free ポリシー（絶対条件）

新規A/Bノブ（観測と制御）

Mimalloc vs HAKMEM (Larson quick A/B)

🎯 What This Proves

✅ Phase 1: Call-site Profiling (DONE)

✅ Phase 2-4: UCB1 Evolution + A/B Testing (DONE)

✅ Phase 5: Benchmarking Infrastructure (COMPLETE)

✅ Phase 6.1-6.4: ELO Rating System (COMPLETE)

✅ Phase 6.5: Learning Lifecycle (COMPLETE)

✅ Phase 6.6: ELO Control Flow Fix (COMPLETE)

✅ Phase 6.7: Overhead Analysis (COMPLETE)

✅ Phase 6.8: Configuration Cleanup (COMPLETE)

🚀 Quick Start

🎯 Choose Your Mode (Phase 6.8+)

📖 Legacy Usage (Phase 1-6.7)

⚙️ Useful Environment Variables

🧪 Larson Runner (Reproducible)

レガシー互換（個別ENV）

マスタ箱経由（Phase 4d）

🧱 TLS Active Slab (Arena-lite)

🧊 EVO/Gating（デフォルト低オーバーヘッド）

🏆 Benchmark Comparison (Phase 5)

📊 Expected Results

Basic Test (test_hakmem)

Benchmark Results (Phase 5) - FINAL

🔧 Implementation Details

Files

Phase 6.16 (SACS‑3)

学習ターゲット（4軸）

ランタイム制御（環境変数）

Inline/Hot Path 方針

Soft CAP（実装済）と 学習器（実装済）

段階導入（提案）

What's Implemented

What's NOT Implemented (Future)

📈 Implementation Progress

💡 Key Insights from PoC

📝 Connection to Paper

🧪 Verification Checklist

🎊 Summary

Latest Updates (2025-11-01)

⚠️ Critical Update (2025-10-22): Thread Safety Issue Discovered

README.md

Soft CAP（実装済）と学習器（実装済）