hakmem

Author	SHA1	Message	Date
Moe Charm (CI)	2013514f7b	Working state before pushing to cyu remote	2025-12-19 03:45:01 +09:00
Moe Charm (CI)	e4c5f05355	Phase 86: Free Path Legacy Mask (NO-GO, +0.25%) ## Summary Implemented Phase 86 "mask-only commit" optimization for free path: - Bitset mask (0x7f for C0-C6) to identify LEGACY classes - Direct call to tiny_legacy_fallback_free_base_with_env() - No indirect function pointers (avoids Phase 85's -0.86% regression) - Fail-fast on LARSON_FIX=1 (cross-thread validation incompatibility) ## Results (10-run SSOT) NO-GO: +0.25% improvement (threshold: +1.0%) - Control: 51,750,467 ops/s (CV: 2.26%) - Treatment: 51,881,055 ops/s (CV: 2.32%) - Delta: +0.25% (mean), -0.15% (median) ## Root Cause Competing optimizations plateau: 1. Phase 9/10 MONO LEGACY (+1.89%) already capture most free path benefit 2. Remaining margin insufficient to overcome: - Two branch checks (mask_enabled + has_class) - I-cache layout tax in hot path - Direct function call overhead ## Phase 85 vs Phase 86 \| Metric \| Phase 85 \| Phase 86 \| \|--------\|----------\|----------\| \| Approach \| Indirect calls + table \| Bitset mask + direct call \| \| Result \| -0.86% \| +0.25% \| \| Verdict \| NO-GO (regression) \| NO-GO (insufficient) \| Phase 86 correctly avoided indirect call penalties but revealed architectural limit: can't escape Phase 9/10 overlay without restructuring. ## Recommendation Free path optimization layer has reached practical ceiling: - Phase 9/10 +1.89% + Phase 6/19/FASTLANE +16-27% ≈ 18-29% total - Further attempts on ceremony elimination face same constraints - Recommend focus on different optimization layers (malloc, etc.) ## Files Changed ### New - core/box/free_path_legacy_mask_box.h (API + globals) - core/box/free_path_legacy_mask_box.c (refresh logic) ### Modified - core/bench_profile.h (added refresh call) - core/front/malloc_tiny_fast.h (added Phase 86 fast path check) - Makefile (added object files) - CURRENT_TASK.md (documented result) All changes conditional on HAKMEM_FREE_PATH_LEGACY_MASK=1 (default OFF). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-18 22:05:34 +09:00
Moe Charm (CI)	89a9212700	Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update Key changes: - Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible) Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns - Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M): tcmalloc: 115.26M (92.33% of mimalloc) jemalloc: 97.39M (77.96% of mimalloc) system: 85.20M (68.24% of mimalloc) mimalloc: 124.82M (baseline) - hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements Result: baseline stabilized to 55.53M (44.46% of mimalloc) Previous unstable measurement (35.57M) was due to profile leak - Documentation: * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO) * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology - M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-18 18:50:00 +09:00
Moe Charm (CI)	043d34ad5a	Phase 75-2: C5-only Inline Slots (P2) - GO (+1.10%) Extends Phase 75-1 pattern to C5 class (28.5% of C4-C7 ops): - Created 4 new boxes: env_box, tls_box, fast_path_api, TLS variable - Integration: 2 minimal boundary points (alloc/free for C5) - Test strategy: C5-only isolation (baseline C5=OFF+C6=ON, treatment C5=ON+C6=ON) - Default OFF: zero overhead when disabled Results (10-run Mixed SSOT, WS=400, C6 already enabled): - Baseline (C5=OFF, C6=ON): 44.26 M ops/s (σ=0.37) - Treatment (C5=ON, C6=ON): 44.74 M ops/s (σ=0.54) - Delta: +0.49 M ops/s (+1.10%) Status: ✅ GO - C5 individual contribution confirmed Cumulative since Phase 75-0: +2.87% (C6) + 1.10% (C5) = potential +3.97% combined Next: Phase 75-3 (test C5+C6 interaction + non-additivity + promote to preset default) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-18 08:39:48 +09:00
Moe Charm (CI)	0009ce13b3	Phase 75-1: C6-only Inline Slots (P2) - GO (+2.87%) Modular implementation of hot-class inline slots optimization: - Created 5 new boxes: env_box, tls_box, fast_path_api, integration_box, test_script - Single decision point at TLS init (ENV gate: HAKMEM_TINY_C6_INLINE_SLOTS=0/1) - Integration: 2 minimal boundary points (alloc/free paths for C6 class) - Default OFF: zero overhead when disabled (full backward compatibility) Results (10-run Mixed SSOT, WS=400): - Baseline (C6 inline OFF): 44.24 M ops/s - Treatment (C6 inline ON): 45.51 M ops/s - Delta: +1.27 M ops/s (+2.87%) Status: ✅ GO - Strong improvement via C6 ring buffer fast-path Mechanism: Branch elimination on unified_cache_push/pop for C6 allocations Next: Phase 75-2 (add C5 inline slots, target 85% C4-C7 coverage) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-18 08:22:09 +09:00
Moe Charm (CI)	b6212bbe31	Phase 69: Refill tuning completion (Warm Pool Size=16 optimized) - Promoted Warm Pool Size=16 as the new baseline (+3.26% gain). - Updated PERFORMANCE_TARGETS_SCORECARD.md with Phase 69 results. - Updated scripts/run_mixed_10_cleanenv.sh and core/bench_profile.h to use HAKMEM_WARM_POOL_SIZE=16 by default. - Clarified that TINY_REFILL_BATCH_SIZE is not currently connected.	2025-12-18 01:55:27 +09:00
Moe Charm (CI)	84f5034e45	Phase 68: PGO training set diversification (seed/WS expansion) Changes: - scripts/box/pgo_fast_profile_config.sh: Expanded WS patterns (3→5) and seeds (1→3) for reduced overfitting and better production workload representativeness - PERFORMANCE_TARGETS_SCORECARD.md: Phase 68 baseline promoted (61.614M = 50.93%) - CURRENT_TASK.md: Phase 68 marked complete, Phase 67a (layout tax forensics) set Active Results: - 10-run verification: +1.19% vs Phase 66 baseline (GO, >+1.0% threshold) - M1 milestone: 50.93% of mimalloc (target 50%, exceeded by +0.93pp) - Stability: 10-run mean/median with <2.1% CV 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-17 21:08:17 +09:00
Moe Charm (CI)	7adbcdfcb6	Phase 54-60: Memory-Lean mode, Balanced mode stabilization, M1 (50%) achievement ## Summary Completed Phase 54-60 optimization work: Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression) - Implemented ss_mem_lean_env_box.h with ENV gates - Balanced mode (LEAN+OFF) promoted as production default - Result: +1.2% throughput, better stability, zero syscall overhead - Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset Phase 57: 60-min soak finalization - Balanced mode: 60-min soak, RSS drift 0%, CV 5.38% - Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58% - Syscall budget: 1.25e-7/op (800× under target) - Status: PRODUCTION-READY Phase 59: 50% recovery baseline rebase - hakmem FAST (Balanced): 59.184M ops/s, CV 1.31% - mimalloc: 120.466M ops/s, CV 3.50% - Ratio: 49.13% (M1 ACHIEVED within statistical noise) - Superior stability: 2.68× better CV than mimalloc Phase 60: Alloc pass-down SSOT (NO-GO) - Implemented alloc_passdown_ssot_env_box.h - Modified malloc_tiny_fast.h for SSOT pattern - Result: -0.46% (NO-GO) - Key lesson: SSOT not applicable where early-exit already optimized ## Key Metrics - Performance: 49.13% of mimalloc (M1 effectively achieved) - Stability: CV 1.31% (superior to mimalloc 3.50%) - Syscall budget: 1.25e-7/op (excellent) - RSS: 33MB stable, 0% drift over 60 minutes ## Files Added/Modified New boxes: - core/box/ss_mem_lean_env_box.h - core/box/ss_release_policy_box.{h,c} - core/box/alloc_passdown_ssot_env_box.h Scripts: - scripts/soak_mixed_single_process.sh - scripts/analyze_epoch_tail_csv.py - scripts/soak_mixed_rss.sh - scripts/calculate_percentiles.py - scripts/analyze_soak.py Documentation: Phase 40-60 analysis documents ## Design Decisions 1. Profile separation (core/bench_profile.h): - MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN) - MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF) 2. Box Theory compliance: - All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT) - Single conversion points maintained - No physical deletions (compile-out only) 3. Lessons learned: - SSOT effective only where redundancy exists (Phase 60 showed limits) - Branch prediction extremely effective (~0 cycles for well-predicted branches) - Early-exit pattern valuable even when seemingly redundant 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-17 06:24:01 +09:00
Moe Charm (CI)	b7085c47e1	Phase 35-39: FAST build optimization complete (+7.13% cumulative) Phase 35-A: BENCH_MINIMAL gate function elimination (GO +4.39%) - tiny_front_v3_enabled() → constant true - tiny_metadata_cache_enabled() → constant 0 - learner_v7_enabled() → constant false - small_learner_v2_enabled() → constant false Phase 36: Policy snapshot init-once (GO +0.71%) - small_policy_v7_snapshot() version check skip in BENCH_MINIMAL - TLS cache for policy snapshot Phase 37: Standard TLS cache (NO-GO -0.07%) - TLS cache for Standard build attempted - Runtime gate overhead negates benefit Phase 38: FAST/OBSERVE/Standard workflow established - make perf_fast, make perf_observe targets - Scorecard and documentation updates Phase 39: Hot path gate constantization (GO +1.98%) - front_gate_unified_enabled() → constant 1 - alloc_dualhot_enabled() → constant 0 - g_bench_fast_front, g_v3_enabled blocks → compile-out - free_dispatch_stats_enabled() → constant false Results: - FAST v3: 56.04M ops/s (47.4% of mimalloc) - Standard: 53.50M ops/s (45.3% of mimalloc) - M1 target (50%): 5.5% remaining 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-16 15:01:56 +09:00
Moe Charm (CI)	8052e8b320	Phase 24-26: Hot path atomic telemetry prune (+2.00% cumulative) Summary: - Phase 24 (alloc stats): +0.93% GO - Phase 25 (free stats): +1.07% GO - Phase 26 (diagnostics): -0.33% NEUTRAL (code cleanliness) - Total: 11 atomics compiled-out, +2.00% improvement Phase 24: OBSERVE tax prune (tiny_class_stats_box.h) - Added HAKMEM_TINY_CLASS_STATS_COMPILED (default: 0) - Wrapped 5 stats functions: uc_miss, warm_hit, shared_lock, tls_carve_* - Result: +0.93% (baseline 56.675M vs compiled-in 56.151M ops/s) Phase 25: Tiny free stats prune (tiny_superslab_free.inc.h) - Added HAKMEM_TINY_FREE_STATS_COMPILED (default: 0) - Wrapped g_free_ss_enter atomic in free hot path - Result: +1.07% (baseline 57.017M vs compiled-in 56.415M ops/s) Phase 26: Hot path diagnostic atomics prune - Added 5 compile gates for low-frequency error counters: - HAKMEM_TINY_C7_FREE_COUNT_COMPILED - HAKMEM_TINY_HDR_MISMATCH_LOG_COMPILED - HAKMEM_TINY_HDR_META_MISMATCH_COMPILED - HAKMEM_TINY_METRIC_BAD_CLASS_COMPILED - HAKMEM_TINY_HDR_META_FAST_COMPILED - Result: -0.33% NEUTRAL (within noise, kept for cleanliness) Alignment with mimalloc principles: - "No atomics on hot path" - telemetry moved to compile-time opt-in - Fixed per-op tax elimination - Production builds: maximum performance (atomics compiled-out) - Research builds: full diagnostics (COMPILED=1) Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-16 05:35:11 +09:00
Moe Charm (CI)	ec87025da6	Phase 17 v2 (FORCE_LIBC fix) + Phase 19-1b (FastLane Direct) — GO (+5.88%) ## Phase 17 v2: FORCE_LIBC Gap Validation Fix Critical bug fix: Phase 17 v1 の測定が壊れていた Problem: HAKMEM_FORCE_LIBC_ALLOC=1 が FastLane より後でしか見えず、 same-binary A/B が実質 "hakmem vs hakmem" になっていた（+0.39% 誤測定） Fix: core/box/hak_wrappers.inc.h:171 と :645 に g_force_libc_alloc==1 の early bypass を追加、__libc_malloc/__libc_free に最初に直行 Result: 正しい同一バイナリ A/B 測定 - hakmem (FORCE_LIBC=0): 48.99M ops/s - libc (FORCE_LIBC=1): 79.72M ops/s (+62.7%) - system binary: 88.06M ops/s (+10.5% vs libc) Gap 分解: - Allocator 差: +62.7% (主戦場) - Layout penalty: +10.5% (副次的) Conclusion: Case A 確定 (allocator dominant, NOT layout) Phase 17 v1 の Case B 判定は誤り。 Files: - docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md (v2) - docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_NEXT_INSTRUCTIONS.md (updated) --- ## Phase 19: FastLane Instruction Reduction Analysis Goal: libc との instruction gap (-35% instructions, -56% branches) を削減 perf stat 分析 (FORCE_LIBC=0 vs 1, 200M ops): - hakmem: 209.09 instructions/op, 52.33 branches/op - libc: 135.92 instructions/op, 22.93 branches/op - Delta: +73.17 instructions/op (+53.8%), +29.40 branches/op (+128.2%) Hot path (perf report): - front_fastlane_try_free: 23.97% cycles - malloc wrapper: 23.84% cycles - free wrapper: 6.82% cycles - Wrapper overhead: ~55% of all cycles Reduction candidates: - A: Wrapper layer 削除 (-17.5 inst/op, +10-15% 期待) - B: ENV snapshot 統合 (-10.0 inst/op, +5-8%) - C: Stats 削除 (-5.0 inst/op, +3-5%) - D: Header inline (-4.0 inst/op, +2-3%) - E: Route fast path (-3.5 inst/op, +2-3%) Files: - docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_1_DESIGN.md - docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_2_NEXT_INSTRUCTIONS.md --- ## Phase 19-1b: FastLane Direct — GO (+5.88%) Strategy: Wrapper layer を bypass し、core allocator を直接呼ぶ - free() → free_tiny_fast() (not free_tiny_fast_hot) - malloc() → malloc_tiny_fast() Phase 19-1 が NO-GO (-3.81%) だった原因: 1. __builtin_expect(fastlane_direct_enabled(), 0) が逆効果（A/B 不公平） 2. free_tiny_fast_hot() が誤選択（free_tiny_fast() が勝ち筋） Phase 19-1b の修正: 1. __builtin_expect() 削除 2. free_tiny_fast() を直接呼び出し Result (Mixed, 10-run, 20M iters, ws=400): - Baseline (FASTLANE_DIRECT=0): 49.17M ops/s - Optimized (FASTLANE_DIRECT=1): 52.06M ops/s - Delta: +5.88% (GO 基準 +5% クリア) perf stat (200M iters): - Instructions/op: 199.90 → 169.45 (-30.45, -15.23%) - Branches/op: 51.49 → 41.52 (-9.97, -19.36%) - Cycles/op: 88.88 → 84.37 (-4.51, -5.07%) - I-cache miss: 111K → 98K (-11.79%) Trade-offs (acceptable): - iTLB miss: +41.46% (front-end cost) - dTLB miss: +29.15% (backend cost) - Overall gain (+5.88%) outweighs costs Implementation: 1. ENV gate: core/box/fastlane_direct_env_box.{h,c} - HAKMEM_FASTLANE_DIRECT=0/1 (default: 0, opt-in) - Single _Atomic global (wrapper キャッシュ問題を解決) 2. Wrapper 修正: core/box/hak_wrappers.inc.h - malloc: direct call to malloc_tiny_fast() when FASTLANE_DIRECT=1 - free: direct call to free_tiny_fast() when FASTLANE_DIRECT=1 - Safety: !g_initialized では direct 使わない、fallback 維持 3. Preset 昇格: core/bench_profile.h:88 - bench_setenv_default("HAKMEM_FASTLANE_DIRECT", "1") - Comment: +5.88% proven on Mixed, 10-run 4. cleanenv 更新: scripts/run_mixed_10_cleanenv.sh:22 - HAKMEM_FASTLANE_DIRECT=${HAKMEM_FASTLANE_DIRECT:-1} - Phase 9/10 と同様に昇格 Verdict: GO — 本線採用、プリセット昇格完了 Rollback: HAKMEM_FASTLANE_DIRECT=0 で既存 FastLane path に戻る Files: - core/box/fastlane_direct_env_box.{h,c} (new) - core/box/hak_wrappers.inc.h (modified) - core/bench_profile.h (preset promotion) - scripts/run_mixed_10_cleanenv.sh (ENV default aligned) - Makefile (new obj) - docs/analysis/PHASE19_1B_FASTLANE_DIRECT_REVISED_AB_TEST_RESULTS.md --- ## Cumulative Performance - Baseline (all optimizations OFF): ~40M ops/s (estimated) - Current (Phase 19-1b): 52.06M ops/s - Cumulative gain: ~+30% from baseline Remaining gap to libc (79.72M): - Current: 52.06M ops/s - Target: 79.72M ops/s - Gap: +53.2% (was +62.7% before Phase 19-1b) Next: Phase 19-2 (ENV snapshot consolidation, +5-8% expected) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-15 11:28:40 +09:00
Moe Charm (CI)	bc2c5ded76	Phase 18 v2: BENCH_MINIMAL — NEUTRAL (+2.32% throughput, -5.06% instructions) ## Summary Phase 18 v2 attempted instruction count reduction via conditional compilation: - Stats collection → no-op - ENV checks → constant propagation - Binary size: 653K → 649K (-4K, -0.6%) Result: NEUTRAL (below GO threshold) - Throughput: +2.32% (target: +5% minimum) ❌ - Instructions: -5.06% (target: -15% minimum) ❌ - Cycles: -3.26% (positive signal) - Branches: -8.67% (positive signal) - Cache-misses: +30% (unexpected, likely layout) ## Analysis Positive signals: - Implementation correct (Branch -8.67%, Instruction -5.06%) - Binary size reduced (-4K) - Modest throughput gain (+2.32%) - Cycles and branch overhead reduced Negative signals: - Instruction reduction insufficient (-5.06% << -15% smoking gun) - Throughput gain below +5% threshold - Cache-misses increased (+30%, layout noise?) ## Verdict Freeze Phase 18 v2 (weak positive, insufficient for production). Per user guidance: "If instructions don't drop clearly, continuation value is thin." -5.06% instruction reduction is marginal. Allocator micro-optimization plateau confirmed. ## Key Insight Phase 17 showed: - IPC = 2.30 (consistent, memory-bound) - I-cache gap: 55% (Phase 17: 153K → 68K) - Instruction gap: 48% (Phase 17: 41.3B → 21.5B) Phase 18 v1/v2 results confirm: - Layout tweaks are fragile (v1: I-cache +91%) - Instruction removal is modest benefit (v2: -5.06%) - Allocator is NOT the bottleneck (IPC constant, memory-limited) ## Recommendation Do NOT continue Phase 18 micro-optimizations. Next frontier requires different approach: 1. Architectural redesign (SIMD, lock-free, batching) 2. Memory layout optimization (cache-friendly structures) 3. Broader profiling (not allocator-focused) Or: Accept that 48M → 85M (75% gap) is achievable with current architecture. Files: - docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_AB_TEST_RESULTS.md (results) - CURRENT_TASK.md (Phase 18 complete status) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-15 06:02:28 +09:00
Moe Charm (CI)	b1912d6587	Phase 18 v1: Hot Text Isolation — NO-GO (I-cache regression) ## Summary Phase 18 v1 attempted layout optimization using section splitting + GC: - `-ffunction-sections -fdata-sections -Wl,--gc-sections` Result: Catastrophic I-cache regression - Throughput: -0.87% (48.94M → 48.52M ops/s) - I-cache misses: +91.06% (131K → 250K) - Variance: +80% (σ=0.45M → σ=0.81M) Root cause: Section-based splitting without explicit hot symbol ordering fragments code locality, destroying natural compiler/LTO layout. ## Build Knob Safety Makefile updated to separate concerns: - `HOT_TEXT_ISOLATION=1` → attributes only (safe, but no perf gain) - `HOT_TEXT_GC_SECTIONS=1` → section splitting (currently NO-GO) Both kept as research boxes (default OFF). ## Verdict Freeze Phase 18 v1: - Do NOT use section-based linking without strong ordering strategy - Keep hot/cold attributes as placeholder (currently unused) - Proceed to Phase 18 v2: BENCH_MINIMAL compile-out Expected impact v2: +10-20% via instruction count reduction - GO threshold: +5% minimum, +8% preferred - Only continue if instructions clearly drop ## Files New: - docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md Modified: - Makefile (build knob safety isolation) - CURRENT_TASK.md (Phase 18 v1 verdict) - docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md ## Lessons 1. Layout optimization is extremely fragile without ordering guarantees 2. I-cache is first-order performance factor (IPC=2.30 is memory-bound) 3. Compiler defaults may be better than manual section splitting 4. Next frontier: instruction count reduction (stats/ENV removal) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-15 05:53:58 +09:00
Moe Charm (CI)	f8e7cf05b4	Phase 16 v1 NEUTRAL, Phase 17 Case B confirmed, Phase 18 design added ## Phase 16 v1: Front FastLane Alloc LEGACY Direct — NEUTRAL (+0.62%) Target: Reduce alloc-side fixed costs by adding LEGACY direct path to FastLane entry, mirroring Phase 9/10 free-side winning pattern. Result: +0.62% on Mixed (below +1.0% GO threshold) → NEUTRAL, freeze as research box (default OFF). Critical issue: Initial impl crashed (segfault) for C4-C7. Root cause: unified_cache_refill() incompatibility. Safety fix: Limited to C0-C3 only (matching existing dualhot pattern). Files: - core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c} (new) - core/box/front_fastlane_box.h (LEGACY direct path, lines 93-119) - core/bench_profile.h (env refresh sync) - Makefile (new obj) - docs/analysis/PHASE16_.md (design/results/instructions) ENV: HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 (default OFF, opt-in) Verdict: Research box frozen. Phase 14-16 plateau confirms dispatch/ routing optimization ROI is exhausted post-Phase-6 FastLane collapse. --- ## Phase 17: FORCE_LIBC Gap Validation — Case B Confirmed Purpose: Validate "system malloc faster" observation using same-binary A/B testing to isolate allocator logic差 vs binary layout penalty. Method: - Same-binary toggle: HAKMEM_FORCE_LIBC_ALLOC=0/1 (bench_random_mixed_hakmem) - System binary: bench_random_mixed_system (21K separate binary) - Perf stat: Hardware counter analysis (I-cache, cycles, instructions) Result: Case B confirmed* — Allocator差 negligible, layout penalty dominates. Gap breakdown (Mixed, 20M iters, ws=400): - hakmem (FORCE_LIBC=0): 48.12M ops/s - libc (FORCE_LIBC=1, same binary): 48.31M ops/s → +0.39% (noise level) - system binary (21K): 83.85M ops/s → +73.57% vs libc, +74.26% vs hakmem Perf stat (200M iters): - I-cache misses: 153K (hakmem) → 68K (system) = -55% (smoking gun) - Cycles: 17.9B → 10.2B = -43% - Instructions: 41.3B → 21.5B = -48% - Binary size: 653K → 21K (30x difference) Root cause: Binary size (30x) causes I-cache thrashing. Code bloat >> algorithmic efficiency. Conclusion: Phase 12's "system malloc 1.6x faster" was real, but misattributed. Gap is layout/I-cache, NOT allocator algorithm. Files: - docs/analysis/PHASE17_.md (results/instructions) - scripts/run_mixed_10_cleanenv.sh (Phase 9/10 defaults aligned) Next: Phase 18 Hot Text Isolation (layout optimization, not algorithm opt) --- ## Phase 18: Hot Text Isolation — Design Added Purpose: Reduce I-cache misses + instruction footprint via layout control (binary optimization, not allocator algorithm changes). Strategy (v1 → v2 progression): v1 (TU split + hot/cold attrs + optional gc-sections): - Target: +2% throughput (GO threshold, realistic for layout tweaks) - Secondary: I-cache -10%, instructions -5% (direction confirmation) - Risk: Low (reversible via build knob) - Expected: +0-2% (NEUTRAL likely, but validates approach) v2 (BENCH_MINIMAL compile-out): - Target: +10-20% throughput (本命) - Method: Conditional compilation removes stats/ENV/debug from hot path - Expected: Instruction count -30-40% → significant I-cache improvement Files: - docs/analysis/PHASE18_.md (design/instructions) - CURRENT_TASK.md (Phase 17 complete, Phase 18 v1/v2 plan) Build gate: HOT_TEXT_ISOLATION=0/1 (Makefile knob) Next: Implement Phase 18 v1 (TU split first, BENCH_MINIMAL if v1 NEUTRAL) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-15 05:25:47 +09:00
Moe Charm (CI)	87fa27518c	Phase 15 v1: UnifiedCache FIFO→LIFO NEUTRAL (-0.70% Mixed, +0.42% C7) Transform existing array-based UnifiedCache from FIFO ring to LIFO stack. A/B Results: - Mixed (16-1024B): -0.70% (52,965,966 → 52,593,948 ops/s) - C7-only (1025-2048B): +0.42% (78,010,783 → 78,335,509 ops/s) Verdict: NEUTRAL (both below +1.0% GO threshold) - freeze as research box Implementation: - L0 ENV gate: tiny_unified_lifo_env_box.{h,c} (HAKMEM_TINY_UNIFIED_LIFO=0/1) - L1 LIFO ops: tiny_unified_lifo_box.h (unified_cache_try_pop/push_lifo) - L2 integration: tiny_front_hot_box.h (mode check at entry) - Reuses existing slots[] array (no intrusive pointers) Root Causes: 1. Mode check overhead (tiny_unified_lifo_enabled() call) 2. Minimal LIFO vs FIFO locality delta in practice 3. Existing FIFO ring already well-optimized Bonus Fix: LTO bug for tiny_c7_preserve_header_enabled() (Phase 13/14 latent issue) - Converted static inline to extern + non-inline implementation - Fixes undefined reference during LTO linking Design: docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_DESIGN.md Results: docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_AB_TEST_RESULTS.md 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-15 02:19:26 +09:00
Moe Charm (CI)	f8fb05bc13	Phase 14 v1: Pointer-Chase Reduction (tcache) NEUTRAL (+0.20%) Implementation: - Intrusive LIFO tcache layer (L1) before UnifiedCache - TLS per-class bins (head pointer + count) - Intrusive next pointers (via tiny_next_store/load SSOT) - Cap: 64 blocks per class (default) - ENV: HAKMEM_TINY_TCACHE=0/1 (default: 0, OFF) A/B Test Results (Mixed 10-run): - Baseline (TCACHE=0): 51,083,379 ops/s - Optimized (TCACHE=1): 51,186,838 ops/s - Mean delta: +0.20% (below +1.0% GO threshold) - Median delta: +0.59% Verdict: NEUTRAL - Freeze as research box (default OFF) Root Cause (v1 wiring incomplete): - Free side pushes to tcache via unified_cache_push() - Alloc hot path (tiny_hot_alloc_fast) doesn't consume tcache - tcache becomes "sink" without alloc-side pop → ROI not measurable Files: - Created: core/box/tiny_tcache_{env_box,box}.h, tiny_tcache_env_box.c - Modified: core/front/tiny_unified_cache.h (integration) - Modified: core/bench_profile.h (refresh sync) - Modified: Makefile (build integration) - Results: docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_1_AB_TEST_RESULTS.md - v2 Instructions: docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_2_NEXT_INSTRUCTIONS.md Next: Phase 14 v2 (connect tcache to tiny_front_hot_box alloc/free hot path) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-15 01:28:50 +09:00
Moe Charm (CI)	cbb35ee27f	Phase 13 v1 + E5-2 retest: Both NEUTRAL, freeze as research boxes Phase 13 v1: Header Write Elimination (C7 preserve header) - Verdict: NEUTRAL (+0.78%) - Implementation: HAKMEM_TINY_C7_PRESERVE_HEADER ENV gate (default OFF) - Makes C7 nextptr offset conditional (0→1 when enabled) - 4-point matrix A/B test results: * Case A (baseline): 51.49M ops/s * Case B (WRITE_ONCE=1): 52.07M ops/s (+1.13%) * Case C (C7_PRESERVE=1): 51.36M ops/s (-0.26%) * Case D (both): 51.89M ops/s (+0.78% NEUTRAL) - Action: Freeze as research box (default OFF, manual opt-in) Phase 5 E5-2: Header Write-Once retest (promotion test) - Verdict: NEUTRAL (+0.54%) - Motivation: Phase 13 Case B showed +1.13%, re-tested with dedicated 20-run - Results (20-run): * Case A (baseline): 51.10M ops/s * Case B (WRITE_ONCE=1): 51.37M ops/s (+0.54%) - Previous test: +0.45% (consistent with NEUTRAL) - Action: Keep as research box (default OFF, manual opt-in) Key findings: - Header write tax optimization shows consistent NEUTRAL results - Neither Phase 13 v1 nor E5-2 reaches GO threshold (+1.0%) - Both implemented as reversible ENV gates for future research Files changed: - New: core/box/tiny_c7_preserve_header_env_box.{c,h} - Modified: core/box/tiny_layout_box.h (C7 offset conditional) - Modified: core/tiny_nextptr.h, core/box/tiny_header_box.h (comments) - Modified: core/bench_profile.h (refresh sync) - Modified: Makefile (add new .o files) - Modified: scripts/run_mixed_10_cleanenv.sh (add C7_PRESERVE ENV) - Docs: PHASE13_, PHASE5_E5_2_HEADER_WRITE_ONCE_ (design/results) Next: Phase 14 (Pointer-chase reduction, tcache-style intrusive LIFO) 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-15 00:32:25 +09:00
Moe Charm (CI)	be723ca052	Phase 8: FREE-STATIC-ROUTE ENV Cache Hardening (GO +2.61%) Results: - A/B test: +2.61% on Mixed (10-run, clean env) - Baseline: 49.26M ops/s - Optimized: 50.55M ops/s - Improvement: +1.29M ops/s (+2.61%) Strategy: - Fix ENV cache accident (main前キャッシュ事故の修正) - Add refresh mechanism to sync with bench_profile putenv - Ensure Phase 3 D1 optimization works reliably Success factors: 1. Performance improvement: +2.61% (existing win-box now reliable) 2. ENV cache accident fixed: refresh mechanism works correctly 3. Standard deviation improved: 867K → 336K ops/s (61% reduction) 4. Baseline quality improved: existing optimization now guaranteed Implementation: - Patch 1: Make ENV gate refreshable (tiny_free_route_cache_env_box.{h,c}) - Changed static int to extern _Atomic int - Added tiny_free_static_route_refresh_from_env() - Patch 2: Integrate refresh into bench_profile.h - Call refresh after bench_setenv_default() group - Patch 3: Update Makefile for new .c file ENV cache fix verification: - [FREE_STATIC_ROUTE] enabled appears twice (refresh working) - bench_profile putenv now reliably reflected Files modified: - core/box/tiny_free_route_cache_env_box.h: extern + refresh API - core/box/tiny_free_route_cache_env_box.c: NEW (global state + refresh) - core/bench_profile.h: add refresh call - Makefile: add new .o file Health check: PASSED (all profiles) Rollback: HAKMEM_FREE_STATIC_ROUTE=0 or revert Patch 1/2 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-14 18:49:08 +09:00
Moe Charm (CI)	ea221d057a	Phase 6: promote Front FastLane (default ON)	2025-12-14 16:28:23 +09:00
Moe Charm (CI)	5528612f2a	Phase 5 E4-2: Malloc Wrapper ENV Snapshot (+21.83% GO, ADOPTED) Target: Consolidate malloc wrapper TLS reads + eliminate function calls - malloc (16.13%) + tiny_alloc_gate_fast (19.50%) = 35.63% combined - Strategy: E4-1 success pattern + function call elimination Implementation: - ENV gate: HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=0/1 (default 0) - core/box/malloc_wrapper_env_snapshot_box.{h,c}: New box - Consolidates multiple TLS reads → 1 TLS read - Pre-caches tiny_max_size() == 256 (eliminates function call) - Lazy init with probe window (bench_profile putenv sync) - core/box/hak_wrappers.inc.h: Integration in malloc() wrapper - Makefile: Add malloc_wrapper_env_snapshot_box.o to all targets A/B Test Results (Mixed, 10-run, 20M iters): - Baseline (SNAPSHOT=0): 35.74M ops/s (mean), 35.75M ops/s (median) - Optimized (SNAPSHOT=1): 43.54M ops/s (mean), 43.92M ops/s (median) - Improvement: +21.83% mean, +22.86% median (+7.80M ops/s) Decision: GO (+21.83% >> +1.0% threshold, 21.8x over) - Why 6.2x better than E4-1 (+3.51%)? - Higher malloc call frequency (allocation-heavy workload) - Function call elimination (tiny_max_size pre-cached) - Larger target: 35.63% vs free's 25.26% - Health check: PASS (all profiles) - Action: PROMOTED to MIXED_TINYV3_C7_SAFE preset Phase 5 Cumulative (estimated): - E1 (ENV Snapshot): +3.92% - E4-1 (Free Wrapper Snapshot): +3.51% - E4-2 (Malloc Wrapper Snapshot): +21.83% - Estimated combined: ~+30% (needs validation) Next Steps: - Combined A/B test (E4-1 + E4-2 simultaneously) - Measure actual cumulative effect - Profile new baseline for next optimization targets Deliverables: - docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_1_DESIGN.md - docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_1_AB_TEST_RESULTS.md - docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md - docs/analysis/PHASE5_E4_COMBINED_AB_TEST_NEXT_INSTRUCTIONS.md (next) - docs/analysis/ENV_PROFILE_PRESETS.md (E4-2 added) - CURRENT_TASK.md (E4-2 complete) - core/bench_profile.h (E4-2 promoted to default) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-14 05:13:29 +09:00
Moe Charm (CI)	4a070d8a14	Phase 5 E4-1: Free Wrapper ENV Snapshot (+3.51% GO, ADOPTED) Target: Consolidate free wrapper TLS reads (2→1) - free() is 25.26% self% (top hot spot) - Strategy: Apply E1 success pattern (ENV snapshot) to free path Implementation: - ENV gate: HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=0/1 (default 0) - core/box/free_wrapper_env_snapshot_box.{h,c}: New box - Consolidates 2 TLS reads → 1 TLS read (50% reduction) - Reduces 4 branches → 3 branches (25% reduction) - Lazy init with probe window (bench_profile putenv sync) - core/box/hak_wrappers.inc.h: Integration in free() wrapper - Makefile: Add free_wrapper_env_snapshot_box.o to all targets A/B Test Results (Mixed, 10-run, 20M iters): - Baseline (SNAPSHOT=0): 45.35M ops/s (mean), 45.31M ops/s (median) - Optimized (SNAPSHOT=1): 46.94M ops/s (mean), 47.15M ops/s (median) - Improvement: +3.51% mean, +4.07% median Decision: GO (+3.51% >= +1.0% threshold) - Exceeded conservative estimate (+1.5% → +3.51%) - Similar efficiency to E1 (+3.92%) - Health check: PASS (all profiles) - Action: PROMOTED to MIXED_TINYV3_C7_SAFE preset Phase 5 Cumulative: - E1 (ENV Snapshot): +3.92% - E4-1 (Free Wrapper Snapshot): +3.51% - Total Phase 4-5: ~+7.5% E3-4 Correction: - Phase 4 E3-4 (ENV Constructor Init): NO-GO / FROZEN - Initial A/B showed +4.75%, but investigation revealed: - Branch prediction hint mismatch (UNLIKELY with always-true) - Retest confirmed -1.78% regression - Root cause: __builtin_expect(..., 0) with ctor_mode==1 - Decision: Freeze as research box (default OFF) - Learning: Branch hints need careful tuning, TLS consolidation safer Deliverables: - docs/analysis/PHASE5_E4_FREE_GATE_OPTIMIZATION_1_DESIGN.md - docs/analysis/PHASE5_E4_1_FREE_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md - docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md (next) - docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md - docs/analysis/ENV_PROFILE_PRESETS.md (E4-1 added, E3-4 corrected) - CURRENT_TASK.md (E4-1 complete, E3-4 frozen) - core/bench_profile.h (E4-1 promoted to default) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-14 04:24:34 +09:00
Moe Charm (CI)	88717a8737	Phase 4 E1: ENV Snapshot Consolidation - GO (+3.92% avg, +4.01% median) Target: Consolidate 3 ENV gate TLS reads → 1 TLS read - tiny_c7_ultra_enabled_env(): 1.28% self - tiny_front_v3_enabled(): 1.01% self - tiny_metadata_cache_enabled(): 0.97% self - Total overhead: 3.26% self (perf profile analysis) Implementation: - core/box/hakmem_env_snapshot_box.h (new): ENV snapshot struct & API - core/box/hakmem_env_snapshot_box.c (new): TLS snapshot implementation - core/front/malloc_tiny_fast.h: Migrated 5 call sites to snapshot - core/box/tiny_legacy_fallback_box.h: Migrated 2 call sites - core/box/tiny_metadata_cache_hot_box.h: Migrated 1 call site - core/bench_profile.h: Added hakmem_env_snapshot_refresh_from_env() - Makefile: Added hakmem_env_snapshot_box.o to build - ENV gate: HAKMEM_ENV_SNAPSHOT=0/1 (default: 0, research box) A/B Test Results (Mixed, 10-run, 20M iters): - Baseline (E1=0): 43,617,549 ops/s (avg), 43,562,895 ops/s (median) - Optimized (E1=1): 45,327,239 ops/s (avg), 45,309,218 ops/s (median) - Improvement: avg +3.92%, median +4.01% Decision: GO (+3.92% >= +2.5% threshold) - Action: Keep as research box (default OFF) for Phase 4 - Next: Consider promotion to default in MIXED_TINYV3_C7_SAFE preset Design Rationale: - Shape optimizations (B3, D3) reached saturation (+0.56% NEUTRAL) - Shift to memory/TLS overhead optimization (new optimization frontier) - Pattern: Similar to existing tiny_front_v3_snapshot (proven approach) - Expected: +1-3% from 3.26% ENV overhead → Achieved: +3.92% Technical Details: - Consolidation: 3 TLS reads → 1 TLS read (66% reduction) - Learner interlock: tiny_metadata_cache_eff pre-computed in snapshot - Version sync: Refreshes on small_policy_v7_version_changed() - Fallback safety: Existing ENV gates still available when E1=0 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-14 00:59:12 +09:00
Moe Charm (CI)	deecda7336	Phase 3 C2: Slab Metadata Cache Optimization (3 patches) - NEUTRAL Patch 1: Policy Hot Cache - Add TinyPolicyHot struct (route_kind[8] cached in TLS) - Eliminate policy_snapshot() calls (~2 memory ops saved) - Safety: disabled when learner v7 active - Files: tiny_metadata_cache_env_box.h, tiny_metadata_cache_hot_box.{h,c} - Integration: malloc_tiny_fast.h route selection Patch 2: First Page Inline Cache - Cache current slab page pointer in TLS per-class - Avoid superslab metadata lookup (1-2 memory ops) - Fast-path in tiny_legacy_fallback_free_base() - Files: tiny_first_page_cache.h, tiny_unified_cache.c - Integration: tiny_legacy_fallback_box.h Patch 3: Bounds Check Compile-out - Hardcode unified_cache capacity as MACRO constant - Eliminate modulo operation (constant fold) - Macros: TINY_UNIFIED_CACHE_CAPACITY_POW2=11, CAPACITY=2048, MASK=2047 - File: tiny_unified_cache.h A/B Test Results (Mixed, 10-run): - Baseline (C2=0): 40.43M ops/s (avg), 40.72M ops/s (median) - Optimized (C2=1): 40.25M ops/s (avg), 40.29M ops/s (median) - Improvement: -0.45% (avg), -1.06% (median) - DECISION: NEUTRAL (within ±1.0% threshold) - Action: Keep as research box (ENV gate OFF by default) Cumulative Gain (Phase 2-3): - B3 (Routing shape): +2.89% - B4 (Wrapper split): +1.47% - C3 (Static routing): +2.20% - C2 (Metadata cache): -0.45% - Total: ~6.1% (from baseline 37.5M → 39.8M ops/s) 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-13 19:19:42 +09:00
Moe Charm (CI)	1798ed656d	Phase 3 C3: Tiny Static Routing Box Implementation (Step 1A) Research Box Implementation: - core/box/tiny_static_route_box.h: TinyStaticRoute struct & API - core/box/tiny_static_route_box.c: Static route table management - Makefile: Added tiny_static_route_box.o to 3 OBJS lists Design: - ENV gate: HAKMEM_TINY_STATIC_ROUTE=0/1 (default: 0) - Learner auto-disable: If HAKMEM_TINY_LEARNER_ENABLED=1, force OFF - Constructor priority: 102 (runs after wrapper_env_ctor at 101) - Thread-safe: Atomic CAS for exactly-once initialization Baseline Profiling (Step 0 Complete): - Throughput: 46.2M ops/s (10M iterations × 400 ws) - Instructions/cycle: 2.11 insn/cycle - Frontend stalls: 10.62% (memory latency bottleneck) - Cache-misses: 3.46% of references Expected C3 gain: +5-8% (policy_snapshot bypass) Next Steps (Step 1B onwards): 1. Integrate static route into malloc_tiny_fast_for_class() 2. A/B test: Mixed 10-run, expect +1% minimum for GO 3. Decision: GO if +1%, NO-GO if -1%, else freeze Status: ✅ Phase 2 (B3+B4): +4.4% cumulative ✅ Phase 3 planning & C3 Step 0-1A complete ⏳ Phase 3 C3 Step 1B-3 pending (malloc integration & testing) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-13 18:04:14 +09:00
Moe Charm (CI)	d9991f39ff	Phase ALLOC-TINY-FAST-DUALHOT-1 & Optimization Roadmap Update Add comprehensive design docs and research boxes: - docs/analysis/ALLOC_TINY_FAST_DUALHOT_1_DESIGN.md: ALLOC DUALHOT investigation - docs/analysis/FREE_TINY_FAST_DUALHOT_1_DESIGN.md: FREE DUALHOT final specs - docs/analysis/FREE_TINY_FAST_HOTCOLD_OPT_1_DESIGN.md: Hot/Cold split research - docs/analysis/POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md: Deferred batching design - docs/analysis/POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md: Stats overhead findings - docs/analysis/MID_DESC_CACHE_BENCHMARK_2025-12-12.md: Cache measurement results - docs/analysis/LAST_MATCH_CACHE_IMPLEMENTATION.md: TLS cache investigation Research boxes (SS page table): - core/box/ss_pt_env_box.h: HAKMEM_SS_LOOKUP_KIND gate - core/box/ss_pt_types_box.h: 2-level page table structures - core/box/ss_pt_lookup_box.h: ss_pt_lookup() implementation - core/box/ss_pt_register_box.h: Page table registration - core/box/ss_pt_impl.c: Global definitions Updates: - docs/specs/ENV_VARS_COMPLETE.md: HOTCOLD, DEFERRED, SS_LOOKUP env vars - core/box/hak_free_api.inc.h: FREE-DISPATCH-SSOT integration - core/box/pool_mid_inuse_deferred_box.h: Deferred API updates - core/box/pool_mid_inuse_deferred_stats_box.h: Stats collection - core/hakmem_super_registry: SS page table integration Current Status: - FREE-TINY-FAST-DUALHOT-1: +13% improvement, ready for adoption - ALLOC-TINY-FAST-DUALHOT-1: -2% regression, frozen as research box - Next: Optimization roadmap per ROI (mimalloc gap 2.5x) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-13 05:35:46 +09:00
Moe Charm (CI)	1a8652a91a	Phase TLS-UNIFY-3: C6 intrusive freelist implementation (完成) Implement C6 ULTRA intrusive LIFO freelist with ENV gating: - Single-linked LIFO using next pointer at USER+1 offset - tiny_next_store/tiny_next_load for pointer access (single source of truth) - Segment learning via ss_fast_lookup (per-class seg_base/seg_end) - ENV gate: HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL (default OFF) - Counters: c6_ifl_push/pop/fallback in FREE_PATH_STATS Files: - core/box/tiny_ultra_tls_box.h: Added c6_head field for intrusive LIFO - core/box/tiny_ultra_tls_box.c: Pop/push with intrusive branching (case 6) - core/box/tiny_c6_ultra_intrusive_env_box.h: ENV gate (new) - core/box/tiny_c6_intrusive_freelist_box.h: L1 pure LIFO (new) - core/tiny_debug_ring.h: C6_IFL events - core/box/free_path_stats_box.h/c: c6_ifl_* counters A/B Test Results (1M iterations, ws=200, 257-512B): - ENV_OFF (array): 56.6 Mop/s avg - ENV_ON (intrusive): 57.6 Mop/s avg (+1.8%, within noise) - Counters verified: c6_ifl_push=265890, c6_ifl_pop=265815, fallback=0 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-12 16:26:42 +09:00
Moe Charm (CI)	212739607a	Phase v11a-3: MID v3.5 Activation (Build Complete) Integrated MID v3.5 into active code path, making it available for C5/C6/C7 routing. Key Changes: - Policy Box: Added SMALL_ROUTE_MID_V35 with ENV gates (HAKMEM_MID_V35_ENABLED, HAKMEM_MID_V35_CLASSES) - HotBox: Implemented small_mid_v35_alloc/free with TLS-cached page allocation - Front Gate: Wired MID_V35 routing into malloc_tiny_fast.h (priority: ULTRA > MID_V35 > V7) - Build: Added core/smallobject_mid_v35.o to all object lists Architecture: - Slot sizes: C5=384B, C6=512B, C7=1024B - Page size: 64KB (170/128/64 slots) - Integration: ColdIface v2 (refill/retire), Stats v2 (observation), Learner v2 (dormant) Status: Build successful, ready for A/B benchmarking Next: Performance validation (C6-heavy, C5+C6-only, Mixed benchmarks) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-12 06:52:14 +09:00
Moe Charm (CI)	0dba67ba9d	Phase v11a-2: Core MID v3.5 implementation - segment, cold iface, stats, learner Implement 5-layer infrastructure for multi-class MID v3.5 (C5-C7, 257-1KiB): 1. SegmentBox_mid_v3 (L2 Physical) - core/smallobject_segment_mid_v3.c (9.5 KB) - 2MiB segments, 64KiB pages (32 per segment) - Per-class free page stacks (LIFO) - RegionIdBox registration - Slots: C5→170, C6→102, C7→64 2. ColdIface_mid_v3 (L2→L1) - core/box/smallobject_cold_iface_mid_v3_box.h (NEW) - core/smallobject_cold_iface_mid_v3.c (3.5 KB) - refill: get page from free stack or new segment - retire: calculate free_hit_ratio, publish stats, return to stack - Clean separation: TLS cache for hot path, ColdIface for cold path 3. StatsBox_mid_v3 (L2→L3) - core/smallobject_stats_mid_v3.c (7.2 KB) - Circular buffer history (1000 events) - Per-page metrics: class_idx, allocs, frees, free_hit_ratio_bps - Periodic aggregation (every 100 retires) - Learner notification callback 4. Learner v2 (L3) - core/smallobject_learner_v2.c (11 KB) - Multi-class aggregation: allocs[8], retire_count[8], avg_free_hit_bps[8] - Exponential smoothing (90% history + 10% new) - Per-class efficiency tracking - Stats snapshot API - Route decision disabled for v11a-2 (v11b feature) 5. Build Integration - Modified Makefile: added 4 new .o files (segment, cold_iface, stats, learner) - Updated box header prototypes - Clean compilation, all dependencies resolved Architecture Decision Implementation: - v7 remains frozen (C5/C6 research preset) - MID v3.5 becomes unified 257-1KiB main path - Multi-class isolation: per-class free stacks - Dormant infrastructure: linked but not active (zero overhead) Performance: - Build: clean compilation - Sanity benchmark: 27.3M ops/s (no regression vs v10) - Memory: ~30MB RSS (baseline maintained) Design Compliance: ✅ Layer separation: L2 (segment) → L2 (cold iface) → L3 (stats) → L3 (learner) ✅ Hot path clean: alloc/free never touch stats/learner ✅ Backward compatible: existing MID v3 routes unchanged ✅ Transparent: v11a-2 is dormant (no behavior change) Next Phase (v11a-3): - Activate C5/C6/C7 routing through MID v3.5 - Connect TLS cache to segment refill - Verify performance under load - Then Phase v11a-4: dynamic C5 ratio routing 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-12 06:37:06 +09:00
Moe Charm (CI)	8143e8b797	Phase v7-4: Policy Box 導入 (L3 層の明確化とフロント芯の作り直し) - SmallPolicyV7 Box: L3 Policy layer に配置、route 決定を一元化 - Route kind enum: SMALL_ROUTE_ULTRA / V7 / MID_V3 / LEGACY - ENV priority (fixed): ULTRA > v7 > MID_v3 > LEGACY - Frontend integration: v7 routing を Policy Box 経由に変更 (段階移行) - Legacy compatibility: 既存の tiny_route_env_box.h は併用維持 Box Theory layer structure: - L0: ULTRA (C4-C7, FROZEN) - L1: SmallObject v7 (research box) - L1': MID_v3 / LEGACY (fallback) - L2: Segment / RegionId - L3: Policy / Stats / Learner ← Policy Box added here Frontend now follows clean "size→class→route_kind→switch" pattern. ENV variables read once at Policy init, not scattered across frontend. Future: ULTRA/MID_v3/LEGACY consolidation, Learner integration, flexible priority. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-12 03:50:58 +09:00
Moe Charm (CI)	39a3c53dbc	Phase v7-2: SmallObject v7 C6-only implementation with RegionIdBox integration - SmallSegment_v7: 2MiB segment with TLS slot and free page stack - ColdIface_v7: Page refill/retire between HotBox and SegmentBox - HotBox_v7: Full C6-only alloc/free with header writing (HEADER_MAGIC\|class_idx) - Free path early-exit: Check v7 route BEFORE ss_fast_lookup (separate mmap segment) - RegionIdBox: Register v7 segment for ptr->region lookup - Benchmark: v7 ON ~54.5M ops/s (-7% overhead vs 58.6M legacy baseline) v7 correctly balances alloc/free counts and page lifecycle. RegionIdBox overhead identified as primary cost driver. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-12 03:12:28 +09:00
Moe Charm (CI)	7bb179df6c	Fix: Add core/mid_hotbox_v3.o to BENCH_HAKMEM_OBJS_BASE core/mid_hotbox_v3.o was missing from BENCH_HAKMEM_OBJS_BASE, causing linker errors. Added it after core/region_id_v6.o. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-12 01:06:30 +09:00
Moe Charm (CI)	510cf338f3	MID-V3-6: hakmem.c integration (box modularization) Integrate MID/Pool v3 into hakmem.c main allocation path using box modularization pattern. Changes: - core/hakmem.c: Include MID v3 headers - core/box/hak_alloc_api.inc.h: Add v3 allocation gate - C6 (145-256B) and C7 (769-1024B) size classes - ENV opt-in via HAKMEM_MID_V3_ENABLED + HAKMEM_MID_V3_CLASSES - Priority: v6 > v3 > v4 > pool - core/box/hak_free_api.inc.h: Add v3 free path - RegionIdBox lookup based ownership check - Makefile: Add core/mid_hotbox_v3.o to TINY_BENCH_OBJS_BASE ENV controls (default OFF): HAKMEM_MID_V3_ENABLED=1 HAKMEM_MID_V3_CLASSES=0x40 (C6) HAKMEM_MID_V3_CLASSES=0x80 (C7) HAKMEM_MID_V3_DEBUG=1 Verified with bench_mid_large_mt_hakmem (7-9M ops/s, no crashes) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-12 01:04:55 +09:00
Moe Charm (CI)	710541b69e	MID-V3 Phase 3-5: RegionId integration, alloc/free implementation - MID-V3-3: RegionId integration (page registration at carve) - mid_segment_v3_carve_page(): Register with RegionIdBox - mid_segment_v3_return_page(): Unregister from RegionIdBox - Uses REGION_KIND_MID_V3 for region identification - MID-V3-4: Allocation fast path implementation - mid_hot_v3_alloc_slow(): Slow path for lane miss - mid_cold_v3_refill_page(): Segment-based page allocation - mid_lane_refill_from_page(): Batch transfer (16 items default) - mid_page_build_freelist(): Initial freelist construction - MID-V3-5: Free/cold path implementation - mid_hot_v3_free(): RegionIdBox lookup based free - mid_page_push_free(): Page freelist push - Local/remote page detection via lane ownership ENV controls (default OFF): HAKMEM_MID_V3_ENABLED=1 HAKMEM_MID_V3_CLASSES=0xC0 (C6+C7) HAKMEM_MID_V3_DEBUG=1 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-12 00:53:42 +09:00
Moe Charm (CI)	df216b6901	Phase V6-HDR-3: SmallSegmentV6 実割り当て & RegionIdBox Registration 実装内容: 1. SmallSegmentV6のmmap割り当ては既に v6-0で実装済み 2. small_heap_ctx_v6() で segment 取得時に region_id_register_v6_segment() 呼び出し 3. region_id_v6.c に TLS スコープのセグメント登録ロジック実装: - 4つの static __thread 変数でセグメント情報をキャッシュ - region_id_register_v6_segment(): セグメント base/end を TLS に記録 - region_id_lookup_v6(): TLS segment の range check を最初に実行 - TLS cache 更新で O(1) lookup 実現 4. region_id_v6_box.h に SmallSegmentV6 type include & function 宣言追加 5. small_v6_region_observe_validate() に region_id_observe_lookup() 呼び出し追加効果: - HeaderlessデザインでRegionIdBoxが正式にSMALL_V6分類を返せるように - TLS-scopedな簡潔な登録メカニズム (マルチスレッド対応) - Fast path: TLS segment range check -> page_meta lookup - Fall back path: 従来の small_page_meta_v6_of() による動的検出 - Latency: O(1) TLS cache hit rate がv6 alloc/free の大部分をカバー 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-11 23:51:48 +09:00
Moe Charm (CI)	9fb2240319	Fix: Add alloc_gate_stats_box.o to BENCH_HAKMEM_OBJS_BASE; Document PERF-ULTRA-REBASE-4 findings Phase PERF-ULTRA-REBASE-4 confirmed: - dispatcher (25.48%) and alloc gate (21.13%) already heavily optimized via snapshot - New bottleneck: C7 ULTRA refill path (tiny_c7_ultra_page_of at 1.78%) - Recommendation: Next optimize C7 ULTRA refill for +1-2% overall gain	2025-12-11 21:36:58 +09:00
Moe Charm (CI)	0f15adae4e	Phase ALLOC-GATE-OPT-1: tiny_alloc_gate_fast 統計計測 - AllocGateStats 構造体追加（size2class/route/env/class分布） - malloc_tiny_fast にカウンタ埋め込み - ENV: HAKMEM_ALLOC_GATE_STATS (default 0) - 挙動変更なし（計測のみ）計測結果: - Mixed: total=542k, size2class=0, route_calls=0, env_checks=275k, C4-C7=95.2% - size_to_class/route_for_class は完全削減済み（LUT 効果） - C4-C7 が 95% → ULTRA fast path が有効 - env_checks ≈ c7_calls → C7 ULTRA の ENV gate が毎回呼ばれる - C6-heavy: total=11 → malloc_tiny_fast はほぼ通らない（mid/pool 主体）結論: - alloc gate は既に十分最適化済み（LUT + ULTRA で削減済み） - さらなる最適化余地は小さい（env_checks は軽量化済み、数%以下の効果） - 次フェーズでは free dispatcher (29%) や C7 ULTRA refill (7%) など、他のボトルネックを狙う詳細: docs/analysis/ALLOC_GATE_ANALYSIS.md 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-11 21:32:40 +09:00
Moe Charm (CI)	118c0e4857	Phase FREE-DISPATCHER-OPT-1: free dispatcher 統計計測目的: free dispatcher（29%）の内訳を細分化して計測。実装内容: - FreeDispatchStats 構造体追加（ENV: HAKMEM_FREE_DISPATCH_STATS, default 0） - カウンタ: total_calls / domain (tiny/mid/large) / route (ultra/legacy/pool/v6) / env_checks / route_for_class_calls - hak_free_at / tiny_route_for_class / tiny_route_snapshot_init にカウンタ埋め込み - 挙動変更なし（計測のみ、ENV OFF 時は overhead ゼロ）計測結果: Mixed 16-1024B (1M iter, ws=400): - total=8,081, route_calls=267,967, env_checks=9 - BENCH_FAST_FRONT により大半は早期リターン - route_for_class は主に alloc 側で呼ばれる（267k calls vs 8k frees） - ENV check は初期化時の 9回のみ（snapshot 効果） C6-heavy (257-768B, 1M iter, ws=400): - total=500,099, route_calls=1,034, env_checks=9 - fg_classify_domain に到達する free が多い - route_for_class 呼び出しは極小（snapshot 効果）結論: - ENV check は既に十分最適化されている（初期化時のみ） - route_for_class は alloc 側での呼び出しが主で、free 側は snapshot で O(1) - 次フェーズ（OPT-2）では別のアプローチを検討ドキュメント追加: - docs/analysis/FREE_DISPATCHER_ANALYSIS.md（新規） - CURRENT_TASK.md に Phase FREE-DISPATCHER-OPT-1 セクション追加 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-11 21:21:40 +09:00
Moe Charm (CI)	fb88725a43	Phase FREE-LEGACY-OPT-6: C4 ULTRA Implementation Implement C4 ULTRA free TLS cache with parasitic free+alloc pattern, achieving 99.7-99.9% elimination of C4 legacy fallback calls. Key Features: - TLS cache cap=64 (tuned for L1 cache fit, smaller than C5/C6's 128) - Segment learning via ss_fast_lookup() on first free - Free-side cache push + alloc-side TLS pop pattern - ENV gate: HAKMEM_TINY_C4_ULTRA_FREE_ENABLED (default OFF) - Full FREE_PATH_STATS instrumentation Benchmark Results: C4-heavy (65-128B range): - C4 legacy: 591,583 → 1,711 (-99.7%) - c4_ultra cache hits: ~599k (free) + ~599k (alloc) - Mixed load: 340,732 → 284 C4 legacy (-99.9%) Legacy fallback reduction: - C4-heavy: 589,872 fewer legacy calls (-10.9% total) - Mixed: 340,448 fewer C4 legacy calls (-12.8% in mixed) Performance note: ~2% throughput cost in isolated C4-heavy case, acceptable tradeoff for 99%+ legacy elimination per class. Files: NEW: core/box/tiny_c4_ultra_free_box.h/c NEW: core/box/tiny_c4_ultra_free_env_box.h MOD: core/box/tiny_ultra_classes_box.h (added C4 macros) MOD: core/box/free_path_stats_box.h/c (C4 ULTRA counters) MOD: core/front/malloc_tiny_fast.h (C4 alloc+free integration) MOD: Makefile (added C4 ULTRA object) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-11 19:38:27 +09:00
Moe Charm (CI)	ea6ed1a6e4	Phase FREE-LEGACY-OPT-5-1/5-2: C5 ULTRA free+alloc integration Summary: ======== Implemented C5 ULTRA TLS cache pattern following the successful C6 ULTRA design: - Phase 5-1: Free-side TLS cache + segment learning - Phase 5-2: Alloc-side TLS pop for complete free+alloc cycle integration Targets C5 class (129-256B) as next legacy reduction after C6 completion. Key Changes: ============ 1. NEW FILES: - core/box/tiny_c5_ultra_free_box.h: C5 ULTRA TLS cache structure - core/box/tiny_c5_ultra_free_box.c: C5 free path implementation (same pattern as C6) - core/box/tiny_c5_ultra_free_env_box.h: ENV gating (HAKMEM_TINY_C5_ULTRA_FREE_ENABLED) 2. MODIFIED FILES: - core/front/malloc_tiny_fast.h: * Added C5 ULTRA includes * Added C5 alloc-side TLS pop at lines 186-194 (integrated with C6) * Added C5 free path at lines 333-337 (integrated with C6) - core/box/tiny_ultra_classes_box.h: * Added TINY_CLASS_C5 constant * Added tiny_class_is_c5() macro * Extended tiny_class_is_ultra() to include C5 - core/box/free_path_stats_box.h: * Added c5_ultra_free_fast counter * Added c5_ultra_alloc_hit counter - core/box/free_path_stats_box.c: * Updated stats dump to output C5 counters - Makefile: * Added core/box/tiny_c5_ultra_free_box.o to all object lists 3. Design Rationale: - Exact copy of C6 ULTRA pattern (proven effective) - TLS cache capacity: 128 blocks (same as C6 for consistency) - Segment learning on first C5 free via ss_fast_lookup() - Alloc-side pop integrated directly in malloc_tiny_fast.h hotpath - Legacy fallback unification via tiny_legacy_fallback_free_base() 4. Expected Impact: - C5 legacy calls: 68,871 → 0 (100% elimination) - Total legacy reduction: ~53% of remaining 129,623 - Mixed workload: Minimal regression (C5 is smaller class, fewer allocations) 5. Stats Collection: Run with: HAKMEM_TINY_C5_ULTRA_FREE_ENABLED=1 HAKMEM_FREE_PATH_STATS=1 ./bench_allocators_hakmem Expected output: [FREE_PATH_STATS] ... c5_ultra_free=68871 c5_ultra_alloc=68871 ... legacy_fb=60752 ... [FREE_PATH_STATS_LEGACY_BY_CLASS] ... c5=0 ... Status: ======= - Code: ✅ COMPLETE (3 new files + 5 modified files) - Compilation: ✅ Verified (no errors, only unused variable warnings unrelated to C5) - Functionality: Ready to benchmark (ENV gating: default OFF, opt-in via ENV) Phase Progression: ================== ✅ Phase 4-4: C6 ULTRA free+alloc (legacy C6: 137,319 → 0) ✅ Phase 5-1/5-2: C5 ULTRA free+alloc (legacy C5: 68,871 → 0 expected) ⏳ Phase 4.5: C4 ULTRA (34,727 remaining) 📋 Future: C3/C2 ULTRA if beneficial 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-11 19:26:51 +09:00
Moe Charm (CI)	7b7de53167	Phase FREE-FRONT-V3-1: Free route snapshot infrastructure + build fix Summary: ======== Implemented Phase FREE-FRONT-V3 infrastructure to optimize free hotpath by: 1. Creating snapshot-based route decision table (consolidating route logic) 2. Removing redundant ENV checks from hot path 3. Preparing for future integration into hak_free_at() Key Changes: ============ 1. NEW FILES: - core/box/free_front_v3_env_box.h: Route snapshot definition & API - core/box/free_front_v3_env_box.c: Snapshot initialization & caching 2. Infrastructure Details: - FreeRouteSnapshotV3: Maps class_idx → free_route_kind for all 8 classes - Routes defined: LEGACY, TINY_V3, CORE_V6_C6, POOL_V1 - ENV-gated initialization (HAKMEM_TINY_FREE_FRONT_V3_ENABLED, default OFF) - Per-thread TLS caching to avoid repeated ENV reads 3. Design Goals: - Consolidate tiny_route_for_class() results into snapshot table - Remove C7 ULTRA / v4 / v5 / v6 ENV checks from hot path - Limit lookup (ss_fast_lookup/slab_index_for) to paths that truly need it - Clear ownership boundary: front v3 handles routing, downstream handles free 4. Phase Plan: - v3-1 ✅ COMPLETE: Infrastructure (snapshot table, ENV initialization, TLS cache) - v3-2 (INFRASTRUCTURE ONLY): Placeholder integration in hak_free_api.inc.h - v3-3 (FUTURE): Full integration + benchmark A/B to measure hotpath improvement 5. BUILD FIX: - Added missing core/box/c7_meta_used_counter_box.o to OBJS_BASE in Makefile - This symbol was referenced but not linked, causing undefined reference errors - Benchmark targets now build cleanly without LTO Status: ======= - Build: ✅ PASS (bench_allocators_hakmem builds without errors) - Integration: Currently DISABLED (default OFF, ready for v3-2 phase) - No performance impact: Infrastructure-only, hotpath unchanged Future Work: ============ - Phase v3-2: Integrate snapshot routing into hak_free_at() main path - Phase v3-3: Measure free hotpath performance improvement (target: 1-2% less branch mispredict) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-11 19:17:30 +09:00
Moe Charm (CI)	1b196b3ac0	Phase FREE-LEGACY-OPT-4-2/4-3: C6 ULTRA-free TLS cache + segment learning Phase 4-2: - Add TinyC6UltraFreeTLS structure with 128-slot TLS freelist - Implement tiny_c6_ultra_free_fast/slow for C6 free hot path - Add c6_ultra_free_fast counter to FreePathStats - ENV gate: HAKMEM_TINY_C6_ULTRA_FREE_ENABLED (default: OFF) Phase 4-3: - Add segment learning on first C6 free via ss_fast_lookup() - Learn seg_base/seg_end from SuperSlab for range check - Increase cache capacity from 32 to 128 blocks Results: - Segment learning works: fast path captures blocks in segment - However, without alloc integration, cache fills up and overflows to legacy - Net effect: +1-3% (within noise range) - Drain strategy also tested: no benefit (equal overhead) Conclusion: - Free-only TLS cache is limited without alloc-side integration - Core v6 already has alloc/free integrated TLS (but -12% slower) - Keep as research box (ENV default OFF) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-11 18:34:27 +09:00
Moe Charm (CI)	c60199182e	Phase v6-1/2/3/4: SmallObject Core v6 - C6-only implementation + refactor Phase v6-1: C6-only route stub (v1/pool fallback) Phase v6-2: Segment v6 + ColdIface v6 + Core v6 HotPath implementation - 2MiB segment / 64KiB page allocation - O(1) ptr→page_meta lookup with segment masking - C6-heavy A/B: SEGV-free but -44% performance (15.3M ops/s) Phase v6-3: Thin-layer optimization (TLS ownership check + batch header + refill batching) - TLS ownership fast-path skip page_meta for 90%+ of frees - Batch header writes during refill (32 allocs = 1 header write) - TLS batch refill (1/32 refill frequency) - C6-heavy A/B: v6-2 15.3M → v6-3 27.1M ops/s (±0% vs baseline) ✅ Phase v6-4: Mixed hang fix (segment metadata lookup correction) - Root cause: metadata lookup was reading mmap region instead of TLS slot - Fix: use TLS slot descriptor with in_use validation - Mixed health: 5M iterations SEGV-free, 35.8M ops/s ✅ Phase v6-refactor: Code quality improvements (macro unification + inline + docs) - Add SMALL_V6_* prefix macros (header, pointer conversion, page index) - Extract inline validation functions (small_page_v6_valid, small_ptr_in_segment_v6) - Doxygen-style comments for all public functions - Result: 0 compiler warnings, maintained +1.2% performance Files: - core/box/smallobject_core_v6_box.h (new, type & API definitions) - core/box/smallobject_cold_iface_v6.h (new, cold iface API) - core/box/smallsegment_v6_box.h (new, segment type definitions) - core/smallobject_core_v6.c (new, C6 alloc/free implementation) - core/smallobject_cold_iface_v6.c (new, refill/retire logic) - core/smallsegment_v6.c (new, segment allocator) - docs/analysis/SMALLOBJECT_CORE_V6_DESIGN.md (new, design document) - core/box/tiny_route_env_box.h (modified, v6 route added) - core/front/malloc_tiny_fast.h (modified, v6 case in route switch) - Makefile (modified, v6 objects added) - CURRENT_TASK.md (modified, v6 status added) Status: - C6-heavy: v6 OFF 27.1M → v6-3 ON 27.1M ops/s (±0%) ✅ - Mixed: v6 ON 35.8M ops/s (C6-only, other classes via v1) ✅ - Build: 0 warnings, fully documented ✅ 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-11 15:29:59 +09:00
Moe Charm (CI)	e0fb7d550a	Phase v5-2: SmallObject v5 C6-only 本実装 (WIP - header fix) 本実装修正: - tiny_region_id_write_header() を追加: USER pointer を正しく返す - TLS slot からの segment 探索 (page_meta_of) - Page-level allocation で segment 再利用 - 2MiB alignment 保証 (4MiB 確保 + alignment) - free パスの route 修正 (v4 から v5 への fallthrough 削除) 動作確認: - SEGV 消失: alloc/free 基本動作 OK - 性能: ~18-20M ops/s (baseline 43-47M の約 40-45%) - 回帰原因: TLS slot 線形探索 O(n)、find_page O(n) 残タスク: - O(1) segment lookup 最適化 (hash または array 直接参照) - find_page 除去 (segment lookup 成功時) - partial_count/list 管理の最適化 ENV デフォルト OFF なので本線影響なし。 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-11 04:14:51 +09:00
Moe Charm (CI)	9c24bebf08	Phase v5-1: SmallObject v5 C6-only route stub 接続 - tiny_route_env_box.h: TINY_ROUTE_SMALL_HEAP_V5 enum 追加、route snapshot で C6→v5 分岐 - malloc_tiny_fast.h: alloc/free switch に v5 case 追加（v1/pool fallback） - smallobject_hotbox_v5.c: stub 実装（alloc は NULL 返却、free は no-op） - smallobject_hotbox_v5_box.h: 関数 signature に ctx パラメータ追加 - Makefile: core/smallobject_hotbox_v5.o をリンクリストに追加 - ENV_PROFILE_PRESETS.md: v5-1 プリセット追記 - CURRENT_TASK.md: Phase v5-1 完了記録特性: - ENV: HAKMEM_SMALL_HEAP_V5_ENABLED=1 / HAKMEM_SMALL_HEAP_V5_CLASSES=0x40 で opt-in - テスト結果: C6-heavy (v5 OFF 15.5M → v5 ON 16.4M ops/s, 正常), Mixed 47.2M ops/s, SEGV/assert なし - 挙動は v1/pool fallback と同じ（実装は v5-2） 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-11 03:25:37 +09:00
Moe Charm (CI)	bbb55b018a	Add C7 ULTRA segment skeleton and TLS freelist	2025-12-10 22:19:32 +09:00
Moe Charm (CI)	cbd33511eb	Phase v4-3.1: reuse C7 v4 pages and record prep calls	2025-12-10 17:58:42 +09:00
Moe Charm (CI)	acc64f2438	Phase ML1: Pool v1 memset 89.73% overhead 軽量化 (+15.34% improvement) ## Summary - ChatGPT により bench_profile.h の setenv segfault を修正（RTLD_NEXT 経由に切り替え） - core/box/pool_zero_mode_box.h 新設：ENV キャッシュ経由で ZERO_MODE を統一管理 - core/hakmem_pool.c で zero mode に応じた memset 制御（FULL/header/off） - A/B テスト結果：ZERO_MODE=header で +15.34% improvement（1M iterations, C6-heavy） ## Files Modified - core/box/pool_api.inc.h: pool_zero_mode_box.h include - core/bench_profile.h: glibc setenv → malloc+putenv（segfault 回避） - core/hakmem_pool.c: zero mode 参照・制御ロジック - core/box/pool_zero_mode_box.h (新設): enum/getter - CURRENT_TASK.md: Phase ML1 結果記載 ## Test Results \| Iterations \| ZERO_MODE=full \| ZERO_MODE=header \| Improvement \| \|-----------\|----------------\|-----------------\|------------\| \| 10K \| 3.06 M ops/s \| 3.17 M ops/s \| +3.65% \| \| 1M \| 23.71 M ops/s \| 27.34 M ops/s \| +15.34% \| 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-10 09:08:18 +09:00
Moe Charm (CI)	a905e0ffdd	Guard madvise ENOMEM and stabilize pool/tiny front v3	2025-12-09 21:50:15 +09:00
Moe Charm (CI)	fda6cd2e67	Boxify superslab registry, add bench profile, and document C7 hotpath experiments	2025-12-07 03:12:27 +09:00
Moe Charm (CI)	03538055ae	Restore C7 Warm/TLS carve for release and add policy scaffolding	2025-12-06 01:34:04 +09:00

1 2 3

105 Commits