Files
hakmem/core/bench_profile.h

249 lines
14 KiB
C
Raw Permalink Normal View History

#pragma once
#include <dlfcn.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <unistd.h>
#ifdef USE_HAKMEM
#include "box/wrapper_env_box.h" // wrapper_env_refresh_from_env (Phase 2 B4)
#include "box/tiny_static_route_box.h" // tiny_static_route_refresh_from_env (Phase 3 C3)
#include "box/hakmem_env_snapshot_box.h" // hakmem_env_snapshot_refresh_from_env (Phase 4 E1)
Phase 8: FREE-STATIC-ROUTE ENV Cache Hardening (GO +2.61%) Results: - A/B test: +2.61% on Mixed (10-run, clean env) - Baseline: 49.26M ops/s - Optimized: 50.55M ops/s - Improvement: +1.29M ops/s (+2.61%) Strategy: - Fix ENV cache accident (main前キャッシュ事故の修正) - Add refresh mechanism to sync with bench_profile putenv - Ensure Phase 3 D1 optimization works reliably Success factors: 1. Performance improvement: +2.61% (existing win-box now reliable) 2. ENV cache accident fixed: refresh mechanism works correctly 3. Standard deviation improved: 867K → 336K ops/s (61% reduction) 4. Baseline quality improved: existing optimization now guaranteed Implementation: - Patch 1: Make ENV gate refreshable (tiny_free_route_cache_env_box.{h,c}) - Changed static int to extern _Atomic int - Added tiny_free_static_route_refresh_from_env() - Patch 2: Integrate refresh into bench_profile.h - Call refresh after bench_setenv_default() group - Patch 3: Update Makefile for new .c file ENV cache fix verification: - [FREE_STATIC_ROUTE] enabled appears twice (refresh working) - bench_profile putenv now reliably reflected Files modified: - core/box/tiny_free_route_cache_env_box.h: extern + refresh API - core/box/tiny_free_route_cache_env_box.c: NEW (global state + refresh) - core/bench_profile.h: add refresh call - Makefile: add new .o file Health check: PASSED (all profiles) Rollback: HAKMEM_FREE_STATIC_ROUTE=0 or revert Patch 1/2 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 18:49:08 +09:00
#include "box/tiny_free_route_cache_env_box.h" // tiny_free_static_route_refresh_from_env (Phase 8)
Phase 13 v1 + E5-2 retest: Both NEUTRAL, freeze as research boxes Phase 13 v1: Header Write Elimination (C7 preserve header) - Verdict: NEUTRAL (+0.78%) - Implementation: HAKMEM_TINY_C7_PRESERVE_HEADER ENV gate (default OFF) - Makes C7 nextptr offset conditional (0→1 when enabled) - 4-point matrix A/B test results: * Case A (baseline): 51.49M ops/s * Case B (WRITE_ONCE=1): 52.07M ops/s (+1.13%) * Case C (C7_PRESERVE=1): 51.36M ops/s (-0.26%) * Case D (both): 51.89M ops/s (+0.78% NEUTRAL) - Action: Freeze as research box (default OFF, manual opt-in) Phase 5 E5-2: Header Write-Once retest (promotion test) - Verdict: NEUTRAL (+0.54%) - Motivation: Phase 13 Case B showed +1.13%, re-tested with dedicated 20-run - Results (20-run): * Case A (baseline): 51.10M ops/s * Case B (WRITE_ONCE=1): 51.37M ops/s (+0.54%) - Previous test: +0.45% (consistent with NEUTRAL) - Action: Keep as research box (default OFF, manual opt-in) Key findings: - Header write tax optimization shows consistent NEUTRAL results - Neither Phase 13 v1 nor E5-2 reaches GO threshold (+1.0%) - Both implemented as reversible ENV gates for future research Files changed: - New: core/box/tiny_c7_preserve_header_env_box.{c,h} - Modified: core/box/tiny_layout_box.h (C7 offset conditional) - Modified: core/tiny_nextptr.h, core/box/tiny_header_box.h (comments) - Modified: core/bench_profile.h (refresh sync) - Modified: Makefile (add new .o files) - Modified: scripts/run_mixed_10_cleanenv.sh (add C7_PRESERVE ENV) - Docs: PHASE13_*, PHASE5_E5_2_HEADER_WRITE_ONCE_* (design/results) Next: Phase 14 (Pointer-chase reduction, tcache-style intrusive LIFO) 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 00:32:25 +09:00
#include "box/tiny_c7_preserve_header_env_box.h" // tiny_c7_preserve_header_env_refresh_from_env (Phase 13 v1)
#include "box/tiny_tcache_env_box.h" // tiny_tcache_env_refresh_from_env (Phase 14 v1)
#include "box/tiny_unified_lifo_env_box.h" // tiny_unified_lifo_env_refresh_from_env (Phase 15 v1)
Phase 16 v1 NEUTRAL, Phase 17 Case B confirmed, Phase 18 design added ## Phase 16 v1: Front FastLane Alloc LEGACY Direct — NEUTRAL (+0.62%) Target: Reduce alloc-side fixed costs by adding LEGACY direct path to FastLane entry, mirroring Phase 9/10 free-side winning pattern. Result: +0.62% on Mixed (below +1.0% GO threshold) → NEUTRAL, freeze as research box (default OFF). Critical issue: Initial impl crashed (segfault) for C4-C7. Root cause: unified_cache_refill() incompatibility. Safety fix: Limited to C0-C3 only (matching existing dualhot pattern). Files: - core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c} (new) - core/box/front_fastlane_box.h (LEGACY direct path, lines 93-119) - core/bench_profile.h (env refresh sync) - Makefile (new obj) - docs/analysis/PHASE16_*.md (design/results/instructions) ENV: HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 (default OFF, opt-in) Verdict: Research box frozen. Phase 14-16 plateau confirms dispatch/ routing optimization ROI is exhausted post-Phase-6 FastLane collapse. --- ## Phase 17: FORCE_LIBC Gap Validation — Case B Confirmed Purpose: Validate "system malloc faster" observation using same-binary A/B testing to isolate allocator logic差 vs binary layout penalty. Method: - Same-binary toggle: HAKMEM_FORCE_LIBC_ALLOC=0/1 (bench_random_mixed_hakmem) - System binary: bench_random_mixed_system (21K separate binary) - Perf stat: Hardware counter analysis (I-cache, cycles, instructions) Result: **Case B confirmed** — Allocator差 negligible, layout penalty dominates. Gap breakdown (Mixed, 20M iters, ws=400): - hakmem (FORCE_LIBC=0): 48.12M ops/s - libc (FORCE_LIBC=1, same binary): 48.31M ops/s → +0.39% (noise level) - system binary (21K): 83.85M ops/s → +73.57% vs libc, +74.26% vs hakmem Perf stat (200M iters): - I-cache misses: 153K (hakmem) → 68K (system) = -55% (smoking gun) - Cycles: 17.9B → 10.2B = -43% - Instructions: 41.3B → 21.5B = -48% - Binary size: 653K → 21K (30x difference) Root cause: Binary size (30x) causes I-cache thrashing. Code bloat >> algorithmic efficiency. Conclusion: Phase 12's "system malloc 1.6x faster" was real, but misattributed. Gap is layout/I-cache, NOT allocator algorithm. Files: - docs/analysis/PHASE17_*.md (results/instructions) - scripts/run_mixed_10_cleanenv.sh (Phase 9/10 defaults aligned) Next: Phase 18 Hot Text Isolation (layout optimization, not algorithm opt) --- ## Phase 18: Hot Text Isolation — Design Added Purpose: Reduce I-cache misses + instruction footprint via layout control (binary optimization, not allocator algorithm changes). Strategy (v1 → v2 progression): v1 (TU split + hot/cold attrs + optional gc-sections): - Target: +2% throughput (GO threshold, realistic for layout tweaks) - Secondary: I-cache -10%, instructions -5% (direction confirmation) - Risk: Low (reversible via build knob) - Expected: +0-2% (NEUTRAL likely, but validates approach) v2 (BENCH_MINIMAL compile-out): - Target: +10-20% throughput (本命) - Method: Conditional compilation removes stats/ENV/debug from hot path - Expected: Instruction count -30-40% → significant I-cache improvement Files: - docs/analysis/PHASE18_*.md (design/instructions) - CURRENT_TASK.md (Phase 17 complete, Phase 18 v1/v2 plan) Build gate: HOT_TEXT_ISOLATION=0/1 (Makefile knob) Next: Implement Phase 18 v1 (TU split first, BENCH_MINIMAL if v1 NEUTRAL) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 05:25:47 +09:00
#include "box/front_fastlane_alloc_legacy_direct_env_box.h" // front_fastlane_alloc_legacy_direct_env_refresh_from_env (Phase 16 v1)
Phase 17 v2 (FORCE_LIBC fix) + Phase 19-1b (FastLane Direct) — GO (+5.88%) ## Phase 17 v2: FORCE_LIBC Gap Validation Fix **Critical bug fix**: Phase 17 v1 の測定が壊れていた **Problem**: HAKMEM_FORCE_LIBC_ALLOC=1 が FastLane より後でしか見えず、 same-binary A/B が実質 "hakmem vs hakmem" になっていた(+0.39% 誤測定) **Fix**: core/box/hak_wrappers.inc.h:171 と :645 に g_force_libc_alloc==1 の early bypass を追加、__libc_malloc/__libc_free に最初に直行 **Result**: 正しい同一バイナリ A/B 測定 - hakmem (FORCE_LIBC=0): 48.99M ops/s - libc (FORCE_LIBC=1): 79.72M ops/s (+62.7%) - system binary: 88.06M ops/s (+10.5% vs libc) **Gap 分解**: - Allocator 差: +62.7% (主戦場) - Layout penalty: +10.5% (副次的) **Conclusion**: Case A 確定 (allocator dominant, NOT layout) Phase 17 v1 の Case B 判定は誤り。 Files: - docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md (v2) - docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_NEXT_INSTRUCTIONS.md (updated) --- ## Phase 19: FastLane Instruction Reduction Analysis **Goal**: libc との instruction gap (-35% instructions, -56% branches) を削減 **perf stat 分析** (FORCE_LIBC=0 vs 1, 200M ops): - hakmem: 209.09 instructions/op, 52.33 branches/op - libc: 135.92 instructions/op, 22.93 branches/op - Delta: +73.17 instructions/op (+53.8%), +29.40 branches/op (+128.2%) **Hot path** (perf report): - front_fastlane_try_free: 23.97% cycles - malloc wrapper: 23.84% cycles - free wrapper: 6.82% cycles - **Wrapper overhead: ~55% of all cycles** **Reduction candidates**: - A: Wrapper layer 削除 (-17.5 inst/op, +10-15% 期待) - B: ENV snapshot 統合 (-10.0 inst/op, +5-8%) - C: Stats 削除 (-5.0 inst/op, +3-5%) - D: Header inline (-4.0 inst/op, +2-3%) - E: Route fast path (-3.5 inst/op, +2-3%) Files: - docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_1_DESIGN.md - docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_2_NEXT_INSTRUCTIONS.md --- ## Phase 19-1b: FastLane Direct — GO (+5.88%) **Strategy**: Wrapper layer を bypass し、core allocator を直接呼ぶ - free() → free_tiny_fast() (not free_tiny_fast_hot) - malloc() → malloc_tiny_fast() **Phase 19-1 が NO-GO (-3.81%) だった原因**: 1. __builtin_expect(fastlane_direct_enabled(), 0) が逆効果(A/B 不公平) 2. free_tiny_fast_hot() が誤選択(free_tiny_fast() が勝ち筋) **Phase 19-1b の修正**: 1. __builtin_expect() 削除 2. free_tiny_fast() を直接呼び出し **Result** (Mixed, 10-run, 20M iters, ws=400): - Baseline (FASTLANE_DIRECT=0): 49.17M ops/s - Optimized (FASTLANE_DIRECT=1): 52.06M ops/s - **Delta: +5.88%** (GO 基準 +5% クリア) **perf stat** (200M iters): - Instructions/op: 199.90 → 169.45 (-30.45, -15.23%) - Branches/op: 51.49 → 41.52 (-9.97, -19.36%) - Cycles/op: 88.88 → 84.37 (-4.51, -5.07%) - I-cache miss: 111K → 98K (-11.79%) **Trade-offs** (acceptable): - iTLB miss: +41.46% (front-end cost) - dTLB miss: +29.15% (backend cost) - Overall gain (+5.88%) outweighs costs **Implementation**: 1. **ENV gate**: core/box/fastlane_direct_env_box.{h,c} - HAKMEM_FASTLANE_DIRECT=0/1 (default: 0, opt-in) - Single _Atomic global (wrapper キャッシュ問題を解決) 2. **Wrapper 修正**: core/box/hak_wrappers.inc.h - malloc: direct call to malloc_tiny_fast() when FASTLANE_DIRECT=1 - free: direct call to free_tiny_fast() when FASTLANE_DIRECT=1 - Safety: !g_initialized では direct 使わない、fallback 維持 3. **Preset 昇格**: core/bench_profile.h:88 - bench_setenv_default("HAKMEM_FASTLANE_DIRECT", "1") - Comment: +5.88% proven on Mixed, 10-run 4. **cleanenv 更新**: scripts/run_mixed_10_cleanenv.sh:22 - HAKMEM_FASTLANE_DIRECT=${HAKMEM_FASTLANE_DIRECT:-1} - Phase 9/10 と同様に昇格 **Verdict**: GO — 本線採用、プリセット昇格完了 **Rollback**: HAKMEM_FASTLANE_DIRECT=0 で既存 FastLane path に戻る Files: - core/box/fastlane_direct_env_box.{h,c} (new) - core/box/hak_wrappers.inc.h (modified) - core/bench_profile.h (preset promotion) - scripts/run_mixed_10_cleanenv.sh (ENV default aligned) - Makefile (new obj) - docs/analysis/PHASE19_1B_FASTLANE_DIRECT_REVISED_AB_TEST_RESULTS.md --- ## Cumulative Performance - Baseline (all optimizations OFF): ~40M ops/s (estimated) - Current (Phase 19-1b): 52.06M ops/s - **Cumulative gain: ~+30% from baseline** Remaining gap to libc (79.72M): - Current: 52.06M ops/s - Target: 79.72M ops/s - **Gap: +53.2%** (was +62.7% before Phase 19-1b) Next: Phase 19-2 (ENV snapshot consolidation, +5-8% expected) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 11:28:40 +09:00
#include "box/fastlane_direct_env_box.h" // fastlane_direct_env_refresh_from_env (Phase 19-1)
#include "box/tiny_header_hotfull_env_box.h" // tiny_header_hotfull_env_refresh_from_env (Phase 21)
#include "box/tiny_inline_slots_fixed_mode_box.h" // tiny_inline_slots_fixed_mode_refresh_from_env (Phase 78-1)
Phase 86: Free Path Legacy Mask (NO-GO, +0.25%) ## Summary Implemented Phase 86 "mask-only commit" optimization for free path: - Bitset mask (0x7f for C0-C6) to identify LEGACY classes - Direct call to tiny_legacy_fallback_free_base_with_env() - No indirect function pointers (avoids Phase 85's -0.86% regression) - Fail-fast on LARSON_FIX=1 (cross-thread validation incompatibility) ## Results (10-run SSOT) **NO-GO**: +0.25% improvement (threshold: +1.0%) - Control: 51,750,467 ops/s (CV: 2.26%) - Treatment: 51,881,055 ops/s (CV: 2.32%) - Delta: +0.25% (mean), -0.15% (median) ## Root Cause Competing optimizations plateau: 1. Phase 9/10 MONO LEGACY (+1.89%) already capture most free path benefit 2. Remaining margin insufficient to overcome: - Two branch checks (mask_enabled + has_class) - I-cache layout tax in hot path - Direct function call overhead ## Phase 85 vs Phase 86 | Metric | Phase 85 | Phase 86 | |--------|----------|----------| | Approach | Indirect calls + table | Bitset mask + direct call | | Result | -0.86% | +0.25% | | Verdict | NO-GO (regression) | NO-GO (insufficient) | Phase 86 correctly avoided indirect call penalties but revealed architectural limit: can't escape Phase 9/10 overlay without restructuring. ## Recommendation Free path optimization layer has reached practical ceiling: - Phase 9/10 +1.89% + Phase 6/19/FASTLANE +16-27% ≈ 18-29% total - Further attempts on ceremony elimination face same constraints - Recommend focus on different optimization layers (malloc, etc.) ## Files Changed ### New - core/box/free_path_legacy_mask_box.h (API + globals) - core/box/free_path_legacy_mask_box.c (refresh logic) ### Modified - core/bench_profile.h (added refresh call) - core/front/malloc_tiny_fast.h (added Phase 86 fast path check) - Makefile (added object files) - CURRENT_TASK.md (documented result) All changes conditional on HAKMEM_FREE_PATH_LEGACY_MASK=1 (default OFF). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-18 22:05:34 +09:00
#include "box/free_path_commit_once_fixed_box.h" // free_path_commit_once_refresh_from_env (Phase 85)
#include "box/free_path_legacy_mask_box.h" // free_path_legacy_mask_refresh_from_env (Phase 86)
#include "box/tiny_c6_inline_slots_ifl_env_box.h" // tiny_c6_inline_slots_ifl_refresh_from_env (Phase 91)
#endif
// env が未設定のときだけ既定値を入れる
static inline void bench_setenv_default(const char* key, const char* val) {
if (getenv(key) != NULL) return;
static void* (*real_malloc)(size_t) = NULL;
static int (*real_putenv)(char*) = NULL;
if (!real_malloc) {
real_malloc = (void* (*)(size_t))dlsym(RTLD_NEXT, "malloc");
if (!real_malloc) real_malloc = malloc;
}
if (!real_putenv) {
real_putenv = (int (*)(char*))dlsym(RTLD_NEXT, "putenv");
if (!real_putenv) real_putenv = putenv;
}
size_t klen = strlen(key);
size_t vlen = strlen(val);
char* buf = (char*)real_malloc(klen + vlen + 2);
if (!buf) return;
memcpy(buf, key, klen);
buf[klen] = '=';
memcpy(buf + klen + 1, val, vlen);
buf[klen + 1 + vlen] = '\0';
{
char msg[256];
int n = snprintf(msg, sizeof(msg), "[bench_profile] set %s=%s\n", key, val);
if (n > 0) {
if (n > (int)sizeof(msg)) n = (int)sizeof(msg);
ssize_t w = write(2, msg, (size_t)n);
(void)w;
}
}
real_putenv(buf); // takes ownership; do not free
}
// ベンチ専用: HAKMEM_PROFILE に応じて ENV をプリセットする
Phase 54-60: Memory-Lean mode, Balanced mode stabilization, M1 (50%) achievement ## Summary Completed Phase 54-60 optimization work: **Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)** - Implemented ss_mem_lean_env_box.h with ENV gates - Balanced mode (LEAN+OFF) promoted as production default - Result: +1.2% throughput, better stability, zero syscall overhead - Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset **Phase 57: 60-min soak finalization** - Balanced mode: 60-min soak, RSS drift 0%, CV 5.38% - Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58% - Syscall budget: 1.25e-7/op (800× under target) - Status: PRODUCTION-READY **Phase 59: 50% recovery baseline rebase** - hakmem FAST (Balanced): 59.184M ops/s, CV 1.31% - mimalloc: 120.466M ops/s, CV 3.50% - Ratio: 49.13% (M1 ACHIEVED within statistical noise) - Superior stability: 2.68× better CV than mimalloc **Phase 60: Alloc pass-down SSOT (NO-GO)** - Implemented alloc_passdown_ssot_env_box.h - Modified malloc_tiny_fast.h for SSOT pattern - Result: -0.46% (NO-GO) - Key lesson: SSOT not applicable where early-exit already optimized ## Key Metrics - Performance: 49.13% of mimalloc (M1 effectively achieved) - Stability: CV 1.31% (superior to mimalloc 3.50%) - Syscall budget: 1.25e-7/op (excellent) - RSS: 33MB stable, 0% drift over 60 minutes ## Files Added/Modified New boxes: - core/box/ss_mem_lean_env_box.h - core/box/ss_release_policy_box.{h,c} - core/box/alloc_passdown_ssot_env_box.h Scripts: - scripts/soak_mixed_single_process.sh - scripts/analyze_epoch_tail_csv.py - scripts/soak_mixed_rss.sh - scripts/calculate_percentiles.py - scripts/analyze_soak.py Documentation: Phase 40-60 analysis documents ## Design Decisions 1. Profile separation (core/bench_profile.h): - MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN) - MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF) 2. Box Theory compliance: - All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT) - Single conversion points maintained - No physical deletions (compile-out only) 3. Lessons learned: - SSOT effective only where redundancy exists (Phase 60 showed limits) - Branch prediction extremely effective (~0 cycles for well-predicted branches) - Early-exit pattern valuable even when seemingly redundant 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-17 06:24:01 +09:00
static inline void bench_apply_mixed_tinyv3_c7_common(void) {
bench_setenv_default("HAKMEM_TINY_HEAP_PROFILE", "C7_SAFE");
bench_setenv_default("HAKMEM_TINY_C7_HOT", "1");
bench_setenv_default("HAKMEM_TINY_HOTHEAP_V2", "0");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_ENABLED", "1");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_CLASSES", "0x80");
bench_setenv_default("HAKMEM_SMALL_HEAP_V4_ENABLED", "0");
bench_setenv_default("HAKMEM_SMALL_HEAP_V4_CLASSES", "0x0");
bench_setenv_default("HAKMEM_TINY_PTR_FAST_CLASSIFY_V4_ENABLED", "0");
bench_setenv_default("HAKMEM_SMALL_SEGMENT_V4_ENABLED", "0");
bench_setenv_default("HAKMEM_POOL_V2_ENABLED", "0");
bench_setenv_default("HAKMEM_TINY_FRONT_V3_ENABLED", "1");
bench_setenv_default("HAKMEM_TINY_FRONT_V3_LUT_ENABLED", "1");
bench_setenv_default("HAKMEM_TINY_PTR_FAST_CLASSIFY_ENABLED", "1");
// Phase FREE-TINY-FAST-DUALHOT-1: C0-C3 direct fast free (skip policy snapshot)
bench_setenv_default("HAKMEM_FREE_TINY_FAST_HOTCOLD", "1");
// Phase 2 B4: Wrapper hot/cold split (malloc/free wrapper shape)
bench_setenv_default("HAKMEM_WRAP_SHAPE", "1");
// Phase 4 E1: ENV Snapshot Consolidation (+3.92% proven on Mixed)
bench_setenv_default("HAKMEM_ENV_SNAPSHOT", "1");
// Phase 5 E4-1: Free wrapper ENV snapshot (+3.51% proven on Mixed, 10-run)
bench_setenv_default("HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT", "1");
// Phase 5 E4-2: Malloc wrapper ENV snapshot (+21.83% proven on Mixed, 10-run)
bench_setenv_default("HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT", "1");
// Phase 5 E5-1: Free Tiny Direct Path (+3.35% proven on Mixed, 10-run)
bench_setenv_default("HAKMEM_FREE_TINY_DIRECT", "1");
// Phase 6-1: Front FastLane (Layer Collapse) (+11.13% proven on Mixed, 10-run)
bench_setenv_default("HAKMEM_FRONT_FASTLANE", "1");
// Phase 6-2: Front FastLane Free DeDup (+5.18% proven on Mixed, 10-run)
bench_setenv_default("HAKMEM_FRONT_FASTLANE_FREE_DEDUP", "1");
// Phase 21: Tiny Header HotFull (alloc header hot/cold split; opt-out with 0)
bench_setenv_default("HAKMEM_TINY_HEADER_HOTFULL", "1");
// Phase 19-1b: FastLane Direct (wrapper layer bypass, +5.88% proven on Mixed, 10-run)
bench_setenv_default("HAKMEM_FASTLANE_DIRECT", "1");
// Phase 9: FREE-TINY-FAST MONO DUALHOT (+2.72% proven on Mixed, 10-run)
bench_setenv_default("HAKMEM_FREE_TINY_FAST_MONO_DUALHOT", "1");
// Phase 10: FREE-TINY-FAST MONO LEGACY DIRECT (+1.89% proven on Mixed, 10-run)
bench_setenv_default("HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT", "1");
// Phase 4-4: C6 ULTRA free+alloc 統合を有効化 (default OFF, manual opt-in)
bench_setenv_default("HAKMEM_TINY_C6_ULTRA_FREE_ENABLED", "0");
// Phase MID-V3: Mid/Pool HotBox v3
// Mixed (161024B) では MID_V3(C6) が大きく遅くなるため、デフォルト OFF に固定。
// C6-heavy プロファイル側でのみ ON を推奨するC6-heavy のみ最適化対象)。
bench_setenv_default("HAKMEM_MID_V3_ENABLED", "0");
bench_setenv_default("HAKMEM_MID_V3_CLASSES", "0x0");
// Phase 2 B3: Routing branch shape optimization (LIKELY on LEGACY, cold helper for rare routes)
bench_setenv_default("HAKMEM_TINY_ALLOC_ROUTE_SHAPE", "1");
// Phase 3 C3: Static routing (policy_snapshot bypass, +2.2% proven)
bench_setenv_default("HAKMEM_TINY_STATIC_ROUTE", "1");
// Phase 3 D1: Free route cache (TLS cache for free path routing, +2.19% proven)
bench_setenv_default("HAKMEM_FREE_STATIC_ROUTE", "1");
// Phase 69-1: Warm Pool Size=16 (+3.26% Strong GO, ENV-only)
bench_setenv_default("HAKMEM_WARM_POOL_SIZE", "16");
Phase 75-3: C5+C6 Interaction Matrix Test (4-Point A/B) - STRONG GO (+5.41%) Comprehensive interaction testing with single binary, ENV-only configuration: 4-Point Matrix Results (Mixed SSOT, WS=400): - Point A (C5=0, C6=0): 42.36 M ops/s [Baseline] - Point B (C5=1, C6=0): 43.54 M ops/s (+2.79% vs A) - Point C (C5=0, C6=1): 44.25 M ops/s (+4.46% vs A) - Point D (C5=1, C6=1): 44.65 M ops/s (+5.41% vs A) **[COMBINED TARGET]** Additivity Analysis: - Expected additive: 45.43 M ops/s (B+C-A) - Actual: 44.65 M ops/s (D) - Sub-additivity: 1.72% (near-perfect, minimal negative interaction) Perf Stat Validation (Point D vs A): - Instructions: -6.1% (function call elimination confirmed) - Branches: -6.1% (matches instructions reduction) - Cache-misses: -31.5% (improved locality, NO code explosion) - Throughput: +5.41% (net positive) Decision: ✅ STRONG GO (exceeds +3.0% GO threshold) - D vs A: +5.41% >> +3.0% - Sub-additivity: 1.72% << 20% acceptable - Phase 73 hypothesis validated: -6.1% instructions/branches → +5.41% throughput Promotion to Defaults: - core/bench_profile.h: C5+C6 added to bench_apply_mixed_tinyv3_c7_common() - scripts/run_mixed_10_cleanenv.sh: C5+C6 ENV defaults added - C5+C6 inline slots now PRESET DEFAULT for MIXED_TINYV3_C7_SAFE New Baseline: 44.65 M ops/s (36.75% of mimalloc, +5.41% from Phase 75-0) M2 Target: 55% of mimalloc ≈ 66.8 M ops/s (remaining gap: 22.15 M ops/s) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-18 08:53:01 +09:00
// Phase 75-3: C5+C6 Inline Slots (GO +5.41% proven, 4-point matrix A/B)
bench_setenv_default("HAKMEM_TINY_C5_INLINE_SLOTS", "1");
bench_setenv_default("HAKMEM_TINY_C6_INLINE_SLOTS", "1");
// Phase 76-1: C4 Inline Slots (GO +1.73%, 10-run A/B)
bench_setenv_default("HAKMEM_TINY_C4_INLINE_SLOTS", "1");
// Phase 78-1: Inline Slots Fixed Mode (GO, removes per-op ENV gate overhead)
bench_setenv_default("HAKMEM_TINY_INLINE_SLOTS_FIXED", "1");
// Phase 80-1: Inline Slots Switch Dispatch (GO +1.65%, removes if-chain comparisons)
bench_setenv_default("HAKMEM_TINY_INLINE_SLOTS_SWITCHDISPATCH", "1");
Phase 54-60: Memory-Lean mode, Balanced mode stabilization, M1 (50%) achievement ## Summary Completed Phase 54-60 optimization work: **Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)** - Implemented ss_mem_lean_env_box.h with ENV gates - Balanced mode (LEAN+OFF) promoted as production default - Result: +1.2% throughput, better stability, zero syscall overhead - Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset **Phase 57: 60-min soak finalization** - Balanced mode: 60-min soak, RSS drift 0%, CV 5.38% - Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58% - Syscall budget: 1.25e-7/op (800× under target) - Status: PRODUCTION-READY **Phase 59: 50% recovery baseline rebase** - hakmem FAST (Balanced): 59.184M ops/s, CV 1.31% - mimalloc: 120.466M ops/s, CV 3.50% - Ratio: 49.13% (M1 ACHIEVED within statistical noise) - Superior stability: 2.68× better CV than mimalloc **Phase 60: Alloc pass-down SSOT (NO-GO)** - Implemented alloc_passdown_ssot_env_box.h - Modified malloc_tiny_fast.h for SSOT pattern - Result: -0.46% (NO-GO) - Key lesson: SSOT not applicable where early-exit already optimized ## Key Metrics - Performance: 49.13% of mimalloc (M1 effectively achieved) - Stability: CV 1.31% (superior to mimalloc 3.50%) - Syscall budget: 1.25e-7/op (excellent) - RSS: 33MB stable, 0% drift over 60 minutes ## Files Added/Modified New boxes: - core/box/ss_mem_lean_env_box.h - core/box/ss_release_policy_box.{h,c} - core/box/alloc_passdown_ssot_env_box.h Scripts: - scripts/soak_mixed_single_process.sh - scripts/analyze_epoch_tail_csv.py - scripts/soak_mixed_rss.sh - scripts/calculate_percentiles.py - scripts/analyze_soak.py Documentation: Phase 40-60 analysis documents ## Design Decisions 1. Profile separation (core/bench_profile.h): - MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN) - MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF) 2. Box Theory compliance: - All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT) - Single conversion points maintained - No physical deletions (compile-out only) 3. Lessons learned: - SSOT effective only where redundancy exists (Phase 60 showed limits) - Branch prediction extremely effective (~0 cycles for well-predicted branches) - Early-exit pattern valuable even when seemingly redundant 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-17 06:24:01 +09:00
}
static inline void bench_apply_profile(void) {
const char* p = getenv("HAKMEM_PROFILE");
if (!p || !*p) return;
if (strcmp(p, "MIXED_TINYV3_C7_SAFE") == 0) {
Phase 54-60: Memory-Lean mode, Balanced mode stabilization, M1 (50%) achievement ## Summary Completed Phase 54-60 optimization work: **Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)** - Implemented ss_mem_lean_env_box.h with ENV gates - Balanced mode (LEAN+OFF) promoted as production default - Result: +1.2% throughput, better stability, zero syscall overhead - Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset **Phase 57: 60-min soak finalization** - Balanced mode: 60-min soak, RSS drift 0%, CV 5.38% - Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58% - Syscall budget: 1.25e-7/op (800× under target) - Status: PRODUCTION-READY **Phase 59: 50% recovery baseline rebase** - hakmem FAST (Balanced): 59.184M ops/s, CV 1.31% - mimalloc: 120.466M ops/s, CV 3.50% - Ratio: 49.13% (M1 ACHIEVED within statistical noise) - Superior stability: 2.68× better CV than mimalloc **Phase 60: Alloc pass-down SSOT (NO-GO)** - Implemented alloc_passdown_ssot_env_box.h - Modified malloc_tiny_fast.h for SSOT pattern - Result: -0.46% (NO-GO) - Key lesson: SSOT not applicable where early-exit already optimized ## Key Metrics - Performance: 49.13% of mimalloc (M1 effectively achieved) - Stability: CV 1.31% (superior to mimalloc 3.50%) - Syscall budget: 1.25e-7/op (excellent) - RSS: 33MB stable, 0% drift over 60 minutes ## Files Added/Modified New boxes: - core/box/ss_mem_lean_env_box.h - core/box/ss_release_policy_box.{h,c} - core/box/alloc_passdown_ssot_env_box.h Scripts: - scripts/soak_mixed_single_process.sh - scripts/analyze_epoch_tail_csv.py - scripts/soak_mixed_rss.sh - scripts/calculate_percentiles.py - scripts/analyze_soak.py Documentation: Phase 40-60 analysis documents ## Design Decisions 1. Profile separation (core/bench_profile.h): - MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN) - MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF) 2. Box Theory compliance: - All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT) - Single conversion points maintained - No physical deletions (compile-out only) 3. Lessons learned: - SSOT effective only where redundancy exists (Phase 60 showed limits) - Branch prediction extremely effective (~0 cycles for well-predicted branches) - Early-exit pattern valuable even when seemingly redundant 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-17 06:24:01 +09:00
// Speed-first default (Phase 57): do not set HAKMEM_SS_MEM_LEAN here.
bench_apply_mixed_tinyv3_c7_common();
} else if (strcmp(p, "MIXED_TINYV3_C7_BALANCED") == 0) {
// Balanced mode (Phase 55/56): LEAN+OFF (prewarm suppression only).
bench_apply_mixed_tinyv3_c7_common();
bench_setenv_default("HAKMEM_SS_MEM_LEAN", "1");
bench_setenv_default("HAKMEM_SS_MEM_LEAN_DECOMMIT", "OFF");
bench_setenv_default("HAKMEM_SS_MEM_LEAN_TARGET_MB", "10");
} else if (strcmp(p, "C6_HEAVY_LEGACY_POOLV1") == 0) {
bench_setenv_default("HAKMEM_TINY_HEAP_PROFILE", "C7_SAFE");
bench_setenv_default("HAKMEM_TINY_C6_HOT", "0");
bench_setenv_default("HAKMEM_TINY_HOTHEAP_V2", "0");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_ENABLED", "1");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_CLASSES", "0x80");
bench_setenv_default("HAKMEM_POOL_V2_ENABLED", "0");
bench_setenv_default("HAKMEM_POOL_V1_FLATTEN_ENABLED", "0");
bench_setenv_default("HAKMEM_MID_DESC_CACHE_ENABLED", "1");
// Phase 4-4: C6 ULTRA free+alloc 統合を有効化 (default OFF, manual opt-in)
bench_setenv_default("HAKMEM_TINY_C6_ULTRA_FREE_ENABLED", "0");
// Phase MID-V3: Mid/Pool HotBox v3 (257-768B, C6 only)
bench_setenv_default("HAKMEM_MID_V3_ENABLED", "1");
bench_setenv_default("HAKMEM_MID_V3_CLASSES", "0x40");
// Phase 6-1: Front FastLane (Layer Collapse) (+11.13% proven on Mixed, 10-run)
bench_setenv_default("HAKMEM_FRONT_FASTLANE", "1");
// Phase 6-2: Front FastLane Free DeDup (+5.18% proven on Mixed, 10-run)
bench_setenv_default("HAKMEM_FRONT_FASTLANE_FREE_DEDUP", "1");
// Phase 21: Tiny Header HotFull (alloc header hot/cold split; opt-out with 0)
bench_setenv_default("HAKMEM_TINY_HEADER_HOTFULL", "1");
Phase 17 v2 (FORCE_LIBC fix) + Phase 19-1b (FastLane Direct) — GO (+5.88%) ## Phase 17 v2: FORCE_LIBC Gap Validation Fix **Critical bug fix**: Phase 17 v1 の測定が壊れていた **Problem**: HAKMEM_FORCE_LIBC_ALLOC=1 が FastLane より後でしか見えず、 same-binary A/B が実質 "hakmem vs hakmem" になっていた(+0.39% 誤測定) **Fix**: core/box/hak_wrappers.inc.h:171 と :645 に g_force_libc_alloc==1 の early bypass を追加、__libc_malloc/__libc_free に最初に直行 **Result**: 正しい同一バイナリ A/B 測定 - hakmem (FORCE_LIBC=0): 48.99M ops/s - libc (FORCE_LIBC=1): 79.72M ops/s (+62.7%) - system binary: 88.06M ops/s (+10.5% vs libc) **Gap 分解**: - Allocator 差: +62.7% (主戦場) - Layout penalty: +10.5% (副次的) **Conclusion**: Case A 確定 (allocator dominant, NOT layout) Phase 17 v1 の Case B 判定は誤り。 Files: - docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md (v2) - docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_NEXT_INSTRUCTIONS.md (updated) --- ## Phase 19: FastLane Instruction Reduction Analysis **Goal**: libc との instruction gap (-35% instructions, -56% branches) を削減 **perf stat 分析** (FORCE_LIBC=0 vs 1, 200M ops): - hakmem: 209.09 instructions/op, 52.33 branches/op - libc: 135.92 instructions/op, 22.93 branches/op - Delta: +73.17 instructions/op (+53.8%), +29.40 branches/op (+128.2%) **Hot path** (perf report): - front_fastlane_try_free: 23.97% cycles - malloc wrapper: 23.84% cycles - free wrapper: 6.82% cycles - **Wrapper overhead: ~55% of all cycles** **Reduction candidates**: - A: Wrapper layer 削除 (-17.5 inst/op, +10-15% 期待) - B: ENV snapshot 統合 (-10.0 inst/op, +5-8%) - C: Stats 削除 (-5.0 inst/op, +3-5%) - D: Header inline (-4.0 inst/op, +2-3%) - E: Route fast path (-3.5 inst/op, +2-3%) Files: - docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_1_DESIGN.md - docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_2_NEXT_INSTRUCTIONS.md --- ## Phase 19-1b: FastLane Direct — GO (+5.88%) **Strategy**: Wrapper layer を bypass し、core allocator を直接呼ぶ - free() → free_tiny_fast() (not free_tiny_fast_hot) - malloc() → malloc_tiny_fast() **Phase 19-1 が NO-GO (-3.81%) だった原因**: 1. __builtin_expect(fastlane_direct_enabled(), 0) が逆効果(A/B 不公平) 2. free_tiny_fast_hot() が誤選択(free_tiny_fast() が勝ち筋) **Phase 19-1b の修正**: 1. __builtin_expect() 削除 2. free_tiny_fast() を直接呼び出し **Result** (Mixed, 10-run, 20M iters, ws=400): - Baseline (FASTLANE_DIRECT=0): 49.17M ops/s - Optimized (FASTLANE_DIRECT=1): 52.06M ops/s - **Delta: +5.88%** (GO 基準 +5% クリア) **perf stat** (200M iters): - Instructions/op: 199.90 → 169.45 (-30.45, -15.23%) - Branches/op: 51.49 → 41.52 (-9.97, -19.36%) - Cycles/op: 88.88 → 84.37 (-4.51, -5.07%) - I-cache miss: 111K → 98K (-11.79%) **Trade-offs** (acceptable): - iTLB miss: +41.46% (front-end cost) - dTLB miss: +29.15% (backend cost) - Overall gain (+5.88%) outweighs costs **Implementation**: 1. **ENV gate**: core/box/fastlane_direct_env_box.{h,c} - HAKMEM_FASTLANE_DIRECT=0/1 (default: 0, opt-in) - Single _Atomic global (wrapper キャッシュ問題を解決) 2. **Wrapper 修正**: core/box/hak_wrappers.inc.h - malloc: direct call to malloc_tiny_fast() when FASTLANE_DIRECT=1 - free: direct call to free_tiny_fast() when FASTLANE_DIRECT=1 - Safety: !g_initialized では direct 使わない、fallback 維持 3. **Preset 昇格**: core/bench_profile.h:88 - bench_setenv_default("HAKMEM_FASTLANE_DIRECT", "1") - Comment: +5.88% proven on Mixed, 10-run 4. **cleanenv 更新**: scripts/run_mixed_10_cleanenv.sh:22 - HAKMEM_FASTLANE_DIRECT=${HAKMEM_FASTLANE_DIRECT:-1} - Phase 9/10 と同様に昇格 **Verdict**: GO — 本線採用、プリセット昇格完了 **Rollback**: HAKMEM_FASTLANE_DIRECT=0 で既存 FastLane path に戻る Files: - core/box/fastlane_direct_env_box.{h,c} (new) - core/box/hak_wrappers.inc.h (modified) - core/bench_profile.h (preset promotion) - scripts/run_mixed_10_cleanenv.sh (ENV default aligned) - Makefile (new obj) - docs/analysis/PHASE19_1B_FASTLANE_DIRECT_REVISED_AB_TEST_RESULTS.md --- ## Cumulative Performance - Baseline (all optimizations OFF): ~40M ops/s (estimated) - Current (Phase 19-1b): 52.06M ops/s - **Cumulative gain: ~+30% from baseline** Remaining gap to libc (79.72M): - Current: 52.06M ops/s - Target: 79.72M ops/s - **Gap: +53.2%** (was +62.7% before Phase 19-1b) Next: Phase 19-2 (ENV snapshot consolidation, +5-8% expected) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 11:28:40 +09:00
// Phase 19-1b: FastLane Direct (wrapper layer bypass)
bench_setenv_default("HAKMEM_FASTLANE_DIRECT", "1");
// Phase 2 B3: Routing branch shape optimization (LIKELY on LEGACY, cold helper for rare routes)
bench_setenv_default("HAKMEM_TINY_ALLOC_ROUTE_SHAPE", "1");
} else if (strcmp(p, "C6_V7_STUB") == 0) {
// Phase v7-1: C6-only v7 stub 実験用MID v3 fallback
bench_setenv_default("HAKMEM_TINY_HEAP_PROFILE", "C7_SAFE");
bench_setenv_default("HAKMEM_TINY_C6_HOT", "0");
bench_setenv_default("HAKMEM_TINY_HOTHEAP_V2", "0");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_ENABLED", "1");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_CLASSES", "0x80");
bench_setenv_default("HAKMEM_POOL_V2_ENABLED", "0");
bench_setenv_default("HAKMEM_MID_V3_ENABLED", "1");
bench_setenv_default("HAKMEM_MID_V3_CLASSES", "0x40");
// v7 stub ON (C6-only)
bench_setenv_default("HAKMEM_SMALL_HEAP_V7_ENABLED", "1");
bench_setenv_default("HAKMEM_SMALL_HEAP_V7_CLASSES", "0x40");
} else if (strcmp(p, "C6_HEAVY_LEGACY_POOLV1_FLATTEN") == 0) {
// LEGACY mid/smallmid ベンチ専用C7_SAFE では使用しない)
bench_setenv_default("HAKMEM_TINY_HEAP_PROFILE", "LEGACY");
bench_setenv_default("HAKMEM_TINY_C6_HOT", "0");
bench_setenv_default("HAKMEM_TINY_HOTHEAP_V2", "0");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_ENABLED", "1");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_CLASSES", "0x80");
bench_setenv_default("HAKMEM_POOL_V2_ENABLED", "0");
bench_setenv_default("HAKMEM_POOL_V1_FLATTEN_ENABLED", "1");
bench_setenv_default("HAKMEM_POOL_V1_FLATTEN_STATS", "1");
bench_setenv_default("HAKMEM_POOL_ZERO_MODE", "header");
} else if (strcmp(p, "DEBUG_TINY_FRONT_PERF") == 0) {
bench_setenv_default("HAKMEM_TINY_HEAP_PROFILE", "C7_SAFE");
bench_setenv_default("HAKMEM_TINY_C7_HOT", "1");
bench_setenv_default("HAKMEM_TINY_HOTHEAP_V2", "0");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_ENABLED", "1");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_CLASSES", "0x80");
bench_setenv_default("HAKMEM_POOL_V2_ENABLED", "0");
bench_setenv_default("HAKMEM_TINY_FRONT_V3_ENABLED", "1");
bench_setenv_default("HAKMEM_TINY_FRONT_V3_LUT_ENABLED", "1");
bench_setenv_default("HAKMEM_TINY_PTR_FAST_CLASSIFY_ENABLED", "1");
} else if (strcmp(p, "C6_SMALL_HEAP_V3_EXPERIMENT") == 0) {
// C6 を SmallObject v3 に載せる研究用(標準では使用しない)
bench_setenv_default("HAKMEM_TINY_HEAP_PROFILE", "C7_SAFE");
bench_setenv_default("HAKMEM_TINY_C6_HOT", "1");
bench_setenv_default("HAKMEM_TINY_HOTHEAP_V2", "0");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_ENABLED", "1");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_CLASSES", "0x40"); // C6 only
bench_setenv_default("HAKMEM_SMALL_HEAP_V4_ENABLED", "0");
bench_setenv_default("HAKMEM_SMALL_HEAP_V4_CLASSES", "0x0");
bench_setenv_default("HAKMEM_POOL_V2_ENABLED", "0");
} else if (strcmp(p, "C6_SMALL_HEAP_V4_EXPERIMENT") == 0) {
// C6 を SmallObject v4 に載せる研究用(標準では使用しない)
bench_setenv_default("HAKMEM_TINY_HEAP_PROFILE", "C7_SAFE");
bench_setenv_default("HAKMEM_TINY_C6_HOT", "1");
bench_setenv_default("HAKMEM_TINY_HOTHEAP_V2", "0");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_ENABLED", "0");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_CLASSES", "0x0");
bench_setenv_default("HAKMEM_SMALL_HEAP_V4_ENABLED", "1");
bench_setenv_default("HAKMEM_SMALL_HEAP_V4_CLASSES", "0x40"); // C6 only
bench_setenv_default("HAKMEM_POOL_V2_ENABLED", "0");
}
#ifdef USE_HAKMEM
// Phase 3 C3 Step 0: Ensure policy snapshot reflects final ENV after putenv defaults.
small_policy_v7_bump_version();
// Phase 2 B4: Sync wrapper ENV cache after bench_profile putenv defaults.
wrapper_env_refresh_from_env();
// Phase 3 C3: Sync static route cache after bench_profile putenv defaults.
tiny_static_route_refresh_from_env();
// Phase 4 E1: Sync ENV snapshot cache after bench_profile putenv defaults.
hakmem_env_snapshot_refresh_from_env();
Phase 8: FREE-STATIC-ROUTE ENV Cache Hardening (GO +2.61%) Results: - A/B test: +2.61% on Mixed (10-run, clean env) - Baseline: 49.26M ops/s - Optimized: 50.55M ops/s - Improvement: +1.29M ops/s (+2.61%) Strategy: - Fix ENV cache accident (main前キャッシュ事故の修正) - Add refresh mechanism to sync with bench_profile putenv - Ensure Phase 3 D1 optimization works reliably Success factors: 1. Performance improvement: +2.61% (existing win-box now reliable) 2. ENV cache accident fixed: refresh mechanism works correctly 3. Standard deviation improved: 867K → 336K ops/s (61% reduction) 4. Baseline quality improved: existing optimization now guaranteed Implementation: - Patch 1: Make ENV gate refreshable (tiny_free_route_cache_env_box.{h,c}) - Changed static int to extern _Atomic int - Added tiny_free_static_route_refresh_from_env() - Patch 2: Integrate refresh into bench_profile.h - Call refresh after bench_setenv_default() group - Patch 3: Update Makefile for new .c file ENV cache fix verification: - [FREE_STATIC_ROUTE] enabled appears twice (refresh working) - bench_profile putenv now reliably reflected Files modified: - core/box/tiny_free_route_cache_env_box.h: extern + refresh API - core/box/tiny_free_route_cache_env_box.c: NEW (global state + refresh) - core/bench_profile.h: add refresh call - Makefile: add new .o file Health check: PASSED (all profiles) Rollback: HAKMEM_FREE_STATIC_ROUTE=0 or revert Patch 1/2 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 18:49:08 +09:00
// Phase 8: Sync free static route ENV cache after bench_profile putenv defaults.
tiny_free_static_route_refresh_from_env();
Phase 13 v1 + E5-2 retest: Both NEUTRAL, freeze as research boxes Phase 13 v1: Header Write Elimination (C7 preserve header) - Verdict: NEUTRAL (+0.78%) - Implementation: HAKMEM_TINY_C7_PRESERVE_HEADER ENV gate (default OFF) - Makes C7 nextptr offset conditional (0→1 when enabled) - 4-point matrix A/B test results: * Case A (baseline): 51.49M ops/s * Case B (WRITE_ONCE=1): 52.07M ops/s (+1.13%) * Case C (C7_PRESERVE=1): 51.36M ops/s (-0.26%) * Case D (both): 51.89M ops/s (+0.78% NEUTRAL) - Action: Freeze as research box (default OFF, manual opt-in) Phase 5 E5-2: Header Write-Once retest (promotion test) - Verdict: NEUTRAL (+0.54%) - Motivation: Phase 13 Case B showed +1.13%, re-tested with dedicated 20-run - Results (20-run): * Case A (baseline): 51.10M ops/s * Case B (WRITE_ONCE=1): 51.37M ops/s (+0.54%) - Previous test: +0.45% (consistent with NEUTRAL) - Action: Keep as research box (default OFF, manual opt-in) Key findings: - Header write tax optimization shows consistent NEUTRAL results - Neither Phase 13 v1 nor E5-2 reaches GO threshold (+1.0%) - Both implemented as reversible ENV gates for future research Files changed: - New: core/box/tiny_c7_preserve_header_env_box.{c,h} - Modified: core/box/tiny_layout_box.h (C7 offset conditional) - Modified: core/tiny_nextptr.h, core/box/tiny_header_box.h (comments) - Modified: core/bench_profile.h (refresh sync) - Modified: Makefile (add new .o files) - Modified: scripts/run_mixed_10_cleanenv.sh (add C7_PRESERVE ENV) - Docs: PHASE13_*, PHASE5_E5_2_HEADER_WRITE_ONCE_* (design/results) Next: Phase 14 (Pointer-chase reduction, tcache-style intrusive LIFO) 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 00:32:25 +09:00
// Phase 13 v1: Sync C7 preserve header ENV cache after bench_profile putenv defaults.
tiny_c7_preserve_header_env_refresh_from_env();
// Phase 14 v1: Sync tcache ENV cache after bench_profile putenv defaults.
tiny_tcache_env_refresh_from_env();
// Phase 15 v1: Sync LIFO ENV cache after bench_profile putenv defaults.
tiny_unified_lifo_env_refresh_from_env();
Phase 16 v1 NEUTRAL, Phase 17 Case B confirmed, Phase 18 design added ## Phase 16 v1: Front FastLane Alloc LEGACY Direct — NEUTRAL (+0.62%) Target: Reduce alloc-side fixed costs by adding LEGACY direct path to FastLane entry, mirroring Phase 9/10 free-side winning pattern. Result: +0.62% on Mixed (below +1.0% GO threshold) → NEUTRAL, freeze as research box (default OFF). Critical issue: Initial impl crashed (segfault) for C4-C7. Root cause: unified_cache_refill() incompatibility. Safety fix: Limited to C0-C3 only (matching existing dualhot pattern). Files: - core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c} (new) - core/box/front_fastlane_box.h (LEGACY direct path, lines 93-119) - core/bench_profile.h (env refresh sync) - Makefile (new obj) - docs/analysis/PHASE16_*.md (design/results/instructions) ENV: HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 (default OFF, opt-in) Verdict: Research box frozen. Phase 14-16 plateau confirms dispatch/ routing optimization ROI is exhausted post-Phase-6 FastLane collapse. --- ## Phase 17: FORCE_LIBC Gap Validation — Case B Confirmed Purpose: Validate "system malloc faster" observation using same-binary A/B testing to isolate allocator logic差 vs binary layout penalty. Method: - Same-binary toggle: HAKMEM_FORCE_LIBC_ALLOC=0/1 (bench_random_mixed_hakmem) - System binary: bench_random_mixed_system (21K separate binary) - Perf stat: Hardware counter analysis (I-cache, cycles, instructions) Result: **Case B confirmed** — Allocator差 negligible, layout penalty dominates. Gap breakdown (Mixed, 20M iters, ws=400): - hakmem (FORCE_LIBC=0): 48.12M ops/s - libc (FORCE_LIBC=1, same binary): 48.31M ops/s → +0.39% (noise level) - system binary (21K): 83.85M ops/s → +73.57% vs libc, +74.26% vs hakmem Perf stat (200M iters): - I-cache misses: 153K (hakmem) → 68K (system) = -55% (smoking gun) - Cycles: 17.9B → 10.2B = -43% - Instructions: 41.3B → 21.5B = -48% - Binary size: 653K → 21K (30x difference) Root cause: Binary size (30x) causes I-cache thrashing. Code bloat >> algorithmic efficiency. Conclusion: Phase 12's "system malloc 1.6x faster" was real, but misattributed. Gap is layout/I-cache, NOT allocator algorithm. Files: - docs/analysis/PHASE17_*.md (results/instructions) - scripts/run_mixed_10_cleanenv.sh (Phase 9/10 defaults aligned) Next: Phase 18 Hot Text Isolation (layout optimization, not algorithm opt) --- ## Phase 18: Hot Text Isolation — Design Added Purpose: Reduce I-cache misses + instruction footprint via layout control (binary optimization, not allocator algorithm changes). Strategy (v1 → v2 progression): v1 (TU split + hot/cold attrs + optional gc-sections): - Target: +2% throughput (GO threshold, realistic for layout tweaks) - Secondary: I-cache -10%, instructions -5% (direction confirmation) - Risk: Low (reversible via build knob) - Expected: +0-2% (NEUTRAL likely, but validates approach) v2 (BENCH_MINIMAL compile-out): - Target: +10-20% throughput (本命) - Method: Conditional compilation removes stats/ENV/debug from hot path - Expected: Instruction count -30-40% → significant I-cache improvement Files: - docs/analysis/PHASE18_*.md (design/instructions) - CURRENT_TASK.md (Phase 17 complete, Phase 18 v1/v2 plan) Build gate: HOT_TEXT_ISOLATION=0/1 (Makefile knob) Next: Implement Phase 18 v1 (TU split first, BENCH_MINIMAL if v1 NEUTRAL) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 05:25:47 +09:00
// Phase 16 v1: Sync LEGACY direct ENV cache after bench_profile putenv defaults.
front_fastlane_alloc_legacy_direct_env_refresh_from_env();
// Phase 19-1: Sync FastLane Direct ENV cache after bench_profile putenv defaults.
fastlane_direct_env_refresh_from_env();
// Phase 21: Sync Tiny Header HotFull ENV cache after bench_profile putenv defaults.
tiny_header_hotfull_env_refresh_from_env();
// Phase 78-1: Optionally pin C3/C4/C5/C6 inline-slots modes (avoid per-op ENV gates).
tiny_inline_slots_fixed_mode_refresh_from_env();
Phase 86: Free Path Legacy Mask (NO-GO, +0.25%) ## Summary Implemented Phase 86 "mask-only commit" optimization for free path: - Bitset mask (0x7f for C0-C6) to identify LEGACY classes - Direct call to tiny_legacy_fallback_free_base_with_env() - No indirect function pointers (avoids Phase 85's -0.86% regression) - Fail-fast on LARSON_FIX=1 (cross-thread validation incompatibility) ## Results (10-run SSOT) **NO-GO**: +0.25% improvement (threshold: +1.0%) - Control: 51,750,467 ops/s (CV: 2.26%) - Treatment: 51,881,055 ops/s (CV: 2.32%) - Delta: +0.25% (mean), -0.15% (median) ## Root Cause Competing optimizations plateau: 1. Phase 9/10 MONO LEGACY (+1.89%) already capture most free path benefit 2. Remaining margin insufficient to overcome: - Two branch checks (mask_enabled + has_class) - I-cache layout tax in hot path - Direct function call overhead ## Phase 85 vs Phase 86 | Metric | Phase 85 | Phase 86 | |--------|----------|----------| | Approach | Indirect calls + table | Bitset mask + direct call | | Result | -0.86% | +0.25% | | Verdict | NO-GO (regression) | NO-GO (insufficient) | Phase 86 correctly avoided indirect call penalties but revealed architectural limit: can't escape Phase 9/10 overlay without restructuring. ## Recommendation Free path optimization layer has reached practical ceiling: - Phase 9/10 +1.89% + Phase 6/19/FASTLANE +16-27% ≈ 18-29% total - Further attempts on ceremony elimination face same constraints - Recommend focus on different optimization layers (malloc, etc.) ## Files Changed ### New - core/box/free_path_legacy_mask_box.h (API + globals) - core/box/free_path_legacy_mask_box.c (refresh logic) ### Modified - core/bench_profile.h (added refresh call) - core/front/malloc_tiny_fast.h (added Phase 86 fast path check) - Makefile (added object files) - CURRENT_TASK.md (documented result) All changes conditional on HAKMEM_FREE_PATH_LEGACY_MASK=1 (default OFF). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-18 22:05:34 +09:00
// Phase 85: Optionally commit-once for C4-C7 LEGACY free path (skip policy/route/mono ceremony).
free_path_commit_once_refresh_from_env();
// Phase 86: Optionally use legacy mask for early exit (no indirect calls, just bit test).
free_path_legacy_mask_refresh_from_env();
// Phase 91: C6 intrusive LIFO inline slots (per-class LIFO transformation).
tiny_c6_inline_slots_ifl_refresh_from_env();
#endif
}