hakmem

Author	SHA1	Message	Date
Moe Charm (CI)	f8e7cf05b4	Phase 16 v1 NEUTRAL, Phase 17 Case B confirmed, Phase 18 design added ## Phase 16 v1: Front FastLane Alloc LEGACY Direct — NEUTRAL (+0.62%) Target: Reduce alloc-side fixed costs by adding LEGACY direct path to FastLane entry, mirroring Phase 9/10 free-side winning pattern. Result: +0.62% on Mixed (below +1.0% GO threshold) → NEUTRAL, freeze as research box (default OFF). Critical issue: Initial impl crashed (segfault) for C4-C7. Root cause: unified_cache_refill() incompatibility. Safety fix: Limited to C0-C3 only (matching existing dualhot pattern). Files: - core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c} (new) - core/box/front_fastlane_box.h (LEGACY direct path, lines 93-119) - core/bench_profile.h (env refresh sync) - Makefile (new obj) - docs/analysis/PHASE16_.md (design/results/instructions) ENV: HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 (default OFF, opt-in) Verdict: Research box frozen. Phase 14-16 plateau confirms dispatch/ routing optimization ROI is exhausted post-Phase-6 FastLane collapse. --- ## Phase 17: FORCE_LIBC Gap Validation — Case B Confirmed Purpose: Validate "system malloc faster" observation using same-binary A/B testing to isolate allocator logic差 vs binary layout penalty. Method: - Same-binary toggle: HAKMEM_FORCE_LIBC_ALLOC=0/1 (bench_random_mixed_hakmem) - System binary: bench_random_mixed_system (21K separate binary) - Perf stat: Hardware counter analysis (I-cache, cycles, instructions) Result: Case B confirmed* — Allocator差 negligible, layout penalty dominates. Gap breakdown (Mixed, 20M iters, ws=400): - hakmem (FORCE_LIBC=0): 48.12M ops/s - libc (FORCE_LIBC=1, same binary): 48.31M ops/s → +0.39% (noise level) - system binary (21K): 83.85M ops/s → +73.57% vs libc, +74.26% vs hakmem Perf stat (200M iters): - I-cache misses: 153K (hakmem) → 68K (system) = -55% (smoking gun) - Cycles: 17.9B → 10.2B = -43% - Instructions: 41.3B → 21.5B = -48% - Binary size: 653K → 21K (30x difference) Root cause: Binary size (30x) causes I-cache thrashing. Code bloat >> algorithmic efficiency. Conclusion: Phase 12's "system malloc 1.6x faster" was real, but misattributed. Gap is layout/I-cache, NOT allocator algorithm. Files: - docs/analysis/PHASE17_.md (results/instructions) - scripts/run_mixed_10_cleanenv.sh (Phase 9/10 defaults aligned) Next: Phase 18 Hot Text Isolation (layout optimization, not algorithm opt) --- ## Phase 18: Hot Text Isolation — Design Added Purpose: Reduce I-cache misses + instruction footprint via layout control (binary optimization, not allocator algorithm changes). Strategy (v1 → v2 progression): v1 (TU split + hot/cold attrs + optional gc-sections): - Target: +2% throughput (GO threshold, realistic for layout tweaks) - Secondary: I-cache -10%, instructions -5% (direction confirmation) - Risk: Low (reversible via build knob) - Expected: +0-2% (NEUTRAL likely, but validates approach) v2 (BENCH_MINIMAL compile-out): - Target: +10-20% throughput (本命) - Method: Conditional compilation removes stats/ENV/debug from hot path - Expected: Instruction count -30-40% → significant I-cache improvement Files: - docs/analysis/PHASE18_.md (design/instructions) - CURRENT_TASK.md (Phase 17 complete, Phase 18 v1/v2 plan) Build gate: HOT_TEXT_ISOLATION=0/1 (Makefile knob) Next: Implement Phase 18 v1 (TU split first, BENCH_MINIMAL if v1 NEUTRAL) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-15 05:25:47 +09:00
Moe Charm (CI)	87fa27518c	Phase 15 v1: UnifiedCache FIFO→LIFO NEUTRAL (-0.70% Mixed, +0.42% C7) Transform existing array-based UnifiedCache from FIFO ring to LIFO stack. A/B Results: - Mixed (16-1024B): -0.70% (52,965,966 → 52,593,948 ops/s) - C7-only (1025-2048B): +0.42% (78,010,783 → 78,335,509 ops/s) Verdict: NEUTRAL (both below +1.0% GO threshold) - freeze as research box Implementation: - L0 ENV gate: tiny_unified_lifo_env_box.{h,c} (HAKMEM_TINY_UNIFIED_LIFO=0/1) - L1 LIFO ops: tiny_unified_lifo_box.h (unified_cache_try_pop/push_lifo) - L2 integration: tiny_front_hot_box.h (mode check at entry) - Reuses existing slots[] array (no intrusive pointers) Root Causes: 1. Mode check overhead (tiny_unified_lifo_enabled() call) 2. Minimal LIFO vs FIFO locality delta in practice 3. Existing FIFO ring already well-optimized Bonus Fix: LTO bug for tiny_c7_preserve_header_enabled() (Phase 13/14 latent issue) - Converted static inline to extern + non-inline implementation - Fixes undefined reference during LTO linking Design: docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_DESIGN.md Results: docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_AB_TEST_RESULTS.md 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-15 02:19:26 +09:00
Moe Charm (CI)	f8fb05bc13	Phase 14 v1: Pointer-Chase Reduction (tcache) NEUTRAL (+0.20%) Implementation: - Intrusive LIFO tcache layer (L1) before UnifiedCache - TLS per-class bins (head pointer + count) - Intrusive next pointers (via tiny_next_store/load SSOT) - Cap: 64 blocks per class (default) - ENV: HAKMEM_TINY_TCACHE=0/1 (default: 0, OFF) A/B Test Results (Mixed 10-run): - Baseline (TCACHE=0): 51,083,379 ops/s - Optimized (TCACHE=1): 51,186,838 ops/s - Mean delta: +0.20% (below +1.0% GO threshold) - Median delta: +0.59% Verdict: NEUTRAL - Freeze as research box (default OFF) Root Cause (v1 wiring incomplete): - Free side pushes to tcache via unified_cache_push() - Alloc hot path (tiny_hot_alloc_fast) doesn't consume tcache - tcache becomes "sink" without alloc-side pop → ROI not measurable Files: - Created: core/box/tiny_tcache_{env_box,box}.h, tiny_tcache_env_box.c - Modified: core/front/tiny_unified_cache.h (integration) - Modified: core/bench_profile.h (refresh sync) - Modified: Makefile (build integration) - Results: docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_1_AB_TEST_RESULTS.md - v2 Instructions: docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_2_NEXT_INSTRUCTIONS.md Next: Phase 14 v2 (connect tcache to tiny_front_hot_box alloc/free hot path) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-15 01:28:50 +09:00
Moe Charm (CI)	cbb35ee27f	Phase 13 v1 + E5-2 retest: Both NEUTRAL, freeze as research boxes Phase 13 v1: Header Write Elimination (C7 preserve header) - Verdict: NEUTRAL (+0.78%) - Implementation: HAKMEM_TINY_C7_PRESERVE_HEADER ENV gate (default OFF) - Makes C7 nextptr offset conditional (0→1 when enabled) - 4-point matrix A/B test results: * Case A (baseline): 51.49M ops/s * Case B (WRITE_ONCE=1): 52.07M ops/s (+1.13%) * Case C (C7_PRESERVE=1): 51.36M ops/s (-0.26%) * Case D (both): 51.89M ops/s (+0.78% NEUTRAL) - Action: Freeze as research box (default OFF, manual opt-in) Phase 5 E5-2: Header Write-Once retest (promotion test) - Verdict: NEUTRAL (+0.54%) - Motivation: Phase 13 Case B showed +1.13%, re-tested with dedicated 20-run - Results (20-run): * Case A (baseline): 51.10M ops/s * Case B (WRITE_ONCE=1): 51.37M ops/s (+0.54%) - Previous test: +0.45% (consistent with NEUTRAL) - Action: Keep as research box (default OFF, manual opt-in) Key findings: - Header write tax optimization shows consistent NEUTRAL results - Neither Phase 13 v1 nor E5-2 reaches GO threshold (+1.0%) - Both implemented as reversible ENV gates for future research Files changed: - New: core/box/tiny_c7_preserve_header_env_box.{c,h} - Modified: core/box/tiny_layout_box.h (C7 offset conditional) - Modified: core/tiny_nextptr.h, core/box/tiny_header_box.h (comments) - Modified: core/bench_profile.h (refresh sync) - Modified: Makefile (add new .o files) - Modified: scripts/run_mixed_10_cleanenv.sh (add C7_PRESERVE ENV) - Docs: PHASE13_, PHASE5_E5_2_HEADER_WRITE_ONCE_ (design/results) Next: Phase 14 (Pointer-chase reduction, tcache-style intrusive LIFO) 🤖 Generated with Claude Code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-15 00:32:25 +09:00
Moe Charm (CI)	71b1354d32	Phase 10: FREE-TINY-FAST MONO LEGACY DIRECT (GO +1.89%) Results: - A/B test: +1.89% on Mixed (10-run, clean env) - Baseline: 51.96M ops/s - Optimized: 52.94M ops/s - Improvement: +984K ops/s (+1.89%) - C6-heavy verification: +7.86% (nonlegacy_mask works correctly, no misfires) Strategy: - Extend Phase 9 (C0-C3 DUALHOT) to C4-C7 LEGACY DIRECT - Fail-Fast principle: Never misclassify MID/ULTRA/V7 as LEGACY - nonlegacy_mask: Cached at init, hot path uses single bit operation Success factors: 1. Performance improvement: +1.89% (1.9x GO threshold) 2. Safety verified: nonlegacy_mask prevents MID v3 misfire in C6-heavy 3. Phase 9 coexistence: C0-C3 (Phase 9) + C4-C7 (Phase 10) = full LEGACY coverage 4. Minimal overhead: Single bit operation in hot path (mask & (1u<<class)) Implementation: - Patch 1: ENV gate box (free_tiny_fast_mono_legacy_direct_env_box.h) - ENV: HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=0/1 (default 0) - nonlegacy_mask cached (reuses free_policy_fast_v2_nonlegacy_mask()) - Probe window: 64 (avoid bench_profile putenv race) - Patch 2: Early-exit in free_tiny_fast() (malloc_tiny_fast.h) - Conditions: !nonlegacy_mask, route==LEGACY, !LARSON_FIX, done==1 - Direct call: tiny_legacy_fallback_free_base() - Patch 3: Visibility (free_path_stats_box.h) - mono_legacy_direct_hit counter (compile-out in release) - Patch 4: cleanenv extension (run_mixed_10_cleanenv.sh) - ENV leak protection Safety verification (C6-heavy): - OFF: 19.75M ops/s - ON: 21.30M ops/s (+7.86%) - nonlegacy_mask correctly excludes C6 (MID v3 active) - Improvement from C0-C5, C7 direct path acceleration Files modified: - core/bench_profile.h: add to MIXED_TINYV3_C7_SAFE preset - core/front/malloc_tiny_fast.h: early-exit insertion - core/box/free_path_stats_box.h: counter - core/box/free_tiny_fast_mono_legacy_direct_env_box.h: NEW (ENV gate + nonlegacy_mask) - scripts/run_mixed_10_cleanenv.sh: ENV leak protection Health check: PASSED (all profiles) Promotion: Added to MIXED_TINYV3_C7_SAFE preset (default ON, opt-out) Rollback: HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=0 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-14 20:09:40 +09:00
Moe Charm (CI)	871034da1f	Phase 9: FREE-TINY-FAST MONO DUALHOT (GO +2.72%) Results: - A/B test: +2.72% on Mixed (10-run, clean env) - Baseline: 48.89M ops/s - Optimized: 50.22M ops/s - Improvement: +1.33M ops/s (+2.72%) - Stability: Standard deviation reduced by 60.8% (2.44M → 955K ops/s) Strategy: - Transplant C0-C3 "second hot" path to monolithic free_tiny_fast() - Early-exit within monolithic (no hot/cold split) - FastLane free now benefits from C0-C3 direct path Success factors: 1. Performance improvement: +2.72% (2.7x GO threshold) 2. Stability improvement: 2.6x more stable (stdev 60.8% reduction) 3. Learned from Phase 7 failure: - Phase 7: Function split (hot/cold) → NO-GO - Phase 9: Early-exit within monolithic → GO 4. FastLane free compatibility: C0-C3 direct path now works with FastLane 5. Policy snapshot overhead reduction: C0-C3 (48% of Mixed) skip route lookup Implementation: - Patch 1: ENV gate box (free_tiny_fast_mono_dualhot_env_box.h) - ENV: HAKMEM_FREE_TINY_FAST_MONO_DUALHOT=0/1 (default 0) - Probe window: 64 (avoid bench_profile putenv race) - Patch 2: Early-exit in free_tiny_fast() (malloc_tiny_fast.h) - Conditions: class_idx <= 3, !LARSON_FIX, route==LEGACY - Direct call: tiny_legacy_fallback_free_base() - Patch 3: Visibility (free_path_stats_box.h) - mono_dualhot_hit counter (compile-out in release) - Patch 4: cleanenv extension (run_mixed_10_cleanenv.sh) - ENV leak protection Files modified: - core/bench_profile.h: add to MIXED_TINYV3_C7_SAFE preset - core/front/malloc_tiny_fast.h: early-exit insertion - core/box/free_path_stats_box.h: counter - core/box/free_tiny_fast_mono_dualhot_env_box.h: NEW (ENV gate) - scripts/run_mixed_10_cleanenv.sh: ENV leak protection Health check: PASSED (all profiles) Promotion: Added to MIXED_TINYV3_C7_SAFE preset (default ON, opt-out) Rollback: HAKMEM_FREE_TINY_FAST_MONO_DUALHOT=0 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-14 19:16:49 +09:00
Moe Charm (CI)	be723ca052	Phase 8: FREE-STATIC-ROUTE ENV Cache Hardening (GO +2.61%) Results: - A/B test: +2.61% on Mixed (10-run, clean env) - Baseline: 49.26M ops/s - Optimized: 50.55M ops/s - Improvement: +1.29M ops/s (+2.61%) Strategy: - Fix ENV cache accident (main前キャッシュ事故の修正) - Add refresh mechanism to sync with bench_profile putenv - Ensure Phase 3 D1 optimization works reliably Success factors: 1. Performance improvement: +2.61% (existing win-box now reliable) 2. ENV cache accident fixed: refresh mechanism works correctly 3. Standard deviation improved: 867K → 336K ops/s (61% reduction) 4. Baseline quality improved: existing optimization now guaranteed Implementation: - Patch 1: Make ENV gate refreshable (tiny_free_route_cache_env_box.{h,c}) - Changed static int to extern _Atomic int - Added tiny_free_static_route_refresh_from_env() - Patch 2: Integrate refresh into bench_profile.h - Call refresh after bench_setenv_default() group - Patch 3: Update Makefile for new .c file ENV cache fix verification: - [FREE_STATIC_ROUTE] enabled appears twice (refresh working) - bench_profile putenv now reliably reflected Files modified: - core/box/tiny_free_route_cache_env_box.h: extern + refresh API - core/box/tiny_free_route_cache_env_box.c: NEW (global state + refresh) - core/bench_profile.h: add refresh call - Makefile: add new .o file Health check: PASSED (all profiles) Rollback: HAKMEM_FREE_STATIC_ROUTE=0 or revert Patch 1/2 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-14 18:49:08 +09:00
Moe Charm (CI)	dcc1d42e7f	Phase 6-2: Promote Front FastLane Free DeDup (default ON) Results: - A/B test: +5.18% on Mixed (10-run, clean env) - Baseline: 46.68M ops/s - Optimized: 49.10M ops/s - Improvement: +2.42M ops/s (+5.18%) Strategy: - Eliminate duplicate header validation in front_fastlane_try_free() - Direct call to free_tiny_fast() when dedup enabled - Single validation path (no redundant checks) Success factors: 1. Complete duplicate elimination (free path optimization) 2. Free path importance (50% of Mixed workload) 3. Improved execution stability (CV: 1.00% → 0.58%) Phase 6 cumulative: - Phase 6-1 FastLane: +11.13% - Phase 6-2 Free DeDup: +5.18% - Total: ~+16-17% from baseline (multiplicative effect) Promotion: - Default: HAKMEM_FRONT_FASTLANE_FREE_DEDUP=1 (opt-out) - Added to MIXED_TINYV3_C7_SAFE preset - Added to C6_HEAVY_LEGACY_POOLV1 preset - Rollback: HAKMEM_FRONT_FASTLANE_FREE_DEDUP=0 Files modified: - core/box/front_fastlane_env_box.h: default 0 → 1 - core/bench_profile.h: added to presets - CURRENT_TASK.md: Phase 6-2 GO result Health check: PASSED (all profiles) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-14 17:38:21 +09:00
Moe Charm (CI)	ea221d057a	Phase 6: promote Front FastLane (default ON)	2025-12-14 16:28:23 +09:00

9 Commits