hakmem

Author	SHA1	Message	Date
Moe Charm (CI)	bc2c5ded76	Phase 18 v2: BENCH_MINIMAL — NEUTRAL (+2.32% throughput, -5.06% instructions) ## Summary Phase 18 v2 attempted instruction count reduction via conditional compilation: - Stats collection → no-op - ENV checks → constant propagation - Binary size: 653K → 649K (-4K, -0.6%) Result: NEUTRAL (below GO threshold) - Throughput: +2.32% (target: +5% minimum) ❌ - Instructions: -5.06% (target: -15% minimum) ❌ - Cycles: -3.26% (positive signal) - Branches: -8.67% (positive signal) - Cache-misses: +30% (unexpected, likely layout) ## Analysis Positive signals: - Implementation correct (Branch -8.67%, Instruction -5.06%) - Binary size reduced (-4K) - Modest throughput gain (+2.32%) - Cycles and branch overhead reduced Negative signals: - Instruction reduction insufficient (-5.06% << -15% smoking gun) - Throughput gain below +5% threshold - Cache-misses increased (+30%, layout noise?) ## Verdict Freeze Phase 18 v2 (weak positive, insufficient for production). Per user guidance: "If instructions don't drop clearly, continuation value is thin." -5.06% instruction reduction is marginal. Allocator micro-optimization plateau confirmed. ## Key Insight Phase 17 showed: - IPC = 2.30 (consistent, memory-bound) - I-cache gap: 55% (Phase 17: 153K → 68K) - Instruction gap: 48% (Phase 17: 41.3B → 21.5B) Phase 18 v1/v2 results confirm: - Layout tweaks are fragile (v1: I-cache +91%) - Instruction removal is modest benefit (v2: -5.06%) - Allocator is NOT the bottleneck (IPC constant, memory-limited) ## Recommendation Do NOT continue Phase 18 micro-optimizations. Next frontier requires different approach: 1. Architectural redesign (SIMD, lock-free, batching) 2. Memory layout optimization (cache-friendly structures) 3. Broader profiling (not allocator-focused) Or: Accept that 48M → 85M (75% gap) is achievable with current architecture. Files: - docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_AB_TEST_RESULTS.md (results) - CURRENT_TASK.md (Phase 18 complete status) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-15 06:02:28 +09:00
Moe Charm (CI)	4a070d8a14	Phase 5 E4-1: Free Wrapper ENV Snapshot (+3.51% GO, ADOPTED) Target: Consolidate free wrapper TLS reads (2→1) - free() is 25.26% self% (top hot spot) - Strategy: Apply E1 success pattern (ENV snapshot) to free path Implementation: - ENV gate: HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=0/1 (default 0) - core/box/free_wrapper_env_snapshot_box.{h,c}: New box - Consolidates 2 TLS reads → 1 TLS read (50% reduction) - Reduces 4 branches → 3 branches (25% reduction) - Lazy init with probe window (bench_profile putenv sync) - core/box/hak_wrappers.inc.h: Integration in free() wrapper - Makefile: Add free_wrapper_env_snapshot_box.o to all targets A/B Test Results (Mixed, 10-run, 20M iters): - Baseline (SNAPSHOT=0): 45.35M ops/s (mean), 45.31M ops/s (median) - Optimized (SNAPSHOT=1): 46.94M ops/s (mean), 47.15M ops/s (median) - Improvement: +3.51% mean, +4.07% median Decision: GO (+3.51% >= +1.0% threshold) - Exceeded conservative estimate (+1.5% → +3.51%) - Similar efficiency to E1 (+3.92%) - Health check: PASS (all profiles) - Action: PROMOTED to MIXED_TINYV3_C7_SAFE preset Phase 5 Cumulative: - E1 (ENV Snapshot): +3.92% - E4-1 (Free Wrapper Snapshot): +3.51% - Total Phase 4-5: ~+7.5% E3-4 Correction: - Phase 4 E3-4 (ENV Constructor Init): NO-GO / FROZEN - Initial A/B showed +4.75%, but investigation revealed: - Branch prediction hint mismatch (UNLIKELY with always-true) - Retest confirmed -1.78% regression - Root cause: __builtin_expect(..., 0) with ctor_mode==1 - Decision: Freeze as research box (default OFF) - Learning: Branch hints need careful tuning, TLS consolidation safer Deliverables: - docs/analysis/PHASE5_E4_FREE_GATE_OPTIMIZATION_1_DESIGN.md - docs/analysis/PHASE5_E4_1_FREE_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md - docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md (next) - docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md - docs/analysis/ENV_PROFILE_PRESETS.md (E4-1 added, E3-4 corrected) - CURRENT_TASK.md (E4-1 complete, E3-4 frozen) - core/bench_profile.h (E4-1 promoted to default) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-14 04:24:34 +09:00
Moe Charm (CI)	21e2e4ac2b	Phase 4 E3-4: ENV Constructor Init (+4.75% GO) Target: Eliminate E1 lazy init check overhead (3.22% self%) - E1 consolidated ENV gates but lazy check remained in hot path - Strategy: __attribute__((constructor(101))) for pre-main init Implementation: - ENV gate: HAKMEM_ENV_SNAPSHOT_CTOR=0/1 (default 0, research box) - core/box/hakmem_env_snapshot_box.c: Constructor function added - Reads ENV before main() when CTOR=1 - Refresh also syncs gate state for bench_profile putenv - core/box/hakmem_env_snapshot_box.h: Dual-mode enabled check - CTOR=1 fast path: direct global read (no lazy branch) - CTOR=0 fallback: legacy lazy init (rollback safe) - Branch hints adjusted for default OFF baseline A/B Test Results (Mixed, 10-run, 20M iters, E1=1): - Baseline (CTOR=0): 44.28M ops/s (mean), 44.60M ops/s (median) - Optimized (CTOR=1): 46.38M ops/s (mean), 46.53M ops/s (median) - Improvement: +4.75% mean, +4.35% median Decision: GO (+4.75% >> +0.5% threshold) - Expected +0.5-1.5%, achieved +4.75% - Lazy init branch overhead was larger than expected - Action: Keep as research box (default OFF), evaluate promotion Phase 4 Cumulative: - E1 (ENV Snapshot): +3.92% - E2 (Alloc Per-Class): -0.21% (NEUTRAL, frozen) - E3-4 (Constructor Init): +4.75% - Total Phase 4: ~+8.5% Deliverables: - docs/analysis/PHASE4_E3_ENV_CONSTRUCTOR_INIT_DESIGN.md - docs/analysis/PHASE4_E3_ENV_CONSTRUCTOR_INIT_NEXT_INSTRUCTIONS.md - docs/analysis/PHASE4_COMPREHENSIVE_STATUS_ANALYSIS.md - docs/analysis/PHASE4_EXECUTIVE_SUMMARY.md - scripts/verify_health_profiles.sh (sanity check script) - CURRENT_TASK.md (E3-4 complete, next instructions) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-14 02:57:35 +09:00
Moe Charm (CI)	88717a8737	Phase 4 E1: ENV Snapshot Consolidation - GO (+3.92% avg, +4.01% median) Target: Consolidate 3 ENV gate TLS reads → 1 TLS read - tiny_c7_ultra_enabled_env(): 1.28% self - tiny_front_v3_enabled(): 1.01% self - tiny_metadata_cache_enabled(): 0.97% self - Total overhead: 3.26% self (perf profile analysis) Implementation: - core/box/hakmem_env_snapshot_box.h (new): ENV snapshot struct & API - core/box/hakmem_env_snapshot_box.c (new): TLS snapshot implementation - core/front/malloc_tiny_fast.h: Migrated 5 call sites to snapshot - core/box/tiny_legacy_fallback_box.h: Migrated 2 call sites - core/box/tiny_metadata_cache_hot_box.h: Migrated 1 call site - core/bench_profile.h: Added hakmem_env_snapshot_refresh_from_env() - Makefile: Added hakmem_env_snapshot_box.o to build - ENV gate: HAKMEM_ENV_SNAPSHOT=0/1 (default: 0, research box) A/B Test Results (Mixed, 10-run, 20M iters): - Baseline (E1=0): 43,617,549 ops/s (avg), 43,562,895 ops/s (median) - Optimized (E1=1): 45,327,239 ops/s (avg), 45,309,218 ops/s (median) - Improvement: avg +3.92%, median +4.01% Decision: GO (+3.92% >= +2.5% threshold) - Action: Keep as research box (default OFF) for Phase 4 - Next: Consider promotion to default in MIXED_TINYV3_C7_SAFE preset Design Rationale: - Shape optimizations (B3, D3) reached saturation (+0.56% NEUTRAL) - Shift to memory/TLS overhead optimization (new optimization frontier) - Pattern: Similar to existing tiny_front_v3_snapshot (proven approach) - Expected: +1-3% from 3.26% ENV overhead → Achieved: +3.92% Technical Details: - Consolidation: 3 TLS reads → 1 TLS read (66% reduction) - Learner interlock: tiny_metadata_cache_eff pre-computed in snapshot - Version sync: Refreshes on small_policy_v7_version_changed() - Fallback safety: Existing ENV gates still available when E1=0 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-14 00:59:12 +09:00

4 Commits