hakmem

Author	SHA1	Message	Date
Moe Charm (CI)	84f5034e45	Phase 68: PGO training set diversification (seed/WS expansion) Changes: - scripts/box/pgo_fast_profile_config.sh: Expanded WS patterns (3→5) and seeds (1→3) for reduced overfitting and better production workload representativeness - PERFORMANCE_TARGETS_SCORECARD.md: Phase 68 baseline promoted (61.614M = 50.93%) - CURRENT_TASK.md: Phase 68 marked complete, Phase 67a (layout tax forensics) set Active Results: - 10-run verification: +1.19% vs Phase 66 baseline (GO, >+1.0% threshold) - M1 milestone: 50.93% of mimalloc (target 50%, exceeded by +0.93pp) - Stability: 10-run mean/median with <2.1% CV 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-17 21:08:17 +09:00
Moe Charm (CI)	f99ef77ad7	Phase 29: Pool Hotbox v2 Stats Prune - NO-OP (infrastructure ready) Target: g_pool_hotbox_v2_stats atomics (12 total) in Pool v2 Result: 0.00% impact (code path inactive by default, ENV-gated) Verdict: NO-OP - Maintain compile-out for future-proofing Audit Results: - Classification: 12/12 TELEMETRY (100% observational) - Counters: alloc_calls, alloc_fast, alloc_refill, alloc_refill_fail, alloc_fallback_v1, free_calls, free_fast, free_fallback_v1, page_of_fail_* (4 failure counters) - Verification: All stats/logging only, zero flow control usage - Phase 28 lesson applied: Traced all usages, confirmed no CORRECTNESS Key Finding: Pool v2 OFF by default - Requires HAKMEM_POOL_V2_ENABLED=1 to activate - Benchmark never executes Pool v2 code paths - Compile-out has zero performance impact (code never runs) Implementation (future-ready): - Added HAKMEM_POOL_HOTBOX_V2_STATS_COMPILED (default: 0) - Wrapped 13 atomic write sites in core/hakmem_pool.c - Pattern: #if HAKMEM_POOL_HOTBOX_V2_STATS_COMPILED ... #endif - Expected impact if Pool v2 enabled: +0.3~0.8% (HOT+WARM atomics) A/B Test Results: - Baseline (COMPILED=0): 52.98 M ops/s (±0.43M, 0.81% stdev) - Research (COMPILED=1): 53.31 M ops/s (±0.80M, 1.50% stdev) - Delta: -0.62% (noise, not real effect - code path not active) Critical Lesson Learned (NEW): Phase 29 revealed ENV-gated features can appear on hot paths but never execute. Updated audit checklist: 1. Classify atomics (CORRECTNESS vs TELEMETRY) 2. Verify no flow control usage 3. NEW: Verify code path is ACTIVE in benchmark (check ENV gates) 4. Implement compile-out 5. A/B test Verification methods added to documentation: - rg "getenv.*FEATURE" to check ENV gates - perf record/report to verify execution - Debug printf for quick validation Cumulative Progress (Phase 24-29): - Phase 24 (class stats): +0.93% GO - Phase 25 (free stats): +1.07% GO - Phase 26 (diagnostics): -0.33% NEUTRAL - Phase 27 (unified cache): +0.74% GO - Phase 28 (bg spill): NO-OP (all CORRECTNESS) - Phase 29 (pool v2): NO-OP (inactive code path) - Total: 17 atomics removed, +2.74% improvement Documentation: - PHASE29_POOL_HOTBOX_V2_AUDIT.md: Complete audit with TELEMETRY classification - PHASE29_POOL_HOTBOX_V2_STATS_RESULTS.md: Results + new lesson learned - ATOMIC_PRUNE_CUMULATIVE_SUMMARY.md: Updated with Phase 29 + new checklist - PHASE29_COMPLETE.md: Completion summary with recommendations Decision: Keep compile-out despite NO-OP - Code cleanliness (binary size reduction) - Future-proofing (ready when Pool v2 enabled) - Consistency with Phase 24-28 pattern Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-16 06:33:41 +09:00
Moe Charm (CI)	19056282b6	Phase 3 D2: Wrapper Env Cache - [DECISION: NO-GO] Target: Reduce wrapper_env_cfg() overhead in malloc/free hot path - Strategy: Cache wrapper env configuration pointer in TLS - Approach: Fast pointer cache (TLS caches const wrapper_env_cfg_t*) Implementation: - core/box/wrapper_env_cache_env_box.h: ENV gate (HAKMEM_WRAP_ENV_CACHE) - core/box/wrapper_env_cache_box.h: TLS cache layer (wrapper_env_cfg_fast) - core/box/hak_wrappers.inc.h: Integration into malloc/free hot paths - ENV gate: HAKMEM_WRAP_ENV_CACHE=0/1 (default OFF) A/B Test Results (Mixed, 10-run, 20M iters): - Baseline (D2=0): 46.52M ops/s (avg), 46.47M ops/s (median) - Optimized (D2=1): 45.85M ops/s (avg), 45.98M ops/s (median) - Improvement: avg -1.44%, median -1.05% (DECISION: NO-GO) Analysis: - Regression cause: TLS cache adds overhead (branch + TLS access) - wrapper_env_cfg() is already minimal (pointer return after simple check) - Adding TLS caching layer makes it worse, not better - Branch prediction penalty outweighs any potential savings Cumulative Phase 2-3: - B3: +2.89%, B4: +1.47%, C3: +2.20% - D1: +1.06% (opt-in), D2: -1.44% (NO-GO) - Total: ~7.2% (excluding D2) Decision: FREEZE as research box (default OFF, regression confirmed) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-13 22:03:27 +09:00
Moe Charm (CI)	acc64f2438	Phase ML1: Pool v1 memset 89.73% overhead 軽量化 (+15.34% improvement) ## Summary - ChatGPT により bench_profile.h の setenv segfault を修正（RTLD_NEXT 経由に切り替え） - core/box/pool_zero_mode_box.h 新設：ENV キャッシュ経由で ZERO_MODE を統一管理 - core/hakmem_pool.c で zero mode に応じた memset 制御（FULL/header/off） - A/B テスト結果：ZERO_MODE=header で +15.34% improvement（1M iterations, C6-heavy） ## Files Modified - core/box/pool_api.inc.h: pool_zero_mode_box.h include - core/bench_profile.h: glibc setenv → malloc+putenv（segfault 回避） - core/hakmem_pool.c: zero mode 参照・制御ロジック - core/box/pool_zero_mode_box.h (新設): enum/getter - CURRENT_TASK.md: Phase ML1 結果記載 ## Test Results \| Iterations \| ZERO_MODE=full \| ZERO_MODE=header \| Improvement \| \|-----------\|----------------\|-----------------\|------------\| \| 10K \| 3.06 M ops/s \| 3.17 M ops/s \| +3.65% \| \| 1M \| 23.71 M ops/s \| 27.34 M ops/s \| +15.34% \| 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-10 09:08:18 +09:00
Moe Charm (CI)	a905e0ffdd	Guard madvise ENOMEM and stabilize pool/tiny front v3	2025-12-09 21:50:15 +09:00
Moe Charm (CI)	bd5e97f38a	Save current state before investigating TLS_SLL_HDR_RESET	2025-12-03 10:34:39 +09:00
Moe Charm (CI)	4ef0171bc0	feat: Add ACE allocation failure tracing and debug hooks This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations. Key changes include: - ACE Tracing Implementation: - Added environment variable to enable/disable detailed logging of allocation failures. - Instrumented , , and to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure). - Build System Fixes: - Corrected to ensure is properly linked into , resolving an error. - LD_PRELOAD Wrapper Adjustments: - Investigated and understood the wrapper's behavior under , particularly its interaction with and checks. - Enabled debugging flags for environment to prevent unintended fallbacks to 's for non-tiny allocations, allowing comprehensive testing of the allocator. - Debugging & Verification: - Introduced temporary verbose logging to pinpoint execution flow issues within interception and routing. These temporary logs have been removed. - Created to facilitate testing of the tracing features. This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in by providing clear insights into the failure pathways.	2025-12-01 16:37:59 +09:00
Moe Charm (CI)	03ba62df4d	Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified Summary: - Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s) - PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM) - Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization Phase 23 Changes: 1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h}) - Direct SuperSlab carve (TLS SLL bypass) - Self-contained pop-or-refill pattern - ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128 2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h) - Unified ON → direct cache access (skip all intermediate layers) - Alloc: unified_cache_pop_or_refill() → immediate fail to slow - Free: unified_cache_push() → fallback to SLL only if full PageFaultTelemetry Changes: 3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h}) - PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement - Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked() 4. Measurement results (Random Mixed 500K / 256B): - Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page) - SSM: 512 pages (initialization footprint) - MID/L25: 0 (unused in this workload) - Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny) Ring Cache Enhancements: 5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h}) - ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size - Conditional compilation cleanup Documentation: 6. Analysis reports - RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown - RANDOM_MIXED_SUMMARY.md: Phase 23 summary - RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage - CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan Next Steps (Phase 24): - Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K) - Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal) - Expected improvement: +30-50% for Mid/Large workloads Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-17 02:47:58 +09:00
Moe Charm (CI)	1010a961fb	Tiny: fix header/stride mismatch and harden refill paths - Root cause: header-based class indexing (HEADER_CLASSIDX=1) wrote a 1-byte header during allocation, but linear carve/refill and initial slab capacity still used bare class block sizes. This mismatch could overrun slab usable space and corrupt freelists, causing reproducible SEGV at ~100k iters. Changes - Superslab: compute capacity with effective stride (block_size + header for classes 0..6; class7 remains headerless) in superslab_init_slab(). Add a debug-only bound check in superslab_alloc_from_slab() to fail fast if carve would exceed usable bytes. - Refill (non-P0 and P0): use header-aware stride for all linear carving and TLS window bump operations. Ensure alignment/validation in tiny_refill_opt.h also uses stride, not raw class size. - Drain: keep existing defense-in-depth for remote sentinel and sanitize nodes before splicing into freelist (already present). Notes - This unifies the memory layout across alloc/linear-carve/refill with a single stride definition and keeps class7 (1024B) headerless as designed. - Debug builds add fail-fast checks; release builds remain lean. Next - Re-run Tiny benches (256/1024B) in debug to confirm stability, then in release. If any remaining crash persists, bisect with HAKMEM_TINY_P0_BATCH_REFILL=0 to isolate P0 batch carve, and continue reducing branch-miss as planned.	2025-11-09 18:55:50 +09:00

9 Commits