hakmem

Author	SHA1	Message	Date
Moe Charm (CI)	b7085c47e1	Phase 35-39: FAST build optimization complete (+7.13% cumulative) Phase 35-A: BENCH_MINIMAL gate function elimination (GO +4.39%) - tiny_front_v3_enabled() → constant true - tiny_metadata_cache_enabled() → constant 0 - learner_v7_enabled() → constant false - small_learner_v2_enabled() → constant false Phase 36: Policy snapshot init-once (GO +0.71%) - small_policy_v7_snapshot() version check skip in BENCH_MINIMAL - TLS cache for policy snapshot Phase 37: Standard TLS cache (NO-GO -0.07%) - TLS cache for Standard build attempted - Runtime gate overhead negates benefit Phase 38: FAST/OBSERVE/Standard workflow established - make perf_fast, make perf_observe targets - Scorecard and documentation updates Phase 39: Hot path gate constantization (GO +1.98%) - front_gate_unified_enabled() → constant 1 - alloc_dualhot_enabled() → constant 0 - g_bench_fast_front, g_v3_enabled blocks → compile-out - free_dispatch_stats_enabled() → constant false Results: - FAST v3: 56.04M ops/s (47.4% of mimalloc) - Standard: 53.50M ops/s (45.3% of mimalloc) - M1 target (50%): 5.5% remaining 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-16 15:01:56 +09:00
Moe Charm (CI)	11b0e3f32b	Phase 4 D3: alloc gate shape (env-gated)	2025-12-14 00:26:57 +09:00
Moe Charm (CI)	d0f939c2eb	Phase ALLOC-GATE-SSOT-1 + ALLOC-TINY-FAST-DUALHOT-2: Structure fixes for alloc path 4 patches to eliminate allocation overhead and enable research path: Patch 1: Extract malloc_tiny_fast_for_class(size, class_idx) - SSOT: size→class conversion happens once in gate - malloc_tiny_fast() becomes thin wrapper - Foundation for eliminating duplicate lookups Patch 2: Update tiny_alloc_gate_fast() to call *_for_class - Pass class_idx computed in gate to malloc_tiny_fast_for_class() - Eliminates second hak_tiny_size_to_class() call - Impact: +1-2% expected from reduced instruction count Patch 3: Reposition DUALHOT branch (C0-C3 only) - Move class_idx <= 3 check outside alloc_dualhot_enabled() - C4-C7 no longer evaluate ENV gate (even when OFF) - Impact: Maintains neutral performance on default path Patch 4: Probe window for ENV gate - Tolerate early putenv() before probe window exhausted (64 calls) - Maintains correctness for bench_profile setenv timing A/B Results (DUALHOT=0 vs DUALHOT=1): - Mixed median: 48.75M → 48.62M ops/s (-0.27%, neutral within variance) - C6-heavy median: 23.24M → 23.63M ops/s (+1.68%, SSOT benefit) Decision: ADOPT with DUALHOT default OFF (research feature) - SSOT provides structural improvement - No regression on default configuration - C6-heavy shows SSOT effectiveness (+1.68%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-13 06:50:39 +09:00
Moe Charm (CI)	acc64f2438	Phase ML1: Pool v1 memset 89.73% overhead 軽量化 (+15.34% improvement) ## Summary - ChatGPT により bench_profile.h の setenv segfault を修正（RTLD_NEXT 経由に切り替え） - core/box/pool_zero_mode_box.h 新設：ENV キャッシュ経由で ZERO_MODE を統一管理 - core/hakmem_pool.c で zero mode に応じた memset 制御（FULL/header/off） - A/B テスト結果：ZERO_MODE=header で +15.34% improvement（1M iterations, C6-heavy） ## Files Modified - core/box/pool_api.inc.h: pool_zero_mode_box.h include - core/bench_profile.h: glibc setenv → malloc+putenv（segfault 回避） - core/hakmem_pool.c: zero mode 参照・制御ロジック - core/box/pool_zero_mode_box.h (新設): enum/getter - CURRENT_TASK.md: Phase ML1 結果記載 ## Test Results \| Iterations \| ZERO_MODE=full \| ZERO_MODE=header \| Improvement \| \|-----------\|----------------\|-----------------\|------------\| \| 10K \| 3.06 M ops/s \| 3.17 M ops/s \| +3.65% \| \| 1M \| 23.71 M ops/s \| 27.34 M ops/s \| +15.34% \| 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-10 09:08:18 +09:00
Moe Charm (CI)	5685c2f4c9	Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete) Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being implemented, causing all cache misses to go through expensive superslab_refill registry scans. Root Cause Analysis: - Warm pool was initialized once and pushed a single slab after each refill - When that slab was exhausted, it was discarded (not pushed back) - Next refill would push another single slab, which was immediately exhausted - Pool would oscillate between 0 and 1 items, yielding 0% hit rate Solution: Secondary Prefill on Cache Miss When warm pool becomes empty, we now do multiple superslab_refills and prefill the pool with 3 additional HOT superlslabs before attempting to carve. This builds a working set of slabs that can sustain allocation pressure. Implementation Details: - Modified unified_cache_refill() cold path to detect empty pool - Added prefill loop: when pool count == 0, load 3 extra superlslabs - Store extra slabs in warm pool, keep 1 in TLS for immediate carving - Track prefill events in g_warm_pool_stats[].prefilled counter Results (1M Random Mixed 256B allocations): - Before: C7 hits=1, misses=3976, hit_rate=0.0% - After: C7 hits=3929, misses=3143, hit_rate=55.6% - Throughput: 4.055M ops/s (maintained vs 4.07M baseline) - Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s) Performance Impact: - No regression: throughput remained stable at ~4.1M ops/s - Registry scan avoided in 55.6% of cache misses (significant savings) - Warm pool now functioning as intended with strong locality Configuration: - TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill - Prefill budget hardcoded to 3 (tunable via env var if needed later) - All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1 Next Steps: - Monitor for further optimization opportunities (prefill budget tuning) - Consider adaptive prefill budget based on class-specific hit rates - Validate at larger allocation counts (10M+ pending registry size fix) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-04 23:31:54 +09:00
Moe Charm (CI)	d5e6ed535c	P-Tier + Tiny Route Policy: Aggressive Superslab Management + Safe Routing ## Phase 1: Utilization-Aware Superslab Tiering (案B実装済) - Add ss_tier_box.h: Classify SuperSlabs into HOT/DRAINING/FREE based on utilization - HOT (>25%): Accept new allocations - DRAINING (≤25%): Drain only, no new allocs - FREE (0%): Ready for eager munmap - Enhanced shared_pool_release_slab(): - Check tier transition after each slab release - If tier→FREE: Force remaining slots to EMPTY and call superslab_free() immediately - Bypasses LRU cache to prevent registry bloat from accumulating DRAINING SuperSlabs - Test results (bench_random_mixed_hakmem): - 1M iterations: ✅ ~1.03M ops/s (previously passed) - 10M iterations: ✅ ~1.15M ops/s (previously: registry full error) - 50M iterations: ✅ ~1.08M ops/s (stress test) ## Phase 2: Tiny Front Routing Policy (新規Box) - Add tiny_route_box.h/c: Single 8-byte table for class→routing decisions - ROUTE_TINY_ONLY: Tiny front exclusive (no fallback) - ROUTE_TINY_FIRST: Try Tiny, fallback to Pool if fails - ROUTE_POOL_ONLY: Skip Tiny entirely - Profiles via HAKMEM_TINY_PROFILE ENV: - "hot": C0-C3=TINY_ONLY, C4-C6=TINY_FIRST, C7=POOL_ONLY - "conservative" (default): All TINY_FIRST - "off": All POOL_ONLY (disable Tiny) - "full": All TINY_ONLY (microbench mode) - A/B test results (ws=256, 100k ops random_mixed): - Default (conservative): ~2.90M ops/s - hot: ~2.65M ops/s (more conservative) - off: ~2.86M ops/s - full: ~2.98M ops/s (slightly best) ## Design Rationale ### Registry Pressure Fix (案B) - Problem: DRAINING tier SS occupied registry indefinitely - Solution: When total_active_blocks→0, immediately free to clear registry slot - Result: No more "registry full" errors under stress ### Routing Policy Box (新) - Problem: Tiny front optimization scattered across ENV/branches - Solution: Centralize routing in single table, select profiles via ENV - Benefit: Safe A/B testing without touching hot path code - Future: Integrate with RSS budget/learning layers for dynamic profile switching ## Next Steps (性能最適化) - Profile Tiny front internals (TLS SLL, FastCache, Superslab backend latency) - Identify bottleneck between current ~2.9M ops/s and mimalloc ~100M ops/s - Consider: - Reduce shared pool lock contention - Optimize unified cache hit rate - Streamline Superslab carving logic 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-04 18:01:25 +09:00
Moe Charm (CI)	3a2e466af1	Add lightweight Fail-Fast layer to Gatekeeper Boxes Core Changes: - Modified: core/box/tiny_free_gate_box.h * Added address range check in tiny_free_gate_try_fast() (line 142) * Catches obviously invalid pointers (addr < 4096) * Rejects fast path for garbage pointers, delegates to slow path * Logs [TINY_FREE_GATE_RANGE_INVALID] (debug-only, max 8 messages) * Cost: ~1 cycle (comparison + unlikely branch) * Behavior: Fails safe by delegating to hak_tiny_free() slow path - Modified: core/box/tiny_alloc_gate_box.h * Added range check for malloc_tiny_fast() return value (line 143) * Debug-only: Checks if returned user_ptr has addr < 4096 * On failure: Logs [TINY_ALLOC_GATE_RANGE_INVALID] and calls abort() * Release build: Entire check compiled out (zero overhead) * Rationale: Invalid allocator return is catastrophic - fail immediately Design Rationale: - Early detection of memory corruption/undefined behavior - Conservative threshold (4096) captures NULL and kernel space - Free path: Graceful degradation (delegate to slow path) - Alloc path: Hard fail (allocator corruption is non-recoverable) - Zero performance impact in production (Release) builds - Debug-only diagnostic output prevents log spam Fail-Fast Strategy: - Layer 3a: Address range sanity check (always enabled) * Rejects addr < 4096 (NULL, low memory garbage) * Free: delegates to slow path (safe fallback) * Alloc: aborts (corruption indicator) - Layer 3b: Detailed Bridge/Header validation (ENV-controlled) * Traditional HAKMEM_TINY_FREE_GATE_DIAG / HAKMEM_TINY_ALLOC_GATE_DIAG * For advanced debugging and observability Testing: - Compilation: RELEASE=0 and RELEASE=1 both successful - Smoke tests: 3/3 passed (simple_alloc, loop 10M, pool_tls) - Performance: No regressions detected - Address threshold (4096): Conservative, minimizes false positives - Verified via Task agent (PASS verdict) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-04 12:36:32 +09:00
Moe Charm (CI)	291c84a1a7	Add Tiny Alloc Gatekeeper Box for unified malloc entry point Core Changes: - New file: core/box/tiny_alloc_gate_box.h * Thin wrapper around malloc_tiny_fast() with diagnostic hooks * TinyAllocGateContext structure for size/class_idx/user/base/bridge information * tiny_alloc_gate_diag_enabled() - ENV-controlled diagnostic mode * tiny_alloc_gate_validate() - Validates class_idx/header/meta consistency * tiny_alloc_gate_fast() - Main gatekeeper function * Zero performance impact when diagnostics disabled - Modified: core/box/hak_wrappers.inc.h * Added #include "tiny_alloc_gate_box.h" (line 35) * Integrated gatekeeper into malloc wrapper (lines 198-200) * Diagnostic mode via HAKMEM_TINY_ALLOC_GATE_DIAG env var Design Rationale: - Complements Free Gatekeeper Box: Together they provide entry/exit hooks - Validates allocation consistency at malloc time - Enables Bridge + BASE/USER conversion validation in debug mode - Maintains backward compatibility: existing behavior unchanged Validation Features: - tiny_ptr_bridge_classify_raw() - Verifies Superslab/Slab/meta lookup - Header vs meta class consistency check (rate-limited, 8 msgs max) - class_idx validation via hak_tiny_size_to_class() - All validation logged but non-blocking (observation points for Guard) Testing: - All smoke tests pass (10M malloc/free cycles, pool TLS, real programs) - Diagnostic mode validated with HAKMEM_TINY_ALLOC_GATE_DIAG=1 - No regressions in existing functionality - Verified via Task agent (PASS verdict) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-04 12:06:14 +09:00

8 Commits