hakmem

Author	SHA1	Message	Date
Moe Charm (CI)	90c7f148fc	Larson Fix: Increase batch refill from 64 to 128 blocks to reduce lock contention Root Cause (identified via perf profiling): - shared_pool_acquire_slab() consumed 85% CPU (lock contention) - 19,372 locks/sec (1 lock per ~10 allocations) - Only ~64 blocks carved per SuperSlab refill → frequent lock acquisitions Fix Applied: 1. Increased HAKMEM_TINY_REFILL_DEFAULT from 64 → 128 blocks 2. Added larson targets to Pool TLS auto-enable in build.sh 3. Increased refill max ceiling from 256 → 512 (allows future tuning) Expected Impact: - Lock frequency: 19K → ~1.6K locks/sec (12x reduction) - Target performance: 0.74M → ~3-5M ops/sec (4-7x improvement) Known Issues: - Multi-threaded Larson (>1 thread) has pre-existing crash bug (NOT caused by this change) - Verified: Original code also crashes with >1 thread - Single-threaded Larson works fine: ~480-792K ops/sec - Root cause: "Node pool exhausted for class 7" → requires separate investigation Files Modified: - core/hakmem_build_flags.h: HAKMEM_TINY_REFILL_DEFAULT 64→128 - build.sh: Enable Pool TLS for larson targets Related: - Task agent report: LARSON_CATASTROPHIC_SLOWDOWN_ROOT_CAUSE.md - Priority 1 fix from 4-step optimization plan 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-14 22:09:14 +09:00
Moe Charm (CI)	696aa7c0b9	CRITICAL FIX: Restore mincore() safety checks in classify_ptr() and free wrapper Root Cause: - Phase 9 gutted hak_is_memory_readable() to always return 1 (unsafe!) - classify_ptr() Step 3 and free wrapper AllocHeader dispatch both relied on this - Result: SEGV when freeing external pointers (e.g. 0x5555... executable area) - Crash: hdr->magic dereference at unmapped memory (page boundary crossing) Fix (2-file, minimal patch): 1. core/box/front_gate_classifier.c (Line 211-230): - REMOVED unsafe AllocHeader probe from classify_ptr() - Return PTR_KIND_UNKNOWN immediately after registry lookups fail - Let free wrapper handle unknown pointers safely 2. core/box/hak_free_api.inc.h (Line 194-211): - RESTORED real mincore() check before AllocHeader dereference - Check BOTH pages if header crosses page boundary (40-byte header) - Only dereference hdr->magic if memory is verified mapped Verification: - ws=4096 benchmark: 10/10 runs passed (was: 100% crash) - Exit code: 0 (was: 139/SIGSEGV) - Crash location: eliminated (was: classify_ptr+298, hdr->magic read) Performance Impact: - Minimal (only affects unknown pointers, rare case) - mincore() syscall only when ptr NOT in Pool/SuperSlab registries Files Changed: - core/box/front_gate_classifier.c (+20 simplified, -30 unsafe) - core/box/hak_free_api.inc.h (+16 mincore check)	2025-11-14 06:09:02 +09:00
Moe Charm (CI)	030132f911	Phase 10: TLS/SFC aggressive cache tuning (syscall reduction failed) Goal: Reduce backend transitions by increasing frontend hit rate Result: +2% best case, syscalls unchanged (root cause: SuperSlab churn) Implementation: 1. Cache capacity expansion (2-8x per-class) - Hot classes (C0-C3): 4x increase (512 slots) - Medium classes (C4-C6): 2-3x increase - Class 7 (1KB): 2x increase (128 slots) - Fast cache: 2x default capacity 2. Refill batch size increase (4-8x) - Global default: 16 → 64 (4x) - Hot classes: 128 (8x) via HAKMEM_TINY_REFILL_COUNT_HOT - Mid classes: 96 (6x) via HAKMEM_TINY_REFILL_COUNT_MID - Class 7: 64 → 128 (2x) - SFC refill: 64 → 128 (2x) 3. Adaptive sizing aggressive parameters - Grow threshold: 80% → 70% (expand earlier) - Shrink threshold: 20% → 10% (shrink less) - Growth rate: 2x → 1.5x (smoother growth) - Max capacity: 2048 → 4096 (2x ceiling) - Adapt frequency: Every 10 → 5 refills (more responsive) Performance Results (100K iterations): Before (Phase 9): - Performance: 9.71M ops/s - Syscalls: 1,729 (mmap:877, munmap:852) After (Phase 10): - Default settings: 8.77M ops/s (-9.7%) ⚠️ - Optimal ENV: 9.89M ops/s (+2%) ✅ - Syscalls: 1,729 (unchanged) ❌ Optimal ENV configuration: export HAKMEM_TINY_REFILL_COUNT_HOT=256 export HAKMEM_TINY_REFILL_COUNT_MID=192 Root Cause Analysis: Bottleneck is NOT TLS/SFC hit rate, but SuperSlab allocation churn: - 877 SuperSlabs allocated (877MB via mmap) - Phase 9 LRU cache not utilized (no frees during benchmark) - All SuperSlabs retained until program exit - System malloc: 9 syscalls vs HAKMEM: 1,729 syscalls (192x gap) Conclusion: TLS/SFC tuning cannot solve SuperSlab allocation policy problem. Next step: Phase 11 SuperSlab Prewarm strategy to eliminate mmap/munmap during benchmark execution. ChatGPT review: Strategy validated, Option A (Prewarm) recommended. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-13 14:25:54 +09:00
Moe Charm (CI)	862e8ea7db	Infrastructure and build updates - Update build configuration and flags - Add missing header files and dependencies - Update TLS list implementation with proper scoping - Fix various compilation warnings and issues - Update debug ring and tiny allocation infrastructure - Update benchmark results documentation Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>	2025-11-11 21:49:05 +09:00
HakmemBot	5b31629650	tiny: fix TLS list next_off scope; default TLS_LIST=1; add sentinel guards; header-aware TLS ops; release quiet for benches	2025-11-11 10:00:36 +09:00
Moe Charm (CI)	0da9f8cba3	Phase 7 + Pool TLS 1.5b stabilization:\n- Add build hygiene (dep tracking, flag consistency, print-flags)\n- Add build.sh + verify_build.sh (unified recipe, freshness check)\n- Quiet verbose logs behind HAKMEM_DEBUG_VERBOSE\n- A/B free safety via HAKMEM_TINY_SAFE_FREE (mincore strict vs boundary)\n- Tweak Tiny header path to reduce noise; Pool TLS free guard optimized\n- Fix mimalloc link retention (--no-as-needed + force symbol)\n- Add docs/BUILD_PHASE7_POOL_TLS.md (cheatsheet)	2025-11-09 11:50:18 +09:00
Moe Charm (CI)	7975e243ee	Phase 7 Task 3: Pre-warm TLS cache (+180-280% improvement!) MAJOR SUCCESS: HAKMEM now achieves 85-92% of System malloc on tiny allocations (128-512B) and BEATS System at 146% on 1024B allocations! Performance Results: - Random Mixed 128B: 21M → 59M ops/s (+181%) 🚀 - Random Mixed 256B: 19M → 70M ops/s (+268%) 🚀 - Random Mixed 512B: 21M → 68M ops/s (+224%) 🚀 - Random Mixed 1024B: 21M → 65M ops/s (+210%, 146% of System!) 🏆 - Larson 1T: 2.68M ops/s (stable, no regression) Implementation: 1. Task 3a: Remove profiling overhead in release builds - Wrapped RDTSC calls in #if !HAKMEM_BUILD_RELEASE - Compiler can eliminate profiling code completely - Effect: +2% (2.68M → 2.73M Larson) 2. Task 3b: Simplify refill logic - Use constants from hakmem_build_flags.h - TLS cache already optimal - Effect: No regression 3. Task 3c: Pre-warm TLS cache (GAME CHANGER!) - Pre-allocate 16 blocks per class at init - Eliminates cold-start penalty - Effect: +180-280% improvement 🚀 Root Cause: The bottleneck was cold-start, not the hot path! First allocation in each class triggered a SuperSlab refill (100+ cycles). Pre-warming eliminated this penalty, revealing Phase 7's true potential. Files Modified: - core/hakmem_tiny.c: Pre-warm function implementation - core/box/hak_core_init.inc.h: Pre-warm initialization call - core/tiny_alloc_fast.inc.h: Profiling overhead removal - core/hakmem_phase7_config.h: Task 3 constants (NEW) - core/hakmem_build_flags.h: Phase 7 feature flags - Makefile: PREWARM_TLS flag, phase7 targets - CLAUDE.md: Phase 7 success summary - PHASE7_TASK3_RESULTS.md: Comprehensive results report (NEW) Build: make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1 phase7-bench 🎉 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-08 12:54:52 +09:00
Moe Charm (CI)	1da8754d45	CRITICAL FIX: TLS 未初期化による 4T SEGV を完全解消問題: - Larson 4T で 100% SEGV (1T は 2.09M ops/s で完走) - System/mimalloc は 4T で 33.52M ops/s 正常動作 - SS OFF + Remote OFF でも 4T で SEGV 根本原因: (Task agent ultrathink 調査結果) ``` CRASH: mov (%r15),%r13 R15 = 0x6261 ← ASCII "ba" (ゴミ値、未初期化TLS) ``` Worker スレッドの TLS 変数が未初期化: - `__thread void* g_tls_sll_head[TINY_NUM_CLASSES];` ← 初期化なし - pthread_create() で生成されたスレッドでゼロ初期化されない - NULL チェックが通過 (0x6261 != NULL) → dereference → SEGV 修正内容: 全 TLS 配列に明示的初期化子 `= {0}` を追加: 1. core/hakmem_tiny.c: - `g_tls_sll_head[TINY_NUM_CLASSES] = {0}` - `g_tls_sll_count[TINY_NUM_CLASSES] = {0}` - `g_tls_live_ss[TINY_NUM_CLASSES] = {0}` - `g_tls_bcur[TINY_NUM_CLASSES] = {0}` - `g_tls_bend[TINY_NUM_CLASSES] = {0}` 2. core/tiny_fastcache.c: - `g_tiny_fast_cache[TINY_FAST_CLASS_COUNT] = {0}` - `g_tiny_fast_count[TINY_FAST_CLASS_COUNT] = {0}` - `g_tiny_fast_free_head[TINY_FAST_CLASS_COUNT] = {0}` - `g_tiny_fast_free_count[TINY_FAST_CLASS_COUNT] = {0}` 3. core/hakmem_tiny_magazine.c: - `g_tls_mags[TINY_NUM_CLASSES] = {0}` 4. core/tiny_sticky.c: - `g_tls_sticky_ss[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}` - `g_tls_sticky_idx[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}` - `g_tls_sticky_pos[TINY_NUM_CLASSES] = {0}` 効果: ``` Before: 1T: 2.09M ✅ \| 4T: SEGV 💀 After: 1T: 2.41M ✅ \| 4T: 4.19M ✅ (+15% 1T, SEGV解消) ``` テスト: ```bash # 1 thread: 完走 ./larson_hakmem 2 8 128 1024 1 12345 1 → Throughput = 2,407,597 ops/s ✅ # 4 threads: 完走（以前は SEGV） ./larson_hakmem 2 8 128 1024 1 12345 4 → Throughput = 4,192,155 ops/s ✅ ``` 調査協力: Task agent (ultrathink mode) による完璧な根本原因特定 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-07 01:27:04 +09:00
Moe Charm (CI)	52386401b3	Debug Counters Implementation - Clean History Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-05 12:31:14 +09:00

9 Commits