hakmem

Author	SHA1	Message	Date
Moe Charm (CI)	8feeb63c2b	release: silence runtime logs and stabilize benches - Fix HAKMEM_LOG gating to use (numeric) so release builds compile out logs. - Switch remaining prints to HAKMEM_LOG or guard with : - core/box/hak_core_init.inc.h (EVO sample warning, shutdown banner) - core/hakmem_config.c (config/feature prints) - core/hakmem.c (BigCache eviction prints) - core/hakmem_tiny_superslab.c (OOM, head init/expand, C7 init diagnostics) - core/hakmem_elo.c (init/evolution) - core/hakmem_batch.c (init/flush/stats) - core/hakmem_ace.c (33KB route diagnostics) - core/hakmem_ace_controller.c (ACE logs macro → no-op in release) - core/hakmem_site_rules.c (init banner) - core/box/hak_free_api.inc.h (unknown method error → release-gated) - Rebuilt benches and verified quiet output for release: - bench_fixed_size_hakmem/system - bench_random_mixed_hakmem/system - bench_mid_large_mt_hakmem/system - bench_comprehensive_hakmem/system Note: Kept debug logs available in debug builds and when explicitly toggled via env.	2025-11-11 01:47:06 +09:00
Moe Charm (CI)	518bf29754	Fix TLS-SLL splice alignment issue causing SIGSEGV - core/box/tls_sll_box.h: Normalize splice head, remove heuristics, fix misalignment guard - core/tiny_refill_opt.h: Add LINEAR_LINK debug logging after carve - core/ptr_trace.h: Fix function declaration conflicts for debug builds - core/hakmem.c: Add stdatomic.h include and ptr_trace_dump_now declaration Fixes misaligned memory access in splice_trav that was causing SIGSEGV. TLS-SLL GUARD identified: base=0x7244b7e10009 (should be 0x7244b7e10401) Preserves existing ptr=0xa0 guard for small pointer free detection. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>	2025-11-10 23:41:53 +09:00
Moe Charm (CI)	382980d450	Phase 6-2.4: Fix SuperSlab free SEGV: remove guess loop and add memory readability check; add registry atomic consistency (base as _Atomic uintptr_t with acq/rel); add debug toggles (SUPER_REG_DEBUG/REQTRACE); update CURRENT_TASK with results and next steps; capture suite results.	2025-11-07 18:07:48 +09:00
Moe Charm (CI)	77ed72fcf6	Fix: LIBC/HAKMEM mixed allocation crashes (0% → 80% success) Problem: 4T Larson crashed 100% due to "free(): invalid pointer" Root Causes (6 bugs found via Task Agent ultrathink): 1. Invalid magic fallback (`hak_free_api.inc.h:87`) - When `hdr->magic != HAKMEM_MAGIC`, ptr came from LIBC (no header) - Was calling `free(raw)` where `raw = ptr - HEADER_SIZE` (garbage!) - Fixed: Use `__libc_free(ptr)` instead 2. BigCache eviction (`hakmem.c:230`) - Same issue: invalid magic means LIBC allocation - Fixed: Use `__libc_free(ptr)` directly 3. Malloc wrapper recursion (`hakmem_internal.h:209`) - `hak_alloc_malloc_impl()` called `malloc()` → wrapper recursion - Fixed: Use `__libc_malloc()` directly 4. ALLOC_METHOD_MALLOC free (`hak_free_api.inc.h:106`) - Was calling `free(raw)` → wrapper recursion - Fixed: Use `__libc_free(raw)` directly 5. fopen/fclose crash (`hakmem_tiny_superslab.c:131`) - `log_superslab_oom_once()` used `fopen()` → FILE buffer via wrapper - `fclose()` calls `__libc_free()` on HAKMEM-allocated buffer → crash - Fixed: Wrap with `g_hakmem_lock_depth++/--` to force LIBC path 6. g_hakmem_lock_depth visibility (`hakmem.c:163`) - Was `static`, needed by hakmem_tiny_superslab.c - Fixed: Remove `static` keyword Result: 4T Larson success rate improved 0% → 80% (8/10 runs) ✅ Remaining: 20% crash rate still needs investigation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-07 02:48:20 +09:00
Moe Charm (CI)	1da8754d45	CRITICAL FIX: TLS 未初期化による 4T SEGV を完全解消問題: - Larson 4T で 100% SEGV (1T は 2.09M ops/s で完走) - System/mimalloc は 4T で 33.52M ops/s 正常動作 - SS OFF + Remote OFF でも 4T で SEGV 根本原因: (Task agent ultrathink 調査結果) ``` CRASH: mov (%r15),%r13 R15 = 0x6261 ← ASCII "ba" (ゴミ値、未初期化TLS) ``` Worker スレッドの TLS 変数が未初期化: - `__thread void* g_tls_sll_head[TINY_NUM_CLASSES];` ← 初期化なし - pthread_create() で生成されたスレッドでゼロ初期化されない - NULL チェックが通過 (0x6261 != NULL) → dereference → SEGV 修正内容: 全 TLS 配列に明示的初期化子 `= {0}` を追加: 1. core/hakmem_tiny.c: - `g_tls_sll_head[TINY_NUM_CLASSES] = {0}` - `g_tls_sll_count[TINY_NUM_CLASSES] = {0}` - `g_tls_live_ss[TINY_NUM_CLASSES] = {0}` - `g_tls_bcur[TINY_NUM_CLASSES] = {0}` - `g_tls_bend[TINY_NUM_CLASSES] = {0}` 2. core/tiny_fastcache.c: - `g_tiny_fast_cache[TINY_FAST_CLASS_COUNT] = {0}` - `g_tiny_fast_count[TINY_FAST_CLASS_COUNT] = {0}` - `g_tiny_fast_free_head[TINY_FAST_CLASS_COUNT] = {0}` - `g_tiny_fast_free_count[TINY_FAST_CLASS_COUNT] = {0}` 3. core/hakmem_tiny_magazine.c: - `g_tls_mags[TINY_NUM_CLASSES] = {0}` 4. core/tiny_sticky.c: - `g_tls_sticky_ss[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}` - `g_tls_sticky_idx[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}` - `g_tls_sticky_pos[TINY_NUM_CLASSES] = {0}` 効果: ``` Before: 1T: 2.09M ✅ \| 4T: SEGV 💀 After: 1T: 2.41M ✅ \| 4T: 4.19M ✅ (+15% 1T, SEGV解消) ``` テスト: ```bash # 1 thread: 完走 ./larson_hakmem 2 8 128 1024 1 12345 1 → Throughput = 2,407,597 ops/s ✅ # 4 threads: 完走（以前は SEGV） ./larson_hakmem 2 8 128 1024 1 12345 4 → Throughput = 4,192,155 ops/s ✅ ``` 調査協力: Task agent (ultrathink mode) による完璧な根本原因特定 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-07 01:27:04 +09:00
Claude	5ec9d1746f	Option A (Full): Inline TLS cache access in malloc() Implementation: 1. Added g_initialized check to fast path (skip bootstrap overhead) 2. Inlined hak_tiny_size_to_class() - LUT lookup (~1 load) 3. Inlined TLS cache pop - direct g_tls_sll_head access (3-4 instructions) 4. Eliminated function call overhead on fast path hit Result: +11.5% improvement (1.31M → 1.46M ops/s avg, threads=4) - Before: Function call + internal processing (~15-20 instructions) - After: LUT + TLS load + pop + return (~5-6 instructions) Still below target (1.81M ops/s). Next: RDTSC profiling to identify remaining bottleneck.	2025-11-05 07:07:47 +00:00
Claude	6550cd3970	Remove overhead: diagnostic + counters for fast path ### Changes: 1. Removed diagnostic from wrapper (hakmem_tiny.c:1542) - Was: getenv() + fprintf() on every wrapper call - Now: Direct return tiny_alloc_fast(size) - Relies on LTO (-flto) for inlining 2. Removed counter overhead from malloc() (hakmem.c:1242) - Was: 4 TLS counter increments per malloc - g_malloc_total_calls++ - g_malloc_tiny_size_match++ - g_malloc_fast_path_tried++ - g_malloc_fast_path_null++ (on miss) - Now: Zero counter overhead ### Performance Results: ``` Before (with overhead): 1.51M ops/s After (zero overhead): 1.59M ops/s (+5% 🎉) Baseline (old impl): 1.68M ops/s (-5% gap remains) System malloc: 8.08M ops/s (reference) ``` ### Analysis: What was heavy: - Counter increments: ~4 TLS writes per malloc (cache pollution) - Diagnostic: getenv() + fprintf() check (even if disabled) - These added ~80K ops/s overhead Remaining gap (-5% vs baseline): Box Theory (1.59M) vs Old implementation (1.68M) - Likely due to: ownership check in free path - Or: refill backend (sll_refill_small_from_ss vs hak_tiny_alloc x16) ### Bottleneck Update: From profiling data (2,418 cycles per fast path): ``` Fast path time: 49.5M cycles (49.1% of total) Refill time: 51.3M cycles (50.9% of total) Counter overhead removed: ~5% improvement LTO should inline wrapper: Further gains expected ``` ### Status: ✅ IMPROVEMENT - Removed overhead, 5% faster ❌ STILL SHORT - 5% slower than baseline (1.68M target) ### Next Steps: A. Investigate ownership check overhead in free path B. Compare refill backend efficiency C. Consider reverting to old implementation if gap persists Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md	2025-11-05 06:25:29 +00:00
Claude	08593fea14	Fix: Box Theory routing - direct call before guards ### Problem Identified: Previous commit routed malloc() → guards → hak_alloc_at() → Box Theory This added massive overhead (guard checks, function calls) defeating the "3-4 instruction" fast path promise. ### Root Cause: "命令数減って遅くなるのはおかしい" - User's insight was correct! Box Theory claims 3-4 instructions, but routing added dozens of instructions before reaching TLS freelist pop. ### Fix: Move Box Theory call to malloc() entry point (line ~1253), BEFORE all guards: ```c #ifdef HAKMEM_TINY_PHASE6_BOX_REFACTOR if (size <= TINY_FAST_THRESHOLD) { void* ptr = hak_tiny_alloc_fast_wrapper(size); if (ptr) return ptr; // ✅ Fast path: No guards, no overhead } #endif // SLOW PATH: All guards here... ``` ### Performance Results: ``` Baseline (old tiny_fast_alloc): 1.68M ops/s Box Theory (no env vars): 1.22M ops/s (-27%) Box Theory (with env vars): 1.39M ops/s (-17%) ← Improved! System malloc: 8.08M ops/s CLAUDE.md expectation: 2.75M (+64%) ~ 4.19M (+150%) ← Not reached ``` ### Env Vars Used: ``` HAKMEM_TINY_TLS_SLL=1 HAKMEM_TINY_TLS_LIST=0 HAKMEM_TINY_HOTMAG=0 HAKMEM_WRAP_TINY=1 HAKMEM_TINY_SS_ADOPT=1 HAKMEM_TINY_TRACE_RING=0 HAKMEM_SAFE_FREE=0 HAKMEM_TINY_REFILL_COUNT=128 ``` ### Verification: - ✅ HAKMEM_TINY_PHASE6_BOX_REFACTOR=1 confirmed active - ✅ hak_tiny_alloc_fast_wrapper() called (FRONT diagnostics) - ✅ Routing now bypasses guards for fast path - ❌ Still -17% slower than baseline (investigation needed) ### Status: 🔬 PARTIAL SUCCESS - Routing fixed, but performance below expectation. Box Theory is active and bypassing guards, but still slower than old implementation. ### Next Steps: - Compare refill implementations (old vs Box Theory) - Profile to identify specific bottleneck - Investigate why Box Theory underperforms vs CLAUDE.md claims Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md, CLAUDE.md Phase 6-1.7	2025-11-05 06:12:32 +00:00
Claude	0c66991393	WIP: Unify fast path to Box Theory (experimental) ### Changes: - Removed duplicate fast paths: Disabled HAKMEM_TINY_FAST_PATH in: - malloc() entry point (line ~1257) - hak_alloc_at() helper (line ~682) - Unified to Box Theory: All tiny allocations now use Box Theory's hak_tiny_alloc_fast_wrapper() at line ~712 (HAKMEM_TINY_PHASE6_BOX_REFACTOR) ### Rationale: - Previous implementation had 2 fast path checks (double overhead) - Box Theory (tiny_alloc_fast.inc.h) provides optimized 3-4 instruction path - CLAUDE.md claims +64% (debug), +150% (production) with Box Theory - Attempt to eliminate redundant checks and unify to single fast path ### Performance Results: ⚠️ REGRESSION - Performance decreased: ``` Baseline (old tiny_fast_alloc): 1.68M ops/s Box Theory (unified): 1.35M ops/s (-20%) System malloc: 8.08M ops/s (reference) ``` ### Status: 🔬 EXPERIMENTAL - This commit documents the attempt but shows regression. Possible issues: 1. Box Theory may need additional tuning (env vars not sufficient) 2. Refill backend may be slower than old implementation 3. TLS freelist initialization overhead 4. Missing optimizations in Box Theory integration ### Next Steps: - Profile to identify why Box Theory is slower - Compare refill efficiency: old vs Box Theory - Check if TLS SLL variables are properly initialized - Consider reverting if root cause not found Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md, CLAUDE.md Phase 6-1.7	2025-11-05 06:06:34 +00:00
Claude	31af3eab27	Add malloc routing analysis and refill success tracking ### Changes: - Routing Counters: Added per-thread counters in hakmem.c to track: - g_malloc_total_calls: Total malloc() invocations - g_malloc_tiny_size_match: Calls within tiny size range (<=128B) - g_malloc_fast_path_tried: Calls that attempted fast path - g_malloc_fast_path_null: Fast path returned NULL - g_malloc_slow_path: Calls routed to slow path - Refill Success Tracking: Added counters in tiny_fastcache.c: - g_refill_success_count: Full batch (16 blocks) - g_refill_partial_count: Partial batch (<16 blocks) - g_refill_fail_count: Zero blocks allocated - g_refill_total_blocks: Total blocks across all refills - Profile Output Enhanced: tiny_fast_print_profile() now shows: - Routing statistics (which path allocations take) - Refill success/failure breakdown - Average blocks per refill ### Key Findings: ✅ Fast path routing: 100% success (20,479/20,480 calls per thread) ✅ Refill success: 100% (1,285 refills, all 16 blocks each) ⚠️ Performance: Still only 1.68M ops/s vs System's 8.06M (20.8%) Root Cause Confirmed: - NOT a routing problem (100% reach fast path) - NOT a refill failure (100% success) - IS a structural performance issue (2,418 cycles avg for malloc) Bottlenecks Identified: 1. Fast path cache hits: ~2,418 cycles (vs tcache ~100 cycles) 2. Refill operations: ~39,938 cycles (expensive but infrequent) 3. Overall throughput: 4.8x slower than system malloc Next Steps (per LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md): - Option B: Refill efficiency (batch allocation from SuperSlab) - Option C: Ultra-fast path redesign (tcache-equivalent) Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md	2025-11-05 05:56:02 +00:00
Claude	3e4e90eadb	Phase 6-5: Entry Point Optimization (Phase 1) - Unexpected results Implementation: Move HAKMEM_TINY_FAST_PATH check BEFORE all guard checks in malloc(), inspired by mimalloc/tcache entry point design. Strategy: - tcache has 0 branches before fast path - mimalloc has 1-2 branches before fast path - Old HAKMEM had 8+ branches before fast path - Phase 1: Move fast path to line 1, add branch prediction hints Changes in core/hakmem.c: 1. Fast Path First: Size check → Init check → Cache hit (3 branches) 2. Slow Path: All guards moved after fast path (rare cases) 3. Branch hints: __builtin_expect() for hot paths Expected results (from research): - ST: 0.46M → 1.4-2.3M ops/s (+204-400%) - MT: 1.86M → 3.7-5.6M ops/s (+99-201%) Actual results (Larson 2s 8-128B 1024): - ST: 0.377M → 0.424M ops/s (+12% only) - MT: 1.856M → 1.453M ops/s (-22% regression!) Analysis: - Similar pattern to previous Option A test (+42% ST, -20% MT) - Entry point reordering alone is insufficient - True bottleneck may be: 1. tiny_fast_alloc() internals (size-to-class, cache access) 2. Refill cost (1,600 cycles for 16 individual calls) 3. Need Batch Refill optimization (Phase 3) as priority Next steps: - Investigate refill bottleneck with perf profiling - Consider implementing Phase 3 (Batch Refill) before Phase 2 - May need combination of multiple optimizations for breakthrough Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md	2025-11-05 05:10:02 +00:00
Claude	09e1d89e8d	Phase 6-4: Larson benchmark optimizations - LUT size-to-class Two optimizations to improve Larson benchmark performance: 1. Option A: Fast Path Priority (core/hakmem.c) - Move HAKMEM_TINY_FAST_PATH check before all guard checks - Reduce malloc() fast path from 8+ branches to 3 branches - Results: +42% ST, -20% MT (mixed results) 2. LUT Optimization (core/tiny_fastcache.h) - Replace 11-branch linear search with O(1) lookup table - Use size_to_class_lut[size >> 3] for fast mapping - Results: +24% MT, -24% ST (MT-optimized tradeoff) Benchmark results (Larson 2s 8-128B 1024 chunks): - Original: ST 0.498M ops/s, MT 1.502M ops/s - LUT version: ST 0.377M ops/s, MT 1.856M ops/s Analysis: - ST regression: Branch predictor learns linear search pattern - MT improvement: LUT avoids branch misprediction on context switch - Recommendation: Keep LUT for multi-threaded workloads Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md	2025-11-05 04:58:03 +00:00
Claude	b64cfc055e	Implement Option A: Fast Path priority optimization (Phase 6-4) Changes: - Reorder malloc() to prioritize Fast Path (initialized + tiny size check first) - Move Fast Path check before all guard checks (recursion, LD_PRELOAD, etc.) - Optimize free() with same strategy (initialized check first) - Add branch prediction hints (__builtin_expect) Implementation: - malloc(): Fast Path now executes with 3 branches total - Branch 1+2: g_initialized && size <= TINY_FAST_THRESHOLD - Branch 3: tiny_fast_alloc() cache hit check - Slow Path: All guard checks moved after Fast Path miss - free(): Fast Path with 1-2 branches - Branch 1: g_initialized check - Direct to hak_free_at() on normal case Performance Results (Larson benchmark, size=8-128B): Single-thread (threads=1): - Before: 0.46M ops/s (10.7% of system malloc) - After: 0.65M ops/s (15.4% of system malloc) - Change: +42% improvement ✓ Multi-thread (threads=4): - Before: 1.81M ops/s (25.0% of system malloc) - After: 1.44M ops/s (19.9% of system malloc) - Change: -20% regression ✗ Analysis: - ST improvement shows Fast Path optimization works - MT regression suggests contention or cache issues - Did not meet target (+200-400%), further optimization needed Next Steps: - Investigate MT regression (cache coherency?) - Consider more aggressive inlining - Explore Option B (Refill optimization)	2025-11-05 04:44:50 +00:00
Moe Charm (CI)	52386401b3	Debug Counters Implementation - Clean History Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-05 12:31:14 +09:00

14 Commits