hakmem

Author	SHA1	Message	Date
Moe Charm (CI)	6154e7656c	根治修正: unified_cache_refill SEGVAULT + コンパイラ最適化対策問題: - リリース版sh8benchでunified_cache_refill+0x46fでSEGVAULT - コンパイラ最適化により、ヘッダー書き込みとtiny_next_read()の順序が入れ替わり、破損したポインタをout[]に格納根本原因: - ヘッダー書き込みがtiny_next_read()の後にあった - volatile barrierがなく、コンパイラが自由に順序を変更 - ASan版では最適化が制限されるため問題が隠蔽されていた修正内容（P1-P3）: P1: unified_cache_refill SEGVAULT修正 (core/front/tiny_unified_cache.c:341-350) - ヘッダー書き込みをtiny_next_read()の前に移動 - __atomic_thread_fence(__ATOMIC_RELEASE)追加 - コンパイラ最適化による順序入れ替えを防止 P2: 二重書き込み削除 (core/box/tiny_front_cold_box.h:75-82) - tiny_region_id_write_header()削除 - unified_cache_refillが既にヘッダー書き込み済み - 不要なメモリ操作を削除して効率化 P3: tiny_next_read()安全性強化 (core/tiny_nextptr.h:73-86) - __atomic_thread_fence(__ATOMIC_ACQUIRE)追加 - メモリ操作の順序を保証 P4: ヘッダー書き込みデフォルトON (core/tiny_region_id.h - ChatGPT修正) - g_write_headerのデフォルトを1に変更 - HAKMEM_TINY_WRITE_HEADER=0で旧挙動に戻せるテスト結果: ✅ unified_cache_refill SEGVAULT: 解消（sh8bench実行可能に） ❌ TLS_SLL_HDR_RESET: まだ発生中（別の根本原因、調査継続） 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-03 09:57:12 +09:00
Moe Charm (CI)	c04cccf723	Phase 6-A: Clarify debug-only validation (code readability, no perf change) Explicitly guard SuperSlab validation with #if !HAKMEM_BUILD_RELEASE to document that this code is debug-only. Changes: - core/tiny_region_id.h: Add #if !HAKMEM_BUILD_RELEASE guard around hak_super_lookup() validation code (lines 199-239) - Improves code readability: Makes debug-only intent explicit - Self-documenting: No need to check Makefile to understand behavior - Defensive: Works correctly even if LTO is disabled Performance Impact: - Measured: +1.67% (bench_random_mixed), +1.33% (bench_mid_mt_gap) - Expected: +12-15% (based on initial perf interpretation) - Actual: NO measurable improvement (within noise margin ±3.6%) Root Cause (Investigation): - Compiler (LTO) already eliminated hak_super_lookup() automatically - The function never existed in compiled binary (verified via nm/objdump) - Default Makefile has -DHAKMEM_BUILD_RELEASE=1 + -flto - perf's "15.84% CPU" was misattributed (was free(), not hak_super_lookup) Conclusion: This change provides NO performance benefit, but IMPROVES code clarity by making the debug-only nature explicit rather than relying on implicit compiler optimization. Files: - core/tiny_region_id.h - Add explicit debug guard - PHASE6A_DISCREPANCY_INVESTIGATION.md - Full investigation report Lessons Learned: 1. Always verify assembly output before claiming optimizations 2. perf attribution can be misleading - cross-reference with symbols 3. LTO is extremely aggressive at dead code elimination 4. Small improvements (<2× stdev) need statistical validation See PHASE6A_DISCREPANCY_INVESTIGATION.md for complete analysis. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-29 15:22:31 +09:00
Moe Charm (CI)	20f8d6f179	Cleanup: Add tiny_debug_api.h to eliminate guard/failfast implicit warnings Created central header for debug instrumentation API to fix implicit function declaration warnings across the codebase. Changes: 1. Created core/tiny_debug_api.h - Declares guard system API (3 functions) - Declares failfast debugging API (3 functions) - Uses forward declarations for SuperSlab/TinySlabMeta 2. Updated 3 files to include tiny_debug_api.h: - core/tiny_region_id.h (removed inline externs) - core/hakmem_tiny_tls_ops.h - core/tiny_superslab_alloc.inc.h Warnings eliminated (6 of 11 total): ✅ tiny_guard_is_enabled() ✅ tiny_guard_on_alloc() ✅ tiny_guard_on_invalid() ✅ tiny_failfast_log() ✅ tiny_failfast_abort_ptr() ✅ tiny_refill_failfast_level() Remaining warnings (deferred to P1): - ss_active_add (2 occurrences) - expand_superslab_head - hkm_ace_set_tls_capacity - smallmid_backend_free Impact: - Cleaner build output - Better type safety for debug functions - No behavior changes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-29 06:47:13 +09:00
Moe Charm (CI)	6ac6f5ae1b	Refactor: Split hakmem_tiny_superslab.c + unified backend exit point Major refactoring to improve maintainability and debugging: 1. Split hakmem_tiny_superslab.c (1521 lines) into 7 focused files: - superslab_allocate.c: SuperSlab allocation/deallocation - superslab_backend.c: Backend allocation paths (legacy, shared) - superslab_ace.c: ACE (Adaptive Cache Engine) logic - superslab_slab.c: Slab initialization and bitmap management - superslab_cache.c: LRU cache and prewarm cache management - superslab_head.c: SuperSlabHead management and expansion - superslab_stats.c: Statistics tracking and debugging 2. Created hakmem_tiny_superslab_internal.h for shared declarations 3. Added superslab_return_block() as single exit point for header writing: - All backend allocations now go through this helper - Prevents bugs where headers are forgotten in some paths - Makes future debugging easier 4. Updated Makefile for new file structure 5. Added header writing to ss_legacy_backend_box.c and ss_unified_backend_box.c (though not currently linked) Note: Header corruption bug in Larson benchmark still exists. Class 1-6 allocations go through TLS refill/carve paths, not backend. Further investigation needed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-29 05:13:04 +09:00
Moe Charm (CI)	912123cbbe	P3: Skip header write in alloc path when class_map is active Skip the 1-byte header write in tiny_region_id_write_header() when class_map is active (default). class_map provides out-of-band class_idx lookup, making the header byte unnecessary for the free path. Changes: - Add ENV-gated conditional to skip header write (default: skip) - ENV: HAKMEM_TINY_WRITE_HEADER=1 to force header write (legacy mode) - Memory layout preserved: user pointer = base + 1 (1B unused when skipped) Performance improvement: - tiny_hot 64B: 83.5M → 84.2M ops/sec (+0.8%) - random_mixed ws=256: 68.1M → 72.2M ops/sec (+6%) The header skip reduces one store instruction per allocation, which is particularly beneficial for mixed-size workloads like random_mixed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-28 14:46:55 +09:00
Moe Charm (CI)	316ea4dfd6	ENV Cleanup Step 4: Gate HAKMEM_WATCH_ADDR in tiny_region_id.h Gate get_watch_addr() debug functionality with HAKMEM_BUILD_RELEASE, returning 0 in release builds to disable address watching overhead. Performance: 30.31M ops/s (baseline: 30.2M) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-28 00:47:16 +09:00
Moe Charm (CI)	0a8bdb8b18	Fix release build debug logging in tiny_region_id.h The allocation logging at line 236-249 was missing the #if !HAKMEM_BUILD_RELEASE guard, causing fprintf(stderr) on every allocation even in release builds. Impact: 19.8M ops/s → 28.0M ops/s (+42%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-27 11:58:00 +09:00
Moe Charm (CI)	d8e3971dc2	Fix cross-thread ownership check: Use bits 8-15 for owner_tid_low Problem: - TLS_SLL_PUSH_DUP crash in Larson multi-threaded benchmark - Cross-thread frees incorrectly routed to same-thread TLS path - Root cause: pthread_t on glibc is 256-byte aligned (TCB base) so lower 8 bits are ALWAYS 0x00 for ALL threads Fix: - Change owner_tid_low from (tid & 0xFF) to ((tid >> 8) & 0xFF) - Bits 8-15 actually vary between threads, enabling correct detection - Applied consistently across all ownership check locations: - superslab_inline.h: ss_owner_try_acquire/release/is_mine - slab_handle.h: slab_try_acquire - tiny_free_fast.inc.h: tiny_free_is_same_thread_ss - tiny_free_fast_v2.inc.h: cross-thread detection - tiny_superslab_free.inc.h: same-thread check - ss_allocation_box.c: slab initialization - hakmem_tiny_superslab.c: ownership handling Also added: - Address watcher debug infrastructure (tiny_region_id.h) - Cross-thread detection in malloc_tiny_fast.h Front Gate Test results: - Larson 1T/2T/4T: PASS (no TLS_SLL_PUSH_DUP crash) - random_mixed: PASS - Performance: ~20M ops/s (regression from 48M, needs optimization) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-27 11:52:11 +09:00
Moe Charm (CI)	25d963a4aa	Code Cleanup: Remove false positives, redundant validations, and reduce verbose logging Following the C7 stride upgrade fix (commit `23c0d9541`), this commit performs comprehensive cleanup to improve code quality and reduce debug noise. ## Changes ### 1. Disable False Positive Checks (tiny_nextptr.h) - Disabled: NXT_MISALIGN validation block with `#if 0` - Reason: Produces false positives due to slab base offsets (2048, 65536) not being stride-aligned, causing all blocks to appear "misaligned" - TODO: Reimplement to check stride DISTANCE between consecutive blocks instead of absolute alignment to stride boundaries ### 2. Remove Redundant Geometry Validations hakmem_tiny_refill_p0.inc.h (P0 batch refill) - Removed 25-line CARVE_GEOMETRY_FIX validation block - Replaced with NOTE explaining redundancy - Reason: Stride table is now correct in tiny_block_stride_for_class(), defense-in-depth validation adds overhead without benefit ss_legacy_backend_box.c (legacy backend) - Removed 18-line LEGACY_FIX_GEOMETRY validation block - Replaced with NOTE explaining redundancy - Reason: Shared_pool validates geometry at acquisition time ### 3. Reduce Verbose Logging hakmem_shared_pool.c (sp_fix_geometry_if_needed) - Made SP_FIX_GEOMETRY logging conditional on `!HAKMEM_BUILD_RELEASE` - Reason: Geometry fixes are expected during stride upgrades, no need to log in release builds ### 4. Verification - Build: ✅ Successful (LTO warnings expected) - Test: ✅ 10K iterations (1.87M ops/s, no crashes) - NXT_MISALIGN false positives: ✅ Eliminated ## Files Modified - core/tiny_nextptr.h - Disabled false positive NXT_MISALIGN check - core/hakmem_tiny_refill_p0.inc.h - Removed redundant CARVE validation - core/box/ss_legacy_backend_box.c - Removed redundant LEGACY validation - core/hakmem_shared_pool.c - Made SP_FIX_GEOMETRY logging debug-only ## Impact - Code clarity: Removed 43 lines of redundant validation code - Debug noise: Reduced false positive diagnostics - Performance: Eliminated overhead from redundant geometry checks - Maintainability: Single source of truth for geometry validation 🧹 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-21 23:00:24 +09:00
Moe Charm (CI)	72b38bc994	Phase E3-FINAL: Fix Box API offset bugs - ALL classes now use correct offsets ## Root Cause Analysis (GPT5) Physical Layout Constraints: - Class 0: 8B = [1B header][7B payload] → offset 1 = 9B needed = ❌ IMPOSSIBLE - Class 1-6: >=16B = [1B header][15B+ payload] → offset 1 = ✅ POSSIBLE - Class 7: 1KB → offset 0 (compatibility) Correct Specification: - HAKMEM_TINY_HEADER_CLASSIDX != 0: - Class 0, 7: next at offset 0 (overwrites header when on freelist) - Class 1-6: next at offset 1 (after header) - HAKMEM_TINY_HEADER_CLASSIDX == 0: - All classes: next at offset 0 Previous Bug: - Attempted "ALL classes offset 1" unification - Class 0 with offset 1 caused immediate SEGV (9B > 8B block size) - Mixed 2-arg/3-arg API caused confusion ## Fixes Applied ### 1. Restored 3-Argument Box API (core/box/tiny_next_ptr_box.h) ```c // Correct signatures void tiny_next_write(int class_idx, void* base, void* next_value) void* tiny_next_read(int class_idx, const void* base) // Correct offset calculation size_t offset = (class_idx == 0 \|\| class_idx == 7) ? 0 : 1; ``` ### 2. Updated 123+ Call Sites Across 34 Files - hakmem_tiny_hot_pop_v4.inc.h (4 locations) - hakmem_tiny_fastcache.inc.h (3 locations) - hakmem_tiny_tls_list.h (12 locations) - superslab_inline.h (5 locations) - tiny_fastcache.h (3 locations) - ptr_trace.h (macro definitions) - tls_sll_box.h (2 locations) - + 27 additional files Pattern: `tiny_next_read(base)` → `tiny_next_read(class_idx, base)` Pattern: `tiny_next_write(base, next)` → `tiny_next_write(class_idx, base, next)` ### 3. Added Sentinel Detection Guards - tiny_fast_push(): Block nodes with sentinel in ptr or ptr->next - tls_list_push(): Block nodes with sentinel in ptr or ptr->next - Defense-in-depth against remote free sentinel leakage ## Verification (GPT5 Report) Test Command: `./out/release/bench_random_mixed_hakmem --iterations=70000` Results: - ✅ Main loop completed successfully - ✅ Drain phase completed successfully - ✅ NO SEGV (previous crash at iteration 66151 is FIXED) - ℹ️ Final log: "tiny_alloc(1024) failed" is normal fallback to Mid/ACE layers Analysis: - Class 0 immediate SEGV: ✅ RESOLVED (correct offset 0 now used) - 66K iteration crash: ✅ RESOLVED (offset consistency fixed) - Box API conflicts: ✅ RESOLVED (unified 3-arg API) ## Technical Details ### Offset Logic Justification ``` Class 0: 8B block → next pointer (8B) fits ONLY at offset 0 Class 1: 16B block → next pointer (8B) fits at offset 1 (after 1B header) Class 2: 32B block → next pointer (8B) fits at offset 1 ... Class 6: 512B block → next pointer (8B) fits at offset 1 Class 7: 1024B block → offset 0 for legacy compatibility ``` ### Files Modified (Summary) - Core API: `box/tiny_next_ptr_box.h` - Hot paths: `hakmem_tiny_hot_pop.inc.h`, `tiny_fastcache.h` - TLS layers: `hakmem_tiny_tls_list.h`, `hakmem_tiny_tls_ops.h` - SuperSlab: `superslab_inline.h`, `tiny_superslab_.inc.h` - Refill: `hakmem_tiny_refill.inc.h`, `tiny_refill_opt.h` - Free paths: `tiny_free_magazine.inc.h`, `tiny_superslab_free.inc.h` - Documentation: Multiple Phase E3 reports ## Remaining Work None for Box API offset bugs - all structural issues resolved. Future enhancements (non-critical): - Periodic `grep -R '(void*)' core/` to detect direct pointer access violations - Enforce Box API usage via static analysis - Document offset rationale in architecture docs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-13 06:50:20 +09:00
Moe Charm (CI)	84dbd97fe9	Fix #16 : Resolve double BASE→USER conversion causing header corruption 🎯 ROOT CAUSE: Internal allocation helpers were prematurely converting BASE → USER pointers before returning to caller. The caller then applied HAK_RET_ALLOC/tiny_region_id_write_header which performed ANOTHER BASE→USER conversion, resulting in double offset (BASE+2) and header written at wrong location. 📦 BOX THEORY SOLUTION: Establish clean pointer conversion boundary at tiny_region_id_write_header, making it the single source of truth for BASE → USER conversion. 🔧 CHANGES: - Fix #16: Remove premature BASE→USER conversions (6 locations) * core/tiny_alloc_fast.inc.h (3 fixes) * core/hakmem_tiny_refill.inc.h (2 fixes) * core/hakmem_tiny_fastcache.inc.h (1 fix) - Fix #12: Add header validation in tls_sll_pop (detect corruption) - Fix #14: Defense-in-depth header restoration in tls_sll_splice - Fix #15: USER pointer detection (for debugging) - Fix #13: Bump window header restoration - Fix #2, #6, #7, #8: Various header restoration & NULL termination 🧪 TEST RESULTS: 100% SUCCESS - 10K-500K iterations: All passed - 8 seeds × 100K: All passed (42,123,456,789,999,314,271,161) - Performance: ~630K ops/s average (stable) - Header corruption: ZERO 📋 FIXES SUMMARY: Fix #1-8: Initial header restoration & chain fixes (chatgpt-san) Fix #9-10: USER pointer auto-fix (later disabled) Fix #12: Validation system (caught corruption at call 14209) Fix #13: Bump window header writes Fix #14: Splice defense-in-depth Fix #15: USER pointer detection (debugging tool) Fix #16: Double conversion fix (FINAL SOLUTION) ✅ 🎓 LESSONS LEARNED: 1. Validation catches bugs early (Fix #12 was critical) 2. Class-specific inline logging reveals patterns (Option C) 3. Box Theory provides clean architectural boundaries 4. Multiple investigation approaches (Task/chatgpt-san collaboration) 📄 DOCUMENTATION: - P0_BUG_STATUS.md: Complete bug tracking timeline - C2_CORRUPTION_ROOT_CAUSE_FINAL.md: Detailed root cause analysis - FINAL_ANALYSIS_C2_CORRUPTION.md: Investigation methodology 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Task Agent <task@anthropic.com> Co-Authored-By: ChatGPT <chatgpt@openai.com>	2025-11-12 10:33:57 +09:00
Moe Charm (CI)	dde490f842	Phase 7: header-aware TLS front caches and FG gating - core/hakmem_tiny_fastcache.inc.h: make tiny_fast_pop/push read/write next at base+1 for C0–C6; clear C7 next on pop - core/hakmem_tiny_hot_pop.inc.h: header-aware next reads for g_fast_head pops (classes 0–3) - core/tiny_free_magazine.inc.h: header-aware chain linking for BG spill chain (base+1 for C0–C6) - core/box/front_gate_classifier.c: registry fallback classifies headerless only for class 7; others as headered Build OK; bench_fixed_size_hakmem still SIGBUS right after init. FREE_ROUTE trace shows invalid frees (ptr=0xa0, etc.). Next steps: instrument early frees and audit remaining header-aware writes in any front caches not yet patched.	2025-11-10 18:04:08 +09:00
Moe Charm (CI)	1010a961fb	Tiny: fix header/stride mismatch and harden refill paths - Root cause: header-based class indexing (HEADER_CLASSIDX=1) wrote a 1-byte header during allocation, but linear carve/refill and initial slab capacity still used bare class block sizes. This mismatch could overrun slab usable space and corrupt freelists, causing reproducible SEGV at ~100k iters. Changes - Superslab: compute capacity with effective stride (block_size + header for classes 0..6; class7 remains headerless) in superslab_init_slab(). Add a debug-only bound check in superslab_alloc_from_slab() to fail fast if carve would exceed usable bytes. - Refill (non-P0 and P0): use header-aware stride for all linear carving and TLS window bump operations. Ensure alignment/validation in tiny_refill_opt.h also uses stride, not raw class size. - Drain: keep existing defense-in-depth for remote sentinel and sanitize nodes before splicing into freelist (already present). Notes - This unifies the memory layout across alloc/linear-carve/refill with a single stride definition and keeps class7 (1024B) headerless as designed. - Debug builds add fail-fast checks; release builds remain lean. Next - Re-run Tiny benches (256/1024B) in debug to confirm stability, then in release. If any remaining crash persists, bisect with HAKMEM_TINY_P0_BATCH_REFILL=0 to isolate P0 batch carve, and continue reducing branch-miss as planned.	2025-11-09 18:55:50 +09:00
Moe Charm (CI)	0da9f8cba3	Phase 7 + Pool TLS 1.5b stabilization:\n- Add build hygiene (dep tracking, flag consistency, print-flags)\n- Add build.sh + verify_build.sh (unified recipe, freshness check)\n- Quiet verbose logs behind HAKMEM_DEBUG_VERBOSE\n- A/B free safety via HAKMEM_TINY_SAFE_FREE (mincore strict vs boundary)\n- Tweak Tiny header path to reduce noise; Pool TLS free guard optimized\n- Fix mimalloc link retention (--no-as-needed + force symbol)\n- Add docs/BUILD_PHASE7_POOL_TLS.md (cheatsheet)	2025-11-09 11:50:18 +09:00
Moe Charm (CI)	7975e243ee	Phase 7 Task 3: Pre-warm TLS cache (+180-280% improvement!) MAJOR SUCCESS: HAKMEM now achieves 85-92% of System malloc on tiny allocations (128-512B) and BEATS System at 146% on 1024B allocations! Performance Results: - Random Mixed 128B: 21M → 59M ops/s (+181%) 🚀 - Random Mixed 256B: 19M → 70M ops/s (+268%) 🚀 - Random Mixed 512B: 21M → 68M ops/s (+224%) 🚀 - Random Mixed 1024B: 21M → 65M ops/s (+210%, 146% of System!) 🏆 - Larson 1T: 2.68M ops/s (stable, no regression) Implementation: 1. Task 3a: Remove profiling overhead in release builds - Wrapped RDTSC calls in #if !HAKMEM_BUILD_RELEASE - Compiler can eliminate profiling code completely - Effect: +2% (2.68M → 2.73M Larson) 2. Task 3b: Simplify refill logic - Use constants from hakmem_build_flags.h - TLS cache already optimal - Effect: No regression 3. Task 3c: Pre-warm TLS cache (GAME CHANGER!) - Pre-allocate 16 blocks per class at init - Eliminates cold-start penalty - Effect: +180-280% improvement 🚀 Root Cause: The bottleneck was cold-start, not the hot path! First allocation in each class triggered a SuperSlab refill (100+ cycles). Pre-warming eliminated this penalty, revealing Phase 7's true potential. Files Modified: - core/hakmem_tiny.c: Pre-warm function implementation - core/box/hak_core_init.inc.h: Pre-warm initialization call - core/tiny_alloc_fast.inc.h: Profiling overhead removal - core/hakmem_phase7_config.h: Task 3 constants (NEW) - core/hakmem_build_flags.h: Phase 7 feature flags - Makefile: PREWARM_TLS flag, phase7 targets - CLAUDE.md: Phase 7 success summary - PHASE7_TASK3_RESULTS.md: Comprehensive results report (NEW) Build: make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1 phase7-bench 🎉 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-08 12:54:52 +09:00
Moe Charm (CI)	4983352812	Perf: Phase 7-1.3 - Hybrid mincore + Macro fix (+194-333%) ## Summary Fixed CRITICAL bottleneck (mincore overhead) and macro definition bug. Result: 2-3x performance improvement across all benchmarks. ## Performance Results - Larson 1T: 631K → 2.73M ops/s (+333%) 🚀 - bench_random_mixed (128B): 768K → 2.26M ops/s (+194%) 🚀 - bench_random_mixed (512B): → 1.43M ops/s (new) - [HEADER_INVALID] messages: Many → ~Zero ✅ ## Changes ### 1. Hybrid mincore Optimization (317-634x faster) Problem: `hak_is_memory_readable()` calls mincore() syscall on EVERY free - Cost: 634 cycles/call - Impact: 40x slower than System malloc Solution: Check alignment BEFORE calling mincore() - Step 1 (1-byte header): `if ((ptr & 0xFFF) == 0)` → only 0.1% call mincore - Step 2 (16-byte header): `if ((ptr & 0xFFF) < HEADER_SIZE)` → only 0.4% call mincore - Result: 634 → 1-2 cycles effective (99.6% skip mincore) Files: - core/tiny_free_fast_v2.inc.h:53-71 - Step 1 hybrid check - core/box/hak_free_api.inc.h:94-107 - Step 2 hybrid check - core/hakmem_internal.h:281-312 - Performance warning added ### 2. HAK_RET_ALLOC Macro Fix (CRITICAL BUG) Problem: Macro definition order prevented Phase 7 header write - hakmem_tiny.c:130 defined legacy macro (no header write) - tiny_alloc_fast.inc.h:67 had `#ifndef` guard → skipped! - Result: Headers NEVER written → All frees failed → Slow path Solution: Force Phase 7 macro to override legacy - hakmem_tiny.c:119 - Added `#ifndef HAK_RET_ALLOC` guard - tiny_alloc_fast.inc.h:69-72 - Added `#undef` before redefine ### 3. Magic Byte Fix Problem: Release builds don't write magic byte, but free ALWAYS checks it - Result: All headers marked as invalid Solution: ALWAYS write magic byte (same 1-byte write, no overhead) - tiny_region_id.h:50-54 - Removed `#if !HAKMEM_BUILD_RELEASE` guard ## Technical Details ### Hybrid mincore Effectiveness \| Case \| Frequency \| Cost \| Weighted \| \|------\|-----------\|------\|----------\| \| Normal (Step 1) \| 99.9% \| 1-2 cycles \| 1-2 \| \| Page boundary \| 0.1% \| 634 cycles \| 0.6 \| \| Total \| - \| - \| 1.6-2.6 cycles \| Improvement: 634 → 1.6 cycles = 317-396x faster! ### Macro Fix Impact Before: HAK_RET_ALLOC(cls, ptr) → return (ptr) // No header write After: HAK_RET_ALLOC(cls, ptr) → return tiny_region_id_write_header((ptr), (cls)) Result: Headers properly written → Fast path works → +194-333% performance ## Investigation Task Agent Ultrathink analysis identified: 1. mincore() syscall overhead (634 cycles) 2. Macro definition order conflict 3. Release/Debug build mismatch (magic byte) Full report: PHASE7_DESIGN_REVIEW.md (23KB, 758 lines) ## Related - Phase 7-1.0: PoC implementation (+39%~+436%) - Phase 7-1.1: Dual-header dispatch (Task Agent) - Phase 7-1.2: Page boundary SEGV fix (100% crash-free) - Phase 7-1.3: Hybrid mincore + Macro fix (this commit) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-08 04:50:41 +09:00
Moe Charm (CI)	48fadea590	Phase 7-1.1: Fix 1024B crash (header validation + malloc fallback) Fixed critical bugs preventing Phase 7 from working with 1024B allocations. ## Bug Fixes (by Task Agent Ultrathink) 1. Header Validation Missing in Release Builds - `core/tiny_region_id.h:73-97` - Removed `#if !HAKMEM_BUILD_RELEASE` - Always validate magic byte and class_idx (prevents SEGV on Mid/Large) 2. 1024B Malloc Fallback Missing - `core/box/hak_alloc_api.inc.h:35-49` - Direct fallback to malloc - Phase 7 rejects 1024B (needs header) → skip ACE → use malloc ## Test Results \| Test \| Result \| \|------\|--------\| \| 128B, 512B, 1023B (Tiny) \| +39%~+436% ✅ \| \| 1024B only (100 allocs) \| 100% success ✅ \| \| Mixed 128B+1024B (200) \| 100% success ✅ \| \| bench_random_mixed 1024B \| Still crashes ❌ \| ## Known Issue `bench_random_mixed` with 1024B still crashes (intermittent SEGV). Simple tests pass, suggesting issue is with complex allocation patterns. Investigation pending. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Task Agent Ultrathink	2025-11-08 03:35:07 +09:00
Moe Charm (CI)	6b1382959c	Phase 7-1 PoC: Region-ID Direct Lookup (+39%~+436% improvement!) Implemented ultra-fast header-based free path that eliminates SuperSlab lookup bottleneck (100+ cycles → 5-10 cycles). ## Key Changes 1. Smart Headers (core/tiny_region_id.h): - 1-byte header before each allocation stores class_idx - Memory layout: [Header: 1B] [User data: N-1B] - Overhead: <2% average (0% for Slab[0] using wasted padding) 2. Ultra-Fast Allocation (core/tiny_alloc_fast.inc.h): - Write header at base: base = class_idx - Return user pointer: base + 1 3. Ultra-Fast Free* (core/tiny_free_fast_v2.inc.h): - Read class_idx from header (ptr-1): 2-3 cycles - Push base (ptr-1) to TLS freelist: 3-5 cycles - Total: 5-10 cycles (vs 500+ cycles current!) 4. Free Path Integration (core/box/hak_free_api.inc.h): - Removed SuperSlab lookup from fast path - Direct header validation (no lookup needed!) 5. Size Class Adjustment (core/hakmem_tiny.h): - Max tiny size: 1023B (was 1024B) - 1024B requests → Mid allocator fallback ## Performance Results \| Size \| Baseline \| Phase 7 \| Improvement \| \|------\|----------\|---------\|-------------\| \| 128B \| 1.22M \| 6.54M \| +436% 🚀 \| \| 512B \| 1.22M \| 1.70M \| +39% \| \| 1023B \| 1.22M \| 1.92M \| +57% \| ## Build & Test Enable Phase 7: make HEADER_CLASSIDX=1 bench_random_mixed_hakmem Run benchmark: HAKMEM_TINY_USE_SUPERSLAB=1 ./bench_random_mixed_hakmem 10000 128 1234567 ## Known Issues - 1024B requests fallback to Mid allocator (by design) - Target 40-60M ops/s not yet reached (current: 1.7-6.5M) - Further optimization needed (TLS capacity tuning, refill optimization) ## Credits Design: ChatGPT Pro Ultrathink, Claude Code Implementation: Claude Code with Task Agent Ultrathink support 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-08 03:18:17 +09:00

18 Commits