2dc9d5d596
Fix include order in hakmem.c - move hak_kpi_util.inc.h before hak_core_init.inc.h
...
Problem: hak_core_init.inc.h references KPI measurement variables
(g_latency_histogram, g_latency_samples, g_baseline_soft_pf, etc.)
but hakmem.c was including hak_kpi_util.inc.h AFTER hak_core_init.inc.h,
causing undefined reference errors.
Solution: Reorder includes so hak_kpi_util.inc.h (definition) comes
before hak_core_init.inc.h (usage).
Build result: ✅ Success (libhakmem.so 547KB, 0 errors)
Minor changes:
- Added extern __thread declarations for TLS SLL debug variables
- Added signal handler logging for debug_dump_last_push
- Improved hakmem_tiny.c structure for Phase 2 preparation
🤖 Generated with Claude Code + Task Agent
Co-Authored-By: Gemini <gemini@example.com >
Co-Authored-By: Claude <noreply@anthropic.com >
2025-12-03 13:28:44 +09:00
b5be708b6a
Fix potential freelist corruption in unified_cache_refill (Class 0) and improve TLS SLL logging/safety
2025-12-03 12:43:02 +09:00
c2716f5c01
Implement Phase 2: Headerless Allocator Support (Partial)
...
- Feature: Added HAKMEM_TINY_HEADERLESS toggle (A/B testing)
- Feature: Implemented Headerless layout logic (Offset=0)
- Refactor: Centralized layout definitions in tiny_layout_box.h
- Refactor: Abstracted pointer arithmetic in free path via ptr_conversion_box.h
- Verification: sh8bench passes in Headerless mode (No TLS_SLL_HDR_RESET)
- Known Issue: Regression in Phase 1 mode due to blind pointer conversion logic
2025-12-03 12:11:27 +09:00
a6aeeb7a4e
Phase 1 Refactoring Complete: Box-based Logic Consolidation ✅
...
Summary:
- Task 1.1 ✅ : Created tiny_layout_box.h for centralized class/header definitions
- Task 1.2 ✅ : Updated tiny_nextptr.h to use layout Box (bitmasking optimization)
- Task 1.3 ✅ : Enhanced ptr_conversion_box.h with Phantom Types support
- Task 1.4 ✅ : Implemented test_phantom.c for Debug-mode type checking
Verification Results (by Task Agent):
- Box Pattern Compliance: ⭐ ⭐ ⭐ ⭐ ⭐ (5/5) - MISSION/DESIGN documented
- Type Safety: ⭐ ⭐ ⭐ ⭐ ⭐ (5/5) - Phantom Types working as designed
- Test Coverage: ⭐ ⭐ ⭐ ☆☆ (3/5) - Compile-time tests OK, runtime tests planned
- Performance: 0 bytes, 0 cycles overhead in Release build
- Build Status: ✅ Success (526KB libhakmem.so, zero warnings)
Key Achievements:
1. Single Source of Truth principle fully implemented
2. Circular dependency eliminated (layout→header→nextptr→conversion)
3. Release build: 100% inlining, zero overhead
4. Debug build: Full type checking with Phantom Types
5. HAK_RET_ALLOC macro migrated to Box API
Known Issues (unrelated to Phase 1):
- TLS_SLL_HDR_RESET from sh8bench (existing, will be resolved in Phase 2)
Next Steps:
- Phase 2 readiness: ✅ READY
- Recommended: Create migration guide + runtime test suite
- Alignment guarantee will be addressed in Phase 2 (Headerless layout)
🤖 Generated with Claude Code + Gemini (implementation) + Task Agent (verification)
Co-Authored-By: Gemini <gemini@example.com >
Co-Authored-By: Claude <noreply@anthropic.com >
2025-12-03 11:38:11 +09:00
6df1bdec37
Fix TLS SLL race condition with atomic fence and report investigation results
2025-12-03 10:57:16 +09:00
bd5e97f38a
Save current state before investigating TLS_SLL_HDR_RESET
2025-12-03 10:34:39 +09:00
6154e7656c
根治修正: unified_cache_refill SEGVAULT + コンパイラ最適化対策
...
問題:
- リリース版sh8benchでunified_cache_refill+0x46fでSEGVAULT
- コンパイラ最適化により、ヘッダー書き込みとtiny_next_read()の
順序が入れ替わり、破損したポインタをout[]に格納
根本原因:
- ヘッダー書き込みがtiny_next_read()の後にあった
- volatile barrierがなく、コンパイラが自由に順序を変更
- ASan版では最適化が制限されるため問題が隠蔽されていた
修正内容(P1-P3):
P1: unified_cache_refill SEGVAULT修正 (core/front/tiny_unified_cache.c:341-350)
- ヘッダー書き込みをtiny_next_read()の前に移動
- __atomic_thread_fence(__ATOMIC_RELEASE)追加
- コンパイラ最適化による順序入れ替えを防止
P2: 二重書き込み削除 (core/box/tiny_front_cold_box.h:75-82)
- tiny_region_id_write_header()削除
- unified_cache_refillが既にヘッダー書き込み済み
- 不要なメモリ操作を削除して効率化
P3: tiny_next_read()安全性強化 (core/tiny_nextptr.h:73-86)
- __atomic_thread_fence(__ATOMIC_ACQUIRE)追加
- メモリ操作の順序を保証
P4: ヘッダー書き込みデフォルトON (core/tiny_region_id.h - ChatGPT修正)
- g_write_headerのデフォルトを1に変更
- HAKMEM_TINY_WRITE_HEADER=0で旧挙動に戻せる
テスト結果:
✅ unified_cache_refill SEGVAULT: 解消(sh8bench実行可能に)
❌ TLS_SLL_HDR_RESET: まだ発生中(別の根本原因、調査継続)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-12-03 09:57:12 +09:00
4cc2d8addf
sh8bench修正: LRU registry未登録問題 + self-heal修復
...
問題:
- sh8benchでfree(): invalid pointer発生
- header=0xA0だがsuperslab registry未登録のポインタがlibcへ
根本原因:
- LRU pop時にhak_super_register()が呼ばれていなかった
- hakmem_super_registry.c:hak_ss_lru_pop()の設計不備
修正内容:
1. 根治修正 (core/hakmem_super_registry.c:466)
- LRU popしたSuperSlabを明示的にregistry再登録
- hak_super_register((uintptr_t)curr, curr) 追加
- これによりfree時のhak_super_lookup()が成功
2. Self-heal修復 (core/box/hak_wrappers.inc.h:387-436)
- Safety net: 未登録SuperSlabを検出して再登録
- mincore()でマッピング確認 + magic検証
- libcへの誤ルート遮断(free()クラッシュ回避)
- 詳細デバッグログ追加(HAKMEM_WRAP_DIAG=1)
3. デバッグ指示書追加 (docs/sh8bench_debug_instruction.md)
- TLS_SLL_HDR_RESET問題の調査手順
テスト:
- cfrac, larson等の他ベンチマークは正常動作確認
- sh8benchのTLS_SLL_HDR_RESET問題は別issue(調査中)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-12-03 09:15:59 +09:00
f7d0d236e0
malloc_count アトミック操作削除: sh8bench 17s→10s (41%改善)
...
perf分析により、malloc()関数内のmalloc_countインクリメントが
27.55%のCPU時間を消費していることが判明。
変更:
- core/box/hak_wrappers.inc.h:84-86
- NDEBUGビルドでmalloc_countインクリメントを無効化
- lock incq命令によるキャッシュライン競合を完全に排除
効果:
- sh8bench (8スレッド): 17秒 → 10-11秒 (35-41%改善)
- 目標14秒を大幅に達成
- futex時間: 2.4s → 3.2s (総実行時間短縮により相対的に増加)
分析手法:
- perf record -g で詳細プロファイリング実施
- アトミック操作がボトルネックと特定
- sysalloc比較: hakmem 10s vs sysalloc 3s (差を大幅縮小)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-12-03 07:56:38 +09:00
60b02adf54
hak_init_wait_for_ready: タイムアウト削除 + デバッグ出力抑制
...
- hak_init_wait_for_ready(): タイムアウト(i > 1000000)を削除
- 他スレッドは初期化完了まで確実に待機するように変更
- init_waitによるlibcフォールバックを防止
- tls_sll_drain_box.h: デバッグ出力を#ifndef NDEBUGで囲む
- releaseビルドでの不要なfprintf出力を抑制
- [TLS_SLL_DRAIN] メッセージがベンチマーク時に出なくなった
性能への影響:
- sh8bench 8スレッド: 17秒(変更なし)
- フォールバック: 8回(初期化時のみ、正常動作)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-12-02 23:29:07 +09:00
daddbc926c
fix(Phase 11+): Cold Start lazy init for unified_cache_refill
...
Root cause: unified_cache_refill() accessed cache->slots before initialization
when a size class was first used via the refill path (not pop path).
Fix: Add lazy initialization check at start of unified_cache_refill()
- Check if cache->slots is NULL before accessing
- Call unified_cache_init() if needed
- Return NULL if init fails (graceful degradation)
Also includes:
- ss_cold_start_box.inc.h: Box Pattern for default prewarm settings
- hakmem_super_registry.c: Use static array in prewarm (avoid recursion)
- Default prewarm enabled (1 SuperSlab/class, configurable via ENV)
Test: 8B→16B→Mixed allocation pattern now works correctly
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-12-02 19:43:23 +09:00
644e3c30d1
feat(Phase 2-1): Lane Classification + Fallback Reduction
...
## Phase 2-1: Lane Classification Box (Single Source of Truth)
### New Module: hak_lane_classify.inc.h
- Centralized size-to-lane mapping with unified boundary definitions
- Lane architecture:
- LANE_TINY: [0, 1024B] SuperSlab (unchanged)
- LANE_POOL: [1025, 52KB] Pool per-thread (extended!)
- LANE_ACE: [52KB, 2MB] ACE learning
- LANE_HUGE: [2MB+] mmap direct
- Key invariant: POOL_MIN = TINY_MAX + 1 (no gaps)
### Fixed: Tiny/Pool Boundary Mismatch
- Before: TINY_MAX_SIZE=1024 vs tiny_get_max_size()=2047 (inconsistent!)
- After: Both reference LANE_TINY_MAX=1024 (authoritative)
- Impact: Eliminates 1025-2047B "unmanaged zone" causing libc fragmentation
### Updated Files
- core/hakmem_tiny.h: Use LANE_TINY_MAX, fix sizes[7]=1024 (was 2047)
- core/hakmem_pool.h: Use POOL_MIN_REQUEST_SIZE=1025 (was 2048)
- core/box/hak_alloc_api.inc.h: Lane-based routing (HAK_LANE_IS_*)
## jemalloc Block Bug Fix
### Root Cause
- g_jemalloc_loaded initialized to -1 (unknown)
- Condition `if (block && g_jemalloc_loaded)` treated -1 as true
- Result: ALL allocations fallback to libc (even when jemalloc not loaded!)
### Fix
- Change condition to `g_jemalloc_loaded > 0`
- Only fallback when jemalloc is ACTUALLY loaded
- Applied to: malloc/free/calloc/realloc
### Impact
- Before: 100% libc fallback (jemalloc block false positive)
- After: Only genuine cases fallback (init_wait, lockdepth, etc.)
## Fallback Diagnostics (ChatGPT contribution)
### New Feature: HAKMEM_WRAP_DIAG
- ENV flag to enable fallback logging
- Reason-specific counters (init_wait, jemalloc_block, lockdepth, etc.)
- First 4 occurrences logged per reason
- Helps identify unwanted fallback paths
### Implementation
- core/box/wrapper_env_box.{c,h}: ENV cache + DIAG flag
- core/box/hak_wrappers.inc.h: wrapper_record_fallback() calls
## Verification
### Fallback Reduction
- Before fix: [wrap] libc malloc: jemalloc block (100% fallback)
- After fix: Only init_wait + lockdepth (expected, minimal)
### Known Issue
- Tiny allocator OOM (size=8) still crashes
- This is a pre-existing bug, unrelated to Phase 2-1
- Was hidden by jemalloc block false positive
- Will be investigated separately
## Performance Impact
### sh8bench 8 threads
- Phase 1-1: 15秒
- Phase 2-1: 14秒 (~7% improvement)
### Note
- True hakmem performance now measurable (no more 100% fallback)
- Tiny OOM prevents full benchmark completion
- Next: Fix Tiny allocator for complete evaluation
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
Co-Authored-By: ChatGPT <chatgpt@openai.com >
2025-12-02 19:13:28 +09:00
695aec8279
feat(Phase 1-2): Add atomic initialization wait mechanism (safety improvement)
...
Implements thread-safe atomic initialization tracking and a wait helper for
non-init threads to avoid libc fallback during the initialization window.
Changes:
- Convert g_initializing to _Atomic type for thread-safe access
- Add g_init_thread to identify which thread performs initialization
- Implement hak_init_wait_for_ready() helper with spin/yield mechanism
- Update hak_core_init.inc.h to use atomic operations
- Update hak_wrappers.inc.h to call wait helper instead of checking g_initializing
Results & Analysis:
- Performance: ±0% (21s → 21s, no measurable improvement)
- Safety: ✓ Prevents recursion in init window
- Investigation: Initialization overhead is <1% of total allocations
- Expected: 2-8% improvement
- Actual: 0% improvement (spin/yield overhead ≈ savings)
- libc overhead: 41% → 57% (relative increase, likely sampling variation)
Key Findings from Perf Analysis:
- getenv: 0% (maintained from Phase 1-1) ✓
- libc malloc/free: ~24.54% of cycles
- libc fragmentation (malloc_consolidate/unlink_chunk): ~16% of cycles
- Total libc overhead: ~41% (difficult to optimize without changing algorithm)
Next Phase Target:
- Phase 2: Investigate libc fragmentation (malloc_consolidate 9.33%, unlink_chunk 6.90%)
- Potential approaches: hakmem Mid/ACE allocator expansion, sh8bench pattern analysis
Recommendation: Keep Phase 1-2 for safety (no performance regression), proceed to Phase 2.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-12-02 16:44:27 +09:00
49969d2e0f
feat(Phase 1-1): Complete getenv elimination from malloc/free hot paths (+39-42% perf)
...
## Summary
Eliminated all getenv() calls from malloc/free wrappers and allocator hot paths by implementing
constructor-based environment variable caching. This achieves 39-42% performance improvement
(36s → 22s on sh8bench single-thread).
## Performance Impact
- sh8bench 1 thread: 35-36s → 21-22s (+39-42% improvement) 🚀
- sh8bench 8 threads: ~15s (maintained)
- getenv overhead: 36.32% → 0% (completely eliminated)
## Changes
### New Files
- **core/box/tiny_env_box.{c,h}**: Centralized environment variable cache for Tiny allocator
- Caches 43 environment variables (HAKMEM_TINY_*, HAKMEM_SLL_*, HAKMEM_SS_*, etc.)
- Constructor-based initialization with atomic CAS for thread safety
- Inline accessor tiny_env_cfg() for hot path access
- **core/box/wrapper_env_box.{c,h}**: Environment cache for malloc/free wrappers
- Caches 3 wrapper variables (HAKMEM_STEP_TRACE, HAKMEM_LD_SAFE, HAKMEM_FREE_WRAP_TRACE)
- Constructor priority 101 ensures early initialization
- Replaces all lazy-init patterns in wrapper code
### Modified Files
- **Makefile**: Added tiny_env_box.o and wrapper_env_box.o to OBJS_BASE and SHARED_OBJS
- **core/box/hak_wrappers.inc.h**:
- Removed static lazy-init variables (g_step_trace, ld_safe_mode cache)
- Replaced with wrapper_env_cfg() lookups (wcfg->step_trace, wcfg->ld_safe_mode)
- All getenv() calls eliminated from malloc/free hot paths
- **core/hakmem.c**:
- Added hak_ld_env_init() with constructor for LD_PRELOAD caching
- Added hak_force_libc_ctor() for HAKMEM_FORCE_LIBC_ALLOC* caching
- Simplified hak_ld_env_mode() to return cached value only
- Simplified hak_force_libc_alloc() to use cached values
- Eliminated all getenv/atoi calls from hot paths
## Technical Details
### Constructor Initialization Pattern
All environment variables are now read once at library load time using __attribute__((constructor)):
```c
__attribute__((constructor(101)))
static void wrapper_env_ctor(void) {
wrapper_env_init_once(); // Atomic CAS ensures exactly-once init
}
```
### Thread Safety
- Atomic compare-and-swap (CAS) ensures single initialization
- Spin-wait for initialization completion in multi-threaded scenarios
- Memory barriers (memory_order_acq_rel) ensure visibility
### Hot Path Impact
Before: Every malloc/free → getenv("LD_PRELOAD") + getenv("HAKMEM_STEP_TRACE") + ...
After: Every malloc/free → Single pointer dereference (wcfg->field)
## Next Optimization Target (Phase 1-2)
Perf analysis reveals libc fallback accounts for ~51% of cycles:
- _int_malloc: 15.04%
- malloc: 9.81%
- _int_free: 10.07%
- malloc_consolidate: 9.27%
- unlink_chunk: 6.82%
Reducing libc fallback from 51% → 10% could yield additional +25-30% improvement.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
Co-Authored-By: ChatGPT <chatgpt@openai.com >
2025-12-02 16:16:51 +09:00
f1b7964ef9
Remove unused Mid MT layer
2025-12-01 23:43:44 +09:00
195c74756c
Fix mid free routing and relax mid W_MAX
2025-12-01 22:06:10 +09:00
4ef0171bc0
feat: Add ACE allocation failure tracing and debug hooks
...
This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations.
Key changes include:
- **ACE Tracing Implementation**:
- Added environment variable to enable/disable detailed logging of allocation failures.
- Instrumented , , and to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure).
- **Build System Fixes**:
- Corrected to ensure is properly linked into , resolving an error.
- **LD_PRELOAD Wrapper Adjustments**:
- Investigated and understood the wrapper's behavior under , particularly its interaction with and checks.
- Enabled debugging flags for environment to prevent unintended fallbacks to 's for non-tiny allocations, allowing comprehensive testing of the allocator.
- **Debugging & Verification**:
- Introduced temporary verbose logging to pinpoint execution flow issues within interception and routing. These temporary logs have been removed.
- Created to facilitate testing of the tracing features.
This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in by providing clear insights into the failure pathways.
2025-12-01 16:37:59 +09:00
2bd8da9267
fix: guard Tiny FG misclass and add fg_tiny_gate box
2025-12-01 16:05:55 +09:00
0bc33dc4f5
Phase 9-2: Remove Legacy Backend & Unify to Shared Pool (50M ops/s)
...
- Removed Legacy Backend fallback; Shared Pool is now the sole backend.
- Removed Soft Cap limit in Shared Pool to allow full memory management.
- Implemented EMPTY slab recycling with batched meta->used decrement in remote drain.
- Updated tiny_free_local_box to return is_empty status for safe recycling.
- Fixed race condition in release path by removing from legacy list early.
- Achieved 50.3M ops/s in WS8192 benchmark (+200% vs baseline).
2025-12-01 13:47:23 +09:00
e769dec283
Refactor: Clean up SuperSlab shared pool code
...
- Removed unused/disabled L0 cache implementation from core/hakmem_shared_pool.c.
- Deleted stale backup file core/hakmem_tiny_superslab.c.bak.
- Removed untracked and obsolete shared_pool source files.
2025-11-30 15:27:53 +09:00
0276420938
Extract adopt/refill boundary into tiny_adopt_refill_box
2025-11-30 11:06:44 +09:00
eea3b988bd
Phase 9-3: Box Theory refactoring (TLS_SLL_DUP root fix)
...
Implementation:
- Step 1: TLS SLL Guard Box (push前meta/class/state突合)
- Step 2: SP_REBIND_SLOT macro (原子的slab rebind)
- Step 3: Unified Geometry Box (ポインタ演算API統一)
- Step 4: Unified Guard Box (HAKMEM_TINY_GUARD=1 統一制御)
New Files (545 lines):
- core/box/tiny_guard_box.h (277L)
- TLS push guard (SuperSlab/slab/class/state validation)
- Recycle guard (EMPTY確認)
- Drain guard (準備)
- 統一ENV制御: HAKMEM_TINY_GUARD=1
- core/box/tiny_geometry_box.h (174L)
- BASE_FROM_USER/USER_FROM_BASE conversion
- SS_FROM_PTR/SLAB_IDX_FROM_PTR lookup
- PTR_CLASSIFY combined helper
- 85+箇所の重複コード削減候補を特定
- core/box/sp_rebind_slot_box.h (94L)
- SP_REBIND_SLOT macro (geometry + TLS reset + class_map原子化)
- 6箇所に適用 (Stage 0/0.5/1/2/3)
- デバッグトレース: HAKMEM_SP_REBIND_TRACE=1
Results:
- ✅ TLS_SLL_DUP完全根絶 (0 crashes, 0 guard rejects)
- ✅ パフォーマンス改善 +5.9% (15.16M → 16.05M ops/s on WS8192)
- ✅ コンパイル警告0件(新規)
- ✅ Box Theory準拠 (Single Responsibility, Clear Contract, Observable, Composable)
Test Results:
- Debug build: HAKMEM_TINY_GUARD=1 で10M iterations完走
- Release build: 3回平均 16.05M ops/s
- Guard reject rate: 0%
- Core dump: なし
Box Theory Compliance:
- Single Responsibility: 各Boxが単一責任 (guard/rebind/geometry)
- Clear Contract: 明確なAPI境界
- Observable: ENV変数で制御可能な検証
- Composable: 全allocation/free pathから利用可能
Performance Impact:
- Release build (guard無効): 影響なし (+5.9%改善)
- Debug build (guard有効): 数%のオーバーヘッド (検証コスト)
Architecture Improvements:
- ポインタ演算の一元管理 (85+箇所の統一候補)
- Slab rebindの原子性保証
- 検証機能の統合 (単一ENV制御)
Phase 9 Status:
- 性能目標 (25-30M ops/s): 未達 (16.05M = 53-64%)
- TLS_SLL_DUP根絶: ✅ 達成
- コード品質: ✅ 大幅向上
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-30 10:48:50 +09:00
87b7d30998
Phase 9: SuperSlab optimization & EMPTY slab recycling (WIP)
...
Phase 9-1: O(1) SuperSlab lookup optimization
- Created ss_addr_map_box: Hash table (8192 buckets) for O(1) SuperSlab lookup
- Created ss_tls_hint_box: TLS caching layer for SuperSlab hints
- Integrated hash table into registry (init, insert, remove, lookup)
- Modified hak_super_lookup() to use new hash table
- Expected: 50-80 cycles → 10-20 cycles (not verified - SuperSlab disabled by default)
Phase 9-2: EMPTY slab recycling implementation
- Created slab_recycling_box: SLAB_TRY_RECYCLE() macro following Box pattern
- Integrated into remote drain (superslab_slab.c)
- Integrated into TLS SLL drain (tls_sll_drain_box.h) with touched slab tracking
- Observable: Debug tracing via HAKMEM_SLAB_RECYCLE_TRACE
- Updated Makefile: Added new box objects to 3 build targets
Known Issues:
- SuperSlab registry exhaustion still occurs (unregistration not working)
- shared_pool_release_slab() may not be removing from g_super_reg[]
- Needs investigation before Phase 9-2 can be completed
Expected Impact (when fixed):
- Stage 1 hit rate: 0% → 80%
- shared_fail events: 4 → 0
- Kernel overhead: 55% → 15%
- Throughput: 16.5M → 25-30M ops/s (+50-80%)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-30 07:16:50 +09:00
da8f4d2c86
Phase 8-TLS-Fix: BenchFast crash root cause fixes
...
Two critical bugs fixed:
1. TLS→Atomic guard (cross-thread safety):
- Changed `__thread int bench_fast_init_in_progress` to `atomic_int`
- Root cause: pthread_once() creates threads with fresh TLS (= 0)
- Guard must protect entire process, not just calling thread
- Box Contract: Observable state across all threads
2. Direct header write (P3 optimization bypass):
- bench_fast_alloc() now writes header directly: 0xa0 | class_idx
- Root cause: P3 optimization skips header writes by default
- BenchFast REQUIRES headers for free routing (0xa0-0xa7 magic)
- Box Contract: BenchFast always writes headers
Result:
- Normal mode: 16.3M ops/s (working)
- BenchFast mode: No crash (pool exhaustion expected with 128 blocks/class)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-30 05:12:32 +09:00
191e659837
Phase 8 Root Cause Fix: BenchFast crash investigation and infrastructure isolation
...
Goal: Fix BenchFast mode crash and improve infrastructure separation
Status: Normal mode works perfectly (17.9M ops/s), BenchFast crash reduced but persists (separate issue)
Root Cause Analysis (Layers 0-3):
Layer 1: Removed unnecessary unified_cache_init() call
- Problem: Phase 8 Step 2 added unified_cache_init() to bench_fast_init()
- Design error: BenchFast uses TLS SLL strategy, NOT Unified Cache
- Impact: 16KB mmap allocations created, later misclassified as Tiny → crash
- Fix: Removed unified_cache_init() call from bench_fast_box.c lines 123-129
- Rationale: BenchFast and Unified Cache are different allocation strategies
Layer 2: Infrastructure isolation (__libc bypass)
- Problem: Infrastructure allocations (cache arrays) went through HAKMEM wrapper
- Risk: Can interact with BenchFast mode, causing path conflicts
- Fix: Use __libc_calloc/__libc_free in unified_cache_init/shutdown
- Benefit: Clean separation between workload (measured) and infrastructure (unmeasured)
- Defense: Prevents future crashes from infrastructure/workload mixing
Layer 3: Box Contract documentation
- Problem: Implicit assumptions about BenchFast behavior were undocumented
- Fix: Added comprehensive Box Contract to bench_fast_box.h (lines 13-51)
- Documents:
* Workload allocations: Tiny only, TLS SLL strategy
* Infrastructure allocations: __libc bypass, no HAKMEM interaction
* Preconditions, guarantees, and violation examples
- Benefit: Future developers understand design constraints
Layer 0: Limit prealloc to actual TLS SLL capacity
- Problem: Old code preallocated 50,000 blocks/class
- Reality: Adaptive sizing limits TLS SLL to 128 blocks/class at runtime
- Lost blocks: 50,000 - 128 = 49,872 blocks/class × 6 = 299,232 lost blocks!
- Impact: Lost blocks caused heap corruption
- Fix: Hard-code prealloc to 128 blocks/class (observed actual capacity)
- Result: 768 total blocks (128 × 6), zero lost blocks
Performance Impact:
- Normal mode: ✅ 17.9M ops/s (perfect, no regression)
- BenchFast mode: ⚠️ Still crashes (different root cause, requires further investigation)
Benefits:
- Unified Cache infrastructure properly isolated (__libc bypass)
- BenchFast Box Contract documented (prevents future misunderstandings)
- Prealloc overflow eliminated (no more lost blocks)
- Normal mode unchanged (backward compatible)
Known Issue (separate):
- BenchFast mode still crashes with "free(): invalid pointer"
- Crash location: Likely bench_random_mixed.c line 145 (BENCH_META_FREE(slots))
- Next steps: GDB debugging, AddressSanitizer build, or strace analysis
- Not caused by Phase 8 changes (pre-existing issue)
Files Modified:
- core/box/bench_fast_box.h - Box Contract documentation (Layer 3)
- core/box/bench_fast_box.c - Removed prewarm, limited prealloc (Layer 0+1)
- core/front/tiny_unified_cache.c - __libc bypass (Layer 2)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-30 04:51:36 +09:00
cfa587c61d
Phase 8-Step1-3: Unified Cache hot path optimization (config macro + prewarm + PGO init removal)
...
Goal: Reduce branches in Unified Cache hot paths (-2 branches per op)
Expected improvement: +2-3% in PGO mode
Changes:
1. Config Macro (Step 1):
- Added TINY_FRONT_UNIFIED_CACHE_ENABLED macro to tiny_front_config_box.h
- PGO mode: compile-time constant (1)
- Normal mode: runtime function call unified_cache_enabled()
- Replaced unified_cache_enabled() calls in 3 locations:
* unified_cache_pop() line 142
* unified_cache_push() line 182
* unified_cache_pop_or_refill() line 228
2. Function Declaration Fix:
- Moved unified_cache_enabled() from static inline to non-static
- Implementation in tiny_unified_cache.c (was in .h as static inline)
- Forward declaration in tiny_front_config_box.h
- Resolves declaration conflict between config box and header
3. Prewarm (Step 2):
- Added unified_cache_init() call to bench_fast_init()
- Ensures cache is initialized before benchmark starts
- Enables PGO builds to remove lazy init checks
4. Conditional Init Removal (Step 3):
- Wrapped lazy init checks in #if !HAKMEM_TINY_FRONT_PGO
- PGO builds assume prewarm → no init check needed (-1 branch)
- Normal builds keep lazy init for safety
- Applied to 3 functions: unified_cache_pop(), unified_cache_push(), unified_cache_pop_or_refill()
Performance Impact:
PGO mode: -2 branches per operation (enabled check + init check)
Normal mode: Same as before (runtime checks)
Branch Elimination (PGO):
Before: if (!unified_cache_enabled()) + if (slots == NULL)
After: if (!1 ) [eliminated] + [init check removed]
Result: -2 branches in alloc/free hot paths
Files Modified:
core/box/tiny_front_config_box.h - Config macro + forward declaration
core/front/tiny_unified_cache.h - Config macro usage + PGO conditionals
core/front/tiny_unified_cache.c - unified_cache_enabled() implementation
core/box/bench_fast_box.c - Prewarm call in bench_fast_init()
Note: BenchFast mode has pre-existing crash (not caused by these changes)
🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 17:58:42 +09:00
6b75453072
Phase 7-Step8: Replace SFC/HEAP_V2/ULTRA_SLIM runtime checks with config macros
...
**Goal**: Complete dead code elimination infrastructure for all runtime checks
**Changes**:
1. core/box/tiny_front_config_box.h:
- Rename sfc_cascade_enabled() → tiny_sfc_enabled() (avoid name collision)
- Update TINY_FRONT_SFC_ENABLED macro to use tiny_sfc_enabled()
2. core/tiny_alloc_fast.inc.h (5 locations):
- Line 274: tiny_heap_v2_alloc_by_class() - use TINY_FRONT_HEAP_V2_ENABLED
- Line 431: SFC TLS cache init - use TINY_FRONT_SFC_ENABLED
- Line 678: SFC cascade check - use TINY_FRONT_SFC_ENABLED
- Line 740: Ultra SLIM debug check - use TINY_FRONT_ULTRA_SLIM_ENABLED
3. core/hakmem_tiny_free.inc (1 location):
- Line 233: Heap V2 free path - use TINY_FRONT_HEAP_V2_ENABLED
**Performance**: 79.5M ops/s (maintained, -0.4M vs Step 7, within noise)
- Normal mode: Neutral (runtime checks preserved)
- PGO mode: Ready for dead code elimination
**Total Runtime Checks Replaced (Phase 7)**:
- ✅ TINY_FRONT_FASTCACHE_ENABLED: 3 locations (Step 4-6)
- ✅ TINY_FRONT_TLS_SLL_ENABLED: 7 locations (Step 7)
- ✅ TINY_FRONT_SFC_ENABLED: 2 locations (Step 8)
- ✅ TINY_FRONT_HEAP_V2_ENABLED: 2 locations (Step 8)
- ✅ TINY_FRONT_ULTRA_SLIM_ENABLED: 1 location (Step 8)
**Total**: 15 runtime checks → config macros
**PGO Mode Expected Benefit**:
- Eliminate 15 runtime checks across hot paths
- Reduce branch mispredictions
- Smaller code size (dead code removed by compiler)
- Better instruction cache locality
**Design Complete**: Config Box as single entry point for all Tiny Front policy
- Unified macro interface for all feature toggles
- Include order independent (static inline wrappers)
- Dual-mode support (PGO compile-time vs normal runtime)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 17:40:05 +09:00
69e6df4cbc
Phase 7-Step7: Replace g_tls_sll_enable with TINY_FRONT_TLS_SLL_ENABLED macro
...
**Goal**: Enable dead code elimination for TLS SLL checks in PGO mode
**Changes**:
1. core/box/tiny_front_config_box.h:
- Add TINY_FRONT_TLS_SLL_ENABLED macro (PGO: 1, Normal: tiny_tls_sll_enabled())
- Add tiny_tls_sll_enabled() wrapper function (static inline)
2. core/tiny_alloc_fast.inc.h (5 hot path locations):
- Line 220: tiny_heap_v2_refill_mag() - early return check
- Line 388: SLIM mode - SLL freelist check
- Line 459: tiny_alloc_fast_pop() - Layer 1 SLL check
- Line 774: Main alloc path - cached sll_enabled check (most critical!)
- Line 815: Generic front - SLL toggle respect
3. core/hakmem_tiny_refill.inc.h (2 locations):
- Line 186: bulk_mag_refill_fc() - refill from SLL
- Line 213: bulk_mag_to_sll_if_room() - push to SLL
**Performance**: 79.9M ops/s (maintained, +0.1M vs Step 6)
- Normal mode: Same performance (runtime checks preserved)
- PGO mode: Dead code elimination ready (if (!1 ) → removed by compiler)
**Expected PGO benefit**:
- Eliminate 7 TLS SLL checks across hot paths
- Reduce instruction count in main alloc loop
- Better branch prediction (no runtime checks)
**Design**: Config Box as single entry point
- All TLS SLL checks now use TINY_FRONT_TLS_SLL_ENABLED
- Consistent pattern with FASTCACHE/SFC/HEAP_V2 macros
- Include order independent (wrapper in config box header)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 17:35:51 +09:00
ae00221a0a
Phase 7-Step6: Fix include order issue - refill path optimization complete
...
**Problem**: Include order dependency prevented using TINY_FRONT_FASTCACHE_ENABLED
macro in hakmem_tiny_refill.inc.h (included before tiny_alloc_fast.inc.h).
**Solution** (from ChatGPT advice):
- Move wrapper functions to tiny_front_config_box.h as static inline
- This makes them available regardless of include order
- Enables dead code elimination in PGO mode for refill path
**Changes**:
1. core/box/tiny_front_config_box.h:
- Add tiny_fastcache_enabled() and sfc_cascade_enabled() as static inline
- These access static global variables via extern declaration
2. core/hakmem_tiny_refill.inc.h:
- Include tiny_front_config_box.h
- Use TINY_FRONT_FASTCACHE_ENABLED macro (line 162)
- Enables dead code elimination in PGO mode
3. core/tiny_alloc_fast.inc.h:
- Remove duplicate wrapper function definitions
- Now uses functions from config box header
**Performance**: 79.8M ops/s (maintained, 77M/81M/81M across 3 runs)
**Design Principle**: Config Box as "single entry point" for Tiny Front policy
- All config checks go through TINY_FRONT_*_ENABLED macros
- Wrapper functions centralized in config box header
- Include order independent (static inline in header)
🐱 Generated with ChatGPT advice for solving include order dependencies
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 17:31:32 +09:00
490b1c132a
Phase 7-Step1: Unified front path branch hint reversal (+54.2% improvement!)
...
Performance Results (bench_random_mixed, ws=256):
- Before: 52.3 M ops/s (Phase 5/6 baseline)
- After: 80.6 M ops/s (+54.2% improvement, +28.3M ops/s)
Implementation:
- Changed __builtin_expect(TINY_FRONT_UNIFIED_GATE_ENABLED, 0) → (..., 1)
- Applied to BOTH malloc and free paths
- Lines changed: 137 (malloc), 190 (free)
Root Cause (from ChatGPT + Task agent analysis):
- Unified fast path existed but was marked UNLIKELY (hint = 0)
- Compiler optimized for legacy path, not unified cache path
- malloc/free consumed 43% CPU due to branch misprediction
- Reversing hint: unified path now primary, legacy path fallback
Impact Analysis:
- Tiny allocations now hit malloc_tiny_fast() → Unified Cache → SuperSlab
- Legacy layers (FastCache/SFC/HeapV2/TLS SLL) still exist but cold
- Next step: Compile-time elimination of legacy paths (Step 2)
Code Changes:
- core/box/hak_wrappers.inc.h:137 (malloc path)
- core/box/hak_wrappers.inc.h:190 (free path)
- Total: 2 lines changed (4 lines including comments)
Why This Works:
- CPU branch predictor now expects unified path
- Cache locality improved (unified path hot, legacy path cold)
- Instruction cache pressure reduced (hot path smaller)
Next Steps (ChatGPT recommendations):
1. ✅ free side hint reversal (DONE - already applied)
2. ⏸️ Compile-time unified ON fixed (Step 2)
3. ⏸️ Document Phase 7 results (Step 3)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 16:17:34 +09:00
c19bb6a3bc
Phase 6-B: Header-based Mid MT free (lock-free, +2.65% improvement)
...
Performance Results (bench_mid_mt_gap, 1KB-8KB, ws=256):
- Before: 41.0 M ops/s (mutex-protected registry)
- After: 42.09 M ops/s (+2.65% improvement)
Expected vs Actual:
- Expected: +17-27% (based on perf showing 13.98% mutex overhead)
- Actual: +2.65% (needs investigation)
Implementation:
- Added MidMTHeader (8 bytes) to each Mid MT allocation
- Allocation: Write header with block_size, class_idx, magic (0xAB42)
- Free: Read header for O(1) metadata lookup (no mutex!)
- Eliminated entire registry infrastructure (127 lines deleted)
Changes:
- core/hakmem_mid_mt.h: Added MidMTHeader, removed registry structures
- core/hakmem_mid_mt.c: Updated alloc/free, removed registry functions
- core/box/mid_free_route_box.h: Header-based detection instead of registry lookup
Code Quality:
✅ Lock-free (no pthread_mutex operations)
✅ Simpler (O(1) header read vs O(log N) binary search)
✅ Smaller binary (127 lines deleted)
✅ Positive improvement (no regression)
Next: Investigate why improvement is smaller than expected
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 15:45:29 +09:00
6f8742582b
Phase 5-Step3: Mid/Large Config Box (future workload optimization)
...
Add compile-time configuration for Mid/Large allocation paths using Box pattern.
Implementation:
- Created core/box/mid_large_config_box.h
- Dual-mode config: PGO (compile-time) vs Normal (runtime)
- Replace HAK_ENABLED_* checks with MID_LARGE_* macros
- Dead code elimination when HAKMEM_MID_LARGE_PGO=1
Target Checks Eliminated (PGO mode):
- MID_LARGE_BIGCACHE_ENABLED (BigCache for 2MB+ allocations)
- MID_LARGE_ELO_ENABLED (ELO learning/threshold)
- MID_LARGE_ACE_ENABLED (ACE allocator gate)
- MID_LARGE_EVOLUTION_ENABLED (Evolution sampling)
Files:
- core/box/mid_large_config_box.h (NEW) - Config Box pattern
- core/hakmem_build_flags.h - Add HAKMEM_MID_LARGE_PGO flag
- core/box/hak_alloc_api.inc.h - Replace 2 checks (ELO, BigCache)
- core/box/hak_free_api.inc.h - Replace 2 checks (BigCache)
Performance Impact:
- Current workloads (16B-8KB): No effect (checks not in hot path)
- Future workloads (2MB+): Expected +2-4% via dead code elimination
Box Pattern: ✅ Single responsibility, clear contract, testable
Note: Config Box infrastructure ready for future large allocation benchmarks.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 14:39:07 +09:00
3daf75e57f
Phase 5-Step2: Mid Free Route Box (+28.9x free perf, 1.53x faster than system)
...
Fix critical 19x free() slowdown in Mid MT allocator (1KB-8KB range).
Root Cause:
- Mid MT registers chunks in MidGlobalRegistry
- Free path searches Pool's mid_desc registry (different registry!)
- Result: 100% lookup failure → 4x cascading lookups → libc fallback
Solution (Box Pattern):
- Created core/box/mid_free_route_box.h
- Try Mid MT registry BEFORE classify_ptr() in free()
- Direct route to mid_mt_free() if found
- Fall through to existing path if not found
Performance Results (bench_mid_mt_gap, 1KB-8KB allocs):
- Before: 1.49 M ops/s (19x slower than system malloc)
- After: 41.0 M ops/s (+28.9x improvement)
- vs System malloc: 1.53x faster (41.0 vs 26.8 M ops/s)
Files:
- core/box/mid_free_route_box.h (NEW) - Mid Free Route Box
- core/box/hak_wrappers.inc.h - Add mid_free_route_try() call
- core/hakmem_mid_mt.h - Fix mid_get_min_size() (1024 not 2048)
- bench_mid_mt_gap.c (NEW) - Targeted 1KB-8KB benchmark
- Makefile - Add bench_mid_mt_gap targets
Box Pattern: ✅ Single responsibility, clear contract, testable, minimal change
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 14:18:20 +09:00
e0aa51dba1
Phase 4-Step3: Add Front Config Box (+2.7-4.9% dead code elimination)
...
Implement compile-time configuration system for dead code elimination in Tiny
allocation hot paths. The Config Box provides dual-mode configuration:
- Normal mode: Runtime ENV checks (backward compatible, flexible)
- PGO mode: Compile-time constants (dead code elimination, performance)
PERFORMANCE:
- Baseline (runtime config): 50.32 M ops/s (avg of 5 runs)
- Config Box (PGO mode): 52.77 M ops/s (avg of 5 runs)
- Improvement: +2.45 M ops/s (+4.87% with outlier, +2.72% without)
- Target: +5-8% (partially achieved)
IMPLEMENTATION:
1. core/box/tiny_front_config_box.h (NEW):
- Defines TINY_FRONT_*_ENABLED macros for all config checks
- PGO mode (#if HAKMEM_TINY_FRONT_PGO): Macros expand to constants (0/1)
- Normal mode (#else): Macros expand to function calls
- Functions remain in their original locations (no code duplication)
2. core/hakmem_build_flags.h:
- Added HAKMEM_TINY_FRONT_PGO build flag (default: 0, off)
- Documentation: Usage with make EXTRA_CFLAGS="-DHAKMEM_TINY_FRONT_PGO=1"
3. core/box/hak_wrappers.inc.h:
- Replaced front_gate_unified_enabled() with TINY_FRONT_UNIFIED_GATE_ENABLED
- 2 call sites updated (malloc and free fast paths)
- Added config box include
EXPECTED DEAD CODE ELIMINATION (PGO mode):
if (TINY_FRONT_UNIFIED_GATE_ENABLED) { ... }
→ if (1) { ... } // Constant, always true
→ Compiler optimizes away the branch, keeps body
SCOPE:
Currently only front_gate_unified_enabled() is replaced (2 call sites).
To achieve full +5-8% target, expand to other config checks:
- ultra_slim_mode_enabled()
- tiny_heap_v2_enabled()
- sfc_cascade_enabled()
- tiny_fastcache_enabled()
- tiny_metrics_enabled()
- tiny_diag_enabled()
BUILD USAGE:
Normal mode (runtime config, default):
make bench_random_mixed_hakmem
PGO mode (compile-time config, dead code elimination):
make EXTRA_CFLAGS="-DHAKMEM_TINY_FRONT_PGO=1" bench_random_mixed_hakmem
BOX PATTERN COMPLIANCE:
✅ Single Responsibility: Configuration management ONLY
✅ Clear Contract: Dual-mode (PGO = constants, Normal = runtime)
✅ Observable: Config report function (debug builds)
✅ Safe: Backward compatible (default is normal mode)
✅ Testable: Easy A/B comparison (PGO vs normal builds)
WHY +2.7-4.9% (below +5-8% target)?
- Limited scope: Only 2 call sites for 1 config function replaced
- Lazy init overhead: front_gate_unified_enabled() cached after first call
- Need to expand to more config checks for full benefit
NEXT STEPS:
- Expand config macro usage to other functions (optional)
- OR proceed with PGO re-enablement (Final polish)
🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 12:18:37 +09:00
04186341c1
Phase 4-Step2: Add Hot/Cold Path Box (+7.3% performance)
...
Implemented Hot/Cold Path separation using Box pattern for Tiny allocations:
Performance Improvement (without PGO):
- Baseline (Phase 26-A): 53.3 M ops/s
- Hot/Cold Box (Phase 4-Step2): 57.2 M ops/s
- Gain: +7.3% (+3.9 M ops/s)
Implementation:
1. core/box/tiny_front_hot_box.h - Ultra-fast hot path (1 branch)
- Removed range check (caller guarantees valid class_idx)
- Inline cache hit path with branch prediction hints
- Debug metrics with zero overhead in Release builds
2. core/box/tiny_front_cold_box.h - Slow cold path (noinline, cold)
- Refill logic (batch allocation from SuperSlab)
- Drain logic (batch free to SuperSlab)
- Error reporting and diagnostics
3. core/front/malloc_tiny_fast.h - Updated to use Hot/Cold Boxes
- Hot path: tiny_hot_alloc_fast() (1 branch: cache empty check)
- Cold path: tiny_cold_refill_and_alloc() (noinline, cold attribute)
- Clear separation improves i-cache locality
Branch Analysis:
- Baseline: 4-5 branches in hot path (range check + cache check + refill logic mixed)
- Hot/Cold Box: 1 branch in hot path (cache empty check only)
- Reduction: 3-4 branches eliminated from hot path
Design Principles (Box Pattern):
✅ Single Responsibility: Hot path = cache hit only, Cold path = refill/errors
✅ Clear Contract: Hot returns NULL on miss, Cold handles miss
✅ Observable: Debug metrics (TINY_HOT_METRICS_*) gated by NDEBUG
✅ Safe: Branch prediction hints (TINY_HOT_LIKELY/UNLIKELY)
✅ Testable: Isolated hot/cold paths, easy A/B testing
PGO Status:
- Temporarily disabled (build issues with __gcov_merge_time_profile)
- Will re-enable PGO in future commit after resolving gcc/lto issues
- Current benchmarks are without PGO (fair A/B comparison)
Other Changes:
- .gitignore: Added *.d files (dependency files, auto-generated)
- Makefile: PGO targets temporarily disabled (show informational message)
- build_pgo.sh: Temporarily disabled (show "PGO paused" message)
Next: Phase 4-Step3 (Front Config Box, target +5-8%)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 11:58:37 +09:00
d78baf41ce
Phase 3: Remove mincore() syscall completely
...
Problem:
- mincore() was already disabled by default (DISABLE_MINCORE=1)
- Phase 1b/2 registry-based validation made mincore obsolete
- Dead code (~60 lines) remained with complex #ifdef guards
Solution:
Complete removal of mincore() syscall and related infrastructure:
1. Makefile:
- Removed DISABLE_MINCORE configuration (lines 167-177)
- Added Phase 3 comment documenting removal rationale
2. core/box/hak_free_api.inc.h:
- Removed ~60 lines of mincore logic with TLS page cache
- Simplified to: int is_mapped = 1;
- Added comprehensive history comment
3. core/box/external_guard_box.h:
- Simplified external_guard_is_mapped() from 20 lines to 4 lines
- Always returns 1 (assume mapped)
- Added Phase 3 comment
Safety:
Trust internal metadata for all validation:
- SuperSlab registry: validates Tiny allocations (Phase 1b/2)
- AllocHeader: validates Mid/Large allocations
- FrontGate classifier: routes external allocations
Testing:
✓ Build: Clean compilation (no warnings)
✓ Stability: 100/100 test iterations passed (0% crash rate)
✓ Performance: No regression (mincore already disabled)
History:
- Phase 9: Used mincore() for safety
- 2025-11-14: Added DISABLE_MINCORE flag (+10.3% perf improvement)
- Phase 1b/2: Registry-based validation (0% crash rate)
- Phase 3: Dead code cleanup (this commit)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 09:04:32 +09:00
4f2bcb7d32
Refactor: Phase 2 Box化 - SuperSlab Lookup Box with multiple contract levels
...
Purpose: Formalize SuperSlab lookup responsibilities with clear safety guarantees
Evolution:
- Phase 12: UNSAFE mask+dereference (5-10 cycles) → 12% crash rate
- Phase 1b: SAFE registry lookup (50-100 cycles) → 0% crash rate
- Phase 2: Box化 - multiple contracts (UNSAFE/SAFE/GUARDED)
Box Pattern Benefits:
1. Clear Contracts: Each API documents preconditions and guarantees
2. Multiple Levels: Choose speed vs safety based on context
3. Future-Proof: Enables optimizations without breaking existing code
API Design:
- ss_lookup_unsafe(): 5-10 cycles, requires validated pointer (internal use only)
- ss_lookup_safe(): 50-100 cycles, works with arbitrary pointers (recommended)
- ss_lookup_guarded(): 100-200 cycles, adds integrity checks (debug only)
- ss_fast_lookup(): Backward compatible (→ ss_lookup_safe)
Implementation:
- Created core/box/superslab_lookup_box.h with full contract documentation
- Integrated into core/superslab/superslab_inline.h
- ss_lookup_safe() implemented as macro to avoid circular dependency
- ss_lookup_guarded() only available in debug builds
- Removed conflicting extern declarations from 3 locations
Testing:
- Build: Success (all warnings resolved)
- Crash rate: 0% (50/50 iterations passed)
- Backward compatibility: Maintained via ss_fast_lookup() macro
Future Optimization Opportunities (documented in Box):
- Phase 2.1: Hybrid lookup (try UNSAFE first, fallback to SAFE)
- Phase 2.2: Per-thread cache (1-2 cycles hit rate)
- Phase 2.3: Hardware-assisted validation (PAC/CPUID)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 08:44:29 +09:00
846daa3edf
Cleanup: Fix 2 additional Class 0/7 header bugs (correctness fix)
...
Task Agent Investigation:
- Found 2 more instances of hardcoded `class_idx != 7` checks
- These are real bugs (C0 also uses offset=0, not just C7)
- However, NOT the root cause of 12% crash rate
Bug Fixes (2 locations):
1. tls_sll_drain_box.h:190
- Path: TLS SLL drain → tiny_free_local_box()
- Fix: Use tiny_header_write_for_alloc() (ALL classes)
- Reason: tiny_free_local_box() reads header for class_idx
2. hakmem_tiny_refill.inc.h:384
- Path: SuperSlab refill → TLS SLL push
- Fix: Use tiny_header_write_if_preserved() (C1-C6 only)
- Reason: TLS SLL push needs header for validation
Test Results:
- Before: 12% crash rate (88/100 runs successful)
- After: 12% crash rate (44/50 runs successful)
- Conclusion: Correctness fix, but not primary crash cause
Analysis:
- Bugs are real (incorrect Class 0 handling)
- Fixes don't reduce crash rate → different root cause exists
- Heisenbug characteristics (disappears under gdb)
- Likely: Race condition, uninitialized memory, or use-after-free
Remaining Work:
- 12% crash rate persists (requires different investigation)
- Next: Focus on TLS initialization, race conditions, allocation paths
Design Note:
- tls_sll_drain_box.h uses tiny_header_write_for_alloc()
because tiny_free_local_box() needs header to read class_idx
- hakmem_tiny_refill.inc.h uses tiny_header_write_if_preserved()
because TLS SLL push validates header (C1-C6 only)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 08:12:08 +09:00
6e2552e654
Bugfix: Add Header Box and fix Class 0/7 header handling (crash rate -50%)
...
Root Cause Analysis:
- tls_sll_box.h had hardcoded `class_idx != 7` checks
- This incorrectly assumed only C7 uses offset=0
- But C0 (8B) also uses offset=0 (header overwritten by next pointer)
- Result: C0 blocks had corrupted headers in TLS SLL → crash
Architecture Fix: Header Box (Single Source of Truth)
- Created core/box/tiny_header_box.h
- Encapsulates "which classes preserve headers" logic
- Delegates to tiny_nextptr.h (0x7E bitmask: C0=0, C1-C6=1, C7=0)
- API:
* tiny_class_preserves_header() - C1-C6 only
* tiny_header_write_if_preserved() - Conditional write
* tiny_header_validate() - Conditional validation
* tiny_header_write_for_alloc() - Unconditional (alloc path)
Bug Fixes (6 locations):
- tls_sll_box.h:366 - push header restore (C1-C6 only; skip C0/C7)
- tls_sll_box.h:560 - pop header validate (C1-C6 only; skip C0/C7)
- tls_sll_box.h:700 - splice header restore head (C1-C6 only)
- tls_sll_box.h:722 - splice header restore next (C1-C6 only)
- carve_push_box.c:198 - freelist→TLS SLL header restore
- hakmem_tiny_free.inc:78 - drain freelist header restore
Impact:
- Before: 23.8% crash rate (bench_random_mixed_hakmem)
- After: 12% crash rate
- Improvement: 49.6% reduction in crashes
- Test: 88/100 runs successful (vs 76/100 before)
Design Principles:
- Eliminates hardcoded class_idx checks (class_idx != 7)
- Single Source of Truth (tiny_nextptr.h → Header Box)
- Type-safe API prevents future bugs
- Future: Add lint to forbid direct header manipulation
Remaining Work:
- 12% crash rate still exists (likely different root cause)
- Next: Investigate with core dump analysis
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 07:57:49 +09:00
3f461ba25f
Cleanup: Consolidate debug ENV vars to HAKMEM_DEBUG_LEVEL
...
Integrated 4 new debug environment variables added during bug fixes
into the existing unified HAKMEM_DEBUG_LEVEL system (expanded to 0-5 levels).
Changes:
1. Expanded HAKMEM_DEBUG_LEVEL from 0-3 to 0-5 levels:
- 0 = OFF (production)
- 1 = ERROR (critical errors)
- 2 = WARN (warnings)
- 3 = INFO (allocation paths, header validation, stats)
- 4 = DEBUG (guard instrumentation, failfast)
- 5 = TRACE (verbose tracing)
2. Integrated 4 environment variables:
- HAKMEM_ALLOC_PATH_TRACE → HAKMEM_DEBUG_LEVEL >= 3 (INFO)
- HAKMEM_TINY_SLL_VALIDATE_HDR → HAKMEM_DEBUG_LEVEL >= 3 (INFO)
- HAKMEM_TINY_REFILL_FAILFAST → HAKMEM_DEBUG_LEVEL >= 4 (DEBUG)
- HAKMEM_TINY_GUARD → HAKMEM_DEBUG_LEVEL >= 4 (DEBUG)
3. Kept 2 special-purpose variables (fine-grained control):
- HAKMEM_TINY_GUARD_CLASS (target class for guard)
- HAKMEM_TINY_GUARD_MAX (max guard events)
4. Backward compatibility:
- Legacy ENV vars still work via hak_debug_check_level()
- New code uses unified system
- No behavior changes for existing users
Updated files:
- core/hakmem_debug_master.h (level 0-5 expansion)
- core/hakmem_tiny_superslab_internal.h (alloc path trace)
- core/box/tls_sll_box.h (header validation)
- core/tiny_failfast.c (failfast level)
- core/tiny_refill_opt.h (failfast guard)
- core/hakmem_tiny_ace_guard_box.inc (guard enable)
- core/hakmem_tiny.c (include hakmem_debug_master.h)
Impact:
- Simpler debug control: HAKMEM_DEBUG_LEVEL=3 instead of 4 separate ENVs
- Easier to discover/use
- Consistent debug levels across codebase
- Reduces ENV variable proliferation (43+ vars surveyed)
Future work:
- Consolidate remaining 39+ debug variables (documented in survey)
- Gradual migration over 2-3 releases
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 06:57:03 +09:00
20f8d6f179
Cleanup: Add tiny_debug_api.h to eliminate guard/failfast implicit warnings
...
Created central header for debug instrumentation API to fix implicit
function declaration warnings across the codebase.
Changes:
1. Created core/tiny_debug_api.h
- Declares guard system API (3 functions)
- Declares failfast debugging API (3 functions)
- Uses forward declarations for SuperSlab/TinySlabMeta
2. Updated 3 files to include tiny_debug_api.h:
- core/tiny_region_id.h (removed inline externs)
- core/hakmem_tiny_tls_ops.h
- core/tiny_superslab_alloc.inc.h
Warnings eliminated (6 of 11 total):
✅ tiny_guard_is_enabled()
✅ tiny_guard_on_alloc()
✅ tiny_guard_on_invalid()
✅ tiny_failfast_log()
✅ tiny_failfast_abort_ptr()
✅ tiny_refill_failfast_level()
Remaining warnings (deferred to P1):
- ss_active_add (2 occurrences)
- expand_superslab_head
- hkm_ace_set_tls_capacity
- smallmid_backend_free
Impact:
- Cleaner build output
- Better type safety for debug functions
- No behavior changes
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 06:47:13 +09:00
3c6c76cb11
Fix: Restore headers in box_carve_and_push_with_freelist()
...
Root cause identified by Task exploration agent:
- box_carve_and_push_with_freelist() pops blocks from slab
freelist without restoring headers before pushing to TLS SLL
- Freelist blocks have stale data at offset 0
- When popped from TLS SLL, header validation fails
- Error: [TLS_SLL_HDR_RESET] cls=1 got=0x00 expect=0xa1
Fix applied:
1. Added HEADER_MAGIC restoration before tls_sll_push()
in box_carve_and_push_with_freelist() (carve_push_box.c:193-198)
2. Added tiny_region_id.h include for HEADER_MAGIC definition
Results:
- 20 threads: Header corruption ELIMINATED ✓
- 4 threads: Still shows 1 corruption (partial fix)
- Suggests multiple freelist pop paths exist
Additional work needed:
- Check hakmem_tiny_alloc_new.inc freelist pops
- Verify all freelist → TLS SLL paths write headers
Reference:
Same pattern as tiny_superslab_alloc.inc.h:159-169 (correct impl)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 05:44:13 +09:00
5582cbc22c
Refactor: Unified allocation macros + header validation
...
1. Archive unused backend files (ss_legacy/unified_backend_box.c/h)
- These files were not linked in the build
- Moved to archive/ to reduce confusion
2. Created HAK_RET_ALLOC_BLOCK macro for SuperSlab allocations
- Replaces superslab_return_block() function
- Consistent with existing HAK_RET_ALLOC pattern
- Single source of truth for header writing
- Defined in hakmem_tiny_superslab_internal.h
3. Added header validation on TLS SLL push
- Detects blocks pushed without proper header
- Enabled via HAKMEM_TINY_SLL_VALIDATE_HDR=1 (release)
- Always on in debug builds
- Logs first 10 violations with backtraces
Benefits:
- Easier to track allocation paths
- Catches header bugs at push time
- More maintainable macro-based design
Note: Larson bug still reproduces - header corruption occurs
before push validation can catch it.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 05:37:24 +09:00
6ac6f5ae1b
Refactor: Split hakmem_tiny_superslab.c + unified backend exit point
...
Major refactoring to improve maintainability and debugging:
1. Split hakmem_tiny_superslab.c (1521 lines) into 7 focused files:
- superslab_allocate.c: SuperSlab allocation/deallocation
- superslab_backend.c: Backend allocation paths (legacy, shared)
- superslab_ace.c: ACE (Adaptive Cache Engine) logic
- superslab_slab.c: Slab initialization and bitmap management
- superslab_cache.c: LRU cache and prewarm cache management
- superslab_head.c: SuperSlabHead management and expansion
- superslab_stats.c: Statistics tracking and debugging
2. Created hakmem_tiny_superslab_internal.h for shared declarations
3. Added superslab_return_block() as single exit point for header writing:
- All backend allocations now go through this helper
- Prevents bugs where headers are forgotten in some paths
- Makes future debugging easier
4. Updated Makefile for new file structure
5. Added header writing to ss_legacy_backend_box.c and
ss_unified_backend_box.c (though not currently linked)
Note: Header corruption bug in Larson benchmark still exists.
Class 1-6 allocations go through TLS refill/carve paths, not backend.
Further investigation needed.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 05:13:04 +09:00
da3f3507b8
Perf optimization: Add __builtin_expect hints to hot paths
...
Problem: Branch mispredictions in allocation hot paths.
Perf analysis suggested adding likely/unlikely hints.
Solution: Added __builtin_expect hints to critical allocation paths:
1. smallmid_is_enabled() - unlikely (disabled by default)
2. sm_ptr/tiny_ptr/pool_ptr/mid_ptr null checks - likely (success expected)
Optimized locations (core/box/hak_alloc_api.inc.h):
- Line 44: smallmid check (unlikely)
- Line 53: smallmid success check (likely)
- Line 81: tiny success check (likely)
- Line 112: pool success check (likely)
- Line 126: mid success check (likely)
Benchmark results (10M iterations × 5 runs, ws=256):
- Before (Opt2): 71.30M ops/s (avg)
- After (Opt3): 72.92M ops/s (avg)
- Improvement: +2.3% (+1.62M ops/s)
Matches Task agent's prediction of +2-3% throughput gain.
Perf analysis: commit 53bc92842
2025-11-28 18:04:32 +09:00
5a5aaf7514
Cleanup: Reformat super-long line in pool_api.inc.h for readability
...
Refactored the extremely compressed line 312 (previously 600+ chars) into
properly indented, readable code while preserving identical logic:
- Broke down TLS local freelist spill operation into clear steps
- Added clarifying comment for spill operation
- Improved atomic CAS loop formatting
- No functional changes, only formatting improvements
Performance verified: 16-18M ops/s maintained (same as before)
2025-11-28 17:10:32 +09:00
f36ebe83aa
Phase 4c: Add master trace control (HAKMEM_TRACE)
...
Add unified trace control that allows enabling specific trace modules
using comma-separated values or "all" to enable everything.
New file: core/hakmem_trace_master.h
- HAKMEM_TRACE=all: Enable all trace modules
- HAKMEM_TRACE=ptr,refill,free,mailbox: Enable specific modules
- HAKMEM_TRACE_LEVEL=N: Set trace verbosity (1-3)
- hak_trace_check(): Check if module should enable tracing
Available trace modules:
ptr, refill, superslab, ring, free, mailbox, registry
Priority order:
1. HAKMEM_QUIET=1 → suppress all
2. Specific module ENV (e.g., HAKMEM_PTR_TRACE=1)
3. HAKMEM_TRACE=module1,module2
4. Default → disabled
Updated files:
- core/tiny_refill.h: Use hak_trace_check() for refill tracing
- core/box/mailbox_box.c: Use hak_trace_check() for mailbox tracing
Performance: No regression (72.9M ops/s)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-28 16:08:44 +09:00
a6e681aae7
P2: TLS SLL Redesign - class_map default, tls_cached tracking, conditional header restore
...
This commit completes the P2 phase of the Tiny Pool TLS SLL redesign to fix the
Header/Next pointer conflict that was causing ~30% crash rates.
Changes:
- P2.1: Make class_map lookup the default (ENV: HAKMEM_TINY_NO_CLASS_MAP=1 for legacy)
- P2.2: Add meta->tls_cached field to track blocks cached in TLS SLL
- P2.3: Make Header restoration conditional in tiny_next_store() (default: skip)
- P2.4: Add invariant verification functions (active + tls_cached ≈ used)
- P0.4: Document new ENV variables in ENV_VARS.md
New ENV variables:
- HAKMEM_TINY_ACTIVE_TRACK=1: Enable active/tls_cached tracking (~1% overhead)
- HAKMEM_TINY_NO_CLASS_MAP=1: Disable class_map (legacy mode)
- HAKMEM_TINY_RESTORE_HEADER=1: Force header restoration (legacy mode)
- HAKMEM_TINY_INVARIANT_CHECK=1: Enable invariant verification (debug)
- HAKMEM_TINY_INVARIANT_DUMP=1: Enable periodic state dumps (debug)
Benchmark results (bench_tiny_hot_hakmem 64B):
- Default (class_map ON): 84.49 M ops/sec
- ACTIVE_TRACK=1: 83.62 M ops/sec (-1%)
- NO_CLASS_MAP=1 (legacy): 85.06 M ops/sec
- MT performance: +21-28% vs system allocator
No crashes observed. All tests passed.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-28 14:11:37 +09:00
6b86c60a20
P1.3: Add meta->active for TLS SLL tracking
...
Add active field to TinySlabMeta to track blocks currently held by
users (not in TLS SLL or freelist caches). This enables accurate
empty slab detection that accounts for TLS SLL cached blocks.
Changes:
- superslab_types.h: Add _Atomic uint16_t active field
- ss_allocation_box.c, hakmem_tiny_superslab.c: Initialize active=0
- tiny_free_fast_v2.inc.h: Decrement active on TLS SLL push
- tiny_alloc_fast.inc.h: Add tiny_active_track_alloc() helper,
increment active on TLS SLL pop (all code paths)
- ss_hot_cold_box.h: ss_is_slab_empty() uses active when enabled
All tracking is ENV-gated: HAKMEM_TINY_ACTIVE_TRACK=1 to enable.
Default is off for zero performance impact.
Invariant: active = used - tls_cached (active <= used)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-28 13:53:45 +09:00
dc9e650db3
Tiny Pool redesign: P0.1, P0.3, P1.1, P1.2 - Out-of-band class_idx lookup
...
This commit implements the first phase of Tiny Pool redesign based on
ChatGPT architecture review. The goal is to eliminate Header/Next pointer
conflicts by moving class_idx lookup out-of-band (to SuperSlab metadata).
## P0.1: C0(8B) class upgraded to 16B
- Size table changed: {16,32,64,128,256,512,1024,2048} (8 classes)
- LUT updated: 1..16 → class 0, 17..32 → class 1, etc.
- tiny_next_off: C0 now uses offset 1 (header preserved)
- Eliminates edge cases for 8B allocations
## P0.3: Slab reuse guard Box (tls_slab_reuse_guard_box.h)
- New Box for draining TLS SLL before slab reuse
- ENV gate: HAKMEM_TINY_SLAB_REUSE_GUARD=1
- Prevents stale pointers when slabs are recycled
- Follows Box theory: single responsibility, minimal API
## P1.1: SuperSlab class_map addition
- Added uint8_t class_map[SLABS_PER_SUPERSLAB_MAX] to SuperSlab
- Maps slab_idx → class_idx for out-of-band lookup
- Initialized to 255 (UNASSIGNED) on SuperSlab creation
- Set correctly on slab initialization in all backends
## P1.2: Free fast path uses class_map
- ENV gate: HAKMEM_TINY_USE_CLASS_MAP=1
- Free path can now get class_idx from class_map instead of Header
- Falls back to Header read if class_map returns invalid value
- Fixed Legacy Backend dynamic slab initialization bug
## Documentation added
- HAKMEM_ARCHITECTURE_OVERVIEW.md: 4-layer architecture analysis
- TLS_SLL_ARCHITECTURE_INVESTIGATION.md: Root cause analysis
- PTR_LIFECYCLE_TRACE_AND_ROOT_CAUSE_ANALYSIS.md: Pointer tracking
- TINY_REDESIGN_CHECKLIST.md: Implementation roadmap (P0-P3)
## Test results
- Baseline: 70% success rate (30% crash - pre-existing issue)
- class_map enabled: 70% success rate (same as baseline)
- Performance: ~30.5M ops/s (unchanged)
## Next steps (P1.3, P2, P3)
- P1.3: Add meta->active for accurate TLS/freelist sync
- P2: TLS SLL redesign with Box-based counting
- P3: Complete Header out-of-band migration
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-28 13:42:39 +09:00