|
|
707056b765
|
feat: Phase 7 + Phase 2 - Massive performance & stability improvements
Performance Achievements:
- Tiny allocations: +180-280% (21M → 59-70M ops/s random mixed)
- Single-thread: +24% (2.71M → 3.36M ops/s Larson)
- 4T stability: 0% → 95% (19/20 success rate)
- Overall: 91.3% of System malloc average (target was 40-55%) ✓
Phase 7 (Tasks 1-3): Core Optimizations
- Task 1: Header validation removal (Region-ID direct lookup)
- Task 2: Aggressive inline (TLS cache access optimization)
- Task 3: Pre-warm TLS cache (eliminate cold-start penalty)
Result: +180-280% improvement, 85-146% of System malloc
Critical Bug Fixes:
- Fix 64B allocation crash (size-to-class +1 for header)
- Fix 4T wrapper recursion bugs (BUG #7, #8, #10, #11)
- Remove malloc fallback (30% → 50% stability)
Phase 2a: SuperSlab Dynamic Expansion (CRITICAL)
- Implement mimalloc-style chunk linking
- Unlimited slab expansion (no more OOM at 32 slabs)
- Fix chunk initialization bug (bitmap=0x00000001 after expansion)
Files: core/hakmem_tiny_superslab.c/h, core/superslab/superslab_types.h
Result: 50% → 95% stability (19/20 4T success)
Phase 2b: TLS Cache Adaptive Sizing
- Dynamic capacity: 16-2048 slots based on usage
- High-water mark tracking + exponential growth/shrink
- Expected: +3-10% performance, -30-50% memory
Files: core/tiny_adaptive_sizing.c/h (new)
Phase 2c: BigCache Dynamic Hash Table
- Migrate from fixed 256×8 array to dynamic hash table
- Auto-resize: 256 → 512 → 1024 → 65,536 buckets
- Improved hash function (FNV-1a) + collision chaining
Files: core/hakmem_bigcache.c/h
Expected: +10-20% cache hit rate
Design Flaws Analysis:
- Identified 6 components with fixed-capacity bottlenecks
- SuperSlab (CRITICAL), TLS Cache (HIGH), BigCache/L2.5 (MEDIUM)
- Report: DESIGN_FLAWS_ANALYSIS.md (11 chapters)
Documentation:
- 13 comprehensive reports (PHASE*.md, DESIGN_FLAWS*.md)
- Implementation guides, test results, production readiness
- Bug fix reports, root cause analysis
Build System:
- Makefile: phase7 targets, PREWARM_TLS flag
- Auto dependency generation (-MMD -MP) for .inc files
Known Issues:
- 4T stability: 19/20 (95%) - investigating 1 failure for 100%
- L2.5 Pool dynamic sharding: design only (needs 2-3 days integration)
🤖 Generated with Claude Code (https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
|
2025-11-08 17:08:00 +09:00 |
|
|
|
a430545820
|
Phase 6-2.8: SuperSlab modular refactoring (665 lines → 104 lines)
目的: hakmem_tiny_superslab.h の肥大化を解消 (500+ 行)
実装内容:
1. superslab_types.h を作成
- SuperSlab 構造体定義 (TinySlabMeta, SuperSlab)
- 設定定数 (SUPERSLAB_SIZE_MAX, TINY_NUM_CLASSES_SS)
- コンパイル時アサーション
2. superslab_inline.h を作成
- ホットパス用インライン関数を集約
- ss_slabs_capacity(), slab_index_for()
- tiny_slab_base_for(), ss_remote_push()
- _ss_remote_drain_to_freelist_unsafe()
- Fail-fast validation helpers
- ACE helpers (hak_now_ns, hak_tiny_superslab_next_lg)
3. hakmem_tiny_superslab.h をリファクタリング
- 665 行 → 104 行 (-84%)
- include のみに書き換え
- 関数宣言と extern 宣言のみ残す
効果:
✅ ビルド成功 (libhakmem.so, larson_hakmem)
✅ Mid-Large allocator テスト通過 (3.98M ops/s)
⚠️ Tiny allocator の freelist corruption バグは未解決 (リファクタリングのスコープ外)
注意:
- Phase 6-2.6/6-2.7 の freelist バグは依然として存在
- リファクタリングは保守性向上のみが目的
- バグ修正は次のフェーズで対応
🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
|
2025-11-07 23:05:33 +09:00 |
|
|
|
d2f0d84584
|
Phase 6-2.5: Fix SuperSlab alignment bug + refactor constants
## Problem: 53-byte misalignment mystery
**Symptom:** All SuperSlab allocations misaligned by exactly 53 bytes
```
[TRC_FAILFAST_PTR] stage=alloc_ret_align cls=7 ptr=0x..f835
offset=63541 (expected: 63488)
Diff: 63541 - 63488 = 53 bytes
```
## Root Cause (Ultrathink investigation)
**sizeof(SuperSlab) != hardcoded offset:**
- `sizeof(SuperSlab)` = 1088 bytes (actual struct size)
- `tiny_slab_base_for()` used: 1024 (hardcoded)
- `superslab_init_slab()` assumed: 2048 (in capacity calc)
**Impact:**
1. Memory corruption: 64-byte overlap with SuperSlab metadata
2. Misalignment: 1088 % 1024 = 64 (violates class 7 alignment)
3. Inconsistency: Init assumed 2048, but runtime used 1024
## Solution
### 1. Centralize constants (NEW)
**File:** `core/hakmem_tiny_superslab_constants.h`
- `SLAB_SIZE` = 64KB
- `SUPERSLAB_HEADER_SIZE` = 1088
- `SUPERSLAB_SLAB0_DATA_OFFSET` = 2048 (aligned to 1024)
- `SUPERSLAB_SLAB0_USABLE_SIZE` = 63488 (64KB - 2048)
- Compile-time validation checks
**Why 2048?**
- Round up 1088 to next 1024-byte boundary
- Ensures proper alignment for class 7 (1024-byte blocks)
- Previous: (1088 + 1023) & ~1023 = 2048
### 2. Update all code to use constants
- `hakmem_tiny_superslab.h`: `tiny_slab_base_for()` → use `SUPERSLAB_SLAB0_DATA_OFFSET`
- `hakmem_tiny_superslab.c`: `superslab_init_slab()` → use `SUPERSLAB_SLAB0_USABLE_SIZE`
- Removed hardcoded 1024, 2048 magic numbers
### 3. Add class consistency check
**File:** `core/tiny_superslab_alloc.inc.h:433-449`
- Verify `tls->ss->size_class == class_idx` before allocation
- Unbind TLS if mismatch detected
- Prevents using wrong block_size for calculations
## Status
⚠️ **INCOMPLETE - New issue discovered**
After fix, benchmark hits different error:
```
[TRC_FAILFAST] stage=freelist_next cls=7 node=0x...d474
```
Freelist corruption detected. Likely caused by:
- 2048 offset change affects free() path
- Block addresses no longer match freelist expectations
- Needs further investigation
## Files Modified
- `core/hakmem_tiny_superslab_constants.h` - NEW: Centralized constants
- `core/hakmem_tiny_superslab.h` - Use SUPERSLAB_SLAB0_DATA_OFFSET
- `core/hakmem_tiny_superslab.c` - Use SUPERSLAB_SLAB0_USABLE_SIZE
- `core/tiny_superslab_alloc.inc.h` - Add class consistency check
- `core/hakmem_tiny_init.inc` - Remove diet mode override (Phase 6-2.5)
- `core/hakmem_super_registry.h` - Remove debug output (cleaned)
- `PERFORMANCE_INVESTIGATION_REPORT.md` - Task agent analysis
## Next Steps
1. Investigate freelist corruption with 2048 offset
2. Verify free() path uses tiny_slab_base_for() correctly
3. Consider reverting to 1024 and fixing capacity calculation instead
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
|
2025-11-07 21:45:20 +09:00 |
|
|
|
c9053a43ac
|
Phase 6-2.3~6-2.5: Critical bug fixes + SuperSlab optimization (WIP)
## Phase 6-2.3: Fix 4T Larson crash (active counter bug) ✅
**Problem:** 4T Larson crashed with "free(): invalid pointer", OOM errors
**Root cause:** core/hakmem_tiny_refill_p0.inc.h:103
- P0 batch refill moved freelist blocks to TLS cache
- Active counter NOT incremented → double-decrement on free
- Counter underflows → SuperSlab appears full → OOM → crash
**Fix:** Added ss_active_add(tls->ss, from_freelist);
**Result:** 4T stable at 838K ops/s ✅
## Phase 6-2.4: Fix SEGV in random_mixed/mid_large_mt benchmarks ✅
**Problem:** bench_random_mixed_hakmem, bench_mid_large_mt_hakmem → immediate SEGV
**Root cause #1:** core/box/hak_free_api.inc.h:92-95
- "Guess loop" dereferenced unmapped memory when registry lookup failed
**Root cause #2:** core/box/hak_free_api.inc.h:115
- Header magic check dereferenced unmapped memory
**Fix:**
1. Removed dangerous guess loop (lines 92-95)
2. Added hak_is_memory_readable() check before dereferencing header
(core/hakmem_internal.h:277-294 - uses mincore() syscall)
**Result:**
- random_mixed (2KB): SEGV → 2.22M ops/s ✅
- random_mixed (4KB): SEGV → 2.58M ops/s ✅
- Larson 4T: no regression (838K ops/s) ✅
## Phase 6-2.5: Performance investigation + SuperSlab fix (WIP) ⚠️
**Problem:** Severe performance gaps (19-26x slower than system malloc)
**Investigation:** Task agent identified root cause
- hak_is_memory_readable() syscall overhead (100-300 cycles per free)
- ALL frees hit unmapped_header_fallback path
- SuperSlab lookup NEVER called
- Why? g_use_superslab = 0 (disabled by diet mode)
**Root cause:** core/hakmem_tiny_init.inc:104-105
- Diet mode (default ON) disables SuperSlab
- SuperSlab defaults to 1 (hakmem_config.c:334)
- BUT diet mode overrides it to 0 during init
**Fix:** Separate SuperSlab from diet mode
- SuperSlab: Performance-critical (fast alloc/free)
- Diet mode: Memory efficiency (magazine capacity limits only)
- Both are independent features, should not interfere
**Status:** ⚠️ INCOMPLETE - New SEGV discovered after fix
- SuperSlab lookup now works (confirmed via debug output)
- But benchmark crashes (Exit 139) after ~20 lookups
- Needs further investigation
**Files modified:**
- core/hakmem_tiny_init.inc:99-109 - Removed diet mode override
- PERFORMANCE_INVESTIGATION_REPORT.md - Task agent analysis (303x instruction gap)
**Next steps:**
- Investigate new SEGV (likely SuperSlab free path bug)
- OR: Revert Phase 6-2.5 changes if blocking progress
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
|
2025-11-07 20:31:01 +09:00 |
|
|
|
1da8754d45
|
CRITICAL FIX: TLS 未初期化による 4T SEGV を完全解消
**問題:**
- Larson 4T で 100% SEGV (1T は 2.09M ops/s で完走)
- System/mimalloc は 4T で 33.52M ops/s 正常動作
- SS OFF + Remote OFF でも 4T で SEGV
**根本原因: (Task agent ultrathink 調査結果)**
```
CRASH: mov (%r15),%r13
R15 = 0x6261 ← ASCII "ba" (ゴミ値、未初期化TLS)
```
Worker スレッドの TLS 変数が未初期化:
- `__thread void* g_tls_sll_head[TINY_NUM_CLASSES];` ← 初期化なし
- pthread_create() で生成されたスレッドでゼロ初期化されない
- NULL チェックが通過 (0x6261 != NULL) → dereference → SEGV
**修正内容:**
全 TLS 配列に明示的初期化子 `= {0}` を追加:
1. **core/hakmem_tiny.c:**
- `g_tls_sll_head[TINY_NUM_CLASSES] = {0}`
- `g_tls_sll_count[TINY_NUM_CLASSES] = {0}`
- `g_tls_live_ss[TINY_NUM_CLASSES] = {0}`
- `g_tls_bcur[TINY_NUM_CLASSES] = {0}`
- `g_tls_bend[TINY_NUM_CLASSES] = {0}`
2. **core/tiny_fastcache.c:**
- `g_tiny_fast_cache[TINY_FAST_CLASS_COUNT] = {0}`
- `g_tiny_fast_count[TINY_FAST_CLASS_COUNT] = {0}`
- `g_tiny_fast_free_head[TINY_FAST_CLASS_COUNT] = {0}`
- `g_tiny_fast_free_count[TINY_FAST_CLASS_COUNT] = {0}`
3. **core/hakmem_tiny_magazine.c:**
- `g_tls_mags[TINY_NUM_CLASSES] = {0}`
4. **core/tiny_sticky.c:**
- `g_tls_sticky_ss[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}`
- `g_tls_sticky_idx[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}`
- `g_tls_sticky_pos[TINY_NUM_CLASSES] = {0}`
**効果:**
```
Before: 1T: 2.09M ✅ | 4T: SEGV 💀
After: 1T: 2.41M ✅ | 4T: 4.19M ✅ (+15% 1T, SEGV解消)
```
**テスト:**
```bash
# 1 thread: 完走
./larson_hakmem 2 8 128 1024 1 12345 1
→ Throughput = 2,407,597 ops/s ✅
# 4 threads: 完走(以前は SEGV)
./larson_hakmem 2 8 128 1024 1 12345 4
→ Throughput = 4,192,155 ops/s ✅
```
**調査協力:** Task agent (ultrathink mode) による完璧な根本原因特定
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
|
2025-11-07 01:27:04 +09:00 |
|
|
|
cd6507468e
|
Fix critical SuperSlab accounting bug + ACE improvements
Critical Bug Fix (OOM Root Cause):
- ss_remote_push() was missing ss_active_dec_one() call
- Cross-thread frees did not decrement total_active_blocks
- SuperSlabs appeared "full" even when empty
- hak_tiny_trim() could never free SuperSlabs → OOM
- Result: alloc=49,123 freed=0 bytes=103GB
One-Line Fix (core/hakmem_tiny_superslab.h:360):
+ ss_active_dec_one(ss); // Decrement on cross-thread free
Impact:
- OOM eliminated (167GB VmSize → clean exit)
- SuperSlabs now properly freed
- Performance maintained: 4.19M ops/s (±0%)
- Memory leak fixed (freed: 0 → expected ~45,000+)
ACE Improvements:
- Set SUPERSLAB_LG_DEFAULT = 21 (2MB, was 1MB)
- g_ss_min_lg_env now uses SUPERSLAB_LG_DEFAULT
- hak_tiny_superslab_next_lg() fallback to default if uninitialized
- Centralized ACE constants in .h for easier tuning
Verification:
- Larson benchmark: Clean completion, no OOM
- Throughput: 4,192,124 ops/s (baseline maintained)
Root cause analysis by Task agent: Larson 50%+ cross-thread frees
triggered accounting leak, preventing SuperSlab reclamation.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
|
2025-11-06 22:26:58 +09:00 |
|
|
|
52386401b3
|
Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation
Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files
Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)
This is a clean repository without large log files.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
|
2025-11-05 12:31:14 +09:00 |
|