faed928969
Perf: Optimize remote queue drain to skip when empty
...
Optimization:
=============
Check remote_counts[slab_idx] BEFORE calling drain function.
If remote queue is empty (count == 0), skip the drain entirely.
Impact:
- Single-threaded: remote_count is ALWAYS 0 → drain calls = 0
- Multi-threaded: only drain when there are actual remote frees
- Reduces unnecessary function call overhead in common case
Code:
if (tls->ss && tls->slab_idx >= 0) {
uint32_t remote_count = atomic_load_explicit(
&tls->ss->remote_counts[tls->slab_idx], memory_order_relaxed);
if (remote_count > 0) {
_ss_remote_drain_to_freelist_unsafe(tls->ss, tls->slab_idx, meta);
}
}
Benchmark Results:
==================
bench_random_mixed (1 thread):
Before: 1,020,163 ops/s
After: 1,015,347 ops/s (-0.5%, within noise)
larson_hakmem (4 threads):
Before: 931,629 ops/s (1073 sec)
After: 929,709 ops/s (1075 sec) (-0.2%, within noise)
Note: Performance unchanged, but code is cleaner and avoids
unnecessary work in single-threaded case. Real bottleneck
appears to be elsewhere (Magazine layer overhead per CLAUDE.md).
Next: Profile with perf to find actual hotspots.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-08 01:44:24 +09:00
0b1c825f25
Fix: CRITICAL multi-threaded freelist/remote queue race condition
...
Root Cause:
===========
Freelist and remote queue contained the SAME blocks, causing use-after-free:
1. Thread A (owner): pops block X from freelist → allocates to user
2. User writes data ("ab") to block X
3. Thread B (remote): free(block X) → adds to remote queue
4. Thread A (later): drains remote queue → *(void**)block_X = chain_head
→ OVERWRITES USER DATA! 💥
The freelist pop path did NOT drain the remote queue first, so blocks could
be simultaneously in both freelist and remote queue.
Fix:
====
Add remote queue drain BEFORE freelist pop in refill path:
core/hakmem_tiny_refill_p0.inc.h:
- Call _ss_remote_drain_to_freelist_unsafe() BEFORE trc_pop_from_freelist()
- Add #include "superslab/superslab_inline.h"
- This ensures freelist and remote queue are mutually exclusive
Test Results:
=============
BEFORE:
larson_hakmem (4 threads): ❌ SEGV in seconds (freelist corruption)
AFTER:
larson_hakmem (4 threads): ✅ 931,629 ops/s (1073 sec stable run)
bench_random_mixed: ✅ 1,020,163 ops/s (no crashes)
Evidence:
- Fail-Fast logs showed next pointer corruption: 0x...6261 (ASCII "ab")
- Single-threaded benchmarks worked (865K ops/s)
- Multi-threaded Larson crashed immediately
- Fix eliminates all crashes in both benchmarks
Files:
- core/hakmem_tiny_refill_p0.inc.h: Add remote drain before freelist pop
- CURRENT_TASK.md: Document fix details
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-08 01:35:45 +09:00
b7021061b8
Fix: CRITICAL double-allocation bug in trc_linear_carve()
...
Root Cause:
trc_linear_carve() used meta->used as cursor, but meta->used decrements
on free, causing already-allocated blocks to be re-carved.
Evidence:
- [LINEAR_CARVE] used=61 batch=1 → block 61 created
- (blocks freed, used decrements 62→59)
- [LINEAR_CARVE] used=59 batch=3 → blocks 59,60,61 RE-CREATED!
- Result: double-allocation → memory corruption → SEGV
Fix Implementation:
1. Added TinySlabMeta.carved (monotonic counter, never decrements)
2. Changed trc_linear_carve() to use carved instead of used
3. carved tracks carve progress, used tracks active count
Files Modified:
- core/superslab/superslab_types.h: Add carved field
- core/tiny_refill_opt.h: Use carved in trc_linear_carve()
- core/hakmem_tiny_superslab.c: Initialize carved=0
- core/tiny_alloc_fast.inc.h: Add next pointer validation
- core/hakmem_tiny_free.inc: Add drain/free validation
Test Results:
✅ bench_random_mixed: 950,037 ops/s (no crash)
✅ Fail-fast mode: 651,627 ops/s (with diagnostic logs)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-08 01:18:37 +09:00
c9053a43ac
Phase 6-2.3~6-2.5: Critical bug fixes + SuperSlab optimization (WIP)
...
## Phase 6-2.3: Fix 4T Larson crash (active counter bug) ✅
**Problem:** 4T Larson crashed with "free(): invalid pointer", OOM errors
**Root cause:** core/hakmem_tiny_refill_p0.inc.h:103
- P0 batch refill moved freelist blocks to TLS cache
- Active counter NOT incremented → double-decrement on free
- Counter underflows → SuperSlab appears full → OOM → crash
**Fix:** Added ss_active_add(tls->ss, from_freelist);
**Result:** 4T stable at 838K ops/s ✅
## Phase 6-2.4: Fix SEGV in random_mixed/mid_large_mt benchmarks ✅
**Problem:** bench_random_mixed_hakmem, bench_mid_large_mt_hakmem → immediate SEGV
**Root cause #1:** core/box/hak_free_api.inc.h:92-95
- "Guess loop" dereferenced unmapped memory when registry lookup failed
**Root cause #2:** core/box/hak_free_api.inc.h:115
- Header magic check dereferenced unmapped memory
**Fix:**
1. Removed dangerous guess loop (lines 92-95)
2. Added hak_is_memory_readable() check before dereferencing header
(core/hakmem_internal.h:277-294 - uses mincore() syscall)
**Result:**
- random_mixed (2KB): SEGV → 2.22M ops/s ✅
- random_mixed (4KB): SEGV → 2.58M ops/s ✅
- Larson 4T: no regression (838K ops/s) ✅
## Phase 6-2.5: Performance investigation + SuperSlab fix (WIP) ⚠️
**Problem:** Severe performance gaps (19-26x slower than system malloc)
**Investigation:** Task agent identified root cause
- hak_is_memory_readable() syscall overhead (100-300 cycles per free)
- ALL frees hit unmapped_header_fallback path
- SuperSlab lookup NEVER called
- Why? g_use_superslab = 0 (disabled by diet mode)
**Root cause:** core/hakmem_tiny_init.inc:104-105
- Diet mode (default ON) disables SuperSlab
- SuperSlab defaults to 1 (hakmem_config.c:334)
- BUT diet mode overrides it to 0 during init
**Fix:** Separate SuperSlab from diet mode
- SuperSlab: Performance-critical (fast alloc/free)
- Diet mode: Memory efficiency (magazine capacity limits only)
- Both are independent features, should not interfere
**Status:** ⚠️ INCOMPLETE - New SEGV discovered after fix
- SuperSlab lookup now works (confirmed via debug output)
- But benchmark crashes (Exit 139) after ~20 lookups
- Needs further investigation
**Files modified:**
- core/hakmem_tiny_init.inc:99-109 - Removed diet mode override
- PERFORMANCE_INVESTIGATION_REPORT.md - Task agent analysis (303x instruction gap)
**Next steps:**
- Investigate new SEGV (likely SuperSlab free path bug)
- OR: Revert Phase 6-2.5 changes if blocking progress
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-07 20:31:01 +09:00
f6b06a0311
Fix: Active counter double-decrement in P0 batch refill (4T crash → stable)
...
## Problem
HAKMEM 4T crashed with "free(): invalid pointer" on startup:
- System/mimalloc: 3.3M ops/s ✅
- HAKMEM 1T: 838K ops/s (-75%) ⚠️
- HAKMEM 4T: Crash (Exit 134) ❌
Error: superslab_refill returned NULL (OOM), active=0, bitmap=0x00000000
## Root Cause (Ultrathink Task Agent Investigation)
Active counter double-decrement when re-allocating from freelist:
1. Free → counter-- ✅
2. Remote drain → add to freelist (no counter change) ✅
3. P0 batch refill → move to TLS cache (forgot counter++) ❌ BUG!
4. Next free → counter-- ❌ Double decrement!
Result: Counter underflow → SuperSlab appears "full" → OOM → crash
## Fix (1 line)
File: core/hakmem_tiny_refill_p0.inc.h:103
+ss_active_add(tls->ss, from_freelist);
Reason: Freelist re-allocation moves block from "free" to "allocated" state,
so active counter MUST increment.
## Verification
| Setting | Before | After | Result |
|----------------|---------|----------------|--------------|
| 4T default | ❌ Crash | ✅ 838,445 ops/s | 🎉 Stable |
| Stability (2x) | - | ✅ Same score | Reproducible |
## Remaining Issue
❌ HAKMEM_TINY_REFILL_COUNT_HOT=64 triggers crash (class=4 OOM)
- Suspected: TLS cache over-accumulation or memory leak
- Next: Investigate HAKMEM_TINY_FAST_CAP interaction
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-07 12:37:23 +09:00
1da8754d45
CRITICAL FIX: TLS 未初期化による 4T SEGV を完全解消
...
**問題:**
- Larson 4T で 100% SEGV (1T は 2.09M ops/s で完走)
- System/mimalloc は 4T で 33.52M ops/s 正常動作
- SS OFF + Remote OFF でも 4T で SEGV
**根本原因: (Task agent ultrathink 調査結果)**
```
CRASH: mov (%r15),%r13
R15 = 0x6261 ← ASCII "ba" (ゴミ値、未初期化TLS)
```
Worker スレッドの TLS 変数が未初期化:
- `__thread void* g_tls_sll_head[TINY_NUM_CLASSES];` ← 初期化なし
- pthread_create() で生成されたスレッドでゼロ初期化されない
- NULL チェックが通過 (0x6261 != NULL) → dereference → SEGV
**修正内容:**
全 TLS 配列に明示的初期化子 `= {0}` を追加:
1. **core/hakmem_tiny.c:**
- `g_tls_sll_head[TINY_NUM_CLASSES] = {0}`
- `g_tls_sll_count[TINY_NUM_CLASSES] = {0}`
- `g_tls_live_ss[TINY_NUM_CLASSES] = {0}`
- `g_tls_bcur[TINY_NUM_CLASSES] = {0}`
- `g_tls_bend[TINY_NUM_CLASSES] = {0}`
2. **core/tiny_fastcache.c:**
- `g_tiny_fast_cache[TINY_FAST_CLASS_COUNT] = {0}`
- `g_tiny_fast_count[TINY_FAST_CLASS_COUNT] = {0}`
- `g_tiny_fast_free_head[TINY_FAST_CLASS_COUNT] = {0}`
- `g_tiny_fast_free_count[TINY_FAST_CLASS_COUNT] = {0}`
3. **core/hakmem_tiny_magazine.c:**
- `g_tls_mags[TINY_NUM_CLASSES] = {0}`
4. **core/tiny_sticky.c:**
- `g_tls_sticky_ss[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}`
- `g_tls_sticky_idx[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}`
- `g_tls_sticky_pos[TINY_NUM_CLASSES] = {0}`
**効果:**
```
Before: 1T: 2.09M ✅ | 4T: SEGV 💀
After: 1T: 2.41M ✅ | 4T: 4.19M ✅ (+15% 1T, SEGV解消)
```
**テスト:**
```bash
# 1 thread: 完走
./larson_hakmem 2 8 128 1024 1 12345 1
→ Throughput = 2,407,597 ops/s ✅
# 4 threads: 完走(以前は SEGV)
./larson_hakmem 2 8 128 1024 1 12345 4
→ Throughput = 4,192,155 ops/s ✅
```
**調査協力:** Task agent (ultrathink mode) による完璧な根本原因特定
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-07 01:27:04 +09:00
52386401b3
Debug Counters Implementation - Clean History
...
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation
Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files
Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)
This is a clean repository without large log files.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-05 12:31:14 +09:00