4983352812
Perf: Phase 7-1.3 - Hybrid mincore + Macro fix (+194-333%)
...
## Summary
Fixed CRITICAL bottleneck (mincore overhead) and macro definition bug.
Result: 2-3x performance improvement across all benchmarks.
## Performance Results
- Larson 1T: 631K → 2.73M ops/s (+333%) 🚀
- bench_random_mixed (128B): 768K → 2.26M ops/s (+194%) 🚀
- bench_random_mixed (512B): → 1.43M ops/s (new)
- [HEADER_INVALID] messages: Many → ~Zero ✅
## Changes
### 1. Hybrid mincore Optimization (317-634x faster)
**Problem**: `hak_is_memory_readable()` calls mincore() syscall on EVERY free
- Cost: 634 cycles/call
- Impact: 40x slower than System malloc
**Solution**: Check alignment BEFORE calling mincore()
- Step 1 (1-byte header): `if ((ptr & 0xFFF) == 0)` → only 0.1% call mincore
- Step 2 (16-byte header): `if ((ptr & 0xFFF) < HEADER_SIZE)` → only 0.4% call mincore
- Result: 634 → 1-2 cycles effective (99.6% skip mincore)
**Files**:
- core/tiny_free_fast_v2.inc.h:53-71 - Step 1 hybrid check
- core/box/hak_free_api.inc.h:94-107 - Step 2 hybrid check
- core/hakmem_internal.h:281-312 - Performance warning added
### 2. HAK_RET_ALLOC Macro Fix (CRITICAL BUG)
**Problem**: Macro definition order prevented Phase 7 header write
- hakmem_tiny.c:130 defined legacy macro (no header write)
- tiny_alloc_fast.inc.h:67 had `#ifndef` guard → skipped!
- Result: Headers NEVER written → All frees failed → Slow path
**Solution**: Force Phase 7 macro to override legacy
- hakmem_tiny.c:119 - Added `#ifndef HAK_RET_ALLOC` guard
- tiny_alloc_fast.inc.h:69-72 - Added `#undef` before redefine
### 3. Magic Byte Fix
**Problem**: Release builds don't write magic byte, but free ALWAYS checks it
- Result: All headers marked as invalid
**Solution**: ALWAYS write magic byte (same 1-byte write, no overhead)
- tiny_region_id.h:50-54 - Removed `#if !HAKMEM_BUILD_RELEASE` guard
## Technical Details
### Hybrid mincore Effectiveness
| Case | Frequency | Cost | Weighted |
|------|-----------|------|----------|
| Normal (Step 1) | 99.9% | 1-2 cycles | 1-2 |
| Page boundary | 0.1% | 634 cycles | 0.6 |
| **Total** | - | - | **1.6-2.6 cycles** |
**Improvement**: 634 → 1.6 cycles = **317-396x faster!**
### Macro Fix Impact
**Before**: HAK_RET_ALLOC(cls, ptr) → return (ptr) // No header write
**After**: HAK_RET_ALLOC(cls, ptr) → return tiny_region_id_write_header((ptr), (cls))
**Result**: Headers properly written → Fast path works → +194-333% performance
## Investigation
Task Agent Ultrathink analysis identified:
1. mincore() syscall overhead (634 cycles)
2. Macro definition order conflict
3. Release/Debug build mismatch (magic byte)
Full report: PHASE7_DESIGN_REVIEW.md (23KB, 758 lines)
## Related
- Phase 7-1.0: PoC implementation (+39%~+436%)
- Phase 7-1.1: Dual-header dispatch (Task Agent)
- Phase 7-1.2: Page boundary SEGV fix (100% crash-free)
- Phase 7-1.3: Hybrid mincore + Macro fix (this commit)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-08 04:50:41 +09:00
6b1382959c
Phase 7-1 PoC: Region-ID Direct Lookup (+39%~+436% improvement!)
...
Implemented ultra-fast header-based free path that eliminates SuperSlab
lookup bottleneck (100+ cycles → 5-10 cycles).
## Key Changes
1. **Smart Headers** (core/tiny_region_id.h):
- 1-byte header before each allocation stores class_idx
- Memory layout: [Header: 1B] [User data: N-1B]
- Overhead: <2% average (0% for Slab[0] using wasted padding)
2. **Ultra-Fast Allocation** (core/tiny_alloc_fast.inc.h):
- Write header at base: *base = class_idx
- Return user pointer: base + 1
3. **Ultra-Fast Free** (core/tiny_free_fast_v2.inc.h):
- Read class_idx from header (ptr-1): 2-3 cycles
- Push base (ptr-1) to TLS freelist: 3-5 cycles
- Total: 5-10 cycles (vs 500+ cycles current!)
4. **Free Path Integration** (core/box/hak_free_api.inc.h):
- Removed SuperSlab lookup from fast path
- Direct header validation (no lookup needed!)
5. **Size Class Adjustment** (core/hakmem_tiny.h):
- Max tiny size: 1023B (was 1024B)
- 1024B requests → Mid allocator fallback
## Performance Results
| Size | Baseline | Phase 7 | Improvement |
|------|----------|---------|-------------|
| 128B | 1.22M | 6.54M | **+436%** 🚀 |
| 512B | 1.22M | 1.70M | **+39%** |
| 1023B | 1.22M | 1.92M | **+57%** |
## Build & Test
Enable Phase 7:
make HEADER_CLASSIDX=1 bench_random_mixed_hakmem
Run benchmark:
HAKMEM_TINY_USE_SUPERSLAB=1 ./bench_random_mixed_hakmem 10000 128 1234567
## Known Issues
- 1024B requests fallback to Mid allocator (by design)
- Target 40-60M ops/s not yet reached (current: 1.7-6.5M)
- Further optimization needed (TLS capacity tuning, refill optimization)
## Credits
Design: ChatGPT Pro Ultrathink, Claude Code
Implementation: Claude Code with Task Agent Ultrathink support
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-08 03:18:17 +09:00
0b1c825f25
Fix: CRITICAL multi-threaded freelist/remote queue race condition
...
Root Cause:
===========
Freelist and remote queue contained the SAME blocks, causing use-after-free:
1. Thread A (owner): pops block X from freelist → allocates to user
2. User writes data ("ab") to block X
3. Thread B (remote): free(block X) → adds to remote queue
4. Thread A (later): drains remote queue → *(void**)block_X = chain_head
→ OVERWRITES USER DATA! 💥
The freelist pop path did NOT drain the remote queue first, so blocks could
be simultaneously in both freelist and remote queue.
Fix:
====
Add remote queue drain BEFORE freelist pop in refill path:
core/hakmem_tiny_refill_p0.inc.h:
- Call _ss_remote_drain_to_freelist_unsafe() BEFORE trc_pop_from_freelist()
- Add #include "superslab/superslab_inline.h"
- This ensures freelist and remote queue are mutually exclusive
Test Results:
=============
BEFORE:
larson_hakmem (4 threads): ❌ SEGV in seconds (freelist corruption)
AFTER:
larson_hakmem (4 threads): ✅ 931,629 ops/s (1073 sec stable run)
bench_random_mixed: ✅ 1,020,163 ops/s (no crashes)
Evidence:
- Fail-Fast logs showed next pointer corruption: 0x...6261 (ASCII "ab")
- Single-threaded benchmarks worked (865K ops/s)
- Multi-threaded Larson crashed immediately
- Fix eliminates all crashes in both benchmarks
Files:
- core/hakmem_tiny_refill_p0.inc.h: Add remote drain before freelist pop
- CURRENT_TASK.md: Document fix details
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-08 01:35:45 +09:00
b7021061b8
Fix: CRITICAL double-allocation bug in trc_linear_carve()
...
Root Cause:
trc_linear_carve() used meta->used as cursor, but meta->used decrements
on free, causing already-allocated blocks to be re-carved.
Evidence:
- [LINEAR_CARVE] used=61 batch=1 → block 61 created
- (blocks freed, used decrements 62→59)
- [LINEAR_CARVE] used=59 batch=3 → blocks 59,60,61 RE-CREATED!
- Result: double-allocation → memory corruption → SEGV
Fix Implementation:
1. Added TinySlabMeta.carved (monotonic counter, never decrements)
2. Changed trc_linear_carve() to use carved instead of used
3. carved tracks carve progress, used tracks active count
Files Modified:
- core/superslab/superslab_types.h: Add carved field
- core/tiny_refill_opt.h: Use carved in trc_linear_carve()
- core/hakmem_tiny_superslab.c: Initialize carved=0
- core/tiny_alloc_fast.inc.h: Add next pointer validation
- core/hakmem_tiny_free.inc: Add drain/free validation
Test Results:
✅ bench_random_mixed: 950,037 ops/s (no crash)
✅ Fail-fast mode: 651,627 ops/s (with diagnostic logs)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-08 01:18:37 +09:00
c9053a43ac
Phase 6-2.3~6-2.5: Critical bug fixes + SuperSlab optimization (WIP)
...
## Phase 6-2.3: Fix 4T Larson crash (active counter bug) ✅
**Problem:** 4T Larson crashed with "free(): invalid pointer", OOM errors
**Root cause:** core/hakmem_tiny_refill_p0.inc.h:103
- P0 batch refill moved freelist blocks to TLS cache
- Active counter NOT incremented → double-decrement on free
- Counter underflows → SuperSlab appears full → OOM → crash
**Fix:** Added ss_active_add(tls->ss, from_freelist);
**Result:** 4T stable at 838K ops/s ✅
## Phase 6-2.4: Fix SEGV in random_mixed/mid_large_mt benchmarks ✅
**Problem:** bench_random_mixed_hakmem, bench_mid_large_mt_hakmem → immediate SEGV
**Root cause #1:** core/box/hak_free_api.inc.h:92-95
- "Guess loop" dereferenced unmapped memory when registry lookup failed
**Root cause #2:** core/box/hak_free_api.inc.h:115
- Header magic check dereferenced unmapped memory
**Fix:**
1. Removed dangerous guess loop (lines 92-95)
2. Added hak_is_memory_readable() check before dereferencing header
(core/hakmem_internal.h:277-294 - uses mincore() syscall)
**Result:**
- random_mixed (2KB): SEGV → 2.22M ops/s ✅
- random_mixed (4KB): SEGV → 2.58M ops/s ✅
- Larson 4T: no regression (838K ops/s) ✅
## Phase 6-2.5: Performance investigation + SuperSlab fix (WIP) ⚠️
**Problem:** Severe performance gaps (19-26x slower than system malloc)
**Investigation:** Task agent identified root cause
- hak_is_memory_readable() syscall overhead (100-300 cycles per free)
- ALL frees hit unmapped_header_fallback path
- SuperSlab lookup NEVER called
- Why? g_use_superslab = 0 (disabled by diet mode)
**Root cause:** core/hakmem_tiny_init.inc:104-105
- Diet mode (default ON) disables SuperSlab
- SuperSlab defaults to 1 (hakmem_config.c:334)
- BUT diet mode overrides it to 0 during init
**Fix:** Separate SuperSlab from diet mode
- SuperSlab: Performance-critical (fast alloc/free)
- Diet mode: Memory efficiency (magazine capacity limits only)
- Both are independent features, should not interfere
**Status:** ⚠️ INCOMPLETE - New SEGV discovered after fix
- SuperSlab lookup now works (confirmed via debug output)
- But benchmark crashes (Exit 139) after ~20 lookups
- Needs further investigation
**Files modified:**
- core/hakmem_tiny_init.inc:99-109 - Removed diet mode override
- PERFORMANCE_INVESTIGATION_REPORT.md - Task agent analysis (303x instruction gap)
**Next steps:**
- Investigate new SEGV (likely SuperSlab free path bug)
- OR: Revert Phase 6-2.5 changes if blocking progress
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-07 20:31:01 +09:00
382980d450
Phase 6-2.4: Fix SuperSlab free SEGV: remove guess loop and add memory readability check; add registry atomic consistency (base as _Atomic uintptr_t with acq/rel); add debug toggles (SUPER_REG_DEBUG/REQTRACE); update CURRENT_TASK with results and next steps; capture suite results.
2025-11-07 18:07:48 +09:00
b6d9c92f71
Fix: SuperSlab guess loop & header magic SEGV (random_mixed/mid_large_mt)
...
## Problem
bench_random_mixed_hakmem and bench_mid_large_mt_hakmem crashed with SEGV:
- random_mixed: Exit 139 (SEGV) ❌
- mid_large_mt: Exit 139 (SEGV) ❌
- Larson: 838K ops/s ✅ (worked fine)
Error: Unmapped memory dereference in free path
## Root Causes (2 bugs found by Ultrathink Task)
### Bug 1: Guess Loop (core/box/hak_free_api.inc.h:92-95)
```c
for (int lg=21; lg>=20; lg--) {
SuperSlab* guess=(SuperSlab*)((uintptr_t)ptr & ~mask);
if (guess && guess->magic==SUPERSLAB_MAGIC) { // ← SEGV
// Dereferences unmapped memory
}
}
```
### Bug 2: Header Magic Check (core/box/hak_free_api.inc.h:115)
```c
void* raw = (char*)ptr - HEADER_SIZE;
AllocHeader* hdr = (AllocHeader*)raw;
if (hdr->magic != HAKMEM_MAGIC) { // ← SEGV
// Dereferences unmapped memory if ptr has no header
}
```
**Why SEGV:**
- Registry lookup fails (allocation not from SuperSlab)
- Guess loop calculates 1MB/2MB aligned address
- No memory mapping validation
- Dereferences unmapped memory → SEGV
**Why Larson worked but random_mixed failed:**
- Larson: All from SuperSlab → registry hit → never reaches guess loop
- random_mixed: Diverse sizes (8-4096B) → registry miss → enters buggy paths
**Why LD_PRELOAD worked:**
- hak_core_init.inc.h:119-121 disables SuperSlab by default
- → SS-first path skipped → buggy code never executed
## Fix (2-part)
### Part 1: Remove Guess Loop
File: core/box/hak_free_api.inc.h:92-95
- Deleted unsafe guess loop (4 lines)
- If registry lookup fails, allocation is not from SuperSlab
### Part 2: Add Memory Safety Check
File: core/hakmem_internal.h:277-294
```c
static inline int hak_is_memory_readable(void* addr) {
unsigned char vec;
return mincore(addr, 1, &vec) == 0; // Check if mapped
}
```
File: core/box/hak_free_api.inc.h:115-131
```c
if (!hak_is_memory_readable(raw)) {
// Not accessible → route to appropriate handler
// Prevents SEGV on unmapped memory
goto done;
}
// Safe to dereference now
AllocHeader* hdr = (AllocHeader*)raw;
```
## Verification
| Test | Before | After | Result |
|------|--------|-------|--------|
| random_mixed (2KB) | ❌ SEGV | ✅ 2.22M ops/s | 🎉 Fixed |
| random_mixed (4KB) | ❌ SEGV | ✅ 2.58M ops/s | 🎉 Fixed |
| Larson 4T | ✅ 838K | ✅ 838K ops/s | ✅ No regression |
**Performance Impact:** 0% (mincore only on fallback path)
## Investigation
- Complete analysis: SEGV_ROOT_CAUSE_COMPLETE.md
- Fix report: SEGV_FIX_REPORT.md
- Previous investigation: SEGFAULT_INVESTIGATION_REPORT.md
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-07 17:34:24 +09:00
f6b06a0311
Fix: Active counter double-decrement in P0 batch refill (4T crash → stable)
...
## Problem
HAKMEM 4T crashed with "free(): invalid pointer" on startup:
- System/mimalloc: 3.3M ops/s ✅
- HAKMEM 1T: 838K ops/s (-75%) ⚠️
- HAKMEM 4T: Crash (Exit 134) ❌
Error: superslab_refill returned NULL (OOM), active=0, bitmap=0x00000000
## Root Cause (Ultrathink Task Agent Investigation)
Active counter double-decrement when re-allocating from freelist:
1. Free → counter-- ✅
2. Remote drain → add to freelist (no counter change) ✅
3. P0 batch refill → move to TLS cache (forgot counter++) ❌ BUG!
4. Next free → counter-- ❌ Double decrement!
Result: Counter underflow → SuperSlab appears "full" → OOM → crash
## Fix (1 line)
File: core/hakmem_tiny_refill_p0.inc.h:103
+ss_active_add(tls->ss, from_freelist);
Reason: Freelist re-allocation moves block from "free" to "allocated" state,
so active counter MUST increment.
## Verification
| Setting | Before | After | Result |
|----------------|---------|----------------|--------------|
| 4T default | ❌ Crash | ✅ 838,445 ops/s | 🎉 Stable |
| Stability (2x) | - | ✅ Same score | Reproducible |
## Remaining Issue
❌ HAKMEM_TINY_REFILL_COUNT_HOT=64 triggers crash (class=4 OOM)
- Suspected: TLS cache over-accumulation or memory leak
- Next: Investigate HAKMEM_TINY_FAST_CAP interaction
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-07 12:37:23 +09:00
8f3095fb85
CI-safe debug runners: add ASan LD_PRELOAD + UBSan mailbox targets; add asan_preload script; document sanitizer-safe workflows and results in CURRENT_TASK.md (debug complete).
2025-11-07 12:09:28 +09:00
1da8754d45
CRITICAL FIX: TLS 未初期化による 4T SEGV を完全解消
...
**問題:**
- Larson 4T で 100% SEGV (1T は 2.09M ops/s で完走)
- System/mimalloc は 4T で 33.52M ops/s 正常動作
- SS OFF + Remote OFF でも 4T で SEGV
**根本原因: (Task agent ultrathink 調査結果)**
```
CRASH: mov (%r15),%r13
R15 = 0x6261 ← ASCII "ba" (ゴミ値、未初期化TLS)
```
Worker スレッドの TLS 変数が未初期化:
- `__thread void* g_tls_sll_head[TINY_NUM_CLASSES];` ← 初期化なし
- pthread_create() で生成されたスレッドでゼロ初期化されない
- NULL チェックが通過 (0x6261 != NULL) → dereference → SEGV
**修正内容:**
全 TLS 配列に明示的初期化子 `= {0}` を追加:
1. **core/hakmem_tiny.c:**
- `g_tls_sll_head[TINY_NUM_CLASSES] = {0}`
- `g_tls_sll_count[TINY_NUM_CLASSES] = {0}`
- `g_tls_live_ss[TINY_NUM_CLASSES] = {0}`
- `g_tls_bcur[TINY_NUM_CLASSES] = {0}`
- `g_tls_bend[TINY_NUM_CLASSES] = {0}`
2. **core/tiny_fastcache.c:**
- `g_tiny_fast_cache[TINY_FAST_CLASS_COUNT] = {0}`
- `g_tiny_fast_count[TINY_FAST_CLASS_COUNT] = {0}`
- `g_tiny_fast_free_head[TINY_FAST_CLASS_COUNT] = {0}`
- `g_tiny_fast_free_count[TINY_FAST_CLASS_COUNT] = {0}`
3. **core/hakmem_tiny_magazine.c:**
- `g_tls_mags[TINY_NUM_CLASSES] = {0}`
4. **core/tiny_sticky.c:**
- `g_tls_sticky_ss[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}`
- `g_tls_sticky_idx[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}`
- `g_tls_sticky_pos[TINY_NUM_CLASSES] = {0}`
**効果:**
```
Before: 1T: 2.09M ✅ | 4T: SEGV 💀
After: 1T: 2.41M ✅ | 4T: 4.19M ✅ (+15% 1T, SEGV解消)
```
**テスト:**
```bash
# 1 thread: 完走
./larson_hakmem 2 8 128 1024 1 12345 1
→ Throughput = 2,407,597 ops/s ✅
# 4 threads: 完走(以前は SEGV)
./larson_hakmem 2 8 128 1024 1 12345 4
→ Throughput = 4,192,155 ops/s ✅
```
**調査協力:** Task agent (ultrathink mode) による完璧な根本原因特定
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-07 01:27:04 +09:00
582ebdfd4f
CURRENT_TASK: Registry 線形スキャン ボトルネック特定 (2025-11-05)
...
- perf 分析で superslab_refill が 28.51% CPU を消費
- Root cause: 262,144 エントリの線形スキャン (97.65% の hot instructions)
- 解決策: per-class registry (8×4096 = 32K entries)
- 期待効果: +200-300% (2.59M → 7.8-10.4M ops/s)
- Box Refactor は既に動いている (+463% ST, +131% MT)
次のアクション: Phase 1 実装 (per-class registry 変更)
詳細: PERF_ANALYSIS_2025_11_05.md
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-05 16:47:04 +09:00
52386401b3
Debug Counters Implementation - Clean History
...
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation
Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files
Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)
This is a clean repository without large log files.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-05 12:31:14 +09:00