Commit Graph

7 Commits

Author SHA1 Message Date
d55ee48459 Tiny C7(1KB) SEGV triage hardening: always-on lightweight free-time guards for headerless class7 in both hak_tiny_free_with_slab and superslab free path (alignment/range check, fail-fast via SIGUSR2). Leave C7 P0/direct-FC gated OFF by default. Add docs/TINY_C7_1KB_SEGV_TRIAGE.md for Claude with repro matrix, hypotheses, instrumentation and acceptance criteria. 2025-11-10 01:59:11 +09:00
83bb8624f6 Tiny: fix remote sentinel leak → SEGV; add defense-in-depth; PoolTLS: refill-boundary remote drain; build UX help; quickstart docs
Summary
- Fix SEGV root cause in Tiny random_mixed: TINY_REMOTE_SENTINEL leaked from Remote queue into freelist/TLS SLL.
- Clear/guard sentinel at the single boundary where Remote merges to freelist.
- Add minimal defense-in-depth in freelist_pop and TLS SLL pop.
- Silence verbose prints behind debug gates to reduce noise in release runs.
- Pool TLS: integrate Remote Queue drain at refill boundary to avoid unnecessary backend carve/OS calls when possible.
- DX: strengthen build.sh with help/list/verify and add docs/BUILDING_QUICKSTART.md.

Details
- core/superslab/superslab_inline.h: guard head/node against TINY_REMOTE_SENTINEL; sanitize node[0] when splicing local chain; only print diagnostics when debug guard is enabled.
- core/slab_handle.h: freelist_pop breaks on sentinel head (fail-fast under strict).
- core/tiny_alloc_fast_inline.h: TLS SLL pop breaks on sentinel head (rare branch).
- core/tiny_superslab_free.inc.h: sentinel scan log behind debug guard.
- core/pool_refill.c: try pool_remote_pop_chain() before backend carve in pool_refill_and_alloc().
- core/tiny_adaptive_sizing.c: default adaptive logs off; enable via HAKMEM_ADAPTIVE_LOG=1.
- build.sh: add help/list/verify; EXTRA_MAKEFLAGS passthrough; echo pinned flags.
- docs/BUILDING_QUICKSTART.md: add one‑pager for targets/flags/env/perf/strace.

Verification (high level)
- Tiny random_mixed 10k 256/1024: SEGV resolved; runs complete.
- Pool TLS 1T/4T perf: HAKMEM >= system (≈ +0.7% 1T, ≈ +2.9% 4T); syscall counts ~10–13.

Known issues (to address next)
- Tiny random_mixed perf is weak vs system:
  - 1T/500k/256: cycles/op ≈ 240 vs ~47 (≈5× slower), IPC ≈0.92, branch‑miss ≈11%.
  - 1T/500k/1024: cycles/op ≈ 149 vs ~53 (≈2.8× slower), IPC ≈0.82, branch‑miss ≈10.5%.
  - Hypothesis: frequent SuperSlab path for class7 (fast_cap=0), branchy refill/adopt, and hot-path divergence.
- Proposed next steps:
  - Introduce fast_cap>0 for class7 (bounded TLS SLL) and a simpler batch refill.
  - Add env‑gated Remote Side OFF for 1T A/B (reduce side-table and guards).
  - Revisit likely/unlikely and unify adopt boundary sequencing (drain→bind→acquire) for Tiny.
2025-11-09 16:49:34 +09:00
6b1382959c Phase 7-1 PoC: Region-ID Direct Lookup (+39%~+436% improvement!)
Implemented ultra-fast header-based free path that eliminates SuperSlab
lookup bottleneck (100+ cycles → 5-10 cycles).

## Key Changes

1. **Smart Headers** (core/tiny_region_id.h):
   - 1-byte header before each allocation stores class_idx
   - Memory layout: [Header: 1B] [User data: N-1B]
   - Overhead: <2% average (0% for Slab[0] using wasted padding)

2. **Ultra-Fast Allocation** (core/tiny_alloc_fast.inc.h):
   - Write header at base: *base = class_idx
   - Return user pointer: base + 1

3. **Ultra-Fast Free** (core/tiny_free_fast_v2.inc.h):
   - Read class_idx from header (ptr-1): 2-3 cycles
   - Push base (ptr-1) to TLS freelist: 3-5 cycles
   - Total: 5-10 cycles (vs 500+ cycles current!)

4. **Free Path Integration** (core/box/hak_free_api.inc.h):
   - Removed SuperSlab lookup from fast path
   - Direct header validation (no lookup needed!)

5. **Size Class Adjustment** (core/hakmem_tiny.h):
   - Max tiny size: 1023B (was 1024B)
   - 1024B requests → Mid allocator fallback

## Performance Results

| Size | Baseline | Phase 7 | Improvement |
|------|----------|---------|-------------|
| 128B | 1.22M | 6.54M | **+436%** 🚀 |
| 512B | 1.22M | 1.70M | **+39%** |
| 1023B | 1.22M | 1.92M | **+57%** |

## Build & Test

Enable Phase 7:
  make HEADER_CLASSIDX=1 bench_random_mixed_hakmem

Run benchmark:
  HAKMEM_TINY_USE_SUPERSLAB=1 ./bench_random_mixed_hakmem 10000 128 1234567

## Known Issues

- 1024B requests fallback to Mid allocator (by design)
- Target 40-60M ops/s not yet reached (current: 1.7-6.5M)
- Further optimization needed (TLS capacity tuning, refill optimization)

## Credits

Design: ChatGPT Pro Ultrathink, Claude Code
Implementation: Claude Code with Task Agent Ultrathink support

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-08 03:18:17 +09:00
b7021061b8 Fix: CRITICAL double-allocation bug in trc_linear_carve()
Root Cause:
trc_linear_carve() used meta->used as cursor, but meta->used decrements
on free, causing already-allocated blocks to be re-carved.

Evidence:
- [LINEAR_CARVE] used=61 batch=1 → block 61 created
- (blocks freed, used decrements 62→59)
- [LINEAR_CARVE] used=59 batch=3 → blocks 59,60,61 RE-CREATED!
- Result: double-allocation → memory corruption → SEGV

Fix Implementation:
1. Added TinySlabMeta.carved (monotonic counter, never decrements)
2. Changed trc_linear_carve() to use carved instead of used
3. carved tracks carve progress, used tracks active count

Files Modified:
- core/superslab/superslab_types.h: Add carved field
- core/tiny_refill_opt.h: Use carved in trc_linear_carve()
- core/hakmem_tiny_superslab.c: Initialize carved=0
- core/tiny_alloc_fast.inc.h: Add next pointer validation
- core/hakmem_tiny_free.inc: Add drain/free validation

Test Results:
 bench_random_mixed: 950,037 ops/s (no crash)
 Fail-fast mode: 651,627 ops/s (with diagnostic logs)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-08 01:18:37 +09:00
c9053a43ac Phase 6-2.3~6-2.5: Critical bug fixes + SuperSlab optimization (WIP)
## Phase 6-2.3: Fix 4T Larson crash (active counter bug) 
**Problem:** 4T Larson crashed with "free(): invalid pointer", OOM errors
**Root cause:** core/hakmem_tiny_refill_p0.inc.h:103
  - P0 batch refill moved freelist blocks to TLS cache
  - Active counter NOT incremented → double-decrement on free
  - Counter underflows → SuperSlab appears full → OOM → crash
**Fix:** Added ss_active_add(tls->ss, from_freelist);
**Result:** 4T stable at 838K ops/s 

## Phase 6-2.4: Fix SEGV in random_mixed/mid_large_mt benchmarks 
**Problem:** bench_random_mixed_hakmem, bench_mid_large_mt_hakmem → immediate SEGV
**Root cause #1:** core/box/hak_free_api.inc.h:92-95
  - "Guess loop" dereferenced unmapped memory when registry lookup failed
**Root cause #2:** core/box/hak_free_api.inc.h:115
  - Header magic check dereferenced unmapped memory
**Fix:**
  1. Removed dangerous guess loop (lines 92-95)
  2. Added hak_is_memory_readable() check before dereferencing header
     (core/hakmem_internal.h:277-294 - uses mincore() syscall)
**Result:**
  - random_mixed (2KB): SEGV → 2.22M ops/s 
  - random_mixed (4KB): SEGV → 2.58M ops/s 
  - Larson 4T: no regression (838K ops/s) 

## Phase 6-2.5: Performance investigation + SuperSlab fix (WIP) ⚠️
**Problem:** Severe performance gaps (19-26x slower than system malloc)
**Investigation:** Task agent identified root cause
  - hak_is_memory_readable() syscall overhead (100-300 cycles per free)
  - ALL frees hit unmapped_header_fallback path
  - SuperSlab lookup NEVER called
  - Why? g_use_superslab = 0 (disabled by diet mode)

**Root cause:** core/hakmem_tiny_init.inc:104-105
  - Diet mode (default ON) disables SuperSlab
  - SuperSlab defaults to 1 (hakmem_config.c:334)
  - BUT diet mode overrides it to 0 during init

**Fix:** Separate SuperSlab from diet mode
  - SuperSlab: Performance-critical (fast alloc/free)
  - Diet mode: Memory efficiency (magazine capacity limits only)
  - Both are independent features, should not interfere

**Status:** ⚠️ INCOMPLETE - New SEGV discovered after fix
  - SuperSlab lookup now works (confirmed via debug output)
  - But benchmark crashes (Exit 139) after ~20 lookups
  - Needs further investigation

**Files modified:**
- core/hakmem_tiny_init.inc:99-109 - Removed diet mode override
- PERFORMANCE_INVESTIGATION_REPORT.md - Task agent analysis (303x instruction gap)

**Next steps:**
- Investigate new SEGV (likely SuperSlab free path bug)
- OR: Revert Phase 6-2.5 changes if blocking progress

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-07 20:31:01 +09:00
1da8754d45 CRITICAL FIX: TLS 未初期化による 4T SEGV を完全解消
**問題:**
- Larson 4T で 100% SEGV (1T は 2.09M ops/s で完走)
- System/mimalloc は 4T で 33.52M ops/s 正常動作
- SS OFF + Remote OFF でも 4T で SEGV

**根本原因: (Task agent ultrathink 調査結果)**
```
CRASH: mov (%r15),%r13
R15 = 0x6261  ← ASCII "ba" (ゴミ値、未初期化TLS)
```

Worker スレッドの TLS 変数が未初期化:
- `__thread void* g_tls_sll_head[TINY_NUM_CLASSES];`  ← 初期化なし
- pthread_create() で生成されたスレッドでゼロ初期化されない
- NULL チェックが通過 (0x6261 != NULL) → dereference → SEGV

**修正内容:**
全 TLS 配列に明示的初期化子 `= {0}` を追加:

1. **core/hakmem_tiny.c:**
   - `g_tls_sll_head[TINY_NUM_CLASSES] = {0}`
   - `g_tls_sll_count[TINY_NUM_CLASSES] = {0}`
   - `g_tls_live_ss[TINY_NUM_CLASSES] = {0}`
   - `g_tls_bcur[TINY_NUM_CLASSES] = {0}`
   - `g_tls_bend[TINY_NUM_CLASSES] = {0}`

2. **core/tiny_fastcache.c:**
   - `g_tiny_fast_cache[TINY_FAST_CLASS_COUNT] = {0}`
   - `g_tiny_fast_count[TINY_FAST_CLASS_COUNT] = {0}`
   - `g_tiny_fast_free_head[TINY_FAST_CLASS_COUNT] = {0}`
   - `g_tiny_fast_free_count[TINY_FAST_CLASS_COUNT] = {0}`

3. **core/hakmem_tiny_magazine.c:**
   - `g_tls_mags[TINY_NUM_CLASSES] = {0}`

4. **core/tiny_sticky.c:**
   - `g_tls_sticky_ss[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}`
   - `g_tls_sticky_idx[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}`
   - `g_tls_sticky_pos[TINY_NUM_CLASSES] = {0}`

**効果:**
```
Before: 1T: 2.09M   |  4T: SEGV 💀
After:  1T: 2.41M   |  4T: 4.19M   (+15% 1T, SEGV解消)
```

**テスト:**
```bash
# 1 thread: 完走
./larson_hakmem 2 8 128 1024 1 12345 1
→ Throughput = 2,407,597 ops/s 

# 4 threads: 完走(以前は SEGV)
./larson_hakmem 2 8 128 1024 1 12345 4
→ Throughput = 4,192,155 ops/s 
```

**調査協力:** Task agent (ultrathink mode) による完璧な根本原因特定

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-07 01:27:04 +09:00
602edab87f Phase 1: Box Theory refactoring + include reduction
Phase 1-1: Split hakmem_tiny_free.inc (1,711 → 452 lines, -73%)
- Created tiny_free_magazine.inc.h (413 lines) - Magazine layer
- Created tiny_superslab_alloc.inc.h (394 lines) - SuperSlab alloc
- Created tiny_superslab_free.inc.h (305 lines) - SuperSlab free

Phase 1-2++: Refactor hakmem_pool.c (1,481 → 907 lines, -38.8%)
- Created pool_tls_types.inc.h (32 lines) - TLS structures
- Created pool_mf2_types.inc.h (266 lines) - MF2 data structures
- Created pool_mf2_helpers.inc.h (158 lines) - Helper functions
- Created pool_mf2_adoption.inc.h (129 lines) - Adoption logic

Phase 1-3: Reduce hakmem_tiny.c includes (60 → 46, -23.3%)
- Created tiny_system.h - System headers umbrella (stdio, stdlib, etc.)
- Created tiny_api.h - API headers umbrella (stats, query, rss, registry)

Performance: 4.19M ops/s maintained (±0% regression)
Verified: Larson benchmark 2×8×128×1024 = 4,192,128 ops/s

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-06 21:54:12 +09:00