Commit Graph

8 Commits

Author SHA1 Message Date
b8ed2b05b4 Phase 6-2.6: Fix slab_data_start() consistency in refill/validation paths
Problem:
- Phase 6-2.5 changed SUPERSLAB_SLAB0_DATA_OFFSET from 1024 → 2048
- Fixed sizeof(SuperSlab) mismatch (1088 bytes)
- But 3 locations still used old slab_data_start() + manual offset

This caused:
- Address mismatch between allocation carving and validation
- Freelist corruption false positives
- 53-byte misalignment errors resolved, but new errors appeared

Changes:
1. core/tiny_tls_guard.h:34
   - Validation: slab_data_start() → tiny_slab_base_for()
   - Ensures validation uses same base address as allocation

2. core/hakmem_tiny_refill.inc.h:222
   - Allocation carving: Remove manual +2048 hack
   - Use canonical tiny_slab_base_for()

3. core/hakmem_tiny_refill.inc.h:275
   - Bump allocation: Remove duplicate slab_start calculation
   - Use existing base calculation with tiny_slab_base_for()

Result:
- Consistent use of tiny_slab_base_for() across all paths
- All code uses SUPERSLAB_SLAB0_DATA_OFFSET constant
- Remaining freelist corruption needs deeper investigation (not simple offset bug)

Related commits:
- d2f0d8458: Phase 6-2.5 (constants.h + 2048 offset)
- c9053a43a: Phase 6-2.3~6-2.4 (active counter + SEGV fixes)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-07 22:34:24 +09:00
d2f0d84584 Phase 6-2.5: Fix SuperSlab alignment bug + refactor constants
## Problem: 53-byte misalignment mystery
**Symptom:** All SuperSlab allocations misaligned by exactly 53 bytes
```
[TRC_FAILFAST_PTR] stage=alloc_ret_align cls=7 ptr=0x..f835
offset=63541 (expected: 63488)
Diff: 63541 - 63488 = 53 bytes
```

## Root Cause (Ultrathink investigation)
**sizeof(SuperSlab) != hardcoded offset:**
- `sizeof(SuperSlab)` = 1088 bytes (actual struct size)
- `tiny_slab_base_for()` used: 1024 (hardcoded)
- `superslab_init_slab()` assumed: 2048 (in capacity calc)

**Impact:**
1. Memory corruption: 64-byte overlap with SuperSlab metadata
2. Misalignment: 1088 % 1024 = 64 (violates class 7 alignment)
3. Inconsistency: Init assumed 2048, but runtime used 1024

## Solution
### 1. Centralize constants (NEW)
**File:** `core/hakmem_tiny_superslab_constants.h`
- `SLAB_SIZE` = 64KB
- `SUPERSLAB_HEADER_SIZE` = 1088
- `SUPERSLAB_SLAB0_DATA_OFFSET` = 2048 (aligned to 1024)
- `SUPERSLAB_SLAB0_USABLE_SIZE` = 63488 (64KB - 2048)
- Compile-time validation checks

**Why 2048?**
- Round up 1088 to next 1024-byte boundary
- Ensures proper alignment for class 7 (1024-byte blocks)
- Previous: (1088 + 1023) & ~1023 = 2048

### 2. Update all code to use constants
- `hakmem_tiny_superslab.h`: `tiny_slab_base_for()` → use `SUPERSLAB_SLAB0_DATA_OFFSET`
- `hakmem_tiny_superslab.c`: `superslab_init_slab()` → use `SUPERSLAB_SLAB0_USABLE_SIZE`
- Removed hardcoded 1024, 2048 magic numbers

### 3. Add class consistency check
**File:** `core/tiny_superslab_alloc.inc.h:433-449`
- Verify `tls->ss->size_class == class_idx` before allocation
- Unbind TLS if mismatch detected
- Prevents using wrong block_size for calculations

## Status
⚠️ **INCOMPLETE - New issue discovered**

After fix, benchmark hits different error:
```
[TRC_FAILFAST] stage=freelist_next cls=7 node=0x...d474
```

Freelist corruption detected. Likely caused by:
- 2048 offset change affects free() path
- Block addresses no longer match freelist expectations
- Needs further investigation

## Files Modified
- `core/hakmem_tiny_superslab_constants.h` - NEW: Centralized constants
- `core/hakmem_tiny_superslab.h` - Use SUPERSLAB_SLAB0_DATA_OFFSET
- `core/hakmem_tiny_superslab.c` - Use SUPERSLAB_SLAB0_USABLE_SIZE
- `core/tiny_superslab_alloc.inc.h` - Add class consistency check
- `core/hakmem_tiny_init.inc` - Remove diet mode override (Phase 6-2.5)
- `core/hakmem_super_registry.h` - Remove debug output (cleaned)
- `PERFORMANCE_INVESTIGATION_REPORT.md` - Task agent analysis

## Next Steps
1. Investigate freelist corruption with 2048 offset
2. Verify free() path uses tiny_slab_base_for() correctly
3. Consider reverting to 1024 and fixing capacity calculation instead

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-07 21:45:20 +09:00
c9053a43ac Phase 6-2.3~6-2.5: Critical bug fixes + SuperSlab optimization (WIP)
## Phase 6-2.3: Fix 4T Larson crash (active counter bug) 
**Problem:** 4T Larson crashed with "free(): invalid pointer", OOM errors
**Root cause:** core/hakmem_tiny_refill_p0.inc.h:103
  - P0 batch refill moved freelist blocks to TLS cache
  - Active counter NOT incremented → double-decrement on free
  - Counter underflows → SuperSlab appears full → OOM → crash
**Fix:** Added ss_active_add(tls->ss, from_freelist);
**Result:** 4T stable at 838K ops/s 

## Phase 6-2.4: Fix SEGV in random_mixed/mid_large_mt benchmarks 
**Problem:** bench_random_mixed_hakmem, bench_mid_large_mt_hakmem → immediate SEGV
**Root cause #1:** core/box/hak_free_api.inc.h:92-95
  - "Guess loop" dereferenced unmapped memory when registry lookup failed
**Root cause #2:** core/box/hak_free_api.inc.h:115
  - Header magic check dereferenced unmapped memory
**Fix:**
  1. Removed dangerous guess loop (lines 92-95)
  2. Added hak_is_memory_readable() check before dereferencing header
     (core/hakmem_internal.h:277-294 - uses mincore() syscall)
**Result:**
  - random_mixed (2KB): SEGV → 2.22M ops/s 
  - random_mixed (4KB): SEGV → 2.58M ops/s 
  - Larson 4T: no regression (838K ops/s) 

## Phase 6-2.5: Performance investigation + SuperSlab fix (WIP) ⚠️
**Problem:** Severe performance gaps (19-26x slower than system malloc)
**Investigation:** Task agent identified root cause
  - hak_is_memory_readable() syscall overhead (100-300 cycles per free)
  - ALL frees hit unmapped_header_fallback path
  - SuperSlab lookup NEVER called
  - Why? g_use_superslab = 0 (disabled by diet mode)

**Root cause:** core/hakmem_tiny_init.inc:104-105
  - Diet mode (default ON) disables SuperSlab
  - SuperSlab defaults to 1 (hakmem_config.c:334)
  - BUT diet mode overrides it to 0 during init

**Fix:** Separate SuperSlab from diet mode
  - SuperSlab: Performance-critical (fast alloc/free)
  - Diet mode: Memory efficiency (magazine capacity limits only)
  - Both are independent features, should not interfere

**Status:** ⚠️ INCOMPLETE - New SEGV discovered after fix
  - SuperSlab lookup now works (confirmed via debug output)
  - But benchmark crashes (Exit 139) after ~20 lookups
  - Needs further investigation

**Files modified:**
- core/hakmem_tiny_init.inc:99-109 - Removed diet mode override
- PERFORMANCE_INVESTIGATION_REPORT.md - Task agent analysis (303x instruction gap)

**Next steps:**
- Investigate new SEGV (likely SuperSlab free path bug)
- OR: Revert Phase 6-2.5 changes if blocking progress

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-07 20:31:01 +09:00
1da8754d45 CRITICAL FIX: TLS 未初期化による 4T SEGV を完全解消
**問題:**
- Larson 4T で 100% SEGV (1T は 2.09M ops/s で完走)
- System/mimalloc は 4T で 33.52M ops/s 正常動作
- SS OFF + Remote OFF でも 4T で SEGV

**根本原因: (Task agent ultrathink 調査結果)**
```
CRASH: mov (%r15),%r13
R15 = 0x6261  ← ASCII "ba" (ゴミ値、未初期化TLS)
```

Worker スレッドの TLS 変数が未初期化:
- `__thread void* g_tls_sll_head[TINY_NUM_CLASSES];`  ← 初期化なし
- pthread_create() で生成されたスレッドでゼロ初期化されない
- NULL チェックが通過 (0x6261 != NULL) → dereference → SEGV

**修正内容:**
全 TLS 配列に明示的初期化子 `= {0}` を追加:

1. **core/hakmem_tiny.c:**
   - `g_tls_sll_head[TINY_NUM_CLASSES] = {0}`
   - `g_tls_sll_count[TINY_NUM_CLASSES] = {0}`
   - `g_tls_live_ss[TINY_NUM_CLASSES] = {0}`
   - `g_tls_bcur[TINY_NUM_CLASSES] = {0}`
   - `g_tls_bend[TINY_NUM_CLASSES] = {0}`

2. **core/tiny_fastcache.c:**
   - `g_tiny_fast_cache[TINY_FAST_CLASS_COUNT] = {0}`
   - `g_tiny_fast_count[TINY_FAST_CLASS_COUNT] = {0}`
   - `g_tiny_fast_free_head[TINY_FAST_CLASS_COUNT] = {0}`
   - `g_tiny_fast_free_count[TINY_FAST_CLASS_COUNT] = {0}`

3. **core/hakmem_tiny_magazine.c:**
   - `g_tls_mags[TINY_NUM_CLASSES] = {0}`

4. **core/tiny_sticky.c:**
   - `g_tls_sticky_ss[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}`
   - `g_tls_sticky_idx[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}`
   - `g_tls_sticky_pos[TINY_NUM_CLASSES] = {0}`

**効果:**
```
Before: 1T: 2.09M   |  4T: SEGV 💀
After:  1T: 2.41M   |  4T: 4.19M   (+15% 1T, SEGV解消)
```

**テスト:**
```bash
# 1 thread: 完走
./larson_hakmem 2 8 128 1024 1 12345 1
→ Throughput = 2,407,597 ops/s 

# 4 threads: 完走(以前は SEGV)
./larson_hakmem 2 8 128 1024 1 12345 4
→ Throughput = 4,192,155 ops/s 
```

**調査協力:** Task agent (ultrathink mode) による完璧な根本原因特定

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-07 01:27:04 +09:00
602edab87f Phase 1: Box Theory refactoring + include reduction
Phase 1-1: Split hakmem_tiny_free.inc (1,711 → 452 lines, -73%)
- Created tiny_free_magazine.inc.h (413 lines) - Magazine layer
- Created tiny_superslab_alloc.inc.h (394 lines) - SuperSlab alloc
- Created tiny_superslab_free.inc.h (305 lines) - SuperSlab free

Phase 1-2++: Refactor hakmem_pool.c (1,481 → 907 lines, -38.8%)
- Created pool_tls_types.inc.h (32 lines) - TLS structures
- Created pool_mf2_types.inc.h (266 lines) - MF2 data structures
- Created pool_mf2_helpers.inc.h (158 lines) - Helper functions
- Created pool_mf2_adoption.inc.h (129 lines) - Adoption logic

Phase 1-3: Reduce hakmem_tiny.c includes (60 → 46, -23.3%)
- Created tiny_system.h - System headers umbrella (stdio, stdlib, etc.)
- Created tiny_api.h - API headers umbrella (stats, query, rss, registry)

Performance: 4.19M ops/s maintained (±0% regression)
Verified: Larson benchmark 2×8×128×1024 = 4,192,128 ops/s

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-06 21:54:12 +09:00
5ea6c1237b Tiny: add per-class refill count tuning infrastructure (ChatGPT)
External AI (ChatGPT Pro) implemented hierarchical refill count tuning:
- Move getenv() from hot path to init (performance hygiene)
- Add per-class granularity: global → hot/mid → per-class precedence
- Environment variables:
  * HAKMEM_TINY_REFILL_COUNT (global default)
  * HAKMEM_TINY_REFILL_COUNT_HOT (classes 0-3)
  * HAKMEM_TINY_REFILL_COUNT_MID (classes 4-7)
  * HAKMEM_TINY_REFILL_COUNT_C{0..7} (per-class override)

Performance impact: Neutral (no tuning applied yet, default=16)
- Larson 4-thread: 4.19M ops/s (unchanged)
- No measurable overhead from init-time parsing

Code quality improvement:
- Better separation: hot path reads plain ints (no syscalls)
- Future-proof: enables A/B testing per size class
- Documentation: ENV_VARS.md updated

Note: Per Ultrathink's advice, further tuning deferred until bottleneck
visualization (superslab_refill branch analysis) is complete.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: ChatGPT <external-ai@openai.com>
2025-11-05 17:45:11 +09:00
6550cd3970 Remove overhead: diagnostic + counters for fast path
### Changes:
1. **Removed diagnostic from wrapper** (hakmem_tiny.c:1542)
   - Was: getenv() + fprintf() on every wrapper call
   - Now: Direct return tiny_alloc_fast(size)
   - Relies on LTO (-flto) for inlining

2. **Removed counter overhead from malloc()** (hakmem.c:1242)
   - Was: 4 TLS counter increments per malloc
     - g_malloc_total_calls++
     - g_malloc_tiny_size_match++
     - g_malloc_fast_path_tried++
     - g_malloc_fast_path_null++ (on miss)
   - Now: Zero counter overhead

### Performance Results:
```
Before (with overhead):  1.51M ops/s
After (zero overhead):   1.59M ops/s  (+5% 🎉)
Baseline (old impl):     1.68M ops/s  (-5% gap remains)
System malloc:           8.08M ops/s  (reference)
```

### Analysis:
**What was heavy:**
- Counter increments: ~4 TLS writes per malloc (cache pollution)
- Diagnostic: getenv() + fprintf() check (even if disabled)
- These added ~80K ops/s overhead

**Remaining gap (-5% vs baseline):**
Box Theory (1.59M) vs Old implementation (1.68M)
- Likely due to: ownership check in free path
- Or: refill backend (sll_refill_small_from_ss vs hak_tiny_alloc x16)

### Bottleneck Update:
From profiling data (2,418 cycles per fast path):
```
Fast path time: 49.5M cycles (49.1% of total)
Refill time:    51.3M cycles (50.9% of total)

Counter overhead removed: ~5% improvement
LTO should inline wrapper: Further gains expected
```

### Status:
 IMPROVEMENT - Removed overhead, 5% faster
 STILL SHORT - 5% slower than baseline (1.68M target)

### Next Steps:
A. Investigate ownership check overhead in free path
B. Compare refill backend efficiency
C. Consider reverting to old implementation if gap persists

Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md
2025-11-05 06:25:29 +00:00
52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00