Commit Graph

14 Commits

Author SHA1 Message Date
8feeb63c2b release: silence runtime logs and stabilize benches
- Fix HAKMEM_LOG gating to use  (numeric) so release builds compile out logs.
- Switch remaining prints to HAKMEM_LOG or guard with :
  - core/box/hak_core_init.inc.h (EVO sample warning, shutdown banner)
  - core/hakmem_config.c (config/feature prints)
  - core/hakmem.c (BigCache eviction prints)
  - core/hakmem_tiny_superslab.c (OOM, head init/expand, C7 init diagnostics)
  - core/hakmem_elo.c (init/evolution)
  - core/hakmem_batch.c (init/flush/stats)
  - core/hakmem_ace.c (33KB route diagnostics)
  - core/hakmem_ace_controller.c (ACE logs macro → no-op in release)
  - core/hakmem_site_rules.c (init banner)
  - core/box/hak_free_api.inc.h (unknown method error → release-gated)
- Rebuilt benches and verified quiet output for release:
  - bench_fixed_size_hakmem/system
  - bench_random_mixed_hakmem/system
  - bench_mid_large_mt_hakmem/system
  - bench_comprehensive_hakmem/system

Note: Kept debug logs available in debug builds and when explicitly toggled via env.
2025-11-11 01:47:06 +09:00
518bf29754 Fix TLS-SLL splice alignment issue causing SIGSEGV
- core/box/tls_sll_box.h: Normalize splice head, remove heuristics, fix misalignment guard
- core/tiny_refill_opt.h: Add LINEAR_LINK debug logging after carve
- core/ptr_trace.h: Fix function declaration conflicts for debug builds
- core/hakmem.c: Add stdatomic.h include and ptr_trace_dump_now declaration

Fixes misaligned memory access in splice_trav that was causing SIGSEGV.
TLS-SLL GUARD identified: base=0x7244b7e10009 (should be 0x7244b7e10401)
Preserves existing ptr=0xa0 guard for small pointer free detection.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-11-10 23:41:53 +09:00
382980d450 Phase 6-2.4: Fix SuperSlab free SEGV: remove guess loop and add memory readability check; add registry atomic consistency (base as _Atomic uintptr_t with acq/rel); add debug toggles (SUPER_REG_DEBUG/REQTRACE); update CURRENT_TASK with results and next steps; capture suite results. 2025-11-07 18:07:48 +09:00
77ed72fcf6 Fix: LIBC/HAKMEM mixed allocation crashes (0% → 80% success)
**Problem**: 4T Larson crashed 100% due to "free(): invalid pointer"

**Root Causes** (6 bugs found via Task Agent ultrathink):

1. **Invalid magic fallback** (`hak_free_api.inc.h:87`)
   - When `hdr->magic != HAKMEM_MAGIC`, ptr came from LIBC (no header)
   - Was calling `free(raw)` where `raw = ptr - HEADER_SIZE` (garbage!)
   - Fixed: Use `__libc_free(ptr)` instead

2. **BigCache eviction** (`hakmem.c:230`)
   - Same issue: invalid magic means LIBC allocation
   - Fixed: Use `__libc_free(ptr)` directly

3. **Malloc wrapper recursion** (`hakmem_internal.h:209`)
   - `hak_alloc_malloc_impl()` called `malloc()` → wrapper recursion
   - Fixed: Use `__libc_malloc()` directly

4. **ALLOC_METHOD_MALLOC free** (`hak_free_api.inc.h:106`)
   - Was calling `free(raw)` → wrapper recursion
   - Fixed: Use `__libc_free(raw)` directly

5. **fopen/fclose crash** (`hakmem_tiny_superslab.c:131`)
   - `log_superslab_oom_once()` used `fopen()` → FILE buffer via wrapper
   - `fclose()` calls `__libc_free()` on HAKMEM-allocated buffer → crash
   - Fixed: Wrap with `g_hakmem_lock_depth++/--` to force LIBC path

6. **g_hakmem_lock_depth visibility** (`hakmem.c:163`)
   - Was `static`, needed by hakmem_tiny_superslab.c
   - Fixed: Remove `static` keyword

**Result**: 4T Larson success rate improved 0% → 80% (8/10 runs) 

**Remaining**: 20% crash rate still needs investigation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-07 02:48:20 +09:00
1da8754d45 CRITICAL FIX: TLS 未初期化による 4T SEGV を完全解消
**問題:**
- Larson 4T で 100% SEGV (1T は 2.09M ops/s で完走)
- System/mimalloc は 4T で 33.52M ops/s 正常動作
- SS OFF + Remote OFF でも 4T で SEGV

**根本原因: (Task agent ultrathink 調査結果)**
```
CRASH: mov (%r15),%r13
R15 = 0x6261  ← ASCII "ba" (ゴミ値、未初期化TLS)
```

Worker スレッドの TLS 変数が未初期化:
- `__thread void* g_tls_sll_head[TINY_NUM_CLASSES];`  ← 初期化なし
- pthread_create() で生成されたスレッドでゼロ初期化されない
- NULL チェックが通過 (0x6261 != NULL) → dereference → SEGV

**修正内容:**
全 TLS 配列に明示的初期化子 `= {0}` を追加:

1. **core/hakmem_tiny.c:**
   - `g_tls_sll_head[TINY_NUM_CLASSES] = {0}`
   - `g_tls_sll_count[TINY_NUM_CLASSES] = {0}`
   - `g_tls_live_ss[TINY_NUM_CLASSES] = {0}`
   - `g_tls_bcur[TINY_NUM_CLASSES] = {0}`
   - `g_tls_bend[TINY_NUM_CLASSES] = {0}`

2. **core/tiny_fastcache.c:**
   - `g_tiny_fast_cache[TINY_FAST_CLASS_COUNT] = {0}`
   - `g_tiny_fast_count[TINY_FAST_CLASS_COUNT] = {0}`
   - `g_tiny_fast_free_head[TINY_FAST_CLASS_COUNT] = {0}`
   - `g_tiny_fast_free_count[TINY_FAST_CLASS_COUNT] = {0}`

3. **core/hakmem_tiny_magazine.c:**
   - `g_tls_mags[TINY_NUM_CLASSES] = {0}`

4. **core/tiny_sticky.c:**
   - `g_tls_sticky_ss[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}`
   - `g_tls_sticky_idx[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}`
   - `g_tls_sticky_pos[TINY_NUM_CLASSES] = {0}`

**効果:**
```
Before: 1T: 2.09M   |  4T: SEGV 💀
After:  1T: 2.41M   |  4T: 4.19M   (+15% 1T, SEGV解消)
```

**テスト:**
```bash
# 1 thread: 完走
./larson_hakmem 2 8 128 1024 1 12345 1
→ Throughput = 2,407,597 ops/s 

# 4 threads: 完走(以前は SEGV)
./larson_hakmem 2 8 128 1024 1 12345 4
→ Throughput = 4,192,155 ops/s 
```

**調査協力:** Task agent (ultrathink mode) による完璧な根本原因特定

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-07 01:27:04 +09:00
5ec9d1746f Option A (Full): Inline TLS cache access in malloc()
Implementation:
1. Added g_initialized check to fast path (skip bootstrap overhead)
2. Inlined hak_tiny_size_to_class() - LUT lookup (~1 load)
3. Inlined TLS cache pop - direct g_tls_sll_head access (3-4 instructions)
4. Eliminated function call overhead on fast path hit

Result: +11.5% improvement (1.31M → 1.46M ops/s avg, threads=4)
- Before: Function call + internal processing (~15-20 instructions)
- After: LUT + TLS load + pop + return (~5-6 instructions)

Still below target (1.81M ops/s). Next: RDTSC profiling to identify remaining bottleneck.
2025-11-05 07:07:47 +00:00
6550cd3970 Remove overhead: diagnostic + counters for fast path
### Changes:
1. **Removed diagnostic from wrapper** (hakmem_tiny.c:1542)
   - Was: getenv() + fprintf() on every wrapper call
   - Now: Direct return tiny_alloc_fast(size)
   - Relies on LTO (-flto) for inlining

2. **Removed counter overhead from malloc()** (hakmem.c:1242)
   - Was: 4 TLS counter increments per malloc
     - g_malloc_total_calls++
     - g_malloc_tiny_size_match++
     - g_malloc_fast_path_tried++
     - g_malloc_fast_path_null++ (on miss)
   - Now: Zero counter overhead

### Performance Results:
```
Before (with overhead):  1.51M ops/s
After (zero overhead):   1.59M ops/s  (+5% 🎉)
Baseline (old impl):     1.68M ops/s  (-5% gap remains)
System malloc:           8.08M ops/s  (reference)
```

### Analysis:
**What was heavy:**
- Counter increments: ~4 TLS writes per malloc (cache pollution)
- Diagnostic: getenv() + fprintf() check (even if disabled)
- These added ~80K ops/s overhead

**Remaining gap (-5% vs baseline):**
Box Theory (1.59M) vs Old implementation (1.68M)
- Likely due to: ownership check in free path
- Or: refill backend (sll_refill_small_from_ss vs hak_tiny_alloc x16)

### Bottleneck Update:
From profiling data (2,418 cycles per fast path):
```
Fast path time: 49.5M cycles (49.1% of total)
Refill time:    51.3M cycles (50.9% of total)

Counter overhead removed: ~5% improvement
LTO should inline wrapper: Further gains expected
```

### Status:
 IMPROVEMENT - Removed overhead, 5% faster
 STILL SHORT - 5% slower than baseline (1.68M target)

### Next Steps:
A. Investigate ownership check overhead in free path
B. Compare refill backend efficiency
C. Consider reverting to old implementation if gap persists

Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md
2025-11-05 06:25:29 +00:00
08593fea14 Fix: Box Theory routing - direct call before guards
### Problem Identified:
Previous commit routed malloc() → guards → hak_alloc_at() → Box Theory
This added massive overhead (guard checks, function calls) defeating the
"3-4 instruction" fast path promise.

### Root Cause:
"命令数減って遅くなるのはおかしい" - User's insight was correct!
Box Theory claims 3-4 instructions, but routing added dozens of instructions
before reaching TLS freelist pop.

### Fix:
Move Box Theory call to malloc() entry point (line ~1253), BEFORE all guards:
```c
#ifdef HAKMEM_TINY_PHASE6_BOX_REFACTOR
    if (size <= TINY_FAST_THRESHOLD) {
        void* ptr = hak_tiny_alloc_fast_wrapper(size);
        if (ptr) return ptr;  //  Fast path: No guards, no overhead
    }
#endif
// SLOW PATH: All guards here...
```

### Performance Results:
```
Baseline (old tiny_fast_alloc):  1.68M ops/s
Box Theory (no env vars):        1.22M ops/s  (-27%)
Box Theory (with env vars):      1.39M ops/s  (-17%)  ← Improved!
System malloc:                   8.08M ops/s

CLAUDE.md expectation:           2.75M (+64%) ~ 4.19M (+150%)  ← Not reached
```

### Env Vars Used:
```
HAKMEM_TINY_TLS_SLL=1 HAKMEM_TINY_TLS_LIST=0 HAKMEM_TINY_HOTMAG=0
HAKMEM_WRAP_TINY=1 HAKMEM_TINY_SS_ADOPT=1 HAKMEM_TINY_TRACE_RING=0
HAKMEM_SAFE_FREE=0 HAKMEM_TINY_REFILL_COUNT=128
```

### Verification:
-  HAKMEM_TINY_PHASE6_BOX_REFACTOR=1 confirmed active
-  hak_tiny_alloc_fast_wrapper() called (FRONT diagnostics)
-  Routing now bypasses guards for fast path
-  Still -17% slower than baseline (investigation needed)

### Status:
🔬 PARTIAL SUCCESS - Routing fixed, but performance below expectation.
Box Theory is active and bypassing guards, but still slower than old implementation.

### Next Steps:
- Compare refill implementations (old vs Box Theory)
- Profile to identify specific bottleneck
- Investigate why Box Theory underperforms vs CLAUDE.md claims

Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md, CLAUDE.md Phase 6-1.7
2025-11-05 06:12:32 +00:00
0c66991393 WIP: Unify fast path to Box Theory (experimental)
### Changes:
- **Removed duplicate fast paths**: Disabled HAKMEM_TINY_FAST_PATH in:
  - malloc() entry point (line ~1257)
  - hak_alloc_at() helper (line ~682)
- **Unified to Box Theory**: All tiny allocations now use Box Theory's
  hak_tiny_alloc_fast_wrapper() at line ~712 (HAKMEM_TINY_PHASE6_BOX_REFACTOR)

### Rationale:
- Previous implementation had **2 fast path checks** (double overhead)
- Box Theory (tiny_alloc_fast.inc.h) provides optimized 3-4 instruction path
- CLAUDE.md claims +64% (debug), +150% (production) with Box Theory
- Attempt to eliminate redundant checks and unify to single fast path

### Performance Results:
⚠️ **REGRESSION** - Performance decreased:
```
Baseline (old tiny_fast_alloc): 1.68M ops/s
Box Theory (unified):           1.35M ops/s  (-20%)
System malloc:                  8.08M ops/s  (reference)
```

### Status:
🔬 **EXPERIMENTAL** - This commit documents the attempt but shows regression.
Possible issues:
1. Box Theory may need additional tuning (env vars not sufficient)
2. Refill backend may be slower than old implementation
3. TLS freelist initialization overhead
4. Missing optimizations in Box Theory integration

### Next Steps:
- Profile to identify why Box Theory is slower
- Compare refill efficiency: old vs Box Theory
- Check if TLS SLL variables are properly initialized
- Consider reverting if root cause not found

Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md, CLAUDE.md Phase 6-1.7
2025-11-05 06:06:34 +00:00
31af3eab27 Add malloc routing analysis and refill success tracking
### Changes:
- **Routing Counters**: Added per-thread counters in hakmem.c to track:
  - g_malloc_total_calls: Total malloc() invocations
  - g_malloc_tiny_size_match: Calls within tiny size range (<=128B)
  - g_malloc_fast_path_tried: Calls that attempted fast path
  - g_malloc_fast_path_null: Fast path returned NULL
  - g_malloc_slow_path: Calls routed to slow path

- **Refill Success Tracking**: Added counters in tiny_fastcache.c:
  - g_refill_success_count: Full batch (16 blocks)
  - g_refill_partial_count: Partial batch (<16 blocks)
  - g_refill_fail_count: Zero blocks allocated
  - g_refill_total_blocks: Total blocks across all refills

- **Profile Output Enhanced**: tiny_fast_print_profile() now shows:
  - Routing statistics (which path allocations take)
  - Refill success/failure breakdown
  - Average blocks per refill

### Key Findings:
 Fast path routing: 100% success (20,479/20,480 calls per thread)
 Refill success: 100% (1,285 refills, all 16 blocks each)
⚠️  Performance: Still only 1.68M ops/s vs System's 8.06M (20.8%)

**Root Cause Confirmed**:
- NOT a routing problem (100% reach fast path)
- NOT a refill failure (100% success)
- IS a structural performance issue (2,418 cycles avg for malloc)

**Bottlenecks Identified**:
1. Fast path cache hits: ~2,418 cycles (vs tcache ~100 cycles)
2. Refill operations: ~39,938 cycles (expensive but infrequent)
3. Overall throughput: 4.8x slower than system malloc

**Next Steps** (per LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md):
- Option B: Refill efficiency (batch allocation from SuperSlab)
- Option C: Ultra-fast path redesign (tcache-equivalent)

Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md
2025-11-05 05:56:02 +00:00
3e4e90eadb Phase 6-5: Entry Point Optimization (Phase 1) - Unexpected results
Implementation: Move HAKMEM_TINY_FAST_PATH check BEFORE all guard checks
in malloc(), inspired by mimalloc/tcache entry point design.

Strategy:
- tcache has 0 branches before fast path
- mimalloc has 1-2 branches before fast path
- Old HAKMEM had 8+ branches before fast path
- Phase 1: Move fast path to line 1, add branch prediction hints

Changes in core/hakmem.c:
1. Fast Path First: Size check → Init check → Cache hit (3 branches)
2. Slow Path: All guards moved after fast path (rare cases)
3. Branch hints: __builtin_expect() for hot paths

Expected results (from research):
- ST: 0.46M → 1.4-2.3M ops/s (+204-400%)
- MT: 1.86M → 3.7-5.6M ops/s (+99-201%)

Actual results (Larson 2s 8-128B 1024):
- ST: 0.377M → 0.424M ops/s (+12% only)
- MT: 1.856M → 1.453M ops/s (-22% regression!)

Analysis:
- Similar pattern to previous Option A test (+42% ST, -20% MT)
- Entry point reordering alone is insufficient
- True bottleneck may be:
  1. tiny_fast_alloc() internals (size-to-class, cache access)
  2. Refill cost (1,600 cycles for 16 individual calls)
  3. Need Batch Refill optimization (Phase 3) as priority

Next steps:
- Investigate refill bottleneck with perf profiling
- Consider implementing Phase 3 (Batch Refill) before Phase 2
- May need combination of multiple optimizations for breakthrough

Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md
2025-11-05 05:10:02 +00:00
09e1d89e8d Phase 6-4: Larson benchmark optimizations - LUT size-to-class
Two optimizations to improve Larson benchmark performance:

1. **Option A: Fast Path Priority** (core/hakmem.c)
   - Move HAKMEM_TINY_FAST_PATH check before all guard checks
   - Reduce malloc() fast path from 8+ branches to 3 branches
   - Results: +42% ST, -20% MT (mixed results)

2. **LUT Optimization** (core/tiny_fastcache.h)
   - Replace 11-branch linear search with O(1) lookup table
   - Use size_to_class_lut[size >> 3] for fast mapping
   - Results: +24% MT, -24% ST (MT-optimized tradeoff)

Benchmark results (Larson 2s 8-128B 1024 chunks):
- Original:     ST 0.498M ops/s, MT 1.502M ops/s
- LUT version:  ST 0.377M ops/s, MT 1.856M ops/s

Analysis:
- ST regression: Branch predictor learns linear search pattern
- MT improvement: LUT avoids branch misprediction on context switch
- Recommendation: Keep LUT for multi-threaded workloads

Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md
2025-11-05 04:58:03 +00:00
b64cfc055e Implement Option A: Fast Path priority optimization (Phase 6-4)
Changes:
- Reorder malloc() to prioritize Fast Path (initialized + tiny size check first)
- Move Fast Path check before all guard checks (recursion, LD_PRELOAD, etc.)
- Optimize free() with same strategy (initialized check first)
- Add branch prediction hints (__builtin_expect)

Implementation:
- malloc(): Fast Path now executes with 3 branches total
  - Branch 1+2: g_initialized && size <= TINY_FAST_THRESHOLD
  - Branch 3: tiny_fast_alloc() cache hit check
  - Slow Path: All guard checks moved after Fast Path miss

- free(): Fast Path with 1-2 branches
  - Branch 1: g_initialized check
  - Direct to hak_free_at() on normal case

Performance Results (Larson benchmark, size=8-128B):

Single-thread (threads=1):
- Before: 0.46M ops/s (10.7% of system malloc)
- After:  0.65M ops/s (15.4% of system malloc)
- Change: +42% improvement ✓

Multi-thread (threads=4):
- Before: 1.81M ops/s (25.0% of system malloc)
- After:  1.44M ops/s (19.9% of system malloc)
- Change: -20% regression ✗

Analysis:
- ST improvement shows Fast Path optimization works
- MT regression suggests contention or cache issues
- Did not meet target (+200-400%), further optimization needed

Next Steps:
- Investigate MT regression (cache coherency?)
- Consider more aggressive inlining
- Explore Option B (Refill optimization)
2025-11-05 04:44:50 +00:00
52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00