hakmem

tomoaki/hakmem

Fork 0

Commit Graph

Author	SHA1	Message	Date
Moe Charm (CI)	030132f911	Phase 10: TLS/SFC aggressive cache tuning (syscall reduction failed) Goal: Reduce backend transitions by increasing frontend hit rate Result: +2% best case, syscalls unchanged (root cause: SuperSlab churn) Implementation: 1. Cache capacity expansion (2-8x per-class) - Hot classes (C0-C3): 4x increase (512 slots) - Medium classes (C4-C6): 2-3x increase - Class 7 (1KB): 2x increase (128 slots) - Fast cache: 2x default capacity 2. Refill batch size increase (4-8x) - Global default: 16 → 64 (4x) - Hot classes: 128 (8x) via HAKMEM_TINY_REFILL_COUNT_HOT - Mid classes: 96 (6x) via HAKMEM_TINY_REFILL_COUNT_MID - Class 7: 64 → 128 (2x) - SFC refill: 64 → 128 (2x) 3. Adaptive sizing aggressive parameters - Grow threshold: 80% → 70% (expand earlier) - Shrink threshold: 20% → 10% (shrink less) - Growth rate: 2x → 1.5x (smoother growth) - Max capacity: 2048 → 4096 (2x ceiling) - Adapt frequency: Every 10 → 5 refills (more responsive) Performance Results (100K iterations): Before (Phase 9): - Performance: 9.71M ops/s - Syscalls: 1,729 (mmap:877, munmap:852) After (Phase 10): - Default settings: 8.77M ops/s (-9.7%) ⚠️ - Optimal ENV: 9.89M ops/s (+2%) ✅ - Syscalls: 1,729 (unchanged) ❌ Optimal ENV configuration: export HAKMEM_TINY_REFILL_COUNT_HOT=256 export HAKMEM_TINY_REFILL_COUNT_MID=192 Root Cause Analysis: Bottleneck is NOT TLS/SFC hit rate, but SuperSlab allocation churn: - 877 SuperSlabs allocated (877MB via mmap) - Phase 9 LRU cache not utilized (no frees during benchmark) - All SuperSlabs retained until program exit - System malloc: 9 syscalls vs HAKMEM: 1,729 syscalls (192x gap) Conclusion: TLS/SFC tuning cannot solve SuperSlab allocation policy problem. Next step: Phase 11 SuperSlab Prewarm strategy to eliminate mmap/munmap during benchmark execution. ChatGPT review: Strategy validated, Option A (Prewarm) recommended. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-13 14:25:54 +09:00
Moe Charm (CI)	1da8754d45	CRITICAL FIX: TLS 未初期化による 4T SEGV を完全解消問題: - Larson 4T で 100% SEGV (1T は 2.09M ops/s で完走) - System/mimalloc は 4T で 33.52M ops/s 正常動作 - SS OFF + Remote OFF でも 4T で SEGV 根本原因: (Task agent ultrathink 調査結果) ``` CRASH: mov (%r15),%r13 R15 = 0x6261 ← ASCII "ba" (ゴミ値、未初期化TLS) ``` Worker スレッドの TLS 変数が未初期化: - `__thread void* g_tls_sll_head[TINY_NUM_CLASSES];` ← 初期化なし - pthread_create() で生成されたスレッドでゼロ初期化されない - NULL チェックが通過 (0x6261 != NULL) → dereference → SEGV 修正内容: 全 TLS 配列に明示的初期化子 `= {0}` を追加: 1. core/hakmem_tiny.c: - `g_tls_sll_head[TINY_NUM_CLASSES] = {0}` - `g_tls_sll_count[TINY_NUM_CLASSES] = {0}` - `g_tls_live_ss[TINY_NUM_CLASSES] = {0}` - `g_tls_bcur[TINY_NUM_CLASSES] = {0}` - `g_tls_bend[TINY_NUM_CLASSES] = {0}` 2. core/tiny_fastcache.c: - `g_tiny_fast_cache[TINY_FAST_CLASS_COUNT] = {0}` - `g_tiny_fast_count[TINY_FAST_CLASS_COUNT] = {0}` - `g_tiny_fast_free_head[TINY_FAST_CLASS_COUNT] = {0}` - `g_tiny_fast_free_count[TINY_FAST_CLASS_COUNT] = {0}` 3. core/hakmem_tiny_magazine.c: - `g_tls_mags[TINY_NUM_CLASSES] = {0}` 4. core/tiny_sticky.c: - `g_tls_sticky_ss[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}` - `g_tls_sticky_idx[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}` - `g_tls_sticky_pos[TINY_NUM_CLASSES] = {0}` 効果: ``` Before: 1T: 2.09M ✅ \| 4T: SEGV 💀 After: 1T: 2.41M ✅ \| 4T: 4.19M ✅ (+15% 1T, SEGV解消) ``` テスト: ```bash # 1 thread: 完走 ./larson_hakmem 2 8 128 1024 1 12345 1 → Throughput = 2,407,597 ops/s ✅ # 4 threads: 完走（以前は SEGV） ./larson_hakmem 2 8 128 1024 1 12345 4 → Throughput = 4,192,155 ops/s ✅ ``` 調査協力: Task agent (ultrathink mode) による完璧な根本原因特定 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-07 01:27:04 +09:00
Moe Charm (CI)	52386401b3	Debug Counters Implementation - Clean History Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-05 12:31:14 +09:00

Author

SHA1

Message

Date

Moe Charm (CI)

030132f911

Phase 10: TLS/SFC aggressive cache tuning (syscall reduction failed)

Goal: Reduce backend transitions by increasing frontend hit rate
Result: +2% best case, syscalls unchanged (root cause: SuperSlab churn)

Implementation:

1. Cache capacity expansion (2-8x per-class)
   - Hot classes (C0-C3): 4x increase (512 slots)
   - Medium classes (C4-C6): 2-3x increase
   - Class 7 (1KB): 2x increase (128 slots)
   - Fast cache: 2x default capacity

2. Refill batch size increase (4-8x)
   - Global default: 16 → 64 (4x)
   - Hot classes: 128 (8x) via HAKMEM_TINY_REFILL_COUNT_HOT
   - Mid classes: 96 (6x) via HAKMEM_TINY_REFILL_COUNT_MID
   - Class 7: 64 → 128 (2x)
   - SFC refill: 64 → 128 (2x)

3. Adaptive sizing aggressive parameters
   - Grow threshold: 80% → 70% (expand earlier)
   - Shrink threshold: 20% → 10% (shrink less)
   - Growth rate: 2x → 1.5x (smoother growth)
   - Max capacity: 2048 → 4096 (2x ceiling)
   - Adapt frequency: Every 10 → 5 refills (more responsive)

Performance Results (100K iterations):

Before (Phase 9):
- Performance: 9.71M ops/s
- Syscalls: 1,729 (mmap:877, munmap:852)

After (Phase 10):
- Default settings: 8.77M ops/s (-9.7%) ⚠️
- Optimal ENV: 9.89M ops/s (+2%) ✅
- Syscalls: 1,729 (unchanged) ❌

Optimal ENV configuration:
export HAKMEM_TINY_REFILL_COUNT_HOT=256
export HAKMEM_TINY_REFILL_COUNT_MID=192

Root Cause Analysis:

Bottleneck is NOT TLS/SFC hit rate, but SuperSlab allocation churn:
- 877 SuperSlabs allocated (877MB via mmap)
- Phase 9 LRU cache not utilized (no frees during benchmark)
- All SuperSlabs retained until program exit
- System malloc: 9 syscalls vs HAKMEM: 1,729 syscalls (192x gap)

Conclusion:

TLS/SFC tuning cannot solve SuperSlab allocation policy problem.
Next step: Phase 11 SuperSlab Prewarm strategy to eliminate
mmap/munmap during benchmark execution.

ChatGPT review: Strategy validated, Option A (Prewarm) recommended.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-13 14:25:54 +09:00

Moe Charm (CI)

1da8754d45

CRITICAL FIX: TLS 未初期化による 4T SEGV を完全解消

**問題:**
- Larson 4T で 100% SEGV (1T は 2.09M ops/s で完走)
- System/mimalloc は 4T で 33.52M ops/s 正常動作
- SS OFF + Remote OFF でも 4T で SEGV

**根本原因: (Task agent ultrathink 調査結果)**
```
CRASH: mov (%r15),%r13
R15 = 0x6261  ← ASCII "ba" (ゴミ値、未初期化TLS)
```

Worker スレッドの TLS 変数が未初期化:
- `__thread void* g_tls_sll_head[TINY_NUM_CLASSES];`  ← 初期化なし
- pthread_create() で生成されたスレッドでゼロ初期化されない
- NULL チェックが通過 (0x6261 != NULL) → dereference → SEGV

**修正内容:**
全 TLS 配列に明示的初期化子 `= {0}` を追加:

1. **core/hakmem_tiny.c:**
   - `g_tls_sll_head[TINY_NUM_CLASSES] = {0}`
   - `g_tls_sll_count[TINY_NUM_CLASSES] = {0}`
   - `g_tls_live_ss[TINY_NUM_CLASSES] = {0}`
   - `g_tls_bcur[TINY_NUM_CLASSES] = {0}`
   - `g_tls_bend[TINY_NUM_CLASSES] = {0}`

2. **core/tiny_fastcache.c:**
   - `g_tiny_fast_cache[TINY_FAST_CLASS_COUNT] = {0}`
   - `g_tiny_fast_count[TINY_FAST_CLASS_COUNT] = {0}`
   - `g_tiny_fast_free_head[TINY_FAST_CLASS_COUNT] = {0}`
   - `g_tiny_fast_free_count[TINY_FAST_CLASS_COUNT] = {0}`

3. **core/hakmem_tiny_magazine.c:**
   - `g_tls_mags[TINY_NUM_CLASSES] = {0}`

4. **core/tiny_sticky.c:**
   - `g_tls_sticky_ss[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}`
   - `g_tls_sticky_idx[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}`
   - `g_tls_sticky_pos[TINY_NUM_CLASSES] = {0}`

**効果:**
```
Before: 1T: 2.09M ✅  |  4T: SEGV 💀
After:  1T: 2.41M ✅  |  4T: 4.19M ✅  (+15% 1T, SEGV解消)
```

**テスト:**
```bash
# 1 thread: 完走
./larson_hakmem 2 8 128 1024 1 12345 1
→ Throughput = 2,407,597 ops/s ✅

# 4 threads: 完走（以前は SEGV）
./larson_hakmem 2 8 128 1024 1 12345 4
→ Throughput = 4,192,155 ops/s ✅
```

**調査協力:** Task agent (ultrathink mode) による完璧な根本原因特定

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-07 01:27:04 +09:00

Moe Charm (CI)

52386401b3

Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

3 Commits