|
|
131cdb7b88
|
Doc: Add benchmark reports, atomic freelist docs, and .gitignore update
Phase 1 Commit: Comprehensive documentation and build system cleanup
Added Documentation:
- BENCHMARK_SUMMARY_20251122.md: Current performance baseline
- COMPREHENSIVE_BENCHMARK_REPORT_20251122.md: Detailed analysis
- LARSON_SLOWDOWN_INVESTIGATION_REPORT.md: Larson benchmark deep dive
- ATOMIC_FREELIST_*.md (5 files): Complete atomic freelist documentation
- Implementation strategy, quick start, site-by-site guide
- Index and summary for easy navigation
Added Scripts:
- run_comprehensive_benchmark.sh: Automated benchmark runner
- scripts/analyze_freelist_sites.sh: Freelist analysis tool
- scripts/verify_atomic_freelist_conversion.sh: Conversion verification
Build System:
- Updated .gitignore: Added *.d (build dependency files)
- Cleaned up tracked .d files (will be ignored going forward)
Performance Status (2025-11-22):
- Random Mixed 256B: 59.6M ops/s (VERIFIED WORKING)
- Benchmark command: ./out/release/bench_random_mixed_hakmem 10000000 256 42
- Known issue: workset=8192 causes SEGV (to be fixed separately)
Notes:
- bench_random_mixed.c already tracked, working state confirmed
- Ultra SLIM implementation backed up to /tmp/ (Phase 2 restore pending)
- Documentation covers atomic freelist conversion and benchmarking methodology
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
|
2025-11-22 06:11:55 +09:00 |
|
|
|
7975e243ee
|
Phase 7 Task 3: Pre-warm TLS cache (+180-280% improvement!)
MAJOR SUCCESS: HAKMEM now achieves 85-92% of System malloc on tiny
allocations (128-512B) and BEATS System at 146% on 1024B allocations!
Performance Results:
- Random Mixed 128B: 21M → 59M ops/s (+181%) 🚀
- Random Mixed 256B: 19M → 70M ops/s (+268%) 🚀
- Random Mixed 512B: 21M → 68M ops/s (+224%) 🚀
- Random Mixed 1024B: 21M → 65M ops/s (+210%, 146% of System!) 🏆
- Larson 1T: 2.68M ops/s (stable, no regression)
Implementation:
1. Task 3a: Remove profiling overhead in release builds
- Wrapped RDTSC calls in #if !HAKMEM_BUILD_RELEASE
- Compiler can eliminate profiling code completely
- Effect: +2% (2.68M → 2.73M Larson)
2. Task 3b: Simplify refill logic
- Use constants from hakmem_build_flags.h
- TLS cache already optimal
- Effect: No regression
3. Task 3c: Pre-warm TLS cache (GAME CHANGER!)
- Pre-allocate 16 blocks per class at init
- Eliminates cold-start penalty
- Effect: +180-280% improvement 🚀
Root Cause:
The bottleneck was cold-start, not the hot path! First allocation in
each class triggered a SuperSlab refill (100+ cycles). Pre-warming
eliminated this penalty, revealing Phase 7's true potential.
Files Modified:
- core/hakmem_tiny.c: Pre-warm function implementation
- core/box/hak_core_init.inc.h: Pre-warm initialization call
- core/tiny_alloc_fast.inc.h: Profiling overhead removal
- core/hakmem_phase7_config.h: Task 3 constants (NEW)
- core/hakmem_build_flags.h: Phase 7 feature flags
- Makefile: PREWARM_TLS flag, phase7 targets
- CLAUDE.md: Phase 7 success summary
- PHASE7_TASK3_RESULTS.md: Comprehensive results report (NEW)
Build:
make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1 phase7-bench
🎉 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
|
2025-11-08 12:54:52 +09:00 |
|
|
|
382980d450
|
Phase 6-2.4: Fix SuperSlab free SEGV: remove guess loop and add memory readability check; add registry atomic consistency (base as _Atomic uintptr_t with acq/rel); add debug toggles (SUPER_REG_DEBUG/REQTRACE); update CURRENT_TASK with results and next steps; capture suite results.
|
2025-11-07 18:07:48 +09:00 |
|
|
|
8f3095fb85
|
CI-safe debug runners: add ASan LD_PRELOAD + UBSan mailbox targets; add asan_preload script; document sanitizer-safe workflows and results in CURRENT_TASK.md (debug complete).
|
2025-11-07 12:09:28 +09:00 |
|
|
|
1da8754d45
|
CRITICAL FIX: TLS 未初期化による 4T SEGV を完全解消
**問題:**
- Larson 4T で 100% SEGV (1T は 2.09M ops/s で完走)
- System/mimalloc は 4T で 33.52M ops/s 正常動作
- SS OFF + Remote OFF でも 4T で SEGV
**根本原因: (Task agent ultrathink 調査結果)**
```
CRASH: mov (%r15),%r13
R15 = 0x6261 ← ASCII "ba" (ゴミ値、未初期化TLS)
```
Worker スレッドの TLS 変数が未初期化:
- `__thread void* g_tls_sll_head[TINY_NUM_CLASSES];` ← 初期化なし
- pthread_create() で生成されたスレッドでゼロ初期化されない
- NULL チェックが通過 (0x6261 != NULL) → dereference → SEGV
**修正内容:**
全 TLS 配列に明示的初期化子 `= {0}` を追加:
1. **core/hakmem_tiny.c:**
- `g_tls_sll_head[TINY_NUM_CLASSES] = {0}`
- `g_tls_sll_count[TINY_NUM_CLASSES] = {0}`
- `g_tls_live_ss[TINY_NUM_CLASSES] = {0}`
- `g_tls_bcur[TINY_NUM_CLASSES] = {0}`
- `g_tls_bend[TINY_NUM_CLASSES] = {0}`
2. **core/tiny_fastcache.c:**
- `g_tiny_fast_cache[TINY_FAST_CLASS_COUNT] = {0}`
- `g_tiny_fast_count[TINY_FAST_CLASS_COUNT] = {0}`
- `g_tiny_fast_free_head[TINY_FAST_CLASS_COUNT] = {0}`
- `g_tiny_fast_free_count[TINY_FAST_CLASS_COUNT] = {0}`
3. **core/hakmem_tiny_magazine.c:**
- `g_tls_mags[TINY_NUM_CLASSES] = {0}`
4. **core/tiny_sticky.c:**
- `g_tls_sticky_ss[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}`
- `g_tls_sticky_idx[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}`
- `g_tls_sticky_pos[TINY_NUM_CLASSES] = {0}`
**効果:**
```
Before: 1T: 2.09M ✅ | 4T: SEGV 💀
After: 1T: 2.41M ✅ | 4T: 4.19M ✅ (+15% 1T, SEGV解消)
```
**テスト:**
```bash
# 1 thread: 完走
./larson_hakmem 2 8 128 1024 1 12345 1
→ Throughput = 2,407,597 ops/s ✅
# 4 threads: 完走(以前は SEGV)
./larson_hakmem 2 8 128 1024 1 12345 4
→ Throughput = 4,192,155 ops/s ✅
```
**調査協力:** Task agent (ultrathink mode) による完璧な根本原因特定
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
|
2025-11-07 01:27:04 +09:00 |
|
|
|
f0c87d0cac
|
Add Larson performance analysis and optimized profile
Ultrathink analysis reveals root cause of 4x performance gap:
Key Findings:
- Single-thread: HAKMEM 0.46M ops/s vs system 4.29M ops/s (10.7%)
- Multi-thread: HAKMEM 1.81M ops/s vs system 7.23M ops/s (25.0%)
- Root cause: malloc() entry point has 8+ branch checks
- Bottleneck: Fast Path is structurally complex vs system tcache
Files Added:
- LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md: Detailed analysis with 3 optimization strategies
- scripts/profiles/tinyhot_optimized.env: CLAUDE.md-based optimized config
Proposed Solutions:
- Option A: Optimize malloc() guard checks (+200-400% expected)
- Option B: Improve refill efficiency (+30-50% expected)
- Option C: Complete Fast Path simplification (+400-800% expected)
Target: Achieve 60-80% of system malloc performance
|
2025-11-05 04:03:10 +00:00 |
|
|
|
52386401b3
|
Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation
Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files
Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)
This is a clean repository without large log files.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
|
2025-11-05 12:31:14 +09:00 |
|