2025-11-26 12:33:49 +09:00
|
|
|
|
# CURRENT TASK – Performance Optimization Status
|
2025-11-21 04:56:48 +09:00
|
|
|
|
|
2025-11-26 12:33:49 +09:00
|
|
|
|
**Last Updated**: 2025-11-25
|
|
|
|
|
|
**Scope**: Random Mixed 16-1024B / Arena Allocator / Architecture Limit Analysis
|
2025-11-15 16:28:40 +09:00
|
|
|
|
|
2025-11-16 01:40:36 +09:00
|
|
|
|
---
|
2025-11-15 16:28:40 +09:00
|
|
|
|
|
2025-11-26 12:33:49 +09:00
|
|
|
|
## 🎯 現状サマリ
|
2025-11-21 05:16:35 +09:00
|
|
|
|
|
2025-11-26 12:33:49 +09:00
|
|
|
|
### ✅ Arena Allocator 実装完了 - mmap 95% 削減達成
|
2025-11-21 05:16:35 +09:00
|
|
|
|
|
2025-11-26 12:33:49 +09:00
|
|
|
|
| Metric | Before | After | Improvement |
|
|
|
|
|
|
|--------|--------|-------|-------------|
|
|
|
|
|
|
| mmap syscalls | 401 | 32 | -92% |
|
|
|
|
|
|
| munmap syscalls | 378 | 3 | -99% |
|
|
|
|
|
|
| Performance (10M) | ~60M ops/s | **68-70M ops/s** | +15% |
|
2025-11-21 05:16:35 +09:00
|
|
|
|
|
2025-11-26 12:33:49 +09:00
|
|
|
|
### 現在の性能比較 (10M iterations)
|
2025-11-21 05:16:35 +09:00
|
|
|
|
```
|
2025-11-26 12:33:49 +09:00
|
|
|
|
System malloc: 93M ops/s (baseline)
|
|
|
|
|
|
HAKMEM: 68-70M ops/s (73-76% of system malloc)
|
|
|
|
|
|
Gap: ~25% (構造的オーバーヘッド)
|
2025-11-21 05:16:35 +09:00
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2025-11-26 12:33:49 +09:00
|
|
|
|
## 🔬 Phase 27 調査結果: アーキテクチャ限界の確認
|
2025-11-21 05:16:35 +09:00
|
|
|
|
|
2025-11-26 12:33:49 +09:00
|
|
|
|
### 試した最適化(すべて失敗)
|
|
|
|
|
|
| 最適化案 | 結果 | 効果 |
|
|
|
|
|
|
|---------|------|------|
|
|
|
|
|
|
| C5 TLS容量 2倍 (1024→2048) | 68-69M | 変化なし |
|
|
|
|
|
|
| Registry lookup削除 | 68-70M | 変化なし |
|
|
|
|
|
|
| Ultra SLIM 4-layer | ~69M | 変化なし |
|
|
|
|
|
|
| **Phase 27-A: Ultra-Inline (全size)** | **56-61M** | **-15% 悪化** ❌ |
|
|
|
|
|
|
| **Phase 27-B: Ultra-Inline (9-512B)** | **61-62M** | **-10% 悪化** ❌ |
|
2025-11-21 05:16:35 +09:00
|
|
|
|
|
2025-11-26 12:33:49 +09:00
|
|
|
|
### Phase 27 失敗の原因
|
|
|
|
|
|
- Workload の ~52% が headerless classes (cls 0: 1-8B, cls 7: 513-1024B)
|
|
|
|
|
|
- Headerless クラスをフィルタする条件分岐自体が overhead
|
|
|
|
|
|
- Classes 1-6 からの利益 < 条件分岐の overhead
|
2025-11-21 05:16:35 +09:00
|
|
|
|
|
2025-11-26 12:33:49 +09:00
|
|
|
|
### 残り 25% ギャップの原因(構造的オーバーヘッド)
|
|
|
|
|
|
1. **Header byte オーバーヘッド** - 毎 alloc/free で 1 バイト書き込み/読み込み
|
|
|
|
|
|
2. **TLS SLL カウンタ** - 毎回 count++ / count-- (vs tcache: pointer のみ)
|
|
|
|
|
|
3. **多層分岐** - 4-5層 dispatch (vs tcache: 2-3層)
|
2025-11-21 05:16:35 +09:00
|
|
|
|
|
2025-11-26 12:33:49 +09:00
|
|
|
|
### 結論
|
|
|
|
|
|
**68-70M ops/s が現アーキテクチャの実質的な限界**。System malloc の 93M ops/s に到達するには:
|
|
|
|
|
|
- Header-free design への全面的な見直し
|
|
|
|
|
|
- tcache 模倣(カウンタ削除、分岐削減)
|
2025-11-21 05:16:35 +09:00
|
|
|
|
|
2025-11-26 12:33:49 +09:00
|
|
|
|
が必要だが、現時点では投資対効果が低い。
|
2025-11-21 05:16:35 +09:00
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2025-11-26 12:33:49 +09:00
|
|
|
|
## 📁 主要な修正ファイル(Arena Allocator 実装)
|
2025-11-21 05:16:35 +09:00
|
|
|
|
|
2025-11-26 12:33:49 +09:00
|
|
|
|
- `core/box/ss_cache_box.inc:138-229` - SSArena allocator 追加
|
|
|
|
|
|
- `core/box/tls_sll_box.h:509-561` - Release mode で recycle check オプショナル化
|
|
|
|
|
|
- `core/tiny_free_fast_v2.inc.h:113-148` - Release mode で cross-check 削除
|
|
|
|
|
|
- `core/hakmem_tiny_sll_cap_box.inc:8-25` - C5 容量を full capacity に変更
|
|
|
|
|
|
- `core/hakmem_policy.c:24-30` - min_keep tuning
|
|
|
|
|
|
- `core/tiny_alloc_fast_sfc.inc.h:18-26` - SFC defaults tuning
|
2025-11-21 05:16:35 +09:00
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2025-11-26 12:33:49 +09:00
|
|
|
|
## 🗃 過去の問題と解決(参考)
|
2025-11-21 05:16:35 +09:00
|
|
|
|
|
2025-11-26 12:33:49 +09:00
|
|
|
|
### Arena Allocator 以前の状態
|
|
|
|
|
|
- **Random Mixed (5M ops)**: ~56-60M ops/s, **mmap 418回** (mimalloc の 26倍)
|
|
|
|
|
|
- **根本原因**: SuperSlab = allocation単位 = cache単位 という設計ミスマッチ
|
|
|
|
|
|
- **問題**: ws=256 では Slab が 5-15% 使用率で停滞 → 完全 EMPTY にならない → LRU キャッシュ不発 → 毎回 mmap/munmap
|
2025-11-21 05:16:35 +09:00
|
|
|
|
|
2025-11-26 12:33:49 +09:00
|
|
|
|
### Arena Allocator による解決
|
|
|
|
|
|
- SuperSlab を OS 単位として扱う Arena allocator 実装
|
|
|
|
|
|
- mmap 418回 → 32回 (-92%)、munmap 378回 → 3回 (-99%)
|
|
|
|
|
|
- 性能 60M → 68-70M ops/s (+15%)
|
2025-11-21 05:16:35 +09:00
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2025-11-26 12:33:49 +09:00
|
|
|
|
## 📊 他アロケータとのアーキテクチャ対応(参考)
|
2025-11-21 05:16:35 +09:00
|
|
|
|
|
2025-11-26 12:33:49 +09:00
|
|
|
|
| HAKMEM | mimalloc | tcmalloc | jemalloc |
|
|
|
|
|
|
|--------|----------|----------|----------|
|
|
|
|
|
|
| SuperSlab (2MB) | Segment (~2MiB) | PageHeap | Extent |
|
|
|
|
|
|
| Slab (64KB) | Page (~64KiB) | Span | Run/slab |
|
|
|
|
|
|
| per-class freelist | pages_queue | Central freelist | bin/slab lists |
|
|
|
|
|
|
| Arena allocator | segment cache | PageHeap | extent_avail |
|
2025-11-21 05:16:35 +09:00
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2025-11-26 12:33:49 +09:00
|
|
|
|
## 🚀 将来の可能性(長期)
|
2025-11-21 05:16:35 +09:00
|
|
|
|
|
2025-11-26 12:33:49 +09:00
|
|
|
|
### Slab-level EMPTY Recycling(未実装)
|
|
|
|
|
|
- **Goal**: Slab を cross-class で再利用可能にする
|
|
|
|
|
|
- **設計**: EMPTY slab を lock-free stack で管理、alloc 時に class_idx を動的割り当て
|
|
|
|
|
|
- **期待効果**: メモリ効率向上(ただし性能向上は限定的)
|
2025-11-21 05:16:35 +09:00
|
|
|
|
|
2025-11-26 12:33:49 +09:00
|
|
|
|
### Abandoned SuperSlab(MT 用、未実装)
|
|
|
|
|
|
- **Goal**: スレッド終了後のメモリを他スレッドから reclaim
|
|
|
|
|
|
- **設計**: mimalloc の abandoned segments 相当
|
|
|
|
|
|
- **実装タイミング**: MT ワークロードで必要になってから
|
2025-11-21 05:16:35 +09:00
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
2025-11-26 12:33:49 +09:00
|
|
|
|
## ✅ 完成したマイルストーン
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
|
2025-11-26 12:33:49 +09:00
|
|
|
|
1. **Arena Allocator 実装** - mmap 95% 削減達成 ✅
|
|
|
|
|
|
2. **Phase 27 調査** - アーキテクチャ限界の確認 ✅
|
|
|
|
|
|
3. **性能 68-70M ops/s** - System malloc の 73-76% に到達 ✅
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
|
2025-11-26 12:33:49 +09:00
|
|
|
|
**現在の推奨**: 68-70M ops/s を baseline として受け入れ、他のワークロード(Mid-Large, Larson 等)の最適化に注力する。
|