Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@ -1,306 +1,189 @@
|
|||||||
# Large Files Analysis - Document Index
|
# Random Mixed ボトルネック分析 - 完全レポート
|
||||||
|
|
||||||
## Overview
|
**Analysis Date**: 2025-11-16
|
||||||
|
**Status**: Complete & Implementation Ready
|
||||||
Comprehensive analysis of 1000+ line files in HAKMEM allocator codebase, with detailed refactoring recommendations and implementation plan.
|
**Priority**: 🔴 HIGHEST
|
||||||
|
**Expected Gain**: +13-29% (19.4M → 22-25M ops/s)
|
||||||
**Analysis Date**: 2025-11-06
|
|
||||||
**Status**: COMPLETE - Ready for Implementation
|
|
||||||
**Scope**: 5 large files, 9,008 lines (28% of codebase)
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Documents
|
## ドキュメント一覧
|
||||||
|
|
||||||
### 1. LARGE_FILES_ANALYSIS.md (645 lines) - Main Analysis Report
|
### 1. **RANDOM_MIXED_SUMMARY.md** (推奨・最初に読む)
|
||||||
**Length**: 645 lines | **Read Time**: 30-40 minutes
|
**用途**: エグゼクティブサマリー + 優先度付き推奨施策
|
||||||
|
**対象**: マネージャー、意思決定者
|
||||||
|
**内容**:
|
||||||
|
- Cycles 分布(表形式)
|
||||||
|
- FrontMetrics 現状
|
||||||
|
- Class別プロファイル
|
||||||
|
- 優先度付き候補(A/B/C/D)
|
||||||
|
- 最終推奨(1-4優先度順)
|
||||||
|
|
||||||
**Contents**:
|
**読む時間**: 5分
|
||||||
- Executive summary with priority matrix
|
**ファイル**: `/mnt/workdisk/public_share/hakmem/RANDOM_MIXED_SUMMARY.md`
|
||||||
- Detailed analysis of each of the 5 large files:
|
|
||||||
- hakmem_pool.c (2,592 lines)
|
|
||||||
- hakmem_tiny.c (1,765 lines)
|
|
||||||
- hakmem.c (1,745 lines)
|
|
||||||
- hakmem_tiny_free.inc (1,711 lines) - CRITICAL
|
|
||||||
- hakmem_l25_pool.c (1,195 lines)
|
|
||||||
|
|
||||||
**For each file**:
|
|
||||||
- Primary responsibilities
|
|
||||||
- Code structure breakdown (line ranges)
|
|
||||||
- Key functions listing
|
|
||||||
- Include analysis
|
|
||||||
- Cross-file dependencies
|
|
||||||
- Complexity metrics
|
|
||||||
- Refactoring recommendations with rationale
|
|
||||||
|
|
||||||
**Key Findings**:
|
|
||||||
- hakmem_tiny_free.inc: Average 171 lines per function (EXTREME - should be 20-30)
|
|
||||||
- hakmem_pool.c: 65 functions mixed across 4 responsibilities
|
|
||||||
- hakmem_tiny.c: 35 header includes (extreme coupling)
|
|
||||||
- hakmem.c: 38 includes, mixing API + dispatch + config
|
|
||||||
- hakmem_l25_pool.c: Code duplication with MidPool
|
|
||||||
|
|
||||||
**When to Use**:
|
|
||||||
- First time readers wanting detailed analysis
|
|
||||||
- Technical discussions and design reviews
|
|
||||||
- Understanding current code structure
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### 2. LARGE_FILES_REFACTORING_PLAN.md (577 lines) - Implementation Guide
|
### 2. **RANDOM_MIXED_BOTTLENECK_ANALYSIS.md** (詳細分析)
|
||||||
**Length**: 577 lines | **Read Time**: 20-30 minutes
|
**用途**: 深掘りボトルネック分析、技術的根拠の確認
|
||||||
|
**対象**: エンジニア、最適化担当者
|
||||||
|
**内容**:
|
||||||
|
- Executive Summary
|
||||||
|
- Cycles 分布分析(詳細)
|
||||||
|
- FrontMetrics 状況確認
|
||||||
|
- Class別パフォーマンスプロファイル
|
||||||
|
- 次の一手候補の詳細分析(A/B/C/D)
|
||||||
|
- 優先順位付け結論
|
||||||
|
- 推奨施策(スクリプト付き)
|
||||||
|
- 長期ロードマップ
|
||||||
|
- 技術的根拠(Fixed vs Mixed 比較、Refill Cost 見積もり)
|
||||||
|
|
||||||
**Contents**:
|
**読む時間**: 15-20分
|
||||||
- Critical path timeline (5 phases)
|
**ファイル**: `/mnt/workdisk/public_share/hakmem/RANDOM_MIXED_BOTTLENECK_ANALYSIS.md`
|
||||||
- Phase-by-phase implementation details:
|
|
||||||
- Phase 1: Tiny Free Path (Week 1) - CRITICAL
|
|
||||||
- Phase 2: Pool Manager (Week 2) - CRITICAL
|
|
||||||
- Phase 3: Tiny Core (Week 3) - CRITICAL
|
|
||||||
- Phase 4: Main Dispatcher (Week 4) - HIGH
|
|
||||||
- Phase 5: Pool Core Library (Week 5) - HIGH
|
|
||||||
|
|
||||||
**For each phase**:
|
|
||||||
- Specific deliverables
|
|
||||||
- Metrics (before/after)
|
|
||||||
- Build integration details
|
|
||||||
- Dependency graphs
|
|
||||||
- Expected results
|
|
||||||
|
|
||||||
**Additional sections**:
|
|
||||||
- Before/after dependency graph visualization
|
|
||||||
- Metrics comparison table
|
|
||||||
- Risk mitigation strategies
|
|
||||||
- Success criteria checklist
|
|
||||||
- Time & effort estimates
|
|
||||||
- Rollback procedures
|
|
||||||
- Next immediate steps
|
|
||||||
|
|
||||||
**Key Timeline**:
|
|
||||||
- Total: 2 weeks (1 developer) or 1 week (2 developers)
|
|
||||||
- Phase 1: 3 days (Tiny Free, CRITICAL)
|
|
||||||
- Phase 2: 4 days (Pool, CRITICAL)
|
|
||||||
- Phase 3: 3 days (Tiny core consolidation, CRITICAL)
|
|
||||||
- Phase 4: 2 days (Dispatcher split, HIGH)
|
|
||||||
- Phase 5: 2 days (Pool core library, HIGH)
|
|
||||||
|
|
||||||
**When to Use**:
|
|
||||||
- Implementation planning
|
|
||||||
- Work breakdown structure
|
|
||||||
- Parallel work assignment
|
|
||||||
- Risk assessment
|
|
||||||
- Timeline estimation
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### 3. LARGE_FILES_QUICK_REFERENCE.md (270 lines) - Quick Reference
|
### 3. **RING_CACHE_ACTIVATION_GUIDE.md** (即実施ガイド)
|
||||||
**Length**: 270 lines | **Read Time**: 10-15 minutes
|
**用途**: Ring Cache C4-C7 有効化の実施手順書
|
||||||
|
**対象**: 実装者
|
||||||
|
**内容**:
|
||||||
|
- 概要(なぜ Ring Cache か)
|
||||||
|
- Ring Cache アーキテクチャ解説
|
||||||
|
- 実装状況確認方法
|
||||||
|
- テスト実施手順(Step 1-5)
|
||||||
|
- Baseline 測定
|
||||||
|
- C2/C3 Ring テスト
|
||||||
|
- **C4-C7 Ring テスト(推奨)** ← これを実施すること
|
||||||
|
- Combined テスト
|
||||||
|
- ENV変数リファレンス
|
||||||
|
- トラブルシューティング
|
||||||
|
- 成功基準
|
||||||
|
- 次のステップ
|
||||||
|
|
||||||
**Contents**:
|
**読む時間**: 10分
|
||||||
- TL;DR problem summary
|
**実施時間**: 30分~1時間
|
||||||
- TL;DR solution summary (5 phases)
|
**ファイル**: `/mnt/workdisk/public_share/hakmem/RING_CACHE_ACTIVATION_GUIDE.md`
|
||||||
- Quick reference tables
|
|
||||||
- Phase 1 quick start checklist
|
|
||||||
- Key metrics to track (before/after)
|
|
||||||
- Common FAQ section
|
|
||||||
- File organization diagram
|
|
||||||
- Next steps checklist
|
|
||||||
|
|
||||||
**Key Checklists**:
|
|
||||||
- Phase 1 (Tiny Free): 10-point implementation checklist
|
|
||||||
- Success criteria per phase
|
|
||||||
- Metrics to establish baseline
|
|
||||||
|
|
||||||
**When to Use**:
|
|
||||||
- Executive summary for stakeholders
|
|
||||||
- Quick review before meetings
|
|
||||||
- Team onboarding
|
|
||||||
- Daily progress tracking
|
|
||||||
- Decision-making checklist
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Quick Navigation
|
## クイックスタート
|
||||||
|
|
||||||
### By Role
|
### 最速で結果を見たい場合(5分)
|
||||||
|
|
||||||
**Technical Lead**:
|
```bash
|
||||||
1. Start: LARGE_FILES_QUICK_REFERENCE.md (overview)
|
# 1. このガイドを読む
|
||||||
2. Deep dive: LARGE_FILES_ANALYSIS.md (current state)
|
cat /mnt/workdisk/public_share/hakmem/RING_CACHE_ACTIVATION_GUIDE.md
|
||||||
3. Plan: LARGE_FILES_REFACTORING_PLAN.md (implementation)
|
|
||||||
|
|
||||||
**Developer**:
|
# 2. Baseline 測定
|
||||||
1. Start: LARGE_FILES_QUICK_REFERENCE.md (quick reference)
|
./out/release/bench_random_mixed_hakmem 500000 256 42
|
||||||
2. Checklist: Phase-specific section in REFACTORING_PLAN.md
|
|
||||||
3. Details: Relevant section in ANALYSIS.md
|
|
||||||
|
|
||||||
**Project Manager**:
|
# 3. Ring Cache C4-C7 有効化してテスト
|
||||||
1. Overview: LARGE_FILES_QUICK_REFERENCE.md (TL;DR)
|
export HAKMEM_TINY_HOT_RING_ENABLE=1
|
||||||
2. Timeline: LARGE_FILES_REFACTORING_PLAN.md (phase breakdown)
|
export HAKMEM_TINY_HOT_RING_C4=128
|
||||||
3. Metrics: Metrics section in QUICK_REFERENCE.md
|
export HAKMEM_TINY_HOT_RING_C5=128
|
||||||
|
export HAKMEM_TINY_HOT_RING_C6=64
|
||||||
|
export HAKMEM_TINY_HOT_RING_C7=64
|
||||||
|
./out/release/bench_random_mixed_hakmem 500000 256 42
|
||||||
|
|
||||||
**Code Reviewer**:
|
# 期待結果: 19.4M → 22-25M ops/s (+13-29%)
|
||||||
1. Analysis: LARGE_FILES_ANALYSIS.md (current structure)
|
|
||||||
2. Refactoring: LARGE_FILES_REFACTORING_PLAN.md (expected changes)
|
|
||||||
3. Checklist: Success criteria in REFACTORING_PLAN.md
|
|
||||||
|
|
||||||
### By Priority
|
|
||||||
|
|
||||||
**CRITICAL READS** (required):
|
|
||||||
- LARGE_FILES_ANALYSIS.md - Detailed problem analysis
|
|
||||||
- LARGE_FILES_REFACTORING_PLAN.md - Implementation approach
|
|
||||||
|
|
||||||
**HIGHLY RECOMMENDED** (important):
|
|
||||||
- LARGE_FILES_QUICK_REFERENCE.md - Overview and checklists
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Key Statistics
|
|
||||||
|
|
||||||
### Current State (Before)
|
|
||||||
- Files over 1000 lines: 5
|
|
||||||
- Total lines in large files: 9,008 (28% of 32,175)
|
|
||||||
- Max file size: 2,592 lines
|
|
||||||
- Avg function size: 40-171 lines (extreme)
|
|
||||||
- Worst file: hakmem_tiny_free.inc (171 lines/function)
|
|
||||||
- Includes in worst file: 35 (hakmem_tiny.c)
|
|
||||||
|
|
||||||
### Target State (After)
|
|
||||||
- Files over 1000 lines: 0
|
|
||||||
- Files over 800 lines: 0
|
|
||||||
- Max file size: 800 lines (-69%)
|
|
||||||
- Avg function size: 25-35 lines (-60%)
|
|
||||||
- Includes per file: 5-8 (-80%)
|
|
||||||
- Compilation time: 2.5x faster
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Quick Start
|
|
||||||
|
|
||||||
### For Immediate Understanding
|
|
||||||
1. Read LARGE_FILES_QUICK_REFERENCE.md (10 min)
|
|
||||||
2. Review TL;DR sections in this index (5 min)
|
|
||||||
3. Review metrics comparison table (5 min)
|
|
||||||
|
|
||||||
### For Implementation Planning
|
|
||||||
1. Review LARGE_FILES_QUICK_REFERENCE.md Phase 1 checklist (5 min)
|
|
||||||
2. Read Phase 1 section in REFACTORING_PLAN.md (10 min)
|
|
||||||
3. Identify owner and schedule (5 min)
|
|
||||||
|
|
||||||
### For Technical Deep Dive
|
|
||||||
1. Read LARGE_FILES_ANALYSIS.md completely (40 min)
|
|
||||||
2. Review before/after dependency graphs in REFACTORING_PLAN.md (10 min)
|
|
||||||
3. Review code structure sections per file (20 min)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Summary of Files
|
|
||||||
|
|
||||||
| File | Lines | Functions | Avg/Func | Priority | Phase |
|
|
||||||
|------|-------|-----------|----------|----------|-------|
|
|
||||||
| hakmem_pool.c | 2,592 | 65 | 40 | CRITICAL | 2 |
|
|
||||||
| hakmem_tiny.c | 1,765 | 57 | 31 | CRITICAL | 3 |
|
|
||||||
| hakmem.c | 1,745 | 29 | 60 | HIGH | 4 |
|
|
||||||
| hakmem_tiny_free.inc | 1,711 | 10 | 171 | CRITICAL | 1 |
|
|
||||||
| hakmem_l25_pool.c | 1,195 | 39 | 31 | HIGH | 5 |
|
|
||||||
| **TOTAL** | **9,008** | **200** | **45** | - | - |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Implementation Roadmap
|
|
||||||
|
|
||||||
```
|
|
||||||
Week 1: Phase 1 - Split tiny_free.inc (3 days)
|
|
||||||
Phase 2 - Split pool.c starts (parallel)
|
|
||||||
|
|
||||||
Week 2: Phase 2 - Split pool.c (1 more day)
|
|
||||||
Phase 3 - Consolidate tiny.c starts
|
|
||||||
|
|
||||||
Week 3: Phase 3 - Consolidate tiny.c (1 more day)
|
|
||||||
Phase 4 - Split hakmem.c starts
|
|
||||||
|
|
||||||
Week 4: Phase 4 - Split hakmem.c
|
|
||||||
Phase 5 - Extract pool_core starts (parallel)
|
|
||||||
|
|
||||||
Week 5: Phase 5 - Extract pool_core (final polish)
|
|
||||||
Final testing and merge
|
|
||||||
```
|
```
|
||||||
|
|
||||||
**Parallel Work Possible**: Yes, with careful coordination
|
---
|
||||||
**Rollback Possible**: Yes, simple git revert per phase
|
|
||||||
**Risk Level**: LOW (changes isolated, APIs unchanged)
|
## ボトルネック要約
|
||||||
|
|
||||||
|
### 根本原因
|
||||||
|
Random Mixed が 23% で停滞している理由:
|
||||||
|
|
||||||
|
1. **Class切り替え多発**:
|
||||||
|
- Random Mixed は C2-C7 を均等に使用(16B-1040B)
|
||||||
|
- 毎iteration ごとに異なるクラスを処理
|
||||||
|
- TLS SLL(per-class)が複数classで頻繁に空になる
|
||||||
|
|
||||||
|
2. **最適化カバレッジ不足**:
|
||||||
|
- C0-C3: HeapV2 で 88-99% ヒット率 ✅
|
||||||
|
- **C4-C7: 最適化なし** ❌(Random Mixed の 50%)
|
||||||
|
- Ring Cache は実装済みだが **デフォルト OFF**
|
||||||
|
- HeapV2 拡張試験で効果薄(+0.3%)
|
||||||
|
|
||||||
|
3. **支配的ボトルネック**:
|
||||||
|
- SuperSlab refill: 50-200 cycles/回
|
||||||
|
- TLS SLL ポインタチェイス: 3 mem accesses
|
||||||
|
- Metadata 走査: 32 slab iteration
|
||||||
|
|
||||||
|
### 解決策
|
||||||
|
**Ring Cache C4-C7 有効化**:
|
||||||
|
- ポインタチェイス: 3 mem → 2 mem (-33%)
|
||||||
|
- キャッシュミス削減(配列アクセス)
|
||||||
|
- 既実装(有効化のみ)、低リスク
|
||||||
|
- **期待: +13-29%** (19.4M → 22-25M ops/s)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Success Criteria
|
## 推奨実施順序
|
||||||
|
|
||||||
### Phase Completion
|
### Phase 0: 理解
|
||||||
- All deliverable files created
|
1. RANDOM_MIXED_SUMMARY.md を読む(5分)
|
||||||
- Compilation succeeds without errors
|
2. なぜ C4-C7 が遅いかを理解
|
||||||
- Larson benchmark unchanged (±1%)
|
|
||||||
- No valgrind errors
|
|
||||||
- Code review approved
|
|
||||||
|
|
||||||
### Overall Success
|
### Phase 1: Baseline 測定
|
||||||
- 0 files over 1000 lines
|
1. RING_CACHE_ACTIVATION_GUIDE.md Step 1-2 を実施
|
||||||
- Max file size: 800 lines
|
2. 現在の性能 (19.4M ops/s) を確認
|
||||||
- Avg function size: 25-35 lines
|
|
||||||
- Compilation time: 60% improvement
|
### Phase 2: Ring Cache 有効化テスト
|
||||||
- Development speed: 3-6x faster for common tasks
|
1. RING_CACHE_ACTIVATION_GUIDE.md Step 4 を実施
|
||||||
|
2. C4-C7 Ring Cache を有効化
|
||||||
|
3. 性能向上を測定(目標: 22-25M ops/s)
|
||||||
|
|
||||||
|
### Phase 3: 詳細分析(必要に応じて)
|
||||||
|
1. RANDOM_MIXED_BOTTLENECK_ANALYSIS.md で深掘り
|
||||||
|
2. FrontMetrics で Ring hit rate 確認
|
||||||
|
3. 次の最適化への道筋を検討
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Next Steps
|
## 予想される性能向上パス
|
||||||
|
|
||||||
1. **Today**: Review this index + QUICK_REFERENCE.md
|
```
|
||||||
2. **Tomorrow**: Technical discussion + ANALYSIS.md review
|
Now: 19.4M ops/s (23.4% of system)
|
||||||
3. **Day 3**: Phase 1 implementation planning
|
↓
|
||||||
4. **Day 4**: Phase 1 begins (estimated 3 days)
|
Phase 21-1 (Ring C4/C7): 22-25M ops/s (25-28%) ← これを実施
|
||||||
5. **Day 7**: Phase 1 review + Phase 2 starts
|
↓
|
||||||
|
Phase 21-2 (Hot Slab): 25-30M ops/s (28-33%)
|
||||||
|
↓
|
||||||
|
Phase 21-3 (Minimal Meta): 28-35M ops/s (31-39%)
|
||||||
|
↓
|
||||||
|
Phase 12 (Shared SS Pool): 70-90M ops/s (70-90%) 🎯
|
||||||
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Document Glossary
|
## 関連ファイル
|
||||||
|
|
||||||
**Phase**: A 2-4 day work item splitting one or more large files
|
### 実装ファイル
|
||||||
|
- `/mnt/workdisk/public_share/hakmem/core/front/tiny_ring_cache.h` - Ring Cache header
|
||||||
|
- `/mnt/workdisk/public_share/hakmem/core/front/tiny_ring_cache.c` - Ring Cache impl
|
||||||
|
- `/mnt/workdisk/public_share/hakmem/core/tiny_alloc_fast.inc.h` - Alloc fast path
|
||||||
|
- `/mnt/workdisk/public_share/hakmem/core/box/tls_sll_box.h` - TLS SLL API
|
||||||
|
|
||||||
**Deliverable**: Specific file(s) to be created or modified in a phase
|
### 参考ドキュメント
|
||||||
|
- `/mnt/workdisk/public_share/hakmem/CURRENT_TASK.md` - Phase 21-22 計画
|
||||||
**Metric**: Quantifiable measure (lines, complexity, time)
|
- `/mnt/workdisk/public_share/hakmem/bench_random_mixed.c` - ベンチマーク実装
|
||||||
|
|
||||||
**Responsibility**: A distinct task or subsystem within a file
|
|
||||||
|
|
||||||
**Cohesion**: How closely related functions are within a module
|
|
||||||
|
|
||||||
**Coupling**: How dependent a module is on other modules
|
|
||||||
|
|
||||||
**Cyclomatic Complexity**: Number of independent code paths (lower is better)
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Document Metadata
|
## チェックリスト
|
||||||
|
|
||||||
- **Created**: 2025-11-06
|
- [ ] RANDOM_MIXED_SUMMARY.md を読む
|
||||||
- **Last Updated**: 2025-11-06
|
- [ ] RING_CACHE_ACTIVATION_GUIDE.md を読む
|
||||||
- **Status**: COMPLETE
|
- [ ] Baseline を測定 (19.4M ops/s 確認)
|
||||||
- **Review Status**: Ready for technical review
|
- [ ] Ring Cache C4-C7 を有効化
|
||||||
- **Implementation Status**: Ready for Phase 1 kickoff
|
- [ ] テスト実施 (22-25M ops/s 目標)
|
||||||
|
- [ ] 結果が目標値を達成したら ✓ 成功!
|
||||||
|
- [ ] 詳細分析が必要ならば RANDOM_MIXED_BOTTLENECK_ANALYSIS.md を参照
|
||||||
|
- [ ] Phase 21-2 計画に進む
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Contact & Questions
|
**準備完了。実施をお待ちしています。**
|
||||||
|
|
||||||
For questions about the analysis:
|
|
||||||
1. Review the relevant document above
|
|
||||||
2. Check FAQ section in QUICK_REFERENCE.md
|
|
||||||
3. Refer to corresponding phase in REFACTORING_PLAN.md
|
|
||||||
|
|
||||||
For implementation support:
|
|
||||||
- Use phase-specific checklists
|
|
||||||
- Follow week-by-week breakdown
|
|
||||||
- Reference success criteria
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
Generated by: Large Files Analysis System
|
|
||||||
Repository: /mnt/workdisk/public_share/hakmem
|
|
||||||
Codebase: HAKMEM Memory Allocator
|
|
||||||
|
|||||||
318
CURRENT_TASK.md
318
CURRENT_TASK.md
@ -44,6 +44,244 @@
|
|||||||
|
|
||||||
### 2.1 Fixed-size Tiny ベンチ(HAKMEM vs System)
|
### 2.1 Fixed-size Tiny ベンチ(HAKMEM vs System)
|
||||||
|
|
||||||
|
**Phase 21-1: Ring Cache Implementation (C2/C3/C5) (2025-11-16)** 🎯
|
||||||
|
- **Goal**: Eliminate pointer chasing in TLS SLL by using array-based ring buffer cache
|
||||||
|
- **Strategy**: 3-layer hierarchy (Ring L0 → SLL L1 → SuperSlab L2)
|
||||||
|
- **Implementation**:
|
||||||
|
- Added `TinyRingCache` struct with power-of-2 ring buffer (128 slots default)
|
||||||
|
- Implemented `ring_cache_pop/push` for ultra-fast alloc/free (1-2 instructions)
|
||||||
|
- Extended to C2 (32B), C3 (64B), C5 (256B) size classes
|
||||||
|
- ENV variables: `HAKMEM_TINY_HOT_RING_ENABLE=1`, `HAKMEM_TINY_HOT_RING_C2/C3/C5=128`
|
||||||
|
- **Results** (`bench_random_mixed_hakmem 500K, 256B workload`):
|
||||||
|
- **Baseline** (Ring OFF): 20.18M ops/s
|
||||||
|
- **C2/C3 Ring**: 21.15M ops/s (**+4.8%** improvement) ✅
|
||||||
|
- **C2/C3/C5 Ring**: 21.18M ops/s (**+5.0%** total improvement) ✅
|
||||||
|
- **Analysis**:
|
||||||
|
- C2/C3 provide most of the gain (small sizes are hottest)
|
||||||
|
- C5 addition provides marginal benefit (+0.03M ops/s)
|
||||||
|
- Implementation complete and stable
|
||||||
|
- **Files Modified**:
|
||||||
|
- `core/front/tiny_ring_cache.h/c` - Ring buffer implementation
|
||||||
|
- `core/tiny_alloc_fast.inc.h` - Alloc path integration
|
||||||
|
- `core/tiny_free_fast_v2.inc.h` - Free path integration (line 154-160)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Phase 21-1-D: Ring Cache Default ON (2025-11-16)** 🚀
|
||||||
|
- **Goal**: Enable Ring Cache by default for production use (remove ENV gating)
|
||||||
|
- **Implementation**: 1-line change in `core/front/tiny_ring_cache.h:72`
|
||||||
|
- Changed logic: `g_enable = (e && *e == '0') ? 0 : 1; // DEFAULT: ON`
|
||||||
|
- ENV=0 disables, ENV unset or ENV=1 enables
|
||||||
|
- **Results** (`bench_random_mixed_hakmem 500K, 256B workload, 3-run average`):
|
||||||
|
- **Ring ON** (default): **20.31M ops/s** (baseline)
|
||||||
|
- **Ring OFF** (ENV=0): 19.30M ops/s
|
||||||
|
- **Improvement**: **+5.2%** (+1.01M ops/s) ✅
|
||||||
|
- **Impact**: Ring Cache now active in all builds without manual ENV configuration
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Performance Bottleneck Analysis (Task-sensei Report, 2025-11-16)** 🔍
|
||||||
|
|
||||||
|
**Root Cause: Cache Misses (6.6x worse than System malloc)**
|
||||||
|
- **L1 D-cache miss rate**: HAKMEM 5.15% vs System 0.78% → **6.6x higher**
|
||||||
|
- **IPC (instructions/cycle)**: HAKMEM 0.52 vs System 1.43 → **2.75x worse**
|
||||||
|
- **Branch miss rate**: HAKMEM 11.86% vs System 4.77% → **2.5x higher**
|
||||||
|
- **Per-operation cost**: HAKMEM **8-10 cache misses** vs System **2-3 cache misses**
|
||||||
|
|
||||||
|
**Problem: 4-5 Layer Frontend Cascade**
|
||||||
|
```
|
||||||
|
Random Mixed allocation flow:
|
||||||
|
Ring (L0) miss → FastCache (L1) miss → SFC (L2) miss → TLS SLL (L3) miss → SuperSlab refill (L4)
|
||||||
|
= 8-10 cache misses per allocation (each layer = 2 misses: head + next pointer)
|
||||||
|
```
|
||||||
|
|
||||||
|
**System malloc tcache: 2-3 cache misses (single-layer array-based bins)**
|
||||||
|
|
||||||
|
**Improvement Roadmap** (Target: 48-77M ops/s, System比 53-86%):
|
||||||
|
1. **P1 (Done)**: Ring Cache default ON → **+5.2%** (20.3M ops/s) ✅
|
||||||
|
2. **P2 (Next)**: Unified Frontend Cache (flatten 4-5 layers → 1 layer) → **+50-100%** (30-40M expected)
|
||||||
|
3. **P3**: Adaptive refill optimization → **+20-30%**
|
||||||
|
4. **P4**: Branchless dispatch table → **+10-15%**
|
||||||
|
5. **P5**: Metadata locality optimization → **+15-20%**
|
||||||
|
|
||||||
|
**Conservative Target**: 48M ops/s (+136% vs current, 53% of System)
|
||||||
|
**Optimistic Target**: 77M ops/s (+279% vs current, 86% of System)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Phase 22: Lazy Per-Class Initialization (2025-11-16)** 🚀
|
||||||
|
- **Goal**: Cold-start page faultを削減 (ChatGPT分析: `hak_tiny_init()` → 94.94% of page faults)
|
||||||
|
- **Strategy**: Eager init (全8クラス初期化) → Lazy init (使用クラスのみ初期化)
|
||||||
|
- **Results** (`bench_random_mixed_hakmem 500K, 256B workload`):
|
||||||
|
- **Cold-start**: 18.1M ops/s (Phase 21-1: 16.2M) → **+12% improvement** ✅
|
||||||
|
- **Steady-state**: 25.5M ops/s (Phase 21-1: 26.1M) → -2.3% (誤差範囲)
|
||||||
|
- **Key Achievement**: `hak_tiny_init.part.0` 完全削除、未使用クラスのpage touchを回避
|
||||||
|
- **Remaining Bottleneck**: SuperSlab allocation時の`memset` page fault (42.40%)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**📊 PERFORMANCE MAP (2025-11-16) - 全体性能俯瞰** 🗺️
|
||||||
|
|
||||||
|
ベンチマーク自動化スクリプト: `scripts/bench_performance_map.sh`
|
||||||
|
最新結果: `bench_results/performance_map/20251116_095827/`
|
||||||
|
|
||||||
|
### 🎯 固定サイズ (16-1024B) - Tiny層の現実
|
||||||
|
|
||||||
|
| Size | System | HAKMEM | Ratio | Status |
|
||||||
|
|------|--------|--------|-------|--------|
|
||||||
|
| 16B | 118.6M | 50.0M | 42.2% | ❌ Slow |
|
||||||
|
| 32B | 103.3M | 49.3M | 47.7% | ❌ Slow |
|
||||||
|
| 64B | 104.3M | 49.2M | 47.1% | ❌ Slow |
|
||||||
|
| **128B** | **74.0M** | **51.8M** | **70.0%** | **⚠️ Gap** ✨ |
|
||||||
|
| 256B | 115.7M | 36.2M | 31.3% | ❌ Slow |
|
||||||
|
| 512B | 103.5M | 41.5M | 40.1% | ❌ Slow |
|
||||||
|
| 1024B| 96.0M | 47.8M | 49.8% | ❌ Slow |
|
||||||
|
|
||||||
|
**発見**:
|
||||||
|
- **128Bのみ 70%** (唯一Gap範囲) - 他は全て50%未満
|
||||||
|
- **256Bが最悪 31.3%** - Phase 22で18.1M → 36.2Mに改善したが、systemの1/3に留まる
|
||||||
|
- **小サイズ (16-64B) 42-47%** - UltraHot経由でも system の半分
|
||||||
|
|
||||||
|
### 🌀 Random Mixed (128B-1KB)
|
||||||
|
|
||||||
|
| Allocator | ops/s | vs System |
|
||||||
|
|-----------|--------|-----------|
|
||||||
|
| System | 90.2M | 100% (baseline) |
|
||||||
|
| **Mimalloc** | **117.5M** | **130%** 🏆 (systemより速い!) |
|
||||||
|
| **HAKMEM** | **21.1M** | **23.4%** ❌ (mimallocの1/5.5) |
|
||||||
|
|
||||||
|
**衝撃的発見**:
|
||||||
|
- Mimallocは system より 30%速い
|
||||||
|
- HAKMEMは mimalloc の **1/5.5** - 巨大なギャップ
|
||||||
|
|
||||||
|
### 💥 CRITICAL ISSUES - Mid-Large / MT層が完全破壊
|
||||||
|
|
||||||
|
**Mid-Large MT (8-32KB)**: ❌ **CRASHED** (コアダンプ)
|
||||||
|
- **原因**: `hkm_ace_alloc` が 33KB allocation で NULL返却
|
||||||
|
- **結果**: `free(): invalid pointer` → クラッシュ
|
||||||
|
- **Mimalloc**: 40.2M ops/s (system の 449%!)
|
||||||
|
- **HAKMEM**: 0 ops/s (動作不能)
|
||||||
|
|
||||||
|
**VM Mixed**: ❌ **CRASHED** (コアダンプ)
|
||||||
|
- System: 957K ops/s
|
||||||
|
- HAKMEM: 0 ops/s
|
||||||
|
|
||||||
|
**Larson (MT churn)**: ❌ **SEGV**
|
||||||
|
- System: 3.4M ops/s
|
||||||
|
- Mimalloc: 3.4M ops/s
|
||||||
|
- HAKMEM: 0 ops/s
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**🔧 Mid-Large Crash FIX (2025-11-16)** ✅
|
||||||
|
|
||||||
|
**Root Cause (ChatGPT分析)**:
|
||||||
|
- `classify_ptr()` が AllocHeader (Mid/Large mmap allocations) をチェックしていない
|
||||||
|
- Free wrapper が `PTR_KIND_MID_LARGE` ケースを処理していない
|
||||||
|
- 結果: Mid-Large ポインタが `PTR_KIND_UNKNOWN` → `__libc_free()` → `free(): invalid pointer`
|
||||||
|
|
||||||
|
**修正内容**:
|
||||||
|
1. **`classify_ptr()` に AllocHeader チェック追加** (`core/box/front_gate_classifier.c:256-271`)
|
||||||
|
- `hak_header_from_user()` + `hak_header_validate()` で HAKMEM_MAGIC 確認
|
||||||
|
- `ALLOC_METHOD_MMAP/POOL/L25_POOL` → `PTR_KIND_MID_LARGE` 返却
|
||||||
|
2. **Free wrapper に `PTR_KIND_MID_LARGE` ケース追加** (`core/box/hak_wrappers.inc.h:181`)
|
||||||
|
- `is_hakmem_owned = 1` で HAKMEM 管轄として処理
|
||||||
|
|
||||||
|
**修正結果**:
|
||||||
|
- **Mid-Large MT (8-32KB)**: 0 → **10.5M ops/s** (System 8.7M = **120%**) 🏆
|
||||||
|
- **VM Mixed**: 0 → **285K ops/s** (System 939K = 30.4%)
|
||||||
|
- ✅ クラッシュ完全解消、Mid-Large で system malloc を **20% 上回る**
|
||||||
|
|
||||||
|
**残存課題**:
|
||||||
|
- ❌ **random_mixed**: SEGV (AllocHeader読み込みでページ境界越え)
|
||||||
|
- ❌ **Larson**: SEGV継続 (Tiny 8-128B 領域、別原因)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**🔧 random_mixed Crash FIX (2025-11-16)** ✅
|
||||||
|
|
||||||
|
**Root Cause**:
|
||||||
|
- Mid-Large fix で追加した `classify_ptr()` の AllocHeader check が unsafe
|
||||||
|
- AllocHeader = 40 bytes → `ptr - 40` がページ境界越えると SEGV
|
||||||
|
- 例: `ptr = 0x7ffff6a00000` (page-aligned) → header at `0x7ffff69fffd8` (別ページ、unmapped)
|
||||||
|
|
||||||
|
**修正内容** (`core/box/front_gate_classifier.c:263-266`):
|
||||||
|
```c
|
||||||
|
// Safety check: Need at least HEADER_SIZE (40 bytes) before ptr
|
||||||
|
uintptr_t offset_in_page_for_hdr = (uintptr_t)ptr & 0xFFF;
|
||||||
|
if (offset_in_page_for_hdr >= HEADER_SIZE) {
|
||||||
|
// Safe to read AllocHeader (won't cross page boundary)
|
||||||
|
AllocHeader* hdr = hak_header_from_user(ptr);
|
||||||
|
...
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**修正結果**:
|
||||||
|
- **random_mixed**: SEGV → **1.92M ops/s** ✅
|
||||||
|
- ✅ Single-thread workloads 完全修復
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**🔧 Larson MT Crash FIX (2025-11-16)** ✅
|
||||||
|
|
||||||
|
**2-Layer Problem Structure**:
|
||||||
|
|
||||||
|
**Layer 1: Cross-thread Free (TLS SLL Corruption)**
|
||||||
|
- **Root Cause**: Block allocated by Thread A, freed by Thread B → pushed to B's TLS SLL
|
||||||
|
- B allocates the block → metadata still points to A's SuperSlab → corruption
|
||||||
|
- Poison values (0xbada55bada55bada) in TLS SLL → SEGV in `tiny_alloc_fast()`
|
||||||
|
- **Fix** (`core/tiny_free_fast_v2.inc.h:176-205`):
|
||||||
|
- Made cross-thread check **ALWAYS ON** (removed ENV gating)
|
||||||
|
- Check `owner_tid_low` on every free, route cross-thread to remote queue via `tiny_free_remote_box()`
|
||||||
|
- **Status**: ✅ **FIXED** - TLS SLL corruption eliminated
|
||||||
|
|
||||||
|
**Layer 2: SP Metadata Capacity Limit**
|
||||||
|
- **Root Cause**: `[SP_META_CAPACITY_ERROR] Exceeded MAX_SS_METADATA_ENTRIES=2048`
|
||||||
|
- Larson rapid churn workload → 2048+ SuperSlabs → registry exhaustion → hang
|
||||||
|
- **Fix** (`core/hakmem_shared_pool.h:122-126`):
|
||||||
|
- Increased `MAX_SS_METADATA_ENTRIES` from 2048 → **8192** (4x capacity)
|
||||||
|
- **Status**: ✅ **FIXED** - Larson completes successfully
|
||||||
|
|
||||||
|
**Results** (10 seconds, 4 threads):
|
||||||
|
- **Before**: 4.2TB virtual memory, 65,531 mappings, indefinite hang (kill -9 required)
|
||||||
|
- **After**: 6.7GB virtual (-99.84%), 424MB RSS, completes in 10-18 seconds
|
||||||
|
- **Throughput**: 7,387-8,499 ops/s (0.014% of system malloc 60.6M)
|
||||||
|
|
||||||
|
**Layer 3: Performance Optimization (IN PROGRESS)**
|
||||||
|
- Cross-thread check adds SuperSlab lookup on every free (20-50 cycles overhead)
|
||||||
|
- **Drain Interval Tuning** (2025-11-16):
|
||||||
|
- Baseline (drain=2048): 7,663 ops/s
|
||||||
|
- Moderate (drain=1024): **8,514 ops/s** (+11.1%) ✅
|
||||||
|
- Aggressive (drain=512): Core dump ❌ (too aggressive, causes crash)
|
||||||
|
- **Recommendation**: `export HAKMEM_TINY_SLL_DRAIN_INTERVAL=1024` for stable +11% gain
|
||||||
|
- **Remaining Work**: LRU policy tuning (MAX_CACHED, MAX_MEMORY_MB, TTL_SEC)
|
||||||
|
- Goal: Improve from 0.014% → 80% of system malloc (currently 0.015% with drain=1024)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 📈 Summary (Performance Map 2025-11-16 17:15)
|
||||||
|
|
||||||
|
**修正後の全体結果**:
|
||||||
|
- ✅ Competitive (≥80%): **0/10 benchmarks** (0%)
|
||||||
|
- ⚠️ Gap (50-80%): **1/10 benchmarks** (10%) ← 64B固定のみ 53.6%
|
||||||
|
- ❌ Slow (<50%): **9/10 benchmarks** (90%)
|
||||||
|
|
||||||
|
**主要ベンチマーク**:
|
||||||
|
1. **Fixed-size (16-1024B)**: 38.5-53.6% of system (64B が最良)
|
||||||
|
2. **Random Mixed (128-1KB)**: **19.4M ops/s** (24.0% of system)
|
||||||
|
3. **Mid-Large MT (8-32KB)**: **891K ops/s** (12.1% of system, crash 修正済み ✅)
|
||||||
|
4. **VM Mixed**: **275K ops/s** (30.7% of system, crash 修正済み ✅)
|
||||||
|
5. **Larson (MT churn)**: **7.4-8.5K ops/s** (0.014% of system, crash 修正済み ✅, 性能最適化は Layer 3 で対応予定)
|
||||||
|
|
||||||
|
**優先課題 (2025-11-16 更新)**:
|
||||||
|
1. ✅ **完了**: Mid-Large crash 修復 (classify_ptr + AllocHeader check)
|
||||||
|
2. ✅ **完了**: VM Mixed crash 修復 (Mid-Large fix で解消)
|
||||||
|
3. ✅ **完了**: random_mixed crash 修復 (page boundary check)
|
||||||
|
4. 🔴 **P0**: Larson SP metadata limit 拡大 (2048 → 4096-8192)
|
||||||
|
5. 🟡 **P1**: Fixed-size 性能改善 (38-53% → 目標 80%+)
|
||||||
|
6. 🟡 **P1**: Random Mixed 性能改善 (24% → 目標 80%+)
|
||||||
|
7. 🟡 **P1**: Mid-Large MT 性能改善 (12% → 目標 80%+, mimalloc 449%が参考値)
|
||||||
|
|
||||||
`bench_fixed_size_hakmem` / `bench_fixed_size_system`(workset=128, 500K iterations 相当)
|
`bench_fixed_size_hakmem` / `bench_fixed_size_system`(workset=128, 500K iterations 相当)
|
||||||
|
|
||||||
| Size | HAKMEM (Phase 15) | System malloc | 比率 |
|
| Size | HAKMEM (Phase 15) | System malloc | 比率 |
|
||||||
@ -940,3 +1178,83 @@ Phase 21-3 (Minimal Meta Access):
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## HAKMEM ハング問題調査 (2025-11-16)
|
||||||
|
|
||||||
|
### 症状
|
||||||
|
1. `bench_fixed_size_hakmem 1 16 128` → 5秒以上ハング
|
||||||
|
2. `bench_random_mixed_hakmem 500000 256 42` → キルされた
|
||||||
|
|
||||||
|
### Root Cause
|
||||||
|
**Cross-thread check の always-on 化** (直前の修正)
|
||||||
|
- `core/tiny_free_fast_v2.inc.h:175-204` で ENV ゲート削除
|
||||||
|
- Single-thread でも毎回 SuperSlab lookup 実行
|
||||||
|
|
||||||
|
### ハング箇所の推定 (確度順)
|
||||||
|
|
||||||
|
| 箇所 | ファイル:行 | 原因 | 確度 |
|
||||||
|
|------|-----------|------|------|
|
||||||
|
| `hak_super_lookup()` registry probing | `core/hakmem_super_registry.h:119-187` | 線形探索 32-64 iterations / free | **高** |
|
||||||
|
| Node pool exhausted fallback | `core/hakmem_shared_pool.c:394-400` | sp_freelist_push_lockfree fallback の unsafe | 中 |
|
||||||
|
| `tls_sll_push()` CAS loop | `core/box/tls_sll_box.h:75-184` | 単純実装、無限ループはなさそう | 低 |
|
||||||
|
|
||||||
|
### パフォーマンス影響
|
||||||
|
|
||||||
|
```
|
||||||
|
Before (header-based): 5-10 cycles/free
|
||||||
|
After (cross-thread): 110-520 cycles/free (11-51倍遅い!)
|
||||||
|
|
||||||
|
500K iterations:
|
||||||
|
500K × 200 cycles = 100M cycles @ 3GHz = 33ms
|
||||||
|
→ Overhead は大きいが単なる遅さ?
|
||||||
|
```
|
||||||
|
|
||||||
|
### Node pool exhausted の真実
|
||||||
|
|
||||||
|
- `MAX_FREE_NODES_PER_CLASS = 4096`
|
||||||
|
- 500K iterations > 4096 → exhausted ⚠️
|
||||||
|
- しかし fallback (`sp_freelist_push()`) は lock-free で安全
|
||||||
|
- **副作用であり、直接的ハング原因ではない可能性高い**
|
||||||
|
|
||||||
|
### 推奨修正
|
||||||
|
|
||||||
|
✅ **ENV ゲートで cross-thread check を復活**
|
||||||
|
```c
|
||||||
|
// core/tiny_free_fast_v2.inc.h:175
|
||||||
|
static int g_larson_fix = -1;
|
||||||
|
if (__builtin_expect(g_larson_fix == -1, 0)) {
|
||||||
|
const char* e = getenv("HAKMEM_TINY_LARSON_FIX");
|
||||||
|
g_larson_fix = (e && *e && *e != '0') ? 1 : 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (__builtin_expect(g_larson_fix, 0)) {
|
||||||
|
// Cross-thread check - only for MT
|
||||||
|
SuperSlab* ss = hak_super_lookup(base);
|
||||||
|
// ... rest of check
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**利点:**
|
||||||
|
- Single-thread ベンチ: 5-10 cycles (fast)
|
||||||
|
- Larson MT: `HAKMEM_TINY_LARSON_FIX=1` で有効 (safe)
|
||||||
|
|
||||||
|
### 検証コマンド
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. ハング確認
|
||||||
|
timeout 5 ./out/release/bench_fixed_size_hakmem 1 16 128
|
||||||
|
echo $? # 124 = timeout
|
||||||
|
|
||||||
|
# 2. 修正後確認
|
||||||
|
HAKMEM_TINY_LARSON_FIX=0 ./out/release/bench_fixed_size_hakmem 1 16 128
|
||||||
|
# Should complete fast
|
||||||
|
|
||||||
|
# 3. 500K テスト
|
||||||
|
./out/release/bench_random_mixed_hakmem 500000 256 42 2>&1 | grep "Node pool"
|
||||||
|
# Output: [P0-4 WARN] Node pool exhausted for class 7
|
||||||
|
```
|
||||||
|
|
||||||
|
### 詳細レポート
|
||||||
|
- **HANG分析**: `/tmp/HAKMEM_HANG_INVESTIGATION_FINAL.md`
|
||||||
|
|||||||
8
Makefile
8
Makefile
@ -190,12 +190,12 @@ LDFLAGS += $(EXTRA_LDFLAGS)
|
|||||||
|
|
||||||
# Targets
|
# Targets
|
||||||
TARGET = test_hakmem
|
TARGET = test_hakmem
|
||||||
OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o hakmem_tiny_superslab.o hakmem_smallmid.o hakmem_smallmid_superslab.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_mid_mt.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_local_box.o core/box/free_remote_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/front/tiny_ring_cache.o core/link_stubs.o core/tiny_failfast.o test_hakmem.o
|
OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o hakmem_tiny_superslab.o hakmem_smallmid.o hakmem_smallmid_superslab.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_mid_mt.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_local_box.o core/box/free_remote_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/unified_batch_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/pagefault_telemetry_box.o core/front/tiny_ring_cache.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/link_stubs.o core/tiny_failfast.o test_hakmem.o
|
||||||
OBJS = $(OBJS_BASE)
|
OBJS = $(OBJS_BASE)
|
||||||
|
|
||||||
# Shared library
|
# Shared library
|
||||||
SHARED_LIB = libhakmem.so
|
SHARED_LIB = libhakmem.so
|
||||||
SHARED_OBJS = hakmem_shared.o hakmem_config_shared.o hakmem_tiny_config_shared.o hakmem_ucb1_shared.o hakmem_bigcache_shared.o hakmem_pool_shared.o hakmem_l25_pool_shared.o hakmem_site_rules_shared.o hakmem_tiny_shared.o hakmem_tiny_superslab_shared.o hakmem_smallmid_shared.o core/box/superslab_expansion_box_shared.o core/box/integrity_box_shared.o core/box/mailbox_box_shared.o core/box/front_gate_box_shared.o core/box/free_local_box_shared.o core/box/free_remote_box_shared.o core/box/free_publish_box_shared.o core/box/capacity_box_shared.o core/box/carve_push_box_shared.o core/box/prewarm_box_shared.o core/box/bench_fast_box_shared.o core/front/tiny_ring_cache_shared.o tiny_sticky_shared.o tiny_remote_shared.o tiny_publish_shared.o tiny_debug_ring_shared.o hakmem_tiny_magazine_shared.o hakmem_tiny_stats_shared.o hakmem_tiny_sfc_shared.o hakmem_tiny_query_shared.o hakmem_tiny_rss_shared.o hakmem_tiny_registry_shared.o hakmem_tiny_remote_target_shared.o hakmem_tiny_bg_spill_shared.o tiny_adaptive_sizing_shared.o hakmem_mid_mt_shared.o hakmem_super_registry_shared.o hakmem_elo_shared.o hakmem_batch_shared.o hakmem_p2_shared.o hakmem_sizeclass_dist_shared.o hakmem_evo_shared.o hakmem_debug_shared.o hakmem_sys_shared.o hakmem_whale_shared.o hakmem_policy_shared.o hakmem_ace_shared.o hakmem_ace_stats_shared.o hakmem_ace_controller_shared.o hakmem_ace_metrics_shared.o hakmem_ace_ucb1_shared.o hakmem_prof_shared.o hakmem_learner_shared.o hakmem_size_hist_shared.o hakmem_learn_log_shared.o hakmem_syscall_shared.o tiny_fastcache_shared.o
|
SHARED_OBJS = hakmem_shared.o hakmem_config_shared.o hakmem_tiny_config_shared.o hakmem_ucb1_shared.o hakmem_bigcache_shared.o hakmem_pool_shared.o hakmem_l25_pool_shared.o hakmem_site_rules_shared.o hakmem_tiny_shared.o hakmem_tiny_superslab_shared.o hakmem_smallmid_shared.o core/box/superslab_expansion_box_shared.o core/box/integrity_box_shared.o core/box/mailbox_box_shared.o core/box/front_gate_box_shared.o core/box/free_local_box_shared.o core/box/free_remote_box_shared.o core/box/free_publish_box_shared.o core/box/capacity_box_shared.o core/box/carve_push_box_shared.o core/box/prewarm_box_shared.o core/box/bench_fast_box_shared.o core/front/tiny_ring_cache_shared.o core/front/tiny_unified_cache_shared.o tiny_sticky_shared.o tiny_remote_shared.o tiny_publish_shared.o tiny_debug_ring_shared.o hakmem_tiny_magazine_shared.o hakmem_tiny_stats_shared.o hakmem_tiny_sfc_shared.o hakmem_tiny_query_shared.o hakmem_tiny_rss_shared.o hakmem_tiny_registry_shared.o hakmem_tiny_remote_target_shared.o hakmem_tiny_bg_spill_shared.o tiny_adaptive_sizing_shared.o hakmem_mid_mt_shared.o hakmem_super_registry_shared.o hakmem_elo_shared.o hakmem_batch_shared.o hakmem_p2_shared.o hakmem_sizeclass_dist_shared.o hakmem_evo_shared.o hakmem_debug_shared.o hakmem_sys_shared.o hakmem_whale_shared.o hakmem_policy_shared.o hakmem_ace_shared.o hakmem_ace_stats_shared.o hakmem_ace_controller_shared.o hakmem_ace_metrics_shared.o hakmem_ace_ucb1_shared.o hakmem_prof_shared.o hakmem_learner_shared.o hakmem_size_hist_shared.o hakmem_learn_log_shared.o hakmem_syscall_shared.o tiny_fastcache_shared.o
|
||||||
|
|
||||||
# Pool TLS Phase 1 (enable with POOL_TLS_PHASE1=1)
|
# Pool TLS Phase 1 (enable with POOL_TLS_PHASE1=1)
|
||||||
ifeq ($(POOL_TLS_PHASE1),1)
|
ifeq ($(POOL_TLS_PHASE1),1)
|
||||||
@ -222,7 +222,7 @@ endif
|
|||||||
# Benchmark targets
|
# Benchmark targets
|
||||||
BENCH_HAKMEM = bench_allocators_hakmem
|
BENCH_HAKMEM = bench_allocators_hakmem
|
||||||
BENCH_SYSTEM = bench_allocators_system
|
BENCH_SYSTEM = bench_allocators_system
|
||||||
BENCH_HAKMEM_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o hakmem_tiny_superslab.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_mid_mt.o hakmem_super_registry.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_local_box.o core/box/free_remote_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/front/tiny_ring_cache.o core/link_stubs.o core/tiny_failfast.o bench_allocators_hakmem.o
|
BENCH_HAKMEM_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o hakmem_tiny_superslab.o hakmem_smallmid.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_mid_mt.o hakmem_super_registry.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/free_local_box.o core/box/free_remote_box.o core/box/free_publish_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/capacity_box.o core/box/carve_push_box.o core/box/unified_batch_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/pagefault_telemetry_box.o core/front/tiny_ring_cache.o core/front/tiny_unified_cache.o core/tiny_alloc_fast_push.o core/link_stubs.o core/tiny_failfast.o bench_allocators_hakmem.o
|
||||||
BENCH_HAKMEM_OBJS = $(BENCH_HAKMEM_OBJS_BASE)
|
BENCH_HAKMEM_OBJS = $(BENCH_HAKMEM_OBJS_BASE)
|
||||||
ifeq ($(POOL_TLS_PHASE1),1)
|
ifeq ($(POOL_TLS_PHASE1),1)
|
||||||
BENCH_HAKMEM_OBJS += pool_tls.o pool_refill.o pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o
|
BENCH_HAKMEM_OBJS += pool_tls.o pool_refill.o pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o
|
||||||
@ -399,7 +399,7 @@ test-box-refactor: box-refactor
|
|||||||
./larson_hakmem 10 8 128 1024 1 12345 4
|
./larson_hakmem 10 8 128 1024 1 12345 4
|
||||||
|
|
||||||
# Phase 4: Tiny Pool benchmarks (properly linked with hakmem)
|
# Phase 4: Tiny Pool benchmarks (properly linked with hakmem)
|
||||||
TINY_BENCH_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o hakmem_tiny_superslab.o hakmem_smallmid.o hakmem_smallmid_superslab.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/free_local_box.o core/box/free_remote_box.o core/box/free_publish_box.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/front/tiny_ring_cache.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_mid_mt.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/link_stubs.o core/tiny_failfast.o
|
TINY_BENCH_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o hakmem_tiny_superslab.o hakmem_smallmid.o hakmem_smallmid_superslab.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/free_local_box.o core/box/free_remote_box.o core/box/free_publish_box.o core/box/capacity_box.o core/box/carve_push_box.o core/box/unified_batch_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/pagefault_telemetry_box.o core/front/tiny_ring_cache.o core/front/tiny_unified_cache.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_mid_mt.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/tiny_alloc_fast_push.o core/link_stubs.o core/tiny_failfast.o
|
||||||
TINY_BENCH_OBJS = $(TINY_BENCH_OBJS_BASE)
|
TINY_BENCH_OBJS = $(TINY_BENCH_OBJS_BASE)
|
||||||
ifeq ($(POOL_TLS_PHASE1),1)
|
ifeq ($(POOL_TLS_PHASE1),1)
|
||||||
TINY_BENCH_OBJS += pool_tls.o pool_refill.o core/pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o
|
TINY_BENCH_OBJS += pool_tls.o pool_refill.o core/pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o
|
||||||
|
|||||||
412
RANDOM_MIXED_BOTTLENECK_ANALYSIS.md
Normal file
412
RANDOM_MIXED_BOTTLENECK_ANALYSIS.md
Normal file
@ -0,0 +1,412 @@
|
|||||||
|
# Random Mixed (128-1KB) ボトルネック分析レポート
|
||||||
|
|
||||||
|
**Analyzed**: 2025-11-16
|
||||||
|
**Performance Gap**: 19.4M ops/s → 23.4% of System (目標: 80%)
|
||||||
|
**Analysis Depth**: Architecture review + Code tracing + Performance pathfinding
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Executive Summary
|
||||||
|
|
||||||
|
Random Mixed が 23% で停滞している根本原因は、**複数の最適化層が C2-C7(64B-1KB)の異なるクラスに部分的にしか適用されていない** ことです。Fixed-size 256B (40.3M ops/s) との性能差から、**class切り替え頻度と、各クラスの最適化カバレッジ不足** が支配的ボトルネックです。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Cycles 分布分析
|
||||||
|
|
||||||
|
### 1.1 レイヤー別コスト推定
|
||||||
|
|
||||||
|
| Layer | Target Classes | Hit Rate | Cycles | Assessment |
|
||||||
|
|-------|---|---|---|---|
|
||||||
|
| **HeapV2** | C0-C3 (8-64B) | 88-99% ✅ | **Low (2-3)** | Working well |
|
||||||
|
| **Ring Cache** | C2-C3 only | 0% (OFF) ❌ | N/A | Not enabled |
|
||||||
|
| **TLS SLL** | C0-C7 (全) | 0.7-2.7% | **Medium (8-12)** | Fallback only |
|
||||||
|
| **SuperSlab refill** | All classes | ~2-5% miss | **High (50-200)** | Dominant cost |
|
||||||
|
| **UltraHot** | C1-C2 | 11.7% | Medium | Disabled (Phase 19) |
|
||||||
|
|
||||||
|
### 1.2 支配的ボトルネック: SuperSlab Refill
|
||||||
|
|
||||||
|
**理由**:
|
||||||
|
1. **Refill頻度**: Random Mixed では class切り替え多発 → TLS SLL が複数クラスで頻繁に空になる
|
||||||
|
2. **Class-specific carving**: SuperSlab内の各slabは「1クラス専用」→ C4/C5/C6/C7 では carving/batch overhead が相対的に大きい
|
||||||
|
3. **Metadata access**: SuperSlab → TinySlabMeta → carving → SLL push の連鎖で 50-200 cycles
|
||||||
|
|
||||||
|
**Code Path** (`core/tiny_alloc_fast.inc.h:386-450` + `core/hakmem_tiny_refill_p0.inc.h`):
|
||||||
|
```
|
||||||
|
tiny_alloc_fast_pop() miss
|
||||||
|
↓
|
||||||
|
tiny_alloc_fast_refill() called
|
||||||
|
↓
|
||||||
|
sll_refill_batch_from_ss() or sll_refill_small_from_ss()
|
||||||
|
↓
|
||||||
|
hak_super_registry lookup (linear search)
|
||||||
|
↓
|
||||||
|
SuperSlab -> TinySlabMeta[] iteration (32 slabs)
|
||||||
|
↓
|
||||||
|
carve_batch_from_slab() (write multiple fields)
|
||||||
|
↓
|
||||||
|
tls_sll_push() (chain push)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 1.3 ボトルネック確定
|
||||||
|
|
||||||
|
**最優先**: **SuperSlab refill コスト** (50-200 cycles/refill)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. FrontMetrics 状況確認
|
||||||
|
|
||||||
|
### 2.1 実装状況
|
||||||
|
|
||||||
|
✅ **実装完了** (`core/box/front_metrics_box.{h,c}`)
|
||||||
|
|
||||||
|
**Current Status** (Phase 19-4):
|
||||||
|
- HeapV2: C0-C3 で 88-99% ヒット率 → 本命層として機能中
|
||||||
|
- UltraHot: デフォルト OFF (Phase 19-4 で +12.9% 改善のため削除)
|
||||||
|
- FC/SFC: 実質 OFF
|
||||||
|
- TLS SLL: Fallback のみ (0.7-2.7%)
|
||||||
|
|
||||||
|
### 2.2 Fixed vs Random Mixed の構造的違い
|
||||||
|
|
||||||
|
| 側面 | Fixed 256B | Random Mixed |
|
||||||
|
|------|---|---|
|
||||||
|
| **使用クラス** | C5 のみ (100%) | C3, C5, C6, C7 (混在) |
|
||||||
|
| **Class切り替え** | 0 (固定) | 頻繁 (各iteration) |
|
||||||
|
| **HeapV2適用** | C5 には非適用 ❌ | C0-C3 のみ適用 (部分) |
|
||||||
|
| **TLS SLL hit率** | High (C5は SLL頼り) | Low (複数class混在) |
|
||||||
|
| **Refill頻度** | 低い (C5 warm) | **高い (class ごとに空)** |
|
||||||
|
|
||||||
|
### 2.3 「死んでいる層」の候補
|
||||||
|
|
||||||
|
**C4-C7 (128B-1KB) に対する最適化が極度に不足**:
|
||||||
|
|
||||||
|
| Class | Size | Ring | HeapV2 | UltraHot | Coverage |
|
||||||
|
|-------|---|---|---|---|---|
|
||||||
|
| C0 | 8B | ❌ | ✅ | ❌ | 1/3 |
|
||||||
|
| C1 | 16B | ❌ | ✅ | ❌ (OFF) | 1/3 |
|
||||||
|
| C2 | 32B | ❌ (OFF) | ✅ | ❌ (OFF) | 1/3 |
|
||||||
|
| C3 | 64B | ❌ (OFF) | ✅ | ❌ (OFF) | 1/3 |
|
||||||
|
| **C4** | **128B** | ❌ | ❌ | ❌ | **0/3** ← 完全未最適化 |
|
||||||
|
| **C5** | **256B** | ❌ | ❌ | ❌ | **0/3** ← 完全未最適化 |
|
||||||
|
| **C6** | **512B** | ❌ | ❌ | ❌ | **0/3** ← 完全未最適化 |
|
||||||
|
| **C7** | **1024B** | ❌ | ❌ | ❌ | **0/3** ← 完全未最適化 |
|
||||||
|
|
||||||
|
**衝撃的発見**: Random Mixed で使用されるクラスの **50%** (C5, C6, C7) が全く最適化されていない!
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Class別パフォーマンスプロファイル
|
||||||
|
|
||||||
|
### 3.1 Random Mixed で使用されるクラス
|
||||||
|
|
||||||
|
コード分析 (`bench_random_mixed.c:77`):
|
||||||
|
```c
|
||||||
|
size_t sz = 16u + (r & 0x3FFu); // 16B-1040B の範囲
|
||||||
|
```
|
||||||
|
|
||||||
|
マッピング:
|
||||||
|
```
|
||||||
|
16-31B → C2 (32B) [16B requested]
|
||||||
|
32-63B → C3 (64B) [32-63B requested]
|
||||||
|
64-127B → C4 (128B) [64-127B requested]
|
||||||
|
128-255B → C5 (256B) [128-255B requested]
|
||||||
|
256-511B → C6 (512B) [256-511B requested]
|
||||||
|
512-1024B → C7 (1024B) [512-1023B requested]
|
||||||
|
```
|
||||||
|
|
||||||
|
**実際の分布**: ほぼ均一分布(ビット選択の性質上)
|
||||||
|
|
||||||
|
### 3.2 各クラスの最適化カバレッジ
|
||||||
|
|
||||||
|
**C0-C3 (HeapV2): 実装済みだが Random Mixed では使用量少ない**
|
||||||
|
- HeapV2 magazine capacity: 16/class
|
||||||
|
- Hit rate: 88-99%(実装は良い)
|
||||||
|
- **制限**: C4+ に対応していない
|
||||||
|
|
||||||
|
**C4-C7 (完全未最適化)**:
|
||||||
|
- Ring cache: 実装済みだが **デフォルト OFF** (`HAKMEM_TINY_HOT_RING_ENABLE=0`)
|
||||||
|
- HeapV2: C0-C3 のみ
|
||||||
|
- UltraHot: デフォルト OFF
|
||||||
|
- **結果**: 素の TLS SLL + SuperSlab refill に頼る
|
||||||
|
|
||||||
|
### 3.3 性能への影響
|
||||||
|
|
||||||
|
Random Mixed の大半は C4-C7 で処理されているのに、**全く最適化されていない**:
|
||||||
|
|
||||||
|
```
|
||||||
|
固定 256B での性能向上の理由:
|
||||||
|
- C5 単独 → HeapV2 未適用だが TLS SLL warm保持可能
|
||||||
|
- Class切り替えない → refill不要
|
||||||
|
- 結果: 40.3M ops/s
|
||||||
|
|
||||||
|
Random Mixed での性能低下の理由:
|
||||||
|
- C3/C5/C6/C7 混在
|
||||||
|
- 各クラス TLS SLL small → refill頻繁
|
||||||
|
- Refill cost: 50-200 cycles/回
|
||||||
|
- 結果: 19.4M ops/s (47% の性能低下)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. 次の一手候補の優先度付け
|
||||||
|
|
||||||
|
### 候補分析
|
||||||
|
|
||||||
|
#### 候補A: Ring Cache を C4/C5 に拡張 🔴 最優先
|
||||||
|
|
||||||
|
**理由**:
|
||||||
|
- Phase 21-1 で既に **実装済み**(`core/front/tiny_ring_cache.{h,c}`)
|
||||||
|
- C2/C3 では未使用(デフォルト OFF)
|
||||||
|
- C4-C7 への拡張は小さな変更で済む
|
||||||
|
- **効果**: ポインタチェイス削減 (+15-20%)
|
||||||
|
|
||||||
|
**実装状況**:
|
||||||
|
```c
|
||||||
|
// tiny_ring_cache.h:67-80
|
||||||
|
static inline int ring_cache_enabled(void) {
|
||||||
|
const char* e = getenv("HAKMEM_TINY_HOT_RING_ENABLE");
|
||||||
|
// デフォルト: 0 (OFF)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**有効化方法**:
|
||||||
|
```bash
|
||||||
|
export HAKMEM_TINY_HOT_RING_ENABLE=1
|
||||||
|
export HAKMEM_TINY_HOT_RING_C4=128
|
||||||
|
export HAKMEM_TINY_HOT_RING_C5=128
|
||||||
|
export HAKMEM_TINY_HOT_RING_C6=64
|
||||||
|
export HAKMEM_TINY_HOT_RING_C7=64
|
||||||
|
```
|
||||||
|
|
||||||
|
**推定効果**:
|
||||||
|
- 19.4M → 22-25M ops/s (+13-29%)
|
||||||
|
- TLS SLL pointer chasing: 3 mem → 2 mem
|
||||||
|
- Cache locality 向上
|
||||||
|
|
||||||
|
**実装コスト**: **LOW** (既存実装の有効化のみ)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### 候補B: HeapV2 を C4/C5 に拡張 🟡 中優先度
|
||||||
|
|
||||||
|
**理由**:
|
||||||
|
- Phase 13-A で既に **実装済み**(`core/front/tiny_heap_v2.h`)
|
||||||
|
- 現在 C0-C3 のみ(`HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE`)
|
||||||
|
- Magazine supply で TLS SLL hit rate 向上可能
|
||||||
|
|
||||||
|
**制限**:
|
||||||
|
- Magazine size: 16/class → Random Mixed では小さい
|
||||||
|
- Phase 17-1 実験: `+0.3%` のみ改善
|
||||||
|
- **理由**: Delegation overhead = TLS savings
|
||||||
|
|
||||||
|
**推定効果**: +2-5% (TLS refill削減)
|
||||||
|
|
||||||
|
**実装コスト**: LOW(ENV設定変更のみ)
|
||||||
|
|
||||||
|
**判断**: Ring Cache の方が効果的(候補A推奨)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### 候補C: C7 (1KB) 専用 HotPath 実装 🟢 長期
|
||||||
|
|
||||||
|
**理由**:
|
||||||
|
- C7 は Random Mixed の ~16% を占める
|
||||||
|
- SuperSlab refill cost が大きい
|
||||||
|
- 専用設計で carve/batch overhead 削減可能
|
||||||
|
|
||||||
|
**推定効果**: +5-10% (C7 単体で)
|
||||||
|
|
||||||
|
**実装コスト**: **HIGH** (新規設計)
|
||||||
|
|
||||||
|
**判断**: 後回し(Ring Cache + その他の最適化後に検討)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### 候補D: SuperSlab refill の高速化 🔥 超長期
|
||||||
|
|
||||||
|
**理由**:
|
||||||
|
- 根本原因(50-200 cycles/refill)の直接攻撃
|
||||||
|
- Phase 12 (Shared SuperSlab Pool) でアーキテクチャ変更
|
||||||
|
- 877 SuperSlab → 100-200 に削減
|
||||||
|
|
||||||
|
**推定効果**: **+300-400%** (9.38M → 70-90M ops/s)
|
||||||
|
|
||||||
|
**実装コスト**: **VERY HIGH** (アーキテクチャ変更)
|
||||||
|
|
||||||
|
**判断**: Phase 21(前提となる細かい最適化)完了後に着手
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 優先順位付け結論
|
||||||
|
|
||||||
|
```
|
||||||
|
🔴 最優先: Ring Cache C4/C7 拡張 (実装済み、有効化のみ)
|
||||||
|
期待: +13-29% (19.4M → 22-25M ops/s)
|
||||||
|
工数: LOW
|
||||||
|
リスク: LOW
|
||||||
|
|
||||||
|
🟡 次点: HeapV2 C4/C5 拡張 (実装済み、有効化のみ)
|
||||||
|
期待: +2-5%
|
||||||
|
工数: LOW
|
||||||
|
リスク: LOW
|
||||||
|
判断: 効果が小さい(Ring優先)
|
||||||
|
|
||||||
|
🟢 長期: C7 専用 HotPath
|
||||||
|
期待: +5-10%
|
||||||
|
工数: HIGH
|
||||||
|
判断: 後回し
|
||||||
|
|
||||||
|
🔥 超長期: SuperSlab Shared Pool (Phase 12)
|
||||||
|
期待: +300-400%
|
||||||
|
工数: VERY HIGH
|
||||||
|
判断: 根本解決(Phase 21終了後)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. 推奨施策
|
||||||
|
|
||||||
|
### 5.1 即実施: Ring Cache 有効化テスト
|
||||||
|
|
||||||
|
**スクリプト** (`scripts/test_ring_cache.sh` の例):
|
||||||
|
```bash
|
||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
echo "=== Ring Cache OFF (Baseline) ==="
|
||||||
|
./out/release/bench_random_mixed_hakmem 500000 256 42
|
||||||
|
|
||||||
|
echo "=== Ring Cache ON (C4/C7) ==="
|
||||||
|
export HAKMEM_TINY_HOT_RING_ENABLE=1
|
||||||
|
export HAKMEM_TINY_HOT_RING_C4=128
|
||||||
|
export HAKMEM_TINY_HOT_RING_C5=128
|
||||||
|
export HAKMEM_TINY_HOT_RING_C6=64
|
||||||
|
export HAKMEM_TINY_HOT_RING_C7=64
|
||||||
|
./out/release/bench_random_mixed_hakmem 500000 256 42
|
||||||
|
|
||||||
|
echo "=== Ring Cache ON (C2/C3 original) ==="
|
||||||
|
export HAKMEM_TINY_HOT_RING_ENABLE=1
|
||||||
|
export HAKMEM_TINY_HOT_RING_C2=128
|
||||||
|
export HAKMEM_TINY_HOT_RING_C3=128
|
||||||
|
unset HAKMEM_TINY_HOT_RING_C4 HAKMEM_TINY_HOT_RING_C5 HAKMEM_TINY_HOT_RING_C6 HAKMEM_TINY_HOT_RING_C7
|
||||||
|
./out/release/bench_random_mixed_hakmem 500000 256 42
|
||||||
|
```
|
||||||
|
|
||||||
|
**期待結果**:
|
||||||
|
- Baseline: 19.4M ops/s (23.4%)
|
||||||
|
- Ring C4/C7: 22-25M ops/s (24-28%) ← +13-29%
|
||||||
|
- Ring C2/C3: 20-21M ops/s (23-24%) ← +3-8%
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 5.2 検証用 FrontMetrics 計測
|
||||||
|
|
||||||
|
**有効化**:
|
||||||
|
```bash
|
||||||
|
export HAKMEM_TINY_FRONT_METRICS=1
|
||||||
|
export HAKMEM_TINY_FRONT_DUMP=1
|
||||||
|
./out/release/bench_random_mixed_hakmem 500000 256 42 2>&1 | grep -A 100 "Frontend Metrics"
|
||||||
|
```
|
||||||
|
|
||||||
|
**期待出力**: クラス別ヒット率一覧(Ring 有効化前後で比較)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 5.3 長期ロードマップ
|
||||||
|
|
||||||
|
```
|
||||||
|
フェーズ 21-1: Ring Cache 有効化 (即実施)
|
||||||
|
├─ C2/C3 テスト(既実装)
|
||||||
|
├─ C4-C7 拡張テスト
|
||||||
|
└─ 期待: 20-25M ops/s (+13-29%)
|
||||||
|
|
||||||
|
フェーズ 21-2: Hot Slab Direct Index (Class5+)
|
||||||
|
└─ SuperSlab slab ループ削減
|
||||||
|
└─ 期待: 22-30M ops/s (+13-55%)
|
||||||
|
|
||||||
|
フェーズ 21-3: Minimal Meta Access
|
||||||
|
└─ 触るフィールド削減(accessed pattern 限定)
|
||||||
|
└─ 期待: 24-35M ops/s (+24-80%)
|
||||||
|
|
||||||
|
フェーズ 22: Phase 12 (Shared SuperSlab Pool) 着手
|
||||||
|
└─ 877 SuperSlab → 100-200 削減
|
||||||
|
└─ 期待: 70-90M ops/s (+260-364%)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. 技術的根拠
|
||||||
|
|
||||||
|
### 6.1 Fixed 256B (C5) vs Random Mixed (C3/C5/C6/C7)
|
||||||
|
|
||||||
|
**固定の高速性の理由**:
|
||||||
|
1. **Class 固定** → TLS SLL warm保持
|
||||||
|
2. **HeapV2 非適用** → でも SLL hit率高い
|
||||||
|
3. **Refill少ない** → class切り替えない
|
||||||
|
|
||||||
|
**Random Mixed の低速性の理由**:
|
||||||
|
1. **Class 頻繁切り替え** → TLS SLL → 複数class で枯渇
|
||||||
|
2. **各クラス refill多発** → 50-200 cycles × 多発
|
||||||
|
3. **最適化カバレッジ 0%** → C4-C7 が素のパス
|
||||||
|
|
||||||
|
**差分**: 40.3M - 19.4M = **20.9M ops/s**
|
||||||
|
|
||||||
|
素の TLS SLL と Ring Cache の差:
|
||||||
|
```
|
||||||
|
TLS SLL (pointer chasing): 3 mem accesses
|
||||||
|
- Load head: 1 mem
|
||||||
|
- Load next: 1 mem (cache miss)
|
||||||
|
- Update head: 1 mem
|
||||||
|
|
||||||
|
Ring Cache (array): 2 mem accesses
|
||||||
|
- Load from array: 1 mem
|
||||||
|
- Update index: 1 mem (同一cache line)
|
||||||
|
|
||||||
|
改善: 3→2 = -33% cycles
|
||||||
|
```
|
||||||
|
|
||||||
|
### 6.2 Refill Cost 見積もり
|
||||||
|
|
||||||
|
```
|
||||||
|
Random Mixed refill frequency:
|
||||||
|
- Total iterations: 500K
|
||||||
|
- Classes: 6 (C2-C7)
|
||||||
|
- Per-class avg lifetime: 500K/6 ≈ 83K
|
||||||
|
- TLS SLL typical warmth: 16-32 blocks
|
||||||
|
- Refill per 50 ops: ~1 refill per 50-100 ops
|
||||||
|
|
||||||
|
→ 500K × 1/75 ≈ 6.7K refills
|
||||||
|
|
||||||
|
Refill cost:
|
||||||
|
- SuperSlab lookup: 10-20 cycles
|
||||||
|
- Slab iteration: 30-50 cycles (32 slabs)
|
||||||
|
- Carving: 10-15 cycles
|
||||||
|
- Push chain: 5-10 cycles
|
||||||
|
Total: ~60-95 cycles/refill (average)
|
||||||
|
|
||||||
|
Impact:
|
||||||
|
- 6.7K × 80 cycles = 536K cycles
|
||||||
|
- vs 500K × 50 cycles = 25M cycles total
|
||||||
|
= 2.1% のみ
|
||||||
|
|
||||||
|
理由: refill は相対的に少ない、むしろ TLS hit rate の悪さと
|
||||||
|
class切り替え overhead が支配的
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. 最終推奨
|
||||||
|
|
||||||
|
| 項目 | 内容 |
|
||||||
|
|------|------|
|
||||||
|
| **最優先施策** | **Ring Cache C4/C7 有効化テスト** |
|
||||||
|
| **期待改善** | +13-29% (19.4M → 22-25M ops/s) |
|
||||||
|
| **実装期間** | < 1日 (ENV設定のみ) |
|
||||||
|
| **リスク** | 極低(既実装、有効化のみ) |
|
||||||
|
| **成功条件** | 23-25M ops/s 到達 (25-28% of system) |
|
||||||
|
| **次ステップ** | Phase 21-2 (Hot Slab Cache) |
|
||||||
|
| **長期目標** | Phase 12 (Shared SS Pool) で 70-90M ops/s |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**End of Analysis**
|
||||||
|
|
||||||
148
RANDOM_MIXED_SUMMARY.md
Normal file
148
RANDOM_MIXED_SUMMARY.md
Normal file
@ -0,0 +1,148 @@
|
|||||||
|
# Random Mixed ボトルネック分析 - 返答フォーマット
|
||||||
|
|
||||||
|
## Random Mixed ボトルネック分析
|
||||||
|
|
||||||
|
### 1. Cycles 分布
|
||||||
|
|
||||||
|
| Layer | Target Classes | Hit Rate | Cycles | Status |
|
||||||
|
|-------|---|---|---|---|
|
||||||
|
| Ring Cache | C2-C3 only | 0% (OFF) | N/A | Not enabled |
|
||||||
|
| HeapV2 | C0-C3 | 88-99% | Low (2-3) | Working ✅ |
|
||||||
|
| TLS SLL | C0-C7 | 0.7-2.7% | Medium (8-12) | Fallback only |
|
||||||
|
| **SuperSlab refill** | **All classes** | **~2-5% miss** | **High (50-200)** | **BOTTLENECK** 🔥 |
|
||||||
|
| UltraHot | C1-C2 | N/A | Medium | OFF (Phase 19) |
|
||||||
|
|
||||||
|
- **Ring Cache**: Low (2-3 cycles) - ポインタチェイス削減(未使用)
|
||||||
|
- **HeapV2**: Low (2-3 cycles) - Magazine供給(C0-C3のみ有効)
|
||||||
|
- **TLS SLL**: Medium (8-12 cycles) - Fallback層、複数classで枯渇
|
||||||
|
- **SuperSlab refill**: High (50-200 cycles) - Metadata走査+carving(支配的)
|
||||||
|
- **UltraHot**: Medium - デフォルトOFF(Phase 19で削除)
|
||||||
|
|
||||||
|
**ボトルネック**: **SuperSlab refill** (50-200 cycles/refill) - Random Mixed では class切り替え多発により TLS SLL が頻繁に空になり、refill多発
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2. FrontMetrics 状況
|
||||||
|
|
||||||
|
- **実装**: ✅ ある (`core/box/front_metrics_box.{h,c}`)
|
||||||
|
- **HeapV2**: 88-99% ヒット率 → C0-C3 では本命層として機能中
|
||||||
|
- **UltraHot**: デフォルト OFF (Phase 19-4で +12.9% 改善のため削除)
|
||||||
|
- **FC/SFC**: 実質無効化
|
||||||
|
|
||||||
|
**Fixed vs Mixed の違い**:
|
||||||
|
| 側面 | Fixed 256B | Random Mixed |
|
||||||
|
|------|---|---|
|
||||||
|
| 使用クラス | C5 のみ | C3, C5, C6, C7 (混在) |
|
||||||
|
| Class切り替え | 0 (固定) | 頻繁 (毎iteration) |
|
||||||
|
| HeapV2適用 | 非適用 | C0-C3のみ(部分)|
|
||||||
|
| TLS SLL hit率 | High | Low(複数class枯渇)|
|
||||||
|
| Refill頻度 | **低い(C5 warm保持)** | **高い(class毎に空)** |
|
||||||
|
|
||||||
|
**死んでいる層**: **C4-C7 (128B-1KB) が全く最適化されていない**
|
||||||
|
- C0-C3: HeapV2 ✅
|
||||||
|
- C4: Ring ❌, HeapV2 ❌, UltraHot ❌ → 素のTLS SLL + refill
|
||||||
|
- C5: Ring ❌, HeapV2 ❌, UltraHot ❌ → 素のTLS SLL + refill
|
||||||
|
- C6: Ring ❌, HeapV2 ❌, UltraHot ❌ → 素のTLS SLL + refill
|
||||||
|
- C7: Ring ❌, HeapV2 ❌, UltraHot ❌ → 素のTLS SLL + refill
|
||||||
|
|
||||||
|
Random Mixed で使用されるクラスの **50%以上** が完全未最適化!
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3. Class別プロファイル
|
||||||
|
|
||||||
|
**使用クラス** (bench_random_mixed.c:77 分析):
|
||||||
|
```c
|
||||||
|
size_t sz = 16u + (r & 0x3FFu); // 16B-1040B
|
||||||
|
→ C2 (16-31B), C3 (32-63B), C4 (64-127B), C5 (128-255B), C6 (256-511B), C7 (512-1024B)
|
||||||
|
```
|
||||||
|
|
||||||
|
**最適化カバレッジ**:
|
||||||
|
- Ring Cache: 4個クラス対応済み(C0-C7)だが **デフォルト OFF**
|
||||||
|
- `HAKMEM_TINY_HOT_RING_ENABLE=0` (有効化されていない)
|
||||||
|
- HeapV2: 4個クラス対応(C0-C3)
|
||||||
|
- C4-C7 に拡張可能だが Phase 17-1 実験で +0.3% のみ効果
|
||||||
|
- 素のTLS SLL: 全クラス(fallback)
|
||||||
|
|
||||||
|
**素のTLS SLL 経路の割合**:
|
||||||
|
- C0-C3: ~88-99% HeapV2(TLS SLL は2-12% fallback)
|
||||||
|
- **C4-C7: ~100% TLS SLL + SuperSlab refill**(最適化なし)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 4. 推奨施策(優先度順)
|
||||||
|
|
||||||
|
#### 1. **最優先**: Ring Cache C4/C7 拡張
|
||||||
|
- **効果推定**: **High (+13-29%)**
|
||||||
|
- **理由**:
|
||||||
|
- Phase 21-1 で実装済み(`core/front/tiny_ring_cache.h`)
|
||||||
|
- C2-C3 未使用(デフォルト OFF)
|
||||||
|
- **ポインタチェイス削減**: TLS SLL 3mem → Ring 2mem (-33%)
|
||||||
|
- Random Mixed の C4-C7 (50%) をカバー可能
|
||||||
|
- **実装期間**: **低** (ENV 有効化のみ、≦1日)
|
||||||
|
- **リスク**: **低** (既実装、有効化のみ)
|
||||||
|
- **期待値**: 19.4M → 22-25M ops/s (25-28%)
|
||||||
|
- **有効化**:
|
||||||
|
```bash
|
||||||
|
export HAKMEM_TINY_HOT_RING_ENABLE=1
|
||||||
|
export HAKMEM_TINY_HOT_RING_C4=128
|
||||||
|
export HAKMEM_TINY_HOT_RING_C5=128
|
||||||
|
export HAKMEM_TINY_HOT_RING_C6=64
|
||||||
|
export HAKMEM_TINY_HOT_RING_C7=64
|
||||||
|
```
|
||||||
|
|
||||||
|
#### 2. **次点**: HeapV2 を C4/C5 に拡張
|
||||||
|
- **効果推定**: **Low to Medium (+2-5%)**
|
||||||
|
- **理由**:
|
||||||
|
- Phase 13-A で実装済み(`core/front/tiny_heap_v2.h`)
|
||||||
|
- Magazine supply で TLS SLL hit rate 向上
|
||||||
|
- **制限**: Phase 17-1 実験で +0.3% のみ(delegation overhead = TLS savings)
|
||||||
|
- **実装期間**: **低** (ENV 変更のみ)
|
||||||
|
- **リスク**: **低**
|
||||||
|
- **期待値**: 19.4M → 19.8-20.4M ops/s (+2-5%)
|
||||||
|
- **判断**: Ring Cache の方が効果的(Ring を優先)
|
||||||
|
|
||||||
|
#### 3. **長期**: C7 (1KB) 専用 HotPath
|
||||||
|
- **効果推定**: **Medium (+5-10%)**
|
||||||
|
- **理由**: C7 は Random Mixed の ~16% を占める
|
||||||
|
- **実装期間**: **高**(新規実装)
|
||||||
|
- **判断**: 後回し(Ring Cache + Phase 21-2 後に検討)
|
||||||
|
|
||||||
|
#### 4. **超長期**: SuperSlab Shared Pool (Phase 12)
|
||||||
|
- **効果推定**: **VERY HIGH (+300-400%)**
|
||||||
|
- **理由**: 877 SuperSlab → 100-200 削減(根本解決)
|
||||||
|
- **実装期間**: **Very High**(アーキテクチャ変更)
|
||||||
|
- **期待値**: 70-90M ops/s(System の 70-90%)
|
||||||
|
- **判断**: Phase 21 完了後に着手
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 最終推奨(フォーマット通り)
|
||||||
|
|
||||||
|
### 優先度付き推奨施策
|
||||||
|
|
||||||
|
1. **最優先**: **Ring Cache C4/C7 有効化**
|
||||||
|
- 理由: ポインタチェイス削減で +13-29% 期待、実装済み(有効化のみ)
|
||||||
|
- 期待: 19.4M → 22-25M ops/s (25-28% of system)
|
||||||
|
|
||||||
|
2. **次点**: **HeapV2 C4/C5 拡張**
|
||||||
|
- 理由: TLS refill 削減で +2-5% 期待、ただし Ring より効果薄
|
||||||
|
- 期待: 19.4M → 19.8-20.4M ops/s (+2-5%)
|
||||||
|
|
||||||
|
3. **長期**: **C7 専用 HotPath 実装**
|
||||||
|
- 理由: 1KB 単体の最適化、実装コスト大
|
||||||
|
- 期待: +5-10%
|
||||||
|
|
||||||
|
4. **超長期**: **Phase 12 (Shared SuperSlab Pool)**
|
||||||
|
- 理由: 根本的なメタデータ圧縮(構造的ボトルネック攻撃)
|
||||||
|
- 期待: +300-400% (70-90M ops/s)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**本分析の根拠ファイル**:
|
||||||
|
- `/mnt/workdisk/public_share/hakmem/core/front/tiny_ring_cache.h` - Ring Cache 実装
|
||||||
|
- `/mnt/workdisk/public_share/hakmem/core/front/tiny_heap_v2.h` - HeapV2 実装
|
||||||
|
- `/mnt/workdisk/public_share/hakmem/core/tiny_alloc_fast.inc.h` - Allocation fast path
|
||||||
|
- `/mnt/workdisk/public_share/hakmem/core/box/tls_sll_box.h` - TLS SLL 実装
|
||||||
|
- `/mnt/workdisk/public_share/hakmem/CURRENT_TASK.md` - Phase 19-22 実装状況
|
||||||
|
|
||||||
301
RING_CACHE_ACTIVATION_GUIDE.md
Normal file
301
RING_CACHE_ACTIVATION_GUIDE.md
Normal file
@ -0,0 +1,301 @@
|
|||||||
|
# Ring Cache C4-C7 有効化ガイド(Phase 21-1 即実施版)
|
||||||
|
|
||||||
|
**Priority**: 🔴 HIGHEST
|
||||||
|
**Status**: Implementation Ready (待つだけ)
|
||||||
|
**Expected Gain**: +13-29% (19.4M → 22-25M ops/s)
|
||||||
|
**Risk Level**: LOW (既実装、有効化のみ)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 概要
|
||||||
|
|
||||||
|
Random Mixed の bottleneck は **C4-C7 (128B-1KB) が完全未最適化** されている点です。
|
||||||
|
Phase 21-1 で実装済みの **Ring Cache** を有効化することで、TLS SLL のポインタチェイス(3 mem)を 配列アクセス(2 mem)に削減し、+13-29% の性能向上が期待できます。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Ring Cache とは
|
||||||
|
|
||||||
|
### アーキテクチャ
|
||||||
|
|
||||||
|
```
|
||||||
|
3-層階層:
|
||||||
|
Layer 0: Ring Cache (array-based, 128 slots)
|
||||||
|
└─ Fast pop/push (1-2 mem accesses)
|
||||||
|
|
||||||
|
Layer 1: TLS SLL (linked list)
|
||||||
|
└─ Medium pop/push (3 mem accesses + cache miss)
|
||||||
|
|
||||||
|
Layer 2: SuperSlab
|
||||||
|
└─ Slow refill (50-200 cycles)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 性能改善の仕組み
|
||||||
|
|
||||||
|
**従来の TLS SLL (pointer chasing)**:
|
||||||
|
```
|
||||||
|
Pop:
|
||||||
|
1. Load head pointer: mov rax, [g_tls_sll_head]
|
||||||
|
2. Load next pointer: mov rdx, [rax] ← cache miss!
|
||||||
|
3. Update head: mov [g_tls_sll_head], rdx
|
||||||
|
= 3 memory accesses
|
||||||
|
```
|
||||||
|
|
||||||
|
**Ring Cache (array-based)**:
|
||||||
|
```
|
||||||
|
Pop:
|
||||||
|
1. Load from array: mov rax, [g_ring_cache + head*8]
|
||||||
|
2. Update head index: add head, 1 ← CPU register!
|
||||||
|
= 2 memory accesses、キャッシュミスなし
|
||||||
|
```
|
||||||
|
|
||||||
|
**改善**: 3 → 2 memory = -33% cycles per alloc/free
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 実装状況確認
|
||||||
|
|
||||||
|
### ファイル一覧
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Ring Cache 実装ファイル
|
||||||
|
ls -la /mnt/workdisk/public_share/hakmem/core/front/tiny_ring_cache.{h,c}
|
||||||
|
|
||||||
|
# 確認コマンド
|
||||||
|
grep -n "ring_cache_enabled\|HAKMEM_TINY_HOT_RING" \
|
||||||
|
/mnt/workdisk/public_share/hakmem/core/front/tiny_ring_cache.h | head -20
|
||||||
|
```
|
||||||
|
|
||||||
|
### 既実装機能の確認
|
||||||
|
|
||||||
|
```c
|
||||||
|
// core/front/tiny_ring_cache.h:67-80
|
||||||
|
static inline int ring_cache_enabled(void) {
|
||||||
|
static int g_enable = -1;
|
||||||
|
if (__builtin_expect(g_enable == -1, 0)) {
|
||||||
|
const char* e = getenv("HAKMEM_TINY_HOT_RING_ENABLE");
|
||||||
|
g_enable = (e && *e && *e != '0') ? 1 : 0; // Default: 0 (OFF)
|
||||||
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
|
if (g_enable) {
|
||||||
|
fprintf(stderr, "[Ring-INIT] ring_cache_enabled() = %d\n", g_enable);
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
return g_enable;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Ring pop/push already implemented:
|
||||||
|
// - ring_cache_pop() (line 159-190)
|
||||||
|
// - ring_cache_push() (line 195-228)
|
||||||
|
// - Per-class capacities: C2/C3 (default: 128, configurable)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## テスト実施手順
|
||||||
|
|
||||||
|
### Step 1: ビルド確認
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /mnt/workdisk/public_share/hakmem
|
||||||
|
|
||||||
|
# Release ビルド
|
||||||
|
./build.sh bench_random_mixed_hakmem
|
||||||
|
./build.sh bench_random_mixed_system
|
||||||
|
|
||||||
|
# 確認
|
||||||
|
ls -lh ./out/release/bench_random_mixed_*
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 2: Baseline 測定
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Ring Cache OFF (現在のデフォルト)
|
||||||
|
echo "=== Baseline (Ring Cache OFF) ==="
|
||||||
|
./out/release/bench_random_mixed_hakmem 500000 256 42
|
||||||
|
|
||||||
|
# Expected: ~19.4M ops/s (23.4% of system)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 3: Ring Cache C2/C3 テスト(既存)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
echo "=== Ring Cache C2/C3 (experimental baseline) ==="
|
||||||
|
export HAKMEM_TINY_HOT_RING_ENABLE=1
|
||||||
|
export HAKMEM_TINY_HOT_RING_C2=128
|
||||||
|
export HAKMEM_TINY_HOT_RING_C3=128
|
||||||
|
|
||||||
|
./out/release/bench_random_mixed_hakmem 500000 256 42
|
||||||
|
|
||||||
|
# Expected: ~20-21M ops/s (+3-8% from baseline)
|
||||||
|
# Note: C2/C3 は Random Mixed で少数派
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: Ring Cache C4-C7 テスト(推奨)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
echo "=== Ring Cache C4-C7 (推奨: Random Mixed の主要クラス) ==="
|
||||||
|
export HAKMEM_TINY_HOT_RING_ENABLE=1
|
||||||
|
export HAKMEM_TINY_HOT_RING_C4=128
|
||||||
|
export HAKMEM_TINY_HOT_RING_C5=128
|
||||||
|
export HAKMEM_TINY_HOT_RING_C6=64
|
||||||
|
export HAKMEM_TINY_HOT_RING_C7=64
|
||||||
|
unset HAKMEM_TINY_HOT_RING_C2 HAKMEM_TINY_HOT_RING_C3
|
||||||
|
|
||||||
|
./out/release/bench_random_mixed_hakmem 500000 256 42
|
||||||
|
|
||||||
|
# Expected: ~22-25M ops/s (+13-29% from baseline)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 5: Combined (全クラス) テスト
|
||||||
|
|
||||||
|
```bash
|
||||||
|
echo "=== Ring Cache All Classes (C0-C7) ==="
|
||||||
|
export HAKMEM_TINY_HOT_RING_ENABLE=1
|
||||||
|
# デフォルト: C2=128, C3=128, C4=128, C5=128, C6=64, C7=64
|
||||||
|
unset HAKMEM_TINY_HOT_RING_C2 HAKMEM_TINY_HOT_RING_C3 HAKMEM_TINY_HOT_RING_C4 \
|
||||||
|
HAKMEM_TINY_HOT_RING_C5 HAKMEM_TINY_HOT_RING_C6 HAKMEM_TINY_HOT_RING_C7
|
||||||
|
|
||||||
|
./out/release/bench_random_mixed_hakmem 500000 256 42
|
||||||
|
|
||||||
|
# Expected: ~23-24M ops/s (+18-24% from baseline)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ENV変数リファレンス
|
||||||
|
|
||||||
|
### 有効化/無効化
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Ring Cache 全体の有効/無効
|
||||||
|
export HAKMEM_TINY_HOT_RING_ENABLE=1 # ON (default: 0 = OFF)
|
||||||
|
export HAKMEM_TINY_HOT_RING_ENABLE=0 # OFF
|
||||||
|
```
|
||||||
|
|
||||||
|
### クラス別容量設定
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# デフォルト値: すべて 128 (Ring サイズ)
|
||||||
|
export HAKMEM_TINY_HOT_RING_C0=128 # 8B
|
||||||
|
export HAKMEM_TINY_HOT_RING_C1=128 # 16B
|
||||||
|
export HAKMEM_TINY_HOT_RING_C2=128 # 32B
|
||||||
|
export HAKMEM_TINY_HOT_RING_C3=128 # 64B
|
||||||
|
export HAKMEM_TINY_HOT_RING_C4=128 # 128B (新)
|
||||||
|
export HAKMEM_TINY_HOT_RING_C5=128 # 256B (新)
|
||||||
|
export HAKMEM_TINY_HOT_RING_C6=64 # 512B (新)
|
||||||
|
export HAKMEM_TINY_HOT_RING_C7=64 # 1024B (新)
|
||||||
|
|
||||||
|
# サイズ指定: 32-256 (power of 2 に自動調整)
|
||||||
|
# 小さい: 32, 64 → メモリ効率優先、ヒット率低
|
||||||
|
# 中: 128 → バランス型(推奨)
|
||||||
|
# 大: 256 → ヒット率優先、メモリ多消費
|
||||||
|
```
|
||||||
|
|
||||||
|
### カスケード設定(上級)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Ring → SLL への一方向補充(デフォルト: OFF)
|
||||||
|
export HAKMEM_TINY_HOT_RING_CASCADE=1 # SLL 空時に Ring から補充
|
||||||
|
```
|
||||||
|
|
||||||
|
### デバッグ出力
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Metrics 出力(リリースビルド時は無効)
|
||||||
|
export HAKMEM_DEBUG_COUNTERS=1 # Ring hit/miss カウント
|
||||||
|
export HAKMEM_BUILD_RELEASE=0 # デバッグビルド(遅い)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## テスト結果フォーマット
|
||||||
|
|
||||||
|
各テストの結果を以下形式で記録してください:
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
### Test Results (YYYY-MM-DD HH:MM)
|
||||||
|
|
||||||
|
| Test | Iterations | Workset | Seed | Result | vs Baseline | Status |
|
||||||
|
|------|---|---|---|---|---|---|
|
||||||
|
| Baseline (OFF) | 500K | 256 | 42 | 19.4M | - | ✓ |
|
||||||
|
| C2/C3 Ring | 500K | 256 | 42 | 20.5M | +5.7% | ✓ |
|
||||||
|
| C4/C7 Ring | 500K | 256 | 42 | 23.0M | +18.6% | ✓✓ |
|
||||||
|
| All Classes | 500K | 256 | 42 | 22.8M | +17.5% | ✓✓ |
|
||||||
|
|
||||||
|
**Recommendation**: C4-C7 設定で +18.6% 改善、目標達成
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## トラブルシューティング
|
||||||
|
|
||||||
|
### 問題: Ring Cache 有効化しても性能向上しない
|
||||||
|
|
||||||
|
**診断**:
|
||||||
|
```bash
|
||||||
|
# ENV が実際に反映されているか確認
|
||||||
|
./out/release/bench_random_mixed_hakmem 100 256 42 2>&1 | grep -i "ring\|cache"
|
||||||
|
|
||||||
|
# 期待出力: [Ring-INIT] ring_cache_enabled() = 1
|
||||||
|
```
|
||||||
|
|
||||||
|
**原因候補**:
|
||||||
|
1. **ENV が設定されていない** → `export HAKMEM_TINY_HOT_RING_ENABLE=1` を再確認
|
||||||
|
2. **ビルドが古い** → `./build.sh clean && ./build.sh bench_random_mixed_hakmem`
|
||||||
|
3. **リリースビルド** → デバッグ出力なし(正常、性能測定のため)
|
||||||
|
|
||||||
|
### 問題: ハング or SEGV
|
||||||
|
|
||||||
|
**対応**:
|
||||||
|
```bash
|
||||||
|
# Ring Cache OFF に戻す
|
||||||
|
unset HAKMEM_TINY_HOT_RING_ENABLE
|
||||||
|
unset HAKMEM_TINY_HOT_RING_C{0..7}
|
||||||
|
|
||||||
|
./out/release/bench_random_mixed_hakmem 100 256 42
|
||||||
|
```
|
||||||
|
|
||||||
|
**報告**: 発生時は StackTrace + ENV 設定を記録
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 成功基準
|
||||||
|
|
||||||
|
| 項目 | 基準 | 判定 |
|
||||||
|
|------|------|------|
|
||||||
|
| **Baseline 測定** | 19-20M ops/s | ✅ Pass |
|
||||||
|
| **C4-C7 Ring 有効化** | 22M ops/s 以上 | ✅ Pass (+13%+) |
|
||||||
|
| **目標達成** | 23-25M ops/s | 🎯 Target |
|
||||||
|
| **Crash/Hang** | なし | ✅ Stability |
|
||||||
|
| **FrontMetrics 検証** | Ring hit > 50% | ✅ Confirm |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 次のステップ
|
||||||
|
|
||||||
|
### 成功時 (23-25M ops/s 到達):
|
||||||
|
1. ✅ Ring Cache C4-C7 を本番設定として固定
|
||||||
|
2. 🔄 Phase 21-2 (Hot Slab Direct Index) 実装開始
|
||||||
|
3. 📊 FrontMetrics で詳細分析(class別 hit rate)
|
||||||
|
|
||||||
|
### 失敗時 (改善なし):
|
||||||
|
1. 🔍 FrontMetrics で Ring hit rate 確認
|
||||||
|
2. 🐛 Ring cache initialization デバッグ
|
||||||
|
3. 🔧 キャパシティ調整テスト(64 / 256 等)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 参考資料
|
||||||
|
|
||||||
|
- **実装**: `/mnt/workdisk/public_share/hakmem/core/front/tiny_ring_cache.h/c`
|
||||||
|
- **ボトルネック分析**: `/mnt/workdisk/public_share/hakmem/RANDOM_MIXED_BOTTLENECK_ANALYSIS.md`
|
||||||
|
- **Phase 21-1 計画**: `/mnt/workdisk/public_share/hakmem/CURRENT_TASK.md` § 10, 11
|
||||||
|
- **Alloc fast path**: `/mnt/workdisk/public_share/hakmem/core/tiny_alloc_fast.inc.h:199-310`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**End of Guide**
|
||||||
|
|
||||||
|
準備完了。実施をお待ちしています!
|
||||||
|
|
||||||
@ -28,11 +28,13 @@
|
|||||||
__thread uint64_t g_classify_header_hit = 0;
|
__thread uint64_t g_classify_header_hit = 0;
|
||||||
__thread uint64_t g_classify_headerless_hit = 0;
|
__thread uint64_t g_classify_headerless_hit = 0;
|
||||||
__thread uint64_t g_classify_pool_hit = 0;
|
__thread uint64_t g_classify_pool_hit = 0;
|
||||||
|
__thread uint64_t g_classify_mid_large_hit = 0;
|
||||||
__thread uint64_t g_classify_unknown_hit = 0;
|
__thread uint64_t g_classify_unknown_hit = 0;
|
||||||
|
|
||||||
void front_gate_print_stats(void) {
|
void front_gate_print_stats(void) {
|
||||||
uint64_t total = g_classify_header_hit + g_classify_headerless_hit +
|
uint64_t total = g_classify_header_hit + g_classify_headerless_hit +
|
||||||
g_classify_pool_hit + g_classify_unknown_hit;
|
g_classify_pool_hit + g_classify_mid_large_hit +
|
||||||
|
g_classify_unknown_hit;
|
||||||
if (total == 0) return;
|
if (total == 0) return;
|
||||||
|
|
||||||
fprintf(stderr, "\n========== Front Gate Classification Stats ==========\n");
|
fprintf(stderr, "\n========== Front Gate Classification Stats ==========\n");
|
||||||
@ -42,6 +44,8 @@ void front_gate_print_stats(void) {
|
|||||||
g_classify_headerless_hit, 100.0 * g_classify_headerless_hit / total);
|
g_classify_headerless_hit, 100.0 * g_classify_headerless_hit / total);
|
||||||
fprintf(stderr, "Pool TLS: %lu (%.2f%%)\n",
|
fprintf(stderr, "Pool TLS: %lu (%.2f%%)\n",
|
||||||
g_classify_pool_hit, 100.0 * g_classify_pool_hit / total);
|
g_classify_pool_hit, 100.0 * g_classify_pool_hit / total);
|
||||||
|
fprintf(stderr, "Mid-Large (MMAP): %lu (%.2f%%)\n",
|
||||||
|
g_classify_mid_large_hit, 100.0 * g_classify_mid_large_hit / total);
|
||||||
fprintf(stderr, "Unknown: %lu (%.2f%%)\n",
|
fprintf(stderr, "Unknown: %lu (%.2f%%)\n",
|
||||||
g_classify_unknown_hit, 100.0 * g_classify_unknown_hit / total);
|
g_classify_unknown_hit, 100.0 * g_classify_unknown_hit / total);
|
||||||
fprintf(stderr, "Total: %lu\n", total);
|
fprintf(stderr, "Total: %lu\n", total);
|
||||||
@ -253,6 +257,30 @@ ptr_classification_t classify_ptr(void* ptr) {
|
|||||||
return result;
|
return result;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Check for Mid-Large allocation with AllocHeader (MMAP/POOL/L25_POOL)
|
||||||
|
// AllocHeader is placed before user pointer (user_ptr - HEADER_SIZE)
|
||||||
|
//
|
||||||
|
// Safety check: Need at least HEADER_SIZE (40 bytes) before ptr to read AllocHeader
|
||||||
|
// If ptr is too close to page start, skip this check (avoid SEGV)
|
||||||
|
uintptr_t offset_in_page_for_hdr = (uintptr_t)ptr & 0xFFF;
|
||||||
|
if (offset_in_page_for_hdr >= HEADER_SIZE) {
|
||||||
|
// Safe to read AllocHeader (won't cross page boundary)
|
||||||
|
AllocHeader* hdr = hak_header_from_user(ptr);
|
||||||
|
if (hak_header_validate(hdr)) {
|
||||||
|
// Valid HAKMEM header found
|
||||||
|
if (hdr->method == ALLOC_METHOD_MMAP ||
|
||||||
|
hdr->method == ALLOC_METHOD_POOL ||
|
||||||
|
hdr->method == ALLOC_METHOD_L25_POOL) {
|
||||||
|
result.kind = PTR_KIND_MID_LARGE;
|
||||||
|
result.ss = NULL;
|
||||||
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
|
g_classify_mid_large_hit++;
|
||||||
|
#endif
|
||||||
|
return result;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// Unknown pointer (external allocation or Mid/Large)
|
// Unknown pointer (external allocation or Mid/Large)
|
||||||
// Let free wrapper handle Mid/Large registry lookups
|
// Let free wrapper handle Mid/Large registry lookups
|
||||||
result.kind = PTR_KIND_UNKNOWN;
|
result.kind = PTR_KIND_UNKNOWN;
|
||||||
|
|||||||
@ -70,6 +70,7 @@ ptr_classification_t classify_ptr(void* ptr);
|
|||||||
extern __thread uint64_t g_classify_header_hit;
|
extern __thread uint64_t g_classify_header_hit;
|
||||||
extern __thread uint64_t g_classify_headerless_hit;
|
extern __thread uint64_t g_classify_headerless_hit;
|
||||||
extern __thread uint64_t g_classify_pool_hit;
|
extern __thread uint64_t g_classify_pool_hit;
|
||||||
|
extern __thread uint64_t g_classify_mid_large_hit;
|
||||||
extern __thread uint64_t g_classify_unknown_hit;
|
extern __thread uint64_t g_classify_unknown_hit;
|
||||||
|
|
||||||
void front_gate_print_stats(void);
|
void front_gate_print_stats(void);
|
||||||
|
|||||||
@ -265,8 +265,10 @@ static void hak_init_impl(void) {
|
|||||||
hak_site_rules_init();
|
hak_site_rules_init();
|
||||||
}
|
}
|
||||||
|
|
||||||
// NEW Phase 6.12: Tiny Pool (≤1KB allocations)
|
// Phase 22: Tiny Pool initialization now LAZY (per-class on first use)
|
||||||
hak_tiny_init();
|
// hak_tiny_init() moved to lazy_init_class() in hakmem_tiny_lazy_init.inc.h
|
||||||
|
// OLD: hak_tiny_init(); (eager init of all 8 classes → 94.94% page faults)
|
||||||
|
// NEW: Lazy init triggered by tiny_alloc_fast() → only used classes initialized
|
||||||
|
|
||||||
// Env: optional Tiny flush on exit (memory efficiency evaluation)
|
// Env: optional Tiny flush on exit (memory efficiency evaluation)
|
||||||
{
|
{
|
||||||
|
|||||||
@ -178,6 +178,7 @@ void free(void* ptr) {
|
|||||||
case PTR_KIND_TINY_HEADER:
|
case PTR_KIND_TINY_HEADER:
|
||||||
case PTR_KIND_TINY_HEADERLESS:
|
case PTR_KIND_TINY_HEADERLESS:
|
||||||
case PTR_KIND_POOL_TLS:
|
case PTR_KIND_POOL_TLS:
|
||||||
|
case PTR_KIND_MID_LARGE: // FIX: Include Mid-Large (mmap/ACE) pointers
|
||||||
is_hakmem_owned = 1; break;
|
is_hakmem_owned = 1; break;
|
||||||
default: break;
|
default: break;
|
||||||
}
|
}
|
||||||
|
|||||||
83
core/box/pagefault_telemetry_box.c
Normal file
83
core/box/pagefault_telemetry_box.c
Normal file
@ -0,0 +1,83 @@
|
|||||||
|
// pagefault_telemetry_box.c - Box PageFaultTelemetry implementation
|
||||||
|
|
||||||
|
#include "pagefault_telemetry_box.h"
|
||||||
|
|
||||||
|
#include "../hakmem_tiny_stats_api.h" // For macros / flags
|
||||||
|
#include <stdio.h>
|
||||||
|
#include <stdlib.h>
|
||||||
|
|
||||||
|
// Per-thread state
|
||||||
|
__thread uint64_t g_pf_bloom[PF_BUCKET_MAX][16] = {{0}};
|
||||||
|
__thread uint64_t g_pf_touch[PF_BUCKET_MAX] = {0};
|
||||||
|
|
||||||
|
// Enable flag (cached)
|
||||||
|
int pagefault_telemetry_enabled(void) {
|
||||||
|
static int g_enabled = -1;
|
||||||
|
if (__builtin_expect(g_enabled == -1, 0)) {
|
||||||
|
const char* env = getenv("HAKMEM_TINY_PAGEFAULT_TELEMETRY");
|
||||||
|
g_enabled = (env && *env && *env != '0') ? 1 : 0;
|
||||||
|
}
|
||||||
|
return g_enabled;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Dump helper
|
||||||
|
void pagefault_telemetry_dump(void) {
|
||||||
|
if (!pagefault_telemetry_enabled()) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
const char* dump_env = getenv("HAKMEM_TINY_PAGEFAULT_DUMP");
|
||||||
|
if (!(dump_env && *dump_env && *dump_env != '0')) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
fprintf(stderr, "\n========== Box PageFaultTelemetry: Tiny Page Touch Stats ==========\n");
|
||||||
|
fprintf(stderr, "Note: pages ~= popcount(1024-bit bloom); collisions → 下限近似値\n\n");
|
||||||
|
fprintf(stderr, "%-5s %12s %12s %12s\n", "Bucket", "touches", "approx_pages", "touches/page");
|
||||||
|
fprintf(stderr, "------|------------|------------|------------\n");
|
||||||
|
|
||||||
|
for (int b = 0; b < PF_BUCKET_MAX; b++) {
|
||||||
|
uint64_t touches = g_pf_touch[b];
|
||||||
|
if (touches == 0) {
|
||||||
|
continue;
|
||||||
|
}
|
||||||
|
|
||||||
|
uint64_t bits = 0;
|
||||||
|
for (int w = 0; w < 16; w++) {
|
||||||
|
bits += (uint64_t)__builtin_popcountll(g_pf_bloom[b][w]);
|
||||||
|
}
|
||||||
|
|
||||||
|
double pages = (double)bits;
|
||||||
|
double tpp = pages > 0.0 ? (double)touches / pages : 0.0;
|
||||||
|
|
||||||
|
const char* name = NULL;
|
||||||
|
char buf[8];
|
||||||
|
if (b < PF_BUCKET_TINY_LIMIT) {
|
||||||
|
snprintf(buf, sizeof(buf), "C%d", b);
|
||||||
|
name = buf;
|
||||||
|
} else if (b == PF_BUCKET_MID) {
|
||||||
|
name = "MID";
|
||||||
|
} else if (b == PF_BUCKET_L25) {
|
||||||
|
name = "L25";
|
||||||
|
} else if (b == PF_BUCKET_SS_META) {
|
||||||
|
name = "SSM";
|
||||||
|
} else {
|
||||||
|
snprintf(buf, sizeof(buf), "X%d", b);
|
||||||
|
name = buf;
|
||||||
|
}
|
||||||
|
|
||||||
|
fprintf(stderr, "%-5s %12llu %12llu %12.1f\n",
|
||||||
|
name,
|
||||||
|
(unsigned long long)touches,
|
||||||
|
(unsigned long long)bits,
|
||||||
|
tpp);
|
||||||
|
}
|
||||||
|
|
||||||
|
fprintf(stderr, "===============================================================\n\n");
|
||||||
|
}
|
||||||
|
|
||||||
|
// Auto-dump at thread exit (bench系で 1 回だけ実行される想定)
|
||||||
|
static void pagefault_telemetry_atexit(void) __attribute__((destructor));
|
||||||
|
static void pagefault_telemetry_atexit(void) {
|
||||||
|
pagefault_telemetry_dump();
|
||||||
|
}
|
||||||
4
core/box/pagefault_telemetry_box.d
Normal file
4
core/box/pagefault_telemetry_box.d
Normal file
@ -0,0 +1,4 @@
|
|||||||
|
core/box/pagefault_telemetry_box.o: core/box/pagefault_telemetry_box.c \
|
||||||
|
core/box/pagefault_telemetry_box.h core/box/../hakmem_tiny_stats_api.h
|
||||||
|
core/box/pagefault_telemetry_box.h:
|
||||||
|
core/box/../hakmem_tiny_stats_api.h:
|
||||||
96
core/box/pagefault_telemetry_box.h
Normal file
96
core/box/pagefault_telemetry_box.h
Normal file
@ -0,0 +1,96 @@
|
|||||||
|
// pagefault_telemetry_box.h - Box PageFaultTelemetry: Tiny page-touch visualization
|
||||||
|
// Purpose:
|
||||||
|
// - Approximate「何枚のページをどれだけ触ったか」をクラス別に計測する箱。
|
||||||
|
// - Tiny フロントエンド側からのみ呼び出し、Superslab/カーネル側の挙動は変更しない。
|
||||||
|
//
|
||||||
|
// Design:
|
||||||
|
// - 4KB ページ単位でアドレスを正規化し、簡易 Bloom/ビットセットにハッシュ。
|
||||||
|
// - 1 クラスあたり 1024bit (= 16 x uint64_t) を用意し、popcount で「近似ページ枚数」を算出。
|
||||||
|
// - 衝突は起こり得るが「下限近似値」として十分。目的は傾向把握。
|
||||||
|
//
|
||||||
|
// ENV Control:
|
||||||
|
// - HAKMEM_TINY_PAGEFAULT_TELEMETRY=1 … 計測有効化
|
||||||
|
// - HAKMEM_TINY_PAGEFAULT_DUMP=1 … 終了時に stderr へ 1 回だけダンプ
|
||||||
|
|
||||||
|
#ifndef HAK_BOX_PAGEFAULT_TELEMETRY_H
|
||||||
|
#define HAK_BOX_PAGEFAULT_TELEMETRY_H
|
||||||
|
|
||||||
|
#include <stdint.h>
|
||||||
|
|
||||||
|
#ifdef __cplusplus
|
||||||
|
extern "C" {
|
||||||
|
#endif
|
||||||
|
|
||||||
|
// Tiny クラス数(既存定義が無ければ 8 とみなす)
|
||||||
|
#ifndef TINY_NUM_CLASSES
|
||||||
|
#define TINY_NUM_CLASSES 8
|
||||||
|
#endif
|
||||||
|
|
||||||
|
// ドメインバケット定義:
|
||||||
|
// 0..7 : Tiny C0..C7
|
||||||
|
// 8 : Mid Pool (hak_pool_*)
|
||||||
|
// 9 : L25 Pool (hak_l25_pool_*)
|
||||||
|
// 10 : Shared SuperSlab meta / backing
|
||||||
|
// 11 : 予備
|
||||||
|
enum {
|
||||||
|
PF_BUCKET_TINY_BASE = 0,
|
||||||
|
PF_BUCKET_TINY_LIMIT = TINY_NUM_CLASSES,
|
||||||
|
PF_BUCKET_MID = TINY_NUM_CLASSES,
|
||||||
|
PF_BUCKET_L25 = TINY_NUM_CLASSES + 1,
|
||||||
|
PF_BUCKET_SS_META = TINY_NUM_CLASSES + 2,
|
||||||
|
PF_BUCKET_RESERVED = TINY_NUM_CLASSES + 3,
|
||||||
|
PF_BUCKET_MAX = TINY_NUM_CLASSES + 4
|
||||||
|
};
|
||||||
|
|
||||||
|
// ビットセット本体(1 バケットあたり 1024bit)
|
||||||
|
extern __thread uint64_t g_pf_bloom[PF_BUCKET_MAX][16];
|
||||||
|
// タッチ総数(ページ単位ではなく「呼び出し回数」)
|
||||||
|
extern __thread uint64_t g_pf_touch[PF_BUCKET_MAX];
|
||||||
|
|
||||||
|
// ENV による有効/無効判定(キャッシュ付き)
|
||||||
|
int pagefault_telemetry_enabled(void);
|
||||||
|
|
||||||
|
// 集計・ダンプ(ENV HAKMEM_TINY_PAGEFAULT_DUMP=1 のときだけ出力)
|
||||||
|
void pagefault_telemetry_dump(void);
|
||||||
|
|
||||||
|
// ----------------------------------------------------------------------------
|
||||||
|
// Inline helper: ページタッチ記録
|
||||||
|
// ----------------------------------------------------------------------------
|
||||||
|
|
||||||
|
static inline void pagefault_telemetry_touch(int cls, const void* ptr) {
|
||||||
|
#if HAKMEM_DEBUG_COUNTERS
|
||||||
|
if (!pagefault_telemetry_enabled()) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (cls < 0 || cls >= PF_BUCKET_MAX) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
// 4KB ページに正規化
|
||||||
|
uintptr_t addr = (uintptr_t)ptr;
|
||||||
|
uintptr_t page = addr >> 12;
|
||||||
|
|
||||||
|
// 1024 エントリのビットセットにハッシュ
|
||||||
|
uint32_t idx = (uint32_t)(page & 1023u);
|
||||||
|
uint32_t word = idx >> 6;
|
||||||
|
uint32_t bit = idx & 63u;
|
||||||
|
uint64_t mask = (uint64_t)1u << bit;
|
||||||
|
|
||||||
|
uint64_t old = g_pf_bloom[cls][word];
|
||||||
|
if (!(old & mask)) {
|
||||||
|
g_pf_bloom[cls][word] = old | mask;
|
||||||
|
}
|
||||||
|
|
||||||
|
g_pf_touch[cls]++;
|
||||||
|
#else
|
||||||
|
(void)cls;
|
||||||
|
(void)ptr;
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
|
||||||
|
#ifdef __cplusplus
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
|
||||||
|
#endif // HAK_BOX_PAGEFAULT_TELEMETRY_H
|
||||||
@ -2,6 +2,8 @@
|
|||||||
#ifndef POOL_API_INC_H
|
#ifndef POOL_API_INC_H
|
||||||
#define POOL_API_INC_H
|
#define POOL_API_INC_H
|
||||||
|
|
||||||
|
#include "pagefault_telemetry_box.h" // Box PageFaultTelemetry (PF_BUCKET_MID)
|
||||||
|
|
||||||
void* hak_pool_try_alloc(size_t size, uintptr_t site_id) {
|
void* hak_pool_try_alloc(size_t size, uintptr_t site_id) {
|
||||||
// Debug: IMMEDIATE output to verify function is called
|
// Debug: IMMEDIATE output to verify function is called
|
||||||
static int first_call = 1;
|
static int first_call = 1;
|
||||||
@ -52,10 +54,12 @@ void* hak_pool_try_alloc(size_t size, uintptr_t site_id) {
|
|||||||
void* raw = (void*)tlsb;
|
void* raw = (void*)tlsb;
|
||||||
AllocHeader* hdr = (AllocHeader*)raw;
|
AllocHeader* hdr = (AllocHeader*)raw;
|
||||||
mid_set_header(hdr, g_class_sizes[class_idx], site_id);
|
mid_set_header(hdr, g_class_sizes[class_idx], site_id);
|
||||||
|
void* user0 = (char*)raw + HEADER_SIZE;
|
||||||
mid_page_inuse_inc(raw);
|
mid_page_inuse_inc(raw);
|
||||||
t_pool_rng ^= t_pool_rng << 13; t_pool_rng ^= t_pool_rng >> 17; t_pool_rng ^= t_pool_rng << 5;
|
t_pool_rng ^= t_pool_rng << 13; t_pool_rng ^= t_pool_rng >> 17; t_pool_rng ^= t_pool_rng << 5;
|
||||||
if ((t_pool_rng & ((1u<<g_count_sample_exp)-1u)) == 0u) g_pool.hits[class_idx]++;
|
if ((t_pool_rng & ((1u<<g_count_sample_exp)-1u)) == 0u) g_pool.hits[class_idx]++;
|
||||||
return (char*)raw + HEADER_SIZE;
|
pagefault_telemetry_touch(PF_BUCKET_MID, user0);
|
||||||
|
return user0;
|
||||||
}
|
}
|
||||||
} else { HKM_TIME_END(HKM_CAT_TC_DRAIN, t_tc_drain); }
|
} else { HKM_TIME_END(HKM_CAT_TC_DRAIN, t_tc_drain); }
|
||||||
}
|
}
|
||||||
@ -70,9 +74,11 @@ void* hak_pool_try_alloc(size_t size, uintptr_t site_id) {
|
|||||||
void* raw = (void*)tlsb;
|
void* raw = (void*)tlsb;
|
||||||
AllocHeader* hdr = (AllocHeader*)raw;
|
AllocHeader* hdr = (AllocHeader*)raw;
|
||||||
mid_set_header(hdr, g_class_sizes[class_idx], site_id);
|
mid_set_header(hdr, g_class_sizes[class_idx], site_id);
|
||||||
|
void* user1 = (char*)raw + HEADER_SIZE;
|
||||||
t_pool_rng ^= t_pool_rng << 13; t_pool_rng ^= t_pool_rng >> 17; t_pool_rng ^= t_pool_rng << 5;
|
t_pool_rng ^= t_pool_rng << 13; t_pool_rng ^= t_pool_rng >> 17; t_pool_rng ^= t_pool_rng << 5;
|
||||||
if ((t_pool_rng & ((1u<<g_count_sample_exp)-1u)) == 0u) g_pool.hits[class_idx]++;
|
if ((t_pool_rng & ((1u<<g_count_sample_exp)-1u)) == 0u) g_pool.hits[class_idx]++;
|
||||||
return (char*)raw + HEADER_SIZE;
|
pagefault_telemetry_touch(PF_BUCKET_MID, user1);
|
||||||
|
return user1;
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
if (g_tls_bin[class_idx].lo_head) {
|
if (g_tls_bin[class_idx].lo_head) {
|
||||||
@ -83,10 +89,12 @@ void* hak_pool_try_alloc(size_t size, uintptr_t site_id) {
|
|||||||
HKM_TIME_END(HKM_CAT_POOL_TLS_LIFO_POP, t_lifo_pop0);
|
HKM_TIME_END(HKM_CAT_POOL_TLS_LIFO_POP, t_lifo_pop0);
|
||||||
void* raw = (void*)b; AllocHeader* hdr = (AllocHeader*)raw;
|
void* raw = (void*)b; AllocHeader* hdr = (AllocHeader*)raw;
|
||||||
mid_set_header(hdr, g_class_sizes[class_idx], site_id);
|
mid_set_header(hdr, g_class_sizes[class_idx], site_id);
|
||||||
|
void* user2 = (char*)raw + HEADER_SIZE;
|
||||||
mid_page_inuse_inc(raw);
|
mid_page_inuse_inc(raw);
|
||||||
t_pool_rng ^= t_pool_rng << 13; t_pool_rng ^= t_pool_rng >> 17; t_pool_rng ^= t_pool_rng << 5;
|
t_pool_rng ^= t_pool_rng << 13; t_pool_rng ^= t_pool_rng >> 17; t_pool_rng ^= t_pool_rng << 5;
|
||||||
if ((t_pool_rng & ((1u<<g_count_sample_exp)-1u)) == 0u) g_pool.hits[class_idx]++;
|
if ((t_pool_rng & ((1u<<g_count_sample_exp)-1u)) == 0u) g_pool.hits[class_idx]++;
|
||||||
return (char*)raw + HEADER_SIZE;
|
pagefault_telemetry_touch(PF_BUCKET_MID, user2);
|
||||||
|
return user2;
|
||||||
}
|
}
|
||||||
|
|
||||||
// Compute shard only when we need to access shared structures
|
// Compute shard only when we need to access shared structures
|
||||||
@ -231,9 +239,11 @@ void* hak_pool_try_alloc(size_t size, uintptr_t site_id) {
|
|||||||
else if (ap->page && ap->count > 0 && ap->bump < ap->end) { takeb = (PoolBlock*)(void*)ap->bump; ap->bump += (HEADER_SIZE + g_class_sizes[class_idx]); ap->count--; if (ap->bump >= ap->end || ap->count==0){ ap->page=NULL; ap->count=0; } }
|
else if (ap->page && ap->count > 0 && ap->bump < ap->end) { takeb = (PoolBlock*)(void*)ap->bump; ap->bump += (HEADER_SIZE + g_class_sizes[class_idx]); ap->count--; if (ap->bump >= ap->end || ap->count==0){ ap->page=NULL; ap->count=0; } }
|
||||||
void* raw2 = (void*)takeb; AllocHeader* hdr2 = (AllocHeader*)raw2;
|
void* raw2 = (void*)takeb; AllocHeader* hdr2 = (AllocHeader*)raw2;
|
||||||
mid_set_header(hdr2, g_class_sizes[class_idx], site_id);
|
mid_set_header(hdr2, g_class_sizes[class_idx], site_id);
|
||||||
|
void* user3 = (char*)raw2 + HEADER_SIZE;
|
||||||
mid_page_inuse_inc(raw2);
|
mid_page_inuse_inc(raw2);
|
||||||
g_pool.hits[class_idx]++;
|
g_pool.hits[class_idx]++;
|
||||||
return (char*)raw2 + HEADER_SIZE;
|
pagefault_telemetry_touch(PF_BUCKET_MID, user3);
|
||||||
|
return user3;
|
||||||
}
|
}
|
||||||
HKM_TIME_START(t_refill);
|
HKM_TIME_START(t_refill);
|
||||||
struct timespec ts_rf; int rf = hkm_prof_begin(&ts_rf);
|
struct timespec ts_rf; int rf = hkm_prof_begin(&ts_rf);
|
||||||
@ -266,8 +276,10 @@ void* hak_pool_try_alloc(size_t size, uintptr_t site_id) {
|
|||||||
|
|
||||||
void* raw = (void*)take; AllocHeader* hdr = (AllocHeader*)raw;
|
void* raw = (void*)take; AllocHeader* hdr = (AllocHeader*)raw;
|
||||||
mid_set_header(hdr, g_class_sizes[class_idx], site_id);
|
mid_set_header(hdr, g_class_sizes[class_idx], site_id);
|
||||||
|
void* user4 = (char*)raw + HEADER_SIZE;
|
||||||
mid_page_inuse_inc(raw);
|
mid_page_inuse_inc(raw);
|
||||||
return (char*)raw + HEADER_SIZE;
|
pagefault_telemetry_touch(PF_BUCKET_MID, user4);
|
||||||
|
return user4;
|
||||||
}
|
}
|
||||||
|
|
||||||
void hak_pool_free(void* ptr, size_t size, uintptr_t site_id) {
|
void hak_pool_free(void* ptr, size_t size, uintptr_t site_id) {
|
||||||
|
|||||||
26
core/box/unified_batch_box.c
Normal file
26
core/box/unified_batch_box.c
Normal file
@ -0,0 +1,26 @@
|
|||||||
|
// unified_batch_box.c - Box U2: Batch Alloc Connector Implementation
|
||||||
|
#include "unified_batch_box.h"
|
||||||
|
#include "carve_push_box.h"
|
||||||
|
#include "../box/tls_sll_box.h"
|
||||||
|
#include <stddef.h>
|
||||||
|
|
||||||
|
// Batch allocate blocks from SuperSlab
|
||||||
|
// Returns: Actual count allocated (0 = failed)
|
||||||
|
int superslab_batch_alloc(int class_idx, void** blocks, int max_count) {
|
||||||
|
if (!blocks || max_count <= 0) return 0;
|
||||||
|
|
||||||
|
// Step 1: Carve N blocks from SuperSlab and push to TLS SLL
|
||||||
|
// (uses existing Box C1 carve_push logic)
|
||||||
|
uint32_t carved = box_carve_and_push_with_freelist(class_idx, (uint32_t)max_count);
|
||||||
|
if (carved == 0) return 0;
|
||||||
|
|
||||||
|
// Step 2: Pop carved blocks from TLS SLL into output array
|
||||||
|
int got = 0;
|
||||||
|
for (uint32_t i = 0; i < carved; i++) {
|
||||||
|
void* base;
|
||||||
|
if (!tls_sll_pop(class_idx, &base)) break; // Should not happen
|
||||||
|
blocks[got++] = base;
|
||||||
|
}
|
||||||
|
|
||||||
|
return got;
|
||||||
|
}
|
||||||
39
core/box/unified_batch_box.d
Normal file
39
core/box/unified_batch_box.d
Normal file
@ -0,0 +1,39 @@
|
|||||||
|
core/box/unified_batch_box.o: core/box/unified_batch_box.c \
|
||||||
|
core/box/unified_batch_box.h core/box/carve_push_box.h \
|
||||||
|
core/box/../box/tls_sll_box.h core/box/../box/../hakmem_tiny_config.h \
|
||||||
|
core/box/../box/../hakmem_build_flags.h core/box/../box/../tiny_remote.h \
|
||||||
|
core/box/../box/../tiny_region_id.h \
|
||||||
|
core/box/../box/../hakmem_build_flags.h \
|
||||||
|
core/box/../box/../tiny_box_geometry.h \
|
||||||
|
core/box/../box/../hakmem_tiny_superslab_constants.h \
|
||||||
|
core/box/../box/../hakmem_tiny_config.h core/box/../box/../ptr_track.h \
|
||||||
|
core/box/../box/../hakmem_tiny_integrity.h \
|
||||||
|
core/box/../box/../hakmem_tiny.h core/box/../box/../hakmem_trace.h \
|
||||||
|
core/box/../box/../hakmem_tiny_mini_mag.h core/box/../box/../ptr_track.h \
|
||||||
|
core/box/../box/../ptr_trace.h \
|
||||||
|
core/box/../box/../box/tiny_next_ptr_box.h core/hakmem_tiny_config.h \
|
||||||
|
core/tiny_nextptr.h core/hakmem_build_flags.h \
|
||||||
|
core/box/../box/../tiny_debug_ring.h
|
||||||
|
core/box/unified_batch_box.h:
|
||||||
|
core/box/carve_push_box.h:
|
||||||
|
core/box/../box/tls_sll_box.h:
|
||||||
|
core/box/../box/../hakmem_tiny_config.h:
|
||||||
|
core/box/../box/../hakmem_build_flags.h:
|
||||||
|
core/box/../box/../tiny_remote.h:
|
||||||
|
core/box/../box/../tiny_region_id.h:
|
||||||
|
core/box/../box/../hakmem_build_flags.h:
|
||||||
|
core/box/../box/../tiny_box_geometry.h:
|
||||||
|
core/box/../box/../hakmem_tiny_superslab_constants.h:
|
||||||
|
core/box/../box/../hakmem_tiny_config.h:
|
||||||
|
core/box/../box/../ptr_track.h:
|
||||||
|
core/box/../box/../hakmem_tiny_integrity.h:
|
||||||
|
core/box/../box/../hakmem_tiny.h:
|
||||||
|
core/box/../box/../hakmem_trace.h:
|
||||||
|
core/box/../box/../hakmem_tiny_mini_mag.h:
|
||||||
|
core/box/../box/../ptr_track.h:
|
||||||
|
core/box/../box/../ptr_trace.h:
|
||||||
|
core/box/../box/../box/tiny_next_ptr_box.h:
|
||||||
|
core/hakmem_tiny_config.h:
|
||||||
|
core/tiny_nextptr.h:
|
||||||
|
core/hakmem_build_flags.h:
|
||||||
|
core/box/../box/../tiny_debug_ring.h:
|
||||||
29
core/box/unified_batch_box.h
Normal file
29
core/box/unified_batch_box.h
Normal file
@ -0,0 +1,29 @@
|
|||||||
|
// unified_batch_box.h - Box U2: Batch Alloc Connector for Unified Cache
|
||||||
|
//
|
||||||
|
// Purpose: Provide batch allocation API for Unified Frontend Cache (Box U1)
|
||||||
|
// Design: Thin wrapper over existing Box flow (Carve/Push Box C1)
|
||||||
|
//
|
||||||
|
// API:
|
||||||
|
// int superslab_batch_alloc(int class_idx, void** blocks, int max_count)
|
||||||
|
// - Allocates up to max_count blocks from SuperSlab
|
||||||
|
// - Returns actual count allocated
|
||||||
|
// - blocks[] receives BASE pointers (caller converts to USER)
|
||||||
|
//
|
||||||
|
// Box Theory:
|
||||||
|
// - Box U2 (this) = Connector layer (no state, pure function)
|
||||||
|
// - Box U1 (Unified Cache) calls this for batch refill
|
||||||
|
// - This delegates to Box C1 (Carve/Push) for actual allocation
|
||||||
|
//
|
||||||
|
// ENV: None (controlled by caller Box U1)
|
||||||
|
|
||||||
|
#ifndef HAK_BOX_UNIFIED_BATCH_BOX_H
|
||||||
|
#define HAK_BOX_UNIFIED_BATCH_BOX_H
|
||||||
|
|
||||||
|
#include <stdint.h>
|
||||||
|
|
||||||
|
// Batch allocate blocks from SuperSlab (for Unified Cache refill)
|
||||||
|
// Returns: Actual count allocated (0 = failed)
|
||||||
|
// Note: blocks[] contains BASE pointers (not USER pointers)
|
||||||
|
int superslab_batch_alloc(int class_idx, void** blocks, int max_count);
|
||||||
|
|
||||||
|
#endif // HAK_BOX_UNIFIED_BATCH_BOX_H
|
||||||
@ -10,6 +10,7 @@
|
|||||||
|
|
||||||
__thread TinyRingCache g_ring_cache_c2 = {NULL, 0, 0, 0, 0};
|
__thread TinyRingCache g_ring_cache_c2 = {NULL, 0, 0, 0, 0};
|
||||||
__thread TinyRingCache g_ring_cache_c3 = {NULL, 0, 0, 0, 0};
|
__thread TinyRingCache g_ring_cache_c3 = {NULL, 0, 0, 0, 0};
|
||||||
|
__thread TinyRingCache g_ring_cache_c5 = {NULL, 0, 0, 0, 0};
|
||||||
|
|
||||||
// ============================================================================
|
// ============================================================================
|
||||||
// Metrics (Phase 21-1-E, optional for Phase 21-1-C)
|
// Metrics (Phase 21-1-E, optional for Phase 21-1-C)
|
||||||
@ -63,10 +64,31 @@ void ring_cache_init(void) {
|
|||||||
g_ring_cache_c3.head = 0;
|
g_ring_cache_c3.head = 0;
|
||||||
g_ring_cache_c3.tail = 0;
|
g_ring_cache_c3.tail = 0;
|
||||||
|
|
||||||
|
// C5 init
|
||||||
|
size_t cap_c5 = ring_capacity_c5();
|
||||||
|
g_ring_cache_c5.slots = (void**)calloc(cap_c5, sizeof(void*));
|
||||||
|
if (!g_ring_cache_c5.slots) {
|
||||||
#if !HAKMEM_BUILD_RELEASE
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
fprintf(stderr, "[Ring-INIT] C2=%zu slots (%zu bytes), C3=%zu slots (%zu bytes)\n",
|
fprintf(stderr, "[Ring-INIT] Failed to allocate C5 ring (%zu slots)\n", cap_c5);
|
||||||
|
fflush(stderr);
|
||||||
|
#endif
|
||||||
|
// Free C2 and C3 if C5 failed
|
||||||
|
free(g_ring_cache_c2.slots);
|
||||||
|
g_ring_cache_c2.slots = NULL;
|
||||||
|
free(g_ring_cache_c3.slots);
|
||||||
|
g_ring_cache_c3.slots = NULL;
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
g_ring_cache_c5.capacity = (uint16_t)cap_c5;
|
||||||
|
g_ring_cache_c5.mask = (uint16_t)(cap_c5 - 1);
|
||||||
|
g_ring_cache_c5.head = 0;
|
||||||
|
g_ring_cache_c5.tail = 0;
|
||||||
|
|
||||||
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
|
fprintf(stderr, "[Ring-INIT] C2=%zu slots (%zu bytes), C3=%zu slots (%zu bytes), C5=%zu slots (%zu bytes)\n",
|
||||||
cap_c2, cap_c2 * sizeof(void*),
|
cap_c2, cap_c2 * sizeof(void*),
|
||||||
cap_c3, cap_c3 * sizeof(void*));
|
cap_c3, cap_c3 * sizeof(void*),
|
||||||
|
cap_c5, cap_c5 * sizeof(void*));
|
||||||
fflush(stderr);
|
fflush(stderr);
|
||||||
#endif
|
#endif
|
||||||
}
|
}
|
||||||
@ -92,8 +114,13 @@ void ring_cache_shutdown(void) {
|
|||||||
g_ring_cache_c3.slots = NULL;
|
g_ring_cache_c3.slots = NULL;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
if (g_ring_cache_c5.slots) {
|
||||||
|
free(g_ring_cache_c5.slots);
|
||||||
|
g_ring_cache_c5.slots = NULL;
|
||||||
|
}
|
||||||
|
|
||||||
#if !HAKMEM_BUILD_RELEASE
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
fprintf(stderr, "[Ring-SHUTDOWN] C2/C3 rings freed\n");
|
fprintf(stderr, "[Ring-SHUTDOWN] C2/C3/C5 rings freed\n");
|
||||||
fflush(stderr);
|
fflush(stderr);
|
||||||
#endif
|
#endif
|
||||||
}
|
}
|
||||||
|
|||||||
@ -1,4 +1,4 @@
|
|||||||
// tiny_ring_cache.h - Phase 21-1: Array-based hot cache (C2/C3 only)
|
// tiny_ring_cache.h - Phase 21-1: Array-based hot cache (C2/C3/C5)
|
||||||
//
|
//
|
||||||
// Goal: Eliminate pointer chasing in TLS SLL by using ring buffer
|
// Goal: Eliminate pointer chasing in TLS SLL by using ring buffer
|
||||||
// Target: +15-20% performance (54.4M → 62-65M ops/s)
|
// Target: +15-20% performance (54.4M → 62-65M ops/s)
|
||||||
@ -46,6 +46,7 @@ typedef struct {
|
|||||||
|
|
||||||
extern __thread TinyRingCache g_ring_cache_c2;
|
extern __thread TinyRingCache g_ring_cache_c2;
|
||||||
extern __thread TinyRingCache g_ring_cache_c3;
|
extern __thread TinyRingCache g_ring_cache_c3;
|
||||||
|
extern __thread TinyRingCache g_ring_cache_c5;
|
||||||
|
|
||||||
// ============================================================================
|
// ============================================================================
|
||||||
// Metrics (Phase 21-1-E, optional for Phase 21-1-C)
|
// Metrics (Phase 21-1-E, optional for Phase 21-1-C)
|
||||||
@ -63,12 +64,12 @@ extern __thread uint64_t g_ring_cache_refill[8]; // Refill count (SLL → Ring)
|
|||||||
// ENV Control (cached, lazy init)
|
// ENV Control (cached, lazy init)
|
||||||
// ============================================================================
|
// ============================================================================
|
||||||
|
|
||||||
// Enable flag (default: 0, OFF)
|
// Enable flag (default: 1, ON)
|
||||||
static inline int ring_cache_enabled(void) {
|
static inline int ring_cache_enabled(void) {
|
||||||
static int g_enable = -1;
|
static int g_enable = -1;
|
||||||
if (__builtin_expect(g_enable == -1, 0)) {
|
if (__builtin_expect(g_enable == -1, 0)) {
|
||||||
const char* e = getenv("HAKMEM_TINY_HOT_RING_ENABLE");
|
const char* e = getenv("HAKMEM_TINY_HOT_RING_ENABLE");
|
||||||
g_enable = (e && *e && *e != '0') ? 1 : 0;
|
g_enable = (e && *e == '0') ? 0 : 1; // DEFAULT: ON (set ENV=0 to disable)
|
||||||
#if !HAKMEM_BUILD_RELEASE
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
if (g_enable) {
|
if (g_enable) {
|
||||||
fprintf(stderr, "[Ring-INIT] ring_cache_enabled() = %d\n", g_enable);
|
fprintf(stderr, "[Ring-INIT] ring_cache_enabled() = %d\n", g_enable);
|
||||||
@ -126,6 +127,29 @@ static inline size_t ring_capacity_c3(void) {
|
|||||||
return g_cap;
|
return g_cap;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// C5 capacity (default: 128)
|
||||||
|
static inline size_t ring_capacity_c5(void) {
|
||||||
|
static size_t g_cap = 0;
|
||||||
|
if (__builtin_expect(g_cap == 0, 0)) {
|
||||||
|
const char* e = getenv("HAKMEM_TINY_HOT_RING_C5");
|
||||||
|
g_cap = (e && *e) ? (size_t)atoi(e) : 128; // Default: 128
|
||||||
|
|
||||||
|
// Round up to power of 2
|
||||||
|
if (g_cap < 32) g_cap = 32;
|
||||||
|
if (g_cap > 256) g_cap = 256;
|
||||||
|
|
||||||
|
size_t pow2 = 32;
|
||||||
|
while (pow2 < g_cap) pow2 *= 2;
|
||||||
|
g_cap = pow2;
|
||||||
|
|
||||||
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
|
fprintf(stderr, "[Ring-INIT] C5 capacity = %zu (power of 2)\n", g_cap);
|
||||||
|
fflush(stderr);
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
return g_cap;
|
||||||
|
}
|
||||||
|
|
||||||
// Cascade enable flag (default: 0, OFF)
|
// Cascade enable flag (default: 0, OFF)
|
||||||
static inline int ring_cascade_enabled(void) {
|
static inline int ring_cascade_enabled(void) {
|
||||||
static int g_enable = -1;
|
static int g_enable = -1;
|
||||||
@ -159,9 +183,10 @@ void ring_cache_print_stats(void);
|
|||||||
static inline void* ring_cache_pop(int class_idx) {
|
static inline void* ring_cache_pop(int class_idx) {
|
||||||
// Fast path: Ring disabled or wrong class → return NULL immediately
|
// Fast path: Ring disabled or wrong class → return NULL immediately
|
||||||
if (__builtin_expect(!ring_cache_enabled(), 0)) return NULL;
|
if (__builtin_expect(!ring_cache_enabled(), 0)) return NULL;
|
||||||
if (__builtin_expect(class_idx != 2 && class_idx != 3, 0)) return NULL;
|
if (__builtin_expect(class_idx != 2 && class_idx != 3 && class_idx != 5, 0)) return NULL;
|
||||||
|
|
||||||
TinyRingCache* ring = (class_idx == 2) ? &g_ring_cache_c2 : &g_ring_cache_c3;
|
TinyRingCache* ring = (class_idx == 2) ? &g_ring_cache_c2 :
|
||||||
|
(class_idx == 3) ? &g_ring_cache_c3 : &g_ring_cache_c5;
|
||||||
|
|
||||||
// Lazy init check (once per thread)
|
// Lazy init check (once per thread)
|
||||||
if (__builtin_expect(ring->slots == NULL, 0)) {
|
if (__builtin_expect(ring->slots == NULL, 0)) {
|
||||||
@ -195,9 +220,10 @@ static inline void* ring_cache_pop(int class_idx) {
|
|||||||
static inline int ring_cache_push(int class_idx, void* base) {
|
static inline int ring_cache_push(int class_idx, void* base) {
|
||||||
// Fast path: Ring disabled or wrong class → return 0 (not handled)
|
// Fast path: Ring disabled or wrong class → return 0 (not handled)
|
||||||
if (__builtin_expect(!ring_cache_enabled(), 0)) return 0;
|
if (__builtin_expect(!ring_cache_enabled(), 0)) return 0;
|
||||||
if (__builtin_expect(class_idx != 2 && class_idx != 3, 0)) return 0;
|
if (__builtin_expect(class_idx != 2 && class_idx != 3 && class_idx != 5, 0)) return 0;
|
||||||
|
|
||||||
TinyRingCache* ring = (class_idx == 2) ? &g_ring_cache_c2 : &g_ring_cache_c3;
|
TinyRingCache* ring = (class_idx == 2) ? &g_ring_cache_c2 :
|
||||||
|
(class_idx == 3) ? &g_ring_cache_c3 : &g_ring_cache_c5;
|
||||||
|
|
||||||
// Lazy init check (once per thread)
|
// Lazy init check (once per thread)
|
||||||
if (__builtin_expect(ring->slots == NULL, 0)) {
|
if (__builtin_expect(ring->slots == NULL, 0)) {
|
||||||
|
|||||||
231
core/front/tiny_unified_cache.c
Normal file
231
core/front/tiny_unified_cache.c
Normal file
@ -0,0 +1,231 @@
|
|||||||
|
// tiny_unified_cache.c - Phase 23: Unified Frontend Cache Implementation
|
||||||
|
#include "tiny_unified_cache.h"
|
||||||
|
#include "../box/unified_batch_box.h" // Phase 23-D: Box U2 batch alloc (deprecated in 23-E)
|
||||||
|
#include "../tiny_tls.h" // Phase 23-E: TinyTLSSlab, TinySlabMeta
|
||||||
|
#include "../tiny_box_geometry.h" // Phase 23-E: tiny_stride_for_class, tiny_slab_base_for_geometry
|
||||||
|
#include "../box/tiny_next_ptr_box.h" // Phase 23-E: tiny_next_read (freelist traversal)
|
||||||
|
#include "../hakmem_tiny_superslab.h" // Phase 23-E: SuperSlab
|
||||||
|
#include "../superslab/superslab_inline.h" // Phase 23-E: ss_active_add
|
||||||
|
#include "../box/pagefault_telemetry_box.h" // Phase 24: Box PageFaultTelemetry (Tiny page touch stats)
|
||||||
|
#include <stdlib.h>
|
||||||
|
#include <string.h>
|
||||||
|
|
||||||
|
// Phase 23-E: Forward declarations
|
||||||
|
extern __thread TinyTLSSlab g_tls_slabs[TINY_NUM_CLASSES]; // From hakmem_tiny_superslab.c
|
||||||
|
extern int superslab_refill(int class_idx); // From hakmem_tiny_superslab.c
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// TLS Variables (defined here, extern in header)
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
__thread TinyUnifiedCache g_unified_cache[TINY_NUM_CLASSES];
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// Metrics (Phase 23, optional for debugging)
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
|
__thread uint64_t g_unified_cache_hit[TINY_NUM_CLASSES] = {0};
|
||||||
|
__thread uint64_t g_unified_cache_miss[TINY_NUM_CLASSES] = {0};
|
||||||
|
__thread uint64_t g_unified_cache_push[TINY_NUM_CLASSES] = {0};
|
||||||
|
__thread uint64_t g_unified_cache_full[TINY_NUM_CLASSES] = {0};
|
||||||
|
#endif
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// Init (called at thread start or lazy on first access)
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
void unified_cache_init(void) {
|
||||||
|
if (!unified_cache_enabled()) return;
|
||||||
|
|
||||||
|
// Initialize all classes (C0-C7)
|
||||||
|
for (int cls = 0; cls < TINY_NUM_CLASSES; cls++) {
|
||||||
|
if (g_unified_cache[cls].slots != NULL) continue; // Already initialized
|
||||||
|
|
||||||
|
size_t cap = unified_capacity(cls);
|
||||||
|
g_unified_cache[cls].slots = (void**)calloc(cap, sizeof(void*));
|
||||||
|
|
||||||
|
if (!g_unified_cache[cls].slots) {
|
||||||
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
|
fprintf(stderr, "[Unified-INIT] Failed to allocate C%d cache (%zu slots)\n", cls, cap);
|
||||||
|
fflush(stderr);
|
||||||
|
#endif
|
||||||
|
continue; // Skip this class, try others
|
||||||
|
}
|
||||||
|
|
||||||
|
g_unified_cache[cls].capacity = (uint16_t)cap;
|
||||||
|
g_unified_cache[cls].mask = (uint16_t)(cap - 1);
|
||||||
|
g_unified_cache[cls].head = 0;
|
||||||
|
g_unified_cache[cls].tail = 0;
|
||||||
|
|
||||||
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
|
fprintf(stderr, "[Unified-INIT] C%d: %zu slots (%zu bytes)\n",
|
||||||
|
cls, cap, cap * sizeof(void*));
|
||||||
|
fflush(stderr);
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// Shutdown (called at thread exit, optional)
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
void unified_cache_shutdown(void) {
|
||||||
|
if (!unified_cache_enabled()) return;
|
||||||
|
|
||||||
|
// TODO: Drain caches to SuperSlab before shutdown (prevent leak)
|
||||||
|
|
||||||
|
// Free cache buffers
|
||||||
|
for (int cls = 0; cls < TINY_NUM_CLASSES; cls++) {
|
||||||
|
if (g_unified_cache[cls].slots) {
|
||||||
|
free(g_unified_cache[cls].slots);
|
||||||
|
g_unified_cache[cls].slots = NULL;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
|
fprintf(stderr, "[Unified-SHUTDOWN] All caches freed\n");
|
||||||
|
fflush(stderr);
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// Stats (Phase 23 metrics)
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
void unified_cache_print_stats(void) {
|
||||||
|
if (!unified_cache_enabled()) return;
|
||||||
|
|
||||||
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
|
fprintf(stderr, "\n[Unified-STATS] Unified Cache Metrics:\n");
|
||||||
|
|
||||||
|
for (int cls = 0; cls < TINY_NUM_CLASSES; cls++) {
|
||||||
|
uint64_t total_allocs = g_unified_cache_hit[cls] + g_unified_cache_miss[cls];
|
||||||
|
uint64_t total_frees = g_unified_cache_push[cls] + g_unified_cache_full[cls];
|
||||||
|
|
||||||
|
if (total_allocs == 0 && total_frees == 0) continue; // Skip unused classes
|
||||||
|
|
||||||
|
double hit_rate = (total_allocs > 0) ? (100.0 * g_unified_cache_hit[cls] / total_allocs) : 0.0;
|
||||||
|
double full_rate = (total_frees > 0) ? (100.0 * g_unified_cache_full[cls] / total_frees) : 0.0;
|
||||||
|
|
||||||
|
// Current occupancy
|
||||||
|
uint16_t count = (g_unified_cache[cls].tail >= g_unified_cache[cls].head)
|
||||||
|
? (g_unified_cache[cls].tail - g_unified_cache[cls].head)
|
||||||
|
: (g_unified_cache[cls].capacity - g_unified_cache[cls].head + g_unified_cache[cls].tail);
|
||||||
|
|
||||||
|
fprintf(stderr, " C%d: %u/%u slots occupied, hit=%llu miss=%llu (%.1f%% hit), push=%llu full=%llu (%.1f%% full)\n",
|
||||||
|
cls,
|
||||||
|
count, g_unified_cache[cls].capacity,
|
||||||
|
(unsigned long long)g_unified_cache_hit[cls],
|
||||||
|
(unsigned long long)g_unified_cache_miss[cls],
|
||||||
|
hit_rate,
|
||||||
|
(unsigned long long)g_unified_cache_push[cls],
|
||||||
|
(unsigned long long)g_unified_cache_full[cls],
|
||||||
|
full_rate);
|
||||||
|
}
|
||||||
|
fflush(stderr);
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// Phase 23-E: Direct SuperSlab Carve (TLS SLL Bypass)
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
// Batch refill from SuperSlab (called on cache miss)
|
||||||
|
// Returns: BASE pointer (first block), or NULL if failed
|
||||||
|
// Design: Direct carve from SuperSlab to array (no TLS SLL intermediate layer)
|
||||||
|
void* unified_cache_refill(int class_idx) {
|
||||||
|
TinyTLSSlab* tls = &g_tls_slabs[class_idx];
|
||||||
|
|
||||||
|
// Step 1: Ensure SuperSlab available
|
||||||
|
if (!tls->ss) {
|
||||||
|
if (!superslab_refill(class_idx)) return NULL;
|
||||||
|
tls = &g_tls_slabs[class_idx]; // Reload after refill
|
||||||
|
}
|
||||||
|
|
||||||
|
TinyUnifiedCache* cache = &g_unified_cache[class_idx];
|
||||||
|
|
||||||
|
// Step 2: Calculate available room in unified cache
|
||||||
|
int room = (int)cache->capacity - 1; // Leave 1 slot for full detection
|
||||||
|
if (cache->head > cache->tail) {
|
||||||
|
room = cache->head - cache->tail - 1;
|
||||||
|
} else if (cache->head < cache->tail) {
|
||||||
|
room = cache->capacity - (cache->tail - cache->head) - 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (room <= 0) return NULL;
|
||||||
|
if (room > 128) room = 128; // Batch size limit
|
||||||
|
|
||||||
|
// Step 3: Direct carve from SuperSlab into local array (bypass TLS SLL!)
|
||||||
|
void* out[128];
|
||||||
|
int produced = 0;
|
||||||
|
TinySlabMeta* m = tls->meta;
|
||||||
|
size_t bs = tiny_stride_for_class(class_idx);
|
||||||
|
uint8_t* base = tls->slab_base
|
||||||
|
? tls->slab_base
|
||||||
|
: tiny_slab_base_for_geometry(tls->ss, tls->slab_idx);
|
||||||
|
|
||||||
|
while (produced < room) {
|
||||||
|
if (m->freelist) {
|
||||||
|
// Freelist pop
|
||||||
|
void* p = m->freelist;
|
||||||
|
m->freelist = tiny_next_read(class_idx, p);
|
||||||
|
|
||||||
|
// PageFaultTelemetry: record page touch for this BASE
|
||||||
|
pagefault_telemetry_touch(class_idx, p);
|
||||||
|
|
||||||
|
// ✅ CRITICAL: Restore header (overwritten by freelist link)
|
||||||
|
#if HAKMEM_TINY_HEADER_CLASSIDX
|
||||||
|
*(uint8_t*)p = (uint8_t)(0xa0 | (class_idx & 0x0f));
|
||||||
|
#endif
|
||||||
|
|
||||||
|
m->used++;
|
||||||
|
out[produced++] = p;
|
||||||
|
|
||||||
|
} else if (m->carved < m->capacity) {
|
||||||
|
// Linear carve (fresh block, no freelist link)
|
||||||
|
void* p = (void*)(base + ((size_t)m->carved * bs));
|
||||||
|
|
||||||
|
// PageFaultTelemetry: record page touch for this BASE
|
||||||
|
pagefault_telemetry_touch(class_idx, p);
|
||||||
|
|
||||||
|
// ✅ CRITICAL: Write header (new block)
|
||||||
|
#if HAKMEM_TINY_HEADER_CLASSIDX
|
||||||
|
*(uint8_t*)p = (uint8_t)(0xa0 | (class_idx & 0x0f));
|
||||||
|
#endif
|
||||||
|
|
||||||
|
m->carved++;
|
||||||
|
m->used++;
|
||||||
|
out[produced++] = p;
|
||||||
|
|
||||||
|
} else {
|
||||||
|
// SuperSlab exhausted → refill and retry
|
||||||
|
if (!superslab_refill(class_idx)) break;
|
||||||
|
|
||||||
|
// ✅ CRITICAL: Reload TLS pointers after refill (avoid stale pointer bug)
|
||||||
|
tls = &g_tls_slabs[class_idx];
|
||||||
|
m = tls->meta;
|
||||||
|
base = tls->slab_base
|
||||||
|
? tls->slab_base
|
||||||
|
: tiny_slab_base_for_geometry(tls->ss, tls->slab_idx);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (produced == 0) return NULL;
|
||||||
|
|
||||||
|
// Step 4: Update active counter
|
||||||
|
ss_active_add(tls->ss, (uint32_t)produced);
|
||||||
|
|
||||||
|
// Step 5: Store blocks into unified cache (skip first, return it)
|
||||||
|
void* first = out[0];
|
||||||
|
for (int i = 1; i < produced; i++) {
|
||||||
|
cache->slots[cache->tail] = out[i];
|
||||||
|
cache->tail = (cache->tail + 1) & cache->mask;
|
||||||
|
}
|
||||||
|
|
||||||
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
|
g_unified_cache_miss[class_idx]++;
|
||||||
|
#endif
|
||||||
|
|
||||||
|
return first; // Return first block (BASE pointer)
|
||||||
|
}
|
||||||
40
core/front/tiny_unified_cache.d
Normal file
40
core/front/tiny_unified_cache.d
Normal file
@ -0,0 +1,40 @@
|
|||||||
|
core/front/tiny_unified_cache.o: core/front/tiny_unified_cache.c \
|
||||||
|
core/front/tiny_unified_cache.h core/front/../hakmem_build_flags.h \
|
||||||
|
core/front/../hakmem_tiny_config.h core/front/../box/unified_batch_box.h \
|
||||||
|
core/front/../tiny_tls.h core/front/../hakmem_tiny_superslab.h \
|
||||||
|
core/front/../superslab/superslab_types.h \
|
||||||
|
core/hakmem_tiny_superslab_constants.h \
|
||||||
|
core/front/../superslab/superslab_inline.h \
|
||||||
|
core/front/../superslab/superslab_types.h \
|
||||||
|
core/front/../tiny_debug_ring.h core/front/../hakmem_build_flags.h \
|
||||||
|
core/front/../tiny_remote.h \
|
||||||
|
core/front/../hakmem_tiny_superslab_constants.h \
|
||||||
|
core/front/../tiny_box_geometry.h core/front/../hakmem_tiny_config.h \
|
||||||
|
core/front/../box/tiny_next_ptr_box.h core/hakmem_tiny_config.h \
|
||||||
|
core/tiny_nextptr.h core/hakmem_build_flags.h \
|
||||||
|
core/front/../hakmem_tiny_superslab.h \
|
||||||
|
core/front/../superslab/superslab_inline.h \
|
||||||
|
core/front/../box/pagefault_telemetry_box.h
|
||||||
|
core/front/tiny_unified_cache.h:
|
||||||
|
core/front/../hakmem_build_flags.h:
|
||||||
|
core/front/../hakmem_tiny_config.h:
|
||||||
|
core/front/../box/unified_batch_box.h:
|
||||||
|
core/front/../tiny_tls.h:
|
||||||
|
core/front/../hakmem_tiny_superslab.h:
|
||||||
|
core/front/../superslab/superslab_types.h:
|
||||||
|
core/hakmem_tiny_superslab_constants.h:
|
||||||
|
core/front/../superslab/superslab_inline.h:
|
||||||
|
core/front/../superslab/superslab_types.h:
|
||||||
|
core/front/../tiny_debug_ring.h:
|
||||||
|
core/front/../hakmem_build_flags.h:
|
||||||
|
core/front/../tiny_remote.h:
|
||||||
|
core/front/../hakmem_tiny_superslab_constants.h:
|
||||||
|
core/front/../tiny_box_geometry.h:
|
||||||
|
core/front/../hakmem_tiny_config.h:
|
||||||
|
core/front/../box/tiny_next_ptr_box.h:
|
||||||
|
core/hakmem_tiny_config.h:
|
||||||
|
core/tiny_nextptr.h:
|
||||||
|
core/hakmem_build_flags.h:
|
||||||
|
core/front/../hakmem_tiny_superslab.h:
|
||||||
|
core/front/../superslab/superslab_inline.h:
|
||||||
|
core/front/../box/pagefault_telemetry_box.h:
|
||||||
233
core/front/tiny_unified_cache.h
Normal file
233
core/front/tiny_unified_cache.h
Normal file
@ -0,0 +1,233 @@
|
|||||||
|
// tiny_unified_cache.h - Phase 23: Unified Frontend Cache (tcache-style)
|
||||||
|
//
|
||||||
|
// Goal: Flatten 4-5 layer frontend cascade into single-layer array cache
|
||||||
|
// Target: +50-100% performance (20.3M → 30-40M ops/s)
|
||||||
|
//
|
||||||
|
// Design (Task-sensei analysis):
|
||||||
|
// - Replace: Ring → FastCache → SFC → TLS SLL (4 layers, 8-10 cache misses)
|
||||||
|
// - With: Single unified array cache per class (1 layer, 2-3 cache misses)
|
||||||
|
// - Fallback: Direct SuperSlab refill (skip intermediate layers)
|
||||||
|
//
|
||||||
|
// Performance:
|
||||||
|
// - Alloc: 2-3 cache misses (TLS access + array access)
|
||||||
|
// - Free: 2-3 cache misses (similar to System malloc tcache)
|
||||||
|
// - vs Current: 8-10 cache misses → 2-3 cache misses (70% reduction)
|
||||||
|
//
|
||||||
|
// ENV Variables:
|
||||||
|
// HAKMEM_TINY_UNIFIED_CACHE=1 # Enable Unified cache (default: 0, OFF)
|
||||||
|
// HAKMEM_TINY_UNIFIED_C0=128 # C0 cache size (default: 128)
|
||||||
|
// ...
|
||||||
|
// HAKMEM_TINY_UNIFIED_C7=128 # C7 cache size (default: 128)
|
||||||
|
|
||||||
|
#ifndef HAK_FRONT_TINY_UNIFIED_CACHE_H
|
||||||
|
#define HAK_FRONT_TINY_UNIFIED_CACHE_H
|
||||||
|
|
||||||
|
#include <stdint.h>
|
||||||
|
#include <stdlib.h>
|
||||||
|
#include <stdio.h>
|
||||||
|
#include "../hakmem_build_flags.h"
|
||||||
|
#include "../hakmem_tiny_config.h" // For TINY_NUM_CLASSES
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// Unified Cache Structure (per class)
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
typedef struct {
|
||||||
|
void** slots; // Dynamic array (allocated at init, power-of-2 size)
|
||||||
|
uint16_t head; // Pop index (consumer)
|
||||||
|
uint16_t tail; // Push index (producer)
|
||||||
|
uint16_t capacity; // Cache size (power of 2 for fast modulo: & (capacity-1))
|
||||||
|
uint16_t mask; // Capacity - 1 (for fast modulo)
|
||||||
|
} TinyUnifiedCache;
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// External TLS Variables (defined in tiny_unified_cache.c)
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
extern __thread TinyUnifiedCache g_unified_cache[TINY_NUM_CLASSES];
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// Metrics (Phase 23, optional for debugging)
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
|
extern __thread uint64_t g_unified_cache_hit[TINY_NUM_CLASSES]; // Alloc hits
|
||||||
|
extern __thread uint64_t g_unified_cache_miss[TINY_NUM_CLASSES]; // Alloc misses
|
||||||
|
extern __thread uint64_t g_unified_cache_push[TINY_NUM_CLASSES]; // Free pushes
|
||||||
|
extern __thread uint64_t g_unified_cache_full[TINY_NUM_CLASSES]; // Free full (fallback to SuperSlab)
|
||||||
|
#endif
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// ENV Control (cached, lazy init)
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
// Enable flag (default: 0, OFF)
|
||||||
|
static inline int unified_cache_enabled(void) {
|
||||||
|
static int g_enable = -1;
|
||||||
|
if (__builtin_expect(g_enable == -1, 0)) {
|
||||||
|
const char* e = getenv("HAKMEM_TINY_UNIFIED_CACHE");
|
||||||
|
g_enable = (e && *e && *e != '0') ? 1 : 0;
|
||||||
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
|
if (g_enable) {
|
||||||
|
fprintf(stderr, "[Unified-INIT] unified_cache_enabled() = %d\n", g_enable);
|
||||||
|
fflush(stderr);
|
||||||
|
}
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
return g_enable;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Per-class capacity (default: 128 for all classes)
|
||||||
|
static inline size_t unified_capacity(int class_idx) {
|
||||||
|
static size_t g_cap[TINY_NUM_CLASSES] = {0};
|
||||||
|
if (__builtin_expect(g_cap[class_idx] == 0, 0)) {
|
||||||
|
char env_name[64];
|
||||||
|
snprintf(env_name, sizeof(env_name), "HAKMEM_TINY_UNIFIED_C%d", class_idx);
|
||||||
|
const char* e = getenv(env_name);
|
||||||
|
g_cap[class_idx] = (e && *e) ? (size_t)atoi(e) : 128; // Default: 128
|
||||||
|
|
||||||
|
// Round up to power of 2 (for fast modulo)
|
||||||
|
if (g_cap[class_idx] < 32) g_cap[class_idx] = 32;
|
||||||
|
if (g_cap[class_idx] > 512) g_cap[class_idx] = 512;
|
||||||
|
|
||||||
|
// Ensure power of 2
|
||||||
|
size_t pow2 = 32;
|
||||||
|
while (pow2 < g_cap[class_idx]) pow2 *= 2;
|
||||||
|
g_cap[class_idx] = pow2;
|
||||||
|
|
||||||
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
|
fprintf(stderr, "[Unified-INIT] C%d capacity = %zu (power of 2)\n", class_idx, g_cap[class_idx]);
|
||||||
|
fflush(stderr);
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
return g_cap[class_idx];
|
||||||
|
}
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// Init/Shutdown Forward Declarations
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
void unified_cache_init(void);
|
||||||
|
void unified_cache_shutdown(void);
|
||||||
|
void unified_cache_print_stats(void);
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// Phase 23-D: Self-Contained Refill (Box U1 + Box U2 integration)
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
// Batch refill from SuperSlab (called on cache miss)
|
||||||
|
// Returns: BASE pointer (first block), or NULL if failed
|
||||||
|
void* unified_cache_refill(int class_idx);
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// Ultra-Fast Pop/Push (2-3 cache misses, tcache-style)
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
// Pop from unified cache (alloc fast path)
|
||||||
|
// Returns: BASE pointer (caller must convert to USER with +1)
|
||||||
|
static inline void* unified_cache_pop(int class_idx) {
|
||||||
|
// Fast path: Unified cache disabled → return NULL immediately
|
||||||
|
if (__builtin_expect(!unified_cache_enabled(), 0)) return NULL;
|
||||||
|
|
||||||
|
TinyUnifiedCache* cache = &g_unified_cache[class_idx]; // 1 cache miss (TLS)
|
||||||
|
|
||||||
|
// Lazy init check (once per thread, per class)
|
||||||
|
if (__builtin_expect(cache->slots == NULL, 0)) {
|
||||||
|
unified_cache_init(); // First call in this thread
|
||||||
|
// Re-check after init (may fail if allocation failed)
|
||||||
|
if (cache->slots == NULL) return NULL;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Empty check
|
||||||
|
if (__builtin_expect(cache->head == cache->tail, 0)) {
|
||||||
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
|
g_unified_cache_miss[class_idx]++;
|
||||||
|
#endif
|
||||||
|
return NULL; // Empty
|
||||||
|
}
|
||||||
|
|
||||||
|
// Pop from head (consumer)
|
||||||
|
void* base = cache->slots[cache->head]; // 1 cache miss (array access)
|
||||||
|
cache->head = (cache->head + 1) & cache->mask; // Fast modulo (power of 2)
|
||||||
|
|
||||||
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
|
g_unified_cache_hit[class_idx]++;
|
||||||
|
#endif
|
||||||
|
|
||||||
|
return base; // Return BASE pointer (2-3 cache misses total)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Push to unified cache (free fast path)
|
||||||
|
// Input: BASE pointer (caller must pass BASE, not USER)
|
||||||
|
// Returns: 1=SUCCESS, 0=FULL
|
||||||
|
static inline int unified_cache_push(int class_idx, void* base) {
|
||||||
|
// Fast path: Unified cache disabled → return 0 (not handled)
|
||||||
|
if (__builtin_expect(!unified_cache_enabled(), 0)) return 0;
|
||||||
|
|
||||||
|
TinyUnifiedCache* cache = &g_unified_cache[class_idx]; // 1 cache miss (TLS)
|
||||||
|
|
||||||
|
// Lazy init check (once per thread, per class)
|
||||||
|
if (__builtin_expect(cache->slots == NULL, 0)) {
|
||||||
|
unified_cache_init(); // First call in this thread
|
||||||
|
// Re-check after init (may fail if allocation failed)
|
||||||
|
if (cache->slots == NULL) return 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
uint16_t next_tail = (cache->tail + 1) & cache->mask;
|
||||||
|
|
||||||
|
// Full check (leave 1 slot empty to distinguish full/empty)
|
||||||
|
if (__builtin_expect(next_tail == cache->head, 0)) {
|
||||||
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
|
g_unified_cache_full[class_idx]++;
|
||||||
|
#endif
|
||||||
|
return 0; // Full
|
||||||
|
}
|
||||||
|
|
||||||
|
// Push to tail (producer)
|
||||||
|
cache->slots[cache->tail] = base; // 1 cache miss (array write)
|
||||||
|
cache->tail = next_tail;
|
||||||
|
|
||||||
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
|
g_unified_cache_push[class_idx]++;
|
||||||
|
#endif
|
||||||
|
|
||||||
|
return 1; // SUCCESS (2-3 cache misses total)
|
||||||
|
}
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// Phase 23-D: Self-Contained Pop-or-Refill (tcache-style, single-layer)
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
// All-in-one: Pop from cache, or refill from SuperSlab on miss
|
||||||
|
// Returns: BASE pointer (caller converts to USER), or NULL if failed
|
||||||
|
// Design: Self-contained, bypasses all other frontend layers (Ring/FC/SFC/SLL)
|
||||||
|
static inline void* unified_cache_pop_or_refill(int class_idx) {
|
||||||
|
// Fast path: Unified cache disabled → return NULL (caller uses legacy cascade)
|
||||||
|
if (__builtin_expect(!unified_cache_enabled(), 0)) return NULL;
|
||||||
|
|
||||||
|
TinyUnifiedCache* cache = &g_unified_cache[class_idx]; // 1 cache miss (TLS)
|
||||||
|
|
||||||
|
// Lazy init check (once per thread, per class)
|
||||||
|
if (__builtin_expect(cache->slots == NULL, 0)) {
|
||||||
|
unified_cache_init();
|
||||||
|
if (cache->slots == NULL) return NULL;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Try pop from cache (fast path)
|
||||||
|
if (__builtin_expect(cache->head != cache->tail, 1)) {
|
||||||
|
void* base = cache->slots[cache->head]; // 1 cache miss (array access)
|
||||||
|
cache->head = (cache->head + 1) & cache->mask;
|
||||||
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
|
g_unified_cache_hit[class_idx]++;
|
||||||
|
#endif
|
||||||
|
return base; // Hit! (2-3 cache misses total)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Cache miss → Batch refill from SuperSlab
|
||||||
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
|
g_unified_cache_miss[class_idx]++;
|
||||||
|
#endif
|
||||||
|
return unified_cache_refill(class_idx); // Refill + return first block
|
||||||
|
}
|
||||||
|
|
||||||
|
#endif // HAK_FRONT_TINY_UNIFIED_CACHE_H
|
||||||
@ -50,6 +50,7 @@
|
|||||||
#include "hakmem_config.h"
|
#include "hakmem_config.h"
|
||||||
#include "hakmem_internal.h" // For AllocHeader and HAKMEM_MAGIC
|
#include "hakmem_internal.h" // For AllocHeader and HAKMEM_MAGIC
|
||||||
#include "hakmem_syscall.h" // Phase 6.X P0 Fix: Box 3 syscall layer (bypasses LD_PRELOAD)
|
#include "hakmem_syscall.h" // Phase 6.X P0 Fix: Box 3 syscall layer (bypasses LD_PRELOAD)
|
||||||
|
#include "box/pagefault_telemetry_box.h" // Box PageFaultTelemetry (PF_BUCKET_L25)
|
||||||
#include <stdlib.h>
|
#include <stdlib.h>
|
||||||
#include <string.h>
|
#include <string.h>
|
||||||
#include <stdio.h>
|
#include <stdio.h>
|
||||||
@ -343,6 +344,11 @@ static inline int l25_alloc_new_run(int class_idx) {
|
|||||||
// Register page descriptors for headerless free
|
// Register page descriptors for headerless free
|
||||||
l25_desc_insert_range(ar->base, ar->end, class_idx);
|
l25_desc_insert_range(ar->base, ar->end, class_idx);
|
||||||
|
|
||||||
|
// PageFaultTelemetry: mark all backing pages for this run (approximate)
|
||||||
|
for (size_t off = 0; off < run_bytes; off += 4096) {
|
||||||
|
pagefault_telemetry_touch(PF_BUCKET_L25, ar->base + off);
|
||||||
|
}
|
||||||
|
|
||||||
// Stats (best-effort)
|
// Stats (best-effort)
|
||||||
g_l25_pool.total_bytes_allocated += run_bytes;
|
g_l25_pool.total_bytes_allocated += run_bytes;
|
||||||
g_l25_pool.total_bundles_allocated += blocks;
|
g_l25_pool.total_bundles_allocated += blocks;
|
||||||
|
|||||||
@ -1,6 +1,7 @@
|
|||||||
#include "hakmem_shared_pool.h"
|
#include "hakmem_shared_pool.h"
|
||||||
#include "hakmem_tiny_superslab.h"
|
#include "hakmem_tiny_superslab.h"
|
||||||
#include "hakmem_tiny_superslab_constants.h"
|
#include "hakmem_tiny_superslab_constants.h"
|
||||||
|
#include "box/pagefault_telemetry_box.h" // Box PageFaultTelemetry (PF_BUCKET_SS_META)
|
||||||
|
|
||||||
#include <stdlib.h>
|
#include <stdlib.h>
|
||||||
#include <string.h>
|
#include <string.h>
|
||||||
@ -477,6 +478,12 @@ shared_pool_allocate_superslab_unlocked(void)
|
|||||||
return NULL;
|
return NULL;
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// PageFaultTelemetry: mark all backing pages for this Superslab (approximate)
|
||||||
|
size_t ss_bytes = (size_t)1 << ss->lg_size;
|
||||||
|
for (size_t off = 0; off < ss_bytes; off += 4096) {
|
||||||
|
pagefault_telemetry_touch(PF_BUCKET_SS_META, (char*)ss + off);
|
||||||
|
}
|
||||||
|
|
||||||
// superslab_allocate() already:
|
// superslab_allocate() already:
|
||||||
// - zeroes slab metadata / remote queues,
|
// - zeroes slab metadata / remote queues,
|
||||||
// - sets magic/lg_size/etc,
|
// - sets magic/lg_size/etc,
|
||||||
|
|||||||
@ -121,7 +121,8 @@ typedef struct SharedSuperSlabPool {
|
|||||||
|
|
||||||
// SharedSSMeta array for all SuperSlabs in pool
|
// SharedSSMeta array for all SuperSlabs in pool
|
||||||
// RACE FIX: Fixed-size array (no realloc!) to avoid race with lock-free Stage 2
|
// RACE FIX: Fixed-size array (no realloc!) to avoid race with lock-free Stage 2
|
||||||
#define MAX_SS_METADATA_ENTRIES 2048
|
// LARSON FIX (2025-11-16): Increased from 2048 → 8192 for MT churn workloads
|
||||||
|
#define MAX_SS_METADATA_ENTRIES 8192
|
||||||
SharedSSMeta ss_metadata[MAX_SS_METADATA_ENTRIES]; // Fixed-size array
|
SharedSSMeta ss_metadata[MAX_SS_METADATA_ENTRIES]; // Fixed-size array
|
||||||
_Atomic uint32_t ss_meta_count; // Used entries (atomic for lock-free Stage 2)
|
_Atomic uint32_t ss_meta_count; // Used entries (atomic for lock-free Stage 2)
|
||||||
} SharedSuperSlabPool;
|
} SharedSuperSlabPool;
|
||||||
|
|||||||
@ -44,12 +44,13 @@ core/hakmem_tiny.o: core/hakmem_tiny.c core/hakmem_tiny.h \
|
|||||||
core/tiny_atomic.h core/tiny_alloc_fast.inc.h \
|
core/tiny_atomic.h core/tiny_alloc_fast.inc.h \
|
||||||
core/tiny_alloc_fast_sfc.inc.h core/hakmem_tiny_fastcache.inc.h \
|
core/tiny_alloc_fast_sfc.inc.h core/hakmem_tiny_fastcache.inc.h \
|
||||||
core/front/tiny_front_c23.h core/front/../hakmem_build_flags.h \
|
core/front/tiny_front_c23.h core/front/../hakmem_build_flags.h \
|
||||||
core/front/tiny_ring_cache.h core/front/tiny_heap_v2.h \
|
core/front/tiny_ring_cache.h core/front/tiny_unified_cache.h \
|
||||||
|
core/front/../hakmem_tiny_config.h core/front/tiny_heap_v2.h \
|
||||||
core/front/tiny_ultra_hot.h core/front/../box/tls_sll_box.h \
|
core/front/tiny_ultra_hot.h core/front/../box/tls_sll_box.h \
|
||||||
core/box/front_metrics_box.h core/tiny_alloc_fast_inline.h \
|
core/box/front_metrics_box.h core/hakmem_tiny_lazy_init.inc.h \
|
||||||
core/tiny_free_fast.inc.h core/hakmem_tiny_alloc.inc \
|
core/tiny_alloc_fast_inline.h core/tiny_free_fast.inc.h \
|
||||||
core/hakmem_tiny_slow.inc core/hakmem_tiny_free.inc \
|
core/hakmem_tiny_alloc.inc core/hakmem_tiny_slow.inc \
|
||||||
core/box/free_publish_box.h core/mid_tcache.h \
|
core/hakmem_tiny_free.inc core/box/free_publish_box.h core/mid_tcache.h \
|
||||||
core/tiny_free_magazine.inc.h core/tiny_superslab_alloc.inc.h \
|
core/tiny_free_magazine.inc.h core/tiny_superslab_alloc.inc.h \
|
||||||
core/box/superslab_expansion_box.h \
|
core/box/superslab_expansion_box.h \
|
||||||
core/box/../superslab/superslab_types.h core/box/../tiny_tls.h \
|
core/box/../superslab/superslab_types.h core/box/../tiny_tls.h \
|
||||||
@ -155,10 +156,13 @@ core/hakmem_tiny_fastcache.inc.h:
|
|||||||
core/front/tiny_front_c23.h:
|
core/front/tiny_front_c23.h:
|
||||||
core/front/../hakmem_build_flags.h:
|
core/front/../hakmem_build_flags.h:
|
||||||
core/front/tiny_ring_cache.h:
|
core/front/tiny_ring_cache.h:
|
||||||
|
core/front/tiny_unified_cache.h:
|
||||||
|
core/front/../hakmem_tiny_config.h:
|
||||||
core/front/tiny_heap_v2.h:
|
core/front/tiny_heap_v2.h:
|
||||||
core/front/tiny_ultra_hot.h:
|
core/front/tiny_ultra_hot.h:
|
||||||
core/front/../box/tls_sll_box.h:
|
core/front/../box/tls_sll_box.h:
|
||||||
core/box/front_metrics_box.h:
|
core/box/front_metrics_box.h:
|
||||||
|
core/hakmem_tiny_lazy_init.inc.h:
|
||||||
core/tiny_alloc_fast_inline.h:
|
core/tiny_alloc_fast_inline.h:
|
||||||
core/tiny_free_fast.inc.h:
|
core/tiny_free_fast.inc.h:
|
||||||
core/hakmem_tiny_alloc.inc:
|
core/hakmem_tiny_alloc.inc:
|
||||||
|
|||||||
139
core/hakmem_tiny_lazy_init.inc.h
Normal file
139
core/hakmem_tiny_lazy_init.inc.h
Normal file
@ -0,0 +1,139 @@
|
|||||||
|
// hakmem_tiny_lazy_init.inc.h - Phase 22: Lazy Per-Class Initialization
|
||||||
|
// Goal: Reduce cold-start page faults by initializing only used classes
|
||||||
|
//
|
||||||
|
// ChatGPT Analysis (2025-11-16):
|
||||||
|
// - hak_tiny_init() page faults: 94.94% of all page faults
|
||||||
|
// - Cause: Eager init of all 8 classes even if only C2/C3 used
|
||||||
|
// - Solution: Lazy init per class on first use
|
||||||
|
//
|
||||||
|
// Expected Impact:
|
||||||
|
// - Page faults: -90% (only touch C2/C3 for 256B workload)
|
||||||
|
// - Cold start: +30-40% performance (16.2M → 22-25M ops/s)
|
||||||
|
|
||||||
|
#ifndef HAKMEM_TINY_LAZY_INIT_INC_H
|
||||||
|
#define HAKMEM_TINY_LAZY_INIT_INC_H
|
||||||
|
|
||||||
|
#include <pthread.h>
|
||||||
|
#include <stdint.h>
|
||||||
|
#include "superslab/superslab_types.h" // For SuperSlabACEState
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// Phase 22-1: Per-Class Initialization State
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
// Track which classes are initialized (per-thread)
|
||||||
|
__thread uint8_t g_class_initialized[TINY_NUM_CLASSES] = {0};
|
||||||
|
|
||||||
|
// Global one-time init flag (for shared resources)
|
||||||
|
static int g_tiny_global_initialized = 0;
|
||||||
|
static pthread_mutex_t g_lazy_init_lock = PTHREAD_MUTEX_INITIALIZER;
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// Phase 22-2: Lazy Init Implementation
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
// Initialize one class lazily (called on first use)
|
||||||
|
static inline void lazy_init_class(int class_idx) {
|
||||||
|
// Fast path: already initialized
|
||||||
|
if (__builtin_expect(g_class_initialized[class_idx], 1)) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Slow path: need to initialize this class
|
||||||
|
pthread_mutex_lock(&g_lazy_init_lock);
|
||||||
|
|
||||||
|
// Double-check after acquiring lock
|
||||||
|
if (g_class_initialized[class_idx]) {
|
||||||
|
pthread_mutex_unlock(&g_lazy_init_lock);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Extract from hak_tiny_init.inc lines 84-103: TLS List Init
|
||||||
|
{
|
||||||
|
TinyTLSList* tls = &g_tls_lists[class_idx];
|
||||||
|
tls->head = NULL;
|
||||||
|
tls->count = 0;
|
||||||
|
uint32_t base_cap = (uint32_t)tiny_default_cap(class_idx);
|
||||||
|
uint32_t class_max = (uint32_t)tiny_cap_max_for_class(class_idx);
|
||||||
|
if (base_cap > class_max) base_cap = class_max;
|
||||||
|
|
||||||
|
// Apply global cap limit if set
|
||||||
|
extern int g_mag_cap_limit;
|
||||||
|
extern int g_mag_cap_override[TINY_NUM_CLASSES];
|
||||||
|
if ((uint32_t)g_mag_cap_limit < base_cap) base_cap = (uint32_t)g_mag_cap_limit;
|
||||||
|
if (g_mag_cap_override[class_idx] > 0) {
|
||||||
|
uint32_t ov = (uint32_t)g_mag_cap_override[class_idx];
|
||||||
|
if (ov > class_max) ov = class_max;
|
||||||
|
if (ov > (uint32_t)g_mag_cap_limit) ov = (uint32_t)g_mag_cap_limit;
|
||||||
|
if (ov != 0u) base_cap = ov;
|
||||||
|
}
|
||||||
|
if (base_cap == 0u) base_cap = 32u;
|
||||||
|
|
||||||
|
tls->cap = base_cap;
|
||||||
|
tls->refill_low = tiny_tls_default_refill(base_cap);
|
||||||
|
tls->spill_high = tiny_tls_default_spill(base_cap);
|
||||||
|
tiny_tls_publish_targets(class_idx, base_cap);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Extract from hak_tiny_init.inc lines 623-625: Per-class lock
|
||||||
|
pthread_mutex_init(&g_tiny_class_locks[class_idx].m, NULL);
|
||||||
|
|
||||||
|
// Extract from hak_tiny_init.inc lines 628-637: ACE state
|
||||||
|
{
|
||||||
|
extern SuperSlabACEState g_ss_ace[TINY_NUM_CLASSES];
|
||||||
|
g_ss_ace[class_idx].current_lg = 20; // Start with 1MB SuperSlabs
|
||||||
|
g_ss_ace[class_idx].target_lg = 20;
|
||||||
|
g_ss_ace[class_idx].hot_score = 0;
|
||||||
|
g_ss_ace[class_idx].alloc_count = 0;
|
||||||
|
g_ss_ace[class_idx].refill_count = 0;
|
||||||
|
g_ss_ace[class_idx].spill_count = 0;
|
||||||
|
g_ss_ace[class_idx].live_blocks = 0;
|
||||||
|
g_ss_ace[class_idx].last_tick_ns = 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Mark as initialized
|
||||||
|
g_class_initialized[class_idx] = 1;
|
||||||
|
|
||||||
|
pthread_mutex_unlock(&g_lazy_init_lock);
|
||||||
|
|
||||||
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
|
fprintf(stderr, "[LAZY_INIT] Class %d initialized\n", class_idx);
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
|
||||||
|
// Global initialization (called once, for non-class resources)
|
||||||
|
static inline void lazy_init_global(void) {
|
||||||
|
if (__builtin_expect(g_tiny_global_initialized, 1)) {
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
pthread_mutex_lock(&g_lazy_init_lock);
|
||||||
|
|
||||||
|
if (g_tiny_global_initialized) {
|
||||||
|
pthread_mutex_unlock(&g_lazy_init_lock);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Initialize SuperSlab subsystem (only once)
|
||||||
|
extern int g_use_superslab;
|
||||||
|
if (g_use_superslab) {
|
||||||
|
extern void hak_super_registry_init(void);
|
||||||
|
extern void hak_ss_lru_init(void);
|
||||||
|
extern void hak_ss_prewarm_init(void);
|
||||||
|
|
||||||
|
hak_super_registry_init();
|
||||||
|
hak_ss_lru_init();
|
||||||
|
hak_ss_prewarm_init();
|
||||||
|
}
|
||||||
|
|
||||||
|
// Mark global resources as initialized
|
||||||
|
g_tiny_global_initialized = 1;
|
||||||
|
|
||||||
|
pthread_mutex_unlock(&g_lazy_init_lock);
|
||||||
|
|
||||||
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
|
fprintf(stderr, "[LAZY_INIT] Global resources initialized\n");
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
|
||||||
|
#endif // HAKMEM_TINY_LAZY_INIT_INC_H
|
||||||
@ -29,10 +29,12 @@
|
|||||||
#ifdef HAKMEM_TINY_HEADER_CLASSIDX
|
#ifdef HAKMEM_TINY_HEADER_CLASSIDX
|
||||||
#include "front/tiny_front_c23.h" // Phase B: Ultra-simple C2/C3 front
|
#include "front/tiny_front_c23.h" // Phase B: Ultra-simple C2/C3 front
|
||||||
#include "front/tiny_ring_cache.h" // Phase 21-1: Ring cache (C2/C3 array-based TLS cache)
|
#include "front/tiny_ring_cache.h" // Phase 21-1: Ring cache (C2/C3 array-based TLS cache)
|
||||||
|
#include "front/tiny_unified_cache.h" // Phase 23: Unified frontend cache (tcache-style, all classes)
|
||||||
#include "front/tiny_heap_v2.h" // Phase 13-A: TinyHeapV2 magazine front
|
#include "front/tiny_heap_v2.h" // Phase 13-A: TinyHeapV2 magazine front
|
||||||
#include "front/tiny_ultra_hot.h" // Phase 14: TinyUltraHot C1/C2 ultra-fast path
|
#include "front/tiny_ultra_hot.h" // Phase 14: TinyUltraHot C1/C2 ultra-fast path
|
||||||
#endif
|
#endif
|
||||||
#include "box/front_metrics_box.h" // Phase 19-1: Frontend layer metrics
|
#include "box/front_metrics_box.h" // Phase 19-1: Frontend layer metrics
|
||||||
|
#include "hakmem_tiny_lazy_init.inc.h" // Phase 22: Lazy per-class initialization
|
||||||
#include <stdio.h>
|
#include <stdio.h>
|
||||||
|
|
||||||
// Phase 7 Task 2: Aggressive inline TLS cache access
|
// Phase 7 Task 2: Aggressive inline TLS cache access
|
||||||
@ -562,6 +564,9 @@ static inline void* tiny_alloc_fast(size_t size) {
|
|||||||
uint64_t call_num = atomic_fetch_add(&alloc_call_count, 1);
|
uint64_t call_num = atomic_fetch_add(&alloc_call_count, 1);
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
|
// Phase 22: Global init (once per process)
|
||||||
|
lazy_init_global();
|
||||||
|
|
||||||
// 1. Size → class index (inline, fast)
|
// 1. Size → class index (inline, fast)
|
||||||
int class_idx = hak_tiny_size_to_class(size);
|
int class_idx = hak_tiny_size_to_class(size);
|
||||||
|
|
||||||
@ -569,6 +574,9 @@ static inline void* tiny_alloc_fast(size_t size) {
|
|||||||
return NULL; // Size > 1KB, not Tiny
|
return NULL; // Size > 1KB, not Tiny
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Phase 22: Lazy per-class init (on first use)
|
||||||
|
lazy_init_class(class_idx);
|
||||||
|
|
||||||
#if !HAKMEM_BUILD_RELEASE
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
// Phase 3: Debug checks eliminated in release builds
|
// Phase 3: Debug checks eliminated in release builds
|
||||||
// CRITICAL: Bounds check to catch corruption
|
// CRITICAL: Bounds check to catch corruption
|
||||||
@ -606,8 +614,26 @@ static inline void* tiny_alloc_fast(size_t size) {
|
|||||||
}
|
}
|
||||||
#endif
|
#endif
|
||||||
|
|
||||||
|
// Phase 23-E: Unified Frontend Cache (self-contained, single-layer tcache)
|
||||||
|
// ENV-gated: HAKMEM_TINY_UNIFIED_CACHE=1 (default: OFF)
|
||||||
|
// Design: Pop-or-Refill → Direct SuperSlab batch refill (bypasses ALL frontend layers)
|
||||||
|
// Target: 20-30% improvement (25-27M ops/s) via cache miss reduction (8-10 → 2-3)
|
||||||
|
if (__builtin_expect(unified_cache_enabled(), 0)) {
|
||||||
|
void* base = unified_cache_pop_or_refill(class_idx);
|
||||||
|
if (base) {
|
||||||
|
// Unified cache hit OR refill success - return USER pointer (BASE + 1)
|
||||||
|
HAK_RET_ALLOC(class_idx, base);
|
||||||
|
}
|
||||||
|
// Unified cache is enabled but refill failed (OOM) → go directly to slow path.
|
||||||
|
ptr = hak_tiny_alloc_slow(size, class_idx);
|
||||||
|
if (ptr) {
|
||||||
|
HAK_RET_ALLOC(class_idx, ptr);
|
||||||
|
}
|
||||||
|
return ptr;
|
||||||
|
}
|
||||||
|
|
||||||
// Phase 21-1: Ring Cache (C2/C3 only) - Array-based TLS cache
|
// Phase 21-1: Ring Cache (C2/C3 only) - Array-based TLS cache
|
||||||
// ENV-gated: HAKMEM_TINY_HOT_RING_ENABLE=1
|
// ENV-gated: HAKMEM_TINY_HOT_RING_ENABLE=1 (default: ON after Phase 21-1-D)
|
||||||
// Target: +15-20% (54.4M → 62-65M ops/s) by eliminating pointer chasing
|
// Target: +15-20% (54.4M → 62-65M ops/s) by eliminating pointer chasing
|
||||||
// Design: Ring (L0) → SLL (L1) → SuperSlab (L2) cascade hierarchy
|
// Design: Ring (L0) → SLL (L1) → SuperSlab (L2) cascade hierarchy
|
||||||
if (class_idx == 2 || class_idx == 3) {
|
if (class_idx == 2 || class_idx == 3) {
|
||||||
|
|||||||
27
core/tiny_alloc_fast_push.c
Normal file
27
core/tiny_alloc_fast_push.c
Normal file
@ -0,0 +1,27 @@
|
|||||||
|
// tiny_alloc_fast_push.c - Out-of-line helper for Box 5/6
|
||||||
|
// Purpose:
|
||||||
|
// Provide a non-inline definition of tiny_alloc_fast_push() for TUs
|
||||||
|
// that include tiny_free_fast_v2.inc.h / hak_free_api.inc.h without
|
||||||
|
// also including tiny_alloc_fast.inc.h.
|
||||||
|
//
|
||||||
|
// Box Theory:
|
||||||
|
// - Box 5 (Alloc Fast Path) owns the TLS freelist push semantics.
|
||||||
|
// - This file is a thin proxy that reuses existing Box APIs
|
||||||
|
// (front_gate_push_tls or tls_sll_push) without duplicating policy.
|
||||||
|
|
||||||
|
#include <stdint.h>
|
||||||
|
#include "hakmem_tiny_config.h"
|
||||||
|
#include "box/tls_sll_box.h"
|
||||||
|
#include "box/front_gate_box.h"
|
||||||
|
|
||||||
|
void tiny_alloc_fast_push(int class_idx, void* ptr) {
|
||||||
|
#ifdef HAKMEM_TINY_FRONT_GATE_BOX
|
||||||
|
// When FrontGate Box is enabled, delegate to its TLS push helper.
|
||||||
|
front_gate_push_tls(class_idx, ptr);
|
||||||
|
#else
|
||||||
|
// Default: push directly into TLS SLL with "unbounded" capacity.
|
||||||
|
uint32_t capacity = UINT32_MAX;
|
||||||
|
(void)tls_sll_push(class_idx, ptr, capacity);
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
|
||||||
38
core/tiny_alloc_fast_push.d
Normal file
38
core/tiny_alloc_fast_push.d
Normal file
@ -0,0 +1,38 @@
|
|||||||
|
core/tiny_alloc_fast_push.o: core/tiny_alloc_fast_push.c \
|
||||||
|
core/hakmem_tiny_config.h core/box/tls_sll_box.h \
|
||||||
|
core/box/../hakmem_tiny_config.h core/box/../hakmem_build_flags.h \
|
||||||
|
core/box/../tiny_remote.h core/box/../tiny_region_id.h \
|
||||||
|
core/box/../hakmem_build_flags.h core/box/../tiny_box_geometry.h \
|
||||||
|
core/box/../hakmem_tiny_superslab_constants.h \
|
||||||
|
core/box/../hakmem_tiny_config.h core/box/../ptr_track.h \
|
||||||
|
core/box/../hakmem_tiny_integrity.h core/box/../hakmem_tiny.h \
|
||||||
|
core/box/../hakmem_trace.h core/box/../hakmem_tiny_mini_mag.h \
|
||||||
|
core/box/../ptr_track.h core/box/../ptr_trace.h \
|
||||||
|
core/box/../box/tiny_next_ptr_box.h core/hakmem_tiny_config.h \
|
||||||
|
core/tiny_nextptr.h core/hakmem_build_flags.h \
|
||||||
|
core/box/../tiny_debug_ring.h core/box/front_gate_box.h \
|
||||||
|
core/hakmem_tiny.h
|
||||||
|
core/hakmem_tiny_config.h:
|
||||||
|
core/box/tls_sll_box.h:
|
||||||
|
core/box/../hakmem_tiny_config.h:
|
||||||
|
core/box/../hakmem_build_flags.h:
|
||||||
|
core/box/../tiny_remote.h:
|
||||||
|
core/box/../tiny_region_id.h:
|
||||||
|
core/box/../hakmem_build_flags.h:
|
||||||
|
core/box/../tiny_box_geometry.h:
|
||||||
|
core/box/../hakmem_tiny_superslab_constants.h:
|
||||||
|
core/box/../hakmem_tiny_config.h:
|
||||||
|
core/box/../ptr_track.h:
|
||||||
|
core/box/../hakmem_tiny_integrity.h:
|
||||||
|
core/box/../hakmem_tiny.h:
|
||||||
|
core/box/../hakmem_trace.h:
|
||||||
|
core/box/../hakmem_tiny_mini_mag.h:
|
||||||
|
core/box/../ptr_track.h:
|
||||||
|
core/box/../ptr_trace.h:
|
||||||
|
core/box/../box/tiny_next_ptr_box.h:
|
||||||
|
core/hakmem_tiny_config.h:
|
||||||
|
core/tiny_nextptr.h:
|
||||||
|
core/hakmem_build_flags.h:
|
||||||
|
core/box/../tiny_debug_ring.h:
|
||||||
|
core/box/front_gate_box.h:
|
||||||
|
core/hakmem_tiny.h:
|
||||||
@ -15,6 +15,8 @@
|
|||||||
// 3. Done! (No lookup, no validation, no atomic)
|
// 3. Done! (No lookup, no validation, no atomic)
|
||||||
|
|
||||||
#pragma once
|
#pragma once
|
||||||
|
#include <stdlib.h> // For getenv() in cross-thread check ENV gate
|
||||||
|
#include <pthread.h> // For pthread_self() in cross-thread check
|
||||||
#include "tiny_region_id.h"
|
#include "tiny_region_id.h"
|
||||||
#include "hakmem_build_flags.h"
|
#include "hakmem_build_flags.h"
|
||||||
#include "hakmem_tiny_config.h" // For TINY_TLS_MAG_CAP, TINY_NUM_CLASSES
|
#include "hakmem_tiny_config.h" // For TINY_TLS_MAG_CAP, TINY_NUM_CLASSES
|
||||||
@ -24,6 +26,10 @@
|
|||||||
#include "front/tiny_heap_v2.h" // Phase 13-B: TinyHeapV2 magazine supply
|
#include "front/tiny_heap_v2.h" // Phase 13-B: TinyHeapV2 magazine supply
|
||||||
#include "front/tiny_ultra_hot.h" // Phase 14: TinyUltraHot C1/C2 ultra-fast path
|
#include "front/tiny_ultra_hot.h" // Phase 14: TinyUltraHot C1/C2 ultra-fast path
|
||||||
#include "front/tiny_ring_cache.h" // Phase 21-1: Ring cache (C2/C3 array-based TLS cache)
|
#include "front/tiny_ring_cache.h" // Phase 21-1: Ring cache (C2/C3 array-based TLS cache)
|
||||||
|
#include "front/tiny_unified_cache.h" // Phase 23: Unified frontend cache (tcache-style, all classes)
|
||||||
|
#include "hakmem_super_registry.h" // For hak_super_lookup (cross-thread check)
|
||||||
|
#include "superslab/superslab_inline.h" // For slab_index_for (cross-thread check)
|
||||||
|
#include "box/free_remote_box.h" // For tiny_free_remote_box (cross-thread routing)
|
||||||
|
|
||||||
// Phase 7: Header-based ultra-fast free
|
// Phase 7: Header-based ultra-fast free
|
||||||
#if HAKMEM_TINY_HEADER_CLASSIDX
|
#if HAKMEM_TINY_HEADER_CLASSIDX
|
||||||
@ -36,6 +42,11 @@ extern int g_tls_sll_enable; // Honored for fast free: when 0, fall back to slo
|
|||||||
// External functions
|
// External functions
|
||||||
extern void hak_tiny_free(void* ptr); // Fallback for non-header allocations
|
extern void hak_tiny_free(void* ptr); // Fallback for non-header allocations
|
||||||
|
|
||||||
|
// Inline helper: Get current thread ID (lower 32 bits)
|
||||||
|
static inline uint32_t tiny_self_u32_local(void) {
|
||||||
|
return (uint32_t)(uintptr_t)pthread_self();
|
||||||
|
}
|
||||||
|
|
||||||
// ========== Ultra-Fast Free (Header-based) ==========
|
// ========== Ultra-Fast Free (Header-based) ==========
|
||||||
|
|
||||||
// Ultra-fast free for header-based allocations
|
// Ultra-fast free for header-based allocations
|
||||||
@ -137,8 +148,21 @@ static inline int hak_tiny_free_fast_v2(void* ptr) {
|
|||||||
// → 正史(TLS SLL)の在庫を正しく保つ
|
// → 正史(TLS SLL)の在庫を正しく保つ
|
||||||
// → UltraHot refill は alloc 側で TLS SLL から借りる
|
// → UltraHot refill は alloc 側で TLS SLL から借りる
|
||||||
|
|
||||||
|
// Phase 23: Unified Frontend Cache (all classes) - tcache-style single-layer cache
|
||||||
|
// ENV-gated: HAKMEM_TINY_UNIFIED_CACHE=1 (default: OFF)
|
||||||
|
// Target: +50-100% (20.3M → 30-40M ops/s) by flattening 4-5 layer cascade
|
||||||
|
// Design: Single unified array cache (2-3 cache misses vs current 8-10)
|
||||||
|
if (__builtin_expect(unified_cache_enabled(), 0)) {
|
||||||
|
if (unified_cache_push(class_idx, base)) {
|
||||||
|
// Unified cache push success - done!
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
// Unified cache full while enabled → fall back to existing TLS helper directly.
|
||||||
|
return tiny_alloc_fast_push(class_idx, base);
|
||||||
|
}
|
||||||
|
|
||||||
// Phase 21-1: Ring Cache (C2/C3 only) - Array-based TLS cache
|
// Phase 21-1: Ring Cache (C2/C3 only) - Array-based TLS cache
|
||||||
// ENV-gated: HAKMEM_TINY_HOT_RING_ENABLE=1
|
// ENV-gated: HAKMEM_TINY_HOT_RING_ENABLE=1 (default: ON after Phase 21-1-D)
|
||||||
// Target: +15-20% (54.4M → 62-65M ops/s) by eliminating pointer chasing
|
// Target: +15-20% (54.4M → 62-65M ops/s) by eliminating pointer chasing
|
||||||
// Design: Ring (L0) → SLL (L1) → SuperSlab (L2) cascade hierarchy
|
// Design: Ring (L0) → SLL (L1) → SuperSlab (L2) cascade hierarchy
|
||||||
if (class_idx == 2 || class_idx == 3) {
|
if (class_idx == 2 || class_idx == 3) {
|
||||||
@ -163,6 +187,48 @@ static inline int hak_tiny_free_fast_v2(void* ptr) {
|
|||||||
// Magazine full → fall through to TLS SLL
|
// Magazine full → fall through to TLS SLL
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// LARSON FIX (2025-11-16): Cross-thread free detection - ENV GATED
|
||||||
|
// Problem: Larson MT crash - TLS SLL poison (0xbada55...) from cross-thread free
|
||||||
|
// Root cause: Block allocated by Thread A, freed by Thread B → pushed to B's TLS SLL
|
||||||
|
// → B allocates the block → metadata still points to A's SuperSlab → corruption
|
||||||
|
// Solution: Check owner_tid_low, route cross-thread free to remote queue
|
||||||
|
// Status: ENV-gated for performance (HAKMEM_TINY_LARSON_FIX=1 to enable)
|
||||||
|
// Performance: OFF=5-10 cycles/free, ON=110-520 cycles/free (registry lookup overhead)
|
||||||
|
{
|
||||||
|
// TLS-cached ENV check (initialized once per thread)
|
||||||
|
static __thread int g_larson_fix = -1;
|
||||||
|
if (__builtin_expect(g_larson_fix == -1, 0)) {
|
||||||
|
const char* e = getenv("HAKMEM_TINY_LARSON_FIX");
|
||||||
|
g_larson_fix = (e && *e && *e != '0') ? 1 : 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (__builtin_expect(g_larson_fix, 0)) {
|
||||||
|
// Cross-thread check enabled - MT safe mode
|
||||||
|
SuperSlab* ss = hak_super_lookup(base);
|
||||||
|
if (__builtin_expect(ss != NULL, 1)) {
|
||||||
|
int slab_idx = slab_index_for(ss, base);
|
||||||
|
if (__builtin_expect(slab_idx >= 0, 1)) {
|
||||||
|
uint32_t self_tid = tiny_self_u32_local();
|
||||||
|
uint8_t owner_tid_low = ss->slabs[slab_idx].owner_tid_low;
|
||||||
|
|
||||||
|
// Check if this is a cross-thread free (lower 8 bits mismatch)
|
||||||
|
if (__builtin_expect((owner_tid_low & 0xFF) != (self_tid & 0xFF), 0)) {
|
||||||
|
// Cross-thread free → remote queue routing
|
||||||
|
TinySlabMeta* meta = &ss->slabs[slab_idx];
|
||||||
|
if (tiny_free_remote_box(ss, slab_idx, meta, ptr, self_tid)) {
|
||||||
|
// Successfully queued to remote, done
|
||||||
|
return 1;
|
||||||
|
}
|
||||||
|
// Remote push failed → fall through to slow path
|
||||||
|
return 0;
|
||||||
|
}
|
||||||
|
// Same-thread free → continue to TLS SLL fast path below
|
||||||
|
}
|
||||||
|
}
|
||||||
|
// SuperSlab lookup failed → fall through to TLS SLL (may be headerless C7)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
// REVERT E3-2: Use Box TLS-SLL for all builds (testing hypothesis)
|
// REVERT E3-2: Use Box TLS-SLL for all builds (testing hypothesis)
|
||||||
// Hypothesis: Box TLS-SLL acts as verification layer, masking underlying bugs
|
// Hypothesis: Box TLS-SLL acts as verification layer, masking underlying bugs
|
||||||
if (!tls_sll_push(class_idx, base, UINT32_MAX)) {
|
if (!tls_sll_push(class_idx, base, UINT32_MAX)) {
|
||||||
|
|||||||
10
hakmem.d
10
hakmem.d
@ -36,7 +36,11 @@ hakmem.o: core/hakmem.c core/hakmem.h core/hakmem_build_flags.h \
|
|||||||
core/box/../front/../hakmem_tiny.h core/box/../front/tiny_ultra_hot.h \
|
core/box/../front/../hakmem_tiny.h core/box/../front/tiny_ultra_hot.h \
|
||||||
core/box/../front/../box/tls_sll_box.h \
|
core/box/../front/../box/tls_sll_box.h \
|
||||||
core/box/../front/tiny_ring_cache.h \
|
core/box/../front/tiny_ring_cache.h \
|
||||||
core/box/../front/../hakmem_build_flags.h core/box/front_gate_v2.h \
|
core/box/../front/../hakmem_build_flags.h \
|
||||||
|
core/box/../front/tiny_unified_cache.h \
|
||||||
|
core/box/../front/../hakmem_tiny_config.h \
|
||||||
|
core/box/../superslab/superslab_inline.h \
|
||||||
|
core/box/../box/free_remote_box.h core/box/front_gate_v2.h \
|
||||||
core/box/external_guard_box.h core/box/hak_wrappers.inc.h \
|
core/box/external_guard_box.h core/box/hak_wrappers.inc.h \
|
||||||
core/box/front_gate_classifier.h
|
core/box/front_gate_classifier.h
|
||||||
core/hakmem.h:
|
core/hakmem.h:
|
||||||
@ -119,6 +123,10 @@ core/box/../front/tiny_ultra_hot.h:
|
|||||||
core/box/../front/../box/tls_sll_box.h:
|
core/box/../front/../box/tls_sll_box.h:
|
||||||
core/box/../front/tiny_ring_cache.h:
|
core/box/../front/tiny_ring_cache.h:
|
||||||
core/box/../front/../hakmem_build_flags.h:
|
core/box/../front/../hakmem_build_flags.h:
|
||||||
|
core/box/../front/tiny_unified_cache.h:
|
||||||
|
core/box/../front/../hakmem_tiny_config.h:
|
||||||
|
core/box/../superslab/superslab_inline.h:
|
||||||
|
core/box/../box/free_remote_box.h:
|
||||||
core/box/front_gate_v2.h:
|
core/box/front_gate_v2.h:
|
||||||
core/box/external_guard_box.h:
|
core/box/external_guard_box.h:
|
||||||
core/box/hak_wrappers.inc.h:
|
core/box/hak_wrappers.inc.h:
|
||||||
|
|||||||
@ -1,7 +1,8 @@
|
|||||||
hakmem_l25_pool.o: core/hakmem_l25_pool.c core/hakmem_l25_pool.h \
|
hakmem_l25_pool.o: core/hakmem_l25_pool.c core/hakmem_l25_pool.h \
|
||||||
core/hakmem_config.h core/hakmem_features.h core/hakmem_internal.h \
|
core/hakmem_config.h core/hakmem_features.h core/hakmem_internal.h \
|
||||||
core/hakmem.h core/hakmem_build_flags.h core/hakmem_sys.h \
|
core/hakmem.h core/hakmem_build_flags.h core/hakmem_sys.h \
|
||||||
core/hakmem_whale.h core/hakmem_syscall.h core/hakmem_prof.h \
|
core/hakmem_whale.h core/hakmem_syscall.h \
|
||||||
|
core/box/pagefault_telemetry_box.h core/hakmem_prof.h \
|
||||||
core/hakmem_debug.h core/hakmem_policy.h
|
core/hakmem_debug.h core/hakmem_policy.h
|
||||||
core/hakmem_l25_pool.h:
|
core/hakmem_l25_pool.h:
|
||||||
core/hakmem_config.h:
|
core/hakmem_config.h:
|
||||||
@ -12,6 +13,7 @@ core/hakmem_build_flags.h:
|
|||||||
core/hakmem_sys.h:
|
core/hakmem_sys.h:
|
||||||
core/hakmem_whale.h:
|
core/hakmem_whale.h:
|
||||||
core/hakmem_syscall.h:
|
core/hakmem_syscall.h:
|
||||||
|
core/box/pagefault_telemetry_box.h:
|
||||||
core/hakmem_prof.h:
|
core/hakmem_prof.h:
|
||||||
core/hakmem_debug.h:
|
core/hakmem_debug.h:
|
||||||
core/hakmem_policy.h:
|
core/hakmem_policy.h:
|
||||||
|
|||||||
@ -7,7 +7,8 @@ hakmem_pool.o: core/hakmem_pool.c core/hakmem_pool.h core/hakmem_config.h \
|
|||||||
core/box/pool_mf2_types.inc.h core/box/pool_mf2_helpers.inc.h \
|
core/box/pool_mf2_types.inc.h core/box/pool_mf2_helpers.inc.h \
|
||||||
core/box/pool_mf2_adoption.inc.h core/box/pool_tls_core.inc.h \
|
core/box/pool_mf2_adoption.inc.h core/box/pool_tls_core.inc.h \
|
||||||
core/box/pool_refill.inc.h core/box/pool_init_api.inc.h \
|
core/box/pool_refill.inc.h core/box/pool_init_api.inc.h \
|
||||||
core/box/pool_stats.inc.h core/box/pool_api.inc.h
|
core/box/pool_stats.inc.h core/box/pool_api.inc.h \
|
||||||
|
core/box/pagefault_telemetry_box.h
|
||||||
core/hakmem_pool.h:
|
core/hakmem_pool.h:
|
||||||
core/hakmem_config.h:
|
core/hakmem_config.h:
|
||||||
core/hakmem_features.h:
|
core/hakmem_features.h:
|
||||||
@ -31,3 +32,4 @@ core/box/pool_refill.inc.h:
|
|||||||
core/box/pool_init_api.inc.h:
|
core/box/pool_init_api.inc.h:
|
||||||
core/box/pool_stats.inc.h:
|
core/box/pool_stats.inc.h:
|
||||||
core/box/pool_api.inc.h:
|
core/box/pool_api.inc.h:
|
||||||
|
core/box/pagefault_telemetry_box.h:
|
||||||
|
|||||||
@ -3,7 +3,8 @@ hakmem_shared_pool.o: core/hakmem_shared_pool.c core/hakmem_shared_pool.h \
|
|||||||
core/hakmem_tiny_superslab.h core/superslab/superslab_inline.h \
|
core/hakmem_tiny_superslab.h core/superslab/superslab_inline.h \
|
||||||
core/superslab/superslab_types.h core/tiny_debug_ring.h \
|
core/superslab/superslab_types.h core/tiny_debug_ring.h \
|
||||||
core/hakmem_build_flags.h core/tiny_remote.h \
|
core/hakmem_build_flags.h core/tiny_remote.h \
|
||||||
core/hakmem_tiny_superslab_constants.h
|
core/hakmem_tiny_superslab_constants.h \
|
||||||
|
core/box/pagefault_telemetry_box.h
|
||||||
core/hakmem_shared_pool.h:
|
core/hakmem_shared_pool.h:
|
||||||
core/superslab/superslab_types.h:
|
core/superslab/superslab_types.h:
|
||||||
core/hakmem_tiny_superslab_constants.h:
|
core/hakmem_tiny_superslab_constants.h:
|
||||||
@ -14,3 +15,4 @@ core/tiny_debug_ring.h:
|
|||||||
core/hakmem_build_flags.h:
|
core/hakmem_build_flags.h:
|
||||||
core/tiny_remote.h:
|
core/tiny_remote.h:
|
||||||
core/hakmem_tiny_superslab_constants.h:
|
core/hakmem_tiny_superslab_constants.h:
|
||||||
|
core/box/pagefault_telemetry_box.h:
|
||||||
|
|||||||
@ -1,5 +1,3 @@
|
|||||||
pool_tls.o: core/pool_tls.c core/pool_tls.h core/pool_tls_registry.h \
|
pool_tls.o: core/pool_tls.c core/pool_tls.h core/pool_tls_registry.h
|
||||||
core/pool_tls_bind.h
|
|
||||||
core/pool_tls.h:
|
core/pool_tls.h:
|
||||||
core/pool_tls_registry.h:
|
core/pool_tls_registry.h:
|
||||||
core/pool_tls_bind.h:
|
|
||||||
|
|||||||
Reference in New Issue
Block a user