Phase 12: Strategic Pause Results - Critical Finding
Completed Strategic Pause investigation with shocking discovery: - System malloc (glibc ptmalloc2): 86.58M ops/s - hakmem (Phase 10): 52.88M ops/s - Gap: **+63.7%** 🚨 Baseline (Phase 10): - Mean: 51.76M ops/s (10-run, CV 1.03%) - Health check: PASS - Perf stat: IPC 2.22, branch miss 2.48%, good cache locality Allocator comparison: - hakmem: 52.43M ops/s (RSS: 33.8MB) - jemalloc: 48.60M ops/s (RSS: 35.6MB) [-7.3%] - system malloc: 85.96M ops/s [+63.9%] 🚨 Gap analysis (5 hypotheses): 1. Header write overhead (400M writes) - Expected ROI: +10-20% 2. Thread cache implementation (tcache vs TinyUnifiedCache) - Expected ROI: +20-30% 3. Metadata access pattern (indirection overhead) - Expected ROI: +5-10% 4. Classification overhead (LUT + routing) - Expected ROI: +5% 5. Freelist management (header vs chunk placement) - Expected ROI: +5% Recommendation: Proceed to Phase 13 (Header Write Elimination) - Most direct overhead (400M writes per 200M iters) - Measurable with perf - Clear ROI (+10-20% expected) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
331
docs/analysis/PHASE12_STRATEGIC_PAUSE_RESULTS.md
Normal file
331
docs/analysis/PHASE12_STRATEGIC_PAUSE_RESULTS.md
Normal file
@ -0,0 +1,331 @@
|
|||||||
|
# Phase 12: Strategic Pause Results(戦略的休止の成果)
|
||||||
|
|
||||||
|
**Date**: 2025-12-14
|
||||||
|
**Status**: 🔍 **CRITICAL FINDINGS** - System malloc significantly faster than hakmem
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Executive Summary
|
||||||
|
|
||||||
|
Strategic Pause の調査により、**system malloc (glibc ptmalloc2) が hakmem より +63.7% 速い**という衝撃的な事実が判明しました。
|
||||||
|
|
||||||
|
これは hakmem の Phase 6-10 最適化(+24.6%)を上回る差であり、**構造的な課題**が存在する可能性を示唆しています。
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Mixed Baseline(Phase 10 後)
|
||||||
|
|
||||||
|
### 10-Run Statistics
|
||||||
|
|
||||||
|
```
|
||||||
|
Iterations: 20M, Working Set: 400, Seed: 1
|
||||||
|
```
|
||||||
|
|
||||||
|
| Metric | Value |
|
||||||
|
|--------|-------|
|
||||||
|
| **Mean** | **51.76M ops/s** |
|
||||||
|
| Median | 51.74M ops/s |
|
||||||
|
| Stdev | 0.53M ops/s |
|
||||||
|
| CV (Coefficient of Variation) | **1.03%** ✅ |
|
||||||
|
| Min | 51.00M ops/s |
|
||||||
|
| Max | 52.67M ops/s |
|
||||||
|
|
||||||
|
**Analysis**:
|
||||||
|
- 非常に安定(CV 1.03%)
|
||||||
|
- Phase 6-10 の累積効果により 51-52M ops/s の範囲で安定動作
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Health Check
|
||||||
|
|
||||||
|
**Status**: ✅ **PASS**
|
||||||
|
|
||||||
|
| Profile | Throughput | Status |
|
||||||
|
|---------|-----------|--------|
|
||||||
|
| MIXED_TINYV3_C7_SAFE | 52.44M ops/s | ✅ PASS |
|
||||||
|
| C6_HEAVY_LEGACY_POOLV1 | 18.78M ops/s | ✅ PASS |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Perf Stat(Memory-System 指標)
|
||||||
|
|
||||||
|
**Test**: 200M iterations, ws=400
|
||||||
|
|
||||||
|
| Metric | Value | Notes |
|
||||||
|
|--------|-------|-------|
|
||||||
|
| **Throughput** | **52.06M ops/s** | Baseline |
|
||||||
|
| Cycles | 16,765,096,198 | |
|
||||||
|
| Instructions | 37,198,361,135 | |
|
||||||
|
| **IPC** | **2.22** | Instructions per cycle |
|
||||||
|
| Branches | 9,361,888,532 | |
|
||||||
|
| **Branch miss rate** | **2.48%** | ✅ Good |
|
||||||
|
| Cache misses | 1,328,188 | 6.6 misses per 1K ops |
|
||||||
|
| dTLB load misses | 66,195 | 0.33 per 1K ops |
|
||||||
|
| Minor faults | 6,938 | |
|
||||||
|
|
||||||
|
**Analysis**:
|
||||||
|
- IPC 2.22 は良好(instruction-level parallelism が効いている)
|
||||||
|
- Branch miss rate 2.48% は低い(予測精度が高い)
|
||||||
|
- Cache miss / dTLB miss は少ない(locality が良好)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Allocator Comparison
|
||||||
|
|
||||||
|
**Critical Finding**: System malloc が予想外に高速
|
||||||
|
|
||||||
|
### 4.1 Throughput Comparison (200M iterations)
|
||||||
|
|
||||||
|
| Allocator | Throughput | vs hakmem | RSS (MB) |
|
||||||
|
|-----------|-----------|-----------|----------|
|
||||||
|
| **hakmem** (Phase 10) | 52.43M ops/s | Baseline | 33.8 |
|
||||||
|
| **jemalloc** (LD_PRELOAD) | 48.60M ops/s | **-7.3%** | 35.6 |
|
||||||
|
| **system malloc** (glibc) | 85.96M ops/s | **+63.9%** 🚨 | N/A |
|
||||||
|
|
||||||
|
### 4.2 Verification (20M iterations)
|
||||||
|
|
||||||
|
| Allocator | Throughput | Time | vs hakmem |
|
||||||
|
|-----------|-----------|------|-----------|
|
||||||
|
| hakmem | 52.88M ops/s | 0.378s | Baseline |
|
||||||
|
| system malloc | 86.58M ops/s | 0.231s | **+63.7%** 🚨 |
|
||||||
|
|
||||||
|
**Analysis**:
|
||||||
|
- System malloc が **1.64x 速い**(hakmem の Phase 6-10 改善 +24.6% を大きく上回る)
|
||||||
|
- jemalloc は hakmem より **若干遅い**(-7.3%)
|
||||||
|
- **Memory footprint は同等**(hakmem 33.8MB vs jemalloc 35.6MB)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Gap Analysis: Why is system malloc so fast?
|
||||||
|
|
||||||
|
### 仮説(要検証)
|
||||||
|
|
||||||
|
#### 5.1 Header Overhead
|
||||||
|
|
||||||
|
**hakmem**:
|
||||||
|
- すべての allocation に 1-byte header を書き込む(`HAKMEM_TINY_HEADER_CLASSIDX`)
|
||||||
|
- User pointer = base + 1
|
||||||
|
|
||||||
|
**system malloc**:
|
||||||
|
- Header を user ポインタより前に配置(user pointer = base)
|
||||||
|
- Write overhead がない可能性
|
||||||
|
|
||||||
|
**Impact**: 各 allocation で 1 write → 400M writes (200M alloc+free)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### 5.2 Classification Overhead
|
||||||
|
|
||||||
|
**hakmem**:
|
||||||
|
- Size → Class LUT (`hak_tiny_size_to_class`)
|
||||||
|
- Class → Route → Handler
|
||||||
|
|
||||||
|
**system malloc**:
|
||||||
|
- Size → Bin(直接 mapping、複雑な分岐なし?)
|
||||||
|
|
||||||
|
**Impact**: LUT lookup + routing の固定費
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### 5.3 Thread Caching Implementation
|
||||||
|
|
||||||
|
**hakmem**:
|
||||||
|
- TinyUnifiedCache(per-class)
|
||||||
|
- FastLane で consolidation 済み
|
||||||
|
|
||||||
|
**system malloc (ptmalloc2)**:
|
||||||
|
- tcache(thread-local cache)
|
||||||
|
- glibc 2.26+ で導入、非常に高速
|
||||||
|
|
||||||
|
**Impact**: tcache の実装が hakmem より最適化されている可能性
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### 5.4 Metadata Access Pattern
|
||||||
|
|
||||||
|
**hakmem**:
|
||||||
|
- SuperSlab → Slab → Metadata の間接参照
|
||||||
|
- `tiny_legacy_fallback_free_base()` で metadata 取得
|
||||||
|
|
||||||
|
**system malloc**:
|
||||||
|
- Chunk metadata が連続配置
|
||||||
|
- Cache locality が良好?
|
||||||
|
|
||||||
|
**Impact**: Metadata アクセスの cache miss 率の差
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### 5.5 Freelist Management
|
||||||
|
|
||||||
|
**hakmem**:
|
||||||
|
- Freelist を header に埋め込み(1-byte header + next ptr)
|
||||||
|
- Pop/push で header write
|
||||||
|
|
||||||
|
**system malloc**:
|
||||||
|
- Freelist を chunk 内に配置(user data 領域を再利用)
|
||||||
|
- Header write なし
|
||||||
|
|
||||||
|
**Impact**: Freelist 操作の write overhead
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 優先調査項目(ROI 高い順)
|
||||||
|
|
||||||
|
1. **Header write overhead** (5.1)
|
||||||
|
- 最も直接的な overhead
|
||||||
|
- 各 allocation で 1 write = 400M writes
|
||||||
|
- **Expected ROI**: +10-20%(header write を削除できれば)
|
||||||
|
|
||||||
|
2. **Thread caching implementation** (5.3)
|
||||||
|
- system malloc の tcache が非常に高速
|
||||||
|
- hakmem の TinyUnifiedCache と比較
|
||||||
|
- **Expected ROI**: +20-30%(tcache レベルの最適化)
|
||||||
|
|
||||||
|
3. **Metadata access pattern** (5.4)
|
||||||
|
- Cache locality の差
|
||||||
|
- SuperSlab → Slab → Metadata の間接参照
|
||||||
|
- **Expected ROI**: +5-10%(locality 改善)
|
||||||
|
|
||||||
|
4. **Classification overhead** (5.2)
|
||||||
|
- LUT + routing の固定費
|
||||||
|
- **Expected ROI**: +5%(FastLane で既に最適化済み)
|
||||||
|
|
||||||
|
5. **Freelist management** (5.5)
|
||||||
|
- Header vs chunk 内配置
|
||||||
|
- **Expected ROI**: +5%(header write と重複)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Next Phase Direction(Pause 解除の条件)
|
||||||
|
|
||||||
|
### 解除ゲート(指示書 Section 5)
|
||||||
|
|
||||||
|
以下のいずれか 1 つが満たされたら Pause を解除:
|
||||||
|
|
||||||
|
1. ✅ **実ワークロードで明確な bottleneck が特定できた**
|
||||||
|
- System malloc との +63.7% gap が特定された
|
||||||
|
- 仮説: header overhead / tcache implementation / metadata access
|
||||||
|
|
||||||
|
2. ✅ **mimalloc/system との差分が "単一の構造課題" として言語化できた**
|
||||||
|
- Header write overhead が最有力候補(400M writes / 200M iters)
|
||||||
|
- Thread caching implementation の差(tcache vs TinyUnifiedCache)
|
||||||
|
|
||||||
|
3. ⚠️ **Phase 12 の比較で "効く最適化方向" が 1 本に絞れた**
|
||||||
|
- 複数の仮説があり、優先順位付けが必要
|
||||||
|
- 最優先: Header write elimination(+10-20% 期待)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 推奨: Phase 13 の方向性
|
||||||
|
|
||||||
|
#### Option 1: Header Write Elimination(最優先)
|
||||||
|
|
||||||
|
**Target**: 1-byte header write の削除
|
||||||
|
|
||||||
|
**Strategy**:
|
||||||
|
- Header を user pointer より前に配置(system malloc パターン)
|
||||||
|
- または header-less classification(RegionId のみ)
|
||||||
|
|
||||||
|
**Expected ROI**: **+10-20%**
|
||||||
|
|
||||||
|
**Risk**: ⚠️ MEDIUM - 構造変更が大きい
|
||||||
|
|
||||||
|
**Next Action**:
|
||||||
|
- Header write の実測 overhead を確認(perf record -e cycles:pp)
|
||||||
|
- Header-less 分類の feasibility 検証(RegionId だけで free path 処理可能か)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### Option 2: Thread Cache Redesign(高 ROI)
|
||||||
|
|
||||||
|
**Target**: TinyUnifiedCache を system malloc tcache レベルに最適化
|
||||||
|
|
||||||
|
**Strategy**:
|
||||||
|
- tcache の実装を調査(glibc 2.26+ source)
|
||||||
|
- TinyUnifiedCache と比較分析
|
||||||
|
- 差分を特定して適用
|
||||||
|
|
||||||
|
**Expected ROI**: **+20-30%**
|
||||||
|
|
||||||
|
**Risk**: ⚠️ HIGH - tcache の設計思想を理解する必要
|
||||||
|
|
||||||
|
**Next Action**:
|
||||||
|
- glibc tcache の source code analysis
|
||||||
|
- TinyUnifiedCache との比較表作成
|
||||||
|
- Prototype implementation
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### Option 3: Metadata Access Optimization(中 ROI)
|
||||||
|
|
||||||
|
**Target**: SuperSlab → Slab → Metadata の間接参照を削減
|
||||||
|
|
||||||
|
**Strategy**:
|
||||||
|
- Metadata を連続配置(locality 改善)
|
||||||
|
- Cache-line alignment 最適化
|
||||||
|
|
||||||
|
**Expected ROI**: **+5-10%**
|
||||||
|
|
||||||
|
**Risk**: ⚪ LOW - 既存の metadata 構造を改善
|
||||||
|
|
||||||
|
**Next Action**:
|
||||||
|
- Metadata access の cache miss 率を測定(perf stat -e cache-misses)
|
||||||
|
- Metadata layout の最適化案を設計
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Summary & Recommendation
|
||||||
|
|
||||||
|
### Key Findings
|
||||||
|
|
||||||
|
1. **Phase 10 後の hakmem**: 51.76M ops/s(安定、CV 1.03%)
|
||||||
|
2. **System malloc**: 86.58M ops/s(**+63.7%** faster 🚨)
|
||||||
|
3. **Gap 原因仮説**: Header write / tcache implementation / metadata access
|
||||||
|
|
||||||
|
### Strategic Decision
|
||||||
|
|
||||||
|
**Pause を解除し、Phase 13 へ進む** ✅
|
||||||
|
|
||||||
|
**Phase 13 の方向性**: **Header Write Elimination**(Option 1)
|
||||||
|
|
||||||
|
**理由**:
|
||||||
|
1. 最も直接的な overhead(400M writes)
|
||||||
|
2. 実測可能(perf で検証容易)
|
||||||
|
3. ROI が明確(+10-20% 期待)
|
||||||
|
|
||||||
|
**Alternative**: tcache redesign (Option 2) の調査を並行実施(長期投資として)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Next Actions(Phase 13 準備)
|
||||||
|
|
||||||
|
1. **Header write overhead の実測**
|
||||||
|
```sh
|
||||||
|
perf record -e cycles:pp -F 9999 -- ./bench_random_mixed_hakmem 200000000 400 1
|
||||||
|
perf annotate tiny_region_id_write_header
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Header-less classification の feasibility 検証**
|
||||||
|
- RegionId だけで free path が処理可能か確認
|
||||||
|
- Header なしで class_idx を復元できるか検証
|
||||||
|
|
||||||
|
3. **System malloc tcache の source code analysis**
|
||||||
|
- glibc malloc/malloc.c の tcache 実装を読む
|
||||||
|
- TinyUnifiedCache との比較表作成
|
||||||
|
|
||||||
|
4. **Phase 13 設計書の作成**
|
||||||
|
- Header elimination の具体的な実装計画
|
||||||
|
- A/B テスト計画
|
||||||
|
- Rollback 戦略
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Status**: 🔍 **CRITICAL PHASE** - Major performance gap identified
|
||||||
|
**Date**: 2025-12-14
|
||||||
|
**Next**: Phase 13 準備(Header Write Elimination)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Analysis**: Claude Code
|
||||||
|
**Context**: Strategic Pause - Allocator Comparison
|
||||||
|
**Finding**: System malloc **+63.7% faster** than hakmem (86.58M vs 52.88M ops/s)
|
||||||
Reference in New Issue
Block a user