Files

Moe Charm (CI) f88e51e45b Phase 12: Strategic Pause Results - Critical Finding

Completed Strategic Pause investigation with shocking discovery:
- System malloc (glibc ptmalloc2): 86.58M ops/s
- hakmem (Phase 10): 52.88M ops/s
- Gap: **+63.7%** 🚨

Baseline (Phase 10):
- Mean: 51.76M ops/s (10-run, CV 1.03%)
- Health check: PASS
- Perf stat: IPC 2.22, branch miss 2.48%, good cache locality

Allocator comparison:
- hakmem: 52.43M ops/s (RSS: 33.8MB)
- jemalloc: 48.60M ops/s (RSS: 35.6MB) [-7.3%]
- system malloc: 85.96M ops/s [+63.9%] 🚨

Gap analysis (5 hypotheses):
1. Header write overhead (400M writes) - Expected ROI: +10-20%
2. Thread cache implementation (tcache vs TinyUnifiedCache) - Expected ROI: +20-30%
3. Metadata access pattern (indirection overhead) - Expected ROI: +5-10%
4. Classification overhead (LUT + routing) - Expected ROI: +5%
5. Freelist management (header vs chunk placement) - Expected ROI: +5%

Recommendation: Proceed to Phase 13 (Header Write Elimination)
- Most direct overhead (400M writes per 200M iters)
- Measurable with perf
- Clear ROI (+10-20% expected)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-14 21:17:48 +09:00

9.1 KiB

Raw Blame History

Phase 12: Strategic Pause Results（戦略的休止の成果）

Date: 2025-12-14 Status: 🔍 CRITICAL FINDINGS - System malloc significantly faster than hakmem

Executive Summary

Strategic Pause の調査により、system malloc (glibc ptmalloc2) が hakmem より +63.7% 速いという衝撃的な事実が判明しました。

これは hakmem の Phase 6-10 最適化（+24.6%）を上回る差であり、構造的な課題が存在する可能性を示唆しています。

1. Mixed Baseline（Phase 10 後）

10-Run Statistics

Iterations: 20M, Working Set: 400, Seed: 1

Metric	Value
Mean	51.76M ops/s
Median	51.74M ops/s
Stdev	0.53M ops/s
CV (Coefficient of Variation)	1.03% ✅
Min	51.00M ops/s
Max	52.67M ops/s

Analysis:

非常に安定（CV 1.03%）
Phase 6-10 の累積効果により 51-52M ops/s の範囲で安定動作

2. Health Check

Status: ✅ PASS

Profile	Throughput	Status
MIXED_TINYV3_C7_SAFE	52.44M ops/s	✅ PASS
C6_HEAVY_LEGACY_POOLV1	18.78M ops/s	✅ PASS

3. Perf Stat（Memory-System 指標）

Test: 200M iterations, ws=400

Metric	Value	Notes
Throughput	52.06M ops/s	Baseline
Cycles	16,765,096,198
Instructions	37,198,361,135
IPC	2.22	Instructions per cycle
Branches	9,361,888,532
Branch miss rate	2.48%	✅ Good
Cache misses	1,328,188	6.6 misses per 1K ops
dTLB load misses	66,195	0.33 per 1K ops
Minor faults	6,938

Analysis:

IPC 2.22 は良好（instruction-level parallelism が効いている）
Branch miss rate 2.48% は低い（予測精度が高い）
Cache miss / dTLB miss は少ない（locality が良好）

4. Allocator Comparison

Critical Finding: System malloc が予想外に高速

4.1 Throughput Comparison (200M iterations)

Allocator	Throughput	vs hakmem	RSS (MB)
hakmem (Phase 10)	52.43M ops/s	Baseline	33.8
jemalloc (LD_PRELOAD)	48.60M ops/s	-7.3%	35.6
system malloc (glibc)	85.96M ops/s	+63.9% 🚨	N/A

4.2 Verification (20M iterations)

Allocator	Throughput	Time	vs hakmem
hakmem	52.88M ops/s	0.378s	Baseline
system malloc	86.58M ops/s	0.231s	+63.7% 🚨

Analysis:

System malloc が 1.64x 速い（hakmem の Phase 6-10 改善 +24.6% を大きく上回る）
jemalloc は hakmem より 若干遅い（-7.3%）
Memory footprint は同等（hakmem 33.8MB vs jemalloc 35.6MB）

5. Gap Analysis: Why is system malloc so fast?

仮説（要検証）

5.1 Header Overhead

hakmem:

すべての allocation に 1-byte header を書き込む（HAKMEM_TINY_HEADER_CLASSIDX）
User pointer = base + 1

system malloc:

Header を user ポインタより前に配置（user pointer = base）
Write overhead がない可能性

Impact: 各 allocation で 1 write → 400M writes (200M alloc+free)

5.2 Classification Overhead

hakmem:

Size → Class LUT (hak_tiny_size_to_class)
Class → Route → Handler

system malloc:

Size → Bin（直接 mapping、複雑な分岐なし？）

Impact: LUT lookup + routing の固定費

5.3 Thread Caching Implementation

hakmem:

TinyUnifiedCache（per-class）
FastLane で consolidation 済み

system malloc (ptmalloc2):

tcache（thread-local cache）
glibc 2.26+ で導入、非常に高速

Impact: tcache の実装が hakmem より最適化されている可能性

5.4 Metadata Access Pattern

hakmem:

SuperSlab → Slab → Metadata の間接参照
tiny_legacy_fallback_free_base() で metadata 取得

system malloc:

Chunk metadata が連続配置
Cache locality が良好？

Impact: Metadata アクセスの cache miss 率の差

5.5 Freelist Management

hakmem:

Freelist を header に埋め込み（1-byte header + next ptr）
Pop/push で header write

system malloc:

Freelist を chunk 内に配置（user data 領域を再利用）
Header write なし

Impact: Freelist 操作の write overhead

優先調査項目（ROI 高い順）

Header write overhead (5.1)
- 最も直接的な overhead
- 各 allocation で 1 write = 400M writes
- Expected ROI: +10-20%（header write を削除できれば）
Thread caching implementation (5.3)
- system malloc の tcache が非常に高速
- hakmem の TinyUnifiedCache と比較
- Expected ROI: +20-30%（tcache レベルの最適化）
Metadata access pattern (5.4)
- Cache locality の差
- SuperSlab → Slab → Metadata の間接参照
- Expected ROI: +5-10%（locality 改善）
Classification overhead (5.2)
- LUT + routing の固定費
- Expected ROI: +5%（FastLane で既に最適化済み）
Freelist management (5.5)
- Header vs chunk 内配置
- Expected ROI: +5%（header write と重複）

6. Next Phase Direction（Pause 解除の条件）

解除ゲート（指示書 Section 5）

以下のいずれか 1 つが満たされたら Pause を解除:

✅ 実ワークロードで明確な bottleneck が特定できた
- System malloc との +63.7% gap が特定された
- 仮説: header overhead / tcache implementation / metadata access
✅ mimalloc/system との差分が "単一の構造課題" として言語化できた
- Header write overhead が最有力候補（400M writes / 200M iters）
- Thread caching implementation の差（tcache vs TinyUnifiedCache）
⚠️ Phase 12 の比較で "効く最適化方向" が 1 本に絞れた
- 複数の仮説があり、優先順位付けが必要
- 最優先: Header write elimination（+10-20% 期待）

推奨: Phase 13 の方向性

Option 1: Header Write Elimination（最優先）

Target: 1-byte header write の削除

Strategy:

Header を user pointer より前に配置（system malloc パターン）
または header-less classification（RegionId のみ）

Expected ROI: +10-20%

Risk: ⚠️ MEDIUM - 構造変更が大きい

Next Action:

Header write の実測 overhead を確認（perf record -e cycles:pp）
Header-less 分類の feasibility 検証（RegionId だけで free path 処理可能か）

Option 2: Thread Cache Redesign（高 ROI）

Target: TinyUnifiedCache を system malloc tcache レベルに最適化

Strategy:

tcache の実装を調査（glibc 2.26+ source）
TinyUnifiedCache と比較分析
差分を特定して適用

Expected ROI: +20-30%

Risk: ⚠️ HIGH - tcache の設計思想を理解する必要

Next Action:

glibc tcache の source code analysis
TinyUnifiedCache との比較表作成
Prototype implementation

Option 3: Metadata Access Optimization（中 ROI）

Target: SuperSlab → Slab → Metadata の間接参照を削減

Strategy:

Metadata を連続配置（locality 改善）
Cache-line alignment 最適化

Expected ROI: +5-10%

Risk: ⚪ LOW - 既存の metadata 構造を改善

Next Action:

Metadata access の cache miss 率を測定（perf stat -e cache-misses）
Metadata layout の最適化案を設計

7. Summary & Recommendation

Key Findings

Phase 10 後の hakmem: 51.76M ops/s（安定、CV 1.03%）
System malloc: 86.58M ops/s（+63.7% faster 🚨）
Gap 原因仮説: Header write / tcache implementation / metadata access

Strategic Decision

Pause を解除し、Phase 13 へ進む ✅

Phase 13 の方向性: Header Write Elimination（Option 1）

理由:

最も直接的な overhead（400M writes）
実測可能（perf で検証容易）
ROI が明確（+10-20% 期待）

Alternative: tcache redesign (Option 2) の調査を並行実施（長期投資として）

Next Actions（Phase 13 準備）

Header write overhead の実測

perf record -e cycles:pp -F 9999 -- ./bench_random_mixed_hakmem 200000000 400 1
perf annotate tiny_region_id_write_header

Header-less classification の feasibility 検証
- RegionId だけで free path が処理可能か確認
- Header なしで class_idx を復元できるか検証
System malloc tcache の source code analysis
- glibc malloc/malloc.c の tcache 実装を読む
- TinyUnifiedCache との比較表作成
Phase 13 設計書の作成
- Header elimination の具体的な実装計画
- A/B テスト計画
- Rollback 戦略

Status: 🔍 CRITICAL PHASE - Major performance gap identified Date: 2025-12-14 Next: Phase 13 準備（Header Write Elimination）

Analysis: Claude Code Context: Strategic Pause - Allocator Comparison Finding: System malloc +63.7% faster than hakmem (86.58M vs 52.88M ops/s)

9.1 KiB Raw Blame History Unescape Escape

Phase 12: Strategic Pause Results（戦略的休止の成果）

Executive Summary

1. Mixed Baseline（Phase 10 後）

10-Run Statistics

2. Health Check

3. Perf Stat（Memory-System 指標）

4. Allocator Comparison

4.1 Throughput Comparison (200M iterations)

4.2 Verification (20M iterations)

5. Gap Analysis: Why is system malloc so fast?

仮説（要検証）

5.1 Header Overhead

5.2 Classification Overhead

5.3 Thread Caching Implementation

5.4 Metadata Access Pattern

5.5 Freelist Management

優先調査項目（ROI 高い順）

6. Next Phase Direction（Pause 解除の条件）

解除ゲート（指示書 Section 5）

推奨: Phase 13 の方向性

Option 1: Header Write Elimination（最優先）

Option 2: Thread Cache Redesign（高 ROI）

Option 3: Metadata Access Optimization（中 ROI）

7. Summary & Recommendation

Key Findings

Strategic Decision

Next Actions（Phase 13 準備）

9.1 KiB

Raw Blame History