Files
hakmem/CURRENT_TASK.md

161 lines
7.6 KiB
Markdown
Raw Normal View History

# CURRENT_TASKRolling, SSOT
## 0) 今の「正」SSOT
- **性能比較の正**: FAST PGO build`make pgo-fast-full``bench_random_mixed_hakmem_minimal_pgo` **WarmPool=16**Phase 69 強GOで昇格済み
- **安全・互換の正**: Standard build`make bench_random_mixed_hakmem`
- **観測の正**: OBSERVE build`make perf_observe`
- **スコアカード(目標/現在値)**: `docs/analysis/PERFORMANCE_TARGETS_SCORECARD.md`
- Current baselineFAST v3 + PGO, Phase 69: **62.63M ops/s = 51.77% of mimalloc**
- 次の目標: **M2 = 55%**(残り **+3.23pp**
- **Mixed 10-run SSOT**: `scripts/run_mixed_10_cleanenv.sh``ITERS=20000000 WS=400``HAKMEM_WARM_POOL_SIZE=16` デフォルト)
## 1) 迷子防止(経路/観測)
“経路が踏まれていない最適化” を防ぐための最小手順。
- **Route Banner経路の誤認を潰す**: `HAKMEM_ROUTE_BANNER=1`
- 出力: Route assignmentsbackend route kind+ cache config`unified_cache_enabled` / `warm_pool_max_per_class`
- **Refill観測のSSOT**: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
- WS=400Mixed SSOTでは miss が極小 → `unified_cache_refill()` 最適化は **凍結ROIゼロ**
## 2) 直近の結論(要点だけ)
- **Phase 69WarmPool sweep**: `HAKMEM_WARM_POOL_SIZE=16`**強GO+3.26%**、baseline 昇格済み。
- 設計: `docs/analysis/PHASE69_REFILL_TUNING_0_DESIGN.md`
- 結果: `docs/analysis/PHASE69_REFILL_TUNING_1_RESULTS.md`
- **Phase 70観測SSOT**: 統計の見える化/前提ゲート確立。WS=400 SSOT では refill は冷たい。
- SSOT: `docs/analysis/PHASE70_REFILL_OBSERVABILITY_PREREQS_SSOT.md`
- **Phase 71/73WarmPool=16 の勝ち筋確定)**: 勝ち筋は **instruction/branch の微減**perf stat で確定)。
- 詳細: `docs/analysis/PHASE70_71_WARMPOOL16_ANALYSIS.md`
- **Phase 72ENV knob ROI枯れ**: WarmPool=16 を超える ENV-only 勝ち筋なし → **構造(コード)で攻める段階**
## 3) 運用ルールBox Theory + layout tax 対策)
- 変更は必ず **箱 + 境界1箇所 + ENVで戻せる** で積むFail-fast、最小可視化
- A/B は **同一バイナリでENVトグル**が原則(別バイナリ比較は layout が混ざる)。
- “削除して速い” は封印link-out/大削除は layout tax で符号反転しやすい)→ **compile-out** を優先。
- 診断: `scripts/box/layout_tax_forensics_box.sh` / `docs/analysis/PHASE67A_LAYOUT_TAX_FORENSICS_SSOT.md`
## 4) 次の指示書Active
### Phase 74構造: UnifiedCache hit-path を短くする ✅ **P1 (LOCALIZE) 凍結**
**前提**:
- WS=400 SSOT では UnifiedCache miss が極小 → refill最適化は ROIゼロ。
- WarmPool=16 の勝ちは instruction/branch 微減 → hit-path を短くするのが正攻法。
**Phase 74-1: LOCALIZE (ENV-gated)** ✅ **完了 (NEUTRAL +0.50%)**
- ENV: `HAKMEM_TINY_UC_LOCALIZE=0/1`
- Runtime branch overhead で instructions/branches **増加** (+0.7%/+0.4%)
- 判定: **NEUTRAL (+0.50%)**
**Phase 74-2: LOCALIZE (compile-time gate)** ✅ **完了 (NEUTRAL -0.87%)**
- Build flag: `HAKMEM_TINY_UC_LOCALIZE_COMPILED=0/1` (default 0)
- Runtime branch 削除 → instructions/branches **改善** (-0.6%/-2.3%) ✓
- しかし **cache-misses +86%** (register pressure / spill) → throughput **-0.87%**
- 切り分け成功: **LOCALIZE本体は勝ち、cache-miss 増加で相殺**
- 判定: **NEUTRAL (-0.87%)****P1 (LOCALIZE) 凍結**
**結論**:
- P1 (LOCALIZE) は default OFF で凍結dependency chain 削減の ROI 低い)
- 次: **Phase 74-3 (P0: FASTAPI)** へ進む
**Phase 74-3: P0 (FASTAPI)** ✅ **完了 (NEUTRAL +0.32%)**
**Goal**: `unified_cache_enabled()` / `lazy-init` / `stats` 判定を **hot loop の外へ追い出す**
**Approach**:
- `unified_cache_push_fast()` / `unified_cache_pop_fast()` API 追加
- 前提: "valid/enabled/no-stats" を caller 側で保証
- Fail-fast: 想定外の状態なら slow path へ fallback境界1箇所
- ENV gate: `HAKMEM_TINY_UC_FASTAPI=0/1` (default 0, research box)
2025-12-17 16:27:06 +09:00
**Results** (10-run Mixed SSOT, WS=400):
- Throughput: **+0.32%** (NEUTRAL, below +1.0% GO threshold)
- cache-misses: **-16.31%** (positive signal, insufficient throughput gain)
**判定**: **NEUTRAL (+0.32%)****P0 (FASTAPI) 凍結**
**参考**:
- 設計: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_0_DESIGN.md`
- 指示書: `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_1_NEXT_INSTRUCTIONS.md`
- 結果 (P1/P0): `docs/analysis/PHASE74_UNIFIEDCACHE_HITPATH_STRUCTURAL_OPT_2_RESULTS.md`
---
## Phase 75構造: Hot-class Inline Slots (P2) 🟡 **準備中**
**Goal**: C4-C7 の統計分析 → targeted optimization 戦略決定
**前提** (Phase 74 learnings):
- UnifiedCache hit-path optimization の ROI が低い ← register pressure / cache-miss effects
- 次の軸: **per-class 特性を活用** → TLS-direct inline slots で branch elimination
**Phase 75-0: Per-Class Analysis** ✅ **完了**
Per-class Unified-STATS (Mixed SSOT, WS=400, HAKMEM_MEASURE_UNIFIED_CACHE=1):
| Class | Capacity | Occupied | Hits | Pushes | Total Ops | Hit % | % of C4-C7 |
|-------|----------|----------|------|--------|-----------|-------|-----------|
| C6 | 128 | 127 | 2,750,854 | 2,750,855 | **5,501,709** | 100% | **57.2%** |
| C5 | 128 | 127 | 1,373,604 | 1,373,605 | **2,747,209** | 100% | **28.5%** |
| C4 | 64 | 63 | 687,563 | 687,564 | **1,375,127** | 100% | **14.3%** |
| C7 | ? | ? | ? | ? | **?** | ? | **?** |
**Key findings**:
1. C6 圧倒的支配: 57.2% の操作 (2.75M hits)
2. 全クラス 100% hit rate (refill inactive in SSOT)
3. Cache occupancy near-capacity (98-99%)
**Phase 75-1: C6-only Inline Slots** ✅ **完了 (GO +2.87%)**
**Approach**: Modular box theory design with single decision point at TLS init
**Implementation** (5 new boxes + test script):
- ENV gate box: `HAKMEM_TINY_C6_INLINE_SLOTS=0/1` (lazy-init, default OFF)
- TLS extension: 128-slot ring buffer (1KB per thread, zero overhead when OFF)
- Fast-path API: `c6_inline_push()` / `c6_inline_pop()` (always_inline, 1-2 cycles)
- Integration: Minimal (2 boundary points: alloc/free for C6 class only)
- Backward compatible: Legacy code intact, fail-fast to unified_cache
**Results** (10-run Mixed SSOT, WS=400):
- Baseline (C6 inline OFF): **44.24 M ops/s**
- Treatment (C6 inline ON): **45.51 M ops/s**
- Delta: **+1.27 M ops/s (+2.87%)**
**Decision**: ✅ **GO** (exceeds +1.0% strict threshold)
**Mechanism**: Branch elimination on unified_cache for C6 (57.2% of C4-C7 ops)
**参考**:
- Per-class分析: `docs/analysis/PHASE75_PERCLASS_ANALYSIS_0_SSOT.md`
- 結果: `docs/analysis/PHASE75_C6_INLINE_SLOTS_1_RESULTS.md`
---
**Phase 75-2: C5 Inline Slots (85% Coverage Target)** 🟡 **次の指示**
**Goal**: Expand to C5 class (28.5% of C4-C7) for 85.7% cumulative coverage
**Approach**: Replicate C6 pattern
- Add C5 ring buffer (128 slots, 1KB TLS)
- ENV gate: `HAKMEM_TINY_C5_INLINE_SLOTS=0/1`
- Integration: same alloc/free boundary points (3 total: C6+C5 alloc/free)
- A/B test: target +2-3% cumulative (Phase 75-1: +2.87% + Phase 75-2 delta)
**Risk Assessment**:
- TLS expansion: ~2KB total (C6+C5), manageable
- Rollback: Simple (ENV gate)
- Expected: +1.5-2.0% additional (diminishing returns from alloc branching)
**Success Criteria**:
- GO: +1.0% or higher cumulative vs Phase 75 baseline
- NEUTRAL: freeze, evaluate Phase 76
- NO-GO: revert C5, keep C6 as Phase 75 final
## 5) アーカイブ
Phase 62A: C7 ULTRA Alloc Dependency Chain Trim - NEUTRAL (-0.71%) Implemented C7 ULTRA allocation hotpath optimization attempt as per Phase 62A instructions. Objective: Reduce dependency chain in tiny_c7_ultra_alloc() by: 1. Eliminating per-call tiny_front_v3_c7_ultra_header_light_enabled() checks 2. Using TLS headers_initialized flag set during refill 3. Reducing branch count and register pressure Implementation: - New ENV box: core/box/c7_ultra_alloc_depchain_opt_box.h - HAKMEM_C7_ULTRA_ALLOC_DEPCHAIN_OPT=0/1 gate (default OFF) - Modified tiny_c7_ultra_alloc() with optimized path - Preserved original path for compatibility Results (Mixed benchmark, 10-run): - Baseline (OPT=0): 59.300 M ops/s (CV 1.98%) - Treatment (OPT=1): 58.879 M ops/s (CV 1.83%) - Delta: -0.71% (NEUTRAL, within ±1.0% threshold but negative) - Status: NEUTRAL → Research box (default OFF) Root Cause Analysis: 1. LTO optimization already inlines header_light function (call cost = 0) 2. TLS access (memory load + offset) not cheaper than function call 3. Layout tax from code addition (I-cache disruption pattern from Phases 43/46A/47) 4. 5.18% stack % is not optimizable hotspot (already well-optimized) Key Lessons: - LTO-optimized function calls can be cheaper than TLS field access - Micro-optimizations on already-optimized paths show diminishing/negative returns - 48.34% gap to mimalloc is likely algorithmic, not micro-architectural - Layout tax remains consistent pattern across attempted micro-optimizations Decision: - NEUTRAL verdict → kept as research box with ENV gate (default OFF) - Not adopted as production default - Next phases: Option B (production readiness pivot) likely higher ROI than further micro-opts Box Theory Compliance: ✅ Compliant (single point, reversible, clear boundary) Performance Compliance: ❌ No (-0.71% regression) Documentation: - PHASE62A_C7_ULTRA_DEPCHAIN_OPT_RESULTS.md: Full A/B test analysis - CURRENT_TASK.md: Updated with results and next phase options 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-17 16:34:03 +09:00
- 詳細ログ: `CURRENT_TASK_ARCHIVE_20251210.md`
- 整理前スナップショット: `docs/analysis/CURRENT_TASK_ARCHIVE.md`