diff --git a/CURRENT_TASK.md b/CURRENT_TASK.md index 9c823c19..31152cc6 100644 --- a/CURRENT_TASK.md +++ b/CURRENT_TASK.md @@ -37,15 +37,77 @@ --- -## 次の攻め先: Profile Adoption & Remaining Optimization +## 次の攻め先: mimalloc Gap Closure Roadmap (2.5x → 1.9x) -**優先度 A** - Free 昇格: -- FREE-TINY-FAST-DUALHOT-1 を MIXED_TINYV3_C7_SAFE profile に取り込む準備 -- ENV: HAKMEM_TINY_LARSON_FIX=0 as default (DUALHOT ON) +**Gap Analysis**: hakmem 50.7M ops/s vs mimalloc 127M ops/s -**優先度 B** - Alloc 構造最適化(deferred): -- `malloc` / Front Gate の "構造的" オーバーヘッド改善 -- PGO / const propagation / inline optimizations +根本原因(ROI順): +1. **Observation tax** (+2-3%): Stats macros branch even when OFF +2. **Policy snapshot** (+10-15%): Per-call TLS policy read + atomic sync +3. **Header management** (+5-10%): 1-byte header per block +4. **Wrapper layer** (+5-10%): malloc → tiny_alloc_gate_fast + security checks +5. **Routing switch** (+3-5%): Per-call switch statement + +### Phase 1: Quick Wins (Week 1) - Target: +4-7% (52-56M ops/s) + +**優先度 A1** - FREE 勝ち箱の本線昇格: +- HAKMEM_FREE_TINY_FAST_HOTCOLD=1 を MIXED_TINYV3_C7_SAFE default +- FREE-TINY-FAST-DUALHOT-1 のデフォルト有効化 +- Expected: +2-3% (DUALHOT 効果は既に測定済み +13%) + +**優先度 A2** - 観測税ゼロ化 (Compile-out stats): +- Add HAKMEM_DEBUG_COUNTERS compile-time flag (default 0) +- When 0: `#define ALLOC_GATE_STAT_INC(x) do {} while(0)` (zero cost) +- Files: `alloc_gate_stats_box.h`, `free_path_stats_box.h`, `tiny_front_stats_box.h`, `free_tiny_fast_hotcold_stats_box.h` +- Expected: +2-3% (eliminate branching on all stats) + +**優先度 A3** - Inline header write: +- Add `__attribute__((always_inline))` to `tiny_region_id_write_header()` +- Eliminate function call overhead in hot path +- Expected: +1-2% + +### Phase 2: Structural Changes (Weeks 2-3) - Target: +5-10% (55-61M ops/s) + +**優先度 B1** - C4-C7 header tax削減: +- Remove 1-byte header for C6 (512B) / C7 (1024B) allocations +- Use registry-only lookup on free +- Expected: +3-5% (C6/C7 = 30% of workload, no header = 10% size savings) + +**優先度 B2** - C0-C3 専用 fast path: +- Create `malloc_tiny_fast_c0c3()` entry point (no policy snapshot) +- Conditional dispatch from wrapper based on size +- Expected: +1-2% + +**優先度 B3** - Routing jump table: +- Replace switch(route_kind) with function pointer array +- Reduce branch prediction misses (5-way switch → direct dispatch) +- Expected: +1-3% + +### Phase 3: Cache Locality (Weeks 4-5) - Target: +12-22% (57-68M ops/s) + +**優先度 C1** - TLS cache prefetch: +- `__builtin_prefetch(g_small_policy_v7, 0, 3)` on malloc entry +- Improve L1 hit rate on cold start +- Expected: +2-4% + +**優先度 C2** - Slab metadata cache optimization: +- Profile cache-miss hotspots (policy struct, slab metadata) +- Hot/cold split of metadata +- Inline first slab descriptor +- Expected: +5-10% + +**優先度 C3** - Static routing (if no learner): +- Detect static routes at init +- Bypass policy snapshot entirely +- Expected: +5-8% + +### Architectural Insight (Long-term) + +**Reality check**: hakmem 4-5 layer design (wrapper → gate → policy → route → handler) adds 50-100x instruction overhead vs mimalloc's 1-layer TLS buckets. + +**Maximum realistic** without redesign: 65-70M ops/s (still ~1.9x gap) + +**Future pivot**: Consider static-compiled routing + optional learner (not per-call policy) ---