Optimization Roadmap: mimalloc Gap Analysis & Phase 1-3 Plan
Add comprehensive mimalloc vs hakmem performance gap analysis (2.5x). Gap sources (ranked by ROI): 1. Observation tax (stats macros): +2-3% overhead 2. Policy snapshot: +10-15% overhead (per-call TLS read + atomic sync) 3. Header management: +5-10% overhead (1-byte per block) 4. Wrapper layer: +5-10% overhead (LD_PRELOAD interception) 5. Routing switch: +3-5% overhead (5-way switch) Optimization roadmap: - Phase 1 (Quick Wins): +4-7% via FREE adoption + compile-out stats + inline - Phase 2 (Structural): +5-10% via header tax removal + C0-C3 path + jump table - Phase 3 (Cache): +12-22% via prefetch + cache optimization + static routing Expected outcome: 52-68M ops/s (vs current 50.7M, gap from 2.5x → 1.9x) Architectural reality: hakmem's 4-5 layer design adds 50-100x instruction overhead vs mimalloc's 1-layer design. Gap closure caps at ~1.9x without fundamental redesign. Next immediate step: Implement Phase 1A (FREE adoption + compile-out stats) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@ -37,15 +37,77 @@
|
||||
|
||||
---
|
||||
|
||||
## 次の攻め先: Profile Adoption & Remaining Optimization
|
||||
## 次の攻め先: mimalloc Gap Closure Roadmap (2.5x → 1.9x)
|
||||
|
||||
**優先度 A** - Free 昇格:
|
||||
- FREE-TINY-FAST-DUALHOT-1 を MIXED_TINYV3_C7_SAFE profile に取り込む準備
|
||||
- ENV: HAKMEM_TINY_LARSON_FIX=0 as default (DUALHOT ON)
|
||||
**Gap Analysis**: hakmem 50.7M ops/s vs mimalloc 127M ops/s
|
||||
|
||||
**優先度 B** - Alloc 構造最適化(deferred):
|
||||
- `malloc` / Front Gate の "構造的" オーバーヘッド改善
|
||||
- PGO / const propagation / inline optimizations
|
||||
根本原因(ROI順):
|
||||
1. **Observation tax** (+2-3%): Stats macros branch even when OFF
|
||||
2. **Policy snapshot** (+10-15%): Per-call TLS policy read + atomic sync
|
||||
3. **Header management** (+5-10%): 1-byte header per block
|
||||
4. **Wrapper layer** (+5-10%): malloc → tiny_alloc_gate_fast + security checks
|
||||
5. **Routing switch** (+3-5%): Per-call switch statement
|
||||
|
||||
### Phase 1: Quick Wins (Week 1) - Target: +4-7% (52-56M ops/s)
|
||||
|
||||
**優先度 A1** - FREE 勝ち箱の本線昇格:
|
||||
- HAKMEM_FREE_TINY_FAST_HOTCOLD=1 を MIXED_TINYV3_C7_SAFE default
|
||||
- FREE-TINY-FAST-DUALHOT-1 のデフォルト有効化
|
||||
- Expected: +2-3% (DUALHOT 効果は既に測定済み +13%)
|
||||
|
||||
**優先度 A2** - 観測税ゼロ化 (Compile-out stats):
|
||||
- Add HAKMEM_DEBUG_COUNTERS compile-time flag (default 0)
|
||||
- When 0: `#define ALLOC_GATE_STAT_INC(x) do {} while(0)` (zero cost)
|
||||
- Files: `alloc_gate_stats_box.h`, `free_path_stats_box.h`, `tiny_front_stats_box.h`, `free_tiny_fast_hotcold_stats_box.h`
|
||||
- Expected: +2-3% (eliminate branching on all stats)
|
||||
|
||||
**優先度 A3** - Inline header write:
|
||||
- Add `__attribute__((always_inline))` to `tiny_region_id_write_header()`
|
||||
- Eliminate function call overhead in hot path
|
||||
- Expected: +1-2%
|
||||
|
||||
### Phase 2: Structural Changes (Weeks 2-3) - Target: +5-10% (55-61M ops/s)
|
||||
|
||||
**優先度 B1** - C4-C7 header tax削減:
|
||||
- Remove 1-byte header for C6 (512B) / C7 (1024B) allocations
|
||||
- Use registry-only lookup on free
|
||||
- Expected: +3-5% (C6/C7 = 30% of workload, no header = 10% size savings)
|
||||
|
||||
**優先度 B2** - C0-C3 専用 fast path:
|
||||
- Create `malloc_tiny_fast_c0c3()` entry point (no policy snapshot)
|
||||
- Conditional dispatch from wrapper based on size
|
||||
- Expected: +1-2%
|
||||
|
||||
**優先度 B3** - Routing jump table:
|
||||
- Replace switch(route_kind) with function pointer array
|
||||
- Reduce branch prediction misses (5-way switch → direct dispatch)
|
||||
- Expected: +1-3%
|
||||
|
||||
### Phase 3: Cache Locality (Weeks 4-5) - Target: +12-22% (57-68M ops/s)
|
||||
|
||||
**優先度 C1** - TLS cache prefetch:
|
||||
- `__builtin_prefetch(g_small_policy_v7, 0, 3)` on malloc entry
|
||||
- Improve L1 hit rate on cold start
|
||||
- Expected: +2-4%
|
||||
|
||||
**優先度 C2** - Slab metadata cache optimization:
|
||||
- Profile cache-miss hotspots (policy struct, slab metadata)
|
||||
- Hot/cold split of metadata
|
||||
- Inline first slab descriptor
|
||||
- Expected: +5-10%
|
||||
|
||||
**優先度 C3** - Static routing (if no learner):
|
||||
- Detect static routes at init
|
||||
- Bypass policy snapshot entirely
|
||||
- Expected: +5-8%
|
||||
|
||||
### Architectural Insight (Long-term)
|
||||
|
||||
**Reality check**: hakmem 4-5 layer design (wrapper → gate → policy → route → handler) adds 50-100x instruction overhead vs mimalloc's 1-layer TLS buckets.
|
||||
|
||||
**Maximum realistic** without redesign: 65-70M ops/s (still ~1.9x gap)
|
||||
|
||||
**Future pivot**: Consider static-compiled routing + optional learner (not per-call policy)
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user