Implement C6 ULTRA intrusive LIFO freelist with ENV gating: - Single-linked LIFO using next pointer at USER+1 offset - tiny_next_store/tiny_next_load for pointer access (single source of truth) - Segment learning via ss_fast_lookup (per-class seg_base/seg_end) - ENV gate: HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL (default OFF) - Counters: c6_ifl_push/pop/fallback in FREE_PATH_STATS Files: - core/box/tiny_ultra_tls_box.h: Added c6_head field for intrusive LIFO - core/box/tiny_ultra_tls_box.c: Pop/push with intrusive branching (case 6) - core/box/tiny_c6_ultra_intrusive_env_box.h: ENV gate (new) - core/box/tiny_c6_intrusive_freelist_box.h: L1 pure LIFO (new) - core/tiny_debug_ring.h: C6_IFL events - core/box/free_path_stats_box.h/c: c6_ifl_* counters A/B Test Results (1M iterations, ws=200, 257-512B): - ENV_OFF (array): 56.6 Mop/s avg - ENV_ON (intrusive): 57.6 Mop/s avg (+1.8%, within noise) - Counters verified: c6_ifl_push=265890, c6_ifl_pop=265815, fallback=0 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
1.5 KiB
1.5 KiB
mimalloc Gap Summary (Phase v11b-1)
Current Status: 2025-12-12
Throughput Comparison (Mixed 16-1024B, ws=400, 10M iter)
| Allocator | Throughput | vs mimalloc |
|---|---|---|
| mimalloc | 65.5M ops/s | 1.00x |
| hakmem v11b-1 | 50.7M ops/s | 0.77x |
Progress Summary
| Phase | Throughput | vs mimalloc | Key Change |
|---|---|---|---|
| v11a-4 | 38.6M | 0.59x | baseline |
| v11a-5 | 45.4M | 0.69x | alloc path: single switch + C7 early-exit |
| v11b-1 | 50.7M | 0.77x | free path: single switch + C7 early-exit |
perf stat Comparison (Mixed 16-1024B, v11a-5 data)
| Metric | mimalloc | hakmem | Ratio |
|---|---|---|---|
| cycles | ~500M | 1.04B | 2.1x |
| instructions | ~920M | 2.2B | 2.4x |
| cache-misses | ~90K | 408K | 4.5x |
| branch-misses | ~6.3M | 14.5M | 2.3x |
Next Target
フロント alloc/free 両方を最適化完了。次は backend core または cache locality 改善。
Candidates:
- cache locality: cache-misses 4.5x が最大差 → TLS page prefetch, hot page reuse
- instructions削減: 2.4x → inline 化, マクロ展開
- small-object v7 の small帯 (C2-C3) 設計
Key Insight
- alloc + free 両パスで switch (jump table) 化が有効
- フロント層の最適化だけで v11a-4 → v11b-1 で +31% 改善 (38.6M → 50.7M)
- mimalloc との差は主に cache-misses (4.5x) と instructions (2.4x)
Date: 2025-12-12 Phase: v11b-1 complete