Phase TLS-UNIFY-3: C6 intrusive freelist implementation (完成)

Implement C6 ULTRA intrusive LIFO freelist with ENV gating:
- Single-linked LIFO using next pointer at USER+1 offset
- tiny_next_store/tiny_next_load for pointer access (single source of truth)
- Segment learning via ss_fast_lookup (per-class seg_base/seg_end)
- ENV gate: HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL (default OFF)
- Counters: c6_ifl_push/pop/fallback in FREE_PATH_STATS

Files:
- core/box/tiny_ultra_tls_box.h: Added c6_head field for intrusive LIFO
- core/box/tiny_ultra_tls_box.c: Pop/push with intrusive branching (case 6)
- core/box/tiny_c6_ultra_intrusive_env_box.h: ENV gate (new)
- core/box/tiny_c6_intrusive_freelist_box.h: L1 pure LIFO (new)
- core/tiny_debug_ring.h: C6_IFL events
- core/box/free_path_stats_box.h/c: c6_ifl_* counters

A/B Test Results (1M iterations, ws=200, 257-512B):
- ENV_OFF (array): 56.6 Mop/s avg
- ENV_ON (intrusive): 57.6 Mop/s avg (+1.8%, within noise)
- Counters verified: c6_ifl_push=265890, c6_ifl_pop=265815, fallback=0

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
This commit is contained in:
Moe Charm (CI)
2025-12-12 16:26:42 +09:00
parent bf83612b97
commit 1a8652a91a
18 changed files with 1268 additions and 217 deletions

View File

@ -435,4 +435,30 @@ v3 backend の so_alloc_fast/so_free_fast パスの「内部最適化」に進
**推奨**: Phase SO-BACKEND-OPT-2 は実装前に perf profile (cycles:u) で so_alloc_fast/so_free_fast を詳細計測することを推奨。
---
## Phase v11b-1: Free Path Micro-Optimization (2025-12-12)
### 変更内容
perf profile で `free_tiny_fast()` のシリアル ULTRA チェック (C7→C6→C5→C4) が 11.73% overhead を占めていることを発見。`malloc_tiny_fast()` と同様のパターンを適用:
1. **C7 ULTRA early-exit**: Policy snapshot 前に C7 判定(最頻出パスを最短化)
2. **Single switch**: route_kind[class_idx] で一発分岐jump table 生成)
3. **Dead code 削除**: 未使用の v4 チェック、重複 v7 チェックを除去
### 結果
| Workload | v11a-5 | v11b-1 | 改善 |
|----------|--------|--------|------|
| Mixed 16-1024B | 45.4M ops/s | 50.7M ops/s | **+11.7%** |
| C6-heavy | 49.1M ops/s | 52.0M ops/s | **+5.9%** |
| C6-heavy + MID v3.5 | 53.1M ops/s | 53.6M ops/s | +0.9% |
### 教訓
- alloc パス最適化 (v11a-5) と同じパターンが free パスにも有効
- シリアル if-else チェーン → switch (jump table) で大幅改善
- フロント層の分岐コストは backend より大きい(今回 +11.7% vs 想定 +1-2%
***