Phase TLS-UNIFY-3: C6 intrusive freelist implementation (完成)

Implement C6 ULTRA intrusive LIFO freelist with ENV gating: - Single-linked LIFO using next pointer at USER+1 offset - tiny_next_store/tiny_next_load for pointer access (single source of truth) - Segment learning via ss_fast_lookup (per-class seg_base/seg_end) - ENV gate: HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL (default OFF) - Counters: c6_ifl_push/pop/fallback in FREE_PATH_STATS Files: - core/box/tiny_ultra_tls_box.h: Added c6_head field for intrusive LIFO - core/box/tiny_ultra_tls_box.c: Pop/push with intrusive branching (case 6) - core/box/tiny_c6_ultra_intrusive_env_box.h: ENV gate (new) - core/box/tiny_c6_intrusive_freelist_box.h: L1 pure LIFO (new) - core/tiny_debug_ring.h: C6_IFL events - core/box/free_path_stats_box.h/c: c6_ifl_* counters A/B Test Results (1M iterations, ws=200, 257-512B): - ENV_OFF (array): 56.6 Mop/s avg - ENV_ON (intrusive): 57.6 Mop/s avg (+1.8%, within noise) - Counters verified: c6_ifl_push=265890, c6_ifl_pop=265815, fallback=0 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-12 16:26:42 +09:00
parent bf83612b97
commit 1a8652a91a
18 changed files with 1268 additions and 217 deletions
--- a/docs/analysis/SMALLOBJECT_HOTBOX_V3_DESIGN.md
+++ b/docs/analysis/SMALLOBJECT_HOTBOX_V3_DESIGN.md
@ -435,4 +435,30 @@ v3 backend の so_alloc_fast/so_free_fast パスの「内部最適化」に進

 **推奨**: Phase SO-BACKEND-OPT-2 は実装前に perf profile (cycles:u) で so_alloc_fast/so_free_fast を詳細計測することを推奨。

+---
+
+## Phase v11b-1: Free Path Micro-Optimization (2025-12-12)
+
+### 変更内容
+
+perf profile で `free_tiny_fast()` のシリアル ULTRA チェック (C7→C6→C5→C4) が 11.73% overhead を占めていることを発見。`malloc_tiny_fast()` と同様のパターンを適用：
+
+1. **C7 ULTRA early-exit**: Policy snapshot 前に C7 判定（最頻出パスを最短化）
+2. **Single switch**: route_kind[class_idx] で一発分岐（jump table 生成）
+3. **Dead code 削除**: 未使用の v4 チェック、重複 v7 チェックを除去
+
+### 結果
+
+| Workload | v11a-5 | v11b-1 | 改善 |
+|----------|--------|--------|------|
+| Mixed 16-1024B | 45.4M ops/s | 50.7M ops/s | **+11.7%** |
+| C6-heavy | 49.1M ops/s | 52.0M ops/s | **+5.9%** |
+| C6-heavy + MID v3.5 | 53.1M ops/s | 53.6M ops/s | +0.9% |
+
+### 教訓
+
+- alloc パス最適化 (v11a-5) と同じパターンが free パスにも有効
+- シリアル if-else チェーン → switch (jump table) で大幅改善
+- フロント層の分岐コストは backend より大きい（今回 +11.7% vs 想定 +1-2%）
+
 ***