Phase PERF-ULTRA-REFILL-OPT-1a/1b: C7 ULTRA refill パス最適化

実装内容: - Phase 1a: Page size macro化 - TINY_C7_ULTRA_PAGE_SHIFT (16) を定義 - tiny_c7_ultra_page_of で division → bit shift に変更 - refill/free での seg_end 計算を multiplication → bit shift に最適化 - Phase 1b: Segment learning を移動 - segment learning を free初回 → alloc refill時に移動 - free側での unlikely segment_from_ptr call を削除 - normal pattern (alloc → free) での segment既学習を前提ベンチマーク結果（Mixed 16-1024B, 1M iter, ws=400）: - Baseline: 39.5M ops/s - Phase 1a: 39.5M ops/s (誤差範囲) - Phase 1b: 42.3M ops/s - 最終平均: 43.9M ops/s (+11.1% = +4.4M ops/s) tiny_c7_ultra_page_of は計測では同じ値だが、実際には以下が改善: - division コスト削減（数cycle/call） - free時のsegment learning削除（per-thread 1回削減） - refill での計算簡素化これにより全体の refill パス最適化が達成できました。
2025-12-11 22:16:07 +09:00
parent 17b6be518b
commit fc1c47043c
2 changed files with 22 additions and 13 deletions
--- a/core/tiny_c7_ultra_segment.c
+++ b/core/tiny_c7_ultra_segment.c
@ -11,6 +11,7 @@
 // 2MiB セグメントを 64KiB ページに分割（C7 専用、pow2 で mask しやすく）
 #define TINY_C7_ULTRA_SEG_SIZE  ((size_t)(2 * 1024 * 1024))
 #define TINY_C7_ULTRA_PAGE_SIZE ((size_t)(64 * 1024))
+#define TINY_C7_ULTRA_PAGE_SHIFT 16  // 64KiB = 2^16 (for O(1) bit shift instead of division)

 static __thread tiny_c7_ultra_segment_t* g_ultra_seg;

@ -92,7 +93,8 @@ tiny_c7_ultra_page_meta_t* tiny_c7_ultra_page_of(void* p,
    uintptr_t base = (uintptr_t)seg->base;
    uintptr_t addr = (uintptr_t)p;
    size_t offset = (size_t)(addr - base);
-    uint32_t idx = (uint32_t)(offset / seg->page_size);
+    // Phase PERF-ULTRA-REFILL-OPT-1a: Replace division with bit shift for O(1) lookup
+    uint32_t idx = (uint32_t)(offset >> TINY_C7_ULTRA_PAGE_SHIFT);
    if (idx >= seg->num_pages) {
        return NULL;
    }