Phase 6-4: Larson benchmark optimizations - LUT size-to-class

Two optimizations to improve Larson benchmark performance: 1. **Option A: Fast Path Priority** (core/hakmem.c) - Move HAKMEM_TINY_FAST_PATH check before all guard checks - Reduce malloc() fast path from 8+ branches to 3 branches - Results: +42% ST, -20% MT (mixed results) 2. **LUT Optimization** (core/tiny_fastcache.h) - Replace 11-branch linear search with O(1) lookup table - Use size_to_class_lut[size >> 3] for fast mapping - Results: +24% MT, -24% ST (MT-optimized tradeoff) Benchmark results (Larson 2s 8-128B 1024 chunks): - Original: ST 0.498M ops/s, MT 1.502M ops/s - LUT version: ST 0.377M ops/s, MT 1.856M ops/s Analysis: - ST regression: Branch predictor learns linear search pattern - MT improvement: LUT avoids branch misprediction on context switch - Recommendation: Keep LUT for multi-threaded workloads Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md
2025-11-05 04:58:03 +00:00
parent b64cfc055e
commit 09e1d89e8d
2 changed files with 79 additions and 82 deletions
--- a/core/tiny_fastcache.h
+++ b/core/tiny_fastcache.h
@ -37,25 +37,44 @@ extern __thread uint32_t g_tiny_fast_count[TINY_FAST_CLASS_COUNT];
 extern __thread int g_tiny_fast_initialized;

 // ========== Size to Class Mapping ==========
-// Inline size-to-class for fast path (minimal branches)
+// Inline size-to-class for fast path (O(1) lookup table)

 static inline int tiny_fast_size_to_class(size_t size) {
-    // Class mapping (same as existing Tiny classes):
-    // 0: 16B, 1: 24B, 2: 32B, 3: 40B, 4: 48B, 5: 56B, 6: 64B
-    // 7: 80B, 8: 96B, 9: 112B, 10: 128B, 11-15: reserved
+    // Optimized: Lookup table for O(1) mapping (vs 11-branch linear search)
+    // Table indexed by (size >> 3) for sizes 0-128
+    // Class mapping: 0:16B, 1:24B, 2:32B, 3:40B, 4:48B, 5:56B, 6:64B, 7:80B, 8:96B, 9:112B, 10:128B

-    if (size <= 16) return 0;
-    if (size <= 24) return 1;
-    if (size <= 32) return 2;
-    if (size <= 40) return 3;
-    if (size <= 48) return 4;
-    if (size <= 56) return 5;
-    if (size <= 64) return 6;
-    if (size <= 80) return 7;
-    if (size <= 96) return 8;
-    if (size <= 112) return 9;
-    if (size <= 128) return 10;
-    return -1;  // Not tiny
+    static const int8_t size_to_class_lut[17] = {
+        0,   // 0-7    → 16B (class 0)
+        0,   // 8-15   → 16B (class 0)
+        0,   // 16     → 16B (class 0)
+        1,   // 17-23  → 24B (class 1)
+        1,   // 24     → 24B (class 1)
+        2,   // 25-31  → 32B (class 2)
+        2,   // 32     → 32B (class 2)
+        3,   // 33-39  → 40B (class 3)
+        3,   // 40     → 40B (class 3)
+        4,   // 41-47  → 48B (class 4)
+        4,   // 48     → 48B (class 4)
+        5,   // 49-55  → 56B (class 5)
+        5,   // 56     → 56B (class 5)
+        6,   // 57-63  → 64B (class 6)
+        6,   // 64     → 64B (class 6)
+        7,   // 65-79  → 80B (class 7)
+        8    // 80-95  → 96B (class 8)
+    };
+
+    if (__builtin_expect(size > 128, 0)) return -1;  // Not tiny
+
+    // Fast path: Direct lookup (1-2 instructions!)
+    unsigned int idx = size >> 3;  // size / 8
+    if (__builtin_expect(idx < 17, 1)) {
+        return size_to_class_lut[idx];
+    }
+
+    // Size 96-128: class 9-10
+    if (size <= 112) return 9;   // 112B (class 9)
+    return 10;                   // 128B (class 10)
 }

 // ========== Forward Declarations ==========