Performance Measurement Framework: Unified Cache, TLS SLL, Shared Pool Analysis

## Summary Implemented production-grade measurement infrastructure to quantify top 3 bottlenecks: - Unified cache hit/miss rates + refill cost - TLS SLL usage patterns - Shared pool lock contention distribution ## Changes ### 1. Unified Cache Metrics (tiny_unified_cache.h/c) - Added atomic counters: - g_unified_cache_hits_global: successful cache pops - g_unified_cache_misses_global: refill triggers - g_unified_cache_refill_cycles_global: refill cost in CPU cycles (rdtsc) - Instrumented `unified_cache_pop_or_refill()` to count hits - Instrumented `unified_cache_refill()` with cycle measurement - ENV-gated: HAKMEM_MEASURE_UNIFIED_CACHE=1 (default: off) - Added unified_cache_print_measurements() output function ### 2. TLS SLL Metrics (tls_sll_box.h) - Added atomic counters: - g_tls_sll_push_count_global: total pushes - g_tls_sll_pop_count_global: successful pops - g_tls_sll_pop_empty_count_global: empty list conditions - Instrumented push/pop paths - Added tls_sll_print_measurements() output function ### 3. Shared Pool Contention (hakmem_shared_pool_acquire.c) - Added atomic counters: - g_sp_stage2_lock_acquired_global: Stage 2 locks - g_sp_stage3_lock_acquired_global: Stage 3 allocations - g_sp_alloc_lock_contention_global: total lock acquisitions - Instrumented all pthread_mutex_lock calls in hot paths - Added shared_pool_print_measurements() output function ### 4. Benchmark Integration (bench_random_mixed.c) - Called all 3 print functions after benchmark loop - Functions active only when HAKMEM_MEASURE_UNIFIED_CACHE=1 set ## Design Principles - **Zero overhead when disabled**: Inline checks with __builtin_expect hints - **Atomic relaxed memory order**: Minimal synchronization overhead - **ENV-gated**: Single flag controls all measurements - **Production-safe**: Compiles in release builds, no functional changes ## Usage ```bash HAKMEM_MEASURE_UNIFIED_CACHE=1 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 ``` Output (when enabled): ``` ======================================== Unified Cache Statistics ======================================== Hits: 1234567 Misses: 56789 Hit Rate: 95.6% Avg Refill Cycles: 1234 ======================================== TLS SLL Statistics ======================================== Total Pushes: 1234567 Total Pops: 345678 Pop Empty Count: 12345 Hit Rate: 98.8% ======================================== Shared Pool Contention Statistics ======================================== Stage 2 Locks: 123456 (33%) Stage 3 Locks: 234567 (67%) Total Contention: 357 locks per 1M ops ``` ## Next Steps 1. **Enable measurements** and run benchmarks to gather data 2. **Analyze miss rates**: Which bottleneck dominates? 3. **Profile hottest stage**: Focus optimization on top contributor 4. Possible targets: - Increase unified cache capacity if miss rate >5% - Profile if TLS SLL is unused (potential legacy code removal) - Analyze if Stage 2 lock can be replaced with CAS ## Makefile Updates Added core/box/tiny_route_box.o to: - OBJS_BASE (test build) - SHARED_OBJS (shared library) - BENCH_HAKMEM_OBJS_BASE (benchmark) - TINY_BENCH_OBJS_BASE (tiny benchmark) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 18:26:39 +09:00
parent d5e6ed535c
commit 860991ee50
8 changed files with 292 additions and 5 deletions
--- a/core/front/tiny_unified_cache.c
+++ b/core/front/tiny_unified_cache.c
@ -11,6 +11,40 @@
 #include <stdlib.h>
 #include <string.h>
 #include <stdatomic.h>
+#include <time.h>
+
+// ============================================================================
+// Performance Measurement: Unified Cache (ENV-gated)
+// ============================================================================
+// Global atomic counters for unified cache performance measurement
+// ENV: HAKMEM_MEASURE_UNIFIED_CACHE=1 to enable (default: OFF)
+_Atomic uint64_t g_unified_cache_hits_global = 0;
+_Atomic uint64_t g_unified_cache_misses_global = 0;
+_Atomic uint64_t g_unified_cache_refill_cycles_global = 0;
+
+// Helper: Get cycle count (x86_64 rdtsc)
+static inline uint64_t read_tsc(void) {
+#if defined(__x86_64__) || defined(_M_X64)
+    uint32_t lo, hi;
+    __asm__ __volatile__("rdtsc" : "=a"(lo), "=d"(hi));
+    return ((uint64_t)hi << 32) | lo;
+#else
+    // Fallback to clock_gettime for non-x86 platforms
+    struct timespec ts;
+    clock_gettime(CLOCK_MONOTONIC, &ts);
+    return (uint64_t)ts.tv_sec * 1000000000ULL + (uint64_t)ts.tv_nsec;
+#endif
+}
+
+// Check if measurement is enabled (cached)
+static inline int unified_cache_measure_enabled(void) {
+    static int g_measure = -1;
+    if (__builtin_expect(g_measure == -1, 0)) {
+        const char* e = getenv("HAKMEM_MEASURE_UNIFIED_CACHE");
+        g_measure = (e && *e && *e != '0') ? 1 : 0;
+    }
+    return g_measure;
+}

 // Phase 23-E: Forward declarations
 extern __thread TinyTLSSlab g_tls_slabs[TINY_NUM_CLASSES];  // From hakmem_tiny_superslab.c
@ -294,6 +328,13 @@ static inline int unified_refill_validate_base(int class_idx,
 // Returns: BASE pointer (first block, wrapped), or NULL-wrapped if failed
 // Design: Direct carve from SuperSlab to array (no TLS SLL intermediate layer)
 hak_base_ptr_t unified_cache_refill(int class_idx) {
+    // Measure refill cost if enabled
+    uint64_t start_cycles = 0;
+    int measure = unified_cache_measure_enabled();
+    if (measure) {
+        start_cycles = read_tsc();
+    }
+
    TinyTLSSlab* tls = &g_tls_slabs[class_idx];

    // Step 1: Ensure SuperSlab available
@ -443,5 +484,51 @@ hak_base_ptr_t unified_cache_refill(int class_idx) {
    g_unified_cache_miss[class_idx]++;
    #endif

+    // Measure refill cycles
+    if (measure) {
+        uint64_t end_cycles = read_tsc();
+        uint64_t delta = end_cycles - start_cycles;
+        atomic_fetch_add_explicit(&g_unified_cache_refill_cycles_global, delta, memory_order_relaxed);
+        atomic_fetch_add_explicit(&g_unified_cache_misses_global, 1, memory_order_relaxed);
+    }
+
    return HAK_BASE_FROM_RAW(first);  // Return first block (BASE pointer)
 }
+
+// ============================================================================
+// Performance Measurement: Print Statistics
+// ============================================================================
+void unified_cache_print_measurements(void) {
+    if (!unified_cache_measure_enabled()) {
+        return;  // Measurement disabled, nothing to print
+    }
+
+    uint64_t hits = atomic_load_explicit(&g_unified_cache_hits_global, memory_order_relaxed);
+    uint64_t misses = atomic_load_explicit(&g_unified_cache_misses_global, memory_order_relaxed);
+    uint64_t refill_cycles = atomic_load_explicit(&g_unified_cache_refill_cycles_global, memory_order_relaxed);
+
+    uint64_t total = hits + misses;
+    if (total == 0) {
+        fprintf(stderr, "\n========================================\n");
+        fprintf(stderr, "Unified Cache Statistics\n");
+        fprintf(stderr, "========================================\n");
+        fprintf(stderr, "No operations recorded (measurement may be disabled)\n");
+        fprintf(stderr, "========================================\n\n");
+        return;
+    }
+
+    double hit_rate = (100.0 * hits) / total;
+    double avg_refill_cycles = misses > 0 ? (double)refill_cycles / misses : 0.0;
+
+    // Estimate time at 1GHz (conservative, most modern CPUs are 2-4GHz)
+    double avg_refill_us = avg_refill_cycles / 1000.0;
+
+    fprintf(stderr, "\n========================================\n");
+    fprintf(stderr, "Unified Cache Statistics\n");
+    fprintf(stderr, "========================================\n");
+    fprintf(stderr, "Hits:        %llu\n", (unsigned long long)hits);
+    fprintf(stderr, "Misses:      %llu\n", (unsigned long long)misses);
+    fprintf(stderr, "Hit Rate:    %.1f%%\n", hit_rate);
+    fprintf(stderr, "Avg Refill Cycles: %.0f (est. %.2fus @ 1GHz)\n", avg_refill_cycles, avg_refill_us);
+    fprintf(stderr, "========================================\n\n");
+}