Performance Measurement Framework: Unified Cache, TLS SLL, Shared Pool Analysis

## Summary

Implemented production-grade measurement infrastructure to quantify top 3 bottlenecks:
- Unified cache hit/miss rates + refill cost
- TLS SLL usage patterns
- Shared pool lock contention distribution

## Changes

### 1. Unified Cache Metrics (tiny_unified_cache.h/c)
- Added atomic counters:
  - g_unified_cache_hits_global: successful cache pops
  - g_unified_cache_misses_global: refill triggers
  - g_unified_cache_refill_cycles_global: refill cost in CPU cycles (rdtsc)
- Instrumented `unified_cache_pop_or_refill()` to count hits
- Instrumented `unified_cache_refill()` with cycle measurement
- ENV-gated: HAKMEM_MEASURE_UNIFIED_CACHE=1 (default: off)
- Added unified_cache_print_measurements() output function

### 2. TLS SLL Metrics (tls_sll_box.h)
- Added atomic counters:
  - g_tls_sll_push_count_global: total pushes
  - g_tls_sll_pop_count_global: successful pops
  - g_tls_sll_pop_empty_count_global: empty list conditions
- Instrumented push/pop paths
- Added tls_sll_print_measurements() output function

### 3. Shared Pool Contention (hakmem_shared_pool_acquire.c)
- Added atomic counters:
  - g_sp_stage2_lock_acquired_global: Stage 2 locks
  - g_sp_stage3_lock_acquired_global: Stage 3 allocations
  - g_sp_alloc_lock_contention_global: total lock acquisitions
- Instrumented all pthread_mutex_lock calls in hot paths
- Added shared_pool_print_measurements() output function

### 4. Benchmark Integration (bench_random_mixed.c)
- Called all 3 print functions after benchmark loop
- Functions active only when HAKMEM_MEASURE_UNIFIED_CACHE=1 set

## Design Principles

- **Zero overhead when disabled**: Inline checks with __builtin_expect hints
- **Atomic relaxed memory order**: Minimal synchronization overhead
- **ENV-gated**: Single flag controls all measurements
- **Production-safe**: Compiles in release builds, no functional changes

## Usage

```bash
HAKMEM_MEASURE_UNIFIED_CACHE=1 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
```

Output (when enabled):
```
========================================
Unified Cache Statistics
========================================
Hits:        1234567
Misses:      56789
Hit Rate:    95.6%
Avg Refill Cycles: 1234

========================================
TLS SLL Statistics
========================================
Total Pushes:     1234567
Total Pops:       345678
Pop Empty Count:  12345
Hit Rate:         98.8%

========================================
Shared Pool Contention Statistics
========================================
Stage 2 Locks:    123456 (33%)
Stage 3 Locks:    234567 (67%)
Total Contention: 357 locks per 1M ops
```

## Next Steps

1. **Enable measurements** and run benchmarks to gather data
2. **Analyze miss rates**: Which bottleneck dominates?
3. **Profile hottest stage**: Focus optimization on top contributor
4. Possible targets:
   - Increase unified cache capacity if miss rate >5%
   - Profile if TLS SLL is unused (potential legacy code removal)
   - Analyze if Stage 2 lock can be replaced with CAS

## Makefile Updates

Added core/box/tiny_route_box.o to:
- OBJS_BASE (test build)
- SHARED_OBJS (shared library)
- BENCH_HAKMEM_OBJS_BASE (benchmark)
- TINY_BENCH_OBJS_BASE (tiny benchmark)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Moe Charm (CI)
2025-12-04 18:26:39 +09:00
parent d5e6ed535c
commit 860991ee50
8 changed files with 292 additions and 5 deletions

View File

@ -11,6 +11,40 @@
#include <stdlib.h>
#include <string.h>
#include <stdatomic.h>
#include <time.h>
// ============================================================================
// Performance Measurement: Unified Cache (ENV-gated)
// ============================================================================
// Global atomic counters for unified cache performance measurement
// ENV: HAKMEM_MEASURE_UNIFIED_CACHE=1 to enable (default: OFF)
_Atomic uint64_t g_unified_cache_hits_global = 0;
_Atomic uint64_t g_unified_cache_misses_global = 0;
_Atomic uint64_t g_unified_cache_refill_cycles_global = 0;
// Helper: Get cycle count (x86_64 rdtsc)
static inline uint64_t read_tsc(void) {
#if defined(__x86_64__) || defined(_M_X64)
uint32_t lo, hi;
__asm__ __volatile__("rdtsc" : "=a"(lo), "=d"(hi));
return ((uint64_t)hi << 32) | lo;
#else
// Fallback to clock_gettime for non-x86 platforms
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return (uint64_t)ts.tv_sec * 1000000000ULL + (uint64_t)ts.tv_nsec;
#endif
}
// Check if measurement is enabled (cached)
static inline int unified_cache_measure_enabled(void) {
static int g_measure = -1;
if (__builtin_expect(g_measure == -1, 0)) {
const char* e = getenv("HAKMEM_MEASURE_UNIFIED_CACHE");
g_measure = (e && *e && *e != '0') ? 1 : 0;
}
return g_measure;
}
// Phase 23-E: Forward declarations
extern __thread TinyTLSSlab g_tls_slabs[TINY_NUM_CLASSES]; // From hakmem_tiny_superslab.c
@ -294,6 +328,13 @@ static inline int unified_refill_validate_base(int class_idx,
// Returns: BASE pointer (first block, wrapped), or NULL-wrapped if failed
// Design: Direct carve from SuperSlab to array (no TLS SLL intermediate layer)
hak_base_ptr_t unified_cache_refill(int class_idx) {
// Measure refill cost if enabled
uint64_t start_cycles = 0;
int measure = unified_cache_measure_enabled();
if (measure) {
start_cycles = read_tsc();
}
TinyTLSSlab* tls = &g_tls_slabs[class_idx];
// Step 1: Ensure SuperSlab available
@ -443,5 +484,51 @@ hak_base_ptr_t unified_cache_refill(int class_idx) {
g_unified_cache_miss[class_idx]++;
#endif
// Measure refill cycles
if (measure) {
uint64_t end_cycles = read_tsc();
uint64_t delta = end_cycles - start_cycles;
atomic_fetch_add_explicit(&g_unified_cache_refill_cycles_global, delta, memory_order_relaxed);
atomic_fetch_add_explicit(&g_unified_cache_misses_global, 1, memory_order_relaxed);
}
return HAK_BASE_FROM_RAW(first); // Return first block (BASE pointer)
}
// ============================================================================
// Performance Measurement: Print Statistics
// ============================================================================
void unified_cache_print_measurements(void) {
if (!unified_cache_measure_enabled()) {
return; // Measurement disabled, nothing to print
}
uint64_t hits = atomic_load_explicit(&g_unified_cache_hits_global, memory_order_relaxed);
uint64_t misses = atomic_load_explicit(&g_unified_cache_misses_global, memory_order_relaxed);
uint64_t refill_cycles = atomic_load_explicit(&g_unified_cache_refill_cycles_global, memory_order_relaxed);
uint64_t total = hits + misses;
if (total == 0) {
fprintf(stderr, "\n========================================\n");
fprintf(stderr, "Unified Cache Statistics\n");
fprintf(stderr, "========================================\n");
fprintf(stderr, "No operations recorded (measurement may be disabled)\n");
fprintf(stderr, "========================================\n\n");
return;
}
double hit_rate = (100.0 * hits) / total;
double avg_refill_cycles = misses > 0 ? (double)refill_cycles / misses : 0.0;
// Estimate time at 1GHz (conservative, most modern CPUs are 2-4GHz)
double avg_refill_us = avg_refill_cycles / 1000.0;
fprintf(stderr, "\n========================================\n");
fprintf(stderr, "Unified Cache Statistics\n");
fprintf(stderr, "========================================\n");
fprintf(stderr, "Hits: %llu\n", (unsigned long long)hits);
fprintf(stderr, "Misses: %llu\n", (unsigned long long)misses);
fprintf(stderr, "Hit Rate: %.1f%%\n", hit_rate);
fprintf(stderr, "Avg Refill Cycles: %.0f (est. %.2fus @ 1GHz)\n", avg_refill_cycles, avg_refill_us);
fprintf(stderr, "========================================\n\n");
}