Performance Measurement Framework: Unified Cache, TLS SLL, Shared Pool Analysis

## Summary

Implemented production-grade measurement infrastructure to quantify top 3 bottlenecks:
- Unified cache hit/miss rates + refill cost
- TLS SLL usage patterns
- Shared pool lock contention distribution

## Changes

### 1. Unified Cache Metrics (tiny_unified_cache.h/c)
- Added atomic counters:
  - g_unified_cache_hits_global: successful cache pops
  - g_unified_cache_misses_global: refill triggers
  - g_unified_cache_refill_cycles_global: refill cost in CPU cycles (rdtsc)
- Instrumented `unified_cache_pop_or_refill()` to count hits
- Instrumented `unified_cache_refill()` with cycle measurement
- ENV-gated: HAKMEM_MEASURE_UNIFIED_CACHE=1 (default: off)
- Added unified_cache_print_measurements() output function

### 2. TLS SLL Metrics (tls_sll_box.h)
- Added atomic counters:
  - g_tls_sll_push_count_global: total pushes
  - g_tls_sll_pop_count_global: successful pops
  - g_tls_sll_pop_empty_count_global: empty list conditions
- Instrumented push/pop paths
- Added tls_sll_print_measurements() output function

### 3. Shared Pool Contention (hakmem_shared_pool_acquire.c)
- Added atomic counters:
  - g_sp_stage2_lock_acquired_global: Stage 2 locks
  - g_sp_stage3_lock_acquired_global: Stage 3 allocations
  - g_sp_alloc_lock_contention_global: total lock acquisitions
- Instrumented all pthread_mutex_lock calls in hot paths
- Added shared_pool_print_measurements() output function

### 4. Benchmark Integration (bench_random_mixed.c)
- Called all 3 print functions after benchmark loop
- Functions active only when HAKMEM_MEASURE_UNIFIED_CACHE=1 set

## Design Principles

- **Zero overhead when disabled**: Inline checks with __builtin_expect hints
- **Atomic relaxed memory order**: Minimal synchronization overhead
- **ENV-gated**: Single flag controls all measurements
- **Production-safe**: Compiles in release builds, no functional changes

## Usage

```bash
HAKMEM_MEASURE_UNIFIED_CACHE=1 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
```

Output (when enabled):
```
========================================
Unified Cache Statistics
========================================
Hits:        1234567
Misses:      56789
Hit Rate:    95.6%
Avg Refill Cycles: 1234

========================================
TLS SLL Statistics
========================================
Total Pushes:     1234567
Total Pops:       345678
Pop Empty Count:  12345
Hit Rate:         98.8%

========================================
Shared Pool Contention Statistics
========================================
Stage 2 Locks:    123456 (33%)
Stage 3 Locks:    234567 (67%)
Total Contention: 357 locks per 1M ops
```

## Next Steps

1. **Enable measurements** and run benchmarks to gather data
2. **Analyze miss rates**: Which bottleneck dominates?
3. **Profile hottest stage**: Focus optimization on top contributor
4. Possible targets:
   - Increase unified cache capacity if miss rate >5%
   - Profile if TLS SLL is unused (potential legacy code removal)
   - Analyze if Stage 2 lock can be replaced with CAS

## Makefile Updates

Added core/box/tiny_route_box.o to:
- OBJS_BASE (test build)
- SHARED_OBJS (shared library)
- BENCH_HAKMEM_OBJS_BASE (benchmark)
- TINY_BENCH_OBJS_BASE (tiny benchmark)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Moe Charm (CI)
2025-12-04 18:26:39 +09:00
parent d5e6ed535c
commit 860991ee50
8 changed files with 292 additions and 5 deletions

View File

@ -25,11 +25,34 @@
#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdatomic.h>
#include "../hakmem_build_flags.h"
#include "../hakmem_tiny_config.h" // For TINY_NUM_CLASSES
#include "../box/ptr_type_box.h" // Phantom pointer types (BASE/USER)
#include "../box/tiny_front_config_box.h" // Phase 8-Step1: Config macros
// ============================================================================
// Performance Measurement: Unified Cache (ENV-gated)
// ============================================================================
// Global atomic counters for production performance measurement
// ENV: HAKMEM_MEASURE_UNIFIED_CACHE=1 to enable (default: OFF)
extern _Atomic uint64_t g_unified_cache_hits_global;
extern _Atomic uint64_t g_unified_cache_misses_global;
extern _Atomic uint64_t g_unified_cache_refill_cycles_global;
// Print statistics function
void unified_cache_print_measurements(void);
// Check if measurement is enabled (inline for hot path)
static inline int unified_cache_measure_check(void) {
static int g_measure = -1;
if (__builtin_expect(g_measure == -1, 0)) {
const char* e = getenv("HAKMEM_MEASURE_UNIFIED_CACHE");
g_measure = (e && *e && *e != '0') ? 1 : 0;
}
return g_measure;
}
// ============================================================================
// Unified Cache Structure (per class)
// ============================================================================
@ -242,6 +265,10 @@ static inline hak_base_ptr_t unified_cache_pop_or_refill(int class_idx) {
#if !HAKMEM_BUILD_RELEASE
g_unified_cache_hit[class_idx]++;
#endif
// Performance measurement: count cache hits
if (__builtin_expect(unified_cache_measure_check(), 0)) {
atomic_fetch_add_explicit(&g_unified_cache_hits_global, 1, memory_order_relaxed);
}
return HAK_BASE_FROM_RAW(base); // Hit! (2-3 cache misses total)
}