hakmem/archive/analysis/RING_SIZE_SOLUTION.md

# Solution: Separate Ring Sizes Per Pool

## Problem Summary

`POOL_TLS_RING_CAP` currently controls ring size for BOTH L2 and L2.5 pools:
- **mid_large_mt** (8-32KB) uses L2 Pool → benefits from Ring=64
- **random_mixed** (8-128B) uses Tiny Pool → hurt by L2's TLS growth

**Root cause:** L2 Pool TLS grows from 980B → 3,668B (Ring 16→64), evicting Tiny Pool data from L1 cache.

## Solution: Per-Pool Ring Sizes

**Target configuration:**
- L2 Pool: Ring=48 (balanced performance + cache fit)
- L2.5 Pool: Ring=16 (unchanged, optimal for large allocs)
- Tiny Pool: No ring (uses freelist, unchanged)

**Expected outcome:**
- mid_large_mt: +2.1% vs baseline (36.04M → 36.8M ops/s)
- random_mixed: ±0% (22.5M maintained)
- TLS memory: -33% vs Ring=64 (5.0KB → 3.4KB)

---

## Implementation Steps

### Step 1: Modify L2 Pool (hakmem_pool.c)

Replace `POOL_TLS_RING_CAP` with `POOL_L2_RING_CAP`:

```c
// Line 77-78 (current):
#ifndef POOL_TLS_RING_CAP
#define POOL_TLS_RING_CAP 64  // QW1-adjusted: Moderate increase

// Change to:
#ifndef POOL_L2_RING_CAP
#define POOL_L2_RING_CAP 48  // Optimized for mid-size allocations (2-32KB)
#endif

// Line 80:
typedef struct { PoolBlock* items[POOL_TLS_RING_CAP]; int top; } PoolTLSRing;

// Change to:
typedef struct { PoolBlock* items[POOL_L2_RING_CAP]; int top; } PoolTLSRing;
```

**Then replace ALL occurrences** of `POOL_TLS_RING_CAP` → `POOL_L2_RING_CAP` in:
- Line 265, 1721, 1954, 2146, 2173, 2174, 2265, 2266, 2319, 2397

**Command:**
```bash
sed -i 's/POOL_TLS_RING_CAP/POOL_L2_RING_CAP/g' core/hakmem_pool.c
```

### Step 2: Modify L2.5 Pool (hakmem_l25_pool.c)

Replace `POOL_TLS_RING_CAP` with `POOL_L25_RING_CAP`:

```c
// Line 75-76 (current):
#ifndef POOL_TLS_RING_CAP
#define POOL_TLS_RING_CAP 16

// Change to:
#ifndef POOL_L25_RING_CAP
#define POOL_L25_RING_CAP 16  // Optimized for large allocations (64KB-1MB)
#endif

// Line 78:
typedef struct { L25Block* items[POOL_TLS_RING_CAP]; int top; } L25TLSRing;

// Change to:
typedef struct { L25Block* items[POOL_L25_RING_CAP]; int top; } L25TLSRing;
```

**Then replace ALL occurrences** of `POOL_TLS_RING_CAP` → `POOL_L25_RING_CAP`:

**Command:**
```bash
sed -i 's/POOL_TLS_RING_CAP/POOL_L25_RING_CAP/g' core/hakmem_l25_pool.c
```

### Step 3: Update Makefile

Update build flags to expose separate ring sizes:

```makefile
# Line 12 (current):
CFLAGS_SHARED = ... -DPOOL_TLS_RING_CAP=$(RING_CAP) ...

# Change to:
CFLAGS_SHARED = ... -DPOOL_L2_RING_CAP=$(L2_RING) -DPOOL_L25_RING_CAP=$(L25_RING) ...

# Add default values:
L2_RING ?= 48
L25_RING ?= 16
```

**Full line:**
```makefile
L2_RING ?= 48
L25_RING ?= 16
CFLAGS_SHARED = -O3 -march=native -mtune=native -Wall -Wextra -std=c11 -D_GNU_SOURCE -D_POSIX_C_SOURCE=199309L -D_GLIBC_USE_ISOC2X=0 -D__isoc23_strtol=strtol -D__isoc23_strtoll=strtoll -D__isoc23_strtoul=strtoul -D__isoc23_strtoull=strtoull -DHAKMEM_DEBUG_TIMING=$(HAKMEM_TIMING) -fPIC -DPOOL_L2_RING_CAP=$(L2_RING) -DPOOL_L25_RING_CAP=$(L25_RING) -ffast-math -funroll-loops -flto -fno-semantic-interposition -fno-plt -fomit-frame-pointer -fno-unwind-tables -fno-asynchronous-unwind-tables -I core
```

### Step 4: Add Documentation Comments

Add to `core/hakmem_pool.c` (after line 78):

```c
// POOL_L2_RING_CAP: TLS ring buffer capacity for L2 Pool (2-32KB allocations)
// - Default: 48 (balanced performance + L1 cache fit)
// - Larger values (64+): Better for high-contention mid-size workloads
//   but increases TLS footprint (may evict other pools from L1 cache)
// - Smaller values (16-32): Lower TLS memory, better for mixed workloads
// - Memory per thread: 7 classes × (CAP×8 + 12) bytes
//   Ring=48: 7 × 396 = 2,772 bytes (~44 cache lines)
```

Add to `core/hakmem_l25_pool.c` (after line 76):

```c
// POOL_L25_RING_CAP: TLS ring buffer capacity for L2.5 Pool (64KB-1MB allocations)
// - Default: 16 (optimal for large, less-frequent allocations)
// - Memory per thread: 5 classes × 148 bytes = 740 bytes (~12 cache lines)
```

---

## Testing Plan

### Test 1: Baseline Validation (Ring=16)

```bash
make clean
make L2_RING=16 L25_RING=16 bench_mid_large_mt bench_random_mixed

echo "=== Baseline Ring=16 ===" | tee baseline.txt
./bench_mid_large_mt 2 40000 128 | tee -a baseline.txt
./bench_random_mixed 200000 400 | tee -a baseline.txt
```

**Expected:**
- mid_large_mt: ~36.04M ops/s
- random_mixed: ~22.5M ops/s

### Test 2: Sweep L2 Ring Size (L2.5 fixed at 16)

```bash
rm -f sweep_results.txt
for RING in 24 32 40 48 56 64; do
    echo "=== Testing L2_RING=$RING ===" | tee -a sweep_results.txt
    make clean
    make L2_RING=$RING L25_RING=16 bench_mid_large_mt bench_random_mixed
    
    echo "mid_large_mt:" | tee -a sweep_results.txt
    ./bench_mid_large_mt 2 40000 128 | tee -a sweep_results.txt
    
    echo "random_mixed:" | tee -a sweep_results.txt
    ./bench_random_mixed 200000 400 | tee -a sweep_results.txt
    echo "" | tee -a sweep_results.txt
done
```

### Test 3: Validate Optimal Configuration (L2=48)

```bash
make clean
make L2_RING=48 L25_RING=16 bench_mid_large_mt bench_random_mixed

echo "=== Optimal L2=48, L25=16 ===" | tee optimal.txt
./bench_mid_large_mt 2 40000 128 | tee -a optimal.txt
./bench_random_mixed 200000 400 | tee -a optimal.txt
```

**Target:**
- mid_large_mt: ≥36.5M ops/s (+1.3% vs baseline)
- random_mixed: ≥22.4M ops/s (within ±1% of baseline)

### Test 4: Full Benchmark Suite

```bash
# Build with optimal config
make clean
make L2_RING=48 L25_RING=16

# Run comprehensive suite
./scripts/run_bench_suite.sh 2>&1 | tee full_suite.txt

# Check for regressions
grep -E "ops/sec|Throughput" full_suite.txt
```

---

## Expected Performance Matrix

| Configuration | mid_large_mt | random_mixed | Average | TLS (KB) | L1 Cache % |
|---------------|--------------|--------------|---------|----------|------------|
| Ring=16 (baseline) | 36.04M | 22.5M | 29.27M | 2.36 | 7.4% |
| Ring=64 (current) | 37.22M | 21.29M | 29.26M | 5.05 | 15.8% |
| **L2=48, L25=16** | **36.8M** | **22.5M** | **29.65M** | **3.4** | **10.6%** |

**Gains vs Ring=64:**
- mid_large_mt: -1.1% (acceptable trade-off)
- random_mixed: **+5.7%** (recovered performance)
- Average: **+1.3%**
- TLS footprint: **-33%**

**Gains vs Ring=16:**
- mid_large_mt: **+2.1%**
- random_mixed: ±0%
- Average: **+1.3%**

---

## Rollback Plan

If performance regresses unexpectedly:

```bash
# Revert to Ring=64 (current)
make clean
make L2_RING=64 L25_RING=16

# Or revert to uniform Ring=16 (safe baseline)
make clean
make L2_RING=16 L25_RING=16
```

---

## Future Enhancements

### 1. Per-Size-Class Ring Tuning

```c
static const int g_l2_ring_caps[POOL_NUM_CLASSES] = {
    24,  // 2KB   (hot, minimal TLS)
    32,  // 4KB   (hot, moderate TLS)
    48,  // 8KB   (warm, larger TLS)
    64,  // 16KB  (warm, largest TLS)
    64,  // 32KB  (cold, largest TLS)
    32,  // 40KB  (bridge)
    24,  // 52KB  (bridge)
};
```

**Benefit:** Targeted optimization per size class (estimated +2-3% additional gain).

### 2. Runtime Adaptive Sizing

```c
// Environment variables:
// HAKMEM_L2_RING_CAP=48
// HAKMEM_L25_RING_CAP=16
```

**Benefit:** A/B testing without rebuild.

### 3. Dynamic Ring Adjustment

Monitor ring hit rate and adjust capacity at runtime based on workload.

**Benefit:** Optimal performance for changing workloads.

---

## Success Criteria

1. **mid_large_mt:** ≥36.5M ops/s (+1.3% vs baseline)
2. **random_mixed:** ≥22.4M ops/s (within ±1%)
3. **No regressions** in full benchmark suite
4. **TLS memory:** ≤3.5 KB per thread

## Timeline

- **Step 1-3:** 30 minutes (code changes)
- **Testing:** 2-3 hours (sweep + validation)
- **Documentation:** 30 minutes
- **Total:** ~4 hours
-												Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-05 12:31:14 +09:00
+								# Solution: Separate Ring Sizes Per Pool
 								## Problem Summary
 								`POOL_TLS_RING_CAP` currently controls ring size for BOTH L2 and L2.5 pools:
 								- **mid_large_mt** (8-32KB) uses L2 Pool → benefits from Ring=64
 								- **random_mixed** (8-128B) uses Tiny Pool → hurt by L2's TLS growth
 								**Root cause:** L2 Pool TLS grows from 980B → 3,668B (Ring 16→64), evicting Tiny Pool data from L1 cache.
 								## Solution: Per-Pool Ring Sizes
 								**Target configuration:**
 								- L2 Pool: Ring=48 (balanced performance + cache fit)
 								- L2.5 Pool: Ring=16 (unchanged, optimal for large allocs)
 								- Tiny Pool: No ring (uses freelist, unchanged)
 								**Expected outcome:**
 								- mid_large_mt: +2.1% vs baseline (36.04M → 36.8M ops/s)
 								- random_mixed: ±0% (22.5M maintained)
 								- TLS memory: -33% vs Ring=64 (5.0KB → 3.4KB)
 								---
 								## Implementation Steps
 								### Step 1: Modify L2 Pool (hakmem_pool.c)
 								Replace `POOL_TLS_RING_CAP` with `POOL_L2_RING_CAP`:
 								```c
 								// Line 77-78 (current):
 								#ifndef POOL_TLS_RING_CAP
 								#define POOL_TLS_RING_CAP 64  // QW1-adjusted: Moderate increase
 								// Change to:
 								#ifndef POOL_L2_RING_CAP
 								#define POOL_L2_RING_CAP 48  // Optimized for mid-size allocations (2-32KB)
 								#endif
 								// Line 80:
 								typedef struct { PoolBlock* items[POOL_TLS_RING_CAP]; int top; } PoolTLSRing;
 								// Change to:
 								typedef struct { PoolBlock* items[POOL_L2_RING_CAP]; int top; } PoolTLSRing;
 								```
 								**Then replace ALL occurrences** of `POOL_TLS_RING_CAP` → `POOL_L2_RING_CAP` in:
 								- Line 265, 1721, 1954, 2146, 2173, 2174, 2265, 2266, 2319, 2397
 								**Command:**
 								```bash
 								sed -i 's/POOL_TLS_RING_CAP/POOL_L2_RING_CAP/g' core/hakmem_pool.c
 								```
 								### Step 2: Modify L2.5 Pool (hakmem_l25_pool.c)
 								Replace `POOL_TLS_RING_CAP` with `POOL_L25_RING_CAP`:
 								```c
 								// Line 75-76 (current):
 								#ifndef POOL_TLS_RING_CAP
 								#define POOL_TLS_RING_CAP 16
 								// Change to:
 								#ifndef POOL_L25_RING_CAP
 								#define POOL_L25_RING_CAP 16  // Optimized for large allocations (64KB-1MB)
 								#endif
 								// Line 78:
 								typedef struct { L25Block* items[POOL_TLS_RING_CAP]; int top; } L25TLSRing;
 								// Change to:
 								typedef struct { L25Block* items[POOL_L25_RING_CAP]; int top; } L25TLSRing;
 								```
 								**Then replace ALL occurrences** of `POOL_TLS_RING_CAP` → `POOL_L25_RING_CAP`:
 								**Command:**
 								```bash
 								sed -i 's/POOL_TLS_RING_CAP/POOL_L25_RING_CAP/g' core/hakmem_l25_pool.c
 								```
 								### Step 3: Update Makefile
 								Update build flags to expose separate ring sizes:
 								```makefile
 								# Line 12 (current):
 								CFLAGS_SHARED = ... -DPOOL_TLS_RING_CAP=$(RING_CAP) ...
 								# Change to:
 								CFLAGS_SHARED = ... -DPOOL_L2_RING_CAP=$(L2_RING) -DPOOL_L25_RING_CAP=$(L25_RING) ...
 								# Add default values:
 								L2_RING ?= 48
 								L25_RING ?= 16
 								```
 								**Full line:**
 								```makefile
 								L2_RING ?= 48
 								L25_RING ?= 16
 								CFLAGS_SHARED = -O3 -march=native -mtune=native -Wall -Wextra -std=c11 -D_GNU_SOURCE -D_POSIX_C_SOURCE=199309L -D_GLIBC_USE_ISOC2X=0 -D__isoc23_strtol=strtol -D__isoc23_strtoll=strtoll -D__isoc23_strtoul=strtoul -D__isoc23_strtoull=strtoull -DHAKMEM_DEBUG_TIMING=$(HAKMEM_TIMING) -fPIC -DPOOL_L2_RING_CAP=$(L2_RING) -DPOOL_L25_RING_CAP=$(L25_RING) -ffast-math -funroll-loops -flto -fno-semantic-interposition -fno-plt -fomit-frame-pointer -fno-unwind-tables -fno-asynchronous-unwind-tables -I core
 								```
 								### Step 4: Add Documentation Comments
 								Add to `core/hakmem_pool.c` (after line 78):
 								```c
 								// POOL_L2_RING_CAP: TLS ring buffer capacity for L2 Pool (2-32KB allocations)
 								// - Default: 48 (balanced performance + L1 cache fit)
 								// - Larger values (64+): Better for high-contention mid-size workloads
 								//   but increases TLS footprint (may evict other pools from L1 cache)
 								// - Smaller values (16-32): Lower TLS memory, better for mixed workloads
 								// - Memory per thread: 7 classes × (CAP×8 + 12) bytes
 								//   Ring=48: 7 × 396 = 2,772 bytes (~44 cache lines)
 								```
 								Add to `core/hakmem_l25_pool.c` (after line 76):
 								```c
 								// POOL_L25_RING_CAP: TLS ring buffer capacity for L2.5 Pool (64KB-1MB allocations)
 								// - Default: 16 (optimal for large, less-frequent allocations)
 								// - Memory per thread: 5 classes × 148 bytes = 740 bytes (~12 cache lines)
 								```
 								---
 								## Testing Plan
 								### Test 1: Baseline Validation (Ring=16)
 								```bash
 								make clean
 								make L2_RING=16 L25_RING=16 bench_mid_large_mt bench_random_mixed
 								echo "=== Baseline Ring=16 ===" | tee baseline.txt
 								./bench_mid_large_mt 2 40000 128 | tee -a baseline.txt
 								./bench_random_mixed 200000 400 | tee -a baseline.txt
 								```
 								**Expected:**
 								- mid_large_mt: ~36.04M ops/s
 								- random_mixed: ~22.5M ops/s
 								### Test 2: Sweep L2 Ring Size (L2.5 fixed at 16)
 								```bash
 								rm -f sweep_results.txt
 								for RING in 24 32 40 48 56 64; do
 								    echo "=== Testing L2_RING=$RING ===" | tee -a sweep_results.txt
 								    make clean
 								    make L2_RING=$RING L25_RING=16 bench_mid_large_mt bench_random_mixed
 								    echo "mid_large_mt:" | tee -a sweep_results.txt
 								    ./bench_mid_large_mt 2 40000 128 | tee -a sweep_results.txt
 								    echo "random_mixed:" | tee -a sweep_results.txt
 								    ./bench_random_mixed 200000 400 | tee -a sweep_results.txt
 								    echo "" | tee -a sweep_results.txt
 								done
 								```
 								### Test 3: Validate Optimal Configuration (L2=48)
 								```bash
 								make clean
 								make L2_RING=48 L25_RING=16 bench_mid_large_mt bench_random_mixed
 								echo "=== Optimal L2=48, L25=16 ===" | tee optimal.txt
 								./bench_mid_large_mt 2 40000 128 | tee -a optimal.txt
 								./bench_random_mixed 200000 400 | tee -a optimal.txt
 								```
 								**Target:**
 								- mid_large_mt: ≥36.5M ops/s (+1.3% vs baseline)
 								- random_mixed: ≥22.4M ops/s (within ±1% of baseline)
 								### Test 4: Full Benchmark Suite
 								```bash
 								# Build with optimal config
 								make clean
 								make L2_RING=48 L25_RING=16
 								# Run comprehensive suite
 								./scripts/run_bench_suite.sh 2>&1 | tee full_suite.txt
 								# Check for regressions
 								grep -E "ops/sec|Throughput" full_suite.txt
 								```
 								---
 								## Expected Performance Matrix
 								| Configuration | mid_large_mt | random_mixed | Average | TLS (KB) | L1 Cache % |
 								|---------------|--------------|--------------|---------|----------|------------|
 								| Ring=16 (baseline) | 36.04M | 22.5M | 29.27M | 2.36 | 7.4% |
 								| Ring=64 (current) | 37.22M | 21.29M | 29.26M | 5.05 | 15.8% |
 								| **L2=48, L25=16** | **36.8M** | **22.5M** | **29.65M** | **3.4** | **10.6%** |
 								**Gains vs Ring=64:**
 								- mid_large_mt: -1.1% (acceptable trade-off)
 								- random_mixed: **+5.7%** (recovered performance)
 								- Average: **+1.3%**
 								- TLS footprint: **-33%**
 								**Gains vs Ring=16:**
 								- mid_large_mt: **+2.1%**
 								- random_mixed: ±0%
 								- Average: **+1.3%**
 								---
 								## Rollback Plan
 								If performance regresses unexpectedly:
 								```bash
 								# Revert to Ring=64 (current)
 								make clean
 								make L2_RING=64 L25_RING=16
 								# Or revert to uniform Ring=16 (safe baseline)
 								make clean
 								make L2_RING=16 L25_RING=16
 								```
 								---
 								## Future Enhancements
 								### 1. Per-Size-Class Ring Tuning
 								```c
 								static const int g_l2_ring_caps[POOL_NUM_CLASSES] = {
 ,  // 2KB   (hot, minimal TLS)
 ,  // 4KB   (hot, moderate TLS)
 ,  // 8KB   (warm, larger TLS)
 ,  // 16KB  (warm, largest TLS)
 ,  // 32KB  (cold, largest TLS)
 ,  // 40KB  (bridge)
 ,  // 52KB  (bridge)
 								};
 								```
 								**Benefit:** Targeted optimization per size class (estimated +2-3% additional gain).
 								### 2. Runtime Adaptive Sizing
 								```c
 								// Environment variables:
 								// HAKMEM_L2_RING_CAP=48
 								// HAKMEM_L25_RING_CAP=16
 								```
 								**Benefit:** A/B testing without rebuild.
 								### 3. Dynamic Ring Adjustment
 								Monitor ring hit rate and adjust capacity at runtime based on workload.
 								**Benefit:** Optimal performance for changing workloads.
 								---
 								## Success Criteria
 . **mid_large_mt:** ≥36.5M ops/s (+1.3% vs baseline)
 . **random_mixed:** ≥22.4M ops/s (within ±1%)
 . **No regressions** in full benchmark suite
 . **TLS memory:** ≤3.5 KB per thread
 								## Timeline
 								- **Step 1-3:** 30 minutes (code changes)
 								- **Testing:** 2-3 hours (sweep + validation)
 								- **Documentation:** 30 minutes
 								- **Total:** ~4 hours