Files
hakmem/archive/analysis/RING_SIZE_SOLUTION.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

284 lines
7.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Solution: Separate Ring Sizes Per Pool
## Problem Summary
`POOL_TLS_RING_CAP` currently controls ring size for BOTH L2 and L2.5 pools:
- **mid_large_mt** (8-32KB) uses L2 Pool → benefits from Ring=64
- **random_mixed** (8-128B) uses Tiny Pool → hurt by L2's TLS growth
**Root cause:** L2 Pool TLS grows from 980B → 3,668B (Ring 16→64), evicting Tiny Pool data from L1 cache.
## Solution: Per-Pool Ring Sizes
**Target configuration:**
- L2 Pool: Ring=48 (balanced performance + cache fit)
- L2.5 Pool: Ring=16 (unchanged, optimal for large allocs)
- Tiny Pool: No ring (uses freelist, unchanged)
**Expected outcome:**
- mid_large_mt: +2.1% vs baseline (36.04M → 36.8M ops/s)
- random_mixed: ±0% (22.5M maintained)
- TLS memory: -33% vs Ring=64 (5.0KB → 3.4KB)
---
## Implementation Steps
### Step 1: Modify L2 Pool (hakmem_pool.c)
Replace `POOL_TLS_RING_CAP` with `POOL_L2_RING_CAP`:
```c
// Line 77-78 (current):
#ifndef POOL_TLS_RING_CAP
#define POOL_TLS_RING_CAP 64 // QW1-adjusted: Moderate increase
// Change to:
#ifndef POOL_L2_RING_CAP
#define POOL_L2_RING_CAP 48 // Optimized for mid-size allocations (2-32KB)
#endif
// Line 80:
typedef struct { PoolBlock* items[POOL_TLS_RING_CAP]; int top; } PoolTLSRing;
// Change to:
typedef struct { PoolBlock* items[POOL_L2_RING_CAP]; int top; } PoolTLSRing;
```
**Then replace ALL occurrences** of `POOL_TLS_RING_CAP``POOL_L2_RING_CAP` in:
- Line 265, 1721, 1954, 2146, 2173, 2174, 2265, 2266, 2319, 2397
**Command:**
```bash
sed -i 's/POOL_TLS_RING_CAP/POOL_L2_RING_CAP/g' core/hakmem_pool.c
```
### Step 2: Modify L2.5 Pool (hakmem_l25_pool.c)
Replace `POOL_TLS_RING_CAP` with `POOL_L25_RING_CAP`:
```c
// Line 75-76 (current):
#ifndef POOL_TLS_RING_CAP
#define POOL_TLS_RING_CAP 16
// Change to:
#ifndef POOL_L25_RING_CAP
#define POOL_L25_RING_CAP 16 // Optimized for large allocations (64KB-1MB)
#endif
// Line 78:
typedef struct { L25Block* items[POOL_TLS_RING_CAP]; int top; } L25TLSRing;
// Change to:
typedef struct { L25Block* items[POOL_L25_RING_CAP]; int top; } L25TLSRing;
```
**Then replace ALL occurrences** of `POOL_TLS_RING_CAP``POOL_L25_RING_CAP`:
**Command:**
```bash
sed -i 's/POOL_TLS_RING_CAP/POOL_L25_RING_CAP/g' core/hakmem_l25_pool.c
```
### Step 3: Update Makefile
Update build flags to expose separate ring sizes:
```makefile
# Line 12 (current):
CFLAGS_SHARED = ... -DPOOL_TLS_RING_CAP=$(RING_CAP) ...
# Change to:
CFLAGS_SHARED = ... -DPOOL_L2_RING_CAP=$(L2_RING) -DPOOL_L25_RING_CAP=$(L25_RING) ...
# Add default values:
L2_RING ?= 48
L25_RING ?= 16
```
**Full line:**
```makefile
L2_RING ?= 48
L25_RING ?= 16
CFLAGS_SHARED = -O3 -march=native -mtune=native -Wall -Wextra -std=c11 -D_GNU_SOURCE -D_POSIX_C_SOURCE=199309L -D_GLIBC_USE_ISOC2X=0 -D__isoc23_strtol=strtol -D__isoc23_strtoll=strtoll -D__isoc23_strtoul=strtoul -D__isoc23_strtoull=strtoull -DHAKMEM_DEBUG_TIMING=$(HAKMEM_TIMING) -fPIC -DPOOL_L2_RING_CAP=$(L2_RING) -DPOOL_L25_RING_CAP=$(L25_RING) -ffast-math -funroll-loops -flto -fno-semantic-interposition -fno-plt -fomit-frame-pointer -fno-unwind-tables -fno-asynchronous-unwind-tables -I core
```
### Step 4: Add Documentation Comments
Add to `core/hakmem_pool.c` (after line 78):
```c
// POOL_L2_RING_CAP: TLS ring buffer capacity for L2 Pool (2-32KB allocations)
// - Default: 48 (balanced performance + L1 cache fit)
// - Larger values (64+): Better for high-contention mid-size workloads
// but increases TLS footprint (may evict other pools from L1 cache)
// - Smaller values (16-32): Lower TLS memory, better for mixed workloads
// - Memory per thread: 7 classes × (CAP×8 + 12) bytes
// Ring=48: 7 × 396 = 2,772 bytes (~44 cache lines)
```
Add to `core/hakmem_l25_pool.c` (after line 76):
```c
// POOL_L25_RING_CAP: TLS ring buffer capacity for L2.5 Pool (64KB-1MB allocations)
// - Default: 16 (optimal for large, less-frequent allocations)
// - Memory per thread: 5 classes × 148 bytes = 740 bytes (~12 cache lines)
```
---
## Testing Plan
### Test 1: Baseline Validation (Ring=16)
```bash
make clean
make L2_RING=16 L25_RING=16 bench_mid_large_mt bench_random_mixed
echo "=== Baseline Ring=16 ===" | tee baseline.txt
./bench_mid_large_mt 2 40000 128 | tee -a baseline.txt
./bench_random_mixed 200000 400 | tee -a baseline.txt
```
**Expected:**
- mid_large_mt: ~36.04M ops/s
- random_mixed: ~22.5M ops/s
### Test 2: Sweep L2 Ring Size (L2.5 fixed at 16)
```bash
rm -f sweep_results.txt
for RING in 24 32 40 48 56 64; do
echo "=== Testing L2_RING=$RING ===" | tee -a sweep_results.txt
make clean
make L2_RING=$RING L25_RING=16 bench_mid_large_mt bench_random_mixed
echo "mid_large_mt:" | tee -a sweep_results.txt
./bench_mid_large_mt 2 40000 128 | tee -a sweep_results.txt
echo "random_mixed:" | tee -a sweep_results.txt
./bench_random_mixed 200000 400 | tee -a sweep_results.txt
echo "" | tee -a sweep_results.txt
done
```
### Test 3: Validate Optimal Configuration (L2=48)
```bash
make clean
make L2_RING=48 L25_RING=16 bench_mid_large_mt bench_random_mixed
echo "=== Optimal L2=48, L25=16 ===" | tee optimal.txt
./bench_mid_large_mt 2 40000 128 | tee -a optimal.txt
./bench_random_mixed 200000 400 | tee -a optimal.txt
```
**Target:**
- mid_large_mt: ≥36.5M ops/s (+1.3% vs baseline)
- random_mixed: ≥22.4M ops/s (within ±1% of baseline)
### Test 4: Full Benchmark Suite
```bash
# Build with optimal config
make clean
make L2_RING=48 L25_RING=16
# Run comprehensive suite
./scripts/run_bench_suite.sh 2>&1 | tee full_suite.txt
# Check for regressions
grep -E "ops/sec|Throughput" full_suite.txt
```
---
## Expected Performance Matrix
| Configuration | mid_large_mt | random_mixed | Average | TLS (KB) | L1 Cache % |
|---------------|--------------|--------------|---------|----------|------------|
| Ring=16 (baseline) | 36.04M | 22.5M | 29.27M | 2.36 | 7.4% |
| Ring=64 (current) | 37.22M | 21.29M | 29.26M | 5.05 | 15.8% |
| **L2=48, L25=16** | **36.8M** | **22.5M** | **29.65M** | **3.4** | **10.6%** |
**Gains vs Ring=64:**
- mid_large_mt: -1.1% (acceptable trade-off)
- random_mixed: **+5.7%** (recovered performance)
- Average: **+1.3%**
- TLS footprint: **-33%**
**Gains vs Ring=16:**
- mid_large_mt: **+2.1%**
- random_mixed: ±0%
- Average: **+1.3%**
---
## Rollback Plan
If performance regresses unexpectedly:
```bash
# Revert to Ring=64 (current)
make clean
make L2_RING=64 L25_RING=16
# Or revert to uniform Ring=16 (safe baseline)
make clean
make L2_RING=16 L25_RING=16
```
---
## Future Enhancements
### 1. Per-Size-Class Ring Tuning
```c
static const int g_l2_ring_caps[POOL_NUM_CLASSES] = {
24, // 2KB (hot, minimal TLS)
32, // 4KB (hot, moderate TLS)
48, // 8KB (warm, larger TLS)
64, // 16KB (warm, largest TLS)
64, // 32KB (cold, largest TLS)
32, // 40KB (bridge)
24, // 52KB (bridge)
};
```
**Benefit:** Targeted optimization per size class (estimated +2-3% additional gain).
### 2. Runtime Adaptive Sizing
```c
// Environment variables:
// HAKMEM_L2_RING_CAP=48
// HAKMEM_L25_RING_CAP=16
```
**Benefit:** A/B testing without rebuild.
### 3. Dynamic Ring Adjustment
Monitor ring hit rate and adjust capacity at runtime based on workload.
**Benefit:** Optimal performance for changing workloads.
---
## Success Criteria
1. **mid_large_mt:** ≥36.5M ops/s (+1.3% vs baseline)
2. **random_mixed:** ≥22.4M ops/s (within ±1%)
3. **No regressions** in full benchmark suite
4. **TLS memory:** ≤3.5 KB per thread
## Timeline
- **Step 1-3:** 30 minutes (code changes)
- **Testing:** 2-3 hours (sweep + validation)
- **Documentation:** 30 minutes
- **Total:** ~4 hours