284 lines
7.5 KiB
Markdown
284 lines
7.5 KiB
Markdown
|
|
# Solution: Separate Ring Sizes Per Pool
|
|||
|
|
|
|||
|
|
## Problem Summary
|
|||
|
|
|
|||
|
|
`POOL_TLS_RING_CAP` currently controls ring size for BOTH L2 and L2.5 pools:
|
|||
|
|
- **mid_large_mt** (8-32KB) uses L2 Pool → benefits from Ring=64
|
|||
|
|
- **random_mixed** (8-128B) uses Tiny Pool → hurt by L2's TLS growth
|
|||
|
|
|
|||
|
|
**Root cause:** L2 Pool TLS grows from 980B → 3,668B (Ring 16→64), evicting Tiny Pool data from L1 cache.
|
|||
|
|
|
|||
|
|
## Solution: Per-Pool Ring Sizes
|
|||
|
|
|
|||
|
|
**Target configuration:**
|
|||
|
|
- L2 Pool: Ring=48 (balanced performance + cache fit)
|
|||
|
|
- L2.5 Pool: Ring=16 (unchanged, optimal for large allocs)
|
|||
|
|
- Tiny Pool: No ring (uses freelist, unchanged)
|
|||
|
|
|
|||
|
|
**Expected outcome:**
|
|||
|
|
- mid_large_mt: +2.1% vs baseline (36.04M → 36.8M ops/s)
|
|||
|
|
- random_mixed: ±0% (22.5M maintained)
|
|||
|
|
- TLS memory: -33% vs Ring=64 (5.0KB → 3.4KB)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Implementation Steps
|
|||
|
|
|
|||
|
|
### Step 1: Modify L2 Pool (hakmem_pool.c)
|
|||
|
|
|
|||
|
|
Replace `POOL_TLS_RING_CAP` with `POOL_L2_RING_CAP`:
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Line 77-78 (current):
|
|||
|
|
#ifndef POOL_TLS_RING_CAP
|
|||
|
|
#define POOL_TLS_RING_CAP 64 // QW1-adjusted: Moderate increase
|
|||
|
|
|
|||
|
|
// Change to:
|
|||
|
|
#ifndef POOL_L2_RING_CAP
|
|||
|
|
#define POOL_L2_RING_CAP 48 // Optimized for mid-size allocations (2-32KB)
|
|||
|
|
#endif
|
|||
|
|
|
|||
|
|
// Line 80:
|
|||
|
|
typedef struct { PoolBlock* items[POOL_TLS_RING_CAP]; int top; } PoolTLSRing;
|
|||
|
|
|
|||
|
|
// Change to:
|
|||
|
|
typedef struct { PoolBlock* items[POOL_L2_RING_CAP]; int top; } PoolTLSRing;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Then replace ALL occurrences** of `POOL_TLS_RING_CAP` → `POOL_L2_RING_CAP` in:
|
|||
|
|
- Line 265, 1721, 1954, 2146, 2173, 2174, 2265, 2266, 2319, 2397
|
|||
|
|
|
|||
|
|
**Command:**
|
|||
|
|
```bash
|
|||
|
|
sed -i 's/POOL_TLS_RING_CAP/POOL_L2_RING_CAP/g' core/hakmem_pool.c
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Step 2: Modify L2.5 Pool (hakmem_l25_pool.c)
|
|||
|
|
|
|||
|
|
Replace `POOL_TLS_RING_CAP` with `POOL_L25_RING_CAP`:
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Line 75-76 (current):
|
|||
|
|
#ifndef POOL_TLS_RING_CAP
|
|||
|
|
#define POOL_TLS_RING_CAP 16
|
|||
|
|
|
|||
|
|
// Change to:
|
|||
|
|
#ifndef POOL_L25_RING_CAP
|
|||
|
|
#define POOL_L25_RING_CAP 16 // Optimized for large allocations (64KB-1MB)
|
|||
|
|
#endif
|
|||
|
|
|
|||
|
|
// Line 78:
|
|||
|
|
typedef struct { L25Block* items[POOL_TLS_RING_CAP]; int top; } L25TLSRing;
|
|||
|
|
|
|||
|
|
// Change to:
|
|||
|
|
typedef struct { L25Block* items[POOL_L25_RING_CAP]; int top; } L25TLSRing;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Then replace ALL occurrences** of `POOL_TLS_RING_CAP` → `POOL_L25_RING_CAP`:
|
|||
|
|
|
|||
|
|
**Command:**
|
|||
|
|
```bash
|
|||
|
|
sed -i 's/POOL_TLS_RING_CAP/POOL_L25_RING_CAP/g' core/hakmem_l25_pool.c
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Step 3: Update Makefile
|
|||
|
|
|
|||
|
|
Update build flags to expose separate ring sizes:
|
|||
|
|
|
|||
|
|
```makefile
|
|||
|
|
# Line 12 (current):
|
|||
|
|
CFLAGS_SHARED = ... -DPOOL_TLS_RING_CAP=$(RING_CAP) ...
|
|||
|
|
|
|||
|
|
# Change to:
|
|||
|
|
CFLAGS_SHARED = ... -DPOOL_L2_RING_CAP=$(L2_RING) -DPOOL_L25_RING_CAP=$(L25_RING) ...
|
|||
|
|
|
|||
|
|
# Add default values:
|
|||
|
|
L2_RING ?= 48
|
|||
|
|
L25_RING ?= 16
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Full line:**
|
|||
|
|
```makefile
|
|||
|
|
L2_RING ?= 48
|
|||
|
|
L25_RING ?= 16
|
|||
|
|
CFLAGS_SHARED = -O3 -march=native -mtune=native -Wall -Wextra -std=c11 -D_GNU_SOURCE -D_POSIX_C_SOURCE=199309L -D_GLIBC_USE_ISOC2X=0 -D__isoc23_strtol=strtol -D__isoc23_strtoll=strtoll -D__isoc23_strtoul=strtoul -D__isoc23_strtoull=strtoull -DHAKMEM_DEBUG_TIMING=$(HAKMEM_TIMING) -fPIC -DPOOL_L2_RING_CAP=$(L2_RING) -DPOOL_L25_RING_CAP=$(L25_RING) -ffast-math -funroll-loops -flto -fno-semantic-interposition -fno-plt -fomit-frame-pointer -fno-unwind-tables -fno-asynchronous-unwind-tables -I core
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Step 4: Add Documentation Comments
|
|||
|
|
|
|||
|
|
Add to `core/hakmem_pool.c` (after line 78):
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// POOL_L2_RING_CAP: TLS ring buffer capacity for L2 Pool (2-32KB allocations)
|
|||
|
|
// - Default: 48 (balanced performance + L1 cache fit)
|
|||
|
|
// - Larger values (64+): Better for high-contention mid-size workloads
|
|||
|
|
// but increases TLS footprint (may evict other pools from L1 cache)
|
|||
|
|
// - Smaller values (16-32): Lower TLS memory, better for mixed workloads
|
|||
|
|
// - Memory per thread: 7 classes × (CAP×8 + 12) bytes
|
|||
|
|
// Ring=48: 7 × 396 = 2,772 bytes (~44 cache lines)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Add to `core/hakmem_l25_pool.c` (after line 76):
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// POOL_L25_RING_CAP: TLS ring buffer capacity for L2.5 Pool (64KB-1MB allocations)
|
|||
|
|
// - Default: 16 (optimal for large, less-frequent allocations)
|
|||
|
|
// - Memory per thread: 5 classes × 148 bytes = 740 bytes (~12 cache lines)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Testing Plan
|
|||
|
|
|
|||
|
|
### Test 1: Baseline Validation (Ring=16)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
make clean
|
|||
|
|
make L2_RING=16 L25_RING=16 bench_mid_large_mt bench_random_mixed
|
|||
|
|
|
|||
|
|
echo "=== Baseline Ring=16 ===" | tee baseline.txt
|
|||
|
|
./bench_mid_large_mt 2 40000 128 | tee -a baseline.txt
|
|||
|
|
./bench_random_mixed 200000 400 | tee -a baseline.txt
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Expected:**
|
|||
|
|
- mid_large_mt: ~36.04M ops/s
|
|||
|
|
- random_mixed: ~22.5M ops/s
|
|||
|
|
|
|||
|
|
### Test 2: Sweep L2 Ring Size (L2.5 fixed at 16)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
rm -f sweep_results.txt
|
|||
|
|
for RING in 24 32 40 48 56 64; do
|
|||
|
|
echo "=== Testing L2_RING=$RING ===" | tee -a sweep_results.txt
|
|||
|
|
make clean
|
|||
|
|
make L2_RING=$RING L25_RING=16 bench_mid_large_mt bench_random_mixed
|
|||
|
|
|
|||
|
|
echo "mid_large_mt:" | tee -a sweep_results.txt
|
|||
|
|
./bench_mid_large_mt 2 40000 128 | tee -a sweep_results.txt
|
|||
|
|
|
|||
|
|
echo "random_mixed:" | tee -a sweep_results.txt
|
|||
|
|
./bench_random_mixed 200000 400 | tee -a sweep_results.txt
|
|||
|
|
echo "" | tee -a sweep_results.txt
|
|||
|
|
done
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Test 3: Validate Optimal Configuration (L2=48)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
make clean
|
|||
|
|
make L2_RING=48 L25_RING=16 bench_mid_large_mt bench_random_mixed
|
|||
|
|
|
|||
|
|
echo "=== Optimal L2=48, L25=16 ===" | tee optimal.txt
|
|||
|
|
./bench_mid_large_mt 2 40000 128 | tee -a optimal.txt
|
|||
|
|
./bench_random_mixed 200000 400 | tee -a optimal.txt
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Target:**
|
|||
|
|
- mid_large_mt: ≥36.5M ops/s (+1.3% vs baseline)
|
|||
|
|
- random_mixed: ≥22.4M ops/s (within ±1% of baseline)
|
|||
|
|
|
|||
|
|
### Test 4: Full Benchmark Suite
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Build with optimal config
|
|||
|
|
make clean
|
|||
|
|
make L2_RING=48 L25_RING=16
|
|||
|
|
|
|||
|
|
# Run comprehensive suite
|
|||
|
|
./scripts/run_bench_suite.sh 2>&1 | tee full_suite.txt
|
|||
|
|
|
|||
|
|
# Check for regressions
|
|||
|
|
grep -E "ops/sec|Throughput" full_suite.txt
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Expected Performance Matrix
|
|||
|
|
|
|||
|
|
| Configuration | mid_large_mt | random_mixed | Average | TLS (KB) | L1 Cache % |
|
|||
|
|
|---------------|--------------|--------------|---------|----------|------------|
|
|||
|
|
| Ring=16 (baseline) | 36.04M | 22.5M | 29.27M | 2.36 | 7.4% |
|
|||
|
|
| Ring=64 (current) | 37.22M | 21.29M | 29.26M | 5.05 | 15.8% |
|
|||
|
|
| **L2=48, L25=16** | **36.8M** | **22.5M** | **29.65M** | **3.4** | **10.6%** |
|
|||
|
|
|
|||
|
|
**Gains vs Ring=64:**
|
|||
|
|
- mid_large_mt: -1.1% (acceptable trade-off)
|
|||
|
|
- random_mixed: **+5.7%** (recovered performance)
|
|||
|
|
- Average: **+1.3%**
|
|||
|
|
- TLS footprint: **-33%**
|
|||
|
|
|
|||
|
|
**Gains vs Ring=16:**
|
|||
|
|
- mid_large_mt: **+2.1%**
|
|||
|
|
- random_mixed: ±0%
|
|||
|
|
- Average: **+1.3%**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Rollback Plan
|
|||
|
|
|
|||
|
|
If performance regresses unexpectedly:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Revert to Ring=64 (current)
|
|||
|
|
make clean
|
|||
|
|
make L2_RING=64 L25_RING=16
|
|||
|
|
|
|||
|
|
# Or revert to uniform Ring=16 (safe baseline)
|
|||
|
|
make clean
|
|||
|
|
make L2_RING=16 L25_RING=16
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Future Enhancements
|
|||
|
|
|
|||
|
|
### 1. Per-Size-Class Ring Tuning
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
static const int g_l2_ring_caps[POOL_NUM_CLASSES] = {
|
|||
|
|
24, // 2KB (hot, minimal TLS)
|
|||
|
|
32, // 4KB (hot, moderate TLS)
|
|||
|
|
48, // 8KB (warm, larger TLS)
|
|||
|
|
64, // 16KB (warm, largest TLS)
|
|||
|
|
64, // 32KB (cold, largest TLS)
|
|||
|
|
32, // 40KB (bridge)
|
|||
|
|
24, // 52KB (bridge)
|
|||
|
|
};
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Benefit:** Targeted optimization per size class (estimated +2-3% additional gain).
|
|||
|
|
|
|||
|
|
### 2. Runtime Adaptive Sizing
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Environment variables:
|
|||
|
|
// HAKMEM_L2_RING_CAP=48
|
|||
|
|
// HAKMEM_L25_RING_CAP=16
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Benefit:** A/B testing without rebuild.
|
|||
|
|
|
|||
|
|
### 3. Dynamic Ring Adjustment
|
|||
|
|
|
|||
|
|
Monitor ring hit rate and adjust capacity at runtime based on workload.
|
|||
|
|
|
|||
|
|
**Benefit:** Optimal performance for changing workloads.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Success Criteria
|
|||
|
|
|
|||
|
|
1. **mid_large_mt:** ≥36.5M ops/s (+1.3% vs baseline)
|
|||
|
|
2. **random_mixed:** ≥22.4M ops/s (within ±1%)
|
|||
|
|
3. **No regressions** in full benchmark suite
|
|||
|
|
4. **TLS memory:** ≤3.5 KB per thread
|
|||
|
|
|
|||
|
|
## Timeline
|
|||
|
|
|
|||
|
|
- **Step 1-3:** 30 minutes (code changes)
|
|||
|
|
- **Testing:** 2-3 hours (sweep + validation)
|
|||
|
|
- **Documentation:** 30 minutes
|
|||
|
|
- **Total:** ~4 hours
|
|||
|
|
|