Phase 5 E5-2: Header Write-Once (NEUTRAL, FROZEN)
Target: tiny_region_id_write_header (3.35% self%) - Hypothesis: Headers redundant for reused blocks - Strategy: Write headers ONCE at refill boundary, skip in hot alloc Implementation: - ENV gate: HAKMEM_TINY_HEADER_WRITE_ONCE=0/1 (default 0) - core/box/tiny_header_write_once_env_box.h: ENV gate - core/box/tiny_header_write_once_stats_box.h: Stats counters - core/box/tiny_header_box.h: Added tiny_header_finalize_alloc() - core/front/tiny_unified_cache.c: Prefill at 3 refill sites - core/box/tiny_front_hot_box.h: Use finalize function A/B Test Results (Mixed, 10-run, 20M iters): - Baseline (WRITE_ONCE=0): 44.22M ops/s (mean), 44.53M ops/s (median) - Optimized (WRITE_ONCE=1): 44.42M ops/s (mean), 44.36M ops/s (median) - Improvement: +0.45% mean, -0.38% median Decision: NEUTRAL (within ±1.0% threshold) - Action: FREEZE as research box (default OFF, do not promote) Root Cause Analysis: - Header writes are NOT redundant - existing code writes only when needed - Branch overhead (~4 cycles) cancels savings (~3-5 cycles) - perf self% ≠ optimization ROI (3.35% target → +0.45% gain) Key Lessons: 1. Verify assumptions before optimizing (inspect code paths) 2. Hot spot self% measures time IN function, not savings from REMOVING it 3. Branch overhead matters (even "simple" checks add cycles) Positive Outcome: - StdDev reduced 50% (0.96M → 0.48M) - more stable performance Health Check: PASS (all profiles) Next Candidates: - free_tiny_fast_cold: 7.14% self% - unified_cache_push: 3.39% self% - hakmem_env_snapshot_enabled: 2.97% self% Deliverables: - docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md - docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md - CURRENT_TASK.md (E5-2 complete, FROZEN) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
240
docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md
Normal file
240
docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md
Normal file
@ -0,0 +1,240 @@
|
||||
# Phase 5 E5-2: Header Write-Once Optimization - A/B Test Results
|
||||
|
||||
## Summary
|
||||
|
||||
**Target**: `tiny_region_id_write_header` (3.35% self% in perf profile)
|
||||
**Strategy**: Write headers ONCE at refill boundary (C1-C6), skip writes in hot allocation path
|
||||
**Result**: **NEUTRAL** (+0.45% mean, -0.38% median)
|
||||
**Decision**: FREEZE as research box (default OFF)
|
||||
|
||||
---
|
||||
|
||||
## A/B Test Results (Mixed Workload)
|
||||
|
||||
### Configuration
|
||||
|
||||
- **Workload**: Mixed (16-1024B)
|
||||
- **Iterations**: 20M per run
|
||||
- **Working set**: 400
|
||||
- **Runs**: 10 baseline, 10 optimized
|
||||
- **ENV baseline**: `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE` + `HAKMEM_TINY_HEADER_WRITE_ONCE=0`
|
||||
- **ENV optimized**: `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE` + `HAKMEM_TINY_HEADER_WRITE_ONCE=1`
|
||||
|
||||
### Results
|
||||
|
||||
| Metric | Baseline (WRITE_ONCE=0) | Optimized (WRITE_ONCE=1) | Delta |
|
||||
|--------|-------------------------|--------------------------|-------|
|
||||
| Mean | 44.22M ops/s | 44.42M ops/s | +0.45% |
|
||||
| Median | 44.53M ops/s | 44.36M ops/s | -0.38% |
|
||||
| StdDev | 0.96M ops/s | 0.48M ops/s | -50% |
|
||||
|
||||
### Raw Data
|
||||
|
||||
**Baseline (WRITE_ONCE=0)**:
|
||||
```
|
||||
Run 1: 44.31M ops/s
|
||||
Run 2: 45.34M ops/s
|
||||
Run 3: 44.48M ops/s
|
||||
Run 4: 41.95M ops/s (outlier)
|
||||
Run 5: 44.86M ops/s
|
||||
Run 6: 44.57M ops/s
|
||||
Run 7: 44.68M ops/s
|
||||
Run 8: 44.72M ops/s
|
||||
Run 9: 43.87M ops/s
|
||||
Run 10: 43.42M ops/s
|
||||
```
|
||||
|
||||
**Optimized (WRITE_ONCE=1)**:
|
||||
```
|
||||
Run 1: 44.23M ops/s
|
||||
Run 2: 44.93M ops/s
|
||||
Run 3: 44.26M ops/s
|
||||
Run 4: 44.46M ops/s
|
||||
Run 5: 43.86M ops/s
|
||||
Run 6: 44.98M ops/s
|
||||
Run 7: 44.10M ops/s
|
||||
Run 8: 45.06M ops/s
|
||||
Run 9: 43.65M ops/s
|
||||
Run 10: 44.66M ops/s
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Analysis
|
||||
|
||||
### Why NEUTRAL?
|
||||
|
||||
1. **Baseline variance**: Run 4 (41.95M) was an outlier, introducing high variance (σ=0.96M)
|
||||
2. **Optimization reduced variance**: σ dropped from 0.96M → 0.48M (50% improvement in stability)
|
||||
3. **Net effect**: Mean +0.45%, Median -0.38% → **within noise threshold (±1.0%)**
|
||||
|
||||
### Expected vs Actual
|
||||
|
||||
- **Expected**: +1-3% (based on 3.35% self% overhead reduction)
|
||||
- **Actual**: +0.45% mean (7.5x lower than expected minimum)
|
||||
- **Gap**: Optimization didn't deliver expected benefit
|
||||
|
||||
### Why Lower Than Expected?
|
||||
|
||||
**Hypothesis 1: Headers already written at refill**
|
||||
- Inspection of `unified_cache_refill()` shows headers are ALREADY written during freelist pop (lines 835, 864)
|
||||
- Hot path writes are **not redundant** - they write headers for blocks that DON'T have them yet
|
||||
- E5-2 assumption (redundant writes) was incorrect
|
||||
|
||||
**Hypothesis 2: Branch overhead > write savings**
|
||||
- E5-2 adds 2 branches to hot path:
|
||||
- `if (tiny_header_write_once_enabled())` (ENV gate check)
|
||||
- `if (tiny_class_preserves_header(class_idx))` (class check)
|
||||
- These branches cost ~2 cycles each = 4 cycles total
|
||||
- Header write saves ~3-5 cycles
|
||||
- **Net**: 4 cycles overhead vs 3-5 cycles savings → marginal or negative
|
||||
|
||||
**Hypothesis 3: Prefill loop cost**
|
||||
- `unified_cache_prefill_headers()` runs at refill boundary
|
||||
- Loop over 128-512 blocks × 2 cycles per header write = 256-1024 cycles
|
||||
- Amortized over 2048 allocations = 0.125-0.5 cycles/alloc
|
||||
- Still negligible, but adds to overall cost
|
||||
|
||||
### Reduced Variance (Good)
|
||||
|
||||
- **Baseline StdDev**: 0.96M ops/s
|
||||
- **Optimized StdDev**: 0.48M ops/s
|
||||
- **50% reduction in variance**
|
||||
|
||||
This is a positive signal - the optimization makes performance more **stable**, even if it doesn't make it faster.
|
||||
|
||||
---
|
||||
|
||||
## Health Check
|
||||
|
||||
```bash
|
||||
scripts/verify_health_profiles.sh
|
||||
```
|
||||
|
||||
**Result**: ✅ PASS
|
||||
- MIXED_TINYV3_C7_SAFE: 41.9M ops/s (no regression)
|
||||
- C6_HEAVY_LEGACY_POOLV1: 22.6M ops/s (no regression)
|
||||
- All profiles passed functional tests
|
||||
|
||||
---
|
||||
|
||||
## Decision Matrix
|
||||
|
||||
| Criterion | Threshold | Actual | Status |
|
||||
|-----------|-----------|--------|--------|
|
||||
| Mean gain | >= +1.0% (GO) | +0.45% | ❌ FAIL |
|
||||
| Median gain | >= +1.0% (GO) | -0.38% | ❌ FAIL |
|
||||
| Health check | PASS | ✅ PASS | ✅ PASS |
|
||||
| Correctness | No SEGV/assert | ✅ No issues | ✅ PASS |
|
||||
|
||||
**Decision**: **NEUTRAL** → FREEZE as research box
|
||||
|
||||
---
|
||||
|
||||
## Verdict
|
||||
|
||||
### FREEZE (Default OFF)
|
||||
|
||||
**Rationale**:
|
||||
1. **Gain within noise**: +0.45% mean is below +1.0% GO threshold
|
||||
2. **Median slightly negative**: -0.38% suggests no consistent benefit
|
||||
3. **Root cause**: Original assumption (redundant header writes) was incorrect
|
||||
- Headers are already written correctly at refill (freelist pop path)
|
||||
- Hot path writes are NOT redundant
|
||||
4. **Branch overhead**: ENV gate + class check (~4 cycles) > savings (~3 cycles)
|
||||
|
||||
### Positive Outcomes
|
||||
|
||||
1. **Reduced variance**: σ dropped 50% (0.96M → 0.48M)
|
||||
- Optimization makes performance more predictable
|
||||
- Useful for benchmarking/profiling stability
|
||||
2. **Clean implementation**: Box theory design is correct, safe, and maintainable
|
||||
3. **Learning**: perf self% doesn't always translate to optimization ROI
|
||||
- Need to verify assumptions (redundancy) before optimizing
|
||||
|
||||
---
|
||||
|
||||
## Files Modified
|
||||
|
||||
### New Files Created (3)
|
||||
|
||||
1. **core/box/tiny_header_write_once_env_box.h**:
|
||||
- ENV gate: `HAKMEM_TINY_HEADER_WRITE_ONCE=0/1` (default: 0)
|
||||
|
||||
2. **core/box/tiny_header_write_once_stats_box.h**:
|
||||
- Stats counters (optional, ENV-gated)
|
||||
|
||||
3. **docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md**:
|
||||
- Design document
|
||||
|
||||
### Existing Files Modified (4)
|
||||
|
||||
1. **core/box/tiny_header_box.h**:
|
||||
- Added `tiny_header_finalize_alloc()` function
|
||||
- Enables write-once optimization for C1-C6
|
||||
|
||||
2. **core/front/tiny_unified_cache.c**:
|
||||
- Added `unified_cache_prefill_headers()` helper (lines 523-549)
|
||||
- Integrated prefill at 3 refill boundaries (lines 633, 805, 950)
|
||||
- Added includes for ENV box and header box (lines 31-32)
|
||||
|
||||
3. **core/box/tiny_front_hot_box.h**:
|
||||
- Changed hot path to use `tiny_header_finalize_alloc()` (line 131)
|
||||
- Added include for `tiny_header_box.h` (line 32)
|
||||
|
||||
4. **docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md**:
|
||||
- This file
|
||||
|
||||
---
|
||||
|
||||
## Rollback Plan
|
||||
|
||||
**ENV gate**:
|
||||
```bash
|
||||
export HAKMEM_TINY_HEADER_WRITE_ONCE=0 # Already default
|
||||
```
|
||||
|
||||
**Code rollback**: Not needed (default OFF, no preset promotion)
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **E5-2**: FREEZE as research box (do not promote to preset)
|
||||
2. **E5-3**: Attempt next candidate (ENV snapshot shape optimization, 2.97% target)
|
||||
3. **Alternative**: Investigate other perf hot spots (>= 3% self%)
|
||||
|
||||
---
|
||||
|
||||
## Key Lessons
|
||||
|
||||
### Lesson 1: Verify Assumptions
|
||||
|
||||
- **Assumption**: Header writes are redundant (blocks reused from freelist)
|
||||
- **Reality**: Headers are already written correctly at freelist pop
|
||||
- **Learning**: Always inspect code paths before optimizing based on perf profile
|
||||
|
||||
### Lesson 2: perf self% ≠ Optimization ROI
|
||||
|
||||
- **Observation**: 3.35% self% → +0.45% gain (7.5x gap)
|
||||
- **Reason**: self% measures time IN function, not time saved by REMOVING it
|
||||
- **Learning**: Hot spot optimization requires understanding WHY it's hot, not just THAT it's hot
|
||||
|
||||
### Lesson 3: Branch Overhead Matters
|
||||
|
||||
- **Cost**: 2 new branches (ENV gate + class check) = ~4 cycles
|
||||
- **Savings**: Header write skip = ~3-5 cycles
|
||||
- **Net**: Marginal or negative
|
||||
- **Learning**: Even "simple" optimizations can add overhead that cancels savings
|
||||
|
||||
### Lesson 4: Reduced Variance is Valuable
|
||||
|
||||
- **Outcome**: σ dropped 50% despite neutral mean
|
||||
- **Value**: More stable performance → better for profiling/benchmarking
|
||||
- **Learning**: Optimization success isn't just throughput, stability matters too
|
||||
|
||||
---
|
||||
|
||||
**Date**: 2025-12-14
|
||||
**Phase**: 5 E5-2
|
||||
**Status**: COMPLETE - NEUTRAL (FREEZE as research box)
|
||||
361
docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md
Normal file
361
docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md
Normal file
@ -0,0 +1,361 @@
|
||||
# Phase 5 E5-2: Header Write at Refill Boundary (Write-Once Strategy)
|
||||
|
||||
## Status
|
||||
|
||||
**Target**: `tiny_region_id_write_header` (3.35% self% in perf profile)
|
||||
**Baseline**: 43.998M ops/s (Mixed, 40M iters, ws=400, E4-1+E4-2+E5-1 ON)
|
||||
**Goal**: +1-3% by moving header writes from allocation hot path to refill cold boundary
|
||||
|
||||
---
|
||||
|
||||
## Hypothesis
|
||||
|
||||
**Problem**: `tiny_region_id_write_header()` is called on **every** allocation, writing the same header multiple times for reused blocks:
|
||||
1. **First allocation**: Block carved from slab → header written
|
||||
2. **Free**: Block pushed to TLS freelist → header preserved (C1-C6) or overwritten (C0, C7)
|
||||
3. **Second allocation**: Block popped from TLS → **header written AGAIN** (redundant for C1-C6)
|
||||
|
||||
**Observation**:
|
||||
- **C1-C6** (16B-1024B): Headers are **preserved** in freelist (next pointer at offset +1)
|
||||
- Rewriting the same header on every allocation is pure waste
|
||||
- **C0, C7** (8B, 2048B): Headers are **overwritten** by next pointer (offset 0)
|
||||
- Must write header on every allocation (cannot skip)
|
||||
|
||||
**Opportunity**: For C1-C6, write header **once** at refill boundary (when block is initially created), skip writes on subsequent allocations.
|
||||
|
||||
---
|
||||
|
||||
## Box Theory Design
|
||||
|
||||
### L0: ENV Gate (HAKMEM_TINY_HEADER_WRITE_ONCE)
|
||||
|
||||
```c
|
||||
// core/box/tiny_header_write_once_env_box.h
|
||||
static inline int tiny_header_write_once_enabled(void) {
|
||||
static int cached = -1;
|
||||
if (cached == -1) {
|
||||
cached = getenv_flag("HAKMEM_TINY_HEADER_WRITE_ONCE", 0);
|
||||
}
|
||||
return cached;
|
||||
}
|
||||
```
|
||||
|
||||
**Default**: 0 (OFF, research box)
|
||||
**MIXED preset**: 1 (ON after GO)
|
||||
|
||||
### L1: Refill Boundary (unified_cache_refill)
|
||||
|
||||
**Current flow** (core/front/tiny_unified_cache.c):
|
||||
```c
|
||||
// unified_cache_refill() populates TLS cache from backend
|
||||
// Backend returns BASE pointers (no header written yet)
|
||||
// Each allocation calls tiny_region_id_write_header(base, class_idx)
|
||||
```
|
||||
|
||||
**Optimized flow** (write-once):
|
||||
```c
|
||||
// unified_cache_refill() PREFILLS headers for C1-C6 blocks
|
||||
for (int i = 0; i < count; i++) {
|
||||
void* base = slots[i];
|
||||
if (tiny_class_preserves_header(class_idx)) {
|
||||
// Write header ONCE at refill boundary
|
||||
*(uint8_t*)base = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
|
||||
}
|
||||
// C0, C7: Skip (header will be overwritten by next pointer anyway)
|
||||
}
|
||||
```
|
||||
|
||||
**Hot path change**:
|
||||
```c
|
||||
// Before (WRITE_ONCE=0):
|
||||
return tiny_region_id_write_header(base, class_idx); // 3.35% self%
|
||||
|
||||
// After (WRITE_ONCE=1, C1-C6):
|
||||
return (void*)((uint8_t*)base + 1); // Direct offset, no write
|
||||
|
||||
// After (WRITE_ONCE=1, C0/C7):
|
||||
return tiny_region_id_write_header(base, class_idx); // Still need write
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Implementation Strategy
|
||||
|
||||
### Step 1: Refill-time header prefill
|
||||
|
||||
**File**: `core/front/tiny_unified_cache.c`
|
||||
**Function**: `unified_cache_refill()`
|
||||
|
||||
**Modification**:
|
||||
```c
|
||||
static void unified_cache_refill(int class_idx) {
|
||||
// ... existing refill logic ...
|
||||
|
||||
// After populating slots[], prefill headers (C1-C6 only)
|
||||
#if HAKMEM_TINY_HEADER_CLASSIDX
|
||||
if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
|
||||
const uint8_t header_byte = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
|
||||
for (int i = 0; i < refill_count; i++) {
|
||||
void* base = cache->slots[tail_idx];
|
||||
*(uint8_t*)base = header_byte;
|
||||
tail_idx = (tail_idx + 1) & TINY_UNIFIED_CACHE_MASK;
|
||||
}
|
||||
}
|
||||
#endif
|
||||
}
|
||||
```
|
||||
|
||||
**Safety**:
|
||||
- Only prefills for C1-C6 (`tiny_class_preserves_header()`)
|
||||
- C0, C7 are skipped (headers will be overwritten anyway)
|
||||
- Uses existing `HEADER_MAGIC` constant
|
||||
- Fail-fast: If `WRITE_ONCE=1` but headers not prefilled, hot path still writes header (no corruption)
|
||||
|
||||
### Step 2: Hot path skip logic
|
||||
|
||||
**File**: `core/front/malloc_tiny_fast.h`
|
||||
**Functions**: All allocation paths (tiny_hot_alloc_fast, etc.)
|
||||
|
||||
**Before**:
|
||||
```c
|
||||
#if HAKMEM_TINY_HEADER_CLASSIDX
|
||||
return tiny_region_id_write_header(base, class_idx);
|
||||
#else
|
||||
return base;
|
||||
#endif
|
||||
```
|
||||
|
||||
**After**:
|
||||
```c
|
||||
#if HAKMEM_TINY_HEADER_CLASSIDX
|
||||
if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
|
||||
// Header already written at refill boundary (C1-C6)
|
||||
return (void*)((uint8_t*)base + 1); // Fast: skip write, direct offset
|
||||
} else {
|
||||
// C0, C7, or WRITE_ONCE=0: Traditional path
|
||||
return tiny_region_id_write_header(base, class_idx);
|
||||
}
|
||||
#else
|
||||
return base;
|
||||
#endif
|
||||
```
|
||||
|
||||
**Inline optimization**:
|
||||
```c
|
||||
// Extract to tiny_header_box.h for inlining
|
||||
static inline void* tiny_header_finalize_alloc(void* base, int class_idx) {
|
||||
#if HAKMEM_TINY_HEADER_CLASSIDX
|
||||
if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
|
||||
return (void*)((uint8_t*)base + 1); // Prefilled, skip write
|
||||
}
|
||||
return tiny_region_id_write_header(base, class_idx); // Traditional
|
||||
#else
|
||||
(void)class_idx;
|
||||
return base;
|
||||
#endif
|
||||
}
|
||||
```
|
||||
|
||||
### Step 3: Stats counters (optional)
|
||||
|
||||
**File**: `core/box/tiny_header_write_once_stats_box.h`
|
||||
|
||||
```c
|
||||
typedef struct {
|
||||
uint64_t refill_prefill_count; // Headers prefilled at refill
|
||||
uint64_t alloc_skip_count; // Allocations that skipped header write
|
||||
uint64_t alloc_write_count; // Allocations that wrote header (C0, C7)
|
||||
} TinyHeaderWriteOnceStats;
|
||||
|
||||
extern __thread TinyHeaderWriteOnceStats g_header_write_once_stats;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Expected Performance Impact
|
||||
|
||||
### Cost Breakdown (Before - WRITE_ONCE=0)
|
||||
|
||||
**Hot path** (every allocation):
|
||||
```
|
||||
tiny_region_id_write_header():
|
||||
1. NULL check (1 cycle)
|
||||
2. Header write: *(uint8_t*)base = HEADER_MAGIC | class_idx (2-3 cycles, store)
|
||||
3. Offset calculation: return (uint8_t*)base + 1 (1 cycle)
|
||||
Total: ~5 cycles per allocation
|
||||
```
|
||||
|
||||
**perf profile**: 3.35% self% → **~1.5M ops/s** overhead at 43.998M ops/s baseline
|
||||
|
||||
### Optimized Path (WRITE_ONCE=1, C1-C6)
|
||||
|
||||
**Refill boundary** (once per 2048 allocations):
|
||||
```
|
||||
unified_cache_refill():
|
||||
Loop over refill_count (~128-256 blocks):
|
||||
*(uint8_t*)base = header_byte (2 cycles × 128 = 256 cycles)
|
||||
Total: ~256 cycles amortized over 2048 allocations = 0.125 cycles/alloc
|
||||
```
|
||||
|
||||
**Hot path** (every allocation):
|
||||
```
|
||||
tiny_header_finalize_alloc():
|
||||
1. Branch: if (write_once && preserves) (1 cycle, predicted)
|
||||
2. Offset: return (uint8_t*)base + 1 (1 cycle)
|
||||
Total: ~2 cycles per allocation
|
||||
```
|
||||
|
||||
**Net savings**: 5 cycles → 2 cycles = **3 cycles per allocation** (60% reduction)
|
||||
|
||||
### Expected Gain
|
||||
|
||||
**Formula**: 3.35% overhead × 60% reduction = **~2.0% throughput gain**
|
||||
|
||||
**Conservative estimate**: +1.0% to +2.5% (accounting for branch misprediction, ENV check overhead)
|
||||
|
||||
**Target**: 43.998M → **44.9M - 45.1M ops/s** (+2.0% to +2.5%)
|
||||
|
||||
---
|
||||
|
||||
## Safety & Rollback
|
||||
|
||||
### Safety Mechanisms
|
||||
|
||||
1. **ENV gate**: `HAKMEM_TINY_HEADER_WRITE_ONCE=0` reverts to traditional path
|
||||
2. **Class filter**: Only C1-C6 use write-once (C0, C7 always write header)
|
||||
3. **Fail-safe**: If ENV=1 but refill prefill is broken, hot path still works (writes header)
|
||||
4. **No ABI change**: User pointers identical, only internal optimization
|
||||
|
||||
### Rollback Plan
|
||||
|
||||
```bash
|
||||
# Disable write-once optimization
|
||||
export HAKMEM_TINY_HEADER_WRITE_ONCE=0
|
||||
./bench_random_mixed_hakmem 20000000 400 1
|
||||
```
|
||||
|
||||
**Rollback triggers**:
|
||||
- A/B test shows <+1.0% gain (NEUTRAL → freeze as research box)
|
||||
- A/B test shows <-1.0% regression (NO-GO → freeze)
|
||||
- Health check fails (revert preset default)
|
||||
|
||||
---
|
||||
|
||||
## Integration Points
|
||||
|
||||
### Files to modify
|
||||
|
||||
1. **core/box/tiny_header_write_once_env_box.h** (new):
|
||||
- ENV gate: `tiny_header_write_once_enabled()`
|
||||
|
||||
2. **core/box/tiny_header_write_once_stats_box.h** (new, optional):
|
||||
- Stats counters for observability
|
||||
|
||||
3. **core/box/tiny_header_box.h** (existing):
|
||||
- New function: `tiny_header_finalize_alloc(base, class_idx)`
|
||||
- Inline logic for write-once vs traditional
|
||||
|
||||
4. **core/front/tiny_unified_cache.c** (existing):
|
||||
- Modify `unified_cache_refill()` to prefill headers
|
||||
|
||||
5. **core/front/malloc_tiny_fast.h** (existing):
|
||||
- Replace `tiny_region_id_write_header()` calls with `tiny_header_finalize_alloc()`
|
||||
- ~15-20 call sites
|
||||
|
||||
6. **core/bench_profile.h** (existing, after GO):
|
||||
- Add `HAKMEM_TINY_HEADER_WRITE_ONCE=1` to `MIXED_TINYV3_C7_SAFE` preset
|
||||
|
||||
---
|
||||
|
||||
## A/B Test Plan
|
||||
|
||||
### Baseline (WRITE_ONCE=0)
|
||||
|
||||
```bash
|
||||
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
|
||||
HAKMEM_TINY_HEADER_WRITE_ONCE=0 \
|
||||
./bench_random_mixed_hakmem 20000000 400 1
|
||||
```
|
||||
|
||||
**Run 10 times**, collect mean/median/stddev.
|
||||
|
||||
### Optimized (WRITE_ONCE=1)
|
||||
|
||||
```bash
|
||||
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
|
||||
HAKMEM_TINY_HEADER_WRITE_ONCE=1 \
|
||||
./bench_random_mixed_hakmem 20000000 400 1
|
||||
```
|
||||
|
||||
**Run 10 times**, collect mean/median/stddev.
|
||||
|
||||
### GO/NO-GO Criteria
|
||||
|
||||
- **GO**: mean >= +1.0% (promote to MIXED preset)
|
||||
- **NEUTRAL**: -1.0% < mean < +1.0% (freeze as research box)
|
||||
- **NO-GO**: mean <= -1.0% (freeze, do not pursue)
|
||||
|
||||
### Health Check
|
||||
|
||||
```bash
|
||||
scripts/verify_health_profiles.sh
|
||||
```
|
||||
|
||||
**Requirements**:
|
||||
- MIXED_TINYV3_C7_SAFE: No regression vs baseline
|
||||
- C6_HEAVY_LEGACY_POOLV1: No regression vs baseline
|
||||
|
||||
---
|
||||
|
||||
## Success Metrics
|
||||
|
||||
### Performance
|
||||
|
||||
- **Primary**: Mixed throughput +1.0% or higher (mean)
|
||||
- **Secondary**: `tiny_region_id_write_header` self% drops from 3.35% to <1.5%
|
||||
|
||||
### Correctness
|
||||
|
||||
- **No SEGV**: All benchmarks pass without segmentation faults
|
||||
- **No assert failures**: Debug builds pass validation
|
||||
- **Health check**: All profiles pass functional tests
|
||||
|
||||
---
|
||||
|
||||
## Key Insights (Box Theory)
|
||||
|
||||
### Why This Works
|
||||
|
||||
1. **Single Source of Truth**: `tiny_class_preserves_header()` encapsulates C1-C6 logic
|
||||
2. **Boundary Optimization**: Write cost moved from hot (N times) to cold (1 time)
|
||||
3. **Deduplication**: Eliminates redundant header writes on freelist reuse
|
||||
4. **Fail-fast**: C0, C7 continue to write headers (no special case complexity)
|
||||
|
||||
### Design Patterns
|
||||
|
||||
- **L0 Gate**: ENV flag with static cache (zero runtime cost)
|
||||
- **L1 Cold Boundary**: Refill is cold path (amortized cost is negligible)
|
||||
- **L1 Hot Path**: Branch predicted (write_once=1 is stable state)
|
||||
- **Safety**: Class-based filtering ensures correctness
|
||||
|
||||
### Comparison to E5-1 Success
|
||||
|
||||
- **E5-1 strategy**: Consolidation (eliminate redundant checks in wrapper)
|
||||
- **E5-2 strategy**: Deduplication (eliminate redundant header writes)
|
||||
- **Common pattern**: "Do once what you were doing N times"
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Implement**: Create ENV box, modify refill boundary, update hot paths
|
||||
2. **A/B test**: 10-run Mixed benchmark (WRITE_ONCE=0 vs 1)
|
||||
3. **Validate**: Health check on all profiles
|
||||
4. **Decide**: GO (preset promotion) / NEUTRAL (freeze) / NO-GO (revert)
|
||||
5. **Document**: Update `CURRENT_TASK.md` and `ENV_PROFILE_PRESETS.md`
|
||||
|
||||
---
|
||||
|
||||
**Date**: 2025-12-14
|
||||
**Phase**: 5 E5-2
|
||||
**Status**: DESIGN COMPLETE, ready for implementation
|
||||
Reference in New Issue
Block a user