# Phase 5 E5-2: Header Write at Refill Boundary (Write-Once Strategy) ## Status **Target**: `tiny_region_id_write_header` (3.35% self% in perf profile) **Baseline**: 43.998M ops/s (Mixed, 40M iters, ws=400, E4-1+E4-2+E5-1 ON) **Goal**: +1-3% by moving header writes from allocation hot path to refill cold boundary --- ## Hypothesis **Problem**: `tiny_region_id_write_header()` is called on **every** allocation, writing the same header multiple times for reused blocks: 1. **First allocation**: Block carved from slab → header written 2. **Free**: Block pushed to TLS freelist → header preserved (C1-C6) or overwritten (C0, C7) 3. **Second allocation**: Block popped from TLS → **header written AGAIN** (redundant for C1-C6) **Observation**: - **C1-C6** (16B-1024B): Headers are **preserved** in freelist (next pointer at offset +1) - Rewriting the same header on every allocation is pure waste - **C0, C7** (8B, 2048B): Headers are **overwritten** by next pointer (offset 0) - Must write header on every allocation (cannot skip) **Opportunity**: For C1-C6, write header **once** at refill boundary (when block is initially created), skip writes on subsequent allocations. --- ## Box Theory Design ### L0: ENV Gate (HAKMEM_TINY_HEADER_WRITE_ONCE) ```c // core/box/tiny_header_write_once_env_box.h static inline int tiny_header_write_once_enabled(void) { static int cached = -1; if (cached == -1) { cached = getenv_flag("HAKMEM_TINY_HEADER_WRITE_ONCE", 0); } return cached; } ``` **Default**: 0 (OFF, research box) **MIXED preset**: 1 (ON after GO) ### L1: Refill Boundary (unified_cache_refill) **Current flow** (core/front/tiny_unified_cache.c): ```c // unified_cache_refill() populates TLS cache from backend // Backend returns BASE pointers (no header written yet) // Each allocation calls tiny_region_id_write_header(base, class_idx) ``` **Optimized flow** (write-once): ```c // unified_cache_refill() PREFILLS headers for C1-C6 blocks for (int i = 0; i < count; i++) { void* base = slots[i]; if (tiny_class_preserves_header(class_idx)) { // Write header ONCE at refill boundary *(uint8_t*)base = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK); } // C0, C7: Skip (header will be overwritten by next pointer anyway) } ``` **Hot path change**: ```c // Before (WRITE_ONCE=0): return tiny_region_id_write_header(base, class_idx); // 3.35% self% // After (WRITE_ONCE=1, C1-C6): return (void*)((uint8_t*)base + 1); // Direct offset, no write // After (WRITE_ONCE=1, C0/C7): return tiny_region_id_write_header(base, class_idx); // Still need write ``` --- ## Implementation Strategy ### Step 1: Refill-time header prefill **File**: `core/front/tiny_unified_cache.c` **Function**: `unified_cache_refill()` **Modification**: ```c static void unified_cache_refill(int class_idx) { // ... existing refill logic ... // After populating slots[], prefill headers (C1-C6 only) #if HAKMEM_TINY_HEADER_CLASSIDX if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) { const uint8_t header_byte = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK); for (int i = 0; i < refill_count; i++) { void* base = cache->slots[tail_idx]; *(uint8_t*)base = header_byte; tail_idx = (tail_idx + 1) & TINY_UNIFIED_CACHE_MASK; } } #endif } ``` **Safety**: - Only prefills for C1-C6 (`tiny_class_preserves_header()`) - C0, C7 are skipped (headers will be overwritten anyway) - Uses existing `HEADER_MAGIC` constant - Fail-fast: If `WRITE_ONCE=1` but headers not prefilled, hot path still writes header (no corruption) ### Step 2: Hot path skip logic **File**: `core/front/malloc_tiny_fast.h` **Functions**: All allocation paths (tiny_hot_alloc_fast, etc.) **Before**: ```c #if HAKMEM_TINY_HEADER_CLASSIDX return tiny_region_id_write_header(base, class_idx); #else return base; #endif ``` **After**: ```c #if HAKMEM_TINY_HEADER_CLASSIDX if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) { // Header already written at refill boundary (C1-C6) return (void*)((uint8_t*)base + 1); // Fast: skip write, direct offset } else { // C0, C7, or WRITE_ONCE=0: Traditional path return tiny_region_id_write_header(base, class_idx); } #else return base; #endif ``` **Inline optimization**: ```c // Extract to tiny_header_box.h for inlining static inline void* tiny_header_finalize_alloc(void* base, int class_idx) { #if HAKMEM_TINY_HEADER_CLASSIDX if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) { return (void*)((uint8_t*)base + 1); // Prefilled, skip write } return tiny_region_id_write_header(base, class_idx); // Traditional #else (void)class_idx; return base; #endif } ``` ### Step 3: Stats counters (optional) **File**: `core/box/tiny_header_write_once_stats_box.h` ```c typedef struct { uint64_t refill_prefill_count; // Headers prefilled at refill uint64_t alloc_skip_count; // Allocations that skipped header write uint64_t alloc_write_count; // Allocations that wrote header (C0, C7) } TinyHeaderWriteOnceStats; extern __thread TinyHeaderWriteOnceStats g_header_write_once_stats; ``` --- ## Expected Performance Impact ### Cost Breakdown (Before - WRITE_ONCE=0) **Hot path** (every allocation): ``` tiny_region_id_write_header(): 1. NULL check (1 cycle) 2. Header write: *(uint8_t*)base = HEADER_MAGIC | class_idx (2-3 cycles, store) 3. Offset calculation: return (uint8_t*)base + 1 (1 cycle) Total: ~5 cycles per allocation ``` **perf profile**: 3.35% self% → **~1.5M ops/s** overhead at 43.998M ops/s baseline ### Optimized Path (WRITE_ONCE=1, C1-C6) **Refill boundary** (once per 2048 allocations): ``` unified_cache_refill(): Loop over refill_count (~128-256 blocks): *(uint8_t*)base = header_byte (2 cycles × 128 = 256 cycles) Total: ~256 cycles amortized over 2048 allocations = 0.125 cycles/alloc ``` **Hot path** (every allocation): ``` tiny_header_finalize_alloc(): 1. Branch: if (write_once && preserves) (1 cycle, predicted) 2. Offset: return (uint8_t*)base + 1 (1 cycle) Total: ~2 cycles per allocation ``` **Net savings**: 5 cycles → 2 cycles = **3 cycles per allocation** (60% reduction) ### Expected Gain **Formula**: 3.35% overhead × 60% reduction = **~2.0% throughput gain** **Conservative estimate**: +1.0% to +2.5% (accounting for branch misprediction, ENV check overhead) **Target**: 43.998M → **44.9M - 45.1M ops/s** (+2.0% to +2.5%) --- ## Safety & Rollback ### Safety Mechanisms 1. **ENV gate**: `HAKMEM_TINY_HEADER_WRITE_ONCE=0` reverts to traditional path 2. **Class filter**: Only C1-C6 use write-once (C0, C7 always write header) 3. **Fail-safe**: If ENV=1 but refill prefill is broken, hot path still works (writes header) 4. **No ABI change**: User pointers identical, only internal optimization ### Rollback Plan ```bash # Disable write-once optimization export HAKMEM_TINY_HEADER_WRITE_ONCE=0 ./bench_random_mixed_hakmem 20000000 400 1 ``` **Rollback triggers**: - A/B test shows <+1.0% gain (NEUTRAL → freeze as research box) - A/B test shows <-1.0% regression (NO-GO → freeze) - Health check fails (revert preset default) --- ## Integration Points ### Files to modify 1. **core/box/tiny_header_write_once_env_box.h** (new): - ENV gate: `tiny_header_write_once_enabled()` 2. **core/box/tiny_header_write_once_stats_box.h** (new, optional): - Stats counters for observability 3. **core/box/tiny_header_box.h** (existing): - New function: `tiny_header_finalize_alloc(base, class_idx)` - Inline logic for write-once vs traditional 4. **core/front/tiny_unified_cache.c** (existing): - Modify `unified_cache_refill()` to prefill headers 5. **core/front/malloc_tiny_fast.h** (existing): - Replace `tiny_region_id_write_header()` calls with `tiny_header_finalize_alloc()` - ~15-20 call sites 6. **core/bench_profile.h** (existing, after GO): - Add `HAKMEM_TINY_HEADER_WRITE_ONCE=1` to `MIXED_TINYV3_C7_SAFE` preset --- ## A/B Test Plan ### Baseline (WRITE_ONCE=0) ```bash HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \ HAKMEM_TINY_HEADER_WRITE_ONCE=0 \ ./bench_random_mixed_hakmem 20000000 400 1 ``` **Run 10 times**, collect mean/median/stddev. ### Optimized (WRITE_ONCE=1) ```bash HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \ HAKMEM_TINY_HEADER_WRITE_ONCE=1 \ ./bench_random_mixed_hakmem 20000000 400 1 ``` **Run 10 times**, collect mean/median/stddev. ### GO/NO-GO Criteria - **GO**: mean >= +1.0% (promote to MIXED preset) - **NEUTRAL**: -1.0% < mean < +1.0% (freeze as research box) - **NO-GO**: mean <= -1.0% (freeze, do not pursue) ### Health Check ```bash scripts/verify_health_profiles.sh ``` **Requirements**: - MIXED_TINYV3_C7_SAFE: No regression vs baseline - C6_HEAVY_LEGACY_POOLV1: No regression vs baseline --- ## Success Metrics ### Performance - **Primary**: Mixed throughput +1.0% or higher (mean) - **Secondary**: `tiny_region_id_write_header` self% drops from 3.35% to <1.5% ### Correctness - **No SEGV**: All benchmarks pass without segmentation faults - **No assert failures**: Debug builds pass validation - **Health check**: All profiles pass functional tests --- ## Key Insights (Box Theory) ### Why This Works 1. **Single Source of Truth**: `tiny_class_preserves_header()` encapsulates C1-C6 logic 2. **Boundary Optimization**: Write cost moved from hot (N times) to cold (1 time) 3. **Deduplication**: Eliminates redundant header writes on freelist reuse 4. **Fail-fast**: C0, C7 continue to write headers (no special case complexity) ### Design Patterns - **L0 Gate**: ENV flag with static cache (zero runtime cost) - **L1 Cold Boundary**: Refill is cold path (amortized cost is negligible) - **L1 Hot Path**: Branch predicted (write_once=1 is stable state) - **Safety**: Class-based filtering ensures correctness ### Comparison to E5-1 Success - **E5-1 strategy**: Consolidation (eliminate redundant checks in wrapper) - **E5-2 strategy**: Deduplication (eliminate redundant header writes) - **Common pattern**: "Do once what you were doing N times" --- ## Next Steps 1. **Implement**: Create ENV box, modify refill boundary, update hot paths 2. **A/B test**: 10-run Mixed benchmark (WRITE_ONCE=0 vs 1) 3. **Validate**: Health check on all profiles 4. **Decide**: GO (preset promotion) / NEUTRAL (freeze) / NO-GO (revert) 5. **Document**: Update `CURRENT_TASK.md` and `ENV_PROFILE_PRESETS.md` --- **Date**: 2025-12-14 **Phase**: 5 E5-2 **Status**: DESIGN COMPLETE, ready for implementation