# Phase 5 E5-2: Header Write at Refill Boundary (Write-Once Strategy)

## Status

**Target**: `tiny_region_id_write_header` (3.35% self% in perf profile)
**Baseline**: 43.998M ops/s (Mixed, 40M iters, ws=400, E4-1+E4-2+E5-1 ON)
**Goal**: +1-3% by moving header writes from allocation hot path to refill cold boundary

---

## Hypothesis

**Problem**: `tiny_region_id_write_header()` is called on **every** allocation, writing the same header multiple times for reused blocks:
1. **First allocation**: Block carved from slab → header written
2. **Free**: Block pushed to TLS freelist → header preserved (C1-C6) or overwritten (C0, C7)
3. **Second allocation**: Block popped from TLS → **header written AGAIN** (redundant for C1-C6)

**Observation**:
- **C1-C6** (16B-1024B): Headers are **preserved** in freelist (next pointer at offset +1)
  - Rewriting the same header on every allocation is pure waste
- **C0, C7** (8B, 2048B): Headers are **overwritten** by next pointer (offset 0)
  - Must write header on every allocation (cannot skip)

**Opportunity**: For C1-C6, write header **once** at refill boundary (when block is initially created), skip writes on subsequent allocations.

---

## Box Theory Design

### L0: ENV Gate (HAKMEM_TINY_HEADER_WRITE_ONCE)

```c
// core/box/tiny_header_write_once_env_box.h
static inline int tiny_header_write_once_enabled(void) {
    static int cached = -1;
    if (cached == -1) {
        cached = getenv_flag("HAKMEM_TINY_HEADER_WRITE_ONCE", 0);
    }
    return cached;
}
```

**Default**: 0 (OFF, research box)
**MIXED preset**: 1 (ON after GO)

### L1: Refill Boundary (unified_cache_refill)

**Current flow** (core/front/tiny_unified_cache.c):
```c
// unified_cache_refill() populates TLS cache from backend
// Backend returns BASE pointers (no header written yet)
// Each allocation calls tiny_region_id_write_header(base, class_idx)
```

**Optimized flow** (write-once):
```c
// unified_cache_refill() PREFILLS headers for C1-C6 blocks
for (int i = 0; i < count; i++) {
    void* base = slots[i];
    if (tiny_class_preserves_header(class_idx)) {
        // Write header ONCE at refill boundary
        *(uint8_t*)base = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
    }
    // C0, C7: Skip (header will be overwritten by next pointer anyway)
}
```

**Hot path change**:
```c
// Before (WRITE_ONCE=0):
return tiny_region_id_write_header(base, class_idx);  // 3.35% self%

// After (WRITE_ONCE=1, C1-C6):
return (void*)((uint8_t*)base + 1);  // Direct offset, no write

// After (WRITE_ONCE=1, C0/C7):
return tiny_region_id_write_header(base, class_idx);  // Still need write
```

---

## Implementation Strategy

### Step 1: Refill-time header prefill

**File**: `core/front/tiny_unified_cache.c`
**Function**: `unified_cache_refill()`

**Modification**:
```c
static void unified_cache_refill(int class_idx) {
    // ... existing refill logic ...

    // After populating slots[], prefill headers (C1-C6 only)
    #if HAKMEM_TINY_HEADER_CLASSIDX
    if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
        const uint8_t header_byte = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
        for (int i = 0; i < refill_count; i++) {
            void* base = cache->slots[tail_idx];
            *(uint8_t*)base = header_byte;
            tail_idx = (tail_idx + 1) & TINY_UNIFIED_CACHE_MASK;
        }
    }
    #endif
}
```

**Safety**:
- Only prefills for C1-C6 (`tiny_class_preserves_header()`)
- C0, C7 are skipped (headers will be overwritten anyway)
- Uses existing `HEADER_MAGIC` constant
- Fail-fast: If `WRITE_ONCE=1` but headers not prefilled, hot path still writes header (no corruption)

### Step 2: Hot path skip logic

**File**: `core/front/malloc_tiny_fast.h`
**Functions**: All allocation paths (tiny_hot_alloc_fast, etc.)

**Before**:
```c
#if HAKMEM_TINY_HEADER_CLASSIDX
return tiny_region_id_write_header(base, class_idx);
#else
return base;
#endif
```

**After**:
```c
#if HAKMEM_TINY_HEADER_CLASSIDX
if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
    // Header already written at refill boundary (C1-C6)
    return (void*)((uint8_t*)base + 1);  // Fast: skip write, direct offset
} else {
    // C0, C7, or WRITE_ONCE=0: Traditional path
    return tiny_region_id_write_header(base, class_idx);
}
#else
return base;
#endif
```

**Inline optimization**:
```c
// Extract to tiny_header_box.h for inlining
static inline void* tiny_header_finalize_alloc(void* base, int class_idx) {
#if HAKMEM_TINY_HEADER_CLASSIDX
    if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
        return (void*)((uint8_t*)base + 1);  // Prefilled, skip write
    }
    return tiny_region_id_write_header(base, class_idx);  // Traditional
#else
    (void)class_idx;
    return base;
#endif
}
```

### Step 3: Stats counters (optional)

**File**: `core/box/tiny_header_write_once_stats_box.h`

```c
typedef struct {
    uint64_t refill_prefill_count;  // Headers prefilled at refill
    uint64_t alloc_skip_count;      // Allocations that skipped header write
    uint64_t alloc_write_count;     // Allocations that wrote header (C0, C7)
} TinyHeaderWriteOnceStats;

extern __thread TinyHeaderWriteOnceStats g_header_write_once_stats;
```

---

## Expected Performance Impact

### Cost Breakdown (Before - WRITE_ONCE=0)

**Hot path** (every allocation):
```
tiny_region_id_write_header():
  1. NULL check (1 cycle)
  2. Header write: *(uint8_t*)base = HEADER_MAGIC | class_idx (2-3 cycles, store)
  3. Offset calculation: return (uint8_t*)base + 1 (1 cycle)
  Total: ~5 cycles per allocation
```

**perf profile**: 3.35% self% → **~1.5M ops/s** overhead at 43.998M ops/s baseline

### Optimized Path (WRITE_ONCE=1, C1-C6)

**Refill boundary** (once per 2048 allocations):
```
unified_cache_refill():
  Loop over refill_count (~128-256 blocks):
    *(uint8_t*)base = header_byte (2 cycles × 128 = 256 cycles)
  Total: ~256 cycles amortized over 2048 allocations = 0.125 cycles/alloc
```

**Hot path** (every allocation):
```
tiny_header_finalize_alloc():
  1. Branch: if (write_once && preserves) (1 cycle, predicted)
  2. Offset: return (uint8_t*)base + 1 (1 cycle)
  Total: ~2 cycles per allocation
```

**Net savings**: 5 cycles → 2 cycles = **3 cycles per allocation** (60% reduction)

### Expected Gain

**Formula**: 3.35% overhead × 60% reduction = **~2.0% throughput gain**

**Conservative estimate**: +1.0% to +2.5% (accounting for branch misprediction, ENV check overhead)

**Target**: 43.998M → **44.9M - 45.1M ops/s** (+2.0% to +2.5%)

---

## Safety & Rollback

### Safety Mechanisms

1. **ENV gate**: `HAKMEM_TINY_HEADER_WRITE_ONCE=0` reverts to traditional path
2. **Class filter**: Only C1-C6 use write-once (C0, C7 always write header)
3. **Fail-safe**: If ENV=1 but refill prefill is broken, hot path still works (writes header)
4. **No ABI change**: User pointers identical, only internal optimization

### Rollback Plan

```bash
# Disable write-once optimization
export HAKMEM_TINY_HEADER_WRITE_ONCE=0
./bench_random_mixed_hakmem 20000000 400 1
```

**Rollback triggers**:
- A/B test shows <+1.0% gain (NEUTRAL → freeze as research box)
- A/B test shows <-1.0% regression (NO-GO → freeze)
- Health check fails (revert preset default)

---

## Integration Points

### Files to modify

1. **core/box/tiny_header_write_once_env_box.h** (new):
   - ENV gate: `tiny_header_write_once_enabled()`

2. **core/box/tiny_header_write_once_stats_box.h** (new, optional):
   - Stats counters for observability

3. **core/box/tiny_header_box.h** (existing):
   - New function: `tiny_header_finalize_alloc(base, class_idx)`
   - Inline logic for write-once vs traditional

4. **core/front/tiny_unified_cache.c** (existing):
   - Modify `unified_cache_refill()` to prefill headers

5. **core/front/malloc_tiny_fast.h** (existing):
   - Replace `tiny_region_id_write_header()` calls with `tiny_header_finalize_alloc()`
   - ~15-20 call sites

6. **core/bench_profile.h** (existing, after GO):
   - Add `HAKMEM_TINY_HEADER_WRITE_ONCE=1` to `MIXED_TINYV3_C7_SAFE` preset

---

## A/B Test Plan

### Baseline (WRITE_ONCE=0)

```bash
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
  HAKMEM_TINY_HEADER_WRITE_ONCE=0 \
  ./bench_random_mixed_hakmem 20000000 400 1
```

**Run 10 times**, collect mean/median/stddev.

### Optimized (WRITE_ONCE=1)

```bash
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
  HAKMEM_TINY_HEADER_WRITE_ONCE=1 \
  ./bench_random_mixed_hakmem 20000000 400 1
```

**Run 10 times**, collect mean/median/stddev.

### GO/NO-GO Criteria

- **GO**: mean >= +1.0% (promote to MIXED preset)
- **NEUTRAL**: -1.0% < mean < +1.0% (freeze as research box)
- **NO-GO**: mean <= -1.0% (freeze, do not pursue)

### Health Check

```bash
scripts/verify_health_profiles.sh
```

**Requirements**:
- MIXED_TINYV3_C7_SAFE: No regression vs baseline
- C6_HEAVY_LEGACY_POOLV1: No regression vs baseline

---

## Success Metrics

### Performance

- **Primary**: Mixed throughput +1.0% or higher (mean)
- **Secondary**: `tiny_region_id_write_header` self% drops from 3.35% to <1.5%

### Correctness

- **No SEGV**: All benchmarks pass without segmentation faults
- **No assert failures**: Debug builds pass validation
- **Health check**: All profiles pass functional tests

---

## Key Insights (Box Theory)

### Why This Works

1. **Single Source of Truth**: `tiny_class_preserves_header()` encapsulates C1-C6 logic
2. **Boundary Optimization**: Write cost moved from hot (N times) to cold (1 time)
3. **Deduplication**: Eliminates redundant header writes on freelist reuse
4. **Fail-fast**: C0, C7 continue to write headers (no special case complexity)

### Design Patterns

- **L0 Gate**: ENV flag with static cache (zero runtime cost)
- **L1 Cold Boundary**: Refill is cold path (amortized cost is negligible)
- **L1 Hot Path**: Branch predicted (write_once=1 is stable state)
- **Safety**: Class-based filtering ensures correctness

### Comparison to E5-1 Success

- **E5-1 strategy**: Consolidation (eliminate redundant checks in wrapper)
- **E5-2 strategy**: Deduplication (eliminate redundant header writes)
- **Common pattern**: "Do once what you were doing N times"

---

## Next Steps

1. **Implement**: Create ENV box, modify refill boundary, update hot paths
2. **A/B test**: 10-run Mixed benchmark (WRITE_ONCE=0 vs 1)
3. **Validate**: Health check on all profiles
4. **Decide**: GO (preset promotion) / NEUTRAL (freeze) / NO-GO (revert)
5. **Document**: Update `CURRENT_TASK.md` and `ENV_PROFILE_PRESETS.md`

---

**Date**: 2025-12-14
**Phase**: 5 E5-2
**Status**: DESIGN COMPLETE, ready for implementation