362 lines
10 KiB
Markdown
362 lines
10 KiB
Markdown
|
|
# Phase 5 E5-2: Header Write at Refill Boundary (Write-Once Strategy)
|
|||
|
|
|
|||
|
|
## Status
|
|||
|
|
|
|||
|
|
**Target**: `tiny_region_id_write_header` (3.35% self% in perf profile)
|
|||
|
|
**Baseline**: 43.998M ops/s (Mixed, 40M iters, ws=400, E4-1+E4-2+E5-1 ON)
|
|||
|
|
**Goal**: +1-3% by moving header writes from allocation hot path to refill cold boundary
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Hypothesis
|
|||
|
|
|
|||
|
|
**Problem**: `tiny_region_id_write_header()` is called on **every** allocation, writing the same header multiple times for reused blocks:
|
|||
|
|
1. **First allocation**: Block carved from slab → header written
|
|||
|
|
2. **Free**: Block pushed to TLS freelist → header preserved (C1-C6) or overwritten (C0, C7)
|
|||
|
|
3. **Second allocation**: Block popped from TLS → **header written AGAIN** (redundant for C1-C6)
|
|||
|
|
|
|||
|
|
**Observation**:
|
|||
|
|
- **C1-C6** (16B-1024B): Headers are **preserved** in freelist (next pointer at offset +1)
|
|||
|
|
- Rewriting the same header on every allocation is pure waste
|
|||
|
|
- **C0, C7** (8B, 2048B): Headers are **overwritten** by next pointer (offset 0)
|
|||
|
|
- Must write header on every allocation (cannot skip)
|
|||
|
|
|
|||
|
|
**Opportunity**: For C1-C6, write header **once** at refill boundary (when block is initially created), skip writes on subsequent allocations.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Box Theory Design
|
|||
|
|
|
|||
|
|
### L0: ENV Gate (HAKMEM_TINY_HEADER_WRITE_ONCE)
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// core/box/tiny_header_write_once_env_box.h
|
|||
|
|
static inline int tiny_header_write_once_enabled(void) {
|
|||
|
|
static int cached = -1;
|
|||
|
|
if (cached == -1) {
|
|||
|
|
cached = getenv_flag("HAKMEM_TINY_HEADER_WRITE_ONCE", 0);
|
|||
|
|
}
|
|||
|
|
return cached;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Default**: 0 (OFF, research box)
|
|||
|
|
**MIXED preset**: 1 (ON after GO)
|
|||
|
|
|
|||
|
|
### L1: Refill Boundary (unified_cache_refill)
|
|||
|
|
|
|||
|
|
**Current flow** (core/front/tiny_unified_cache.c):
|
|||
|
|
```c
|
|||
|
|
// unified_cache_refill() populates TLS cache from backend
|
|||
|
|
// Backend returns BASE pointers (no header written yet)
|
|||
|
|
// Each allocation calls tiny_region_id_write_header(base, class_idx)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Optimized flow** (write-once):
|
|||
|
|
```c
|
|||
|
|
// unified_cache_refill() PREFILLS headers for C1-C6 blocks
|
|||
|
|
for (int i = 0; i < count; i++) {
|
|||
|
|
void* base = slots[i];
|
|||
|
|
if (tiny_class_preserves_header(class_idx)) {
|
|||
|
|
// Write header ONCE at refill boundary
|
|||
|
|
*(uint8_t*)base = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
|
|||
|
|
}
|
|||
|
|
// C0, C7: Skip (header will be overwritten by next pointer anyway)
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Hot path change**:
|
|||
|
|
```c
|
|||
|
|
// Before (WRITE_ONCE=0):
|
|||
|
|
return tiny_region_id_write_header(base, class_idx); // 3.35% self%
|
|||
|
|
|
|||
|
|
// After (WRITE_ONCE=1, C1-C6):
|
|||
|
|
return (void*)((uint8_t*)base + 1); // Direct offset, no write
|
|||
|
|
|
|||
|
|
// After (WRITE_ONCE=1, C0/C7):
|
|||
|
|
return tiny_region_id_write_header(base, class_idx); // Still need write
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Implementation Strategy
|
|||
|
|
|
|||
|
|
### Step 1: Refill-time header prefill
|
|||
|
|
|
|||
|
|
**File**: `core/front/tiny_unified_cache.c`
|
|||
|
|
**Function**: `unified_cache_refill()`
|
|||
|
|
|
|||
|
|
**Modification**:
|
|||
|
|
```c
|
|||
|
|
static void unified_cache_refill(int class_idx) {
|
|||
|
|
// ... existing refill logic ...
|
|||
|
|
|
|||
|
|
// After populating slots[], prefill headers (C1-C6 only)
|
|||
|
|
#if HAKMEM_TINY_HEADER_CLASSIDX
|
|||
|
|
if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
|
|||
|
|
const uint8_t header_byte = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
|
|||
|
|
for (int i = 0; i < refill_count; i++) {
|
|||
|
|
void* base = cache->slots[tail_idx];
|
|||
|
|
*(uint8_t*)base = header_byte;
|
|||
|
|
tail_idx = (tail_idx + 1) & TINY_UNIFIED_CACHE_MASK;
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
#endif
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Safety**:
|
|||
|
|
- Only prefills for C1-C6 (`tiny_class_preserves_header()`)
|
|||
|
|
- C0, C7 are skipped (headers will be overwritten anyway)
|
|||
|
|
- Uses existing `HEADER_MAGIC` constant
|
|||
|
|
- Fail-fast: If `WRITE_ONCE=1` but headers not prefilled, hot path still writes header (no corruption)
|
|||
|
|
|
|||
|
|
### Step 2: Hot path skip logic
|
|||
|
|
|
|||
|
|
**File**: `core/front/malloc_tiny_fast.h`
|
|||
|
|
**Functions**: All allocation paths (tiny_hot_alloc_fast, etc.)
|
|||
|
|
|
|||
|
|
**Before**:
|
|||
|
|
```c
|
|||
|
|
#if HAKMEM_TINY_HEADER_CLASSIDX
|
|||
|
|
return tiny_region_id_write_header(base, class_idx);
|
|||
|
|
#else
|
|||
|
|
return base;
|
|||
|
|
#endif
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**After**:
|
|||
|
|
```c
|
|||
|
|
#if HAKMEM_TINY_HEADER_CLASSIDX
|
|||
|
|
if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
|
|||
|
|
// Header already written at refill boundary (C1-C6)
|
|||
|
|
return (void*)((uint8_t*)base + 1); // Fast: skip write, direct offset
|
|||
|
|
} else {
|
|||
|
|
// C0, C7, or WRITE_ONCE=0: Traditional path
|
|||
|
|
return tiny_region_id_write_header(base, class_idx);
|
|||
|
|
}
|
|||
|
|
#else
|
|||
|
|
return base;
|
|||
|
|
#endif
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Inline optimization**:
|
|||
|
|
```c
|
|||
|
|
// Extract to tiny_header_box.h for inlining
|
|||
|
|
static inline void* tiny_header_finalize_alloc(void* base, int class_idx) {
|
|||
|
|
#if HAKMEM_TINY_HEADER_CLASSIDX
|
|||
|
|
if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
|
|||
|
|
return (void*)((uint8_t*)base + 1); // Prefilled, skip write
|
|||
|
|
}
|
|||
|
|
return tiny_region_id_write_header(base, class_idx); // Traditional
|
|||
|
|
#else
|
|||
|
|
(void)class_idx;
|
|||
|
|
return base;
|
|||
|
|
#endif
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Step 3: Stats counters (optional)
|
|||
|
|
|
|||
|
|
**File**: `core/box/tiny_header_write_once_stats_box.h`
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
typedef struct {
|
|||
|
|
uint64_t refill_prefill_count; // Headers prefilled at refill
|
|||
|
|
uint64_t alloc_skip_count; // Allocations that skipped header write
|
|||
|
|
uint64_t alloc_write_count; // Allocations that wrote header (C0, C7)
|
|||
|
|
} TinyHeaderWriteOnceStats;
|
|||
|
|
|
|||
|
|
extern __thread TinyHeaderWriteOnceStats g_header_write_once_stats;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Expected Performance Impact
|
|||
|
|
|
|||
|
|
### Cost Breakdown (Before - WRITE_ONCE=0)
|
|||
|
|
|
|||
|
|
**Hot path** (every allocation):
|
|||
|
|
```
|
|||
|
|
tiny_region_id_write_header():
|
|||
|
|
1. NULL check (1 cycle)
|
|||
|
|
2. Header write: *(uint8_t*)base = HEADER_MAGIC | class_idx (2-3 cycles, store)
|
|||
|
|
3. Offset calculation: return (uint8_t*)base + 1 (1 cycle)
|
|||
|
|
Total: ~5 cycles per allocation
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**perf profile**: 3.35% self% → **~1.5M ops/s** overhead at 43.998M ops/s baseline
|
|||
|
|
|
|||
|
|
### Optimized Path (WRITE_ONCE=1, C1-C6)
|
|||
|
|
|
|||
|
|
**Refill boundary** (once per 2048 allocations):
|
|||
|
|
```
|
|||
|
|
unified_cache_refill():
|
|||
|
|
Loop over refill_count (~128-256 blocks):
|
|||
|
|
*(uint8_t*)base = header_byte (2 cycles × 128 = 256 cycles)
|
|||
|
|
Total: ~256 cycles amortized over 2048 allocations = 0.125 cycles/alloc
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Hot path** (every allocation):
|
|||
|
|
```
|
|||
|
|
tiny_header_finalize_alloc():
|
|||
|
|
1. Branch: if (write_once && preserves) (1 cycle, predicted)
|
|||
|
|
2. Offset: return (uint8_t*)base + 1 (1 cycle)
|
|||
|
|
Total: ~2 cycles per allocation
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Net savings**: 5 cycles → 2 cycles = **3 cycles per allocation** (60% reduction)
|
|||
|
|
|
|||
|
|
### Expected Gain
|
|||
|
|
|
|||
|
|
**Formula**: 3.35% overhead × 60% reduction = **~2.0% throughput gain**
|
|||
|
|
|
|||
|
|
**Conservative estimate**: +1.0% to +2.5% (accounting for branch misprediction, ENV check overhead)
|
|||
|
|
|
|||
|
|
**Target**: 43.998M → **44.9M - 45.1M ops/s** (+2.0% to +2.5%)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Safety & Rollback
|
|||
|
|
|
|||
|
|
### Safety Mechanisms
|
|||
|
|
|
|||
|
|
1. **ENV gate**: `HAKMEM_TINY_HEADER_WRITE_ONCE=0` reverts to traditional path
|
|||
|
|
2. **Class filter**: Only C1-C6 use write-once (C0, C7 always write header)
|
|||
|
|
3. **Fail-safe**: If ENV=1 but refill prefill is broken, hot path still works (writes header)
|
|||
|
|
4. **No ABI change**: User pointers identical, only internal optimization
|
|||
|
|
|
|||
|
|
### Rollback Plan
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Disable write-once optimization
|
|||
|
|
export HAKMEM_TINY_HEADER_WRITE_ONCE=0
|
|||
|
|
./bench_random_mixed_hakmem 20000000 400 1
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Rollback triggers**:
|
|||
|
|
- A/B test shows <+1.0% gain (NEUTRAL → freeze as research box)
|
|||
|
|
- A/B test shows <-1.0% regression (NO-GO → freeze)
|
|||
|
|
- Health check fails (revert preset default)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Integration Points
|
|||
|
|
|
|||
|
|
### Files to modify
|
|||
|
|
|
|||
|
|
1. **core/box/tiny_header_write_once_env_box.h** (new):
|
|||
|
|
- ENV gate: `tiny_header_write_once_enabled()`
|
|||
|
|
|
|||
|
|
2. **core/box/tiny_header_write_once_stats_box.h** (new, optional):
|
|||
|
|
- Stats counters for observability
|
|||
|
|
|
|||
|
|
3. **core/box/tiny_header_box.h** (existing):
|
|||
|
|
- New function: `tiny_header_finalize_alloc(base, class_idx)`
|
|||
|
|
- Inline logic for write-once vs traditional
|
|||
|
|
|
|||
|
|
4. **core/front/tiny_unified_cache.c** (existing):
|
|||
|
|
- Modify `unified_cache_refill()` to prefill headers
|
|||
|
|
|
|||
|
|
5. **core/front/malloc_tiny_fast.h** (existing):
|
|||
|
|
- Replace `tiny_region_id_write_header()` calls with `tiny_header_finalize_alloc()`
|
|||
|
|
- ~15-20 call sites
|
|||
|
|
|
|||
|
|
6. **core/bench_profile.h** (existing, after GO):
|
|||
|
|
- Add `HAKMEM_TINY_HEADER_WRITE_ONCE=1` to `MIXED_TINYV3_C7_SAFE` preset
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## A/B Test Plan
|
|||
|
|
|
|||
|
|
### Baseline (WRITE_ONCE=0)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
|
|||
|
|
HAKMEM_TINY_HEADER_WRITE_ONCE=0 \
|
|||
|
|
./bench_random_mixed_hakmem 20000000 400 1
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Run 10 times**, collect mean/median/stddev.
|
|||
|
|
|
|||
|
|
### Optimized (WRITE_ONCE=1)
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
|
|||
|
|
HAKMEM_TINY_HEADER_WRITE_ONCE=1 \
|
|||
|
|
./bench_random_mixed_hakmem 20000000 400 1
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Run 10 times**, collect mean/median/stddev.
|
|||
|
|
|
|||
|
|
### GO/NO-GO Criteria
|
|||
|
|
|
|||
|
|
- **GO**: mean >= +1.0% (promote to MIXED preset)
|
|||
|
|
- **NEUTRAL**: -1.0% < mean < +1.0% (freeze as research box)
|
|||
|
|
- **NO-GO**: mean <= -1.0% (freeze, do not pursue)
|
|||
|
|
|
|||
|
|
### Health Check
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
scripts/verify_health_profiles.sh
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Requirements**:
|
|||
|
|
- MIXED_TINYV3_C7_SAFE: No regression vs baseline
|
|||
|
|
- C6_HEAVY_LEGACY_POOLV1: No regression vs baseline
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Success Metrics
|
|||
|
|
|
|||
|
|
### Performance
|
|||
|
|
|
|||
|
|
- **Primary**: Mixed throughput +1.0% or higher (mean)
|
|||
|
|
- **Secondary**: `tiny_region_id_write_header` self% drops from 3.35% to <1.5%
|
|||
|
|
|
|||
|
|
### Correctness
|
|||
|
|
|
|||
|
|
- **No SEGV**: All benchmarks pass without segmentation faults
|
|||
|
|
- **No assert failures**: Debug builds pass validation
|
|||
|
|
- **Health check**: All profiles pass functional tests
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Key Insights (Box Theory)
|
|||
|
|
|
|||
|
|
### Why This Works
|
|||
|
|
|
|||
|
|
1. **Single Source of Truth**: `tiny_class_preserves_header()` encapsulates C1-C6 logic
|
|||
|
|
2. **Boundary Optimization**: Write cost moved from hot (N times) to cold (1 time)
|
|||
|
|
3. **Deduplication**: Eliminates redundant header writes on freelist reuse
|
|||
|
|
4. **Fail-fast**: C0, C7 continue to write headers (no special case complexity)
|
|||
|
|
|
|||
|
|
### Design Patterns
|
|||
|
|
|
|||
|
|
- **L0 Gate**: ENV flag with static cache (zero runtime cost)
|
|||
|
|
- **L1 Cold Boundary**: Refill is cold path (amortized cost is negligible)
|
|||
|
|
- **L1 Hot Path**: Branch predicted (write_once=1 is stable state)
|
|||
|
|
- **Safety**: Class-based filtering ensures correctness
|
|||
|
|
|
|||
|
|
### Comparison to E5-1 Success
|
|||
|
|
|
|||
|
|
- **E5-1 strategy**: Consolidation (eliminate redundant checks in wrapper)
|
|||
|
|
- **E5-2 strategy**: Deduplication (eliminate redundant header writes)
|
|||
|
|
- **Common pattern**: "Do once what you were doing N times"
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Next Steps
|
|||
|
|
|
|||
|
|
1. **Implement**: Create ENV box, modify refill boundary, update hot paths
|
|||
|
|
2. **A/B test**: 10-run Mixed benchmark (WRITE_ONCE=0 vs 1)
|
|||
|
|
3. **Validate**: Health check on all profiles
|
|||
|
|
4. **Decide**: GO (preset promotion) / NEUTRAL (freeze) / NO-GO (revert)
|
|||
|
|
5. **Document**: Update `CURRENT_TASK.md` and `ENV_PROFILE_PRESETS.md`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**Date**: 2025-12-14
|
|||
|
|
**Phase**: 5 E5-2
|
|||
|
|
**Status**: DESIGN COMPLETE, ready for implementation
|