Files
hakmem/docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md
Moe Charm (CI) f7b18aaf13 Phase 5 E5-2: Header Write-Once (NEUTRAL, FROZEN)
Target: tiny_region_id_write_header (3.35% self%)
- Hypothesis: Headers redundant for reused blocks
- Strategy: Write headers ONCE at refill boundary, skip in hot alloc

Implementation:
- ENV gate: HAKMEM_TINY_HEADER_WRITE_ONCE=0/1 (default 0)
- core/box/tiny_header_write_once_env_box.h: ENV gate
- core/box/tiny_header_write_once_stats_box.h: Stats counters
- core/box/tiny_header_box.h: Added tiny_header_finalize_alloc()
- core/front/tiny_unified_cache.c: Prefill at 3 refill sites
- core/box/tiny_front_hot_box.h: Use finalize function

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (WRITE_ONCE=0): 44.22M ops/s (mean), 44.53M ops/s (median)
- Optimized (WRITE_ONCE=1): 44.42M ops/s (mean), 44.36M ops/s (median)
- Improvement: +0.45% mean, -0.38% median

Decision: NEUTRAL (within ±1.0% threshold)
- Action: FREEZE as research box (default OFF, do not promote)

Root Cause Analysis:
- Header writes are NOT redundant - existing code writes only when needed
- Branch overhead (~4 cycles) cancels savings (~3-5 cycles)
- perf self% ≠ optimization ROI (3.35% target → +0.45% gain)

Key Lessons:
1. Verify assumptions before optimizing (inspect code paths)
2. Hot spot self% measures time IN function, not savings from REMOVING it
3. Branch overhead matters (even "simple" checks add cycles)

Positive Outcome:
- StdDev reduced 50% (0.96M → 0.48M) - more stable performance

Health Check: PASS (all profiles)

Next Candidates:
- free_tiny_fast_cold: 7.14% self%
- unified_cache_push: 3.39% self%
- hakmem_env_snapshot_enabled: 2.97% self%

Deliverables:
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-2 complete, FROZEN)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 06:22:25 +09:00

10 KiB
Raw Blame History

Phase 5 E5-2: Header Write at Refill Boundary (Write-Once Strategy)

Status

Target: tiny_region_id_write_header (3.35% self% in perf profile) Baseline: 43.998M ops/s (Mixed, 40M iters, ws=400, E4-1+E4-2+E5-1 ON) Goal: +1-3% by moving header writes from allocation hot path to refill cold boundary


Hypothesis

Problem: tiny_region_id_write_header() is called on every allocation, writing the same header multiple times for reused blocks:

  1. First allocation: Block carved from slab → header written
  2. Free: Block pushed to TLS freelist → header preserved (C1-C6) or overwritten (C0, C7)
  3. Second allocation: Block popped from TLS → header written AGAIN (redundant for C1-C6)

Observation:

  • C1-C6 (16B-1024B): Headers are preserved in freelist (next pointer at offset +1)
    • Rewriting the same header on every allocation is pure waste
  • C0, C7 (8B, 2048B): Headers are overwritten by next pointer (offset 0)
    • Must write header on every allocation (cannot skip)

Opportunity: For C1-C6, write header once at refill boundary (when block is initially created), skip writes on subsequent allocations.


Box Theory Design

L0: ENV Gate (HAKMEM_TINY_HEADER_WRITE_ONCE)

// core/box/tiny_header_write_once_env_box.h
static inline int tiny_header_write_once_enabled(void) {
    static int cached = -1;
    if (cached == -1) {
        cached = getenv_flag("HAKMEM_TINY_HEADER_WRITE_ONCE", 0);
    }
    return cached;
}

Default: 0 (OFF, research box) MIXED preset: 1 (ON after GO)

L1: Refill Boundary (unified_cache_refill)

Current flow (core/front/tiny_unified_cache.c):

// unified_cache_refill() populates TLS cache from backend
// Backend returns BASE pointers (no header written yet)
// Each allocation calls tiny_region_id_write_header(base, class_idx)

Optimized flow (write-once):

// unified_cache_refill() PREFILLS headers for C1-C6 blocks
for (int i = 0; i < count; i++) {
    void* base = slots[i];
    if (tiny_class_preserves_header(class_idx)) {
        // Write header ONCE at refill boundary
        *(uint8_t*)base = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
    }
    // C0, C7: Skip (header will be overwritten by next pointer anyway)
}

Hot path change:

// Before (WRITE_ONCE=0):
return tiny_region_id_write_header(base, class_idx);  // 3.35% self%

// After (WRITE_ONCE=1, C1-C6):
return (void*)((uint8_t*)base + 1);  // Direct offset, no write

// After (WRITE_ONCE=1, C0/C7):
return tiny_region_id_write_header(base, class_idx);  // Still need write

Implementation Strategy

Step 1: Refill-time header prefill

File: core/front/tiny_unified_cache.c Function: unified_cache_refill()

Modification:

static void unified_cache_refill(int class_idx) {
    // ... existing refill logic ...

    // After populating slots[], prefill headers (C1-C6 only)
    #if HAKMEM_TINY_HEADER_CLASSIDX
    if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
        const uint8_t header_byte = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
        for (int i = 0; i < refill_count; i++) {
            void* base = cache->slots[tail_idx];
            *(uint8_t*)base = header_byte;
            tail_idx = (tail_idx + 1) & TINY_UNIFIED_CACHE_MASK;
        }
    }
    #endif
}

Safety:

  • Only prefills for C1-C6 (tiny_class_preserves_header())
  • C0, C7 are skipped (headers will be overwritten anyway)
  • Uses existing HEADER_MAGIC constant
  • Fail-fast: If WRITE_ONCE=1 but headers not prefilled, hot path still writes header (no corruption)

Step 2: Hot path skip logic

File: core/front/malloc_tiny_fast.h Functions: All allocation paths (tiny_hot_alloc_fast, etc.)

Before:

#if HAKMEM_TINY_HEADER_CLASSIDX
return tiny_region_id_write_header(base, class_idx);
#else
return base;
#endif

After:

#if HAKMEM_TINY_HEADER_CLASSIDX
if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
    // Header already written at refill boundary (C1-C6)
    return (void*)((uint8_t*)base + 1);  // Fast: skip write, direct offset
} else {
    // C0, C7, or WRITE_ONCE=0: Traditional path
    return tiny_region_id_write_header(base, class_idx);
}
#else
return base;
#endif

Inline optimization:

// Extract to tiny_header_box.h for inlining
static inline void* tiny_header_finalize_alloc(void* base, int class_idx) {
#if HAKMEM_TINY_HEADER_CLASSIDX
    if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
        return (void*)((uint8_t*)base + 1);  // Prefilled, skip write
    }
    return tiny_region_id_write_header(base, class_idx);  // Traditional
#else
    (void)class_idx;
    return base;
#endif
}

Step 3: Stats counters (optional)

File: core/box/tiny_header_write_once_stats_box.h

typedef struct {
    uint64_t refill_prefill_count;  // Headers prefilled at refill
    uint64_t alloc_skip_count;      // Allocations that skipped header write
    uint64_t alloc_write_count;     // Allocations that wrote header (C0, C7)
} TinyHeaderWriteOnceStats;

extern __thread TinyHeaderWriteOnceStats g_header_write_once_stats;

Expected Performance Impact

Cost Breakdown (Before - WRITE_ONCE=0)

Hot path (every allocation):

tiny_region_id_write_header():
  1. NULL check (1 cycle)
  2. Header write: *(uint8_t*)base = HEADER_MAGIC | class_idx (2-3 cycles, store)
  3. Offset calculation: return (uint8_t*)base + 1 (1 cycle)
  Total: ~5 cycles per allocation

perf profile: 3.35% self% → ~1.5M ops/s overhead at 43.998M ops/s baseline

Optimized Path (WRITE_ONCE=1, C1-C6)

Refill boundary (once per 2048 allocations):

unified_cache_refill():
  Loop over refill_count (~128-256 blocks):
    *(uint8_t*)base = header_byte (2 cycles × 128 = 256 cycles)
  Total: ~256 cycles amortized over 2048 allocations = 0.125 cycles/alloc

Hot path (every allocation):

tiny_header_finalize_alloc():
  1. Branch: if (write_once && preserves) (1 cycle, predicted)
  2. Offset: return (uint8_t*)base + 1 (1 cycle)
  Total: ~2 cycles per allocation

Net savings: 5 cycles → 2 cycles = 3 cycles per allocation (60% reduction)

Expected Gain

Formula: 3.35% overhead × 60% reduction = ~2.0% throughput gain

Conservative estimate: +1.0% to +2.5% (accounting for branch misprediction, ENV check overhead)

Target: 43.998M → 44.9M - 45.1M ops/s (+2.0% to +2.5%)


Safety & Rollback

Safety Mechanisms

  1. ENV gate: HAKMEM_TINY_HEADER_WRITE_ONCE=0 reverts to traditional path
  2. Class filter: Only C1-C6 use write-once (C0, C7 always write header)
  3. Fail-safe: If ENV=1 but refill prefill is broken, hot path still works (writes header)
  4. No ABI change: User pointers identical, only internal optimization

Rollback Plan

# Disable write-once optimization
export HAKMEM_TINY_HEADER_WRITE_ONCE=0
./bench_random_mixed_hakmem 20000000 400 1

Rollback triggers:

  • A/B test shows <+1.0% gain (NEUTRAL → freeze as research box)
  • A/B test shows <-1.0% regression (NO-GO → freeze)
  • Health check fails (revert preset default)

Integration Points

Files to modify

  1. core/box/tiny_header_write_once_env_box.h (new):

    • ENV gate: tiny_header_write_once_enabled()
  2. core/box/tiny_header_write_once_stats_box.h (new, optional):

    • Stats counters for observability
  3. core/box/tiny_header_box.h (existing):

    • New function: tiny_header_finalize_alloc(base, class_idx)
    • Inline logic for write-once vs traditional
  4. core/front/tiny_unified_cache.c (existing):

    • Modify unified_cache_refill() to prefill headers
  5. core/front/malloc_tiny_fast.h (existing):

    • Replace tiny_region_id_write_header() calls with tiny_header_finalize_alloc()
    • ~15-20 call sites
  6. core/bench_profile.h (existing, after GO):

    • Add HAKMEM_TINY_HEADER_WRITE_ONCE=1 to MIXED_TINYV3_C7_SAFE preset

A/B Test Plan

Baseline (WRITE_ONCE=0)

HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
  HAKMEM_TINY_HEADER_WRITE_ONCE=0 \
  ./bench_random_mixed_hakmem 20000000 400 1

Run 10 times, collect mean/median/stddev.

Optimized (WRITE_ONCE=1)

HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
  HAKMEM_TINY_HEADER_WRITE_ONCE=1 \
  ./bench_random_mixed_hakmem 20000000 400 1

Run 10 times, collect mean/median/stddev.

GO/NO-GO Criteria

  • GO: mean >= +1.0% (promote to MIXED preset)
  • NEUTRAL: -1.0% < mean < +1.0% (freeze as research box)
  • NO-GO: mean <= -1.0% (freeze, do not pursue)

Health Check

scripts/verify_health_profiles.sh

Requirements:

  • MIXED_TINYV3_C7_SAFE: No regression vs baseline
  • C6_HEAVY_LEGACY_POOLV1: No regression vs baseline

Success Metrics

Performance

  • Primary: Mixed throughput +1.0% or higher (mean)
  • Secondary: tiny_region_id_write_header self% drops from 3.35% to <1.5%

Correctness

  • No SEGV: All benchmarks pass without segmentation faults
  • No assert failures: Debug builds pass validation
  • Health check: All profiles pass functional tests

Key Insights (Box Theory)

Why This Works

  1. Single Source of Truth: tiny_class_preserves_header() encapsulates C1-C6 logic
  2. Boundary Optimization: Write cost moved from hot (N times) to cold (1 time)
  3. Deduplication: Eliminates redundant header writes on freelist reuse
  4. Fail-fast: C0, C7 continue to write headers (no special case complexity)

Design Patterns

  • L0 Gate: ENV flag with static cache (zero runtime cost)
  • L1 Cold Boundary: Refill is cold path (amortized cost is negligible)
  • L1 Hot Path: Branch predicted (write_once=1 is stable state)
  • Safety: Class-based filtering ensures correctness

Comparison to E5-1 Success

  • E5-1 strategy: Consolidation (eliminate redundant checks in wrapper)
  • E5-2 strategy: Deduplication (eliminate redundant header writes)
  • Common pattern: "Do once what you were doing N times"

Next Steps

  1. Implement: Create ENV box, modify refill boundary, update hot paths
  2. A/B test: 10-run Mixed benchmark (WRITE_ONCE=0 vs 1)
  3. Validate: Health check on all profiles
  4. Decide: GO (preset promotion) / NEUTRAL (freeze) / NO-GO (revert)
  5. Document: Update CURRENT_TASK.md and ENV_PROFILE_PRESETS.md

Date: 2025-12-14 Phase: 5 E5-2 Status: DESIGN COMPLETE, ready for implementation