Files

Moe Charm (CI) cbb35ee27f Phase 13 v1 + E5-2 retest: Both NEUTRAL, freeze as research boxes

Phase 13 v1: Header Write Elimination (C7 preserve header)
- Verdict: NEUTRAL (+0.78%)
- Implementation: HAKMEM_TINY_C7_PRESERVE_HEADER ENV gate (default OFF)
- Makes C7 nextptr offset conditional (0→1 when enabled)
- 4-point matrix A/B test results:
  * Case A (baseline): 51.49M ops/s
  * Case B (WRITE_ONCE=1): 52.07M ops/s (+1.13%)
  * Case C (C7_PRESERVE=1): 51.36M ops/s (-0.26%)
  * Case D (both): 51.89M ops/s (+0.78% NEUTRAL)
- Action: Freeze as research box (default OFF, manual opt-in)

Phase 5 E5-2: Header Write-Once retest (promotion test)
- Verdict: NEUTRAL (+0.54%)
- Motivation: Phase 13 Case B showed +1.13%, re-tested with dedicated 20-run
- Results (20-run):
  * Case A (baseline): 51.10M ops/s
  * Case B (WRITE_ONCE=1): 51.37M ops/s (+0.54%)
- Previous test: +0.45% (consistent with NEUTRAL)
- Action: Keep as research box (default OFF, manual opt-in)

Key findings:
- Header write tax optimization shows consistent NEUTRAL results
- Neither Phase 13 v1 nor E5-2 reaches GO threshold (+1.0%)
- Both implemented as reversible ENV gates for future research

Files changed:
- New: core/box/tiny_c7_preserve_header_env_box.{c,h}
- Modified: core/box/tiny_layout_box.h (C7 offset conditional)
- Modified: core/tiny_nextptr.h, core/box/tiny_header_box.h (comments)
- Modified: core/bench_profile.h (refresh sync)
- Modified: Makefile (add new .o files)
- Modified: scripts/run_mixed_10_cleanenv.sh (add C7_PRESERVE ENV)
- Docs: PHASE13_*, PHASE5_E5_2_HEADER_WRITE_ONCE_* (design/results)

Next: Phase 14 (Pointer-chase reduction, tcache-style intrusive LIFO)

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-15 00:32:25 +09:00

11 KiB

Raw Blame History

Phase 5 E5-2: Header Write at Refill Boundary (Write-Once Strategy)

Status

Target: tiny_region_id_write_header (3.35% self% in perf profile) Baseline: 43.998M ops/s (Mixed, 40M iters, ws=400, E4-1+E4-2+E5-1 ON) Goal: +1-3% by moving header writes from allocation hot path to refill cold boundary

Update (2025-12-14):

Phase 13 v1 の 4点マトリクスで HAKMEM_TINY_HEADER_WRITE_ONCE=1 単体が +1.13% を観測（候補）。
- docs/analysis/PHASE13_HEADER_WRITE_ELIMINATION_1_AB_TEST_RESULTS.md
専用 clean env 20-run 再テストでは +0.54%（NEUTRAL） → 昇格は見送り。
- docs/analysis/PHASE5_E5_2_HEADER_WRITE_ONCE_RETEST_AB_TEST_RESULTS.md

Hypothesis

Problem: tiny_region_id_write_header() is called on every allocation, writing the same header multiple times for reused blocks:

First allocation: Block carved from slab → header written
Free: Block pushed to TLS freelist → header preserved (C1-C6) or overwritten (C0, C7)
Second allocation: Block popped from TLS → header written AGAIN (redundant for C1-C6)

Observation:

C1-C6 (16B-1024B): Headers are preserved in freelist (next pointer at offset +1)
- Rewriting the same header on every allocation is pure waste
C0, C7 (8B, 2048B): Headers are overwritten by next pointer (offset 0)
- Must write header on every allocation (cannot skip)

Opportunity: For C1-C6, write header once at refill boundary (when block is initially created), skip writes on subsequent allocations.

Box Theory Design

L0: ENV Gate (HAKMEM_TINY_HEADER_WRITE_ONCE)

// core/box/tiny_header_write_once_env_box.h
static inline int tiny_header_write_once_enabled(void) {
    static int cached = -1;
    if (cached == -1) {
        cached = getenv_flag("HAKMEM_TINY_HEADER_WRITE_ONCE", 0);
    }
    return cached;
}

Default: 0 (OFF, research box) MIXED preset: 1 (ON after GO)

L1: Refill Boundary (unified_cache_refill)

Current flow (core/front/tiny_unified_cache.c):

// unified_cache_refill() populates TLS cache from backend
// Backend returns BASE pointers (no header written yet)
// Each allocation calls tiny_region_id_write_header(base, class_idx)

Optimized flow (write-once):

// unified_cache_refill() PREFILLS headers for C1-C6 blocks
for (int i = 0; i < count; i++) {
    void* base = slots[i];
    if (tiny_class_preserves_header(class_idx)) {
        // Write header ONCE at refill boundary
        *(uint8_t*)base = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
    }
    // C0, C7: Skip (header will be overwritten by next pointer anyway)
}

Hot path change:

// Before (WRITE_ONCE=0):
return tiny_region_id_write_header(base, class_idx);  // 3.35% self%

// After (WRITE_ONCE=1, C1-C6):
return (void*)((uint8_t*)base + 1);  // Direct offset, no write

// After (WRITE_ONCE=1, C0/C7):
return tiny_region_id_write_header(base, class_idx);  // Still need write

Implementation Strategy

Step 1: Refill-time header prefill

File: core/front/tiny_unified_cache.c Function: unified_cache_refill()

Modification:

static void unified_cache_refill(int class_idx) {
    // ... existing refill logic ...

    // After populating slots[], prefill headers (C1-C6 only)
    #if HAKMEM_TINY_HEADER_CLASSIDX
    if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
        const uint8_t header_byte = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
        for (int i = 0; i < refill_count; i++) {
            void* base = cache->slots[tail_idx];
            *(uint8_t*)base = header_byte;
            tail_idx = (tail_idx + 1) & TINY_UNIFIED_CACHE_MASK;
        }
    }
    #endif
}

Safety:

Only prefills for C1-C6 (tiny_class_preserves_header())
C0, C7 are skipped (headers will be overwritten anyway)
Uses existing HEADER_MAGIC constant
Fail-fast: If WRITE_ONCE=1 but headers not prefilled, hot path still writes header (no corruption)

Step 2: Hot path skip logic

File: core/front/malloc_tiny_fast.h Functions: All allocation paths (tiny_hot_alloc_fast, etc.)

Before:

#if HAKMEM_TINY_HEADER_CLASSIDX
return tiny_region_id_write_header(base, class_idx);
#else
return base;
#endif

After:

#if HAKMEM_TINY_HEADER_CLASSIDX
if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
    // Header already written at refill boundary (C1-C6)
    return (void*)((uint8_t*)base + 1);  // Fast: skip write, direct offset
} else {
    // C0, C7, or WRITE_ONCE=0: Traditional path
    return tiny_region_id_write_header(base, class_idx);
}
#else
return base;
#endif

Inline optimization:

// Extract to tiny_header_box.h for inlining
static inline void* tiny_header_finalize_alloc(void* base, int class_idx) {
#if HAKMEM_TINY_HEADER_CLASSIDX
    if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
        return (void*)((uint8_t*)base + 1);  // Prefilled, skip write
    }
    return tiny_region_id_write_header(base, class_idx);  // Traditional
#else
    (void)class_idx;
    return base;
#endif
}

Step 3: Stats counters (optional)

File: core/box/tiny_header_write_once_stats_box.h

typedef struct {
    uint64_t refill_prefill_count;  // Headers prefilled at refill
    uint64_t alloc_skip_count;      // Allocations that skipped header write
    uint64_t alloc_write_count;     // Allocations that wrote header (C0, C7)
} TinyHeaderWriteOnceStats;

extern __thread TinyHeaderWriteOnceStats g_header_write_once_stats;

Expected Performance Impact

Cost Breakdown (Before - WRITE_ONCE=0)

Hot path (every allocation):

tiny_region_id_write_header():
  1. NULL check (1 cycle)
  2. Header write: *(uint8_t*)base = HEADER_MAGIC | class_idx (2-3 cycles, store)
  3. Offset calculation: return (uint8_t*)base + 1 (1 cycle)
  Total: ~5 cycles per allocation

perf profile: 3.35% self% → ~1.5M ops/s overhead at 43.998M ops/s baseline

Optimized Path (WRITE_ONCE=1, C1-C6)

Refill boundary (once per 2048 allocations):

unified_cache_refill():
  Loop over refill_count (~128-256 blocks):
    *(uint8_t*)base = header_byte (2 cycles × 128 = 256 cycles)
  Total: ~256 cycles amortized over 2048 allocations = 0.125 cycles/alloc

Hot path (every allocation):

tiny_header_finalize_alloc():
  1. Branch: if (write_once && preserves) (1 cycle, predicted)
  2. Offset: return (uint8_t*)base + 1 (1 cycle)
  Total: ~2 cycles per allocation

Net savings: 5 cycles → 2 cycles = 3 cycles per allocation (60% reduction)

Expected Gain

Formula: 3.35% overhead × 60% reduction = ~2.0% throughput gain

Conservative estimate: +1.0% to +2.5% (accounting for branch misprediction, ENV check overhead)

Target: 43.998M → 44.9M - 45.1M ops/s (+2.0% to +2.5%)

Safety & Rollback

Safety Mechanisms

ENV gate: HAKMEM_TINY_HEADER_WRITE_ONCE=0 reverts to traditional path
Class filter: Only C1-C6 use write-once (C0, C7 always write header)
Fail-safe: If ENV=1 but refill prefill is broken, hot path still works (writes header)
No ABI change: User pointers identical, only internal optimization

Rollback Plan

# Disable write-once optimization
export HAKMEM_TINY_HEADER_WRITE_ONCE=0
./bench_random_mixed_hakmem 20000000 400 1

Rollback triggers:

A/B test shows <+1.0% gain (NEUTRAL → freeze as research box)
A/B test shows <-1.0% regression (NO-GO → freeze)
Health check fails (revert preset default)

Integration Points

Files to modify

core/box/tiny_header_write_once_env_box.h (new):
- ENV gate: tiny_header_write_once_enabled()
core/box/tiny_header_write_once_stats_box.h (new, optional):
- Stats counters for observability
core/box/tiny_header_box.h (existing):
- New function: tiny_header_finalize_alloc(base, class_idx)
- Inline logic for write-once vs traditional
core/front/tiny_unified_cache.c (existing):
- Modify unified_cache_refill() to prefill headers
core/front/malloc_tiny_fast.h (existing):
- Replace tiny_region_id_write_header() calls with tiny_header_finalize_alloc()
- ~15-20 call sites
core/bench_profile.h (existing, after GO):
- Add HAKMEM_TINY_HEADER_WRITE_ONCE=1 to MIXED_TINYV3_C7_SAFE preset

A/B Test Plan

Baseline (WRITE_ONCE=0)

HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
  HAKMEM_TINY_HEADER_WRITE_ONCE=0 \
  ./bench_random_mixed_hakmem 20000000 400 1

Run 10 times, collect mean/median/stddev.

Optimized (WRITE_ONCE=1)

HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
  HAKMEM_TINY_HEADER_WRITE_ONCE=1 \
  ./bench_random_mixed_hakmem 20000000 400 1