Phase 13 v1: Header Write Elimination (C7 preserve header)
- Verdict: NEUTRAL (+0.78%)
- Implementation: HAKMEM_TINY_C7_PRESERVE_HEADER ENV gate (default OFF)
- Makes C7 nextptr offset conditional (0→1 when enabled)
- 4-point matrix A/B test results:
* Case A (baseline): 51.49M ops/s
* Case B (WRITE_ONCE=1): 52.07M ops/s (+1.13%)
* Case C (C7_PRESERVE=1): 51.36M ops/s (-0.26%)
* Case D (both): 51.89M ops/s (+0.78% NEUTRAL)
- Action: Freeze as research box (default OFF, manual opt-in)
Phase 5 E5-2: Header Write-Once retest (promotion test)
- Verdict: NEUTRAL (+0.54%)
- Motivation: Phase 13 Case B showed +1.13%, re-tested with dedicated 20-run
- Results (20-run):
* Case A (baseline): 51.10M ops/s
* Case B (WRITE_ONCE=1): 51.37M ops/s (+0.54%)
- Previous test: +0.45% (consistent with NEUTRAL)
- Action: Keep as research box (default OFF, manual opt-in)
Key findings:
- Header write tax optimization shows consistent NEUTRAL results
- Neither Phase 13 v1 nor E5-2 reaches GO threshold (+1.0%)
- Both implemented as reversible ENV gates for future research
Files changed:
- New: core/box/tiny_c7_preserve_header_env_box.{c,h}
- Modified: core/box/tiny_layout_box.h (C7 offset conditional)
- Modified: core/tiny_nextptr.h, core/box/tiny_header_box.h (comments)
- Modified: core/bench_profile.h (refresh sync)
- Modified: Makefile (add new .o files)
- Modified: scripts/run_mixed_10_cleanenv.sh (add C7_PRESERVE ENV)
- Docs: PHASE13_*, PHASE5_E5_2_HEADER_WRITE_ONCE_* (design/results)
Next: Phase 14 (Pointer-chase reduction, tcache-style intrusive LIFO)
🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
11 KiB
Phase 5 E5-2: Header Write at Refill Boundary (Write-Once Strategy)
Status
Target: tiny_region_id_write_header (3.35% self% in perf profile)
Baseline: 43.998M ops/s (Mixed, 40M iters, ws=400, E4-1+E4-2+E5-1 ON)
Goal: +1-3% by moving header writes from allocation hot path to refill cold boundary
Update (2025-12-14):
- Phase 13 v1 の 4点マトリクスで
HAKMEM_TINY_HEADER_WRITE_ONCE=1単体が +1.13% を観測(候補)。docs/analysis/PHASE13_HEADER_WRITE_ELIMINATION_1_AB_TEST_RESULTS.md
- 専用 clean env 20-run 再テストでは +0.54%(NEUTRAL) → 昇格は見送り。
docs/analysis/PHASE5_E5_2_HEADER_WRITE_ONCE_RETEST_AB_TEST_RESULTS.md
Hypothesis
Problem: tiny_region_id_write_header() is called on every allocation, writing the same header multiple times for reused blocks:
- First allocation: Block carved from slab → header written
- Free: Block pushed to TLS freelist → header preserved (C1-C6) or overwritten (C0, C7)
- Second allocation: Block popped from TLS → header written AGAIN (redundant for C1-C6)
Observation:
- C1-C6 (16B-1024B): Headers are preserved in freelist (next pointer at offset +1)
- Rewriting the same header on every allocation is pure waste
- C0, C7 (8B, 2048B): Headers are overwritten by next pointer (offset 0)
- Must write header on every allocation (cannot skip)
Opportunity: For C1-C6, write header once at refill boundary (when block is initially created), skip writes on subsequent allocations.
Box Theory Design
L0: ENV Gate (HAKMEM_TINY_HEADER_WRITE_ONCE)
// core/box/tiny_header_write_once_env_box.h
static inline int tiny_header_write_once_enabled(void) {
static int cached = -1;
if (cached == -1) {
cached = getenv_flag("HAKMEM_TINY_HEADER_WRITE_ONCE", 0);
}
return cached;
}
Default: 0 (OFF, research box) MIXED preset: 1 (ON after GO)
L1: Refill Boundary (unified_cache_refill)
Current flow (core/front/tiny_unified_cache.c):
// unified_cache_refill() populates TLS cache from backend
// Backend returns BASE pointers (no header written yet)
// Each allocation calls tiny_region_id_write_header(base, class_idx)
Optimized flow (write-once):
// unified_cache_refill() PREFILLS headers for C1-C6 blocks
for (int i = 0; i < count; i++) {
void* base = slots[i];
if (tiny_class_preserves_header(class_idx)) {
// Write header ONCE at refill boundary
*(uint8_t*)base = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
}
// C0, C7: Skip (header will be overwritten by next pointer anyway)
}
Hot path change:
// Before (WRITE_ONCE=0):
return tiny_region_id_write_header(base, class_idx); // 3.35% self%
// After (WRITE_ONCE=1, C1-C6):
return (void*)((uint8_t*)base + 1); // Direct offset, no write
// After (WRITE_ONCE=1, C0/C7):
return tiny_region_id_write_header(base, class_idx); // Still need write
Implementation Strategy
Step 1: Refill-time header prefill
File: core/front/tiny_unified_cache.c
Function: unified_cache_refill()
Modification:
static void unified_cache_refill(int class_idx) {
// ... existing refill logic ...
// After populating slots[], prefill headers (C1-C6 only)
#if HAKMEM_TINY_HEADER_CLASSIDX
if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
const uint8_t header_byte = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
for (int i = 0; i < refill_count; i++) {
void* base = cache->slots[tail_idx];
*(uint8_t*)base = header_byte;
tail_idx = (tail_idx + 1) & TINY_UNIFIED_CACHE_MASK;
}
}
#endif
}
Safety:
- Only prefills for C1-C6 (
tiny_class_preserves_header()) - C0, C7 are skipped (headers will be overwritten anyway)
- Uses existing
HEADER_MAGICconstant - Fail-fast: If
WRITE_ONCE=1but headers not prefilled, hot path still writes header (no corruption)
Step 2: Hot path skip logic
File: core/front/malloc_tiny_fast.h
Functions: All allocation paths (tiny_hot_alloc_fast, etc.)
Before:
#if HAKMEM_TINY_HEADER_CLASSIDX
return tiny_region_id_write_header(base, class_idx);
#else
return base;
#endif
After:
#if HAKMEM_TINY_HEADER_CLASSIDX
if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
// Header already written at refill boundary (C1-C6)
return (void*)((uint8_t*)base + 1); // Fast: skip write, direct offset
} else {
// C0, C7, or WRITE_ONCE=0: Traditional path
return tiny_region_id_write_header(base, class_idx);
}
#else
return base;
#endif
Inline optimization:
// Extract to tiny_header_box.h for inlining
static inline void* tiny_header_finalize_alloc(void* base, int class_idx) {
#if HAKMEM_TINY_HEADER_CLASSIDX
if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
return (void*)((uint8_t*)base + 1); // Prefilled, skip write
}
return tiny_region_id_write_header(base, class_idx); // Traditional
#else
(void)class_idx;
return base;
#endif
}
Step 3: Stats counters (optional)
File: core/box/tiny_header_write_once_stats_box.h
typedef struct {
uint64_t refill_prefill_count; // Headers prefilled at refill
uint64_t alloc_skip_count; // Allocations that skipped header write
uint64_t alloc_write_count; // Allocations that wrote header (C0, C7)
} TinyHeaderWriteOnceStats;
extern __thread TinyHeaderWriteOnceStats g_header_write_once_stats;
Expected Performance Impact
Cost Breakdown (Before - WRITE_ONCE=0)
Hot path (every allocation):
tiny_region_id_write_header():
1. NULL check (1 cycle)
2. Header write: *(uint8_t*)base = HEADER_MAGIC | class_idx (2-3 cycles, store)
3. Offset calculation: return (uint8_t*)base + 1 (1 cycle)
Total: ~5 cycles per allocation
perf profile: 3.35% self% → ~1.5M ops/s overhead at 43.998M ops/s baseline
Optimized Path (WRITE_ONCE=1, C1-C6)
Refill boundary (once per 2048 allocations):
unified_cache_refill():
Loop over refill_count (~128-256 blocks):
*(uint8_t*)base = header_byte (2 cycles × 128 = 256 cycles)
Total: ~256 cycles amortized over 2048 allocations = 0.125 cycles/alloc
Hot path (every allocation):
tiny_header_finalize_alloc():
1. Branch: if (write_once && preserves) (1 cycle, predicted)
2. Offset: return (uint8_t*)base + 1 (1 cycle)
Total: ~2 cycles per allocation
Net savings: 5 cycles → 2 cycles = 3 cycles per allocation (60% reduction)
Expected Gain
Formula: 3.35% overhead × 60% reduction = ~2.0% throughput gain
Conservative estimate: +1.0% to +2.5% (accounting for branch misprediction, ENV check overhead)
Target: 43.998M → 44.9M - 45.1M ops/s (+2.0% to +2.5%)
Safety & Rollback
Safety Mechanisms
- ENV gate:
HAKMEM_TINY_HEADER_WRITE_ONCE=0reverts to traditional path - Class filter: Only C1-C6 use write-once (C0, C7 always write header)
- Fail-safe: If ENV=1 but refill prefill is broken, hot path still works (writes header)
- No ABI change: User pointers identical, only internal optimization
Rollback Plan
# Disable write-once optimization
export HAKMEM_TINY_HEADER_WRITE_ONCE=0
./bench_random_mixed_hakmem 20000000 400 1
Rollback triggers:
- A/B test shows <+1.0% gain (NEUTRAL → freeze as research box)
- A/B test shows <-1.0% regression (NO-GO → freeze)
- Health check fails (revert preset default)
Integration Points
Files to modify
-
core/box/tiny_header_write_once_env_box.h (new):
- ENV gate:
tiny_header_write_once_enabled()
- ENV gate:
-
core/box/tiny_header_write_once_stats_box.h (new, optional):
- Stats counters for observability
-
core/box/tiny_header_box.h (existing):
- New function:
tiny_header_finalize_alloc(base, class_idx) - Inline logic for write-once vs traditional
- New function:
-
core/front/tiny_unified_cache.c (existing):
- Modify
unified_cache_refill()to prefill headers
- Modify
-
core/front/malloc_tiny_fast.h (existing):
- Replace
tiny_region_id_write_header()calls withtiny_header_finalize_alloc() - ~15-20 call sites
- Replace
-
core/bench_profile.h (existing, after GO):
- Add
HAKMEM_TINY_HEADER_WRITE_ONCE=1toMIXED_TINYV3_C7_SAFEpreset
- Add
A/B Test Plan
Baseline (WRITE_ONCE=0)
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
HAKMEM_TINY_HEADER_WRITE_ONCE=0 \
./bench_random_mixed_hakmem 20000000 400 1
Run 10 times, collect mean/median/stddev.
Optimized (WRITE_ONCE=1)
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
HAKMEM_TINY_HEADER_WRITE_ONCE=1 \
./bench_random_mixed_hakmem 20000000 400 1
Run 10 times, collect mean/median/stddev.
GO/NO-GO Criteria
- GO: mean >= +1.0% (promote to MIXED preset)
- NEUTRAL: -1.0% < mean < +1.0% (freeze as research box)
- NO-GO: mean <= -1.0% (freeze, do not pursue)
Health Check
scripts/verify_health_profiles.sh
Requirements:
- MIXED_TINYV3_C7_SAFE: No regression vs baseline
- C6_HEAVY_LEGACY_POOLV1: No regression vs baseline
Success Metrics
Performance
- Primary: Mixed throughput +1.0% or higher (mean)
- Secondary:
tiny_region_id_write_headerself% drops from 3.35% to <1.5%
Correctness
- No SEGV: All benchmarks pass without segmentation faults
- No assert failures: Debug builds pass validation
- Health check: All profiles pass functional tests
Key Insights (Box Theory)
Why This Works
- Single Source of Truth:
tiny_class_preserves_header()encapsulates C1-C6 logic - Boundary Optimization: Write cost moved from hot (N times) to cold (1 time)
- Deduplication: Eliminates redundant header writes on freelist reuse
- Fail-fast: C0, C7 continue to write headers (no special case complexity)
Design Patterns
- L0 Gate: ENV flag with static cache (zero runtime cost)
- L1 Cold Boundary: Refill is cold path (amortized cost is negligible)
- L1 Hot Path: Branch predicted (write_once=1 is stable state)
- Safety: Class-based filtering ensures correctness
Comparison to E5-1 Success
- E5-1 strategy: Consolidation (eliminate redundant checks in wrapper)
- E5-2 strategy: Deduplication (eliminate redundant header writes)
- Common pattern: "Do once what you were doing N times"
Next Steps
- Implement: Create ENV box, modify refill boundary, update hot paths
- A/B test: 10-run Mixed benchmark (WRITE_ONCE=0 vs 1)
- Validate: Health check on all profiles
- Decide: GO (preset promotion) / NEUTRAL (freeze) / NO-GO (revert)
- Document: Update
CURRENT_TASK.mdandENV_PROFILE_PRESETS.md
Date: 2025-12-14 Phase: 5 E5-2 Status: DESIGN COMPLETE, ready for implementation