Target: tiny_region_id_write_header (3.35% self%) - Hypothesis: Headers redundant for reused blocks - Strategy: Write headers ONCE at refill boundary, skip in hot alloc Implementation: - ENV gate: HAKMEM_TINY_HEADER_WRITE_ONCE=0/1 (default 0) - core/box/tiny_header_write_once_env_box.h: ENV gate - core/box/tiny_header_write_once_stats_box.h: Stats counters - core/box/tiny_header_box.h: Added tiny_header_finalize_alloc() - core/front/tiny_unified_cache.c: Prefill at 3 refill sites - core/box/tiny_front_hot_box.h: Use finalize function A/B Test Results (Mixed, 10-run, 20M iters): - Baseline (WRITE_ONCE=0): 44.22M ops/s (mean), 44.53M ops/s (median) - Optimized (WRITE_ONCE=1): 44.42M ops/s (mean), 44.36M ops/s (median) - Improvement: +0.45% mean, -0.38% median Decision: NEUTRAL (within ±1.0% threshold) - Action: FREEZE as research box (default OFF, do not promote) Root Cause Analysis: - Header writes are NOT redundant - existing code writes only when needed - Branch overhead (~4 cycles) cancels savings (~3-5 cycles) - perf self% ≠ optimization ROI (3.35% target → +0.45% gain) Key Lessons: 1. Verify assumptions before optimizing (inspect code paths) 2. Hot spot self% measures time IN function, not savings from REMOVING it 3. Branch overhead matters (even "simple" checks add cycles) Positive Outcome: - StdDev reduced 50% (0.96M → 0.48M) - more stable performance Health Check: PASS (all profiles) Next Candidates: - free_tiny_fast_cold: 7.14% self% - unified_cache_push: 3.39% self% - hakmem_env_snapshot_enabled: 2.97% self% Deliverables: - docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md - docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md - CURRENT_TASK.md (E5-2 complete, FROZEN) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
10 KiB
Phase 5 E5-2: Header Write at Refill Boundary (Write-Once Strategy)
Status
Target: tiny_region_id_write_header (3.35% self% in perf profile)
Baseline: 43.998M ops/s (Mixed, 40M iters, ws=400, E4-1+E4-2+E5-1 ON)
Goal: +1-3% by moving header writes from allocation hot path to refill cold boundary
Hypothesis
Problem: tiny_region_id_write_header() is called on every allocation, writing the same header multiple times for reused blocks:
- First allocation: Block carved from slab → header written
- Free: Block pushed to TLS freelist → header preserved (C1-C6) or overwritten (C0, C7)
- Second allocation: Block popped from TLS → header written AGAIN (redundant for C1-C6)
Observation:
- C1-C6 (16B-1024B): Headers are preserved in freelist (next pointer at offset +1)
- Rewriting the same header on every allocation is pure waste
- C0, C7 (8B, 2048B): Headers are overwritten by next pointer (offset 0)
- Must write header on every allocation (cannot skip)
Opportunity: For C1-C6, write header once at refill boundary (when block is initially created), skip writes on subsequent allocations.
Box Theory Design
L0: ENV Gate (HAKMEM_TINY_HEADER_WRITE_ONCE)
// core/box/tiny_header_write_once_env_box.h
static inline int tiny_header_write_once_enabled(void) {
static int cached = -1;
if (cached == -1) {
cached = getenv_flag("HAKMEM_TINY_HEADER_WRITE_ONCE", 0);
}
return cached;
}
Default: 0 (OFF, research box) MIXED preset: 1 (ON after GO)
L1: Refill Boundary (unified_cache_refill)
Current flow (core/front/tiny_unified_cache.c):
// unified_cache_refill() populates TLS cache from backend
// Backend returns BASE pointers (no header written yet)
// Each allocation calls tiny_region_id_write_header(base, class_idx)
Optimized flow (write-once):
// unified_cache_refill() PREFILLS headers for C1-C6 blocks
for (int i = 0; i < count; i++) {
void* base = slots[i];
if (tiny_class_preserves_header(class_idx)) {
// Write header ONCE at refill boundary
*(uint8_t*)base = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
}
// C0, C7: Skip (header will be overwritten by next pointer anyway)
}
Hot path change:
// Before (WRITE_ONCE=0):
return tiny_region_id_write_header(base, class_idx); // 3.35% self%
// After (WRITE_ONCE=1, C1-C6):
return (void*)((uint8_t*)base + 1); // Direct offset, no write
// After (WRITE_ONCE=1, C0/C7):
return tiny_region_id_write_header(base, class_idx); // Still need write
Implementation Strategy
Step 1: Refill-time header prefill
File: core/front/tiny_unified_cache.c
Function: unified_cache_refill()
Modification:
static void unified_cache_refill(int class_idx) {
// ... existing refill logic ...
// After populating slots[], prefill headers (C1-C6 only)
#if HAKMEM_TINY_HEADER_CLASSIDX
if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
const uint8_t header_byte = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
for (int i = 0; i < refill_count; i++) {
void* base = cache->slots[tail_idx];
*(uint8_t*)base = header_byte;
tail_idx = (tail_idx + 1) & TINY_UNIFIED_CACHE_MASK;
}
}
#endif
}
Safety:
- Only prefills for C1-C6 (
tiny_class_preserves_header()) - C0, C7 are skipped (headers will be overwritten anyway)
- Uses existing
HEADER_MAGICconstant - Fail-fast: If
WRITE_ONCE=1but headers not prefilled, hot path still writes header (no corruption)
Step 2: Hot path skip logic
File: core/front/malloc_tiny_fast.h
Functions: All allocation paths (tiny_hot_alloc_fast, etc.)
Before:
#if HAKMEM_TINY_HEADER_CLASSIDX
return tiny_region_id_write_header(base, class_idx);
#else
return base;
#endif
After:
#if HAKMEM_TINY_HEADER_CLASSIDX
if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
// Header already written at refill boundary (C1-C6)
return (void*)((uint8_t*)base + 1); // Fast: skip write, direct offset
} else {
// C0, C7, or WRITE_ONCE=0: Traditional path
return tiny_region_id_write_header(base, class_idx);
}
#else
return base;
#endif
Inline optimization:
// Extract to tiny_header_box.h for inlining
static inline void* tiny_header_finalize_alloc(void* base, int class_idx) {
#if HAKMEM_TINY_HEADER_CLASSIDX
if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
return (void*)((uint8_t*)base + 1); // Prefilled, skip write
}
return tiny_region_id_write_header(base, class_idx); // Traditional
#else
(void)class_idx;
return base;
#endif
}
Step 3: Stats counters (optional)
File: core/box/tiny_header_write_once_stats_box.h
typedef struct {
uint64_t refill_prefill_count; // Headers prefilled at refill
uint64_t alloc_skip_count; // Allocations that skipped header write
uint64_t alloc_write_count; // Allocations that wrote header (C0, C7)
} TinyHeaderWriteOnceStats;
extern __thread TinyHeaderWriteOnceStats g_header_write_once_stats;
Expected Performance Impact
Cost Breakdown (Before - WRITE_ONCE=0)
Hot path (every allocation):
tiny_region_id_write_header():
1. NULL check (1 cycle)
2. Header write: *(uint8_t*)base = HEADER_MAGIC | class_idx (2-3 cycles, store)
3. Offset calculation: return (uint8_t*)base + 1 (1 cycle)
Total: ~5 cycles per allocation
perf profile: 3.35% self% → ~1.5M ops/s overhead at 43.998M ops/s baseline
Optimized Path (WRITE_ONCE=1, C1-C6)
Refill boundary (once per 2048 allocations):
unified_cache_refill():
Loop over refill_count (~128-256 blocks):
*(uint8_t*)base = header_byte (2 cycles × 128 = 256 cycles)
Total: ~256 cycles amortized over 2048 allocations = 0.125 cycles/alloc
Hot path (every allocation):
tiny_header_finalize_alloc():
1. Branch: if (write_once && preserves) (1 cycle, predicted)
2. Offset: return (uint8_t*)base + 1 (1 cycle)
Total: ~2 cycles per allocation
Net savings: 5 cycles → 2 cycles = 3 cycles per allocation (60% reduction)
Expected Gain
Formula: 3.35% overhead × 60% reduction = ~2.0% throughput gain
Conservative estimate: +1.0% to +2.5% (accounting for branch misprediction, ENV check overhead)
Target: 43.998M → 44.9M - 45.1M ops/s (+2.0% to +2.5%)
Safety & Rollback
Safety Mechanisms
- ENV gate:
HAKMEM_TINY_HEADER_WRITE_ONCE=0reverts to traditional path - Class filter: Only C1-C6 use write-once (C0, C7 always write header)
- Fail-safe: If ENV=1 but refill prefill is broken, hot path still works (writes header)
- No ABI change: User pointers identical, only internal optimization
Rollback Plan
# Disable write-once optimization
export HAKMEM_TINY_HEADER_WRITE_ONCE=0
./bench_random_mixed_hakmem 20000000 400 1
Rollback triggers:
- A/B test shows <+1.0% gain (NEUTRAL → freeze as research box)
- A/B test shows <-1.0% regression (NO-GO → freeze)
- Health check fails (revert preset default)
Integration Points
Files to modify
-
core/box/tiny_header_write_once_env_box.h (new):
- ENV gate:
tiny_header_write_once_enabled()
- ENV gate:
-
core/box/tiny_header_write_once_stats_box.h (new, optional):
- Stats counters for observability
-
core/box/tiny_header_box.h (existing):
- New function:
tiny_header_finalize_alloc(base, class_idx) - Inline logic for write-once vs traditional
- New function:
-
core/front/tiny_unified_cache.c (existing):
- Modify
unified_cache_refill()to prefill headers
- Modify
-
core/front/malloc_tiny_fast.h (existing):
- Replace
tiny_region_id_write_header()calls withtiny_header_finalize_alloc() - ~15-20 call sites
- Replace
-
core/bench_profile.h (existing, after GO):
- Add
HAKMEM_TINY_HEADER_WRITE_ONCE=1toMIXED_TINYV3_C7_SAFEpreset
- Add
A/B Test Plan
Baseline (WRITE_ONCE=0)
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
HAKMEM_TINY_HEADER_WRITE_ONCE=0 \
./bench_random_mixed_hakmem 20000000 400 1
Run 10 times, collect mean/median/stddev.
Optimized (WRITE_ONCE=1)
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
HAKMEM_TINY_HEADER_WRITE_ONCE=1 \
./bench_random_mixed_hakmem 20000000 400 1
Run 10 times, collect mean/median/stddev.
GO/NO-GO Criteria
- GO: mean >= +1.0% (promote to MIXED preset)
- NEUTRAL: -1.0% < mean < +1.0% (freeze as research box)
- NO-GO: mean <= -1.0% (freeze, do not pursue)
Health Check
scripts/verify_health_profiles.sh
Requirements:
- MIXED_TINYV3_C7_SAFE: No regression vs baseline
- C6_HEAVY_LEGACY_POOLV1: No regression vs baseline
Success Metrics
Performance
- Primary: Mixed throughput +1.0% or higher (mean)
- Secondary:
tiny_region_id_write_headerself% drops from 3.35% to <1.5%
Correctness
- No SEGV: All benchmarks pass without segmentation faults
- No assert failures: Debug builds pass validation
- Health check: All profiles pass functional tests
Key Insights (Box Theory)
Why This Works
- Single Source of Truth:
tiny_class_preserves_header()encapsulates C1-C6 logic - Boundary Optimization: Write cost moved from hot (N times) to cold (1 time)
- Deduplication: Eliminates redundant header writes on freelist reuse
- Fail-fast: C0, C7 continue to write headers (no special case complexity)
Design Patterns
- L0 Gate: ENV flag with static cache (zero runtime cost)
- L1 Cold Boundary: Refill is cold path (amortized cost is negligible)
- L1 Hot Path: Branch predicted (write_once=1 is stable state)
- Safety: Class-based filtering ensures correctness
Comparison to E5-1 Success
- E5-1 strategy: Consolidation (eliminate redundant checks in wrapper)
- E5-2 strategy: Deduplication (eliminate redundant header writes)
- Common pattern: "Do once what you were doing N times"
Next Steps
- Implement: Create ENV box, modify refill boundary, update hot paths
- A/B test: 10-run Mixed benchmark (WRITE_ONCE=0 vs 1)
- Validate: Health check on all profiles
- Decide: GO (preset promotion) / NEUTRAL (freeze) / NO-GO (revert)
- Document: Update
CURRENT_TASK.mdandENV_PROFILE_PRESETS.md
Date: 2025-12-14 Phase: 5 E5-2 Status: DESIGN COMPLETE, ready for implementation