Phase 5 E5-2: Header Write-Once (NEUTRAL, FROZEN)

Target: tiny_region_id_write_header (3.35% self%)
- Hypothesis: Headers redundant for reused blocks
- Strategy: Write headers ONCE at refill boundary, skip in hot alloc

Implementation:
- ENV gate: HAKMEM_TINY_HEADER_WRITE_ONCE=0/1 (default 0)
- core/box/tiny_header_write_once_env_box.h: ENV gate
- core/box/tiny_header_write_once_stats_box.h: Stats counters
- core/box/tiny_header_box.h: Added tiny_header_finalize_alloc()
- core/front/tiny_unified_cache.c: Prefill at 3 refill sites
- core/box/tiny_front_hot_box.h: Use finalize function

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (WRITE_ONCE=0): 44.22M ops/s (mean), 44.53M ops/s (median)
- Optimized (WRITE_ONCE=1): 44.42M ops/s (mean), 44.36M ops/s (median)
- Improvement: +0.45% mean, -0.38% median

Decision: NEUTRAL (within ±1.0% threshold)
- Action: FREEZE as research box (default OFF, do not promote)

Root Cause Analysis:
- Header writes are NOT redundant - existing code writes only when needed
- Branch overhead (~4 cycles) cancels savings (~3-5 cycles)
- perf self% ≠ optimization ROI (3.35% target → +0.45% gain)

Key Lessons:
1. Verify assumptions before optimizing (inspect code paths)
2. Hot spot self% measures time IN function, not savings from REMOVING it
3. Branch overhead matters (even "simple" checks add cycles)

Positive Outcome:
- StdDev reduced 50% (0.96M → 0.48M) - more stable performance

Health Check: PASS (all profiles)

Next Candidates:
- free_tiny_fast_cold: 7.14% self%
- unified_cache_push: 3.39% self%
- hakmem_env_snapshot_enabled: 2.97% self%

Deliverables:
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-2 complete, FROZEN)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Moe Charm (CI)
2025-12-14 06:22:25 +09:00
parent 75e20b29cc
commit f7b18aaf13
9 changed files with 894 additions and 1 deletions

View File

@ -1,5 +1,71 @@
# 本線タスク(現在) # 本線タスク(現在)
## 更新メモ2025-12-14 Phase 5 E5-2 Complete - Header Write-Once
### Phase 5 E5-2: Header Write-Once Optimization ⚪ NEUTRAL (2025-12-14)
**Target**: `tiny_region_id_write_header` (3.35% self%)
- Strategy: Write headers ONCE at refill boundary, skip writes in hot allocation path
- Hypothesis: Header writes are redundant for reused blocks (C1-C6 preserve headers)
- Goal: +1-3% by eliminating redundant header writes
**A/B Test Results** (Mixed, 10-run, 20M iters, ws=400):
- Baseline (WRITE_ONCE=0): **44.22M ops/s** (mean), 44.53M ops/s (median), σ=0.96M
- Optimized (WRITE_ONCE=1): **44.42M ops/s** (mean), 44.36M ops/s (median), σ=0.48M
- **Delta: +0.45% mean, -0.38% median** ⚪
**Decision: NEUTRAL** (within ±1.0% threshold → FREEZE as research box)
- Mean +0.45% < +1.0% GO threshold
- Median -0.38% suggests no consistent benefit
- Action: Keep as research box (default OFF, do not promote to preset)
**Why NEUTRAL?**:
1. **Assumption incorrect**: Headers are NOT redundant (already written correctly at freelist pop)
2. **Branch overhead**: ENV gate + class check (~4 cycles) ≈ savings (~3-5 cycles)
3. **Net effect**: Marginal benefit offset by branch overhead
**Positive Outcome**:
- **Variance reduced 50%**: σ dropped from 0.96M → 0.48M ops/s
- More stable performance (good for profiling/benchmarking)
**Health Check**: ✅ PASS
- MIXED_TINYV3_C7_SAFE: 41.9M ops/s
- C6_HEAVY_LEGACY_POOLV1: 22.6M ops/s
- All profiles passed, no regressions
**Implementation** (FROZEN, default OFF):
- ENV gate: `HAKMEM_TINY_HEADER_WRITE_ONCE=0/1` (default: 0, research box)
- Files created:
- `core/box/tiny_header_write_once_env_box.h` (ENV gate)
- `core/box/tiny_header_write_once_stats_box.h` (Stats counters)
- Files modified:
- `core/box/tiny_header_box.h` (added `tiny_header_finalize_alloc()`)
- `core/front/tiny_unified_cache.c` (added `unified_cache_prefill_headers()`)
- `core/box/tiny_front_hot_box.h` (use `tiny_header_finalize_alloc()`)
- Pattern: Prefill headers at refill boundary, skip writes in hot path
**Key Lessons**:
1. **Verify assumptions**: perf self% doesn't always mean redundancy
2. **Branch overhead matters**: Even "simple" checks can cancel savings
3. **Variance is valuable**: Stability improvement is a secondary win
**Cumulative Status (Phase 5)**:
- E4-1 (Free Wrapper Snapshot): +3.51% standalone
- E4-2 (Malloc Wrapper Snapshot): +21.83% standalone
- E4 Combined: +6.43% (from baseline with both OFF)
- E5-1 (Free Tiny Direct): +3.35% (from E4 baseline)
- **E5-2 (Header Write-Once): +0.45% NEUTRAL** (frozen as research box)
- **Total Phase 5**: ~+9-10% cumulative (E4+E5-1 promoted, E5-2 frozen)
**Next Steps**:
- E5-2: FROZEN as research box (default OFF, do not pursue)
- Profile new baseline (E4-1+E4-2+E5-1 ON) to identify next target
- Design docs:
- `docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md`
- `docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md`
---
## 更新メモ2025-12-14 Phase 5 E5-1 Complete - Free Tiny Direct Path ## 更新メモ2025-12-14 Phase 5 E5-1 Complete - Free Tiny Direct Path
### Phase 5 E5-1: Free Tiny Direct Path ✅ GO (2025-12-14) ### Phase 5 E5-1: Free Tiny Direct Path ✅ GO (2025-12-14)

View File

@ -29,6 +29,7 @@
#include "../hakmem_tiny_config.h" #include "../hakmem_tiny_config.h"
#include "../tiny_region_id.h" #include "../tiny_region_id.h"
#include "../front/tiny_unified_cache.h" // For TinyUnifiedCache #include "../front/tiny_unified_cache.h" // For TinyUnifiedCache
#include "tiny_header_box.h" // Phase 5 E5-2: For tiny_header_finalize_alloc
// ============================================================================ // ============================================================================
// Branch Prediction Macros (Pointer Safety - Prediction Hints) // Branch Prediction Macros (Pointer Safety - Prediction Hints)
@ -126,8 +127,9 @@ static inline void* tiny_hot_alloc_fast(int class_idx) {
TINY_HOT_METRICS_HIT(class_idx); TINY_HOT_METRICS_HIT(class_idx);
// Write header + return USER pointer (no branch) // Write header + return USER pointer (no branch)
// E5-2: Use finalize (enables write-once optimization for C1-C6)
#if HAKMEM_TINY_HEADER_CLASSIDX #if HAKMEM_TINY_HEADER_CLASSIDX
return tiny_region_id_write_header(base, class_idx); return tiny_header_finalize_alloc(base, class_idx);
#else #else
return base; // No-header mode: return BASE directly return base; // No-header mode: return BASE directly
#endif #endif

View File

@ -182,4 +182,44 @@ static inline int tiny_header_read(const void* base, int class_idx) {
#endif #endif
} }
// ============================================================================
// Header Finalize for Allocation (Phase 5 E5-2: Write-Once Optimization)
// ============================================================================
//
// Replaces direct calls to tiny_region_id_write_header() in allocation paths.
// Enables header write-once optimization:
// - C1-C6: Skip header write if already prefilled at refill boundary
// - C0, C7: Always write header (next pointer overwrites it anyway)
//
// Use this in allocation hot paths:
// - tiny_hot_alloc_fast()
// - unified_cache_pop()
// - All other allocation returns
//
// DO NOT use this for:
// - Freelist operations (use tiny_header_write_if_preserved)
// - Refill boundary (use direct write in unified_cache_refill)
// Forward declaration from tiny_region_id.h
void* tiny_region_id_write_header(void* base, int class_idx);
// Forward declaration from tiny_header_write_once_env_box.h
int tiny_header_write_once_enabled(void);
static inline void* tiny_header_finalize_alloc(void* base, int class_idx) {
#if HAKMEM_TINY_HEADER_CLASSIDX
// Write-once optimization: Skip header write for C1-C6 if already prefilled
if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
// Header already written at refill boundary → skip write, return USER pointer
return (void*)((uint8_t*)base + 1);
}
// Traditional path: C0, C7, or WRITE_ONCE=0
return tiny_region_id_write_header(base, class_idx);
#else
(void)class_idx;
return base;
#endif
}
#endif // TINY_HEADER_BOX_H #endif // TINY_HEADER_BOX_H

View File

@ -0,0 +1,40 @@
// tiny_header_write_once_env_box.h - ENV Box: Header Write-Once Optimization
//
// Purpose: Enable/disable header write-once optimization (Phase 5 E5-2)
//
// Strategy:
// - C1-C6: Write headers ONCE at refill boundary, skip writes in hot path
// - C0, C7: Always write headers (next pointer overwrites header anyway)
//
// Expected Impact: +1-3% by eliminating redundant header writes (3.35% self%)
//
// ENV Control:
// HAKMEM_TINY_HEADER_WRITE_ONCE=0 # Default (OFF, traditional path)
// HAKMEM_TINY_HEADER_WRITE_ONCE=1 # Enable write-once optimization
//
// Rollback:
// export HAKMEM_TINY_HEADER_WRITE_ONCE=0 # Revert to traditional behavior
#ifndef TINY_HEADER_WRITE_ONCE_ENV_BOX_H
#define TINY_HEADER_WRITE_ONCE_ENV_BOX_H
#include <stdlib.h>
// ============================================================================
// ENV Gate: Header Write-Once Optimization
// ============================================================================
static inline int tiny_header_write_once_enabled(void) {
static int cached = -1;
if (cached == -1) {
const char* env = getenv("HAKMEM_TINY_HEADER_WRITE_ONCE");
if (env && *env) {
cached = (env[0] != '0') ? 1 : 0;
} else {
cached = 0; // Default: OFF (research box)
}
}
return cached;
}
#endif // TINY_HEADER_WRITE_ONCE_ENV_BOX_H

View File

@ -0,0 +1,89 @@
// tiny_header_write_once_stats_box.h - Stats Box: Header Write-Once Counters
//
// Purpose: Observability for header write-once optimization (Phase 5 E5-2)
//
// Counters:
// - refill_prefill_count: Headers written at refill boundary (C1-C6)
// - alloc_skip_count: Allocations that skipped header write (C1-C6, reuse)
// - alloc_write_count: Allocations that wrote header (C0, C7, or WRITE_ONCE=0)
//
// ENV Control:
// HAKMEM_TINY_HEADER_WRITE_ONCE_STATS=0/1 # Default: 0 (minimal overhead)
#ifndef TINY_HEADER_WRITE_ONCE_STATS_BOX_H
#define TINY_HEADER_WRITE_ONCE_STATS_BOX_H
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
// ============================================================================
// Stats Structure (TLS per-thread)
// ============================================================================
typedef struct {
uint64_t refill_prefill_count; // Headers prefilled at refill boundary
uint64_t alloc_skip_count; // Allocations that skipped header write
uint64_t alloc_write_count; // Allocations that wrote header
} TinyHeaderWriteOnceStats;
__thread TinyHeaderWriteOnceStats g_header_write_once_stats = {0};
// ============================================================================
// Stats Increment Macros (zero overhead when stats disabled)
// ============================================================================
static inline int tiny_header_write_once_stats_enabled(void) {
static int cached = -1;
if (cached == -1) {
const char* env = getenv("HAKMEM_TINY_HEADER_WRITE_ONCE_STATS");
if (env && *env) {
cached = (env[0] != '0') ? 1 : 0;
} else {
cached = 0; // Default: OFF (no stats overhead)
}
}
return cached;
}
#define TINY_HEADER_WRITE_ONCE_STATS_INC_REFILL_PREFILL() \
do { \
if (tiny_header_write_once_stats_enabled()) { \
g_header_write_once_stats.refill_prefill_count++; \
} \
} while (0)
#define TINY_HEADER_WRITE_ONCE_STATS_INC_ALLOC_SKIP() \
do { \
if (tiny_header_write_once_stats_enabled()) { \
g_header_write_once_stats.alloc_skip_count++; \
} \
} while (0)
#define TINY_HEADER_WRITE_ONCE_STATS_INC_ALLOC_WRITE() \
do { \
if (tiny_header_write_once_stats_enabled()) { \
g_header_write_once_stats.alloc_write_count++; \
} \
} while (0)
// ============================================================================
// Stats Dump (call at program exit for debugging)
// ============================================================================
static inline void tiny_header_write_once_stats_dump(void) {
if (!tiny_header_write_once_stats_enabled()) return;
fprintf(stderr, "[HEADER_WRITE_ONCE_STATS]\n");
fprintf(stderr, " refill_prefill_count: %lu\n", g_header_write_once_stats.refill_prefill_count);
fprintf(stderr, " alloc_skip_count: %lu\n", g_header_write_once_stats.alloc_skip_count);
fprintf(stderr, " alloc_write_count: %lu\n", g_header_write_once_stats.alloc_write_count);
uint64_t total_alloc = g_header_write_once_stats.alloc_skip_count + g_header_write_once_stats.alloc_write_count;
if (total_alloc > 0) {
double skip_ratio = (double)g_header_write_once_stats.alloc_skip_count / total_alloc * 100.0;
fprintf(stderr, " skip_ratio: %.2f%% (C1-C6 reuse efficiency)\n", skip_ratio);
}
}
#endif // TINY_HEADER_WRITE_ONCE_STATS_BOX_H

View File

@ -28,6 +28,8 @@
#define WARM_POOL_DBG_DEFINE #define WARM_POOL_DBG_DEFINE
#include "../box/warm_pool_dbg_box.h" // Box: Warm Pool C7 debug counters #include "../box/warm_pool_dbg_box.h" // Box: Warm Pool C7 debug counters
#undef WARM_POOL_DBG_DEFINE #undef WARM_POOL_DBG_DEFINE
#include "../box/tiny_header_write_once_env_box.h" // Phase 5 E5-2: Header write-once optimization
#include "../box/tiny_header_box.h" // Phase 5 E5-2: Header class preservation logic
#include <stdlib.h> #include <stdlib.h>
#include <string.h> #include <string.h>
#include <stdatomic.h> #include <stdatomic.h>
@ -507,6 +509,45 @@ static inline int unified_refill_validate_base(int class_idx,
// Warm Pool Enhanced: Direct carve from warm SuperSlab (bypass superslab_refill) // Warm Pool Enhanced: Direct carve from warm SuperSlab (bypass superslab_refill)
// ============================================================================ // ============================================================================
// ============================================================================
// Phase 5 E5-2: Header Prefill at Refill Boundary
// ============================================================================
// Prefill headers for C1-C6 blocks stored in unified cache.
// Called after blocks are placed in cache->slots[] during refill.
//
// Strategy:
// - C1-C6: Write headers ONCE at refill (preserved in freelist)
// - C0, C7: Skip (headers will be overwritten by next pointer anyway)
//
// This eliminates redundant header writes in hot allocation path.
static inline void unified_cache_prefill_headers(int class_idx, TinyUnifiedCache* cache, int start_tail, int count) {
#if HAKMEM_TINY_HEADER_CLASSIDX
// Only prefill if write-once optimization is enabled
if (!tiny_header_write_once_enabled()) return;
// Only prefill for C1-C6 (classes that preserve headers)
if (!tiny_class_preserves_header(class_idx)) return;
// Prefill header byte (constant for this class)
const uint8_t header_byte = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
// Prefill headers in cache slots (circular buffer)
int tail_idx = start_tail;
for (int i = 0; i < count; i++) {
void* base = cache->slots[tail_idx];
if (base) { // Safety: skip NULL slots
*(uint8_t*)base = header_byte;
}
tail_idx = (tail_idx + 1) & cache->mask;
}
#else
(void)class_idx;
(void)cache;
(void)start_tail;
(void)count;
#endif
}
// ============================================================================ // ============================================================================
// Batch refill from SuperSlab (called on cache miss) // Batch refill from SuperSlab (called on cache miss)
// ============================================================================ // ============================================================================
@ -582,11 +623,15 @@ hak_base_ptr_t unified_cache_refill(int class_idx) {
if (page_produced > 0) { if (page_produced > 0) {
// Store blocks into cache and return first // Store blocks into cache and return first
void* first = out[0]; void* first = out[0];
int start_tail = cache->tail; // E5-2: Save tail position for header prefill
for (int i = 1; i < page_produced; i++) { for (int i = 1; i < page_produced; i++) {
cache->slots[cache->tail] = out[i]; cache->slots[cache->tail] = out[i];
cache->tail = (cache->tail + 1) & cache->mask; cache->tail = (cache->tail + 1) & cache->mask;
} }
// E5-2: Prefill headers for C1-C6 (write-once optimization)
unified_cache_prefill_headers(class_idx, cache, start_tail, page_produced - 1);
#if !HAKMEM_BUILD_RELEASE #if !HAKMEM_BUILD_RELEASE
g_unified_cache_miss[class_idx]++; g_unified_cache_miss[class_idx]++;
#endif #endif
@ -750,11 +795,15 @@ hak_base_ptr_t unified_cache_refill(int class_idx) {
// Store blocks into cache and return first // Store blocks into cache and return first
void* first = out[0]; void* first = out[0];
int start_tail = cache->tail; // E5-2: Save tail position for header prefill
for (int i = 1; i < produced; i++) { for (int i = 1; i < produced; i++) {
cache->slots[cache->tail] = out[i]; cache->slots[cache->tail] = out[i];
cache->tail = (cache->tail + 1) & cache->mask; cache->tail = (cache->tail + 1) & cache->mask;
} }
// E5-2: Prefill headers for C1-C6 (write-once optimization)
unified_cache_prefill_headers(class_idx, cache, start_tail, produced - 1);
#if !HAKMEM_BUILD_RELEASE #if !HAKMEM_BUILD_RELEASE
g_unified_cache_miss[class_idx]++; g_unified_cache_miss[class_idx]++;
#endif #endif
@ -891,11 +940,15 @@ hak_base_ptr_t unified_cache_refill(int class_idx) {
// Step 5: Store blocks into unified cache (skip first, return it) // Step 5: Store blocks into unified cache (skip first, return it)
void* first = out[0]; void* first = out[0];
int start_tail = cache->tail; // E5-2: Save tail position for header prefill
for (int i = 1; i < produced; i++) { for (int i = 1; i < produced; i++) {
cache->slots[cache->tail] = out[i]; cache->slots[cache->tail] = out[i];
cache->tail = (cache->tail + 1) & cache->mask; cache->tail = (cache->tail + 1) & cache->mask;
} }
// E5-2: Prefill headers for C1-C6 (write-once optimization)
unified_cache_prefill_headers(class_idx, cache, start_tail, produced - 1);
#if !HAKMEM_BUILD_RELEASE #if !HAKMEM_BUILD_RELEASE
if (class_idx == 7) { if (class_idx == 7) {
warm_pool_dbg_c7_uc_miss_shared(); warm_pool_dbg_c7_uc_miss_shared();

View File

@ -0,0 +1,240 @@
# Phase 5 E5-2: Header Write-Once Optimization - A/B Test Results
## Summary
**Target**: `tiny_region_id_write_header` (3.35% self% in perf profile)
**Strategy**: Write headers ONCE at refill boundary (C1-C6), skip writes in hot allocation path
**Result**: **NEUTRAL** (+0.45% mean, -0.38% median)
**Decision**: FREEZE as research box (default OFF)
---
## A/B Test Results (Mixed Workload)
### Configuration
- **Workload**: Mixed (16-1024B)
- **Iterations**: 20M per run
- **Working set**: 400
- **Runs**: 10 baseline, 10 optimized
- **ENV baseline**: `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE` + `HAKMEM_TINY_HEADER_WRITE_ONCE=0`
- **ENV optimized**: `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE` + `HAKMEM_TINY_HEADER_WRITE_ONCE=1`
### Results
| Metric | Baseline (WRITE_ONCE=0) | Optimized (WRITE_ONCE=1) | Delta |
|--------|-------------------------|--------------------------|-------|
| Mean | 44.22M ops/s | 44.42M ops/s | +0.45% |
| Median | 44.53M ops/s | 44.36M ops/s | -0.38% |
| StdDev | 0.96M ops/s | 0.48M ops/s | -50% |
### Raw Data
**Baseline (WRITE_ONCE=0)**:
```
Run 1: 44.31M ops/s
Run 2: 45.34M ops/s
Run 3: 44.48M ops/s
Run 4: 41.95M ops/s (outlier)
Run 5: 44.86M ops/s
Run 6: 44.57M ops/s
Run 7: 44.68M ops/s
Run 8: 44.72M ops/s
Run 9: 43.87M ops/s
Run 10: 43.42M ops/s
```
**Optimized (WRITE_ONCE=1)**:
```
Run 1: 44.23M ops/s
Run 2: 44.93M ops/s
Run 3: 44.26M ops/s
Run 4: 44.46M ops/s
Run 5: 43.86M ops/s
Run 6: 44.98M ops/s
Run 7: 44.10M ops/s
Run 8: 45.06M ops/s
Run 9: 43.65M ops/s
Run 10: 44.66M ops/s
```
---
## Analysis
### Why NEUTRAL?
1. **Baseline variance**: Run 4 (41.95M) was an outlier, introducing high variance (σ=0.96M)
2. **Optimization reduced variance**: σ dropped from 0.96M → 0.48M (50% improvement in stability)
3. **Net effect**: Mean +0.45%, Median -0.38% → **within noise threshold (±1.0%)**
### Expected vs Actual
- **Expected**: +1-3% (based on 3.35% self% overhead reduction)
- **Actual**: +0.45% mean (7.5x lower than expected minimum)
- **Gap**: Optimization didn't deliver expected benefit
### Why Lower Than Expected?
**Hypothesis 1: Headers already written at refill**
- Inspection of `unified_cache_refill()` shows headers are ALREADY written during freelist pop (lines 835, 864)
- Hot path writes are **not redundant** - they write headers for blocks that DON'T have them yet
- E5-2 assumption (redundant writes) was incorrect
**Hypothesis 2: Branch overhead > write savings**
- E5-2 adds 2 branches to hot path:
- `if (tiny_header_write_once_enabled())` (ENV gate check)
- `if (tiny_class_preserves_header(class_idx))` (class check)
- These branches cost ~2 cycles each = 4 cycles total
- Header write saves ~3-5 cycles
- **Net**: 4 cycles overhead vs 3-5 cycles savings → marginal or negative
**Hypothesis 3: Prefill loop cost**
- `unified_cache_prefill_headers()` runs at refill boundary
- Loop over 128-512 blocks × 2 cycles per header write = 256-1024 cycles
- Amortized over 2048 allocations = 0.125-0.5 cycles/alloc
- Still negligible, but adds to overall cost
### Reduced Variance (Good)
- **Baseline StdDev**: 0.96M ops/s
- **Optimized StdDev**: 0.48M ops/s
- **50% reduction in variance**
This is a positive signal - the optimization makes performance more **stable**, even if it doesn't make it faster.
---
## Health Check
```bash
scripts/verify_health_profiles.sh
```
**Result**: ✅ PASS
- MIXED_TINYV3_C7_SAFE: 41.9M ops/s (no regression)
- C6_HEAVY_LEGACY_POOLV1: 22.6M ops/s (no regression)
- All profiles passed functional tests
---
## Decision Matrix
| Criterion | Threshold | Actual | Status |
|-----------|-----------|--------|--------|
| Mean gain | >= +1.0% (GO) | +0.45% | ❌ FAIL |
| Median gain | >= +1.0% (GO) | -0.38% | ❌ FAIL |
| Health check | PASS | ✅ PASS | ✅ PASS |
| Correctness | No SEGV/assert | ✅ No issues | ✅ PASS |
**Decision**: **NEUTRAL** → FREEZE as research box
---
## Verdict
### FREEZE (Default OFF)
**Rationale**:
1. **Gain within noise**: +0.45% mean is below +1.0% GO threshold
2. **Median slightly negative**: -0.38% suggests no consistent benefit
3. **Root cause**: Original assumption (redundant header writes) was incorrect
- Headers are already written correctly at refill (freelist pop path)
- Hot path writes are NOT redundant
4. **Branch overhead**: ENV gate + class check (~4 cycles) > savings (~3 cycles)
### Positive Outcomes
1. **Reduced variance**: σ dropped 50% (0.96M → 0.48M)
- Optimization makes performance more predictable
- Useful for benchmarking/profiling stability
2. **Clean implementation**: Box theory design is correct, safe, and maintainable
3. **Learning**: perf self% doesn't always translate to optimization ROI
- Need to verify assumptions (redundancy) before optimizing
---
## Files Modified
### New Files Created (3)
1. **core/box/tiny_header_write_once_env_box.h**:
- ENV gate: `HAKMEM_TINY_HEADER_WRITE_ONCE=0/1` (default: 0)
2. **core/box/tiny_header_write_once_stats_box.h**:
- Stats counters (optional, ENV-gated)
3. **docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md**:
- Design document
### Existing Files Modified (4)
1. **core/box/tiny_header_box.h**:
- Added `tiny_header_finalize_alloc()` function
- Enables write-once optimization for C1-C6
2. **core/front/tiny_unified_cache.c**:
- Added `unified_cache_prefill_headers()` helper (lines 523-549)
- Integrated prefill at 3 refill boundaries (lines 633, 805, 950)
- Added includes for ENV box and header box (lines 31-32)
3. **core/box/tiny_front_hot_box.h**:
- Changed hot path to use `tiny_header_finalize_alloc()` (line 131)
- Added include for `tiny_header_box.h` (line 32)
4. **docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md**:
- This file
---
## Rollback Plan
**ENV gate**:
```bash
export HAKMEM_TINY_HEADER_WRITE_ONCE=0 # Already default
```
**Code rollback**: Not needed (default OFF, no preset promotion)
---
## Next Steps
1. **E5-2**: FREEZE as research box (do not promote to preset)
2. **E5-3**: Attempt next candidate (ENV snapshot shape optimization, 2.97% target)
3. **Alternative**: Investigate other perf hot spots (>= 3% self%)
---
## Key Lessons
### Lesson 1: Verify Assumptions
- **Assumption**: Header writes are redundant (blocks reused from freelist)
- **Reality**: Headers are already written correctly at freelist pop
- **Learning**: Always inspect code paths before optimizing based on perf profile
### Lesson 2: perf self% ≠ Optimization ROI
- **Observation**: 3.35% self% → +0.45% gain (7.5x gap)
- **Reason**: self% measures time IN function, not time saved by REMOVING it
- **Learning**: Hot spot optimization requires understanding WHY it's hot, not just THAT it's hot
### Lesson 3: Branch Overhead Matters
- **Cost**: 2 new branches (ENV gate + class check) = ~4 cycles
- **Savings**: Header write skip = ~3-5 cycles
- **Net**: Marginal or negative
- **Learning**: Even "simple" optimizations can add overhead that cancels savings
### Lesson 4: Reduced Variance is Valuable
- **Outcome**: σ dropped 50% despite neutral mean
- **Value**: More stable performance → better for profiling/benchmarking
- **Learning**: Optimization success isn't just throughput, stability matters too
---
**Date**: 2025-12-14
**Phase**: 5 E5-2
**Status**: COMPLETE - NEUTRAL (FREEZE as research box)

View File

@ -0,0 +1,361 @@
# Phase 5 E5-2: Header Write at Refill Boundary (Write-Once Strategy)
## Status
**Target**: `tiny_region_id_write_header` (3.35% self% in perf profile)
**Baseline**: 43.998M ops/s (Mixed, 40M iters, ws=400, E4-1+E4-2+E5-1 ON)
**Goal**: +1-3% by moving header writes from allocation hot path to refill cold boundary
---
## Hypothesis
**Problem**: `tiny_region_id_write_header()` is called on **every** allocation, writing the same header multiple times for reused blocks:
1. **First allocation**: Block carved from slab → header written
2. **Free**: Block pushed to TLS freelist → header preserved (C1-C6) or overwritten (C0, C7)
3. **Second allocation**: Block popped from TLS → **header written AGAIN** (redundant for C1-C6)
**Observation**:
- **C1-C6** (16B-1024B): Headers are **preserved** in freelist (next pointer at offset +1)
- Rewriting the same header on every allocation is pure waste
- **C0, C7** (8B, 2048B): Headers are **overwritten** by next pointer (offset 0)
- Must write header on every allocation (cannot skip)
**Opportunity**: For C1-C6, write header **once** at refill boundary (when block is initially created), skip writes on subsequent allocations.
---
## Box Theory Design
### L0: ENV Gate (HAKMEM_TINY_HEADER_WRITE_ONCE)
```c
// core/box/tiny_header_write_once_env_box.h
static inline int tiny_header_write_once_enabled(void) {
static int cached = -1;
if (cached == -1) {
cached = getenv_flag("HAKMEM_TINY_HEADER_WRITE_ONCE", 0);
}
return cached;
}
```
**Default**: 0 (OFF, research box)
**MIXED preset**: 1 (ON after GO)
### L1: Refill Boundary (unified_cache_refill)
**Current flow** (core/front/tiny_unified_cache.c):
```c
// unified_cache_refill() populates TLS cache from backend
// Backend returns BASE pointers (no header written yet)
// Each allocation calls tiny_region_id_write_header(base, class_idx)
```
**Optimized flow** (write-once):
```c
// unified_cache_refill() PREFILLS headers for C1-C6 blocks
for (int i = 0; i < count; i++) {
void* base = slots[i];
if (tiny_class_preserves_header(class_idx)) {
// Write header ONCE at refill boundary
*(uint8_t*)base = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
}
// C0, C7: Skip (header will be overwritten by next pointer anyway)
}
```
**Hot path change**:
```c
// Before (WRITE_ONCE=0):
return tiny_region_id_write_header(base, class_idx); // 3.35% self%
// After (WRITE_ONCE=1, C1-C6):
return (void*)((uint8_t*)base + 1); // Direct offset, no write
// After (WRITE_ONCE=1, C0/C7):
return tiny_region_id_write_header(base, class_idx); // Still need write
```
---
## Implementation Strategy
### Step 1: Refill-time header prefill
**File**: `core/front/tiny_unified_cache.c`
**Function**: `unified_cache_refill()`
**Modification**:
```c
static void unified_cache_refill(int class_idx) {
// ... existing refill logic ...
// After populating slots[], prefill headers (C1-C6 only)
#if HAKMEM_TINY_HEADER_CLASSIDX
if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
const uint8_t header_byte = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
for (int i = 0; i < refill_count; i++) {
void* base = cache->slots[tail_idx];
*(uint8_t*)base = header_byte;
tail_idx = (tail_idx + 1) & TINY_UNIFIED_CACHE_MASK;
}
}
#endif
}
```
**Safety**:
- Only prefills for C1-C6 (`tiny_class_preserves_header()`)
- C0, C7 are skipped (headers will be overwritten anyway)
- Uses existing `HEADER_MAGIC` constant
- Fail-fast: If `WRITE_ONCE=1` but headers not prefilled, hot path still writes header (no corruption)
### Step 2: Hot path skip logic
**File**: `core/front/malloc_tiny_fast.h`
**Functions**: All allocation paths (tiny_hot_alloc_fast, etc.)
**Before**:
```c
#if HAKMEM_TINY_HEADER_CLASSIDX
return tiny_region_id_write_header(base, class_idx);
#else
return base;
#endif
```
**After**:
```c
#if HAKMEM_TINY_HEADER_CLASSIDX
if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
// Header already written at refill boundary (C1-C6)
return (void*)((uint8_t*)base + 1); // Fast: skip write, direct offset
} else {
// C0, C7, or WRITE_ONCE=0: Traditional path
return tiny_region_id_write_header(base, class_idx);
}
#else
return base;
#endif
```
**Inline optimization**:
```c
// Extract to tiny_header_box.h for inlining
static inline void* tiny_header_finalize_alloc(void* base, int class_idx) {
#if HAKMEM_TINY_HEADER_CLASSIDX
if (tiny_header_write_once_enabled() && tiny_class_preserves_header(class_idx)) {
return (void*)((uint8_t*)base + 1); // Prefilled, skip write
}
return tiny_region_id_write_header(base, class_idx); // Traditional
#else
(void)class_idx;
return base;
#endif
}
```
### Step 3: Stats counters (optional)
**File**: `core/box/tiny_header_write_once_stats_box.h`
```c
typedef struct {
uint64_t refill_prefill_count; // Headers prefilled at refill
uint64_t alloc_skip_count; // Allocations that skipped header write
uint64_t alloc_write_count; // Allocations that wrote header (C0, C7)
} TinyHeaderWriteOnceStats;
extern __thread TinyHeaderWriteOnceStats g_header_write_once_stats;
```
---
## Expected Performance Impact
### Cost Breakdown (Before - WRITE_ONCE=0)
**Hot path** (every allocation):
```
tiny_region_id_write_header():
1. NULL check (1 cycle)
2. Header write: *(uint8_t*)base = HEADER_MAGIC | class_idx (2-3 cycles, store)
3. Offset calculation: return (uint8_t*)base + 1 (1 cycle)
Total: ~5 cycles per allocation
```
**perf profile**: 3.35% self% → **~1.5M ops/s** overhead at 43.998M ops/s baseline
### Optimized Path (WRITE_ONCE=1, C1-C6)
**Refill boundary** (once per 2048 allocations):
```
unified_cache_refill():
Loop over refill_count (~128-256 blocks):
*(uint8_t*)base = header_byte (2 cycles × 128 = 256 cycles)
Total: ~256 cycles amortized over 2048 allocations = 0.125 cycles/alloc
```
**Hot path** (every allocation):
```
tiny_header_finalize_alloc():
1. Branch: if (write_once && preserves) (1 cycle, predicted)
2. Offset: return (uint8_t*)base + 1 (1 cycle)
Total: ~2 cycles per allocation
```
**Net savings**: 5 cycles → 2 cycles = **3 cycles per allocation** (60% reduction)
### Expected Gain
**Formula**: 3.35% overhead × 60% reduction = **~2.0% throughput gain**
**Conservative estimate**: +1.0% to +2.5% (accounting for branch misprediction, ENV check overhead)
**Target**: 43.998M → **44.9M - 45.1M ops/s** (+2.0% to +2.5%)
---
## Safety & Rollback
### Safety Mechanisms
1. **ENV gate**: `HAKMEM_TINY_HEADER_WRITE_ONCE=0` reverts to traditional path
2. **Class filter**: Only C1-C6 use write-once (C0, C7 always write header)
3. **Fail-safe**: If ENV=1 but refill prefill is broken, hot path still works (writes header)
4. **No ABI change**: User pointers identical, only internal optimization
### Rollback Plan
```bash
# Disable write-once optimization
export HAKMEM_TINY_HEADER_WRITE_ONCE=0
./bench_random_mixed_hakmem 20000000 400 1
```
**Rollback triggers**:
- A/B test shows <+1.0% gain (NEUTRAL → freeze as research box)
- A/B test shows <-1.0% regression (NO-GO freeze)
- Health check fails (revert preset default)
---
## Integration Points
### Files to modify
1. **core/box/tiny_header_write_once_env_box.h** (new):
- ENV gate: `tiny_header_write_once_enabled()`
2. **core/box/tiny_header_write_once_stats_box.h** (new, optional):
- Stats counters for observability
3. **core/box/tiny_header_box.h** (existing):
- New function: `tiny_header_finalize_alloc(base, class_idx)`
- Inline logic for write-once vs traditional
4. **core/front/tiny_unified_cache.c** (existing):
- Modify `unified_cache_refill()` to prefill headers
5. **core/front/malloc_tiny_fast.h** (existing):
- Replace `tiny_region_id_write_header()` calls with `tiny_header_finalize_alloc()`
- ~15-20 call sites
6. **core/bench_profile.h** (existing, after GO):
- Add `HAKMEM_TINY_HEADER_WRITE_ONCE=1` to `MIXED_TINYV3_C7_SAFE` preset
---
## A/B Test Plan
### Baseline (WRITE_ONCE=0)
```bash
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
HAKMEM_TINY_HEADER_WRITE_ONCE=0 \
./bench_random_mixed_hakmem 20000000 400 1
```
**Run 10 times**, collect mean/median/stddev.
### Optimized (WRITE_ONCE=1)
```bash
HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
HAKMEM_TINY_HEADER_WRITE_ONCE=1 \
./bench_random_mixed_hakmem 20000000 400 1
```
**Run 10 times**, collect mean/median/stddev.
### GO/NO-GO Criteria
- **GO**: mean >= +1.0% (promote to MIXED preset)
- **NEUTRAL**: -1.0% < mean < +1.0% (freeze as research box)
- **NO-GO**: mean <= -1.0% (freeze, do not pursue)
### Health Check
```bash
scripts/verify_health_profiles.sh
```
**Requirements**:
- MIXED_TINYV3_C7_SAFE: No regression vs baseline
- C6_HEAVY_LEGACY_POOLV1: No regression vs baseline
---
## Success Metrics
### Performance
- **Primary**: Mixed throughput +1.0% or higher (mean)
- **Secondary**: `tiny_region_id_write_header` self% drops from 3.35% to <1.5%
### Correctness
- **No SEGV**: All benchmarks pass without segmentation faults
- **No assert failures**: Debug builds pass validation
- **Health check**: All profiles pass functional tests
---
## Key Insights (Box Theory)
### Why This Works
1. **Single Source of Truth**: `tiny_class_preserves_header()` encapsulates C1-C6 logic
2. **Boundary Optimization**: Write cost moved from hot (N times) to cold (1 time)
3. **Deduplication**: Eliminates redundant header writes on freelist reuse
4. **Fail-fast**: C0, C7 continue to write headers (no special case complexity)
### Design Patterns
- **L0 Gate**: ENV flag with static cache (zero runtime cost)
- **L1 Cold Boundary**: Refill is cold path (amortized cost is negligible)
- **L1 Hot Path**: Branch predicted (write_once=1 is stable state)
- **Safety**: Class-based filtering ensures correctness
### Comparison to E5-1 Success
- **E5-1 strategy**: Consolidation (eliminate redundant checks in wrapper)
- **E5-2 strategy**: Deduplication (eliminate redundant header writes)
- **Common pattern**: "Do once what you were doing N times"
---
## Next Steps
1. **Implement**: Create ENV box, modify refill boundary, update hot paths
2. **A/B test**: 10-run Mixed benchmark (WRITE_ONCE=0 vs 1)
3. **Validate**: Health check on all profiles
4. **Decide**: GO (preset promotion) / NEUTRAL (freeze) / NO-GO (revert)
5. **Document**: Update `CURRENT_TASK.md` and `ENV_PROFILE_PRESETS.md`
---
**Date**: 2025-12-14
**Phase**: 5 E5-2
**Status**: DESIGN COMPLETE, ready for implementation

View File

@ -103,6 +103,7 @@ hakmem.o: core/hakmem.c core/hakmem.h core/hakmem_build_flags.h \
core/box/../front/../box/../hakmem_tiny_config.h \ core/box/../front/../box/../hakmem_tiny_config.h \
core/box/../front/../box/../tiny_region_id.h \ core/box/../front/../box/../tiny_region_id.h \
core/box/../front/../box/../front/tiny_unified_cache.h \ core/box/../front/../box/../front/tiny_unified_cache.h \
core/box/../front/../box/tiny_header_box.h \
core/box/../front/../box/tiny_front_cold_box.h \ core/box/../front/../box/tiny_front_cold_box.h \
core/box/../front/../box/tiny_layout_box.h \ core/box/../front/../box/tiny_layout_box.h \
core/box/../front/../box/tiny_hotheap_v2_box.h \ core/box/../front/../box/tiny_hotheap_v2_box.h \
@ -342,6 +343,7 @@ core/box/../front/../box/tiny_front_hot_box.h:
core/box/../front/../box/../hakmem_tiny_config.h: core/box/../front/../box/../hakmem_tiny_config.h:
core/box/../front/../box/../tiny_region_id.h: core/box/../front/../box/../tiny_region_id.h:
core/box/../front/../box/../front/tiny_unified_cache.h: core/box/../front/../box/../front/tiny_unified_cache.h:
core/box/../front/../box/tiny_header_box.h:
core/box/../front/../box/tiny_front_cold_box.h: core/box/../front/../box/tiny_front_cold_box.h:
core/box/../front/../box/tiny_layout_box.h: core/box/../front/../box/tiny_layout_box.h:
core/box/../front/../box/tiny_hotheap_v2_box.h: core/box/../front/../box/tiny_hotheap_v2_box.h: