2025-11-29 11:58:37 +09:00
|
|
|
// tiny_front_hot_box.h - Phase 4-Step2: Tiny Front Hot Path Box
|
|
|
|
|
// Purpose: Ultra-fast allocation path (5-7 branches max)
|
|
|
|
|
// Contract: TLS cache hit path only, falls back to cold path on miss
|
|
|
|
|
// Performance: Target +10-15% (60.6M → 68-75M ops/s)
|
|
|
|
|
//
|
|
|
|
|
// Design Principles (Box Pattern):
|
|
|
|
|
// 1. Single Responsibility: Hot path ONLY (cache hit)
|
|
|
|
|
// 2. Clear Contract: Assumes cache initialized, returns NULL on miss
|
|
|
|
|
// 3. Observable: Debug metrics (zero overhead in Release)
|
|
|
|
|
// 4. Safe: Pointer safety via branch hints, type-safe operations
|
|
|
|
|
// 5. Testable: Isolated from cold path, easy to benchmark
|
|
|
|
|
//
|
|
|
|
|
// Branch Count Analysis:
|
|
|
|
|
// Hot Path (cache hit):
|
|
|
|
|
// 1. class_idx range check (UNLIKELY)
|
|
|
|
|
// 2. cache empty check (LIKELY hit)
|
|
|
|
|
// 3. (header write - no branch)
|
|
|
|
|
// Total: 2 branches (down from 4-5)
|
|
|
|
|
//
|
|
|
|
|
// Cold Path (cache miss):
|
|
|
|
|
// Return NULL → caller handles via tiny_cold_refill_and_alloc()
|
|
|
|
|
|
|
|
|
|
#ifndef TINY_FRONT_HOT_BOX_H
|
|
|
|
|
#define TINY_FRONT_HOT_BOX_H
|
|
|
|
|
|
|
|
|
|
#include <stdint.h>
|
|
|
|
|
#include <stddef.h>
|
|
|
|
|
#include "../hakmem_build_flags.h"
|
|
|
|
|
#include "../hakmem_tiny_config.h"
|
|
|
|
|
#include "../tiny_region_id.h"
|
|
|
|
|
#include "../front/tiny_unified_cache.h" // For TinyUnifiedCache
|
Phase 5 E5-2: Header Write-Once (NEUTRAL, FROZEN)
Target: tiny_region_id_write_header (3.35% self%)
- Hypothesis: Headers redundant for reused blocks
- Strategy: Write headers ONCE at refill boundary, skip in hot alloc
Implementation:
- ENV gate: HAKMEM_TINY_HEADER_WRITE_ONCE=0/1 (default 0)
- core/box/tiny_header_write_once_env_box.h: ENV gate
- core/box/tiny_header_write_once_stats_box.h: Stats counters
- core/box/tiny_header_box.h: Added tiny_header_finalize_alloc()
- core/front/tiny_unified_cache.c: Prefill at 3 refill sites
- core/box/tiny_front_hot_box.h: Use finalize function
A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (WRITE_ONCE=0): 44.22M ops/s (mean), 44.53M ops/s (median)
- Optimized (WRITE_ONCE=1): 44.42M ops/s (mean), 44.36M ops/s (median)
- Improvement: +0.45% mean, -0.38% median
Decision: NEUTRAL (within ±1.0% threshold)
- Action: FREEZE as research box (default OFF, do not promote)
Root Cause Analysis:
- Header writes are NOT redundant - existing code writes only when needed
- Branch overhead (~4 cycles) cancels savings (~3-5 cycles)
- perf self% ≠ optimization ROI (3.35% target → +0.45% gain)
Key Lessons:
1. Verify assumptions before optimizing (inspect code paths)
2. Hot spot self% measures time IN function, not savings from REMOVING it
3. Branch overhead matters (even "simple" checks add cycles)
Positive Outcome:
- StdDev reduced 50% (0.96M → 0.48M) - more stable performance
Health Check: PASS (all profiles)
Next Candidates:
- free_tiny_fast_cold: 7.14% self%
- unified_cache_push: 3.39% self%
- hakmem_env_snapshot_enabled: 2.97% self%
Deliverables:
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-2 complete, FROZEN)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 06:22:25 +09:00
|
|
|
#include "tiny_header_box.h" // Phase 5 E5-2: For tiny_header_finalize_alloc
|
Phase 15 v1: UnifiedCache FIFO→LIFO NEUTRAL (-0.70% Mixed, +0.42% C7)
Transform existing array-based UnifiedCache from FIFO ring to LIFO stack.
A/B Results:
- Mixed (16-1024B): -0.70% (52,965,966 → 52,593,948 ops/s)
- C7-only (1025-2048B): +0.42% (78,010,783 → 78,335,509 ops/s)
Verdict: NEUTRAL (both below +1.0% GO threshold) - freeze as research box
Implementation:
- L0 ENV gate: tiny_unified_lifo_env_box.{h,c} (HAKMEM_TINY_UNIFIED_LIFO=0/1)
- L1 LIFO ops: tiny_unified_lifo_box.h (unified_cache_try_pop/push_lifo)
- L2 integration: tiny_front_hot_box.h (mode check at entry)
- Reuses existing slots[] array (no intrusive pointers)
Root Causes:
1. Mode check overhead (tiny_unified_lifo_enabled() call)
2. Minimal LIFO vs FIFO locality delta in practice
3. Existing FIFO ring already well-optimized
Bonus Fix: LTO bug for tiny_c7_preserve_header_enabled() (Phase 13/14 latent issue)
- Converted static inline to extern + non-inline implementation
- Fixes undefined reference during LTO linking
Design: docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_DESIGN.md
Results: docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_AB_TEST_RESULTS.md
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 02:19:26 +09:00
|
|
|
#include "tiny_unified_lifo_box.h" // Phase 15 v1: UnifiedCache FIFO→LIFO
|
2025-11-29 11:58:37 +09:00
|
|
|
|
|
|
|
|
// ============================================================================
|
|
|
|
|
// Branch Prediction Macros (Pointer Safety - Prediction Hints)
|
|
|
|
|
// ============================================================================
|
|
|
|
|
|
|
|
|
|
// TINY_HOT_LIKELY: Hint compiler that condition is VERY likely true
|
|
|
|
|
// Usage: if (TINY_HOT_LIKELY(ptr != NULL)) { ... }
|
|
|
|
|
// Result: CPU pipeline optimized for hot path, cold path predicted as unlikely
|
|
|
|
|
#define TINY_HOT_LIKELY(x) __builtin_expect(!!(x), 1)
|
|
|
|
|
|
|
|
|
|
// TINY_HOT_UNLIKELY: Hint compiler that condition is VERY unlikely
|
|
|
|
|
// Usage: if (TINY_HOT_UNLIKELY(error)) { ... }
|
|
|
|
|
// Result: CPU pipeline avoids speculative execution of error path
|
|
|
|
|
#define TINY_HOT_UNLIKELY(x) __builtin_expect(!!(x), 0)
|
|
|
|
|
|
|
|
|
|
// ============================================================================
|
|
|
|
|
// Debug Metrics (Zero Overhead in Release)
|
|
|
|
|
// ============================================================================
|
|
|
|
|
|
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
|
|
|
|
// Increment cache hit counter (debug only)
|
|
|
|
|
#define TINY_HOT_METRICS_HIT(class_idx) \
|
|
|
|
|
do { extern __thread uint64_t g_unified_cache_hit[]; \
|
|
|
|
|
g_unified_cache_hit[class_idx]++; } while(0)
|
|
|
|
|
|
|
|
|
|
// Increment cache miss counter (debug only)
|
|
|
|
|
#define TINY_HOT_METRICS_MISS(class_idx) \
|
|
|
|
|
do { extern __thread uint64_t g_unified_cache_miss[]; \
|
|
|
|
|
g_unified_cache_miss[class_idx]++; } while(0)
|
|
|
|
|
#else
|
|
|
|
|
// Release builds: macros expand to nothing (zero overhead)
|
|
|
|
|
#define TINY_HOT_METRICS_HIT(class_idx) ((void)0)
|
|
|
|
|
#define TINY_HOT_METRICS_MISS(class_idx) ((void)0)
|
|
|
|
|
#endif
|
|
|
|
|
|
|
|
|
|
// ============================================================================
|
|
|
|
|
// Box 2: Tiny Hot Alloc (Ultra-Fast Path)
|
|
|
|
|
// ============================================================================
|
|
|
|
|
|
|
|
|
|
// Ultra-fast allocation from TLS unified cache
|
|
|
|
|
//
|
|
|
|
|
// CONTRACT:
|
|
|
|
|
// Input: class_idx (0-7, caller must validate)
|
|
|
|
|
// Output: USER pointer (base+1) on success, NULL on miss
|
|
|
|
|
// Precondition: Cache initialized (caller ensures via lazy init or prewarm)
|
|
|
|
|
// Postcondition: Cache head advanced, object header written
|
|
|
|
|
//
|
|
|
|
|
// PERFORMANCE:
|
|
|
|
|
// Hot path (cache hit): 2 branches, 2-3 cache misses
|
|
|
|
|
// Cold path (cache miss): Returns NULL (caller handles)
|
|
|
|
|
//
|
|
|
|
|
// BRANCH ANALYSIS:
|
|
|
|
|
// 1. class_idx range check (UNLIKELY, safety)
|
|
|
|
|
// 2. cache empty check (LIKELY hit)
|
|
|
|
|
// 3. (no branch for header write, direct store)
|
|
|
|
|
//
|
|
|
|
|
// ASSEMBLY (expected, x86-64):
|
|
|
|
|
// mov g_unified_cache@TPOFF(%rax,%rdi,8), %rcx ; TLS cache access
|
|
|
|
|
// movzwl (%rcx), %edx ; head
|
|
|
|
|
// movzwl 2(%rcx), %esi ; tail
|
|
|
|
|
// cmp %dx, %si ; head != tail ?
|
|
|
|
|
// je .Lcache_miss
|
|
|
|
|
// mov 8(%rcx), %rax ; slots
|
|
|
|
|
// mov (%rax,%rdx,8), %rax ; base = slots[head]
|
|
|
|
|
// inc %dx ; head++
|
|
|
|
|
// and 6(%rcx), %dx ; head & mask
|
|
|
|
|
// mov %dx, (%rcx) ; store head
|
|
|
|
|
// movb $0xA0, (%rax) ; header magic
|
|
|
|
|
// or %dil, (%rax) ; header |= class_idx
|
|
|
|
|
// lea 1(%rax), %rax ; base+1 → USER
|
|
|
|
|
// ret
|
|
|
|
|
// .Lcache_miss:
|
|
|
|
|
// xor %eax, %eax ; return NULL
|
|
|
|
|
// ret
|
|
|
|
|
//
|
|
|
|
|
__attribute__((always_inline))
|
|
|
|
|
static inline void* tiny_hot_alloc_fast(int class_idx) {
|
Phase 15 v1: UnifiedCache FIFO→LIFO NEUTRAL (-0.70% Mixed, +0.42% C7)
Transform existing array-based UnifiedCache from FIFO ring to LIFO stack.
A/B Results:
- Mixed (16-1024B): -0.70% (52,965,966 → 52,593,948 ops/s)
- C7-only (1025-2048B): +0.42% (78,010,783 → 78,335,509 ops/s)
Verdict: NEUTRAL (both below +1.0% GO threshold) - freeze as research box
Implementation:
- L0 ENV gate: tiny_unified_lifo_env_box.{h,c} (HAKMEM_TINY_UNIFIED_LIFO=0/1)
- L1 LIFO ops: tiny_unified_lifo_box.h (unified_cache_try_pop/push_lifo)
- L2 integration: tiny_front_hot_box.h (mode check at entry)
- Reuses existing slots[] array (no intrusive pointers)
Root Causes:
1. Mode check overhead (tiny_unified_lifo_enabled() call)
2. Minimal LIFO vs FIFO locality delta in practice
3. Existing FIFO ring already well-optimized
Bonus Fix: LTO bug for tiny_c7_preserve_header_enabled() (Phase 13/14 latent issue)
- Converted static inline to extern + non-inline implementation
- Fixes undefined reference during LTO linking
Design: docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_DESIGN.md
Results: docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_AB_TEST_RESULTS.md
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 02:19:26 +09:00
|
|
|
// Phase 15 v1: Mode check at entry (once per call, not scattered in hot path)
|
|
|
|
|
int lifo_mode = tiny_unified_lifo_enabled();
|
|
|
|
|
|
2025-11-29 11:58:37 +09:00
|
|
|
extern __thread TinyUnifiedCache g_unified_cache[];
|
|
|
|
|
|
|
|
|
|
// TLS cache access (1 cache miss)
|
|
|
|
|
// NOTE: Range check removed - caller (hak_tiny_size_to_class) guarantees valid class_idx
|
|
|
|
|
TinyUnifiedCache* cache = &g_unified_cache[class_idx];
|
|
|
|
|
|
Phase 15 v1: UnifiedCache FIFO→LIFO NEUTRAL (-0.70% Mixed, +0.42% C7)
Transform existing array-based UnifiedCache from FIFO ring to LIFO stack.
A/B Results:
- Mixed (16-1024B): -0.70% (52,965,966 → 52,593,948 ops/s)
- C7-only (1025-2048B): +0.42% (78,010,783 → 78,335,509 ops/s)
Verdict: NEUTRAL (both below +1.0% GO threshold) - freeze as research box
Implementation:
- L0 ENV gate: tiny_unified_lifo_env_box.{h,c} (HAKMEM_TINY_UNIFIED_LIFO=0/1)
- L1 LIFO ops: tiny_unified_lifo_box.h (unified_cache_try_pop/push_lifo)
- L2 integration: tiny_front_hot_box.h (mode check at entry)
- Reuses existing slots[] array (no intrusive pointers)
Root Causes:
1. Mode check overhead (tiny_unified_lifo_enabled() call)
2. Minimal LIFO vs FIFO locality delta in practice
3. Existing FIFO ring already well-optimized
Bonus Fix: LTO bug for tiny_c7_preserve_header_enabled() (Phase 13/14 latent issue)
- Converted static inline to extern + non-inline implementation
- Fixes undefined reference during LTO linking
Design: docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_DESIGN.md
Results: docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_AB_TEST_RESULTS.md
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 02:19:26 +09:00
|
|
|
// Phase 15 v1: LIFO vs FIFO mode switch
|
|
|
|
|
if (lifo_mode) {
|
|
|
|
|
// === LIFO MODE: Stack-based (LIFO) ===
|
|
|
|
|
// Try pop from stack (tail is stack depth)
|
|
|
|
|
void* base = unified_cache_try_pop_lifo(class_idx);
|
|
|
|
|
if (__builtin_expect(base != NULL, 1)) {
|
|
|
|
|
TINY_HOT_METRICS_HIT(class_idx);
|
|
|
|
|
#if HAKMEM_TINY_HEADER_CLASSIDX
|
|
|
|
|
return tiny_header_finalize_alloc(base, class_idx);
|
|
|
|
|
#else
|
|
|
|
|
return base;
|
|
|
|
|
#endif
|
|
|
|
|
}
|
|
|
|
|
// LIFO miss → fall through to cold path
|
|
|
|
|
TINY_HOT_METRICS_MISS(class_idx);
|
|
|
|
|
return NULL;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// === FIFO MODE: Ring-based (existing) ===
|
2025-11-29 11:58:37 +09:00
|
|
|
// Branch 1: Cache empty check (LIKELY hit)
|
|
|
|
|
// Hot path: cache has objects (head != tail)
|
|
|
|
|
// Cold path: cache empty (head == tail) → refill needed
|
|
|
|
|
if (TINY_HOT_LIKELY(cache->head != cache->tail)) {
|
|
|
|
|
// === HOT PATH: Cache hit (2-3 instructions) ===
|
|
|
|
|
|
|
|
|
|
// Pop from cache (1 cache miss for array access)
|
|
|
|
|
void* base = cache->slots[cache->head];
|
|
|
|
|
cache->head = (cache->head + 1) & cache->mask; // Fast modulo (power of 2)
|
|
|
|
|
|
|
|
|
|
// Debug metrics (zero overhead in release)
|
|
|
|
|
TINY_HOT_METRICS_HIT(class_idx);
|
|
|
|
|
|
|
|
|
|
// Write header + return USER pointer (no branch)
|
Phase 5 E5-2: Header Write-Once (NEUTRAL, FROZEN)
Target: tiny_region_id_write_header (3.35% self%)
- Hypothesis: Headers redundant for reused blocks
- Strategy: Write headers ONCE at refill boundary, skip in hot alloc
Implementation:
- ENV gate: HAKMEM_TINY_HEADER_WRITE_ONCE=0/1 (default 0)
- core/box/tiny_header_write_once_env_box.h: ENV gate
- core/box/tiny_header_write_once_stats_box.h: Stats counters
- core/box/tiny_header_box.h: Added tiny_header_finalize_alloc()
- core/front/tiny_unified_cache.c: Prefill at 3 refill sites
- core/box/tiny_front_hot_box.h: Use finalize function
A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (WRITE_ONCE=0): 44.22M ops/s (mean), 44.53M ops/s (median)
- Optimized (WRITE_ONCE=1): 44.42M ops/s (mean), 44.36M ops/s (median)
- Improvement: +0.45% mean, -0.38% median
Decision: NEUTRAL (within ±1.0% threshold)
- Action: FREEZE as research box (default OFF, do not promote)
Root Cause Analysis:
- Header writes are NOT redundant - existing code writes only when needed
- Branch overhead (~4 cycles) cancels savings (~3-5 cycles)
- perf self% ≠ optimization ROI (3.35% target → +0.45% gain)
Key Lessons:
1. Verify assumptions before optimizing (inspect code paths)
2. Hot spot self% measures time IN function, not savings from REMOVING it
3. Branch overhead matters (even "simple" checks add cycles)
Positive Outcome:
- StdDev reduced 50% (0.96M → 0.48M) - more stable performance
Health Check: PASS (all profiles)
Next Candidates:
- free_tiny_fast_cold: 7.14% self%
- unified_cache_push: 3.39% self%
- hakmem_env_snapshot_enabled: 2.97% self%
Deliverables:
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-2 complete, FROZEN)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 06:22:25 +09:00
|
|
|
// E5-2: Use finalize (enables write-once optimization for C1-C6)
|
2025-12-03 12:11:27 +09:00
|
|
|
#if HAKMEM_TINY_HEADER_CLASSIDX
|
Phase 5 E5-2: Header Write-Once (NEUTRAL, FROZEN)
Target: tiny_region_id_write_header (3.35% self%)
- Hypothesis: Headers redundant for reused blocks
- Strategy: Write headers ONCE at refill boundary, skip in hot alloc
Implementation:
- ENV gate: HAKMEM_TINY_HEADER_WRITE_ONCE=0/1 (default 0)
- core/box/tiny_header_write_once_env_box.h: ENV gate
- core/box/tiny_header_write_once_stats_box.h: Stats counters
- core/box/tiny_header_box.h: Added tiny_header_finalize_alloc()
- core/front/tiny_unified_cache.c: Prefill at 3 refill sites
- core/box/tiny_front_hot_box.h: Use finalize function
A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (WRITE_ONCE=0): 44.22M ops/s (mean), 44.53M ops/s (median)
- Optimized (WRITE_ONCE=1): 44.42M ops/s (mean), 44.36M ops/s (median)
- Improvement: +0.45% mean, -0.38% median
Decision: NEUTRAL (within ±1.0% threshold)
- Action: FREEZE as research box (default OFF, do not promote)
Root Cause Analysis:
- Header writes are NOT redundant - existing code writes only when needed
- Branch overhead (~4 cycles) cancels savings (~3-5 cycles)
- perf self% ≠ optimization ROI (3.35% target → +0.45% gain)
Key Lessons:
1. Verify assumptions before optimizing (inspect code paths)
2. Hot spot self% measures time IN function, not savings from REMOVING it
3. Branch overhead matters (even "simple" checks add cycles)
Positive Outcome:
- StdDev reduced 50% (0.96M → 0.48M) - more stable performance
Health Check: PASS (all profiles)
Next Candidates:
- free_tiny_fast_cold: 7.14% self%
- unified_cache_push: 3.39% self%
- hakmem_env_snapshot_enabled: 2.97% self%
Deliverables:
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-2 complete, FROZEN)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 06:22:25 +09:00
|
|
|
return tiny_header_finalize_alloc(base, class_idx);
|
2025-11-29 11:58:37 +09:00
|
|
|
#else
|
|
|
|
|
return base; // No-header mode: return BASE directly
|
|
|
|
|
#endif
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// === COLD PATH: Cache miss ===
|
|
|
|
|
// Don't refill here - let caller handle via tiny_cold_refill_and_alloc()
|
|
|
|
|
// This keeps hot path small and predictable
|
|
|
|
|
TINY_HOT_METRICS_MISS(class_idx);
|
|
|
|
|
return NULL;
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// ============================================================================
|
|
|
|
|
// Box 2b: Tiny Hot Free (Ultra-Fast Path)
|
|
|
|
|
// ============================================================================
|
|
|
|
|
|
|
|
|
|
// Ultra-fast free to TLS unified cache
|
|
|
|
|
//
|
|
|
|
|
// CONTRACT:
|
|
|
|
|
// Input: class_idx (0-7), base pointer (BASE, not USER)
|
|
|
|
|
// Output: 1=SUCCESS (pushed to cache), 0=FULL (caller handles)
|
|
|
|
|
// Precondition: Cache initialized, base is valid BASE pointer
|
|
|
|
|
// Postcondition: Cache tail advanced, object pushed to cache
|
|
|
|
|
//
|
|
|
|
|
// PERFORMANCE:
|
|
|
|
|
// Hot path (cache not full): 2 branches, 2-3 cache misses
|
|
|
|
|
// Cold path (cache full): Returns 0 (caller handles)
|
|
|
|
|
//
|
|
|
|
|
// BRANCH ANALYSIS:
|
|
|
|
|
// 1. class_idx range check (UNLIKELY, safety)
|
|
|
|
|
// 2. cache full check (UNLIKELY full)
|
|
|
|
|
//
|
|
|
|
|
__attribute__((always_inline))
|
|
|
|
|
static inline int tiny_hot_free_fast(int class_idx, void* base) {
|
Phase 15 v1: UnifiedCache FIFO→LIFO NEUTRAL (-0.70% Mixed, +0.42% C7)
Transform existing array-based UnifiedCache from FIFO ring to LIFO stack.
A/B Results:
- Mixed (16-1024B): -0.70% (52,965,966 → 52,593,948 ops/s)
- C7-only (1025-2048B): +0.42% (78,010,783 → 78,335,509 ops/s)
Verdict: NEUTRAL (both below +1.0% GO threshold) - freeze as research box
Implementation:
- L0 ENV gate: tiny_unified_lifo_env_box.{h,c} (HAKMEM_TINY_UNIFIED_LIFO=0/1)
- L1 LIFO ops: tiny_unified_lifo_box.h (unified_cache_try_pop/push_lifo)
- L2 integration: tiny_front_hot_box.h (mode check at entry)
- Reuses existing slots[] array (no intrusive pointers)
Root Causes:
1. Mode check overhead (tiny_unified_lifo_enabled() call)
2. Minimal LIFO vs FIFO locality delta in practice
3. Existing FIFO ring already well-optimized
Bonus Fix: LTO bug for tiny_c7_preserve_header_enabled() (Phase 13/14 latent issue)
- Converted static inline to extern + non-inline implementation
- Fixes undefined reference during LTO linking
Design: docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_DESIGN.md
Results: docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_AB_TEST_RESULTS.md
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 02:19:26 +09:00
|
|
|
// Phase 15 v1: Mode check at entry (once per call, not scattered in hot path)
|
|
|
|
|
int lifo_mode = tiny_unified_lifo_enabled();
|
|
|
|
|
|
2025-11-29 11:58:37 +09:00
|
|
|
extern __thread TinyUnifiedCache g_unified_cache[];
|
|
|
|
|
|
|
|
|
|
// TLS cache access (1 cache miss)
|
|
|
|
|
// NOTE: Range check removed - caller guarantees valid class_idx
|
|
|
|
|
TinyUnifiedCache* cache = &g_unified_cache[class_idx];
|
|
|
|
|
|
Phase 15 v1: UnifiedCache FIFO→LIFO NEUTRAL (-0.70% Mixed, +0.42% C7)
Transform existing array-based UnifiedCache from FIFO ring to LIFO stack.
A/B Results:
- Mixed (16-1024B): -0.70% (52,965,966 → 52,593,948 ops/s)
- C7-only (1025-2048B): +0.42% (78,010,783 → 78,335,509 ops/s)
Verdict: NEUTRAL (both below +1.0% GO threshold) - freeze as research box
Implementation:
- L0 ENV gate: tiny_unified_lifo_env_box.{h,c} (HAKMEM_TINY_UNIFIED_LIFO=0/1)
- L1 LIFO ops: tiny_unified_lifo_box.h (unified_cache_try_pop/push_lifo)
- L2 integration: tiny_front_hot_box.h (mode check at entry)
- Reuses existing slots[] array (no intrusive pointers)
Root Causes:
1. Mode check overhead (tiny_unified_lifo_enabled() call)
2. Minimal LIFO vs FIFO locality delta in practice
3. Existing FIFO ring already well-optimized
Bonus Fix: LTO bug for tiny_c7_preserve_header_enabled() (Phase 13/14 latent issue)
- Converted static inline to extern + non-inline implementation
- Fixes undefined reference during LTO linking
Design: docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_DESIGN.md
Results: docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_AB_TEST_RESULTS.md
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 02:19:26 +09:00
|
|
|
// Phase 15 v1: LIFO vs FIFO mode switch
|
|
|
|
|
if (lifo_mode) {
|
|
|
|
|
// === LIFO MODE: Stack-based (LIFO) ===
|
|
|
|
|
// Try push to stack (tail is stack depth)
|
|
|
|
|
if (unified_cache_try_push_lifo(class_idx, base)) {
|
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
|
|
|
|
extern __thread uint64_t g_unified_cache_push[];
|
|
|
|
|
g_unified_cache_push[class_idx]++;
|
|
|
|
|
#endif
|
|
|
|
|
return 1; // SUCCESS
|
|
|
|
|
}
|
|
|
|
|
// LIFO overflow → fall through to cold path
|
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
|
|
|
|
extern __thread uint64_t g_unified_cache_full[];
|
|
|
|
|
g_unified_cache_full[class_idx]++;
|
|
|
|
|
#endif
|
|
|
|
|
return 0; // FULL
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// === FIFO MODE: Ring-based (existing) ===
|
2025-11-29 11:58:37 +09:00
|
|
|
// Calculate next tail (for full check)
|
|
|
|
|
uint16_t next_tail = (cache->tail + 1) & cache->mask;
|
|
|
|
|
|
|
|
|
|
// Branch 1: Cache full check (UNLIKELY full)
|
|
|
|
|
// Hot path: cache has space (next_tail != head)
|
|
|
|
|
// Cold path: cache full (next_tail == head) → drain needed
|
|
|
|
|
if (TINY_HOT_LIKELY(next_tail != cache->head)) {
|
|
|
|
|
// === HOT PATH: Cache has space (2-3 instructions) ===
|
|
|
|
|
|
|
|
|
|
// Push to cache (1 cache miss for array write)
|
|
|
|
|
cache->slots[cache->tail] = base;
|
|
|
|
|
cache->tail = next_tail;
|
|
|
|
|
|
|
|
|
|
// Debug metrics (zero overhead in release)
|
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
|
|
|
|
extern __thread uint64_t g_unified_cache_push[];
|
|
|
|
|
g_unified_cache_push[class_idx]++;
|
|
|
|
|
#endif
|
|
|
|
|
|
|
|
|
|
return 1; // SUCCESS
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// === COLD PATH: Cache full ===
|
|
|
|
|
// Don't drain here - let caller handle via tiny_cold_drain_and_free()
|
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
|
|
|
|
extern __thread uint64_t g_unified_cache_full[];
|
|
|
|
|
g_unified_cache_full[class_idx]++;
|
|
|
|
|
#endif
|
|
|
|
|
|
|
|
|
|
return 0; // FULL
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
// ============================================================================
|
|
|
|
|
// Performance Notes
|
|
|
|
|
// ============================================================================
|
|
|
|
|
|
|
|
|
|
// Expected improvements (Phase 4-Step2):
|
|
|
|
|
// - Random Mixed 256: 60.6M → 68-75M ops/s (+10-15%)
|
|
|
|
|
// - Tiny Hot 64B: Current → +10-15%
|
|
|
|
|
//
|
|
|
|
|
// Key optimizations:
|
|
|
|
|
// 1. Branch reduction: 4-5 → 2 branches (hot path)
|
|
|
|
|
// 2. Branch hints: LIKELY/UNLIKELY guide CPU pipeline
|
|
|
|
|
// 3. Hot/Cold separation: Keeps hot path small (better i-cache)
|
|
|
|
|
// 4. Always inline: Eliminates function call overhead
|
|
|
|
|
// 5. Metrics gated: Zero overhead in release builds
|
|
|
|
|
//
|
|
|
|
|
// Trade-offs:
|
|
|
|
|
// 1. Code size: +50-100 bytes per call site (inline expansion)
|
|
|
|
|
// 2. Cold path complexity: Caller must handle NULL/0 returns
|
|
|
|
|
// 3. Cache assumption: Assumes cache initialized (lazy init moved to caller)
|
|
|
|
|
|
|
|
|
|
#endif // TINY_FRONT_HOT_BOX_H
|