Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
// tiny_unified_cache.c - Phase 23: Unified Frontend Cache Implementation
|
|
|
|
|
|
#include "tiny_unified_cache.h"
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
#include "tiny_warm_pool.h" // Warm Pool: O(1) SuperSlab lookup
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
#include "../tiny_tls.h" // Phase 23-E: TinyTLSSlab, TinySlabMeta
|
|
|
|
|
|
#include "../tiny_box_geometry.h" // Phase 23-E: tiny_stride_for_class, tiny_slab_base_for_geometry
|
|
|
|
|
|
#include "../box/tiny_next_ptr_box.h" // Phase 23-E: tiny_next_read (freelist traversal)
|
2025-11-22 07:40:35 +09:00
|
|
|
|
#include "../hakmem_tiny_superslab.h" // Phase 23-E: SuperSlab, superslab_refill()
|
|
|
|
|
|
#include "../superslab/superslab_inline.h" // Phase 23-E: ss_active_add, slab_index_for, ss_slabs_capacity
|
|
|
|
|
|
#include "../hakmem_super_registry.h" // For hak_super_lookup (pointer→SuperSlab)
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
#include "../box/pagefault_telemetry_box.h" // Phase 24: Box PageFaultTelemetry (Tiny page touch stats)
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
#include "../box/ss_tier_box.h" // For ss_tier_is_hot() tier checks
|
|
|
|
|
|
#include "../box/ss_slab_meta_box.h" // For ss_active_add() and slab metadata operations
|
2025-12-04 23:39:02 +09:00
|
|
|
|
#include "../box/warm_pool_stats_box.h" // Box: Warm Pool Statistics Recording (inline)
|
|
|
|
|
|
#include "../box/slab_carve_box.h" // Box: Slab Carving (inline O(slabs) scan)
|
2025-12-05 23:41:01 +09:00
|
|
|
|
#define WARM_POOL_REL_DEFINE
|
|
|
|
|
|
#include "../box/warm_pool_rel_counters_box.h" // Box: Release-side C7 counters
|
|
|
|
|
|
#undef WARM_POOL_REL_DEFINE
|
|
|
|
|
|
#include "../box/c7_meta_used_counter_box.h" // Box: C7 meta->used increment counters
|
2025-12-04 23:39:02 +09:00
|
|
|
|
#include "../box/warm_pool_prefill_box.h" // Box: Warm Pool Prefill (secondary optimization)
|
2025-12-07 03:12:27 +09:00
|
|
|
|
#include "../box/tiny_mem_stats_box.h" // Box: Tiny front memory accounting
|
2025-12-02 20:25:48 +09:00
|
|
|
|
#include "../hakmem_env_cache.h" // Priority-2: ENV cache (eliminate syscalls)
|
2025-12-05 15:31:44 +09:00
|
|
|
|
#include "../box/tiny_page_box.h" // Tiny-Plus Page Box (C5–C7 initial hook)
|
2025-12-05 19:57:30 +09:00
|
|
|
|
#include "../box/ss_tls_bind_box.h" // Box: TLS Bind (SuperSlab -> TLS binding)
|
2025-12-05 23:41:01 +09:00
|
|
|
|
#include "../box/tiny_tls_carve_one_block_box.h" // Box: TLS carve helper (shared)
|
2025-12-06 01:34:04 +09:00
|
|
|
|
#include "../box/tiny_class_policy_box.h" // Box: per-class policy (Page/Warm caps)
|
|
|
|
|
|
#include "../box/tiny_class_stats_box.h" // Box: lightweight per-class stats
|
2025-12-05 23:41:01 +09:00
|
|
|
|
#include "../box/warm_tls_bind_logger_box.h" // Box: Warm TLS Bind logging (throttled)
|
|
|
|
|
|
#define WARM_POOL_DBG_DEFINE
|
|
|
|
|
|
#include "../box/warm_pool_dbg_box.h" // Box: Warm Pool C7 debug counters
|
|
|
|
|
|
#undef WARM_POOL_DBG_DEFINE
|
Phase 5 E5-2: Header Write-Once (NEUTRAL, FROZEN)
Target: tiny_region_id_write_header (3.35% self%)
- Hypothesis: Headers redundant for reused blocks
- Strategy: Write headers ONCE at refill boundary, skip in hot alloc
Implementation:
- ENV gate: HAKMEM_TINY_HEADER_WRITE_ONCE=0/1 (default 0)
- core/box/tiny_header_write_once_env_box.h: ENV gate
- core/box/tiny_header_write_once_stats_box.h: Stats counters
- core/box/tiny_header_box.h: Added tiny_header_finalize_alloc()
- core/front/tiny_unified_cache.c: Prefill at 3 refill sites
- core/box/tiny_front_hot_box.h: Use finalize function
A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (WRITE_ONCE=0): 44.22M ops/s (mean), 44.53M ops/s (median)
- Optimized (WRITE_ONCE=1): 44.42M ops/s (mean), 44.36M ops/s (median)
- Improvement: +0.45% mean, -0.38% median
Decision: NEUTRAL (within ±1.0% threshold)
- Action: FREEZE as research box (default OFF, do not promote)
Root Cause Analysis:
- Header writes are NOT redundant - existing code writes only when needed
- Branch overhead (~4 cycles) cancels savings (~3-5 cycles)
- perf self% ≠ optimization ROI (3.35% target → +0.45% gain)
Key Lessons:
1. Verify assumptions before optimizing (inspect code paths)
2. Hot spot self% measures time IN function, not savings from REMOVING it
3. Branch overhead matters (even "simple" checks add cycles)
Positive Outcome:
- StdDev reduced 50% (0.96M → 0.48M) - more stable performance
Health Check: PASS (all profiles)
Next Candidates:
- free_tiny_fast_cold: 7.14% self%
- unified_cache_push: 3.39% self%
- hakmem_env_snapshot_enabled: 2.97% self%
Deliverables:
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-2 complete, FROZEN)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 06:22:25 +09:00
|
|
|
|
#include "../box/tiny_header_write_once_env_box.h" // Phase 5 E5-2: Header write-once optimization
|
|
|
|
|
|
#include "../box/tiny_header_box.h" // Phase 5 E5-2: Header class preservation logic
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
#include <stdlib.h>
|
|
|
|
|
|
#include <string.h>
|
2025-12-03 21:56:52 +09:00
|
|
|
|
#include <stdatomic.h>
|
2025-12-05 20:50:24 +09:00
|
|
|
|
#include <stdio.h>
|
2025-12-04 18:26:39 +09:00
|
|
|
|
#include <time.h>
|
|
|
|
|
|
|
|
|
|
|
|
// ============================================================================
|
|
|
|
|
|
// Performance Measurement: Unified Cache (ENV-gated)
|
|
|
|
|
|
// ============================================================================
|
|
|
|
|
|
// Global atomic counters for unified cache performance measurement
|
|
|
|
|
|
// ENV: HAKMEM_MEASURE_UNIFIED_CACHE=1 to enable (default: OFF)
|
2025-12-16 05:35:11 +09:00
|
|
|
|
#if HAKMEM_TINY_UNIFIED_CACHE_MEASURE_COMPILED
|
2025-12-04 18:26:39 +09:00
|
|
|
|
_Atomic uint64_t g_unified_cache_hits_global = 0;
|
|
|
|
|
|
_Atomic uint64_t g_unified_cache_misses_global = 0;
|
|
|
|
|
|
_Atomic uint64_t g_unified_cache_refill_cycles_global = 0;
|
|
|
|
|
|
|
2025-12-05 15:31:44 +09:00
|
|
|
|
// Per-class counters(Tiny クラス別の Unified Cache 観測用)
|
|
|
|
|
|
_Atomic uint64_t g_unified_cache_hits_by_class[TINY_NUM_CLASSES] = {0};
|
|
|
|
|
|
_Atomic uint64_t g_unified_cache_misses_by_class[TINY_NUM_CLASSES] = {0};
|
|
|
|
|
|
_Atomic uint64_t g_unified_cache_refill_cycles_by_class[TINY_NUM_CLASSES] = {0};
|
|
|
|
|
|
|
2025-12-04 18:26:39 +09:00
|
|
|
|
// Helper: Get cycle count (x86_64 rdtsc)
|
|
|
|
|
|
static inline uint64_t read_tsc(void) {
|
|
|
|
|
|
#if defined(__x86_64__) || defined(_M_X64)
|
|
|
|
|
|
uint32_t lo, hi;
|
|
|
|
|
|
__asm__ __volatile__("rdtsc" : "=a"(lo), "=d"(hi));
|
|
|
|
|
|
return ((uint64_t)hi << 32) | lo;
|
|
|
|
|
|
#else
|
|
|
|
|
|
// Fallback to clock_gettime for non-x86 platforms
|
|
|
|
|
|
struct timespec ts;
|
|
|
|
|
|
clock_gettime(CLOCK_MONOTONIC, &ts);
|
|
|
|
|
|
return (uint64_t)ts.tv_sec * 1000000000ULL + (uint64_t)ts.tv_nsec;
|
|
|
|
|
|
#endif
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
// Check if measurement is enabled (cached)
|
|
|
|
|
|
static inline int unified_cache_measure_enabled(void) {
|
|
|
|
|
|
static int g_measure = -1;
|
|
|
|
|
|
if (__builtin_expect(g_measure == -1, 0)) {
|
|
|
|
|
|
const char* e = getenv("HAKMEM_MEASURE_UNIFIED_CACHE");
|
|
|
|
|
|
g_measure = (e && *e && *e != '0') ? 1 : 0;
|
|
|
|
|
|
}
|
|
|
|
|
|
return g_measure;
|
|
|
|
|
|
}
|
2025-12-16 05:35:11 +09:00
|
|
|
|
#endif
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
|
|
|
|
|
|
// Phase 23-E: Forward declarations
|
|
|
|
|
|
extern __thread TinyTLSSlab g_tls_slabs[TINY_NUM_CLASSES]; // From hakmem_tiny_superslab.c
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
extern void ss_active_add(SuperSlab* ss, uint32_t n); // From hakmem_tiny_ss_active_box.inc
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
|
|
|
|
|
|
// ============================================================================
|
|
|
|
|
|
// TLS Variables (defined here, extern in header)
|
|
|
|
|
|
// ============================================================================
|
|
|
|
|
|
|
|
|
|
|
|
__thread TinyUnifiedCache g_unified_cache[TINY_NUM_CLASSES];
|
|
|
|
|
|
|
2025-12-13 19:19:42 +09:00
|
|
|
|
// Phase 3 C2 Patch 2: First Page Inline Cache (TLS per-class)
|
|
|
|
|
|
#include "tiny_first_page_cache.h"
|
|
|
|
|
|
__thread TinyFirstPageCache g_first_page_cache[TINY_NUM_CLASSES] = {0};
|
|
|
|
|
|
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
// Warm Pool: Per-thread warm SuperSlab pools (one per class)
|
|
|
|
|
|
__thread TinyWarmPool g_tiny_warm_pool[TINY_NUM_CLASSES] = {0};
|
|
|
|
|
|
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
// ============================================================================
|
|
|
|
|
|
// Metrics (Phase 23, optional for debugging)
|
|
|
|
|
|
// ============================================================================
|
|
|
|
|
|
|
2025-12-18 03:02:40 +09:00
|
|
|
|
#if !HAKMEM_BUILD_RELEASE || HAKMEM_UNIFIED_CACHE_STATS_COMPILED
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
__thread uint64_t g_unified_cache_hit[TINY_NUM_CLASSES] = {0};
|
|
|
|
|
|
__thread uint64_t g_unified_cache_miss[TINY_NUM_CLASSES] = {0};
|
|
|
|
|
|
__thread uint64_t g_unified_cache_push[TINY_NUM_CLASSES] = {0};
|
|
|
|
|
|
__thread uint64_t g_unified_cache_full[TINY_NUM_CLASSES] = {0};
|
|
|
|
|
|
#endif
|
|
|
|
|
|
|
2025-12-05 23:41:01 +09:00
|
|
|
|
// Release-side lightweight telemetry (C7 Warm path only)
|
|
|
|
|
|
#if HAKMEM_BUILD_RELEASE
|
|
|
|
|
|
_Atomic uint64_t g_rel_c7_warm_pop = 0;
|
|
|
|
|
|
_Atomic uint64_t g_rel_c7_warm_push = 0;
|
|
|
|
|
|
#endif
|
|
|
|
|
|
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
// Warm Pool metrics (definition - declared in tiny_warm_pool.h as extern)
|
|
|
|
|
|
// Note: These are kept outside !HAKMEM_BUILD_RELEASE for profiling in release builds
|
|
|
|
|
|
__thread TinyWarmPoolStats g_warm_pool_stats[TINY_NUM_CLASSES] = {0};
|
|
|
|
|
|
|
2025-12-05 19:57:30 +09:00
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
|
|
|
|
|
// Debug-only diagnostics for Warm Pool effectiveness
|
|
|
|
|
|
_Atomic uint64_t g_dbg_warm_prefill_attempts = 0;
|
|
|
|
|
|
_Atomic uint64_t g_dbg_warm_prefill_refill_fail = 0;
|
|
|
|
|
|
_Atomic uint64_t g_dbg_warm_prefill_push_ok = 0;
|
|
|
|
|
|
_Atomic uint64_t g_dbg_warm_prefill_push_full = 0;
|
|
|
|
|
|
_Atomic uint64_t g_dbg_warm_pop_attempts = 0;
|
|
|
|
|
|
_Atomic uint64_t g_dbg_warm_pop_hits = 0;
|
|
|
|
|
|
_Atomic uint64_t g_dbg_warm_pop_empty = 0;
|
|
|
|
|
|
_Atomic uint64_t g_dbg_warm_pop_carve_zero = 0;
|
2025-12-05 23:41:01 +09:00
|
|
|
|
#endif
|
2025-12-05 20:50:24 +09:00
|
|
|
|
|
2025-12-05 23:41:01 +09:00
|
|
|
|
// Warm TLS Bind (C7) mode selector
|
|
|
|
|
|
// mode 0: Legacy warm path(デバッグ専用・C7では非推奨)
|
|
|
|
|
|
// mode 1: Bind-only 本番経路(C7 標準)
|
|
|
|
|
|
// mode 2: Bind + TLS carve 実験経路(Debug 専用)
|
|
|
|
|
|
// Release ビルドでは常に mode=1 に固定し、ENV は無視する。
|
2025-12-05 20:50:24 +09:00
|
|
|
|
static inline int warm_tls_bind_mode_c7(void) {
|
2025-12-05 23:41:01 +09:00
|
|
|
|
#if HAKMEM_BUILD_RELEASE
|
|
|
|
|
|
static int g_warm_tls_bind_mode_c7 = -1;
|
2025-12-05 20:50:24 +09:00
|
|
|
|
if (__builtin_expect(g_warm_tls_bind_mode_c7 == -1, 0)) {
|
|
|
|
|
|
const char* e = getenv("HAKMEM_WARM_TLS_BIND_C7");
|
2025-12-05 23:41:01 +09:00
|
|
|
|
int mode = (e && *e) ? atoi(e) : 1; // default = Bind-only
|
|
|
|
|
|
if (mode < 0) mode = 0;
|
|
|
|
|
|
if (mode > 2) mode = 2;
|
|
|
|
|
|
g_warm_tls_bind_mode_c7 = mode;
|
2025-12-05 20:50:24 +09:00
|
|
|
|
}
|
|
|
|
|
|
return g_warm_tls_bind_mode_c7;
|
2025-12-05 23:41:01 +09:00
|
|
|
|
#else
|
|
|
|
|
|
static int g_warm_tls_bind_mode_c7 = -1;
|
|
|
|
|
|
if (__builtin_expect(g_warm_tls_bind_mode_c7 == -1, 0)) {
|
|
|
|
|
|
const char* e = getenv("HAKMEM_WARM_TLS_BIND_C7");
|
|
|
|
|
|
int mode = (e && *e) ? atoi(e) : 1; // default = Bind-only
|
|
|
|
|
|
if (mode < 0) mode = 0;
|
|
|
|
|
|
if (mode > 2) mode = 2;
|
|
|
|
|
|
g_warm_tls_bind_mode_c7 = mode;
|
2025-12-05 20:50:24 +09:00
|
|
|
|
}
|
2025-12-05 23:41:01 +09:00
|
|
|
|
return g_warm_tls_bind_mode_c7;
|
2025-12-05 19:57:30 +09:00
|
|
|
|
#endif
|
2025-12-05 23:41:01 +09:00
|
|
|
|
}
|
2025-12-05 19:57:30 +09:00
|
|
|
|
|
|
|
|
|
|
// Forward declaration for Warm Pool stats printer (defined later in this file)
|
|
|
|
|
|
static inline void tiny_warm_pool_print_stats(void);
|
|
|
|
|
|
|
2025-11-29 17:58:42 +09:00
|
|
|
|
// ============================================================================
|
|
|
|
|
|
// Phase 8-Step1-Fix: unified_cache_enabled() implementation (non-static)
|
|
|
|
|
|
// ============================================================================
|
|
|
|
|
|
|
|
|
|
|
|
// Enable flag (default: ON, disable with HAKMEM_TINY_UNIFIED_CACHE=0)
|
|
|
|
|
|
int unified_cache_enabled(void) {
|
2025-12-02 20:25:48 +09:00
|
|
|
|
// Priority-2: Use cached ENV (eliminate lazy-init static overhead)
|
2025-11-29 17:58:42 +09:00
|
|
|
|
static int g_enable = -1;
|
|
|
|
|
|
if (__builtin_expect(g_enable == -1, 0)) {
|
2025-12-02 20:25:48 +09:00
|
|
|
|
g_enable = HAK_ENV_TINY_UNIFIED_CACHE();
|
2025-11-29 17:58:42 +09:00
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
|
|
|
|
|
if (g_enable) {
|
|
|
|
|
|
fprintf(stderr, "[Unified-INIT] unified_cache_enabled() = %d\n", g_enable);
|
|
|
|
|
|
fflush(stderr);
|
|
|
|
|
|
}
|
2025-12-05 23:41:01 +09:00
|
|
|
|
#else
|
|
|
|
|
|
if (g_enable) {
|
|
|
|
|
|
static int printed = 0;
|
|
|
|
|
|
if (!printed) {
|
|
|
|
|
|
fprintf(stderr, "[Rel-Unified] unified_cache_enabled() = %d\n", g_enable);
|
|
|
|
|
|
fflush(stderr);
|
|
|
|
|
|
printed = 1;
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
2025-11-29 17:58:42 +09:00
|
|
|
|
#endif
|
|
|
|
|
|
}
|
|
|
|
|
|
return g_enable;
|
|
|
|
|
|
}
|
|
|
|
|
|
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
// ============================================================================
|
|
|
|
|
|
// Init (called at thread start or lazy on first access)
|
|
|
|
|
|
// ============================================================================
|
|
|
|
|
|
|
|
|
|
|
|
void unified_cache_init(void) {
|
|
|
|
|
|
if (!unified_cache_enabled()) return;
|
|
|
|
|
|
|
Phase 8 Root Cause Fix: BenchFast crash investigation and infrastructure isolation
Goal: Fix BenchFast mode crash and improve infrastructure separation
Status: Normal mode works perfectly (17.9M ops/s), BenchFast crash reduced but persists (separate issue)
Root Cause Analysis (Layers 0-3):
Layer 1: Removed unnecessary unified_cache_init() call
- Problem: Phase 8 Step 2 added unified_cache_init() to bench_fast_init()
- Design error: BenchFast uses TLS SLL strategy, NOT Unified Cache
- Impact: 16KB mmap allocations created, later misclassified as Tiny → crash
- Fix: Removed unified_cache_init() call from bench_fast_box.c lines 123-129
- Rationale: BenchFast and Unified Cache are different allocation strategies
Layer 2: Infrastructure isolation (__libc bypass)
- Problem: Infrastructure allocations (cache arrays) went through HAKMEM wrapper
- Risk: Can interact with BenchFast mode, causing path conflicts
- Fix: Use __libc_calloc/__libc_free in unified_cache_init/shutdown
- Benefit: Clean separation between workload (measured) and infrastructure (unmeasured)
- Defense: Prevents future crashes from infrastructure/workload mixing
Layer 3: Box Contract documentation
- Problem: Implicit assumptions about BenchFast behavior were undocumented
- Fix: Added comprehensive Box Contract to bench_fast_box.h (lines 13-51)
- Documents:
* Workload allocations: Tiny only, TLS SLL strategy
* Infrastructure allocations: __libc bypass, no HAKMEM interaction
* Preconditions, guarantees, and violation examples
- Benefit: Future developers understand design constraints
Layer 0: Limit prealloc to actual TLS SLL capacity
- Problem: Old code preallocated 50,000 blocks/class
- Reality: Adaptive sizing limits TLS SLL to 128 blocks/class at runtime
- Lost blocks: 50,000 - 128 = 49,872 blocks/class × 6 = 299,232 lost blocks!
- Impact: Lost blocks caused heap corruption
- Fix: Hard-code prealloc to 128 blocks/class (observed actual capacity)
- Result: 768 total blocks (128 × 6), zero lost blocks
Performance Impact:
- Normal mode: ✅ 17.9M ops/s (perfect, no regression)
- BenchFast mode: ⚠️ Still crashes (different root cause, requires further investigation)
Benefits:
- Unified Cache infrastructure properly isolated (__libc bypass)
- BenchFast Box Contract documented (prevents future misunderstandings)
- Prealloc overflow eliminated (no more lost blocks)
- Normal mode unchanged (backward compatible)
Known Issue (separate):
- BenchFast mode still crashes with "free(): invalid pointer"
- Crash location: Likely bench_random_mixed.c line 145 (BENCH_META_FREE(slots))
- Next steps: GDB debugging, AddressSanitizer build, or strace analysis
- Not caused by Phase 8 changes (pre-existing issue)
Files Modified:
- core/box/bench_fast_box.h - Box Contract documentation (Layer 3)
- core/box/bench_fast_box.c - Removed prewarm, limited prealloc (Layer 0+1)
- core/front/tiny_unified_cache.c - __libc bypass (Layer 2)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-30 04:51:36 +09:00
|
|
|
|
// Layer 2 Defensive Fix: Use __libc_calloc for infrastructure allocations
|
|
|
|
|
|
// Rationale: Cache arrays are infrastructure (not workload), bypass HAKMEM entirely
|
|
|
|
|
|
// This prevents interaction with BenchFast mode and ensures clean separation
|
|
|
|
|
|
extern void* __libc_calloc(size_t, size_t);
|
|
|
|
|
|
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
// Initialize all classes (C0-C7)
|
|
|
|
|
|
for (int cls = 0; cls < TINY_NUM_CLASSES; cls++) {
|
|
|
|
|
|
if (g_unified_cache[cls].slots != NULL) continue; // Already initialized
|
|
|
|
|
|
|
|
|
|
|
|
size_t cap = unified_capacity(cls);
|
Phase 8 Root Cause Fix: BenchFast crash investigation and infrastructure isolation
Goal: Fix BenchFast mode crash and improve infrastructure separation
Status: Normal mode works perfectly (17.9M ops/s), BenchFast crash reduced but persists (separate issue)
Root Cause Analysis (Layers 0-3):
Layer 1: Removed unnecessary unified_cache_init() call
- Problem: Phase 8 Step 2 added unified_cache_init() to bench_fast_init()
- Design error: BenchFast uses TLS SLL strategy, NOT Unified Cache
- Impact: 16KB mmap allocations created, later misclassified as Tiny → crash
- Fix: Removed unified_cache_init() call from bench_fast_box.c lines 123-129
- Rationale: BenchFast and Unified Cache are different allocation strategies
Layer 2: Infrastructure isolation (__libc bypass)
- Problem: Infrastructure allocations (cache arrays) went through HAKMEM wrapper
- Risk: Can interact with BenchFast mode, causing path conflicts
- Fix: Use __libc_calloc/__libc_free in unified_cache_init/shutdown
- Benefit: Clean separation between workload (measured) and infrastructure (unmeasured)
- Defense: Prevents future crashes from infrastructure/workload mixing
Layer 3: Box Contract documentation
- Problem: Implicit assumptions about BenchFast behavior were undocumented
- Fix: Added comprehensive Box Contract to bench_fast_box.h (lines 13-51)
- Documents:
* Workload allocations: Tiny only, TLS SLL strategy
* Infrastructure allocations: __libc bypass, no HAKMEM interaction
* Preconditions, guarantees, and violation examples
- Benefit: Future developers understand design constraints
Layer 0: Limit prealloc to actual TLS SLL capacity
- Problem: Old code preallocated 50,000 blocks/class
- Reality: Adaptive sizing limits TLS SLL to 128 blocks/class at runtime
- Lost blocks: 50,000 - 128 = 49,872 blocks/class × 6 = 299,232 lost blocks!
- Impact: Lost blocks caused heap corruption
- Fix: Hard-code prealloc to 128 blocks/class (observed actual capacity)
- Result: 768 total blocks (128 × 6), zero lost blocks
Performance Impact:
- Normal mode: ✅ 17.9M ops/s (perfect, no regression)
- BenchFast mode: ⚠️ Still crashes (different root cause, requires further investigation)
Benefits:
- Unified Cache infrastructure properly isolated (__libc bypass)
- BenchFast Box Contract documented (prevents future misunderstandings)
- Prealloc overflow eliminated (no more lost blocks)
- Normal mode unchanged (backward compatible)
Known Issue (separate):
- BenchFast mode still crashes with "free(): invalid pointer"
- Crash location: Likely bench_random_mixed.c line 145 (BENCH_META_FREE(slots))
- Next steps: GDB debugging, AddressSanitizer build, or strace analysis
- Not caused by Phase 8 changes (pre-existing issue)
Files Modified:
- core/box/bench_fast_box.h - Box Contract documentation (Layer 3)
- core/box/bench_fast_box.c - Removed prewarm, limited prealloc (Layer 0+1)
- core/front/tiny_unified_cache.c - __libc bypass (Layer 2)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-30 04:51:36 +09:00
|
|
|
|
g_unified_cache[cls].slots = (void**)__libc_calloc(cap, sizeof(void*));
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
|
|
|
|
|
|
if (!g_unified_cache[cls].slots) {
|
|
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
|
|
|
|
|
fprintf(stderr, "[Unified-INIT] Failed to allocate C%d cache (%zu slots)\n", cls, cap);
|
|
|
|
|
|
fflush(stderr);
|
|
|
|
|
|
#endif
|
|
|
|
|
|
continue; // Skip this class, try others
|
|
|
|
|
|
}
|
|
|
|
|
|
|
2025-12-07 03:12:27 +09:00
|
|
|
|
tiny_mem_stats_add_unified((ssize_t)(cap * sizeof(void*)));
|
|
|
|
|
|
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
g_unified_cache[cls].capacity = (uint16_t)cap;
|
|
|
|
|
|
g_unified_cache[cls].mask = (uint16_t)(cap - 1);
|
|
|
|
|
|
g_unified_cache[cls].head = 0;
|
|
|
|
|
|
g_unified_cache[cls].tail = 0;
|
|
|
|
|
|
|
|
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
|
|
|
|
|
fprintf(stderr, "[Unified-INIT] C%d: %zu slots (%zu bytes)\n",
|
|
|
|
|
|
cls, cap, cap * sizeof(void*));
|
|
|
|
|
|
fflush(stderr);
|
|
|
|
|
|
#endif
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
// ============================================================================
|
|
|
|
|
|
// Shutdown (called at thread exit, optional)
|
|
|
|
|
|
// ============================================================================
|
|
|
|
|
|
|
|
|
|
|
|
void unified_cache_shutdown(void) {
|
|
|
|
|
|
if (!unified_cache_enabled()) return;
|
|
|
|
|
|
|
|
|
|
|
|
// TODO: Drain caches to SuperSlab before shutdown (prevent leak)
|
|
|
|
|
|
|
Phase 8 Root Cause Fix: BenchFast crash investigation and infrastructure isolation
Goal: Fix BenchFast mode crash and improve infrastructure separation
Status: Normal mode works perfectly (17.9M ops/s), BenchFast crash reduced but persists (separate issue)
Root Cause Analysis (Layers 0-3):
Layer 1: Removed unnecessary unified_cache_init() call
- Problem: Phase 8 Step 2 added unified_cache_init() to bench_fast_init()
- Design error: BenchFast uses TLS SLL strategy, NOT Unified Cache
- Impact: 16KB mmap allocations created, later misclassified as Tiny → crash
- Fix: Removed unified_cache_init() call from bench_fast_box.c lines 123-129
- Rationale: BenchFast and Unified Cache are different allocation strategies
Layer 2: Infrastructure isolation (__libc bypass)
- Problem: Infrastructure allocations (cache arrays) went through HAKMEM wrapper
- Risk: Can interact with BenchFast mode, causing path conflicts
- Fix: Use __libc_calloc/__libc_free in unified_cache_init/shutdown
- Benefit: Clean separation between workload (measured) and infrastructure (unmeasured)
- Defense: Prevents future crashes from infrastructure/workload mixing
Layer 3: Box Contract documentation
- Problem: Implicit assumptions about BenchFast behavior were undocumented
- Fix: Added comprehensive Box Contract to bench_fast_box.h (lines 13-51)
- Documents:
* Workload allocations: Tiny only, TLS SLL strategy
* Infrastructure allocations: __libc bypass, no HAKMEM interaction
* Preconditions, guarantees, and violation examples
- Benefit: Future developers understand design constraints
Layer 0: Limit prealloc to actual TLS SLL capacity
- Problem: Old code preallocated 50,000 blocks/class
- Reality: Adaptive sizing limits TLS SLL to 128 blocks/class at runtime
- Lost blocks: 50,000 - 128 = 49,872 blocks/class × 6 = 299,232 lost blocks!
- Impact: Lost blocks caused heap corruption
- Fix: Hard-code prealloc to 128 blocks/class (observed actual capacity)
- Result: 768 total blocks (128 × 6), zero lost blocks
Performance Impact:
- Normal mode: ✅ 17.9M ops/s (perfect, no regression)
- BenchFast mode: ⚠️ Still crashes (different root cause, requires further investigation)
Benefits:
- Unified Cache infrastructure properly isolated (__libc bypass)
- BenchFast Box Contract documented (prevents future misunderstandings)
- Prealloc overflow eliminated (no more lost blocks)
- Normal mode unchanged (backward compatible)
Known Issue (separate):
- BenchFast mode still crashes with "free(): invalid pointer"
- Crash location: Likely bench_random_mixed.c line 145 (BENCH_META_FREE(slots))
- Next steps: GDB debugging, AddressSanitizer build, or strace analysis
- Not caused by Phase 8 changes (pre-existing issue)
Files Modified:
- core/box/bench_fast_box.h - Box Contract documentation (Layer 3)
- core/box/bench_fast_box.c - Removed prewarm, limited prealloc (Layer 0+1)
- core/front/tiny_unified_cache.c - __libc bypass (Layer 2)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-30 04:51:36 +09:00
|
|
|
|
// Layer 2 Defensive Fix: Use __libc_free (symmetric with __libc_calloc in init)
|
|
|
|
|
|
extern void __libc_free(void*);
|
|
|
|
|
|
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
// Free cache buffers
|
|
|
|
|
|
for (int cls = 0; cls < TINY_NUM_CLASSES; cls++) {
|
|
|
|
|
|
if (g_unified_cache[cls].slots) {
|
Phase 8 Root Cause Fix: BenchFast crash investigation and infrastructure isolation
Goal: Fix BenchFast mode crash and improve infrastructure separation
Status: Normal mode works perfectly (17.9M ops/s), BenchFast crash reduced but persists (separate issue)
Root Cause Analysis (Layers 0-3):
Layer 1: Removed unnecessary unified_cache_init() call
- Problem: Phase 8 Step 2 added unified_cache_init() to bench_fast_init()
- Design error: BenchFast uses TLS SLL strategy, NOT Unified Cache
- Impact: 16KB mmap allocations created, later misclassified as Tiny → crash
- Fix: Removed unified_cache_init() call from bench_fast_box.c lines 123-129
- Rationale: BenchFast and Unified Cache are different allocation strategies
Layer 2: Infrastructure isolation (__libc bypass)
- Problem: Infrastructure allocations (cache arrays) went through HAKMEM wrapper
- Risk: Can interact with BenchFast mode, causing path conflicts
- Fix: Use __libc_calloc/__libc_free in unified_cache_init/shutdown
- Benefit: Clean separation between workload (measured) and infrastructure (unmeasured)
- Defense: Prevents future crashes from infrastructure/workload mixing
Layer 3: Box Contract documentation
- Problem: Implicit assumptions about BenchFast behavior were undocumented
- Fix: Added comprehensive Box Contract to bench_fast_box.h (lines 13-51)
- Documents:
* Workload allocations: Tiny only, TLS SLL strategy
* Infrastructure allocations: __libc bypass, no HAKMEM interaction
* Preconditions, guarantees, and violation examples
- Benefit: Future developers understand design constraints
Layer 0: Limit prealloc to actual TLS SLL capacity
- Problem: Old code preallocated 50,000 blocks/class
- Reality: Adaptive sizing limits TLS SLL to 128 blocks/class at runtime
- Lost blocks: 50,000 - 128 = 49,872 blocks/class × 6 = 299,232 lost blocks!
- Impact: Lost blocks caused heap corruption
- Fix: Hard-code prealloc to 128 blocks/class (observed actual capacity)
- Result: 768 total blocks (128 × 6), zero lost blocks
Performance Impact:
- Normal mode: ✅ 17.9M ops/s (perfect, no regression)
- BenchFast mode: ⚠️ Still crashes (different root cause, requires further investigation)
Benefits:
- Unified Cache infrastructure properly isolated (__libc bypass)
- BenchFast Box Contract documented (prevents future misunderstandings)
- Prealloc overflow eliminated (no more lost blocks)
- Normal mode unchanged (backward compatible)
Known Issue (separate):
- BenchFast mode still crashes with "free(): invalid pointer"
- Crash location: Likely bench_random_mixed.c line 145 (BENCH_META_FREE(slots))
- Next steps: GDB debugging, AddressSanitizer build, or strace analysis
- Not caused by Phase 8 changes (pre-existing issue)
Files Modified:
- core/box/bench_fast_box.h - Box Contract documentation (Layer 3)
- core/box/bench_fast_box.c - Removed prewarm, limited prealloc (Layer 0+1)
- core/front/tiny_unified_cache.c - __libc bypass (Layer 2)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-30 04:51:36 +09:00
|
|
|
|
__libc_free(g_unified_cache[cls].slots);
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
g_unified_cache[cls].slots = NULL;
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
|
|
|
|
|
fprintf(stderr, "[Unified-SHUTDOWN] All caches freed\n");
|
|
|
|
|
|
fflush(stderr);
|
|
|
|
|
|
#endif
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
// ============================================================================
|
|
|
|
|
|
// Stats (Phase 23 metrics)
|
|
|
|
|
|
// ============================================================================
|
|
|
|
|
|
|
|
|
|
|
|
void unified_cache_print_stats(void) {
|
|
|
|
|
|
if (!unified_cache_enabled()) return;
|
|
|
|
|
|
|
2025-12-18 03:02:40 +09:00
|
|
|
|
#if !HAKMEM_BUILD_RELEASE || HAKMEM_UNIFIED_CACHE_STATS_COMPILED
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
fprintf(stderr, "\n[Unified-STATS] Unified Cache Metrics:\n");
|
|
|
|
|
|
|
2025-12-18 05:55:27 +09:00
|
|
|
|
// Phase 70-3: Consistency Check - calculate totals across all classes
|
|
|
|
|
|
uint64_t total_allocs_all = 0;
|
|
|
|
|
|
uint64_t total_frees_all = 0;
|
|
|
|
|
|
for (int cls = 0; cls < TINY_NUM_CLASSES; cls++) {
|
|
|
|
|
|
total_allocs_all += g_unified_cache_hit[cls] + g_unified_cache_miss[cls];
|
|
|
|
|
|
total_frees_all += g_unified_cache_push[cls] + g_unified_cache_full[cls];
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
// Print consistency check BEFORE individual class stats
|
|
|
|
|
|
fprintf(stderr, "[Unified-STATS] Consistency Check:\n");
|
|
|
|
|
|
fprintf(stderr, "[Unified-STATS] total_allocs (hit+miss) = %llu\n",
|
|
|
|
|
|
(unsigned long long)total_allocs_all);
|
|
|
|
|
|
fprintf(stderr, "[Unified-STATS] total_frees (push+full) = %llu\n",
|
|
|
|
|
|
(unsigned long long)total_frees_all);
|
|
|
|
|
|
|
|
|
|
|
|
// Phase 70-3: WARNING logic for inconsistent counters
|
|
|
|
|
|
static int g_consistency_warned = 0;
|
|
|
|
|
|
if (!g_consistency_warned && total_allocs_all > 0 && total_frees_all > total_allocs_all * 2) {
|
|
|
|
|
|
fprintf(stderr, "[Unified-STATS-WARNING] total_frees >> total_allocs detected! "
|
|
|
|
|
|
"Alloc counters may not be wired.\n");
|
|
|
|
|
|
g_consistency_warned = 1;
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
fprintf(stderr, "\n");
|
|
|
|
|
|
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
for (int cls = 0; cls < TINY_NUM_CLASSES; cls++) {
|
|
|
|
|
|
uint64_t total_allocs = g_unified_cache_hit[cls] + g_unified_cache_miss[cls];
|
|
|
|
|
|
uint64_t total_frees = g_unified_cache_push[cls] + g_unified_cache_full[cls];
|
|
|
|
|
|
|
|
|
|
|
|
if (total_allocs == 0 && total_frees == 0) continue; // Skip unused classes
|
|
|
|
|
|
|
|
|
|
|
|
double hit_rate = (total_allocs > 0) ? (100.0 * g_unified_cache_hit[cls] / total_allocs) : 0.0;
|
|
|
|
|
|
double full_rate = (total_frees > 0) ? (100.0 * g_unified_cache_full[cls] / total_frees) : 0.0;
|
|
|
|
|
|
|
|
|
|
|
|
// Current occupancy
|
|
|
|
|
|
uint16_t count = (g_unified_cache[cls].tail >= g_unified_cache[cls].head)
|
|
|
|
|
|
? (g_unified_cache[cls].tail - g_unified_cache[cls].head)
|
|
|
|
|
|
: (g_unified_cache[cls].capacity - g_unified_cache[cls].head + g_unified_cache[cls].tail);
|
|
|
|
|
|
|
|
|
|
|
|
fprintf(stderr, " C%d: %u/%u slots occupied, hit=%llu miss=%llu (%.1f%% hit), push=%llu full=%llu (%.1f%% full)\n",
|
|
|
|
|
|
cls,
|
|
|
|
|
|
count, g_unified_cache[cls].capacity,
|
|
|
|
|
|
(unsigned long long)g_unified_cache_hit[cls],
|
|
|
|
|
|
(unsigned long long)g_unified_cache_miss[cls],
|
|
|
|
|
|
hit_rate,
|
|
|
|
|
|
(unsigned long long)g_unified_cache_push[cls],
|
|
|
|
|
|
(unsigned long long)g_unified_cache_full[cls],
|
|
|
|
|
|
full_rate);
|
|
|
|
|
|
}
|
|
|
|
|
|
fflush(stderr);
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
|
|
|
|
|
|
// Also print warm pool stats if enabled
|
|
|
|
|
|
tiny_warm_pool_print_stats();
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
#endif
|
|
|
|
|
|
}
|
|
|
|
|
|
|
2025-12-18 03:02:40 +09:00
|
|
|
|
__attribute__((destructor))
|
|
|
|
|
|
static void unified_cache_auto_stats(void) {
|
|
|
|
|
|
unified_cache_print_stats();
|
|
|
|
|
|
}
|
|
|
|
|
|
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
// ============================================================================
|
|
|
|
|
|
// Warm Pool Stats (always compiled, ENV-gated at runtime)
|
|
|
|
|
|
// ============================================================================
|
|
|
|
|
|
|
|
|
|
|
|
static inline void tiny_warm_pool_print_stats(void) {
|
|
|
|
|
|
// Check if warm pool stats are enabled via ENV
|
|
|
|
|
|
static int g_print_stats = -1;
|
|
|
|
|
|
if (__builtin_expect(g_print_stats == -1, 0)) {
|
|
|
|
|
|
const char* e = getenv("HAKMEM_WARM_POOL_STATS");
|
|
|
|
|
|
g_print_stats = (e && *e && *e != '0') ? 1 : 0;
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
if (!g_print_stats) return;
|
|
|
|
|
|
|
|
|
|
|
|
fprintf(stderr, "\n[WarmPool-STATS] Warm Pool Metrics:\n");
|
|
|
|
|
|
|
|
|
|
|
|
for (int i = 0; i < TINY_NUM_CLASSES; i++) {
|
|
|
|
|
|
uint64_t total = g_warm_pool_stats[i].hits + g_warm_pool_stats[i].misses;
|
2025-12-05 19:57:30 +09:00
|
|
|
|
float hit_rate = (total > 0)
|
|
|
|
|
|
? (100.0 * g_warm_pool_stats[i].hits / total)
|
|
|
|
|
|
: 0.0;
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
fprintf(stderr, " C%d: hits=%llu misses=%llu hit_rate=%.1f%% prefilled=%llu\n",
|
|
|
|
|
|
i,
|
|
|
|
|
|
(unsigned long long)g_warm_pool_stats[i].hits,
|
|
|
|
|
|
(unsigned long long)g_warm_pool_stats[i].misses,
|
|
|
|
|
|
hit_rate,
|
|
|
|
|
|
(unsigned long long)g_warm_pool_stats[i].prefilled);
|
|
|
|
|
|
}
|
2025-12-05 19:57:30 +09:00
|
|
|
|
|
|
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
|
|
|
|
|
// Debug-only aggregated diagnostics for Warm Pool
|
|
|
|
|
|
fprintf(stderr,
|
|
|
|
|
|
" [DBG] prefill_attempts=%llu refill_fail=%llu push_ok=%llu push_full=%llu "
|
|
|
|
|
|
"pop_attempts=%llu pop_hits=%llu pop_empty=%llu pop_carve_zero=%llu\n",
|
|
|
|
|
|
(unsigned long long)atomic_load_explicit(&g_dbg_warm_prefill_attempts, memory_order_relaxed),
|
|
|
|
|
|
(unsigned long long)atomic_load_explicit(&g_dbg_warm_prefill_refill_fail, memory_order_relaxed),
|
|
|
|
|
|
(unsigned long long)atomic_load_explicit(&g_dbg_warm_prefill_push_ok, memory_order_relaxed),
|
|
|
|
|
|
(unsigned long long)atomic_load_explicit(&g_dbg_warm_prefill_push_full, memory_order_relaxed),
|
|
|
|
|
|
(unsigned long long)atomic_load_explicit(&g_dbg_warm_pop_attempts, memory_order_relaxed),
|
|
|
|
|
|
(unsigned long long)atomic_load_explicit(&g_dbg_warm_pop_hits, memory_order_relaxed),
|
|
|
|
|
|
(unsigned long long)atomic_load_explicit(&g_dbg_warm_pop_empty, memory_order_relaxed),
|
|
|
|
|
|
(unsigned long long)atomic_load_explicit(&g_dbg_warm_pop_carve_zero, memory_order_relaxed));
|
2025-12-05 23:41:01 +09:00
|
|
|
|
uint64_t c7_attempts = warm_pool_dbg_c7_attempts();
|
|
|
|
|
|
uint64_t c7_hits = warm_pool_dbg_c7_hits();
|
|
|
|
|
|
uint64_t c7_carve = warm_pool_dbg_c7_carves();
|
|
|
|
|
|
uint64_t c7_tls_attempts = warm_pool_dbg_c7_tls_attempts();
|
|
|
|
|
|
uint64_t c7_tls_success = warm_pool_dbg_c7_tls_successes();
|
|
|
|
|
|
uint64_t c7_tls_fail = warm_pool_dbg_c7_tls_failures();
|
|
|
|
|
|
uint64_t c7_uc_warm = warm_pool_dbg_c7_uc_miss_warm_refills();
|
|
|
|
|
|
uint64_t c7_uc_tls = warm_pool_dbg_c7_uc_miss_tls_refills();
|
|
|
|
|
|
uint64_t c7_uc_shared = warm_pool_dbg_c7_uc_miss_shared_refills();
|
|
|
|
|
|
if (c7_attempts || c7_hits || c7_carve ||
|
|
|
|
|
|
c7_tls_attempts || c7_tls_success || c7_tls_fail ||
|
|
|
|
|
|
c7_uc_warm || c7_uc_tls || c7_uc_shared) {
|
|
|
|
|
|
fprintf(stderr,
|
|
|
|
|
|
" [DBG_C7] warm_pop_attempts=%llu warm_pop_hits=%llu warm_pop_carve=%llu "
|
|
|
|
|
|
"tls_carve_attempts=%llu tls_carve_success=%llu tls_carve_fail=%llu "
|
|
|
|
|
|
"uc_miss_warm=%llu uc_miss_tls=%llu uc_miss_shared=%llu\n",
|
|
|
|
|
|
(unsigned long long)c7_attempts,
|
|
|
|
|
|
(unsigned long long)c7_hits,
|
|
|
|
|
|
(unsigned long long)c7_carve,
|
|
|
|
|
|
(unsigned long long)c7_tls_attempts,
|
|
|
|
|
|
(unsigned long long)c7_tls_success,
|
|
|
|
|
|
(unsigned long long)c7_tls_fail,
|
|
|
|
|
|
(unsigned long long)c7_uc_warm,
|
|
|
|
|
|
(unsigned long long)c7_uc_tls,
|
|
|
|
|
|
(unsigned long long)c7_uc_shared);
|
|
|
|
|
|
}
|
2025-12-05 19:57:30 +09:00
|
|
|
|
#endif
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
fflush(stderr);
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
// Public wrapper for benchmarks
|
|
|
|
|
|
void tiny_warm_pool_print_stats_public(void) {
|
|
|
|
|
|
tiny_warm_pool_print_stats();
|
|
|
|
|
|
}
|
|
|
|
|
|
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
// ============================================================================
|
|
|
|
|
|
// Phase 23-E: Direct SuperSlab Carve (TLS SLL Bypass)
|
|
|
|
|
|
// ============================================================================
|
|
|
|
|
|
|
2025-11-22 07:40:35 +09:00
|
|
|
|
// Fail-fast helper: verify that a candidate BASE pointer belongs to a valid
|
|
|
|
|
|
// Tiny slab within a SuperSlab. This is intentionally defensive and only
|
|
|
|
|
|
// compiled in debug builds to avoid hot-path overhead in release.
|
|
|
|
|
|
static inline int unified_refill_validate_base(int class_idx,
|
|
|
|
|
|
TinyTLSSlab* tls,
|
|
|
|
|
|
TinySlabMeta* meta,
|
|
|
|
|
|
void* base,
|
|
|
|
|
|
const char* stage)
|
|
|
|
|
|
{
|
|
|
|
|
|
#if HAKMEM_BUILD_RELEASE
|
2025-12-10 09:08:18 +09:00
|
|
|
|
(void)class_idx; (void)tls; (void)base; (void)stage; (void)meta;
|
2025-11-22 07:40:35 +09:00
|
|
|
|
return 1;
|
|
|
|
|
|
#else
|
|
|
|
|
|
if (!base) {
|
|
|
|
|
|
fprintf(stderr,
|
|
|
|
|
|
"[UNIFIED_REFILL_CORRUPT] stage=%s cls=%d base=NULL tls_ss=%p meta=%p\n",
|
|
|
|
|
|
stage ? stage : "unified_refill",
|
|
|
|
|
|
class_idx,
|
|
|
|
|
|
(void*)(tls ? tls->ss : NULL),
|
|
|
|
|
|
(void*)meta);
|
|
|
|
|
|
abort();
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
SuperSlab* tls_ss = tls ? tls->ss : NULL;
|
|
|
|
|
|
if (!tls_ss || tls_ss->magic != SUPERSLAB_MAGIC) {
|
|
|
|
|
|
fprintf(stderr,
|
|
|
|
|
|
"[UNIFIED_REFILL_CORRUPT] stage=%s cls=%d base=%p tls_ss=%p meta=%p (invalid TLS ss)\n",
|
|
|
|
|
|
stage ? stage : "unified_refill",
|
|
|
|
|
|
class_idx,
|
|
|
|
|
|
base,
|
|
|
|
|
|
(void*)tls_ss,
|
|
|
|
|
|
(void*)meta);
|
|
|
|
|
|
abort();
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
// Cross-check registry lookup for additional safety.
|
|
|
|
|
|
SuperSlab* ss_lookup = hak_super_lookup(base);
|
|
|
|
|
|
if (!ss_lookup || ss_lookup->magic != SUPERSLAB_MAGIC) {
|
|
|
|
|
|
fprintf(stderr,
|
|
|
|
|
|
"[UNIFIED_REFILL_CORRUPT] stage=%s cls=%d base=%p tls_ss=%p lookup_ss=%p meta=%p\n",
|
|
|
|
|
|
stage ? stage : "unified_refill",
|
|
|
|
|
|
class_idx,
|
|
|
|
|
|
base,
|
|
|
|
|
|
(void*)tls_ss,
|
|
|
|
|
|
(void*)ss_lookup,
|
|
|
|
|
|
(void*)meta);
|
|
|
|
|
|
abort();
|
|
|
|
|
|
}
|
|
|
|
|
|
if (ss_lookup != tls_ss) {
|
|
|
|
|
|
fprintf(stderr,
|
|
|
|
|
|
"[UNIFIED_REFILL_CORRUPT] stage=%s cls=%d base=%p tls_ss=%p lookup_ss=%p (mismatch)\n",
|
|
|
|
|
|
stage ? stage : "unified_refill",
|
|
|
|
|
|
class_idx,
|
|
|
|
|
|
base,
|
|
|
|
|
|
(void*)tls_ss,
|
|
|
|
|
|
(void*)ss_lookup);
|
|
|
|
|
|
abort();
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
int slab_idx = tls ? (int)tls->slab_idx : -1;
|
|
|
|
|
|
int cap = ss_slabs_capacity(tls_ss);
|
|
|
|
|
|
if (slab_idx < 0 || slab_idx >= cap) {
|
|
|
|
|
|
fprintf(stderr,
|
|
|
|
|
|
"[UNIFIED_REFILL_CORRUPT] stage=%s cls=%d base=%p tls_ss=%p slab_idx=%d cap=%d meta_cap=%u meta_used=%u meta_carved=%u\n",
|
|
|
|
|
|
stage ? stage : "unified_refill",
|
|
|
|
|
|
class_idx,
|
|
|
|
|
|
base,
|
|
|
|
|
|
(void*)tls_ss,
|
|
|
|
|
|
slab_idx,
|
|
|
|
|
|
cap,
|
|
|
|
|
|
meta ? meta->capacity : 0u,
|
|
|
|
|
|
meta ? (unsigned)meta->used : 0u,
|
|
|
|
|
|
meta ? (unsigned)meta->carved : 0u);
|
|
|
|
|
|
abort();
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
// Ensure meta matches TLS view for this slab.
|
|
|
|
|
|
TinySlabMeta* expected_meta = &tls_ss->slabs[slab_idx];
|
|
|
|
|
|
if (meta && meta != expected_meta) {
|
|
|
|
|
|
fprintf(stderr,
|
|
|
|
|
|
"[UNIFIED_REFILL_CORRUPT] stage=%s cls=%d base=%p tls_ss=%p slab_idx=%d meta=%p expected_meta=%p\n",
|
|
|
|
|
|
stage ? stage : "unified_refill",
|
|
|
|
|
|
class_idx,
|
|
|
|
|
|
base,
|
|
|
|
|
|
(void*)tls_ss,
|
|
|
|
|
|
slab_idx,
|
|
|
|
|
|
(void*)meta,
|
|
|
|
|
|
(void*)expected_meta);
|
|
|
|
|
|
abort();
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
uint8_t* slab_base = tiny_slab_base_for_geometry(tls_ss, slab_idx);
|
|
|
|
|
|
size_t stride = tiny_stride_for_class(class_idx);
|
|
|
|
|
|
size_t usable = tiny_usable_bytes_for_slab(slab_idx);
|
|
|
|
|
|
uint8_t* slab_end = slab_base + usable;
|
|
|
|
|
|
|
|
|
|
|
|
if ((uint8_t*)base < slab_base || (uint8_t*)base >= slab_end) {
|
|
|
|
|
|
fprintf(stderr,
|
|
|
|
|
|
"[UNIFIED_REFILL_CORRUPT] stage=%s cls=%d base=%p range=[%p,%p) stride=%zu meta_cap=%u meta_used=%u meta_carved=%u\n",
|
|
|
|
|
|
stage ? stage : "unified_refill",
|
|
|
|
|
|
class_idx,
|
|
|
|
|
|
base,
|
|
|
|
|
|
(void*)slab_base,
|
|
|
|
|
|
(void*)slab_end,
|
|
|
|
|
|
stride,
|
|
|
|
|
|
meta ? meta->capacity : 0u,
|
|
|
|
|
|
meta ? (unsigned)meta->used : 0u,
|
|
|
|
|
|
meta ? (unsigned)meta->carved : 0u);
|
|
|
|
|
|
abort();
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
ptrdiff_t offset = (uint8_t*)base - slab_base;
|
|
|
|
|
|
if (offset % (ptrdiff_t)stride != 0) {
|
|
|
|
|
|
fprintf(stderr,
|
|
|
|
|
|
"[UNIFIED_REFILL_CORRUPT] stage=%s cls=%d base=%p offset=%td stride=%zu (misaligned) meta_cap=%u meta_used=%u meta_carved=%u\n",
|
|
|
|
|
|
stage ? stage : "unified_refill",
|
|
|
|
|
|
class_idx,
|
|
|
|
|
|
base,
|
|
|
|
|
|
offset,
|
|
|
|
|
|
stride,
|
|
|
|
|
|
meta ? meta->capacity : 0u,
|
|
|
|
|
|
meta ? (unsigned)meta->used : 0u,
|
|
|
|
|
|
meta ? (unsigned)meta->carved : 0u);
|
|
|
|
|
|
abort();
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
|
|
#endif
|
|
|
|
|
|
}
|
|
|
|
|
|
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
// ============================================================================
|
|
|
|
|
|
// Warm Pool Enhanced: Direct carve from warm SuperSlab (bypass superslab_refill)
|
|
|
|
|
|
// ============================================================================
|
|
|
|
|
|
|
Phase 5 E5-2: Header Write-Once (NEUTRAL, FROZEN)
Target: tiny_region_id_write_header (3.35% self%)
- Hypothesis: Headers redundant for reused blocks
- Strategy: Write headers ONCE at refill boundary, skip in hot alloc
Implementation:
- ENV gate: HAKMEM_TINY_HEADER_WRITE_ONCE=0/1 (default 0)
- core/box/tiny_header_write_once_env_box.h: ENV gate
- core/box/tiny_header_write_once_stats_box.h: Stats counters
- core/box/tiny_header_box.h: Added tiny_header_finalize_alloc()
- core/front/tiny_unified_cache.c: Prefill at 3 refill sites
- core/box/tiny_front_hot_box.h: Use finalize function
A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (WRITE_ONCE=0): 44.22M ops/s (mean), 44.53M ops/s (median)
- Optimized (WRITE_ONCE=1): 44.42M ops/s (mean), 44.36M ops/s (median)
- Improvement: +0.45% mean, -0.38% median
Decision: NEUTRAL (within ±1.0% threshold)
- Action: FREEZE as research box (default OFF, do not promote)
Root Cause Analysis:
- Header writes are NOT redundant - existing code writes only when needed
- Branch overhead (~4 cycles) cancels savings (~3-5 cycles)
- perf self% ≠ optimization ROI (3.35% target → +0.45% gain)
Key Lessons:
1. Verify assumptions before optimizing (inspect code paths)
2. Hot spot self% measures time IN function, not savings from REMOVING it
3. Branch overhead matters (even "simple" checks add cycles)
Positive Outcome:
- StdDev reduced 50% (0.96M → 0.48M) - more stable performance
Health Check: PASS (all profiles)
Next Candidates:
- free_tiny_fast_cold: 7.14% self%
- unified_cache_push: 3.39% self%
- hakmem_env_snapshot_enabled: 2.97% self%
Deliverables:
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-2 complete, FROZEN)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 06:22:25 +09:00
|
|
|
|
// ============================================================================
|
|
|
|
|
|
// Phase 5 E5-2: Header Prefill at Refill Boundary
|
|
|
|
|
|
// ============================================================================
|
|
|
|
|
|
// Prefill headers for C1-C6 blocks stored in unified cache.
|
|
|
|
|
|
// Called after blocks are placed in cache->slots[] during refill.
|
|
|
|
|
|
//
|
|
|
|
|
|
// Strategy:
|
|
|
|
|
|
// - C1-C6: Write headers ONCE at refill (preserved in freelist)
|
|
|
|
|
|
// - C0, C7: Skip (headers will be overwritten by next pointer anyway)
|
|
|
|
|
|
//
|
|
|
|
|
|
// This eliminates redundant header writes in hot allocation path.
|
|
|
|
|
|
static inline void unified_cache_prefill_headers(int class_idx, TinyUnifiedCache* cache, int start_tail, int count) {
|
2025-12-16 05:35:11 +09:00
|
|
|
|
#if HAKMEM_TINY_HEADER_CLASSIDX && HAKMEM_TINY_HEADER_WRITE_ONCE_COMPILED
|
Phase 5 E5-2: Header Write-Once (NEUTRAL, FROZEN)
Target: tiny_region_id_write_header (3.35% self%)
- Hypothesis: Headers redundant for reused blocks
- Strategy: Write headers ONCE at refill boundary, skip in hot alloc
Implementation:
- ENV gate: HAKMEM_TINY_HEADER_WRITE_ONCE=0/1 (default 0)
- core/box/tiny_header_write_once_env_box.h: ENV gate
- core/box/tiny_header_write_once_stats_box.h: Stats counters
- core/box/tiny_header_box.h: Added tiny_header_finalize_alloc()
- core/front/tiny_unified_cache.c: Prefill at 3 refill sites
- core/box/tiny_front_hot_box.h: Use finalize function
A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (WRITE_ONCE=0): 44.22M ops/s (mean), 44.53M ops/s (median)
- Optimized (WRITE_ONCE=1): 44.42M ops/s (mean), 44.36M ops/s (median)
- Improvement: +0.45% mean, -0.38% median
Decision: NEUTRAL (within ±1.0% threshold)
- Action: FREEZE as research box (default OFF, do not promote)
Root Cause Analysis:
- Header writes are NOT redundant - existing code writes only when needed
- Branch overhead (~4 cycles) cancels savings (~3-5 cycles)
- perf self% ≠ optimization ROI (3.35% target → +0.45% gain)
Key Lessons:
1. Verify assumptions before optimizing (inspect code paths)
2. Hot spot self% measures time IN function, not savings from REMOVING it
3. Branch overhead matters (even "simple" checks add cycles)
Positive Outcome:
- StdDev reduced 50% (0.96M → 0.48M) - more stable performance
Health Check: PASS (all profiles)
Next Candidates:
- free_tiny_fast_cold: 7.14% self%
- unified_cache_push: 3.39% self%
- hakmem_env_snapshot_enabled: 2.97% self%
Deliverables:
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-2 complete, FROZEN)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 06:22:25 +09:00
|
|
|
|
// Only prefill if write-once optimization is enabled
|
|
|
|
|
|
if (!tiny_header_write_once_enabled()) return;
|
|
|
|
|
|
|
|
|
|
|
|
// Only prefill for C1-C6 (classes that preserve headers)
|
|
|
|
|
|
if (!tiny_class_preserves_header(class_idx)) return;
|
|
|
|
|
|
|
|
|
|
|
|
// Prefill header byte (constant for this class)
|
|
|
|
|
|
const uint8_t header_byte = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
|
|
|
|
|
|
|
|
|
|
|
|
// Prefill headers in cache slots (circular buffer)
|
|
|
|
|
|
int tail_idx = start_tail;
|
|
|
|
|
|
for (int i = 0; i < count; i++) {
|
|
|
|
|
|
void* base = cache->slots[tail_idx];
|
|
|
|
|
|
if (base) { // Safety: skip NULL slots
|
|
|
|
|
|
*(uint8_t*)base = header_byte;
|
|
|
|
|
|
}
|
|
|
|
|
|
tail_idx = (tail_idx + 1) & cache->mask;
|
|
|
|
|
|
}
|
|
|
|
|
|
#else
|
|
|
|
|
|
(void)class_idx;
|
|
|
|
|
|
(void)cache;
|
|
|
|
|
|
(void)start_tail;
|
|
|
|
|
|
(void)count;
|
|
|
|
|
|
#endif
|
|
|
|
|
|
}
|
|
|
|
|
|
|
2025-12-04 23:39:02 +09:00
|
|
|
|
// ============================================================================
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
// Batch refill from SuperSlab (called on cache miss)
|
2025-12-04 23:39:02 +09:00
|
|
|
|
// ============================================================================
|
2025-12-04 12:20:21 +09:00
|
|
|
|
// Returns: BASE pointer (first block, wrapped), or NULL-wrapped if failed
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
// Design: Direct carve from SuperSlab to array (no TLS SLL intermediate layer)
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
// Warm Pool Integration: PRIORITIZE warm pool, use superslab_refill as fallback
|
2025-12-04 12:20:21 +09:00
|
|
|
|
hak_base_ptr_t unified_cache_refill(int class_idx) {
|
2025-12-16 05:35:11 +09:00
|
|
|
|
#if HAKMEM_TINY_UNIFIED_CACHE_MEASURE_COMPILED
|
2025-12-04 18:26:39 +09:00
|
|
|
|
// Measure refill cost if enabled
|
|
|
|
|
|
uint64_t start_cycles = 0;
|
|
|
|
|
|
int measure = unified_cache_measure_enabled();
|
|
|
|
|
|
if (measure) {
|
|
|
|
|
|
start_cycles = read_tsc();
|
|
|
|
|
|
}
|
2025-12-16 05:35:11 +09:00
|
|
|
|
#endif
|
2025-12-04 18:26:39 +09:00
|
|
|
|
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
// Initialize warm pool on first use (per-thread)
|
|
|
|
|
|
tiny_warm_pool_init_once();
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
|
|
|
|
|
|
TinyUnifiedCache* cache = &g_unified_cache[class_idx];
|
2025-12-06 01:34:04 +09:00
|
|
|
|
const TinyClassPolicy* policy = tiny_policy_get(class_idx);
|
|
|
|
|
|
int warm_enabled = policy ? policy->warm_enabled : 0;
|
|
|
|
|
|
int warm_cap = policy ? policy->warm_cap : 0;
|
|
|
|
|
|
int page_enabled = policy ? policy->page_box_enabled : 0;
|
2025-12-07 03:12:27 +09:00
|
|
|
|
TinyTLSSlab* tls = &g_tls_slabs[class_idx];
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
|
2025-12-02 19:43:23 +09:00
|
|
|
|
// ✅ Phase 11+: Ensure cache is initialized (lazy init for cold path)
|
|
|
|
|
|
if (!cache->slots) {
|
|
|
|
|
|
unified_cache_init();
|
|
|
|
|
|
// Re-check after init (may fail due to alloc failure)
|
|
|
|
|
|
if (!cache->slots) {
|
|
|
|
|
|
return NULL;
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
// Calculate available room in unified cache
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
int room = (int)cache->capacity - 1; // Leave 1 slot for full detection
|
|
|
|
|
|
if (cache->head > cache->tail) {
|
|
|
|
|
|
room = cache->head - cache->tail - 1;
|
|
|
|
|
|
} else if (cache->head < cache->tail) {
|
|
|
|
|
|
room = cache->capacity - (cache->tail - cache->head) - 1;
|
|
|
|
|
|
}
|
|
|
|
|
|
|
2025-12-04 12:20:21 +09:00
|
|
|
|
if (room <= 0) return HAK_BASE_FROM_RAW(NULL);
|
2025-12-05 15:31:44 +09:00
|
|
|
|
// Batch size limit(クラス別チューニング)
|
|
|
|
|
|
// - 通常: 128
|
2025-12-05 19:57:30 +09:00
|
|
|
|
// - C5〜C6(129B〜512B): 256 まで拡張
|
|
|
|
|
|
// - C7(≈1KB): 512 まで拡張して refill 頻度をさらに下げる
|
|
|
|
|
|
// - 安全性のため、下の out[] 配列サイズ(512)と常に整合させる
|
|
|
|
|
|
int max_batch;
|
|
|
|
|
|
if (class_idx == 7) {
|
|
|
|
|
|
max_batch = 512;
|
|
|
|
|
|
} else if (class_idx >= 5 && class_idx <= 6) {
|
|
|
|
|
|
max_batch = 256;
|
|
|
|
|
|
} else {
|
|
|
|
|
|
max_batch = 128;
|
|
|
|
|
|
}
|
2025-12-05 15:31:44 +09:00
|
|
|
|
if (room > max_batch) room = max_batch;
|
|
|
|
|
|
|
|
|
|
|
|
// NOTE:
|
2025-12-05 19:57:30 +09:00
|
|
|
|
// - C7 では max_batch を 512 まで拡張するため、スタック配列も 512 エントリ確保する。
|
|
|
|
|
|
// - これにより、room <= max_batch <= 512 が常に成り立ち、out[] オーバーランを防止する。
|
|
|
|
|
|
void* out[512];
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
int produced = 0;
|
2025-12-05 23:41:01 +09:00
|
|
|
|
int tls_carved = 0; // Debug bookkeeping: track TLS carve experiment hits
|
2025-12-07 03:12:27 +09:00
|
|
|
|
#if HAKMEM_BUILD_RELEASE
|
|
|
|
|
|
(void)tls_carved;
|
|
|
|
|
|
#endif
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
|
2025-12-05 15:31:44 +09:00
|
|
|
|
// ========== PAGE BOX HOT PATH(Tiny-Plus 層): Try page box FIRST ==========
|
|
|
|
|
|
// 将来的に C7 専用の page-level freelist 管理をここに統合する。
|
|
|
|
|
|
// いまは stub 実装で常に 0 を返すが、Box 境界としての接続だけ先に行う。
|
2025-12-06 01:34:04 +09:00
|
|
|
|
if (page_enabled && tiny_page_box_is_enabled(class_idx)) {
|
2025-12-07 03:12:27 +09:00
|
|
|
|
int page_produced = tiny_page_box_refill(class_idx, tls, out, room);
|
2025-12-05 15:31:44 +09:00
|
|
|
|
if (page_produced > 0) {
|
|
|
|
|
|
// Store blocks into cache and return first
|
|
|
|
|
|
void* first = out[0];
|
Phase 5 E5-2: Header Write-Once (NEUTRAL, FROZEN)
Target: tiny_region_id_write_header (3.35% self%)
- Hypothesis: Headers redundant for reused blocks
- Strategy: Write headers ONCE at refill boundary, skip in hot alloc
Implementation:
- ENV gate: HAKMEM_TINY_HEADER_WRITE_ONCE=0/1 (default 0)
- core/box/tiny_header_write_once_env_box.h: ENV gate
- core/box/tiny_header_write_once_stats_box.h: Stats counters
- core/box/tiny_header_box.h: Added tiny_header_finalize_alloc()
- core/front/tiny_unified_cache.c: Prefill at 3 refill sites
- core/box/tiny_front_hot_box.h: Use finalize function
A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (WRITE_ONCE=0): 44.22M ops/s (mean), 44.53M ops/s (median)
- Optimized (WRITE_ONCE=1): 44.42M ops/s (mean), 44.36M ops/s (median)
- Improvement: +0.45% mean, -0.38% median
Decision: NEUTRAL (within ±1.0% threshold)
- Action: FREEZE as research box (default OFF, do not promote)
Root Cause Analysis:
- Header writes are NOT redundant - existing code writes only when needed
- Branch overhead (~4 cycles) cancels savings (~3-5 cycles)
- perf self% ≠ optimization ROI (3.35% target → +0.45% gain)
Key Lessons:
1. Verify assumptions before optimizing (inspect code paths)
2. Hot spot self% measures time IN function, not savings from REMOVING it
3. Branch overhead matters (even "simple" checks add cycles)
Positive Outcome:
- StdDev reduced 50% (0.96M → 0.48M) - more stable performance
Health Check: PASS (all profiles)
Next Candidates:
- free_tiny_fast_cold: 7.14% self%
- unified_cache_push: 3.39% self%
- hakmem_env_snapshot_enabled: 2.97% self%
Deliverables:
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-2 complete, FROZEN)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 06:22:25 +09:00
|
|
|
|
int start_tail = cache->tail; // E5-2: Save tail position for header prefill
|
2025-12-05 15:31:44 +09:00
|
|
|
|
for (int i = 1; i < page_produced; i++) {
|
|
|
|
|
|
cache->slots[cache->tail] = out[i];
|
|
|
|
|
|
cache->tail = (cache->tail + 1) & cache->mask;
|
|
|
|
|
|
}
|
|
|
|
|
|
|
Phase 5 E5-2: Header Write-Once (NEUTRAL, FROZEN)
Target: tiny_region_id_write_header (3.35% self%)
- Hypothesis: Headers redundant for reused blocks
- Strategy: Write headers ONCE at refill boundary, skip in hot alloc
Implementation:
- ENV gate: HAKMEM_TINY_HEADER_WRITE_ONCE=0/1 (default 0)
- core/box/tiny_header_write_once_env_box.h: ENV gate
- core/box/tiny_header_write_once_stats_box.h: Stats counters
- core/box/tiny_header_box.h: Added tiny_header_finalize_alloc()
- core/front/tiny_unified_cache.c: Prefill at 3 refill sites
- core/box/tiny_front_hot_box.h: Use finalize function
A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (WRITE_ONCE=0): 44.22M ops/s (mean), 44.53M ops/s (median)
- Optimized (WRITE_ONCE=1): 44.42M ops/s (mean), 44.36M ops/s (median)
- Improvement: +0.45% mean, -0.38% median
Decision: NEUTRAL (within ±1.0% threshold)
- Action: FREEZE as research box (default OFF, do not promote)
Root Cause Analysis:
- Header writes are NOT redundant - existing code writes only when needed
- Branch overhead (~4 cycles) cancels savings (~3-5 cycles)
- perf self% ≠ optimization ROI (3.35% target → +0.45% gain)
Key Lessons:
1. Verify assumptions before optimizing (inspect code paths)
2. Hot spot self% measures time IN function, not savings from REMOVING it
3. Branch overhead matters (even "simple" checks add cycles)
Positive Outcome:
- StdDev reduced 50% (0.96M → 0.48M) - more stable performance
Health Check: PASS (all profiles)
Next Candidates:
- free_tiny_fast_cold: 7.14% self%
- unified_cache_push: 3.39% self%
- hakmem_env_snapshot_enabled: 2.97% self%
Deliverables:
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-2 complete, FROZEN)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 06:22:25 +09:00
|
|
|
|
// E5-2: Prefill headers for C1-C6 (write-once optimization)
|
|
|
|
|
|
unified_cache_prefill_headers(class_idx, cache, start_tail, page_produced - 1);
|
|
|
|
|
|
|
2025-12-05 15:31:44 +09:00
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
|
|
|
|
|
g_unified_cache_miss[class_idx]++;
|
|
|
|
|
|
#endif
|
2025-12-06 01:34:04 +09:00
|
|
|
|
tiny_class_stats_on_uc_miss(class_idx);
|
2025-12-05 15:31:44 +09:00
|
|
|
|
|
2025-12-16 05:35:11 +09:00
|
|
|
|
#if HAKMEM_TINY_UNIFIED_CACHE_MEASURE_COMPILED
|
2025-12-05 15:31:44 +09:00
|
|
|
|
if (measure) {
|
|
|
|
|
|
uint64_t end_cycles = read_tsc();
|
|
|
|
|
|
uint64_t delta = end_cycles - start_cycles;
|
|
|
|
|
|
atomic_fetch_add_explicit(&g_unified_cache_refill_cycles_global,
|
|
|
|
|
|
delta, memory_order_relaxed);
|
|
|
|
|
|
atomic_fetch_add_explicit(&g_unified_cache_misses_global,
|
|
|
|
|
|
1, memory_order_relaxed);
|
|
|
|
|
|
atomic_fetch_add_explicit(&g_unified_cache_refill_cycles_by_class[class_idx],
|
|
|
|
|
|
delta, memory_order_relaxed);
|
|
|
|
|
|
atomic_fetch_add_explicit(&g_unified_cache_misses_by_class[class_idx],
|
|
|
|
|
|
1, memory_order_relaxed);
|
|
|
|
|
|
}
|
2025-12-16 05:35:11 +09:00
|
|
|
|
#endif
|
2025-12-05 15:31:44 +09:00
|
|
|
|
|
|
|
|
|
|
return HAK_BASE_FROM_RAW(first);
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
// ========== WARM POOL HOT PATH: Check warm pool FIRST ==========
|
|
|
|
|
|
// This is the critical optimization - avoid superslab_refill() registry scan
|
2025-12-06 01:34:04 +09:00
|
|
|
|
if (warm_enabled) {
|
|
|
|
|
|
if (class_idx == 7) {
|
|
|
|
|
|
const TinyClassPolicy* pol = tiny_policy_get(7);
|
|
|
|
|
|
static _Atomic int g_c7_policy_logged = 0;
|
|
|
|
|
|
if (atomic_exchange_explicit(&g_c7_policy_logged, 1, memory_order_acq_rel) == 0) {
|
|
|
|
|
|
fprintf(stderr,
|
|
|
|
|
|
"[C7_POLICY_AT_WARM] page=%u warm=%u cap=%u\n",
|
|
|
|
|
|
pol ? pol->page_box_enabled : 0,
|
|
|
|
|
|
pol ? pol->warm_enabled : 0,
|
|
|
|
|
|
pol ? pol->warm_cap : 0);
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
2025-12-05 20:05:11 +09:00
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
2025-12-06 01:34:04 +09:00
|
|
|
|
atomic_fetch_add_explicit(&g_dbg_warm_pop_attempts, 1, memory_order_relaxed);
|
2025-12-05 23:41:01 +09:00
|
|
|
|
if (class_idx == 7) {
|
2025-12-06 01:34:04 +09:00
|
|
|
|
warm_pool_dbg_c7_attempt();
|
2025-12-05 23:41:01 +09:00
|
|
|
|
}
|
2025-12-06 01:34:04 +09:00
|
|
|
|
#endif
|
|
|
|
|
|
#if HAKMEM_BUILD_RELEASE
|
2025-12-05 20:50:24 +09:00
|
|
|
|
if (class_idx == 7) {
|
2025-12-06 01:34:04 +09:00
|
|
|
|
atomic_fetch_add_explicit(&g_rel_c7_warm_pop, 1, memory_order_relaxed);
|
|
|
|
|
|
}
|
|
|
|
|
|
#endif
|
|
|
|
|
|
SuperSlab* warm_ss = tiny_warm_pool_pop(class_idx);
|
|
|
|
|
|
if (warm_ss) {
|
2025-12-07 03:12:27 +09:00
|
|
|
|
int allow_tls_bind = policy && policy->tls_carve_enabled;
|
|
|
|
|
|
int allow_tls_carve = allow_tls_bind;
|
|
|
|
|
|
int warm_mode = 0;
|
2025-12-06 01:34:04 +09:00
|
|
|
|
if (class_idx == 7) {
|
|
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
|
|
|
|
|
warm_pool_dbg_c7_hit();
|
|
|
|
|
|
#endif
|
2025-12-07 03:12:27 +09:00
|
|
|
|
warm_mode = warm_tls_bind_mode_c7();
|
|
|
|
|
|
allow_tls_bind = (warm_mode >= 1);
|
|
|
|
|
|
allow_tls_carve = (warm_mode == 2);
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
if (allow_tls_bind) {
|
|
|
|
|
|
int cap = ss_slabs_capacity(warm_ss);
|
|
|
|
|
|
int slab_idx = -1;
|
|
|
|
|
|
|
|
|
|
|
|
// Simple heuristic: first slab matching class
|
|
|
|
|
|
for (int i = 0; i < cap; i++) {
|
|
|
|
|
|
if (tiny_get_class_from_ss(warm_ss, i) == class_idx) {
|
|
|
|
|
|
slab_idx = i;
|
|
|
|
|
|
break;
|
2025-12-05 20:50:24 +09:00
|
|
|
|
}
|
2025-12-07 03:12:27 +09:00
|
|
|
|
}
|
2025-12-05 20:50:24 +09:00
|
|
|
|
|
2025-12-07 03:12:27 +09:00
|
|
|
|
if (slab_idx >= 0) {
|
|
|
|
|
|
uint32_t tid = (uint32_t)(uintptr_t)pthread_self();
|
|
|
|
|
|
if (ss_tls_bind_one(class_idx, tls, warm_ss, slab_idx, tid)) {
|
|
|
|
|
|
if (class_idx == 7) {
|
2025-12-06 01:34:04 +09:00
|
|
|
|
warm_tls_bind_log_success(warm_ss, slab_idx);
|
2025-12-07 03:12:27 +09:00
|
|
|
|
}
|
2025-12-06 01:34:04 +09:00
|
|
|
|
|
2025-12-07 03:12:27 +09:00
|
|
|
|
// Mode 2: carve a single block via TLS fast path (policy enabled classes)
|
|
|
|
|
|
if (allow_tls_carve) {
|
|
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
|
|
|
|
|
if (class_idx == 7) {
|
2025-12-06 01:34:04 +09:00
|
|
|
|
warm_pool_dbg_c7_tls_attempt();
|
2025-12-07 03:12:27 +09:00
|
|
|
|
}
|
|
|
|
|
|
#endif
|
|
|
|
|
|
TinyTLSCarveOneResult tls_carve =
|
|
|
|
|
|
tiny_tls_carve_one_block(tls, class_idx);
|
|
|
|
|
|
if (tls_carve.block) {
|
|
|
|
|
|
if (class_idx == 7) {
|
2025-12-06 01:34:04 +09:00
|
|
|
|
warm_tls_bind_log_tls_carve(warm_ss, slab_idx, tls_carve.block);
|
|
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
|
|
|
|
|
warm_pool_dbg_c7_tls_success();
|
|
|
|
|
|
#endif
|
2025-12-07 03:12:27 +09:00
|
|
|
|
}
|
|
|
|
|
|
out[0] = tls_carve.block;
|
|
|
|
|
|
produced = 1;
|
|
|
|
|
|
tls_carved = 1;
|
|
|
|
|
|
} else {
|
|
|
|
|
|
if (class_idx == 7) {
|
2025-12-06 01:34:04 +09:00
|
|
|
|
warm_tls_bind_log_tls_fail(warm_ss, slab_idx);
|
|
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
|
|
|
|
|
warm_pool_dbg_c7_tls_fail();
|
|
|
|
|
|
#endif
|
|
|
|
|
|
}
|
2025-12-05 20:50:24 +09:00
|
|
|
|
}
|
|
|
|
|
|
}
|
2025-12-05 20:05:11 +09:00
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
2025-12-05 19:57:30 +09:00
|
|
|
|
|
2025-12-06 01:34:04 +09:00
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
|
|
|
|
|
atomic_fetch_add_explicit(&g_dbg_warm_pop_hits, 1, memory_order_relaxed);
|
2025-12-05 23:41:01 +09:00
|
|
|
|
#endif
|
2025-12-06 01:34:04 +09:00
|
|
|
|
// HOT PATH: Warm pool hit, try to carve directly
|
|
|
|
|
|
if (produced == 0) {
|
|
|
|
|
|
#if HAKMEM_BUILD_RELEASE
|
|
|
|
|
|
if (class_idx == 7) {
|
|
|
|
|
|
warm_pool_rel_c7_carve_attempt();
|
|
|
|
|
|
}
|
|
|
|
|
|
#endif
|
|
|
|
|
|
produced = slab_carve_from_ss(class_idx, warm_ss, out, room);
|
|
|
|
|
|
#if HAKMEM_BUILD_RELEASE
|
|
|
|
|
|
if (class_idx == 7) {
|
|
|
|
|
|
if (produced > 0) {
|
|
|
|
|
|
warm_pool_rel_c7_carve_success();
|
|
|
|
|
|
} else {
|
|
|
|
|
|
warm_pool_rel_c7_carve_zero();
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
#endif
|
2025-12-05 23:41:01 +09:00
|
|
|
|
if (produced > 0) {
|
2025-12-06 01:34:04 +09:00
|
|
|
|
// Update active counter for carved blocks
|
|
|
|
|
|
ss_active_add(warm_ss, (uint32_t)produced);
|
2025-12-05 23:41:01 +09:00
|
|
|
|
}
|
|
|
|
|
|
}
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
|
2025-12-06 01:34:04 +09:00
|
|
|
|
if (produced > 0) {
|
|
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
|
|
|
|
|
if (class_idx == 7) {
|
|
|
|
|
|
warm_pool_dbg_c7_carve();
|
|
|
|
|
|
if (tls_carved) {
|
|
|
|
|
|
warm_pool_dbg_c7_uc_miss_tls();
|
|
|
|
|
|
} else {
|
|
|
|
|
|
warm_pool_dbg_c7_uc_miss_warm();
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
#endif
|
|
|
|
|
|
// Success! Return SuperSlab to warm pool for next use
|
|
|
|
|
|
#if HAKMEM_BUILD_RELEASE
|
|
|
|
|
|
if (class_idx == 7) {
|
|
|
|
|
|
atomic_fetch_add_explicit(&g_rel_c7_warm_push, 1, memory_order_relaxed);
|
|
|
|
|
|
}
|
|
|
|
|
|
#endif
|
|
|
|
|
|
tiny_warm_pool_push_with_cap(class_idx, warm_ss, warm_cap);
|
|
|
|
|
|
|
|
|
|
|
|
// Track warm pool hit (always compiled, ENV-gated printing)
|
|
|
|
|
|
warm_pool_record_hit(class_idx);
|
|
|
|
|
|
tiny_class_stats_on_warm_hit(class_idx);
|
|
|
|
|
|
|
|
|
|
|
|
// Store blocks into cache and return first
|
|
|
|
|
|
void* first = out[0];
|
Phase 5 E5-2: Header Write-Once (NEUTRAL, FROZEN)
Target: tiny_region_id_write_header (3.35% self%)
- Hypothesis: Headers redundant for reused blocks
- Strategy: Write headers ONCE at refill boundary, skip in hot alloc
Implementation:
- ENV gate: HAKMEM_TINY_HEADER_WRITE_ONCE=0/1 (default 0)
- core/box/tiny_header_write_once_env_box.h: ENV gate
- core/box/tiny_header_write_once_stats_box.h: Stats counters
- core/box/tiny_header_box.h: Added tiny_header_finalize_alloc()
- core/front/tiny_unified_cache.c: Prefill at 3 refill sites
- core/box/tiny_front_hot_box.h: Use finalize function
A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (WRITE_ONCE=0): 44.22M ops/s (mean), 44.53M ops/s (median)
- Optimized (WRITE_ONCE=1): 44.42M ops/s (mean), 44.36M ops/s (median)
- Improvement: +0.45% mean, -0.38% median
Decision: NEUTRAL (within ±1.0% threshold)
- Action: FREEZE as research box (default OFF, do not promote)
Root Cause Analysis:
- Header writes are NOT redundant - existing code writes only when needed
- Branch overhead (~4 cycles) cancels savings (~3-5 cycles)
- perf self% ≠ optimization ROI (3.35% target → +0.45% gain)
Key Lessons:
1. Verify assumptions before optimizing (inspect code paths)
2. Hot spot self% measures time IN function, not savings from REMOVING it
3. Branch overhead matters (even "simple" checks add cycles)
Positive Outcome:
- StdDev reduced 50% (0.96M → 0.48M) - more stable performance
Health Check: PASS (all profiles)
Next Candidates:
- free_tiny_fast_cold: 7.14% self%
- unified_cache_push: 3.39% self%
- hakmem_env_snapshot_enabled: 2.97% self%
Deliverables:
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-2 complete, FROZEN)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 06:22:25 +09:00
|
|
|
|
int start_tail = cache->tail; // E5-2: Save tail position for header prefill
|
2025-12-06 01:34:04 +09:00
|
|
|
|
for (int i = 1; i < produced; i++) {
|
|
|
|
|
|
cache->slots[cache->tail] = out[i];
|
|
|
|
|
|
cache->tail = (cache->tail + 1) & cache->mask;
|
2025-12-05 23:41:01 +09:00
|
|
|
|
}
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
|
Phase 5 E5-2: Header Write-Once (NEUTRAL, FROZEN)
Target: tiny_region_id_write_header (3.35% self%)
- Hypothesis: Headers redundant for reused blocks
- Strategy: Write headers ONCE at refill boundary, skip in hot alloc
Implementation:
- ENV gate: HAKMEM_TINY_HEADER_WRITE_ONCE=0/1 (default 0)
- core/box/tiny_header_write_once_env_box.h: ENV gate
- core/box/tiny_header_write_once_stats_box.h: Stats counters
- core/box/tiny_header_box.h: Added tiny_header_finalize_alloc()
- core/front/tiny_unified_cache.c: Prefill at 3 refill sites
- core/box/tiny_front_hot_box.h: Use finalize function
A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (WRITE_ONCE=0): 44.22M ops/s (mean), 44.53M ops/s (median)
- Optimized (WRITE_ONCE=1): 44.42M ops/s (mean), 44.36M ops/s (median)
- Improvement: +0.45% mean, -0.38% median
Decision: NEUTRAL (within ±1.0% threshold)
- Action: FREEZE as research box (default OFF, do not promote)
Root Cause Analysis:
- Header writes are NOT redundant - existing code writes only when needed
- Branch overhead (~4 cycles) cancels savings (~3-5 cycles)
- perf self% ≠ optimization ROI (3.35% target → +0.45% gain)
Key Lessons:
1. Verify assumptions before optimizing (inspect code paths)
2. Hot spot self% measures time IN function, not savings from REMOVING it
3. Branch overhead matters (even "simple" checks add cycles)
Positive Outcome:
- StdDev reduced 50% (0.96M → 0.48M) - more stable performance
Health Check: PASS (all profiles)
Next Candidates:
- free_tiny_fast_cold: 7.14% self%
- unified_cache_push: 3.39% self%
- hakmem_env_snapshot_enabled: 2.97% self%
Deliverables:
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-2 complete, FROZEN)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 06:22:25 +09:00
|
|
|
|
// E5-2: Prefill headers for C1-C6 (write-once optimization)
|
|
|
|
|
|
unified_cache_prefill_headers(class_idx, cache, start_tail, produced - 1);
|
|
|
|
|
|
|
2025-12-06 01:34:04 +09:00
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
|
|
|
|
|
g_unified_cache_miss[class_idx]++;
|
|
|
|
|
|
#endif
|
|
|
|
|
|
tiny_class_stats_on_uc_miss(class_idx);
|
|
|
|
|
|
|
2025-12-16 05:35:11 +09:00
|
|
|
|
#if HAKMEM_TINY_UNIFIED_CACHE_MEASURE_COMPILED
|
2025-12-06 01:34:04 +09:00
|
|
|
|
if (measure) {
|
|
|
|
|
|
uint64_t end_cycles = read_tsc();
|
|
|
|
|
|
uint64_t delta = end_cycles - start_cycles;
|
|
|
|
|
|
atomic_fetch_add_explicit(&g_unified_cache_refill_cycles_global,
|
|
|
|
|
|
delta, memory_order_relaxed);
|
|
|
|
|
|
atomic_fetch_add_explicit(&g_unified_cache_misses_global,
|
|
|
|
|
|
1, memory_order_relaxed);
|
|
|
|
|
|
// Per-class 集計(C5–C7 の refill コストを可視化)
|
|
|
|
|
|
atomic_fetch_add_explicit(&g_unified_cache_refill_cycles_by_class[class_idx],
|
|
|
|
|
|
delta, memory_order_relaxed);
|
|
|
|
|
|
atomic_fetch_add_explicit(&g_unified_cache_misses_by_class[class_idx],
|
|
|
|
|
|
1, memory_order_relaxed);
|
|
|
|
|
|
}
|
2025-12-16 05:35:11 +09:00
|
|
|
|
#endif
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
|
2025-12-06 01:34:04 +09:00
|
|
|
|
return HAK_BASE_FROM_RAW(first);
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
}
|
|
|
|
|
|
|
2025-12-06 01:34:04 +09:00
|
|
|
|
// SuperSlab carve failed (produced == 0)
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
2025-12-06 01:34:04 +09:00
|
|
|
|
atomic_fetch_add_explicit(&g_dbg_warm_pop_carve_zero, 1, memory_order_relaxed);
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
#endif
|
2025-12-06 01:34:04 +09:00
|
|
|
|
// This slab is either exhausted or has no more available capacity
|
|
|
|
|
|
// The statistics counter 'prefilled' tracks how often we try to prefill
|
|
|
|
|
|
if (produced == 0 && tiny_warm_pool_count(class_idx) == 0) {
|
|
|
|
|
|
// Pool is empty and carve failed - prefill would help here
|
|
|
|
|
|
warm_pool_record_prefilled(class_idx);
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
}
|
2025-12-06 01:34:04 +09:00
|
|
|
|
} else {
|
|
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
|
|
|
|
|
atomic_fetch_add_explicit(&g_dbg_warm_pop_empty, 1, memory_order_relaxed);
|
|
|
|
|
|
#endif
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
}
|
|
|
|
|
|
|
2025-12-06 01:34:04 +09:00
|
|
|
|
// ========== COLD PATH: Warm pool miss, use superslab_refill ==========
|
|
|
|
|
|
// Track warm pool miss (always compiled, ENV-gated printing)
|
|
|
|
|
|
warm_pool_record_miss(class_idx);
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
// Step 1: Ensure SuperSlab available via normal refill
|
2025-12-04 23:39:02 +09:00
|
|
|
|
// Enhanced: Use Warm Pool Prefill Box for secondary prefill when pool is empty
|
2025-12-06 01:34:04 +09:00
|
|
|
|
if (warm_enabled) {
|
|
|
|
|
|
if (warm_pool_do_prefill(class_idx, tls, warm_cap) < 0) {
|
|
|
|
|
|
return HAK_BASE_FROM_RAW(NULL);
|
|
|
|
|
|
}
|
|
|
|
|
|
// After prefill: tls->ss has the final slab for carving
|
|
|
|
|
|
tls = &g_tls_slabs[class_idx]; // Reload (already done in prefill box)
|
|
|
|
|
|
} else {
|
|
|
|
|
|
if (!tls->ss) {
|
|
|
|
|
|
if (!superslab_refill(class_idx)) {
|
|
|
|
|
|
return HAK_BASE_FROM_RAW(NULL);
|
|
|
|
|
|
}
|
|
|
|
|
|
tls = &g_tls_slabs[class_idx];
|
|
|
|
|
|
}
|
Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.
Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate
Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.
Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter
Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After: C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)
Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality
Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1
Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
// Step 2: Direct carve from SuperSlab into local array (bypass TLS SLL!)
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
TinySlabMeta* m = tls->meta;
|
|
|
|
|
|
size_t bs = tiny_stride_for_class(class_idx);
|
|
|
|
|
|
uint8_t* base = tls->slab_base
|
|
|
|
|
|
? tls->slab_base
|
|
|
|
|
|
: tiny_slab_base_for_geometry(tls->ss, tls->slab_idx);
|
|
|
|
|
|
|
|
|
|
|
|
while (produced < room) {
|
|
|
|
|
|
if (m->freelist) {
|
|
|
|
|
|
// Freelist pop
|
|
|
|
|
|
void* p = m->freelist;
|
2025-12-03 21:56:52 +09:00
|
|
|
|
|
2025-12-03 12:43:02 +09:00
|
|
|
|
void* next_node = tiny_next_read(class_idx, p);
|
2025-12-03 09:57:12 +09:00
|
|
|
|
|
2025-12-03 12:43:02 +09:00
|
|
|
|
// ROOT CAUSE FIX: Write header BEFORE exposing block (but AFTER reading next)
|
|
|
|
|
|
// For Class 0 (offset 0), next overlaps header, so we must read next first.
|
2025-12-03 09:57:12 +09:00
|
|
|
|
#if HAKMEM_TINY_HEADER_CLASSIDX
|
|
|
|
|
|
*(uint8_t*)p = (uint8_t)(0xa0 | (class_idx & 0x0f));
|
2025-12-03 12:43:02 +09:00
|
|
|
|
|
|
|
|
|
|
// Prevent compiler from reordering header write after out[] assignment
|
2025-12-03 09:57:12 +09:00
|
|
|
|
__atomic_thread_fence(__ATOMIC_RELEASE);
|
|
|
|
|
|
#endif
|
|
|
|
|
|
|
2025-12-03 12:43:02 +09:00
|
|
|
|
m->freelist = next_node;
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
|
2025-11-22 07:40:35 +09:00
|
|
|
|
unified_refill_validate_base(class_idx, tls, m, p,
|
|
|
|
|
|
"unified_refill_freelist");
|
|
|
|
|
|
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
// PageFaultTelemetry: record page touch for this BASE
|
|
|
|
|
|
pagefault_telemetry_touch(class_idx, p);
|
|
|
|
|
|
|
|
|
|
|
|
m->used++;
|
|
|
|
|
|
out[produced++] = p;
|
|
|
|
|
|
|
|
|
|
|
|
} else if (m->carved < m->capacity) {
|
|
|
|
|
|
// Linear carve (fresh block, no freelist link)
|
|
|
|
|
|
void* p = (void*)(base + ((size_t)m->carved * bs));
|
|
|
|
|
|
|
2025-11-22 07:40:35 +09:00
|
|
|
|
unified_refill_validate_base(class_idx, tls, m, p,
|
|
|
|
|
|
"unified_refill_carve");
|
|
|
|
|
|
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
// PageFaultTelemetry: record page touch for this BASE
|
|
|
|
|
|
pagefault_telemetry_touch(class_idx, p);
|
|
|
|
|
|
|
|
|
|
|
|
// ✅ CRITICAL: Write header (new block)
|
|
|
|
|
|
#if HAKMEM_TINY_HEADER_CLASSIDX
|
|
|
|
|
|
*(uint8_t*)p = (uint8_t)(0xa0 | (class_idx & 0x0f));
|
|
|
|
|
|
#endif
|
|
|
|
|
|
|
|
|
|
|
|
m->carved++;
|
|
|
|
|
|
m->used++;
|
|
|
|
|
|
out[produced++] = p;
|
|
|
|
|
|
|
|
|
|
|
|
} else {
|
|
|
|
|
|
// SuperSlab exhausted → refill and retry
|
|
|
|
|
|
if (!superslab_refill(class_idx)) break;
|
|
|
|
|
|
|
|
|
|
|
|
// ✅ CRITICAL: Reload TLS pointers after refill (avoid stale pointer bug)
|
|
|
|
|
|
tls = &g_tls_slabs[class_idx];
|
|
|
|
|
|
m = tls->meta;
|
|
|
|
|
|
base = tls->slab_base
|
|
|
|
|
|
? tls->slab_base
|
|
|
|
|
|
: tiny_slab_base_for_geometry(tls->ss, tls->slab_idx);
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
2025-12-04 12:20:21 +09:00
|
|
|
|
if (produced == 0) return HAK_BASE_FROM_RAW(NULL);
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
|
|
|
|
|
|
// Step 4: Update active counter
|
2025-11-27 13:31:46 +09:00
|
|
|
|
// Guard: tls->ss can be NULL if all SuperSlab refills failed
|
|
|
|
|
|
if (tls->ss) {
|
|
|
|
|
|
ss_active_add(tls->ss, (uint32_t)produced);
|
|
|
|
|
|
}
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
|
|
|
|
|
|
// Step 5: Store blocks into unified cache (skip first, return it)
|
|
|
|
|
|
void* first = out[0];
|
Phase 5 E5-2: Header Write-Once (NEUTRAL, FROZEN)
Target: tiny_region_id_write_header (3.35% self%)
- Hypothesis: Headers redundant for reused blocks
- Strategy: Write headers ONCE at refill boundary, skip in hot alloc
Implementation:
- ENV gate: HAKMEM_TINY_HEADER_WRITE_ONCE=0/1 (default 0)
- core/box/tiny_header_write_once_env_box.h: ENV gate
- core/box/tiny_header_write_once_stats_box.h: Stats counters
- core/box/tiny_header_box.h: Added tiny_header_finalize_alloc()
- core/front/tiny_unified_cache.c: Prefill at 3 refill sites
- core/box/tiny_front_hot_box.h: Use finalize function
A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (WRITE_ONCE=0): 44.22M ops/s (mean), 44.53M ops/s (median)
- Optimized (WRITE_ONCE=1): 44.42M ops/s (mean), 44.36M ops/s (median)
- Improvement: +0.45% mean, -0.38% median
Decision: NEUTRAL (within ±1.0% threshold)
- Action: FREEZE as research box (default OFF, do not promote)
Root Cause Analysis:
- Header writes are NOT redundant - existing code writes only when needed
- Branch overhead (~4 cycles) cancels savings (~3-5 cycles)
- perf self% ≠ optimization ROI (3.35% target → +0.45% gain)
Key Lessons:
1. Verify assumptions before optimizing (inspect code paths)
2. Hot spot self% measures time IN function, not savings from REMOVING it
3. Branch overhead matters (even "simple" checks add cycles)
Positive Outcome:
- StdDev reduced 50% (0.96M → 0.48M) - more stable performance
Health Check: PASS (all profiles)
Next Candidates:
- free_tiny_fast_cold: 7.14% self%
- unified_cache_push: 3.39% self%
- hakmem_env_snapshot_enabled: 2.97% self%
Deliverables:
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-2 complete, FROZEN)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 06:22:25 +09:00
|
|
|
|
int start_tail = cache->tail; // E5-2: Save tail position for header prefill
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
for (int i = 1; i < produced; i++) {
|
|
|
|
|
|
cache->slots[cache->tail] = out[i];
|
|
|
|
|
|
cache->tail = (cache->tail + 1) & cache->mask;
|
|
|
|
|
|
}
|
|
|
|
|
|
|
Phase 5 E5-2: Header Write-Once (NEUTRAL, FROZEN)
Target: tiny_region_id_write_header (3.35% self%)
- Hypothesis: Headers redundant for reused blocks
- Strategy: Write headers ONCE at refill boundary, skip in hot alloc
Implementation:
- ENV gate: HAKMEM_TINY_HEADER_WRITE_ONCE=0/1 (default 0)
- core/box/tiny_header_write_once_env_box.h: ENV gate
- core/box/tiny_header_write_once_stats_box.h: Stats counters
- core/box/tiny_header_box.h: Added tiny_header_finalize_alloc()
- core/front/tiny_unified_cache.c: Prefill at 3 refill sites
- core/box/tiny_front_hot_box.h: Use finalize function
A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (WRITE_ONCE=0): 44.22M ops/s (mean), 44.53M ops/s (median)
- Optimized (WRITE_ONCE=1): 44.42M ops/s (mean), 44.36M ops/s (median)
- Improvement: +0.45% mean, -0.38% median
Decision: NEUTRAL (within ±1.0% threshold)
- Action: FREEZE as research box (default OFF, do not promote)
Root Cause Analysis:
- Header writes are NOT redundant - existing code writes only when needed
- Branch overhead (~4 cycles) cancels savings (~3-5 cycles)
- perf self% ≠ optimization ROI (3.35% target → +0.45% gain)
Key Lessons:
1. Verify assumptions before optimizing (inspect code paths)
2. Hot spot self% measures time IN function, not savings from REMOVING it
3. Branch overhead matters (even "simple" checks add cycles)
Positive Outcome:
- StdDev reduced 50% (0.96M → 0.48M) - more stable performance
Health Check: PASS (all profiles)
Next Candidates:
- free_tiny_fast_cold: 7.14% self%
- unified_cache_push: 3.39% self%
- hakmem_env_snapshot_enabled: 2.97% self%
Deliverables:
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-2 complete, FROZEN)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 06:22:25 +09:00
|
|
|
|
// E5-2: Prefill headers for C1-C6 (write-once optimization)
|
|
|
|
|
|
unified_cache_prefill_headers(class_idx, cache, start_tail, produced - 1);
|
|
|
|
|
|
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
2025-12-05 23:41:01 +09:00
|
|
|
|
if (class_idx == 7) {
|
|
|
|
|
|
warm_pool_dbg_c7_uc_miss_shared();
|
|
|
|
|
|
}
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
g_unified_cache_miss[class_idx]++;
|
|
|
|
|
|
#endif
|
2025-12-06 01:34:04 +09:00
|
|
|
|
tiny_class_stats_on_uc_miss(class_idx);
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
|
2025-12-04 18:26:39 +09:00
|
|
|
|
// Measure refill cycles
|
2025-12-16 05:35:11 +09:00
|
|
|
|
#if HAKMEM_TINY_UNIFIED_CACHE_MEASURE_COMPILED
|
2025-12-04 18:26:39 +09:00
|
|
|
|
if (measure) {
|
|
|
|
|
|
uint64_t end_cycles = read_tsc();
|
|
|
|
|
|
uint64_t delta = end_cycles - start_cycles;
|
2025-12-05 15:31:44 +09:00
|
|
|
|
atomic_fetch_add_explicit(&g_unified_cache_refill_cycles_global,
|
|
|
|
|
|
delta, memory_order_relaxed);
|
|
|
|
|
|
atomic_fetch_add_explicit(&g_unified_cache_misses_global,
|
|
|
|
|
|
1, memory_order_relaxed);
|
|
|
|
|
|
// Per-class 集計
|
|
|
|
|
|
atomic_fetch_add_explicit(&g_unified_cache_refill_cycles_by_class[class_idx],
|
|
|
|
|
|
delta, memory_order_relaxed);
|
|
|
|
|
|
atomic_fetch_add_explicit(&g_unified_cache_misses_by_class[class_idx],
|
|
|
|
|
|
1, memory_order_relaxed);
|
2025-12-04 18:26:39 +09:00
|
|
|
|
}
|
2025-12-16 05:35:11 +09:00
|
|
|
|
#endif
|
2025-12-04 18:26:39 +09:00
|
|
|
|
|
2025-12-04 12:20:21 +09:00
|
|
|
|
return HAK_BASE_FROM_RAW(first); // Return first block (BASE pointer)
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
|
|
|
|
}
|
2025-12-04 18:26:39 +09:00
|
|
|
|
|
|
|
|
|
|
// ============================================================================
|
|
|
|
|
|
// Performance Measurement: Print Statistics
|
|
|
|
|
|
// ============================================================================
|
|
|
|
|
|
void unified_cache_print_measurements(void) {
|
2025-12-16 05:35:11 +09:00
|
|
|
|
#if !HAKMEM_TINY_UNIFIED_CACHE_MEASURE_COMPILED
|
|
|
|
|
|
return;
|
|
|
|
|
|
#else
|
2025-12-04 18:26:39 +09:00
|
|
|
|
if (!unified_cache_measure_enabled()) {
|
|
|
|
|
|
return; // Measurement disabled, nothing to print
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
uint64_t hits = atomic_load_explicit(&g_unified_cache_hits_global, memory_order_relaxed);
|
|
|
|
|
|
uint64_t misses = atomic_load_explicit(&g_unified_cache_misses_global, memory_order_relaxed);
|
|
|
|
|
|
uint64_t refill_cycles = atomic_load_explicit(&g_unified_cache_refill_cycles_global, memory_order_relaxed);
|
|
|
|
|
|
|
|
|
|
|
|
uint64_t total = hits + misses;
|
|
|
|
|
|
if (total == 0) {
|
|
|
|
|
|
fprintf(stderr, "\n========================================\n");
|
|
|
|
|
|
fprintf(stderr, "Unified Cache Statistics\n");
|
|
|
|
|
|
fprintf(stderr, "========================================\n");
|
|
|
|
|
|
fprintf(stderr, "No operations recorded (measurement may be disabled)\n");
|
|
|
|
|
|
fprintf(stderr, "========================================\n\n");
|
|
|
|
|
|
return;
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
double hit_rate = (100.0 * hits) / total;
|
|
|
|
|
|
double avg_refill_cycles = misses > 0 ? (double)refill_cycles / misses : 0.0;
|
|
|
|
|
|
|
|
|
|
|
|
// Estimate time at 1GHz (conservative, most modern CPUs are 2-4GHz)
|
|
|
|
|
|
double avg_refill_us = avg_refill_cycles / 1000.0;
|
|
|
|
|
|
|
|
|
|
|
|
fprintf(stderr, "\n========================================\n");
|
|
|
|
|
|
fprintf(stderr, "Unified Cache Statistics\n");
|
|
|
|
|
|
fprintf(stderr, "========================================\n");
|
|
|
|
|
|
fprintf(stderr, "Hits: %llu\n", (unsigned long long)hits);
|
|
|
|
|
|
fprintf(stderr, "Misses: %llu\n", (unsigned long long)misses);
|
|
|
|
|
|
fprintf(stderr, "Hit Rate: %.1f%%\n", hit_rate);
|
2025-12-05 15:31:44 +09:00
|
|
|
|
fprintf(stderr, "Avg Refill Cycles: %.0f (est. %.2fus @ 1GHz)\n",
|
|
|
|
|
|
avg_refill_cycles, avg_refill_us);
|
|
|
|
|
|
|
|
|
|
|
|
// Per-class breakdown(Tiny クラス 0-7、特に C5–C7 を観測)
|
|
|
|
|
|
fprintf(stderr, "\nPer-class Unified Cache (Tiny classes):\n");
|
|
|
|
|
|
for (int cls = 0; cls < TINY_NUM_CLASSES; cls++) {
|
|
|
|
|
|
uint64_t ch = atomic_load_explicit(&g_unified_cache_hits_by_class[cls],
|
|
|
|
|
|
memory_order_relaxed);
|
|
|
|
|
|
uint64_t cm = atomic_load_explicit(&g_unified_cache_misses_by_class[cls],
|
|
|
|
|
|
memory_order_relaxed);
|
|
|
|
|
|
uint64_t cc = atomic_load_explicit(&g_unified_cache_refill_cycles_by_class[cls],
|
|
|
|
|
|
memory_order_relaxed);
|
|
|
|
|
|
uint64_t ct = ch + cm;
|
|
|
|
|
|
if (ct == 0 && cc == 0) {
|
|
|
|
|
|
continue; // 未使用クラスは省略
|
|
|
|
|
|
}
|
|
|
|
|
|
double cls_hit_rate = ct > 0 ? (100.0 * (double)ch / (double)ct) : 0.0;
|
|
|
|
|
|
double cls_avg_refill = cm > 0 ? (double)cc / (double)cm : 0.0;
|
|
|
|
|
|
double cls_avg_us = cls_avg_refill / 1000.0;
|
|
|
|
|
|
fprintf(stderr,
|
|
|
|
|
|
" C%d: hits=%llu miss=%llu hit=%.1f%% avg_refill=%.0f cyc (%.2fus @1GHz)\n",
|
|
|
|
|
|
cls,
|
|
|
|
|
|
(unsigned long long)ch,
|
|
|
|
|
|
(unsigned long long)cm,
|
|
|
|
|
|
cls_hit_rate,
|
|
|
|
|
|
cls_avg_refill,
|
|
|
|
|
|
cls_avg_us);
|
|
|
|
|
|
}
|
|
|
|
|
|
|
2025-12-04 18:26:39 +09:00
|
|
|
|
fprintf(stderr, "========================================\n\n");
|
2025-12-16 05:35:11 +09:00
|
|
|
|
#endif
|
2025-12-04 18:26:39 +09:00
|
|
|
|
}
|