Files
hakmem/core/hakmem_build_flags.h

451 lines
19 KiB
C
Raw Normal View History

// hakmem_build_flags.h - Centralized compile-time feature switches
// Purpose: Define all build-time toggles in one place with safe defaults.
// Usage: Include from common public headers (e.g., hakmem.h / hakmem_tiny.h).
#ifndef HAKMEM_BUILD_FLAGS_H
#define HAKMEM_BUILD_FLAGS_H
// ------------------------------------------------------------
// Phase 2: Headerless Mode Override
// ------------------------------------------------------------
// If Headerless is enabled, force HEADER_CLASSIDX to 0
#if defined(HAKMEM_TINY_HEADERLESS) && HAKMEM_TINY_HEADERLESS
#undef HAKMEM_TINY_HEADER_CLASSIDX
#define HAKMEM_TINY_HEADER_CLASSIDX 0
#endif
// ------------------------------------------------------------
// Release/debug detection
// ------------------------------------------------------------
// HAKMEM_BUILD_RELEASE: 1 in release-like builds, 0 otherwise
#ifndef HAKMEM_BUILD_RELEASE
# if defined(NDEBUG)
# define HAKMEM_BUILD_RELEASE 1
# else
# define HAKMEM_BUILD_RELEASE 0
# endif
#endif
// ------------------------------------------------------------
// Phase 35-A: Benchmark Minimal Mode
// ------------------------------------------------------------
// HAKMEM_BENCH_MINIMAL: Eliminate gate function overhead for benchmarks
// When =1: Gate functions return compile-time constants (no lazy init check)
// When =0: Normal runtime gate behavior (default)
// Usage: Build with -DHAKMEM_BENCH_MINIMAL=1 for benchmark-only binaries
#ifndef HAKMEM_BENCH_MINIMAL
# define HAKMEM_BENCH_MINIMAL 0
#endif
// ------------------------------------------------------------
// Instrumentation & counters (compile-time)
// ------------------------------------------------------------
// Enable lightweight path/debug counters (compiled out when 0)
Performance Optimization: Release Build Hygiene (Priority 1-4) Implement 4 targeted optimizations for release builds: 1. **Remove freelist validation from release builds** (Priority 1) - Guard registry lookup on every freelist node with #if !HAKMEM_BUILD_RELEASE - Expected gain: +15-20% throughput (eliminates 30-40% of refill cycles) - File: core/front/tiny_unified_cache.c:501-529 2. **Optimize PageFault telemetry** (Priority 2) - Already properly gated with HAKMEM_DEBUG_COUNTERS - No change needed (verified correct implementation) 3. **Make warm pool stats compile-time gated** (Priority 3) - Guard all stats recording with #if HAKMEM_DEBUG_COUNTERS - File: core/box/warm_pool_stats_box.h:25-51 4. **Reduce warm pool prefill lock overhead** (Priority 4) - Reduced WARM_POOL_PREFILL_BUDGET from 3 to 2 SuperSlabs - Balances prefill lock overhead with pool depletion frequency - File: core/box/warm_pool_prefill_box.h:28 5. **Disable debug counters by default in release builds** (Supporting) - Modified HAKMEM_DEBUG_COUNTERS to auto-detect based on NDEBUG - File: core/hakmem_build_flags.h:33-40 Benchmark Results (1M allocations, ws=256): - Before: 4.02-4.2M ops/s (with diagnostic overhead) - After: 4.04-4.2M ops/s (release build optimized) - Warm pool hit rate: Maintained at 55.6% - No performance regressions detected Expected Impact After Compilation: - With -DHAKMEM_BUILD_RELEASE=1 and -DNDEBUG: - Freelist validation: compiled out completely - Debug counters: compiled out completely - Telemetry: compiled out completely - Stats recording: compiled out (single (void) statement remains) - Expected +15-25% improvement in release builds 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-05 06:16:12 +09:00
// Default: 0 in release builds (NDEBUG set), 1 in debug builds
#ifndef HAKMEM_DEBUG_COUNTERS
Performance Optimization: Release Build Hygiene (Priority 1-4) Implement 4 targeted optimizations for release builds: 1. **Remove freelist validation from release builds** (Priority 1) - Guard registry lookup on every freelist node with #if !HAKMEM_BUILD_RELEASE - Expected gain: +15-20% throughput (eliminates 30-40% of refill cycles) - File: core/front/tiny_unified_cache.c:501-529 2. **Optimize PageFault telemetry** (Priority 2) - Already properly gated with HAKMEM_DEBUG_COUNTERS - No change needed (verified correct implementation) 3. **Make warm pool stats compile-time gated** (Priority 3) - Guard all stats recording with #if HAKMEM_DEBUG_COUNTERS - File: core/box/warm_pool_stats_box.h:25-51 4. **Reduce warm pool prefill lock overhead** (Priority 4) - Reduced WARM_POOL_PREFILL_BUDGET from 3 to 2 SuperSlabs - Balances prefill lock overhead with pool depletion frequency - File: core/box/warm_pool_prefill_box.h:28 5. **Disable debug counters by default in release builds** (Supporting) - Modified HAKMEM_DEBUG_COUNTERS to auto-detect based on NDEBUG - File: core/hakmem_build_flags.h:33-40 Benchmark Results (1M allocations, ws=256): - Before: 4.02-4.2M ops/s (with diagnostic overhead) - After: 4.04-4.2M ops/s (release build optimized) - Warm pool hit rate: Maintained at 55.6% - No performance regressions detected Expected Impact After Compilation: - With -DHAKMEM_BUILD_RELEASE=1 and -DNDEBUG: - Freelist validation: compiled out completely - Debug counters: compiled out completely - Telemetry: compiled out completely - Stats recording: compiled out (single (void) statement remains) - Expected +15-25% improvement in release builds 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-05 06:16:12 +09:00
# if defined(NDEBUG)
# define HAKMEM_DEBUG_COUNTERS 0
# else
# define HAKMEM_DEBUG_COUNTERS 1
# endif
#endif
// Enable extended memory profiling (compiled out when 0)
#ifndef HAKMEM_DEBUG_MEMORY
# define HAKMEM_DEBUG_MEMORY 0
#endif
// Tiny refill optimization helpers (header-only)
#ifndef HAKMEM_TINY_REFILL_OPT
# define HAKMEM_TINY_REFILL_OPT 1
#endif
// Batch refill P0 (can be toggled for A/B)
#ifndef HAKMEM_TINY_P0_BATCH_REFILL
# define HAKMEM_TINY_P0_BATCH_REFILL 0
#endif
// Box refactor (Phase 6-1.7) — usually injected from build system
#ifndef HAKMEM_TINY_PHASE6_BOX_REFACTOR
# define HAKMEM_TINY_PHASE6_BOX_REFACTOR 1
#endif
// SuperSlab backend toggle (compile-time)
// Default: 1 (ON) - SuperSlab is the core architecture.
// Set to 0 only for legacy/compat testing.
#ifndef HAKMEM_TINY_USE_SUPERSLAB
# define HAKMEM_TINY_USE_SUPERSLAB 1
#endif
Phase 7 Task 3: Pre-warm TLS cache (+180-280% improvement!) MAJOR SUCCESS: HAKMEM now achieves 85-92% of System malloc on tiny allocations (128-512B) and BEATS System at 146% on 1024B allocations! Performance Results: - Random Mixed 128B: 21M → 59M ops/s (+181%) 🚀 - Random Mixed 256B: 19M → 70M ops/s (+268%) 🚀 - Random Mixed 512B: 21M → 68M ops/s (+224%) 🚀 - Random Mixed 1024B: 21M → 65M ops/s (+210%, 146% of System!) 🏆 - Larson 1T: 2.68M ops/s (stable, no regression) Implementation: 1. Task 3a: Remove profiling overhead in release builds - Wrapped RDTSC calls in #if !HAKMEM_BUILD_RELEASE - Compiler can eliminate profiling code completely - Effect: +2% (2.68M → 2.73M Larson) 2. Task 3b: Simplify refill logic - Use constants from hakmem_build_flags.h - TLS cache already optimal - Effect: No regression 3. Task 3c: Pre-warm TLS cache (GAME CHANGER!) - Pre-allocate 16 blocks per class at init - Eliminates cold-start penalty - Effect: +180-280% improvement 🚀 Root Cause: The bottleneck was cold-start, not the hot path! First allocation in each class triggered a SuperSlab refill (100+ cycles). Pre-warming eliminated this penalty, revealing Phase 7's true potential. Files Modified: - core/hakmem_tiny.c: Pre-warm function implementation - core/box/hak_core_init.inc.h: Pre-warm initialization call - core/tiny_alloc_fast.inc.h: Profiling overhead removal - core/hakmem_phase7_config.h: Task 3 constants (NEW) - core/hakmem_build_flags.h: Phase 7 feature flags - Makefile: PREWARM_TLS flag, phase7 targets - CLAUDE.md: Phase 7 success summary - PHASE7_TASK3_RESULTS.md: Comprehensive results report (NEW) Build: make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1 phase7-bench 🎉 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-08 12:54:52 +09:00
// ------------------------------------------------------------
// Phase 7: Region-ID Direct Lookup (Header-based optimization)
// ------------------------------------------------------------
// Phase 7 Task 1: Header-based class_idx for O(1) free
// Default: OFF (enable after full validation in Task 5)
// Build: make HEADER_CLASSIDX=1 or make phase7
#ifndef HAKMEM_TINY_HEADER_CLASSIDX
# define HAKMEM_TINY_HEADER_CLASSIDX 1
Phase 7 Task 3: Pre-warm TLS cache (+180-280% improvement!) MAJOR SUCCESS: HAKMEM now achieves 85-92% of System malloc on tiny allocations (128-512B) and BEATS System at 146% on 1024B allocations! Performance Results: - Random Mixed 128B: 21M → 59M ops/s (+181%) 🚀 - Random Mixed 256B: 19M → 70M ops/s (+268%) 🚀 - Random Mixed 512B: 21M → 68M ops/s (+224%) 🚀 - Random Mixed 1024B: 21M → 65M ops/s (+210%, 146% of System!) 🏆 - Larson 1T: 2.68M ops/s (stable, no regression) Implementation: 1. Task 3a: Remove profiling overhead in release builds - Wrapped RDTSC calls in #if !HAKMEM_BUILD_RELEASE - Compiler can eliminate profiling code completely - Effect: +2% (2.68M → 2.73M Larson) 2. Task 3b: Simplify refill logic - Use constants from hakmem_build_flags.h - TLS cache already optimal - Effect: No regression 3. Task 3c: Pre-warm TLS cache (GAME CHANGER!) - Pre-allocate 16 blocks per class at init - Eliminates cold-start penalty - Effect: +180-280% improvement 🚀 Root Cause: The bottleneck was cold-start, not the hot path! First allocation in each class triggered a SuperSlab refill (100+ cycles). Pre-warming eliminated this penalty, revealing Phase 7's true potential. Files Modified: - core/hakmem_tiny.c: Pre-warm function implementation - core/box/hak_core_init.inc.h: Pre-warm initialization call - core/tiny_alloc_fast.inc.h: Profiling overhead removal - core/hakmem_phase7_config.h: Task 3 constants (NEW) - core/hakmem_build_flags.h: Phase 7 feature flags - Makefile: PREWARM_TLS flag, phase7 targets - CLAUDE.md: Phase 7 success summary - PHASE7_TASK3_RESULTS.md: Comprehensive results report (NEW) Build: make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1 phase7-bench 🎉 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-08 12:54:52 +09:00
#endif
// Phase 7 Task 2: Aggressive inline TLS cache access
// Default: OFF (enable after full validation in Task 5)
// Build: make AGGRESSIVE_INLINE=1 or make phase7
// Requires: HAKMEM_TINY_HEADER_CLASSIDX=1
#ifndef HAKMEM_TINY_AGGRESSIVE_INLINE
# define HAKMEM_TINY_AGGRESSIVE_INLINE 0
#endif
// Inline TLS SLL pop (experimental, A/B only)
// Default: OFF (HAKMEM_TINY_INLINE_SLL=0) to keep Box TLS-SLL API as the standard path.
// Enable explicitly via build flag: -DHAKMEM_TINY_INLINE_SLL=1 (bench/debug only).
#ifndef HAKMEM_TINY_INLINE_SLL
# define HAKMEM_TINY_INLINE_SLL 0
#endif
// Phase 1A3: Always-inline tiny_region_id_write_header()
// Default: OFF (HAKMEM_TINY_HEADER_WRITE_ALWAYS_INLINE=0) - enable after A/B validation
// Purpose: Force inline expansion of header write to reduce alloc path overhead
// Expected impact: +0.5-2% on Mixed workloads
// Build: make EXTRA_CFLAGS=-DHAKMEM_TINY_HEADER_WRITE_ALWAYS_INLINE=1 [target]
#ifndef HAKMEM_TINY_HEADER_WRITE_ALWAYS_INLINE
# define HAKMEM_TINY_HEADER_WRITE_ALWAYS_INLINE 0
#endif
Phase 7 Task 3: Pre-warm TLS cache (+180-280% improvement!) MAJOR SUCCESS: HAKMEM now achieves 85-92% of System malloc on tiny allocations (128-512B) and BEATS System at 146% on 1024B allocations! Performance Results: - Random Mixed 128B: 21M → 59M ops/s (+181%) 🚀 - Random Mixed 256B: 19M → 70M ops/s (+268%) 🚀 - Random Mixed 512B: 21M → 68M ops/s (+224%) 🚀 - Random Mixed 1024B: 21M → 65M ops/s (+210%, 146% of System!) 🏆 - Larson 1T: 2.68M ops/s (stable, no regression) Implementation: 1. Task 3a: Remove profiling overhead in release builds - Wrapped RDTSC calls in #if !HAKMEM_BUILD_RELEASE - Compiler can eliminate profiling code completely - Effect: +2% (2.68M → 2.73M Larson) 2. Task 3b: Simplify refill logic - Use constants from hakmem_build_flags.h - TLS cache already optimal - Effect: No regression 3. Task 3c: Pre-warm TLS cache (GAME CHANGER!) - Pre-allocate 16 blocks per class at init - Eliminates cold-start penalty - Effect: +180-280% improvement 🚀 Root Cause: The bottleneck was cold-start, not the hot path! First allocation in each class triggered a SuperSlab refill (100+ cycles). Pre-warming eliminated this penalty, revealing Phase 7's true potential. Files Modified: - core/hakmem_tiny.c: Pre-warm function implementation - core/box/hak_core_init.inc.h: Pre-warm initialization call - core/tiny_alloc_fast.inc.h: Profiling overhead removal - core/hakmem_phase7_config.h: Task 3 constants (NEW) - core/hakmem_build_flags.h: Phase 7 feature flags - Makefile: PREWARM_TLS flag, phase7 targets - CLAUDE.md: Phase 7 success summary - PHASE7_TASK3_RESULTS.md: Comprehensive results report (NEW) Build: make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1 phase7-bench 🎉 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-08 12:54:52 +09:00
// Phase 7 Task 3: Pre-warm TLS cache at init
// Default: OFF (enable after implementation)
// Build: make PREWARM_TLS=1 or make phase7
#ifndef HAKMEM_TINY_PREWARM_TLS
# define HAKMEM_TINY_PREWARM_TLS 0
#endif
Implement Phase 1: TLS SuperSlab Hint Box for Headerless performance Design: Cache recently-used SuperSlab references in TLS to accelerate ptr→SuperSlab resolution in Headerless mode free() path. ## Implementation ### New Box: core/box/tls_ss_hint_box.h - Header-only Box (4-slot FIFO cache per thread) - Functions: tls_ss_hint_init(), tls_ss_hint_update(), tls_ss_hint_lookup(), tls_ss_hint_clear() - Memory overhead: 112 bytes per thread (negligible) - Statistics API for debug builds (hit/miss counters) ### Integration Points 1. **Free path** (core/hakmem_tiny_free.inc): - Lines 477-481: Fast path hint lookup before hak_super_lookup() - Lines 550-555: Second lookup location (fallback path) - Expected savings: 10-50 cycles → 2-5 cycles on cache hit 2. **Allocation path** (core/tiny_superslab_alloc.inc.h): - Lines 115-122: Linear allocation return path - Lines 179-186: Freelist allocation return path - Cache update on successful allocation 3. **TLS variable** (core/hakmem_tiny_tls_state_box.inc): - `__thread TlsSsHintCache g_tls_ss_hint = {0};` ### Build System - **Build flag** (core/hakmem_build_flags.h): - HAKMEM_TINY_SS_TLS_HINT (default: 0, disabled) - Validation: requires HAKMEM_TINY_HEADERLESS=1 - **Makefile**: - Removed old ss_tls_hint_box.o (conflicting implementation) - Header-only design eliminates compiled object files ### Testing - **Unit tests** (tests/test_tls_ss_hint.c): - 6 test functions covering init, lookup, FIFO rotation, duplicates, clear, stats - All tests PASSING - **Build validation**: - ✅ Compiles with hint disabled (default) - ✅ Compiles with hint enabled (HAKMEM_TINY_SS_TLS_HINT=1) ### Documentation - **Benchmark report** (docs/PHASE1_TLS_HINT_BENCHMARK.md): - Implementation summary - Build validation results - Benchmark methodology (to be executed) - Performance analysis framework ## Expected Performance - **Hit rate**: 85-95% (single-threaded), 70-85% (multi-threaded) - **Cycle savings**: 80-95% on cache hit (10-50 cycles → 2-5 cycles) - **Target improvement**: 15-20% throughput increase vs Headerless baseline - **Memory overhead**: 112 bytes per thread ## Box Theory **Mission**: Cache hot SuperSlabs to avoid global registry lookup **Boundary**: ptr → SuperSlab* or NULL (miss) **Invariant**: hint.base ≤ ptr < hint.end → hit is valid **Fallback**: Always safe to miss (triggers hak_super_lookup) **Thread Safety**: TLS storage, no synchronization required **Risk**: Low (read-only cache, fail-safe fallback, magic validation) ## Next Steps 1. Run full benchmark suite (sh8bench, cfrac, larson) 2. Measure actual hit rate with stats enabled 3. If performance target met (15-20% improvement), enable by default 4. Consider increasing cache slots if hit rate < 80% 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 18:06:24 +09:00
// ------------------------------------------------------------
// Phase 1: Headerless Optimization - TLS SuperSlab Hint Cache
// ------------------------------------------------------------
// Purpose: Accelerate ptr→SuperSlab lookup in Headerless mode
// Default: 0 (disabled during development and testing)
// Target: 1 (enabled after validation in Phase 1 rollout)
//
// Performance Impact:
// - Cache hit: 2-5 cycles (vs 10-50 cycles for hak_super_lookup)
// - Expected hit rate: 85-95% (single-threaded), 70-85% (multi-threaded)
// - Expected throughput improvement: 15-20%
//
// Memory Overhead:
// - 112 bytes per thread (TLS)
// - Negligible for typical workloads (1000 threads = 112KB)
//
// Dependencies:
// - Requires HAKMEM_TINY_HEADERLESS=1 (hint is no-op in header mode)
// - No other dependencies (self-contained Box)
//
// Build: make EXTRA_CFLAGS="-DHAKMEM_TINY_SS_TLS_HINT=1"
#ifndef HAKMEM_TINY_SS_TLS_HINT
# define HAKMEM_TINY_SS_TLS_HINT 0
#endif
// Validation: Hint Box only active in Headerless mode
#if HAKMEM_TINY_SS_TLS_HINT && !defined(HAKMEM_TINY_HEADERLESS)
#warning "HAKMEM_TINY_SS_TLS_HINT enabled but HAKMEM_TINY_HEADERLESS not defined - hint will have no effect"
#endif
// Runtime verbosity (printf-heavy diagnostics). Keep OFF for benches.
#ifndef HAKMEM_DEBUG_VERBOSE
# define HAKMEM_DEBUG_VERBOSE 0
#endif
// Tiny/Mid safety checks on free path (mincore header validation).
// 0 = performance (boundary-only), 1 = strict (mincore for all)
#ifndef HAKMEM_TINY_SAFE_FREE
# define HAKMEM_TINY_SAFE_FREE 0
#endif
Phase 10: TLS/SFC aggressive cache tuning (syscall reduction failed) Goal: Reduce backend transitions by increasing frontend hit rate Result: +2% best case, syscalls unchanged (root cause: SuperSlab churn) Implementation: 1. Cache capacity expansion (2-8x per-class) - Hot classes (C0-C3): 4x increase (512 slots) - Medium classes (C4-C6): 2-3x increase - Class 7 (1KB): 2x increase (128 slots) - Fast cache: 2x default capacity 2. Refill batch size increase (4-8x) - Global default: 16 → 64 (4x) - Hot classes: 128 (8x) via HAKMEM_TINY_REFILL_COUNT_HOT - Mid classes: 96 (6x) via HAKMEM_TINY_REFILL_COUNT_MID - Class 7: 64 → 128 (2x) - SFC refill: 64 → 128 (2x) 3. Adaptive sizing aggressive parameters - Grow threshold: 80% → 70% (expand earlier) - Shrink threshold: 20% → 10% (shrink less) - Growth rate: 2x → 1.5x (smoother growth) - Max capacity: 2048 → 4096 (2x ceiling) - Adapt frequency: Every 10 → 5 refills (more responsive) Performance Results (100K iterations): Before (Phase 9): - Performance: 9.71M ops/s - Syscalls: 1,729 (mmap:877, munmap:852) After (Phase 10): - Default settings: 8.77M ops/s (-9.7%) ⚠️ - Optimal ENV: 9.89M ops/s (+2%) ✅ - Syscalls: 1,729 (unchanged) ❌ Optimal ENV configuration: export HAKMEM_TINY_REFILL_COUNT_HOT=256 export HAKMEM_TINY_REFILL_COUNT_MID=192 Root Cause Analysis: Bottleneck is NOT TLS/SFC hit rate, but SuperSlab allocation churn: - 877 SuperSlabs allocated (877MB via mmap) - Phase 9 LRU cache not utilized (no frees during benchmark) - All SuperSlabs retained until program exit - System malloc: 9 syscalls vs HAKMEM: 1,729 syscalls (192x gap) Conclusion: TLS/SFC tuning cannot solve SuperSlab allocation policy problem. Next step: Phase 11 SuperSlab Prewarm strategy to eliminate mmap/munmap during benchmark execution. ChatGPT review: Strategy validated, Option A (Prewarm) recommended. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-13 14:25:54 +09:00
// Phase 10: Aggressive refill count defaults (tunable via env vars)
// Goal: Reduce backend transitions by refilling in larger batches
// HAKMEM_TINY_REFILL_COUNT: global default (default: 128)
Phase 10: TLS/SFC aggressive cache tuning (syscall reduction failed) Goal: Reduce backend transitions by increasing frontend hit rate Result: +2% best case, syscalls unchanged (root cause: SuperSlab churn) Implementation: 1. Cache capacity expansion (2-8x per-class) - Hot classes (C0-C3): 4x increase (512 slots) - Medium classes (C4-C6): 2-3x increase - Class 7 (1KB): 2x increase (128 slots) - Fast cache: 2x default capacity 2. Refill batch size increase (4-8x) - Global default: 16 → 64 (4x) - Hot classes: 128 (8x) via HAKMEM_TINY_REFILL_COUNT_HOT - Mid classes: 96 (6x) via HAKMEM_TINY_REFILL_COUNT_MID - Class 7: 64 → 128 (2x) - SFC refill: 64 → 128 (2x) 3. Adaptive sizing aggressive parameters - Grow threshold: 80% → 70% (expand earlier) - Shrink threshold: 20% → 10% (shrink less) - Growth rate: 2x → 1.5x (smoother growth) - Max capacity: 2048 → 4096 (2x ceiling) - Adapt frequency: Every 10 → 5 refills (more responsive) Performance Results (100K iterations): Before (Phase 9): - Performance: 9.71M ops/s - Syscalls: 1,729 (mmap:877, munmap:852) After (Phase 10): - Default settings: 8.77M ops/s (-9.7%) ⚠️ - Optimal ENV: 9.89M ops/s (+2%) ✅ - Syscalls: 1,729 (unchanged) ❌ Optimal ENV configuration: export HAKMEM_TINY_REFILL_COUNT_HOT=256 export HAKMEM_TINY_REFILL_COUNT_MID=192 Root Cause Analysis: Bottleneck is NOT TLS/SFC hit rate, but SuperSlab allocation churn: - 877 SuperSlabs allocated (877MB via mmap) - Phase 9 LRU cache not utilized (no frees during benchmark) - All SuperSlabs retained until program exit - System malloc: 9 syscalls vs HAKMEM: 1,729 syscalls (192x gap) Conclusion: TLS/SFC tuning cannot solve SuperSlab allocation policy problem. Next step: Phase 11 SuperSlab Prewarm strategy to eliminate mmap/munmap during benchmark execution. ChatGPT review: Strategy validated, Option A (Prewarm) recommended. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-13 14:25:54 +09:00
// HAKMEM_TINY_REFILL_COUNT_HOT: class 0-3 (default: 128)
// HAKMEM_TINY_REFILL_COUNT_MID: class 4-7 (default: 96)
// Larson Fix (Priority 1): Increased from 64 to 128 to reduce lock contention
// Expected impact: Lock frequency reduction 19K → ~1.6K locks/sec (12x)
// NOTE: Multi-threaded Larson has pre-existing crash bug (not caused by this change)
Phase 7 Task 3: Pre-warm TLS cache (+180-280% improvement!) MAJOR SUCCESS: HAKMEM now achieves 85-92% of System malloc on tiny allocations (128-512B) and BEATS System at 146% on 1024B allocations! Performance Results: - Random Mixed 128B: 21M → 59M ops/s (+181%) 🚀 - Random Mixed 256B: 19M → 70M ops/s (+268%) 🚀 - Random Mixed 512B: 21M → 68M ops/s (+224%) 🚀 - Random Mixed 1024B: 21M → 65M ops/s (+210%, 146% of System!) 🏆 - Larson 1T: 2.68M ops/s (stable, no regression) Implementation: 1. Task 3a: Remove profiling overhead in release builds - Wrapped RDTSC calls in #if !HAKMEM_BUILD_RELEASE - Compiler can eliminate profiling code completely - Effect: +2% (2.68M → 2.73M Larson) 2. Task 3b: Simplify refill logic - Use constants from hakmem_build_flags.h - TLS cache already optimal - Effect: No regression 3. Task 3c: Pre-warm TLS cache (GAME CHANGER!) - Pre-allocate 16 blocks per class at init - Eliminates cold-start penalty - Effect: +180-280% improvement 🚀 Root Cause: The bottleneck was cold-start, not the hot path! First allocation in each class triggered a SuperSlab refill (100+ cycles). Pre-warming eliminated this penalty, revealing Phase 7's true potential. Files Modified: - core/hakmem_tiny.c: Pre-warm function implementation - core/box/hak_core_init.inc.h: Pre-warm initialization call - core/tiny_alloc_fast.inc.h: Profiling overhead removal - core/hakmem_phase7_config.h: Task 3 constants (NEW) - core/hakmem_build_flags.h: Phase 7 feature flags - Makefile: PREWARM_TLS flag, phase7 targets - CLAUDE.md: Phase 7 success summary - PHASE7_TASK3_RESULTS.md: Comprehensive results report (NEW) Build: make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1 phase7-bench 🎉 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-08 12:54:52 +09:00
#ifndef HAKMEM_TINY_REFILL_DEFAULT
# define HAKMEM_TINY_REFILL_DEFAULT 128
Phase 7 Task 3: Pre-warm TLS cache (+180-280% improvement!) MAJOR SUCCESS: HAKMEM now achieves 85-92% of System malloc on tiny allocations (128-512B) and BEATS System at 146% on 1024B allocations! Performance Results: - Random Mixed 128B: 21M → 59M ops/s (+181%) 🚀 - Random Mixed 256B: 19M → 70M ops/s (+268%) 🚀 - Random Mixed 512B: 21M → 68M ops/s (+224%) 🚀 - Random Mixed 1024B: 21M → 65M ops/s (+210%, 146% of System!) 🏆 - Larson 1T: 2.68M ops/s (stable, no regression) Implementation: 1. Task 3a: Remove profiling overhead in release builds - Wrapped RDTSC calls in #if !HAKMEM_BUILD_RELEASE - Compiler can eliminate profiling code completely - Effect: +2% (2.68M → 2.73M Larson) 2. Task 3b: Simplify refill logic - Use constants from hakmem_build_flags.h - TLS cache already optimal - Effect: No regression 3. Task 3c: Pre-warm TLS cache (GAME CHANGER!) - Pre-allocate 16 blocks per class at init - Eliminates cold-start penalty - Effect: +180-280% improvement 🚀 Root Cause: The bottleneck was cold-start, not the hot path! First allocation in each class triggered a SuperSlab refill (100+ cycles). Pre-warming eliminated this penalty, revealing Phase 7's true potential. Files Modified: - core/hakmem_tiny.c: Pre-warm function implementation - core/box/hak_core_init.inc.h: Pre-warm initialization call - core/tiny_alloc_fast.inc.h: Profiling overhead removal - core/hakmem_phase7_config.h: Task 3 constants (NEW) - core/hakmem_build_flags.h: Phase 7 feature flags - Makefile: PREWARM_TLS flag, phase7 targets - CLAUDE.md: Phase 7 success summary - PHASE7_TASK3_RESULTS.md: Comprehensive results report (NEW) Build: make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1 phase7-bench 🎉 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-08 12:54:52 +09:00
#endif
CRITICAL FIX: TLS 未初期化による 4T SEGV を完全解消 **問題:** - Larson 4T で 100% SEGV (1T は 2.09M ops/s で完走) - System/mimalloc は 4T で 33.52M ops/s 正常動作 - SS OFF + Remote OFF でも 4T で SEGV **根本原因: (Task agent ultrathink 調査結果)** ``` CRASH: mov (%r15),%r13 R15 = 0x6261 ← ASCII "ba" (ゴミ値、未初期化TLS) ``` Worker スレッドの TLS 変数が未初期化: - `__thread void* g_tls_sll_head[TINY_NUM_CLASSES];` ← 初期化なし - pthread_create() で生成されたスレッドでゼロ初期化されない - NULL チェックが通過 (0x6261 != NULL) → dereference → SEGV **修正内容:** 全 TLS 配列に明示的初期化子 `= {0}` を追加: 1. **core/hakmem_tiny.c:** - `g_tls_sll_head[TINY_NUM_CLASSES] = {0}` - `g_tls_sll_count[TINY_NUM_CLASSES] = {0}` - `g_tls_live_ss[TINY_NUM_CLASSES] = {0}` - `g_tls_bcur[TINY_NUM_CLASSES] = {0}` - `g_tls_bend[TINY_NUM_CLASSES] = {0}` 2. **core/tiny_fastcache.c:** - `g_tiny_fast_cache[TINY_FAST_CLASS_COUNT] = {0}` - `g_tiny_fast_count[TINY_FAST_CLASS_COUNT] = {0}` - `g_tiny_fast_free_head[TINY_FAST_CLASS_COUNT] = {0}` - `g_tiny_fast_free_count[TINY_FAST_CLASS_COUNT] = {0}` 3. **core/hakmem_tiny_magazine.c:** - `g_tls_mags[TINY_NUM_CLASSES] = {0}` 4. **core/tiny_sticky.c:** - `g_tls_sticky_ss[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}` - `g_tls_sticky_idx[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}` - `g_tls_sticky_pos[TINY_NUM_CLASSES] = {0}` **効果:** ``` Before: 1T: 2.09M ✅ | 4T: SEGV 💀 After: 1T: 2.41M ✅ | 4T: 4.19M ✅ (+15% 1T, SEGV解消) ``` **テスト:** ```bash # 1 thread: 完走 ./larson_hakmem 2 8 128 1024 1 12345 1 → Throughput = 2,407,597 ops/s ✅ # 4 threads: 完走(以前は SEGV) ./larson_hakmem 2 8 128 1024 1 12345 4 → Throughput = 4,192,155 ops/s ✅ ``` **調査協力:** Task agent (ultrathink mode) による完璧な根本原因特定 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-07 01:27:04 +09:00
// ------------------------------------------------------------
// Tiny front architecture toggles (compile-time defaults)
// ------------------------------------------------------------
// New 3-layer Tiny front (A/B via build flag)
#ifndef HAKMEM_TINY_USE_NEW_3LAYER
# define HAKMEM_TINY_USE_NEW_3LAYER 0
#endif
// Minimal/strict front variants (bench/debug only)
#ifndef HAKMEM_TINY_MINIMAL_FRONT
# define HAKMEM_TINY_MINIMAL_FRONT 1
CRITICAL FIX: TLS 未初期化による 4T SEGV を完全解消 **問題:** - Larson 4T で 100% SEGV (1T は 2.09M ops/s で完走) - System/mimalloc は 4T で 33.52M ops/s 正常動作 - SS OFF + Remote OFF でも 4T で SEGV **根本原因: (Task agent ultrathink 調査結果)** ``` CRASH: mov (%r15),%r13 R15 = 0x6261 ← ASCII "ba" (ゴミ値、未初期化TLS) ``` Worker スレッドの TLS 変数が未初期化: - `__thread void* g_tls_sll_head[TINY_NUM_CLASSES];` ← 初期化なし - pthread_create() で生成されたスレッドでゼロ初期化されない - NULL チェックが通過 (0x6261 != NULL) → dereference → SEGV **修正内容:** 全 TLS 配列に明示的初期化子 `= {0}` を追加: 1. **core/hakmem_tiny.c:** - `g_tls_sll_head[TINY_NUM_CLASSES] = {0}` - `g_tls_sll_count[TINY_NUM_CLASSES] = {0}` - `g_tls_live_ss[TINY_NUM_CLASSES] = {0}` - `g_tls_bcur[TINY_NUM_CLASSES] = {0}` - `g_tls_bend[TINY_NUM_CLASSES] = {0}` 2. **core/tiny_fastcache.c:** - `g_tiny_fast_cache[TINY_FAST_CLASS_COUNT] = {0}` - `g_tiny_fast_count[TINY_FAST_CLASS_COUNT] = {0}` - `g_tiny_fast_free_head[TINY_FAST_CLASS_COUNT] = {0}` - `g_tiny_fast_free_count[TINY_FAST_CLASS_COUNT] = {0}` 3. **core/hakmem_tiny_magazine.c:** - `g_tls_mags[TINY_NUM_CLASSES] = {0}` 4. **core/tiny_sticky.c:** - `g_tls_sticky_ss[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}` - `g_tls_sticky_idx[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}` - `g_tls_sticky_pos[TINY_NUM_CLASSES] = {0}` **効果:** ``` Before: 1T: 2.09M ✅ | 4T: SEGV 💀 After: 1T: 2.41M ✅ | 4T: 4.19M ✅ (+15% 1T, SEGV解消) ``` **テスト:** ```bash # 1 thread: 完走 ./larson_hakmem 2 8 128 1024 1 12345 1 → Throughput = 2,407,597 ops/s ✅ # 4 threads: 完走(以前は SEGV) ./larson_hakmem 2 8 128 1024 1 12345 4 → Throughput = 4,192,155 ops/s ✅ ``` **調査協力:** Task agent (ultrathink mode) による完璧な根本原因特定 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-07 01:27:04 +09:00
#endif
#ifndef HAKMEM_TINY_STRICT_FRONT
# define HAKMEM_TINY_STRICT_FRONT 0
#endif
Phase 4-Step3: Add Front Config Box (+2.7-4.9% dead code elimination) Implement compile-time configuration system for dead code elimination in Tiny allocation hot paths. The Config Box provides dual-mode configuration: - Normal mode: Runtime ENV checks (backward compatible, flexible) - PGO mode: Compile-time constants (dead code elimination, performance) PERFORMANCE: - Baseline (runtime config): 50.32 M ops/s (avg of 5 runs) - Config Box (PGO mode): 52.77 M ops/s (avg of 5 runs) - Improvement: +2.45 M ops/s (+4.87% with outlier, +2.72% without) - Target: +5-8% (partially achieved) IMPLEMENTATION: 1. core/box/tiny_front_config_box.h (NEW): - Defines TINY_FRONT_*_ENABLED macros for all config checks - PGO mode (#if HAKMEM_TINY_FRONT_PGO): Macros expand to constants (0/1) - Normal mode (#else): Macros expand to function calls - Functions remain in their original locations (no code duplication) 2. core/hakmem_build_flags.h: - Added HAKMEM_TINY_FRONT_PGO build flag (default: 0, off) - Documentation: Usage with make EXTRA_CFLAGS="-DHAKMEM_TINY_FRONT_PGO=1" 3. core/box/hak_wrappers.inc.h: - Replaced front_gate_unified_enabled() with TINY_FRONT_UNIFIED_GATE_ENABLED - 2 call sites updated (malloc and free fast paths) - Added config box include EXPECTED DEAD CODE ELIMINATION (PGO mode): if (TINY_FRONT_UNIFIED_GATE_ENABLED) { ... } → if (1) { ... } // Constant, always true → Compiler optimizes away the branch, keeps body SCOPE: Currently only front_gate_unified_enabled() is replaced (2 call sites). To achieve full +5-8% target, expand to other config checks: - ultra_slim_mode_enabled() - tiny_heap_v2_enabled() - sfc_cascade_enabled() - tiny_fastcache_enabled() - tiny_metrics_enabled() - tiny_diag_enabled() BUILD USAGE: Normal mode (runtime config, default): make bench_random_mixed_hakmem PGO mode (compile-time config, dead code elimination): make EXTRA_CFLAGS="-DHAKMEM_TINY_FRONT_PGO=1" bench_random_mixed_hakmem BOX PATTERN COMPLIANCE: ✅ Single Responsibility: Configuration management ONLY ✅ Clear Contract: Dual-mode (PGO = constants, Normal = runtime) ✅ Observable: Config report function (debug builds) ✅ Safe: Backward compatible (default is normal mode) ✅ Testable: Easy A/B comparison (PGO vs normal builds) WHY +2.7-4.9% (below +5-8% target)? - Limited scope: Only 2 call sites for 1 config function replaced - Lazy init overhead: front_gate_unified_enabled() cached after first call - Need to expand to more config checks for full benefit NEXT STEPS: - Expand config macro usage to other functions (optional) - OR proceed with PGO re-enablement (Final polish) 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 12:18:37 +09:00
// ------------------------------------------------------------
// Phase 4-Step3: Tiny Front PGO Config Box
// ------------------------------------------------------------
// HAKMEM_TINY_FRONT_PGO:
// 0 = Normal build with runtime configuration (default, backward compatible)
// Configuration checked via ENV variables at runtime (flexible)
// 1 = PGO-optimized build with compile-time configuration (performance)
// Configuration fixed at compile time (dead code elimination)
// Eliminates runtime branches for maximum performance.
// Use with: make CFLAGS="-DHAKMEM_TINY_FRONT_PGO=1" bench_random_mixed_hakmem
// Expected benefit: +5-8% improvement via dead code elimination (57.2 → 60-62 M ops/s)
#ifndef HAKMEM_TINY_FRONT_PGO
# define HAKMEM_TINY_FRONT_PGO 0
#endif
// Phase 5-Step3: Mid/Large PGO Config Box
// ------------------------------------------------------------
// HAKMEM_MID_LARGE_PGO:
// 0 = Normal build with runtime configuration (default, backward compatible)
// Configuration checked via ENV variables at runtime (flexible)
// 1 = PGO-optimized build with compile-time configuration (performance)
// Configuration fixed at compile time (dead code elimination)
// Eliminates runtime branches for Mid/Large allocation paths.
// Use with: make EXTRA_CFLAGS="-DHAKMEM_MID_LARGE_PGO=1" bench_random_mixed_hakmem
// Expected benefit: +2-4% improvement via dead code elimination
#ifndef HAKMEM_MID_LARGE_PGO
# define HAKMEM_MID_LARGE_PGO 0
#endif
CRITICAL FIX: TLS 未初期化による 4T SEGV を完全解消 **問題:** - Larson 4T で 100% SEGV (1T は 2.09M ops/s で完走) - System/mimalloc は 4T で 33.52M ops/s 正常動作 - SS OFF + Remote OFF でも 4T で SEGV **根本原因: (Task agent ultrathink 調査結果)** ``` CRASH: mov (%r15),%r13 R15 = 0x6261 ← ASCII "ba" (ゴミ値、未初期化TLS) ``` Worker スレッドの TLS 変数が未初期化: - `__thread void* g_tls_sll_head[TINY_NUM_CLASSES];` ← 初期化なし - pthread_create() で生成されたスレッドでゼロ初期化されない - NULL チェックが通過 (0x6261 != NULL) → dereference → SEGV **修正内容:** 全 TLS 配列に明示的初期化子 `= {0}` を追加: 1. **core/hakmem_tiny.c:** - `g_tls_sll_head[TINY_NUM_CLASSES] = {0}` - `g_tls_sll_count[TINY_NUM_CLASSES] = {0}` - `g_tls_live_ss[TINY_NUM_CLASSES] = {0}` - `g_tls_bcur[TINY_NUM_CLASSES] = {0}` - `g_tls_bend[TINY_NUM_CLASSES] = {0}` 2. **core/tiny_fastcache.c:** - `g_tiny_fast_cache[TINY_FAST_CLASS_COUNT] = {0}` - `g_tiny_fast_count[TINY_FAST_CLASS_COUNT] = {0}` - `g_tiny_fast_free_head[TINY_FAST_CLASS_COUNT] = {0}` - `g_tiny_fast_free_count[TINY_FAST_CLASS_COUNT] = {0}` 3. **core/hakmem_tiny_magazine.c:** - `g_tls_mags[TINY_NUM_CLASSES] = {0}` 4. **core/tiny_sticky.c:** - `g_tls_sticky_ss[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}` - `g_tls_sticky_idx[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}` - `g_tls_sticky_pos[TINY_NUM_CLASSES] = {0}` **効果:** ``` Before: 1T: 2.09M ✅ | 4T: SEGV 💀 After: 1T: 2.41M ✅ | 4T: 4.19M ✅ (+15% 1T, SEGV解消) ``` **テスト:** ```bash # 1 thread: 完走 ./larson_hakmem 2 8 128 1024 1 12345 1 → Throughput = 2,407,597 ops/s ✅ # 4 threads: 完走(以前は SEGV) ./larson_hakmem 2 8 128 1024 1 12345 4 → Throughput = 4,192,155 ops/s ✅ ``` **調査協力:** Task agent (ultrathink mode) による完璧な根本原因特定 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-07 01:27:04 +09:00
// Route fingerprint (compile-time gate; runtime ENV still required)
#ifndef HAKMEM_ROUTE
# define HAKMEM_ROUTE 0
#endif
// Phase 63: FAST Profile-Fixed Build (compile-time constant gates)
// HAKMEM_FAST_PROFILE_FIXED: Fix all MIXED_TINYV3_C7_SAFE gates to compile-time constants
// When =1: Top 5-8 gates (tiny_front_v3_enabled, front_fastlane_enabled, etc.)
// return compile-time constants, learning layer disabled (DCE expected +5-10%)
// When =0: Normal runtime gate behavior (default, backward compatible)
// Usage: Build with -DHAKMEM_FAST_PROFILE_FIXED=1 for speed-first FAST binaries
// Only for FAST builds; Standard/OBSERVE keep runtime gates unchanged
#ifndef HAKMEM_FAST_PROFILE_FIXED
# define HAKMEM_FAST_PROFILE_FIXED 0
#endif
// Phase 64: Backend Pruning (compile-time unreachable code elimination)
// HAKMEM_FAST_PROFILE_PRUNE_BACKENDS: Disable unused backends in Mixed workload
// When =1: Backend gates (mid_v3_enabled, pool_v2_enabled, etc.) return false at compile-time
// LTO DCE eliminates unreachable code paths (expected +5-10%)
// When =0: Normal runtime gate behavior (default, backward compatible)
// Usage: Build with -DHAKMEM_FAST_PROFILE_PRUNE_BACKENDS=1 for ultra-fast FAST binaries
// Backends disabled: MID_V3, POOL_V2, SMALL_HEAP_V4, LEARNER, etc.
#ifndef HAKMEM_FAST_PROFILE_PRUNE_BACKENDS
# define HAKMEM_FAST_PROFILE_PRUNE_BACKENDS 0
#endif
CRITICAL FIX: TLS 未初期化による 4T SEGV を完全解消 **問題:** - Larson 4T で 100% SEGV (1T は 2.09M ops/s で完走) - System/mimalloc は 4T で 33.52M ops/s 正常動作 - SS OFF + Remote OFF でも 4T で SEGV **根本原因: (Task agent ultrathink 調査結果)** ``` CRASH: mov (%r15),%r13 R15 = 0x6261 ← ASCII "ba" (ゴミ値、未初期化TLS) ``` Worker スレッドの TLS 変数が未初期化: - `__thread void* g_tls_sll_head[TINY_NUM_CLASSES];` ← 初期化なし - pthread_create() で生成されたスレッドでゼロ初期化されない - NULL チェックが通過 (0x6261 != NULL) → dereference → SEGV **修正内容:** 全 TLS 配列に明示的初期化子 `= {0}` を追加: 1. **core/hakmem_tiny.c:** - `g_tls_sll_head[TINY_NUM_CLASSES] = {0}` - `g_tls_sll_count[TINY_NUM_CLASSES] = {0}` - `g_tls_live_ss[TINY_NUM_CLASSES] = {0}` - `g_tls_bcur[TINY_NUM_CLASSES] = {0}` - `g_tls_bend[TINY_NUM_CLASSES] = {0}` 2. **core/tiny_fastcache.c:** - `g_tiny_fast_cache[TINY_FAST_CLASS_COUNT] = {0}` - `g_tiny_fast_count[TINY_FAST_CLASS_COUNT] = {0}` - `g_tiny_fast_free_head[TINY_FAST_CLASS_COUNT] = {0}` - `g_tiny_fast_free_count[TINY_FAST_CLASS_COUNT] = {0}` 3. **core/hakmem_tiny_magazine.c:** - `g_tls_mags[TINY_NUM_CLASSES] = {0}` 4. **core/tiny_sticky.c:** - `g_tls_sticky_ss[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}` - `g_tls_sticky_idx[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}` - `g_tls_sticky_pos[TINY_NUM_CLASSES] = {0}` **効果:** ``` Before: 1T: 2.09M ✅ | 4T: SEGV 💀 After: 1T: 2.41M ✅ | 4T: 4.19M ✅ (+15% 1T, SEGV解消) ``` **テスト:** ```bash # 1 thread: 完走 ./larson_hakmem 2 8 128 1024 1 12345 1 → Throughput = 2,407,597 ops/s ✅ # 4 threads: 完走(以前は SEGV) ./larson_hakmem 2 8 128 1024 1 12345 4 → Throughput = 4,192,155 ops/s ✅ ``` **調査協力:** Task agent (ultrathink mode) による完璧な根本原因特定 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-07 01:27:04 +09:00
// Bench-only knobs (default values; can be overridden via build flags)
#ifndef HAKMEM_TINY_BENCH_REFILL
# define HAKMEM_TINY_BENCH_REFILL 8
#endif
#ifndef HAKMEM_TINY_BENCH_REFILL8
# define HAKMEM_TINY_BENCH_REFILL8 HAKMEM_TINY_BENCH_REFILL
#endif
#ifndef HAKMEM_TINY_BENCH_REFILL16
# define HAKMEM_TINY_BENCH_REFILL16 HAKMEM_TINY_BENCH_REFILL
#endif
#ifndef HAKMEM_TINY_BENCH_REFILL32
# define HAKMEM_TINY_BENCH_REFILL32 HAKMEM_TINY_BENCH_REFILL
#endif
#ifndef HAKMEM_TINY_BENCH_REFILL64
# define HAKMEM_TINY_BENCH_REFILL64 HAKMEM_TINY_BENCH_REFILL
#endif
#ifndef HAKMEM_TINY_BENCH_WARMUP8
# define HAKMEM_TINY_BENCH_WARMUP8 64
#endif
#ifndef HAKMEM_TINY_BENCH_WARMUP16
# define HAKMEM_TINY_BENCH_WARMUP16 96
#endif
#ifndef HAKMEM_TINY_BENCH_WARMUP32
# define HAKMEM_TINY_BENCH_WARMUP32 160
#endif
#ifndef HAKMEM_TINY_BENCH_WARMUP64
# define HAKMEM_TINY_BENCH_WARMUP64 192
#endif
// ------------------------------------------------------------
// Phase 22: Research Box Prune (Compile-out default-OFF boxes)
// ------------------------------------------------------------
// Phase 14 Tcache: Compile gate (default OFF = compile-out)
// Set to 1 for research builds that need tcache experimentation
#ifndef HAKMEM_TINY_TCACHE_COMPILED
# define HAKMEM_TINY_TCACHE_COMPILED 0
#endif
// Phase 15 Unified LIFO: Compile gate (default OFF = compile-out)
// Set to 1 for research builds that need LIFO/FIFO mode switching
#ifndef HAKMEM_TINY_UNIFIED_LIFO_COMPILED
# define HAKMEM_TINY_UNIFIED_LIFO_COMPILED 0
#endif
// ------------------------------------------------------------
// Phase 23: Per-op Default-OFF Tax Prune (Compile-out per-op research knobs)
// ------------------------------------------------------------
// Phase E5-2 Header Write-Once: Compile gate (default OFF = compile-out)
// Set to 1 for research builds that need write-once header optimization
#ifndef HAKMEM_TINY_HEADER_WRITE_ONCE_COMPILED
# define HAKMEM_TINY_HEADER_WRITE_ONCE_COMPILED 0
#endif
// Unified Cache Measurement: Compile gate (default OFF = compile-out)
// Set to 1 for research builds that need cache measurement instrumentation
#ifndef HAKMEM_TINY_UNIFIED_CACHE_MEASURE_COMPILED
# define HAKMEM_TINY_UNIFIED_CACHE_MEASURE_COMPILED 0
#endif
// ------------------------------------------------------------
// Phase 24: OBSERVE Tax Prune (Compile-out hot-path stats atomics)
// ------------------------------------------------------------
// Tiny Class Stats: Compile gate (default OFF = compile-out)
// Set to 1 for research builds that need per-class stats observation
#ifndef HAKMEM_TINY_CLASS_STATS_COMPILED
# define HAKMEM_TINY_CLASS_STATS_COMPILED 0
#endif
// ------------------------------------------------------------
// Phase 25: Tiny Free Stats Atomic Prune (Compile-out g_free_ss_enter)
// ------------------------------------------------------------
// Tiny Free Stats: Compile gate (default OFF = compile-out)
// Set to 1 for research builds that need free path telemetry
// Target: g_free_ss_enter atomic in core/tiny_superslab_free.inc.h
#ifndef HAKMEM_TINY_FREE_STATS_COMPILED
# define HAKMEM_TINY_FREE_STATS_COMPILED 0
#endif
// ------------------------------------------------------------
// Phase 26A: C7 Free Count Atomic Prune (Compile-out c7_free_count)
// ------------------------------------------------------------
// C7 Free Count: Compile gate (default OFF = compile-out)
// Set to 1 for research builds that need C7 free path diagnostics
// Target: c7_free_count atomic in core/tiny_superslab_free.inc.h:51
#ifndef HAKMEM_C7_FREE_COUNT_COMPILED
# define HAKMEM_C7_FREE_COUNT_COMPILED 0
#endif
// ------------------------------------------------------------
// Phase 26B: Header Mismatch Log Atomic Prune (Compile-out g_hdr_mismatch_log)
// ------------------------------------------------------------
// Header Mismatch Log: Compile gate (default OFF = compile-out)
// Set to 1 for research builds that need header validation diagnostics
// Target: g_hdr_mismatch_log atomic in core/tiny_superslab_free.inc.h:147
#ifndef HAKMEM_HDR_MISMATCH_LOG_COMPILED
# define HAKMEM_HDR_MISMATCH_LOG_COMPILED 0
#endif
// ------------------------------------------------------------
// Phase 26C: Header Meta Mismatch Atomic Prune (Compile-out g_hdr_meta_mismatch)
// ------------------------------------------------------------
// Header Meta Mismatch: Compile gate (default OFF = compile-out)
// Set to 1 for research builds that need metadata validation diagnostics
// Target: g_hdr_meta_mismatch atomic in core/tiny_superslab_free.inc.h:182
#ifndef HAKMEM_HDR_META_MISMATCH_COMPILED
# define HAKMEM_HDR_META_MISMATCH_COMPILED 0
#endif
// ------------------------------------------------------------
// Phase 26D: Metric Bad Class Atomic Prune (Compile-out g_metric_bad_class_once)
// ------------------------------------------------------------
// Metric Bad Class: Compile gate (default OFF = compile-out)
// Set to 1 for research builds that need bad class index diagnostics
// Target: g_metric_bad_class_once atomic in core/hakmem_tiny_alloc.inc:22
#ifndef HAKMEM_METRIC_BAD_CLASS_COMPILED
# define HAKMEM_METRIC_BAD_CLASS_COMPILED 0
#endif
// ------------------------------------------------------------
// Phase 26E: Header Meta Fast Atomic Prune (Compile-out g_hdr_meta_fast)
// ------------------------------------------------------------
// Header Meta Fast: Compile gate (default OFF = compile-out)
// Set to 1 for research builds that need fast-path metadata telemetry
// Target: g_hdr_meta_fast atomic in core/tiny_free_fast_v2.inc.h:181
#ifndef HAKMEM_HDR_META_FAST_COMPILED
# define HAKMEM_HDR_META_FAST_COMPILED 0
#endif
Phase 29: Pool Hotbox v2 Stats Prune - NO-OP (infrastructure ready) Target: g_pool_hotbox_v2_stats atomics (12 total) in Pool v2 Result: 0.00% impact (code path inactive by default, ENV-gated) Verdict: NO-OP - Maintain compile-out for future-proofing Audit Results: - Classification: 12/12 TELEMETRY (100% observational) - Counters: alloc_calls, alloc_fast, alloc_refill, alloc_refill_fail, alloc_fallback_v1, free_calls, free_fast, free_fallback_v1, page_of_fail_* (4 failure counters) - Verification: All stats/logging only, zero flow control usage - Phase 28 lesson applied: Traced all usages, confirmed no CORRECTNESS Key Finding: Pool v2 OFF by default - Requires HAKMEM_POOL_V2_ENABLED=1 to activate - Benchmark never executes Pool v2 code paths - Compile-out has zero performance impact (code never runs) Implementation (future-ready): - Added HAKMEM_POOL_HOTBOX_V2_STATS_COMPILED (default: 0) - Wrapped 13 atomic write sites in core/hakmem_pool.c - Pattern: #if HAKMEM_POOL_HOTBOX_V2_STATS_COMPILED ... #endif - Expected impact if Pool v2 enabled: +0.3~0.8% (HOT+WARM atomics) A/B Test Results: - Baseline (COMPILED=0): 52.98 M ops/s (±0.43M, 0.81% stdev) - Research (COMPILED=1): 53.31 M ops/s (±0.80M, 1.50% stdev) - Delta: -0.62% (noise, not real effect - code path not active) Critical Lesson Learned (NEW): Phase 29 revealed ENV-gated features can appear on hot paths but never execute. Updated audit checklist: 1. Classify atomics (CORRECTNESS vs TELEMETRY) 2. Verify no flow control usage 3. NEW: Verify code path is ACTIVE in benchmark (check ENV gates) 4. Implement compile-out 5. A/B test Verification methods added to documentation: - rg "getenv.*FEATURE" to check ENV gates - perf record/report to verify execution - Debug printf for quick validation Cumulative Progress (Phase 24-29): - Phase 24 (class stats): +0.93% GO - Phase 25 (free stats): +1.07% GO - Phase 26 (diagnostics): -0.33% NEUTRAL - Phase 27 (unified cache): +0.74% GO - Phase 28 (bg spill): NO-OP (all CORRECTNESS) - Phase 29 (pool v2): NO-OP (inactive code path) - Total: 17 atomics removed, +2.74% improvement Documentation: - PHASE29_POOL_HOTBOX_V2_AUDIT.md: Complete audit with TELEMETRY classification - PHASE29_POOL_HOTBOX_V2_STATS_RESULTS.md: Results + new lesson learned - ATOMIC_PRUNE_CUMULATIVE_SUMMARY.md: Updated with Phase 29 + new checklist - PHASE29_COMPLETE.md: Completion summary with recommendations Decision: Keep compile-out despite NO-OP - Code cleanliness (binary size reduction) - Future-proofing (ready when Pool v2 enabled) - Consistency with Phase 24-28 pattern Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-16 06:33:41 +09:00
// ------------------------------------------------------------
// Phase 27: Unified Cache Stats Atomic Prune (Compile-out observation atomics)
// ------------------------------------------------------------
// Unified Cache Stats: Compile gate (default OFF = compile-out)
// Set to 1 for research builds that need cache telemetry
// Target: g_cache_unified_stats atomics in core/hakmem_tiny.c
#ifndef HAKMEM_UNIFIED_CACHE_STATS_COMPILED
# define HAKMEM_UNIFIED_CACHE_STATS_COMPILED 0
#endif
// ------------------------------------------------------------
// Phase 29: Pool Hotbox v2 Stats Prune (Compile-out telemetry atomics)
// ------------------------------------------------------------
// Pool Hotbox v2 Stats: Compile gate (default OFF = compile-out)
// Set to 1 for research builds that need Pool v2 telemetry
// Target: g_pool_hotbox_v2_stats[ci].* atomics in core/hakmem_pool.c
// Impact: 12 atomic counters on HOT+WARM path (alloc_fast, free_fast, etc.)
#ifndef HAKMEM_POOL_HOTBOX_V2_STATS_COMPILED
# define HAKMEM_POOL_HOTBOX_V2_STATS_COMPILED 0
#endif
Phase 30-31: Standard procedure + g_tiny_free_trace atomic prune Phase 30: Standard Procedure Establishment - Created 4-step standardized methodology (Step 0-3) - Step 0: Execution Verification (NEW - Phase 29 lesson) - Step 1: CORRECTNESS/TELEMETRY Classification (Phase 28 lesson) - Step 2: Compile-Out Implementation (Phase 24-27 pattern) - Step 3: A/B Test (build-level comparison) - Executed audit_atomics.sh: 412 atomics analyzed - Identified Phase 31 candidate: g_tiny_free_trace (HOT path, TOP PRIORITY) Phase 31: g_tiny_free_trace Compile-Out (HOT Path TELEMETRY) - Target: core/hakmem_tiny_free.inc:326 (trace-rate-limit atomic) - Added HAKMEM_TINY_FREE_TRACE_COMPILED (default: 0) - Classification: Pure TELEMETRY (trace output only, no flow control) - A/B Result: NEUTRAL (baseline -0.35% mean, +0.19% median) - Verdict: NEUTRAL → Adopted for code cleanliness (Phase 26 precedent) - Rationale: HOT path TELEMETRY removal improves code quality A/B Test Details: - Baseline (COMPILED=0): 53.638M ops/s mean, 53.799M median - Compiled-in (COMPILED=1): 53.828M ops/s mean, 53.697M median - Conflicting signals within ±0.5% noise margin - Phase 25 comparison: g_free_ss_enter (+1.07% GO) vs g_tiny_free_trace (NEUTRAL) - Hypothesis: Rate-limited atomic (128 calls) optimized by compiler Cumulative Progress (Phase 24-31): - Phase 24 (class stats): +0.93% GO - Phase 25 (free stats): +1.07% GO - Phase 26 (diagnostics): -0.33% NEUTRAL - Phase 27 (unified cache): +0.74% GO - Phase 28 (bg spill): NO-OP (all CORRECTNESS) - Phase 29 (pool v2): NO-OP (ENV-gated) - Phase 30 (procedure): PROCEDURE - Phase 31 (free trace): -0.35% NEUTRAL - Total: 18 atomics removed, +2.74% net improvement Documentation Created: - PHASE30_STANDARD_PROCEDURE.md: Complete 4-step methodology - ATOMIC_AUDIT_FULL.txt: 412 atomics comprehensive audit - PHASE31_CANDIDATES_HOT/WARM.txt: Priority-sorted candidates - PHASE31_RECOMMENDED_CANDIDATES.md: TOP 3 with Step 0 verification - PHASE31_TINY_FREE_TRACE_ATOMIC_PRUNE_RESULTS.md: Complete A/B results - ATOMIC_PRUNE_CUMULATIVE_SUMMARY.md: Updated (Phase 30-31) - CURRENT_TASK.md: Phase 32 candidate identified (g_hak_tiny_free_calls) Key Lessons: - Lesson 7 (Phase 30): Step 0 execution verification prevents wasted effort - Lesson 8 (Phase 31): NEUTRAL + code cleanliness = valid adoption - HOT path ≠ guaranteed performance win (rate-limited atomics may be optimized) Next Phase: Phase 32 candidate (g_hak_tiny_free_calls) - Location: core/hakmem_tiny_free.inc:335 (9 lines below Phase 31 target) - Expected: +0.3~0.7% or NEUTRAL Generated with Claude Code https://claude.com/claude-code Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-16 07:31:15 +09:00
// ------------------------------------------------------------
// Phase 31: Tiny Free Trace Atomic Prune (Compile-out trace atomic)
// ------------------------------------------------------------
// Tiny Free Trace: Compile gate (default OFF = compile-out)
// Set to 1 for research builds that need free path trace diagnostics
// Target: g_tiny_free_trace atomic in core/hakmem_tiny_free.inc:326
// Impact: HOT path atomic (every free operation)
// Expected improvement: +0.5% to +1.0% (similar to Phase 25: +1.07%)
#ifndef HAKMEM_TINY_FREE_TRACE_COMPILED
# define HAKMEM_TINY_FREE_TRACE_COMPILED 0
#endif
// ------------------------------------------------------------
// Phase 32: Tiny Free Calls Atomic Prune (Compile-out diagnostic counter)
// ------------------------------------------------------------
// Tiny Free Calls: Compile gate (default OFF = compile-out)
// Set to 1 for research builds that need free path call counting
// Target: g_hak_tiny_free_calls atomic in core/hakmem_tiny_free.inc:335
// Impact: HOT path atomic (every free operation, unconditional)
// Expected improvement: +0.3% to +0.7% (diagnostic counter, less critical than Phase 25)
#ifndef HAKMEM_TINY_FREE_CALLS_COMPILED
# define HAKMEM_TINY_FREE_CALLS_COMPILED 0
#endif
// ------------------------------------------------------------
// Phase 34: Batch Atomic Prune (Compile-out remaining WARM path atomics)
// ------------------------------------------------------------
// Phase 34A: Splice Debug Counter (WARM path, refill)
// Target: g_splice_count in core/tiny_refill_opt.h:79
// Impact: WARM path atomic (every refill splice operation)
#ifndef HAKMEM_SPLICE_DEBUG_COMPILED
# define HAKMEM_SPLICE_DEBUG_COMPILED 0
#endif
// Phase 34B: Alloc Gate Class Mismatch (ERROR path, rare)
// Target: g_alloc_gate_cls_mis in core/box/tiny_alloc_gate_box.h:95
// Impact: ERROR path atomic (class mismatch detection, rare)
#ifndef HAKMEM_ALLOC_GATE_CLS_MIS_COMPILED
# define HAKMEM_ALLOC_GATE_CLS_MIS_COMPILED 0
#endif
// ------------------------------------------------------------
// Helper enum (for documentation / logging)
// ------------------------------------------------------------
typedef enum {
HAK_FLAG_BUILD_RELEASE = HAKMEM_BUILD_RELEASE,
HAK_FLAG_DEBUG_COUNTERS = HAKMEM_DEBUG_COUNTERS,
HAK_FLAG_DEBUG_MEMORY = HAKMEM_DEBUG_MEMORY,
HAK_FLAG_REFILL_OPT = HAKMEM_TINY_REFILL_OPT,
HAK_FLAG_P0_BATCH = HAKMEM_TINY_P0_BATCH_REFILL,
HAK_FLAG_BOX_REFACTOR = HAKMEM_TINY_PHASE6_BOX_REFACTOR,
CRITICAL FIX: TLS 未初期化による 4T SEGV を完全解消 **問題:** - Larson 4T で 100% SEGV (1T は 2.09M ops/s で完走) - System/mimalloc は 4T で 33.52M ops/s 正常動作 - SS OFF + Remote OFF でも 4T で SEGV **根本原因: (Task agent ultrathink 調査結果)** ``` CRASH: mov (%r15),%r13 R15 = 0x6261 ← ASCII "ba" (ゴミ値、未初期化TLS) ``` Worker スレッドの TLS 変数が未初期化: - `__thread void* g_tls_sll_head[TINY_NUM_CLASSES];` ← 初期化なし - pthread_create() で生成されたスレッドでゼロ初期化されない - NULL チェックが通過 (0x6261 != NULL) → dereference → SEGV **修正内容:** 全 TLS 配列に明示的初期化子 `= {0}` を追加: 1. **core/hakmem_tiny.c:** - `g_tls_sll_head[TINY_NUM_CLASSES] = {0}` - `g_tls_sll_count[TINY_NUM_CLASSES] = {0}` - `g_tls_live_ss[TINY_NUM_CLASSES] = {0}` - `g_tls_bcur[TINY_NUM_CLASSES] = {0}` - `g_tls_bend[TINY_NUM_CLASSES] = {0}` 2. **core/tiny_fastcache.c:** - `g_tiny_fast_cache[TINY_FAST_CLASS_COUNT] = {0}` - `g_tiny_fast_count[TINY_FAST_CLASS_COUNT] = {0}` - `g_tiny_fast_free_head[TINY_FAST_CLASS_COUNT] = {0}` - `g_tiny_fast_free_count[TINY_FAST_CLASS_COUNT] = {0}` 3. **core/hakmem_tiny_magazine.c:** - `g_tls_mags[TINY_NUM_CLASSES] = {0}` 4. **core/tiny_sticky.c:** - `g_tls_sticky_ss[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}` - `g_tls_sticky_idx[TINY_NUM_CLASSES][TINY_STICKY_RING] = {0}` - `g_tls_sticky_pos[TINY_NUM_CLASSES] = {0}` **効果:** ``` Before: 1T: 2.09M ✅ | 4T: SEGV 💀 After: 1T: 2.41M ✅ | 4T: 4.19M ✅ (+15% 1T, SEGV解消) ``` **テスト:** ```bash # 1 thread: 完走 ./larson_hakmem 2 8 128 1024 1 12345 1 → Throughput = 2,407,597 ops/s ✅ # 4 threads: 完走(以前は SEGV) ./larson_hakmem 2 8 128 1024 1 12345 4 → Throughput = 4,192,155 ops/s ✅ ``` **調査協力:** Task agent (ultrathink mode) による完璧な根本原因特定 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-07 01:27:04 +09:00
HAK_FLAG_NEW_3LAYER = HAKMEM_TINY_USE_NEW_3LAYER,
} hak_build_flags_t;
#endif // HAKMEM_BUILD_FLAGS_H