84f5034e45
Phase 68: PGO training set diversification (seed/WS expansion)
...
Changes:
- scripts/box/pgo_fast_profile_config.sh: Expanded WS patterns (3→5) and seeds (1→3)
for reduced overfitting and better production workload representativeness
- PERFORMANCE_TARGETS_SCORECARD.md: Phase 68 baseline promoted (61.614M = 50.93%)
- CURRENT_TASK.md: Phase 68 marked complete, Phase 67a (layout tax forensics) set Active
Results:
- 10-run verification: +1.19% vs Phase 66 baseline (GO, >+1.0% threshold)
- M1 milestone: 50.93% of mimalloc (target 50%, exceeded by +0.93pp)
- Stability: 10-run mean/median with <2.1% CV
🤖 Generated with Claude Code
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-17 21:08:17 +09:00
f99ef77ad7
Phase 29: Pool Hotbox v2 Stats Prune - NO-OP (infrastructure ready)
...
Target: g_pool_hotbox_v2_stats atomics (12 total) in Pool v2
Result: 0.00% impact (code path inactive by default, ENV-gated)
Verdict: NO-OP - Maintain compile-out for future-proofing
Audit Results:
- Classification: 12/12 TELEMETRY (100% observational)
- Counters: alloc_calls, alloc_fast, alloc_refill, alloc_refill_fail,
alloc_fallback_v1, free_calls, free_fast, free_fallback_v1,
page_of_fail_* (4 failure counters)
- Verification: All stats/logging only, zero flow control usage
- Phase 28 lesson applied: Traced all usages, confirmed no CORRECTNESS
Key Finding: Pool v2 OFF by default
- Requires HAKMEM_POOL_V2_ENABLED=1 to activate
- Benchmark never executes Pool v2 code paths
- Compile-out has zero performance impact (code never runs)
Implementation (future-ready):
- Added HAKMEM_POOL_HOTBOX_V2_STATS_COMPILED (default: 0)
- Wrapped 13 atomic write sites in core/hakmem_pool.c
- Pattern: #if HAKMEM_POOL_HOTBOX_V2_STATS_COMPILED ... #endif
- Expected impact if Pool v2 enabled: +0.3~0.8% (HOT+WARM atomics)
A/B Test Results:
- Baseline (COMPILED=0): 52.98 M ops/s (±0.43M, 0.81% stdev)
- Research (COMPILED=1): 53.31 M ops/s (±0.80M, 1.50% stdev)
- Delta: -0.62% (noise, not real effect - code path not active)
Critical Lesson Learned (NEW):
Phase 29 revealed ENV-gated features can appear on hot paths but never
execute. Updated audit checklist:
1. Classify atomics (CORRECTNESS vs TELEMETRY)
2. Verify no flow control usage
3. NEW: Verify code path is ACTIVE in benchmark (check ENV gates)
4. Implement compile-out
5. A/B test
Verification methods added to documentation:
- rg "getenv.*FEATURE" to check ENV gates
- perf record/report to verify execution
- Debug printf for quick validation
Cumulative Progress (Phase 24-29):
- Phase 24 (class stats): +0.93% GO
- Phase 25 (free stats): +1.07% GO
- Phase 26 (diagnostics): -0.33% NEUTRAL
- Phase 27 (unified cache): +0.74% GO
- Phase 28 (bg spill): NO-OP (all CORRECTNESS)
- Phase 29 (pool v2): NO-OP (inactive code path)
- Total: 17 atomics removed, +2.74% improvement
Documentation:
- PHASE29_POOL_HOTBOX_V2_AUDIT.md: Complete audit with TELEMETRY classification
- PHASE29_POOL_HOTBOX_V2_STATS_RESULTS.md: Results + new lesson learned
- ATOMIC_PRUNE_CUMULATIVE_SUMMARY.md: Updated with Phase 29 + new checklist
- PHASE29_COMPLETE.md: Completion summary with recommendations
Decision: Keep compile-out despite NO-OP
- Code cleanliness (binary size reduction)
- Future-proofing (ready when Pool v2 enabled)
- Consistency with Phase 24-28 pattern
Generated with Claude Code
https://claude.com/claude-code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com >
2025-12-16 06:33:41 +09:00
19056282b6
Phase 3 D2: Wrapper Env Cache - [DECISION: NO-GO]
...
Target: Reduce wrapper_env_cfg() overhead in malloc/free hot path
- Strategy: Cache wrapper env configuration pointer in TLS
- Approach: Fast pointer cache (TLS caches const wrapper_env_cfg_t*)
Implementation:
- core/box/wrapper_env_cache_env_box.h: ENV gate (HAKMEM_WRAP_ENV_CACHE)
- core/box/wrapper_env_cache_box.h: TLS cache layer (wrapper_env_cfg_fast)
- core/box/hak_wrappers.inc.h: Integration into malloc/free hot paths
- ENV gate: HAKMEM_WRAP_ENV_CACHE=0/1 (default OFF)
A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (D2=0): 46.52M ops/s (avg), 46.47M ops/s (median)
- Optimized (D2=1): 45.85M ops/s (avg), 45.98M ops/s (median)
- Improvement: avg -1.44%, median -1.05% (DECISION: NO-GO)
Analysis:
- Regression cause: TLS cache adds overhead (branch + TLS access)
- wrapper_env_cfg() is already minimal (pointer return after simple check)
- Adding TLS caching layer makes it worse, not better
- Branch prediction penalty outweighs any potential savings
Cumulative Phase 2-3:
- B3: +2.89%, B4: +1.47%, C3: +2.20%
- D1: +1.06% (opt-in), D2: -1.44% (NO-GO)
- Total: ~7.2% (excluding D2)
Decision: FREEZE as research box (default OFF, regression confirmed)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-13 22:03:27 +09:00
acc64f2438
Phase ML1: Pool v1 memset 89.73% overhead 軽量化 (+15.34% improvement)
...
## Summary
- ChatGPT により bench_profile.h の setenv segfault を修正(RTLD_NEXT 経由に切り替え)
- core/box/pool_zero_mode_box.h 新設:ENV キャッシュ経由で ZERO_MODE を統一管理
- core/hakmem_pool.c で zero mode に応じた memset 制御(FULL/header/off)
- A/B テスト結果:ZERO_MODE=header で +15.34% improvement(1M iterations, C6-heavy)
## Files Modified
- core/box/pool_api.inc.h: pool_zero_mode_box.h include
- core/bench_profile.h: glibc setenv → malloc+putenv(segfault 回避)
- core/hakmem_pool.c: zero mode 参照・制御ロジック
- core/box/pool_zero_mode_box.h (新設): enum/getter
- CURRENT_TASK.md: Phase ML1 結果記載
## Test Results
| Iterations | ZERO_MODE=full | ZERO_MODE=header | Improvement |
|-----------|----------------|-----------------|------------|
| 10K | 3.06 M ops/s | 3.17 M ops/s | +3.65% |
| 1M | 23.71 M ops/s | 27.34 M ops/s | **+15.34%** |
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-10 09:08:18 +09:00
a905e0ffdd
Guard madvise ENOMEM and stabilize pool/tiny front v3
2025-12-09 21:50:15 +09:00
bd5e97f38a
Save current state before investigating TLS_SLL_HDR_RESET
2025-12-03 10:34:39 +09:00
4ef0171bc0
feat: Add ACE allocation failure tracing and debug hooks
...
This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations.
Key changes include:
- **ACE Tracing Implementation**:
- Added environment variable to enable/disable detailed logging of allocation failures.
- Instrumented , , and to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure).
- **Build System Fixes**:
- Corrected to ensure is properly linked into , resolving an error.
- **LD_PRELOAD Wrapper Adjustments**:
- Investigated and understood the wrapper's behavior under , particularly its interaction with and checks.
- Enabled debugging flags for environment to prevent unintended fallbacks to 's for non-tiny allocations, allowing comprehensive testing of the allocator.
- **Debugging & Verification**:
- Introduced temporary verbose logging to pinpoint execution flow issues within interception and routing. These temporary logs have been removed.
- Created to facilitate testing of the tracing features.
This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in by providing clear insights into the failure pathways.
2025-12-01 16:37:59 +09:00
03ba62df4d
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
...
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-17 02:47:58 +09:00
1010a961fb
Tiny: fix header/stride mismatch and harden refill paths
...
- Root cause: header-based class indexing (HEADER_CLASSIDX=1) wrote a 1-byte
header during allocation, but linear carve/refill and initial slab capacity
still used bare class block sizes. This mismatch could overrun slab usable
space and corrupt freelists, causing reproducible SEGV at ~100k iters.
Changes
- Superslab: compute capacity with effective stride (block_size + header for
classes 0..6; class7 remains headerless) in superslab_init_slab(). Add a
debug-only bound check in superslab_alloc_from_slab() to fail fast if carve
would exceed usable bytes.
- Refill (non-P0 and P0): use header-aware stride for all linear carving and
TLS window bump operations. Ensure alignment/validation in tiny_refill_opt.h
also uses stride, not raw class size.
- Drain: keep existing defense-in-depth for remote sentinel and sanitize nodes
before splicing into freelist (already present).
Notes
- This unifies the memory layout across alloc/linear-carve/refill with a single
stride definition and keeps class7 (1024B) headerless as designed.
- Debug builds add fail-fast checks; release builds remain lean.
Next
- Re-run Tiny benches (256/1024B) in debug to confirm stability, then in
release. If any remaining crash persists, bisect with HAKMEM_TINY_P0_BATCH_REFILL=0
to isolate P0 batch carve, and continue reducing branch-miss as planned.
2025-11-09 18:55:50 +09:00