2025-12-14 16:28:23 +09:00
|
|
|
|
#pragma once
|
|
|
|
|
|
#include <dlfcn.h>
|
|
|
|
|
|
#include <stdlib.h>
|
|
|
|
|
|
#include <string.h>
|
|
|
|
|
|
#include <stdio.h>
|
|
|
|
|
|
#include <unistd.h>
|
|
|
|
|
|
|
|
|
|
|
|
#ifdef USE_HAKMEM
|
|
|
|
|
|
#include "box/wrapper_env_box.h" // wrapper_env_refresh_from_env (Phase 2 B4)
|
|
|
|
|
|
#include "box/tiny_static_route_box.h" // tiny_static_route_refresh_from_env (Phase 3 C3)
|
|
|
|
|
|
#include "box/hakmem_env_snapshot_box.h" // hakmem_env_snapshot_refresh_from_env (Phase 4 E1)
|
2025-12-14 18:49:08 +09:00
|
|
|
|
#include "box/tiny_free_route_cache_env_box.h" // tiny_free_static_route_refresh_from_env (Phase 8)
|
2025-12-15 00:32:25 +09:00
|
|
|
|
#include "box/tiny_c7_preserve_header_env_box.h" // tiny_c7_preserve_header_env_refresh_from_env (Phase 13 v1)
|
2025-12-15 01:28:50 +09:00
|
|
|
|
#include "box/tiny_tcache_env_box.h" // tiny_tcache_env_refresh_from_env (Phase 14 v1)
|
Phase 15 v1: UnifiedCache FIFO→LIFO NEUTRAL (-0.70% Mixed, +0.42% C7)
Transform existing array-based UnifiedCache from FIFO ring to LIFO stack.
A/B Results:
- Mixed (16-1024B): -0.70% (52,965,966 → 52,593,948 ops/s)
- C7-only (1025-2048B): +0.42% (78,010,783 → 78,335,509 ops/s)
Verdict: NEUTRAL (both below +1.0% GO threshold) - freeze as research box
Implementation:
- L0 ENV gate: tiny_unified_lifo_env_box.{h,c} (HAKMEM_TINY_UNIFIED_LIFO=0/1)
- L1 LIFO ops: tiny_unified_lifo_box.h (unified_cache_try_pop/push_lifo)
- L2 integration: tiny_front_hot_box.h (mode check at entry)
- Reuses existing slots[] array (no intrusive pointers)
Root Causes:
1. Mode check overhead (tiny_unified_lifo_enabled() call)
2. Minimal LIFO vs FIFO locality delta in practice
3. Existing FIFO ring already well-optimized
Bonus Fix: LTO bug for tiny_c7_preserve_header_enabled() (Phase 13/14 latent issue)
- Converted static inline to extern + non-inline implementation
- Fixes undefined reference during LTO linking
Design: docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_DESIGN.md
Results: docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_AB_TEST_RESULTS.md
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 02:19:26 +09:00
|
|
|
|
#include "box/tiny_unified_lifo_env_box.h" // tiny_unified_lifo_env_refresh_from_env (Phase 15 v1)
|
Phase 16 v1 NEUTRAL, Phase 17 Case B confirmed, Phase 18 design added
## Phase 16 v1: Front FastLane Alloc LEGACY Direct — NEUTRAL (+0.62%)
Target: Reduce alloc-side fixed costs by adding LEGACY direct path to
FastLane entry, mirroring Phase 9/10 free-side winning pattern.
Result: +0.62% on Mixed (below +1.0% GO threshold) → NEUTRAL, freeze as
research box (default OFF).
Critical issue: Initial impl crashed (segfault) for C4-C7. Root cause:
unified_cache_refill() incompatibility. Safety fix: Limited to C0-C3
only (matching existing dualhot pattern).
Files:
- core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c} (new)
- core/box/front_fastlane_box.h (LEGACY direct path, lines 93-119)
- core/bench_profile.h (env refresh sync)
- Makefile (new obj)
- docs/analysis/PHASE16_*.md (design/results/instructions)
ENV: HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 (default OFF, opt-in)
Verdict: Research box frozen. Phase 14-16 plateau confirms dispatch/
routing optimization ROI is exhausted post-Phase-6 FastLane collapse.
---
## Phase 17: FORCE_LIBC Gap Validation — Case B Confirmed
Purpose: Validate "system malloc faster" observation using same-binary
A/B testing to isolate allocator logic差 vs binary layout penalty.
Method:
- Same-binary toggle: HAKMEM_FORCE_LIBC_ALLOC=0/1 (bench_random_mixed_hakmem)
- System binary: bench_random_mixed_system (21K separate binary)
- Perf stat: Hardware counter analysis (I-cache, cycles, instructions)
Result: **Case B confirmed** — Allocator差 negligible, layout penalty dominates.
Gap breakdown (Mixed, 20M iters, ws=400):
- hakmem (FORCE_LIBC=0): 48.12M ops/s
- libc (FORCE_LIBC=1, same binary): 48.31M ops/s → +0.39% (noise level)
- system binary (21K): 83.85M ops/s → +73.57% vs libc, +74.26% vs hakmem
Perf stat (200M iters):
- I-cache misses: 153K (hakmem) → 68K (system) = -55% (smoking gun)
- Cycles: 17.9B → 10.2B = -43%
- Instructions: 41.3B → 21.5B = -48%
- Binary size: 653K → 21K (30x difference)
Root cause: Binary size (30x) causes I-cache thrashing. Code bloat >>
algorithmic efficiency.
Conclusion: Phase 12's "system malloc 1.6x faster" was real, but
misattributed. Gap is layout/I-cache, NOT allocator algorithm.
Files:
- docs/analysis/PHASE17_*.md (results/instructions)
- scripts/run_mixed_10_cleanenv.sh (Phase 9/10 defaults aligned)
Next: Phase 18 Hot Text Isolation (layout optimization, not algorithm opt)
---
## Phase 18: Hot Text Isolation — Design Added
Purpose: Reduce I-cache misses + instruction footprint via layout control
(binary optimization, not allocator algorithm changes).
Strategy (v1 → v2 progression):
v1 (TU split + hot/cold attrs + optional gc-sections):
- Target: +2% throughput (GO threshold, realistic for layout tweaks)
- Secondary: I-cache -10%, instructions -5% (direction confirmation)
- Risk: Low (reversible via build knob)
- Expected: +0-2% (NEUTRAL likely, but validates approach)
v2 (BENCH_MINIMAL compile-out):
- Target: +10-20% throughput (本命)
- Method: Conditional compilation removes stats/ENV/debug from hot path
- Expected: Instruction count -30-40% → significant I-cache improvement
Files:
- docs/analysis/PHASE18_*.md (design/instructions)
- CURRENT_TASK.md (Phase 17 complete, Phase 18 v1/v2 plan)
Build gate: HOT_TEXT_ISOLATION=0/1 (Makefile knob)
Next: Implement Phase 18 v1 (TU split first, BENCH_MINIMAL if v1 NEUTRAL)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 05:25:47 +09:00
|
|
|
|
#include "box/front_fastlane_alloc_legacy_direct_env_box.h" // front_fastlane_alloc_legacy_direct_env_refresh_from_env (Phase 16 v1)
|
Phase 17 v2 (FORCE_LIBC fix) + Phase 19-1b (FastLane Direct) — GO (+5.88%)
## Phase 17 v2: FORCE_LIBC Gap Validation Fix
**Critical bug fix**: Phase 17 v1 の測定が壊れていた
**Problem**: HAKMEM_FORCE_LIBC_ALLOC=1 が FastLane より後でしか見えず、
same-binary A/B が実質 "hakmem vs hakmem" になっていた(+0.39% 誤測定)
**Fix**: core/box/hak_wrappers.inc.h:171 と :645 に g_force_libc_alloc==1 の
early bypass を追加、__libc_malloc/__libc_free に最初に直行
**Result**: 正しい同一バイナリ A/B 測定
- hakmem (FORCE_LIBC=0): 48.99M ops/s
- libc (FORCE_LIBC=1): 79.72M ops/s (+62.7%)
- system binary: 88.06M ops/s (+10.5% vs libc)
**Gap 分解**:
- Allocator 差: +62.7% (主戦場)
- Layout penalty: +10.5% (副次的)
**Conclusion**: Case A 確定 (allocator dominant, NOT layout)
Phase 17 v1 の Case B 判定は誤り。
Files:
- docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md (v2)
- docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_NEXT_INSTRUCTIONS.md (updated)
---
## Phase 19: FastLane Instruction Reduction Analysis
**Goal**: libc との instruction gap (-35% instructions, -56% branches) を削減
**perf stat 分析** (FORCE_LIBC=0 vs 1, 200M ops):
- hakmem: 209.09 instructions/op, 52.33 branches/op
- libc: 135.92 instructions/op, 22.93 branches/op
- Delta: +73.17 instructions/op (+53.8%), +29.40 branches/op (+128.2%)
**Hot path** (perf report):
- front_fastlane_try_free: 23.97% cycles
- malloc wrapper: 23.84% cycles
- free wrapper: 6.82% cycles
- **Wrapper overhead: ~55% of all cycles**
**Reduction candidates**:
- A: Wrapper layer 削除 (-17.5 inst/op, +10-15% 期待)
- B: ENV snapshot 統合 (-10.0 inst/op, +5-8%)
- C: Stats 削除 (-5.0 inst/op, +3-5%)
- D: Header inline (-4.0 inst/op, +2-3%)
- E: Route fast path (-3.5 inst/op, +2-3%)
Files:
- docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_1_DESIGN.md
- docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_2_NEXT_INSTRUCTIONS.md
---
## Phase 19-1b: FastLane Direct — GO (+5.88%)
**Strategy**: Wrapper layer を bypass し、core allocator を直接呼ぶ
- free() → free_tiny_fast() (not free_tiny_fast_hot)
- malloc() → malloc_tiny_fast()
**Phase 19-1 が NO-GO (-3.81%) だった原因**:
1. __builtin_expect(fastlane_direct_enabled(), 0) が逆効果(A/B 不公平)
2. free_tiny_fast_hot() が誤選択(free_tiny_fast() が勝ち筋)
**Phase 19-1b の修正**:
1. __builtin_expect() 削除
2. free_tiny_fast() を直接呼び出し
**Result** (Mixed, 10-run, 20M iters, ws=400):
- Baseline (FASTLANE_DIRECT=0): 49.17M ops/s
- Optimized (FASTLANE_DIRECT=1): 52.06M ops/s
- **Delta: +5.88%** (GO 基準 +5% クリア)
**perf stat** (200M iters):
- Instructions/op: 199.90 → 169.45 (-30.45, -15.23%)
- Branches/op: 51.49 → 41.52 (-9.97, -19.36%)
- Cycles/op: 88.88 → 84.37 (-4.51, -5.07%)
- I-cache miss: 111K → 98K (-11.79%)
**Trade-offs** (acceptable):
- iTLB miss: +41.46% (front-end cost)
- dTLB miss: +29.15% (backend cost)
- Overall gain (+5.88%) outweighs costs
**Implementation**:
1. **ENV gate**: core/box/fastlane_direct_env_box.{h,c}
- HAKMEM_FASTLANE_DIRECT=0/1 (default: 0, opt-in)
- Single _Atomic global (wrapper キャッシュ問題を解決)
2. **Wrapper 修正**: core/box/hak_wrappers.inc.h
- malloc: direct call to malloc_tiny_fast() when FASTLANE_DIRECT=1
- free: direct call to free_tiny_fast() when FASTLANE_DIRECT=1
- Safety: !g_initialized では direct 使わない、fallback 維持
3. **Preset 昇格**: core/bench_profile.h:88
- bench_setenv_default("HAKMEM_FASTLANE_DIRECT", "1")
- Comment: +5.88% proven on Mixed, 10-run
4. **cleanenv 更新**: scripts/run_mixed_10_cleanenv.sh:22
- HAKMEM_FASTLANE_DIRECT=${HAKMEM_FASTLANE_DIRECT:-1}
- Phase 9/10 と同様に昇格
**Verdict**: GO — 本線採用、プリセット昇格完了
**Rollback**: HAKMEM_FASTLANE_DIRECT=0 で既存 FastLane path に戻る
Files:
- core/box/fastlane_direct_env_box.{h,c} (new)
- core/box/hak_wrappers.inc.h (modified)
- core/bench_profile.h (preset promotion)
- scripts/run_mixed_10_cleanenv.sh (ENV default aligned)
- Makefile (new obj)
- docs/analysis/PHASE19_1B_FASTLANE_DIRECT_REVISED_AB_TEST_RESULTS.md
---
## Cumulative Performance
- Baseline (all optimizations OFF): ~40M ops/s (estimated)
- Current (Phase 19-1b): 52.06M ops/s
- **Cumulative gain: ~+30% from baseline**
Remaining gap to libc (79.72M):
- Current: 52.06M ops/s
- Target: 79.72M ops/s
- **Gap: +53.2%** (was +62.7% before Phase 19-1b)
Next: Phase 19-2 (ENV snapshot consolidation, +5-8% expected)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 11:28:40 +09:00
|
|
|
|
#include "box/fastlane_direct_env_box.h" // fastlane_direct_env_refresh_from_env (Phase 19-1)
|
2025-12-16 05:35:11 +09:00
|
|
|
|
#include "box/tiny_header_hotfull_env_box.h" // tiny_header_hotfull_env_refresh_from_env (Phase 21)
|
2025-12-14 16:28:23 +09:00
|
|
|
|
#endif
|
|
|
|
|
|
|
|
|
|
|
|
// env が未設定のときだけ既定値を入れる
|
|
|
|
|
|
static inline void bench_setenv_default(const char* key, const char* val) {
|
|
|
|
|
|
if (getenv(key) != NULL) return;
|
|
|
|
|
|
static void* (*real_malloc)(size_t) = NULL;
|
|
|
|
|
|
static int (*real_putenv)(char*) = NULL;
|
|
|
|
|
|
if (!real_malloc) {
|
|
|
|
|
|
real_malloc = (void* (*)(size_t))dlsym(RTLD_NEXT, "malloc");
|
|
|
|
|
|
if (!real_malloc) real_malloc = malloc;
|
|
|
|
|
|
}
|
|
|
|
|
|
if (!real_putenv) {
|
|
|
|
|
|
real_putenv = (int (*)(char*))dlsym(RTLD_NEXT, "putenv");
|
|
|
|
|
|
if (!real_putenv) real_putenv = putenv;
|
|
|
|
|
|
}
|
|
|
|
|
|
size_t klen = strlen(key);
|
|
|
|
|
|
size_t vlen = strlen(val);
|
|
|
|
|
|
char* buf = (char*)real_malloc(klen + vlen + 2);
|
|
|
|
|
|
if (!buf) return;
|
|
|
|
|
|
memcpy(buf, key, klen);
|
|
|
|
|
|
buf[klen] = '=';
|
|
|
|
|
|
memcpy(buf + klen + 1, val, vlen);
|
|
|
|
|
|
buf[klen + 1 + vlen] = '\0';
|
|
|
|
|
|
{
|
|
|
|
|
|
char msg[256];
|
|
|
|
|
|
int n = snprintf(msg, sizeof(msg), "[bench_profile] set %s=%s\n", key, val);
|
|
|
|
|
|
if (n > 0) {
|
|
|
|
|
|
if (n > (int)sizeof(msg)) n = (int)sizeof(msg);
|
|
|
|
|
|
ssize_t w = write(2, msg, (size_t)n);
|
|
|
|
|
|
(void)w;
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
real_putenv(buf); // takes ownership; do not free
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
// ベンチ専用: HAKMEM_PROFILE に応じて ENV をプリセットする
|
Phase 54-60: Memory-Lean mode, Balanced mode stabilization, M1 (50%) achievement
## Summary
Completed Phase 54-60 optimization work:
**Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)**
- Implemented ss_mem_lean_env_box.h with ENV gates
- Balanced mode (LEAN+OFF) promoted as production default
- Result: +1.2% throughput, better stability, zero syscall overhead
- Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset
**Phase 57: 60-min soak finalization**
- Balanced mode: 60-min soak, RSS drift 0%, CV 5.38%
- Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58%
- Syscall budget: 1.25e-7/op (800× under target)
- Status: PRODUCTION-READY
**Phase 59: 50% recovery baseline rebase**
- hakmem FAST (Balanced): 59.184M ops/s, CV 1.31%
- mimalloc: 120.466M ops/s, CV 3.50%
- Ratio: 49.13% (M1 ACHIEVED within statistical noise)
- Superior stability: 2.68× better CV than mimalloc
**Phase 60: Alloc pass-down SSOT (NO-GO)**
- Implemented alloc_passdown_ssot_env_box.h
- Modified malloc_tiny_fast.h for SSOT pattern
- Result: -0.46% (NO-GO)
- Key lesson: SSOT not applicable where early-exit already optimized
## Key Metrics
- Performance: 49.13% of mimalloc (M1 effectively achieved)
- Stability: CV 1.31% (superior to mimalloc 3.50%)
- Syscall budget: 1.25e-7/op (excellent)
- RSS: 33MB stable, 0% drift over 60 minutes
## Files Added/Modified
New boxes:
- core/box/ss_mem_lean_env_box.h
- core/box/ss_release_policy_box.{h,c}
- core/box/alloc_passdown_ssot_env_box.h
Scripts:
- scripts/soak_mixed_single_process.sh
- scripts/analyze_epoch_tail_csv.py
- scripts/soak_mixed_rss.sh
- scripts/calculate_percentiles.py
- scripts/analyze_soak.py
Documentation: Phase 40-60 analysis documents
## Design Decisions
1. Profile separation (core/bench_profile.h):
- MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN)
- MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF)
2. Box Theory compliance:
- All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT)
- Single conversion points maintained
- No physical deletions (compile-out only)
3. Lessons learned:
- SSOT effective only where redundancy exists (Phase 60 showed limits)
- Branch prediction extremely effective (~0 cycles for well-predicted branches)
- Early-exit pattern valuable even when seemingly redundant
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-17 06:24:01 +09:00
|
|
|
|
static inline void bench_apply_mixed_tinyv3_c7_common(void) {
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_HEAP_PROFILE", "C7_SAFE");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_C7_HOT", "1");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_HOTHEAP_V2", "0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_ENABLED", "1");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_CLASSES", "0x80");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SMALL_HEAP_V4_ENABLED", "0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SMALL_HEAP_V4_CLASSES", "0x0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_PTR_FAST_CLASSIFY_V4_ENABLED", "0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SMALL_SEGMENT_V4_ENABLED", "0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_POOL_V2_ENABLED", "0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_FRONT_V3_ENABLED", "1");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_FRONT_V3_LUT_ENABLED", "1");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_PTR_FAST_CLASSIFY_ENABLED", "1");
|
|
|
|
|
|
// Phase FREE-TINY-FAST-DUALHOT-1: C0-C3 direct fast free (skip policy snapshot)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_FREE_TINY_FAST_HOTCOLD", "1");
|
|
|
|
|
|
// Phase 2 B4: Wrapper hot/cold split (malloc/free wrapper shape)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_WRAP_SHAPE", "1");
|
|
|
|
|
|
// Phase 4 E1: ENV Snapshot Consolidation (+3.92% proven on Mixed)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_ENV_SNAPSHOT", "1");
|
|
|
|
|
|
// Phase 5 E4-1: Free wrapper ENV snapshot (+3.51% proven on Mixed, 10-run)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT", "1");
|
|
|
|
|
|
// Phase 5 E4-2: Malloc wrapper ENV snapshot (+21.83% proven on Mixed, 10-run)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT", "1");
|
|
|
|
|
|
// Phase 5 E5-1: Free Tiny Direct Path (+3.35% proven on Mixed, 10-run)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_FREE_TINY_DIRECT", "1");
|
|
|
|
|
|
// Phase 6-1: Front FastLane (Layer Collapse) (+11.13% proven on Mixed, 10-run)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_FRONT_FASTLANE", "1");
|
|
|
|
|
|
// Phase 6-2: Front FastLane Free DeDup (+5.18% proven on Mixed, 10-run)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_FRONT_FASTLANE_FREE_DEDUP", "1");
|
|
|
|
|
|
// Phase 21: Tiny Header HotFull (alloc header hot/cold split; opt-out with 0)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_HEADER_HOTFULL", "1");
|
|
|
|
|
|
// Phase 19-1b: FastLane Direct (wrapper layer bypass, +5.88% proven on Mixed, 10-run)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_FASTLANE_DIRECT", "1");
|
|
|
|
|
|
// Phase 9: FREE-TINY-FAST MONO DUALHOT (+2.72% proven on Mixed, 10-run)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_FREE_TINY_FAST_MONO_DUALHOT", "1");
|
|
|
|
|
|
// Phase 10: FREE-TINY-FAST MONO LEGACY DIRECT (+1.89% proven on Mixed, 10-run)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT", "1");
|
|
|
|
|
|
// Phase 4-4: C6 ULTRA free+alloc 統合を有効化 (default OFF, manual opt-in)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_C6_ULTRA_FREE_ENABLED", "0");
|
|
|
|
|
|
// Phase MID-V3: Mid/Pool HotBox v3
|
|
|
|
|
|
// Mixed (16–1024B) では MID_V3(C6) が大きく遅くなるため、デフォルト OFF に固定。
|
|
|
|
|
|
// C6-heavy プロファイル側でのみ ON を推奨する(C6-heavy のみ最適化対象)。
|
|
|
|
|
|
bench_setenv_default("HAKMEM_MID_V3_ENABLED", "0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_MID_V3_CLASSES", "0x0");
|
|
|
|
|
|
// Phase 2 B3: Routing branch shape optimization (LIKELY on LEGACY, cold helper for rare routes)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_ALLOC_ROUTE_SHAPE", "1");
|
|
|
|
|
|
// Phase 3 C3: Static routing (policy_snapshot bypass, +2.2% proven)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_STATIC_ROUTE", "1");
|
|
|
|
|
|
// Phase 3 D1: Free route cache (TLS cache for free path routing, +2.19% proven)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_FREE_STATIC_ROUTE", "1");
|
2025-12-18 01:55:27 +09:00
|
|
|
|
// Phase 69-1: Warm Pool Size=16 (+3.26% Strong GO, ENV-only)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_WARM_POOL_SIZE", "16");
|
Phase 54-60: Memory-Lean mode, Balanced mode stabilization, M1 (50%) achievement
## Summary
Completed Phase 54-60 optimization work:
**Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)**
- Implemented ss_mem_lean_env_box.h with ENV gates
- Balanced mode (LEAN+OFF) promoted as production default
- Result: +1.2% throughput, better stability, zero syscall overhead
- Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset
**Phase 57: 60-min soak finalization**
- Balanced mode: 60-min soak, RSS drift 0%, CV 5.38%
- Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58%
- Syscall budget: 1.25e-7/op (800× under target)
- Status: PRODUCTION-READY
**Phase 59: 50% recovery baseline rebase**
- hakmem FAST (Balanced): 59.184M ops/s, CV 1.31%
- mimalloc: 120.466M ops/s, CV 3.50%
- Ratio: 49.13% (M1 ACHIEVED within statistical noise)
- Superior stability: 2.68× better CV than mimalloc
**Phase 60: Alloc pass-down SSOT (NO-GO)**
- Implemented alloc_passdown_ssot_env_box.h
- Modified malloc_tiny_fast.h for SSOT pattern
- Result: -0.46% (NO-GO)
- Key lesson: SSOT not applicable where early-exit already optimized
## Key Metrics
- Performance: 49.13% of mimalloc (M1 effectively achieved)
- Stability: CV 1.31% (superior to mimalloc 3.50%)
- Syscall budget: 1.25e-7/op (excellent)
- RSS: 33MB stable, 0% drift over 60 minutes
## Files Added/Modified
New boxes:
- core/box/ss_mem_lean_env_box.h
- core/box/ss_release_policy_box.{h,c}
- core/box/alloc_passdown_ssot_env_box.h
Scripts:
- scripts/soak_mixed_single_process.sh
- scripts/analyze_epoch_tail_csv.py
- scripts/soak_mixed_rss.sh
- scripts/calculate_percentiles.py
- scripts/analyze_soak.py
Documentation: Phase 40-60 analysis documents
## Design Decisions
1. Profile separation (core/bench_profile.h):
- MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN)
- MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF)
2. Box Theory compliance:
- All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT)
- Single conversion points maintained
- No physical deletions (compile-out only)
3. Lessons learned:
- SSOT effective only where redundancy exists (Phase 60 showed limits)
- Branch prediction extremely effective (~0 cycles for well-predicted branches)
- Early-exit pattern valuable even when seemingly redundant
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-17 06:24:01 +09:00
|
|
|
|
}
|
|
|
|
|
|
|
2025-12-14 16:28:23 +09:00
|
|
|
|
static inline void bench_apply_profile(void) {
|
|
|
|
|
|
const char* p = getenv("HAKMEM_PROFILE");
|
|
|
|
|
|
if (!p || !*p) return;
|
|
|
|
|
|
|
|
|
|
|
|
if (strcmp(p, "MIXED_TINYV3_C7_SAFE") == 0) {
|
Phase 54-60: Memory-Lean mode, Balanced mode stabilization, M1 (50%) achievement
## Summary
Completed Phase 54-60 optimization work:
**Phase 54-56: Memory-Lean mode (LEAN+OFF prewarm suppression)**
- Implemented ss_mem_lean_env_box.h with ENV gates
- Balanced mode (LEAN+OFF) promoted as production default
- Result: +1.2% throughput, better stability, zero syscall overhead
- Added to bench_profile.h: MIXED_TINYV3_C7_BALANCED preset
**Phase 57: 60-min soak finalization**
- Balanced mode: 60-min soak, RSS drift 0%, CV 5.38%
- Speed-first mode: 60-min soak, RSS drift 0%, CV 1.58%
- Syscall budget: 1.25e-7/op (800× under target)
- Status: PRODUCTION-READY
**Phase 59: 50% recovery baseline rebase**
- hakmem FAST (Balanced): 59.184M ops/s, CV 1.31%
- mimalloc: 120.466M ops/s, CV 3.50%
- Ratio: 49.13% (M1 ACHIEVED within statistical noise)
- Superior stability: 2.68× better CV than mimalloc
**Phase 60: Alloc pass-down SSOT (NO-GO)**
- Implemented alloc_passdown_ssot_env_box.h
- Modified malloc_tiny_fast.h for SSOT pattern
- Result: -0.46% (NO-GO)
- Key lesson: SSOT not applicable where early-exit already optimized
## Key Metrics
- Performance: 49.13% of mimalloc (M1 effectively achieved)
- Stability: CV 1.31% (superior to mimalloc 3.50%)
- Syscall budget: 1.25e-7/op (excellent)
- RSS: 33MB stable, 0% drift over 60 minutes
## Files Added/Modified
New boxes:
- core/box/ss_mem_lean_env_box.h
- core/box/ss_release_policy_box.{h,c}
- core/box/alloc_passdown_ssot_env_box.h
Scripts:
- scripts/soak_mixed_single_process.sh
- scripts/analyze_epoch_tail_csv.py
- scripts/soak_mixed_rss.sh
- scripts/calculate_percentiles.py
- scripts/analyze_soak.py
Documentation: Phase 40-60 analysis documents
## Design Decisions
1. Profile separation (core/bench_profile.h):
- MIXED_TINYV3_C7_SAFE: Speed-first (no LEAN)
- MIXED_TINYV3_C7_BALANCED: Balanced mode (LEAN+OFF)
2. Box Theory compliance:
- All ENV gates reversible (HAKMEM_SS_MEM_LEAN, HAKMEM_ALLOC_PASSDOWN_SSOT)
- Single conversion points maintained
- No physical deletions (compile-out only)
3. Lessons learned:
- SSOT effective only where redundancy exists (Phase 60 showed limits)
- Branch prediction extremely effective (~0 cycles for well-predicted branches)
- Early-exit pattern valuable even when seemingly redundant
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-17 06:24:01 +09:00
|
|
|
|
// Speed-first default (Phase 57): do not set HAKMEM_SS_MEM_LEAN here.
|
|
|
|
|
|
bench_apply_mixed_tinyv3_c7_common();
|
|
|
|
|
|
} else if (strcmp(p, "MIXED_TINYV3_C7_BALANCED") == 0) {
|
|
|
|
|
|
// Balanced mode (Phase 55/56): LEAN+OFF (prewarm suppression only).
|
|
|
|
|
|
bench_apply_mixed_tinyv3_c7_common();
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SS_MEM_LEAN", "1");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SS_MEM_LEAN_DECOMMIT", "OFF");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SS_MEM_LEAN_TARGET_MB", "10");
|
2025-12-14 16:28:23 +09:00
|
|
|
|
} else if (strcmp(p, "C6_HEAVY_LEGACY_POOLV1") == 0) {
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_HEAP_PROFILE", "C7_SAFE");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_C6_HOT", "0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_HOTHEAP_V2", "0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_ENABLED", "1");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_CLASSES", "0x80");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_POOL_V2_ENABLED", "0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_POOL_V1_FLATTEN_ENABLED", "0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_MID_DESC_CACHE_ENABLED", "1");
|
|
|
|
|
|
// Phase 4-4: C6 ULTRA free+alloc 統合を有効化 (default OFF, manual opt-in)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_C6_ULTRA_FREE_ENABLED", "0");
|
|
|
|
|
|
// Phase MID-V3: Mid/Pool HotBox v3 (257-768B, C6 only)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_MID_V3_ENABLED", "1");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_MID_V3_CLASSES", "0x40");
|
2025-12-14 17:38:21 +09:00
|
|
|
|
// Phase 6-1: Front FastLane (Layer Collapse) (+11.13% proven on Mixed, 10-run)
|
2025-12-14 16:28:23 +09:00
|
|
|
|
bench_setenv_default("HAKMEM_FRONT_FASTLANE", "1");
|
2025-12-14 17:38:21 +09:00
|
|
|
|
// Phase 6-2: Front FastLane Free DeDup (+5.18% proven on Mixed, 10-run)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_FRONT_FASTLANE_FREE_DEDUP", "1");
|
2025-12-16 05:35:11 +09:00
|
|
|
|
// Phase 21: Tiny Header HotFull (alloc header hot/cold split; opt-out with 0)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_HEADER_HOTFULL", "1");
|
Phase 17 v2 (FORCE_LIBC fix) + Phase 19-1b (FastLane Direct) — GO (+5.88%)
## Phase 17 v2: FORCE_LIBC Gap Validation Fix
**Critical bug fix**: Phase 17 v1 の測定が壊れていた
**Problem**: HAKMEM_FORCE_LIBC_ALLOC=1 が FastLane より後でしか見えず、
same-binary A/B が実質 "hakmem vs hakmem" になっていた(+0.39% 誤測定)
**Fix**: core/box/hak_wrappers.inc.h:171 と :645 に g_force_libc_alloc==1 の
early bypass を追加、__libc_malloc/__libc_free に最初に直行
**Result**: 正しい同一バイナリ A/B 測定
- hakmem (FORCE_LIBC=0): 48.99M ops/s
- libc (FORCE_LIBC=1): 79.72M ops/s (+62.7%)
- system binary: 88.06M ops/s (+10.5% vs libc)
**Gap 分解**:
- Allocator 差: +62.7% (主戦場)
- Layout penalty: +10.5% (副次的)
**Conclusion**: Case A 確定 (allocator dominant, NOT layout)
Phase 17 v1 の Case B 判定は誤り。
Files:
- docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md (v2)
- docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_NEXT_INSTRUCTIONS.md (updated)
---
## Phase 19: FastLane Instruction Reduction Analysis
**Goal**: libc との instruction gap (-35% instructions, -56% branches) を削減
**perf stat 分析** (FORCE_LIBC=0 vs 1, 200M ops):
- hakmem: 209.09 instructions/op, 52.33 branches/op
- libc: 135.92 instructions/op, 22.93 branches/op
- Delta: +73.17 instructions/op (+53.8%), +29.40 branches/op (+128.2%)
**Hot path** (perf report):
- front_fastlane_try_free: 23.97% cycles
- malloc wrapper: 23.84% cycles
- free wrapper: 6.82% cycles
- **Wrapper overhead: ~55% of all cycles**
**Reduction candidates**:
- A: Wrapper layer 削除 (-17.5 inst/op, +10-15% 期待)
- B: ENV snapshot 統合 (-10.0 inst/op, +5-8%)
- C: Stats 削除 (-5.0 inst/op, +3-5%)
- D: Header inline (-4.0 inst/op, +2-3%)
- E: Route fast path (-3.5 inst/op, +2-3%)
Files:
- docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_1_DESIGN.md
- docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_2_NEXT_INSTRUCTIONS.md
---
## Phase 19-1b: FastLane Direct — GO (+5.88%)
**Strategy**: Wrapper layer を bypass し、core allocator を直接呼ぶ
- free() → free_tiny_fast() (not free_tiny_fast_hot)
- malloc() → malloc_tiny_fast()
**Phase 19-1 が NO-GO (-3.81%) だった原因**:
1. __builtin_expect(fastlane_direct_enabled(), 0) が逆効果(A/B 不公平)
2. free_tiny_fast_hot() が誤選択(free_tiny_fast() が勝ち筋)
**Phase 19-1b の修正**:
1. __builtin_expect() 削除
2. free_tiny_fast() を直接呼び出し
**Result** (Mixed, 10-run, 20M iters, ws=400):
- Baseline (FASTLANE_DIRECT=0): 49.17M ops/s
- Optimized (FASTLANE_DIRECT=1): 52.06M ops/s
- **Delta: +5.88%** (GO 基準 +5% クリア)
**perf stat** (200M iters):
- Instructions/op: 199.90 → 169.45 (-30.45, -15.23%)
- Branches/op: 51.49 → 41.52 (-9.97, -19.36%)
- Cycles/op: 88.88 → 84.37 (-4.51, -5.07%)
- I-cache miss: 111K → 98K (-11.79%)
**Trade-offs** (acceptable):
- iTLB miss: +41.46% (front-end cost)
- dTLB miss: +29.15% (backend cost)
- Overall gain (+5.88%) outweighs costs
**Implementation**:
1. **ENV gate**: core/box/fastlane_direct_env_box.{h,c}
- HAKMEM_FASTLANE_DIRECT=0/1 (default: 0, opt-in)
- Single _Atomic global (wrapper キャッシュ問題を解決)
2. **Wrapper 修正**: core/box/hak_wrappers.inc.h
- malloc: direct call to malloc_tiny_fast() when FASTLANE_DIRECT=1
- free: direct call to free_tiny_fast() when FASTLANE_DIRECT=1
- Safety: !g_initialized では direct 使わない、fallback 維持
3. **Preset 昇格**: core/bench_profile.h:88
- bench_setenv_default("HAKMEM_FASTLANE_DIRECT", "1")
- Comment: +5.88% proven on Mixed, 10-run
4. **cleanenv 更新**: scripts/run_mixed_10_cleanenv.sh:22
- HAKMEM_FASTLANE_DIRECT=${HAKMEM_FASTLANE_DIRECT:-1}
- Phase 9/10 と同様に昇格
**Verdict**: GO — 本線採用、プリセット昇格完了
**Rollback**: HAKMEM_FASTLANE_DIRECT=0 で既存 FastLane path に戻る
Files:
- core/box/fastlane_direct_env_box.{h,c} (new)
- core/box/hak_wrappers.inc.h (modified)
- core/bench_profile.h (preset promotion)
- scripts/run_mixed_10_cleanenv.sh (ENV default aligned)
- Makefile (new obj)
- docs/analysis/PHASE19_1B_FASTLANE_DIRECT_REVISED_AB_TEST_RESULTS.md
---
## Cumulative Performance
- Baseline (all optimizations OFF): ~40M ops/s (estimated)
- Current (Phase 19-1b): 52.06M ops/s
- **Cumulative gain: ~+30% from baseline**
Remaining gap to libc (79.72M):
- Current: 52.06M ops/s
- Target: 79.72M ops/s
- **Gap: +53.2%** (was +62.7% before Phase 19-1b)
Next: Phase 19-2 (ENV snapshot consolidation, +5-8% expected)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 11:28:40 +09:00
|
|
|
|
// Phase 19-1b: FastLane Direct (wrapper layer bypass)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_FASTLANE_DIRECT", "1");
|
2025-12-14 16:28:23 +09:00
|
|
|
|
// Phase 2 B3: Routing branch shape optimization (LIKELY on LEGACY, cold helper for rare routes)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_ALLOC_ROUTE_SHAPE", "1");
|
|
|
|
|
|
} else if (strcmp(p, "C6_V7_STUB") == 0) {
|
|
|
|
|
|
// Phase v7-1: C6-only v7 stub 実験用(MID v3 fallback)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_HEAP_PROFILE", "C7_SAFE");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_C6_HOT", "0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_HOTHEAP_V2", "0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_ENABLED", "1");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_CLASSES", "0x80");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_POOL_V2_ENABLED", "0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_MID_V3_ENABLED", "1");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_MID_V3_CLASSES", "0x40");
|
|
|
|
|
|
// v7 stub ON (C6-only)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SMALL_HEAP_V7_ENABLED", "1");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SMALL_HEAP_V7_CLASSES", "0x40");
|
|
|
|
|
|
} else if (strcmp(p, "C6_HEAVY_LEGACY_POOLV1_FLATTEN") == 0) {
|
|
|
|
|
|
// LEGACY mid/smallmid ベンチ専用(C7_SAFE では使用しない)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_HEAP_PROFILE", "LEGACY");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_C6_HOT", "0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_HOTHEAP_V2", "0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_ENABLED", "1");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_CLASSES", "0x80");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_POOL_V2_ENABLED", "0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_POOL_V1_FLATTEN_ENABLED", "1");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_POOL_V1_FLATTEN_STATS", "1");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_POOL_ZERO_MODE", "header");
|
|
|
|
|
|
} else if (strcmp(p, "DEBUG_TINY_FRONT_PERF") == 0) {
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_HEAP_PROFILE", "C7_SAFE");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_C7_HOT", "1");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_HOTHEAP_V2", "0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_ENABLED", "1");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_CLASSES", "0x80");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_POOL_V2_ENABLED", "0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_FRONT_V3_ENABLED", "1");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_FRONT_V3_LUT_ENABLED", "1");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_PTR_FAST_CLASSIFY_ENABLED", "1");
|
|
|
|
|
|
} else if (strcmp(p, "C6_SMALL_HEAP_V3_EXPERIMENT") == 0) {
|
|
|
|
|
|
// C6 を SmallObject v3 に載せる研究用(標準では使用しない)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_HEAP_PROFILE", "C7_SAFE");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_C6_HOT", "1");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_HOTHEAP_V2", "0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_ENABLED", "1");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_CLASSES", "0x40"); // C6 only
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SMALL_HEAP_V4_ENABLED", "0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SMALL_HEAP_V4_CLASSES", "0x0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_POOL_V2_ENABLED", "0");
|
|
|
|
|
|
} else if (strcmp(p, "C6_SMALL_HEAP_V4_EXPERIMENT") == 0) {
|
|
|
|
|
|
// C6 を SmallObject v4 に載せる研究用(標準では使用しない)
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_HEAP_PROFILE", "C7_SAFE");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_C6_HOT", "1");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_TINY_HOTHEAP_V2", "0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_ENABLED", "0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_CLASSES", "0x0");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SMALL_HEAP_V4_ENABLED", "1");
|
|
|
|
|
|
bench_setenv_default("HAKMEM_SMALL_HEAP_V4_CLASSES", "0x40"); // C6 only
|
|
|
|
|
|
bench_setenv_default("HAKMEM_POOL_V2_ENABLED", "0");
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
#ifdef USE_HAKMEM
|
|
|
|
|
|
// Phase 3 C3 Step 0: Ensure policy snapshot reflects final ENV after putenv defaults.
|
|
|
|
|
|
small_policy_v7_bump_version();
|
|
|
|
|
|
// Phase 2 B4: Sync wrapper ENV cache after bench_profile putenv defaults.
|
|
|
|
|
|
wrapper_env_refresh_from_env();
|
|
|
|
|
|
// Phase 3 C3: Sync static route cache after bench_profile putenv defaults.
|
|
|
|
|
|
tiny_static_route_refresh_from_env();
|
|
|
|
|
|
// Phase 4 E1: Sync ENV snapshot cache after bench_profile putenv defaults.
|
|
|
|
|
|
hakmem_env_snapshot_refresh_from_env();
|
2025-12-14 18:49:08 +09:00
|
|
|
|
// Phase 8: Sync free static route ENV cache after bench_profile putenv defaults.
|
|
|
|
|
|
tiny_free_static_route_refresh_from_env();
|
2025-12-15 00:32:25 +09:00
|
|
|
|
// Phase 13 v1: Sync C7 preserve header ENV cache after bench_profile putenv defaults.
|
|
|
|
|
|
tiny_c7_preserve_header_env_refresh_from_env();
|
2025-12-15 01:28:50 +09:00
|
|
|
|
// Phase 14 v1: Sync tcache ENV cache after bench_profile putenv defaults.
|
|
|
|
|
|
tiny_tcache_env_refresh_from_env();
|
Phase 15 v1: UnifiedCache FIFO→LIFO NEUTRAL (-0.70% Mixed, +0.42% C7)
Transform existing array-based UnifiedCache from FIFO ring to LIFO stack.
A/B Results:
- Mixed (16-1024B): -0.70% (52,965,966 → 52,593,948 ops/s)
- C7-only (1025-2048B): +0.42% (78,010,783 → 78,335,509 ops/s)
Verdict: NEUTRAL (both below +1.0% GO threshold) - freeze as research box
Implementation:
- L0 ENV gate: tiny_unified_lifo_env_box.{h,c} (HAKMEM_TINY_UNIFIED_LIFO=0/1)
- L1 LIFO ops: tiny_unified_lifo_box.h (unified_cache_try_pop/push_lifo)
- L2 integration: tiny_front_hot_box.h (mode check at entry)
- Reuses existing slots[] array (no intrusive pointers)
Root Causes:
1. Mode check overhead (tiny_unified_lifo_enabled() call)
2. Minimal LIFO vs FIFO locality delta in practice
3. Existing FIFO ring already well-optimized
Bonus Fix: LTO bug for tiny_c7_preserve_header_enabled() (Phase 13/14 latent issue)
- Converted static inline to extern + non-inline implementation
- Fixes undefined reference during LTO linking
Design: docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_DESIGN.md
Results: docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_AB_TEST_RESULTS.md
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 02:19:26 +09:00
|
|
|
|
// Phase 15 v1: Sync LIFO ENV cache after bench_profile putenv defaults.
|
|
|
|
|
|
tiny_unified_lifo_env_refresh_from_env();
|
Phase 16 v1 NEUTRAL, Phase 17 Case B confirmed, Phase 18 design added
## Phase 16 v1: Front FastLane Alloc LEGACY Direct — NEUTRAL (+0.62%)
Target: Reduce alloc-side fixed costs by adding LEGACY direct path to
FastLane entry, mirroring Phase 9/10 free-side winning pattern.
Result: +0.62% on Mixed (below +1.0% GO threshold) → NEUTRAL, freeze as
research box (default OFF).
Critical issue: Initial impl crashed (segfault) for C4-C7. Root cause:
unified_cache_refill() incompatibility. Safety fix: Limited to C0-C3
only (matching existing dualhot pattern).
Files:
- core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c} (new)
- core/box/front_fastlane_box.h (LEGACY direct path, lines 93-119)
- core/bench_profile.h (env refresh sync)
- Makefile (new obj)
- docs/analysis/PHASE16_*.md (design/results/instructions)
ENV: HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 (default OFF, opt-in)
Verdict: Research box frozen. Phase 14-16 plateau confirms dispatch/
routing optimization ROI is exhausted post-Phase-6 FastLane collapse.
---
## Phase 17: FORCE_LIBC Gap Validation — Case B Confirmed
Purpose: Validate "system malloc faster" observation using same-binary
A/B testing to isolate allocator logic差 vs binary layout penalty.
Method:
- Same-binary toggle: HAKMEM_FORCE_LIBC_ALLOC=0/1 (bench_random_mixed_hakmem)
- System binary: bench_random_mixed_system (21K separate binary)
- Perf stat: Hardware counter analysis (I-cache, cycles, instructions)
Result: **Case B confirmed** — Allocator差 negligible, layout penalty dominates.
Gap breakdown (Mixed, 20M iters, ws=400):
- hakmem (FORCE_LIBC=0): 48.12M ops/s
- libc (FORCE_LIBC=1, same binary): 48.31M ops/s → +0.39% (noise level)
- system binary (21K): 83.85M ops/s → +73.57% vs libc, +74.26% vs hakmem
Perf stat (200M iters):
- I-cache misses: 153K (hakmem) → 68K (system) = -55% (smoking gun)
- Cycles: 17.9B → 10.2B = -43%
- Instructions: 41.3B → 21.5B = -48%
- Binary size: 653K → 21K (30x difference)
Root cause: Binary size (30x) causes I-cache thrashing. Code bloat >>
algorithmic efficiency.
Conclusion: Phase 12's "system malloc 1.6x faster" was real, but
misattributed. Gap is layout/I-cache, NOT allocator algorithm.
Files:
- docs/analysis/PHASE17_*.md (results/instructions)
- scripts/run_mixed_10_cleanenv.sh (Phase 9/10 defaults aligned)
Next: Phase 18 Hot Text Isolation (layout optimization, not algorithm opt)
---
## Phase 18: Hot Text Isolation — Design Added
Purpose: Reduce I-cache misses + instruction footprint via layout control
(binary optimization, not allocator algorithm changes).
Strategy (v1 → v2 progression):
v1 (TU split + hot/cold attrs + optional gc-sections):
- Target: +2% throughput (GO threshold, realistic for layout tweaks)
- Secondary: I-cache -10%, instructions -5% (direction confirmation)
- Risk: Low (reversible via build knob)
- Expected: +0-2% (NEUTRAL likely, but validates approach)
v2 (BENCH_MINIMAL compile-out):
- Target: +10-20% throughput (本命)
- Method: Conditional compilation removes stats/ENV/debug from hot path
- Expected: Instruction count -30-40% → significant I-cache improvement
Files:
- docs/analysis/PHASE18_*.md (design/instructions)
- CURRENT_TASK.md (Phase 17 complete, Phase 18 v1/v2 plan)
Build gate: HOT_TEXT_ISOLATION=0/1 (Makefile knob)
Next: Implement Phase 18 v1 (TU split first, BENCH_MINIMAL if v1 NEUTRAL)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 05:25:47 +09:00
|
|
|
|
// Phase 16 v1: Sync LEGACY direct ENV cache after bench_profile putenv defaults.
|
|
|
|
|
|
front_fastlane_alloc_legacy_direct_env_refresh_from_env();
|
2025-12-16 05:35:11 +09:00
|
|
|
|
// Phase 19-1: Sync FastLane Direct ENV cache after bench_profile putenv defaults.
|
|
|
|
|
|
fastlane_direct_env_refresh_from_env();
|
|
|
|
|
|
// Phase 21: Sync Tiny Header HotFull ENV cache after bench_profile putenv defaults.
|
|
|
|
|
|
tiny_header_hotfull_env_refresh_from_env();
|
2025-12-14 16:28:23 +09:00
|
|
|
|
#endif
|
2025-12-16 05:35:11 +09:00
|
|
|
|
}
|