Files
hakmem/core/bench_profile.h
Moe Charm (CI) f8e7cf05b4 Phase 16 v1 NEUTRAL, Phase 17 Case B confirmed, Phase 18 design added
## Phase 16 v1: Front FastLane Alloc LEGACY Direct — NEUTRAL (+0.62%)

Target: Reduce alloc-side fixed costs by adding LEGACY direct path to
FastLane entry, mirroring Phase 9/10 free-side winning pattern.

Result: +0.62% on Mixed (below +1.0% GO threshold) → NEUTRAL, freeze as
research box (default OFF).

Critical issue: Initial impl crashed (segfault) for C4-C7. Root cause:
unified_cache_refill() incompatibility. Safety fix: Limited to C0-C3
only (matching existing dualhot pattern).

Files:
- core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c} (new)
- core/box/front_fastlane_box.h (LEGACY direct path, lines 93-119)
- core/bench_profile.h (env refresh sync)
- Makefile (new obj)
- docs/analysis/PHASE16_*.md (design/results/instructions)

ENV: HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 (default OFF, opt-in)

Verdict: Research box frozen. Phase 14-16 plateau confirms dispatch/
routing optimization ROI is exhausted post-Phase-6 FastLane collapse.

---

## Phase 17: FORCE_LIBC Gap Validation — Case B Confirmed

Purpose: Validate "system malloc faster" observation using same-binary
A/B testing to isolate allocator logic差 vs binary layout penalty.

Method:
- Same-binary toggle: HAKMEM_FORCE_LIBC_ALLOC=0/1 (bench_random_mixed_hakmem)
- System binary: bench_random_mixed_system (21K separate binary)
- Perf stat: Hardware counter analysis (I-cache, cycles, instructions)

Result: **Case B confirmed** — Allocator差 negligible, layout penalty dominates.

Gap breakdown (Mixed, 20M iters, ws=400):
- hakmem (FORCE_LIBC=0): 48.12M ops/s
- libc (FORCE_LIBC=1, same binary): 48.31M ops/s → +0.39% (noise level)
- system binary (21K): 83.85M ops/s → +73.57% vs libc, +74.26% vs hakmem

Perf stat (200M iters):
- I-cache misses: 153K (hakmem) → 68K (system) = -55% (smoking gun)
- Cycles: 17.9B → 10.2B = -43%
- Instructions: 41.3B → 21.5B = -48%
- Binary size: 653K → 21K (30x difference)

Root cause: Binary size (30x) causes I-cache thrashing. Code bloat >>
algorithmic efficiency.

Conclusion: Phase 12's "system malloc 1.6x faster" was real, but
misattributed. Gap is layout/I-cache, NOT allocator algorithm.

Files:
- docs/analysis/PHASE17_*.md (results/instructions)
- scripts/run_mixed_10_cleanenv.sh (Phase 9/10 defaults aligned)

Next: Phase 18 Hot Text Isolation (layout optimization, not algorithm opt)

---

## Phase 18: Hot Text Isolation — Design Added

Purpose: Reduce I-cache misses + instruction footprint via layout control
(binary optimization, not allocator algorithm changes).

Strategy (v1 → v2 progression):

v1 (TU split + hot/cold attrs + optional gc-sections):
- Target: +2% throughput (GO threshold, realistic for layout tweaks)
- Secondary: I-cache -10%, instructions -5% (direction confirmation)
- Risk: Low (reversible via build knob)
- Expected: +0-2% (NEUTRAL likely, but validates approach)

v2 (BENCH_MINIMAL compile-out):
- Target: +10-20% throughput (本命)
- Method: Conditional compilation removes stats/ENV/debug from hot path
- Expected: Instruction count -30-40% → significant I-cache improvement

Files:
- docs/analysis/PHASE18_*.md (design/instructions)
- CURRENT_TASK.md (Phase 17 complete, Phase 18 v1/v2 plan)

Build gate: HOT_TEXT_ISOLATION=0/1 (Makefile knob)

Next: Implement Phase 18 v1 (TU split first, BENCH_MINIMAL if v1 NEUTRAL)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 05:25:47 +09:00

201 lines
11 KiB
C
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

#pragma once
#include <dlfcn.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <unistd.h>
#ifdef USE_HAKMEM
#include "box/wrapper_env_box.h" // wrapper_env_refresh_from_env (Phase 2 B4)
#include "box/tiny_static_route_box.h" // tiny_static_route_refresh_from_env (Phase 3 C3)
#include "box/hakmem_env_snapshot_box.h" // hakmem_env_snapshot_refresh_from_env (Phase 4 E1)
#include "box/tiny_free_route_cache_env_box.h" // tiny_free_static_route_refresh_from_env (Phase 8)
#include "box/tiny_c7_preserve_header_env_box.h" // tiny_c7_preserve_header_env_refresh_from_env (Phase 13 v1)
#include "box/tiny_tcache_env_box.h" // tiny_tcache_env_refresh_from_env (Phase 14 v1)
#include "box/tiny_unified_lifo_env_box.h" // tiny_unified_lifo_env_refresh_from_env (Phase 15 v1)
#include "box/front_fastlane_alloc_legacy_direct_env_box.h" // front_fastlane_alloc_legacy_direct_env_refresh_from_env (Phase 16 v1)
#endif
// env が未設定のときだけ既定値を入れる
static inline void bench_setenv_default(const char* key, const char* val) {
if (getenv(key) != NULL) return;
static void* (*real_malloc)(size_t) = NULL;
static int (*real_putenv)(char*) = NULL;
if (!real_malloc) {
real_malloc = (void* (*)(size_t))dlsym(RTLD_NEXT, "malloc");
if (!real_malloc) real_malloc = malloc;
}
if (!real_putenv) {
real_putenv = (int (*)(char*))dlsym(RTLD_NEXT, "putenv");
if (!real_putenv) real_putenv = putenv;
}
size_t klen = strlen(key);
size_t vlen = strlen(val);
char* buf = (char*)real_malloc(klen + vlen + 2);
if (!buf) return;
memcpy(buf, key, klen);
buf[klen] = '=';
memcpy(buf + klen + 1, val, vlen);
buf[klen + 1 + vlen] = '\0';
{
char msg[256];
int n = snprintf(msg, sizeof(msg), "[bench_profile] set %s=%s\n", key, val);
if (n > 0) {
if (n > (int)sizeof(msg)) n = (int)sizeof(msg);
ssize_t w = write(2, msg, (size_t)n);
(void)w;
}
}
real_putenv(buf); // takes ownership; do not free
}
// ベンチ専用: HAKMEM_PROFILE に応じて ENV をプリセットする
static inline void bench_apply_profile(void) {
const char* p = getenv("HAKMEM_PROFILE");
if (!p || !*p) return;
if (strcmp(p, "MIXED_TINYV3_C7_SAFE") == 0) {
bench_setenv_default("HAKMEM_TINY_HEAP_PROFILE", "C7_SAFE");
bench_setenv_default("HAKMEM_TINY_C7_HOT", "1");
bench_setenv_default("HAKMEM_TINY_HOTHEAP_V2", "0");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_ENABLED", "1");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_CLASSES", "0x80");
bench_setenv_default("HAKMEM_SMALL_HEAP_V4_ENABLED", "0");
bench_setenv_default("HAKMEM_SMALL_HEAP_V4_CLASSES", "0x0");
bench_setenv_default("HAKMEM_TINY_PTR_FAST_CLASSIFY_V4_ENABLED", "0");
bench_setenv_default("HAKMEM_SMALL_SEGMENT_V4_ENABLED", "0");
bench_setenv_default("HAKMEM_POOL_V2_ENABLED", "0");
bench_setenv_default("HAKMEM_TINY_FRONT_V3_ENABLED", "1");
bench_setenv_default("HAKMEM_TINY_FRONT_V3_LUT_ENABLED", "1");
bench_setenv_default("HAKMEM_TINY_PTR_FAST_CLASSIFY_ENABLED", "1");
// Phase FREE-TINY-FAST-DUALHOT-1: C0-C3 direct fast free (skip policy snapshot)
bench_setenv_default("HAKMEM_FREE_TINY_FAST_HOTCOLD", "1");
// Phase 2 B4: Wrapper hot/cold split (malloc/free wrapper shape)
bench_setenv_default("HAKMEM_WRAP_SHAPE", "1");
// Phase 4 E1: ENV Snapshot Consolidation (+3.92% proven on Mixed)
bench_setenv_default("HAKMEM_ENV_SNAPSHOT", "1");
// Phase 5 E4-1: Free wrapper ENV snapshot (+3.51% proven on Mixed, 10-run)
bench_setenv_default("HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT", "1");
// Phase 5 E4-2: Malloc wrapper ENV snapshot (+21.83% proven on Mixed, 10-run)
bench_setenv_default("HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT", "1");
// Phase 5 E5-1: Free Tiny Direct Path (+3.35% proven on Mixed, 10-run)
bench_setenv_default("HAKMEM_FREE_TINY_DIRECT", "1");
// Phase 6-1: Front FastLane (Layer Collapse) (+11.13% proven on Mixed, 10-run)
bench_setenv_default("HAKMEM_FRONT_FASTLANE", "1");
// Phase 6-2: Front FastLane Free DeDup (+5.18% proven on Mixed, 10-run)
bench_setenv_default("HAKMEM_FRONT_FASTLANE_FREE_DEDUP", "1");
// Phase 9: FREE-TINY-FAST MONO DUALHOT (+2.72% proven on Mixed, 10-run)
bench_setenv_default("HAKMEM_FREE_TINY_FAST_MONO_DUALHOT", "1");
// Phase 10: FREE-TINY-FAST MONO LEGACY DIRECT (+1.89% proven on Mixed, 10-run)
bench_setenv_default("HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT", "1");
// Phase 4-4: C6 ULTRA free+alloc 統合を有効化 (default OFF, manual opt-in)
bench_setenv_default("HAKMEM_TINY_C6_ULTRA_FREE_ENABLED", "0");
// Phase MID-V3: Mid/Pool HotBox v3
// Mixed (161024B) では MID_V3(C6) が大きく遅くなるため、デフォルト OFF に固定。
// C6-heavy プロファイル側でのみ ON を推奨するC6-heavy のみ最適化対象)。
bench_setenv_default("HAKMEM_MID_V3_ENABLED", "0");
bench_setenv_default("HAKMEM_MID_V3_CLASSES", "0x0");
// Phase 2 B3: Routing branch shape optimization (LIKELY on LEGACY, cold helper for rare routes)
bench_setenv_default("HAKMEM_TINY_ALLOC_ROUTE_SHAPE", "1");
// Phase 3 C3: Static routing (policy_snapshot bypass, +2.2% proven)
bench_setenv_default("HAKMEM_TINY_STATIC_ROUTE", "1");
// Phase 3 D1: Free route cache (TLS cache for free path routing, +2.19% proven)
bench_setenv_default("HAKMEM_FREE_STATIC_ROUTE", "1");
} else if (strcmp(p, "C6_HEAVY_LEGACY_POOLV1") == 0) {
bench_setenv_default("HAKMEM_TINY_HEAP_PROFILE", "C7_SAFE");
bench_setenv_default("HAKMEM_TINY_C6_HOT", "0");
bench_setenv_default("HAKMEM_TINY_HOTHEAP_V2", "0");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_ENABLED", "1");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_CLASSES", "0x80");
bench_setenv_default("HAKMEM_POOL_V2_ENABLED", "0");
bench_setenv_default("HAKMEM_POOL_V1_FLATTEN_ENABLED", "0");
bench_setenv_default("HAKMEM_MID_DESC_CACHE_ENABLED", "1");
// Phase 4-4: C6 ULTRA free+alloc 統合を有効化 (default OFF, manual opt-in)
bench_setenv_default("HAKMEM_TINY_C6_ULTRA_FREE_ENABLED", "0");
// Phase MID-V3: Mid/Pool HotBox v3 (257-768B, C6 only)
bench_setenv_default("HAKMEM_MID_V3_ENABLED", "1");
bench_setenv_default("HAKMEM_MID_V3_CLASSES", "0x40");
// Phase 6-1: Front FastLane (Layer Collapse) (+11.13% proven on Mixed, 10-run)
bench_setenv_default("HAKMEM_FRONT_FASTLANE", "1");
// Phase 6-2: Front FastLane Free DeDup (+5.18% proven on Mixed, 10-run)
bench_setenv_default("HAKMEM_FRONT_FASTLANE_FREE_DEDUP", "1");
// Phase 2 B3: Routing branch shape optimization (LIKELY on LEGACY, cold helper for rare routes)
bench_setenv_default("HAKMEM_TINY_ALLOC_ROUTE_SHAPE", "1");
} else if (strcmp(p, "C6_V7_STUB") == 0) {
// Phase v7-1: C6-only v7 stub 実験用MID v3 fallback
bench_setenv_default("HAKMEM_TINY_HEAP_PROFILE", "C7_SAFE");
bench_setenv_default("HAKMEM_TINY_C6_HOT", "0");
bench_setenv_default("HAKMEM_TINY_HOTHEAP_V2", "0");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_ENABLED", "1");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_CLASSES", "0x80");
bench_setenv_default("HAKMEM_POOL_V2_ENABLED", "0");
bench_setenv_default("HAKMEM_MID_V3_ENABLED", "1");
bench_setenv_default("HAKMEM_MID_V3_CLASSES", "0x40");
// v7 stub ON (C6-only)
bench_setenv_default("HAKMEM_SMALL_HEAP_V7_ENABLED", "1");
bench_setenv_default("HAKMEM_SMALL_HEAP_V7_CLASSES", "0x40");
} else if (strcmp(p, "C6_HEAVY_LEGACY_POOLV1_FLATTEN") == 0) {
// LEGACY mid/smallmid ベンチ専用C7_SAFE では使用しない)
bench_setenv_default("HAKMEM_TINY_HEAP_PROFILE", "LEGACY");
bench_setenv_default("HAKMEM_TINY_C6_HOT", "0");
bench_setenv_default("HAKMEM_TINY_HOTHEAP_V2", "0");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_ENABLED", "1");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_CLASSES", "0x80");
bench_setenv_default("HAKMEM_POOL_V2_ENABLED", "0");
bench_setenv_default("HAKMEM_POOL_V1_FLATTEN_ENABLED", "1");
bench_setenv_default("HAKMEM_POOL_V1_FLATTEN_STATS", "1");
bench_setenv_default("HAKMEM_POOL_ZERO_MODE", "header");
} else if (strcmp(p, "DEBUG_TINY_FRONT_PERF") == 0) {
bench_setenv_default("HAKMEM_TINY_HEAP_PROFILE", "C7_SAFE");
bench_setenv_default("HAKMEM_TINY_C7_HOT", "1");
bench_setenv_default("HAKMEM_TINY_HOTHEAP_V2", "0");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_ENABLED", "1");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_CLASSES", "0x80");
bench_setenv_default("HAKMEM_POOL_V2_ENABLED", "0");
bench_setenv_default("HAKMEM_TINY_FRONT_V3_ENABLED", "1");
bench_setenv_default("HAKMEM_TINY_FRONT_V3_LUT_ENABLED", "1");
bench_setenv_default("HAKMEM_TINY_PTR_FAST_CLASSIFY_ENABLED", "1");
} else if (strcmp(p, "C6_SMALL_HEAP_V3_EXPERIMENT") == 0) {
// C6 を SmallObject v3 に載せる研究用(標準では使用しない)
bench_setenv_default("HAKMEM_TINY_HEAP_PROFILE", "C7_SAFE");
bench_setenv_default("HAKMEM_TINY_C6_HOT", "1");
bench_setenv_default("HAKMEM_TINY_HOTHEAP_V2", "0");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_ENABLED", "1");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_CLASSES", "0x40"); // C6 only
bench_setenv_default("HAKMEM_SMALL_HEAP_V4_ENABLED", "0");
bench_setenv_default("HAKMEM_SMALL_HEAP_V4_CLASSES", "0x0");
bench_setenv_default("HAKMEM_POOL_V2_ENABLED", "0");
} else if (strcmp(p, "C6_SMALL_HEAP_V4_EXPERIMENT") == 0) {
// C6 を SmallObject v4 に載せる研究用(標準では使用しない)
bench_setenv_default("HAKMEM_TINY_HEAP_PROFILE", "C7_SAFE");
bench_setenv_default("HAKMEM_TINY_C6_HOT", "1");
bench_setenv_default("HAKMEM_TINY_HOTHEAP_V2", "0");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_ENABLED", "0");
bench_setenv_default("HAKMEM_SMALL_HEAP_V3_CLASSES", "0x0");
bench_setenv_default("HAKMEM_SMALL_HEAP_V4_ENABLED", "1");
bench_setenv_default("HAKMEM_SMALL_HEAP_V4_CLASSES", "0x40"); // C6 only
bench_setenv_default("HAKMEM_POOL_V2_ENABLED", "0");
}
#ifdef USE_HAKMEM
// Phase 3 C3 Step 0: Ensure policy snapshot reflects final ENV after putenv defaults.
small_policy_v7_bump_version();
// Phase 2 B4: Sync wrapper ENV cache after bench_profile putenv defaults.
wrapper_env_refresh_from_env();
// Phase 3 C3: Sync static route cache after bench_profile putenv defaults.
tiny_static_route_refresh_from_env();
// Phase 4 E1: Sync ENV snapshot cache after bench_profile putenv defaults.
hakmem_env_snapshot_refresh_from_env();
// Phase 8: Sync free static route ENV cache after bench_profile putenv defaults.
tiny_free_static_route_refresh_from_env();
// Phase 13 v1: Sync C7 preserve header ENV cache after bench_profile putenv defaults.
tiny_c7_preserve_header_env_refresh_from_env();
// Phase 14 v1: Sync tcache ENV cache after bench_profile putenv defaults.
tiny_tcache_env_refresh_from_env();
// Phase 15 v1: Sync LIFO ENV cache after bench_profile putenv defaults.
tiny_unified_lifo_env_refresh_from_env();
// Phase 16 v1: Sync LEGACY direct ENV cache after bench_profile putenv defaults.
front_fastlane_alloc_legacy_direct_env_refresh_from_env();
#endif
}