Phase 16 v1 NEUTRAL, Phase 17 Case B confirmed, Phase 18 design added

## Phase 16 v1: Front FastLane Alloc LEGACY Direct — NEUTRAL (+0.62%)

Target: Reduce alloc-side fixed costs by adding LEGACY direct path to
FastLane entry, mirroring Phase 9/10 free-side winning pattern.

Result: +0.62% on Mixed (below +1.0% GO threshold) → NEUTRAL, freeze as
research box (default OFF).

Critical issue: Initial impl crashed (segfault) for C4-C7. Root cause:
unified_cache_refill() incompatibility. Safety fix: Limited to C0-C3
only (matching existing dualhot pattern).

Files:
- core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c} (new)
- core/box/front_fastlane_box.h (LEGACY direct path, lines 93-119)
- core/bench_profile.h (env refresh sync)
- Makefile (new obj)
- docs/analysis/PHASE16_*.md (design/results/instructions)

ENV: HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 (default OFF, opt-in)

Verdict: Research box frozen. Phase 14-16 plateau confirms dispatch/
routing optimization ROI is exhausted post-Phase-6 FastLane collapse.

---

## Phase 17: FORCE_LIBC Gap Validation — Case B Confirmed

Purpose: Validate "system malloc faster" observation using same-binary
A/B testing to isolate allocator logic差 vs binary layout penalty.

Method:
- Same-binary toggle: HAKMEM_FORCE_LIBC_ALLOC=0/1 (bench_random_mixed_hakmem)
- System binary: bench_random_mixed_system (21K separate binary)
- Perf stat: Hardware counter analysis (I-cache, cycles, instructions)

Result: **Case B confirmed** — Allocator差 negligible, layout penalty dominates.

Gap breakdown (Mixed, 20M iters, ws=400):
- hakmem (FORCE_LIBC=0): 48.12M ops/s
- libc (FORCE_LIBC=1, same binary): 48.31M ops/s → +0.39% (noise level)
- system binary (21K): 83.85M ops/s → +73.57% vs libc, +74.26% vs hakmem

Perf stat (200M iters):
- I-cache misses: 153K (hakmem) → 68K (system) = -55% (smoking gun)
- Cycles: 17.9B → 10.2B = -43%
- Instructions: 41.3B → 21.5B = -48%
- Binary size: 653K → 21K (30x difference)

Root cause: Binary size (30x) causes I-cache thrashing. Code bloat >>
algorithmic efficiency.

Conclusion: Phase 12's "system malloc 1.6x faster" was real, but
misattributed. Gap is layout/I-cache, NOT allocator algorithm.

Files:
- docs/analysis/PHASE17_*.md (results/instructions)
- scripts/run_mixed_10_cleanenv.sh (Phase 9/10 defaults aligned)

Next: Phase 18 Hot Text Isolation (layout optimization, not algorithm opt)

---

## Phase 18: Hot Text Isolation — Design Added

Purpose: Reduce I-cache misses + instruction footprint via layout control
(binary optimization, not allocator algorithm changes).

Strategy (v1 → v2 progression):

v1 (TU split + hot/cold attrs + optional gc-sections):
- Target: +2% throughput (GO threshold, realistic for layout tweaks)
- Secondary: I-cache -10%, instructions -5% (direction confirmation)
- Risk: Low (reversible via build knob)
- Expected: +0-2% (NEUTRAL likely, but validates approach)

v2 (BENCH_MINIMAL compile-out):
- Target: +10-20% throughput (本命)
- Method: Conditional compilation removes stats/ENV/debug from hot path
- Expected: Instruction count -30-40% → significant I-cache improvement

Files:
- docs/analysis/PHASE18_*.md (design/instructions)
- CURRENT_TASK.md (Phase 17 complete, Phase 18 v1/v2 plan)

Build gate: HOT_TEXT_ISOLATION=0/1 (Makefile knob)

Next: Implement Phase 18 v1 (TU split first, BENCH_MINIMAL if v1 NEUTRAL)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
Moe Charm (CI)
2025-12-15 05:25:47 +09:00
parent 87fa27518c
commit f8e7cf05b4
14 changed files with 1292 additions and 5 deletions

View File

@ -13,6 +13,7 @@
#include "box/tiny_c7_preserve_header_env_box.h" // tiny_c7_preserve_header_env_refresh_from_env (Phase 13 v1)
#include "box/tiny_tcache_env_box.h" // tiny_tcache_env_refresh_from_env (Phase 14 v1)
#include "box/tiny_unified_lifo_env_box.h" // tiny_unified_lifo_env_refresh_from_env (Phase 15 v1)
#include "box/front_fastlane_alloc_legacy_direct_env_box.h" // front_fastlane_alloc_legacy_direct_env_refresh_from_env (Phase 16 v1)
#endif
// env が未設定のときだけ既定値を入れる
@ -193,5 +194,7 @@ static inline void bench_apply_profile(void) {
tiny_tcache_env_refresh_from_env();
// Phase 15 v1: Sync LIFO ENV cache after bench_profile putenv defaults.
tiny_unified_lifo_env_refresh_from_env();
// Phase 16 v1: Sync LEGACY direct ENV cache after bench_profile putenv defaults.
front_fastlane_alloc_legacy_direct_env_refresh_from_env();
#endif
}

View File

@ -0,0 +1,63 @@
// ============================================================================
// Phase 16 v1: Front FastLane Alloc LEGACY Direct ENV Box (L0) - Implementation
// ============================================================================
#include "front_fastlane_alloc_legacy_direct_env_box.h"
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <unistd.h>
// ============================================================================
// Global State
// ============================================================================
_Atomic int g_front_fastlane_alloc_legacy_direct_enabled = -1;
// ============================================================================
// Init (Cold Path)
// ============================================================================
int front_fastlane_alloc_legacy_direct_env_init(void) {
const char* env = getenv("HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT");
int enabled = 0; // default: OFF (opt-in)
if (env && (env[0] == '1' || strcmp(env, "true") == 0 || strcmp(env, "TRUE") == 0)) {
enabled = 1;
}
// Cache result
atomic_store_explicit(&g_front_fastlane_alloc_legacy_direct_enabled, enabled, memory_order_relaxed);
// Log once (stderr for immediate visibility)
if (enabled) {
const char msg[] = "[FRONT_FASTLANE_ALLOC_LEGACY_DIRECT] enabled\n";
ssize_t w = write(2, msg, sizeof(msg) - 1);
(void)w;
}
return enabled;
}
// ============================================================================
// Hot Path (LTO Fallback)
// ============================================================================
// LTO fallback: Non-inline version for cases where LTO can't inline
int front_fastlane_alloc_legacy_direct_enabled(void) {
int val = atomic_load_explicit(&g_front_fastlane_alloc_legacy_direct_enabled, memory_order_relaxed);
if (__builtin_expect(val == -1, 0)) {
val = front_fastlane_alloc_legacy_direct_env_init();
}
return val;
}
// ============================================================================
// Refresh (Cold Path, called from bench_profile)
// ============================================================================
void front_fastlane_alloc_legacy_direct_env_refresh_from_env(void) {
// Reset to uninitialized state (-1)
// Next call to front_fastlane_alloc_legacy_direct_enabled() will re-read ENV
atomic_store_explicit(&g_front_fastlane_alloc_legacy_direct_enabled, -1, memory_order_relaxed);
}

View File

@ -0,0 +1,63 @@
// ============================================================================
// Phase 16 v1: Front FastLane Alloc LEGACY Direct ENV Box (L0)
// ============================================================================
//
// Purpose: ENV gate for FastLane alloc LEGACY direct path
//
// Design: docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_DESIGN.md
// Instructions: docs/analysis/PHASE16_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_1_NEXT_INSTRUCTIONS.md
//
// Strategy:
// - alloc 側の route/policy 固定費を削減
// - FastLane 入口で LEGACY を直行hot → cold → fallback
// - free 側Phase 9/10の勝ち筋を alloc にも適用
//
// ENV:
// HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0/1 (default: 0, opt-in)
//
// API:
// front_fastlane_alloc_legacy_direct_enabled() -> int
// front_fastlane_alloc_legacy_direct_env_refresh_from_env()
//
// Box Theory:
// - L0: This file (ENV gate, reversible)
// - L1: front_fastlane_box.h (LEGACY direct early-exit)
// - L2: malloc_tiny_fast_for_class (existing fallback)
//
// Safety:
// - ENV-gated (default OFF, opt-in)
// - Reversible (ENV toggle)
// - Fail-Fast (direct条件を満たさない場合は既存経路)
//
// ============================================================================
#ifndef FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_ENV_BOX_H
#define FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_ENV_BOX_H
#include <stdatomic.h>
// ============================================================================
// Global State (L0)
// ============================================================================
// Cached state: -1 (uninitialized), 0 (disabled), 1 (enabled)
extern _Atomic int g_front_fastlane_alloc_legacy_direct_enabled;
// ============================================================================
// Hot API (L0)
// ============================================================================
// Check if FastLane alloc LEGACY direct is enabled
// Returns: 1 if enabled, 0 if disabled
// Note: Implementation in .c file (non-inline for LTO compatibility)
extern int front_fastlane_alloc_legacy_direct_enabled(void);
// ============================================================================
// Cold API (L2)
// ============================================================================
// Refresh ENV cache (called from bench_profile after putenv)
// Pattern: Same as Phase 8/13/14/15
extern void front_fastlane_alloc_legacy_direct_env_refresh_from_env(void);
#endif // FRONT_FASTLANE_ALLOC_LEGACY_DIRECT_ENV_BOX_H

View File

@ -42,6 +42,11 @@
#include "front_fastlane_stats_box.h"
#include "../hakmem_tiny.h" // hak_tiny_size_to_class, tiny_get_max_size
#include "../front/malloc_tiny_fast.h" // malloc_tiny_fast_for_class
#include "front_fastlane_alloc_legacy_direct_env_box.h" // Phase 16 v1: LEGACY direct
#include "tiny_static_route_box.h" // tiny_static_route_ready_fast, tiny_static_route_get_kind_fast
#include "tiny_front_hot_box.h" // tiny_hot_alloc_fast
#include "tiny_front_cold_box.h" // tiny_cold_refill_and_alloc
#include "smallobject_policy_v7_box.h" // SMALL_ROUTE_LEGACY
// FastLane is only safe after global init completes.
// Before init, wrappers must handle recursion guards + syscall init.
@ -85,6 +90,34 @@ static inline void* front_fastlane_try_malloc(size_t size) {
return NULL; // Class not enabled → fallback
}
// Phase 16 v1: LEGACY direct path (early-exit optimization)
// Try direct allocation for LEGACY routes only (skip route/policy overhead)
// TEMPORARY SAFETY: Limit to C0-C3 (match dualhot pattern) until refill issue debugged
if (__builtin_expect(front_fastlane_alloc_legacy_direct_enabled() && (unsigned)class_idx <= 3u, 0)) {
// Condition 1: Static route must be ready (Learner interlock check)
// Condition 2: Route must be LEGACY (断定可能な場合のみ)
if (tiny_static_route_ready_fast() &&
tiny_static_route_get_kind_fast(class_idx) == SMALL_ROUTE_LEGACY) {
// Hot path: Try UnifiedCache first
void* ptr = tiny_hot_alloc_fast(class_idx);
if (__builtin_expect(ptr != NULL, 1)) {
FRONT_FASTLANE_STAT_INC(malloc_hit);
return ptr; // Success (cache hit)
}
// Cold path: Refill UnifiedCache and retry
ptr = tiny_cold_refill_and_alloc(class_idx);
if (__builtin_expect(ptr != NULL, 1)) {
FRONT_FASTLANE_STAT_INC(malloc_hit);
return ptr; // Success (after refill)
}
// Fallback: Direct path failed → use existing route (safety)
// This handles edge cases (Learner transition, policy changes, etc.)
}
}
// Call existing hot handler (no duplication)
// This is the winning path from E5-4 / Phase 4 E2
void* ptr = malloc_tiny_fast_for_class(size, class_idx);