Files
hakmem/core/box/hakmem_env_snapshot_box.h
Moe Charm (CI) bc2c5ded76 Phase 18 v2: BENCH_MINIMAL — NEUTRAL (+2.32% throughput, -5.06% instructions)
## Summary

Phase 18 v2 attempted instruction count reduction via conditional compilation:
- Stats collection → no-op
- ENV checks → constant propagation
- Binary size: 653K → 649K (-4K, -0.6%)

Result: NEUTRAL (below GO threshold)
- Throughput: +2.32% (target: +5% minimum) 
- Instructions: -5.06% (target: -15% minimum) 
- Cycles: -3.26% (positive signal)
- Branches: -8.67% (positive signal)
- Cache-misses: +30% (unexpected, likely layout)

## Analysis

Positive signals:
- Implementation correct (Branch -8.67%, Instruction -5.06%)
- Binary size reduced (-4K)
- Modest throughput gain (+2.32%)
- Cycles and branch overhead reduced

Negative signals:
- Instruction reduction insufficient (-5.06% << -15% smoking gun)
- Throughput gain below +5% threshold
- Cache-misses increased (+30%, layout noise?)

## Verdict

Freeze Phase 18 v2 (weak positive, insufficient for production).

Per user guidance: "If instructions don't drop clearly, continuation value is thin."
-5.06% instruction reduction is marginal. Allocator micro-optimization plateau confirmed.

## Key Insight

Phase 17 showed:
- IPC = 2.30 (consistent, memory-bound)
- I-cache gap: 55% (Phase 17: 153K → 68K)
- Instruction gap: 48% (Phase 17: 41.3B → 21.5B)

Phase 18 v1/v2 results confirm:
- Layout tweaks are fragile (v1: I-cache +91%)
- Instruction removal is modest benefit (v2: -5.06%)
- Allocator is NOT the bottleneck (IPC constant, memory-limited)

## Recommendation

Do NOT continue Phase 18 micro-optimizations.

Next frontier requires different approach:
1. Architectural redesign (SIMD, lock-free, batching)
2. Memory layout optimization (cache-friendly structures)
3. Broader profiling (not allocator-focused)

Or: Accept that 48M → 85M (75% gap) is achievable with current architecture.

Files:
- docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_AB_TEST_RESULTS.md (results)
- CURRENT_TASK.md (Phase 18 complete status)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-15 06:02:28 +09:00

95 lines
3.7 KiB
C

// hakmem_env_snapshot_box.h - Phase 4 E1: ENV Snapshot Consolidation
//
// Purpose: Consolidate 3 hot ENV gate calls into 1 TLS snapshot read
// Target: tiny_c7_ultra_enabled_env (1.28%) + tiny_front_v3_enabled (1.01%) +
// tiny_metadata_cache_enabled (0.97%) = 3.26% combined ENV overhead
//
// Design:
// - ENV: HAKMEM_ENV_SNAPSHOT=0/1 (default 0, research box)
// - Single TLS snapshot struct containing all hot toggles
// - Lazy init with version-based refresh (follows tiny_front_v3_snapshot pattern)
// - Learner interlock: tiny_metadata_cache_eff = cache && !learner
//
// E3-4 Extension: Constructor init to eliminate lazy check overhead
// - ENV: HAKMEM_ENV_SNAPSHOT_CTOR=0/1 (default 0)
// - When =1: Gate init runs in constructor (before main)
// - Eliminates 3.22% lazy init check overhead
//
// Benefits:
// - 3 TLS reads → 1 TLS read (66% reduction)
// - 3 lazy init checks → 1 lazy init check
// - E3-4: Lazy init check → no check (constructor init)
// - Expected gain: +1-3% (E1) + +0.5-1.5% (E3-4)
#ifndef HAK_ENV_SNAPSHOT_BOX_H
#define HAK_ENV_SNAPSHOT_BOX_H
#include <stdbool.h>
#include <stdlib.h>
// ENV snapshot struct: consolidates all hot ENV gates
typedef struct HakmemEnvSnapshot {
bool tiny_c7_ultra_enabled; // ENV: HAKMEM_TINY_C7_ULTRA (default 1)
bool tiny_front_v3_enabled; // ENV: HAKMEM_TINY_FRONT_V3_ENABLED (default 1)
bool tiny_metadata_cache; // ENV: HAKMEM_TINY_METADATA_CACHE (default 0)
bool tiny_metadata_cache_eff; // Effective: cache && !learner (for hot path)
} HakmemEnvSnapshot;
// Global snapshot state (implemented in hakmem_env_snapshot_box.c)
extern HakmemEnvSnapshot g_hakmem_env_snapshot;
extern int g_hakmem_env_snapshot_ready;
// Snapshot initializer (implemented in hakmem_env_snapshot_box.c)
void hakmem_env_snapshot_init(void);
// Refresh from ENV (for bench_profile putenv sync)
void hakmem_env_snapshot_refresh_from_env(void);
// Fast snapshot getter: lazy init + 1 TLS read
static inline const HakmemEnvSnapshot* hakmem_env_snapshot(void) {
if (__builtin_expect(!g_hakmem_env_snapshot_ready, 0)) {
hakmem_env_snapshot_init();
}
return &g_hakmem_env_snapshot;
}
// E3-4: Global gate state (defined in hakmem_env_snapshot_box.c)
extern int g_hakmem_env_snapshot_gate;
extern int g_hakmem_env_snapshot_ctor_mode;
// ENV gate: default OFF (research box, set =1 to enable)
// E3-4: Dual-mode - constructor init (fast) or legacy lazy init (fallback)
// Phase 18 v2: BENCH_MINIMAL conditional (constant return when HAKMEM_BENCH_MINIMAL=1)
#if HAKMEM_BENCH_MINIMAL
// In bench mode, snapshot is always enabled (one-time cost, compile-away benefit)
static inline bool hakmem_env_snapshot_enabled(void) {
return 1;
}
#else
// Normal mode: runtime check
static inline bool hakmem_env_snapshot_enabled(void) {
// E3-4 Fast path: constructor mode (no lazy check, just global read).
// Important: do not put a static LIKELY/UNLIKELY hint here.
// - Default runs want ctor_mode==0 to be "fast"
// - CTOR runs want ctor_mode==1 to be "fast"
// Any fixed hint will be wrong for one of the modes and can induce steady-state mispredicts.
int ctor_mode = g_hakmem_env_snapshot_ctor_mode;
if (ctor_mode == 1) {
return g_hakmem_env_snapshot_gate != 0;
}
// Legacy path: lazy init (fallback when HAKMEM_ENV_SNAPSHOT_CTOR=0)
if (__builtin_expect(g_hakmem_env_snapshot_gate == -1, 0)) {
const char* e = getenv("HAKMEM_ENV_SNAPSHOT");
if (e && *e) {
g_hakmem_env_snapshot_gate = (*e == '1') ? 1 : 0;
} else {
g_hakmem_env_snapshot_gate = 0; // default: OFF (research box)
}
}
return g_hakmem_env_snapshot_gate != 0;
}
#endif
#endif // HAK_ENV_SNAPSHOT_BOX_H