Commit Graph

168 Commits

Author SHA1 Message Date
e1a4561992 Phase 19-3b: pass down env snapshot in hot paths 2025-12-15 12:50:16 +09:00
8f4ada5bbd Phase 19-3a: remove backwards UNLIKELY env-snapshot hints 2025-12-15 12:29:27 +09:00
ec87025da6 Phase 17 v2 (FORCE_LIBC fix) + Phase 19-1b (FastLane Direct) — GO (+5.88%)
## Phase 17 v2: FORCE_LIBC Gap Validation Fix

**Critical bug fix**: Phase 17 v1 の測定が壊れていた

**Problem**: HAKMEM_FORCE_LIBC_ALLOC=1 が FastLane より後でしか見えず、
same-binary A/B が実質 "hakmem vs hakmem" になっていた(+0.39% 誤測定)

**Fix**: core/box/hak_wrappers.inc.h:171 と :645 に g_force_libc_alloc==1 の
early bypass を追加、__libc_malloc/__libc_free に最初に直行

**Result**: 正しい同一バイナリ A/B 測定
- hakmem (FORCE_LIBC=0): 48.99M ops/s
- libc (FORCE_LIBC=1): 79.72M ops/s (+62.7%)
- system binary: 88.06M ops/s (+10.5% vs libc)

**Gap 分解**:
- Allocator 差: +62.7% (主戦場)
- Layout penalty: +10.5% (副次的)

**Conclusion**: Case A 確定 (allocator dominant, NOT layout)
Phase 17 v1 の Case B 判定は誤り。

Files:
- docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md (v2)
- docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_NEXT_INSTRUCTIONS.md (updated)

---

## Phase 19: FastLane Instruction Reduction Analysis

**Goal**: libc との instruction gap (-35% instructions, -56% branches) を削減

**perf stat 分析** (FORCE_LIBC=0 vs 1, 200M ops):
- hakmem: 209.09 instructions/op, 52.33 branches/op
- libc: 135.92 instructions/op, 22.93 branches/op
- Delta: +73.17 instructions/op (+53.8%), +29.40 branches/op (+128.2%)

**Hot path** (perf report):
- front_fastlane_try_free: 23.97% cycles
- malloc wrapper: 23.84% cycles
- free wrapper: 6.82% cycles
- **Wrapper overhead: ~55% of all cycles**

**Reduction candidates**:
- A: Wrapper layer 削除 (-17.5 inst/op, +10-15% 期待)
- B: ENV snapshot 統合 (-10.0 inst/op, +5-8%)
- C: Stats 削除 (-5.0 inst/op, +3-5%)
- D: Header inline (-4.0 inst/op, +2-3%)
- E: Route fast path (-3.5 inst/op, +2-3%)

Files:
- docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_1_DESIGN.md
- docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_2_NEXT_INSTRUCTIONS.md

---

## Phase 19-1b: FastLane Direct — GO (+5.88%)

**Strategy**: Wrapper layer を bypass し、core allocator を直接呼ぶ
- free() → free_tiny_fast() (not free_tiny_fast_hot)
- malloc() → malloc_tiny_fast()

**Phase 19-1 が NO-GO (-3.81%) だった原因**:
1. __builtin_expect(fastlane_direct_enabled(), 0) が逆効果(A/B 不公平)
2. free_tiny_fast_hot() が誤選択(free_tiny_fast() が勝ち筋)

**Phase 19-1b の修正**:
1. __builtin_expect() 削除
2. free_tiny_fast() を直接呼び出し

**Result** (Mixed, 10-run, 20M iters, ws=400):
- Baseline (FASTLANE_DIRECT=0): 49.17M ops/s
- Optimized (FASTLANE_DIRECT=1): 52.06M ops/s
- **Delta: +5.88%** (GO 基準 +5% クリア)

**perf stat** (200M iters):
- Instructions/op: 199.90 → 169.45 (-30.45, -15.23%)
- Branches/op: 51.49 → 41.52 (-9.97, -19.36%)
- Cycles/op: 88.88 → 84.37 (-4.51, -5.07%)
- I-cache miss: 111K → 98K (-11.79%)

**Trade-offs** (acceptable):
- iTLB miss: +41.46% (front-end cost)
- dTLB miss: +29.15% (backend cost)
- Overall gain (+5.88%) outweighs costs

**Implementation**:
1. **ENV gate**: core/box/fastlane_direct_env_box.{h,c}
   - HAKMEM_FASTLANE_DIRECT=0/1 (default: 0, opt-in)
   - Single _Atomic global (wrapper キャッシュ問題を解決)

2. **Wrapper 修正**: core/box/hak_wrappers.inc.h
   - malloc: direct call to malloc_tiny_fast() when FASTLANE_DIRECT=1
   - free: direct call to free_tiny_fast() when FASTLANE_DIRECT=1
   - Safety: !g_initialized では direct 使わない、fallback 維持

3. **Preset 昇格**: core/bench_profile.h:88
   - bench_setenv_default("HAKMEM_FASTLANE_DIRECT", "1")
   - Comment: +5.88% proven on Mixed, 10-run

4. **cleanenv 更新**: scripts/run_mixed_10_cleanenv.sh:22
   - HAKMEM_FASTLANE_DIRECT=${HAKMEM_FASTLANE_DIRECT:-1}
   - Phase 9/10 と同様に昇格

**Verdict**: GO — 本線採用、プリセット昇格完了

**Rollback**: HAKMEM_FASTLANE_DIRECT=0 で既存 FastLane path に戻る

Files:
- core/box/fastlane_direct_env_box.{h,c} (new)
- core/box/hak_wrappers.inc.h (modified)
- core/bench_profile.h (preset promotion)
- scripts/run_mixed_10_cleanenv.sh (ENV default aligned)
- Makefile (new obj)
- docs/analysis/PHASE19_1B_FASTLANE_DIRECT_REVISED_AB_TEST_RESULTS.md

---

## Cumulative Performance

- Baseline (all optimizations OFF): ~40M ops/s (estimated)
- Current (Phase 19-1b): 52.06M ops/s
- **Cumulative gain: ~+30% from baseline**

Remaining gap to libc (79.72M):
- Current: 52.06M ops/s
- Target: 79.72M ops/s
- **Gap: +53.2%** (was +62.7% before Phase 19-1b)

Next: Phase 19-2 (ENV snapshot consolidation, +5-8% expected)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 11:28:40 +09:00
ad346f7885 Phase 18 v2: BENCH_MINIMAL design + instructions (instruction removal strategy)
## Phase 18 v2: Next Phase Direction

After Phase 18 v1 failure (layout optimization caused I-cache regression),
shift to instruction count reduction via compile-time removal:

- Stats collection (FRONT_FASTLANE_STAT_INC → no-op)
- Environment checks (runtime lookup → constant)
- Debug logging (conditional compilation)

Expected impact: Instructions -30-40%, Throughput +10-20%

## Success Criteria (STRICT)

GO (must have ALL):
- Throughput: +5% minimum (+8% preferred)
- Instructions: -15% minimum (smoking gun)
- I-cache: automatic improvement from smaller footprint

NEUTRAL: throughput ±3%, instructions -5% to -15%
NO-GO: throughput < -2%, instructions < -5%

Key: If instructions do not drop -15%+, allocator is not the bottleneck
and this phase should be abandoned.

## Implementation Strategy

1. Makefile knob: BENCH_MINIMAL=0/1 (default OFF, production-safe)
2. Conditional removal:
   - Stats: #if !HAKMEM_BENCH_MINIMAL
   - ENV checks: constant propagation
   - Debug: conditional includes

3. A/B test with perf stat (must measure instruction reduction)

## Files

New:
- docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_DESIGN.md (detailed design)
- docs/analysis/PHASE18_HOT_TEXT_ISOLATION_2_NEXT_INSTRUCTIONS.md (step-by-step)

Modified:
- CURRENT_TASK.md (Phase 18 v1/v2 status)

## Key Learning from Phase 18 v1 Failure

Layout optimization is extremely fragile without strong ordering guarantees.
Section splitting alone (without symbol ordering, PGO, or linker script)
destroyed code locality and increased I-cache misses 91%.

Switching to direct instruction removal is safer and more predictable.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-15 05:55:22 +09:00
b1912d6587 Phase 18 v1: Hot Text Isolation — NO-GO (I-cache regression)
## Summary

Phase 18 v1 attempted layout optimization using section splitting + GC:
- `-ffunction-sections -fdata-sections -Wl,--gc-sections`

Result: **Catastrophic I-cache regression**
- Throughput: -0.87% (48.94M → 48.52M ops/s)
- I-cache misses: +91.06% (131K → 250K)
- Variance: +80% (σ=0.45M → σ=0.81M)

Root cause: Section-based splitting without explicit hot symbol ordering
fragments code locality, destroying natural compiler/LTO layout.

## Build Knob Safety

Makefile updated to separate concerns:
- `HOT_TEXT_ISOLATION=1` → attributes only (safe, but no perf gain)
- `HOT_TEXT_GC_SECTIONS=1` → section splitting (currently NO-GO)

Both kept as research boxes (default OFF).

## Verdict

Freeze Phase 18 v1:
- Do NOT use section-based linking without strong ordering strategy
- Keep hot/cold attributes as placeholder (currently unused)
- Proceed to Phase 18 v2: BENCH_MINIMAL compile-out

Expected impact v2: +10-20% via instruction count reduction
- GO threshold: +5% minimum, +8% preferred
- Only continue if instructions clearly drop

## Files

New:
- docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_AB_TEST_RESULTS.md

Modified:
- Makefile (build knob safety isolation)
- CURRENT_TASK.md (Phase 18 v1 verdict)
- docs/analysis/PHASE18_HOT_TEXT_ISOLATION_1_NEXT_INSTRUCTIONS.md

## Lessons

1. Layout optimization is extremely fragile without ordering guarantees
2. I-cache is first-order performance factor (IPC=2.30 is memory-bound)
3. Compiler defaults may be better than manual section splitting
4. Next frontier: instruction count reduction (stats/ENV removal)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-15 05:53:58 +09:00
f8e7cf05b4 Phase 16 v1 NEUTRAL, Phase 17 Case B confirmed, Phase 18 design added
## Phase 16 v1: Front FastLane Alloc LEGACY Direct — NEUTRAL (+0.62%)

Target: Reduce alloc-side fixed costs by adding LEGACY direct path to
FastLane entry, mirroring Phase 9/10 free-side winning pattern.

Result: +0.62% on Mixed (below +1.0% GO threshold) → NEUTRAL, freeze as
research box (default OFF).

Critical issue: Initial impl crashed (segfault) for C4-C7. Root cause:
unified_cache_refill() incompatibility. Safety fix: Limited to C0-C3
only (matching existing dualhot pattern).

Files:
- core/box/front_fastlane_alloc_legacy_direct_env_box.{h,c} (new)
- core/box/front_fastlane_box.h (LEGACY direct path, lines 93-119)
- core/bench_profile.h (env refresh sync)
- Makefile (new obj)
- docs/analysis/PHASE16_*.md (design/results/instructions)

ENV: HAKMEM_FRONT_FASTLANE_ALLOC_LEGACY_DIRECT=0 (default OFF, opt-in)

Verdict: Research box frozen. Phase 14-16 plateau confirms dispatch/
routing optimization ROI is exhausted post-Phase-6 FastLane collapse.

---

## Phase 17: FORCE_LIBC Gap Validation — Case B Confirmed

Purpose: Validate "system malloc faster" observation using same-binary
A/B testing to isolate allocator logic差 vs binary layout penalty.

Method:
- Same-binary toggle: HAKMEM_FORCE_LIBC_ALLOC=0/1 (bench_random_mixed_hakmem)
- System binary: bench_random_mixed_system (21K separate binary)
- Perf stat: Hardware counter analysis (I-cache, cycles, instructions)

Result: **Case B confirmed** — Allocator差 negligible, layout penalty dominates.

Gap breakdown (Mixed, 20M iters, ws=400):
- hakmem (FORCE_LIBC=0): 48.12M ops/s
- libc (FORCE_LIBC=1, same binary): 48.31M ops/s → +0.39% (noise level)
- system binary (21K): 83.85M ops/s → +73.57% vs libc, +74.26% vs hakmem

Perf stat (200M iters):
- I-cache misses: 153K (hakmem) → 68K (system) = -55% (smoking gun)
- Cycles: 17.9B → 10.2B = -43%
- Instructions: 41.3B → 21.5B = -48%
- Binary size: 653K → 21K (30x difference)

Root cause: Binary size (30x) causes I-cache thrashing. Code bloat >>
algorithmic efficiency.

Conclusion: Phase 12's "system malloc 1.6x faster" was real, but
misattributed. Gap is layout/I-cache, NOT allocator algorithm.

Files:
- docs/analysis/PHASE17_*.md (results/instructions)
- scripts/run_mixed_10_cleanenv.sh (Phase 9/10 defaults aligned)

Next: Phase 18 Hot Text Isolation (layout optimization, not algorithm opt)

---

## Phase 18: Hot Text Isolation — Design Added

Purpose: Reduce I-cache misses + instruction footprint via layout control
(binary optimization, not allocator algorithm changes).

Strategy (v1 → v2 progression):

v1 (TU split + hot/cold attrs + optional gc-sections):
- Target: +2% throughput (GO threshold, realistic for layout tweaks)
- Secondary: I-cache -10%, instructions -5% (direction confirmation)
- Risk: Low (reversible via build knob)
- Expected: +0-2% (NEUTRAL likely, but validates approach)

v2 (BENCH_MINIMAL compile-out):
- Target: +10-20% throughput (本命)
- Method: Conditional compilation removes stats/ENV/debug from hot path
- Expected: Instruction count -30-40% → significant I-cache improvement

Files:
- docs/analysis/PHASE18_*.md (design/instructions)
- CURRENT_TASK.md (Phase 17 complete, Phase 18 v1/v2 plan)

Build gate: HOT_TEXT_ISOLATION=0/1 (Makefile knob)

Next: Implement Phase 18 v1 (TU split first, BENCH_MINIMAL if v1 NEUTRAL)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 05:25:47 +09:00
87fa27518c Phase 15 v1: UnifiedCache FIFO→LIFO NEUTRAL (-0.70% Mixed, +0.42% C7)
Transform existing array-based UnifiedCache from FIFO ring to LIFO stack.

A/B Results:
- Mixed (16-1024B): -0.70% (52,965,966 → 52,593,948 ops/s)
- C7-only (1025-2048B): +0.42% (78,010,783 → 78,335,509 ops/s)

Verdict: NEUTRAL (both below +1.0% GO threshold) - freeze as research box

Implementation:
- L0 ENV gate: tiny_unified_lifo_env_box.{h,c} (HAKMEM_TINY_UNIFIED_LIFO=0/1)
- L1 LIFO ops: tiny_unified_lifo_box.h (unified_cache_try_pop/push_lifo)
- L2 integration: tiny_front_hot_box.h (mode check at entry)
- Reuses existing slots[] array (no intrusive pointers)

Root Causes:
1. Mode check overhead (tiny_unified_lifo_enabled() call)
2. Minimal LIFO vs FIFO locality delta in practice
3. Existing FIFO ring already well-optimized

Bonus Fix: LTO bug for tiny_c7_preserve_header_enabled() (Phase 13/14 latent issue)
- Converted static inline to extern + non-inline implementation
- Fixes undefined reference during LTO linking

Design: docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_DESIGN.md
Results: docs/analysis/PHASE15_UNIFIEDCACHE_LIFO_1_AB_TEST_RESULTS.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 02:19:26 +09:00
b7e01a9419 Phase 14 v2: Hot Path Integration NEUTRAL (+0.08% Mixed, -0.39% C7-only)
Implementation:
- Patch 1: Add tcache pop to tiny_hot_alloc_fast() (try tcache first)
- Patch 2: Add tcache push to tiny_hot_free_fast() (try tcache first)
- Makefile fix: Add missing .o files to BENCH_HAKMEM_OBJS_BASE
- LTO fix: Restore static inline for tiny_c7_preserve_header_enabled()

A/B Test Results:
- Mixed (16-1024B): 51,287,515 → 51,330,213 ops/s (+0.08%)
- C7-only (1025-2048B): 80,975,651 → 80,660,283 ops/s (-0.39%)

Verdict: NEUTRAL (below +1.0% GO threshold)

Root Cause:
- LIFO/FIFO mixing degrades cache locality
- Hot path branch overhead
- Intrusive pointers add overhead vs array cache
- v2 worse than v1 (+0.20%)

Files:
- Modified: core/box/tiny_front_hot_box.h (tcache integration)
- Modified: Makefile (BENCH_HAKMEM_OBJS_BASE fix)
- Modified: core/box/tiny_c7_preserve_header_env_box.{h,c} (LTO fix)
- Results: docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_2_AB_TEST_RESULTS.md

Decision: Freeze Phase 14 (v1+v2) as research box (HAKMEM_TINY_TCACHE=0 default)

Next: Phase 15 (UnifiedCache FIFO→LIFO) - optimize array cache structure

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 01:57:38 +09:00
f8fb05bc13 Phase 14 v1: Pointer-Chase Reduction (tcache) NEUTRAL (+0.20%)
Implementation:
- Intrusive LIFO tcache layer (L1) before UnifiedCache
- TLS per-class bins (head pointer + count)
- Intrusive next pointers (via tiny_next_store/load SSOT)
- Cap: 64 blocks per class (default)
- ENV: HAKMEM_TINY_TCACHE=0/1 (default: 0, OFF)

A/B Test Results (Mixed 10-run):
- Baseline (TCACHE=0): 51,083,379 ops/s
- Optimized (TCACHE=1): 51,186,838 ops/s
- Mean delta: +0.20% (below +1.0% GO threshold)
- Median delta: +0.59%

Verdict: NEUTRAL - Freeze as research box (default OFF)

Root Cause (v1 wiring incomplete):
- Free side pushes to tcache via unified_cache_push()
- Alloc hot path (tiny_hot_alloc_fast) doesn't consume tcache
- tcache becomes "sink" without alloc-side pop → ROI not measurable

Files:
- Created: core/box/tiny_tcache_{env_box,box}.h, tiny_tcache_env_box.c
- Modified: core/front/tiny_unified_cache.h (integration)
- Modified: core/bench_profile.h (refresh sync)
- Modified: Makefile (build integration)
- Results: docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_1_AB_TEST_RESULTS.md
- v2 Instructions: docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_2_NEXT_INSTRUCTIONS.md

Next: Phase 14 v2 (connect tcache to tiny_front_hot_box alloc/free hot path)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 01:28:50 +09:00
0b306f72f4 Phase 14 kickoff: Pointer-chase reduction (tcache-style intrusive LIFO)
Design and implementation plan for Phase 14 v1:
- Target: Reduce pointer-chase overhead in TinyUnifiedCache
- Strategy: Add intrusive LIFO tcache layer before array-based cache
- Inspired by glibc tcache (per-bin head pointer, intrusive next)

Approach:
- L0: tiny_tcache_env_box (ENV gate: HAKMEM_TINY_TCACHE=0/1, default OFF)
- L1: tiny_tcache_box (intrusive LIFO: push/pop with cap=64)
- Integration: Inside unified_cache_push/pop (minimal call site changes)

Expected benefits:
- tcache hit: No array access, just head pointer + intrusive next
- Better locality (LIFO vs FIFO)
- Closer to system malloc tcache behavior

A/B plan:
- Test: HAKMEM_TINY_TCACHE=0/1 on Mixed 10-run
- GO threshold: +1.0% mean
- Rollback: ENV-gated, default OFF

Files added:
- docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_1_DESIGN.md
- docs/analysis/PHASE14_POINTER_CHASE_REDUCTION_1_NEXT_INSTRUCTIONS.md

Next: Implement Phase 14 v1 patches (ENV box → tcache box → integration)

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 00:32:56 +09:00
cbb35ee27f Phase 13 v1 + E5-2 retest: Both NEUTRAL, freeze as research boxes
Phase 13 v1: Header Write Elimination (C7 preserve header)
- Verdict: NEUTRAL (+0.78%)
- Implementation: HAKMEM_TINY_C7_PRESERVE_HEADER ENV gate (default OFF)
- Makes C7 nextptr offset conditional (0→1 when enabled)
- 4-point matrix A/B test results:
  * Case A (baseline): 51.49M ops/s
  * Case B (WRITE_ONCE=1): 52.07M ops/s (+1.13%)
  * Case C (C7_PRESERVE=1): 51.36M ops/s (-0.26%)
  * Case D (both): 51.89M ops/s (+0.78% NEUTRAL)
- Action: Freeze as research box (default OFF, manual opt-in)

Phase 5 E5-2: Header Write-Once retest (promotion test)
- Verdict: NEUTRAL (+0.54%)
- Motivation: Phase 13 Case B showed +1.13%, re-tested with dedicated 20-run
- Results (20-run):
  * Case A (baseline): 51.10M ops/s
  * Case B (WRITE_ONCE=1): 51.37M ops/s (+0.54%)
- Previous test: +0.45% (consistent with NEUTRAL)
- Action: Keep as research box (default OFF, manual opt-in)

Key findings:
- Header write tax optimization shows consistent NEUTRAL results
- Neither Phase 13 v1 nor E5-2 reaches GO threshold (+1.0%)
- Both implemented as reversible ENV gates for future research

Files changed:
- New: core/box/tiny_c7_preserve_header_env_box.{c,h}
- Modified: core/box/tiny_layout_box.h (C7 offset conditional)
- Modified: core/tiny_nextptr.h, core/box/tiny_header_box.h (comments)
- Modified: core/bench_profile.h (refresh sync)
- Modified: Makefile (add new .o files)
- Modified: scripts/run_mixed_10_cleanenv.sh (add C7_PRESERVE ENV)
- Docs: PHASE13_*, PHASE5_E5_2_HEADER_WRITE_ONCE_* (design/results)

Next: Phase 14 (Pointer-chase reduction, tcache-style intrusive LIFO)

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 00:32:25 +09:00
f88e51e45b Phase 12: Strategic Pause Results - Critical Finding
Completed Strategic Pause investigation with shocking discovery:
- System malloc (glibc ptmalloc2): 86.58M ops/s
- hakmem (Phase 10): 52.88M ops/s
- Gap: **+63.7%** 🚨

Baseline (Phase 10):
- Mean: 51.76M ops/s (10-run, CV 1.03%)
- Health check: PASS
- Perf stat: IPC 2.22, branch miss 2.48%, good cache locality

Allocator comparison:
- hakmem: 52.43M ops/s (RSS: 33.8MB)
- jemalloc: 48.60M ops/s (RSS: 35.6MB) [-7.3%]
- system malloc: 85.96M ops/s [+63.9%] 🚨

Gap analysis (5 hypotheses):
1. Header write overhead (400M writes) - Expected ROI: +10-20%
2. Thread cache implementation (tcache vs TinyUnifiedCache) - Expected ROI: +20-30%
3. Metadata access pattern (indirection overhead) - Expected ROI: +5-10%
4. Classification overhead (LUT + routing) - Expected ROI: +5%
5. Freelist management (header vs chunk placement) - Expected ROI: +5%

Recommendation: Proceed to Phase 13 (Header Write Elimination)
- Most direct overhead (400M writes per 200M iters)
- Measurable with perf
- Clear ROI (+10-20% expected)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 21:17:48 +09:00
2b5e4ad576 docs: add Phase 12 strategic pause instructions 2025-12-14 20:59:23 +09:00
a6078a52b5 Phase 12: Strategic Options Analysis
Comprehensive analysis of next optimization options after Phase 6-10 (+24.6%):

Option A: Micro-Optimization ( LOW PRIORITY)
- tiny_c7_ultra_alloc (3.75%): C7-specific, +1-2% ROI
- unified_cache_push (1.61%): Marginal ROI ~+1.0%
- High risk (20-30% NO-GO), diminishing returns

Option B: Workload-Specific Optimization (🔍 MEDIUM PRIORITY)
- C6-heavy optimization (+3-5% for specific workload)
- Mid/Large allocation optimization (requires investigation)

Option C: Strategic Pause ( RECOMMENDED)
- Major milestone achieved (+24.6%)
- Diminishing returns (marginal ROI < +2%)
- Time to reassess project goals and explore new frontiers

Recommendation: Strategic Pause to:
- Benchmark vs mimalloc/jemalloc
- Validate production workloads
- Explore next optimization frontiers (footprint, multi-thread, fragmentation)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 20:55:25 +09:00
37bb3ee63f Phase 6-10: Cumulative Results & Strategic Analysis (+24.6%)
Comprehensive analysis of Phases 6-10 achievements:
- Cumulative improvement: +24.6% (43.04M → 53.62M ops/s)
- Individual phases: 6-1 (+11.13%), 6-2 (+5.18%), 8 (+2.61%), 9 (+2.72%), 10 (+1.89%)
- Phase 7 NO-GO (-2.16%), Phase 11 NO-GO (-8.35%)

Winning patterns:
- Wrapper-level consolidation (Phase 6-1: largest single gain)
- Deduplication at layer boundaries (Phase 6-2)
- Monolithic early-exit (Phase 9, 10 vs Phase 7 function split)

Next strategic options:
A) Micro-optimizations (marginal ROI < +2%)
B) Alloc side deep dive (malloc 23.26% hotspot, high potential +5-10%)
C) Strategic pause (declare victory at +24.6%)

Recommendation: Alloc side investigation as highest-ROI next direction.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 20:48:34 +09:00
ad73ca5544 Phase 11: ENV Snapshot maybe-fast API - NO-GO (-8.35%)
Phase 11 attempted to consolidate ENV snapshot overhead by:
- Adding hakmem_env_snapshot_maybe_fast() API
- Caching front_v3_snap pointer in HakmemEnvSnapshot
- Replacing separate calls with single API at call sites

Result: -8.35% regression (51.65M → 47.33M ops/s)

Root cause:
- maybe_fast() called in inline hot path functions
- ctor_mode check accumulated on every call
- Compiler optimization inhibited
- Even 2-3 instructions are expensive at high frequency

Lesson: ENV gate optimization should target gate itself, not call sites.

All changes rolled back. Phase 11 FROZEN.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 20:44:42 +09:00
44273693cc docs: record Phase 10 GO promotion; add Phase 11 instructions
Phase 10 updates:
- Mark Phase 10 as promoted (GO +1.89%)
- Update CURRENT_TASK.md with Phase 10 results
- Clean up absolute paths in PHASE10 docs
- Add promotion status to PHASE10 docs

Phase 11 instructions:
- New: PHASE11_ENV_SNAPSHOT_MAYBE_FAST_1_NEXT_INSTRUCTIONS.md
- Target: Consolidate ENV snapshot overhead
- Strategy: Merge hakmem_env_snapshot* and tiny_front_v3_snapshot_get into single call
- ENV: HAKMEM_ENV_SNAPSHOT_MAYBE_FAST=0/1
- Expected: +1-2% (reduce TLS/ENV read overhead)

Files modified:
- CURRENT_TASK.md: Phase 10 GO record, Phase 11 next
- docs/analysis/PHASE10_FREE_TINY_FAST_MONO_LEGACY_DIRECT_1_AB_TEST_RESULTS.md
- docs/analysis/PHASE10_FREE_TINY_FAST_MONO_LEGACY_DIRECT_1_NEXT_INSTRUCTIONS.md

Files added:
- docs/analysis/PHASE11_ENV_SNAPSHOT_MAYBE_FAST_1_NEXT_INSTRUCTIONS.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 20:29:28 +09:00
71b1354d32 Phase 10: FREE-TINY-FAST MONO LEGACY DIRECT (GO +1.89%)
Results:
- A/B test: +1.89% on Mixed (10-run, clean env)
- Baseline: 51.96M ops/s
- Optimized: 52.94M ops/s
- Improvement: +984K ops/s (+1.89%)
- C6-heavy verification: +7.86% (nonlegacy_mask works correctly, no misfires)

Strategy:
- Extend Phase 9 (C0-C3 DUALHOT) to C4-C7 LEGACY DIRECT
- Fail-Fast principle: Never misclassify MID/ULTRA/V7 as LEGACY
- nonlegacy_mask: Cached at init, hot path uses single bit operation

Success factors:
1. Performance improvement: +1.89% (1.9x GO threshold)
2. Safety verified: nonlegacy_mask prevents MID v3 misfire in C6-heavy
3. Phase 9 coexistence: C0-C3 (Phase 9) + C4-C7 (Phase 10) = full LEGACY coverage
4. Minimal overhead: Single bit operation in hot path (mask & (1u<<class))

Implementation:
- Patch 1: ENV gate box (free_tiny_fast_mono_legacy_direct_env_box.h)
  - ENV: HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=0/1 (default 0)
  - nonlegacy_mask cached (reuses free_policy_fast_v2_nonlegacy_mask())
  - Probe window: 64 (avoid bench_profile putenv race)
- Patch 2: Early-exit in free_tiny_fast() (malloc_tiny_fast.h)
  - Conditions: !nonlegacy_mask, route==LEGACY, !LARSON_FIX, done==1
  - Direct call: tiny_legacy_fallback_free_base()
- Patch 3: Visibility (free_path_stats_box.h)
  - mono_legacy_direct_hit counter (compile-out in release)
- Patch 4: cleanenv extension (run_mixed_10_cleanenv.sh)
  - ENV leak protection

Safety verification (C6-heavy):
- OFF: 19.75M ops/s
- ON: 21.30M ops/s (+7.86%)
- nonlegacy_mask correctly excludes C6 (MID v3 active)
- Improvement from C0-C5, C7 direct path acceleration

Files modified:
- core/bench_profile.h: add to MIXED_TINYV3_C7_SAFE preset
- core/front/malloc_tiny_fast.h: early-exit insertion
- core/box/free_path_stats_box.h: counter
- core/box/free_tiny_fast_mono_legacy_direct_env_box.h: NEW (ENV gate + nonlegacy_mask)
- scripts/run_mixed_10_cleanenv.sh: ENV leak protection

Health check: PASSED (all profiles)

Promotion: Added to MIXED_TINYV3_C7_SAFE preset (default ON, opt-out)

Rollback: HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=0

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 20:09:40 +09:00
720ae61007 docs: record Phase 9 GO promotion; add Phase 10 instructions
Phase 9 updates:
- Mark Phase 9 as promoted (GO +2.72%)
- Update CURRENT_TASK.md with Phase 9 results
- Update PHASE9 docs with promotion status

Phase 10 instructions:
- New: PHASE10_FREE_TINY_FAST_MONO_LEGACY_DIRECT_1_NEXT_INSTRUCTIONS.md
- Target: Extend free_tiny_fast() "LEGACY direct" to C4-C7
- Strategy: Safe conditions + early-exit (similar to Phase 9 success pattern)
- ENV: HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=0/1
- Expected: +1-3% (C4-C7 coverage expansion)

Files modified:
- CURRENT_TASK.md: Phase 9 GO record, Phase 10 next
- docs/analysis/PHASE9_FREE_TINY_FAST_MONO_DUALHOT_1_AB_TEST_RESULTS.md
- docs/analysis/PHASE9_FREE_TINY_FAST_MONO_DUALHOT_1_NEXT_INSTRUCTIONS.md

Files added:
- docs/analysis/PHASE10_FREE_TINY_FAST_MONO_LEGACY_DIRECT_1_NEXT_INSTRUCTIONS.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 19:59:15 +09:00
871034da1f Phase 9: FREE-TINY-FAST MONO DUALHOT (GO +2.72%)
Results:
- A/B test: +2.72% on Mixed (10-run, clean env)
- Baseline: 48.89M ops/s
- Optimized: 50.22M ops/s
- Improvement: +1.33M ops/s (+2.72%)
- Stability: Standard deviation reduced by 60.8% (2.44M → 955K ops/s)

Strategy:
- Transplant C0-C3 "second hot" path to monolithic free_tiny_fast()
- Early-exit within monolithic (no hot/cold split)
- FastLane free now benefits from C0-C3 direct path

Success factors:
1. Performance improvement: +2.72% (2.7x GO threshold)
2. Stability improvement: 2.6x more stable (stdev 60.8% reduction)
3. Learned from Phase 7 failure:
   - Phase 7: Function split (hot/cold) → NO-GO
   - Phase 9: Early-exit within monolithic → GO
4. FastLane free compatibility: C0-C3 direct path now works with FastLane
5. Policy snapshot overhead reduction: C0-C3 (48% of Mixed) skip route lookup

Implementation:
- Patch 1: ENV gate box (free_tiny_fast_mono_dualhot_env_box.h)
  - ENV: HAKMEM_FREE_TINY_FAST_MONO_DUALHOT=0/1 (default 0)
  - Probe window: 64 (avoid bench_profile putenv race)
- Patch 2: Early-exit in free_tiny_fast() (malloc_tiny_fast.h)
  - Conditions: class_idx <= 3, !LARSON_FIX, route==LEGACY
  - Direct call: tiny_legacy_fallback_free_base()
- Patch 3: Visibility (free_path_stats_box.h)
  - mono_dualhot_hit counter (compile-out in release)
- Patch 4: cleanenv extension (run_mixed_10_cleanenv.sh)
  - ENV leak protection

Files modified:
- core/bench_profile.h: add to MIXED_TINYV3_C7_SAFE preset
- core/front/malloc_tiny_fast.h: early-exit insertion
- core/box/free_path_stats_box.h: counter
- core/box/free_tiny_fast_mono_dualhot_env_box.h: NEW (ENV gate)
- scripts/run_mixed_10_cleanenv.sh: ENV leak protection

Health check: PASSED (all profiles)

Promotion: Added to MIXED_TINYV3_C7_SAFE preset (default ON, opt-out)

Rollback: HAKMEM_FREE_TINY_FAST_MONO_DUALHOT=0

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 19:16:49 +09:00
152b578f64 docs: record Phase 8 GO; add Phase 9 mono dualhot instructions 2025-12-14 19:05:50 +09:00
be723ca052 Phase 8: FREE-STATIC-ROUTE ENV Cache Hardening (GO +2.61%)
Results:
- A/B test: +2.61% on Mixed (10-run, clean env)
- Baseline: 49.26M ops/s
- Optimized: 50.55M ops/s
- Improvement: +1.29M ops/s (+2.61%)

Strategy:
- Fix ENV cache accident (main前キャッシュ事故の修正)
- Add refresh mechanism to sync with bench_profile putenv
- Ensure Phase 3 D1 optimization works reliably

Success factors:
1. Performance improvement: +2.61% (existing win-box now reliable)
2. ENV cache accident fixed: refresh mechanism works correctly
3. Standard deviation improved: 867K → 336K ops/s (61% reduction)
4. Baseline quality improved: existing optimization now guaranteed

Implementation:
- Patch 1: Make ENV gate refreshable (tiny_free_route_cache_env_box.{h,c})
  - Changed static int to extern _Atomic int
  - Added tiny_free_static_route_refresh_from_env()
- Patch 2: Integrate refresh into bench_profile.h
  - Call refresh after bench_setenv_default() group
- Patch 3: Update Makefile for new .c file

ENV cache fix verification:
- [FREE_STATIC_ROUTE] enabled appears twice (refresh working)
- bench_profile putenv now reliably reflected

Files modified:
- core/box/tiny_free_route_cache_env_box.h: extern + refresh API
- core/box/tiny_free_route_cache_env_box.c: NEW (global state + refresh)
- core/bench_profile.h: add refresh call
- Makefile: add new .o file

Health check: PASSED (all profiles)

Rollback: HAKMEM_FREE_STATIC_ROUTE=0 or revert Patch 1/2

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 18:49:08 +09:00
17aaa90981 docs: freeze Phase 7 FastLane free hot/cold; add Phase 8 env cache fix 2025-12-14 18:34:08 +09:00
7a3702e069 docs: Phase 7 FastLane free hot/cold alignment instructions 2025-12-14 17:52:55 +09:00
dcc1d42e7f Phase 6-2: Promote Front FastLane Free DeDup (default ON)
Results:
- A/B test: +5.18% on Mixed (10-run, clean env)
- Baseline: 46.68M ops/s
- Optimized: 49.10M ops/s
- Improvement: +2.42M ops/s (+5.18%)

Strategy:
- Eliminate duplicate header validation in front_fastlane_try_free()
- Direct call to free_tiny_fast() when dedup enabled
- Single validation path (no redundant checks)

Success factors:
1. Complete duplicate elimination (free path optimization)
2. Free path importance (50% of Mixed workload)
3. Improved execution stability (CV: 1.00% → 0.58%)

Phase 6 cumulative:
- Phase 6-1 FastLane: +11.13%
- Phase 6-2 Free DeDup: +5.18%
- Total: ~+16-17% from baseline (multiplicative effect)

Promotion:
- Default: HAKMEM_FRONT_FASTLANE_FREE_DEDUP=1 (opt-out)
- Added to MIXED_TINYV3_C7_SAFE preset
- Added to C6_HEAVY_LEGACY_POOLV1 preset
- Rollback: HAKMEM_FRONT_FASTLANE_FREE_DEDUP=0

Files modified:
- core/box/front_fastlane_env_box.h: default 0 → 1
- core/bench_profile.h: added to presets
- CURRENT_TASK.md: Phase 6-2 GO result

Health check: PASSED (all profiles)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 17:38:21 +09:00
c0d2f47f7d docs: Phase 6-2 FastLane free dedup instructions 2025-12-14 17:09:57 +09:00
ea221d057a Phase 6: promote Front FastLane (default ON) 2025-12-14 16:28:23 +09:00
e48cbff4b9 Phase 5 Complete: E7 NO-GO confirmed + ChatGPT Pro questionnaire
Summary:
- E7 frozen box prune: -3.20% regression (NO-GO) with clean ENV
- Keep E5-2/E5-4 (NEUTRAL) + E6 (NO-GO) as research boxes
- Regression due to build differences (LTO/layout/alignment), not logic

Results:
- Winning boxes: E4-1 (+3.51%), E4-2 (+21.83%), E5-1 (+3.35%) → adopted
- Frozen boxes: E5-2, E5-4, E6, E7 → kept with ENV gates (doc as assets)
- Phase 5 cumulative progress: +6.43% on MIXED profile

Documentation updates:
- PHASE5_E7_FROZEN_BOX_PRUNE_AB_TEST_RESULTS.md: Final NO-GO record
- PHASE5_E7_FROZEN_BOX_PRUNE_NEXT_INSTRUCTIONS.md: E7 conclusion

Next phase planning:
- PHASE_ML2_CHATGPT_QUESTIONNAIRE_FASTLANE.md: Design consultation template
  - Candidates: dedup new boundaries, PGO/layout optimization feasibility

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-14 08:56:09 +09:00
6d38511318 Phase 5: E7 prune no-go (keep frozen boxes); add clean-env runner 2025-12-14 08:11:20 +09:00
f92be5f541 Phase 5: freeze E6 env snapshot shape (no-go) 2025-12-14 07:18:59 +09:00
4124c86d99 Phase 5: freeze E5-4 malloc tiny direct (neutral) 2025-12-14 06:59:35 +09:00
580e7f4fa3 Phase 5 E5-3: Candidate Analysis (All DEFERRED) + E5-4 Instructions
E5-3 Analysis Results:
- free_tiny_fast_cold (7.14%): DEFER - cold path, low ROI
- unified_cache_push (3.39%): DEFER - already optimized
- hakmem_env_snapshot_enabled (2.97%): DEFER - low headroom

Key Insight: perf self% is time-weighted, not frequency-weighted.
Cold paths appear hot but have low total impact.

Next: E5-4 (Malloc Tiny Direct Path)
- Apply E5-1 winning pattern to malloc side
- Target: tiny_alloc_gate_fast() gate tax elimination
- ENV gate: HAKMEM_MALLOC_TINY_DIRECT=0/1

Files added:
- docs/analysis/PHASE5_E5_3_ANALYSIS_AND_RECOMMENDATIONS.md
- docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_NEXT_INSTRUCTIONS.md
- core/box/free_cold_shape_env_box.{h,c} (research box, not tested)
- core/box/free_cold_shape_stats_box.{h,c} (research box, not tested)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-14 06:44:04 +09:00
f7b18aaf13 Phase 5 E5-2: Header Write-Once (NEUTRAL, FROZEN)
Target: tiny_region_id_write_header (3.35% self%)
- Hypothesis: Headers redundant for reused blocks
- Strategy: Write headers ONCE at refill boundary, skip in hot alloc

Implementation:
- ENV gate: HAKMEM_TINY_HEADER_WRITE_ONCE=0/1 (default 0)
- core/box/tiny_header_write_once_env_box.h: ENV gate
- core/box/tiny_header_write_once_stats_box.h: Stats counters
- core/box/tiny_header_box.h: Added tiny_header_finalize_alloc()
- core/front/tiny_unified_cache.c: Prefill at 3 refill sites
- core/box/tiny_front_hot_box.h: Use finalize function

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (WRITE_ONCE=0): 44.22M ops/s (mean), 44.53M ops/s (median)
- Optimized (WRITE_ONCE=1): 44.42M ops/s (mean), 44.36M ops/s (median)
- Improvement: +0.45% mean, -0.38% median

Decision: NEUTRAL (within ±1.0% threshold)
- Action: FREEZE as research box (default OFF, do not promote)

Root Cause Analysis:
- Header writes are NOT redundant - existing code writes only when needed
- Branch overhead (~4 cycles) cancels savings (~3-5 cycles)
- perf self% ≠ optimization ROI (3.35% target → +0.45% gain)

Key Lessons:
1. Verify assumptions before optimizing (inspect code paths)
2. Hot spot self% measures time IN function, not savings from REMOVING it
3. Branch overhead matters (even "simple" checks add cycles)

Positive Outcome:
- StdDev reduced 50% (0.96M → 0.48M) - more stable performance

Health Check: PASS (all profiles)

Next Candidates:
- free_tiny_fast_cold: 7.14% self%
- unified_cache_push: 3.39% self%
- hakmem_env_snapshot_enabled: 2.97% self%

Deliverables:
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-2 complete, FROZEN)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 06:22:25 +09:00
75e20b29cc Phase 5 E5-1: Promote to preset + next target instructions
E5-1 Promotion:
- Added HAKMEM_FREE_TINY_DIRECT=1 to MIXED_TINYV3_C7_SAFE preset
- Updated ENV_PROFILE_PRESETS.md with rollback instructions
- Rollback: HAKMEM_FREE_TINY_DIRECT=0

A/B Test Clarification:
- Documented bench_setenv_default vs export ENV=0 interaction
- bench_setenv_default only sets if ENV is unset
- To force OFF in A/B: use value that differs from default

Next Target Selection (E5-2 vs E5-3):
- E5-2: Header write reduction (tiny_region_id_write_header)
- E5-3: ENV snapshot gate shape optimization
- Decision requires fresh perf profile on new baseline

Deliverables:
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_NEXT_INSTRUCTIONS.md
- docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md (updated)
- docs/analysis/ENV_PROFILE_PRESETS.md (E5-1 added)
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_AB_TEST_RESULTS.md (clarified)
- CURRENT_TASK.md (progress links)
- docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md (progress links)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 05:59:43 +09:00
8875132134 Phase 5 E5-1: Free Tiny Direct Path (+3.35% GO)
Target: Consolidate free() wrapper overhead (29.56% combined)
- free() wrapper: 21.67% self%
- free_tiny_fast_cold(): 7.89% self%

Strategy: Single header check in wrapper → direct call to free_tiny_fast()
- Eliminates redundant header validation (validated twice before)
- Bypasses cold path routing for Tiny allocations
- High coverage: 48% of frees in Mixed workload are Tiny

Implementation:
- ENV gate: HAKMEM_FREE_TINY_DIRECT=0/1 (default 0)
- core/box/free_tiny_direct_env_box.h: ENV gate
- core/box/free_tiny_direct_stats_box.h: Stats counters
- core/box/hak_wrappers.inc.h: Wrapper integration (lines 593-625)

Safety gates:
- Page boundary guard ((ptr & 0xFFF) != 0)
- Tiny magic validation ((header & 0xF0) == 0xA0)
- Class bounds check (class_idx < 8)
- Fail-fast fallback to existing paths

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (DIRECT=0): 44.38M ops/s (mean), 44.45M ops/s (median)
- Optimized (DIRECT=1): 45.87M ops/s (mean), 45.95M ops/s (median)
- Improvement: +3.35% mean, +3.36% median

Decision: GO (+3.35% >= +1.0% threshold)
- 3rd consecutive success with consolidation/deduplication pattern
- E4-1: +3.51%, E4-2: +21.83%, E5-1: +3.35%
- Health check: PASS (all profiles)

Phase 5 Cumulative:
- E4 Combined: +6.43%
- E5-1: +3.35%
- Estimated total: ~+10%

Deliverables:
- docs/analysis/PHASE5_E5_COMPREHENSIVE_ANALYSIS.md
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_DESIGN.md
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-1 complete)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 05:52:32 +09:00
6cdbd815ab Phase 5 E4 Combined: E4-1 + E4-2 (+6.43% GO, baseline consolidated)
Combined A/B Test Results (10-run Mixed):
- Baseline (both OFF): 44.48M ops/s (mean), 44.39M ops/s (median)
- Optimized (both ON): 47.34M ops/s (mean), 47.38M ops/s (median)
- Improvement: +6.43% mean, +6.74% median

Interaction Analysis:
- E4-1 alone: +3.51% (measured in separate session)
- E4-2 alone: +21.83% (measured in separate session)
- Combined: +6.43% (measured in same binary)
- Pattern: SUBADDITIVE (overlapping bottlenecks)

Key Finding: Single-binary incremental gain is the accurate metric
- E4-1 and E4-2 target overlapping TLS/branch resources
- Individual measurements were from different baselines/sessions
- Combined measurement (same binary, both flags) shows true progress

Phase 5 Total Progress:
- Original baseline (session start): 35.74M ops/s
- Combined optimized: 47.34M ops/s
- Total gain: +32.4% (cross-session, reference only)
- Same-binary gain: +6.43% (E4-1+E4-2 both ON vs both OFF)

New Baseline Perf Profile (47.0M ops/s):
- free: 37.56% self% (still top hotspot)
- tiny_alloc_gate_fast: 13.73% (reduced from 19.50%)
- malloc: 12.95% (reduced from 16.13%)
- tiny_region_id_write_header: 6.97% (header write tax)
- hakmem_env_snapshot_enabled: 4.29% (ENV overhead visible)

Health Check: PASS
- MIXED_TINYV3_C7_SAFE: 42.3M ops/s
- C6_HEAVY_LEGACY_POOLV1: 20.9M ops/s

Phase 5 E5 Candidates (from perf profile):
- E5-1: free() path internals (37.56% self%)
- E5-2: Header write reduction (6.97% self%)
- E5-3: ENV snapshot overhead (4.29% self%)

Deliverables:
- docs/analysis/PHASE5_E4_COMBINED_AB_TEST_RESULTS.md
- docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md
- CURRENT_TASK.md (E4 combined complete, E5 candidates)
- docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md (E5 pointer)
- perf.data.e4combined (perf profile data)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 05:36:57 +09:00
5528612f2a Phase 5 E4-2: Malloc Wrapper ENV Snapshot (+21.83% GO, ADOPTED)
Target: Consolidate malloc wrapper TLS reads + eliminate function calls
- malloc (16.13%) + tiny_alloc_gate_fast (19.50%) = 35.63% combined
- Strategy: E4-1 success pattern + function call elimination

Implementation:
- ENV gate: HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=0/1 (default 0)
- core/box/malloc_wrapper_env_snapshot_box.{h,c}: New box
  - Consolidates multiple TLS reads → 1 TLS read
  - Pre-caches tiny_max_size() == 256 (eliminates function call)
  - Lazy init with probe window (bench_profile putenv sync)
- core/box/hak_wrappers.inc.h: Integration in malloc() wrapper
- Makefile: Add malloc_wrapper_env_snapshot_box.o to all targets

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (SNAPSHOT=0): 35.74M ops/s (mean), 35.75M ops/s (median)
- Optimized (SNAPSHOT=1): 43.54M ops/s (mean), 43.92M ops/s (median)
- Improvement: +21.83% mean, +22.86% median (+7.80M ops/s)

Decision: GO (+21.83% >> +1.0% threshold, 21.8x over)
- Why 6.2x better than E4-1 (+3.51%)?
  - Higher malloc call frequency (allocation-heavy workload)
  - Function call elimination (tiny_max_size pre-cached)
  - Larger target: 35.63% vs free's 25.26%
- Health check: PASS (all profiles)
- Action: PROMOTED to MIXED_TINYV3_C7_SAFE preset

Phase 5 Cumulative (estimated):
- E1 (ENV Snapshot): +3.92%
- E4-1 (Free Wrapper Snapshot): +3.51%
- E4-2 (Malloc Wrapper Snapshot): +21.83%
- Estimated combined: ~+30% (needs validation)

Next Steps:
- Combined A/B test (E4-1 + E4-2 simultaneously)
- Measure actual cumulative effect
- Profile new baseline for next optimization targets

Deliverables:
- docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_1_DESIGN.md
- docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_1_AB_TEST_RESULTS.md
- docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md
- docs/analysis/PHASE5_E4_COMBINED_AB_TEST_NEXT_INSTRUCTIONS.md (next)
- docs/analysis/ENV_PROFILE_PRESETS.md (E4-2 added)
- CURRENT_TASK.md (E4-2 complete)
- core/bench_profile.h (E4-2 promoted to default)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 05:13:29 +09:00
4a070d8a14 Phase 5 E4-1: Free Wrapper ENV Snapshot (+3.51% GO, ADOPTED)
Target: Consolidate free wrapper TLS reads (2→1)
- free() is 25.26% self% (top hot spot)
- Strategy: Apply E1 success pattern (ENV snapshot) to free path

Implementation:
- ENV gate: HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=0/1 (default 0)
- core/box/free_wrapper_env_snapshot_box.{h,c}: New box
  - Consolidates 2 TLS reads → 1 TLS read (50% reduction)
  - Reduces 4 branches → 3 branches (25% reduction)
  - Lazy init with probe window (bench_profile putenv sync)
- core/box/hak_wrappers.inc.h: Integration in free() wrapper
- Makefile: Add free_wrapper_env_snapshot_box.o to all targets

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (SNAPSHOT=0): 45.35M ops/s (mean), 45.31M ops/s (median)
- Optimized (SNAPSHOT=1): 46.94M ops/s (mean), 47.15M ops/s (median)
- Improvement: +3.51% mean, +4.07% median

Decision: GO (+3.51% >= +1.0% threshold)
- Exceeded conservative estimate (+1.5% → +3.51%)
- Similar efficiency to E1 (+3.92%)
- Health check: PASS (all profiles)
- Action: PROMOTED to MIXED_TINYV3_C7_SAFE preset

Phase 5 Cumulative:
- E1 (ENV Snapshot): +3.92%
- E4-1 (Free Wrapper Snapshot): +3.51%
- Total Phase 4-5: ~+7.5%

E3-4 Correction:
- Phase 4 E3-4 (ENV Constructor Init): NO-GO / FROZEN
- Initial A/B showed +4.75%, but investigation revealed:
  - Branch prediction hint mismatch (UNLIKELY with always-true)
  - Retest confirmed -1.78% regression
  - Root cause: __builtin_expect(..., 0) with ctor_mode==1
- Decision: Freeze as research box (default OFF)
- Learning: Branch hints need careful tuning, TLS consolidation safer

Deliverables:
- docs/analysis/PHASE5_E4_FREE_GATE_OPTIMIZATION_1_DESIGN.md
- docs/analysis/PHASE5_E4_1_FREE_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md
- docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md (next)
- docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md
- docs/analysis/ENV_PROFILE_PRESETS.md (E4-1 added, E3-4 corrected)
- CURRENT_TASK.md (E4-1 complete, E3-4 frozen)
- core/bench_profile.h (E4-1 promoted to default)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 04:24:34 +09:00
21e2e4ac2b Phase 4 E3-4: ENV Constructor Init (+4.75% GO)
Target: Eliminate E1 lazy init check overhead (3.22% self%)
- E1 consolidated ENV gates but lazy check remained in hot path
- Strategy: __attribute__((constructor(101))) for pre-main init

Implementation:
- ENV gate: HAKMEM_ENV_SNAPSHOT_CTOR=0/1 (default 0, research box)
- core/box/hakmem_env_snapshot_box.c: Constructor function added
  - Reads ENV before main() when CTOR=1
  - Refresh also syncs gate state for bench_profile putenv
- core/box/hakmem_env_snapshot_box.h: Dual-mode enabled check
  - CTOR=1 fast path: direct global read (no lazy branch)
  - CTOR=0 fallback: legacy lazy init (rollback safe)
  - Branch hints adjusted for default OFF baseline

A/B Test Results (Mixed, 10-run, 20M iters, E1=1):
- Baseline (CTOR=0): 44.28M ops/s (mean), 44.60M ops/s (median)
- Optimized (CTOR=1): 46.38M ops/s (mean), 46.53M ops/s (median)
- Improvement: +4.75% mean, +4.35% median

Decision: GO (+4.75% >> +0.5% threshold)
- Expected +0.5-1.5%, achieved +4.75%
- Lazy init branch overhead was larger than expected
- Action: Keep as research box (default OFF), evaluate promotion

Phase 4 Cumulative:
- E1 (ENV Snapshot): +3.92%
- E2 (Alloc Per-Class): -0.21% (NEUTRAL, frozen)
- E3-4 (Constructor Init): +4.75%
- Total Phase 4: ~+8.5%

Deliverables:
- docs/analysis/PHASE4_E3_ENV_CONSTRUCTOR_INIT_DESIGN.md
- docs/analysis/PHASE4_E3_ENV_CONSTRUCTOR_INIT_NEXT_INSTRUCTIONS.md
- docs/analysis/PHASE4_COMPREHENSIVE_STATUS_ANALYSIS.md
- docs/analysis/PHASE4_EXECUTIVE_SUMMARY.md
- scripts/verify_health_profiles.sh (sanity check script)
- CURRENT_TASK.md (E3-4 complete, next instructions)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 02:57:35 +09:00
6a6744d065 Phase 4 E2: Alloc Per-Class FastPath - NEUTRAL (-0.21%)
A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (DUALHOT=0): 45.40M ops/s (mean), 45.51M ops/s (median)
- Optimized (DUALHOT=1): 45.30M ops/s (mean), 45.22M ops/s (median)
- Improvement: -0.21% mean, -0.62% median

Decision: NEUTRAL (within ±1.0% noise threshold)
Action: FREEZE as research box (default OFF, no promotion)

Key Findings:
- C0-C3 fast path adds branch overhead without measurable benefit
- Unlike FREE path (+13%), ALLOC path already has optimized route caching
- Phase 3 C3 static routing eliminated route lookup overhead
- Additional per-class specialization doesn't reduce existing cost

Root Cause:
- Free DUALHOT skips expensive policy_snapshot() + tiny_route_for_class()
- Alloc DUALHOT adds C0-C3 branch but route already cached (Phase 3 C3)
- Net effect: Branch cost ≈ Route savings → neutral

Conclusion: Alloc route optimization has reached diminishing returns

Cumulative Status:
- Phase 4 E1: +3.92% (GO, research box)
- Phase 4 E2: -0.21% (NEUTRAL, frozen)

Files:
- CURRENT_TASK.md: Updated with E2 results
- docs/analysis/PHASE4_E2_ALLOC_PER_CLASS_FASTPATH_AB_TEST_RESULTS.md: Full A/B test report

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-14 01:54:21 +09:00
7f3ff6c7e6 Phase 4: E1 docs + E2 next instructions 2025-12-14 01:46:18 +09:00
42ba23fbd0 Phase 4 E1: env snapshot consolidation docs 2025-12-14 00:48:03 +09:00
11b0e3f32b Phase 4 D3: alloc gate shape (env-gated) 2025-12-14 00:26:57 +09:00
b40aff290e Phase 4 D3 Design: Alloc Gate Shape 2025-12-14 00:05:11 +09:00
141cd8a5be Phase 3 Closure & Phase 4 Preparation
Summary:
- Phase 3 optimization complete (cumulative +8.93%)
- D1 promoted to default (HAKMEM_FREE_STATIC_ROUTE=1, +2.19%)
- D2 frozen (NO-GO, -1.44% regression)
- Phase 4 instructions prepared (D3/Alloc Gate Specialization)

Results:
  B3 (Routing shape): +2.89%
  B4 (Wrapper split): +1.47%
  C3 (Static routing): +2.20%
  C1 (TLS prefetch): NEUTRAL (-0.34%, research box)
  C2 (Metadata cache): NEUTRAL (-0.45%, research box)
  D1 (Free route cache): +2.19% (now default)
  D2 (Wrapper env cache): NO-GO (-1.44%, frozen)
  MID_V3 fix: +13% (structural)

Total Phase 2-3 gain: ~8.93% (37.5M → 51M ops/s)

Updated:
- CURRENT_TASK.md: Phase 3 final results + D3 conditions
- ENV_PROFILE_PRESETS.md: Active optimizations listed
- PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: Phase 3→4 transition
- PHASE4_ALLOC_GATE_SPECIALIZATION_NEXT_INSTRUCTIONS.md: D3 execution plan
- PHASE3_BASELINE_AND_CANDIDATES.md: Post-validation status

Next phase: Phase 4 D3 - Alloc Gate Specialization
- Requires: tiny_alloc_gate_fast self% ≥5% from perf
- Design SSOT: PHASE3_D3_ALLOC_GATE_SPECIALIZATION_1_DESIGN.md
- Execution: PHASE4_ALLOC_GATE_SPECIALIZATION_NEXT_INSTRUCTIONS.md

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 23:47:19 +09:00
50bded8c85 Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established
Summary:
- D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT
  - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median)
  - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median)
  - Mean gain: +2.19%, Median gain: +2.37%
  - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%)
  - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset

- D2 (Wrapper env cache): FROZEN
  - Previous result: -1.44% regression (TLS overhead > benefit)
  - Status: Research box (do not pursue further)
  - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset)

- Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13)

Cumulative Gains (Phase 2-3):
  B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19%
  Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%)
  MID_V3 fix: +13% (structural change, Mixed OFF by default)

Documentation Updates:
  - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report
  - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status
  - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results
  - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status
  - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN
  - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status
  - CURRENT_TASK.md: Phase 3 complete summary

Next:
  - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%)
  - Or Phase 4 planning if no more D3-class targets
  - Current active optimizations: B3, B4, C3, D1, MID_V3 fix

Files Changed:
  - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines)
  - docs/analysis/*.md (6 files updated with D1/D2 results)
  - CURRENT_TASK.md (Phase 3 status update)
  - analyze_d1_results.py (statistical analysis script)
  - core/bench_profile.h (D1 promoted to default in MIXED preset)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 22:42:22 +09:00
19056282b6 Phase 3 D2: Wrapper Env Cache - [DECISION: NO-GO]
Target: Reduce wrapper_env_cfg() overhead in malloc/free hot path
- Strategy: Cache wrapper env configuration pointer in TLS
- Approach: Fast pointer cache (TLS caches const wrapper_env_cfg_t*)

Implementation:
- core/box/wrapper_env_cache_env_box.h: ENV gate (HAKMEM_WRAP_ENV_CACHE)
- core/box/wrapper_env_cache_box.h: TLS cache layer (wrapper_env_cfg_fast)
- core/box/hak_wrappers.inc.h: Integration into malloc/free hot paths
- ENV gate: HAKMEM_WRAP_ENV_CACHE=0/1 (default OFF)

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (D2=0): 46.52M ops/s (avg), 46.47M ops/s (median)
- Optimized (D2=1): 45.85M ops/s (avg), 45.98M ops/s (median)
- Improvement: avg -1.44%, median -1.05% (DECISION: NO-GO)

Analysis:
- Regression cause: TLS cache adds overhead (branch + TLS access)
- wrapper_env_cfg() is already minimal (pointer return after simple check)
- Adding TLS caching layer makes it worse, not better
- Branch prediction penalty outweighs any potential savings

Cumulative Phase 2-3:
- B3: +2.89%, B4: +1.47%, C3: +2.20%
- D1: +1.06% (opt-in), D2: -1.44% (NO-GO)
- Total: ~7.2% (excluding D2)

Decision: FREEZE as research box (default OFF, regression confirmed)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 22:03:27 +09:00
f059c0ec83 Phase 3 D1: Free Path Route Cache - DECISION: GO (+1.06%)
Target: Eliminate tiny_route_for_class() overhead in free path
- Perf finding: 4.39% self + 24.78% children (free bottleneck)
- Approach: Use cached route_kind (like Phase 3 C3 for alloc)

Implementation:
- core/box/tiny_free_route_cache_env_box.h (new)
  * ENV gate: HAKMEM_FREE_STATIC_ROUTE=0/1 (default OFF)
  * Lazy initialization with sentinel value
- core/front/malloc_tiny_fast.h (modified)
  * Two call sites: free_tiny_fast_cold() + legacy_fallback path
  * Direct route lookup: g_tiny_route_class[class_idx]
  * Fallback safety: Check g_tiny_route_snapshot_done

A/B Test Results (Mixed, 10-run):
- Baseline (D1=0): 45.13 M ops/s (avg), 45.76 M ops/s (median)
- Optimized (D1=1): 45.61 M ops/s (avg), 45.40 M ops/s (median)
- Improvement: +1.06% (avg), -0.77% (median)
- DECISION: GO (avg gain meets +1.0% threshold)

Cumulative Phase 2-3:
- B3: +2.89%, B4: +1.47%, C3: +2.20%
- D1: +1.06%
- Total: ~7.2% cumulative gain

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 21:44:00 +09:00
d0b931b197 Phase 3 C1: TLS Prefetch Implementation - NEUTRAL Result (Research Box)
Step 1 & 2 Complete:
- Implemented: core/front/malloc_tiny_fast.h prefetch (lines 264-267, 331-334)
  - LEGACY path prefetch of g_unified_cache[class_idx] to L1
  - ENV gate: HAKMEM_TINY_PREFETCH=0/1 (default OFF)
  - Conditional: only when prefetch enabled + route_kind == LEGACY

- A/B test (Mixed 10-run): PREFETCH=0 (39.33M) → =1 (39.20M) = -0.34% avg
  - Median: +1.28% (within ±1.0% neutral range)
  - Result: 🔬 NEUTRAL (research box, default OFF)

Decision: FREEZE as research box
- Average -0.34% suggests prefetch overhead > benefit
- Prefetch timing too late (after route_kind selection)
- TLS cache access is already fast (head/tail indices)
- Actual memory wait happens at slots[] array access (after prefetch)

Technical Learning:
- Prefetch effectiveness depends on L1 miss rate at access time
- Inserting prefetch after route selection may be too late
- Future approach: move prefetch earlier or use different target

Next: Phase 3 C2 (Metadata Cache Optimization, expected +5-10%)

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 19:01:57 +09:00
d54893ea1d Phase 3 C3: Static Routing A/B Test ADOPT (+2.20% Mixed gain)
Step 2 & 3 Complete:
- A/B test (Mixed 10-run): STATIC_ROUTE=0 (38.91M) → =1 (39.77M) = +2.20% avg
  - Median gain: +1.98%
  - Result:  GO (exceeds +1.0% threshold)

- Decision:  ADOPT into MIXED_TINYV3_C7_SAFE preset
  - bench_profile.h line 77: HAKMEM_TINY_STATIC_ROUTE=1 default
  - Learner auto-disables static route when HAKMEM_SMALL_LEARNER_V7_ENABLED=1

Implementation Summary:
- core/box/tiny_static_route_box.{h,c}: Research box (Step 1A)
- core/front/malloc_tiny_fast.h: Route lookup integration (Step 1B, lines 249-256)
- core/bench_profile.h: Bench sync + preset adoption

Cumulative Phase 2-3 Gains:
- B3 (Routing shape): +2.89%
- B4 (Wrapper split): +1.47%
- C3 (Static routing): +2.20%
- Total: ~6.8% (35.2M → ~39.8M ops/s)

Next: Phase 3 C1 (TLS Prefetch, expected +2-4%)

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 18:46:11 +09:00