Commit Graph

112 Commits

Author SHA1 Message Date
f88e51e45b Phase 12: Strategic Pause Results - Critical Finding
Completed Strategic Pause investigation with shocking discovery:
- System malloc (glibc ptmalloc2): 86.58M ops/s
- hakmem (Phase 10): 52.88M ops/s
- Gap: **+63.7%** 🚨

Baseline (Phase 10):
- Mean: 51.76M ops/s (10-run, CV 1.03%)
- Health check: PASS
- Perf stat: IPC 2.22, branch miss 2.48%, good cache locality

Allocator comparison:
- hakmem: 52.43M ops/s (RSS: 33.8MB)
- jemalloc: 48.60M ops/s (RSS: 35.6MB) [-7.3%]
- system malloc: 85.96M ops/s [+63.9%] 🚨

Gap analysis (5 hypotheses):
1. Header write overhead (400M writes) - Expected ROI: +10-20%
2. Thread cache implementation (tcache vs TinyUnifiedCache) - Expected ROI: +20-30%
3. Metadata access pattern (indirection overhead) - Expected ROI: +5-10%
4. Classification overhead (LUT + routing) - Expected ROI: +5%
5. Freelist management (header vs chunk placement) - Expected ROI: +5%

Recommendation: Proceed to Phase 13 (Header Write Elimination)
- Most direct overhead (400M writes per 200M iters)
- Measurable with perf
- Clear ROI (+10-20% expected)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 21:17:48 +09:00
2b5e4ad576 docs: add Phase 12 strategic pause instructions 2025-12-14 20:59:23 +09:00
a6078a52b5 Phase 12: Strategic Options Analysis
Comprehensive analysis of next optimization options after Phase 6-10 (+24.6%):

Option A: Micro-Optimization ( LOW PRIORITY)
- tiny_c7_ultra_alloc (3.75%): C7-specific, +1-2% ROI
- unified_cache_push (1.61%): Marginal ROI ~+1.0%
- High risk (20-30% NO-GO), diminishing returns

Option B: Workload-Specific Optimization (🔍 MEDIUM PRIORITY)
- C6-heavy optimization (+3-5% for specific workload)
- Mid/Large allocation optimization (requires investigation)

Option C: Strategic Pause ( RECOMMENDED)
- Major milestone achieved (+24.6%)
- Diminishing returns (marginal ROI < +2%)
- Time to reassess project goals and explore new frontiers

Recommendation: Strategic Pause to:
- Benchmark vs mimalloc/jemalloc
- Validate production workloads
- Explore next optimization frontiers (footprint, multi-thread, fragmentation)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 20:55:25 +09:00
37bb3ee63f Phase 6-10: Cumulative Results & Strategic Analysis (+24.6%)
Comprehensive analysis of Phases 6-10 achievements:
- Cumulative improvement: +24.6% (43.04M → 53.62M ops/s)
- Individual phases: 6-1 (+11.13%), 6-2 (+5.18%), 8 (+2.61%), 9 (+2.72%), 10 (+1.89%)
- Phase 7 NO-GO (-2.16%), Phase 11 NO-GO (-8.35%)

Winning patterns:
- Wrapper-level consolidation (Phase 6-1: largest single gain)
- Deduplication at layer boundaries (Phase 6-2)
- Monolithic early-exit (Phase 9, 10 vs Phase 7 function split)

Next strategic options:
A) Micro-optimizations (marginal ROI < +2%)
B) Alloc side deep dive (malloc 23.26% hotspot, high potential +5-10%)
C) Strategic pause (declare victory at +24.6%)

Recommendation: Alloc side investigation as highest-ROI next direction.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 20:48:34 +09:00
ad73ca5544 Phase 11: ENV Snapshot maybe-fast API - NO-GO (-8.35%)
Phase 11 attempted to consolidate ENV snapshot overhead by:
- Adding hakmem_env_snapshot_maybe_fast() API
- Caching front_v3_snap pointer in HakmemEnvSnapshot
- Replacing separate calls with single API at call sites

Result: -8.35% regression (51.65M → 47.33M ops/s)

Root cause:
- maybe_fast() called in inline hot path functions
- ctor_mode check accumulated on every call
- Compiler optimization inhibited
- Even 2-3 instructions are expensive at high frequency

Lesson: ENV gate optimization should target gate itself, not call sites.

All changes rolled back. Phase 11 FROZEN.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 20:44:42 +09:00
44273693cc docs: record Phase 10 GO promotion; add Phase 11 instructions
Phase 10 updates:
- Mark Phase 10 as promoted (GO +1.89%)
- Update CURRENT_TASK.md with Phase 10 results
- Clean up absolute paths in PHASE10 docs
- Add promotion status to PHASE10 docs

Phase 11 instructions:
- New: PHASE11_ENV_SNAPSHOT_MAYBE_FAST_1_NEXT_INSTRUCTIONS.md
- Target: Consolidate ENV snapshot overhead
- Strategy: Merge hakmem_env_snapshot* and tiny_front_v3_snapshot_get into single call
- ENV: HAKMEM_ENV_SNAPSHOT_MAYBE_FAST=0/1
- Expected: +1-2% (reduce TLS/ENV read overhead)

Files modified:
- CURRENT_TASK.md: Phase 10 GO record, Phase 11 next
- docs/analysis/PHASE10_FREE_TINY_FAST_MONO_LEGACY_DIRECT_1_AB_TEST_RESULTS.md
- docs/analysis/PHASE10_FREE_TINY_FAST_MONO_LEGACY_DIRECT_1_NEXT_INSTRUCTIONS.md

Files added:
- docs/analysis/PHASE11_ENV_SNAPSHOT_MAYBE_FAST_1_NEXT_INSTRUCTIONS.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 20:29:28 +09:00
71b1354d32 Phase 10: FREE-TINY-FAST MONO LEGACY DIRECT (GO +1.89%)
Results:
- A/B test: +1.89% on Mixed (10-run, clean env)
- Baseline: 51.96M ops/s
- Optimized: 52.94M ops/s
- Improvement: +984K ops/s (+1.89%)
- C6-heavy verification: +7.86% (nonlegacy_mask works correctly, no misfires)

Strategy:
- Extend Phase 9 (C0-C3 DUALHOT) to C4-C7 LEGACY DIRECT
- Fail-Fast principle: Never misclassify MID/ULTRA/V7 as LEGACY
- nonlegacy_mask: Cached at init, hot path uses single bit operation

Success factors:
1. Performance improvement: +1.89% (1.9x GO threshold)
2. Safety verified: nonlegacy_mask prevents MID v3 misfire in C6-heavy
3. Phase 9 coexistence: C0-C3 (Phase 9) + C4-C7 (Phase 10) = full LEGACY coverage
4. Minimal overhead: Single bit operation in hot path (mask & (1u<<class))

Implementation:
- Patch 1: ENV gate box (free_tiny_fast_mono_legacy_direct_env_box.h)
  - ENV: HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=0/1 (default 0)
  - nonlegacy_mask cached (reuses free_policy_fast_v2_nonlegacy_mask())
  - Probe window: 64 (avoid bench_profile putenv race)
- Patch 2: Early-exit in free_tiny_fast() (malloc_tiny_fast.h)
  - Conditions: !nonlegacy_mask, route==LEGACY, !LARSON_FIX, done==1
  - Direct call: tiny_legacy_fallback_free_base()
- Patch 3: Visibility (free_path_stats_box.h)
  - mono_legacy_direct_hit counter (compile-out in release)
- Patch 4: cleanenv extension (run_mixed_10_cleanenv.sh)
  - ENV leak protection

Safety verification (C6-heavy):
- OFF: 19.75M ops/s
- ON: 21.30M ops/s (+7.86%)
- nonlegacy_mask correctly excludes C6 (MID v3 active)
- Improvement from C0-C5, C7 direct path acceleration

Files modified:
- core/bench_profile.h: add to MIXED_TINYV3_C7_SAFE preset
- core/front/malloc_tiny_fast.h: early-exit insertion
- core/box/free_path_stats_box.h: counter
- core/box/free_tiny_fast_mono_legacy_direct_env_box.h: NEW (ENV gate + nonlegacy_mask)
- scripts/run_mixed_10_cleanenv.sh: ENV leak protection

Health check: PASSED (all profiles)

Promotion: Added to MIXED_TINYV3_C7_SAFE preset (default ON, opt-out)

Rollback: HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=0

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 20:09:40 +09:00
720ae61007 docs: record Phase 9 GO promotion; add Phase 10 instructions
Phase 9 updates:
- Mark Phase 9 as promoted (GO +2.72%)
- Update CURRENT_TASK.md with Phase 9 results
- Update PHASE9 docs with promotion status

Phase 10 instructions:
- New: PHASE10_FREE_TINY_FAST_MONO_LEGACY_DIRECT_1_NEXT_INSTRUCTIONS.md
- Target: Extend free_tiny_fast() "LEGACY direct" to C4-C7
- Strategy: Safe conditions + early-exit (similar to Phase 9 success pattern)
- ENV: HAKMEM_FREE_TINY_FAST_MONO_LEGACY_DIRECT=0/1
- Expected: +1-3% (C4-C7 coverage expansion)

Files modified:
- CURRENT_TASK.md: Phase 9 GO record, Phase 10 next
- docs/analysis/PHASE9_FREE_TINY_FAST_MONO_DUALHOT_1_AB_TEST_RESULTS.md
- docs/analysis/PHASE9_FREE_TINY_FAST_MONO_DUALHOT_1_NEXT_INSTRUCTIONS.md

Files added:
- docs/analysis/PHASE10_FREE_TINY_FAST_MONO_LEGACY_DIRECT_1_NEXT_INSTRUCTIONS.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 19:59:15 +09:00
871034da1f Phase 9: FREE-TINY-FAST MONO DUALHOT (GO +2.72%)
Results:
- A/B test: +2.72% on Mixed (10-run, clean env)
- Baseline: 48.89M ops/s
- Optimized: 50.22M ops/s
- Improvement: +1.33M ops/s (+2.72%)
- Stability: Standard deviation reduced by 60.8% (2.44M → 955K ops/s)

Strategy:
- Transplant C0-C3 "second hot" path to monolithic free_tiny_fast()
- Early-exit within monolithic (no hot/cold split)
- FastLane free now benefits from C0-C3 direct path

Success factors:
1. Performance improvement: +2.72% (2.7x GO threshold)
2. Stability improvement: 2.6x more stable (stdev 60.8% reduction)
3. Learned from Phase 7 failure:
   - Phase 7: Function split (hot/cold) → NO-GO
   - Phase 9: Early-exit within monolithic → GO
4. FastLane free compatibility: C0-C3 direct path now works with FastLane
5. Policy snapshot overhead reduction: C0-C3 (48% of Mixed) skip route lookup

Implementation:
- Patch 1: ENV gate box (free_tiny_fast_mono_dualhot_env_box.h)
  - ENV: HAKMEM_FREE_TINY_FAST_MONO_DUALHOT=0/1 (default 0)
  - Probe window: 64 (avoid bench_profile putenv race)
- Patch 2: Early-exit in free_tiny_fast() (malloc_tiny_fast.h)
  - Conditions: class_idx <= 3, !LARSON_FIX, route==LEGACY
  - Direct call: tiny_legacy_fallback_free_base()
- Patch 3: Visibility (free_path_stats_box.h)
  - mono_dualhot_hit counter (compile-out in release)
- Patch 4: cleanenv extension (run_mixed_10_cleanenv.sh)
  - ENV leak protection

Files modified:
- core/bench_profile.h: add to MIXED_TINYV3_C7_SAFE preset
- core/front/malloc_tiny_fast.h: early-exit insertion
- core/box/free_path_stats_box.h: counter
- core/box/free_tiny_fast_mono_dualhot_env_box.h: NEW (ENV gate)
- scripts/run_mixed_10_cleanenv.sh: ENV leak protection

Health check: PASSED (all profiles)

Promotion: Added to MIXED_TINYV3_C7_SAFE preset (default ON, opt-out)

Rollback: HAKMEM_FREE_TINY_FAST_MONO_DUALHOT=0

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 19:16:49 +09:00
152b578f64 docs: record Phase 8 GO; add Phase 9 mono dualhot instructions 2025-12-14 19:05:50 +09:00
be723ca052 Phase 8: FREE-STATIC-ROUTE ENV Cache Hardening (GO +2.61%)
Results:
- A/B test: +2.61% on Mixed (10-run, clean env)
- Baseline: 49.26M ops/s
- Optimized: 50.55M ops/s
- Improvement: +1.29M ops/s (+2.61%)

Strategy:
- Fix ENV cache accident (main前キャッシュ事故の修正)
- Add refresh mechanism to sync with bench_profile putenv
- Ensure Phase 3 D1 optimization works reliably

Success factors:
1. Performance improvement: +2.61% (existing win-box now reliable)
2. ENV cache accident fixed: refresh mechanism works correctly
3. Standard deviation improved: 867K → 336K ops/s (61% reduction)
4. Baseline quality improved: existing optimization now guaranteed

Implementation:
- Patch 1: Make ENV gate refreshable (tiny_free_route_cache_env_box.{h,c})
  - Changed static int to extern _Atomic int
  - Added tiny_free_static_route_refresh_from_env()
- Patch 2: Integrate refresh into bench_profile.h
  - Call refresh after bench_setenv_default() group
- Patch 3: Update Makefile for new .c file

ENV cache fix verification:
- [FREE_STATIC_ROUTE] enabled appears twice (refresh working)
- bench_profile putenv now reliably reflected

Files modified:
- core/box/tiny_free_route_cache_env_box.h: extern + refresh API
- core/box/tiny_free_route_cache_env_box.c: NEW (global state + refresh)
- core/bench_profile.h: add refresh call
- Makefile: add new .o file

Health check: PASSED (all profiles)

Rollback: HAKMEM_FREE_STATIC_ROUTE=0 or revert Patch 1/2

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 18:49:08 +09:00
17aaa90981 docs: freeze Phase 7 FastLane free hot/cold; add Phase 8 env cache fix 2025-12-14 18:34:08 +09:00
7a3702e069 docs: Phase 7 FastLane free hot/cold alignment instructions 2025-12-14 17:52:55 +09:00
dcc1d42e7f Phase 6-2: Promote Front FastLane Free DeDup (default ON)
Results:
- A/B test: +5.18% on Mixed (10-run, clean env)
- Baseline: 46.68M ops/s
- Optimized: 49.10M ops/s
- Improvement: +2.42M ops/s (+5.18%)

Strategy:
- Eliminate duplicate header validation in front_fastlane_try_free()
- Direct call to free_tiny_fast() when dedup enabled
- Single validation path (no redundant checks)

Success factors:
1. Complete duplicate elimination (free path optimization)
2. Free path importance (50% of Mixed workload)
3. Improved execution stability (CV: 1.00% → 0.58%)

Phase 6 cumulative:
- Phase 6-1 FastLane: +11.13%
- Phase 6-2 Free DeDup: +5.18%
- Total: ~+16-17% from baseline (multiplicative effect)

Promotion:
- Default: HAKMEM_FRONT_FASTLANE_FREE_DEDUP=1 (opt-out)
- Added to MIXED_TINYV3_C7_SAFE preset
- Added to C6_HEAVY_LEGACY_POOLV1 preset
- Rollback: HAKMEM_FRONT_FASTLANE_FREE_DEDUP=0

Files modified:
- core/box/front_fastlane_env_box.h: default 0 → 1
- core/bench_profile.h: added to presets
- CURRENT_TASK.md: Phase 6-2 GO result

Health check: PASSED (all profiles)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 17:38:21 +09:00
c0d2f47f7d docs: Phase 6-2 FastLane free dedup instructions 2025-12-14 17:09:57 +09:00
ea221d057a Phase 6: promote Front FastLane (default ON) 2025-12-14 16:28:23 +09:00
e48cbff4b9 Phase 5 Complete: E7 NO-GO confirmed + ChatGPT Pro questionnaire
Summary:
- E7 frozen box prune: -3.20% regression (NO-GO) with clean ENV
- Keep E5-2/E5-4 (NEUTRAL) + E6 (NO-GO) as research boxes
- Regression due to build differences (LTO/layout/alignment), not logic

Results:
- Winning boxes: E4-1 (+3.51%), E4-2 (+21.83%), E5-1 (+3.35%) → adopted
- Frozen boxes: E5-2, E5-4, E6, E7 → kept with ENV gates (doc as assets)
- Phase 5 cumulative progress: +6.43% on MIXED profile

Documentation updates:
- PHASE5_E7_FROZEN_BOX_PRUNE_AB_TEST_RESULTS.md: Final NO-GO record
- PHASE5_E7_FROZEN_BOX_PRUNE_NEXT_INSTRUCTIONS.md: E7 conclusion

Next phase planning:
- PHASE_ML2_CHATGPT_QUESTIONNAIRE_FASTLANE.md: Design consultation template
  - Candidates: dedup new boundaries, PGO/layout optimization feasibility

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-14 08:56:09 +09:00
6d38511318 Phase 5: E7 prune no-go (keep frozen boxes); add clean-env runner 2025-12-14 08:11:20 +09:00
f92be5f541 Phase 5: freeze E6 env snapshot shape (no-go) 2025-12-14 07:18:59 +09:00
4124c86d99 Phase 5: freeze E5-4 malloc tiny direct (neutral) 2025-12-14 06:59:35 +09:00
580e7f4fa3 Phase 5 E5-3: Candidate Analysis (All DEFERRED) + E5-4 Instructions
E5-3 Analysis Results:
- free_tiny_fast_cold (7.14%): DEFER - cold path, low ROI
- unified_cache_push (3.39%): DEFER - already optimized
- hakmem_env_snapshot_enabled (2.97%): DEFER - low headroom

Key Insight: perf self% is time-weighted, not frequency-weighted.
Cold paths appear hot but have low total impact.

Next: E5-4 (Malloc Tiny Direct Path)
- Apply E5-1 winning pattern to malloc side
- Target: tiny_alloc_gate_fast() gate tax elimination
- ENV gate: HAKMEM_MALLOC_TINY_DIRECT=0/1

Files added:
- docs/analysis/PHASE5_E5_3_ANALYSIS_AND_RECOMMENDATIONS.md
- docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_NEXT_INSTRUCTIONS.md
- core/box/free_cold_shape_env_box.{h,c} (research box, not tested)
- core/box/free_cold_shape_stats_box.{h,c} (research box, not tested)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-14 06:44:04 +09:00
f7b18aaf13 Phase 5 E5-2: Header Write-Once (NEUTRAL, FROZEN)
Target: tiny_region_id_write_header (3.35% self%)
- Hypothesis: Headers redundant for reused blocks
- Strategy: Write headers ONCE at refill boundary, skip in hot alloc

Implementation:
- ENV gate: HAKMEM_TINY_HEADER_WRITE_ONCE=0/1 (default 0)
- core/box/tiny_header_write_once_env_box.h: ENV gate
- core/box/tiny_header_write_once_stats_box.h: Stats counters
- core/box/tiny_header_box.h: Added tiny_header_finalize_alloc()
- core/front/tiny_unified_cache.c: Prefill at 3 refill sites
- core/box/tiny_front_hot_box.h: Use finalize function

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (WRITE_ONCE=0): 44.22M ops/s (mean), 44.53M ops/s (median)
- Optimized (WRITE_ONCE=1): 44.42M ops/s (mean), 44.36M ops/s (median)
- Improvement: +0.45% mean, -0.38% median

Decision: NEUTRAL (within ±1.0% threshold)
- Action: FREEZE as research box (default OFF, do not promote)

Root Cause Analysis:
- Header writes are NOT redundant - existing code writes only when needed
- Branch overhead (~4 cycles) cancels savings (~3-5 cycles)
- perf self% ≠ optimization ROI (3.35% target → +0.45% gain)

Key Lessons:
1. Verify assumptions before optimizing (inspect code paths)
2. Hot spot self% measures time IN function, not savings from REMOVING it
3. Branch overhead matters (even "simple" checks add cycles)

Positive Outcome:
- StdDev reduced 50% (0.96M → 0.48M) - more stable performance

Health Check: PASS (all profiles)

Next Candidates:
- free_tiny_fast_cold: 7.14% self%
- unified_cache_push: 3.39% self%
- hakmem_env_snapshot_enabled: 2.97% self%

Deliverables:
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-2 complete, FROZEN)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 06:22:25 +09:00
75e20b29cc Phase 5 E5-1: Promote to preset + next target instructions
E5-1 Promotion:
- Added HAKMEM_FREE_TINY_DIRECT=1 to MIXED_TINYV3_C7_SAFE preset
- Updated ENV_PROFILE_PRESETS.md with rollback instructions
- Rollback: HAKMEM_FREE_TINY_DIRECT=0

A/B Test Clarification:
- Documented bench_setenv_default vs export ENV=0 interaction
- bench_setenv_default only sets if ENV is unset
- To force OFF in A/B: use value that differs from default

Next Target Selection (E5-2 vs E5-3):
- E5-2: Header write reduction (tiny_region_id_write_header)
- E5-3: ENV snapshot gate shape optimization
- Decision requires fresh perf profile on new baseline

Deliverables:
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_NEXT_INSTRUCTIONS.md
- docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md (updated)
- docs/analysis/ENV_PROFILE_PRESETS.md (E5-1 added)
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_AB_TEST_RESULTS.md (clarified)
- CURRENT_TASK.md (progress links)
- docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md (progress links)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 05:59:43 +09:00
8875132134 Phase 5 E5-1: Free Tiny Direct Path (+3.35% GO)
Target: Consolidate free() wrapper overhead (29.56% combined)
- free() wrapper: 21.67% self%
- free_tiny_fast_cold(): 7.89% self%

Strategy: Single header check in wrapper → direct call to free_tiny_fast()
- Eliminates redundant header validation (validated twice before)
- Bypasses cold path routing for Tiny allocations
- High coverage: 48% of frees in Mixed workload are Tiny

Implementation:
- ENV gate: HAKMEM_FREE_TINY_DIRECT=0/1 (default 0)
- core/box/free_tiny_direct_env_box.h: ENV gate
- core/box/free_tiny_direct_stats_box.h: Stats counters
- core/box/hak_wrappers.inc.h: Wrapper integration (lines 593-625)

Safety gates:
- Page boundary guard ((ptr & 0xFFF) != 0)
- Tiny magic validation ((header & 0xF0) == 0xA0)
- Class bounds check (class_idx < 8)
- Fail-fast fallback to existing paths

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (DIRECT=0): 44.38M ops/s (mean), 44.45M ops/s (median)
- Optimized (DIRECT=1): 45.87M ops/s (mean), 45.95M ops/s (median)
- Improvement: +3.35% mean, +3.36% median

Decision: GO (+3.35% >= +1.0% threshold)
- 3rd consecutive success with consolidation/deduplication pattern
- E4-1: +3.51%, E4-2: +21.83%, E5-1: +3.35%
- Health check: PASS (all profiles)

Phase 5 Cumulative:
- E4 Combined: +6.43%
- E5-1: +3.35%
- Estimated total: ~+10%

Deliverables:
- docs/analysis/PHASE5_E5_COMPREHENSIVE_ANALYSIS.md
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_DESIGN.md
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-1 complete)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 05:52:32 +09:00
6cdbd815ab Phase 5 E4 Combined: E4-1 + E4-2 (+6.43% GO, baseline consolidated)
Combined A/B Test Results (10-run Mixed):
- Baseline (both OFF): 44.48M ops/s (mean), 44.39M ops/s (median)
- Optimized (both ON): 47.34M ops/s (mean), 47.38M ops/s (median)
- Improvement: +6.43% mean, +6.74% median

Interaction Analysis:
- E4-1 alone: +3.51% (measured in separate session)
- E4-2 alone: +21.83% (measured in separate session)
- Combined: +6.43% (measured in same binary)
- Pattern: SUBADDITIVE (overlapping bottlenecks)

Key Finding: Single-binary incremental gain is the accurate metric
- E4-1 and E4-2 target overlapping TLS/branch resources
- Individual measurements were from different baselines/sessions
- Combined measurement (same binary, both flags) shows true progress

Phase 5 Total Progress:
- Original baseline (session start): 35.74M ops/s
- Combined optimized: 47.34M ops/s
- Total gain: +32.4% (cross-session, reference only)
- Same-binary gain: +6.43% (E4-1+E4-2 both ON vs both OFF)

New Baseline Perf Profile (47.0M ops/s):
- free: 37.56% self% (still top hotspot)
- tiny_alloc_gate_fast: 13.73% (reduced from 19.50%)
- malloc: 12.95% (reduced from 16.13%)
- tiny_region_id_write_header: 6.97% (header write tax)
- hakmem_env_snapshot_enabled: 4.29% (ENV overhead visible)

Health Check: PASS
- MIXED_TINYV3_C7_SAFE: 42.3M ops/s
- C6_HEAVY_LEGACY_POOLV1: 20.9M ops/s

Phase 5 E5 Candidates (from perf profile):
- E5-1: free() path internals (37.56% self%)
- E5-2: Header write reduction (6.97% self%)
- E5-3: ENV snapshot overhead (4.29% self%)

Deliverables:
- docs/analysis/PHASE5_E4_COMBINED_AB_TEST_RESULTS.md
- docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md
- CURRENT_TASK.md (E4 combined complete, E5 candidates)
- docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md (E5 pointer)
- perf.data.e4combined (perf profile data)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 05:36:57 +09:00
5528612f2a Phase 5 E4-2: Malloc Wrapper ENV Snapshot (+21.83% GO, ADOPTED)
Target: Consolidate malloc wrapper TLS reads + eliminate function calls
- malloc (16.13%) + tiny_alloc_gate_fast (19.50%) = 35.63% combined
- Strategy: E4-1 success pattern + function call elimination

Implementation:
- ENV gate: HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=0/1 (default 0)
- core/box/malloc_wrapper_env_snapshot_box.{h,c}: New box
  - Consolidates multiple TLS reads → 1 TLS read
  - Pre-caches tiny_max_size() == 256 (eliminates function call)
  - Lazy init with probe window (bench_profile putenv sync)
- core/box/hak_wrappers.inc.h: Integration in malloc() wrapper
- Makefile: Add malloc_wrapper_env_snapshot_box.o to all targets

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (SNAPSHOT=0): 35.74M ops/s (mean), 35.75M ops/s (median)
- Optimized (SNAPSHOT=1): 43.54M ops/s (mean), 43.92M ops/s (median)
- Improvement: +21.83% mean, +22.86% median (+7.80M ops/s)

Decision: GO (+21.83% >> +1.0% threshold, 21.8x over)
- Why 6.2x better than E4-1 (+3.51%)?
  - Higher malloc call frequency (allocation-heavy workload)
  - Function call elimination (tiny_max_size pre-cached)
  - Larger target: 35.63% vs free's 25.26%
- Health check: PASS (all profiles)
- Action: PROMOTED to MIXED_TINYV3_C7_SAFE preset

Phase 5 Cumulative (estimated):
- E1 (ENV Snapshot): +3.92%
- E4-1 (Free Wrapper Snapshot): +3.51%
- E4-2 (Malloc Wrapper Snapshot): +21.83%
- Estimated combined: ~+30% (needs validation)

Next Steps:
- Combined A/B test (E4-1 + E4-2 simultaneously)
- Measure actual cumulative effect
- Profile new baseline for next optimization targets

Deliverables:
- docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_1_DESIGN.md
- docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_1_AB_TEST_RESULTS.md
- docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md
- docs/analysis/PHASE5_E4_COMBINED_AB_TEST_NEXT_INSTRUCTIONS.md (next)
- docs/analysis/ENV_PROFILE_PRESETS.md (E4-2 added)
- CURRENT_TASK.md (E4-2 complete)
- core/bench_profile.h (E4-2 promoted to default)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 05:13:29 +09:00
4a070d8a14 Phase 5 E4-1: Free Wrapper ENV Snapshot (+3.51% GO, ADOPTED)
Target: Consolidate free wrapper TLS reads (2→1)
- free() is 25.26% self% (top hot spot)
- Strategy: Apply E1 success pattern (ENV snapshot) to free path

Implementation:
- ENV gate: HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=0/1 (default 0)
- core/box/free_wrapper_env_snapshot_box.{h,c}: New box
  - Consolidates 2 TLS reads → 1 TLS read (50% reduction)
  - Reduces 4 branches → 3 branches (25% reduction)
  - Lazy init with probe window (bench_profile putenv sync)
- core/box/hak_wrappers.inc.h: Integration in free() wrapper
- Makefile: Add free_wrapper_env_snapshot_box.o to all targets

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (SNAPSHOT=0): 45.35M ops/s (mean), 45.31M ops/s (median)
- Optimized (SNAPSHOT=1): 46.94M ops/s (mean), 47.15M ops/s (median)
- Improvement: +3.51% mean, +4.07% median

Decision: GO (+3.51% >= +1.0% threshold)
- Exceeded conservative estimate (+1.5% → +3.51%)
- Similar efficiency to E1 (+3.92%)
- Health check: PASS (all profiles)
- Action: PROMOTED to MIXED_TINYV3_C7_SAFE preset

Phase 5 Cumulative:
- E1 (ENV Snapshot): +3.92%
- E4-1 (Free Wrapper Snapshot): +3.51%
- Total Phase 4-5: ~+7.5%

E3-4 Correction:
- Phase 4 E3-4 (ENV Constructor Init): NO-GO / FROZEN
- Initial A/B showed +4.75%, but investigation revealed:
  - Branch prediction hint mismatch (UNLIKELY with always-true)
  - Retest confirmed -1.78% regression
  - Root cause: __builtin_expect(..., 0) with ctor_mode==1
- Decision: Freeze as research box (default OFF)
- Learning: Branch hints need careful tuning, TLS consolidation safer

Deliverables:
- docs/analysis/PHASE5_E4_FREE_GATE_OPTIMIZATION_1_DESIGN.md
- docs/analysis/PHASE5_E4_1_FREE_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md
- docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md (next)
- docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md
- docs/analysis/ENV_PROFILE_PRESETS.md (E4-1 added, E3-4 corrected)
- CURRENT_TASK.md (E4-1 complete, E3-4 frozen)
- core/bench_profile.h (E4-1 promoted to default)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 04:24:34 +09:00
21e2e4ac2b Phase 4 E3-4: ENV Constructor Init (+4.75% GO)
Target: Eliminate E1 lazy init check overhead (3.22% self%)
- E1 consolidated ENV gates but lazy check remained in hot path
- Strategy: __attribute__((constructor(101))) for pre-main init

Implementation:
- ENV gate: HAKMEM_ENV_SNAPSHOT_CTOR=0/1 (default 0, research box)
- core/box/hakmem_env_snapshot_box.c: Constructor function added
  - Reads ENV before main() when CTOR=1
  - Refresh also syncs gate state for bench_profile putenv
- core/box/hakmem_env_snapshot_box.h: Dual-mode enabled check
  - CTOR=1 fast path: direct global read (no lazy branch)
  - CTOR=0 fallback: legacy lazy init (rollback safe)
  - Branch hints adjusted for default OFF baseline

A/B Test Results (Mixed, 10-run, 20M iters, E1=1):
- Baseline (CTOR=0): 44.28M ops/s (mean), 44.60M ops/s (median)
- Optimized (CTOR=1): 46.38M ops/s (mean), 46.53M ops/s (median)
- Improvement: +4.75% mean, +4.35% median

Decision: GO (+4.75% >> +0.5% threshold)
- Expected +0.5-1.5%, achieved +4.75%
- Lazy init branch overhead was larger than expected
- Action: Keep as research box (default OFF), evaluate promotion

Phase 4 Cumulative:
- E1 (ENV Snapshot): +3.92%
- E2 (Alloc Per-Class): -0.21% (NEUTRAL, frozen)
- E3-4 (Constructor Init): +4.75%
- Total Phase 4: ~+8.5%

Deliverables:
- docs/analysis/PHASE4_E3_ENV_CONSTRUCTOR_INIT_DESIGN.md
- docs/analysis/PHASE4_E3_ENV_CONSTRUCTOR_INIT_NEXT_INSTRUCTIONS.md
- docs/analysis/PHASE4_COMPREHENSIVE_STATUS_ANALYSIS.md
- docs/analysis/PHASE4_EXECUTIVE_SUMMARY.md
- scripts/verify_health_profiles.sh (sanity check script)
- CURRENT_TASK.md (E3-4 complete, next instructions)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 02:57:35 +09:00
6a6744d065 Phase 4 E2: Alloc Per-Class FastPath - NEUTRAL (-0.21%)
A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (DUALHOT=0): 45.40M ops/s (mean), 45.51M ops/s (median)
- Optimized (DUALHOT=1): 45.30M ops/s (mean), 45.22M ops/s (median)
- Improvement: -0.21% mean, -0.62% median

Decision: NEUTRAL (within ±1.0% noise threshold)
Action: FREEZE as research box (default OFF, no promotion)

Key Findings:
- C0-C3 fast path adds branch overhead without measurable benefit
- Unlike FREE path (+13%), ALLOC path already has optimized route caching
- Phase 3 C3 static routing eliminated route lookup overhead
- Additional per-class specialization doesn't reduce existing cost

Root Cause:
- Free DUALHOT skips expensive policy_snapshot() + tiny_route_for_class()
- Alloc DUALHOT adds C0-C3 branch but route already cached (Phase 3 C3)
- Net effect: Branch cost ≈ Route savings → neutral

Conclusion: Alloc route optimization has reached diminishing returns

Cumulative Status:
- Phase 4 E1: +3.92% (GO, research box)
- Phase 4 E2: -0.21% (NEUTRAL, frozen)

Files:
- CURRENT_TASK.md: Updated with E2 results
- docs/analysis/PHASE4_E2_ALLOC_PER_CLASS_FASTPATH_AB_TEST_RESULTS.md: Full A/B test report

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-14 01:54:21 +09:00
7f3ff6c7e6 Phase 4: E1 docs + E2 next instructions 2025-12-14 01:46:18 +09:00
42ba23fbd0 Phase 4 E1: env snapshot consolidation docs 2025-12-14 00:48:03 +09:00
11b0e3f32b Phase 4 D3: alloc gate shape (env-gated) 2025-12-14 00:26:57 +09:00
b40aff290e Phase 4 D3 Design: Alloc Gate Shape 2025-12-14 00:05:11 +09:00
141cd8a5be Phase 3 Closure & Phase 4 Preparation
Summary:
- Phase 3 optimization complete (cumulative +8.93%)
- D1 promoted to default (HAKMEM_FREE_STATIC_ROUTE=1, +2.19%)
- D2 frozen (NO-GO, -1.44% regression)
- Phase 4 instructions prepared (D3/Alloc Gate Specialization)

Results:
  B3 (Routing shape): +2.89%
  B4 (Wrapper split): +1.47%
  C3 (Static routing): +2.20%
  C1 (TLS prefetch): NEUTRAL (-0.34%, research box)
  C2 (Metadata cache): NEUTRAL (-0.45%, research box)
  D1 (Free route cache): +2.19% (now default)
  D2 (Wrapper env cache): NO-GO (-1.44%, frozen)
  MID_V3 fix: +13% (structural)

Total Phase 2-3 gain: ~8.93% (37.5M → 51M ops/s)

Updated:
- CURRENT_TASK.md: Phase 3 final results + D3 conditions
- ENV_PROFILE_PRESETS.md: Active optimizations listed
- PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: Phase 3→4 transition
- PHASE4_ALLOC_GATE_SPECIALIZATION_NEXT_INSTRUCTIONS.md: D3 execution plan
- PHASE3_BASELINE_AND_CANDIDATES.md: Post-validation status

Next phase: Phase 4 D3 - Alloc Gate Specialization
- Requires: tiny_alloc_gate_fast self% ≥5% from perf
- Design SSOT: PHASE3_D3_ALLOC_GATE_SPECIALIZATION_1_DESIGN.md
- Execution: PHASE4_ALLOC_GATE_SPECIALIZATION_NEXT_INSTRUCTIONS.md

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 23:47:19 +09:00
50bded8c85 Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established
Summary:
- D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT
  - Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median)
  - Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median)
  - Mean gain: +2.19%, Median gain: +2.37%
  - Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%)
  - Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset

- D2 (Wrapper env cache): FROZEN
  - Previous result: -1.44% regression (TLS overhead > benefit)
  - Status: Research box (do not pursue further)
  - Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset)

- Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13)

Cumulative Gains (Phase 2-3):
  B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19%
  Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%)
  MID_V3 fix: +13% (structural change, Mixed OFF by default)

Documentation Updates:
  - PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report
  - PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status
  - PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results
  - PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status
  - ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN
  - PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status
  - CURRENT_TASK.md: Phase 3 complete summary

Next:
  - D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%)
  - Or Phase 4 planning if no more D3-class targets
  - Current active optimizations: B3, B4, C3, D1, MID_V3 fix

Files Changed:
  - docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines)
  - docs/analysis/*.md (6 files updated with D1/D2 results)
  - CURRENT_TASK.md (Phase 3 status update)
  - analyze_d1_results.py (statistical analysis script)
  - core/bench_profile.h (D1 promoted to default in MIXED preset)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 22:42:22 +09:00
19056282b6 Phase 3 D2: Wrapper Env Cache - [DECISION: NO-GO]
Target: Reduce wrapper_env_cfg() overhead in malloc/free hot path
- Strategy: Cache wrapper env configuration pointer in TLS
- Approach: Fast pointer cache (TLS caches const wrapper_env_cfg_t*)

Implementation:
- core/box/wrapper_env_cache_env_box.h: ENV gate (HAKMEM_WRAP_ENV_CACHE)
- core/box/wrapper_env_cache_box.h: TLS cache layer (wrapper_env_cfg_fast)
- core/box/hak_wrappers.inc.h: Integration into malloc/free hot paths
- ENV gate: HAKMEM_WRAP_ENV_CACHE=0/1 (default OFF)

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (D2=0): 46.52M ops/s (avg), 46.47M ops/s (median)
- Optimized (D2=1): 45.85M ops/s (avg), 45.98M ops/s (median)
- Improvement: avg -1.44%, median -1.05% (DECISION: NO-GO)

Analysis:
- Regression cause: TLS cache adds overhead (branch + TLS access)
- wrapper_env_cfg() is already minimal (pointer return after simple check)
- Adding TLS caching layer makes it worse, not better
- Branch prediction penalty outweighs any potential savings

Cumulative Phase 2-3:
- B3: +2.89%, B4: +1.47%, C3: +2.20%
- D1: +1.06% (opt-in), D2: -1.44% (NO-GO)
- Total: ~7.2% (excluding D2)

Decision: FREEZE as research box (default OFF, regression confirmed)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 22:03:27 +09:00
f059c0ec83 Phase 3 D1: Free Path Route Cache - DECISION: GO (+1.06%)
Target: Eliminate tiny_route_for_class() overhead in free path
- Perf finding: 4.39% self + 24.78% children (free bottleneck)
- Approach: Use cached route_kind (like Phase 3 C3 for alloc)

Implementation:
- core/box/tiny_free_route_cache_env_box.h (new)
  * ENV gate: HAKMEM_FREE_STATIC_ROUTE=0/1 (default OFF)
  * Lazy initialization with sentinel value
- core/front/malloc_tiny_fast.h (modified)
  * Two call sites: free_tiny_fast_cold() + legacy_fallback path
  * Direct route lookup: g_tiny_route_class[class_idx]
  * Fallback safety: Check g_tiny_route_snapshot_done

A/B Test Results (Mixed, 10-run):
- Baseline (D1=0): 45.13 M ops/s (avg), 45.76 M ops/s (median)
- Optimized (D1=1): 45.61 M ops/s (avg), 45.40 M ops/s (median)
- Improvement: +1.06% (avg), -0.77% (median)
- DECISION: GO (avg gain meets +1.0% threshold)

Cumulative Phase 2-3:
- B3: +2.89%, B4: +1.47%, C3: +2.20%
- D1: +1.06%
- Total: ~7.2% cumulative gain

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 21:44:00 +09:00
d0b931b197 Phase 3 C1: TLS Prefetch Implementation - NEUTRAL Result (Research Box)
Step 1 & 2 Complete:
- Implemented: core/front/malloc_tiny_fast.h prefetch (lines 264-267, 331-334)
  - LEGACY path prefetch of g_unified_cache[class_idx] to L1
  - ENV gate: HAKMEM_TINY_PREFETCH=0/1 (default OFF)
  - Conditional: only when prefetch enabled + route_kind == LEGACY

- A/B test (Mixed 10-run): PREFETCH=0 (39.33M) → =1 (39.20M) = -0.34% avg
  - Median: +1.28% (within ±1.0% neutral range)
  - Result: 🔬 NEUTRAL (research box, default OFF)

Decision: FREEZE as research box
- Average -0.34% suggests prefetch overhead > benefit
- Prefetch timing too late (after route_kind selection)
- TLS cache access is already fast (head/tail indices)
- Actual memory wait happens at slots[] array access (after prefetch)

Technical Learning:
- Prefetch effectiveness depends on L1 miss rate at access time
- Inserting prefetch after route selection may be too late
- Future approach: move prefetch earlier or use different target

Next: Phase 3 C2 (Metadata Cache Optimization, expected +5-10%)

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 19:01:57 +09:00
d54893ea1d Phase 3 C3: Static Routing A/B Test ADOPT (+2.20% Mixed gain)
Step 2 & 3 Complete:
- A/B test (Mixed 10-run): STATIC_ROUTE=0 (38.91M) → =1 (39.77M) = +2.20% avg
  - Median gain: +1.98%
  - Result:  GO (exceeds +1.0% threshold)

- Decision:  ADOPT into MIXED_TINYV3_C7_SAFE preset
  - bench_profile.h line 77: HAKMEM_TINY_STATIC_ROUTE=1 default
  - Learner auto-disables static route when HAKMEM_SMALL_LEARNER_V7_ENABLED=1

Implementation Summary:
- core/box/tiny_static_route_box.{h,c}: Research box (Step 1A)
- core/front/malloc_tiny_fast.h: Route lookup integration (Step 1B, lines 249-256)
- core/bench_profile.h: Bench sync + preset adoption

Cumulative Phase 2-3 Gains:
- B3 (Routing shape): +2.89%
- B4 (Wrapper split): +1.47%
- C3 (Static routing): +2.20%
- Total: ~6.8% (35.2M → ~39.8M ops/s)

Next: Phase 3 C1 (TLS Prefetch, expected +2-4%)

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 18:46:11 +09:00
4c4796a1f8 Phase 2 B4: Documentation & Instruction Creation (Phase 2→3 Transition)
Documentation Created:
- docs/analysis/PHASE2_STRUCTURAL_CHANGES_NEXT_INSTRUCTIONS.md: Phase 2 完了レポート (B3+B4累積 +4.4%)
- docs/analysis/PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: Phase 3 開始指示(C3 Static Routing優先)

Verification Completed:
-  HAKMEM_WRAP_SHAPE=1 プリセット昇格(core/bench_profile.h:67)
-  wrapper_env_refresh_from_env() 実装済み(core/box/wrapper_env_box.c:49-64)
-  malloc_cold() lock_depth 対称性確認(全 return 経路で g_hakmem_lock_depth--)
-  A/B テスト結果: Mixed +1.47% (≥+1.0% GO threshold)

Summary:
  B3 routing shape:  +2.89%
  B4 wrapper shape:  +1.47%
  ─────────────────
  Estimated total:   ~+4.4%

Next Phase: Phase 3 (Cache Locality, +12-22%)
- Priority: C3 (Static Routing) - bypass policy_snapshot, +5-8% expected
- Profile: perf top で malloc/policy_snapshot hot spot を特定推奨

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 17:32:34 +09:00
d9991f39ff Phase ALLOC-TINY-FAST-DUALHOT-1 & Optimization Roadmap Update
Add comprehensive design docs and research boxes:
- docs/analysis/ALLOC_TINY_FAST_DUALHOT_1_DESIGN.md: ALLOC DUALHOT investigation
- docs/analysis/FREE_TINY_FAST_DUALHOT_1_DESIGN.md: FREE DUALHOT final specs
- docs/analysis/FREE_TINY_FAST_HOTCOLD_OPT_1_DESIGN.md: Hot/Cold split research
- docs/analysis/POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md: Deferred batching design
- docs/analysis/POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md: Stats overhead findings
- docs/analysis/MID_DESC_CACHE_BENCHMARK_2025-12-12.md: Cache measurement results
- docs/analysis/LAST_MATCH_CACHE_IMPLEMENTATION.md: TLS cache investigation

Research boxes (SS page table):
- core/box/ss_pt_env_box.h: HAKMEM_SS_LOOKUP_KIND gate
- core/box/ss_pt_types_box.h: 2-level page table structures
- core/box/ss_pt_lookup_box.h: ss_pt_lookup() implementation
- core/box/ss_pt_register_box.h: Page table registration
- core/box/ss_pt_impl.c: Global definitions

Updates:
- docs/specs/ENV_VARS_COMPLETE.md: HOTCOLD, DEFERRED, SS_LOOKUP env vars
- core/box/hak_free_api.inc.h: FREE-DISPATCH-SSOT integration
- core/box/pool_mid_inuse_deferred_box.h: Deferred API updates
- core/box/pool_mid_inuse_deferred_stats_box.h: Stats collection
- core/hakmem_super_registry: SS page table integration

Current Status:
- FREE-TINY-FAST-DUALHOT-1: +13% improvement, ready for adoption
- ALLOC-TINY-FAST-DUALHOT-1: -2% regression, frozen as research box
- Next: Optimization roadmap per ROI (mimalloc gap 2.5x)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-13 05:35:46 +09:00
b2724e6f5d Phase ALLOC-TINY-FAST-DUALHOT-1: WIP (regression), FREE DUALHOT confirmed +13%
**ALLOC-TINY-FAST-DUALHOT-1** (this phase):
- Implementation: malloc_tiny_fast() C0-C3 early-exit with policy snapshot skip
- ENV: HAKMEM_TINY_ALLOC_DUALHOT=0/1 (default OFF)
- A/B Result: -1.17% median regression (Mixed, 10-run)
- Root Cause: Branch prediction penalty on C4-C7 outweighs policy skip benefit
- Decision: Freeze as research box (default OFF)
- Difference from FREE: ALLOC requires structural changes (per-class paths)

**FREE-TINY-FAST-DUALHOT-1** (verified):
- A/B Confirmation: +13.00% improvement (42.08M → 47.81M ops/s, Mixed, 10-run)
- Success Criteria: +2% target ACHIEVED
- Health Check: PASS (verify_health_profiles.sh, ENV OFF/ON)
- Safety: HAKMEM_TINY_LARSON_FIX guard in place
- Decision: Promotion to MIXED_TINYV3_C7_SAFE profile candidate

**Next Steps**:
- Profile adoption of FREE DUALHOT for MIXED workload
- No further deep-dive on ALLOC optimization (deferred to future phases)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-13 05:10:45 +09:00
0a7400d7d3 Phase ALLOC-TINY-FAST-DUALHOT-1: C0-C3 alloc direct path (WIP, -2% regression)
Add C0-C3 early-exit optimization to malloc_tiny_fast() similar to
FREE-TINY-FAST-DUALHOT-1. Skip policy snapshot for C0-C3 classes.

A/B Result (10-run, Mixed TINYV3_C7_SAFE):
- Baseline: 47.27M ops/s (median)
- Optimized: 46.10M ops/s (median)
- Result: -2.00% (regression, needs investigation)

ENV: HAKMEM_TINY_ALLOC_DUALHOT=0/1 (default OFF)

Implementation:
- core/front/malloc_tiny_fast.h: alloc_dualhot_enabled() + early-exit
- Design: docs/analysis/ALLOC_TINY_FAST_DUALHOT_1_DESIGN.md

Status: Research box (default OFF), needs root cause analysis

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-13 04:28:52 +09:00
fe70e3baf5 Phase MID-V35-HOTPATH-OPT-1 complete: +7.3% on C6-heavy
Step 0: Geometry SSOT
  - New: core/box/smallobject_mid_v35_geom_box.h (L1/L2 consistency)
  - Fix: C6 slots/page 102→128 in L2 (smallobject_cold_iface_mid_v3.c)
  - Applied: smallobject_mid_v35.c, smallobject_segment_mid_v3.c

Step 1-3: ENV gates for hotpath optimizations
  - New: core/box/mid_v35_hotpath_env_box.h
    * HAKMEM_MID_V35_HEADER_PREFILL (default 0)
    * HAKMEM_MID_V35_HOT_COUNTS (default 1)
    * HAKMEM_MID_V35_C6_FASTPATH (default 0)
  - Implementation: smallobject_mid_v35.c
    * Header prefill at refill boundary (Step 1)
    * Gated alloc_count++ in hot path (Step 2)
    * C6 specialized fast path with constant slot_size (Step 3)

A/B Results:
  C6-heavy (257–768B): 8.75M→9.39M ops/s (+7.3%, 5-run mean) 
  Mixed (16–1024B): 9.98M→9.96M ops/s (-0.2%, within noise) ✓

Decision: FROZEN - defaults OFF, C6-heavy推奨ON, Mixed現状維持
Documentation: ENV_PROFILE_PRESETS.md updated

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-12 19:19:25 +09:00
e95e61f0ff Phase POLICY-FAST-PATH-V2 complete + MID-V35-HOTPATH-OPT-1 design
## Phase POLICY-FAST-PATH-V2 (FROZEN)
- Implementation complete: free_policy_fast_v2_box.h + malloc_tiny_fast.h integration
- A/B Results:
  - Mixed (ws=400): -1.6% regression  (branch cost > skip benefit)
  - C6-heavy (ws=200): +5.4% improvement 
- Decision: Default OFF, FROZEN (ws<300 / C6-heavy research only)
- Learning: Large WS causes branch misprediction to dominate

## Phase 3-GRADUATE + ENV probe fix
- 64-probe retry for getenv() stability during bench_profile putenv()
- C6 ULTRA intrusive freelist: FROZEN (research box)

## Phase MID-V35-HOTPATH-OPT-1-DESIGN
- Design doc for next optimization target
- Target: MID v3.5 alloc/free hot path (C5-C6)
- Boxes: Stats Gate, TLS Layout, Boundary Check elimination
- Expected: +3-9% on Mixed mainline

Files:
- core/box/free_policy_fast_v2_box.h (new)
- core/box/free_path_stats_box.h/c (policy_fast_v2_skip counter)
- core/front/malloc_tiny_fast.h (fast-path integration)
- docs/analysis/MID_V35_HOTPATH_OPT_1_DESIGN.md (new)
- docs/analysis/PHASE_3_GRADUATE_*.md (new)
- CURRENT_TASK.md (phase status update)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-12 18:40:08 +09:00
1a8652a91a Phase TLS-UNIFY-3: C6 intrusive freelist implementation (完成)
Implement C6 ULTRA intrusive LIFO freelist with ENV gating:
- Single-linked LIFO using next pointer at USER+1 offset
- tiny_next_store/tiny_next_load for pointer access (single source of truth)
- Segment learning via ss_fast_lookup (per-class seg_base/seg_end)
- ENV gate: HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL (default OFF)
- Counters: c6_ifl_push/pop/fallback in FREE_PATH_STATS

Files:
- core/box/tiny_ultra_tls_box.h: Added c6_head field for intrusive LIFO
- core/box/tiny_ultra_tls_box.c: Pop/push with intrusive branching (case 6)
- core/box/tiny_c6_ultra_intrusive_env_box.h: ENV gate (new)
- core/box/tiny_c6_intrusive_freelist_box.h: L1 pure LIFO (new)
- core/tiny_debug_ring.h: C6_IFL events
- core/box/free_path_stats_box.h/c: c6_ifl_* counters

A/B Test Results (1M iterations, ws=200, 257-512B):
- ENV_OFF (array): 56.6 Mop/s avg
- ENV_ON (intrusive): 57.6 Mop/s avg (+1.8%, within noise)
- Counters verified: c6_ifl_push=265890, c6_ifl_pop=265815, fallback=0

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-12 16:26:42 +09:00
bf83612b97 Phase v11a-4: Mixed本線ベンチマーク結果追加
Results:
- C6-heavy (257-512B): +5.1% (34.0M → 35.8M ops/s)
- Mixed 16-1024B:      +4.4% (38.6M → 40.3M ops/s)

Conclusion: Mixed本線で C6→MID v3.5 は採用候補。
予測(+1-3%)を上回る +4-5% の改善を確認。

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 07:17:52 +09:00
212739607a Phase v11a-3: MID v3.5 Activation (Build Complete)
Integrated MID v3.5 into active code path, making it available for C5/C6/C7 routing.

Key Changes:
- Policy Box: Added SMALL_ROUTE_MID_V35 with ENV gates (HAKMEM_MID_V35_ENABLED, HAKMEM_MID_V35_CLASSES)
- HotBox: Implemented small_mid_v35_alloc/free with TLS-cached page allocation
- Front Gate: Wired MID_V35 routing into malloc_tiny_fast.h (priority: ULTRA > MID_V35 > V7)
- Build: Added core/smallobject_mid_v35.o to all object lists

Architecture:
- Slot sizes: C5=384B, C6=512B, C7=1024B
- Page size: 64KB (170/128/64 slots)
- Integration: ColdIface v2 (refill/retire), Stats v2 (observation), Learner v2 (dormant)

Status: Build successful, ready for A/B benchmarking
Next: Performance validation (C6-heavy, C5+C6-only, Mixed benchmarks)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 06:52:14 +09:00
57313f7822 Phase v11a: Architecture design and implementation roadmap documents
Create comprehensive design specifications for Phase v11a (MID v3.5):

1. PHASE_V11A_DESIGN_MID_V3.5.md
   - Decision rationale: Option A chosen (consolidation vs expansion)
   - MID v3.5 architecture: unified 257-1KiB box
   - Role clarification: v7 frozen as research preset
   - Learner v2 scope: multi-class tracking, C5 ratio primary decision
   - Segment design decision: shared segment (Design B) vs separate segments
   - Stats expansion: per-class efficiency metrics
   - API changes: minimal, backward compatible

2. PHASE_V11A_IMPLEMENTATION_ROADMAP.md
   - Detailed task breakdown for v11a-1, v11a-2, v11a-3
   - File structure: new boxes, implementation files, modified files
   - Concrete function signatures and integration points
   - Benchmark commands and expected performance
   - Dependency graph and implementation order
   - Build/Makefile changes needed
   - Testing strategy and regression checks

Key Design Decisions:
- Multi-class segment uses shared 2MiB segment (not separate)
- Per-class free page stacks for efficient refill
- Stats published per-page retire (for Learner ingestion)
- TLS version-based cache invalidation (atomic policy updates)
- Backward compatibility: Policy v2 extends v1 interface

Next Step: Phase v11a-2 (Core Implementation)
- Implement segment creation/alloc/free
- Add C7 support to existing MID_v3
- Stats recording during page retire
- Learner aggregation logic

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-12 06:20:14 +09:00
397aea0131 Phase v10: Freeze v7 as C5/C6-only research preset
Documentation: Baseline fixed per Phase v10
- HAKMEM_V2_GENERATION_SUMMARY.md:
  - v7 repositioned as 「C5/C6 專用研究箱」
  - Mixed baseline: HAKMEM_SMALL_HEAP_V7_ENABLED=0 (OFF)
  - Added Phase v7-7 (Learner), Phase v10 (legacy removal)
  - Learner performance: +127% on C5/C6 workload
  - Size class table: segregated Mixed (v7 OFF) vs C5/C6 preset (v7 ON)

- ENV_PROFILE_PRESETS.md:
  - MIXED_TINYV3_C7_SAFE: explicitly v7 OFF (Mixed baseline)
  - NEW: C5_C6_SMALL_HEAP_V7_LEARNER profile
    - Learner dynamic route switching documentation
    - Test commands and expected performance (38-39M ops/s)
    - Phase v10 deprecation notice (v3/v4/v5 removed)

Purpose:
- Set clear baseline: v7 OFF for Mixed, ON for C5/C6 benchmarks
- Document Learner preset for future reference
- No code changes (docs-only checkpoint)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 06:13:15 +09:00