dcc1d42e7f
Phase 6-2: Promote Front FastLane Free DeDup (default ON)
...
Results:
- A/B test: +5.18% on Mixed (10-run, clean env)
- Baseline: 46.68M ops/s
- Optimized: 49.10M ops/s
- Improvement: +2.42M ops/s (+5.18%)
Strategy:
- Eliminate duplicate header validation in front_fastlane_try_free()
- Direct call to free_tiny_fast() when dedup enabled
- Single validation path (no redundant checks)
Success factors:
1. Complete duplicate elimination (free path optimization)
2. Free path importance (50% of Mixed workload)
3. Improved execution stability (CV: 1.00% → 0.58%)
Phase 6 cumulative:
- Phase 6-1 FastLane: +11.13%
- Phase 6-2 Free DeDup: +5.18%
- Total: ~+16-17% from baseline (multiplicative effect)
Promotion:
- Default: HAKMEM_FRONT_FASTLANE_FREE_DEDUP=1 (opt-out)
- Added to MIXED_TINYV3_C7_SAFE preset
- Added to C6_HEAVY_LEGACY_POOLV1 preset
- Rollback: HAKMEM_FRONT_FASTLANE_FREE_DEDUP=0
Files modified:
- core/box/front_fastlane_env_box.h: default 0 → 1
- core/bench_profile.h: added to presets
- CURRENT_TASK.md: Phase 6-2 GO result
Health check: PASSED (all profiles)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com >
2025-12-14 17:38:21 +09:00
c0d2f47f7d
docs: Phase 6-2 FastLane free dedup instructions
2025-12-14 17:09:57 +09:00
f301ee4df3
chore: fix FastLane default comment
2025-12-14 16:30:32 +09:00
ea221d057a
Phase 6: promote Front FastLane (default ON)
2025-12-14 16:28:23 +09:00
e48cbff4b9
Phase 5 Complete: E7 NO-GO confirmed + ChatGPT Pro questionnaire
...
Summary:
- E7 frozen box prune: -3.20% regression (NO-GO) with clean ENV
- Keep E5-2/E5-4 (NEUTRAL) + E6 (NO-GO) as research boxes
- Regression due to build differences (LTO/layout/alignment), not logic
Results:
- Winning boxes: E4-1 (+3.51%), E4-2 (+21.83%), E5-1 (+3.35%) → adopted
- Frozen boxes: E5-2, E5-4, E6, E7 → kept with ENV gates (doc as assets)
- Phase 5 cumulative progress: +6.43% on MIXED profile
Documentation updates:
- PHASE5_E7_FROZEN_BOX_PRUNE_AB_TEST_RESULTS.md: Final NO-GO record
- PHASE5_E7_FROZEN_BOX_PRUNE_NEXT_INSTRUCTIONS.md: E7 conclusion
Next phase planning:
- PHASE_ML2_CHATGPT_QUESTIONNAIRE_FASTLANE.md: Design consultation template
- Candidates: dedup new boundaries, PGO/layout optimization feasibility
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-14 08:56:09 +09:00
6d38511318
Phase 5: E7 prune no-go (keep frozen boxes); add clean-env runner
2025-12-14 08:11:20 +09:00
f92be5f541
Phase 5: freeze E6 env snapshot shape (no-go)
2025-12-14 07:18:59 +09:00
4124c86d99
Phase 5: freeze E5-4 malloc tiny direct (neutral)
2025-12-14 06:59:35 +09:00
580e7f4fa3
Phase 5 E5-3: Candidate Analysis (All DEFERRED) + E5-4 Instructions
...
E5-3 Analysis Results:
- free_tiny_fast_cold (7.14%): DEFER - cold path, low ROI
- unified_cache_push (3.39%): DEFER - already optimized
- hakmem_env_snapshot_enabled (2.97%): DEFER - low headroom
Key Insight: perf self% is time-weighted, not frequency-weighted.
Cold paths appear hot but have low total impact.
Next: E5-4 (Malloc Tiny Direct Path)
- Apply E5-1 winning pattern to malloc side
- Target: tiny_alloc_gate_fast() gate tax elimination
- ENV gate: HAKMEM_MALLOC_TINY_DIRECT=0/1
Files added:
- docs/analysis/PHASE5_E5_3_ANALYSIS_AND_RECOMMENDATIONS.md
- docs/analysis/PHASE5_E5_4_MALLOC_TINY_DIRECT_NEXT_INSTRUCTIONS.md
- core/box/free_cold_shape_env_box.{h,c} (research box, not tested)
- core/box/free_cold_shape_stats_box.{h,c} (research box, not tested)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2025-12-14 06:44:04 +09:00
f7b18aaf13
Phase 5 E5-2: Header Write-Once (NEUTRAL, FROZEN)
...
Target: tiny_region_id_write_header (3.35% self%)
- Hypothesis: Headers redundant for reused blocks
- Strategy: Write headers ONCE at refill boundary, skip in hot alloc
Implementation:
- ENV gate: HAKMEM_TINY_HEADER_WRITE_ONCE=0/1 (default 0)
- core/box/tiny_header_write_once_env_box.h: ENV gate
- core/box/tiny_header_write_once_stats_box.h: Stats counters
- core/box/tiny_header_box.h: Added tiny_header_finalize_alloc()
- core/front/tiny_unified_cache.c: Prefill at 3 refill sites
- core/box/tiny_front_hot_box.h: Use finalize function
A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (WRITE_ONCE=0): 44.22M ops/s (mean), 44.53M ops/s (median)
- Optimized (WRITE_ONCE=1): 44.42M ops/s (mean), 44.36M ops/s (median)
- Improvement: +0.45% mean, -0.38% median
Decision: NEUTRAL (within ±1.0% threshold)
- Action: FREEZE as research box (default OFF, do not promote)
Root Cause Analysis:
- Header writes are NOT redundant - existing code writes only when needed
- Branch overhead (~4 cycles) cancels savings (~3-5 cycles)
- perf self% ≠ optimization ROI (3.35% target → +0.45% gain)
Key Lessons:
1. Verify assumptions before optimizing (inspect code paths)
2. Hot spot self% measures time IN function, not savings from REMOVING it
3. Branch overhead matters (even "simple" checks add cycles)
Positive Outcome:
- StdDev reduced 50% (0.96M → 0.48M) - more stable performance
Health Check: PASS (all profiles)
Next Candidates:
- free_tiny_fast_cold: 7.14% self%
- unified_cache_push: 3.39% self%
- hakmem_env_snapshot_enabled: 2.97% self%
Deliverables:
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-2 complete, FROZEN)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com >
2025-12-14 06:22:25 +09:00
75e20b29cc
Phase 5 E5-1: Promote to preset + next target instructions
...
E5-1 Promotion:
- Added HAKMEM_FREE_TINY_DIRECT=1 to MIXED_TINYV3_C7_SAFE preset
- Updated ENV_PROFILE_PRESETS.md with rollback instructions
- Rollback: HAKMEM_FREE_TINY_DIRECT=0
A/B Test Clarification:
- Documented bench_setenv_default vs export ENV=0 interaction
- bench_setenv_default only sets if ENV is unset
- To force OFF in A/B: use value that differs from default
Next Target Selection (E5-2 vs E5-3):
- E5-2: Header write reduction (tiny_region_id_write_header)
- E5-3: ENV snapshot gate shape optimization
- Decision requires fresh perf profile on new baseline
Deliverables:
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_NEXT_INSTRUCTIONS.md
- docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md (updated)
- docs/analysis/ENV_PROFILE_PRESETS.md (E5-1 added)
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_AB_TEST_RESULTS.md (clarified)
- CURRENT_TASK.md (progress links)
- docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md (progress links)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com >
2025-12-14 05:59:43 +09:00
8875132134
Phase 5 E5-1: Free Tiny Direct Path (+3.35% GO)
...
Target: Consolidate free() wrapper overhead (29.56% combined)
- free() wrapper: 21.67% self%
- free_tiny_fast_cold(): 7.89% self%
Strategy: Single header check in wrapper → direct call to free_tiny_fast()
- Eliminates redundant header validation (validated twice before)
- Bypasses cold path routing for Tiny allocations
- High coverage: 48% of frees in Mixed workload are Tiny
Implementation:
- ENV gate: HAKMEM_FREE_TINY_DIRECT=0/1 (default 0)
- core/box/free_tiny_direct_env_box.h: ENV gate
- core/box/free_tiny_direct_stats_box.h: Stats counters
- core/box/hak_wrappers.inc.h: Wrapper integration (lines 593-625)
Safety gates:
- Page boundary guard ((ptr & 0xFFF) != 0)
- Tiny magic validation ((header & 0xF0) == 0xA0)
- Class bounds check (class_idx < 8)
- Fail-fast fallback to existing paths
A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (DIRECT=0): 44.38M ops/s (mean), 44.45M ops/s (median)
- Optimized (DIRECT=1): 45.87M ops/s (mean), 45.95M ops/s (median)
- Improvement: +3.35% mean, +3.36% median
Decision: GO (+3.35% >= +1.0% threshold)
- 3rd consecutive success with consolidation/deduplication pattern
- E4-1: +3.51%, E4-2: +21.83%, E5-1: +3.35%
- Health check: PASS (all profiles)
Phase 5 Cumulative:
- E4 Combined: +6.43%
- E5-1: +3.35%
- Estimated total: ~+10%
Deliverables:
- docs/analysis/PHASE5_E5_COMPREHENSIVE_ANALYSIS.md
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_DESIGN.md
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-1 complete)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com >
2025-12-14 05:52:32 +09:00
6cdbd815ab
Phase 5 E4 Combined: E4-1 + E4-2 (+6.43% GO, baseline consolidated)
...
Combined A/B Test Results (10-run Mixed):
- Baseline (both OFF): 44.48M ops/s (mean), 44.39M ops/s (median)
- Optimized (both ON): 47.34M ops/s (mean), 47.38M ops/s (median)
- Improvement: +6.43% mean, +6.74% median
Interaction Analysis:
- E4-1 alone: +3.51% (measured in separate session)
- E4-2 alone: +21.83% (measured in separate session)
- Combined: +6.43% (measured in same binary)
- Pattern: SUBADDITIVE (overlapping bottlenecks)
Key Finding: Single-binary incremental gain is the accurate metric
- E4-1 and E4-2 target overlapping TLS/branch resources
- Individual measurements were from different baselines/sessions
- Combined measurement (same binary, both flags) shows true progress
Phase 5 Total Progress:
- Original baseline (session start): 35.74M ops/s
- Combined optimized: 47.34M ops/s
- Total gain: +32.4% (cross-session, reference only)
- Same-binary gain: +6.43% (E4-1+E4-2 both ON vs both OFF)
New Baseline Perf Profile (47.0M ops/s):
- free: 37.56% self% (still top hotspot)
- tiny_alloc_gate_fast: 13.73% (reduced from 19.50%)
- malloc: 12.95% (reduced from 16.13%)
- tiny_region_id_write_header: 6.97% (header write tax)
- hakmem_env_snapshot_enabled: 4.29% (ENV overhead visible)
Health Check: PASS
- MIXED_TINYV3_C7_SAFE: 42.3M ops/s
- C6_HEAVY_LEGACY_POOLV1: 20.9M ops/s
Phase 5 E5 Candidates (from perf profile):
- E5-1: free() path internals (37.56% self%)
- E5-2: Header write reduction (6.97% self%)
- E5-3: ENV snapshot overhead (4.29% self%)
Deliverables:
- docs/analysis/PHASE5_E4_COMBINED_AB_TEST_RESULTS.md
- docs/analysis/PHASE5_E5_NEXT_INSTRUCTIONS.md
- CURRENT_TASK.md (E4 combined complete, E5 candidates)
- docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md (E5 pointer)
- perf.data.e4combined (perf profile data)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com >
2025-12-14 05:36:57 +09:00
5528612f2a
Phase 5 E4-2: Malloc Wrapper ENV Snapshot (+21.83% GO, ADOPTED)
...
Target: Consolidate malloc wrapper TLS reads + eliminate function calls
- malloc (16.13%) + tiny_alloc_gate_fast (19.50%) = 35.63% combined
- Strategy: E4-1 success pattern + function call elimination
Implementation:
- ENV gate: HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=0/1 (default 0)
- core/box/malloc_wrapper_env_snapshot_box.{h,c}: New box
- Consolidates multiple TLS reads → 1 TLS read
- Pre-caches tiny_max_size() == 256 (eliminates function call)
- Lazy init with probe window (bench_profile putenv sync)
- core/box/hak_wrappers.inc.h: Integration in malloc() wrapper
- Makefile: Add malloc_wrapper_env_snapshot_box.o to all targets
A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (SNAPSHOT=0): 35.74M ops/s (mean), 35.75M ops/s (median)
- Optimized (SNAPSHOT=1): 43.54M ops/s (mean), 43.92M ops/s (median)
- Improvement: +21.83% mean, +22.86% median (+7.80M ops/s)
Decision: GO (+21.83% >> +1.0% threshold, 21.8x over)
- Why 6.2x better than E4-1 (+3.51%)?
- Higher malloc call frequency (allocation-heavy workload)
- Function call elimination (tiny_max_size pre-cached)
- Larger target: 35.63% vs free's 25.26%
- Health check: PASS (all profiles)
- Action: PROMOTED to MIXED_TINYV3_C7_SAFE preset
Phase 5 Cumulative (estimated):
- E1 (ENV Snapshot): +3.92%
- E4-1 (Free Wrapper Snapshot): +3.51%
- E4-2 (Malloc Wrapper Snapshot): +21.83%
- Estimated combined: ~+30% (needs validation)
Next Steps:
- Combined A/B test (E4-1 + E4-2 simultaneously)
- Measure actual cumulative effect
- Profile new baseline for next optimization targets
Deliverables:
- docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_1_DESIGN.md
- docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_1_AB_TEST_RESULTS.md
- docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md
- docs/analysis/PHASE5_E4_COMBINED_AB_TEST_NEXT_INSTRUCTIONS.md (next)
- docs/analysis/ENV_PROFILE_PRESETS.md (E4-2 added)
- CURRENT_TASK.md (E4-2 complete)
- core/bench_profile.h (E4-2 promoted to default)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com >
2025-12-14 05:13:29 +09:00
4a070d8a14
Phase 5 E4-1: Free Wrapper ENV Snapshot (+3.51% GO, ADOPTED)
...
Target: Consolidate free wrapper TLS reads (2→1)
- free() is 25.26% self% (top hot spot)
- Strategy: Apply E1 success pattern (ENV snapshot) to free path
Implementation:
- ENV gate: HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=0/1 (default 0)
- core/box/free_wrapper_env_snapshot_box.{h,c}: New box
- Consolidates 2 TLS reads → 1 TLS read (50% reduction)
- Reduces 4 branches → 3 branches (25% reduction)
- Lazy init with probe window (bench_profile putenv sync)
- core/box/hak_wrappers.inc.h: Integration in free() wrapper
- Makefile: Add free_wrapper_env_snapshot_box.o to all targets
A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (SNAPSHOT=0): 45.35M ops/s (mean), 45.31M ops/s (median)
- Optimized (SNAPSHOT=1): 46.94M ops/s (mean), 47.15M ops/s (median)
- Improvement: +3.51% mean, +4.07% median
Decision: GO (+3.51% >= +1.0% threshold)
- Exceeded conservative estimate (+1.5% → +3.51%)
- Similar efficiency to E1 (+3.92%)
- Health check: PASS (all profiles)
- Action: PROMOTED to MIXED_TINYV3_C7_SAFE preset
Phase 5 Cumulative:
- E1 (ENV Snapshot): +3.92%
- E4-1 (Free Wrapper Snapshot): +3.51%
- Total Phase 4-5: ~+7.5%
E3-4 Correction:
- Phase 4 E3-4 (ENV Constructor Init): NO-GO / FROZEN
- Initial A/B showed +4.75%, but investigation revealed:
- Branch prediction hint mismatch (UNLIKELY with always-true)
- Retest confirmed -1.78% regression
- Root cause: __builtin_expect(..., 0) with ctor_mode==1
- Decision: Freeze as research box (default OFF)
- Learning: Branch hints need careful tuning, TLS consolidation safer
Deliverables:
- docs/analysis/PHASE5_E4_FREE_GATE_OPTIMIZATION_1_DESIGN.md
- docs/analysis/PHASE5_E4_1_FREE_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md
- docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md (next)
- docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md
- docs/analysis/ENV_PROFILE_PRESETS.md (E4-1 added, E3-4 corrected)
- CURRENT_TASK.md (E4-1 complete, E3-4 frozen)
- core/bench_profile.h (E4-1 promoted to default)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com >
2025-12-14 04:24:34 +09:00
21e2e4ac2b
Phase 4 E3-4: ENV Constructor Init (+4.75% GO)
...
Target: Eliminate E1 lazy init check overhead (3.22% self%)
- E1 consolidated ENV gates but lazy check remained in hot path
- Strategy: __attribute__((constructor(101))) for pre-main init
Implementation:
- ENV gate: HAKMEM_ENV_SNAPSHOT_CTOR=0/1 (default 0, research box)
- core/box/hakmem_env_snapshot_box.c: Constructor function added
- Reads ENV before main() when CTOR=1
- Refresh also syncs gate state for bench_profile putenv
- core/box/hakmem_env_snapshot_box.h: Dual-mode enabled check
- CTOR=1 fast path: direct global read (no lazy branch)
- CTOR=0 fallback: legacy lazy init (rollback safe)
- Branch hints adjusted for default OFF baseline
A/B Test Results (Mixed, 10-run, 20M iters, E1=1):
- Baseline (CTOR=0): 44.28M ops/s (mean), 44.60M ops/s (median)
- Optimized (CTOR=1): 46.38M ops/s (mean), 46.53M ops/s (median)
- Improvement: +4.75% mean, +4.35% median
Decision: GO (+4.75% >> +0.5% threshold)
- Expected +0.5-1.5%, achieved +4.75%
- Lazy init branch overhead was larger than expected
- Action: Keep as research box (default OFF), evaluate promotion
Phase 4 Cumulative:
- E1 (ENV Snapshot): +3.92%
- E2 (Alloc Per-Class): -0.21% (NEUTRAL, frozen)
- E3-4 (Constructor Init): +4.75%
- Total Phase 4: ~+8.5%
Deliverables:
- docs/analysis/PHASE4_E3_ENV_CONSTRUCTOR_INIT_DESIGN.md
- docs/analysis/PHASE4_E3_ENV_CONSTRUCTOR_INIT_NEXT_INSTRUCTIONS.md
- docs/analysis/PHASE4_COMPREHENSIVE_STATUS_ANALYSIS.md
- docs/analysis/PHASE4_EXECUTIVE_SUMMARY.md
- scripts/verify_health_profiles.sh (sanity check script)
- CURRENT_TASK.md (E3-4 complete, next instructions)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com >
2025-12-14 02:57:35 +09:00
6a6744d065
Phase 4 E2: Alloc Per-Class FastPath - NEUTRAL (-0.21%)
...
A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (DUALHOT=0): 45.40M ops/s (mean), 45.51M ops/s (median)
- Optimized (DUALHOT=1): 45.30M ops/s (mean), 45.22M ops/s (median)
- Improvement: -0.21% mean, -0.62% median
Decision: NEUTRAL (within ±1.0% noise threshold)
Action: FREEZE as research box (default OFF, no promotion)
Key Findings:
- C0-C3 fast path adds branch overhead without measurable benefit
- Unlike FREE path (+13%), ALLOC path already has optimized route caching
- Phase 3 C3 static routing eliminated route lookup overhead
- Additional per-class specialization doesn't reduce existing cost
Root Cause:
- Free DUALHOT skips expensive policy_snapshot() + tiny_route_for_class()
- Alloc DUALHOT adds C0-C3 branch but route already cached (Phase 3 C3)
- Net effect: Branch cost ≈ Route savings → neutral
Conclusion: Alloc route optimization has reached diminishing returns
Cumulative Status:
- Phase 4 E1: +3.92% (GO, research box)
- Phase 4 E2: -0.21% (NEUTRAL, frozen)
Files:
- CURRENT_TASK.md: Updated with E2 results
- docs/analysis/PHASE4_E2_ALLOC_PER_CLASS_FASTPATH_AB_TEST_RESULTS.md: Full A/B test report
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-14 01:54:21 +09:00
7f3ff6c7e6
Phase 4: E1 docs + E2 next instructions
2025-12-14 01:46:18 +09:00
88717a8737
Phase 4 E1: ENV Snapshot Consolidation - GO (+3.92% avg, +4.01% median)
...
Target: Consolidate 3 ENV gate TLS reads → 1 TLS read
- tiny_c7_ultra_enabled_env(): 1.28% self
- tiny_front_v3_enabled(): 1.01% self
- tiny_metadata_cache_enabled(): 0.97% self
- Total overhead: 3.26% self (perf profile analysis)
Implementation:
- core/box/hakmem_env_snapshot_box.h (new): ENV snapshot struct & API
- core/box/hakmem_env_snapshot_box.c (new): TLS snapshot implementation
- core/front/malloc_tiny_fast.h: Migrated 5 call sites to snapshot
- core/box/tiny_legacy_fallback_box.h: Migrated 2 call sites
- core/box/tiny_metadata_cache_hot_box.h: Migrated 1 call site
- core/bench_profile.h: Added hakmem_env_snapshot_refresh_from_env()
- Makefile: Added hakmem_env_snapshot_box.o to build
- ENV gate: HAKMEM_ENV_SNAPSHOT=0/1 (default: 0, research box)
A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (E1=0): 43,617,549 ops/s (avg), 43,562,895 ops/s (median)
- Optimized (E1=1): 45,327,239 ops/s (avg), 45,309,218 ops/s (median)
- Improvement: avg +3.92%, median +4.01%
Decision: GO (+3.92% >= +2.5% threshold)
- Action: Keep as research box (default OFF) for Phase 4
- Next: Consider promotion to default in MIXED_TINYV3_C7_SAFE preset
Design Rationale:
- Shape optimizations (B3, D3) reached saturation (+0.56% NEUTRAL)
- Shift to memory/TLS overhead optimization (new optimization frontier)
- Pattern: Similar to existing tiny_front_v3_snapshot (proven approach)
- Expected: +1-3% from 3.26% ENV overhead → Achieved: +3.92%
Technical Details:
- Consolidation: 3 TLS reads → 1 TLS read (66% reduction)
- Learner interlock: tiny_metadata_cache_eff pre-computed in snapshot
- Version sync: Refreshes on small_policy_v7_version_changed()
- Fallback safety: Existing ENV gates still available when E1=0
🤖 Generated with Claude Code
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-14 00:59:12 +09:00
42ba23fbd0
Phase 4 E1: env snapshot consolidation docs
2025-12-14 00:48:03 +09:00
11b0e3f32b
Phase 4 D3: alloc gate shape (env-gated)
2025-12-14 00:26:57 +09:00
b40aff290e
Phase 4 D3 Design: Alloc Gate Shape
2025-12-14 00:05:11 +09:00
141cd8a5be
Phase 3 Closure & Phase 4 Preparation
...
Summary:
- Phase 3 optimization complete (cumulative +8.93%)
- D1 promoted to default (HAKMEM_FREE_STATIC_ROUTE=1, +2.19%)
- D2 frozen (NO-GO, -1.44% regression)
- Phase 4 instructions prepared (D3/Alloc Gate Specialization)
Results:
B3 (Routing shape): +2.89%
B4 (Wrapper split): +1.47%
C3 (Static routing): +2.20%
C1 (TLS prefetch): NEUTRAL (-0.34%, research box)
C2 (Metadata cache): NEUTRAL (-0.45%, research box)
D1 (Free route cache): +2.19% (now default)
D2 (Wrapper env cache): NO-GO (-1.44%, frozen)
MID_V3 fix: +13% (structural)
Total Phase 2-3 gain: ~8.93% (37.5M → 51M ops/s)
Updated:
- CURRENT_TASK.md: Phase 3 final results + D3 conditions
- ENV_PROFILE_PRESETS.md: Active optimizations listed
- PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: Phase 3→4 transition
- PHASE4_ALLOC_GATE_SPECIALIZATION_NEXT_INSTRUCTIONS.md: D3 execution plan
- PHASE3_BASELINE_AND_CANDIDATES.md: Post-validation status
Next phase: Phase 4 D3 - Alloc Gate Specialization
- Requires: tiny_alloc_gate_fast self% ≥5% from perf
- Design SSOT: PHASE3_D3_ALLOC_GATE_SPECIALIZATION_1_DESIGN.md
- Execution: PHASE4_ALLOC_GATE_SPECIALIZATION_NEXT_INSTRUCTIONS.md
🤖 Generated with Claude Code
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-13 23:47:19 +09:00
50bded8c85
Phase 3 Finalization: D1 20-run validation, D2 frozen, baseline established
...
Summary:
- D1 (Free route cache): 20-run validation → PROMOTED TO DEFAULT
- Baseline (20-run, ROUTE=0): 46.30M ops/s (mean), 46.30M (median)
- Optimized (20-run, ROUTE=1): 47.32M ops/s (mean), 47.39M (median)
- Mean gain: +2.19%, Median gain: +2.37%
- Decision: GO (both criteria met: mean >= +1.0%, median >= +0.0%)
- Implementation: Added HAKMEM_FREE_STATIC_ROUTE=1 to MIXED preset
- D2 (Wrapper env cache): FROZEN
- Previous result: -1.44% regression (TLS overhead > benefit)
- Status: Research box (do not pursue further)
- Default: OFF (not included in MIXED_TINYV3_C7_SAFE preset)
- Baseline Phase 3: 46.04M ops/s (Mixed, 10-run, 2025-12-13)
Cumulative Gains (Phase 2-3):
B3: +2.89%, B4: +1.47%, C3: +2.20%, D1: +2.19%
Total: ~7.6-8.9% (conservative: 7.6%, multiplicative: 8.93%)
MID_V3 fix: +13% (structural change, Mixed OFF by default)
Documentation Updates:
- PHASE3_FINALIZATION_SUMMARY.md: Comprehensive Phase 3 report
- PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: D1/D2 final status
- PHASE3_D1_FREE_ROUTE_CACHE_1_DESIGN.md: 20-run validation results
- PHASE3_D2_WRAPPER_ENV_CACHE_1_DESIGN.md: FROZEN status
- ENV_PROFILE_PRESETS.md: D1 ADOPT, D2 FROZEN
- PHASE3_BASELINE_AND_CANDIDATES.md: Post-D1/D2 status
- CURRENT_TASK.md: Phase 3 complete summary
Next:
- D3 requires perf validation (tiny_alloc_gate_fast self% ≥5%)
- Or Phase 4 planning if no more D3-class targets
- Current active optimizations: B3, B4, C3, D1, MID_V3 fix
Files Changed:
- docs/analysis/PHASE3_FINALIZATION_SUMMARY.md (new, 580+ lines)
- docs/analysis/*.md (6 files updated with D1/D2 results)
- CURRENT_TASK.md (Phase 3 status update)
- analyze_d1_results.py (statistical analysis script)
- core/bench_profile.h (D1 promoted to default in MIXED preset)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-13 22:42:22 +09:00
742b474890
Update CURRENT_TASK: Phase 3 D2 Complete (NO-GO, -1.44% regression)
2025-12-13 22:04:28 +09:00
19056282b6
Phase 3 D2: Wrapper Env Cache - [DECISION: NO-GO]
...
Target: Reduce wrapper_env_cfg() overhead in malloc/free hot path
- Strategy: Cache wrapper env configuration pointer in TLS
- Approach: Fast pointer cache (TLS caches const wrapper_env_cfg_t*)
Implementation:
- core/box/wrapper_env_cache_env_box.h: ENV gate (HAKMEM_WRAP_ENV_CACHE)
- core/box/wrapper_env_cache_box.h: TLS cache layer (wrapper_env_cfg_fast)
- core/box/hak_wrappers.inc.h: Integration into malloc/free hot paths
- ENV gate: HAKMEM_WRAP_ENV_CACHE=0/1 (default OFF)
A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (D2=0): 46.52M ops/s (avg), 46.47M ops/s (median)
- Optimized (D2=1): 45.85M ops/s (avg), 45.98M ops/s (median)
- Improvement: avg -1.44%, median -1.05% (DECISION: NO-GO)
Analysis:
- Regression cause: TLS cache adds overhead (branch + TLS access)
- wrapper_env_cfg() is already minimal (pointer return after simple check)
- Adding TLS caching layer makes it worse, not better
- Branch prediction penalty outweighs any potential savings
Cumulative Phase 2-3:
- B3: +2.89%, B4: +1.47%, C3: +2.20%
- D1: +1.06% (opt-in), D2: -1.44% (NO-GO)
- Total: ~7.2% (excluding D2)
Decision: FREEZE as research box (default OFF, regression confirmed)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-13 22:03:27 +09:00
76ea5ad57d
Update CURRENT_TASK: Phase 3 D1 Complete (GO, +1.06%)
2025-12-13 21:44:52 +09:00
f059c0ec83
Phase 3 D1: Free Path Route Cache - DECISION: GO (+1.06%)
...
Target: Eliminate tiny_route_for_class() overhead in free path
- Perf finding: 4.39% self + 24.78% children (free bottleneck)
- Approach: Use cached route_kind (like Phase 3 C3 for alloc)
Implementation:
- core/box/tiny_free_route_cache_env_box.h (new)
* ENV gate: HAKMEM_FREE_STATIC_ROUTE=0/1 (default OFF)
* Lazy initialization with sentinel value
- core/front/malloc_tiny_fast.h (modified)
* Two call sites: free_tiny_fast_cold() + legacy_fallback path
* Direct route lookup: g_tiny_route_class[class_idx]
* Fallback safety: Check g_tiny_route_snapshot_done
A/B Test Results (Mixed, 10-run):
- Baseline (D1=0): 45.13 M ops/s (avg), 45.76 M ops/s (median)
- Optimized (D1=1): 45.61 M ops/s (avg), 45.40 M ops/s (median)
- Improvement: +1.06% (avg), -0.77% (median)
- DECISION: GO (avg gain meets +1.0% threshold)
Cumulative Phase 2-3:
- B3: +2.89%, B4: +1.47%, C3: +2.20%
- D1: +1.06%
- Total: ~7.2% cumulative gain
🤖 Generated with Claude Code
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-13 21:44:00 +09:00
d43a3ce611
Update CURRENT_TASK: Phase 3 C2 Complete (NEUTRAL, research box)
2025-12-13 19:20:27 +09:00
deecda7336
Phase 3 C2: Slab Metadata Cache Optimization (3 patches) - NEUTRAL
...
Patch 1: Policy Hot Cache
- Add TinyPolicyHot struct (route_kind[8] cached in TLS)
- Eliminate policy_snapshot() calls (~2 memory ops saved)
- Safety: disabled when learner v7 active
- Files: tiny_metadata_cache_env_box.h, tiny_metadata_cache_hot_box.{h,c}
- Integration: malloc_tiny_fast.h route selection
Patch 2: First Page Inline Cache
- Cache current slab page pointer in TLS per-class
- Avoid superslab metadata lookup (1-2 memory ops)
- Fast-path in tiny_legacy_fallback_free_base()
- Files: tiny_first_page_cache.h, tiny_unified_cache.c
- Integration: tiny_legacy_fallback_box.h
Patch 3: Bounds Check Compile-out
- Hardcode unified_cache capacity as MACRO constant
- Eliminate modulo operation (constant fold)
- Macros: TINY_UNIFIED_CACHE_CAPACITY_POW2=11, CAPACITY=2048, MASK=2047
- File: tiny_unified_cache.h
A/B Test Results (Mixed, 10-run):
- Baseline (C2=0): 40.43M ops/s (avg), 40.72M ops/s (median)
- Optimized (C2=1): 40.25M ops/s (avg), 40.29M ops/s (median)
- Improvement: -0.45% (avg), -1.06% (median)
- DECISION: NEUTRAL (within ±1.0% threshold)
- Action: Keep as research box (ENV gate OFF by default)
Cumulative Gain (Phase 2-3):
- B3 (Routing shape): +2.89%
- B4 (Wrapper split): +1.47%
- C3 (Static routing): +2.20%
- C2 (Metadata cache): -0.45%
- Total: ~6.1% (from baseline 37.5M → 39.8M ops/s)
🤖 Generated with Claude Code
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-13 19:19:42 +09:00
d0b931b197
Phase 3 C1: TLS Prefetch Implementation - NEUTRAL Result (Research Box)
...
Step 1 & 2 Complete:
- Implemented: core/front/malloc_tiny_fast.h prefetch (lines 264-267, 331-334)
- LEGACY path prefetch of g_unified_cache[class_idx] to L1
- ENV gate: HAKMEM_TINY_PREFETCH=0/1 (default OFF)
- Conditional: only when prefetch enabled + route_kind == LEGACY
- A/B test (Mixed 10-run): PREFETCH=0 (39.33M) → =1 (39.20M) = -0.34% avg
- Median: +1.28% (within ±1.0% neutral range)
- Result: 🔬 NEUTRAL (research box, default OFF)
Decision: FREEZE as research box
- Average -0.34% suggests prefetch overhead > benefit
- Prefetch timing too late (after route_kind selection)
- TLS cache access is already fast (head/tail indices)
- Actual memory wait happens at slots[] array access (after prefetch)
Technical Learning:
- Prefetch effectiveness depends on L1 miss rate at access time
- Inserting prefetch after route selection may be too late
- Future approach: move prefetch earlier or use different target
Next: Phase 3 C2 (Metadata Cache Optimization, expected +5-10%)
🤖 Generated with Claude Code
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-13 19:01:57 +09:00
d54893ea1d
Phase 3 C3: Static Routing A/B Test ADOPT (+2.20% Mixed gain)
...
Step 2 & 3 Complete:
- A/B test (Mixed 10-run): STATIC_ROUTE=0 (38.91M) → =1 (39.77M) = +2.20% avg
- Median gain: +1.98%
- Result: ✅ GO (exceeds +1.0% threshold)
- Decision: ✅ ADOPT into MIXED_TINYV3_C7_SAFE preset
- bench_profile.h line 77: HAKMEM_TINY_STATIC_ROUTE=1 default
- Learner auto-disables static route when HAKMEM_SMALL_LEARNER_V7_ENABLED=1
Implementation Summary:
- core/box/tiny_static_route_box.{h,c}: Research box (Step 1A)
- core/front/malloc_tiny_fast.h: Route lookup integration (Step 1B, lines 249-256)
- core/bench_profile.h: Bench sync + preset adoption
Cumulative Phase 2-3 Gains:
- B3 (Routing shape): +2.89%
- B4 (Wrapper split): +1.47%
- C3 (Static routing): +2.20%
- Total: ~6.8% (35.2M → ~39.8M ops/s)
Next: Phase 3 C1 (TLS Prefetch, expected +2-4%)
🤖 Generated with Claude Code
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-13 18:46:11 +09:00
1798ed656d
Phase 3 C3: Tiny Static Routing Box Implementation (Step 1A)
...
Research Box Implementation:
- core/box/tiny_static_route_box.h: TinyStaticRoute struct & API
- core/box/tiny_static_route_box.c: Static route table management
- Makefile: Added tiny_static_route_box.o to 3 OBJS lists
Design:
- ENV gate: HAKMEM_TINY_STATIC_ROUTE=0/1 (default: 0)
- Learner auto-disable: If HAKMEM_TINY_LEARNER_ENABLED=1, force OFF
- Constructor priority: 102 (runs after wrapper_env_ctor at 101)
- Thread-safe: Atomic CAS for exactly-once initialization
Baseline Profiling (Step 0 Complete):
- Throughput: 46.2M ops/s (10M iterations × 400 ws)
- Instructions/cycle: 2.11 insn/cycle
- Frontend stalls: 10.62% (memory latency bottleneck)
- Cache-misses: 3.46% of references
Expected C3 gain: +5-8% (policy_snapshot bypass)
Next Steps (Step 1B onwards):
1. Integrate static route into malloc_tiny_fast_for_class()
2. A/B test: Mixed 10-run, expect +1% minimum for GO
3. Decision: GO if +1%, NO-GO if -1%, else freeze
Status:
✅ Phase 2 (B3+B4): +4.4% cumulative
✅ Phase 3 planning & C3 Step 0-1A complete
⏳ Phase 3 C3 Step 1B-3 pending (malloc integration & testing)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-13 18:04:14 +09:00
4c4796a1f8
Phase 2 B4: Documentation & Instruction Creation (Phase 2→3 Transition)
...
Documentation Created:
- docs/analysis/PHASE2_STRUCTURAL_CHANGES_NEXT_INSTRUCTIONS.md: Phase 2 完了レポート (B3+B4累積 +4.4%)
- docs/analysis/PHASE3_CACHE_LOCALITY_NEXT_INSTRUCTIONS.md: Phase 3 開始指示(C3 Static Routing優先)
Verification Completed:
- ✅ HAKMEM_WRAP_SHAPE=1 プリセット昇格(core/bench_profile.h:67)
- ✅ wrapper_env_refresh_from_env() 実装済み(core/box/wrapper_env_box.c:49-64)
- ✅ malloc_cold() lock_depth 対称性確認(全 return 経路で g_hakmem_lock_depth--)
- ✅ A/B テスト結果: Mixed +1.47% (≥+1.0% GO threshold)
Summary:
B3 routing shape: +2.89%
B4 wrapper shape: +1.47%
─────────────────
Estimated total: ~+4.4%
Next Phase: Phase 3 (Cache Locality, +12-22%)
- Priority: C3 (Static Routing) - bypass policy_snapshot, +5-8% expected
- Profile: perf top で malloc/policy_snapshot hot spot を特定推奨
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-13 17:32:34 +09:00
c687673a99
Phase 2 B4: Wrapper Layer Hot/Cold Split (malloc/free) - ADOPT (+1.47%)
...
- Implement malloc_cold() helper (noinline,cold) for LD mode, jemalloc, force_libc
- Add malloc() hot/cold dispatch with HAKMEM_WRAP_SHAPE=1 ENV gate
- Implement free_cold() helper (noinline,cold) for classification, ownership checks
- Add free() hot/cold dispatch: hot path returns early, cold path delegates to free_cold()
- Lock_depth symmetry verified on all return paths (malloc: ++/--, free: consistent)
A/B Testing Results (Mixed 10-run):
WRAP_SHAPE=0 (default): 34,750,578 ops/s
WRAP_SHAPE=1 (optimized): 35,262,596 ops/s
Average gain: +1.47% (Median: +1.39%)
✓ Decision: GO (exceeds +1.0% threshold)
Implementation Strategy:
- Separate frequently-executed code from rare paths (LD, jemalloc, diagnostics)
- Keep hot path instruction count minimal (returns early on success)
- L1 I-cache pressure reduction via noinline,cold attributes
- Default OFF (HAKMEM_WRAP_SHAPE=0) maintains backward compatibility
Files:
- core/box/hak_wrappers.inc.h: malloc_cold(), free_cold(), hot/cold dispatches
- core/box/wrapper_env_box.h/c: HAKMEM_WRAP_SHAPE ENV variable caching
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-13 17:08:24 +09:00
0feeccdcef
Phase 2 B1/B3/B4 preparation: Analysis & ENV gate setup
...
## Phase 2 Optimization Research Complete
### B1 (Header tax reduction v2) - NO-GO
- HAKMEM_TINY_HEADER_MODE=LIGHT: -2.54% regression on Mixed
- Decision: FREEZE as research box (ENV opt-in only)
### B3 (Routing branch shape optimization) - ADOPT
- Mixed: +2.89% (48.41M → 49.80M ops/s)
- C6-heavy: +9.13% (8.97M → 9.79M ops/s)
- Strategy: LIKELY on LEGACY (hot), noinline,cold helper for rare routes
- Implementation: Already in malloc_tiny_fast.h:252-267
- Profile updates: HAKMEM_TINY_ALLOC_ROUTE_SHAPE=1 now default
### B4 (Wrapper Layer Hot/Cold Split) - Preparation
- Design memo: docs/analysis/PHASE2_B4_WRAPPER_SHAPE_1_DESIGN.md
- Goal: Split malloc/free into hot/cold paths, reduce I-cache pressure
- ENV gate: HAKMEM_WRAP_SHAPE=0/1 (added to wrapper_env_box)
- Expected gain: +2-5% Mixed, +1-3% C6-heavy
## Analysis Summary
- Background is visible: FREE DUALHOT + B3 routing optimizations work
- Code layering is clean: winning boxes promoted to presets, losing boxes frozen with ENV guards
- Remaining gap to mimalloc is wrapper layer + safety checks + policy snapshot
- Further +5-10% still realistically achievable
🤖 Generated with Claude Code
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-13 16:46:18 +09:00
cc398e4a0e
Phase 2 B1 & B3: Routing optimization research (NO-GO on B1, ADOPT B3)
...
## B1 (Header tax reduction v2) - NO-GO
- HAKMEM_TINY_HEADER_MODE=LIGHT: -2.54% on Mixed (regression)
- Decision: FREEZE as research box, ENV opt-in only
## B3 (Routing branch shape optimization) - ADOPT
- Mixed: +2.89% (48.41M → 49.80M ops/s)
- C6-heavy: +9.13% (8.97M → 9.79M ops/s)
- Strategy: LIKELY on LEGACY (hot path), cold helper for rare routes
- Implementation: Already in malloc_tiny_fast.h:252-267, now enabled by default
- Profile updates: bench_profile.h adds HAKMEM_TINY_ALLOC_ROUTE_SHAPE=1 to MIXED_TINYV3_C7_SAFE and C6_HEAVY_LEGACY_POOLV1
🤖 Generated with Claude Code
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-13 16:08:24 +09:00
150c3bddd4
Update CURRENT_TASK: Phase 1A3 Complete (NO-GO, research box)
...
Phase 1A3 always_inline test complete:
- A/B Result: Mixed -4.00% (I-cache pressure), C6-heavy +6.00%
- Decision: NO-GO - freeze as research box
- Commit: df37baa50
Phase 1 Summary:
- A1: FREE 昇格 ✅ DONE
- A2: 観測税ゼロ化 ✅ DONE
- A3: always_inline ❌ NO-GO (I-cache issue)
Expected Phase 1 impact: +2-3% (A1 FREE +13% + A2 observe-tax reduction)
Next: Phase 2 structural changes, Phase 3 cache locality
2025-12-13 15:31:33 +09:00
df37baa505
Phase 1A3: tiny_region_id_write_header always_inline research box (NO-GO)
...
Add HAKMEM_TINY_HEADER_WRITE_ALWAYS_INLINE build flag (default 0) to enable
always_inline on tiny_region_id_write_header().
A/B Results (HAKMEM_TINY_HEADER_WRITE_ALWAYS_INLINE=0 vs 1):
- Mixed (10-run): 49.53M → 47.55M ops/s (-4.00% regression)
- C6-heavy (5-run): 23.49M → 24.93M ops/s (+6.00% improvement)
Decision: NO-GO - Mixed regression (-4%) exceeds threshold (-1%)
Status: Frozen as research box (default OFF)
Root Cause: I-cache pressure from forced inline expansion
- Mixed workload: higher code diversity → I-cache evictions
- C6-heavy workload: focused pattern → benefits from inlining
Patches:
1. core/hakmem_build_flags.h: Add HAKMEM_TINY_HEADER_WRITE_ALWAYS_INLINE (default 0)
2. core/tiny_region_id.h: Add conditional __attribute__((always_inline)) gate
Build: make -j EXTRA_CFLAGS=-DHAKMEM_TINY_HEADER_WRITE_ALWAYS_INLINE=1 [target]
Recommendation: Keep compiler's natural inline heuristic (already optimal for Mixed)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-13 15:31:08 +09:00
93b59ef414
Update CURRENT_TASK: ALLOC-GATE-SSOT-1 + DUALHOT-2 Complete
...
Phase 2 finished: 4 patches implement SSOT + branch optimization
Results:
- Mixed: -0.27% (neutral, SSOT cost absorbed by aggregate)
- C6-heavy: +1.68% (SSOT benefit: eliminate duplicate size→class)
Decision: ADOPT SSOT as structural foundation
- Enables future *_for_class specialization
- DUALHOT-2 as ENV feature (default OFF)
- No regression on default path
Commit: d0f939c2e
Next: Phase 1 Quick Wins (A1-A3: FREE昇格, 観測税, inline)
2025-12-13 06:51:11 +09:00
d0f939c2eb
Phase ALLOC-GATE-SSOT-1 + ALLOC-TINY-FAST-DUALHOT-2: Structure fixes for alloc path
...
4 patches to eliminate allocation overhead and enable research path:
Patch 1: Extract malloc_tiny_fast_for_class(size, class_idx)
- SSOT: size→class conversion happens once in gate
- malloc_tiny_fast() becomes thin wrapper
- Foundation for eliminating duplicate lookups
Patch 2: Update tiny_alloc_gate_fast() to call *_for_class
- Pass class_idx computed in gate to malloc_tiny_fast_for_class()
- Eliminates second hak_tiny_size_to_class() call
- Impact: +1-2% expected from reduced instruction count
Patch 3: Reposition DUALHOT branch (C0-C3 only)
- Move class_idx <= 3 check outside alloc_dualhot_enabled()
- C4-C7 no longer evaluate ENV gate (even when OFF)
- Impact: Maintains neutral performance on default path
Patch 4: Probe window for ENV gate
- Tolerate early putenv() before probe window exhausted (64 calls)
- Maintains correctness for bench_profile setenv timing
A/B Results (DUALHOT=0 vs DUALHOT=1):
- Mixed median: 48.75M → 48.62M ops/s (-0.27%, neutral within variance)
- C6-heavy median: 23.24M → 23.63M ops/s (+1.68%, SSOT benefit)
Decision: ADOPT with DUALHOT default OFF (research feature)
- SSOT provides structural improvement
- No regression on default configuration
- C6-heavy shows SSOT effectiveness (+1.68%)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-13 06:50:39 +09:00
c7facced06
Optimization Roadmap: mimalloc Gap Analysis & Phase 1-3 Plan
...
Add comprehensive mimalloc vs hakmem performance gap analysis (2.5x).
Gap sources (ranked by ROI):
1. Observation tax (stats macros): +2-3% overhead
2. Policy snapshot: +10-15% overhead (per-call TLS read + atomic sync)
3. Header management: +5-10% overhead (1-byte per block)
4. Wrapper layer: +5-10% overhead (LD_PRELOAD interception)
5. Routing switch: +3-5% overhead (5-way switch)
Optimization roadmap:
- Phase 1 (Quick Wins): +4-7% via FREE adoption + compile-out stats + inline
- Phase 2 (Structural): +5-10% via header tax removal + C0-C3 path + jump table
- Phase 3 (Cache): +12-22% via prefetch + cache optimization + static routing
Expected outcome: 52-68M ops/s (vs current 50.7M, gap from 2.5x → 1.9x)
Architectural reality: hakmem's 4-5 layer design adds 50-100x instruction
overhead vs mimalloc's 1-layer design. Gap closure caps at ~1.9x without
fundamental redesign.
Next immediate step: Implement Phase 1A (FREE adoption + compile-out stats)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-12-13 05:37:54 +09:00
d9991f39ff
Phase ALLOC-TINY-FAST-DUALHOT-1 & Optimization Roadmap Update
...
Add comprehensive design docs and research boxes:
- docs/analysis/ALLOC_TINY_FAST_DUALHOT_1_DESIGN.md: ALLOC DUALHOT investigation
- docs/analysis/FREE_TINY_FAST_DUALHOT_1_DESIGN.md: FREE DUALHOT final specs
- docs/analysis/FREE_TINY_FAST_HOTCOLD_OPT_1_DESIGN.md: Hot/Cold split research
- docs/analysis/POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md: Deferred batching design
- docs/analysis/POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md: Stats overhead findings
- docs/analysis/MID_DESC_CACHE_BENCHMARK_2025-12-12.md: Cache measurement results
- docs/analysis/LAST_MATCH_CACHE_IMPLEMENTATION.md: TLS cache investigation
Research boxes (SS page table):
- core/box/ss_pt_env_box.h: HAKMEM_SS_LOOKUP_KIND gate
- core/box/ss_pt_types_box.h: 2-level page table structures
- core/box/ss_pt_lookup_box.h: ss_pt_lookup() implementation
- core/box/ss_pt_register_box.h: Page table registration
- core/box/ss_pt_impl.c: Global definitions
Updates:
- docs/specs/ENV_VARS_COMPLETE.md: HOTCOLD, DEFERRED, SS_LOOKUP env vars
- core/box/hak_free_api.inc.h: FREE-DISPATCH-SSOT integration
- core/box/pool_mid_inuse_deferred_box.h: Deferred API updates
- core/box/pool_mid_inuse_deferred_stats_box.h: Stats collection
- core/hakmem_super_registry: SS page table integration
Current Status:
- FREE-TINY-FAST-DUALHOT-1: +13% improvement, ready for adoption
- ALLOC-TINY-FAST-DUALHOT-1: -2% regression, frozen as research box
- Next: Optimization roadmap per ROI (mimalloc gap 2.5x)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-12-13 05:35:46 +09:00
b917357034
Update CURRENT_TASK: FREE DUALHOT confirmed +13%, ALLOC frozen as research box
2025-12-13 05:11:09 +09:00
b2724e6f5d
Phase ALLOC-TINY-FAST-DUALHOT-1: WIP (regression), FREE DUALHOT confirmed +13%
...
**ALLOC-TINY-FAST-DUALHOT-1** (this phase):
- Implementation: malloc_tiny_fast() C0-C3 early-exit with policy snapshot skip
- ENV: HAKMEM_TINY_ALLOC_DUALHOT=0/1 (default OFF)
- A/B Result: -1.17% median regression (Mixed, 10-run)
- Root Cause: Branch prediction penalty on C4-C7 outweighs policy skip benefit
- Decision: Freeze as research box (default OFF)
- Difference from FREE: ALLOC requires structural changes (per-class paths)
**FREE-TINY-FAST-DUALHOT-1** (verified):
- A/B Confirmation: +13.00% improvement (42.08M → 47.81M ops/s, Mixed, 10-run)
- Success Criteria: +2% target ACHIEVED
- Health Check: PASS (verify_health_profiles.sh, ENV OFF/ON)
- Safety: HAKMEM_TINY_LARSON_FIX guard in place
- Decision: Promotion to MIXED_TINYV3_C7_SAFE profile candidate
**Next Steps**:
- Profile adoption of FREE DUALHOT for MIXED workload
- No further deep-dive on ALLOC optimization (deferred to future phases)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-12-13 05:10:45 +09:00
0a7400d7d3
Phase ALLOC-TINY-FAST-DUALHOT-1: C0-C3 alloc direct path (WIP, -2% regression)
...
Add C0-C3 early-exit optimization to malloc_tiny_fast() similar to
FREE-TINY-FAST-DUALHOT-1. Skip policy snapshot for C0-C3 classes.
A/B Result (10-run, Mixed TINYV3_C7_SAFE):
- Baseline: 47.27M ops/s (median)
- Optimized: 46.10M ops/s (median)
- Result: -2.00% (regression, needs investigation)
ENV: HAKMEM_TINY_ALLOC_DUALHOT=0/1 (default OFF)
Implementation:
- core/front/malloc_tiny_fast.h: alloc_dualhot_enabled() + early-exit
- Design: docs/analysis/ALLOC_TINY_FAST_DUALHOT_1_DESIGN.md
Status: Research box (default OFF), needs root cause analysis
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-12-13 04:28:52 +09:00
2b567ac070
Phase FREE-TINY-FAST-DUALHOT-1: Optimize C0-C3 direct free path
...
Treat C0-C3 classes (48% of calls) as "second hot path" instead of
cold path. Skip expensive policy snapshot and route determination,
direct to tiny_legacy_fallback_free_base().
Measurements from FREE-TINY-FAST-HOTCOLD-OPT-1 revealed C0-C3 is NOT
rare (48.43% of all frees). Previous attempt to optimize via hot/cold
split failed (-13% regression) because noinline + function call on 48%
of workload hurt more than it helped.
This phase applies correct optimization: direct inline path for
frequent C0-C3 without policy snapshot overhead.
Implementation:
- Insert C0-C3 early-exit after C7 ULTRA check
- Skip tiny_front_v3_snapshot_get() for C0-C3 (saves 5-10 cycles)
- Skip route determination logic
- Safety: HAKMEM_TINY_LARSON_FIX=1 disables optimization
Benchmark Results (100M ops, 400 threads, MIXED_TINYV3_C7_SAFE):
- Baseline (optimization OFF): 44.50M ops/s (median)
- Optimized (DUALHOT ON): 48.74M ops/s (median)
- Improvement: +9.51% (+4.23M ops/s)
Perf Stats (optimized):
- Branch misses: 112.8M
- Cycles: 8.89B
- Instructions: 21.95B (2.47 IPC)
- Cache misses: 656K
Status: GO (significant improvement, no regression)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-13 03:46:36 +09:00
c503b212a3
Phase FREE-TINY-FAST-HOTCOLD-OPT-1: Hot/Cold split for free_tiny_fast [RESEARCH BOX - FREEZE]
...
Split free_tiny_fast() into hot and cold paths to reduce I-cache pressure:
- free_tiny_fast_hot(): always_inline, fast-path validation + ULTRA/MID/V7
- free_tiny_fast_cold(): noinline,cold, cross-thread + TinyHeap + legacy
ENV: HAKMEM_FREE_TINY_FAST_HOTCOLD=0/1 (default 0)
Stats: HAKMEM_FREE_TINY_FAST_HOTCOLD_STATS=0/1 (TLS only, exit dump)
## Benchmark Results (random mixed, 100M ops)
HOTCOLD=0 (legacy): 49.35M, 50.18M, 50.25M ops/s (median: 50.18M)
HOTCOLD=1 (split): 43.54M, 43.59M, 43.62M ops/s (median: 43.59M)
**Regression: -13.1%** (NO-GO)
## Stats Analysis (10M ops, HOTCOLD_STATS=1)
Hot path: 50.11% (C7 ULTRA early-exit)
Cold path: 48.43% (legacy fallback)
## Root Cause
Design assumption FAILED: "Cold path is rare"
Reality: Cold path is 48% (almost as common as hot path)
The split introduces:
1. Extra dispatch overhead in hot path
2. Function call overhead to cold for ~48% of frees
3. "Cold" is NOT rare - it's the legacy fallback for non-ULTRA classes
## Conclusion
**FREEZE as research box (default OFF)**
Box Theory value:
- Validated hot/cold distribution via TLS stats
- Confirmed that legacy fallback is NOT rare (48%)
- Demonstrated that naive hot/cold split hurts when "cold" is common
Alternative approaches for future work:
1. Inline the legacy fallback in hot path (no split)
2. Route-specific specialization (C7 vs non-C7 separate paths)
3. Policy-based early routing (before header validation)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-13 03:16:54 +09:00
4e7870469c
POOL-MID-DN-BATCH: Add hash-based TLS page map (O(1) lookup)
...
Replace linear search (avg 16 iterations, -7.6% regression) with
open addressing hash table:
- Size: 64 slots (power-of-two)
- Collision: Linear probing, max 8 probes
- On probe limit: drain and retry (safe fallback)
- Hash function: Golden ratio with page-aligned shift
New ENV: HAKMEM_POOL_MID_INUSE_MAP_KIND=hash|linear (default: linear)
Implementation:
- Added hak_pool_mid_inuse_map_hash_enabled() ENV gate
- Extended MidInuseTlsPageMap with hash_pages[64], hash_counts[64], hash_used
- Added mid_inuse_hash_page() golden ratio hash function
- Added mid_inuse_dec_deferred_hash() O(1) insert with probing
- Updated mid_inuse_deferred_drain() to support hash mode
- Added decs_drained stats counter for batching metrics
Benchmark Results (10 runs each, bench_mid_large_mt_hakmem):
Baseline (DEFERRED=0): median=9,250,340 ops/s
Linear mode: median=8,159,240 ops/s (-11.80%)
Hash mode: median=8,262,982 ops/s (-10.67%)
Hash vs Linear: +1.27% improvement (eliminates linear search overhead)
Note: Both deferred modes still show regression vs baseline due to
other factors (TLS access overhead, drain cost). Hash mode successfully
eliminates the linear search penalty as designed.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2025-12-13 00:28:03 +09:00
6c849fd020
POOL-MID-DN-BATCH: Add last-match cache to reduce linear search overhead
...
Root cause: Linear search in 32-entry TLS map averaged 16 iterations,
causing instruction overhead that exceeded mid_desc_lookup savings.
Fix implemented:
- Added last_idx field to MidInuseTlsPageMap for temporal locality
- Check last_idx before linear search (O(1) fast path)
- Update last_idx on hits and new entries
- Reset last_idx on drain
Changes:
1. pool_mid_inuse_tls_pagemap_box.h:
- Added uint32_t last_idx field to struct
2. pool_mid_inuse_deferred_box.h:
- Check last_idx before linear search (lines 90-94)
- Update last_idx on linear search hit (line 101)
- Set last_idx on new entry insert (line 117)
- Reset last_idx on drain (line 166)
Benchmark results (bench_mid_large_mt_hakmem):
- Baseline (DEFERRED=0): median 9.08M ops/s, variance 300B
- Deferred with cache (DEFERRED=1): median 8.38M ops/s, variance 207B
- Performance: -7.6% regression (vs expected +2-4% gain)
- Stability: -31% variance (improvement as expected)
Analysis:
The last-match cache reduces variance but does not eliminate the
regression for this benchmark's random access pattern (2048 slots,
many pages). The temporal locality assumption (60-80% hit rate) is
not met by bench_mid_large_mt's allocation pattern.
Further optimization needed:
- Consider hash-based lookup for better than O(n) search
- OR reduce map size to decrease search iterations
- OR add drain triggers at better boundaries
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2025-12-13 00:04:41 +09:00