Commit Graph

9 Commits

Author SHA1 Message Date
2013514f7b Working state before pushing to cyu remote 2025-12-19 03:45:01 +09:00
89a9212700 Phase 83-1 + Allocator Comparison: Switch dispatch fixed (NO-GO +0.32%), PROFILE correction, SCORECARD update
Key changes:
- Phase 83-1: Switch dispatch fixed mode (tiny_inline_slots_switch_dispatch_fixed_box) - NO-GO (marginal +0.32%, branch reduction negligible)
  Reason: lazy-init pattern already optimal, Phase 78-1 pattern shows diminishing returns

- Allocator comparison baseline update (10-run SSOT, WS=400, ITERS=20M):
  tcmalloc: 115.26M (92.33% of mimalloc)
  jemalloc: 97.39M (77.96% of mimalloc)
  system: 85.20M (68.24% of mimalloc)
  mimalloc: 124.82M (baseline)

- hakmem PROFILE correction: scripts/run_mixed_10_cleanenv.sh + run_allocator_quick_matrix.sh
  PROFILE explicitly set to MIXED_TINYV3_C7_SAFE for hakmem measurements
  Result: baseline stabilized to 55.53M (44.46% of mimalloc)
  Previous unstable measurement (35.57M) was due to profile leak

- Documentation:
  * PERFORMANCE_TARGETS_SCORECARD.md: Reference allocators + M1/M2 milestone status
  * PHASE83_1_SWITCH_DISPATCH_FIXED_RESULTS.md: Phase 83-1 analysis (NO-GO)
  * ALLOCATOR_COMPARISON_QUICK_RUNBOOK.md: Quick comparison procedure
  * ALLOCATOR_COMPARISON_SSOT.md: Detailed SSOT methodology

- M2 milestone status: 44.46% (target 55%, gap -10.54pp) - structural improvements needed

🤖 Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-18 18:50:00 +09:00
043d34ad5a Phase 75-2: C5-only Inline Slots (P2) - GO (+1.10%)
Extends Phase 75-1 pattern to C5 class (28.5% of C4-C7 ops):
- Created 4 new boxes: env_box, tls_box, fast_path_api, TLS variable
- Integration: 2 minimal boundary points (alloc/free for C5)
- Test strategy: C5-only isolation (baseline C5=OFF+C6=ON, treatment C5=ON+C6=ON)
- Default OFF: zero overhead when disabled

Results (10-run Mixed SSOT, WS=400, C6 already enabled):
- Baseline (C5=OFF, C6=ON): 44.26 M ops/s (σ=0.37)
- Treatment (C5=ON, C6=ON): 44.74 M ops/s (σ=0.54)
- Delta: +0.49 M ops/s (+1.10%)

Status:  GO - C5 individual contribution confirmed
Cumulative since Phase 75-0: +2.87% (C6) + 1.10% (C5) = potential +3.97% combined
Next: Phase 75-3 (test C5+C6 interaction + non-additivity + promote to preset default)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-18 08:39:48 +09:00
0009ce13b3 Phase 75-1: C6-only Inline Slots (P2) - GO (+2.87%)
Modular implementation of hot-class inline slots optimization:
- Created 5 new boxes: env_box, tls_box, fast_path_api, integration_box, test_script
- Single decision point at TLS init (ENV gate: HAKMEM_TINY_C6_INLINE_SLOTS=0/1)
- Integration: 2 minimal boundary points (alloc/free paths for C6 class)
- Default OFF: zero overhead when disabled (full backward compatibility)

Results (10-run Mixed SSOT, WS=400):
- Baseline (C6 inline OFF):  44.24 M ops/s
- Treatment (C6 inline ON):  45.51 M ops/s
- Delta: +1.27 M ops/s (+2.87%)

Status:  GO - Strong improvement via C6 ring buffer fast-path
Mechanism: Branch elimination on unified_cache_push/pop for C6 allocations
Next: Phase 75-2 (add C5 inline slots, target 85% C4-C7 coverage)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-18 08:22:09 +09:00
65f982aeec Phase 74-3: P0 (FASTAPI) - Free-path branch reduction optimization
Implements FASTAPI pattern to move enabled/init/stats checks outside hot loop:
- Added unified_cache_push_fast() fast-path API (no precondition checks)
- Modified tiny_legacy_fallback_free_base_with_env() to use FASTAPI with ENV gate
- Single boundary point: fallback to slow path if FULL or FASTAPI disabled
- ENV gate: HAKMEM_TINY_UC_FASTAPI=0/1 (default 0, research box)

Results (10-run Mixed SSOT, WS=400):
- Throughput: +0.32% (NEUTRAL, below +1.0% GO threshold)
- cache-misses: -16.31% (positive signal, but throughput gains insufficient)

Status: NEUTRAL, P0 FROZEN (insufficient ROI for structural change)
Next: Phase 74-4 (P2: Hot-class Inline Slots) pending per-class analysis

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-18 08:07:46 +09:00
e1a4561992 Phase 19-3b: pass down env snapshot in hot paths 2025-12-15 12:50:16 +09:00
88717a8737 Phase 4 E1: ENV Snapshot Consolidation - GO (+3.92% avg, +4.01% median)
Target: Consolidate 3 ENV gate TLS reads → 1 TLS read
- tiny_c7_ultra_enabled_env():    1.28% self
- tiny_front_v3_enabled():        1.01% self
- tiny_metadata_cache_enabled():  0.97% self
- Total overhead: 3.26% self (perf profile analysis)

Implementation:
- core/box/hakmem_env_snapshot_box.h (new): ENV snapshot struct & API
- core/box/hakmem_env_snapshot_box.c (new): TLS snapshot implementation
- core/front/malloc_tiny_fast.h: Migrated 5 call sites to snapshot
- core/box/tiny_legacy_fallback_box.h: Migrated 2 call sites
- core/box/tiny_metadata_cache_hot_box.h: Migrated 1 call site
- core/bench_profile.h: Added hakmem_env_snapshot_refresh_from_env()
- Makefile: Added hakmem_env_snapshot_box.o to build
- ENV gate: HAKMEM_ENV_SNAPSHOT=0/1 (default: 0, research box)

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (E1=0): 43,617,549 ops/s (avg), 43,562,895 ops/s (median)
- Optimized (E1=1): 45,327,239 ops/s (avg), 45,309,218 ops/s (median)
- Improvement: avg +3.92%, median +4.01%

Decision: GO (+3.92% >= +2.5% threshold)
- Action: Keep as research box (default OFF) for Phase 4
- Next: Consider promotion to default in MIXED_TINYV3_C7_SAFE preset

Design Rationale:
- Shape optimizations (B3, D3) reached saturation (+0.56% NEUTRAL)
- Shift to memory/TLS overhead optimization (new optimization frontier)
- Pattern: Similar to existing tiny_front_v3_snapshot (proven approach)
- Expected: +1-3% from 3.26% ENV overhead → Achieved: +3.92%

Technical Details:
- Consolidation: 3 TLS reads → 1 TLS read (66% reduction)
- Learner interlock: tiny_metadata_cache_eff pre-computed in snapshot
- Version sync: Refreshes on small_policy_v7_version_changed()
- Fallback safety: Existing ENV gates still available when E1=0

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-14 00:59:12 +09:00
deecda7336 Phase 3 C2: Slab Metadata Cache Optimization (3 patches) - NEUTRAL
Patch 1: Policy Hot Cache
- Add TinyPolicyHot struct (route_kind[8] cached in TLS)
- Eliminate policy_snapshot() calls (~2 memory ops saved)
- Safety: disabled when learner v7 active
- Files: tiny_metadata_cache_env_box.h, tiny_metadata_cache_hot_box.{h,c}
- Integration: malloc_tiny_fast.h route selection

Patch 2: First Page Inline Cache
- Cache current slab page pointer in TLS per-class
- Avoid superslab metadata lookup (1-2 memory ops)
- Fast-path in tiny_legacy_fallback_free_base()
- Files: tiny_first_page_cache.h, tiny_unified_cache.c
- Integration: tiny_legacy_fallback_box.h

Patch 3: Bounds Check Compile-out
- Hardcode unified_cache capacity as MACRO constant
- Eliminate modulo operation (constant fold)
- Macros: TINY_UNIFIED_CACHE_CAPACITY_POW2=11, CAPACITY=2048, MASK=2047
- File: tiny_unified_cache.h

A/B Test Results (Mixed, 10-run):
- Baseline (C2=0): 40.43M ops/s (avg), 40.72M ops/s (median)
- Optimized (C2=1): 40.25M ops/s (avg), 40.29M ops/s (median)
- Improvement: -0.45% (avg), -1.06% (median)
- DECISION: NEUTRAL (within ±1.0% threshold)
- Action: Keep as research box (ENV gate OFF by default)

Cumulative Gain (Phase 2-3):
- B3 (Routing shape): +2.89%
- B4 (Wrapper split): +1.47%
- C3 (Static routing): +2.20%
- C2 (Metadata cache): -0.45%
- Total: ~6.1% (from baseline 37.5M → 39.8M ops/s)

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 19:19:42 +09:00
0752688785 Phase REFACTOR-2: Legacy Fallback Logic Unification
Consolidate duplicated legacy free logic into a single reusable function.
Previously, hak_tiny_free_legacy_inline() and hak_tiny_free_legacy_impl()
contained identical implementations in malloc_tiny_fast.h and
tiny_c6_ultra_free_box.c.

Changes:
- NEW: core/box/tiny_legacy_fallback_box.h
  - tiny_legacy_fallback_free_base(): Unified legacy free implementation
  - Encapsulates: Unified Cache push + per-class stats + final fallback
  - Contract: BASE pointer input (already extracted from USER ptr)

- Modified: core/front/malloc_tiny_fast.h
  - Removed: hak_tiny_free_legacy_inline() (lines 96-111)
  - Replaced call: hak_tiny_free_legacy_inline → tiny_legacy_fallback_free_base

- Modified: core/box/tiny_c6_ultra_free_box.c
  - Removed: hak_tiny_free_legacy_impl() (lines 17-39)
  - Replaced call: hak_tiny_free_legacy_impl → tiny_legacy_fallback_free_base

Benefits:
- Single source of truth (DRY principle)
- Easier to maintain and test
- Consistent behavior across all free paths
- No performance impact (always_inline preserved)

No semantic changes - identical logic, just centralized.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-11 19:01:59 +09:00