Commit Graph

67 Commits

Author SHA1 Message Date
ea221d057a Phase 6: promote Front FastLane (default ON) 2025-12-14 16:28:23 +09:00
4124c86d99 Phase 5: freeze E5-4 malloc tiny direct (neutral) 2025-12-14 06:59:35 +09:00
f7b18aaf13 Phase 5 E5-2: Header Write-Once (NEUTRAL, FROZEN)
Target: tiny_region_id_write_header (3.35% self%)
- Hypothesis: Headers redundant for reused blocks
- Strategy: Write headers ONCE at refill boundary, skip in hot alloc

Implementation:
- ENV gate: HAKMEM_TINY_HEADER_WRITE_ONCE=0/1 (default 0)
- core/box/tiny_header_write_once_env_box.h: ENV gate
- core/box/tiny_header_write_once_stats_box.h: Stats counters
- core/box/tiny_header_box.h: Added tiny_header_finalize_alloc()
- core/front/tiny_unified_cache.c: Prefill at 3 refill sites
- core/box/tiny_front_hot_box.h: Use finalize function

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (WRITE_ONCE=0): 44.22M ops/s (mean), 44.53M ops/s (median)
- Optimized (WRITE_ONCE=1): 44.42M ops/s (mean), 44.36M ops/s (median)
- Improvement: +0.45% mean, -0.38% median

Decision: NEUTRAL (within ±1.0% threshold)
- Action: FREEZE as research box (default OFF, do not promote)

Root Cause Analysis:
- Header writes are NOT redundant - existing code writes only when needed
- Branch overhead (~4 cycles) cancels savings (~3-5 cycles)
- perf self% ≠ optimization ROI (3.35% target → +0.45% gain)

Key Lessons:
1. Verify assumptions before optimizing (inspect code paths)
2. Hot spot self% measures time IN function, not savings from REMOVING it
3. Branch overhead matters (even "simple" checks add cycles)

Positive Outcome:
- StdDev reduced 50% (0.96M → 0.48M) - more stable performance

Health Check: PASS (all profiles)

Next Candidates:
- free_tiny_fast_cold: 7.14% self%
- unified_cache_push: 3.39% self%
- hakmem_env_snapshot_enabled: 2.97% self%

Deliverables:
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_DESIGN.md
- docs/analysis/PHASE5_E5_2_HEADER_REFILL_ONCE_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-2 complete, FROZEN)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 06:22:25 +09:00
8875132134 Phase 5 E5-1: Free Tiny Direct Path (+3.35% GO)
Target: Consolidate free() wrapper overhead (29.56% combined)
- free() wrapper: 21.67% self%
- free_tiny_fast_cold(): 7.89% self%

Strategy: Single header check in wrapper → direct call to free_tiny_fast()
- Eliminates redundant header validation (validated twice before)
- Bypasses cold path routing for Tiny allocations
- High coverage: 48% of frees in Mixed workload are Tiny

Implementation:
- ENV gate: HAKMEM_FREE_TINY_DIRECT=0/1 (default 0)
- core/box/free_tiny_direct_env_box.h: ENV gate
- core/box/free_tiny_direct_stats_box.h: Stats counters
- core/box/hak_wrappers.inc.h: Wrapper integration (lines 593-625)

Safety gates:
- Page boundary guard ((ptr & 0xFFF) != 0)
- Tiny magic validation ((header & 0xF0) == 0xA0)
- Class bounds check (class_idx < 8)
- Fail-fast fallback to existing paths

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (DIRECT=0): 44.38M ops/s (mean), 44.45M ops/s (median)
- Optimized (DIRECT=1): 45.87M ops/s (mean), 45.95M ops/s (median)
- Improvement: +3.35% mean, +3.36% median

Decision: GO (+3.35% >= +1.0% threshold)
- 3rd consecutive success with consolidation/deduplication pattern
- E4-1: +3.51%, E4-2: +21.83%, E5-1: +3.35%
- Health check: PASS (all profiles)

Phase 5 Cumulative:
- E4 Combined: +6.43%
- E5-1: +3.35%
- Estimated total: ~+10%

Deliverables:
- docs/analysis/PHASE5_E5_COMPREHENSIVE_ANALYSIS.md
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_DESIGN.md
- docs/analysis/PHASE5_E5_1_FREE_TINY_DIRECT_1_AB_TEST_RESULTS.md
- CURRENT_TASK.md (E5-1 complete)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 05:52:32 +09:00
5528612f2a Phase 5 E4-2: Malloc Wrapper ENV Snapshot (+21.83% GO, ADOPTED)
Target: Consolidate malloc wrapper TLS reads + eliminate function calls
- malloc (16.13%) + tiny_alloc_gate_fast (19.50%) = 35.63% combined
- Strategy: E4-1 success pattern + function call elimination

Implementation:
- ENV gate: HAKMEM_MALLOC_WRAPPER_ENV_SNAPSHOT=0/1 (default 0)
- core/box/malloc_wrapper_env_snapshot_box.{h,c}: New box
  - Consolidates multiple TLS reads → 1 TLS read
  - Pre-caches tiny_max_size() == 256 (eliminates function call)
  - Lazy init with probe window (bench_profile putenv sync)
- core/box/hak_wrappers.inc.h: Integration in malloc() wrapper
- Makefile: Add malloc_wrapper_env_snapshot_box.o to all targets

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (SNAPSHOT=0): 35.74M ops/s (mean), 35.75M ops/s (median)
- Optimized (SNAPSHOT=1): 43.54M ops/s (mean), 43.92M ops/s (median)
- Improvement: +21.83% mean, +22.86% median (+7.80M ops/s)

Decision: GO (+21.83% >> +1.0% threshold, 21.8x over)
- Why 6.2x better than E4-1 (+3.51%)?
  - Higher malloc call frequency (allocation-heavy workload)
  - Function call elimination (tiny_max_size pre-cached)
  - Larger target: 35.63% vs free's 25.26%
- Health check: PASS (all profiles)
- Action: PROMOTED to MIXED_TINYV3_C7_SAFE preset

Phase 5 Cumulative (estimated):
- E1 (ENV Snapshot): +3.92%
- E4-1 (Free Wrapper Snapshot): +3.51%
- E4-2 (Malloc Wrapper Snapshot): +21.83%
- Estimated combined: ~+30% (needs validation)

Next Steps:
- Combined A/B test (E4-1 + E4-2 simultaneously)
- Measure actual cumulative effect
- Profile new baseline for next optimization targets

Deliverables:
- docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_1_DESIGN.md
- docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_1_AB_TEST_RESULTS.md
- docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md
- docs/analysis/PHASE5_E4_COMBINED_AB_TEST_NEXT_INSTRUCTIONS.md (next)
- docs/analysis/ENV_PROFILE_PRESETS.md (E4-2 added)
- CURRENT_TASK.md (E4-2 complete)
- core/bench_profile.h (E4-2 promoted to default)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 05:13:29 +09:00
4a070d8a14 Phase 5 E4-1: Free Wrapper ENV Snapshot (+3.51% GO, ADOPTED)
Target: Consolidate free wrapper TLS reads (2→1)
- free() is 25.26% self% (top hot spot)
- Strategy: Apply E1 success pattern (ENV snapshot) to free path

Implementation:
- ENV gate: HAKMEM_FREE_WRAPPER_ENV_SNAPSHOT=0/1 (default 0)
- core/box/free_wrapper_env_snapshot_box.{h,c}: New box
  - Consolidates 2 TLS reads → 1 TLS read (50% reduction)
  - Reduces 4 branches → 3 branches (25% reduction)
  - Lazy init with probe window (bench_profile putenv sync)
- core/box/hak_wrappers.inc.h: Integration in free() wrapper
- Makefile: Add free_wrapper_env_snapshot_box.o to all targets

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (SNAPSHOT=0): 45.35M ops/s (mean), 45.31M ops/s (median)
- Optimized (SNAPSHOT=1): 46.94M ops/s (mean), 47.15M ops/s (median)
- Improvement: +3.51% mean, +4.07% median

Decision: GO (+3.51% >= +1.0% threshold)
- Exceeded conservative estimate (+1.5% → +3.51%)
- Similar efficiency to E1 (+3.92%)
- Health check: PASS (all profiles)
- Action: PROMOTED to MIXED_TINYV3_C7_SAFE preset

Phase 5 Cumulative:
- E1 (ENV Snapshot): +3.92%
- E4-1 (Free Wrapper Snapshot): +3.51%
- Total Phase 4-5: ~+7.5%

E3-4 Correction:
- Phase 4 E3-4 (ENV Constructor Init): NO-GO / FROZEN
- Initial A/B showed +4.75%, but investigation revealed:
  - Branch prediction hint mismatch (UNLIKELY with always-true)
  - Retest confirmed -1.78% regression
  - Root cause: __builtin_expect(..., 0) with ctor_mode==1
- Decision: Freeze as research box (default OFF)
- Learning: Branch hints need careful tuning, TLS consolidation safer

Deliverables:
- docs/analysis/PHASE5_E4_FREE_GATE_OPTIMIZATION_1_DESIGN.md
- docs/analysis/PHASE5_E4_1_FREE_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md
- docs/analysis/PHASE5_E4_2_MALLOC_WRAPPER_ENV_SNAPSHOT_NEXT_INSTRUCTIONS.md (next)
- docs/analysis/PHASE5_POST_E1_NEXT_INSTRUCTIONS.md
- docs/analysis/ENV_PROFILE_PRESETS.md (E4-1 added, E3-4 corrected)
- CURRENT_TASK.md (E4-1 complete, E3-4 frozen)
- core/bench_profile.h (E4-1 promoted to default)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-14 04:24:34 +09:00
7f3ff6c7e6 Phase 4: E1 docs + E2 next instructions 2025-12-14 01:46:18 +09:00
11b0e3f32b Phase 4 D3: alloc gate shape (env-gated) 2025-12-14 00:26:57 +09:00
19056282b6 Phase 3 D2: Wrapper Env Cache - [DECISION: NO-GO]
Target: Reduce wrapper_env_cfg() overhead in malloc/free hot path
- Strategy: Cache wrapper env configuration pointer in TLS
- Approach: Fast pointer cache (TLS caches const wrapper_env_cfg_t*)

Implementation:
- core/box/wrapper_env_cache_env_box.h: ENV gate (HAKMEM_WRAP_ENV_CACHE)
- core/box/wrapper_env_cache_box.h: TLS cache layer (wrapper_env_cfg_fast)
- core/box/hak_wrappers.inc.h: Integration into malloc/free hot paths
- ENV gate: HAKMEM_WRAP_ENV_CACHE=0/1 (default OFF)

A/B Test Results (Mixed, 10-run, 20M iters):
- Baseline (D2=0): 46.52M ops/s (avg), 46.47M ops/s (median)
- Optimized (D2=1): 45.85M ops/s (avg), 45.98M ops/s (median)
- Improvement: avg -1.44%, median -1.05% (DECISION: NO-GO)

Analysis:
- Regression cause: TLS cache adds overhead (branch + TLS access)
- wrapper_env_cfg() is already minimal (pointer return after simple check)
- Adding TLS caching layer makes it worse, not better
- Branch prediction penalty outweighs any potential savings

Cumulative Phase 2-3:
- B3: +2.89%, B4: +1.47%, C3: +2.20%
- D1: +1.06% (opt-in), D2: -1.44% (NO-GO)
- Total: ~7.2% (excluding D2)

Decision: FREEZE as research box (default OFF, regression confirmed)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-13 22:03:27 +09:00
e95e61f0ff Phase POLICY-FAST-PATH-V2 complete + MID-V35-HOTPATH-OPT-1 design
## Phase POLICY-FAST-PATH-V2 (FROZEN)
- Implementation complete: free_policy_fast_v2_box.h + malloc_tiny_fast.h integration
- A/B Results:
  - Mixed (ws=400): -1.6% regression  (branch cost > skip benefit)
  - C6-heavy (ws=200): +5.4% improvement 
- Decision: Default OFF, FROZEN (ws<300 / C6-heavy research only)
- Learning: Large WS causes branch misprediction to dominate

## Phase 3-GRADUATE + ENV probe fix
- 64-probe retry for getenv() stability during bench_profile putenv()
- C6 ULTRA intrusive freelist: FROZEN (research box)

## Phase MID-V35-HOTPATH-OPT-1-DESIGN
- Design doc for next optimization target
- Target: MID v3.5 alloc/free hot path (C5-C6)
- Boxes: Stats Gate, TLS Layout, Boundary Check elimination
- Expected: +3-9% on Mixed mainline

Files:
- core/box/free_policy_fast_v2_box.h (new)
- core/box/free_path_stats_box.h/c (policy_fast_v2_skip counter)
- core/front/malloc_tiny_fast.h (fast-path integration)
- docs/analysis/MID_V35_HOTPATH_OPT_1_DESIGN.md (new)
- docs/analysis/PHASE_3_GRADUATE_*.md (new)
- CURRENT_TASK.md (phase status update)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-12 18:40:08 +09:00
1a8652a91a Phase TLS-UNIFY-3: C6 intrusive freelist implementation (完成)
Implement C6 ULTRA intrusive LIFO freelist with ENV gating:
- Single-linked LIFO using next pointer at USER+1 offset
- tiny_next_store/tiny_next_load for pointer access (single source of truth)
- Segment learning via ss_fast_lookup (per-class seg_base/seg_end)
- ENV gate: HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL (default OFF)
- Counters: c6_ifl_push/pop/fallback in FREE_PATH_STATS

Files:
- core/box/tiny_ultra_tls_box.h: Added c6_head field for intrusive LIFO
- core/box/tiny_ultra_tls_box.c: Pop/push with intrusive branching (case 6)
- core/box/tiny_c6_ultra_intrusive_env_box.h: ENV gate (new)
- core/box/tiny_c6_intrusive_freelist_box.h: L1 pure LIFO (new)
- core/tiny_debug_ring.h: C6_IFL events
- core/box/free_path_stats_box.h/c: c6_ifl_* counters

A/B Test Results (1M iterations, ws=200, 257-512B):
- ENV_OFF (array): 56.6 Mop/s avg
- ENV_ON (intrusive): 57.6 Mop/s avg (+1.8%, within noise)
- Counters verified: c6_ifl_push=265890, c6_ifl_pop=265815, fallback=0

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-12 16:26:42 +09:00
d5ffb3eeb2 Fix MID v3.5 activation bugs: policy loop + malloc recursion
Two critical bugs fixed:

1. Policy snapshot infinite loop (smallobject_policy_v7.c):
   - Condition `g_policy_v7_version == 0` caused reinit on every call
   - Fixed via CAS to set global version to 1 after first init

2. Malloc recursion (smallobject_segment_mid_v3.c):
   - Internal malloc() routed back through hakmem → MID v3.5 → segment
     creation → malloc → infinite recursion / stack overflow
   - Fixed by using mmap() directly for internal allocations:
     - Segment struct, pages array, page metadata block

Performance results (bench_random_mixed 257-512B):
- Baseline (LEGACY): 34.0M ops/s
- MID_V35 ON (C6):   35.8M ops/s
- Improvement:       +5.1% ✓

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-12 07:12:24 +09:00
8143e8b797 Phase v7-4: Policy Box 導入 (L3 層の明確化とフロント芯の作り直し)
- SmallPolicyV7 Box: L3 Policy layer に配置、route 決定を一元化
- Route kind enum: SMALL_ROUTE_ULTRA / V7 / MID_V3 / LEGACY
- ENV priority (fixed): ULTRA > v7 > MID_v3 > LEGACY
- Frontend integration: v7 routing を Policy Box 経由に変更 (段階移行)
- Legacy compatibility: 既存の tiny_route_env_box.h は併用維持

Box Theory layer structure:
- L0: ULTRA (C4-C7, FROZEN)
- L1: SmallObject v7 (research box)
- L1': MID_v3 / LEGACY (fallback)
- L2: Segment / RegionId
- L3: Policy / Stats / Learner ← Policy Box added here

Frontend now follows clean "size→class→route_kind→switch" pattern.
ENV variables read once at Policy init, not scattered across frontend.

Future: ULTRA/MID_v3/LEGACY consolidation, Learner integration, flexible priority.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-12 03:50:58 +09:00
39a3c53dbc Phase v7-2: SmallObject v7 C6-only implementation with RegionIdBox integration
- SmallSegment_v7: 2MiB segment with TLS slot and free page stack
- ColdIface_v7: Page refill/retire between HotBox and SegmentBox
- HotBox_v7: Full C6-only alloc/free with header writing (HEADER_MAGIC|class_idx)
- Free path early-exit: Check v7 route BEFORE ss_fast_lookup (separate mmap segment)
- RegionIdBox: Register v7 segment for ptr->region lookup
- Benchmark: v7 ON ~54.5M ops/s (-7% overhead vs 58.6M legacy baseline)

v7 correctly balances alloc/free counts and page lifecycle.
RegionIdBox overhead identified as primary cost driver.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-12 03:12:28 +09:00
df216b6901 Phase V6-HDR-3: SmallSegmentV6 実割り当て & RegionIdBox Registration
実装内容:
1. SmallSegmentV6のmmap割り当ては既に v6-0で実装済み
2. small_heap_ctx_v6() で segment 取得時に region_id_register_v6_segment() 呼び出し
3. region_id_v6.c に TLS スコープのセグメント登録ロジック実装:
   - 4つの static __thread 変数でセグメント情報をキャッシュ
   - region_id_register_v6_segment(): セグメント base/end を TLS に記録
   - region_id_lookup_v6(): TLS segment の range check を最初に実行
   - TLS cache 更新で O(1) lookup 実現
4. region_id_v6_box.h に SmallSegmentV6 type include & function 宣言追加
5. small_v6_region_observe_validate() に region_id_observe_lookup() 呼び出し追加

効果:
- HeaderlessデザインでRegionIdBoxが正式にSMALL_V6分類を返せるように
- TLS-scopedな簡潔な登録メカニズム (マルチスレッド対応)
- Fast path: TLS segment range check -> page_meta lookup
- Fall back path: 従来の small_page_meta_v6_of() による動的検出
- Latency: O(1) TLS cache hit rate がv6 alloc/free の大部分をカバー

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-11 23:51:48 +09:00
0f15adae4e Phase ALLOC-GATE-OPT-1: tiny_alloc_gate_fast 統計計測
- AllocGateStats 構造体追加(size2class/route/env/class分布)
- malloc_tiny_fast にカウンタ埋め込み
- ENV: HAKMEM_ALLOC_GATE_STATS (default 0)
- 挙動変更なし(計測のみ)

計測結果:
- Mixed: total=542k, size2class=0, route_calls=0, env_checks=275k, C4-C7=95.2%
  - size_to_class/route_for_class は完全削減済み(LUT 効果)
  - C4-C7 が 95% → ULTRA fast path が有効
  - env_checks ≈ c7_calls → C7 ULTRA の ENV gate が毎回呼ばれる
- C6-heavy: total=11 → malloc_tiny_fast はほぼ通らない(mid/pool 主体)

結論:
- alloc gate は既に十分最適化済み(LUT + ULTRA で削減済み)
- さらなる最適化余地は小さい(env_checks は軽量化済み、数%以下の効果)
- 次フェーズでは free dispatcher (29%) や C7 ULTRA refill (7%) など、他のボトルネックを狙う

詳細: docs/analysis/ALLOC_GATE_ANALYSIS.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-11 21:32:40 +09:00
118c0e4857 Phase FREE-DISPATCHER-OPT-1: free dispatcher 統計計測
**目的**: free dispatcher(29%)の内訳を細分化して計測。

**実装内容**:
- FreeDispatchStats 構造体追加(ENV: HAKMEM_FREE_DISPATCH_STATS, default 0)
- カウンタ: total_calls / domain (tiny/mid/large) / route (ultra/legacy/pool/v6) / env_checks / route_for_class_calls
- hak_free_at / tiny_route_for_class / tiny_route_snapshot_init にカウンタ埋め込み
- 挙動変更なし(計測のみ、ENV OFF 時は overhead ゼロ)

**計測結果**:

Mixed 16-1024B (1M iter, ws=400):
- total=8,081, route_calls=267,967, env_checks=9
- BENCH_FAST_FRONT により大半は早期リターン
- route_for_class は主に alloc 側で呼ばれる(267k calls vs 8k frees)
- ENV check は初期化時の 9回のみ(snapshot 効果)

C6-heavy (257-768B, 1M iter, ws=400):
- total=500,099, route_calls=1,034, env_checks=9
- fg_classify_domain に到達する free が多い
- route_for_class 呼び出しは極小(snapshot 効果)

**結論**:
- ENV check は既に十分最適化されている(初期化時のみ)
- route_for_class は alloc 側での呼び出しが主で、free 側は snapshot で O(1)
- 次フェーズ(OPT-2)では別のアプローチを検討

**ドキュメント追加**:
- docs/analysis/FREE_DISPATCHER_ANALYSIS.md(新規)
- CURRENT_TASK.md に Phase FREE-DISPATCHER-OPT-1 セクション追加

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-11 21:21:40 +09:00
fb88725a43 Phase FREE-LEGACY-OPT-6: C4 ULTRA Implementation
Implement C4 ULTRA free TLS cache with parasitic free+alloc pattern,
achieving 99.7-99.9% elimination of C4 legacy fallback calls.

Key Features:
- TLS cache cap=64 (tuned for L1 cache fit, smaller than C5/C6's 128)
- Segment learning via ss_fast_lookup() on first free
- Free-side cache push + alloc-side TLS pop pattern
- ENV gate: HAKMEM_TINY_C4_ULTRA_FREE_ENABLED (default OFF)
- Full FREE_PATH_STATS instrumentation

Benchmark Results:
C4-heavy (65-128B range):
  - C4 legacy: 591,583 → 1,711 (-99.7%)
  - c4_ultra cache hits: ~599k (free) + ~599k (alloc)
  - Mixed load: 340,732 → 284 C4 legacy (-99.9%)

Legacy fallback reduction:
  - C4-heavy: 589,872 fewer legacy calls (-10.9% total)
  - Mixed: 340,448 fewer C4 legacy calls (-12.8% in mixed)

Performance note: ~2% throughput cost in isolated C4-heavy case,
acceptable tradeoff for 99%+ legacy elimination per class.

Files:
  NEW: core/box/tiny_c4_ultra_free_box.h/c
  NEW: core/box/tiny_c4_ultra_free_env_box.h
  MOD: core/box/tiny_ultra_classes_box.h (added C4 macros)
  MOD: core/box/free_path_stats_box.h/c (C4 ULTRA counters)
  MOD: core/front/malloc_tiny_fast.h (C4 alloc+free integration)
  MOD: Makefile (added C4 ULTRA object)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-11 19:38:27 +09:00
ea6ed1a6e4 Phase FREE-LEGACY-OPT-5-1/5-2: C5 ULTRA free+alloc integration
Summary:
========
Implemented C5 ULTRA TLS cache pattern following the successful C6 ULTRA design:
- Phase 5-1: Free-side TLS cache + segment learning
- Phase 5-2: Alloc-side TLS pop for complete free+alloc cycle integration

Targets C5 class (129-256B) as next legacy reduction after C6 completion.

Key Changes:
============

1. NEW FILES:
   - core/box/tiny_c5_ultra_free_box.h: C5 ULTRA TLS cache structure
   - core/box/tiny_c5_ultra_free_box.c: C5 free path implementation (same pattern as C6)
   - core/box/tiny_c5_ultra_free_env_box.h: ENV gating (HAKMEM_TINY_C5_ULTRA_FREE_ENABLED)

2. MODIFIED FILES:
   - core/front/malloc_tiny_fast.h:
     * Added C5 ULTRA includes
     * Added C5 alloc-side TLS pop at lines 186-194 (integrated with C6)
     * Added C5 free path at lines 333-337 (integrated with C6)

   - core/box/tiny_ultra_classes_box.h:
     * Added TINY_CLASS_C5 constant
     * Added tiny_class_is_c5() macro
     * Extended tiny_class_is_ultra() to include C5

   - core/box/free_path_stats_box.h:
     * Added c5_ultra_free_fast counter
     * Added c5_ultra_alloc_hit counter

   - core/box/free_path_stats_box.c:
     * Updated stats dump to output C5 counters

   - Makefile:
     * Added core/box/tiny_c5_ultra_free_box.o to all object lists

3. Design Rationale:
   - Exact copy of C6 ULTRA pattern (proven effective)
   - TLS cache capacity: 128 blocks (same as C6 for consistency)
   - Segment learning on first C5 free via ss_fast_lookup()
   - Alloc-side pop integrated directly in malloc_tiny_fast.h hotpath
   - Legacy fallback unification via tiny_legacy_fallback_free_base()

4. Expected Impact:
   - C5 legacy calls: 68,871 → 0 (100% elimination)
   - Total legacy reduction: ~53% of remaining 129,623
   - Mixed workload: Minimal regression (C5 is smaller class, fewer allocations)

5. Stats Collection:
   Run with: HAKMEM_TINY_C5_ULTRA_FREE_ENABLED=1 HAKMEM_FREE_PATH_STATS=1 ./bench_allocators_hakmem

   Expected output:
   [FREE_PATH_STATS] ... c5_ultra_free=68871 c5_ultra_alloc=68871 ... legacy_fb=60752 ...
   [FREE_PATH_STATS_LEGACY_BY_CLASS] ... c5=0 ...

Status:
=======
- Code:  COMPLETE (3 new files + 5 modified files)
- Compilation:  Verified (no errors, only unused variable warnings unrelated to C5)
- Functionality: Ready to benchmark (ENV gating: default OFF, opt-in via ENV)

Phase Progression:
==================
 Phase 4-4: C6 ULTRA free+alloc (legacy C6: 137,319 → 0)
 Phase 5-1/5-2: C5 ULTRA free+alloc (legacy C5: 68,871 → 0 expected)
 Phase 4.5: C4 ULTRA (34,727 remaining)
📋 Future: C3/C2 ULTRA if beneficial

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-11 19:26:51 +09:00
7b7de53167 Phase FREE-FRONT-V3-1: Free route snapshot infrastructure + build fix
Summary:
========
Implemented Phase FREE-FRONT-V3 infrastructure to optimize free hotpath by:
1. Creating snapshot-based route decision table (consolidating route logic)
2. Removing redundant ENV checks from hot path
3. Preparing for future integration into hak_free_at()

Key Changes:
============

1. NEW FILES:
   - core/box/free_front_v3_env_box.h: Route snapshot definition & API
   - core/box/free_front_v3_env_box.c: Snapshot initialization & caching

2. Infrastructure Details:
   - FreeRouteSnapshotV3: Maps class_idx → free_route_kind for all 8 classes
   - Routes defined: LEGACY, TINY_V3, CORE_V6_C6, POOL_V1
   - ENV-gated initialization (HAKMEM_TINY_FREE_FRONT_V3_ENABLED, default OFF)
   - Per-thread TLS caching to avoid repeated ENV reads

3. Design Goals:
   - Consolidate tiny_route_for_class() results into snapshot table
   - Remove C7 ULTRA / v4 / v5 / v6 ENV checks from hot path
   - Limit lookup (ss_fast_lookup/slab_index_for) to paths that truly need it
   - Clear ownership boundary: front v3 handles routing, downstream handles free

4. Phase Plan:
   - v3-1  COMPLETE: Infrastructure (snapshot table, ENV initialization, TLS cache)
   - v3-2 (INFRASTRUCTURE ONLY): Placeholder integration in hak_free_api.inc.h
   - v3-3 (FUTURE): Full integration + benchmark A/B to measure hotpath improvement

5. BUILD FIX:
   - Added missing core/box/c7_meta_used_counter_box.o to OBJS_BASE in Makefile
   - This symbol was referenced but not linked, causing undefined reference errors
   - Benchmark targets now build cleanly without LTO

Status:
=======
- Build:  PASS (bench_allocators_hakmem builds without errors)
- Integration: Currently DISABLED (default OFF, ready for v3-2 phase)
- No performance impact: Infrastructure-only, hotpath unchanged

Future Work:
============
- Phase v3-2: Integrate snapshot routing into hak_free_at() main path
- Phase v3-3: Measure free hotpath performance improvement (target: 1-2% less branch mispredict)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-11 19:17:30 +09:00
c60199182e Phase v6-1/2/3/4: SmallObject Core v6 - C6-only implementation + refactor
Phase v6-1: C6-only route stub (v1/pool fallback)
Phase v6-2: Segment v6 + ColdIface v6 + Core v6 HotPath implementation
  - 2MiB segment / 64KiB page allocation
  - O(1) ptr→page_meta lookup with segment masking
  - C6-heavy A/B: SEGV-free but -44% performance (15.3M ops/s)

Phase v6-3: Thin-layer optimization (TLS ownership check + batch header + refill batching)
  - TLS ownership fast-path skip page_meta for 90%+ of frees
  - Batch header writes during refill (32 allocs = 1 header write)
  - TLS batch refill (1/32 refill frequency)
  - C6-heavy A/B: v6-2 15.3M → v6-3 27.1M ops/s (±0% vs baseline) 

Phase v6-4: Mixed hang fix (segment metadata lookup correction)
  - Root cause: metadata lookup was reading mmap region instead of TLS slot
  - Fix: use TLS slot descriptor with in_use validation
  - Mixed health: 5M iterations SEGV-free, 35.8M ops/s 

Phase v6-refactor: Code quality improvements (macro unification + inline + docs)
  - Add SMALL_V6_* prefix macros (header, pointer conversion, page index)
  - Extract inline validation functions (small_page_v6_valid, small_ptr_in_segment_v6)
  - Doxygen-style comments for all public functions
  - Result: 0 compiler warnings, maintained +1.2% performance

Files:
- core/box/smallobject_core_v6_box.h (new, type & API definitions)
- core/box/smallobject_cold_iface_v6.h (new, cold iface API)
- core/box/smallsegment_v6_box.h (new, segment type definitions)
- core/smallobject_core_v6.c (new, C6 alloc/free implementation)
- core/smallobject_cold_iface_v6.c (new, refill/retire logic)
- core/smallsegment_v6.c (new, segment allocator)
- docs/analysis/SMALLOBJECT_CORE_V6_DESIGN.md (new, design document)
- core/box/tiny_route_env_box.h (modified, v6 route added)
- core/front/malloc_tiny_fast.h (modified, v6 case in route switch)
- Makefile (modified, v6 objects added)
- CURRENT_TASK.md (modified, v6 status added)

Status:
- C6-heavy: v6 OFF 27.1M → v6-3 ON 27.1M ops/s (±0%) 
- Mixed: v6 ON 35.8M ops/s (C6-only, other classes via v1) 
- Build: 0 warnings, fully documented 

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-11 15:29:59 +09:00
2a13478dc7 Optimize C6 heavy and C7 ultra performance analysis with refined design refinements
- Update environment profile presets and visibility analysis
- Enhance small object and tiny segment v4 box implementations
- Refine C7 ultra and C6 heavy allocation strategies
- Add comprehensive performance metrics and design documentation

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-10 22:57:26 +09:00
acc64f2438 Phase ML1: Pool v1 memset 89.73% overhead 軽量化 (+15.34% improvement)
## Summary
- ChatGPT により bench_profile.h の setenv segfault を修正(RTLD_NEXT 経由に切り替え)
- core/box/pool_zero_mode_box.h 新設:ENV キャッシュ経由で ZERO_MODE を統一管理
- core/hakmem_pool.c で zero mode に応じた memset 制御(FULL/header/off)
- A/B テスト結果:ZERO_MODE=header で +15.34% improvement(1M iterations, C6-heavy)

## Files Modified
- core/box/pool_api.inc.h: pool_zero_mode_box.h include
- core/bench_profile.h: glibc setenv → malloc+putenv(segfault 回避)
- core/hakmem_pool.c: zero mode 参照・制御ロジック
- core/box/pool_zero_mode_box.h (新設): enum/getter
- CURRENT_TASK.md: Phase ML1 結果記載

## Test Results
| Iterations | ZERO_MODE=full | ZERO_MODE=header | Improvement |
|-----------|----------------|-----------------|------------|
| 10K       | 3.06 M ops/s   | 3.17 M ops/s    | +3.65%     |
| 1M        | 23.71 M ops/s  | 27.34 M ops/s   | **+15.34%** |

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-10 09:08:18 +09:00
a905e0ffdd Guard madvise ENOMEM and stabilize pool/tiny front v3 2025-12-09 21:50:15 +09:00
8f18963ad5 Phase 36-37: TinyHotHeap v2 HotBox redesign and C7 current_page policy fixes
- Redefine TinyHotHeap v2 as per-thread Hot Box with clear boundaries
- Add comprehensive OS statistics tracking for SS allocations
- Implement route-based free handling for TinyHeap v2
- Add C6/C7 debugging and statistics improvements
- Update documentation with implementation guidelines and analysis
- Add new box headers for stats, routing, and front-end management
2025-12-08 21:30:21 +09:00
9502501842 Fix tiny lane success handling for TinyHeap routes 2025-12-07 23:06:50 +09:00
a6991ec9e4 Add TinyHeap class mask and extend routing 2025-12-07 22:49:28 +09:00
fda6cd2e67 Boxify superslab registry, add bench profile, and document C7 hotpath experiments 2025-12-07 03:12:27 +09:00
d5e6ed535c P-Tier + Tiny Route Policy: Aggressive Superslab Management + Safe Routing
## Phase 1: Utilization-Aware Superslab Tiering (案B実装済)

- Add ss_tier_box.h: Classify SuperSlabs into HOT/DRAINING/FREE based on utilization
  - HOT (>25%): Accept new allocations
  - DRAINING (≤25%): Drain only, no new allocs
  - FREE (0%): Ready for eager munmap

- Enhanced shared_pool_release_slab():
  - Check tier transition after each slab release
  - If tier→FREE: Force remaining slots to EMPTY and call superslab_free() immediately
  - Bypasses LRU cache to prevent registry bloat from accumulating DRAINING SuperSlabs

- Test results (bench_random_mixed_hakmem):
  - 1M iterations:  ~1.03M ops/s (previously passed)
  - 10M iterations:  ~1.15M ops/s (previously: registry full error)
  - 50M iterations:  ~1.08M ops/s (stress test)

## Phase 2: Tiny Front Routing Policy (新規Box)

- Add tiny_route_box.h/c: Single 8-byte table for class→routing decisions
  - ROUTE_TINY_ONLY: Tiny front exclusive (no fallback)
  - ROUTE_TINY_FIRST: Try Tiny, fallback to Pool if fails
  - ROUTE_POOL_ONLY: Skip Tiny entirely

- Profiles via HAKMEM_TINY_PROFILE ENV:
  - "hot": C0-C3=TINY_ONLY, C4-C6=TINY_FIRST, C7=POOL_ONLY
  - "conservative" (default): All TINY_FIRST
  - "off": All POOL_ONLY (disable Tiny)
  - "full": All TINY_ONLY (microbench mode)

- A/B test results (ws=256, 100k ops random_mixed):
  - Default (conservative): ~2.90M ops/s
  - hot: ~2.65M ops/s (more conservative)
  - off: ~2.86M ops/s
  - full: ~2.98M ops/s (slightly best)

## Design Rationale

### Registry Pressure Fix (案B)
- Problem: DRAINING tier SS occupied registry indefinitely
- Solution: When total_active_blocks→0, immediately free to clear registry slot
- Result: No more "registry full" errors under stress

### Routing Policy Box (新)
- Problem: Tiny front optimization scattered across ENV/branches
- Solution: Centralize routing in single table, select profiles via ENV
- Benefit: Safe A/B testing without touching hot path code
- Future: Integrate with RSS budget/learning layers for dynamic profile switching

## Next Steps (性能最適化)
- Profile Tiny front internals (TLS SLL, FastCache, Superslab backend latency)
- Identify bottleneck between current ~2.9M ops/s and mimalloc ~100M ops/s
- Consider:
  - Reduce shared pool lock contention
  - Optimize unified cache hit rate
  - Streamline Superslab carving logic

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 18:01:25 +09:00
984cca41ef P0 Optimization: Shared Pool fast path with O(1) metadata lookup
Performance Results:
- Throughput: 2.66M ops/s → 3.8M ops/s (+43% improvement)
- sp_meta_find_or_create: O(N) linear scan → O(1) direct pointer
- Stage 2 metadata scan: 100% → 10-20% (80-90% reduction via hints)

Core Optimizations:

1. O(1) Metadata Lookup (superslab_types.h)
   - Added `shared_meta` pointer field to SuperSlab struct
   - Eliminates O(N) linear search through ss_metadata[] array
   - First access: O(N) scan + cache | Subsequent: O(1) direct return

2. sp_meta_find_or_create Fast Path (hakmem_shared_pool.c)
   - Check cached ss->shared_meta first before linear scan
   - Cache pointer after successful linear scan for future lookups
   - Reduces 7.8% CPU hotspot to near-zero for hot paths

3. Stage 2 Class Hints Fast Path (hakmem_shared_pool_acquire.c)
   - Try class_hints[class_idx] FIRST before full metadata scan
   - Uses O(1) ss->shared_meta lookup for hint validation
   - __builtin_expect() for branch prediction optimization
   - 80-90% of acquire calls now skip full metadata scan

4. Proper Initialization (ss_allocation_box.c)
   - Initialize shared_meta = NULL in superslab_allocate()
   - Ensures correct NULL-check semantics for new SuperSlabs

Additional Improvements:
- Updated ptr_trace and debug ring for release build efficiency
- Enhanced ENV variable documentation and analysis
- Added learner_env_box.h for configuration management
- Various Box optimizations for reduced overhead

Thread Safety:
- All atomic operations use correct memory ordering
- shared_meta cached under mutex protection
- Lock-free Stage 2 uses proper CAS with acquire/release semantics

Testing:
- Benchmark: 1M iterations, 3.8M ops/s stable
- Build: Clean compile RELEASE=0 and RELEASE=1
- No crashes, memory leaks, or correctness issues

Next Optimization Candidates:
- P1: Per-SuperSlab free slot bitmap for O(1) slot claiming
- P2: Reduce Stage 2 critical section size
- P3: Page pre-faulting (MAP_POPULATE)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 16:21:54 +09:00
25cb7164c7 Comprehensive legacy cleanup and architecture consolidation
Summary of Changes:

MOVED TO ARCHIVE:
- core/hakmem_tiny_legacy_slow_box.inc → archive/
  * Slow path legacy code preserved for reference
  * Superseded by Gatekeeper Box architecture

- core/superslab_allocate.c → archive/superslab_allocate_legacy.c
  * Legacy SuperSlab allocation implementation
  * Functionality integrated into new Box system

- core/superslab_head.c → archive/superslab_head_legacy.c
  * Legacy slab head management
  * Refactored through Box architecture

REMOVED DEAD CODE:
- Eliminated unused allocation policy variants from ss_allocation_box.c
  * Reduced from 127+ lines of conditional logic to focused implementation
  * Removed: old policy branches, unused allocation strategies
  * Kept: current Box-based allocation path

ADDED NEW INFRASTRUCTURE:
- core/superslab_head_stub.c (41 lines)
  * Minimal stub for backward compatibility
  * Delegates to new architecture

- Enhanced core/superslab_cache.c (75 lines added)
  * Added missing API functions for cache management
  * Proper interface for SuperSlab cache integration

REFACTORED CORE SYSTEMS:
- core/hakmem_super_registry.c
  * Moved registration logic from scattered locations
  * Centralized SuperSlab registry management

- core/hakmem_tiny.c
  * Removed 27 lines of redundant initialization
  * Simplified through Box architecture

- core/hakmem_tiny_alloc.inc
  * Streamlined allocation path to use Gatekeeper
  * Removed legacy decision logic

- core/box/ss_allocation_box.c/h
  * Dramatically simplified allocation policy
  * Removed conditional branches for unused strategies
  * Focused on current Box-based approach

BUILD SYSTEM:
- Updated Makefile for archive structure
- Removed obsolete object file references
- Maintained build compatibility

SAFETY & TESTING:
- All deletions verified: no broken references
- Build verification: RELEASE=0 and RELEASE=1 pass
- Smoke tests: 100% pass rate
- Functional verification: allocation/free intact

Architecture Consolidation:
Before: Multiple overlapping allocation paths with legacy code branches
After:  Single unified path through Gatekeeper Boxes with clear architecture

Benefits:
- Reduced code size and complexity
- Improved maintainability
- Single source of truth for allocation logic
- Better diagnostic/observability hooks
- Foundation for future optimizations

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 14:22:48 +09:00
a0a80f5403 Remove legacy redundant code after Gatekeeper Box consolidation
Summary of Deletions:
- Remove core/box/unified_batch_box.c (26 lines)
  * Legacy batch allocation logic superseded by Alloc Gatekeeper Box
  * unified_cache now handles allocation aggregation

- Remove core/box/unified_batch_box.h (29 lines)
  * Header declarations for deprecated unified_batch_box module

- Remove core/tiny_free_fast.inc.h (329 lines)
  * Legacy fast-path free implementation
  * Functionality consolidated into:
    - tiny_free_gate_box.h (Fail-Fast layer + diagnostics)
    - malloc_tiny_fast.h (Free path integration)
    - unified_cache (return to freelist)
  * Code path now routes through Gatekeeper Box for consistency

Build System Updates:
- Update Makefile
  * Remove unified_batch_box.o from OBJS_BASE
  * Remove unified_batch_box_shared.o from SHARED_OBJS
  * Remove unified_batch_box.o from BENCH_HAKMEM_OBJS_BASE

- Update core/hakmem_tiny_phase6_wrappers_box.inc
  * Remove unified_batch_box references
  * Simplify allocation wrapper to use new Gatekeeper architecture

Impact:
- Removes ~385 lines of redundant/superseded code
- Consolidates allocation logic through unified Gatekeeper entry points
- All functionality preserved via new Box-based architecture
- Simplifies codebase and reduces maintenance burden

Testing:
- Build verification: make clean && make RELEASE=0/1
- Smoke tests: All pass (simple_alloc, loop 10M, pool_tls)
- No functional regressions

Rationale:
After implementing Alloc/Free Gatekeeper Boxes with Fail-Fast layers
and Unified Cache type safety, the legacy separate implementations
became redundant. This commit completes the architectural consolidation
and simplifies the allocator codebase.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 12:55:53 +09:00
0546454168 WIP: Add TLS SLL validation and SuperSlab registry fallback
ChatGPT's diagnostic changes to address TLS_SLL_HDR_RESET issue.
Current status: Partial mitigation, but root cause remains.

Changes Applied:
1. SuperSlab Registry Fallback (hakmem_super_registry.h)
   - Added legacy table probe when hash map lookup misses
   - Prevents NULL returns for valid SuperSlabs during initialization
   - Status:  Works but may hide underlying registration issues

2. TLS SLL Push Validation (tls_sll_box.h)
   - Reject push if SuperSlab lookup returns NULL
   - Reject push if class_idx mismatch detected
   - Added [TLS_SLL_PUSH_NO_SS] diagnostic message
   - Status:  Prevents list corruption (defensive)

3. SuperSlab Allocation Class Fix (superslab_allocate.c)
   - Pass actual class_idx to sp_internal_allocate_superslab
   - Prevents dummy class=8 causing OOB access
   - Status:  Root cause fix for allocation path

4. Debug Output Additions
   - First 256 push/pop operations traced
   - First 4 mismatches logged with details
   - SuperSlab registration state logged
   - Status:  Diagnostic tool (not a fix)

5. TLS Hint Box Removed
   - Deleted ss_tls_hint_box.{c,h} (Phase 1 optimization)
   - Simplified to focus on stability first
   - Status:  Can be re-added after root cause fixed

Current Problem (REMAINS UNSOLVED):
- [TLS_SLL_HDR_RESET] still occurs after ~60 seconds of sh8bench
- Pointer is 16 bytes offset from expected (class 1 → class 2 boundary)
- hak_super_lookup returns NULL for that pointer
- Suggests: Use-After-Free, Double-Free, or pointer arithmetic error

Root Cause Analysis:
- Pattern: Pointer offset by +16 (one class 1 stride)
- Timing: Cumulative problem (appears after 60s, not immediately)
- Location: Header corruption detected during TLS SLL pop

Remaining Issues:
⚠️ Registry fallback is defensive (may hide registration bugs)
⚠️ Push validation prevents symptoms but not root cause
⚠️ 16-byte pointer offset source unidentified

Next Steps for Investigation:
1. Full pointer arithmetic audit (Magazine ⇔ TLS SLL paths)
2. Enhanced logging at HDR_RESET point:
   - Expected vs actual pointer value
   - Pointer provenance (where it came from)
   - Allocation trace for that block
3. Verify Headerless flag is OFF throughout build
4. Check for double-offset application in conversions

Technical Assessment:
- 60% root cause fixes (allocation class, validation)
- 40% defensive mitigation (registry fallback, push rejection)

Performance Impact:
- Registry fallback: +10-30 cycles on cold path (negligible)
- Push validation: +5-10 cycles per push (acceptable)
- Overall: < 2% performance impact estimated

Related Issues:
- Phase 1 TLS Hint Box removed temporarily
- Phase 2 Headerless blocked until stability achieved

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 20:42:28 +09:00
2dc9d5d596 Fix include order in hakmem.c - move hak_kpi_util.inc.h before hak_core_init.inc.h
Problem: hak_core_init.inc.h references KPI measurement variables
(g_latency_histogram, g_latency_samples, g_baseline_soft_pf, etc.)
but hakmem.c was including hak_kpi_util.inc.h AFTER hak_core_init.inc.h,
causing undefined reference errors.

Solution: Reorder includes so hak_kpi_util.inc.h (definition) comes
before hak_core_init.inc.h (usage).

Build result:  Success (libhakmem.so 547KB, 0 errors)

Minor changes:
- Added extern __thread declarations for TLS SLL debug variables
- Added signal handler logging for debug_dump_last_push
- Improved hakmem_tiny.c structure for Phase 2 preparation

🤖 Generated with Claude Code + Task Agent

Co-Authored-By: Gemini <gemini@example.com>
Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 13:28:44 +09:00
bd5e97f38a Save current state before investigating TLS_SLL_HDR_RESET 2025-12-03 10:34:39 +09:00
195c74756c Fix mid free routing and relax mid W_MAX 2025-12-01 22:06:10 +09:00
4ef0171bc0 feat: Add ACE allocation failure tracing and debug hooks
This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations.

Key changes include:
- **ACE Tracing Implementation**:
  - Added  environment variable to enable/disable detailed logging of allocation failures.
  - Instrumented , , and  to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure).
- **Build System Fixes**:
  - Corrected  to ensure  is properly linked into , resolving an  error.
- **LD_PRELOAD Wrapper Adjustments**:
  - Investigated and understood the  wrapper's behavior under , particularly its interaction with  and  checks.
  - Enabled debugging flags for  environment to prevent unintended fallbacks to 's  for non-tiny allocations, allowing comprehensive testing of the  allocator.
- **Debugging & Verification**:
  - Introduced temporary verbose logging to pinpoint execution flow issues within  interception and  routing. These temporary logs have been removed.
  - Created  to facilitate testing of the tracing features.

This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in  by providing clear insights into the failure pathways.
2025-12-01 16:37:59 +09:00
e769dec283 Refactor: Clean up SuperSlab shared pool code
- Removed unused/disabled L0 cache implementation from core/hakmem_shared_pool.c.
- Deleted stale backup file core/hakmem_tiny_superslab.c.bak.
- Removed untracked and obsolete shared_pool source files.
2025-11-30 15:27:53 +09:00
3f461ba25f Cleanup: Consolidate debug ENV vars to HAKMEM_DEBUG_LEVEL
Integrated 4 new debug environment variables added during bug fixes
into the existing unified HAKMEM_DEBUG_LEVEL system (expanded to 0-5 levels).

Changes:

1. Expanded HAKMEM_DEBUG_LEVEL from 0-3 to 0-5 levels:
   - 0 = OFF (production)
   - 1 = ERROR (critical errors)
   - 2 = WARN (warnings)
   - 3 = INFO (allocation paths, header validation, stats)
   - 4 = DEBUG (guard instrumentation, failfast)
   - 5 = TRACE (verbose tracing)

2. Integrated 4 environment variables:
   - HAKMEM_ALLOC_PATH_TRACE → HAKMEM_DEBUG_LEVEL >= 3 (INFO)
   - HAKMEM_TINY_SLL_VALIDATE_HDR → HAKMEM_DEBUG_LEVEL >= 3 (INFO)
   - HAKMEM_TINY_REFILL_FAILFAST → HAKMEM_DEBUG_LEVEL >= 4 (DEBUG)
   - HAKMEM_TINY_GUARD → HAKMEM_DEBUG_LEVEL >= 4 (DEBUG)

3. Kept 2 special-purpose variables (fine-grained control):
   - HAKMEM_TINY_GUARD_CLASS (target class for guard)
   - HAKMEM_TINY_GUARD_MAX (max guard events)

4. Backward compatibility:
   - Legacy ENV vars still work via hak_debug_check_level()
   - New code uses unified system
   - No behavior changes for existing users

Updated files:
- core/hakmem_debug_master.h (level 0-5 expansion)
- core/hakmem_tiny_superslab_internal.h (alloc path trace)
- core/box/tls_sll_box.h (header validation)
- core/tiny_failfast.c (failfast level)
- core/tiny_refill_opt.h (failfast guard)
- core/hakmem_tiny_ace_guard_box.inc (guard enable)
- core/hakmem_tiny.c (include hakmem_debug_master.h)

Impact:
- Simpler debug control: HAKMEM_DEBUG_LEVEL=3 instead of 4 separate ENVs
- Easier to discover/use
- Consistent debug levels across codebase
- Reduces ENV variable proliferation (43+ vars surveyed)

Future work:
- Consolidate remaining 39+ debug variables (documented in survey)
- Gradual migration over 2-3 releases

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 06:57:03 +09:00
20f8d6f179 Cleanup: Add tiny_debug_api.h to eliminate guard/failfast implicit warnings
Created central header for debug instrumentation API to fix implicit
function declaration warnings across the codebase.

Changes:
1. Created core/tiny_debug_api.h
   - Declares guard system API (3 functions)
   - Declares failfast debugging API (3 functions)
   - Uses forward declarations for SuperSlab/TinySlabMeta

2. Updated 3 files to include tiny_debug_api.h:
   - core/tiny_region_id.h (removed inline externs)
   - core/hakmem_tiny_tls_ops.h
   - core/tiny_superslab_alloc.inc.h

Warnings eliminated (6 of 11 total):
 tiny_guard_is_enabled()
 tiny_guard_on_alloc()
 tiny_guard_on_invalid()
 tiny_failfast_log()
 tiny_failfast_abort_ptr()
 tiny_refill_failfast_level()

Remaining warnings (deferred to P1):
- ss_active_add (2 occurrences)
- expand_superslab_head
- hkm_ace_set_tls_capacity
- smallmid_backend_free

Impact:
- Cleaner build output
- Better type safety for debug functions
- No behavior changes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 06:47:13 +09:00
f8b0f38f78 ENV Cleanup Step 8: Gate HAKMEM_SUPER_LOOKUP_DEBUG in header
Gate HAKMEM_SUPER_LOOKUP_DEBUG environment variable behind
#if !HAKMEM_BUILD_RELEASE in hakmem_super_registry.h inline function.

Changes:
- Wrap s_dbg initialization in conditional compilation
- Release builds use constant s_dbg = 0 for complete elimination
- Debug logging in hak_super_lookup() now fully compiled out in release

Performance: 30.3M ops/s Larson (stable, no regression)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-28 01:45:45 +09:00
6b791b97d4 ENV Cleanup: Delete Ultra HEAP & BG Remote dead code (-1,096 LOC)
Deleted files (11):
- core/ultra/ directory (6 files: tiny_ultra_heap.*, tiny_ultra_page_arena.*)
- core/front/tiny_ultrafront.h
- core/tiny_ultra_fast.inc.h
- core/hakmem_tiny_ultra_front.inc.h
- core/hakmem_tiny_ultra_simple.inc
- core/hakmem_tiny_ultra_batch_box.inc

Edited files (10):
- core/hakmem_tiny.c: Remove Ultra HEAP #includes, move ultra_batch_for_class()
- core/hakmem_tiny_tls_state_box.inc: Delete TinyUltraFront, g_ultra_simple
- core/hakmem_tiny_phase6_wrappers_box.inc: Delete ULTRA_SIMPLE block
- core/hakmem_tiny_alloc.inc: Delete Ultra-Front code block
- core/hakmem_tiny_init.inc: Delete ULTRA_SIMPLE ENV loading
- core/hakmem_tiny_remote_target.{c,h}: Delete g_bg_remote_enable/batch
- core/tiny_refill.h: Remove BG Remote check (always break)
- core/hakmem_tiny_background.inc: Delete BG Remote drain loop

Deleted ENV variables:
- HAKMEM_TINY_ULTRA_HEAP (build flag, undefined)
- HAKMEM_TINY_ULTRA_L0
- HAKMEM_TINY_ULTRA_HEAP_DUMP
- HAKMEM_TINY_ULTRA_PAGE_DUMP
- HAKMEM_TINY_ULTRA_FRONT
- HAKMEM_TINY_BG_REMOTE (no getenv, dead code)
- HAKMEM_TINY_BG_REMOTE_BATCH (no getenv, dead code)
- HAKMEM_TINY_ULTRA_SIMPLE (references only)

Impact:
- Code reduction: -1,096 lines
- Binary size: 305KB → 304KB (-1KB)
- Build: PASS
- Sanity: 15.69M ops/s (3 runs avg)
- Larson: 1 crash observed (seed 43, likely existing instability)

Notes:
- Ultra HEAP never compiled (#if HAKMEM_TINY_ULTRA_HEAP undefined)
- BG Remote variables never initialized (g_bg_remote_enable always 0)
- Ultra SLIM (ultra_slim_alloc_box.h) preserved (active 4-layer path)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-27 04:35:47 +09:00
6fadc74405 ENV cleanup: Remove obsolete ULTRAHOT variable + organize docs
Changes:
1. Removed HAKMEM_TINY_FRONT_ENABLE_ULTRAHOT variable
   - Deleted front_prune_ultrahot_enabled() function
   - UltraHot feature was removed in commit bcfb4f6b5
   - Variable was dead code, no longer referenced

2. Organized ENV cleanup analysis documents
   - Moved 5 ENV analysis docs to docs/analysis/
   - ENV_CLEANUP_PLAN.md - detailed file-by-file plan
   - ENV_CLEANUP_SUMMARY.md - executive summary
   - ENV_CLEANUP_ANALYSIS.md - categorized analysis
   - ENV_CONSOLIDATION_PLAN.md - consolidation proposals
   - ENV_QUICK_REFERENCE.md - quick reference guide

Impact:
- ENV variables: 221 → 220 (-1)
- Build:  Successful
- Risk: Zero (dead code removal)

Next steps (documented in ENV_CLEANUP_SUMMARY.md):
- 21 variables need verification (Ultra/HeapV2/BG/HotMag)
- SFC_DEBUG deduplication opportunity (7 callsites)

File: core/box/front_metrics_box.h
Status: SAVEPOINT - stable baseline for future ENV cleanup

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 17:12:41 +09:00
6b38bc840e Cleanup: Remove unused hakmem_libc.c (duplicate of hakmem_syscall.c)
- File was not included in Makefile OBJS_BASE
- Functions already implemented in hakmem_syscall.c
- Size: 361 bytes removed
2025-11-26 13:03:17 +09:00
bcfb4f6b59 Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath
(cherry-picked from 225b6fcc7, conflicts resolved)
2025-11-26 12:33:49 +09:00
d8168a2021 Fix C7 TLS SLL header restoration regression + Document Larson MT race condition
## Bug Fix: Restore C7 Exception in TLS SLL Push

**File**: `core/box/tls_sll_box.h:309`

**Problem**: Commit 25d963a4a (Code Cleanup) accidentally reverted the C7 fix by changing:
```c
if (class_idx != 0 && class_idx != 7) {  // CORRECT (commit 8b67718bf)
if (class_idx != 0) {                     // BROKEN (commit 25d963a4a)
```

**Impact**: C7 (1024B class) header restoration in TLS SLL push overwrote next pointer at base[0], causing corruption.

**Fix**: Restored `&& class_idx != 7` check to prevent header restoration for C7.

**Why C7 Needs Exception**:
- C7 uses offset=0 (stores next pointer at base[0])
- User pointer is at base+1
- Next pointer MUST NOT be overwritten by header restoration
- C1-C6 use offset=1 (next at base[1]), so base[0] header restoration is safe

## Investigation: Larson MT Race Condition (SEPARATE ISSUE)

**Finding**: Larson still crashes with 3+ threads due to UNRELATED multi-threading race condition in unified cache freelist management.

**Root Cause**: Non-atomic freelist operations in `TinySlabMeta`:
```c
typedef struct TinySlabMeta {
    void* freelist;    //  NOT ATOMIC
    uint16_t used;     //  NOT ATOMIC
} TinySlabMeta;
```

**Evidence**:
```
1 thread:   PASS (1.88M - 41.8M ops/s)
2 threads:  PASS (24.6M ops/s)
3 threads:  SEGV (race condition)
4+ threads:  SEGV (race condition)
```

**Status**: C7 fix is CORRECT. Larson crash is separate MT issue requiring atomic freelist implementation.

## Documentation Added

Created comprehensive investigation reports:
- `LARSON_CRASH_ROOT_CAUSE_REPORT.md` - Full technical analysis
- `LARSON_DIAGNOSTIC_PATCH.md` - Implementation guide
- `LARSON_INVESTIGATION_SUMMARY.md` - Executive summary
- `LARSON_QUICK_REF.md` - Quick reference
- `verify_race_condition.sh` - Automated verification script

## Next Steps

Implement atomic freelist operations for full MT safety (7-9 hour effort):
1. Make `TinySlabMeta.freelist` atomic with CAS loop
2. Audit 87 freelist access sites
3. Test with Larson 8+ threads

🔧 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-22 02:15:34 +09:00
25d963a4aa Code Cleanup: Remove false positives, redundant validations, and reduce verbose logging
Following the C7 stride upgrade fix (commit 23c0d9541), this commit performs
comprehensive cleanup to improve code quality and reduce debug noise.

## Changes

### 1. Disable False Positive Checks (tiny_nextptr.h)
- **Disabled**: NXT_MISALIGN validation block with `#if 0`
- **Reason**: Produces false positives due to slab base offsets (2048, 65536)
  not being stride-aligned, causing all blocks to appear "misaligned"
- **TODO**: Reimplement to check stride DISTANCE between consecutive blocks
  instead of absolute alignment to stride boundaries

### 2. Remove Redundant Geometry Validations

**hakmem_tiny_refill_p0.inc.h (P0 batch refill)**
- Removed 25-line CARVE_GEOMETRY_FIX validation block
- Replaced with NOTE explaining redundancy
- **Reason**: Stride table is now correct in tiny_block_stride_for_class(),
  defense-in-depth validation adds overhead without benefit

**ss_legacy_backend_box.c (legacy backend)**
- Removed 18-line LEGACY_FIX_GEOMETRY validation block
- Replaced with NOTE explaining redundancy
- **Reason**: Shared_pool validates geometry at acquisition time

### 3. Reduce Verbose Logging

**hakmem_shared_pool.c (sp_fix_geometry_if_needed)**
- Made SP_FIX_GEOMETRY logging conditional on `!HAKMEM_BUILD_RELEASE`
- **Reason**: Geometry fixes are expected during stride upgrades,
  no need to log in release builds

### 4. Verification
- Build:  Successful (LTO warnings expected)
- Test:  10K iterations (1.87M ops/s, no crashes)
- NXT_MISALIGN false positives:  Eliminated

## Files Modified
- core/tiny_nextptr.h - Disabled false positive NXT_MISALIGN check
- core/hakmem_tiny_refill_p0.inc.h - Removed redundant CARVE validation
- core/box/ss_legacy_backend_box.c - Removed redundant LEGACY validation
- core/hakmem_shared_pool.c - Made SP_FIX_GEOMETRY logging debug-only

## Impact
- **Code clarity**: Removed 43 lines of redundant validation code
- **Debug noise**: Reduced false positive diagnostics
- **Performance**: Eliminated overhead from redundant geometry checks
- **Maintainability**: Single source of truth for geometry validation

🧹 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 23:00:24 +09:00
9b0d746407 Phase 3d-B: TLS Cache Merge - Unified g_tls_sll[] structure (+12-18% expected)
Merge separate g_tls_sll_head[] and g_tls_sll_count[] arrays into unified
TinyTLSSLL struct to improve L1D cache locality. Expected performance gain:
+12-18% from reducing cache line splits (2 loads → 1 load per operation).

Changes:
- core/hakmem_tiny.h: Add TinyTLSSLL type (16B aligned, head+count+pad)
- core/hakmem_tiny.c: Replace separate arrays with g_tls_sll[8]
- core/box/tls_sll_box.h: Update Box API (13 sites) for unified access
- Updated 32+ files: All g_tls_sll_head[i] → g_tls_sll[i].head
- Updated 32+ files: All g_tls_sll_count[i] → g_tls_sll[i].count
- core/hakmem_tiny_integrity.h: Unified canary guards
- core/box/integrity_box.c: Simplified canary validation
- Makefile: Added core/box/tiny_sizeclass_hist_box.o to link

Build:  PASS (10K ops sanity test)
Warnings: Only pre-existing LTO type mismatches (unrelated)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-20 07:32:30 +09:00
03ba62df4d Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization

Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
   - Direct SuperSlab carve (TLS SLL bypass)
   - Self-contained pop-or-refill pattern
   - ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128

2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
   - Unified ON → direct cache access (skip all intermediate layers)
   - Alloc: unified_cache_pop_or_refill() → immediate fail to slow
   - Free: unified_cache_push() → fallback to SLL only if full

PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
   - PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
   - Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()

4. Measurement results (Random Mixed 500K / 256B):
   - Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
   - SSM: 512 pages (initialization footprint)
   - MID/L25: 0 (unused in this workload)
   - Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)

Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
   - ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
   - Conditional compilation cleanup

Documentation:
6. Analysis reports
   - RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
   - RANDOM_MIXED_SUMMARY.md: Phase 23 summary
   - RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
   - CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan

Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads

Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
fdbdcdcdb3 Phase 21-1-B: Ring cache Alloc/Free 統合 - C2/C3 hot path integration
**統合内容**:
- Alloc path (tiny_alloc_fast.inc.h): Ring pop → HeapV2/UltraHot/SLL fallback
- Free path (tiny_free_fast_v2.inc.h): Ring push → HeapV2/SLL fallback
- Lazy init: 最初の alloc/free 時に自動初期化(thread-safe)

**設計**:
- Lazy init パターン(ENV control と同様)
- ring_cache_pop/push 内で slots == NULL チェック → ring_cache_init() 呼び出し
- Include 構造: ファイルトップレベルに #include 追加(関数内 include 禁止)

**Makefile 修正**:
- TINY_BENCH_OBJS_BASE に core/front/tiny_ring_cache.o 追加
- Link エラー修正: 4箇所の object list に追加

**動作確認**:
- Ring OFF (default): 83K ops/s (1K iterations) 
- Ring ON (HAKMEM_TINY_HOT_RING_ENABLE=1): 78K ops/s 
- クラッシュなし、正常動作確認

**次のステップ**: Phase 21-1-C (Refill/Cascade 実装)
2025-11-16 07:51:37 +09:00