hakmem

Author	SHA1	Message	Date
Moe Charm (CI)	d9991f39ff	Phase ALLOC-TINY-FAST-DUALHOT-1 & Optimization Roadmap Update Add comprehensive design docs and research boxes: - docs/analysis/ALLOC_TINY_FAST_DUALHOT_1_DESIGN.md: ALLOC DUALHOT investigation - docs/analysis/FREE_TINY_FAST_DUALHOT_1_DESIGN.md: FREE DUALHOT final specs - docs/analysis/FREE_TINY_FAST_HOTCOLD_OPT_1_DESIGN.md: Hot/Cold split research - docs/analysis/POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md: Deferred batching design - docs/analysis/POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md: Stats overhead findings - docs/analysis/MID_DESC_CACHE_BENCHMARK_2025-12-12.md: Cache measurement results - docs/analysis/LAST_MATCH_CACHE_IMPLEMENTATION.md: TLS cache investigation Research boxes (SS page table): - core/box/ss_pt_env_box.h: HAKMEM_SS_LOOKUP_KIND gate - core/box/ss_pt_types_box.h: 2-level page table structures - core/box/ss_pt_lookup_box.h: ss_pt_lookup() implementation - core/box/ss_pt_register_box.h: Page table registration - core/box/ss_pt_impl.c: Global definitions Updates: - docs/specs/ENV_VARS_COMPLETE.md: HOTCOLD, DEFERRED, SS_LOOKUP env vars - core/box/hak_free_api.inc.h: FREE-DISPATCH-SSOT integration - core/box/pool_mid_inuse_deferred_box.h: Deferred API updates - core/box/pool_mid_inuse_deferred_stats_box.h: Stats collection - core/hakmem_super_registry: SS page table integration Current Status: - FREE-TINY-FAST-DUALHOT-1: +13% improvement, ready for adoption - ALLOC-TINY-FAST-DUALHOT-1: -2% regression, frozen as research box - Next: Optimization roadmap per ROI (mimalloc gap 2.5x) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-13 05:35:46 +09:00
Moe Charm (CI)	1a8652a91a	Phase TLS-UNIFY-3: C6 intrusive freelist implementation (完成) Implement C6 ULTRA intrusive LIFO freelist with ENV gating: - Single-linked LIFO using next pointer at USER+1 offset - tiny_next_store/tiny_next_load for pointer access (single source of truth) - Segment learning via ss_fast_lookup (per-class seg_base/seg_end) - ENV gate: HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL (default OFF) - Counters: c6_ifl_push/pop/fallback in FREE_PATH_STATS Files: - core/box/tiny_ultra_tls_box.h: Added c6_head field for intrusive LIFO - core/box/tiny_ultra_tls_box.c: Pop/push with intrusive branching (case 6) - core/box/tiny_c6_ultra_intrusive_env_box.h: ENV gate (new) - core/box/tiny_c6_intrusive_freelist_box.h: L1 pure LIFO (new) - core/tiny_debug_ring.h: C6_IFL events - core/box/free_path_stats_box.h/c: c6_ifl_* counters A/B Test Results (1M iterations, ws=200, 257-512B): - ENV_OFF (array): 56.6 Mop/s avg - ENV_ON (intrusive): 57.6 Mop/s avg (+1.8%, within noise) - Counters verified: c6_ifl_push=265890, c6_ifl_pop=265815, fallback=0 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-12 16:26:42 +09:00
Moe Charm (CI)	212739607a	Phase v11a-3: MID v3.5 Activation (Build Complete) Integrated MID v3.5 into active code path, making it available for C5/C6/C7 routing. Key Changes: - Policy Box: Added SMALL_ROUTE_MID_V35 with ENV gates (HAKMEM_MID_V35_ENABLED, HAKMEM_MID_V35_CLASSES) - HotBox: Implemented small_mid_v35_alloc/free with TLS-cached page allocation - Front Gate: Wired MID_V35 routing into malloc_tiny_fast.h (priority: ULTRA > MID_V35 > V7) - Build: Added core/smallobject_mid_v35.o to all object lists Architecture: - Slot sizes: C5=384B, C6=512B, C7=1024B - Page size: 64KB (170/128/64 slots) - Integration: ColdIface v2 (refill/retire), Stats v2 (observation), Learner v2 (dormant) Status: Build successful, ready for A/B benchmarking Next: Performance validation (C6-heavy, C5+C6-only, Mixed benchmarks) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-12 06:52:14 +09:00
Moe Charm (CI)	0dba67ba9d	Phase v11a-2: Core MID v3.5 implementation - segment, cold iface, stats, learner Implement 5-layer infrastructure for multi-class MID v3.5 (C5-C7, 257-1KiB): 1. SegmentBox_mid_v3 (L2 Physical) - core/smallobject_segment_mid_v3.c (9.5 KB) - 2MiB segments, 64KiB pages (32 per segment) - Per-class free page stacks (LIFO) - RegionIdBox registration - Slots: C5→170, C6→102, C7→64 2. ColdIface_mid_v3 (L2→L1) - core/box/smallobject_cold_iface_mid_v3_box.h (NEW) - core/smallobject_cold_iface_mid_v3.c (3.5 KB) - refill: get page from free stack or new segment - retire: calculate free_hit_ratio, publish stats, return to stack - Clean separation: TLS cache for hot path, ColdIface for cold path 3. StatsBox_mid_v3 (L2→L3) - core/smallobject_stats_mid_v3.c (7.2 KB) - Circular buffer history (1000 events) - Per-page metrics: class_idx, allocs, frees, free_hit_ratio_bps - Periodic aggregation (every 100 retires) - Learner notification callback 4. Learner v2 (L3) - core/smallobject_learner_v2.c (11 KB) - Multi-class aggregation: allocs[8], retire_count[8], avg_free_hit_bps[8] - Exponential smoothing (90% history + 10% new) - Per-class efficiency tracking - Stats snapshot API - Route decision disabled for v11a-2 (v11b feature) 5. Build Integration - Modified Makefile: added 4 new .o files (segment, cold_iface, stats, learner) - Updated box header prototypes - Clean compilation, all dependencies resolved Architecture Decision Implementation: - v7 remains frozen (C5/C6 research preset) - MID v3.5 becomes unified 257-1KiB main path - Multi-class isolation: per-class free stacks - Dormant infrastructure: linked but not active (zero overhead) Performance: - Build: clean compilation - Sanity benchmark: 27.3M ops/s (no regression vs v10) - Memory: ~30MB RSS (baseline maintained) Design Compliance: ✅ Layer separation: L2 (segment) → L2 (cold iface) → L3 (stats) → L3 (learner) ✅ Hot path clean: alloc/free never touch stats/learner ✅ Backward compatible: existing MID v3 routes unchanged ✅ Transparent: v11a-2 is dormant (no behavior change) Next Phase (v11a-3): - Activate C5/C6/C7 routing through MID v3.5 - Connect TLS cache to segment refill - Verify performance under load - Then Phase v11a-4: dynamic C5 ratio routing 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-12 06:37:06 +09:00
Moe Charm (CI)	8143e8b797	Phase v7-4: Policy Box 導入 (L3 層の明確化とフロント芯の作り直し) - SmallPolicyV7 Box: L3 Policy layer に配置、route 決定を一元化 - Route kind enum: SMALL_ROUTE_ULTRA / V7 / MID_V3 / LEGACY - ENV priority (fixed): ULTRA > v7 > MID_v3 > LEGACY - Frontend integration: v7 routing を Policy Box 経由に変更 (段階移行) - Legacy compatibility: 既存の tiny_route_env_box.h は併用維持 Box Theory layer structure: - L0: ULTRA (C4-C7, FROZEN) - L1: SmallObject v7 (research box) - L1': MID_v3 / LEGACY (fallback) - L2: Segment / RegionId - L3: Policy / Stats / Learner ← Policy Box added here Frontend now follows clean "size→class→route_kind→switch" pattern. ENV variables read once at Policy init, not scattered across frontend. Future: ULTRA/MID_v3/LEGACY consolidation, Learner integration, flexible priority. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-12 03:50:58 +09:00
Moe Charm (CI)	39a3c53dbc	Phase v7-2: SmallObject v7 C6-only implementation with RegionIdBox integration - SmallSegment_v7: 2MiB segment with TLS slot and free page stack - ColdIface_v7: Page refill/retire between HotBox and SegmentBox - HotBox_v7: Full C6-only alloc/free with header writing (HEADER_MAGIC\|class_idx) - Free path early-exit: Check v7 route BEFORE ss_fast_lookup (separate mmap segment) - RegionIdBox: Register v7 segment for ptr->region lookup - Benchmark: v7 ON ~54.5M ops/s (-7% overhead vs 58.6M legacy baseline) v7 correctly balances alloc/free counts and page lifecycle. RegionIdBox overhead identified as primary cost driver. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-12 03:12:28 +09:00
Moe Charm (CI)	7bb179df6c	Fix: Add core/mid_hotbox_v3.o to BENCH_HAKMEM_OBJS_BASE core/mid_hotbox_v3.o was missing from BENCH_HAKMEM_OBJS_BASE, causing linker errors. Added it after core/region_id_v6.o. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-12 01:06:30 +09:00
Moe Charm (CI)	510cf338f3	MID-V3-6: hakmem.c integration (box modularization) Integrate MID/Pool v3 into hakmem.c main allocation path using box modularization pattern. Changes: - core/hakmem.c: Include MID v3 headers - core/box/hak_alloc_api.inc.h: Add v3 allocation gate - C6 (145-256B) and C7 (769-1024B) size classes - ENV opt-in via HAKMEM_MID_V3_ENABLED + HAKMEM_MID_V3_CLASSES - Priority: v6 > v3 > v4 > pool - core/box/hak_free_api.inc.h: Add v3 free path - RegionIdBox lookup based ownership check - Makefile: Add core/mid_hotbox_v3.o to TINY_BENCH_OBJS_BASE ENV controls (default OFF): HAKMEM_MID_V3_ENABLED=1 HAKMEM_MID_V3_CLASSES=0x40 (C6) HAKMEM_MID_V3_CLASSES=0x80 (C7) HAKMEM_MID_V3_DEBUG=1 Verified with bench_mid_large_mt_hakmem (7-9M ops/s, no crashes) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-12 01:04:55 +09:00
Moe Charm (CI)	710541b69e	MID-V3 Phase 3-5: RegionId integration, alloc/free implementation - MID-V3-3: RegionId integration (page registration at carve) - mid_segment_v3_carve_page(): Register with RegionIdBox - mid_segment_v3_return_page(): Unregister from RegionIdBox - Uses REGION_KIND_MID_V3 for region identification - MID-V3-4: Allocation fast path implementation - mid_hot_v3_alloc_slow(): Slow path for lane miss - mid_cold_v3_refill_page(): Segment-based page allocation - mid_lane_refill_from_page(): Batch transfer (16 items default) - mid_page_build_freelist(): Initial freelist construction - MID-V3-5: Free/cold path implementation - mid_hot_v3_free(): RegionIdBox lookup based free - mid_page_push_free(): Page freelist push - Local/remote page detection via lane ownership ENV controls (default OFF): HAKMEM_MID_V3_ENABLED=1 HAKMEM_MID_V3_CLASSES=0xC0 (C6+C7) HAKMEM_MID_V3_DEBUG=1 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-12 00:53:42 +09:00
Moe Charm (CI)	df216b6901	Phase V6-HDR-3: SmallSegmentV6 実割り当て & RegionIdBox Registration 実装内容: 1. SmallSegmentV6のmmap割り当ては既に v6-0で実装済み 2. small_heap_ctx_v6() で segment 取得時に region_id_register_v6_segment() 呼び出し 3. region_id_v6.c に TLS スコープのセグメント登録ロジック実装: - 4つの static __thread 変数でセグメント情報をキャッシュ - region_id_register_v6_segment(): セグメント base/end を TLS に記録 - region_id_lookup_v6(): TLS segment の range check を最初に実行 - TLS cache 更新で O(1) lookup 実現 4. region_id_v6_box.h に SmallSegmentV6 type include & function 宣言追加 5. small_v6_region_observe_validate() に region_id_observe_lookup() 呼び出し追加効果: - HeaderlessデザインでRegionIdBoxが正式にSMALL_V6分類を返せるように - TLS-scopedな簡潔な登録メカニズム (マルチスレッド対応) - Fast path: TLS segment range check -> page_meta lookup - Fall back path: 従来の small_page_meta_v6_of() による動的検出 - Latency: O(1) TLS cache hit rate がv6 alloc/free の大部分をカバー 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-11 23:51:48 +09:00
Moe Charm (CI)	9fb2240319	Fix: Add alloc_gate_stats_box.o to BENCH_HAKMEM_OBJS_BASE; Document PERF-ULTRA-REBASE-4 findings Phase PERF-ULTRA-REBASE-4 confirmed: - dispatcher (25.48%) and alloc gate (21.13%) already heavily optimized via snapshot - New bottleneck: C7 ULTRA refill path (tiny_c7_ultra_page_of at 1.78%) - Recommendation: Next optimize C7 ULTRA refill for +1-2% overall gain	2025-12-11 21:36:58 +09:00
Moe Charm (CI)	0f15adae4e	Phase ALLOC-GATE-OPT-1: tiny_alloc_gate_fast 統計計測 - AllocGateStats 構造体追加（size2class/route/env/class分布） - malloc_tiny_fast にカウンタ埋め込み - ENV: HAKMEM_ALLOC_GATE_STATS (default 0) - 挙動変更なし（計測のみ）計測結果: - Mixed: total=542k, size2class=0, route_calls=0, env_checks=275k, C4-C7=95.2% - size_to_class/route_for_class は完全削減済み（LUT 効果） - C4-C7 が 95% → ULTRA fast path が有効 - env_checks ≈ c7_calls → C7 ULTRA の ENV gate が毎回呼ばれる - C6-heavy: total=11 → malloc_tiny_fast はほぼ通らない（mid/pool 主体）結論: - alloc gate は既に十分最適化済み（LUT + ULTRA で削減済み） - さらなる最適化余地は小さい（env_checks は軽量化済み、数%以下の効果） - 次フェーズでは free dispatcher (29%) や C7 ULTRA refill (7%) など、他のボトルネックを狙う詳細: docs/analysis/ALLOC_GATE_ANALYSIS.md 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-11 21:32:40 +09:00
Moe Charm (CI)	118c0e4857	Phase FREE-DISPATCHER-OPT-1: free dispatcher 統計計測目的: free dispatcher（29%）の内訳を細分化して計測。実装内容: - FreeDispatchStats 構造体追加（ENV: HAKMEM_FREE_DISPATCH_STATS, default 0） - カウンタ: total_calls / domain (tiny/mid/large) / route (ultra/legacy/pool/v6) / env_checks / route_for_class_calls - hak_free_at / tiny_route_for_class / tiny_route_snapshot_init にカウンタ埋め込み - 挙動変更なし（計測のみ、ENV OFF 時は overhead ゼロ）計測結果: Mixed 16-1024B (1M iter, ws=400): - total=8,081, route_calls=267,967, env_checks=9 - BENCH_FAST_FRONT により大半は早期リターン - route_for_class は主に alloc 側で呼ばれる（267k calls vs 8k frees） - ENV check は初期化時の 9回のみ（snapshot 効果） C6-heavy (257-768B, 1M iter, ws=400): - total=500,099, route_calls=1,034, env_checks=9 - fg_classify_domain に到達する free が多い - route_for_class 呼び出しは極小（snapshot 効果）結論: - ENV check は既に十分最適化されている（初期化時のみ） - route_for_class は alloc 側での呼び出しが主で、free 側は snapshot で O(1) - 次フェーズ（OPT-2）では別のアプローチを検討ドキュメント追加: - docs/analysis/FREE_DISPATCHER_ANALYSIS.md（新規） - CURRENT_TASK.md に Phase FREE-DISPATCHER-OPT-1 セクション追加 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-11 21:21:40 +09:00
Moe Charm (CI)	fb88725a43	Phase FREE-LEGACY-OPT-6: C4 ULTRA Implementation Implement C4 ULTRA free TLS cache with parasitic free+alloc pattern, achieving 99.7-99.9% elimination of C4 legacy fallback calls. Key Features: - TLS cache cap=64 (tuned for L1 cache fit, smaller than C5/C6's 128) - Segment learning via ss_fast_lookup() on first free - Free-side cache push + alloc-side TLS pop pattern - ENV gate: HAKMEM_TINY_C4_ULTRA_FREE_ENABLED (default OFF) - Full FREE_PATH_STATS instrumentation Benchmark Results: C4-heavy (65-128B range): - C4 legacy: 591,583 → 1,711 (-99.7%) - c4_ultra cache hits: ~599k (free) + ~599k (alloc) - Mixed load: 340,732 → 284 C4 legacy (-99.9%) Legacy fallback reduction: - C4-heavy: 589,872 fewer legacy calls (-10.9% total) - Mixed: 340,448 fewer C4 legacy calls (-12.8% in mixed) Performance note: ~2% throughput cost in isolated C4-heavy case, acceptable tradeoff for 99%+ legacy elimination per class. Files: NEW: core/box/tiny_c4_ultra_free_box.h/c NEW: core/box/tiny_c4_ultra_free_env_box.h MOD: core/box/tiny_ultra_classes_box.h (added C4 macros) MOD: core/box/free_path_stats_box.h/c (C4 ULTRA counters) MOD: core/front/malloc_tiny_fast.h (C4 alloc+free integration) MOD: Makefile (added C4 ULTRA object) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-11 19:38:27 +09:00
Moe Charm (CI)	ea6ed1a6e4	Phase FREE-LEGACY-OPT-5-1/5-2: C5 ULTRA free+alloc integration Summary: ======== Implemented C5 ULTRA TLS cache pattern following the successful C6 ULTRA design: - Phase 5-1: Free-side TLS cache + segment learning - Phase 5-2: Alloc-side TLS pop for complete free+alloc cycle integration Targets C5 class (129-256B) as next legacy reduction after C6 completion. Key Changes: ============ 1. NEW FILES: - core/box/tiny_c5_ultra_free_box.h: C5 ULTRA TLS cache structure - core/box/tiny_c5_ultra_free_box.c: C5 free path implementation (same pattern as C6) - core/box/tiny_c5_ultra_free_env_box.h: ENV gating (HAKMEM_TINY_C5_ULTRA_FREE_ENABLED) 2. MODIFIED FILES: - core/front/malloc_tiny_fast.h: * Added C5 ULTRA includes * Added C5 alloc-side TLS pop at lines 186-194 (integrated with C6) * Added C5 free path at lines 333-337 (integrated with C6) - core/box/tiny_ultra_classes_box.h: * Added TINY_CLASS_C5 constant * Added tiny_class_is_c5() macro * Extended tiny_class_is_ultra() to include C5 - core/box/free_path_stats_box.h: * Added c5_ultra_free_fast counter * Added c5_ultra_alloc_hit counter - core/box/free_path_stats_box.c: * Updated stats dump to output C5 counters - Makefile: * Added core/box/tiny_c5_ultra_free_box.o to all object lists 3. Design Rationale: - Exact copy of C6 ULTRA pattern (proven effective) - TLS cache capacity: 128 blocks (same as C6 for consistency) - Segment learning on first C5 free via ss_fast_lookup() - Alloc-side pop integrated directly in malloc_tiny_fast.h hotpath - Legacy fallback unification via tiny_legacy_fallback_free_base() 4. Expected Impact: - C5 legacy calls: 68,871 → 0 (100% elimination) - Total legacy reduction: ~53% of remaining 129,623 - Mixed workload: Minimal regression (C5 is smaller class, fewer allocations) 5. Stats Collection: Run with: HAKMEM_TINY_C5_ULTRA_FREE_ENABLED=1 HAKMEM_FREE_PATH_STATS=1 ./bench_allocators_hakmem Expected output: [FREE_PATH_STATS] ... c5_ultra_free=68871 c5_ultra_alloc=68871 ... legacy_fb=60752 ... [FREE_PATH_STATS_LEGACY_BY_CLASS] ... c5=0 ... Status: ======= - Code: ✅ COMPLETE (3 new files + 5 modified files) - Compilation: ✅ Verified (no errors, only unused variable warnings unrelated to C5) - Functionality: Ready to benchmark (ENV gating: default OFF, opt-in via ENV) Phase Progression: ================== ✅ Phase 4-4: C6 ULTRA free+alloc (legacy C6: 137,319 → 0) ✅ Phase 5-1/5-2: C5 ULTRA free+alloc (legacy C5: 68,871 → 0 expected) ⏳ Phase 4.5: C4 ULTRA (34,727 remaining) 📋 Future: C3/C2 ULTRA if beneficial 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-11 19:26:51 +09:00
Moe Charm (CI)	7b7de53167	Phase FREE-FRONT-V3-1: Free route snapshot infrastructure + build fix Summary: ======== Implemented Phase FREE-FRONT-V3 infrastructure to optimize free hotpath by: 1. Creating snapshot-based route decision table (consolidating route logic) 2. Removing redundant ENV checks from hot path 3. Preparing for future integration into hak_free_at() Key Changes: ============ 1. NEW FILES: - core/box/free_front_v3_env_box.h: Route snapshot definition & API - core/box/free_front_v3_env_box.c: Snapshot initialization & caching 2. Infrastructure Details: - FreeRouteSnapshotV3: Maps class_idx → free_route_kind for all 8 classes - Routes defined: LEGACY, TINY_V3, CORE_V6_C6, POOL_V1 - ENV-gated initialization (HAKMEM_TINY_FREE_FRONT_V3_ENABLED, default OFF) - Per-thread TLS caching to avoid repeated ENV reads 3. Design Goals: - Consolidate tiny_route_for_class() results into snapshot table - Remove C7 ULTRA / v4 / v5 / v6 ENV checks from hot path - Limit lookup (ss_fast_lookup/slab_index_for) to paths that truly need it - Clear ownership boundary: front v3 handles routing, downstream handles free 4. Phase Plan: - v3-1 ✅ COMPLETE: Infrastructure (snapshot table, ENV initialization, TLS cache) - v3-2 (INFRASTRUCTURE ONLY): Placeholder integration in hak_free_api.inc.h - v3-3 (FUTURE): Full integration + benchmark A/B to measure hotpath improvement 5. BUILD FIX: - Added missing core/box/c7_meta_used_counter_box.o to OBJS_BASE in Makefile - This symbol was referenced but not linked, causing undefined reference errors - Benchmark targets now build cleanly without LTO Status: ======= - Build: ✅ PASS (bench_allocators_hakmem builds without errors) - Integration: Currently DISABLED (default OFF, ready for v3-2 phase) - No performance impact: Infrastructure-only, hotpath unchanged Future Work: ============ - Phase v3-2: Integrate snapshot routing into hak_free_at() main path - Phase v3-3: Measure free hotpath performance improvement (target: 1-2% less branch mispredict) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-11 19:17:30 +09:00
Moe Charm (CI)	1b196b3ac0	Phase FREE-LEGACY-OPT-4-2/4-3: C6 ULTRA-free TLS cache + segment learning Phase 4-2: - Add TinyC6UltraFreeTLS structure with 128-slot TLS freelist - Implement tiny_c6_ultra_free_fast/slow for C6 free hot path - Add c6_ultra_free_fast counter to FreePathStats - ENV gate: HAKMEM_TINY_C6_ULTRA_FREE_ENABLED (default: OFF) Phase 4-3: - Add segment learning on first C6 free via ss_fast_lookup() - Learn seg_base/seg_end from SuperSlab for range check - Increase cache capacity from 32 to 128 blocks Results: - Segment learning works: fast path captures blocks in segment - However, without alloc integration, cache fills up and overflows to legacy - Net effect: +1-3% (within noise range) - Drain strategy also tested: no benefit (equal overhead) Conclusion: - Free-only TLS cache is limited without alloc-side integration - Core v6 already has alloc/free integrated TLS (but -12% slower) - Keep as research box (ENV default OFF) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-11 18:34:27 +09:00
Moe Charm (CI)	c60199182e	Phase v6-1/2/3/4: SmallObject Core v6 - C6-only implementation + refactor Phase v6-1: C6-only route stub (v1/pool fallback) Phase v6-2: Segment v6 + ColdIface v6 + Core v6 HotPath implementation - 2MiB segment / 64KiB page allocation - O(1) ptr→page_meta lookup with segment masking - C6-heavy A/B: SEGV-free but -44% performance (15.3M ops/s) Phase v6-3: Thin-layer optimization (TLS ownership check + batch header + refill batching) - TLS ownership fast-path skip page_meta for 90%+ of frees - Batch header writes during refill (32 allocs = 1 header write) - TLS batch refill (1/32 refill frequency) - C6-heavy A/B: v6-2 15.3M → v6-3 27.1M ops/s (±0% vs baseline) ✅ Phase v6-4: Mixed hang fix (segment metadata lookup correction) - Root cause: metadata lookup was reading mmap region instead of TLS slot - Fix: use TLS slot descriptor with in_use validation - Mixed health: 5M iterations SEGV-free, 35.8M ops/s ✅ Phase v6-refactor: Code quality improvements (macro unification + inline + docs) - Add SMALL_V6_* prefix macros (header, pointer conversion, page index) - Extract inline validation functions (small_page_v6_valid, small_ptr_in_segment_v6) - Doxygen-style comments for all public functions - Result: 0 compiler warnings, maintained +1.2% performance Files: - core/box/smallobject_core_v6_box.h (new, type & API definitions) - core/box/smallobject_cold_iface_v6.h (new, cold iface API) - core/box/smallsegment_v6_box.h (new, segment type definitions) - core/smallobject_core_v6.c (new, C6 alloc/free implementation) - core/smallobject_cold_iface_v6.c (new, refill/retire logic) - core/smallsegment_v6.c (new, segment allocator) - docs/analysis/SMALLOBJECT_CORE_V6_DESIGN.md (new, design document) - core/box/tiny_route_env_box.h (modified, v6 route added) - core/front/malloc_tiny_fast.h (modified, v6 case in route switch) - Makefile (modified, v6 objects added) - CURRENT_TASK.md (modified, v6 status added) Status: - C6-heavy: v6 OFF 27.1M → v6-3 ON 27.1M ops/s (±0%) ✅ - Mixed: v6 ON 35.8M ops/s (C6-only, other classes via v1) ✅ - Build: 0 warnings, fully documented ✅ 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-11 15:29:59 +09:00
Moe Charm (CI)	e0fb7d550a	Phase v5-2: SmallObject v5 C6-only 本実装 (WIP - header fix) 本実装修正: - tiny_region_id_write_header() を追加: USER pointer を正しく返す - TLS slot からの segment 探索 (page_meta_of) - Page-level allocation で segment 再利用 - 2MiB alignment 保証 (4MiB 確保 + alignment) - free パスの route 修正 (v4 から v5 への fallthrough 削除) 動作確認: - SEGV 消失: alloc/free 基本動作 OK - 性能: ~18-20M ops/s (baseline 43-47M の約 40-45%) - 回帰原因: TLS slot 線形探索 O(n)、find_page O(n) 残タスク: - O(1) segment lookup 最適化 (hash または array 直接参照) - find_page 除去 (segment lookup 成功時) - partial_count/list 管理の最適化 ENV デフォルト OFF なので本線影響なし。 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-11 04:14:51 +09:00
Moe Charm (CI)	9c24bebf08	Phase v5-1: SmallObject v5 C6-only route stub 接続 - tiny_route_env_box.h: TINY_ROUTE_SMALL_HEAP_V5 enum 追加、route snapshot で C6→v5 分岐 - malloc_tiny_fast.h: alloc/free switch に v5 case 追加（v1/pool fallback） - smallobject_hotbox_v5.c: stub 実装（alloc は NULL 返却、free は no-op） - smallobject_hotbox_v5_box.h: 関数 signature に ctx パラメータ追加 - Makefile: core/smallobject_hotbox_v5.o をリンクリストに追加 - ENV_PROFILE_PRESETS.md: v5-1 プリセット追記 - CURRENT_TASK.md: Phase v5-1 完了記録特性: - ENV: HAKMEM_SMALL_HEAP_V5_ENABLED=1 / HAKMEM_SMALL_HEAP_V5_CLASSES=0x40 で opt-in - テスト結果: C6-heavy (v5 OFF 15.5M → v5 ON 16.4M ops/s, 正常), Mixed 47.2M ops/s, SEGV/assert なし - 挙動は v1/pool fallback と同じ（実装は v5-2） 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-11 03:25:37 +09:00
Moe Charm (CI)	bbb55b018a	Add C7 ULTRA segment skeleton and TLS freelist	2025-12-10 22:19:32 +09:00
Moe Charm (CI)	cbd33511eb	Phase v4-3.1: reuse C7 v4 pages and record prep calls	2025-12-10 17:58:42 +09:00
Moe Charm (CI)	acc64f2438	Phase ML1: Pool v1 memset 89.73% overhead 軽量化 (+15.34% improvement) ## Summary - ChatGPT により bench_profile.h の setenv segfault を修正（RTLD_NEXT 経由に切り替え） - core/box/pool_zero_mode_box.h 新設：ENV キャッシュ経由で ZERO_MODE を統一管理 - core/hakmem_pool.c で zero mode に応じた memset 制御（FULL/header/off） - A/B テスト結果：ZERO_MODE=header で +15.34% improvement（1M iterations, C6-heavy） ## Files Modified - core/box/pool_api.inc.h: pool_zero_mode_box.h include - core/bench_profile.h: glibc setenv → malloc+putenv（segfault 回避） - core/hakmem_pool.c: zero mode 参照・制御ロジック - core/box/pool_zero_mode_box.h (新設): enum/getter - CURRENT_TASK.md: Phase ML1 結果記載 ## Test Results \| Iterations \| ZERO_MODE=full \| ZERO_MODE=header \| Improvement \| \|-----------\|----------------\|-----------------\|------------\| \| 10K \| 3.06 M ops/s \| 3.17 M ops/s \| +3.65% \| \| 1M \| 23.71 M ops/s \| 27.34 M ops/s \| +15.34% \| 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-10 09:08:18 +09:00
Moe Charm (CI)	a905e0ffdd	Guard madvise ENOMEM and stabilize pool/tiny front v3	2025-12-09 21:50:15 +09:00
Moe Charm (CI)	fda6cd2e67	Boxify superslab registry, add bench profile, and document C7 hotpath experiments	2025-12-07 03:12:27 +09:00
Moe Charm (CI)	03538055ae	Restore C7 Warm/TLS carve for release and add policy scaffolding	2025-12-06 01:34:04 +09:00
Moe Charm (CI)	3e1d7c3798	Fix debug build after clean reset	2025-12-05 20:43:14 +09:00
Moe Charm (CI)	093f362231	Add Page Box layer for C7 class optimization - Implement tiny_page_box.c/h: per-thread page cache between UC and Shared Pool - Integrate Page Box into Unified Cache refill path - Remove legacy SuperSlab implementation (merged into smallmid) - Add HAKMEM_TINY_PAGE_BOX_CLASSES env var for selective class enabling - Update bench_random_mixed.c with Page Box statistics Current status: Implementation safe, no regressions. Page Box ON/OFF shows minimal difference - pool strategy needs tuning. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-05 15:31:44 +09:00
Moe Charm (CI)	860991ee50	Performance Measurement Framework: Unified Cache, TLS SLL, Shared Pool Analysis ## Summary Implemented production-grade measurement infrastructure to quantify top 3 bottlenecks: - Unified cache hit/miss rates + refill cost - TLS SLL usage patterns - Shared pool lock contention distribution ## Changes ### 1. Unified Cache Metrics (tiny_unified_cache.h/c) - Added atomic counters: - g_unified_cache_hits_global: successful cache pops - g_unified_cache_misses_global: refill triggers - g_unified_cache_refill_cycles_global: refill cost in CPU cycles (rdtsc) - Instrumented `unified_cache_pop_or_refill()` to count hits - Instrumented `unified_cache_refill()` with cycle measurement - ENV-gated: HAKMEM_MEASURE_UNIFIED_CACHE=1 (default: off) - Added unified_cache_print_measurements() output function ### 2. TLS SLL Metrics (tls_sll_box.h) - Added atomic counters: - g_tls_sll_push_count_global: total pushes - g_tls_sll_pop_count_global: successful pops - g_tls_sll_pop_empty_count_global: empty list conditions - Instrumented push/pop paths - Added tls_sll_print_measurements() output function ### 3. Shared Pool Contention (hakmem_shared_pool_acquire.c) - Added atomic counters: - g_sp_stage2_lock_acquired_global: Stage 2 locks - g_sp_stage3_lock_acquired_global: Stage 3 allocations - g_sp_alloc_lock_contention_global: total lock acquisitions - Instrumented all pthread_mutex_lock calls in hot paths - Added shared_pool_print_measurements() output function ### 4. Benchmark Integration (bench_random_mixed.c) - Called all 3 print functions after benchmark loop - Functions active only when HAKMEM_MEASURE_UNIFIED_CACHE=1 set ## Design Principles - Zero overhead when disabled: Inline checks with __builtin_expect hints - Atomic relaxed memory order: Minimal synchronization overhead - ENV-gated: Single flag controls all measurements - Production-safe: Compiles in release builds, no functional changes ## Usage ```bash HAKMEM_MEASURE_UNIFIED_CACHE=1 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42 ``` Output (when enabled): ``` ======================================== Unified Cache Statistics ======================================== Hits: 1234567 Misses: 56789 Hit Rate: 95.6% Avg Refill Cycles: 1234 ======================================== TLS SLL Statistics ======================================== Total Pushes: 1234567 Total Pops: 345678 Pop Empty Count: 12345 Hit Rate: 98.8% ======================================== Shared Pool Contention Statistics ======================================== Stage 2 Locks: 123456 (33%) Stage 3 Locks: 234567 (67%) Total Contention: 357 locks per 1M ops ``` ## Next Steps 1. Enable measurements and run benchmarks to gather data 2. Analyze miss rates: Which bottleneck dominates? 3. Profile hottest stage: Focus optimization on top contributor 4. Possible targets: - Increase unified cache capacity if miss rate >5% - Profile if TLS SLL is unused (potential legacy code removal) - Analyze if Stage 2 lock can be replaced with CAS ## Makefile Updates Added core/box/tiny_route_box.o to: - OBJS_BASE (test build) - SHARED_OBJS (shared library) - BENCH_HAKMEM_OBJS_BASE (benchmark) - TINY_BENCH_OBJS_BASE (tiny benchmark) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-04 18:26:39 +09:00
Moe Charm (CI)	d5e6ed535c	P-Tier + Tiny Route Policy: Aggressive Superslab Management + Safe Routing ## Phase 1: Utilization-Aware Superslab Tiering (案B実装済) - Add ss_tier_box.h: Classify SuperSlabs into HOT/DRAINING/FREE based on utilization - HOT (>25%): Accept new allocations - DRAINING (≤25%): Drain only, no new allocs - FREE (0%): Ready for eager munmap - Enhanced shared_pool_release_slab(): - Check tier transition after each slab release - If tier→FREE: Force remaining slots to EMPTY and call superslab_free() immediately - Bypasses LRU cache to prevent registry bloat from accumulating DRAINING SuperSlabs - Test results (bench_random_mixed_hakmem): - 1M iterations: ✅ ~1.03M ops/s (previously passed) - 10M iterations: ✅ ~1.15M ops/s (previously: registry full error) - 50M iterations: ✅ ~1.08M ops/s (stress test) ## Phase 2: Tiny Front Routing Policy (新規Box) - Add tiny_route_box.h/c: Single 8-byte table for class→routing decisions - ROUTE_TINY_ONLY: Tiny front exclusive (no fallback) - ROUTE_TINY_FIRST: Try Tiny, fallback to Pool if fails - ROUTE_POOL_ONLY: Skip Tiny entirely - Profiles via HAKMEM_TINY_PROFILE ENV: - "hot": C0-C3=TINY_ONLY, C4-C6=TINY_FIRST, C7=POOL_ONLY - "conservative" (default): All TINY_FIRST - "off": All POOL_ONLY (disable Tiny) - "full": All TINY_ONLY (microbench mode) - A/B test results (ws=256, 100k ops random_mixed): - Default (conservative): ~2.90M ops/s - hot: ~2.65M ops/s (more conservative) - off: ~2.86M ops/s - full: ~2.98M ops/s (slightly best) ## Design Rationale ### Registry Pressure Fix (案B) - Problem: DRAINING tier SS occupied registry indefinitely - Solution: When total_active_blocks→0, immediately free to clear registry slot - Result: No more "registry full" errors under stress ### Routing Policy Box (新) - Problem: Tiny front optimization scattered across ENV/branches - Solution: Centralize routing in single table, select profiles via ENV - Benefit: Safe A/B testing without touching hot path code - Future: Integrate with RSS budget/learning layers for dynamic profile switching ## Next Steps (性能最適化) - Profile Tiny front internals (TLS SLL, FastCache, Superslab backend latency) - Identify bottleneck between current ~2.9M ops/s and mimalloc ~100M ops/s - Consider: - Reduce shared pool lock contention - Optimize unified cache hit rate - Streamline Superslab carving logic 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-04 18:01:25 +09:00
Moe Charm (CI)	25cb7164c7	Comprehensive legacy cleanup and architecture consolidation Summary of Changes: MOVED TO ARCHIVE: - core/hakmem_tiny_legacy_slow_box.inc → archive/ * Slow path legacy code preserved for reference * Superseded by Gatekeeper Box architecture - core/superslab_allocate.c → archive/superslab_allocate_legacy.c * Legacy SuperSlab allocation implementation * Functionality integrated into new Box system - core/superslab_head.c → archive/superslab_head_legacy.c * Legacy slab head management * Refactored through Box architecture REMOVED DEAD CODE: - Eliminated unused allocation policy variants from ss_allocation_box.c * Reduced from 127+ lines of conditional logic to focused implementation * Removed: old policy branches, unused allocation strategies * Kept: current Box-based allocation path ADDED NEW INFRASTRUCTURE: - core/superslab_head_stub.c (41 lines) * Minimal stub for backward compatibility * Delegates to new architecture - Enhanced core/superslab_cache.c (75 lines added) * Added missing API functions for cache management * Proper interface for SuperSlab cache integration REFACTORED CORE SYSTEMS: - core/hakmem_super_registry.c * Moved registration logic from scattered locations * Centralized SuperSlab registry management - core/hakmem_tiny.c * Removed 27 lines of redundant initialization * Simplified through Box architecture - core/hakmem_tiny_alloc.inc * Streamlined allocation path to use Gatekeeper * Removed legacy decision logic - core/box/ss_allocation_box.c/h * Dramatically simplified allocation policy * Removed conditional branches for unused strategies * Focused on current Box-based approach BUILD SYSTEM: - Updated Makefile for archive structure - Removed obsolete object file references - Maintained build compatibility SAFETY & TESTING: - All deletions verified: no broken references - Build verification: RELEASE=0 and RELEASE=1 pass - Smoke tests: 100% pass rate - Functional verification: allocation/free intact Architecture Consolidation: Before: Multiple overlapping allocation paths with legacy code branches After: Single unified path through Gatekeeper Boxes with clear architecture Benefits: - Reduced code size and complexity - Improved maintainability - Single source of truth for allocation logic - Better diagnostic/observability hooks - Foundation for future optimizations 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-04 14:22:48 +09:00
Moe Charm (CI)	a0a80f5403	Remove legacy redundant code after Gatekeeper Box consolidation Summary of Deletions: - Remove core/box/unified_batch_box.c (26 lines) * Legacy batch allocation logic superseded by Alloc Gatekeeper Box * unified_cache now handles allocation aggregation - Remove core/box/unified_batch_box.h (29 lines) * Header declarations for deprecated unified_batch_box module - Remove core/tiny_free_fast.inc.h (329 lines) * Legacy fast-path free implementation * Functionality consolidated into: - tiny_free_gate_box.h (Fail-Fast layer + diagnostics) - malloc_tiny_fast.h (Free path integration) - unified_cache (return to freelist) * Code path now routes through Gatekeeper Box for consistency Build System Updates: - Update Makefile * Remove unified_batch_box.o from OBJS_BASE * Remove unified_batch_box_shared.o from SHARED_OBJS * Remove unified_batch_box.o from BENCH_HAKMEM_OBJS_BASE - Update core/hakmem_tiny_phase6_wrappers_box.inc * Remove unified_batch_box references * Simplify allocation wrapper to use new Gatekeeper architecture Impact: - Removes ~385 lines of redundant/superseded code - Consolidates allocation logic through unified Gatekeeper entry points - All functionality preserved via new Box-based architecture - Simplifies codebase and reduces maintenance burden Testing: - Build verification: make clean && make RELEASE=0/1 - Smoke tests: All pass (simple_alloc, loop 10M, pool_tls) - No functional regressions Rationale: After implementing Alloc/Free Gatekeeper Boxes with Fail-Fast layers and Unified Cache type safety, the legacy separate implementations became redundant. This commit completes the architectural consolidation and simplifies the allocator codebase. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-04 12:55:53 +09:00
Moe Charm (CI)	94f9ea5104	Implement Phase 1: TLS SuperSlab Hint Box for Headerless performance Design: Cache recently-used SuperSlab references in TLS to accelerate ptr→SuperSlab resolution in Headerless mode free() path. ## Implementation ### New Box: core/box/tls_ss_hint_box.h - Header-only Box (4-slot FIFO cache per thread) - Functions: tls_ss_hint_init(), tls_ss_hint_update(), tls_ss_hint_lookup(), tls_ss_hint_clear() - Memory overhead: 112 bytes per thread (negligible) - Statistics API for debug builds (hit/miss counters) ### Integration Points 1. Free path (core/hakmem_tiny_free.inc): - Lines 477-481: Fast path hint lookup before hak_super_lookup() - Lines 550-555: Second lookup location (fallback path) - Expected savings: 10-50 cycles → 2-5 cycles on cache hit 2. Allocation path (core/tiny_superslab_alloc.inc.h): - Lines 115-122: Linear allocation return path - Lines 179-186: Freelist allocation return path - Cache update on successful allocation 3. TLS variable (core/hakmem_tiny_tls_state_box.inc): - `__thread TlsSsHintCache g_tls_ss_hint = {0};` ### Build System - Build flag (core/hakmem_build_flags.h): - HAKMEM_TINY_SS_TLS_HINT (default: 0, disabled) - Validation: requires HAKMEM_TINY_HEADERLESS=1 - Makefile: - Removed old ss_tls_hint_box.o (conflicting implementation) - Header-only design eliminates compiled object files ### Testing - Unit tests (tests/test_tls_ss_hint.c): - 6 test functions covering init, lookup, FIFO rotation, duplicates, clear, stats - All tests PASSING - Build validation: - ✅ Compiles with hint disabled (default) - ✅ Compiles with hint enabled (HAKMEM_TINY_SS_TLS_HINT=1) ### Documentation - Benchmark report (docs/PHASE1_TLS_HINT_BENCHMARK.md): - Implementation summary - Build validation results - Benchmark methodology (to be executed) - Performance analysis framework ## Expected Performance - Hit rate: 85-95% (single-threaded), 70-85% (multi-threaded) - Cycle savings: 80-95% on cache hit (10-50 cycles → 2-5 cycles) - Target improvement: 15-20% throughput increase vs Headerless baseline - Memory overhead: 112 bytes per thread ## Box Theory Mission: Cache hot SuperSlabs to avoid global registry lookup Boundary: ptr → SuperSlab* or NULL (miss) Invariant: hint.base ≤ ptr < hint.end → hit is valid Fallback: Always safe to miss (triggers hak_super_lookup) Thread Safety: TLS storage, no synchronization required Risk: Low (read-only cache, fail-safe fallback, magic validation) ## Next Steps 1. Run full benchmark suite (sh8bench, cfrac, larson) 2. Measure actual hit rate with stats enabled 3. If performance target met (15-20% improvement), enable by default 4. Consider increasing cache slots if hit rate < 80% 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-03 18:06:24 +09:00
Moe Charm (CI)	c91602f181	Fix ptr_user_to_base_blind regression: use class-aware base calculation and correct slab index lookup	2025-12-03 12:29:31 +09:00
Moe Charm (CI)	c2716f5c01	Implement Phase 2: Headerless Allocator Support (Partial) - Feature: Added HAKMEM_TINY_HEADERLESS toggle (A/B testing) - Feature: Implemented Headerless layout logic (Offset=0) - Refactor: Centralized layout definitions in tiny_layout_box.h - Refactor: Abstracted pointer arithmetic in free path via ptr_conversion_box.h - Verification: sh8bench passes in Headerless mode (No TLS_SLL_HDR_RESET) - Known Issue: Regression in Phase 1 mode due to blind pointer conversion logic	2025-12-03 12:11:27 +09:00
Moe Charm (CI)	49969d2e0f	feat(Phase 1-1): Complete getenv elimination from malloc/free hot paths (+39-42% perf) ## Summary Eliminated all getenv() calls from malloc/free wrappers and allocator hot paths by implementing constructor-based environment variable caching. This achieves 39-42% performance improvement (36s → 22s on sh8bench single-thread). ## Performance Impact - sh8bench 1 thread: 35-36s → 21-22s (+39-42% improvement) 🚀 - sh8bench 8 threads: ~15s (maintained) - getenv overhead: 36.32% → 0% (completely eliminated) ## Changes ### New Files - core/box/tiny_env_box.{c,h}: Centralized environment variable cache for Tiny allocator - Caches 43 environment variables (HAKMEM_TINY_, HAKMEM_SLL_, HAKMEM_SS_, etc.) - Constructor-based initialization with atomic CAS for thread safety - Inline accessor tiny_env_cfg() for hot path access - core/box/wrapper_env_box.{c,h}: Environment cache for malloc/free wrappers - Caches 3 wrapper variables (HAKMEM_STEP_TRACE, HAKMEM_LD_SAFE, HAKMEM_FREE_WRAP_TRACE) - Constructor priority 101 ensures early initialization - Replaces all lazy-init patterns in wrapper code ### Modified Files - Makefile: Added tiny_env_box.o and wrapper_env_box.o to OBJS_BASE and SHARED_OBJS - core/box/hak_wrappers.inc.h: - Removed static lazy-init variables (g_step_trace, ld_safe_mode cache) - Replaced with wrapper_env_cfg() lookups (wcfg->step_trace, wcfg->ld_safe_mode) - All getenv() calls eliminated from malloc/free hot paths - core/hakmem.c: - Added hak_ld_env_init() with constructor for LD_PRELOAD caching - Added hak_force_libc_ctor() for HAKMEM_FORCE_LIBC_ALLOC caching - Simplified hak_ld_env_mode() to return cached value only - Simplified hak_force_libc_alloc() to use cached values - Eliminated all getenv/atoi calls from hot paths ## Technical Details ### Constructor Initialization Pattern All environment variables are now read once at library load time using __attribute__((constructor)): ```c __attribute__((constructor(101))) static void wrapper_env_ctor(void) { wrapper_env_init_once(); // Atomic CAS ensures exactly-once init } ``` ### Thread Safety - Atomic compare-and-swap (CAS) ensures single initialization - Spin-wait for initialization completion in multi-threaded scenarios - Memory barriers (memory_order_acq_rel) ensure visibility ### Hot Path Impact Before: Every malloc/free → getenv("LD_PRELOAD") + getenv("HAKMEM_STEP_TRACE") + ... After: Every malloc/free → Single pointer dereference (wcfg->field) ## Next Optimization Target (Phase 1-2) Perf analysis reveals libc fallback accounts for ~51% of cycles: - _int_malloc: 15.04% - malloc: 9.81% - _int_free: 10.07% - malloc_consolidate: 9.27% - unlink_chunk: 6.82% Reducing libc fallback from 51% → 10% could yield additional +25-30% improvement. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: ChatGPT <chatgpt@openai.com>	2025-12-02 16:16:51 +09:00
Moe Charm (CI)	f1b7964ef9	Remove unused Mid MT layer	2025-12-01 23:43:44 +09:00
Moe Charm (CI)	4ef0171bc0	feat: Add ACE allocation failure tracing and debug hooks This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations. Key changes include: - ACE Tracing Implementation: - Added environment variable to enable/disable detailed logging of allocation failures. - Instrumented , , and to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure). - Build System Fixes: - Corrected to ensure is properly linked into , resolving an error. - LD_PRELOAD Wrapper Adjustments: - Investigated and understood the wrapper's behavior under , particularly its interaction with and checks. - Enabled debugging flags for environment to prevent unintended fallbacks to 's for non-tiny allocations, allowing comprehensive testing of the allocator. - Debugging & Verification: - Introduced temporary verbose logging to pinpoint execution flow issues within interception and routing. These temporary logs have been removed. - Created to facilitate testing of the tracing features. This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in by providing clear insights into the failure pathways.	2025-12-01 16:37:59 +09:00
Moe Charm (CI)	3a040a545a	Refactor: Split monolithic hakmem_shared_pool.c into acquire/release modules - Split core/hakmem_shared_pool.c into acquire/release modules for maintainability. - Introduced core/hakmem_shared_pool_internal.h for shared internal API. - Fixed incorrect function name usage (superslab_alloc -> superslab_allocate). - Increased SUPER_REG_SIZE to 1M to support large working sets (Phase 9-2 fix). - Updated Makefile. - Verified with benchmarks.	2025-11-30 18:11:08 +09:00
Moe Charm (CI)	87b7d30998	Phase 9: SuperSlab optimization & EMPTY slab recycling (WIP) Phase 9-1: O(1) SuperSlab lookup optimization - Created ss_addr_map_box: Hash table (8192 buckets) for O(1) SuperSlab lookup - Created ss_tls_hint_box: TLS caching layer for SuperSlab hints - Integrated hash table into registry (init, insert, remove, lookup) - Modified hak_super_lookup() to use new hash table - Expected: 50-80 cycles → 10-20 cycles (not verified - SuperSlab disabled by default) Phase 9-2: EMPTY slab recycling implementation - Created slab_recycling_box: SLAB_TRY_RECYCLE() macro following Box pattern - Integrated into remote drain (superslab_slab.c) - Integrated into TLS SLL drain (tls_sll_drain_box.h) with touched slab tracking - Observable: Debug tracing via HAKMEM_SLAB_RECYCLE_TRACE - Updated Makefile: Added new box objects to 3 build targets Known Issues: - SuperSlab registry exhaustion still occurs (unregistration not working) - shared_pool_release_slab() may not be removing from g_super_reg[] - Needs investigation before Phase 9-2 can be completed Expected Impact (when fixed): - Stage 1 hit rate: 0% → 80% - shared_fail events: 4 → 0 - Kernel overhead: 55% → 15% - Throughput: 16.5M → 25-30M ops/s (+50-80%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-30 07:16:50 +09:00
Moe Charm (CI)	181e448b76	Phase 7-Step2: Enable PGO mode for bench builds (compile-time unified gate) Performance Results (bench_random_mixed, ws=256): - Step 1 baseline: 80.6 M ops/s (branch hint reversal) - Step 2 result: 80.3 M ops/s (-0.37%, within noise margin) Implementation: - Added -DHAKMEM_TINY_FRONT_PGO=1 to bench_random_mixed_hakmem.o build - Triggers compile-time mode in tiny_front_config_box.h: - TINY_FRONT_UNIFIED_GATE_ENABLED = 1 (constant, not function call) - Enables dead code elimination: if (1) { ... } → always taken Why No Performance Change: - Step 1 branch hint already optimized the path - CPU branch predictor learns runtime behavior quickly - Compile-time constant mainly helps code size, not hot path speed - Legacy paths already cold after Step 1 Benefits (Non-Performance): ✅ Cleaner code (compile-time constants vs runtime checks) ✅ Binary size reduction (dead code elimination possible) ✅ Foundation for future optimizations (Step 3+) Code Changes: - Makefile:606 - Added -DHAKMEM_TINY_FRONT_PGO=1 flag Expected Impact: - Current: Neutral performance (within noise) - Future: Enables legacy path removal (Step 3-7 from Task plan) Next Steps: - Step 3+: Remove legacy layers (FastCache/SFC/HeapV2/TLS SLL) - Expected: Additional 5-10% from dead code elimination 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-29 16:19:53 +09:00
Moe Charm (CI)	3daf75e57f	Phase 5-Step2: Mid Free Route Box (+28.9x free perf, 1.53x faster than system) Fix critical 19x free() slowdown in Mid MT allocator (1KB-8KB range). Root Cause: - Mid MT registers chunks in MidGlobalRegistry - Free path searches Pool's mid_desc registry (different registry!) - Result: 100% lookup failure → 4x cascading lookups → libc fallback Solution (Box Pattern): - Created core/box/mid_free_route_box.h - Try Mid MT registry BEFORE classify_ptr() in free() - Direct route to mid_mt_free() if found - Fall through to existing path if not found Performance Results (bench_mid_mt_gap, 1KB-8KB allocs): - Before: 1.49 M ops/s (19x slower than system malloc) - After: 41.0 M ops/s (+28.9x improvement) - vs System malloc: 1.53x faster (41.0 vs 26.8 M ops/s) Files: - core/box/mid_free_route_box.h (NEW) - Mid Free Route Box - core/box/hak_wrappers.inc.h - Add mid_free_route_try() call - core/hakmem_mid_mt.h - Fix mid_get_min_size() (1024 not 2048) - bench_mid_mt_gap.c (NEW) - Targeted 1KB-8KB benchmark - Makefile - Add bench_mid_mt_gap targets Box Pattern: ✅ Single responsibility, clear contract, testable, minimal change 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-29 14:18:20 +09:00
Moe Charm (CI)	04186341c1	Phase 4-Step2: Add Hot/Cold Path Box (+7.3% performance) Implemented Hot/Cold Path separation using Box pattern for Tiny allocations: Performance Improvement (without PGO): - Baseline (Phase 26-A): 53.3 M ops/s - Hot/Cold Box (Phase 4-Step2): 57.2 M ops/s - Gain: +7.3% (+3.9 M ops/s) Implementation: 1. core/box/tiny_front_hot_box.h - Ultra-fast hot path (1 branch) - Removed range check (caller guarantees valid class_idx) - Inline cache hit path with branch prediction hints - Debug metrics with zero overhead in Release builds 2. core/box/tiny_front_cold_box.h - Slow cold path (noinline, cold) - Refill logic (batch allocation from SuperSlab) - Drain logic (batch free to SuperSlab) - Error reporting and diagnostics 3. core/front/malloc_tiny_fast.h - Updated to use Hot/Cold Boxes - Hot path: tiny_hot_alloc_fast() (1 branch: cache empty check) - Cold path: tiny_cold_refill_and_alloc() (noinline, cold attribute) - Clear separation improves i-cache locality Branch Analysis: - Baseline: 4-5 branches in hot path (range check + cache check + refill logic mixed) - Hot/Cold Box: 1 branch in hot path (cache empty check only) - Reduction: 3-4 branches eliminated from hot path Design Principles (Box Pattern): ✅ Single Responsibility: Hot path = cache hit only, Cold path = refill/errors ✅ Clear Contract: Hot returns NULL on miss, Cold handles miss ✅ Observable: Debug metrics (TINY_HOT_METRICS_) gated by NDEBUG ✅ Safe: Branch prediction hints (TINY_HOT_LIKELY/UNLIKELY) ✅ Testable: Isolated hot/cold paths, easy A/B testing PGO Status: - Temporarily disabled (build issues with __gcov_merge_time_profile) - Will re-enable PGO in future commit after resolving gcc/lto issues - Current benchmarks are without PGO (fair A/B comparison) Other Changes: - .gitignore: Added .d files (dependency files, auto-generated) - Makefile: PGO targets temporarily disabled (show informational message) - build_pgo.sh: Temporarily disabled (show "PGO paused" message) Next: Phase 4-Step3 (Front Config Box, target +5-8%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-29 11:58:37 +09:00
Moe Charm (CI)	b51b600e8d	Phase 4-Step1: Add PGO workflow automation (+6.25% performance) Implemented automated Profile-Guided Optimization workflow using Box pattern: Performance Improvement: - Baseline: 57.0 M ops/s - PGO-optimized: 60.6 M ops/s - Gain: +6.25% (within expected +5-10% range) Implementation: 1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads 2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection 3. Makefile PGO targets: - pgo-tiny-profile: Build instrumented binaries - pgo-tiny-collect: Collect .gcda profile data - pgo-tiny-build: Build optimized binaries - pgo-tiny-full: Complete workflow (profile → collect → build → test) 4. Makefile help target: Added PGO instructions for discoverability Design: - Box化: Single responsibility, clear contracts - Deterministic: Fixed seeds (42) for reproducibility - Safe: Validation, error detection, timeout protection (30s/workload) - Observable: Progress reporting, .gcda verification (33 files generated) Workload Coverage: - Random mixed: 3 working set sizes (128/256/512 slots) - Tiny hot: 2 size classes (16B/64B) - Total: 5 workloads covering hot/cold paths Documentation: - PHASE4_STEP1_COMPLETE.md - Completion report - CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓) - docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-29 11:28:38 +09:00
Moe Charm (CI)	d78baf41ce	Phase 3: Remove mincore() syscall completely Problem: - mincore() was already disabled by default (DISABLE_MINCORE=1) - Phase 1b/2 registry-based validation made mincore obsolete - Dead code (~60 lines) remained with complex #ifdef guards Solution: Complete removal of mincore() syscall and related infrastructure: 1. Makefile: - Removed DISABLE_MINCORE configuration (lines 167-177) - Added Phase 3 comment documenting removal rationale 2. core/box/hak_free_api.inc.h: - Removed ~60 lines of mincore logic with TLS page cache - Simplified to: int is_mapped = 1; - Added comprehensive history comment 3. core/box/external_guard_box.h: - Simplified external_guard_is_mapped() from 20 lines to 4 lines - Always returns 1 (assume mapped) - Added Phase 3 comment Safety: Trust internal metadata for all validation: - SuperSlab registry: validates Tiny allocations (Phase 1b/2) - AllocHeader: validates Mid/Large allocations - FrontGate classifier: routes external allocations Testing: ✓ Build: Clean compilation (no warnings) ✓ Stability: 100/100 test iterations passed (0% crash rate) ✓ Performance: No regression (mincore already disabled) History: - Phase 9: Used mincore() for safety - 2025-11-14: Added DISABLE_MINCORE flag (+10.3% perf improvement) - Phase 1b/2: Registry-based validation (0% crash rate) - Phase 3: Dead code cleanup (this commit) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-29 09:04:32 +09:00
Moe Charm (CI)	6ac6f5ae1b	Refactor: Split hakmem_tiny_superslab.c + unified backend exit point Major refactoring to improve maintainability and debugging: 1. Split hakmem_tiny_superslab.c (1521 lines) into 7 focused files: - superslab_allocate.c: SuperSlab allocation/deallocation - superslab_backend.c: Backend allocation paths (legacy, shared) - superslab_ace.c: ACE (Adaptive Cache Engine) logic - superslab_slab.c: Slab initialization and bitmap management - superslab_cache.c: LRU cache and prewarm cache management - superslab_head.c: SuperSlabHead management and expansion - superslab_stats.c: Statistics tracking and debugging 2. Created hakmem_tiny_superslab_internal.h for shared declarations 3. Added superslab_return_block() as single exit point for header writing: - All backend allocations now go through this helper - Prevents bugs where headers are forgotten in some paths - Makes future debugging easier 4. Updated Makefile for new file structure 5. Added header writing to ss_legacy_backend_box.c and ss_unified_backend_box.c (though not currently linked) Note: Header corruption bug in Larson benchmark still exists. Class 1-6 allocations go through TLS refill/carve paths, not backend. Further investigation needed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-29 05:13:04 +09:00
Moe Charm (CI)	ccbeb935c5	Perf optimization: Disable mincore syscall by default Problem: mincore() syscall in hak_free_api caused performance overhead. Perf analysis showed mincore syscall overhead in hot path. Solution: Change DISABLE_MINCORE default from 0 to 1 in Makefile. This disables mincore() checks in core/box/hak_free_api.inc.h by default. Benchmark results (10M iterations × 5 runs, ws=256): - Before (mincore enabled): 64.61M ops/s (avg) - After (mincore disabled): 71.30M ops/s (avg) - Improvement: +10.3% (+6.69M ops/s) This exceeds Task agent's prediction of +2-3%, showing significant impact in real-world allocation patterns. Note: Set DISABLE_MINCORE=0 to re-enable if debugging invalid pointers. Location: Makefile:173 Perf analysis: commit `53bc92842`	2025-11-28 18:00:22 +09:00
Moe Charm (CI)	dc9e650db3	Tiny Pool redesign: P0.1, P0.3, P1.1, P1.2 - Out-of-band class_idx lookup This commit implements the first phase of Tiny Pool redesign based on ChatGPT architecture review. The goal is to eliminate Header/Next pointer conflicts by moving class_idx lookup out-of-band (to SuperSlab metadata). ## P0.1: C0(8B) class upgraded to 16B - Size table changed: {16,32,64,128,256,512,1024,2048} (8 classes) - LUT updated: 1..16 → class 0, 17..32 → class 1, etc. - tiny_next_off: C0 now uses offset 1 (header preserved) - Eliminates edge cases for 8B allocations ## P0.3: Slab reuse guard Box (tls_slab_reuse_guard_box.h) - New Box for draining TLS SLL before slab reuse - ENV gate: HAKMEM_TINY_SLAB_REUSE_GUARD=1 - Prevents stale pointers when slabs are recycled - Follows Box theory: single responsibility, minimal API ## P1.1: SuperSlab class_map addition - Added uint8_t class_map[SLABS_PER_SUPERSLAB_MAX] to SuperSlab - Maps slab_idx → class_idx for out-of-band lookup - Initialized to 255 (UNASSIGNED) on SuperSlab creation - Set correctly on slab initialization in all backends ## P1.2: Free fast path uses class_map - ENV gate: HAKMEM_TINY_USE_CLASS_MAP=1 - Free path can now get class_idx from class_map instead of Header - Falls back to Header read if class_map returns invalid value - Fixed Legacy Backend dynamic slab initialization bug ## Documentation added - HAKMEM_ARCHITECTURE_OVERVIEW.md: 4-layer architecture analysis - TLS_SLL_ARCHITECTURE_INVESTIGATION.md: Root cause analysis - PTR_LIFECYCLE_TRACE_AND_ROOT_CAUSE_ANALYSIS.md: Pointer tracking - TINY_REDESIGN_CHECKLIST.md: Implementation roadmap (P0-P3) ## Test results - Baseline: 70% success rate (30% crash - pre-existing issue) - class_map enabled: 70% success rate (same as baseline) - Performance: ~30.5M ops/s (unchanged) ## Next steps (P1.3, P2, P3) - P1.3: Add meta->active for accurate TLS/freelist sync - P2: TLS SLL redesign with Box-based counting - P3: Complete Header out-of-band migration 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-28 13:42:39 +09:00
Moe Charm (CI)	a9ddb52ad4	ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s) Phase 1 完了：環境変数整理 + fprintf デバッグガード ENV変数削除（BG/HotMag系）: - core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines) - core/hakmem_tiny_bg_spill.c: BG spill ENV 削除 - core/tiny_refill.h: BG remote 固定値化 - core/hakmem_tiny_slow.inc: BG refs 削除 fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE): - core/hakmem_shared_pool.c: Lock stats (~18 fprintf) - core/page_arena.c: Init/Shutdown/Stats (~27 fprintf) - core/hakmem.c: SIGSEGV init message ドキュメント整理: - 328 markdown files 削除（旧レポート・重複docs）性能確認: - Larson: 52.35M ops/s (前回52.8M、安定動作✅) - ENV整理による機能影響なし - Debug出力は一部残存（次phase で対応） 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-26 14:45:26 +09:00
Moe Charm (CI)	bcfb4f6b59	Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath (cherry-picked from 225b6fcc7, conflicts resolved)	2025-11-26 12:33:49 +09:00

1 2

81 Commits