d9991f39ff
Phase ALLOC-TINY-FAST-DUALHOT-1 & Optimization Roadmap Update
...
Add comprehensive design docs and research boxes:
- docs/analysis/ALLOC_TINY_FAST_DUALHOT_1_DESIGN.md: ALLOC DUALHOT investigation
- docs/analysis/FREE_TINY_FAST_DUALHOT_1_DESIGN.md: FREE DUALHOT final specs
- docs/analysis/FREE_TINY_FAST_HOTCOLD_OPT_1_DESIGN.md: Hot/Cold split research
- docs/analysis/POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md: Deferred batching design
- docs/analysis/POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md: Stats overhead findings
- docs/analysis/MID_DESC_CACHE_BENCHMARK_2025-12-12.md: Cache measurement results
- docs/analysis/LAST_MATCH_CACHE_IMPLEMENTATION.md: TLS cache investigation
Research boxes (SS page table):
- core/box/ss_pt_env_box.h: HAKMEM_SS_LOOKUP_KIND gate
- core/box/ss_pt_types_box.h: 2-level page table structures
- core/box/ss_pt_lookup_box.h: ss_pt_lookup() implementation
- core/box/ss_pt_register_box.h: Page table registration
- core/box/ss_pt_impl.c: Global definitions
Updates:
- docs/specs/ENV_VARS_COMPLETE.md: HOTCOLD, DEFERRED, SS_LOOKUP env vars
- core/box/hak_free_api.inc.h: FREE-DISPATCH-SSOT integration
- core/box/pool_mid_inuse_deferred_box.h: Deferred API updates
- core/box/pool_mid_inuse_deferred_stats_box.h: Stats collection
- core/hakmem_super_registry: SS page table integration
Current Status:
- FREE-TINY-FAST-DUALHOT-1: +13% improvement, ready for adoption
- ALLOC-TINY-FAST-DUALHOT-1: -2% regression, frozen as research box
- Next: Optimization roadmap per ROI (mimalloc gap 2.5x)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-12-13 05:35:46 +09:00
1a8652a91a
Phase TLS-UNIFY-3: C6 intrusive freelist implementation (完成)
...
Implement C6 ULTRA intrusive LIFO freelist with ENV gating:
- Single-linked LIFO using next pointer at USER+1 offset
- tiny_next_store/tiny_next_load for pointer access (single source of truth)
- Segment learning via ss_fast_lookup (per-class seg_base/seg_end)
- ENV gate: HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL (default OFF)
- Counters: c6_ifl_push/pop/fallback in FREE_PATH_STATS
Files:
- core/box/tiny_ultra_tls_box.h: Added c6_head field for intrusive LIFO
- core/box/tiny_ultra_tls_box.c: Pop/push with intrusive branching (case 6)
- core/box/tiny_c6_ultra_intrusive_env_box.h: ENV gate (new)
- core/box/tiny_c6_intrusive_freelist_box.h: L1 pure LIFO (new)
- core/tiny_debug_ring.h: C6_IFL events
- core/box/free_path_stats_box.h/c: c6_ifl_* counters
A/B Test Results (1M iterations, ws=200, 257-512B):
- ENV_OFF (array): 56.6 Mop/s avg
- ENV_ON (intrusive): 57.6 Mop/s avg (+1.8%, within noise)
- Counters verified: c6_ifl_push=265890, c6_ifl_pop=265815, fallback=0
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-12 16:26:42 +09:00
212739607a
Phase v11a-3: MID v3.5 Activation (Build Complete)
...
Integrated MID v3.5 into active code path, making it available for C5/C6/C7 routing.
Key Changes:
- Policy Box: Added SMALL_ROUTE_MID_V35 with ENV gates (HAKMEM_MID_V35_ENABLED, HAKMEM_MID_V35_CLASSES)
- HotBox: Implemented small_mid_v35_alloc/free with TLS-cached page allocation
- Front Gate: Wired MID_V35 routing into malloc_tiny_fast.h (priority: ULTRA > MID_V35 > V7)
- Build: Added core/smallobject_mid_v35.o to all object lists
Architecture:
- Slot sizes: C5=384B, C6=512B, C7=1024B
- Page size: 64KB (170/128/64 slots)
- Integration: ColdIface v2 (refill/retire), Stats v2 (observation), Learner v2 (dormant)
Status: Build successful, ready for A/B benchmarking
Next: Performance validation (C6-heavy, C5+C6-only, Mixed benchmarks)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2025-12-12 06:52:14 +09:00
0dba67ba9d
Phase v11a-2: Core MID v3.5 implementation - segment, cold iface, stats, learner
...
Implement 5-layer infrastructure for multi-class MID v3.5 (C5-C7, 257-1KiB):
1. SegmentBox_mid_v3 (L2 Physical)
- core/smallobject_segment_mid_v3.c (9.5 KB)
- 2MiB segments, 64KiB pages (32 per segment)
- Per-class free page stacks (LIFO)
- RegionIdBox registration
- Slots: C5→170, C6→102, C7→64
2. ColdIface_mid_v3 (L2→L1)
- core/box/smallobject_cold_iface_mid_v3_box.h (NEW)
- core/smallobject_cold_iface_mid_v3.c (3.5 KB)
- refill: get page from free stack or new segment
- retire: calculate free_hit_ratio, publish stats, return to stack
- Clean separation: TLS cache for hot path, ColdIface for cold path
3. StatsBox_mid_v3 (L2→L3)
- core/smallobject_stats_mid_v3.c (7.2 KB)
- Circular buffer history (1000 events)
- Per-page metrics: class_idx, allocs, frees, free_hit_ratio_bps
- Periodic aggregation (every 100 retires)
- Learner notification callback
4. Learner v2 (L3)
- core/smallobject_learner_v2.c (11 KB)
- Multi-class aggregation: allocs[8], retire_count[8], avg_free_hit_bps[8]
- Exponential smoothing (90% history + 10% new)
- Per-class efficiency tracking
- Stats snapshot API
- Route decision disabled for v11a-2 (v11b feature)
5. Build Integration
- Modified Makefile: added 4 new .o files (segment, cold_iface, stats, learner)
- Updated box header prototypes
- Clean compilation, all dependencies resolved
Architecture Decision Implementation:
- v7 remains frozen (C5/C6 research preset)
- MID v3.5 becomes unified 257-1KiB main path
- Multi-class isolation: per-class free stacks
- Dormant infrastructure: linked but not active (zero overhead)
Performance:
- Build: clean compilation
- Sanity benchmark: 27.3M ops/s (no regression vs v10)
- Memory: ~30MB RSS (baseline maintained)
Design Compliance:
✅ Layer separation: L2 (segment) → L2 (cold iface) → L3 (stats) → L3 (learner)
✅ Hot path clean: alloc/free never touch stats/learner
✅ Backward compatible: existing MID v3 routes unchanged
✅ Transparent: v11a-2 is dormant (no behavior change)
Next Phase (v11a-3):
- Activate C5/C6/C7 routing through MID v3.5
- Connect TLS cache to segment refill
- Verify performance under load
- Then Phase v11a-4: dynamic C5 ratio routing
🤖 Generated with Claude Code
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-12 06:37:06 +09:00
8143e8b797
Phase v7-4: Policy Box 導入 (L3 層の明確化とフロント芯の作り直し)
...
- SmallPolicyV7 Box: L3 Policy layer に配置、route 決定を一元化
- Route kind enum: SMALL_ROUTE_ULTRA / V7 / MID_V3 / LEGACY
- ENV priority (fixed): ULTRA > v7 > MID_v3 > LEGACY
- Frontend integration: v7 routing を Policy Box 経由に変更 (段階移行)
- Legacy compatibility: 既存の tiny_route_env_box.h は併用維持
Box Theory layer structure:
- L0: ULTRA (C4-C7, FROZEN)
- L1: SmallObject v7 (research box)
- L1': MID_v3 / LEGACY (fallback)
- L2: Segment / RegionId
- L3: Policy / Stats / Learner ← Policy Box added here
Frontend now follows clean "size→class→route_kind→switch" pattern.
ENV variables read once at Policy init, not scattered across frontend.
Future: ULTRA/MID_v3/LEGACY consolidation, Learner integration, flexible priority.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com >
2025-12-12 03:50:58 +09:00
39a3c53dbc
Phase v7-2: SmallObject v7 C6-only implementation with RegionIdBox integration
...
- SmallSegment_v7: 2MiB segment with TLS slot and free page stack
- ColdIface_v7: Page refill/retire between HotBox and SegmentBox
- HotBox_v7: Full C6-only alloc/free with header writing (HEADER_MAGIC|class_idx)
- Free path early-exit: Check v7 route BEFORE ss_fast_lookup (separate mmap segment)
- RegionIdBox: Register v7 segment for ptr->region lookup
- Benchmark: v7 ON ~54.5M ops/s (-7% overhead vs 58.6M legacy baseline)
v7 correctly balances alloc/free counts and page lifecycle.
RegionIdBox overhead identified as primary cost driver.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-12 03:12:28 +09:00
7bb179df6c
Fix: Add core/mid_hotbox_v3.o to BENCH_HAKMEM_OBJS_BASE
...
core/mid_hotbox_v3.o was missing from BENCH_HAKMEM_OBJS_BASE, causing
linker errors. Added it after core/region_id_v6.o.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com >
2025-12-12 01:06:30 +09:00
510cf338f3
MID-V3-6: hakmem.c integration (box modularization)
...
Integrate MID/Pool v3 into hakmem.c main allocation path using
box modularization pattern.
Changes:
- core/hakmem.c: Include MID v3 headers
- core/box/hak_alloc_api.inc.h: Add v3 allocation gate
- C6 (145-256B) and C7 (769-1024B) size classes
- ENV opt-in via HAKMEM_MID_V3_ENABLED + HAKMEM_MID_V3_CLASSES
- Priority: v6 > v3 > v4 > pool
- core/box/hak_free_api.inc.h: Add v3 free path
- RegionIdBox lookup based ownership check
- Makefile: Add core/mid_hotbox_v3.o to TINY_BENCH_OBJS_BASE
ENV controls (default OFF):
HAKMEM_MID_V3_ENABLED=1
HAKMEM_MID_V3_CLASSES=0x40 (C6)
HAKMEM_MID_V3_CLASSES=0x80 (C7)
HAKMEM_MID_V3_DEBUG=1
Verified with bench_mid_large_mt_hakmem (7-9M ops/s, no crashes)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2025-12-12 01:04:55 +09:00
710541b69e
MID-V3 Phase 3-5: RegionId integration, alloc/free implementation
...
- MID-V3-3: RegionId integration (page registration at carve)
- mid_segment_v3_carve_page(): Register with RegionIdBox
- mid_segment_v3_return_page(): Unregister from RegionIdBox
- Uses REGION_KIND_MID_V3 for region identification
- MID-V3-4: Allocation fast path implementation
- mid_hot_v3_alloc_slow(): Slow path for lane miss
- mid_cold_v3_refill_page(): Segment-based page allocation
- mid_lane_refill_from_page(): Batch transfer (16 items default)
- mid_page_build_freelist(): Initial freelist construction
- MID-V3-5: Free/cold path implementation
- mid_hot_v3_free(): RegionIdBox lookup based free
- mid_page_push_free(): Page freelist push
- Local/remote page detection via lane ownership
ENV controls (default OFF):
HAKMEM_MID_V3_ENABLED=1
HAKMEM_MID_V3_CLASSES=0xC0 (C6+C7)
HAKMEM_MID_V3_DEBUG=1
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2025-12-12 00:53:42 +09:00
df216b6901
Phase V6-HDR-3: SmallSegmentV6 実割り当て & RegionIdBox Registration
...
実装内容:
1. SmallSegmentV6のmmap割り当ては既に v6-0で実装済み
2. small_heap_ctx_v6() で segment 取得時に region_id_register_v6_segment() 呼び出し
3. region_id_v6.c に TLS スコープのセグメント登録ロジック実装:
- 4つの static __thread 変数でセグメント情報をキャッシュ
- region_id_register_v6_segment(): セグメント base/end を TLS に記録
- region_id_lookup_v6(): TLS segment の range check を最初に実行
- TLS cache 更新で O(1) lookup 実現
4. region_id_v6_box.h に SmallSegmentV6 type include & function 宣言追加
5. small_v6_region_observe_validate() に region_id_observe_lookup() 呼び出し追加
効果:
- HeaderlessデザインでRegionIdBoxが正式にSMALL_V6分類を返せるように
- TLS-scopedな簡潔な登録メカニズム (マルチスレッド対応)
- Fast path: TLS segment range check -> page_meta lookup
- Fall back path: 従来の small_page_meta_v6_of() による動的検出
- Latency: O(1) TLS cache hit rate がv6 alloc/free の大部分をカバー
🤖 Generated with Claude Code
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-11 23:51:48 +09:00
9fb2240319
Fix: Add alloc_gate_stats_box.o to BENCH_HAKMEM_OBJS_BASE; Document PERF-ULTRA-REBASE-4 findings
...
Phase PERF-ULTRA-REBASE-4 confirmed:
- dispatcher (25.48%) and alloc gate (21.13%) already heavily optimized via snapshot
- New bottleneck: C7 ULTRA refill path (tiny_c7_ultra_page_of at 1.78%)
- Recommendation: Next optimize C7 ULTRA refill for +1-2% overall gain
2025-12-11 21:36:58 +09:00
0f15adae4e
Phase ALLOC-GATE-OPT-1: tiny_alloc_gate_fast 統計計測
...
- AllocGateStats 構造体追加(size2class/route/env/class分布)
- malloc_tiny_fast にカウンタ埋め込み
- ENV: HAKMEM_ALLOC_GATE_STATS (default 0)
- 挙動変更なし(計測のみ)
計測結果:
- Mixed: total=542k, size2class=0, route_calls=0, env_checks=275k, C4-C7=95.2%
- size_to_class/route_for_class は完全削減済み(LUT 効果)
- C4-C7 が 95% → ULTRA fast path が有効
- env_checks ≈ c7_calls → C7 ULTRA の ENV gate が毎回呼ばれる
- C6-heavy: total=11 → malloc_tiny_fast はほぼ通らない(mid/pool 主体)
結論:
- alloc gate は既に十分最適化済み(LUT + ULTRA で削減済み)
- さらなる最適化余地は小さい(env_checks は軽量化済み、数%以下の効果)
- 次フェーズでは free dispatcher (29%) や C7 ULTRA refill (7%) など、他のボトルネックを狙う
詳細: docs/analysis/ALLOC_GATE_ANALYSIS.md
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-11 21:32:40 +09:00
118c0e4857
Phase FREE-DISPATCHER-OPT-1: free dispatcher 統計計測
...
**目的**: free dispatcher(29%)の内訳を細分化して計測。
**実装内容**:
- FreeDispatchStats 構造体追加(ENV: HAKMEM_FREE_DISPATCH_STATS, default 0)
- カウンタ: total_calls / domain (tiny/mid/large) / route (ultra/legacy/pool/v6) / env_checks / route_for_class_calls
- hak_free_at / tiny_route_for_class / tiny_route_snapshot_init にカウンタ埋め込み
- 挙動変更なし(計測のみ、ENV OFF 時は overhead ゼロ)
**計測結果**:
Mixed 16-1024B (1M iter, ws=400):
- total=8,081, route_calls=267,967, env_checks=9
- BENCH_FAST_FRONT により大半は早期リターン
- route_for_class は主に alloc 側で呼ばれる(267k calls vs 8k frees)
- ENV check は初期化時の 9回のみ(snapshot 効果)
C6-heavy (257-768B, 1M iter, ws=400):
- total=500,099, route_calls=1,034, env_checks=9
- fg_classify_domain に到達する free が多い
- route_for_class 呼び出しは極小(snapshot 効果)
**結論**:
- ENV check は既に十分最適化されている(初期化時のみ)
- route_for_class は alloc 側での呼び出しが主で、free 側は snapshot で O(1)
- 次フェーズ(OPT-2)では別のアプローチを検討
**ドキュメント追加**:
- docs/analysis/FREE_DISPATCHER_ANALYSIS.md(新規)
- CURRENT_TASK.md に Phase FREE-DISPATCHER-OPT-1 セクション追加
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com >
2025-12-11 21:21:40 +09:00
fb88725a43
Phase FREE-LEGACY-OPT-6: C4 ULTRA Implementation
...
Implement C4 ULTRA free TLS cache with parasitic free+alloc pattern,
achieving 99.7-99.9% elimination of C4 legacy fallback calls.
Key Features:
- TLS cache cap=64 (tuned for L1 cache fit, smaller than C5/C6's 128)
- Segment learning via ss_fast_lookup() on first free
- Free-side cache push + alloc-side TLS pop pattern
- ENV gate: HAKMEM_TINY_C4_ULTRA_FREE_ENABLED (default OFF)
- Full FREE_PATH_STATS instrumentation
Benchmark Results:
C4-heavy (65-128B range):
- C4 legacy: 591,583 → 1,711 (-99.7%)
- c4_ultra cache hits: ~599k (free) + ~599k (alloc)
- Mixed load: 340,732 → 284 C4 legacy (-99.9%)
Legacy fallback reduction:
- C4-heavy: 589,872 fewer legacy calls (-10.9% total)
- Mixed: 340,448 fewer C4 legacy calls (-12.8% in mixed)
Performance note: ~2% throughput cost in isolated C4-heavy case,
acceptable tradeoff for 99%+ legacy elimination per class.
Files:
NEW: core/box/tiny_c4_ultra_free_box.h/c
NEW: core/box/tiny_c4_ultra_free_env_box.h
MOD: core/box/tiny_ultra_classes_box.h (added C4 macros)
MOD: core/box/free_path_stats_box.h/c (C4 ULTRA counters)
MOD: core/front/malloc_tiny_fast.h (C4 alloc+free integration)
MOD: Makefile (added C4 ULTRA object)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-11 19:38:27 +09:00
ea6ed1a6e4
Phase FREE-LEGACY-OPT-5-1/5-2: C5 ULTRA free+alloc integration
...
Summary:
========
Implemented C5 ULTRA TLS cache pattern following the successful C6 ULTRA design:
- Phase 5-1: Free-side TLS cache + segment learning
- Phase 5-2: Alloc-side TLS pop for complete free+alloc cycle integration
Targets C5 class (129-256B) as next legacy reduction after C6 completion.
Key Changes:
============
1. NEW FILES:
- core/box/tiny_c5_ultra_free_box.h: C5 ULTRA TLS cache structure
- core/box/tiny_c5_ultra_free_box.c: C5 free path implementation (same pattern as C6)
- core/box/tiny_c5_ultra_free_env_box.h: ENV gating (HAKMEM_TINY_C5_ULTRA_FREE_ENABLED)
2. MODIFIED FILES:
- core/front/malloc_tiny_fast.h:
* Added C5 ULTRA includes
* Added C5 alloc-side TLS pop at lines 186-194 (integrated with C6)
* Added C5 free path at lines 333-337 (integrated with C6)
- core/box/tiny_ultra_classes_box.h:
* Added TINY_CLASS_C5 constant
* Added tiny_class_is_c5() macro
* Extended tiny_class_is_ultra() to include C5
- core/box/free_path_stats_box.h:
* Added c5_ultra_free_fast counter
* Added c5_ultra_alloc_hit counter
- core/box/free_path_stats_box.c:
* Updated stats dump to output C5 counters
- Makefile:
* Added core/box/tiny_c5_ultra_free_box.o to all object lists
3. Design Rationale:
- Exact copy of C6 ULTRA pattern (proven effective)
- TLS cache capacity: 128 blocks (same as C6 for consistency)
- Segment learning on first C5 free via ss_fast_lookup()
- Alloc-side pop integrated directly in malloc_tiny_fast.h hotpath
- Legacy fallback unification via tiny_legacy_fallback_free_base()
4. Expected Impact:
- C5 legacy calls: 68,871 → 0 (100% elimination)
- Total legacy reduction: ~53% of remaining 129,623
- Mixed workload: Minimal regression (C5 is smaller class, fewer allocations)
5. Stats Collection:
Run with: HAKMEM_TINY_C5_ULTRA_FREE_ENABLED=1 HAKMEM_FREE_PATH_STATS=1 ./bench_allocators_hakmem
Expected output:
[FREE_PATH_STATS] ... c5_ultra_free=68871 c5_ultra_alloc=68871 ... legacy_fb=60752 ...
[FREE_PATH_STATS_LEGACY_BY_CLASS] ... c5=0 ...
Status:
=======
- Code: ✅ COMPLETE (3 new files + 5 modified files)
- Compilation: ✅ Verified (no errors, only unused variable warnings unrelated to C5)
- Functionality: Ready to benchmark (ENV gating: default OFF, opt-in via ENV)
Phase Progression:
==================
✅ Phase 4-4: C6 ULTRA free+alloc (legacy C6: 137,319 → 0)
✅ Phase 5-1/5-2: C5 ULTRA free+alloc (legacy C5: 68,871 → 0 expected)
⏳ Phase 4.5: C4 ULTRA (34,727 remaining)
📋 Future: C3/C2 ULTRA if beneficial
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-11 19:26:51 +09:00
7b7de53167
Phase FREE-FRONT-V3-1: Free route snapshot infrastructure + build fix
...
Summary:
========
Implemented Phase FREE-FRONT-V3 infrastructure to optimize free hotpath by:
1. Creating snapshot-based route decision table (consolidating route logic)
2. Removing redundant ENV checks from hot path
3. Preparing for future integration into hak_free_at()
Key Changes:
============
1. NEW FILES:
- core/box/free_front_v3_env_box.h: Route snapshot definition & API
- core/box/free_front_v3_env_box.c: Snapshot initialization & caching
2. Infrastructure Details:
- FreeRouteSnapshotV3: Maps class_idx → free_route_kind for all 8 classes
- Routes defined: LEGACY, TINY_V3, CORE_V6_C6, POOL_V1
- ENV-gated initialization (HAKMEM_TINY_FREE_FRONT_V3_ENABLED, default OFF)
- Per-thread TLS caching to avoid repeated ENV reads
3. Design Goals:
- Consolidate tiny_route_for_class() results into snapshot table
- Remove C7 ULTRA / v4 / v5 / v6 ENV checks from hot path
- Limit lookup (ss_fast_lookup/slab_index_for) to paths that truly need it
- Clear ownership boundary: front v3 handles routing, downstream handles free
4. Phase Plan:
- v3-1 ✅ COMPLETE: Infrastructure (snapshot table, ENV initialization, TLS cache)
- v3-2 (INFRASTRUCTURE ONLY): Placeholder integration in hak_free_api.inc.h
- v3-3 (FUTURE): Full integration + benchmark A/B to measure hotpath improvement
5. BUILD FIX:
- Added missing core/box/c7_meta_used_counter_box.o to OBJS_BASE in Makefile
- This symbol was referenced but not linked, causing undefined reference errors
- Benchmark targets now build cleanly without LTO
Status:
=======
- Build: ✅ PASS (bench_allocators_hakmem builds without errors)
- Integration: Currently DISABLED (default OFF, ready for v3-2 phase)
- No performance impact: Infrastructure-only, hotpath unchanged
Future Work:
============
- Phase v3-2: Integrate snapshot routing into hak_free_at() main path
- Phase v3-3: Measure free hotpath performance improvement (target: 1-2% less branch mispredict)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-11 19:17:30 +09:00
1b196b3ac0
Phase FREE-LEGACY-OPT-4-2/4-3: C6 ULTRA-free TLS cache + segment learning
...
Phase 4-2:
- Add TinyC6UltraFreeTLS structure with 128-slot TLS freelist
- Implement tiny_c6_ultra_free_fast/slow for C6 free hot path
- Add c6_ultra_free_fast counter to FreePathStats
- ENV gate: HAKMEM_TINY_C6_ULTRA_FREE_ENABLED (default: OFF)
Phase 4-3:
- Add segment learning on first C6 free via ss_fast_lookup()
- Learn seg_base/seg_end from SuperSlab for range check
- Increase cache capacity from 32 to 128 blocks
Results:
- Segment learning works: fast path captures blocks in segment
- However, without alloc integration, cache fills up and overflows to legacy
- Net effect: +1-3% (within noise range)
- Drain strategy also tested: no benefit (equal overhead)
Conclusion:
- Free-only TLS cache is limited without alloc-side integration
- Core v6 already has alloc/free integrated TLS (but -12% slower)
- Keep as research box (ENV default OFF)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2025-12-11 18:34:27 +09:00
c60199182e
Phase v6-1/2/3/4: SmallObject Core v6 - C6-only implementation + refactor
...
Phase v6-1: C6-only route stub (v1/pool fallback)
Phase v6-2: Segment v6 + ColdIface v6 + Core v6 HotPath implementation
- 2MiB segment / 64KiB page allocation
- O(1) ptr→page_meta lookup with segment masking
- C6-heavy A/B: SEGV-free but -44% performance (15.3M ops/s)
Phase v6-3: Thin-layer optimization (TLS ownership check + batch header + refill batching)
- TLS ownership fast-path skip page_meta for 90%+ of frees
- Batch header writes during refill (32 allocs = 1 header write)
- TLS batch refill (1/32 refill frequency)
- C6-heavy A/B: v6-2 15.3M → v6-3 27.1M ops/s (±0% vs baseline) ✅
Phase v6-4: Mixed hang fix (segment metadata lookup correction)
- Root cause: metadata lookup was reading mmap region instead of TLS slot
- Fix: use TLS slot descriptor with in_use validation
- Mixed health: 5M iterations SEGV-free, 35.8M ops/s ✅
Phase v6-refactor: Code quality improvements (macro unification + inline + docs)
- Add SMALL_V6_* prefix macros (header, pointer conversion, page index)
- Extract inline validation functions (small_page_v6_valid, small_ptr_in_segment_v6)
- Doxygen-style comments for all public functions
- Result: 0 compiler warnings, maintained +1.2% performance
Files:
- core/box/smallobject_core_v6_box.h (new, type & API definitions)
- core/box/smallobject_cold_iface_v6.h (new, cold iface API)
- core/box/smallsegment_v6_box.h (new, segment type definitions)
- core/smallobject_core_v6.c (new, C6 alloc/free implementation)
- core/smallobject_cold_iface_v6.c (new, refill/retire logic)
- core/smallsegment_v6.c (new, segment allocator)
- docs/analysis/SMALLOBJECT_CORE_V6_DESIGN.md (new, design document)
- core/box/tiny_route_env_box.h (modified, v6 route added)
- core/front/malloc_tiny_fast.h (modified, v6 case in route switch)
- Makefile (modified, v6 objects added)
- CURRENT_TASK.md (modified, v6 status added)
Status:
- C6-heavy: v6 OFF 27.1M → v6-3 ON 27.1M ops/s (±0%) ✅
- Mixed: v6 ON 35.8M ops/s (C6-only, other classes via v1) ✅
- Build: 0 warnings, fully documented ✅
🤖 Generated with Claude Code
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-11 15:29:59 +09:00
e0fb7d550a
Phase v5-2: SmallObject v5 C6-only 本実装 (WIP - header fix)
...
本実装修正:
- tiny_region_id_write_header() を追加: USER pointer を正しく返す
- TLS slot からの segment 探索 (page_meta_of)
- Page-level allocation で segment 再利用
- 2MiB alignment 保証 (4MiB 確保 + alignment)
- free パスの route 修正 (v4 から v5 への fallthrough 削除)
動作確認:
- SEGV 消失: alloc/free 基本動作 OK
- 性能: ~18-20M ops/s (baseline 43-47M の約 40-45%)
- 回帰原因: TLS slot 線形探索 O(n)、find_page O(n)
残タスク:
- O(1) segment lookup 最適化 (hash または array 直接参照)
- find_page 除去 (segment lookup 成功時)
- partial_count/list 管理の最適化
ENV デフォルト OFF なので本線影響なし。
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com >
2025-12-11 04:14:51 +09:00
9c24bebf08
Phase v5-1: SmallObject v5 C6-only route stub 接続
...
- tiny_route_env_box.h: TINY_ROUTE_SMALL_HEAP_V5 enum 追加、route snapshot で C6→v5 分岐
- malloc_tiny_fast.h: alloc/free switch に v5 case 追加(v1/pool fallback)
- smallobject_hotbox_v5.c: stub 実装(alloc は NULL 返却、free は no-op)
- smallobject_hotbox_v5_box.h: 関数 signature に ctx パラメータ追加
- Makefile: core/smallobject_hotbox_v5.o をリンクリストに追加
- ENV_PROFILE_PRESETS.md: v5-1 プリセット追記
- CURRENT_TASK.md: Phase v5-1 完了記録
**特性**:
- ENV: HAKMEM_SMALL_HEAP_V5_ENABLED=1 / HAKMEM_SMALL_HEAP_V5_CLASSES=0x40 で opt-in
- テスト結果: C6-heavy (v5 OFF 15.5M → v5 ON 16.4M ops/s, 正常), Mixed 47.2M ops/s, SEGV/assert なし
- 挙動は v1/pool fallback と同じ(実装は v5-2)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-11 03:25:37 +09:00
bbb55b018a
Add C7 ULTRA segment skeleton and TLS freelist
2025-12-10 22:19:32 +09:00
cbd33511eb
Phase v4-3.1: reuse C7 v4 pages and record prep calls
2025-12-10 17:58:42 +09:00
acc64f2438
Phase ML1: Pool v1 memset 89.73% overhead 軽量化 (+15.34% improvement)
...
## Summary
- ChatGPT により bench_profile.h の setenv segfault を修正(RTLD_NEXT 経由に切り替え)
- core/box/pool_zero_mode_box.h 新設:ENV キャッシュ経由で ZERO_MODE を統一管理
- core/hakmem_pool.c で zero mode に応じた memset 制御(FULL/header/off)
- A/B テスト結果:ZERO_MODE=header で +15.34% improvement(1M iterations, C6-heavy)
## Files Modified
- core/box/pool_api.inc.h: pool_zero_mode_box.h include
- core/bench_profile.h: glibc setenv → malloc+putenv(segfault 回避)
- core/hakmem_pool.c: zero mode 参照・制御ロジック
- core/box/pool_zero_mode_box.h (新設): enum/getter
- CURRENT_TASK.md: Phase ML1 結果記載
## Test Results
| Iterations | ZERO_MODE=full | ZERO_MODE=header | Improvement |
|-----------|----------------|-----------------|------------|
| 10K | 3.06 M ops/s | 3.17 M ops/s | +3.65% |
| 1M | 23.71 M ops/s | 27.34 M ops/s | **+15.34%** |
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com >
2025-12-10 09:08:18 +09:00
a905e0ffdd
Guard madvise ENOMEM and stabilize pool/tiny front v3
2025-12-09 21:50:15 +09:00
fda6cd2e67
Boxify superslab registry, add bench profile, and document C7 hotpath experiments
2025-12-07 03:12:27 +09:00
03538055ae
Restore C7 Warm/TLS carve for release and add policy scaffolding
2025-12-06 01:34:04 +09:00
3e1d7c3798
Fix debug build after clean reset
2025-12-05 20:43:14 +09:00
093f362231
Add Page Box layer for C7 class optimization
...
- Implement tiny_page_box.c/h: per-thread page cache between UC and Shared Pool
- Integrate Page Box into Unified Cache refill path
- Remove legacy SuperSlab implementation (merged into smallmid)
- Add HAKMEM_TINY_PAGE_BOX_CLASSES env var for selective class enabling
- Update bench_random_mixed.c with Page Box statistics
Current status: Implementation safe, no regressions.
Page Box ON/OFF shows minimal difference - pool strategy needs tuning.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-12-05 15:31:44 +09:00
860991ee50
Performance Measurement Framework: Unified Cache, TLS SLL, Shared Pool Analysis
...
## Summary
Implemented production-grade measurement infrastructure to quantify top 3 bottlenecks:
- Unified cache hit/miss rates + refill cost
- TLS SLL usage patterns
- Shared pool lock contention distribution
## Changes
### 1. Unified Cache Metrics (tiny_unified_cache.h/c)
- Added atomic counters:
- g_unified_cache_hits_global: successful cache pops
- g_unified_cache_misses_global: refill triggers
- g_unified_cache_refill_cycles_global: refill cost in CPU cycles (rdtsc)
- Instrumented `unified_cache_pop_or_refill()` to count hits
- Instrumented `unified_cache_refill()` with cycle measurement
- ENV-gated: HAKMEM_MEASURE_UNIFIED_CACHE=1 (default: off)
- Added unified_cache_print_measurements() output function
### 2. TLS SLL Metrics (tls_sll_box.h)
- Added atomic counters:
- g_tls_sll_push_count_global: total pushes
- g_tls_sll_pop_count_global: successful pops
- g_tls_sll_pop_empty_count_global: empty list conditions
- Instrumented push/pop paths
- Added tls_sll_print_measurements() output function
### 3. Shared Pool Contention (hakmem_shared_pool_acquire.c)
- Added atomic counters:
- g_sp_stage2_lock_acquired_global: Stage 2 locks
- g_sp_stage3_lock_acquired_global: Stage 3 allocations
- g_sp_alloc_lock_contention_global: total lock acquisitions
- Instrumented all pthread_mutex_lock calls in hot paths
- Added shared_pool_print_measurements() output function
### 4. Benchmark Integration (bench_random_mixed.c)
- Called all 3 print functions after benchmark loop
- Functions active only when HAKMEM_MEASURE_UNIFIED_CACHE=1 set
## Design Principles
- **Zero overhead when disabled**: Inline checks with __builtin_expect hints
- **Atomic relaxed memory order**: Minimal synchronization overhead
- **ENV-gated**: Single flag controls all measurements
- **Production-safe**: Compiles in release builds, no functional changes
## Usage
```bash
HAKMEM_MEASURE_UNIFIED_CACHE=1 ./bench_allocators_hakmem bench_random_mixed_hakmem 1000000 256 42
```
Output (when enabled):
```
========================================
Unified Cache Statistics
========================================
Hits: 1234567
Misses: 56789
Hit Rate: 95.6%
Avg Refill Cycles: 1234
========================================
TLS SLL Statistics
========================================
Total Pushes: 1234567
Total Pops: 345678
Pop Empty Count: 12345
Hit Rate: 98.8%
========================================
Shared Pool Contention Statistics
========================================
Stage 2 Locks: 123456 (33%)
Stage 3 Locks: 234567 (67%)
Total Contention: 357 locks per 1M ops
```
## Next Steps
1. **Enable measurements** and run benchmarks to gather data
2. **Analyze miss rates**: Which bottleneck dominates?
3. **Profile hottest stage**: Focus optimization on top contributor
4. Possible targets:
- Increase unified cache capacity if miss rate >5%
- Profile if TLS SLL is unused (potential legacy code removal)
- Analyze if Stage 2 lock can be replaced with CAS
## Makefile Updates
Added core/box/tiny_route_box.o to:
- OBJS_BASE (test build)
- SHARED_OBJS (shared library)
- BENCH_HAKMEM_OBJS_BASE (benchmark)
- TINY_BENCH_OBJS_BASE (tiny benchmark)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-12-04 18:26:39 +09:00
d5e6ed535c
P-Tier + Tiny Route Policy: Aggressive Superslab Management + Safe Routing
...
## Phase 1: Utilization-Aware Superslab Tiering (案B実装済)
- Add ss_tier_box.h: Classify SuperSlabs into HOT/DRAINING/FREE based on utilization
- HOT (>25%): Accept new allocations
- DRAINING (≤25%): Drain only, no new allocs
- FREE (0%): Ready for eager munmap
- Enhanced shared_pool_release_slab():
- Check tier transition after each slab release
- If tier→FREE: Force remaining slots to EMPTY and call superslab_free() immediately
- Bypasses LRU cache to prevent registry bloat from accumulating DRAINING SuperSlabs
- Test results (bench_random_mixed_hakmem):
- 1M iterations: ✅ ~1.03M ops/s (previously passed)
- 10M iterations: ✅ ~1.15M ops/s (previously: registry full error)
- 50M iterations: ✅ ~1.08M ops/s (stress test)
## Phase 2: Tiny Front Routing Policy (新規Box)
- Add tiny_route_box.h/c: Single 8-byte table for class→routing decisions
- ROUTE_TINY_ONLY: Tiny front exclusive (no fallback)
- ROUTE_TINY_FIRST: Try Tiny, fallback to Pool if fails
- ROUTE_POOL_ONLY: Skip Tiny entirely
- Profiles via HAKMEM_TINY_PROFILE ENV:
- "hot": C0-C3=TINY_ONLY, C4-C6=TINY_FIRST, C7=POOL_ONLY
- "conservative" (default): All TINY_FIRST
- "off": All POOL_ONLY (disable Tiny)
- "full": All TINY_ONLY (microbench mode)
- A/B test results (ws=256, 100k ops random_mixed):
- Default (conservative): ~2.90M ops/s
- hot: ~2.65M ops/s (more conservative)
- off: ~2.86M ops/s
- full: ~2.98M ops/s (slightly best)
## Design Rationale
### Registry Pressure Fix (案B)
- Problem: DRAINING tier SS occupied registry indefinitely
- Solution: When total_active_blocks→0, immediately free to clear registry slot
- Result: No more "registry full" errors under stress
### Routing Policy Box (新)
- Problem: Tiny front optimization scattered across ENV/branches
- Solution: Centralize routing in single table, select profiles via ENV
- Benefit: Safe A/B testing without touching hot path code
- Future: Integrate with RSS budget/learning layers for dynamic profile switching
## Next Steps (性能最適化)
- Profile Tiny front internals (TLS SLL, FastCache, Superslab backend latency)
- Identify bottleneck between current ~2.9M ops/s and mimalloc ~100M ops/s
- Consider:
- Reduce shared pool lock contention
- Optimize unified cache hit rate
- Streamline Superslab carving logic
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-12-04 18:01:25 +09:00
25cb7164c7
Comprehensive legacy cleanup and architecture consolidation
...
Summary of Changes:
MOVED TO ARCHIVE:
- core/hakmem_tiny_legacy_slow_box.inc → archive/
* Slow path legacy code preserved for reference
* Superseded by Gatekeeper Box architecture
- core/superslab_allocate.c → archive/superslab_allocate_legacy.c
* Legacy SuperSlab allocation implementation
* Functionality integrated into new Box system
- core/superslab_head.c → archive/superslab_head_legacy.c
* Legacy slab head management
* Refactored through Box architecture
REMOVED DEAD CODE:
- Eliminated unused allocation policy variants from ss_allocation_box.c
* Reduced from 127+ lines of conditional logic to focused implementation
* Removed: old policy branches, unused allocation strategies
* Kept: current Box-based allocation path
ADDED NEW INFRASTRUCTURE:
- core/superslab_head_stub.c (41 lines)
* Minimal stub for backward compatibility
* Delegates to new architecture
- Enhanced core/superslab_cache.c (75 lines added)
* Added missing API functions for cache management
* Proper interface for SuperSlab cache integration
REFACTORED CORE SYSTEMS:
- core/hakmem_super_registry.c
* Moved registration logic from scattered locations
* Centralized SuperSlab registry management
- core/hakmem_tiny.c
* Removed 27 lines of redundant initialization
* Simplified through Box architecture
- core/hakmem_tiny_alloc.inc
* Streamlined allocation path to use Gatekeeper
* Removed legacy decision logic
- core/box/ss_allocation_box.c/h
* Dramatically simplified allocation policy
* Removed conditional branches for unused strategies
* Focused on current Box-based approach
BUILD SYSTEM:
- Updated Makefile for archive structure
- Removed obsolete object file references
- Maintained build compatibility
SAFETY & TESTING:
- All deletions verified: no broken references
- Build verification: RELEASE=0 and RELEASE=1 pass
- Smoke tests: 100% pass rate
- Functional verification: allocation/free intact
Architecture Consolidation:
Before: Multiple overlapping allocation paths with legacy code branches
After: Single unified path through Gatekeeper Boxes with clear architecture
Benefits:
- Reduced code size and complexity
- Improved maintainability
- Single source of truth for allocation logic
- Better diagnostic/observability hooks
- Foundation for future optimizations
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-12-04 14:22:48 +09:00
a0a80f5403
Remove legacy redundant code after Gatekeeper Box consolidation
...
Summary of Deletions:
- Remove core/box/unified_batch_box.c (26 lines)
* Legacy batch allocation logic superseded by Alloc Gatekeeper Box
* unified_cache now handles allocation aggregation
- Remove core/box/unified_batch_box.h (29 lines)
* Header declarations for deprecated unified_batch_box module
- Remove core/tiny_free_fast.inc.h (329 lines)
* Legacy fast-path free implementation
* Functionality consolidated into:
- tiny_free_gate_box.h (Fail-Fast layer + diagnostics)
- malloc_tiny_fast.h (Free path integration)
- unified_cache (return to freelist)
* Code path now routes through Gatekeeper Box for consistency
Build System Updates:
- Update Makefile
* Remove unified_batch_box.o from OBJS_BASE
* Remove unified_batch_box_shared.o from SHARED_OBJS
* Remove unified_batch_box.o from BENCH_HAKMEM_OBJS_BASE
- Update core/hakmem_tiny_phase6_wrappers_box.inc
* Remove unified_batch_box references
* Simplify allocation wrapper to use new Gatekeeper architecture
Impact:
- Removes ~385 lines of redundant/superseded code
- Consolidates allocation logic through unified Gatekeeper entry points
- All functionality preserved via new Box-based architecture
- Simplifies codebase and reduces maintenance burden
Testing:
- Build verification: make clean && make RELEASE=0/1
- Smoke tests: All pass (simple_alloc, loop 10M, pool_tls)
- No functional regressions
Rationale:
After implementing Alloc/Free Gatekeeper Boxes with Fail-Fast layers
and Unified Cache type safety, the legacy separate implementations
became redundant. This commit completes the architectural consolidation
and simplifies the allocator codebase.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-12-04 12:55:53 +09:00
94f9ea5104
Implement Phase 1: TLS SuperSlab Hint Box for Headerless performance
...
Design: Cache recently-used SuperSlab references in TLS to accelerate
ptr→SuperSlab resolution in Headerless mode free() path.
## Implementation
### New Box: core/box/tls_ss_hint_box.h
- Header-only Box (4-slot FIFO cache per thread)
- Functions: tls_ss_hint_init(), tls_ss_hint_update(), tls_ss_hint_lookup(), tls_ss_hint_clear()
- Memory overhead: 112 bytes per thread (negligible)
- Statistics API for debug builds (hit/miss counters)
### Integration Points
1. **Free path** (core/hakmem_tiny_free.inc):
- Lines 477-481: Fast path hint lookup before hak_super_lookup()
- Lines 550-555: Second lookup location (fallback path)
- Expected savings: 10-50 cycles → 2-5 cycles on cache hit
2. **Allocation path** (core/tiny_superslab_alloc.inc.h):
- Lines 115-122: Linear allocation return path
- Lines 179-186: Freelist allocation return path
- Cache update on successful allocation
3. **TLS variable** (core/hakmem_tiny_tls_state_box.inc):
- `__thread TlsSsHintCache g_tls_ss_hint = {0};`
### Build System
- **Build flag** (core/hakmem_build_flags.h):
- HAKMEM_TINY_SS_TLS_HINT (default: 0, disabled)
- Validation: requires HAKMEM_TINY_HEADERLESS=1
- **Makefile**:
- Removed old ss_tls_hint_box.o (conflicting implementation)
- Header-only design eliminates compiled object files
### Testing
- **Unit tests** (tests/test_tls_ss_hint.c):
- 6 test functions covering init, lookup, FIFO rotation, duplicates, clear, stats
- All tests PASSING
- **Build validation**:
- ✅ Compiles with hint disabled (default)
- ✅ Compiles with hint enabled (HAKMEM_TINY_SS_TLS_HINT=1)
### Documentation
- **Benchmark report** (docs/PHASE1_TLS_HINT_BENCHMARK.md):
- Implementation summary
- Build validation results
- Benchmark methodology (to be executed)
- Performance analysis framework
## Expected Performance
- **Hit rate**: 85-95% (single-threaded), 70-85% (multi-threaded)
- **Cycle savings**: 80-95% on cache hit (10-50 cycles → 2-5 cycles)
- **Target improvement**: 15-20% throughput increase vs Headerless baseline
- **Memory overhead**: 112 bytes per thread
## Box Theory
**Mission**: Cache hot SuperSlabs to avoid global registry lookup
**Boundary**: ptr → SuperSlab* or NULL (miss)
**Invariant**: hint.base ≤ ptr < hint.end → hit is valid
**Fallback**: Always safe to miss (triggers hak_super_lookup)
**Thread Safety**: TLS storage, no synchronization required
**Risk**: Low (read-only cache, fail-safe fallback, magic validation)
## Next Steps
1. Run full benchmark suite (sh8bench, cfrac, larson)
2. Measure actual hit rate with stats enabled
3. If performance target met (15-20% improvement), enable by default
4. Consider increasing cache slots if hit rate < 80%
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-12-03 18:06:24 +09:00
c91602f181
Fix ptr_user_to_base_blind regression: use class-aware base calculation and correct slab index lookup
2025-12-03 12:29:31 +09:00
c2716f5c01
Implement Phase 2: Headerless Allocator Support (Partial)
...
- Feature: Added HAKMEM_TINY_HEADERLESS toggle (A/B testing)
- Feature: Implemented Headerless layout logic (Offset=0)
- Refactor: Centralized layout definitions in tiny_layout_box.h
- Refactor: Abstracted pointer arithmetic in free path via ptr_conversion_box.h
- Verification: sh8bench passes in Headerless mode (No TLS_SLL_HDR_RESET)
- Known Issue: Regression in Phase 1 mode due to blind pointer conversion logic
2025-12-03 12:11:27 +09:00
49969d2e0f
feat(Phase 1-1): Complete getenv elimination from malloc/free hot paths (+39-42% perf)
...
## Summary
Eliminated all getenv() calls from malloc/free wrappers and allocator hot paths by implementing
constructor-based environment variable caching. This achieves 39-42% performance improvement
(36s → 22s on sh8bench single-thread).
## Performance Impact
- sh8bench 1 thread: 35-36s → 21-22s (+39-42% improvement) 🚀
- sh8bench 8 threads: ~15s (maintained)
- getenv overhead: 36.32% → 0% (completely eliminated)
## Changes
### New Files
- **core/box/tiny_env_box.{c,h}**: Centralized environment variable cache for Tiny allocator
- Caches 43 environment variables (HAKMEM_TINY_*, HAKMEM_SLL_*, HAKMEM_SS_*, etc.)
- Constructor-based initialization with atomic CAS for thread safety
- Inline accessor tiny_env_cfg() for hot path access
- **core/box/wrapper_env_box.{c,h}**: Environment cache for malloc/free wrappers
- Caches 3 wrapper variables (HAKMEM_STEP_TRACE, HAKMEM_LD_SAFE, HAKMEM_FREE_WRAP_TRACE)
- Constructor priority 101 ensures early initialization
- Replaces all lazy-init patterns in wrapper code
### Modified Files
- **Makefile**: Added tiny_env_box.o and wrapper_env_box.o to OBJS_BASE and SHARED_OBJS
- **core/box/hak_wrappers.inc.h**:
- Removed static lazy-init variables (g_step_trace, ld_safe_mode cache)
- Replaced with wrapper_env_cfg() lookups (wcfg->step_trace, wcfg->ld_safe_mode)
- All getenv() calls eliminated from malloc/free hot paths
- **core/hakmem.c**:
- Added hak_ld_env_init() with constructor for LD_PRELOAD caching
- Added hak_force_libc_ctor() for HAKMEM_FORCE_LIBC_ALLOC* caching
- Simplified hak_ld_env_mode() to return cached value only
- Simplified hak_force_libc_alloc() to use cached values
- Eliminated all getenv/atoi calls from hot paths
## Technical Details
### Constructor Initialization Pattern
All environment variables are now read once at library load time using __attribute__((constructor)):
```c
__attribute__((constructor(101)))
static void wrapper_env_ctor(void) {
wrapper_env_init_once(); // Atomic CAS ensures exactly-once init
}
```
### Thread Safety
- Atomic compare-and-swap (CAS) ensures single initialization
- Spin-wait for initialization completion in multi-threaded scenarios
- Memory barriers (memory_order_acq_rel) ensure visibility
### Hot Path Impact
Before: Every malloc/free → getenv("LD_PRELOAD") + getenv("HAKMEM_STEP_TRACE") + ...
After: Every malloc/free → Single pointer dereference (wcfg->field)
## Next Optimization Target (Phase 1-2)
Perf analysis reveals libc fallback accounts for ~51% of cycles:
- _int_malloc: 15.04%
- malloc: 9.81%
- _int_free: 10.07%
- malloc_consolidate: 9.27%
- unlink_chunk: 6.82%
Reducing libc fallback from 51% → 10% could yield additional +25-30% improvement.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
Co-Authored-By: ChatGPT <chatgpt@openai.com >
2025-12-02 16:16:51 +09:00
f1b7964ef9
Remove unused Mid MT layer
2025-12-01 23:43:44 +09:00
4ef0171bc0
feat: Add ACE allocation failure tracing and debug hooks
...
This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations.
Key changes include:
- **ACE Tracing Implementation**:
- Added environment variable to enable/disable detailed logging of allocation failures.
- Instrumented , , and to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure).
- **Build System Fixes**:
- Corrected to ensure is properly linked into , resolving an error.
- **LD_PRELOAD Wrapper Adjustments**:
- Investigated and understood the wrapper's behavior under , particularly its interaction with and checks.
- Enabled debugging flags for environment to prevent unintended fallbacks to 's for non-tiny allocations, allowing comprehensive testing of the allocator.
- **Debugging & Verification**:
- Introduced temporary verbose logging to pinpoint execution flow issues within interception and routing. These temporary logs have been removed.
- Created to facilitate testing of the tracing features.
This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in by providing clear insights into the failure pathways.
2025-12-01 16:37:59 +09:00
3a040a545a
Refactor: Split monolithic hakmem_shared_pool.c into acquire/release modules
...
- Split core/hakmem_shared_pool.c into acquire/release modules for maintainability.
- Introduced core/hakmem_shared_pool_internal.h for shared internal API.
- Fixed incorrect function name usage (superslab_alloc -> superslab_allocate).
- Increased SUPER_REG_SIZE to 1M to support large working sets (Phase 9-2 fix).
- Updated Makefile.
- Verified with benchmarks.
2025-11-30 18:11:08 +09:00
87b7d30998
Phase 9: SuperSlab optimization & EMPTY slab recycling (WIP)
...
Phase 9-1: O(1) SuperSlab lookup optimization
- Created ss_addr_map_box: Hash table (8192 buckets) for O(1) SuperSlab lookup
- Created ss_tls_hint_box: TLS caching layer for SuperSlab hints
- Integrated hash table into registry (init, insert, remove, lookup)
- Modified hak_super_lookup() to use new hash table
- Expected: 50-80 cycles → 10-20 cycles (not verified - SuperSlab disabled by default)
Phase 9-2: EMPTY slab recycling implementation
- Created slab_recycling_box: SLAB_TRY_RECYCLE() macro following Box pattern
- Integrated into remote drain (superslab_slab.c)
- Integrated into TLS SLL drain (tls_sll_drain_box.h) with touched slab tracking
- Observable: Debug tracing via HAKMEM_SLAB_RECYCLE_TRACE
- Updated Makefile: Added new box objects to 3 build targets
Known Issues:
- SuperSlab registry exhaustion still occurs (unregistration not working)
- shared_pool_release_slab() may not be removing from g_super_reg[]
- Needs investigation before Phase 9-2 can be completed
Expected Impact (when fixed):
- Stage 1 hit rate: 0% → 80%
- shared_fail events: 4 → 0
- Kernel overhead: 55% → 15%
- Throughput: 16.5M → 25-30M ops/s (+50-80%)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-30 07:16:50 +09:00
181e448b76
Phase 7-Step2: Enable PGO mode for bench builds (compile-time unified gate)
...
Performance Results (bench_random_mixed, ws=256):
- Step 1 baseline: 80.6 M ops/s (branch hint reversal)
- Step 2 result: 80.3 M ops/s (-0.37%, within noise margin)
Implementation:
- Added -DHAKMEM_TINY_FRONT_PGO=1 to bench_random_mixed_hakmem.o build
- Triggers compile-time mode in tiny_front_config_box.h:
- TINY_FRONT_UNIFIED_GATE_ENABLED = 1 (constant, not function call)
- Enables dead code elimination: if (1) { ... } → always taken
Why No Performance Change:
- Step 1 branch hint already optimized the path
- CPU branch predictor learns runtime behavior quickly
- Compile-time constant mainly helps code size, not hot path speed
- Legacy paths already cold after Step 1
Benefits (Non-Performance):
✅ Cleaner code (compile-time constants vs runtime checks)
✅ Binary size reduction (dead code elimination possible)
✅ Foundation for future optimizations (Step 3+)
Code Changes:
- Makefile:606 - Added -DHAKMEM_TINY_FRONT_PGO=1 flag
Expected Impact:
- Current: Neutral performance (within noise)
- Future: Enables legacy path removal (Step 3-7 from Task plan)
Next Steps:
- Step 3+: Remove legacy layers (FastCache/SFC/HeapV2/TLS SLL)
- Expected: Additional 5-10% from dead code elimination
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 16:19:53 +09:00
3daf75e57f
Phase 5-Step2: Mid Free Route Box (+28.9x free perf, 1.53x faster than system)
...
Fix critical 19x free() slowdown in Mid MT allocator (1KB-8KB range).
Root Cause:
- Mid MT registers chunks in MidGlobalRegistry
- Free path searches Pool's mid_desc registry (different registry!)
- Result: 100% lookup failure → 4x cascading lookups → libc fallback
Solution (Box Pattern):
- Created core/box/mid_free_route_box.h
- Try Mid MT registry BEFORE classify_ptr() in free()
- Direct route to mid_mt_free() if found
- Fall through to existing path if not found
Performance Results (bench_mid_mt_gap, 1KB-8KB allocs):
- Before: 1.49 M ops/s (19x slower than system malloc)
- After: 41.0 M ops/s (+28.9x improvement)
- vs System malloc: 1.53x faster (41.0 vs 26.8 M ops/s)
Files:
- core/box/mid_free_route_box.h (NEW) - Mid Free Route Box
- core/box/hak_wrappers.inc.h - Add mid_free_route_try() call
- core/hakmem_mid_mt.h - Fix mid_get_min_size() (1024 not 2048)
- bench_mid_mt_gap.c (NEW) - Targeted 1KB-8KB benchmark
- Makefile - Add bench_mid_mt_gap targets
Box Pattern: ✅ Single responsibility, clear contract, testable, minimal change
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 14:18:20 +09:00
04186341c1
Phase 4-Step2: Add Hot/Cold Path Box (+7.3% performance)
...
Implemented Hot/Cold Path separation using Box pattern for Tiny allocations:
Performance Improvement (without PGO):
- Baseline (Phase 26-A): 53.3 M ops/s
- Hot/Cold Box (Phase 4-Step2): 57.2 M ops/s
- Gain: +7.3% (+3.9 M ops/s)
Implementation:
1. core/box/tiny_front_hot_box.h - Ultra-fast hot path (1 branch)
- Removed range check (caller guarantees valid class_idx)
- Inline cache hit path with branch prediction hints
- Debug metrics with zero overhead in Release builds
2. core/box/tiny_front_cold_box.h - Slow cold path (noinline, cold)
- Refill logic (batch allocation from SuperSlab)
- Drain logic (batch free to SuperSlab)
- Error reporting and diagnostics
3. core/front/malloc_tiny_fast.h - Updated to use Hot/Cold Boxes
- Hot path: tiny_hot_alloc_fast() (1 branch: cache empty check)
- Cold path: tiny_cold_refill_and_alloc() (noinline, cold attribute)
- Clear separation improves i-cache locality
Branch Analysis:
- Baseline: 4-5 branches in hot path (range check + cache check + refill logic mixed)
- Hot/Cold Box: 1 branch in hot path (cache empty check only)
- Reduction: 3-4 branches eliminated from hot path
Design Principles (Box Pattern):
✅ Single Responsibility: Hot path = cache hit only, Cold path = refill/errors
✅ Clear Contract: Hot returns NULL on miss, Cold handles miss
✅ Observable: Debug metrics (TINY_HOT_METRICS_*) gated by NDEBUG
✅ Safe: Branch prediction hints (TINY_HOT_LIKELY/UNLIKELY)
✅ Testable: Isolated hot/cold paths, easy A/B testing
PGO Status:
- Temporarily disabled (build issues with __gcov_merge_time_profile)
- Will re-enable PGO in future commit after resolving gcc/lto issues
- Current benchmarks are without PGO (fair A/B comparison)
Other Changes:
- .gitignore: Added *.d files (dependency files, auto-generated)
- Makefile: PGO targets temporarily disabled (show informational message)
- build_pgo.sh: Temporarily disabled (show "PGO paused" message)
Next: Phase 4-Step3 (Front Config Box, target +5-8%)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 11:58:37 +09:00
b51b600e8d
Phase 4-Step1: Add PGO workflow automation (+6.25% performance)
...
Implemented automated Profile-Guided Optimization workflow using Box pattern:
Performance Improvement:
- Baseline: 57.0 M ops/s
- PGO-optimized: 60.6 M ops/s
- Gain: +6.25% (within expected +5-10% range)
Implementation:
1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads
2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection
3. Makefile PGO targets:
- pgo-tiny-profile: Build instrumented binaries
- pgo-tiny-collect: Collect .gcda profile data
- pgo-tiny-build: Build optimized binaries
- pgo-tiny-full: Complete workflow (profile → collect → build → test)
4. Makefile help target: Added PGO instructions for discoverability
Design:
- Box化: Single responsibility, clear contracts
- Deterministic: Fixed seeds (42) for reproducibility
- Safe: Validation, error detection, timeout protection (30s/workload)
- Observable: Progress reporting, .gcda verification (33 files generated)
Workload Coverage:
- Random mixed: 3 working set sizes (128/256/512 slots)
- Tiny hot: 2 size classes (16B/64B)
- Total: 5 workloads covering hot/cold paths
Documentation:
- PHASE4_STEP1_COMPLETE.md - Completion report
- CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓)
- docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design
Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 11:28:38 +09:00
d78baf41ce
Phase 3: Remove mincore() syscall completely
...
Problem:
- mincore() was already disabled by default (DISABLE_MINCORE=1)
- Phase 1b/2 registry-based validation made mincore obsolete
- Dead code (~60 lines) remained with complex #ifdef guards
Solution:
Complete removal of mincore() syscall and related infrastructure:
1. Makefile:
- Removed DISABLE_MINCORE configuration (lines 167-177)
- Added Phase 3 comment documenting removal rationale
2. core/box/hak_free_api.inc.h:
- Removed ~60 lines of mincore logic with TLS page cache
- Simplified to: int is_mapped = 1;
- Added comprehensive history comment
3. core/box/external_guard_box.h:
- Simplified external_guard_is_mapped() from 20 lines to 4 lines
- Always returns 1 (assume mapped)
- Added Phase 3 comment
Safety:
Trust internal metadata for all validation:
- SuperSlab registry: validates Tiny allocations (Phase 1b/2)
- AllocHeader: validates Mid/Large allocations
- FrontGate classifier: routes external allocations
Testing:
✓ Build: Clean compilation (no warnings)
✓ Stability: 100/100 test iterations passed (0% crash rate)
✓ Performance: No regression (mincore already disabled)
History:
- Phase 9: Used mincore() for safety
- 2025-11-14: Added DISABLE_MINCORE flag (+10.3% perf improvement)
- Phase 1b/2: Registry-based validation (0% crash rate)
- Phase 3: Dead code cleanup (this commit)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 09:04:32 +09:00
6ac6f5ae1b
Refactor: Split hakmem_tiny_superslab.c + unified backend exit point
...
Major refactoring to improve maintainability and debugging:
1. Split hakmem_tiny_superslab.c (1521 lines) into 7 focused files:
- superslab_allocate.c: SuperSlab allocation/deallocation
- superslab_backend.c: Backend allocation paths (legacy, shared)
- superslab_ace.c: ACE (Adaptive Cache Engine) logic
- superslab_slab.c: Slab initialization and bitmap management
- superslab_cache.c: LRU cache and prewarm cache management
- superslab_head.c: SuperSlabHead management and expansion
- superslab_stats.c: Statistics tracking and debugging
2. Created hakmem_tiny_superslab_internal.h for shared declarations
3. Added superslab_return_block() as single exit point for header writing:
- All backend allocations now go through this helper
- Prevents bugs where headers are forgotten in some paths
- Makes future debugging easier
4. Updated Makefile for new file structure
5. Added header writing to ss_legacy_backend_box.c and
ss_unified_backend_box.c (though not currently linked)
Note: Header corruption bug in Larson benchmark still exists.
Class 1-6 allocations go through TLS refill/carve paths, not backend.
Further investigation needed.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-29 05:13:04 +09:00
ccbeb935c5
Perf optimization: Disable mincore syscall by default
...
Problem: mincore() syscall in hak_free_api caused performance overhead.
Perf analysis showed mincore syscall overhead in hot path.
Solution: Change DISABLE_MINCORE default from 0 to 1 in Makefile.
This disables mincore() checks in core/box/hak_free_api.inc.h by default.
Benchmark results (10M iterations × 5 runs, ws=256):
- Before (mincore enabled): 64.61M ops/s (avg)
- After (mincore disabled): 71.30M ops/s (avg)
- Improvement: +10.3% (+6.69M ops/s)
This exceeds Task agent's prediction of +2-3%, showing significant
impact in real-world allocation patterns.
Note: Set DISABLE_MINCORE=0 to re-enable if debugging invalid pointers.
Location: Makefile:173
Perf analysis: commit 53bc92842
2025-11-28 18:00:22 +09:00
dc9e650db3
Tiny Pool redesign: P0.1, P0.3, P1.1, P1.2 - Out-of-band class_idx lookup
...
This commit implements the first phase of Tiny Pool redesign based on
ChatGPT architecture review. The goal is to eliminate Header/Next pointer
conflicts by moving class_idx lookup out-of-band (to SuperSlab metadata).
## P0.1: C0(8B) class upgraded to 16B
- Size table changed: {16,32,64,128,256,512,1024,2048} (8 classes)
- LUT updated: 1..16 → class 0, 17..32 → class 1, etc.
- tiny_next_off: C0 now uses offset 1 (header preserved)
- Eliminates edge cases for 8B allocations
## P0.3: Slab reuse guard Box (tls_slab_reuse_guard_box.h)
- New Box for draining TLS SLL before slab reuse
- ENV gate: HAKMEM_TINY_SLAB_REUSE_GUARD=1
- Prevents stale pointers when slabs are recycled
- Follows Box theory: single responsibility, minimal API
## P1.1: SuperSlab class_map addition
- Added uint8_t class_map[SLABS_PER_SUPERSLAB_MAX] to SuperSlab
- Maps slab_idx → class_idx for out-of-band lookup
- Initialized to 255 (UNASSIGNED) on SuperSlab creation
- Set correctly on slab initialization in all backends
## P1.2: Free fast path uses class_map
- ENV gate: HAKMEM_TINY_USE_CLASS_MAP=1
- Free path can now get class_idx from class_map instead of Header
- Falls back to Header read if class_map returns invalid value
- Fixed Legacy Backend dynamic slab initialization bug
## Documentation added
- HAKMEM_ARCHITECTURE_OVERVIEW.md: 4-layer architecture analysis
- TLS_SLL_ARCHITECTURE_INVESTIGATION.md: Root cause analysis
- PTR_LIFECYCLE_TRACE_AND_ROOT_CAUSE_ANALYSIS.md: Pointer tracking
- TINY_REDESIGN_CHECKLIST.md: Implementation roadmap (P0-P3)
## Test results
- Baseline: 70% success rate (30% crash - pre-existing issue)
- class_map enabled: 70% success rate (same as baseline)
- Performance: ~30.5M ops/s (unchanged)
## Next steps (P1.3, P2, P3)
- P1.3: Add meta->active for accurate TLS/freelist sync
- P2: TLS SLL redesign with Box-based counting
- P3: Complete Header out-of-band migration
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-28 13:42:39 +09:00
a9ddb52ad4
ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)
...
Phase 1 完了:環境変数整理 + fprintf デバッグガード
ENV変数削除(BG/HotMag系):
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除
fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message
ドキュメント整理:
- 328 markdown files 削除(旧レポート・重複docs)
性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作✅ )
- ENV整理による機能影響なし
- Debug出力は一部残存(次phase で対応)
🤖 Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-26 14:45:26 +09:00
bcfb4f6b59
Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath
...
(cherry-picked from 225b6fcc7, conflicts resolved)
2025-11-26 12:33:49 +09:00