hakmem

Author	SHA1	Message	Date
Moe Charm (CI)	b2724e6f5d	Phase ALLOC-TINY-FAST-DUALHOT-1: WIP (regression), FREE DUALHOT confirmed +13% ALLOC-TINY-FAST-DUALHOT-1 (this phase): - Implementation: malloc_tiny_fast() C0-C3 early-exit with policy snapshot skip - ENV: HAKMEM_TINY_ALLOC_DUALHOT=0/1 (default OFF) - A/B Result: -1.17% median regression (Mixed, 10-run) - Root Cause: Branch prediction penalty on C4-C7 outweighs policy skip benefit - Decision: Freeze as research box (default OFF) - Difference from FREE: ALLOC requires structural changes (per-class paths) FREE-TINY-FAST-DUALHOT-1 (verified): - A/B Confirmation: +13.00% improvement (42.08M → 47.81M ops/s, Mixed, 10-run) - Success Criteria: +2% target ACHIEVED - Health Check: PASS (verify_health_profiles.sh, ENV OFF/ON) - Safety: HAKMEM_TINY_LARSON_FIX guard in place - Decision: Promotion to MIXED_TINYV3_C7_SAFE profile candidate Next Steps: - Profile adoption of FREE DUALHOT for MIXED workload - No further deep-dive on ALLOC optimization (deferred to future phases) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-13 05:10:45 +09:00
Moe Charm (CI)	0a7400d7d3	Phase ALLOC-TINY-FAST-DUALHOT-1: C0-C3 alloc direct path (WIP, -2% regression) Add C0-C3 early-exit optimization to malloc_tiny_fast() similar to FREE-TINY-FAST-DUALHOT-1. Skip policy snapshot for C0-C3 classes. A/B Result (10-run, Mixed TINYV3_C7_SAFE): - Baseline: 47.27M ops/s (median) - Optimized: 46.10M ops/s (median) - Result: -2.00% (regression, needs investigation) ENV: HAKMEM_TINY_ALLOC_DUALHOT=0/1 (default OFF) Implementation: - core/front/malloc_tiny_fast.h: alloc_dualhot_enabled() + early-exit - Design: docs/analysis/ALLOC_TINY_FAST_DUALHOT_1_DESIGN.md Status: Research box (default OFF), needs root cause analysis 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-13 04:28:52 +09:00
Moe Charm (CI)	2b567ac070	Phase FREE-TINY-FAST-DUALHOT-1: Optimize C0-C3 direct free path Treat C0-C3 classes (48% of calls) as "second hot path" instead of cold path. Skip expensive policy snapshot and route determination, direct to tiny_legacy_fallback_free_base(). Measurements from FREE-TINY-FAST-HOTCOLD-OPT-1 revealed C0-C3 is NOT rare (48.43% of all frees). Previous attempt to optimize via hot/cold split failed (-13% regression) because noinline + function call on 48% of workload hurt more than it helped. This phase applies correct optimization: direct inline path for frequent C0-C3 without policy snapshot overhead. Implementation: - Insert C0-C3 early-exit after C7 ULTRA check - Skip tiny_front_v3_snapshot_get() for C0-C3 (saves 5-10 cycles) - Skip route determination logic - Safety: HAKMEM_TINY_LARSON_FIX=1 disables optimization Benchmark Results (100M ops, 400 threads, MIXED_TINYV3_C7_SAFE): - Baseline (optimization OFF): 44.50M ops/s (median) - Optimized (DUALHOT ON): 48.74M ops/s (median) - Improvement: +9.51% (+4.23M ops/s) Perf Stats (optimized): - Branch misses: 112.8M - Cycles: 8.89B - Instructions: 21.95B (2.47 IPC) - Cache misses: 656K Status: GO (significant improvement, no regression) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-13 03:46:36 +09:00
Moe Charm (CI)	c503b212a3	Phase FREE-TINY-FAST-HOTCOLD-OPT-1: Hot/Cold split for free_tiny_fast [RESEARCH BOX - FREEZE] Split free_tiny_fast() into hot and cold paths to reduce I-cache pressure: - free_tiny_fast_hot(): always_inline, fast-path validation + ULTRA/MID/V7 - free_tiny_fast_cold(): noinline,cold, cross-thread + TinyHeap + legacy ENV: HAKMEM_FREE_TINY_FAST_HOTCOLD=0/1 (default 0) Stats: HAKMEM_FREE_TINY_FAST_HOTCOLD_STATS=0/1 (TLS only, exit dump) ## Benchmark Results (random mixed, 100M ops) HOTCOLD=0 (legacy): 49.35M, 50.18M, 50.25M ops/s (median: 50.18M) HOTCOLD=1 (split): 43.54M, 43.59M, 43.62M ops/s (median: 43.59M) Regression: -13.1% (NO-GO) ## Stats Analysis (10M ops, HOTCOLD_STATS=1) Hot path: 50.11% (C7 ULTRA early-exit) Cold path: 48.43% (legacy fallback) ## Root Cause Design assumption FAILED: "Cold path is rare" Reality: Cold path is 48% (almost as common as hot path) The split introduces: 1. Extra dispatch overhead in hot path 2. Function call overhead to cold for ~48% of frees 3. "Cold" is NOT rare - it's the legacy fallback for non-ULTRA classes ## Conclusion FREEZE as research box (default OFF) Box Theory value: - Validated hot/cold distribution via TLS stats - Confirmed that legacy fallback is NOT rare (48%) - Demonstrated that naive hot/cold split hurts when "cold" is common Alternative approaches for future work: 1. Inline the legacy fallback in hot path (no split) 2. Route-specific specialization (C7 vs non-C7 separate paths) 3. Policy-based early routing (before header validation) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-13 03:16:54 +09:00
Moe Charm (CI)	4e7870469c	POOL-MID-DN-BATCH: Add hash-based TLS page map (O(1) lookup) Replace linear search (avg 16 iterations, -7.6% regression) with open addressing hash table: - Size: 64 slots (power-of-two) - Collision: Linear probing, max 8 probes - On probe limit: drain and retry (safe fallback) - Hash function: Golden ratio with page-aligned shift New ENV: HAKMEM_POOL_MID_INUSE_MAP_KIND=hash\|linear (default: linear) Implementation: - Added hak_pool_mid_inuse_map_hash_enabled() ENV gate - Extended MidInuseTlsPageMap with hash_pages[64], hash_counts[64], hash_used - Added mid_inuse_hash_page() golden ratio hash function - Added mid_inuse_dec_deferred_hash() O(1) insert with probing - Updated mid_inuse_deferred_drain() to support hash mode - Added decs_drained stats counter for batching metrics Benchmark Results (10 runs each, bench_mid_large_mt_hakmem): Baseline (DEFERRED=0): median=9,250,340 ops/s Linear mode: median=8,159,240 ops/s (-11.80%) Hash mode: median=8,262,982 ops/s (-10.67%) Hash vs Linear: +1.27% improvement (eliminates linear search overhead) Note: Both deferred modes still show regression vs baseline due to other factors (TLS access overhead, drain cost). Hash mode successfully eliminates the linear search penalty as designed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-13 00:28:03 +09:00
Moe Charm (CI)	6c849fd020	POOL-MID-DN-BATCH: Add last-match cache to reduce linear search overhead Root cause: Linear search in 32-entry TLS map averaged 16 iterations, causing instruction overhead that exceeded mid_desc_lookup savings. Fix implemented: - Added last_idx field to MidInuseTlsPageMap for temporal locality - Check last_idx before linear search (O(1) fast path) - Update last_idx on hits and new entries - Reset last_idx on drain Changes: 1. pool_mid_inuse_tls_pagemap_box.h: - Added uint32_t last_idx field to struct 2. pool_mid_inuse_deferred_box.h: - Check last_idx before linear search (lines 90-94) - Update last_idx on linear search hit (line 101) - Set last_idx on new entry insert (line 117) - Reset last_idx on drain (line 166) Benchmark results (bench_mid_large_mt_hakmem): - Baseline (DEFERRED=0): median 9.08M ops/s, variance 300B - Deferred with cache (DEFERRED=1): median 8.38M ops/s, variance 207B - Performance: -7.6% regression (vs expected +2-4% gain) - Stability: -31% variance (improvement as expected) Analysis: The last-match cache reduces variance but does not eliminate the regression for this benchmark's random access pattern (2048 slots, many pages). The temporal locality assumption (60-80% hit rate) is not met by bench_mid_large_mt's allocation pattern. Further optimization needed: - Consider hash-based lookup for better than O(n) search - OR reduce map size to decrease search iterations - OR add drain triggers at better boundaries 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-13 00:04:41 +09:00
Moe Charm (CI)	16b415f5a2	Phase POOL-MID-DN-BATCH Step 5: Integrate deferred API into pool_free_v1	2025-12-12 23:00:06 +09:00
Moe Charm (CI)	cba444b943	Phase POOL-MID-DN-BATCH Step 4: Deferred API implementation with thread cleanup	2025-12-12 23:00:00 +09:00
Moe Charm (CI)	d45729f063	Phase POOL-MID-DN-BATCH Step 3: Statistics counters for deferred inuse_dec	2025-12-12 22:59:56 +09:00
Moe Charm (CI)	b381515b16	Phase POOL-MID-DN-BATCH Step 2: TLS page map for batched inuse_dec	2025-12-12 22:59:50 +09:00
Moe Charm (CI)	f5f03ef68c	Phase POOL-MID-DN-BATCH Step 1: ENV gate for deferred inuse_dec	2025-12-12 22:59:45 +09:00
Moe Charm (CI)	506d8f2e5e	Phase: Pool API Modularization - Step 8 (FINAL): Extract pool_alloc_v1_box.h Extract 288 lines: hak_pool_try_alloc_v1_impl() - LARGEST SIZE - New box: core/box/pool_alloc_v1_box.h (v1 alloc baseline, no hotbox_v2) - Updated: pool_api.inc.h (add include, remove extracted function) - Build: OK, bench_mid_large_mt_hakmem: 8.01M ops/s (baseline ~8M, within ±2%) - Risk: MEDIUM (simpler than v2 but large function, validated) - Result: pool_api.inc.h reduced from 909 lines to ~40 lines (95% reduction) ALL 5 STEPS COMPLETE (Steps 4-8): - Step 4: pool_block_to_user_box.h (30 lines) - helpers - Step 5: pool_free_v2_box.h (121 lines) - v2 free with hotbox - Step 6: pool_alloc_v1_flat_box.h (103 lines) - v1 flatten TLS - Step 7: pool_alloc_v2_box.h (277 lines) - v2 alloc with hotbox - Step 8: pool_alloc_v1_box.h (288 lines) - v1 alloc baseline Total extracted: 819 lines Final pool_api.inc.h size: ~40 lines (public wrappers only) Performance: MAINTAINED (8M ops/s baseline) Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-12 22:28:13 +09:00
Moe Charm (CI)	76a5bb568a	Phase: Pool API Modularization - Step 7: Extract pool_alloc_v2_box.h Extract 277 lines: hak_pool_try_alloc_v2_impl() - LARGEST COMPLEXITY - New box: core/box/pool_alloc_v2_box.h (v2 alloc with hotbox, MF2, TC drain, TLS) - Updated: pool_api.inc.h (add include, remove extracted function) - Build: OK, bench_mid_large_mt_hakmem: 8.86M ops/s (baseline ~8M, within ±2%) - Risk: MEDIUM (complex function with 30+ dependencies, validated) - Note: Avoided forward declarations for types/macros already in compilation unit Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-12 22:24:21 +09:00
Moe Charm (CI)	5f069e08bf	Phase: Pool API Modularization - Step 6: Extract pool_alloc_v1_flat_box.h Extract 103 lines: hak_pool_try_alloc_v1_flat() + hak_pool_free_v1_flat() - New box: core/box/pool_alloc_v1_flat_box.h (v1 flatten TLS-only fast path) - Updated: pool_api.inc.h (add include, remove extracted functions) - Build: OK, bench_mid_large_mt_hakmem: 9.17M ops/s (baseline ~8M, within ±2%) - Risk: MINIMAL (TLS-only path, well-isolated) - Note: Added forward declarations for v1_impl functions (defined later) Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-12 22:20:19 +09:00
Moe Charm (CI)	0ad9c57aca	Phase: Pool API Modularization - Step 5: Extract pool_free_v2_box.h Extract 121 lines: hak_pool_free_v2_impl() + hak_pool_mid_lookup_v2_impl() + hak_pool_free_fast_v2_impl() - New box: core/box/pool_free_v2_box.h (v2 free with hotbox support) - Updated: pool_api.inc.h (add include, remove extracted functions) - Build: OK, bench_mid_large_mt_hakmem: 8.58M ops/s (baseline ~8M, within ±2%) - Risk: LOW-MEDIUM (hotbox_v2 integration, well-isolated) Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-12 22:17:53 +09:00
Moe Charm (CI)	0da8a63fa5	Phase: Pool API Modularization - Step 4: Extract pool_block_to_user_box.h Extract 30 lines: hak_pool_block_to_user() + hak_pool_block_to_user_legacy() - New box: core/box/pool_block_to_user_box.h (helpers for block→user conversion) - Updated: pool_api.inc.h (add include, remove extracted functions) - Build: OK, bench_mid_large_mt_hakmem: 9.17M ops/s (baseline ~8M) - Risk: MINIMAL (simple extraction, no dependencies) Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-12 22:15:21 +09:00
Moe Charm (CI)	a92f3e52c3	Phase: Pool API Modularization - Step 3: Extract pool_free_v1_box.h Extracted pool v1 free implementation into separate box module: - hak_pool_free_v1_fast_impl(): L1-FastBox (TLS-only path, no mid_desc_lookup) - hak_pool_free_v1_slow_impl(): L1-SlowBox (full impl with lookup) - hak_pool_free_v1_impl(): L0-SplitBox (fast predicate router) Benefits: - Reduced pool_api.inc.h from ~950 to ~840 lines - Clear separation of concern (fast vs slow paths) - Enables future phase extensions (e.g., POOL-MID-DN-BATCH) - Maintains zero-cost abstraction (all inline) Testing: - Build: ✓ (no errors) - Benchmark: ✓ (7.99M ops/s, consistent with baseline) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-12 21:46:26 +09:00
Moe Charm (CI)	b01c99f209	Phase: Pool API Modularization - Steps 1-2 Extract configuration, statistics, and caching boxes from pool_api.inc.h Step 1: pool_config_box.h (60 lines) - All ENV gate predicates (hak_pool_v2_enabled, hak_pool_v1_flatten_enabled, etc) - Lazy static int cache pattern (matches tiny_heap_env_box.h style) - Zero dependencies (lowest-level box) Step 2a: pool_stats_box.h (90 lines) - PoolV1FlattenStats structure with multi-phase support - pool_v1_flat_stats_dump() with phase-aware output - Destructor hook for automatic dumping on exit - Multi-phase design: supports future phases without refactoring Step 2b: pool_mid_desc_cache_box.h (60 lines) - MidDescCache structure (TLS-local single-entry LRU) - mid_desc_lookup_cached() with fast TLS hit path - Minimal external dependency: mid_desc_lookup from pool_mid_desc.inc.h Result: pool_api.inc.h reduced from 1050+ lines to ~950 lines Still contains: alloc/free implementations, helpers (next steps) Build: ✅ Clean (no warnings) Test: ✅ Benchmark passes (8.5M ops/s) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-12 21:39:18 +09:00
Moe Charm (CI)	c86a59159b	Phase POOL-FREE-V1-OPT Step 2: Fast/Slow split for v1 free Implement L0-SplitBox + L1-FastBox/SlowBox architecture for pool v1 free: L0-SplitBox (hak_pool_free_v1_impl): - Fast predicate: header-based same-thread detection - Requires g_hdr_light_enabled == 0, tls_free_enabled - Routes to fast or slow box based on predicate L1-FastBox (hak_pool_free_v1_fast_impl): - Same-thread TLS free path only (ring → lo_head → spill) - Skips mid_desc_lookup for validation (uses header) - Still calls mid_page_inuse_dec_and_maybe_dn at end L1-SlowBox (hak_pool_free_v1_slow_impl): - Full v1 impl with mid_desc_lookup for validation - Handles cross-thread, TC lookup, etc. ENV gate: HAKMEM_POOL_V1_FREE_FASTSPLIT (default OFF) Stats tracking: - fastsplit_fast_hit: Fast path taken (>99% typically) - fastsplit_slow_hit: Slow path taken (predicate failed) Benchmark result (FLATTEN OFF, Mixed profile): - Baseline: ~8.3M ops/s (high variance) - FASTSPLIT ON: ~8.1M ops/s (high variance) - Performance neutral (savings limited by inuse_dec still calling mid_desc_lookup) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-12 19:52:36 +09:00
Moe Charm (CI)	dbdd2e0e0e	Phase POOL-FREE-V1-OPT Step 1: Add v2 reject stats tracking Add reject reason counters for v2 free path to understand fallback patterns: - v2_reject_total: Total v2 free rejects - v2_reject_ptr_null: ptr == NULL - v2_reject_not_init: pool not initialized - v2_reject_desc_null: mid_desc_lookup returned NULL - v2_reject_mf2_null: MF2 path but mf2_addr_to_page returned NULL ENV gate: HAKMEM_POOL_FREE_V1_REJECT_STATS (default OFF) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-12 19:43:03 +09:00
Moe Charm (CI)	fe70e3baf5	Phase MID-V35-HOTPATH-OPT-1 complete: +7.3% on C6-heavy Step 0: Geometry SSOT - New: core/box/smallobject_mid_v35_geom_box.h (L1/L2 consistency) - Fix: C6 slots/page 102→128 in L2 (smallobject_cold_iface_mid_v3.c) - Applied: smallobject_mid_v35.c, smallobject_segment_mid_v3.c Step 1-3: ENV gates for hotpath optimizations - New: core/box/mid_v35_hotpath_env_box.h * HAKMEM_MID_V35_HEADER_PREFILL (default 0) * HAKMEM_MID_V35_HOT_COUNTS (default 1) * HAKMEM_MID_V35_C6_FASTPATH (default 0) - Implementation: smallobject_mid_v35.c * Header prefill at refill boundary (Step 1) * Gated alloc_count++ in hot path (Step 2) * C6 specialized fast path with constant slot_size (Step 3) A/B Results: C6-heavy (257–768B): 8.75M→9.39M ops/s (+7.3%, 5-run mean) ✅ Mixed (16–1024B): 9.98M→9.96M ops/s (-0.2%, within noise) ✓ Decision: FROZEN - defaults OFF, C6-heavy推奨ON, Mixed現状維持 Documentation: ENV_PROFILE_PRESETS.md updated 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-12 19:19:25 +09:00
Moe Charm (CI)	e95e61f0ff	Phase POLICY-FAST-PATH-V2 complete + MID-V35-HOTPATH-OPT-1 design ## Phase POLICY-FAST-PATH-V2 (FROZEN) - Implementation complete: free_policy_fast_v2_box.h + malloc_tiny_fast.h integration - A/B Results: - Mixed (ws=400): -1.6% regression ❌ (branch cost > skip benefit) - C6-heavy (ws=200): +5.4% improvement ✅ - Decision: Default OFF, FROZEN (ws<300 / C6-heavy research only) - Learning: Large WS causes branch misprediction to dominate ## Phase 3-GRADUATE + ENV probe fix - 64-probe retry for getenv() stability during bench_profile putenv() - C6 ULTRA intrusive freelist: FROZEN (research box) ## Phase MID-V35-HOTPATH-OPT-1-DESIGN - Design doc for next optimization target - Target: MID v3.5 alloc/free hot path (C5-C6) - Boxes: Stats Gate, TLS Layout, Boundary Check elimination - Expected: +3-9% on Mixed mainline Files: - core/box/free_policy_fast_v2_box.h (new) - core/box/free_path_stats_box.h/c (policy_fast_v2_skip counter) - core/front/malloc_tiny_fast.h (fast-path integration) - docs/analysis/MID_V35_HOTPATH_OPT_1_DESIGN.md (new) - docs/analysis/PHASE_3_GRADUATE_*.md (new) - CURRENT_TASK.md (phase status update) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-12 18:40:08 +09:00
Moe Charm (CI)	0c8583f91e	Phase TLS-UNIFY-3+: Refactoring - Unified ENV gates for C6 ULTRA Consolidate C6 ULTRA ENV gate functions: - tiny_c6_ultra_intrusive_env_box.h now contains both: - tiny_c6_ultra_free_enabled() - C6 ULTRA routing (policy gate) - tiny_c6_ultra_intrusive_enabled() - intrusive LIFO mode (TLS optimization) - Simplified ENV gate management with clear separation of concerns Removes code duplication by centralizing environment checks in single header. Performance verified: ENV_OFF=56.4 Mop/s, ENV_ON=57.6 Mop/s (parity maintained) Note: Avoided macro-based segment learning consolidation (C4/C5/C6) as it would hinder compiler optimizations. Current inline approach is optimal. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-12 16:31:14 +09:00
Moe Charm (CI)	1a8652a91a	Phase TLS-UNIFY-3: C6 intrusive freelist implementation (完成) Implement C6 ULTRA intrusive LIFO freelist with ENV gating: - Single-linked LIFO using next pointer at USER+1 offset - tiny_next_store/tiny_next_load for pointer access (single source of truth) - Segment learning via ss_fast_lookup (per-class seg_base/seg_end) - ENV gate: HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL (default OFF) - Counters: c6_ifl_push/pop/fallback in FREE_PATH_STATS Files: - core/box/tiny_ultra_tls_box.h: Added c6_head field for intrusive LIFO - core/box/tiny_ultra_tls_box.c: Pop/push with intrusive branching (case 6) - core/box/tiny_c6_ultra_intrusive_env_box.h: ENV gate (new) - core/box/tiny_c6_intrusive_freelist_box.h: L1 pure LIFO (new) - core/tiny_debug_ring.h: C6_IFL events - core/box/free_path_stats_box.h/c: c6_ifl_* counters A/B Test Results (1M iterations, ws=200, 257-512B): - ENV_OFF (array): 56.6 Mop/s avg - ENV_ON (intrusive): 57.6 Mop/s avg (+1.8%, within noise) - Counters verified: c6_ifl_push=265890, c6_ifl_pop=265815, fallback=0 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-12 16:26:42 +09:00
Moe Charm (CI)	d5ffb3eeb2	Fix MID v3.5 activation bugs: policy loop + malloc recursion Two critical bugs fixed: 1. Policy snapshot infinite loop (smallobject_policy_v7.c): - Condition `g_policy_v7_version == 0` caused reinit on every call - Fixed via CAS to set global version to 1 after first init 2. Malloc recursion (smallobject_segment_mid_v3.c): - Internal malloc() routed back through hakmem → MID v3.5 → segment creation → malloc → infinite recursion / stack overflow - Fixed by using mmap() directly for internal allocations: - Segment struct, pages array, page metadata block Performance results (bench_random_mixed 257-512B): - Baseline (LEGACY): 34.0M ops/s - MID_V35 ON (C6): 35.8M ops/s - Improvement: +5.1% ✓ 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-12 07:12:24 +09:00
Moe Charm (CI)	212739607a	Phase v11a-3: MID v3.5 Activation (Build Complete) Integrated MID v3.5 into active code path, making it available for C5/C6/C7 routing. Key Changes: - Policy Box: Added SMALL_ROUTE_MID_V35 with ENV gates (HAKMEM_MID_V35_ENABLED, HAKMEM_MID_V35_CLASSES) - HotBox: Implemented small_mid_v35_alloc/free with TLS-cached page allocation - Front Gate: Wired MID_V35 routing into malloc_tiny_fast.h (priority: ULTRA > MID_V35 > V7) - Build: Added core/smallobject_mid_v35.o to all object lists Architecture: - Slot sizes: C5=384B, C6=512B, C7=1024B - Page size: 64KB (170/128/64 slots) - Integration: ColdIface v2 (refill/retire), Stats v2 (observation), Learner v2 (dormant) Status: Build successful, ready for A/B benchmarking Next: Performance validation (C6-heavy, C5+C6-only, Mixed benchmarks) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-12 06:52:14 +09:00
Moe Charm (CI)	0dba67ba9d	Phase v11a-2: Core MID v3.5 implementation - segment, cold iface, stats, learner Implement 5-layer infrastructure for multi-class MID v3.5 (C5-C7, 257-1KiB): 1. SegmentBox_mid_v3 (L2 Physical) - core/smallobject_segment_mid_v3.c (9.5 KB) - 2MiB segments, 64KiB pages (32 per segment) - Per-class free page stacks (LIFO) - RegionIdBox registration - Slots: C5→170, C6→102, C7→64 2. ColdIface_mid_v3 (L2→L1) - core/box/smallobject_cold_iface_mid_v3_box.h (NEW) - core/smallobject_cold_iface_mid_v3.c (3.5 KB) - refill: get page from free stack or new segment - retire: calculate free_hit_ratio, publish stats, return to stack - Clean separation: TLS cache for hot path, ColdIface for cold path 3. StatsBox_mid_v3 (L2→L3) - core/smallobject_stats_mid_v3.c (7.2 KB) - Circular buffer history (1000 events) - Per-page metrics: class_idx, allocs, frees, free_hit_ratio_bps - Periodic aggregation (every 100 retires) - Learner notification callback 4. Learner v2 (L3) - core/smallobject_learner_v2.c (11 KB) - Multi-class aggregation: allocs[8], retire_count[8], avg_free_hit_bps[8] - Exponential smoothing (90% history + 10% new) - Per-class efficiency tracking - Stats snapshot API - Route decision disabled for v11a-2 (v11b feature) 5. Build Integration - Modified Makefile: added 4 new .o files (segment, cold_iface, stats, learner) - Updated box header prototypes - Clean compilation, all dependencies resolved Architecture Decision Implementation: - v7 remains frozen (C5/C6 research preset) - MID v3.5 becomes unified 257-1KiB main path - Multi-class isolation: per-class free stacks - Dormant infrastructure: linked but not active (zero overhead) Performance: - Build: clean compilation - Sanity benchmark: 27.3M ops/s (no regression vs v10) - Memory: ~30MB RSS (baseline maintained) Design Compliance: ✅ Layer separation: L2 (segment) → L2 (cold iface) → L3 (stats) → L3 (learner) ✅ Hot path clean: alloc/free never touch stats/learner ✅ Backward compatible: existing MID v3 routes unchanged ✅ Transparent: v11a-2 is dormant (no behavior change) Next Phase (v11a-3): - Activate C5/C6/C7 routing through MID v3.5 - Connect TLS cache to segment refill - Verify performance under load - Then Phase v11a-4: dynamic C5 ratio routing 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-12 06:37:06 +09:00
Moe Charm (CI)	babd884b96	Phase v11a-1: Infrastructure - Multi-class segment and learner v2 box definitions Create core box definitions for MID v3.5 consolidation (Phase v11a): 1. smallobject_segment_mid_v3_box.h - Multi-class unified segment (2MiB, C5-C7) - Per-class free page stacks - SmallHeapCtx_MID_v3 for TLS caching - Refill/retire/validation APIs 2. smallobject_stats_mid_v3_box.h - SmallPageStatsMID_v3: per-page lifetime stats - Aggregation for Learner input - Free hit ratio tracking (basis points) 3. smallobject_learner_v2_box.h - SmallLearnerStatsV2: multi-class and global metrics - Extended from v7 (C5-only ratio) to full workload analysis - Per-class retire efficiency, global free hit ratio - Decision API for route optimization 4. smallobject_policy_v2_box.h - SmallPolicyV2: routing with Learner integration - Version-based TLS cache invalidation - Route update from Learner stats - Backward compatible with v1 interface Dependency graph: segment → stats → learner → policy → malloc routing Architecture Decision: Option A (MID v3.5 consolidation) - v7 frozen as C5/C6-only research preset - MID v3.5 becomes 257-1KiB main implementation - Learner scope: multi-class tracking (C5 ratio primary, Phase v11a) - Future (v11b): multi-dimensional optimization 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-12 06:20:01 +09:00
Moe Charm (CI)	bbc4b66a22	Phase v10: Enable Learner v7 by default Change: Learner now defaults to ON (when v7 is enabled) - Old behavior: Learner only enabled if explicitly requested - New behavior: Learner always ON (can disable with ENV=0) - Learner is optional dependency of v7 (not intrusive) Configuration: - HAKMEM_SMALL_HEAP_V7_ENABLED=1: enables v7 + Learner - HAKMEM_SMALL_LEARNER_V7_ENABLED=0: disable Learner only (keeps v7) Benefits: - Automatic workload detection without user configuration - C5 allocation ratio monitored by default - Route optimization happens transparently Performance: v7+Learner C5/C6 workload = 39M ops/s (maintained) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-12 06:09:53 +09:00
Moe Charm (CI)	79674c9390	Phase v10: Remove legacy v3/v4/v5 implementations Removal strategy: Deprecate routes by disabling ENV-based routing - v3/v4/v5 enum types kept for binary compatibility - small_heap_v3/v4/v5_enabled() always return 0 - small_heap_v3/v4/v5_class_enabled() always return 0 - Any v3/v4/v5 ENVs are silently ignored, routes to LEGACY Changes: - core/box/smallobject_hotbox_v3_env_box.h: stub functions - core/box/smallobject_hotbox_v4_env_box.h: stub functions - core/box/smallobject_v5_env_box.h: stub functions - core/front/malloc_tiny_fast.h: remove alloc/free cases (20+ lines) Benefits: - Cleaner routing logic (v6/v7 only for SmallObject) - 20+ lines deleted from hot path validation - No behavioral change (routes were rarely used) Performance: No regression expected (v3/v4/v5 already disabled by default) Next: Set Learner v7 default ON, production testing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-12 06:09:12 +09:00
Moe Charm (CI)	540230c301	v7-7: Modularize Learner into separate box Refactoring: Separate Learner API and types from Policy Box - New: core/box/smallobject_learner_v7_box.h - SmallLearnerStatsV7 type definition - Learner recording API (record_refill, record_retire) - Learner evaluation and stats snapshot - Learner configuration constants - Updated: core/box/smallobject_policy_v7_box.h - Removed Learner API (moved to Learner Box) - Removed SmallLearnerStatsV7 type (moved to Learner Box) - Added include of smallobject_learner_v7_box.h - Kept small_policy_v7_update_from_learner() (L3 integration) - Updated: core/smallobject_policy_v7.c - Added include of smallobject_learner_v7_box.h Benefits: - Clearer module boundaries (Policy vs Learner) - Easier testing and debugging (stats isolation) - Reduced coupling between components Performance: No regression (v7+Learner: 41M ops/s on C5/C6) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-12 06:06:44 +09:00
Moe Charm (CI)	6c8c7b7f6c	v7-5b/v7-7: Fix free path for C5 and Learner route switching Bug fixes: - Free path now handles C5 (not just C6) for v7 routing - After Learner route switch, old V7 pointers are correctly freed via V7 (instead of being misrouted to legacy) Change: Always try V7 free for SMALL_V7_CLASS_SUPPORTED classes (C5/C6). V7 returns false if ptr is not in V7 segment, allowing proper fallback to legacy for non-V7 pointers. This fix is essential because Learner may dynamically switch C5 from V7→MID_V3, but pointers allocated before the switch still reside in V7 segments and must be freed via V7. Performance (C5/C6 workload 200-500B): - v7 OFF: ~19M ops/s - v7+Learner: ~43M ops/s (+126%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-12 06:02:13 +09:00
Moe Charm (CI)	6f559e1a1d	v7-7: Implement Learner for dynamic C5 route switching - Add SmallLearnerStatsV7 type + API to policy box - Hook ColdIface refill/retire to collect stats (capacity-based) - Implement C5 route switching: if C5 ratio < 30%, switch to MID_V3 - Version-based TLS cache invalidation for policy updates - Evaluation interval: every 100 refills Tested with c6heavy scenario: C5 ratio=12% triggers V7 → MID_V3 switch 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-12 05:51:27 +09:00
Moe Charm (CI)	d5aa3110c6	Phase v7-5b: C5+C6 multi-class expansion (+4.3% improvement) - Add C5 (256B blocks) support alongside C6 (512B blocks) - Same segment shared between C5/C6 (page_meta.class_idx distinguishes) - SMALL_V7_CLASS_SUPPORTED() macro for class validation - Extend small_v7_block_size() for C5 (switch statement) A/B Result: C6-only v7 avg 7.64M ops/s → C5+C6 v7 avg 7.97M ops/s (+4.3%) Criteria: C6 protected ✅, C5 net positive ✅, TLS bloat none ✅ ENV: HAKMEM_SMALL_HEAP_V7_CLASSES=0x60 (bit5+bit6) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-12 05:11:02 +09:00
Moe Charm (CI)	17ceed619c	Phase v7-5a: Hot path stats removal (C6 v7 極限最適化) - Remove per-page stats from hot path (alloc_count, free_count, live_current) - Add ENV-gated global atomic stats (HAKMEM_V7_HOT_STATS) - Stats now collected only at retire time (cold path) - Header write kept at alloc time (freelist overlaps block[0]) A/B Result: -4.3% overhead → ±0% (target: legacy ±2%) v7 OFF avg: 9.26M ops/s, v7 ON avg: 9.27M ops/s (+0.15%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-12 04:51:17 +09:00
Moe Charm (CI)	8143e8b797	Phase v7-4: Policy Box 導入 (L3 層の明確化とフロント芯の作り直し) - SmallPolicyV7 Box: L3 Policy layer に配置、route 決定を一元化 - Route kind enum: SMALL_ROUTE_ULTRA / V7 / MID_V3 / LEGACY - ENV priority (fixed): ULTRA > v7 > MID_v3 > LEGACY - Frontend integration: v7 routing を Policy Box 経由に変更 (段階移行) - Legacy compatibility: 既存の tiny_route_env_box.h は併用維持 Box Theory layer structure: - L0: ULTRA (C4-C7, FROZEN) - L1: SmallObject v7 (research box) - L1': MID_v3 / LEGACY (fallback) - L2: Segment / RegionId - L3: Policy / Stats / Learner ← Policy Box added here Frontend now follows clean "size→class→route_kind→switch" pattern. ENV variables read once at Policy init, not scattered across frontend. Future: ULTRA/MID_v3/LEGACY consolidation, Learner integration, flexible priority. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-12 03:50:58 +09:00
Moe Charm (CI)	2bdf29a9ed	Phase v7-3: TLS segment fast path optimization (RegionIdBox overhead reduction) - SmallHeapCtx_v7: Add TLS segment hints (tls_seg_base/end) for fast bounds check - free fast path: TLS segment hit → skip RegionIdBox binary search - Simplified control flow: removed same-page cache (negligible benefit vs branch cost) - Optimization: O(1) page_idx calculation via bit shift vs O(log N) RegionIdBox lookup Performance improvement: - Phase v7-2: 54.5M ops/s (-7.0% vs 58.6M legacy) - Phase v7-3: 56.3M ops/s (-4.3% vs legacy) - Overhead reduction: 38% (from -7.0% to -4.3%) TLS segment hit path bypasses RegionIdBox for most C6 frees. Remaining -4.3% overhead acceptable for modular v7 architecture. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-12 03:38:39 +09:00
Moe Charm (CI)	39a3c53dbc	Phase v7-2: SmallObject v7 C6-only implementation with RegionIdBox integration - SmallSegment_v7: 2MiB segment with TLS slot and free page stack - ColdIface_v7: Page refill/retire between HotBox and SegmentBox - HotBox_v7: Full C6-only alloc/free with header writing (HEADER_MAGIC\|class_idx) - Free path early-exit: Check v7 route BEFORE ss_fast_lookup (separate mmap segment) - RegionIdBox: Register v7 segment for ptr->region lookup - Benchmark: v7 ON ~54.5M ops/s (-7% overhead vs 58.6M legacy baseline) v7 correctly balances alloc/free counts and page lifecycle. RegionIdBox overhead identified as primary cost driver. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-12 03:12:28 +09:00
Moe Charm (CI)	a8d0ab06fc	MID-V3: Specialize to 257-768B, exclude C7 (ULTRA handles 1KB) Role separation based on ultrathink analysis: - MID v3: 257-768B専用 (C6 only, HAKMEM_MID_V3_CLASSES=0x40) - C7 ULTRA: 769-1024B専用 (existing optimized path) Changes: - core/box/hak_alloc_api.inc.h: Remove C7 route, restrict to 257-768B - core/box/mid_hotbox_v3_env_box.h: Update ENV comments - docs/analysis/MID_POOL_V3_DESIGN.md: Add performance results & role - CURRENT_TASK.md: Document MID-V3 completion & role separation Verified: - 257-768B with v3 ON: 1,199,526 ops/s (+1.7% vs baseline) - 769-1024B with v3 ON: 1,181,254 ops/s (same as baseline, C7 excluded) - C7 correctly routes to ULTRA instead of MID v3 Rationale: C7-only showed -11% regression, but C6/mixed showed +11-19% improvement. Specializing to mid-range (257-768B) leverages v3 strengths while keeping C7 on the proven ULTRA path. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-12 01:14:13 +09:00
Moe Charm (CI)	510cf338f3	MID-V3-6: hakmem.c integration (box modularization) Integrate MID/Pool v3 into hakmem.c main allocation path using box modularization pattern. Changes: - core/hakmem.c: Include MID v3 headers - core/box/hak_alloc_api.inc.h: Add v3 allocation gate - C6 (145-256B) and C7 (769-1024B) size classes - ENV opt-in via HAKMEM_MID_V3_ENABLED + HAKMEM_MID_V3_CLASSES - Priority: v6 > v3 > v4 > pool - core/box/hak_free_api.inc.h: Add v3 free path - RegionIdBox lookup based ownership check - Makefile: Add core/mid_hotbox_v3.o to TINY_BENCH_OBJS_BASE ENV controls (default OFF): HAKMEM_MID_V3_ENABLED=1 HAKMEM_MID_V3_CLASSES=0x40 (C6) HAKMEM_MID_V3_CLASSES=0x80 (C7) HAKMEM_MID_V3_DEBUG=1 Verified with bench_mid_large_mt_hakmem (7-9M ops/s, no crashes) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-12 01:04:55 +09:00
Moe Charm (CI)	710541b69e	MID-V3 Phase 3-5: RegionId integration, alloc/free implementation - MID-V3-3: RegionId integration (page registration at carve) - mid_segment_v3_carve_page(): Register with RegionIdBox - mid_segment_v3_return_page(): Unregister from RegionIdBox - Uses REGION_KIND_MID_V3 for region identification - MID-V3-4: Allocation fast path implementation - mid_hot_v3_alloc_slow(): Slow path for lane miss - mid_cold_v3_refill_page(): Segment-based page allocation - mid_lane_refill_from_page(): Batch transfer (16 items default) - mid_page_build_freelist(): Initial freelist construction - MID-V3-5: Free/cold path implementation - mid_hot_v3_free(): RegionIdBox lookup based free - mid_page_push_free(): Page freelist push - Local/remote page detection via lane ownership ENV controls (default OFF): HAKMEM_MID_V3_ENABLED=1 HAKMEM_MID_V3_CLASSES=0xC0 (C6+C7) HAKMEM_MID_V3_DEBUG=1 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-12 00:53:42 +09:00
Moe Charm (CI)	2b35de2123	MID-V3 Phase 0-2: Design doc, type skeleton, and RegionIdBox API - MID-V3-0: Create design doc (docs/analysis/MID_POOL_V3_DESIGN.md) - Lane vs Page role clarification - Phase plan and checklist - MID-V3-1: Type skeleton + ENV - MidHotBoxV3, MidLaneV3, MidPageDescV3 structures - ENV controls (HAKMEM_MID_V3_ENABLED, HAKMEM_MID_V3_CLASSES) - Cold interface declarations - MID-V3-2 (V6-HDR-2): RegionIdBox Registration API completion - RegionEntry structure with sorted array storage - Binary search lookup implementation - region_id_register_v6() / region_id_unregister_v6() - REGION_KIND_MID_V3 added to enum 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-12 00:46:25 +09:00
Moe Charm (CI)	ce372cfc7e	Phase V6-HDR-4: Headerless 最適化 (P0 + P1) ## P0: Double validation 排除 - region_id_lookup_v6() で TLS segment 登録済み + 範囲内なら small_page_meta_v6_of() を呼ばずに直接 page_meta を計算 - 削除された重複チェック: - slot->in_use (TLS登録で保証) - small_ptr_in_segment_v6() (addr範囲で既にチェック済み) - 関数呼び出しオーバーヘッド - 推定効果: +1-2% (6-8 instructions 削減) ## P1: TLS cache に page_meta キャッシュ追加 - RegionIdTlsCache に追加: - last_page_base / last_page_end (ページ範囲) - last_page (SmallPageMetaV6* 直接ポインタ) - region_id_lookup_cached_v6() で same-page hit 時は page_meta lookup を完全スキップ - 推定効果: +1.5-2.5% (10-12 instructions 削減) ## ベンチマーク結果 (揺れあり) - V6-HDR-3 (P0/P1 前): -3.5% ~ -8.3% 回帰 - V6-HDR-4 (P0+P1 後): +2.7% ~ +12% 改善 (一部の run で) 設計原則: - RegionIdBox は薄く保つ (分類のみ) - キャッシュは TLS 側に寄せる - same-page 判定で last_page_base/end を使用 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-12 00:16:32 +09:00
Moe Charm (CI)	df216b6901	Phase V6-HDR-3: SmallSegmentV6 実割り当て & RegionIdBox Registration 実装内容: 1. SmallSegmentV6のmmap割り当ては既に v6-0で実装済み 2. small_heap_ctx_v6() で segment 取得時に region_id_register_v6_segment() 呼び出し 3. region_id_v6.c に TLS スコープのセグメント登録ロジック実装: - 4つの static __thread 変数でセグメント情報をキャッシュ - region_id_register_v6_segment(): セグメント base/end を TLS に記録 - region_id_lookup_v6(): TLS segment の range check を最初に実行 - TLS cache 更新で O(1) lookup 実現 4. region_id_v6_box.h に SmallSegmentV6 type include & function 宣言追加 5. small_v6_region_observe_validate() に region_id_observe_lookup() 呼び出し追加効果: - HeaderlessデザインでRegionIdBoxが正式にSMALL_V6分類を返せるように - TLS-scopedな簡潔な登録メカニズム (マルチスレッド対応) - Fast path: TLS segment range check -> page_meta lookup - Fall back path: 従来の small_page_meta_v6_of() による動的検出 - Latency: O(1) TLS cache hit rate がv6 alloc/free の大部分をカバー 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-11 23:51:48 +09:00
Moe Charm (CI)	406835feb3	Phase V6-HDR-0: C6-only headerless core 設計確定 - CURRENT_TASK.md: V6-HDR-0 セクション追加（4層 Box Theory） - SMALLOBJECT_CORE_V6_DESIGN.md: V6-HDR-0 設計方針追加 - REGIONID_V6_DESIGN.md: RegionIdBox 設計書新規作成 - smallobject_core_v6_box.h: SmallTlsLaneV6 型＋TLS API 追加 - smallobject_core_v6.c: OBSERVE モード追加 - region_id_v6_box.h: RegionIdBox 型スケルトン - page_stats_v6_box.h: PageStatsV6 箱スケルトン - AGENTS.md: v6 研究箱ルールセクション追加サニティベンチ: Mixed 42.1M, C6-heavy 25.0M（挙動不変確認） 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-11 23:07:26 +09:00
Moe Charm (CI)	2d684ffd25	Phase SO-BACKEND-OPT-1: v3 backend 分解＆Tiny/ULTRA 完成世代宣言 === 実装内容 === 1. v3 backend 詳細計測 - ENV: HAKMEM_SO_V3_STATS で alloc/free パス内訳計測 - 追加 stats: alloc_current_hit, alloc_partial_hit, free_current, free_partial, free_retire - so_alloc_fast / so_free_fast に埋め込み - デストラクタで [ALLOC_DETAIL] / [FREE_DETAIL] 出力 2. v3 backend ボトルネック分析完了 - C7-only: alloc_current_hit=99.99%, alloc_refill=0.9%, free_retire=0.1%, page_of_fail=0 - Mixed: alloc_current_hit=100%, alloc_refill=0.85%, free_retire=0.07%, page_of_fail=0 - 結論: v3 ロジック部分（ページ選択・retire）は完全最適化済み - 残り 5% overhead は内部コスト（header write, memcpy, 分岐） 3. Tiny/ULTRA 層「完成世代」宣言 - 総括ドキュメント作成: docs/analysis/PERF_EXEC_SUMMARY_ULTRA_PHASE_20251211.md - CURRENT_TASK.md に Phase ULTRA 総括セクション追加 - AGENTS.md に Tiny/ULTRA 完成世代宣言追加 - 最終成果: Mixed 16–1024B = 43.9M ops/s (baseline 30.6M → +43.5%) === ボトルネック地図 === \| 層 \| 関数 \| overhead \| \|-----\|------\|----------\| \| Front \| malloc/free dispatcher \| ~40–45% \| \| ULTRA \| C4–C7 alloc/free/refill \| ~12% \| \| v3 backend \| so_alloc/so_free \| ~5% \| \| mid/pool \| hak_super_lookup \| 3–5% \| === フェーズ履歴（Phase ULTRA cycle） === - Phase PERF-ULTRA-FREE-OPT-1: C4–C7 ULTRA統合 → +9.3% - Phase REFACTOR: Code quality (60行削減) - Phase PERF-ULTRA-REFILL-OPT-1a/1b: C7 ULTRA refill最適化 → +11.1% - Phase SO-BACKEND-OPT-1: v3 backend分解 → 設計限界確認 === 次フェーズ（独立ライン） === 1. Phase SO-BACKEND-OPT-2: v3 header write削減 (1-2%) 2. Headerless/v6系: out-of-band header (1-2%) 3. mid/pool v3新設計: C6-heavy 10M → 20–25M 本フェーズでTiny/ULTRA層は「完成世代」として基盤固定。今後の大きい変更はHeaderless/mid系の独立ラインで検討。 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-11 22:45:14 +09:00
Moe Charm (CI)	fc1c47043c	Phase PERF-ULTRA-REFILL-OPT-1a/1b: C7 ULTRA refill パス最適化実装内容: - Phase 1a: Page size macro化 - TINY_C7_ULTRA_PAGE_SHIFT (16) を定義 - tiny_c7_ultra_page_of で division → bit shift に変更 - refill/free での seg_end 計算を multiplication → bit shift に最適化 - Phase 1b: Segment learning を移動 - segment learning を free初回 → alloc refill時に移動 - free側での unlikely segment_from_ptr call を削除 - normal pattern (alloc → free) での segment既学習を前提ベンチマーク結果（Mixed 16-1024B, 1M iter, ws=400）: - Baseline: 39.5M ops/s - Phase 1a: 39.5M ops/s (誤差範囲) - Phase 1b: 42.3M ops/s - 最終平均: 43.9M ops/s (+11.1% = +4.4M ops/s) tiny_c7_ultra_page_of は計測では同じ値だが、実際には以下が改善: - division コスト削減（数cycle/call） - free時のsegment learning削除（per-thread 1回削減） - refill での計算簡素化これにより全体の refill パス最適化が達成できました。	2025-12-11 22:16:07 +09:00
Moe Charm (CI)	0f15adae4e	Phase ALLOC-GATE-OPT-1: tiny_alloc_gate_fast 統計計測 - AllocGateStats 構造体追加（size2class/route/env/class分布） - malloc_tiny_fast にカウンタ埋め込み - ENV: HAKMEM_ALLOC_GATE_STATS (default 0) - 挙動変更なし（計測のみ）計測結果: - Mixed: total=542k, size2class=0, route_calls=0, env_checks=275k, C4-C7=95.2% - size_to_class/route_for_class は完全削減済み（LUT 効果） - C4-C7 が 95% → ULTRA fast path が有効 - env_checks ≈ c7_calls → C7 ULTRA の ENV gate が毎回呼ばれる - C6-heavy: total=11 → malloc_tiny_fast はほぼ通らない（mid/pool 主体）結論: - alloc gate は既に十分最適化済み（LUT + ULTRA で削減済み） - さらなる最適化余地は小さい（env_checks は軽量化済み、数%以下の効果） - 次フェーズでは free dispatcher (29%) や C7 ULTRA refill (7%) など、他のボトルネックを狙う詳細: docs/analysis/ALLOC_GATE_ANALYSIS.md 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-11 21:32:40 +09:00
Moe Charm (CI)	118c0e4857	Phase FREE-DISPATCHER-OPT-1: free dispatcher 統計計測目的: free dispatcher（29%）の内訳を細分化して計測。実装内容: - FreeDispatchStats 構造体追加（ENV: HAKMEM_FREE_DISPATCH_STATS, default 0） - カウンタ: total_calls / domain (tiny/mid/large) / route (ultra/legacy/pool/v6) / env_checks / route_for_class_calls - hak_free_at / tiny_route_for_class / tiny_route_snapshot_init にカウンタ埋め込み - 挙動変更なし（計測のみ、ENV OFF 時は overhead ゼロ）計測結果: Mixed 16-1024B (1M iter, ws=400): - total=8,081, route_calls=267,967, env_checks=9 - BENCH_FAST_FRONT により大半は早期リターン - route_for_class は主に alloc 側で呼ばれる（267k calls vs 8k frees） - ENV check は初期化時の 9回のみ（snapshot 効果） C6-heavy (257-768B, 1M iter, ws=400): - total=500,099, route_calls=1,034, env_checks=9 - fg_classify_domain に到達する free が多い - route_for_class 呼び出しは極小（snapshot 効果）結論: - ENV check は既に十分最適化されている（初期化時のみ） - route_for_class は alloc 側での呼び出しが主で、free 側は snapshot で O(1) - 次フェーズ（OPT-2）では別のアプローチを検討ドキュメント追加: - docs/analysis/FREE_DISPATCHER_ANALYSIS.md（新規） - CURRENT_TASK.md に Phase FREE-DISPATCHER-OPT-1 セクション追加 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-11 21:21:40 +09:00
Moe Charm (CI)	11dc9d390a	Phase PERF-ULTRA-FREE-OPT-1: C4-C7 ULTRA free 薄型化 - C4-C7 ULTRA free を pure TLS push + cold segment learning に統一 - C7 ULTRA free を同じパターンに整列（likely/unlikely + FREE_PATH_STAT_INC） - C4/C5/C6 ULTRA は既に最適化済み（統一 legacy fallback 経由） - base/user 変換を tiny_ptr_convert_box.h マクロで統一実測値 (Mixed 16-1024B, 1M iter, ws=400): - Baseline (C7 のみ): 42.0M ops/s, legacy=266,943 (49.2%) - Optimized (C4-C7): 46.5M ops/s, legacy=26,025 (4.8%) - 改善: +9.3% (+4M ops/s) FREE_PATH_STATS: - C6 ULTRA: 137,319 free + 137,241 alloc (100% カバー) - C5 ULTRA: 68,871 free + 68,827 alloc (100% カバー) - C4 ULTRA: 34,727 free + 34,696 alloc (100% カバー) - Legacy: 266,943 → 26,025 (−90.2%, C2/C3 のみ) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-11 20:49:39 +09:00

1 2 3 4 5 ...

440 Commits