hakmem

Author	SHA1	Message	Date
Moe Charm (CI)	bdfa32d869	Phase v4-mid-SEGV: C6 v4 を SmallSegment 専用に切り替え、TinyHeap SEGV を解決問題: C6 v4 が TinyHeap のページを共有することで iters >= 800k で freelist 破壊 → SEGV 発生修正内容: - c6_segment_alloc_page_direct(): C6 専用ページ割当 (SmallSegment v4 経由, TinyHeap 非共有) - c6_segment_release_page_direct(): C6 専用ページ返却 - cold_refill_page_v4() で C6 を分岐: SmallSegment 直接使用 - cold_retire_page_v4() で C6 を分岐: SmallSegment に直接返却 - fastlist state reset 処理追加 (L392-399) 結果: ✅ iters=1M, ws <= 390 で SEGV 消失 ✅ C6-only: v4 OFF ~47M → v4 ON ~43M ops/s (−8.5%, 安定) ✅ Mixed: v4 ON で SEGV なし (小幅回帰許容) 方針: C6 v4 は研究箱として安定化完了。本線には載せない (既存 mid/pool v1 使用)。 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-11 02:39:32 +09:00
Moe Charm (CI)	e486dd2c55	Phase v4-mid-6: Implement C6 v4 TLS Fastlist (Gated) - Implemented TLS fastlist logic for C6 in smallobject_hotbox_v4.c (alloc/free). - Added SmallC6FastState struct and g_small_c6_fast TLS variable. - Gated the fastlist logic with HAKMEM_SMALL_HEAP_V4_FASTLIST (default OFF) due to observed instability in mixed workloads. - Fixed a memory leak in small_heap_free_fast_v4 fallback path by calling hak_pool_free. - Updated CURRENT_TASK.md with phase report.	2025-12-11 01:44:08 +09:00
Moe Charm (CI)	dd974b49c5	Phase v4-mid-2, v4-mid-3, v4-mid-5: SmallObject HotBox v4 implementation and docs update Implementation: - SmallObject HotBox v4 (core/smallobject_hotbox_v4.c) now fully implements C6-only allocations and frees, including current/partial management and freelist operations. - Cold Iface (tiny_heap based) for page refill/retire is integrated. - Stats instrumentation (v4-mid-5) added to small_heap_alloc_fast_v4 and small_heap_free_fast_v4, with a new header file core/box/smallobject_hotbox_v4_stats_box.h and atexit dump function. Updates: - CURRENT_TASK.md has been condensed and updated with summaries of Phase v4-mid-2 (C6-only v4), Phase v4-mid-3 (C5-only v4 pilot), and the stats implementation (v4-mid-5). - docs/analysis/SMALLOBJECT_V4_BOX_DESIGN.md updated with A/B results and conclusions for C6-only and C5-only v4 implementations. - The previous CURRENT_TASK.md content has been archived to CURRENT_TASK_ARCHIVE_20251210.md.	2025-12-11 01:01:15 +09:00
Moe Charm (CI)	3b4449d773	Phase v4-mid-1: C6-only v4 route + page_meta_of() Fail-Fast validation Implementation: - SMALL_SEGMENT_V4_* constants (SIZE=2MiB, PAGE_SIZE=64KiB, MAGIC=0xDEADBEEF) - smallsegment_v4_page_meta_of(): O(1) mask+shift lookup with magic validation - Computes segment base: addr & ~(2MiB - 1) - Verifies SmallSegment magic number - Calculates page_idx: (addr - seg_base) >> PAGE_SHIFT (16) - Returns non-NULL sentinel for now (full page_meta[] in Phase v4-mid-2) Stubs for C6-only phase: - small_heap_alloc_fast_v4(): C6 returns NULL → pool v1 fallback - small_heap_free_fast_v4(): C6 calls page_meta_of() for Fail-Fast, then pool v1 fallback Documentation: - ENV_PROFILE_PRESETS.md: Add "C6_ONLY_SMALLOBJECT_V4" research profile - HAKMEM_SMALL_HEAP_V4_ENABLED=1, HAKMEM_SMALL_HEAP_V4_CLASSES=0x40 - Expected: Throughput ≈ 28–29M ops/s (same as v1) Build: - ビルド成功（警告のみ） - Backward compatible, alloc/free stubs fall back to pool v1 Sanity: - C6-heavy with v4 opt-in: segv/assert なし - page_meta_of() lookup working correctly - Performance unchanged (expected for stub phase) Status: - C6-only v4 route now available via ENV opt-in - Phase v4-mid-2: SmallHeapCtx v4 full implementation with A/B 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-10 23:37:45 +09:00
Moe Charm (CI)	52c65da783	Phase v4-mid-0: Small-object v4 型・IF 足場（箱化モジュール化） - SmallHeapCtx/SmallPageMeta/SmallClassHeap typedef alias 追加 - SmallSegment struct (base/num_pages/owner_tid/magic) を smallsegment_v4_box.h に定義 - SmallColdIface_v4 direct function prototypes (refill/retire/remote_push/drain) - smallobject_hotbox_v4.c の internal/public API 分離（small_segment_v4_internal） - direct function stubs 実装（SmallColdIfaceV4 delegate 形式） - ENV OFF デフォルト（ENABLED=0/CLASSES=0）で既存挙動 100% 不変 - ビルド成功・sanity 確認（mixed/C6-heavy、segv/assert なし） - CURRENT_TASK.md に Phase v4-mid-0 記録 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-10 23:23:07 +09:00
Moe Charm (CI)	2a13478dc7	Optimize C6 heavy and C7 ultra performance analysis with refined design refinements - Update environment profile presets and visibility analysis - Enhance small object and tiny segment v4 box implementations - Refine C7 ultra and C6 heavy allocation strategies - Add comprehensive performance metrics and design documentation 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-10 22:57:26 +09:00
Moe Charm (CI)	9460785bd6	Enable C7 ULTRA segment path by default	2025-12-10 22:25:24 +09:00
Moe Charm (CI)	bbb55b018a	Add C7 ULTRA segment skeleton and TLS freelist	2025-12-10 22:19:32 +09:00
Moe Charm (CI)	f2ce7256cd	Add v4 C7/C6 fast classify and small-segment v4 scaffolding	2025-12-10 19:14:38 +09:00
Moe Charm (CI)	3261025995	Phase v4-4: pilot C6 v4 route with opt-in gate	2025-12-10 18:18:05 +09:00
Moe Charm (CI)	cbd33511eb	Phase v4-3.1: reuse C7 v4 pages and record prep calls	2025-12-10 17:58:42 +09:00
Moe Charm (CI)	677030d699	Document new Mixed baseline and C7 header dedup A/B	2025-12-10 14:38:49 +09:00
Moe Charm (CI)	d576116484	Document current Mixed baseline throughput and ENV profile	2025-12-10 14:12:13 +09:00
Moe Charm (CI)	406a2f4d26	Incremental improvements: mid_desc cache, pool hotpath optimization, and doc updates Changes: - core/box/pool_api.inc.h: Code organization and micro-optimizations - CURRENT_TASK.md: Updated Phase MD1 (mid_desc TLS cache: +3.2% for C6-heavy) - docs/analysis files: Various analysis and documentation updates - AGENTS.md: Agent role clarifications - TINY_FRONT_V3_FLATTENING_GUIDE.md: Flattening strategy documentation Verification: - random_mixed_hakmem: 44.8M ops/s (1M iterations, 400 working set) - No segfaults or assertions across all benchmark variants - Stable performance across multiple runs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-10 14:00:57 +09:00
Moe Charm (CI)	0e5a2634bc	Phase 82 Final: Documentation of mid_desc race fix and comprehensive A/B results Implementation Summary: - Early `mid_desc_init_once()` in `hak_pool_init_impl()` prevents uninitialized mutex crash - Eliminates race condition that caused C7_SAFE + flatten crashes - Enables safe operation across all profiles (C7_SAFE, LEGACY) Benchmark Results (C6_HEAVY_LEGACY_POOLV1, Release): - Phase 1 (Baseline): 3.03M / 14.86M / 26.67M ops/s (10K/100K/1M) - Phase 2 (Zero Mode): +5.0% / -2.7% / -0.2% - Phase 3 (Flatten): +3.7% / +6.1% / -5.0% - Phase 4 (Combined): -5.1% / +8.8% / +2.0% (best at 100K: +8.8%) - Phase 5 (C7_SAFE Safety): NO CRASH ✅ (all iterations stable) Mainline Policy: - mid_desc initialization: Always enabled (crash prevention) - Flatten: Default OFF (bench opt-in via HAKMEM_POOL_V1_FLATTEN_ENABLED=1) - Zero Mode: Default FULL (bench opt-in via HAKMEM_POOL_ZERO_MODE=header) - Workload-specific: Medium (100K) benefits most (+8.8%) Documentation Updated: - CURRENT_TASK.md: Added Phase 82 conclusions with benchmark table - MID_LARGE_CPU_HOTPATH_ANALYSIS.md: Added Phase 82 Final with workload analysis 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-10 09:35:18 +09:00
Moe Charm (CI)	acc64f2438	Phase ML1: Pool v1 memset 89.73% overhead 軽量化 (+15.34% improvement) ## Summary - ChatGPT により bench_profile.h の setenv segfault を修正（RTLD_NEXT 経由に切り替え） - core/box/pool_zero_mode_box.h 新設：ENV キャッシュ経由で ZERO_MODE を統一管理 - core/hakmem_pool.c で zero mode に応じた memset 制御（FULL/header/off） - A/B テスト結果：ZERO_MODE=header で +15.34% improvement（1M iterations, C6-heavy） ## Files Modified - core/box/pool_api.inc.h: pool_zero_mode_box.h include - core/bench_profile.h: glibc setenv → malloc+putenv（segfault 回避） - core/hakmem_pool.c: zero mode 参照・制御ロジック - core/box/pool_zero_mode_box.h (新設): enum/getter - CURRENT_TASK.md: Phase ML1 結果記載 ## Test Results \| Iterations \| ZERO_MODE=full \| ZERO_MODE=header \| Improvement \| \|-----------\|----------------\|-----------------\|------------\| \| 10K \| 3.06 M ops/s \| 3.17 M ops/s \| +3.65% \| \| 1M \| 23.71 M ops/s \| 27.34 M ops/s \| +15.34% \| 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-10 09:08:18 +09:00
Moe Charm (CI)	a905e0ffdd	Guard madvise ENOMEM and stabilize pool/tiny front v3	2025-12-09 21:50:15 +09:00
Moe Charm (CI)	e274d5f6a9	pool v1 flatten: break down free fallback causes and normalize mid_desc keys	2025-12-09 19:34:54 +09:00
Moe Charm (CI)	8f18963ad5	Phase 36-37: TinyHotHeap v2 HotBox redesign and C7 current_page policy fixes - Redefine TinyHotHeap v2 as per-thread Hot Box with clear boundaries - Add comprehensive OS statistics tracking for SS allocations - Implement route-based free handling for TinyHeap v2 - Add C6/C7 debugging and statistics improvements - Update documentation with implementation guidelines and analysis - Add new box headers for stats, routing, and front-end management	2025-12-08 21:30:21 +09:00
Moe Charm (CI)	34a8fd69b6	C7 v2: add lease helpers and v2 page reset	2025-12-08 14:40:03 +09:00
Moe Charm (CI)	9502501842	Fix tiny lane success handling for TinyHeap routes	2025-12-07 23:06:50 +09:00
Moe Charm (CI)	a6991ec9e4	Add TinyHeap class mask and extend routing	2025-12-07 22:49:28 +09:00
Moe Charm (CI)	9c68073557	C7 meta-light delta flush threshold and clamp	2025-12-07 22:42:02 +09:00
Moe Charm (CI)	fda6cd2e67	Boxify superslab registry, add bench profile, and document C7 hotpath experiments	2025-12-07 03:12:27 +09:00
Moe Charm (CI)	03538055ae	Restore C7 Warm/TLS carve for release and add policy scaffolding	2025-12-06 01:34:04 +09:00
Moe Charm (CI)	d17ec46628	Fix C7 warm/TLS Release path and unify debug instrumentation	2025-12-05 23:41:01 +09:00
Moe Charm (CI)	093f362231	Add Page Box layer for C7 class optimization - Implement tiny_page_box.c/h: per-thread page cache between UC and Shared Pool - Integrate Page Box into Unified Cache refill path - Remove legacy SuperSlab implementation (merged into smallmid) - Add HAKMEM_TINY_PAGE_BOX_CLASSES env var for selective class enabling - Update bench_random_mixed.c with Page Box statistics Current status: Implementation safe, no regressions. Page Box ON/OFF shows minimal difference - pool strategy needs tuning. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-12-05 15:31:44 +09:00
Moe Charm (CI)	7e3c3d6020	Update CURRENT_TASK after Mid MT removal	2025-12-02 00:53:26 +09:00
Moe Charm (CI)	4ef0171bc0	feat: Add ACE allocation failure tracing and debug hooks This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations. Key changes include: - ACE Tracing Implementation: - Added environment variable to enable/disable detailed logging of allocation failures. - Instrumented , , and to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure). - Build System Fixes: - Corrected to ensure is properly linked into , resolving an error. - LD_PRELOAD Wrapper Adjustments: - Investigated and understood the wrapper's behavior under , particularly its interaction with and checks. - Enabled debugging flags for environment to prevent unintended fallbacks to 's for non-tiny allocations, allowing comprehensive testing of the allocator. - Debugging & Verification: - Introduced temporary verbose logging to pinpoint execution flow issues within interception and routing. These temporary logs have been removed. - Created to facilitate testing of the tracing features. This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in by providing clear insights into the failure pathways.	2025-12-01 16:37:59 +09:00
Moe Charm (CI)	f32d996edb	Update CURRENT_TASK.md: Phase 9-2 Complete (50M ops/s), Phase 10 Planned (Type Safety)	2025-12-01 13:50:46 +09:00
Moe Charm (CI)	3a040a545a	Refactor: Split monolithic hakmem_shared_pool.c into acquire/release modules - Split core/hakmem_shared_pool.c into acquire/release modules for maintainability. - Introduced core/hakmem_shared_pool_internal.h for shared internal API. - Fixed incorrect function name usage (superslab_alloc -> superslab_allocate). - Increased SUPER_REG_SIZE to 1M to support large working sets (Phase 9-2 fix). - Updated Makefile. - Verified with benchmarks.	2025-11-30 18:11:08 +09:00
Moe Charm (CI)	f7d2348751	Update current task for Phase 9-2 SuperSlab unification	2025-11-30 11:02:39 +09:00
Moe Charm (CI)	4ad3223f5b	docs: Update CURRENT_TASK.md and claude.md for Phase 8 completion Phase 8 Complete: BenchFast crash root cause fixes Documentation updates: 1. CURRENT_TASK.md: - Phase 8 complete (TLS→Atomic + Header write fixes) - 箱理論 root cause analysis (3 critical bugs) - Next phase recommendations (Option C: BenchFast pool expansion) - Detailed technical explanations for each layer 2. .claude/claude.md: - Phase 8 achievement summary - 箱理論 4-principle validation - Commit references (`191e65983`, `da8f4d2c8`) Key Fixes Documented: - TLS→Atomic: Cross-thread guard variable (pthread_once bug) - Header Write: Direct write bypasses P3 optimization (free routing) - Infrastructure Isolation: __libc_calloc for cache arrays - Design Fix: Removed unified_cache_init() call 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-30 05:50:43 +09:00
Moe Charm (CI)	d2d4737d1c	Update CURRENT_TASK.md: Phase 7-Step4 complete (+55.5% total improvement!) Updated: - Status: Phase 7 Step 1-3 → Step 1-4 (complete) - Achievement: +54.2% → +55.5% total (+1.1% from Step 4) - Performance: 52.3M → 81.5M ops/s (+29.2M ops/s total) Phase 7-Step4 Summary: - Replace 3 runtime checks with config macros in hot path - Dead code elimination in PGO mode (bench builds) - Performance: 80.6M → 81.5M ops/s (+1.1%, +0.9M ops/s) Macro Replacements: 1. `g_fastcache_enable` → `TINY_FRONT_FASTCACHE_ENABLED` (line 421) 2. `tiny_heap_v2_enabled()` → `TINY_FRONT_HEAP_V2_ENABLED` (line 809) 3. `ultra_slim_mode_enabled()` → `TINY_FRONT_ULTRA_SLIM_ENABLED` (line 757) Dead Code Eliminated (PGO mode): - FastCache path: fastcache_pop() + hit/miss tracking - Heap V2 path: tiny_heap_v2_alloc_by_class() + metrics - Ultra SLIM path: ultra_slim_alloc_with_refill() early return Cumulative Phase 7 Results: - Step 1: Branch hint reversal (+54.2%) - Step 2: PGO mode infrastructure (neutral) - Step 3: Config box integration (neutral) - Step 4: Macro replacement (+1.1%) - Total: +55.5% improvement (52.3M → 81.5M ops/s) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-29 17:05:54 +09:00
Moe Charm (CI)	09942d5a08	Update CURRENT_TASK.md: Phase 7-Step3 complete (config box integration) Updated: - Status: Phase 7 Step 1-2 → Step 1-3 (complete) - Completed Steps: Added Step 3 (Config box integration) - Benchmark Results: Added Step 3 result (80.6 M ops/s, maintained) - Technical Details: Added Phase 7-Step3 section with implementation details Phase 7-Step3 Summary: - Include tiny_front_config_box.h (dead code elimination infrastructure) - Add wrapper functions: tiny_fastcache_enabled(), sfc_cascade_enabled() - Performance: 80.6 M ops/s (no regression, infrastructure-only change) - Foundation for Steps 4-7 (replace runtime checks with compile-time macros) Remaining Steps (updated): - Step 4: Replace runtime checks → config macros (~20 lines) - Step 5: Compile library with PGO flag (Makefile change) - Step 6: Verify dead code elimination in assembly - Step 7: Measure performance (+5-10% expected) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-29 16:35:29 +09:00
Moe Charm (CI)	0e191113ed	Update CURRENT_TASK.md: Phase 7 complete (+54.2% improvement!)	2025-11-29 16:20:58 +09:00
Moe Charm (CI)	1468efadd7	Update CURRENT_TASK.md: Phase 6 complete, next phase selection	2025-11-29 15:53:05 +09:00
Moe Charm (CI)	d4d415115f	Phase 5: Documentation & Task Update (COMPLETE) Phase 5 Mid/Large Allocation Optimization complete with major success. Achievement: - Mid MT allocations (1KB-8KB): +28.9x improvement (1.49M → 41.0M ops/s) - vs System malloc: 1.53x faster (41.0 vs 26.8 M ops/s) - Mid Free Route Box: Fixed 19x free() slowdown via dual-registry routing Files: - PHASE5_COMPLETION_REPORT.md (NEW) - Full completion report with technical details - CURRENT_TASK.md - Updated with Phase 5 completion and next phase options Completed Steps: - Step 1: Mid MT Verification (range bug identified) - Step 2: Mid Free Route Box (+28.9x improvement) - Step 3: Mid/Large Config Box (future workload infrastructure) - Step 4: Deferred (MT workload needed) - Step 5: Documentation (this commit) Next Phase Options: - Option A: Investigate bench_random_mixed regression - Option B: PGO re-enablement (recommended, +6.25% proven) - Option C: Expand Tiny Front Config Box - Option D: Production readiness & benchmarking - Option E: Multi-threaded optimization See PHASE5_COMPLETION_REPORT.md for full technical details and CURRENT_TASK.md for next phase recommendations. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-29 14:46:54 +09:00
Moe Charm (CI)	3cc7b675df	docs: Start Phase 5 - Mid/Large Allocation Optimization Update CURRENT_TASK.md with Phase 5 roadmap: - Goal: +10-26% improvement (57.2M → 63-72M ops/s) - Strategy: Fix allocation gap + Config Box + Mid MT optimization - Duration: 12 days / 2 weeks Phase 5 Steps: 1. Mid MT Verification (2 days) 2. Allocation Gap Elimination (3 days) - Priority 1 3. Mid/Large Config Box (3 days) 4. Mid Registry Pre-allocation (2 days) 5. Documentation & Benchmark (2 days) Critical Issue Found: - 1KB-8KB allocations fall through to mmap() when ACE disabled - Impact: 1000-5000x slower than O(1) allocation - Fix: Route through existing Mid MT allocator Phase 4 Complete: - Result: 53.3M → 57.2M ops/s (+7.3%) - PGO deferred to final optimization phase 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-29 12:30:29 +09:00
Moe Charm (CI)	9bc26be3bb	docs: Add Phase 4-Step3 completion report Document Config Box implementation results: - Performance: +2.7-4.9% (50.3 → 52.8 M ops/s) - Scope: 1 config function, 2 call sites - Target: Partially achieved (below +5-8% due to limited scope) Updated CURRENT_TASK.md: - Marked Step 3 as complete ✅ - Documented actual results vs. targets - Listed next action options 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-29 12:20:34 +09:00
Moe Charm (CI)	14e781cf60	docs: Add Phase 4-Step2 completion report Documented Hot/Cold Path Box implementation and results: - Performance: +7.3% improvement (53.3 → 57.2 M ops/s) - Branch reduction: 4-5 → 1 (hot path) - Design principles, benchmarks, technical analysis included Updated CURRENT_TASK.md with Step 2 completion status. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-29 12:00:27 +09:00
Moe Charm (CI)	b51b600e8d	Phase 4-Step1: Add PGO workflow automation (+6.25% performance) Implemented automated Profile-Guided Optimization workflow using Box pattern: Performance Improvement: - Baseline: 57.0 M ops/s - PGO-optimized: 60.6 M ops/s - Gain: +6.25% (within expected +5-10% range) Implementation: 1. scripts/box/pgo_tiny_profile_config.sh - 5 representative workloads 2. scripts/box/pgo_tiny_profile_box.sh - Automated profile collection 3. Makefile PGO targets: - pgo-tiny-profile: Build instrumented binaries - pgo-tiny-collect: Collect .gcda profile data - pgo-tiny-build: Build optimized binaries - pgo-tiny-full: Complete workflow (profile → collect → build → test) 4. Makefile help target: Added PGO instructions for discoverability Design: - Box化: Single responsibility, clear contracts - Deterministic: Fixed seeds (42) for reproducibility - Safe: Validation, error detection, timeout protection (30s/workload) - Observable: Progress reporting, .gcda verification (33 files generated) Workload Coverage: - Random mixed: 3 working set sizes (128/256/512 slots) - Tiny hot: 2 size classes (16B/64B) - Total: 5 workloads covering hot/cold paths Documentation: - PHASE4_STEP1_COMPLETE.md - Completion report - CURRENT_TASK.md - Phase 4 roadmap (Step 1 complete ✓) - docs/design/PHASE4_TINY_FRONT_BOX_DESIGN.md - Complete Phase 4 design Next: Phase 4-Step2 (Hot/Cold Path Box, target +10-15%) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-29 11:28:38 +09:00
Moe Charm (CI)	a9ddb52ad4	ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s) Phase 1 完了：環境変数整理 + fprintf デバッグガード ENV変数削除（BG/HotMag系）: - core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines) - core/hakmem_tiny_bg_spill.c: BG spill ENV 削除 - core/tiny_refill.h: BG remote 固定値化 - core/hakmem_tiny_slow.inc: BG refs 削除 fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE): - core/hakmem_shared_pool.c: Lock stats (~18 fprintf) - core/page_arena.c: Init/Shutdown/Stats (~27 fprintf) - core/hakmem.c: SIGSEGV init message ドキュメント整理: - 328 markdown files 削除（旧レポート・重複docs）性能確認: - Larson: 52.35M ops/s (前回52.8M、安定動作✅) - ENV整理による機能影響なし - Debug出力は一部残存（次phase で対応） 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-26 14:45:26 +09:00
Moe Charm (CI)	6b38bc840e	Cleanup: Remove unused hakmem_libc.c (duplicate of hakmem_syscall.c) - File was not included in Makefile OBJS_BASE - Functions already implemented in hakmem_syscall.c - Size: 361 bytes removed	2025-11-26 13:03:17 +09:00
Moe Charm (CI)	bcfb4f6b59	Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath (cherry-picked from 225b6fcc7, conflicts resolved)	2025-11-26 12:33:49 +09:00
Moe Charm (CI)	6baf63a1fb	Documentation: Phase 12-1.1 Results + Phase 19 Frontend Strategy ## Phase 12-1.1 Summary (Box Theory + EMPTY Slab Reuse) ### Box Theory Refactoring (Complete) - hakmem_tiny.c: 2081行 → 562行 (-73%) - 12 modules extracted across 3 phases - Commit: `4c33ccdf8` ### Phase 12-1.1: EMPTY Slab Detection (Complete) - Implementation: empty_mask + immediate detection on free - Performance: +1.3% average, +14.9% max (22.9M → 23.2M ops/s) - Commit: `6afaa5703` ### Key Findings Stage Statistics (HAKMEM_SHARED_POOL_STAGE_STATS=1): ``` Class 6 (256B): Stage 1 (EMPTY): 95.1% ← Already super-efficient! Stage 2 (UNUSED): 4.7% Stage 3 (new SS): 0.2% ← Bottleneck already resolved ``` Conclusion: Backend optimization (SS-Reuse) is saturated. Task-sensei's assumption (Stage 3: 87-95%) does not hold. Phase 12 Shared Pool already works. Next bottleneck: Frontend fast path (31ns vs mimalloc 9ns = 3.4x slower) --- ## Phase 19: Frontend Fast Path Optimization (Next Implementation) ### Strategy Shift ChatGPT-sensei Priority 2 → Priority 1 (promoted based on Phase 12-1.1 results) ### Target - Current: 31ns (HAKMEM) vs 9ns (mimalloc) - Goal: 31ns → 15ns (-50%) for 22M → 40M ops/s ### Hit Rate Analysis (Premise) ``` HeapV2: 88-99% (primary) UltraHot: 0-12% (limited) FC/SFC: 0% (unused) ``` → Layers other than HeapV2 are prune candidates --- ## Phase 19-1: Quick Prune (Branch Pruning) - 🚀 Highest Priority Goal: Skip unused frontend layers, simplify to HeapV2 → SLL → SS path Implementation: - File: `core/tiny_alloc_fast.inc.h` - Method: Early return gate at front entry point - ENV: `HAKMEM_TINY_FRONT_SLIM=1` Features: - ✅ Existing code unchanged (bypass only) - ✅ A/B gate (ENV=0 instant rollback) - ✅ Minimal risk Expected: 22M → 27-30M ops/s (+22-36%) --- ## Phase 19-2: Front-V2 (tcache Single-Layer) - ⚡ Main Event Goal: Unify frontend to tcache-style (1-layer per-class magazine) Design: ```c // New file: core/front/tiny_heap_v2.h typedef struct { void* items[32]; // cap 32 (tunable) uint8_t top; // stack top index uint8_t class_idx; // bound class } TinyFrontV2; // Ultra-fast pop (1 branch + 1 array lookup + 1 instruction) static inline void* front_v2_pop(int class_idx); static inline int front_v2_push(int class_idx, void* ptr); static inline int front_v2_refill(int class_idx); ``` Fast Path Flow: ``` ptr = front_v2_pop(class_idx) // 1 branch + 1 array lookup → empty? → front_v2_refill() → retry → miss? → backend fallback (SLL/SS) ``` Target: C0-C3 (hot classes), C4-C5 off ENV: `HAKMEM_TINY_FRONT_V2=1`, `HAKMEM_FRONT_V2_CAP=32` Expected: 30M → 40M ops/s (+33%) --- ## Phase 19-3: A/B Testing & Metrics Metrics: - `g_front_v2_hits[TINY_NUM_CLASSES]` - `g_front_v2_miss[TINY_NUM_CLASSES]` - `g_front_v2_refill_count[TINY_NUM_CLASSES]` ENV: `HAKMEM_TINY_FRONT_METRICS=1` Benchmark Order: 1. Short run (100K) - SEGV/regression check 2. Latency measurement (500K) - 31ns → 15ns goal 3. Larson short run - MT stability check --- ## Implementation Timeline ``` Week 1: Phase 19-1 Quick Prune - Add gate to tiny_alloc_fast.inc.h - Implement HAKMEM_TINY_FRONT_SLIM=1 - 100K short test - Performance measurement (expect: 22M → 27-30M) Week 2: Phase 19-2 Front-V2 Design - Create core/front/tiny_heap_v2.{h,c} - Implement front_v2_pop/push/refill - C0-C3 integration test Week 3: Phase 19-2 Front-V2 Integration - Add Front-V2 path to tiny_alloc_fast.inc.h - Implement HAKMEM_TINY_FRONT_V2=1 - A/B benchmark Week 4: Phase 19-3 Optimization - Magazine capacity tuning (16/32/64) - Refill batch size adjustment - Larson/MT stability confirmation ``` --- ## Expected Final Performance ``` Baseline (Phase 12-1.1): 22M ops/s Phase 19-1 (Slim): 27-30M ops/s (+22-36%) Phase 19-2 (V2): 40M ops/s (+82%) ← Goal System malloc: 78M ops/s (reference) Gap closure: 28% → 51% (major improvement!) ``` --- ## Summary Today's Achievements (2025-11-21): 1. ✅ Box Theory Refactoring (3 phases, -73% code size) 2. ✅ Phase 12-1.1 EMPTY Slab Reuse (+1-15% improvement) 3. ✅ Stage statistics analysis (identified frontend as true bottleneck) 4. ✅ Phase 19 strategy documentation (ChatGPT-sensei plan) Next Session: - Phase 19-1 Quick Prune implementation - ENV gate + early return in tiny_alloc_fast.inc.h - 100K short test + performance measurement --- 📝 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: ChatGPT (Phase 19 strategy design) Co-Authored-By: Task-sensei (Phase 12-1.1 investigation)	2025-11-21 05:16:35 +09:00
Moe Charm (CI)	6afaa5703a	Phase 12-1.1: EMPTY Slab Detection + Immediate Reuse (+13% improvement, 10.2M→11.5M ops/s) Implementation of Task-sensei Priority 1 recommendation: Add empty_mask to SuperSlab for immediate EMPTY slab detection and reuse, reducing Stage 3 (mmap) overhead. ## Changes ### 1. SuperSlab Structure (core/superslab/superslab_types.h) - Added `empty_mask` (uint32_t): Bitmap for EMPTY slabs (used==0) - Added `empty_count` (uint8_t): Quick check for EMPTY slab availability ### 2. EMPTY Detection API (core/box/ss_hot_cold_box.h) - Added `ss_is_slab_empty()`: Returns true if slab is completely EMPTY - Added `ss_mark_slab_empty()`: Marks slab as EMPTY (highest reuse priority) - Added `ss_clear_slab_empty()`: Removes EMPTY state when reactivated - Updated `ss_update_hot_cold_indices()`: Classify EMPTY/Hot/Cold slabs - Updated `ss_init_hot_cold()`: Initialize empty_mask/empty_count ### 3. Free Path Integration (core/box/free_local_box.c) - After `meta->used--`, check if `meta->used == 0` - If true, call `ss_mark_slab_empty()` to update empty_mask - Enables immediate EMPTY detection on every free operation ### 4. Shared Pool Stage 0.5 (core/hakmem_shared_pool.c) - New Stage 0.5 before Stage 1: Scan existing SuperSlabs for EMPTY slabs - Iterate over `g_super_reg_by_class[class_idx][]` (first 16 entries) - Check `ss->empty_count > 0` → scan `empty_mask` with `__builtin_ctz()` - Reuse EMPTY slab directly, avoiding Stage 3 (mmap/lock overhead) - ENV control: `HAKMEM_SS_EMPTY_REUSE=1` (default OFF for A/B testing) - ENV tunable: `HAKMEM_SS_EMPTY_SCAN_LIMIT=N` (default 16 SuperSlabs) ## Performance Results ``` Benchmark: Random Mixed 256B (100K iterations) OFF (default): 10.2M ops/s (baseline) ON (ENV=1): 11.5M ops/s (+13.0% improvement) ✅ ``` ## Expected Impact (from Task-sensei analysis) Current bottleneck: - Stage 1: 2-5% hit rate (free list broken) - Stage 2: 3-8% hit rate (rare UNUSED) - Stage 3: 87-95% hit rate (lock + mmap overhead) ← bottleneck Expected with Phase 12-1.1: - Stage 0.5: 20-40% hit rate (EMPTY scan) - Stage 1-2: 20-30% hit rate (combined) - Stage 3: 30-50% hit rate (significantly reduced) Theoretical max: 25M → 55-70M ops/s (+120-180%) ## Current Gap Analysis Observed: 11.5M ops/s (+13%) Expected: 55-70M ops/s (+120-180%) Gap: Performance regression or missing complementary optimizations Possible causes: 1. Phase 3d-C (25.1M→10.2M) regression - unrelated to this change 2. EMPTY scan overhead (16 SuperSlabs × empty_count check) 3. Missing Priority 2-5 optimizations (Lazy SS deallocation, etc.) 4. Stage 0.5 too conservative (scan_limit=16, should be higher?) ## Usage ```bash # Enable EMPTY reuse optimization export HAKMEM_SS_EMPTY_REUSE=1 # Optional: increase scan limit (trade-off: throughput vs latency) export HAKMEM_SS_EMPTY_SCAN_LIMIT=32 ./bench_random_mixed_hakmem 100000 256 42 ``` ## Next Steps Priority 1-A: Investigate Phase 3d-C→12-1.1 regression (25.1M→10.2M) Priority 1-B: Implement Phase 12-1.2 (Lazy SS deallocation) for complementary effect Priority 1-C: Profile Stage 0.5 overhead (scan_limit tuning) ## Files Modified Core implementation: - `core/superslab/superslab_types.h` - empty_mask/empty_count fields - `core/box/ss_hot_cold_box.h` - EMPTY detection/marking API - `core/box/free_local_box.c` - Free path EMPTY detection - `core/hakmem_shared_pool.c` - Stage 0.5 EMPTY scan Documentation: - `CURRENT_TASK.md` - Task-sensei investigation report --- 🎯 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: Task-sensei (investigation & design analysis)	2025-11-21 04:56:48 +09:00
Moe Charm (CI)	4c33ccdf86	Box Theory Refactoring - Phase 1-3 Complete: hakmem_tiny.c 73% reduction (2081→562 lines) ULTRATHINK SUMMARY: 3-phase systematic refactoring of monolithic hakmem_tiny.c using Box Theory modular design principles. Achieved 73% size reduction while maintaining build stability and functional correctness. ## Achievement Summary - Total Reduction: 2081 lines → 562 lines (-1519 lines, -73%) - Modules Extracted: 12 box modules (config, publish, globals, legacy_slow, slab_lookup, ss_active, eventq, sll_cap, ultra_batch + 3 more from Phase 1-2) - Build Success: 100% (all phases, all modules) - Performance Impact: -10% (Phase 1 only, acceptable for design phase) - Stability: No crashes, all tests passing ## Phase Breakdown ### Phase 1: ChatGPT Initial Split (2081 → 1456 lines, -30%) Extracted foundational modules: - config_box.inc (211 lines): Size class tables, debug counters, benchmark macros - publish_box.inc (419 lines): Publish/Adopt stats, TLS helpers, live cap mgmt Commit: `6b6ad69ac` Strategy: Low-risk infrastructure modules first ### Phase 2: Claude Conservative Extraction (1456 → 616 lines, -58%) Extracted core architectural modules: - globals_box.inc (256 lines): Global pool, TLS vars, adopt_gate_try() - legacy_slow_box.inc (96 lines): Legacy slab allocation (cold/unused path) - slab_lookup_box.inc (77 lines): O(1) registry lookup, owner slab discovery Commit: `922eaac79` Strategy: Dependency-light core modules, build verification after each ### Phase 3: Task-Sensei Analysis + Conservative Extraction (616 → 562 lines, -9%) Extracted helper modules based on rigorous dependency analysis: - ss_active_box.inc (6 lines): SuperSlab active counter helpers (LOW risk) - eventq_box.inc (32 lines): Event queue push, thread ID compression (LOW risk) - sll_cap_box.inc (12 lines): SLL capacity policy (hot/cold classes) (LOW risk) - ultra_batch_box.inc (20 lines): Ultra batch size policy + override (LOW risk) Commit: `287845913` Strategy: Task-sensei risk analysis, extract LOW-risk only, skip MEDIUM-risk ## Box Theory Implementation Pattern Extraction follows consistent pattern: 1. Identify coherent functional block (e.g., active counter helpers) 2. Extract to .inc file (preserves static/TLS linkage in same translation unit) 3. Replace with #include directive in hakmem_tiny.c 4. Add forward declarations as needed for circular dependencies 5. Build + verify before next extraction Example: ```c // Before (hakmem_tiny.c) static inline void ss_active_add(SuperSlab* ss, uint32_t n) { atomic_fetch_add_explicit(&ss->total_active_blocks, n, memory_order_relaxed); } // After (hakmem_tiny.c) #include "hakmem_tiny_ss_active_box.inc" ``` Benefits: - ✅ Same translation unit (.inc) → static/TLS variables work correctly - ✅ Forward declarations resolve circular dependencies - ✅ Clear module boundaries (future .c migration possible) - ✅ Incremental refactoring maintains build stability ## Lessons Learned (Failed Attempts) ### Attempt 1: lifecycle.inc → lifecycle.c separation Problem: Complex dependencies (g_tls_lists, g_empty_lock), massive helper copying Resolution: Reverted, .inc pattern is correct for high-dependency modules ### Attempt 2: Aggressive 6-module extraction (Phase 3 first try) Problem: helpers_box undefined symbols (g_use_superslab), dependency ordering Resolution: Reverted, requested Task-sensei analysis → extract LOW-risk only ### Key Lessons: 1. Dependency analysis first - Task-sensei risk assessment prevents failures 2. Small batch extraction - 1-4 modules at a time, verify each build 3. .inc pattern validity - Don't force .c separation, prioritize boundary clarity ## Remaining Work (Deferred) MEDIUM-risk candidates identified by Task-sensei (skipped this round): - Candidate 5: Hot/Cold judgment helpers (12 lines) - is_hot_class() - Candidate 6: Frontend helpers (18 lines) - tiny_optional_push() Recommendation: Extract after performance optimization phase completes (currently in design refinement stage, prioritize functionality over structure) ## Impact Assessment Readability: ✅ Major improvement (2081 → 562 lines, clear module boundaries) Maintainability: ✅ Improved (change sites easy to locate) Build Time: No impact (.inc = same translation unit) Performance: -10% Phase 1 only, Phases 2-3 no impact (acceptable for design) Stability: ✅ All builds successful, no crashes ## Methodology Highlights Collaboration: ChatGPT (Phase 1) + Claude (Phase 2-3) + Task-sensei (analysis) Verification: Build after every extraction, no batch commits without verification Risk Management: Task-sensei dependency analysis → LOW-risk priority queue Rollback Strategy: Git revert for failed attempts, learn and retry conservatively ## Files Modified Core extractions: - core/hakmem_tiny.c (2081 → 562 lines, -73%) - core/hakmem_tiny_config_box.inc (211 lines, new) - core/hakmem_tiny_publish_box.inc (419 lines, new) - core/hakmem_tiny_globals_box.inc (256 lines, new) - core/hakmem_tiny_legacy_slow_box.inc (96 lines, new) - core/hakmem_tiny_slab_lookup_box.inc (77 lines, new) - core/hakmem_tiny_ss_active_box.inc (6 lines, new) - core/hakmem_tiny_eventq_box.inc (32 lines, new) - core/hakmem_tiny_sll_cap_box.inc (12 lines, new) - core/hakmem_tiny_ultra_batch_box.inc (20 lines, new) Documentation: - CURRENT_TASK.md (comprehensive refactoring summary added) ## Next Steps Priority 1: Phase 3d-D alternative (Hot-priority refill optimization) Priority 2: Phase 12 Shared SuperSlab Pool (fundamental performance fix) Priority 3: Remaining MEDIUM-risk module extraction (post-optimization) --- 🎨 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> Co-Authored-By: ChatGPT (Phase 1 initial extraction)	2025-11-21 03:42:36 +09:00
Moe Charm (CI)	5b36c1c908	Phase 26: Front Gate Unification - Tiny allocator fast path (+12.9%) Implementation: - New single-layer malloc/free path for Tiny (≤1024B) allocations - Bypasses 3-layer overhead: malloc → hak_alloc_at (236 lines) → wrapper → tiny_alloc_fast - Leverages Phase 23 Unified Cache (tcache-style, 2-3 cache misses) - Safe fallback to normal path on Unified Cache miss Performance (Random Mixed 256B, 100K iterations): - Baseline (Phase 26 OFF): 11.33M ops/s - Phase 26 ON: 12.79M ops/s (+12.9%) - Prediction (ChatGPT): +10-15% → Actual: +12.9% (perfect match!) Bug fixes: - Initialization bug: Added hak_init() call before fast path - Page boundary SEGV: Added guard for offset_in_page == 0 Also includes Phase 23 debug log fixes: - Guard C2_CARVE logs with #if !HAKMEM_BUILD_RELEASE - Guard prewarm logs with #if !HAKMEM_BUILD_RELEASE - Set Hot_2048 as default capacity (C2/C3=2048, others=64) Files: - core/front/malloc_tiny_fast.h: Phase 26 implementation (145 lines) - core/box/hak_wrappers.inc.h: Fast path integration (+28 lines) - core/front/tiny_unified_cache.h: Hot_2048 default - core/tiny_refill_opt.h: C2_CARVE log guard - core/box/ss_hot_prewarm_box.c: Prewarm log guard - CURRENT_TASK.md: Phase 26 completion documentation ENV variables: - HAKMEM_FRONT_GATE_UNIFIED=1 (enable Phase 26, default: OFF) - HAKMEM_TINY_UNIFIED_CACHE=1 (Phase 23, required) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-17 05:29:08 +09:00
Moe Charm (CI)	03ba62df4d	Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified Summary: - Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s) - PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM) - Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization Phase 23 Changes: 1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h}) - Direct SuperSlab carve (TLS SLL bypass) - Self-contained pop-or-refill pattern - ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128 2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h) - Unified ON → direct cache access (skip all intermediate layers) - Alloc: unified_cache_pop_or_refill() → immediate fail to slow - Free: unified_cache_push() → fallback to SLL only if full PageFaultTelemetry Changes: 3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h}) - PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement - Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked() 4. Measurement results (Random Mixed 500K / 256B): - Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page) - SSM: 512 pages (initialization footprint) - MID/L25: 0 (unused in this workload) - Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny) Ring Cache Enhancements: 5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h}) - ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size - Conditional compilation cleanup Documentation: 6. Analysis reports - RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown - RANDOM_MIXED_SUMMARY.md: Phase 23 summary - RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage - CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan Next Steps (Phase 24): - Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K) - Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal) - Expected improvement: +30-50% for Mid/Large workloads Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-17 02:47:58 +09:00

1 2

86 Commits