# HAKMEM Tiny Allocator スーパーリファクタリング計画 ## 執行サマリー ### 現状 - **hakmem_tiny.c (1584行)**: 複数の .inc ファイルをアグリゲートする器 - **hakmem_tiny_free.inc (1470行)**: 最大級の混合ファイル - Free パス (33-558行) - SuperSlab Allocation (559-998行) - SuperSlab Free (999-1369行) - Query API (commented-out, extracted to hakmem_tiny_query.c) **問題点**: 1. 単一のメガファイル (1470行) 2. Free + Allocation が混在 3. 責務が不明確 4. Static inline の嵌套が深い ### 目標 **「箱理論に基づいて、500行以下のファイルに分割」** - 各ファイルが単一責務 (SRP) - `static inline` で境界をゼロコスト化 - 依存関係を明確化 - リファクタリング順序の最適化 --- ## Phase 1: 現状分析 ### 巨大ファイル TOP 10 | ランク | ファイル | 行数 | 責務 | |--------|---------|------|------| | 1 | hakmem_pool.c | 2592 | Mid/Large allocator (対象外) | | 2 | hakmem_tiny.c | 1584 | Tiny アグリゲータ (分析対象) | | 3 | **hakmem_tiny_free.inc** | **1470** | Free + SS Alloc + Query (要分割) | | 4 | hakmem.c | 1449 | Top-level allocator (対象外) | | 5 | hakmem_l25_pool.c | 1195 | L25 pool (対象外) | | 6 | hakmem_tiny_intel.inc | 863 | Intel 最適化 (分割候補) | | 7 | hakmem_tiny_superslab.c | 810 | SuperSlab (継続, 強化済み) | | 8 | hakmem_tiny_stats.c | 697 | Statistics (継続) | | 9 | tiny_remote.c | 645 | Remote queue (継続, 分割候補) | | 10 | hakmem_learner.c | 603 | Learning (対象外) | ### Tiny 関連で 500行超のファイル ``` hakmem_tiny_free.inc 1470 ← 要分割(最優先) hakmem_tiny_intel.inc 863 ← 分割候補 hakmem_tiny_init.inc 544 ← 分割候補 tiny_remote.c 645 ← 分割候補 ``` ### hakmem_tiny.c が include する .inc ファイル (44個) **最大級 (300行超):** - hakmem_tiny_free.inc (1470) ← **最優先** - hakmem_tiny_intel.inc (863) - hakmem_tiny_init.inc (544) **中規模 (150-300行):** - hakmem_tiny_refill.inc.h (410) - hakmem_tiny_alloc_new.inc (275) - hakmem_tiny_background.inc (261) - hakmem_tiny_alloc.inc (249) - hakmem_tiny_lifecycle.inc (244) - hakmem_tiny_metadata.inc (226) **小規模 (50-150行):** - hakmem_tiny_ultra_simple.inc (176) - hakmem_tiny_slab_mgmt.inc (163) - hakmem_tiny_fastcache.inc.h (149) - hakmem_tiny_hotmag.inc.h (147) - hakmem_tiny_smallmag.inc.h (139) - hakmem_tiny_hot_pop.inc.h (118) - hakmem_tiny_bump.inc.h (107) --- ## Phase 2: 箱理論による責務分類 ### Box 1: Atomic Ops (最下層, 50-100行) **責務**: CAS/Exchange/Fetch のラッパー、メモリ順序管理 **新規作成**: - `tiny_atomic.h` (80行) **含める内容**: ```c // Atomics for remote queue, owner_tid, refcount - tiny_atomic_cas() - tiny_atomic_exchange() - tiny_atomic_load/store() - Memory order wrapper ``` --- ### Box 2: Remote Queue & Ownership (下層, 500-700行) #### 2.1: Remote Queue Operations (`tiny_remote_queue.inc.h`, 250-350行) **責務**: MPSC stack ops, guard check, node management **出処**: hakmem_tiny_free.inc の remote queue 部分を抽出 ```c - tiny_remote_queue_contains_guard() - tiny_remote_queue_push() - tiny_remote_queue_pop() - tiny_remote_drain_owner() // from hakmem_tiny_free.inc:170 ``` #### 2.2: Remote Drain Logic (`tiny_remote_drain.inc.h`, 200-250行) **責務**: Drain logic, TLS cleanup **出処**: hakmem_tiny_free.inc の drain ロジック ```c - tiny_remote_drain_batch() - tiny_remote_process_mailbox() ``` #### 2.3: Ownership (Owner TID) (`tiny_owner.inc.h`, 100-150行) **責務**: owner_tid の acquire/release, slab ownership **既存**: slab_handle.h (295行, 継続) + 強化 **新規**: tiny_owner.inc.h ```c - tiny_owner_acquire() - tiny_owner_release() - tiny_owner_self() ``` **依存**: Box 1 (Atomic) --- ### Box 3: Superslab Core (`hakmem_tiny_superslab.c` + `hakmem_tiny_superslab.h`, 継続) **責務**: SuperSlab allocation, cache, registry **現状**: 810行(既に well-structured) **強化**: 下記の Box と連携 - Box 4 の Publish/Adopt - Box 2 の Remote ops --- ### Box 4: Publish/Adopt (上層, 400-500行) #### 4.1: Publish (`tiny_publish.c/h`, 継続, 34行) **責務**: Freelist 変化を publish **既存**: tiny_publish.c (34行) ← 既に tiny #### 4.2: Mailbox (`tiny_mailbox.c/h`, 継続, 252行) **責務**: 他スレッドからの adopt 要求 **既存**: tiny_mailbox.c (252行) → 分割検討 ```c - tiny_mailbox_push() // 50行 - tiny_mailbox_drain() // 150行 ``` **分割案**: - `tiny_mailbox_push.inc.h` (50行) - `tiny_mailbox_drain.inc.h` (150行) #### 4.3: Adopt Logic (`tiny_adopt.inc.h`, 200-300行) **責務**: SuperSlab から slab を adopt する logic **出処**: hakmem_tiny_free.inc の adoption ロジックを抽出 ```c - tiny_adopt_request() - tiny_adopt_select() - tiny_adopt_cooldown() ``` **依存**: Box 3 (SuperSlab), Box 4.2 (Mailbox), Box 2 (Ownership) --- ### Box 5: Allocation Path (横断, 600-800行) #### 5.1: Fast Path (`tiny_alloc_fast.inc.h`, 200-300行) **責務**: 3-4 命令の fast path (TLS cache direct pop) **出処**: hakmem_tiny_ultra_simple.inc (176行) + hakmem_tiny_fastcache.inc.h (149行) ```c // Ultra-simple fast (SRP): static inline void* tiny_fast_alloc(int class_idx) { void** head = &g_tls_cache[class_idx]; void* ptr = *head; if (ptr) *head = *(void**)ptr; // Pop return ptr; } // Fast push: static inline int tiny_fast_push(int class_idx, void* ptr) { int cap = g_tls_cache_cap[class_idx]; int cnt = atomic_load(&g_tls_cache_count[class_idx]); if (cnt < cap) { void** head = &g_tls_cache[class_idx]; *(void**)ptr = *head; *head = ptr; atomic_increment(&g_tls_cache_count[class_idx]); return 1; } return 0; // Slow path } ``` #### 5.2: Refill Logic (`tiny_refill.inc.h`, 410行, 既存) **責務**: キャッシュのリファイル **現状**: hakmem_tiny_refill.inc.h (410行) ← 既に well-sized #### 5.3: Slow Path (`tiny_alloc_slow.inc.h`, 250-350行) **責務**: SuperSlab → New Slab → Refill **出処**: hakmem_tiny_free.inc の superslab_refill + allocation logic + hakmem_tiny_alloc.inc (249行) ```c - tiny_alloc_slow() - tiny_refill_from_superslab() - tiny_new_slab_alloc() ``` **依存**: Box 3 (SuperSlab), Box 5.2 (Refill) --- ### Box 6: Free Path (横断, 600-800行) #### 6.1: Fast Free (`tiny_free_fast.inc.h`, 200-250行) **責務**: Same-thread free, TLS cache push **出処**: hakmem_tiny_free.inc の fast-path free logic ```c // Fast same-thread free: static inline int tiny_free_fast(void* ptr, int class_idx) { // Owner check + Cache push uint32_t self_tid = tiny_self_u32(); TinySlab* slab = hak_tiny_owner_slab(ptr); if (!slab || slab->owner_tid != self_tid) return 0; // Slow path return tiny_fast_push(class_idx, ptr); } ``` #### 6.2: Cross-Thread Free (`tiny_free_remote.inc.h`, 250-300行) **責務**: Remote queue push, publish **出処**: hakmem_tiny_free.inc の cross-thread logic + remote push ```c - tiny_free_remote() - tiny_free_remote_queue_push() ``` **依存**: Box 2 (Remote Queue), Box 4.1 (Publish) #### 6.3: Guard/Safety (`tiny_free_guard.inc.h`, 100-150行) **責務**: Guard sentinel check, bounds validation **出処**: hakmem_tiny_free.inc の guard logic ```c - tiny_free_guard_check() - tiny_free_validate_ptr() ``` --- ### Box 7: Statistics & Query (分析層, 700-900行) #### 既存(継続): - hakmem_tiny_stats.c (697行) - Stats aggregate - hakmem_tiny_stats_api.h (103行) - Stats API - hakmem_tiny_stats.h (278行) - Stats internal - hakmem_tiny_query.c (72行) - Query API #### 分割検討: hakmem_tiny_stats.c (697行) は統計エンジン専門なので OK --- ### Box 8: Lifecycle (初期化・クリーンアップ, 544行) #### 既存: - hakmem_tiny_init.inc (544行) - Initialization - hakmem_tiny_lifecycle.inc (244行) - Lifecycle - hakmem_tiny_slab_mgmt.inc (163行) - Slab management **分割検討**: - `tiny_init_globals.inc.h` (150行) - Global vars - `tiny_init_config.inc.h` (150行) - Config from env - `tiny_init_pools.inc.h` (150行) - Pool allocation - `tiny_lifecycle_trim.inc.h` (120行) - Trim logic - `tiny_lifecycle_shutdown.inc.h` (120行) - Shutdown --- ### Box 9: Intel Specific (863行) **分割案**: - `tiny_intel_fast.inc.h` (300行) - Prefetch + PAUSE - `tiny_intel_cache.inc.h` (200行) - Cache tuning - `tiny_intel_cfl.inc.h` (150行) - CFL-specific - `tiny_intel_skl.inc.h` (150行) - SKL-specific (共通化) --- ## Phase 3: 分割実行計画 ### Priority 1: Critical Path (1週間) **目標**: Fast path を 3-4 命令レベルまで削減 1. **Box 1: tiny_atomic.h** (80行) ✨ - `atomic_load_explicit()` wrapper - `atomic_store_explicit()` wrapper - `atomic_cas()` wrapper - 依存: `` のみ 2. **Box 5.1: tiny_alloc_fast.inc.h** (250行) ✨ - Ultra-simple TLS cache pop - 依存: Box 1 3. **Box 6.1: tiny_free_fast.inc.h** (200行) ✨ - Same-thread fast free - 依存: Box 1, Box 5.1 4. **Extract from hakmem_tiny_free.inc**: - Fast path logic (500行) → 上記へ - SuperSlab path (400行) → Box 5.3, 6.2へ - Remote logic (250行) → Box 2へ - Cleanup → hakmem_tiny_free.inc は 300行に削減 **効果**: Fast path を system tcache 並みに最適化 --- ### Priority 2: Remote & Ownership (1週間) 5. **Box 2.1: tiny_remote_queue.inc.h** (300行) - Remote queue ops - 依存: Box 1 6. **Box 2.3: tiny_owner.inc.h** (120行) - Owner TID management - 依存: Box 1, slab_handle.h (既存) 7. **tiny_remote.c の整理**: 645行 - `tiny_remote_queue_ops()` → tiny_remote_queue.inc.h へ - `tiny_remote_side_*()` → 継続 - リサイズ: 645 → 350行に削減 **効果**: Remote ops を モジュール化 --- ### Priority 3: SuperSlab Integration (1-2週間) 8. **Box 3 強化**: hakmem_tiny_superslab.c (810行, 継続) - Publish/Adopt 統合 - 依存: Box 2, Box 4 9. **Box 4.1-4.3: Publish/Adopt Path** (400-500行) - `tiny_publish.c` (34行, 既存) - `tiny_mailbox.c` → 分割 - `tiny_adopt.inc.h` (新規) **効果**: SuperSlab adoption を完全に統合 --- ### Priority 4: Allocation/Free Slow Path (1週間) 10. **Box 5.2-5.3: Refill & Slow Allocation** (650行) - hakmem_tiny_refill.inc.h (410行, 既存) - `tiny_alloc_slow.inc.h` (新規, 300行) 11. **Box 6.2-6.3: Cross-thread Free** (400行) - `tiny_free_remote.inc.h` (新規) - `tiny_free_guard.inc.h` (新規) **効果**: Slow path を 明確に分離 --- ### Priority 5: Lifecycle & Config (1-2週間) 12. **Box 8: Lifecycle の分割** (400-500行) - hakmem_tiny_init.inc (544行) → 150 + 150 + 150 - hakmem_tiny_lifecycle.inc (244行) → 120 + 120 - Remove duplication 13. **Box 9: Intel-specific の整理** (863行) - `tiny_intel_fast.inc.h` (300行) - `tiny_intel_cache.inc.h` (200行) - `tiny_intel_common.inc.h` (150行) - Deduplicate × 3 architectures **効果**: 設定管理を統一化 --- ## Phase 4: 新ファイル構成案 ### 最終構成 ``` core/ ├─ Box 1: Atomic Ops │ └─ tiny_atomic.h (80行) │ ├─ Box 2: Remote & Ownership │ ├─ tiny_remote.h (80行, 既存, 軽量化) │ ├─ tiny_remote_queue.inc.h (300行, 新規) │ ├─ tiny_remote_drain.inc.h (150行, 新規) │ ├─ tiny_owner.inc.h (120行, 新規) │ └─ slab_handle.h (295行, 既存, 継続) │ ├─ Box 3: SuperSlab Core │ ├─ hakmem_tiny_superslab.h (500行, 既存) │ └─ hakmem_tiny_superslab.c (810行, 既存) │ ├─ Box 4: Publish/Adopt │ ├─ tiny_publish.h (6行, 既존) │ ├─ tiny_publish.c (34行, 既存) │ ├─ tiny_mailbox.h (11行, 既存) │ ├─ tiny_mailbox.c (252行, 既존) → 분할 가능 │ ├─ tiny_mailbox_push.inc.h (80行, 새로) │ ├─ tiny_mailbox_drain.inc.h (150行, 새로) │ └─ tiny_adopt.inc.h (300行, 새로) │ ├─ Box 5: Allocation │ ├─ tiny_alloc_fast.inc.h (250行, 新規) │ ├─ hakmem_tiny_refill.inc.h (410行, 既存) │ └─ tiny_alloc_slow.inc.h (300行, 新規) │ ├─ Box 6: Free │ ├─ tiny_free_fast.inc.h (200行, 新規) │ ├─ tiny_free_remote.inc.h (300行, 新規) │ ├─ tiny_free_guard.inc.h (120行, 新規) │ └─ hakmem_tiny_free.inc (1470行, 既存) → 300行に削減 │ ├─ Box 7: Statistics │ ├─ hakmem_tiny_stats.c (697行, 既存) │ ├─ hakmem_tiny_stats.h (278行, 既存) │ ├─ hakmem_tiny_stats_api.h (103行, 既存) │ └─ hakmem_tiny_query.c (72行, 既存) │ ├─ Box 8: Lifecycle │ ├─ tiny_init_globals.inc.h (150行, 新規) │ ├─ tiny_init_config.inc.h (150行, 新規) │ ├─ tiny_init_pools.inc.h (150行, 新規) │ ├─ tiny_lifecycle_trim.inc.h (120行, 新規) │ └─ tiny_lifecycle_shutdown.inc.h (120行, 新規) │ ├─ Box 9: Intel-specific │ ├─ tiny_intel_common.inc.h (150行, 新規) │ ├─ tiny_intel_fast.inc.h (300行, 新規) │ └─ tiny_intel_cache.inc.h (200行, 新規) │ └─ Integration └─ hakmem_tiny.c (1584行, 既存, include aggregator) └─ 新規フォーマット: 1. includes Box 1-9 2. Minimal glue code only ``` --- ## Phase 5: Include 順序の最適化 ### 安全な include 依存関係 ```mermaid graph TD A[Box 1: tiny_atomic.h] --> B[Box 2: tiny_remote.h] A --> C[Box 5/6: Alloc/Free] B --> D[Box 2.1: tiny_remote_queue.inc.h] D --> E[tiny_remote.c] A --> F[Box 4: Publish/Adopt] E --> F C --> G[Box 3: SuperSlab] F --> G G --> H[Box 5.3/6.2: Slow Path] I[Box 8: Lifecycle] --> H J[Box 9: Intel] --> C ``` ### hakmem_tiny.c の新規フォーマット ```c #include "hakmem_tiny.h" #include "hakmem_tiny_config.h" // ============================================================ // LAYER 0: Atomic + Ownership (lowest) // ============================================================ #include "tiny_atomic.h" #include "tiny_owner.inc.h" #include "slab_handle.h" // ============================================================ // LAYER 1: Remote Queue + SuperSlab Core // ============================================================ #include "hakmem_tiny_superslab.h" #include "tiny_remote_queue.inc.h" #include "tiny_remote_drain.inc.h" #include "tiny_remote.inc" // tiny_remote_side_* #include "tiny_remote.c" // Link-time // ============================================================ // LAYER 2: Publish/Adopt (publication mechanism) // ============================================================ #include "tiny_publish.h" #include "tiny_publish.c" #include "tiny_mailbox.h" #include "tiny_mailbox_push.inc.h" #include "tiny_mailbox_drain.inc.h" #include "tiny_mailbox.c" #include "tiny_adopt.inc.h" // ============================================================ // LAYER 3: Fast Path (allocation + free) // ============================================================ #include "tiny_alloc_fast.inc.h" #include "tiny_free_fast.inc.h" // ============================================================ // LAYER 4: Slow Path (refill + cross-thread free) // ============================================================ #include "hakmem_tiny_refill.inc.h" #include "tiny_alloc_slow.inc.h" #include "tiny_free_remote.inc.h" #include "tiny_free_guard.inc.h" // ============================================================ // LAYER 5: Statistics + Query + Metadata // ============================================================ #include "hakmem_tiny_stats.h" #include "hakmem_tiny_query.c" #include "hakmem_tiny_metadata.inc" // ============================================================ // LAYER 6: Lifecycle + Init // ============================================================ #include "tiny_init_globals.inc.h" #include "tiny_init_config.inc.h" #include "tiny_init_pools.inc.h" #include "tiny_lifecycle_trim.inc.h" #include "tiny_lifecycle_shutdown.inc.h" // ============================================================ // LAYER 7: Intel-specific optimizations // ============================================================ #include "tiny_intel_common.inc.h" #include "tiny_intel_fast.inc.h" #include "tiny_intel_cache.inc.h" // ============================================================ // LAYER 8: Legacy/Experimental (kept for compat) // ============================================================ #include "hakmem_tiny_ultra_simple.inc" #include "hakmem_tiny_alloc.inc" #include "hakmem_tiny_slow.inc" // ============================================================ // LAYER 9: Old free.inc (minimal, mostly extracted) // ============================================================ #include "hakmem_tiny_free.inc" // Now just cleanup #include "hakmem_tiny_background.inc" #include "hakmem_tiny_magazine.h" #include "tiny_refill.h" #include "tiny_mmap_gate.h" ``` --- ## Phase 6: 実装ガイド ### Key Principles 1. **SRP (Single Responsibility Principle)** - Each file: 1 責務、500行以下 - No sideways dependencies 2. **Zero-Cost Abstraction** - All boundaries via `static inline` - No function pointer indirection - Compiler inlines aggressively 3. **Cyclic Dependency Prevention** - Layer 1 → Layer 2 → ... → Layer 9 - Backward dependency は回避 4. **Backward Compatibility** - Legacy .inc files は維持(互換性) - 段階的に新ファイルに移行 ### Static Inline の使用場所 #### ✅ Use `static inline`: ```c // tiny_atomic.h static inline void tiny_atomic_store(volatile int* p, int v) { atomic_store_explicit((_Atomic int*)p, v, memory_order_release); } // tiny_free_fast.inc.h static inline void* tiny_fast_pop_alloc(int class_idx) { void** head = &g_tls_cache[class_idx]; void* ptr = *head; if (ptr) *head = *(void**)ptr; return ptr; } // tiny_alloc_slow.inc.h static inline void* tiny_refill_from_superslab(int class_idx) { SuperSlab* ss = g_tls_current_ss[class_idx]; if (ss) return superslab_alloc_from_slab(ss, ...); return NULL; } ``` #### ❌ Don't use `static inline` for: - Large functions (>20 lines) - Slow path logic - Setup/teardown code #### ✅ Use regular functions: ```c // tiny_remote.c void tiny_remote_drain_batch(int class_idx) { // 50+ lines: slow path → regular function } // hakmem_tiny_superslab.c SuperSlab* superslab_refill(int class_idx) { // Complex allocation → regular function } ``` ### Macro Usage #### Use Macros for: ```c // tiny_atomic.h #define TINY_ATOMIC_LOAD(ptr, order) \ atomic_load_explicit((_Atomic typeof(*ptr)*)ptr, order) #define TINY_ATOMIC_CAS(ptr, expected, desired) \ atomic_compare_exchange_strong_explicit( \ (_Atomic typeof(*ptr)*)ptr, expected, desired, \ memory_order_release, memory_order_relaxed) ``` #### Don't over-use for: - Complex logic (use functions) - Multiple statements (hard to debug) --- ## Phase 7: Testing Strategy ### Per-File Unit Tests ```c // test_tiny_alloc_fast.c void test_tiny_alloc_fast_pop_empty() { g_tls_cache[0] = NULL; assert(tiny_fast_pop_alloc(0) == NULL); } void test_tiny_alloc_fast_push_pop() { void* ptr = malloc(8); tiny_fast_push_alloc(0, ptr); assert(tiny_fast_pop_alloc(0) == ptr); } ``` ### Integration Tests ```c // test_tiny_alloc_free_cycle.c void test_alloc_free_single_thread() { void* p1 = hak_tiny_alloc(8); void* p2 = hak_tiny_alloc(8); hak_tiny_free(p1); hak_tiny_free(p2); // Verify no memory leak } void test_alloc_free_cross_thread() { // Thread A allocs, Thread B frees // Verify remote queue works } ``` --- ## 期待される効果 ### パフォーマンス | 指標 | 現状 | 目標 | 効果 | |------|------|------|------| | Fast path 命令数 | 20+ | 3-4 | -80% cycles | | Branch misprediction | 50-100 cycles | 15-20 cycles | -70% | | TLS cache hit rate | 70% | 85% | +15% throughput | ### 保守性 | 指標 | 現状 | 目標 | 効果 | |------|------|------|------| | Max file size | 1470行 | 300-400行 | -70% 複雑度 | | Cyclic dependencies | 多数 | 0 | 100% 明確化 | | Code review time | 3h | 30min | -90% | ### 開発速度 | タスク | 現状 | リファクタ後 | |--------|------|-------------| | Bug fix | 2-4h | 30min | | Optimization | 4-6h | 1-2h | | Feature add | 6-8h | 2-3h | --- ## Timeline | Week | Task | Owner | Status | |------|------|-------|--------| | 1 | Box 1,5,6 (Fast path) | Claude | TODO | | 2 | Box 2,3 (Remote/SS) | Claude | TODO | | 3 | Box 4 (Publish/Adopt) | Claude | TODO | | 4 | Box 8,9 (Lifecycle/Intel) | Claude | TODO | | 5 | Testing + Integration | Claude | TODO | | 6 | Benchmark + Tuning | Claude | TODO | --- ## Rollback Strategy If performance regresses: 1. Keep all old .inc files (legacy compatibility) 2. hakmem_tiny.c can include either old or new 3. Gradual migration: one Box at a time 4. Benchmark after each Box --- ## Known Risks 1. **Include order sensitivity**: New Box 順序が critical → Test carefully 2. **Inlining threshold**: Compiler may not inline all static inline functions → Profiling needed 3. **TLS cache contention**: Fast path の simple化で TLS synchronization が bottleneck化する可能性 → Monitor g_tls_cache_count 4. **RemoteQueue scalability**: Box 2 の remote queue が high-contention に弱い → Lock-free 化検討 --- ## Success Criteria ✅ All tests pass (unit + integration + larson) ✅ Fast path = 3-4 命令 (assembly analysis) ✅ +10-15% throughput on Tiny allocations ✅ All files <= 500 行 ✅ Zero cyclic dependencies ✅ Documentation complete