From cbd33511eb690a831d74d1841ab9efa90892e7bd Mon Sep 17 00:00:00 2001 From: "Moe Charm (CI)" Date: Wed, 10 Dec 2025 17:58:42 +0900 Subject: [PATCH] Phase v4-3.1: reuse C7 v4 pages and record prep calls --- CURRENT_TASK.md | 64 +++++ Makefile | 2 +- core/box/smallobject_cold_iface_v4.h | 21 ++ core/box/smallobject_hotbox_v4_box.h | 48 ++++ core/box/smallobject_hotbox_v4_env_box.h | 45 +++ core/box/tiny_route_env_box.h | 11 +- core/front/malloc_tiny_fast.h | 21 +- core/smallobject_hotbox_v4.c | 304 +++++++++++++++++++++ docs/analysis/SMALLOBJECT_V4_BOX_DESIGN.md | 40 +++ 9 files changed, 551 insertions(+), 5 deletions(-) create mode 100644 core/box/smallobject_cold_iface_v4.h create mode 100644 core/box/smallobject_hotbox_v4_box.h create mode 100644 core/box/smallobject_hotbox_v4_env_box.h create mode 100644 core/smallobject_hotbox_v4.c create mode 100644 docs/analysis/SMALLOBJECT_V4_BOX_DESIGN.md diff --git a/CURRENT_TASK.md b/CURRENT_TASK.md index bc66b64c..a6c608ab 100644 --- a/CURRENT_TASK.md +++ b/CURRENT_TASK.md @@ -997,3 +997,67 @@ export HAKMEM_POOL_ZERO_MODE=header ### Header Light v3.1(C7 header 再書き込み抑制・ベンチ専用) - ENV: `HAKMEM_TINY_C7_HEADER_DEDUP_ENABLED`(デフォルト 0, C7 v3 ON 時のみ効く)。直前と同一のヘッダなら store をスキップ、free 側の検証は不変。 - A/B (Mixed 16–1024B, ws=400, iters=1M, MIXED_TINYV3_C7_SAFE): OFF 44.38M ops/s, ON 44.30M ops/s(±ノイズ, 回帰なし)。安全な実験箱として残す(デフォルト OFF)。 + +### Phase v4-1: SmallObjectHotBox v4 設計相談(ChatGPT Pro)と次フェーズ方針 +- 背景: C7-only SmallObject v3 + Tiny front v3(LUT + fast classify)まで積んだ状態でも、Mixed 16–1024B で mimalloc の 30〜40% 程度。Tiny v2 / Pool v2 / C6 v3 など v2 世代の実験箱は perf 的に NG が多く、small-object heap 全体をもう一段構造から見直す必要が見えてきた。 +- 設計相談(ChatGPT Pro)からの提案要約: + - small-object heap v4(16〜1024B〜2KiB)の箱構造: + - HotBox_v4(per-thread SmallHeapCtx / SmallClassHeap / SmallPageMeta): current/partial/full を持つ page-based freelist を全 small-object クラスで統一。 + - ColdIface_v4: `refill_page` / `retire_page` / `remote_push` / `remote_drain` など少数の境界 API に集約し、内部で Superslab/Warm/Remote を呼ぶ薄いラッパ。 + - SuperslabBox/RemoteBox: 既存の Superslab/WarmPool/Remote を Cold 側の Box として再利用(Hot から直接は触らない)。 + - PolicyBox/LearningBox: small-object 用の `SmallPolicySnapshot` を作り、route_kind/classごとの block_size などを A/B できるようにする(Hot は snapshot を読むだけ)。 + - mimalloc に近づくための「大きい一手」: + 1. per-thread small-object heap v4 をきちんと作る(C7-only v3 の成功パターンを generalize)。 + 2. Segment/Page/Block レイアウトと pf 削減(page 配置と WarmPool を v4 用に再チューニング)。 + 3. front/gate を small-object 用に一段直線化(Tiny/front v3 は残しつつ small-object 用 route を V4 vs legacy/pool に収束)。 + - v3 の扱い: + - C7-only SmallObject v3 / front v3 は「v4 の prototype」として構造だけ再利用し、HotBox_v4 は基本的に新規設計にする。 + - v2 ラッパ系(TinyHotHeap v2 など)はインターフェースアイデアだけ残し、実装は archive 的扱いに寄せる。 +- 次の実装フェーズ(v4-1)の方針: + - まずは HotBox_v4 / ColdIface_v4 の「型と入口」だけを追加する(挙動は変えない): + - `core/box/smallobject_hotbox_v4_box.h` に `SmallPageMeta` / `SmallClassHeap` / `SmallHeapCtx` の struct 定義と TLS 取得 API 宣言だけ追加。 + - `core/box/smallobject_cold_iface_v4.h` に `SmallColdIface` と `refill_page/retire_page` 等のインターフェース宣言だけ用意(中身は後続フェーズ)。 + - route 種類として `TINY_ROUTE_SMALLHEAP_V4` を enum に追加し、front v3 の switch に case を足すが、現時点では即 legacy/v3 へフォールバックする stub に留める。 + - docs: + - `docs/analysis/SMALLOBJECT_V4_BOX_DESIGN.md` を新規追加し、上記の箱構造と Phase v4-1〜v4-4 のロードマップをまとめる。 + - 挙動確認: + - コード追加後も `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE` / `C6_HEAVY_LEGACY_POOLV1` の健康診断ランが segv/assert なし・スループットほぼ不変で通ることを確認する(v4 はまだ stub のため perf には影響しない前提)。 +- 実装メモ (2025-12-10): + - 追加: `core/box/smallobject_hotbox_v4_box.h`, `core/box/smallobject_cold_iface_v4.h`, `docs/analysis/SMALLOBJECT_V4_BOX_DESIGN.md`。`tiny_route_env_box.h` に `TINY_ROUTE_SMALLHEAP_V4` を追加し、`malloc_tiny_fast.h` の alloc/free switch に stub case を追加(C7 は v3 経由、それ以外は v1 へフォールバック)。 + - 健康診断ラン: Mixed 37.16M ops/s(MIXED_TINYV3_C7_SAFE, ws=400, iters=1M)、C6-heavy 27.62M ops/s(C6_HEAVY_LEGACY_POOLV1, ws=400, iters=1M)。segv/assert なし。今後の perf 伸びしろは別途追う。 + +### Phase v4-2: C7-only v4 route (v3 互換挙動、ENV で A/B) +- ENV: + - `HAKMEM_SMALL_HEAP_V4_ENABLED`(デフォルト 0, 研究箱) + - `HAKMEM_SMALL_HEAP_V4_CLASSES`(bit7=0x80 を C7 用に使用)。v4 と v3 を両方指定した場合は v4 を優先。 + - v4 OFF(従来 v3): `SMALL_HEAP_V3_ENABLED=1, V3_CLASSES=0x80, V4_ENABLED=0` + - v4 ON(C7-only v4): `SMALL_HEAP_V3_ENABLED=0, V3_CLASSES=0, V4_ENABLED=1, V4_CLASSES=0x80` +- 変更点: + - `smallobject_hotbox_v4_env_box.h` を追加(デフォルト OFF)。 + - `smallobject_hotbox_v4.c` を追加し、C7 ルートは v4 → いまは v3 実装に委譲する stub(後続フェーズで独立実装予定)。TLS ctx (`small_heap_ctx_v4_get`) は stub を返す。 + - `tiny_route_env_box.h` で v4 を優先する route snapshot に更新。`malloc_tiny_fast.h` の alloc/free で `TINY_ROUTE_SMALLHEAP_V4` case を追加(C7 以外は v1/v3 へフォールバック)。 +- 健康診断ラン(v4 OFF のまま): Mixed 37.16M ops/s、C6-heavy 27.62M ops/s(segv/assert なし)。 + +### Phase v4-3: C7-only v4 freelist 実装(Cold は Tiny v1 経由) +- 変更: + - `smallobject_hotbox_v4.c` に C7 専用の current/partial/full + freelist を実装。page_of はクラス内リスト検索。ColdIface v4 の refill/retire は Tiny v1 (`tiny_heap_prepare_page` / `tiny_heap_page_becomes_empty`) に接続。 + - `smallobject_cold_iface_v4.h` を retire(class_idx 付き) に更新。 + - `smallobject_hotbox_v4_box.h` に block_size/base を追加。 + - route/LUT は v4 を優先するが、v4 が無効なら従来どおり v3/v1。 +- A/B (Mixed 16–1024B, ws=400, iters=1M): + - v3: `V3_ENABLED=1 V3_CLASSES=0x80 V4_ENABLED=0` → 39.23M ops/s + - v4: `V3_ENABLED=0 V3_CLASSES=0 V4_ENABLED=1 V4_CLASSES=0x80` → 38.01M ops/s + - segv/assert なし。v4 は自前実装になったがまだ v3 よりわずかに遅い。次は C7 ページ管理の最適化/pf 改善を検討。 + +### Phase v4-3.1: C7 v4 の prepare 多発を抑制(current/partial 再利用強化) +- 変更: + - `smallobject_hotbox_v4.c` で current/partial を捨てず保持する設計に変更。freelist が空でも current を NULL にせず、slow パスが partial を拾う / 本当に空のときだけ refill。free 側で current が無ければ戻ってきた page を掴み直し、partial_count を持って上限 2 に抑制。 + - C7 alloc で `tiny_region_id_write_header` を呼ぶようにして v3 と整合を取った。 + - `smallobject_hotbox_v4_box.h` に partial_count を追加。 +- A/B(ws=400, iters=1M, size=1024 固定, stats ON): + - v3: prepare_calls=5,077, Throughput=41,673,129 ops/s + - v4: prepare_calls=4,701(以前 17,191 → 4.7k に減少), Throughput=42,130,607 ops/s(v3 比 +1%) +- Mixed 16–1024B (MIXED_TINYV3_C7_SAFE): + - v3 route: 40,661,560 ops/s + - v4 route: 40,010,302 ops/s(-1.6% 以内、回帰なし) +- 所感: C7-only では v4 が逆転し、prepare 増加の問題は解消。Mixed も健康レンジに収まった。次は C7 v4 の pf/partial 再利用 or C6/C5 拡張を検討。 diff --git a/Makefile b/Makefile index 341f1b84..3b76fded 100644 --- a/Makefile +++ b/Makefile @@ -427,7 +427,7 @@ test-box-refactor: box-refactor ./larson_hakmem 10 8 128 1024 1 12345 4 # Phase 4: Tiny Pool benchmarks (properly linked with hakmem) -TINY_BENCH_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/free_publish_box.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/wrapper_env_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/page_arena.o core/front/tiny_unified_cache.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/tiny_alloc_fast_push.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o +TINY_BENCH_OBJS_BASE = hakmem.o hakmem_config.o hakmem_tiny_config.o hakmem_ucb1.o hakmem_bigcache.o hakmem_pool.o hakmem_l25_pool.o hakmem_site_rules.o hakmem_tiny.o core/box/ss_allocation_box.o superslab_stats.o superslab_cache.o superslab_ace.o superslab_slab.o superslab_backend.o core/superslab_head_stub.o hakmem_smallmid.o core/box/superslab_expansion_box.o core/box/integrity_box.o core/box/mailbox_box.o core/box/front_gate_box.o core/box/front_gate_classifier.o core/box/free_publish_box.o core/box/capacity_box.o core/box/carve_push_box.o core/box/prewarm_box.o core/box/ss_hot_prewarm_box.o core/box/front_metrics_box.o core/box/bench_fast_box.o core/box/ss_addr_map_box.o core/box/slab_recycling_box.o core/box/pagefault_telemetry_box.o core/box/tiny_sizeclass_hist_box.o core/box/tiny_env_box.o core/box/tiny_route_box.o core/box/tiny_page_box.o core/box/tiny_class_policy_box.o core/box/tiny_class_stats_box.o core/box/tiny_policy_learner_box.o core/box/ss_budget_box.o core/box/tiny_mem_stats_box.o core/box/c7_meta_used_counter_box.o core/box/wrapper_env_box.o core/box/madvise_guard_box.o core/box/libm_reloc_guard_box.o core/box/ptr_trace_box.o core/box/link_missing_stubs.o core/box/super_reg_box.o core/box/shared_pool_box.o core/box/remote_side_box.o core/page_arena.o core/front/tiny_unified_cache.o tiny_sticky.o tiny_remote.o tiny_publish.o tiny_debug_ring.o hakmem_tiny_magazine.o hakmem_tiny_stats.o hakmem_tiny_sfc.o hakmem_tiny_query.o hakmem_tiny_rss.o hakmem_tiny_registry.o hakmem_tiny_remote_target.o hakmem_tiny_bg_spill.o tiny_adaptive_sizing.o hakmem_super_registry.o hakmem_shared_pool.o hakmem_shared_pool_acquire.o hakmem_shared_pool_release.o hakmem_elo.o hakmem_batch.o hakmem_p2.o hakmem_sizeclass_dist.o hakmem_evo.o hakmem_debug.o hakmem_sys.o hakmem_whale.o hakmem_policy.o hakmem_ace.o hakmem_ace_stats.o hakmem_prof.o hakmem_learner.o hakmem_size_hist.o hakmem_learn_log.o hakmem_syscall.o hakmem_ace_metrics.o hakmem_ace_ucb1.o hakmem_ace_controller.o tiny_fastcache.o core/tiny_alloc_fast_push.o core/link_stubs.o core/tiny_failfast.o core/tiny_destructors.o core/smallobject_hotbox_v3.o core/smallobject_hotbox_v4.o TINY_BENCH_OBJS = $(TINY_BENCH_OBJS_BASE) ifeq ($(POOL_TLS_PHASE1),1) TINY_BENCH_OBJS += pool_tls.o pool_refill.o core/pool_tls_arena.o pool_tls_registry.o pool_tls_remote.o diff --git a/core/box/smallobject_cold_iface_v4.h b/core/box/smallobject_cold_iface_v4.h new file mode 100644 index 00000000..75d4ae70 --- /dev/null +++ b/core/box/smallobject_cold_iface_v4.h @@ -0,0 +1,21 @@ +// smallobject_cold_iface_v4.h - SmallObject HotHeap v4 Cold Interface (境界 API のみ) +// +// 役割: +// - HotBox_v4 と Superslab/Warm/Remote を繋ぐ関数ポインタの箱を定義する。 +// - 実装は後続フェーズで追加し、いまは型と宣言だけを置く。 +#pragma once + +#include +#include + +#include "smallobject_hotbox_v4_box.h" + +typedef struct SmallColdIfaceV4 { + small_page_v4* (*refill_page)(small_heap_ctx_v4*, uint32_t class_idx); + void (*retire_page)(small_heap_ctx_v4*, uint32_t class_idx, small_page_v4* page); + bool (*remote_push)(small_page_v4* page, void* ptr, uint32_t self_tid); + void (*remote_drain)(small_heap_ctx_v4*); +} SmallColdIfaceV4; + +// Cold iface accessor(実装は後続フェーズ) +const SmallColdIfaceV4* small_cold_iface_v4_get(void); diff --git a/core/box/smallobject_hotbox_v4_box.h b/core/box/smallobject_hotbox_v4_box.h new file mode 100644 index 00000000..e799a1ae --- /dev/null +++ b/core/box/smallobject_hotbox_v4_box.h @@ -0,0 +1,48 @@ +// smallobject_hotbox_v4_box.h - SmallObject HotHeap v4 (型スケルトン) +// +// 役割: +// - v4 のページ / クラス / TLS コンテキスト型と API 宣言だけを定義する箱。 +// - 挙動はまだ v3/v1 のまま。alloc/free 本体は後続フェーズで実装する。 +#pragma once + +#include +#include + +#include "tiny_geometry_box.h" + +#ifndef SMALLOBJECT_NUM_CLASSES +#define SMALLOBJECT_NUM_CLASSES TINY_NUM_CLASSES +#endif + +// Page metadata for v4 HotBox +typedef struct small_page_v4 { + void* freelist; + uint16_t used; + uint16_t capacity; + uint8_t class_idx; + uint8_t flags; + uint32_t block_size; + uint8_t* base; + void* slab_ref; // Superslab / lease token (box境界で扱う) + struct small_page_v4* next; +} small_page_v4; + +// Per-class heap state (current / partial / full lists) +typedef struct small_class_heap_v4 { + small_page_v4* current; + small_page_v4* partial_head; + small_page_v4* full_head; + uint32_t partial_count; +} small_class_heap_v4; + +// TLS heap context (per-thread) +typedef struct small_heap_ctx_v4 { + small_class_heap_v4 cls[SMALLOBJECT_NUM_CLASSES]; +} small_heap_ctx_v4; + +// TLS accessor (実装は後続フェーズで追加) +small_heap_ctx_v4* small_heap_ctx_v4_get(void); + +// Hot path API (C7-only stub; later phases will expand) +void* small_heap_alloc_fast_v4(small_heap_ctx_v4* ctx, int class_idx); +void small_heap_free_fast_v4(small_heap_ctx_v4* ctx, int class_idx, void* ptr); diff --git a/core/box/smallobject_hotbox_v4_env_box.h b/core/box/smallobject_hotbox_v4_env_box.h new file mode 100644 index 00000000..366a22aa --- /dev/null +++ b/core/box/smallobject_hotbox_v4_env_box.h @@ -0,0 +1,45 @@ +// smallobject_hotbox_v4_env_box.h - ENV gate for SmallObject HotHeap v4 +// デフォルト: OFF(研究用に明示 opt-in) +#pragma once + +#include +#include + +#include "../hakmem_tiny_config.h" + +static inline int small_heap_v4_enabled(void) { + static int g_enable = -1; + if (__builtin_expect(g_enable == -1, 0)) { + const char* e = getenv("HAKMEM_SMALL_HEAP_V4_ENABLED"); + if (e && *e) { + g_enable = (*e != '0') ? 1 : 0; + } else { + // v4 は研究箱。明示しない限り OFF + g_enable = 0; + } + } + return g_enable; +} + +static inline int small_heap_v4_class_enabled(uint8_t class_idx) { + static int g_parsed = 0; + static unsigned g_mask = 0; + if (__builtin_expect(!g_parsed, 0)) { + const char* e = getenv("HAKMEM_SMALL_HEAP_V4_CLASSES"); + if (e && *e) { + unsigned v = (unsigned)strtoul(e, NULL, 0); + g_mask = v & 0xFFu; + } else { + // デフォルトは全クラス OFF + g_mask = 0; + } + g_parsed = 1; + } + if (!small_heap_v4_enabled()) return 0; + if (class_idx >= TINY_NUM_CLASSES) return 0; + return (g_mask & (1u << class_idx)) != 0; +} + +static inline int small_heap_v4_c7_enabled(void) { + return small_heap_v4_class_enabled(7); +} diff --git a/core/box/tiny_route_env_box.h b/core/box/tiny_route_env_box.h index a223b1a6..a0f5f27a 100644 --- a/core/box/tiny_route_env_box.h +++ b/core/box/tiny_route_env_box.h @@ -10,12 +10,14 @@ #include "tiny_heap_env_box.h" #include "smallobject_hotbox_v3_env_box.h" +#include "smallobject_hotbox_v4_env_box.h" typedef enum { TINY_ROUTE_LEGACY = 0, TINY_ROUTE_HEAP = 1, // TinyHeap v1 TINY_ROUTE_HOTHEAP_V2 = 2, // TinyHotHeap v2 TINY_ROUTE_SMALL_HEAP_V3 = 3, // SmallObject HotHeap v3 (C7-first,研究箱) + TINY_ROUTE_SMALL_HEAP_V4 = 4, // SmallObject HotHeap v4 (stub, route未使用) } tiny_route_kind_t; extern tiny_route_kind_t g_tiny_route_class[TINY_NUM_CLASSES]; @@ -23,7 +25,9 @@ extern int g_tiny_route_snapshot_done; static inline void tiny_route_snapshot_init(void) { for (int i = 0; i < TINY_NUM_CLASSES; i++) { - if (small_heap_v3_class_enabled((uint8_t)i)) { + if (small_heap_v4_class_enabled((uint8_t)i)) { + g_tiny_route_class[i] = TINY_ROUTE_SMALL_HEAP_V4; + } else if (small_heap_v3_class_enabled((uint8_t)i)) { g_tiny_route_class[i] = TINY_ROUTE_SMALL_HEAP_V3; } else if (tiny_hotheap_v2_class_enabled((uint8_t)i)) { g_tiny_route_class[i] = TINY_ROUTE_HOTHEAP_V2; @@ -47,7 +51,10 @@ static inline tiny_route_kind_t tiny_route_for_class(uint8_t ci) { } static inline int tiny_route_is_heap_kind(tiny_route_kind_t route) { - return route == TINY_ROUTE_HEAP || route == TINY_ROUTE_HOTHEAP_V2 || route == TINY_ROUTE_SMALL_HEAP_V3; + return route == TINY_ROUTE_HEAP || + route == TINY_ROUTE_HOTHEAP_V2 || + route == TINY_ROUTE_SMALL_HEAP_V3 || + route == TINY_ROUTE_SMALL_HEAP_V4; } // C7 front が TinyHeap を使うか(Route snapshot 経由で判定) diff --git a/core/front/malloc_tiny_fast.h b/core/front/malloc_tiny_fast.h index 9ba96cfd..dee8a5e0 100644 --- a/core/front/malloc_tiny_fast.h +++ b/core/front/malloc_tiny_fast.h @@ -41,6 +41,7 @@ #include "../box/tiny_heap_box.h" // TinyHeap 汎用 Box #include "../box/tiny_hotheap_v2_box.h" // TinyHotHeap v2 (Phase31 A/B) #include "../box/smallobject_hotbox_v3_box.h" // SmallObject HotHeap v3 skeleton +#include "../box/smallobject_hotbox_v4_box.h" // SmallObject HotHeap v4 (C7 stub) #include "../box/tiny_front_v3_env_box.h" // Tiny front v3 snapshot gate #include "../box/tiny_heap_env_box.h" // ENV gate for TinyHeap front (A/B) #include "../box/tiny_route_env_box.h" // Route snapshot (Heap vs Legacy) @@ -132,7 +133,8 @@ static inline void* malloc_tiny_fast(size_t size) { route_trusted = false; } else if (!route_trusted && route != TINY_ROUTE_LEGACY && route != TINY_ROUTE_HEAP && - route != TINY_ROUTE_HOTHEAP_V2 && route != TINY_ROUTE_SMALL_HEAP_V3) { + route != TINY_ROUTE_HOTHEAP_V2 && route != TINY_ROUTE_SMALL_HEAP_V3 && + route != TINY_ROUTE_SMALL_HEAP_V4) { route = tiny_route_for_class((uint8_t)class_idx); } @@ -148,6 +150,15 @@ static inline void* malloc_tiny_fast(size_t size) { // fallthrough to v2/v1 __attribute__((fallthrough)); } + case TINY_ROUTE_SMALL_HEAP_V4: { + void* v4p = small_heap_alloc_fast_v4(small_heap_ctx_v4_get(), class_idx); + if (TINY_HOT_LIKELY(v4p != NULL)) { + return v4p; + } + so_v3_record_alloc_fallback((uint8_t)class_idx); + // fallthrough to v2/v1 + __attribute__((fallthrough)); + } case TINY_ROUTE_HOTHEAP_V2: { void* v2p = tiny_hotheap_v2_alloc((uint8_t)class_idx); if (TINY_HOT_LIKELY(v2p != NULL)) { @@ -307,6 +318,12 @@ static inline int free_tiny_fast(void* ptr) { // Same-thread + TinyHeap route → route-based free if (__builtin_expect(use_tiny_heap, 0)) { switch (route) { + case TINY_ROUTE_SMALL_HEAP_V4: + if (class_idx == 7) { + small_heap_free_fast_v4(small_heap_ctx_v4_get(), class_idx, base); + return 1; + } + __attribute__((fallthrough)); case TINY_ROUTE_SMALL_HEAP_V3: so_free((uint32_t)class_idx, base); return 1; @@ -332,7 +349,7 @@ static inline int free_tiny_fast(void* ptr) { // fallback: lookup failed but TinyHeap front is ON → use generic TinyHeap free if (route == TINY_ROUTE_HOTHEAP_V2) { tiny_hotheap_v2_record_free_fallback((uint8_t)class_idx); - } else if (route == TINY_ROUTE_SMALL_HEAP_V3) { + } else if (route == TINY_ROUTE_SMALL_HEAP_V3 || route == TINY_ROUTE_SMALL_HEAP_V4) { so_v3_record_free_fallback((uint8_t)class_idx); } tiny_heap_free_class_fast(tiny_heap_ctx_for_thread(), class_idx, ptr); diff --git a/core/smallobject_hotbox_v4.c b/core/smallobject_hotbox_v4.c new file mode 100644 index 00000000..6f3ddd21 --- /dev/null +++ b/core/smallobject_hotbox_v4.c @@ -0,0 +1,304 @@ +// smallobject_hotbox_v4.c - SmallObject HotHeap v4 (C7-only real path) +// +// Phase v4-3: C7 クラスについては v4 独自の freelist/current/partial で完結させる。 + +#include +#include + +#include "box/smallobject_hotbox_v4_box.h" +#include "box/smallobject_hotbox_v4_env_box.h" +#include "box/smallobject_cold_iface_v4.h" +#include "box/smallobject_hotbox_v3_env_box.h" +#include "box/tiny_heap_box.h" +#include "box/tiny_cold_iface_v1.h" +#include "box/tiny_geometry_box.h" +#include "tiny_region_id.h" + +// TLS context +static __thread small_heap_ctx_v4 g_ctx_v4; + +#define V4_MAX_PARTIAL_PAGES 2 + +small_heap_ctx_v4* small_heap_ctx_v4_get(void) { + return &g_ctx_v4; +} + +// ----------------------------------------------------------------------------- +// helpers +// ----------------------------------------------------------------------------- + +static inline void v4_page_push_partial(small_class_heap_v4* h, small_page_v4* page) { + if (!h || !page) return; + page->next = h->partial_head; + h->partial_head = page; + h->partial_count++; +} + +static inline small_page_v4* v4_page_pop_partial(small_class_heap_v4* h) { + if (!h) return NULL; + small_page_v4* p = h->partial_head; + if (p) { + h->partial_head = p->next; + p->next = NULL; + if (h->partial_count > 0) { + h->partial_count--; + } + } + return p; +} + +static inline void v4_page_push_full(small_class_heap_v4* h, small_page_v4* page) { + if (!h || !page) return; + page->next = h->full_head; + h->full_head = page; +} + +static inline int v4_ptr_in_page(const small_page_v4* page, const uint8_t* ptr) { + if (!page || !ptr) return 0; + uint8_t* base = page->base; + size_t span = (size_t)page->block_size * (size_t)page->capacity; + if (ptr < base || ptr >= base + span) return 0; + size_t off = (size_t)(ptr - base); + return (off % page->block_size) == 0; +} + +static inline void* v4_build_freelist(uint8_t* base, uint16_t capacity, size_t stride) { + void* head = NULL; + for (int i = capacity - 1; i >= 0; i--) { + uint8_t* blk = base + ((size_t)i * stride); + void* next = head; + head = blk; + memcpy(blk, &next, sizeof(void*)); + } + return head; +} + +typedef enum { + V4_LOC_NONE = 0, + V4_LOC_CURRENT, + V4_LOC_PARTIAL, + V4_LOC_FULL, +} v4_loc_t; + +static small_page_v4* v4_find_page(small_class_heap_v4* h, const uint8_t* ptr, v4_loc_t* loc, small_page_v4** prev_out) { + if (loc) *loc = V4_LOC_NONE; + if (prev_out) *prev_out = NULL; + if (!h || !ptr) return NULL; + + if (h->current && v4_ptr_in_page(h->current, ptr)) { + if (loc) *loc = V4_LOC_CURRENT; + return h->current; + } + small_page_v4* prev = NULL; + for (small_page_v4* p = h->partial_head; p; prev = p, p = p->next) { + if (v4_ptr_in_page(p, ptr)) { + if (loc) *loc = V4_LOC_PARTIAL; + if (prev_out) *prev_out = prev; + return p; + } + } + prev = NULL; + for (small_page_v4* p = h->full_head; p; prev = p, p = p->next) { + if (v4_ptr_in_page(p, ptr)) { + if (loc) *loc = V4_LOC_FULL; + if (prev_out) *prev_out = prev; + return p; + } + } + return NULL; +} + +// ----------------------------------------------------------------------------- +// Cold iface (C7-only, Tiny v1 経由) +// ----------------------------------------------------------------------------- + +static small_page_v4* cold_refill_page_v4(small_heap_ctx_v4* hot_ctx, uint32_t class_idx) { + if (__builtin_expect(class_idx != 7, 0)) return NULL; + (void)hot_ctx; + tiny_heap_ctx_t* tctx = tiny_heap_ctx_for_thread(); + if (!tctx) return NULL; + + tiny_heap_page_t* lease = tiny_heap_prepare_page(tctx, (int)class_idx); + if (!lease) return NULL; + + small_page_v4* page = (small_page_v4*)malloc(sizeof(small_page_v4)); + if (!page) { + return NULL; + } + memset(page, 0, sizeof(*page)); + page->class_idx = (uint8_t)class_idx; + page->capacity = lease->capacity; + page->used = 0; + page->block_size = (uint32_t)tiny_stride_for_class((int)class_idx); + page->base = lease->base; + page->slab_ref = lease; + page->freelist = v4_build_freelist(lease->base, lease->capacity, page->block_size); + if (!page->freelist) { + free(page); + return NULL; + } + page->next = NULL; + page->flags = 0; + return page; +} + +static void cold_retire_page_v4(small_heap_ctx_v4* hot_ctx, uint32_t class_idx, small_page_v4* page) { + (void)hot_ctx; + if (!page) return; + tiny_heap_ctx_t* tctx = tiny_heap_ctx_for_thread(); + tiny_heap_page_t* lease = (tiny_heap_page_t*)page->slab_ref; + if (tctx && lease) { + tiny_heap_page_becomes_empty(tctx, (int)class_idx, lease); + } + free(page); +} + +static const SmallColdIfaceV4 g_cold_iface_v4 = { + .refill_page = cold_refill_page_v4, + .retire_page = cold_retire_page_v4, + .remote_push = NULL, + .remote_drain = NULL, +}; + +const SmallColdIfaceV4* small_cold_iface_v4_get(void) { + return &g_cold_iface_v4; +} + +// ----------------------------------------------------------------------------- +// alloc/free +// ----------------------------------------------------------------------------- + +static small_page_v4* small_alloc_slow_v4(small_heap_ctx_v4* ctx, int class_idx) { + small_class_heap_v4* h = &ctx->cls[class_idx]; + small_page_v4* cur = h->current; + if (cur && cur->freelist) { + return cur; // usable current + } + if (cur && !cur->freelist) { + // current を full list に残しておき、free で戻す + v4_page_push_full(h, cur); + h->current = NULL; + } + + // partial から 1 ページだけ復帰 + small_page_v4* from_partial = v4_page_pop_partial(h); + if (from_partial) { + h->current = from_partial; + return from_partial; + } + + const SmallColdIfaceV4* cold = small_cold_iface_v4_get(); + if (!cold || !cold->refill_page) return NULL; + small_page_v4* page = cold->refill_page(ctx, (uint32_t)class_idx); + if (!page) return NULL; + h->current = page; + return page; +} + +void* small_heap_alloc_fast_v4(small_heap_ctx_v4* ctx, int class_idx) { + if (__builtin_expect(class_idx != 7, 0)) { + return NULL; // C7 専用 + } + if (!small_heap_v4_c7_enabled()) return NULL; + small_class_heap_v4* h = &ctx->cls[class_idx]; + small_page_v4* page = h->current; + + if (!page || !page->freelist) { + page = small_alloc_slow_v4(ctx, class_idx); + } + if (!page || !page->freelist) { + return NULL; + } + + void* blk = page->freelist; + void* next = NULL; + memcpy(&next, blk, sizeof(void*)); + page->freelist = next; + page->used++; + + return tiny_region_id_write_header(blk, class_idx); +} + +static void v4_unlink_from_list(small_class_heap_v4* h, v4_loc_t loc, small_page_v4* prev, small_page_v4* page) { + if (!h || !page) return; + switch (loc) { + case V4_LOC_CURRENT: + h->current = NULL; + break; + case V4_LOC_PARTIAL: + if (prev) prev->next = page->next; + else h->partial_head = page->next; + if (h->partial_count > 0) { + h->partial_count--; + } + break; + case V4_LOC_FULL: + if (prev) prev->next = page->next; + else h->full_head = page->next; + break; + default: + break; + } + page->next = NULL; +} + +void small_heap_free_fast_v4(small_heap_ctx_v4* ctx, int class_idx, void* ptr) { + if (__builtin_expect(class_idx != 7, 0)) { + return; + } + if (!small_heap_v4_c7_enabled()) return; + if (!ptr) return; + + small_class_heap_v4* h = &ctx->cls[class_idx]; + small_page_v4* prev = NULL; + v4_loc_t loc = V4_LOC_NONE; + small_page_v4* page = v4_find_page(h, (const uint8_t*)ptr, &loc, &prev); + if (!page) return; + + // freelist push + void* head = page->freelist; + memcpy(ptr, &head, sizeof(void*)); + page->freelist = ptr; + if (page->used > 0) { + page->used--; + } + + if (page->used == 0) { + const SmallColdIfaceV4* cold = small_cold_iface_v4_get(); + if (loc != V4_LOC_CURRENT) { + v4_unlink_from_list(h, loc, prev, page); + } + if (!h->current) { + h->current = page; + page->next = NULL; + return; + } + if (h->current == page) { + page->next = NULL; + return; + } + if (h->partial_count < V4_MAX_PARTIAL_PAGES) { + v4_page_push_partial(h, page); + return; + } + if (cold && cold->retire_page) { + cold->retire_page(ctx, (uint32_t)class_idx, page); + } else { + free(page); + } + return; + } + + if (!h->current) { + // このページを current に据える + if (loc != V4_LOC_CURRENT) { + v4_unlink_from_list(h, loc, prev, page); + } + h->current = page; + page->next = NULL; + } else if (loc == V4_LOC_FULL && page->freelist) { + // full → partial に戻す + v4_unlink_from_list(h, loc, prev, page); + v4_page_push_partial(h, page); + } +} diff --git a/docs/analysis/SMALLOBJECT_V4_BOX_DESIGN.md b/docs/analysis/SMALLOBJECT_V4_BOX_DESIGN.md new file mode 100644 index 00000000..bead6afa --- /dev/null +++ b/docs/analysis/SMALLOBJECT_V4_BOX_DESIGN.md @@ -0,0 +1,40 @@ +# SmallObject HotBox v4 Box Design (Phase v4-1 スケルトン) + +## Overview +- 目的: 16〜1024B〜2KiB の small-object を統合する v4 の箱を用意し、v3 の成功パターン(C7-only)を一般化する足場を作る。 +- 中期目標: mimalloc の 70〜80% に迫ること。現状は C7-only v3 + front v3 + fast classify で Mixed 16–1024B がまだ 30〜40% 程度。 +- 位置付け: v3 = prototype、v2 = archive(インターフェース参考のみ)。v4 は構造を整理しつつホット/コールド境界を明確にする。 + +## Box 構造 +- **HotBox_v4**: per-thread `SmallHeapCtx` に `SmallClassHeap[current/partial/full]` と `SmallPageMeta` を持つ。ホットパスはここに閉じ込める。 +- **ColdIface_v4**: `refill_page` / `retire_page` / `remote_push` / `remote_drain` を 1 箱に集約し、内部で Superslab / Warm / Remote を呼ぶ薄いラッパ。 +- **SuperslabBox / RemoteBox**: 既存の Superslab/WarmPool/Remote を Cold 側の箱として再利用(Hot から直接触らない)。 +- **PolicyBox / LearningBox**: small-object 用の `SmallPolicySnapshot`(block_size/route_kind 等)を上位で更新し、Hot は snapshot を読むだけにする。 + +## Phase ロードマップ +- **v4-1**: 型と入口だけ追加。挙動は v3/v1 のままで、コンパイルが通る足場を用意。 +- **v4-2**: C7-only を v4 に寄せ、v3 互換の挙動で動かす(ENV ゲート付き、v4 が優先)。 +- **v4-3**: C7-only を v4 自前の freelist/current/partial で動かす(Cold は Tiny v1 経由)。v3 はベンチ用に残し ENV で A/B。 +- **v4-3.1 (今回)**: C7 v4 で current/partial 再利用を強化し、prepare_calls を v3 並みに抑制。C7-only ベンチで v4 が v3 比 +1% 程度まで回復。 +- **v4-4**: C5〜C7 を含む全 small-object クラスを v4 に段階移行。route LUT から v4 を返せるようにする。 +- **v4-5**: Segment/Page/Block レイアウトと pf 削減、WarmPool チューニングを v4 用に調整。 + +## 現行 v3/v2 の扱い +- v3: C7-only front v3 の prototype として構造だけ再利用する。性能・安定のベースライン。 +- v2: archive として残し、インターフェース案のみ参照。ホットパスには混ぜない。 + +## 次ステップの入口 +- `core/box/smallobject_hotbox_v4_box.h`: v4 のページ/クラス/TLS 型と TLS アクセサ宣言。 +- `core/box/smallobject_cold_iface_v4.h`: ColdIface の関数ポインタ箱(C7 専用 refill/retire を v1 Tiny に繋ぐ)。 +- `core/box/tiny_route_env_box.h`: `TINY_ROUTE_SMALLHEAP_V4` を追加し、ENV `HAKMEM_SMALL_HEAP_V4_ENABLED` / `HAKMEM_SMALL_HEAP_V4_CLASSES` から C7 を v4 route に載せられる(未指定なら OFF)。 +- `core/front/malloc_tiny_fast.h`: route switch に v4 の case を足し、C7 v4 が ON のときは v4 経路(現在は C7 自前 freelist, それ以外は v1/v3 へフォールバック)、OFF 時は従来の v3/v1。 + +## A/B と運用 +- Phase v4-3.1 時点の健康診断: + - C7-only A/B (ws=400, iters=1M, size=1024 固定): + - v3: 41.67M ops/s, prepare_calls=5,077 + - v4: 42.13M ops/s, prepare_calls=4,701(current/partial 再利用で 3.4x→約1.0x に改善) + - Mixed 16–1024B (MIXED_TINYV3_C7_SAFE, ws=400, iters=1M): + - v3 route: 40.66M ops/s + - v4 route: 40.01M ops/s(-1.6% 以内、回帰なし) +- どちらも segv/assert なし。C7 v4 の prepare 増加は解消済み。Mixed ではまだ v3 がわずかに優勢だが許容範囲。