Boxify superslab registry, add bench profile, and document C7 hotpath experiments
This commit is contained in:
35
docs/analysis/C7_FREE_HOTPATH.md
Normal file
35
docs/analysis/C7_FREE_HOTPATH.md
Normal file
@ -0,0 +1,35 @@
|
||||
C7 Free Hotpath (design memo)
|
||||
=============================
|
||||
|
||||
Goals
|
||||
-----
|
||||
- Flatten the dominant C7 free path to minimise branches and helper hops.
|
||||
- Keep safety checks boxed; keep hot lane minimal.
|
||||
|
||||
Current typical path (C7)
|
||||
-------------------------
|
||||
1. size→class LUT → `class_idx = 7`.
|
||||
2. free gate / route box decides Tiny vs Pool.
|
||||
3. Tiny free fast v2:
|
||||
- Policy/env checks,
|
||||
- TLS SLL push,
|
||||
- Warm/UC interaction as needed.
|
||||
4. Multiple helper calls along the way (gate, policy, sll push).
|
||||
|
||||
Target hot lane
|
||||
---------------
|
||||
1. Single policy snapshot for C7 (warm/page/tls on).
|
||||
2. Straight to TLS SLL push with minimal bookkeeping.
|
||||
3. Optional UC/Warm stats only in sampled mode.
|
||||
4. Rare branches (remote/free-list edge cases) stay in boxed slow path.
|
||||
|
||||
Ideas to explore
|
||||
----------------
|
||||
- Add `hak_tiny_free_fast_v2_c7()` inline used when `class_idx==7`.
|
||||
- Fold gate/policy reads into one branch per free call.
|
||||
- Keep TLS SLL push inline, push remote/cross-thread cases behind unlikely branches.
|
||||
|
||||
Validation
|
||||
----------
|
||||
- Compare C7-only ops/s before/after.
|
||||
- Ensure remote/free-list invariants stay enforced in the slow path.
|
||||
39
docs/analysis/C7_HOTPATH_FLATTENING.md
Normal file
39
docs/analysis/C7_HOTPATH_FLATTENING.md
Normal file
@ -0,0 +1,39 @@
|
||||
C7 Alloc Hotpath Flattening (design memo)
|
||||
=========================================
|
||||
|
||||
Goals
|
||||
-----
|
||||
- Make C7 alloc as close to a straight line as possible.
|
||||
- Minimise branches/indirections on the steady hit path (UC/TLS/Warm already stable).
|
||||
- Keep Box boundaries intact; isolate feature gates to one lookup.
|
||||
|
||||
Current shape (simplified)
|
||||
--------------------------
|
||||
1. size→class LUT → `class_idx = 7` for 1024B path.
|
||||
2. Route/Policy checks (tiny_route_get, tiny_policy_get) → gate UC/Warm/Page.
|
||||
3. UC pop: hit path shares code with miss/refill, includes stats/guards.
|
||||
4. TLS/Warm engagement happens behind UC miss boundary.
|
||||
5. Multiple helper calls on the hit path (gate box, policy box, UC helpers).
|
||||
|
||||
Target shape
|
||||
------------
|
||||
1. size→class LUT (unchanged).
|
||||
2. One policy snapshot: `const TinyClassPolicy* pol = tiny_policy_get(7);`
|
||||
3. One route decision: C7 fast path assumes Tiny→UC→TLS/Warm enabled.
|
||||
4. Hit path specialised:
|
||||
- Inline `tiny_unified_cache_pop_fast_c7()` that only touches the hot cache lines.
|
||||
- Stats optional/sampled (avoid atomic on every hit).
|
||||
- No feature/env reads.
|
||||
5. Miss path remains boxed and guarded; enters existing refill flow unchanged.
|
||||
|
||||
Possible refactors
|
||||
------------------
|
||||
- Add `malloc_tiny_fast_c7_inline(...)` as a static inline used only when class==7.
|
||||
- Precompute `pol->warm_enabled/page_box_enabled` once per thread and reuse.
|
||||
- Split UC helpers into `*_hit_fast` vs `*_miss` to keep the hit CFG tiny.
|
||||
|
||||
Trade-offs / checks
|
||||
-------------------
|
||||
- Keep the Box boundaries (Gate/Route/Policy) but allow an inline “fast lane” for C7.
|
||||
- Ensure Debug/Policy logging stays in the slow/miss path only.
|
||||
- Validate with IPC/ops after implementation; target +10–15% for C7-heavy mixes.
|
||||
53
docs/analysis/CPU_HOTPATH_OVERVIEW.md
Normal file
53
docs/analysis/CPU_HOTPATH_OVERVIEW.md
Normal file
@ -0,0 +1,53 @@
|
||||
CPU Hotpath Overview (bench profile)
|
||||
====================================
|
||||
|
||||
Context
|
||||
-------
|
||||
- Build/profile: `HAKMEM_PROFILE=bench`, `HAKMEM_TINY_PROFILE=full`, `HAKMEM_WARM_TLS_BIND_C7=2`.
|
||||
- Workloads sampled:
|
||||
- 16–1024B (`./bench_random_mixed_hakmem 1000000 256 42`)
|
||||
- 129–1024B (`HAKMEM_BENCH_MIN_SIZE=129 HAKMEM_BENCH_MAX_SIZE=1024 ./bench_random_mixed_hakmem 1000000 256 42`)
|
||||
- Target: identify user‑space hot spots to guide C7 flattening work.
|
||||
|
||||
Sampling attempt (perf)
|
||||
-----------------------
|
||||
- `perf record -g -e cycles:u` and `perf record -g -e cpu-clock:u` both fall back to `page-faults` on this host (likely perf_event_paranoid). The captures show:
|
||||
- ~97% of page-fault samples in `__memset_avx2_unaligned_erms` during warmup/zeroing.
|
||||
- Callers were `tiny_tls_sll_drain.part.0.constprop.0` and `adaptive_sizing_init` (warmup path).
|
||||
- No steady‑state cycle profile was available without elevated perf permissions. `perf data` removed after inspection (`rm perf.data`) to keep tree clean.
|
||||
|
||||
What we can infer despite the limitation
|
||||
----------------------------------------
|
||||
- Warmup zeroing dominates page‑fault samples; steady‑state alloc/free is not represented.
|
||||
- Hot candidates for the next pass (from previous code inspection and bench intuition):
|
||||
- `tiny_alloc_fast` / `malloc_tiny_fast` (C7 fast path)
|
||||
- `hak_tiny_free_fast_v2`
|
||||
- `tiny_unified_cache` hit path helpers
|
||||
- `tls_sll_pop_impl` / `tiny_tls_sll_drain`
|
||||
|
||||
Next measurement options
|
||||
------------------------
|
||||
- If perf cycles are still blocked:
|
||||
- Use `perf stat -e cycles,instructions,branches,branch-misses -r 5 -- ...` to get aggregate IPC per workload.
|
||||
- Add temporary user‑space counters (Box‑guarded) around C7 alloc/free hot sections to estimate per‑op cycles.
|
||||
- Run perf with elevated permissions or lower `perf_event_paranoid` if available.
|
||||
|
||||
Aggregate perf stat snapshot (bench profile)
|
||||
--------------------------------------------
|
||||
- Env: `HAKMEM_PROFILE=bench HAKMEM_TINY_PROFILE=full HAKMEM_WARM_TLS_BIND_C7=2`
|
||||
- Workloads (Release, 3× perf stat):
|
||||
- 16–1024B: cycles≈109.8M, instructions≈233.2M → IPC≈2.12, branches≈49.4M, branch-miss≈2.89%
|
||||
- 129–1024B: cycles≈109.6M, instructions≈230.3M → IPC≈2.10, branches≈48.8M, branch-miss≈2.90%
|
||||
- 16–1024B with `HAKMEM_TINY_C7_HOT=1` (UC hit-only + flat TLS→UC→cold):
|
||||
- cycles≈111.8M, instructions≈242.1M → IPC≈2.16, branches≈52.0M, branch-miss≈2.75%
|
||||
- RSS≈7.1MB; throughput ≈47.4–47.6M ops/s (hot=1) vs ≈47.2M (hot=0) on the same runset.
|
||||
|
||||
Action items flowing from this note
|
||||
-----------------------------------
|
||||
- Proceed with design notes for C7 alloc/free flattening and UC hit simplification based on code structure.
|
||||
- Keep warmup zeroing out of the steady‑state loop when profiling (consider `HAKMEM_BENCH_FAST_MODE` for future captures).
|
||||
|
||||
Conclusion (current state)
|
||||
--------------------------
|
||||
- `HAKMEM_TINY_C7_HOT` は実験用フラグとして残し、デフォルト OFF のまま運用する。ON にしても branch-miss はわずかに改善する程度で、ops/s は同等〜微減。
|
||||
- ひとまず「安全+そこそこ速い」現行経路を基準とし、さらなるフラット化は別途必要性を見て検討する。***
|
||||
84
docs/analysis/LARGE_GLOBALS_OVERVIEW.md
Normal file
84
docs/analysis/LARGE_GLOBALS_OVERVIEW.md
Normal file
@ -0,0 +1,84 @@
|
||||
LARGE_GLOBALS_OVERVIEW
|
||||
======================
|
||||
|
||||
概要
|
||||
----
|
||||
- `nm -S --size-sort bench_random_mixed_hakmem` で確認した巨大 BSS/静的領域の上位シンボルをメモ。
|
||||
- 目視での役割・要素数イメージと、直近の SS_STATS(短い run, ws=64, iters=10k, HAKMEM_SS_STATS_DUMP=1)のギャップを併記。
|
||||
- 目的: 次フェーズの「SuperReg/SharedPool/Remote を動的化 or 縮小」設計の入力にする。
|
||||
|
||||
コマンド
|
||||
--------
|
||||
```bash
|
||||
nm -S --size-sort bench_random_mixed_hakmem | tail -n 120
|
||||
HAKMEM_SS_STATS_DUMP=1 ./bench_random_mixed_hakmem 10000 64 1 2> /tmp/ss_stats_sample.log
|
||||
```
|
||||
|
||||
観測された大きめシンボル
|
||||
------------------------
|
||||
|
||||
| Symbol | Size | 役割/箱 | 備考・ギャップ |
|
||||
| --- | --- | --- | --- |
|
||||
| `g_super_reg` | 0x1800000 ≈ 24.0 MB | Super Registry 全体 | SS_STATS では C2=1, C7=1 live と極小。大半が未使用の固定配列。 |
|
||||
| `g_rem_side` | 0x1000000 ≈ 16.0 MB | Remote Queue 側バッファ | スレッド数・ノード数に対してオーバーサイズ。bench ではほぼ未使用。 |
|
||||
| `g_shared_pool` | 0x238140 ≈ 2.23 MB | Shared Pool テーブル | live SS 2 枚に対し容量が大きい。class 別縮小余地あり。 |
|
||||
| `g_super_reg_by_class` | 0x100000 ≈ 1.0 MB | クラス別 SuperReg インデックス | クラス数 8 に対し 1MB 固定。動的化で圧縮可能。 |
|
||||
| `g_free_node_pool` | 0xC0000 ≈ 0.75 MB | Free ノードプール | Remote/Pool 用。小さくはないが上位ほどではない。 |
|
||||
| `g_mf2_page_registry.lto_priv.0` | 0x82810 ≈ 0.51 MB | MF2 ページレジストリ | MF2 経路用。 |
|
||||
| `g_tls_mags` | 0x40040 ≈ 0.26 MB | TLS Magazine 配列 | スレッド数ぶん前提。実使用は少数。 |
|
||||
| `g_site_rules` | 0x40040 ≈ 0.26 MB | Site rule テーブル | 固定長。 |
|
||||
| `g_mid_desc_mu` | 0x14000 ≈ 80 KB | Mid-size desc | 中規模。 |
|
||||
| `g_mid_tc_mu` | 0xA000 ≈ 40 KB | Mid-size TC | 中規模。 |
|
||||
| `g_pool.lto_priv.0` | 0x9680 ≈ 37 KB | Pool 配列 | 中規模。 |
|
||||
| `g_tiny_page_box` | 0xC40 ≈ 3.1 KB | Tiny Page Box 配列 | Tiny Front 系。微小。 |
|
||||
| `g_tls_hot_mag` | 0x2040 ≈ 8 KB | TLS Hot Magazine | 微小。 |
|
||||
| `g_fast_cache` | 0x2200 ≈ 8.6 KB | Fast cache | 微小。 |
|
||||
|
||||
補足(SS_STATS サンプル)
|
||||
-------------------------
|
||||
- 短い run(ws=64, iters=10k, Release, HAKMEM_SS_STATS_DUMP=1)の結果:
|
||||
- `[SS_STATS] class live empty_events slab_live_events`
|
||||
- `C2: live=1 empty=0 slab_live=0`
|
||||
- `C7: live=1 empty=1 slab_live=0`
|
||||
- `[RSS] max_kb=29568`
|
||||
- 「巨大配列の容量」に対し「実際に live の Superslab」は 2 枚のみ。固定長 BSS が RSS を支配していることが確実。
|
||||
|
||||
定義元と役割(コード位置)
|
||||
--------------------------
|
||||
- Super Registry (`core/hakmem_super_registry.{h,c}`)
|
||||
- `g_super_reg[SUPER_REG_SIZE]` … ハッシュ登録(デフォルト 1,048,576 エントリ = 24MB、`SUPER_REG_SIZE` で調整可能)
|
||||
- `g_super_reg_by_class[TINY_NUM_CLASSES][SUPER_REG_PER_CLASS]` … クラス別スキャン用(デフォルト 8×16384 = 128K スロット ≈1MB)
|
||||
- `g_super_reg_class_size[]` … クラス別 live カウント
|
||||
- `g_ss_lru_cache` … LRU 再利用キャッシュ(メモリは小さめ)
|
||||
- Shared Pool (`core/hakmem_shared_pool.{h,c}` + `_acquire.c` + `_release.c`)
|
||||
- `g_shared_pool` … Superslab 配列、クラス別ヒント/活性/フリーリスト/メタ配列を同居させた大きめ struct(≈2.3MB)
|
||||
- `g_shared_pool.ss_metadata[]` … Superslab ごとのメタデータ配列
|
||||
- Remote Queue (`core/tiny_remote.c`)
|
||||
- `g_rem_side[REM_SIDE_SIZE]` … cross-thread free のハッシュ(`REM_SIDE_LOG2=20` → 1M エントリ ≈16MB)
|
||||
- Debug 時の `g_rem_track[]` は release では落ちるのでサイズ影響なし
|
||||
- Free Node Pool (`core/pool_refill.c` など)
|
||||
- `g_free_node_pool` … pool refilling 用のノードストック(≈0.75MB)
|
||||
- TLS / MF2 系
|
||||
- `g_tls_mags` (`core/hakmem_tiny_magazine.c`) … TLS マガジン配列(≈0.26MB、スレッド数前提)
|
||||
- `g_mf2_page_registry` (`core/mf2*`) … MF2 併用時のページレジストリ(≈0.5MB)
|
||||
- `g_ss_addr_map` (`core/box/ss_addr_map_box.h`) … Superslab アドレス検索ハッシュ(サイズは中程度)
|
||||
|
||||
ベンチ向け縮小の目安(案)
|
||||
--------------------------
|
||||
- SuperReg
|
||||
- 現状: `SUPER_REG_SIZE=1,048,576`(24MB)、`SUPER_REG_PER_CLASS=16384`(1MB)
|
||||
- Bench 目安: `SUPER_REG_SIZE_BENCH=65,536`(~1.5MB)、`SUPER_REG_PER_CLASS_BENCH=1024`(~64KB)
|
||||
- Shared Pool
|
||||
- 現状: capacity は動的拡張だが初期サイズは大きめ(約 2.3MB)
|
||||
- Bench 目安: 初期 capacity を 64〜128 に抑え、クラス別スロットも縮小
|
||||
- Remote Queue
|
||||
- 現状: `REM_SIDE_LOG2=20` → 1M エントリ(16MB)
|
||||
- Bench 目安: `REM_SIDE_LOG2=16`(64K エントリ ≈1MB)程度まで削減
|
||||
- Free Node Pool / TLS Mag / MF2
|
||||
- Bench ではスレッド数や MF2 オンオフに応じて「初期化を遅延」「固定配列を半減」する余地あり。
|
||||
|
||||
次の設計ステップ(Box 化の方向性)
|
||||
-----------------------------------
|
||||
- SuperReg/SharedPool/Remote を Box 化し、`HAKMEM_PROFILE`(prod/full/bench/larson_guard 等)で容量を切替できるようにする。
|
||||
- Bench 用の小型プロファイル(registry/pool/remote を 1/4〜1/8)を追加し、RSS を抑えた状態で mimalloc/system と比較する。
|
||||
- Superslab Budget Box と組み合わせ、live 枚数上限(予算)と「空 SS 再利用ポリシー」を分離して管理する。***
|
||||
40
docs/analysis/SUPERSLAB_STATS_SNAPSHOT.md
Normal file
40
docs/analysis/SUPERSLAB_STATS_SNAPSHOT.md
Normal file
@ -0,0 +1,40 @@
|
||||
# Superslab Stats Snapshot (larson_guard, 2025-12-06)
|
||||
|
||||
コマンド:
|
||||
`HAKMEM_TINY_PROFILE=larson_guard HAKMEM_SS_STATS_DUMP=1 ./bench_allocators_hakmem larson 1 10000 1`
|
||||
|
||||
抜粋ログ:
|
||||
```
|
||||
[SS_STATS] class live empty_events slab_live_events
|
||||
C2: live=1 empty=0 slab_live=0
|
||||
```
|
||||
|
||||
メモ: larson_guard では Superslab 枚数が予算近辺で頭打ちになり、暴走せずに完走することを確認。
|
||||
|
||||
# Superslab Stats Snapshot (bench profile, 2025-12-06)
|
||||
|
||||
コマンド:
|
||||
`HAKMEM_PROFILE=bench HAKMEM_TINY_PROFILE=full HAKMEM_WARM_TLS_BIND_C7=2 HAKMEM_SS_STATS_DUMP=1 ./bench_random_mixed_hakmem 1000000 256 42`
|
||||
|
||||
抜粋ログ:
|
||||
```
|
||||
[SS_STATS] class live empty_events slab_live_events
|
||||
C2: live=1 empty=0 slab_live=0
|
||||
C7: live=1 empty=1 slab_live=0
|
||||
[RSS] max_kb=7168
|
||||
```
|
||||
|
||||
メモ: bench プロファイル(SuperReg/Remote 実配列縮小版)でも live Superslab は C2=1, C7=1 に収まり、RSS は ~7MB まで低減。***
|
||||
|
||||
# Tiny Mem Stats Snapshot (bench profile, 2025-12-06)
|
||||
|
||||
コマンド:
|
||||
`HAKMEM_PROFILE=bench HAKMEM_TINY_PROFILE=full HAKMEM_WARM_TLS_BIND_C7=2 HAKMEM_TINY_MEM_DUMP=1 ./bench_random_mixed_hakmem 1000 8 1`
|
||||
|
||||
抜粋ログ:
|
||||
```
|
||||
[TINY_MEM_STATS] unified_cache=36KB warm_pool=2KB page_box=3KB tls_mag=0KB policy_stats=0KB total=41KB
|
||||
[RSS] max_kb=7040
|
||||
```
|
||||
|
||||
メモ: Tiny 層(UC/Warm/Page/TLS/Policy)だけなら数十 KB で、 bench プロファイルの RSS 低減は主に SuperReg/Remote の実配列縮小による。***
|
||||
Reference in New Issue
Block a user