hakmem/CURRENT_TASK.md

# Current Task: Phase 7 + Pool TLS — Step 4.x Integration & Validation（Tiny P0: デフォルトON）

**Date**: 2025-11-09
**Status**: 🚀 In Progress (Step 4.x)
**Priority**: HIGH

---

## 🎯 Goal

Box理論に沿って、Pool TLS を中心に「syscall 希薄化」と「境界一箇所化」を推し進め、Tiny/Mid/Larson の安定高速化を図る。

### **Why This Works**
Phase 7 Task 3 achieved **+180-280% improvement** by pre-warming:
- **Before**: First allocation → TLS miss → SuperSlab refill (100+ cycles)
- **After**: First allocation → TLS hit (15 cycles, pre-populated cache)

**Same bottleneck exists in Pool TLS**:
- First 8KB allocation → TLS miss → Arena carve → mmap (1000+ cycles)
- Pre-warm eliminates this cold-start penalty

---

## 📊 Current Status（Step 4までの主な進捗）

### 実装サマリ（Tiny + Pool TLS）
- ✅ Tiny 1024B 特例（ヘッダ無し）＋ class7 補給の軽量適応（mmap 多発の主因を遮断）
- ✅ OS 降下の境界化（`hak_os_map_boundary()`）：mmap 呼び出しを一箇所に集約
- ✅ Pool TLS Arena（1→2→4→8MB指数成長, ENV で可変）：mmap をアリーナへ集約
- ✅ Page Registry（チャンク登録/lookup で owner 解決）
- ✅ Remote Queue（Pool 用, mutex バケット版）＋ alloc 前の軽量 drain を配線

#### Tiny P0（Batch Refill）
- ✅ P0 致命バグ修正（freelist→SLL一括移送後に `meta->used += from_freelist` が抜けていた）
- ✅ 線形 carve の Fail‑Fast ガード（簡素/一般/TLSバンプの全経路）
- ✅ ランタイム A/B スイッチ実装：
  - 既定ON（`HAKMEM_TINY_P0_ENABLE` 未設定/≠0）
  - Kill: `HAKMEM_TINY_P0_DISABLE=1`、Drain 切替: `HAKMEM_TINY_P0_NO_DRAIN=1`、ログ: `HAKMEM_TINY_P0_LOG=1`
- ✅ ベンチ: 100k×256B（1T）で P0 ON 最速（~2.76M ops/s）、P0 OFF ~2.73M ops/s（安定）
- ⚠️ 既知: `[P0_COUNTER_MISMATCH]` 警告（active_delta と taken の差分）が稀に出るが、SEGV は解消済（継続監査）

##### NEW: P0 carve ループの根本原因と修正（SEGV 解消）
- 🔴 根因: P0 バッチ carve ループ内で `superslab_refill(class_idx)` により TLS が新しい SuperSlab を指すのに、`tls` を再読込せず `meta=tls->meta` のみ更新 → `ss_active_add(tls->ss, batch)` が古い SuperSlab に加算され、active カウンタ破壊・SEGV に繋がる。
- 🛠 修正: `superslab_refill()` 後に `tls = &g_tls_slabs[class_idx]; meta = tls->meta;` を再読込（core/hakmem_tiny_refill_p0.inc.h）。
- 🧪 検証: 固定サイズ 256B/1KB （200k iters）完走、SEGV 再現なし。active_delta=0 を確認。RS はわずかに改善（0.8–0.9% → 継続最適化対象）。

詳細: docs/TINY_P0_BATCH_REFILL.md

---

## 🚀 次のステップ（アクション）

1) Remote Queue の drain を Pool TLS refill 境界とも統合（低水位時は drain→refill→bind）
- 現状: pool_alloc 入口で drain, pop 後 low-water で追加 drain を実装済み
- 追加: refill 経路（`pool_refill_and_alloc` 呼出し直前）でも drain を試行し、drain 成功時は refill を回避

2) strace による syscall 減少確認（指標化）
- RandomMixed: 256 / 1024B, それぞれ `mmap/madvise/munmap` 回数（-c合計）
- PoolTLS: 1T/4T の `mmap/madvise/munmap` 減少を比較（Arena導入前後）

3) 性能A/B（ENV: INIT/MAX/GROWTH）で最適化勘所を探索
- `HAKMEM_POOL_TLS_ARENA_MB_INIT`, `HAKMEM_POOL_TLS_ARENA_MB_MAX`, `HAKMEM_POOL_TLS_ARENA_GROWTH_LEVELS` の組合せを評価
- 目標: syscall を削減しつつメモリ使用量を許容範囲に維持

4) Remote Queue の高速化（次フェーズ）

5) Tiny 256B/1KB の直詰め最適化（性能）
- P0→FC 直詰めの一往復設計を活用し、以下を段階的に適用（A/Bスイッチ済み）
  - FC cap/batch 上限の掃引（class5/7）
  - remote drain 閾値化のチューニング（頻度削減）
  - adopt 先行の徹底（map 前に再試行）
  - 配列詰めの軽い unroll／分岐ヒントの見直し（branch‑miss 低減）
- まずはmutex→lock分割/軽量スピン化、必要に応じてクラス別queue
- Page Registry の O(1) 化（ページ単位のテーブル）, 将来はper-arena ID化

### NEW: 本日の適用と計測スナップショット（Ryzen 7 5825U）
- 変更点（Tiny 256B/1KB 向け）
  - FastCache 有効容量を per-class で厳密適用（`tiny_fc_room/push_bulk` が `g_fast_cap[c]` を使用）
  - 既定 cap 見直し: class5=96, class7=48（ENVで上書き可: `HAKMEM_TINY_FAST_CAP_C{5,7}`）
  - Direct-FC の drain 閾値 既定を 32→64（ENV: `HAKMEM_TINY_P0_DRAIN_THRESH`）
  - class7 の Direct-FC 既定は OFF（`HAKMEM_TINY_P0_DIRECT_FC_C7=1` で明示ON）

- 固定サイズベンチ（release, 200k iters）
  - 256B: 4.49–4.54M ops/s, branch-miss ≈ 8.89%（先行値 ≈11% から改善）
  - 1KB: 現状 SEGV（Direct-FC OFF でも再現）→ P0 一般経路の残存不具合の可能性
  - 結果保存: benchmarks/results/<date>_ryzen7-5825U_fixed/

- 推奨: class7 は当面 P0 をA/Bで停止（`HAKMEM_TINY_P0_DISABLE=1` もしくは class7限定ガード導入）し、256Bのチューニングを先行。

**Challenge**: Pool blocks are LARGE (8KB-52KB) vs Tiny (128B-1KB)

**Memory Budget Analysis**:
```
Phase 7 Tiny:
- 16 blocks × 1KB = 16KB per class
- 7 classes × 16KB = 112KB total ✅ Acceptable

Pool TLS (Naive):
- 16 blocks × 8KB = 128KB (class 0)
- 16 blocks × 52KB = 832KB (class 6)
- Total: ~4-5MB ❌ Too much!
```

**Smart Strategy**: Variable pre-warm counts based on expected usage
```c
// Hot classes (8-24KB) - common in real workloads
Class 0 (8KB):  16 blocks = 128KB
Class 1 (16KB): 16 blocks = 256KB
Class 2 (24KB): 12 blocks = 288KB

// Warm classes (32-40KB)
Class 3 (32KB): 8 blocks = 256KB
Class 4 (40KB): 8 blocks = 320KB

// Cold classes (48-52KB) - rare
Class 5 (48KB): 4 blocks = 192KB
Class 6 (52KB): 4 blocks = 208KB

Total: ~1.6MB ✅ Acceptable
```

**Rationale**:
1. Smaller classes are used more frequently (Pareto principle)
2. Total memory: 1.6MB (reasonable for 8-52KB allocations)
3. Covers most real-world workload patterns

---

## ENV（Arena 関連）
```
# Initial chunk size in MB (default: 1)
export HAKMEM_POOL_TLS_ARENA_MB_INIT=2

# Maximum chunk size in MB (default: 8)
export HAKMEM_POOL_TLS_ARENA_MB_MAX=16

# Number of growth levels (default: 3 → 1→2→4→8MB)
export HAKMEM_POOL_TLS_ARENA_GROWTH_LEVELS=4
```

**Location**: `core/pool_tls.c`

**Code**:
```c
// Pre-warm counts optimized for memory usage
static const int PREWARM_COUNTS[POOL_SIZE_CLASSES] = {
    16, 16, 12,  // Hot: 8KB, 16KB, 24KB
    8, 8,        // Warm: 32KB, 40KB
    4, 4         // Cold: 48KB, 52KB
};

void pool_tls_prewarm(void) {
    for (int class_idx = 0; class_idx < POOL_SIZE_CLASSES; class_idx++) {
        int count = PREWARM_COUNTS[class_idx];
        size_t size = POOL_CLASS_SIZES[class_idx];

        // Allocate then immediately free to populate TLS cache
        for (int i = 0; i < count; i++) {
            void* ptr = pool_alloc(size);
            if (ptr) {
                pool_free(ptr);  // Goes back to TLS freelist
            } else {
                // OOM during pre-warm (rare, but handle gracefully)
                break;
            }
        }
    }
}
```

**Header Addition** (`core/pool_tls.h`):
```c
// Pre-warm TLS cache (call once at thread init)
void pool_tls_prewarm(void);
```

---

## 軽い確認（推奨）
```
# PoolTLS
./build.sh bench_pool_tls_hakmem
./bench_pool_tls_hakmem 1 100000 256 42
./bench_pool_tls_hakmem 4 50000 256 42

# syscall 計測（mmap/madvise/munmap 合計が減っているか確認）
strace -e trace=mmap,madvise,munmap -c ./bench_pool_tls_hakmem 1 100000 256 42
strace -e trace=mmap,madvise,munmap -c ./bench_random_mixed_hakmem 100000 256 42
strace -e trace=mmap,madvise,munmap -c ./bench_random_mixed_hakmem 100000 1024 42
```

**Location**: `core/hakmem.c` (or wherever Pool TLS init happens)

**Code**:
```c
#ifdef HAKMEM_POOL_TLS_PHASE1
    // Initialize Pool TLS
    pool_thread_init();

    // Pre-warm cache (Phase 1.5b optimization)
    #ifdef HAKMEM_POOL_TLS_PREWARM
    pool_tls_prewarm();
    #endif
#endif
```

**Makefile Addition**:
```makefile
# Pool TLS Phase 1.5b - Pre-warm optimization
ifeq ($(POOL_TLS_PREWARM),1)
CFLAGS += -DHAKMEM_POOL_TLS_PREWARM=1
endif
```

**Update `build.sh`**:
```bash
make \
  POOL_TLS_PHASE1=1 \
  POOL_TLS_PREWARM=1 \  # NEW!
  HEADER_CLASSIDX=1 \
  AGGRESSIVE_INLINE=1 \
  PREWARM_TLS=1 \
  "${TARGET}"
```

---

### **Step 4: Build & Smoke Test** ⏳ 10 min

```bash
# Build with pre-warm enabled
./build_pool_tls.sh bench_mid_large_mt_hakmem

# Quick smoke test
./dev_pool_tls.sh test

# Expected: No crashes, similar or better performance
```

---

### **Step 5: Benchmark** ⏳ 15 min

```bash
# Full benchmark vs System malloc
./run_pool_bench.sh

# Expected results:
# Before (1.5a): 1.79M ops/s
# After (1.5b):  5-15M ops/s (+3-8x)
```

**Additional benchmarks**:
```bash
# Different sizes
./bench_mid_large_mt_hakmem 1 100000 256 42   # 8-32KB mixed
./bench_mid_large_mt_hakmem 1 100000 1024 42  # Larger workset

# Multi-threaded
./bench_mid_large_mt_hakmem 4 100000 256 42   # 4T
```

---

### **Step 6: Measure & Analyze** ⏳ 10 min

**Metrics to collect**:
1. ops/s improvement (target: +3-8x)
2. Memory overhead (should be ~1.6MB per thread)
3. Cold-start penalty reduction (first allocation latency)

**Success Criteria**:
- ✅ No crashes or stability issues
- ✅ +200% or better improvement (5M ops/s minimum)
- ✅ Memory overhead < 2MB per thread
- ✅ No performance regression on small workloads

---

### **Step 7: Tune (if needed)** ⏳ 15 min (optional)

**If results are suboptimal**, adjust pre-warm counts:

**Too slow** (< 5M ops/s):
- Increase hot class pre-warm (16 → 24)
- More aggressive: Pre-warm all classes to 16

**Memory too high** (> 2MB):
- Reduce cold class pre-warm (4 → 2)
- Lazy pre-warm: Only hot classes initially

**Adaptive approach**:
```c
// Pre-warm based on runtime heuristics
void pool_tls_prewarm_adaptive(void) {
    // Start with minimal pre-warm
    static const int MIN_PREWARM[7] = {8, 8, 4, 4, 2, 2, 2};

    // TODO: Track usage patterns and adjust dynamically
}
```

---

## 📋 **Implementation Checklist**

### **Phase 1.5b: Pre-warm Optimization**

- [ ] **Step 1**: Design pre-warm strategy (15 min)
  - [ ] Analyze memory budget
  - [ ] Decide pre-warm counts per class
  - [ ] Document rationale

- [ ] **Step 2**: Implement `pool_tls_prewarm()` (20 min)
  - [ ] Add PREWARM_COUNTS array
  - [ ] Write pre-warm function
  - [ ] Add to pool_tls.h

- [ ] **Step 3**: Integrate with init (10 min)
  - [ ] Add call to hakmem.c init
  - [ ] Add Makefile flag
  - [ ] Update build.sh

- [ ] **Step 4**: Build & smoke test (10 min)
  - [ ] Build with pre-warm enabled
  - [ ] Run dev_pool_tls.sh test
  - [ ] Verify no crashes

- [ ] **Step 5**: Benchmark (15 min)
  - [ ] Run run_pool_bench.sh
  - [ ] Test different sizes
  - [ ] Test multi-threaded

- [ ] **Step 6**: Measure & analyze (10 min)
  - [ ] Record performance improvement
  - [ ] Measure memory overhead
  - [ ] Validate success criteria

- [ ] **Step 7**: Tune (optional, 15 min)
  - [ ] Adjust pre-warm counts if needed
  - [ ] Re-benchmark
  - [ ] Document final configuration

**Total Estimated Time**: 1.5 hours (90 minutes)

---

## 🎯 **Expected Outcomes**

### **Performance Targets**
```
Phase 1.5a (current): 1.79M ops/s
Phase 1.5b (target):  5-15M ops/s (+3-8x)

Conservative: 5M ops/s   (+180%)
Expected:     8M ops/s   (+350%)
Optimistic:   15M ops/s  (+740%)
```

### **Comparison to Phase 7**
```
Phase 7 Task 3 (Tiny):
  Before: 21M → After: 59M ops/s (+181%)

Phase 1.5b (Pool):
  Before: 1.79M → After: 5-15M ops/s (+180-740%)

Similar or better improvement expected!
```

### **Risk Assessment**
- **Technical Risk**: LOW (proven pattern from Phase 7)
- **Stability Risk**: LOW (simple, non-invasive change)
- **Memory Risk**: LOW (1.6MB is negligible for Pool workloads)
- **Complexity Risk**: LOW (< 50 LOC change)

---

## 📁 **Related Documents**

- `CLAUDE.md` - Development history (Phase 1.5a documented)
- `POOL_TLS_QUICKSTART.md` - Quick start guide
- `POOL_TLS_INVESTIGATION_FINAL.md` - Phase 1.5a debugging journey
- `PHASE7_TASK3_RESULTS.md` - Pre-warm success pattern (Tiny)

---

## 🚀 **Next Actions**

**NOW**: Start Step 1 - Design pre-warm strategy
**NEXT**: Implement pool_tls_prewarm() function
**THEN**: Build, test, benchmark

**Estimated Completion**: 1.5 hours from start
**Success Probability**: 90% (proven technique)

---

**Status**: Ready to implement - awaiting user confirmation to proceed! 🚀

---

## NEW 2025-11-11: Tiny L1-miss増加とUB修正（FastCache/Freeチェイン）

構造方針（確認）
- 結論: 構造はこのままでよい。`tiny_nextptr.h` に next を集約した箱構成で安全性と一貫性は確保。
- この前提で A/B とパラメータ最適化を継続し、必要時のみ“クラス限定ヘッダ”などの再設計に進む。

現象（提供値 + 再現計測）
- 平均スループット: 56.7M → 55.95M ops/s（-1.3% 誤差範囲）
- L1-dcache-miss: 335M → 501M（+49.5%）
- 当環境の `bench_random_mixed_hakmem 100000 256 42` でも L1 miss ≈ 3.7–4.0%（安定）
- mimalloc 同条件: 98–110M ops/s（大差）

根因仮説（高確度）
1) ヘッダ方式によるアラインメント崩れ（本丸）
   - 1バイトヘッダで user ptr を +1 するため、stride=サイズ+1 となり多くのクラスで16B整列を失う。
   - 例: 256B→257B stride で 16ブロック中15ブロックが非整列。L1 miss/μops増の主因。
2) 非整列 next の void** デリファレンス（UB）
   - C0–C6 は next を base+1 に保存/参照しており、C言語的には非整列アクセスで UB。
   - コンパイラ最適化の悪影響やスピル増の可能性。

対処（適用済み：UB除去の最小パッチ）
- 追加: 安全 next アクセス小箱 `core/tiny_nextptr.h:1`
  - `tiny_next_off(int)`, `tiny_next_load(void*, cls)`, `tiny_next_store(void*, cls, void*)`
  - memcpy ベースの実装で、非整列でも未定義動作を回避
- 適用先（ホットパス差し替え）
  - `core/hakmem_tiny_fastcache.inc.h:76,108`
  - `core/tiny_free_magazine.inc.h:83,94`
  - `core/tiny_alloc_fast_inline.h:54` および push 側
  - `core/hakmem_tiny_tls_list.h:63,76,109,115` 他（pop/push/bulk）
  - `core/hakmem_tiny_bg_spill.c`（ループ分割/再接続部）
  - `core/hakmem_tiny_bg_spill.h`（spill push 経路）
  - `core/tiny_alloc_fast_sfc.inc.h`（pop/push）
  - `core/hakmem_tiny_lifecycle.inc`（SLL/Fast 層の drain 処理）

リリースログ抑制（無害化）
- `core/superslab/superslab_inline.h:208` の `[DEBUG ss_remote_push]` を
  `!HAKMEM_BUILD_RELEASE && HAKMEM_DEBUG_VERBOSE` ガード下へ
- `core/tiny_superslab_free.inc.h:36` の `[C7_FIRST_FREE]` も同様に
  `!HAKMEM_BUILD_RELEASE && HAKMEM_DEBUG_VERBOSE` のみで出力

効果
- スループット/ミス率は誤差範囲（正当性の改善が中心）
- 非整列 next の UB を除去し、将来の最適化で悪化しづらい状態に整備
- mimalloc との差は依然大きく、根因は主に「整列崩れ＋キャッシュ設計差」と判断

計測結果（抜粋）
- hakmem Tiny:
  - `./bench_random_mixed_hakmem 100000 256 42`
    - Throughput: ≈8.8–9.1M ops/s
    - L1-dcache-load-misses: ≈1.50–1.60M（3.7–4.0%）
- mimalloc:
  - `LD_LIBRARY_PATH=... ./bench_random_mixed_mi 100000 256 42`
    - Throughput: ≈98–110M ops/s
- 固定256B（ヘッダON/OFF比較）:
  - `./bench_fixed_size_hakmem 100000 256 42`
    - ヘッダON: ~3.86M ops/s, L1D miss ≈4.07%
    - ヘッダOFF: ~4.00M ops/s, L1D miss ≈4.12%（誤差級）

新規に特定した懸念と対応案
- 整列崩れ（最有力）
  - 1Bヘッダにより stride=サイズ+1 となり、16B 整列を崩すクラスが多い（例: 256→257B）。
  - 単純なヘッダON/OFF比較では差は小さく、他要因との複合影響と見做し継続調査。
- UB（未定義動作）
  - 非整列 void** load/store を `tiny_nextptr.h` による安全アクセサへ置換済み。
- リリースガード漏れ
  - `[C7_FIRST_FREE]` / `[DEBUG ss_remote_push]` は release ビルドでは
    `HAKMEM_DEBUG_VERBOSE` 未指定時に出ないよう修正済み。

成功判定（Tiny側）
- A/B（ヘッダOFF or クラス限定ヘッダ）で 256B 固定の L1 miss 低下・ops/s 改善
- mimalloc との差を段階的に圧縮（まず 2–3x 程度まで、将来的に 1.5x 以内を目標）

トラッキング（参照ファイル/行）
- 安全 next 小箱:
  - `core/tiny_nextptr.h:1`
- 呼び出し側差し替え:
  - `core/hakmem_tiny_fastcache.inc.h:76,108`
  - `core/tiny_free_magazine.inc.h:83,94`
  - `core/tiny_alloc_fast_inline.h:54` 他
  - `core/hakmem_tiny_tls_list.h:63,76,109,115`
  - `core/hakmem_tiny_bg_spill.c` / `core/hakmem_tiny_bg_spill.h`
  - `core/tiny_alloc_fast_sfc.inc.h`
  - `core/hakmem_tiny_lifecycle.inc`
- リリースログガード:
  - `core/superslab/superslab_inline.h:208`
  - `core/tiny_superslab_free.inc.h:36`

現象（提供値 + 再現計測）
- 平均スループット: 56.7M → 55.95M ops/s（-1.3% 誤差範囲）
- L1-dcache-miss: 335M → 501M（+49.5%）
- 当環境の `bench_random_mixed_hakmem 100000 256 42` でも L1 miss ≈ 3.7–4.0%（安定）
- mimalloc 同条件: 98–110M ops/s（大差）

根因仮説（高確度）
1) ヘッダ方式によるアラインメント崩れ（本丸）
   - 1バイトヘッダで user ptr を +1 するため、stride=サイズ+1 となり多くのクラスで16B整列を失う。
   - 例: 256B→257B stride で 16ブロック中15ブロックが非整列。L1 miss/μops増の主因。
2) 非整列 next の void** デリファレンス（UB）
   - C0–C6 は next を base+1 に保存/参照しており、C言語的には非整列アクセスで UB。
   - コンパイラ最適化の悪影響やスピル増の可能性。

対処（適用済み：UB除去の最小パッチ）
- 追加: 安全 next アクセス小箱 `core/tiny_nextptr.h:1`
  - `tiny_next_load()/tiny_next_store()` を memcpy ベースで提供（非整列でもUBなし）
- 適用先（ホットパス）
  - `core/hakmem_tiny_fastcache.inc.h:76,108`（tiny_fast_pop/push）
  - `core/tiny_free_magazine.inc.h:83,94`（BG spill チェイン構築）

効果（短期計測）
- Throughput/L1 miss は誤差範囲で横ばい（正当性の改善が主、性能は現状維持）
- 本質は「整列崩れ」→ 次の対策で A/B 確認へ

未解決の懸念（要フォロー）
- Release ガード漏れの可能性: `[C7_FIRST_FREE]`/`[DEBUG ss_remote_push]` が release でも1回だけ出力
  - 該当箇所: `core/tiny_superslab_free.inc.h:36`, `core/superslab/superslab_inline.h:208`
  - Makefile上は `-DHAKMEM_BUILD_RELEASE=1`（print-flags でも確認）。TUごとのCFLAGS齟齬を監査。

次アクション（Tiny alignment 検証のA/B）
1) ヘッダ全無効 A/B（即時）
```
# A: 現行（ヘッダON）
./build.sh bench_random_mixed_hakmem
perf stat -e cycles,instructions,branches,branch-misses,cache-references,cache-misses,\
  L1-dcache-loads,L1-dcache-load-misses -r 5 -- ./bench_random_mixed_hakmem 100000 256 42

# B: ヘッダOFF（クラス全体）
EXTRA_MAKEFLAGS="HEADER_CLASSIDX=0" ./build.sh bench_random_mixed_hakmem
perf stat -e cycles,instructions,branches,branch-misses,cache-references,cache-misses,\
  L1-dcache-loads,L1-dcache-load-misses -r 5 -- ./bench_random_mixed_hakmem 100000 256 42
```
2) 固定サイズ 256B の比較（alignment 影響の顕在化狙い）
```
./build.sh bench_fixed_size_hakmem
perf stat -e cycles,instructions,cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses \
  -r 5 -- ./bench_fixed_size_hakmem 100000 256 42
```
3) FastCache 稼働確認（C0–C3 ヒット率の見える化）
```
HAKMEM_TINY_FAST_STATS=1 ./bench_random_mixed_hakmem 100000 256 42
```

中期対策（Box設計の指針）
- 方針A（簡易・高効果）: ヘッダを小クラス（C0–C3）限定に縮小、C4–C6は整列重視（ヘッダなし）。
  - 実装: まず A/B でヘッダ全OFFの効果を確認→効果大なら「クラス限定ヘッダ」へ段階導入。
- 方針B（高度）: フッタ方式やビットタグ化など“アラインメント維持”の識別方式へ移行。
  - 例: 16B整列を保つパディング/タグで class_idx を保持（RSS/複雑性と要トレードオフ検証）。

トラッキング（ファイル/行）
- 安全 next 小箱: `core/tiny_nextptr.h:1`
- 差し替え: `core/hakmem_tiny_fastcache.inc.h:76,108`, `core/tiny_free_magazine.inc.h:83,94`
- 追加監査対象（未修正だが next を直接触る箇所）
  - `core/tiny_alloc_fast_inline.h:54,297`, `core/hakmem_tiny_tls_list.h:63,76,109,115` ほか

成功判定（Tiny）
- A/B（ヘッダOFF）で 256B 固定の L1 miss 低下、ops/s 上昇（±20–50% を期待）
- mimalloc との差が大幅に縮小（まず 2–3x → 継続改善で 1.5x 以内へ）

最新A/Bスナップショット（当環境, RandomMixed 256B）
- HEADER_CLASSIDX=1（現行）: 平均 ≈ 8.16M ops/s, L1D miss ≈ 3.79%
- HEADER_CLASSIDX=0（全OFF）: 平均 ≈ 9.12M ops/s, L1D miss ≈ 3.74%
- 差分: +11.7% 前後の改善（整列効果は小〜中。追加のチューニング継続）