Optimize C6 heavy and C7 ultra performance analysis with refined design refinements

- Update environment profile presets and visibility analysis - Enhance small object and tiny segment v4 box implementations - Refine C7 ultra and C6 heavy allocation strategies - Add comprehensive performance metrics and design documentation 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-10 22:57:26 +09:00
parent 9460785bd6
commit 2a13478dc7
25 changed files with 718 additions and 41 deletions
--- a/docs/analysis/C6_HEAVY_VISIBILITY_ANALYSIS_PHASE_C6H.md
+++ b/docs/analysis/C6_HEAVY_VISIBILITY_ANALYSIS_PHASE_C6H.md
@ -0,0 +1,414 @@
+# C6-Heavy (257-768B) Visibility Analysis - Phase C6-H
+
+**Date**: 2025-12-10
+**Benchmark**: `./bench_mid_large_mt_hakmem 1 1000000 400 1` (1 thread, ws=400, iters=1M)
+**Size Range**: 257-768B (Class 6: 512B allocations)
+**Configuration**: C6_HEAVY_LEGACY_POOLV1 profile (C7_SAFE + C6_HOT=1)
+
+---
+
+## Executive Summary
+
+### Performance Gap Analysis
+- **HAKMEM**: 9.84M ops/s (baseline)
+- **mimalloc**: 51.3M ops/s
+- **Performance Gap**: **5.2x** (mimalloc is 421% faster)
+
+This represents a **critical performance deficit** in the C6-heavy allocation path, where HAKMEM achieves only **19% of mimalloc's throughput**.
+
+### Key Findings
+1. **C6 does NOT use Pool flatten path** - With `HAKMEM_TINY_C6_HOT=1`, allocations route through TinyHeap v1, bypassing pool flatten entirely
+2. **Address lookup dominates CPU time** - `hak_super_lookup` (9.3%) + `mid_desc_lookup` (8.2%) + `classify_ptr` (5.8%) = **23.3% of cycles**
+3. **Pool operations are expensive** - Despite not using flatten, pool alloc/free combined still consume ~15-20% of cycles
+4. **Mid_desc cache provides modest gains** - +6.4% improvement (9.8M → 10.4M ops/s)
+
+---
+
+## Phase C6-H1: Baseline Metrics
+
+### Test Configuration
+```bash
+export HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1
+export HAKMEM_BENCH_MIN_SIZE=257
+export HAKMEM_BENCH_MAX_SIZE=768
+```
+
+### Baseline Results
+
+| Configuration | Throughput (ops/s) | vs mimalloc | Notes |
+|---------------|-------------------|-------------|-------|
+| **Baseline (C6_HOT=1, mid_desc_cache=1)** | 9,836,420 | 19.2% | Default profile |
+| **C6_HOT=1, mid_desc_cache=0** | 9,805,954 | 19.1% | Without cache |
+| **C6_HOT=1, mid_desc_cache=1** | 10,435,480 | 20.3% | With cache (+6.4%) |
+| **C6_HOT=0 (pure legacy pool)** | 9,938,473 | 19.4% | Pool path ~same as TinyHeap |
+| **mimalloc baseline** | 51,297,877 | 100.0% | Reference |
+
+### Key Observations
+1. **Mid_desc cache effect**: +6.4% improvement, but far from closing the gap
+2. **C6_HOT vs pool path**: Nearly identical performance (~9.8M-9.9M ops/s), suggesting the bottleneck is in common infrastructure (address lookup, classification)
+3. **Size class routing**: 257-768B → Class 6 (512B) as expected
+
+---
+
+## Phase C6-H2: Pool Flatten and Cache Analysis
+
+### Pool Flatten Test (ATTEMPTED)
+
+**Finding**: Pool v1 flatten path is **NOT USED** for C6 allocations with `HAKMEM_TINY_C6_HOT=1`.
+
+```bash
+# Test with flatten enabled
+export HAKMEM_POOL_V1_FLATTEN_ENABLED=1
+export HAKMEM_POOL_V1_FLATTEN_STATS=1
+# Result: [POOL_V1_FLAT] alloc_tls_hit=0 alloc_fb=0 free_tls_hit=0 free_fb=0
+```
+
+**Root Cause**:
+- With `HAKMEM_TINY_C6_HOT=1`, class 6 routes to `TINY_ROUTE_HEAP` (TinyHeap v1)
+- TinyHeap v1 uses its own allocation path via `tiny_heap_box.h`, not the pool flatten path
+- Pool flatten optimizations (Phase 80-82) only apply to **legacy pool path** (when C6_HOT=0)
+
+### Mid_Desc Cache Analysis
+
+| Metric | Without Cache | With Cache | Delta |
+|--------|--------------|------------|-------|
+| Throughput | 9.81M ops/s | 10.44M ops/s | +6.4% |
+| Expected self% reduction | mid_desc_lookup: 8.2% | ~6-7% (estimated) | ~1-2% |
+
+**Conclusion**: Mid_desc cache provides measurable but insufficient improvement. The 8.2% CPU time in `mid_desc_lookup` is reduced, but other lookup costs (hak_super_lookup, classify_ptr) remain.
+
+---
+
+## Phase C6-H3: CPU Hotspot Analysis
+
+### Perf Stat Results
+
+```
+Benchmark: 9,911,926 ops/s (0.101s runtime)
+Cycles:      398,766,361 cycles:u
+Instructions: 1,054,643,524 instructions:u
+IPC:         2.64
+Page Faults: 7,131
+Task Clock:  119.08 ms
+```
+
+**Analysis**:
+- **IPC 2.64**: Reasonable instruction-level parallelism, but many cycles wasted
+- **Cycles per operation**: 398,766,361 / 1,000,000 = **398 cycles/op**
+- **Instructions per operation**: 1,054,643,524 / 1,000,000 = **1,054 instructions/op**
+
+**Comparison estimate** (mimalloc at 51.3M ops/s):
+- Estimated cycles/op for mimalloc: ~76 cycles/op (5.2x faster)
+- HAKMEM uses **5.2x more cycles** per allocation/free pair
+
+### Perf Record Hotspots (Top 20 Functions)
+
+| Function | Self % | Category | Description |
+|----------|--------|----------|-------------|
+| `hak_super_lookup` | 9.32% | Address Lookup | Superslab registry lookup (largest single cost) |
+| `mid_desc_lookup` | 8.23% | Address Lookup | Mid-size descriptor lookup |
+| `hak_pool_get_class_index` | 5.87% | Classification | Size→class mapping |
+| `classify_ptr` | 5.76% | Classification | Pointer classification for free |
+| `hak_pool_free_v1_impl` | 5.52% | Pool Free | Pool free implementation |
+| `hak_pool_try_alloc_v1_impl` | 5.46% | Pool Alloc | Pool allocation implementation |
+| `free` | 4.54% | Front Gate | glibc free wrapper |
+| `worker_run` | 4.47% | Benchmark | Benchmark driver |
+| `ss_map_lookup` | 4.35% | Address Lookup | Superslab map lookup |
+| `super_reg_effective_mask` | 4.32% | Address Lookup | Registry mask computation |
+| `mid_desc_hash` | 3.69% | Address Lookup | Hash computation for mid_desc |
+| `mid_set_header` | 3.27% | Metadata | Header initialization |
+| `mid_page_inuse_dec_and_maybe_dn` | 3.17% | Metadata | Page occupancy tracking |
+| `mid_desc_init_once` | 2.71% | Initialization | Descriptor initialization |
+| `malloc` | 2.60% | Front Gate | glibc malloc wrapper |
+| `hak_free_at` | 2.53% | Front Gate | Internal free dispatcher |
+| `hak_pool_mid_lookup_v1_impl` | 2.17% | Pool Lookup | Pool-specific descriptor lookup |
+| `super_reg_effective_size` | 1.87% | Address Lookup | Registry size computation |
+| `hak_pool_free_fast_v1_impl` | 1.77% | Pool Free | Fast path for pool free |
+| `hak_pool_init` | 1.44% | Initialization | Pool initialization |
+
+### Hotspot Category Breakdown
+
+| Category | Combined Self % | Functions |
+|----------|----------------|-----------|
+| **Address Lookup & Classification** | **41.5%** | hak_super_lookup, mid_desc_lookup, classify_ptr, hak_pool_get_class_index, ss_map_lookup, super_reg_effective_mask, mid_desc_hash, super_reg_effective_size, hak_pool_mid_lookup_v1_impl |
+| **Pool Operations** | **14.8%** | hak_pool_try_alloc_v1_impl, hak_pool_free_v1_impl, hak_pool_free_fast_v1_impl |
+| **Metadata Management** | **9.2%** | mid_set_header, mid_page_inuse_dec_and_maybe_dn, mid_desc_init_once |
+| **Front Gate** | **9.7%** | malloc, free, hak_free_at |
+| **Benchmark Driver** | **4.5%** | worker_run |
+| **Other** | **20.3%** | Various helpers, initialization, etc. |
+
+---
+
+## Root Cause Analysis
+
+### 1. Address Lookup Dominates (41.5% of CPU)
+
+The single largest performance killer is **address→metadata lookup infrastructure**:
+
+- **hak_super_lookup** (9.3%): Superslab registry lookup to find which allocator owns a pointer
+- **mid_desc_lookup** (8.2%): Hash-based descriptor lookup for mid-size allocations
+- **ss_map_lookup** (4.3%): Secondary map lookup within superslab
+- **classify_ptr** (5.8%): Pointer classification during free
+- **hak_pool_get_class_index** (5.9%): Size→class index computation
+
+**Why this matters**: Every allocation AND free requires multiple lookups:
+- Alloc: size → class_idx → descriptor → block
+- Free: ptr → superslab → descriptor → classification → free handler
+
+**Comparison to mimalloc**: mimalloc likely uses:
+- Thread-local caching with minimal lookup
+- Direct pointer arithmetic from block headers
+- Segment-based organization reducing lookup depth
+
+### 2. Pool Operations Still Expensive (14.8%)
+
+Despite C6 routing through TinyHeap (not pool flatten), pool operations still consume significant cycles:
+- `hak_pool_try_alloc_v1_impl` (5.5%)
+- `hak_pool_free_v1_impl` (5.5%)
+
+**Why**: TinyHeap v1 likely calls into pool infrastructure for:
+- Page allocation from mid/smallmid pool
+- Descriptor management
+- Cross-thread handling
+
+### 3. Metadata Overhead (9.2%)
+
+Mid-size allocations carry significant metadata overhead:
+- Header initialization: `mid_set_header` (3.3%)
+- Occupancy tracking: `mid_page_inuse_dec_and_maybe_dn` (3.2%)
+- Descriptor init: `mid_desc_init_once` (2.7%)
+
+### 4. Front Gate Overhead (9.7%)
+
+The malloc/free wrappers add non-trivial cost:
+- Route determination
+- Cross-allocator checks (jemalloc, system)
+- Lock depth checks
+- Initialization checks
+
+---
+
+## Recommendations for Next Phase
+
+### Priority 1: Address Lookup Reduction (Highest Impact)
+**Target**: 41.5% → 20-25% of cycles
+
+**Strategies**:
+1. **TLS Descriptor Cache**: Extend mid_desc_cache to cache full allocation context (class_idx + descriptor + page_info)
+2. **Fast Path Header**: Embed class_idx in allocation header for instant classification on free (similar to tiny allocations)
+3. **Segment-Based Addressing**: Consider segment-style addressing (like mimalloc) where ptr→metadata is direct pointer arithmetic
+4. **Superslab Lookup Bypass**: For C6-heavy workloads, skip superslab lookup when we know it's mid-size
+
+**Expected Gain**: 10-15M ops/s (+100-150%)
+
+### Priority 2: Pool Path Streamlining (Medium Impact)
+**Target**: 14.8% → 8-10% of cycles
+
+**Strategies**:
+1. **Dedicated C6 Fast Path**: Create a specialized alloc/free path for class 6 that skips pool generality
+2. **TLS Block Cache**: Implement TLS-local block cache for C6 (bypass pool ring buffer overhead)
+3. **Inline Critical Helpers**: Force-inline `hak_pool_get_class_index` and other hot helpers
+
+**Expected Gain**: 3-5M ops/s (+30-50%)
+
+### Priority 3: Metadata Streamlining (Lower Impact)
+**Target**: 9.2% → 5-6% of cycles
+
+**Strategies**:
+1. **Lazy Header Init**: Only initialize headers when necessary (debug mode, cross-thread)
+2. **Batch Occupancy Updates**: Combine multiple inuse_dec calls
+3. **Cached Descriptors**: Reduce descriptor initialization overhead
+
+**Expected Gain**: 1-2M ops/s (+10-20%)
+
+### Priority 4: Front Gate Thinning (Lower Impact)
+**Target**: 9.7% → 6-7% of cycles
+
+**Strategies**:
+1. **Size-Based Fast Path**: For mid-size range (257-768B), skip most gate checks
+2. **Compile-Time Routing**: When jemalloc/system allocators are not used, eliminate checks
+
+**Expected Gain**: 1-2M ops/s (+10-20%)
+
+---
+
+## Comparison to Historical Baselines
+
+| Phase | Configuration | Throughput | vs Current | Notes |
+|-------|--------------|------------|------------|-------|
+| **Phase 54** | C7_SAFE, mixed 16-1024B | 28.1M ops/s | 2.9x | Mixed workload |
+| **Phase 80** | C6-heavy, flatten OFF | 23.1M ops/s | 2.4x | Legacy baseline |
+| **Phase 81** | C6-heavy, flatten ON | 25.9M ops/s | 2.6x | +10% from flatten |
+| **Phase 82** | C6-heavy, flatten ON | 26.7M ops/s | 2.7x | +13% from flatten |
+| **Current (C6-H)** | C6-heavy, C6_HOT=1 | 9.8M ops/s | 1.0x | **REGRESSION** |
+
+**CRITICAL FINDING**: Current baseline (9.8M ops/s) is **2.4-2.7x SLOWER** than historical C6-heavy baselines (23-27M ops/s).
+
+**Possible Causes**:
+1. **Configuration difference**: Historical tests may have used different profile (LEGACY vs C7_SAFE)
+2. **Routing change**: C6_HOT=1 may be forcing a slower path through TinyHeap
+3. **Build/compiler difference**: Flags or LTO settings may have changed
+4. **Benchmark variance**: Different workload characteristics
+
+**Action Required**: Replicate historical Phase 80-82 configurations exactly to identify regression point.
+
+---
+
+## Verification of Historical Configuration
+
+Let me verify the exact configuration used in Phase 80-82:
+
+**Phase 80-82 Configuration** (from CURRENT_TASK.md):
+```bash
+HAKMEM_BENCH_MIN_SIZE=257
+HAKMEM_BENCH_MAX_SIZE=768
+HAKMEM_TINY_HEAP_PROFILE=LEGACY  # ← Different!
+HAKMEM_TINY_HOTHEAP_V2=0
+HAKMEM_POOL_V2_ENABLED=0
+HAKMEM_POOL_V1_FLATTEN_ENABLED=1
+HAKMEM_POOL_V1_FLATTEN_STATS=1
+```
+
+**Current Configuration**:
+```bash
+HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1  # Sets TINY_HEAP_PROFILE=C7_SAFE
+HAKMEM_TINY_C6_HOT=1  # ← Adds TinyHeap routing
+HAKMEM_POOL_V1_FLATTEN_ENABLED=0  # ← Flatten OFF by default
+```
+
+**Key Difference**: Historical tests used `TINY_HEAP_PROFILE=LEGACY`, which likely routes C6 through pure pool path (no TinyHeap). Current `C6_HEAVY_LEGACY_POOLV1` profile sets `TINY_HEAP_PROFILE=C7_SAFE` + `TINY_C6_HOT=1`, routing C6 through TinyHeap.
+
+---
+
+## Action Items for Phase C6-H+1
+
+1. **Replicate Historical Baseline** (URGENT)
+   ```bash
+   export HAKMEM_BENCH_MIN_SIZE=257
+   export HAKMEM_BENCH_MAX_SIZE=768
+   export HAKMEM_TINY_HEAP_PROFILE=LEGACY
+   export HAKMEM_TINY_HOTHEAP_V2=0
+   export HAKMEM_POOL_V2_ENABLED=0
+   export HAKMEM_POOL_V1_FLATTEN_ENABLED=0
+   # Expected: ~23M ops/s
+   ```
+
+2. **Test Flatten ON with Historical Config**
+   ```bash
+   # Same as above, but:
+   export HAKMEM_POOL_V1_FLATTEN_ENABLED=1
+   export HAKMEM_POOL_V1_FLATTEN_STATS=1
+   # Expected: ~26M ops/s with active flatten stats
+   ```
+
+3. **Profile Comparison Matrix**
+   - LEGACY vs C7_SAFE profile
+   - C6_HOT=0 vs C6_HOT=1
+   - Flatten OFF vs ON
+   - Identify which combination yields best performance
+
+4. **Address Lookup Prototype**
+   - Implement TLS allocation context cache (class_idx + descriptor + page)
+   - Measure impact on lookup overhead (target: 41.5% → 25%)
+
+5. **Update ENV_PROFILE_PRESETS.md**
+   - Clarify that `C6_HEAVY_LEGACY_POOLV1` uses C7_SAFE profile (not pure LEGACY)
+   - Add note about C6_HOT routing implications
+   - Document performance differences between profile choices
+
+---
+
+## Success Criteria for Phase C6-H+1
+
+- **Reproduce historical baseline**: Achieve 23-27M ops/s with LEGACY profile
+- **Understand routing impact**: Quantify C6_HOT=0 vs C6_HOT=1 difference
+- **Identify optimization path**: Choose between:
+  - Optimizing TinyHeap C6 path (if C6_HOT=1 is strategic)
+  - Optimizing pool flatten path (if LEGACY/C6_HOT=0 is preferred)
+  - Hybrid approach with runtime selection
+
+**Target**: Close to **30M ops/s** (1/2 of current gap to 51.3M mimalloc baseline) by end of next phase.
+
+---
+
+## Appendix A: Full Perf Report Output
+
+```
+# Samples: 656  of event 'cycles:u'
+# Event count (approx.): 409,174,521
+#
+# Overhead  Symbol
+# ........  .....................................
+     9.32%  [.] hak_super_lookup
+     8.23%  [.] mid_desc_lookup
+     5.87%  [.] hak_pool_get_class_index
+     5.76%  [.] classify_ptr
+     5.52%  [.] hak_pool_free_v1_impl
+     5.46%  [.] hak_pool_try_alloc_v1_impl
+     4.54%  [.] free
+     4.47%  [.] worker_run
+     4.35%  [.] ss_map_lookup
+     4.32%  [.] super_reg_effective_mask
+     3.69%  [.] mid_desc_hash
+     3.27%  [.] mid_set_header
+     3.17%  [.] mid_page_inuse_dec_and_maybe_dn
+     2.71%  [.] mid_desc_init_once
+     2.60%  [.] malloc
+     2.53%  [.] hak_free_at
+     2.17%  [.] hak_pool_mid_lookup_v1_impl
+     1.87%  [.] super_reg_effective_size
+     1.77%  [.] hak_pool_free_fast_v1_impl
+     1.64%  [k] 0xffffffffae200ba0 (kernel)
+     1.44%  [.] hak_pool_init
+     1.42%  [.] hak_pool_is_poolable
+     1.21%  [.] should_sample
+     1.12%  [.] hak_pool_free
+     1.11%  [.] hak_super_hash
+     1.09%  [.] hak_pool_try_alloc
+     0.95%  [.] mid_desc_lookup_cached
+     0.93%  [.] hak_pool_v1_flatten_enabled
+     0.76%  [.] hak_pool_v2_route
+     0.57%  [.] ss_map_hash
+     0.55%  [.] hak_in_wrapper
+```
+
+---
+
+## Appendix B: Test Commands Summary
+
+```bash
+# Baseline
+export HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1
+export HAKMEM_BENCH_MIN_SIZE=257
+export HAKMEM_BENCH_MAX_SIZE=768
+./bench_mid_large_mt_hakmem 1 1000000 400 1
+# Result: 9,836,420 ops/s
+
+# Mimalloc comparison
+./bench_mid_large_mt_mi 1 1000000 400 1
+# Result: 51,297,877 ops/s (5.2x faster)
+
+# Mid_desc cache OFF
+export HAKMEM_MID_DESC_CACHE_ENABLED=0
+./bench_mid_large_mt_hakmem 1 1000000 400 1
+# Result: 9,805,954 ops/s
+
+# Mid_desc cache ON
+export HAKMEM_MID_DESC_CACHE_ENABLED=1
+./bench_mid_large_mt_hakmem 1 1000000 400 1
+# Result: 10,435,480 ops/s (+6.4%)
+
+# Perf stat
+perf stat -e cycles:u,instructions:u,task-clock,page-faults:u \
+  ./bench_mid_large_mt_hakmem 1 1000000 400 1
+# Result: 398M cycles, 1.05B instructions, IPC=2.64
+
+# Perf record
+perf record -F 5000 --call-graph dwarf -e cycles:u \
+  -o perf.data.c6_flat ./bench_mid_large_mt_hakmem 1 1000000 400 1
+perf report -i perf.data.c6_flat --stdio --no-children
+```
+
+---
+
+**End of Report**
--- a/docs/analysis/ENV_PROFILE_PRESETS.md
+++ b/docs/analysis/ENV_PROFILE_PRESETS.md
@ -10,7 +10,9 @@
 ### 目的
 - Mixed 16–1024B の標準ベンチ用。
 - C7-only SmallObject v3 + Tiny front v3 + LUT + fast classify ON。
- Tiny/Pool v2 はすべて OFF。
+- v4 系（C6/C7 v4、fast classify v4、small segment v4）はすべて OFF。
+- Tiny/Pool v2 もすべて OFF。
+- C6 は凍結中（Tiny/SmallObject の特別扱いなし）。mid/pool の通常経路に任せる。

 ### ENV 最小セット（Release）
 ```sh
@ -21,6 +23,19 @@ HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE
 HAKMEM_BENCH_MIN_SIZE=16
 HAKMEM_BENCH_MAX_SIZE=1024
 ```
+プリセットで自動設定される主な ENV:
+- `HAKMEM_TINY_HEAP_PROFILE=C7_SAFE`
+- `HAKMEM_TINY_C7_HOT=1`
+- `HAKMEM_TINY_HOTHEAP_V2=0`
+- `HAKMEM_SMALL_HEAP_V3_ENABLED=1`
+- `HAKMEM_SMALL_HEAP_V3_CLASSES=0x80`（C7-only v3）
+- `HAKMEM_SMALL_HEAP_V4_ENABLED=0` / `HAKMEM_SMALL_HEAP_V4_CLASSES=0x0`
+- `HAKMEM_TINY_PTR_FAST_CLASSIFY_ENABLED=1`
+- `HAKMEM_TINY_PTR_FAST_CLASSIFY_V4_ENABLED=0`
+- `HAKMEM_SMALL_SEGMENT_V4_ENABLED=0`
+- `HAKMEM_POOL_V2_ENABLED=0`
+- `HAKMEM_TINY_FRONT_V3_ENABLED=1`
+- `HAKMEM_TINY_FRONT_V3_LUT_ENABLED=1`

 ### 任意オプション
 - stats を見たいとき:
@ -39,6 +54,10 @@ HAKMEM_SS_MADVISE_STRICT=0
 HAKMEM_FREE_POLICY=batch
 HAKMEM_THP=auto
 ```
+- 参考（v4 研究箱の現状）:
+  - C7/C6 v4 + fast classify v4 ON（v3 OFF, segment OFF）: **≈32.0–32.5M ops/s**（MIXED 1M/ws=400, Release）。
+  - C7-only v4（C6 v1、v3 OFF）: **≈33.0M ops/s**。
+  - 現状は v3 構成が最速のため、標準プロファイルでは v4 系をすべて OFF に固定。

 ---

@ -46,14 +65,14 @@ HAKMEM_THP=auto

 ### 目的
 - C6-heavy mid/smallmid のベンチ用。
- C6 は v1 固定（C6 v3 OFF）、Pool v2 OFF。Pool v1 flatten は bench 用に opt-in。
+- C6 は v1 固定（C6 v3/v4/ULTRA は研究箱のみ）。Pool v2 OFF。Pool v1 flatten は bench 用に opt-in。

 ### ENV（v1 基準線）
 ```sh
 HAKMEM_BENCH_MIN_SIZE=257
 HAKMEM_BENCH_MAX_SIZE=768
 HAKMEM_TINY_HEAP_PROFILE=C7_SAFE
-HAKMEM_TINY_C6_HOT=1
+HAKMEM_TINY_C6_HOT=0
 HAKMEM_TINY_HOTHEAP_V2=0
 HAKMEM_SMALL_HEAP_V3_ENABLED=1
 HAKMEM_SMALL_HEAP_V3_CLASSES=0x80   # C7-only v3, C6 v3 は OFF
@ -69,6 +88,7 @@ HAKMEM_TINY_HEAP_PROFILE=LEGACY
 HAKMEM_POOL_V2_ENABLED=0
 HAKMEM_POOL_V1_FLATTEN_ENABLED=1
 HAKMEM_POOL_V1_FLATTEN_STATS=1
+```

 ## Profile 2b: C6_HEAVY_LEGACY_POOLV1_FLATTEN（mid/smallmid LEGACY flatten ベンチ専用）

@ -84,9 +104,35 @@ HAKMEM_POOL_ZERO_MODE=header
 HAKMEM_POOL_V1_FLATTEN_STATS=1
 ```
 ※ LEGACY 専用。C7_SAFE / C7_ULTRA_BENCH ではこのプリセットを使用しないこと。
-```
 - flatten は LEGACY 専用。C7_SAFE / C7_ULTRA_BENCH ではコード側で強制 OFF になる前提。

+### C6 研究用プリセット（標準ラインには影響させない）
+
+- C6 v3 研究（Tiny/SmallObject に C6 を載せるときだけ）
+```sh
+HAKMEM_PROFILE=C6_SMALL_HEAP_V3_EXPERIMENT
+HAKMEM_BENCH_MIN_SIZE=257
+HAKMEM_BENCH_MAX_SIZE=768
+# bench_profile が以下を自動注入（既存 ENV を上書きしません）:
+# HAKMEM_TINY_C6_HOT=1
+# HAKMEM_SMALL_HEAP_V3_ENABLED=1
+# HAKMEM_SMALL_HEAP_V3_CLASSES=0x40   # C6 only v3
+```
+
+- C6 v4 研究（C6 を v4 に載せるときだけ）
+```sh
+HAKMEM_PROFILE=C6_SMALL_HEAP_V4_EXPERIMENT
+HAKMEM_BENCH_MIN_SIZE=257
+HAKMEM_BENCH_MAX_SIZE=768
+# bench_profile が以下を自動注入（既存 ENV を上書きしません）:
+# HAKMEM_TINY_C6_HOT=1
+# HAKMEM_SMALL_HEAP_V3_ENABLED=0
+# HAKMEM_SMALL_HEAP_V4_ENABLED=1
+# HAKMEM_SMALL_HEAP_V4_CLASSES=0x40   # C6 only v4
+```
+
+※ いずれも「研究箱」です。Mixed/C6-heavy の標準評価では使わず、回帰やセグフォを許容できるときだけ明示的に opt-in してください。
+
 ---

 ## Profile 3: DEBUG_TINY_FRONT_PERF（perf 用 DEBUG プロファイル）
--- a/docs/analysis/PF_STATUS_V4_202502.md
+++ b/docs/analysis/PF_STATUS_V4_202502.md
@ -1,3 +1,23 @@
+# PF/OS ベースライン
+
+# BASELINE-LOCK (Mixed 16–1024B v3 vs v4, Release)
+- コマンド共通 (ws=400, iters=1M):
+  ```
+  HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE
+  HAKMEM_BENCH_MIN_SIZE=16
+  HAKMEM_BENCH_MAX_SIZE=1024
+  ```
+- v3 本命構成（C7-only v3, v4/segment すべて OFF, fast classify v3 ON）:
+  - `HAKMEM_SMALL_HEAP_V3_ENABLED=1 HAKMEM_SMALL_HEAP_V3_CLASSES=0x80 HAKMEM_SMALL_HEAP_V4_ENABLED=0 HAKMEM_SMALL_HEAP_V4_CLASSES=0 HAKMEM_TINY_PTR_FAST_CLASSIFY_V4_ENABLED=0 HAKMEM_SMALL_SEGMENT_V4_ENABLED=0`
+  - Throughput: **33.7–33.9M ops/s**（2 run, segv/assert なし）
+- v4 強制（C7+C6 v4 + fast classify v4, v3 OFF, segment OFF）:
+  - `HAKMEM_SMALL_HEAP_V3_ENABLED=0 HAKMEM_SMALL_HEAP_V3_CLASSES=0 HAKMEM_SMALL_HEAP_V4_ENABLED=1 HAKMEM_SMALL_HEAP_V4_CLASSES=0xC0 HAKMEM_TINY_PTR_FAST_CLASSIFY_V4_ENABLED=1`
+  - Throughput: **32.0–32.5M ops/s**
+- C7-only v4（C6 v1, v3 OFF, fast classify v4 ON）:
+  - `HAKMEM_SMALL_HEAP_V4_CLASSES=0x80 HAKMEM_SMALL_HEAP_V3_ENABLED=0`
+  - Throughput: **≈33.0M ops/s**
+- 判断: 現行 Mixed の本命は v3 構成（上記）。v4 系は研究箱として opt-in 扱いを維持。
+
 # PF/OS ベースライン (PF2, small-object v4 状態)

 - コマンド (Release, v4: C7+C6 を v4 に強制、v3 OFF):
@ -20,6 +40,29 @@
  - v4 (C7+C6) 強制時の pf/OS 基準値。v3 基準 (~40M) より遅めだが、pf 数値と OS stats を PF2 の起点として固定。
  - 今後 SmallSegmentBox_v4 を繋ぐ A/B では、page-faults/SS_OS_STATS をこの値からどこまで下げられるかを指標にする。

+## PF3: smallsegment_v4 ゲート A/B（C7+C6 v4 強制）
+
+- コマンド (Release, v4: C7+C6, v3 OFF):
+  ```
+  HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE \
+  HAKMEM_BENCH_MIN_SIZE=16 \
+  HAKMEM_BENCH_MAX_SIZE=1024 \
+  HAKMEM_SMALL_HEAP_V4_ENABLED=1 \
+  HAKMEM_SMALL_HEAP_V4_CLASSES=0xC0 \
+  HAKMEM_SMALL_HEAP_V3_ENABLED=0 \
+  HAKMEM_SMALL_HEAP_V3_CLASSES=0 \
+    perf stat -e cycles,instructions,task-clock,page-faults \
+      HAKMEM_SMALL_SEGMENT_V4_ENABLED=0 ./bench_random_mixed_hakmem 1000000 400 1
+    perf stat -e cycles,instructions,task-clock,page-faults \
+      HAKMEM_SMALL_SEGMENT_V4_ENABLED=1 ./bench_random_mixed_hakmem 1000000 400 1
+  ```
+- 結果 (ws=400, iters=1M):
+  - OFF: Throughput **28,890,266 ops/s**, page-faults=6,744, task-clock=54.84ms
+  - ON : Throughput **28,849,781 ops/s**, page-faults=6,746, task-clock=61.49ms
+- 所感:
+  - smallsegment_v4 ゲートを通しても pf/ops はほぼ変化なし（現状は Tiny v1 lease 経由の薄い実装）。
+  - 「Segment 経由の入り口」はできたので、PF4 以降で専用 mmap/segment 分割を実装して再 A/B する。
+
 ## DEBUG perf (cycles:u, -O0/-g, v4=C7+C6)

 - ビルド:
--- a/docs/analysis/SMALLOBJECT_SEGMENT_V4_DESIGN.md
+++ b/docs/analysis/SMALLOBJECT_SEGMENT_V4_DESIGN.md
@ -24,6 +24,13 @@
 - **PF3**: SmallSegmentBox_v4 を実装し、C7/C6 v4 で small-object 専用 Segment を試す A/B を実施。
 - **PF4**: Segment サイズ/ポリシーのチューニングと pf/OS スタッツの可視化強化。成功したら ENV プリセットに反映。

+## PF3 進捗メモ
+- smallsegment_v4_box をホットコードに接続し、ENV `HAKMEM_SMALL_SEGMENT_V4_ENABLED` で Tiny v1 経由と segment 経由を切替可能にした（現段階は Tiny v1 lease を薄くラップする構造）。
+- Mixed 16–1024B（v4 強制、ws=400, iters=1M）で A/B:
+  - OFF: 28.89M ops/s, page-faults=6,744
+  - ON : 28.85M ops/s, page-faults=6,746
+- pf/ops はまだ変化なし。次フェーズで実際の small-object 専用 mmap/segment carve を入れて再 A/B する。
+
 ## メモ
 - C5 v4 はまだ研究箱（C5-heavy 専用）。Mixed では C5 v1 を維持する予定。
 - C6 v4 は C6-heavy で +4〜5% が見えており、Mixed ではデフォルト OFF（研究箱）。
--- a/docs/analysis/SMALLOBJECT_V4_BOX_DESIGN.md
+++ b/docs/analysis/SMALLOBJECT_V4_BOX_DESIGN.md
@ -32,22 +32,10 @@
 - `core/front/malloc_tiny_fast.h`: route switch に v4 の case を足し、C7 v4 が ON のときは v4 経路（現在は C7 自前 freelist, それ以外は v1/v3 へフォールバック）、OFF 時は従来の v3/v1。
 - `core/box/smallsegment_v4_box.h` / `core/box/smallsegment_v4_env_box.h`: PF2 で追加した small-object Segment Box の足場（型と ENV だけ、挙動不変）。設計メモは `docs/analysis/SMALLOBJECT_SEGMENT_V4_DESIGN.md` にまとめる。

-## A/B と運用
- Phase v4-3.1 時点の健康診断:
-  - C7-only A/B (ws=400, iters=1M, size=1024 固定):
-    - v3: 41.67M ops/s, prepare_calls=5,077
-    - v4: 42.13M ops/s, prepare_calls=4,701（current/partial 再利用で 3.4x→約1.0x に改善）
-  - Mixed 16–1024B (MIXED_TINYV3_C7_SAFE, ws=400, iters=1M):
-    - v3 route: 40.66M ops/s
-    - v4 route: 40.01M ops/s（-1.6% 以内、回帰なし）
- どちらも segv/assert なし。C7 v4 の prepare 増加は解消済み。Mixed ではまだ v3 がわずかに優勢だが許容範囲。
- Phase v4-4 (C6 v4 パイロット):
-  - ENV: `HAKMEM_SMALL_HEAP_V4_ENABLED=1`, `HAKMEM_SMALL_HEAP_V4_CLASSES=0x40`（C6-only v4）。Mixed では標準 OFF（0x80= C7-only）。
-  - C6-heavy ベンチ (ws=400, iters=1M, size 257–768):
-    - C6 v1: 28.69M ops/s
-    - C6 v4: 30.07M ops/s（+4.8%）segv/assert なし
-  - Mixed 16–1024B はデフォルトで C6 v1 のまま（C6 v4 は研究箱）。今後 C6 v4 の安定度を見つつ拡張予定。
- Phase v4-5 (C5 v4 パイロット; C5-heavy 専用 opt-in):
-  - ENV: `HAKMEM_SMALL_HEAP_V4_ENABLED=1`, `HAKMEM_SMALL_HEAP_V4_CLASSES=0x20`（C5-only v4）。C7 v4 / C6 v4 とは独立にビットで切替。
-  - 目的: C5-heavy ワークロードで v4 が v1 を上回るか確認。Mixed 標準は C5 v1 のまま（C5 v4 は研究箱）。
-  - ステータス: 実装済み。C5-heavy / Mixed の A/B は未実施。segv/assert の有無と throughput を確認してから昇格判断。
+## A/B と運用（2025-12 時点の整理）
+- v4 C7/C6/C5 はいずれも **研究箱**。Mixed の標準ラインは C7-only v3 + C7 ULTRA（UF-3 セグメント）で固定し、v4 系は ENV opt-in のみで利用する。
+- C6/FREEZE 方針により、C6 v4 / C5 v4 は mid/pool 再設計が進むまで本線に載せない（C6 は「普通の mid クラス」として pool/mid 側で扱う）。
+- 今後 small-object v4 を攻めるときは:
+  - まず C7 ULTRA で固めた設計（Segment + Page + TLS freelist + mask free）を「small-object 全体の共通パターン」として整理し、
+  - その上で 16〜2KiB 帯を SmallHeapCtx v4 に寄せる（ヘッダレス化・lookup 削減を C7 と mid で統合）、
+  という順番で進める。
--- a/docs/analysis/TINY_C7_ULTRA_DESIGN.md
+++ b/docs/analysis/TINY_C7_ULTRA_DESIGN.md
@ -40,8 +40,13 @@
  - 管理内 → page_idx = (p - seg_base) >> PAGE_SHIFT で page_meta を取得し、ヘッダ無しで freelist push。
 - Remote/cross-thread free は UF-3 でも非対応（同一スレッド C7 専用のまま）。

+## UF-4: C7 ULTRA header light（研究箱）
+- 目的: C7 ULTRA の alloc/free から tiny_region_id_write_header の毎回実行を外し、carve 時だけに寄せる。
+- 手段: freelist の next をヘッダ直後に格納してヘッダを保持し、ENV `HAKMEM_TINY_C7_ULTRA_HEADER_LIGHT` (default 0) ON のときだけ carve 時に一括書き込み。alloc はヘッダ済みならスキップ。
+- Fail-Fast: ULTRA 管理外 ptr は従来どおり v3 free 経路へ落とす。
+
 ## フェーズ
 - UF-1: 箱・ENV・front フックだけ stub で入れる（中身は v3 C7 経由、挙動変化なし）。
 - UF-2: ULTRA TLS freelist を実装（C7 ページ 1 枚を TLS で握る。同一スレッドのみ）。C7 ページ供給は当面 v3/v4 経由。
 - UF-3: C7UltraSegmentBox を実装し、ptr→segment mask でヘッダレス free に寄せる（セグメント 1 枚のみでも可）。
- UF-4: pf/segment/学習層との統合を調整し、Mixed で本格的に A/B。
+- UF-4: C7 ULTRA header light を研究箱として追加し、ON/OFF A/B（Mixed / C7-only 両方）で評価する。