hakmem/CURRENT_TASK.md

# CURRENT TASK – Performance Optimization Status

**Last Updated**: 2025-11-25
**Scope**: Random Mixed 16-1024B / Arena Allocator / Architecture Limit Analysis

---

## 🎯 現状サマリ

### ✅ Arena Allocator 実装完了 - mmap 95% 削減達成

| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| mmap syscalls | 401 | 32 | -92% |
| munmap syscalls | 378 | 3 | -99% |
| Performance (10M) | ~60M ops/s | **68-70M ops/s** | +15% |

### 現在の性能比較 (10M iterations)
```
System malloc: 93M ops/s (baseline)
HAKMEM:        68-70M ops/s (73-76% of system malloc)
Gap:           ~25% (構造的オーバーヘッド)
```

---

## 🔬 Phase 27 調査結果: アーキテクチャ限界の確認

### 試した最適化（すべて失敗）
| 最適化案 | 結果 | 効果 |
|---------|------|------|
| C5 TLS容量 2倍 (1024→2048) | 68-69M | 変化なし |
| Registry lookup削除 | 68-70M | 変化なし |
| Ultra SLIM 4-layer | ~69M | 変化なし |
| **Phase 27-A: Ultra-Inline (全size)** | **56-61M** | **-15% 悪化** ❌ |
| **Phase 27-B: Ultra-Inline (9-512B)** | **61-62M** | **-10% 悪化** ❌ |

### Phase 27 失敗の原因
- Workload の ~52% が headerless classes (cls 0: 1-8B, cls 7: 513-1024B)
- Headerless クラスをフィルタする条件分岐自体が overhead
- Classes 1-6 からの利益 < 条件分岐の overhead

### 残り 25% ギャップの原因（構造的オーバーヘッド）
1. **Header byte オーバーヘッド** - 毎 alloc/free で 1 バイト書き込み/読み込み
2. **TLS SLL カウンタ** - 毎回 count++ / count-- (vs tcache: pointer のみ)
3. **多層分岐** - 4-5層 dispatch (vs tcache: 2-3層)

### 結論
**68-70M ops/s が現アーキテクチャの実質的な限界**。System malloc の 93M ops/s に到達するには:
- Header-free design への全面的な見直し
- tcache 模倣（カウンタ削除、分岐削減）

が必要だが、現時点では投資対効果が低い。

---

## 📁 主要な修正ファイル（Arena Allocator 実装）

- `core/box/ss_cache_box.inc:138-229` - SSArena allocator 追加
- `core/box/tls_sll_box.h:509-561` - Release mode で recycle check オプショナル化
- `core/tiny_free_fast_v2.inc.h:113-148` - Release mode で cross-check 削除
- `core/hakmem_tiny_sll_cap_box.inc:8-25` - C5 容量を full capacity に変更
- `core/hakmem_policy.c:24-30` - min_keep tuning
- `core/tiny_alloc_fast_sfc.inc.h:18-26` - SFC defaults tuning

---

## 🗃 過去の問題と解決（参考）

### Arena Allocator 以前の状態
- **Random Mixed (5M ops)**: ~56-60M ops/s, **mmap 418回** (mimalloc の 26倍)
- **根本原因**: SuperSlab = allocation単位 = cache単位 という設計ミスマッチ
- **問題**: ws=256 では Slab が 5-15% 使用率で停滞 → 完全 EMPTY にならない → LRU キャッシュ不発 → 毎回 mmap/munmap

### Arena Allocator による解決
- SuperSlab を OS 単位として扱う Arena allocator 実装
- mmap 418回 → 32回 (-92%)、munmap 378回 → 3回 (-99%)
- 性能 60M → 68-70M ops/s (+15%)

---

## 📊 他アロケータとのアーキテクチャ対応（参考）

| HAKMEM | mimalloc | tcmalloc | jemalloc |
|--------|----------|----------|----------|
| SuperSlab (2MB) | Segment (~2MiB) | PageHeap | Extent |
| Slab (64KB) | Page (~64KiB) | Span | Run/slab |
| per-class freelist | pages_queue | Central freelist | bin/slab lists |
| Arena allocator | segment cache | PageHeap | extent_avail |

---

## 🚀 将来の可能性（長期）

### Slab-level EMPTY Recycling（未実装）
- **Goal**: Slab を cross-class で再利用可能にする
- **設計**: EMPTY slab を lock-free stack で管理、alloc 時に class_idx を動的割り当て
- **期待効果**: メモリ効率向上（ただし性能向上は限定的）

### Abandoned SuperSlab（MT 用、未実装）
- **Goal**: スレッド終了後のメモリを他スレッドから reclaim
- **設計**: mimalloc の abandoned segments 相当
- **実装タイミング**: MT ワークロードで必要になってから

---

## ✅ 完成したマイルストーン

1. **Arena Allocator 実装** - mmap 95% 削減達成 ✅
2. **Phase 27 調査** - アーキテクチャ限界の確認 ✅
3. **性能 68-70M ops/s** - System malloc の 73-76% に到達 ✅

**現在の推奨**: 68-70M ops/s を baseline として受け入れ、他のワークロード（Mid-Large, Larson 等）の最適化に注力する。
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								# CURRENT TASK – Performance Optimization Status
-												Phase 12-1.1: EMPTY Slab Detection + Immediate Reuse (+13% improvement, 10.2M→11.5M ops/s)

Implementation of Task-sensei Priority 1 recommendation: Add empty_mask to SuperSlab
for immediate EMPTY slab detection and reuse, reducing Stage 3 (mmap) overhead.

## Changes

### 1. SuperSlab Structure (core/superslab/superslab_types.h)
- Added `empty_mask` (uint32_t): Bitmap for EMPTY slabs (used==0)
- Added `empty_count` (uint8_t): Quick check for EMPTY slab availability

### 2. EMPTY Detection API (core/box/ss_hot_cold_box.h)
- Added `ss_is_slab_empty()`: Returns true if slab is completely EMPTY
- Added `ss_mark_slab_empty()`: Marks slab as EMPTY (highest reuse priority)
- Added `ss_clear_slab_empty()`: Removes EMPTY state when reactivated
- Updated `ss_update_hot_cold_indices()`: Classify EMPTY/Hot/Cold slabs
- Updated `ss_init_hot_cold()`: Initialize empty_mask/empty_count

### 3. Free Path Integration (core/box/free_local_box.c)
- After `meta->used--`, check if `meta->used == 0`
- If true, call `ss_mark_slab_empty()` to update empty_mask
- Enables immediate EMPTY detection on every free operation

### 4. Shared Pool Stage 0.5 (core/hakmem_shared_pool.c)
- New Stage 0.5 before Stage 1: Scan existing SuperSlabs for EMPTY slabs
- Iterate over `g_super_reg_by_class[class_idx][]` (first 16 entries)
- Check `ss->empty_count > 0` → scan `empty_mask` with `__builtin_ctz()`
- Reuse EMPTY slab directly, avoiding Stage 3 (mmap/lock overhead)
- ENV control: `HAKMEM_SS_EMPTY_REUSE=1` (default OFF for A/B testing)
- ENV tunable: `HAKMEM_SS_EMPTY_SCAN_LIMIT=N` (default 16 SuperSlabs)

## Performance Results

```
Benchmark: Random Mixed 256B (100K iterations)

OFF (default):  10.2M ops/s (baseline)
ON  (ENV=1):    11.5M ops/s (+13.0% improvement) ✅
```

## Expected Impact (from Task-sensei analysis)

**Current bottleneck**:
- Stage 1: 2-5% hit rate (free list broken)
- Stage 2: 3-8% hit rate (rare UNUSED)
- Stage 3: 87-95% hit rate (lock + mmap overhead) ← bottleneck

**Expected with Phase 12-1.1**:
- Stage 0.5: 20-40% hit rate (EMPTY scan)
- Stage 1-2: 20-30% hit rate (combined)
- Stage 3: 30-50% hit rate (significantly reduced)

**Theoretical max**: 25M → 55-70M ops/s (+120-180%)

## Current Gap Analysis

**Observed**: 11.5M ops/s (+13%)
**Expected**: 55-70M ops/s (+120-180%)
**Gap**: Performance regression or missing complementary optimizations

Possible causes:
1. Phase 3d-C (25.1M→10.2M) regression - unrelated to this change
2. EMPTY scan overhead (16 SuperSlabs × empty_count check)
3. Missing Priority 2-5 optimizations (Lazy SS deallocation, etc.)
4. Stage 0.5 too conservative (scan_limit=16, should be higher?)

## Usage

```bash
# Enable EMPTY reuse optimization
export HAKMEM_SS_EMPTY_REUSE=1

# Optional: increase scan limit (trade-off: throughput vs latency)
export HAKMEM_SS_EMPTY_SCAN_LIMIT=32

./bench_random_mixed_hakmem 100000 256 42
```

## Next Steps

**Priority 1-A**: Investigate Phase 3d-C→12-1.1 regression (25.1M→10.2M)
**Priority 1-B**: Implement Phase 12-1.2 (Lazy SS deallocation) for complementary effect
**Priority 1-C**: Profile Stage 0.5 overhead (scan_limit tuning)

## Files Modified

Core implementation:
- `core/superslab/superslab_types.h` - empty_mask/empty_count fields
- `core/box/ss_hot_cold_box.h` - EMPTY detection/marking API
- `core/box/free_local_box.c` - Free path EMPTY detection
- `core/hakmem_shared_pool.c` - Stage 0.5 EMPTY scan

Documentation:
- `CURRENT_TASK.md` - Task-sensei investigation report

---

🎯 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Task-sensei (investigation & design analysis)

											
										
										
											2025-11-21 04:56:48 +09:00
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								**Last Updated**: 2025-11-25
 								**Scope**: Random Mixed 16-1024B / Arena Allocator / Architecture Limit Analysis
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												CURRENT_TASK.md: Add Phase 16 results and Phase 17 design plan

Added:
- Section 4: Phase 16 A/B Testing results (Dynamic Tiny/Mid boundary)
- Section 5: Phase 17 Small-Mid Box design plan (256B-4KB dedicated layer)
- Updated TODO list for Phase 17 implementation

Phase 16 Conclusion:
- Reducing Tiny coverage (C0-C5) caused -76% to -79% performance degradation
- Mid's coarse size classes (8KB/16KB/32KB) are inefficient for small sizes
- Recommendation: Keep default HAKMEM_TINY_MAX_CLASS=7

Phase 17 Plan:
- New Small-Mid allocator box for 256B-4KB range
- Dedicated SuperSlab pool (separated from Tiny to avoid Phase 12 churn)
- 5 size classes: 256B/512B/1KB/2KB/4KB
- Target: 10M-20M ops/s (2-4x improvement over current Tiny C6/C7)
- ENV control: HAKMEM_SMALLMID_ENABLE=1 for A/B testing

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-16 01:40:36 +09:00
+								---
-												Phase 13-B: TinyHeapV2 supply path with dual-mode A/B framework (Stealing vs Leftover)

Summary:
- Implemented free path supply with ENV-gated A/B modes (HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE)
- Mode 0 (Stealing, default): L0 gets freed blocks first → +18% @ 32B
- Mode 1 (Leftover): L1 primary owner, L0 gets leftovers → Box-clean but -5% @ 16B
- Decision: Default to Stealing for performance (ChatGPT analysis: L0 doesn't corrupt learning layer signals)

Performance (100K iterations, workset=128):
- 16B: 43.9M → 45.6M ops/s (+3.9%)
- 32B: 41.9M → 49.6M ops/s (+18.4%) ✅
- 64B: 51.2M → 51.5M ops/s (+0.6%)
- 100% magazine hit rate (supply from free path working correctly)

Implementation:
- tiny_free_fast_v2.inc.h: Dual-mode supply (lines 134-166)
- tiny_heap_v2.h: Add tiny_heap_v2_leftover_mode() flag + rationale doc
- tiny_alloc_fast.inc.h: Alloc hook with tiny_heap_v2_alloc_by_class()
- CURRENT_TASK.md: Updated Phase 13-B status (complete) with A/B results

ENV flags:
- HAKMEM_TINY_HEAP_V2=1                      # Enable TinyHeapV2
- HAKMEM_TINY_HEAP_V2_LEFTOVER_MODE=0        # Mode 0 (Stealing, default)
- HAKMEM_TINY_HEAP_V2_CLASS_MASK=0xE         # C1-C3 only (skip C0 -5% regression)
- HAKMEM_TINY_HEAP_V2_STATS=1                # Print statistics

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-15 16:28:40 +09:00
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								## 🎯 現状サマリ
-												Documentation: Phase 12-1.1 Results + Phase 19 Frontend Strategy

## Phase 12-1.1 Summary (Box Theory + EMPTY Slab Reuse)

### Box Theory Refactoring (Complete)
- hakmem_tiny.c: 2081行 → 562行 (-73%)
- 12 modules extracted across 3 phases
- Commit: 4c33ccdf8

### Phase 12-1.1: EMPTY Slab Detection (Complete)
- Implementation: empty_mask + immediate detection on free
- Performance: +1.3% average, +14.9% max (22.9M → 23.2M ops/s)
- Commit: 6afaa5703

### Key Findings
**Stage Statistics (HAKMEM_SHARED_POOL_STAGE_STATS=1)**:
```
Class 6 (256B):
  Stage 1 (EMPTY):  95.1%  ← Already super-efficient!
  Stage 2 (UNUSED):  4.7%
  Stage 3 (new SS):  0.2%  ← Bottleneck already resolved
```

**Conclusion**: Backend optimization (SS-Reuse) is saturated. Task-sensei's
assumption (Stage 3: 87-95%) does not hold. Phase 12 Shared Pool already works.

**Next bottleneck**: Frontend fast path (31ns vs mimalloc 9ns = 3.4x slower)

---

## Phase 19: Frontend Fast Path Optimization (Next Implementation)

### Strategy Shift
ChatGPT-sensei Priority 2 → Priority 1 (promoted based on Phase 12-1.1 results)

### Target
- Current: 31ns (HAKMEM) vs 9ns (mimalloc)
- Goal: 31ns → 15ns (-50%) for 22M → 40M ops/s

### Hit Rate Analysis (Premise)
```
HeapV2:      88-99% (primary)
UltraHot:     0-12% (limited)
FC/SFC:          0% (unused)
```
→ Layers other than HeapV2 are prune candidates

---

## Phase 19-1: Quick Prune (Branch Pruning) - 🚀 Highest Priority

**Goal**: Skip unused frontend layers, simplify to HeapV2 → SLL → SS path

**Implementation**:
- File: `core/tiny_alloc_fast.inc.h`
- Method: Early return gate at front entry point
- ENV: `HAKMEM_TINY_FRONT_SLIM=1`

**Features**:
- ✅ Existing code unchanged (bypass only)
- ✅ A/B gate (ENV=0 instant rollback)
- ✅ Minimal risk

**Expected**: 22M → 27-30M ops/s (+22-36%)

---

## Phase 19-2: Front-V2 (tcache Single-Layer) - ⚡ Main Event

**Goal**: Unify frontend to tcache-style (1-layer per-class magazine)

**Design**:
```c
// New file: core/front/tiny_heap_v2.h
typedef struct {
    void* items[32];      // cap 32 (tunable)
    uint8_t top;          // stack top index
    uint8_t class_idx;    // bound class
} TinyFrontV2;

// Ultra-fast pop (1 branch + 1 array lookup + 1 instruction)
static inline void* front_v2_pop(int class_idx);
static inline int front_v2_push(int class_idx, void* ptr);
static inline int front_v2_refill(int class_idx);
```

**Fast Path Flow**:
```
ptr = front_v2_pop(class_idx)  // 1 branch + 1 array lookup
  → empty? → front_v2_refill() → retry
  → miss? → backend fallback (SLL/SS)
```

**Target**: C0-C3 (hot classes), C4-C5 off
**ENV**: `HAKMEM_TINY_FRONT_V2=1`, `HAKMEM_FRONT_V2_CAP=32`
**Expected**: 30M → 40M ops/s (+33%)

---

## Phase 19-3: A/B Testing & Metrics

**Metrics**:
- `g_front_v2_hits[TINY_NUM_CLASSES]`
- `g_front_v2_miss[TINY_NUM_CLASSES]`
- `g_front_v2_refill_count[TINY_NUM_CLASSES]`

**ENV**: `HAKMEM_TINY_FRONT_METRICS=1`

**Benchmark Order**:
1. Short run (100K) - SEGV/regression check
2. Latency measurement (500K) - 31ns → 15ns goal
3. Larson short run - MT stability check

---

## Implementation Timeline

```
Week 1: Phase 19-1 Quick Prune
  - Add gate to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_SLIM=1
  - 100K short test
  - Performance measurement (expect: 22M → 27-30M)

Week 2: Phase 19-2 Front-V2 Design
  - Create core/front/tiny_heap_v2.{h,c}
  - Implement front_v2_pop/push/refill
  - C0-C3 integration test

Week 3: Phase 19-2 Front-V2 Integration
  - Add Front-V2 path to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_V2=1
  - A/B benchmark

Week 4: Phase 19-3 Optimization
  - Magazine capacity tuning (16/32/64)
  - Refill batch size adjustment
  - Larson/MT stability confirmation
```

---

## Expected Final Performance

```
Baseline (Phase 12-1.1):  22M ops/s
Phase 19-1 (Slim):        27-30M ops/s (+22-36%)
Phase 19-2 (V2):          40M ops/s (+82%)  ← Goal
System malloc:            78M ops/s (reference)

Gap closure: 28% → 51% (major improvement!)
```

---

## Summary

**Today's Achievements** (2025-11-21):
1. ✅ Box Theory Refactoring (3 phases, -73% code size)
2. ✅ Phase 12-1.1 EMPTY Slab Reuse (+1-15% improvement)
3. ✅ Stage statistics analysis (identified frontend as true bottleneck)
4. ✅ Phase 19 strategy documentation (ChatGPT-sensei plan)

**Next Session**:
- Phase 19-1 Quick Prune implementation
- ENV gate + early return in tiny_alloc_fast.inc.h
- 100K short test + performance measurement

---

📝 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: ChatGPT (Phase 19 strategy design)
Co-Authored-By: Task-sensei (Phase 12-1.1 investigation)

											
										
										
											2025-11-21 05:16:35 +09:00
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								### ✅ Arena Allocator 実装完了 - mmap 95% 削減達成
-												Documentation: Phase 12-1.1 Results + Phase 19 Frontend Strategy

## Phase 12-1.1 Summary (Box Theory + EMPTY Slab Reuse)

### Box Theory Refactoring (Complete)
- hakmem_tiny.c: 2081行 → 562行 (-73%)
- 12 modules extracted across 3 phases
- Commit: 4c33ccdf8

### Phase 12-1.1: EMPTY Slab Detection (Complete)
- Implementation: empty_mask + immediate detection on free
- Performance: +1.3% average, +14.9% max (22.9M → 23.2M ops/s)
- Commit: 6afaa5703

### Key Findings
**Stage Statistics (HAKMEM_SHARED_POOL_STAGE_STATS=1)**:
```
Class 6 (256B):
  Stage 1 (EMPTY):  95.1%  ← Already super-efficient!
  Stage 2 (UNUSED):  4.7%
  Stage 3 (new SS):  0.2%  ← Bottleneck already resolved
```

**Conclusion**: Backend optimization (SS-Reuse) is saturated. Task-sensei's
assumption (Stage 3: 87-95%) does not hold. Phase 12 Shared Pool already works.

**Next bottleneck**: Frontend fast path (31ns vs mimalloc 9ns = 3.4x slower)

---

## Phase 19: Frontend Fast Path Optimization (Next Implementation)

### Strategy Shift
ChatGPT-sensei Priority 2 → Priority 1 (promoted based on Phase 12-1.1 results)

### Target
- Current: 31ns (HAKMEM) vs 9ns (mimalloc)
- Goal: 31ns → 15ns (-50%) for 22M → 40M ops/s

### Hit Rate Analysis (Premise)
```
HeapV2:      88-99% (primary)
UltraHot:     0-12% (limited)
FC/SFC:          0% (unused)
```
→ Layers other than HeapV2 are prune candidates

---

## Phase 19-1: Quick Prune (Branch Pruning) - 🚀 Highest Priority

**Goal**: Skip unused frontend layers, simplify to HeapV2 → SLL → SS path

**Implementation**:
- File: `core/tiny_alloc_fast.inc.h`
- Method: Early return gate at front entry point
- ENV: `HAKMEM_TINY_FRONT_SLIM=1`

**Features**:
- ✅ Existing code unchanged (bypass only)
- ✅ A/B gate (ENV=0 instant rollback)
- ✅ Minimal risk

**Expected**: 22M → 27-30M ops/s (+22-36%)

---

## Phase 19-2: Front-V2 (tcache Single-Layer) - ⚡ Main Event

**Goal**: Unify frontend to tcache-style (1-layer per-class magazine)

**Design**:
```c
// New file: core/front/tiny_heap_v2.h
typedef struct {
    void* items[32];      // cap 32 (tunable)
    uint8_t top;          // stack top index
    uint8_t class_idx;    // bound class
} TinyFrontV2;

// Ultra-fast pop (1 branch + 1 array lookup + 1 instruction)
static inline void* front_v2_pop(int class_idx);
static inline int front_v2_push(int class_idx, void* ptr);
static inline int front_v2_refill(int class_idx);
```

**Fast Path Flow**:
```
ptr = front_v2_pop(class_idx)  // 1 branch + 1 array lookup
  → empty? → front_v2_refill() → retry
  → miss? → backend fallback (SLL/SS)
```

**Target**: C0-C3 (hot classes), C4-C5 off
**ENV**: `HAKMEM_TINY_FRONT_V2=1`, `HAKMEM_FRONT_V2_CAP=32`
**Expected**: 30M → 40M ops/s (+33%)

---

## Phase 19-3: A/B Testing & Metrics

**Metrics**:
- `g_front_v2_hits[TINY_NUM_CLASSES]`
- `g_front_v2_miss[TINY_NUM_CLASSES]`
- `g_front_v2_refill_count[TINY_NUM_CLASSES]`

**ENV**: `HAKMEM_TINY_FRONT_METRICS=1`

**Benchmark Order**:
1. Short run (100K) - SEGV/regression check
2. Latency measurement (500K) - 31ns → 15ns goal
3. Larson short run - MT stability check

---

## Implementation Timeline

```
Week 1: Phase 19-1 Quick Prune
  - Add gate to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_SLIM=1
  - 100K short test
  - Performance measurement (expect: 22M → 27-30M)

Week 2: Phase 19-2 Front-V2 Design
  - Create core/front/tiny_heap_v2.{h,c}
  - Implement front_v2_pop/push/refill
  - C0-C3 integration test

Week 3: Phase 19-2 Front-V2 Integration
  - Add Front-V2 path to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_V2=1
  - A/B benchmark

Week 4: Phase 19-3 Optimization
  - Magazine capacity tuning (16/32/64)
  - Refill batch size adjustment
  - Larson/MT stability confirmation
```

---

## Expected Final Performance

```
Baseline (Phase 12-1.1):  22M ops/s
Phase 19-1 (Slim):        27-30M ops/s (+22-36%)
Phase 19-2 (V2):          40M ops/s (+82%)  ← Goal
System malloc:            78M ops/s (reference)

Gap closure: 28% → 51% (major improvement!)
```

---

## Summary

**Today's Achievements** (2025-11-21):
1. ✅ Box Theory Refactoring (3 phases, -73% code size)
2. ✅ Phase 12-1.1 EMPTY Slab Reuse (+1-15% improvement)
3. ✅ Stage statistics analysis (identified frontend as true bottleneck)
4. ✅ Phase 19 strategy documentation (ChatGPT-sensei plan)

**Next Session**:
- Phase 19-1 Quick Prune implementation
- ENV gate + early return in tiny_alloc_fast.inc.h
- 100K short test + performance measurement

---

📝 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: ChatGPT (Phase 19 strategy design)
Co-Authored-By: Task-sensei (Phase 12-1.1 investigation)

											
										
										
											2025-11-21 05:16:35 +09:00
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								| Metric | Before | After | Improvement |
 								|--------|--------|-------|-------------|
 								| mmap syscalls | 401 | 32 | -92% |
 								| munmap syscalls | 378 | 3 | -99% |
 								| Performance (10M) | ~60M ops/s | **68-70M ops/s** | +15% |
-												Documentation: Phase 12-1.1 Results + Phase 19 Frontend Strategy

## Phase 12-1.1 Summary (Box Theory + EMPTY Slab Reuse)

### Box Theory Refactoring (Complete)
- hakmem_tiny.c: 2081行 → 562行 (-73%)
- 12 modules extracted across 3 phases
- Commit: 4c33ccdf8

### Phase 12-1.1: EMPTY Slab Detection (Complete)
- Implementation: empty_mask + immediate detection on free
- Performance: +1.3% average, +14.9% max (22.9M → 23.2M ops/s)
- Commit: 6afaa5703

### Key Findings
**Stage Statistics (HAKMEM_SHARED_POOL_STAGE_STATS=1)**:
```
Class 6 (256B):
  Stage 1 (EMPTY):  95.1%  ← Already super-efficient!
  Stage 2 (UNUSED):  4.7%
  Stage 3 (new SS):  0.2%  ← Bottleneck already resolved
```

**Conclusion**: Backend optimization (SS-Reuse) is saturated. Task-sensei's
assumption (Stage 3: 87-95%) does not hold. Phase 12 Shared Pool already works.

**Next bottleneck**: Frontend fast path (31ns vs mimalloc 9ns = 3.4x slower)

---

## Phase 19: Frontend Fast Path Optimization (Next Implementation)

### Strategy Shift
ChatGPT-sensei Priority 2 → Priority 1 (promoted based on Phase 12-1.1 results)

### Target
- Current: 31ns (HAKMEM) vs 9ns (mimalloc)
- Goal: 31ns → 15ns (-50%) for 22M → 40M ops/s

### Hit Rate Analysis (Premise)
```
HeapV2:      88-99% (primary)
UltraHot:     0-12% (limited)
FC/SFC:          0% (unused)
```
→ Layers other than HeapV2 are prune candidates

---

## Phase 19-1: Quick Prune (Branch Pruning) - 🚀 Highest Priority

**Goal**: Skip unused frontend layers, simplify to HeapV2 → SLL → SS path

**Implementation**:
- File: `core/tiny_alloc_fast.inc.h`
- Method: Early return gate at front entry point
- ENV: `HAKMEM_TINY_FRONT_SLIM=1`

**Features**:
- ✅ Existing code unchanged (bypass only)
- ✅ A/B gate (ENV=0 instant rollback)
- ✅ Minimal risk

**Expected**: 22M → 27-30M ops/s (+22-36%)

---

## Phase 19-2: Front-V2 (tcache Single-Layer) - ⚡ Main Event

**Goal**: Unify frontend to tcache-style (1-layer per-class magazine)

**Design**:
```c
// New file: core/front/tiny_heap_v2.h
typedef struct {
    void* items[32];      // cap 32 (tunable)
    uint8_t top;          // stack top index
    uint8_t class_idx;    // bound class
} TinyFrontV2;

// Ultra-fast pop (1 branch + 1 array lookup + 1 instruction)
static inline void* front_v2_pop(int class_idx);
static inline int front_v2_push(int class_idx, void* ptr);
static inline int front_v2_refill(int class_idx);
```

**Fast Path Flow**:
```
ptr = front_v2_pop(class_idx)  // 1 branch + 1 array lookup
  → empty? → front_v2_refill() → retry
  → miss? → backend fallback (SLL/SS)
```

**Target**: C0-C3 (hot classes), C4-C5 off
**ENV**: `HAKMEM_TINY_FRONT_V2=1`, `HAKMEM_FRONT_V2_CAP=32`
**Expected**: 30M → 40M ops/s (+33%)

---

## Phase 19-3: A/B Testing & Metrics

**Metrics**:
- `g_front_v2_hits[TINY_NUM_CLASSES]`
- `g_front_v2_miss[TINY_NUM_CLASSES]`
- `g_front_v2_refill_count[TINY_NUM_CLASSES]`

**ENV**: `HAKMEM_TINY_FRONT_METRICS=1`

**Benchmark Order**:
1. Short run (100K) - SEGV/regression check
2. Latency measurement (500K) - 31ns → 15ns goal
3. Larson short run - MT stability check

---

## Implementation Timeline

```
Week 1: Phase 19-1 Quick Prune
  - Add gate to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_SLIM=1
  - 100K short test
  - Performance measurement (expect: 22M → 27-30M)

Week 2: Phase 19-2 Front-V2 Design
  - Create core/front/tiny_heap_v2.{h,c}
  - Implement front_v2_pop/push/refill
  - C0-C3 integration test

Week 3: Phase 19-2 Front-V2 Integration
  - Add Front-V2 path to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_V2=1
  - A/B benchmark

Week 4: Phase 19-3 Optimization
  - Magazine capacity tuning (16/32/64)
  - Refill batch size adjustment
  - Larson/MT stability confirmation
```

---

## Expected Final Performance

```
Baseline (Phase 12-1.1):  22M ops/s
Phase 19-1 (Slim):        27-30M ops/s (+22-36%)
Phase 19-2 (V2):          40M ops/s (+82%)  ← Goal
System malloc:            78M ops/s (reference)

Gap closure: 28% → 51% (major improvement!)
```

---

## Summary

**Today's Achievements** (2025-11-21):
1. ✅ Box Theory Refactoring (3 phases, -73% code size)
2. ✅ Phase 12-1.1 EMPTY Slab Reuse (+1-15% improvement)
3. ✅ Stage statistics analysis (identified frontend as true bottleneck)
4. ✅ Phase 19 strategy documentation (ChatGPT-sensei plan)

**Next Session**:
- Phase 19-1 Quick Prune implementation
- ENV gate + early return in tiny_alloc_fast.inc.h
- 100K short test + performance measurement

---

📝 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: ChatGPT (Phase 19 strategy design)
Co-Authored-By: Task-sensei (Phase 12-1.1 investigation)

											
										
										
											2025-11-21 05:16:35 +09:00
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								### 現在の性能比較 (10M iterations)
-												Documentation: Phase 12-1.1 Results + Phase 19 Frontend Strategy

## Phase 12-1.1 Summary (Box Theory + EMPTY Slab Reuse)

### Box Theory Refactoring (Complete)
- hakmem_tiny.c: 2081行 → 562行 (-73%)
- 12 modules extracted across 3 phases
- Commit: 4c33ccdf8

### Phase 12-1.1: EMPTY Slab Detection (Complete)
- Implementation: empty_mask + immediate detection on free
- Performance: +1.3% average, +14.9% max (22.9M → 23.2M ops/s)
- Commit: 6afaa5703

### Key Findings
**Stage Statistics (HAKMEM_SHARED_POOL_STAGE_STATS=1)**:
```
Class 6 (256B):
  Stage 1 (EMPTY):  95.1%  ← Already super-efficient!
  Stage 2 (UNUSED):  4.7%
  Stage 3 (new SS):  0.2%  ← Bottleneck already resolved
```

**Conclusion**: Backend optimization (SS-Reuse) is saturated. Task-sensei's
assumption (Stage 3: 87-95%) does not hold. Phase 12 Shared Pool already works.

**Next bottleneck**: Frontend fast path (31ns vs mimalloc 9ns = 3.4x slower)

---

## Phase 19: Frontend Fast Path Optimization (Next Implementation)

### Strategy Shift
ChatGPT-sensei Priority 2 → Priority 1 (promoted based on Phase 12-1.1 results)

### Target
- Current: 31ns (HAKMEM) vs 9ns (mimalloc)
- Goal: 31ns → 15ns (-50%) for 22M → 40M ops/s

### Hit Rate Analysis (Premise)
```
HeapV2:      88-99% (primary)
UltraHot:     0-12% (limited)
FC/SFC:          0% (unused)
```
→ Layers other than HeapV2 are prune candidates

---

## Phase 19-1: Quick Prune (Branch Pruning) - 🚀 Highest Priority

**Goal**: Skip unused frontend layers, simplify to HeapV2 → SLL → SS path

**Implementation**:
- File: `core/tiny_alloc_fast.inc.h`
- Method: Early return gate at front entry point
- ENV: `HAKMEM_TINY_FRONT_SLIM=1`

**Features**:
- ✅ Existing code unchanged (bypass only)
- ✅ A/B gate (ENV=0 instant rollback)
- ✅ Minimal risk

**Expected**: 22M → 27-30M ops/s (+22-36%)

---

## Phase 19-2: Front-V2 (tcache Single-Layer) - ⚡ Main Event

**Goal**: Unify frontend to tcache-style (1-layer per-class magazine)

**Design**:
```c
// New file: core/front/tiny_heap_v2.h
typedef struct {
    void* items[32];      // cap 32 (tunable)
    uint8_t top;          // stack top index
    uint8_t class_idx;    // bound class
} TinyFrontV2;

// Ultra-fast pop (1 branch + 1 array lookup + 1 instruction)
static inline void* front_v2_pop(int class_idx);
static inline int front_v2_push(int class_idx, void* ptr);
static inline int front_v2_refill(int class_idx);
```

**Fast Path Flow**:
```
ptr = front_v2_pop(class_idx)  // 1 branch + 1 array lookup
  → empty? → front_v2_refill() → retry
  → miss? → backend fallback (SLL/SS)
```

**Target**: C0-C3 (hot classes), C4-C5 off
**ENV**: `HAKMEM_TINY_FRONT_V2=1`, `HAKMEM_FRONT_V2_CAP=32`
**Expected**: 30M → 40M ops/s (+33%)

---

## Phase 19-3: A/B Testing & Metrics

**Metrics**:
- `g_front_v2_hits[TINY_NUM_CLASSES]`
- `g_front_v2_miss[TINY_NUM_CLASSES]`
- `g_front_v2_refill_count[TINY_NUM_CLASSES]`

**ENV**: `HAKMEM_TINY_FRONT_METRICS=1`

**Benchmark Order**:
1. Short run (100K) - SEGV/regression check
2. Latency measurement (500K) - 31ns → 15ns goal
3. Larson short run - MT stability check

---

## Implementation Timeline

```
Week 1: Phase 19-1 Quick Prune
  - Add gate to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_SLIM=1
  - 100K short test
  - Performance measurement (expect: 22M → 27-30M)

Week 2: Phase 19-2 Front-V2 Design
  - Create core/front/tiny_heap_v2.{h,c}
  - Implement front_v2_pop/push/refill
  - C0-C3 integration test

Week 3: Phase 19-2 Front-V2 Integration
  - Add Front-V2 path to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_V2=1
  - A/B benchmark

Week 4: Phase 19-3 Optimization
  - Magazine capacity tuning (16/32/64)
  - Refill batch size adjustment
  - Larson/MT stability confirmation
```

---

## Expected Final Performance

```
Baseline (Phase 12-1.1):  22M ops/s
Phase 19-1 (Slim):        27-30M ops/s (+22-36%)
Phase 19-2 (V2):          40M ops/s (+82%)  ← Goal
System malloc:            78M ops/s (reference)

Gap closure: 28% → 51% (major improvement!)
```

---

## Summary

**Today's Achievements** (2025-11-21):
1. ✅ Box Theory Refactoring (3 phases, -73% code size)
2. ✅ Phase 12-1.1 EMPTY Slab Reuse (+1-15% improvement)
3. ✅ Stage statistics analysis (identified frontend as true bottleneck)
4. ✅ Phase 19 strategy documentation (ChatGPT-sensei plan)

**Next Session**:
- Phase 19-1 Quick Prune implementation
- ENV gate + early return in tiny_alloc_fast.inc.h
- 100K short test + performance measurement

---

📝 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: ChatGPT (Phase 19 strategy design)
Co-Authored-By: Task-sensei (Phase 12-1.1 investigation)

											
										
										
											2025-11-21 05:16:35 +09:00
+								```
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								System malloc: 93M ops/s (baseline)
 								HAKMEM:        68-70M ops/s (73-76% of system malloc)
 								Gap:           ~25% (構造的オーバーヘッド)
-												Documentation: Phase 12-1.1 Results + Phase 19 Frontend Strategy

## Phase 12-1.1 Summary (Box Theory + EMPTY Slab Reuse)

### Box Theory Refactoring (Complete)
- hakmem_tiny.c: 2081行 → 562行 (-73%)
- 12 modules extracted across 3 phases
- Commit: 4c33ccdf8

### Phase 12-1.1: EMPTY Slab Detection (Complete)
- Implementation: empty_mask + immediate detection on free
- Performance: +1.3% average, +14.9% max (22.9M → 23.2M ops/s)
- Commit: 6afaa5703

### Key Findings
**Stage Statistics (HAKMEM_SHARED_POOL_STAGE_STATS=1)**:
```
Class 6 (256B):
  Stage 1 (EMPTY):  95.1%  ← Already super-efficient!
  Stage 2 (UNUSED):  4.7%
  Stage 3 (new SS):  0.2%  ← Bottleneck already resolved
```

**Conclusion**: Backend optimization (SS-Reuse) is saturated. Task-sensei's
assumption (Stage 3: 87-95%) does not hold. Phase 12 Shared Pool already works.

**Next bottleneck**: Frontend fast path (31ns vs mimalloc 9ns = 3.4x slower)

---

## Phase 19: Frontend Fast Path Optimization (Next Implementation)

### Strategy Shift
ChatGPT-sensei Priority 2 → Priority 1 (promoted based on Phase 12-1.1 results)

### Target
- Current: 31ns (HAKMEM) vs 9ns (mimalloc)
- Goal: 31ns → 15ns (-50%) for 22M → 40M ops/s

### Hit Rate Analysis (Premise)
```
HeapV2:      88-99% (primary)
UltraHot:     0-12% (limited)
FC/SFC:          0% (unused)
```
→ Layers other than HeapV2 are prune candidates

---

## Phase 19-1: Quick Prune (Branch Pruning) - 🚀 Highest Priority

**Goal**: Skip unused frontend layers, simplify to HeapV2 → SLL → SS path

**Implementation**:
- File: `core/tiny_alloc_fast.inc.h`
- Method: Early return gate at front entry point
- ENV: `HAKMEM_TINY_FRONT_SLIM=1`

**Features**:
- ✅ Existing code unchanged (bypass only)
- ✅ A/B gate (ENV=0 instant rollback)
- ✅ Minimal risk

**Expected**: 22M → 27-30M ops/s (+22-36%)

---

## Phase 19-2: Front-V2 (tcache Single-Layer) - ⚡ Main Event

**Goal**: Unify frontend to tcache-style (1-layer per-class magazine)

**Design**:
```c
// New file: core/front/tiny_heap_v2.h
typedef struct {
    void* items[32];      // cap 32 (tunable)
    uint8_t top;          // stack top index
    uint8_t class_idx;    // bound class
} TinyFrontV2;

// Ultra-fast pop (1 branch + 1 array lookup + 1 instruction)
static inline void* front_v2_pop(int class_idx);
static inline int front_v2_push(int class_idx, void* ptr);
static inline int front_v2_refill(int class_idx);
```

**Fast Path Flow**:
```
ptr = front_v2_pop(class_idx)  // 1 branch + 1 array lookup
  → empty? → front_v2_refill() → retry
  → miss? → backend fallback (SLL/SS)
```

**Target**: C0-C3 (hot classes), C4-C5 off
**ENV**: `HAKMEM_TINY_FRONT_V2=1`, `HAKMEM_FRONT_V2_CAP=32`
**Expected**: 30M → 40M ops/s (+33%)

---

## Phase 19-3: A/B Testing & Metrics

**Metrics**:
- `g_front_v2_hits[TINY_NUM_CLASSES]`
- `g_front_v2_miss[TINY_NUM_CLASSES]`
- `g_front_v2_refill_count[TINY_NUM_CLASSES]`

**ENV**: `HAKMEM_TINY_FRONT_METRICS=1`

**Benchmark Order**:
1. Short run (100K) - SEGV/regression check
2. Latency measurement (500K) - 31ns → 15ns goal
3. Larson short run - MT stability check

---

## Implementation Timeline

```
Week 1: Phase 19-1 Quick Prune
  - Add gate to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_SLIM=1
  - 100K short test
  - Performance measurement (expect: 22M → 27-30M)

Week 2: Phase 19-2 Front-V2 Design
  - Create core/front/tiny_heap_v2.{h,c}
  - Implement front_v2_pop/push/refill
  - C0-C3 integration test

Week 3: Phase 19-2 Front-V2 Integration
  - Add Front-V2 path to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_V2=1
  - A/B benchmark

Week 4: Phase 19-3 Optimization
  - Magazine capacity tuning (16/32/64)
  - Refill batch size adjustment
  - Larson/MT stability confirmation
```

---

## Expected Final Performance

```
Baseline (Phase 12-1.1):  22M ops/s
Phase 19-1 (Slim):        27-30M ops/s (+22-36%)
Phase 19-2 (V2):          40M ops/s (+82%)  ← Goal
System malloc:            78M ops/s (reference)

Gap closure: 28% → 51% (major improvement!)
```

---

## Summary

**Today's Achievements** (2025-11-21):
1. ✅ Box Theory Refactoring (3 phases, -73% code size)
2. ✅ Phase 12-1.1 EMPTY Slab Reuse (+1-15% improvement)
3. ✅ Stage statistics analysis (identified frontend as true bottleneck)
4. ✅ Phase 19 strategy documentation (ChatGPT-sensei plan)

**Next Session**:
- Phase 19-1 Quick Prune implementation
- ENV gate + early return in tiny_alloc_fast.inc.h
- 100K short test + performance measurement

---

📝 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: ChatGPT (Phase 19 strategy design)
Co-Authored-By: Task-sensei (Phase 12-1.1 investigation)

											
										
										
											2025-11-21 05:16:35 +09:00
+								```
 								---
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								## 🔬 Phase 27 調査結果: アーキテクチャ限界の確認
-												Documentation: Phase 12-1.1 Results + Phase 19 Frontend Strategy

## Phase 12-1.1 Summary (Box Theory + EMPTY Slab Reuse)

### Box Theory Refactoring (Complete)
- hakmem_tiny.c: 2081行 → 562行 (-73%)
- 12 modules extracted across 3 phases
- Commit: 4c33ccdf8

### Phase 12-1.1: EMPTY Slab Detection (Complete)
- Implementation: empty_mask + immediate detection on free
- Performance: +1.3% average, +14.9% max (22.9M → 23.2M ops/s)
- Commit: 6afaa5703

### Key Findings
**Stage Statistics (HAKMEM_SHARED_POOL_STAGE_STATS=1)**:
```
Class 6 (256B):
  Stage 1 (EMPTY):  95.1%  ← Already super-efficient!
  Stage 2 (UNUSED):  4.7%
  Stage 3 (new SS):  0.2%  ← Bottleneck already resolved
```

**Conclusion**: Backend optimization (SS-Reuse) is saturated. Task-sensei's
assumption (Stage 3: 87-95%) does not hold. Phase 12 Shared Pool already works.

**Next bottleneck**: Frontend fast path (31ns vs mimalloc 9ns = 3.4x slower)

---

## Phase 19: Frontend Fast Path Optimization (Next Implementation)

### Strategy Shift
ChatGPT-sensei Priority 2 → Priority 1 (promoted based on Phase 12-1.1 results)

### Target
- Current: 31ns (HAKMEM) vs 9ns (mimalloc)
- Goal: 31ns → 15ns (-50%) for 22M → 40M ops/s

### Hit Rate Analysis (Premise)
```
HeapV2:      88-99% (primary)
UltraHot:     0-12% (limited)
FC/SFC:          0% (unused)
```
→ Layers other than HeapV2 are prune candidates

---

## Phase 19-1: Quick Prune (Branch Pruning) - 🚀 Highest Priority

**Goal**: Skip unused frontend layers, simplify to HeapV2 → SLL → SS path

**Implementation**:
- File: `core/tiny_alloc_fast.inc.h`
- Method: Early return gate at front entry point
- ENV: `HAKMEM_TINY_FRONT_SLIM=1`

**Features**:
- ✅ Existing code unchanged (bypass only)
- ✅ A/B gate (ENV=0 instant rollback)
- ✅ Minimal risk

**Expected**: 22M → 27-30M ops/s (+22-36%)

---

## Phase 19-2: Front-V2 (tcache Single-Layer) - ⚡ Main Event

**Goal**: Unify frontend to tcache-style (1-layer per-class magazine)

**Design**:
```c
// New file: core/front/tiny_heap_v2.h
typedef struct {
    void* items[32];      // cap 32 (tunable)
    uint8_t top;          // stack top index
    uint8_t class_idx;    // bound class
} TinyFrontV2;

// Ultra-fast pop (1 branch + 1 array lookup + 1 instruction)
static inline void* front_v2_pop(int class_idx);
static inline int front_v2_push(int class_idx, void* ptr);
static inline int front_v2_refill(int class_idx);
```

**Fast Path Flow**:
```
ptr = front_v2_pop(class_idx)  // 1 branch + 1 array lookup
  → empty? → front_v2_refill() → retry
  → miss? → backend fallback (SLL/SS)
```

**Target**: C0-C3 (hot classes), C4-C5 off
**ENV**: `HAKMEM_TINY_FRONT_V2=1`, `HAKMEM_FRONT_V2_CAP=32`
**Expected**: 30M → 40M ops/s (+33%)

---

## Phase 19-3: A/B Testing & Metrics

**Metrics**:
- `g_front_v2_hits[TINY_NUM_CLASSES]`
- `g_front_v2_miss[TINY_NUM_CLASSES]`
- `g_front_v2_refill_count[TINY_NUM_CLASSES]`

**ENV**: `HAKMEM_TINY_FRONT_METRICS=1`

**Benchmark Order**:
1. Short run (100K) - SEGV/regression check
2. Latency measurement (500K) - 31ns → 15ns goal
3. Larson short run - MT stability check

---

## Implementation Timeline

```
Week 1: Phase 19-1 Quick Prune
  - Add gate to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_SLIM=1
  - 100K short test
  - Performance measurement (expect: 22M → 27-30M)

Week 2: Phase 19-2 Front-V2 Design
  - Create core/front/tiny_heap_v2.{h,c}
  - Implement front_v2_pop/push/refill
  - C0-C3 integration test

Week 3: Phase 19-2 Front-V2 Integration
  - Add Front-V2 path to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_V2=1
  - A/B benchmark

Week 4: Phase 19-3 Optimization
  - Magazine capacity tuning (16/32/64)
  - Refill batch size adjustment
  - Larson/MT stability confirmation
```

---

## Expected Final Performance

```
Baseline (Phase 12-1.1):  22M ops/s
Phase 19-1 (Slim):        27-30M ops/s (+22-36%)
Phase 19-2 (V2):          40M ops/s (+82%)  ← Goal
System malloc:            78M ops/s (reference)

Gap closure: 28% → 51% (major improvement!)
```

---

## Summary

**Today's Achievements** (2025-11-21):
1. ✅ Box Theory Refactoring (3 phases, -73% code size)
2. ✅ Phase 12-1.1 EMPTY Slab Reuse (+1-15% improvement)
3. ✅ Stage statistics analysis (identified frontend as true bottleneck)
4. ✅ Phase 19 strategy documentation (ChatGPT-sensei plan)

**Next Session**:
- Phase 19-1 Quick Prune implementation
- ENV gate + early return in tiny_alloc_fast.inc.h
- 100K short test + performance measurement

---

📝 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: ChatGPT (Phase 19 strategy design)
Co-Authored-By: Task-sensei (Phase 12-1.1 investigation)

											
										
										
											2025-11-21 05:16:35 +09:00
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								### 試した最適化（すべて失敗）
 								| 最適化案 | 結果 | 効果 |
 								|---------|------|------|
 								| C5 TLS容量 2倍 (1024→2048) | 68-69M | 変化なし |
 								| Registry lookup削除 | 68-70M | 変化なし |
 								| Ultra SLIM 4-layer | ~69M | 変化なし |
 								| **Phase 27-A: Ultra-Inline (全size)** | **56-61M** | **-15% 悪化** ❌ |
 								| **Phase 27-B: Ultra-Inline (9-512B)** | **61-62M** | **-10% 悪化** ❌ |
-												Documentation: Phase 12-1.1 Results + Phase 19 Frontend Strategy

## Phase 12-1.1 Summary (Box Theory + EMPTY Slab Reuse)

### Box Theory Refactoring (Complete)
- hakmem_tiny.c: 2081行 → 562行 (-73%)
- 12 modules extracted across 3 phases
- Commit: 4c33ccdf8

### Phase 12-1.1: EMPTY Slab Detection (Complete)
- Implementation: empty_mask + immediate detection on free
- Performance: +1.3% average, +14.9% max (22.9M → 23.2M ops/s)
- Commit: 6afaa5703

### Key Findings
**Stage Statistics (HAKMEM_SHARED_POOL_STAGE_STATS=1)**:
```
Class 6 (256B):
  Stage 1 (EMPTY):  95.1%  ← Already super-efficient!
  Stage 2 (UNUSED):  4.7%
  Stage 3 (new SS):  0.2%  ← Bottleneck already resolved
```

**Conclusion**: Backend optimization (SS-Reuse) is saturated. Task-sensei's
assumption (Stage 3: 87-95%) does not hold. Phase 12 Shared Pool already works.

**Next bottleneck**: Frontend fast path (31ns vs mimalloc 9ns = 3.4x slower)

---

## Phase 19: Frontend Fast Path Optimization (Next Implementation)

### Strategy Shift
ChatGPT-sensei Priority 2 → Priority 1 (promoted based on Phase 12-1.1 results)

### Target
- Current: 31ns (HAKMEM) vs 9ns (mimalloc)
- Goal: 31ns → 15ns (-50%) for 22M → 40M ops/s

### Hit Rate Analysis (Premise)
```
HeapV2:      88-99% (primary)
UltraHot:     0-12% (limited)
FC/SFC:          0% (unused)
```
→ Layers other than HeapV2 are prune candidates

---

## Phase 19-1: Quick Prune (Branch Pruning) - 🚀 Highest Priority

**Goal**: Skip unused frontend layers, simplify to HeapV2 → SLL → SS path

**Implementation**:
- File: `core/tiny_alloc_fast.inc.h`
- Method: Early return gate at front entry point
- ENV: `HAKMEM_TINY_FRONT_SLIM=1`

**Features**:
- ✅ Existing code unchanged (bypass only)
- ✅ A/B gate (ENV=0 instant rollback)
- ✅ Minimal risk

**Expected**: 22M → 27-30M ops/s (+22-36%)

---

## Phase 19-2: Front-V2 (tcache Single-Layer) - ⚡ Main Event

**Goal**: Unify frontend to tcache-style (1-layer per-class magazine)

**Design**:
```c
// New file: core/front/tiny_heap_v2.h
typedef struct {
    void* items[32];      // cap 32 (tunable)
    uint8_t top;          // stack top index
    uint8_t class_idx;    // bound class
} TinyFrontV2;

// Ultra-fast pop (1 branch + 1 array lookup + 1 instruction)
static inline void* front_v2_pop(int class_idx);
static inline int front_v2_push(int class_idx, void* ptr);
static inline int front_v2_refill(int class_idx);
```

**Fast Path Flow**:
```
ptr = front_v2_pop(class_idx)  // 1 branch + 1 array lookup
  → empty? → front_v2_refill() → retry
  → miss? → backend fallback (SLL/SS)
```

**Target**: C0-C3 (hot classes), C4-C5 off
**ENV**: `HAKMEM_TINY_FRONT_V2=1`, `HAKMEM_FRONT_V2_CAP=32`
**Expected**: 30M → 40M ops/s (+33%)

---

## Phase 19-3: A/B Testing & Metrics

**Metrics**:
- `g_front_v2_hits[TINY_NUM_CLASSES]`
- `g_front_v2_miss[TINY_NUM_CLASSES]`
- `g_front_v2_refill_count[TINY_NUM_CLASSES]`

**ENV**: `HAKMEM_TINY_FRONT_METRICS=1`

**Benchmark Order**:
1. Short run (100K) - SEGV/regression check
2. Latency measurement (500K) - 31ns → 15ns goal
3. Larson short run - MT stability check

---

## Implementation Timeline

```
Week 1: Phase 19-1 Quick Prune
  - Add gate to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_SLIM=1
  - 100K short test
  - Performance measurement (expect: 22M → 27-30M)

Week 2: Phase 19-2 Front-V2 Design
  - Create core/front/tiny_heap_v2.{h,c}
  - Implement front_v2_pop/push/refill
  - C0-C3 integration test

Week 3: Phase 19-2 Front-V2 Integration
  - Add Front-V2 path to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_V2=1
  - A/B benchmark

Week 4: Phase 19-3 Optimization
  - Magazine capacity tuning (16/32/64)
  - Refill batch size adjustment
  - Larson/MT stability confirmation
```

---

## Expected Final Performance

```
Baseline (Phase 12-1.1):  22M ops/s
Phase 19-1 (Slim):        27-30M ops/s (+22-36%)
Phase 19-2 (V2):          40M ops/s (+82%)  ← Goal
System malloc:            78M ops/s (reference)

Gap closure: 28% → 51% (major improvement!)
```

---

## Summary

**Today's Achievements** (2025-11-21):
1. ✅ Box Theory Refactoring (3 phases, -73% code size)
2. ✅ Phase 12-1.1 EMPTY Slab Reuse (+1-15% improvement)
3. ✅ Stage statistics analysis (identified frontend as true bottleneck)
4. ✅ Phase 19 strategy documentation (ChatGPT-sensei plan)

**Next Session**:
- Phase 19-1 Quick Prune implementation
- ENV gate + early return in tiny_alloc_fast.inc.h
- 100K short test + performance measurement

---

📝 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: ChatGPT (Phase 19 strategy design)
Co-Authored-By: Task-sensei (Phase 12-1.1 investigation)

											
										
										
											2025-11-21 05:16:35 +09:00
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								### Phase 27 失敗の原因
 								- Workload の ~52% が headerless classes (cls 0: 1-8B, cls 7: 513-1024B)
 								- Headerless クラスをフィルタする条件分岐自体が overhead
 								- Classes 1-6 からの利益 < 条件分岐の overhead
-												Documentation: Phase 12-1.1 Results + Phase 19 Frontend Strategy

## Phase 12-1.1 Summary (Box Theory + EMPTY Slab Reuse)

### Box Theory Refactoring (Complete)
- hakmem_tiny.c: 2081行 → 562行 (-73%)
- 12 modules extracted across 3 phases
- Commit: 4c33ccdf8

### Phase 12-1.1: EMPTY Slab Detection (Complete)
- Implementation: empty_mask + immediate detection on free
- Performance: +1.3% average, +14.9% max (22.9M → 23.2M ops/s)
- Commit: 6afaa5703

### Key Findings
**Stage Statistics (HAKMEM_SHARED_POOL_STAGE_STATS=1)**:
```
Class 6 (256B):
  Stage 1 (EMPTY):  95.1%  ← Already super-efficient!
  Stage 2 (UNUSED):  4.7%
  Stage 3 (new SS):  0.2%  ← Bottleneck already resolved
```

**Conclusion**: Backend optimization (SS-Reuse) is saturated. Task-sensei's
assumption (Stage 3: 87-95%) does not hold. Phase 12 Shared Pool already works.

**Next bottleneck**: Frontend fast path (31ns vs mimalloc 9ns = 3.4x slower)

---

## Phase 19: Frontend Fast Path Optimization (Next Implementation)

### Strategy Shift
ChatGPT-sensei Priority 2 → Priority 1 (promoted based on Phase 12-1.1 results)

### Target
- Current: 31ns (HAKMEM) vs 9ns (mimalloc)
- Goal: 31ns → 15ns (-50%) for 22M → 40M ops/s

### Hit Rate Analysis (Premise)
```
HeapV2:      88-99% (primary)
UltraHot:     0-12% (limited)
FC/SFC:          0% (unused)
```
→ Layers other than HeapV2 are prune candidates

---

## Phase 19-1: Quick Prune (Branch Pruning) - 🚀 Highest Priority

**Goal**: Skip unused frontend layers, simplify to HeapV2 → SLL → SS path

**Implementation**:
- File: `core/tiny_alloc_fast.inc.h`
- Method: Early return gate at front entry point
- ENV: `HAKMEM_TINY_FRONT_SLIM=1`

**Features**:
- ✅ Existing code unchanged (bypass only)
- ✅ A/B gate (ENV=0 instant rollback)
- ✅ Minimal risk

**Expected**: 22M → 27-30M ops/s (+22-36%)

---

## Phase 19-2: Front-V2 (tcache Single-Layer) - ⚡ Main Event

**Goal**: Unify frontend to tcache-style (1-layer per-class magazine)

**Design**:
```c
// New file: core/front/tiny_heap_v2.h
typedef struct {
    void* items[32];      // cap 32 (tunable)
    uint8_t top;          // stack top index
    uint8_t class_idx;    // bound class
} TinyFrontV2;

// Ultra-fast pop (1 branch + 1 array lookup + 1 instruction)
static inline void* front_v2_pop(int class_idx);
static inline int front_v2_push(int class_idx, void* ptr);
static inline int front_v2_refill(int class_idx);
```

**Fast Path Flow**:
```
ptr = front_v2_pop(class_idx)  // 1 branch + 1 array lookup
  → empty? → front_v2_refill() → retry
  → miss? → backend fallback (SLL/SS)
```

**Target**: C0-C3 (hot classes), C4-C5 off
**ENV**: `HAKMEM_TINY_FRONT_V2=1`, `HAKMEM_FRONT_V2_CAP=32`
**Expected**: 30M → 40M ops/s (+33%)

---

## Phase 19-3: A/B Testing & Metrics

**Metrics**:
- `g_front_v2_hits[TINY_NUM_CLASSES]`
- `g_front_v2_miss[TINY_NUM_CLASSES]`
- `g_front_v2_refill_count[TINY_NUM_CLASSES]`

**ENV**: `HAKMEM_TINY_FRONT_METRICS=1`

**Benchmark Order**:
1. Short run (100K) - SEGV/regression check
2. Latency measurement (500K) - 31ns → 15ns goal
3. Larson short run - MT stability check

---

## Implementation Timeline

```
Week 1: Phase 19-1 Quick Prune
  - Add gate to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_SLIM=1
  - 100K short test
  - Performance measurement (expect: 22M → 27-30M)

Week 2: Phase 19-2 Front-V2 Design
  - Create core/front/tiny_heap_v2.{h,c}
  - Implement front_v2_pop/push/refill
  - C0-C3 integration test

Week 3: Phase 19-2 Front-V2 Integration
  - Add Front-V2 path to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_V2=1
  - A/B benchmark

Week 4: Phase 19-3 Optimization
  - Magazine capacity tuning (16/32/64)
  - Refill batch size adjustment
  - Larson/MT stability confirmation
```

---

## Expected Final Performance

```
Baseline (Phase 12-1.1):  22M ops/s
Phase 19-1 (Slim):        27-30M ops/s (+22-36%)
Phase 19-2 (V2):          40M ops/s (+82%)  ← Goal
System malloc:            78M ops/s (reference)

Gap closure: 28% → 51% (major improvement!)
```

---

## Summary

**Today's Achievements** (2025-11-21):
1. ✅ Box Theory Refactoring (3 phases, -73% code size)
2. ✅ Phase 12-1.1 EMPTY Slab Reuse (+1-15% improvement)
3. ✅ Stage statistics analysis (identified frontend as true bottleneck)
4. ✅ Phase 19 strategy documentation (ChatGPT-sensei plan)

**Next Session**:
- Phase 19-1 Quick Prune implementation
- ENV gate + early return in tiny_alloc_fast.inc.h
- 100K short test + performance measurement

---

📝 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: ChatGPT (Phase 19 strategy design)
Co-Authored-By: Task-sensei (Phase 12-1.1 investigation)

											
										
										
											2025-11-21 05:16:35 +09:00
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								### 残り 25% ギャップの原因（構造的オーバーヘッド）
 . **Header byte オーバーヘッド** - 毎 alloc/free で 1 バイト書き込み/読み込み
 . **TLS SLL カウンタ** - 毎回 count++ / count-- (vs tcache: pointer のみ)
 . **多層分岐** - 4-5層 dispatch (vs tcache: 2-3層)
-												Documentation: Phase 12-1.1 Results + Phase 19 Frontend Strategy

## Phase 12-1.1 Summary (Box Theory + EMPTY Slab Reuse)

### Box Theory Refactoring (Complete)
- hakmem_tiny.c: 2081行 → 562行 (-73%)
- 12 modules extracted across 3 phases
- Commit: 4c33ccdf8

### Phase 12-1.1: EMPTY Slab Detection (Complete)
- Implementation: empty_mask + immediate detection on free
- Performance: +1.3% average, +14.9% max (22.9M → 23.2M ops/s)
- Commit: 6afaa5703

### Key Findings
**Stage Statistics (HAKMEM_SHARED_POOL_STAGE_STATS=1)**:
```
Class 6 (256B):
  Stage 1 (EMPTY):  95.1%  ← Already super-efficient!
  Stage 2 (UNUSED):  4.7%
  Stage 3 (new SS):  0.2%  ← Bottleneck already resolved
```

**Conclusion**: Backend optimization (SS-Reuse) is saturated. Task-sensei's
assumption (Stage 3: 87-95%) does not hold. Phase 12 Shared Pool already works.

**Next bottleneck**: Frontend fast path (31ns vs mimalloc 9ns = 3.4x slower)

---

## Phase 19: Frontend Fast Path Optimization (Next Implementation)

### Strategy Shift
ChatGPT-sensei Priority 2 → Priority 1 (promoted based on Phase 12-1.1 results)

### Target
- Current: 31ns (HAKMEM) vs 9ns (mimalloc)
- Goal: 31ns → 15ns (-50%) for 22M → 40M ops/s

### Hit Rate Analysis (Premise)
```
HeapV2:      88-99% (primary)
UltraHot:     0-12% (limited)
FC/SFC:          0% (unused)
```
→ Layers other than HeapV2 are prune candidates

---

## Phase 19-1: Quick Prune (Branch Pruning) - 🚀 Highest Priority

**Goal**: Skip unused frontend layers, simplify to HeapV2 → SLL → SS path

**Implementation**:
- File: `core/tiny_alloc_fast.inc.h`
- Method: Early return gate at front entry point
- ENV: `HAKMEM_TINY_FRONT_SLIM=1`

**Features**:
- ✅ Existing code unchanged (bypass only)
- ✅ A/B gate (ENV=0 instant rollback)
- ✅ Minimal risk

**Expected**: 22M → 27-30M ops/s (+22-36%)

---

## Phase 19-2: Front-V2 (tcache Single-Layer) - ⚡ Main Event

**Goal**: Unify frontend to tcache-style (1-layer per-class magazine)

**Design**:
```c
// New file: core/front/tiny_heap_v2.h
typedef struct {
    void* items[32];      // cap 32 (tunable)
    uint8_t top;          // stack top index
    uint8_t class_idx;    // bound class
} TinyFrontV2;

// Ultra-fast pop (1 branch + 1 array lookup + 1 instruction)
static inline void* front_v2_pop(int class_idx);
static inline int front_v2_push(int class_idx, void* ptr);
static inline int front_v2_refill(int class_idx);
```

**Fast Path Flow**:
```
ptr = front_v2_pop(class_idx)  // 1 branch + 1 array lookup
  → empty? → front_v2_refill() → retry
  → miss? → backend fallback (SLL/SS)
```

**Target**: C0-C3 (hot classes), C4-C5 off
**ENV**: `HAKMEM_TINY_FRONT_V2=1`, `HAKMEM_FRONT_V2_CAP=32`
**Expected**: 30M → 40M ops/s (+33%)

---

## Phase 19-3: A/B Testing & Metrics

**Metrics**:
- `g_front_v2_hits[TINY_NUM_CLASSES]`
- `g_front_v2_miss[TINY_NUM_CLASSES]`
- `g_front_v2_refill_count[TINY_NUM_CLASSES]`

**ENV**: `HAKMEM_TINY_FRONT_METRICS=1`

**Benchmark Order**:
1. Short run (100K) - SEGV/regression check
2. Latency measurement (500K) - 31ns → 15ns goal
3. Larson short run - MT stability check

---

## Implementation Timeline

```
Week 1: Phase 19-1 Quick Prune
  - Add gate to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_SLIM=1
  - 100K short test
  - Performance measurement (expect: 22M → 27-30M)

Week 2: Phase 19-2 Front-V2 Design
  - Create core/front/tiny_heap_v2.{h,c}
  - Implement front_v2_pop/push/refill
  - C0-C3 integration test

Week 3: Phase 19-2 Front-V2 Integration
  - Add Front-V2 path to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_V2=1
  - A/B benchmark

Week 4: Phase 19-3 Optimization
  - Magazine capacity tuning (16/32/64)
  - Refill batch size adjustment
  - Larson/MT stability confirmation
```

---

## Expected Final Performance

```
Baseline (Phase 12-1.1):  22M ops/s
Phase 19-1 (Slim):        27-30M ops/s (+22-36%)
Phase 19-2 (V2):          40M ops/s (+82%)  ← Goal
System malloc:            78M ops/s (reference)

Gap closure: 28% → 51% (major improvement!)
```

---

## Summary

**Today's Achievements** (2025-11-21):
1. ✅ Box Theory Refactoring (3 phases, -73% code size)
2. ✅ Phase 12-1.1 EMPTY Slab Reuse (+1-15% improvement)
3. ✅ Stage statistics analysis (identified frontend as true bottleneck)
4. ✅ Phase 19 strategy documentation (ChatGPT-sensei plan)

**Next Session**:
- Phase 19-1 Quick Prune implementation
- ENV gate + early return in tiny_alloc_fast.inc.h
- 100K short test + performance measurement

---

📝 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: ChatGPT (Phase 19 strategy design)
Co-Authored-By: Task-sensei (Phase 12-1.1 investigation)

											
										
										
											2025-11-21 05:16:35 +09:00
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								### 結論
 								**68-70M ops/s が現アーキテクチャの実質的な限界**。System malloc の 93M ops/s に到達するには:
 								- Header-free design への全面的な見直し
 								- tcache 模倣（カウンタ削除、分岐削減）
-												Documentation: Phase 12-1.1 Results + Phase 19 Frontend Strategy

## Phase 12-1.1 Summary (Box Theory + EMPTY Slab Reuse)

### Box Theory Refactoring (Complete)
- hakmem_tiny.c: 2081行 → 562行 (-73%)
- 12 modules extracted across 3 phases
- Commit: 4c33ccdf8

### Phase 12-1.1: EMPTY Slab Detection (Complete)
- Implementation: empty_mask + immediate detection on free
- Performance: +1.3% average, +14.9% max (22.9M → 23.2M ops/s)
- Commit: 6afaa5703

### Key Findings
**Stage Statistics (HAKMEM_SHARED_POOL_STAGE_STATS=1)**:
```
Class 6 (256B):
  Stage 1 (EMPTY):  95.1%  ← Already super-efficient!
  Stage 2 (UNUSED):  4.7%
  Stage 3 (new SS):  0.2%  ← Bottleneck already resolved
```

**Conclusion**: Backend optimization (SS-Reuse) is saturated. Task-sensei's
assumption (Stage 3: 87-95%) does not hold. Phase 12 Shared Pool already works.

**Next bottleneck**: Frontend fast path (31ns vs mimalloc 9ns = 3.4x slower)

---

## Phase 19: Frontend Fast Path Optimization (Next Implementation)

### Strategy Shift
ChatGPT-sensei Priority 2 → Priority 1 (promoted based on Phase 12-1.1 results)

### Target
- Current: 31ns (HAKMEM) vs 9ns (mimalloc)
- Goal: 31ns → 15ns (-50%) for 22M → 40M ops/s

### Hit Rate Analysis (Premise)
```
HeapV2:      88-99% (primary)
UltraHot:     0-12% (limited)
FC/SFC:          0% (unused)
```
→ Layers other than HeapV2 are prune candidates

---

## Phase 19-1: Quick Prune (Branch Pruning) - 🚀 Highest Priority

**Goal**: Skip unused frontend layers, simplify to HeapV2 → SLL → SS path

**Implementation**:
- File: `core/tiny_alloc_fast.inc.h`
- Method: Early return gate at front entry point
- ENV: `HAKMEM_TINY_FRONT_SLIM=1`

**Features**:
- ✅ Existing code unchanged (bypass only)
- ✅ A/B gate (ENV=0 instant rollback)
- ✅ Minimal risk

**Expected**: 22M → 27-30M ops/s (+22-36%)

---

## Phase 19-2: Front-V2 (tcache Single-Layer) - ⚡ Main Event

**Goal**: Unify frontend to tcache-style (1-layer per-class magazine)

**Design**:
```c
// New file: core/front/tiny_heap_v2.h
typedef struct {
    void* items[32];      // cap 32 (tunable)
    uint8_t top;          // stack top index
    uint8_t class_idx;    // bound class
} TinyFrontV2;

// Ultra-fast pop (1 branch + 1 array lookup + 1 instruction)
static inline void* front_v2_pop(int class_idx);
static inline int front_v2_push(int class_idx, void* ptr);
static inline int front_v2_refill(int class_idx);
```

**Fast Path Flow**:
```
ptr = front_v2_pop(class_idx)  // 1 branch + 1 array lookup
  → empty? → front_v2_refill() → retry
  → miss? → backend fallback (SLL/SS)
```

**Target**: C0-C3 (hot classes), C4-C5 off
**ENV**: `HAKMEM_TINY_FRONT_V2=1`, `HAKMEM_FRONT_V2_CAP=32`
**Expected**: 30M → 40M ops/s (+33%)

---

## Phase 19-3: A/B Testing & Metrics

**Metrics**:
- `g_front_v2_hits[TINY_NUM_CLASSES]`
- `g_front_v2_miss[TINY_NUM_CLASSES]`
- `g_front_v2_refill_count[TINY_NUM_CLASSES]`

**ENV**: `HAKMEM_TINY_FRONT_METRICS=1`

**Benchmark Order**:
1. Short run (100K) - SEGV/regression check
2. Latency measurement (500K) - 31ns → 15ns goal
3. Larson short run - MT stability check

---

## Implementation Timeline

```
Week 1: Phase 19-1 Quick Prune
  - Add gate to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_SLIM=1
  - 100K short test
  - Performance measurement (expect: 22M → 27-30M)

Week 2: Phase 19-2 Front-V2 Design
  - Create core/front/tiny_heap_v2.{h,c}
  - Implement front_v2_pop/push/refill
  - C0-C3 integration test

Week 3: Phase 19-2 Front-V2 Integration
  - Add Front-V2 path to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_V2=1
  - A/B benchmark

Week 4: Phase 19-3 Optimization
  - Magazine capacity tuning (16/32/64)
  - Refill batch size adjustment
  - Larson/MT stability confirmation
```

---

## Expected Final Performance

```
Baseline (Phase 12-1.1):  22M ops/s
Phase 19-1 (Slim):        27-30M ops/s (+22-36%)
Phase 19-2 (V2):          40M ops/s (+82%)  ← Goal
System malloc:            78M ops/s (reference)

Gap closure: 28% → 51% (major improvement!)
```

---

## Summary

**Today's Achievements** (2025-11-21):
1. ✅ Box Theory Refactoring (3 phases, -73% code size)
2. ✅ Phase 12-1.1 EMPTY Slab Reuse (+1-15% improvement)
3. ✅ Stage statistics analysis (identified frontend as true bottleneck)
4. ✅ Phase 19 strategy documentation (ChatGPT-sensei plan)

**Next Session**:
- Phase 19-1 Quick Prune implementation
- ENV gate + early return in tiny_alloc_fast.inc.h
- 100K short test + performance measurement

---

📝 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: ChatGPT (Phase 19 strategy design)
Co-Authored-By: Task-sensei (Phase 12-1.1 investigation)

											
										
										
											2025-11-21 05:16:35 +09:00
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								が必要だが、現時点では投資対効果が低い。
-												Documentation: Phase 12-1.1 Results + Phase 19 Frontend Strategy

## Phase 12-1.1 Summary (Box Theory + EMPTY Slab Reuse)

### Box Theory Refactoring (Complete)
- hakmem_tiny.c: 2081行 → 562行 (-73%)
- 12 modules extracted across 3 phases
- Commit: 4c33ccdf8

### Phase 12-1.1: EMPTY Slab Detection (Complete)
- Implementation: empty_mask + immediate detection on free
- Performance: +1.3% average, +14.9% max (22.9M → 23.2M ops/s)
- Commit: 6afaa5703

### Key Findings
**Stage Statistics (HAKMEM_SHARED_POOL_STAGE_STATS=1)**:
```
Class 6 (256B):
  Stage 1 (EMPTY):  95.1%  ← Already super-efficient!
  Stage 2 (UNUSED):  4.7%
  Stage 3 (new SS):  0.2%  ← Bottleneck already resolved
```

**Conclusion**: Backend optimization (SS-Reuse) is saturated. Task-sensei's
assumption (Stage 3: 87-95%) does not hold. Phase 12 Shared Pool already works.

**Next bottleneck**: Frontend fast path (31ns vs mimalloc 9ns = 3.4x slower)

---

## Phase 19: Frontend Fast Path Optimization (Next Implementation)

### Strategy Shift
ChatGPT-sensei Priority 2 → Priority 1 (promoted based on Phase 12-1.1 results)

### Target
- Current: 31ns (HAKMEM) vs 9ns (mimalloc)
- Goal: 31ns → 15ns (-50%) for 22M → 40M ops/s

### Hit Rate Analysis (Premise)
```
HeapV2:      88-99% (primary)
UltraHot:     0-12% (limited)
FC/SFC:          0% (unused)
```
→ Layers other than HeapV2 are prune candidates

---

## Phase 19-1: Quick Prune (Branch Pruning) - 🚀 Highest Priority

**Goal**: Skip unused frontend layers, simplify to HeapV2 → SLL → SS path

**Implementation**:
- File: `core/tiny_alloc_fast.inc.h`
- Method: Early return gate at front entry point
- ENV: `HAKMEM_TINY_FRONT_SLIM=1`

**Features**:
- ✅ Existing code unchanged (bypass only)
- ✅ A/B gate (ENV=0 instant rollback)
- ✅ Minimal risk

**Expected**: 22M → 27-30M ops/s (+22-36%)

---

## Phase 19-2: Front-V2 (tcache Single-Layer) - ⚡ Main Event

**Goal**: Unify frontend to tcache-style (1-layer per-class magazine)

**Design**:
```c
// New file: core/front/tiny_heap_v2.h
typedef struct {
    void* items[32];      // cap 32 (tunable)
    uint8_t top;          // stack top index
    uint8_t class_idx;    // bound class
} TinyFrontV2;

// Ultra-fast pop (1 branch + 1 array lookup + 1 instruction)
static inline void* front_v2_pop(int class_idx);
static inline int front_v2_push(int class_idx, void* ptr);
static inline int front_v2_refill(int class_idx);
```

**Fast Path Flow**:
```
ptr = front_v2_pop(class_idx)  // 1 branch + 1 array lookup
  → empty? → front_v2_refill() → retry
  → miss? → backend fallback (SLL/SS)
```

**Target**: C0-C3 (hot classes), C4-C5 off
**ENV**: `HAKMEM_TINY_FRONT_V2=1`, `HAKMEM_FRONT_V2_CAP=32`
**Expected**: 30M → 40M ops/s (+33%)

---

## Phase 19-3: A/B Testing & Metrics

**Metrics**:
- `g_front_v2_hits[TINY_NUM_CLASSES]`
- `g_front_v2_miss[TINY_NUM_CLASSES]`
- `g_front_v2_refill_count[TINY_NUM_CLASSES]`

**ENV**: `HAKMEM_TINY_FRONT_METRICS=1`

**Benchmark Order**:
1. Short run (100K) - SEGV/regression check
2. Latency measurement (500K) - 31ns → 15ns goal
3. Larson short run - MT stability check

---

## Implementation Timeline

```
Week 1: Phase 19-1 Quick Prune
  - Add gate to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_SLIM=1
  - 100K short test
  - Performance measurement (expect: 22M → 27-30M)

Week 2: Phase 19-2 Front-V2 Design
  - Create core/front/tiny_heap_v2.{h,c}
  - Implement front_v2_pop/push/refill
  - C0-C3 integration test

Week 3: Phase 19-2 Front-V2 Integration
  - Add Front-V2 path to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_V2=1
  - A/B benchmark

Week 4: Phase 19-3 Optimization
  - Magazine capacity tuning (16/32/64)
  - Refill batch size adjustment
  - Larson/MT stability confirmation
```

---

## Expected Final Performance

```
Baseline (Phase 12-1.1):  22M ops/s
Phase 19-1 (Slim):        27-30M ops/s (+22-36%)
Phase 19-2 (V2):          40M ops/s (+82%)  ← Goal
System malloc:            78M ops/s (reference)

Gap closure: 28% → 51% (major improvement!)
```

---

## Summary

**Today's Achievements** (2025-11-21):
1. ✅ Box Theory Refactoring (3 phases, -73% code size)
2. ✅ Phase 12-1.1 EMPTY Slab Reuse (+1-15% improvement)
3. ✅ Stage statistics analysis (identified frontend as true bottleneck)
4. ✅ Phase 19 strategy documentation (ChatGPT-sensei plan)

**Next Session**:
- Phase 19-1 Quick Prune implementation
- ENV gate + early return in tiny_alloc_fast.inc.h
- 100K short test + performance measurement

---

📝 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: ChatGPT (Phase 19 strategy design)
Co-Authored-By: Task-sensei (Phase 12-1.1 investigation)

											
										
										
											2025-11-21 05:16:35 +09:00
 								---
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								## 📁 主要な修正ファイル（Arena Allocator 実装）
-												Documentation: Phase 12-1.1 Results + Phase 19 Frontend Strategy

## Phase 12-1.1 Summary (Box Theory + EMPTY Slab Reuse)

### Box Theory Refactoring (Complete)
- hakmem_tiny.c: 2081行 → 562行 (-73%)
- 12 modules extracted across 3 phases
- Commit: 4c33ccdf8

### Phase 12-1.1: EMPTY Slab Detection (Complete)
- Implementation: empty_mask + immediate detection on free
- Performance: +1.3% average, +14.9% max (22.9M → 23.2M ops/s)
- Commit: 6afaa5703

### Key Findings
**Stage Statistics (HAKMEM_SHARED_POOL_STAGE_STATS=1)**:
```
Class 6 (256B):
  Stage 1 (EMPTY):  95.1%  ← Already super-efficient!
  Stage 2 (UNUSED):  4.7%
  Stage 3 (new SS):  0.2%  ← Bottleneck already resolved
```

**Conclusion**: Backend optimization (SS-Reuse) is saturated. Task-sensei's
assumption (Stage 3: 87-95%) does not hold. Phase 12 Shared Pool already works.

**Next bottleneck**: Frontend fast path (31ns vs mimalloc 9ns = 3.4x slower)

---

## Phase 19: Frontend Fast Path Optimization (Next Implementation)

### Strategy Shift
ChatGPT-sensei Priority 2 → Priority 1 (promoted based on Phase 12-1.1 results)

### Target
- Current: 31ns (HAKMEM) vs 9ns (mimalloc)
- Goal: 31ns → 15ns (-50%) for 22M → 40M ops/s

### Hit Rate Analysis (Premise)
```
HeapV2:      88-99% (primary)
UltraHot:     0-12% (limited)
FC/SFC:          0% (unused)
```
→ Layers other than HeapV2 are prune candidates

---

## Phase 19-1: Quick Prune (Branch Pruning) - 🚀 Highest Priority

**Goal**: Skip unused frontend layers, simplify to HeapV2 → SLL → SS path

**Implementation**:
- File: `core/tiny_alloc_fast.inc.h`
- Method: Early return gate at front entry point
- ENV: `HAKMEM_TINY_FRONT_SLIM=1`

**Features**:
- ✅ Existing code unchanged (bypass only)
- ✅ A/B gate (ENV=0 instant rollback)
- ✅ Minimal risk

**Expected**: 22M → 27-30M ops/s (+22-36%)

---

## Phase 19-2: Front-V2 (tcache Single-Layer) - ⚡ Main Event

**Goal**: Unify frontend to tcache-style (1-layer per-class magazine)

**Design**:
```c
// New file: core/front/tiny_heap_v2.h
typedef struct {
    void* items[32];      // cap 32 (tunable)
    uint8_t top;          // stack top index
    uint8_t class_idx;    // bound class
} TinyFrontV2;

// Ultra-fast pop (1 branch + 1 array lookup + 1 instruction)
static inline void* front_v2_pop(int class_idx);
static inline int front_v2_push(int class_idx, void* ptr);
static inline int front_v2_refill(int class_idx);
```

**Fast Path Flow**:
```
ptr = front_v2_pop(class_idx)  // 1 branch + 1 array lookup
  → empty? → front_v2_refill() → retry
  → miss? → backend fallback (SLL/SS)
```

**Target**: C0-C3 (hot classes), C4-C5 off
**ENV**: `HAKMEM_TINY_FRONT_V2=1`, `HAKMEM_FRONT_V2_CAP=32`
**Expected**: 30M → 40M ops/s (+33%)

---

## Phase 19-3: A/B Testing & Metrics

**Metrics**:
- `g_front_v2_hits[TINY_NUM_CLASSES]`
- `g_front_v2_miss[TINY_NUM_CLASSES]`
- `g_front_v2_refill_count[TINY_NUM_CLASSES]`

**ENV**: `HAKMEM_TINY_FRONT_METRICS=1`

**Benchmark Order**:
1. Short run (100K) - SEGV/regression check
2. Latency measurement (500K) - 31ns → 15ns goal
3. Larson short run - MT stability check

---

## Implementation Timeline

```
Week 1: Phase 19-1 Quick Prune
  - Add gate to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_SLIM=1
  - 100K short test
  - Performance measurement (expect: 22M → 27-30M)

Week 2: Phase 19-2 Front-V2 Design
  - Create core/front/tiny_heap_v2.{h,c}
  - Implement front_v2_pop/push/refill
  - C0-C3 integration test

Week 3: Phase 19-2 Front-V2 Integration
  - Add Front-V2 path to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_V2=1
  - A/B benchmark

Week 4: Phase 19-3 Optimization
  - Magazine capacity tuning (16/32/64)
  - Refill batch size adjustment
  - Larson/MT stability confirmation
```

---

## Expected Final Performance

```
Baseline (Phase 12-1.1):  22M ops/s
Phase 19-1 (Slim):        27-30M ops/s (+22-36%)
Phase 19-2 (V2):          40M ops/s (+82%)  ← Goal
System malloc:            78M ops/s (reference)

Gap closure: 28% → 51% (major improvement!)
```

---

## Summary

**Today's Achievements** (2025-11-21):
1. ✅ Box Theory Refactoring (3 phases, -73% code size)
2. ✅ Phase 12-1.1 EMPTY Slab Reuse (+1-15% improvement)
3. ✅ Stage statistics analysis (identified frontend as true bottleneck)
4. ✅ Phase 19 strategy documentation (ChatGPT-sensei plan)

**Next Session**:
- Phase 19-1 Quick Prune implementation
- ENV gate + early return in tiny_alloc_fast.inc.h
- 100K short test + performance measurement

---

📝 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: ChatGPT (Phase 19 strategy design)
Co-Authored-By: Task-sensei (Phase 12-1.1 investigation)

											
										
										
											2025-11-21 05:16:35 +09:00
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								- `core/box/ss_cache_box.inc:138-229` - SSArena allocator 追加
 								- `core/box/tls_sll_box.h:509-561` - Release mode で recycle check オプショナル化
 								- `core/tiny_free_fast_v2.inc.h:113-148` - Release mode で cross-check 削除
 								- `core/hakmem_tiny_sll_cap_box.inc:8-25` - C5 容量を full capacity に変更
 								- `core/hakmem_policy.c:24-30` - min_keep tuning
 								- `core/tiny_alloc_fast_sfc.inc.h:18-26` - SFC defaults tuning
-												Documentation: Phase 12-1.1 Results + Phase 19 Frontend Strategy

## Phase 12-1.1 Summary (Box Theory + EMPTY Slab Reuse)

### Box Theory Refactoring (Complete)
- hakmem_tiny.c: 2081行 → 562行 (-73%)
- 12 modules extracted across 3 phases
- Commit: 4c33ccdf8

### Phase 12-1.1: EMPTY Slab Detection (Complete)
- Implementation: empty_mask + immediate detection on free
- Performance: +1.3% average, +14.9% max (22.9M → 23.2M ops/s)
- Commit: 6afaa5703

### Key Findings
**Stage Statistics (HAKMEM_SHARED_POOL_STAGE_STATS=1)**:
```
Class 6 (256B):
  Stage 1 (EMPTY):  95.1%  ← Already super-efficient!
  Stage 2 (UNUSED):  4.7%
  Stage 3 (new SS):  0.2%  ← Bottleneck already resolved
```

**Conclusion**: Backend optimization (SS-Reuse) is saturated. Task-sensei's
assumption (Stage 3: 87-95%) does not hold. Phase 12 Shared Pool already works.

**Next bottleneck**: Frontend fast path (31ns vs mimalloc 9ns = 3.4x slower)

---

## Phase 19: Frontend Fast Path Optimization (Next Implementation)

### Strategy Shift
ChatGPT-sensei Priority 2 → Priority 1 (promoted based on Phase 12-1.1 results)

### Target
- Current: 31ns (HAKMEM) vs 9ns (mimalloc)
- Goal: 31ns → 15ns (-50%) for 22M → 40M ops/s

### Hit Rate Analysis (Premise)
```
HeapV2:      88-99% (primary)
UltraHot:     0-12% (limited)
FC/SFC:          0% (unused)
```
→ Layers other than HeapV2 are prune candidates

---

## Phase 19-1: Quick Prune (Branch Pruning) - 🚀 Highest Priority

**Goal**: Skip unused frontend layers, simplify to HeapV2 → SLL → SS path

**Implementation**:
- File: `core/tiny_alloc_fast.inc.h`
- Method: Early return gate at front entry point
- ENV: `HAKMEM_TINY_FRONT_SLIM=1`

**Features**:
- ✅ Existing code unchanged (bypass only)
- ✅ A/B gate (ENV=0 instant rollback)
- ✅ Minimal risk

**Expected**: 22M → 27-30M ops/s (+22-36%)

---

## Phase 19-2: Front-V2 (tcache Single-Layer) - ⚡ Main Event

**Goal**: Unify frontend to tcache-style (1-layer per-class magazine)

**Design**:
```c
// New file: core/front/tiny_heap_v2.h
typedef struct {
    void* items[32];      // cap 32 (tunable)
    uint8_t top;          // stack top index
    uint8_t class_idx;    // bound class
} TinyFrontV2;

// Ultra-fast pop (1 branch + 1 array lookup + 1 instruction)
static inline void* front_v2_pop(int class_idx);
static inline int front_v2_push(int class_idx, void* ptr);
static inline int front_v2_refill(int class_idx);
```

**Fast Path Flow**:
```
ptr = front_v2_pop(class_idx)  // 1 branch + 1 array lookup
  → empty? → front_v2_refill() → retry
  → miss? → backend fallback (SLL/SS)
```

**Target**: C0-C3 (hot classes), C4-C5 off
**ENV**: `HAKMEM_TINY_FRONT_V2=1`, `HAKMEM_FRONT_V2_CAP=32`
**Expected**: 30M → 40M ops/s (+33%)

---

## Phase 19-3: A/B Testing & Metrics

**Metrics**:
- `g_front_v2_hits[TINY_NUM_CLASSES]`
- `g_front_v2_miss[TINY_NUM_CLASSES]`
- `g_front_v2_refill_count[TINY_NUM_CLASSES]`

**ENV**: `HAKMEM_TINY_FRONT_METRICS=1`

**Benchmark Order**:
1. Short run (100K) - SEGV/regression check
2. Latency measurement (500K) - 31ns → 15ns goal
3. Larson short run - MT stability check

---

## Implementation Timeline

```
Week 1: Phase 19-1 Quick Prune
  - Add gate to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_SLIM=1
  - 100K short test
  - Performance measurement (expect: 22M → 27-30M)

Week 2: Phase 19-2 Front-V2 Design
  - Create core/front/tiny_heap_v2.{h,c}
  - Implement front_v2_pop/push/refill
  - C0-C3 integration test

Week 3: Phase 19-2 Front-V2 Integration
  - Add Front-V2 path to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_V2=1
  - A/B benchmark

Week 4: Phase 19-3 Optimization
  - Magazine capacity tuning (16/32/64)
  - Refill batch size adjustment
  - Larson/MT stability confirmation
```

---

## Expected Final Performance

```
Baseline (Phase 12-1.1):  22M ops/s
Phase 19-1 (Slim):        27-30M ops/s (+22-36%)
Phase 19-2 (V2):          40M ops/s (+82%)  ← Goal
System malloc:            78M ops/s (reference)

Gap closure: 28% → 51% (major improvement!)
```

---

## Summary

**Today's Achievements** (2025-11-21):
1. ✅ Box Theory Refactoring (3 phases, -73% code size)
2. ✅ Phase 12-1.1 EMPTY Slab Reuse (+1-15% improvement)
3. ✅ Stage statistics analysis (identified frontend as true bottleneck)
4. ✅ Phase 19 strategy documentation (ChatGPT-sensei plan)

**Next Session**:
- Phase 19-1 Quick Prune implementation
- ENV gate + early return in tiny_alloc_fast.inc.h
- 100K short test + performance measurement

---

📝 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: ChatGPT (Phase 19 strategy design)
Co-Authored-By: Task-sensei (Phase 12-1.1 investigation)

											
										
										
											2025-11-21 05:16:35 +09:00
 								---
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								## 🗃 過去の問題と解決（参考）
-												Documentation: Phase 12-1.1 Results + Phase 19 Frontend Strategy

## Phase 12-1.1 Summary (Box Theory + EMPTY Slab Reuse)

### Box Theory Refactoring (Complete)
- hakmem_tiny.c: 2081行 → 562行 (-73%)
- 12 modules extracted across 3 phases
- Commit: 4c33ccdf8

### Phase 12-1.1: EMPTY Slab Detection (Complete)
- Implementation: empty_mask + immediate detection on free
- Performance: +1.3% average, +14.9% max (22.9M → 23.2M ops/s)
- Commit: 6afaa5703

### Key Findings
**Stage Statistics (HAKMEM_SHARED_POOL_STAGE_STATS=1)**:
```
Class 6 (256B):
  Stage 1 (EMPTY):  95.1%  ← Already super-efficient!
  Stage 2 (UNUSED):  4.7%
  Stage 3 (new SS):  0.2%  ← Bottleneck already resolved
```

**Conclusion**: Backend optimization (SS-Reuse) is saturated. Task-sensei's
assumption (Stage 3: 87-95%) does not hold. Phase 12 Shared Pool already works.

**Next bottleneck**: Frontend fast path (31ns vs mimalloc 9ns = 3.4x slower)

---

## Phase 19: Frontend Fast Path Optimization (Next Implementation)

### Strategy Shift
ChatGPT-sensei Priority 2 → Priority 1 (promoted based on Phase 12-1.1 results)

### Target
- Current: 31ns (HAKMEM) vs 9ns (mimalloc)
- Goal: 31ns → 15ns (-50%) for 22M → 40M ops/s

### Hit Rate Analysis (Premise)
```
HeapV2:      88-99% (primary)
UltraHot:     0-12% (limited)
FC/SFC:          0% (unused)
```
→ Layers other than HeapV2 are prune candidates

---

## Phase 19-1: Quick Prune (Branch Pruning) - 🚀 Highest Priority

**Goal**: Skip unused frontend layers, simplify to HeapV2 → SLL → SS path

**Implementation**:
- File: `core/tiny_alloc_fast.inc.h`
- Method: Early return gate at front entry point
- ENV: `HAKMEM_TINY_FRONT_SLIM=1`

**Features**:
- ✅ Existing code unchanged (bypass only)
- ✅ A/B gate (ENV=0 instant rollback)
- ✅ Minimal risk

**Expected**: 22M → 27-30M ops/s (+22-36%)

---

## Phase 19-2: Front-V2 (tcache Single-Layer) - ⚡ Main Event

**Goal**: Unify frontend to tcache-style (1-layer per-class magazine)

**Design**:
```c
// New file: core/front/tiny_heap_v2.h
typedef struct {
    void* items[32];      // cap 32 (tunable)
    uint8_t top;          // stack top index
    uint8_t class_idx;    // bound class
} TinyFrontV2;

// Ultra-fast pop (1 branch + 1 array lookup + 1 instruction)
static inline void* front_v2_pop(int class_idx);
static inline int front_v2_push(int class_idx, void* ptr);
static inline int front_v2_refill(int class_idx);
```

**Fast Path Flow**:
```
ptr = front_v2_pop(class_idx)  // 1 branch + 1 array lookup
  → empty? → front_v2_refill() → retry
  → miss? → backend fallback (SLL/SS)
```

**Target**: C0-C3 (hot classes), C4-C5 off
**ENV**: `HAKMEM_TINY_FRONT_V2=1`, `HAKMEM_FRONT_V2_CAP=32`
**Expected**: 30M → 40M ops/s (+33%)

---

## Phase 19-3: A/B Testing & Metrics

**Metrics**:
- `g_front_v2_hits[TINY_NUM_CLASSES]`
- `g_front_v2_miss[TINY_NUM_CLASSES]`
- `g_front_v2_refill_count[TINY_NUM_CLASSES]`

**ENV**: `HAKMEM_TINY_FRONT_METRICS=1`

**Benchmark Order**:
1. Short run (100K) - SEGV/regression check
2. Latency measurement (500K) - 31ns → 15ns goal
3. Larson short run - MT stability check

---

## Implementation Timeline

```
Week 1: Phase 19-1 Quick Prune
  - Add gate to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_SLIM=1
  - 100K short test
  - Performance measurement (expect: 22M → 27-30M)

Week 2: Phase 19-2 Front-V2 Design
  - Create core/front/tiny_heap_v2.{h,c}
  - Implement front_v2_pop/push/refill
  - C0-C3 integration test

Week 3: Phase 19-2 Front-V2 Integration
  - Add Front-V2 path to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_V2=1
  - A/B benchmark

Week 4: Phase 19-3 Optimization
  - Magazine capacity tuning (16/32/64)
  - Refill batch size adjustment
  - Larson/MT stability confirmation
```

---

## Expected Final Performance

```
Baseline (Phase 12-1.1):  22M ops/s
Phase 19-1 (Slim):        27-30M ops/s (+22-36%)
Phase 19-2 (V2):          40M ops/s (+82%)  ← Goal
System malloc:            78M ops/s (reference)

Gap closure: 28% → 51% (major improvement!)
```

---

## Summary

**Today's Achievements** (2025-11-21):
1. ✅ Box Theory Refactoring (3 phases, -73% code size)
2. ✅ Phase 12-1.1 EMPTY Slab Reuse (+1-15% improvement)
3. ✅ Stage statistics analysis (identified frontend as true bottleneck)
4. ✅ Phase 19 strategy documentation (ChatGPT-sensei plan)

**Next Session**:
- Phase 19-1 Quick Prune implementation
- ENV gate + early return in tiny_alloc_fast.inc.h
- 100K short test + performance measurement

---

📝 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: ChatGPT (Phase 19 strategy design)
Co-Authored-By: Task-sensei (Phase 12-1.1 investigation)

											
										
										
											2025-11-21 05:16:35 +09:00
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								### Arena Allocator 以前の状態
 								- **Random Mixed (5M ops)**: ~56-60M ops/s, **mmap 418回** (mimalloc の 26倍)
 								- **根本原因**: SuperSlab = allocation単位 = cache単位 という設計ミスマッチ
 								- **問題**: ws=256 では Slab が 5-15% 使用率で停滞 → 完全 EMPTY にならない → LRU キャッシュ不発 → 毎回 mmap/munmap
-												Documentation: Phase 12-1.1 Results + Phase 19 Frontend Strategy

## Phase 12-1.1 Summary (Box Theory + EMPTY Slab Reuse)

### Box Theory Refactoring (Complete)
- hakmem_tiny.c: 2081行 → 562行 (-73%)
- 12 modules extracted across 3 phases
- Commit: 4c33ccdf8

### Phase 12-1.1: EMPTY Slab Detection (Complete)
- Implementation: empty_mask + immediate detection on free
- Performance: +1.3% average, +14.9% max (22.9M → 23.2M ops/s)
- Commit: 6afaa5703

### Key Findings
**Stage Statistics (HAKMEM_SHARED_POOL_STAGE_STATS=1)**:
```
Class 6 (256B):
  Stage 1 (EMPTY):  95.1%  ← Already super-efficient!
  Stage 2 (UNUSED):  4.7%
  Stage 3 (new SS):  0.2%  ← Bottleneck already resolved
```

**Conclusion**: Backend optimization (SS-Reuse) is saturated. Task-sensei's
assumption (Stage 3: 87-95%) does not hold. Phase 12 Shared Pool already works.

**Next bottleneck**: Frontend fast path (31ns vs mimalloc 9ns = 3.4x slower)

---

## Phase 19: Frontend Fast Path Optimization (Next Implementation)

### Strategy Shift
ChatGPT-sensei Priority 2 → Priority 1 (promoted based on Phase 12-1.1 results)

### Target
- Current: 31ns (HAKMEM) vs 9ns (mimalloc)
- Goal: 31ns → 15ns (-50%) for 22M → 40M ops/s

### Hit Rate Analysis (Premise)
```
HeapV2:      88-99% (primary)
UltraHot:     0-12% (limited)
FC/SFC:          0% (unused)
```
→ Layers other than HeapV2 are prune candidates

---

## Phase 19-1: Quick Prune (Branch Pruning) - 🚀 Highest Priority

**Goal**: Skip unused frontend layers, simplify to HeapV2 → SLL → SS path

**Implementation**:
- File: `core/tiny_alloc_fast.inc.h`
- Method: Early return gate at front entry point
- ENV: `HAKMEM_TINY_FRONT_SLIM=1`

**Features**:
- ✅ Existing code unchanged (bypass only)
- ✅ A/B gate (ENV=0 instant rollback)
- ✅ Minimal risk

**Expected**: 22M → 27-30M ops/s (+22-36%)

---

## Phase 19-2: Front-V2 (tcache Single-Layer) - ⚡ Main Event

**Goal**: Unify frontend to tcache-style (1-layer per-class magazine)

**Design**:
```c
// New file: core/front/tiny_heap_v2.h
typedef struct {
    void* items[32];      // cap 32 (tunable)
    uint8_t top;          // stack top index
    uint8_t class_idx;    // bound class
} TinyFrontV2;

// Ultra-fast pop (1 branch + 1 array lookup + 1 instruction)
static inline void* front_v2_pop(int class_idx);
static inline int front_v2_push(int class_idx, void* ptr);
static inline int front_v2_refill(int class_idx);
```

**Fast Path Flow**:
```
ptr = front_v2_pop(class_idx)  // 1 branch + 1 array lookup
  → empty? → front_v2_refill() → retry
  → miss? → backend fallback (SLL/SS)
```

**Target**: C0-C3 (hot classes), C4-C5 off
**ENV**: `HAKMEM_TINY_FRONT_V2=1`, `HAKMEM_FRONT_V2_CAP=32`
**Expected**: 30M → 40M ops/s (+33%)

---

## Phase 19-3: A/B Testing & Metrics

**Metrics**:
- `g_front_v2_hits[TINY_NUM_CLASSES]`
- `g_front_v2_miss[TINY_NUM_CLASSES]`
- `g_front_v2_refill_count[TINY_NUM_CLASSES]`

**ENV**: `HAKMEM_TINY_FRONT_METRICS=1`

**Benchmark Order**:
1. Short run (100K) - SEGV/regression check
2. Latency measurement (500K) - 31ns → 15ns goal
3. Larson short run - MT stability check

---

## Implementation Timeline

```
Week 1: Phase 19-1 Quick Prune
  - Add gate to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_SLIM=1
  - 100K short test
  - Performance measurement (expect: 22M → 27-30M)

Week 2: Phase 19-2 Front-V2 Design
  - Create core/front/tiny_heap_v2.{h,c}
  - Implement front_v2_pop/push/refill
  - C0-C3 integration test

Week 3: Phase 19-2 Front-V2 Integration
  - Add Front-V2 path to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_V2=1
  - A/B benchmark

Week 4: Phase 19-3 Optimization
  - Magazine capacity tuning (16/32/64)
  - Refill batch size adjustment
  - Larson/MT stability confirmation
```

---

## Expected Final Performance

```
Baseline (Phase 12-1.1):  22M ops/s
Phase 19-1 (Slim):        27-30M ops/s (+22-36%)
Phase 19-2 (V2):          40M ops/s (+82%)  ← Goal
System malloc:            78M ops/s (reference)

Gap closure: 28% → 51% (major improvement!)
```

---

## Summary

**Today's Achievements** (2025-11-21):
1. ✅ Box Theory Refactoring (3 phases, -73% code size)
2. ✅ Phase 12-1.1 EMPTY Slab Reuse (+1-15% improvement)
3. ✅ Stage statistics analysis (identified frontend as true bottleneck)
4. ✅ Phase 19 strategy documentation (ChatGPT-sensei plan)

**Next Session**:
- Phase 19-1 Quick Prune implementation
- ENV gate + early return in tiny_alloc_fast.inc.h
- 100K short test + performance measurement

---

📝 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: ChatGPT (Phase 19 strategy design)
Co-Authored-By: Task-sensei (Phase 12-1.1 investigation)

											
										
										
											2025-11-21 05:16:35 +09:00
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								### Arena Allocator による解決
 								- SuperSlab を OS 単位として扱う Arena allocator 実装
 								- mmap 418回 → 32回 (-92%)、munmap 378回 → 3回 (-99%)
 								- 性能 60M → 68-70M ops/s (+15%)
-												Documentation: Phase 12-1.1 Results + Phase 19 Frontend Strategy

## Phase 12-1.1 Summary (Box Theory + EMPTY Slab Reuse)

### Box Theory Refactoring (Complete)
- hakmem_tiny.c: 2081行 → 562行 (-73%)
- 12 modules extracted across 3 phases
- Commit: 4c33ccdf8

### Phase 12-1.1: EMPTY Slab Detection (Complete)
- Implementation: empty_mask + immediate detection on free
- Performance: +1.3% average, +14.9% max (22.9M → 23.2M ops/s)
- Commit: 6afaa5703

### Key Findings
**Stage Statistics (HAKMEM_SHARED_POOL_STAGE_STATS=1)**:
```
Class 6 (256B):
  Stage 1 (EMPTY):  95.1%  ← Already super-efficient!
  Stage 2 (UNUSED):  4.7%
  Stage 3 (new SS):  0.2%  ← Bottleneck already resolved
```

**Conclusion**: Backend optimization (SS-Reuse) is saturated. Task-sensei's
assumption (Stage 3: 87-95%) does not hold. Phase 12 Shared Pool already works.

**Next bottleneck**: Frontend fast path (31ns vs mimalloc 9ns = 3.4x slower)

---

## Phase 19: Frontend Fast Path Optimization (Next Implementation)

### Strategy Shift
ChatGPT-sensei Priority 2 → Priority 1 (promoted based on Phase 12-1.1 results)

### Target
- Current: 31ns (HAKMEM) vs 9ns (mimalloc)
- Goal: 31ns → 15ns (-50%) for 22M → 40M ops/s

### Hit Rate Analysis (Premise)
```
HeapV2:      88-99% (primary)
UltraHot:     0-12% (limited)
FC/SFC:          0% (unused)
```
→ Layers other than HeapV2 are prune candidates

---

## Phase 19-1: Quick Prune (Branch Pruning) - 🚀 Highest Priority

**Goal**: Skip unused frontend layers, simplify to HeapV2 → SLL → SS path

**Implementation**:
- File: `core/tiny_alloc_fast.inc.h`
- Method: Early return gate at front entry point
- ENV: `HAKMEM_TINY_FRONT_SLIM=1`

**Features**:
- ✅ Existing code unchanged (bypass only)
- ✅ A/B gate (ENV=0 instant rollback)
- ✅ Minimal risk

**Expected**: 22M → 27-30M ops/s (+22-36%)

---

## Phase 19-2: Front-V2 (tcache Single-Layer) - ⚡ Main Event

**Goal**: Unify frontend to tcache-style (1-layer per-class magazine)

**Design**:
```c
// New file: core/front/tiny_heap_v2.h
typedef struct {
    void* items[32];      // cap 32 (tunable)
    uint8_t top;          // stack top index
    uint8_t class_idx;    // bound class
} TinyFrontV2;

// Ultra-fast pop (1 branch + 1 array lookup + 1 instruction)
static inline void* front_v2_pop(int class_idx);
static inline int front_v2_push(int class_idx, void* ptr);
static inline int front_v2_refill(int class_idx);
```

**Fast Path Flow**:
```
ptr = front_v2_pop(class_idx)  // 1 branch + 1 array lookup
  → empty? → front_v2_refill() → retry
  → miss? → backend fallback (SLL/SS)
```

**Target**: C0-C3 (hot classes), C4-C5 off
**ENV**: `HAKMEM_TINY_FRONT_V2=1`, `HAKMEM_FRONT_V2_CAP=32`
**Expected**: 30M → 40M ops/s (+33%)

---

## Phase 19-3: A/B Testing & Metrics

**Metrics**:
- `g_front_v2_hits[TINY_NUM_CLASSES]`
- `g_front_v2_miss[TINY_NUM_CLASSES]`
- `g_front_v2_refill_count[TINY_NUM_CLASSES]`

**ENV**: `HAKMEM_TINY_FRONT_METRICS=1`

**Benchmark Order**:
1. Short run (100K) - SEGV/regression check
2. Latency measurement (500K) - 31ns → 15ns goal
3. Larson short run - MT stability check

---

## Implementation Timeline

```
Week 1: Phase 19-1 Quick Prune
  - Add gate to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_SLIM=1
  - 100K short test
  - Performance measurement (expect: 22M → 27-30M)

Week 2: Phase 19-2 Front-V2 Design
  - Create core/front/tiny_heap_v2.{h,c}
  - Implement front_v2_pop/push/refill
  - C0-C3 integration test

Week 3: Phase 19-2 Front-V2 Integration
  - Add Front-V2 path to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_V2=1
  - A/B benchmark

Week 4: Phase 19-3 Optimization
  - Magazine capacity tuning (16/32/64)
  - Refill batch size adjustment
  - Larson/MT stability confirmation
```

---

## Expected Final Performance

```
Baseline (Phase 12-1.1):  22M ops/s
Phase 19-1 (Slim):        27-30M ops/s (+22-36%)
Phase 19-2 (V2):          40M ops/s (+82%)  ← Goal
System malloc:            78M ops/s (reference)

Gap closure: 28% → 51% (major improvement!)
```

---

## Summary

**Today's Achievements** (2025-11-21):
1. ✅ Box Theory Refactoring (3 phases, -73% code size)
2. ✅ Phase 12-1.1 EMPTY Slab Reuse (+1-15% improvement)
3. ✅ Stage statistics analysis (identified frontend as true bottleneck)
4. ✅ Phase 19 strategy documentation (ChatGPT-sensei plan)

**Next Session**:
- Phase 19-1 Quick Prune implementation
- ENV gate + early return in tiny_alloc_fast.inc.h
- 100K short test + performance measurement

---

📝 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: ChatGPT (Phase 19 strategy design)
Co-Authored-By: Task-sensei (Phase 12-1.1 investigation)

											
										
										
											2025-11-21 05:16:35 +09:00
 								---
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								## 📊 他アロケータとのアーキテクチャ対応（参考）
-												Documentation: Phase 12-1.1 Results + Phase 19 Frontend Strategy

## Phase 12-1.1 Summary (Box Theory + EMPTY Slab Reuse)

### Box Theory Refactoring (Complete)
- hakmem_tiny.c: 2081行 → 562行 (-73%)
- 12 modules extracted across 3 phases
- Commit: 4c33ccdf8

### Phase 12-1.1: EMPTY Slab Detection (Complete)
- Implementation: empty_mask + immediate detection on free
- Performance: +1.3% average, +14.9% max (22.9M → 23.2M ops/s)
- Commit: 6afaa5703

### Key Findings
**Stage Statistics (HAKMEM_SHARED_POOL_STAGE_STATS=1)**:
```
Class 6 (256B):
  Stage 1 (EMPTY):  95.1%  ← Already super-efficient!
  Stage 2 (UNUSED):  4.7%
  Stage 3 (new SS):  0.2%  ← Bottleneck already resolved
```

**Conclusion**: Backend optimization (SS-Reuse) is saturated. Task-sensei's
assumption (Stage 3: 87-95%) does not hold. Phase 12 Shared Pool already works.

**Next bottleneck**: Frontend fast path (31ns vs mimalloc 9ns = 3.4x slower)

---

## Phase 19: Frontend Fast Path Optimization (Next Implementation)

### Strategy Shift
ChatGPT-sensei Priority 2 → Priority 1 (promoted based on Phase 12-1.1 results)

### Target
- Current: 31ns (HAKMEM) vs 9ns (mimalloc)
- Goal: 31ns → 15ns (-50%) for 22M → 40M ops/s

### Hit Rate Analysis (Premise)
```
HeapV2:      88-99% (primary)
UltraHot:     0-12% (limited)
FC/SFC:          0% (unused)
```
→ Layers other than HeapV2 are prune candidates

---

## Phase 19-1: Quick Prune (Branch Pruning) - 🚀 Highest Priority

**Goal**: Skip unused frontend layers, simplify to HeapV2 → SLL → SS path

**Implementation**:
- File: `core/tiny_alloc_fast.inc.h`
- Method: Early return gate at front entry point
- ENV: `HAKMEM_TINY_FRONT_SLIM=1`

**Features**:
- ✅ Existing code unchanged (bypass only)
- ✅ A/B gate (ENV=0 instant rollback)
- ✅ Minimal risk

**Expected**: 22M → 27-30M ops/s (+22-36%)

---

## Phase 19-2: Front-V2 (tcache Single-Layer) - ⚡ Main Event

**Goal**: Unify frontend to tcache-style (1-layer per-class magazine)

**Design**:
```c
// New file: core/front/tiny_heap_v2.h
typedef struct {
    void* items[32];      // cap 32 (tunable)
    uint8_t top;          // stack top index
    uint8_t class_idx;    // bound class
} TinyFrontV2;

// Ultra-fast pop (1 branch + 1 array lookup + 1 instruction)
static inline void* front_v2_pop(int class_idx);
static inline int front_v2_push(int class_idx, void* ptr);
static inline int front_v2_refill(int class_idx);
```

**Fast Path Flow**:
```
ptr = front_v2_pop(class_idx)  // 1 branch + 1 array lookup
  → empty? → front_v2_refill() → retry
  → miss? → backend fallback (SLL/SS)
```

**Target**: C0-C3 (hot classes), C4-C5 off
**ENV**: `HAKMEM_TINY_FRONT_V2=1`, `HAKMEM_FRONT_V2_CAP=32`
**Expected**: 30M → 40M ops/s (+33%)

---

## Phase 19-3: A/B Testing & Metrics

**Metrics**:
- `g_front_v2_hits[TINY_NUM_CLASSES]`
- `g_front_v2_miss[TINY_NUM_CLASSES]`
- `g_front_v2_refill_count[TINY_NUM_CLASSES]`

**ENV**: `HAKMEM_TINY_FRONT_METRICS=1`

**Benchmark Order**:
1. Short run (100K) - SEGV/regression check
2. Latency measurement (500K) - 31ns → 15ns goal
3. Larson short run - MT stability check

---

## Implementation Timeline

```
Week 1: Phase 19-1 Quick Prune
  - Add gate to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_SLIM=1
  - 100K short test
  - Performance measurement (expect: 22M → 27-30M)

Week 2: Phase 19-2 Front-V2 Design
  - Create core/front/tiny_heap_v2.{h,c}
  - Implement front_v2_pop/push/refill
  - C0-C3 integration test

Week 3: Phase 19-2 Front-V2 Integration
  - Add Front-V2 path to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_V2=1
  - A/B benchmark

Week 4: Phase 19-3 Optimization
  - Magazine capacity tuning (16/32/64)
  - Refill batch size adjustment
  - Larson/MT stability confirmation
```

---

## Expected Final Performance

```
Baseline (Phase 12-1.1):  22M ops/s
Phase 19-1 (Slim):        27-30M ops/s (+22-36%)
Phase 19-2 (V2):          40M ops/s (+82%)  ← Goal
System malloc:            78M ops/s (reference)

Gap closure: 28% → 51% (major improvement!)
```

---

## Summary

**Today's Achievements** (2025-11-21):
1. ✅ Box Theory Refactoring (3 phases, -73% code size)
2. ✅ Phase 12-1.1 EMPTY Slab Reuse (+1-15% improvement)
3. ✅ Stage statistics analysis (identified frontend as true bottleneck)
4. ✅ Phase 19 strategy documentation (ChatGPT-sensei plan)

**Next Session**:
- Phase 19-1 Quick Prune implementation
- ENV gate + early return in tiny_alloc_fast.inc.h
- 100K short test + performance measurement

---

📝 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: ChatGPT (Phase 19 strategy design)
Co-Authored-By: Task-sensei (Phase 12-1.1 investigation)

											
										
										
											2025-11-21 05:16:35 +09:00
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								| HAKMEM | mimalloc | tcmalloc | jemalloc |
 								|--------|----------|----------|----------|
 								| SuperSlab (2MB) | Segment (~2MiB) | PageHeap | Extent |
 								| Slab (64KB) | Page (~64KiB) | Span | Run/slab |
 								| per-class freelist | pages_queue | Central freelist | bin/slab lists |
 								| Arena allocator | segment cache | PageHeap | extent_avail |
-												Documentation: Phase 12-1.1 Results + Phase 19 Frontend Strategy

## Phase 12-1.1 Summary (Box Theory + EMPTY Slab Reuse)

### Box Theory Refactoring (Complete)
- hakmem_tiny.c: 2081行 → 562行 (-73%)
- 12 modules extracted across 3 phases
- Commit: 4c33ccdf8

### Phase 12-1.1: EMPTY Slab Detection (Complete)
- Implementation: empty_mask + immediate detection on free
- Performance: +1.3% average, +14.9% max (22.9M → 23.2M ops/s)
- Commit: 6afaa5703

### Key Findings
**Stage Statistics (HAKMEM_SHARED_POOL_STAGE_STATS=1)**:
```
Class 6 (256B):
  Stage 1 (EMPTY):  95.1%  ← Already super-efficient!
  Stage 2 (UNUSED):  4.7%
  Stage 3 (new SS):  0.2%  ← Bottleneck already resolved
```

**Conclusion**: Backend optimization (SS-Reuse) is saturated. Task-sensei's
assumption (Stage 3: 87-95%) does not hold. Phase 12 Shared Pool already works.

**Next bottleneck**: Frontend fast path (31ns vs mimalloc 9ns = 3.4x slower)

---

## Phase 19: Frontend Fast Path Optimization (Next Implementation)

### Strategy Shift
ChatGPT-sensei Priority 2 → Priority 1 (promoted based on Phase 12-1.1 results)

### Target
- Current: 31ns (HAKMEM) vs 9ns (mimalloc)
- Goal: 31ns → 15ns (-50%) for 22M → 40M ops/s

### Hit Rate Analysis (Premise)
```
HeapV2:      88-99% (primary)
UltraHot:     0-12% (limited)
FC/SFC:          0% (unused)
```
→ Layers other than HeapV2 are prune candidates

---

## Phase 19-1: Quick Prune (Branch Pruning) - 🚀 Highest Priority

**Goal**: Skip unused frontend layers, simplify to HeapV2 → SLL → SS path

**Implementation**:
- File: `core/tiny_alloc_fast.inc.h`
- Method: Early return gate at front entry point
- ENV: `HAKMEM_TINY_FRONT_SLIM=1`

**Features**:
- ✅ Existing code unchanged (bypass only)
- ✅ A/B gate (ENV=0 instant rollback)
- ✅ Minimal risk

**Expected**: 22M → 27-30M ops/s (+22-36%)

---

## Phase 19-2: Front-V2 (tcache Single-Layer) - ⚡ Main Event

**Goal**: Unify frontend to tcache-style (1-layer per-class magazine)

**Design**:
```c
// New file: core/front/tiny_heap_v2.h
typedef struct {
    void* items[32];      // cap 32 (tunable)
    uint8_t top;          // stack top index
    uint8_t class_idx;    // bound class
} TinyFrontV2;

// Ultra-fast pop (1 branch + 1 array lookup + 1 instruction)
static inline void* front_v2_pop(int class_idx);
static inline int front_v2_push(int class_idx, void* ptr);
static inline int front_v2_refill(int class_idx);
```

**Fast Path Flow**:
```
ptr = front_v2_pop(class_idx)  // 1 branch + 1 array lookup
  → empty? → front_v2_refill() → retry
  → miss? → backend fallback (SLL/SS)
```

**Target**: C0-C3 (hot classes), C4-C5 off
**ENV**: `HAKMEM_TINY_FRONT_V2=1`, `HAKMEM_FRONT_V2_CAP=32`
**Expected**: 30M → 40M ops/s (+33%)

---

## Phase 19-3: A/B Testing & Metrics

**Metrics**:
- `g_front_v2_hits[TINY_NUM_CLASSES]`
- `g_front_v2_miss[TINY_NUM_CLASSES]`
- `g_front_v2_refill_count[TINY_NUM_CLASSES]`

**ENV**: `HAKMEM_TINY_FRONT_METRICS=1`

**Benchmark Order**:
1. Short run (100K) - SEGV/regression check
2. Latency measurement (500K) - 31ns → 15ns goal
3. Larson short run - MT stability check

---

## Implementation Timeline

```
Week 1: Phase 19-1 Quick Prune
  - Add gate to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_SLIM=1
  - 100K short test
  - Performance measurement (expect: 22M → 27-30M)

Week 2: Phase 19-2 Front-V2 Design
  - Create core/front/tiny_heap_v2.{h,c}
  - Implement front_v2_pop/push/refill
  - C0-C3 integration test

Week 3: Phase 19-2 Front-V2 Integration
  - Add Front-V2 path to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_V2=1
  - A/B benchmark

Week 4: Phase 19-3 Optimization
  - Magazine capacity tuning (16/32/64)
  - Refill batch size adjustment
  - Larson/MT stability confirmation
```

---

## Expected Final Performance

```
Baseline (Phase 12-1.1):  22M ops/s
Phase 19-1 (Slim):        27-30M ops/s (+22-36%)
Phase 19-2 (V2):          40M ops/s (+82%)  ← Goal
System malloc:            78M ops/s (reference)

Gap closure: 28% → 51% (major improvement!)
```

---

## Summary

**Today's Achievements** (2025-11-21):
1. ✅ Box Theory Refactoring (3 phases, -73% code size)
2. ✅ Phase 12-1.1 EMPTY Slab Reuse (+1-15% improvement)
3. ✅ Stage statistics analysis (identified frontend as true bottleneck)
4. ✅ Phase 19 strategy documentation (ChatGPT-sensei plan)

**Next Session**:
- Phase 19-1 Quick Prune implementation
- ENV gate + early return in tiny_alloc_fast.inc.h
- 100K short test + performance measurement

---

📝 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: ChatGPT (Phase 19 strategy design)
Co-Authored-By: Task-sensei (Phase 12-1.1 investigation)

											
										
										
											2025-11-21 05:16:35 +09:00
 								---
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								## 🚀 将来の可能性（長期）
-												Documentation: Phase 12-1.1 Results + Phase 19 Frontend Strategy

## Phase 12-1.1 Summary (Box Theory + EMPTY Slab Reuse)

### Box Theory Refactoring (Complete)
- hakmem_tiny.c: 2081行 → 562行 (-73%)
- 12 modules extracted across 3 phases
- Commit: 4c33ccdf8

### Phase 12-1.1: EMPTY Slab Detection (Complete)
- Implementation: empty_mask + immediate detection on free
- Performance: +1.3% average, +14.9% max (22.9M → 23.2M ops/s)
- Commit: 6afaa5703

### Key Findings
**Stage Statistics (HAKMEM_SHARED_POOL_STAGE_STATS=1)**:
```
Class 6 (256B):
  Stage 1 (EMPTY):  95.1%  ← Already super-efficient!
  Stage 2 (UNUSED):  4.7%
  Stage 3 (new SS):  0.2%  ← Bottleneck already resolved
```

**Conclusion**: Backend optimization (SS-Reuse) is saturated. Task-sensei's
assumption (Stage 3: 87-95%) does not hold. Phase 12 Shared Pool already works.

**Next bottleneck**: Frontend fast path (31ns vs mimalloc 9ns = 3.4x slower)

---

## Phase 19: Frontend Fast Path Optimization (Next Implementation)

### Strategy Shift
ChatGPT-sensei Priority 2 → Priority 1 (promoted based on Phase 12-1.1 results)

### Target
- Current: 31ns (HAKMEM) vs 9ns (mimalloc)
- Goal: 31ns → 15ns (-50%) for 22M → 40M ops/s

### Hit Rate Analysis (Premise)
```
HeapV2:      88-99% (primary)
UltraHot:     0-12% (limited)
FC/SFC:          0% (unused)
```
→ Layers other than HeapV2 are prune candidates

---

## Phase 19-1: Quick Prune (Branch Pruning) - 🚀 Highest Priority

**Goal**: Skip unused frontend layers, simplify to HeapV2 → SLL → SS path

**Implementation**:
- File: `core/tiny_alloc_fast.inc.h`
- Method: Early return gate at front entry point
- ENV: `HAKMEM_TINY_FRONT_SLIM=1`

**Features**:
- ✅ Existing code unchanged (bypass only)
- ✅ A/B gate (ENV=0 instant rollback)
- ✅ Minimal risk

**Expected**: 22M → 27-30M ops/s (+22-36%)

---

## Phase 19-2: Front-V2 (tcache Single-Layer) - ⚡ Main Event

**Goal**: Unify frontend to tcache-style (1-layer per-class magazine)

**Design**:
```c
// New file: core/front/tiny_heap_v2.h
typedef struct {
    void* items[32];      // cap 32 (tunable)
    uint8_t top;          // stack top index
    uint8_t class_idx;    // bound class
} TinyFrontV2;

// Ultra-fast pop (1 branch + 1 array lookup + 1 instruction)
static inline void* front_v2_pop(int class_idx);
static inline int front_v2_push(int class_idx, void* ptr);
static inline int front_v2_refill(int class_idx);
```

**Fast Path Flow**:
```
ptr = front_v2_pop(class_idx)  // 1 branch + 1 array lookup
  → empty? → front_v2_refill() → retry
  → miss? → backend fallback (SLL/SS)
```

**Target**: C0-C3 (hot classes), C4-C5 off
**ENV**: `HAKMEM_TINY_FRONT_V2=1`, `HAKMEM_FRONT_V2_CAP=32`
**Expected**: 30M → 40M ops/s (+33%)

---

## Phase 19-3: A/B Testing & Metrics

**Metrics**:
- `g_front_v2_hits[TINY_NUM_CLASSES]`
- `g_front_v2_miss[TINY_NUM_CLASSES]`
- `g_front_v2_refill_count[TINY_NUM_CLASSES]`

**ENV**: `HAKMEM_TINY_FRONT_METRICS=1`

**Benchmark Order**:
1. Short run (100K) - SEGV/regression check
2. Latency measurement (500K) - 31ns → 15ns goal
3. Larson short run - MT stability check

---

## Implementation Timeline

```
Week 1: Phase 19-1 Quick Prune
  - Add gate to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_SLIM=1
  - 100K short test
  - Performance measurement (expect: 22M → 27-30M)

Week 2: Phase 19-2 Front-V2 Design
  - Create core/front/tiny_heap_v2.{h,c}
  - Implement front_v2_pop/push/refill
  - C0-C3 integration test

Week 3: Phase 19-2 Front-V2 Integration
  - Add Front-V2 path to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_V2=1
  - A/B benchmark

Week 4: Phase 19-3 Optimization
  - Magazine capacity tuning (16/32/64)
  - Refill batch size adjustment
  - Larson/MT stability confirmation
```

---

## Expected Final Performance

```
Baseline (Phase 12-1.1):  22M ops/s
Phase 19-1 (Slim):        27-30M ops/s (+22-36%)
Phase 19-2 (V2):          40M ops/s (+82%)  ← Goal
System malloc:            78M ops/s (reference)

Gap closure: 28% → 51% (major improvement!)
```

---

## Summary

**Today's Achievements** (2025-11-21):
1. ✅ Box Theory Refactoring (3 phases, -73% code size)
2. ✅ Phase 12-1.1 EMPTY Slab Reuse (+1-15% improvement)
3. ✅ Stage statistics analysis (identified frontend as true bottleneck)
4. ✅ Phase 19 strategy documentation (ChatGPT-sensei plan)

**Next Session**:
- Phase 19-1 Quick Prune implementation
- ENV gate + early return in tiny_alloc_fast.inc.h
- 100K short test + performance measurement

---

📝 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: ChatGPT (Phase 19 strategy design)
Co-Authored-By: Task-sensei (Phase 12-1.1 investigation)

											
										
										
											2025-11-21 05:16:35 +09:00
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								### Slab-level EMPTY Recycling（未実装）
 								- **Goal**: Slab を cross-class で再利用可能にする
 								- **設計**: EMPTY slab を lock-free stack で管理、alloc 時に class_idx を動的割り当て
 								- **期待効果**: メモリ効率向上（ただし性能向上は限定的）
-												Documentation: Phase 12-1.1 Results + Phase 19 Frontend Strategy

## Phase 12-1.1 Summary (Box Theory + EMPTY Slab Reuse)

### Box Theory Refactoring (Complete)
- hakmem_tiny.c: 2081行 → 562行 (-73%)
- 12 modules extracted across 3 phases
- Commit: 4c33ccdf8

### Phase 12-1.1: EMPTY Slab Detection (Complete)
- Implementation: empty_mask + immediate detection on free
- Performance: +1.3% average, +14.9% max (22.9M → 23.2M ops/s)
- Commit: 6afaa5703

### Key Findings
**Stage Statistics (HAKMEM_SHARED_POOL_STAGE_STATS=1)**:
```
Class 6 (256B):
  Stage 1 (EMPTY):  95.1%  ← Already super-efficient!
  Stage 2 (UNUSED):  4.7%
  Stage 3 (new SS):  0.2%  ← Bottleneck already resolved
```

**Conclusion**: Backend optimization (SS-Reuse) is saturated. Task-sensei's
assumption (Stage 3: 87-95%) does not hold. Phase 12 Shared Pool already works.

**Next bottleneck**: Frontend fast path (31ns vs mimalloc 9ns = 3.4x slower)

---

## Phase 19: Frontend Fast Path Optimization (Next Implementation)

### Strategy Shift
ChatGPT-sensei Priority 2 → Priority 1 (promoted based on Phase 12-1.1 results)

### Target
- Current: 31ns (HAKMEM) vs 9ns (mimalloc)
- Goal: 31ns → 15ns (-50%) for 22M → 40M ops/s

### Hit Rate Analysis (Premise)
```
HeapV2:      88-99% (primary)
UltraHot:     0-12% (limited)
FC/SFC:          0% (unused)
```
→ Layers other than HeapV2 are prune candidates

---

## Phase 19-1: Quick Prune (Branch Pruning) - 🚀 Highest Priority

**Goal**: Skip unused frontend layers, simplify to HeapV2 → SLL → SS path

**Implementation**:
- File: `core/tiny_alloc_fast.inc.h`
- Method: Early return gate at front entry point
- ENV: `HAKMEM_TINY_FRONT_SLIM=1`

**Features**:
- ✅ Existing code unchanged (bypass only)
- ✅ A/B gate (ENV=0 instant rollback)
- ✅ Minimal risk

**Expected**: 22M → 27-30M ops/s (+22-36%)

---

## Phase 19-2: Front-V2 (tcache Single-Layer) - ⚡ Main Event

**Goal**: Unify frontend to tcache-style (1-layer per-class magazine)

**Design**:
```c
// New file: core/front/tiny_heap_v2.h
typedef struct {
    void* items[32];      // cap 32 (tunable)
    uint8_t top;          // stack top index
    uint8_t class_idx;    // bound class
} TinyFrontV2;

// Ultra-fast pop (1 branch + 1 array lookup + 1 instruction)
static inline void* front_v2_pop(int class_idx);
static inline int front_v2_push(int class_idx, void* ptr);
static inline int front_v2_refill(int class_idx);
```

**Fast Path Flow**:
```
ptr = front_v2_pop(class_idx)  // 1 branch + 1 array lookup
  → empty? → front_v2_refill() → retry
  → miss? → backend fallback (SLL/SS)
```

**Target**: C0-C3 (hot classes), C4-C5 off
**ENV**: `HAKMEM_TINY_FRONT_V2=1`, `HAKMEM_FRONT_V2_CAP=32`
**Expected**: 30M → 40M ops/s (+33%)

---

## Phase 19-3: A/B Testing & Metrics

**Metrics**:
- `g_front_v2_hits[TINY_NUM_CLASSES]`
- `g_front_v2_miss[TINY_NUM_CLASSES]`
- `g_front_v2_refill_count[TINY_NUM_CLASSES]`

**ENV**: `HAKMEM_TINY_FRONT_METRICS=1`

**Benchmark Order**:
1. Short run (100K) - SEGV/regression check
2. Latency measurement (500K) - 31ns → 15ns goal
3. Larson short run - MT stability check

---

## Implementation Timeline

```
Week 1: Phase 19-1 Quick Prune
  - Add gate to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_SLIM=1
  - 100K short test
  - Performance measurement (expect: 22M → 27-30M)

Week 2: Phase 19-2 Front-V2 Design
  - Create core/front/tiny_heap_v2.{h,c}
  - Implement front_v2_pop/push/refill
  - C0-C3 integration test

Week 3: Phase 19-2 Front-V2 Integration
  - Add Front-V2 path to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_V2=1
  - A/B benchmark

Week 4: Phase 19-3 Optimization
  - Magazine capacity tuning (16/32/64)
  - Refill batch size adjustment
  - Larson/MT stability confirmation
```

---

## Expected Final Performance

```
Baseline (Phase 12-1.1):  22M ops/s
Phase 19-1 (Slim):        27-30M ops/s (+22-36%)
Phase 19-2 (V2):          40M ops/s (+82%)  ← Goal
System malloc:            78M ops/s (reference)

Gap closure: 28% → 51% (major improvement!)
```

---

## Summary

**Today's Achievements** (2025-11-21):
1. ✅ Box Theory Refactoring (3 phases, -73% code size)
2. ✅ Phase 12-1.1 EMPTY Slab Reuse (+1-15% improvement)
3. ✅ Stage statistics analysis (identified frontend as true bottleneck)
4. ✅ Phase 19 strategy documentation (ChatGPT-sensei plan)

**Next Session**:
- Phase 19-1 Quick Prune implementation
- ENV gate + early return in tiny_alloc_fast.inc.h
- 100K short test + performance measurement

---

📝 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: ChatGPT (Phase 19 strategy design)
Co-Authored-By: Task-sensei (Phase 12-1.1 investigation)

											
										
										
											2025-11-21 05:16:35 +09:00
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								### Abandoned SuperSlab（MT 用、未実装）
 								- **Goal**: スレッド終了後のメモリを他スレッドから reclaim
 								- **設計**: mimalloc の abandoned segments 相当
 								- **実装タイミング**: MT ワークロードで必要になってから
-												Documentation: Phase 12-1.1 Results + Phase 19 Frontend Strategy

## Phase 12-1.1 Summary (Box Theory + EMPTY Slab Reuse)

### Box Theory Refactoring (Complete)
- hakmem_tiny.c: 2081行 → 562行 (-73%)
- 12 modules extracted across 3 phases
- Commit: 4c33ccdf8

### Phase 12-1.1: EMPTY Slab Detection (Complete)
- Implementation: empty_mask + immediate detection on free
- Performance: +1.3% average, +14.9% max (22.9M → 23.2M ops/s)
- Commit: 6afaa5703

### Key Findings
**Stage Statistics (HAKMEM_SHARED_POOL_STAGE_STATS=1)**:
```
Class 6 (256B):
  Stage 1 (EMPTY):  95.1%  ← Already super-efficient!
  Stage 2 (UNUSED):  4.7%
  Stage 3 (new SS):  0.2%  ← Bottleneck already resolved
```

**Conclusion**: Backend optimization (SS-Reuse) is saturated. Task-sensei's
assumption (Stage 3: 87-95%) does not hold. Phase 12 Shared Pool already works.

**Next bottleneck**: Frontend fast path (31ns vs mimalloc 9ns = 3.4x slower)

---

## Phase 19: Frontend Fast Path Optimization (Next Implementation)

### Strategy Shift
ChatGPT-sensei Priority 2 → Priority 1 (promoted based on Phase 12-1.1 results)

### Target
- Current: 31ns (HAKMEM) vs 9ns (mimalloc)
- Goal: 31ns → 15ns (-50%) for 22M → 40M ops/s

### Hit Rate Analysis (Premise)
```
HeapV2:      88-99% (primary)
UltraHot:     0-12% (limited)
FC/SFC:          0% (unused)
```
→ Layers other than HeapV2 are prune candidates

---

## Phase 19-1: Quick Prune (Branch Pruning) - 🚀 Highest Priority

**Goal**: Skip unused frontend layers, simplify to HeapV2 → SLL → SS path

**Implementation**:
- File: `core/tiny_alloc_fast.inc.h`
- Method: Early return gate at front entry point
- ENV: `HAKMEM_TINY_FRONT_SLIM=1`

**Features**:
- ✅ Existing code unchanged (bypass only)
- ✅ A/B gate (ENV=0 instant rollback)
- ✅ Minimal risk

**Expected**: 22M → 27-30M ops/s (+22-36%)

---

## Phase 19-2: Front-V2 (tcache Single-Layer) - ⚡ Main Event

**Goal**: Unify frontend to tcache-style (1-layer per-class magazine)

**Design**:
```c
// New file: core/front/tiny_heap_v2.h
typedef struct {
    void* items[32];      // cap 32 (tunable)
    uint8_t top;          // stack top index
    uint8_t class_idx;    // bound class
} TinyFrontV2;

// Ultra-fast pop (1 branch + 1 array lookup + 1 instruction)
static inline void* front_v2_pop(int class_idx);
static inline int front_v2_push(int class_idx, void* ptr);
static inline int front_v2_refill(int class_idx);
```

**Fast Path Flow**:
```
ptr = front_v2_pop(class_idx)  // 1 branch + 1 array lookup
  → empty? → front_v2_refill() → retry
  → miss? → backend fallback (SLL/SS)
```

**Target**: C0-C3 (hot classes), C4-C5 off
**ENV**: `HAKMEM_TINY_FRONT_V2=1`, `HAKMEM_FRONT_V2_CAP=32`
**Expected**: 30M → 40M ops/s (+33%)

---

## Phase 19-3: A/B Testing & Metrics

**Metrics**:
- `g_front_v2_hits[TINY_NUM_CLASSES]`
- `g_front_v2_miss[TINY_NUM_CLASSES]`
- `g_front_v2_refill_count[TINY_NUM_CLASSES]`

**ENV**: `HAKMEM_TINY_FRONT_METRICS=1`

**Benchmark Order**:
1. Short run (100K) - SEGV/regression check
2. Latency measurement (500K) - 31ns → 15ns goal
3. Larson short run - MT stability check

---

## Implementation Timeline

```
Week 1: Phase 19-1 Quick Prune
  - Add gate to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_SLIM=1
  - 100K short test
  - Performance measurement (expect: 22M → 27-30M)

Week 2: Phase 19-2 Front-V2 Design
  - Create core/front/tiny_heap_v2.{h,c}
  - Implement front_v2_pop/push/refill
  - C0-C3 integration test

Week 3: Phase 19-2 Front-V2 Integration
  - Add Front-V2 path to tiny_alloc_fast.inc.h
  - Implement HAKMEM_TINY_FRONT_V2=1
  - A/B benchmark

Week 4: Phase 19-3 Optimization
  - Magazine capacity tuning (16/32/64)
  - Refill batch size adjustment
  - Larson/MT stability confirmation
```

---

## Expected Final Performance

```
Baseline (Phase 12-1.1):  22M ops/s
Phase 19-1 (Slim):        27-30M ops/s (+22-36%)
Phase 19-2 (V2):          40M ops/s (+82%)  ← Goal
System malloc:            78M ops/s (reference)

Gap closure: 28% → 51% (major improvement!)
```

---

## Summary

**Today's Achievements** (2025-11-21):
1. ✅ Box Theory Refactoring (3 phases, -73% code size)
2. ✅ Phase 12-1.1 EMPTY Slab Reuse (+1-15% improvement)
3. ✅ Stage statistics analysis (identified frontend as true bottleneck)
4. ✅ Phase 19 strategy documentation (ChatGPT-sensei plan)

**Next Session**:
- Phase 19-1 Quick Prune implementation
- ENV gate + early return in tiny_alloc_fast.inc.h
- 100K short test + performance measurement

---

📝 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: ChatGPT (Phase 19 strategy design)
Co-Authored-By: Task-sensei (Phase 12-1.1 investigation)

											
										
										
											2025-11-21 05:16:35 +09:00
 								---
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								## ✅ 完成したマイルストーン
-												Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified

Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization

Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
   - Direct SuperSlab carve (TLS SLL bypass)
   - Self-contained pop-or-refill pattern
   - ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128

2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
   - Unified ON → direct cache access (skip all intermediate layers)
   - Alloc: unified_cache_pop_or_refill() → immediate fail to slow
   - Free: unified_cache_push() → fallback to SLL only if full

PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
   - PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
   - Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()

4. Measurement results (Random Mixed 500K / 256B):
   - Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
   - SSM: 512 pages (initialization footprint)
   - MID/L25: 0 (unused in this workload)
   - Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)

Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
   - ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
   - Conditional compilation cleanup

Documentation:
6. Analysis reports
   - RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
   - RANDOM_MIXED_SUMMARY.md: Phase 23 summary
   - RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
   - CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan

Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads

Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-17 02:47:58 +09:00
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+. **Arena Allocator 実装** - mmap 95% 削減達成 ✅
 . **Phase 27 調査** - アーキテクチャ限界の確認 ✅
 . **性能 68-70M ops/s** - System malloc の 73-76% に到達 ✅
-												Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified

Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization

Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
   - Direct SuperSlab carve (TLS SLL bypass)
   - Self-contained pop-or-refill pattern
   - ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128

2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
   - Unified ON → direct cache access (skip all intermediate layers)
   - Alloc: unified_cache_pop_or_refill() → immediate fail to slow
   - Free: unified_cache_push() → fallback to SLL only if full

PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
   - PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
   - Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()

4. Measurement results (Random Mixed 500K / 256B):
   - Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
   - SSM: 512 pages (initialization footprint)
   - MID/L25: 0 (unused in this workload)
   - Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)

Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
   - ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
   - Conditional compilation cleanup

Documentation:
6. Analysis reports
   - RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
   - RANDOM_MIXED_SUMMARY.md: Phase 23 summary
   - RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
   - CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan

Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads

Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-17 02:47:58 +09:00
-												Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath

(cherry-picked from 225b6fcc7, conflicts resolved)

											
										
										
											2025-11-26 12:33:49 +09:00
+								**現在の推奨**: 68-70M ops/s を baseline として受け入れ、他のワークロード（Mid-Large, Larson 等）の最適化に注力する。