Files
hakmem/core/box/ss_hot_cold_box.h
Moe Charm (CI) 6afaa5703a Phase 12-1.1: EMPTY Slab Detection + Immediate Reuse (+13% improvement, 10.2M→11.5M ops/s)
Implementation of Task-sensei Priority 1 recommendation: Add empty_mask to SuperSlab
for immediate EMPTY slab detection and reuse, reducing Stage 3 (mmap) overhead.

## Changes

### 1. SuperSlab Structure (core/superslab/superslab_types.h)
- Added `empty_mask` (uint32_t): Bitmap for EMPTY slabs (used==0)
- Added `empty_count` (uint8_t): Quick check for EMPTY slab availability

### 2. EMPTY Detection API (core/box/ss_hot_cold_box.h)
- Added `ss_is_slab_empty()`: Returns true if slab is completely EMPTY
- Added `ss_mark_slab_empty()`: Marks slab as EMPTY (highest reuse priority)
- Added `ss_clear_slab_empty()`: Removes EMPTY state when reactivated
- Updated `ss_update_hot_cold_indices()`: Classify EMPTY/Hot/Cold slabs
- Updated `ss_init_hot_cold()`: Initialize empty_mask/empty_count

### 3. Free Path Integration (core/box/free_local_box.c)
- After `meta->used--`, check if `meta->used == 0`
- If true, call `ss_mark_slab_empty()` to update empty_mask
- Enables immediate EMPTY detection on every free operation

### 4. Shared Pool Stage 0.5 (core/hakmem_shared_pool.c)
- New Stage 0.5 before Stage 1: Scan existing SuperSlabs for EMPTY slabs
- Iterate over `g_super_reg_by_class[class_idx][]` (first 16 entries)
- Check `ss->empty_count > 0` → scan `empty_mask` with `__builtin_ctz()`
- Reuse EMPTY slab directly, avoiding Stage 3 (mmap/lock overhead)
- ENV control: `HAKMEM_SS_EMPTY_REUSE=1` (default OFF for A/B testing)
- ENV tunable: `HAKMEM_SS_EMPTY_SCAN_LIMIT=N` (default 16 SuperSlabs)

## Performance Results

```
Benchmark: Random Mixed 256B (100K iterations)

OFF (default):  10.2M ops/s (baseline)
ON  (ENV=1):    11.5M ops/s (+13.0% improvement) 
```

## Expected Impact (from Task-sensei analysis)

**Current bottleneck**:
- Stage 1: 2-5% hit rate (free list broken)
- Stage 2: 3-8% hit rate (rare UNUSED)
- Stage 3: 87-95% hit rate (lock + mmap overhead) ← bottleneck

**Expected with Phase 12-1.1**:
- Stage 0.5: 20-40% hit rate (EMPTY scan)
- Stage 1-2: 20-30% hit rate (combined)
- Stage 3: 30-50% hit rate (significantly reduced)

**Theoretical max**: 25M → 55-70M ops/s (+120-180%)

## Current Gap Analysis

**Observed**: 11.5M ops/s (+13%)
**Expected**: 55-70M ops/s (+120-180%)
**Gap**: Performance regression or missing complementary optimizations

Possible causes:
1. Phase 3d-C (25.1M→10.2M) regression - unrelated to this change
2. EMPTY scan overhead (16 SuperSlabs × empty_count check)
3. Missing Priority 2-5 optimizations (Lazy SS deallocation, etc.)
4. Stage 0.5 too conservative (scan_limit=16, should be higher?)

## Usage

```bash
# Enable EMPTY reuse optimization
export HAKMEM_SS_EMPTY_REUSE=1

# Optional: increase scan limit (trade-off: throughput vs latency)
export HAKMEM_SS_EMPTY_SCAN_LIMIT=32

./bench_random_mixed_hakmem 100000 256 42
```

## Next Steps

**Priority 1-A**: Investigate Phase 3d-C→12-1.1 regression (25.1M→10.2M)
**Priority 1-B**: Implement Phase 12-1.2 (Lazy SS deallocation) for complementary effect
**Priority 1-C**: Profile Stage 0.5 overhead (scan_limit tuning)

## Files Modified

Core implementation:
- `core/superslab/superslab_types.h` - empty_mask/empty_count fields
- `core/box/ss_hot_cold_box.h` - EMPTY detection/marking API
- `core/box/free_local_box.c` - Free path EMPTY detection
- `core/hakmem_shared_pool.c` - Stage 0.5 EMPTY scan

Documentation:
- `CURRENT_TASK.md` - Task-sensei investigation report

---

🎯 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Task-sensei (investigation & design analysis)
2025-11-21 04:56:48 +09:00

139 lines
4.6 KiB
C
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

// ss_hot_cold_box.h - Phase 3d-C: Hot/Cold Slab Split Box
// Purpose: Cache locality optimization via hot/cold slab separation
// License: MIT
// Date: 2025-11-20
#ifndef SS_HOT_COLD_BOX_H
#define SS_HOT_COLD_BOX_H
#include "../superslab/superslab_types.h"
#include <stdbool.h>
// ============================================================================
// Phase 3d-C: Hot/Cold Split Box API
// ============================================================================
//
// Goal: Improve L1D cache hit rate by separating hot (high utilization) and
// cold (low utilization) slabs within a SuperSlab.
//
// Strategy:
// - Hot slabs (used > 50%): Prioritized for allocation → better cache locality
// - Cold slabs (used ≤ 50%): Used as fallback → delayed deallocation
//
// Expected: +8-12% throughput from improved cache line locality
//
// Box Contract:
// - ss_is_slab_hot(): Returns true if slab should be considered "hot"
// - ss_update_hot_cold_indices(): Rebuilds hot/cold index arrays
// - ss_init_hot_cold(): Initializes hot/cold fields on SuperSlab creation
//
// ============================================================================
// Phase 3d-C: Hot/Cold判定閾値
#define HOT_UTILIZATION_THRESHOLD 50 // 使用率50%以上でホット判定
// Phase 12-1.1: EMPTY判定ロジック最優先再利用
// Returns: true if slab is completely EMPTY (used == 0, highest reuse priority)
static inline bool ss_is_slab_empty(const TinySlabMeta* meta) {
return (meta->capacity > 0 && meta->used == 0);
}
// Phase 3d-C: Hot判定ロジック
// Returns: true if slab is "hot" (high utilization, should be prioritized)
static inline bool ss_is_slab_hot(const TinySlabMeta* meta) {
// ヒューリスティック: 使用率 > 50% → ホット
// 理由: 使用率が高い = 頻繁にアクセスされている = キャッシュに載せたい
if (meta->capacity == 0) {
return false; // Uninitialized slab
}
return (meta->used * 100 / meta->capacity) > HOT_UTILIZATION_THRESHOLD;
}
// Phase 12-1.1: EMPTY mask更新ヘルパー
// Marks a slab as EMPTY (highest reuse priority)
static inline void ss_mark_slab_empty(SuperSlab* ss, int slab_idx) {
if (!ss || slab_idx < 0 || slab_idx >= SLABS_PER_SUPERSLAB_MAX) return;
uint32_t bit = (1u << slab_idx);
if (!(ss->empty_mask & bit)) {
ss->empty_mask |= bit;
ss->empty_count++;
}
}
// Phase 12-1.1: EMPTY mask クリアヘルパー
// Removes a slab from EMPTY state (when reactivated)
static inline void ss_clear_slab_empty(SuperSlab* ss, int slab_idx) {
if (!ss || slab_idx < 0 || slab_idx >= SLABS_PER_SUPERSLAB_MAX) return;
uint32_t bit = (1u << slab_idx);
if (ss->empty_mask & bit) {
ss->empty_mask &= ~bit;
ss->empty_count--;
}
}
// Phase 3d-C: Hot/Cold インデックス更新
// Rebuilds hot_indices[] and cold_indices[] arrays based on current slab state
static inline void ss_update_hot_cold_indices(SuperSlab* ss) {
if (!ss) return;
ss->hot_count = 0;
ss->cold_count = 0;
// Phase 12-1.1: Reset empty tracking
ss->empty_mask = 0;
ss->empty_count = 0;
uint32_t max_slabs = (1u << ss->lg_size) / SLAB_SIZE;
if (max_slabs > SLABS_PER_SUPERSLAB_MAX) {
max_slabs = SLABS_PER_SUPERSLAB_MAX;
}
// Scan active slabs and classify as EMPTY / hot / cold
for (uint32_t i = 0; i < max_slabs && i < ss->active_slabs; i++) {
TinySlabMeta* meta = &ss->slabs[i];
// Skip uninitialized slabs (capacity == 0)
if (meta->capacity == 0) {
continue;
}
// Phase 12-1.1: EMPTY slabs have highest reuse priority
if (ss_is_slab_empty(meta)) {
ss_mark_slab_empty(ss, (int)i);
continue; // Don't add to hot/cold arrays
}
if (ss_is_slab_hot(meta)) {
// Hot slab: high utilization
if (ss->hot_count < 16) {
ss->hot_indices[ss->hot_count++] = (uint8_t)i;
}
} else {
// Cold slab: low utilization
if (ss->cold_count < 16) {
ss->cold_indices[ss->cold_count++] = (uint8_t)i;
}
}
}
}
// Phase 3d-C: SuperSlab初期化時にhot/cold fieldsをゼロクリア
static inline void ss_init_hot_cold(SuperSlab* ss) {
if (!ss) return;
ss->hot_count = 0;
ss->cold_count = 0;
// Phase 12-1.1: Initialize EMPTY tracking
ss->empty_mask = 0;
ss->empty_count = 0;
// Initialize index arrays to 0 (defensive programming)
for (int i = 0; i < 16; i++) {
ss->hot_indices[i] = 0;
ss->cold_indices[i] = 0;
}
}
#endif // SS_HOT_COLD_BOX_H