|
|
6afaa5703a
|
Phase 12-1.1: EMPTY Slab Detection + Immediate Reuse (+13% improvement, 10.2M→11.5M ops/s)
Implementation of Task-sensei Priority 1 recommendation: Add empty_mask to SuperSlab
for immediate EMPTY slab detection and reuse, reducing Stage 3 (mmap) overhead.
## Changes
### 1. SuperSlab Structure (core/superslab/superslab_types.h)
- Added `empty_mask` (uint32_t): Bitmap for EMPTY slabs (used==0)
- Added `empty_count` (uint8_t): Quick check for EMPTY slab availability
### 2. EMPTY Detection API (core/box/ss_hot_cold_box.h)
- Added `ss_is_slab_empty()`: Returns true if slab is completely EMPTY
- Added `ss_mark_slab_empty()`: Marks slab as EMPTY (highest reuse priority)
- Added `ss_clear_slab_empty()`: Removes EMPTY state when reactivated
- Updated `ss_update_hot_cold_indices()`: Classify EMPTY/Hot/Cold slabs
- Updated `ss_init_hot_cold()`: Initialize empty_mask/empty_count
### 3. Free Path Integration (core/box/free_local_box.c)
- After `meta->used--`, check if `meta->used == 0`
- If true, call `ss_mark_slab_empty()` to update empty_mask
- Enables immediate EMPTY detection on every free operation
### 4. Shared Pool Stage 0.5 (core/hakmem_shared_pool.c)
- New Stage 0.5 before Stage 1: Scan existing SuperSlabs for EMPTY slabs
- Iterate over `g_super_reg_by_class[class_idx][]` (first 16 entries)
- Check `ss->empty_count > 0` → scan `empty_mask` with `__builtin_ctz()`
- Reuse EMPTY slab directly, avoiding Stage 3 (mmap/lock overhead)
- ENV control: `HAKMEM_SS_EMPTY_REUSE=1` (default OFF for A/B testing)
- ENV tunable: `HAKMEM_SS_EMPTY_SCAN_LIMIT=N` (default 16 SuperSlabs)
## Performance Results
```
Benchmark: Random Mixed 256B (100K iterations)
OFF (default): 10.2M ops/s (baseline)
ON (ENV=1): 11.5M ops/s (+13.0% improvement) ✅
```
## Expected Impact (from Task-sensei analysis)
**Current bottleneck**:
- Stage 1: 2-5% hit rate (free list broken)
- Stage 2: 3-8% hit rate (rare UNUSED)
- Stage 3: 87-95% hit rate (lock + mmap overhead) ← bottleneck
**Expected with Phase 12-1.1**:
- Stage 0.5: 20-40% hit rate (EMPTY scan)
- Stage 1-2: 20-30% hit rate (combined)
- Stage 3: 30-50% hit rate (significantly reduced)
**Theoretical max**: 25M → 55-70M ops/s (+120-180%)
## Current Gap Analysis
**Observed**: 11.5M ops/s (+13%)
**Expected**: 55-70M ops/s (+120-180%)
**Gap**: Performance regression or missing complementary optimizations
Possible causes:
1. Phase 3d-C (25.1M→10.2M) regression - unrelated to this change
2. EMPTY scan overhead (16 SuperSlabs × empty_count check)
3. Missing Priority 2-5 optimizations (Lazy SS deallocation, etc.)
4. Stage 0.5 too conservative (scan_limit=16, should be higher?)
## Usage
```bash
# Enable EMPTY reuse optimization
export HAKMEM_SS_EMPTY_REUSE=1
# Optional: increase scan limit (trade-off: throughput vs latency)
export HAKMEM_SS_EMPTY_SCAN_LIMIT=32
./bench_random_mixed_hakmem 100000 256 42
```
## Next Steps
**Priority 1-A**: Investigate Phase 3d-C→12-1.1 regression (25.1M→10.2M)
**Priority 1-B**: Implement Phase 12-1.2 (Lazy SS deallocation) for complementary effect
**Priority 1-C**: Profile Stage 0.5 overhead (scan_limit tuning)
## Files Modified
Core implementation:
- `core/superslab/superslab_types.h` - empty_mask/empty_count fields
- `core/box/ss_hot_cold_box.h` - EMPTY detection/marking API
- `core/box/free_local_box.c` - Free path EMPTY detection
- `core/hakmem_shared_pool.c` - Stage 0.5 EMPTY scan
Documentation:
- `CURRENT_TASK.md` - Task-sensei investigation report
---
🎯 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Task-sensei (investigation & design analysis)
|
2025-11-21 04:56:48 +09:00 |
|
|
|
03ba62df4d
|
Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization
Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
- Direct SuperSlab carve (TLS SLL bypass)
- Self-contained pop-or-refill pattern
- ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128
2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
- Unified ON → direct cache access (skip all intermediate layers)
- Alloc: unified_cache_pop_or_refill() → immediate fail to slow
- Free: unified_cache_push() → fallback to SLL only if full
PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
- PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
- Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()
4. Measurement results (Random Mixed 500K / 256B):
- Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
- SSM: 512 pages (initialization footprint)
- MID/L25: 0 (unused in this workload)
- Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)
Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
- ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
- Conditional compilation cleanup
Documentation:
6. Analysis reports
- RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
- RANDOM_MIXED_SUMMARY.md: Phase 23 summary
- RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
- CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan
Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads
Generated with Claude Code
Co-Authored-By: Claude <noreply@anthropic.com>
|
2025-11-17 02:47:58 +09:00 |
|