P-Tier + Tiny Route Policy: Aggressive Superslab Management + Safe Routing
## Phase 1: Utilization-Aware Superslab Tiering (案B実装済)
- Add ss_tier_box.h: Classify SuperSlabs into HOT/DRAINING/FREE based on utilization
- HOT (>25%): Accept new allocations
- DRAINING (≤25%): Drain only, no new allocs
- FREE (0%): Ready for eager munmap
- Enhanced shared_pool_release_slab():
- Check tier transition after each slab release
- If tier→FREE: Force remaining slots to EMPTY and call superslab_free() immediately
- Bypasses LRU cache to prevent registry bloat from accumulating DRAINING SuperSlabs
- Test results (bench_random_mixed_hakmem):
- 1M iterations: ✅ ~1.03M ops/s (previously passed)
- 10M iterations: ✅ ~1.15M ops/s (previously: registry full error)
- 50M iterations: ✅ ~1.08M ops/s (stress test)
## Phase 2: Tiny Front Routing Policy (新規Box)
- Add tiny_route_box.h/c: Single 8-byte table for class→routing decisions
- ROUTE_TINY_ONLY: Tiny front exclusive (no fallback)
- ROUTE_TINY_FIRST: Try Tiny, fallback to Pool if fails
- ROUTE_POOL_ONLY: Skip Tiny entirely
- Profiles via HAKMEM_TINY_PROFILE ENV:
- "hot": C0-C3=TINY_ONLY, C4-C6=TINY_FIRST, C7=POOL_ONLY
- "conservative" (default): All TINY_FIRST
- "off": All POOL_ONLY (disable Tiny)
- "full": All TINY_ONLY (microbench mode)
- A/B test results (ws=256, 100k ops random_mixed):
- Default (conservative): ~2.90M ops/s
- hot: ~2.65M ops/s (more conservative)
- off: ~2.86M ops/s
- full: ~2.98M ops/s (slightly best)
## Design Rationale
### Registry Pressure Fix (案B)
- Problem: DRAINING tier SS occupied registry indefinitely
- Solution: When total_active_blocks→0, immediately free to clear registry slot
- Result: No more "registry full" errors under stress
### Routing Policy Box (新)
- Problem: Tiny front optimization scattered across ENV/branches
- Solution: Centralize routing in single table, select profiles via ENV
- Benefit: Safe A/B testing without touching hot path code
- Future: Integrate with RSS budget/learning layers for dynamic profile switching
## Next Steps (性能最適化)
- Profile Tiny front internals (TLS SLL, FastCache, Superslab backend latency)
- Identify bottleneck between current ~2.9M ops/s and mimalloc ~100M ops/s
- Consider:
- Reduce shared pool lock contention
- Optimize unified cache hit rate
- Streamline Superslab carving logic
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>