Files
hakmem/docs/analysis/POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md

196 lines
6.0 KiB
Markdown
Raw Normal View History

Phase ALLOC-TINY-FAST-DUALHOT-1 & Optimization Roadmap Update Add comprehensive design docs and research boxes: - docs/analysis/ALLOC_TINY_FAST_DUALHOT_1_DESIGN.md: ALLOC DUALHOT investigation - docs/analysis/FREE_TINY_FAST_DUALHOT_1_DESIGN.md: FREE DUALHOT final specs - docs/analysis/FREE_TINY_FAST_HOTCOLD_OPT_1_DESIGN.md: Hot/Cold split research - docs/analysis/POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md: Deferred batching design - docs/analysis/POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md: Stats overhead findings - docs/analysis/MID_DESC_CACHE_BENCHMARK_2025-12-12.md: Cache measurement results - docs/analysis/LAST_MATCH_CACHE_IMPLEMENTATION.md: TLS cache investigation Research boxes (SS page table): - core/box/ss_pt_env_box.h: HAKMEM_SS_LOOKUP_KIND gate - core/box/ss_pt_types_box.h: 2-level page table structures - core/box/ss_pt_lookup_box.h: ss_pt_lookup() implementation - core/box/ss_pt_register_box.h: Page table registration - core/box/ss_pt_impl.c: Global definitions Updates: - docs/specs/ENV_VARS_COMPLETE.md: HOTCOLD, DEFERRED, SS_LOOKUP env vars - core/box/hak_free_api.inc.h: FREE-DISPATCH-SSOT integration - core/box/pool_mid_inuse_deferred_box.h: Deferred API updates - core/box/pool_mid_inuse_deferred_stats_box.h: Stats collection - core/hakmem_super_registry: SS page table integration Current Status: - FREE-TINY-FAST-DUALHOT-1: +13% improvement, ready for adoption - ALLOC-TINY-FAST-DUALHOT-1: -2% regression, frozen as research box - Next: Optimization roadmap per ROI (mimalloc gap 2.5x) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-13 05:35:46 +09:00
# Phase POOL-MID-DN-BATCH: Deferred inuse_dec Design
## Goal
- Eliminate `mid_desc_lookup*` from `hak_pool_free_v1_fast_impl` hot path completely
- Target: Mixed median +2-4%, tail/variance reduction (as seen in cache A/B)
## Background
### A/B Benchmark Results (2025-12-12)
| Metric | Baseline | Cache ON | Improvement |
|--------|----------|----------|-------------|
| Median throughput | 8.72M ops/s | 8.93M ops/s | +2.3% |
| Worst-case | 6.46M ops/s | 7.52M ops/s | **+16.5%** |
| CV (variance) | 13.3% | 7.2% | **-46%** |
**Insight**: Cache improves stability more than raw speed. Deferred will be even better because it completely eliminates lookup from hot path.
## Box Theory Design
### L0: MidInuseDeferredBox
```c
// Hot API (lookup/atomic/lock PROHIBITED)
static inline void mid_inuse_dec_deferred(void* raw);
// Cold API (ONLY lookup boundary)
static inline void mid_inuse_deferred_drain(void);
```
### L1: MidInuseTlsPageMapBox
```c
// TLS fixed-size map (32 or 64 entries)
// Single responsibility: "bundle page→dec_count"
typedef struct {
void* pages[MID_INUSE_TLS_MAP_SIZE];
uint32_t counts[MID_INUSE_TLS_MAP_SIZE];
uint32_t used;
} MidInuseTlsPageMap;
static __thread MidInuseTlsPageMap g_mid_inuse_tls_map;
```
## Algorithm
### mid_inuse_dec_deferred(raw) - HOT
```c
static inline void mid_inuse_dec_deferred(void* raw) {
if (!hak_pool_mid_inuse_deferred_enabled()) {
mid_page_inuse_dec_and_maybe_dn(raw); // Fallback
return;
}
void* page = (void*)((uintptr_t)raw & ~(POOL_PAGE_SIZE - 1));
// Find or insert in TLS map
for (int i = 0; i < g_mid_inuse_tls_map.used; i++) {
if (g_mid_inuse_tls_map.pages[i] == page) {
g_mid_inuse_tls_map.counts[i]++;
STAT_INC(mid_inuse_deferred_hit);
return;
}
}
// New page entry
if (g_mid_inuse_tls_map.used >= MID_INUSE_TLS_MAP_SIZE) {
mid_inuse_deferred_drain(); // Flush when full
}
int idx = g_mid_inuse_tls_map.used++;
g_mid_inuse_tls_map.pages[idx] = page;
g_mid_inuse_tls_map.counts[idx] = 1;
STAT_INC(mid_inuse_deferred_hit);
}
```
### mid_inuse_deferred_drain() - COLD (only lookup boundary)
```c
static inline void mid_inuse_deferred_drain(void) {
STAT_INC(mid_inuse_deferred_drain_calls);
for (int i = 0; i < g_mid_inuse_tls_map.used; i++) {
void* page = g_mid_inuse_tls_map.pages[i];
uint32_t n = g_mid_inuse_tls_map.counts[i];
// ONLY lookup happens here (batched)
MidPageDesc* d = mid_desc_lookup(page);
if (d) {
uint64_t old = atomic_fetch_sub(&d->in_use, n);
STAT_ADD(mid_inuse_deferred_pages_drained, n);
// Check for empty transition (existing logic)
if (old >= n && old - n == 0) {
STAT_INC(mid_inuse_deferred_empty_transitions);
// pending_dn logic (existing)
if (d->pending_dn == 0) {
d->pending_dn = 1;
hak_batch_add_page(page);
}
}
}
}
g_mid_inuse_tls_map.used = 0; // Clear map
}
```
## Drain Boundaries (Critical)
**DO NOT drain in hot path.** Drain only at these cold/rare points:
1. **TLS map full** - Inside `mid_inuse_dec_deferred()` (once per overflow)
2. **Refill/slow boundary** - Add 1 call in pool alloc refill or slow free tail
3. **Thread exit** - If thread cleanup exists (optional)
## ENV Gate
```c
// HAKMEM_POOL_MID_INUSE_DEFERRED=1 (default 0)
static inline int hak_pool_mid_inuse_deferred_enabled(void) {
static int g = -1;
if (__builtin_expect(g == -1, 0)) {
const char* e = getenv("HAKMEM_POOL_MID_INUSE_DEFERRED");
g = (e && *e == '1') ? 1 : 0;
}
return g;
}
```
Related knobs:
- `HAKMEM_POOL_MID_INUSE_MAP_KIND=linear|hash` (default `linear`)
- TLS page-map implementation used by the hot path.
- `HAKMEM_POOL_MID_INUSE_DEFERRED_STATS=0/1` (default `0`)
- Enables debug counters + exit dump. Keep OFF for perf runs.
## Implementation Patches (Order)
| Step | File | Description |
|------|------|-------------|
| 1 | `pool_mid_inuse_deferred_env_box.h` | ENV gate |
| 2 | `pool_mid_inuse_tls_pagemap_box.h` | TLS map box |
| 3 | `pool_mid_inuse_deferred_box.h` | deferred API (dec + drain) |
| 4 | `pool_free_v1_box.h` | Replace tail with deferred (ENV ON only) |
| 5 | `pool_mid_inuse_deferred_stats_box.h` | Counters |
| 6 | A/B benchmark | Validate |
## Stats Counters
```c
typedef struct {
_Atomic uint64_t mid_inuse_deferred_hit; // deferred dec calls (hot)
_Atomic uint64_t drain_calls; // drain invocations (cold)
_Atomic uint64_t pages_drained; // unique pages processed
_Atomic uint64_t decs_drained; // total decrements applied
_Atomic uint64_t empty_transitions; // pages that hit <=0
} MidInuseDeferredStats;
```
**Goal**: With fastsplit ON + deferred ON:
- fast path lookup = 0
- drain calls = rare (low frequency)
## Safety Analysis
| Concern | Analysis |
|---------|----------|
| Race condition | dec delayed → in_use appears larger → DONTNEED delayed (safe direction) |
| Double free | No change (header check still in place) |
| Early release | Impossible (dec is delayed, not advanced) |
| Memory pressure | Slightly delayed DONTNEED, acceptable |
## Acceptance Gates
| Workload | Metric | Criteria |
|----------|--------|----------|
| Mixed (MIXED_TINYV3_C7_SAFE) | Median | No regression |
| Mixed | CV | Clear reduction (matches cache trend) |
| C6-heavy (C6_HEAVY_LEGACY_POOLV1) | Throughput | <2% regression, ideally +2% |
| pending_dn | Timing | Delayed OK, earlier NG |
## Expected Result
After this phase, pool free hot path becomes:
```
header check → TLS push → deferred bookkeeping (O(1), no lookup)
```
This is very close to mimalloc's O(1) fast free design.
## Files to Modify
- `core/box/pool_mid_inuse_deferred_env_box.h` (NEW)
- `core/box/pool_mid_inuse_tls_pagemap_box.h` (NEW)
- `core/box/pool_mid_inuse_deferred_box.h` (NEW)
- `core/box/pool_free_v1_box.h` (MODIFY - add deferred call)
- `core/box/pool_mid_inuse_deferred_stats_box.h` (NEW)