196 lines
6.0 KiB
Markdown
196 lines
6.0 KiB
Markdown
|
|
# Phase POOL-MID-DN-BATCH: Deferred inuse_dec Design
|
||
|
|
|
||
|
|
## Goal
|
||
|
|
- Eliminate `mid_desc_lookup*` from `hak_pool_free_v1_fast_impl` hot path completely
|
||
|
|
- Target: Mixed median +2-4%, tail/variance reduction (as seen in cache A/B)
|
||
|
|
|
||
|
|
## Background
|
||
|
|
|
||
|
|
### A/B Benchmark Results (2025-12-12)
|
||
|
|
| Metric | Baseline | Cache ON | Improvement |
|
||
|
|
|--------|----------|----------|-------------|
|
||
|
|
| Median throughput | 8.72M ops/s | 8.93M ops/s | +2.3% |
|
||
|
|
| Worst-case | 6.46M ops/s | 7.52M ops/s | **+16.5%** |
|
||
|
|
| CV (variance) | 13.3% | 7.2% | **-46%** |
|
||
|
|
|
||
|
|
**Insight**: Cache improves stability more than raw speed. Deferred will be even better because it completely eliminates lookup from hot path.
|
||
|
|
|
||
|
|
## Box Theory Design
|
||
|
|
|
||
|
|
### L0: MidInuseDeferredBox
|
||
|
|
```c
|
||
|
|
// Hot API (lookup/atomic/lock PROHIBITED)
|
||
|
|
static inline void mid_inuse_dec_deferred(void* raw);
|
||
|
|
|
||
|
|
// Cold API (ONLY lookup boundary)
|
||
|
|
static inline void mid_inuse_deferred_drain(void);
|
||
|
|
```
|
||
|
|
|
||
|
|
### L1: MidInuseTlsPageMapBox
|
||
|
|
```c
|
||
|
|
// TLS fixed-size map (32 or 64 entries)
|
||
|
|
// Single responsibility: "bundle page→dec_count"
|
||
|
|
typedef struct {
|
||
|
|
void* pages[MID_INUSE_TLS_MAP_SIZE];
|
||
|
|
uint32_t counts[MID_INUSE_TLS_MAP_SIZE];
|
||
|
|
uint32_t used;
|
||
|
|
} MidInuseTlsPageMap;
|
||
|
|
|
||
|
|
static __thread MidInuseTlsPageMap g_mid_inuse_tls_map;
|
||
|
|
```
|
||
|
|
|
||
|
|
## Algorithm
|
||
|
|
|
||
|
|
### mid_inuse_dec_deferred(raw) - HOT
|
||
|
|
```c
|
||
|
|
static inline void mid_inuse_dec_deferred(void* raw) {
|
||
|
|
if (!hak_pool_mid_inuse_deferred_enabled()) {
|
||
|
|
mid_page_inuse_dec_and_maybe_dn(raw); // Fallback
|
||
|
|
return;
|
||
|
|
}
|
||
|
|
|
||
|
|
void* page = (void*)((uintptr_t)raw & ~(POOL_PAGE_SIZE - 1));
|
||
|
|
|
||
|
|
// Find or insert in TLS map
|
||
|
|
for (int i = 0; i < g_mid_inuse_tls_map.used; i++) {
|
||
|
|
if (g_mid_inuse_tls_map.pages[i] == page) {
|
||
|
|
g_mid_inuse_tls_map.counts[i]++;
|
||
|
|
STAT_INC(mid_inuse_deferred_hit);
|
||
|
|
return;
|
||
|
|
}
|
||
|
|
}
|
||
|
|
|
||
|
|
// New page entry
|
||
|
|
if (g_mid_inuse_tls_map.used >= MID_INUSE_TLS_MAP_SIZE) {
|
||
|
|
mid_inuse_deferred_drain(); // Flush when full
|
||
|
|
}
|
||
|
|
|
||
|
|
int idx = g_mid_inuse_tls_map.used++;
|
||
|
|
g_mid_inuse_tls_map.pages[idx] = page;
|
||
|
|
g_mid_inuse_tls_map.counts[idx] = 1;
|
||
|
|
STAT_INC(mid_inuse_deferred_hit);
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### mid_inuse_deferred_drain() - COLD (only lookup boundary)
|
||
|
|
```c
|
||
|
|
static inline void mid_inuse_deferred_drain(void) {
|
||
|
|
STAT_INC(mid_inuse_deferred_drain_calls);
|
||
|
|
|
||
|
|
for (int i = 0; i < g_mid_inuse_tls_map.used; i++) {
|
||
|
|
void* page = g_mid_inuse_tls_map.pages[i];
|
||
|
|
uint32_t n = g_mid_inuse_tls_map.counts[i];
|
||
|
|
|
||
|
|
// ONLY lookup happens here (batched)
|
||
|
|
MidPageDesc* d = mid_desc_lookup(page);
|
||
|
|
if (d) {
|
||
|
|
uint64_t old = atomic_fetch_sub(&d->in_use, n);
|
||
|
|
STAT_ADD(mid_inuse_deferred_pages_drained, n);
|
||
|
|
|
||
|
|
// Check for empty transition (existing logic)
|
||
|
|
if (old >= n && old - n == 0) {
|
||
|
|
STAT_INC(mid_inuse_deferred_empty_transitions);
|
||
|
|
// pending_dn logic (existing)
|
||
|
|
if (d->pending_dn == 0) {
|
||
|
|
d->pending_dn = 1;
|
||
|
|
hak_batch_add_page(page);
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
|
||
|
|
g_mid_inuse_tls_map.used = 0; // Clear map
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Drain Boundaries (Critical)
|
||
|
|
|
||
|
|
**DO NOT drain in hot path.** Drain only at these cold/rare points:
|
||
|
|
|
||
|
|
1. **TLS map full** - Inside `mid_inuse_dec_deferred()` (once per overflow)
|
||
|
|
2. **Refill/slow boundary** - Add 1 call in pool alloc refill or slow free tail
|
||
|
|
3. **Thread exit** - If thread cleanup exists (optional)
|
||
|
|
|
||
|
|
## ENV Gate
|
||
|
|
|
||
|
|
```c
|
||
|
|
// HAKMEM_POOL_MID_INUSE_DEFERRED=1 (default 0)
|
||
|
|
static inline int hak_pool_mid_inuse_deferred_enabled(void) {
|
||
|
|
static int g = -1;
|
||
|
|
if (__builtin_expect(g == -1, 0)) {
|
||
|
|
const char* e = getenv("HAKMEM_POOL_MID_INUSE_DEFERRED");
|
||
|
|
g = (e && *e == '1') ? 1 : 0;
|
||
|
|
}
|
||
|
|
return g;
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
Related knobs:
|
||
|
|
|
||
|
|
- `HAKMEM_POOL_MID_INUSE_MAP_KIND=linear|hash` (default `linear`)
|
||
|
|
- TLS page-map implementation used by the hot path.
|
||
|
|
- `HAKMEM_POOL_MID_INUSE_DEFERRED_STATS=0/1` (default `0`)
|
||
|
|
- Enables debug counters + exit dump. Keep OFF for perf runs.
|
||
|
|
|
||
|
|
## Implementation Patches (Order)
|
||
|
|
|
||
|
|
| Step | File | Description |
|
||
|
|
|------|------|-------------|
|
||
|
|
| 1 | `pool_mid_inuse_deferred_env_box.h` | ENV gate |
|
||
|
|
| 2 | `pool_mid_inuse_tls_pagemap_box.h` | TLS map box |
|
||
|
|
| 3 | `pool_mid_inuse_deferred_box.h` | deferred API (dec + drain) |
|
||
|
|
| 4 | `pool_free_v1_box.h` | Replace tail with deferred (ENV ON only) |
|
||
|
|
| 5 | `pool_mid_inuse_deferred_stats_box.h` | Counters |
|
||
|
|
| 6 | A/B benchmark | Validate |
|
||
|
|
|
||
|
|
## Stats Counters
|
||
|
|
|
||
|
|
```c
|
||
|
|
typedef struct {
|
||
|
|
_Atomic uint64_t mid_inuse_deferred_hit; // deferred dec calls (hot)
|
||
|
|
_Atomic uint64_t drain_calls; // drain invocations (cold)
|
||
|
|
_Atomic uint64_t pages_drained; // unique pages processed
|
||
|
|
_Atomic uint64_t decs_drained; // total decrements applied
|
||
|
|
_Atomic uint64_t empty_transitions; // pages that hit <=0
|
||
|
|
} MidInuseDeferredStats;
|
||
|
|
```
|
||
|
|
|
||
|
|
**Goal**: With fastsplit ON + deferred ON:
|
||
|
|
- fast path lookup = 0
|
||
|
|
- drain calls = rare (low frequency)
|
||
|
|
|
||
|
|
## Safety Analysis
|
||
|
|
|
||
|
|
| Concern | Analysis |
|
||
|
|
|---------|----------|
|
||
|
|
| Race condition | dec delayed → in_use appears larger → DONTNEED delayed (safe direction) |
|
||
|
|
| Double free | No change (header check still in place) |
|
||
|
|
| Early release | Impossible (dec is delayed, not advanced) |
|
||
|
|
| Memory pressure | Slightly delayed DONTNEED, acceptable |
|
||
|
|
|
||
|
|
## Acceptance Gates
|
||
|
|
|
||
|
|
| Workload | Metric | Criteria |
|
||
|
|
|----------|--------|----------|
|
||
|
|
| Mixed (MIXED_TINYV3_C7_SAFE) | Median | No regression |
|
||
|
|
| Mixed | CV | Clear reduction (matches cache trend) |
|
||
|
|
| C6-heavy (C6_HEAVY_LEGACY_POOLV1) | Throughput | <2% regression, ideally +2% |
|
||
|
|
| pending_dn | Timing | Delayed OK, earlier NG |
|
||
|
|
|
||
|
|
## Expected Result
|
||
|
|
|
||
|
|
After this phase, pool free hot path becomes:
|
||
|
|
```
|
||
|
|
header check → TLS push → deferred bookkeeping (O(1), no lookup)
|
||
|
|
```
|
||
|
|
|
||
|
|
This is very close to mimalloc's O(1) fast free design.
|
||
|
|
|
||
|
|
## Files to Modify
|
||
|
|
|
||
|
|
- `core/box/pool_mid_inuse_deferred_env_box.h` (NEW)
|
||
|
|
- `core/box/pool_mid_inuse_tls_pagemap_box.h` (NEW)
|
||
|
|
- `core/box/pool_mid_inuse_deferred_box.h` (NEW)
|
||
|
|
- `core/box/pool_free_v1_box.h` (MODIFY - add deferred call)
|
||
|
|
- `core/box/pool_mid_inuse_deferred_stats_box.h` (NEW)
|