Tiny: fix header/stride mismatch and harden refill paths
- Root cause: header-based class indexing (HEADER_CLASSIDX=1) wrote a 1-byte header during allocation, but linear carve/refill and initial slab capacity still used bare class block sizes. This mismatch could overrun slab usable space and corrupt freelists, causing reproducible SEGV at ~100k iters. Changes - Superslab: compute capacity with effective stride (block_size + header for classes 0..6; class7 remains headerless) in superslab_init_slab(). Add a debug-only bound check in superslab_alloc_from_slab() to fail fast if carve would exceed usable bytes. - Refill (non-P0 and P0): use header-aware stride for all linear carving and TLS window bump operations. Ensure alignment/validation in tiny_refill_opt.h also uses stride, not raw class size. - Drain: keep existing defense-in-depth for remote sentinel and sanitize nodes before splicing into freelist (already present). Notes - This unifies the memory layout across alloc/linear-carve/refill with a single stride definition and keeps class7 (1024B) headerless as designed. - Debug builds add fail-fast checks; release builds remain lean. Next - Re-run Tiny benches (256/1024B) in debug to confirm stability, then in release. If any remaining crash persists, bisect with HAKMEM_TINY_P0_BATCH_REFILL=0 to isolate P0 batch carve, and continue reducing branch-miss as planned.
This commit is contained in:
325
ACE_POOL_ARCHITECTURE_INVESTIGATION.md
Normal file
325
ACE_POOL_ARCHITECTURE_INVESTIGATION.md
Normal file
@ -0,0 +1,325 @@
|
||||
# ACE-Pool Architecture Investigation Report
|
||||
|
||||
## Executive Summary
|
||||
|
||||
**Root Cause Found:** Bridge classes (40KB, 52KB) are disabled at initialization due to conflicting code paths. The Pool init code expects them from Policy, but Policy disabled them in Phase 6.21. **Fix is trivial: Don't overwrite hardcoded Bridge classes with 0.**
|
||||
|
||||
## Part 1: Root Cause Analysis
|
||||
|
||||
### The Bug Chain
|
||||
|
||||
1. **Policy Phase 6.21 Change:**
|
||||
```c
|
||||
// core/hakmem_policy.c:53-55
|
||||
pol->mid_dyn1_bytes = 0; // Disabled (Bridge classes now hardcoded)
|
||||
pol->mid_dyn2_bytes = 0; // Disabled
|
||||
```
|
||||
|
||||
2. **Pool Init Overwrites Bridge Classes:**
|
||||
```c
|
||||
// core/box/pool_init_api.inc.h:9-17
|
||||
if (pol && pol->mid_dyn1_bytes >= POOL_MIN_SIZE && pol->mid_dyn1_bytes <= POOL_MAX_SIZE) {
|
||||
g_class_sizes[5] = pol->mid_dyn1_bytes;
|
||||
} else {
|
||||
g_class_sizes[5] = 0; // ← BRIDGE CLASS 5 (40KB) DISABLED!
|
||||
}
|
||||
```
|
||||
|
||||
3. **Pool Has Bridge Classes Hardcoded:**
|
||||
```c
|
||||
// core/hakmem_pool.c:810-817
|
||||
static size_t g_class_sizes[POOL_NUM_CLASSES] = {
|
||||
POOL_CLASS_2KB, // 2 KB
|
||||
POOL_CLASS_4KB, // 4 KB
|
||||
POOL_CLASS_8KB, // 8 KB
|
||||
POOL_CLASS_16KB, // 16 KB
|
||||
POOL_CLASS_32KB, // 32 KB
|
||||
POOL_CLASS_40KB, // 40 KB (Bridge class 0) ← GETS OVERWRITTEN TO 0!
|
||||
POOL_CLASS_52KB // 52 KB (Bridge class 1) ← GETS OVERWRITTEN TO 0!
|
||||
};
|
||||
```
|
||||
|
||||
4. **Result: 33KB Allocation Fails:**
|
||||
- ACE rounds 33KB → 40KB (Bridge class 5)
|
||||
- Pool lookup: `g_class_sizes[5] = 0` → class disabled
|
||||
- Pool returns NULL
|
||||
- Fallback to mmap (1.03M ops/s instead of 50-80M)
|
||||
|
||||
### Why Pre-allocation Code Never Runs
|
||||
|
||||
```c
|
||||
// core/box/pool_init_api.inc.h:101-106
|
||||
if (g_class_sizes[5] != 0) { // ← FALSE because g_class_sizes[5] = 0
|
||||
// Pre-allocation code NEVER executes
|
||||
for (int s = 0; s < prewarm_pages && s < POOL_NUM_SHARDS; s++) {
|
||||
refill_freelist(5, s);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
The pre-allocation code is correct but never runs because the Bridge classes are disabled!
|
||||
|
||||
## Part 2: Boxing Analysis
|
||||
|
||||
### Current Architecture Problems
|
||||
|
||||
**1. Conflicting Ownership:**
|
||||
- Policy thinks it owns Bridge class configuration (DYN1/DYN2)
|
||||
- Pool has Bridge classes hardcoded
|
||||
- Pool init overwrites hardcoded values with Policy's 0s
|
||||
|
||||
**2. Invisible Failures:**
|
||||
- No error when Bridge classes get disabled
|
||||
- No warning when Pool returns NULL
|
||||
- No trace showing why allocation failed
|
||||
|
||||
**3. Mixed Responsibilities:**
|
||||
- `pool_init_api.inc.h` does both init AND policy configuration
|
||||
- ACE does rounding AND allocation AND fallback
|
||||
- No clear separation of concerns
|
||||
|
||||
### Data Flow Tracing
|
||||
|
||||
```
|
||||
33KB allocation request
|
||||
→ hkm_ace_alloc()
|
||||
→ round_to_mid_class(33KB, wmax=1.33) → 40KB ✓
|
||||
→ hak_pool_try_alloc(40KB)
|
||||
→ hak_pool_init() (pthread_once)
|
||||
→ hak_pool_get_class_index(40KB)
|
||||
→ Check g_class_sizes[5] = 0 ✗
|
||||
→ Return -1 (not found)
|
||||
→ Pool returns NULL
|
||||
→ ACE tries Large rounding (fails)
|
||||
→ Fallback to mmap ✗
|
||||
```
|
||||
|
||||
### Missing Boxes
|
||||
|
||||
1. **Configuration Validator Box:**
|
||||
- Should verify Bridge classes are enabled
|
||||
- Should warn if Policy conflicts with Pool
|
||||
|
||||
2. **Allocation Router Box:**
|
||||
- Central decision point for allocation strategy
|
||||
- Clear logging of routing decisions
|
||||
|
||||
3. **Pool Health Check Box:**
|
||||
- Verify all classes are properly configured
|
||||
- Check if pre-allocation succeeded
|
||||
|
||||
## Part 3: Central Checker Box Design
|
||||
|
||||
### Proposed Architecture
|
||||
|
||||
```c
|
||||
// core/box/ace_pool_checker.h
|
||||
typedef struct {
|
||||
bool ace_enabled;
|
||||
bool pool_initialized;
|
||||
bool bridge_classes_enabled;
|
||||
bool pool_has_pages[POOL_NUM_CLASSES];
|
||||
size_t class_sizes[POOL_NUM_CLASSES];
|
||||
const char* last_error;
|
||||
} AcePoolHealthStatus;
|
||||
|
||||
// Central validation point
|
||||
AcePoolHealthStatus* hak_ace_pool_health_check(void);
|
||||
|
||||
// Routing with validation
|
||||
void* hak_ace_pool_route_alloc(size_t size, uintptr_t site_id) {
|
||||
// 1. Check health
|
||||
AcePoolHealthStatus* health = hak_ace_pool_health_check();
|
||||
if (!health->ace_enabled) {
|
||||
LOG("ACE disabled, fallback to system");
|
||||
return NULL;
|
||||
}
|
||||
|
||||
// 2. Validate Pool
|
||||
if (!health->pool_initialized) {
|
||||
LOG("Pool not initialized!");
|
||||
hak_pool_init();
|
||||
health = hak_ace_pool_health_check(); // Re-check
|
||||
}
|
||||
|
||||
// 3. Check Bridge classes
|
||||
size_t rounded = round_to_mid_class(size, 1.33, NULL);
|
||||
int class_idx = hak_pool_get_class_index(rounded);
|
||||
if (class_idx >= 0 && health->class_sizes[class_idx] == 0) {
|
||||
LOG("ERROR: Class %d disabled (size=%zu)", class_idx, rounded);
|
||||
return NULL;
|
||||
}
|
||||
|
||||
// 4. Try allocation with logging
|
||||
LOG("Routing %zu → class %d (size=%zu)", size, class_idx, rounded);
|
||||
void* ptr = hak_pool_try_alloc(rounded, site_id);
|
||||
if (!ptr) {
|
||||
LOG("Pool allocation failed for class %d", class_idx);
|
||||
}
|
||||
return ptr;
|
||||
}
|
||||
```
|
||||
|
||||
### Integration Points
|
||||
|
||||
1. **Replace silent failures with logged checker:**
|
||||
```c
|
||||
// Before: Silent failure
|
||||
void* p = hak_pool_try_alloc(r, site_id);
|
||||
|
||||
// After: Checked and logged
|
||||
void* p = hak_ace_pool_route_alloc(size, site_id);
|
||||
```
|
||||
|
||||
2. **Add health check command:**
|
||||
```c
|
||||
// In main() or benchmark
|
||||
if (getenv("HAKMEM_HEALTH_CHECK")) {
|
||||
AcePoolHealthStatus* h = hak_ace_pool_health_check();
|
||||
fprintf(stderr, "ACE: %s\n", h->ace_enabled ? "ON" : "OFF");
|
||||
fprintf(stderr, "Pool: %s\n", h->pool_initialized ? "OK" : "NOT INIT");
|
||||
for (int i = 0; i < POOL_NUM_CLASSES; i++) {
|
||||
fprintf(stderr, "Class %d: %zu KB %s\n",
|
||||
i, h->class_sizes[i]/1024,
|
||||
h->class_sizes[i] ? "ENABLED" : "DISABLED");
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Part 4: Immediate Fix
|
||||
|
||||
### Quick Fix #1: Don't Overwrite Bridge Classes
|
||||
|
||||
```diff
|
||||
// core/box/pool_init_api.inc.h:9-17
|
||||
- if (pol && pol->mid_dyn1_bytes >= POOL_MIN_SIZE && pol->mid_dyn1_bytes <= POOL_MAX_SIZE) {
|
||||
- g_class_sizes[5] = pol->mid_dyn1_bytes;
|
||||
- } else {
|
||||
- g_class_sizes[5] = 0;
|
||||
- }
|
||||
+ // Phase 6.21: Bridge classes are hardcoded, don't overwrite with 0
|
||||
+ if (pol && pol->mid_dyn1_bytes >= POOL_MIN_SIZE && pol->mid_dyn1_bytes <= POOL_MAX_SIZE) {
|
||||
+ g_class_sizes[5] = pol->mid_dyn1_bytes; // Only override if Policy provides valid value
|
||||
+ }
|
||||
+ // Otherwise keep the hardcoded POOL_CLASS_40KB
|
||||
```
|
||||
|
||||
### Quick Fix #2: Force Bridge Classes (Simpler)
|
||||
|
||||
```diff
|
||||
// core/box/pool_init_api.inc.h:7 (in hak_pool_init_impl)
|
||||
static void hak_pool_init_impl(void) {
|
||||
const FrozenPolicy* pol = hkm_policy_get();
|
||||
+
|
||||
+ // Phase 6.21 CRITICAL FIX: Bridge classes are hardcoded, not from Policy
|
||||
+ // DO NOT overwrite them with 0!
|
||||
+ /*
|
||||
if (pol && pol->mid_dyn1_bytes >= POOL_MIN_SIZE && pol->mid_dyn1_bytes <= POOL_MAX_SIZE) {
|
||||
g_class_sizes[5] = pol->mid_dyn1_bytes;
|
||||
} else {
|
||||
g_class_sizes[5] = 0;
|
||||
}
|
||||
if (pol && pol->mid_dyn2_bytes >= POOL_MIN_SIZE && pol->mid_dyn2_bytes <= POOL_MAX_SIZE) {
|
||||
g_class_sizes[6] = pol->mid_dyn2_bytes;
|
||||
} else {
|
||||
g_class_sizes[6] = 0;
|
||||
}
|
||||
+ */
|
||||
+ // Bridge classes stay as initialized in g_class_sizes (40KB, 52KB)
|
||||
```
|
||||
|
||||
### Quick Fix #3: Add Debug Logging (For Verification)
|
||||
|
||||
```diff
|
||||
// core/box/pool_init_api.inc.h:84-95
|
||||
g_pool.initialized = 1;
|
||||
HAKMEM_LOG("[Pool] Initialized (L2 Hybrid Pool)\n");
|
||||
+ HAKMEM_LOG("[Pool] Class sizes after init:\n");
|
||||
+ for (int i = 0; i < POOL_NUM_CLASSES; i++) {
|
||||
+ HAKMEM_LOG(" Class %d: %zu KB %s\n",
|
||||
+ i, g_class_sizes[i]/1024,
|
||||
+ g_class_sizes[i] ? "ENABLED" : "DISABLED");
|
||||
+ }
|
||||
```
|
||||
|
||||
## Recommended Actions
|
||||
|
||||
### Immediate (NOW):
|
||||
1. Apply Quick Fix #2 (comment out the overwrite code)
|
||||
2. Rebuild with debug logging
|
||||
3. Test: `HAKMEM_ACE_ENABLED=1 ./bench_mid_large_mt_hakmem`
|
||||
4. Expected: 50-80M ops/s (vs current 1.03M)
|
||||
|
||||
### Short-term (1-2 days):
|
||||
1. Implement Central Checker Box
|
||||
2. Add health check API
|
||||
3. Add allocation routing logs
|
||||
|
||||
### Long-term (1 week):
|
||||
1. Refactor Pool/Policy bridge class ownership
|
||||
2. Separate init from configuration
|
||||
3. Add comprehensive boxing tests
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
```
|
||||
Current (BROKEN):
|
||||
================
|
||||
[Policy]
|
||||
↓ mid_dyn1=0, mid_dyn2=0
|
||||
[Pool Init]
|
||||
↓ Overwrites g_class_sizes[5]=0, [6]=0
|
||||
[Pool]
|
||||
↓ Bridge classes DISABLED
|
||||
[ACE Alloc]
|
||||
↓ 33KB → 40KB (class 5)
|
||||
[Pool Lookup]
|
||||
↓ g_class_sizes[5]=0 → FAIL
|
||||
[mmap fallback] ← 1.03M ops/s
|
||||
|
||||
Proposed (FIXED):
|
||||
================
|
||||
[Policy]
|
||||
↓ (Bridge config ignored)
|
||||
[Pool Init]
|
||||
↓ Keep hardcoded g_class_sizes
|
||||
[Central Checker] ← NEW
|
||||
↓ Validate all components
|
||||
[Pool]
|
||||
↓ Bridge classes ENABLED (40KB, 52KB)
|
||||
[ACE Alloc]
|
||||
↓ 33KB → 40KB (class 5)
|
||||
[Pool Lookup]
|
||||
↓ g_class_sizes[5]=40KB → SUCCESS
|
||||
[Pool Pages] ← 50-80M ops/s
|
||||
```
|
||||
|
||||
## Test Commands
|
||||
|
||||
```bash
|
||||
# Before fix (current broken state)
|
||||
make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1
|
||||
HAKMEM_ACE_ENABLED=1 ./bench_mid_large_mt_hakmem
|
||||
# Result: 1.03M ops/s (mmap fallback)
|
||||
|
||||
# After fix (comment out lines 9-17)
|
||||
vim core/box/pool_init_api.inc.h
|
||||
# Comment out lines 9-17
|
||||
make clean && make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1
|
||||
HAKMEM_ACE_ENABLED=1 ./bench_mid_large_mt_hakmem
|
||||
# Expected: 50-80M ops/s (Pool working!)
|
||||
|
||||
# With debug verification
|
||||
HAKMEM_LOG_LEVEL=3 HAKMEM_ACE_ENABLED=1 ./bench_mid_large_mt_hakmem 2>&1 | grep "Class 5"
|
||||
# Should show: "Class 5: 40 KB ENABLED"
|
||||
```
|
||||
|
||||
## Conclusion
|
||||
|
||||
**The bug is trivial:** Pool init code overwrites hardcoded Bridge classes with 0 because Policy disabled them in Phase 6.21.
|
||||
|
||||
**The fix is trivial:** Don't overwrite them. Comment out 9 lines.
|
||||
|
||||
**The impact is massive:** 50-80x performance improvement (1.03M → 50-80M ops/s).
|
||||
|
||||
**The lesson:** When two components (Policy and Pool) both think they own configuration, silent failures occur. Need better boxing with clear ownership boundaries and validation points.
|
||||
Reference in New Issue
Block a user