- Root cause: header-based class indexing (HEADER_CLASSIDX=1) wrote a 1-byte header during allocation, but linear carve/refill and initial slab capacity still used bare class block sizes. This mismatch could overrun slab usable space and corrupt freelists, causing reproducible SEGV at ~100k iters. Changes - Superslab: compute capacity with effective stride (block_size + header for classes 0..6; class7 remains headerless) in superslab_init_slab(). Add a debug-only bound check in superslab_alloc_from_slab() to fail fast if carve would exceed usable bytes. - Refill (non-P0 and P0): use header-aware stride for all linear carving and TLS window bump operations. Ensure alignment/validation in tiny_refill_opt.h also uses stride, not raw class size. - Drain: keep existing defense-in-depth for remote sentinel and sanitize nodes before splicing into freelist (already present). Notes - This unifies the memory layout across alloc/linear-carve/refill with a single stride definition and keeps class7 (1024B) headerless as designed. - Debug builds add fail-fast checks; release builds remain lean. Next - Re-run Tiny benches (256/1024B) in debug to confirm stability, then in release. If any remaining crash persists, bisect with HAKMEM_TINY_P0_BATCH_REFILL=0 to isolate P0 batch carve, and continue reducing branch-miss as planned.
9.7 KiB
ACE-Pool Architecture Investigation Report
Executive Summary
Root Cause Found: Bridge classes (40KB, 52KB) are disabled at initialization due to conflicting code paths. The Pool init code expects them from Policy, but Policy disabled them in Phase 6.21. Fix is trivial: Don't overwrite hardcoded Bridge classes with 0.
Part 1: Root Cause Analysis
The Bug Chain
-
Policy Phase 6.21 Change:
// core/hakmem_policy.c:53-55 pol->mid_dyn1_bytes = 0; // Disabled (Bridge classes now hardcoded) pol->mid_dyn2_bytes = 0; // Disabled -
Pool Init Overwrites Bridge Classes:
// core/box/pool_init_api.inc.h:9-17 if (pol && pol->mid_dyn1_bytes >= POOL_MIN_SIZE && pol->mid_dyn1_bytes <= POOL_MAX_SIZE) { g_class_sizes[5] = pol->mid_dyn1_bytes; } else { g_class_sizes[5] = 0; // ← BRIDGE CLASS 5 (40KB) DISABLED! } -
Pool Has Bridge Classes Hardcoded:
// core/hakmem_pool.c:810-817 static size_t g_class_sizes[POOL_NUM_CLASSES] = { POOL_CLASS_2KB, // 2 KB POOL_CLASS_4KB, // 4 KB POOL_CLASS_8KB, // 8 KB POOL_CLASS_16KB, // 16 KB POOL_CLASS_32KB, // 32 KB POOL_CLASS_40KB, // 40 KB (Bridge class 0) ← GETS OVERWRITTEN TO 0! POOL_CLASS_52KB // 52 KB (Bridge class 1) ← GETS OVERWRITTEN TO 0! }; -
Result: 33KB Allocation Fails:
- ACE rounds 33KB → 40KB (Bridge class 5)
- Pool lookup:
g_class_sizes[5] = 0→ class disabled - Pool returns NULL
- Fallback to mmap (1.03M ops/s instead of 50-80M)
Why Pre-allocation Code Never Runs
// core/box/pool_init_api.inc.h:101-106
if (g_class_sizes[5] != 0) { // ← FALSE because g_class_sizes[5] = 0
// Pre-allocation code NEVER executes
for (int s = 0; s < prewarm_pages && s < POOL_NUM_SHARDS; s++) {
refill_freelist(5, s);
}
}
The pre-allocation code is correct but never runs because the Bridge classes are disabled!
Part 2: Boxing Analysis
Current Architecture Problems
1. Conflicting Ownership:
- Policy thinks it owns Bridge class configuration (DYN1/DYN2)
- Pool has Bridge classes hardcoded
- Pool init overwrites hardcoded values with Policy's 0s
2. Invisible Failures:
- No error when Bridge classes get disabled
- No warning when Pool returns NULL
- No trace showing why allocation failed
3. Mixed Responsibilities:
pool_init_api.inc.hdoes both init AND policy configuration- ACE does rounding AND allocation AND fallback
- No clear separation of concerns
Data Flow Tracing
33KB allocation request
→ hkm_ace_alloc()
→ round_to_mid_class(33KB, wmax=1.33) → 40KB ✓
→ hak_pool_try_alloc(40KB)
→ hak_pool_init() (pthread_once)
→ hak_pool_get_class_index(40KB)
→ Check g_class_sizes[5] = 0 ✗
→ Return -1 (not found)
→ Pool returns NULL
→ ACE tries Large rounding (fails)
→ Fallback to mmap ✗
Missing Boxes
-
Configuration Validator Box:
- Should verify Bridge classes are enabled
- Should warn if Policy conflicts with Pool
-
Allocation Router Box:
- Central decision point for allocation strategy
- Clear logging of routing decisions
-
Pool Health Check Box:
- Verify all classes are properly configured
- Check if pre-allocation succeeded
Part 3: Central Checker Box Design
Proposed Architecture
// core/box/ace_pool_checker.h
typedef struct {
bool ace_enabled;
bool pool_initialized;
bool bridge_classes_enabled;
bool pool_has_pages[POOL_NUM_CLASSES];
size_t class_sizes[POOL_NUM_CLASSES];
const char* last_error;
} AcePoolHealthStatus;
// Central validation point
AcePoolHealthStatus* hak_ace_pool_health_check(void);
// Routing with validation
void* hak_ace_pool_route_alloc(size_t size, uintptr_t site_id) {
// 1. Check health
AcePoolHealthStatus* health = hak_ace_pool_health_check();
if (!health->ace_enabled) {
LOG("ACE disabled, fallback to system");
return NULL;
}
// 2. Validate Pool
if (!health->pool_initialized) {
LOG("Pool not initialized!");
hak_pool_init();
health = hak_ace_pool_health_check(); // Re-check
}
// 3. Check Bridge classes
size_t rounded = round_to_mid_class(size, 1.33, NULL);
int class_idx = hak_pool_get_class_index(rounded);
if (class_idx >= 0 && health->class_sizes[class_idx] == 0) {
LOG("ERROR: Class %d disabled (size=%zu)", class_idx, rounded);
return NULL;
}
// 4. Try allocation with logging
LOG("Routing %zu → class %d (size=%zu)", size, class_idx, rounded);
void* ptr = hak_pool_try_alloc(rounded, site_id);
if (!ptr) {
LOG("Pool allocation failed for class %d", class_idx);
}
return ptr;
}
Integration Points
-
Replace silent failures with logged checker:
// Before: Silent failure void* p = hak_pool_try_alloc(r, site_id); // After: Checked and logged void* p = hak_ace_pool_route_alloc(size, site_id); -
Add health check command:
// In main() or benchmark if (getenv("HAKMEM_HEALTH_CHECK")) { AcePoolHealthStatus* h = hak_ace_pool_health_check(); fprintf(stderr, "ACE: %s\n", h->ace_enabled ? "ON" : "OFF"); fprintf(stderr, "Pool: %s\n", h->pool_initialized ? "OK" : "NOT INIT"); for (int i = 0; i < POOL_NUM_CLASSES; i++) { fprintf(stderr, "Class %d: %zu KB %s\n", i, h->class_sizes[i]/1024, h->class_sizes[i] ? "ENABLED" : "DISABLED"); } }
Part 4: Immediate Fix
Quick Fix #1: Don't Overwrite Bridge Classes
// core/box/pool_init_api.inc.h:9-17
- if (pol && pol->mid_dyn1_bytes >= POOL_MIN_SIZE && pol->mid_dyn1_bytes <= POOL_MAX_SIZE) {
- g_class_sizes[5] = pol->mid_dyn1_bytes;
- } else {
- g_class_sizes[5] = 0;
- }
+ // Phase 6.21: Bridge classes are hardcoded, don't overwrite with 0
+ if (pol && pol->mid_dyn1_bytes >= POOL_MIN_SIZE && pol->mid_dyn1_bytes <= POOL_MAX_SIZE) {
+ g_class_sizes[5] = pol->mid_dyn1_bytes; // Only override if Policy provides valid value
+ }
+ // Otherwise keep the hardcoded POOL_CLASS_40KB
Quick Fix #2: Force Bridge Classes (Simpler)
// core/box/pool_init_api.inc.h:7 (in hak_pool_init_impl)
static void hak_pool_init_impl(void) {
const FrozenPolicy* pol = hkm_policy_get();
+
+ // Phase 6.21 CRITICAL FIX: Bridge classes are hardcoded, not from Policy
+ // DO NOT overwrite them with 0!
+ /*
if (pol && pol->mid_dyn1_bytes >= POOL_MIN_SIZE && pol->mid_dyn1_bytes <= POOL_MAX_SIZE) {
g_class_sizes[5] = pol->mid_dyn1_bytes;
} else {
g_class_sizes[5] = 0;
}
if (pol && pol->mid_dyn2_bytes >= POOL_MIN_SIZE && pol->mid_dyn2_bytes <= POOL_MAX_SIZE) {
g_class_sizes[6] = pol->mid_dyn2_bytes;
} else {
g_class_sizes[6] = 0;
}
+ */
+ // Bridge classes stay as initialized in g_class_sizes (40KB, 52KB)
Quick Fix #3: Add Debug Logging (For Verification)
// core/box/pool_init_api.inc.h:84-95
g_pool.initialized = 1;
HAKMEM_LOG("[Pool] Initialized (L2 Hybrid Pool)\n");
+ HAKMEM_LOG("[Pool] Class sizes after init:\n");
+ for (int i = 0; i < POOL_NUM_CLASSES; i++) {
+ HAKMEM_LOG(" Class %d: %zu KB %s\n",
+ i, g_class_sizes[i]/1024,
+ g_class_sizes[i] ? "ENABLED" : "DISABLED");
+ }
Recommended Actions
Immediate (NOW):
- Apply Quick Fix #2 (comment out the overwrite code)
- Rebuild with debug logging
- Test:
HAKMEM_ACE_ENABLED=1 ./bench_mid_large_mt_hakmem - Expected: 50-80M ops/s (vs current 1.03M)
Short-term (1-2 days):
- Implement Central Checker Box
- Add health check API
- Add allocation routing logs
Long-term (1 week):
- Refactor Pool/Policy bridge class ownership
- Separate init from configuration
- Add comprehensive boxing tests
Architecture Diagram
Current (BROKEN):
================
[Policy]
↓ mid_dyn1=0, mid_dyn2=0
[Pool Init]
↓ Overwrites g_class_sizes[5]=0, [6]=0
[Pool]
↓ Bridge classes DISABLED
[ACE Alloc]
↓ 33KB → 40KB (class 5)
[Pool Lookup]
↓ g_class_sizes[5]=0 → FAIL
[mmap fallback] ← 1.03M ops/s
Proposed (FIXED):
================
[Policy]
↓ (Bridge config ignored)
[Pool Init]
↓ Keep hardcoded g_class_sizes
[Central Checker] ← NEW
↓ Validate all components
[Pool]
↓ Bridge classes ENABLED (40KB, 52KB)
[ACE Alloc]
↓ 33KB → 40KB (class 5)
[Pool Lookup]
↓ g_class_sizes[5]=40KB → SUCCESS
[Pool Pages] ← 50-80M ops/s
Test Commands
# Before fix (current broken state)
make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1
HAKMEM_ACE_ENABLED=1 ./bench_mid_large_mt_hakmem
# Result: 1.03M ops/s (mmap fallback)
# After fix (comment out lines 9-17)
vim core/box/pool_init_api.inc.h
# Comment out lines 9-17
make clean && make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1
HAKMEM_ACE_ENABLED=1 ./bench_mid_large_mt_hakmem
# Expected: 50-80M ops/s (Pool working!)
# With debug verification
HAKMEM_LOG_LEVEL=3 HAKMEM_ACE_ENABLED=1 ./bench_mid_large_mt_hakmem 2>&1 | grep "Class 5"
# Should show: "Class 5: 40 KB ENABLED"
Conclusion
The bug is trivial: Pool init code overwrites hardcoded Bridge classes with 0 because Policy disabled them in Phase 6.21.
The fix is trivial: Don't overwrite them. Comment out 9 lines.
The impact is massive: 50-80x performance improvement (1.03M → 50-80M ops/s).
The lesson: When two components (Policy and Pool) both think they own configuration, silent failures occur. Need better boxing with clear ownership boundaries and validation points.