Files
hakmem/docs/analysis/ACE_POOL_ARCHITECTURE_INVESTIGATION.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

325 lines
9.7 KiB
Markdown

# ACE-Pool Architecture Investigation Report
## Executive Summary
**Root Cause Found:** Bridge classes (40KB, 52KB) are disabled at initialization due to conflicting code paths. The Pool init code expects them from Policy, but Policy disabled them in Phase 6.21. **Fix is trivial: Don't overwrite hardcoded Bridge classes with 0.**
## Part 1: Root Cause Analysis
### The Bug Chain
1. **Policy Phase 6.21 Change:**
```c
// core/hakmem_policy.c:53-55
pol->mid_dyn1_bytes = 0; // Disabled (Bridge classes now hardcoded)
pol->mid_dyn2_bytes = 0; // Disabled
```
2. **Pool Init Overwrites Bridge Classes:**
```c
// core/box/pool_init_api.inc.h:9-17
if (pol && pol->mid_dyn1_bytes >= POOL_MIN_SIZE && pol->mid_dyn1_bytes <= POOL_MAX_SIZE) {
g_class_sizes[5] = pol->mid_dyn1_bytes;
} else {
g_class_sizes[5] = 0; // ← BRIDGE CLASS 5 (40KB) DISABLED!
}
```
3. **Pool Has Bridge Classes Hardcoded:**
```c
// core/hakmem_pool.c:810-817
static size_t g_class_sizes[POOL_NUM_CLASSES] = {
POOL_CLASS_2KB, // 2 KB
POOL_CLASS_4KB, // 4 KB
POOL_CLASS_8KB, // 8 KB
POOL_CLASS_16KB, // 16 KB
POOL_CLASS_32KB, // 32 KB
POOL_CLASS_40KB, // 40 KB (Bridge class 0) ← GETS OVERWRITTEN TO 0!
POOL_CLASS_52KB // 52 KB (Bridge class 1) ← GETS OVERWRITTEN TO 0!
};
```
4. **Result: 33KB Allocation Fails:**
- ACE rounds 33KB → 40KB (Bridge class 5)
- Pool lookup: `g_class_sizes[5] = 0` → class disabled
- Pool returns NULL
- Fallback to mmap (1.03M ops/s instead of 50-80M)
### Why Pre-allocation Code Never Runs
```c
// core/box/pool_init_api.inc.h:101-106
if (g_class_sizes[5] != 0) { // ← FALSE because g_class_sizes[5] = 0
// Pre-allocation code NEVER executes
for (int s = 0; s < prewarm_pages && s < POOL_NUM_SHARDS; s++) {
refill_freelist(5, s);
}
}
```
The pre-allocation code is correct but never runs because the Bridge classes are disabled!
## Part 2: Boxing Analysis
### Current Architecture Problems
**1. Conflicting Ownership:**
- Policy thinks it owns Bridge class configuration (DYN1/DYN2)
- Pool has Bridge classes hardcoded
- Pool init overwrites hardcoded values with Policy's 0s
**2. Invisible Failures:**
- No error when Bridge classes get disabled
- No warning when Pool returns NULL
- No trace showing why allocation failed
**3. Mixed Responsibilities:**
- `pool_init_api.inc.h` does both init AND policy configuration
- ACE does rounding AND allocation AND fallback
- No clear separation of concerns
### Data Flow Tracing
```
33KB allocation request
→ hkm_ace_alloc()
→ round_to_mid_class(33KB, wmax=1.33) → 40KB ✓
→ hak_pool_try_alloc(40KB)
→ hak_pool_init() (pthread_once)
→ hak_pool_get_class_index(40KB)
→ Check g_class_sizes[5] = 0 ✗
→ Return -1 (not found)
→ Pool returns NULL
→ ACE tries Large rounding (fails)
→ Fallback to mmap ✗
```
### Missing Boxes
1. **Configuration Validator Box:**
- Should verify Bridge classes are enabled
- Should warn if Policy conflicts with Pool
2. **Allocation Router Box:**
- Central decision point for allocation strategy
- Clear logging of routing decisions
3. **Pool Health Check Box:**
- Verify all classes are properly configured
- Check if pre-allocation succeeded
## Part 3: Central Checker Box Design
### Proposed Architecture
```c
// core/box/ace_pool_checker.h
typedef struct {
bool ace_enabled;
bool pool_initialized;
bool bridge_classes_enabled;
bool pool_has_pages[POOL_NUM_CLASSES];
size_t class_sizes[POOL_NUM_CLASSES];
const char* last_error;
} AcePoolHealthStatus;
// Central validation point
AcePoolHealthStatus* hak_ace_pool_health_check(void);
// Routing with validation
void* hak_ace_pool_route_alloc(size_t size, uintptr_t site_id) {
// 1. Check health
AcePoolHealthStatus* health = hak_ace_pool_health_check();
if (!health->ace_enabled) {
LOG("ACE disabled, fallback to system");
return NULL;
}
// 2. Validate Pool
if (!health->pool_initialized) {
LOG("Pool not initialized!");
hak_pool_init();
health = hak_ace_pool_health_check(); // Re-check
}
// 3. Check Bridge classes
size_t rounded = round_to_mid_class(size, 1.33, NULL);
int class_idx = hak_pool_get_class_index(rounded);
if (class_idx >= 0 && health->class_sizes[class_idx] == 0) {
LOG("ERROR: Class %d disabled (size=%zu)", class_idx, rounded);
return NULL;
}
// 4. Try allocation with logging
LOG("Routing %zu → class %d (size=%zu)", size, class_idx, rounded);
void* ptr = hak_pool_try_alloc(rounded, site_id);
if (!ptr) {
LOG("Pool allocation failed for class %d", class_idx);
}
return ptr;
}
```
### Integration Points
1. **Replace silent failures with logged checker:**
```c
// Before: Silent failure
void* p = hak_pool_try_alloc(r, site_id);
// After: Checked and logged
void* p = hak_ace_pool_route_alloc(size, site_id);
```
2. **Add health check command:**
```c
// In main() or benchmark
if (getenv("HAKMEM_HEALTH_CHECK")) {
AcePoolHealthStatus* h = hak_ace_pool_health_check();
fprintf(stderr, "ACE: %s\n", h->ace_enabled ? "ON" : "OFF");
fprintf(stderr, "Pool: %s\n", h->pool_initialized ? "OK" : "NOT INIT");
for (int i = 0; i < POOL_NUM_CLASSES; i++) {
fprintf(stderr, "Class %d: %zu KB %s\n",
i, h->class_sizes[i]/1024,
h->class_sizes[i] ? "ENABLED" : "DISABLED");
}
}
```
## Part 4: Immediate Fix
### Quick Fix #1: Don't Overwrite Bridge Classes
```diff
// core/box/pool_init_api.inc.h:9-17
- if (pol && pol->mid_dyn1_bytes >= POOL_MIN_SIZE && pol->mid_dyn1_bytes <= POOL_MAX_SIZE) {
- g_class_sizes[5] = pol->mid_dyn1_bytes;
- } else {
- g_class_sizes[5] = 0;
- }
+ // Phase 6.21: Bridge classes are hardcoded, don't overwrite with 0
+ if (pol && pol->mid_dyn1_bytes >= POOL_MIN_SIZE && pol->mid_dyn1_bytes <= POOL_MAX_SIZE) {
+ g_class_sizes[5] = pol->mid_dyn1_bytes; // Only override if Policy provides valid value
+ }
+ // Otherwise keep the hardcoded POOL_CLASS_40KB
```
### Quick Fix #2: Force Bridge Classes (Simpler)
```diff
// core/box/pool_init_api.inc.h:7 (in hak_pool_init_impl)
static void hak_pool_init_impl(void) {
const FrozenPolicy* pol = hkm_policy_get();
+
+ // Phase 6.21 CRITICAL FIX: Bridge classes are hardcoded, not from Policy
+ // DO NOT overwrite them with 0!
+ /*
if (pol && pol->mid_dyn1_bytes >= POOL_MIN_SIZE && pol->mid_dyn1_bytes <= POOL_MAX_SIZE) {
g_class_sizes[5] = pol->mid_dyn1_bytes;
} else {
g_class_sizes[5] = 0;
}
if (pol && pol->mid_dyn2_bytes >= POOL_MIN_SIZE && pol->mid_dyn2_bytes <= POOL_MAX_SIZE) {
g_class_sizes[6] = pol->mid_dyn2_bytes;
} else {
g_class_sizes[6] = 0;
}
+ */
+ // Bridge classes stay as initialized in g_class_sizes (40KB, 52KB)
```
### Quick Fix #3: Add Debug Logging (For Verification)
```diff
// core/box/pool_init_api.inc.h:84-95
g_pool.initialized = 1;
HAKMEM_LOG("[Pool] Initialized (L2 Hybrid Pool)\n");
+ HAKMEM_LOG("[Pool] Class sizes after init:\n");
+ for (int i = 0; i < POOL_NUM_CLASSES; i++) {
+ HAKMEM_LOG(" Class %d: %zu KB %s\n",
+ i, g_class_sizes[i]/1024,
+ g_class_sizes[i] ? "ENABLED" : "DISABLED");
+ }
```
## Recommended Actions
### Immediate (NOW):
1. Apply Quick Fix #2 (comment out the overwrite code)
2. Rebuild with debug logging
3. Test: `HAKMEM_ACE_ENABLED=1 ./bench_mid_large_mt_hakmem`
4. Expected: 50-80M ops/s (vs current 1.03M)
### Short-term (1-2 days):
1. Implement Central Checker Box
2. Add health check API
3. Add allocation routing logs
### Long-term (1 week):
1. Refactor Pool/Policy bridge class ownership
2. Separate init from configuration
3. Add comprehensive boxing tests
## Architecture Diagram
```
Current (BROKEN):
================
[Policy]
↓ mid_dyn1=0, mid_dyn2=0
[Pool Init]
↓ Overwrites g_class_sizes[5]=0, [6]=0
[Pool]
↓ Bridge classes DISABLED
[ACE Alloc]
↓ 33KB → 40KB (class 5)
[Pool Lookup]
↓ g_class_sizes[5]=0 → FAIL
[mmap fallback] ← 1.03M ops/s
Proposed (FIXED):
================
[Policy]
↓ (Bridge config ignored)
[Pool Init]
↓ Keep hardcoded g_class_sizes
[Central Checker] ← NEW
↓ Validate all components
[Pool]
↓ Bridge classes ENABLED (40KB, 52KB)
[ACE Alloc]
↓ 33KB → 40KB (class 5)
[Pool Lookup]
↓ g_class_sizes[5]=40KB → SUCCESS
[Pool Pages] ← 50-80M ops/s
```
## Test Commands
```bash
# Before fix (current broken state)
make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1
HAKMEM_ACE_ENABLED=1 ./bench_mid_large_mt_hakmem
# Result: 1.03M ops/s (mmap fallback)
# After fix (comment out lines 9-17)
vim core/box/pool_init_api.inc.h
# Comment out lines 9-17
make clean && make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1
HAKMEM_ACE_ENABLED=1 ./bench_mid_large_mt_hakmem
# Expected: 50-80M ops/s (Pool working!)
# With debug verification
HAKMEM_LOG_LEVEL=3 HAKMEM_ACE_ENABLED=1 ./bench_mid_large_mt_hakmem 2>&1 | grep "Class 5"
# Should show: "Class 5: 40 KB ENABLED"
```
## Conclusion
**The bug is trivial:** Pool init code overwrites hardcoded Bridge classes with 0 because Policy disabled them in Phase 6.21.
**The fix is trivial:** Don't overwrite them. Comment out 9 lines.
**The impact is massive:** 50-80x performance improvement (1.03M → 50-80M ops/s).
**The lesson:** When two components (Policy and Pool) both think they own configuration, silent failures occur. Need better boxing with clear ownership boundaries and validation points.