# ACE Investigation Report: Mid-Large MT Performance Recovery

## Executive Summary

ACE (Adaptive Cache Engine) is the central L1 allocator for Mid-Large (2KB-1MB) allocations in HAKMEM. Investigation reveals ACE is **disabled by default**, causing all Mid-Large allocations to fall back to slow mmap operations, resulting in -88% regression vs System malloc. The solution is straightforward: enable ACE via `HAKMEM_ACE_ENABLED=1` environment variable. However, testing shows ACE still returns NULL even when enabled, indicating the underlying pools (MidPool/LargePool) are not properly initialized or lack available memory. A deeper fix is required to initialize the pools correctly.

## ACE Mechanism Explanation

ACE (Adaptive Cache Engine) is HAKMEM's intelligent caching layer for Mid-Large allocations (2KB-1MB). It acts as an intermediary between the main allocation path and the underlying memory pools. ACE's primary function is to round allocation sizes to optimal size classes using "W_MAX" rounding policies, then attempt allocation from two specialized pools: MidPool (2-52KB) and LargePool (64KB-1MB). The rounding strategy allows trading small amounts of internal fragmentation for significantly faster allocation performance by fitting requests into pre-sized cache buckets.

The ACE architecture consists of three main components: (1) The allocation router (`hkm_ace_alloc`) which maps sizes to appropriate pools, (2) The ACE controller which manages background threads for cache maintenance and statistics collection, and (3) The UCB1 (Upper Confidence Bound) learning algorithm which optimizes allocation strategies based on observed patterns. When ACE successfully allocates from its pools, it achieves O(1) allocation complexity compared to mmap's O(n) kernel overhead.

ACE significantly improves performance by eliminating system call overhead. Without ACE, every Mid-Large allocation requires an mmap system call (~500-1000 cycles), kernel page table updates, and TLB shootdowns in multi-threaded scenarios. With ACE enabled and pools populated, allocations are served from pre-mapped memory with simple pointer arithmetic (~20-50 cycles), achieving 10-50x speedup for the allocation fast path.

## Current State Diagnosis

**ACE is currently DISABLED by default.**

Evidence from debug output:
```
[ACE] ACE disabled (HAKMEM_ACE_ENABLED=0)
[HAKMEM] INFO: Using mmap for mid-range size=33296 (ACE disabled or failed)
```

The ACE enable/disable mechanism is controlled by:
- **Environment variable:** `HAKMEM_ACE_ENABLED` (default: 0)
- **Initialization:** `core/hakmem_ace_controller.c:42`
- **Check location:** The controller reads `getenv_int("HAKMEM_ACE_ENABLED", 0)`

When disabled, ACE immediately returns from initialization without starting background threads or initializing the underlying pools. This was likely a conservative default during development to avoid potential instability from the learning layer.

## Root Cause Analysis

### Allocation Path Analysis

**With ACE disabled:**
1. Allocation request (e.g., 33KB) enters `hak_alloc`
2. Falls into Mid-Large range check (1KB < size < 2MB threshold)
3. Calls `hkm_ace_alloc()` which checks if ACE controller is enabled
4. Since disabled, ACE immediately returns NULL
5. Falls back to mmap in `hak_alloc_api.inc.h:145`
6. Every allocation incurs ~500-1000 cycle syscall overhead

**With ACE enabled (but pools empty):**
1. ACE controller initializes and starts background thread
2. `hkm_ace_alloc()` rounds 33KB → 40KB (Bridge class)
3. Calls `hak_pool_try_alloc(40KB, site_id)`
4. Pool has no pages allocated (never refilled)
5. Returns NULL
6. Still falls back to mmap

### Performance Impact Quantification

**mmap overhead per allocation:**
- System call entry/exit: ~200 cycles
- Kernel page allocation: ~300-500 cycles
- Page table updates: ~100-200 cycles
- TLB flush (MT): ~500-2000 cycles
- **Total: 1100-2900 cycles per alloc**

**Pool allocation (when working):**
- TLS cache check: ~5 cycles
- Pointer pop: ~10 cycles
- Header write: ~5 cycles
- **Total: 20-50 cycles**

**Performance delta:** 55-145x slower with mmap fallback

For the `bench_mid_large_mt` workload (33KB allocations):
- Expected with ACE: ~50-80M ops/s
- Current (mmap): ~1M ops/s
- **Matches observed -88% regression**

## Proposed Solution

### Solution: Enable ACE + Fix Pool Initialization

### Approach
Enable ACE via environment variable and ensure pools are properly initialized with pre-allocated pages to serve requests immediately.

### Implementation Steps

1. **Enable ACE at runtime** (Immediate workaround)
   ```bash
   export HAKMEM_ACE_ENABLED=1
   ./bench_mid_large_mt_hakmem
   ```

2. **Fix pool initialization** (`core/box/pool_init_api.inc.h`)
   - Add pre-allocation of pages for Bridge classes (40KB, 52KB)
   - Ensure `g_class_sizes[5]` and `g_class_sizes[6]` are properly set
   - Pre-populate each class with at least 2-4 pages

3. **Verify L2.5 Large Pool init** (`core/hakmem_l25_pool.c`)
   - Check lazy initialization is working
   - Pre-allocate pages for 64KB-1MB classes

4. **Add ACE health check**
   - Log successful pool allocations
   - Track hit/miss rates
   - Alert if pools are consistently empty

### Code Changes

**File:** `core/box/hak_core_init.inc.h:75` (after `mid_mt_init()`)
```c
// OLD
    // NEW Phase Hybrid: Initialize Mid Range MT allocator (8-32KB, mimalloc-style)
    mid_mt_init();

// NEW
    // NEW Phase Hybrid: Initialize Mid Range MT allocator (8-32KB, mimalloc-style)
    mid_mt_init();

    // Initialize MidPool for ACE (2-52KB allocations)
    hak_pool_init();

    // Initialize LargePool for ACE (64KB-1MB allocations)
    hak_l25_pool_init();
```

**File:** `core/box/pool_init_api.inc.h:96` (in `hak_pool_init_impl`)
```c
// OLD
    g_pool.initialized = 1;
    HAKMEM_LOG("[Pool] Initialized (L2 Hybrid Pool)\n");

// NEW
    g_pool.initialized = 1;
    HAKMEM_LOG("[Pool] Initialized (L2 Hybrid Pool)\n");

    // Pre-allocate pages for Bridge classes to avoid cold start
    if (g_class_sizes[5] != 0) {  // 40KB Bridge class
        for (int s = 0; s < 4; s++) {
            refill_freelist(5, s);
        }
        HAKMEM_LOG("[Pool] Pre-allocated 40KB Bridge class pages\n");
    }
    if (g_class_sizes[6] != 0) {  // 52KB Bridge class
        for (int s = 0; s < 4; s++) {
            refill_freelist(6, s);
        }
        HAKMEM_LOG("[Pool] Pre-allocated 52KB Bridge class pages\n");
    }
```

**File:** `core/hakmem_ace_controller.c:42` (change default)
```c
// OLD
    ctrl->enabled = getenv_int("HAKMEM_ACE_ENABLED", 0);

// NEW (Option A - Enable by default)
    ctrl->enabled = getenv_int("HAKMEM_ACE_ENABLED", 1);

// OR (Option B - Keep disabled but add warning)
    ctrl->enabled = getenv_int("HAKMEM_ACE_ENABLED", 0);
    if (!ctrl->enabled) {
        ACE_LOG_WARN(ctrl, "ACE disabled - Mid-Large performance will be degraded. Set HAKMEM_ACE_ENABLED=1 to enable.");
    }
```

### Testing
- Build command: `make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1`
- Test command: `HAKMEM_ACE_ENABLED=1 ./bench_mid_large_mt_hakmem`
- Expected result: 50-80M ops/s (vs current 1.05M)

### Effort Estimate
- Implementation: 2-4 hours (mostly testing)
- Testing: 2-3 hours (verify all size classes)
- Total: 4-7 hours

### Risk Level
**MEDIUM** - ACE has been disabled for a while, so enabling it may expose latent bugs. However, the code exists and was previously tested. Main risks:
- Pool exhaustion under high load
- Thread safety issues in ACE controller
- Memory leaks if pools don't properly free

## Risk Assessment

### Primary Risks

1. **Pool Memory Exhaustion** (Medium)
   - Pools may not have sufficient pages for high concurrency
   - Mitigation: Implement dynamic page allocation on demand

2. **ACE Thread Safety** (Low-Medium)
   - Background thread may have race conditions
   - Mitigation: Code review of ACE controller threading

3. **Memory Fragmentation** (Low)
   - Bridge classes (40KB, 52KB) may cause fragmentation
   - Mitigation: Monitor fragmentation metrics

4. **Learning Algorithm Instability** (Low)
   - UCB1 algorithm may make poor decisions initially
   - Mitigation: Conservative initial parameters

## Alternative Approaches

### Alternative 1: Remove ACE, Direct Pool Access
Skip ACE layer entirely and call pools directly from main allocation path. This removes the learning layer but simplifies the code.

**Pros:** Simpler, fewer components
**Cons:** Loses adaptive optimization potential
**Effort:** 8-10 hours

### Alternative 2: Increase mmap Threshold
Lower the threshold from 2MB to 32KB so only truly large allocations use mmap.

**Pros:** Simple config change
**Cons:** Doesn't fix the core problem, just shifts it
**Effort:** 1 hour

### Alternative 3: Implement Simple Cache
Replace ACE with a basic per-thread cache without learning.

**Pros:** Predictable performance
**Cons:** Loses adaptation benefits
**Effort:** 12-16 hours

## Testing Strategy

1. **Unit Tests**
   - Verify ACE returns non-NULL for each size class
   - Test pool refill logic
   - Validate Bridge class allocation

2. **Integration Tests**
   - Run full benchmark suite with ACE enabled
   - Compare against baseline (System malloc)
   - Monitor memory usage

3. **Stress Tests**
   - High concurrency (32+ threads)
   - Mixed size allocations
   - Long-running stability test (1+ hour)

4. **Performance Validation**
   - Target: 50-80M ops/s for bench_mid_large_mt
   - Must maintain Tiny performance gains
   - No regression in other benchmarks

## Effort Estimate

**Immediate Fix (Enable ACE):** 1 hour
- Set environment variable
- Verify basic functionality
- Document in README

**Full Solution (Initialize Pools):** 4-7 hours
- Code changes: 2-3 hours
- Testing: 2-3 hours
- Documentation: 1 hour

**Production Hardening:** 8-12 hours (optional)
- Add monitoring/metrics
- Implement auto-tuning
- Stress testing

## Recommendations

1. **Immediate Action:** Enable ACE via environment variable for testing
   ```bash
   export HAKMEM_ACE_ENABLED=1
   ```

2. **Short-term Fix:** Implement pool initialization fixes (4-7 hours)
   - Priority: HIGH
   - Impact: Recovers Mid-Large performance (+88%)
   - Risk: Medium (needs thorough testing)

3. **Long-term:** Consider making ACE enabled by default after validation
   - Add comprehensive tests
   - Monitor production metrics
   - Document tuning parameters

4. **Configuration:** Add startup configuration to set optimal defaults
   ```bash
   # Recommended .hakmemrc or startup script
   export HAKMEM_ACE_ENABLED=1
   export HAKMEM_ACE_FAST_INTERVAL_MS=100  # More aggressive adaptation
   export HAKMEM_ACE_LOG_LEVEL=2           # Verbose logging initially
   ```

## Conclusion

The -88% Mid-Large MT regression is caused by ACE being disabled, forcing all allocations through slow mmap. The fix is straightforward: enable ACE and ensure pools are properly initialized. This should recover the +171% performance advantage HAKMEM previously demonstrated for Mid-Large allocations. With 4-7 hours of work, we can restore HAKMEM's competitive advantage in this critical size range.