Files
hakmem/ACE_INVESTIGATION_REPORT.md
Moe Charm (CI) 1010a961fb Tiny: fix header/stride mismatch and harden refill paths
- Root cause: header-based class indexing (HEADER_CLASSIDX=1) wrote a 1-byte
  header during allocation, but linear carve/refill and initial slab capacity
  still used bare class block sizes. This mismatch could overrun slab usable
  space and corrupt freelists, causing reproducible SEGV at ~100k iters.

Changes
- Superslab: compute capacity with effective stride (block_size + header for
  classes 0..6; class7 remains headerless) in superslab_init_slab(). Add a
  debug-only bound check in superslab_alloc_from_slab() to fail fast if carve
  would exceed usable bytes.
- Refill (non-P0 and P0): use header-aware stride for all linear carving and
  TLS window bump operations. Ensure alignment/validation in tiny_refill_opt.h
  also uses stride, not raw class size.
- Drain: keep existing defense-in-depth for remote sentinel and sanitize nodes
  before splicing into freelist (already present).

Notes
- This unifies the memory layout across alloc/linear-carve/refill with a single
  stride definition and keeps class7 (1024B) headerless as designed.
- Debug builds add fail-fast checks; release builds remain lean.

Next
- Re-run Tiny benches (256/1024B) in debug to confirm stability, then in
  release. If any remaining crash persists, bisect with HAKMEM_TINY_P0_BATCH_REFILL=0
  to isolate P0 batch carve, and continue reducing branch-miss as planned.
2025-11-09 18:55:50 +09:00

11 KiB

ACE Investigation Report: Mid-Large MT Performance Recovery

Executive Summary

ACE (Adaptive Cache Engine) is the central L1 allocator for Mid-Large (2KB-1MB) allocations in HAKMEM. Investigation reveals ACE is disabled by default, causing all Mid-Large allocations to fall back to slow mmap operations, resulting in -88% regression vs System malloc. The solution is straightforward: enable ACE via HAKMEM_ACE_ENABLED=1 environment variable. However, testing shows ACE still returns NULL even when enabled, indicating the underlying pools (MidPool/LargePool) are not properly initialized or lack available memory. A deeper fix is required to initialize the pools correctly.

ACE Mechanism Explanation

ACE (Adaptive Cache Engine) is HAKMEM's intelligent caching layer for Mid-Large allocations (2KB-1MB). It acts as an intermediary between the main allocation path and the underlying memory pools. ACE's primary function is to round allocation sizes to optimal size classes using "W_MAX" rounding policies, then attempt allocation from two specialized pools: MidPool (2-52KB) and LargePool (64KB-1MB). The rounding strategy allows trading small amounts of internal fragmentation for significantly faster allocation performance by fitting requests into pre-sized cache buckets.

The ACE architecture consists of three main components: (1) The allocation router (hkm_ace_alloc) which maps sizes to appropriate pools, (2) The ACE controller which manages background threads for cache maintenance and statistics collection, and (3) The UCB1 (Upper Confidence Bound) learning algorithm which optimizes allocation strategies based on observed patterns. When ACE successfully allocates from its pools, it achieves O(1) allocation complexity compared to mmap's O(n) kernel overhead.

ACE significantly improves performance by eliminating system call overhead. Without ACE, every Mid-Large allocation requires an mmap system call (~500-1000 cycles), kernel page table updates, and TLB shootdowns in multi-threaded scenarios. With ACE enabled and pools populated, allocations are served from pre-mapped memory with simple pointer arithmetic (~20-50 cycles), achieving 10-50x speedup for the allocation fast path.

Current State Diagnosis

ACE is currently DISABLED by default.

Evidence from debug output:

[ACE] ACE disabled (HAKMEM_ACE_ENABLED=0)
[HAKMEM] INFO: Using mmap for mid-range size=33296 (ACE disabled or failed)

The ACE enable/disable mechanism is controlled by:

  • Environment variable: HAKMEM_ACE_ENABLED (default: 0)
  • Initialization: core/hakmem_ace_controller.c:42
  • Check location: The controller reads getenv_int("HAKMEM_ACE_ENABLED", 0)

When disabled, ACE immediately returns from initialization without starting background threads or initializing the underlying pools. This was likely a conservative default during development to avoid potential instability from the learning layer.

Root Cause Analysis

Allocation Path Analysis

With ACE disabled:

  1. Allocation request (e.g., 33KB) enters hak_alloc
  2. Falls into Mid-Large range check (1KB < size < 2MB threshold)
  3. Calls hkm_ace_alloc() which checks if ACE controller is enabled
  4. Since disabled, ACE immediately returns NULL
  5. Falls back to mmap in hak_alloc_api.inc.h:145
  6. Every allocation incurs ~500-1000 cycle syscall overhead

With ACE enabled (but pools empty):

  1. ACE controller initializes and starts background thread
  2. hkm_ace_alloc() rounds 33KB → 40KB (Bridge class)
  3. Calls hak_pool_try_alloc(40KB, site_id)
  4. Pool has no pages allocated (never refilled)
  5. Returns NULL
  6. Still falls back to mmap

Performance Impact Quantification

mmap overhead per allocation:

  • System call entry/exit: ~200 cycles
  • Kernel page allocation: ~300-500 cycles
  • Page table updates: ~100-200 cycles
  • TLB flush (MT): ~500-2000 cycles
  • Total: 1100-2900 cycles per alloc

Pool allocation (when working):

  • TLS cache check: ~5 cycles
  • Pointer pop: ~10 cycles
  • Header write: ~5 cycles
  • Total: 20-50 cycles

Performance delta: 55-145x slower with mmap fallback

For the bench_mid_large_mt workload (33KB allocations):

  • Expected with ACE: ~50-80M ops/s
  • Current (mmap): ~1M ops/s
  • Matches observed -88% regression

Proposed Solution

Solution: Enable ACE + Fix Pool Initialization

Approach

Enable ACE via environment variable and ensure pools are properly initialized with pre-allocated pages to serve requests immediately.

Implementation Steps

  1. Enable ACE at runtime (Immediate workaround)

    export HAKMEM_ACE_ENABLED=1
    ./bench_mid_large_mt_hakmem
    
  2. Fix pool initialization (core/box/pool_init_api.inc.h)

    • Add pre-allocation of pages for Bridge classes (40KB, 52KB)
    • Ensure g_class_sizes[5] and g_class_sizes[6] are properly set
    • Pre-populate each class with at least 2-4 pages
  3. Verify L2.5 Large Pool init (core/hakmem_l25_pool.c)

    • Check lazy initialization is working
    • Pre-allocate pages for 64KB-1MB classes
  4. Add ACE health check

    • Log successful pool allocations
    • Track hit/miss rates
    • Alert if pools are consistently empty

Code Changes

File: core/box/hak_core_init.inc.h:75 (after mid_mt_init())

// OLD
    // NEW Phase Hybrid: Initialize Mid Range MT allocator (8-32KB, mimalloc-style)
    mid_mt_init();

// NEW
    // NEW Phase Hybrid: Initialize Mid Range MT allocator (8-32KB, mimalloc-style)
    mid_mt_init();

    // Initialize MidPool for ACE (2-52KB allocations)
    hak_pool_init();

    // Initialize LargePool for ACE (64KB-1MB allocations)
    hak_l25_pool_init();

File: core/box/pool_init_api.inc.h:96 (in hak_pool_init_impl)

// OLD
    g_pool.initialized = 1;
    HAKMEM_LOG("[Pool] Initialized (L2 Hybrid Pool)\n");

// NEW
    g_pool.initialized = 1;
    HAKMEM_LOG("[Pool] Initialized (L2 Hybrid Pool)\n");

    // Pre-allocate pages for Bridge classes to avoid cold start
    if (g_class_sizes[5] != 0) {  // 40KB Bridge class
        for (int s = 0; s < 4; s++) {
            refill_freelist(5, s);
        }
        HAKMEM_LOG("[Pool] Pre-allocated 40KB Bridge class pages\n");
    }
    if (g_class_sizes[6] != 0) {  // 52KB Bridge class
        for (int s = 0; s < 4; s++) {
            refill_freelist(6, s);
        }
        HAKMEM_LOG("[Pool] Pre-allocated 52KB Bridge class pages\n");
    }

File: core/hakmem_ace_controller.c:42 (change default)

// OLD
    ctrl->enabled = getenv_int("HAKMEM_ACE_ENABLED", 0);

// NEW (Option A - Enable by default)
    ctrl->enabled = getenv_int("HAKMEM_ACE_ENABLED", 1);

// OR (Option B - Keep disabled but add warning)
    ctrl->enabled = getenv_int("HAKMEM_ACE_ENABLED", 0);
    if (!ctrl->enabled) {
        ACE_LOG_WARN(ctrl, "ACE disabled - Mid-Large performance will be degraded. Set HAKMEM_ACE_ENABLED=1 to enable.");
    }

Testing

  • Build command: make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1 PREWARM_TLS=1
  • Test command: HAKMEM_ACE_ENABLED=1 ./bench_mid_large_mt_hakmem
  • Expected result: 50-80M ops/s (vs current 1.05M)

Effort Estimate

  • Implementation: 2-4 hours (mostly testing)
  • Testing: 2-3 hours (verify all size classes)
  • Total: 4-7 hours

Risk Level

MEDIUM - ACE has been disabled for a while, so enabling it may expose latent bugs. However, the code exists and was previously tested. Main risks:

  • Pool exhaustion under high load
  • Thread safety issues in ACE controller
  • Memory leaks if pools don't properly free

Risk Assessment

Primary Risks

  1. Pool Memory Exhaustion (Medium)

    • Pools may not have sufficient pages for high concurrency
    • Mitigation: Implement dynamic page allocation on demand
  2. ACE Thread Safety (Low-Medium)

    • Background thread may have race conditions
    • Mitigation: Code review of ACE controller threading
  3. Memory Fragmentation (Low)

    • Bridge classes (40KB, 52KB) may cause fragmentation
    • Mitigation: Monitor fragmentation metrics
  4. Learning Algorithm Instability (Low)

    • UCB1 algorithm may make poor decisions initially
    • Mitigation: Conservative initial parameters

Alternative Approaches

Alternative 1: Remove ACE, Direct Pool Access

Skip ACE layer entirely and call pools directly from main allocation path. This removes the learning layer but simplifies the code.

Pros: Simpler, fewer components Cons: Loses adaptive optimization potential Effort: 8-10 hours

Alternative 2: Increase mmap Threshold

Lower the threshold from 2MB to 32KB so only truly large allocations use mmap.

Pros: Simple config change Cons: Doesn't fix the core problem, just shifts it Effort: 1 hour

Alternative 3: Implement Simple Cache

Replace ACE with a basic per-thread cache without learning.

Pros: Predictable performance Cons: Loses adaptation benefits Effort: 12-16 hours

Testing Strategy

  1. Unit Tests

    • Verify ACE returns non-NULL for each size class
    • Test pool refill logic
    • Validate Bridge class allocation
  2. Integration Tests

    • Run full benchmark suite with ACE enabled
    • Compare against baseline (System malloc)
    • Monitor memory usage
  3. Stress Tests

    • High concurrency (32+ threads)
    • Mixed size allocations
    • Long-running stability test (1+ hour)
  4. Performance Validation

    • Target: 50-80M ops/s for bench_mid_large_mt
    • Must maintain Tiny performance gains
    • No regression in other benchmarks

Effort Estimate

Immediate Fix (Enable ACE): 1 hour

  • Set environment variable
  • Verify basic functionality
  • Document in README

Full Solution (Initialize Pools): 4-7 hours

  • Code changes: 2-3 hours
  • Testing: 2-3 hours
  • Documentation: 1 hour

Production Hardening: 8-12 hours (optional)

  • Add monitoring/metrics
  • Implement auto-tuning
  • Stress testing

Recommendations

  1. Immediate Action: Enable ACE via environment variable for testing

    export HAKMEM_ACE_ENABLED=1
    
  2. Short-term Fix: Implement pool initialization fixes (4-7 hours)

    • Priority: HIGH
    • Impact: Recovers Mid-Large performance (+88%)
    • Risk: Medium (needs thorough testing)
  3. Long-term: Consider making ACE enabled by default after validation

    • Add comprehensive tests
    • Monitor production metrics
    • Document tuning parameters
  4. Configuration: Add startup configuration to set optimal defaults

    # Recommended .hakmemrc or startup script
    export HAKMEM_ACE_ENABLED=1
    export HAKMEM_ACE_FAST_INTERVAL_MS=100  # More aggressive adaptation
    export HAKMEM_ACE_LOG_LEVEL=2           # Verbose logging initially
    

Conclusion

The -88% Mid-Large MT regression is caused by ACE being disabled, forcing all allocations through slow mmap. The fix is straightforward: enable ACE and ensure pools are properly initialized. This should recover the +171% performance advantage HAKMEM previously demonstrated for Mid-Large allocations. With 4-7 hours of work, we can restore HAKMEM's competitive advantage in this critical size range.