Files
hakmem/CURRENT_TASK.md
Moe Charm (CI) 4ef0171bc0 feat: Add ACE allocation failure tracing and debug hooks
This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations.

Key changes include:
- **ACE Tracing Implementation**:
  - Added  environment variable to enable/disable detailed logging of allocation failures.
  - Instrumented , , and  to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure).
- **Build System Fixes**:
  - Corrected  to ensure  is properly linked into , resolving an  error.
- **LD_PRELOAD Wrapper Adjustments**:
  - Investigated and understood the  wrapper's behavior under , particularly its interaction with  and  checks.
  - Enabled debugging flags for  environment to prevent unintended fallbacks to 's  for non-tiny allocations, allowing comprehensive testing of the  allocator.
- **Debugging & Verification**:
  - Introduced temporary verbose logging to pinpoint execution flow issues within  interception and  routing. These temporary logs have been removed.
  - Created  to facilitate testing of the tracing features.

This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in  by providing clear insights into the failure pathways.
2025-12-01 16:37:59 +09:00

3.4 KiB

HAKMEM Bug Investigation: OOM Spam (ACE 33KB) - December 1, 2025

Objective

Investigate and provide a mechanism to diagnose "OOM spam caused by continuous NULL returns for ACE 33KB allocations." The goal is to distinguish between:

  1. Threshold issues (size class rounding)
  2. Cache exhaustion (pool empty)
  3. Mapping failures (OS mmap failure)

Work Performed & Resolution

  1. Implemented ACE Tracing:

    • Added a runtime-controlled tracing mechanism via the HAKMEM_ACE_TRACE=1 environment variable.
    • Instrumentation was added to core/hakmem_ace.c, core/hakmem_pool.c, and core/hakmem_l25_pool.c to log specific failure reasons to stderr.
    • Log messages distinguish between [ACE-FAIL] Threshold, [ACE-FAIL] Exhaustion, and [ACE-FAIL] MapFail.
  2. Resolved Build & Linkage Issues:

    • Undefined Symbol classify_ptr: Identified that core/box/front_gate_classifier.c was not correctly linked into libhakmem.so. The Makefile was updated to include core/box/front_gate_classifier_shared.o in the SHARED_OBJS list.
    • Removed Temporary Debug Logs: All interim write(2, ...) and fprintf(stderr, ...) debug statements introduced during the investigation have been removed to restore a clean code state.
  3. Clarified malloc Wrapper Behavior:

    • Discovered that libhakmem.so's malloc wrapper had logic to force fallback to libc's malloc for larger allocations (> TINY_MAX_SIZE) and when jemalloc was detected, especially under LD_PRELOAD.
    • This was preventing 33KB allocations from reaching the hakmem ACE layer.
    • Solution: Identified the necessary environment variables to disable these bypasses for testing purposes: HAKMEM_LD_SAFE=0 and HAKMEM_LD_BLOCK_JEMALLOC=0.
  4. Verified Trace Functionality:

    • A test program (test_ace_trace.c) was used to allocate 33KB.
    • By setting HAKMEM_WMAX_MID=1.01 and HAKMEM_WMAX_LARGE=1.01 (to force threshold failures), the [ACE-FAIL] Threshold logs were successfully generated, confirming the tracing mechanism works as intended.

How to Use the Trace Feature (for Users)

To diagnose the 33KB OOM spam issue in your application:

  1. Ensure Correct libhakmem.so Build: Make sure libhakmem.so is built without POOL_TLS_PHASE1 enabled (e.g., make shared POOL_TLS_PHASE1=0). The current libhakmem.so reflects this.

  2. Run Your Application with Specific Environment Variables:

    export HAKMEM_FRONT_GATE_UNIFIED=0
    export HAKMEM_SMALLMID_ENABLE=0
    export HAKMEM_FORCE_LIBC_ALLOC=0
    export HAKMEM_LD_BLOCK_JEMALLOC=0
    export HAKMEM_ACE_TRACE=1          # Crucial for seeing the logs
    export HAKMEM_WMAX_MID=1.60        # Use default or adjust as needed for W_MAX analysis
    export HAKMEM_WMAX_LARGE=1.30      # Use default or adjust as needed for W_MAX analysis
    export LD_PRELOAD=/path/to/hakmem/libhakmem.so
    
    ./your_application 2> stderr.log   # Redirect stderr to a file for analysis
    
  3. Analyze stderr.log: Look for [ACE-FAIL] messages to determine if the issue is a Threshold (e.g., size=33000 wmax=...), Exhaustion (pool empty), or MapFail (OS allocation error). This will provide the necessary data to pinpoint the root cause of the OOM spam.

This setup will allow for precise diagnosis of 33KB allocation failures within the hakmem ACE component.