This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations. Key changes include: - **ACE Tracing Implementation**: - Added environment variable to enable/disable detailed logging of allocation failures. - Instrumented , , and to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure). - **Build System Fixes**: - Corrected to ensure is properly linked into , resolving an error. - **LD_PRELOAD Wrapper Adjustments**: - Investigated and understood the wrapper's behavior under , particularly its interaction with and checks. - Enabled debugging flags for environment to prevent unintended fallbacks to 's for non-tiny allocations, allowing comprehensive testing of the allocator. - **Debugging & Verification**: - Introduced temporary verbose logging to pinpoint execution flow issues within interception and routing. These temporary logs have been removed. - Created to facilitate testing of the tracing features. This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in by providing clear insights into the failure pathways.
3.4 KiB
HAKMEM Bug Investigation: OOM Spam (ACE 33KB) - December 1, 2025
Objective
Investigate and provide a mechanism to diagnose "OOM spam caused by continuous NULL returns for ACE 33KB allocations." The goal is to distinguish between:
- Threshold issues (size class rounding)
- Cache exhaustion (pool empty)
- Mapping failures (OS mmap failure)
Work Performed & Resolution
-
Implemented ACE Tracing:
- Added a runtime-controlled tracing mechanism via the
HAKMEM_ACE_TRACE=1environment variable. - Instrumentation was added to
core/hakmem_ace.c,core/hakmem_pool.c, andcore/hakmem_l25_pool.cto log specific failure reasons tostderr. - Log messages distinguish between
[ACE-FAIL] Threshold,[ACE-FAIL] Exhaustion, and[ACE-FAIL] MapFail.
- Added a runtime-controlled tracing mechanism via the
-
Resolved Build & Linkage Issues:
- Undefined Symbol
classify_ptr: Identified thatcore/box/front_gate_classifier.cwas not correctly linked intolibhakmem.so. TheMakefilewas updated to includecore/box/front_gate_classifier_shared.oin theSHARED_OBJSlist. - Removed Temporary Debug Logs: All interim
write(2, ...)andfprintf(stderr, ...)debug statements introduced during the investigation have been removed to restore a clean code state.
- Undefined Symbol
-
Clarified
mallocWrapper Behavior:- Discovered that
libhakmem.so'smallocwrapper had logic to force fallback tolibc'smallocfor larger allocations (> TINY_MAX_SIZE) and whenjemallocwas detected, especially underLD_PRELOAD. - This was preventing 33KB allocations from reaching the
hakmemACE layer. - Solution: Identified the necessary environment variables to disable these bypasses for testing purposes:
HAKMEM_LD_SAFE=0andHAKMEM_LD_BLOCK_JEMALLOC=0.
- Discovered that
-
Verified Trace Functionality:
- A test program (
test_ace_trace.c) was used to allocate 33KB. - By setting
HAKMEM_WMAX_MID=1.01andHAKMEM_WMAX_LARGE=1.01(to force threshold failures), the[ACE-FAIL] Thresholdlogs were successfully generated, confirming the tracing mechanism works as intended.
- A test program (
How to Use the Trace Feature (for Users)
To diagnose the 33KB OOM spam issue in your application:
-
Ensure Correct
libhakmem.soBuild: Make surelibhakmem.sois built withoutPOOL_TLS_PHASE1enabled (e.g.,make shared POOL_TLS_PHASE1=0). The currentlibhakmem.soreflects this. -
Run Your Application with Specific Environment Variables:
export HAKMEM_FRONT_GATE_UNIFIED=0 export HAKMEM_SMALLMID_ENABLE=0 export HAKMEM_FORCE_LIBC_ALLOC=0 export HAKMEM_LD_BLOCK_JEMALLOC=0 export HAKMEM_ACE_TRACE=1 # Crucial for seeing the logs export HAKMEM_WMAX_MID=1.60 # Use default or adjust as needed for W_MAX analysis export HAKMEM_WMAX_LARGE=1.30 # Use default or adjust as needed for W_MAX analysis export LD_PRELOAD=/path/to/hakmem/libhakmem.so ./your_application 2> stderr.log # Redirect stderr to a file for analysis -
Analyze
stderr.log: Look for[ACE-FAIL]messages to determine if the issue is aThreshold(e.g.,size=33000 wmax=...),Exhaustion(pool empty), orMapFail(OS allocation error). This will provide the necessary data to pinpoint the root cause of the OOM spam.
This setup will allow for precise diagnosis of 33KB allocation failures within the hakmem ACE component.