Files
hakmem/CURRENT_TASK.md
Moe Charm (CI) 4ef0171bc0 feat: Add ACE allocation failure tracing and debug hooks
This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations.

Key changes include:
- **ACE Tracing Implementation**:
  - Added  environment variable to enable/disable detailed logging of allocation failures.
  - Instrumented , , and  to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure).
- **Build System Fixes**:
  - Corrected  to ensure  is properly linked into , resolving an  error.
- **LD_PRELOAD Wrapper Adjustments**:
  - Investigated and understood the  wrapper's behavior under , particularly its interaction with  and  checks.
  - Enabled debugging flags for  environment to prevent unintended fallbacks to 's  for non-tiny allocations, allowing comprehensive testing of the  allocator.
- **Debugging & Verification**:
  - Introduced temporary verbose logging to pinpoint execution flow issues within  interception and  routing. These temporary logs have been removed.
  - Created  to facilitate testing of the tracing features.

This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in  by providing clear insights into the failure pathways.
2025-12-01 16:37:59 +09:00

54 lines
3.4 KiB
Markdown

## HAKMEM Bug Investigation: OOM Spam (ACE 33KB) - December 1, 2025
### Objective
Investigate and provide a mechanism to diagnose "OOM spam caused by continuous NULL returns for ACE 33KB allocations." The goal is to distinguish between:
1. Threshold issues (size class rounding)
2. Cache exhaustion (pool empty)
3. Mapping failures (OS mmap failure)
### Work Performed & Resolution
1. **Implemented ACE Tracing**:
* Added a runtime-controlled tracing mechanism via the `HAKMEM_ACE_TRACE=1` environment variable.
* Instrumentation was added to `core/hakmem_ace.c`, `core/hakmem_pool.c`, and `core/hakmem_l25_pool.c` to log specific failure reasons to `stderr`.
* Log messages distinguish between `[ACE-FAIL] Threshold`, `[ACE-FAIL] Exhaustion`, and `[ACE-FAIL] MapFail`.
2. **Resolved Build & Linkage Issues**:
* **Undefined Symbol `classify_ptr`**: Identified that `core/box/front_gate_classifier.c` was not correctly linked into `libhakmem.so`. The `Makefile` was updated to include `core/box/front_gate_classifier_shared.o` in the `SHARED_OBJS` list.
* **Removed Temporary Debug Logs**: All interim `write(2, ...)` and `fprintf(stderr, ...)` debug statements introduced during the investigation have been removed to restore a clean code state.
3. **Clarified `malloc` Wrapper Behavior**:
* Discovered that `libhakmem.so`'s `malloc` wrapper had logic to force fallback to `libc`'s `malloc` for larger allocations (`> TINY_MAX_SIZE`) and when `jemalloc` was detected, especially under `LD_PRELOAD`.
* This was preventing 33KB allocations from reaching the `hakmem` ACE layer.
* **Solution**: Identified the necessary environment variables to disable these bypasses for testing purposes: `HAKMEM_LD_SAFE=0` and `HAKMEM_LD_BLOCK_JEMALLOC=0`.
4. **Verified Trace Functionality**:
* A test program (`test_ace_trace.c`) was used to allocate 33KB.
* By setting `HAKMEM_WMAX_MID=1.01` and `HAKMEM_WMAX_LARGE=1.01` (to force threshold failures), the `[ACE-FAIL] Threshold` logs were successfully generated, confirming the tracing mechanism works as intended.
### How to Use the Trace Feature (for Users)
To diagnose the 33KB OOM spam issue in your application:
1. **Ensure Correct `libhakmem.so` Build**:
Make sure `libhakmem.so` is built without `POOL_TLS_PHASE1` enabled (e.g., `make shared POOL_TLS_PHASE1=0`). The current `libhakmem.so` reflects this.
2. **Run Your Application with Specific Environment Variables**:
```bash
export HAKMEM_FRONT_GATE_UNIFIED=0
export HAKMEM_SMALLMID_ENABLE=0
export HAKMEM_FORCE_LIBC_ALLOC=0
export HAKMEM_LD_BLOCK_JEMALLOC=0
export HAKMEM_ACE_TRACE=1 # Crucial for seeing the logs
export HAKMEM_WMAX_MID=1.60 # Use default or adjust as needed for W_MAX analysis
export HAKMEM_WMAX_LARGE=1.30 # Use default or adjust as needed for W_MAX analysis
export LD_PRELOAD=/path/to/hakmem/libhakmem.so
./your_application 2> stderr.log # Redirect stderr to a file for analysis
```
3. **Analyze `stderr.log`**:
Look for `[ACE-FAIL]` messages to determine if the issue is a `Threshold` (e.g., `size=33000 wmax=...`), `Exhaustion` (pool empty), or `MapFail` (OS allocation error). This will provide the necessary data to pinpoint the root cause of the OOM spam.
This setup will allow for precise diagnosis of 33KB allocation failures within the hakmem ACE component.