feat: Add ACE allocation failure tracing and debug hooks

This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations.

Key changes include:
- **ACE Tracing Implementation**:
  - Added  environment variable to enable/disable detailed logging of allocation failures.
  - Instrumented , , and  to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure).
- **Build System Fixes**:
  - Corrected  to ensure  is properly linked into , resolving an  error.
- **LD_PRELOAD Wrapper Adjustments**:
  - Investigated and understood the  wrapper's behavior under , particularly its interaction with  and  checks.
  - Enabled debugging flags for  environment to prevent unintended fallbacks to 's  for non-tiny allocations, allowing comprehensive testing of the  allocator.
- **Debugging & Verification**:
  - Introduced temporary verbose logging to pinpoint execution flow issues within  interception and  routing. These temporary logs have been removed.
  - Created  to facilitate testing of the tracing features.

This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in  by providing clear insights into the failure pathways.
This commit is contained in:
Moe Charm (CI)
2025-12-01 16:37:59 +09:00
parent 2bd8da9267
commit 4ef0171bc0
85 changed files with 5930 additions and 479 deletions

View File

@ -1,50 +1,53 @@
# Current Task: Phase 9-2 Refactoring (Complete) & Phase 10 Preparation
## HAKMEM Bug Investigation: OOM Spam (ACE 33KB) - December 1, 2025
**Date**: 2025-12-01
**Status**: **COMPLETE** (Phase 9-2) / **PLANNING** (Phase 10)
**Goal**: Legacy Backend Removal, Shared Pool Unification, and Type Safety
### Objective
Investigate and provide a mechanism to diagnose "OOM spam caused by continuous NULL returns for ACE 33KB allocations." The goal is to distinguish between:
1. Threshold issues (size class rounding)
2. Cache exhaustion (pool empty)
3. Mapping failures (OS mmap failure)
---
### Work Performed & Resolution
## Phase 9-2 Achievements (Completed)
1. **Implemented ACE Tracing**:
* Added a runtime-controlled tracing mechanism via the `HAKMEM_ACE_TRACE=1` environment variable.
* Instrumentation was added to `core/hakmem_ace.c`, `core/hakmem_pool.c`, and `core/hakmem_l25_pool.c` to log specific failure reasons to `stderr`.
* Log messages distinguish between `[ACE-FAIL] Threshold`, `[ACE-FAIL] Exhaustion`, and `[ACE-FAIL] MapFail`.
1. **Legacy Backend Removal & Unification (2025-12-01)**
* **Eliminated Fallback**: Removed `hak_tiny_alloc_superslab_backend_legacy` fallback. Shared Pool is now the sole backend (`hak_tiny_alloc_superslab_box` -> `hak_tiny_alloc_superslab_backend_shared`).
* **Soft Cap Removed**: Removed the artificial "Soft Cap" limit in Shared Pool Stage 3, allowing it to handle full workload load.
* **EMPTY Recycling**: Implemented `SLAB_TRY_RECYCLE` with atomic batch decrement of `meta->used` in `_ss_remote_drain_to_freelist_unsafe`. This ensures EMPTY slabs are immediately returned to the global pool.
* **Race Condition Fix**: Moved `remove_superslab_from_legacy_head(ss)` to the *start* of `shared_pool_release_slab` to prevent Legacy Backend from allocating from a slab being recycled. Added `total_active_blocks` check before freeing.
* **Performance**: **50.3 M ops/s** in WS8192 benchmark (vs 16.5 M baseline). OOM/Crash issues resolved.
2. **Resolved Build & Linkage Issues**:
* **Undefined Symbol `classify_ptr`**: Identified that `core/box/front_gate_classifier.c` was not correctly linked into `libhakmem.so`. The `Makefile` was updated to include `core/box/front_gate_classifier_shared.o` in the `SHARED_OBJS` list.
* **Removed Temporary Debug Logs**: All interim `write(2, ...)` and `fprintf(stderr, ...)` debug statements introduced during the investigation have been removed to restore a clean code state.
2. **Critical Fixes (Deadlock & OOM)**
* **Deadlock**: `shared_pool_acquire_slab` releases `alloc_lock` before `superslab_allocate`.
* **Is Empty Return**: `tiny_free_local_box` now returns `int is_empty` status to allow safe, race-free recycling by the caller.
3. **Clarified `malloc` Wrapper Behavior**:
* Discovered that `libhakmem.so`'s `malloc` wrapper had logic to force fallback to `libc`'s `malloc` for larger allocations (`> TINY_MAX_SIZE`) and when `jemalloc` was detected, especially under `LD_PRELOAD`.
* This was preventing 33KB allocations from reaching the `hakmem` ACE layer.
* **Solution**: Identified the necessary environment variables to disable these bypasses for testing purposes: `HAKMEM_LD_SAFE=0` and `HAKMEM_LD_BLOCK_JEMALLOC=0`.
3. **Code Refactoring**
* Modularized `hakmem_shared_pool.c` into `acquire/release/internal` components.
4. **Verified Trace Functionality**:
* A test program (`test_ace_trace.c`) was used to allocate 33KB.
* By setting `HAKMEM_WMAX_MID=1.01` and `HAKMEM_WMAX_LARGE=1.01` (to force threshold failures), the `[ACE-FAIL] Threshold` logs were successfully generated, confirming the tracing mechanism works as intended.
---
### How to Use the Trace Feature (for Users)
## Next Phase: Phase 10 - Type Safety & Hardening
To diagnose the 33KB OOM spam issue in your application:
### 1. Pointer Type Safety (Debug Only)
* **Issue**: Occasional `[TLS_SLL_HDR_RESET]` warnings indicate confusion between `BasePtr` (header start) and `UserPtr` (payload start).
* **Solution**: Implement "Phantom Type" checking macros enabled only in debug builds.
* Define `hak_base_ptr_t` and `hak_user_ptr_t` structs in debug.
* Define strict conversion macros (`hak_base_to_user`, `hak_user_to_base`).
* Apply incrementally to `tls_sll_box`, `free_local_box`, and `remote_free_box`.
* **Goal**: Catch pointer arithmetic errors at compile time in debug mode.
1. **Ensure Correct `libhakmem.so` Build**:
Make sure `libhakmem.so` is built without `POOL_TLS_PHASE1` enabled (e.g., `make shared POOL_TLS_PHASE1=0`). The current `libhakmem.so` reflects this.
### 2. Header Protection Hardening
* **Goal**: Reinforce header integrity checks in `tiny_free_local_box` and `tls_sll_pop` using the new type system.
2. **Run Your Application with Specific Environment Variables**:
```bash
export HAKMEM_FRONT_GATE_UNIFIED=0
export HAKMEM_SMALLMID_ENABLE=0
export HAKMEM_FORCE_LIBC_ALLOC=0
export HAKMEM_LD_BLOCK_JEMALLOC=0
export HAKMEM_ACE_TRACE=1 # Crucial for seeing the logs
export HAKMEM_WMAX_MID=1.60 # Use default or adjust as needed for W_MAX analysis
export HAKMEM_WMAX_LARGE=1.30 # Use default or adjust as needed for W_MAX analysis
export LD_PRELOAD=/path/to/hakmem/libhakmem.so
### 3. Fast Path Optimization
* **Goal**: Re-evaluate hot path performance (Stage 1 lock-free) after Phase 9-2 stabilization.
./your_application 2> stderr.log # Redirect stderr to a file for analysis
```
---
3. **Analyze `stderr.log`**:
Look for `[ACE-FAIL]` messages to determine if the issue is a `Threshold` (e.g., `size=33000 wmax=...`), `Exhaustion` (pool empty), or `MapFail` (OS allocation error). This will provide the necessary data to pinpoint the root cause of the OOM spam.
## Current Status
* **Build**: Passing (Clean build verified).
* **Benchmarks**:
* WS8192: **50.3 M ops/s** (Shared Pool ONLY).
* Crash/OOM: Resolved.
* **Pending**: Phase 10 implementation (Type Safety).
This setup will allow for precise diagnosis of 33KB allocation failures within the hakmem ACE component.