This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations. Key changes include: - **ACE Tracing Implementation**: - Added environment variable to enable/disable detailed logging of allocation failures. - Instrumented , , and to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure). - **Build System Fixes**: - Corrected to ensure is properly linked into , resolving an error. - **LD_PRELOAD Wrapper Adjustments**: - Investigated and understood the wrapper's behavior under , particularly its interaction with and checks. - Enabled debugging flags for environment to prevent unintended fallbacks to 's for non-tiny allocations, allowing comprehensive testing of the allocator. - **Debugging & Verification**: - Introduced temporary verbose logging to pinpoint execution flow issues within interception and routing. These temporary logs have been removed. - Created to facilitate testing of the tracing features. This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in by providing clear insights into the failure pathways.
15 KiB
Phase 9-2 Benchmark Report: WS8192 Performance Analysis
Date: 2025-11-30 Test Configuration: WS8192 (Working Set = 8192 allocations) Benchmark: bench_random_mixed_hakmem 10000000 8192 Status: Baseline measurements complete, optimization not yet implemented
Executive Summary
WS8192ベンチマークを正しいパラメータで測定しました。結果:
- SuperSlab OFF vs ON: ほぼ同じ性能(16.23M vs 16.15M ops/s、-0.51%)
- 期待値とのギャップ: Phase 9-2の期待値は25-30M ops/s (+50-80%)、実測は改善なし
- 根本原因: Phase 9-2の修正(EMPTY→Freelist recycling)が未実装であることが判明
- 次のステップ: Phase 9-2 Option Aの実装が必要
1. Benchmark Results
1.1 SuperSlab OFF (Baseline)
HAKMEM_TINY_USE_SUPERSLAB=0 ./bench_random_mixed_hakmem 10000000 8192
| Run | Throughput (ops/s) | Time (s) |
|---|---|---|
| 1 | 16,468,918 | 0.607 |
| 2 | 16,192,733 | 0.618 |
| 3 | 16,035,542 | 0.624 |
| Average | 16,232,398 | 0.616 |
| Std Dev | 178,517 (±1.1%) | 0.007 |
Key Observations:
- Consistent performance (±1.1% variance)
- 4x
[SS_BACKEND] shared_fail→legacy cls=7warnings - TLS_SLL errors present (header corruption warnings)
1.2 SuperSlab ON (Current State)
HAKMEM_TINY_USE_SUPERSLAB=1 ./bench_random_mixed_hakmem 10000000 8192
| Run | Throughput (ops/s) | Time (s) |
|---|---|---|
| 1 | 16,231,848 | 0.616 |
| 2 | 16,305,843 | 0.613 |
| 3 | 15,910,918 | 0.628 |
| Average | 16,149,536 | 0.619 |
| Std Dev | 171,766 (±1.1%) | 0.007 |
Key Observations:
- No performance improvement (-0.51% vs baseline)
- Same
shared_fail→legacywarnings (4x Class 7 fallbacks) - Same TLS_SLL errors
- SuperSlab enabled but not providing benefits
1.3 Improvement Analysis
Baseline (SuperSlab OFF): 16.23 M ops/s
Current (SuperSlab ON): 16.15 M ops/s
Improvement: -0.51% (REGRESSION, within noise)
Expected (Phase 9-2): 25-30 M ops/s
Gap: -8.85 to -13.85 M ops/s (-35% to -46%)
Verdict: SuperSlab is enabled but not functional due to missing EMPTY recycling.
2. Problem Analysis
2.1 Why SuperSlab Has No Effect
From PHASE9_2_SUPERSLAB_BACKEND_INVESTIGATION.md investigation:
Root Cause: Shared pool Stage 3 soft cap blocks new SuperSlab allocation, but EMPTY slabs are not recycled to Stage 1 freelist.
Flow:
1. Benchmark allocates ~820 Class 7 blocks (10% of WS=8192)
2. Shared pool allocates 2 SuperSlabs (512KB each = 1022 blocks total)
3. class_active_slots[7] = 2 (soft cap reached)
4. Next allocation request:
- Stage 0.5 (EMPTY scan): Finds nothing (only 2 SS, both ACTIVE)
- Stage 1 (freelist): Empty (no EMPTY→ACTIVE transitions)
- Stage 2 (UNUSED claim): Exhausted (first pass only)
- Stage 3 (new SS alloc): FAIL (soft cap: current=2 >= limit=2)
5. shared_pool_acquire_slab() returns -1
6. Falls back to legacy backend
7. Legacy backend uses system malloc → kernel overhead
Result: SuperSlab backend is bypassed 4 times during benchmark → falls back to legacy system malloc.
2.2 Observable Evidence
Log Snippet:
[SS_BACKEND] shared_fail→legacy cls=7 ← SuperSlab failed, using legacy
[SS_BACKEND] shared_fail→legacy cls=7
[SS_BACKEND] shared_fail→legacy cls=7
[SS_BACKEND] shared_fail→legacy cls=7
What This Means:
- SuperSlab attempted allocation → hit soft cap → failed
- Fell back to
hak_tiny_alloc_superslab_backend_legacy() - Legacy backend uses system malloc (not SuperSlab)
- Kernel overhead: mmap/munmap syscalls → 55% CPU in kernel
Why No Performance Difference:
- SuperSlab ON: Uses legacy backend (same as SuperSlab OFF)
- SuperSlab OFF: Uses legacy backend (expected)
- Both configurations → same code path → same performance
3. Missing Implementation: EMPTY→Freelist Recycling
3.1 What Needs to Be Implemented
Phase 9-2 Option A (from investigation report):
Step 1: Add EMPTY Detection to Remote Drain
File: core/superslab_slab.c (after line 109)
void _ss_remote_drain_to_freelist_unsafe(SuperSlab* ss, int slab_idx, TinySlabMeta* meta) {
// ... existing drain logic ...
meta->freelist = prev;
atomic_store(&ss->remote_counts[slab_idx], 0);
// ✅ NEW: Check if slab is now EMPTY
if (meta->used == 0 && meta->capacity > 0) {
ss_mark_slab_empty(ss, slab_idx); // Set empty_mask bit
// Notify shared pool: push to per-class freelist
int class_idx = (int)meta->class_idx;
if (class_idx >= 0 && class_idx < TINY_NUM_CLASSES_SS) {
shared_pool_release_slab(ss, slab_idx);
}
}
// ... update masks ...
}
Step 2: Add EMPTY Detection to TLS SLL Drain
File: core/box/tls_sll_drain_box.c
uint32_t tiny_tls_sll_drain(int class_idx, uint32_t batch_size) {
// ... existing drain logic ...
// After draining N blocks from TLS SLL to freelist:
if (meta->used == 0 && meta->capacity > 0) {
ss_mark_slab_empty(ss, slab_idx);
shared_pool_release_slab(ss, slab_idx);
}
}
3.2 Expected Impact (After Implementation)
Performance Prediction (from Phase 9-2 investigation, Section 9.2):
| Configuration | Throughput | Kernel Overhead | Stage 1 Hit Rate |
|---|---|---|---|
| Current (no recycling) | 16.5 M ops/s | 55% | 0% |
| Option A (EMPTY recycling) | 25-28 M ops/s | 15% | 80% |
| Option A+B (+ 2MB SS) | 30-35 M ops/s | 12% | 85% |
Why +50-70% Improvement:
- EMPTY slabs recycle instantly via lock-free Stage 1
- Soft cap never hit (slots reused, not created)
- Eliminates mmap/munmap overhead from legacy fallback
- SuperSlab backend becomes fully functional
4. Comparison with Phase 9-1
4.1 Phase 9-1 Status
From PHASE9_1_PROGRESS.md:
Phase 9-1 Goal: Optimize SuperSlab lookup (50-80 cycles → 8-12 cycles) Status: Infrastructure complete (4/6 steps), migration not started
- ✅ Step 1-4: Hash table + TLS hints implementation
- ⏸️ Step 5: Migration (IN PROGRESS)
- ⏸️ Step 6: Benchmark (PENDING)
Key Point: Phase 9-1 optimizations are not yet integrated into hot path.
4.2 Phase 9-2 Status
Phase 9-2 Goal: Fix SuperSlab backend (eliminate legacy fallbacks) Status: Investigation complete, implementation not started
- ✅ Root cause identified (EMPTY recycling missing)
- ✅ 4 fix options proposed (Option A recommended)
- ⏸️ Implementation: NOT STARTED
- ⏸️ Benchmark: NOT STARTED
Key Point: Phase 9-2 is still in planning phase.
5. Performance Budget Analysis
5.1 Current Bottlenecks (WS8192)
Total: 212 cycles/op (16.5 M ops/s @ 2.8 GHz)
- SuperSlab Lookup: 50-80 cycles ← Phase 9-1 target
- Legacy Fallback: 30-50 cycles ← Phase 9-2 target
- Fragmentation: 30-50 cycles
- TLS Drain: 10-15 cycles
- Actual Work: 30-40 cycles
Kernel Overhead: 55% (mmap/munmap from legacy fallback)
5.2 Expected After Phase 9-1 + 9-2
After Phase 9-1 (lookup optimization):
Total: 152 cycles/op (18.4 M ops/s baseline)
- SuperSlab Lookup: 8-12 cycles ✅ Fixed (hash + TLS hints)
- Legacy Fallback: 30-50 cycles ← Still broken
- Fragmentation: 30-50 cycles
- TLS Drain: 10-15 cycles
- Actual Work: 30-40 cycles
Expected: 16.5M → 23-25M ops/s (+39-52%)
After Phase 9-1 + 9-2 (lookup + backend):
Total: 95 cycles/op (29.5 M ops/s baseline)
- SuperSlab Lookup: 8-12 cycles ✅ Fixed (Phase 9-1)
- Legacy Fallback: 0 cycles ✅ Fixed (Phase 9-2)
- SuperSlab Backend: 15-20 cycles ✅ Stage 1 reuse
- Fragmentation: 20-30 cycles
- TLS Drain: 10-15 cycles
- Actual Work: 30-40 cycles
Expected: 16.5M → 30-35M ops/s (+80-110%) Kernel Overhead: 55% → 12-15%
6. Diagnostic Output Analysis
6.1 Repeated Warnings
TLS_SLL_POP_POST_INVALID:
[TLS_SLL_POP_POST_INVALID] cls=6 next=0x7 last_writer=pop
[TLS_SLL_HDR_RESET] cls=6 base=0x... got=0x00 expect=0xa6 count=0
[TLS_SLL_POP_POST_INVALID] cls=6 next=0x5b last_writer=pop
Analysis (from Phase 9-2 investigation, Section 2):
- cls=6: Class 6 (512-byte blocks)
- got=0x00: Header corrupted/zeroed
- count=0: One-time event (not recurring)
- Hypothesis: Use-after-free or slab reuse race
- Mitigation: Existing guards (
tiny_tls_slab_reuse_guard()) should prevent - Verdict: Not critical (one-time event, guards in place)
- Action: Monitor with
HAKMEM_SUPER_REG_DEBUG=1for recurrence
6.2 Shared Fail Events
[SS_BACKEND] shared_fail→legacy cls=7
Count: 4 events per benchmark run Class: Class 7 (2048-byte allocations, 1024-1040B range in benchmark) Reason: Soft cap reached (Stage 3 blocked) Impact: Falls back to system malloc → kernel overhead
This is the PRIMARY bottleneck that Phase 9-2 Option A will fix.
7. Verification of Test Configuration
7.1 Benchmark Parameters
Command Used:
./bench_random_mixed_hakmem 10000000 8192
Breakdown:
10000000: 10M cycles (steady-state measurement)8192: Working set size (WS8192)
From bench_random_mixed.c (line 45-46):
int cycles = (argc>1)? atoi(argv[1]) : 10000000; // total ops
int ws = (argc>2)? atoi(argv[2]) : 8192; // working-set slots
Allocation Pattern (line 116):
size_t sz = 16u + (r & 0x3FFu); // 16..1040 bytes (approx 16..1024)
Class Distribution (estimated):
16-64B → Classes 0-3 (~40%)
64-256B → Classes 4-5 (~30%)
256-512B → Class 6 (~20%)
512-1040B → Class 7 (~10% = ~820 live allocations)
Why Class 7 Exhausts:
- 820 live allocations ÷ 511 blocks/SuperSlab = 1.6 SuperSlabs (rounded to 2)
- Soft cap = 2 → any additional allocation fails → legacy fallback
7.2 Comparison with Phase 9-1 Baseline
From PHASE9_1_PROGRESS.md (line 142):
./bench_random_mixed_hakmem 10000000 8192 # ~16.5 M ops/s
Current Measurement:
- SuperSlab OFF: 16.23 M ops/s
- SuperSlab ON: 16.15 M ops/s
Match: ✅ Values align with Phase 9-1 baseline (16.5M vs 16.2M, within variance)
8. Next Steps
8.1 Immediate Actions
-
Implement Phase 9-2 Option A (EMPTY→Freelist recycling)
- Modify
core/superslab_slab.c(remote drain) - Modify
core/box/tls_sll_drain_box.c(TLS SLL drain) - Add EMPTY detection:
if (meta->used == 0) { shared_pool_release_slab(...) }
- Modify
-
Run Debug Build to verify EMPTY recycling
make clean make CFLAGS="-O2 -g -DHAKMEM_BUILD_RELEASE=0" bench_random_mixed_hakmem HAKMEM_TINY_USE_SUPERSLAB=1 \ HAKMEM_SS_ACQUIRE_DEBUG=1 \ HAKMEM_SHARED_POOL_STAGE_STATS=1 \ ./bench_random_mixed_hakmem 100000 256 42 -
Verify Stage 1 Hits in debug output
- Look for
[SP_ACQUIRE_STAGE1_LOCKFREE]logs - Confirm freelist population:
[SP_SLOT_FREELIST_LOCKFREE] - Verify zero
shared_fail→legacyevents
- Look for
8.2 Performance Validation
-
Re-run WS8192 Benchmark (after Option A implementation)
# Baseline (should be same as before) HAKMEM_TINY_USE_SUPERSLAB=0 ./bench_random_mixed_hakmem 10000000 8192 # Optimized (should show +50-70% improvement) HAKMEM_TINY_USE_SUPERSLAB=1 ./bench_random_mixed_hakmem 10000000 8192 -
Success Criteria (from Phase 9-2 Section 11.2):
- ✅ Throughput: 16.5M → 25-30M ops/s (+50-80%)
- ✅ Zero
shared_fail→legacyevents - ✅ Stage 1 hit rate: 70-80% (after warmup)
- ✅ Kernel overhead: 55% → <15%
8.3 Optional Enhancements
-
Implement Option B (revert to 2MB SuperSlab)
- Change
SUPERSLAB_LG_DEFAULTfrom 19 → 21 - Expected additional gain: +10-15% (30-35M ops/s total)
- Change
-
Implement Option D (expand EMPTY scan limit)
- Change
HAKMEM_SS_EMPTY_SCAN_LIMITdefault from 16 → 64 - Expected additional gain: +3-8% (marginal)
- Change
9. Risk Assessment
9.1 Implementation Risks (Option A)
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Double-free in EMPTY detection | Low | Critical | Add meta->used > 0 assertion before shared_pool_release_slab() |
| Race: EMPTY→ACTIVE→EMPTY | Medium | Medium | Use atomic meta->used reads; Stage 1 CAS prevents double-activation |
| Deadlock in release_slab | Low | Medium | Use lock-free push (already implemented) |
Overall: Low risk (Box boundaries well-defined, guards in place)
9.2 Performance Risks
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Improvement less than expected | Medium | Medium | Profile with perf, check Stage 1 hit rate, consider Option B |
| Regression in other workloads | Low | Medium | Run full benchmark suite (WS256, cache_thrash, larson) |
| Memory leak from freelist | Low | High | Monitor RSS growth, verify EMPTY detection logic |
Overall: Medium risk (new feature, but small code change)
10. Lessons Learned
10.1 Benchmark Parameter Confusion
Issue: Initial request mentioned "デフォルトパラメータで測定してしまい、ワークロードが軽すぎました" Reality: Default parameters ARE WS8192 (line 46 in bench_random_mixed.c)
int ws = (argc>2)? atoi(argv[2]) : 8192; // default: 8192
Takeaway: Always check source code to verify default behavior (documentation may be outdated).
10.2 SuperSlab Enabled ≠ SuperSlab Functional
Issue: HAKMEM_TINY_USE_SUPERSLAB=1 enables SuperSlab code, but doesn't guarantee it's used.
Reality: Legacy fallback is triggered when SuperSlab backend fails (soft cap, OOM, etc.)
Takeaway: Check for shared_fail→legacy warnings in output to verify SuperSlab is actually being used.
10.3 Phase Dependencies
Issue: Assumed Phase 9-2 was complete (based on PHASE9_2_*.md files) Reality: Phase 9-2 investigation is complete, but implementation is not started
Takeaway: Check document status header (e.g., "Status: Root Cause Analysis Complete" vs "Status: Implementation Complete")
11. Conclusion
Current State: WS8192 benchmark correctly measured at 16.2-16.3 M ops/s, consistent across SuperSlab ON/OFF.
Root Cause: SuperSlab backend falls back to legacy system malloc due to missing EMPTY→Freelist recycling (Phase 9-2 Option A).
Expected Improvement: After implementing Option A, expect 25-30 M ops/s (+50-80%) by eliminating legacy fallbacks and enabling lock-free Stage 1 EMPTY reuse.
Next Action: Implement Phase 9-2 Option A (2-3 hour task), then re-benchmark WS8192 to verify +50-70% improvement.
Report Prepared By: Claude (Sonnet 4.5) Benchmark Date: 2025-11-30 Total Test Time: ~6 seconds (6 runs × 0.6s average) Status: Baseline established, awaiting Phase 9-2 implementation