feat: Add ACE allocation failure tracing and debug hooks

This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations. Key changes include: - **ACE Tracing Implementation**: - Added environment variable to enable/disable detailed logging of allocation failures. - Instrumented , , and to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure). - **Build System Fixes**: - Corrected to ensure is properly linked into , resolving an error. - **LD_PRELOAD Wrapper Adjustments**: - Investigated and understood the wrapper's behavior under , particularly its interaction with and checks. - Enabled debugging flags for environment to prevent unintended fallbacks to 's for non-tiny allocations, allowing comprehensive testing of the allocator. - **Debugging & Verification**: - Introduced temporary verbose logging to pinpoint execution flow issues within interception and routing. These temporary logs have been removed. - Created to facilitate testing of the tracing features. This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in by providing clear insights into the failure pathways.
2025-12-01 16:37:59 +09:00
parent 2bd8da9267
commit 4ef0171bc0
85 changed files with 5930 additions and 479 deletions
--- a/PHASE8_VISUAL_SUMMARY.md
+++ b/PHASE8_VISUAL_SUMMARY.md
@ -0,0 +1,246 @@
+# Phase 8 Comprehensive Benchmark - Visual Summary
+
+## Performance Comparison Charts
+
+### Working Set 256 (Hot Cache) - Bar Chart
+
+```
+HAKMEM    ████████████████████████████████████████ 79.2 M ops/s (1.00x)
+System    ███████████████████████████████████████████ 86.7 M ops/s (1.09x) ↑ 9%
+mimalloc  ██████████████████████████████████████████████████████████ 114.9 M ops/s (1.45x) ↑ 45%
+```
+
+### Working Set 8192 (Realistic Workload) - Bar Chart
+
+```
+HAKMEM    ████ 16.5 M ops/s (1.00x)
+System    ██████████████ 57.1 M ops/s (3.46x) ↑ 246%
+mimalloc  ████████████████████████ 96.5 M ops/s (5.85x) ↑ 485%
+```
+
+## Scalability Comparison
+
+### Performance Degradation (WS256 → WS8192)
+
+```
+mimalloc  ████ 1.19x degradation  [EXCELLENT]
+System    ██████ 1.52x degradation  [GOOD]
+HAKMEM    ███████████████████ 4.80x degradation  [CRITICAL ISSUE]
+```
+
+## Performance Gap Analysis
+
+### Cycle Budget (Estimated at 3.5 GHz)
+
+| Allocator | Cycles/Op | Extra Cycles vs Best |
+|-----------|-----------|---------------------|
+| mimalloc  | 36        | 0 (baseline)        |
+| System    | 61        | +25 (+69%)          |
+| HAKMEM    | 212       | +176 (+489%)        |
+
+**HAKMEM uses 176 extra cycles per operation compared to mimalloc!**
+
+### Where Are The Cycles Going?
+
+```
+Estimated cycle breakdown for HAKMEM WS8192:
+
+SuperSlab Lookup:      ████████████████ 50-80 cycles
+Legacy Fallback:       ██████████████ 30-50 cycles (when triggered)
+Fragmentation:         ███████████ 30-50 cycles
+TLS Drain Logic:       ███ 10-15 cycles
+Actual Work:           ████████ 30-40 cycles
+                       ─────────────────────────
+Total:                 ~212 cycles/operation
+
+mimalloc for comparison:
+Optimized Fast Path:   ████████ 36 cycles total
+```
+
+## Priority Ranking
+
+### Critical Issues (Must Fix)
+
+```
+1. SuperSlab Scaling          Priority: CRITICAL    Impact: 246% perf loss
+   └─ 4.8x degradation vs 1.5x for System malloc
+   └─ "shared_fail→legacy" messages indicate capacity issues
+
+2. Fragmentation             Priority: HIGH        Impact: 30-50 cycles/op
+   └─ SuperSlab list becomes inefficient at scale
+
+3. TLB Pressure              Priority: HIGH        Impact: Unknown, likely high
+   └─ Many 512KB SuperSlabs → TLB misses
+```
+
+### Important Issues (Should Fix)
+
+```
+4. TLS Drain Overhead        Priority: MEDIUM      Impact: 9.4% on hot cache
+   └─ Affects even best-case performance
+
+5. Fast Path Efficiency      Priority: MEDIUM      Impact: 9.4% on hot cache
+   └─ Need more aggressive inlining
+```
+
+### Nice-to-Have
+
+```
+6. Metadata Optimization     Priority: LOW         Impact: Unknown
+   └─ Reduce cache pollution from slab metadata
+```
+
+## Competitive Position
+
+### Current Status: Phase 8
+
+```
+Tier 1 (Production-Ready):
+  mimalloc   ████████████████████████ 96.5 M ops/s
+  System     ██████████████ 57.1 M ops/s
+
+Tier 2 (Needs Work):
+  (empty)
+
+Tier 3 (Experimental):
+  HAKMEM     ████ 16.5 M ops/s  ← YOU ARE HERE
+```
+
+### Target for Phase 12 (6 months)
+
+```
+Tier 1 (Production-Ready):
+  mimalloc   ████████████████████████ 96.5 M ops/s
+  HAKMEM     ████████████████████ 80+ M ops/s  ← TARGET
+  System     ██████████████ 57.1 M ops/s
+
+Goal: Match or exceed System malloc, get within 20% of mimalloc
+```
+
+## Decision Matrix for Phase 9
+
+### Option A: Fix SuperSlab Architecture (Recommended)
+
+**Pros**:
+- Preserve existing work
+- Targeted fixes may yield big gains
+- Debug logs provide clear direction
+
+**Cons**:
+- May be fundamentally flawed architecture
+- Risk of incremental fixes not solving core issue
+
+**Time estimate**: 2-3 weeks
+**Success probability**: 60%
+
+### Option B: Hybrid Architecture
+
+**Pros**:
+- Keep TLS fast path (working well)
+- Replace SuperSlab backend with proven design
+- Best of both worlds
+
+**Cons**:
+- Major refactoring required
+- Lose SuperSlab work
+- Integration complexity
+
+**Time estimate**: 4-6 weeks
+**Success probability**: 75%
+
+### Option C: Start Over (Not Recommended Yet)
+
+**Pros**:
+- Clean slate
+- Can copy proven designs (mimalloc, jemalloc)
+
+**Cons**:
+- Lose all current work
+- No learning from mistakes
+- 3+ months delay
+
+**Time estimate**: 3-4 months
+**Success probability**: 85% (but high cost)
+
+## Recommended Path Forward
+
+### Phase 9: SuperSlab Deep Dive (2 weeks)
+
+**Week 1: Investigation**
+- Add comprehensive profiling
+- Measure cache/TLB misses
+- Analyze fragmentation patterns
+- Understand "shared_fail→legacy" root cause
+
+**Week 2: Targeted Fixes**
+- Implement hash table for SuperSlab lookup
+- Experiment with larger SuperSlabs (1-2MB)
+- Optimize fragmentation handling
+- Add better capacity management
+
+**Success criteria**:
+- WS8192: 16.5 → 35+ M ops/s (2x improvement)
+- Understand root cause even if fix incomplete
+
+### Phase 10: Decision Point
+
+**If Phase 9 successful (>35 M ops/s)**:
+- Continue with SuperSlab optimizations
+- Focus on fast path improvements
+- Target: 50 M ops/s by Phase 12
+
+**If Phase 9 unsuccessful (<30 M ops/s)**:
+- Switch to Hybrid Architecture (Option B)
+- Keep TLS layer, replace backend
+- Target: 60 M ops/s by Phase 14
+
+## Key Metrics to Track
+
+### Performance Metrics
+- [ ] WS256 throughput (target: 85+ M ops/s)
+- [ ] WS8192 throughput (target: 35+ M ops/s)
+- [ ] Degradation ratio (target: <2.5x)
+
+### Architecture Metrics
+- [ ] SuperSlab lookup latency (target: <20 cycles)
+- [ ] Cache miss rate (target: <5%)
+- [ ] TLB miss rate (target: <1%)
+- [ ] Fragmentation ratio (target: <20%)
+
+### Debug Metrics
+- [ ] "shared_fail→legacy" events (target: 0)
+- [ ] TLS_SLL_HDR_RESET events (target: 0)
+- [ ] Average SuperSlab count (target: <10 at WS8192)
+
+## Conclusion
+
+**Phase 8 Status**: COMPLETE
+- ✓ Comprehensive benchmarks executed
+- ✓ Statistical analysis completed
+- ✓ Root cause hypotheses identified
+- ✓ Clear path forward defined
+
+**Phase 9 Ready**: YES
+- Clear investigation targets
+- Specific metrics to measure
+- Decision criteria established
+
+**Confidence Level**: HIGH
+- Data is robust (low variance)
+- Gaps are well-understood
+- Multiple viable paths forward
+
+---
+
+**Next Action**: Begin Phase 9 - SuperSlab Deep Dive and Profiling
+
+**Timeline**:
+- Phase 9: 2 weeks (investigation + targeted fixes)
+- Phase 10: 1 week (decision point + planning)
+- Phase 11-12: 3-4 weeks (major optimizations)
+- Target completion: 6-8 weeks to production-ready
+
+**Risk Level**: MEDIUM
+- SuperSlab may be unfixable → fallback to Hybrid (Option B)
+- Hybrid adds 2-3 weeks but higher success probability
+- Total timeline stays within 10 weeks worst case