Files

Moe Charm (CI) 4ef0171bc0 feat: Add ACE allocation failure tracing and debug hooks

This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations.

Key changes include:
- **ACE Tracing Implementation**:
  - Added  environment variable to enable/disable detailed logging of allocation failures.
  - Instrumented , , and  to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure).
- **Build System Fixes**:
  - Corrected  to ensure  is properly linked into , resolving an  error.
- **LD_PRELOAD Wrapper Adjustments**:
  - Investigated and understood the  wrapper's behavior under , particularly its interaction with  and  checks.
  - Enabled debugging flags for  environment to prevent unintended fallbacks to 's  for non-tiny allocations, allowing comprehensive testing of the  allocator.
- **Debugging & Verification**:
  - Introduced temporary verbose logging to pinpoint execution flow issues within  interception and  routing. These temporary logs have been removed.
  - Created  to facilitate testing of the tracing features.

This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in  by providing clear insights into the failure pathways.

2025-12-01 16:37:59 +09:00

6.5 KiB

Raw Blame History

Phase 9-1 Implementation Complete

Date: 2025-11-30 06:40 JST Status: Infrastructure Complete, Benchmarking In Progress Completion: 5/6 steps done

Summary

Phase 9-1 successfully implemented a hash table-based SuperSlab lookup system to replace the linear probing registry. The infrastructure is complete and integrated, but initial benchmarks show unexpected results that require investigation.

Completed Work ✅

1. SuperSlabMap Box (Phase 9-1-1) ✅

Files Created:

core/box/ss_addr_map_box.h (149 lines)
core/box/ss_addr_map_box.c (262 lines)

Implementation:

Hash table with 8192 buckets
Chaining collision resolution
O(1) amortized lookup
Handles multiple SuperSlab alignments (512KB, 1MB, 2MB)
Uses __libc_malloc/__libc_free to avoid recursion

2. TLS Hints (Phase 9-1-4) ✅

Files Created:

core/box/ss_tls_hint_box.h (238 lines)
core/box/ss_tls_hint_box.c (22 lines)

Implementation:

__thread SuperSlab* g_tls_ss_hint[TINY_NUM_CLASSES]
Fast path: TLS cache check (5-10 cycles expected)
Slow path: Hash table fallback + cache update
Debug statistics tracking

3. Debug Macros (Phase 9-1-3) ✅

Implemented:

SS_MAP_LOOKUP() - Trace lookups
SS_MAP_INSERT() - Trace registrations
SS_MAP_REMOVE() - Trace unregistrations
ss_map_print_stats() - Collision/load stats
Environment-gated: HAKMEM_SS_MAP_TRACE=1

4. Integration (Phase 9-1-5) ✅

Modified Files:

core/hakmem_tiny_lazy_init.inc.h - Initialize ss_map_init()
core/hakmem_super_registry.c - Hook ss_map_insert/remove()
core/hakmem_super_registry.h - Replace hak_super_lookup() implementation
Makefile - Add new modules to build

Changes:

ss_map_init() called at SuperSlab subsystem initialization
ss_map_insert() called when registering SuperSlabs
ss_map_remove() called when unregistering SuperSlabs
hak_super_lookup() now uses ss_map_lookup() instead of linear probing

Benchmark Results 🔍

WS256 (Hot Cache)

Phase 8 Baseline:  79.2 M ops/s
Phase 9-1 Result:  79.2 M ops/s  (no change)

Status: ✅ No regression in hot cache performance

WS8192 (Realistic)

Phase 8 Baseline:  16.5 M ops/s
Phase 9-1 Result:  16.2 M ops/s  (no improvement)

Status: ⚠️ No improvement observed

Investigation Needed 🔍

Observation

The hash table optimization did NOT improve WS8192 performance as expected. Possible reasons:

SuperSlab Not Used in Benchmark
- Default bench settings may disable SuperSlab path
- Test with: HAKMEM_TINY_USE_SUPERSLAB=1
- When enabled, performance drops to 15M ops/s
Different Bottleneck
- Phase 8 analysis identified SuperSlab lookup as 50-80 cycle bottleneck
- Actual bottleneck may be elsewhere (fragmentation, TLS drain, etc.)
- Need profiling to confirm actual hot path
Hash Table Not Exercised
- Benchmark may be hitting TLS fast path entirely
- SuperSlab lookups may not happen in hot path
- Need to verify with profiling/tracing

Next Steps for Investigation

Profile Actual Bottleneck

perf record -g ./bench_random_mixed_hakmem 10000000 8192
perf report

Enable SuperSlab and Measure

HAKMEM_TINY_USE_SUPERSLAB=1 ./bench_random_mixed_hakmem 10000000 8192

Check Lookup Statistics
- Build debug version without RELEASE flag
- Enable HAKMEM_SS_MAP_TRACE=1
- Count actual lookup calls
Verify TLS vs SuperSlab Split
- Check what percentage of allocations hit TLS vs SuperSlab
- Benchmark may be 100% TLS (fast path) with no SuperSlab lookups

Code Quality ✅

All new code follows Box pattern:

✅ Single Responsibility
✅ Clear Contracts
✅ Observable (debug macros)
✅ Composable (coexists with legacy)
✅ No compilation warnings
✅ No runtime crashes

Files Modified/Created

New Files (4)

core/box/ss_addr_map_box.h
core/box/ss_addr_map_box.c
core/box/ss_tls_hint_box.h
core/box/ss_tls_hint_box.c

Modified Files (4)

core/hakmem_tiny_lazy_init.inc.h - Added init call
core/hakmem_super_registry.c - Added insert/remove hooks
core/hakmem_super_registry.h - Replaced lookup implementation
Makefile - Added new modules

Documentation (2)

PHASE9_1_PROGRESS.md - Detailed progress tracking
PHASE9_1_COMPLETE.md - This file

Lessons Learned

Premature Optimization
- Phase 8 analysis identified bottleneck without profiling
- Assumed SuperSlab lookup was the problem
- Should have profiled first before implementing solution
Benchmark Configuration
- Default benchmark may not exercise the optimized path
- Need to verify assumptions about what code paths are executed
- Environment variables can dramatically change behavior
Infrastructure Still Valuable
- Even if not the current bottleneck, O(1) lookup is correct design
- Future workloads may benefit (more SuperSlabs, different patterns)
- Clean Box-based architecture enables future optimization

Recommendations

Option 1: Profile and Re-Target

Run perf profiling on WS8192 benchmark
Identify actual bottleneck (may not be SuperSlab lookup)
Implement targeted fix for real bottleneck
Re-benchmark

Timeline: 1-2 days Risk: Low Expected: 20-30M ops/s at WS8192

Option 2: Enable SuperSlab and Optimize

Configure benchmark to force SuperSlab usage
Measure hash table effectiveness with SuperSlab enabled
Optimize SuperSlab fragmentation/capacity issues
Re-benchmark

Timeline: 2-3 days Risk: Medium Expected: 18-22M ops/s at WS8192

Option 3: Accept Baseline and Move Forward

Keep hash table infrastructure (no harm, better design)
Focus on other optimization opportunities
Return to this if profiling shows it's needed later

Timeline: 0 days (done) Risk: Low Expected: 16-17M ops/s at WS8192 (status quo)

Conclusion

Phase 9-1 successfully delivered clean, well-architected infrastructure for O(1) SuperSlab lookups. The code compiles, runs without crashes, and follows all Box pattern principles.

However, benchmark results show no improvement, suggesting either:

The identified bottleneck was incorrect
The benchmark doesn't exercise the optimized path
A different bottleneck dominates performance

Recommended Next Step: Profile with perf to identify actual bottleneck before further optimization work.

Prepared by: Claude (Sonnet 4.5) Timestamp: 2025-11-30 06:40 JST Status: Infrastructure complete, performance investigation needed

6.5 KiB Raw Blame History