This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations. Key changes include: - **ACE Tracing Implementation**: - Added environment variable to enable/disable detailed logging of allocation failures. - Instrumented , , and to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure). - **Build System Fixes**: - Corrected to ensure is properly linked into , resolving an error. - **LD_PRELOAD Wrapper Adjustments**: - Investigated and understood the wrapper's behavior under , particularly its interaction with and checks. - Enabled debugging flags for environment to prevent unintended fallbacks to 's for non-tiny allocations, allowing comprehensive testing of the allocator. - **Debugging & Verification**: - Introduced temporary verbose logging to pinpoint execution flow issues within interception and routing. These temporary logs have been removed. - Created to facilitate testing of the tracing features. This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in by providing clear insights into the failure pathways.
6.5 KiB
Phase 9-1 Implementation Complete
Date: 2025-11-30 06:40 JST Status: Infrastructure Complete, Benchmarking In Progress Completion: 5/6 steps done
Summary
Phase 9-1 successfully implemented a hash table-based SuperSlab lookup system to replace the linear probing registry. The infrastructure is complete and integrated, but initial benchmarks show unexpected results that require investigation.
Completed Work ✅
1. SuperSlabMap Box (Phase 9-1-1) ✅
Files Created:
core/box/ss_addr_map_box.h(149 lines)core/box/ss_addr_map_box.c(262 lines)
Implementation:
- Hash table with 8192 buckets
- Chaining collision resolution
- O(1) amortized lookup
- Handles multiple SuperSlab alignments (512KB, 1MB, 2MB)
- Uses
__libc_malloc/__libc_freeto avoid recursion
2. TLS Hints (Phase 9-1-4) ✅
Files Created:
core/box/ss_tls_hint_box.h(238 lines)core/box/ss_tls_hint_box.c(22 lines)
Implementation:
__thread SuperSlab* g_tls_ss_hint[TINY_NUM_CLASSES]- Fast path: TLS cache check (5-10 cycles expected)
- Slow path: Hash table fallback + cache update
- Debug statistics tracking
3. Debug Macros (Phase 9-1-3) ✅
Implemented:
SS_MAP_LOOKUP()- Trace lookupsSS_MAP_INSERT()- Trace registrationsSS_MAP_REMOVE()- Trace unregistrationsss_map_print_stats()- Collision/load stats- Environment-gated:
HAKMEM_SS_MAP_TRACE=1
4. Integration (Phase 9-1-5) ✅
Modified Files:
core/hakmem_tiny_lazy_init.inc.h- Initializess_map_init()core/hakmem_super_registry.c- Hookss_map_insert/remove()core/hakmem_super_registry.h- Replacehak_super_lookup()implementationMakefile- Add new modules to build
Changes:
ss_map_init()called at SuperSlab subsystem initializationss_map_insert()called when registering SuperSlabsss_map_remove()called when unregistering SuperSlabshak_super_lookup()now usesss_map_lookup()instead of linear probing
Benchmark Results 🔍
WS256 (Hot Cache)
Phase 8 Baseline: 79.2 M ops/s
Phase 9-1 Result: 79.2 M ops/s (no change)
Status: ✅ No regression in hot cache performance
WS8192 (Realistic)
Phase 8 Baseline: 16.5 M ops/s
Phase 9-1 Result: 16.2 M ops/s (no improvement)
Status: ⚠️ No improvement observed
Investigation Needed 🔍
Observation
The hash table optimization did NOT improve WS8192 performance as expected. Possible reasons:
-
SuperSlab Not Used in Benchmark
- Default bench settings may disable SuperSlab path
- Test with:
HAKMEM_TINY_USE_SUPERSLAB=1 - When enabled, performance drops to 15M ops/s
-
Different Bottleneck
- Phase 8 analysis identified SuperSlab lookup as 50-80 cycle bottleneck
- Actual bottleneck may be elsewhere (fragmentation, TLS drain, etc.)
- Need profiling to confirm actual hot path
-
Hash Table Not Exercised
- Benchmark may be hitting TLS fast path entirely
- SuperSlab lookups may not happen in hot path
- Need to verify with profiling/tracing
Next Steps for Investigation
-
Profile Actual Bottleneck
perf record -g ./bench_random_mixed_hakmem 10000000 8192 perf report -
Enable SuperSlab and Measure
HAKMEM_TINY_USE_SUPERSLAB=1 ./bench_random_mixed_hakmem 10000000 8192 -
Check Lookup Statistics
- Build debug version without RELEASE flag
- Enable
HAKMEM_SS_MAP_TRACE=1 - Count actual lookup calls
-
Verify TLS vs SuperSlab Split
- Check what percentage of allocations hit TLS vs SuperSlab
- Benchmark may be 100% TLS (fast path) with no SuperSlab lookups
Code Quality ✅
All new code follows Box pattern:
- ✅ Single Responsibility
- ✅ Clear Contracts
- ✅ Observable (debug macros)
- ✅ Composable (coexists with legacy)
- ✅ No compilation warnings
- ✅ No runtime crashes
Files Modified/Created
New Files (4)
core/box/ss_addr_map_box.hcore/box/ss_addr_map_box.ccore/box/ss_tls_hint_box.hcore/box/ss_tls_hint_box.c
Modified Files (4)
core/hakmem_tiny_lazy_init.inc.h- Added init callcore/hakmem_super_registry.c- Added insert/remove hookscore/hakmem_super_registry.h- Replaced lookup implementationMakefile- Added new modules
Documentation (2)
PHASE9_1_PROGRESS.md- Detailed progress trackingPHASE9_1_COMPLETE.md- This file
Lessons Learned
-
Premature Optimization
- Phase 8 analysis identified bottleneck without profiling
- Assumed SuperSlab lookup was the problem
- Should have profiled first before implementing solution
-
Benchmark Configuration
- Default benchmark may not exercise the optimized path
- Need to verify assumptions about what code paths are executed
- Environment variables can dramatically change behavior
-
Infrastructure Still Valuable
- Even if not the current bottleneck, O(1) lookup is correct design
- Future workloads may benefit (more SuperSlabs, different patterns)
- Clean Box-based architecture enables future optimization
Recommendations
Option 1: Profile and Re-Target
- Run perf profiling on WS8192 benchmark
- Identify actual bottleneck (may not be SuperSlab lookup)
- Implement targeted fix for real bottleneck
- Re-benchmark
Timeline: 1-2 days Risk: Low Expected: 20-30M ops/s at WS8192
Option 2: Enable SuperSlab and Optimize
- Configure benchmark to force SuperSlab usage
- Measure hash table effectiveness with SuperSlab enabled
- Optimize SuperSlab fragmentation/capacity issues
- Re-benchmark
Timeline: 2-3 days Risk: Medium Expected: 18-22M ops/s at WS8192
Option 3: Accept Baseline and Move Forward
- Keep hash table infrastructure (no harm, better design)
- Focus on other optimization opportunities
- Return to this if profiling shows it's needed later
Timeline: 0 days (done) Risk: Low Expected: 16-17M ops/s at WS8192 (status quo)
Conclusion
Phase 9-1 successfully delivered clean, well-architected infrastructure for O(1) SuperSlab lookups. The code compiles, runs without crashes, and follows all Box pattern principles.
However, benchmark results show no improvement, suggesting either:
- The identified bottleneck was incorrect
- The benchmark doesn't exercise the optimized path
- A different bottleneck dominates performance
Recommended Next Step: Profile with perf to identify actual bottleneck before further optimization work.
Prepared by: Claude (Sonnet 4.5) Timestamp: 2025-11-30 06:40 JST Status: Infrastructure complete, performance investigation needed