Files
hakmem/PHASE9_1_COMPLETE.md
Moe Charm (CI) 4ef0171bc0 feat: Add ACE allocation failure tracing and debug hooks
This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations.

Key changes include:
- **ACE Tracing Implementation**:
  - Added  environment variable to enable/disable detailed logging of allocation failures.
  - Instrumented , , and  to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure).
- **Build System Fixes**:
  - Corrected  to ensure  is properly linked into , resolving an  error.
- **LD_PRELOAD Wrapper Adjustments**:
  - Investigated and understood the  wrapper's behavior under , particularly its interaction with  and  checks.
  - Enabled debugging flags for  environment to prevent unintended fallbacks to 's  for non-tiny allocations, allowing comprehensive testing of the  allocator.
- **Debugging & Verification**:
  - Introduced temporary verbose logging to pinpoint execution flow issues within  interception and  routing. These temporary logs have been removed.
  - Created  to facilitate testing of the tracing features.

This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in  by providing clear insights into the failure pathways.
2025-12-01 16:37:59 +09:00

6.5 KiB

Phase 9-1 Implementation Complete

Date: 2025-11-30 06:40 JST Status: Infrastructure Complete, Benchmarking In Progress Completion: 5/6 steps done

Summary

Phase 9-1 successfully implemented a hash table-based SuperSlab lookup system to replace the linear probing registry. The infrastructure is complete and integrated, but initial benchmarks show unexpected results that require investigation.

Completed Work

1. SuperSlabMap Box (Phase 9-1-1)

Files Created:

  • core/box/ss_addr_map_box.h (149 lines)
  • core/box/ss_addr_map_box.c (262 lines)

Implementation:

  • Hash table with 8192 buckets
  • Chaining collision resolution
  • O(1) amortized lookup
  • Handles multiple SuperSlab alignments (512KB, 1MB, 2MB)
  • Uses __libc_malloc/__libc_free to avoid recursion

2. TLS Hints (Phase 9-1-4)

Files Created:

  • core/box/ss_tls_hint_box.h (238 lines)
  • core/box/ss_tls_hint_box.c (22 lines)

Implementation:

  • __thread SuperSlab* g_tls_ss_hint[TINY_NUM_CLASSES]
  • Fast path: TLS cache check (5-10 cycles expected)
  • Slow path: Hash table fallback + cache update
  • Debug statistics tracking

3. Debug Macros (Phase 9-1-3)

Implemented:

  • SS_MAP_LOOKUP() - Trace lookups
  • SS_MAP_INSERT() - Trace registrations
  • SS_MAP_REMOVE() - Trace unregistrations
  • ss_map_print_stats() - Collision/load stats
  • Environment-gated: HAKMEM_SS_MAP_TRACE=1

4. Integration (Phase 9-1-5)

Modified Files:

  • core/hakmem_tiny_lazy_init.inc.h - Initialize ss_map_init()
  • core/hakmem_super_registry.c - Hook ss_map_insert/remove()
  • core/hakmem_super_registry.h - Replace hak_super_lookup() implementation
  • Makefile - Add new modules to build

Changes:

  1. ss_map_init() called at SuperSlab subsystem initialization
  2. ss_map_insert() called when registering SuperSlabs
  3. ss_map_remove() called when unregistering SuperSlabs
  4. hak_super_lookup() now uses ss_map_lookup() instead of linear probing

Benchmark Results 🔍

WS256 (Hot Cache)

Phase 8 Baseline:  79.2 M ops/s
Phase 9-1 Result:  79.2 M ops/s  (no change)

Status: No regression in hot cache performance

WS8192 (Realistic)

Phase 8 Baseline:  16.5 M ops/s
Phase 9-1 Result:  16.2 M ops/s  (no improvement)

Status: ⚠️ No improvement observed

Investigation Needed 🔍

Observation

The hash table optimization did NOT improve WS8192 performance as expected. Possible reasons:

  1. SuperSlab Not Used in Benchmark

    • Default bench settings may disable SuperSlab path
    • Test with: HAKMEM_TINY_USE_SUPERSLAB=1
    • When enabled, performance drops to 15M ops/s
  2. Different Bottleneck

    • Phase 8 analysis identified SuperSlab lookup as 50-80 cycle bottleneck
    • Actual bottleneck may be elsewhere (fragmentation, TLS drain, etc.)
    • Need profiling to confirm actual hot path
  3. Hash Table Not Exercised

    • Benchmark may be hitting TLS fast path entirely
    • SuperSlab lookups may not happen in hot path
    • Need to verify with profiling/tracing

Next Steps for Investigation

  1. Profile Actual Bottleneck

    perf record -g ./bench_random_mixed_hakmem 10000000 8192
    perf report
    
  2. Enable SuperSlab and Measure

    HAKMEM_TINY_USE_SUPERSLAB=1 ./bench_random_mixed_hakmem 10000000 8192
    
  3. Check Lookup Statistics

    • Build debug version without RELEASE flag
    • Enable HAKMEM_SS_MAP_TRACE=1
    • Count actual lookup calls
  4. Verify TLS vs SuperSlab Split

    • Check what percentage of allocations hit TLS vs SuperSlab
    • Benchmark may be 100% TLS (fast path) with no SuperSlab lookups

Code Quality

All new code follows Box pattern:

  • Single Responsibility
  • Clear Contracts
  • Observable (debug macros)
  • Composable (coexists with legacy)
  • No compilation warnings
  • No runtime crashes

Files Modified/Created

New Files (4)

  1. core/box/ss_addr_map_box.h
  2. core/box/ss_addr_map_box.c
  3. core/box/ss_tls_hint_box.h
  4. core/box/ss_tls_hint_box.c

Modified Files (4)

  1. core/hakmem_tiny_lazy_init.inc.h - Added init call
  2. core/hakmem_super_registry.c - Added insert/remove hooks
  3. core/hakmem_super_registry.h - Replaced lookup implementation
  4. Makefile - Added new modules

Documentation (2)

  1. PHASE9_1_PROGRESS.md - Detailed progress tracking
  2. PHASE9_1_COMPLETE.md - This file

Lessons Learned

  1. Premature Optimization

    • Phase 8 analysis identified bottleneck without profiling
    • Assumed SuperSlab lookup was the problem
    • Should have profiled first before implementing solution
  2. Benchmark Configuration

    • Default benchmark may not exercise the optimized path
    • Need to verify assumptions about what code paths are executed
    • Environment variables can dramatically change behavior
  3. Infrastructure Still Valuable

    • Even if not the current bottleneck, O(1) lookup is correct design
    • Future workloads may benefit (more SuperSlabs, different patterns)
    • Clean Box-based architecture enables future optimization

Recommendations

Option 1: Profile and Re-Target

  1. Run perf profiling on WS8192 benchmark
  2. Identify actual bottleneck (may not be SuperSlab lookup)
  3. Implement targeted fix for real bottleneck
  4. Re-benchmark

Timeline: 1-2 days Risk: Low Expected: 20-30M ops/s at WS8192

Option 2: Enable SuperSlab and Optimize

  1. Configure benchmark to force SuperSlab usage
  2. Measure hash table effectiveness with SuperSlab enabled
  3. Optimize SuperSlab fragmentation/capacity issues
  4. Re-benchmark

Timeline: 2-3 days Risk: Medium Expected: 18-22M ops/s at WS8192

Option 3: Accept Baseline and Move Forward

  1. Keep hash table infrastructure (no harm, better design)
  2. Focus on other optimization opportunities
  3. Return to this if profiling shows it's needed later

Timeline: 0 days (done) Risk: Low Expected: 16-17M ops/s at WS8192 (status quo)

Conclusion

Phase 9-1 successfully delivered clean, well-architected infrastructure for O(1) SuperSlab lookups. The code compiles, runs without crashes, and follows all Box pattern principles.

However, benchmark results show no improvement, suggesting either:

  1. The identified bottleneck was incorrect
  2. The benchmark doesn't exercise the optimized path
  3. A different bottleneck dominates performance

Recommended Next Step: Profile with perf to identify actual bottleneck before further optimization work.


Prepared by: Claude (Sonnet 4.5) Timestamp: 2025-11-30 06:40 JST Status: Infrastructure complete, performance investigation needed