Files
hakmem/PHASE9_2_BENCHMARK_REPORT.md
Moe Charm (CI) 4ef0171bc0 feat: Add ACE allocation failure tracing and debug hooks
This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations.

Key changes include:
- **ACE Tracing Implementation**:
  - Added  environment variable to enable/disable detailed logging of allocation failures.
  - Instrumented , , and  to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure).
- **Build System Fixes**:
  - Corrected  to ensure  is properly linked into , resolving an  error.
- **LD_PRELOAD Wrapper Adjustments**:
  - Investigated and understood the  wrapper's behavior under , particularly its interaction with  and  checks.
  - Enabled debugging flags for  environment to prevent unintended fallbacks to 's  for non-tiny allocations, allowing comprehensive testing of the  allocator.
- **Debugging & Verification**:
  - Introduced temporary verbose logging to pinpoint execution flow issues within  interception and  routing. These temporary logs have been removed.
  - Created  to facilitate testing of the tracing features.

This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in  by providing clear insights into the failure pathways.
2025-12-01 16:37:59 +09:00

15 KiB
Raw Blame History

Phase 9-2 Benchmark Report: WS8192 Performance Analysis

Date: 2025-11-30 Test Configuration: WS8192 (Working Set = 8192 allocations) Benchmark: bench_random_mixed_hakmem 10000000 8192 Status: Baseline measurements complete, optimization not yet implemented


Executive Summary

WS8192ベンチマークを正しいパラメータで測定しました。結果

  1. SuperSlab OFF vs ON: ほぼ同じ性能16.23M vs 16.15M ops/s、-0.51%
  2. 期待値とのギャップ: Phase 9-2の期待値は25-30M ops/s (+50-80%)、実測は改善なし
  3. 根本原因: Phase 9-2の修正EMPTY→Freelist recycling未実装であることが判明
  4. 次のステップ: Phase 9-2 Option Aの実装が必要

1. Benchmark Results

1.1 SuperSlab OFF (Baseline)

HAKMEM_TINY_USE_SUPERSLAB=0 ./bench_random_mixed_hakmem 10000000 8192
Run Throughput (ops/s) Time (s)
1 16,468,918 0.607
2 16,192,733 0.618
3 16,035,542 0.624
Average 16,232,398 0.616
Std Dev 178,517 (±1.1%) 0.007

Key Observations:

  • Consistent performance (±1.1% variance)
  • 4x [SS_BACKEND] shared_fail→legacy cls=7 warnings
  • TLS_SLL errors present (header corruption warnings)

1.2 SuperSlab ON (Current State)

HAKMEM_TINY_USE_SUPERSLAB=1 ./bench_random_mixed_hakmem 10000000 8192
Run Throughput (ops/s) Time (s)
1 16,231,848 0.616
2 16,305,843 0.613
3 15,910,918 0.628
Average 16,149,536 0.619
Std Dev 171,766 (±1.1%) 0.007

Key Observations:

  • No performance improvement (-0.51% vs baseline)
  • Same shared_fail→legacy warnings (4x Class 7 fallbacks)
  • Same TLS_SLL errors
  • SuperSlab enabled but not providing benefits

1.3 Improvement Analysis

Baseline (SuperSlab OFF): 16.23 M ops/s
Current (SuperSlab ON):   16.15 M ops/s
Improvement:              -0.51% (REGRESSION, within noise)

Expected (Phase 9-2):     25-30 M ops/s
Gap:                      -8.85 to -13.85 M ops/s (-35% to -46%)

Verdict: SuperSlab is enabled but not functional due to missing EMPTY recycling.


2. Problem Analysis

2.1 Why SuperSlab Has No Effect

From PHASE9_2_SUPERSLAB_BACKEND_INVESTIGATION.md investigation:

Root Cause: Shared pool Stage 3 soft cap blocks new SuperSlab allocation, but EMPTY slabs are not recycled to Stage 1 freelist.

Flow:

1. Benchmark allocates ~820 Class 7 blocks (10% of WS=8192)
2. Shared pool allocates 2 SuperSlabs (512KB each = 1022 blocks total)
3. class_active_slots[7] = 2 (soft cap reached)
4. Next allocation request:
   - Stage 0.5 (EMPTY scan): Finds nothing (only 2 SS, both ACTIVE)
   - Stage 1 (freelist): Empty (no EMPTY→ACTIVE transitions)
   - Stage 2 (UNUSED claim): Exhausted (first pass only)
   - Stage 3 (new SS alloc): FAIL (soft cap: current=2 >= limit=2)
5. shared_pool_acquire_slab() returns -1
6. Falls back to legacy backend
7. Legacy backend uses system malloc → kernel overhead

Result: SuperSlab backend is bypassed 4 times during benchmark → falls back to legacy system malloc.

2.2 Observable Evidence

Log Snippet:

[SS_BACKEND] shared_fail→legacy cls=7  ← SuperSlab failed, using legacy
[SS_BACKEND] shared_fail→legacy cls=7
[SS_BACKEND] shared_fail→legacy cls=7
[SS_BACKEND] shared_fail→legacy cls=7

What This Means:

  • SuperSlab attempted allocation → hit soft cap → failed
  • Fell back to hak_tiny_alloc_superslab_backend_legacy()
  • Legacy backend uses system malloc (not SuperSlab)
  • Kernel overhead: mmap/munmap syscalls → 55% CPU in kernel

Why No Performance Difference:

  • SuperSlab ON: Uses legacy backend (same as SuperSlab OFF)
  • SuperSlab OFF: Uses legacy backend (expected)
  • Both configurations → same code path → same performance

3. Missing Implementation: EMPTY→Freelist Recycling

3.1 What Needs to Be Implemented

Phase 9-2 Option A (from investigation report):

Step 1: Add EMPTY Detection to Remote Drain

File: core/superslab_slab.c (after line 109)

void _ss_remote_drain_to_freelist_unsafe(SuperSlab* ss, int slab_idx, TinySlabMeta* meta) {
    // ... existing drain logic ...

    meta->freelist = prev;
    atomic_store(&ss->remote_counts[slab_idx], 0);

    // ✅ NEW: Check if slab is now EMPTY
    if (meta->used == 0 && meta->capacity > 0) {
        ss_mark_slab_empty(ss, slab_idx);  // Set empty_mask bit

        // Notify shared pool: push to per-class freelist
        int class_idx = (int)meta->class_idx;
        if (class_idx >= 0 && class_idx < TINY_NUM_CLASSES_SS) {
            shared_pool_release_slab(ss, slab_idx);
        }
    }

    // ... update masks ...
}

Step 2: Add EMPTY Detection to TLS SLL Drain

File: core/box/tls_sll_drain_box.c

uint32_t tiny_tls_sll_drain(int class_idx, uint32_t batch_size) {
    // ... existing drain logic ...

    // After draining N blocks from TLS SLL to freelist:
    if (meta->used == 0 && meta->capacity > 0) {
        ss_mark_slab_empty(ss, slab_idx);
        shared_pool_release_slab(ss, slab_idx);
    }
}

3.2 Expected Impact (After Implementation)

Performance Prediction (from Phase 9-2 investigation, Section 9.2):

Configuration Throughput Kernel Overhead Stage 1 Hit Rate
Current (no recycling) 16.5 M ops/s 55% 0%
Option A (EMPTY recycling) 25-28 M ops/s 15% 80%
Option A+B (+ 2MB SS) 30-35 M ops/s 12% 85%

Why +50-70% Improvement:

  • EMPTY slabs recycle instantly via lock-free Stage 1
  • Soft cap never hit (slots reused, not created)
  • Eliminates mmap/munmap overhead from legacy fallback
  • SuperSlab backend becomes fully functional

4. Comparison with Phase 9-1

4.1 Phase 9-1 Status

From PHASE9_1_PROGRESS.md:

Phase 9-1 Goal: Optimize SuperSlab lookup (50-80 cycles → 8-12 cycles) Status: Infrastructure complete (4/6 steps), migration not started

  • Step 1-4: Hash table + TLS hints implementation
  • ⏸️ Step 5: Migration (IN PROGRESS)
  • ⏸️ Step 6: Benchmark (PENDING)

Key Point: Phase 9-1 optimizations are not yet integrated into hot path.

4.2 Phase 9-2 Status

Phase 9-2 Goal: Fix SuperSlab backend (eliminate legacy fallbacks) Status: Investigation complete, implementation not started

  • Root cause identified (EMPTY recycling missing)
  • 4 fix options proposed (Option A recommended)
  • ⏸️ Implementation: NOT STARTED
  • ⏸️ Benchmark: NOT STARTED

Key Point: Phase 9-2 is still in planning phase.


5. Performance Budget Analysis

5.1 Current Bottlenecks (WS8192)

Total: 212 cycles/op (16.5 M ops/s @ 2.8 GHz)
  - SuperSlab Lookup:   50-80 cycles  ← Phase 9-1 target
  - Legacy Fallback:    30-50 cycles  ← Phase 9-2 target
  - Fragmentation:      30-50 cycles
  - TLS Drain:          10-15 cycles
  - Actual Work:        30-40 cycles

Kernel Overhead: 55% (mmap/munmap from legacy fallback)

5.2 Expected After Phase 9-1 + 9-2

After Phase 9-1 (lookup optimization):

Total: 152 cycles/op (18.4 M ops/s baseline)
  - SuperSlab Lookup:   8-12 cycles   ✅ Fixed (hash + TLS hints)
  - Legacy Fallback:    30-50 cycles  ← Still broken
  - Fragmentation:      30-50 cycles
  - TLS Drain:          10-15 cycles
  - Actual Work:        30-40 cycles

Expected: 16.5M → 23-25M ops/s (+39-52%)

After Phase 9-1 + 9-2 (lookup + backend):

Total: 95 cycles/op (29.5 M ops/s baseline)
  - SuperSlab Lookup:   8-12 cycles   ✅ Fixed (Phase 9-1)
  - Legacy Fallback:    0 cycles      ✅ Fixed (Phase 9-2)
  - SuperSlab Backend:  15-20 cycles  ✅ Stage 1 reuse
  - Fragmentation:      20-30 cycles
  - TLS Drain:          10-15 cycles
  - Actual Work:        30-40 cycles

Expected: 16.5M → 30-35M ops/s (+80-110%) Kernel Overhead: 55% → 12-15%


6. Diagnostic Output Analysis

6.1 Repeated Warnings

TLS_SLL_POP_POST_INVALID:

[TLS_SLL_POP_POST_INVALID] cls=6 next=0x7 last_writer=pop
[TLS_SLL_HDR_RESET] cls=6 base=0x... got=0x00 expect=0xa6 count=0
[TLS_SLL_POP_POST_INVALID] cls=6 next=0x5b last_writer=pop

Analysis (from Phase 9-2 investigation, Section 2):

  • cls=6: Class 6 (512-byte blocks)
  • got=0x00: Header corrupted/zeroed
  • count=0: One-time event (not recurring)
  • Hypothesis: Use-after-free or slab reuse race
  • Mitigation: Existing guards (tiny_tls_slab_reuse_guard()) should prevent
  • Verdict: Not critical (one-time event, guards in place)
  • Action: Monitor with HAKMEM_SUPER_REG_DEBUG=1 for recurrence

6.2 Shared Fail Events

[SS_BACKEND] shared_fail→legacy cls=7

Count: 4 events per benchmark run Class: Class 7 (2048-byte allocations, 1024-1040B range in benchmark) Reason: Soft cap reached (Stage 3 blocked) Impact: Falls back to system malloc → kernel overhead

This is the PRIMARY bottleneck that Phase 9-2 Option A will fix.


7. Verification of Test Configuration

7.1 Benchmark Parameters

Command Used:

./bench_random_mixed_hakmem 10000000 8192

Breakdown:

  • 10000000: 10M cycles (steady-state measurement)
  • 8192: Working set size (WS8192)

From bench_random_mixed.c (line 45-46):

int cycles = (argc>1)? atoi(argv[1]) : 10000000; // total ops
int ws     = (argc>2)? atoi(argv[2]) : 8192;    // working-set slots

Allocation Pattern (line 116):

size_t sz = 16u + (r & 0x3FFu); // 16..1040 bytes (approx 16..1024)

Class Distribution (estimated):

16-64B   → Classes 0-3 (~40%)
64-256B  → Classes 4-5 (~30%)
256-512B → Class 6      (~20%)
512-1040B → Class 7     (~10% = ~820 live allocations)

Why Class 7 Exhausts:

  • 820 live allocations ÷ 511 blocks/SuperSlab = 1.6 SuperSlabs (rounded to 2)
  • Soft cap = 2 → any additional allocation fails → legacy fallback

7.2 Comparison with Phase 9-1 Baseline

From PHASE9_1_PROGRESS.md (line 142):

./bench_random_mixed_hakmem 10000000 8192  # ~16.5 M ops/s

Current Measurement:

  • SuperSlab OFF: 16.23 M ops/s
  • SuperSlab ON: 16.15 M ops/s

Match: Values align with Phase 9-1 baseline (16.5M vs 16.2M, within variance)


8. Next Steps

8.1 Immediate Actions

  1. Implement Phase 9-2 Option A (EMPTY→Freelist recycling)

    • Modify core/superslab_slab.c (remote drain)
    • Modify core/box/tls_sll_drain_box.c (TLS SLL drain)
    • Add EMPTY detection: if (meta->used == 0) { shared_pool_release_slab(...) }
  2. Run Debug Build to verify EMPTY recycling

    make clean
    make CFLAGS="-O2 -g -DHAKMEM_BUILD_RELEASE=0" bench_random_mixed_hakmem
    
    HAKMEM_TINY_USE_SUPERSLAB=1 \
    HAKMEM_SS_ACQUIRE_DEBUG=1 \
    HAKMEM_SHARED_POOL_STAGE_STATS=1 \
      ./bench_random_mixed_hakmem 100000 256 42
    
  3. Verify Stage 1 Hits in debug output

    • Look for [SP_ACQUIRE_STAGE1_LOCKFREE] logs
    • Confirm freelist population: [SP_SLOT_FREELIST_LOCKFREE]
    • Verify zero shared_fail→legacy events

8.2 Performance Validation

  1. Re-run WS8192 Benchmark (after Option A implementation)

    # Baseline (should be same as before)
    HAKMEM_TINY_USE_SUPERSLAB=0 ./bench_random_mixed_hakmem 10000000 8192
    
    # Optimized (should show +50-70% improvement)
    HAKMEM_TINY_USE_SUPERSLAB=1 ./bench_random_mixed_hakmem 10000000 8192
    
  2. Success Criteria (from Phase 9-2 Section 11.2):

    • Throughput: 16.5M → 25-30M ops/s (+50-80%)
    • Zero shared_fail→legacy events
    • Stage 1 hit rate: 70-80% (after warmup)
    • Kernel overhead: 55% → <15%

8.3 Optional Enhancements

  1. Implement Option B (revert to 2MB SuperSlab)

    • Change SUPERSLAB_LG_DEFAULT from 19 → 21
    • Expected additional gain: +10-15% (30-35M ops/s total)
  2. Implement Option D (expand EMPTY scan limit)

    • Change HAKMEM_SS_EMPTY_SCAN_LIMIT default from 16 → 64
    • Expected additional gain: +3-8% (marginal)

9. Risk Assessment

9.1 Implementation Risks (Option A)

Risk Likelihood Impact Mitigation
Double-free in EMPTY detection Low Critical Add meta->used > 0 assertion before shared_pool_release_slab()
Race: EMPTY→ACTIVE→EMPTY Medium Medium Use atomic meta->used reads; Stage 1 CAS prevents double-activation
Deadlock in release_slab Low Medium Use lock-free push (already implemented)

Overall: Low risk (Box boundaries well-defined, guards in place)

9.2 Performance Risks

Risk Likelihood Impact Mitigation
Improvement less than expected Medium Medium Profile with perf, check Stage 1 hit rate, consider Option B
Regression in other workloads Low Medium Run full benchmark suite (WS256, cache_thrash, larson)
Memory leak from freelist Low High Monitor RSS growth, verify EMPTY detection logic

Overall: Medium risk (new feature, but small code change)


10. Lessons Learned

10.1 Benchmark Parameter Confusion

Issue: Initial request mentioned "デフォルトパラメータで測定してしまい、ワークロードが軽すぎました" Reality: Default parameters ARE WS8192 (line 46 in bench_random_mixed.c)

int ws = (argc>2)? atoi(argv[2]) : 8192;  // default: 8192

Takeaway: Always check source code to verify default behavior (documentation may be outdated).

10.2 SuperSlab Enabled ≠ SuperSlab Functional

Issue: HAKMEM_TINY_USE_SUPERSLAB=1 enables SuperSlab code, but doesn't guarantee it's used. Reality: Legacy fallback is triggered when SuperSlab backend fails (soft cap, OOM, etc.)

Takeaway: Check for shared_fail→legacy warnings in output to verify SuperSlab is actually being used.

10.3 Phase Dependencies

Issue: Assumed Phase 9-2 was complete (based on PHASE9_2_*.md files) Reality: Phase 9-2 investigation is complete, but implementation is not started

Takeaway: Check document status header (e.g., "Status: Root Cause Analysis Complete" vs "Status: Implementation Complete")


11. Conclusion

Current State: WS8192 benchmark correctly measured at 16.2-16.3 M ops/s, consistent across SuperSlab ON/OFF.

Root Cause: SuperSlab backend falls back to legacy system malloc due to missing EMPTY→Freelist recycling (Phase 9-2 Option A).

Expected Improvement: After implementing Option A, expect 25-30 M ops/s (+50-80%) by eliminating legacy fallbacks and enabling lock-free Stage 1 EMPTY reuse.

Next Action: Implement Phase 9-2 Option A (2-3 hour task), then re-benchmark WS8192 to verify +50-70% improvement.


Report Prepared By: Claude (Sonnet 4.5) Benchmark Date: 2025-11-30 Total Test Time: ~6 seconds (6 runs × 0.6s average) Status: Baseline established, awaiting Phase 9-2 implementation