Files
hakmem/PHASE9_1_PROGRESS.md
Moe Charm (CI) 4ef0171bc0 feat: Add ACE allocation failure tracing and debug hooks
This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations.

Key changes include:
- **ACE Tracing Implementation**:
  - Added  environment variable to enable/disable detailed logging of allocation failures.
  - Instrumented , , and  to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure).
- **Build System Fixes**:
  - Corrected  to ensure  is properly linked into , resolving an  error.
- **LD_PRELOAD Wrapper Adjustments**:
  - Investigated and understood the  wrapper's behavior under , particularly its interaction with  and  checks.
  - Enabled debugging flags for  environment to prevent unintended fallbacks to 's  for non-tiny allocations, allowing comprehensive testing of the  allocator.
- **Debugging & Verification**:
  - Introduced temporary verbose logging to pinpoint execution flow issues within  interception and  routing. These temporary logs have been removed.
  - Created  to facilitate testing of the tracing features.

This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in  by providing clear insights into the failure pathways.
2025-12-01 16:37:59 +09:00

9.2 KiB
Raw Blame History

Phase 9-1 Progress Report: SuperSlab Lookup Optimization

Date: 2025-11-30 Status: Infrastructure Complete (4/6 steps done) Next: Integration and Benchmarking

Summary

Phase 9-1 aims to fix the critical SuperSlab lookup bottleneck identified in Phase 8:

  • Current: 50-80 cycles per lookup (linear probing in registry)
  • Target: 10-20 cycles average (hash table + TLS hints)
  • Expected Impact: 16.5M → 23-25M ops/s at WS8192 (+39-52%)

Completed Steps

Phase 9-1-1: SuperSlabMap Box Design

Files Created:

  • core/box/ss_addr_map_box.h (143 lines)
  • core/box/ss_addr_map_box.c (262 lines)

Design:

  • Hash table with 8192 buckets (2^13)
  • Chaining for collision resolution
  • Hash function: (ptr >> 19) & (SS_MAP_HASH_SIZE - 1)
  • Uses __libc_malloc/__libc_free to avoid recursion
  • Handles multiple SuperSlab alignments (512KB, 1MB, 2MB)

Box Pattern Compliance:

  • Single Responsibility: Address→SuperSlab mapping ONLY
  • Clear Contract: O(1) amortized lookup
  • Observable: Debug macros (SS_MAP_LOOKUP, SS_MAP_INSERT, SS_MAP_REMOVE)
  • Composable: Can coexist with legacy registry

Performance Contract:

  • Insert: O(1) amortized
  • Lookup: O(1) amortized (tries 3 alignments, hash + chain traversal)
  • Remove: O(1) amortized

Phase 9-1-3: Debug Macros

Implemented:

// Environment-gated tracing: HAKMEM_SS_MAP_TRACE=1
#define SS_MAP_LOOKUP(map, ptr)   // Logs: ptr=%p -> ss=%p
#define SS_MAP_INSERT(map, base, ss)  // Logs: base=%p ss=%p
#define SS_MAP_REMOVE(map, base)      // Logs: base=%p

Statistics Functions (Debug builds):

  • ss_map_print_stats() - collision rate, load factor, longest chain
  • ss_map_collision_rate() - for performance tuning

Phase 9-1-4: TLS Hints

Files Created:

  • core/box/ss_tls_hint_box.h (238 lines)
  • core/box/ss_tls_hint_box.c (22 lines)

Design:

__thread struct SuperSlab* g_tls_ss_hint[TINY_NUM_CLASSES];

// Fast path: Check TLS hint (5-10 cycles)
// Slow path: Hash table lookup + update hint (15-25 cycles)
struct SuperSlab* ss_tls_hint_lookup(int class_idx, void* ptr);

Performance Contract:

  • Hit case: 5-10 cycles (TLS load + range check)
  • Miss case: 15-25 cycles (hash table + hint update)
  • Expected hit rate: 80-95% (locality of reference)
  • Net improvement: 50-80 cycles → 10-15 cycles average

Statistics (Debug builds):

typedef struct {
    uint64_t total_lookups;
    uint64_t hint_hits;      // TLS cache hits
    uint64_t hint_misses;    // Fallback to hash table
    uint64_t hash_hits;      // Hash table successes
    uint64_t hash_misses;    // NULL returns
} SSTLSHintStats;

// Environment-gated: HAKMEM_SS_TLS_HINT_TRACE=1
void ss_tls_hint_print_stats(void);

API Functions:

  • ss_tls_hint_init() - Initialize TLS cache
  • ss_tls_hint_lookup(class_idx, ptr) - Main lookup with caching
  • ss_tls_hint_update(class_idx, ss) - Prefill hint (hot path)
  • ss_tls_hint_invalidate(class_idx, ss) - Clear hint on SuperSlab free

Pending Steps ⏸️

Phase 9-1-2: O(1) Lookup (2-tier page table) ⏸️

Status: DEFERRED - Hash table is sufficient for Phase 1

Rationale:

  • Current hash table already provides O(1) amortized
  • 2-tier page table would be O(1) worst-case but more complex
  • Benchmark first, optimize only if needed

Potential Future Enhancement:

// 2-tier page table (if hash table shows high collision rate)
// Level 1: (ptr >> 30) = 4 entries (cover 4GB address space)
// Level 2: (ptr >> 19) & 0x7FF = 2048 entries per L1
// Total: 4 × 2048 = 8K pointers (64KB overhead)
// Lookup: Always 2 cache misses (predictable, no chains)

Phase 9-1-5: Migration (既存コードからss_map_lookupへ移行) 🚧

Status: IN PROGRESS - Next task

Plan:

  1. Initialize ss_addr_map at startup

    • Call ss_map_init(&g_ss_addr_map) in hak_init_impl()
  2. Register SuperSlabs on creation

    • Modify hak_super_register() to also call ss_map_insert()
    • Keep old registry for compatibility during migration
  3. Unregister SuperSlabs on free

    • Modify hak_super_unregister() to also call ss_map_remove()
  4. Replace lookup calls

    • Find all hak_super_lookup() calls
    • Replace with ss_tls_hint_lookup(class_idx, ptr)
    • Use ss_map_lookup() where class_idx is unknown
  5. Test dual-mode operation

    • Both old registry and new hash table active
    • Compare results for correctness
    • Gradual rollout: can fall back if issues found

Phase 9-1-6: Benchmark (Phase 1効果確認) ⏸️

Status: PENDING - After migration

Test Plan:

# Phase 8 baseline (before optimization)
./bench_random_mixed_hakmem 10000000 256   # ~79.2 M ops/s
./bench_random_mixed_hakmem 10000000 8192  # ~16.5 M ops/s

# Phase 9-1 target (after optimization)
./bench_random_mixed_hakmem 10000000 256   # >85 M ops/s (+7%)
./bench_random_mixed_hakmem 10000000 8192  # >23 M ops/s (+39%)

# Debug mode (measure hit rates)
HAKMEM_SS_TLS_HINT_TRACE=1 ./bench_random_mixed_hakmem 10000 256
HAKMEM_SS_MAP_TRACE=1 ./bench_random_mixed_hakmem 10000 8192

Success Criteria:

  • Minimum: WS8192 reaches 23 M ops/s (+39% from 16.5M)
  • Stretch: WS8192 reaches 25 M ops/s (+52% from 16.5M)
  • TLS hint hit rate: >80%
  • Hash table collision rate: <20%

Failure Plan:

  • If <20 M ops/s: Investigate with profiling
    • Check TLS hint hit rate (should be >80%)
    • Check hash table collision rate
    • Consider Phase 9-1-2 (2-tier page table) if needed
  • If 20-23 M ops/s: Acceptable, proceed to Phase 9-2
  • If >23 M ops/s: Excellent, proceed to Phase 9-2

File Summary

New Files Created (4 files)

  1. core/box/ss_addr_map_box.h - Hash table interface
  2. core/box/ss_addr_map_box.c - Hash table implementation
  3. core/box/ss_tls_hint_box.h - TLS cache interface
  4. core/box/ss_tls_hint_box.c - TLS cache implementation

Modified Files (1 file)

  1. Makefile - Added new modules to build
    • OBJS_BASE: Added ss_addr_map_box.o, ss_tls_hint_box.o
    • TINY_BENCH_OBJS_BASE: Added same
    • SHARED_OBJS: Added _shared.o variants

Compilation Status

  • ss_addr_map_box.o - 17KB (compiled, no warnings except unused function)
  • ss_tls_hint_box.o - 6.0KB (compiled, no warnings)
  • bench_random_mixed_hakmem - Links successfully with both modules

Architecture Overview

┌─────────────────────────────────────────────────────┐
│ Phase 9-1: SuperSlab Lookup Optimization            │
└─────────────────────────────────────────────────────┘

Lookup Path (Before Phase 9-1):
  ptr → hak_super_lookup() → Linear probe (32 iterations)
                           → 50-80 cycles

Lookup Path (After Phase 9-1):
  ptr → ss_tls_hint_lookup(class_idx, ptr)
      ↓
      ├─ Fast path (80-95%): TLS hint hit
      │  └─ ss_contains(hint, ptr) → 5-10 cycles ✅
      │
      └─ Slow path (5-20%): TLS hint miss
         └─ ss_map_lookup(ptr) → Hash table
            └─ 10-20 cycles (hash + chain traversal) ✅

Expected average: 0.85 × 7 + 0.15 × 15 = 8.2 cycles

Performance Budget Analysis

Phase 8 Baseline (WS8192):

Total: 212 cycles/op
  - SuperSlab Lookup: 50-80 cycles ← BOTTLENECK
  - Legacy Fallback:  30-50 cycles
  - Fragmentation:    30-50 cycles
  - TLS Drain:        10-15 cycles
  - Actual Work:      30-40 cycles

Phase 9-1 Target (WS8192):

Total: 152 cycles/op (60 cycle improvement)
  - SuperSlab Lookup: 8-12 cycles ← OPTIMIZED (hash + TLS)
  - Legacy Fallback:  30-50 cycles
  - Fragmentation:    30-50 cycles
  - TLS Drain:        10-15 cycles
  - Actual Work:      30-40 cycles

Throughput: 2.8 GHz / 152 = 18.4M ops/s (baseline)
            + variance → 23-25M ops/s (expected)

Risk Assessment

Low Risk

  • Hash table design is proven (similar to jemalloc/mimalloc)
  • TLS hints are simple and well-contained
  • Can run dual-mode (old + new) during migration
  • Easy rollback if issues found

Medium Risk ⚠️

  • Collision rate: If >30%, performance may degrade
    • Mitigation: Measured in stats, can increase bucket count
  • TLS hit rate: If <70%, benefit reduced
    • Mitigation: Measured in stats, can tune hint invalidation

High Risk

  • None identified

Next Steps

  1. Immediate: Start Phase 9-1-5 migration

    • Initialize ss_addr_map in hak_init_impl()
    • Add ss_map_insert/remove to registration paths
    • Find and replace hak_super_lookup() calls
  2. After Migration: Run Phase 9-1-6 benchmarks

    • Compare Phase 8 vs Phase 9-1 performance
    • Measure TLS hit rate and collision rate
    • Validate success criteria
  3. If Successful: Proceed to Phase 9-2

    • Remove old linear-probe registry (cleanup)
    • Optimize hot paths further
    • Consider additional TLS optimizations
  4. If Unsuccessful: Root cause analysis

    • Profile with perf/cachegrind
    • Check TLS hit rate (expect >80%)
    • Check collision rate (expect <20%)
    • Consider Phase 9-1-2 (2-tier page table) if needed

Prepared by: Claude (Sonnet 4.5) Last Updated: 2025-11-30 06:32 JST Status: 4/6 steps complete, migration starting