Files
hakmem/PHASE9_1_PROGRESS.md
Moe Charm (CI) 4ef0171bc0 feat: Add ACE allocation failure tracing and debug hooks
This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations.

Key changes include:
- **ACE Tracing Implementation**:
  - Added  environment variable to enable/disable detailed logging of allocation failures.
  - Instrumented , , and  to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure).
- **Build System Fixes**:
  - Corrected  to ensure  is properly linked into , resolving an  error.
- **LD_PRELOAD Wrapper Adjustments**:
  - Investigated and understood the  wrapper's behavior under , particularly its interaction with  and  checks.
  - Enabled debugging flags for  environment to prevent unintended fallbacks to 's  for non-tiny allocations, allowing comprehensive testing of the  allocator.
- **Debugging & Verification**:
  - Introduced temporary verbose logging to pinpoint execution flow issues within  interception and  routing. These temporary logs have been removed.
  - Created  to facilitate testing of the tracing features.

This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in  by providing clear insights into the failure pathways.
2025-12-01 16:37:59 +09:00

280 lines
9.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 9-1 Progress Report: SuperSlab Lookup Optimization
**Date**: 2025-11-30
**Status**: Infrastructure Complete (4/6 steps done)
**Next**: Integration and Benchmarking
## Summary
Phase 9-1 aims to fix the critical SuperSlab lookup bottleneck identified in Phase 8:
- **Current**: 50-80 cycles per lookup (linear probing in registry)
- **Target**: 10-20 cycles average (hash table + TLS hints)
- **Expected Impact**: 16.5M → 23-25M ops/s at WS8192 (+39-52%)
## Completed Steps ✅
### Phase 9-1-1: SuperSlabMap Box Design ✅
**Files Created:**
- `core/box/ss_addr_map_box.h` (143 lines)
- `core/box/ss_addr_map_box.c` (262 lines)
**Design:**
- Hash table with 8192 buckets (2^13)
- Chaining for collision resolution
- Hash function: `(ptr >> 19) & (SS_MAP_HASH_SIZE - 1)`
- Uses `__libc_malloc/__libc_free` to avoid recursion
- Handles multiple SuperSlab alignments (512KB, 1MB, 2MB)
**Box Pattern Compliance:**
- ✅ Single Responsibility: Address→SuperSlab mapping ONLY
- ✅ Clear Contract: O(1) amortized lookup
- ✅ Observable: Debug macros (SS_MAP_LOOKUP, SS_MAP_INSERT, SS_MAP_REMOVE)
- ✅ Composable: Can coexist with legacy registry
**Performance Contract:**
- Insert: O(1) amortized
- Lookup: O(1) amortized (tries 3 alignments, hash + chain traversal)
- Remove: O(1) amortized
### Phase 9-1-3: Debug Macros ✅
**Implemented:**
```c
// Environment-gated tracing: HAKMEM_SS_MAP_TRACE=1
#define SS_MAP_LOOKUP(map, ptr) // Logs: ptr=%p -> ss=%p
#define SS_MAP_INSERT(map, base, ss) // Logs: base=%p ss=%p
#define SS_MAP_REMOVE(map, base) // Logs: base=%p
```
**Statistics Functions (Debug builds):**
- `ss_map_print_stats()` - collision rate, load factor, longest chain
- `ss_map_collision_rate()` - for performance tuning
### Phase 9-1-4: TLS Hints ✅
**Files Created:**
- `core/box/ss_tls_hint_box.h` (238 lines)
- `core/box/ss_tls_hint_box.c` (22 lines)
**Design:**
```c
__thread struct SuperSlab* g_tls_ss_hint[TINY_NUM_CLASSES];
// Fast path: Check TLS hint (5-10 cycles)
// Slow path: Hash table lookup + update hint (15-25 cycles)
struct SuperSlab* ss_tls_hint_lookup(int class_idx, void* ptr);
```
**Performance Contract:**
- Hit case: 5-10 cycles (TLS load + range check)
- Miss case: 15-25 cycles (hash table + hint update)
- Expected hit rate: 80-95% (locality of reference)
- **Net improvement: 50-80 cycles → 10-15 cycles average**
**Statistics (Debug builds):**
```c
typedef struct {
uint64_t total_lookups;
uint64_t hint_hits; // TLS cache hits
uint64_t hint_misses; // Fallback to hash table
uint64_t hash_hits; // Hash table successes
uint64_t hash_misses; // NULL returns
} SSTLSHintStats;
// Environment-gated: HAKMEM_SS_TLS_HINT_TRACE=1
void ss_tls_hint_print_stats(void);
```
**API Functions:**
- `ss_tls_hint_init()` - Initialize TLS cache
- `ss_tls_hint_lookup(class_idx, ptr)` - Main lookup with caching
- `ss_tls_hint_update(class_idx, ss)` - Prefill hint (hot path)
- `ss_tls_hint_invalidate(class_idx, ss)` - Clear hint on SuperSlab free
## Pending Steps ⏸️
### Phase 9-1-2: O(1) Lookup (2-tier page table) ⏸️
**Status**: DEFERRED - Hash table is sufficient for Phase 1
**Rationale:**
- Current hash table already provides O(1) amortized
- 2-tier page table would be O(1) worst-case but more complex
- Benchmark first, optimize only if needed
**Potential Future Enhancement:**
```c
// 2-tier page table (if hash table shows high collision rate)
// Level 1: (ptr >> 30) = 4 entries (cover 4GB address space)
// Level 2: (ptr >> 19) & 0x7FF = 2048 entries per L1
// Total: 4 × 2048 = 8K pointers (64KB overhead)
// Lookup: Always 2 cache misses (predictable, no chains)
```
### Phase 9-1-5: Migration (既存コードからss_map_lookupへ移行) 🚧
**Status**: IN PROGRESS - Next task
**Plan:**
1. Initialize `ss_addr_map` at startup
- Call `ss_map_init(&g_ss_addr_map)` in `hak_init_impl()`
2. Register SuperSlabs on creation
- Modify `hak_super_register()` to also call `ss_map_insert()`
- Keep old registry for compatibility during migration
3. Unregister SuperSlabs on free
- Modify `hak_super_unregister()` to also call `ss_map_remove()`
4. Replace lookup calls
- Find all `hak_super_lookup()` calls
- Replace with `ss_tls_hint_lookup(class_idx, ptr)`
- Use `ss_map_lookup()` where class_idx is unknown
5. Test dual-mode operation
- Both old registry and new hash table active
- Compare results for correctness
- Gradual rollout: can fall back if issues found
### Phase 9-1-6: Benchmark (Phase 1効果確認) ⏸️
**Status**: PENDING - After migration
**Test Plan:**
```bash
# Phase 8 baseline (before optimization)
./bench_random_mixed_hakmem 10000000 256 # ~79.2 M ops/s
./bench_random_mixed_hakmem 10000000 8192 # ~16.5 M ops/s
# Phase 9-1 target (after optimization)
./bench_random_mixed_hakmem 10000000 256 # >85 M ops/s (+7%)
./bench_random_mixed_hakmem 10000000 8192 # >23 M ops/s (+39%)
# Debug mode (measure hit rates)
HAKMEM_SS_TLS_HINT_TRACE=1 ./bench_random_mixed_hakmem 10000 256
HAKMEM_SS_MAP_TRACE=1 ./bench_random_mixed_hakmem 10000 8192
```
**Success Criteria:**
- ✅ Minimum: WS8192 reaches 23 M ops/s (+39% from 16.5M)
- ✅ Stretch: WS8192 reaches 25 M ops/s (+52% from 16.5M)
- ✅ TLS hint hit rate: >80%
- ✅ Hash table collision rate: <20%
**Failure Plan:**
- If <20 M ops/s: Investigate with profiling
- Check TLS hint hit rate (should be >80%)
- Check hash table collision rate
- Consider Phase 9-1-2 (2-tier page table) if needed
- If 20-23 M ops/s: Acceptable, proceed to Phase 9-2
- If >23 M ops/s: Excellent, proceed to Phase 9-2
## File Summary
### New Files Created (4 files)
1. `core/box/ss_addr_map_box.h` - Hash table interface
2. `core/box/ss_addr_map_box.c` - Hash table implementation
3. `core/box/ss_tls_hint_box.h` - TLS cache interface
4. `core/box/ss_tls_hint_box.c` - TLS cache implementation
### Modified Files (1 file)
1. `Makefile` - Added new modules to build
- `OBJS_BASE`: Added `ss_addr_map_box.o`, `ss_tls_hint_box.o`
- `TINY_BENCH_OBJS_BASE`: Added same
- `SHARED_OBJS`: Added `_shared.o` variants
### Compilation Status ✅
-`ss_addr_map_box.o` - 17KB (compiled, no warnings except unused function)
-`ss_tls_hint_box.o` - 6.0KB (compiled, no warnings)
-`bench_random_mixed_hakmem` - Links successfully with both modules
## Architecture Overview
```
┌─────────────────────────────────────────────────────┐
│ Phase 9-1: SuperSlab Lookup Optimization │
└─────────────────────────────────────────────────────┘
Lookup Path (Before Phase 9-1):
ptr → hak_super_lookup() → Linear probe (32 iterations)
→ 50-80 cycles
Lookup Path (After Phase 9-1):
ptr → ss_tls_hint_lookup(class_idx, ptr)
├─ Fast path (80-95%): TLS hint hit
│ └─ ss_contains(hint, ptr) → 5-10 cycles ✅
└─ Slow path (5-20%): TLS hint miss
└─ ss_map_lookup(ptr) → Hash table
└─ 10-20 cycles (hash + chain traversal) ✅
Expected average: 0.85 × 7 + 0.15 × 15 = 8.2 cycles
```
## Performance Budget Analysis
### Phase 8 Baseline (WS8192):
```
Total: 212 cycles/op
- SuperSlab Lookup: 50-80 cycles ← BOTTLENECK
- Legacy Fallback: 30-50 cycles
- Fragmentation: 30-50 cycles
- TLS Drain: 10-15 cycles
- Actual Work: 30-40 cycles
```
### Phase 9-1 Target (WS8192):
```
Total: 152 cycles/op (60 cycle improvement)
- SuperSlab Lookup: 8-12 cycles ← OPTIMIZED (hash + TLS)
- Legacy Fallback: 30-50 cycles
- Fragmentation: 30-50 cycles
- TLS Drain: 10-15 cycles
- Actual Work: 30-40 cycles
Throughput: 2.8 GHz / 152 = 18.4M ops/s (baseline)
+ variance → 23-25M ops/s (expected)
```
## Risk Assessment
### Low Risk ✅
- Hash table design is proven (similar to jemalloc/mimalloc)
- TLS hints are simple and well-contained
- Can run dual-mode (old + new) during migration
- Easy rollback if issues found
### Medium Risk ⚠️
- Collision rate: If >30%, performance may degrade
- Mitigation: Measured in stats, can increase bucket count
- TLS hit rate: If <70%, benefit reduced
- Mitigation: Measured in stats, can tune hint invalidation
### High Risk ❌
- None identified
## Next Steps
1. **Immediate**: Start Phase 9-1-5 migration
- Initialize ss_addr_map in hak_init_impl()
- Add ss_map_insert/remove to registration paths
- Find and replace hak_super_lookup() calls
2. **After Migration**: Run Phase 9-1-6 benchmarks
- Compare Phase 8 vs Phase 9-1 performance
- Measure TLS hit rate and collision rate
- Validate success criteria
3. **If Successful**: Proceed to Phase 9-2
- Remove old linear-probe registry (cleanup)
- Optimize hot paths further
- Consider additional TLS optimizations
4. **If Unsuccessful**: Root cause analysis
- Profile with perf/cachegrind
- Check TLS hit rate (expect >80%)
- Check collision rate (expect <20%)
- Consider Phase 9-1-2 (2-tier page table) if needed
---
**Prepared by**: Claude (Sonnet 4.5)
**Last Updated**: 2025-11-30 06:32 JST
**Status**: 4/6 steps complete, migration starting