This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations. Key changes include: - **ACE Tracing Implementation**: - Added environment variable to enable/disable detailed logging of allocation failures. - Instrumented , , and to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure). - **Build System Fixes**: - Corrected to ensure is properly linked into , resolving an error. - **LD_PRELOAD Wrapper Adjustments**: - Investigated and understood the wrapper's behavior under , particularly its interaction with and checks. - Enabled debugging flags for environment to prevent unintended fallbacks to 's for non-tiny allocations, allowing comprehensive testing of the allocator. - **Debugging & Verification**: - Introduced temporary verbose logging to pinpoint execution flow issues within interception and routing. These temporary logs have been removed. - Created to facilitate testing of the tracing features. This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in by providing clear insights into the failure pathways.
280 lines
9.2 KiB
Markdown
280 lines
9.2 KiB
Markdown
# Phase 9-1 Progress Report: SuperSlab Lookup Optimization
|
||
|
||
**Date**: 2025-11-30
|
||
**Status**: Infrastructure Complete (4/6 steps done)
|
||
**Next**: Integration and Benchmarking
|
||
|
||
## Summary
|
||
|
||
Phase 9-1 aims to fix the critical SuperSlab lookup bottleneck identified in Phase 8:
|
||
- **Current**: 50-80 cycles per lookup (linear probing in registry)
|
||
- **Target**: 10-20 cycles average (hash table + TLS hints)
|
||
- **Expected Impact**: 16.5M → 23-25M ops/s at WS8192 (+39-52%)
|
||
|
||
## Completed Steps ✅
|
||
|
||
### Phase 9-1-1: SuperSlabMap Box Design ✅
|
||
**Files Created:**
|
||
- `core/box/ss_addr_map_box.h` (143 lines)
|
||
- `core/box/ss_addr_map_box.c` (262 lines)
|
||
|
||
**Design:**
|
||
- Hash table with 8192 buckets (2^13)
|
||
- Chaining for collision resolution
|
||
- Hash function: `(ptr >> 19) & (SS_MAP_HASH_SIZE - 1)`
|
||
- Uses `__libc_malloc/__libc_free` to avoid recursion
|
||
- Handles multiple SuperSlab alignments (512KB, 1MB, 2MB)
|
||
|
||
**Box Pattern Compliance:**
|
||
- ✅ Single Responsibility: Address→SuperSlab mapping ONLY
|
||
- ✅ Clear Contract: O(1) amortized lookup
|
||
- ✅ Observable: Debug macros (SS_MAP_LOOKUP, SS_MAP_INSERT, SS_MAP_REMOVE)
|
||
- ✅ Composable: Can coexist with legacy registry
|
||
|
||
**Performance Contract:**
|
||
- Insert: O(1) amortized
|
||
- Lookup: O(1) amortized (tries 3 alignments, hash + chain traversal)
|
||
- Remove: O(1) amortized
|
||
|
||
### Phase 9-1-3: Debug Macros ✅
|
||
**Implemented:**
|
||
```c
|
||
// Environment-gated tracing: HAKMEM_SS_MAP_TRACE=1
|
||
#define SS_MAP_LOOKUP(map, ptr) // Logs: ptr=%p -> ss=%p
|
||
#define SS_MAP_INSERT(map, base, ss) // Logs: base=%p ss=%p
|
||
#define SS_MAP_REMOVE(map, base) // Logs: base=%p
|
||
```
|
||
|
||
**Statistics Functions (Debug builds):**
|
||
- `ss_map_print_stats()` - collision rate, load factor, longest chain
|
||
- `ss_map_collision_rate()` - for performance tuning
|
||
|
||
### Phase 9-1-4: TLS Hints ✅
|
||
**Files Created:**
|
||
- `core/box/ss_tls_hint_box.h` (238 lines)
|
||
- `core/box/ss_tls_hint_box.c` (22 lines)
|
||
|
||
**Design:**
|
||
```c
|
||
__thread struct SuperSlab* g_tls_ss_hint[TINY_NUM_CLASSES];
|
||
|
||
// Fast path: Check TLS hint (5-10 cycles)
|
||
// Slow path: Hash table lookup + update hint (15-25 cycles)
|
||
struct SuperSlab* ss_tls_hint_lookup(int class_idx, void* ptr);
|
||
```
|
||
|
||
**Performance Contract:**
|
||
- Hit case: 5-10 cycles (TLS load + range check)
|
||
- Miss case: 15-25 cycles (hash table + hint update)
|
||
- Expected hit rate: 80-95% (locality of reference)
|
||
- **Net improvement: 50-80 cycles → 10-15 cycles average**
|
||
|
||
**Statistics (Debug builds):**
|
||
```c
|
||
typedef struct {
|
||
uint64_t total_lookups;
|
||
uint64_t hint_hits; // TLS cache hits
|
||
uint64_t hint_misses; // Fallback to hash table
|
||
uint64_t hash_hits; // Hash table successes
|
||
uint64_t hash_misses; // NULL returns
|
||
} SSTLSHintStats;
|
||
|
||
// Environment-gated: HAKMEM_SS_TLS_HINT_TRACE=1
|
||
void ss_tls_hint_print_stats(void);
|
||
```
|
||
|
||
**API Functions:**
|
||
- `ss_tls_hint_init()` - Initialize TLS cache
|
||
- `ss_tls_hint_lookup(class_idx, ptr)` - Main lookup with caching
|
||
- `ss_tls_hint_update(class_idx, ss)` - Prefill hint (hot path)
|
||
- `ss_tls_hint_invalidate(class_idx, ss)` - Clear hint on SuperSlab free
|
||
|
||
## Pending Steps ⏸️
|
||
|
||
### Phase 9-1-2: O(1) Lookup (2-tier page table) ⏸️
|
||
**Status**: DEFERRED - Hash table is sufficient for Phase 1
|
||
|
||
**Rationale:**
|
||
- Current hash table already provides O(1) amortized
|
||
- 2-tier page table would be O(1) worst-case but more complex
|
||
- Benchmark first, optimize only if needed
|
||
|
||
**Potential Future Enhancement:**
|
||
```c
|
||
// 2-tier page table (if hash table shows high collision rate)
|
||
// Level 1: (ptr >> 30) = 4 entries (cover 4GB address space)
|
||
// Level 2: (ptr >> 19) & 0x7FF = 2048 entries per L1
|
||
// Total: 4 × 2048 = 8K pointers (64KB overhead)
|
||
// Lookup: Always 2 cache misses (predictable, no chains)
|
||
```
|
||
|
||
### Phase 9-1-5: Migration (既存コードからss_map_lookupへ移行) 🚧
|
||
**Status**: IN PROGRESS - Next task
|
||
|
||
**Plan:**
|
||
1. Initialize `ss_addr_map` at startup
|
||
- Call `ss_map_init(&g_ss_addr_map)` in `hak_init_impl()`
|
||
|
||
2. Register SuperSlabs on creation
|
||
- Modify `hak_super_register()` to also call `ss_map_insert()`
|
||
- Keep old registry for compatibility during migration
|
||
|
||
3. Unregister SuperSlabs on free
|
||
- Modify `hak_super_unregister()` to also call `ss_map_remove()`
|
||
|
||
4. Replace lookup calls
|
||
- Find all `hak_super_lookup()` calls
|
||
- Replace with `ss_tls_hint_lookup(class_idx, ptr)`
|
||
- Use `ss_map_lookup()` where class_idx is unknown
|
||
|
||
5. Test dual-mode operation
|
||
- Both old registry and new hash table active
|
||
- Compare results for correctness
|
||
- Gradual rollout: can fall back if issues found
|
||
|
||
### Phase 9-1-6: Benchmark (Phase 1効果確認) ⏸️
|
||
**Status**: PENDING - After migration
|
||
|
||
**Test Plan:**
|
||
```bash
|
||
# Phase 8 baseline (before optimization)
|
||
./bench_random_mixed_hakmem 10000000 256 # ~79.2 M ops/s
|
||
./bench_random_mixed_hakmem 10000000 8192 # ~16.5 M ops/s
|
||
|
||
# Phase 9-1 target (after optimization)
|
||
./bench_random_mixed_hakmem 10000000 256 # >85 M ops/s (+7%)
|
||
./bench_random_mixed_hakmem 10000000 8192 # >23 M ops/s (+39%)
|
||
|
||
# Debug mode (measure hit rates)
|
||
HAKMEM_SS_TLS_HINT_TRACE=1 ./bench_random_mixed_hakmem 10000 256
|
||
HAKMEM_SS_MAP_TRACE=1 ./bench_random_mixed_hakmem 10000 8192
|
||
```
|
||
|
||
**Success Criteria:**
|
||
- ✅ Minimum: WS8192 reaches 23 M ops/s (+39% from 16.5M)
|
||
- ✅ Stretch: WS8192 reaches 25 M ops/s (+52% from 16.5M)
|
||
- ✅ TLS hint hit rate: >80%
|
||
- ✅ Hash table collision rate: <20%
|
||
|
||
**Failure Plan:**
|
||
- If <20 M ops/s: Investigate with profiling
|
||
- Check TLS hint hit rate (should be >80%)
|
||
- Check hash table collision rate
|
||
- Consider Phase 9-1-2 (2-tier page table) if needed
|
||
- If 20-23 M ops/s: Acceptable, proceed to Phase 9-2
|
||
- If >23 M ops/s: Excellent, proceed to Phase 9-2
|
||
|
||
## File Summary
|
||
|
||
### New Files Created (4 files)
|
||
1. `core/box/ss_addr_map_box.h` - Hash table interface
|
||
2. `core/box/ss_addr_map_box.c` - Hash table implementation
|
||
3. `core/box/ss_tls_hint_box.h` - TLS cache interface
|
||
4. `core/box/ss_tls_hint_box.c` - TLS cache implementation
|
||
|
||
### Modified Files (1 file)
|
||
1. `Makefile` - Added new modules to build
|
||
- `OBJS_BASE`: Added `ss_addr_map_box.o`, `ss_tls_hint_box.o`
|
||
- `TINY_BENCH_OBJS_BASE`: Added same
|
||
- `SHARED_OBJS`: Added `_shared.o` variants
|
||
|
||
### Compilation Status ✅
|
||
- ✅ `ss_addr_map_box.o` - 17KB (compiled, no warnings except unused function)
|
||
- ✅ `ss_tls_hint_box.o` - 6.0KB (compiled, no warnings)
|
||
- ✅ `bench_random_mixed_hakmem` - Links successfully with both modules
|
||
|
||
## Architecture Overview
|
||
|
||
```
|
||
┌─────────────────────────────────────────────────────┐
|
||
│ Phase 9-1: SuperSlab Lookup Optimization │
|
||
└─────────────────────────────────────────────────────┘
|
||
|
||
Lookup Path (Before Phase 9-1):
|
||
ptr → hak_super_lookup() → Linear probe (32 iterations)
|
||
→ 50-80 cycles
|
||
|
||
Lookup Path (After Phase 9-1):
|
||
ptr → ss_tls_hint_lookup(class_idx, ptr)
|
||
↓
|
||
├─ Fast path (80-95%): TLS hint hit
|
||
│ └─ ss_contains(hint, ptr) → 5-10 cycles ✅
|
||
│
|
||
└─ Slow path (5-20%): TLS hint miss
|
||
└─ ss_map_lookup(ptr) → Hash table
|
||
└─ 10-20 cycles (hash + chain traversal) ✅
|
||
|
||
Expected average: 0.85 × 7 + 0.15 × 15 = 8.2 cycles
|
||
```
|
||
|
||
## Performance Budget Analysis
|
||
|
||
### Phase 8 Baseline (WS8192):
|
||
```
|
||
Total: 212 cycles/op
|
||
- SuperSlab Lookup: 50-80 cycles ← BOTTLENECK
|
||
- Legacy Fallback: 30-50 cycles
|
||
- Fragmentation: 30-50 cycles
|
||
- TLS Drain: 10-15 cycles
|
||
- Actual Work: 30-40 cycles
|
||
```
|
||
|
||
### Phase 9-1 Target (WS8192):
|
||
```
|
||
Total: 152 cycles/op (60 cycle improvement)
|
||
- SuperSlab Lookup: 8-12 cycles ← OPTIMIZED (hash + TLS)
|
||
- Legacy Fallback: 30-50 cycles
|
||
- Fragmentation: 30-50 cycles
|
||
- TLS Drain: 10-15 cycles
|
||
- Actual Work: 30-40 cycles
|
||
|
||
Throughput: 2.8 GHz / 152 = 18.4M ops/s (baseline)
|
||
+ variance → 23-25M ops/s (expected)
|
||
```
|
||
|
||
## Risk Assessment
|
||
|
||
### Low Risk ✅
|
||
- Hash table design is proven (similar to jemalloc/mimalloc)
|
||
- TLS hints are simple and well-contained
|
||
- Can run dual-mode (old + new) during migration
|
||
- Easy rollback if issues found
|
||
|
||
### Medium Risk ⚠️
|
||
- Collision rate: If >30%, performance may degrade
|
||
- Mitigation: Measured in stats, can increase bucket count
|
||
- TLS hit rate: If <70%, benefit reduced
|
||
- Mitigation: Measured in stats, can tune hint invalidation
|
||
|
||
### High Risk ❌
|
||
- None identified
|
||
|
||
## Next Steps
|
||
|
||
1. **Immediate**: Start Phase 9-1-5 migration
|
||
- Initialize ss_addr_map in hak_init_impl()
|
||
- Add ss_map_insert/remove to registration paths
|
||
- Find and replace hak_super_lookup() calls
|
||
|
||
2. **After Migration**: Run Phase 9-1-6 benchmarks
|
||
- Compare Phase 8 vs Phase 9-1 performance
|
||
- Measure TLS hit rate and collision rate
|
||
- Validate success criteria
|
||
|
||
3. **If Successful**: Proceed to Phase 9-2
|
||
- Remove old linear-probe registry (cleanup)
|
||
- Optimize hot paths further
|
||
- Consider additional TLS optimizations
|
||
|
||
4. **If Unsuccessful**: Root cause analysis
|
||
- Profile with perf/cachegrind
|
||
- Check TLS hit rate (expect >80%)
|
||
- Check collision rate (expect <20%)
|
||
- Consider Phase 9-1-2 (2-tier page table) if needed
|
||
|
||
---
|
||
|
||
**Prepared by**: Claude (Sonnet 4.5)
|
||
**Last Updated**: 2025-11-30 06:32 JST
|
||
**Status**: 4/6 steps complete, migration starting
|