207 lines
6.5 KiB
Markdown
207 lines
6.5 KiB
Markdown
|
|
# Phase 9-1 Implementation Complete
|
||
|
|
|
||
|
|
**Date**: 2025-11-30 06:40 JST
|
||
|
|
**Status**: Infrastructure Complete, Benchmarking In Progress
|
||
|
|
**Completion**: 5/6 steps done
|
||
|
|
|
||
|
|
## Summary
|
||
|
|
|
||
|
|
Phase 9-1 successfully implemented a hash table-based SuperSlab lookup system to replace the linear probing registry. The infrastructure is complete and integrated, but initial benchmarks show unexpected results that require investigation.
|
||
|
|
|
||
|
|
## Completed Work ✅
|
||
|
|
|
||
|
|
### 1. SuperSlabMap Box (Phase 9-1-1) ✅
|
||
|
|
**Files Created:**
|
||
|
|
- `core/box/ss_addr_map_box.h` (149 lines)
|
||
|
|
- `core/box/ss_addr_map_box.c` (262 lines)
|
||
|
|
|
||
|
|
**Implementation:**
|
||
|
|
- Hash table with 8192 buckets
|
||
|
|
- Chaining collision resolution
|
||
|
|
- O(1) amortized lookup
|
||
|
|
- Handles multiple SuperSlab alignments (512KB, 1MB, 2MB)
|
||
|
|
- Uses `__libc_malloc/__libc_free` to avoid recursion
|
||
|
|
|
||
|
|
### 2. TLS Hints (Phase 9-1-4) ✅
|
||
|
|
**Files Created:**
|
||
|
|
- `core/box/ss_tls_hint_box.h` (238 lines)
|
||
|
|
- `core/box/ss_tls_hint_box.c` (22 lines)
|
||
|
|
|
||
|
|
**Implementation:**
|
||
|
|
- `__thread SuperSlab* g_tls_ss_hint[TINY_NUM_CLASSES]`
|
||
|
|
- Fast path: TLS cache check (5-10 cycles expected)
|
||
|
|
- Slow path: Hash table fallback + cache update
|
||
|
|
- Debug statistics tracking
|
||
|
|
|
||
|
|
### 3. Debug Macros (Phase 9-1-3) ✅
|
||
|
|
**Implemented:**
|
||
|
|
- `SS_MAP_LOOKUP()` - Trace lookups
|
||
|
|
- `SS_MAP_INSERT()` - Trace registrations
|
||
|
|
- `SS_MAP_REMOVE()` - Trace unregistrations
|
||
|
|
- `ss_map_print_stats()` - Collision/load stats
|
||
|
|
- Environment-gated: `HAKMEM_SS_MAP_TRACE=1`
|
||
|
|
|
||
|
|
### 4. Integration (Phase 9-1-5) ✅
|
||
|
|
**Modified Files:**
|
||
|
|
- `core/hakmem_tiny_lazy_init.inc.h` - Initialize `ss_map_init()`
|
||
|
|
- `core/hakmem_super_registry.c` - Hook `ss_map_insert/remove()`
|
||
|
|
- `core/hakmem_super_registry.h` - Replace `hak_super_lookup()` implementation
|
||
|
|
- `Makefile` - Add new modules to build
|
||
|
|
|
||
|
|
**Changes:**
|
||
|
|
1. `ss_map_init()` called at SuperSlab subsystem initialization
|
||
|
|
2. `ss_map_insert()` called when registering SuperSlabs
|
||
|
|
3. `ss_map_remove()` called when unregistering SuperSlabs
|
||
|
|
4. `hak_super_lookup()` now uses `ss_map_lookup()` instead of linear probing
|
||
|
|
|
||
|
|
## Benchmark Results 🔍
|
||
|
|
|
||
|
|
### WS256 (Hot Cache)
|
||
|
|
```
|
||
|
|
Phase 8 Baseline: 79.2 M ops/s
|
||
|
|
Phase 9-1 Result: 79.2 M ops/s (no change)
|
||
|
|
```
|
||
|
|
**Status**: ✅ No regression in hot cache performance
|
||
|
|
|
||
|
|
### WS8192 (Realistic)
|
||
|
|
```
|
||
|
|
Phase 8 Baseline: 16.5 M ops/s
|
||
|
|
Phase 9-1 Result: 16.2 M ops/s (no improvement)
|
||
|
|
```
|
||
|
|
**Status**: ⚠️ No improvement observed
|
||
|
|
|
||
|
|
## Investigation Needed 🔍
|
||
|
|
|
||
|
|
### Observation
|
||
|
|
The hash table optimization did NOT improve WS8192 performance as expected. Possible reasons:
|
||
|
|
|
||
|
|
1. **SuperSlab Not Used in Benchmark**
|
||
|
|
- Default bench settings may disable SuperSlab path
|
||
|
|
- Test with: `HAKMEM_TINY_USE_SUPERSLAB=1`
|
||
|
|
- When enabled, performance drops to 15M ops/s
|
||
|
|
|
||
|
|
2. **Different Bottleneck**
|
||
|
|
- Phase 8 analysis identified SuperSlab lookup as 50-80 cycle bottleneck
|
||
|
|
- Actual bottleneck may be elsewhere (fragmentation, TLS drain, etc.)
|
||
|
|
- Need profiling to confirm actual hot path
|
||
|
|
|
||
|
|
3. **Hash Table Not Exercised**
|
||
|
|
- Benchmark may be hitting TLS fast path entirely
|
||
|
|
- SuperSlab lookups may not happen in hot path
|
||
|
|
- Need to verify with profiling/tracing
|
||
|
|
|
||
|
|
### Next Steps for Investigation
|
||
|
|
|
||
|
|
1. **Profile Actual Bottleneck**
|
||
|
|
```bash
|
||
|
|
perf record -g ./bench_random_mixed_hakmem 10000000 8192
|
||
|
|
perf report
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Enable SuperSlab and Measure**
|
||
|
|
```bash
|
||
|
|
HAKMEM_TINY_USE_SUPERSLAB=1 ./bench_random_mixed_hakmem 10000000 8192
|
||
|
|
```
|
||
|
|
|
||
|
|
3. **Check Lookup Statistics**
|
||
|
|
- Build debug version without RELEASE flag
|
||
|
|
- Enable `HAKMEM_SS_MAP_TRACE=1`
|
||
|
|
- Count actual lookup calls
|
||
|
|
|
||
|
|
4. **Verify TLS vs SuperSlab Split**
|
||
|
|
- Check what percentage of allocations hit TLS vs SuperSlab
|
||
|
|
- Benchmark may be 100% TLS (fast path) with no SuperSlab lookups
|
||
|
|
|
||
|
|
## Code Quality ✅
|
||
|
|
|
||
|
|
All new code follows Box pattern:
|
||
|
|
- ✅ Single Responsibility
|
||
|
|
- ✅ Clear Contracts
|
||
|
|
- ✅ Observable (debug macros)
|
||
|
|
- ✅ Composable (coexists with legacy)
|
||
|
|
- ✅ No compilation warnings
|
||
|
|
- ✅ No runtime crashes
|
||
|
|
|
||
|
|
## Files Modified/Created
|
||
|
|
|
||
|
|
### New Files (4)
|
||
|
|
1. `core/box/ss_addr_map_box.h`
|
||
|
|
2. `core/box/ss_addr_map_box.c`
|
||
|
|
3. `core/box/ss_tls_hint_box.h`
|
||
|
|
4. `core/box/ss_tls_hint_box.c`
|
||
|
|
|
||
|
|
### Modified Files (4)
|
||
|
|
1. `core/hakmem_tiny_lazy_init.inc.h` - Added init call
|
||
|
|
2. `core/hakmem_super_registry.c` - Added insert/remove hooks
|
||
|
|
3. `core/hakmem_super_registry.h` - Replaced lookup implementation
|
||
|
|
4. `Makefile` - Added new modules
|
||
|
|
|
||
|
|
### Documentation (2)
|
||
|
|
1. `PHASE9_1_PROGRESS.md` - Detailed progress tracking
|
||
|
|
2. `PHASE9_1_COMPLETE.md` - This file
|
||
|
|
|
||
|
|
## Lessons Learned
|
||
|
|
|
||
|
|
1. **Premature Optimization**
|
||
|
|
- Phase 8 analysis identified bottleneck without profiling
|
||
|
|
- Assumed SuperSlab lookup was the problem
|
||
|
|
- Should have profiled first before implementing solution
|
||
|
|
|
||
|
|
2. **Benchmark Configuration**
|
||
|
|
- Default benchmark may not exercise the optimized path
|
||
|
|
- Need to verify assumptions about what code paths are executed
|
||
|
|
- Environment variables can dramatically change behavior
|
||
|
|
|
||
|
|
3. **Infrastructure Still Valuable**
|
||
|
|
- Even if not the current bottleneck, O(1) lookup is correct design
|
||
|
|
- Future workloads may benefit (more SuperSlabs, different patterns)
|
||
|
|
- Clean Box-based architecture enables future optimization
|
||
|
|
|
||
|
|
## Recommendations
|
||
|
|
|
||
|
|
### Option 1: Profile and Re-Target
|
||
|
|
1. Run perf profiling on WS8192 benchmark
|
||
|
|
2. Identify actual bottleneck (may not be SuperSlab lookup)
|
||
|
|
3. Implement targeted fix for real bottleneck
|
||
|
|
4. Re-benchmark
|
||
|
|
|
||
|
|
**Timeline**: 1-2 days
|
||
|
|
**Risk**: Low
|
||
|
|
**Expected**: 20-30M ops/s at WS8192
|
||
|
|
|
||
|
|
### Option 2: Enable SuperSlab and Optimize
|
||
|
|
1. Configure benchmark to force SuperSlab usage
|
||
|
|
2. Measure hash table effectiveness with SuperSlab enabled
|
||
|
|
3. Optimize SuperSlab fragmentation/capacity issues
|
||
|
|
4. Re-benchmark
|
||
|
|
|
||
|
|
**Timeline**: 2-3 days
|
||
|
|
**Risk**: Medium
|
||
|
|
**Expected**: 18-22M ops/s at WS8192
|
||
|
|
|
||
|
|
### Option 3: Accept Baseline and Move Forward
|
||
|
|
1. Keep hash table infrastructure (no harm, better design)
|
||
|
|
2. Focus on other optimization opportunities
|
||
|
|
3. Return to this if profiling shows it's needed later
|
||
|
|
|
||
|
|
**Timeline**: 0 days (done)
|
||
|
|
**Risk**: Low
|
||
|
|
**Expected**: 16-17M ops/s at WS8192 (status quo)
|
||
|
|
|
||
|
|
## Conclusion
|
||
|
|
|
||
|
|
Phase 9-1 successfully delivered clean, well-architected infrastructure for O(1) SuperSlab lookups. The code compiles, runs without crashes, and follows all Box pattern principles.
|
||
|
|
|
||
|
|
However, **benchmark results show no improvement**, suggesting either:
|
||
|
|
1. The identified bottleneck was incorrect
|
||
|
|
2. The benchmark doesn't exercise the optimized path
|
||
|
|
3. A different bottleneck dominates performance
|
||
|
|
|
||
|
|
**Recommended Next Step**: Profile with `perf` to identify actual bottleneck before further optimization work.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Prepared by**: Claude (Sonnet 4.5)
|
||
|
|
**Timestamp**: 2025-11-30 06:40 JST
|
||
|
|
**Status**: Infrastructure complete, performance investigation needed
|