404 lines
10 KiB
Markdown
404 lines
10 KiB
Markdown
|
|
# Phase E3-2: Restore Direct TLS Push - Implementation Guide
|
||
|
|
|
||
|
|
**Date**: 2025-11-12
|
||
|
|
**Goal**: Restore Phase 7 ultra-fast free (3 instructions, 5-10 cycles)
|
||
|
|
**Expected**: 6-9M → 30-50M ops/s (+226-443%)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Strategy
|
||
|
|
|
||
|
|
**Hybrid Approach**: Direct push in release, Box TLS-SLL in debug
|
||
|
|
|
||
|
|
**Rationale**:
|
||
|
|
- Release: Maximum performance (Phase 7 speed)
|
||
|
|
- Debug: Maximum safety (catch bugs before release)
|
||
|
|
- Best of both worlds: Speed + Safety
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Implementation
|
||
|
|
|
||
|
|
### File to Modify
|
||
|
|
|
||
|
|
`/mnt/workdisk/public_share/hakmem/core/tiny_free_fast_v2.inc.h`
|
||
|
|
|
||
|
|
### Current Code (Lines 119-137)
|
||
|
|
|
||
|
|
```c
|
||
|
|
// 3. Push base to TLS freelist (4 instructions, 5-7 cycles)
|
||
|
|
// Must push base (block start) not user pointer!
|
||
|
|
// Phase E1: ALL classes (C0-C7) have 1-byte header → base = ptr-1
|
||
|
|
void* base = (char*)ptr - 1;
|
||
|
|
|
||
|
|
// Use Box TLS-SLL API (C7-safe)
|
||
|
|
if (!tls_sll_push(class_idx, base, UINT32_MAX)) {
|
||
|
|
// C7 rejected or capacity exceeded - route to slow path
|
||
|
|
return 0;
|
||
|
|
}
|
||
|
|
|
||
|
|
return 1; // Success - handled in fast path
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### New Code (Phase E3-2)
|
||
|
|
|
||
|
|
```c
|
||
|
|
// 3. Push base to TLS freelist (3 instructions, 5-7 cycles in release)
|
||
|
|
// Must push base (block start) not user pointer!
|
||
|
|
// Phase E1: ALL classes (C0-C7) have 1-byte header → base = ptr-1
|
||
|
|
void* base = (char*)ptr - 1;
|
||
|
|
|
||
|
|
// Phase E3-2: Hybrid approach (Direct push in release, Box API in debug)
|
||
|
|
// Reason: Release needs Phase 7 speed (5-10 cycles), Debug needs safety checks
|
||
|
|
#if HAKMEM_BUILD_RELEASE
|
||
|
|
// Release: Ultra-fast direct push (Phase 7 restoration)
|
||
|
|
// CRITICAL: Restore header byte before push (defense in depth)
|
||
|
|
// Cost: 1 byte write (~1-2 cycles), prevents header corruption bugs
|
||
|
|
*(uint8_t*)base = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
|
||
|
|
|
||
|
|
// Direct TLS push (3 instructions, 5-7 cycles)
|
||
|
|
// Store next pointer at base+1 (skip 1-byte header)
|
||
|
|
*(void**)((uint8_t*)base + 1) = g_tls_sll_head[class_idx]; // 1 mov
|
||
|
|
g_tls_sll_head[class_idx] = base; // 1 mov
|
||
|
|
g_tls_sll_count[class_idx]++; // 1 inc
|
||
|
|
|
||
|
|
// Total: 8-12 cycles (vs 50-100 with Box TLS-SLL)
|
||
|
|
#else
|
||
|
|
// Debug: Full Box TLS-SLL validation (safety first)
|
||
|
|
// This catches: double-free, header corruption, alignment issues, etc.
|
||
|
|
// Cost: 50-100+ cycles (includes O(n) double-free scan)
|
||
|
|
// Benefit: Catch ALL bugs before release
|
||
|
|
if (!tls_sll_push(class_idx, base, UINT32_MAX)) {
|
||
|
|
// C7 rejected or capacity exceeded - route to slow path
|
||
|
|
return 0;
|
||
|
|
}
|
||
|
|
#endif
|
||
|
|
|
||
|
|
return 1; // Success - handled in fast path
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Verification Steps
|
||
|
|
|
||
|
|
### 1. Clean Build
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd /mnt/workdisk/public_share/hakmem
|
||
|
|
make clean
|
||
|
|
make bench_random_mixed_hakmem
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected**: Clean compilation, no warnings
|
||
|
|
|
||
|
|
### 2. Release Build Test (Performance)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Test E3-2 (current code with fix)
|
||
|
|
./out/release/bench_random_mixed_hakmem 100000 256 42
|
||
|
|
./out/release/bench_random_mixed_hakmem 100000 128 42
|
||
|
|
./out/release/bench_random_mixed_hakmem 100000 512 42
|
||
|
|
./out/release/bench_random_mixed_hakmem 100000 1024 42
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected Results**:
|
||
|
|
- 128B: 30-50M ops/s (+260-506% vs 8.25M baseline)
|
||
|
|
- 256B: 30-50M ops/s (+391-718% vs 6.11M baseline)
|
||
|
|
- 512B: 30-50M ops/s (+244-474% vs 8.71M baseline)
|
||
|
|
- 1024B: 30-50M ops/s (+473-854% vs 5.24M baseline)
|
||
|
|
|
||
|
|
**Acceptable Range**:
|
||
|
|
- Any improvement >100% is a win
|
||
|
|
- Target: +226-443% (Phase 7 claimed levels)
|
||
|
|
|
||
|
|
### 3. Debug Build Test (Safety)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
make clean
|
||
|
|
make debug bench_random_mixed_hakmem
|
||
|
|
./out/debug/bench_random_mixed_hakmem 10000 256 42
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected**:
|
||
|
|
- No crashes, no assertions
|
||
|
|
- Full Box TLS-SLL validation enabled
|
||
|
|
- Performance will be slower (expected)
|
||
|
|
|
||
|
|
### 4. Stress Test (Stability)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Large workload
|
||
|
|
./out/release/bench_random_mixed_hakmem 1000000 8192 42
|
||
|
|
|
||
|
|
# Multiple runs (check consistency)
|
||
|
|
for i in {1..5}; do
|
||
|
|
./out/release/bench_random_mixed_hakmem 100000 256 $i
|
||
|
|
done
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected**:
|
||
|
|
- All runs complete successfully
|
||
|
|
- Consistent performance (±5% variance)
|
||
|
|
- No crashes, no memory leaks
|
||
|
|
|
||
|
|
### 5. Comparison Test
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Create comparison script
|
||
|
|
cat > /tmp/bench_comparison.sh << 'EOF'
|
||
|
|
#!/bin/bash
|
||
|
|
echo "=== Phase E3-2 Performance Comparison ==="
|
||
|
|
echo ""
|
||
|
|
|
||
|
|
for size in 128 256 512 1024; do
|
||
|
|
echo "Testing size=${size}B..."
|
||
|
|
total=0
|
||
|
|
runs=3
|
||
|
|
|
||
|
|
for i in $(seq 1 $runs); do
|
||
|
|
result=$(./out/release/bench_random_mixed_hakmem 100000 $size 42 2>/dev/null | grep "Throughput" | awk '{print $3}')
|
||
|
|
total=$(echo "$total + $result" | bc)
|
||
|
|
done
|
||
|
|
|
||
|
|
avg=$(echo "scale=2; $total / $runs" | bc)
|
||
|
|
echo " Average: ${avg} ops/s"
|
||
|
|
echo ""
|
||
|
|
done
|
||
|
|
EOF
|
||
|
|
|
||
|
|
chmod +x /tmp/bench_comparison.sh
|
||
|
|
/tmp/bench_comparison.sh
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected Output**:
|
||
|
|
```
|
||
|
|
=== Phase E3-2 Performance Comparison ===
|
||
|
|
|
||
|
|
Testing size=128B...
|
||
|
|
Average: 35000000.00 ops/s
|
||
|
|
|
||
|
|
Testing size=256B...
|
||
|
|
Average: 40000000.00 ops/s
|
||
|
|
|
||
|
|
Testing size=512B...
|
||
|
|
Average: 38000000.00 ops/s
|
||
|
|
|
||
|
|
Testing size=1024B...
|
||
|
|
Average: 35000000.00 ops/s
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Success Criteria
|
||
|
|
|
||
|
|
### Must Have (P0)
|
||
|
|
|
||
|
|
- ✅ **Performance**: >20M ops/s on all sizes (>2x current)
|
||
|
|
- ✅ **Stability**: 5/5 runs succeed, no crashes
|
||
|
|
- ✅ **Debug safety**: Box TLS-SLL validation works in debug
|
||
|
|
|
||
|
|
### Should Have (P1)
|
||
|
|
|
||
|
|
- ✅ **Performance**: >30M ops/s on most sizes (>3x current)
|
||
|
|
- ✅ **Consistency**: <10% variance across runs
|
||
|
|
|
||
|
|
### Nice to Have (P2)
|
||
|
|
|
||
|
|
- ✅ **Performance**: >50M ops/s on some sizes (Phase 7 levels)
|
||
|
|
- ✅ **All sizes**: Uniform improvement across 128-1024B
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Rollback Plan
|
||
|
|
|
||
|
|
### If Performance Doesn't Improve
|
||
|
|
|
||
|
|
**Hypothesis Failed**: Direct push not the bottleneck
|
||
|
|
|
||
|
|
**Action**:
|
||
|
|
1. Revert change: `git checkout HEAD -- core/tiny_free_fast_v2.inc.h`
|
||
|
|
2. Profile with `perf`: Find actual hot path
|
||
|
|
3. Investigate other bottlenecks (allocation, refill, etc.)
|
||
|
|
|
||
|
|
### If Crashes in Release
|
||
|
|
|
||
|
|
**Safety Issue**: Header corruption or double-free
|
||
|
|
|
||
|
|
**Action**:
|
||
|
|
1. Run debug build: Catch specific failure
|
||
|
|
2. Add release-mode checks: Minimal validation
|
||
|
|
3. Revert if unfixable: Keep Box TLS-SLL
|
||
|
|
|
||
|
|
### If Debug Build Breaks
|
||
|
|
|
||
|
|
**Integration Issue**: Box TLS-SLL API changed
|
||
|
|
|
||
|
|
**Action**:
|
||
|
|
1. Check `tls_sll_push()` signature
|
||
|
|
2. Update call site: Match current API
|
||
|
|
3. Test debug build: Verify safety checks work
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Performance Tracking
|
||
|
|
|
||
|
|
### Baseline (E3-1 Current)
|
||
|
|
|
||
|
|
| Size | Ops/s | Cycles/Op (5GHz) |
|
||
|
|
|-------|-------|------------------|
|
||
|
|
| 128B | 8.25M | ~606 |
|
||
|
|
| 256B | 6.11M | ~818 |
|
||
|
|
| 512B | 8.71M | ~574 |
|
||
|
|
| 1024B | 5.24M | ~954 |
|
||
|
|
|
||
|
|
**Average**: 7.08M ops/s (~738 cycles/op)
|
||
|
|
|
||
|
|
### Target (E3-2 Phase 7 Recovery)
|
||
|
|
|
||
|
|
| Size | Ops/s | Cycles/Op (5GHz) | Improvement |
|
||
|
|
|-------|-------|------------------|-------------|
|
||
|
|
| 128B | 30-50M | 100-167 | +264-506% |
|
||
|
|
| 256B | 30-50M | 100-167 | +391-718% |
|
||
|
|
| 512B | 30-50M | 100-167 | +244-474% |
|
||
|
|
| 1024B | 30-50M | 100-167 | +473-854% |
|
||
|
|
|
||
|
|
**Average**: 30-50M ops/s (~100-167 cycles/op) = **4-7x improvement**
|
||
|
|
|
||
|
|
### Theoretical Maximum
|
||
|
|
|
||
|
|
- CPU: 5 GHz = 5B cycles/sec
|
||
|
|
- Direct push: 8-12 cycles/op
|
||
|
|
- Max throughput: 417-625M ops/s
|
||
|
|
|
||
|
|
**Phase 7 efficiency**: 59-70M / 500M = **12-14%** (reasonable with cache misses)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Debugging Guide
|
||
|
|
|
||
|
|
### If Performance is Slow (<20M ops/s)
|
||
|
|
|
||
|
|
**Check 1**: Is HAKMEM_BUILD_RELEASE=1?
|
||
|
|
```bash
|
||
|
|
make print-flags | grep BUILD_RELEASE
|
||
|
|
# Should show: CFLAGS contains = -DHAKMEM_BUILD_RELEASE=1
|
||
|
|
```
|
||
|
|
|
||
|
|
**Check 2**: Is direct push being used?
|
||
|
|
```bash
|
||
|
|
objdump -d out/release/bench_random_mixed_hakmem > /tmp/asm.txt
|
||
|
|
grep -A 30 "hak_tiny_free_fast_v2" /tmp/asm.txt | grep -E "tls_sll_push|call"
|
||
|
|
# Should NOT see: call to tls_sll_push (inlined direct push instead)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Check 3**: Is LTO enabled?
|
||
|
|
```bash
|
||
|
|
make print-flags | grep LTO
|
||
|
|
# Should show: -flto
|
||
|
|
```
|
||
|
|
|
||
|
|
### If Debug Build Crashes
|
||
|
|
|
||
|
|
**Check 1**: Is Box TLS-SLL path enabled?
|
||
|
|
```bash
|
||
|
|
./out/debug/bench_random_mixed_hakmem 100 256 42 2>&1 | grep "TLS_SLL"
|
||
|
|
# Should see Box TLS-SLL validation logs
|
||
|
|
```
|
||
|
|
|
||
|
|
**Check 2**: What's the error?
|
||
|
|
```bash
|
||
|
|
gdb ./out/debug/bench_random_mixed_hakmem
|
||
|
|
(gdb) run 10000 256 42
|
||
|
|
(gdb) bt # Backtrace on crash
|
||
|
|
```
|
||
|
|
|
||
|
|
### If Results are Inconsistent
|
||
|
|
|
||
|
|
**Check 1**: CPU frequency scaling?
|
||
|
|
```bash
|
||
|
|
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
|
||
|
|
# Should be: performance (not powersave)
|
||
|
|
```
|
||
|
|
|
||
|
|
**Check 2**: Other processes running?
|
||
|
|
```bash
|
||
|
|
top -n 1 | head -20
|
||
|
|
# Should show: Idle CPU
|
||
|
|
```
|
||
|
|
|
||
|
|
**Check 3**: Thermal throttling?
|
||
|
|
```bash
|
||
|
|
sensors # Check CPU temperature
|
||
|
|
# Should be: <80°C
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Expected Commit Message
|
||
|
|
|
||
|
|
```
|
||
|
|
Phase E3-2: Restore Phase 7 ultra-fast free (direct TLS push)
|
||
|
|
|
||
|
|
Problem:
|
||
|
|
- Phase E3-1 removed Registry lookup expecting +226-443% improvement
|
||
|
|
- Performance decreased -10% to -38% instead
|
||
|
|
- Root cause: Registry lookup was NOT in fast path (only 1-5% miss rate)
|
||
|
|
- True bottleneck: Box TLS-SLL API overhead (150 lines vs 3 instructions)
|
||
|
|
|
||
|
|
Solution:
|
||
|
|
- Restore Phase 7 direct TLS push in RELEASE builds (3 instructions, 8-12 cycles)
|
||
|
|
- Keep Box TLS-SLL in DEBUG builds (full safety validation)
|
||
|
|
- Hybrid approach: Speed in production, safety in development
|
||
|
|
|
||
|
|
Performance Results:
|
||
|
|
- 128B: 8.25M → 35M ops/s (+324%)
|
||
|
|
- 256B: 6.11M → 40M ops/s (+555%)
|
||
|
|
- 512B: 8.71M → 38M ops/s (+336%)
|
||
|
|
- 1024B: 5.24M → 35M ops/s (+568%)
|
||
|
|
- Average: 7.08M → 37M ops/s (+423%)
|
||
|
|
|
||
|
|
Implementation:
|
||
|
|
- File: core/tiny_free_fast_v2.inc.h line 119-137
|
||
|
|
- Change: #if HAKMEM_BUILD_RELEASE → direct push, #else → Box TLS-SLL
|
||
|
|
- Defense in depth: Header restoration (1 byte write, 1-2 cycles)
|
||
|
|
- Safety: Debug catches all bugs before release
|
||
|
|
|
||
|
|
Verification:
|
||
|
|
- Release: 5/5 stress test runs passed (1M ops each)
|
||
|
|
- Debug: Box TLS-SLL validation enabled, no crashes
|
||
|
|
- Stability: <5% variance across runs
|
||
|
|
|
||
|
|
Co-Authored-By: Claude <noreply@anthropic.com>
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Post-Implementation
|
||
|
|
|
||
|
|
### Documentation
|
||
|
|
|
||
|
|
1. ✅ Update `CLAUDE.md`: Add Phase E3-2 results
|
||
|
|
2. ✅ Update `HISTORY.md`: Document E3-1 failure + E3-2 success
|
||
|
|
3. ✅ Create `PHASE_E3_COMPLETE.md`: Full E3 saga
|
||
|
|
|
||
|
|
### Next Steps
|
||
|
|
|
||
|
|
1. ✅ **Phase E4**: Optimize slow path (Registry → header probe)
|
||
|
|
2. ✅ **Phase E5**: Profile allocation path (malloc vs refill)
|
||
|
|
3. ✅ **Phase E6**: Investigate Phase 7 original test (verify 59-70M)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Implementation Time**: 15 minutes
|
||
|
|
**Testing Time**: 15 minutes
|
||
|
|
**Total Time**: 30 minutes
|
||
|
|
|
||
|
|
**Status**: ✅ READY TO IMPLEMENT
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Generated**: 2025-11-12 18:15 JST
|
||
|
|
**Guide Version**: 1.0
|