445 lines
19 KiB
Markdown
445 lines
19 KiB
Markdown
|
|
# Phase E2: Visual Performance Comparison
|
||
|
|
|
||
|
|
**Date**: 2025-11-12
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Performance Timeline
|
||
|
|
|
||
|
|
```
|
||
|
|
Phase 7 Peak (Nov 8) Phase E1 (Nov 12) Phase E3 Target
|
||
|
|
↓ ↓ ↓
|
||
|
|
┌─────────┐ ┌─────────┐ ┌─────────┐
|
||
|
|
│ 59-70M │ ──────────────→ │ 9M │ ──────────→ │ 59-70M │
|
||
|
|
│ ops/s │ Regression │ ops/s │ Phase E3 │ ops/s │
|
||
|
|
└─────────┘ 85% └─────────┘ +541-674% └─────────┘
|
||
|
|
🏆 😱 🎯
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Free Path Cycle Comparison
|
||
|
|
|
||
|
|
### Phase 7-1.3 (FAST - 5-10 cycles)
|
||
|
|
|
||
|
|
```
|
||
|
|
┌─────────────────────────────────────────────────────────────┐
|
||
|
|
│ hak_tiny_free_fast_v2(ptr) │
|
||
|
|
├─────────────────────────────────────────────────────────────┤
|
||
|
|
│ │
|
||
|
|
│ 1. NULL check [1 cycle] │
|
||
|
|
│ 2. Page boundary check [1-2 cycles] ← 99.9% skip │
|
||
|
|
│ 3. Read header (ptr-1) [2-3 cycles] ← L1 cache │
|
||
|
|
│ 4. Validate magic [included] │
|
||
|
|
│ 5. TLS freelist push [3-5 cycles] ← 4 instructions │
|
||
|
|
│ │
|
||
|
|
│ TOTAL: 5-10 cycles ✅ │
|
||
|
|
│ │
|
||
|
|
└─────────────────────────────────────────────────────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
### Current (SLOW - 55-110 cycles)
|
||
|
|
|
||
|
|
```
|
||
|
|
┌─────────────────────────────────────────────────────────────┐
|
||
|
|
│ hak_tiny_free_fast_v2(ptr) │
|
||
|
|
├─────────────────────────────────────────────────────────────┤
|
||
|
|
│ │
|
||
|
|
│ 1. NULL check [1 cycle] │
|
||
|
|
│ ❌ 2. Registry lookup [50-100 cycles] ← O(log N) │
|
||
|
|
│ └─> hak_super_lookup() │
|
||
|
|
│ └─> RB-tree search │
|
||
|
|
│ └─> Multiple pointer dereferences │
|
||
|
|
│ └─> Cache misses likely │
|
||
|
|
│ 3. Page boundary check [1-2 cycles] │
|
||
|
|
│ 4. Read header (ptr-1) [2-3 cycles] │
|
||
|
|
│ 5. Validate magic [included] │
|
||
|
|
│ 6. TLS freelist push [3-5 cycles] │
|
||
|
|
│ │
|
||
|
|
│ TOTAL: 55-110 cycles ❌ (10x slower!) │
|
||
|
|
│ │
|
||
|
|
└─────────────────────────────────────────────────────────────┘
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## The Problem Visualized
|
||
|
|
|
||
|
|
### Commit 5eabb89ad9 Added This:
|
||
|
|
|
||
|
|
```c
|
||
|
|
// Lines 54-62 in core/tiny_free_fast_v2.inc.h
|
||
|
|
|
||
|
|
static inline int hak_tiny_free_fast_v2(void* ptr) {
|
||
|
|
if (!ptr) return 0;
|
||
|
|
|
||
|
|
┌──────────────────────────────────────────────────────┐
|
||
|
|
│ // ❌ THE BOTTLENECK (50-100 cycles) │
|
||
|
|
│ extern struct SuperSlab* hak_super_lookup(void* ptr);│
|
||
|
|
│ struct SuperSlab* ss = hak_super_lookup(ptr); │
|
||
|
|
│ if (ss && ss->size_class == 7) { │
|
||
|
|
│ return 0; // C7 detected → slow path │
|
||
|
|
│ } │
|
||
|
|
└──────────────────────────────────────────────────────┘
|
||
|
|
↑
|
||
|
|
└── This is UNNECESSARY because Phase E1
|
||
|
|
already added headers to C7!
|
||
|
|
|
||
|
|
// ... rest of function (fast path) ...
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Why It's Unnecessary:
|
||
|
|
|
||
|
|
```
|
||
|
|
Phase E1 (Commit baaf815c9):
|
||
|
|
┌─────────────────────────────────────────────────────────────┐
|
||
|
|
│ ALL classes (C0-C7) now have 1-byte header │
|
||
|
|
├─────────────────────────────────────────────────────────────┤
|
||
|
|
│ │
|
||
|
|
│ C0 (16B): [0xA0] [user data: 15B] │
|
||
|
|
│ C1 (32B): [0xA1] [user data: 31B] │
|
||
|
|
│ C2 (64B): [0xA2] [user data: 63B] │
|
||
|
|
│ C3 (128B): [0xA3] [user data: 127B] │
|
||
|
|
│ C4 (256B): [0xA4] [user data: 255B] │
|
||
|
|
│ C5 (512B): [0xA5] [user data: 511B] │
|
||
|
|
│ C6 (768B): [0xA6] [user data: 767B] │
|
||
|
|
│ C7 (1024B): [0xA7] [user data: 1023B] ← HAS HEADER NOW! │
|
||
|
|
│ │
|
||
|
|
│ Header magic 0xA0 distinguishes from: │
|
||
|
|
│ - Pool TLS: 0xB0 │
|
||
|
|
│ - Mid/Large: no header (magic check fails) │
|
||
|
|
│ │
|
||
|
|
└─────────────────────────────────────────────────────────────┘
|
||
|
|
|
||
|
|
Therefore: Registry lookup is REDUNDANT!
|
||
|
|
Header validation (2-3 cycles) is SUFFICIENT!
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Performance Impact by Size
|
||
|
|
|
||
|
|
### 128B Allocations
|
||
|
|
|
||
|
|
```
|
||
|
|
Phase 7: ████████████████████████████████████████████████████████ 59M ops/s
|
||
|
|
Current: ████████ 9.2M ops/s
|
||
|
|
Phase E3: ████████████████████████████████████████████████████████ 59M ops/s (target)
|
||
|
|
|
||
|
|
Regression: -85% | Recovery: +541%
|
||
|
|
```
|
||
|
|
|
||
|
|
### 256B Allocations
|
||
|
|
|
||
|
|
```
|
||
|
|
Phase 7: ██████████████████████████████████████████████████████████████ 70M ops/s
|
||
|
|
Current: ████████ 9.4M ops/s
|
||
|
|
Phase E3: ██████████████████████████████████████████████████████████████ 70M ops/s (target)
|
||
|
|
|
||
|
|
Regression: -87% | Recovery: +645%
|
||
|
|
```
|
||
|
|
|
||
|
|
### 512B Allocations
|
||
|
|
|
||
|
|
```
|
||
|
|
Phase 7: ███████████████████████████████████████████████████████████ 68M ops/s
|
||
|
|
Current: ███████ 8.4M ops/s
|
||
|
|
Phase E3: ███████████████████████████████████████████████████████████ 68M ops/s (target)
|
||
|
|
|
||
|
|
Regression: -88% | Recovery: +710%
|
||
|
|
```
|
||
|
|
|
||
|
|
### 1024B Allocations (C7)
|
||
|
|
|
||
|
|
```
|
||
|
|
Phase 7: █████████████████████████████████████████████████████████ 65M ops/s
|
||
|
|
Current: ███████ 8.4M ops/s
|
||
|
|
Phase E3: █████████████████████████████████████████████████████████ 65M ops/s (target)
|
||
|
|
|
||
|
|
Regression: -87% | Recovery: +674%
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Call Graph Comparison
|
||
|
|
|
||
|
|
### Phase 7 (Fast Path - 95-99% hit rate)
|
||
|
|
|
||
|
|
```
|
||
|
|
hak_free_at()
|
||
|
|
└─> hak_tiny_free_fast_v2() [5-10 cycles]
|
||
|
|
├─> Page boundary check [1-2 cycles, 99.9% skip]
|
||
|
|
├─> Header read (ptr-1) [2-3 cycles, L1 hit]
|
||
|
|
├─> Magic validation [included in read]
|
||
|
|
└─> TLS freelist push [3-5 cycles]
|
||
|
|
└─> *(void**)base = head
|
||
|
|
└─> head = base
|
||
|
|
└─> count++
|
||
|
|
```
|
||
|
|
|
||
|
|
### Current (Bottlenecked - 95-99% hit rate, but SLOW)
|
||
|
|
|
||
|
|
```
|
||
|
|
hak_free_at()
|
||
|
|
└─> hak_tiny_free_fast_v2() [55-110 cycles] ❌
|
||
|
|
├─> Registry lookup [50-100 cycles] ❌
|
||
|
|
│ └─> hak_super_lookup()
|
||
|
|
│ ├─> RB-tree search (O(log N))
|
||
|
|
│ ├─> Multiple dereferences
|
||
|
|
│ └─> Cache misses
|
||
|
|
├─> Page boundary check [1-2 cycles]
|
||
|
|
├─> Header read (ptr-1) [2-3 cycles]
|
||
|
|
├─> Magic validation [included]
|
||
|
|
└─> TLS freelist push [3-5 cycles]
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Cycle Budget Breakdown
|
||
|
|
|
||
|
|
### Phase 7-1.3 (Target)
|
||
|
|
|
||
|
|
```
|
||
|
|
Operation Cycles Frequency Weighted
|
||
|
|
────────────────────────────────────────────────────────────
|
||
|
|
NULL check 1 100% 1
|
||
|
|
Page boundary check 1-2 0.1% 0.002
|
||
|
|
Header read 2-3 100% 3
|
||
|
|
TLS freelist push 3-5 100% 4
|
||
|
|
────────────────────────────────────────────────────────────
|
||
|
|
TOTAL (Fast Path) 5-10 95-99% 8
|
||
|
|
────────────────────────────────────────────────────────────
|
||
|
|
Slow path fallback 500+ 1-5% 5-25
|
||
|
|
────────────────────────────────────────────────────────────
|
||
|
|
WEIGHTED AVERAGE ~13-33 cycles/free
|
||
|
|
```
|
||
|
|
|
||
|
|
**Throughput** (3.0 GHz CPU):
|
||
|
|
- Free latency: ~13-33 cycles = 4-11 ns
|
||
|
|
- Mixed (50% alloc/free): ~8-22 ns per op
|
||
|
|
- Throughput: ~45-125M ops/s per core
|
||
|
|
- Multi-core (4 cores, 50% efficiency): **45-60M ops/s** ✅
|
||
|
|
|
||
|
|
### Current (Bottlenecked)
|
||
|
|
|
||
|
|
```
|
||
|
|
Operation Cycles Frequency Weighted
|
||
|
|
────────────────────────────────────────────────────────────
|
||
|
|
NULL check 1 100% 1
|
||
|
|
Registry lookup ❌ 50-100 100% 75
|
||
|
|
Page boundary check 1-2 0.1% 0.002
|
||
|
|
Header read 2-3 100% 3
|
||
|
|
TLS freelist push 3-5 100% 4
|
||
|
|
────────────────────────────────────────────────────────────
|
||
|
|
TOTAL (Fast Path) 55-110 95-99% 83
|
||
|
|
────────────────────────────────────────────────────────────
|
||
|
|
Slow path fallback 500+ 1-5% 5-25
|
||
|
|
────────────────────────────────────────────────────────────
|
||
|
|
WEIGHTED AVERAGE ~88-108 cycles/free ❌
|
||
|
|
```
|
||
|
|
|
||
|
|
**Throughput** (3.0 GHz CPU):
|
||
|
|
- Free latency: ~88-108 cycles = 29-36 ns
|
||
|
|
- Mixed (50% alloc/free): ~58-72 ns per op
|
||
|
|
- Throughput: ~14-17M ops/s per core
|
||
|
|
- Multi-core (4 cores, 50% efficiency): **7-9M ops/s** ❌
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Memory Layout: Why Header Validation Is Sufficient
|
||
|
|
|
||
|
|
### Tiny Allocation (C0-C7)
|
||
|
|
|
||
|
|
```
|
||
|
|
Base ptr User ptr (returned)
|
||
|
|
↓ ↓
|
||
|
|
┌────────┬──────────────────────────────────────┐
|
||
|
|
│ Header │ User Data │
|
||
|
|
│ 0xAX │ (N-1 bytes) │
|
||
|
|
└────────┴──────────────────────────────────────┘
|
||
|
|
1 byte User allocation
|
||
|
|
|
||
|
|
Header format: 0xAX where X = class_idx (0-7)
|
||
|
|
- C0: 0xA0 (16B)
|
||
|
|
- C1: 0xA1 (32B)
|
||
|
|
- ...
|
||
|
|
- C7: 0xA7 (1024B) ← HAS HEADER SINCE PHASE E1!
|
||
|
|
```
|
||
|
|
|
||
|
|
### Pool TLS Allocation (8KB-52KB)
|
||
|
|
|
||
|
|
```
|
||
|
|
Base ptr User ptr (returned)
|
||
|
|
↓ ↓
|
||
|
|
┌────────┬──────────────────────────────────────┐
|
||
|
|
│ Header │ User Data │
|
||
|
|
│ 0xBX │ (N-1 bytes) │
|
||
|
|
└────────┴──────────────────────────────────────┘
|
||
|
|
1 byte User allocation
|
||
|
|
|
||
|
|
Header format: 0xBX where X = pool class (0-15)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Mid/Large Allocation (64KB+)
|
||
|
|
|
||
|
|
```
|
||
|
|
Base ptr User ptr (returned)
|
||
|
|
↓ ↓
|
||
|
|
┌────────────────┬─────────────────────────────┐
|
||
|
|
│ AllocHeader │ User Data │
|
||
|
|
│ (16 bytes) │ (N bytes) │
|
||
|
|
│ magic = 0x... │ │
|
||
|
|
└────────────────┴─────────────────────────────┘
|
||
|
|
16 bytes User allocation
|
||
|
|
```
|
||
|
|
|
||
|
|
### External Allocation (libc malloc)
|
||
|
|
|
||
|
|
```
|
||
|
|
User ptr (returned)
|
||
|
|
↓
|
||
|
|
┌────────────────────────────────────┐
|
||
|
|
│ User Data │
|
||
|
|
│ (no header) │
|
||
|
|
└────────────────────────────────────┘
|
||
|
|
|
||
|
|
Header at ptr-1: Random data (NOT 0xA0)
|
||
|
|
```
|
||
|
|
|
||
|
|
### Classification Logic
|
||
|
|
|
||
|
|
```c
|
||
|
|
// Read header at ptr-1
|
||
|
|
uint8_t header = *(uint8_t*)(ptr - 1);
|
||
|
|
uint8_t magic = header & 0xF0;
|
||
|
|
|
||
|
|
if (magic == 0xA0) {
|
||
|
|
// Tiny allocation (C0-C7)
|
||
|
|
int class_idx = header & 0x0F;
|
||
|
|
return TINY_HEADER; // Fast path: 2-3 cycles ✅
|
||
|
|
}
|
||
|
|
|
||
|
|
if (magic == 0xB0) {
|
||
|
|
// Pool TLS allocation
|
||
|
|
return POOL_TLS; // Slow path: fallback
|
||
|
|
}
|
||
|
|
|
||
|
|
// No valid header
|
||
|
|
return UNKNOWN; // Slow path: check 16-byte AllocHeader
|
||
|
|
```
|
||
|
|
|
||
|
|
**Result**: Header magic alone is sufficient! No registry lookup needed!
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## The Fix: Before vs After
|
||
|
|
|
||
|
|
### Before (Lines 51-90 in tiny_free_fast_v2.inc.h)
|
||
|
|
|
||
|
|
```c
|
||
|
|
static inline int hak_tiny_free_fast_v2(void* ptr) {
|
||
|
|
if (__builtin_expect(!ptr, 0)) return 0;
|
||
|
|
|
||
|
|
// ╔══════════════════════════════════════════════════════╗
|
||
|
|
// ║ ❌ DELETE THIS BLOCK (50-100 cycles overhead) ║
|
||
|
|
// ╠══════════════════════════════════════════════════════╣
|
||
|
|
// ║ extern struct SuperSlab* hak_super_lookup(void*); ║
|
||
|
|
// ║ struct SuperSlab* ss = hak_super_lookup(ptr); ║
|
||
|
|
// ║ if (ss && ss->size_class == 7) { ║
|
||
|
|
// ║ return 0; ║
|
||
|
|
// ║ } ║
|
||
|
|
// ╚══════════════════════════════════════════════════════╝
|
||
|
|
|
||
|
|
void* header_addr = (char*)ptr - 1;
|
||
|
|
|
||
|
|
// Page boundary check (1-2 cycles)
|
||
|
|
if (((uintptr_t)ptr & 0xFFF) == 0) {
|
||
|
|
if (!hak_is_memory_readable(header_addr)) return 0;
|
||
|
|
}
|
||
|
|
|
||
|
|
// Read header (2-3 cycles) - includes magic validation
|
||
|
|
int class_idx = tiny_region_id_read_header(ptr);
|
||
|
|
if (class_idx < 0) return 0;
|
||
|
|
|
||
|
|
// TLS capacity check (1 cycle)
|
||
|
|
if (g_tls_sll_count[class_idx] >= cap) return 0;
|
||
|
|
|
||
|
|
// Push to TLS freelist (3-5 cycles)
|
||
|
|
void* base = (char*)ptr - 1;
|
||
|
|
tls_sll_push(class_idx, base, UINT32_MAX);
|
||
|
|
|
||
|
|
return 1; // TOTAL: 55-110 cycles ❌
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### After (Phase E3-1 - Simple deletion!)
|
||
|
|
|
||
|
|
```c
|
||
|
|
static inline int hak_tiny_free_fast_v2(void* ptr) {
|
||
|
|
if (__builtin_expect(!ptr, 0)) return 0;
|
||
|
|
|
||
|
|
// Phase E3: C7 now has header (Phase E1), registry lookup removed!
|
||
|
|
// Header magic validation (2-3 cycles) distinguishes:
|
||
|
|
// - Tiny (0xA0-0xA7): valid header → fast path
|
||
|
|
// - Pool TLS (0xB0): different magic → slow path
|
||
|
|
// - Mid/Large: no header → slow path
|
||
|
|
|
||
|
|
void* header_addr = (char*)ptr - 1;
|
||
|
|
|
||
|
|
// Page boundary check (1-2 cycles)
|
||
|
|
if (((uintptr_t)ptr & 0xFFF) == 0) {
|
||
|
|
if (!hak_is_memory_readable(header_addr)) return 0;
|
||
|
|
}
|
||
|
|
|
||
|
|
// Read header (2-3 cycles) - includes magic validation
|
||
|
|
int class_idx = tiny_region_id_read_header(ptr);
|
||
|
|
if (class_idx < 0) return 0;
|
||
|
|
|
||
|
|
// TLS capacity check (1 cycle)
|
||
|
|
if (g_tls_sll_count[class_idx] >= cap) return 0;
|
||
|
|
|
||
|
|
// Push to TLS freelist (3-5 cycles)
|
||
|
|
void* base = (char*)ptr - 1;
|
||
|
|
tls_sll_push(class_idx, base, UINT32_MAX);
|
||
|
|
|
||
|
|
return 1; // TOTAL: 5-10 cycles ✅
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Diff**:
|
||
|
|
- **Lines deleted**: 9 (registry lookup block)
|
||
|
|
- **Lines added**: 5 (explanatory comments)
|
||
|
|
- **Net change**: -4 lines
|
||
|
|
- **Cycle savings**: -50 to -100 cycles per free
|
||
|
|
- **Throughput improvement**: +541-674%
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Summary: Why This Fix Works
|
||
|
|
|
||
|
|
### Phase E1 Guarantees
|
||
|
|
|
||
|
|
✅ **ALL classes have headers** (C0-C7 including C7)
|
||
|
|
✅ **Header magic distinguishes allocators** (0xA0 vs 0xB0 vs none)
|
||
|
|
✅ **No C7 special cases needed** (unified code path)
|
||
|
|
|
||
|
|
### Current Code Problems
|
||
|
|
|
||
|
|
❌ **Registry lookup redundant** (50-100 cycles for nothing)
|
||
|
|
❌ **Header validation sufficient** (already done in 2-3 cycles)
|
||
|
|
❌ **No performance benefit** (safety already guaranteed by headers)
|
||
|
|
|
||
|
|
### Phase E3-1 Solution
|
||
|
|
|
||
|
|
✅ **Remove registry lookup** (revert to Phase 7-1.3)
|
||
|
|
✅ **Keep header validation** (2-3 cycles, sufficient)
|
||
|
|
✅ **Restore performance** (5-10 cycles per free)
|
||
|
|
✅ **Maintain safety** (Phase E1 headers guarantee correctness)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Ready to implement Phase E3!** 🚀
|
||
|
|
|
||
|
|
The fix is trivial (delete 9 lines), low-risk (revert to proven code), and high-impact (+541-674% throughput).
|