# Phase E2: Visual Performance Comparison **Date**: 2025-11-12 --- ## Performance Timeline ``` Phase 7 Peak (Nov 8) Phase E1 (Nov 12) Phase E3 Target ↓ ↓ ↓ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ 59-70M │ ──────────────→ │ 9M │ ──────────→ │ 59-70M │ │ ops/s │ Regression │ ops/s │ Phase E3 │ ops/s │ └─────────┘ 85% └─────────┘ +541-674% └─────────┘ 🏆 😱 🎯 ``` --- ## Free Path Cycle Comparison ### Phase 7-1.3 (FAST - 5-10 cycles) ``` ┌─────────────────────────────────────────────────────────────┐ │ hak_tiny_free_fast_v2(ptr) │ ├─────────────────────────────────────────────────────────────┤ │ │ │ 1. NULL check [1 cycle] │ │ 2. Page boundary check [1-2 cycles] ← 99.9% skip │ │ 3. Read header (ptr-1) [2-3 cycles] ← L1 cache │ │ 4. Validate magic [included] │ │ 5. TLS freelist push [3-5 cycles] ← 4 instructions │ │ │ │ TOTAL: 5-10 cycles ✅ │ │ │ └─────────────────────────────────────────────────────────────┘ ``` ### Current (SLOW - 55-110 cycles) ``` ┌─────────────────────────────────────────────────────────────┐ │ hak_tiny_free_fast_v2(ptr) │ ├─────────────────────────────────────────────────────────────┤ │ │ │ 1. NULL check [1 cycle] │ │ ❌ 2. Registry lookup [50-100 cycles] ← O(log N) │ │ └─> hak_super_lookup() │ │ └─> RB-tree search │ │ └─> Multiple pointer dereferences │ │ └─> Cache misses likely │ │ 3. Page boundary check [1-2 cycles] │ │ 4. Read header (ptr-1) [2-3 cycles] │ │ 5. Validate magic [included] │ │ 6. TLS freelist push [3-5 cycles] │ │ │ │ TOTAL: 55-110 cycles ❌ (10x slower!) │ │ │ └─────────────────────────────────────────────────────────────┘ ``` --- ## The Problem Visualized ### Commit 5eabb89ad9 Added This: ```c // Lines 54-62 in core/tiny_free_fast_v2.inc.h static inline int hak_tiny_free_fast_v2(void* ptr) { if (!ptr) return 0; ┌──────────────────────────────────────────────────────┐ │ // ❌ THE BOTTLENECK (50-100 cycles) │ │ extern struct SuperSlab* hak_super_lookup(void* ptr);│ │ struct SuperSlab* ss = hak_super_lookup(ptr); │ │ if (ss && ss->size_class == 7) { │ │ return 0; // C7 detected → slow path │ │ } │ └──────────────────────────────────────────────────────┘ ↑ └── This is UNNECESSARY because Phase E1 already added headers to C7! // ... rest of function (fast path) ... } ``` ### Why It's Unnecessary: ``` Phase E1 (Commit baaf815c9): ┌─────────────────────────────────────────────────────────────┐ │ ALL classes (C0-C7) now have 1-byte header │ ├─────────────────────────────────────────────────────────────┤ │ │ │ C0 (16B): [0xA0] [user data: 15B] │ │ C1 (32B): [0xA1] [user data: 31B] │ │ C2 (64B): [0xA2] [user data: 63B] │ │ C3 (128B): [0xA3] [user data: 127B] │ │ C4 (256B): [0xA4] [user data: 255B] │ │ C5 (512B): [0xA5] [user data: 511B] │ │ C6 (768B): [0xA6] [user data: 767B] │ │ C7 (1024B): [0xA7] [user data: 1023B] ← HAS HEADER NOW! │ │ │ │ Header magic 0xA0 distinguishes from: │ │ - Pool TLS: 0xB0 │ │ - Mid/Large: no header (magic check fails) │ │ │ └─────────────────────────────────────────────────────────────┘ Therefore: Registry lookup is REDUNDANT! Header validation (2-3 cycles) is SUFFICIENT! ``` --- ## Performance Impact by Size ### 128B Allocations ``` Phase 7: ████████████████████████████████████████████████████████ 59M ops/s Current: ████████ 9.2M ops/s Phase E3: ████████████████████████████████████████████████████████ 59M ops/s (target) Regression: -85% | Recovery: +541% ``` ### 256B Allocations ``` Phase 7: ██████████████████████████████████████████████████████████████ 70M ops/s Current: ████████ 9.4M ops/s Phase E3: ██████████████████████████████████████████████████████████████ 70M ops/s (target) Regression: -87% | Recovery: +645% ``` ### 512B Allocations ``` Phase 7: ███████████████████████████████████████████████████████████ 68M ops/s Current: ███████ 8.4M ops/s Phase E3: ███████████████████████████████████████████████████████████ 68M ops/s (target) Regression: -88% | Recovery: +710% ``` ### 1024B Allocations (C7) ``` Phase 7: █████████████████████████████████████████████████████████ 65M ops/s Current: ███████ 8.4M ops/s Phase E3: █████████████████████████████████████████████████████████ 65M ops/s (target) Regression: -87% | Recovery: +674% ``` --- ## Call Graph Comparison ### Phase 7 (Fast Path - 95-99% hit rate) ``` hak_free_at() └─> hak_tiny_free_fast_v2() [5-10 cycles] ├─> Page boundary check [1-2 cycles, 99.9% skip] ├─> Header read (ptr-1) [2-3 cycles, L1 hit] ├─> Magic validation [included in read] └─> TLS freelist push [3-5 cycles] └─> *(void**)base = head └─> head = base └─> count++ ``` ### Current (Bottlenecked - 95-99% hit rate, but SLOW) ``` hak_free_at() └─> hak_tiny_free_fast_v2() [55-110 cycles] ❌ ├─> Registry lookup [50-100 cycles] ❌ │ └─> hak_super_lookup() │ ├─> RB-tree search (O(log N)) │ ├─> Multiple dereferences │ └─> Cache misses ├─> Page boundary check [1-2 cycles] ├─> Header read (ptr-1) [2-3 cycles] ├─> Magic validation [included] └─> TLS freelist push [3-5 cycles] ``` --- ## Cycle Budget Breakdown ### Phase 7-1.3 (Target) ``` Operation Cycles Frequency Weighted ──────────────────────────────────────────────────────────── NULL check 1 100% 1 Page boundary check 1-2 0.1% 0.002 Header read 2-3 100% 3 TLS freelist push 3-5 100% 4 ──────────────────────────────────────────────────────────── TOTAL (Fast Path) 5-10 95-99% 8 ──────────────────────────────────────────────────────────── Slow path fallback 500+ 1-5% 5-25 ──────────────────────────────────────────────────────────── WEIGHTED AVERAGE ~13-33 cycles/free ``` **Throughput** (3.0 GHz CPU): - Free latency: ~13-33 cycles = 4-11 ns - Mixed (50% alloc/free): ~8-22 ns per op - Throughput: ~45-125M ops/s per core - Multi-core (4 cores, 50% efficiency): **45-60M ops/s** ✅ ### Current (Bottlenecked) ``` Operation Cycles Frequency Weighted ──────────────────────────────────────────────────────────── NULL check 1 100% 1 Registry lookup ❌ 50-100 100% 75 Page boundary check 1-2 0.1% 0.002 Header read 2-3 100% 3 TLS freelist push 3-5 100% 4 ──────────────────────────────────────────────────────────── TOTAL (Fast Path) 55-110 95-99% 83 ──────────────────────────────────────────────────────────── Slow path fallback 500+ 1-5% 5-25 ──────────────────────────────────────────────────────────── WEIGHTED AVERAGE ~88-108 cycles/free ❌ ``` **Throughput** (3.0 GHz CPU): - Free latency: ~88-108 cycles = 29-36 ns - Mixed (50% alloc/free): ~58-72 ns per op - Throughput: ~14-17M ops/s per core - Multi-core (4 cores, 50% efficiency): **7-9M ops/s** ❌ --- ## Memory Layout: Why Header Validation Is Sufficient ### Tiny Allocation (C0-C7) ``` Base ptr User ptr (returned) ↓ ↓ ┌────────┬──────────────────────────────────────┐ │ Header │ User Data │ │ 0xAX │ (N-1 bytes) │ └────────┴──────────────────────────────────────┘ 1 byte User allocation Header format: 0xAX where X = class_idx (0-7) - C0: 0xA0 (16B) - C1: 0xA1 (32B) - ... - C7: 0xA7 (1024B) ← HAS HEADER SINCE PHASE E1! ``` ### Pool TLS Allocation (8KB-52KB) ``` Base ptr User ptr (returned) ↓ ↓ ┌────────┬──────────────────────────────────────┐ │ Header │ User Data │ │ 0xBX │ (N-1 bytes) │ └────────┴──────────────────────────────────────┘ 1 byte User allocation Header format: 0xBX where X = pool class (0-15) ``` ### Mid/Large Allocation (64KB+) ``` Base ptr User ptr (returned) ↓ ↓ ┌────────────────┬─────────────────────────────┐ │ AllocHeader │ User Data │ │ (16 bytes) │ (N bytes) │ │ magic = 0x... │ │ └────────────────┴─────────────────────────────┘ 16 bytes User allocation ``` ### External Allocation (libc malloc) ``` User ptr (returned) ↓ ┌────────────────────────────────────┐ │ User Data │ │ (no header) │ └────────────────────────────────────┘ Header at ptr-1: Random data (NOT 0xA0) ``` ### Classification Logic ```c // Read header at ptr-1 uint8_t header = *(uint8_t*)(ptr - 1); uint8_t magic = header & 0xF0; if (magic == 0xA0) { // Tiny allocation (C0-C7) int class_idx = header & 0x0F; return TINY_HEADER; // Fast path: 2-3 cycles ✅ } if (magic == 0xB0) { // Pool TLS allocation return POOL_TLS; // Slow path: fallback } // No valid header return UNKNOWN; // Slow path: check 16-byte AllocHeader ``` **Result**: Header magic alone is sufficient! No registry lookup needed! --- ## The Fix: Before vs After ### Before (Lines 51-90 in tiny_free_fast_v2.inc.h) ```c static inline int hak_tiny_free_fast_v2(void* ptr) { if (__builtin_expect(!ptr, 0)) return 0; // ╔══════════════════════════════════════════════════════╗ // ║ ❌ DELETE THIS BLOCK (50-100 cycles overhead) ║ // ╠══════════════════════════════════════════════════════╣ // ║ extern struct SuperSlab* hak_super_lookup(void*); ║ // ║ struct SuperSlab* ss = hak_super_lookup(ptr); ║ // ║ if (ss && ss->size_class == 7) { ║ // ║ return 0; ║ // ║ } ║ // ╚══════════════════════════════════════════════════════╝ void* header_addr = (char*)ptr - 1; // Page boundary check (1-2 cycles) if (((uintptr_t)ptr & 0xFFF) == 0) { if (!hak_is_memory_readable(header_addr)) return 0; } // Read header (2-3 cycles) - includes magic validation int class_idx = tiny_region_id_read_header(ptr); if (class_idx < 0) return 0; // TLS capacity check (1 cycle) if (g_tls_sll_count[class_idx] >= cap) return 0; // Push to TLS freelist (3-5 cycles) void* base = (char*)ptr - 1; tls_sll_push(class_idx, base, UINT32_MAX); return 1; // TOTAL: 55-110 cycles ❌ } ``` ### After (Phase E3-1 - Simple deletion!) ```c static inline int hak_tiny_free_fast_v2(void* ptr) { if (__builtin_expect(!ptr, 0)) return 0; // Phase E3: C7 now has header (Phase E1), registry lookup removed! // Header magic validation (2-3 cycles) distinguishes: // - Tiny (0xA0-0xA7): valid header → fast path // - Pool TLS (0xB0): different magic → slow path // - Mid/Large: no header → slow path void* header_addr = (char*)ptr - 1; // Page boundary check (1-2 cycles) if (((uintptr_t)ptr & 0xFFF) == 0) { if (!hak_is_memory_readable(header_addr)) return 0; } // Read header (2-3 cycles) - includes magic validation int class_idx = tiny_region_id_read_header(ptr); if (class_idx < 0) return 0; // TLS capacity check (1 cycle) if (g_tls_sll_count[class_idx] >= cap) return 0; // Push to TLS freelist (3-5 cycles) void* base = (char*)ptr - 1; tls_sll_push(class_idx, base, UINT32_MAX); return 1; // TOTAL: 5-10 cycles ✅ } ``` **Diff**: - **Lines deleted**: 9 (registry lookup block) - **Lines added**: 5 (explanatory comments) - **Net change**: -4 lines - **Cycle savings**: -50 to -100 cycles per free - **Throughput improvement**: +541-674% --- ## Summary: Why This Fix Works ### Phase E1 Guarantees ✅ **ALL classes have headers** (C0-C7 including C7) ✅ **Header magic distinguishes allocators** (0xA0 vs 0xB0 vs none) ✅ **No C7 special cases needed** (unified code path) ### Current Code Problems ❌ **Registry lookup redundant** (50-100 cycles for nothing) ❌ **Header validation sufficient** (already done in 2-3 cycles) ❌ **No performance benefit** (safety already guaranteed by headers) ### Phase E3-1 Solution ✅ **Remove registry lookup** (revert to Phase 7-1.3) ✅ **Keep header validation** (2-3 cycles, sufficient) ✅ **Restore performance** (5-10 cycles per free) ✅ **Maintain safety** (Phase E1 headers guarantee correctness) --- **Ready to implement Phase E3!** 🚀 The fix is trivial (delete 9 lines), low-risk (revert to proven code), and high-impact (+541-674% throughput).