# Phase E3: Performance Restoration Implementation Plan **Date**: 2025-11-12 **Goal**: Restore Phase 7 performance (9M → 59-70M ops/s, +541-674%) **Status**: READY TO IMPLEMENT --- ## Quick Reference ### The One Critical Fix **File**: `/mnt/workdisk/public_share/hakmem/core/tiny_free_fast_v2.inc.h` **Lines to Remove**: 54-63 (SuperSlab registry lookup) **Impact**: -91% latency, +1100% throughput --- ## Phase E3-1: Remove Registry Lookup (CRITICAL) ### Detailed Code Changes **File**: `core/tiny_free_fast_v2.inc.h` **Lines 51-63 (BEFORE - SLOW)**: ```c static inline int hak_tiny_free_fast_v2(void* ptr) { if (__builtin_expect(!ptr, 0)) return 0; // CRITICAL: C7 (1KB headerless) MUST be excluded from Ultra-Fast Free // Problem: Magic validation alone insufficient (C7 user data can be 0xaX pattern) // Solution: Registry lookup to 100% identify C7 before header read // Cost: 50-100 cycles (O(log N) RB-tree), but C7 is rare (~5% of allocations) // Benefit: 100% SEGV prevention, no false positives extern struct SuperSlab* hak_super_lookup(void* ptr); struct SuperSlab* ss = hak_super_lookup(ptr); if (__builtin_expect(ss && ss->size_class == 7, 0)) { return 0; // C7 detected → force slow path (Front Gate will handle correctly) } // CRITICAL: Check if header is accessible before reading void* header_addr = (char*)ptr - 1; ``` **Lines 51-63 (AFTER - FAST)**: ```c static inline int hak_tiny_free_fast_v2(void* ptr) { if (__builtin_expect(!ptr, 0)) return 0; // Phase E3: C7 now has header (Phase E1), no registry lookup needed! // Header magic validation (2-3 cycles) is sufficient to distinguish: // - Tiny (0xA0-0xA7): valid header → fast path // - Pool TLS (0xB0-0xBF): different magic → slow path // - Mid/Large: no header → slow path // - C7: has header like all other classes → fast path works! // // Performance: 5-10 cycles (vs 55-110 cycles with registry lookup) // CRITICAL: Check if header is accessible before reading void* header_addr = (char*)ptr - 1; ``` **Summary of Changes**: - **DELETE**: Lines 54-62 (9 lines of SuperSlab registry lookup code) - **ADD**: 7 lines of explanatory comments (why registry lookup is no longer needed) - **Net change**: -2 lines, -50-100 cycles per free operation ### Build & Test Commands ```bash # 1. Edit file vim /mnt/workdisk/public_share/hakmem/core/tiny_free_fast_v2.inc.h # 2. Build release binary cd /mnt/workdisk/public_share/hakmem ./build.sh bench_random_mixed_hakmem # 3. Verify build succeeded ls -lh ./out/release/bench_random_mixed_hakmem # 4. Run benchmarks (3 runs each for stability) echo "=== 128B Benchmark ===" ./out/release/bench_random_mixed_hakmem 100000 128 42 2>&1 | tail -1 ./out/release/bench_random_mixed_hakmem 100000 128 43 2>&1 | tail -1 ./out/release/bench_random_mixed_hakmem 100000 128 44 2>&1 | tail -1 echo "=== 256B Benchmark ===" ./out/release/bench_random_mixed_hakmem 100000 256 42 2>&1 | tail -1 ./out/release/bench_random_mixed_hakmem 100000 256 43 2>&1 | tail -1 ./out/release/bench_random_mixed_hakmem 100000 256 44 2>&1 | tail -1 echo "=== 512B Benchmark ===" ./out/release/bench_random_mixed_hakmem 100000 512 42 2>&1 | tail -1 ./out/release/bench_random_mixed_hakmem 100000 512 43 2>&1 | tail -1 ./out/release/bench_random_mixed_hakmem 100000 512 44 2>&1 | tail -1 echo "=== 1024B Benchmark ===" ./out/release/bench_random_mixed_hakmem 100000 1024 42 2>&1 | tail -1 ./out/release/bench_random_mixed_hakmem 100000 1024 43 2>&1 | tail -1 ./out/release/bench_random_mixed_hakmem 100000 1024 44 2>&1 | tail -1 ``` ### Success Criteria (Phase E3-1) **Minimum Acceptable Performance** (vs current 9M ops/s): - 128B: ≥30M ops/s (+226%) - 256B: ≥32M ops/s (+240%) - 512B: ≥28M ops/s (+233%) - 1024B: ≥28M ops/s (+233%) **Target Performance** (Phase 7-1.3 baseline): - 128B: 40-50M ops/s (+335-443%) - 256B: 45-55M ops/s (+379-485%) - 512B: 40-50M ops/s (+376-495%) - 1024B: 40-50M ops/s (+376-495%) --- ## Phase E3-2: Header-First Classification (OPTIONAL) ### Why Optional? Phase E3-1 (remove registry lookup from fast path) should restore 80-90% of Phase 7 performance. Phase E3-2 optimizes the **slow path** (TLS cache full, Pool TLS, Mid/Large), which is only 1-5% of operations. **Impact**: Additional +10-20% on top of Phase E3-1 ### Detailed Code Changes **File**: `core/box/front_gate_classifier.h` **Lines 166-234 (BEFORE - Registry-First)**: ```c static inline __attribute__((always_inline)) ptr_classification_t classify_ptr(void* ptr) { ptr_classification_t result = { .kind = PTR_KIND_UNKNOWN, .class_idx = -1, .ss = NULL, .slab_idx = -1 }; if (__builtin_expect(!ptr, 0)) return result; if (__builtin_expect((uintptr_t)ptr < 4096, 0)) { result.kind = PTR_KIND_UNKNOWN; return result; } #ifdef HAKMEM_POOL_TLS_PHASE1 if (__builtin_expect(is_pool_tls_reg(ptr), 0)) { result.kind = PTR_KIND_POOL_TLS; return result; } #endif // ❌ SLOW: Registry lookup FIRST (50-100 cycles) result = registry_lookup(ptr); if (__builtin_expect(result.kind == PTR_KIND_TINY_HEADERLESS, 0)) { return result; } if (__builtin_expect(result.kind == PTR_KIND_TINY_HEADER, 1)) { return result; } // ... rest of function ... } ``` **Lines 166-234 (AFTER - Header-First)**: ```c static inline __attribute__((always_inline)) ptr_classification_t classify_ptr(void* ptr) { ptr_classification_t result = { .kind = PTR_KIND_UNKNOWN, .class_idx = -1, .ss = NULL, .slab_idx = -1 }; if (__builtin_expect(!ptr, 0)) return result; if (__builtin_expect((uintptr_t)ptr < 4096, 0)) { result.kind = PTR_KIND_UNKNOWN; return result; } // ✅ FAST: Try header probe FIRST (2-3 cycles, 95-99% hit rate) int class_idx = safe_header_probe(ptr); if (__builtin_expect(class_idx >= 0, 1)) { // Valid Tiny header found result.kind = PTR_KIND_TINY_HEADER; result.class_idx = class_idx; #if !HAKMEM_BUILD_RELEASE extern __thread uint64_t g_classify_header_hit; g_classify_header_hit++; #endif return result; // Fast path: 2-3 cycles! } #ifdef HAKMEM_POOL_TLS_PHASE1 // Fallback: Check Pool TLS registry (header probe failed) if (__builtin_expect(is_pool_tls_reg(ptr), 0)) { result.kind = PTR_KIND_POOL_TLS; #if !HAKMEM_BUILD_RELEASE extern __thread uint64_t g_classify_pool_hit; g_classify_pool_hit++; #endif return result; } #endif // Fallback: Registry lookup (rare, <1%) result = registry_lookup(ptr); if (__builtin_expect(result.kind == PTR_KIND_TINY_HEADERLESS, 0)) { #if !HAKMEM_BUILD_RELEASE extern __thread uint64_t g_classify_headerless_hit; g_classify_headerless_hit++; #endif return result; } if (__builtin_expect(result.kind == PTR_KIND_TINY_HEADER, 0)) { #if !HAKMEM_BUILD_RELEASE extern __thread uint64_t g_classify_header_hit; g_classify_header_hit++; #endif return result; } // ... rest of function (16-byte AllocHeader check) ... } ``` ### Build & Test Commands ```bash # 1. Edit file vim /mnt/workdisk/public_share/hakmem/core/box/front_gate_classifier.h # 2. Rebuild ./build.sh bench_random_mixed_hakmem # 3. Benchmark (should see +10-20% improvement over Phase E3-1) ./out/release/bench_random_mixed_hakmem 100000 256 42 2>&1 | tail -1 ``` ### Success Criteria (Phase E3-2) **Target**: +10-20% improvement over Phase E3-1 **Example**: - Phase E3-1: 45M ops/s - Phase E3-2: 50-55M ops/s (+11-22%) --- ## Phase E3-3: Remove C7 Special Cases (CLEANUP) ### Why Cleanup? Phase E1 added headers to C7, making all `if (class_idx == 7)` conditionals obsolete. However, many files still contain C7 special cases from legacy code. **Impact**: Code simplification + 5-10% reduced branching overhead ### Files to Edit #### File 1: `core/hakmem_tiny_free.inc` **Lines to Remove/Modify**: ```bash # Find all C7 special cases grep -n "class_idx == 7" core/hakmem_tiny_free.inc ``` **Expected Output**: ``` 32: // CRITICAL: C7 (1KB) is headerless - MUST NOT drain to TLS SLL 34: if (__builtin_expect(class_idx == 7, 0)) return; 124: if (__builtin_expect(g_tiny_safe_free || class_idx == 7, 0)) { 145: if (__builtin_expect(g_tiny_safe_free || class_idx == 7, 0)) { 158: if (g_tiny_safe_free_strict || class_idx == 7) { raise(SIGUSR2); return; } 195: void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1); 211: void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1); 233: void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1); 241: void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1); 253: void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1); 348: // CRITICAL: C7 (1KB) is headerless - MUST NOT use TLS SLL 384: void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1); 445: void* base2 = (fast_class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1); ``` **Changes**: 1. **Line 32-34**: Remove early return for C7 ```c // BEFORE // CRITICAL: C7 (1KB) is headerless - MUST NOT drain to TLS SLL if (__builtin_expect(class_idx == 7, 0)) return; // AFTER (DELETE these 2 lines) ``` 2. **Lines 124, 145, 158**: Remove `|| class_idx == 7` conditions ```c // BEFORE if (__builtin_expect(g_tiny_safe_free || class_idx == 7, 0)) { // AFTER if (__builtin_expect(g_tiny_safe_free, 0)) { ``` 3. **Lines 195, 211, 233, 241, 253, 384, 445**: Simplify base calculation ```c // BEFORE void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1); // AFTER (ALL classes have header now) void* base = (void*)((uint8_t*)ptr - 1); ``` 4. **Line 348**: Remove C7 comment (obsolete) ```c // BEFORE // CRITICAL: C7 (1KB) is headerless - MUST NOT use TLS SLL // AFTER (DELETE this line) ``` #### File 2: `core/hakmem_tiny_alloc.inc` **Lines to Remove/Modify**: ```bash grep -n "class_idx == 7" core/hakmem_tiny_alloc.inc ``` **Expected Output**: ``` 252: if (__builtin_expect(class_idx == 7, 0)) { *(void**)hotmag_ptr = NULL; } 281: if (__builtin_expect(class_idx == 7, 0)) { *(void**)fast_hot = NULL; } 292: if (__builtin_expect(class_idx == 7, 0)) { *(void**)fast = NULL; } ``` **Changes**: Remove all 3 lines (C7 now has header, no NULL clearing needed) #### File 3: `core/hakmem_tiny_slow.inc` **Lines to Remove/Modify**: ```bash grep -n "class_idx == 7" core/hakmem_tiny_slow.inc ``` **Expected Output**: ``` 25: // Try TLS list refill (C7 is headerless: skip TLS list entirely) ``` **Changes**: Update comment ```c // BEFORE // Try TLS list refill (C7 is headerless: skip TLS list entirely) // AFTER // Try TLS list refill (all classes use TLS list now) ``` ### Build & Test Commands ```bash # 1. Edit files vim core/hakmem_tiny_free.inc vim core/hakmem_tiny_alloc.inc vim core/hakmem_tiny_slow.inc # 2. Rebuild ./build.sh bench_random_mixed_hakmem # 3. Verify no regressions ./out/release/bench_random_mixed_hakmem 100000 1024 42 2>&1 | tail -1 ``` ### Success Criteria (Phase E3-3) **Target**: 50-60M → 59-70M ops/s (+5-10% from reduced branching) **Code Quality**: - All C7 special cases removed - Unified base pointer calculation (`ptr - 1` for all classes) - Cleaner, more maintainable code --- ## Final Verification ### Full Benchmark Suite ```bash # Run comprehensive benchmarks cd /mnt/workdisk/public_share/hakmem # 1. Random Mixed (primary benchmark) for size in 128 256 512 1024; do echo "=== Random Mixed ${size}B ===" ./out/release/bench_random_mixed_hakmem 100000 $size 42 2>&1 | grep "Throughput" done # 2. Fixed Size (stability check) for size in 256 1024; do echo "=== Fixed Size ${size}B ===" ./out/release/bench_fixed_size_hakmem 200000 $size 128 2>&1 | grep "Throughput" done # 3. Larson (multi-threaded stress test) echo "=== Larson Multi-Threaded ===" ./out/release/larson_hakmem 1 2>&1 | grep "ops/sec" ``` ### Expected Results (After All 3 Phases) | Benchmark | Current | Phase E3 | Improvement | |-----------|---------|----------|-------------| | Random Mixed 128B | 9.2M | **59M** | **+541%** 🎯 | | Random Mixed 256B | 9.4M | **70M** | **+645%** 🎯 | | Random Mixed 512B | 8.4M | **68M** | **+710%** 🎯 | | Random Mixed 1024B | 8.4M | **65M** | **+674%** 🎯 | | Fixed Size 256B | 2.76M | **10-12M** | **+263-335%** | | Larson 1T | 2.68M | **8-10M** | **+199-273%** | --- ## Rollback Plan (If Needed) ### If Phase E3-1 Causes Issues ```bash # Revert to current version git checkout HEAD -- core/tiny_free_fast_v2.inc.h ./build.sh bench_random_mixed_hakmem ``` ### If Phase E3-2 Causes Issues ```bash # Revert to Phase E3-1 git checkout HEAD -- core/box/front_gate_classifier.h ./build.sh bench_random_mixed_hakmem ``` ### If Phase E3-3 Causes Issues ```bash # Revert cleanup changes git checkout HEAD -- core/hakmem_tiny_free.inc core/hakmem_tiny_alloc.inc core/hakmem_tiny_slow.inc ./build.sh bench_random_mixed_hakmem ``` --- ## Risk Assessment ### Phase E3-1: Remove Registry Lookup **Risk**: ⚠️ **LOW** - Reverting to Phase 7-1.3 code (proven stable at 59-70M ops/s) - Phase E1 already added headers to C7 (safety guaranteed) - Header magic validation (2-3 cycles) sufficient for classification **Mitigation**: - Test with 1M iterations (stress test) - Run Larson multi-threaded (race condition check) - Monitor for SEGV (should be zero) ### Phase E3-2: Header-First Classification **Risk**: ⚠️ **LOW-MEDIUM** - Only affects slow path (1-5% of operations) - Safe header probe already implemented (lines 100-117) - No change to fast path (already optimized in E3-1) **Mitigation**: - Test with Pool TLS workloads (8-52KB allocations) - Test with Mid/Large workloads (64KB+ allocations) - Verify classification hit rates in debug mode ### Phase E3-3: Remove C7 Special Cases **Risk**: ⚠️ **LOW** - Code cleanup only (no algorithmic changes) - Phase E1 already verified C7 works with headers - All conditionals are redundant (dead code) **Mitigation**: - Test specifically with 1024B workload (C7 class) - Run 1M iterations (comprehensive coverage) - Check for any unexpected branches --- ## Timeline | Phase | Time | Cumulative | |-------|------|------------| | E3-1: Remove Registry Lookup | 10 min | 10 min | | E3-1: Build & Test | 5 min | 15 min | | E3-2: Header-First Classification | 15 min | 30 min | | E3-2: Build & Test | 5 min | 35 min | | E3-3: Remove C7 Special Cases | 30 min | 65 min | | E3-3: Build & Test | 5 min | 70 min | | Final Verification | 10 min | 80 min | | **TOTAL** | - | **~1.5 hours** | --- ## Success Metrics ### Performance (Primary) ✅ **Phase E3-1 Success**: ≥30M ops/s (all sizes) ✅ **Phase E3-2 Success**: ≥50M ops/s (all sizes) ✅ **Phase E3-3 Success**: ≥59M ops/s (target met!) ### Stability (Critical) ✅ **No SEGV**: 1M iterations without crash ✅ **No corruption**: Memory integrity checks pass ✅ **Multi-threaded**: Larson 4T stable ### Code Quality (Secondary) ✅ **Reduced LOC**: -50 lines (C7 special cases removed) ✅ **Reduced branching**: -10% branch-miss rate ✅ **Unified code**: Single base calculation (`ptr - 1`) --- ## Next Actions 1. **Start with Phase E3-1** (highest ROI, lowest risk) 2. **Verify performance** (should see 3-5x improvement immediately) 3. **Proceed to E3-2** (optional, +10-20% additional) 4. **Complete E3-3** (cleanup, +5-10% final boost) 5. **Update CLAUDE.md** (document restoration success) **Ready to implement!** 🚀