## Root Cause Analysis (GPT5) **Physical Layout Constraints**: - Class 0: 8B = [1B header][7B payload] → offset 1 = 9B needed = ❌ IMPOSSIBLE - Class 1-6: >=16B = [1B header][15B+ payload] → offset 1 = ✅ POSSIBLE - Class 7: 1KB → offset 0 (compatibility) **Correct Specification**: - HAKMEM_TINY_HEADER_CLASSIDX != 0: - Class 0, 7: next at offset 0 (overwrites header when on freelist) - Class 1-6: next at offset 1 (after header) - HAKMEM_TINY_HEADER_CLASSIDX == 0: - All classes: next at offset 0 **Previous Bug**: - Attempted "ALL classes offset 1" unification - Class 0 with offset 1 caused immediate SEGV (9B > 8B block size) - Mixed 2-arg/3-arg API caused confusion ## Fixes Applied ### 1. Restored 3-Argument Box API (core/box/tiny_next_ptr_box.h) ```c // Correct signatures void tiny_next_write(int class_idx, void* base, void* next_value) void* tiny_next_read(int class_idx, const void* base) // Correct offset calculation size_t offset = (class_idx == 0 || class_idx == 7) ? 0 : 1; ``` ### 2. Updated 123+ Call Sites Across 34 Files - hakmem_tiny_hot_pop_v4.inc.h (4 locations) - hakmem_tiny_fastcache.inc.h (3 locations) - hakmem_tiny_tls_list.h (12 locations) - superslab_inline.h (5 locations) - tiny_fastcache.h (3 locations) - ptr_trace.h (macro definitions) - tls_sll_box.h (2 locations) - + 27 additional files Pattern: `tiny_next_read(base)` → `tiny_next_read(class_idx, base)` Pattern: `tiny_next_write(base, next)` → `tiny_next_write(class_idx, base, next)` ### 3. Added Sentinel Detection Guards - tiny_fast_push(): Block nodes with sentinel in ptr or ptr->next - tls_list_push(): Block nodes with sentinel in ptr or ptr->next - Defense-in-depth against remote free sentinel leakage ## Verification (GPT5 Report) **Test Command**: `./out/release/bench_random_mixed_hakmem --iterations=70000` **Results**: - ✅ Main loop completed successfully - ✅ Drain phase completed successfully - ✅ NO SEGV (previous crash at iteration 66151 is FIXED) - ℹ️ Final log: "tiny_alloc(1024) failed" is normal fallback to Mid/ACE layers **Analysis**: - Class 0 immediate SEGV: ✅ RESOLVED (correct offset 0 now used) - 66K iteration crash: ✅ RESOLVED (offset consistency fixed) - Box API conflicts: ✅ RESOLVED (unified 3-arg API) ## Technical Details ### Offset Logic Justification ``` Class 0: 8B block → next pointer (8B) fits ONLY at offset 0 Class 1: 16B block → next pointer (8B) fits at offset 1 (after 1B header) Class 2: 32B block → next pointer (8B) fits at offset 1 ... Class 6: 512B block → next pointer (8B) fits at offset 1 Class 7: 1024B block → offset 0 for legacy compatibility ``` ### Files Modified (Summary) - Core API: `box/tiny_next_ptr_box.h` - Hot paths: `hakmem_tiny_hot_pop*.inc.h`, `tiny_fastcache.h` - TLS layers: `hakmem_tiny_tls_list.h`, `hakmem_tiny_tls_ops.h` - SuperSlab: `superslab_inline.h`, `tiny_superslab_*.inc.h` - Refill: `hakmem_tiny_refill.inc.h`, `tiny_refill_opt.h` - Free paths: `tiny_free_magazine.inc.h`, `tiny_superslab_free.inc.h` - Documentation: Multiple Phase E3 reports ## Remaining Work None for Box API offset bugs - all structural issues resolved. Future enhancements (non-critical): - Periodic `grep -R '*(void**)' core/` to detect direct pointer access violations - Enforce Box API usage via static analysis - Document offset rationale in architecture docs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
15 KiB
Phase E3: Performance Restoration Implementation Plan
Date: 2025-11-12 Goal: Restore Phase 7 performance (9M → 59-70M ops/s, +541-674%) Status: READY TO IMPLEMENT
Quick Reference
The One Critical Fix
File: /mnt/workdisk/public_share/hakmem/core/tiny_free_fast_v2.inc.h
Lines to Remove: 54-63 (SuperSlab registry lookup)
Impact: -91% latency, +1100% throughput
Phase E3-1: Remove Registry Lookup (CRITICAL)
Detailed Code Changes
File: core/tiny_free_fast_v2.inc.h
Lines 51-63 (BEFORE - SLOW):
static inline int hak_tiny_free_fast_v2(void* ptr) {
if (__builtin_expect(!ptr, 0)) return 0;
// CRITICAL: C7 (1KB headerless) MUST be excluded from Ultra-Fast Free
// Problem: Magic validation alone insufficient (C7 user data can be 0xaX pattern)
// Solution: Registry lookup to 100% identify C7 before header read
// Cost: 50-100 cycles (O(log N) RB-tree), but C7 is rare (~5% of allocations)
// Benefit: 100% SEGV prevention, no false positives
extern struct SuperSlab* hak_super_lookup(void* ptr);
struct SuperSlab* ss = hak_super_lookup(ptr);
if (__builtin_expect(ss && ss->size_class == 7, 0)) {
return 0; // C7 detected → force slow path (Front Gate will handle correctly)
}
// CRITICAL: Check if header is accessible before reading
void* header_addr = (char*)ptr - 1;
Lines 51-63 (AFTER - FAST):
static inline int hak_tiny_free_fast_v2(void* ptr) {
if (__builtin_expect(!ptr, 0)) return 0;
// Phase E3: C7 now has header (Phase E1), no registry lookup needed!
// Header magic validation (2-3 cycles) is sufficient to distinguish:
// - Tiny (0xA0-0xA7): valid header → fast path
// - Pool TLS (0xB0-0xBF): different magic → slow path
// - Mid/Large: no header → slow path
// - C7: has header like all other classes → fast path works!
//
// Performance: 5-10 cycles (vs 55-110 cycles with registry lookup)
// CRITICAL: Check if header is accessible before reading
void* header_addr = (char*)ptr - 1;
Summary of Changes:
- DELETE: Lines 54-62 (9 lines of SuperSlab registry lookup code)
- ADD: 7 lines of explanatory comments (why registry lookup is no longer needed)
- Net change: -2 lines, -50-100 cycles per free operation
Build & Test Commands
# 1. Edit file
vim /mnt/workdisk/public_share/hakmem/core/tiny_free_fast_v2.inc.h
# 2. Build release binary
cd /mnt/workdisk/public_share/hakmem
./build.sh bench_random_mixed_hakmem
# 3. Verify build succeeded
ls -lh ./out/release/bench_random_mixed_hakmem
# 4. Run benchmarks (3 runs each for stability)
echo "=== 128B Benchmark ==="
./out/release/bench_random_mixed_hakmem 100000 128 42 2>&1 | tail -1
./out/release/bench_random_mixed_hakmem 100000 128 43 2>&1 | tail -1
./out/release/bench_random_mixed_hakmem 100000 128 44 2>&1 | tail -1
echo "=== 256B Benchmark ==="
./out/release/bench_random_mixed_hakmem 100000 256 42 2>&1 | tail -1
./out/release/bench_random_mixed_hakmem 100000 256 43 2>&1 | tail -1
./out/release/bench_random_mixed_hakmem 100000 256 44 2>&1 | tail -1
echo "=== 512B Benchmark ==="
./out/release/bench_random_mixed_hakmem 100000 512 42 2>&1 | tail -1
./out/release/bench_random_mixed_hakmem 100000 512 43 2>&1 | tail -1
./out/release/bench_random_mixed_hakmem 100000 512 44 2>&1 | tail -1
echo "=== 1024B Benchmark ==="
./out/release/bench_random_mixed_hakmem 100000 1024 42 2>&1 | tail -1
./out/release/bench_random_mixed_hakmem 100000 1024 43 2>&1 | tail -1
./out/release/bench_random_mixed_hakmem 100000 1024 44 2>&1 | tail -1
Success Criteria (Phase E3-1)
Minimum Acceptable Performance (vs current 9M ops/s):
- 128B: ≥30M ops/s (+226%)
- 256B: ≥32M ops/s (+240%)
- 512B: ≥28M ops/s (+233%)
- 1024B: ≥28M ops/s (+233%)
Target Performance (Phase 7-1.3 baseline):
- 128B: 40-50M ops/s (+335-443%)
- 256B: 45-55M ops/s (+379-485%)
- 512B: 40-50M ops/s (+376-495%)
- 1024B: 40-50M ops/s (+376-495%)
Phase E3-2: Header-First Classification (OPTIONAL)
Why Optional?
Phase E3-1 (remove registry lookup from fast path) should restore 80-90% of Phase 7 performance. Phase E3-2 optimizes the slow path (TLS cache full, Pool TLS, Mid/Large), which is only 1-5% of operations.
Impact: Additional +10-20% on top of Phase E3-1
Detailed Code Changes
File: core/box/front_gate_classifier.h
Lines 166-234 (BEFORE - Registry-First):
static inline __attribute__((always_inline))
ptr_classification_t classify_ptr(void* ptr) {
ptr_classification_t result = {
.kind = PTR_KIND_UNKNOWN,
.class_idx = -1,
.ss = NULL,
.slab_idx = -1
};
if (__builtin_expect(!ptr, 0)) return result;
if (__builtin_expect((uintptr_t)ptr < 4096, 0)) {
result.kind = PTR_KIND_UNKNOWN;
return result;
}
#ifdef HAKMEM_POOL_TLS_PHASE1
if (__builtin_expect(is_pool_tls_reg(ptr), 0)) {
result.kind = PTR_KIND_POOL_TLS;
return result;
}
#endif
// ❌ SLOW: Registry lookup FIRST (50-100 cycles)
result = registry_lookup(ptr);
if (__builtin_expect(result.kind == PTR_KIND_TINY_HEADERLESS, 0)) {
return result;
}
if (__builtin_expect(result.kind == PTR_KIND_TINY_HEADER, 1)) {
return result;
}
// ... rest of function ...
}
Lines 166-234 (AFTER - Header-First):
static inline __attribute__((always_inline))
ptr_classification_t classify_ptr(void* ptr) {
ptr_classification_t result = {
.kind = PTR_KIND_UNKNOWN,
.class_idx = -1,
.ss = NULL,
.slab_idx = -1
};
if (__builtin_expect(!ptr, 0)) return result;
if (__builtin_expect((uintptr_t)ptr < 4096, 0)) {
result.kind = PTR_KIND_UNKNOWN;
return result;
}
// ✅ FAST: Try header probe FIRST (2-3 cycles, 95-99% hit rate)
int class_idx = safe_header_probe(ptr);
if (__builtin_expect(class_idx >= 0, 1)) {
// Valid Tiny header found
result.kind = PTR_KIND_TINY_HEADER;
result.class_idx = class_idx;
#if !HAKMEM_BUILD_RELEASE
extern __thread uint64_t g_classify_header_hit;
g_classify_header_hit++;
#endif
return result; // Fast path: 2-3 cycles!
}
#ifdef HAKMEM_POOL_TLS_PHASE1
// Fallback: Check Pool TLS registry (header probe failed)
if (__builtin_expect(is_pool_tls_reg(ptr), 0)) {
result.kind = PTR_KIND_POOL_TLS;
#if !HAKMEM_BUILD_RELEASE
extern __thread uint64_t g_classify_pool_hit;
g_classify_pool_hit++;
#endif
return result;
}
#endif
// Fallback: Registry lookup (rare, <1%)
result = registry_lookup(ptr);
if (__builtin_expect(result.kind == PTR_KIND_TINY_HEADERLESS, 0)) {
#if !HAKMEM_BUILD_RELEASE
extern __thread uint64_t g_classify_headerless_hit;
g_classify_headerless_hit++;
#endif
return result;
}
if (__builtin_expect(result.kind == PTR_KIND_TINY_HEADER, 0)) {
#if !HAKMEM_BUILD_RELEASE
extern __thread uint64_t g_classify_header_hit;
g_classify_header_hit++;
#endif
return result;
}
// ... rest of function (16-byte AllocHeader check) ...
}
Build & Test Commands
# 1. Edit file
vim /mnt/workdisk/public_share/hakmem/core/box/front_gate_classifier.h
# 2. Rebuild
./build.sh bench_random_mixed_hakmem
# 3. Benchmark (should see +10-20% improvement over Phase E3-1)
./out/release/bench_random_mixed_hakmem 100000 256 42 2>&1 | tail -1
Success Criteria (Phase E3-2)
Target: +10-20% improvement over Phase E3-1
Example:
- Phase E3-1: 45M ops/s
- Phase E3-2: 50-55M ops/s (+11-22%)
Phase E3-3: Remove C7 Special Cases (CLEANUP)
Why Cleanup?
Phase E1 added headers to C7, making all if (class_idx == 7) conditionals obsolete. However, many files still contain C7 special cases from legacy code.
Impact: Code simplification + 5-10% reduced branching overhead
Files to Edit
File 1: core/hakmem_tiny_free.inc
Lines to Remove/Modify:
# Find all C7 special cases
grep -n "class_idx == 7" core/hakmem_tiny_free.inc
Expected Output:
32: // CRITICAL: C7 (1KB) is headerless - MUST NOT drain to TLS SLL
34: if (__builtin_expect(class_idx == 7, 0)) return;
124: if (__builtin_expect(g_tiny_safe_free || class_idx == 7, 0)) {
145: if (__builtin_expect(g_tiny_safe_free || class_idx == 7, 0)) {
158: if (g_tiny_safe_free_strict || class_idx == 7) { raise(SIGUSR2); return; }
195: void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);
211: void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);
233: void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);
241: void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);
253: void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);
348: // CRITICAL: C7 (1KB) is headerless - MUST NOT use TLS SLL
384: void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);
445: void* base2 = (fast_class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);
Changes:
- Line 32-34: Remove early return for C7
// BEFORE
// CRITICAL: C7 (1KB) is headerless - MUST NOT drain to TLS SLL
if (__builtin_expect(class_idx == 7, 0)) return;
// AFTER (DELETE these 2 lines)
- Lines 124, 145, 158: Remove
|| class_idx == 7conditions
// BEFORE
if (__builtin_expect(g_tiny_safe_free || class_idx == 7, 0)) {
// AFTER
if (__builtin_expect(g_tiny_safe_free, 0)) {
- Lines 195, 211, 233, 241, 253, 384, 445: Simplify base calculation
// BEFORE
void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);
// AFTER (ALL classes have header now)
void* base = (void*)((uint8_t*)ptr - 1);
- Line 348: Remove C7 comment (obsolete)
// BEFORE
// CRITICAL: C7 (1KB) is headerless - MUST NOT use TLS SLL
// AFTER (DELETE this line)
File 2: core/hakmem_tiny_alloc.inc
Lines to Remove/Modify:
grep -n "class_idx == 7" core/hakmem_tiny_alloc.inc
Expected Output:
252: if (__builtin_expect(class_idx == 7, 0)) { *(void**)hotmag_ptr = NULL; }
281: if (__builtin_expect(class_idx == 7, 0)) { *(void**)fast_hot = NULL; }
292: if (__builtin_expect(class_idx == 7, 0)) { *(void**)fast = NULL; }
Changes: Remove all 3 lines (C7 now has header, no NULL clearing needed)
File 3: core/hakmem_tiny_slow.inc
Lines to Remove/Modify:
grep -n "class_idx == 7" core/hakmem_tiny_slow.inc
Expected Output:
25: // Try TLS list refill (C7 is headerless: skip TLS list entirely)
Changes: Update comment
// BEFORE
// Try TLS list refill (C7 is headerless: skip TLS list entirely)
// AFTER
// Try TLS list refill (all classes use TLS list now)
Build & Test Commands
# 1. Edit files
vim core/hakmem_tiny_free.inc
vim core/hakmem_tiny_alloc.inc
vim core/hakmem_tiny_slow.inc
# 2. Rebuild
./build.sh bench_random_mixed_hakmem
# 3. Verify no regressions
./out/release/bench_random_mixed_hakmem 100000 1024 42 2>&1 | tail -1
Success Criteria (Phase E3-3)
Target: 50-60M → 59-70M ops/s (+5-10% from reduced branching)
Code Quality:
- All C7 special cases removed
- Unified base pointer calculation (
ptr - 1for all classes) - Cleaner, more maintainable code
Final Verification
Full Benchmark Suite
# Run comprehensive benchmarks
cd /mnt/workdisk/public_share/hakmem
# 1. Random Mixed (primary benchmark)
for size in 128 256 512 1024; do
echo "=== Random Mixed ${size}B ==="
./out/release/bench_random_mixed_hakmem 100000 $size 42 2>&1 | grep "Throughput"
done
# 2. Fixed Size (stability check)
for size in 256 1024; do
echo "=== Fixed Size ${size}B ==="
./out/release/bench_fixed_size_hakmem 200000 $size 128 2>&1 | grep "Throughput"
done
# 3. Larson (multi-threaded stress test)
echo "=== Larson Multi-Threaded ==="
./out/release/larson_hakmem 1 2>&1 | grep "ops/sec"
Expected Results (After All 3 Phases)
| Benchmark | Current | Phase E3 | Improvement |
|---|---|---|---|
| Random Mixed 128B | 9.2M | 59M | +541% 🎯 |
| Random Mixed 256B | 9.4M | 70M | +645% 🎯 |
| Random Mixed 512B | 8.4M | 68M | +710% 🎯 |
| Random Mixed 1024B | 8.4M | 65M | +674% 🎯 |
| Fixed Size 256B | 2.76M | 10-12M | +263-335% |
| Larson 1T | 2.68M | 8-10M | +199-273% |
Rollback Plan (If Needed)
If Phase E3-1 Causes Issues
# Revert to current version
git checkout HEAD -- core/tiny_free_fast_v2.inc.h
./build.sh bench_random_mixed_hakmem
If Phase E3-2 Causes Issues
# Revert to Phase E3-1
git checkout HEAD -- core/box/front_gate_classifier.h
./build.sh bench_random_mixed_hakmem
If Phase E3-3 Causes Issues
# Revert cleanup changes
git checkout HEAD -- core/hakmem_tiny_free.inc core/hakmem_tiny_alloc.inc core/hakmem_tiny_slow.inc
./build.sh bench_random_mixed_hakmem
Risk Assessment
Phase E3-1: Remove Registry Lookup
Risk: ⚠️ LOW
- Reverting to Phase 7-1.3 code (proven stable at 59-70M ops/s)
- Phase E1 already added headers to C7 (safety guaranteed)
- Header magic validation (2-3 cycles) sufficient for classification
Mitigation:
- Test with 1M iterations (stress test)
- Run Larson multi-threaded (race condition check)
- Monitor for SEGV (should be zero)
Phase E3-2: Header-First Classification
Risk: ⚠️ LOW-MEDIUM
- Only affects slow path (1-5% of operations)
- Safe header probe already implemented (lines 100-117)
- No change to fast path (already optimized in E3-1)
Mitigation:
- Test with Pool TLS workloads (8-52KB allocations)
- Test with Mid/Large workloads (64KB+ allocations)
- Verify classification hit rates in debug mode
Phase E3-3: Remove C7 Special Cases
Risk: ⚠️ LOW
- Code cleanup only (no algorithmic changes)
- Phase E1 already verified C7 works with headers
- All conditionals are redundant (dead code)
Mitigation:
- Test specifically with 1024B workload (C7 class)
- Run 1M iterations (comprehensive coverage)
- Check for any unexpected branches
Timeline
| Phase | Time | Cumulative |
|---|---|---|
| E3-1: Remove Registry Lookup | 10 min | 10 min |
| E3-1: Build & Test | 5 min | 15 min |
| E3-2: Header-First Classification | 15 min | 30 min |
| E3-2: Build & Test | 5 min | 35 min |
| E3-3: Remove C7 Special Cases | 30 min | 65 min |
| E3-3: Build & Test | 5 min | 70 min |
| Final Verification | 10 min | 80 min |
| TOTAL | - | ~1.5 hours |
Success Metrics
Performance (Primary)
✅ Phase E3-1 Success: ≥30M ops/s (all sizes) ✅ Phase E3-2 Success: ≥50M ops/s (all sizes) ✅ Phase E3-3 Success: ≥59M ops/s (target met!)
Stability (Critical)
✅ No SEGV: 1M iterations without crash ✅ No corruption: Memory integrity checks pass ✅ Multi-threaded: Larson 4T stable
Code Quality (Secondary)
✅ Reduced LOC: -50 lines (C7 special cases removed)
✅ Reduced branching: -10% branch-miss rate
✅ Unified code: Single base calculation (ptr - 1)
Next Actions
- Start with Phase E3-1 (highest ROI, lowest risk)
- Verify performance (should see 3-5x improvement immediately)
- Proceed to E3-2 (optional, +10-20% additional)
- Complete E3-3 (cleanup, +5-10% final boost)
- Update CLAUDE.md (document restoration success)
Ready to implement! 🚀