Files
hakmem/docs/PHASE_E3_IMPLEMENTATION_PLAN.md
Moe Charm (CI) 72b38bc994 Phase E3-FINAL: Fix Box API offset bugs - ALL classes now use correct offsets
## Root Cause Analysis (GPT5)

**Physical Layout Constraints**:
- Class 0: 8B = [1B header][7B payload] → offset 1 = 9B needed =  IMPOSSIBLE
- Class 1-6: >=16B = [1B header][15B+ payload] → offset 1 =  POSSIBLE
- Class 7: 1KB → offset 0 (compatibility)

**Correct Specification**:
- HAKMEM_TINY_HEADER_CLASSIDX != 0:
  - Class 0, 7: next at offset 0 (overwrites header when on freelist)
  - Class 1-6: next at offset 1 (after header)
- HAKMEM_TINY_HEADER_CLASSIDX == 0:
  - All classes: next at offset 0

**Previous Bug**:
- Attempted "ALL classes offset 1" unification
- Class 0 with offset 1 caused immediate SEGV (9B > 8B block size)
- Mixed 2-arg/3-arg API caused confusion

## Fixes Applied

### 1. Restored 3-Argument Box API (core/box/tiny_next_ptr_box.h)
```c
// Correct signatures
void tiny_next_write(int class_idx, void* base, void* next_value)
void* tiny_next_read(int class_idx, const void* base)

// Correct offset calculation
size_t offset = (class_idx == 0 || class_idx == 7) ? 0 : 1;
```

### 2. Updated 123+ Call Sites Across 34 Files
- hakmem_tiny_hot_pop_v4.inc.h (4 locations)
- hakmem_tiny_fastcache.inc.h (3 locations)
- hakmem_tiny_tls_list.h (12 locations)
- superslab_inline.h (5 locations)
- tiny_fastcache.h (3 locations)
- ptr_trace.h (macro definitions)
- tls_sll_box.h (2 locations)
- + 27 additional files

Pattern: `tiny_next_read(base)` → `tiny_next_read(class_idx, base)`
Pattern: `tiny_next_write(base, next)` → `tiny_next_write(class_idx, base, next)`

### 3. Added Sentinel Detection Guards
- tiny_fast_push(): Block nodes with sentinel in ptr or ptr->next
- tls_list_push(): Block nodes with sentinel in ptr or ptr->next
- Defense-in-depth against remote free sentinel leakage

## Verification (GPT5 Report)

**Test Command**: `./out/release/bench_random_mixed_hakmem --iterations=70000`

**Results**:
-  Main loop completed successfully
-  Drain phase completed successfully
-  NO SEGV (previous crash at iteration 66151 is FIXED)
- ℹ️ Final log: "tiny_alloc(1024) failed" is normal fallback to Mid/ACE layers

**Analysis**:
- Class 0 immediate SEGV:  RESOLVED (correct offset 0 now used)
- 66K iteration crash:  RESOLVED (offset consistency fixed)
- Box API conflicts:  RESOLVED (unified 3-arg API)

## Technical Details

### Offset Logic Justification
```
Class 0:  8B block → next pointer (8B) fits ONLY at offset 0
Class 1: 16B block → next pointer (8B) fits at offset 1 (after 1B header)
Class 2: 32B block → next pointer (8B) fits at offset 1
...
Class 6: 512B block → next pointer (8B) fits at offset 1
Class 7: 1024B block → offset 0 for legacy compatibility
```

### Files Modified (Summary)
- Core API: `box/tiny_next_ptr_box.h`
- Hot paths: `hakmem_tiny_hot_pop*.inc.h`, `tiny_fastcache.h`
- TLS layers: `hakmem_tiny_tls_list.h`, `hakmem_tiny_tls_ops.h`
- SuperSlab: `superslab_inline.h`, `tiny_superslab_*.inc.h`
- Refill: `hakmem_tiny_refill.inc.h`, `tiny_refill_opt.h`
- Free paths: `tiny_free_magazine.inc.h`, `tiny_superslab_free.inc.h`
- Documentation: Multiple Phase E3 reports

## Remaining Work

None for Box API offset bugs - all structural issues resolved.

Future enhancements (non-critical):
- Periodic `grep -R '*(void**)' core/` to detect direct pointer access violations
- Enforce Box API usage via static analysis
- Document offset rationale in architecture docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-13 06:50:20 +09:00

541 lines
15 KiB
Markdown

# Phase E3: Performance Restoration Implementation Plan
**Date**: 2025-11-12
**Goal**: Restore Phase 7 performance (9M → 59-70M ops/s, +541-674%)
**Status**: READY TO IMPLEMENT
---
## Quick Reference
### The One Critical Fix
**File**: `/mnt/workdisk/public_share/hakmem/core/tiny_free_fast_v2.inc.h`
**Lines to Remove**: 54-63 (SuperSlab registry lookup)
**Impact**: -91% latency, +1100% throughput
---
## Phase E3-1: Remove Registry Lookup (CRITICAL)
### Detailed Code Changes
**File**: `core/tiny_free_fast_v2.inc.h`
**Lines 51-63 (BEFORE - SLOW)**:
```c
static inline int hak_tiny_free_fast_v2(void* ptr) {
if (__builtin_expect(!ptr, 0)) return 0;
// CRITICAL: C7 (1KB headerless) MUST be excluded from Ultra-Fast Free
// Problem: Magic validation alone insufficient (C7 user data can be 0xaX pattern)
// Solution: Registry lookup to 100% identify C7 before header read
// Cost: 50-100 cycles (O(log N) RB-tree), but C7 is rare (~5% of allocations)
// Benefit: 100% SEGV prevention, no false positives
extern struct SuperSlab* hak_super_lookup(void* ptr);
struct SuperSlab* ss = hak_super_lookup(ptr);
if (__builtin_expect(ss && ss->size_class == 7, 0)) {
return 0; // C7 detected → force slow path (Front Gate will handle correctly)
}
// CRITICAL: Check if header is accessible before reading
void* header_addr = (char*)ptr - 1;
```
**Lines 51-63 (AFTER - FAST)**:
```c
static inline int hak_tiny_free_fast_v2(void* ptr) {
if (__builtin_expect(!ptr, 0)) return 0;
// Phase E3: C7 now has header (Phase E1), no registry lookup needed!
// Header magic validation (2-3 cycles) is sufficient to distinguish:
// - Tiny (0xA0-0xA7): valid header → fast path
// - Pool TLS (0xB0-0xBF): different magic → slow path
// - Mid/Large: no header → slow path
// - C7: has header like all other classes → fast path works!
//
// Performance: 5-10 cycles (vs 55-110 cycles with registry lookup)
// CRITICAL: Check if header is accessible before reading
void* header_addr = (char*)ptr - 1;
```
**Summary of Changes**:
- **DELETE**: Lines 54-62 (9 lines of SuperSlab registry lookup code)
- **ADD**: 7 lines of explanatory comments (why registry lookup is no longer needed)
- **Net change**: -2 lines, -50-100 cycles per free operation
### Build & Test Commands
```bash
# 1. Edit file
vim /mnt/workdisk/public_share/hakmem/core/tiny_free_fast_v2.inc.h
# 2. Build release binary
cd /mnt/workdisk/public_share/hakmem
./build.sh bench_random_mixed_hakmem
# 3. Verify build succeeded
ls -lh ./out/release/bench_random_mixed_hakmem
# 4. Run benchmarks (3 runs each for stability)
echo "=== 128B Benchmark ==="
./out/release/bench_random_mixed_hakmem 100000 128 42 2>&1 | tail -1
./out/release/bench_random_mixed_hakmem 100000 128 43 2>&1 | tail -1
./out/release/bench_random_mixed_hakmem 100000 128 44 2>&1 | tail -1
echo "=== 256B Benchmark ==="
./out/release/bench_random_mixed_hakmem 100000 256 42 2>&1 | tail -1
./out/release/bench_random_mixed_hakmem 100000 256 43 2>&1 | tail -1
./out/release/bench_random_mixed_hakmem 100000 256 44 2>&1 | tail -1
echo "=== 512B Benchmark ==="
./out/release/bench_random_mixed_hakmem 100000 512 42 2>&1 | tail -1
./out/release/bench_random_mixed_hakmem 100000 512 43 2>&1 | tail -1
./out/release/bench_random_mixed_hakmem 100000 512 44 2>&1 | tail -1
echo "=== 1024B Benchmark ==="
./out/release/bench_random_mixed_hakmem 100000 1024 42 2>&1 | tail -1
./out/release/bench_random_mixed_hakmem 100000 1024 43 2>&1 | tail -1
./out/release/bench_random_mixed_hakmem 100000 1024 44 2>&1 | tail -1
```
### Success Criteria (Phase E3-1)
**Minimum Acceptable Performance** (vs current 9M ops/s):
- 128B: ≥30M ops/s (+226%)
- 256B: ≥32M ops/s (+240%)
- 512B: ≥28M ops/s (+233%)
- 1024B: ≥28M ops/s (+233%)
**Target Performance** (Phase 7-1.3 baseline):
- 128B: 40-50M ops/s (+335-443%)
- 256B: 45-55M ops/s (+379-485%)
- 512B: 40-50M ops/s (+376-495%)
- 1024B: 40-50M ops/s (+376-495%)
---
## Phase E3-2: Header-First Classification (OPTIONAL)
### Why Optional?
Phase E3-1 (remove registry lookup from fast path) should restore 80-90% of Phase 7 performance. Phase E3-2 optimizes the **slow path** (TLS cache full, Pool TLS, Mid/Large), which is only 1-5% of operations.
**Impact**: Additional +10-20% on top of Phase E3-1
### Detailed Code Changes
**File**: `core/box/front_gate_classifier.h`
**Lines 166-234 (BEFORE - Registry-First)**:
```c
static inline __attribute__((always_inline))
ptr_classification_t classify_ptr(void* ptr) {
ptr_classification_t result = {
.kind = PTR_KIND_UNKNOWN,
.class_idx = -1,
.ss = NULL,
.slab_idx = -1
};
if (__builtin_expect(!ptr, 0)) return result;
if (__builtin_expect((uintptr_t)ptr < 4096, 0)) {
result.kind = PTR_KIND_UNKNOWN;
return result;
}
#ifdef HAKMEM_POOL_TLS_PHASE1
if (__builtin_expect(is_pool_tls_reg(ptr), 0)) {
result.kind = PTR_KIND_POOL_TLS;
return result;
}
#endif
// ❌ SLOW: Registry lookup FIRST (50-100 cycles)
result = registry_lookup(ptr);
if (__builtin_expect(result.kind == PTR_KIND_TINY_HEADERLESS, 0)) {
return result;
}
if (__builtin_expect(result.kind == PTR_KIND_TINY_HEADER, 1)) {
return result;
}
// ... rest of function ...
}
```
**Lines 166-234 (AFTER - Header-First)**:
```c
static inline __attribute__((always_inline))
ptr_classification_t classify_ptr(void* ptr) {
ptr_classification_t result = {
.kind = PTR_KIND_UNKNOWN,
.class_idx = -1,
.ss = NULL,
.slab_idx = -1
};
if (__builtin_expect(!ptr, 0)) return result;
if (__builtin_expect((uintptr_t)ptr < 4096, 0)) {
result.kind = PTR_KIND_UNKNOWN;
return result;
}
// ✅ FAST: Try header probe FIRST (2-3 cycles, 95-99% hit rate)
int class_idx = safe_header_probe(ptr);
if (__builtin_expect(class_idx >= 0, 1)) {
// Valid Tiny header found
result.kind = PTR_KIND_TINY_HEADER;
result.class_idx = class_idx;
#if !HAKMEM_BUILD_RELEASE
extern __thread uint64_t g_classify_header_hit;
g_classify_header_hit++;
#endif
return result; // Fast path: 2-3 cycles!
}
#ifdef HAKMEM_POOL_TLS_PHASE1
// Fallback: Check Pool TLS registry (header probe failed)
if (__builtin_expect(is_pool_tls_reg(ptr), 0)) {
result.kind = PTR_KIND_POOL_TLS;
#if !HAKMEM_BUILD_RELEASE
extern __thread uint64_t g_classify_pool_hit;
g_classify_pool_hit++;
#endif
return result;
}
#endif
// Fallback: Registry lookup (rare, <1%)
result = registry_lookup(ptr);
if (__builtin_expect(result.kind == PTR_KIND_TINY_HEADERLESS, 0)) {
#if !HAKMEM_BUILD_RELEASE
extern __thread uint64_t g_classify_headerless_hit;
g_classify_headerless_hit++;
#endif
return result;
}
if (__builtin_expect(result.kind == PTR_KIND_TINY_HEADER, 0)) {
#if !HAKMEM_BUILD_RELEASE
extern __thread uint64_t g_classify_header_hit;
g_classify_header_hit++;
#endif
return result;
}
// ... rest of function (16-byte AllocHeader check) ...
}
```
### Build & Test Commands
```bash
# 1. Edit file
vim /mnt/workdisk/public_share/hakmem/core/box/front_gate_classifier.h
# 2. Rebuild
./build.sh bench_random_mixed_hakmem
# 3. Benchmark (should see +10-20% improvement over Phase E3-1)
./out/release/bench_random_mixed_hakmem 100000 256 42 2>&1 | tail -1
```
### Success Criteria (Phase E3-2)
**Target**: +10-20% improvement over Phase E3-1
**Example**:
- Phase E3-1: 45M ops/s
- Phase E3-2: 50-55M ops/s (+11-22%)
---
## Phase E3-3: Remove C7 Special Cases (CLEANUP)
### Why Cleanup?
Phase E1 added headers to C7, making all `if (class_idx == 7)` conditionals obsolete. However, many files still contain C7 special cases from legacy code.
**Impact**: Code simplification + 5-10% reduced branching overhead
### Files to Edit
#### File 1: `core/hakmem_tiny_free.inc`
**Lines to Remove/Modify**:
```bash
# Find all C7 special cases
grep -n "class_idx == 7" core/hakmem_tiny_free.inc
```
**Expected Output**:
```
32: // CRITICAL: C7 (1KB) is headerless - MUST NOT drain to TLS SLL
34: if (__builtin_expect(class_idx == 7, 0)) return;
124: if (__builtin_expect(g_tiny_safe_free || class_idx == 7, 0)) {
145: if (__builtin_expect(g_tiny_safe_free || class_idx == 7, 0)) {
158: if (g_tiny_safe_free_strict || class_idx == 7) { raise(SIGUSR2); return; }
195: void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);
211: void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);
233: void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);
241: void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);
253: void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);
348: // CRITICAL: C7 (1KB) is headerless - MUST NOT use TLS SLL
384: void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);
445: void* base2 = (fast_class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);
```
**Changes**:
1. **Line 32-34**: Remove early return for C7
```c
// BEFORE
// CRITICAL: C7 (1KB) is headerless - MUST NOT drain to TLS SLL
if (__builtin_expect(class_idx == 7, 0)) return;
// AFTER (DELETE these 2 lines)
```
2. **Lines 124, 145, 158**: Remove `|| class_idx == 7` conditions
```c
// BEFORE
if (__builtin_expect(g_tiny_safe_free || class_idx == 7, 0)) {
// AFTER
if (__builtin_expect(g_tiny_safe_free, 0)) {
```
3. **Lines 195, 211, 233, 241, 253, 384, 445**: Simplify base calculation
```c
// BEFORE
void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);
// AFTER (ALL classes have header now)
void* base = (void*)((uint8_t*)ptr - 1);
```
4. **Line 348**: Remove C7 comment (obsolete)
```c
// BEFORE
// CRITICAL: C7 (1KB) is headerless - MUST NOT use TLS SLL
// AFTER (DELETE this line)
```
#### File 2: `core/hakmem_tiny_alloc.inc`
**Lines to Remove/Modify**:
```bash
grep -n "class_idx == 7" core/hakmem_tiny_alloc.inc
```
**Expected Output**:
```
252: if (__builtin_expect(class_idx == 7, 0)) { *(void**)hotmag_ptr = NULL; }
281: if (__builtin_expect(class_idx == 7, 0)) { *(void**)fast_hot = NULL; }
292: if (__builtin_expect(class_idx == 7, 0)) { *(void**)fast = NULL; }
```
**Changes**: Remove all 3 lines (C7 now has header, no NULL clearing needed)
#### File 3: `core/hakmem_tiny_slow.inc`
**Lines to Remove/Modify**:
```bash
grep -n "class_idx == 7" core/hakmem_tiny_slow.inc
```
**Expected Output**:
```
25: // Try TLS list refill (C7 is headerless: skip TLS list entirely)
```
**Changes**: Update comment
```c
// BEFORE
// Try TLS list refill (C7 is headerless: skip TLS list entirely)
// AFTER
// Try TLS list refill (all classes use TLS list now)
```
### Build & Test Commands
```bash
# 1. Edit files
vim core/hakmem_tiny_free.inc
vim core/hakmem_tiny_alloc.inc
vim core/hakmem_tiny_slow.inc
# 2. Rebuild
./build.sh bench_random_mixed_hakmem
# 3. Verify no regressions
./out/release/bench_random_mixed_hakmem 100000 1024 42 2>&1 | tail -1
```
### Success Criteria (Phase E3-3)
**Target**: 50-60M → 59-70M ops/s (+5-10% from reduced branching)
**Code Quality**:
- All C7 special cases removed
- Unified base pointer calculation (`ptr - 1` for all classes)
- Cleaner, more maintainable code
---
## Final Verification
### Full Benchmark Suite
```bash
# Run comprehensive benchmarks
cd /mnt/workdisk/public_share/hakmem
# 1. Random Mixed (primary benchmark)
for size in 128 256 512 1024; do
echo "=== Random Mixed ${size}B ==="
./out/release/bench_random_mixed_hakmem 100000 $size 42 2>&1 | grep "Throughput"
done
# 2. Fixed Size (stability check)
for size in 256 1024; do
echo "=== Fixed Size ${size}B ==="
./out/release/bench_fixed_size_hakmem 200000 $size 128 2>&1 | grep "Throughput"
done
# 3. Larson (multi-threaded stress test)
echo "=== Larson Multi-Threaded ==="
./out/release/larson_hakmem 1 2>&1 | grep "ops/sec"
```
### Expected Results (After All 3 Phases)
| Benchmark | Current | Phase E3 | Improvement |
|-----------|---------|----------|-------------|
| Random Mixed 128B | 9.2M | **59M** | **+541%** 🎯 |
| Random Mixed 256B | 9.4M | **70M** | **+645%** 🎯 |
| Random Mixed 512B | 8.4M | **68M** | **+710%** 🎯 |
| Random Mixed 1024B | 8.4M | **65M** | **+674%** 🎯 |
| Fixed Size 256B | 2.76M | **10-12M** | **+263-335%** |
| Larson 1T | 2.68M | **8-10M** | **+199-273%** |
---
## Rollback Plan (If Needed)
### If Phase E3-1 Causes Issues
```bash
# Revert to current version
git checkout HEAD -- core/tiny_free_fast_v2.inc.h
./build.sh bench_random_mixed_hakmem
```
### If Phase E3-2 Causes Issues
```bash
# Revert to Phase E3-1
git checkout HEAD -- core/box/front_gate_classifier.h
./build.sh bench_random_mixed_hakmem
```
### If Phase E3-3 Causes Issues
```bash
# Revert cleanup changes
git checkout HEAD -- core/hakmem_tiny_free.inc core/hakmem_tiny_alloc.inc core/hakmem_tiny_slow.inc
./build.sh bench_random_mixed_hakmem
```
---
## Risk Assessment
### Phase E3-1: Remove Registry Lookup
**Risk**: ⚠️ **LOW**
- Reverting to Phase 7-1.3 code (proven stable at 59-70M ops/s)
- Phase E1 already added headers to C7 (safety guaranteed)
- Header magic validation (2-3 cycles) sufficient for classification
**Mitigation**:
- Test with 1M iterations (stress test)
- Run Larson multi-threaded (race condition check)
- Monitor for SEGV (should be zero)
### Phase E3-2: Header-First Classification
**Risk**: ⚠️ **LOW-MEDIUM**
- Only affects slow path (1-5% of operations)
- Safe header probe already implemented (lines 100-117)
- No change to fast path (already optimized in E3-1)
**Mitigation**:
- Test with Pool TLS workloads (8-52KB allocations)
- Test with Mid/Large workloads (64KB+ allocations)
- Verify classification hit rates in debug mode
### Phase E3-3: Remove C7 Special Cases
**Risk**: ⚠️ **LOW**
- Code cleanup only (no algorithmic changes)
- Phase E1 already verified C7 works with headers
- All conditionals are redundant (dead code)
**Mitigation**:
- Test specifically with 1024B workload (C7 class)
- Run 1M iterations (comprehensive coverage)
- Check for any unexpected branches
---
## Timeline
| Phase | Time | Cumulative |
|-------|------|------------|
| E3-1: Remove Registry Lookup | 10 min | 10 min |
| E3-1: Build & Test | 5 min | 15 min |
| E3-2: Header-First Classification | 15 min | 30 min |
| E3-2: Build & Test | 5 min | 35 min |
| E3-3: Remove C7 Special Cases | 30 min | 65 min |
| E3-3: Build & Test | 5 min | 70 min |
| Final Verification | 10 min | 80 min |
| **TOTAL** | - | **~1.5 hours** |
---
## Success Metrics
### Performance (Primary)
**Phase E3-1 Success**: ≥30M ops/s (all sizes)
**Phase E3-2 Success**: ≥50M ops/s (all sizes)
**Phase E3-3 Success**: ≥59M ops/s (target met!)
### Stability (Critical)
**No SEGV**: 1M iterations without crash
**No corruption**: Memory integrity checks pass
**Multi-threaded**: Larson 4T stable
### Code Quality (Secondary)
**Reduced LOC**: -50 lines (C7 special cases removed)
**Reduced branching**: -10% branch-miss rate
**Unified code**: Single base calculation (`ptr - 1`)
---
## Next Actions
1. **Start with Phase E3-1** (highest ROI, lowest risk)
2. **Verify performance** (should see 3-5x improvement immediately)
3. **Proceed to E3-2** (optional, +10-20% additional)
4. **Complete E3-3** (cleanup, +5-10% final boost)
5. **Update CLAUDE.md** (document restoration success)
**Ready to implement!** 🚀