Files
hakmem/docs/PHASE_E3_IMPLEMENTATION_PLAN.md
Moe Charm (CI) 72b38bc994 Phase E3-FINAL: Fix Box API offset bugs - ALL classes now use correct offsets
## Root Cause Analysis (GPT5)

**Physical Layout Constraints**:
- Class 0: 8B = [1B header][7B payload] → offset 1 = 9B needed =  IMPOSSIBLE
- Class 1-6: >=16B = [1B header][15B+ payload] → offset 1 =  POSSIBLE
- Class 7: 1KB → offset 0 (compatibility)

**Correct Specification**:
- HAKMEM_TINY_HEADER_CLASSIDX != 0:
  - Class 0, 7: next at offset 0 (overwrites header when on freelist)
  - Class 1-6: next at offset 1 (after header)
- HAKMEM_TINY_HEADER_CLASSIDX == 0:
  - All classes: next at offset 0

**Previous Bug**:
- Attempted "ALL classes offset 1" unification
- Class 0 with offset 1 caused immediate SEGV (9B > 8B block size)
- Mixed 2-arg/3-arg API caused confusion

## Fixes Applied

### 1. Restored 3-Argument Box API (core/box/tiny_next_ptr_box.h)
```c
// Correct signatures
void tiny_next_write(int class_idx, void* base, void* next_value)
void* tiny_next_read(int class_idx, const void* base)

// Correct offset calculation
size_t offset = (class_idx == 0 || class_idx == 7) ? 0 : 1;
```

### 2. Updated 123+ Call Sites Across 34 Files
- hakmem_tiny_hot_pop_v4.inc.h (4 locations)
- hakmem_tiny_fastcache.inc.h (3 locations)
- hakmem_tiny_tls_list.h (12 locations)
- superslab_inline.h (5 locations)
- tiny_fastcache.h (3 locations)
- ptr_trace.h (macro definitions)
- tls_sll_box.h (2 locations)
- + 27 additional files

Pattern: `tiny_next_read(base)` → `tiny_next_read(class_idx, base)`
Pattern: `tiny_next_write(base, next)` → `tiny_next_write(class_idx, base, next)`

### 3. Added Sentinel Detection Guards
- tiny_fast_push(): Block nodes with sentinel in ptr or ptr->next
- tls_list_push(): Block nodes with sentinel in ptr or ptr->next
- Defense-in-depth against remote free sentinel leakage

## Verification (GPT5 Report)

**Test Command**: `./out/release/bench_random_mixed_hakmem --iterations=70000`

**Results**:
-  Main loop completed successfully
-  Drain phase completed successfully
-  NO SEGV (previous crash at iteration 66151 is FIXED)
- ℹ️ Final log: "tiny_alloc(1024) failed" is normal fallback to Mid/ACE layers

**Analysis**:
- Class 0 immediate SEGV:  RESOLVED (correct offset 0 now used)
- 66K iteration crash:  RESOLVED (offset consistency fixed)
- Box API conflicts:  RESOLVED (unified 3-arg API)

## Technical Details

### Offset Logic Justification
```
Class 0:  8B block → next pointer (8B) fits ONLY at offset 0
Class 1: 16B block → next pointer (8B) fits at offset 1 (after 1B header)
Class 2: 32B block → next pointer (8B) fits at offset 1
...
Class 6: 512B block → next pointer (8B) fits at offset 1
Class 7: 1024B block → offset 0 for legacy compatibility
```

### Files Modified (Summary)
- Core API: `box/tiny_next_ptr_box.h`
- Hot paths: `hakmem_tiny_hot_pop*.inc.h`, `tiny_fastcache.h`
- TLS layers: `hakmem_tiny_tls_list.h`, `hakmem_tiny_tls_ops.h`
- SuperSlab: `superslab_inline.h`, `tiny_superslab_*.inc.h`
- Refill: `hakmem_tiny_refill.inc.h`, `tiny_refill_opt.h`
- Free paths: `tiny_free_magazine.inc.h`, `tiny_superslab_free.inc.h`
- Documentation: Multiple Phase E3 reports

## Remaining Work

None for Box API offset bugs - all structural issues resolved.

Future enhancements (non-critical):
- Periodic `grep -R '*(void**)' core/` to detect direct pointer access violations
- Enforce Box API usage via static analysis
- Document offset rationale in architecture docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-13 06:50:20 +09:00

15 KiB

Phase E3: Performance Restoration Implementation Plan

Date: 2025-11-12 Goal: Restore Phase 7 performance (9M → 59-70M ops/s, +541-674%) Status: READY TO IMPLEMENT


Quick Reference

The One Critical Fix

File: /mnt/workdisk/public_share/hakmem/core/tiny_free_fast_v2.inc.h Lines to Remove: 54-63 (SuperSlab registry lookup) Impact: -91% latency, +1100% throughput


Phase E3-1: Remove Registry Lookup (CRITICAL)

Detailed Code Changes

File: core/tiny_free_fast_v2.inc.h

Lines 51-63 (BEFORE - SLOW):

static inline int hak_tiny_free_fast_v2(void* ptr) {
    if (__builtin_expect(!ptr, 0)) return 0;

    // CRITICAL: C7 (1KB headerless) MUST be excluded from Ultra-Fast Free
    // Problem: Magic validation alone insufficient (C7 user data can be 0xaX pattern)
    // Solution: Registry lookup to 100% identify C7 before header read
    // Cost: 50-100 cycles (O(log N) RB-tree), but C7 is rare (~5% of allocations)
    // Benefit: 100% SEGV prevention, no false positives
    extern struct SuperSlab* hak_super_lookup(void* ptr);
    struct SuperSlab* ss = hak_super_lookup(ptr);
    if (__builtin_expect(ss && ss->size_class == 7, 0)) {
        return 0;  // C7 detected → force slow path (Front Gate will handle correctly)
    }

    // CRITICAL: Check if header is accessible before reading
    void* header_addr = (char*)ptr - 1;

Lines 51-63 (AFTER - FAST):

static inline int hak_tiny_free_fast_v2(void* ptr) {
    if (__builtin_expect(!ptr, 0)) return 0;

    // Phase E3: C7 now has header (Phase E1), no registry lookup needed!
    // Header magic validation (2-3 cycles) is sufficient to distinguish:
    // - Tiny (0xA0-0xA7): valid header → fast path
    // - Pool TLS (0xB0-0xBF): different magic → slow path
    // - Mid/Large: no header → slow path
    // - C7: has header like all other classes → fast path works!
    //
    // Performance: 5-10 cycles (vs 55-110 cycles with registry lookup)

    // CRITICAL: Check if header is accessible before reading
    void* header_addr = (char*)ptr - 1;

Summary of Changes:

  • DELETE: Lines 54-62 (9 lines of SuperSlab registry lookup code)
  • ADD: 7 lines of explanatory comments (why registry lookup is no longer needed)
  • Net change: -2 lines, -50-100 cycles per free operation

Build & Test Commands

# 1. Edit file
vim /mnt/workdisk/public_share/hakmem/core/tiny_free_fast_v2.inc.h

# 2. Build release binary
cd /mnt/workdisk/public_share/hakmem
./build.sh bench_random_mixed_hakmem

# 3. Verify build succeeded
ls -lh ./out/release/bench_random_mixed_hakmem

# 4. Run benchmarks (3 runs each for stability)
echo "=== 128B Benchmark ==="
./out/release/bench_random_mixed_hakmem 100000 128 42 2>&1 | tail -1
./out/release/bench_random_mixed_hakmem 100000 128 43 2>&1 | tail -1
./out/release/bench_random_mixed_hakmem 100000 128 44 2>&1 | tail -1

echo "=== 256B Benchmark ==="
./out/release/bench_random_mixed_hakmem 100000 256 42 2>&1 | tail -1
./out/release/bench_random_mixed_hakmem 100000 256 43 2>&1 | tail -1
./out/release/bench_random_mixed_hakmem 100000 256 44 2>&1 | tail -1

echo "=== 512B Benchmark ==="
./out/release/bench_random_mixed_hakmem 100000 512 42 2>&1 | tail -1
./out/release/bench_random_mixed_hakmem 100000 512 43 2>&1 | tail -1
./out/release/bench_random_mixed_hakmem 100000 512 44 2>&1 | tail -1

echo "=== 1024B Benchmark ==="
./out/release/bench_random_mixed_hakmem 100000 1024 42 2>&1 | tail -1
./out/release/bench_random_mixed_hakmem 100000 1024 43 2>&1 | tail -1
./out/release/bench_random_mixed_hakmem 100000 1024 44 2>&1 | tail -1

Success Criteria (Phase E3-1)

Minimum Acceptable Performance (vs current 9M ops/s):

  • 128B: ≥30M ops/s (+226%)
  • 256B: ≥32M ops/s (+240%)
  • 512B: ≥28M ops/s (+233%)
  • 1024B: ≥28M ops/s (+233%)

Target Performance (Phase 7-1.3 baseline):

  • 128B: 40-50M ops/s (+335-443%)
  • 256B: 45-55M ops/s (+379-485%)
  • 512B: 40-50M ops/s (+376-495%)
  • 1024B: 40-50M ops/s (+376-495%)

Phase E3-2: Header-First Classification (OPTIONAL)

Why Optional?

Phase E3-1 (remove registry lookup from fast path) should restore 80-90% of Phase 7 performance. Phase E3-2 optimizes the slow path (TLS cache full, Pool TLS, Mid/Large), which is only 1-5% of operations.

Impact: Additional +10-20% on top of Phase E3-1

Detailed Code Changes

File: core/box/front_gate_classifier.h

Lines 166-234 (BEFORE - Registry-First):

static inline __attribute__((always_inline))
ptr_classification_t classify_ptr(void* ptr) {
    ptr_classification_t result = {
        .kind = PTR_KIND_UNKNOWN,
        .class_idx = -1,
        .ss = NULL,
        .slab_idx = -1
    };

    if (__builtin_expect(!ptr, 0)) return result;
    if (__builtin_expect((uintptr_t)ptr < 4096, 0)) {
        result.kind = PTR_KIND_UNKNOWN;
        return result;
    }

#ifdef HAKMEM_POOL_TLS_PHASE1
    if (__builtin_expect(is_pool_tls_reg(ptr), 0)) {
        result.kind = PTR_KIND_POOL_TLS;
        return result;
    }
#endif

    // ❌ SLOW: Registry lookup FIRST (50-100 cycles)
    result = registry_lookup(ptr);
    if (__builtin_expect(result.kind == PTR_KIND_TINY_HEADERLESS, 0)) {
        return result;
    }
    if (__builtin_expect(result.kind == PTR_KIND_TINY_HEADER, 1)) {
        return result;
    }

    // ... rest of function ...
}

Lines 166-234 (AFTER - Header-First):

static inline __attribute__((always_inline))
ptr_classification_t classify_ptr(void* ptr) {
    ptr_classification_t result = {
        .kind = PTR_KIND_UNKNOWN,
        .class_idx = -1,
        .ss = NULL,
        .slab_idx = -1
    };

    if (__builtin_expect(!ptr, 0)) return result;
    if (__builtin_expect((uintptr_t)ptr < 4096, 0)) {
        result.kind = PTR_KIND_UNKNOWN;
        return result;
    }

    // ✅ FAST: Try header probe FIRST (2-3 cycles, 95-99% hit rate)
    int class_idx = safe_header_probe(ptr);
    if (__builtin_expect(class_idx >= 0, 1)) {
        // Valid Tiny header found
        result.kind = PTR_KIND_TINY_HEADER;
        result.class_idx = class_idx;
#if !HAKMEM_BUILD_RELEASE
        extern __thread uint64_t g_classify_header_hit;
        g_classify_header_hit++;
#endif
        return result;  // Fast path: 2-3 cycles!
    }

#ifdef HAKMEM_POOL_TLS_PHASE1
    // Fallback: Check Pool TLS registry (header probe failed)
    if (__builtin_expect(is_pool_tls_reg(ptr), 0)) {
        result.kind = PTR_KIND_POOL_TLS;
#if !HAKMEM_BUILD_RELEASE
        extern __thread uint64_t g_classify_pool_hit;
        g_classify_pool_hit++;
#endif
        return result;
    }
#endif

    // Fallback: Registry lookup (rare, <1%)
    result = registry_lookup(ptr);
    if (__builtin_expect(result.kind == PTR_KIND_TINY_HEADERLESS, 0)) {
#if !HAKMEM_BUILD_RELEASE
        extern __thread uint64_t g_classify_headerless_hit;
        g_classify_headerless_hit++;
#endif
        return result;
    }
    if (__builtin_expect(result.kind == PTR_KIND_TINY_HEADER, 0)) {
#if !HAKMEM_BUILD_RELEASE
        extern __thread uint64_t g_classify_header_hit;
        g_classify_header_hit++;
#endif
        return result;
    }

    // ... rest of function (16-byte AllocHeader check) ...
}

Build & Test Commands

# 1. Edit file
vim /mnt/workdisk/public_share/hakmem/core/box/front_gate_classifier.h

# 2. Rebuild
./build.sh bench_random_mixed_hakmem

# 3. Benchmark (should see +10-20% improvement over Phase E3-1)
./out/release/bench_random_mixed_hakmem 100000 256 42 2>&1 | tail -1

Success Criteria (Phase E3-2)

Target: +10-20% improvement over Phase E3-1

Example:

  • Phase E3-1: 45M ops/s
  • Phase E3-2: 50-55M ops/s (+11-22%)

Phase E3-3: Remove C7 Special Cases (CLEANUP)

Why Cleanup?

Phase E1 added headers to C7, making all if (class_idx == 7) conditionals obsolete. However, many files still contain C7 special cases from legacy code.

Impact: Code simplification + 5-10% reduced branching overhead

Files to Edit

File 1: core/hakmem_tiny_free.inc

Lines to Remove/Modify:

# Find all C7 special cases
grep -n "class_idx == 7" core/hakmem_tiny_free.inc

Expected Output:

32:    // CRITICAL: C7 (1KB) is headerless - MUST NOT drain to TLS SLL
34:    if (__builtin_expect(class_idx == 7, 0)) return;
124:        if (__builtin_expect(g_tiny_safe_free || class_idx == 7, 0)) {
145:        if (__builtin_expect(g_tiny_safe_free || class_idx == 7, 0)) {
158:                if (g_tiny_safe_free_strict || class_idx == 7) { raise(SIGUSR2); return; }
195:            void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);
211:            void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);
233:                void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);
241:                void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);
253:                void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);
348:        // CRITICAL: C7 (1KB) is headerless - MUST NOT use TLS SLL
384:                    void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);
445:        void* base2 = (fast_class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);

Changes:

  1. Line 32-34: Remove early return for C7
// BEFORE
// CRITICAL: C7 (1KB) is headerless - MUST NOT drain to TLS SLL
if (__builtin_expect(class_idx == 7, 0)) return;

// AFTER (DELETE these 2 lines)
  1. Lines 124, 145, 158: Remove || class_idx == 7 conditions
// BEFORE
if (__builtin_expect(g_tiny_safe_free || class_idx == 7, 0)) {

// AFTER
if (__builtin_expect(g_tiny_safe_free, 0)) {
  1. Lines 195, 211, 233, 241, 253, 384, 445: Simplify base calculation
// BEFORE
void* base = (class_idx == 7) ? ptr : (void*)((uint8_t*)ptr - 1);

// AFTER (ALL classes have header now)
void* base = (void*)((uint8_t*)ptr - 1);
  1. Line 348: Remove C7 comment (obsolete)
// BEFORE
// CRITICAL: C7 (1KB) is headerless - MUST NOT use TLS SLL

// AFTER (DELETE this line)

File 2: core/hakmem_tiny_alloc.inc

Lines to Remove/Modify:

grep -n "class_idx == 7" core/hakmem_tiny_alloc.inc

Expected Output:

252:                if (__builtin_expect(class_idx == 7, 0)) { *(void**)hotmag_ptr = NULL; }
281:                if (__builtin_expect(class_idx == 7, 0)) { *(void**)fast_hot = NULL; }
292:            if (__builtin_expect(class_idx == 7, 0)) { *(void**)fast = NULL; }

Changes: Remove all 3 lines (C7 now has header, no NULL clearing needed)

File 3: core/hakmem_tiny_slow.inc

Lines to Remove/Modify:

grep -n "class_idx == 7" core/hakmem_tiny_slow.inc

Expected Output:

25:    // Try TLS list refill (C7 is headerless: skip TLS list entirely)

Changes: Update comment

// BEFORE
// Try TLS list refill (C7 is headerless: skip TLS list entirely)

// AFTER
// Try TLS list refill (all classes use TLS list now)

Build & Test Commands

# 1. Edit files
vim core/hakmem_tiny_free.inc
vim core/hakmem_tiny_alloc.inc
vim core/hakmem_tiny_slow.inc

# 2. Rebuild
./build.sh bench_random_mixed_hakmem

# 3. Verify no regressions
./out/release/bench_random_mixed_hakmem 100000 1024 42 2>&1 | tail -1

Success Criteria (Phase E3-3)

Target: 50-60M → 59-70M ops/s (+5-10% from reduced branching)

Code Quality:

  • All C7 special cases removed
  • Unified base pointer calculation (ptr - 1 for all classes)
  • Cleaner, more maintainable code

Final Verification

Full Benchmark Suite

# Run comprehensive benchmarks
cd /mnt/workdisk/public_share/hakmem

# 1. Random Mixed (primary benchmark)
for size in 128 256 512 1024; do
    echo "=== Random Mixed ${size}B ==="
    ./out/release/bench_random_mixed_hakmem 100000 $size 42 2>&1 | grep "Throughput"
done

# 2. Fixed Size (stability check)
for size in 256 1024; do
    echo "=== Fixed Size ${size}B ==="
    ./out/release/bench_fixed_size_hakmem 200000 $size 128 2>&1 | grep "Throughput"
done

# 3. Larson (multi-threaded stress test)
echo "=== Larson Multi-Threaded ==="
./out/release/larson_hakmem 1 2>&1 | grep "ops/sec"

Expected Results (After All 3 Phases)

Benchmark Current Phase E3 Improvement
Random Mixed 128B 9.2M 59M +541% 🎯
Random Mixed 256B 9.4M 70M +645% 🎯
Random Mixed 512B 8.4M 68M +710% 🎯
Random Mixed 1024B 8.4M 65M +674% 🎯
Fixed Size 256B 2.76M 10-12M +263-335%
Larson 1T 2.68M 8-10M +199-273%

Rollback Plan (If Needed)

If Phase E3-1 Causes Issues

# Revert to current version
git checkout HEAD -- core/tiny_free_fast_v2.inc.h
./build.sh bench_random_mixed_hakmem

If Phase E3-2 Causes Issues

# Revert to Phase E3-1
git checkout HEAD -- core/box/front_gate_classifier.h
./build.sh bench_random_mixed_hakmem

If Phase E3-3 Causes Issues

# Revert cleanup changes
git checkout HEAD -- core/hakmem_tiny_free.inc core/hakmem_tiny_alloc.inc core/hakmem_tiny_slow.inc
./build.sh bench_random_mixed_hakmem

Risk Assessment

Phase E3-1: Remove Registry Lookup

Risk: ⚠️ LOW

  • Reverting to Phase 7-1.3 code (proven stable at 59-70M ops/s)
  • Phase E1 already added headers to C7 (safety guaranteed)
  • Header magic validation (2-3 cycles) sufficient for classification

Mitigation:

  • Test with 1M iterations (stress test)
  • Run Larson multi-threaded (race condition check)
  • Monitor for SEGV (should be zero)

Phase E3-2: Header-First Classification

Risk: ⚠️ LOW-MEDIUM

  • Only affects slow path (1-5% of operations)
  • Safe header probe already implemented (lines 100-117)
  • No change to fast path (already optimized in E3-1)

Mitigation:

  • Test with Pool TLS workloads (8-52KB allocations)
  • Test with Mid/Large workloads (64KB+ allocations)
  • Verify classification hit rates in debug mode

Phase E3-3: Remove C7 Special Cases

Risk: ⚠️ LOW

  • Code cleanup only (no algorithmic changes)
  • Phase E1 already verified C7 works with headers
  • All conditionals are redundant (dead code)

Mitigation:

  • Test specifically with 1024B workload (C7 class)
  • Run 1M iterations (comprehensive coverage)
  • Check for any unexpected branches

Timeline

Phase Time Cumulative
E3-1: Remove Registry Lookup 10 min 10 min
E3-1: Build & Test 5 min 15 min
E3-2: Header-First Classification 15 min 30 min
E3-2: Build & Test 5 min 35 min
E3-3: Remove C7 Special Cases 30 min 65 min
E3-3: Build & Test 5 min 70 min
Final Verification 10 min 80 min
TOTAL - ~1.5 hours

Success Metrics

Performance (Primary)

Phase E3-1 Success: ≥30M ops/s (all sizes) Phase E3-2 Success: ≥50M ops/s (all sizes) Phase E3-3 Success: ≥59M ops/s (target met!)

Stability (Critical)

No SEGV: 1M iterations without crash No corruption: Memory integrity checks pass Multi-threaded: Larson 4T stable

Code Quality (Secondary)

Reduced LOC: -50 lines (C7 special cases removed) Reduced branching: -10% branch-miss rate Unified code: Single base calculation (ptr - 1)


Next Actions

  1. Start with Phase E3-1 (highest ROI, lowest risk)
  2. Verify performance (should see 3-5x improvement immediately)
  3. Proceed to E3-2 (optional, +10-20% additional)
  4. Complete E3-3 (cleanup, +5-10% final boost)
  5. Update CLAUDE.md (document restoration success)

Ready to implement! 🚀