Files
hakmem/docs/PHASE_E2_VISUAL_COMPARISON.md
Moe Charm (CI) 72b38bc994 Phase E3-FINAL: Fix Box API offset bugs - ALL classes now use correct offsets
## Root Cause Analysis (GPT5)

**Physical Layout Constraints**:
- Class 0: 8B = [1B header][7B payload] → offset 1 = 9B needed =  IMPOSSIBLE
- Class 1-6: >=16B = [1B header][15B+ payload] → offset 1 =  POSSIBLE
- Class 7: 1KB → offset 0 (compatibility)

**Correct Specification**:
- HAKMEM_TINY_HEADER_CLASSIDX != 0:
  - Class 0, 7: next at offset 0 (overwrites header when on freelist)
  - Class 1-6: next at offset 1 (after header)
- HAKMEM_TINY_HEADER_CLASSIDX == 0:
  - All classes: next at offset 0

**Previous Bug**:
- Attempted "ALL classes offset 1" unification
- Class 0 with offset 1 caused immediate SEGV (9B > 8B block size)
- Mixed 2-arg/3-arg API caused confusion

## Fixes Applied

### 1. Restored 3-Argument Box API (core/box/tiny_next_ptr_box.h)
```c
// Correct signatures
void tiny_next_write(int class_idx, void* base, void* next_value)
void* tiny_next_read(int class_idx, const void* base)

// Correct offset calculation
size_t offset = (class_idx == 0 || class_idx == 7) ? 0 : 1;
```

### 2. Updated 123+ Call Sites Across 34 Files
- hakmem_tiny_hot_pop_v4.inc.h (4 locations)
- hakmem_tiny_fastcache.inc.h (3 locations)
- hakmem_tiny_tls_list.h (12 locations)
- superslab_inline.h (5 locations)
- tiny_fastcache.h (3 locations)
- ptr_trace.h (macro definitions)
- tls_sll_box.h (2 locations)
- + 27 additional files

Pattern: `tiny_next_read(base)` → `tiny_next_read(class_idx, base)`
Pattern: `tiny_next_write(base, next)` → `tiny_next_write(class_idx, base, next)`

### 3. Added Sentinel Detection Guards
- tiny_fast_push(): Block nodes with sentinel in ptr or ptr->next
- tls_list_push(): Block nodes with sentinel in ptr or ptr->next
- Defense-in-depth against remote free sentinel leakage

## Verification (GPT5 Report)

**Test Command**: `./out/release/bench_random_mixed_hakmem --iterations=70000`

**Results**:
-  Main loop completed successfully
-  Drain phase completed successfully
-  NO SEGV (previous crash at iteration 66151 is FIXED)
- ℹ️ Final log: "tiny_alloc(1024) failed" is normal fallback to Mid/ACE layers

**Analysis**:
- Class 0 immediate SEGV:  RESOLVED (correct offset 0 now used)
- 66K iteration crash:  RESOLVED (offset consistency fixed)
- Box API conflicts:  RESOLVED (unified 3-arg API)

## Technical Details

### Offset Logic Justification
```
Class 0:  8B block → next pointer (8B) fits ONLY at offset 0
Class 1: 16B block → next pointer (8B) fits at offset 1 (after 1B header)
Class 2: 32B block → next pointer (8B) fits at offset 1
...
Class 6: 512B block → next pointer (8B) fits at offset 1
Class 7: 1024B block → offset 0 for legacy compatibility
```

### Files Modified (Summary)
- Core API: `box/tiny_next_ptr_box.h`
- Hot paths: `hakmem_tiny_hot_pop*.inc.h`, `tiny_fastcache.h`
- TLS layers: `hakmem_tiny_tls_list.h`, `hakmem_tiny_tls_ops.h`
- SuperSlab: `superslab_inline.h`, `tiny_superslab_*.inc.h`
- Refill: `hakmem_tiny_refill.inc.h`, `tiny_refill_opt.h`
- Free paths: `tiny_free_magazine.inc.h`, `tiny_superslab_free.inc.h`
- Documentation: Multiple Phase E3 reports

## Remaining Work

None for Box API offset bugs - all structural issues resolved.

Future enhancements (non-critical):
- Periodic `grep -R '*(void**)' core/` to detect direct pointer access violations
- Enforce Box API usage via static analysis
- Document offset rationale in architecture docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-13 06:50:20 +09:00

19 KiB

Phase E2: Visual Performance Comparison

Date: 2025-11-12


Performance Timeline

Phase 7 Peak (Nov 8)          Phase E1 (Nov 12)         Phase E3 Target
     ↓                             ↓                          ↓
 ┌─────────┐                   ┌─────────┐               ┌─────────┐
 │  59-70M │ ──────────────→   │   9M    │ ──────────→   │  59-70M │
 │  ops/s  │     Regression    │  ops/s  │   Phase E3    │  ops/s  │
 └─────────┘        85%        └─────────┘   +541-674%   └─────────┘
     🏆                             😱                         🎯

Free Path Cycle Comparison

Phase 7-1.3 (FAST - 5-10 cycles)

┌─────────────────────────────────────────────────────────────┐
│  hak_tiny_free_fast_v2(ptr)                                 │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. NULL check              [1 cycle]                       │
│  2. Page boundary check     [1-2 cycles]  ← 99.9% skip     │
│  3. Read header (ptr-1)     [2-3 cycles]  ← L1 cache       │
│  4. Validate magic          [included]                      │
│  5. TLS freelist push       [3-5 cycles]  ← 4 instructions │
│                                                             │
│  TOTAL: 5-10 cycles ✅                                      │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Current (SLOW - 55-110 cycles)

┌─────────────────────────────────────────────────────────────┐
│  hak_tiny_free_fast_v2(ptr)                                 │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. NULL check              [1 cycle]                       │
│  ❌ 2. Registry lookup       [50-100 cycles] ← O(log N)     │
│     └─> hak_super_lookup()                                  │
│         └─> RB-tree search                                  │
│         └─> Multiple pointer dereferences                   │
│         └─> Cache misses likely                             │
│  3. Page boundary check     [1-2 cycles]                    │
│  4. Read header (ptr-1)     [2-3 cycles]                    │
│  5. Validate magic          [included]                      │
│  6. TLS freelist push       [3-5 cycles]                    │
│                                                             │
│  TOTAL: 55-110 cycles ❌ (10x slower!)                      │
│                                                             │
└─────────────────────────────────────────────────────────────┘

The Problem Visualized

Commit 5eabb89ad9 Added This:

// Lines 54-62 in core/tiny_free_fast_v2.inc.h

static inline int hak_tiny_free_fast_v2(void* ptr) {
    if (!ptr) return 0;

    ┌──────────────────────────────────────────────────────┐
     // ❌ THE BOTTLENECK (50-100 cycles)                 │
     extern struct SuperSlab* hak_super_lookup(void* ptr);
     struct SuperSlab* ss = hak_super_lookup(ptr);        
     if (ss && ss->size_class == 7) {                     
         return 0;  // C7 detected → slow path            │
     }                                                     
    └──────────────────────────────────────────────────────┘
         
         └── This is UNNECESSARY because Phase E1
             already added headers to C7!

    // ... rest of function (fast path) ...
}

Why It's Unnecessary:

Phase E1 (Commit baaf815c9):
┌─────────────────────────────────────────────────────────────┐
│  ALL classes (C0-C7) now have 1-byte header                │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  C0 (16B):   [0xA0] [user data: 15B]                       │
│  C1 (32B):   [0xA1] [user data: 31B]                       │
│  C2 (64B):   [0xA2] [user data: 63B]                       │
│  C3 (128B):  [0xA3] [user data: 127B]                      │
│  C4 (256B):  [0xA4] [user data: 255B]                      │
│  C5 (512B):  [0xA5] [user data: 511B]                      │
│  C6 (768B):  [0xA6] [user data: 767B]                      │
│  C7 (1024B): [0xA7] [user data: 1023B] ← HAS HEADER NOW!   │
│                                                             │
│  Header magic 0xA0 distinguishes from:                     │
│  - Pool TLS: 0xB0                                           │
│  - Mid/Large: no header (magic check fails)                │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Therefore: Registry lookup is REDUNDANT!
           Header validation (2-3 cycles) is SUFFICIENT!

Performance Impact by Size

128B Allocations

Phase 7:    ████████████████████████████████████████████████████████  59M ops/s
Current:    ████████  9.2M ops/s
Phase E3:   ████████████████████████████████████████████████████████  59M ops/s (target)

Regression: -85% | Recovery: +541%

256B Allocations

Phase 7:    ██████████████████████████████████████████████████████████████  70M ops/s
Current:    ████████  9.4M ops/s
Phase E3:   ██████████████████████████████████████████████████████████████  70M ops/s (target)

Regression: -87% | Recovery: +645%

512B Allocations

Phase 7:    ███████████████████████████████████████████████████████████  68M ops/s
Current:    ███████  8.4M ops/s
Phase E3:   ███████████████████████████████████████████████████████████  68M ops/s (target)

Regression: -88% | Recovery: +710%

1024B Allocations (C7)

Phase 7:    █████████████████████████████████████████████████████████  65M ops/s
Current:    ███████  8.4M ops/s
Phase E3:   █████████████████████████████████████████████████████████  65M ops/s (target)

Regression: -87% | Recovery: +674%

Call Graph Comparison

Phase 7 (Fast Path - 95-99% hit rate)

hak_free_at()
  └─> hak_tiny_free_fast_v2()      [5-10 cycles]
        ├─> Page boundary check    [1-2 cycles, 99.9% skip]
        ├─> Header read (ptr-1)    [2-3 cycles, L1 hit]
        ├─> Magic validation       [included in read]
        └─> TLS freelist push      [3-5 cycles]
           └─> *(void**)base = head
           └─> head = base
           └─> count++

Current (Bottlenecked - 95-99% hit rate, but SLOW)

hak_free_at()
  └─> hak_tiny_free_fast_v2()      [55-110 cycles] ❌
        ├─> Registry lookup        [50-100 cycles] ❌
        │     └─> hak_super_lookup()
        │           ├─> RB-tree search (O(log N))
        │           ├─> Multiple dereferences
        │           └─> Cache misses
        ├─> Page boundary check    [1-2 cycles]
        ├─> Header read (ptr-1)    [2-3 cycles]
        ├─> Magic validation       [included]
        └─> TLS freelist push      [3-5 cycles]

Cycle Budget Breakdown

Phase 7-1.3 (Target)

Operation                  Cycles    Frequency    Weighted
────────────────────────────────────────────────────────────
NULL check                 1         100%         1
Page boundary check        1-2       0.1%         0.002
Header read                2-3       100%         3
TLS freelist push          3-5       100%         4
────────────────────────────────────────────────────────────
TOTAL (Fast Path)          5-10      95-99%       8
────────────────────────────────────────────────────────────
Slow path fallback         500+      1-5%         5-25
────────────────────────────────────────────────────────────
WEIGHTED AVERAGE           ~13-33 cycles/free

Throughput (3.0 GHz CPU):

  • Free latency: ~13-33 cycles = 4-11 ns
  • Mixed (50% alloc/free): ~8-22 ns per op
  • Throughput: ~45-125M ops/s per core
  • Multi-core (4 cores, 50% efficiency): 45-60M ops/s

Current (Bottlenecked)

Operation                  Cycles    Frequency    Weighted
────────────────────────────────────────────────────────────
NULL check                 1         100%         1
Registry lookup ❌          50-100    100%         75
Page boundary check        1-2       0.1%         0.002
Header read                2-3       100%         3
TLS freelist push          3-5       100%         4
────────────────────────────────────────────────────────────
TOTAL (Fast Path)          55-110    95-99%       83
────────────────────────────────────────────────────────────
Slow path fallback         500+      1-5%         5-25
────────────────────────────────────────────────────────────
WEIGHTED AVERAGE           ~88-108 cycles/free ❌

Throughput (3.0 GHz CPU):

  • Free latency: ~88-108 cycles = 29-36 ns
  • Mixed (50% alloc/free): ~58-72 ns per op
  • Throughput: ~14-17M ops/s per core
  • Multi-core (4 cores, 50% efficiency): 7-9M ops/s

Memory Layout: Why Header Validation Is Sufficient

Tiny Allocation (C0-C7)

  Base ptr          User ptr (returned)
     ↓                    ↓
┌────────┬──────────────────────────────────────┐
│ Header │        User Data                     │
│  0xAX  │        (N-1 bytes)                   │
└────────┴──────────────────────────────────────┘
   1 byte          User allocation

Header format: 0xAX where X = class_idx (0-7)
- C0: 0xA0 (16B)
- C1: 0xA1 (32B)
- ...
- C7: 0xA7 (1024B) ← HAS HEADER SINCE PHASE E1!

Pool TLS Allocation (8KB-52KB)

  Base ptr          User ptr (returned)
     ↓                    ↓
┌────────┬──────────────────────────────────────┐
│ Header │        User Data                     │
│  0xBX  │        (N-1 bytes)                   │
└────────┴──────────────────────────────────────┘
   1 byte          User allocation

Header format: 0xBX where X = pool class (0-15)

Mid/Large Allocation (64KB+)

  Base ptr          User ptr (returned)
     ↓                    ↓
┌────────────────┬─────────────────────────────┐
│  AllocHeader   │        User Data             │
│   (16 bytes)   │        (N bytes)             │
│  magic = 0x... │                              │
└────────────────┴─────────────────────────────┘
   16 bytes            User allocation

External Allocation (libc malloc)

  User ptr (returned)
     ↓
┌────────────────────────────────────┐
│        User Data                   │
│        (no header)                 │
└────────────────────────────────────┘

Header at ptr-1: Random data (NOT 0xA0)

Classification Logic

// Read header at ptr-1
uint8_t header = *(uint8_t*)(ptr - 1);
uint8_t magic = header & 0xF0;

if (magic == 0xA0) {
    // Tiny allocation (C0-C7)
    int class_idx = header & 0x0F;
    return TINY_HEADER;  // Fast path: 2-3 cycles ✅
}

if (magic == 0xB0) {
    // Pool TLS allocation
    return POOL_TLS;  // Slow path: fallback
}

// No valid header
return UNKNOWN;  // Slow path: check 16-byte AllocHeader

Result: Header magic alone is sufficient! No registry lookup needed!


The Fix: Before vs After

Before (Lines 51-90 in tiny_free_fast_v2.inc.h)

static inline int hak_tiny_free_fast_v2(void* ptr) {
    if (__builtin_expect(!ptr, 0)) return 0;

    // ╔══════════════════════════════════════════════════════╗
    // ║  ❌ DELETE THIS BLOCK (50-100 cycles overhead)       ║
    // ╠══════════════════════════════════════════════════════╣
    // ║  extern struct SuperSlab* hak_super_lookup(void*);   ║
    // ║  struct SuperSlab* ss = hak_super_lookup(ptr);       ║
    // ║  if (ss && ss->size_class == 7) {                    ║
    // ║      return 0;                                       ║
    // ║  }                                                   ║
    // ╚══════════════════════════════════════════════════════╝

    void* header_addr = (char*)ptr - 1;

    // Page boundary check (1-2 cycles)
    if (((uintptr_t)ptr & 0xFFF) == 0) {
        if (!hak_is_memory_readable(header_addr)) return 0;
    }

    // Read header (2-3 cycles) - includes magic validation
    int class_idx = tiny_region_id_read_header(ptr);
    if (class_idx < 0) return 0;

    // TLS capacity check (1 cycle)
    if (g_tls_sll_count[class_idx] >= cap) return 0;

    // Push to TLS freelist (3-5 cycles)
    void* base = (char*)ptr - 1;
    tls_sll_push(class_idx, base, UINT32_MAX);

    return 1;  // TOTAL: 55-110 cycles ❌
}

After (Phase E3-1 - Simple deletion!)

static inline int hak_tiny_free_fast_v2(void* ptr) {
    if (__builtin_expect(!ptr, 0)) return 0;

    // Phase E3: C7 now has header (Phase E1), registry lookup removed!
    // Header magic validation (2-3 cycles) distinguishes:
    // - Tiny (0xA0-0xA7): valid header → fast path
    // - Pool TLS (0xB0): different magic → slow path
    // - Mid/Large: no header → slow path

    void* header_addr = (char*)ptr - 1;

    // Page boundary check (1-2 cycles)
    if (((uintptr_t)ptr & 0xFFF) == 0) {
        if (!hak_is_memory_readable(header_addr)) return 0;
    }

    // Read header (2-3 cycles) - includes magic validation
    int class_idx = tiny_region_id_read_header(ptr);
    if (class_idx < 0) return 0;

    // TLS capacity check (1 cycle)
    if (g_tls_sll_count[class_idx] >= cap) return 0;

    // Push to TLS freelist (3-5 cycles)
    void* base = (char*)ptr - 1;
    tls_sll_push(class_idx, base, UINT32_MAX);

    return 1;  // TOTAL: 5-10 cycles ✅
}

Diff:

  • Lines deleted: 9 (registry lookup block)
  • Lines added: 5 (explanatory comments)
  • Net change: -4 lines
  • Cycle savings: -50 to -100 cycles per free
  • Throughput improvement: +541-674%

Summary: Why This Fix Works

Phase E1 Guarantees

ALL classes have headers (C0-C7 including C7) Header magic distinguishes allocators (0xA0 vs 0xB0 vs none) No C7 special cases needed (unified code path)

Current Code Problems

Registry lookup redundant (50-100 cycles for nothing) Header validation sufficient (already done in 2-3 cycles) No performance benefit (safety already guaranteed by headers)

Phase E3-1 Solution

Remove registry lookup (revert to Phase 7-1.3) Keep header validation (2-3 cycles, sufficient) Restore performance (5-10 cycles per free) Maintain safety (Phase E1 headers guarantee correctness)


Ready to implement Phase E3! 🚀

The fix is trivial (delete 9 lines), low-risk (revert to proven code), and high-impact (+541-674% throughput).