Files
hakmem/core/tiny_alloc_fast.inc.h
Moe Charm (CI) 4983352812 Perf: Phase 7-1.3 - Hybrid mincore + Macro fix (+194-333%)
## Summary
Fixed CRITICAL bottleneck (mincore overhead) and macro definition bug.
Result: 2-3x performance improvement across all benchmarks.

## Performance Results
- Larson 1T: 631K → 2.73M ops/s (+333%) 🚀
- bench_random_mixed (128B): 768K → 2.26M ops/s (+194%) 🚀
- bench_random_mixed (512B): → 1.43M ops/s (new)
- [HEADER_INVALID] messages: Many → ~Zero 

## Changes

### 1. Hybrid mincore Optimization (317-634x faster)
**Problem**: `hak_is_memory_readable()` calls mincore() syscall on EVERY free
- Cost: 634 cycles/call
- Impact: 40x slower than System malloc

**Solution**: Check alignment BEFORE calling mincore()
- Step 1 (1-byte header): `if ((ptr & 0xFFF) == 0)` → only 0.1% call mincore
- Step 2 (16-byte header): `if ((ptr & 0xFFF) < HEADER_SIZE)` → only 0.4% call mincore
- Result: 634 → 1-2 cycles effective (99.6% skip mincore)

**Files**:
- core/tiny_free_fast_v2.inc.h:53-71 - Step 1 hybrid check
- core/box/hak_free_api.inc.h:94-107 - Step 2 hybrid check
- core/hakmem_internal.h:281-312 - Performance warning added

### 2. HAK_RET_ALLOC Macro Fix (CRITICAL BUG)
**Problem**: Macro definition order prevented Phase 7 header write
- hakmem_tiny.c:130 defined legacy macro (no header write)
- tiny_alloc_fast.inc.h:67 had `#ifndef` guard → skipped!
- Result: Headers NEVER written → All frees failed → Slow path

**Solution**: Force Phase 7 macro to override legacy
- hakmem_tiny.c:119 - Added `#ifndef HAK_RET_ALLOC` guard
- tiny_alloc_fast.inc.h:69-72 - Added `#undef` before redefine

### 3. Magic Byte Fix
**Problem**: Release builds don't write magic byte, but free ALWAYS checks it
- Result: All headers marked as invalid

**Solution**: ALWAYS write magic byte (same 1-byte write, no overhead)
- tiny_region_id.h:50-54 - Removed `#if !HAKMEM_BUILD_RELEASE` guard

## Technical Details

### Hybrid mincore Effectiveness
| Case | Frequency | Cost | Weighted |
|------|-----------|------|----------|
| Normal (Step 1) | 99.9% | 1-2 cycles | 1-2 |
| Page boundary | 0.1% | 634 cycles | 0.6 |
| **Total** | - | - | **1.6-2.6 cycles** |

**Improvement**: 634 → 1.6 cycles = **317-396x faster!**

### Macro Fix Impact
**Before**: HAK_RET_ALLOC(cls, ptr) → return (ptr)  // No header write
**After**: HAK_RET_ALLOC(cls, ptr) → return tiny_region_id_write_header((ptr), (cls))

**Result**: Headers properly written → Fast path works → +194-333% performance

## Investigation
Task Agent Ultrathink analysis identified:
1. mincore() syscall overhead (634 cycles)
2. Macro definition order conflict
3. Release/Debug build mismatch (magic byte)

Full report: PHASE7_DESIGN_REVIEW.md (23KB, 758 lines)

## Related
- Phase 7-1.0: PoC implementation (+39%~+436%)
- Phase 7-1.1: Dual-header dispatch (Task Agent)
- Phase 7-1.2: Page boundary SEGV fix (100% crash-free)
- Phase 7-1.3: Hybrid mincore + Macro fix (this commit)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-08 04:50:41 +09:00

486 lines
18 KiB
C
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

// tiny_alloc_fast.inc.h - Box 5: Allocation Fast Path (3-4 instructions)
// Purpose: Ultra-fast TLS freelist pop (inspired by System tcache & Mid-Large HAKX +171%)
// Invariant: Hit rate > 95% → 3-4 instructions, Miss → refill from backend
// Design: "Simple Front + Smart Back" - Front is dumb & fast, Back is smart
//
// Box 5-NEW: SFC (Super Front Cache) Integration
// Architecture: SFC (Layer 0, 128-256 slots) → SLL (Layer 1, unlimited) → SuperSlab (Layer 2+)
// Cascade Refill: SFC ← SLL (one-way, safe)
// Goal: +200% performance (4.19M → 12M+ ops/s)
#pragma once
#include "tiny_atomic.h"
#include "hakmem_tiny.h"
#include "tiny_route.h"
#include "tiny_alloc_fast_sfc.inc.h" // Box 5-NEW: SFC Layer
#include "tiny_region_id.h" // Phase 7: Header-based class_idx lookup
#ifdef HAKMEM_TINY_FRONT_GATE_BOX
#include "box/front_gate_box.h"
#endif
#include <stdio.h>
// ========== Debug Counters (compile-time gated) ==========
#if HAKMEM_DEBUG_COUNTERS
// Refill-stage counters (defined in hakmem_tiny.c)
extern unsigned long long g_rf_total_calls[];
extern unsigned long long g_rf_hit_bench[];
extern unsigned long long g_rf_hit_hot[];
extern unsigned long long g_rf_hit_mail[];
extern unsigned long long g_rf_hit_slab[];
extern unsigned long long g_rf_hit_ss[];
extern unsigned long long g_rf_hit_reg[];
extern unsigned long long g_rf_mmap_calls[];
// Publish hits (defined in hakmem_tiny.c)
extern unsigned long long g_pub_mail_hits[];
extern unsigned long long g_pub_bench_hits[];
extern unsigned long long g_pub_hot_hits[];
// Free pipeline (defined in hakmem_tiny.c)
extern unsigned long long g_free_via_tls_sll[];
#endif
// ========== Box 5: Allocation Fast Path ==========
// 箱理論の Fast Allocation 層。TLS freelist から直接 pop3-4命令
// 不変条件:
// - TLS freelist が非空なら即座に return (no lock, no sync)
// - Miss なら Backend (Box 3: SuperSlab) に委譲
// - Cross-thread allocation は考慮しないBackend が処理)
// External TLS variables (defined in hakmem_tiny.c)
extern __thread void* g_tls_sll_head[TINY_NUM_CLASSES];
extern __thread uint32_t g_tls_sll_count[TINY_NUM_CLASSES];
// External backend functions
extern int sll_refill_small_from_ss(int class_idx, int max_take);
extern void* hak_tiny_alloc_slow(size_t size, int class_idx);
extern int hak_tiny_size_to_class(size_t size);
extern int tiny_refill_failfast_level(void);
extern const size_t g_tiny_class_sizes[];
// Global Front refill config (parsed at init; defined in hakmem_tiny.c)
extern int g_refill_count_global;
extern int g_refill_count_hot;
extern int g_refill_count_mid;
extern int g_refill_count_class[TINY_NUM_CLASSES];
// External macros
// Phase 7: Write header before returning (if enabled)
// CRITICAL: Undefine legacy macro to ensure Phase 7 version is used
#ifdef HAK_RET_ALLOC
#undef HAK_RET_ALLOC
#endif
#define HAK_RET_ALLOC(cls, ptr) return tiny_region_id_write_header((ptr), (cls))
// ========== RDTSC Profiling (lightweight) ==========
#ifdef __x86_64__
static inline uint64_t tiny_fast_rdtsc(void) {
unsigned int lo, hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
}
#else
static inline uint64_t tiny_fast_rdtsc(void) { return 0; }
#endif
// Per-thread profiling counters (enable with HAKMEM_TINY_PROFILE=1)
static __thread uint64_t g_tiny_alloc_hits = 0;
static __thread uint64_t g_tiny_alloc_cycles = 0;
static __thread uint64_t g_tiny_refill_calls = 0;
static __thread uint64_t g_tiny_refill_cycles = 0;
static int g_tiny_profile_enabled = -1; // -1: uninitialized
static inline int tiny_profile_enabled(void) {
if (__builtin_expect(g_tiny_profile_enabled == -1, 0)) {
const char* env = getenv("HAKMEM_TINY_PROFILE");
g_tiny_profile_enabled = (env && *env && *env != '0') ? 1 : 0;
}
return g_tiny_profile_enabled;
}
// Print profiling results at exit
static void tiny_fast_print_profile(void) __attribute__((destructor));
static void tiny_fast_print_profile(void) {
if (!tiny_profile_enabled()) return;
if (g_tiny_alloc_hits == 0 && g_tiny_refill_calls == 0) return;
fprintf(stderr, "\n========== Box Theory Fast Path Profile ==========\n");
if (g_tiny_alloc_hits > 0) {
fprintf(stderr, "[ALLOC HIT] count=%lu, avg_cycles=%lu\n",
(unsigned long)g_tiny_alloc_hits,
(unsigned long)(g_tiny_alloc_cycles / g_tiny_alloc_hits));
}
if (g_tiny_refill_calls > 0) {
fprintf(stderr, "[REFILL] count=%lu, avg_cycles=%lu\n",
(unsigned long)g_tiny_refill_calls,
(unsigned long)(g_tiny_refill_cycles / g_tiny_refill_calls));
}
fprintf(stderr, "===================================================\n\n");
}
// ========== Fast Path: TLS Freelist Pop (3-4 instructions) ==========
// External SFC control (defined in hakmem_tiny_sfc.c)
extern int g_sfc_enabled;
// Allocation fast path (inline for zero-cost)
// Returns: pointer on success, NULL on miss (caller should try refill/slow)
//
// Box 5-NEW Architecture:
// Layer 0: SFC (128-256 slots, high hit rate) [if enabled]
// Layer 1: SLL (unlimited, existing)
// Cascade: SFC miss → try SLL → refill
//
// Assembly (x86-64, optimized):
// mov rax, QWORD PTR g_sfc_head[class_idx] ; SFC: Load head
// test rax, rax ; Check NULL
// jne .sfc_hit ; If not empty, SFC hit!
// mov rax, QWORD PTR g_tls_sll_head[class_idx] ; SLL: Load head
// test rax, rax ; Check NULL
// je .miss ; If empty, miss
// mov rdx, QWORD PTR [rax] ; Load next
// mov QWORD PTR g_tls_sll_head[class_idx], rdx ; Update head
// ret ; Return ptr
// .sfc_hit:
// mov rdx, QWORD PTR [rax] ; Load next
// mov QWORD PTR g_sfc_head[class_idx], rdx ; Update head
// ret
// .miss:
// ; Fall through to refill
//
// Expected: 3-4 instructions on SFC hit, 6-8 on SLL hit
static inline void* tiny_alloc_fast_pop(int class_idx) {
#ifdef HAKMEM_TINY_FRONT_GATE_BOX
void* out = NULL;
if (front_gate_try_pop(class_idx, &out)) {
return out;
}
return NULL;
#else
uint64_t start = tiny_profile_enabled() ? tiny_fast_rdtsc() : 0;
// Box 5-NEW: Layer 0 - Try SFC first (if enabled)
// Cache g_sfc_enabled in TLS to avoid global load on every allocation
static __thread int sfc_check_done = 0;
static __thread int sfc_is_enabled = 0;
if (__builtin_expect(!sfc_check_done, 0)) {
sfc_is_enabled = g_sfc_enabled;
sfc_check_done = 1;
}
if (__builtin_expect(sfc_is_enabled, 1)) {
void* ptr = sfc_alloc(class_idx);
if (__builtin_expect(ptr != NULL, 1)) {
// Front Gate: SFC hit
extern unsigned long long g_front_sfc_hit[];
g_front_sfc_hit[class_idx]++;
// 🚀 SFC HIT! (Layer 0)
if (start) {
g_tiny_alloc_cycles += (tiny_fast_rdtsc() - start);
g_tiny_alloc_hits++;
}
return ptr;
}
// SFC miss → try SLL (Layer 1)
}
// Box Boundary: Layer 1 - TLS SLL freelist の先頭を popenvで無効化可
extern int g_tls_sll_enable; // set at init via HAKMEM_TINY_TLS_SLL
if (__builtin_expect(g_tls_sll_enable, 1)) {
void* head = g_tls_sll_head[class_idx];
if (__builtin_expect(head != NULL, 1)) {
// CORRUPTION DEBUG: Validate TLS SLL head before popping
if (__builtin_expect(tiny_refill_failfast_level() >= 2, 0)) {
size_t blk = g_tiny_class_sizes[class_idx];
// Check alignment (must be multiple of block size)
if (((uintptr_t)head % blk) != 0) {
fprintf(stderr, "[TLS_SLL_CORRUPT] cls=%d head=%p misaligned (blk=%zu offset=%zu)\n",
class_idx, head, blk, (uintptr_t)head % blk);
fprintf(stderr, "[TLS_SLL_CORRUPT] TLS freelist head is corrupted!\n");
abort();
}
}
// Front Gate: SLL hit (fast path 3 instructions)
extern unsigned long long g_front_sll_hit[];
g_front_sll_hit[class_idx]++;
// CORRUPTION DEBUG: Validate next pointer before updating head
void* next = *(void**)head;
if (__builtin_expect(tiny_refill_failfast_level() >= 2, 0)) {
size_t blk = g_tiny_class_sizes[class_idx];
if (next != NULL && ((uintptr_t)next % blk) != 0) {
fprintf(stderr, "[ALLOC_POP_CORRUPT] Reading next from head=%p got corrupted next=%p!\n",
head, next);
fprintf(stderr, "[ALLOC_POP_CORRUPT] cls=%d blk=%zu next_offset=%zu (expected 0)\n",
class_idx, blk, (uintptr_t)next % blk);
fprintf(stderr, "[ALLOC_POP_CORRUPT] TLS SLL head block was corrupted (use-after-free/double-free)!\n");
abort();
}
fprintf(stderr, "[ALLOC_POP] cls=%d head=%p next=%p\n", class_idx, head, next);
}
g_tls_sll_head[class_idx] = next; // Pop: next = *head
// Optional: update count (for stats, can be disabled)
if (g_tls_sll_count[class_idx] > 0) {
g_tls_sll_count[class_idx]--;
}
#if HAKMEM_DEBUG_COUNTERS
// Track TLS freelist hits (compile-time gated, zero runtime cost when disabled)
g_free_via_tls_sll[class_idx]++;
#endif
if (start) {
g_tiny_alloc_cycles += (tiny_fast_rdtsc() - start);
g_tiny_alloc_hits++;
}
return head;
}
}
// Fast path miss → NULL (caller should refill)
return NULL;
#endif
}
// ========== Cascade Refill: SFC ← SLL (Box Theory boundary) ==========
// Cascade refill: Transfer blocks from SLL to SFC (one-way, safe)
// Returns: number of blocks transferred
//
// Contract:
// - Transfer ownership: SLL → SFC
// - No circular dependency: one-way only
// - Boundary clear: SLL pop → SFC push
// - Fallback safe: if SFC full, stop (no overflow)
static inline int sfc_refill_from_sll(int class_idx, int target_count) {
int transferred = 0;
uint32_t cap = g_sfc_capacity[class_idx];
while (transferred < target_count && g_tls_sll_count[class_idx] > 0) {
// Check SFC capacity before transfer
if (g_sfc_count[class_idx] >= cap) {
break; // SFC full, stop
}
// Pop from SLL (Layer 1)
void* ptr = g_tls_sll_head[class_idx];
if (!ptr) break; // SLL empty
g_tls_sll_head[class_idx] = *(void**)ptr;
g_tls_sll_count[class_idx]--;
// Push to SFC (Layer 0)
*(void**)ptr = g_sfc_head[class_idx];
g_sfc_head[class_idx] = ptr;
g_sfc_count[class_idx]++;
transferred++;
}
return transferred;
}
// ========== Refill Path: Backend Integration ==========
// Refill TLS freelist from backend (SuperSlab/ACE/Learning layer)
// Returns: number of blocks refilled
//
// Box 5-NEW Architecture:
// SFC enabled: SuperSlab → SLL → SFC (cascade)
// SFC disabled: SuperSlab → SLL (direct, old path)
//
// This integrates with existing HAKMEM infrastructure:
// - SuperSlab provides memory chunks
// - ACE provides adaptive capacity learning
// - L25 provides mid-large integration
//
// Refill count is tunable via HAKMEM_TINY_REFILL_COUNT (default: 32)
// - Smaller count (8-16): better for diverse workloads, faster warmup
// - Larger count (64-128): better for homogeneous workloads, fewer refills
static inline int tiny_alloc_fast_refill(int class_idx) {
uint64_t start = tiny_profile_enabled() ? tiny_fast_rdtsc() : 0;
// Tunable refill count (cached per-class in TLS for performance)
static __thread int s_refill_count[TINY_NUM_CLASSES] = {0};
int cnt = s_refill_count[class_idx];
if (__builtin_expect(cnt == 0, 0)) {
int def = 16; // Default: 16 (smaller = less overhead per refill)
int v = def;
// Resolve precedence without getenv on hot path (values parsed at init)
if (g_refill_count_class[class_idx] > 0) {
v = g_refill_count_class[class_idx];
} else if (class_idx <= 3 && g_refill_count_hot > 0) {
v = g_refill_count_hot;
} else if (class_idx >= 4 && g_refill_count_mid > 0) {
v = g_refill_count_mid;
} else if (g_refill_count_global > 0) {
v = g_refill_count_global;
}
// Clamp to sane range (avoid pathological cases)
if (v < 8) v = 8; // Minimum: avoid thrashing
if (v > 256) v = 256; // Maximum: avoid excessive TLS memory
s_refill_count[class_idx] = v;
cnt = v;
}
#if HAKMEM_DEBUG_COUNTERS
// Track refill calls (compile-time gated)
g_rf_total_calls[class_idx]++;
#endif
// Box Boundary: Delegate to Backend (Box 3: SuperSlab)
// This gives us ACE, Learning layer, L25 integration for free!
// Note: g_rf_hit_slab counter is incremented inside sll_refill_small_from_ss()
int refilled = sll_refill_small_from_ss(class_idx, cnt);
// Box 5-NEW: Cascade refill SFC ← SLL (if SFC enabled)
// This happens AFTER SuperSlab → SLL refill, so SLL has blocks
static __thread int sfc_check_done_refill = 0;
static __thread int sfc_is_enabled_refill = 0;
if (__builtin_expect(!sfc_check_done_refill, 0)) {
sfc_is_enabled_refill = g_sfc_enabled;
sfc_check_done_refill = 1;
}
if (sfc_is_enabled_refill && refilled > 0) {
// Transfer half of refilled blocks to SFC (keep half in SLL for future)
int sfc_target = refilled / 2;
if (sfc_target > 0) {
#ifdef HAKMEM_TINY_FRONT_GATE_BOX
front_gate_after_refill(class_idx, refilled);
#else
int transferred = sfc_refill_from_sll(class_idx, sfc_target);
(void)transferred; // Unused, but could track stats
#endif
}
}
if (start) {
g_tiny_refill_cycles += (tiny_fast_rdtsc() - start);
g_tiny_refill_calls++;
}
return refilled;
}
// ========== Combined Fast Path (Alloc + Refill) ==========
// Complete fast path allocation (inline for zero-cost)
// Returns: pointer on success, NULL on failure (OOM or size too large)
//
// Flow:
// 1. TLS freelist pop (3-4 instructions) - Hit rate ~95%
// 2. Miss → Refill from backend (~5% cases)
// 3. Refill success → Retry pop
// 4. Refill failure → Slow path (OOM or new SuperSlab allocation)
//
// Example usage:
// void* ptr = tiny_alloc_fast(64);
// if (!ptr) {
// // OOM handling
// }
static inline void* tiny_alloc_fast(size_t size) {
// 1. Size → class index (inline, fast)
int class_idx = hak_tiny_size_to_class(size);
if (__builtin_expect(class_idx < 0, 0)) {
return NULL; // Size > 1KB, not Tiny
}
ROUTE_BEGIN(class_idx);
// 2. Fast path: TLS freelist pop (3-4 instructions, 95% hit rate)
void* ptr = tiny_alloc_fast_pop(class_idx);
if (__builtin_expect(ptr != NULL, 1)) {
HAK_RET_ALLOC(class_idx, ptr);
}
// 3. Miss: Refill from backend (Box 3: SuperSlab)
int refilled = tiny_alloc_fast_refill(class_idx);
if (__builtin_expect(refilled > 0, 1)) {
// Refill success → retry pop
ptr = tiny_alloc_fast_pop(class_idx);
if (ptr) {
HAK_RET_ALLOC(class_idx, ptr);
}
}
// 4. Refill failure or still empty → slow path (OOM or new SuperSlab)
// Box Boundary: Delegate to Slow Path (Box 3 backend)
ptr = hak_tiny_alloc_slow(size, class_idx);
if (ptr) {
HAK_RET_ALLOC(class_idx, ptr);
}
return ptr; // NULL if OOM
}
// ========== Push to TLS Freelist (for free path) ==========
// Push block to TLS freelist (used by free fast path)
// This is a "helper" for Box 6 (Free Fast Path)
//
// Invariant: ptr must belong to current thread (no ownership check here)
// Caller (Box 6) is responsible for ownership verification
static inline void tiny_alloc_fast_push(int class_idx, void* ptr) {
#ifdef HAKMEM_TINY_FRONT_GATE_BOX
front_gate_push_tls(class_idx, ptr);
#else
// Box Boundary: Push to TLS freelist
*(void**)ptr = g_tls_sll_head[class_idx];
g_tls_sll_head[class_idx] = ptr;
g_tls_sll_count[class_idx]++;
#endif
}
// ========== Statistics & Diagnostics ==========
// Get TLS freelist stats (for debugging/profiling)
typedef struct {
int class_idx;
void* head;
uint32_t count;
} TinyAllocFastStats;
static inline TinyAllocFastStats tiny_alloc_fast_stats(int class_idx) {
TinyAllocFastStats stats = {
.class_idx = class_idx,
.head = g_tls_sll_head[class_idx],
.count = g_tls_sll_count[class_idx]
};
return stats;
}
// Reset TLS freelist (for testing/benchmarking)
// WARNING: This leaks memory! Only use in controlled test environments.
static inline void tiny_alloc_fast_reset(int class_idx) {
g_tls_sll_head[class_idx] = NULL;
g_tls_sll_count[class_idx] = 0;
}
// ========== Performance Notes ==========
//
// Expected metrics (based on System tcache & HAKX +171% results):
// - Fast path hit rate: 95%+ (workload dependent)
// - Fast path latency: 3-4 instructions (1-2 cycles on modern CPUs)
// - Miss penalty: ~20-50 instructions (refill from SuperSlab)
// - Throughput improvement: +10-25% vs current multi-layer design
//
// Key optimizations:
// 1. `__builtin_expect` for branch prediction (hot path first)
// 2. `static inline` for zero-cost abstraction
// 3. TLS variables (no atomic ops, no locks)
// 4. Minimal work in fast path (defer stats/accounting to backend)
//
// Comparison with current design:
// - Current: 20+ instructions (Magazine → SuperSlab → ACE → ...)
// - New: 3-4 instructions (TLS freelist pop only)
// - Reduction: -80% instructions in hot path
//
// Inspired by:
// - System tcache (glibc malloc) - 3-4 instruction fast path
// - HAKX Mid-Large (+171%) - "Simple Front + Smart Back"
// - Box Theory - Clear boundaries, minimal coupling