2025-11-05 12:31:14 +09:00
|
|
|
|
// tiny_alloc_fast.inc.h - Box 5: Allocation Fast Path (3-4 instructions)
|
|
|
|
|
|
// Purpose: Ultra-fast TLS freelist pop (inspired by System tcache & Mid-Large HAKX +171%)
|
|
|
|
|
|
// Invariant: Hit rate > 95% → 3-4 instructions, Miss → refill from backend
|
|
|
|
|
|
// Design: "Simple Front + Smart Back" - Front is dumb & fast, Back is smart
|
2025-11-07 01:27:04 +09:00
|
|
|
|
//
|
|
|
|
|
|
// Box 5-NEW: SFC (Super Front Cache) Integration
|
|
|
|
|
|
// Architecture: SFC (Layer 0, 128-256 slots) → SLL (Layer 1, unlimited) → SuperSlab (Layer 2+)
|
|
|
|
|
|
// Cascade Refill: SFC ← SLL (one-way, safe)
|
|
|
|
|
|
// Goal: +200% performance (4.19M → 12M+ ops/s)
|
feat: Phase 7 + Phase 2 - Massive performance & stability improvements
Performance Achievements:
- Tiny allocations: +180-280% (21M → 59-70M ops/s random mixed)
- Single-thread: +24% (2.71M → 3.36M ops/s Larson)
- 4T stability: 0% → 95% (19/20 success rate)
- Overall: 91.3% of System malloc average (target was 40-55%) ✓
Phase 7 (Tasks 1-3): Core Optimizations
- Task 1: Header validation removal (Region-ID direct lookup)
- Task 2: Aggressive inline (TLS cache access optimization)
- Task 3: Pre-warm TLS cache (eliminate cold-start penalty)
Result: +180-280% improvement, 85-146% of System malloc
Critical Bug Fixes:
- Fix 64B allocation crash (size-to-class +1 for header)
- Fix 4T wrapper recursion bugs (BUG #7, #8, #10, #11)
- Remove malloc fallback (30% → 50% stability)
Phase 2a: SuperSlab Dynamic Expansion (CRITICAL)
- Implement mimalloc-style chunk linking
- Unlimited slab expansion (no more OOM at 32 slabs)
- Fix chunk initialization bug (bitmap=0x00000001 after expansion)
Files: core/hakmem_tiny_superslab.c/h, core/superslab/superslab_types.h
Result: 50% → 95% stability (19/20 4T success)
Phase 2b: TLS Cache Adaptive Sizing
- Dynamic capacity: 16-2048 slots based on usage
- High-water mark tracking + exponential growth/shrink
- Expected: +3-10% performance, -30-50% memory
Files: core/tiny_adaptive_sizing.c/h (new)
Phase 2c: BigCache Dynamic Hash Table
- Migrate from fixed 256×8 array to dynamic hash table
- Auto-resize: 256 → 512 → 1024 → 65,536 buckets
- Improved hash function (FNV-1a) + collision chaining
Files: core/hakmem_bigcache.c/h
Expected: +10-20% cache hit rate
Design Flaws Analysis:
- Identified 6 components with fixed-capacity bottlenecks
- SuperSlab (CRITICAL), TLS Cache (HIGH), BigCache/L2.5 (MEDIUM)
- Report: DESIGN_FLAWS_ANALYSIS.md (11 chapters)
Documentation:
- 13 comprehensive reports (PHASE*.md, DESIGN_FLAWS*.md)
- Implementation guides, test results, production readiness
- Bug fix reports, root cause analysis
Build System:
- Makefile: phase7 targets, PREWARM_TLS flag
- Auto dependency generation (-MMD -MP) for .inc files
Known Issues:
- 4T stability: 19/20 (95%) - investigating 1 failure for 100%
- L2.5 Pool dynamic sharding: design only (needs 2-3 days integration)
🤖 Generated with Claude Code (https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-08 17:08:00 +09:00
|
|
|
|
//
|
|
|
|
|
|
// Phase 2b: Adaptive TLS Cache Sizing
|
|
|
|
|
|
// Hot classes grow to 2048 slots, cold classes shrink to 16 slots
|
|
|
|
|
|
// Expected: +3-10% performance, -30-50% TLS cache memory overhead
|
2025-11-05 12:31:14 +09:00
|
|
|
|
#pragma once
|
|
|
|
|
|
#include "tiny_atomic.h"
|
|
|
|
|
|
#include "hakmem_tiny.h"
|
2025-11-07 01:27:04 +09:00
|
|
|
|
#include "tiny_route.h"
|
|
|
|
|
|
#include "tiny_alloc_fast_sfc.inc.h" // Box 5-NEW: SFC Layer
|
2025-11-08 03:18:17 +09:00
|
|
|
|
#include "tiny_region_id.h" // Phase 7: Header-based class_idx lookup
|
feat: Phase 7 + Phase 2 - Massive performance & stability improvements
Performance Achievements:
- Tiny allocations: +180-280% (21M → 59-70M ops/s random mixed)
- Single-thread: +24% (2.71M → 3.36M ops/s Larson)
- 4T stability: 0% → 95% (19/20 success rate)
- Overall: 91.3% of System malloc average (target was 40-55%) ✓
Phase 7 (Tasks 1-3): Core Optimizations
- Task 1: Header validation removal (Region-ID direct lookup)
- Task 2: Aggressive inline (TLS cache access optimization)
- Task 3: Pre-warm TLS cache (eliminate cold-start penalty)
Result: +180-280% improvement, 85-146% of System malloc
Critical Bug Fixes:
- Fix 64B allocation crash (size-to-class +1 for header)
- Fix 4T wrapper recursion bugs (BUG #7, #8, #10, #11)
- Remove malloc fallback (30% → 50% stability)
Phase 2a: SuperSlab Dynamic Expansion (CRITICAL)
- Implement mimalloc-style chunk linking
- Unlimited slab expansion (no more OOM at 32 slabs)
- Fix chunk initialization bug (bitmap=0x00000001 after expansion)
Files: core/hakmem_tiny_superslab.c/h, core/superslab/superslab_types.h
Result: 50% → 95% stability (19/20 4T success)
Phase 2b: TLS Cache Adaptive Sizing
- Dynamic capacity: 16-2048 slots based on usage
- High-water mark tracking + exponential growth/shrink
- Expected: +3-10% performance, -30-50% memory
Files: core/tiny_adaptive_sizing.c/h (new)
Phase 2c: BigCache Dynamic Hash Table
- Migrate from fixed 256×8 array to dynamic hash table
- Auto-resize: 256 → 512 → 1024 → 65,536 buckets
- Improved hash function (FNV-1a) + collision chaining
Files: core/hakmem_bigcache.c/h
Expected: +10-20% cache hit rate
Design Flaws Analysis:
- Identified 6 components with fixed-capacity bottlenecks
- SuperSlab (CRITICAL), TLS Cache (HIGH), BigCache/L2.5 (MEDIUM)
- Report: DESIGN_FLAWS_ANALYSIS.md (11 chapters)
Documentation:
- 13 comprehensive reports (PHASE*.md, DESIGN_FLAWS*.md)
- Implementation guides, test results, production readiness
- Bug fix reports, root cause analysis
Build System:
- Makefile: phase7 targets, PREWARM_TLS flag
- Auto dependency generation (-MMD -MP) for .inc files
Known Issues:
- 4T stability: 19/20 (95%) - investigating 1 failure for 100%
- L2.5 Pool dynamic sharding: design only (needs 2-3 days integration)
🤖 Generated with Claude Code (https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-08 17:08:00 +09:00
|
|
|
|
#include "tiny_adaptive_sizing.h" // Phase 2b: Adaptive sizing
|
2025-11-10 16:48:20 +09:00
|
|
|
|
#include "box/tls_sll_box.h" // Box TLS-SLL: C7-safe push/pop/splice
|
2025-11-07 01:27:04 +09:00
|
|
|
|
#ifdef HAKMEM_TINY_FRONT_GATE_BOX
|
|
|
|
|
|
#include "box/front_gate_box.h"
|
|
|
|
|
|
#endif
|
2025-11-05 17:45:11 +09:00
|
|
|
|
#include <stdio.h>
|
2025-11-05 12:31:14 +09:00
|
|
|
|
|
2025-11-08 12:54:52 +09:00
|
|
|
|
// Phase 7 Task 2: Aggressive inline TLS cache access
|
|
|
|
|
|
// Enable with: make HEADER_CLASSIDX=1 AGGRESSIVE_INLINE=1
|
|
|
|
|
|
#ifndef HAKMEM_TINY_AGGRESSIVE_INLINE
|
|
|
|
|
|
#define HAKMEM_TINY_AGGRESSIVE_INLINE 0
|
|
|
|
|
|
#endif
|
|
|
|
|
|
|
|
|
|
|
|
#if HAKMEM_TINY_AGGRESSIVE_INLINE
|
|
|
|
|
|
#include "tiny_alloc_fast_inline.h"
|
|
|
|
|
|
#endif
|
|
|
|
|
|
|
2025-11-05 12:31:14 +09:00
|
|
|
|
// ========== Debug Counters (compile-time gated) ==========
|
|
|
|
|
|
#if HAKMEM_DEBUG_COUNTERS
|
|
|
|
|
|
// Refill-stage counters (defined in hakmem_tiny.c)
|
|
|
|
|
|
extern unsigned long long g_rf_total_calls[];
|
|
|
|
|
|
extern unsigned long long g_rf_hit_bench[];
|
|
|
|
|
|
extern unsigned long long g_rf_hit_hot[];
|
|
|
|
|
|
extern unsigned long long g_rf_hit_mail[];
|
|
|
|
|
|
extern unsigned long long g_rf_hit_slab[];
|
|
|
|
|
|
extern unsigned long long g_rf_hit_ss[];
|
|
|
|
|
|
extern unsigned long long g_rf_hit_reg[];
|
|
|
|
|
|
extern unsigned long long g_rf_mmap_calls[];
|
|
|
|
|
|
|
|
|
|
|
|
// Publish hits (defined in hakmem_tiny.c)
|
|
|
|
|
|
extern unsigned long long g_pub_mail_hits[];
|
|
|
|
|
|
extern unsigned long long g_pub_bench_hits[];
|
|
|
|
|
|
extern unsigned long long g_pub_hot_hits[];
|
|
|
|
|
|
|
|
|
|
|
|
// Free pipeline (defined in hakmem_tiny.c)
|
|
|
|
|
|
extern unsigned long long g_free_via_tls_sll[];
|
|
|
|
|
|
#endif
|
|
|
|
|
|
|
|
|
|
|
|
// ========== Box 5: Allocation Fast Path ==========
|
|
|
|
|
|
// 箱理論の Fast Allocation 層。TLS freelist から直接 pop(3-4命令)。
|
|
|
|
|
|
// 不変条件:
|
|
|
|
|
|
// - TLS freelist が非空なら即座に return (no lock, no sync)
|
|
|
|
|
|
// - Miss なら Backend (Box 3: SuperSlab) に委譲
|
|
|
|
|
|
// - Cross-thread allocation は考慮しない(Backend が処理)
|
|
|
|
|
|
|
|
|
|
|
|
// External TLS variables (defined in hakmem_tiny.c)
|
|
|
|
|
|
extern __thread void* g_tls_sll_head[TINY_NUM_CLASSES];
|
|
|
|
|
|
extern __thread uint32_t g_tls_sll_count[TINY_NUM_CLASSES];
|
|
|
|
|
|
|
|
|
|
|
|
// External backend functions
|
2025-11-09 22:12:34 +09:00
|
|
|
|
// P0 Fix: Use appropriate refill function based on P0 status
|
|
|
|
|
|
#if HAKMEM_TINY_P0_BATCH_REFILL
|
|
|
|
|
|
extern int sll_refill_batch_from_ss(int class_idx, int max_take);
|
|
|
|
|
|
#else
|
2025-11-05 12:31:14 +09:00
|
|
|
|
extern int sll_refill_small_from_ss(int class_idx, int max_take);
|
2025-11-09 22:12:34 +09:00
|
|
|
|
#endif
|
2025-11-05 12:31:14 +09:00
|
|
|
|
extern void* hak_tiny_alloc_slow(size_t size, int class_idx);
|
|
|
|
|
|
extern int hak_tiny_size_to_class(size_t size);
|
2025-11-08 01:18:37 +09:00
|
|
|
|
extern int tiny_refill_failfast_level(void);
|
|
|
|
|
|
extern const size_t g_tiny_class_sizes[];
|
2025-11-05 12:31:14 +09:00
|
|
|
|
|
2025-11-05 17:45:11 +09:00
|
|
|
|
// Global Front refill config (parsed at init; defined in hakmem_tiny.c)
|
|
|
|
|
|
extern int g_refill_count_global;
|
|
|
|
|
|
extern int g_refill_count_hot;
|
|
|
|
|
|
extern int g_refill_count_mid;
|
|
|
|
|
|
extern int g_refill_count_class[TINY_NUM_CLASSES];
|
|
|
|
|
|
|
2025-11-08 11:49:21 +09:00
|
|
|
|
// HAK_RET_ALLOC macro is now defined in core/hakmem_tiny.c
|
|
|
|
|
|
// See lines 116-152 for single definition point based on HAKMEM_TINY_HEADER_CLASSIDX
|
2025-11-05 12:31:14 +09:00
|
|
|
|
|
2025-11-05 06:35:03 +00:00
|
|
|
|
// ========== RDTSC Profiling (lightweight) ==========
|
|
|
|
|
|
#ifdef __x86_64__
|
|
|
|
|
|
static inline uint64_t tiny_fast_rdtsc(void) {
|
|
|
|
|
|
unsigned int lo, hi;
|
|
|
|
|
|
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
|
|
|
|
|
|
return ((uint64_t)hi << 32) | lo;
|
|
|
|
|
|
}
|
|
|
|
|
|
#else
|
|
|
|
|
|
static inline uint64_t tiny_fast_rdtsc(void) { return 0; }
|
|
|
|
|
|
#endif
|
|
|
|
|
|
|
|
|
|
|
|
// Per-thread profiling counters (enable with HAKMEM_TINY_PROFILE=1)
|
|
|
|
|
|
static __thread uint64_t g_tiny_alloc_hits = 0;
|
|
|
|
|
|
static __thread uint64_t g_tiny_alloc_cycles = 0;
|
|
|
|
|
|
static __thread uint64_t g_tiny_refill_calls = 0;
|
|
|
|
|
|
static __thread uint64_t g_tiny_refill_cycles = 0;
|
|
|
|
|
|
static int g_tiny_profile_enabled = -1; // -1: uninitialized
|
|
|
|
|
|
|
|
|
|
|
|
static inline int tiny_profile_enabled(void) {
|
|
|
|
|
|
if (__builtin_expect(g_tiny_profile_enabled == -1, 0)) {
|
|
|
|
|
|
const char* env = getenv("HAKMEM_TINY_PROFILE");
|
|
|
|
|
|
g_tiny_profile_enabled = (env && *env && *env != '0') ? 1 : 0;
|
|
|
|
|
|
}
|
|
|
|
|
|
return g_tiny_profile_enabled;
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
// Print profiling results at exit
|
|
|
|
|
|
static void tiny_fast_print_profile(void) __attribute__((destructor));
|
|
|
|
|
|
static void tiny_fast_print_profile(void) {
|
|
|
|
|
|
if (!tiny_profile_enabled()) return;
|
|
|
|
|
|
if (g_tiny_alloc_hits == 0 && g_tiny_refill_calls == 0) return;
|
|
|
|
|
|
|
|
|
|
|
|
fprintf(stderr, "\n========== Box Theory Fast Path Profile ==========\n");
|
|
|
|
|
|
if (g_tiny_alloc_hits > 0) {
|
|
|
|
|
|
fprintf(stderr, "[ALLOC HIT] count=%lu, avg_cycles=%lu\n",
|
|
|
|
|
|
(unsigned long)g_tiny_alloc_hits,
|
|
|
|
|
|
(unsigned long)(g_tiny_alloc_cycles / g_tiny_alloc_hits));
|
|
|
|
|
|
}
|
|
|
|
|
|
if (g_tiny_refill_calls > 0) {
|
|
|
|
|
|
fprintf(stderr, "[REFILL] count=%lu, avg_cycles=%lu\n",
|
|
|
|
|
|
(unsigned long)g_tiny_refill_calls,
|
|
|
|
|
|
(unsigned long)(g_tiny_refill_cycles / g_tiny_refill_calls));
|
|
|
|
|
|
}
|
|
|
|
|
|
fprintf(stderr, "===================================================\n\n");
|
|
|
|
|
|
}
|
|
|
|
|
|
|
2025-11-05 12:31:14 +09:00
|
|
|
|
// ========== Fast Path: TLS Freelist Pop (3-4 instructions) ==========
|
|
|
|
|
|
|
2025-11-07 01:27:04 +09:00
|
|
|
|
// External SFC control (defined in hakmem_tiny_sfc.c)
|
|
|
|
|
|
extern int g_sfc_enabled;
|
|
|
|
|
|
|
2025-11-05 12:31:14 +09:00
|
|
|
|
// Allocation fast path (inline for zero-cost)
|
|
|
|
|
|
// Returns: pointer on success, NULL on miss (caller should try refill/slow)
|
|
|
|
|
|
//
|
2025-11-07 01:27:04 +09:00
|
|
|
|
// Box 5-NEW Architecture:
|
|
|
|
|
|
// Layer 0: SFC (128-256 slots, high hit rate) [if enabled]
|
|
|
|
|
|
// Layer 1: SLL (unlimited, existing)
|
|
|
|
|
|
// Cascade: SFC miss → try SLL → refill
|
|
|
|
|
|
//
|
2025-11-05 12:31:14 +09:00
|
|
|
|
// Assembly (x86-64, optimized):
|
2025-11-07 01:27:04 +09:00
|
|
|
|
// mov rax, QWORD PTR g_sfc_head[class_idx] ; SFC: Load head
|
|
|
|
|
|
// test rax, rax ; Check NULL
|
|
|
|
|
|
// jne .sfc_hit ; If not empty, SFC hit!
|
|
|
|
|
|
// mov rax, QWORD PTR g_tls_sll_head[class_idx] ; SLL: Load head
|
2025-11-05 12:31:14 +09:00
|
|
|
|
// test rax, rax ; Check NULL
|
|
|
|
|
|
// je .miss ; If empty, miss
|
|
|
|
|
|
// mov rdx, QWORD PTR [rax] ; Load next
|
|
|
|
|
|
// mov QWORD PTR g_tls_sll_head[class_idx], rdx ; Update head
|
|
|
|
|
|
// ret ; Return ptr
|
2025-11-07 01:27:04 +09:00
|
|
|
|
// .sfc_hit:
|
|
|
|
|
|
// mov rdx, QWORD PTR [rax] ; Load next
|
|
|
|
|
|
// mov QWORD PTR g_sfc_head[class_idx], rdx ; Update head
|
|
|
|
|
|
// ret
|
2025-11-05 12:31:14 +09:00
|
|
|
|
// .miss:
|
|
|
|
|
|
// ; Fall through to refill
|
|
|
|
|
|
//
|
2025-11-07 01:27:04 +09:00
|
|
|
|
// Expected: 3-4 instructions on SFC hit, 6-8 on SLL hit
|
2025-11-05 12:31:14 +09:00
|
|
|
|
static inline void* tiny_alloc_fast_pop(int class_idx) {
|
2025-11-10 16:48:20 +09:00
|
|
|
|
// CRITICAL: C7 (1KB) is headerless - delegate to slow path completely
|
|
|
|
|
|
// Reason: Fast path uses SLL which stores next pointer in user data area
|
|
|
|
|
|
// C7's headerless design is incompatible with fast path assumptions
|
|
|
|
|
|
// Solution: Force C7 to use slow path for both alloc and free
|
|
|
|
|
|
if (__builtin_expect(class_idx == 7, 0)) {
|
|
|
|
|
|
return NULL; // Force slow path
|
|
|
|
|
|
}
|
|
|
|
|
|
|
2025-11-07 01:27:04 +09:00
|
|
|
|
#ifdef HAKMEM_TINY_FRONT_GATE_BOX
|
|
|
|
|
|
void* out = NULL;
|
|
|
|
|
|
if (front_gate_try_pop(class_idx, &out)) {
|
|
|
|
|
|
return out;
|
|
|
|
|
|
}
|
|
|
|
|
|
return NULL;
|
|
|
|
|
|
#else
|
2025-11-08 12:54:52 +09:00
|
|
|
|
// Phase 7 Task 3: Profiling overhead removed in release builds
|
|
|
|
|
|
// In release mode, compiler can completely eliminate profiling code
|
|
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
2025-11-05 06:35:03 +00:00
|
|
|
|
uint64_t start = tiny_profile_enabled() ? tiny_fast_rdtsc() : 0;
|
2025-11-08 12:54:52 +09:00
|
|
|
|
#endif
|
2025-11-05 06:35:03 +00:00
|
|
|
|
|
2025-11-07 01:27:04 +09:00
|
|
|
|
// Box 5-NEW: Layer 0 - Try SFC first (if enabled)
|
|
|
|
|
|
// Cache g_sfc_enabled in TLS to avoid global load on every allocation
|
|
|
|
|
|
static __thread int sfc_check_done = 0;
|
|
|
|
|
|
static __thread int sfc_is_enabled = 0;
|
|
|
|
|
|
if (__builtin_expect(!sfc_check_done, 0)) {
|
|
|
|
|
|
sfc_is_enabled = g_sfc_enabled;
|
|
|
|
|
|
sfc_check_done = 1;
|
|
|
|
|
|
}
|
2025-11-05 12:31:14 +09:00
|
|
|
|
|
2025-11-07 01:27:04 +09:00
|
|
|
|
if (__builtin_expect(sfc_is_enabled, 1)) {
|
|
|
|
|
|
void* ptr = sfc_alloc(class_idx);
|
|
|
|
|
|
if (__builtin_expect(ptr != NULL, 1)) {
|
|
|
|
|
|
// Front Gate: SFC hit
|
|
|
|
|
|
extern unsigned long long g_front_sfc_hit[];
|
|
|
|
|
|
g_front_sfc_hit[class_idx]++;
|
|
|
|
|
|
// 🚀 SFC HIT! (Layer 0)
|
2025-11-08 12:54:52 +09:00
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
2025-11-07 01:27:04 +09:00
|
|
|
|
if (start) {
|
|
|
|
|
|
g_tiny_alloc_cycles += (tiny_fast_rdtsc() - start);
|
|
|
|
|
|
g_tiny_alloc_hits++;
|
|
|
|
|
|
}
|
2025-11-08 12:54:52 +09:00
|
|
|
|
#endif
|
2025-11-07 01:27:04 +09:00
|
|
|
|
return ptr;
|
2025-11-05 12:31:14 +09:00
|
|
|
|
}
|
2025-11-07 01:27:04 +09:00
|
|
|
|
// SFC miss → try SLL (Layer 1)
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
// Box Boundary: Layer 1 - TLS SLL freelist の先頭を pop(envで無効化可)
|
|
|
|
|
|
extern int g_tls_sll_enable; // set at init via HAKMEM_TINY_TLS_SLL
|
|
|
|
|
|
if (__builtin_expect(g_tls_sll_enable, 1)) {
|
2025-11-10 16:48:20 +09:00
|
|
|
|
// Use Box TLS-SLL API (C7-safe pop)
|
|
|
|
|
|
// CRITICAL: Pop FIRST, do NOT read g_tls_sll_head directly (race condition!)
|
|
|
|
|
|
// Reading head before pop causes stale read → rbp=0xa0 SEGV
|
|
|
|
|
|
void* head = NULL;
|
|
|
|
|
|
if (tls_sll_pop(class_idx, &head)) {
|
2025-11-07 01:27:04 +09:00
|
|
|
|
// Front Gate: SLL hit (fast path 3 instructions)
|
|
|
|
|
|
extern unsigned long long g_front_sll_hit[];
|
|
|
|
|
|
g_front_sll_hit[class_idx]++;
|
2025-11-08 01:18:37 +09:00
|
|
|
|
|
2025-11-05 12:31:14 +09:00
|
|
|
|
#if HAKMEM_DEBUG_COUNTERS
|
2025-11-07 01:27:04 +09:00
|
|
|
|
// Track TLS freelist hits (compile-time gated, zero runtime cost when disabled)
|
|
|
|
|
|
g_free_via_tls_sll[class_idx]++;
|
2025-11-05 12:31:14 +09:00
|
|
|
|
#endif
|
|
|
|
|
|
|
2025-11-08 12:54:52 +09:00
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
|
|
|
|
|
// Debug: Track profiling (release builds skip this overhead)
|
2025-11-07 01:27:04 +09:00
|
|
|
|
if (start) {
|
|
|
|
|
|
g_tiny_alloc_cycles += (tiny_fast_rdtsc() - start);
|
|
|
|
|
|
g_tiny_alloc_hits++;
|
|
|
|
|
|
}
|
2025-11-08 12:54:52 +09:00
|
|
|
|
#endif
|
2025-11-07 01:27:04 +09:00
|
|
|
|
return head;
|
2025-11-05 06:35:03 +00:00
|
|
|
|
}
|
2025-11-05 12:31:14 +09:00
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
// Fast path miss → NULL (caller should refill)
|
|
|
|
|
|
return NULL;
|
2025-11-07 01:27:04 +09:00
|
|
|
|
#endif
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
// ========== Cascade Refill: SFC ← SLL (Box Theory boundary) ==========
|
|
|
|
|
|
|
|
|
|
|
|
// Cascade refill: Transfer blocks from SLL to SFC (one-way, safe)
|
|
|
|
|
|
// Returns: number of blocks transferred
|
|
|
|
|
|
//
|
|
|
|
|
|
// Contract:
|
|
|
|
|
|
// - Transfer ownership: SLL → SFC
|
|
|
|
|
|
// - No circular dependency: one-way only
|
|
|
|
|
|
// - Boundary clear: SLL pop → SFC push
|
|
|
|
|
|
// - Fallback safe: if SFC full, stop (no overflow)
|
|
|
|
|
|
static inline int sfc_refill_from_sll(int class_idx, int target_count) {
|
|
|
|
|
|
int transferred = 0;
|
|
|
|
|
|
uint32_t cap = g_sfc_capacity[class_idx];
|
|
|
|
|
|
|
|
|
|
|
|
while (transferred < target_count && g_tls_sll_count[class_idx] > 0) {
|
|
|
|
|
|
// Check SFC capacity before transfer
|
|
|
|
|
|
if (g_sfc_count[class_idx] >= cap) {
|
|
|
|
|
|
break; // SFC full, stop
|
|
|
|
|
|
}
|
|
|
|
|
|
|
2025-11-10 16:48:20 +09:00
|
|
|
|
// Pop from SLL (Layer 1) using Box TLS-SLL API (C7-safe)
|
|
|
|
|
|
void* ptr = NULL;
|
|
|
|
|
|
if (!tls_sll_pop(class_idx, &ptr)) {
|
|
|
|
|
|
break; // SLL empty
|
|
|
|
|
|
}
|
2025-11-07 01:27:04 +09:00
|
|
|
|
|
2025-11-10 18:04:08 +09:00
|
|
|
|
// Push to SFC (Layer 0) — header-aware
|
|
|
|
|
|
#if HAKMEM_TINY_HEADER_CLASSIDX
|
|
|
|
|
|
const size_t sfc_next_off = (class_idx == 7) ? 0 : 1;
|
|
|
|
|
|
#else
|
|
|
|
|
|
const size_t sfc_next_off = 0;
|
|
|
|
|
|
#endif
|
|
|
|
|
|
*(void**)((uint8_t*)ptr + sfc_next_off) = g_sfc_head[class_idx];
|
2025-11-07 01:27:04 +09:00
|
|
|
|
g_sfc_head[class_idx] = ptr;
|
|
|
|
|
|
g_sfc_count[class_idx]++;
|
|
|
|
|
|
|
|
|
|
|
|
transferred++;
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
return transferred;
|
2025-11-05 12:31:14 +09:00
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
// ========== Refill Path: Backend Integration ==========
|
|
|
|
|
|
|
|
|
|
|
|
// Refill TLS freelist from backend (SuperSlab/ACE/Learning layer)
|
|
|
|
|
|
// Returns: number of blocks refilled
|
|
|
|
|
|
//
|
2025-11-07 01:27:04 +09:00
|
|
|
|
// Box 5-NEW Architecture:
|
|
|
|
|
|
// SFC enabled: SuperSlab → SLL → SFC (cascade)
|
|
|
|
|
|
// SFC disabled: SuperSlab → SLL (direct, old path)
|
|
|
|
|
|
//
|
2025-11-05 12:31:14 +09:00
|
|
|
|
// This integrates with existing HAKMEM infrastructure:
|
|
|
|
|
|
// - SuperSlab provides memory chunks
|
|
|
|
|
|
// - ACE provides adaptive capacity learning
|
|
|
|
|
|
// - L25 provides mid-large integration
|
|
|
|
|
|
//
|
2025-11-08 12:54:52 +09:00
|
|
|
|
// Refill count is tunable via HAKMEM_TINY_REFILL_COUNT (default: 16)
|
2025-11-05 12:31:14 +09:00
|
|
|
|
// - Smaller count (8-16): better for diverse workloads, faster warmup
|
|
|
|
|
|
// - Larger count (64-128): better for homogeneous workloads, fewer refills
|
|
|
|
|
|
static inline int tiny_alloc_fast_refill(int class_idx) {
|
2025-11-10 16:48:20 +09:00
|
|
|
|
// CRITICAL: C7 (1KB) is headerless - skip refill completely, force slow path
|
|
|
|
|
|
// Reason: Refill pushes blocks to TLS SLL which stores next pointer in user data
|
|
|
|
|
|
// C7's headerless design is incompatible with this mechanism
|
|
|
|
|
|
if (__builtin_expect(class_idx == 7, 0)) {
|
|
|
|
|
|
return 0; // Skip refill, force slow path allocation
|
|
|
|
|
|
}
|
|
|
|
|
|
|
2025-11-08 12:54:52 +09:00
|
|
|
|
// Phase 7 Task 3: Profiling overhead removed in release builds
|
|
|
|
|
|
// In release mode, compiler can completely eliminate profiling code
|
|
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
2025-11-05 06:35:03 +00:00
|
|
|
|
uint64_t start = tiny_profile_enabled() ? tiny_fast_rdtsc() : 0;
|
2025-11-08 12:54:52 +09:00
|
|
|
|
#endif
|
2025-11-05 06:35:03 +00:00
|
|
|
|
|
feat: Phase 7 + Phase 2 - Massive performance & stability improvements
Performance Achievements:
- Tiny allocations: +180-280% (21M → 59-70M ops/s random mixed)
- Single-thread: +24% (2.71M → 3.36M ops/s Larson)
- 4T stability: 0% → 95% (19/20 success rate)
- Overall: 91.3% of System malloc average (target was 40-55%) ✓
Phase 7 (Tasks 1-3): Core Optimizations
- Task 1: Header validation removal (Region-ID direct lookup)
- Task 2: Aggressive inline (TLS cache access optimization)
- Task 3: Pre-warm TLS cache (eliminate cold-start penalty)
Result: +180-280% improvement, 85-146% of System malloc
Critical Bug Fixes:
- Fix 64B allocation crash (size-to-class +1 for header)
- Fix 4T wrapper recursion bugs (BUG #7, #8, #10, #11)
- Remove malloc fallback (30% → 50% stability)
Phase 2a: SuperSlab Dynamic Expansion (CRITICAL)
- Implement mimalloc-style chunk linking
- Unlimited slab expansion (no more OOM at 32 slabs)
- Fix chunk initialization bug (bitmap=0x00000001 after expansion)
Files: core/hakmem_tiny_superslab.c/h, core/superslab/superslab_types.h
Result: 50% → 95% stability (19/20 4T success)
Phase 2b: TLS Cache Adaptive Sizing
- Dynamic capacity: 16-2048 slots based on usage
- High-water mark tracking + exponential growth/shrink
- Expected: +3-10% performance, -30-50% memory
Files: core/tiny_adaptive_sizing.c/h (new)
Phase 2c: BigCache Dynamic Hash Table
- Migrate from fixed 256×8 array to dynamic hash table
- Auto-resize: 256 → 512 → 1024 → 65,536 buckets
- Improved hash function (FNV-1a) + collision chaining
Files: core/hakmem_bigcache.c/h
Expected: +10-20% cache hit rate
Design Flaws Analysis:
- Identified 6 components with fixed-capacity bottlenecks
- SuperSlab (CRITICAL), TLS Cache (HIGH), BigCache/L2.5 (MEDIUM)
- Report: DESIGN_FLAWS_ANALYSIS.md (11 chapters)
Documentation:
- 13 comprehensive reports (PHASE*.md, DESIGN_FLAWS*.md)
- Implementation guides, test results, production readiness
- Bug fix reports, root cause analysis
Build System:
- Makefile: phase7 targets, PREWARM_TLS flag
- Auto dependency generation (-MMD -MP) for .inc files
Known Issues:
- 4T stability: 19/20 (95%) - investigating 1 failure for 100%
- L2.5 Pool dynamic sharding: design only (needs 2-3 days integration)
🤖 Generated with Claude Code (https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-08 17:08:00 +09:00
|
|
|
|
// Phase 2b: Check available capacity before refill
|
|
|
|
|
|
int available_capacity = get_available_capacity(class_idx);
|
|
|
|
|
|
if (available_capacity <= 0) {
|
|
|
|
|
|
// Cache is full, don't refill
|
|
|
|
|
|
return 0;
|
|
|
|
|
|
}
|
|
|
|
|
|
|
2025-11-08 12:54:52 +09:00
|
|
|
|
// Phase 7 Task 3: Simplified refill count (cached per-class in TLS)
|
|
|
|
|
|
// Previous: Complex precedence logic on every miss (5-10 cycles overhead)
|
|
|
|
|
|
// Now: Simple TLS cache lookup (1-2 cycles)
|
2025-11-05 17:45:11 +09:00
|
|
|
|
static __thread int s_refill_count[TINY_NUM_CLASSES] = {0};
|
2025-11-09 18:55:50 +09:00
|
|
|
|
// Simple adaptive booster: bump per-class refill size when refills are frequent.
|
|
|
|
|
|
static __thread uint8_t s_refill_calls[TINY_NUM_CLASSES] = {0};
|
2025-11-05 17:45:11 +09:00
|
|
|
|
int cnt = s_refill_count[class_idx];
|
|
|
|
|
|
if (__builtin_expect(cnt == 0, 0)) {
|
2025-11-08 12:54:52 +09:00
|
|
|
|
// First miss: Initialize from globals (parsed at init time)
|
|
|
|
|
|
int v = HAKMEM_TINY_REFILL_DEFAULT; // Default from hakmem_build_flags.h
|
|
|
|
|
|
|
|
|
|
|
|
// Precedence: per-class > hot/mid > global
|
2025-11-05 17:45:11 +09:00
|
|
|
|
if (g_refill_count_class[class_idx] > 0) {
|
|
|
|
|
|
v = g_refill_count_class[class_idx];
|
|
|
|
|
|
} else if (class_idx <= 3 && g_refill_count_hot > 0) {
|
|
|
|
|
|
v = g_refill_count_hot;
|
|
|
|
|
|
} else if (class_idx >= 4 && g_refill_count_mid > 0) {
|
|
|
|
|
|
v = g_refill_count_mid;
|
|
|
|
|
|
} else if (g_refill_count_global > 0) {
|
|
|
|
|
|
v = g_refill_count_global;
|
|
|
|
|
|
}
|
2025-11-05 12:31:14 +09:00
|
|
|
|
|
2025-11-08 12:54:52 +09:00
|
|
|
|
// Clamp to sane range (min: 8, max: 256)
|
2025-11-05 12:31:14 +09:00
|
|
|
|
if (v < 8) v = 8; // Minimum: avoid thrashing
|
|
|
|
|
|
if (v > 256) v = 256; // Maximum: avoid excessive TLS memory
|
|
|
|
|
|
|
2025-11-05 17:45:11 +09:00
|
|
|
|
s_refill_count[class_idx] = v;
|
|
|
|
|
|
cnt = v;
|
2025-11-05 12:31:14 +09:00
|
|
|
|
}
|
|
|
|
|
|
|
feat: Phase 7 + Phase 2 - Massive performance & stability improvements
Performance Achievements:
- Tiny allocations: +180-280% (21M → 59-70M ops/s random mixed)
- Single-thread: +24% (2.71M → 3.36M ops/s Larson)
- 4T stability: 0% → 95% (19/20 success rate)
- Overall: 91.3% of System malloc average (target was 40-55%) ✓
Phase 7 (Tasks 1-3): Core Optimizations
- Task 1: Header validation removal (Region-ID direct lookup)
- Task 2: Aggressive inline (TLS cache access optimization)
- Task 3: Pre-warm TLS cache (eliminate cold-start penalty)
Result: +180-280% improvement, 85-146% of System malloc
Critical Bug Fixes:
- Fix 64B allocation crash (size-to-class +1 for header)
- Fix 4T wrapper recursion bugs (BUG #7, #8, #10, #11)
- Remove malloc fallback (30% → 50% stability)
Phase 2a: SuperSlab Dynamic Expansion (CRITICAL)
- Implement mimalloc-style chunk linking
- Unlimited slab expansion (no more OOM at 32 slabs)
- Fix chunk initialization bug (bitmap=0x00000001 after expansion)
Files: core/hakmem_tiny_superslab.c/h, core/superslab/superslab_types.h
Result: 50% → 95% stability (19/20 4T success)
Phase 2b: TLS Cache Adaptive Sizing
- Dynamic capacity: 16-2048 slots based on usage
- High-water mark tracking + exponential growth/shrink
- Expected: +3-10% performance, -30-50% memory
Files: core/tiny_adaptive_sizing.c/h (new)
Phase 2c: BigCache Dynamic Hash Table
- Migrate from fixed 256×8 array to dynamic hash table
- Auto-resize: 256 → 512 → 1024 → 65,536 buckets
- Improved hash function (FNV-1a) + collision chaining
Files: core/hakmem_bigcache.c/h
Expected: +10-20% cache hit rate
Design Flaws Analysis:
- Identified 6 components with fixed-capacity bottlenecks
- SuperSlab (CRITICAL), TLS Cache (HIGH), BigCache/L2.5 (MEDIUM)
- Report: DESIGN_FLAWS_ANALYSIS.md (11 chapters)
Documentation:
- 13 comprehensive reports (PHASE*.md, DESIGN_FLAWS*.md)
- Implementation guides, test results, production readiness
- Bug fix reports, root cause analysis
Build System:
- Makefile: phase7 targets, PREWARM_TLS flag
- Auto dependency generation (-MMD -MP) for .inc files
Known Issues:
- 4T stability: 19/20 (95%) - investigating 1 failure for 100%
- L2.5 Pool dynamic sharding: design only (needs 2-3 days integration)
🤖 Generated with Claude Code (https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-08 17:08:00 +09:00
|
|
|
|
// Phase 2b: Clamp refill count to available capacity
|
|
|
|
|
|
if (cnt > available_capacity) {
|
|
|
|
|
|
cnt = available_capacity;
|
|
|
|
|
|
}
|
|
|
|
|
|
|
2025-11-05 12:31:14 +09:00
|
|
|
|
#if HAKMEM_DEBUG_COUNTERS
|
|
|
|
|
|
// Track refill calls (compile-time gated)
|
|
|
|
|
|
g_rf_total_calls[class_idx]++;
|
|
|
|
|
|
#endif
|
|
|
|
|
|
|
|
|
|
|
|
// Box Boundary: Delegate to Backend (Box 3: SuperSlab)
|
|
|
|
|
|
// This gives us ACE, Learning layer, L25 integration for free!
|
2025-11-09 22:12:34 +09:00
|
|
|
|
// P0 Fix: Use appropriate refill function based on P0 status
|
|
|
|
|
|
#if HAKMEM_TINY_P0_BATCH_REFILL
|
|
|
|
|
|
int refilled = sll_refill_batch_from_ss(class_idx, cnt);
|
|
|
|
|
|
#else
|
2025-11-05 17:45:11 +09:00
|
|
|
|
int refilled = sll_refill_small_from_ss(class_idx, cnt);
|
2025-11-09 22:12:34 +09:00
|
|
|
|
#endif
|
2025-11-05 12:31:14 +09:00
|
|
|
|
|
2025-11-09 18:55:50 +09:00
|
|
|
|
// Lightweight adaptation: if refills keep happening, increase per-class refill.
|
|
|
|
|
|
// Focus on class 7 (1024B) to reduce mmap/refill frequency under Tiny-heavy loads.
|
|
|
|
|
|
if (refilled > 0) {
|
|
|
|
|
|
uint8_t c = ++s_refill_calls[class_idx];
|
|
|
|
|
|
if (class_idx == 7) {
|
|
|
|
|
|
// Every 4 refills, increase target by +16 up to 128 (unless overridden).
|
|
|
|
|
|
if ((c & 0x03u) == 0) {
|
|
|
|
|
|
int target = s_refill_count[class_idx];
|
|
|
|
|
|
if (target < 128) {
|
|
|
|
|
|
target += 16;
|
|
|
|
|
|
if (target > 128) target = 128;
|
|
|
|
|
|
s_refill_count[class_idx] = target;
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
} else {
|
|
|
|
|
|
// No refill performed (capacity full): slowly decay the counter.
|
|
|
|
|
|
if (s_refill_calls[class_idx] > 0) s_refill_calls[class_idx]--;
|
|
|
|
|
|
}
|
|
|
|
|
|
|
feat: Phase 7 + Phase 2 - Massive performance & stability improvements
Performance Achievements:
- Tiny allocations: +180-280% (21M → 59-70M ops/s random mixed)
- Single-thread: +24% (2.71M → 3.36M ops/s Larson)
- 4T stability: 0% → 95% (19/20 success rate)
- Overall: 91.3% of System malloc average (target was 40-55%) ✓
Phase 7 (Tasks 1-3): Core Optimizations
- Task 1: Header validation removal (Region-ID direct lookup)
- Task 2: Aggressive inline (TLS cache access optimization)
- Task 3: Pre-warm TLS cache (eliminate cold-start penalty)
Result: +180-280% improvement, 85-146% of System malloc
Critical Bug Fixes:
- Fix 64B allocation crash (size-to-class +1 for header)
- Fix 4T wrapper recursion bugs (BUG #7, #8, #10, #11)
- Remove malloc fallback (30% → 50% stability)
Phase 2a: SuperSlab Dynamic Expansion (CRITICAL)
- Implement mimalloc-style chunk linking
- Unlimited slab expansion (no more OOM at 32 slabs)
- Fix chunk initialization bug (bitmap=0x00000001 after expansion)
Files: core/hakmem_tiny_superslab.c/h, core/superslab/superslab_types.h
Result: 50% → 95% stability (19/20 4T success)
Phase 2b: TLS Cache Adaptive Sizing
- Dynamic capacity: 16-2048 slots based on usage
- High-water mark tracking + exponential growth/shrink
- Expected: +3-10% performance, -30-50% memory
Files: core/tiny_adaptive_sizing.c/h (new)
Phase 2c: BigCache Dynamic Hash Table
- Migrate from fixed 256×8 array to dynamic hash table
- Auto-resize: 256 → 512 → 1024 → 65,536 buckets
- Improved hash function (FNV-1a) + collision chaining
Files: core/hakmem_bigcache.c/h
Expected: +10-20% cache hit rate
Design Flaws Analysis:
- Identified 6 components with fixed-capacity bottlenecks
- SuperSlab (CRITICAL), TLS Cache (HIGH), BigCache/L2.5 (MEDIUM)
- Report: DESIGN_FLAWS_ANALYSIS.md (11 chapters)
Documentation:
- 13 comprehensive reports (PHASE*.md, DESIGN_FLAWS*.md)
- Implementation guides, test results, production readiness
- Bug fix reports, root cause analysis
Build System:
- Makefile: phase7 targets, PREWARM_TLS flag
- Auto dependency generation (-MMD -MP) for .inc files
Known Issues:
- 4T stability: 19/20 (95%) - investigating 1 failure for 100%
- L2.5 Pool dynamic sharding: design only (needs 2-3 days integration)
🤖 Generated with Claude Code (https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-08 17:08:00 +09:00
|
|
|
|
// Phase 2b: Track refill and adapt cache size
|
|
|
|
|
|
if (refilled > 0) {
|
|
|
|
|
|
track_refill_for_adaptation(class_idx);
|
|
|
|
|
|
}
|
|
|
|
|
|
|
2025-11-07 01:27:04 +09:00
|
|
|
|
// Box 5-NEW: Cascade refill SFC ← SLL (if SFC enabled)
|
|
|
|
|
|
// This happens AFTER SuperSlab → SLL refill, so SLL has blocks
|
|
|
|
|
|
static __thread int sfc_check_done_refill = 0;
|
|
|
|
|
|
static __thread int sfc_is_enabled_refill = 0;
|
|
|
|
|
|
if (__builtin_expect(!sfc_check_done_refill, 0)) {
|
|
|
|
|
|
sfc_is_enabled_refill = g_sfc_enabled;
|
|
|
|
|
|
sfc_check_done_refill = 1;
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
if (sfc_is_enabled_refill && refilled > 0) {
|
|
|
|
|
|
// Transfer half of refilled blocks to SFC (keep half in SLL for future)
|
|
|
|
|
|
int sfc_target = refilled / 2;
|
|
|
|
|
|
if (sfc_target > 0) {
|
|
|
|
|
|
#ifdef HAKMEM_TINY_FRONT_GATE_BOX
|
|
|
|
|
|
front_gate_after_refill(class_idx, refilled);
|
|
|
|
|
|
#else
|
|
|
|
|
|
int transferred = sfc_refill_from_sll(class_idx, sfc_target);
|
|
|
|
|
|
(void)transferred; // Unused, but could track stats
|
|
|
|
|
|
#endif
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
2025-11-08 12:54:52 +09:00
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
|
|
|
|
|
// Debug: Track profiling (release builds skip this overhead)
|
2025-11-05 06:35:03 +00:00
|
|
|
|
if (start) {
|
|
|
|
|
|
g_tiny_refill_cycles += (tiny_fast_rdtsc() - start);
|
|
|
|
|
|
g_tiny_refill_calls++;
|
|
|
|
|
|
}
|
2025-11-08 12:54:52 +09:00
|
|
|
|
#endif
|
2025-11-05 06:35:03 +00:00
|
|
|
|
|
2025-11-05 12:31:14 +09:00
|
|
|
|
return refilled;
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
// ========== Combined Fast Path (Alloc + Refill) ==========
|
|
|
|
|
|
|
|
|
|
|
|
// Complete fast path allocation (inline for zero-cost)
|
|
|
|
|
|
// Returns: pointer on success, NULL on failure (OOM or size too large)
|
|
|
|
|
|
//
|
|
|
|
|
|
// Flow:
|
|
|
|
|
|
// 1. TLS freelist pop (3-4 instructions) - Hit rate ~95%
|
|
|
|
|
|
// 2. Miss → Refill from backend (~5% cases)
|
|
|
|
|
|
// 3. Refill success → Retry pop
|
|
|
|
|
|
// 4. Refill failure → Slow path (OOM or new SuperSlab allocation)
|
|
|
|
|
|
//
|
|
|
|
|
|
// Example usage:
|
|
|
|
|
|
// void* ptr = tiny_alloc_fast(64);
|
|
|
|
|
|
// if (!ptr) {
|
|
|
|
|
|
// // OOM handling
|
|
|
|
|
|
// }
|
|
|
|
|
|
static inline void* tiny_alloc_fast(size_t size) {
|
|
|
|
|
|
// 1. Size → class index (inline, fast)
|
|
|
|
|
|
int class_idx = hak_tiny_size_to_class(size);
|
|
|
|
|
|
if (__builtin_expect(class_idx < 0, 0)) {
|
|
|
|
|
|
return NULL; // Size > 1KB, not Tiny
|
|
|
|
|
|
}
|
2025-11-07 01:27:04 +09:00
|
|
|
|
ROUTE_BEGIN(class_idx);
|
2025-11-05 12:31:14 +09:00
|
|
|
|
|
|
|
|
|
|
// 2. Fast path: TLS freelist pop (3-4 instructions, 95% hit rate)
|
2025-11-10 16:48:20 +09:00
|
|
|
|
// CRITICAL: Use Box TLS-SLL API (static inline, same performance as macro but SAFE!)
|
|
|
|
|
|
// The old macro had race condition: read head before pop → rbp=0xa0 SEGV
|
|
|
|
|
|
void* ptr = NULL;
|
|
|
|
|
|
tls_sll_pop(class_idx, &ptr);
|
2025-11-05 12:31:14 +09:00
|
|
|
|
if (__builtin_expect(ptr != NULL, 1)) {
|
2025-11-10 16:48:20 +09:00
|
|
|
|
// C7 (1024B, headerless): clear embedded next pointer before returning to user
|
|
|
|
|
|
if (__builtin_expect(class_idx == 7, 0)) {
|
|
|
|
|
|
*(void**)ptr = NULL;
|
|
|
|
|
|
}
|
2025-11-05 12:31:14 +09:00
|
|
|
|
HAK_RET_ALLOC(class_idx, ptr);
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
// 3. Miss: Refill from backend (Box 3: SuperSlab)
|
|
|
|
|
|
int refilled = tiny_alloc_fast_refill(class_idx);
|
|
|
|
|
|
if (__builtin_expect(refilled > 0, 1)) {
|
2025-11-10 16:48:20 +09:00
|
|
|
|
// Refill success → retry pop using safe Box TLS-SLL API
|
|
|
|
|
|
ptr = NULL;
|
|
|
|
|
|
tls_sll_pop(class_idx, &ptr);
|
2025-11-05 12:31:14 +09:00
|
|
|
|
if (ptr) {
|
2025-11-10 16:48:20 +09:00
|
|
|
|
if (__builtin_expect(class_idx == 7, 0)) {
|
|
|
|
|
|
*(void**)ptr = NULL;
|
|
|
|
|
|
}
|
2025-11-05 12:31:14 +09:00
|
|
|
|
HAK_RET_ALLOC(class_idx, ptr);
|
|
|
|
|
|
}
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
// 4. Refill failure or still empty → slow path (OOM or new SuperSlab)
|
|
|
|
|
|
// Box Boundary: Delegate to Slow Path (Box 3 backend)
|
|
|
|
|
|
ptr = hak_tiny_alloc_slow(size, class_idx);
|
|
|
|
|
|
if (ptr) {
|
|
|
|
|
|
HAK_RET_ALLOC(class_idx, ptr);
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
return ptr; // NULL if OOM
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
// ========== Push to TLS Freelist (for free path) ==========
|
|
|
|
|
|
|
|
|
|
|
|
// Push block to TLS freelist (used by free fast path)
|
|
|
|
|
|
// This is a "helper" for Box 6 (Free Fast Path)
|
|
|
|
|
|
//
|
|
|
|
|
|
// Invariant: ptr must belong to current thread (no ownership check here)
|
|
|
|
|
|
// Caller (Box 6) is responsible for ownership verification
|
|
|
|
|
|
static inline void tiny_alloc_fast_push(int class_idx, void* ptr) {
|
2025-11-07 01:27:04 +09:00
|
|
|
|
#ifdef HAKMEM_TINY_FRONT_GATE_BOX
|
|
|
|
|
|
front_gate_push_tls(class_idx, ptr);
|
|
|
|
|
|
#else
|
2025-11-10 16:48:20 +09:00
|
|
|
|
// Box Boundary: Push to TLS freelist using Box TLS-SLL API (C7-safe)
|
|
|
|
|
|
uint32_t capacity = UINT32_MAX; // Unlimited for helper function
|
|
|
|
|
|
if (!tls_sll_push(class_idx, ptr, capacity)) {
|
|
|
|
|
|
// C7 rejected or SLL somehow full (should not happen)
|
|
|
|
|
|
// In release builds, this is a no-op (caller expects success)
|
|
|
|
|
|
#if !HAKMEM_BUILD_RELEASE
|
|
|
|
|
|
fprintf(stderr, "[WARN] tls_sll_push failed in tiny_alloc_fast_push cls=%d ptr=%p\n",
|
|
|
|
|
|
class_idx, ptr);
|
|
|
|
|
|
#endif
|
|
|
|
|
|
}
|
2025-11-07 01:27:04 +09:00
|
|
|
|
#endif
|
2025-11-05 12:31:14 +09:00
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
// ========== Statistics & Diagnostics ==========
|
|
|
|
|
|
|
|
|
|
|
|
// Get TLS freelist stats (for debugging/profiling)
|
|
|
|
|
|
typedef struct {
|
|
|
|
|
|
int class_idx;
|
|
|
|
|
|
void* head;
|
|
|
|
|
|
uint32_t count;
|
|
|
|
|
|
} TinyAllocFastStats;
|
|
|
|
|
|
|
|
|
|
|
|
static inline TinyAllocFastStats tiny_alloc_fast_stats(int class_idx) {
|
|
|
|
|
|
TinyAllocFastStats stats = {
|
|
|
|
|
|
.class_idx = class_idx,
|
|
|
|
|
|
.head = g_tls_sll_head[class_idx],
|
|
|
|
|
|
.count = g_tls_sll_count[class_idx]
|
|
|
|
|
|
};
|
|
|
|
|
|
return stats;
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
// Reset TLS freelist (for testing/benchmarking)
|
|
|
|
|
|
// WARNING: This leaks memory! Only use in controlled test environments.
|
|
|
|
|
|
static inline void tiny_alloc_fast_reset(int class_idx) {
|
|
|
|
|
|
g_tls_sll_head[class_idx] = NULL;
|
|
|
|
|
|
g_tls_sll_count[class_idx] = 0;
|
|
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
// ========== Performance Notes ==========
|
|
|
|
|
|
//
|
|
|
|
|
|
// Expected metrics (based on System tcache & HAKX +171% results):
|
|
|
|
|
|
// - Fast path hit rate: 95%+ (workload dependent)
|
|
|
|
|
|
// - Fast path latency: 3-4 instructions (1-2 cycles on modern CPUs)
|
|
|
|
|
|
// - Miss penalty: ~20-50 instructions (refill from SuperSlab)
|
|
|
|
|
|
// - Throughput improvement: +10-25% vs current multi-layer design
|
|
|
|
|
|
//
|
|
|
|
|
|
// Key optimizations:
|
|
|
|
|
|
// 1. `__builtin_expect` for branch prediction (hot path first)
|
|
|
|
|
|
// 2. `static inline` for zero-cost abstraction
|
|
|
|
|
|
// 3. TLS variables (no atomic ops, no locks)
|
|
|
|
|
|
// 4. Minimal work in fast path (defer stats/accounting to backend)
|
|
|
|
|
|
//
|
|
|
|
|
|
// Comparison with current design:
|
|
|
|
|
|
// - Current: 20+ instructions (Magazine → SuperSlab → ACE → ...)
|
|
|
|
|
|
// - New: 3-4 instructions (TLS freelist pop only)
|
|
|
|
|
|
// - Reduction: -80% instructions in hot path
|
|
|
|
|
|
//
|
|
|
|
|
|
// Inspired by:
|
|
|
|
|
|
// - System tcache (glibc malloc) - 3-4 instruction fast path
|
|
|
|
|
|
// - HAKX Mid-Large (+171%) - "Simple Front + Smart Back"
|
|
|
|
|
|
// - Box Theory - Clear boundaries, minimal coupling
|