Files
hakmem/core/tiny_fastcache.c
Moe Charm (CI) 72b38bc994 Phase E3-FINAL: Fix Box API offset bugs - ALL classes now use correct offsets
## Root Cause Analysis (GPT5)

**Physical Layout Constraints**:
- Class 0: 8B = [1B header][7B payload] → offset 1 = 9B needed =  IMPOSSIBLE
- Class 1-6: >=16B = [1B header][15B+ payload] → offset 1 =  POSSIBLE
- Class 7: 1KB → offset 0 (compatibility)

**Correct Specification**:
- HAKMEM_TINY_HEADER_CLASSIDX != 0:
  - Class 0, 7: next at offset 0 (overwrites header when on freelist)
  - Class 1-6: next at offset 1 (after header)
- HAKMEM_TINY_HEADER_CLASSIDX == 0:
  - All classes: next at offset 0

**Previous Bug**:
- Attempted "ALL classes offset 1" unification
- Class 0 with offset 1 caused immediate SEGV (9B > 8B block size)
- Mixed 2-arg/3-arg API caused confusion

## Fixes Applied

### 1. Restored 3-Argument Box API (core/box/tiny_next_ptr_box.h)
```c
// Correct signatures
void tiny_next_write(int class_idx, void* base, void* next_value)
void* tiny_next_read(int class_idx, const void* base)

// Correct offset calculation
size_t offset = (class_idx == 0 || class_idx == 7) ? 0 : 1;
```

### 2. Updated 123+ Call Sites Across 34 Files
- hakmem_tiny_hot_pop_v4.inc.h (4 locations)
- hakmem_tiny_fastcache.inc.h (3 locations)
- hakmem_tiny_tls_list.h (12 locations)
- superslab_inline.h (5 locations)
- tiny_fastcache.h (3 locations)
- ptr_trace.h (macro definitions)
- tls_sll_box.h (2 locations)
- + 27 additional files

Pattern: `tiny_next_read(base)` → `tiny_next_read(class_idx, base)`
Pattern: `tiny_next_write(base, next)` → `tiny_next_write(class_idx, base, next)`

### 3. Added Sentinel Detection Guards
- tiny_fast_push(): Block nodes with sentinel in ptr or ptr->next
- tls_list_push(): Block nodes with sentinel in ptr or ptr->next
- Defense-in-depth against remote free sentinel leakage

## Verification (GPT5 Report)

**Test Command**: `./out/release/bench_random_mixed_hakmem --iterations=70000`

**Results**:
-  Main loop completed successfully
-  Drain phase completed successfully
-  NO SEGV (previous crash at iteration 66151 is FIXED)
- ℹ️ Final log: "tiny_alloc(1024) failed" is normal fallback to Mid/ACE layers

**Analysis**:
- Class 0 immediate SEGV:  RESOLVED (correct offset 0 now used)
- 66K iteration crash:  RESOLVED (offset consistency fixed)
- Box API conflicts:  RESOLVED (unified 3-arg API)

## Technical Details

### Offset Logic Justification
```
Class 0:  8B block → next pointer (8B) fits ONLY at offset 0
Class 1: 16B block → next pointer (8B) fits at offset 1 (after 1B header)
Class 2: 32B block → next pointer (8B) fits at offset 1
...
Class 6: 512B block → next pointer (8B) fits at offset 1
Class 7: 1024B block → offset 0 for legacy compatibility
```

### Files Modified (Summary)
- Core API: `box/tiny_next_ptr_box.h`
- Hot paths: `hakmem_tiny_hot_pop*.inc.h`, `tiny_fastcache.h`
- TLS layers: `hakmem_tiny_tls_list.h`, `hakmem_tiny_tls_ops.h`
- SuperSlab: `superslab_inline.h`, `tiny_superslab_*.inc.h`
- Refill: `hakmem_tiny_refill.inc.h`, `tiny_refill_opt.h`
- Free paths: `tiny_free_magazine.inc.h`, `tiny_superslab_free.inc.h`
- Documentation: Multiple Phase E3 reports

## Remaining Work

None for Box API offset bugs - all structural issues resolved.

Future enhancements (non-critical):
- Periodic `grep -R '*(void**)' core/` to detect direct pointer access violations
- Enforce Box API usage via static analysis
- Document offset rationale in architecture docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-13 06:50:20 +09:00

307 lines
12 KiB
C
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

// tiny_fastcache.c - Slow path for Tiny Fast Cache (refill/drain)
// Phase 6-3: Refill from Magazine/SuperSlab when fast cache misses
#include "tiny_fastcache.h"
#include "hakmem_tiny.h"
#include "hakmem_tiny_superslab.h"
#include "box/tiny_next_ptr_box.h" // Phase E1-CORRECT: Box API
#include <stdio.h>
#include <stdlib.h>
// ========== TLS Cache Definitions ==========
// (Declared as extern in tiny_fastcache.h)
// CRITICAL FIX: Explicit initializers prevent SEGV from uninitialized TLS in worker threads
__thread void* g_tiny_fast_cache[TINY_FAST_CLASS_COUNT] = {0};
__thread uint32_t g_tiny_fast_count[TINY_FAST_CLASS_COUNT] = {0};
__thread int g_tiny_fast_initialized = 0;
// ========== Phase 6-7: Dual Free Lists (Phase 2) ==========
// Inspired by mimalloc's local/remote split design
// Separate alloc/free paths to reduce cache line bouncing
__thread void* g_tiny_fast_free_head[TINY_FAST_CLASS_COUNT] = {0}; // Free staging area
__thread uint32_t g_tiny_fast_free_count[TINY_FAST_CLASS_COUNT] = {0}; // Free count
// ========== External References ==========
// External references to existing Tiny infrastructure (from hakmem_tiny.c)
extern __thread void* g_tls_sll_head[];
extern __thread uint32_t g_tls_sll_count[];
extern int g_use_superslab;
// From hakmem_tiny.c
extern void* hak_tiny_alloc_slow(size_t size, int class_idx);
// ========== Batch Refill Configuration ==========
// How many blocks to refill per miss (batch amortization)
#ifndef TINY_FAST_REFILL_BATCH
#define TINY_FAST_REFILL_BATCH 16
#endif
// ========== Debug Counters ==========
static __thread uint64_t g_tiny_fast_refill_count = 0;
static __thread uint64_t g_tiny_fast_drain_count = 0;
// ========== RDTSC Cycle Profiling ==========
// Ultra-lightweight profiling using CPU Time-Stamp Counter (~10 cycles overhead)
#ifdef __x86_64__
static inline uint64_t rdtsc(void) {
unsigned int lo, hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
}
#else
static inline uint64_t rdtsc(void) { return 0; } // Fallback for non-x86
#endif
// Per-thread cycle counters (gated by HAKMEM_TINY_PROFILE env var)
// Declared as extern in tiny_fastcache.h for inline functions
__thread uint64_t g_tiny_malloc_count = 0;
__thread uint64_t g_tiny_malloc_cycles = 0;
__thread uint64_t g_tiny_free_count = 0;
__thread uint64_t g_tiny_free_cycles = 0;
__thread uint64_t g_tiny_refill_cycles = 0;
__thread uint64_t g_tiny_migration_count = 0;
__thread uint64_t g_tiny_migration_cycles = 0;
// Refill failure tracking
static __thread uint64_t g_refill_success_count = 0;
static __thread uint64_t g_refill_partial_count = 0; // Some blocks allocated
static __thread uint64_t g_refill_fail_count = 0; // Zero blocks allocated
static __thread uint64_t g_refill_total_blocks = 0; // Total blocks actually allocated
int g_profile_enabled = -1; // -1: uninitialized, 0: off, 1: on (extern in header)
static inline int profile_enabled(void) {
if (__builtin_expect(g_profile_enabled == -1, 0)) {
const char* env = getenv("HAKMEM_TINY_PROFILE");
g_profile_enabled = (env && *env && *env != '0') ? 1 : 0;
}
return g_profile_enabled;
}
// Forward declarations for atexit registration
void tiny_fast_print_stats(void);
void tiny_fast_print_profile(void);
// ========== Slow Path: Refill from Magazine/SuperSlab ==========
void* tiny_fast_refill(int class_idx) {
uint64_t start = profile_enabled() ? rdtsc() : 0;
if (class_idx < 0 || class_idx >= TINY_FAST_CLASS_COUNT) {
return NULL;
}
g_tiny_fast_refill_count++;
// Register stats printer on first refill (once per thread)
static __thread int stats_registered = 0;
if (!stats_registered) {
atexit(tiny_fast_print_stats);
if (profile_enabled()) {
atexit(tiny_fast_print_profile);
}
stats_registered = 1;
}
// ========================================================================
// Phase 6-6: Batch Refill Optimization (Phase 3)
// Inspired by mimalloc's page-based refill and glibc's tcache batch refill
//
// OLD: 16 individual allocations + 16 individual pushes (16 × 100 cycles = 1,600 cycles)
// NEW: Batch allocate + link in one pass (~200 cycles, -87% cost)
// ========================================================================
// Get size from class mapping
static const size_t class_sizes[] = {16, 24, 32, 40, 48, 56, 64, 80, 96, 112, 128, 144, 160, 176, 192, 256};
size_t size = (class_idx < 16) ? class_sizes[class_idx] : 16;
// Step 1: Batch allocate into temporary array
void* batch[TINY_FAST_REFILL_BATCH];
int count = 0;
extern void* hak_tiny_alloc(size_t size);
for (int i = 0; i < TINY_FAST_REFILL_BATCH; i++) {
void* ptr = hak_tiny_alloc(size);
if (!ptr) break; // OOM or allocation failed
batch[count++] = ptr;
}
// Track refill results
if (count == 0) {
g_refill_fail_count++;
return NULL; // Complete failure
} else if (count < TINY_FAST_REFILL_BATCH) {
g_refill_partial_count++;
} else {
g_refill_success_count++;
}
g_refill_total_blocks += count;
// Step 2: Link all blocks into freelist in one pass (batch linking)
// This is the key optimization: N individual pushes → 1 batch link
for (int i = 0; i < count - 1; i++) {
tiny_next_write(class_idx, batch[i], batch[i + 1]);
}
tiny_next_write(class_idx, batch[count - 1], NULL); // Terminate list
// Step 3: Attach batch to cache head
g_tiny_fast_cache[class_idx] = batch[0];
g_tiny_fast_count[class_idx] = count;
// Step 4: Pop one for the caller
void* result = g_tiny_fast_cache[class_idx];
g_tiny_fast_cache[class_idx] = tiny_next_read(class_idx, result);
g_tiny_fast_count[class_idx]--;
// Profile: Record refill cycles
if (start) {
g_tiny_refill_cycles += (rdtsc() - start);
}
return result;
}
// ========== Slow Path: Drain to Magazine/SuperSlab ==========
void tiny_fast_drain(int class_idx) {
if (class_idx < 0 || class_idx >= TINY_FAST_CLASS_COUNT) {
return;
}
g_tiny_fast_drain_count++;
// ========================================================================
// Phase 6-7: Drain from free_head (Phase 2)
// Since frees go to free_head, drain from there when capacity exceeded
// ========================================================================
// Drain half of the free_head to Magazine/SuperSlab
// TODO: For now, we just reduce the count limit
// In a full implementation, we'd push blocks back to Magazine freelist
// Simple approach: just drop half the cache (temporary, for testing)
// A full implementation would return blocks to SuperSlab freelist
uint32_t target = TINY_FAST_CACHE_CAP / 2;
while (g_tiny_fast_free_count[class_idx] > target) {
void* ptr = g_tiny_fast_free_head[class_idx];
if (!ptr) break;
g_tiny_fast_free_head[class_idx] = tiny_next_read(class_idx, ptr);
g_tiny_fast_free_count[class_idx]--;
// TODO: Return to Magazine/SuperSlab
// For now, we'll just re-push it (no-op, but prevents loss)
// In production, call hak_tiny_free_slow(ptr, class_idx)
}
}
// ========== Debug Stats ==========
void tiny_fast_print_stats(void) {
static const char* env = NULL;
static int checked = 0;
if (!checked) {
env = getenv("HAKMEM_TINY_FAST_STATS");
checked = 1;
}
if (env && *env && *env != '0') {
fprintf(stderr, "[TINY_FAST] refills=%lu drains=%lu\n",
(unsigned long)g_tiny_fast_refill_count,
(unsigned long)g_tiny_fast_drain_count);
}
}
// ========== RDTSC Cycle Profiling Output ==========
// External routing counters from hakmem.c
extern __thread uint64_t g_malloc_total_calls;
extern __thread uint64_t g_malloc_tiny_size_match;
extern __thread uint64_t g_malloc_fast_path_tried;
extern __thread uint64_t g_malloc_fast_path_null;
extern __thread uint64_t g_malloc_slow_path;
void tiny_fast_print_profile(void) {
#ifndef HAKMEM_FORCE_LIBC_ALLOC_BUILD
if (!profile_enabled()) return;
if (g_tiny_malloc_count == 0 && g_tiny_free_count == 0) return; // No data
fprintf(stderr, "\n========== HAKMEM Tiny Fast Path Profile (RDTSC cycles) ==========\n");
// Routing statistics first
if (g_malloc_total_calls > 0) {
fprintf(stderr, "\n[ROUTING]\n");
fprintf(stderr, " Total malloc() calls: %lu\n", (unsigned long)g_malloc_total_calls);
fprintf(stderr, " Size <= %d (tiny range): %lu (%.1f%%)\n",
TINY_FAST_THRESHOLD,
(unsigned long)g_malloc_tiny_size_match,
100.0 * g_malloc_tiny_size_match / g_malloc_total_calls);
fprintf(stderr, " Fast path tried: %lu (%.1f%%)\n",
(unsigned long)g_malloc_fast_path_tried,
100.0 * g_malloc_fast_path_tried / g_malloc_total_calls);
fprintf(stderr, " Fast path returned NULL: %lu (%.1f%% of tried)\n",
(unsigned long)g_malloc_fast_path_null,
g_malloc_fast_path_tried > 0 ? 100.0 * g_malloc_fast_path_null / g_malloc_fast_path_tried : 0);
fprintf(stderr, " Slow path entered: %lu (%.1f%%)\n\n",
(unsigned long)g_malloc_slow_path,
100.0 * g_malloc_slow_path / g_malloc_total_calls);
}
if (g_tiny_malloc_count > 0) {
uint64_t avg_malloc = g_tiny_malloc_cycles / g_tiny_malloc_count;
fprintf(stderr, "[MALLOC] count=%lu, total_cycles=%lu, avg_cycles=%lu\n",
(unsigned long)g_tiny_malloc_count,
(unsigned long)g_tiny_malloc_cycles,
(unsigned long)avg_malloc);
}
if (g_tiny_free_count > 0) {
uint64_t avg_free = g_tiny_free_cycles / g_tiny_free_count;
fprintf(stderr, "[FREE] count=%lu, total_cycles=%lu, avg_cycles=%lu\n",
(unsigned long)g_tiny_free_count,
(unsigned long)g_tiny_free_cycles,
(unsigned long)avg_free);
}
if (g_tiny_fast_refill_count > 0) {
uint64_t avg_refill = g_tiny_refill_cycles / g_tiny_fast_refill_count;
fprintf(stderr, "[REFILL] count=%lu, total_cycles=%lu, avg_cycles=%lu\n",
(unsigned long)g_tiny_fast_refill_count,
(unsigned long)g_tiny_refill_cycles,
(unsigned long)avg_refill);
// Refill success/failure breakdown
fprintf(stderr, "[REFILL SUCCESS] count=%lu (%.1f%%) - full batch\n",
(unsigned long)g_refill_success_count,
100.0 * g_refill_success_count / g_tiny_fast_refill_count);
fprintf(stderr, "[REFILL PARTIAL] count=%lu (%.1f%%) - some blocks\n",
(unsigned long)g_refill_partial_count,
100.0 * g_refill_partial_count / g_tiny_fast_refill_count);
fprintf(stderr, "[REFILL FAIL] count=%lu (%.1f%%) - zero blocks\n",
(unsigned long)g_refill_fail_count,
100.0 * g_refill_fail_count / g_tiny_fast_refill_count);
fprintf(stderr, "[REFILL AVG BLOCKS] %.1f per refill (target=%d)\n",
(double)g_refill_total_blocks / g_tiny_fast_refill_count,
TINY_FAST_REFILL_BATCH);
}
if (g_tiny_migration_count > 0) {
uint64_t avg_migration = g_tiny_migration_cycles / g_tiny_migration_count;
fprintf(stderr, "[MIGRATE] count=%lu, total_cycles=%lu, avg_cycles=%lu\n",
(unsigned long)g_tiny_migration_count,
(unsigned long)g_tiny_migration_cycles,
(unsigned long)avg_migration);
}
fprintf(stderr, "===================================================================\n\n");
#endif // !HAKMEM_FORCE_LIBC_ALLOC_BUILD
}