Files
hakmem/PHASE2B_TLS_ADAPTIVE_SIZING.md
Moe Charm (CI) 707056b765 feat: Phase 7 + Phase 2 - Massive performance & stability improvements
Performance Achievements:
- Tiny allocations: +180-280% (21M → 59-70M ops/s random mixed)
- Single-thread: +24% (2.71M → 3.36M ops/s Larson)
- 4T stability: 0% → 95% (19/20 success rate)
- Overall: 91.3% of System malloc average (target was 40-55%) ✓

Phase 7 (Tasks 1-3): Core Optimizations
- Task 1: Header validation removal (Region-ID direct lookup)
- Task 2: Aggressive inline (TLS cache access optimization)
- Task 3: Pre-warm TLS cache (eliminate cold-start penalty)
  Result: +180-280% improvement, 85-146% of System malloc

Critical Bug Fixes:
- Fix 64B allocation crash (size-to-class +1 for header)
- Fix 4T wrapper recursion bugs (BUG #7, #8, #10, #11)
- Remove malloc fallback (30% → 50% stability)

Phase 2a: SuperSlab Dynamic Expansion (CRITICAL)
- Implement mimalloc-style chunk linking
- Unlimited slab expansion (no more OOM at 32 slabs)
- Fix chunk initialization bug (bitmap=0x00000001 after expansion)
  Files: core/hakmem_tiny_superslab.c/h, core/superslab/superslab_types.h
  Result: 50% → 95% stability (19/20 4T success)

Phase 2b: TLS Cache Adaptive Sizing
- Dynamic capacity: 16-2048 slots based on usage
- High-water mark tracking + exponential growth/shrink
- Expected: +3-10% performance, -30-50% memory
  Files: core/tiny_adaptive_sizing.c/h (new)

Phase 2c: BigCache Dynamic Hash Table
- Migrate from fixed 256×8 array to dynamic hash table
- Auto-resize: 256 → 512 → 1024 → 65,536 buckets
- Improved hash function (FNV-1a) + collision chaining
  Files: core/hakmem_bigcache.c/h
  Expected: +10-20% cache hit rate

Design Flaws Analysis:
- Identified 6 components with fixed-capacity bottlenecks
- SuperSlab (CRITICAL), TLS Cache (HIGH), BigCache/L2.5 (MEDIUM)
- Report: DESIGN_FLAWS_ANALYSIS.md (11 chapters)

Documentation:
- 13 comprehensive reports (PHASE*.md, DESIGN_FLAWS*.md)
- Implementation guides, test results, production readiness
- Bug fix reports, root cause analysis

Build System:
- Makefile: phase7 targets, PREWARM_TLS flag
- Auto dependency generation (-MMD -MP) for .inc files

Known Issues:
- 4T stability: 19/20 (95%) - investigating 1 failure for 100%
- L2.5 Pool dynamic sharding: design only (needs 2-3 days integration)

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-08 17:08:00 +09:00

399 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 2b: TLS Cache Adaptive Sizing
**Date**: 2025-11-08
**Priority**: 🟡 HIGH - Performance optimization
**Estimated Effort**: 3-5 days
**Status**: Ready for implementation
**Depends on**: Phase 2a (not blocking, can run in parallel)
---
## Executive Summary
**Problem**: TLS Cache has fixed capacity (256-768 slots) → Cannot adapt to workload
**Solution**: Implement adaptive sizing with high-water mark tracking
**Expected Result**: Hot classes get more cache → Better hit rate → Higher throughput
---
## Current Architecture (INEFFICIENT)
### Fixed Capacity
```c
// core/hakmem_tiny.c or similar
#define TLS_SLL_CAP_DEFAULT 256
static __thread int g_tls_sll_count[TINY_NUM_CLASSES];
static __thread void* g_tls_sll_head[TINY_NUM_CLASSES];
// Fixed capacity for all classes!
// Hot class (e.g., class 4 in Larson) → cache thrashes
// Cold class (e.g., class 0 rarely used) → wastes memory
```
### Why This is Inefficient
**Scenario 1: Hot class (class 4 - 128B allocations)**
```
Larson 4T: 4000+ concurrent 128B allocations
TLS cache capacity: 256 slots
Hit rate: ~6% (256/4000)
Result: Constant refill overhead → poor performance
```
**Scenario 2: Cold class (class 0 - 16B allocations)**
```
Usage: ~10 allocations per minute
TLS cache capacity: 256 slots
Waste: 246 slots × 16B = 3936B per thread wasted
```
---
## Proposed Architecture (ADAPTIVE)
### High-Water Mark Tracking
```c
typedef struct TLSCacheStats {
size_t capacity; // Current capacity
size_t high_water_mark; // Peak usage in recent window
size_t refill_count; // Number of refills in recent window
uint64_t last_adapt_time; // Timestamp of last adaptation
} TLSCacheStats;
static __thread TLSCacheStats g_tls_cache_stats[TINY_NUM_CLASSES];
```
### Adaptive Sizing Logic
```c
// Periodically adapt cache size based on usage
void adapt_tls_cache_size(int class_idx) {
TLSCacheStats* stats = &g_tls_cache_stats[class_idx];
// Update high-water mark
if (g_tls_sll_count[class_idx] > stats->high_water_mark) {
stats->high_water_mark = g_tls_sll_count[class_idx];
}
// Adapt every N refills or M seconds
uint64_t now = get_timestamp_ns();
if (stats->refill_count < ADAPT_REFILL_THRESHOLD &&
(now - stats->last_adapt_time) < ADAPT_TIME_THRESHOLD_NS) {
return; // Too soon to adapt
}
// Decide: grow, shrink, or keep
if (stats->high_water_mark > stats->capacity * 0.8) {
// High usage → grow cache (2x)
grow_tls_cache(class_idx);
} else if (stats->high_water_mark < stats->capacity * 0.2) {
// Low usage → shrink cache (0.5x)
shrink_tls_cache(class_idx);
}
// Reset stats for next window
stats->high_water_mark = g_tls_sll_count[class_idx];
stats->refill_count = 0;
stats->last_adapt_time = now;
}
```
---
## Implementation Tasks
### Task 1: Add Adaptive Sizing Stats (1-2 hours)
**File**: `core/hakmem_tiny.c` or TLS cache code
```c
// Per-class TLS cache statistics
typedef struct TLSCacheStats {
size_t capacity; // Current capacity
size_t high_water_mark; // Peak usage in recent window
size_t refill_count; // Refills since last adapt
size_t shrink_count; // Shrinks (for debugging)
size_t grow_count; // Grows (for debugging)
uint64_t last_adapt_time; // Timestamp of last adaptation
} TLSCacheStats;
static __thread TLSCacheStats g_tls_cache_stats[TINY_NUM_CLASSES];
// Configuration
#define TLS_CACHE_MIN_CAPACITY 16 // Minimum cache size
#define TLS_CACHE_MAX_CAPACITY 2048 // Maximum cache size
#define TLS_CACHE_INITIAL_CAPACITY 64 // Initial size (reduced from 256)
#define ADAPT_REFILL_THRESHOLD 10 // Adapt every 10 refills
#define ADAPT_TIME_THRESHOLD_NS (1000000000ULL) // Or every 1 second
// Growth thresholds
#define GROW_THRESHOLD 0.8 // Grow if usage > 80% of capacity
#define SHRINK_THRESHOLD 0.2 // Shrink if usage < 20% of capacity
```
### Task 2: Implement Grow/Shrink Functions (2-3 hours)
**File**: `core/hakmem_tiny.c`
```c
// Grow TLS cache capacity (2x)
static void grow_tls_cache(int class_idx) {
TLSCacheStats* stats = &g_tls_cache_stats[class_idx];
size_t new_capacity = stats->capacity * 2;
if (new_capacity > TLS_CACHE_MAX_CAPACITY) {
new_capacity = TLS_CACHE_MAX_CAPACITY;
}
if (new_capacity == stats->capacity) {
return; // Already at max
}
stats->capacity = new_capacity;
stats->grow_count++;
fprintf(stderr, "[TLS_CACHE] Grow class %d: %zu → %zu slots (grow_count=%zu)\n",
class_idx, stats->capacity / 2, stats->capacity, stats->grow_count);
}
// Shrink TLS cache capacity (0.5x)
static void shrink_tls_cache(int class_idx) {
TLSCacheStats* stats = &g_tls_cache_stats[class_idx];
size_t new_capacity = stats->capacity / 2;
if (new_capacity < TLS_CACHE_MIN_CAPACITY) {
new_capacity = TLS_CACHE_MIN_CAPACITY;
}
if (new_capacity == stats->capacity) {
return; // Already at min
}
// Evict excess blocks if current count > new_capacity
if (g_tls_sll_count[class_idx] > new_capacity) {
// Drain excess blocks back to SuperSlab
int excess = g_tls_sll_count[class_idx] - new_capacity;
drain_excess_blocks(class_idx, excess);
}
stats->capacity = new_capacity;
stats->shrink_count++;
fprintf(stderr, "[TLS_CACHE] Shrink class %d: %zu → %zu slots (shrink_count=%zu)\n",
class_idx, stats->capacity * 2, stats->capacity, stats->shrink_count);
}
// Drain excess blocks back to SuperSlab
static void drain_excess_blocks(int class_idx, int count) {
void** head = &g_tls_sll_head[class_idx];
int drained = 0;
while (*head && drained < count) {
void* block = *head;
*head = *(void**)block; // Pop from TLS list
// Return to SuperSlab (or freelist)
return_block_to_superslab(block, class_idx);
drained++;
g_tls_sll_count[class_idx]--;
}
fprintf(stderr, "[TLS_CACHE] Drained %d excess blocks from class %d\n", drained, class_idx);
}
```
### Task 3: Integrate Adaptation into Refill Path (2-3 hours)
**File**: `core/tiny_alloc_fast.inc.h` or refill code
```c
static inline int tiny_alloc_fast_refill(int class_idx) {
// ... existing refill logic ...
// Track refill for adaptive sizing
TLSCacheStats* stats = &g_tls_cache_stats[class_idx];
stats->refill_count++;
// Update high-water mark
if (g_tls_sll_count[class_idx] > stats->high_water_mark) {
stats->high_water_mark = g_tls_sll_count[class_idx];
}
// Periodically adapt cache size
adapt_tls_cache_size(class_idx);
// ... rest of refill ...
}
```
### Task 4: Implement Adaptation Logic (2-3 hours)
**File**: `core/hakmem_tiny.c`
```c
// Adapt TLS cache size based on usage patterns
static void adapt_tls_cache_size(int class_idx) {
TLSCacheStats* stats = &g_tls_cache_stats[class_idx];
// Adapt every N refills or M seconds
uint64_t now = get_timestamp_ns();
bool should_adapt = (stats->refill_count >= ADAPT_REFILL_THRESHOLD) ||
((now - stats->last_adapt_time) >= ADAPT_TIME_THRESHOLD_NS);
if (!should_adapt) {
return; // Too soon to adapt
}
// Calculate usage ratio
double usage_ratio = (double)stats->high_water_mark / (double)stats->capacity;
// Decide: grow, shrink, or keep
if (usage_ratio > GROW_THRESHOLD) {
// High usage (>80%) → grow cache
grow_tls_cache(class_idx);
} else if (usage_ratio < SHRINK_THRESHOLD) {
// Low usage (<20%) → shrink cache
shrink_tls_cache(class_idx);
} else {
// Moderate usage (20-80%) → keep current size
fprintf(stderr, "[TLS_CACHE] Keep class %d at %zu slots (usage=%.1f%%)\n",
class_idx, stats->capacity, usage_ratio * 100.0);
}
// Reset stats for next window
stats->high_water_mark = g_tls_sll_count[class_idx];
stats->refill_count = 0;
stats->last_adapt_time = now;
}
// Helper: Get timestamp in nanoseconds
static inline uint64_t get_timestamp_ns(void) {
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return (uint64_t)ts.tv_sec * 1000000000ULL + (uint64_t)ts.tv_nsec;
}
```
### Task 5: Initialize Adaptive Stats (1 hour)
**File**: `core/hakmem_tiny.c`
```c
void hak_tiny_init(void) {
// ... existing init ...
// Initialize TLS cache stats for each class
for (int class_idx = 0; class_idx < TINY_NUM_CLASSES; class_idx++) {
TLSCacheStats* stats = &g_tls_cache_stats[class_idx];
stats->capacity = TLS_CACHE_INITIAL_CAPACITY; // Start with 64 slots
stats->high_water_mark = 0;
stats->refill_count = 0;
stats->shrink_count = 0;
stats->grow_count = 0;
stats->last_adapt_time = get_timestamp_ns();
// Initialize TLS cache head/count
g_tls_sll_head[class_idx] = NULL;
g_tls_sll_count[class_idx] = 0;
}
}
```
### Task 6: Add Capacity Enforcement (2-3 hours)
**File**: `core/tiny_alloc_fast.inc.h`
```c
static inline int tiny_alloc_fast_refill(int class_idx) {
TLSCacheStats* stats = &g_tls_cache_stats[class_idx];
// Don't refill beyond current capacity
int current_count = g_tls_sll_count[class_idx];
int available_slots = stats->capacity - current_count;
if (available_slots <= 0) {
// Cache is full, don't refill
fprintf(stderr, "[TLS_CACHE] Class %d cache full (%d/%zu), skipping refill\n",
class_idx, current_count, stats->capacity);
return -1; // Signal caller to try again or use slow path
}
// Refill only up to capacity
int want_count = HAKMEM_TINY_REFILL_DEFAULT; // e.g., 16
int refill_count = (want_count < available_slots) ? want_count : available_slots;
// ... existing refill logic with refill_count ...
}
```
---
## Testing Strategy
### Test 1: Adaptive Behavior Verification
```bash
# Enable debug logging
HAKMEM_LOG=1 ./larson_hakmem 10 8 128 1024 1 12345 4 2>&1 | grep "TLS_CACHE"
# Should see:
# [TLS_CACHE] Grow class 4: 64 → 128 slots (grow_count=1)
# [TLS_CACHE] Grow class 4: 128 → 256 slots (grow_count=2)
# [TLS_CACHE] Grow class 4: 256 → 512 slots (grow_count=3)
# [TLS_CACHE] Keep class 0 at 64 slots (usage=5.2%)
```
### Test 2: Performance Improvement
```bash
# Before (fixed capacity)
./larson_hakmem 1 1 128 1024 1 12345 1
# Baseline: 2.71M ops/s
# After (adaptive capacity)
./larson_hakmem 1 1 128 1024 1 12345 1
# Expected: 2.8-3.0M ops/s (+3-10%)
```
### Test 3: Memory Efficiency
```bash
# Run with memory profiling
valgrind --tool=massif ./larson_hakmem 1 1 128 1024 1 12345 1
# Compare peak memory usage
# Fixed: 256 slots × 8 classes × 8B = ~16KB per thread
# Adaptive: ~8KB per thread (cold classes shrink to 16 slots)
```
---
## Success Criteria
**Adaptive behavior**: Logs show grow/shrink based on usage
**Hot class expansion**: Class 4 grows to 512+ slots under load
**Cold class shrinkage**: Class 0 shrinks to 16-32 slots
**Performance improvement**: +3-10% on Larson benchmark
**Memory efficiency**: -30-50% TLS cache memory usage
---
## Deliverable
**Report file**: `/mnt/workdisk/public_share/hakmem/PHASE2B_IMPLEMENTATION_REPORT.md`
**Required sections**:
1. **Adaptive sizing behavior** (logs showing grow/shrink)
2. **Performance comparison** (before/after)
3. **Memory usage comparison** (TLS cache overhead)
4. **Per-class capacity evolution** (graph if possible)
5. **Production readiness** (YES/NO verdict)
---
**Let's make TLS cache adaptive! 🎯**