Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

8.4 KiB

Raw Blame History

Tiny Pool Optimization Roadmap

Quick reference for implementing mimalloc-style optimizations in hakmem's Tiny Pool.

Current Performance: 83 ns/op for 8-64B allocations Target Performance: 30-50 ns/op (realistic with optimizations) Gap to mimalloc: Still 2-3.5x slower (fundamental architecture difference)

Quick Wins (10-20 ns improvement)

1. Lookup Table Size Classification

Effort: 30 minutes | Gain: 3-5 ns

Replace if-chain with table lookup:

// Before
static inline int hak_tiny_size_to_class(size_t size) {
    if (size <= 8) return 0;
    if (size <= 16) return 1;
    if (size <= 32) return 2;
    // ... sequential if-chain
}

// After
static const uint8_t g_size_to_class[65] = {
    0,0,0,0,0,0,0,0,           // 0-7
    1,1,1,1,1,1,1,1,           // 8-15
    2,2,2,2,2,2,2,2,           // 16-23
    2,2,2,2,2,2,2,2,           // 24-31
    3,3,3,3,3,3,3,3,           // 32-39
    // ... continue to 64
};

static inline int hak_tiny_size_to_class_fast(size_t size) {
    return likely(size <= 64) ? g_size_to_class[size] : -1;
}

Implementation:

File: hakmem_tiny.h
Add static const array after line 36 (after g_tiny_blocks_per_slab)
Update hak_tiny_size_to_class() to use table
Add __builtin_expect() for fast path

2. Remove Statistics from Critical Path

Effort: 1 hour | Gain: 10-15 ns

Move sampled counter updates to separate tracking:

// Before (hot path)
void* p = mag->items[--mag->top].ptr;
t_tiny_rng ^= t_tiny_rng << 13;         // 3 ns overhead
t_tiny_rng ^= t_tiny_rng >> 17;
t_tiny_rng ^= t_tiny_rng << 5;
if ((t_tiny_rng & ((1u<<g_tiny_count_sample_exp)-1u)) == 0u)
    g_tiny_pool.alloc_count[class_idx]++;
return p;

// After (hot path)
void* p = mag->items[--mag->top].ptr;
// Stats update deferred (see lazy_counter_update below)
return p;

// New: Lazy counter accumulation (cold path)
static void hak_tiny_lazy_counter_update(int class_idx) {
    if (++g_tls_alloc_counter[class_idx] >= 100) {
        g_tiny_pool.alloc_count[class_idx] += g_tls_alloc_counter[class_idx];
        g_tls_alloc_counter[class_idx] = 0;
    }
}

Implementation:

File: hakmem_tiny.c
Remove sampled XOR code from hak_tiny_alloc() lines 656-659
Replace with simple per-thread counter
Call hak_tiny_lazy_counter_update() in slow path only
Update hak_tiny_get_stats() to account for lazy counters

3. Inline Fast Path

Effort: 1 hour | Gain: 5-10 ns

Create separate inlined fast-path function:

// New file: hakmem_tiny_alloc_fast.h
static inline void* hak_tiny_alloc_hot(size_t size) {
    // Magazine fast path only (no TLS active slab, no locks)
    if (size > TINY_MAX_SIZE) return NULL;
    
    int class_idx = g_size_to_class[size];
    TinyTLSMag* mag = &g_tls_mags[class_idx];
    
    if (likely(mag->top > 0)) {
        return mag->items[--mag->top].ptr;
    }
    
    // Fall through to slow path
    extern void* hak_tiny_alloc_slow(size_t size);
    return hak_tiny_alloc_slow(size);
}

Implementation:

Create: hakmem_tiny_alloc_fast.h
Move pure magazine fast path here
Declare as __attribute__((always_inline))
Include from hakmem.h for public API
Keep hakmem_tiny.c for slow path

Medium Effort (2-5 ns improvement each)

4. Combine TLS Reads into Single Structure

Effort: 2 hours | Gain: 2-3 ns

// Before
TinyTLSMag* mag = &g_tls_mags[class_idx];
TinySlab* slab_a = &g_tls_active_slab_a[class_idx];
TinySlab* slab_b = &g_tls_active_slab_b[class_idx];
// 3 separate TLS reads + prefetch misses

// After
typedef struct {
    TinyTLSMag mag;           // All magazine data
    TinySlab* slab_a;         // Pointers
    TinySlab* slab_b;
} TinyTLSCache;
static __thread TinyTLSCache g_tls_cache[TINY_NUM_CLASSES];

TinyTLSCache* cache = &g_tls_cache[class_idx];  // 1 TLS read
// All 3 data structures prefetched together

Benefits:

Single TLS read instead of 3
All related data on same cache line
Better prefetcher behavior

5. Hardware Prefetching Hints

Effort: 30 minutes | Gain: 1-2 ns (cumulative)

// In hot path loop (e.g., bench_tiny_mt.c)
void* p = mag->items[--mag->top].ptr;
if (mag->top > 0) {
    __builtin_prefetch(mag->items[mag->top].ptr, 0, 3);
    // 0 = read, 3 = bring to L1
}
use_allocation(p);

6. Branchless Fallback Logic

Effort: 1.5 hours | Gain: 10-15 ns

Use conditional moves instead of branches:

// Before (2+ branches, high misprediction rate)
if (mag->top > 0) {
    return mag->items[--mag->top].ptr;
}
if (slab_a && slab_a->free_count > 0) {
    // ... allocation from slab
}
// Fall through to slow path

// After (0 mispredictions with cmov)
void* p = NULL;
if (mag->top > 0) {
    p = mag->items[--mag->top].ptr;
}
// If still NULL, slab_a handler gets chance
if (!p && slab_a && slab_a->free_count > 0) {
    // ... allocation
    p = result;
}
return p != NULL ? p : hak_tiny_alloc_slow(size);

Advanced Optimizations (2-5 ns improvement)

7. Code Layout (Hot/Cold Separation)

Effort: 2 hours | Gain: 2-5 ns

Use compiler pragmas to place fast path in hot section:

// In hakmem_tiny_alloc_fast.h
__attribute__((section(".text.hot")))
__attribute__((aligned(64)))
static inline void* hak_tiny_alloc_hot(size_t size) {
    // Fast path only
}

// In hakmem_tiny.c
__attribute__((section(".text.cold")))
static void* hak_tiny_alloc_slow(size_t size) {
    // Slow path: locks, scanning, etc.
}

Benefits:

Hot path packed in contiguous instruction cache
Fewer I-cache misses
Better CPU prefetching

Implementation Priority (Time Investment vs Gain)

Priority	Optimization	Effort	Gain	Ratio
P0	Lookup table classification	30min	3-5ns	10x ROI
P1	Remove stats overhead	1hr	10-15ns	15x ROI
P2	Inline fast path	1hr	5-10ns	7x ROI
P3	Branch elimination	1.5hr	10-15ns	7x ROI
P4	Combined TLS reads	2hr	2-3ns	1.5x ROI
P5	Code layout	2hr	2-5ns	2x ROI
P6	Prefetching hints	30min	1-2ns	3x ROI

Testing Strategy

After each optimization:

# Rebuild
make clean && make -j4

# Single-threaded benchmark
./bench_tiny --iterations=1000000 --threads=1
# Should show improvement in latency_ns

# Multi-threaded verification
./bench_tiny --iterations=100000 --threads=4
# Should maintain thread-safety and hit rate

# Compare against baseline
./docs/bench_compare.sh baseline optimized

Expected Results Timeline

Phase 1 (P0+P1, ~2 hours):

Current: 83 ns/op
Expected: 65-70 ns/op
Gain: ~18% improvement

Phase 2 (P0+P1+P2, ~3 hours):

Expected: 55-65 ns/op
Gain: ~25-30% improvement

Phase 3 (P0+P1+P2+P3, ~4.5 hours):

Expected: 45-55 ns/op
Gain: ~40-45% improvement
Still 3-4x slower than mimalloc (fundamental difference)

Why still slower than mimalloc (14 ns)?

Irreducible architectural gaps:

Bitmap lookup [5 ns] vs free list head [1 ns]
Magazine validation [3-5 ns] vs implicit ownership [0 ns]
Thread ownership tracking [2-3 ns] vs per-thread pages [0 ns]

Total irreducible gap: 10-13 ns

Code Files to Modify

File	Change	Priority
`hakmem_tiny.h`	Add size_to_class LUT	P0
`hakmem_tiny.c`	Remove stats from hot path	P1
`hakmem_tiny_alloc_fast.h`	NEW - Inlined fast path	P2
`hakmem_tiny.c`	Branchless fallback	P3
`hakmem_tiny.h`	Combine TLS structure	P4
`bench_tiny.c`	Add prefetch hints	P6
`Makefile`	Hot/cold sections (-ffunction-sections)	P5

Rollback Plan

Each optimization is independent and can be reverted:

# If optimization causes regression:
git diff <file>              # See changes
git checkout <file>          # Revert single file
./bench_tiny                 # Re-verify

All optimizations preserve semantics - no behavior changes.

Success Criteria

P0 Implementation: 80-82 ns/op (no regression)
P1 Implementation: 68-72 ns/op (+15% from current)
P2 Implementation: 60-65 ns/op (+22% from current)
P3 Implementation: 50-55 ns/op (+35% from current)
All changes compile with -O3 -march=native
All benchmarks pass thread-safety verification
No regressions on medium/large allocations (L2 pool)

Last Updated: 2025-10-26 Status: Ready for implementation Owner: [Team] Target: Implement P0-P2 in next sprint

8.4 KiB Raw Blame History

Tiny Pool Optimization Roadmap

Quick Wins (10-20 ns improvement)

1. Lookup Table Size Classification

2. Remove Statistics from Critical Path

3. Inline Fast Path

Medium Effort (2-5 ns improvement each)

4. Combine TLS Reads into Single Structure

5. Hardware Prefetching Hints

6. Branchless Fallback Logic

Advanced Optimizations (2-5 ns improvement)

7. Code Layout (Hot/Cold Separation)

Implementation Priority (Time Investment vs Gain)

Testing Strategy

Expected Results Timeline

Code Files to Modify

Rollback Plan

Success Criteria

8.4 KiB

Raw Blame History