Files

Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History

Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-05 12:31:14 +09:00

17 KiB

Raw Blame History

Tiny Pool Optimization Strategy

Date: 2025-10-26 Goal: Reduce 5.9x performance gap with mimalloc for small allocations (8-64 bytes) Current: 83 ns/op (hakmem) vs 14 ns/op (mimalloc) Target: 50-55 ns/op (35-40% improvement, ~3.5x gap remaining) Status: Ready for implementation

Executive Summary

The 5.9x Gap: Root Cause Analysis

Based on comprehensive mimalloc analysis (see ANALYSIS_SUMMARY.md), the performance gap stems from architectural differences, not bugs:

Component	mimalloc	hakmem	Impact
Primary data structure	LIFO free list (intrusive)	Bitmap + magazine	+20 ns
State location	Thread-local only	Thread-local + global	+10 ns
Cache validation	Implicit (per-thread pages)	Explicit (ownership tracking)	+5 ns
Statistics overhead	Batched/deferred	Per-allocation sampled	+10 ns
Control flow	1 branch	3-4 branches	+5 ns

Total measured gap: 69 ns (83 - 14)

Why We Can't Match mimalloc's 14 ns

Irreducible architectural gaps (10-13 ns total):

Bitmap lookup [5 ns]: Find-first-set + bit extraction vs single pointer read
Magazine validation [3-5 ns]: Ownership tracking for diagnostics vs implicit ownership
Statistics integration [2-3 ns]: Per-class stats require bookkeeping vs atomic counters

What We Can Realistically Achieve

Addressable overhead (30-35 ns):

P0: Lookup table classification → +3-5 ns
P1: Remove stats from hot path → +10-15 ns
P2: Inline fast path → +5-10 ns
P3: Branch elimination → +10-15 ns

Expected result: 83 ns → 50-55 ns (35-40% improvement)

Strategic Approach: Three-Phase Implementation

Phase I: Quick Wins (P0 + P1) - 90 minutes

Target: 83 ns → 65-70 ns (~20% improvement) ROI: Highest impact per time invested

Why Start Here:

P0 and P1 are independent (no code conflicts)
Combined gain: 13-20 ns
Low risk (simple, localized changes)
Immediate validation possible

Phase II: Fast Path Optimization (P2) - 60 minutes

Target: 65-70 ns → 55-60 ns (~30% cumulative improvement) ROI: High impact, moderate complexity

Why Second:

Depends on P1 completion (stats removed from hot path)
Creates foundation for P3 (branch elimination)
More complex but well-documented in roadmap

Phase III: Advanced Optimization (P3) - 90 minutes

Target: 55-60 ns → 50-55 ns (~40% cumulative improvement) ROI: Moderate, requires careful testing

Why Last:

Most complex (branchless logic)
Highest risk of subtle bugs
Marginal improvement (diminishing returns)

Detailed Implementation Plan

P0: Lookup Table Size Classification

File: hakmem_tiny.h Effort: 30 minutes Gain: +3-5 ns Risk: Very Low

Current Implementation (If-Chain)

static inline int hak_tiny_size_to_class(size_t size) {
    if (size <= 8) return 0;
    if (size <= 16) return 1;
    if (size <= 32) return 2;
    if (size <= 64) return 3;
    if (size <= 128) return 4;
    if (size <= 256) return 5;
    if (size <= 512) return 6;
    if (size <= 1024) return 7;
    return -1;
}
// Branches: 8 (worst case), avg 4 mispredictions
// Cost: 5-8 ns

Target Implementation (LUT)

// Add after line 36 in hakmem_tiny.h (after g_tiny_blocks_per_slab)
static const uint8_t g_tiny_size_to_class[1025] = {
    // 0-8: class 0
    0,0,0,0,0,0,0,0,0,
    // 9-16: class 1
    1,1,1,1,1,1,1,1,
    // 17-32: class 2
    2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
    // 33-64: class 3
    3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,
    3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,
    // 65-128: class 4
    4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
    4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
    4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
    4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
    // 129-256: class 5
    // ... (continue pattern for all 1025 entries)
    // 513-1024: class 7
};

static inline int hak_tiny_size_to_class_fast(size_t size) {
    // Fast path: direct table lookup
    if (__builtin_expect(size <= 1024, 1)) {
        return g_tiny_size_to_class[size];
    }
    // Slow path: out of Tiny Pool range
    return -1;
}
// Branches: 1 (predictable)
// Cost: 0.5-1 ns (L1 cache hit)

Implementation Steps

Add g_tiny_size_to_class[1025] table to hakmem_tiny.h (after line 36)
Replace hak_tiny_size_to_class() with hak_tiny_size_to_class_fast()
Update all call sites in hakmem_tiny.c
Compile and verify correctness

Testing

# Build and verify correctness
make clean && make -j4

# Unit test: verify classification accuracy
./test_tiny_size_class  # (create if needed)

# Performance test
./bench_tiny --iterations=1000000 --threads=1
# Expected: 83ns → 78-80ns

Rollback

git diff hakmem_tiny.h  # Verify changes
git checkout hakmem_tiny.h  # If regression detected

P1: Remove Statistics from Critical Path

File: hakmem_tiny.c Effort: 60 minutes Gain: +10-15 ns Risk: Low (statistics semantics preserved)

Current Implementation (Per-Allocation Sampling)

// hakmem_tiny.c:656-659 (hot path)
void* p = mag->items[--mag->top].ptr;

// Sampled counter update (XOR-based PRNG)
t_tiny_rng ^= t_tiny_rng << 13;         // 3 ns overhead
t_tiny_rng ^= t_tiny_rng >> 17;
t_tiny_rng ^= t_tiny_rng << 5;
if ((t_tiny_rng & ((1u<<g_tiny_count_sample_exp)-1u)) == 0u) {
    g_tiny_pool.alloc_count[class_idx]++;  // Atomic increment (contention!)
}

return p;
// Total stats overhead: 10-15 ns

Target Implementation (Batched Lazy Update)

// New TLS counter (add to hakmem_tiny.c TLS section)
static __thread uint64_t t_tiny_alloc_counter[TINY_NUM_CLASSES] = {0};

// Hot path (remove all stats code)
void* p = mag->items[--mag->top].ptr;
// NO STATS HERE!
return p;
// Stats overhead: 0 ns

// Cold path: Lazy accumulation (called during magazine refill)
static void hak_tiny_lazy_counter_update(int class_idx) {
    // Accumulate every 100 allocations
    if (++t_tiny_alloc_counter[class_idx] >= 100) {
        g_tiny_pool.alloc_count[class_idx] += t_tiny_alloc_counter[class_idx];
        t_tiny_alloc_counter[class_idx] = 0;
    }
}

// Call from slow path (magazine refill function)
void hak_tiny_refill_magazine(...) {
    // ... existing refill logic ...
    hak_tiny_lazy_counter_update(class_idx);
}

Implementation Steps

Add t_tiny_alloc_counter[TINY_NUM_CLASSES] TLS array
Remove XOR PRNG code from hak_tiny_alloc() hot path (lines 656-659)
Add hak_tiny_lazy_counter_update() function
Call lazy update in slow path (magazine refill, slab allocation)
Update hak_tiny_get_stats() to flush pending TLS counters

Statistics Accuracy Trade-off

Before: Sampled (1/16 allocations counted)
After: Batched (accumulated every 100 allocations)
Impact: Both are approximations; batching is more accurate and faster

Testing

# Build
make clean && make -j4

# Functional test: verify stats still work
./test_mf2  # Should show non-zero alloc_count

# Performance test
./bench_tiny --iterations=1000000 --threads=1
# Expected: 78-80ns → 63-70ns

Rollback

git diff hakmem_tiny.c
git checkout hakmem_tiny.c  # If stats broken

P2: Inline Fast Path

Files: hakmem_tiny_alloc_fast.h (new), hakmem.h, hakmem_tiny.c Effort: 60 minutes Gain: +5-10 ns Risk: Moderate (function splitting, ABI considerations)

Current Implementation (Unified Function)

// hakmem_tiny.c: Single function for fast + slow path
void* hak_tiny_alloc(size_t size) {
    // Size classification
    int class_idx = hak_tiny_size_to_class(size);

    // Magazine check
    TinyTLSMag* mag = &g_tls_mags[class_idx];
    if (mag->top > 0) {
        return mag->items[--mag->top].ptr;  // Fast path
    }

    // TLS active slab A
    TinySlab* slab_a = g_tls_active_slab_a[class_idx];
    if (slab_a && slab_a->free_count > 0) {
        // ... bitmap scan ...
    }

    // TLS active slab B
    // ...

    // Global pool (slow path with locks)
    // ...
}
// Problem: Compiler can't inline due to size/complexity
// Call overhead: 5-10 ns

Target Implementation (Split Fast/Slow)

// New file: hakmem_tiny_alloc_fast.h
#ifndef HAKMEM_TINY_ALLOC_FAST_H
#define HAKMEM_TINY_ALLOC_FAST_H

#include "hakmem_tiny.h"

// External slow path declaration
extern void* hak_tiny_alloc_slow(size_t size, int class_idx);

// Inlined fast path (magazine-only)
__attribute__((always_inline))
static inline void* hak_tiny_alloc_hot(size_t size) {
    // Fast bounds check
    if (__builtin_expect(size > TINY_MAX_SIZE, 0)) {
        return NULL;  // Not Tiny Pool range
    }

    // O(1) size classification (uses P0 LUT)
    int class_idx = g_tiny_size_to_class[size];

    // TLS magazine check (fast path only)
    TinyTLSMag* mag = &g_tls_mags[class_idx];
    if (__builtin_expect(mag->top > 0, 1)) {
        return mag->items[--mag->top].ptr;
    }

    // Fall through to slow path
    return hak_tiny_alloc_slow(size, class_idx);
}

#endif

// hakmem_tiny.c: Slow path only
void* hak_tiny_alloc_slow(size_t size, int class_idx) {
    // TLS active slab A
    TinySlab* slab_a = g_tls_active_slab_a[class_idx];
    if (slab_a && slab_a->free_count > 0) {
        // ... bitmap scan ...
    }

    // TLS active slab B
    // ...

    // Global pool (locks)
    // ...
}

// hakmem.h: Update public API
#include "hakmem_tiny_alloc_fast.h"

// Use inlined fast path for tiny allocations
static inline void* hak_alloc_at(size_t size, void* site) {
    if (size <= TINY_MAX_SIZE) {
        return hak_tiny_alloc_hot(size);
    }
    // ... L2/L2.5 pools ...
}

Benefits

Zero call overhead: Compiler inlines directly into caller
Better register allocation: Hot path uses minimal stack
Improved branch prediction: Fast path separate from cold code

Implementation Steps

Create hakmem_tiny_alloc_fast.h
Move magazine-only logic to hak_tiny_alloc_hot()
Rename hak_tiny_alloc() → hak_tiny_alloc_slow()
Update hakmem.h to include new header
Update all call sites

Testing

# Build
make clean && make -j4

# Functional test: verify all paths work
./test_mf2
./test_mf2_warmup

# Performance test
./bench_tiny --iterations=1000000 --threads=1
# Expected: 63-70ns → 55-65ns

Rollback

git rm hakmem_tiny_alloc_fast.h
git checkout hakmem.h hakmem_tiny.c

Testing Strategy

Phase-by-Phase Validation

After P0 (Lookup Table)

# Correctness
./test_tiny_size_class  # Verify all 1025 entries correct

# Performance
./bench_tiny --iterations=1000000 --threads=1
# Target: 83ns → 78-80ns

# No regressions on other sizes
./bench_allocators_hakmem --scenario json  # L2.5 64KB
./bench_allocators_hakmem --scenario mir   # L2.5 256KB

After P1 (Remove Stats)

# Correctness: Stats still work
./test_mf2
hak_tiny_get_stats()  # Should show non-zero counts

# Performance
./bench_tiny --iterations=1000000 --threads=1
# Target: 78-80ns → 63-70ns

# Multi-threaded: Verify TLS counters work
./bench_tiny_mt --iterations=100000 --threads=4
# Should see reasonable alloc_count

After P2 (Inline Fast Path)

# Correctness: All paths work
./test_mf2
./test_mf2_warmup

# Performance
./bench_tiny --iterations=1000000 --threads=1
# Target: 63-70ns → 55-65ns

# Code inspection: Verify inlining
objdump -d bench_tiny | grep -A 20 'hak_alloc_at'
# Should show inline expansion, not call instruction

Comprehensive Validation

# After all P0-P2 complete
make clean && make -j4

# Full benchmark suite
./bench_tiny --iterations=1000000 --threads=1  # Single-threaded
./bench_tiny_mt --iterations=100000 --threads=4  # Multi-threaded
./bench_allocators_hakmem --scenario json  # L2.5 64KB
./bench_allocators_hakmem --scenario mir   # L2.5 256KB

# Git commit
git add .
git commit -m "Tiny Pool optimization P0-P2: 83ns → 55-65ns

- P0: Lookup table size classification (+3-5ns)
- P1: Remove statistics from hot path (+10-15ns)
- P2: Inline fast path (+5-10ns)

Cumulative improvement: ~30-35%
Remaining gap to mimalloc (14ns): 3.5-4x (irreducible)"

Risk Mitigation

Low-Risk Approach

Incremental changes: One optimization at a time
Validation at each step: Benchmark + test after each P0, P1, P2
Git commits: Separate commit for each optimization
Rollback ready: git checkout if regression detected

Potential Issues and Mitigation

Issue 1: LUT Size (1 KB)

Risk: L1 cache pressure
Mitigation: 1 KB fits in L1 (32 KB typical), negligible impact
Fallback: Use smaller LUT (65 entries, round up to power-of-2)

Issue 2: Inlining Bloat

Risk: Code size increase from inlining
Mitigation: Only inline magazine path (~10 instructions)
Fallback: Use __attribute__((hot)) instead of always_inline

Issue 3: Stats Accuracy

Risk: Batched counters less accurate than sampled
Mitigation: Actually MORE accurate (no sampling variance)
Fallback: Keep TLS counters but flush more frequently

Success Criteria

Performance Targets

Optimization	Current	Target	Achieved?
Baseline	83 ns/op	-	✓ (established)
P0 Complete	83 ns	78-80 ns	[ ]
P1 Complete	78-80 ns	63-70 ns	[ ]
P2 Complete	63-70 ns	55-65 ns	[ ]
P0-P2 Combined	83 ns	50-55 ns	[ ]

Stretch Goal: 50 ns/op (38% improvement, 3.5x gap to mimalloc)

Functional Requirements

All existing tests pass
No regressions on L2/L2.5 pools
Statistics still functional (batched counters work)
Multi-threaded safety maintained
Zero hard page faults (memory reuse preserved)

Code Quality

Clean compilation with -Wall -Wextra
No compiler warnings
Documented trade-offs (stats batching, LUT size)
Git commit messages reference this strategy doc

Timeline and Effort

Phase I: Quick Wins (P0 + P1)

P0 Implementation: 30 minutes
P0 Testing: 15 minutes
P1 Implementation: 60 minutes
P1 Testing: 15 minutes
Total Phase I: 2 hours

Phase II: Fast Path (P2)

P2 Implementation: 60 minutes
P2 Testing: 15 minutes
Total Phase II: 1.25 hours

Phase III: Validation

Comprehensive testing: 30 minutes
Documentation update: 15 minutes
Git commit: 15 minutes
Total Phase III: 1 hour

Grand Total: 4.25 hours

Realistic estimate with contingency: 5-6 hours

Beyond P0-P2: Future Optimization (Optional)

P3: Branch Elimination (Not Included in This Strategy)

Effort: 90 minutes Gain: +10-15 ns Risk: High (complex, subtle bugs possible)

Why Deferred:

Diminishing returns (30% improvement already achieved)
Complexity vs gain trade-off unfavorable
Can be addressed in future iteration

Alternative: NEXT_STEPS.md Approach

After P0-P2 complete, consider NEXT_STEPS.md optimizations:

MPSC opportunistic drain during alloc slow path
Immediate full→free slab promotion after drain
Adaptive magazine capacity per site

These may yield better ROI than P3.

Comparison to Alternatives

Alternative 1: Implement NEXT_STEPS.md First

Pros: Novel features (adaptive magazines, ELO learning) Cons: Higher complexity, uncertain performance gain Why Not: mimalloc analysis shows stats overhead is 10-15ns - addressing known bottleneck first is safer

Alternative 2: Copy mimalloc's Free List Architecture

Pros: Would match mimalloc's 14ns performance Cons: Abandons bitmap approach, loses diagnostics/ownership tracking Why Not: Violates hakmem's research goals (flexible architecture)

Alternative 3: Do Nothing

Pros: Zero effort Cons: 5.9x gap remains, no learning Why Not: P0-P2 are low-risk, high-ROI quick wins

Conclusion

This strategy prioritizes high-impact, low-risk optimizations to achieve a 30-40% performance improvement in 4-6 hours of work.

Key Principles:

Incremental validation: Test after each step
Focus on hot path: Remove overhead from critical path
Preserve semantics: No behavior changes, only optimization
Accept trade-offs: 50-55ns is excellent; chasing 14ns abandons research goals

Next Steps:

Review and approve this strategy
Implement P0 (Lookup Table) - 30 minutes
Validate P0 - 15 minutes
Implement P1 (Remove Stats) - 60 minutes
Validate P1 - 15 minutes
Implement P2 (Inline Fast Path) - 60 minutes
Validate P2 - 15 minutes
Comprehensive testing and documentation - 1 hour

Expected Outcome: Tiny Pool performance improves from 83ns to 50-55ns, closing 40% of the gap with mimalloc while preserving hakmem's research architecture.

Last Updated: 2025-10-26 Status: Ready for implementation Approval: Pending user confirmation Implementation Start: After approval

17 KiB Raw Blame History

Tiny Pool Optimization Strategy

Executive Summary

The 5.9x Gap: Root Cause Analysis

Why We Can't Match mimalloc's 14 ns

What We Can Realistically Achieve

Strategic Approach: Three-Phase Implementation

Phase I: Quick Wins (P0 + P1) - 90 minutes

Phase II: Fast Path Optimization (P2) - 60 minutes

Phase III: Advanced Optimization (P3) - 90 minutes

Detailed Implementation Plan

P0: Lookup Table Size Classification

Current Implementation (If-Chain)

Target Implementation (LUT)

Implementation Steps

Testing

Rollback

P1: Remove Statistics from Critical Path

Current Implementation (Per-Allocation Sampling)

Target Implementation (Batched Lazy Update)

Implementation Steps

Statistics Accuracy Trade-off

Testing

Rollback

P2: Inline Fast Path

Current Implementation (Unified Function)

Target Implementation (Split Fast/Slow)

Benefits

Implementation Steps

Testing

Rollback

Testing Strategy

Phase-by-Phase Validation

After P0 (Lookup Table)

After P1 (Remove Stats)

After P2 (Inline Fast Path)

Comprehensive Validation

Risk Mitigation

Low-Risk Approach

Potential Issues and Mitigation

Issue 1: LUT Size (1 KB)

Issue 2: Inlining Bloat

Issue 3: Stats Accuracy

Success Criteria

Performance Targets

Functional Requirements

Code Quality

Timeline and Effort

Phase I: Quick Wins (P0 + P1)

Phase II: Fast Path (P2)

Phase III: Validation

Grand Total: 4.25 hours

Beyond P0-P2: Future Optimization (Optional)

P3: Branch Elimination (Not Included in This Strategy)

Alternative: NEXT_STEPS.md Approach

Comparison to Alternatives

Alternative 1: Implement NEXT_STEPS.md First

Alternative 2: Copy mimalloc's Free List Architecture

Alternative 3: Do Nothing

Conclusion

17 KiB

Raw Blame History