hakmem/docs/archive/TINY_POOL_OPTIMIZATION_STRATEGY.md

# Tiny Pool Optimization Strategy

**Date**: 2025-10-26
**Goal**: Reduce 5.9x performance gap with mimalloc for small allocations (8-64 bytes)
**Current**: 83 ns/op (hakmem) vs 14 ns/op (mimalloc)
**Target**: 50-55 ns/op (35-40% improvement, ~3.5x gap remaining)
**Status**: Ready for implementation

---

## Executive Summary

### The 5.9x Gap: Root Cause Analysis

Based on comprehensive mimalloc analysis (see ANALYSIS_SUMMARY.md), the performance gap stems from **architectural differences**, not bugs:

| Component | mimalloc | hakmem | Impact |
|-----------|----------|--------|--------|
| **Primary data structure** | LIFO free list (intrusive) | Bitmap + magazine | +20 ns |
| **State location** | Thread-local only | Thread-local + global | +10 ns |
| **Cache validation** | Implicit (per-thread pages) | Explicit (ownership tracking) | +5 ns |
| **Statistics overhead** | Batched/deferred | Per-allocation sampled | +10 ns |
| **Control flow** | 1 branch | 3-4 branches | +5 ns |

**Total measured gap**: 69 ns (83 - 14)

### Why We Can't Match mimalloc's 14 ns

**Irreducible architectural gaps** (10-13 ns total):
1. **Bitmap lookup** [5 ns]: Find-first-set + bit extraction vs single pointer read
2. **Magazine validation** [3-5 ns]: Ownership tracking for diagnostics vs implicit ownership
3. **Statistics integration** [2-3 ns]: Per-class stats require bookkeeping vs atomic counters

### What We Can Realistically Achieve

**Addressable overhead** (30-35 ns):
- P0: Lookup table classification → +3-5 ns
- P1: Remove stats from hot path → +10-15 ns
- P2: Inline fast path → +5-10 ns
- P3: Branch elimination → +10-15 ns

**Expected result**: 83 ns → 50-55 ns (35-40% improvement)

---

## Strategic Approach: Three-Phase Implementation

### Phase I: Quick Wins (P0 + P1) - 90 minutes
**Target**: 83 ns → 65-70 ns (~20% improvement)
**ROI**: Highest impact per time invested

**Why Start Here**:
- P0 and P1 are independent (no code conflicts)
- Combined gain: 13-20 ns
- Low risk (simple, localized changes)
- Immediate validation possible

### Phase II: Fast Path Optimization (P2) - 60 minutes
**Target**: 65-70 ns → 55-60 ns (~30% cumulative improvement)
**ROI**: High impact, moderate complexity

**Why Second**:
- Depends on P1 completion (stats removed from hot path)
- Creates foundation for P3 (branch elimination)
- More complex but well-documented in roadmap

### Phase III: Advanced Optimization (P3) - 90 minutes
**Target**: 55-60 ns → 50-55 ns (~40% cumulative improvement)
**ROI**: Moderate, requires careful testing

**Why Last**:
- Most complex (branchless logic)
- Highest risk of subtle bugs
- Marginal improvement (diminishing returns)

---

## Detailed Implementation Plan

### P0: Lookup Table Size Classification

**File**: `hakmem_tiny.h`
**Effort**: 30 minutes
**Gain**: +3-5 ns
**Risk**: Very Low

#### Current Implementation (If-Chain)
```c
static inline int hak_tiny_size_to_class(size_t size) {
    if (size <= 8) return 0;
    if (size <= 16) return 1;
    if (size <= 32) return 2;
    if (size <= 64) return 3;
    if (size <= 128) return 4;
    if (size <= 256) return 5;
    if (size <= 512) return 6;
    if (size <= 1024) return 7;
    return -1;
}
// Branches: 8 (worst case), avg 4 mispredictions
// Cost: 5-8 ns
```

#### Target Implementation (LUT)
```c
// Add after line 36 in hakmem_tiny.h (after g_tiny_blocks_per_slab)
static const uint8_t g_tiny_size_to_class[1025] = {
    // 0-8: class 0
    0,0,0,0,0,0,0,0,0,
    // 9-16: class 1
    1,1,1,1,1,1,1,1,
    // 17-32: class 2
    2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
    // 33-64: class 3
    3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,
    3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,
    // 65-128: class 4
    4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
    4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
    4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
    4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,
    // 129-256: class 5
    // ... (continue pattern for all 1025 entries)
    // 513-1024: class 7
};

static inline int hak_tiny_size_to_class_fast(size_t size) {
    // Fast path: direct table lookup
    if (__builtin_expect(size <= 1024, 1)) {
        return g_tiny_size_to_class[size];
    }
    // Slow path: out of Tiny Pool range
    return -1;
}
// Branches: 1 (predictable)
// Cost: 0.5-1 ns (L1 cache hit)
```

#### Implementation Steps
1. Add `g_tiny_size_to_class[1025]` table to `hakmem_tiny.h` (after line 36)
2. Replace `hak_tiny_size_to_class()` with `hak_tiny_size_to_class_fast()`
3. Update all call sites in `hakmem_tiny.c`
4. Compile and verify correctness

#### Testing
```bash
# Build and verify correctness
make clean && make -j4

# Unit test: verify classification accuracy
./test_tiny_size_class  # (create if needed)

# Performance test
./bench_tiny --iterations=1000000 --threads=1
# Expected: 83ns → 78-80ns
```

#### Rollback
```bash
git diff hakmem_tiny.h  # Verify changes
git checkout hakmem_tiny.h  # If regression detected
```

---

### P1: Remove Statistics from Critical Path

**File**: `hakmem_tiny.c`
**Effort**: 60 minutes
**Gain**: +10-15 ns
**Risk**: Low (statistics semantics preserved)

#### Current Implementation (Per-Allocation Sampling)
```c
// hakmem_tiny.c:656-659 (hot path)
void* p = mag->items[--mag->top].ptr;

// Sampled counter update (XOR-based PRNG)
t_tiny_rng ^= t_tiny_rng << 13;         // 3 ns overhead
t_tiny_rng ^= t_tiny_rng >> 17;
t_tiny_rng ^= t_tiny_rng << 5;
if ((t_tiny_rng & ((1u<<g_tiny_count_sample_exp)-1u)) == 0u) {
    g_tiny_pool.alloc_count[class_idx]++;  // Atomic increment (contention!)
}

return p;
// Total stats overhead: 10-15 ns
```

#### Target Implementation (Batched Lazy Update)
```c
// New TLS counter (add to hakmem_tiny.c TLS section)
static __thread uint64_t t_tiny_alloc_counter[TINY_NUM_CLASSES] = {0};

// Hot path (remove all stats code)
void* p = mag->items[--mag->top].ptr;
// NO STATS HERE!
return p;
// Stats overhead: 0 ns

// Cold path: Lazy accumulation (called during magazine refill)
static void hak_tiny_lazy_counter_update(int class_idx) {
    // Accumulate every 100 allocations
    if (++t_tiny_alloc_counter[class_idx] >= 100) {
        g_tiny_pool.alloc_count[class_idx] += t_tiny_alloc_counter[class_idx];
        t_tiny_alloc_counter[class_idx] = 0;
    }
}

// Call from slow path (magazine refill function)
void hak_tiny_refill_magazine(...) {
    // ... existing refill logic ...
    hak_tiny_lazy_counter_update(class_idx);
}
```

#### Implementation Steps
1. Add `t_tiny_alloc_counter[TINY_NUM_CLASSES]` TLS array
2. Remove XOR PRNG code from `hak_tiny_alloc()` hot path (lines 656-659)
3. Add `hak_tiny_lazy_counter_update()` function
4. Call lazy update in slow path (magazine refill, slab allocation)
5. Update `hak_tiny_get_stats()` to flush pending TLS counters

#### Statistics Accuracy Trade-off
- **Before**: Sampled (1/16 allocations counted)
- **After**: Batched (accumulated every 100 allocations)
- **Impact**: Both are approximations; batching is more accurate and faster

#### Testing
```bash
# Build
make clean && make -j4

# Functional test: verify stats still work
./test_mf2  # Should show non-zero alloc_count

# Performance test
./bench_tiny --iterations=1000000 --threads=1
# Expected: 78-80ns → 63-70ns
```

#### Rollback
```bash
git diff hakmem_tiny.c
git checkout hakmem_tiny.c  # If stats broken
```

---

### P2: Inline Fast Path

**Files**: `hakmem_tiny_alloc_fast.h` (new), `hakmem.h`, `hakmem_tiny.c`
**Effort**: 60 minutes
**Gain**: +5-10 ns
**Risk**: Moderate (function splitting, ABI considerations)

#### Current Implementation (Unified Function)
```c
// hakmem_tiny.c: Single function for fast + slow path
void* hak_tiny_alloc(size_t size) {
    // Size classification
    int class_idx = hak_tiny_size_to_class(size);

    // Magazine check
    TinyTLSMag* mag = &g_tls_mags[class_idx];
    if (mag->top > 0) {
        return mag->items[--mag->top].ptr;  // Fast path
    }

    // TLS active slab A
    TinySlab* slab_a = g_tls_active_slab_a[class_idx];
    if (slab_a && slab_a->free_count > 0) {
        // ... bitmap scan ...
    }

    // TLS active slab B
    // ...

    // Global pool (slow path with locks)
    // ...
}
// Problem: Compiler can't inline due to size/complexity
// Call overhead: 5-10 ns
```

#### Target Implementation (Split Fast/Slow)
```c
// New file: hakmem_tiny_alloc_fast.h
#ifndef HAKMEM_TINY_ALLOC_FAST_H
#define HAKMEM_TINY_ALLOC_FAST_H

#include "hakmem_tiny.h"

// External slow path declaration
extern void* hak_tiny_alloc_slow(size_t size, int class_idx);

// Inlined fast path (magazine-only)
__attribute__((always_inline))
static inline void* hak_tiny_alloc_hot(size_t size) {
    // Fast bounds check
    if (__builtin_expect(size > TINY_MAX_SIZE, 0)) {
        return NULL;  // Not Tiny Pool range
    }

    // O(1) size classification (uses P0 LUT)
    int class_idx = g_tiny_size_to_class[size];

    // TLS magazine check (fast path only)
    TinyTLSMag* mag = &g_tls_mags[class_idx];
    if (__builtin_expect(mag->top > 0, 1)) {
        return mag->items[--mag->top].ptr;
    }

    // Fall through to slow path
    return hak_tiny_alloc_slow(size, class_idx);
}

#endif
```

```c
// hakmem_tiny.c: Slow path only
void* hak_tiny_alloc_slow(size_t size, int class_idx) {
    // TLS active slab A
    TinySlab* slab_a = g_tls_active_slab_a[class_idx];
    if (slab_a && slab_a->free_count > 0) {
        // ... bitmap scan ...
    }

    // TLS active slab B
    // ...

    // Global pool (locks)
    // ...
}
```

```c
// hakmem.h: Update public API
#include "hakmem_tiny_alloc_fast.h"

// Use inlined fast path for tiny allocations
static inline void* hak_alloc_at(size_t size, void* site) {
    if (size <= TINY_MAX_SIZE) {
        return hak_tiny_alloc_hot(size);
    }
    // ... L2/L2.5 pools ...
}
```

#### Benefits
- **Zero call overhead**: Compiler inlines directly into caller
- **Better register allocation**: Hot path uses minimal stack
- **Improved branch prediction**: Fast path separate from cold code

#### Implementation Steps
1. Create `hakmem_tiny_alloc_fast.h`
2. Move magazine-only logic to `hak_tiny_alloc_hot()`
3. Rename `hak_tiny_alloc()` → `hak_tiny_alloc_slow()`
4. Update `hakmem.h` to include new header
5. Update all call sites

#### Testing
```bash
# Build
make clean && make -j4

# Functional test: verify all paths work
./test_mf2
./test_mf2_warmup

# Performance test
./bench_tiny --iterations=1000000 --threads=1
# Expected: 63-70ns → 55-65ns
```

#### Rollback
```bash
git rm hakmem_tiny_alloc_fast.h
git checkout hakmem.h hakmem_tiny.c
```

---

## Testing Strategy

### Phase-by-Phase Validation

#### After P0 (Lookup Table)
```bash
# Correctness
./test_tiny_size_class  # Verify all 1025 entries correct

# Performance
./bench_tiny --iterations=1000000 --threads=1
# Target: 83ns → 78-80ns

# No regressions on other sizes
./bench_allocators_hakmem --scenario json  # L2.5 64KB
./bench_allocators_hakmem --scenario mir   # L2.5 256KB
```

#### After P1 (Remove Stats)
```bash
# Correctness: Stats still work
./test_mf2
hak_tiny_get_stats()  # Should show non-zero counts

# Performance
./bench_tiny --iterations=1000000 --threads=1
# Target: 78-80ns → 63-70ns

# Multi-threaded: Verify TLS counters work
./bench_tiny_mt --iterations=100000 --threads=4
# Should see reasonable alloc_count
```

#### After P2 (Inline Fast Path)
```bash
# Correctness: All paths work
./test_mf2
./test_mf2_warmup

# Performance
./bench_tiny --iterations=1000000 --threads=1
# Target: 63-70ns → 55-65ns

# Code inspection: Verify inlining
objdump -d bench_tiny | grep -A 20 'hak_alloc_at'
# Should show inline expansion, not call instruction
```

### Comprehensive Validation
```bash
# After all P0-P2 complete
make clean && make -j4

# Full benchmark suite
./bench_tiny --iterations=1000000 --threads=1  # Single-threaded
./bench_tiny_mt --iterations=100000 --threads=4  # Multi-threaded
./bench_allocators_hakmem --scenario json  # L2.5 64KB
./bench_allocators_hakmem --scenario mir   # L2.5 256KB

# Git commit
git add .
git commit -m "Tiny Pool optimization P0-P2: 83ns → 55-65ns

- P0: Lookup table size classification (+3-5ns)
- P1: Remove statistics from hot path (+10-15ns)
- P2: Inline fast path (+5-10ns)

Cumulative improvement: ~30-35%
Remaining gap to mimalloc (14ns): 3.5-4x (irreducible)"
```

---

## Risk Mitigation

### Low-Risk Approach
1. **Incremental changes**: One optimization at a time
2. **Validation at each step**: Benchmark + test after each P0, P1, P2
3. **Git commits**: Separate commit for each optimization
4. **Rollback ready**: `git checkout` if regression detected

### Potential Issues and Mitigation

#### Issue 1: LUT Size (1 KB)
- **Risk**: L1 cache pressure
- **Mitigation**: 1 KB fits in L1 (32 KB typical), negligible impact
- **Fallback**: Use smaller LUT (65 entries, round up to power-of-2)

#### Issue 2: Inlining Bloat
- **Risk**: Code size increase from inlining
- **Mitigation**: Only inline magazine path (~10 instructions)
- **Fallback**: Use `__attribute__((hot))` instead of `always_inline`

#### Issue 3: Stats Accuracy
- **Risk**: Batched counters less accurate than sampled
- **Mitigation**: Actually MORE accurate (no sampling variance)
- **Fallback**: Keep TLS counters but flush more frequently

---

## Success Criteria

### Performance Targets

| Optimization | Current | Target | Achieved? |
|--------------|---------|--------|-----------|
| **Baseline** | 83 ns/op | - | ✓ (established) |
| **P0 Complete** | 83 ns | 78-80 ns | [ ] |
| **P1 Complete** | 78-80 ns | 63-70 ns | [ ] |
| **P2 Complete** | 63-70 ns | 55-65 ns | [ ] |
| **P0-P2 Combined** | 83 ns | **50-55 ns** | [ ] |

**Stretch Goal**: 50 ns/op (38% improvement, 3.5x gap to mimalloc)

### Functional Requirements
- [ ] All existing tests pass
- [ ] No regressions on L2/L2.5 pools
- [ ] Statistics still functional (batched counters work)
- [ ] Multi-threaded safety maintained
- [ ] Zero hard page faults (memory reuse preserved)

### Code Quality
- [ ] Clean compilation with `-Wall -Wextra`
- [ ] No compiler warnings
- [ ] Documented trade-offs (stats batching, LUT size)
- [ ] Git commit messages reference this strategy doc

---

## Timeline and Effort

### Phase I: Quick Wins (P0 + P1)
- **P0 Implementation**: 30 minutes
- **P0 Testing**: 15 minutes
- **P1 Implementation**: 60 minutes
- **P1 Testing**: 15 minutes
- **Total Phase I**: **2 hours**

### Phase II: Fast Path (P2)
- **P2 Implementation**: 60 minutes
- **P2 Testing**: 15 minutes
- **Total Phase II**: **1.25 hours**

### Phase III: Validation
- **Comprehensive testing**: 30 minutes
- **Documentation update**: 15 minutes
- **Git commit**: 15 minutes
- **Total Phase III**: **1 hour**

### Grand Total: 4.25 hours

**Realistic estimate with contingency**: **5-6 hours**

---

## Beyond P0-P2: Future Optimization (Optional)

### P3: Branch Elimination (Not Included in This Strategy)
**Effort**: 90 minutes
**Gain**: +10-15 ns
**Risk**: High (complex, subtle bugs possible)

**Why Deferred**:
- Diminishing returns (30% improvement already achieved)
- Complexity vs gain trade-off unfavorable
- Can be addressed in future iteration

### Alternative: NEXT_STEPS.md Approach
After P0-P2 complete, consider NEXT_STEPS.md optimizations:
- MPSC opportunistic drain during alloc slow path
- Immediate full→free slab promotion after drain
- Adaptive magazine capacity per site

**These may yield better ROI than P3.**

---

## Comparison to Alternatives

### Alternative 1: Implement NEXT_STEPS.md First
**Pros**: Novel features (adaptive magazines, ELO learning)
**Cons**: Higher complexity, uncertain performance gain
**Why Not**: mimalloc analysis shows stats overhead is 10-15ns - addressing known bottleneck first is safer

### Alternative 2: Copy mimalloc's Free List Architecture
**Pros**: Would match mimalloc's 14ns performance
**Cons**: Abandons bitmap approach, loses diagnostics/ownership tracking
**Why Not**: Violates hakmem's research goals (flexible architecture)

### Alternative 3: Do Nothing
**Pros**: Zero effort
**Cons**: 5.9x gap remains, no learning
**Why Not**: P0-P2 are low-risk, high-ROI quick wins

---

## Conclusion

This strategy prioritizes **high-impact, low-risk optimizations** to achieve a **30-40% performance improvement** in 4-6 hours of work.

**Key Principles**:
1. **Incremental validation**: Test after each step
2. **Focus on hot path**: Remove overhead from critical path
3. **Preserve semantics**: No behavior changes, only optimization
4. **Accept trade-offs**: 50-55ns is excellent; chasing 14ns abandons research goals

**Next Steps**:
1. Review and approve this strategy
2. Implement P0 (Lookup Table) - 30 minutes
3. Validate P0 - 15 minutes
4. Implement P1 (Remove Stats) - 60 minutes
5. Validate P1 - 15 minutes
6. Implement P2 (Inline Fast Path) - 60 minutes
7. Validate P2 - 15 minutes
8. Comprehensive testing and documentation - 1 hour

**Expected Outcome**: Tiny Pool performance improves from 83ns to 50-55ns, closing 40% of the gap with mimalloc while preserving hakmem's research architecture.

---

**Last Updated**: 2025-10-26
**Status**: Ready for implementation
**Approval**: Pending user confirmation
**Implementation Start**: After approval
Debug Counters Implementation - Clean History Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-11-05 12:31:14 +09:00			`# Tiny Pool Optimization Strategy`

			`Date: 2025-10-26`
			`Goal: Reduce 5.9x performance gap with mimalloc for small allocations (8-64 bytes)`
			`Current: 83 ns/op (hakmem) vs 14 ns/op (mimalloc)`
			`Target: 50-55 ns/op (35-40% improvement, ~3.5x gap remaining)`
			`Status: Ready for implementation`

			`---`

			`## Executive Summary`

			`### The 5.9x Gap: Root Cause Analysis`

			`Based on comprehensive mimalloc analysis (see ANALYSIS_SUMMARY.md), the performance gap stems from architectural differences, not bugs:`

			`\| Component \| mimalloc \| hakmem \| Impact \|`
			`\|-----------\|----------\|--------\|--------\|`
			`\| Primary data structure \| LIFO free list (intrusive) \| Bitmap + magazine \| +20 ns \|`
			`\| State location \| Thread-local only \| Thread-local + global \| +10 ns \|`
			`\| Cache validation \| Implicit (per-thread pages) \| Explicit (ownership tracking) \| +5 ns \|`
			`\| Statistics overhead \| Batched/deferred \| Per-allocation sampled \| +10 ns \|`
			`\| Control flow \| 1 branch \| 3-4 branches \| +5 ns \|`

			`Total measured gap: 69 ns (83 - 14)`

			`### Why We Can't Match mimalloc's 14 ns`

			`Irreducible architectural gaps (10-13 ns total):`
			`1. Bitmap lookup [5 ns]: Find-first-set + bit extraction vs single pointer read`
			`2. Magazine validation [3-5 ns]: Ownership tracking for diagnostics vs implicit ownership`
			`3. Statistics integration [2-3 ns]: Per-class stats require bookkeeping vs atomic counters`

			`### What We Can Realistically Achieve`

			`Addressable overhead (30-35 ns):`
			`- P0: Lookup table classification → +3-5 ns`
			`- P1: Remove stats from hot path → +10-15 ns`
			`- P2: Inline fast path → +5-10 ns`
			`- P3: Branch elimination → +10-15 ns`

			`Expected result: 83 ns → 50-55 ns (35-40% improvement)`

			`---`

			`## Strategic Approach: Three-Phase Implementation`

			`### Phase I: Quick Wins (P0 + P1) - 90 minutes`
			`Target: 83 ns → 65-70 ns (~20% improvement)`
			`ROI: Highest impact per time invested`

			`Why Start Here:`
			`- P0 and P1 are independent (no code conflicts)`
			`- Combined gain: 13-20 ns`
			`- Low risk (simple, localized changes)`
			`- Immediate validation possible`

			`### Phase II: Fast Path Optimization (P2) - 60 minutes`
			`Target: 65-70 ns → 55-60 ns (~30% cumulative improvement)`
			`ROI: High impact, moderate complexity`

			`Why Second:`
			`- Depends on P1 completion (stats removed from hot path)`
			`- Creates foundation for P3 (branch elimination)`
			`- More complex but well-documented in roadmap`

			`### Phase III: Advanced Optimization (P3) - 90 minutes`
			`Target: 55-60 ns → 50-55 ns (~40% cumulative improvement)`
			`ROI: Moderate, requires careful testing`

			`Why Last:`
			`- Most complex (branchless logic)`
			`- Highest risk of subtle bugs`
			`- Marginal improvement (diminishing returns)`

			`---`

			`## Detailed Implementation Plan`

			`### P0: Lookup Table Size Classification`

			File: `hakmem_tiny.h`
			`Effort: 30 minutes`
			`Gain: +3-5 ns`
			`Risk: Very Low`

			`#### Current Implementation (If-Chain)`
			```c
			`static inline int hak_tiny_size_to_class(size_t size) {`
			`if (size <= 8) return 0;`
			`if (size <= 16) return 1;`
			`if (size <= 32) return 2;`
			`if (size <= 64) return 3;`
			`if (size <= 128) return 4;`
			`if (size <= 256) return 5;`
			`if (size <= 512) return 6;`
			`if (size <= 1024) return 7;`
			`return -1;`
			`}`
			`// Branches: 8 (worst case), avg 4 mispredictions`
			`// Cost: 5-8 ns`
			```

			`#### Target Implementation (LUT)`
			```c
			`// Add after line 36 in hakmem_tiny.h (after g_tiny_blocks_per_slab)`
			`static const uint8_t g_tiny_size_to_class[1025] = {`
			`// 0-8: class 0`
			`0,0,0,0,0,0,0,0,0,`
			`// 9-16: class 1`
			`1,1,1,1,1,1,1,1,`
			`// 17-32: class 2`
			`2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,`
			`// 33-64: class 3`
			`3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,`
			`3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,`
			`// 65-128: class 4`
			`4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,`
			`4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,`
			`4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,`
			`4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,`
			`// 129-256: class 5`
			`// ... (continue pattern for all 1025 entries)`
			`// 513-1024: class 7`
			`};`

			`static inline int hak_tiny_size_to_class_fast(size_t size) {`
			`// Fast path: direct table lookup`
			`if (__builtin_expect(size <= 1024, 1)) {`
			`return g_tiny_size_to_class[size];`
			`}`
			`// Slow path: out of Tiny Pool range`
			`return -1;`
			`}`
			`// Branches: 1 (predictable)`
			`// Cost: 0.5-1 ns (L1 cache hit)`
			```

			`#### Implementation Steps`
			1. Add `g_tiny_size_to_class[1025]` table to `hakmem_tiny.h` (after line 36)
			2. Replace `hak_tiny_size_to_class()` with `hak_tiny_size_to_class_fast()`
			3. Update all call sites in `hakmem_tiny.c`
			`4. Compile and verify correctness`

			`#### Testing`
			```bash
			`# Build and verify correctness`
			`make clean && make -j4`

			`# Unit test: verify classification accuracy`
			`./test_tiny_size_class # (create if needed)`

			`# Performance test`
			`./bench_tiny --iterations=1000000 --threads=1`
			`# Expected: 83ns → 78-80ns`
			```

			`#### Rollback`
			```bash
			`git diff hakmem_tiny.h # Verify changes`
			`git checkout hakmem_tiny.h # If regression detected`
			```

			`---`

			`### P1: Remove Statistics from Critical Path`

			File: `hakmem_tiny.c`
			`Effort: 60 minutes`
			`Gain: +10-15 ns`
			`Risk: Low (statistics semantics preserved)`

			`#### Current Implementation (Per-Allocation Sampling)`
			```c
			`// hakmem_tiny.c:656-659 (hot path)`
			`void* p = mag->items[--mag->top].ptr;`

			`// Sampled counter update (XOR-based PRNG)`
			`t_tiny_rng ^= t_tiny_rng << 13; // 3 ns overhead`
			`t_tiny_rng ^= t_tiny_rng >> 17;`
			`t_tiny_rng ^= t_tiny_rng << 5;`
			`if ((t_tiny_rng & ((1u<<g_tiny_count_sample_exp)-1u)) == 0u) {`
			`g_tiny_pool.alloc_count[class_idx]++; // Atomic increment (contention!)`
			`}`

			`return p;`
			`// Total stats overhead: 10-15 ns`
			```

			`#### Target Implementation (Batched Lazy Update)`
			```c
			`// New TLS counter (add to hakmem_tiny.c TLS section)`
			`static __thread uint64_t t_tiny_alloc_counter[TINY_NUM_CLASSES] = {0};`

			`// Hot path (remove all stats code)`
			`void* p = mag->items[--mag->top].ptr;`
			`// NO STATS HERE!`
			`return p;`
			`// Stats overhead: 0 ns`

			`// Cold path: Lazy accumulation (called during magazine refill)`
			`static void hak_tiny_lazy_counter_update(int class_idx) {`
			`// Accumulate every 100 allocations`
			`if (++t_tiny_alloc_counter[class_idx] >= 100) {`
			`g_tiny_pool.alloc_count[class_idx] += t_tiny_alloc_counter[class_idx];`
			`t_tiny_alloc_counter[class_idx] = 0;`
			`}`
			`}`

			`// Call from slow path (magazine refill function)`
			`void hak_tiny_refill_magazine(...) {`
			`// ... existing refill logic ...`
			`hak_tiny_lazy_counter_update(class_idx);`
			`}`
			```

			`#### Implementation Steps`
			1. Add `t_tiny_alloc_counter[TINY_NUM_CLASSES]` TLS array
			2. Remove XOR PRNG code from `hak_tiny_alloc()` hot path (lines 656-659)
			3. Add `hak_tiny_lazy_counter_update()` function
			`4. Call lazy update in slow path (magazine refill, slab allocation)`
			5. Update `hak_tiny_get_stats()` to flush pending TLS counters

			`#### Statistics Accuracy Trade-off`
			`- Before: Sampled (1/16 allocations counted)`
			`- After: Batched (accumulated every 100 allocations)`
			`- Impact: Both are approximations; batching is more accurate and faster`

			`#### Testing`
			```bash
			`# Build`
			`make clean && make -j4`

			`# Functional test: verify stats still work`
			`./test_mf2 # Should show non-zero alloc_count`

			`# Performance test`
			`./bench_tiny --iterations=1000000 --threads=1`
			`# Expected: 78-80ns → 63-70ns`
			```

			`#### Rollback`
			```bash
			`git diff hakmem_tiny.c`
			`git checkout hakmem_tiny.c # If stats broken`
			```

			`---`

			`### P2: Inline Fast Path`

			Files: `hakmem_tiny_alloc_fast.h` (new), `hakmem.h`, `hakmem_tiny.c`
			`Effort: 60 minutes`
			`Gain: +5-10 ns`
			`Risk: Moderate (function splitting, ABI considerations)`

			`#### Current Implementation (Unified Function)`
			```c
			`// hakmem_tiny.c: Single function for fast + slow path`
			`void* hak_tiny_alloc(size_t size) {`
			`// Size classification`
			`int class_idx = hak_tiny_size_to_class(size);`

			`// Magazine check`
			`TinyTLSMag* mag = &g_tls_mags[class_idx];`
			`if (mag->top > 0) {`
			`return mag->items[--mag->top].ptr; // Fast path`
			`}`

			`// TLS active slab A`
			`TinySlab* slab_a = g_tls_active_slab_a[class_idx];`
			`if (slab_a && slab_a->free_count > 0) {`
			`// ... bitmap scan ...`
			`}`

			`// TLS active slab B`
			`// ...`

			`// Global pool (slow path with locks)`
			`// ...`
			`}`
			`// Problem: Compiler can't inline due to size/complexity`
			`// Call overhead: 5-10 ns`
			```

			`#### Target Implementation (Split Fast/Slow)`
			```c
			`// New file: hakmem_tiny_alloc_fast.h`
			`#ifndef HAKMEM_TINY_ALLOC_FAST_H`
			`#define HAKMEM_TINY_ALLOC_FAST_H`

			`#include "hakmem_tiny.h"`

			`// External slow path declaration`
			`extern void* hak_tiny_alloc_slow(size_t size, int class_idx);`

			`// Inlined fast path (magazine-only)`
			`__attribute__((always_inline))`
			`static inline void* hak_tiny_alloc_hot(size_t size) {`
			`// Fast bounds check`
			`if (__builtin_expect(size > TINY_MAX_SIZE, 0)) {`
			`return NULL; // Not Tiny Pool range`
			`}`

			`// O(1) size classification (uses P0 LUT)`
			`int class_idx = g_tiny_size_to_class[size];`

			`// TLS magazine check (fast path only)`
			`TinyTLSMag* mag = &g_tls_mags[class_idx];`
			`if (__builtin_expect(mag->top > 0, 1)) {`
			`return mag->items[--mag->top].ptr;`
			`}`

			`// Fall through to slow path`
			`return hak_tiny_alloc_slow(size, class_idx);`
			`}`

			`#endif`
			```

			```c
			`// hakmem_tiny.c: Slow path only`
			`void* hak_tiny_alloc_slow(size_t size, int class_idx) {`
			`// TLS active slab A`
			`TinySlab* slab_a = g_tls_active_slab_a[class_idx];`
			`if (slab_a && slab_a->free_count > 0) {`
			`// ... bitmap scan ...`
			`}`

			`// TLS active slab B`
			`// ...`

			`// Global pool (locks)`
			`// ...`
			`}`
			```

			```c
			`// hakmem.h: Update public API`
			`#include "hakmem_tiny_alloc_fast.h"`

			`// Use inlined fast path for tiny allocations`
			`static inline void* hak_alloc_at(size_t size, void* site) {`
			`if (size <= TINY_MAX_SIZE) {`
			`return hak_tiny_alloc_hot(size);`
			`}`
			`// ... L2/L2.5 pools ...`
			`}`
			```

			`#### Benefits`
			`- Zero call overhead: Compiler inlines directly into caller`
			`- Better register allocation: Hot path uses minimal stack`
			`- Improved branch prediction: Fast path separate from cold code`

			`#### Implementation Steps`
			1. Create `hakmem_tiny_alloc_fast.h`
			2. Move magazine-only logic to `hak_tiny_alloc_hot()`
			3. Rename `hak_tiny_alloc()` → `hak_tiny_alloc_slow()`
			4. Update `hakmem.h` to include new header
			`5. Update all call sites`

			`#### Testing`
			```bash
			`# Build`
			`make clean && make -j4`

			`# Functional test: verify all paths work`
			`./test_mf2`
			`./test_mf2_warmup`

			`# Performance test`
			`./bench_tiny --iterations=1000000 --threads=1`
			`# Expected: 63-70ns → 55-65ns`
			```

			`#### Rollback`
			```bash
			`git rm hakmem_tiny_alloc_fast.h`
			`git checkout hakmem.h hakmem_tiny.c`
			```

			`---`

			`## Testing Strategy`

			`### Phase-by-Phase Validation`

			`#### After P0 (Lookup Table)`
			```bash
			`# Correctness`
			`./test_tiny_size_class # Verify all 1025 entries correct`

			`# Performance`
			`./bench_tiny --iterations=1000000 --threads=1`
			`# Target: 83ns → 78-80ns`

			`# No regressions on other sizes`
			`./bench_allocators_hakmem --scenario json # L2.5 64KB`
			`./bench_allocators_hakmem --scenario mir # L2.5 256KB`
			```

			`#### After P1 (Remove Stats)`
			```bash
			`# Correctness: Stats still work`
			`./test_mf2`
			`hak_tiny_get_stats() # Should show non-zero counts`

			`# Performance`
			`./bench_tiny --iterations=1000000 --threads=1`
			`# Target: 78-80ns → 63-70ns`

			`# Multi-threaded: Verify TLS counters work`
			`./bench_tiny_mt --iterations=100000 --threads=4`
			`# Should see reasonable alloc_count`
			```

			`#### After P2 (Inline Fast Path)`
			```bash
			`# Correctness: All paths work`
			`./test_mf2`
			`./test_mf2_warmup`

			`# Performance`
			`./bench_tiny --iterations=1000000 --threads=1`
			`# Target: 63-70ns → 55-65ns`

			`# Code inspection: Verify inlining`
			`objdump -d bench_tiny \| grep -A 20 'hak_alloc_at'`
			`# Should show inline expansion, not call instruction`
			```

			`### Comprehensive Validation`
			```bash
			`# After all P0-P2 complete`
			`make clean && make -j4`

			`# Full benchmark suite`
			`./bench_tiny --iterations=1000000 --threads=1 # Single-threaded`
			`./bench_tiny_mt --iterations=100000 --threads=4 # Multi-threaded`
			`./bench_allocators_hakmem --scenario json # L2.5 64KB`
			`./bench_allocators_hakmem --scenario mir # L2.5 256KB`

			`# Git commit`
			`git add .`
			`git commit -m "Tiny Pool optimization P0-P2: 83ns → 55-65ns`

			`- P0: Lookup table size classification (+3-5ns)`
			`- P1: Remove statistics from hot path (+10-15ns)`
			`- P2: Inline fast path (+5-10ns)`

			`Cumulative improvement: ~30-35%`
			`Remaining gap to mimalloc (14ns): 3.5-4x (irreducible)"`
			```

			`---`

			`## Risk Mitigation`

			`### Low-Risk Approach`
			`1. Incremental changes: One optimization at a time`
			`2. Validation at each step: Benchmark + test after each P0, P1, P2`
			`3. Git commits: Separate commit for each optimization`
			4. Rollback ready: `git checkout` if regression detected

			`### Potential Issues and Mitigation`

			`#### Issue 1: LUT Size (1 KB)`
			`- Risk: L1 cache pressure`
			`- Mitigation: 1 KB fits in L1 (32 KB typical), negligible impact`
			`- Fallback: Use smaller LUT (65 entries, round up to power-of-2)`

			`#### Issue 2: Inlining Bloat`
			`- Risk: Code size increase from inlining`
			`- Mitigation: Only inline magazine path (~10 instructions)`
			- Fallback: Use `__attribute__((hot))` instead of `always_inline`

			`#### Issue 3: Stats Accuracy`
			`- Risk: Batched counters less accurate than sampled`
			`- Mitigation: Actually MORE accurate (no sampling variance)`
			`- Fallback: Keep TLS counters but flush more frequently`

			`---`

			`## Success Criteria`

			`### Performance Targets`

			`\| Optimization \| Current \| Target \| Achieved? \|`
			`\|--------------\|---------\|--------\|-----------\|`
			`\| Baseline \| 83 ns/op \| - \| ✓ (established) \|`
			`\| P0 Complete \| 83 ns \| 78-80 ns \| [ ] \|`
			`\| P1 Complete \| 78-80 ns \| 63-70 ns \| [ ] \|`
			`\| P2 Complete \| 63-70 ns \| 55-65 ns \| [ ] \|`
			`\| P0-P2 Combined \| 83 ns \| 50-55 ns \| [ ] \|`

			`Stretch Goal: 50 ns/op (38% improvement, 3.5x gap to mimalloc)`

			`### Functional Requirements`
			`- [ ] All existing tests pass`
			`- [ ] No regressions on L2/L2.5 pools`
			`- [ ] Statistics still functional (batched counters work)`
			`- [ ] Multi-threaded safety maintained`
			`- [ ] Zero hard page faults (memory reuse preserved)`

			`### Code Quality`
			- [ ] Clean compilation with `-Wall -Wextra`
			`- [ ] No compiler warnings`
			`- [ ] Documented trade-offs (stats batching, LUT size)`
			`- [ ] Git commit messages reference this strategy doc`

			`---`

			`## Timeline and Effort`

			`### Phase I: Quick Wins (P0 + P1)`
			`- P0 Implementation: 30 minutes`
			`- P0 Testing: 15 minutes`
			`- P1 Implementation: 60 minutes`
			`- P1 Testing: 15 minutes`
			`- Total Phase I: 2 hours`

			`### Phase II: Fast Path (P2)`
			`- P2 Implementation: 60 minutes`
			`- P2 Testing: 15 minutes`
			`- Total Phase II: 1.25 hours`

			`### Phase III: Validation`
			`- Comprehensive testing: 30 minutes`
			`- Documentation update: 15 minutes`
			`- Git commit: 15 minutes`
			`- Total Phase III: 1 hour`

			`### Grand Total: 4.25 hours`

			`Realistic estimate with contingency: 5-6 hours`

			`---`

			`## Beyond P0-P2: Future Optimization (Optional)`

			`### P3: Branch Elimination (Not Included in This Strategy)`
			`Effort: 90 minutes`
			`Gain: +10-15 ns`
			`Risk: High (complex, subtle bugs possible)`

			`Why Deferred:`
			`- Diminishing returns (30% improvement already achieved)`
			`- Complexity vs gain trade-off unfavorable`
			`- Can be addressed in future iteration`

			`### Alternative: NEXT_STEPS.md Approach`
			`After P0-P2 complete, consider NEXT_STEPS.md optimizations:`
			`- MPSC opportunistic drain during alloc slow path`
			`- Immediate full→free slab promotion after drain`
			`- Adaptive magazine capacity per site`

			`These may yield better ROI than P3.`

			`---`

			`## Comparison to Alternatives`

			`### Alternative 1: Implement NEXT_STEPS.md First`
			`Pros: Novel features (adaptive magazines, ELO learning)`
			`Cons: Higher complexity, uncertain performance gain`
			`Why Not: mimalloc analysis shows stats overhead is 10-15ns - addressing known bottleneck first is safer`

			`### Alternative 2: Copy mimalloc's Free List Architecture`
			`Pros: Would match mimalloc's 14ns performance`
			`Cons: Abandons bitmap approach, loses diagnostics/ownership tracking`
			`Why Not: Violates hakmem's research goals (flexible architecture)`

			`### Alternative 3: Do Nothing`
			`Pros: Zero effort`
			`Cons: 5.9x gap remains, no learning`
			`Why Not: P0-P2 are low-risk, high-ROI quick wins`

			`---`

			`## Conclusion`

			`This strategy prioritizes high-impact, low-risk optimizations to achieve a 30-40% performance improvement in 4-6 hours of work.`

			`Key Principles:`
			`1. Incremental validation: Test after each step`
			`2. Focus on hot path: Remove overhead from critical path`
			`3. Preserve semantics: No behavior changes, only optimization`
			`4. Accept trade-offs: 50-55ns is excellent; chasing 14ns abandons research goals`

			`Next Steps:`
			`1. Review and approve this strategy`
			`2. Implement P0 (Lookup Table) - 30 minutes`
			`3. Validate P0 - 15 minutes`
			`4. Implement P1 (Remove Stats) - 60 minutes`
			`5. Validate P1 - 15 minutes`
			`6. Implement P2 (Inline Fast Path) - 60 minutes`
			`7. Validate P2 - 15 minutes`
			`8. Comprehensive testing and documentation - 1 hour`

			`Expected Outcome: Tiny Pool performance improves from 83ns to 50-55ns, closing 40% of the gap with mimalloc while preserving hakmem's research architecture.`

			`---`

			`Last Updated: 2025-10-26`
			`Status: Ready for implementation`
			`Approval: Pending user confirmation`
			`Implementation Start: After approval`