hakmem/docs/archive/TINY_POOL_OPTIMIZATION_ROADMAP.md

# Tiny Pool Optimization Roadmap

Quick reference for implementing mimalloc-style optimizations in hakmem's Tiny Pool.

**Current Performance**: 83 ns/op for 8-64B allocations
**Target Performance**: 30-50 ns/op (realistic with optimizations)
**Gap to mimalloc**: Still 2-3.5x slower (fundamental architecture difference)

---

## Quick Wins (10-20 ns improvement)

### 1. Lookup Table Size Classification
**Effort**: 30 minutes | **Gain**: 3-5 ns

Replace if-chain with table lookup:

```c
// Before
static inline int hak_tiny_size_to_class(size_t size) {
    if (size <= 8) return 0;
    if (size <= 16) return 1;
    if (size <= 32) return 2;
    // ... sequential if-chain
}

// After
static const uint8_t g_size_to_class[65] = {
    0,0,0,0,0,0,0,0,           // 0-7
    1,1,1,1,1,1,1,1,           // 8-15
    2,2,2,2,2,2,2,2,           // 16-23
    2,2,2,2,2,2,2,2,           // 24-31
    3,3,3,3,3,3,3,3,           // 32-39
    // ... continue to 64
};

static inline int hak_tiny_size_to_class_fast(size_t size) {
    return likely(size <= 64) ? g_size_to_class[size] : -1;
}
```

**Implementation**:
1. File: `hakmem_tiny.h`
2. Add static const array after line 36 (after `g_tiny_blocks_per_slab`)
3. Update `hak_tiny_size_to_class()` to use table
4. Add `__builtin_expect()` for fast path

---

### 2. Remove Statistics from Critical Path
**Effort**: 1 hour | **Gain**: 10-15 ns

Move sampled counter updates to separate tracking:

```c
// Before (hot path)
void* p = mag->items[--mag->top].ptr;
t_tiny_rng ^= t_tiny_rng << 13;         // 3 ns overhead
t_tiny_rng ^= t_tiny_rng >> 17;
t_tiny_rng ^= t_tiny_rng << 5;
if ((t_tiny_rng & ((1u<<g_tiny_count_sample_exp)-1u)) == 0u)
    g_tiny_pool.alloc_count[class_idx]++;
return p;

// After (hot path)
void* p = mag->items[--mag->top].ptr;
// Stats update deferred (see lazy_counter_update below)
return p;

// New: Lazy counter accumulation (cold path)
static void hak_tiny_lazy_counter_update(int class_idx) {
    if (++g_tls_alloc_counter[class_idx] >= 100) {
        g_tiny_pool.alloc_count[class_idx] += g_tls_alloc_counter[class_idx];
        g_tls_alloc_counter[class_idx] = 0;
    }
}
```

**Implementation**:
1. File: `hakmem_tiny.c`
2. Remove sampled XOR code from `hak_tiny_alloc()` lines 656-659
3. Replace with simple per-thread counter
4. Call `hak_tiny_lazy_counter_update()` in slow path only
5. Update `hak_tiny_get_stats()` to account for lazy counters

---

### 3. Inline Fast Path
**Effort**: 1 hour | **Gain**: 5-10 ns

Create separate inlined fast-path function:

```c
// New file: hakmem_tiny_alloc_fast.h
static inline void* hak_tiny_alloc_hot(size_t size) {
    // Magazine fast path only (no TLS active slab, no locks)
    if (size > TINY_MAX_SIZE) return NULL;
    
    int class_idx = g_size_to_class[size];
    TinyTLSMag* mag = &g_tls_mags[class_idx];
    
    if (likely(mag->top > 0)) {
        return mag->items[--mag->top].ptr;
    }
    
    // Fall through to slow path
    extern void* hak_tiny_alloc_slow(size_t size);
    return hak_tiny_alloc_slow(size);
}
```

**Implementation**:
1. Create: `hakmem_tiny_alloc_fast.h`
2. Move pure magazine fast path here
3. Declare as `__attribute__((always_inline))`
4. Include from `hakmem.h` for public API
5. Keep `hakmem_tiny.c` for slow path

---

## Medium Effort (2-5 ns improvement each)

### 4. Combine TLS Reads into Single Structure
**Effort**: 2 hours | **Gain**: 2-3 ns

```c
// Before
TinyTLSMag* mag = &g_tls_mags[class_idx];
TinySlab* slab_a = &g_tls_active_slab_a[class_idx];
TinySlab* slab_b = &g_tls_active_slab_b[class_idx];
// 3 separate TLS reads + prefetch misses

// After
typedef struct {
    TinyTLSMag mag;           // All magazine data
    TinySlab* slab_a;         // Pointers
    TinySlab* slab_b;
} TinyTLSCache;
static __thread TinyTLSCache g_tls_cache[TINY_NUM_CLASSES];

TinyTLSCache* cache = &g_tls_cache[class_idx];  // 1 TLS read
// All 3 data structures prefetched together
```

**Benefits**:
- Single TLS read instead of 3
- All related data on same cache line
- Better prefetcher behavior

---

### 5. Hardware Prefetching Hints
**Effort**: 30 minutes | **Gain**: 1-2 ns (cumulative)

```c
// In hot path loop (e.g., bench_tiny_mt.c)
void* p = mag->items[--mag->top].ptr;
if (mag->top > 0) {
    __builtin_prefetch(mag->items[mag->top].ptr, 0, 3);
    // 0 = read, 3 = bring to L1
}
use_allocation(p);
```

---

### 6. Branchless Fallback Logic
**Effort**: 1.5 hours | **Gain**: 10-15 ns

Use conditional moves instead of branches:

```c
// Before (2+ branches, high misprediction rate)
if (mag->top > 0) {
    return mag->items[--mag->top].ptr;
}
if (slab_a && slab_a->free_count > 0) {
    // ... allocation from slab
}
// Fall through to slow path

// After (0 mispredictions with cmov)
void* p = NULL;
if (mag->top > 0) {
    p = mag->items[--mag->top].ptr;
}
// If still NULL, slab_a handler gets chance
if (!p && slab_a && slab_a->free_count > 0) {
    // ... allocation
    p = result;
}
return p != NULL ? p : hak_tiny_alloc_slow(size);
```

---

## Advanced Optimizations (2-5 ns improvement)

### 7. Code Layout (Hot/Cold Separation)
**Effort**: 2 hours | **Gain**: 2-5 ns

Use compiler pragmas to place fast path in hot section:

```c
// In hakmem_tiny_alloc_fast.h
__attribute__((section(".text.hot")))
__attribute__((aligned(64)))
static inline void* hak_tiny_alloc_hot(size_t size) {
    // Fast path only
}

// In hakmem_tiny.c
__attribute__((section(".text.cold")))
static void* hak_tiny_alloc_slow(size_t size) {
    // Slow path: locks, scanning, etc.
}
```

**Benefits**:
- Hot path packed in contiguous instruction cache
- Fewer I-cache misses
- Better CPU prefetching

---

## Implementation Priority (Time Investment vs Gain)

| Priority | Optimization | Effort | Gain | Ratio |
|----------|--------------|--------|------|-------|
| **P0** | Lookup table classification | 30min | 3-5ns | 10x ROI |
| **P1** | Remove stats overhead | 1hr | 10-15ns | 15x ROI |
| **P2** | Inline fast path | 1hr | 5-10ns | 7x ROI |
| P3 | Branch elimination | 1.5hr | 10-15ns | 7x ROI |
| P4 | Combined TLS reads | 2hr | 2-3ns | 1.5x ROI |
| P5 | Code layout | 2hr | 2-5ns | 2x ROI |
| P6 | Prefetching hints | 30min | 1-2ns | 3x ROI |

---

## Testing Strategy

After each optimization:

```bash
# Rebuild
make clean && make -j4

# Single-threaded benchmark
./bench_tiny --iterations=1000000 --threads=1
# Should show improvement in latency_ns

# Multi-threaded verification
./bench_tiny --iterations=100000 --threads=4
# Should maintain thread-safety and hit rate

# Compare against baseline
./docs/bench_compare.sh baseline optimized
```

---

## Expected Results Timeline

**Phase 1** (P0+P1, ~2 hours):
- Current: 83 ns/op
- Expected: 65-70 ns/op
- Gain: ~18% improvement

**Phase 2** (P0+P1+P2, ~3 hours):
- Expected: 55-65 ns/op
- Gain: ~25-30% improvement

**Phase 3** (P0+P1+P2+P3, ~4.5 hours):
- Expected: 45-55 ns/op
- Gain: ~40-45% improvement
- Still 3-4x slower than mimalloc (fundamental difference)

**Why still slower than mimalloc (14 ns)?**

Irreducible architectural gaps:
1. Bitmap lookup [5 ns] vs free list head [1 ns]
2. Magazine validation [3-5 ns] vs implicit ownership [0 ns]
3. Thread ownership tracking [2-3 ns] vs per-thread pages [0 ns]

**Total irreducible gap: 10-13 ns**

---

## Code Files to Modify

| File | Change | Priority |
|------|--------|----------|
| `hakmem_tiny.h` | Add size_to_class LUT | P0 |
| `hakmem_tiny.c` | Remove stats from hot path | P1 |
| `hakmem_tiny_alloc_fast.h` | NEW - Inlined fast path | P2 |
| `hakmem_tiny.c` | Branchless fallback | P3 |
| `hakmem_tiny.h` | Combine TLS structure | P4 |
| `bench_tiny.c` | Add prefetch hints | P6 |
| `Makefile` | Hot/cold sections (-ffunction-sections) | P5 |

---

## Rollback Plan

Each optimization is independent and can be reverted:

```bash
# If optimization causes regression:
git diff <file>              # See changes
git checkout <file>          # Revert single file
./bench_tiny                 # Re-verify
```

All optimizations preserve semantics - no behavior changes.

---

## Success Criteria

- [ ] P0 Implementation: 80-82 ns/op (no regression)
- [ ] P1 Implementation: 68-72 ns/op (+15% from current)
- [ ] P2 Implementation: 60-65 ns/op (+22% from current)
- [ ] P3 Implementation: 50-55 ns/op (+35% from current)
- [ ] All changes compile with `-O3 -march=native`
- [ ] All benchmarks pass thread-safety verification
- [ ] No regressions on medium/large allocations (L2 pool)

---

**Last Updated**: 2025-10-26
**Status**: Ready for implementation
**Owner**: [Team]
**Target**: Implement P0-P2 in next sprint
Debug Counters Implementation - Clean History Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-11-05 12:31:14 +09:00			`# Tiny Pool Optimization Roadmap`

			`Quick reference for implementing mimalloc-style optimizations in hakmem's Tiny Pool.`

			`Current Performance: 83 ns/op for 8-64B allocations`
			`Target Performance: 30-50 ns/op (realistic with optimizations)`
			`Gap to mimalloc: Still 2-3.5x slower (fundamental architecture difference)`

			`---`

			`## Quick Wins (10-20 ns improvement)`

			`### 1. Lookup Table Size Classification`
			`Effort: 30 minutes \| Gain: 3-5 ns`

			`Replace if-chain with table lookup:`

			```c
			`// Before`
			`static inline int hak_tiny_size_to_class(size_t size) {`
			`if (size <= 8) return 0;`
			`if (size <= 16) return 1;`
			`if (size <= 32) return 2;`
			`// ... sequential if-chain`
			`}`

			`// After`
			`static const uint8_t g_size_to_class[65] = {`
			`0,0,0,0,0,0,0,0, // 0-7`
			`1,1,1,1,1,1,1,1, // 8-15`
			`2,2,2,2,2,2,2,2, // 16-23`
			`2,2,2,2,2,2,2,2, // 24-31`
			`3,3,3,3,3,3,3,3, // 32-39`
			`// ... continue to 64`
			`};`

			`static inline int hak_tiny_size_to_class_fast(size_t size) {`
			`return likely(size <= 64) ? g_size_to_class[size] : -1;`
			`}`
			```

			`Implementation:`
			1. File: `hakmem_tiny.h`
			2. Add static const array after line 36 (after `g_tiny_blocks_per_slab`)
			3. Update `hak_tiny_size_to_class()` to use table
			4. Add `__builtin_expect()` for fast path

			`---`

			`### 2. Remove Statistics from Critical Path`
			`Effort: 1 hour \| Gain: 10-15 ns`

			`Move sampled counter updates to separate tracking:`

			```c
			`// Before (hot path)`
			`void* p = mag->items[--mag->top].ptr;`
			`t_tiny_rng ^= t_tiny_rng << 13; // 3 ns overhead`
			`t_tiny_rng ^= t_tiny_rng >> 17;`
			`t_tiny_rng ^= t_tiny_rng << 5;`
			`if ((t_tiny_rng & ((1u<<g_tiny_count_sample_exp)-1u)) == 0u)`
			`g_tiny_pool.alloc_count[class_idx]++;`
			`return p;`

			`// After (hot path)`
			`void* p = mag->items[--mag->top].ptr;`
			`// Stats update deferred (see lazy_counter_update below)`
			`return p;`

			`// New: Lazy counter accumulation (cold path)`
			`static void hak_tiny_lazy_counter_update(int class_idx) {`
			`if (++g_tls_alloc_counter[class_idx] >= 100) {`
			`g_tiny_pool.alloc_count[class_idx] += g_tls_alloc_counter[class_idx];`
			`g_tls_alloc_counter[class_idx] = 0;`
			`}`
			`}`
			```

			`Implementation:`
			1. File: `hakmem_tiny.c`
			2. Remove sampled XOR code from `hak_tiny_alloc()` lines 656-659
			`3. Replace with simple per-thread counter`
			4. Call `hak_tiny_lazy_counter_update()` in slow path only
			5. Update `hak_tiny_get_stats()` to account for lazy counters

			`---`

			`### 3. Inline Fast Path`
			`Effort: 1 hour \| Gain: 5-10 ns`

			`Create separate inlined fast-path function:`

			```c
			`// New file: hakmem_tiny_alloc_fast.h`
			`static inline void* hak_tiny_alloc_hot(size_t size) {`
			`// Magazine fast path only (no TLS active slab, no locks)`
			`if (size > TINY_MAX_SIZE) return NULL;`

			`int class_idx = g_size_to_class[size];`
			`TinyTLSMag* mag = &g_tls_mags[class_idx];`

			`if (likely(mag->top > 0)) {`
			`return mag->items[--mag->top].ptr;`
			`}`

			`// Fall through to slow path`
			`extern void* hak_tiny_alloc_slow(size_t size);`
			`return hak_tiny_alloc_slow(size);`
			`}`
			```

			`Implementation:`
			1. Create: `hakmem_tiny_alloc_fast.h`
			`2. Move pure magazine fast path here`
			3. Declare as `__attribute__((always_inline))`
			4. Include from `hakmem.h` for public API
			5. Keep `hakmem_tiny.c` for slow path

			`---`

			`## Medium Effort (2-5 ns improvement each)`

			`### 4. Combine TLS Reads into Single Structure`
			`Effort: 2 hours \| Gain: 2-3 ns`

			```c
			`// Before`
			`TinyTLSMag* mag = &g_tls_mags[class_idx];`
			`TinySlab* slab_a = &g_tls_active_slab_a[class_idx];`
			`TinySlab* slab_b = &g_tls_active_slab_b[class_idx];`
			`// 3 separate TLS reads + prefetch misses`

			`// After`
			`typedef struct {`
			`TinyTLSMag mag; // All magazine data`
			`TinySlab* slab_a; // Pointers`
			`TinySlab* slab_b;`
			`} TinyTLSCache;`
			`static __thread TinyTLSCache g_tls_cache[TINY_NUM_CLASSES];`

			`TinyTLSCache* cache = &g_tls_cache[class_idx]; // 1 TLS read`
			`// All 3 data structures prefetched together`
			```

			`Benefits:`
			`- Single TLS read instead of 3`
			`- All related data on same cache line`
			`- Better prefetcher behavior`

			`---`

			`### 5. Hardware Prefetching Hints`
			`Effort: 30 minutes \| Gain: 1-2 ns (cumulative)`

			```c
			`// In hot path loop (e.g., bench_tiny_mt.c)`
			`void* p = mag->items[--mag->top].ptr;`
			`if (mag->top > 0) {`
			`__builtin_prefetch(mag->items[mag->top].ptr, 0, 3);`
			`// 0 = read, 3 = bring to L1`
			`}`
			`use_allocation(p);`
			```

			`---`

			`### 6. Branchless Fallback Logic`
			`Effort: 1.5 hours \| Gain: 10-15 ns`

			`Use conditional moves instead of branches:`

			```c
			`// Before (2+ branches, high misprediction rate)`
			`if (mag->top > 0) {`
			`return mag->items[--mag->top].ptr;`
			`}`
			`if (slab_a && slab_a->free_count > 0) {`
			`// ... allocation from slab`
			`}`
			`// Fall through to slow path`

			`// After (0 mispredictions with cmov)`
			`void* p = NULL;`
			`if (mag->top > 0) {`
			`p = mag->items[--mag->top].ptr;`
			`}`
			`// If still NULL, slab_a handler gets chance`
			`if (!p && slab_a && slab_a->free_count > 0) {`
			`// ... allocation`
			`p = result;`
			`}`
			`return p != NULL ? p : hak_tiny_alloc_slow(size);`
			```

			`---`

			`## Advanced Optimizations (2-5 ns improvement)`

			`### 7. Code Layout (Hot/Cold Separation)`
			`Effort: 2 hours \| Gain: 2-5 ns`

			`Use compiler pragmas to place fast path in hot section:`

			```c
			`// In hakmem_tiny_alloc_fast.h`
			`__attribute__((section(".text.hot")))`
			`__attribute__((aligned(64)))`
			`static inline void* hak_tiny_alloc_hot(size_t size) {`
			`// Fast path only`
			`}`

			`// In hakmem_tiny.c`
			`__attribute__((section(".text.cold")))`
			`static void* hak_tiny_alloc_slow(size_t size) {`
			`// Slow path: locks, scanning, etc.`
			`}`
			```

			`Benefits:`
			`- Hot path packed in contiguous instruction cache`
			`- Fewer I-cache misses`
			`- Better CPU prefetching`

			`---`

			`## Implementation Priority (Time Investment vs Gain)`

			`\| Priority \| Optimization \| Effort \| Gain \| Ratio \|`
			`\|----------\|--------------\|--------\|------\|-------\|`
			`\| P0 \| Lookup table classification \| 30min \| 3-5ns \| 10x ROI \|`
			`\| P1 \| Remove stats overhead \| 1hr \| 10-15ns \| 15x ROI \|`
			`\| P2 \| Inline fast path \| 1hr \| 5-10ns \| 7x ROI \|`
			`\| P3 \| Branch elimination \| 1.5hr \| 10-15ns \| 7x ROI \|`
			`\| P4 \| Combined TLS reads \| 2hr \| 2-3ns \| 1.5x ROI \|`
			`\| P5 \| Code layout \| 2hr \| 2-5ns \| 2x ROI \|`
			`\| P6 \| Prefetching hints \| 30min \| 1-2ns \| 3x ROI \|`

			`---`

			`## Testing Strategy`

			`After each optimization:`

			```bash
			`# Rebuild`
			`make clean && make -j4`

			`# Single-threaded benchmark`
			`./bench_tiny --iterations=1000000 --threads=1`
			`# Should show improvement in latency_ns`

			`# Multi-threaded verification`
			`./bench_tiny --iterations=100000 --threads=4`
			`# Should maintain thread-safety and hit rate`

			`# Compare against baseline`
			`./docs/bench_compare.sh baseline optimized`
			```

			`---`

			`## Expected Results Timeline`

			`Phase 1 (P0+P1, ~2 hours):`
			`- Current: 83 ns/op`
			`- Expected: 65-70 ns/op`
			`- Gain: ~18% improvement`

			`Phase 2 (P0+P1+P2, ~3 hours):`
			`- Expected: 55-65 ns/op`
			`- Gain: ~25-30% improvement`

			`Phase 3 (P0+P1+P2+P3, ~4.5 hours):`
			`- Expected: 45-55 ns/op`
			`- Gain: ~40-45% improvement`
			`- Still 3-4x slower than mimalloc (fundamental difference)`

			`Why still slower than mimalloc (14 ns)?`

			`Irreducible architectural gaps:`
			`1. Bitmap lookup [5 ns] vs free list head [1 ns]`
			`2. Magazine validation [3-5 ns] vs implicit ownership [0 ns]`
			`3. Thread ownership tracking [2-3 ns] vs per-thread pages [0 ns]`

			`Total irreducible gap: 10-13 ns`

			`---`

			`## Code Files to Modify`

			`\| File \| Change \| Priority \|`
			`\|------\|--------\|----------\|`
			\| `hakmem_tiny.h` \| Add size_to_class LUT \| P0 \|
			\| `hakmem_tiny.c` \| Remove stats from hot path \| P1 \|
			\| `hakmem_tiny_alloc_fast.h` \| NEW - Inlined fast path \| P2 \|
			\| `hakmem_tiny.c` \| Branchless fallback \| P3 \|
			\| `hakmem_tiny.h` \| Combine TLS structure \| P4 \|
			\| `bench_tiny.c` \| Add prefetch hints \| P6 \|
			\| `Makefile` \| Hot/cold sections (-ffunction-sections) \| P5 \|

			`---`

			`## Rollback Plan`

			`Each optimization is independent and can be reverted:`

			```bash
			`# If optimization causes regression:`
			`git diff <file> # See changes`
			`git checkout <file> # Revert single file`
			`./bench_tiny # Re-verify`
			```

			`All optimizations preserve semantics - no behavior changes.`

			`---`

			`## Success Criteria`

			`- [ ] P0 Implementation: 80-82 ns/op (no regression)`
			`- [ ] P1 Implementation: 68-72 ns/op (+15% from current)`
			`- [ ] P2 Implementation: 60-65 ns/op (+22% from current)`
			`- [ ] P3 Implementation: 50-55 ns/op (+35% from current)`
			- [ ] All changes compile with `-O3 -march=native`
			`- [ ] All benchmarks pass thread-safety verification`
			`- [ ] No regressions on medium/large allocations (L2 pool)`

			`---`

			`Last Updated: 2025-10-26`
			`Status: Ready for implementation`
			`Owner: [Team]`
			`Target: Implement P0-P2 in next sprint`