hakmem/docs/design/ATOMIC_FREELIST_IMPLEMENTATION_STRATEGY.md

# Atomic Freelist Implementation Strategy

## Executive Summary

**Good News**: Only **90 freelist access sites** (not 589), making full conversion feasible in 4-6 hours.

**Recommendation**: **Hybrid Approach** - Convert hot paths to lock-free atomic operations, use relaxed ordering for cold paths, skip debug/stats sites entirely.

**Expected Performance Impact**: <3% regression for atomic operations in hot paths.

---

## 1. Accessor Function Design

### Core API (in `core/box/slab_freelist_atomic.h`)

```c
#ifndef SLAB_FREELIST_ATOMIC_H
#define SLAB_FREELIST_ATOMIC_H

#include <stdatomic.h>
#include "../superslab/superslab_types.h"

// ============================================================================
// HOT PATH: Lock-Free Operations (use CAS for push/pop)
// ============================================================================

// Atomic POP (lock-free, for refill hot path)
// Returns NULL if freelist empty
static inline void* slab_freelist_pop_lockfree(TinySlabMeta* meta, int class_idx) {
    void* head = atomic_load_explicit(&meta->freelist, memory_order_acquire);
    if (!head) return NULL;

    void* next = tiny_next_read(class_idx, head);
    while (!atomic_compare_exchange_weak_explicit(
        &meta->freelist,
        &head,              // Expected value (updated on failure)
        next,               // Desired value
        memory_order_release,  // Success ordering
        memory_order_acquire   // Failure ordering (reload head)
    )) {
        // CAS failed (another thread modified freelist)
        if (!head) return NULL;  // List became empty
        next = tiny_next_read(class_idx, head);  // Reload next pointer
    }
    return head;
}

// Atomic PUSH (lock-free, for free hot path)
static inline void slab_freelist_push_lockfree(TinySlabMeta* meta, int class_idx, void* node) {
    void* head = atomic_load_explicit(&meta->freelist, memory_order_relaxed);
    do {
        tiny_next_write(class_idx, node, head);  // Link node->next = head
    } while (!atomic_compare_exchange_weak_explicit(
        &meta->freelist,
        &head,              // Expected value (updated on failure)
        node,               // Desired value
        memory_order_release,  // Success ordering
        memory_order_relaxed   // Failure ordering
    ));
}

// ============================================================================
// WARM PATH: Relaxed Load/Store (single-threaded or low contention)
// ============================================================================

// Simple load (relaxed ordering for checks/prefetch)
static inline void* slab_freelist_load_relaxed(TinySlabMeta* meta) {
    return atomic_load_explicit(&meta->freelist, memory_order_relaxed);
}

// Simple store (relaxed ordering for init/cleanup)
static inline void slab_freelist_store_relaxed(TinySlabMeta* meta, void* value) {
    atomic_store_explicit(&meta->freelist, value, memory_order_relaxed);
}

// NULL check (relaxed ordering)
static inline bool slab_freelist_is_empty(TinySlabMeta* meta) {
    return atomic_load_explicit(&meta->freelist, memory_order_relaxed) == NULL;
}

static inline bool slab_freelist_is_nonempty(TinySlabMeta* meta) {
    return atomic_load_explicit(&meta->freelist, memory_order_relaxed) != NULL;
}

// ============================================================================
// COLD PATH: Direct Access (for debug/stats - already atomic type)
// ============================================================================

// For printf/debugging: cast to void* for printing
#define SLAB_FREELIST_DEBUG_PTR(meta) \
    ((void*)atomic_load_explicit(&(meta)->freelist, memory_order_relaxed))

#endif // SLAB_FREELIST_ATOMIC_H
```

---

## 2. Critical Site List (Top 20 - MUST Convert)

### Tier 1: Ultra-Hot Paths (5-10 ops/allocation)

1. **`core/tiny_superslab_alloc.inc.h:118-145`** - Fast alloc freelist pop
2. **`core/hakmem_tiny_refill_p0.inc.h:252-253`** - P0 batch refill check
3. **`core/box/carve_push_box.c:33-34, 120-121, 128-129`** - Carve rollback push
4. **`core/hakmem_tiny_tls_ops.h:77-85`** - TLS freelist drain

### Tier 2: Hot Paths (1-2 ops/allocation)

5. **`core/tiny_refill_opt.h:199-230`** - Refill chain pop
6. **`core/tiny_free_magazine.inc.h:135-136`** - Magazine free push
7. **`core/box/carve_push_box.c:172-180`** - Freelist carve with push

### Tier 3: Warm Paths (0.1-1 ops/allocation)

8. **`core/refill/ss_refill_fc.h:151-153`** - FC refill pop
9. **`core/hakmem_tiny_tls_ops.h:203`** - TLS freelist init
10. **`core/slab_handle.h:211, 259, 308`** - Slab handle ops

**Total Critical Sites**: ~40-50 (out of 90 total)

---

## 3. Non-Critical Site Strategy

### Skip Entirely (10-15 sites)

- **Debug/Stats**: `core/box/ss_stats_box.c:79`, `core/tiny_debug.h:48`
  - **Reason**: Already atomic type, simple load for printing is fine
  - **Action**: Change `meta->freelist` → `SLAB_FREELIST_DEBUG_PTR(meta)`

- **Initialization** (already protected by single-threaded setup):
  - `core/box/ss_allocation_box.c:66` - Initial freelist setup
  - `core/hakmem_tiny_superslab.c` - SuperSlab init

### Use Relaxed Load/Store (20-30 sites)

- **Condition checks**: `if (meta->freelist)` → `if (slab_freelist_is_nonempty(meta))`
- **Prefetch**: `__builtin_prefetch(&meta->freelist, 0, 3)` → keep as-is (atomic type is fine)
- **Init/cleanup**: `meta->freelist = NULL` → `slab_freelist_store_relaxed(meta, NULL)`

### Convert to Lock-Free (10-20 sites)

- **All POP operations** in hot paths
- **All PUSH operations** in free paths
- **Carve rollback** operations

---

## 4. Phased Implementation Plan

### Phase 1: Hot Paths Only (2-3 hours) 🔥

**Goal**: Fix Larson 8T crash with minimal changes

**Files to modify** (5 files, ~25 sites):
1. `core/tiny_superslab_alloc.inc.h` (fast alloc pop)
2. `core/hakmem_tiny_refill_p0.inc.h` (P0 batch refill)
3. `core/box/carve_push_box.c` (carve/rollback push)
4. `core/hakmem_tiny_tls_ops.h` (TLS drain)
5. Create `core/box/slab_freelist_atomic.h` (accessor API)

**Testing**:
```bash
./build.sh bench_random_mixed_hakmem
./out/release/bench_random_mixed_hakmem 10000000 256 42  # Single-threaded baseline
./build.sh larson_hakmem
./out/release/larson_hakmem 8 100000 256                 # 8 threads (expect no crash)
```

**Expected Result**: Larson 8T stable, <5% regression on single-threaded

---

### Phase 2: All TLS Paths (2-3 hours) ⚡

**Goal**: Full MT safety for all allocation paths

**Files to modify** (10 files, ~40 sites):
- All files from Phase 1 (complete conversion)
- `core/tiny_refill_opt.h` (refill chain ops)
- `core/tiny_free_magazine.inc.h` (magazine push)
- `core/refill/ss_refill_fc.h` (FC refill)
- `core/slab_handle.h` (slab handle ops)

**Testing**:
```bash
./build.sh bench_random_mixed_hakmem
./out/release/bench_random_mixed_hakmem 10000000 256 42  # Baseline check
./build.sh stress_test_mt_hakmem
./out/release/stress_test_mt_hakmem 16 100000            # 16 threads stress test
```

**Expected Result**: All MT tests pass, <3% regression

---

### Phase 3: Cleanup (1-2 hours) 🧹

**Goal**: Convert/document remaining sites

**Files to modify** (5 files, ~25 sites):
- Debug/stats sites: Add `SLAB_FREELIST_DEBUG_PTR()` macro
- Init/cleanup sites: Use `slab_freelist_store_relaxed()`
- Add comments explaining MT safety assumptions

**Testing**:
```bash
make clean && make all                    # Full rebuild
./run_all_tests.sh                        # Comprehensive test suite
```

**Expected Result**: Clean build, all tests pass

---

## 5. Automated Conversion Script

### Semi-Automated Sed Script

```bash
#!/bin/bash
# atomic_freelist_convert.sh - Phase 1 conversion helper

set -e

# Backup
git stash
git checkout -b atomic-freelist-phase1

# Step 1: Convert NULL checks (read-only, safe)
find core -name "*.c" -o -name "*.h" | xargs sed -i \
  's/if (\([^)]*\)meta->freelist)/if (slab_freelist_is_nonempty(\1meta))/g'

# Step 2: Convert condition checks in while loops
find core -name "*.c" -o -name "*.h" | xargs sed -i \
  's/while (\([^)]*\)meta->freelist)/while (slab_freelist_is_nonempty(\1meta))/g'

# Step 3: Show remaining manual conversions needed
echo "=== REMAINING MANUAL CONVERSIONS ==="
grep -rn "meta->freelist" core/ --include="*.c" --include="*.h" | \
  grep -v "slab_freelist_" | wc -l

echo "Review changes:"
git diff --stat
echo ""
echo "If good: git commit -am 'Phase 1: Convert freelist NULL checks'"
echo "If bad: git checkout . && git checkout master"
```

**Limitations**:
- Cannot auto-convert POP operations (need CAS loop)
- Cannot auto-convert PUSH operations (need tiny_next_write + CAS)
- Manual review required for all changes

---

## 6. Performance Projection

### Single-Threaded Impact

| Operation | Before | After (Relaxed) | After (CAS) | Overhead |
|-----------|--------|-----------------|-------------|----------|
| Load | 1 cycle | 1 cycle | 1 cycle | 0% |
| Store | 1 cycle | 1 cycle | - | 0% |
| POP (freelist) | 3-5 cycles | - | 8-12 cycles | +60-140% |
| PUSH (freelist) | 3-5 cycles | - | 8-12 cycles | +60-140% |

**Expected Regression**:
- Best case: 0-1% (mostly relaxed loads)
- Worst case: 3-5% (CAS overhead in hot paths)
- Realistic: 2-3% (good branch prediction, low contention)

**Mitigation**: Lock-free CAS is still faster than mutex (20-30 cycles)

### Multi-Threaded Impact

| Metric | Before (Non-Atomic) | After (Atomic) | Change |
|--------|---------------------|----------------|--------|
| Larson 8T | CRASH | Stable | ✅ FIXED |
| Throughput (1T) | 25.1M ops/s | 24.4-24.8M ops/s | -1.2-2.8% |
| Throughput (8T) | CRASH | ~18-20M ops/s | ✅ NEW |
| Scalability | 0% (crashes) | 70-80% | ✅ GAIN |

**Expected Benefit**: Stability + MT scalability >> 2-3% single-threaded cost

---

## 7. Implementation Example (Phase 1)

### Before: `core/tiny_superslab_alloc.inc.h:117-145`

```c
if (__builtin_expect(meta->freelist != NULL, 0)) {
    void* block = meta->freelist;
    if (meta->class_idx != class_idx) {
        meta->freelist = NULL;
        goto bump_path;
    }
    // ... pop logic ...
    meta->freelist = tiny_next_read(meta->class_idx, block);
    return (void*)((uint8_t*)block + 1);
}
```

### After: `core/tiny_superslab_alloc.inc.h:117-145`

```c
if (__builtin_expect(slab_freelist_is_nonempty(meta), 0)) {
    void* block = slab_freelist_pop_lockfree(meta, class_idx);
    if (!block) {
        // Another thread won the race, fall through to bump path
        goto bump_path;
    }
    if (meta->class_idx != class_idx) {
        // Wrong class, return to freelist and go to bump path
        slab_freelist_push_lockfree(meta, class_idx, block);
        goto bump_path;
    }
    return (void*)((uint8_t*)block + 1);
}
```

**Changes**:
- NULL check → `slab_freelist_is_nonempty()`
- Manual pop → `slab_freelist_pop_lockfree()`
- Handle CAS race (block == NULL case)
- Simpler logic (CAS handles next pointer atomically)

---

## 8. Risk Assessment

### Low Risk ✅

- **Phase 1**: Only 5 files, ~25 sites, well-tested patterns
- **Rollback**: Easy (`git checkout master`)
- **Testing**: Can A/B test with env variable

### Medium Risk ⚠️

- **Performance**: 2-3% regression possible
- **Subtle bugs**: CAS retry loops need careful review
- **ABA problem**: mitigated by pointer tagging (already in codebase)

### High Risk ❌

- **None**: Atomic type already declared, no ABI changes

---

## 9. Alternative Approaches (Considered)

### Option A: Mutex per Slab (rejected)

**Pros**: Simple, guaranteed correctness
**Cons**: 40-byte overhead per slab, 10-20x performance hit

### Option B: Global Lock (rejected)

**Pros**: Zero code changes, 1-line fix
**Cons**: Serializes all allocation, kills MT performance

### Option C: TLS-Only (rejected)

**Pros**: No atomics needed
**Cons**: Cannot handle remote free (required for MT)

### Option D: Hybrid (SELECTED) ✅

**Pros**: Best performance, incremental implementation
**Cons**: More complex, requires careful memory ordering

---

## 10. Memory Ordering Rationale

### Relaxed (`memory_order_relaxed`)

**Use case**: Single-threaded or benign races (e.g., stats)
**Cost**: 0 cycles (no fence)
**Example**: `if (meta->freelist)` - checking emptiness

### Acquire (`memory_order_acquire`)

**Use case**: Loading pointer before dereferencing
**Cost**: 1-2 cycles (read fence on some architectures)
**Example**: POP freelist head before reading `next` pointer

### Release (`memory_order_release`)

**Use case**: Publishing pointer after setup
**Cost**: 1-2 cycles (write fence on some architectures)
**Example**: PUSH node to freelist after writing `next` pointer

### AcqRel (`memory_order_acq_rel`)

**Use case**: CAS success path (acquire+release)
**Cost**: 2-4 cycles (full fence on some architectures)
**Example**: Not used (separate acquire/release in CAS)

### SeqCst (`memory_order_seq_cst`)

**Use case**: Total ordering required
**Cost**: 5-10 cycles (expensive fence)
**Example**: Not needed for freelist (per-slab ordering sufficient)

**Chosen**: Acquire/Release for CAS, Relaxed for checks (optimal trade-off)

---

## 11. Testing Strategy

### Phase 1 Tests

```bash
# Baseline (before conversion)
./out/release/bench_random_mixed_hakmem 10000000 256 42
# Record: 25.1M ops/s

# After conversion (expect: 24.4-24.8M ops/s)
./out/release/bench_random_mixed_hakmem 10000000 256 42

# MT stability (expect: no crash)
./out/release/larson_hakmem 8 100000 256

# Correctness (expect: 0 errors)
./out/release/bench_fixed_size_hakmem 100000 256 128
./out/release/bench_fixed_size_hakmem 100000 1024 128
```

### Phase 2 Tests

```bash
# Stress test all sizes
for size in 128 256 512 1024; do
    ./out/release/bench_random_mixed_hakmem 1000000 $size 42
done

# MT scaling test
for threads in 1 2 4 8 16; do
    ./out/release/larson_hakmem $threads 100000 256
done
```

### Phase 3 Tests

```bash
# Full test suite
./run_all_tests.sh

# ASan build (detect races)
./build.sh asan bench_random_mixed_hakmem
./out/asan/bench_random_mixed_hakmem 100000 256 42

# TSan build (detect data races)
./build.sh tsan larson_hakmem
./out/tsan/larson_hakmem 8 10000 256
```

---

## 12. Success Criteria

### Phase 1 (Hot Paths)

- ✅ Larson 8T runs without crash (100K iterations)
- ✅ Single-threaded regression <5% (24.0M+ ops/s)
- ✅ No ASan/TSan warnings
- ✅ Clean build with no warnings

### Phase 2 (All Paths)

- ✅ All MT tests pass (1T, 2T, 4T, 8T, 16T)
- ✅ Single-threaded regression <3% (24.4M+ ops/s)
- ✅ MT scaling 70%+ (8T = 5.6x+ speedup)
- ✅ No memory leaks (Valgrind clean)

### Phase 3 (Complete)

- ✅ All 90 sites converted or documented
- ✅ Full test suite passes (100% pass rate)
- ✅ Code review approved
- ✅ Documentation updated

---

## 13. Rollback Plan

If Phase 1 fails (>5% regression or instability):

```bash
# Revert to master
git checkout master
git branch -D atomic-freelist-phase1

# Try alternative: Per-slab spinlock (medium overhead)
# Add uint8_t lock field to TinySlabMeta
# Use __sync_lock_test_and_set() for 1-byte spinlock
# Expected: 5-10% overhead, but guaranteed correctness
```

---

## 14. Next Steps

1. **Create accessor header** (`core/box/slab_freelist_atomic.h`) - 30 min
2. **Phase 1 conversion** (5 files, ~25 sites) - 2-3 hours
3. **Test Phase 1** (single + MT tests) - 1 hour
4. **If pass**: Continue to Phase 2
5. **If fail**: Review, fix, or rollback

**Estimated Total Time**: 4-6 hours for full implementation (all 3 phases)

---

## 15. Code Review Checklist

Before merging:

- [ ] All CAS loops handle retry correctly
- [ ] Memory ordering documented for each site
- [ ] No direct `meta->freelist` access remains (except debug)
- [ ] All tests pass (single + MT)
- [ ] ASan/TSan clean
- [ ] Performance regression <3%
- [ ] Documentation updated (CLAUDE.md)

---

## Summary

**Approach**: Hybrid - Lock-free CAS for hot paths, relaxed atomics for cold paths
**Effort**: 4-6 hours (3 phases)
**Risk**: Low (incremental, easy rollback)
**Performance**: -2-3% single-threaded, +MT stability and scalability
**Benefit**: Unlocks MT performance without sacrificing single-threaded speed

**Recommendation**: Proceed with Phase 1 (2-3 hours) and evaluate results before committing to full implementation.
Doc: Add benchmark reports, atomic freelist docs, and .gitignore update Phase 1 Commit: Comprehensive documentation and build system cleanup Added Documentation: - BENCHMARK_SUMMARY_20251122.md: Current performance baseline - COMPREHENSIVE_BENCHMARK_REPORT_20251122.md: Detailed analysis - LARSON_SLOWDOWN_INVESTIGATION_REPORT.md: Larson benchmark deep dive - ATOMIC_FREELIST_.md (5 files): Complete atomic freelist documentation - Implementation strategy, quick start, site-by-site guide - Index and summary for easy navigation Added Scripts: - run_comprehensive_benchmark.sh: Automated benchmark runner - scripts/analyze_freelist_sites.sh: Freelist analysis tool - scripts/verify_atomic_freelist_conversion.sh: Conversion verification Build System: - Updated .gitignore: Added .d (build dependency files) - Cleaned up tracked .d files (will be ignored going forward) Performance Status (2025-11-22): - Random Mixed 256B: 59.6M ops/s (VERIFIED WORKING) - Benchmark command: ./out/release/bench_random_mixed_hakmem 10000000 256 42 - Known issue: workset=8192 causes SEGV (to be fixed separately) Notes: - bench_random_mixed.c already tracked, working state confirmed - Ultra SLIM implementation backed up to /tmp/ (Phase 2 restore pending) - Documentation covers atomic freelist conversion and benchmarking methodology 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-11-22 06:11:55 +09:00			`# Atomic Freelist Implementation Strategy`

			`## Executive Summary`

			`Good News: Only 90 freelist access sites (not 589), making full conversion feasible in 4-6 hours.`

			`Recommendation: Hybrid Approach - Convert hot paths to lock-free atomic operations, use relaxed ordering for cold paths, skip debug/stats sites entirely.`

			`Expected Performance Impact: <3% regression for atomic operations in hot paths.`

			`---`

			`## 1. Accessor Function Design`

			### Core API (in `core/box/slab_freelist_atomic.h`)

			```c
			`#ifndef SLAB_FREELIST_ATOMIC_H`
			`#define SLAB_FREELIST_ATOMIC_H`

			`#include <stdatomic.h>`
			`#include "../superslab/superslab_types.h"`

			`// ============================================================================`
			`// HOT PATH: Lock-Free Operations (use CAS for push/pop)`
			`// ============================================================================`

			`// Atomic POP (lock-free, for refill hot path)`
			`// Returns NULL if freelist empty`
			`static inline void* slab_freelist_pop_lockfree(TinySlabMeta* meta, int class_idx) {`
			`void* head = atomic_load_explicit(&meta->freelist, memory_order_acquire);`
			`if (!head) return NULL;`

			`void* next = tiny_next_read(class_idx, head);`
			`while (!atomic_compare_exchange_weak_explicit(`
			`&meta->freelist,`
			`&head, // Expected value (updated on failure)`
			`next, // Desired value`
			`memory_order_release, // Success ordering`
			`memory_order_acquire // Failure ordering (reload head)`
			`)) {`
			`// CAS failed (another thread modified freelist)`
			`if (!head) return NULL; // List became empty`
			`next = tiny_next_read(class_idx, head); // Reload next pointer`
			`}`
			`return head;`
			`}`

			`// Atomic PUSH (lock-free, for free hot path)`
			`static inline void slab_freelist_push_lockfree(TinySlabMeta* meta, int class_idx, void* node) {`
			`void* head = atomic_load_explicit(&meta->freelist, memory_order_relaxed);`
			`do {`
			`tiny_next_write(class_idx, node, head); // Link node->next = head`
			`} while (!atomic_compare_exchange_weak_explicit(`
			`&meta->freelist,`
			`&head, // Expected value (updated on failure)`
			`node, // Desired value`
			`memory_order_release, // Success ordering`
			`memory_order_relaxed // Failure ordering`
			`));`
			`}`

			`// ============================================================================`
			`// WARM PATH: Relaxed Load/Store (single-threaded or low contention)`
			`// ============================================================================`

			`// Simple load (relaxed ordering for checks/prefetch)`
			`static inline void* slab_freelist_load_relaxed(TinySlabMeta* meta) {`
			`return atomic_load_explicit(&meta->freelist, memory_order_relaxed);`
			`}`

			`// Simple store (relaxed ordering for init/cleanup)`
			`static inline void slab_freelist_store_relaxed(TinySlabMeta* meta, void* value) {`
			`atomic_store_explicit(&meta->freelist, value, memory_order_relaxed);`
			`}`

			`// NULL check (relaxed ordering)`
			`static inline bool slab_freelist_is_empty(TinySlabMeta* meta) {`
			`return atomic_load_explicit(&meta->freelist, memory_order_relaxed) == NULL;`
			`}`

			`static inline bool slab_freelist_is_nonempty(TinySlabMeta* meta) {`
			`return atomic_load_explicit(&meta->freelist, memory_order_relaxed) != NULL;`
			`}`

			`// ============================================================================`
			`// COLD PATH: Direct Access (for debug/stats - already atomic type)`
			`// ============================================================================`

			`// For printf/debugging: cast to void* for printing`
			`#define SLAB_FREELIST_DEBUG_PTR(meta) \`
			`((void*)atomic_load_explicit(&(meta)->freelist, memory_order_relaxed))`

			`#endif // SLAB_FREELIST_ATOMIC_H`
			```

			`---`

			`## 2. Critical Site List (Top 20 - MUST Convert)`

			`### Tier 1: Ultra-Hot Paths (5-10 ops/allocation)`

			1. `core/tiny_superslab_alloc.inc.h:118-145` - Fast alloc freelist pop
			2. `core/hakmem_tiny_refill_p0.inc.h:252-253` - P0 batch refill check
			3. `core/box/carve_push_box.c:33-34, 120-121, 128-129` - Carve rollback push
			4. `core/hakmem_tiny_tls_ops.h:77-85` - TLS freelist drain

			`### Tier 2: Hot Paths (1-2 ops/allocation)`

			5. `core/tiny_refill_opt.h:199-230` - Refill chain pop
			6. `core/tiny_free_magazine.inc.h:135-136` - Magazine free push
			7. `core/box/carve_push_box.c:172-180` - Freelist carve with push

			`### Tier 3: Warm Paths (0.1-1 ops/allocation)`

			8. `core/refill/ss_refill_fc.h:151-153` - FC refill pop
			9. `core/hakmem_tiny_tls_ops.h:203` - TLS freelist init
			10. `core/slab_handle.h:211, 259, 308` - Slab handle ops

			`Total Critical Sites: ~40-50 (out of 90 total)`

			`---`

			`## 3. Non-Critical Site Strategy`

			`### Skip Entirely (10-15 sites)`

			- Debug/Stats: `core/box/ss_stats_box.c:79`, `core/tiny_debug.h:48`
			`- Reason: Already atomic type, simple load for printing is fine`
			- Action: Change `meta->freelist` → `SLAB_FREELIST_DEBUG_PTR(meta)`

			`- Initialization (already protected by single-threaded setup):`
			- `core/box/ss_allocation_box.c:66` - Initial freelist setup
			- `core/hakmem_tiny_superslab.c` - SuperSlab init

			`### Use Relaxed Load/Store (20-30 sites)`

			- Condition checks: `if (meta->freelist)` → `if (slab_freelist_is_nonempty(meta))`
			- Prefetch: `__builtin_prefetch(&meta->freelist, 0, 3)` → keep as-is (atomic type is fine)
			- Init/cleanup: `meta->freelist = NULL` → `slab_freelist_store_relaxed(meta, NULL)`

			`### Convert to Lock-Free (10-20 sites)`

			`- All POP operations in hot paths`
			`- All PUSH operations in free paths`
			`- Carve rollback operations`

			`---`

			`## 4. Phased Implementation Plan`

			`### Phase 1: Hot Paths Only (2-3 hours) 🔥`

			`Goal: Fix Larson 8T crash with minimal changes`

			`Files to modify (5 files, ~25 sites):`
			1. `core/tiny_superslab_alloc.inc.h` (fast alloc pop)
			2. `core/hakmem_tiny_refill_p0.inc.h` (P0 batch refill)
			3. `core/box/carve_push_box.c` (carve/rollback push)
			4. `core/hakmem_tiny_tls_ops.h` (TLS drain)
			5. Create `core/box/slab_freelist_atomic.h` (accessor API)

			`Testing:`
			```bash
			`./build.sh bench_random_mixed_hakmem`
			`./out/release/bench_random_mixed_hakmem 10000000 256 42 # Single-threaded baseline`
			`./build.sh larson_hakmem`
			`./out/release/larson_hakmem 8 100000 256 # 8 threads (expect no crash)`
			```

			`Expected Result: Larson 8T stable, <5% regression on single-threaded`

			`---`

			`### Phase 2: All TLS Paths (2-3 hours) ⚡`

			`Goal: Full MT safety for all allocation paths`

			`Files to modify (10 files, ~40 sites):`
			`- All files from Phase 1 (complete conversion)`
			- `core/tiny_refill_opt.h` (refill chain ops)
			- `core/tiny_free_magazine.inc.h` (magazine push)
			- `core/refill/ss_refill_fc.h` (FC refill)
			- `core/slab_handle.h` (slab handle ops)

			`Testing:`
			```bash
			`./build.sh bench_random_mixed_hakmem`
			`./out/release/bench_random_mixed_hakmem 10000000 256 42 # Baseline check`
			`./build.sh stress_test_mt_hakmem`
			`./out/release/stress_test_mt_hakmem 16 100000 # 16 threads stress test`
			```

			`Expected Result: All MT tests pass, <3% regression`

			`---`

			`### Phase 3: Cleanup (1-2 hours) 🧹`

			`Goal: Convert/document remaining sites`

			`Files to modify (5 files, ~25 sites):`
			- Debug/stats sites: Add `SLAB_FREELIST_DEBUG_PTR()` macro
			- Init/cleanup sites: Use `slab_freelist_store_relaxed()`
			`- Add comments explaining MT safety assumptions`

			`Testing:`
			```bash
			`make clean && make all # Full rebuild`
			`./run_all_tests.sh # Comprehensive test suite`
			```

			`Expected Result: Clean build, all tests pass`

			`---`

			`## 5. Automated Conversion Script`

			`### Semi-Automated Sed Script`

			```bash
			`#!/bin/bash`
			`# atomic_freelist_convert.sh - Phase 1 conversion helper`

			`set -e`

			`# Backup`
			`git stash`
			`git checkout -b atomic-freelist-phase1`

			`# Step 1: Convert NULL checks (read-only, safe)`
			`find core -name ".c" -o -name ".h" \| xargs sed -i \`
			`'s/if (\([^)]*\)meta->freelist)/if (slab_freelist_is_nonempty(\1meta))/g'`

			`# Step 2: Convert condition checks in while loops`
			`find core -name ".c" -o -name ".h" \| xargs sed -i \`
			`'s/while (\([^)]*\)meta->freelist)/while (slab_freelist_is_nonempty(\1meta))/g'`

			`# Step 3: Show remaining manual conversions needed`
			`echo "=== REMAINING MANUAL CONVERSIONS ==="`
			`grep -rn "meta->freelist" core/ --include=".c" --include=".h" \| \`
			`grep -v "slab_freelist_" \| wc -l`

			`echo "Review changes:"`
			`git diff --stat`
			`echo ""`
			`echo "If good: git commit -am 'Phase 1: Convert freelist NULL checks'"`
			`echo "If bad: git checkout . && git checkout master"`
			```

			`Limitations:`
			`- Cannot auto-convert POP operations (need CAS loop)`
			`- Cannot auto-convert PUSH operations (need tiny_next_write + CAS)`
			`- Manual review required for all changes`

			`---`

			`## 6. Performance Projection`

			`### Single-Threaded Impact`

			`\| Operation \| Before \| After (Relaxed) \| After (CAS) \| Overhead \|`
			`\|-----------\|--------\|-----------------\|-------------\|----------\|`
			`\| Load \| 1 cycle \| 1 cycle \| 1 cycle \| 0% \|`
			`\| Store \| 1 cycle \| 1 cycle \| - \| 0% \|`
			`\| POP (freelist) \| 3-5 cycles \| - \| 8-12 cycles \| +60-140% \|`
			`\| PUSH (freelist) \| 3-5 cycles \| - \| 8-12 cycles \| +60-140% \|`

			`Expected Regression:`
			`- Best case: 0-1% (mostly relaxed loads)`
			`- Worst case: 3-5% (CAS overhead in hot paths)`
			`- Realistic: 2-3% (good branch prediction, low contention)`

			`Mitigation: Lock-free CAS is still faster than mutex (20-30 cycles)`

			`### Multi-Threaded Impact`

			`\| Metric \| Before (Non-Atomic) \| After (Atomic) \| Change \|`
			`\|--------\|---------------------\|----------------\|--------\|`
			`\| Larson 8T \| CRASH \| Stable \| ✅ FIXED \|`
			`\| Throughput (1T) \| 25.1M ops/s \| 24.4-24.8M ops/s \| -1.2-2.8% \|`
			`\| Throughput (8T) \| CRASH \| ~18-20M ops/s \| ✅ NEW \|`
			`\| Scalability \| 0% (crashes) \| 70-80% \| ✅ GAIN \|`

			`Expected Benefit: Stability + MT scalability >> 2-3% single-threaded cost`

			`---`

			`## 7. Implementation Example (Phase 1)`

			### Before: `core/tiny_superslab_alloc.inc.h:117-145`

			```c
			`if (__builtin_expect(meta->freelist != NULL, 0)) {`
			`void* block = meta->freelist;`
			`if (meta->class_idx != class_idx) {`
			`meta->freelist = NULL;`
			`goto bump_path;`
			`}`
			`// ... pop logic ...`
			`meta->freelist = tiny_next_read(meta->class_idx, block);`
			`return (void)((uint8_t)block + 1);`
			`}`
			```

			### After: `core/tiny_superslab_alloc.inc.h:117-145`

			```c
			`if (__builtin_expect(slab_freelist_is_nonempty(meta), 0)) {`
			`void* block = slab_freelist_pop_lockfree(meta, class_idx);`
			`if (!block) {`
			`// Another thread won the race, fall through to bump path`
			`goto bump_path;`
			`}`
			`if (meta->class_idx != class_idx) {`
			`// Wrong class, return to freelist and go to bump path`
			`slab_freelist_push_lockfree(meta, class_idx, block);`
			`goto bump_path;`
			`}`
			`return (void)((uint8_t)block + 1);`
			`}`
			```

			`Changes:`
			- NULL check → `slab_freelist_is_nonempty()`
			- Manual pop → `slab_freelist_pop_lockfree()`
			`- Handle CAS race (block == NULL case)`
			`- Simpler logic (CAS handles next pointer atomically)`

			`---`

			`## 8. Risk Assessment`

			`### Low Risk ✅`

			`- Phase 1: Only 5 files, ~25 sites, well-tested patterns`
			- Rollback: Easy (`git checkout master`)
			`- Testing: Can A/B test with env variable`

			`### Medium Risk ⚠️`

			`- Performance: 2-3% regression possible`
			`- Subtle bugs: CAS retry loops need careful review`
			`- ABA problem: mitigated by pointer tagging (already in codebase)`

			`### High Risk ❌`

			`- None: Atomic type already declared, no ABI changes`

			`---`

			`## 9. Alternative Approaches (Considered)`

			`### Option A: Mutex per Slab (rejected)`

			`Pros: Simple, guaranteed correctness`
			`Cons: 40-byte overhead per slab, 10-20x performance hit`

			`### Option B: Global Lock (rejected)`

			`Pros: Zero code changes, 1-line fix`
			`Cons: Serializes all allocation, kills MT performance`

			`### Option C: TLS-Only (rejected)`

			`Pros: No atomics needed`
			`Cons: Cannot handle remote free (required for MT)`

			`### Option D: Hybrid (SELECTED) ✅`

			`Pros: Best performance, incremental implementation`
			`Cons: More complex, requires careful memory ordering`

			`---`

			`## 10. Memory Ordering Rationale`

			### Relaxed (`memory_order_relaxed`)

			`Use case: Single-threaded or benign races (e.g., stats)`
			`Cost: 0 cycles (no fence)`
			Example: `if (meta->freelist)` - checking emptiness

			### Acquire (`memory_order_acquire`)

			`Use case: Loading pointer before dereferencing`
			`Cost: 1-2 cycles (read fence on some architectures)`
			Example: POP freelist head before reading `next` pointer

			### Release (`memory_order_release`)

			`Use case: Publishing pointer after setup`
			`Cost: 1-2 cycles (write fence on some architectures)`
			Example: PUSH node to freelist after writing `next` pointer

			### AcqRel (`memory_order_acq_rel`)

			`Use case: CAS success path (acquire+release)`
			`Cost: 2-4 cycles (full fence on some architectures)`
			`Example: Not used (separate acquire/release in CAS)`

			### SeqCst (`memory_order_seq_cst`)

			`Use case: Total ordering required`
			`Cost: 5-10 cycles (expensive fence)`
			`Example: Not needed for freelist (per-slab ordering sufficient)`

			`Chosen: Acquire/Release for CAS, Relaxed for checks (optimal trade-off)`

			`---`

			`## 11. Testing Strategy`

			`### Phase 1 Tests`

			```bash
			`# Baseline (before conversion)`
			`./out/release/bench_random_mixed_hakmem 10000000 256 42`
			`# Record: 25.1M ops/s`

			`# After conversion (expect: 24.4-24.8M ops/s)`
			`./out/release/bench_random_mixed_hakmem 10000000 256 42`

			`# MT stability (expect: no crash)`
			`./out/release/larson_hakmem 8 100000 256`

			`# Correctness (expect: 0 errors)`
			`./out/release/bench_fixed_size_hakmem 100000 256 128`
			`./out/release/bench_fixed_size_hakmem 100000 1024 128`
			```

			`### Phase 2 Tests`

			```bash
			`# Stress test all sizes`
			`for size in 128 256 512 1024; do`
			`./out/release/bench_random_mixed_hakmem 1000000 $size 42`
			`done`

			`# MT scaling test`
			`for threads in 1 2 4 8 16; do`
			`./out/release/larson_hakmem $threads 100000 256`
			`done`
			```

			`### Phase 3 Tests`

			```bash
			`# Full test suite`
			`./run_all_tests.sh`

			`# ASan build (detect races)`
			`./build.sh asan bench_random_mixed_hakmem`
			`./out/asan/bench_random_mixed_hakmem 100000 256 42`

			`# TSan build (detect data races)`
			`./build.sh tsan larson_hakmem`
			`./out/tsan/larson_hakmem 8 10000 256`
			```

			`---`

			`## 12. Success Criteria`

			`### Phase 1 (Hot Paths)`

			`- ✅ Larson 8T runs without crash (100K iterations)`
			`- ✅ Single-threaded regression <5% (24.0M+ ops/s)`
			`- ✅ No ASan/TSan warnings`
			`- ✅ Clean build with no warnings`

			`### Phase 2 (All Paths)`

			`- ✅ All MT tests pass (1T, 2T, 4T, 8T, 16T)`
			`- ✅ Single-threaded regression <3% (24.4M+ ops/s)`
			`- ✅ MT scaling 70%+ (8T = 5.6x+ speedup)`
			`- ✅ No memory leaks (Valgrind clean)`

			`### Phase 3 (Complete)`

			`- ✅ All 90 sites converted or documented`
			`- ✅ Full test suite passes (100% pass rate)`
			`- ✅ Code review approved`
			`- ✅ Documentation updated`

			`---`

			`## 13. Rollback Plan`

			`If Phase 1 fails (>5% regression or instability):`

			```bash
			`# Revert to master`
			`git checkout master`
			`git branch -D atomic-freelist-phase1`

			`# Try alternative: Per-slab spinlock (medium overhead)`
			`# Add uint8_t lock field to TinySlabMeta`
			`# Use __sync_lock_test_and_set() for 1-byte spinlock`
			`# Expected: 5-10% overhead, but guaranteed correctness`
			```

			`---`

			`## 14. Next Steps`

			1. Create accessor header (`core/box/slab_freelist_atomic.h`) - 30 min
			`2. Phase 1 conversion (5 files, ~25 sites) - 2-3 hours`
			`3. Test Phase 1 (single + MT tests) - 1 hour`
			`4. If pass: Continue to Phase 2`
			`5. If fail: Review, fix, or rollback`

			`Estimated Total Time: 4-6 hours for full implementation (all 3 phases)`

			`---`

			`## 15. Code Review Checklist`

			`Before merging:`

			`- [ ] All CAS loops handle retry correctly`
			`- [ ] Memory ordering documented for each site`
			- [ ] No direct `meta->freelist` access remains (except debug)
			`- [ ] All tests pass (single + MT)`
			`- [ ] ASan/TSan clean`
			`- [ ] Performance regression <3%`
			`- [ ] Documentation updated (CLAUDE.md)`

			`---`

			`## Summary`

			`Approach: Hybrid - Lock-free CAS for hot paths, relaxed atomics for cold paths`
			`Effort: 4-6 hours (3 phases)`
			`Risk: Low (incremental, easy rollback)`
			`Performance: -2-3% single-threaded, +MT stability and scalability`
			`Benefit: Unlocks MT performance without sacrificing single-threaded speed`

			`Recommendation: Proceed with Phase 1 (2-3 hours) and evaluate results before committing to full implementation.`