# Atomic Freelist Implementation Strategy ## Executive Summary **Good News**: Only **90 freelist access sites** (not 589), making full conversion feasible in 4-6 hours. **Recommendation**: **Hybrid Approach** - Convert hot paths to lock-free atomic operations, use relaxed ordering for cold paths, skip debug/stats sites entirely. **Expected Performance Impact**: <3% regression for atomic operations in hot paths. --- ## 1. Accessor Function Design ### Core API (in `core/box/slab_freelist_atomic.h`) ```c #ifndef SLAB_FREELIST_ATOMIC_H #define SLAB_FREELIST_ATOMIC_H #include #include "../superslab/superslab_types.h" // ============================================================================ // HOT PATH: Lock-Free Operations (use CAS for push/pop) // ============================================================================ // Atomic POP (lock-free, for refill hot path) // Returns NULL if freelist empty static inline void* slab_freelist_pop_lockfree(TinySlabMeta* meta, int class_idx) { void* head = atomic_load_explicit(&meta->freelist, memory_order_acquire); if (!head) return NULL; void* next = tiny_next_read(class_idx, head); while (!atomic_compare_exchange_weak_explicit( &meta->freelist, &head, // Expected value (updated on failure) next, // Desired value memory_order_release, // Success ordering memory_order_acquire // Failure ordering (reload head) )) { // CAS failed (another thread modified freelist) if (!head) return NULL; // List became empty next = tiny_next_read(class_idx, head); // Reload next pointer } return head; } // Atomic PUSH (lock-free, for free hot path) static inline void slab_freelist_push_lockfree(TinySlabMeta* meta, int class_idx, void* node) { void* head = atomic_load_explicit(&meta->freelist, memory_order_relaxed); do { tiny_next_write(class_idx, node, head); // Link node->next = head } while (!atomic_compare_exchange_weak_explicit( &meta->freelist, &head, // Expected value (updated on failure) node, // Desired value memory_order_release, // Success ordering memory_order_relaxed // Failure ordering )); } // ============================================================================ // WARM PATH: Relaxed Load/Store (single-threaded or low contention) // ============================================================================ // Simple load (relaxed ordering for checks/prefetch) static inline void* slab_freelist_load_relaxed(TinySlabMeta* meta) { return atomic_load_explicit(&meta->freelist, memory_order_relaxed); } // Simple store (relaxed ordering for init/cleanup) static inline void slab_freelist_store_relaxed(TinySlabMeta* meta, void* value) { atomic_store_explicit(&meta->freelist, value, memory_order_relaxed); } // NULL check (relaxed ordering) static inline bool slab_freelist_is_empty(TinySlabMeta* meta) { return atomic_load_explicit(&meta->freelist, memory_order_relaxed) == NULL; } static inline bool slab_freelist_is_nonempty(TinySlabMeta* meta) { return atomic_load_explicit(&meta->freelist, memory_order_relaxed) != NULL; } // ============================================================================ // COLD PATH: Direct Access (for debug/stats - already atomic type) // ============================================================================ // For printf/debugging: cast to void* for printing #define SLAB_FREELIST_DEBUG_PTR(meta) \ ((void*)atomic_load_explicit(&(meta)->freelist, memory_order_relaxed)) #endif // SLAB_FREELIST_ATOMIC_H ``` --- ## 2. Critical Site List (Top 20 - MUST Convert) ### Tier 1: Ultra-Hot Paths (5-10 ops/allocation) 1. **`core/tiny_superslab_alloc.inc.h:118-145`** - Fast alloc freelist pop 2. **`core/hakmem_tiny_refill_p0.inc.h:252-253`** - P0 batch refill check 3. **`core/box/carve_push_box.c:33-34, 120-121, 128-129`** - Carve rollback push 4. **`core/hakmem_tiny_tls_ops.h:77-85`** - TLS freelist drain ### Tier 2: Hot Paths (1-2 ops/allocation) 5. **`core/tiny_refill_opt.h:199-230`** - Refill chain pop 6. **`core/tiny_free_magazine.inc.h:135-136`** - Magazine free push 7. **`core/box/carve_push_box.c:172-180`** - Freelist carve with push ### Tier 3: Warm Paths (0.1-1 ops/allocation) 8. **`core/refill/ss_refill_fc.h:151-153`** - FC refill pop 9. **`core/hakmem_tiny_tls_ops.h:203`** - TLS freelist init 10. **`core/slab_handle.h:211, 259, 308`** - Slab handle ops **Total Critical Sites**: ~40-50 (out of 90 total) --- ## 3. Non-Critical Site Strategy ### Skip Entirely (10-15 sites) - **Debug/Stats**: `core/box/ss_stats_box.c:79`, `core/tiny_debug.h:48` - **Reason**: Already atomic type, simple load for printing is fine - **Action**: Change `meta->freelist` โ†’ `SLAB_FREELIST_DEBUG_PTR(meta)` - **Initialization** (already protected by single-threaded setup): - `core/box/ss_allocation_box.c:66` - Initial freelist setup - `core/hakmem_tiny_superslab.c` - SuperSlab init ### Use Relaxed Load/Store (20-30 sites) - **Condition checks**: `if (meta->freelist)` โ†’ `if (slab_freelist_is_nonempty(meta))` - **Prefetch**: `__builtin_prefetch(&meta->freelist, 0, 3)` โ†’ keep as-is (atomic type is fine) - **Init/cleanup**: `meta->freelist = NULL` โ†’ `slab_freelist_store_relaxed(meta, NULL)` ### Convert to Lock-Free (10-20 sites) - **All POP operations** in hot paths - **All PUSH operations** in free paths - **Carve rollback** operations --- ## 4. Phased Implementation Plan ### Phase 1: Hot Paths Only (2-3 hours) ๐Ÿ”ฅ **Goal**: Fix Larson 8T crash with minimal changes **Files to modify** (5 files, ~25 sites): 1. `core/tiny_superslab_alloc.inc.h` (fast alloc pop) 2. `core/hakmem_tiny_refill_p0.inc.h` (P0 batch refill) 3. `core/box/carve_push_box.c` (carve/rollback push) 4. `core/hakmem_tiny_tls_ops.h` (TLS drain) 5. Create `core/box/slab_freelist_atomic.h` (accessor API) **Testing**: ```bash ./build.sh bench_random_mixed_hakmem ./out/release/bench_random_mixed_hakmem 10000000 256 42 # Single-threaded baseline ./build.sh larson_hakmem ./out/release/larson_hakmem 8 100000 256 # 8 threads (expect no crash) ``` **Expected Result**: Larson 8T stable, <5% regression on single-threaded --- ### Phase 2: All TLS Paths (2-3 hours) โšก **Goal**: Full MT safety for all allocation paths **Files to modify** (10 files, ~40 sites): - All files from Phase 1 (complete conversion) - `core/tiny_refill_opt.h` (refill chain ops) - `core/tiny_free_magazine.inc.h` (magazine push) - `core/refill/ss_refill_fc.h` (FC refill) - `core/slab_handle.h` (slab handle ops) **Testing**: ```bash ./build.sh bench_random_mixed_hakmem ./out/release/bench_random_mixed_hakmem 10000000 256 42 # Baseline check ./build.sh stress_test_mt_hakmem ./out/release/stress_test_mt_hakmem 16 100000 # 16 threads stress test ``` **Expected Result**: All MT tests pass, <3% regression --- ### Phase 3: Cleanup (1-2 hours) ๐Ÿงน **Goal**: Convert/document remaining sites **Files to modify** (5 files, ~25 sites): - Debug/stats sites: Add `SLAB_FREELIST_DEBUG_PTR()` macro - Init/cleanup sites: Use `slab_freelist_store_relaxed()` - Add comments explaining MT safety assumptions **Testing**: ```bash make clean && make all # Full rebuild ./run_all_tests.sh # Comprehensive test suite ``` **Expected Result**: Clean build, all tests pass --- ## 5. Automated Conversion Script ### Semi-Automated Sed Script ```bash #!/bin/bash # atomic_freelist_convert.sh - Phase 1 conversion helper set -e # Backup git stash git checkout -b atomic-freelist-phase1 # Step 1: Convert NULL checks (read-only, safe) find core -name "*.c" -o -name "*.h" | xargs sed -i \ 's/if (\([^)]*\)meta->freelist)/if (slab_freelist_is_nonempty(\1meta))/g' # Step 2: Convert condition checks in while loops find core -name "*.c" -o -name "*.h" | xargs sed -i \ 's/while (\([^)]*\)meta->freelist)/while (slab_freelist_is_nonempty(\1meta))/g' # Step 3: Show remaining manual conversions needed echo "=== REMAINING MANUAL CONVERSIONS ===" grep -rn "meta->freelist" core/ --include="*.c" --include="*.h" | \ grep -v "slab_freelist_" | wc -l echo "Review changes:" git diff --stat echo "" echo "If good: git commit -am 'Phase 1: Convert freelist NULL checks'" echo "If bad: git checkout . && git checkout master" ``` **Limitations**: - Cannot auto-convert POP operations (need CAS loop) - Cannot auto-convert PUSH operations (need tiny_next_write + CAS) - Manual review required for all changes --- ## 6. Performance Projection ### Single-Threaded Impact | Operation | Before | After (Relaxed) | After (CAS) | Overhead | |-----------|--------|-----------------|-------------|----------| | Load | 1 cycle | 1 cycle | 1 cycle | 0% | | Store | 1 cycle | 1 cycle | - | 0% | | POP (freelist) | 3-5 cycles | - | 8-12 cycles | +60-140% | | PUSH (freelist) | 3-5 cycles | - | 8-12 cycles | +60-140% | **Expected Regression**: - Best case: 0-1% (mostly relaxed loads) - Worst case: 3-5% (CAS overhead in hot paths) - Realistic: 2-3% (good branch prediction, low contention) **Mitigation**: Lock-free CAS is still faster than mutex (20-30 cycles) ### Multi-Threaded Impact | Metric | Before (Non-Atomic) | After (Atomic) | Change | |--------|---------------------|----------------|--------| | Larson 8T | CRASH | Stable | โœ… FIXED | | Throughput (1T) | 25.1M ops/s | 24.4-24.8M ops/s | -1.2-2.8% | | Throughput (8T) | CRASH | ~18-20M ops/s | โœ… NEW | | Scalability | 0% (crashes) | 70-80% | โœ… GAIN | **Expected Benefit**: Stability + MT scalability >> 2-3% single-threaded cost --- ## 7. Implementation Example (Phase 1) ### Before: `core/tiny_superslab_alloc.inc.h:117-145` ```c if (__builtin_expect(meta->freelist != NULL, 0)) { void* block = meta->freelist; if (meta->class_idx != class_idx) { meta->freelist = NULL; goto bump_path; } // ... pop logic ... meta->freelist = tiny_next_read(meta->class_idx, block); return (void*)((uint8_t*)block + 1); } ``` ### After: `core/tiny_superslab_alloc.inc.h:117-145` ```c if (__builtin_expect(slab_freelist_is_nonempty(meta), 0)) { void* block = slab_freelist_pop_lockfree(meta, class_idx); if (!block) { // Another thread won the race, fall through to bump path goto bump_path; } if (meta->class_idx != class_idx) { // Wrong class, return to freelist and go to bump path slab_freelist_push_lockfree(meta, class_idx, block); goto bump_path; } return (void*)((uint8_t*)block + 1); } ``` **Changes**: - NULL check โ†’ `slab_freelist_is_nonempty()` - Manual pop โ†’ `slab_freelist_pop_lockfree()` - Handle CAS race (block == NULL case) - Simpler logic (CAS handles next pointer atomically) --- ## 8. Risk Assessment ### Low Risk โœ… - **Phase 1**: Only 5 files, ~25 sites, well-tested patterns - **Rollback**: Easy (`git checkout master`) - **Testing**: Can A/B test with env variable ### Medium Risk โš ๏ธ - **Performance**: 2-3% regression possible - **Subtle bugs**: CAS retry loops need careful review - **ABA problem**: mitigated by pointer tagging (already in codebase) ### High Risk โŒ - **None**: Atomic type already declared, no ABI changes --- ## 9. Alternative Approaches (Considered) ### Option A: Mutex per Slab (rejected) **Pros**: Simple, guaranteed correctness **Cons**: 40-byte overhead per slab, 10-20x performance hit ### Option B: Global Lock (rejected) **Pros**: Zero code changes, 1-line fix **Cons**: Serializes all allocation, kills MT performance ### Option C: TLS-Only (rejected) **Pros**: No atomics needed **Cons**: Cannot handle remote free (required for MT) ### Option D: Hybrid (SELECTED) โœ… **Pros**: Best performance, incremental implementation **Cons**: More complex, requires careful memory ordering --- ## 10. Memory Ordering Rationale ### Relaxed (`memory_order_relaxed`) **Use case**: Single-threaded or benign races (e.g., stats) **Cost**: 0 cycles (no fence) **Example**: `if (meta->freelist)` - checking emptiness ### Acquire (`memory_order_acquire`) **Use case**: Loading pointer before dereferencing **Cost**: 1-2 cycles (read fence on some architectures) **Example**: POP freelist head before reading `next` pointer ### Release (`memory_order_release`) **Use case**: Publishing pointer after setup **Cost**: 1-2 cycles (write fence on some architectures) **Example**: PUSH node to freelist after writing `next` pointer ### AcqRel (`memory_order_acq_rel`) **Use case**: CAS success path (acquire+release) **Cost**: 2-4 cycles (full fence on some architectures) **Example**: Not used (separate acquire/release in CAS) ### SeqCst (`memory_order_seq_cst`) **Use case**: Total ordering required **Cost**: 5-10 cycles (expensive fence) **Example**: Not needed for freelist (per-slab ordering sufficient) **Chosen**: Acquire/Release for CAS, Relaxed for checks (optimal trade-off) --- ## 11. Testing Strategy ### Phase 1 Tests ```bash # Baseline (before conversion) ./out/release/bench_random_mixed_hakmem 10000000 256 42 # Record: 25.1M ops/s # After conversion (expect: 24.4-24.8M ops/s) ./out/release/bench_random_mixed_hakmem 10000000 256 42 # MT stability (expect: no crash) ./out/release/larson_hakmem 8 100000 256 # Correctness (expect: 0 errors) ./out/release/bench_fixed_size_hakmem 100000 256 128 ./out/release/bench_fixed_size_hakmem 100000 1024 128 ``` ### Phase 2 Tests ```bash # Stress test all sizes for size in 128 256 512 1024; do ./out/release/bench_random_mixed_hakmem 1000000 $size 42 done # MT scaling test for threads in 1 2 4 8 16; do ./out/release/larson_hakmem $threads 100000 256 done ``` ### Phase 3 Tests ```bash # Full test suite ./run_all_tests.sh # ASan build (detect races) ./build.sh asan bench_random_mixed_hakmem ./out/asan/bench_random_mixed_hakmem 100000 256 42 # TSan build (detect data races) ./build.sh tsan larson_hakmem ./out/tsan/larson_hakmem 8 10000 256 ``` --- ## 12. Success Criteria ### Phase 1 (Hot Paths) - โœ… Larson 8T runs without crash (100K iterations) - โœ… Single-threaded regression <5% (24.0M+ ops/s) - โœ… No ASan/TSan warnings - โœ… Clean build with no warnings ### Phase 2 (All Paths) - โœ… All MT tests pass (1T, 2T, 4T, 8T, 16T) - โœ… Single-threaded regression <3% (24.4M+ ops/s) - โœ… MT scaling 70%+ (8T = 5.6x+ speedup) - โœ… No memory leaks (Valgrind clean) ### Phase 3 (Complete) - โœ… All 90 sites converted or documented - โœ… Full test suite passes (100% pass rate) - โœ… Code review approved - โœ… Documentation updated --- ## 13. Rollback Plan If Phase 1 fails (>5% regression or instability): ```bash # Revert to master git checkout master git branch -D atomic-freelist-phase1 # Try alternative: Per-slab spinlock (medium overhead) # Add uint8_t lock field to TinySlabMeta # Use __sync_lock_test_and_set() for 1-byte spinlock # Expected: 5-10% overhead, but guaranteed correctness ``` --- ## 14. Next Steps 1. **Create accessor header** (`core/box/slab_freelist_atomic.h`) - 30 min 2. **Phase 1 conversion** (5 files, ~25 sites) - 2-3 hours 3. **Test Phase 1** (single + MT tests) - 1 hour 4. **If pass**: Continue to Phase 2 5. **If fail**: Review, fix, or rollback **Estimated Total Time**: 4-6 hours for full implementation (all 3 phases) --- ## 15. Code Review Checklist Before merging: - [ ] All CAS loops handle retry correctly - [ ] Memory ordering documented for each site - [ ] No direct `meta->freelist` access remains (except debug) - [ ] All tests pass (single + MT) - [ ] ASan/TSan clean - [ ] Performance regression <3% - [ ] Documentation updated (CLAUDE.md) --- ## Summary **Approach**: Hybrid - Lock-free CAS for hot paths, relaxed atomics for cold paths **Effort**: 4-6 hours (3 phases) **Risk**: Low (incremental, easy rollback) **Performance**: -2-3% single-threaded, +MT stability and scalability **Benefit**: Unlocks MT performance without sacrificing single-threaded speed **Recommendation**: Proceed with Phase 1 (2-3 hours) and evaluate results before committing to full implementation.