hakmem/docs/archive/ATOMIC_FREELIST_SUMMARY.md

# Atomic Freelist Implementation - Executive Summary

## Investigation Results

### Good News

**Actual site count**: **90 sites** (not 589!)
- Original estimate was based on all `.freelist` member accesses
- Actual `meta->freelist` accesses: 90 sites in 21 files
- Fully manageable in 5-8 hours with phased approach

### Analysis Breakdown

| Category | Count | Effort |
|----------|-------|--------|
| **Phase 1 (Critical Hot Paths)** | 25 sites in 5 files | 2-3 hours |
| **Phase 2 (Important Paths)** | 40 sites in 10 files | 2-3 hours |
| **Phase 3 (Debug/Cleanup)** | 25 sites in 6 files | 1-2 hours |
| **Total** | **90 sites in 21 files** | **5-8 hours** |

### Operation Breakdown

- **NULL checks** (if/while conditions): 16 sites
- **Direct assignments** (store): 32 sites
- **POP operations** (load + next): 8 sites
- **PUSH operations** (write + assign): 14 sites
- **Read operations** (checks/loads): 29 sites
- **Write operations** (assignments): 32 sites

---

## Implementation Strategy

### Recommended Approach: Hybrid

**Hot Paths** (10-20 sites):
- Lock-free CAS operations
- `slab_freelist_pop_lockfree()` / `slab_freelist_push_lockfree()`
- Memory ordering: acquire/release
- Cost: 6-10 cycles per operation

**Cold Paths** (40-50 sites):
- Relaxed atomic loads/stores
- `slab_freelist_load_relaxed()` / `slab_freelist_store_relaxed()`
- Memory ordering: relaxed
- Cost: 0 cycles overhead

**Debug/Stats** (10-15 sites):
- Skip conversion entirely
- Use `SLAB_FREELIST_DEBUG_PTR(meta)` macro
- Already atomic type, just cast for printf

---

## Key Design Decisions

### 1. Accessor Function API

Created centralized atomic operations in `core/box/slab_freelist_atomic.h`:

```c
// Lock-free operations (hot paths)
void* slab_freelist_pop_lockfree(TinySlabMeta* meta, int class_idx);
void slab_freelist_push_lockfree(TinySlabMeta* meta, int class_idx, void* node);

// Relaxed operations (cold paths)
void* slab_freelist_load_relaxed(TinySlabMeta* meta);
void slab_freelist_store_relaxed(TinySlabMeta* meta, void* value);

// NULL checks
bool slab_freelist_is_empty(TinySlabMeta* meta);
bool slab_freelist_is_nonempty(TinySlabMeta* meta);

// Debug
#define SLAB_FREELIST_DEBUG_PTR(meta) ...
```

### 2. Memory Ordering Rationale

**Relaxed** (most sites):
- No synchronization needed
- 0 cycles overhead
- Safe for: NULL checks, init, debug

**Acquire** (POP operations):
- Must see next pointer before unlinking
- 1-2 cycles overhead
- Prevents use-after-free

**Release** (PUSH operations):
- Must publish next pointer before freelist update
- 1-2 cycles overhead
- Ensures visibility to other threads

**NOT using seq_cst**:
- Total ordering not needed
- 5-10 cycles overhead (too expensive)
- Per-slab ordering sufficient

### 3. Critical Pattern Conversions

**Before** (direct access):
```c
if (meta->freelist != NULL) {
    void* block = meta->freelist;
    meta->freelist = tiny_next_read(class_idx, block);
    use(block);
}
```

**After** (lock-free atomic):
```c
if (slab_freelist_is_nonempty(meta)) {
    void* block = slab_freelist_pop_lockfree(meta, class_idx);
    if (!block) goto fallback;  // Handle race
    use(block);
}
```

**Key differences**:
1. NULL check uses relaxed atomic load
2. POP operation uses CAS loop internally
3. Must handle race condition (block == NULL)
4. `tiny_next_read()` called inside accessor (no double-conversion)

---

## Performance Analysis

### Single-Threaded Impact

| Operation | Before (cycles) | After Relaxed | After CAS | Overhead |
|-----------|-----------------|---------------|-----------|----------|
| NULL check | 1 | 1 | - | 0% |
| Load/Store | 1 | 1 | - | 0% |
| POP/PUSH | 3-5 | - | 8-12 | +60-140% |

**Overall Expected**:
- Relaxed sites (~70%): 0% overhead
- CAS sites (~30%): +60-140% per operation
- **Net regression**: 2-3% (due to good branch prediction)

**Baseline**: 25.1M ops/s (Random Mixed 256B)
**Expected**: 24.4-24.8M ops/s (Random Mixed 256B)
**Acceptable**: >24.0M ops/s (<5% regression)

### Multi-Threaded Impact

| Metric | Before | After | Change |
|--------|--------|-------|--------|
| Larson 8T | **CRASH** | ~18-20M ops/s | **FIXED** |
| MT Scaling (8T) | 0% | 70-80% | **NEW** |
| Throughput (1T) | 25.1M ops/s | 24.4-24.8M ops/s | -1.2-2.8% |

**Benefit**: Stability + MT scalability >> 2-3% single-threaded cost

---

## Risk Assessment

### Low Risk ✅

- **Incremental implementation**: 3 phases, test after each
- **Easy rollback**: `git checkout master`
- **Well-tested patterns**: Existing atomic operations in codebase (563 sites)
- **No ABI changes**: Atomic type already declared

### Medium Risk ⚠️

- **Performance regression**: 2-3% expected (acceptable)
- **Subtle bugs**: CAS retry loops need careful review
- **Complexity**: 90 sites to convert (but well-documented)

### High Risk ❌

- **None identified**

### Mitigation Strategies

1. **Phase 1 focus**: Fix Larson crash first (25 sites, 2-3 hours)
2. **Test early**: Compile and test after each file
3. **A/B testing**: Keep old code in branches for comparison
4. **Rollback plan**: Alternative spinlock approach if needed

---

## Implementation Plan

### Phase 1: Critical Hot Paths (2-3 hours) 🔥

**Goal**: Fix Larson 8T crash with minimal changes

**Files** (5 files, 25 sites):
1. `core/box/slab_freelist_atomic.h` (CREATE new accessor API)
2. `core/tiny_superslab_alloc.inc.h` (fast alloc pop)
3. `core/hakmem_tiny_refill_p0.inc.h` (P0 batch refill)
4. `core/box/carve_push_box.c` (carve/rollback push)
5. `core/hakmem_tiny_tls_ops.h` (TLS drain)

**Testing**:
```bash
./out/release/larson_hakmem 8 100000 256  # Expect: no crash
./out/release/bench_random_mixed_hakmem 10000000 256 42  # Expect: >24.0M ops/s
```

**Success Criteria**:
- ✅ Larson 8T stable (no crashes)
- ✅ Regression <5% (>24.0M ops/s)
- ✅ No ASan/TSan warnings

---

### Phase 2: Important Paths (2-3 hours) ⚡

**Goal**: Full MT safety for all allocation paths

**Files** (10 files, 40 sites):
- `core/tiny_refill_opt.h`
- `core/tiny_free_magazine.inc.h`
- `core/refill/ss_refill_fc.h`
- `core/slab_handle.h`
- 6 additional files

**Testing**:
```bash
for t in 1 2 4 8 16; do ./out/release/larson_hakmem $t 100000 256; done
```

**Success Criteria**:
- ✅ All MT tests pass
- ✅ Regression <3% (>24.4M ops/s)
- ✅ MT scaling 70%+

---

### Phase 3: Cleanup (1-2 hours) 🧹

**Goal**: Convert/document remaining sites

**Files** (6 files, 25 sites):
- Debug/stats sites: Add `SLAB_FREELIST_DEBUG_PTR()`
- Init/cleanup sites: Use `slab_freelist_store_relaxed()`
- Add comments for MT safety assumptions

**Testing**:
```bash
make clean && make all
./run_all_tests.sh
```

**Success Criteria**:
- ✅ All 90 sites converted or documented
- ✅ Zero direct accesses (except in atomic.h)
- ✅ Full test suite passes

---

## Tools and Scripts

Created comprehensive implementation support:

### 1. Strategy Document
**File**: `ATOMIC_FREELIST_IMPLEMENTATION_STRATEGY.md`
- Accessor function design
- Memory ordering rationale
- Performance projections
- Risk assessment
- Alternative approaches

### 2. Site-by-Site Guide
**File**: `ATOMIC_FREELIST_SITE_BY_SITE_GUIDE.md`
- Detailed conversion instructions (line-by-line)
- Common pitfalls and solutions
- Testing checklist per file
- Quick reference card

### 3. Quick Start Guide
**File**: `ATOMIC_FREELIST_QUICK_START.md`
- Step-by-step implementation
- Time budget breakdown
- Success metrics
- Rollback procedures

### 4. Accessor Header Template
**File**: `core/box/slab_freelist_atomic.h.TEMPLATE`
- Complete implementation (80 lines)
- Extensive comments and examples
- Performance notes
- Testing strategy

### 5. Analysis Script
**File**: `scripts/analyze_freelist_sites.sh`
- Counts sites by category
- Shows hot/warm/cold paths
- Estimates conversion effort
- Checks for lock-protected sites

### 6. Verification Script
**File**: `scripts/verify_atomic_freelist_conversion.sh`
- Tracks conversion progress
- Detects potential bugs (double-POP/PUSH)
- Checks compile status
- Provides recommendations

---

## Usage Instructions

### Quick Start

```bash
# 1. Review documentation (15 min)
cat ATOMIC_FREELIST_QUICK_START.md

# 2. Run analysis (5 min)
./scripts/analyze_freelist_sites.sh

# 3. Create accessor header (30 min)
cp core/box/slab_freelist_atomic.h.TEMPLATE core/box/slab_freelist_atomic.h
make bench_random_mixed_hakmem  # Test compile

# 4. Start Phase 1 (2-3 hours)
git checkout -b atomic-freelist-phase1
# Follow ATOMIC_FREELIST_SITE_BY_SITE_GUIDE.md

# 5. Verify progress
./scripts/verify_atomic_freelist_conversion.sh

# 6. Test Phase 1
./out/release/larson_hakmem 8 100000 256
```

### Incremental Progress Tracking

```bash
# Check conversion progress
./scripts/verify_atomic_freelist_conversion.sh

# Output example:
# Progress: 30% (27/90 sites)
# [============----------------------------]
# Currently working on: Phase 1 (Critical Hot Paths)
```

---

## Expected Timeline

| Day | Activity | Hours | Cumulative |
|-----|----------|-------|------------|
| **Day 1** | Setup + Phase 1 | 3h | 3h |
| | Test Phase 1 | 1h | 4h |
| **Day 2** | Phase 2 conversion | 2-3h | 6-7h |
| | Test Phase 2 | 1h | 7-8h |
| **Day 3** | Phase 3 cleanup | 1-2h | 8-10h |
| | Final testing | 1h | 9-11h |

**Realistic Total**: 9-11 hours (including testing and documentation)
**Minimal Viable**: 3-4 hours (Phase 1 only, fixes Larson crash)

---

## Success Metrics

### Phase 1 Success
- ✅ Larson 8T runs for 100K iterations without crash
- ✅ Single-threaded regression <5% (>24.0M ops/s)
- ✅ No data races detected (TSan clean)

### Phase 2 Success
- ✅ All MT tests pass (1T, 2T, 4T, 8T, 16T)
- ✅ Single-threaded regression <3% (>24.4M ops/s)
- ✅ MT scaling 70%+ (8T = 5.6x+ speedup)

### Phase 3 Success
- ✅ All 90 sites converted or documented
- ✅ Zero direct `meta->freelist` accesses (except atomic.h)
- ✅ Full test suite passes
- ✅ Documentation updated

---

## Rollback Plan

If Phase 1 fails (>5% regression or instability):

### Option A: Revert and Debug
```bash
git stash
git checkout master
git branch -D atomic-freelist-phase1
# Review logs, fix issues, retry
```

### Option B: Alternative Approach (Spinlock)
If lock-free proves too complex:

```c
// Add to TinySlabMeta
typedef struct TinySlabMeta {
    uint8_t freelist_lock;  // 1-byte spinlock
    void* freelist;         // Back to non-atomic
    // ... rest unchanged
} TinySlabMeta;

// Use __sync_lock_test_and_set() for lock/unlock
// Expected overhead: 5-10% (vs 2-3% for lock-free)
```

**Trade-off**: Simpler implementation, guaranteed correctness, slightly higher overhead

---

## Alternatives Considered

### Option A: Mutex per Slab (REJECTED)
**Pros**: Simple, guaranteed correctness
**Cons**: 40-byte overhead, 10-20x performance hit
**Reason**: Too expensive for per-slab locking

### Option B: Global Lock (REJECTED)
**Pros**: 1-line fix, zero code changes
**Cons**: Serializes all allocation, kills MT performance
**Reason**: Defeats purpose of MT allocator

### Option C: TLS-Only (REJECTED)
**Pros**: No atomics needed, simplest
**Cons**: Cannot handle remote free (required for MT)
**Reason**: Breaking existing functionality

### Option D: Hybrid Lock-Free + Relaxed (SELECTED) ✅
**Pros**: Best performance, incremental implementation, minimal overhead
**Cons**: More complex, requires careful memory ordering
**Reason**: Optimal balance of performance, safety, and maintainability

---

## Conclusion

### Feasibility: HIGH ✅

- Only 90 sites (not 589)
- Well-understood patterns
- Existing atomic operations in codebase (563 sites as reference)
- Incremental phased approach
- Easy rollback

### Risk: LOW ✅

- Phase 1 focus (25 sites) minimizes risk
- Test after each file
- Alternative approaches available
- No ABI changes

### Benefit: HIGH ✅

- Fixes Larson 8T crash (critical bug)
- Enables MT performance (70-80% scaling)
- Future-proof architecture
- Only 2-3% single-threaded cost

### Recommendation: PROCEED ✅

**Start with Phase 1 (2-3 hours)** and evaluate:
- If stable + <5% regression: Continue to Phase 2
- If unstable or >5% regression: Rollback and review

**Expected outcome**: 9-11 hours for full MT safety with <3% single-threaded regression

---

## Files Created

1. `ATOMIC_FREELIST_IMPLEMENTATION_STRATEGY.md` (comprehensive strategy)
2. `ATOMIC_FREELIST_SITE_BY_SITE_GUIDE.md` (detailed conversion guide)
3. `ATOMIC_FREELIST_QUICK_START.md` (quick start instructions)
4. `ATOMIC_FREELIST_SUMMARY.md` (this file)
5. `core/box/slab_freelist_atomic.h.TEMPLATE` (accessor API template)
6. `scripts/analyze_freelist_sites.sh` (site analysis tool)
7. `scripts/verify_atomic_freelist_conversion.sh` (progress tracker)

**Total**: 7 files, ~3000 lines of documentation and tooling

---

## Next Actions

1. **Review** `ATOMIC_FREELIST_QUICK_START.md` (15 min)
2. **Run** `./scripts/analyze_freelist_sites.sh` (5 min)
3. **Create** accessor header from template (30 min)
4. **Start** Phase 1 conversion (2-3 hours)
5. **Test** Larson 8T stability (30 min)
6. **Evaluate** results and proceed or rollback

**First milestone**: Larson 8T stable (3-4 hours total)
**Final goal**: Full MT safety in 9-11 hours