Files

Moe Charm (CI) a9ddb52ad4 ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)

Phase 1 完了：環境変数整理 + fprintf デバッグガード

ENV変数削除（BG/HotMag系）:
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除

fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message

ドキュメント整理:
- 328 markdown files 削除（旧レポート・重複docs）

性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作✅)
- ENV整理による機能影響なし
- Debug出力は一部残存（次phase で対応）

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-26 14:45:26 +09:00

13 KiB

Raw Blame History

Atomic Freelist Implementation - Executive Summary

Investigation Results

Good News

Actual site count: 90 sites (not 589!)

Original estimate was based on all .freelist member accesses
Actual meta->freelist accesses: 90 sites in 21 files
Fully manageable in 5-8 hours with phased approach

Analysis Breakdown

Category	Count	Effort
Phase 1 (Critical Hot Paths)	25 sites in 5 files	2-3 hours
Phase 2 (Important Paths)	40 sites in 10 files	2-3 hours
Phase 3 (Debug/Cleanup)	25 sites in 6 files	1-2 hours
Total	90 sites in 21 files	5-8 hours

Operation Breakdown

NULL checks (if/while conditions): 16 sites
Direct assignments (store): 32 sites
POP operations (load + next): 8 sites
PUSH operations (write + assign): 14 sites
Read operations (checks/loads): 29 sites
Write operations (assignments): 32 sites

Implementation Strategy

Recommended Approach: Hybrid

Hot Paths (10-20 sites):

Lock-free CAS operations
slab_freelist_pop_lockfree() / slab_freelist_push_lockfree()
Memory ordering: acquire/release
Cost: 6-10 cycles per operation

Cold Paths (40-50 sites):

Relaxed atomic loads/stores
slab_freelist_load_relaxed() / slab_freelist_store_relaxed()
Memory ordering: relaxed
Cost: 0 cycles overhead

Debug/Stats (10-15 sites):

Skip conversion entirely
Use SLAB_FREELIST_DEBUG_PTR(meta) macro
Already atomic type, just cast for printf

Key Design Decisions

1. Accessor Function API

Created centralized atomic operations in core/box/slab_freelist_atomic.h:

// Lock-free operations (hot paths)
void* slab_freelist_pop_lockfree(TinySlabMeta* meta, int class_idx);
void slab_freelist_push_lockfree(TinySlabMeta* meta, int class_idx, void* node);

// Relaxed operations (cold paths)
void* slab_freelist_load_relaxed(TinySlabMeta* meta);
void slab_freelist_store_relaxed(TinySlabMeta* meta, void* value);

// NULL checks
bool slab_freelist_is_empty(TinySlabMeta* meta);
bool slab_freelist_is_nonempty(TinySlabMeta* meta);

// Debug
#define SLAB_FREELIST_DEBUG_PTR(meta) ...

2. Memory Ordering Rationale

Relaxed (most sites):

No synchronization needed
0 cycles overhead
Safe for: NULL checks, init, debug

Acquire (POP operations):

Must see next pointer before unlinking
1-2 cycles overhead
Prevents use-after-free

Release (PUSH operations):

Must publish next pointer before freelist update
1-2 cycles overhead
Ensures visibility to other threads

NOT using seq_cst:

Total ordering not needed
5-10 cycles overhead (too expensive)
Per-slab ordering sufficient

3. Critical Pattern Conversions

Before (direct access):

if (meta->freelist != NULL) {
    void* block = meta->freelist;
    meta->freelist = tiny_next_read(class_idx, block);
    use(block);
}

After (lock-free atomic):

if (slab_freelist_is_nonempty(meta)) {
    void* block = slab_freelist_pop_lockfree(meta, class_idx);
    if (!block) goto fallback;  // Handle race
    use(block);
}

Key differences:

NULL check uses relaxed atomic load
POP operation uses CAS loop internally
Must handle race condition (block == NULL)
tiny_next_read() called inside accessor (no double-conversion)

Performance Analysis

Single-Threaded Impact

Operation	Before (cycles)	After Relaxed	After CAS	Overhead
NULL check	1	1	-	0%
Load/Store	1	1	-	0%
POP/PUSH	3-5	-	8-12	+60-140%

Overall Expected:

Relaxed sites (~70%): 0% overhead
CAS sites (~30%): +60-140% per operation
Net regression: 2-3% (due to good branch prediction)

Baseline: 25.1M ops/s (Random Mixed 256B) Expected: 24.4-24.8M ops/s (Random Mixed 256B) Acceptable: >24.0M ops/s (<5% regression)

Multi-Threaded Impact

Metric	Before	After	Change
Larson 8T	CRASH	~18-20M ops/s	FIXED
MT Scaling (8T)	0%	70-80%	NEW
Throughput (1T)	25.1M ops/s	24.4-24.8M ops/s	-1.2-2.8%

Benefit: Stability + MT scalability >> 2-3% single-threaded cost

Risk Assessment

Low Risk ✅

Incremental implementation: 3 phases, test after each
Easy rollback: git checkout master
Well-tested patterns: Existing atomic operations in codebase (563 sites)
No ABI changes: Atomic type already declared

Medium Risk ⚠️

Performance regression: 2-3% expected (acceptable)
Subtle bugs: CAS retry loops need careful review
Complexity: 90 sites to convert (but well-documented)

High Risk ❌

None identified

Mitigation Strategies

Phase 1 focus: Fix Larson crash first (25 sites, 2-3 hours)
Test early: Compile and test after each file
A/B testing: Keep old code in branches for comparison
Rollback plan: Alternative spinlock approach if needed

Implementation Plan

Phase 1: Critical Hot Paths (2-3 hours) 🔥

Goal: Fix Larson 8T crash with minimal changes

Files (5 files, 25 sites):

core/box/slab_freelist_atomic.h (CREATE new accessor API)
core/tiny_superslab_alloc.inc.h (fast alloc pop)
core/hakmem_tiny_refill_p0.inc.h (P0 batch refill)
core/box/carve_push_box.c (carve/rollback push)
core/hakmem_tiny_tls_ops.h (TLS drain)

Testing:

./out/release/larson_hakmem 8 100000 256  # Expect: no crash
./out/release/bench_random_mixed_hakmem 10000000 256 42  # Expect: >24.0M ops/s

Success Criteria:

✅ Larson 8T stable (no crashes)
✅ Regression <5% (>24.0M ops/s)
✅ No ASan/TSan warnings

Phase 2: Important Paths (2-3 hours) ⚡

Goal: Full MT safety for all allocation paths

Files (10 files, 40 sites):

core/tiny_refill_opt.h
core/tiny_free_magazine.inc.h
core/refill/ss_refill_fc.h
core/slab_handle.h
6 additional files

Testing:

for t in 1 2 4 8 16; do ./out/release/larson_hakmem $t 100000 256; done

Success Criteria:

✅ All MT tests pass
✅ Regression <3% (>24.4M ops/s)
✅ MT scaling 70%+

Phase 3: Cleanup (1-2 hours) 🧹

Goal: Convert/document remaining sites

Files (6 files, 25 sites):

Debug/stats sites: Add SLAB_FREELIST_DEBUG_PTR()
Init/cleanup sites: Use slab_freelist_store_relaxed()
Add comments for MT safety assumptions

Testing:

make clean && make all
./run_all_tests.sh

Success Criteria:

✅ All 90 sites converted or documented
✅ Zero direct accesses (except in atomic.h)
✅ Full test suite passes

Tools and Scripts

Created comprehensive implementation support:

1. Strategy Document

File: ATOMIC_FREELIST_IMPLEMENTATION_STRATEGY.md

Accessor function design
Memory ordering rationale
Performance projections
Risk assessment
Alternative approaches

2. Site-by-Site Guide

File: ATOMIC_FREELIST_SITE_BY_SITE_GUIDE.md

Detailed conversion instructions (line-by-line)
Common pitfalls and solutions
Testing checklist per file
Quick reference card

3. Quick Start Guide

File: ATOMIC_FREELIST_QUICK_START.md

Step-by-step implementation
Time budget breakdown
Success metrics
Rollback procedures

4. Accessor Header Template

File: core/box/slab_freelist_atomic.h.TEMPLATE

Complete implementation (80 lines)
Extensive comments and examples
Performance notes
Testing strategy

5. Analysis Script

File: scripts/analyze_freelist_sites.sh

Counts sites by category
Shows hot/warm/cold paths
Estimates conversion effort
Checks for lock-protected sites

6. Verification Script

File: scripts/verify_atomic_freelist_conversion.sh

Tracks conversion progress
Detects potential bugs (double-POP/PUSH)
Checks compile status
Provides recommendations

Usage Instructions

Quick Start

# 1. Review documentation (15 min)
cat ATOMIC_FREELIST_QUICK_START.md

# 2. Run analysis (5 min)
./scripts/analyze_freelist_sites.sh

# 3. Create accessor header (30 min)
cp core/box/slab_freelist_atomic.h.TEMPLATE core/box/slab_freelist_atomic.h
make bench_random_mixed_hakmem  # Test compile

# 4. Start Phase 1 (2-3 hours)
git checkout -b atomic-freelist-phase1
# Follow ATOMIC_FREELIST_SITE_BY_SITE_GUIDE.md

# 5. Verify progress
./scripts/verify_atomic_freelist_conversion.sh

# 6. Test Phase 1
./out/release/larson_hakmem 8 100000 256

Incremental Progress Tracking

# Check conversion progress
./scripts/verify_atomic_freelist_conversion.sh

# Output example:
# Progress: 30% (27/90 sites)
# [============----------------------------]
# Currently working on: Phase 1 (Critical Hot Paths)

Expected Timeline

Day	Activity	Hours	Cumulative
Day 1	Setup + Phase 1	3h	3h
	Test Phase 1	1h	4h
Day 2	Phase 2 conversion	2-3h	6-7h
	Test Phase 2	1h	7-8h
Day 3	Phase 3 cleanup	1-2h	8-10h
	Final testing	1h	9-11h

Realistic Total: 9-11 hours (including testing and documentation) Minimal Viable: 3-4 hours (Phase 1 only, fixes Larson crash)

Success Metrics

Phase 1 Success

✅ Larson 8T runs for 100K iterations without crash
✅ Single-threaded regression <5% (>24.0M ops/s)
✅ No data races detected (TSan clean)

Phase 2 Success

✅ All MT tests pass (1T, 2T, 4T, 8T, 16T)
✅ Single-threaded regression <3% (>24.4M ops/s)
✅ MT scaling 70%+ (8T = 5.6x+ speedup)

Phase 3 Success

✅ All 90 sites converted or documented
✅ Zero direct meta->freelist accesses (except atomic.h)
✅ Full test suite passes
✅ Documentation updated

Rollback Plan

If Phase 1 fails (>5% regression or instability):

Option A: Revert and Debug

git stash
git checkout master
git branch -D atomic-freelist-phase1
# Review logs, fix issues, retry

Option B: Alternative Approach (Spinlock)

If lock-free proves too complex:

// Add to TinySlabMeta
typedef struct TinySlabMeta {
    uint8_t freelist_lock;  // 1-byte spinlock
    void* freelist;         // Back to non-atomic
    // ... rest unchanged
} TinySlabMeta;

// Use __sync_lock_test_and_set() for lock/unlock
// Expected overhead: 5-10% (vs 2-3% for lock-free)

Trade-off: Simpler implementation, guaranteed correctness, slightly higher overhead

Alternatives Considered

Option A: Mutex per Slab (REJECTED)

Pros: Simple, guaranteed correctness Cons: 40-byte overhead, 10-20x performance hit Reason: Too expensive for per-slab locking

Option B: Global Lock (REJECTED)

Pros: 1-line fix, zero code changes Cons: Serializes all allocation, kills MT performance Reason: Defeats purpose of MT allocator

Option C: TLS-Only (REJECTED)

Pros: No atomics needed, simplest Cons: Cannot handle remote free (required for MT) Reason: Breaking existing functionality

Option D: Hybrid Lock-Free + Relaxed (SELECTED) ✅

Pros: Best performance, incremental implementation, minimal overhead Cons: More complex, requires careful memory ordering Reason: Optimal balance of performance, safety, and maintainability

Conclusion

Feasibility: HIGH ✅

Only 90 sites (not 589)
Well-understood patterns
Existing atomic operations in codebase (563 sites as reference)
Incremental phased approach
Easy rollback

Risk: LOW ✅

Phase 1 focus (25 sites) minimizes risk
Test after each file
Alternative approaches available
No ABI changes

Benefit: HIGH ✅

Fixes Larson 8T crash (critical bug)
Enables MT performance (70-80% scaling)
Future-proof architecture
Only 2-3% single-threaded cost

Recommendation: PROCEED ✅

Start with Phase 1 (2-3 hours) and evaluate:

If stable + <5% regression: Continue to Phase 2
If unstable or >5% regression: Rollback and review

Expected outcome: 9-11 hours for full MT safety with <3% single-threaded regression

Files Created

ATOMIC_FREELIST_IMPLEMENTATION_STRATEGY.md (comprehensive strategy)
ATOMIC_FREELIST_SITE_BY_SITE_GUIDE.md (detailed conversion guide)
ATOMIC_FREELIST_QUICK_START.md (quick start instructions)
ATOMIC_FREELIST_SUMMARY.md (this file)
core/box/slab_freelist_atomic.h.TEMPLATE (accessor API template)
scripts/analyze_freelist_sites.sh (site analysis tool)
scripts/verify_atomic_freelist_conversion.sh (progress tracker)

Total: 7 files, ~3000 lines of documentation and tooling

Next Actions

Review ATOMIC_FREELIST_QUICK_START.md (15 min)
Run ./scripts/analyze_freelist_sites.sh (5 min)
Create accessor header from template (30 min)
Start Phase 1 conversion (2-3 hours)
Test Larson 8T stability (30 min)
Evaluate results and proceed or rollback

First milestone: Larson 8T stable (3-4 hours total) Final goal: Full MT safety in 9-11 hours

13 KiB Raw Blame History

Atomic Freelist Implementation - Executive Summary

Investigation Results

Good News

Analysis Breakdown

Operation Breakdown

Implementation Strategy

Recommended Approach: Hybrid

Key Design Decisions

1. Accessor Function API

2. Memory Ordering Rationale

3. Critical Pattern Conversions

Performance Analysis

Single-Threaded Impact

Multi-Threaded Impact

Risk Assessment

Low Risk ✅

Medium Risk ⚠️

High Risk ❌

Mitigation Strategies

Implementation Plan

Phase 1: Critical Hot Paths (2-3 hours) 🔥

Phase 2: Important Paths (2-3 hours) ⚡

Phase 3: Cleanup (1-2 hours) 🧹

Tools and Scripts

1. Strategy Document

2. Site-by-Site Guide

3. Quick Start Guide

4. Accessor Header Template

5. Analysis Script

6. Verification Script

Usage Instructions

Quick Start

Incremental Progress Tracking

Expected Timeline

Success Metrics

Phase 1 Success

Phase 2 Success

Phase 3 Success

Rollback Plan

Option A: Revert and Debug

Option B: Alternative Approach (Spinlock)

Alternatives Considered

Option A: Mutex per Slab (REJECTED)

Option B: Global Lock (REJECTED)

Option C: TLS-Only (REJECTED)

Option D: Hybrid Lock-Free + Relaxed (SELECTED) ✅

Conclusion

Feasibility: HIGH ✅

Risk: LOW ✅

Benefit: HIGH ✅

Recommendation: PROCEED ✅

Files Created

Next Actions

13 KiB

Raw Blame History