Phase 1 完了:環境変数整理 + fprintf デバッグガード ENV変数削除(BG/HotMag系): - core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines) - core/hakmem_tiny_bg_spill.c: BG spill ENV 削除 - core/tiny_refill.h: BG remote 固定値化 - core/hakmem_tiny_slow.inc: BG refs 削除 fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE): - core/hakmem_shared_pool.c: Lock stats (~18 fprintf) - core/page_arena.c: Init/Shutdown/Stats (~27 fprintf) - core/hakmem.c: SIGSEGV init message ドキュメント整理: - 328 markdown files 削除(旧レポート・重複docs) 性能確認: - Larson: 52.35M ops/s (前回52.8M、安定動作✅) - ENV整理による機能影響なし - Debug出力は一部残存(次phase で対応) 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
13 KiB
Atomic Freelist Implementation - Executive Summary
Investigation Results
Good News
Actual site count: 90 sites (not 589!)
- Original estimate was based on all
.freelistmember accesses - Actual
meta->freelistaccesses: 90 sites in 21 files - Fully manageable in 5-8 hours with phased approach
Analysis Breakdown
| Category | Count | Effort |
|---|---|---|
| Phase 1 (Critical Hot Paths) | 25 sites in 5 files | 2-3 hours |
| Phase 2 (Important Paths) | 40 sites in 10 files | 2-3 hours |
| Phase 3 (Debug/Cleanup) | 25 sites in 6 files | 1-2 hours |
| Total | 90 sites in 21 files | 5-8 hours |
Operation Breakdown
- NULL checks (if/while conditions): 16 sites
- Direct assignments (store): 32 sites
- POP operations (load + next): 8 sites
- PUSH operations (write + assign): 14 sites
- Read operations (checks/loads): 29 sites
- Write operations (assignments): 32 sites
Implementation Strategy
Recommended Approach: Hybrid
Hot Paths (10-20 sites):
- Lock-free CAS operations
slab_freelist_pop_lockfree()/slab_freelist_push_lockfree()- Memory ordering: acquire/release
- Cost: 6-10 cycles per operation
Cold Paths (40-50 sites):
- Relaxed atomic loads/stores
slab_freelist_load_relaxed()/slab_freelist_store_relaxed()- Memory ordering: relaxed
- Cost: 0 cycles overhead
Debug/Stats (10-15 sites):
- Skip conversion entirely
- Use
SLAB_FREELIST_DEBUG_PTR(meta)macro - Already atomic type, just cast for printf
Key Design Decisions
1. Accessor Function API
Created centralized atomic operations in core/box/slab_freelist_atomic.h:
// Lock-free operations (hot paths)
void* slab_freelist_pop_lockfree(TinySlabMeta* meta, int class_idx);
void slab_freelist_push_lockfree(TinySlabMeta* meta, int class_idx, void* node);
// Relaxed operations (cold paths)
void* slab_freelist_load_relaxed(TinySlabMeta* meta);
void slab_freelist_store_relaxed(TinySlabMeta* meta, void* value);
// NULL checks
bool slab_freelist_is_empty(TinySlabMeta* meta);
bool slab_freelist_is_nonempty(TinySlabMeta* meta);
// Debug
#define SLAB_FREELIST_DEBUG_PTR(meta) ...
2. Memory Ordering Rationale
Relaxed (most sites):
- No synchronization needed
- 0 cycles overhead
- Safe for: NULL checks, init, debug
Acquire (POP operations):
- Must see next pointer before unlinking
- 1-2 cycles overhead
- Prevents use-after-free
Release (PUSH operations):
- Must publish next pointer before freelist update
- 1-2 cycles overhead
- Ensures visibility to other threads
NOT using seq_cst:
- Total ordering not needed
- 5-10 cycles overhead (too expensive)
- Per-slab ordering sufficient
3. Critical Pattern Conversions
Before (direct access):
if (meta->freelist != NULL) {
void* block = meta->freelist;
meta->freelist = tiny_next_read(class_idx, block);
use(block);
}
After (lock-free atomic):
if (slab_freelist_is_nonempty(meta)) {
void* block = slab_freelist_pop_lockfree(meta, class_idx);
if (!block) goto fallback; // Handle race
use(block);
}
Key differences:
- NULL check uses relaxed atomic load
- POP operation uses CAS loop internally
- Must handle race condition (block == NULL)
tiny_next_read()called inside accessor (no double-conversion)
Performance Analysis
Single-Threaded Impact
| Operation | Before (cycles) | After Relaxed | After CAS | Overhead |
|---|---|---|---|---|
| NULL check | 1 | 1 | - | 0% |
| Load/Store | 1 | 1 | - | 0% |
| POP/PUSH | 3-5 | - | 8-12 | +60-140% |
Overall Expected:
- Relaxed sites (~70%): 0% overhead
- CAS sites (~30%): +60-140% per operation
- Net regression: 2-3% (due to good branch prediction)
Baseline: 25.1M ops/s (Random Mixed 256B) Expected: 24.4-24.8M ops/s (Random Mixed 256B) Acceptable: >24.0M ops/s (<5% regression)
Multi-Threaded Impact
| Metric | Before | After | Change |
|---|---|---|---|
| Larson 8T | CRASH | ~18-20M ops/s | FIXED |
| MT Scaling (8T) | 0% | 70-80% | NEW |
| Throughput (1T) | 25.1M ops/s | 24.4-24.8M ops/s | -1.2-2.8% |
Benefit: Stability + MT scalability >> 2-3% single-threaded cost
Risk Assessment
Low Risk ✅
- Incremental implementation: 3 phases, test after each
- Easy rollback:
git checkout master - Well-tested patterns: Existing atomic operations in codebase (563 sites)
- No ABI changes: Atomic type already declared
Medium Risk ⚠️
- Performance regression: 2-3% expected (acceptable)
- Subtle bugs: CAS retry loops need careful review
- Complexity: 90 sites to convert (but well-documented)
High Risk ❌
- None identified
Mitigation Strategies
- Phase 1 focus: Fix Larson crash first (25 sites, 2-3 hours)
- Test early: Compile and test after each file
- A/B testing: Keep old code in branches for comparison
- Rollback plan: Alternative spinlock approach if needed
Implementation Plan
Phase 1: Critical Hot Paths (2-3 hours) 🔥
Goal: Fix Larson 8T crash with minimal changes
Files (5 files, 25 sites):
core/box/slab_freelist_atomic.h(CREATE new accessor API)core/tiny_superslab_alloc.inc.h(fast alloc pop)core/hakmem_tiny_refill_p0.inc.h(P0 batch refill)core/box/carve_push_box.c(carve/rollback push)core/hakmem_tiny_tls_ops.h(TLS drain)
Testing:
./out/release/larson_hakmem 8 100000 256 # Expect: no crash
./out/release/bench_random_mixed_hakmem 10000000 256 42 # Expect: >24.0M ops/s
Success Criteria:
- ✅ Larson 8T stable (no crashes)
- ✅ Regression <5% (>24.0M ops/s)
- ✅ No ASan/TSan warnings
Phase 2: Important Paths (2-3 hours) ⚡
Goal: Full MT safety for all allocation paths
Files (10 files, 40 sites):
core/tiny_refill_opt.hcore/tiny_free_magazine.inc.hcore/refill/ss_refill_fc.hcore/slab_handle.h- 6 additional files
Testing:
for t in 1 2 4 8 16; do ./out/release/larson_hakmem $t 100000 256; done
Success Criteria:
- ✅ All MT tests pass
- ✅ Regression <3% (>24.4M ops/s)
- ✅ MT scaling 70%+
Phase 3: Cleanup (1-2 hours) 🧹
Goal: Convert/document remaining sites
Files (6 files, 25 sites):
- Debug/stats sites: Add
SLAB_FREELIST_DEBUG_PTR() - Init/cleanup sites: Use
slab_freelist_store_relaxed() - Add comments for MT safety assumptions
Testing:
make clean && make all
./run_all_tests.sh
Success Criteria:
- ✅ All 90 sites converted or documented
- ✅ Zero direct accesses (except in atomic.h)
- ✅ Full test suite passes
Tools and Scripts
Created comprehensive implementation support:
1. Strategy Document
File: ATOMIC_FREELIST_IMPLEMENTATION_STRATEGY.md
- Accessor function design
- Memory ordering rationale
- Performance projections
- Risk assessment
- Alternative approaches
2. Site-by-Site Guide
File: ATOMIC_FREELIST_SITE_BY_SITE_GUIDE.md
- Detailed conversion instructions (line-by-line)
- Common pitfalls and solutions
- Testing checklist per file
- Quick reference card
3. Quick Start Guide
File: ATOMIC_FREELIST_QUICK_START.md
- Step-by-step implementation
- Time budget breakdown
- Success metrics
- Rollback procedures
4. Accessor Header Template
File: core/box/slab_freelist_atomic.h.TEMPLATE
- Complete implementation (80 lines)
- Extensive comments and examples
- Performance notes
- Testing strategy
5. Analysis Script
File: scripts/analyze_freelist_sites.sh
- Counts sites by category
- Shows hot/warm/cold paths
- Estimates conversion effort
- Checks for lock-protected sites
6. Verification Script
File: scripts/verify_atomic_freelist_conversion.sh
- Tracks conversion progress
- Detects potential bugs (double-POP/PUSH)
- Checks compile status
- Provides recommendations
Usage Instructions
Quick Start
# 1. Review documentation (15 min)
cat ATOMIC_FREELIST_QUICK_START.md
# 2. Run analysis (5 min)
./scripts/analyze_freelist_sites.sh
# 3. Create accessor header (30 min)
cp core/box/slab_freelist_atomic.h.TEMPLATE core/box/slab_freelist_atomic.h
make bench_random_mixed_hakmem # Test compile
# 4. Start Phase 1 (2-3 hours)
git checkout -b atomic-freelist-phase1
# Follow ATOMIC_FREELIST_SITE_BY_SITE_GUIDE.md
# 5. Verify progress
./scripts/verify_atomic_freelist_conversion.sh
# 6. Test Phase 1
./out/release/larson_hakmem 8 100000 256
Incremental Progress Tracking
# Check conversion progress
./scripts/verify_atomic_freelist_conversion.sh
# Output example:
# Progress: 30% (27/90 sites)
# [============----------------------------]
# Currently working on: Phase 1 (Critical Hot Paths)
Expected Timeline
| Day | Activity | Hours | Cumulative |
|---|---|---|---|
| Day 1 | Setup + Phase 1 | 3h | 3h |
| Test Phase 1 | 1h | 4h | |
| Day 2 | Phase 2 conversion | 2-3h | 6-7h |
| Test Phase 2 | 1h | 7-8h | |
| Day 3 | Phase 3 cleanup | 1-2h | 8-10h |
| Final testing | 1h | 9-11h |
Realistic Total: 9-11 hours (including testing and documentation) Minimal Viable: 3-4 hours (Phase 1 only, fixes Larson crash)
Success Metrics
Phase 1 Success
- ✅ Larson 8T runs for 100K iterations without crash
- ✅ Single-threaded regression <5% (>24.0M ops/s)
- ✅ No data races detected (TSan clean)
Phase 2 Success
- ✅ All MT tests pass (1T, 2T, 4T, 8T, 16T)
- ✅ Single-threaded regression <3% (>24.4M ops/s)
- ✅ MT scaling 70%+ (8T = 5.6x+ speedup)
Phase 3 Success
- ✅ All 90 sites converted or documented
- ✅ Zero direct
meta->freelistaccesses (except atomic.h) - ✅ Full test suite passes
- ✅ Documentation updated
Rollback Plan
If Phase 1 fails (>5% regression or instability):
Option A: Revert and Debug
git stash
git checkout master
git branch -D atomic-freelist-phase1
# Review logs, fix issues, retry
Option B: Alternative Approach (Spinlock)
If lock-free proves too complex:
// Add to TinySlabMeta
typedef struct TinySlabMeta {
uint8_t freelist_lock; // 1-byte spinlock
void* freelist; // Back to non-atomic
// ... rest unchanged
} TinySlabMeta;
// Use __sync_lock_test_and_set() for lock/unlock
// Expected overhead: 5-10% (vs 2-3% for lock-free)
Trade-off: Simpler implementation, guaranteed correctness, slightly higher overhead
Alternatives Considered
Option A: Mutex per Slab (REJECTED)
Pros: Simple, guaranteed correctness Cons: 40-byte overhead, 10-20x performance hit Reason: Too expensive for per-slab locking
Option B: Global Lock (REJECTED)
Pros: 1-line fix, zero code changes Cons: Serializes all allocation, kills MT performance Reason: Defeats purpose of MT allocator
Option C: TLS-Only (REJECTED)
Pros: No atomics needed, simplest Cons: Cannot handle remote free (required for MT) Reason: Breaking existing functionality
Option D: Hybrid Lock-Free + Relaxed (SELECTED) ✅
Pros: Best performance, incremental implementation, minimal overhead Cons: More complex, requires careful memory ordering Reason: Optimal balance of performance, safety, and maintainability
Conclusion
Feasibility: HIGH ✅
- Only 90 sites (not 589)
- Well-understood patterns
- Existing atomic operations in codebase (563 sites as reference)
- Incremental phased approach
- Easy rollback
Risk: LOW ✅
- Phase 1 focus (25 sites) minimizes risk
- Test after each file
- Alternative approaches available
- No ABI changes
Benefit: HIGH ✅
- Fixes Larson 8T crash (critical bug)
- Enables MT performance (70-80% scaling)
- Future-proof architecture
- Only 2-3% single-threaded cost
Recommendation: PROCEED ✅
Start with Phase 1 (2-3 hours) and evaluate:
- If stable + <5% regression: Continue to Phase 2
- If unstable or >5% regression: Rollback and review
Expected outcome: 9-11 hours for full MT safety with <3% single-threaded regression
Files Created
ATOMIC_FREELIST_IMPLEMENTATION_STRATEGY.md(comprehensive strategy)ATOMIC_FREELIST_SITE_BY_SITE_GUIDE.md(detailed conversion guide)ATOMIC_FREELIST_QUICK_START.md(quick start instructions)ATOMIC_FREELIST_SUMMARY.md(this file)core/box/slab_freelist_atomic.h.TEMPLATE(accessor API template)scripts/analyze_freelist_sites.sh(site analysis tool)scripts/verify_atomic_freelist_conversion.sh(progress tracker)
Total: 7 files, ~3000 lines of documentation and tooling
Next Actions
- Review
ATOMIC_FREELIST_QUICK_START.md(15 min) - Run
./scripts/analyze_freelist_sites.sh(5 min) - Create accessor header from template (30 min)
- Start Phase 1 conversion (2-3 hours)
- Test Larson 8T stability (30 min)
- Evaluate results and proceed or rollback
First milestone: Larson 8T stable (3-4 hours total) Final goal: Full MT safety in 9-11 hours