Files
hakmem/RESTRUCTURING_ANALYSIS_COMPLETE_20251204.md
Moe Charm (CI) 5685c2f4c9 Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.

Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate

Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.

Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter

Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After:  C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)

Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality

Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1

Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00

14 KiB
Raw Permalink Blame History

HAKMEM Architectural Restructuring Analysis - Complete Package

2025-12-04


📦 What Has Been Delivered

Documents Created (4 files)

  1. ARCHITECTURAL_RESTRUCTURING_PROPOSAL_20251204.md (5,000 words)

    • Comprehensive analysis of current architecture
    • Current performance bottlenecks identified
    • Proposed three-tier (HOT/WARM/COLD) architecture
    • Detailed implementation plan with phases
    • Risk analysis and mitigation strategies
  2. WARM_POOL_ARCHITECTURE_SUMMARY_20251204.md (3,500 words)

    • Visual explanation of warm pool concept
    • Performance modeling with numbers
    • Data flow diagrams
    • Complexity vs gain analysis (3 phases)
    • Implementation roadmap with decision framework
  3. WARM_POOL_IMPLEMENTATION_GUIDE_20251204.md (2,500 words)

    • Step-by-step implementation instructions
    • Code snippets for each change
    • Testing checklist
    • Success criteria
    • Debugging tips and common pitfalls
  4. This Summary Document

    • Overview of all findings and recommendations
    • Quick decision matrix
    • Next steps and approval paths

🎯 Key Findings

Current State Analysis

Performance Breakdown (Random Mixed: 1.06M ops/s):

Hot path (95% allocations):      950,000 ops @ ~25 cycles = 23.75M cycles
Warm path (5% cache misses):      50,000 batches @ ~1000 cycles = 50M cycles
Other overhead:                                                   15M cycles
─────────────────────────────────────────────────────────────────────────
Total:                                                           70.4M cycles

Root Cause of Bottleneck: Registry scan on every cache miss (O(N) operation, 50-100 cycles per miss)


💡 Proposed Solution: Warm Pool

The Concept

Add per-thread warm SuperSlab pools to eliminate registry scan:

BEFORE:
  Cache miss → Registry scan (50-100 cycles) → Find HOT → Carve → Return

AFTER:
  Cache miss → Warm pool pop (O(1), 5-10 cycles) → Already HOT → Carve → Return

Expected Performance Gain

Current:    1.06M ops/s
After:      1.5M+ ops/s  (+40-50% improvement)
Effort:     ~300 lines of code, 2-3 developer-days
Risk:       Low (fallback to proven registry scan path)

📊 Architectural Analysis

Current Architecture (Already in Place)

HAKMEM already has two-tier routing:

  • HOT tier: Unified Cache hit (95%+ allocations)
  • COLD tier: Everything else (errors, special cases)

Missing: WARM tier for efficient cache miss handling

Three-Tier Proposed Architecture

HOT TIER (95%+ allocations):
  Unified Cache pop → 2-3 cache misses, ~20-30 cycles
  No registry access, no locks

WARM TIER (1-5% cache misses):  ← NEW!
  Warm pool pop → O(1), ~50 cycles per batch (5 per object)
  No registry scan, pre-qualified SuperSlabs

COLD TIER (<0.1% special cases):
  Full allocation path → Mmap, registry insert, etc.
  Only on warm pool exhaustion or errors

Why This Works

1. Thread-Local Storage (No Locks)

  • Warm pools are per-thread (__thread keyword)
  • No atomic operations needed
  • No synchronization overhead
  • Safe for concurrent access

2. Pre-Qualified SuperSlabs

  • Only HOT SuperSlabs go into warm pool
  • Tier checks already done when added to pool
  • Fallback: Registry scan (existing code) always works

3. Batching Amortization

  • Warm pool refill cost amortized over 64+ allocations
  • Batch tier checks (once per N operations, not per-op)
  • Reduces per-allocation overhead

4. Fallback Safety

  • If warm pool empty → Registry scan (proven path)
  • If registry empty → Cold alloc (mmap, normal path)
  • Correctness always preserved

🔍 Implementation Scope

What to change:

  1. Create core/front/tiny_warm_pool.h (~80 lines)
  2. Modify unified_cache_refill() (~50 lines)
  3. Add initialization (~20 lines)
  4. Add cleanup (~15 lines)

Total: ~300 lines of code

Effort: 2-3 development days

Performance gain: +40-50% (1.06M → 1.5M+ ops/s)

Risk: Low (additive, fallback always works)

Phase 2: Advanced Optimizations (OPTIONAL)

Lock-free pools, batched tier checks, per-thread refill threads

Effort: 1-2 weeks Gain: Additional +20-30% (1.5M → 1.8-2.0M ops/s) Risk: Medium

Major rewrite with three separate pools per thread

Effort: 3-4 weeks Gain: Marginal (+100%+ but diminishing returns) Risk: High (complexity, potential correctness issues)


📈 Performance Model

Conservative Estimate (Phase 1)

Registry scan overhead: ~500-1000 cycles per miss
Warm pool hit: ~50-100 cycles per miss
Improvement per miss: 80-95%

Applied to 5% of operations:
  50,000 misses × 900 cycles saved = 45M cycles saved
  70.4M baseline - 45M = 25.4M cycles
  Speedup: 70.4M / 25.4M = 2.77x
  But: Diminishing returns on other overhead = +40-50% realistic

Result: 1.06M × 1.45 = ~1.54M ops/s

Optimistic Estimate (Phase 2)

With additional optimizations:
  - Lock-free pools
  - Batched tier checks
  - Per-thread allocation threads

Result: 1.8-2.0M ops/s (+70-90%)

⚠️ Risks & Mitigations

Risk Severity Mitigation
TLS memory bloat Low Allocate lazily, limit to 4 slots/class
Warm pool stale data Low Periodic tier validation, registry fallback
Cache invalidation Low LRU-based eviction, TTL tracking
Thread safety issues Very Low TLS is thread-safe by design

All risks are manageable and low-severity.


🎓 Why Not 10x Improvement?

The Fundamental Gap

Random Mixed:  1.06M ops/s  (real-world: 256 sizes, page faults)
Tiny Hot:      89M ops/s    (ideal case: 1 size, hot cache)
Gap:           83x

Why unbridgeable?
  1. Size class diversity (256 classes vs 1)
  2. Page faults (7,600 unavoidable)
  3. Working set (large, requires memory traffic)
  4. Routing overhead (necessary for correctness)
  5. Tier management (needed for utilization tracking)

Realistic ceiling with all optimizations:
  - Phase 1 (warm pool): 1.5M ops/s (+40%)
  - Phase 2 (advanced): 2.0M ops/s (+90%)
  - Phase 3 (redesign): ~2.5M ops/s (+135%)

Still 35x below Tiny Hot (architectural, not a bug)

📋 Decision Framework

Should We Implement Warm Pool?

YES if:

  • Current 1.06M ops/s is a bottleneck for users
  • 40-50% improvement (1.5M ops/s) would be valuable
  • We have 2-3 days to spend on implementation
  • We want incremental improvement without full redesign
  • Risk of regressions is acceptable (low)

NO if:

  • Performance is already acceptable
  • 10x improvement is required (not realistic)
  • We need to wait for full redesign (high effort, uncertain timeline)
  • We want to avoid any code changes

Recommendation

STRONGLY RECOMMEND Phase 1 (Warm Pool)

Rationale:

  • High ROI (40-50% gain for ~300 lines)
  • Low risk (fallback always works)
  • Incremental approach (doesn't block other work)
  • Clear success criteria (measurable ops/s improvement)
  • Foundation for future optimizations

🚀 Next Steps

Immediate Actions

  1. Review & Approval (Today)

    • Read all four documents
    • Agree on Phase 1 scope
    • Approve implementation plan
  2. Implementation Setup (Tomorrow)

    • Create core/front/tiny_warm_pool.h
    • Write unit tests
    • Set up benchmarking infrastructure
  3. Core Implementation (Day 2-3)

    • Modify unified_cache_refill()
    • Integrate warm pool initialization
    • Add cleanup on SuperSlab free
    • Compile and verify
  4. Testing & Validation (Day 3-4)

    • Run Random Mixed benchmark
    • Measure ops/s improvement (target: 1.5M+)
    • Verify warm pool hit rate (target: > 90%)
    • Regression testing on other workloads
  5. Profiling & Optimization (Optional)

    • Profile CPU cycles (target: 40-50% reduction)
    • Identify remaining bottlenecks
    • Consider Phase 2 optimizations

Timeline

Phase 1 (Warm Pool):    2-3 days    → Expected +40-50% gain
Phase 2 (Optional):     1-2 weeks   → Additional +20-30% gain
Phase 3 (Not planned):  3-4 weeks   → Marginal additional gain

📚 Documentation Package

For Developers

  1. WARM_POOL_IMPLEMENTATION_GUIDE_20251204.md

    • Step-by-step code changes
    • Copy-paste ready implementation
    • Testing checklist
    • Debugging guide
  2. WARM_POOL_ARCHITECTURE_SUMMARY_20251204.md

    • Visual explanations
    • Performance models
    • Decision framework
    • Risk analysis

For Architects

  1. ARCHITECTURAL_RESTRUCTURING_PROPOSAL_20251204.md
    • Complete analysis
    • Current bottlenecks identified
    • Three-tier design
    • Implementation phases

For Project Managers

  1. This Document
    • Executive summary
    • Decision matrix
    • Timeline and effort estimates
    • Success criteria

🎯 Success Criteria

Functional Requirements

  • Warm pool correctly stores/retrieves SuperSlabs
  • No memory corruption or access violations
  • Thread-safe for concurrent allocations
  • All existing tests pass

Performance Requirements

  • Random Mixed: 1.5M+ ops/s (from 1.06M, +40%)
  • Warm pool hit rate: > 90%
  • Tiny Hot: 89M ops/s (no regression)
  • Memory overhead: < 200KB per thread

Quality Requirements

  • Code compiles without warnings
  • All benchmarks pass validation
  • Documentation is complete
  • Commit message follows conventions

💾 Deliverables Summary

Documents:

  • Comprehensive architectural analysis (5,000 words)
  • Warm pool design summary (3,500 words)
  • Implementation guide (2,500 words)
  • This executive summary

Code References:

  • Current codebase analyzed (file locations documented)
  • Bottlenecks identified (registry scan, tier checks)
  • Integration points mapped (unified_cache_refill, etc.)
  • Test scenarios planned

Ready for:

  • Developer implementation
  • Architecture review
  • Project planning
  • Performance measurement

🎓 Key Learnings

From Previous Analysis Session

  1. User-Space Limitations: Can't control kernel page fault handler
  2. Syscall Overhead: Can negate theoretical gains (lazy zeroing -0.5%)
  3. Profiling Pitfalls: Not all % in profile are controllable

From This Session

  1. Batch Amortization: Most effective optimization technique
  2. Thread-Local Design: Perfect fit for warm pools (no contention)
  3. Fallback Paths: Enable safe incremental improvements
  4. Architecture Matters: 10x gap is unbridgeable without redesign

From Previous Session:

  • FINAL_SESSION_REPORT_20251204.md - Performance profiling results
  • LAZY_ZEROING_IMPLEMENTATION_RESULTS_20251204.md - Why lazy zeroing failed
  • COMPREHENSIVE_PROFILING_ANALYSIS_20251204.md - Initial analysis

New Documents (This Session):

  • ARCHITECTURAL_RESTRUCTURING_PROPOSAL_20251204.md - Full proposal
  • WARM_POOL_ARCHITECTURE_SUMMARY_20251204.md - Visual guide
  • WARM_POOL_IMPLEMENTATION_GUIDE_20251204.md - Code guide
  • RESTRUCTURING_ANALYSIS_COMPLETE_20251204.md - This summary

Approval Checklist

Before starting implementation, please confirm:

  • Scope: Approved Phase 1 (warm pool) implementation
  • Timeline: 2-3 days is acceptable
  • Success Criteria: 1.5M+ ops/s improvement is acceptable
  • Risk: Low risk is acceptable
  • Resource: Developer time available
  • Testing: Benchmarking infrastructure ready

📞 Questions?

Common questions anticipated:

Q: Why not implement Phase 2/3 from the start? A: Phase 1 gives 40-50% gain with low risk and quick delivery. Phase 2/3 have diminishing returns and higher risk. Better to ship Phase 1, measure, then plan Phase 2 if needed.

Q: Will warm pool affect memory usage significantly? A: No. Per-thread overhead is ~256-512KB (4 SuperSlabs × 32 classes). Acceptable even for highly multithreaded apps.

Q: What if warm pool doesn't deliver 40% gain? A: Registry scan fallback always works. Worst case: small overhead from warm pool initialization (minimal). More likely: gain is real but measurement noise (±5%).

Q: Can we reach 10x with warm pool? A: No. 10x gap is architectural (256 size classes, 7,600 page faults, etc.). Warm pool helps with cache miss overhead, but can't fix fundamental differences from Tiny Hot.

Q: What about thread safety? A: Warm pools are per-thread (__thread), so no locks needed. Thread-safe by design. No synchronization complexity.


🎯 Conclusion

What We Know

  1. HAKMEM has clear performance bottleneck: Registry scan on cache miss
  2. Warm pool is an elegant solution that fits the architecture
  3. Implementation is straightforward: ~300 lines, 2-3 days
  4. Expected gain is realistic: +40-50% (1.06M → 1.5M+ ops/s)
  5. Risks are low: Fallback always works, correctness preserved

What We Recommend

Implement Phase 1 (Warm Pool) to achieve:

  • +40-50% performance improvement
  • Low risk, quick delivery
  • Foundation for future optimizations
  • Demonstrates feasibility of architectural changes

Next Action

  1. Stakeholder Review: Approve Phase 1 scope
  2. Developer Assignment: Start implementation
  3. Weekly Check-in: Measure progress and performance

Analysis Complete: 2025-12-04 Status: Ready for implementation Recommendation: PROCEED with Phase 1


📖 How to Use These Documents

  1. Start here: This summary (executive overview)
  2. Understand: WARM_POOL_ARCHITECTURE_SUMMARY (visual explanation)
  3. Implement: WARM_POOL_IMPLEMENTATION_GUIDE (code changes)
  4. Deep dive: ARCHITECTURAL_RESTRUCTURING_PROPOSAL (full analysis)

Generated by Claude Code Date: 2025-12-04 Status: Complete and ready for review