Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being implemented, causing all cache misses to go through expensive superslab_refill registry scans. Root Cause Analysis: - Warm pool was initialized once and pushed a single slab after each refill - When that slab was exhausted, it was discarded (not pushed back) - Next refill would push another single slab, which was immediately exhausted - Pool would oscillate between 0 and 1 items, yielding 0% hit rate Solution: Secondary Prefill on Cache Miss When warm pool becomes empty, we now do multiple superslab_refills and prefill the pool with 3 additional HOT superlslabs before attempting to carve. This builds a working set of slabs that can sustain allocation pressure. Implementation Details: - Modified unified_cache_refill() cold path to detect empty pool - Added prefill loop: when pool count == 0, load 3 extra superlslabs - Store extra slabs in warm pool, keep 1 in TLS for immediate carving - Track prefill events in g_warm_pool_stats[].prefilled counter Results (1M Random Mixed 256B allocations): - Before: C7 hits=1, misses=3976, hit_rate=0.0% - After: C7 hits=3929, misses=3143, hit_rate=55.6% - Throughput: 4.055M ops/s (maintained vs 4.07M baseline) - Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s) Performance Impact: - No regression: throughput remained stable at ~4.1M ops/s - Registry scan avoided in 55.6% of cache misses (significant savings) - Warm pool now functioning as intended with strong locality Configuration: - TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill - Prefill budget hardcoded to 3 (tunable via env var if needed later) - All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1 Next Steps: - Monitor for further optimization opportunities (prefill budget tuning) - Consider adaptive prefill budget based on class-specific hit rates - Validate at larger allocation counts (10M+ pending registry size fix) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
13 KiB
HAKMEM Architectural Restructuring Analysis - Complete Index
2025-12-04
📋 Document Overview
This is your complete guide to the HAKMEM architectural restructuring analysis and warm pool implementation proposal. Start here to navigate all documents.
🎯 Quick Start (5 minutes)
Read this first:
RESTRUCTURING_ANALYSIS_COMPLETE_20251204.md(THIS DOCUMENT POINTS TO IT)
Then decide:
- Should we implement warm pool? ✓ YES, low risk, +40-50% gain
- Do we have time? ✓ YES, 2-3 days
- Is it worth it? ✓ YES, quick ROI
📚 Document Structure
Level 1: Executive Summary (START HERE)
File: RESTRUCTURING_ANALYSIS_COMPLETE_20251204.md
Length: ~3,000 words
Time to read: 15-20 minutes
Audience: Project managers, decision makers
Contains:
- High-level problem analysis
- Warm pool concept overview
- Performance expectations
- Decision framework
- Timeline and effort estimates
Level 2: Architecture & Design (FOR ARCHITECTS)
File: WARM_POOL_ARCHITECTURE_SUMMARY_20251204.md
Length: ~3,500 words
Time to read: 20-30 minutes
Audience: System architects, senior engineers
Contains:
- Visual diagrams of warm pool concept
- Data flow analysis
- Performance modeling with numbers
- Comparison: current vs proposed vs optional
- Risk analysis and mitigation
- Implementation phases explained
Level 3: Implementation Guide (FOR DEVELOPERS)
File: WARM_POOL_IMPLEMENTATION_GUIDE_20251204.md
Length: ~2,500 words
Time to read: 30-45 minutes (while implementing)
Audience: Developers, implementation engineers
Contains:
- Step-by-step code changes
- Code snippets (copy-paste ready)
- Testing checklist
- Debugging guide
- Common pitfalls and solutions
- Build & test commands
Level 4: Deep Technical Analysis (FOR REFERENCE)
File: ARCHITECTURAL_RESTRUCTURING_PROPOSAL_20251204.md
Length: ~5,000 words
Time to read: 45-60 minutes
Audience: Technical leads, code reviewers
Contains:
- Current architecture in detail
- Bottleneck analysis
- Three-tier design specification
- Implementation plan with phases
- Risk assessment
- Integration checklist
- Success metrics
🗺️ Reading Paths
Path 1: Decision Maker (15 minutes)
1. RESTRUCTURING_ANALYSIS_COMPLETE_20251204.md
↓ Read "Key Findings" section
↓ Read "Decision Framework"
↓ Ready to approve/reject
Path 2: Architect (45 minutes)
1. RESTRUCTURING_ANALYSIS_COMPLETE_20251204.md
↓ Full document
2. WARM_POOL_ARCHITECTURE_SUMMARY_20251204.md
↓ Focus on "Implementation Complexity vs Gain"
↓ Understand phases and trade-offs
Path 3: Developer (2-3 hours including implementation)
1. RESTRUCTURING_ANALYSIS_COMPLETE_20251204.md
↓ Skim entire document
2. WARM_POOL_ARCHITECTURE_SUMMARY_20251204.md
↓ Understand overall architecture
3. WARM_POOL_IMPLEMENTATION_GUIDE_20251204.md
↓ Follow step-by-step
↓ Implement code changes
↓ Run tests
4. ARCHITECTURAL_RESTRUCTURING_PROPOSAL_20251204.md
↓ Reference for edge cases
↓ Review integration checklist
Path 4: Code Reviewer (60 minutes)
1. ARCHITECTURAL_RESTRUCTURING_PROPOSAL_20251204.md
↓ "Implementation Plan" section
↓ Understand what changes are needed
2. WARM_POOL_IMPLEMENTATION_GUIDE_20251204.md
↓ Section "Step 3" through "Step 6"
↓ Verify code changes against checklist
3. Code inspection
↓ Verify warm pool operations (thread safety, correctness)
↓ Verify integration points (cache refill, cleanup)
🎯 Key Decision Points
Should We Implement Warm Pool?
Decision Checklist:
- Is +40-50% performance improvement valuable? (YES → Proceed)
- Do we have 2-3 days to spend? (YES → Proceed)
- Is low risk acceptable? (YES → Proceed)
- Can we commit to testing/profiling? (YES → Proceed)
Conclusion: If all YES → IMPLEMENT PHASE 1
What About Phase 2/3?
Phase 2 (Advanced Optimizations):
- Effort: 1-2 weeks
- Gain: Additional +20-30%
- Decision: Implement AFTER Phase 1 if performance still insufficient
Phase 3 (Architectural Redesign):
- Effort: 3-4 weeks
- Gain: Marginal +100% (diminishing returns)
- Decision: NOT RECOMMENDED (defer unless critical)
📊 Performance Summary
Current Performance
Random Mixed: 1.06M ops/s
- Bottleneck: Registry scan on cache miss (O(N), expensive)
- Profile: 70.4M cycles per 1M allocations
- Gap to Tiny Hot: 83x
After Phase 1 (Warm Pool)
Expected: 1.5M+ ops/s (+40-50%)
- Improvement: Registry scan eliminated (90% warm pool hits)
- Profile: ~45-50M cycles (30% reduction)
- Gap to Tiny Hot: Still ~50x (architectural)
After Phase 2 (If Done)
Estimated: 1.8-2.0M ops/s (+70-90%)
- Additional improvements from lock-free pools, batched tier checks
- Gap to Tiny Hot: Still ~40x
Why Not 10x?
Gap to Tiny Hot (89M ops/s) is ARCHITECTURAL:
- 256 size classes (Tiny Hot has 1)
- 7,600 page faults (unavoidable)
- Working set requirements (memory bound)
- Routing overhead (necessary for correctness)
Realistic ceiling: 2.0-2.5M ops/s (2-2.5x improvement max)
This is NORMAL, not a bug. Different workload patterns.
🔧 Implementation Overview
Phase 1: Basic Warm Pool (RECOMMENDED)
Files to Create:
core/front/tiny_warm_pool.h(NEW, ~80 lines)
Files to Modify:
core/front/tiny_unified_cache.h(add warm pool pop, ~50 lines)core/front/malloc_tiny_fast.h(init warm pool, ~20 lines)core/hakmem_super_registry.hor similar (cleanup integration, ~15 lines)
Total: ~300 lines of code
Timeline: 2-3 developer-days
Testing:
- Unit tests for warm pool operations
- Benchmark Random Mixed (target: 1.5M+ ops/s)
- Regression tests for other workloads
- Profiling to verify hit rate (target: > 90%)
Phase 2: Advanced Optimizations (OPTIONAL)
See WARM_POOL_ARCHITECTURE_SUMMARY_20251204.md section "Implementation Phases"
✅ Success Criteria
Phase 1 Success Metrics
| Metric | Target | Measurement |
|---|---|---|
| Random Mixed ops/s | 1.5M+ | bench_allocators_hakmem |
| Warm pool hit rate | > 90% | Add debug counters |
| Tiny Hot regression | 0% | Run Tiny Hot benchmark |
| Memory overhead | < 200KB/thread | Profile TLS usage |
| All tests pass | 100% | Run test suite |
🚀 How to Get Started
For Project Managers
- Read:
RESTRUCTURING_ANALYSIS_COMPLETE_20251204.md - Approve: Phase 1 implementation
- Assign: Developer and 2-3 days
- Schedule: Follow-up in 4 days
For Architects
- Read:
WARM_POOL_ARCHITECTURE_SUMMARY_20251204.md - Review:
ARCHITECTURAL_RESTRUCTURING_PROPOSAL_20251204.md - Approve: Implementation approach
- Plan: Optional Phase 2 after Phase 1
For Developers
- Read:
WARM_POOL_IMPLEMENTATION_GUIDE_20251204.md - Start: Step 1 (create tiny_warm_pool.h)
- Follow: Steps 2-6 in order
- Test: After each step
- Reference:
ARCHITECTURAL_RESTRUCTURING_PROPOSAL_20251204.mdfor edge cases
For QA/Testers
- Read: "Testing Checklist" in
WARM_POOL_IMPLEMENTATION_GUIDE_20251204.md - Prepare: Benchmark infrastructure (if not ready)
- Execute: Tests after implementation
- Validate: Performance metrics (target: 1.5M+ ops/s)
📞 FAQ
Q: How long will this take?
A: 2-3 developer-days for Phase 1. 1-2 weeks for Phase 2 (optional).
Q: What's the risk level?
A: Low. Warm pool is additive. Fallback to registry scan always works.
Q: Can we reach 10x performance?
A: No. That's architectural. Realistic gain: 2-2.5x maximum.
Q: Do we need to rewrite the entire allocator?
A: No. Phase 1 is ~300 lines, minimal disruption.
Q: Will warm pool work with multithreading?
A: Yes. It's thread-local, so no locks needed.
Q: What if we implement Phase 1 and it doesn't work?
A: Warm pool is disabled (zero overhead). Full fallback to registry scan.
Q: Should we plan Phase 2 now or after Phase 1?
A: After Phase 1. Measure first, then decide if more optimization needed.
🔗 Quick Links to Sections
In RESTRUCTURING_ANALYSIS_COMPLETE_20251204.md
- Key Findings: Performance analysis
- Solution Overview: Warm pool concept
- Why This Works: Technical justification
- Implementation Scope: Phases overview
- Performance Model: Numbers and estimates
- Decision Framework: Should we do it?
- Next Steps: Timeline and actions
In WARM_POOL_ARCHITECTURE_SUMMARY_20251204.md
- The Core Problem: What's slow
- Warm Pool Solution: How it works
- Performance Model: Before/after numbers
- Warm Pool Data Flow: Visual explanation
- Implementation Phases: Effort vs gain
- Safety & Correctness: Thread safety analysis
- Success Metrics: What to measure
In WARM_POOL_IMPLEMENTATION_GUIDE_20251204.md
- Step-by-Step Implementation: Code changes
- Testing Checklist: What to verify
- Build & Test: Commands to run
- Debugging Tips: Common issues
- Success Criteria: Acceptance tests
- Implementation Checklist: Verification items
In ARCHITECTURAL_RESTRUCTURING_PROPOSAL_20251204.md
- Current Architecture: Existing design
- Performance Bottlenecks: Root causes
- Three-Tier Architecture: Proposed design
- Implementation Plan: All phases
- Risk Assessment: Potential issues
- Integration Checklist: All tasks
- Files to Create/Modify: Complete list
📈 Metrics Dashboard
Before Implementation
Random Mixed: 1.06M ops/s [BASELINE]
CPU cycles: 70.4M [BASELINE]
L1 misses: 763K [BASELINE]
Page faults: 7,674 [BASELINE]
Warm pool hits: N/A [N/A]
After Phase 1 (Target)
Random Mixed: 1.5M ops/s [+40-50%]
CPU cycles: 45-50M [30% reduction]
L1 misses: Similar [Unchanged]
Page faults: 7,674 [Unchanged]
Warm pool hits: > 90% [Success]
🎓 Key Concepts Explained
Warm Pool
Per-thread cache of pre-allocated SuperSlabs. Eliminates registry scan on cache miss.
Registry Scan
Linear search through per-class registry to find HOT SuperSlab. Expensive (50-100 cycles).
Cache Miss
When Unified Cache (TLS) is empty. Happens ~1-5% of the time.
Three-Tier Architecture
HOT (Unified Cache) + WARM (Warm Pool) + COLD (Full allocation)
Thread-Local Storage (__thread)
Per-thread data, no synchronization needed. Perfect for warm pools.
Batch Amortization
Spreading cost over multiple operations. E.g., 64 objects share SuperSlab lookup cost.
Tier System
Classification of SuperSlabs: HOT (>25% used), DRAINING (≤25%), FREE (0%)
🔄 Review & Approval Process
Step 1: Executive Review (15 mins)
- Read
RESTRUCTURING_ANALYSIS_COMPLETE_20251204.md - Approve Phase 1 scope and timeline
- Assign developer resources
Step 2: Architecture Review (30 mins)
- Review
WARM_POOL_ARCHITECTURE_SUMMARY_20251204.md - Approve design and integration points
- Confirm risk mitigation strategies
Step 3: Implementation Review (During coding)
- Use
WARM_POOL_IMPLEMENTATION_GUIDE_20251204.mdfor step-by-step verification - Check against
ARCHITECTURAL_RESTRUCTURING_PROPOSAL_20251204.mdIntegration Checklist - Verify thread safety, correctness
Step 4: Testing & Validation (After coding)
- Run full test suite (all tests pass)
- Benchmark Random Mixed (1.5M+ ops/s)
- Measure warm pool hit rate (> 90%)
- Verify no regressions (Tiny Hot, etc.)
📝 File Manifest
Analysis Documents (This Package)
ANALYSIS_INDEX_20251204.md← YOU ARE HERERESTRUCTURING_ANALYSIS_COMPLETE_20251204.md(Executive summary)WARM_POOL_ARCHITECTURE_SUMMARY_20251204.md(Architecture guide)WARM_POOL_IMPLEMENTATION_GUIDE_20251204.md(Code guide)ARCHITECTURAL_RESTRUCTURING_PROPOSAL_20251204.md(Deep analysis)
Previous Session Documents
FINAL_SESSION_REPORT_20251204.md(Performance profiling results)LAZY_ZEROING_IMPLEMENTATION_RESULTS_20251204.md(Why lazy zeroing failed)COMPREHENSIVE_PROFILING_ANALYSIS_20251204.md(Initial analysis)- Plus 6+ analysis reports from profiling session
Code to Create (Phase 1)
core/front/tiny_warm_pool.h← NEW FILE
Code to Modify (Phase 1)
core/front/tiny_unified_cache.hcore/front/malloc_tiny_fast.hcore/hakmem_super_registry.hor equivalent
✨ Summary
What We Found:
- HAKMEM has clear bottleneck: Registry scan on cache miss
- Warm pool is elegant solution that fits existing architecture
What We Propose:
- Phase 1: Implement warm pool (~300 lines, 2-3 days)
- Expected: +40-50% performance (1.06M → 1.5M+ ops/s)
- Risk: Low (fallback always works)
What You Should Do:
- Read
RESTRUCTURING_ANALYSIS_COMPLETE_20251204.md - Approve Phase 1 implementation
- Assign 1 developer for 2-3 days
- Follow
WARM_POOL_IMPLEMENTATION_GUIDE_20251204.mdfor implementation - Benchmark and measure improvement
Next Review:
- Check back in 4 days for Phase 1 completion
- Measure performance improvement
- Decide on Phase 2 (optional)
Status: ✅ Analysis complete and ready for implementation
Generated by: Claude Code Date: 2025-12-04 Documents: 5 comprehensive guides + index Ready for: Developer implementation, architecture review, performance validation
Recommendation: PROCEED with Phase 1 implementation