Files
hakmem/ANALYSIS_INDEX_20251204.md
Moe Charm (CI) 5685c2f4c9 Implement Warm Pool Secondary Prefill Optimization (Phase B-2c Complete)
Problem: Warm pool had 0% hit rate (only 1 hit per 3976 misses) despite being
implemented, causing all cache misses to go through expensive superslab_refill
registry scans.

Root Cause Analysis:
- Warm pool was initialized once and pushed a single slab after each refill
- When that slab was exhausted, it was discarded (not pushed back)
- Next refill would push another single slab, which was immediately exhausted
- Pool would oscillate between 0 and 1 items, yielding 0% hit rate

Solution: Secondary Prefill on Cache Miss
When warm pool becomes empty, we now do multiple superslab_refills and prefill
the pool with 3 additional HOT superlslabs before attempting to carve. This
builds a working set of slabs that can sustain allocation pressure.

Implementation Details:
- Modified unified_cache_refill() cold path to detect empty pool
- Added prefill loop: when pool count == 0, load 3 extra superlslabs
- Store extra slabs in warm pool, keep 1 in TLS for immediate carving
- Track prefill events in g_warm_pool_stats[].prefilled counter

Results (1M Random Mixed 256B allocations):
- Before: C7 hits=1, misses=3976, hit_rate=0.0%
- After:  C7 hits=3929, misses=3143, hit_rate=55.6%
- Throughput: 4.055M ops/s (maintained vs 4.07M baseline)
- Stability: Consistent 55.6% hit rate at 5M allocations (4.102M ops/s)

Performance Impact:
- No regression: throughput remained stable at ~4.1M ops/s
- Registry scan avoided in 55.6% of cache misses (significant savings)
- Warm pool now functioning as intended with strong locality

Configuration:
- TINY_WARM_POOL_MAX_PER_CLASS increased from 4 to 16 to support prefill
- Prefill budget hardcoded to 3 (tunable via env var if needed later)
- All statistics always compiled, ENV-gated printing via HAKMEM_WARM_POOL_STATS=1

Next Steps:
- Monitor for further optimization opportunities (prefill budget tuning)
- Consider adaptive prefill budget based on class-specific hit rates
- Validate at larger allocation counts (10M+ pending registry size fix)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-04 23:31:54 +09:00

13 KiB

HAKMEM Architectural Restructuring Analysis - Complete Index

2025-12-04


📋 Document Overview

This is your complete guide to the HAKMEM architectural restructuring analysis and warm pool implementation proposal. Start here to navigate all documents.


🎯 Quick Start (5 minutes)

Read this first:

  1. RESTRUCTURING_ANALYSIS_COMPLETE_20251204.md (THIS DOCUMENT POINTS TO IT)

Then decide:

  • Should we implement warm pool? ✓ YES, low risk, +40-50% gain
  • Do we have time? ✓ YES, 2-3 days
  • Is it worth it? ✓ YES, quick ROI

📚 Document Structure

Level 1: Executive Summary (START HERE)

File: RESTRUCTURING_ANALYSIS_COMPLETE_20251204.md Length: ~3,000 words Time to read: 15-20 minutes Audience: Project managers, decision makers Contains:

  • High-level problem analysis
  • Warm pool concept overview
  • Performance expectations
  • Decision framework
  • Timeline and effort estimates

Level 2: Architecture & Design (FOR ARCHITECTS)

File: WARM_POOL_ARCHITECTURE_SUMMARY_20251204.md Length: ~3,500 words Time to read: 20-30 minutes Audience: System architects, senior engineers Contains:

  • Visual diagrams of warm pool concept
  • Data flow analysis
  • Performance modeling with numbers
  • Comparison: current vs proposed vs optional
  • Risk analysis and mitigation
  • Implementation phases explained

Level 3: Implementation Guide (FOR DEVELOPERS)

File: WARM_POOL_IMPLEMENTATION_GUIDE_20251204.md Length: ~2,500 words Time to read: 30-45 minutes (while implementing) Audience: Developers, implementation engineers Contains:

  • Step-by-step code changes
  • Code snippets (copy-paste ready)
  • Testing checklist
  • Debugging guide
  • Common pitfalls and solutions
  • Build & test commands

Level 4: Deep Technical Analysis (FOR REFERENCE)

File: ARCHITECTURAL_RESTRUCTURING_PROPOSAL_20251204.md Length: ~5,000 words Time to read: 45-60 minutes Audience: Technical leads, code reviewers Contains:

  • Current architecture in detail
  • Bottleneck analysis
  • Three-tier design specification
  • Implementation plan with phases
  • Risk assessment
  • Integration checklist
  • Success metrics

🗺️ Reading Paths

Path 1: Decision Maker (15 minutes)

1. RESTRUCTURING_ANALYSIS_COMPLETE_20251204.md
   ↓ Read "Key Findings" section
   ↓ Read "Decision Framework"
   ↓ Ready to approve/reject

Path 2: Architect (45 minutes)

1. RESTRUCTURING_ANALYSIS_COMPLETE_20251204.md
   ↓ Full document
2. WARM_POOL_ARCHITECTURE_SUMMARY_20251204.md
   ↓ Focus on "Implementation Complexity vs Gain"
   ↓ Understand phases and trade-offs

Path 3: Developer (2-3 hours including implementation)

1. RESTRUCTURING_ANALYSIS_COMPLETE_20251204.md
   ↓ Skim entire document
2. WARM_POOL_ARCHITECTURE_SUMMARY_20251204.md
   ↓ Understand overall architecture
3. WARM_POOL_IMPLEMENTATION_GUIDE_20251204.md
   ↓ Follow step-by-step
   ↓ Implement code changes
   ↓ Run tests
4. ARCHITECTURAL_RESTRUCTURING_PROPOSAL_20251204.md
   ↓ Reference for edge cases
   ↓ Review integration checklist

Path 4: Code Reviewer (60 minutes)

1. ARCHITECTURAL_RESTRUCTURING_PROPOSAL_20251204.md
   ↓ "Implementation Plan" section
   ↓ Understand what changes are needed
2. WARM_POOL_IMPLEMENTATION_GUIDE_20251204.md
   ↓ Section "Step 3" through "Step 6"
   ↓ Verify code changes against checklist
3. Code inspection
   ↓ Verify warm pool operations (thread safety, correctness)
   ↓ Verify integration points (cache refill, cleanup)

🎯 Key Decision Points

Should We Implement Warm Pool?

Decision Checklist:

  • Is +40-50% performance improvement valuable? (YES → Proceed)
  • Do we have 2-3 days to spend? (YES → Proceed)
  • Is low risk acceptable? (YES → Proceed)
  • Can we commit to testing/profiling? (YES → Proceed)

Conclusion: If all YES → IMPLEMENT PHASE 1

What About Phase 2/3?

Phase 2 (Advanced Optimizations):

  • Effort: 1-2 weeks
  • Gain: Additional +20-30%
  • Decision: Implement AFTER Phase 1 if performance still insufficient

Phase 3 (Architectural Redesign):

  • Effort: 3-4 weeks
  • Gain: Marginal +100% (diminishing returns)
  • Decision: NOT RECOMMENDED (defer unless critical)

📊 Performance Summary

Current Performance

Random Mixed:  1.06M ops/s
  - Bottleneck: Registry scan on cache miss (O(N), expensive)
  - Profile: 70.4M cycles per 1M allocations
  - Gap to Tiny Hot: 83x

After Phase 1 (Warm Pool)

Expected:      1.5M+ ops/s  (+40-50%)
  - Improvement: Registry scan eliminated (90% warm pool hits)
  - Profile: ~45-50M cycles (30% reduction)
  - Gap to Tiny Hot: Still ~50x (architectural)

After Phase 2 (If Done)

Estimated:     1.8-2.0M ops/s  (+70-90%)
  - Additional improvements from lock-free pools, batched tier checks
  - Gap to Tiny Hot: Still ~40x

Why Not 10x?

Gap to Tiny Hot (89M ops/s) is ARCHITECTURAL:
  - 256 size classes (Tiny Hot has 1)
  - 7,600 page faults (unavoidable)
  - Working set requirements (memory bound)
  - Routing overhead (necessary for correctness)

Realistic ceiling: 2.0-2.5M ops/s (2-2.5x improvement max)
This is NORMAL, not a bug. Different workload patterns.

🔧 Implementation Overview

Files to Create:

  • core/front/tiny_warm_pool.h (NEW, ~80 lines)

Files to Modify:

  • core/front/tiny_unified_cache.h (add warm pool pop, ~50 lines)
  • core/front/malloc_tiny_fast.h (init warm pool, ~20 lines)
  • core/hakmem_super_registry.h or similar (cleanup integration, ~15 lines)

Total: ~300 lines of code

Timeline: 2-3 developer-days

Testing:

  1. Unit tests for warm pool operations
  2. Benchmark Random Mixed (target: 1.5M+ ops/s)
  3. Regression tests for other workloads
  4. Profiling to verify hit rate (target: > 90%)

Phase 2: Advanced Optimizations (OPTIONAL)

See WARM_POOL_ARCHITECTURE_SUMMARY_20251204.md section "Implementation Phases"


Success Criteria

Phase 1 Success Metrics

Metric Target Measurement
Random Mixed ops/s 1.5M+ bench_allocators_hakmem
Warm pool hit rate > 90% Add debug counters
Tiny Hot regression 0% Run Tiny Hot benchmark
Memory overhead < 200KB/thread Profile TLS usage
All tests pass 100% Run test suite

🚀 How to Get Started

For Project Managers

  1. Read: RESTRUCTURING_ANALYSIS_COMPLETE_20251204.md
  2. Approve: Phase 1 implementation
  3. Assign: Developer and 2-3 days
  4. Schedule: Follow-up in 4 days

For Architects

  1. Read: WARM_POOL_ARCHITECTURE_SUMMARY_20251204.md
  2. Review: ARCHITECTURAL_RESTRUCTURING_PROPOSAL_20251204.md
  3. Approve: Implementation approach
  4. Plan: Optional Phase 2 after Phase 1

For Developers

  1. Read: WARM_POOL_IMPLEMENTATION_GUIDE_20251204.md
  2. Start: Step 1 (create tiny_warm_pool.h)
  3. Follow: Steps 2-6 in order
  4. Test: After each step
  5. Reference: ARCHITECTURAL_RESTRUCTURING_PROPOSAL_20251204.md for edge cases

For QA/Testers

  1. Read: "Testing Checklist" in WARM_POOL_IMPLEMENTATION_GUIDE_20251204.md
  2. Prepare: Benchmark infrastructure (if not ready)
  3. Execute: Tests after implementation
  4. Validate: Performance metrics (target: 1.5M+ ops/s)

📞 FAQ

Q: How long will this take?

A: 2-3 developer-days for Phase 1. 1-2 weeks for Phase 2 (optional).

Q: What's the risk level?

A: Low. Warm pool is additive. Fallback to registry scan always works.

Q: Can we reach 10x performance?

A: No. That's architectural. Realistic gain: 2-2.5x maximum.

Q: Do we need to rewrite the entire allocator?

A: No. Phase 1 is ~300 lines, minimal disruption.

Q: Will warm pool work with multithreading?

A: Yes. It's thread-local, so no locks needed.

Q: What if we implement Phase 1 and it doesn't work?

A: Warm pool is disabled (zero overhead). Full fallback to registry scan.

Q: Should we plan Phase 2 now or after Phase 1?

A: After Phase 1. Measure first, then decide if more optimization needed.


In RESTRUCTURING_ANALYSIS_COMPLETE_20251204.md

  • Key Findings: Performance analysis
  • Solution Overview: Warm pool concept
  • Why This Works: Technical justification
  • Implementation Scope: Phases overview
  • Performance Model: Numbers and estimates
  • Decision Framework: Should we do it?
  • Next Steps: Timeline and actions

In WARM_POOL_ARCHITECTURE_SUMMARY_20251204.md

  • The Core Problem: What's slow
  • Warm Pool Solution: How it works
  • Performance Model: Before/after numbers
  • Warm Pool Data Flow: Visual explanation
  • Implementation Phases: Effort vs gain
  • Safety & Correctness: Thread safety analysis
  • Success Metrics: What to measure

In WARM_POOL_IMPLEMENTATION_GUIDE_20251204.md

  • Step-by-Step Implementation: Code changes
  • Testing Checklist: What to verify
  • Build & Test: Commands to run
  • Debugging Tips: Common issues
  • Success Criteria: Acceptance tests
  • Implementation Checklist: Verification items

In ARCHITECTURAL_RESTRUCTURING_PROPOSAL_20251204.md

  • Current Architecture: Existing design
  • Performance Bottlenecks: Root causes
  • Three-Tier Architecture: Proposed design
  • Implementation Plan: All phases
  • Risk Assessment: Potential issues
  • Integration Checklist: All tasks
  • Files to Create/Modify: Complete list

📈 Metrics Dashboard

Before Implementation

Random Mixed:    1.06M ops/s    [BASELINE]
CPU cycles:      70.4M          [BASELINE]
L1 misses:       763K           [BASELINE]
Page faults:     7,674          [BASELINE]
Warm pool hits:  N/A            [N/A]

After Phase 1 (Target)

Random Mixed:    1.5M ops/s     [+40-50%]
CPU cycles:      45-50M         [30% reduction]
L1 misses:       Similar        [Unchanged]
Page faults:     7,674          [Unchanged]
Warm pool hits:  > 90%          [Success]

🎓 Key Concepts Explained

Warm Pool

Per-thread cache of pre-allocated SuperSlabs. Eliminates registry scan on cache miss.

Registry Scan

Linear search through per-class registry to find HOT SuperSlab. Expensive (50-100 cycles).

Cache Miss

When Unified Cache (TLS) is empty. Happens ~1-5% of the time.

Three-Tier Architecture

HOT (Unified Cache) + WARM (Warm Pool) + COLD (Full allocation)

Thread-Local Storage (__thread)

Per-thread data, no synchronization needed. Perfect for warm pools.

Batch Amortization

Spreading cost over multiple operations. E.g., 64 objects share SuperSlab lookup cost.

Tier System

Classification of SuperSlabs: HOT (>25% used), DRAINING (≤25%), FREE (0%)


🔄 Review & Approval Process

Step 1: Executive Review (15 mins)

  • Read RESTRUCTURING_ANALYSIS_COMPLETE_20251204.md
  • Approve Phase 1 scope and timeline
  • Assign developer resources

Step 2: Architecture Review (30 mins)

  • Review WARM_POOL_ARCHITECTURE_SUMMARY_20251204.md
  • Approve design and integration points
  • Confirm risk mitigation strategies

Step 3: Implementation Review (During coding)

  • Use WARM_POOL_IMPLEMENTATION_GUIDE_20251204.md for step-by-step verification
  • Check against ARCHITECTURAL_RESTRUCTURING_PROPOSAL_20251204.md Integration Checklist
  • Verify thread safety, correctness

Step 4: Testing & Validation (After coding)

  • Run full test suite (all tests pass)
  • Benchmark Random Mixed (1.5M+ ops/s)
  • Measure warm pool hit rate (> 90%)
  • Verify no regressions (Tiny Hot, etc.)

📝 File Manifest

Analysis Documents (This Package)

  • ANALYSIS_INDEX_20251204.md ← YOU ARE HERE
  • RESTRUCTURING_ANALYSIS_COMPLETE_20251204.md (Executive summary)
  • WARM_POOL_ARCHITECTURE_SUMMARY_20251204.md (Architecture guide)
  • WARM_POOL_IMPLEMENTATION_GUIDE_20251204.md (Code guide)
  • ARCHITECTURAL_RESTRUCTURING_PROPOSAL_20251204.md (Deep analysis)

Previous Session Documents

  • FINAL_SESSION_REPORT_20251204.md (Performance profiling results)
  • LAZY_ZEROING_IMPLEMENTATION_RESULTS_20251204.md (Why lazy zeroing failed)
  • COMPREHENSIVE_PROFILING_ANALYSIS_20251204.md (Initial analysis)
  • Plus 6+ analysis reports from profiling session

Code to Create (Phase 1)

  • core/front/tiny_warm_pool.h ← NEW FILE

Code to Modify (Phase 1)

  • core/front/tiny_unified_cache.h
  • core/front/malloc_tiny_fast.h
  • core/hakmem_super_registry.h or equivalent

Summary

What We Found:

  • HAKMEM has clear bottleneck: Registry scan on cache miss
  • Warm pool is elegant solution that fits existing architecture

What We Propose:

  • Phase 1: Implement warm pool (~300 lines, 2-3 days)
  • Expected: +40-50% performance (1.06M → 1.5M+ ops/s)
  • Risk: Low (fallback always works)

What You Should Do:

  1. Read RESTRUCTURING_ANALYSIS_COMPLETE_20251204.md
  2. Approve Phase 1 implementation
  3. Assign 1 developer for 2-3 days
  4. Follow WARM_POOL_IMPLEMENTATION_GUIDE_20251204.md for implementation
  5. Benchmark and measure improvement

Next Review:

  • Check back in 4 days for Phase 1 completion
  • Measure performance improvement
  • Decide on Phase 2 (optional)

Status: Analysis complete and ready for implementation

Generated by: Claude Code Date: 2025-12-04 Documents: 5 comprehensive guides + index Ready for: Developer implementation, architecture review, performance validation

Recommendation: PROCEED with Phase 1 implementation