Phase 6-7: Dual Free Lists (Phase 2) - Mixed results

Implementation: Separate alloc/free paths to reduce cache line bouncing (mimalloc's strategy). Changes: 1. Added g_tiny_fast_free_head[] - separate free staging area 2. Modified tiny_fast_alloc() - lazy migration from free_head 3. Modified tiny_fast_free() - push to free_head (separate cache line) 4. Modified tiny_fast_drain() - drain from free_head Key design (inspired by mimalloc): - alloc_head: Hot allocation path (g_tiny_fast_cache) - free_head: Local frees staging (g_tiny_fast_free_head) - Migration: Pointer swap when alloc_head empty (zero-cost batching) - Benefit: alloc/free touch different cache lines → reduce bouncing Results (Larson 2s 8-128B 1024): - Phase 3 baseline: ST 0.474M, MT 1.712M ops/s - Phase 2: ST 0.600M, MT 1.624M ops/s - Change: **+27% ST, -5% MT** ⚠️ Analysis - Mixed results: ✅ Single-thread: +27% improvement - Better cache locality (alloc/free separated) - No contention, pure memory access pattern win ❌ Multi-thread: -5% regression (expected +30-50%) - Migration logic overhead (extra branches) - Dual arrays increase TLS size → more cache misses? - Pointer swap cost on migration path - May not help in Larson's specific access pattern Comparison to system malloc: - Current: 1.624M ops/s (MT) - System: ~7.2M ops/s (MT) - **Gap: Still 4.4x slower** Key insights: 1. mimalloc's dual free lists help with *cross-thread* frees 2. Larson may be mostly *same-thread* frees → less benefit 3. Migration overhead > cache line bouncing reduction 4. ST improvement shows memory locality matters 5. Need to profile actual malloc/free patterns in Larson Why mimalloc succeeds but HAKMEM doesn't: - mimalloc has sophisticated remote free queue (lock-free MPSC) - HAKMEM's simple dual lists don't handle cross-thread well - Larson's workload may differ from mimalloc's target benchmarks Next considerations: - Verify Larson's same-thread vs cross-thread free ratio - Consider combining all 3 phases (may have synergy) - Profile with actual counters (malloc vs free hotspots) - May need fundamentally different approach
2025-11-05 05:35:06 +00:00
parent e3514e7fa9
commit 3429ed4457
2 changed files with 61 additions and 13 deletions
--- a/core/tiny_fastcache.c
+++ b/core/tiny_fastcache.c
@ -14,6 +14,13 @@ __thread void* g_tiny_fast_cache[TINY_FAST_CLASS_COUNT];
 __thread uint32_t g_tiny_fast_count[TINY_FAST_CLASS_COUNT];
 __thread int g_tiny_fast_initialized = 0;

+// ========== Phase 6-7: Dual Free Lists (Phase 2) ==========
+// Inspired by mimalloc's local/remote split design
+// Separate alloc/free paths to reduce cache line bouncing
+
+__thread void* g_tiny_fast_free_head[TINY_FAST_CLASS_COUNT];  // Free staging area
+__thread uint32_t g_tiny_fast_free_count[TINY_FAST_CLASS_COUNT];  // Free count
+
 // ========== External References ==========

 // External references to existing Tiny infrastructure (from hakmem_tiny.c)
@ -108,7 +115,12 @@ void tiny_fast_drain(int class_idx) {

    g_tiny_fast_drain_count++;

-    // Drain half of the cache to Magazine/SuperSlab
+    // ========================================================================
+    // Phase 6-7: Drain from free_head (Phase 2)
+    // Since frees go to free_head, drain from there when capacity exceeded
+    // ========================================================================
+
+    // Drain half of the free_head to Magazine/SuperSlab
    // TODO: For now, we just reduce the count limit
    // In a full implementation, we'd push blocks back to Magazine freelist

@ -116,12 +128,12 @@ void tiny_fast_drain(int class_idx) {
    // A full implementation would return blocks to SuperSlab freelist
    uint32_t target = TINY_FAST_CACHE_CAP / 2;

-    while (g_tiny_fast_count[class_idx] > target) {
-        void* ptr = g_tiny_fast_cache[class_idx];
+    while (g_tiny_fast_free_count[class_idx] > target) {
+        void* ptr = g_tiny_fast_free_head[class_idx];
        if (!ptr) break;

-        g_tiny_fast_cache[class_idx] = *(void**)ptr;
-        g_tiny_fast_count[class_idx]--;
+        g_tiny_fast_free_head[class_idx] = *(void**)ptr;
+        g_tiny_fast_free_count[class_idx]--;

        // TODO: Return to Magazine/SuperSlab
        // For now, we'll just re-push it (no-op, but prevents loss)