Add malloc routing analysis and refill success tracking

### Changes: - **Routing Counters**: Added per-thread counters in hakmem.c to track: - g_malloc_total_calls: Total malloc() invocations - g_malloc_tiny_size_match: Calls within tiny size range (<=128B) - g_malloc_fast_path_tried: Calls that attempted fast path - g_malloc_fast_path_null: Fast path returned NULL - g_malloc_slow_path: Calls routed to slow path - **Refill Success Tracking**: Added counters in tiny_fastcache.c: - g_refill_success_count: Full batch (16 blocks) - g_refill_partial_count: Partial batch (<16 blocks) - g_refill_fail_count: Zero blocks allocated - g_refill_total_blocks: Total blocks across all refills - **Profile Output Enhanced**: tiny_fast_print_profile() now shows: - Routing statistics (which path allocations take) - Refill success/failure breakdown - Average blocks per refill ### Key Findings: ✅ Fast path routing: 100% success (20,479/20,480 calls per thread) ✅ Refill success: 100% (1,285 refills, all 16 blocks each) ⚠️ Performance: Still only 1.68M ops/s vs System's 8.06M (20.8%) **Root Cause Confirmed**: - NOT a routing problem (100% reach fast path) - NOT a refill failure (100% success) - IS a structural performance issue (2,418 cycles avg for malloc) **Bottlenecks Identified**: 1. Fast path cache hits: ~2,418 cycles (vs tcache ~100 cycles) 2. Refill operations: ~39,938 cycles (expensive but infrequent) 3. Overall throughput: 4.8x slower than system malloc **Next Steps** (per LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md): - Option B: Refill efficiency (batch allocation from SuperSlab) - Option C: Ultra-fast path redesign (tcache-equivalent) Related: LARSON_PERFORMANCE_ANALYSIS_2025_11_05.md
2025-11-05 05:56:02 +00:00
parent 872622b78b
commit 31af3eab27
2 changed files with 69 additions and 1 deletions
--- a/core/hakmem.c
+++ b/core/hakmem.c
@ -1247,7 +1247,16 @@ void* realloc(void* ptr, size_t size) {
 #else

 // malloc wrapper - intercepts system malloc() calls
+// Debug counters for malloc routing (Phase 6-6 analysis)
+__thread uint64_t g_malloc_total_calls = 0;
+__thread uint64_t g_malloc_tiny_size_match = 0;
+__thread uint64_t g_malloc_fast_path_tried = 0;
+__thread uint64_t g_malloc_fast_path_null = 0;
+__thread uint64_t g_malloc_slow_path = 0;
+
 void* malloc(size_t size) {
+    g_malloc_total_calls++;
+
    // ========================================================================
    // Phase 6-5: ULTRA-FAST PATH FIRST (mimalloc/tcache style)
    // Inspired by research: tcache has 0 branches, mimalloc has 1-2 branches
@ -1256,6 +1265,7 @@ void* malloc(size_t size) {
 #ifdef HAKMEM_TINY_FAST_PATH
    // Branch 1: Size check (predicted taken for tiny allocations)
    if (__builtin_expect(size <= TINY_FAST_THRESHOLD, 1)) {
+        g_malloc_tiny_size_match++;
        extern void* tiny_fast_alloc(size_t);
        extern void tiny_fast_init(void);
        extern __thread int g_tiny_fast_initialized;
@ -1263,10 +1273,12 @@ void* malloc(size_t size) {
        // Branch 2: Initialization check (predicted taken after first call)
        if (__builtin_expect(g_tiny_fast_initialized, 1)) {
            // Branch 3: Cache hit check (predicted taken ~90% of time)
+            g_malloc_fast_path_tried++;
            void* ptr = tiny_fast_alloc(size);
            if (__builtin_expect(ptr != NULL, 1)) {
                return ptr;  // ✅ FAST PATH: 3 branches total (vs tcache's 0, mimalloc's 1-2)
            }
+            g_malloc_fast_path_null++;
            // Cache miss: fall through to slow path refill
        } else {
            // Cold path: initialize once per thread (rare)
@ -1279,6 +1291,7 @@ void* malloc(size_t size) {
    // ========================================================================
    // SLOW PATH: All guards moved here (only executed on fast path miss)
    // ========================================================================
+    g_malloc_slow_path++;

    // Recursion guard: if we're inside the allocator already, fall back to libc
    if (g_hakmem_lock_depth > 0) {