Refactor: Unified allocation macros + header validation

1. Archive unused backend files (ss_legacy/unified_backend_box.c/h) - These files were not linked in the build - Moved to archive/ to reduce confusion 2. Created HAK_RET_ALLOC_BLOCK macro for SuperSlab allocations - Replaces superslab_return_block() function - Consistent with existing HAK_RET_ALLOC pattern - Single source of truth for header writing - Defined in hakmem_tiny_superslab_internal.h 3. Added header validation on TLS SLL push - Detects blocks pushed without proper header - Enabled via HAKMEM_TINY_SLL_VALIDATE_HDR=1 (release) - Always on in debug builds - Logs first 10 violations with backtraces Benefits: - Easier to track allocation paths - Catches header bugs at push time - More maintainable macro-based design Note: Larson bug still reproduces - header corruption occurs before push validation can catch it. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 05:37:24 +09:00
parent 6ac6f5ae1b
commit 5582cbc22c
9 changed files with 1733 additions and 28 deletions
--- a/PHASE2_PERF_ANALYSIS.md
+++ b/PHASE2_PERF_ANALYSIS.md
@ -0,0 +1,159 @@
 # HAKMEM Allocator - Phase 2 Performance Analysis
 ## Quick Summary
 | Metric | Phase 1 | Phase 2 | Change |
 |--------|---------|---------|--------|
 | **Throughput** | 72M ops/s | 79.8M ops/s | **+10.8%** ✓ |
 | Cycles | 78.6M | 72.2M | -8.1% ✓ |
 | Instructions | 167M | 153M | -8.4% ✓ |
 | Branches | 36M | 23M | **-36%** ✓ |
 | Branch Misses | 921K (2.56%) | 1.02M (4.43%) | +73% ✗ |
 | L3 Cache Misses | 173K (9.28%) | 216K (10.28%) | +25% ✗ |
 | dTLB Misses | N/A | 41 (0.01%) | **Excellent!** ✓ |
 ## Top 5 Hotspots (Phase 2, 628 samples)
 1. **malloc()** - 36.51% CPU time
   - Function overhead (prologue/epilogue): ~18%
   - Lock operations: 5.05%
   - Initialization checks: ~15%
 2. **main()** - 30.51% CPU time
   - Benchmark loop overhead (not allocator)
 3. **free()** - 19.66% CPU time
   - Lock operations: 3.29%
   - Cached variable checks: ~15%
   - Function overhead: ~10%
 4. **clear_page_erms (kernel)** - 9.31% CPU time
   - Page fault handling
 5. **irqentry_exit_to_user_mode (kernel)** - 5.33% CPU time
   - Kernel exit overhead
 ## Phase 3 Optimization Targets (Ranked by Impact)
 ### 🔥 Priority 1: Fast-Path Inlining (Expected: +5-8%)
 **Target**: Reduce malloc/free from 56% → ~33% CPU time
 - Inline hot paths to eliminate function call overhead
 - Remove stats counters from production builds
 - Cache initialization state in TLS
 ### 🔥 Priority 2: Branch Optimization (Expected: +3-5%)
 **Target**: Reduce branch misses from 1.02M → <700K
 - Apply Profile-Guided Optimization (PGO)
 - Add LIKELY/UNLIKELY hints
 - Reduce branches in fast path from ~15 to 5-7
 ### 🔥 Priority 3: Cache Optimization (Expected: +2-4%)
 **Target**: Reduce L3 misses from 216K → <180K
 - Align hot structures to cache lines
 - Add prefetching in allocation path
 - Compact metadata structures
 ### 🎯 Priority 4: Remove Init Overhead (Expected: +2-3%)
 - Cache g_initialized/g_enable checks in TLS
 - Use constructor attributes more aggressively
 ### 🎯 Priority 5: Reduce Lock Contention (Expected: +1-2%)
 - Move stats to TLS, aggregate periodically
 - Eliminate atomic ops from fast path
 ### 🎯 Priority 6: Optimize TLS Operations (Expected: +1-2%)
 - Reduce TLS reads/writes from ~10 to ~4 per operation
 - Cache TLS values in registers
 ## Expected Phase 3 Results
 **Target Throughput**: 87-95M ops/s (+9-19% improvement)
 | Metric | Phase 2 | Phase 3 Target | Change |
 |--------|---------|----------------|--------|
 | Throughput | 79.8M ops/s | 87-95M ops/s | +9-19% |
 | malloc CPU | 36.51% | ~22% | -40% |
 | free CPU | 19.66% | ~11% | -44% |
 | Branch misses | 4.43% | <3% | -32% |
 | L3 cache misses | 10.28% | <8% | -22% |
 ## Key Insights
 ### ✅ What Worked in Phase 2
 1. **SuperSlab size increase** (64KB → 512KB): Dramatically reduced branches (-36%)
 2. **Amortized initialization**: memset overhead dropped from 6.41% → 1.77%
 3. **Virtual memory optimization**: TLB miss rate is excellent (0.01%)
 ### ❌ What Needs Work
 1. **Branch prediction**: Miss rate doubled despite fewer branches
 2. **Cache pressure**: Larger SuperSlabs increased L3 misses
 3. **Function overhead**: malloc/free dominate CPU time (56%)
 ### 🤔 Surprising Findings
 1. **Cross-calling pattern**: malloc/free call each other 8-12% of the time
   - Thread-local cache flushing
   - Deferred release operations
   - May benefit from batching
 2. **Kernel overhead increased**: clear_page_erms went from 2.23% → 9.31%
   - May need page pre-faulting strategy
 3. **Main loop visible**: 30.51% CPU time
   - Benchmark overhead, not allocator
   - Real allocator overhead is ~56% (malloc + free)
 ## Files Generated
 - `perf_phase2_stats.txt` - perf stat -d output
 - `perf_phase2_symbols.txt` - Symbol-level hotspots
 - `perf_phase2_callgraph.txt` - Call graph analysis
 - `perf_phase2_detailed.txt` - Detailed counter breakdown
 - `perf_malloc_annotate.txt` - Assembly annotation for malloc()
 - `perf_free_annotate.txt` - Assembly annotation for free()
 - `perf_analysis_summary.txt` - Detailed comparison with Phase 1
 - `phase3_recommendations.txt` - Complete optimization roadmap
 ## How to Use This Data
 ### For Quick Reference
 ```bash
 cat perf_phase2_stats.txt        # See overall metrics
 cat perf_phase2_symbols.txt      # See top functions
 ```
 ### For Deep Analysis
 ```bash
 cat perf_malloc_annotate.txt     # See assembly-level hotspots in malloc
 cat perf_free_annotate.txt       # See assembly-level hotspots in free
 cat perf_analysis_summary.txt    # See Phase 1 vs Phase 2 comparison
 ```
 ### For Planning Phase 3
 ```bash
 cat phase3_recommendations.txt   # See ranked optimization opportunities
 ```
 ### To Re-run Analysis
 ```bash
 # Quick stat
 perf stat -d ./bench_random_mixed_hakmem 1000000 256 42
 # Detailed profiling
 perf record -F 9999 -g ./bench_random_mixed_hakmem 5000000 256 42
 perf report --stdio --no-children --sort symbol
 ```
 ## Next Steps
 1. **Week 1**: Implement fast-path inlining + remove stats locks (Expected: +8-10%)
 2. **Week 2**: Apply PGO + branch hints (Expected: +3-5%)
 3. **Week 3**: Cache line alignment + prefetching (Expected: +2-4%)
 4. **Week 4**: TLS optimization + polish (Expected: +1-3%)
 **Total Expected**: +14-22% improvement → **Target: 91-97M ops/s**
 ---
 Generated: 2025-11-28
 Phase: 2 → 3 transition
 Baseline: 72M ops/s → Current: 79.8M ops/s → Target: 87-95M ops/s
--- a/core/box/ss_legacy_backend_box.c
+++ b/core/box/ss_legacy_backend_box.c
--- a/core/box/ss_legacy_backend_box.h
+++ b/core/box/ss_legacy_backend_box.h
--- a/core/box/ss_unified_backend_box.c
+++ b/core/box/ss_unified_backend_box.c
--- a/core/box/ss_unified_backend_box.h
+++ b/core/box/ss_unified_backend_box.h
--- a/core/box/tls_sll_box.h
+++ b/core/box/tls_sll_box.h
@ -294,6 +294,40 @@ static inline bool tls_sll_push_impl(int class_idx, void* ptr, uint32_t capacity
        }
    } while (0);
 #if HAKMEM_TINY_HEADER_CLASSIDX
    // Validate header on push - detect blocks pushed without header write
    // Enabled via HAKMEM_TINY_SLL_VALIDATE_HDR=1 or in debug builds
    do {
        static int g_validate_hdr = -1;
        if (__builtin_expect(g_validate_hdr == -1, 0)) {
 #if !HAKMEM_BUILD_RELEASE
            g_validate_hdr = 1;  // Always on in debug
 #else
            const char* env = getenv("HAKMEM_TINY_SLL_VALIDATE_HDR");
            g_validate_hdr = (env && *env && *env != '0') ? 1 : 0;
 #endif
        }
        if (__builtin_expect(g_validate_hdr, 0)) {
            static _Atomic uint32_t g_tls_sll_push_bad_hdr = 0;
            uint8_t hdr = *(uint8_t*)ptr;
            uint8_t expected = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
            if (hdr != expected) {
                uint32_t n = atomic_fetch_add_explicit(&g_tls_sll_push_bad_hdr, 1, memory_order_relaxed);
                if (n < 10) {
                    fprintf(stderr,
                            "[TLS_SLL_PUSH_BAD_HDR] cls=%d base=%p got=0x%02x expect=0x%02x from=%s\n",
                            class_idx, ptr, hdr, expected, where ? where : "(null)");
                    void* bt[8];
                    int frames = backtrace(bt, 8);
                    backtrace_symbols_fd(bt, frames, fileno(stderr));
                    fflush(stderr);
                }
            }
        }
    } while (0);
 #endif
 #if !HAKMEM_BUILD_RELEASE
    // Minimal range guard before we touch memory.
    if (!validate_ptr_range(ptr, "tls_sll_push_base")) {
--- a/core/hakmem_tiny_superslab.c.bak
+++ b/core/hakmem_tiny_superslab.c.bak
--- a/core/hakmem_tiny_superslab_internal.h
+++ b/core/hakmem_tiny_superslab_internal.h
@ -26,6 +26,22 @@
 #include <pthread.h>
 #include <unistd.h>
 // ============================================================================
 // Unified Return Macros
 // ============================================================================
 // HAK_RET_ALLOC_BLOCK - Single exit point for SuperSlab allocations
 // Purpose: Ensures consistent header writing across all SuperSlab allocation paths
 // Usage: HAK_RET_ALLOC_BLOCK(class_idx, base_ptr);
 // Note: Must be used in function context (macro contains return statement)
 #if HAKMEM_TINY_HEADER_CLASSIDX
    #define HAK_RET_ALLOC_BLOCK(cls, base_ptr) \
        return tiny_region_id_write_header((base_ptr), (cls))
 #else
    #define HAK_RET_ALLOC_BLOCK(cls, base_ptr) \
        return (void*)(base_ptr)
 #endif
 // ============================================================================
 // Global Variables (defined in superslab_stats.c)
 // ============================================================================
--- a/core/superslab_backend.c
+++ b/core/superslab_backend.c
@ -5,31 +5,6 @@
 #include "hakmem_tiny_superslab_internal.h"
 /*
 * superslab_return_block() - Single exit point for all SuperSlab allocations
 *
 * Purpose: Ensures consistent header writing across all allocation paths.
 * This prevents bugs where headers are written in some paths but not others.
 *
 * Parameters:
 *   base      - Block start address from SuperSlab geometry
 *   class_idx - Tiny class index (0-7)
 *
 * Returns:
 *   User pointer (base + 1 if headers enabled, base otherwise)
 *
 * Header writing behavior:
 *   - If HAKMEM_TINY_HEADER_CLASSIDX=1: Writes header via tiny_region_id_write_header()
 *   - If HAKMEM_TINY_HEADER_CLASSIDX=0: Returns base directly (no header)
 */
 static inline void* superslab_return_block(void* base, int class_idx) {
 #if HAKMEM_TINY_HEADER_CLASSIDX
    return tiny_region_id_write_header(base, class_idx);
 #else
    return (void*)base;
 #endif
 }
 /*
 * Legacy backend for hak_tiny_alloc_superslab_box().
 *
@ -87,7 +62,7 @@ void* hak_tiny_alloc_superslab_backend_legacy(int class_idx)
                meta->used++;
                atomic_fetch_add_explicit(&chunk->total_active_blocks, 1, memory_order_relaxed);
-                return superslab_return_block(base, class_idx);
+                HAK_RET_ALLOC_BLOCK(class_idx, base);
            }
        }
        chunk = chunk->next_chunk;
@ -127,7 +102,7 @@ void* hak_tiny_alloc_superslab_backend_legacy(int class_idx)
            meta->used++;
            atomic_fetch_add_explicit(&new_chunk->total_active_blocks, 1, memory_order_relaxed);
-            return superslab_return_block(base, class_idx);
+            HAK_RET_ALLOC_BLOCK(class_idx, base);
        }
    }
@ -230,7 +205,7 @@ void* hak_tiny_alloc_superslab_backend_shared(int class_idx)
    meta->used++;
    atomic_fetch_add_explicit(&ss->total_active_blocks, 1, memory_order_relaxed);
-    return superslab_return_block(base, class_idx);
+    HAK_RET_ALLOC_BLOCK(class_idx, base);
 }
 /*