Refactor: Unified allocation macros + header validation

1. Archive unused backend files (ss_legacy/unified_backend_box.c/h) - These files were not linked in the build - Moved to archive/ to reduce confusion 2. Created HAK_RET_ALLOC_BLOCK macro for SuperSlab allocations - Replaces superslab_return_block() function - Consistent with existing HAK_RET_ALLOC pattern - Single source of truth for header writing - Defined in hakmem_tiny_superslab_internal.h 3. Added header validation on TLS SLL push - Detects blocks pushed without proper header - Enabled via HAKMEM_TINY_SLL_VALIDATE_HDR=1 (release) - Always on in debug builds - Logs first 10 violations with backtraces Benefits: - Easier to track allocation paths - Catches header bugs at push time - More maintainable macro-based design Note: Larson bug still reproduces - header corruption occurs before push validation can catch it. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 05:37:24 +09:00
parent 6ac6f5ae1b
commit 5582cbc22c
9 changed files with 1733 additions and 28 deletions
--- a/PHASE2_PERF_ANALYSIS.md
+++ b/PHASE2_PERF_ANALYSIS.md
@ -0,0 +1,159 @@
+# HAKMEM Allocator - Phase 2 Performance Analysis
+
+## Quick Summary
+
+| Metric | Phase 1 | Phase 2 | Change |
+|--------|---------|---------|--------|
+| **Throughput** | 72M ops/s | 79.8M ops/s | **+10.8%** ✓ |
+| Cycles | 78.6M | 72.2M | -8.1% ✓ |
+| Instructions | 167M | 153M | -8.4% ✓ |
+| Branches | 36M | 23M | **-36%** ✓ |
+| Branch Misses | 921K (2.56%) | 1.02M (4.43%) | +73% ✗ |
+| L3 Cache Misses | 173K (9.28%) | 216K (10.28%) | +25% ✗ |
+| dTLB Misses | N/A | 41 (0.01%) | **Excellent!** ✓ |
+
+## Top 5 Hotspots (Phase 2, 628 samples)
+
+1. **malloc()** - 36.51% CPU time
+   - Function overhead (prologue/epilogue): ~18%
+   - Lock operations: 5.05%
+   - Initialization checks: ~15%
+
+2. **main()** - 30.51% CPU time
+   - Benchmark loop overhead (not allocator)
+
+3. **free()** - 19.66% CPU time
+   - Lock operations: 3.29%
+   - Cached variable checks: ~15%
+   - Function overhead: ~10%
+
+4. **clear_page_erms (kernel)** - 9.31% CPU time
+   - Page fault handling
+
+5. **irqentry_exit_to_user_mode (kernel)** - 5.33% CPU time
+   - Kernel exit overhead
+
+## Phase 3 Optimization Targets (Ranked by Impact)
+
+### 🔥 Priority 1: Fast-Path Inlining (Expected: +5-8%)
+**Target**: Reduce malloc/free from 56% → ~33% CPU time
+- Inline hot paths to eliminate function call overhead
+- Remove stats counters from production builds
+- Cache initialization state in TLS
+
+### 🔥 Priority 2: Branch Optimization (Expected: +3-5%)
+**Target**: Reduce branch misses from 1.02M → <700K
+- Apply Profile-Guided Optimization (PGO)
+- Add LIKELY/UNLIKELY hints
+- Reduce branches in fast path from ~15 to 5-7
+
+### 🔥 Priority 3: Cache Optimization (Expected: +2-4%)
+**Target**: Reduce L3 misses from 216K → <180K
+- Align hot structures to cache lines
+- Add prefetching in allocation path
+- Compact metadata structures
+
+### 🎯 Priority 4: Remove Init Overhead (Expected: +2-3%)
+- Cache g_initialized/g_enable checks in TLS
+- Use constructor attributes more aggressively
+
+### 🎯 Priority 5: Reduce Lock Contention (Expected: +1-2%)
+- Move stats to TLS, aggregate periodically
+- Eliminate atomic ops from fast path
+
+### 🎯 Priority 6: Optimize TLS Operations (Expected: +1-2%)
+- Reduce TLS reads/writes from ~10 to ~4 per operation
+- Cache TLS values in registers
+
+## Expected Phase 3 Results
+
+**Target Throughput**: 87-95M ops/s (+9-19% improvement)
+
+| Metric | Phase 2 | Phase 3 Target | Change |
+|--------|---------|----------------|--------|
+| Throughput | 79.8M ops/s | 87-95M ops/s | +9-19% |
+| malloc CPU | 36.51% | ~22% | -40% |
+| free CPU | 19.66% | ~11% | -44% |
+| Branch misses | 4.43% | <3% | -32% |
+| L3 cache misses | 10.28% | <8% | -22% |
+
+## Key Insights
+
+### ✅ What Worked in Phase 2
+1. **SuperSlab size increase** (64KB → 512KB): Dramatically reduced branches (-36%)
+2. **Amortized initialization**: memset overhead dropped from 6.41% → 1.77%
+3. **Virtual memory optimization**: TLB miss rate is excellent (0.01%)
+
+### ❌ What Needs Work
+1. **Branch prediction**: Miss rate doubled despite fewer branches
+2. **Cache pressure**: Larger SuperSlabs increased L3 misses
+3. **Function overhead**: malloc/free dominate CPU time (56%)
+
+### 🤔 Surprising Findings
+1. **Cross-calling pattern**: malloc/free call each other 8-12% of the time
+   - Thread-local cache flushing
+   - Deferred release operations
+   - May benefit from batching
+
+2. **Kernel overhead increased**: clear_page_erms went from 2.23% → 9.31%
+   - May need page pre-faulting strategy
+
+3. **Main loop visible**: 30.51% CPU time
+   - Benchmark overhead, not allocator
+   - Real allocator overhead is ~56% (malloc + free)
+
+## Files Generated
+
+- `perf_phase2_stats.txt` - perf stat -d output
+- `perf_phase2_symbols.txt` - Symbol-level hotspots
+- `perf_phase2_callgraph.txt` - Call graph analysis
+- `perf_phase2_detailed.txt` - Detailed counter breakdown
+- `perf_malloc_annotate.txt` - Assembly annotation for malloc()
+- `perf_free_annotate.txt` - Assembly annotation for free()
+- `perf_analysis_summary.txt` - Detailed comparison with Phase 1
+- `phase3_recommendations.txt` - Complete optimization roadmap
+
+## How to Use This Data
+
+### For Quick Reference
+```bash
+cat perf_phase2_stats.txt        # See overall metrics
+cat perf_phase2_symbols.txt      # See top functions
+```
+
+### For Deep Analysis
+```bash
+cat perf_malloc_annotate.txt     # See assembly-level hotspots in malloc
+cat perf_free_annotate.txt       # See assembly-level hotspots in free
+cat perf_analysis_summary.txt    # See Phase 1 vs Phase 2 comparison
+```
+
+### For Planning Phase 3
+```bash
+cat phase3_recommendations.txt   # See ranked optimization opportunities
+```
+
+### To Re-run Analysis
+```bash
+# Quick stat
+perf stat -d ./bench_random_mixed_hakmem 1000000 256 42
+
+# Detailed profiling
+perf record -F 9999 -g ./bench_random_mixed_hakmem 5000000 256 42
+perf report --stdio --no-children --sort symbol
+```
+
+## Next Steps
+
+1. **Week 1**: Implement fast-path inlining + remove stats locks (Expected: +8-10%)
+2. **Week 2**: Apply PGO + branch hints (Expected: +3-5%)
+3. **Week 3**: Cache line alignment + prefetching (Expected: +2-4%)
+4. **Week 4**: TLS optimization + polish (Expected: +1-3%)
+
+**Total Expected**: +14-22% improvement → **Target: 91-97M ops/s**
+
+---
+
+Generated: 2025-11-28
+Phase: 2 → 3 transition
+Baseline: 72M ops/s → Current: 79.8M ops/s → Target: 87-95M ops/s
--- a/core/box/ss_legacy_backend_box.c
+++ b/core/box/ss_legacy_backend_box.c
--- a/core/box/ss_legacy_backend_box.h
+++ b/core/box/ss_legacy_backend_box.h
--- a/core/box/ss_unified_backend_box.c
+++ b/core/box/ss_unified_backend_box.c
--- a/core/box/ss_unified_backend_box.h
+++ b/core/box/ss_unified_backend_box.h
--- a/core/box/tls_sll_box.h
+++ b/core/box/tls_sll_box.h
@ -294,6 +294,40 @@ static inline bool tls_sll_push_impl(int class_idx, void* ptr, uint32_t capacity
        }
    } while (0);

+#if HAKMEM_TINY_HEADER_CLASSIDX
+    // Validate header on push - detect blocks pushed without header write
+    // Enabled via HAKMEM_TINY_SLL_VALIDATE_HDR=1 or in debug builds
+    do {
+        static int g_validate_hdr = -1;
+        if (__builtin_expect(g_validate_hdr == -1, 0)) {
+#if !HAKMEM_BUILD_RELEASE
+            g_validate_hdr = 1;  // Always on in debug
+#else
+            const char* env = getenv("HAKMEM_TINY_SLL_VALIDATE_HDR");
+            g_validate_hdr = (env && *env && *env != '0') ? 1 : 0;
+#endif
+        }
+
+        if (__builtin_expect(g_validate_hdr, 0)) {
+            static _Atomic uint32_t g_tls_sll_push_bad_hdr = 0;
+            uint8_t hdr = *(uint8_t*)ptr;
+            uint8_t expected = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
+            if (hdr != expected) {
+                uint32_t n = atomic_fetch_add_explicit(&g_tls_sll_push_bad_hdr, 1, memory_order_relaxed);
+                if (n < 10) {
+                    fprintf(stderr,
+                            "[TLS_SLL_PUSH_BAD_HDR] cls=%d base=%p got=0x%02x expect=0x%02x from=%s\n",
+                            class_idx, ptr, hdr, expected, where ? where : "(null)");
+                    void* bt[8];
+                    int frames = backtrace(bt, 8);
+                    backtrace_symbols_fd(bt, frames, fileno(stderr));
+                    fflush(stderr);
+                }
+            }
+        }
+    } while (0);
+#endif
+
 #if !HAKMEM_BUILD_RELEASE
    // Minimal range guard before we touch memory.
    if (!validate_ptr_range(ptr, "tls_sll_push_base")) {
--- a/core/hakmem_tiny_superslab.c.bak
+++ b/core/hakmem_tiny_superslab.c.bak
--- a/core/hakmem_tiny_superslab_internal.h
+++ b/core/hakmem_tiny_superslab_internal.h
@ -26,6 +26,22 @@
 #include <pthread.h>
 #include <unistd.h>

+// ============================================================================
+// Unified Return Macros
+// ============================================================================
+
+// HAK_RET_ALLOC_BLOCK - Single exit point for SuperSlab allocations
+// Purpose: Ensures consistent header writing across all SuperSlab allocation paths
+// Usage: HAK_RET_ALLOC_BLOCK(class_idx, base_ptr);
+// Note: Must be used in function context (macro contains return statement)
+#if HAKMEM_TINY_HEADER_CLASSIDX
+    #define HAK_RET_ALLOC_BLOCK(cls, base_ptr) \
+        return tiny_region_id_write_header((base_ptr), (cls))
+#else
+    #define HAK_RET_ALLOC_BLOCK(cls, base_ptr) \
+        return (void*)(base_ptr)
+#endif
+
 // ============================================================================
 // Global Variables (defined in superslab_stats.c)
 // ============================================================================
--- a/core/superslab_backend.c
+++ b/core/superslab_backend.c
@ -5,31 +5,6 @@

 #include "hakmem_tiny_superslab_internal.h"

-/*
- * superslab_return_block() - Single exit point for all SuperSlab allocations
- *
- * Purpose: Ensures consistent header writing across all allocation paths.
- * This prevents bugs where headers are written in some paths but not others.
- *
- * Parameters:
- *   base      - Block start address from SuperSlab geometry
- *   class_idx - Tiny class index (0-7)
- *
- * Returns:
- *   User pointer (base + 1 if headers enabled, base otherwise)
- *
- * Header writing behavior:
- *   - If HAKMEM_TINY_HEADER_CLASSIDX=1: Writes header via tiny_region_id_write_header()
- *   - If HAKMEM_TINY_HEADER_CLASSIDX=0: Returns base directly (no header)
- */
-static inline void* superslab_return_block(void* base, int class_idx) {
-#if HAKMEM_TINY_HEADER_CLASSIDX
-    return tiny_region_id_write_header(base, class_idx);
-#else
-    return (void*)base;
-#endif
-}
-
 /*
 * Legacy backend for hak_tiny_alloc_superslab_box().
 *
@ -87,7 +62,7 @@ void* hak_tiny_alloc_superslab_backend_legacy(int class_idx)

                meta->used++;
                atomic_fetch_add_explicit(&chunk->total_active_blocks, 1, memory_order_relaxed);
-                return superslab_return_block(base, class_idx);
+                HAK_RET_ALLOC_BLOCK(class_idx, base);
            }
        }
        chunk = chunk->next_chunk;
@ -127,7 +102,7 @@ void* hak_tiny_alloc_superslab_backend_legacy(int class_idx)

            meta->used++;
            atomic_fetch_add_explicit(&new_chunk->total_active_blocks, 1, memory_order_relaxed);
-            return superslab_return_block(base, class_idx);
+            HAK_RET_ALLOC_BLOCK(class_idx, base);
        }
    }

@ -230,7 +205,7 @@ void* hak_tiny_alloc_superslab_backend_shared(int class_idx)
    meta->used++;
    atomic_fetch_add_explicit(&ss->total_active_blocks, 1, memory_order_relaxed);

-    return superslab_return_block(base, class_idx);
+    HAK_RET_ALLOC_BLOCK(class_idx, base);
 }

 /*