Refactor: Unified allocation macros + header validation
1. Archive unused backend files (ss_legacy/unified_backend_box.c/h) - These files were not linked in the build - Moved to archive/ to reduce confusion 2. Created HAK_RET_ALLOC_BLOCK macro for SuperSlab allocations - Replaces superslab_return_block() function - Consistent with existing HAK_RET_ALLOC pattern - Single source of truth for header writing - Defined in hakmem_tiny_superslab_internal.h 3. Added header validation on TLS SLL push - Detects blocks pushed without proper header - Enabled via HAKMEM_TINY_SLL_VALIDATE_HDR=1 (release) - Always on in debug builds - Logs first 10 violations with backtraces Benefits: - Easier to track allocation paths - Catches header bugs at push time - More maintainable macro-based design Note: Larson bug still reproduces - header corruption occurs before push validation can catch it. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
159
PHASE2_PERF_ANALYSIS.md
Normal file
159
PHASE2_PERF_ANALYSIS.md
Normal file
@ -0,0 +1,159 @@
|
|||||||
|
# HAKMEM Allocator - Phase 2 Performance Analysis
|
||||||
|
|
||||||
|
## Quick Summary
|
||||||
|
|
||||||
|
| Metric | Phase 1 | Phase 2 | Change |
|
||||||
|
|--------|---------|---------|--------|
|
||||||
|
| **Throughput** | 72M ops/s | 79.8M ops/s | **+10.8%** ✓ |
|
||||||
|
| Cycles | 78.6M | 72.2M | -8.1% ✓ |
|
||||||
|
| Instructions | 167M | 153M | -8.4% ✓ |
|
||||||
|
| Branches | 36M | 23M | **-36%** ✓ |
|
||||||
|
| Branch Misses | 921K (2.56%) | 1.02M (4.43%) | +73% ✗ |
|
||||||
|
| L3 Cache Misses | 173K (9.28%) | 216K (10.28%) | +25% ✗ |
|
||||||
|
| dTLB Misses | N/A | 41 (0.01%) | **Excellent!** ✓ |
|
||||||
|
|
||||||
|
## Top 5 Hotspots (Phase 2, 628 samples)
|
||||||
|
|
||||||
|
1. **malloc()** - 36.51% CPU time
|
||||||
|
- Function overhead (prologue/epilogue): ~18%
|
||||||
|
- Lock operations: 5.05%
|
||||||
|
- Initialization checks: ~15%
|
||||||
|
|
||||||
|
2. **main()** - 30.51% CPU time
|
||||||
|
- Benchmark loop overhead (not allocator)
|
||||||
|
|
||||||
|
3. **free()** - 19.66% CPU time
|
||||||
|
- Lock operations: 3.29%
|
||||||
|
- Cached variable checks: ~15%
|
||||||
|
- Function overhead: ~10%
|
||||||
|
|
||||||
|
4. **clear_page_erms (kernel)** - 9.31% CPU time
|
||||||
|
- Page fault handling
|
||||||
|
|
||||||
|
5. **irqentry_exit_to_user_mode (kernel)** - 5.33% CPU time
|
||||||
|
- Kernel exit overhead
|
||||||
|
|
||||||
|
## Phase 3 Optimization Targets (Ranked by Impact)
|
||||||
|
|
||||||
|
### 🔥 Priority 1: Fast-Path Inlining (Expected: +5-8%)
|
||||||
|
**Target**: Reduce malloc/free from 56% → ~33% CPU time
|
||||||
|
- Inline hot paths to eliminate function call overhead
|
||||||
|
- Remove stats counters from production builds
|
||||||
|
- Cache initialization state in TLS
|
||||||
|
|
||||||
|
### 🔥 Priority 2: Branch Optimization (Expected: +3-5%)
|
||||||
|
**Target**: Reduce branch misses from 1.02M → <700K
|
||||||
|
- Apply Profile-Guided Optimization (PGO)
|
||||||
|
- Add LIKELY/UNLIKELY hints
|
||||||
|
- Reduce branches in fast path from ~15 to 5-7
|
||||||
|
|
||||||
|
### 🔥 Priority 3: Cache Optimization (Expected: +2-4%)
|
||||||
|
**Target**: Reduce L3 misses from 216K → <180K
|
||||||
|
- Align hot structures to cache lines
|
||||||
|
- Add prefetching in allocation path
|
||||||
|
- Compact metadata structures
|
||||||
|
|
||||||
|
### 🎯 Priority 4: Remove Init Overhead (Expected: +2-3%)
|
||||||
|
- Cache g_initialized/g_enable checks in TLS
|
||||||
|
- Use constructor attributes more aggressively
|
||||||
|
|
||||||
|
### 🎯 Priority 5: Reduce Lock Contention (Expected: +1-2%)
|
||||||
|
- Move stats to TLS, aggregate periodically
|
||||||
|
- Eliminate atomic ops from fast path
|
||||||
|
|
||||||
|
### 🎯 Priority 6: Optimize TLS Operations (Expected: +1-2%)
|
||||||
|
- Reduce TLS reads/writes from ~10 to ~4 per operation
|
||||||
|
- Cache TLS values in registers
|
||||||
|
|
||||||
|
## Expected Phase 3 Results
|
||||||
|
|
||||||
|
**Target Throughput**: 87-95M ops/s (+9-19% improvement)
|
||||||
|
|
||||||
|
| Metric | Phase 2 | Phase 3 Target | Change |
|
||||||
|
|--------|---------|----------------|--------|
|
||||||
|
| Throughput | 79.8M ops/s | 87-95M ops/s | +9-19% |
|
||||||
|
| malloc CPU | 36.51% | ~22% | -40% |
|
||||||
|
| free CPU | 19.66% | ~11% | -44% |
|
||||||
|
| Branch misses | 4.43% | <3% | -32% |
|
||||||
|
| L3 cache misses | 10.28% | <8% | -22% |
|
||||||
|
|
||||||
|
## Key Insights
|
||||||
|
|
||||||
|
### ✅ What Worked in Phase 2
|
||||||
|
1. **SuperSlab size increase** (64KB → 512KB): Dramatically reduced branches (-36%)
|
||||||
|
2. **Amortized initialization**: memset overhead dropped from 6.41% → 1.77%
|
||||||
|
3. **Virtual memory optimization**: TLB miss rate is excellent (0.01%)
|
||||||
|
|
||||||
|
### ❌ What Needs Work
|
||||||
|
1. **Branch prediction**: Miss rate doubled despite fewer branches
|
||||||
|
2. **Cache pressure**: Larger SuperSlabs increased L3 misses
|
||||||
|
3. **Function overhead**: malloc/free dominate CPU time (56%)
|
||||||
|
|
||||||
|
### 🤔 Surprising Findings
|
||||||
|
1. **Cross-calling pattern**: malloc/free call each other 8-12% of the time
|
||||||
|
- Thread-local cache flushing
|
||||||
|
- Deferred release operations
|
||||||
|
- May benefit from batching
|
||||||
|
|
||||||
|
2. **Kernel overhead increased**: clear_page_erms went from 2.23% → 9.31%
|
||||||
|
- May need page pre-faulting strategy
|
||||||
|
|
||||||
|
3. **Main loop visible**: 30.51% CPU time
|
||||||
|
- Benchmark overhead, not allocator
|
||||||
|
- Real allocator overhead is ~56% (malloc + free)
|
||||||
|
|
||||||
|
## Files Generated
|
||||||
|
|
||||||
|
- `perf_phase2_stats.txt` - perf stat -d output
|
||||||
|
- `perf_phase2_symbols.txt` - Symbol-level hotspots
|
||||||
|
- `perf_phase2_callgraph.txt` - Call graph analysis
|
||||||
|
- `perf_phase2_detailed.txt` - Detailed counter breakdown
|
||||||
|
- `perf_malloc_annotate.txt` - Assembly annotation for malloc()
|
||||||
|
- `perf_free_annotate.txt` - Assembly annotation for free()
|
||||||
|
- `perf_analysis_summary.txt` - Detailed comparison with Phase 1
|
||||||
|
- `phase3_recommendations.txt` - Complete optimization roadmap
|
||||||
|
|
||||||
|
## How to Use This Data
|
||||||
|
|
||||||
|
### For Quick Reference
|
||||||
|
```bash
|
||||||
|
cat perf_phase2_stats.txt # See overall metrics
|
||||||
|
cat perf_phase2_symbols.txt # See top functions
|
||||||
|
```
|
||||||
|
|
||||||
|
### For Deep Analysis
|
||||||
|
```bash
|
||||||
|
cat perf_malloc_annotate.txt # See assembly-level hotspots in malloc
|
||||||
|
cat perf_free_annotate.txt # See assembly-level hotspots in free
|
||||||
|
cat perf_analysis_summary.txt # See Phase 1 vs Phase 2 comparison
|
||||||
|
```
|
||||||
|
|
||||||
|
### For Planning Phase 3
|
||||||
|
```bash
|
||||||
|
cat phase3_recommendations.txt # See ranked optimization opportunities
|
||||||
|
```
|
||||||
|
|
||||||
|
### To Re-run Analysis
|
||||||
|
```bash
|
||||||
|
# Quick stat
|
||||||
|
perf stat -d ./bench_random_mixed_hakmem 1000000 256 42
|
||||||
|
|
||||||
|
# Detailed profiling
|
||||||
|
perf record -F 9999 -g ./bench_random_mixed_hakmem 5000000 256 42
|
||||||
|
perf report --stdio --no-children --sort symbol
|
||||||
|
```
|
||||||
|
|
||||||
|
## Next Steps
|
||||||
|
|
||||||
|
1. **Week 1**: Implement fast-path inlining + remove stats locks (Expected: +8-10%)
|
||||||
|
2. **Week 2**: Apply PGO + branch hints (Expected: +3-5%)
|
||||||
|
3. **Week 3**: Cache line alignment + prefetching (Expected: +2-4%)
|
||||||
|
4. **Week 4**: TLS optimization + polish (Expected: +1-3%)
|
||||||
|
|
||||||
|
**Total Expected**: +14-22% improvement → **Target: 91-97M ops/s**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
Generated: 2025-11-28
|
||||||
|
Phase: 2 → 3 transition
|
||||||
|
Baseline: 72M ops/s → Current: 79.8M ops/s → Target: 87-95M ops/s
|
||||||
@ -294,6 +294,40 @@ static inline bool tls_sll_push_impl(int class_idx, void* ptr, uint32_t capacity
|
|||||||
}
|
}
|
||||||
} while (0);
|
} while (0);
|
||||||
|
|
||||||
|
#if HAKMEM_TINY_HEADER_CLASSIDX
|
||||||
|
// Validate header on push - detect blocks pushed without header write
|
||||||
|
// Enabled via HAKMEM_TINY_SLL_VALIDATE_HDR=1 or in debug builds
|
||||||
|
do {
|
||||||
|
static int g_validate_hdr = -1;
|
||||||
|
if (__builtin_expect(g_validate_hdr == -1, 0)) {
|
||||||
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
|
g_validate_hdr = 1; // Always on in debug
|
||||||
|
#else
|
||||||
|
const char* env = getenv("HAKMEM_TINY_SLL_VALIDATE_HDR");
|
||||||
|
g_validate_hdr = (env && *env && *env != '0') ? 1 : 0;
|
||||||
|
#endif
|
||||||
|
}
|
||||||
|
|
||||||
|
if (__builtin_expect(g_validate_hdr, 0)) {
|
||||||
|
static _Atomic uint32_t g_tls_sll_push_bad_hdr = 0;
|
||||||
|
uint8_t hdr = *(uint8_t*)ptr;
|
||||||
|
uint8_t expected = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
|
||||||
|
if (hdr != expected) {
|
||||||
|
uint32_t n = atomic_fetch_add_explicit(&g_tls_sll_push_bad_hdr, 1, memory_order_relaxed);
|
||||||
|
if (n < 10) {
|
||||||
|
fprintf(stderr,
|
||||||
|
"[TLS_SLL_PUSH_BAD_HDR] cls=%d base=%p got=0x%02x expect=0x%02x from=%s\n",
|
||||||
|
class_idx, ptr, hdr, expected, where ? where : "(null)");
|
||||||
|
void* bt[8];
|
||||||
|
int frames = backtrace(bt, 8);
|
||||||
|
backtrace_symbols_fd(bt, frames, fileno(stderr));
|
||||||
|
fflush(stderr);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
} while (0);
|
||||||
|
#endif
|
||||||
|
|
||||||
#if !HAKMEM_BUILD_RELEASE
|
#if !HAKMEM_BUILD_RELEASE
|
||||||
// Minimal range guard before we touch memory.
|
// Minimal range guard before we touch memory.
|
||||||
if (!validate_ptr_range(ptr, "tls_sll_push_base")) {
|
if (!validate_ptr_range(ptr, "tls_sll_push_base")) {
|
||||||
|
|||||||
1521
core/hakmem_tiny_superslab.c.bak
Normal file
1521
core/hakmem_tiny_superslab.c.bak
Normal file
File diff suppressed because it is too large
Load Diff
@ -26,6 +26,22 @@
|
|||||||
#include <pthread.h>
|
#include <pthread.h>
|
||||||
#include <unistd.h>
|
#include <unistd.h>
|
||||||
|
|
||||||
|
// ============================================================================
|
||||||
|
// Unified Return Macros
|
||||||
|
// ============================================================================
|
||||||
|
|
||||||
|
// HAK_RET_ALLOC_BLOCK - Single exit point for SuperSlab allocations
|
||||||
|
// Purpose: Ensures consistent header writing across all SuperSlab allocation paths
|
||||||
|
// Usage: HAK_RET_ALLOC_BLOCK(class_idx, base_ptr);
|
||||||
|
// Note: Must be used in function context (macro contains return statement)
|
||||||
|
#if HAKMEM_TINY_HEADER_CLASSIDX
|
||||||
|
#define HAK_RET_ALLOC_BLOCK(cls, base_ptr) \
|
||||||
|
return tiny_region_id_write_header((base_ptr), (cls))
|
||||||
|
#else
|
||||||
|
#define HAK_RET_ALLOC_BLOCK(cls, base_ptr) \
|
||||||
|
return (void*)(base_ptr)
|
||||||
|
#endif
|
||||||
|
|
||||||
// ============================================================================
|
// ============================================================================
|
||||||
// Global Variables (defined in superslab_stats.c)
|
// Global Variables (defined in superslab_stats.c)
|
||||||
// ============================================================================
|
// ============================================================================
|
||||||
|
|||||||
@ -5,31 +5,6 @@
|
|||||||
|
|
||||||
#include "hakmem_tiny_superslab_internal.h"
|
#include "hakmem_tiny_superslab_internal.h"
|
||||||
|
|
||||||
/*
|
|
||||||
* superslab_return_block() - Single exit point for all SuperSlab allocations
|
|
||||||
*
|
|
||||||
* Purpose: Ensures consistent header writing across all allocation paths.
|
|
||||||
* This prevents bugs where headers are written in some paths but not others.
|
|
||||||
*
|
|
||||||
* Parameters:
|
|
||||||
* base - Block start address from SuperSlab geometry
|
|
||||||
* class_idx - Tiny class index (0-7)
|
|
||||||
*
|
|
||||||
* Returns:
|
|
||||||
* User pointer (base + 1 if headers enabled, base otherwise)
|
|
||||||
*
|
|
||||||
* Header writing behavior:
|
|
||||||
* - If HAKMEM_TINY_HEADER_CLASSIDX=1: Writes header via tiny_region_id_write_header()
|
|
||||||
* - If HAKMEM_TINY_HEADER_CLASSIDX=0: Returns base directly (no header)
|
|
||||||
*/
|
|
||||||
static inline void* superslab_return_block(void* base, int class_idx) {
|
|
||||||
#if HAKMEM_TINY_HEADER_CLASSIDX
|
|
||||||
return tiny_region_id_write_header(base, class_idx);
|
|
||||||
#else
|
|
||||||
return (void*)base;
|
|
||||||
#endif
|
|
||||||
}
|
|
||||||
|
|
||||||
/*
|
/*
|
||||||
* Legacy backend for hak_tiny_alloc_superslab_box().
|
* Legacy backend for hak_tiny_alloc_superslab_box().
|
||||||
*
|
*
|
||||||
@ -87,7 +62,7 @@ void* hak_tiny_alloc_superslab_backend_legacy(int class_idx)
|
|||||||
|
|
||||||
meta->used++;
|
meta->used++;
|
||||||
atomic_fetch_add_explicit(&chunk->total_active_blocks, 1, memory_order_relaxed);
|
atomic_fetch_add_explicit(&chunk->total_active_blocks, 1, memory_order_relaxed);
|
||||||
return superslab_return_block(base, class_idx);
|
HAK_RET_ALLOC_BLOCK(class_idx, base);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
chunk = chunk->next_chunk;
|
chunk = chunk->next_chunk;
|
||||||
@ -127,7 +102,7 @@ void* hak_tiny_alloc_superslab_backend_legacy(int class_idx)
|
|||||||
|
|
||||||
meta->used++;
|
meta->used++;
|
||||||
atomic_fetch_add_explicit(&new_chunk->total_active_blocks, 1, memory_order_relaxed);
|
atomic_fetch_add_explicit(&new_chunk->total_active_blocks, 1, memory_order_relaxed);
|
||||||
return superslab_return_block(base, class_idx);
|
HAK_RET_ALLOC_BLOCK(class_idx, base);
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@ -230,7 +205,7 @@ void* hak_tiny_alloc_superslab_backend_shared(int class_idx)
|
|||||||
meta->used++;
|
meta->used++;
|
||||||
atomic_fetch_add_explicit(&ss->total_active_blocks, 1, memory_order_relaxed);
|
atomic_fetch_add_explicit(&ss->total_active_blocks, 1, memory_order_relaxed);
|
||||||
|
|
||||||
return superslab_return_block(base, class_idx);
|
HAK_RET_ALLOC_BLOCK(class_idx, base);
|
||||||
}
|
}
|
||||||
|
|
||||||
/*
|
/*
|
||||||
|
|||||||
Reference in New Issue
Block a user