406 lines
12 KiB
Markdown
406 lines
12 KiB
Markdown
|
|
# Region-ID Direct Lookup Design for Ultra-Fast Free Path
|
|||
|
|
|
|||
|
|
**Date:** 2025-11-08
|
|||
|
|
**Author:** Claude (Ultrathink Analysis)
|
|||
|
|
**Goal:** Eliminate SuperSlab lookup bottleneck (52.63% CPU) to achieve 40-80M ops/s free throughput
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Executive Summary
|
|||
|
|
|
|||
|
|
The HAKMEM free() path is currently **47x slower** than System malloc (1.2M vs 56M ops/s) due to expensive SuperSlab registry lookups that consume over 50% of CPU time. The root cause is the need to determine `class_idx` from a pointer to know which TLS freelist to use.
|
|||
|
|
|
|||
|
|
**Recommendation:** Implement **Option 1B: Inline Header with Class Index** - a hybrid approach that embeds a 1-byte class index in a header while maintaining backward compatibility. This approach offers:
|
|||
|
|
- **3-5 instruction free path** (vs current 330+ lines)
|
|||
|
|
- **Expected 30-50x speedup** (1.2M → 40-60M ops/s)
|
|||
|
|
- **Minimal memory overhead** (1 byte per allocation)
|
|||
|
|
- **Simple implementation** (200-300 LOC changes)
|
|||
|
|
- **Full compatibility** with existing Box Theory design
|
|||
|
|
|
|||
|
|
The key insight: We already have 2048 bytes of header space in SuperSlab's slab[0] that's currently wasted as padding. We can repurpose this for inline headers with zero additional memory cost for the first slab.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Detailed Comparison Table
|
|||
|
|
|
|||
|
|
| Criteria | Option 1: Header Embedding | Option 2: Address Range | Option 3: TLS Cache | Hybrid 1B |
|
|||
|
|
|----------|----------------------------|------------------------|-------------------|-----------|
|
|||
|
|
| **Latency (cycles)** | 2-3 (best) | 5-10 (good) | 1-2 hit / 100+ miss | 2-3 |
|
|||
|
|
| **Memory Overhead** | 1-4 bytes/block | 0 bytes | 0 bytes | 1 byte/block |
|
|||
|
|
| **Implementation Complexity** | 3/10 (simple) | 7/10 (complex) | 4/10 (moderate) | 4/10 |
|
|||
|
|
| **Correctness** | Perfect (embedded) | Good (math-based) | Probabilistic | Perfect |
|
|||
|
|
| **Cache Friendliness** | Excellent (inline) | Good | Variable | Excellent |
|
|||
|
|
| **Thread Safety** | Perfect | Perfect | Good | Perfect |
|
|||
|
|
| **UAF Detection** | Yes (can add magic) | No | No | Yes |
|
|||
|
|
| **Debug Support** | Excellent | Moderate | Poor | Excellent |
|
|||
|
|
| **Backward Compat** | Needs flag | Complex | Easy | Easy |
|
|||
|
|
| **Score** | **9/10** ⭐ | 6/10 | 5/10 | **9.5/10** ⭐⭐⭐ |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Option 1: Header Embedding
|
|||
|
|
|
|||
|
|
### Concept
|
|||
|
|
Store `class_idx` directly in a small header (1-4 bytes) before each allocation.
|
|||
|
|
|
|||
|
|
### Implementation Design
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Header structure (1 byte minimal, 4 bytes with safety)
|
|||
|
|
typedef struct {
|
|||
|
|
uint8_t class_idx; // 0-7 for tiny classes
|
|||
|
|
#ifdef HAKMEM_DEBUG
|
|||
|
|
uint8_t magic; // 0xAB for validation
|
|||
|
|
uint16_t guard; // Canary for overflow detection
|
|||
|
|
#endif
|
|||
|
|
} TinyHeader;
|
|||
|
|
|
|||
|
|
// Ultra-fast free (3-5 instructions)
|
|||
|
|
void hak_tiny_free_fast(void* ptr) {
|
|||
|
|
// 1. Get class from header (1 instruction)
|
|||
|
|
uint8_t class_idx = *((uint8_t*)ptr - 1);
|
|||
|
|
|
|||
|
|
// 2. Validate (debug only, compiled out in release)
|
|||
|
|
#ifdef HAKMEM_DEBUG
|
|||
|
|
if (class_idx >= TINY_NUM_CLASSES) {
|
|||
|
|
hak_tiny_free_slow(ptr); // Fallback
|
|||
|
|
return;
|
|||
|
|
}
|
|||
|
|
#endif
|
|||
|
|
|
|||
|
|
// 3. Push to TLS freelist (2-3 instructions)
|
|||
|
|
void** head = &g_tls_sll_head[class_idx];
|
|||
|
|
*(void**)ptr = *head; // ptr->next = head
|
|||
|
|
*head = ptr; // head = ptr
|
|||
|
|
g_tls_sll_count[class_idx]++;
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Memory Layout
|
|||
|
|
```
|
|||
|
|
[Header|Block] [Header|Block] [Header|Block] ...
|
|||
|
|
1B 8B 1B 16B 1B 32B
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Performance Analysis
|
|||
|
|
- **Best case:** 2 cycles (L1 hit, no validation)
|
|||
|
|
- **Average:** 3 cycles (with increment)
|
|||
|
|
- **Worst case:** 5 cycles (with debug checks)
|
|||
|
|
- **Memory overhead:** 1 byte × 1M blocks = 1MB (for 1M allocations)
|
|||
|
|
- **Cache impact:** Excellent (header is inline with data)
|
|||
|
|
|
|||
|
|
### Pros
|
|||
|
|
- ✅ **Fastest possible lookup** (single byte read)
|
|||
|
|
- ✅ **Perfect correctness** (no race conditions)
|
|||
|
|
- ✅ **UAF detection capability** (can check magic on free)
|
|||
|
|
- ✅ **Simple implementation** (~200 LOC)
|
|||
|
|
- ✅ **Debug friendly** (can validate everything)
|
|||
|
|
|
|||
|
|
### Cons
|
|||
|
|
- ❌ Memory overhead (12.5% for 8-byte blocks, 0.1% for 1KB blocks)
|
|||
|
|
- ❌ Requires allocation path changes
|
|||
|
|
- ❌ Not compatible with existing allocations (needs migration)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Option 2: Address Range Mapping
|
|||
|
|
|
|||
|
|
### Concept
|
|||
|
|
Calculate `class_idx` from the SuperSlab base address and slab index using bit manipulation.
|
|||
|
|
|
|||
|
|
### Implementation Design
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Precomputed mapping table (built at SuperSlab creation)
|
|||
|
|
typedef struct {
|
|||
|
|
uintptr_t base; // SuperSlab base (2MB aligned)
|
|||
|
|
uint8_t class_idx; // Size class for this SuperSlab
|
|||
|
|
uint8_t slab_map[32]; // Per-slab class (for mixed SuperSlabs)
|
|||
|
|
} SSClassMap;
|
|||
|
|
|
|||
|
|
// Global registry (similar to current, but simpler)
|
|||
|
|
SSClassMap g_ss_class_map[4096]; // Covers 8GB address space
|
|||
|
|
|
|||
|
|
// Address to class lookup (5-10 instructions)
|
|||
|
|
uint8_t ptr_to_class_idx(void* ptr) {
|
|||
|
|
// 1. Get 2MB-aligned base (1 instruction)
|
|||
|
|
uintptr_t base = (uintptr_t)ptr & ~(2*1024*1024 - 1);
|
|||
|
|
|
|||
|
|
// 2. Hash lookup (2-3 instructions)
|
|||
|
|
uint32_t hash = (base >> 21) & 4095;
|
|||
|
|
SSClassMap* map = &g_ss_class_map[hash];
|
|||
|
|
|
|||
|
|
// 3. Validate and return (2-3 instructions)
|
|||
|
|
if (map->base == base) {
|
|||
|
|
// Optional: per-slab lookup for mixed classes
|
|||
|
|
uint32_t slab_idx = ((uintptr_t)ptr - base) / SLAB_SIZE;
|
|||
|
|
return map->slab_map[slab_idx];
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// 4. Linear probe on miss (expensive fallback)
|
|||
|
|
return lookup_with_probe(base, ptr);
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Performance Analysis
|
|||
|
|
- **Best case:** 5 cycles (direct hit)
|
|||
|
|
- **Average:** 8 cycles (with validation)
|
|||
|
|
- **Worst case:** 50+ cycles (linear probing)
|
|||
|
|
- **Memory overhead:** 0 (uses existing structures)
|
|||
|
|
- **Cache impact:** Good (map is compact)
|
|||
|
|
|
|||
|
|
### Pros
|
|||
|
|
- ✅ **Zero memory overhead** per allocation
|
|||
|
|
- ✅ **Works with existing allocations**
|
|||
|
|
- ✅ **Thread-safe** (read-only lookup)
|
|||
|
|
|
|||
|
|
### Cons
|
|||
|
|
- ❌ **Hash collisions** cause slowdown
|
|||
|
|
- ❌ **Complex implementation** (hash table maintenance)
|
|||
|
|
- ❌ **No UAF detection**
|
|||
|
|
- ❌ Still requires memory loads (not as fast as inline header)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Option 3: TLS Last-Class Cache
|
|||
|
|
|
|||
|
|
### Concept
|
|||
|
|
Cache the last freed class per thread, betting on temporal locality.
|
|||
|
|
|
|||
|
|
### Implementation Design
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// TLS cache (per-thread)
|
|||
|
|
__thread struct {
|
|||
|
|
void* last_base; // Last SuperSlab base
|
|||
|
|
uint8_t last_class; // Last class index
|
|||
|
|
uint32_t hit_count; // Statistics
|
|||
|
|
} g_tls_class_cache;
|
|||
|
|
|
|||
|
|
// Speculative fast path
|
|||
|
|
void hak_tiny_free_cached(void* ptr) {
|
|||
|
|
// 1. Speculative check (2-3 instructions)
|
|||
|
|
uintptr_t base = (uintptr_t)ptr & ~(2*1024*1024 - 1);
|
|||
|
|
if (base == (uintptr_t)g_tls_class_cache.last_base) {
|
|||
|
|
// Hit! Use cached class (1-2 instructions)
|
|||
|
|
uint8_t class_idx = g_tls_class_cache.last_class;
|
|||
|
|
tiny_free_to_tls(ptr, class_idx);
|
|||
|
|
g_tls_class_cache.hit_count++;
|
|||
|
|
return;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// 2. Miss - full lookup (expensive)
|
|||
|
|
SuperSlab* ss = hak_super_lookup(ptr); // 50-100 cycles
|
|||
|
|
if (ss) {
|
|||
|
|
// Update cache
|
|||
|
|
g_tls_class_cache.last_base = (void*)ss;
|
|||
|
|
g_tls_class_cache.last_class = ss->size_class;
|
|||
|
|
hak_tiny_free_superslab(ptr, ss);
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Performance Analysis
|
|||
|
|
- **Hit case:** 2-3 cycles (excellent)
|
|||
|
|
- **Miss case:** 100+ cycles (terrible)
|
|||
|
|
- **Hit rate:** 40-80% (workload dependent)
|
|||
|
|
- **Effective average:** 20-60 cycles
|
|||
|
|
- **Memory overhead:** 16 bytes per thread
|
|||
|
|
|
|||
|
|
### Pros
|
|||
|
|
- ✅ **Zero per-allocation overhead**
|
|||
|
|
- ✅ **Simple implementation** (~100 LOC)
|
|||
|
|
- ✅ **Works with existing allocations**
|
|||
|
|
|
|||
|
|
### Cons
|
|||
|
|
- ❌ **Unpredictable performance** (hit rate varies)
|
|||
|
|
- ❌ **Poor for mixed-size workloads**
|
|||
|
|
- ❌ **No correctness guarantee** (must validate)
|
|||
|
|
- ❌ **Thread-local state pollution**
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Recommended Design: Hybrid Option 1B - Smart Header
|
|||
|
|
|
|||
|
|
### Architecture
|
|||
|
|
|
|||
|
|
The key insight: **Reuse existing wasted space for headers with zero memory cost**.
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
SuperSlab Layout (2MB):
|
|||
|
|
[SuperSlab Header: 1088 bytes]
|
|||
|
|
[WASTED PADDING: 960 bytes] ← Repurpose for headers!
|
|||
|
|
[Slab 0 Data: 63488 bytes]
|
|||
|
|
[Slab 1: 65536 bytes]
|
|||
|
|
...
|
|||
|
|
[Slab 31: 65536 bytes]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Implementation Strategy
|
|||
|
|
|
|||
|
|
1. **Phase 1: Header in Padding (Slab 0 only)**
|
|||
|
|
- Use the 960 bytes of padding for class headers
|
|||
|
|
- Supports 960 allocations with zero overhead
|
|||
|
|
- Perfect for hot allocations
|
|||
|
|
|
|||
|
|
2. **Phase 2: Inline Headers (All slabs)**
|
|||
|
|
- Add 1-byte header for slabs 1-31
|
|||
|
|
- Minimal overhead (1.5% average)
|
|||
|
|
|
|||
|
|
3. **Phase 3: Adaptive Mode**
|
|||
|
|
- Hot classes use headers
|
|||
|
|
- Cold classes use fallback
|
|||
|
|
- Best of both worlds
|
|||
|
|
|
|||
|
|
### Code Design
|
|||
|
|
|
|||
|
|
```c
|
|||
|
|
// Configuration flag
|
|||
|
|
#define HAKMEM_FAST_FREE_HEADERS 1
|
|||
|
|
|
|||
|
|
// Allocation with header
|
|||
|
|
void* tiny_alloc_with_header(int class_idx) {
|
|||
|
|
void* ptr = tiny_alloc_raw(class_idx);
|
|||
|
|
if (ptr) {
|
|||
|
|
// Store class just before the block
|
|||
|
|
*((uint8_t*)ptr - 1) = class_idx;
|
|||
|
|
}
|
|||
|
|
return ptr;
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// Ultra-fast free path (4-5 instructions total)
|
|||
|
|
void hak_free_fast(void* ptr) {
|
|||
|
|
// 1. Check header mode (compile-time eliminated)
|
|||
|
|
if (HAKMEM_FAST_FREE_HEADERS) {
|
|||
|
|
// 2. Read class (1 instruction)
|
|||
|
|
uint8_t class_idx = *((uint8_t*)ptr - 1);
|
|||
|
|
|
|||
|
|
// 3. Validate (debug only)
|
|||
|
|
if (class_idx < TINY_NUM_CLASSES) {
|
|||
|
|
// 4. Push to TLS (3 instructions)
|
|||
|
|
void** head = &g_tls_sll_head[class_idx];
|
|||
|
|
*(void**)ptr = *head;
|
|||
|
|
*head = ptr;
|
|||
|
|
return;
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// 5. Fallback to slow path
|
|||
|
|
hak_tiny_free_slow(ptr);
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Memory Calculation
|
|||
|
|
|
|||
|
|
For 1M allocations across all classes:
|
|||
|
|
```
|
|||
|
|
Class 0 (8B): 125K blocks × 1B = 125KB overhead (12.5%)
|
|||
|
|
Class 1 (16B): 125K blocks × 1B = 125KB overhead (6.25%)
|
|||
|
|
Class 2 (32B): 125K blocks × 1B = 125KB overhead (3.13%)
|
|||
|
|
Class 3 (64B): 125K blocks × 1B = 125KB overhead (1.56%)
|
|||
|
|
Class 4 (128B): 125K blocks × 1B = 125KB overhead (0.78%)
|
|||
|
|
Class 5 (256B): 125K blocks × 1B = 125KB overhead (0.39%)
|
|||
|
|
Class 6 (512B): 125K blocks × 1B = 125KB overhead (0.20%)
|
|||
|
|
Class 7 (1KB): 125K blocks × 1B = 125KB overhead (0.10%)
|
|||
|
|
|
|||
|
|
Average overhead: ~1.5% (acceptable)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Implementation Plan
|
|||
|
|
|
|||
|
|
### Phase 1: Proof of Concept (1-2 days)
|
|||
|
|
1. **Add header field** to allocation path
|
|||
|
|
2. **Implement fast free** with header lookup
|
|||
|
|
3. **Benchmark** against current implementation
|
|||
|
|
4. **Files to modify:**
|
|||
|
|
- `core/tiny_alloc_fast.inc.h` - Add header write
|
|||
|
|
- `core/tiny_free_fast.inc.h` - Add header read
|
|||
|
|
- `core/hakmem_tiny_superslab.h` - Adjust offsets
|
|||
|
|
|
|||
|
|
### Phase 2: Production Integration (2-3 days)
|
|||
|
|
1. **Add feature flag** `HAKMEM_REGION_ID_MODE`
|
|||
|
|
2. **Implement fallback** for non-header allocations
|
|||
|
|
3. **Add debug validation** (magic bytes, bounds checks)
|
|||
|
|
4. **Files to create:**
|
|||
|
|
- `core/tiny_region_id.h` - Region ID API
|
|||
|
|
- `core/tiny_region_id.c` - Implementation
|
|||
|
|
|
|||
|
|
### Phase 3: Testing & Optimization (1-2 days)
|
|||
|
|
1. **Unit tests** for correctness
|
|||
|
|
2. **Stress tests** for thread safety
|
|||
|
|
3. **Performance tuning** (alignment, prefetch)
|
|||
|
|
4. **Benchmarks:**
|
|||
|
|
- `larson_hakmem` - Multi-threaded
|
|||
|
|
- `bench_random_mixed` - Mixed sizes
|
|||
|
|
- `bench_freelist_lifo` - Pure free benchmark
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Performance Projection
|
|||
|
|
|
|||
|
|
### Current State (Baseline)
|
|||
|
|
- **Free throughput:** 1.2M ops/s
|
|||
|
|
- **CPU time:** 52.63% in free path
|
|||
|
|
- **Bottleneck:** SuperSlab lookup (100+ cycles)
|
|||
|
|
|
|||
|
|
### With Region-ID Headers
|
|||
|
|
- **Free throughput:** 40-60M ops/s (33-50x improvement)
|
|||
|
|
- **CPU time:** <2% in free path
|
|||
|
|
- **Fast path:** 3-5 cycles
|
|||
|
|
|
|||
|
|
### Comparison
|
|||
|
|
| Allocator | Free ops/s | Relative |
|
|||
|
|
|-----------|------------|----------|
|
|||
|
|
| System malloc | 56M | 1.00x |
|
|||
|
|
| **HAKMEM+Headers** | **40-60M** | **0.7-1.1x** ⭐ |
|
|||
|
|
| mimalloc | 45M | 0.80x |
|
|||
|
|
| HAKMEM current | 1.2M | 0.02x |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Risk Analysis
|
|||
|
|
|
|||
|
|
### Risks
|
|||
|
|
1. **Memory overhead** for small allocations (12.5% for 8-byte blocks)
|
|||
|
|
- **Mitigation:** Use only for classes 2+ (32+ bytes)
|
|||
|
|
|
|||
|
|
2. **Backward compatibility** with existing allocations
|
|||
|
|
- **Mitigation:** Feature flag + gradual migration
|
|||
|
|
|
|||
|
|
3. **Corruption** if header is overwritten
|
|||
|
|
- **Mitigation:** Magic byte validation in debug mode
|
|||
|
|
|
|||
|
|
4. **Alignment issues** on some architectures
|
|||
|
|
- **Mitigation:** Ensure headers are properly aligned
|
|||
|
|
|
|||
|
|
### Rollback Plan
|
|||
|
|
- Feature flag `HAKMEM_REGION_ID_MODE=0` disables completely
|
|||
|
|
- Existing slow path remains as fallback
|
|||
|
|
- No changes to allocation unless flag is set
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Conclusion
|
|||
|
|
|
|||
|
|
**Recommendation: Implement Option 1B (Smart Headers)**
|
|||
|
|
|
|||
|
|
This hybrid approach provides:
|
|||
|
|
- **Near-optimal performance** (3-5 cycles)
|
|||
|
|
- **Acceptable memory overhead** (~1.5% average)
|
|||
|
|
- **Perfect correctness** (no races, no misses)
|
|||
|
|
- **Simple implementation** (200-300 LOC)
|
|||
|
|
- **Full compatibility** via feature flags
|
|||
|
|
|
|||
|
|
The dramatic speedup (30-50x) will bring HAKMEM's free performance in line with System malloc while maintaining all existing safety features. The implementation is straightforward and can be completed in 4-6 days with full testing.
|
|||
|
|
|
|||
|
|
### Next Steps
|
|||
|
|
1. Review this design with the team
|
|||
|
|
2. Implement Phase 1 proof-of-concept
|
|||
|
|
3. Measure actual performance improvement
|
|||
|
|
4. Decide on production rollout strategy
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**End of Design Document**
|