Files
hakmem/docs/design/REGION_ID_DESIGN.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

406 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Region-ID Direct Lookup Design for Ultra-Fast Free Path
**Date:** 2025-11-08
**Author:** Claude (Ultrathink Analysis)
**Goal:** Eliminate SuperSlab lookup bottleneck (52.63% CPU) to achieve 40-80M ops/s free throughput
---
## Executive Summary
The HAKMEM free() path is currently **47x slower** than System malloc (1.2M vs 56M ops/s) due to expensive SuperSlab registry lookups that consume over 50% of CPU time. The root cause is the need to determine `class_idx` from a pointer to know which TLS freelist to use.
**Recommendation:** Implement **Option 1B: Inline Header with Class Index** - a hybrid approach that embeds a 1-byte class index in a header while maintaining backward compatibility. This approach offers:
- **3-5 instruction free path** (vs current 330+ lines)
- **Expected 30-50x speedup** (1.2M → 40-60M ops/s)
- **Minimal memory overhead** (1 byte per allocation)
- **Simple implementation** (200-300 LOC changes)
- **Full compatibility** with existing Box Theory design
The key insight: We already have 2048 bytes of header space in SuperSlab's slab[0] that's currently wasted as padding. We can repurpose this for inline headers with zero additional memory cost for the first slab.
---
## Detailed Comparison Table
| Criteria | Option 1: Header Embedding | Option 2: Address Range | Option 3: TLS Cache | Hybrid 1B |
|----------|----------------------------|------------------------|-------------------|-----------|
| **Latency (cycles)** | 2-3 (best) | 5-10 (good) | 1-2 hit / 100+ miss | 2-3 |
| **Memory Overhead** | 1-4 bytes/block | 0 bytes | 0 bytes | 1 byte/block |
| **Implementation Complexity** | 3/10 (simple) | 7/10 (complex) | 4/10 (moderate) | 4/10 |
| **Correctness** | Perfect (embedded) | Good (math-based) | Probabilistic | Perfect |
| **Cache Friendliness** | Excellent (inline) | Good | Variable | Excellent |
| **Thread Safety** | Perfect | Perfect | Good | Perfect |
| **UAF Detection** | Yes (can add magic) | No | No | Yes |
| **Debug Support** | Excellent | Moderate | Poor | Excellent |
| **Backward Compat** | Needs flag | Complex | Easy | Easy |
| **Score** | **9/10** ⭐ | 6/10 | 5/10 | **9.5/10** ⭐⭐⭐ |
---
## Option 1: Header Embedding
### Concept
Store `class_idx` directly in a small header (1-4 bytes) before each allocation.
### Implementation Design
```c
// Header structure (1 byte minimal, 4 bytes with safety)
typedef struct {
uint8_t class_idx; // 0-7 for tiny classes
#ifdef HAKMEM_DEBUG
uint8_t magic; // 0xAB for validation
uint16_t guard; // Canary for overflow detection
#endif
} TinyHeader;
// Ultra-fast free (3-5 instructions)
void hak_tiny_free_fast(void* ptr) {
// 1. Get class from header (1 instruction)
uint8_t class_idx = *((uint8_t*)ptr - 1);
// 2. Validate (debug only, compiled out in release)
#ifdef HAKMEM_DEBUG
if (class_idx >= TINY_NUM_CLASSES) {
hak_tiny_free_slow(ptr); // Fallback
return;
}
#endif
// 3. Push to TLS freelist (2-3 instructions)
void** head = &g_tls_sll_head[class_idx];
*(void**)ptr = *head; // ptr->next = head
*head = ptr; // head = ptr
g_tls_sll_count[class_idx]++;
}
```
### Memory Layout
```
[Header|Block] [Header|Block] [Header|Block] ...
1B 8B 1B 16B 1B 32B
```
### Performance Analysis
- **Best case:** 2 cycles (L1 hit, no validation)
- **Average:** 3 cycles (with increment)
- **Worst case:** 5 cycles (with debug checks)
- **Memory overhead:** 1 byte × 1M blocks = 1MB (for 1M allocations)
- **Cache impact:** Excellent (header is inline with data)
### Pros
-**Fastest possible lookup** (single byte read)
-**Perfect correctness** (no race conditions)
-**UAF detection capability** (can check magic on free)
-**Simple implementation** (~200 LOC)
-**Debug friendly** (can validate everything)
### Cons
- ❌ Memory overhead (12.5% for 8-byte blocks, 0.1% for 1KB blocks)
- ❌ Requires allocation path changes
- ❌ Not compatible with existing allocations (needs migration)
---
## Option 2: Address Range Mapping
### Concept
Calculate `class_idx` from the SuperSlab base address and slab index using bit manipulation.
### Implementation Design
```c
// Precomputed mapping table (built at SuperSlab creation)
typedef struct {
uintptr_t base; // SuperSlab base (2MB aligned)
uint8_t class_idx; // Size class for this SuperSlab
uint8_t slab_map[32]; // Per-slab class (for mixed SuperSlabs)
} SSClassMap;
// Global registry (similar to current, but simpler)
SSClassMap g_ss_class_map[4096]; // Covers 8GB address space
// Address to class lookup (5-10 instructions)
uint8_t ptr_to_class_idx(void* ptr) {
// 1. Get 2MB-aligned base (1 instruction)
uintptr_t base = (uintptr_t)ptr & ~(2*1024*1024 - 1);
// 2. Hash lookup (2-3 instructions)
uint32_t hash = (base >> 21) & 4095;
SSClassMap* map = &g_ss_class_map[hash];
// 3. Validate and return (2-3 instructions)
if (map->base == base) {
// Optional: per-slab lookup for mixed classes
uint32_t slab_idx = ((uintptr_t)ptr - base) / SLAB_SIZE;
return map->slab_map[slab_idx];
}
// 4. Linear probe on miss (expensive fallback)
return lookup_with_probe(base, ptr);
}
```
### Performance Analysis
- **Best case:** 5 cycles (direct hit)
- **Average:** 8 cycles (with validation)
- **Worst case:** 50+ cycles (linear probing)
- **Memory overhead:** 0 (uses existing structures)
- **Cache impact:** Good (map is compact)
### Pros
-**Zero memory overhead** per allocation
-**Works with existing allocations**
-**Thread-safe** (read-only lookup)
### Cons
-**Hash collisions** cause slowdown
-**Complex implementation** (hash table maintenance)
-**No UAF detection**
- ❌ Still requires memory loads (not as fast as inline header)
---
## Option 3: TLS Last-Class Cache
### Concept
Cache the last freed class per thread, betting on temporal locality.
### Implementation Design
```c
// TLS cache (per-thread)
__thread struct {
void* last_base; // Last SuperSlab base
uint8_t last_class; // Last class index
uint32_t hit_count; // Statistics
} g_tls_class_cache;
// Speculative fast path
void hak_tiny_free_cached(void* ptr) {
// 1. Speculative check (2-3 instructions)
uintptr_t base = (uintptr_t)ptr & ~(2*1024*1024 - 1);
if (base == (uintptr_t)g_tls_class_cache.last_base) {
// Hit! Use cached class (1-2 instructions)
uint8_t class_idx = g_tls_class_cache.last_class;
tiny_free_to_tls(ptr, class_idx);
g_tls_class_cache.hit_count++;
return;
}
// 2. Miss - full lookup (expensive)
SuperSlab* ss = hak_super_lookup(ptr); // 50-100 cycles
if (ss) {
// Update cache
g_tls_class_cache.last_base = (void*)ss;
g_tls_class_cache.last_class = ss->size_class;
hak_tiny_free_superslab(ptr, ss);
}
}
```
### Performance Analysis
- **Hit case:** 2-3 cycles (excellent)
- **Miss case:** 100+ cycles (terrible)
- **Hit rate:** 40-80% (workload dependent)
- **Effective average:** 20-60 cycles
- **Memory overhead:** 16 bytes per thread
### Pros
-**Zero per-allocation overhead**
-**Simple implementation** (~100 LOC)
-**Works with existing allocations**
### Cons
-**Unpredictable performance** (hit rate varies)
-**Poor for mixed-size workloads**
-**No correctness guarantee** (must validate)
-**Thread-local state pollution**
---
## Recommended Design: Hybrid Option 1B - Smart Header
### Architecture
The key insight: **Reuse existing wasted space for headers with zero memory cost**.
```
SuperSlab Layout (2MB):
[SuperSlab Header: 1088 bytes]
[WASTED PADDING: 960 bytes] ← Repurpose for headers!
[Slab 0 Data: 63488 bytes]
[Slab 1: 65536 bytes]
...
[Slab 31: 65536 bytes]
```
### Implementation Strategy
1. **Phase 1: Header in Padding (Slab 0 only)**
- Use the 960 bytes of padding for class headers
- Supports 960 allocations with zero overhead
- Perfect for hot allocations
2. **Phase 2: Inline Headers (All slabs)**
- Add 1-byte header for slabs 1-31
- Minimal overhead (1.5% average)
3. **Phase 3: Adaptive Mode**
- Hot classes use headers
- Cold classes use fallback
- Best of both worlds
### Code Design
```c
// Configuration flag
#define HAKMEM_FAST_FREE_HEADERS 1
// Allocation with header
void* tiny_alloc_with_header(int class_idx) {
void* ptr = tiny_alloc_raw(class_idx);
if (ptr) {
// Store class just before the block
*((uint8_t*)ptr - 1) = class_idx;
}
return ptr;
}
// Ultra-fast free path (4-5 instructions total)
void hak_free_fast(void* ptr) {
// 1. Check header mode (compile-time eliminated)
if (HAKMEM_FAST_FREE_HEADERS) {
// 2. Read class (1 instruction)
uint8_t class_idx = *((uint8_t*)ptr - 1);
// 3. Validate (debug only)
if (class_idx < TINY_NUM_CLASSES) {
// 4. Push to TLS (3 instructions)
void** head = &g_tls_sll_head[class_idx];
*(void**)ptr = *head;
*head = ptr;
return;
}
}
// 5. Fallback to slow path
hak_tiny_free_slow(ptr);
}
```
### Memory Calculation
For 1M allocations across all classes:
```
Class 0 (8B): 125K blocks × 1B = 125KB overhead (12.5%)
Class 1 (16B): 125K blocks × 1B = 125KB overhead (6.25%)
Class 2 (32B): 125K blocks × 1B = 125KB overhead (3.13%)
Class 3 (64B): 125K blocks × 1B = 125KB overhead (1.56%)
Class 4 (128B): 125K blocks × 1B = 125KB overhead (0.78%)
Class 5 (256B): 125K blocks × 1B = 125KB overhead (0.39%)
Class 6 (512B): 125K blocks × 1B = 125KB overhead (0.20%)
Class 7 (1KB): 125K blocks × 1B = 125KB overhead (0.10%)
Average overhead: ~1.5% (acceptable)
```
---
## Implementation Plan
### Phase 1: Proof of Concept (1-2 days)
1. **Add header field** to allocation path
2. **Implement fast free** with header lookup
3. **Benchmark** against current implementation
4. **Files to modify:**
- `core/tiny_alloc_fast.inc.h` - Add header write
- `core/tiny_free_fast.inc.h` - Add header read
- `core/hakmem_tiny_superslab.h` - Adjust offsets
### Phase 2: Production Integration (2-3 days)
1. **Add feature flag** `HAKMEM_REGION_ID_MODE`
2. **Implement fallback** for non-header allocations
3. **Add debug validation** (magic bytes, bounds checks)
4. **Files to create:**
- `core/tiny_region_id.h` - Region ID API
- `core/tiny_region_id.c` - Implementation
### Phase 3: Testing & Optimization (1-2 days)
1. **Unit tests** for correctness
2. **Stress tests** for thread safety
3. **Performance tuning** (alignment, prefetch)
4. **Benchmarks:**
- `larson_hakmem` - Multi-threaded
- `bench_random_mixed` - Mixed sizes
- `bench_freelist_lifo` - Pure free benchmark
---
## Performance Projection
### Current State (Baseline)
- **Free throughput:** 1.2M ops/s
- **CPU time:** 52.63% in free path
- **Bottleneck:** SuperSlab lookup (100+ cycles)
### With Region-ID Headers
- **Free throughput:** 40-60M ops/s (33-50x improvement)
- **CPU time:** <2% in free path
- **Fast path:** 3-5 cycles
### Comparison
| Allocator | Free ops/s | Relative |
|-----------|------------|----------|
| System malloc | 56M | 1.00x |
| **HAKMEM+Headers** | **40-60M** | **0.7-1.1x** |
| mimalloc | 45M | 0.80x |
| HAKMEM current | 1.2M | 0.02x |
---
## Risk Analysis
### Risks
1. **Memory overhead** for small allocations (12.5% for 8-byte blocks)
- **Mitigation:** Use only for classes 2+ (32+ bytes)
2. **Backward compatibility** with existing allocations
- **Mitigation:** Feature flag + gradual migration
3. **Corruption** if header is overwritten
- **Mitigation:** Magic byte validation in debug mode
4. **Alignment issues** on some architectures
- **Mitigation:** Ensure headers are properly aligned
### Rollback Plan
- Feature flag `HAKMEM_REGION_ID_MODE=0` disables completely
- Existing slow path remains as fallback
- No changes to allocation unless flag is set
---
## Conclusion
**Recommendation: Implement Option 1B (Smart Headers)**
This hybrid approach provides:
- **Near-optimal performance** (3-5 cycles)
- **Acceptable memory overhead** (~1.5% average)
- **Perfect correctness** (no races, no misses)
- **Simple implementation** (200-300 LOC)
- **Full compatibility** via feature flags
The dramatic speedup (30-50x) will bring HAKMEM's free performance in line with System malloc while maintaining all existing safety features. The implementation is straightforward and can be completed in 4-6 days with full testing.
### Next Steps
1. Review this design with the team
2. Implement Phase 1 proof-of-concept
3. Measure actual performance improvement
4. Decide on production rollout strategy
---
**End of Design Document**