hakmem/REGION_ID_DESIGN.md

# Region-ID Direct Lookup Design for Ultra-Fast Free Path

**Date:** 2025-11-08
**Author:** Claude (Ultrathink Analysis)
**Goal:** Eliminate SuperSlab lookup bottleneck (52.63% CPU) to achieve 40-80M ops/s free throughput

---

## Executive Summary

The HAKMEM free() path is currently **47x slower** than System malloc (1.2M vs 56M ops/s) due to expensive SuperSlab registry lookups that consume over 50% of CPU time. The root cause is the need to determine `class_idx` from a pointer to know which TLS freelist to use.

**Recommendation:** Implement **Option 1B: Inline Header with Class Index** - a hybrid approach that embeds a 1-byte class index in a header while maintaining backward compatibility. This approach offers:
- **3-5 instruction free path** (vs current 330+ lines)
- **Expected 30-50x speedup** (1.2M → 40-60M ops/s)
- **Minimal memory overhead** (1 byte per allocation)
- **Simple implementation** (200-300 LOC changes)
- **Full compatibility** with existing Box Theory design

The key insight: We already have 2048 bytes of header space in SuperSlab's slab[0] that's currently wasted as padding. We can repurpose this for inline headers with zero additional memory cost for the first slab.

---

## Detailed Comparison Table

| Criteria | Option 1: Header Embedding | Option 2: Address Range | Option 3: TLS Cache | Hybrid 1B |
|----------|----------------------------|------------------------|-------------------|-----------|
| **Latency (cycles)** | 2-3 (best) | 5-10 (good) | 1-2 hit / 100+ miss | 2-3 |
| **Memory Overhead** | 1-4 bytes/block | 0 bytes | 0 bytes | 1 byte/block |
| **Implementation Complexity** | 3/10 (simple) | 7/10 (complex) | 4/10 (moderate) | 4/10 |
| **Correctness** | Perfect (embedded) | Good (math-based) | Probabilistic | Perfect |
| **Cache Friendliness** | Excellent (inline) | Good | Variable | Excellent |
| **Thread Safety** | Perfect | Perfect | Good | Perfect |
| **UAF Detection** | Yes (can add magic) | No | No | Yes |
| **Debug Support** | Excellent | Moderate | Poor | Excellent |
| **Backward Compat** | Needs flag | Complex | Easy | Easy |
| **Score** | **9/10** ⭐ | 6/10 | 5/10 | **9.5/10** ⭐⭐⭐ |

---

## Option 1: Header Embedding

### Concept
Store `class_idx` directly in a small header (1-4 bytes) before each allocation.

### Implementation Design

```c
// Header structure (1 byte minimal, 4 bytes with safety)
typedef struct {
    uint8_t class_idx;  // 0-7 for tiny classes
#ifdef HAKMEM_DEBUG
    uint8_t magic;      // 0xAB for validation
    uint16_t guard;     // Canary for overflow detection
#endif
} TinyHeader;

// Ultra-fast free (3-5 instructions)
void hak_tiny_free_fast(void* ptr) {
    // 1. Get class from header (1 instruction)
    uint8_t class_idx = *((uint8_t*)ptr - 1);

    // 2. Validate (debug only, compiled out in release)
#ifdef HAKMEM_DEBUG
    if (class_idx >= TINY_NUM_CLASSES) {
        hak_tiny_free_slow(ptr);  // Fallback
        return;
    }
#endif

    // 3. Push to TLS freelist (2-3 instructions)
    void** head = &g_tls_sll_head[class_idx];
    *(void**)ptr = *head;  // ptr->next = head
    *head = ptr;           // head = ptr
    g_tls_sll_count[class_idx]++;
}
```

### Memory Layout
```
[Header|Block] [Header|Block] [Header|Block] ...
   1B    8B      1B    16B     1B    32B
```

### Performance Analysis
- **Best case:** 2 cycles (L1 hit, no validation)
- **Average:** 3 cycles (with increment)
- **Worst case:** 5 cycles (with debug checks)
- **Memory overhead:** 1 byte × 1M blocks = 1MB (for 1M allocations)
- **Cache impact:** Excellent (header is inline with data)

### Pros
- ✅ **Fastest possible lookup** (single byte read)
- ✅ **Perfect correctness** (no race conditions)
- ✅ **UAF detection capability** (can check magic on free)
- ✅ **Simple implementation** (~200 LOC)
- ✅ **Debug friendly** (can validate everything)

### Cons
- ❌ Memory overhead (12.5% for 8-byte blocks, 0.1% for 1KB blocks)
- ❌ Requires allocation path changes
- ❌ Not compatible with existing allocations (needs migration)

---

## Option 2: Address Range Mapping

### Concept
Calculate `class_idx` from the SuperSlab base address and slab index using bit manipulation.

### Implementation Design

```c
// Precomputed mapping table (built at SuperSlab creation)
typedef struct {
    uintptr_t base;      // SuperSlab base (2MB aligned)
    uint8_t class_idx;   // Size class for this SuperSlab
    uint8_t slab_map[32]; // Per-slab class (for mixed SuperSlabs)
} SSClassMap;

// Global registry (similar to current, but simpler)
SSClassMap g_ss_class_map[4096];  // Covers 8GB address space

// Address to class lookup (5-10 instructions)
uint8_t ptr_to_class_idx(void* ptr) {
    // 1. Get 2MB-aligned base (1 instruction)
    uintptr_t base = (uintptr_t)ptr & ~(2*1024*1024 - 1);

    // 2. Hash lookup (2-3 instructions)
    uint32_t hash = (base >> 21) & 4095;
    SSClassMap* map = &g_ss_class_map[hash];

    // 3. Validate and return (2-3 instructions)
    if (map->base == base) {
        // Optional: per-slab lookup for mixed classes
        uint32_t slab_idx = ((uintptr_t)ptr - base) / SLAB_SIZE;
        return map->slab_map[slab_idx];
    }

    // 4. Linear probe on miss (expensive fallback)
    return lookup_with_probe(base, ptr);
}
```

### Performance Analysis
- **Best case:** 5 cycles (direct hit)
- **Average:** 8 cycles (with validation)
- **Worst case:** 50+ cycles (linear probing)
- **Memory overhead:** 0 (uses existing structures)
- **Cache impact:** Good (map is compact)

### Pros
- ✅ **Zero memory overhead** per allocation
- ✅ **Works with existing allocations**
- ✅ **Thread-safe** (read-only lookup)

### Cons
- ❌ **Hash collisions** cause slowdown
- ❌ **Complex implementation** (hash table maintenance)
- ❌ **No UAF detection**
- ❌ Still requires memory loads (not as fast as inline header)

---

## Option 3: TLS Last-Class Cache

### Concept
Cache the last freed class per thread, betting on temporal locality.

### Implementation Design

```c
// TLS cache (per-thread)
__thread struct {
    void* last_base;     // Last SuperSlab base
    uint8_t last_class;  // Last class index
    uint32_t hit_count;  // Statistics
} g_tls_class_cache;

// Speculative fast path
void hak_tiny_free_cached(void* ptr) {
    // 1. Speculative check (2-3 instructions)
    uintptr_t base = (uintptr_t)ptr & ~(2*1024*1024 - 1);
    if (base == (uintptr_t)g_tls_class_cache.last_base) {
        // Hit! Use cached class (1-2 instructions)
        uint8_t class_idx = g_tls_class_cache.last_class;
        tiny_free_to_tls(ptr, class_idx);
        g_tls_class_cache.hit_count++;
        return;
    }

    // 2. Miss - full lookup (expensive)
    SuperSlab* ss = hak_super_lookup(ptr);  // 50-100 cycles
    if (ss) {
        // Update cache
        g_tls_class_cache.last_base = (void*)ss;
        g_tls_class_cache.last_class = ss->size_class;
        hak_tiny_free_superslab(ptr, ss);
    }
}
```

### Performance Analysis
- **Hit case:** 2-3 cycles (excellent)
- **Miss case:** 100+ cycles (terrible)
- **Hit rate:** 40-80% (workload dependent)
- **Effective average:** 20-60 cycles
- **Memory overhead:** 16 bytes per thread

### Pros
- ✅ **Zero per-allocation overhead**
- ✅ **Simple implementation** (~100 LOC)
- ✅ **Works with existing allocations**

### Cons
- ❌ **Unpredictable performance** (hit rate varies)
- ❌ **Poor for mixed-size workloads**
- ❌ **No correctness guarantee** (must validate)
- ❌ **Thread-local state pollution**

---

## Recommended Design: Hybrid Option 1B - Smart Header

### Architecture

The key insight: **Reuse existing wasted space for headers with zero memory cost**.

```
SuperSlab Layout (2MB):
[SuperSlab Header: 1088 bytes]
[WASTED PADDING: 960 bytes] ← Repurpose for headers!
[Slab 0 Data: 63488 bytes]
[Slab 1: 65536 bytes]
...
[Slab 31: 65536 bytes]
```

### Implementation Strategy

1. **Phase 1: Header in Padding (Slab 0 only)**
   - Use the 960 bytes of padding for class headers
   - Supports 960 allocations with zero overhead
   - Perfect for hot allocations

2. **Phase 2: Inline Headers (All slabs)**
   - Add 1-byte header for slabs 1-31
   - Minimal overhead (1.5% average)

3. **Phase 3: Adaptive Mode**
   - Hot classes use headers
   - Cold classes use fallback
   - Best of both worlds

### Code Design

```c
// Configuration flag
#define HAKMEM_FAST_FREE_HEADERS 1

// Allocation with header
void* tiny_alloc_with_header(int class_idx) {
    void* ptr = tiny_alloc_raw(class_idx);
    if (ptr) {
        // Store class just before the block
        *((uint8_t*)ptr - 1) = class_idx;
    }
    return ptr;
}

// Ultra-fast free path (4-5 instructions total)
void hak_free_fast(void* ptr) {
    // 1. Check header mode (compile-time eliminated)
    if (HAKMEM_FAST_FREE_HEADERS) {
        // 2. Read class (1 instruction)
        uint8_t class_idx = *((uint8_t*)ptr - 1);

        // 3. Validate (debug only)
        if (class_idx < TINY_NUM_CLASSES) {
            // 4. Push to TLS (3 instructions)
            void** head = &g_tls_sll_head[class_idx];
            *(void**)ptr = *head;
            *head = ptr;
            return;
        }
    }

    // 5. Fallback to slow path
    hak_tiny_free_slow(ptr);
}
```

### Memory Calculation

For 1M allocations across all classes:
```
Class 0 (8B):   125K blocks × 1B = 125KB overhead (12.5%)
Class 1 (16B):  125K blocks × 1B = 125KB overhead (6.25%)
Class 2 (32B):  125K blocks × 1B = 125KB overhead (3.13%)
Class 3 (64B):  125K blocks × 1B = 125KB overhead (1.56%)
Class 4 (128B): 125K blocks × 1B = 125KB overhead (0.78%)
Class 5 (256B): 125K blocks × 1B = 125KB overhead (0.39%)
Class 6 (512B): 125K blocks × 1B = 125KB overhead (0.20%)
Class 7 (1KB):  125K blocks × 1B = 125KB overhead (0.10%)

Average overhead: ~1.5% (acceptable)
```

---

## Implementation Plan

### Phase 1: Proof of Concept (1-2 days)
1. **Add header field** to allocation path
2. **Implement fast free** with header lookup
3. **Benchmark** against current implementation
4. **Files to modify:**
   - `core/tiny_alloc_fast.inc.h` - Add header write
   - `core/tiny_free_fast.inc.h` - Add header read
   - `core/hakmem_tiny_superslab.h` - Adjust offsets

### Phase 2: Production Integration (2-3 days)
1. **Add feature flag** `HAKMEM_REGION_ID_MODE`
2. **Implement fallback** for non-header allocations
3. **Add debug validation** (magic bytes, bounds checks)
4. **Files to create:**
   - `core/tiny_region_id.h` - Region ID API
   - `core/tiny_region_id.c` - Implementation

### Phase 3: Testing & Optimization (1-2 days)
1. **Unit tests** for correctness
2. **Stress tests** for thread safety
3. **Performance tuning** (alignment, prefetch)
4. **Benchmarks:**
   - `larson_hakmem` - Multi-threaded
   - `bench_random_mixed` - Mixed sizes
   - `bench_freelist_lifo` - Pure free benchmark

---

## Performance Projection

### Current State (Baseline)
- **Free throughput:** 1.2M ops/s
- **CPU time:** 52.63% in free path
- **Bottleneck:** SuperSlab lookup (100+ cycles)

### With Region-ID Headers
- **Free throughput:** 40-60M ops/s (33-50x improvement)
- **CPU time:** <2% in free path
- **Fast path:** 3-5 cycles

### Comparison
| Allocator | Free ops/s | Relative |
|-----------|------------|----------|
| System malloc | 56M | 1.00x |
| **HAKMEM+Headers** | **40-60M** | **0.7-1.1x** ⭐ |
| mimalloc | 45M | 0.80x |
| HAKMEM current | 1.2M | 0.02x |

---

## Risk Analysis

### Risks
1. **Memory overhead** for small allocations (12.5% for 8-byte blocks)
   - **Mitigation:** Use only for classes 2+ (32+ bytes)

2. **Backward compatibility** with existing allocations
   - **Mitigation:** Feature flag + gradual migration

3. **Corruption** if header is overwritten
   - **Mitigation:** Magic byte validation in debug mode

4. **Alignment issues** on some architectures
   - **Mitigation:** Ensure headers are properly aligned

### Rollback Plan
- Feature flag `HAKMEM_REGION_ID_MODE=0` disables completely
- Existing slow path remains as fallback
- No changes to allocation unless flag is set

---

## Conclusion

**Recommendation: Implement Option 1B (Smart Headers)**

This hybrid approach provides:
- **Near-optimal performance** (3-5 cycles)
- **Acceptable memory overhead** (~1.5% average)
- **Perfect correctness** (no races, no misses)
- **Simple implementation** (200-300 LOC)
- **Full compatibility** via feature flags

The dramatic speedup (30-50x) will bring HAKMEM's free performance in line with System malloc while maintaining all existing safety features. The implementation is straightforward and can be completed in 4-6 days with full testing.

### Next Steps
1. Review this design with the team
2. Implement Phase 1 proof-of-concept
3. Measure actual performance improvement
4. Decide on production rollout strategy

---

**End of Design Document**
-												Phase 7-1 PoC: Region-ID Direct Lookup (+39%~+436% improvement!)

Implemented ultra-fast header-based free path that eliminates SuperSlab
lookup bottleneck (100+ cycles → 5-10 cycles).

## Key Changes

1. **Smart Headers** (core/tiny_region_id.h):
   - 1-byte header before each allocation stores class_idx
   - Memory layout: [Header: 1B] [User data: N-1B]
   - Overhead: <2% average (0% for Slab[0] using wasted padding)

2. **Ultra-Fast Allocation** (core/tiny_alloc_fast.inc.h):
   - Write header at base: *base = class_idx
   - Return user pointer: base + 1

3. **Ultra-Fast Free** (core/tiny_free_fast_v2.inc.h):
   - Read class_idx from header (ptr-1): 2-3 cycles
   - Push base (ptr-1) to TLS freelist: 3-5 cycles
   - Total: 5-10 cycles (vs 500+ cycles current!)

4. **Free Path Integration** (core/box/hak_free_api.inc.h):
   - Removed SuperSlab lookup from fast path
   - Direct header validation (no lookup needed!)

5. **Size Class Adjustment** (core/hakmem_tiny.h):
   - Max tiny size: 1023B (was 1024B)
   - 1024B requests → Mid allocator fallback

## Performance Results

| Size | Baseline | Phase 7 | Improvement |
|------|----------|---------|-------------|
| 128B | 1.22M | 6.54M | **+436%** 🚀 |
| 512B | 1.22M | 1.70M | **+39%** |
| 1023B | 1.22M | 1.92M | **+57%** |

## Build & Test

Enable Phase 7:
  make HEADER_CLASSIDX=1 bench_random_mixed_hakmem

Run benchmark:
  HAKMEM_TINY_USE_SUPERSLAB=1 ./bench_random_mixed_hakmem 10000 128 1234567

## Known Issues

- 1024B requests fallback to Mid allocator (by design)
- Target 40-60M ops/s not yet reached (current: 1.7-6.5M)
- Further optimization needed (TLS capacity tuning, refill optimization)

## Credits

Design: ChatGPT Pro Ultrathink, Claude Code
Implementation: Claude Code with Task Agent Ultrathink support

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

											
										
										
											2025-11-08 03:18:17 +09:00
+								# Region-ID Direct Lookup Design for Ultra-Fast Free Path
 								**Date:** 2025-11-08
 								**Author:** Claude (Ultrathink Analysis)
 								**Goal:** Eliminate SuperSlab lookup bottleneck (52.63% CPU) to achieve 40-80M ops/s free throughput
 								---
 								## Executive Summary
 								The HAKMEM free() path is currently **47x slower** than System malloc (1.2M vs 56M ops/s) due to expensive SuperSlab registry lookups that consume over 50% of CPU time. The root cause is the need to determine `class_idx` from a pointer to know which TLS freelist to use.
 								**Recommendation:** Implement **Option 1B: Inline Header with Class Index** - a hybrid approach that embeds a 1-byte class index in a header while maintaining backward compatibility. This approach offers:
 								- **3-5 instruction free path** (vs current 330+ lines)
 								- **Expected 30-50x speedup** (1.2M → 40-60M ops/s)
 								- **Minimal memory overhead** (1 byte per allocation)
 								- **Simple implementation** (200-300 LOC changes)
 								- **Full compatibility** with existing Box Theory design
 								The key insight: We already have 2048 bytes of header space in SuperSlab's slab[0] that's currently wasted as padding. We can repurpose this for inline headers with zero additional memory cost for the first slab.
 								---
 								## Detailed Comparison Table
 								| Criteria | Option 1: Header Embedding | Option 2: Address Range | Option 3: TLS Cache | Hybrid 1B |
 								|----------|----------------------------|------------------------|-------------------|-----------|
 								| **Latency (cycles)** | 2-3 (best) | 5-10 (good) | 1-2 hit / 100+ miss | 2-3 |
 								| **Memory Overhead** | 1-4 bytes/block | 0 bytes | 0 bytes | 1 byte/block |
 								| **Implementation Complexity** | 3/10 (simple) | 7/10 (complex) | 4/10 (moderate) | 4/10 |
 								| **Correctness** | Perfect (embedded) | Good (math-based) | Probabilistic | Perfect |
 								| **Cache Friendliness** | Excellent (inline) | Good | Variable | Excellent |
 								| **Thread Safety** | Perfect | Perfect | Good | Perfect |
 								| **UAF Detection** | Yes (can add magic) | No | No | Yes |
 								| **Debug Support** | Excellent | Moderate | Poor | Excellent |
 								| **Backward Compat** | Needs flag | Complex | Easy | Easy |
 								| **Score** | **9/10** ⭐ | 6/10 | 5/10 | **9.5/10** ⭐⭐⭐ |
 								---
 								## Option 1: Header Embedding
 								### Concept
 								Store `class_idx` directly in a small header (1-4 bytes) before each allocation.
 								### Implementation Design
 								```c
 								// Header structure (1 byte minimal, 4 bytes with safety)
 								typedef struct {
 								    uint8_t class_idx;  // 0-7 for tiny classes
 								#ifdef HAKMEM_DEBUG
 								    uint8_t magic;      // 0xAB for validation
 								    uint16_t guard;     // Canary for overflow detection
 								#endif
 								} TinyHeader;
 								// Ultra-fast free (3-5 instructions)
 								void hak_tiny_free_fast(void* ptr) {
 								    // 1. Get class from header (1 instruction)
 								    uint8_t class_idx = *((uint8_t*)ptr - 1);
 								    // 2. Validate (debug only, compiled out in release)
 								#ifdef HAKMEM_DEBUG
 								    if (class_idx >= TINY_NUM_CLASSES) {
 								        hak_tiny_free_slow(ptr);  // Fallback
 								        return;
 								    }
 								#endif
 								    // 3. Push to TLS freelist (2-3 instructions)
 								    void** head = &g_tls_sll_head[class_idx];
 								    *(void**)ptr = *head;  // ptr->next = head
 								    *head = ptr;           // head = ptr
 								    g_tls_sll_count[class_idx]++;
 								}
 								```
 								### Memory Layout
 								```
 								[Header|Block] [Header|Block] [Header|Block] ...
 B    8B      1B    16B     1B    32B
 								```
 								### Performance Analysis
 								- **Best case:** 2 cycles (L1 hit, no validation)
 								- **Average:** 3 cycles (with increment)
 								- **Worst case:** 5 cycles (with debug checks)
 								- **Memory overhead:** 1 byte × 1M blocks = 1MB (for 1M allocations)
 								- **Cache impact:** Excellent (header is inline with data)
 								### Pros
 								- ✅ **Fastest possible lookup** (single byte read)
 								- ✅ **Perfect correctness** (no race conditions)
 								- ✅ **UAF detection capability** (can check magic on free)
 								- ✅ **Simple implementation** (~200 LOC)
 								- ✅ **Debug friendly** (can validate everything)
 								### Cons
 								- ❌ Memory overhead (12.5% for 8-byte blocks, 0.1% for 1KB blocks)
 								- ❌ Requires allocation path changes
 								- ❌ Not compatible with existing allocations (needs migration)
 								---
 								## Option 2: Address Range Mapping
 								### Concept
 								Calculate `class_idx` from the SuperSlab base address and slab index using bit manipulation.
 								### Implementation Design
 								```c
 								// Precomputed mapping table (built at SuperSlab creation)
 								typedef struct {
 								    uintptr_t base;      // SuperSlab base (2MB aligned)
 								    uint8_t class_idx;   // Size class for this SuperSlab
 								    uint8_t slab_map[32]; // Per-slab class (for mixed SuperSlabs)
 								} SSClassMap;
 								// Global registry (similar to current, but simpler)
 								SSClassMap g_ss_class_map[4096];  // Covers 8GB address space
 								// Address to class lookup (5-10 instructions)
 								uint8_t ptr_to_class_idx(void* ptr) {
 								    // 1. Get 2MB-aligned base (1 instruction)
 								    uintptr_t base = (uintptr_t)ptr & ~(2*1024*1024 - 1);
 								    // 2. Hash lookup (2-3 instructions)
 								    uint32_t hash = (base >> 21) & 4095;
 								    SSClassMap* map = &g_ss_class_map[hash];
 								    // 3. Validate and return (2-3 instructions)
 								    if (map->base == base) {
 								        // Optional: per-slab lookup for mixed classes
 								        uint32_t slab_idx = ((uintptr_t)ptr - base) / SLAB_SIZE;
 								        return map->slab_map[slab_idx];
 								    }
 								    // 4. Linear probe on miss (expensive fallback)
 								    return lookup_with_probe(base, ptr);
 								}
 								```
 								### Performance Analysis
 								- **Best case:** 5 cycles (direct hit)
 								- **Average:** 8 cycles (with validation)
 								- **Worst case:** 50+ cycles (linear probing)
 								- **Memory overhead:** 0 (uses existing structures)
 								- **Cache impact:** Good (map is compact)
 								### Pros
 								- ✅ **Zero memory overhead** per allocation
 								- ✅ **Works with existing allocations**
 								- ✅ **Thread-safe** (read-only lookup)
 								### Cons
 								- ❌ **Hash collisions** cause slowdown
 								- ❌ **Complex implementation** (hash table maintenance)
 								- ❌ **No UAF detection**
 								- ❌ Still requires memory loads (not as fast as inline header)
 								---
 								## Option 3: TLS Last-Class Cache
 								### Concept
 								Cache the last freed class per thread, betting on temporal locality.
 								### Implementation Design
 								```c
 								// TLS cache (per-thread)
 								__thread struct {
 								    void* last_base;     // Last SuperSlab base
 								    uint8_t last_class;  // Last class index
 								    uint32_t hit_count;  // Statistics
 								} g_tls_class_cache;
 								// Speculative fast path
 								void hak_tiny_free_cached(void* ptr) {
 								    // 1. Speculative check (2-3 instructions)
 								    uintptr_t base = (uintptr_t)ptr & ~(2*1024*1024 - 1);
 								    if (base == (uintptr_t)g_tls_class_cache.last_base) {
 								        // Hit! Use cached class (1-2 instructions)
 								        uint8_t class_idx = g_tls_class_cache.last_class;
 								        tiny_free_to_tls(ptr, class_idx);
 								        g_tls_class_cache.hit_count++;
 								        return;
 								    }
 								    // 2. Miss - full lookup (expensive)
 								    SuperSlab* ss = hak_super_lookup(ptr);  // 50-100 cycles
 								    if (ss) {
 								        // Update cache
 								        g_tls_class_cache.last_base = (void*)ss;
 								        g_tls_class_cache.last_class = ss->size_class;
 								        hak_tiny_free_superslab(ptr, ss);
 								    }
 								}
 								```
 								### Performance Analysis
 								- **Hit case:** 2-3 cycles (excellent)
 								- **Miss case:** 100+ cycles (terrible)
 								- **Hit rate:** 40-80% (workload dependent)
 								- **Effective average:** 20-60 cycles
 								- **Memory overhead:** 16 bytes per thread
 								### Pros
 								- ✅ **Zero per-allocation overhead**
 								- ✅ **Simple implementation** (~100 LOC)
 								- ✅ **Works with existing allocations**
 								### Cons
 								- ❌ **Unpredictable performance** (hit rate varies)
 								- ❌ **Poor for mixed-size workloads**
 								- ❌ **No correctness guarantee** (must validate)
 								- ❌ **Thread-local state pollution**
 								---
 								## Recommended Design: Hybrid Option 1B - Smart Header
 								### Architecture
 								The key insight: **Reuse existing wasted space for headers with zero memory cost**.
 								```
 								SuperSlab Layout (2MB):
 								[SuperSlab Header: 1088 bytes]
 								[WASTED PADDING: 960 bytes] ← Repurpose for headers!
 								[Slab 0 Data: 63488 bytes]
 								[Slab 1: 65536 bytes]
 								...
 								[Slab 31: 65536 bytes]
 								```
 								### Implementation Strategy
 . **Phase 1: Header in Padding (Slab 0 only)**
 								   - Use the 960 bytes of padding for class headers
 								   - Supports 960 allocations with zero overhead
 								   - Perfect for hot allocations
 . **Phase 2: Inline Headers (All slabs)**
 								   - Add 1-byte header for slabs 1-31
 								   - Minimal overhead (1.5% average)
 . **Phase 3: Adaptive Mode**
 								   - Hot classes use headers
 								   - Cold classes use fallback
 								   - Best of both worlds
 								### Code Design
 								```c
 								// Configuration flag
 								#define HAKMEM_FAST_FREE_HEADERS 1
 								// Allocation with header
 								void* tiny_alloc_with_header(int class_idx) {
 								    void* ptr = tiny_alloc_raw(class_idx);
 								    if (ptr) {
 								        // Store class just before the block
 								        *((uint8_t*)ptr - 1) = class_idx;
 								    }
 								    return ptr;
 								}
 								// Ultra-fast free path (4-5 instructions total)
 								void hak_free_fast(void* ptr) {
 								    // 1. Check header mode (compile-time eliminated)
 								    if (HAKMEM_FAST_FREE_HEADERS) {
 								        // 2. Read class (1 instruction)
 								        uint8_t class_idx = *((uint8_t*)ptr - 1);
 								        // 3. Validate (debug only)
 								        if (class_idx < TINY_NUM_CLASSES) {
 								            // 4. Push to TLS (3 instructions)
 								            void** head = &g_tls_sll_head[class_idx];
 								            *(void**)ptr = *head;
 								            *head = ptr;
 								            return;
 								        }
 								    }
 								    // 5. Fallback to slow path
 								    hak_tiny_free_slow(ptr);
 								}
 								```
 								### Memory Calculation
 								For 1M allocations across all classes:
 								```
 								Class 0 (8B):   125K blocks × 1B = 125KB overhead (12.5%)
 								Class 1 (16B):  125K blocks × 1B = 125KB overhead (6.25%)
 								Class 2 (32B):  125K blocks × 1B = 125KB overhead (3.13%)
 								Class 3 (64B):  125K blocks × 1B = 125KB overhead (1.56%)
 								Class 4 (128B): 125K blocks × 1B = 125KB overhead (0.78%)
 								Class 5 (256B): 125K blocks × 1B = 125KB overhead (0.39%)
 								Class 6 (512B): 125K blocks × 1B = 125KB overhead (0.20%)
 								Class 7 (1KB):  125K blocks × 1B = 125KB overhead (0.10%)
 								Average overhead: ~1.5% (acceptable)
 								```
 								---
 								## Implementation Plan
 								### Phase 1: Proof of Concept (1-2 days)
 . **Add header field** to allocation path
 . **Implement fast free** with header lookup
 . **Benchmark** against current implementation
 . **Files to modify:**
 								   - `core/tiny_alloc_fast.inc.h` - Add header write
 								   - `core/tiny_free_fast.inc.h` - Add header read
 								   - `core/hakmem_tiny_superslab.h` - Adjust offsets
 								### Phase 2: Production Integration (2-3 days)
 . **Add feature flag** `HAKMEM_REGION_ID_MODE`
 . **Implement fallback** for non-header allocations
 . **Add debug validation** (magic bytes, bounds checks)
 . **Files to create:**
 								   - `core/tiny_region_id.h` - Region ID API
 								   - `core/tiny_region_id.c` - Implementation
 								### Phase 3: Testing & Optimization (1-2 days)
 . **Unit tests** for correctness
 . **Stress tests** for thread safety
 . **Performance tuning** (alignment, prefetch)
 . **Benchmarks:**
 								   - `larson_hakmem` - Multi-threaded
 								   - `bench_random_mixed` - Mixed sizes
 								   - `bench_freelist_lifo` - Pure free benchmark
 								---
 								## Performance Projection
 								### Current State (Baseline)
 								- **Free throughput:** 1.2M ops/s
 								- **CPU time:** 52.63% in free path
 								- **Bottleneck:** SuperSlab lookup (100+ cycles)
 								### With Region-ID Headers
 								- **Free throughput:** 40-60M ops/s (33-50x improvement)
 								- **CPU time:** <2% in free path
 								- **Fast path:** 3-5 cycles
 								### Comparison
 								| Allocator | Free ops/s | Relative |
 								|-----------|------------|----------|
 								| System malloc | 56M | 1.00x |
 								| **HAKMEM+Headers** | **40-60M** | **0.7-1.1x** ⭐ |
 								| mimalloc | 45M | 0.80x |
 								| HAKMEM current | 1.2M | 0.02x |
 								---
 								## Risk Analysis
 								### Risks
 . **Memory overhead** for small allocations (12.5% for 8-byte blocks)
 								   - **Mitigation:** Use only for classes 2+ (32+ bytes)
 . **Backward compatibility** with existing allocations
 								   - **Mitigation:** Feature flag + gradual migration
 . **Corruption** if header is overwritten
 								   - **Mitigation:** Magic byte validation in debug mode
 . **Alignment issues** on some architectures
 								   - **Mitigation:** Ensure headers are properly aligned
 								### Rollback Plan
 								- Feature flag `HAKMEM_REGION_ID_MODE=0` disables completely
 								- Existing slow path remains as fallback
 								- No changes to allocation unless flag is set
 								---
 								## Conclusion
 								**Recommendation: Implement Option 1B (Smart Headers)**
 								This hybrid approach provides:
 								- **Near-optimal performance** (3-5 cycles)
 								- **Acceptable memory overhead** (~1.5% average)
 								- **Perfect correctness** (no races, no misses)
 								- **Simple implementation** (200-300 LOC)
 								- **Full compatibility** via feature flags
 								The dramatic speedup (30-50x) will bring HAKMEM's free performance in line with System malloc while maintaining all existing safety features. The implementation is straightforward and can be completed in 4-6 days with full testing.
 								### Next Steps
 . Review this design with the team
 . Implement Phase 1 proof-of-concept
 . Measure actual performance improvement
 . Decide on production rollout strategy
 								---
 								**End of Design Document**