Files
hakmem/REGION_ID_DESIGN.md
Moe Charm (CI) 6b1382959c Phase 7-1 PoC: Region-ID Direct Lookup (+39%~+436% improvement!)
Implemented ultra-fast header-based free path that eliminates SuperSlab
lookup bottleneck (100+ cycles → 5-10 cycles).

## Key Changes

1. **Smart Headers** (core/tiny_region_id.h):
   - 1-byte header before each allocation stores class_idx
   - Memory layout: [Header: 1B] [User data: N-1B]
   - Overhead: <2% average (0% for Slab[0] using wasted padding)

2. **Ultra-Fast Allocation** (core/tiny_alloc_fast.inc.h):
   - Write header at base: *base = class_idx
   - Return user pointer: base + 1

3. **Ultra-Fast Free** (core/tiny_free_fast_v2.inc.h):
   - Read class_idx from header (ptr-1): 2-3 cycles
   - Push base (ptr-1) to TLS freelist: 3-5 cycles
   - Total: 5-10 cycles (vs 500+ cycles current!)

4. **Free Path Integration** (core/box/hak_free_api.inc.h):
   - Removed SuperSlab lookup from fast path
   - Direct header validation (no lookup needed!)

5. **Size Class Adjustment** (core/hakmem_tiny.h):
   - Max tiny size: 1023B (was 1024B)
   - 1024B requests → Mid allocator fallback

## Performance Results

| Size | Baseline | Phase 7 | Improvement |
|------|----------|---------|-------------|
| 128B | 1.22M | 6.54M | **+436%** 🚀 |
| 512B | 1.22M | 1.70M | **+39%** |
| 1023B | 1.22M | 1.92M | **+57%** |

## Build & Test

Enable Phase 7:
  make HEADER_CLASSIDX=1 bench_random_mixed_hakmem

Run benchmark:
  HAKMEM_TINY_USE_SUPERSLAB=1 ./bench_random_mixed_hakmem 10000 128 1234567

## Known Issues

- 1024B requests fallback to Mid allocator (by design)
- Target 40-60M ops/s not yet reached (current: 1.7-6.5M)
- Further optimization needed (TLS capacity tuning, refill optimization)

## Credits

Design: ChatGPT Pro Ultrathink, Claude Code
Implementation: Claude Code with Task Agent Ultrathink support

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-08 03:18:17 +09:00

12 KiB
Raw Blame History

Region-ID Direct Lookup Design for Ultra-Fast Free Path

Date: 2025-11-08 Author: Claude (Ultrathink Analysis) Goal: Eliminate SuperSlab lookup bottleneck (52.63% CPU) to achieve 40-80M ops/s free throughput


Executive Summary

The HAKMEM free() path is currently 47x slower than System malloc (1.2M vs 56M ops/s) due to expensive SuperSlab registry lookups that consume over 50% of CPU time. The root cause is the need to determine class_idx from a pointer to know which TLS freelist to use.

Recommendation: Implement Option 1B: Inline Header with Class Index - a hybrid approach that embeds a 1-byte class index in a header while maintaining backward compatibility. This approach offers:

  • 3-5 instruction free path (vs current 330+ lines)
  • Expected 30-50x speedup (1.2M → 40-60M ops/s)
  • Minimal memory overhead (1 byte per allocation)
  • Simple implementation (200-300 LOC changes)
  • Full compatibility with existing Box Theory design

The key insight: We already have 2048 bytes of header space in SuperSlab's slab[0] that's currently wasted as padding. We can repurpose this for inline headers with zero additional memory cost for the first slab.


Detailed Comparison Table

Criteria Option 1: Header Embedding Option 2: Address Range Option 3: TLS Cache Hybrid 1B
Latency (cycles) 2-3 (best) 5-10 (good) 1-2 hit / 100+ miss 2-3
Memory Overhead 1-4 bytes/block 0 bytes 0 bytes 1 byte/block
Implementation Complexity 3/10 (simple) 7/10 (complex) 4/10 (moderate) 4/10
Correctness Perfect (embedded) Good (math-based) Probabilistic Perfect
Cache Friendliness Excellent (inline) Good Variable Excellent
Thread Safety Perfect Perfect Good Perfect
UAF Detection Yes (can add magic) No No Yes
Debug Support Excellent Moderate Poor Excellent
Backward Compat Needs flag Complex Easy Easy
Score 9/10 6/10 5/10 9.5/10

Option 1: Header Embedding

Concept

Store class_idx directly in a small header (1-4 bytes) before each allocation.

Implementation Design

// Header structure (1 byte minimal, 4 bytes with safety)
typedef struct {
    uint8_t class_idx;  // 0-7 for tiny classes
#ifdef HAKMEM_DEBUG
    uint8_t magic;      // 0xAB for validation
    uint16_t guard;     // Canary for overflow detection
#endif
} TinyHeader;

// Ultra-fast free (3-5 instructions)
void hak_tiny_free_fast(void* ptr) {
    // 1. Get class from header (1 instruction)
    uint8_t class_idx = *((uint8_t*)ptr - 1);

    // 2. Validate (debug only, compiled out in release)
#ifdef HAKMEM_DEBUG
    if (class_idx >= TINY_NUM_CLASSES) {
        hak_tiny_free_slow(ptr);  // Fallback
        return;
    }
#endif

    // 3. Push to TLS freelist (2-3 instructions)
    void** head = &g_tls_sll_head[class_idx];
    *(void**)ptr = *head;  // ptr->next = head
    *head = ptr;           // head = ptr
    g_tls_sll_count[class_idx]++;
}

Memory Layout

[Header|Block] [Header|Block] [Header|Block] ...
   1B    8B      1B    16B     1B    32B

Performance Analysis

  • Best case: 2 cycles (L1 hit, no validation)
  • Average: 3 cycles (with increment)
  • Worst case: 5 cycles (with debug checks)
  • Memory overhead: 1 byte × 1M blocks = 1MB (for 1M allocations)
  • Cache impact: Excellent (header is inline with data)

Pros

  • Fastest possible lookup (single byte read)
  • Perfect correctness (no race conditions)
  • UAF detection capability (can check magic on free)
  • Simple implementation (~200 LOC)
  • Debug friendly (can validate everything)

Cons

  • Memory overhead (12.5% for 8-byte blocks, 0.1% for 1KB blocks)
  • Requires allocation path changes
  • Not compatible with existing allocations (needs migration)

Option 2: Address Range Mapping

Concept

Calculate class_idx from the SuperSlab base address and slab index using bit manipulation.

Implementation Design

// Precomputed mapping table (built at SuperSlab creation)
typedef struct {
    uintptr_t base;      // SuperSlab base (2MB aligned)
    uint8_t class_idx;   // Size class for this SuperSlab
    uint8_t slab_map[32]; // Per-slab class (for mixed SuperSlabs)
} SSClassMap;

// Global registry (similar to current, but simpler)
SSClassMap g_ss_class_map[4096];  // Covers 8GB address space

// Address to class lookup (5-10 instructions)
uint8_t ptr_to_class_idx(void* ptr) {
    // 1. Get 2MB-aligned base (1 instruction)
    uintptr_t base = (uintptr_t)ptr & ~(2*1024*1024 - 1);

    // 2. Hash lookup (2-3 instructions)
    uint32_t hash = (base >> 21) & 4095;
    SSClassMap* map = &g_ss_class_map[hash];

    // 3. Validate and return (2-3 instructions)
    if (map->base == base) {
        // Optional: per-slab lookup for mixed classes
        uint32_t slab_idx = ((uintptr_t)ptr - base) / SLAB_SIZE;
        return map->slab_map[slab_idx];
    }

    // 4. Linear probe on miss (expensive fallback)
    return lookup_with_probe(base, ptr);
}

Performance Analysis

  • Best case: 5 cycles (direct hit)
  • Average: 8 cycles (with validation)
  • Worst case: 50+ cycles (linear probing)
  • Memory overhead: 0 (uses existing structures)
  • Cache impact: Good (map is compact)

Pros

  • Zero memory overhead per allocation
  • Works with existing allocations
  • Thread-safe (read-only lookup)

Cons

  • Hash collisions cause slowdown
  • Complex implementation (hash table maintenance)
  • No UAF detection
  • Still requires memory loads (not as fast as inline header)

Option 3: TLS Last-Class Cache

Concept

Cache the last freed class per thread, betting on temporal locality.

Implementation Design

// TLS cache (per-thread)
__thread struct {
    void* last_base;     // Last SuperSlab base
    uint8_t last_class;  // Last class index
    uint32_t hit_count;  // Statistics
} g_tls_class_cache;

// Speculative fast path
void hak_tiny_free_cached(void* ptr) {
    // 1. Speculative check (2-3 instructions)
    uintptr_t base = (uintptr_t)ptr & ~(2*1024*1024 - 1);
    if (base == (uintptr_t)g_tls_class_cache.last_base) {
        // Hit! Use cached class (1-2 instructions)
        uint8_t class_idx = g_tls_class_cache.last_class;
        tiny_free_to_tls(ptr, class_idx);
        g_tls_class_cache.hit_count++;
        return;
    }

    // 2. Miss - full lookup (expensive)
    SuperSlab* ss = hak_super_lookup(ptr);  // 50-100 cycles
    if (ss) {
        // Update cache
        g_tls_class_cache.last_base = (void*)ss;
        g_tls_class_cache.last_class = ss->size_class;
        hak_tiny_free_superslab(ptr, ss);
    }
}

Performance Analysis

  • Hit case: 2-3 cycles (excellent)
  • Miss case: 100+ cycles (terrible)
  • Hit rate: 40-80% (workload dependent)
  • Effective average: 20-60 cycles
  • Memory overhead: 16 bytes per thread

Pros

  • Zero per-allocation overhead
  • Simple implementation (~100 LOC)
  • Works with existing allocations

Cons

  • Unpredictable performance (hit rate varies)
  • Poor for mixed-size workloads
  • No correctness guarantee (must validate)
  • Thread-local state pollution

Architecture

The key insight: Reuse existing wasted space for headers with zero memory cost.

SuperSlab Layout (2MB):
[SuperSlab Header: 1088 bytes]
[WASTED PADDING: 960 bytes] ← Repurpose for headers!
[Slab 0 Data: 63488 bytes]
[Slab 1: 65536 bytes]
...
[Slab 31: 65536 bytes]

Implementation Strategy

  1. Phase 1: Header in Padding (Slab 0 only)

    • Use the 960 bytes of padding for class headers
    • Supports 960 allocations with zero overhead
    • Perfect for hot allocations
  2. Phase 2: Inline Headers (All slabs)

    • Add 1-byte header for slabs 1-31
    • Minimal overhead (1.5% average)
  3. Phase 3: Adaptive Mode

    • Hot classes use headers
    • Cold classes use fallback
    • Best of both worlds

Code Design

// Configuration flag
#define HAKMEM_FAST_FREE_HEADERS 1

// Allocation with header
void* tiny_alloc_with_header(int class_idx) {
    void* ptr = tiny_alloc_raw(class_idx);
    if (ptr) {
        // Store class just before the block
        *((uint8_t*)ptr - 1) = class_idx;
    }
    return ptr;
}

// Ultra-fast free path (4-5 instructions total)
void hak_free_fast(void* ptr) {
    // 1. Check header mode (compile-time eliminated)
    if (HAKMEM_FAST_FREE_HEADERS) {
        // 2. Read class (1 instruction)
        uint8_t class_idx = *((uint8_t*)ptr - 1);

        // 3. Validate (debug only)
        if (class_idx < TINY_NUM_CLASSES) {
            // 4. Push to TLS (3 instructions)
            void** head = &g_tls_sll_head[class_idx];
            *(void**)ptr = *head;
            *head = ptr;
            return;
        }
    }

    // 5. Fallback to slow path
    hak_tiny_free_slow(ptr);
}

Memory Calculation

For 1M allocations across all classes:

Class 0 (8B):   125K blocks × 1B = 125KB overhead (12.5%)
Class 1 (16B):  125K blocks × 1B = 125KB overhead (6.25%)
Class 2 (32B):  125K blocks × 1B = 125KB overhead (3.13%)
Class 3 (64B):  125K blocks × 1B = 125KB overhead (1.56%)
Class 4 (128B): 125K blocks × 1B = 125KB overhead (0.78%)
Class 5 (256B): 125K blocks × 1B = 125KB overhead (0.39%)
Class 6 (512B): 125K blocks × 1B = 125KB overhead (0.20%)
Class 7 (1KB):  125K blocks × 1B = 125KB overhead (0.10%)

Average overhead: ~1.5% (acceptable)

Implementation Plan

Phase 1: Proof of Concept (1-2 days)

  1. Add header field to allocation path
  2. Implement fast free with header lookup
  3. Benchmark against current implementation
  4. Files to modify:
    • core/tiny_alloc_fast.inc.h - Add header write
    • core/tiny_free_fast.inc.h - Add header read
    • core/hakmem_tiny_superslab.h - Adjust offsets

Phase 2: Production Integration (2-3 days)

  1. Add feature flag HAKMEM_REGION_ID_MODE
  2. Implement fallback for non-header allocations
  3. Add debug validation (magic bytes, bounds checks)
  4. Files to create:
    • core/tiny_region_id.h - Region ID API
    • core/tiny_region_id.c - Implementation

Phase 3: Testing & Optimization (1-2 days)

  1. Unit tests for correctness
  2. Stress tests for thread safety
  3. Performance tuning (alignment, prefetch)
  4. Benchmarks:
    • larson_hakmem - Multi-threaded
    • bench_random_mixed - Mixed sizes
    • bench_freelist_lifo - Pure free benchmark

Performance Projection

Current State (Baseline)

  • Free throughput: 1.2M ops/s
  • CPU time: 52.63% in free path
  • Bottleneck: SuperSlab lookup (100+ cycles)

With Region-ID Headers

  • Free throughput: 40-60M ops/s (33-50x improvement)
  • CPU time: <2% in free path
  • Fast path: 3-5 cycles

Comparison

Allocator Free ops/s Relative
System malloc 56M 1.00x
HAKMEM+Headers 40-60M 0.7-1.1x
mimalloc 45M 0.80x
HAKMEM current 1.2M 0.02x

Risk Analysis

Risks

  1. Memory overhead for small allocations (12.5% for 8-byte blocks)

    • Mitigation: Use only for classes 2+ (32+ bytes)
  2. Backward compatibility with existing allocations

    • Mitigation: Feature flag + gradual migration
  3. Corruption if header is overwritten

    • Mitigation: Magic byte validation in debug mode
  4. Alignment issues on some architectures

    • Mitigation: Ensure headers are properly aligned

Rollback Plan

  • Feature flag HAKMEM_REGION_ID_MODE=0 disables completely
  • Existing slow path remains as fallback
  • No changes to allocation unless flag is set

Conclusion

Recommendation: Implement Option 1B (Smart Headers)

This hybrid approach provides:

  • Near-optimal performance (3-5 cycles)
  • Acceptable memory overhead (~1.5% average)
  • Perfect correctness (no races, no misses)
  • Simple implementation (200-300 LOC)
  • Full compatibility via feature flags

The dramatic speedup (30-50x) will bring HAKMEM's free performance in line with System malloc while maintaining all existing safety features. The implementation is straightforward and can be completed in 4-6 days with full testing.

Next Steps

  1. Review this design with the team
  2. Implement Phase 1 proof-of-concept
  3. Measure actual performance improvement
  4. Decide on production rollout strategy

End of Design Document